SSK-DDoS: distributed stream processing framework based classification system for DDoS attacks

Patil, Nilesh Vishwasrao; Krishna, C. Rama; Kumar, Krishan

doi:10.1007/s10586-022-03538-x

SSK-DDoS: distributed stream processing framework based classification system for DDoS attacks

Published: 17 January 2022

Volume 25, pages 1355–1372, (2022)
Cite this article

Download PDF

Cluster Computing Aims and scope Submit manuscript

SSK-DDoS: distributed stream processing framework based classification system for DDoS attacks

Download PDF

Nilesh Vishwasrao Patil ORCID: orcid.org/0000-0002-1983-668X¹,
C. Rama Krishna¹ &
Krishan Kumar²

2375 Accesses
14 Citations
Explore all metrics

Abstract

Distributed denial of service (DDoS) is an immense threat for Internet based-applications and their resources. It immediately floods the victim system by transmitting a large number of network packets, and due to this, the victim system resources become unavailable for legitimate users. Therefore, this attack is claimed to be a dangerous attack for Internet-based applications and their resources. Several security approaches have been proposed in the literature to protect Internet-based applications from this type of threat. However, the frequency and strength of DDoS attacks are increasing day-by-day. Further, most of the traditional and distributed processing frameworks-based DDoS attack detection systems analyzed network flows in offline batch processing. Hence, they failed to classify network flows in real-time. This paper proposes a novel Spark Streaming and Kafka-based distributed classification system, named by SSK-DDoS, for classifying different types of DDoS attacks and legitimate network flows. This classification approach is implemented using a distributed Spark MLlib machine learning algorithms on a Hadoop cluster and deployed on the Spark streaming platform to classify streams in real-time. The incoming streams consume by Kafka’s topic to perform preprocessing tasks such as extracting and formulating features for classifying them into seven groups: Benign, DDoS-DNS, DDoS-LDAP, DDoS-MSSQL, DDoS-NetBIOS, DDoS-UDP, and DDoS-SYN. Further, the SSK-DDoS classification system stores formulated features with their predicted class into the HDFS that will help to retrain the distributed classification approach using a new set of samples. The proposed SSK-DDoS classification system has been validated using the recent CICDDoS2019 dataset. The results show that the proposed SSK-DDoS efficiently classified network flows into seven classes and stored formulated features with the predicted value of each incoming network flow into HDFS.

A systematic literature review for network intrusion detection system (IDS)

Article 27 March 2023

Deep learning method for efficient cloud IDS utilizing combined behavior and flow-based features

Article 24 May 2024

Survey of intrusion detection systems: techniques, datasets and challenges

Article Open access 17 July 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Over the decade, companies have been running their services online for growing revenue and are open to users from anywhere-anytime. Further, in recent times, there is huge growth in Internet subscribers and connecting devices. However, this significant growth has come up with unsafe network routes with non-secure connecting devices. Therefore, attackers use this chance to compromise numerous nodes to form a botnet for performing DDoS attacks on the victim system.

1.1 DDoS attacks

A DDoS attack is the biggest threat to Internet-based applications and their resources [1, 2]. The motive of this attack is to overwhelm Internet-based services by transmitting a large amount of attack traffic [3, 4]. A typical example to perform the DDoS attack on the victim system is presented in Fig. 1. In this, a master took control of various slaves with the help of handler programs. The handler is the inter-mediator program between master and slave nodes that will help to perform a large-scale DDoS attack on victim-applications.

1.2 Summary of DDoS attack events

Each country has been struggling with the COVID-19 situation since Jan 2020. In this pandemic, peoples are working, shopping, enjoying, etc. in online mode. Therefore, attackers use this chance to compromise numerous nodes to form a botnet. The Q4-2020 DDoS attacks statistical report [5] is summarized as follows:

1.
Most numbers of attacks experienced by countries: China (44%+), USA (23%+), and Hong Kong (7%+).
2.
The highest number of attacks reported on Dec 31, 2020, i.e., 1349 incidents.
3.
After exception in the last few quarters, once again Linux-based botnets used to launch every DDoS attack.
4.
The majority of C&C servers located in the USA (36%+), Netherlands (19%+), and Germany (8%+).
5.
Once again, most number of incidents observed on Thursday, and this trend dropped on Sunday.

The country-wise distribution of DDoS attacks incidents for Q3-2020 and Q4-2020 are given in Fig. 2. From this, we can conclude that both frequencies and the strength of attacks are increasing year-after-year. Further, the attack strength pattern shifted from “Gbps to Tbps”. Therefore, one more challenge in front of researchers to systematically analyze such a large volume of traffic.

1.3 Challenges

In this big data world, the traditional framework-based DDoS attack detection approaches themselves become the victim while examining a massive number of packets. Therefore, there is a need to deploy the proposed approach on distributed stream processing framework (DSPF). The DSPF has the capability to handle (store and analyze) a large volume of data in real-time by employing multiple nodes. Further, data transfer between nodes, secure communication protocol, and metadata information is systematically managed by DSPF. The traditional and distributed processing frameworks (DPF) based DDoS attack detection systems are specially designed to examine flows in an offline mode. Therefore, this type of approach fails to analyze incoming streams in real-time. Additionally, most of the approaches have been tested on outdated datasets. Therefore, there is a need to design a distributed classification model using a recent dataset and deploy it on DSPF (such as the Spark Streaming platform).

1.4 Open-source technologies

In this section, we are going to summarize the open-source technologies that are required to design the proposed SSK-DDoS classification systems for DDoS attacks. We split-up this section into four sub-sections: Apache Hadoop, Spark Streaming, Apache Kafka, and CICFlowMeter.

A good DSPF must have the following features:

1.
To analyze the streaming data such as network traffic flows as it receives and takes immediate action based on prediction.
2.
To design real-time applications which have a loosely-coupled architecture. Therefore, multiple publishers and consumers can independently access the application without delay.
3.
To have features like analyze data in a distributed manner, extremely low latency, reliability, scalable, fault-tolerant, etc.

1.4.1 Apache Hadoop

Apache Hadoop [7, 8] is one of the powerful DPF for storing and analyzing a large amount of data. It is specially designed to analyze a large amount of data using batch processing on a cluster of nodes. It consists of three major modules:

1.
Hadoop Distributed File System (HDFS): It allows for storing a large amount of data on clusters of nodes called datanodes. The data is divided into multiple blocks and systematically stored on datanodes. Further, metadata information about each block is stored in namenode.
2.
Yet Another Resource Navigator (YARN): This module is used to allocate resources for analyzing a large amount of data.
3.
MapReduce: It is a programming model for analyzing a large amount of data in a distributed manner.

1.4.2 Apache Spark streaming

Apache Spark [9] is a large-scale data analytics engine. It provides a large data processing API. Spark Streaming is an extension of the core Spark API for developing real-time applications. The Apache Spark streaming platform is commonly used:

1.
To design real-time applications for analyzing a large amount of data in real-time.
2.
To immediately respond to the streaming data to take quick action without a delay.

Apache Spark consists of four essential components: Spark SQL, MLlib, GraphX, and Spark Streaming. It is possible to combine these four components to design a machine learning-based real-time application. Spark Machine Learning Library (MLlib) is a distributed in-memory machine learning library. It provides:

1.
A way to design a model in a distributed manner.
2.
Robust APIs.
3.
High-scalability feature for the machine learning model when deployed on DPF/DSPF.
4.
Support various programming languages: Python, Java, Scala, etc.

Several tools/techniques are available to design traditional and non-traditional machine learning models such as Python, Java, R, WEKA, etc. Further, few authors [10,11,12,13] have systematically discussed machine/deep learning methods and features selection. However, when we design a model using these techniques that will face the scalability issue when deployed on DPF/DSPF. The Spark MLlib machine learning library provides a way to design a distributed and in-memory machine learning model. This type of model is specially designed to deploy on DPF/DSPF (Hadoop, Kafka, Spark, etc.). Therefore, it is exciting to implement a distributed classification approach for DDoS attacks using the MLlib and deploy it on the Spark streaming platform.

1.4.3 Apache Kafka

Apache Kafka [14] is an open-source distributed and high-throughput publish-subscribe messaging system. It consists of six essential components: Brokers, Zookeeper, Topics, Partitions, Publishers, and Subscribers. The publishing/consuming feature of Kafka helps to provide a loosely-coupled architecture to real-time applications.

1.4.4 CICFlowMeter

CICFlowMeter [15] is an open-source network flow generator tool. It creates network flows in offline (from PCAP) and online (from network interfaces) mode. It creates 83 attributes and stores them in a CSV file from network traffic. An example of CICFlowMeter for collecting network packets using the network interface card and generating network flows from network packets is presented in Fig. 3.

1.5 Contributions

The significant contributions of this paper are listed in the following:

Proposed a novel Spark Streaming and Kafka based classification system for DDoS attacks called SSK-DDoS.
The SSK-DDoS is distributed and real-time classification approach built using distributed Spark MLlib machine learning algorithms on the Hadoop cluster and deployed on the Spark Streaming clusters to classify network flows in real-time.
It stores formulated features of each network flow with predicted class in the HDFS to retrain the model using a new set of samples.
Proposed SSK-DDoS classification system distributes the computational overhead i.e. preprocessing and classification tasks on network traffic between multiple nodes of Spark clusters.
Proposed distributed SSK-DDoS runs in an automated style as incoming network flows published on Kafka topics, select essential variables, formulate features based on selected variables, perform classification job, and finally publish predictions on the Kafka topic to take action in real-time.
Proposed SSK-DDoS classification approach is designed and validated using the recent CICDDoS2019 dataset.
Proposed SSK-DDoS is a highly-scalable approach and provides loosely-coupled architecture.

Rest of the paper is organized as follows. A summary of related works presented in Sect. 2. Section 3 presents a novel distributed SSK-DDoS classification system for DDoS attacks. Section 4 provides testbed information of the classification approach. Results and analysis is presented in Sect. 5. Finally, Sect. 6 conclude the paper.

2 Related work

Numerous security approaches are available in the literature to protect the victim systems from different DDoS attacks. Patil et al. [16] have systematically classified DDoS attack detection approaches into two broad classes based on their deployment frameworks: traditional and DPF based detection approaches. In the literature [17,18,19,20,21,22,23,24,25,26,27,28,29,30], several authors systematically summarized traditional framework based approaches and few of the recent existing systems are [31,32,33]. However, few authors [16] specifically addressed DPF based approaches. The DPF (batch processing) and DSPF (real-time) themselves have distributed designs to store and analyze a massive volume of data on a cluster of nodes. In the literature, some authors [34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54], proposed DPF and DSPF based approaches. However, most of them are deployed on the DPF. Therefore, this type of detection approach efficiently analyzes a large number of packets and classifies them in a short time. However, they are not capable to classify network flows in real-time. This type of approach is useful for historical data analysis and retrain the distributed model. Therefore, if use-case demands to classify network flows in real-time then one need to deploy the proposed approach on DSPF (such as Spark Streaming platform).

We have drawn some inferences from the existing works related to DPF/DSPF. They are listed as follows:

Most of the systems are designed and tested in an offline mode. Therefore, there is a need to deploy a classification model for DDoS attacks on DSPF such as Apache Spark Streaming that analyzes network traffic in real-time.
Few researchers designed their classification model using shallow and deep learning algorithms. These models performed exceptionally well when we deployed on traditional frameworks. However, models will undergo the scalability issue when deployed on DPF/DSPF. Therefore, there is a need to implement a distributed model using distributed machine learning library that will provide a high scalability feature even models deployed on DPF/DSPF.
Most of the DPF/DSPF based DDoS approaches efficiently analyzed a huge amount of network flows on a group of nodes by distributing the analysis task on multiple systems.
Most of the existing DPF/DSPF based DDoS mechanisms employed a counter-based detection methodology for identifying the high-volume of attacks. Therefore, this type of system fails to recognize a low-volume of DDoS attacks.
Most of the DPF/DSPF and traditional framework-based DDoS mechanisms are validated using outdated datasets. Few authors [55] designed there system using recent dataset. Therefore, there is a need for a new classification approach that can be validated using recent datasets, such as CICDDoS2019.

3 SSK-DDoS: Spark Streaming and Kafka based classification system for DDoS attacks

This section presents the functioning of the proposed SSK-DDoS classification system for DDoS attacks. The logical architecture of SSK-DDoS is given in Fig. 4.

The distributed SSK-DDoS classification system of DDoS attacks is consists of three Spark Streaming clusters: ‘SC-1’, ‘SC-2’, and ‘SC-3’. Two Spark clusters ‘SC-1’ and ‘SC-2’ are deployed in the intermediate network i.e., at ISP-1 and ISP-2 respectively. The primary job of ‘SC-1’ and ‘SC-2’ clusters is to preprocess the incoming network traffic and pass it on to ‘SC-3’. While the ‘SC-3’ cluster is deployed in the victim network and the job of this cluster is to classify flows into seven classes. The first step is producer agents (from ISP-1 and ISP-2) continuously publishing network flows generated by CICFlowMeter onto the “ssk_ddos_flow” topic. Both ‘SC-1’ and ‘SC-2’ clusters immediately consume flows from “ssk_ddos_flow” topic. The second step is to extract essential variables from flows, formulate features using extracted variables, and publish them on “sss_ddos_features” topic. Then ‘SC-3’ cluster immediately consumes formulated features of each flow from “sss_ddos_features”, classify them into seven classes, and publish predicted class on the “sss-ddos_prediction” topic to take action. Further, this system stores formulated features of each flow with predicted class into the HDFS that will help to retrain the distributed classification model of DDoS attacks using a new set of samples. Highlights of the proposed distributed SSK-DDoS classification system of DDoS attacks are as follows:

Loosely-coupled architecture as it uses distributed publish-subscribe messaging system for communication
Analyze network traffic flows in real-time using Spark Streaming API
Distributed computational overhead between three clusters
Stores formulated features of each flow with their predicted class into HDFS for retraining the existing classification model using a new set of samples

The detection approach of the proposed SSK-DDoS classification system splits into two parts: preprocessing and classification task.

3.1 Preprocessing task

The role of ‘SC-1’ and ‘SC-2’ clusters is to consume network traffic, generate network flows using CICFlowMeter, select significant variables, scale selected variable, formulate features using scaled variables, and finally publish it on the “ssk_ddos_features”. Both ‘SC-1’ and ‘SC-2’ have a separate Kafka topic with the same name “ssk_ddos_features”. We split this section into three sub-sections: create network flows, scaling variables, and formulating features.

3.1.1 Create network flows using CICFlowMeter

The CICFlowMeter generates network flows with 83 attributes from incoming traffic and puts flows in a CSV file. We employ producer agents to immediately pick up each entry from CSV and publish flows on the “ssk_ddos_flow” topic. The next task perform by ‘SC-1’ and ‘SC-2’ clusters is to select 23 significant variables from each flow. In [56], 24 significant variables are used to classify flows into different classes. However, in these 24 variables, two variables such as Fwd_Header_Length and Fwd_Header_Length.1 look like duplicate columns. Further, after generating network flows using the current version of CICFlowMeter, the Fwd_Header_Length.1 variable is removed from generated network flows. Therefore, we have selected 23 variables from the variable list of each network flows.

3.1.2 Scaling data values

The next job performed by both clusters is to scaling data values of twenty-three variables on the same scale. The scaling of data points can be adjusted with the help of the “MinMax” technique provided by the “sklearn.preprocessing”. Therefore, after the scaling process, data point values lie between 0 and 1. The mathematical formula for the scaling is:

$$\begin{aligned} \textit{Norm}\_\textit{Data}_i = \frac{\textit{DataVal}_i - \textit{min} (\textit{DataVal})}{\textit{max}(\textit{DataVal}) - \textit{min}(\textit{DataVal})} \end{aligned}$$

(1)

3.1.3 Features formulation

Both ‘SC-1’ and ‘SC-2’ formulate ten features from 23 selected variables. It helps to enhance the accuracy and speed up the design process of the classification model. A summary of each feature is given in Table 1. After formulating features by ‘SC-1’ and ‘SC-2’ has been replicated to ‘SC-3’.

Table 1 Description of formulated features

Full size table

3.2 Classification task

In this section, we present a distributed classification approach of the proposed SSK-DDoS for identifying various types of attacks: DDoS-DNS, DDoS-LDAP, DDoS-MSSQL, DDoS-NetBIOS, DDoS-UDP, and DDoS-SYN. The distributed classification approach is designed using the CICDDoS2019 dataset based on four distributed machine learning algorithms from Spark MLlib library: DecisionTreeClassifier (DTC), Naive Bayes (NB), Multinomial Logistic Regression (MLR), and Random Forest (RF). The Spark MLlib library provides an RF classifier algorithm for both binary and multiclass classification. It allows distributed designing of the model with millions or even billions of samples. The RF is an ensemble classifier that consists of multiple trees (classifiers), and each tree process is based different set of features. Gradient-Boosted Trees (GBT) is also an ensemble classifier and helps to improve accuracy. However, the Spark MLlib library provides this algorithm only for binary classification, and for this use-case, our classification approach has seven target classes. Therefore, this algorithm will not work for our use-case. We deployed an RF-based classification approach on the ‘SC-3’ for classifying flows into seven classes: Benign (One), DDoS_DNS (Two), DDoS_LDAP (Three), DDoS-MSSQL (Four), DDoS-NetBIOS (Five), DDoS-UDP (Six), and DDoS-SYN (Seven).

The primary objective of this classification approach is to classify network flows in real-time. We split the proposed classification approach into two parts: (i) Design process of a distributed classification model using distributed Spark MLlib library on the Hadoop cluster and (ii) After deployment of the classification model in ‘SC-3’ Spark Streaming cluster to classify network flows in real-time. The step-by-step workflow of the proposed classification model is presented in Figs. 5 (designing process) and 6 (after deployment process).

We divided this section into three sub-sections: details of the CICDDoS2019 dataset, designing and after deployment process of the classification model.

3.2.1 CICDDoS2019 dataset

The CICDDoS2019 [56] dataset is a collective project of the “Canadian Communications Security Establishment (CSE) and Canadian Institute for Cybersecurity (CIC)”. It includes both benign and various types of DDoS attack scenarios. This dataset is available in both PCAP and CSV files i.e., raw packets and network flow with labeling, respectively. However, CSV files have several issues. Therefore, we generated network flows from PCAP files for various scenarios such as DDoS-UDP, DDoS-LDAP, DDoS-DNS, DDoS-SYN, DDoS-MSSQL, DDoS-NetBIOS, and Benign using the CICFlowMeter flow generator tool. The newly generated network flows contain 83 variables and one label column that we have to update as per the attack-wise schedule of PCAP files given on the dataset portal.

3.2.2 SSK-DDoS: design process

The step-by-step process to implement a distributed classification model for DDoS attacks using MLlib library is shown in Fig. 5. For designing this model, we assembled PCAP files of DDoS-UDP, DDoS-LDAP, DDoS-DNS, DDoS-SYN, DDoS-MSSQL, DDoS-NetBIOS, and Benign. The number of flows in each class is Benign: 56863, DDoS-DNS: 5071011, DDoS-LDAP: 2179930, DDoS-MSSQL: 4522492, DDoS-NetBIOS: 4093279, DDoS-UDP: 3134645, and DDoS-SYN: 1582289.

However, the number of flows in each class is highly-imbalanced which affects the accuracy of the classification model. We up-sampled some classes to 5071011. Therefore, the number of flows in the sample is 35 million+ and are stored in the HDFS. The next step is to implement a distributed classification model of DDoS attacks. We designed this classification model using Spark MLib machine learning-based algorithms: DTC, MLR, NB, and RF. Then deploy this model on the Spark Streaming cluster. The next task is to calculate performance evaluation metrics: precision, recall, and f1-score. The performance evaluation of these algorithms is discussed in Sect. 5. Finally, we save this model in the persistent storage for deploying in the ‘SC-3’ Spark Streaming cluster to analyze flows in real-time.

3.2.3 K-DDoS: classification process in real-time (after deployment)

The second part of the classification approach is to classify incoming network traffic into seven classes. Figure 6 shows step-by-step process of the proposed classification approach after deploying in ‘SC-3’. The CICFlowMeter generates network flows from incoming network traffic. Then, producer agents continuously publish created flows in the “ssk_ddos_flows”. Both ‘SC-1’ and ‘SC-2’ immediately consume published flows and select twenty-three variables from the list of eighty-three variables. The next step is to scaling data values of variables, formulate features using scaled variables, and published them on the “ssk_ddos_features” by ‘SC-1’ and ‘SC2’. The next step, distributed classification model immediately consumed messages from the “ssk_ddos_features”, analyze and classify them into seven classes: DDoS-UDP, DDoS-LDAP, DDoS-DNS, DDoS-SYN, DDoS-MSSQL, DDoS-NetBIOS, and Benign. Finally, the proposed classification approach publishes the predicted class on the “ssk_ddos_prediction” topic to take immediate action on incoming network flows. Further, distributed SSK-DDoS classification system combines formulated features with the predicted result of each network flows and stores them in the HDFS with the help of the “ssk_ddos_retrain_data”.

4 Experimental setup

In this section, we explore the experimental setup of the proposed distributed SSK-DDoS classification system for DDoS attacks. It is shown in Fig. 7. For the design and validation of the proposed SSK-DDoS, we consider two source networks, two ISPs in the intermediate network, and one victim network. Each ISP receives the network traffic from the source network, then generates network flows using CICFlowMeter from incoming traffic, selects essential variables, scales selected variables, formulate features using scaled variables, and replicates features in the ‘SC-3’. The information about networks/clusters/nodes is given in the following:

Two source networks: Legitimate and DDoS attack traffic traced towards victim network via ISPs.
Two ISP networks: In each ISP network, deploy two nodes Spark Streaming cluster (‘SC-1’ and ‘SC-2’) for performing preprocessing task on incoming network traffic.
Hadoop cluster: Deploy two nodes Hadoop cluster for storing formulated features with the predicted class of each network flow and retrain the existing model using a new set of samples.
Spark Streaming cluster (‘SC-3’): Implement two nodes Spark Streaming cluster ‘SC-3’ in the victim network to classify network flows in real-time.

Several Kafka topics have been created for publishing and consuming messages independently based on the distributed publish-subscribe messaging system. In ‘SC-1’ and ‘SC-2’ Spark Streaming clusters, 02 topics are created:

1.
“ssk_ddos_flows”: for publishing network flows created by CICFlowMeter.
2.
“ssk_ddos_features”: for publishing formulated features and replicated them to ‘SC-3’.

Further, in the ‘SC-3’ Spark Streaming cluster, three Kafka topics are created:

1.
“ssk_ddos_features”: classification model immediately consumes features from this topic to classify flows in real-time.
2.
“ssk_ddos_prediction”: for publishing predicted class of the flows to take action.
3.
“ssk_ddos_retrain_data”: for publishing formulated features with predicted class of each flow to store in the HDFS.

5 Results and discussion

In this section, we evaluate the performance of our proposed SSK-DDoS classification system of DDoS attacks. The proposed SSK-DDoS classification system classifies network flows into seven classes.

We considered two cases for performance evaluation of the proposed SSK-DDoS classification system: case (I) While designing the classification model of DDoS attacks and case (II) After deployment of this classification model on DSPF i.e., Spark Streaming. For this, we measure three performance evaluation metrics for multi-class classification. The mathematical definition of these metrics for multi-class (in this use-case, seven target classes) classification: Precision ($P_{m\_class}$), Recall ($R_{m\_class}$), and F1-score ($F1S_{m\_class}$) are given in the following:

1.
$P_{m\_class}=\frac{\sum _{i=1}^{n}\frac{TruePositive_i}{( TruePositive_i+FalsePositive_i)} }{n},$ where n = number of classes (in this use-case, five classes)
2.
$R_{m\_class}=\frac{\sum _{i=1}^{n}\frac{TruePositive_i}{( TruePositive_i+FalseNegative_i)} }{n}$
3.
$F1S_{m\_class}= \frac{2*P_{m\_class}*R_{m\_class}}{(P_{m\_class}+R_{m\_class})}$

We designed and validated the proposed classification model using the CICDDoS2019 dataset. For evaluation of case-I, the description of class-wise network flows is given in Table 2. We designed this model using four Spark MLlib machine learning algorithms: DTC, MLR, NB, and RF. We visualized multiclass confusion matrices in Fig. 8 and evaluation metrics in Table 3. According to the accuracy, RF (89.05%) has given a better accuracy than the other three, i.e., MLR (43.28%) NB (69.39%) and DTC (87.61%). Further, we have tuned the number of trees ($T=10,20,50$) parameter for the RF algorithm. We come across that RF gives better accuracy for $T=50$ (89.05%) than $T=10$ (87.89%) and $T=10$ (87.91%).

Table 2 Details of the CICDDoS2019 dataset for case-I

Full size table

Table 3 Performance of SSK-DDoS for Case-I (while designing a distributed model using MLlib)

Full size table

For evaluation of the case-II, we examined six scenarios with different combinations of the CICDDoS2019 dataset classes. The description of each scenario is presented in Table 4. After designing the classification model using various algorithms, the RF-based classification model ($T=50$) has given better classification accuracy than MLR, NB, RF ($T=10$), RF ($T=20$), and DTC algorithms. Therefore, we deployed the RF-based classification model ($T=50$) on the ‘SC-3’ Spark Streaming cluster in the production environment. The performance evaluation of these six scenarios is given in Table 5 and visualized their multi-class confusion matrices in Fig. 9.

Table 4 CICDDoS2019 dataset network flows details for Case-II (After deployment)

Full size table

Table 5 Performance of SSK-DDoS for Case-II (After deployment)

Full size table

From the performance evaluation of the proposed SSK-DDoS for case-II, the RF-based classification model ($T=50$) provides a better accuracy such as scenario-I: 99.44%, scenario-II: 87.09%, scenario-III: 91.04%, scenario-IV: 99.17%, scenario-V: 92.17%, and scenario-VI: 94.42%. From this, we conclude that the proposed classification model gives 87%+ accuracy even attackers launch different types of attacks concurrently on the victim system.

5.1 Complexity analysis

In the case of the traditional framework-based DDoS attack detection mechanisms, each network flows is analyzed at a single point. Therefore, the time complexity of the system is O(NNF), where NNF is the number of network flows analyzed by the system [63]. However, in the case of DPF/DSPF, the network flows analysis task is distributed between multiple nodes, and hence complexity is also distributed, say n (where n: no. of nodes). To measure the complexity of the proposed system, we assume each node equally examined network flows. Therefore, the complexity of DPF/DSPF is $O(\frac{NNF}{n})$. In this case, we have to measure one more parameter that is intermediate communication cost between nodes. Let us assume intermediate communication cost is O(ICC). Therefore, the combined complexity cost (CCC) of the DPF/DSPF is $CCC = O(\frac{NNF}{n}) + O(ICC)$. However, DPF/DSPF is specially designed to analyze a large amount of data and hence O(ICC) is negligible when we compared O(NNF) with O(ICC). Therefore the CCC of the DPF/DSPF-based DDoS attack detection system is $O(\frac{NNF}{n})$. It shows that the time complexity will go down as increasing nodes in the cluster.

5.2 Comparison with existing systems

In this section, we systematically compared of the proposed SSK-DDoS classification system of DDoS attacks with existing DPF and traditional framework based systems [34, 35, 37,38,39, 41,42,43,44,45, 47, 47,48,49, 57] in Tables 6 and 7.

Table 6 Comparison of SSK-DDoS with existing DPF/DSPF-based approaches

Full size table

Table 7 Comparison of SSK-DDoS with the traditional framework-based approaches

Full size table

Most of the DPF-based classification approaches [34, 35, 37,38,39, 44, 45, 47, 47, 48] of DDoS attacks and legitimate traffic are deployed on the Apache Hadoop framework. This type of approach efficiently handles a large number of flows on a cluster of nodes. However, Apache Hadoop is particularly employed to examine large data in offline mode. Therefore, this type of classification approach is not capable to classify network packets in real-time.

Few [41,42,43, 49, 57] authors have proposed Apache Spark-based classification approaches for DDoS attacks and legitimate traffic. This type of approach examines network flows in near to real-time. Further, these systems didn’t provide an automated way to take action on incoming traffic flows. However, the proposed SSK-DDoS classification approach for DDoS attacks is not only designed on DPF (Using Spark MLlib machine learning library on Hadoop cluster) but also deployed on DSPF (Spark Streaming). Therefore, the proposed system provides a high-scalability feature. Further, we used Kafka’s distributed pub-sub messaging system that will help to provide a loosely-coupled and automated-way to the proposed SSK-DDoS classification system for DDoS attacks.

Sharafaldin et al. [56] have generated a realistic dataset by considering various attack scenarios. Further, they have proposed a detection approach to classify different types of DDoS attacks. According to their performance evaluation, precision values for classifiers ID3, RF, NB, and LR is 0.78, 0.77, 0.41, and 0.25, respectively. While our RF-based classification model has given a better precision value (0.89).

6 Conclusions

A distributed denial of service attack is one of the biggest threats to Internet-based services and their resources. It overwhelms victim resources in a short time by sending a large number of network packets. The traditional framework-based approaches themselves become a victim of attacks while classifying a massive amount of network flows. Further, most of the existing DPF-based classification systems for DDoS attacks were specially designed for offline mode and hence not capable to classify network flows in real-time.

This paper proposed Spark Streaming and Kafka-based distributed classification system for DDoS attacks, named by SSK-DDoS. This classification approach is designed using a distributed Spark MLlib machine learning library on a Hadoop cluster and deployed on the Spark streaming platform to classify the network traffic in real-time into seven classes: Benign, DDoS-DNS, DDoS-LDAP, DDoS-MSSQL, DDoS-NetBIOS, DDoS-UDP, and DDoS-SYN. Further, this system stored formulated features with the predicted class of each flow into the HDFS for retraining the existing distributed classification model using a new set of samples. The proposed SSK-DDoS classification system has been validated using the recent CICDDoS2019 dataset. The results show that the proposed SSK-DDoS detection system efficiently (89.05%) classified network traffic into seven classes.

Data availability

Data available in a public (UNB-Canadian Institute for Cybersecurity, CICDDoS2019) repository that issues datasets with DOIs (https://www.unb.ca/cic/datasets/ddos-2019.html)

References

Arivudainambi, D., Varun Kumar, K.A., Chakkaravarthy, S.S.: Lion IDS: a meta-heuristics approach to detect DDOS attacks against software-defined networks. Neural Comput. Appl. 31(5), 1491–1501 (2019)
Article Google Scholar
Gopi, R., Sathiyamoorthi, V., Selvakumar, S., Manikandan, R., Chatterjee, P., Jhanjhi, N., Luhach, A.K.: Enhanced method of ANN based model for detection of DDoS attacks on multimedia Internet of Things. Multimedia Tools Appl. (2021). https://doi.org/10.1007/s11042-021-10640-6
Article Google Scholar
Behal, S., Kumar, K., Sachdeva, M.: D-FACE: an anomaly based distributed approach for early detection of DDoS attacks and flash events. J. Netw. Comput. Appl. 111, 49–63 (2018)
Article Google Scholar
Bhandari, A., Kumar, K., Sangal, A., Behal, S.: An anomaly based distributed detection system for DDoS attacks in Tier-2 ISP networks. J. Ambient Intell. Human. Comput. (2020). https://doi.org/10.1007/s12652-020-02208-3
Article Google Scholar
Kaspersky: DoS attacks Q4-2020 (2021). https://securelist.com/ddos-attacks-in-q4-2020/100650/. Accessed 2 Mar 2021
Kaspersky: DDoS attacks Q3-2020 (2021). https://securelist.com/ddos-attacks-in-q3-2020/99171/. Accessed 2 Mar 2021
Apache Hadoop: https://hadoop.apache.org/. Accessed 10 Feb 2021
Bhardwaj, A., Singh, V.K., Narayan, Y.: Analyzing BigData with Hadoop cluster in HDInsight azure Cloud. In: Annual IEEE India Conference (INDICON), vol. 2015, pp. 1–5. IEEE (2015)
Apache Spark: https://spark.apache.org/. Accessed 10 Feb 2021
Chen, Y., He, F., Li, H., Zhang, D., Wu, Y.: A full migration BBO algorithm with enhanced population quality bounds for multimodal biomedical image registration. Appl. Soft Comput. 93, 106335 (2020)
Article Google Scholar
Quan, Q., He, F., Li, H.: A multi-phase blending method with incremental intensity for training detection networks. Vis. Comput. 37(2), 245–259 (2021)
Article Google Scholar
Zhang, S., He, F.: DRCDN: learning deep residual convolutional dehazing networks. Vis. Comput. 36(9), 1797–1808 (2020)
Article Google Scholar
Li, H., He, F., Chen, Y., Pan, Y.: MLFS-CCDE: multi-objective large-scale feature selection by cooperative coevolutionary differential evolution. Memetic Comput. 13(1), 1–18 (2021)
Article Google Scholar
Apache Kafka: https://kafka.apache.org/. Accessed 08 Feb 2021
Lashkari, A.H., Draper-Gil, G., Mamun, M.S.I., and Ghorbani, A.A.: Characterization of tor traffic using time based features. In: ICISSp, pp. 253–262 (2017)
Patil, N.V., RamaKrishna, C., Kumar, K.: Distributed frameworks for detecting distributed denial of service attacks: a comprehensive review, challenges and future directions. Concurr. Comput. Pract. Exp. 33(10), e6197 (2021)
Article Google Scholar
Mirkovic, J., Reiher, P.: A taxonomy of DDoS attack and DDoS defense mechanisms. ACM SIGCOMM Comput. Commun. Rev. 34(2), 39–53 (2004)
Article Google Scholar
Zargar, S.T., Joshi, J., Tipper, D.: A survey of defense mechanisms against distributed denial of service (DDoS) flooding attacks. IEEE Commun. Surveys Tutor. 15(4), 2046–2069 (2013)
Article Google Scholar
Manavi, M.T.: Defense mechanisms against distributed denial of service attacks: a survey. Comput. Electr. Eng. 72, 26–38 (2018)
Article Google Scholar
Peng, T., Leckie, C., Ramamohanarao, K.: Survey of network-based defense mechanisms countering the DoS or DDoS problems. ACM Comput. Surv. (CSUR) 39(1), 3 (2007)
Article Google Scholar
Bhuyan, M.H., Bhattacharyya, D.K., Kalita, J.K.: Network anomaly detection: methods, systems and tools. IEEE Commun. Surv. Tutor. 16(1), 303–336 (2014)
Article Google Scholar
Douligeris, C., Mitrokotsa, A.: DDoS attacks and defense mechanisms: classification and state-of-the-art. Comput. Netw. 44(5), 643–666 (2004)
Article Google Scholar
Hoque, N., Bhuyan, M.H., Baishya, R.C., Bhattacharyya, D.K., Kalita, J.K.: Network attacks: taxonomy, tools and systems. J. Netw. Comput. Appl. 40, 307–324 (2014)
Article Google Scholar
Lee, S.: Distributed denial of service: taxonomies of attacks, tools and countermeasures. In: Proceedings of the International Workshop on Security in Parallel and Distributed Systems, pp. 543–550 (2004)
Bhatia, S., Behal, S., Ahmed, I.: Distributed denial of service attacks and defense mechanisms: current landscape and future directions. In: Versatile Cybersecurity, pp. 55–97. Springer, Cham (2018)
Mahjabin, T., Xiao, Y., Sun, G., Jiang, W.: A survey of distributed denial-of-service attack, prevention, and mitigation techniques. Int. J. Distrib. Sensor Netw. 13(12), 1550147717741463 (2017)
Article Google Scholar
Behal, S., Kumar, K.: Characterization and comparison of DDoS attack tools and traffic generators: a review. IJ Netw. Security 19(3), 383–393 (2017)
Google Scholar
Elejla, O.E., Anbar, M., Belaton, B.: ICMPv6-based DoS and DDoS attacks defense mechanisms. IETE Tech. Rev. 34(4), 390–407 (2017)
Article Google Scholar
Fenil, E., Mohan Kumar, P.: Survey on DDoS defense mechanisms. Concurr. Comput. Pract. Exp. 32(6), e5114 (2019)
Google Scholar
Singh, J., Behal, S.: Detection and mitigation of DDoS attacks in SDN: a comprehensive review, research challenges and future directions. Comput. Sci. Rev. 37, 100279 (2020)
Article Google Scholar
Bouyeddou, B., Harrou, F., Kadri, B., Sun, Y.: Detecting network cyber-attacks using an integrated statistical approach. Clust. Comput. 24(2), 1435–1453 (2021)
Article Google Scholar
Maharaja, R., Iyer, P., Ye, Z.: A hybrid fog-cloud approach for securing the Internet of Things. Clust. Comput. 23(2), 451–459 (2020)
Article Google Scholar
Jyothsna, V., Prasad, K.M., Rajiv, K., Chandra, G.R.: Flow based anomaly intrusion detection system using ensemble classifier with feature impact scale. Clust. Comput. 24(4), 1–18 (2021)
Google Scholar
Lee, Y., Lee, Y.: Detecting DDoS attacks with Hadoop. In: Proceedings of the ACM CoNEXT Student Workshop, p. 7. ACM, New York (2011)
Khattak, R., Bano, S., Hussain, S., Anwar, Z.: DOFUR: DDoS Forensics Using MapReduce. In: Frontiers of Information Technology (FIT), vol. 2011, pp. 117–120. IEEE (2011)
Zhao, T., Lo, D.C.-T., Qian, K.: A neural-network based DDoS detection system using Hadoop and HBase. In: High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), pp. 1326–1331. IEEE (2015)
Dayama, R., Bhandare, A., Ganji, B., Narayankar, V.: Secured network from distributed DoS through Hadoop. Int. J. Comput. Appl. 118(2), 20–22 (2015)
Google Scholar
Hameed, S., Ali, U.: Efficacy of live DDoS detection with Hadoop. In: Network Operations and Management Symposium (NOMS), IEEE/IFIP, vol. 2016, pp. 488–494. IEEE (2016)
Hameed, S., Ali, U.: HADEC: a Hadoop based Live DDoS detection framework. EURASIP J. Inf. Security 2018(1), 1–19 (2018)
Article Google Scholar
Hsieh, C.-J., Chan, T.-Y.: Detection DDoS attacks based on neural-network using Apache Spark. In: 2016 International Conference on Applied System Innovation (ICASI), pp. 1–4. IEEE (2016)
Alsirhani, A., Sampalli, S., Bodorik, P.: DDoS attack detection system: utilizing classification algorithms with Apache Spark. In: 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), pp. 1–7. IEEE (2018)
Alsirhani, S., Sampalli, A., Bodorik, P.: DDoS detection system: utilizing gradient boosting algorithm and Apache Spark. In: 2018 IEEE Canadian Conference on Electrical & Computer Engineering (CCECE), pp. 1–6. IEEE (2018)
Ahmad, S., Yasin, A., Shafi, Q.: DDoS attacks analysis in bigdata (Hadoop) environment. In: 2018 15th International Bhurban Conference on Applied Sciences and Technology (IBCAST), pp. 495–501. IEEE (2018)
Maheshwari, V., Bhatia, A., Kumar, K.: Faster detection and prediction of DDoS attacks using MapReduce and time series analysis. In: 2018 International Conference on Information Networking (ICOIN), pp. 556–561. IEEE (2018)
Chhabra, G.S., Singh, V., Singh, M.: Hadoop-based analytic framework for cyber forensics. Int. J. Commun. Syst. Wiley Online Library 31(15), e3772 (2018)
Article Google Scholar
Patil, N.V., Krishna, C.R., Kumar, K., Behal, S.: E-had: a distributed and collaborative detection framework for early detection of DDoS attacks. J. King Saud Univ. Comput. Inf. Sci. (2019). https://doi.org/10.1016/j.jksuci.2019.06.016
Patil, N.V., Krishna, C.R., Kumar, K., Behal, S.: Apache hadoop based distributed denial of service detection framework. In: Information, Communication and Computing Technology, pp. 25–35. Springer, Singapore (2019)
Sharma, A., Agrawal, C., Singh, A., Kumar, K.: Real-time DDoS detection based on entropy using Hadoop framework. In: Computer Engineering and Technology, pp. 297–305. Springer (2019)
Patil, N.V., Rama-Krishna, C., Kumar, K.: S-DDoS: Apache Spark based real-time DDoS detection system. J. Intell. Fuzzy Syst. 38, 1–9 (2020)
Google Scholar
Vani, Y.K., Ranjana, P.: Detection of distributed denial of service attack using DLMN algorithm in hadoop. J. Crit. Rev. 7(11), 1011–1017 (2020)
Google Scholar
Chen, L., Zhang, Y., Zhao, Q., Geng, G., Yan, Z.: Detection of dns ddos attacks with random forest algorithm on spark. Procedia Comput. Sci. 134, 310–315 (2018)
Article Google Scholar
Gumaste, S., Narayan, D., Shinde, S., Amit, K.: Detection of ddos attacks in openstack-based private cloud using apache spark. J. Telecommun. Inf. Technol. 4, 62–71 (2020)
Article Google Scholar
Ahmed, A., Hameed, S., Rafi, M., Mirza, Q.K.A.: An intelligent and time-efficient DDoS identification framework for real-time enterprise networks SAD-F: spark based anomaly detection framework. IEEE Access 8, 219483–219502 (2020)
Jain, M., Kaur, G.: Distributed anomaly detection using concept drift detection based hybrid ensemble techniques in streamed network data. Clust. Comput. (2021). https://doi.org/10.1007/s10586-021-03249-9
Kshirsagar, D., Kumar, S.: A feature reduction based reflected and exploited DDoS attacks detection system. J. Ambient Intell. Human. Comput. (2021). https://doi.org/10.1007/s12652-021-02907-5
Sharafaldin, I., Lashkari, A.H., Hakak, S., Ghorbani, A.A.: Developing realistic distributed denial of service (DDoS) attack dataset and taxonomy. In: 2019 International Carnahan Conference on Security Technology (ICCST), pp. 1–8. IEEE (2019)
Han, D., Bi, K., Liu, H., Jia, J.: A DDoS attack detection system based on spark framework. Comput. Sci. Inf. Syst. 14(3), 769–788 (2017)
Sree and Bhanu, S.M.S.: Detection of HTTP flooding attacks in cloud using fuzzy bat clustering. Neural Comput. Appl. (2019). https://doi.org/10.1007/S00521-019-04473-6
Behal, S., Kumar, K., Sachdeva, M.: D-FAC: a novel ϕ-divergence based distributed DDoS defense system. J. King Saud Univ. Comput. Inf. Sci. 33(3), 291–303 (2018)
de Lima Filho, F.S., Silveira, F.A., de Medeiros Brito Junior, A., Vargas-Solar, G., Silveira, L.F.: Smart detection: an online approach for DoS/DDoS attack detection using machine learning. Security Commun. Netw. 2019, 1574749 (2019)
Marvi, M., Arfeen, A., Uddin, R.: A generalized machine learning-based model for the detection of DDoS attacks. Int. J. Netw. Manag. 31(6), e2152 (2020)
Joldzic, O., Djuric, Z., Vuletic, P.: A transparent and scalable anomaly-based DoS detection method. Comput. Netw. 104, 27–42 (2016)
Article Google Scholar
Brent, R.P., Zimmermann, P.: Modern Computer Arithmetic, vol. 18. Cambridge University Press, Cambridge (2010)

Download references

Author information

Authors and Affiliations

Computer Science & Engineering, National Institute of Technical Teachers Training & Research, Chandigarh, Panjab University, Chandigarh, India
Nilesh Vishwasrao Patil & C. Rama Krishna
University Institute of Engineering & Technology, Panjab University, Chandigarh, India
Krishan Kumar

Authors

Nilesh Vishwasrao Patil
View author publications
You can also search for this author in PubMed Google Scholar
C. Rama Krishna
View author publications
You can also search for this author in PubMed Google Scholar
Krishan Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nilesh Vishwasrao Patil.

Ethics declarations

Conflict of interest

The authors declared that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Patil, N.V., Krishna, C.R. & Kumar, K. SSK-DDoS: distributed stream processing framework based classification system for DDoS attacks. Cluster Comput 25, 1355–1372 (2022). https://doi.org/10.1007/s10586-022-03538-x

Download citation

Received: 03 June 2021
Revised: 04 January 2022
Accepted: 05 January 2022
Published: 17 January 2022
Issue Date: April 2022
DOI: https://doi.org/10.1007/s10586-022-03538-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

SSK-DDoS: distributed stream processing framework based classification system for DDoS attacks

Abstract

Similar content being viewed by others

A systematic literature review for network intrusion detection system (IDS)

Deep learning method for efficient cloud IDS utilizing combined behavior and flow-based features

Survey of intrusion detection systems: techniques, datasets and challenges

1 Introduction

1.1 DDoS attacks

1.2 Summary of DDoS attack events

1.3 Challenges

1.4 Open-source technologies

1.4.1 Apache Hadoop

1.4.2 Apache Spark streaming

1.4.3 Apache Kafka

1.4.4 CICFlowMeter

1.5 Contributions

2 Related work

3 SSK-DDoS: Spark Streaming and Kafka based classification system for DDoS attacks

3.1 Preprocessing task

3.1.1 Create network flows using CICFlowMeter

3.1.2 Scaling data values

3.1.3 Features formulation

3.2 Classification task

3.2.1 CICDDoS2019 dataset

3.2.2 SSK-DDoS: design process

3.2.3 K-DDoS: classification process in real-time (after deployment)

4 Experimental setup

5 Results and discussion

5.1 Complexity analysis

5.2 Comparison with existing systems

6 Conclusions

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation