1 Introduction

Over the decade, companies have been running their services online for growing revenue and are open to users from anywhere-anytime. Further, in recent times, there is huge growth in Internet subscribers and connecting devices. However, this significant growth has come up with unsafe network routes with non-secure connecting devices. Therefore, attackers use this chance to compromise numerous nodes to form a botnet for performing DDoS attacks on the victim system.

1.1 DDoS attacks

A DDoS attack is the biggest threat to Internet-based applications and their resources [1, 2]. The motive of this attack is to overwhelm Internet-based services by transmitting a large amount of attack traffic [3, 4]. A typical example to perform the DDoS attack on the victim system is presented in Fig. 1. In this, a master took control of various slaves with the help of handler programs. The handler is the inter-mediator program between master and slave nodes that will help to perform a large-scale DDoS attack on victim-applications.

Fig. 1
figure 1

A typical example of DDoS attack

1.2 Summary of DDoS attack events

Each country has been struggling with the COVID-19 situation since Jan 2020. In this pandemic, peoples are working, shopping, enjoying, etc. in online mode. Therefore, attackers use this chance to compromise numerous nodes to form a botnet. The Q4-2020 DDoS attacks statistical report [5] is summarized as follows:

  1. 1.

    Most numbers of attacks experienced by countries: China (44%+), USA (23%+), and Hong Kong (7%+).

  2. 2.

    The highest number of attacks reported on Dec 31, 2020, i.e., 1349 incidents.

  3. 3.

    After exception in the last few quarters, once again Linux-based botnets used to launch every DDoS attack.

  4. 4.

    The majority of C&C servers located in the USA (36%+), Netherlands (19%+), and Germany (8%+).

  5. 5.

    Once again, most number of incidents observed on Thursday, and this trend dropped on Sunday.

The country-wise distribution of DDoS attacks incidents for Q3-2020 and Q4-2020 are given in Fig. 2. From this, we can conclude that both frequencies and the strength of attacks are increasing year-after-year. Further, the attack strength pattern shifted from “Gbps to Tbps”. Therefore, one more challenge in front of researchers to systematically analyze such a large volume of traffic.

Fig. 2
figure 2

Comparison of Q3-2020 and Q4-2020 country-wise statistics distribution of DDoS attacks [5, 6]

1.3 Challenges

In this big data world, the traditional framework-based DDoS attack detection approaches themselves become the victim while examining a massive number of packets. Therefore, there is a need to deploy the proposed approach on distributed stream processing framework (DSPF). The DSPF has the capability to handle (store and analyze) a large volume of data in real-time by employing multiple nodes. Further, data transfer between nodes, secure communication protocol, and metadata information is systematically managed by DSPF. The traditional and distributed processing frameworks (DPF) based DDoS attack detection systems are specially designed to examine flows in an offline mode. Therefore, this type of approach fails to analyze incoming streams in real-time. Additionally, most of the approaches have been tested on outdated datasets. Therefore, there is a need to design a distributed classification model using a recent dataset and deploy it on DSPF (such as the Spark Streaming platform).

1.4 Open-source technologies

In this section, we are going to summarize the open-source technologies that are required to design the proposed SSK-DDoS classification systems for DDoS attacks. We split-up this section into four sub-sections: Apache Hadoop, Spark Streaming, Apache Kafka, and CICFlowMeter.

A good DSPF must have the following features:

  1. 1.

    To analyze the streaming data such as network traffic flows as it receives and takes immediate action based on prediction.

  2. 2.

    To design real-time applications which have a loosely-coupled architecture. Therefore, multiple publishers and consumers can independently access the application without delay.

  3. 3.

    To have features like analyze data in a distributed manner, extremely low latency, reliability, scalable, fault-tolerant, etc.

1.4.1 Apache Hadoop

Apache Hadoop [7, 8] is one of the powerful DPF for storing and analyzing a large amount of data. It is specially designed to analyze a large amount of data using batch processing on a cluster of nodes. It consists of three major modules:

  1. 1.

    Hadoop Distributed File System (HDFS): It allows for storing a large amount of data on clusters of nodes called datanodes. The data is divided into multiple blocks and systematically stored on datanodes. Further, metadata information about each block is stored in namenode.

  2. 2.

    Yet Another Resource Navigator (YARN): This module is used to allocate resources for analyzing a large amount of data.

  3. 3.

    MapReduce: It is a programming model for analyzing a large amount of data in a distributed manner.

1.4.2 Apache Spark streaming

Apache Spark [9] is a large-scale data analytics engine. It provides a large data processing API. Spark Streaming is an extension of the core Spark API for developing real-time applications. The Apache Spark streaming platform is commonly used:

  1. 1.

    To design real-time applications for analyzing a large amount of data in real-time.

  2. 2.

    To immediately respond to the streaming data to take quick action without a delay.

Apache Spark consists of four essential components: Spark SQL, MLlib, GraphX, and Spark Streaming. It is possible to combine these four components to design a machine learning-based real-time application. Spark Machine Learning Library (MLlib) is a distributed in-memory machine learning library. It provides:

  1. 1.

    A way to design a model in a distributed manner.

  2. 2.

    Robust APIs.

  3. 3.

    High-scalability feature for the machine learning model when deployed on DPF/DSPF.

  4. 4.

    Support various programming languages: Python, Java, Scala, etc.

Several tools/techniques are available to design traditional and non-traditional machine learning models such as Python, Java, R, WEKA, etc. Further, few authors [10,11,12,13] have systematically discussed machine/deep learning methods and features selection. However, when we design a model using these techniques that will face the scalability issue when deployed on DPF/DSPF. The Spark MLlib machine learning library provides a way to design a distributed and in-memory machine learning model. This type of model is specially designed to deploy on DPF/DSPF (Hadoop, Kafka, Spark, etc.). Therefore, it is exciting to implement a distributed classification approach for DDoS attacks using the MLlib and deploy it on the Spark streaming platform.

1.4.3 Apache Kafka

Apache Kafka [14] is an open-source distributed and high-throughput publish-subscribe messaging system. It consists of six essential components: Brokers, Zookeeper, Topics, Partitions, Publishers, and Subscribers. The publishing/consuming feature of Kafka helps to provide a loosely-coupled architecture to real-time applications.

1.4.4 CICFlowMeter

CICFlowMeter [15] is an open-source network flow generator tool. It creates network flows in offline (from PCAP) and online (from network interfaces) mode. It creates 83 attributes and stores them in a CSV file from network traffic. An example of CICFlowMeter for collecting network packets using the network interface card and generating network flows from network packets is presented in Fig. 3.

Fig. 3
figure 3

CICFlowMeter: Capture incoming network traffic

1.5 Contributions

The significant contributions of this paper are listed in the following:

  • Proposed a novel Spark Streaming and Kafka based classification system for DDoS attacks called SSK-DDoS.

  • The SSK-DDoS is distributed and real-time classification approach built using distributed Spark MLlib machine learning algorithms on the Hadoop cluster and deployed on the Spark Streaming clusters to classify network flows in real-time.

  • It stores formulated features of each network flow with predicted class in the HDFS to retrain the model using a new set of samples.

  • Proposed SSK-DDoS classification system distributes the computational overhead i.e. preprocessing and classification tasks on network traffic between multiple nodes of Spark clusters.

  • Proposed distributed SSK-DDoS runs in an automated style as incoming network flows published on Kafka topics, select essential variables, formulate features based on selected variables, perform classification job, and finally publish predictions on the Kafka topic to take action in real-time.

  • Proposed SSK-DDoS classification approach is designed and validated using the recent CICDDoS2019 dataset.

  • Proposed SSK-DDoS is a highly-scalable approach and provides loosely-coupled architecture.

Rest of the paper is organized as follows. A summary of related works presented in Sect. 2. Section 3 presents a novel distributed SSK-DDoS classification system for DDoS attacks. Section 4 provides testbed information of the classification approach. Results and analysis is presented in Sect. 5. Finally, Sect. 6 conclude the paper.

2 Related work

Numerous security approaches are available in the literature to protect the victim systems from different DDoS attacks. Patil et al. [16] have systematically classified DDoS attack detection approaches into two broad classes based on their deployment frameworks: traditional and DPF based detection approaches. In the literature [17,18,19,20,21,22,23,24,25,26,27,28,29,30], several authors systematically summarized traditional framework based approaches and few of the recent existing systems are [31,32,33]. However, few authors [16] specifically addressed DPF based approaches. The DPF (batch processing) and DSPF (real-time) themselves have distributed designs to store and analyze a massive volume of data on a cluster of nodes. In the literature, some authors [34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54], proposed DPF and DSPF based approaches. However, most of them are deployed on the DPF. Therefore, this type of detection approach efficiently analyzes a large number of packets and classifies them in a short time. However, they are not capable to classify network flows in real-time. This type of approach is useful for historical data analysis and retrain the distributed model. Therefore, if use-case demands to classify network flows in real-time then one need to deploy the proposed approach on DSPF (such as Spark Streaming platform).

We have drawn some inferences from the existing works related to DPF/DSPF. They are listed as follows:

  • Most of the systems are designed and tested in an offline mode. Therefore, there is a need to deploy a classification model for DDoS attacks on DSPF such as Apache Spark Streaming that analyzes network traffic in real-time.

  • Few researchers designed their classification model using shallow and deep learning algorithms. These models performed exceptionally well when we deployed on traditional frameworks. However, models will undergo the scalability issue when deployed on DPF/DSPF. Therefore, there is a need to implement a distributed model using distributed machine learning library that will provide a high scalability feature even models deployed on DPF/DSPF.

  • Most of the DPF/DSPF based DDoS approaches efficiently analyzed a huge amount of network flows on a group of nodes by distributing the analysis task on multiple systems.

  • Most of the existing DPF/DSPF based DDoS mechanisms employed a counter-based detection methodology for identifying the high-volume of attacks. Therefore, this type of system fails to recognize a low-volume of DDoS attacks.

  • Most of the DPF/DSPF and traditional framework-based DDoS mechanisms are validated using outdated datasets. Few authors [55] designed there system using recent dataset. Therefore, there is a need for a new classification approach that can be validated using recent datasets, such as CICDDoS2019.

3 SSK-DDoS: Spark Streaming and Kafka based classification system for DDoS attacks

This section presents the functioning of the proposed SSK-DDoS classification system for DDoS attacks. The logical architecture of SSK-DDoS is given in Fig. 4.

Fig. 4
figure 4

Logical architecture of the proposed distributed SSK-DDoS classification system for DDoS attacks

The distributed SSK-DDoS classification system of DDoS attacks is consists of three Spark Streaming clusters: ‘SC-1’, ‘SC-2’, and ‘SC-3’. Two Spark clusters ‘SC-1’ and ‘SC-2’ are deployed in the intermediate network i.e., at ISP-1 and ISP-2 respectively. The primary job of ‘SC-1’ and ‘SC-2’ clusters is to preprocess the incoming network traffic and pass it on to ‘SC-3’. While the ‘SC-3’ cluster is deployed in the victim network and the job of this cluster is to classify flows into seven classes. The first step is producer agents (from ISP-1 and ISP-2) continuously publishing network flows generated by CICFlowMeter onto the “ssk_ddos_flow” topic. Both ‘SC-1’ and ‘SC-2’ clusters immediately consume flows from “ssk_ddos_flow” topic. The second step is to extract essential variables from flows, formulate features using extracted variables, and publish them on “sss_ddos_features” topic. Then ‘SC-3’ cluster immediately consumes formulated features of each flow from “sss_ddos_features”, classify them into seven classes, and publish predicted class on the “sss-ddos_prediction” topic to take action. Further, this system stores formulated features of each flow with predicted class into the HDFS that will help to retrain the distributed classification model of DDoS attacks using a new set of samples. Highlights of the proposed distributed SSK-DDoS classification system of DDoS attacks are as follows:

  • Loosely-coupled architecture as it uses distributed publish-subscribe messaging system for communication

  • Analyze network traffic flows in real-time using Spark Streaming API

  • Distributed computational overhead between three clusters

  • Stores formulated features of each flow with their predicted class into HDFS for retraining the existing classification model using a new set of samples

The detection approach of the proposed SSK-DDoS classification system splits into two parts: preprocessing and classification task.

Fig. 5
figure 5

SSK-DDoS classification model: design flow

3.1 Preprocessing task

The role of ‘SC-1’ and ‘SC-2’ clusters is to consume network traffic, generate network flows using CICFlowMeter, select significant variables, scale selected variable, formulate features using scaled variables, and finally publish it on the “ssk_ddos_features”. Both ‘SC-1’ and ‘SC-2’ have a separate Kafka topic with the same name “ssk_ddos_features”. We split this section into three sub-sections: create network flows, scaling variables, and formulating features.

3.1.1 Create network flows using CICFlowMeter

The CICFlowMeter generates network flows with 83 attributes from incoming traffic and puts flows in a CSV file. We employ producer agents to immediately pick up each entry from CSV and publish flows on the “ssk_ddos_flow” topic. The next task perform by ‘SC-1’ and ‘SC-2’ clusters is to select 23 significant variables from each flow. In [56], 24 significant variables are used to classify flows into different classes. However, in these 24 variables, two variables such as Fwd_Header_Length and Fwd_Header_Length.1 look like duplicate columns. Further, after generating network flows using the current version of CICFlowMeter, the Fwd_Header_Length.1 variable is removed from generated network flows. Therefore, we have selected 23 variables from the variable list of each network flows.

3.1.2 Scaling data values

The next job performed by both clusters is to scaling data values of twenty-three variables on the same scale. The scaling of data points can be adjusted with the help of the “MinMax” technique provided by the “sklearn.preprocessing”. Therefore, after the scaling process, data point values lie between 0 and 1. The mathematical formula for the scaling is:

$$\begin{aligned} \textit{Norm}\_\textit{Data}_i = \frac{\textit{DataVal}_i - \textit{min} (\textit{DataVal})}{\textit{max}(\textit{DataVal}) - \textit{min}(\textit{DataVal})} \end{aligned}$$
(1)

3.1.3 Features formulation

Both ‘SC-1’ and ‘SC-2’ formulate ten features from 23 selected variables. It helps to enhance the accuracy and speed up the design process of the classification model. A summary of each feature is given in Table 1. After formulating features by ‘SC-1’ and ‘SC-2’ has been replicated to ‘SC-3’.

Table 1 Description of formulated features

3.2 Classification task

In this section, we present a distributed classification approach of the proposed SSK-DDoS for identifying various types of attacks: DDoS-DNS, DDoS-LDAP, DDoS-MSSQL, DDoS-NetBIOS, DDoS-UDP, and DDoS-SYN. The distributed classification approach is designed using the CICDDoS2019 dataset based on four distributed machine learning algorithms from Spark MLlib library: DecisionTreeClassifier (DTC), Naive Bayes (NB), Multinomial Logistic Regression (MLR), and Random Forest (RF). The Spark MLlib library provides an RF classifier algorithm for both binary and multiclass classification. It allows distributed designing of the model with millions or even billions of samples. The RF is an ensemble classifier that consists of multiple trees (classifiers), and each tree process is based different set of features. Gradient-Boosted Trees (GBT) is also an ensemble classifier and helps to improve accuracy. However, the Spark MLlib library provides this algorithm only for binary classification, and for this use-case, our classification approach has seven target classes. Therefore, this algorithm will not work for our use-case. We deployed an RF-based classification approach on the ‘SC-3’ for classifying flows into seven classes: Benign (One), DDoS_DNS (Two), DDoS_LDAP (Three), DDoS-MSSQL (Four), DDoS-NetBIOS (Five), DDoS-UDP (Six), and DDoS-SYN (Seven).

The primary objective of this classification approach is to classify network flows in real-time. We split the proposed classification approach into two parts: (i) Design process of a distributed classification model using distributed Spark MLlib library on the Hadoop cluster and (ii) After deployment of the classification model in ‘SC-3’ Spark Streaming cluster to classify network flows in real-time. The step-by-step workflow of the proposed classification model is presented in Figs. 5 (designing process) and 6 (after deployment process).

Fig. 6
figure 6

SSK-DDoS classification model: after deployment flow

We divided this section into three sub-sections: details of the CICDDoS2019 dataset, designing and after deployment process of the classification model.

3.2.1 CICDDoS2019 dataset

The CICDDoS2019 [56] dataset is a collective project of the “Canadian Communications Security Establishment (CSE) and Canadian Institute for Cybersecurity (CIC)”. It includes both benign and various types of DDoS attack scenarios. This dataset is available in both PCAP and CSV files i.e., raw packets and network flow with labeling, respectively. However, CSV files have several issues. Therefore, we generated network flows from PCAP files for various scenarios such as DDoS-UDP, DDoS-LDAP, DDoS-DNS, DDoS-SYN, DDoS-MSSQL, DDoS-NetBIOS, and Benign using the CICFlowMeter flow generator tool. The newly generated network flows contain 83 variables and one label column that we have to update as per the attack-wise schedule of PCAP files given on the dataset portal.

3.2.2 SSK-DDoS: design process

The step-by-step process to implement a distributed classification model for DDoS attacks using MLlib library is shown in Fig. 5. For designing this model, we assembled PCAP files of DDoS-UDP, DDoS-LDAP, DDoS-DNS, DDoS-SYN, DDoS-MSSQL, DDoS-NetBIOS, and Benign. The number of flows in each class is Benign: 56863, DDoS-DNS: 5071011, DDoS-LDAP: 2179930, DDoS-MSSQL: 4522492, DDoS-NetBIOS: 4093279, DDoS-UDP: 3134645, and DDoS-SYN: 1582289.

However, the number of flows in each class is highly-imbalanced which affects the accuracy of the classification model. We up-sampled some classes to 5071011. Therefore, the number of flows in the sample is 35 million+ and are stored in the HDFS. The next step is to implement a distributed classification model of DDoS attacks. We designed this classification model using Spark MLib machine learning-based algorithms: DTC, MLR, NB, and RF. Then deploy this model on the Spark Streaming cluster. The next task is to calculate performance evaluation metrics: precision, recall, and f1-score. The performance evaluation of these algorithms is discussed in Sect. 5. Finally, we save this model in the persistent storage for deploying in the ‘SC-3’ Spark Streaming cluster to analyze flows in real-time.

3.2.3 K-DDoS: classification process in real-time (after deployment)

The second part of the classification approach is to classify incoming network traffic into seven classes. Figure 6 shows step-by-step process of the proposed classification approach after deploying in ‘SC-3’. The CICFlowMeter generates network flows from incoming network traffic. Then, producer agents continuously publish created flows in the “ssk_ddos_flows”. Both ‘SC-1’ and ‘SC-2’ immediately consume published flows and select twenty-three variables from the list of eighty-three variables. The next step is to scaling data values of variables, formulate features using scaled variables, and published them on the “ssk_ddos_features” by ‘SC-1’ and ‘SC2’. The next step, distributed classification model immediately consumed messages from the “ssk_ddos_features”, analyze and classify them into seven classes: DDoS-UDP, DDoS-LDAP, DDoS-DNS, DDoS-SYN, DDoS-MSSQL, DDoS-NetBIOS, and Benign. Finally, the proposed classification approach publishes the predicted class on the “ssk_ddos_prediction” topic to take immediate action on incoming network flows. Further, distributed SSK-DDoS classification system combines formulated features with the predicted result of each network flows and stores them in the HDFS with the help of the “ssk_ddos_retrain_data”.

4 Experimental setup

In this section, we explore the experimental setup of the proposed distributed SSK-DDoS classification system for DDoS attacks. It is shown in Fig. 7. For the design and validation of the proposed SSK-DDoS, we consider two source networks, two ISPs in the intermediate network, and one victim network. Each ISP receives the network traffic from the source network, then generates network flows using CICFlowMeter from incoming traffic, selects essential variables, scales selected variables, formulate features using scaled variables, and replicates features in the ‘SC-3’. The information about networks/clusters/nodes is given in the following:

  • Two source networks: Legitimate and DDoS attack traffic traced towards victim network via ISPs.

  • Two ISP networks: In each ISP network, deploy two nodes Spark Streaming cluster (‘SC-1’ and ‘SC-2’) for performing preprocessing task on incoming network traffic.

  • Hadoop cluster: Deploy two nodes Hadoop cluster for storing formulated features with the predicted class of each network flow and retrain the existing model using a new set of samples.

  • Spark Streaming cluster (‘SC-3’): Implement two nodes Spark Streaming cluster ‘SC-3’ in the victim network to classify network flows in real-time.

Several Kafka topics have been created for publishing and consuming messages independently based on the distributed publish-subscribe messaging system. In ‘SC-1’ and ‘SC-2’ Spark Streaming clusters, 02 topics are created:

  1. 1.

    “ssk_ddos_flows”: for publishing network flows created by CICFlowMeter.

  2. 2.

    “ssk_ddos_features”: for publishing formulated features and replicated them to ‘SC-3’.

Further, in the ‘SC-3’ Spark Streaming cluster, three Kafka topics are created:

  1. 1.

    “ssk_ddos_features”: classification model immediately consumes features from this topic to classify flows in real-time.

  2. 2.

    “ssk_ddos_prediction”: for publishing predicted class of the flows to take action.

  3. 3.

    “ssk_ddos_retrain_data”: for publishing formulated features with predicted class of each flow to store in the HDFS.

Fig. 7
figure 7

Testbed for the proposed SSK-DDoS classification system for DDoS attacks

5 Results and discussion

In this section, we evaluate the performance of our proposed SSK-DDoS classification system of DDoS attacks. The proposed SSK-DDoS classification system classifies network flows into seven classes.

We considered two cases for performance evaluation of the proposed SSK-DDoS classification system: case (I) While designing the classification model of DDoS attacks and case (II) After deployment of this classification model on DSPF i.e., Spark Streaming. For this, we measure three performance evaluation metrics for multi-class classification. The mathematical definition of these metrics for multi-class (in this use-case, seven target classes) classification: Precision (\(P_{m\_class}\)), Recall (\(R_{m\_class}\)), and F1-score (\(F1S_{m\_class}\)) are given in the following:

  1. 1.

    \(P_{m\_class}=\frac{\sum _{i=1}^{n}\frac{TruePositive_i}{( TruePositive_i+FalsePositive_i)} }{n},\) where n = number of classes (in this use-case, five classes)

  2. 2.

    \(R_{m\_class}=\frac{\sum _{i=1}^{n}\frac{TruePositive_i}{( TruePositive_i+FalseNegative_i)} }{n}\)

  3. 3.

    \(F1S_{m\_class}= \frac{2*P_{m\_class}*R_{m\_class}}{(P_{m\_class}+R_{m\_class})}\)

We designed and validated the proposed classification model using the CICDDoS2019 dataset. For evaluation of case-I, the description of class-wise network flows is given in Table 2. We designed this model using four Spark MLlib machine learning algorithms: DTC, MLR, NB, and RF. We visualized multiclass confusion matrices in Fig. 8 and evaluation metrics in Table 3. According to the accuracy, RF (89.05%) has given a better accuracy than the other three, i.e., MLR (43.28%) NB (69.39%) and DTC (87.61%). Further, we have tuned the number of trees (\(T=10,20,50\)) parameter for the RF algorithm. We come across that RF gives better accuracy for \(T=50\) (89.05%) than \(T=10\) (87.89%) and \(T=10\) (87.91%).

Table 2 Details of the CICDDoS2019 dataset for case-I
Table 3 Performance of SSK-DDoS for Case-I (while designing a distributed model using MLlib)
Fig. 8
figure 8

Multi-class confusion matrices for Case-I (while designing a distributed model)

For evaluation of the case-II, we examined six scenarios with different combinations of the CICDDoS2019 dataset classes. The description of each scenario is presented in Table 4. After designing the classification model using various algorithms, the RF-based classification model (\(T=50\)) has given better classification accuracy than MLR, NB, RF (\(T=10\)), RF (\(T=20\)), and DTC algorithms. Therefore, we deployed the RF-based classification model (\(T=50\)) on the ‘SC-3’ Spark Streaming cluster in the production environment. The performance evaluation of these six scenarios is given in Table 5 and visualized their multi-class confusion matrices in Fig. 9.

Table 4 CICDDoS2019 dataset network flows details for Case-II (After deployment)
Table 5 Performance of SSK-DDoS for Case-II (After deployment)
Fig. 9
figure 9

Multi-class confusion matrices for Case-II (After deployment)

From the performance evaluation of the proposed SSK-DDoS for case-II, the RF-based classification model (\(T=50\)) provides a better accuracy such as scenario-I: 99.44%, scenario-II: 87.09%, scenario-III: 91.04%, scenario-IV: 99.17%, scenario-V: 92.17%, and scenario-VI: 94.42%. From this, we conclude that the proposed classification model gives 87%+ accuracy even attackers launch different types of attacks concurrently on the victim system.

5.1 Complexity analysis

In the case of the traditional framework-based DDoS attack detection mechanisms, each network flows is analyzed at a single point. Therefore, the time complexity of the system is O(NNF), where NNF is the number of network flows analyzed by the system [63]. However, in the case of DPF/DSPF, the network flows analysis task is distributed between multiple nodes, and hence complexity is also distributed, say n (where n: no. of nodes). To measure the complexity of the proposed system, we assume each node equally examined network flows. Therefore, the complexity of DPF/DSPF is \(O(\frac{NNF}{n})\). In this case, we have to measure one more parameter that is intermediate communication cost between nodes. Let us assume intermediate communication cost is O(ICC). Therefore, the combined complexity cost (CCC) of the DPF/DSPF is \(CCC = O(\frac{NNF}{n}) + O(ICC)\). However, DPF/DSPF is specially designed to analyze a large amount of data and hence O(ICC) is negligible when we compared O(NNF) with O(ICC). Therefore the CCC of the DPF/DSPF-based DDoS attack detection system is \(O(\frac{NNF}{n})\). It shows that the time complexity will go down as increasing nodes in the cluster.

5.2 Comparison with existing systems

In this section, we systematically compared of the proposed SSK-DDoS classification system of DDoS attacks with existing DPF and traditional framework based systems [34, 35, 37,38,39, 41,42,43,44,45, 47, 47,48,49, 57] in Tables 6 and 7.

Table 6 Comparison of SSK-DDoS with existing DPF/DSPF-based approaches
Table 7 Comparison of SSK-DDoS with the traditional framework-based approaches

Most of the DPF-based classification approaches [34, 35, 37,38,39, 44, 45, 47, 47, 48] of DDoS attacks and legitimate traffic are deployed on the Apache Hadoop framework. This type of approach efficiently handles a large number of flows on a cluster of nodes. However, Apache Hadoop is particularly employed to examine large data in offline mode. Therefore, this type of classification approach is not capable to classify network packets in real-time.

Few [41,42,43, 49, 57] authors have proposed Apache Spark-based classification approaches for DDoS attacks and legitimate traffic. This type of approach examines network flows in near to real-time. Further, these systems didn’t provide an automated way to take action on incoming traffic flows. However, the proposed SSK-DDoS classification approach for DDoS attacks is not only designed on DPF (Using Spark MLlib machine learning library on Hadoop cluster) but also deployed on DSPF (Spark Streaming). Therefore, the proposed system provides a high-scalability feature. Further, we used Kafka’s distributed pub-sub messaging system that will help to provide a loosely-coupled and automated-way to the proposed SSK-DDoS classification system for DDoS attacks.

Sharafaldin et al. [56] have generated a realistic dataset by considering various attack scenarios. Further, they have proposed a detection approach to classify different types of DDoS attacks. According to their performance evaluation, precision values for classifiers ID3, RF, NB, and LR is 0.78, 0.77, 0.41, and 0.25, respectively. While our RF-based classification model has given a better precision value (0.89).

6 Conclusions

A distributed denial of service attack is one of the biggest threats to Internet-based services and their resources. It overwhelms victim resources in a short time by sending a large number of network packets. The traditional framework-based approaches themselves become a victim of attacks while classifying a massive amount of network flows. Further, most of the existing DPF-based classification systems for DDoS attacks were specially designed for offline mode and hence not capable to classify network flows in real-time.

This paper proposed Spark Streaming and Kafka-based distributed classification system for DDoS attacks, named by SSK-DDoS. This classification approach is designed using a distributed Spark MLlib machine learning library on a Hadoop cluster and deployed on the Spark streaming platform to classify the network traffic in real-time into seven classes: Benign, DDoS-DNS, DDoS-LDAP, DDoS-MSSQL, DDoS-NetBIOS, DDoS-UDP, and DDoS-SYN. Further, this system stored formulated features with the predicted class of each flow into the HDFS for retraining the existing distributed classification model using a new set of samples. The proposed SSK-DDoS classification system has been validated using the recent CICDDoS2019 dataset. The results show that the proposed SSK-DDoS detection system efficiently (89.05%) classified network traffic into seven classes.