1 Introduction

Volumetric denial of service (DoS) attacksFootnote 1 prevent users and applications from accessing the services provided over the network either by exhausting the bandwidth or the provided service. The Internet (designed as a best-effort delivery) does not have sufficient built-in mechanisms to prevent DoS attacks from happening, although many were proposed and some deployed (e.g. reverse path filtering) to a certain extent. Therefore, volumetric DoS attacks are still present and evolve to achieve maximum effect and to render the defense hard, for example, by the exploitation of IoT devices for launching massively distributed DoS attacks (Hummel and Hildebrand 2021). In the case of volumetric DoS attacks, the position of an attacker and a defender are asymmetric. It is easy and cheap to deploy the attack, and there are even services providing DoS as a service (Douglas et al. 2017). On the other hand, the defender must use significantly over-provisioned distributed infrastructure to withstand the attack or to utilize sophisticated mechanisms to detect and mitigate the attacks.

Ideally, the detection and mitigation of DoS should be deployed at the level of a network operator, closest to the source of DoS rather than at the victim. The victim does not usually have the power and the resources to counter the attack. On the other hand, the detection is rendered hard at the operator level due to a large amount of overall traffic aggregated from many services on the backbone and distributed nature of DoS attacks. If available, the detection and mitigation are deployed one step before the traffic is delivered to the victim (e.g. by service hosting provider, internet service provider).

In the past, when the same attack lasted for hours, it was possible to perform a manual analysis of the attack and come up with corresponding countermeasures after several minutes. The lengthy investigation is no longer affordable. The attacks have become frequent, multivector and short-lived (NETSCOUT Threat Intelligence Report Shows a Dramatic Increase in Multivector DDoS Attacks in First-Half 2020). The defense must respond fast, ideally automatically, to counter the attack in the order of seconds. There is only a little space for manual analysis.

We propose a method to automate the time-consuming analysis of the network traffic by a human who tries to figure out how to filter the particular attack. Our method automatically infers rules to filter volumetric DoS traffic based on the arriving network traffic. At its input, the method needs a sample of the traffic captured during the normal period and a sample of the traffic during the attack period. The method uses a machine learning algorithm, tree induction, to create a model of the current attack traffic. Subsequently, it converts the model into packet filtering rules. This is unlike traditional approaches in which the classifier is trained using an annotated dataset before the attack happens and subsequently, the classifier itself is used to classify the traffic. For the details of our approach, please, see Section 3.

At a glance, the contributions of this paper are as follows:

  • An innovative use of machine learning to infer rules for filtering volumetric DoS.

  • Thorough evaluation of the proposed method to assess its feasibility under conditions present in real deployment.

  • We demonstrate the output of the method as well as evaluate its online deployment.

  • We provide the datasets used during the evaluation publicly available.

The remainder of this article is structured as follows. Section 2 provides an overview of the state of the art in DoS detection and mitigation including the use of machine learning in this domain. Section 3 presents the problem statements, assumptions and the description of our method. In Section 4, we describe our datasets, the evaluation of hyperparameter settings and demonstration of the outputs. Lastly, Section 5 concludes this article and outlines our future work plans.

2 Related work

Although our approach is rather related with the mitigation techniques we also overview the detection techniques. They generate a piece of information about the arriving attack, namely, the time of the attack which is vital input for our method.

2.1 Detection

One of the early works is MULTOPS (Gil and Poletto 2001). It detects bandwidth attacks based on a deviation from a communication proportional symmetry using just the byte and packet counters from routers. Further research methods are based on data obtained from sampled packets or IP flows for better observation detail, e.g. to monitor the number of new source IP addresses seen by the end host (Peng et al. 2002). Most source IP addresses are new to the victim during the attack (Jung et al. 2002. Du and Abe (2008) proposed an attack detection scheme based on a packet size entropy for each application (identified by a transport port number). The assumption is that the entropy of the normal traffic is higher than the entropy during the attack. It is expected that the attack traffic consists of similar packets whereas legitimate traffic packet sizes vary according to each application. The detection is based on the deviation of entropy from a mean value. In general, other proposed schemes (e.g., Yu and Zhou 2008; No and Ra 2009; Zhang et al. 2010; Sardana et al. 2008) also utilize the entropy to convert selected traffic characteristics such as randomness of flows at routers, distribution of source IP addresses in dependence on destination port numbers in the flows, and others into time series. The deviance in the time series are detected by simple schemes such as EWMA, Holt-Winters up to complex schemes such as Wavelet analysis (Li and Lee 2003; Dainotti et al. 2006; Lu et al. 2010) or Chi-Square test, e.g. Feinstein et al. (2003).

Machine learning algorithms are utilized to detect DoS attacks applying both supervised and unsupervised schemes such as Gavrilis and Dermatas (2005); Mukkamala et al. (2002); Akyazı and Uyar (2010); Lee et al. (2008); Burbeck and Nadjm-Tehrani (2007). From several more recent publications we specifically select some that we deem to be the most related to our work.

Authors in Saini et al. (2020) experiment with several machine learning techniques to figure out their performance when constructing a DDoS attack detector. They propose a set of 27 features and use a MLP network, random forest and Naive Bayes to classify them. In comparison to our work, these features are not packet based, their classifiers are trained offline on annotated datasets and the classifiers aim at the recognition of a particular attack type and not to tell which packet is legitimate and which belongs to an attack. Similarly, Sangkatsanee et al. (2011) performs experiments with various machine learning techniques but with a different feature set and a larger set of attack types.

In work Fachkha et al. (2015), the authors aim to predict features of DDoS attacks such as intensity and size. They create a feature set (various time series and their fluctuations) to recognize similar DDoS attacks. To this end the authors propose a clustering approach to investigate DDoS campaigns and their similarities. Therefore we consider this to be an orthogonal to our research. Our approach can benefit from such an analysis, e.g. if a similar campaign is recognized to take place again, the same rule set, previously inferred by our algorithm, can be used to drop attacking packets.

The authors in Subbulakshmi et al. (2010) proposed a fuzzy inference classification system working over IDS. The proposed system aggregates alerts that relate to the same detected cybersecurity events and classifies the aggregated meta alert as true positive or false positive based on feedback from a human operator. The improved accuracy and precision of DoS detection improves output of the analysis and forms a relevant input for our algorithm. Moreover, in our algorithm we do not aim at inferring knowledge from a human operator about positivity of alerts. We use machine learning to infer and generate mitigation rules based on on-the-fly observation of the legitimate and attack traffic mix.

2.2 Mitigation

The previously discussed ML approaches are able to tell when the attack takes place but they do not deal with the mitigation of the DoS traffic itself. Traditionally, Intrusion Prevention Systems (IPS) perform attack mitigation, but they fall short on volumetric DoS protection (Why Firewalls and Intrusion Prevention Systems (IPS) Fall Short on DDoS Protection 2013). In summary, IPS are designed to analyze each session in a great detail, hence they are vulnerable to DoS themselves. Our experience also showed that they are indeed the targets of volumetric attacks which results into a peculiar situation. The service is up and running, waiting for the users, but the users cannot access it as the IPS is overwhelmed.

Therefore dedicated DoS devices or cloud services are offered. Unfortunately, the vendors do not elaborate on their specific techniques and the description remains vague, such as, statistical anomaly detection, protocol anomaly detection, fingerprint matching and profiled anomaly detection. Based on our experience the automation of mitigation is most often based on blacklisting of IP addresses delivered by intelligence feeds and the ability to pair the detection with specific mitigation technique (e.g. specific regular expression that is either known or derived by human).

We surveyed the existing research literature and we state that the mitigation has been researched from the perspective of innovative strategies and measures to prevent spoofing of IP addresses. A basic preventive method suggests an ingress filtering (Ferguson and Senie 2000) in customer or source ISP networks where the pool of legitimate source IP addresses is well-known. In order to allow filtering in transit or destination networks, the information about legitimate source IP addresses must be passed from a source towards the destination networks. This is either achieved by a in Li et al. (2008) and additional authentication (Bremler-barr and Levy 2005; Shen et al. 2008; Xie et al. 2007). TCP SYN cookies, improved in Zuquete (2002); Goldschmidt and Kučera (2002), may also be considered as an IP-spoofing prevention although the method works only for TCP SYN-flood attacks.

Savage et al. (2000) (Probabilistic Packet Marking – PPM) initiated a research in the field of packet marking for tracing back the source of spoofed packets. Further extensions of packet marking can be found in Song and Perrig (2001); Peng et al. (2002); Belenky and Ansari (2003); Strayer et al. (2004); Dan et al. (2001).

Detection of spoofed packets has been researched in Jin et al. (2003); Wang et al. (2007). These methods are based on detecting variances in TTL (Time To Live). In Wang et al. (2007) the authors have discussed TTL issues which constitute a problematic estimation of initial TTL (consider NAT, change of routes, etc.) and a possibility to spoof TTL value. Xu et al., (2007) designs a method to reveal spoofing of source IP addresses by a statistical analysis of their distribution. Xu assumes that an attacker spoofs IP addresses randomly with uniform distribution. But the attacker may choose to spoof IP addresses from a given subnet or from a certain subnet with various types of distribution hence violating the assumption on uniform random distribution. In comparison to this spoofing detection methods, we consider all the network and transport header fields to be relevant for identifying attacking packets and we let the machine learning algorithm decide which fields are relevant in the given circumstances (no matter the attack type, IP addresses spoofing or TTL issues).

3 Inference of filtering rules

The proposed method does not aim at the detection of DDoS attacks. We consider the detection to be a black-box that works prior to our inference method. The detection methods are built to detect the attacks, their type and in some cases also the field that is most likely an indicator of an attack but not the specific value of the field. The detection methods do not provide output in the form of drop all the packets with TCP window size set to zero and IP TTL fifty because all the attacking packets have this specific property. We propose a method to address this part.

3.1 Prerequisite

Let’s consider two separate datasets of network traffic in the form of raw packet captures (e.g. pcap files). The first dataset contains legitimate traffic and it corresponds to the periods of normal traffic mix. The second dataset contains volumetric DoS traffic as well as legitimate traffic and it corresponds to the period when a service or an infrastructure is under attack.

3.2 Problem statement

We are interested in an algorithm capable of inferring mitigation rules to filter the volumetric DoS attacks. The algorithm observes both datasets and it is aware of which dataset is which but it has no prior knowledge about the legitimacy of particular packets in the datasets. After the observation of both datasets, the algorithm generates a set of mitigation rules that are specific to the offending packets as much as possible, not to block the legitimate packets but general enough to describe the offending packets utilizing a small number of rules. The small number of rules is crucial for saving mitigation resources. In other words, it is crucial to avoid generating trivial results such as that each offending packet is identified by its dedicated rule. At the same time, the inferred rules must cover nearly all the offending packets while it is acceptable that the rules block a small portion (as small as achievable) of legitimate packets. Blocking a small portion of legitimate traffic is considered an acceptable price for preserving the availability of a service for the rest of the legitimate portion during DoS attacks.

3.3 Assumptions

We make two assumptions about the characteristics of the network traffic in order to bring the problem closer to the real deployment as well as to allow the algorithm to find a reasonable solution.

Our first assumption revolves around the portion of legitimate and offending traffic in the datasets. The legitimate dataset contains a majority of legitimate traffic and it may contain some small portion of offending traffic such as scanning, brute-force attacks or residuals of DoS attacks called backscatter traffic. The offending dataset contains a majority of offending packets and it also contains some portion of legitimate traffic. We argue that it is realistic to collect and identify such datasets on the fly even in the real deployments, for example, utilizing outputs of detection methods such as network behavioral analysis (NBA) systems or approaches presented in Section 2.1, i.e. when there is no alert issued by NBA then the dataset is considered legitimate while if there is an alert about DoS traffic then the dataset is considered offending.

Our second assumption considers volumetric DoS traffic to exhibit a certain degree of self-similarity, i.e. the packets belonging to the attack are partially similar to each other. The similarity may appear at the network layer (e.g. the same specific size of packets), at the transport layer (e.g. the same specific TCP window size) or at the application layer (e.g. the same specific value of HTTP agent or same content of the payload). We do not consider the application layer in this article and we plan to include it in our future work.

3.4 Approach

Our approach to finding the mitigation rules is built upon using machine learning. We consider a decision tree algorithm as our first option candidate. The decision tree has a good track of being successfully utilized in network traffic analysis (Yuan and Wang 2016). But more importantly, it is possible to convert the trained models into filtering rules that follow packet filter specification (e.g. a set of AND/OR expressions). Such rules are applicable to the existing mitigation solutions as well as are familiar to network administrators who can verify them and decide whether to apply the rules at all. Moreover, since machine learning does not have all the context the administrators have, the administrators can easily introduce additional modifications to the generated rules if necessary.

The selected machine learning represents a supervised approach that requires an annotated dataset. To this end, we utilize two datasets - one collected during normal operation and the second during an attack. As defined in the beginning of this section, there is no a priori knowledge about the legitimacy of the particular packets but only about the whole datasets. Therefore all packets in the offending dataset are considered as positive samples and all packets in the legitimate dataset are considered as negative samples. Clearly, such an approach to annotation introduces errors as the legitimate traffic in the attack dataset will be marked as offending and vice versa. But due to the first assumption, the majority of packets will be correctly marked in each dataset. The second assumption about the self-similarity should put more weight on the truly positive samples while the truly negative samples in the offending dataset will be outweighed by the negative samples of the legitimate dataset.

The machine learning pipeline consists of well-known steps of feature extraction, training and classification. The feature extraction phase parses packets, one by one, and extracts the header fields. We do not further process the header fields (e.g. change representation or normalize the values). The processing of the fields would prohibit applying the generated rulesFootnote 2 in the network filtering devices. On the other hand, changing the representation of certain values may improve the results. Therefore we consider it as one of our tasks for future work. The list of currently utilized features is depicted in Table 1.

Table 1 Overview of protocol fields as features

We omit the description of how the decision tree is constructed as we consider it to be a well-known algorithm, moreover, we utilize the existing DecisionTreeClassifier implementation from the scikit learn library (Decision Trees 2008). The setup of the decision tree hyperparameters is evaluated in Section 4.2.

The filtering rules correspond to all the paths from the root to the positive leaf nodes. These rules can be extracted from the decision tree by the depth-first search (recursive version) which creates the disjunctive form of a ruleset (an example is provided in Section 4.4).

figure a

The algorithm constructs a rule (brule) corresponding to the given branch and depth of the recursion. It amends the brule with the condition of a current node delimited by logical AND. When the algorithm encounters a positive leaf node it prints the brule delimited by logical OR. The description of the algorithm is deliberately simplified to support its easy understanding.

4 Evaluation

The evaluation empirically discovers whether the proposed approach is capable of inferring the filtering rules by simulation of the conditions the algorithm may encounter in an operational environment.

4.1 Dataset

We construct our datasets by using data from three data sources. The first source is a publicly available DDoS Evaluation Dataset (CIC-DDoS2019) Sharafaldin et al. (2019) from Canadian Institute for Cybersecurity. It provides variants of DDoS attacks, namely, SYN flood, UDP flood, DNS amplification and NTP amplification. The second data source is a set of publicly available stress-test tools, namely, LOIC, HULK and Torshammer. We use these tools to generate and capture additional attack samples. All the attack samples are listed in Table 2. In some cases, we modify all the attack samples in a way that we randomly spoof the source IP address of each packet so that our inference algorithm cannot simply use the source IP address as an identifier in a mitigation rule. We mix the individual attack samples together, creating four multivector attack samples which are listed in Table 3. We use only these multivector attack samples (instead of single vectors) during our evaluation as they reflect the DoS attack landscape today more realistically. Moreover, if our inference algorithm works well with multivector attacks then it will work with a single vector attack as well. The third data source provides a legitimate traffic mix captured between Austrian and Czech National Research and Educational Network (ACONET and CESNET respectively). We provide all the datasets used during our experiments publicly available (Zadnik 2021).

Table 2 Attack samples
Table 3 Multi-vector attack samples

We create a training dataset and a testing dataset in a specific way to simulate the operational deployment in the network as described in Section 3. Therefore we assemble the datasets out of three parts: legitimate traffic, legitimate traffic for confusion and DoS traffic. The training dataset consists of Legitimate traffic labelled as Legitimate (LaL), Legitimate traffic for Confusion labelled as DoS (LCaD) and DoS traffic labeled as DoS (DaD). The testing dataset consists of three parts as well - Legitimate traffic labeled as Legitimate (LaL), Legitimate traffic for Confusion labeled as Legitimate (LCaL) and DoS traffic labeled as DoS (DaD). In other words, the testing dataset is labeled correctly so that we are able to assess the inference algorithm, while the part of the training dataset is deliberately labeled wrong to simulate the operational environment. In fact, the LCaD is equal to LCaL except for the labels. The situation is depicted in Fig. 1. The training and the testing datasets are of the same size.

Fig. 1
figure 1

Dataset mix of legitimate and attack traffic

4.2 Experiments

In our experiments, we remove the destination IP address from the available feature set to simulate real deployment. Since the destination IP address of the victim in the attack samples is unique, the training algorithm always infers the destination IP address correctly and hence reaches 100% true positive without false negatives. Rather than select a victim IP address from the pool of the legitimate traffic we decided to remove the destination IP address from the feature set. Removing the destination address explores the limits of our approach in the worst-case scenario and it avoids the dilemma of what legitimate IP address to use as a victim not to bias the results.

If not explicitly mentioned otherwise, we keep legitimate and legitimate for confusion parts constant while we change the DoS attack types (using multi-vector attacks listed in Table 3). The ratio of the legitimate traffic for confusion is set to 30% of the DoS traffic. This ratio between the LCaD and the DaD traffic is much higher than what we observe in cases when a DoS attack needs to be mitigated. Of course there are plenty of small volumetric DoS attacks but reasonably provisioned services and connectivity can withstand small DoS attacks.

During our experiments we investigate how well our inference algorithm performs regarding various setups of decision tree hyperparameters and their setups, namely:

  • max_depth (default=None)

  • max_leaf_nodes (default=None)

  • min_samples_leaf (default=1)

  • min_samples_split (default=2)

The default values are based on the scikit decision tree implementation in the respective documentation (Decision Trees 2008). Essentially, we want to find such parameters that will not only lead to well-performing results but also to reasonable results that are not trivial. An exaggerated example of the trivial result is a decision tree which has a dedicated leaf for each distinct packet (a huge decision tree). Such a decision tree might have excellent results but it is useless from the perspective of deriving a small number of mitigation rules.

Therefore, we want to limit the size of the decision tree using some of the listed hyperparameters. We evaluate the number of true positives and false positives with respect to the values of each parameter.

Our first experiment evaluates the influence of the max_depth parameter on the performance and we keep the other parameters on their default value. Figure 2 captures four graphs, one per each DoS traffic mix (described in Table 3). The deeper the tree the higher is the true positive rate. The trees deeper than 10 levels consistently reach higher than 99% true positive rate. But we can also observe that the false positive rate increases with the growing depth, counter-intuitively. The increasing false positive rate is caused by overtraining to LCaD. Indeed, if the tree is shallow it can select only DaD traffic which is prevalent in the training dataset. On the other hand, if we allow for deeper trees (longer rules) then the training algorithm can also consider even LCaD traffic which is more complex to describe than the DaD, thus, allowing the tree to consider LCaD leads to an increase of the false positives. A tree of depth six consistently reaches good results over all the datasets.

Fig. 2
figure 2

Graphs depicting TP and FP for max_depth parameter

Next, we evaluate parameter max_leaf_nodes. This parameter limits the breadth of the tree, i.e. limits the number of paths from the root to the leaves (the number of the resulting rules). Similarly to the previous experiment, we can observe in Fig. 3 that the true positive rate increases very fast even for a relatively small number of max_leaf_nodes (e.g. 15) but with the higher number of max_leaf_nodes the number of false positives starts to increase. Again, the number of positives grows with a high number of leaf nodes due to the ability of a larger decision tree to classify LCaD as the attacking traffic. We can see that in the case of SYNDNS dataset the best performing is a tree with 10 leaf nodes whereas in other cases, 15 leaf nodes perform better.

Fig. 3
figure 3

Graphs depicting TP and FP for max_leaf_nodes parameter

Another hyperparameter min_samples_leaf does not allow a tree to split a node if the resulting leaves contain less than a certain number of samples and, similarly, min_samples_split does not allow a node to split if it is not larger than a certain number of samples. In our case, we define the minimum number of samples as a fraction of the overall number of all samples. We hope that by using a fraction rather than an absolute number, we obtain a more general setup for various datasets.

Fig. 4
figure 4

Graphs depicting TP and FP for min_samples_leaf parameter

In the case of min_samples_leaf displayed in Fig. 4 the graphs indicate that good results can be achieved when the fraction is low, e.g. 0.005 which approximately corresponds to 400 samples. The SYNDNS dataset behaves differently from the others. With the increasing samples in leaves, the number of false-positive decreases. This suggests that the optimal setup of parameters differs between the datasets but hopefully, we can find a combination of parameters that is suboptimal but performs well for all the datasets.

Fig. 5
figure 5

Graphs depicting TP and FP for min_samples_split parameter

In the case of min_samples_split the graphs (Fig. 5) show the same decreasing trend of false positive with the growth of the parameter. However, we can also see that the false positive rate remains high, unlike in other previous experiments. For example, when we compare Figs. 4 and 5 for SYNDNS then the minimum false positive is 0.15% and 2.9%, respectively. Therefore we can assume that min_samples_split cannot by itself reach good results.

To find the best performing setup of hyperparameters we have tested all the hyperparameter combinations by brute force (so called gridsearch). The respective ranges and granularity of the parameters tested during the gridsearch are derived based on the previous experiments and correspond to the displayed graphs in the previous Figs. 2, 3, 4, 5. The results of the gridsearch are depicted in Table 4. The table displays best results found by the gridsearch for each dataset specifically and the last row considers a setup that works satisfyingly across all the datasets. For each dataset the best performing parameters are different. In the case of SYNDNS, the grid-search discovered a different setup than we would expect based on experiments with min_samples_split in Fig. 5. While the experiments indicated to setup min_sample_split to 0.06, it is better to keep it low when combined with other parameters.

Generally, it holds that the depth of a tree should not be lower than 5 and higher than seven works for all the cases, the maximum number of leaf nodes should be in between 10 to 15 and for the min_samples_split and min_samples_leaf work low values work well. In such a case, more than 97% true positive rate and less than 3% false positive rate, respectively, are achieved across the datasets.

Table 4 Grid-search results for the individual datasets

Our next experiment looks for the limits of our first assumption stated in Section 3.3. In this experiment, we gradually change the percentage of LCaD (Legitimate for Confusion labeled as DoS was fixed to 30% during our previous experiments), i.e. we want to figure out how large the attack should be so that our approach generates successful mitigation rules. In this experiment we gradually increase the volume of LCaD. We aim to simulate a situation when the attack is detected and reported but it is only a small portion of the aggregated traffic being forwarded to the victim. Therefore the majority of the mix during the attack period is composed of legitimate traffic.

We show the results using the general setup of hyperparameters presented in the last row (ALL) of Table 4. The graphs in Fig. 6 show achieved true positive and false positive rates with respect to the percentage of the LCaD. The higher is the percentage the more we confuse the training process with incorrectly labeled samples. In the worst case, there is only 20% true DoS traffic, while the rest of the traffic labeled as DoS is legitimate.

Fig. 6
figure 6

Graphs depicting TP and FP for the increasing portion of LCaD

We can observe that when the LCaD exceeds 45% share the false positive rate increases significantly. In the case of the dataset ALL, there is 90% of false positives during the worst case (70% share of LCaD). While in the case of SYNDNS and ALLUDP the worst-case generates approx. 10% of false positives. The relatively low number of false positives is justified by the simpler attack vector as well as there exist correctly labeled negative examples in LaL which overweight the wrong labels caused by LCaD.

Fig. 7
figure 7

Graphs depicting improvement when relabeling is applied

In order to further improve the results towards fewer false positives we consider the distribution of the positive (attack) and negative samples in the leaf nodes. Our proposal is to look for leaf nodes that contain a majority of positive samples but which are not pure enough. Such nodes would be labeled positive but we change it to negative. The change of a label results in the decrease of the true positive rate but at the same time the false positive rate also decreases. The experimental evaluation of this relabeling operation (for gini=0.25) is depicted in Fig. 7. We can see a dramatic decrease of false positives in case of the most complex dataset ALL. Originally, the false positive rate was 90% for the worst-case whereas by employing the relabeling it is only 7%. This improvement holds for all the datasets and all the ratios of the traffic mix. Such a twist renders our method more robust with respect to various traffic mixes and enhances the applicability of our method even for smaller attacks.

4.3 Online deployment

The rules are inferred online, i.e. the legitimate traffic sample is already available while the attack traffic sample is captured on the fly during the attack. The decision tree induction starts subsequently and should last as little as possible so that the inferred rules can be applied as soon as possible. During our experiments we measured time per each round which includes data processing, training and the conversion algorithm. Our experiments revealed that the data processing (loading and parsing of captured pcaps is the most time-consuming task from the whole process) while the training and the conversion consume only a small portion of the overall time. Please note, that we perform our experiments in Python which gives large space for future optimization. Figure 8 depicts the duration statistics of the data processing time and the rule generation time (training and conversion).

Fig. 8
figure 8

Boxplots of times to infer a rule set (100 rounds per each dataset)

We can see that all the rounds finished in less than 7 seconds in total. We consider such a reaction time sufficient (especially in comparison when the rules are inferred by a human). Moreover, we are aware that there is a scope for optimizations also in the inference process if we implement a customized learning procedure rather than using scikit.learn implementation. Such a procedure will process the legitimate traffic beforehand to compute indicators for the decision tree inference and the indicators will be only updated during the attack period.

4.4 Demonstration

We present two decision trees and their respective mitigation rule representations to demonstrate outputs of our approach. The first decision tree blocks DoS traffic contained in ALLTCP dataset. Its graphical representation is depicted in Fig. 9. Blue colour of a node (solid line) indicates prevalence of positive (attack) class while orange colour (dashed line) of negative class, white colour indicates equal share of classes. The first row in a node describes a condition used for decision (if the condition is met follow left arrow), gini indicates impurity (gini=0 means pure) and the last row assigns a class to the node.

Fig. 9
figure 9

Decision tree for ALLTCP dataset

We transform this tree into the respective filtering rule set by the algorithm proposed in Section 3.4. The generated rules are displayed below:

figure b

The second decision tree blocks DoS traffic contained in ALLUDP dataset. Its graphical representation is depicted in Fig. 10. It is more complex than the previous one, the decision tree recognized that there is a certain pool of IP addresses that are responsible for a part of the attack. Therefore, the algorithm selected source IP address to be the first decision feature although the pool falls within the legitimate address space as well. The decision tree uses other parameters subsequently to differentiate the DoS traffic originating in this pool. The respective generated rule set is:

figure c
Fig. 10
figure 10

Decision tree for ALLUDP dataset

The decision tree selects relevant descriptive features case by case to separate the DoS traffic from legitimate. The derived rule sets are not trivial and cover the attacking packets well while avoiding the legitimate ones. As a conservative next step, a member of the network operation center (NOC) can use the inferred rule sets directly in the Wireshark tool where the operator can observe the effect of the ruleset on a sample of traffic captured before and during the attack. This operation allows an operator to assess the impact of the ruleset on the services in a dry run before the ruleset is deployed in the network. From the operator’s perspective, it is possible to tolerate a low volume of false positives during attacks unless they do not include the traffic of a vital service (e.g. management and monitoring connection to network devices).

As a part of our evaluation, we also asked our CESNET NOC to assess the proposed method. They retrospectively compared their workflow with and without using the proposed method on cases they faced in the past. We received positive feedback that the method speeds up their ability to quickly identify packets that are part of the attack, especially in cases when they have not seen the attack previously. They also identified the scope for further research. They need to prioritize certain packet fields over the others, for example, to force the training algorithm to use the source IP prefixes so that they can apply the filtering by the rule set only on external prefixes (prefixes from abroad).

The method was also discussed with DDoS Clearing House activity developed by SIDN (CONCORDIA 2020). They identified the need for a component in their infrastructure which would fingerprint DDoS attacks. Currently, the DDoS Clearing House derives the fingerprint from the flow data only. The presented method was discussed with SIDN as a relevant to fingerprint DDoS attack at the packet level.

5 Conclusion

In this article, we described our approach to automatically infer packet filtering rules to mitigate network traffic caused by DoS. We utilized the decision tree induction to learn which packet fields were characteristic for a given vector of attack and how to combine them. Subsequently, we applied the conversion algorithm to the inferred decision tree to transform it into a set of filtering rules. We prepared four multivector DoS datasets and we thoroughly experimented with our approach to assess its behavior under various conditions that were simulating its real deployment. The results showed that our approach was feasible and it inferred successful filtering rules for the attacking dataset. Although there was no best setup of hyperparameters that would fit all the datasets, even the general setup resulted in successful rules, especially if the final relabeling of impure nodes was applied.

Besides incremental improvements that were already mentioned in the article, we identified several topics we need to undertake research into. We need to address how to support the automated application of the inferred rules in cases of high confidence, how to automatically recognize the change of an attack vector to trigger retraining or to stop the mitigation.