Learning to control a structuredprediction decoder for detection of HTTPlayer DDoS attackers
 699 Downloads
 2 Citations
Abstract
We focus on the problem of detecting clients that attempt to exhaust server resources by flooding a service with protocolcompliant HTTP requests. Attacks are usually coordinated by an entity that controls many clients. Modeling the application as a structuredprediction problem allows the prediction model to jointly classify a multitude of clients based on their cohesion of otherwise inconspicuous features. Since the resulting output space is too vast to search exhaustively, we employ greedy search and techniques in which a parametric controller guides the search. We apply a known method that sequentially learns the controller and the structuredprediction model. We then derive an online policygradient method that finds the parameters of the controller and of the structuredprediction model in a joint optimization problem; we obtain a convergence guarantee for the latter method. We evaluate and compare the various methods based on a large collection of traffic data of a webhosting service.
1 Introduction
Distributed denialofservice (DDoS) flooding attacks (Zargar et al. 2013) intend to prevent legitimate users from using a webbased service by exhausting server or network resources. DDoS attacks can target the network level or the application level. One way for attackers to target the network level is to continuously request TCP connections and leave the connection in an incomplete state, which eventually exhausts the number of connections which the server can handle; this is called SYN flooding. Adaptive SYNreceived timeouts, packetfiltering policies, and an increasing network capacity are making it more difficult to mount successful networklevel attacks (Peng et al. 2007; Zargar et al. 2013). By comparison, server resources such as CPU, I/O bandwidth, database and disk throughput are becoming easier targets (Amza et al. 2002; Ranjan et al. 2006). Attackers turn towards HTTPlayer flooding attacks in which they flood services with protocolcompliant requests that require the execution of scripts, expensive database operations, or the transmission of large files.
HTTPlayer attacks are more difficult to detect, because the detection mechanism ultimately has to decide whether all connecting clients have a legitimate reason for requesting a service in a particular way. In protocolcompliant applicationlevel attacks, attackers have to sign their TCP/IP packets with their real IP address, because they have to complete the TCP handshake. One can therefore defend against flooding attacks by blacklisting offending IP addresses at the network router, provided that attacking clients can be singled out.
In order to detect attacking clients, one can engineer features of individual clients, train a classifier on labeled traffic data to detect attacking clients, and blacklist detected attackers. We follow this approach and evaluate it empirically, but the following considerations already indicate that it might work less than perfectly in practice. An individual protocolcompliant request is rarely conspicuous by itself; after all, the service is there to be requested. Most individual clients only post a small number of requests to a domain after which their IP address is not seen again. This implies that classification of individual clients will be difficult, and that aggregating information over requests into longitudinal client features (Ranjan et al. 2006; Xie and Yu 2009; Liu and Chang 2011) will only provide limited additional information.
However, DDoS attacks are usually coordinated by an entity that controls the attacking clients. Their joint programming is likely to induce some behavioral coherence of all attacking clients. Features of individual clients cannot reflect this cohesion. But a joint feature function that is parametrized with all clients \(x_i\) that interact with a domain and conjectured class labels \(y_i\) for all clients can measure the behavioral variance of all clients that are labeled as attackers. Structuredprediction methods (Lafferty et al. 2001; Tsochantaridis et al. 2005) match this situation because they are based on joint feature functions of multiple dependent inputs \(x_i\) and their output values \(y_i\). At application time, structuredprediction models have to solve the decoding problem of maximizing the decision function over all combinations of class labels. If the dependencies in the feature function are sequential or treestructured, this maximization can be carried out efficiently using, for instance, the Viterbi algorithm for sequential data. In general as well as in this particular case, however, exhaustive search of the output space is intractable. Moreover, in our application environment, the search has to terminate after a fixed but a priori unknown number of computational steps due to a realtime constraint.
Collective classification algorithms (Luke 2009) conduct a greedy search for the highestscoring joint labeling of the nodes of a graph. They do so by iteratively relabeling individual nodes given the conjectured labels of all neighboring nodes. We will apply this principle, and explore the resulting algorithm empirically. More generally, when exhaustive search for a structuredprediction problem is infeasible, an undergenerating decoder can still search a constrained part of the output space (Finley and Joachims 2008). Explicit constraints that make the remaining output space exhaustively searchable may also exclude good solutions. One may instead resort to learning a search heuristic. \({\mathcal HC}\) search (Doppa et al. 2014a, b) first learns a heuristic that guides the search to the correct output for training instances, and then uses this heuristic to control the decoder during training and application of the structuredprediction model. We will apply this principle to our application, and study the resulting algorithm.
The search heuristic of the \({\mathcal HC}\)search framework is optimized to guide the decoder from an initial labeling to the correct output for all training instances. It is subsequently applied to guiding the decoder to the output that maximizes the decision function of the structuredprediction model, while this model is being learned. But the decision function is an imperfect model of the inputoutput relationship in the training data, especially while the parameters of the decision function are still being optimized. One may argue that a heuristic that does well at guiding the search to the correct output (that is known for the training instances) may do poorly at guiding it to the output that maximizes some decision function. We will therefore derive a policygradient model in which the controller and the structuredprediction model that uses the controller are learned in a joint optimization problem; we will analyze convergence properties of this model.
Defense mechanisms against DDoS attacks have so far been evaluated using artificial or semiartificial traffic data that have been generated under plausible model assumptions of benign and malicious traffic (Ranjan et al. 2006; Xie and Yu 2009; Liu and Chang 2011; Renuka and Yogesh 2012). By contrast, we will compare all models under investigation on a large data set of network traffic that we collect in a large shared web hosting environment and classify manually. It includes unusual highvolume network traffic for more than 1,546 domains over 22,645 time intervals of 10 s in which we observe several million connections of more than 450,000 unique clients.
The rest of the paper is structured as follows. Section 2 derives the problem setting from our motivating application. We model the application as an anomalydetection problem in Sect. 3, as the problem of independently classifying clients in Sect. 4, as a collective classification problem in Sect. 5, and as a structuredprediction problem with a parametric decoder in Sect. 6. Section 7 discusses how all methods can be instantiated for the attackeridentification application. We present an empirical study in Sect. 8; Sect. 9 discusses our results against the background of related work. Section 10 concludes.
2 Problem setting, motivating application
This section first lays out the relevant details of the application and establishes a highlevel problem setting that will be cast into various learning paradigms in the following sections.
We focus on HTTPlayer denialofservice flooding attacks (Zargar et al. 2013), which we define to be any malicious attempts at denying the service to its legitimate users by posting protocolcompliant HTTP requests so as to exhaust any computational resource, such as CPU, bandwidth, or database throughput. Our application environment is a shared web hosting service in which a large number of domains are hosted in a large computing center. Each domain continuously receives requests from many legitimate or attacking clients. A domain is constituted by the toplevel and secondlevel domain in the HOST field of the HTTP header (“example.com”); a client is identified by its IP address.
The effects of an attack can be mitigated when the IP addresses of the attacking clients can be identified: IP addresses of known attackers can be temporarily blacklisted at the router. Anomalous traffic events can extend for as little as a few minutes; attacks can run for several hours. The highlevel view of the system consists of three parts: the web servers the blacklisting mechanism, and the DDoSattackerdetection mechanism that decides which clients should be blacklisted.
The blacklisting mechanism resides at the main routers. It maintains a blacklist of IP addresses, and filters incoming traffic by blocking any TCP/IP packets from clients on that list. Blacklisting client IP addresses is the only feasible mitigation mechanism in our case. If requests from attacking IP addresses were to be processed, inspected, and filtered based on the individual payload, the servers would not be relieved sufficiently under an attack.
The attackerdetection mechanism listens to all TCP traffic between the web servers and blacklisting entity. Since attackers usually target a specific domain, we split the overall attackerdetection problem into an independent subproblem for each domain. This allows us to distribute the attackerdetection mechanism over multiple computing nodes, each of which handles a subset of domains. As long as the number of connections to a domain per unit of time, the number of clients that interact with the domain, and the estimated CPU load used by a domain lie below safe lower bounds, the attackerdetection mechanism can rule out the possibility of a DDoS attack to that domain and excludes its traffic from further processing. If one of the thresholds is exceeded for some domain, then the attackerdetection mechanism processes the traffic to that domain in batches of 10 s. In each 10 s interval, the output is a list of IP addresses that should be blacklisted. This list is forwarded to the blacklisting mechanism which takes the actual blacklisting action.
Hence, for each domain, we arrive at an independent learning problem that can be described abstractly by an unknown distribution \(p(\mathbf{{x}},\mathbf{{y}})\) over sets \(\mathbf{{x}}\in \mathcal {X}\) of clients \(x_j\) that interact with the domain within a 10 s interval and output variables \(\mathbf{{y}}\in \mathcal {Y}(\mathbf{{x}})=\{1,1\}^{m}\) which label each individual client \(x_j\in \mathbf{{x}}\) as legitimate (\(y_j=1\)) or attacker (\(y_j=+1\)). The number of observed clients \(x_j\in \mathbf{{x}}\) may be different in each time interval. In Sects. 3 and 4, we will pursue approaches in which each client \(x_j\) is individually represented by a vector \({\varvec{\Phi }}_{\mathbf{{x}}}(x_j)\) that may depend on absolute features of \(x_j\) as well as on features of \(x_j\) that are measured relatively to the set of all clients \(\mathbf{{x}}\) that currently interact with the domain. In Sects. 5 and 6, we will represent the entire set of clients \(\mathbf{{x}}\) and a candidate labeling \(\mathbf{{y}}\) in a single joint feature representation \({\varvec{\Phi }}(\mathbf{{x}},\mathbf{{y}})\) and thereby arrive at a structuredprediction problem.
The following example illustrates why the problem of labeling sets \(\mathbf{{x}}\in \mathcal {X}\) of clients \(x_j\) that interact with the same domain within a time interval can be modeled as a structuredprediction problem. Consider that an attacker controls a large network of client computers distributed around the world. The attacker tries to exhaust the database capacity of a domain by posting newuser registration requests. Each individual client posts only three such requests, which is inconspicuous. It would be virtually impossible for a classifier to identify the individual requests as being malicious, because each one of them is protocolcompliant and lacks any salient or unusual property.
A structured prediction model, on the other hand, can take joint attributes \({\varvec{\Phi }}(\mathbf{{x}},\mathbf{{y}})\) of sets of clients into account. For instance, since all attacking clients post similar newuser registration requests, the innergroup standard deviation of the URL string length will be much smaller for the attacking clients than for mixed sets of attacking and legitimate clients. A structuredprediction model can assign a negative weight to a feature that measures the innergroup standard deviation of the URL string length for all clients that are labeled as attackers. It can therefore learn to label clients in such a way that groups with small innergroup standard deviation of certain traffic parameters tend to have the same class label. We will discuss the feature representation that we employ for independent classification and for structuredprediction models in Sect. 7.4.
The classification problem for each 10 s interval has to be solved within 10 s—otherwise, a backlog of decisions could build up, especially under an attack. The number of CPU cycles that are available within these 10 s is not known a priori because it depends on the overall server load. For the structuredprediction models, we encode this anytime constraint by limiting the number of search steps to a random number T that is governed by some distribution. We can disregard this anytime constraint for models that treat clients as independent (Sects. 3 and 4), because the resulting classifiers are sufficiently fast at calculating the predictions.
Misclassified legitimate requests can potentially result in lost business while misclassified abusive requests consume computational resources; when CPU capacity, bandwidth, or database throughput capacities are exhausted, the service becomes unavailable. The resulting costs will be reflected in the optimization criteria by cost terms of falsenegative and falsepositive decisions. When the true labels of the clients \(\mathbf{{x}}\) are \(\mathbf{{y}}\), a prediction of \({\hat{\mathbf{{y}}}}\) incurs costs \(c(\mathbf{{x}},\mathbf{{y}},{\hat{\mathbf{{y}}}}) \ge 0\). We will detail the exact cost function in Sect. 7.
With the exception of the anomalydetection models that we will discuss in Sect. 3, training the attackerdetection model requires labeled training data. Section 8.1 describes the largely manual process in which we determine which client IP addresses are in fact attackers.
3 Anomaly detection
In our application, an abundance of network traffic can be observed. However, manually labeling clients as legitimate and attackers is an arduous effort (see Sect. 8.1). Therefore, our first take is to model attacker detection as an anomalydetection problem.
3.1 Problem setting for anomaly detection
In this formulation of the problem settings, the set of clients \(\mathbf{{x}}=\{x_1,\dots ,x_m\}\) that are observed in each 10 s interval is decomposed into individual clients \(x_j\). At application time, clients are labeled independently based on the value of a parametric decision function \(f_{\varvec{\phi }}({\varvec{\Phi }}_{\mathbf{{x}}}(x_j))\) which is a function of feature vector \({\varvec{\Phi }}_{\mathbf{{x}}}(x_j)\). We will define feature vector \({\varvec{\Phi }}_{\mathbf{{x}}}(x_j)\) in Sect. 7.4.2; for instance, it includes the number of different resource paths that client \(x_j\) has accessed, the number of HTTP requests that have resulted in error codes, both in terms of absolute counts and in proportion to all clients that connect to the domain.
3.2 Support vector data description
4 Independent classification
This section models the application as a standard classification problem.
4.1 Problem setting for independent classification
Clients \(\mathbf{{x}}=\{x_1,\dots ,x_m\}\) of each 10 s interval are treated as independent observations, described by feature vectors \({\varvec{\Phi }}_{\mathbf{{x}}}(x_j)\). As in Sect. 3.1, these vector representations are classified independently, based on the value of a parametric decision function \(f_{\varvec{\phi }}({\varvec{\Phi }}_{\mathbf{{x}}}(x_j))\). However, features may be engineered to depend on properties of all clients that interact with the domain in the time interval.
4.2 Logistic regression
5 Structured prediction with approximate inference
In Sect. 4, the decision function has been evaluated independently for each client. This prevented the model from taking joint features of particular groups of clients based on its predicted labels into account.
5.1 Problem setting for structured prediction with approximate inference
In the structuredprediction paradigm, a classification model infers a collective assignment \(\mathbf{{y}}\) of labels to the entirety of clients \(\mathbf{{x}}\) that are observed in a time interval. In our application, all clients that interact with the domain in the time interval are dependent. The model therefore has to label the nodes of a fully connected graph. This problem setting is also referred to as collective classification (Luke 2009).
5.2 Iterative classification algorithm
6 Structured prediction with a parametric decoder
In this section, we allow for a guided search of the label space. Since the space is vastly large, we allow the search do be guided by a parametric model that itself is optimized on the training data.
6.1 Problem setting for structured prediction with parametric decoder
At application time, prediction \({\hat{\mathbf{{y}}}}\) is determined by solving the decoding problem of Eq. 7; decision function \(f_{\varvec{\phi }}(\mathbf{{x}},\mathbf{{y}})\) depends on a feature vector \({\varvec{\Phi }}(\mathbf{{x}},\mathbf{{y}})\). The decoder is allowed T (plus a constant number of) evaluations of the decision function, where \(T\sim p(T\tau )\) is governed by some distribution and its value is not known in advance. The decoder has parameters \({\varvec{\psi }}\) that control this choice of labelings.
In the available T time steps, the decoder has to create a set of candidate labelings \(Y_T(\mathbf{{x}})\) for which the decision function is evaluated. The decoding process starts in a state \(Y_0(\mathbf{{x}})\) that contains a constant number of labelings. In each time step \(t+1\), the decoder can choose an action \(a_{t+1}\) from the action space \(A_{Y_t}\); this space should be designed to be much smaller than the label space \({\mathcal Y}(\mathbf{{x}})\). Action \(a_{t+1}\) creates another labeling \(\mathbf{{y}}_{t+1}\); this additional labeling creates successor state \(Y_{t+1}(\mathbf{{x}})=a_{t+1}(Y_t(\mathbf{{x}}))=Y_t(\mathbf{{x}})\cup \{\mathbf{{y}}_{t+1}\}\).
In a basic definition, \(A_{Y_t}\) could consist of actions \(\alpha _{\mathbf{{y}}j}\) (for all \(\mathbf{{y}}\in Y_t\) and \(1\le j\le n_{\mathbf{{x}}}\), where \(n_{\mathbf{{x}}}\) is the number of clients in \(\mathbf{{x}}\)) that take output \(\mathbf{{y}}\in Y_t(\mathbf{{x}})\) and generate labeling \({\bar{\mathbf{{y}}}}\) by flipping the labeling of the jth client; output \(Y_{t+1}(\mathbf{{x}})=Y_t(\mathbf{{x}})\cup \{\bar{\mathbf{{y}}}\}\) is \(Y_t(\mathbf{{x}})\) plus this modified output. This definition would allow the entire space \(\mathcal{Y}(\mathbf{{x}})\) to be reached from any starting point. In our experiments, we will construct an action space that contains applicationspecific state transactions such as flip the labels of the k addresses that have the most open connections—see Sect. 7.3.
The choice of action \(a_{t+1}\) is based on parameters \({\varvec{\psi }}\) of the decoder, and on a feature vector \({\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1})\); for instance, actions may be chosen by following a stochastic policy \(a_{t+1}\sim \pi _{\varvec{\psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}))\). We will define feature vector \({\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1})\) in Sect. 7.4.4; for instance, it may contain the difference between the geographical distribution of clients whose label is changed by action \(a_{t+1}\) and the geographical distribution of all clients with that same label. Choosing an action \(a_{t+1}\) requires an evaluation of \({\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1})\) for each possible action in \(A_{Y_t(\mathbf{{x}})}\). Our problem setting is most useful for applications in which evaluation of \({\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1})\) takes less time than evaluation of \({\varvec{\Phi }}(\mathbf{{x}},\mathbf{{y}}_{t+1})\)—otherwise, it might be better to evaluate the decision function for a larger set of randomly drawn outputs than to spend time on selecting outputs for which the decision function should be evaluated. Feature vector \({\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1})\) may contain a computationally inexpensive subset of \({\varvec{\Phi }}(\mathbf{{x}},\mathbf{{y}}_{t+1})\).
6.2 HC search
HC search (Doppa et al. 2013) is an approach to structured prediction that learns parameters \({\varvec{\psi }}\) of a search heuristic, and then uses a decoder with this search heuristic to learn parameters \({\varvec{\phi }}\) of a structuredprediction model (the decision function \(f_{\varvec{\phi }}\) is called the costfunction in HCsearch terminology). We apply this principle to our problem setting.
At application time, the decoder produces labeling \({\hat{\mathbf{{y}}}}\) that approximately maximizes \(f_{\varvec{\phi }}(\mathbf{{x}},\mathbf{{y}})\) as follows. The starting point \(Y_0(\mathbf{{x}})\) of each decoding problem contains the labeling produced by the logistic regression classifier (see Sect. 4.2). Action \(a_{t+1}\in A_{Y_t}\) is chosen deterministically as the maximum of the search heuristic \(f_{\varvec{\psi }}({\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1}))={\varvec{\psi }}^\top {\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1})\). After T steps, the argmax \({\hat{\mathbf{{y}}}}\) of \(f_{\varvec{\phi }}(\mathbf{{x}},\mathbf{{y}})\) over all outputs in \(Y_T(\mathbf{{x}})=a_T(\ldots a_1(Y_0(\mathbf{{x}}))\ldots )\) (Eq. 7) is returned as prediction.
After parameters \({\varvec{\psi }}\) have been fixed, parameters \({\varvec{\phi }}\) of structuredprediction model \(f_{\varvec{\phi }}(\mathbf{{x}},\mathbf{{y}})={\varvec{\phi }}^\top {\varvec{\Phi }}(\mathbf{{x}},\mathbf{{y}})\) are trained on the training data set of inputoutput pairs \((\mathbf{{x}}_i,\mathbf{{y}}_i)\) using SVMstruct with margin rescaling and using the search heuristic with parameters \({\varvec{\psi }}\) as decoder. Negative pseudolabels are generated as follows. For each \((\mathbf{{x}}_i,\mathbf{{y}}_i)\in L\), heuristic \({\varvec{\psi }}\) is applied \({\bar{T}}\) times to produce a sequence of output sets \(Y_0(\mathbf{{x}}_i),\ldots ,Y_{\bar{T}}(\mathbf{{x}}_i)\). When \({\bar{\mathbf{{y}}}}={{\mathrm{ argmax\,}}}_{\mathbf{{y}}\in Y_{\bar{T}}(\mathbf{{x}}_i)}{\varvec{\phi }}^\top {\varvec{\Phi }}(\mathbf{{x}},\mathbf{{y}})\ne \mathbf{{y}}_i\) violates the costrescaled margin, then a new training constraint is added, and parameters \({\varvec{\phi }}\) are optimized to satisfy these constraints.
6.3 Online policygradient decoder
The decoder of HC search has been trained to locate the labeling \({\hat{\mathbf{{y}}}}\) that minimizes the costs \(c(\mathbf{{x}}_i,\mathbf{{y}}_i,{\hat{\mathbf{{y}}}})\) for given true labels. However, it is then applied to finding candidate labelings for which \(f_{\varvec{\phi }}(\mathbf{{x}},\mathbf{{y}})\) is evaluated with the goal of maximizing \(f_{\varvec{\phi }}\). However, since the decision function \(f_{\varvec{\phi }}\) may be an imperfect approximation of the inputoutput relationship that is reflected in the training data, labelings that minimize the costs \(c(\mathbf{{x}}_i,\mathbf{{y}}_i,{\hat{\mathbf{{y}}}})\) might be different from outputs that maximize the decision function. We will now derive a closed optimization problem in which decoder and structuredprediction model are jointly optimized. We will study its convergence properties theoretically.
We now demand that during the decoding process, the decoder chooses action \(a_{t+1}\in A_{Y_t}\) which generates successor state \(Y_{t+1}(\mathbf{{x}})=a_{t+1}(Y_t(\mathbf{{x}}))\) according to a stochastic policy, \(a_{t+1}\sim \pi _{\varvec{\psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}))\), with parameter \({\varvec{\psi }}\in \mathbb {R}^{m_2}\) (where \(m_2\) is the dimensionality of the decoder feature space) and features \({\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1})\). At time T, the prediction is the highestscoring output from \({Y}_T(\mathbf{{x}})\) according to Eq. 7.
Lemma 1
Proof
The gradient stacks the partial gradients of \({\varvec{\psi }}\) and \({\varvec{\phi }}\) above each other. The partial gradient \(\nabla _{{\varvec{\phi }}}V_{{\varvec{\phi }},{\varvec{\psi }},\tau }(\mathbf{{x}},\mathbf{{y}}) = \sum _{a_{1\dots {\bar{T}}}} p(a_{1\dots {\bar{T}}}{\varvec{\psi }},Y_0(\mathbf{{x}})) D_{\ell ,\tau }(a_{1\dots {\bar{T}}},Y_0(\mathbf{{x}});{\varvec{\phi }})\) follows from Equation 18. The partial gradient \(\nabla _{{\varvec{\psi }}}V_{{\varvec{\phi }},{\varvec{\psi }},\tau }(\mathbf{{x}},\mathbf{{y}}) = \sum _{a_{1\dots {\bar{T}}}} p(a_{1\dots {\bar{T}}}{\varvec{\psi }},Y_0(\mathbf{{x}})) E_{\ell ,B,\tau } (a_{1\dots {\bar{T}}},Y_0(\mathbf{{x}});{\varvec{\psi }},{\varvec{\phi }})\) is a direct application of the Policy Gradient Theorem (Sutton et al. 2000; Peters and Schaal 2008) for episodic processes.
Theorem 1
(Convergence of Algorithm 1) Let the stochastic policy \(\pi _{\varvec{\psi }}\) be twice differentiable, let both \(\pi _{\varvec{\psi }}\) and \(\nabla _{\varvec{\psi }}\pi _{\varvec{\psi }}\) be Lipschitz continuous, and let \(\nabla _{\varvec{\psi }}\log \pi _{\varvec{\psi }}\) be bounded. Let step size parameters \(\alpha (i)\) satisfy Eq. 27. Let loss function \(\ell \) be differentiable in \({\varvec{\phi }}\) and both \(\ell \) and \(\nabla _{\varvec{\phi }}\ell \) be Lipschitz continuous. Let \(\ell \) be bounded. Let B be differentiable and both B and \(\nabla _{{\varvec{\phi }}{\varvec{\psi }}}B\) be bounded. Let \({\varOmega }_{\varvec{\phi }}=\gamma _1\Vert {\varvec{\phi }}\Vert ^2\),\({\varOmega }_{\varvec{\psi }}=\gamma _2\Vert {\varvec{\psi }}\Vert ^2\). Then, Algorithm 1 converges with probability 1.
Proof
For space limitations and in order to improve readability, throughout the proof we omit dependencies on \(\mathbf{{x}}\) and \(Y_0(\mathbf{{x}})\) in the notations when dependence is clear from the context. For example, we use \(p(a_{1..{\bar{T}}}{\varvec{\psi }})\) instead of \(p(a_{1..{\bar{T}}}{\varvec{\psi }},Y_0(\mathbf{{x}}))\). We use Theorem 2 from Chap. 2 and Theorem 7 from Chap. 3 of Borkar (2008) to prove convergence. We first show that the full negative gradient \(\sum _{(\mathbf{{x}},\mathbf{{y}})}\nabla _{{\varvec{\psi }},{\varvec{\phi }}}V_{{\varvec{\psi }}_i,{\varvec{\phi }}_i,\tau }(\mathbf{{x}},\mathbf{{y}})  [\gamma _2 {\varvec{\psi }}_i^\top , \gamma _1 {\varvec{\phi }}_i^\top ]^\top \) is Lipschitz continuous.
\(p(a_{1..{\bar{T}}}{\varvec{\psi }}) D_{\ell ,\tau }(a_{1..{\bar{T}}};{\varvec{\phi }})\) is Lipschitz because \(p(a_{1..{\bar{T}}}{\varvec{\psi }})\) is Lipschitz and bounded and \(D_{\ell ,\tau }\) is a sum of bounded Lipschitz functions. The product of two bounded Lipschitz functions is bounded. \([\gamma _1{\varvec{\psi }}^\top ,\gamma _2{\varvec{\phi }}^\top ]^\top \) is obviously Lipschitz as well, which concludes the considerations regarding the full negative gradient.
Let \(M_{i+1}=[E_{\ell ,B,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }}_i,{\varvec{\phi }}_i)^\top , D_{\ell ,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }}_i)^\top ]^\top  \sum _{(\mathbf{{x}},\mathbf{{y}})}\nabla _{{\varvec{\psi }},{\varvec{\phi }}}V_{{\varvec{\psi }}_I,{\varvec{\phi }}_i,\tau }(\mathbf{{x}},\mathbf{{y}}))\), where \(E_{\ell ,B,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }}_i,{\varvec{\phi }}_i)\) and \(D_{\ell ,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }}_i)\) are samples as computed by Algorithm 1. We show that \(\{M_i\}\) is a Martingale difference sequence with respect to the increasing family of \(\sigma \)fields \(\mathcal {F}_i=\sigma ([{\varvec{\phi }}_{0}^\top ,{\varvec{\psi }}_{0}^\top ]^\top ,M_1,...,M_i),i\ge 0\). That is, \(\forall i\in \mathbb {N}\), \(\mathbb {E}[M_{i+1}\mathcal {F}_i]=0\) almost surely, and \(\{M_i\}\) are squareintegrable with \(\mathbb {E}[\Vert M_{i+1}\Vert ^2\mathcal {F}_i]\le K(1+\Vert [{\varvec{\phi }}_i^\top ,{\varvec{\psi }}_i^\top ]^\top \Vert ^2)\) almost surely, for some \(K>0\).
Regarding \(E_{\ell ,B,\tau }\), we assume that \(\Vert \nabla \log \pi _{\varvec{\psi }}\Vert ^2\) is bounded by some \(K''\) and it follows that \(\Vert \sum _{T=1}^{\bar{T}}\nabla _{\varvec{\psi }}\log \pi _{\varvec{\psi }}(a_T\mathbf{{x}},a_{T1}(..Y_0(\mathbf{{x}})..)\Vert ^2\) is also bounded by \({\bar{T}}^2 K''\). \(\Vert \ell (\mathbf{{x}},\mathbf{{y}},a_{T}(..Y_0(\mathbf{{x}})..); {\varvec{\phi }})\Vert ^2\le K'(1+\Vert {\varvec{\phi }}\Vert ^2)\) and B bounded per assumption and thus \(\sum _{t=T}^{\bar{T}}p(t) \ell (\mathbf{{x}},\mathbf{{y}},a_{t}(..Y_0(\mathbf{{x}})..) (\mathbf{{x}});{\varvec{\phi }})  B(a_{1..T1};{\varvec{\phi }},{\varvec{\psi }}) \le {\bar{K}}'(1+\Vert {\varvec{\phi }}\Vert ^2)\) with some \({\bar{K}}'\). It follows that \(\Vert E_{\ell ,B,\tau }(a_{1..{\bar{T}}}; {\varvec{\psi }}_i,{\varvec{\phi }}_i)\Vert ^2\le 2^{\bar{T}}K''{\bar{K}}'(1+\Vert {\varvec{\phi }}\Vert ^2)\). As \(\nabla _{\varvec{\phi }}\ell (\mathbf{{x}},\mathbf{{y}},a_{T}(..Y_0(\mathbf{{x}})..);{\varvec{\phi }})\) is bounded per assumption, \(\Vert D_{\ell ,\tau }\Vert ^2\le K'''\) for some \(K'''>0\). The claim follows: \(\Vert [E_{\ell ,B,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }}_i,{\varvec{\phi }}_i)^\top , D_{\ell ,\tau }(a_{1..{\bar{T}}}; {\varvec{\psi }}_i)^\top ]^\top \Vert ^2 = \Vert E_{\ell ,B,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }}_i,{\varvec{\phi }}_i)\Vert ^2 + \Vert D_{\ell ,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }}_i)\Vert ^2 \le K''' + 2^{\bar{T}}K''\bar{K}'(1+\Vert {\varvec{\phi }}\Vert ^2) \le K''' + {\bar{T}}^2K''{\bar{K}}'(1+\Vert [{\varvec{\psi }}^\top ,{\varvec{\phi }}^\top ]^\top \Vert ^2)\).
We can now use Theorem 2 from Chap. 2 of Borkar (2008) to prove convergence by identifying function \(h([{\varvec{\phi }}_i^\top ,{\varvec{\psi }}_i^\top ]^\top )\) as assumed in the assumptions of that theorem with the full negative gradient \(\sum _{(\mathbf{{x}},\mathbf{{y}})}\nabla _{{\varvec{\psi }},{\varvec{\phi }}}V_{{\varvec{\psi }}_i,{\varvec{\phi }}_i,\tau }(\mathbf{{x}},\mathbf{{y}})  \left[ \gamma _2 {\varvec{\psi }}_i^\top , \gamma _1 {\varvec{\phi }}_i^\top \right] ^\top \). The theorem states that the algorithm converges with probability 1 if the iterates \([{\varvec{\phi }}_{i+1}^\top ,{\varvec{\psi }}_{i+1}^\top ]^\top \) stay bounded.
7 Identification of DDoS attackers
We will now implement a DDoSattacker detection mechanism using the techniques that we derived in the previous sections. We engineer a cost function, suitable feature representations \({\varvec{\Phi }}\) and \({\varvec{\Psi }}\), policy \(\pi _\psi \), and loss function \(\ell \) that meet the demands of Theorem 1.
7.1 Cost function
Falsepositive decisions (legitimate clients that are mistaken for attackers) lead to the temporary blacklisting of a legitimate user. This will result in unserved requests, and potentially lost business. Falsenegative decisions (attackers that are not recognized as such) will result in a wasteful allocation of server resources, and possibly in a successful DDoS attack that leaves the service unavailable for legitimate users. We decompose cost function \(c(\mathbf{{x}},\mathbf{{y}},{\hat{\mathbf{{y}}}})\) for a set of clients \(\mathbf{{x}}\) into the following parts.
We measure two costinducing parameters of falsenegative decisions: the number of connections opened by attacking clients and the CPU use triggered by clients’ requests. According to the experience of the dataproviding web hosting service, the same damage is done by attacking clients that a) collectively initiate 200 connections per 10 s interval t and b) collectively initiate scripts that use 10 CPUs for 10 s. However, those costs are not linear in their respective attributes. Instead, only limited resources are available, such as a finite number of CPUs, and the rise in costs of two scripts that use 80 or 90 %, resp., of all available CPUs is different from the rise in costs of two scripts that use 20 or 30 % of CPUs. We define costs incurred by connections initiated by attackers to be quadratic in the number of connections. Similarly, costs for CPU usage are also quadratic.
7.2 Loss function
7.3 Action space and stochastic policy
This section defines the action space \(A_{Y_i}(\mathbf{{x}}_i)\) of HC search and the online policygradient method as well as the stochastic policy \(\pi _{\varvec{\psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}))\) of online policygradient.

Switch the labels of the 1, 2, 5, or 10 clients from \(1\) to \(+1\) that have the highest number of connections, the highest score of the baseline classifier, or CPU consumption. All combinations of these attributes yield 12 possible rules.

Switch the labels of the client from \(1\) to 1 that has the secondhighest number of connections, independent classifier score, or CPU consumption (3 rules).

Switch the label of the client from 1 to \(1\) that has the lowest or secondlowest number of connections, baseline classifier score, or CPU consumption (6 rules).

Switch all clients from \(1\) to \(+1\) whose independent classifier score exceeds 1, 0.5, 0, 0.5, or 1 (5 rules).
7.4 Feature representations
We engineer features that refer to base traffic parameters that we explain in Sect. 7.4.1. From these base traffic parameters, we derive feature representations for all learning approaches that we study. Figure 1 gives an overview of all features.
7.4.1 Base traffic parameters
In each 10 s interval, we calculate base traffic parameters of each client that connects to the domain. For clients that connect to the domain over a longer duration, we calculate moving averages that are reset after two minutes of inactivity. On the TCP protocol level, we extract the absolute numbers of full connections, open connections, open and resent FIN packets, timeouts, RST packets, incoming and outgoing packets, open and resent SYNACK packets, empty connections, connections that are closed before the handshake is completed, incoming and outgoing payload per connection. We determine the average durations until the first FIN packet is received and until the connection is closed, as well as the response time.
We count the number of different resource paths that a client accesses and also count how often each client requests the currently most common path on the domain. If a specific resource is directly accessed we extract and categorize the file ending into plain, script, picture, download, media, other, none, which can give a hint on the type of the requested resource. We measure the fractions of request types per connection (GET, POST, or OTHER). We extract the number of connections with a query string and the average length of each query in terms of number of fields per client. We count the number of connection in which the referrer is the domain itself. Geographic locations are encoded in terms of 21 parameters that represent a geographic region.
7.4.2 Input features for SVDD, logistic regression and ICA
Independent classification uses features \({\varvec{\Phi }}_\mathbf{{x}}(x_j)\) that refer to a particular client \(x_j\) and to the entirety of all clients \(\mathbf{{x}}\) that interact with the domain. For each of the countstyle base traffic parameters, \({\varvec{\Phi }}_\mathbf{{x}}(x_j)\) contains the absolute value, globally normalized over all clients of all domains, a logarithmic absolute count, the globally normalized sums and logsums over all clients that interact with the domain, and the absolute values, normalized by the values of all clients that interact with the domain. For HTTP response code, resource type header fields, we also determine the entropy and frequencies per client on for all clients on the domain. See also Fig. 1.
Feature vector \({\varvec{\Phi }}_{\mathbf{{x}},\mathbf{{y}}}(x_j)\) for ICA contains all features from \({\varvec{\Phi }}_\mathbf{{x}}(x_j)\) plus the numbers of clients that are assigned class \(+1\) and \(1\), respectively, in \(\mathbf{{x}}, \mathbf{{y}}\).
7.4.3 Features for structured prediction
Feature vector \({\varvec{\Phi }}(\mathbf{{x}},\mathbf{{y}})\) contains as one feature the sum \(\sum _{j=1}^{\mathbf{{x}}} y_j f^{LR}_{\varvec{\phi }}({\varvec{\Phi }}_\mathbf{{x}}(x_j))\) of scores of a previously trained logistic regression classifier over all clients \(x_j\in \mathbf{{x}}\). In addition, we distinguish between the groups of clients that \(\mathbf{{y}}\) labels as \(1\) and \(+1\) and determine the innergroup means, innergroup standard deviations, intergroup differences of the base traffic parameters. This results in a total of 297 features.
7.4.4 Decoder features
For HC search and online policy gradient, the parametric decoders depend on a joint feature representation \({\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1})\) of input \(\mathbf{{x}}\) and action \(a_{t+1}=(r,\mathbf{{y}})\). It contains 92 joint features of the clients whose label \(a_{t+1}\) changes and the group (clients of positive or negative class) that \(a_{t+1}\) assigns the clients to. Features include the clients’ distance to the group mean and the clients’ distance to the group minimum for the base traffic parameters. For the fourth group of control actions, the feature representation includes the mean values of these same base attributes for all clients above and below the cutoff value. In order to save computation time, the mean and minimal group values before reassigning the clients are copied from \({\varvec{\Phi }}(\mathbf{{x}},\mathbf{{y}})\) which must have been calculated previously.
7.4.5 Executiontime constraint
We model distribution \(p(T\tau )\) that limits the number of time steps that are available for HC search and online policy gradient as a beta distribution with \(\alpha =5\) and \(\beta =3\) that is capped at a maximum value of \({\bar{T}}=10\). We allow ICA to iterate over all instances for five times; the results do not improve after that. The execution time of logistic regression is negligible and therefore unconstrained.
8 Experimental study
This section explores the practical benefit of all methods for attacker detection.
8.1 Data collection
In order to both train and evaluate the attackerdetection models, we collect a data set of TCP/IP traffic from the application environment. We focus our data collection on hightraffic events in which a domain might be under attack. When the number of connections to a domain per unit of time, the number of clients that interact with the domain, and the CPU capacity used by a domain lie below safe lower bounds, we can rule out the possibility of a DDoS attack. Throughout an observation period of several days, we store all TCP/IP traffic to any domain for which a traffic threshold is exceeded starting 10 mins before the threshold is exceeded and stopping 10 mins after no threshold is exceeded any longer. During the 10 mins before and after each event, around 80 % of the 10 s intervals are empty.
This data collection procedure creates a sample of positive instances (attacking clients) that reflects the exact distribution which the attackerdetection system is exposed to during regular operations, because the attackerdetection model is applied when a domain exceeds the same trafficvolume and CPU thresholds. It creates a sample of negative instances (legitimate clients) that covers the operational distribution and also includes additional legitimate clients observed within 10 minutes of an unusual traffic event. Our intuition is that including additional legitimate clients that interact with the domain immediately before or after an attack in the training and evaluation data should make the model more robust against falsepositive classifications.
We will refer to the entirety of traffic to a particular domain that occurs during one of these episodes as an event. Over our observation period, we collect 1,546 events. We record all traffic parameters described in Sect. 7.4. All data of one domain that are recorded within a time slot of 10 s are stored as a block. The same thresholdbased prefiltering is applied in the operational system, and therefore our data collection reflects the distribution which the attackerdetection system is exposed to in practice.
We then label all traffic events as attacks or legitimate traffic and all clients as attackers or legitimate clients in a largely manual process. In a joint effort with experienced administrators, we decide for each of the 1,546 unusual event whether it is in fact a flooding attack. For this, we employ several tools and information sources. We search for known vulnerabilities in the domain’s scripts, analyze the domain’s recent regular connection patterns, check for unusual geolocation patterns and analyze the query strings and HTTP header fields. This labeling task is inherently difficult. On one hand, repeated queries by several clients that lead to the execution of a CPUheavy script with either identical or random parameters might very likely indicate an attack. On the other hand, when a resource is linked to by a hightraffic web site and that resource is delivered via a computationally expensive script, the resulting traffic may look very similar to traffic observed during an attack and one has to search for and check the referrer for plausibility to identify the traffic as legitimate.
After having labeled all events, we label individual clients that connect to a domain during an attack event. We use several heuristics to group clients with a nearly identical and potentially malicious behavior and label them jointly by hand. We subsequently label the remaining clients after individual inspection.
In total, 50 of the 1,546 events are actually attacks with 10,799 unique attackers. A total of 448,825 client IP addresses are labeled as legitimate. In order to reduce memory and storage usage we use a sample from all 10 s intervals that were labeled. We draw 25 % of intervals per attack and 10 % of intervals (but at least 5 if the event is long enough) per nonattack event. Our final data set consists of 1,096,196 labeled data points; each data point is a client that interacts with a domain within one of the 22,645 nonempty intervals of 10 s.
8.2 Experimental setting
Our data includes 50 attack events; we therefore run 50fold stratified cross validation with one attack event per fold. Since the attack durations vary, the number of test instances varies between folds. We determine the costs of all methods as the average costs over the 50folds. In each fold, we reserve 20 % of the training portion to tune the hyperparameters of all models by a grid search.
8.3 Reference methods
All previous studies on detecting and mitigating applicationlayer DDoS flooding attacks are based on anomalydetection methods (Zargar et al. 2013; Ranjan et al. 2006; Xie and Yu 2009; Renuka and Yogesh 2012; Liu and Chang 2011). A great variety of heuristic and principled approaches is used. In our study, we represent this family of approaches by SVDD which has been used successfully for several related computersecurity problems (Düssel et al. 2008; Görnitz et al. 2013). Prior work generally uses smaller feature sets. Since we have not been able to improve our anomalydetection or classification results by feature subset selection, we refrain from conducting experiments with the specific feature subsets that are used in published prior work.
Costs, truepositive rates, and falsepositive rates of all attackerdetection models. Costs marked with “\(*\)” are significantly lower than the costs of logistic regression
Classification method  Mean costs per fold  TPR  FPR (\(\times 10^{4}\)) 

No filtering  \(3.363 \pm 1.348\)  0  0 
SVDD  \(2.826 \pm 1.049\)  \(0.121 \pm 0.036\)  \( 149.8 \pm 89.5 \) 
Log. reg. w/o domaindependent features  \(1.322 \pm 0.948 \)  \(0.394 \pm 0.056\)  \( 7.0 \pm 2.1 \) 
Logistic regression  \(1.045 \pm 0.715 \)  \(0.372 \pm 0.056\)  \( 2.1 \pm 0.6 \) 
ICA  \(0.946 \pm 0.662 *\)  \(0.369 \pm 0.056\)  \( 3.2 \pm 1.0 \) 
HC search with average margin  \(1.042 \pm 0.715 \)  \(0.406 \pm 0.056\)  \( 9.1 \pm 4.2 \) 
HC search with maxmargin  \(1.040 \pm 0.714 *\)  \(0.398 \pm 0.056\)  \( 7.0 \pm 3.3 \) 
Policy gradient with baseline function  \(0.945 \pm 0.664 *\)  \(0.394 \pm 0.055\)  \( 3.7 \pm 1.2 \) 
Policy gradient without baseline function  \(0.947 \pm 0.665 *\)  \(0.394 \pm 0.055\)  \( 3.7 \pm 1.2 \) 
8.4 Results
Table 1 shows the costs, truepositive rates, and falsepositive rates of all methods under investigation. All methods reduce the costs that are incurred by DDoS attacks substantially at low falsepositive rates. SVDD reduces the costs of DDoS attacks compared to not employing any attackerdetection mechanism (no filtering) by about 16 %. Logistic regression reduces the costs of DDoS attacks compared (no filtering) by about \(69~\%\); online policy gradient reduces the costs by 72 %. Differences between no filtering, SVDD, and logistic regression are highly significant. Cost values marked with an asterisk star (“\(*\)”) are significantly lower than logistic regression in a paired ttest at \(p<0.1\). While HC search is only marginally (insignificantly) better than logistic regression, all other structuredprediction models improve upon logistic regression. Policy gradient with baseline function incurs marginally lower costs than policy gradient without baseline function and ICA, but the differences are not significant.
Logistic regression w/o domaindependent features does not get access to features that take into account all other clients of that domain and to the entropy features. This shows that engineering context features into the feature representation of independent classification already leads to much of the benefit of structured prediction. From a practical point of view, all classification methods are useful, reduce the costs associated with DDoS attacks by around 70 % while misclassifying only an acceptable proportion (below \(10^{3}\)) of legitimate clients. We conclude that ICA and policy gradient achieve a small additional cost reduction over independent classification of clients.
8.5 Analysis
In this section, we quantitatively explore which factors contribute to the residual costs of structured prediction models. The overall costs incurred by policy gradient decompose into costs that are incurred because \(f_{\varvec{\phi }}\) fails to select the best labeling from the decoding set \(Y_T(\mathbf{{x}})\), and costs that are incurred because decoder \(\pi _\psi \) approximates an exhaustive search by a very narrow and directed search that is biased by \(\psi \).
We conduct an experiment in which decoder \(\pi _{\varvec{\psi }}\) is learned on training data, and a perfect decision function \(f_\phi ^*\) is passed down by way of divine inspiration. To this end, we learn \(\pi _{\varvec{\psi }}\) on training data, use it to construct decoding sets \(Y_T(\mathbf{{x}}_i)\) for the test instances, and identify the elements \({\hat{\mathbf{{y}}}}={{\mathrm{argmin\,}}}_{\mathbf{{y}}\in Y_T(\mathbf{{x}}_i)} c(\mathbf{{x}}_i,\mathbf{{y}}_i,\mathbf{{y}})\) that have the smallest true costs; note that this is only possible because the true label \(\mathbf{{y}}_i\) is known for the test instances. We observe costs of \(0.012 \pm 0.008\) for the perfect decision function, compared to costs of \(0.945 \pm 0.664\) when \({\varvec{\phi }}\) is learned on training data. The costs of a perfect decoder that exhaustively searches the space of all labelings, in combination with perfect decision function \(f_\phi ^*\), would be zero. This implies that the decoder with learned parameters \(\psi \) performs almost as well as an (intractable) exhaustive search; it contributes only 1.3 % of the total costs whereas 98.7 % of the costs are due to the imperfection of \(f_{\varvec{\phi }}\). Increasing the decoding time T does not change these results.
This leaves parameter uncertainty of \({\varvec{\phi }}\) caused by limited labeled training data and the definition of the model space as possible sources the residual costs. We conduct a learning curve analysis to explore how decreasing parameter uncertainty decreases the costs. We determine costs for various fractions of training events using 10fold cross validation in Fig. 2. We use 10fold cross validation in order to make sure that each test fold contains at least one attack event when reducing the number of events to 0.2. Since Table 1 uses 50fold cross validation (which results in a higher number of training events), the end points of Fig. 2 are not directly comparable to the values in Table 1. Fig. 2 shows that the costs of all classification methods continue to decrease with an increasing number of training events. A massively larger number of training events would be required to estimate the convergence point. We conclude that parameter uncertainty of \({\varvec{\phi }}\) is the dominating source of costs of all classification models. Anomalydetection method SVDD only requires unlabeled data that can be recorded in abundance. Interestingly, SVDD does not appear to benefit from a larger sample. This matches our subjective perception of the data: HTTP traffic rarely follows a “natural” distribution; anomalies are ubiquitous, but most of the time they are not caused by attacks.
8.6 Feature relevance
Most relevant features of \(f_{\varvec{\phi }}\)
Weight  Description 

3.01  Average length of query strings of client 
\(\)2.38  Number of different resource paths of client 
2.34  Sum of incoming payload of all clients of domain 
2.27  Fraction of connections of client that request the most frequent resource path 
2.25  Sum of response times of all clients of domain 
2.05  Sum of response times of client 
1.64  Fraction of connections for domain that accepts any version of English (e.g., enus) in AcceptLanguage 
\(\)1.46  Entropy of request type (GET/POST/OTHER) 
\(\)1.32  Sum of outgoing payload of all clients 
1.27  Sum of number of open FINs of all clients at end of 10 s interval 
1.23  Average length of query string per connection 
\(\)1.21  Fraction of connections for domain that accepts any language other than EN, DE, ES, PT, CN, RU in AcceptLanguage 
1.19  Fraction of all connections of all clients that query most frequent path 
1.17  Sum of durations of all connections of all clients of domain 
1.13  Fraction of connections of client that accepts any version of English (e.g., enus) in AcceptLanguage 
\(\)1.13  Fraction of combined connections of all clients that directly request a picture type 
\(\)1.11  Fraction of connections of client that specified HTTP header field ContentType as any text variant 
\(\)1.09  Fraction of connections of client that accepts any language other than EN, DE, ES, PT, CN, RU in AcceptLanguage 
1.08  Lognormalized combined outgoing payload of client 
\(\)1.07  Fraction of all connections of all clients that specified HTTP header field ContentType as any text variant 
8.7 Execution time
In our implementation, the step of extracting features \({\varvec{\Phi }}\) takes on average 1 ms per domain for logistic regression and ICA. The additional calculations take about 0.03 ms for logistic regression and 0.04 ms for ICA with five iterations over the nodes which results in nearly identical total execution times of 1.03 and 1.04 ms, respectively.
HC search and online policy gradient start with an execution of logistic regression. For \(T=10\) decoding steps, repeated calculations of \({\varvec{\Phi }}(\mathbf{{x}},\mathbf{{y}})\) and \({\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a)\) lead to a total execution time of 3.1 ms per domain in a hightraffic event.
9 Discussion and related work
Mechanisms that merely detect DDoS attacks still leave it to an operator to take action. Methods for detecting malicious HTTP requests can potentially prevent SQLinjection and crosssite scripting attacks, but their potential to mitigate DDoS flooding attacks is limited, because all incoming HTTP requests still have to be accepted and processed. Defending against networklevel DDoS attacks (Peng et al. 2007; Zargar et al. 2013) is a related problem; but since networklayer attacks are not protocolcompliant, better detection and mitigation mechanisms (e.g., adaptive timeout thresholds, ingress/egress filtering) are available.
Since known detection mechanisms against networklevel DDoS attacks are fairly effective in practice, our study focuses on applicationlevel attacks—specifically, on HTTPlevel flooding attacks. Prior work on defending against applicationlevel DDoS attacks has focused on detecting anomalies in the behavior of clients over time (Ranjan et al. 2006; Xie and Yu 2009; Liu and Chang 2011; Renuka and Yogesh 2012). Clients that deviate from a model of legitimate traffic are trusted less and less, and the rate at which their requests are processed is throttled. Trustbased and throttling approaches leave it necessary to accept incoming HTTP requests, maintain records of all connecting clients, and process the requests—possibly by returning an error code instead of the requested result. In our application environment, this would not sufficiently relieve the servers. Prior work on defending against applicationlevel DDoS attacks have so far been evaluated using artificial or semiartificial traffic data that have been generated under model assumptions of benign and offending traffic. This paper presents the first largescale empirical study based on over 1,500 hightraffic events that we detected while monitoring several hundred thousand domains over several days.
Detection of DDoS attacks and malicious HTTP requests have been modeled as anomaly detection and classification problems. Anomaly detection mechanisms employ a model of legitimate network traffic (Xie and Yu 2009)—and treat unlikely traffic patterns as attacks. For the detection of SQLinjection, crosssitescripting (XSS), and PHP fileinclusion (L/RFI), traffic can be modeled based on HTTP header and query string information using HMMs (Ariu et al. 2011), ngram models (Wressnegger et al. 2013), general kernels (Düssel et al. 2008), or other models (Robertson and Maggi 2010). Anomalydetection mechanisms were investigated, from centroid anomalydetection models (Kloft and Laskov 2012) to setting hard thresholds on the likelihood of new HTTP requests given the model, to unsupervised learning of supportvector data description (SVDD) models (Düssel et al. 2008, Görnitz et al. 2013).
Classificationbased models require traffic data to be labeled; this gives classification methods an information advantage over anomalydetection models. In practice, network traffic rarely follows predictable patterns. Spikes in popularity, misconfigured scripts, and crawlers create traffic patterns that resemble those of attacks; this challenges anomalydetection approaches. Also, in shared hosting environments domains appear and disappear on a regular basis, making the definition of normal traffic even more challenging. A binary SVM trained on labeled data has been observed to consistently outperform a oneclass SVM using ngram features (Wressnegger et al. 2013). Similarly, augmenting SVDDs with labeled data has been observed to greatly improve detection accuracy (Görnitz et al. 2013). Other work has studied SVMs (Khan et al. 2007; Li et al. 2012) and other classification methods (Koc et al. 2012; Peddabachigari et al. 2007; Gharibian and Ghorbani 2007).
Structuredprediction algorithms jointly predict the values of multiple dependent output variables—in this case, labels for all clients that interact with a domain—for a (structured) input (Lafferty et al. 2001; Taskar et al. 2004; Tsochantaridis et al. 2005). At application time, structuredprediction models have to find the highestscoring output during the decoding step. For sequential and treestructured data, the highestscoring output can be identified by dynamic programming. For fully connected graphs, exact inference of the highestscoring output is generally intractable. Many approaches to approximate inference have been developed; for instance, for CRFs (Hazan and Urtasun 2010), structured SVMs (Finley and Joachims 2008), and general graphical models (Taskar et al. 2002). Several algorithmic schemes are based on iterating over the nodes and changing individual class labels locally. The iterative classification algorithm (Neville and Jensen 2000) for collective classification simplistically classifies individual nodes, given the conjectured labels of all neighboring nodes, and reiterates until this process reaches a fixed points.
Online policygradient is the first method that optimizes the parameters of the structuredprediction model and the decoder in a joint optimization problem. This allows us to prove its convergence for suitable loss functions. By contrast, HC search (Doppa et al. 2013, 2014a) first learns a search heuristic that guides the search to the correct labeling for the training data, and subsequently learns the decision function of a structuredprediction model using this search heuristic as a decoder. Shi et al. (2015) follow a complementary approach by first training a probabilistic structured model, and then using reinforcement learning to learn a decoder.
Wick et al. (2011) sample structured outputs using a predefined, handcrafted proposer function that samples outputs sequentially. In other work (Weiss and Taskar 2010) a cascade of Markov models is learned that uses increasing higherorder features and prunes unlikely local outputs per cascade level. This work assumes a ordering of such cliques into levels, which is not applicable for fully connected graphs.
10 Conclusion
We have engineered mechanisms for detection of DDoS attackers based on anomaly detection, independent classification of clients, collective classification of clients, and structuredprediction with HC search. We have then developed the online policygradient method that learns a decision function and a stochastic policy which controls the decoding process in an integrated optimization problem. We have shown that this method is guaranteed to converge for appropriate loss functions. From our empirical study that is based on a large, manuallylabeled collection of HTTP traffic with 1,546 hightraffic events we can draw three main conclusions. (a) All classification approaches outperform the anomalydetection method SVDD substantially. (b) From a practical point of view, even the most basic logistic regression model is useful and reduces the costs by 69 % at a falsepositive rate of \(2.1\times 10^{4}\). (c) ICA and online policy gradient reduce the costs just slightly further, by about 72 %.
Notes
Acknowledgments
This work was supported by Grant SCHE540/122 of the German Science Foundation DFG and by a Grant from STRATO AG.
References
 Amza, C., Cecchet, E., Chanda, A., Cox, A., Elnikety, S., Gil, R., Marguerite, J., Rajamani, K., & Zwaenepoel, W. (2002). Bottleneck characterization of dynamic web site benchmarks. Technical report TR02391, Rice University.Google Scholar
 Ariu, D., Tronci, R., & Giacinto, G. (2011). HMMPayl: An intrusion detection system based on hidden Markov models. Computers & Security, 30(4), 221–241.CrossRefGoogle Scholar
 Borka, V. S. (2008). Stochastic approximation: A dynamical systems viewpoint. Cambridge: Cambridge University Press.Google Scholar
 Borkar, V. S., & Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2), 447–469.MathSciNetCrossRefMATHGoogle Scholar
 Doppa, J. R., Fern, A., & Tadepalli, P. (2013). HCsearch: Learning heuristics and cost functions for structured prediction. AAAI, 2, 4.Google Scholar
 Doppa, J. R., Fern, A., & Tadepalli, P. (2014a). HCsearch: A learning framework for searchbased structured prediction. Journal of Artificial Intelligence Research, 50(1), 369–407.Google Scholar
 Doppa, J. R., Fern, A., & Tadepalli, P. (2014b). Structured prediction via output space search. The Journal of Machine Learning Research, 15(1), 1317–1350.Google Scholar
 Düssel, P., Gehl, C., Laskov, P., & Rieck, K. (2008). Incorporation of application layer protocol syntax into anomaly detection. In International Conference on Information Systems Security, pages 188–202. Springer.Google Scholar
 Finley, T., & Joachims, T. (2008). Training structural SVMs when exact inference is intractable. In Proceedings of the International Conference on Machine Learning.Google Scholar
 Gharibian, F., & Ghorbani, A. A., (2007). Comparative study of supervised machine learning techniques for intrusion detection. In IEEE Annual Conference on Communication Networks and Services Research, pages 350–358.Google Scholar
 Görnitz, N., Kloft, M., Rieck, K., & Brefeld, U. (2013). Toward supervised anomaly detection. Journal of Artificial Intelligence Research, 46, 235–262.MathSciNetMATHGoogle Scholar
 Greensmith, E., Bartlett, P. L., & Baxter, J. (2004). Variance reduction techniques for gradient estimates in reinforcement learning. The Journal of Machine Learning Research, 5, 1471–1530.MathSciNetMATHGoogle Scholar
 Hazan, T., & Urtasun. R. (2010). Approximated structured prediction for learning large scale graphical models. arxiv:1006.2899.
 Khan, L., Awad, M., & Thuraisingham, B. (2007). A new intrusion detection system using support vector machines and hierarchical clustering. International Journal on Very Large Databases, 16(4), 507–521.CrossRefGoogle Scholar
 Kloft, M., & Laskov, P. (2012). Security analysis of online centroid anomaly detection. Journal of Machine Learning Research, 13(1), 3681–3724.MathSciNetMATHGoogle Scholar
 Koc, L., Mazzuchi, T. A., & Sarkani, S. (2012). A network intrusion detection system based on a hidden naïve Bayes multiclass classifier. Expert Systems with Applications, 39(18), 13492–13500.CrossRefGoogle Scholar
 Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning.Google Scholar
 Liu, H., & Chang, K. (2011). Defending systems against tilt DDoS attacks. In Proceedings of the International Conference on Telecommunication Systems, Services, and Applications.Google Scholar
 Li, Y., Xia, J., Zhang, S., Yan, J., Ai, X., & Dai, K. (2012). An efficient intrusion detection system based on support vector machines and gradually feature removal method. Expert Systems with Applications, 39(1), 424–430.CrossRefGoogle Scholar
 Mc Dowell, L. K., Gupta, K. M., & Aha, D. W. (2009). Cautious collective classification. The Journal of Machine Learning Research, 10, 2777–2836.MathSciNetMATHGoogle Scholar
 Neville, J., & Jensen, D. (2000). Iterative classification in relational data. In Proc. AAAI2000 Workshop on Learning Statistical Models from Relational Data.Google Scholar
 Peddabachigari, S., Abraham, A., Grosan, C., & Thomas, J. (2007). Modeling intrusion detection system using hybrid intelligent systems. Journal of Network and Computer Applications, 30(1), 114–132.CrossRefGoogle Scholar
 Peng, T., Leckie, C., & Ramamohanarao, K. (2007). Survey of networkbased defense mechanisms countering the DoS and DDoS problems. ACM Computing Surveys, 39(1), 3.CrossRefGoogle Scholar
 Peters, J., & Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4), 682–697.CrossRefGoogle Scholar
 Ranjan, S., Swaminathan, R., Uysal, M., & Knightley, E. (2006). DDoSresilient scheduling to counter application layer attacks under imperfect detection. In Proceedings of IEEE INFOCOM.Google Scholar
 Renuka Devi, S., & Yogesh, P. (2012). Detection of application layer DDsS attacks using information theory based metrics. Department of Information Science and Technology, College of Engineering Guindy doi: 10.5121/csit.2012.2223.
 Robertson, W. K., & Maggi, F. (2010). Effective anomaly detection with scarce training data. In Network and Distributed System Security Symposium.Google Scholar
 Shi, T., Steinhardt, J., & Liang, P. (2015). Learning where to sample in structured prediction. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pages 875–884.Google Scholar
 Sutton, R. S., Mcallester, D., Singh, S., Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In In Advances in Neural Information Processing Systems 12, pages 1057–1063. Cambridge, Massachusetts: MIT Press.Google Scholar
 Taskar, B., Abbeel, P., & Koller, D. (2002). Discriminative probabilistic models for relational data. In Eighteenth Conference on Uncertainty in Artificial Intelligence.Google Scholar
 Taskar, B., Guestrin, C., Koller, D. (2004). Maxmargin markov networks. In Advances in Neural Information Processing Systems 16. Cambridge, Massachusetts: MIT Press.Google Scholar
 Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.Google Scholar
 Weiss, D., & Taskar, B. (2010). Structured prediction cascades. In International Conference on Artificial Intelligence and Statistics, pages 916–923.Google Scholar
 Wick, M., Rohanimanesh, K., Bellare, K., Culotta, A., & McCallum, A. (2011). Samplerank: Training factor graphs with atomic gradients. In Proceedings of the 28th International Conference on Machine Learning, pages 777–784.Google Scholar
 Wressnegger, C., Schwenk, G., Arp, D., & Rieck, K. (2013). A close look on ngrams in intrusion detection: Anomaly detection versus classification. In Proceedings of the ACM Workshop on Artificial Intelligence and Security, pages 67–76.Google Scholar
 Xie, Y., & Yu, S. Z. (2009). A largescale hidden semimarkov model for anomaly detection on user browsing behaviors. IEEE/ACM Transactions on Networking, 17(1), 54–65.CrossRefGoogle Scholar
 Zargar, S. T., Joshi, J., & Tipper, D. (2013). A survey of defense mechanisms against distributed denial of service (DDoS) flooding attacks. IEEE Communications Surveys & Tutorials, 15(4), 2046–2069.CrossRefGoogle Scholar