Waterfall Traffic Classification: A Quick Approach to Optimizing Cascade Classifiers
Heterogeneous wireless communication networks, like 4G LTE, transport diverse kinds of IP traffic: voice, video, Internet data, and more. In order to effectively manage such networks, administrators need adequate tools, of which traffic classification is the basis for visualizing, shaping, and filtering the broad streams of IP packets observed nowadays. In this paper, we describe a modular, cascading traffic classification system—the Waterfall architecture—and we extensively describe a novel technique for its optimization—in terms of CPU time, number of errors, and percentage of unrecognized flows. We show how to significantly accelerate the process of exhaustive search for the best performing cascade. We employ five datasets of real Internet transmissions and seven traffic analysis methods to demonstrate that our proposal yields valid results and outperforms a greedy optimizer.
KeywordsNetwork management Convergent networks Traffic classification Machine learning
Internet traffic classification—or identification—is the act of matching IP packets with the computer program or communication protocol that generated them . It resembles an “Internet microscope”, which lets us to look at a given network link, see the traffic flowing, and identify various types of IP flows. Another useful metaphor to (TC) is listening to two foreigners talking nearby and recognizing their human language. Quite often, we are able to identify an unfamiliar language or dialect even if we cannot fully understand it. Similarly, the TC problem is recognizing network protocols given their traffic, without interest in their full information content. Moreover, knowing the protocols behind IP flows makes networks easier to manage. For instance, TC is important for network monitoring: if we want to visualize the traffic flowing through a router, it is useful to know its components. TC also helps network security officers to reveal and track suspicious network activity. It is used for implementing (QoS) schemes, traffic shaping, and packet filtering. In convergent networks, TC is the mechanism that enables separate routing policies for voice, video, and data traffic.
A single IP packet alone is difficult to classify, as there is no application name in the packet headers. In the past, the service port number was used for discriminating the traffic class , but this became ineffective due to the raise of Peer-to-Peer (P2P) traffic in the early 2000s . A popular and de facto standard method used nowadays is Deep Packet Inspection (DPI): pattern matching on full packet contents . However, although being more accurate than port-based classification, it requires more computing power and brings privacy concerns. Moreover, pervasive encryption and other issues make DPI increasingly irrelevant [5, 6]. Instead, modern classifiers investigate groups of packets to find distinguishing features of specific application, rather than of single packets. Usually, a flow of packets is statistically summarized—for example, using the average packet size and inter-packet arrival time—and the resultant feature vector is classified using a Machine Learning (ML) algorithm . Such methods are more reliable: the overall behavior of a particular protocol or host is examined instead of seeking for a strict match in a few packets.
The current challenge in TC is that in future it will have to deal with an increasing adoption of encryption, encapsulation, multi-channel techniques, and with the tremendous growth of the Internet . Inevitably, the TC problem is becoming a very complex task that needs breaking into subproblems to keep it tractable. Recent papers proposed various interesting techniques tailored at subproblems in TC [9, 10, 11], but so far few authors addressed the problem of combining these proposals to work together. Thus, in this paper, we describe our method for integrating different traffic classifiers—the Waterfall architecture —and we introduce a novel algorithm for optimizing such systems.
We describe how to implement our algorithm recursively, and we reflect on its time complexity (Sect. 3.3).
We extensively validate our proposal on a new dataset with reliable ground-truth information, and on 7 classification modules total (Sect. 4).
We compare our proposal with myopic optimizer (Sect. 4.4).
We release an open source implementation of our proposal as a publicly available module (Sect. 5).
2 Cascade Traffic Classification
The field of network traffic classification needs a method for integrating results of various research activities. Many papers in this area describe classification methods that in principle propose a set of traffic features tailored at a set of network protocols [1, 9, 10, 11, 15, 16, 17]. Researchers promote their methods for classifying network traffic, which are usually quite effective, but none of them is able to exploit all observable phenomena in the Internet traffic and identify all kinds of protocols.
The question arises: could we integrate these approaches into one system, so that we move forward, building on the achievements of our colleagues? How would this improve classification systems, in terms of accuracy, functionality, completeness, and speed? Answering these questions can open new perspectives for traffic classification. A robust method for combining classifiers can promote research that is more focused on new phenomena in the Internet, rather than addressing the same old issues.
In this Section we describe Waterfall: a modular architecture for traffic identification systems, which we introduced in . Waterfall allows existing classification methods to complement each other, which makes the system as a whole capable of providing higher performance than could be achieved by any of the constituent modules.
A naïve approach to the integration problem would be to survey recent papers for traffic features and use them as long feature vectors, classified with a decent machine learning algorithm. Even with adequate techniques employed, this could quickly lead us to the curse of dimensionality : an exponential growth in the demand for training data as the feature space dimensionality increases. Besides, network flows differ in the set of available features, e.g. only a part of Internet flows evoke DNS queries . Some features need more packets to be computed: e.g. port number is available after one packet, whereas payload statistics need several tens of packets . This means that different tools are needed for different protocols: some flows can be classified immediately using simple methods, while others need more sophisticated analysis. Finally, from the software engineering point of view, a big, monolithic system could be difficult to develop and maintain.
Instead, researchers adopt multi-classification—in particular the Behavior Knowledge Space (BKS) combination method that fuses the outputs of many classifiers into one final decision. In principle, the idea behind BKS is to ask all classifiers for their answers on a particular problem x and then query a look-up table T for the the final decision. The table T is constructed during training of the system, by learning the behavior of classifiers on a labeled dataset. For example, if an ensemble of 3 classifiers replies (A, B, A) for a sample with a ground-truth label of B, then the cell in T under index (A, B, A) is B (see , p. 128). This powerful technique can increase the performance of TC systems—as shown by Dainotti et al. —but comparing to Waterfall, it inherently requires all modules to be run on each flow, with the drawback that the more modules are used, the more processing power is required.
2.2 The Waterfall Architecture
Waterfall applies the idea of multi-classification, but queries the constituent classifiers in sequential manner instead of parallel. It employs cascade classification, of which Kuncheva writes in her book on multi-classification: “cascade classifiers seem to be relatively neglected although they could be of primary importance for real-life applications.” (in , p. 106). We argue that cascade classification is a powerful and effective technique for combining algorithms that identify Internet traffic.
The selection criteria are designed to skip ineligible classifiers quickly. For example, in order to implement a module that identifies traffic by analyzing the packet payload sizes, the criterion could check if at least 5 packets with payload data were already sent in each direction. Only if this condition is true, a machine learning algorithm is run to identify the protocol. However, probably a large amount of flows will be skipped, saving computing resources and avoiding classification with an inadequate method. On the other hand, if a flow satisfies this criterion, it will be analyzed with a method that does not need to support corner cases (that is, number of payload packets less than 5). The selection criteria are optional, i.e. if a module does not have an associated criterion, the classification is always run.
3 Waterfall Optimization
Cascade classification is a multi-classifier system implementing the classifier selection idea . Interestingly, although first introduced in 1998 by Alpaydin and Kaynak , so far few authors considered the puzzle of optimal cascade configuration that would match our problem. In a 2006 paper, Chellapilla et al.  propose a cascade optimization algorithm that updates the rejection thresholds of the constituent classifiers. The authors apply an optimized depth first search to find the cascade that satisfies given constraints on time and accuracy. However, comparing with our work, the system does not optimize the module order. In another paper on this topic, published in 2008 by Abdelazeem , the author proposes a greedy approach for building cascades: start with a generic solution and sequentially prepend a module that reduces CPU time. Comparing with our work, the approach does not evaluate all possible cascade configurations and thus can lead to suboptimal results. We will demonstrate this in Sect. 4 for an exemplary myopic optimizer.
Thus, we propose a new solution to the cascade classification problem, which is better suited for traffic classification than existing methods. Note that comparing with  we do not consider rejection thresholds as input values to the optimization problem. Instead, in case of classifiers with tunable parameters, one could consider the same module parametrized with different values as separate modules, and apply our technique as well. For instance, a Bayes classifier with rejection thresholds on the posterior probability of 0.5, 0.75, 0.90 would be considered as three separate modules.
3.2 Proposed Solution
Static: classify all flows in F using each module in E, and
Dynamic: find the X sequence that minimizes C(X).
3.2.1 Static Evaluation
3.2.2 Dynamic Evaluation
Having all of the required experimental data, we can quickly estimate C(X) for arbitrary X. Because f, g, h, are used only for adjusting the cost function—and can be modified by the network administrator according to her needs (see Sect. 4.2)—we focus only on their arguments, i.e. the cost factors \(T_X, E_X\), and \(U_X\).
Note that the difference operator in Eq. 10 connects the static cost factors with the dynamic effects of a cascade. In stage A, our algorithm evaluates static performance of every module on the entire dataset F, but in stage B we want to simulate cascade operation, so we need to remove the flows that were classified in the previous steps. Thus, the operation in Eq. 10 is crucial.
Module performance depends on its position in the cascade, because preceding modules alter the distribution of traffic classes in the flows conveyed onward. For example, we can improve accuracy of a port-based classifier by putting a module designed for P2P in front of it, which should handle the flows that misuse the traditional port assignments.
Moreover, note that the results depend on F: the optimal cascade depends on the protocols present in the traffic, and on the ground-truth labels. The presented method cannot provide the ultimate solution that would match every network, but it can optimize a specific cascade system for a specific network. We further discuss this issue in Sect. 4.
We assume that the flows are independent of each other, i.e. labeling a particular flow does not require information on any other flow. If such information is needed, e.g. flow DNS names, it should be extracted before the classification process starts. Thus, traffic analysis and flow classification must be separated to uphold this assumption. We successfully implemented such systems for our DNS-Class  and Mutrics  classifiers.
In the next Section, we experimentally validate our method and show that it perfectly predicts \(E_X\) and \(U_X\), and approximates \(T_X\) properly. The simulated cost follows the real cost, so we claim our proposal is valid and can be used in practice. We also analyze the trade-offs between speed, accuracy, and ratio of unlabeled flows, to stress out that the final choice of the cost function should depend on the purpose of the system.
4 Experimental Validation
analyzing the effect of cost function parameters on the result, which demonstrates optimization for different goals;
optimizing on one dataset and using the cascade on another dataset, which evaluates stability;
comparing our optimization method with myopic optimization, which shows that our work is meaningful.
Datasets used for experimental validation
Dst. IP (K)
Avg. util (Mbps)
Avg. flows (/5 min)
For the first 3 datasets, we established ground-truth using light DPI . For Unibs1 and UPC1, we used the supplied ground-truth information, which sometimes was challenging: for example, a skype process generates some HTTP traffic apart of the Skype protocol. For each dataset, we trained the modules using 60 % random sample of all flows, and used the remainder for testing. We considered only the first 10 s of each flow to resemble a near-immediate traffic identification.
Waterfall modules used for experimental validation
Destination IP address
Payload sizes: first 4 packets in+out
Destination port number
Payload sizes: first packet in+out
4 basic statistics of packet sizes and inter-arrival times
4.1 Experiment 1
In the first experiment, we compare simulated cost factors with real values for arbitrary cascade configurations. We randomly selected 100,000 flows from each of the first 4 datasets and ran static evaluation on them. Next, we generated 100 random cascades, and for each cascade we ran both real and simulated classification. As a result, we obtained corresponding pairs of real and estimated values of \(T_X,\, E_X\), and \(U_X\).
We conclude that in general our method properly estimates the cost factors and we can use it to simulate different cascade configurations. Note that accurate prediction of the CPU time is not necessary for optimization: it is enough for the simulated time to be roughly proportional to the real value. Moreover, even the real values will vary depending e.g. on the CPU load due to other tasks executed in the background, which is difficult to predict.
4.2 Experiment 2
In more detail, for time optimization, the optimal cascades are: port for Asnet1, portsize for Asnet2, and dnsclass for IITiS1. In the last case, dnsclass is preferred due to high percentage of DNS traffic in IITiS1. Instead, in case of accuracy optimization, the optimal cascades are: portsize, dnsclass, npkts, port for Asnet1, dstip, dnsclass, portsize for Asnet2, and dnsclass, port, dstip, portsize, npkts for IITiS1. Finally, optimizing for minimum percentage of unrecognized flows yields a common result for all datasets: dnsclass, dstip, npkts, port, portsize.
Note that the results depend on the cost function. We used a power function for presentation purposes, in order to easily show contrasting scenarios by small adjustments to the exponents. For specific purposes, a multi-linear function may be more appropriate, as it is often found in the literature, e.g. linear scalarization of multi-objective optimization problems. Moreover, more complex expressions—including thresholds on some parameters—can be used to find a classification system capable of real-time operation: given an expected amount of flows per second, one could find a cascade that is fast enough to handle the traffic while keeping the other cost factors at possible minimum.
We conclude that our proposal works and is adaptable, i.e. by varying the parameters we optimized the classification system for different goals.
4.3 Experiment 3
In the third experiment, we wanted to verify if the result of optimization is stable in time and space, i.e. if the optimal cascade stays optimal with time and changes of the network. We ran our optimization procedure for 4 datasets, obtaining different cascade configuration for each dataset. Next, we evaluated these configurations on all datasets and measured the increase in the cost function C(X) compared with the original value. Note that we did not use the Unibs1 dataset for this experiment, as it lacks packet payloads and hence needs different set of available modules.
Table 3 presents the results. We see that our proposal yielded results that are stable in time for the same network: the cascades found for Asnet1 and Asnet2, which are 8 months apart, are similar and can be exchanged with little decrease in performance. However, the cascades found for Asnet1 and Asnet2 gave 5–7 % worse performance compared with IITiS1, and 23–49 % worse performance on UPC1. We observed extreme decrease in performance when we varied both the network and time, especially when classifying UPC1 with cascade optimized for IITiS1.
Experiment 3. Result stability: relative increase in the cost C(X), depending on the reference dataset used for determining the optimal cascade
We conclude that cascade optimization is specific to the network, but on the other hand our results suggest that an optimal cascade does not change significantly with time for given network. Thus, the network administrator does not need to repeat the optimization procedure frequently.
4.4 Experiment 4
In the last experiment, we compared our proposal with a greedy optimizer, i.e. a situation in which we select all modules in order of increasing CPU time. This resembles the basic approach in the original paper on Waterfall : start with generic, heavy classifier, and prepend faster modules in front of it (see section 5 in ). Thus, for each module, we calculated the sum of \(t_s\) and \(t_c\) for each dataset separately, and ordered the modules from the fastest to the slowest. We used the results as cascade configurations, i.e. Waterfall systems configured with a conservative algorithm: “myopic” optimization.
On the other hand, we also optimized the system using our proposal, with the cost function given in Eq. 18, for a, b, c equal to 3.00, 1.75, 1.50, respectively. We chose these exponent values arbitrarily to show an example of time optimization: note that the a exponent (influencing the time cost factor) is the highest. Then, we used the results as cascade configurations, but optimized with an “optimal” algorithm.
Table 4 compares the results: in every case, our algorithm optimized the classification system to work faster and with less errors, usually with the same amount of unclassified flows. This demonstrates the point of cascade optimization: it brings performance improvements. Recall that Unibs1 lacks packet payloads, hence we used 5 modules in general for this dataset instead of 7.
Experiment 4. Average improvements compared to myopic cascade optimization
Portname, portsize, port, dstip, dnsclass, stats, npkts
Portsize, portname, dstip, dnsclass, npkts, port, stats
Portname, portsize, port, dstip, dnsclass, stats, npkts
Portsize, portname, dstip, dnsclass, npkts, port
Dnsclass, port, portname, portsize, dstip, stats, npkts
Port, portsize, npkts, stats
Portsize, port, dstip, stats, npkts
Dstip, portsize, port, npkts, stats
Portname, port, portsize, dstip, dnsclass, stats, npkts
Port, portname, dstip, portsize, dnsclass, npkts, stats
On average, the system worked 8 % faster compared with myopic time optimization, and reduced the error rate by 19 %. For Asnet2, it also resulted in higher number of unrecognized flows, but the increase is insignificant given the dataset size, and this cost factor was not the goal of optimization. For instance, if one wants a real-time traffic visualization system, then some small portion of flows might remain unrecognized without negative effect on the whole system. Thus, we conclude that our work is meaningful and can help network administrators to tune cascade TC systems better than ad-hoc tools.
We showed that our Waterfall architecture, together with the new optimization technique, lets for effective combining of traffic classifiers. We presented background on cascade classification (a multi-classifier variant) and employed it for identifying IP transmissions. Waterfall brakes the complex TC problem into smaller, independent modules, which are easier to manage. Moreover, we presented an optimization technique that automatically selects the set of best modules from a pool of available methods, and puts them in right order for maximized performance. By means of experimental validation we demonstrated that our proposal works and can bring significant improvements to classification speed, accuracy, and number of recognized flows.
Our approach to optimizing Waterfall systems brings major improvements over ad-hoc methods. First, it reduces the time needed for optimization by orders of magnitude, by replacing experimentation on different cascades with simulation, which is much faster. Second, by performing an exhaustive search for the best solution, it finds better cascades than a greedy algorithm. However, due to the complex nature of the problem, it still requires a considerable amount of computations to check for all possible cascade configurations, which in practice limits the maximum size of the module pool.
We believe our contribution is important for managing convergent networks like LTE. Finally, in order to support further research in this area, we release an open source implementation of our proposal as an extension to the Mutrics classifier, available at https://github.com/iitis/mutrics.
- 1.Foremski, P. (2013). On different ways to classify Internet traffic: A short review of selected publications. Theoretical and Applied Informatics, 25(2), 147–164.Google Scholar
- 2.Keys, K., Moore, D., Koga, R., Lagache, E., Tesch, M., & Claffy, K. (2001). The architecture of CoralReef: An Internet traffic monitoring software suite. In PAM2001, Workshop on passive and active measurements, RIPE, Citeseer.Google Scholar
- 3.Karagiannis, T., Broido, A., Brownlee, N., Claffy, K. C., & Faloutsos, M. (2004). Is P2P dying or just hiding? In Global telecommunications conference, 2004. GLOBECOM’04. IEEE (Vol. 3, pp. 1532–1538). IEEE.Google Scholar
- 4.Sen, S., Spatscheck, O., & Wang, D. (2004). Accurate, scalable in-network identification of P2P traffic using application signatures. In Proceedings of the 13th international conference on World Wide Web (pp. 512–521). ACM.Google Scholar
- 6.Karagiannis, T., Papagiannaki, K., & Faloutsos, M. (2005). Blinc: Multilevel traffic classification in the dark. In ACM SIGCOMM computer communication review (Vol. 35, pp. 229–240). ACM.Google Scholar
- 7.Kim, H., Claffy, K. C., Fomenkov, M., Barman, D., Faloutsos, M., & Lee, K. (2008). Internet traffic classification demystified: Myths, caveats, and the best practices. In Proceedings of the 2008 ACM CoNEXT conference (p. 11). ACM.Google Scholar
- 12.Foremski, P., Callegari, C., & Pagano, M. (2014). Waterfall: Rapid identification of IP flows using cascade classification. In Computer networks (pp. 14–23). Springer.Google Scholar
- 14.Foremski, P., Callegari, C., & Pagano, M. (2015). Waterfall traffic identification: Optimizing classification cascades. In Computer networks (pp. 1–10). Springer.Google Scholar
- 15.Fiadino, P., Bär, A., & Casas, P. (2013). HTTPTag: A flexible on-line HTTP classification system for operational 3G networks. In International conference on computer communications, 2013. INFOCOM’13. IEEE.Google Scholar
- 17.Korczynski, M., & Duda, A. (2014). Markov chain fingerprinting to classify encrypted traffic. In INFOCOM, 2014 Proceedings IEEE. IEEE.Google Scholar
- 19.Dainotti, A., Pescapé, A., & Sansone, C. (2011). Early classification of network traffic through multi-classification. In Traffic monitoring and analysis (pp. 122–135).Google Scholar
- 21.Chellapilla, K., Shilman, M., & Simard, P. (2006). Optimally combining a cascade of classifiers. Proceedings of SPIE, 6067, 207–214.Google Scholar
- 22.Abdelazeem, S. (2008). A greedy approach for building classification cascades. In Seventh international conference on machine learning and applications, 2008. ICMLA’08 (pp. 115–120). IEEE.Google Scholar
- 25.Carela-Español, V., Bujlow, T., & Barlet-Ros, P. (2014). Is our ground-truth for traffic classification reliable? In Passive and active measurement (pp. 98–108). Springer.Google Scholar
- 27.Alcock, S., & Nelson, R. (2012). Libprotoident: Traffic classification using lightweight packet inspection. WAND Network Research Group, Technical Report.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.