Abstract
The era of big data brings new challenges to the network traffic technique that is an essential tool for network management and security. To deal with the problems of dynamic ports and encrypted payload in traditional port-based and payload-based methods, the state-of-the-art method employs flow statistical features and machine learning techniques to identify network traffic. This chapter reviews the statistical-feature based traffic classification methods, that have been proposed in the last decade. We also examine a new problem: unclean traffic in the training stage of machine learning due to the labeling mistake and complex composition of big Internet data. This chapter further evaluates the performance of typical machine learning algorithms with unclean training data. The review and the empirical study can provide a guide for academia and practitioners in choosing proper traffic classification methods in real-world scenarios.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Nguyen T, Armitage G (2008) A survey of techniques for internet traffic classification using machine learning. IEEE Commun Surv Tutorials 10(4):56–76
Xiang Y, Zhou W, Guo M (2009) Flexible deterministic packet marking: an IP traceback system to find the real source of attacks. IEEE Trans Parallel Distrib Syst 20(4):567–580
European Commission (2011) Horizon 2020-The framework programme for research and innovation, COM (2011) 808 final, 30 November 2011, Brussels: European Commission, available at: http://ec.europa.eu/programmes/horizon2020/en
United States. White House Office, Obama B (2011) International strategy for cyberspace: prosperity, security, and openness in a networked world, available at: http://www.whitehouse.gov/sites/default/files/rss_viewer/international_strategy_for_cyberspace.pdf
Sen S, Spatscheck O, Wang D (2004) Accurate, scalable in-network identification of P2P traffic using application signatures. In: Proceedings of the ACM WWW, pp 512–521
Moore AW, Zuev D (2005) Discriminators for use in flow-based classification. Intel Research Technical Report
Auld T, Moore A, Gull S (2007) Bayesian neural networks for internet traffic classification. IEEE Trans Neural Netw 18(1):223–239
Este A, Gringoli F, Salgarelli L (2009) Support vector machines for TCP traffic classification. Comput Netw 53(14):2476–2490
Li W, Moore AW (2007) A machine learning approach for efficient traffic classification. In: Proceedings of the 15th IEEE modeling, analysis, and simulation of computer and telecommunications systems (MASCOTS’07), pp 310–317
Zander S, Nguyen T, Armitage G (2005) Automated traffic classification and application identification using machine learning. In: Proceedings of the IEEE annual local computer networks, pp 250–257
Erman J, Mahanti A, Arlitt M (2006) Internet traffic identification using machine learning. In: Proceedings of the IEEE global telecommunications conference, pp 1–6
Erman J, Arlitt M, Mahanti A (2006) Traffic classification using clustering algorithms. In: Proceedings of the ACM SIGCOMM workshops, pp 281–286
Liu D, Lung C (2011) P2P traffic identification and optimization using fuzzy c-means clustering. In: Proceedings of the IEEE international conference on fuzzy systems, pp 2245–2252
Ren Y, Li G, Zhang J, Zhou W (2013) Lazy collaborative filtering for datasets with missing values. IEEE Trans Syst Man Cybern Part B 43(6):1822–1834
Zhang J, Chen C, Xiang Y, Zhou W (2012) Semi-supervised and compound classification of network traffic. J Secur Netw 7(4):252–261
Huang Y, Ma D, Zhang J, Zhao Y (2012) QDFA: query-dependent feature aggregation for medical image retrieval. IEICE Trans Inform Syst E95-D(1):275–279
Huang Y, Zhang J, Zhao Y, Ma D (2012) A new re-ranking method using enhanced pseudo-relevance feedback for content-based medical image retrieval. IEICE Trans Inform Syst E95-D(2):694–698
Zhang J, Xiang Y, Zhou W, Ye L, Mu Y (2011) Secure image retrieval based on visual content and watermarking protocol. J Comput. Oxford 54(10):1661–1674
Zhang J, Xiang Y, Wang Y, Zhou W, Xiang Y, Guan Y (2013) Network traffic classification using correlation information. IEEE Trans Parallel Distrib Syst 24(1):104–117
Glatz E, Dimitropoulos X (2012) Classifying internet one-way traffic. In: Proceedings of 12th ACM SIGMETRICS/PERFORMANCE conference on measurement and modeling of computer systems, pp 417–418
Jin Y, Duffield N, Erman J, Haffner P, Sen S, Zhang Z-L (2012) A modular machine learning system for flow-level traffic classification in large networks. ACM Trans Knowl Discov Data 6(1):4:1–4:34
Callado A, Kelner J, Sadok D, Kamienski CA, Fernandes S (2010) Better network traffic identification through the independent combination of techniques. J Netw Comput Appl 33(4):433–446
Carela-Espanol V, Barlet-Ros P, Cabellos-Aparicio A, Sole-Pareta J (2011) Analysis of the impact of sampling on netflow traffic classification. Comput Netw 55(5):1083–1099
Zhang J, Chen C, Xiang Y, Zhou W (2013) Robust network traffic identification with unknown applications. In: Proceedings of the ACM 8th symposium on information, computer and communications security (ASIA CCS), pp 405–414
Ostermann S (2003) tcptrace. Available at http://www.tcptrace.org
Zuev D, Moore AW (2005) Traffic classification using a statistical approach. In: Proceedings of the 6th passive active measurement workshop (PAM), vol 3431, pp 321–324
Dai L, Yun X, Xiao J (2008) Optimizing traffic classification using hybrid feature selection. In: Proceedings of the IEEE 9th conference on web-age information management (WAIM), pp 520–525
Williams N, Zander S, Armitage G (2006) A preliminary performance comparison of five machine learning algorithms for practical ip traffic flow classification. In: Proceedings of the SIGCOMM computer communication review, vol 36, pp 5–16
Erman J, Mahanti A, Arlitt M, Cohen I, Williamson C (2007) Offline/realtime traffic classification using semi-supervised learning. Perform Eval 64(9):1194–1213
Kim H, Claffy K, Fomenkova M, Barman D, Faloutsos M (2008) Internet traffic classification demystified: the myths, caveats and best practices. In: Proceedings of the ACM CoNEXT [Online]. Available at http://www.caida.org/publications/papers/2008/classificationndemystified/
En-Najjary T, Urvoy-Keller G, Pietrzyk M, Costeux JL (2010) Application-based feature selection for internet traffic classification. In: Proceedings of the IEEE 22nd conference on teletraffic congress (ITC), pp 1–8
Wasikowski M, Chen X (2010) Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng 22:1388–1400
Zhang H, Lu G, Qassrawi MT, Zhang Y, Yu X (2012) Feature selection for optimizing traffic classification. Comput Commun 35(12):1457–1471
Fahad A, Tari Z, Khalil I, Habibb I, Alnuweiric H (2013) Toward an efficient and scalable feature selection approach for internet traffic classification. Comput Netw 57(9), 2040–2057
Fahad A, Tari Z, Khalil I, Almalawia A, Zomayab AY (2014) An optimal and stable feature selection approach for traffic classification based on multi-criterion fusion. Futur Gener Comput Syst 36:156–169
Moore AW, Zuev D (2005) Internet traffic classification using Bayesian analysis techniques. ACM SIGMETRICS Perform Eval Rev 33:50–60
Kim H, Claffy K, Fomenkov M, Barman D, Faloutsos M, Lee K (2008) Internet traffic classification demystified: myths, caveats, and the best practices. In: Proceedings of the ACM CoNEXT conference, pp 1–12
Lim Y-S, Kim H-C, Jeong J, Kim C-K, Kwon TT, Choi Y (2010) Internet traffic classification demystified: on the sources of the discriminative power. In: Proceedings of the 6th ACM CoNEXT conference, pp 9:1–9:12
Zhang J, Chen C, Xiang Y, Zhou W (2013) Internet traffic classification by aggregating correlated Naive Bayes predictions. IEEE Trans Inf Forensics Secur 8(1):5–15
Bernaille L, Teixeira R (2007) Early recognition of encrypted applications. In: Passive and active network measurement. Springer, Heidelberg, pp 165–175
Hullar B, Laki S, Gyorgy A (2011) Early identification of peer-to-peer traffic. In: Proceedings of the IEEE international conference on communications, pp 1–6
Nguyen T, Armitage G (2006) Training on multiple sub-flows to optimize the use of machine learning classifiers in real-world ip networks. In: Proceedings of the 31st IEEE conference on local computer networks, pp 369–376
Crotti M, Dusi M, Gringoli F, Salgarelli L (2007) Traffic classification through simple statistical fingerprinting. ACM SIGCOMM Comput Commun Rev 37:5–16
Xie G, Iliofotou M, Keralapura R, Faloutsos M, Nucci A (2012) Sub-flow: towards practical flow-level traffic classification. In: Proceedings of the IEEE INFOCOM, pp 2541–2545
Suthaharan S (2014) Big data classification: problems and challenges in network intrusion prediction with machine learning. ACM SIGMETRICS Perform Eval Rev 41(4):70–73
Singh K, Guntuku SC, Thakur A, Hota C (2014) Big data analytics framework for peer-to-peer botnet detection using random forests. Inform Sci 278:488–497
McGregor A, Hall M, Lorier P, Brunskill J (2004) Flow clustering using machine learning techniques. In: Passive and active network measurement. Springer, Heidelberg, pp 205–214
Bernaille L, Teixeira R, Akodkenou I, Soule A, Salamatian K (2006) Traffic classification on the fly. ACM SIGCOMM Comput Commun Rev 36:23–26
Wang Y, Xiang Y, Yu S-Z (2010) An automatic application signature construction system for unknown traffic. Concurrency Comput Pract Experience 22:1927–1944
Finamore A, Mellia M, Meo M (2011) Mining unclassified traffic using automatic clustering techniques. In: Proceedings of the 3rd international traffic monitoring and analysis (TMA), pp 150–163
Erman J, Mahanti A, Arlitt M, Cohenz I, Williamson C (2007) Semi-supervised network traffic classification. ACM SIGMETRICS Perform Eval Rev 35(1):369–370
Li X, Qi F, Xu D, Qiu X (2011) An internet traffic classification method based on semi-supervised support vector machine. In: Proceedings of the IEEE conference communications (ICC), pp 1–5
Erman J, Arlitt M, Mahanti A (2006) Traffic classification using clustering algorithms. In: Proceedings of the SIGCOMM workshop’ 06, pp 281–286
Erman J, Mahanti A, Arlitt M (2006) Internet traffic identification using machine learning techniques. In: Proceedings of the 49th IEEE global telecommunication conference (GLOBECOM 2006), pp 1–6
Pietrzyk M, Costeux J-L, Urvoy-Keller G, En-Najjary T (2009) Challenging statistical classification for operational usage: the ADSL case. In: Proceedings of the 9th ACM SIGCOMM, pp 122–135
Williams N, Zander S, Armitage G (2006) A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification. ACM SIGCOMM Comput Commun Rev 36(5):5–16
Lim Y, Kim H, Jeong J, Kim C, Kwon T, Choi Y (2010) Internet traffic classification demystified: on the sources of the discriminative power. In: Proceedings of the ACM CoNEXT conference, pp 9:1–9:12
Lee S, Kim H, Barman D, Lee S, Kim C, Kwon T, Choi Y (2011) Netramark: a network traffic classification benchmark. ACM SIGCOMM Comput Commun Rev 41(1):22–30
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18
Zhang J, Chen X, Xiang Y, Zhou W, Wu J (2014) Robust network traffic classification. IEEE/ACM Trans Netw 23(4):1257–1270
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inform Syst 14(1):1–37
Gislason PO, Benediktsson JA, Sveinsson JR (2006) Random forests for land cover classification. Pattern Recogn Lett 27(4):294–300
Acknowledgements
This work was supported by National Natural Science Foundation of China (No. 61401371), Fundamental Research Funds for the Central Universities (No. XDJK2015D029), Science and Technology Foundation of Guizhou (No. LH20147386) and Natural Science Foundation of the Education Department of Guizhou Province.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Wang, B., Zhang, J., Zhang, Z., Luo, W., Xia, D. (2016). Traffic Identification in Big Internet Data. In: Yu, S., Guo, S. (eds) Big Data Concepts, Theories, and Applications . Springer, Cham. https://doi.org/10.1007/978-3-319-27763-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-27763-9_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27761-5
Online ISBN: 978-3-319-27763-9
eBook Packages: Computer ScienceComputer Science (R0)