Traffic Identification in Big Internet Data

Wang, Binfeng; Zhang, Jun; Zhang, Zili; Luo, Wei; Xia, Dawen

doi:10.1007/978-3-319-27763-9_3

Binfeng Wang³,
Jun Zhang^3,4,
Zili Zhang^3,4,
Wei Luo⁵ &
…
Dawen Xia^3,6

5835 Accesses
3 Citations

Abstract

The era of big data brings new challenges to the network traffic technique that is an essential tool for network management and security. To deal with the problems of dynamic ports and encrypted payload in traditional port-based and payload-based methods, the state-of-the-art method employs flow statistical features and machine learning techniques to identify network traffic. This chapter reviews the statistical-feature based traffic classification methods, that have been proposed in the last decade. We also examine a new problem: unclean traffic in the training stage of machine learning due to the labeling mistake and complex composition of big Internet data. This chapter further evaluates the performance of typical machine learning algorithms with unclean training data. The review and the empirical study can provide a guide for academia and practitioners in choosing proper traffic classification methods in real-world scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Nguyen T, Armitage G (2008) A survey of techniques for internet traffic classification using machine learning. IEEE Commun Surv Tutorials 10(4):56–76
Article Google Scholar
Xiang Y, Zhou W, Guo M (2009) Flexible deterministic packet marking: an IP traceback system to find the real source of attacks. IEEE Trans Parallel Distrib Syst 20(4):567–580
Article Google Scholar
European Commission (2011) Horizon 2020-The framework programme for research and innovation, COM (2011) 808 final, 30 November 2011, Brussels: European Commission, available at: http://ec.europa.eu/programmes/horizon2020/en
Google Scholar
United States. White House Office, Obama B (2011) International strategy for cyberspace: prosperity, security, and openness in a networked world, available at: http://www.whitehouse.gov/sites/default/files/rss_viewer/international_strategy_for_cyberspace.pdf
Sen S, Spatscheck O, Wang D (2004) Accurate, scalable in-network identification of P2P traffic using application signatures. In: Proceedings of the ACM WWW, pp 512–521
Google Scholar
Moore AW, Zuev D (2005) Discriminators for use in flow-based classification. Intel Research Technical Report
Google Scholar
Auld T, Moore A, Gull S (2007) Bayesian neural networks for internet traffic classification. IEEE Trans Neural Netw 18(1):223–239
Article Google Scholar
Este A, Gringoli F, Salgarelli L (2009) Support vector machines for TCP traffic classification. Comput Netw 53(14):2476–2490
Article MATH Google Scholar
Li W, Moore AW (2007) A machine learning approach for efficient traffic classification. In: Proceedings of the 15th IEEE modeling, analysis, and simulation of computer and telecommunications systems (MASCOTS’07), pp 310–317
Google Scholar
Zander S, Nguyen T, Armitage G (2005) Automated traffic classification and application identification using machine learning. In: Proceedings of the IEEE annual local computer networks, pp 250–257
Google Scholar
Erman J, Mahanti A, Arlitt M (2006) Internet traffic identification using machine learning. In: Proceedings of the IEEE global telecommunications conference, pp 1–6
Google Scholar
Erman J, Arlitt M, Mahanti A (2006) Traffic classification using clustering algorithms. In: Proceedings of the ACM SIGCOMM workshops, pp 281–286
Google Scholar
Liu D, Lung C (2011) P2P traffic identification and optimization using fuzzy c-means clustering. In: Proceedings of the IEEE international conference on fuzzy systems, pp 2245–2252
Google Scholar
Ren Y, Li G, Zhang J, Zhou W (2013) Lazy collaborative filtering for datasets with missing values. IEEE Trans Syst Man Cybern Part B 43(6):1822–1834
Google Scholar
Zhang J, Chen C, Xiang Y, Zhou W (2012) Semi-supervised and compound classification of network traffic. J Secur Netw 7(4):252–261
Article Google Scholar
Huang Y, Ma D, Zhang J, Zhao Y (2012) QDFA: query-dependent feature aggregation for medical image retrieval. IEICE Trans Inform Syst E95-D(1):275–279
Article Google Scholar
Huang Y, Zhang J, Zhao Y, Ma D (2012) A new re-ranking method using enhanced pseudo-relevance feedback for content-based medical image retrieval. IEICE Trans Inform Syst E95-D(2):694–698
Article Google Scholar
Zhang J, Xiang Y, Zhou W, Ye L, Mu Y (2011) Secure image retrieval based on visual content and watermarking protocol. J Comput. Oxford 54(10):1661–1674
Article Google Scholar
Zhang J, Xiang Y, Wang Y, Zhou W, Xiang Y, Guan Y (2013) Network traffic classification using correlation information. IEEE Trans Parallel Distrib Syst 24(1):104–117
Article Google Scholar
Glatz E, Dimitropoulos X (2012) Classifying internet one-way traffic. In: Proceedings of 12th ACM SIGMETRICS/PERFORMANCE conference on measurement and modeling of computer systems, pp 417–418
Google Scholar
Jin Y, Duffield N, Erman J, Haffner P, Sen S, Zhang Z-L (2012) A modular machine learning system for flow-level traffic classification in large networks. ACM Trans Knowl Discov Data 6(1):4:1–4:34
Google Scholar
Callado A, Kelner J, Sadok D, Kamienski CA, Fernandes S (2010) Better network traffic identification through the independent combination of techniques. J Netw Comput Appl 33(4):433–446
Article Google Scholar
Carela-Espanol V, Barlet-Ros P, Cabellos-Aparicio A, Sole-Pareta J (2011) Analysis of the impact of sampling on netflow traffic classification. Comput Netw 55(5):1083–1099
Article Google Scholar
Zhang J, Chen C, Xiang Y, Zhou W (2013) Robust network traffic identification with unknown applications. In: Proceedings of the ACM 8th symposium on information, computer and communications security (ASIA CCS), pp 405–414
Google Scholar
Ostermann S (2003) tcptrace. Available at http://www.tcptrace.org
Zuev D, Moore AW (2005) Traffic classification using a statistical approach. In: Proceedings of the 6th passive active measurement workshop (PAM), vol 3431, pp 321–324
Google Scholar
Dai L, Yun X, Xiao J (2008) Optimizing traffic classification using hybrid feature selection. In: Proceedings of the IEEE 9th conference on web-age information management (WAIM), pp 520–525
Google Scholar
Williams N, Zander S, Armitage G (2006) A preliminary performance comparison of five machine learning algorithms for practical ip traffic flow classification. In: Proceedings of the SIGCOMM computer communication review, vol 36, pp 5–16
Article Google Scholar
Erman J, Mahanti A, Arlitt M, Cohen I, Williamson C (2007) Offline/realtime traffic classification using semi-supervised learning. Perform Eval 64(9):1194–1213
Article Google Scholar
Kim H, Claffy K, Fomenkova M, Barman D, Faloutsos M (2008) Internet traffic classification demystified: the myths, caveats and best practices. In: Proceedings of the ACM CoNEXT [Online]. Available at http://www.caida.org/publications/papers/2008/classificationndemystified/
En-Najjary T, Urvoy-Keller G, Pietrzyk M, Costeux JL (2010) Application-based feature selection for internet traffic classification. In: Proceedings of the IEEE 22nd conference on teletraffic congress (ITC), pp 1–8
Google Scholar
Wasikowski M, Chen X (2010) Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng 22:1388–1400
Article Google Scholar
Zhang H, Lu G, Qassrawi MT, Zhang Y, Yu X (2012) Feature selection for optimizing traffic classification. Comput Commun 35(12):1457–1471
Article Google Scholar
Fahad A, Tari Z, Khalil I, Habibb I, Alnuweiric H (2013) Toward an efficient and scalable feature selection approach for internet traffic classification. Comput Netw 57(9), 2040–2057
Article Google Scholar
Fahad A, Tari Z, Khalil I, Almalawia A, Zomayab AY (2014) An optimal and stable feature selection approach for traffic classification based on multi-criterion fusion. Futur Gener Comput Syst 36:156–169
Article Google Scholar
Moore AW, Zuev D (2005) Internet traffic classification using Bayesian analysis techniques. ACM SIGMETRICS Perform Eval Rev 33:50–60
Article Google Scholar
Kim H, Claffy K, Fomenkov M, Barman D, Faloutsos M, Lee K (2008) Internet traffic classification demystified: myths, caveats, and the best practices. In: Proceedings of the ACM CoNEXT conference, pp 1–12
Google Scholar
Lim Y-S, Kim H-C, Jeong J, Kim C-K, Kwon TT, Choi Y (2010) Internet traffic classification demystified: on the sources of the discriminative power. In: Proceedings of the 6th ACM CoNEXT conference, pp 9:1–9:12
Google Scholar
Zhang J, Chen C, Xiang Y, Zhou W (2013) Internet traffic classification by aggregating correlated Naive Bayes predictions. IEEE Trans Inf Forensics Secur 8(1):5–15
Article Google Scholar
Bernaille L, Teixeira R (2007) Early recognition of encrypted applications. In: Passive and active network measurement. Springer, Heidelberg, pp 165–175
Chapter Google Scholar
Hullar B, Laki S, Gyorgy A (2011) Early identification of peer-to-peer traffic. In: Proceedings of the IEEE international conference on communications, pp 1–6
Google Scholar
Nguyen T, Armitage G (2006) Training on multiple sub-flows to optimize the use of machine learning classifiers in real-world ip networks. In: Proceedings of the 31st IEEE conference on local computer networks, pp 369–376
Google Scholar
Crotti M, Dusi M, Gringoli F, Salgarelli L (2007) Traffic classification through simple statistical fingerprinting. ACM SIGCOMM Comput Commun Rev 37:5–16
Article Google Scholar
Xie G, Iliofotou M, Keralapura R, Faloutsos M, Nucci A (2012) Sub-flow: towards practical flow-level traffic classification. In: Proceedings of the IEEE INFOCOM, pp 2541–2545
Google Scholar
Suthaharan S (2014) Big data classification: problems and challenges in network intrusion prediction with machine learning. ACM SIGMETRICS Perform Eval Rev 41(4):70–73
Article Google Scholar
Singh K, Guntuku SC, Thakur A, Hota C (2014) Big data analytics framework for peer-to-peer botnet detection using random forests. Inform Sci 278:488–497
Article Google Scholar
McGregor A, Hall M, Lorier P, Brunskill J (2004) Flow clustering using machine learning techniques. In: Passive and active network measurement. Springer, Heidelberg, pp 205–214
Chapter Google Scholar
Bernaille L, Teixeira R, Akodkenou I, Soule A, Salamatian K (2006) Traffic classification on the fly. ACM SIGCOMM Comput Commun Rev 36:23–26
Article Google Scholar
Wang Y, Xiang Y, Yu S-Z (2010) An automatic application signature construction system for unknown traffic. Concurrency Comput Pract Experience 22:1927–1944
Article Google Scholar
Finamore A, Mellia M, Meo M (2011) Mining unclassified traffic using automatic clustering techniques. In: Proceedings of the 3rd international traffic monitoring and analysis (TMA), pp 150–163
Google Scholar
Erman J, Mahanti A, Arlitt M, Cohenz I, Williamson C (2007) Semi-supervised network traffic classification. ACM SIGMETRICS Perform Eval Rev 35(1):369–370
Article Google Scholar
Li X, Qi F, Xu D, Qiu X (2011) An internet traffic classification method based on semi-supervised support vector machine. In: Proceedings of the IEEE conference communications (ICC), pp 1–5
Google Scholar
Erman J, Arlitt M, Mahanti A (2006) Traffic classification using clustering algorithms. In: Proceedings of the SIGCOMM workshop’ 06, pp 281–286
Google Scholar
Erman J, Mahanti A, Arlitt M (2006) Internet traffic identification using machine learning techniques. In: Proceedings of the 49th IEEE global telecommunication conference (GLOBECOM 2006), pp 1–6
Google Scholar
Pietrzyk M, Costeux J-L, Urvoy-Keller G, En-Najjary T (2009) Challenging statistical classification for operational usage: the ADSL case. In: Proceedings of the 9th ACM SIGCOMM, pp 122–135
Google Scholar
Williams N, Zander S, Armitage G (2006) A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification. ACM SIGCOMM Comput Commun Rev 36(5):5–16
Article Google Scholar
Lim Y, Kim H, Jeong J, Kim C, Kwon T, Choi Y (2010) Internet traffic classification demystified: on the sources of the discriminative power. In: Proceedings of the ACM CoNEXT conference, pp 9:1–9:12
Google Scholar
Lee S, Kim H, Barman D, Lee S, Kim C, Kwon T, Choi Y (2011) Netramark: a network traffic classification benchmark. ACM SIGCOMM Comput Commun Rev 41(1):22–30
Article Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18
Article Google Scholar
Zhang J, Chen X, Xiang Y, Zhou W, Wu J (2014) Robust network traffic classification. IEEE/ACM Trans Netw 23(4):1257–1270
Article Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inform Syst 14(1):1–37
Article Google Scholar
Gislason PO, Benediktsson JA, Sveinsson JR (2006) Random forests for land cover classification. Pattern Recogn Lett 27(4):294–300
Article Google Scholar

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (No. 61401371), Fundamental Research Funds for the Central Universities (No. XDJK2015D029), Science and Technology Foundation of Guizhou (No. LH20147386) and Natural Science Foundation of the Education Department of Guizhou Province.

Author information

Authors and Affiliations

College of Computer and Information Science, Southwest University, Chongqing, 400715, China
Binfeng Wang, Jun Zhang, Zili Zhang & Dawen Xia
School of Information Technology, Deakin University, Geelong, VIC, 3216, Australia
Jun Zhang & Zili Zhang
School of Information Technology, Deakin University, Melbourne, VIC, 3215, Australia
Wei Luo
School of Information Engineering, Guizhou Minzu University, Guiyang, 550025, China
Dawen Xia

Authors

Binfeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zili Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Luo
View author publications
You can also search for this author in PubMed Google Scholar
Dawen Xia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Zhang .

Editor information

Editors and Affiliations

School of Information Technology, Deakin University, Burwood, Victoria, Australia
Shui Yu
Schl of Comp Science & Engg, The Univ of Aizu, Aizu-Wakamatsu City, Japan
Song Guo

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wang, B., Zhang, J., Zhang, Z., Luo, W., Xia, D. (2016). Traffic Identification in Big Internet Data. In: Yu, S., Guo, S. (eds) Big Data Concepts, Theories, and Applications . Springer, Cham. https://doi.org/10.1007/978-3-319-27763-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-27763-9_3
Published: 04 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27761-5
Online ISBN: 978-3-319-27763-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics