Skip to main content

Traffic Identification in Big Internet Data

  • Chapter
  • First Online:
Big Data Concepts, Theories, and Applications

Abstract

The era of big data brings new challenges to the network traffic technique that is an essential tool for network management and security. To deal with the problems of dynamic ports and encrypted payload in traditional port-based and payload-based methods, the state-of-the-art method employs flow statistical features and machine learning techniques to identify network traffic. This chapter reviews the statistical-feature based traffic classification methods, that have been proposed in the last decade. We also examine a new problem: unclean traffic in the training stage of machine learning due to the labeling mistake and complex composition of big Internet data. This chapter further evaluates the performance of typical machine learning algorithms with unclean training data. The review and the empirical study can provide a guide for academia and practitioners in choosing proper traffic classification methods in real-world scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Nguyen T, Armitage G (2008) A survey of techniques for internet traffic classification using machine learning. IEEE Commun Surv Tutorials 10(4):56–76

    Article  Google Scholar 

  2. Xiang Y, Zhou W, Guo M (2009) Flexible deterministic packet marking: an IP traceback system to find the real source of attacks. IEEE Trans Parallel Distrib Syst 20(4):567–580

    Article  Google Scholar 

  3. European Commission (2011) Horizon 2020-The framework programme for research and innovation, COM (2011) 808 final, 30 November 2011, Brussels: European Commission, available at: http://ec.europa.eu/programmes/horizon2020/en

    Google Scholar 

  4. United States. White House Office, Obama B (2011) International strategy for cyberspace: prosperity, security, and openness in a networked world, available at: http://www.whitehouse.gov/sites/default/files/rss_viewer/international_strategy_for_cyberspace.pdf

  5. Sen S, Spatscheck O, Wang D (2004) Accurate, scalable in-network identification of P2P traffic using application signatures. In: Proceedings of the ACM WWW, pp 512–521

    Google Scholar 

  6. Moore AW, Zuev D (2005) Discriminators for use in flow-based classification. Intel Research Technical Report

    Google Scholar 

  7. Auld T, Moore A, Gull S (2007) Bayesian neural networks for internet traffic classification. IEEE Trans Neural Netw 18(1):223–239

    Article  Google Scholar 

  8. Este A, Gringoli F, Salgarelli L (2009) Support vector machines for TCP traffic classification. Comput Netw 53(14):2476–2490

    Article  MATH  Google Scholar 

  9. Li W, Moore AW (2007) A machine learning approach for efficient traffic classification. In: Proceedings of the 15th IEEE modeling, analysis, and simulation of computer and telecommunications systems (MASCOTS’07), pp 310–317

    Google Scholar 

  10. Zander S, Nguyen T, Armitage G (2005) Automated traffic classification and application identification using machine learning. In: Proceedings of the IEEE annual local computer networks, pp 250–257

    Google Scholar 

  11. Erman J, Mahanti A, Arlitt M (2006) Internet traffic identification using machine learning. In: Proceedings of the IEEE global telecommunications conference, pp 1–6

    Google Scholar 

  12. Erman J, Arlitt M, Mahanti A (2006) Traffic classification using clustering algorithms. In: Proceedings of the ACM SIGCOMM workshops, pp 281–286

    Google Scholar 

  13. Liu D, Lung C (2011) P2P traffic identification and optimization using fuzzy c-means clustering. In: Proceedings of the IEEE international conference on fuzzy systems, pp 2245–2252

    Google Scholar 

  14. Ren Y, Li G, Zhang J, Zhou W (2013) Lazy collaborative filtering for datasets with missing values. IEEE Trans Syst Man Cybern Part B 43(6):1822–1834

    Google Scholar 

  15. Zhang J, Chen C, Xiang Y, Zhou W (2012) Semi-supervised and compound classification of network traffic. J Secur Netw 7(4):252–261

    Article  Google Scholar 

  16. Huang Y, Ma D, Zhang J, Zhao Y (2012) QDFA: query-dependent feature aggregation for medical image retrieval. IEICE Trans Inform Syst E95-D(1):275–279

    Article  Google Scholar 

  17. Huang Y, Zhang J, Zhao Y, Ma D (2012) A new re-ranking method using enhanced pseudo-relevance feedback for content-based medical image retrieval. IEICE Trans Inform Syst E95-D(2):694–698

    Article  Google Scholar 

  18. Zhang J, Xiang Y, Zhou W, Ye L, Mu Y (2011) Secure image retrieval based on visual content and watermarking protocol. J Comput. Oxford 54(10):1661–1674

    Article  Google Scholar 

  19. Zhang J, Xiang Y, Wang Y, Zhou W, Xiang Y, Guan Y (2013) Network traffic classification using correlation information. IEEE Trans Parallel Distrib Syst 24(1):104–117

    Article  Google Scholar 

  20. Glatz E, Dimitropoulos X (2012) Classifying internet one-way traffic. In: Proceedings of 12th ACM SIGMETRICS/PERFORMANCE conference on measurement and modeling of computer systems, pp 417–418

    Google Scholar 

  21. Jin Y, Duffield N, Erman J, Haffner P, Sen S, Zhang Z-L (2012) A modular machine learning system for flow-level traffic classification in large networks. ACM Trans Knowl Discov Data 6(1):4:1–4:34

    Google Scholar 

  22. Callado A, Kelner J, Sadok D, Kamienski CA, Fernandes S (2010) Better network traffic identification through the independent combination of techniques. J Netw Comput Appl 33(4):433–446

    Article  Google Scholar 

  23. Carela-Espanol V, Barlet-Ros P, Cabellos-Aparicio A, Sole-Pareta J (2011) Analysis of the impact of sampling on netflow traffic classification. Comput Netw 55(5):1083–1099

    Article  Google Scholar 

  24. Zhang J, Chen C, Xiang Y, Zhou W (2013) Robust network traffic identification with unknown applications. In: Proceedings of the ACM 8th symposium on information, computer and communications security (ASIA CCS), pp 405–414

    Google Scholar 

  25. Ostermann S (2003) tcptrace. Available at http://www.tcptrace.org

  26. Zuev D, Moore AW (2005) Traffic classification using a statistical approach. In: Proceedings of the 6th passive active measurement workshop (PAM), vol 3431, pp 321–324

    Google Scholar 

  27. Dai L, Yun X, Xiao J (2008) Optimizing traffic classification using hybrid feature selection. In: Proceedings of the IEEE 9th conference on web-age information management (WAIM), pp 520–525

    Google Scholar 

  28. Williams N, Zander S, Armitage G (2006) A preliminary performance comparison of five machine learning algorithms for practical ip traffic flow classification. In: Proceedings of the SIGCOMM computer communication review, vol 36, pp 5–16

    Article  Google Scholar 

  29. Erman J, Mahanti A, Arlitt M, Cohen I, Williamson C (2007) Offline/realtime traffic classification using semi-supervised learning. Perform Eval 64(9):1194–1213

    Article  Google Scholar 

  30. Kim H, Claffy K, Fomenkova M, Barman D, Faloutsos M (2008) Internet traffic classification demystified: the myths, caveats and best practices. In: Proceedings of the ACM CoNEXT [Online]. Available at http://www.caida.org/publications/papers/2008/classificationndemystified/

  31. En-Najjary T, Urvoy-Keller G, Pietrzyk M, Costeux JL (2010) Application-based feature selection for internet traffic classification. In: Proceedings of the IEEE 22nd conference on teletraffic congress (ITC), pp 1–8

    Google Scholar 

  32. Wasikowski M, Chen X (2010) Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng 22:1388–1400

    Article  Google Scholar 

  33. Zhang H, Lu G, Qassrawi MT, Zhang Y, Yu X (2012) Feature selection for optimizing traffic classification. Comput Commun 35(12):1457–1471

    Article  Google Scholar 

  34. Fahad A, Tari Z, Khalil I, Habibb I, Alnuweiric H (2013) Toward an efficient and scalable feature selection approach for internet traffic classification. Comput Netw 57(9), 2040–2057

    Article  Google Scholar 

  35. Fahad A, Tari Z, Khalil I, Almalawia A, Zomayab AY (2014) An optimal and stable feature selection approach for traffic classification based on multi-criterion fusion. Futur Gener Comput Syst 36:156–169

    Article  Google Scholar 

  36. Moore AW, Zuev D (2005) Internet traffic classification using Bayesian analysis techniques. ACM SIGMETRICS Perform Eval Rev 33:50–60

    Article  Google Scholar 

  37. Kim H, Claffy K, Fomenkov M, Barman D, Faloutsos M, Lee K (2008) Internet traffic classification demystified: myths, caveats, and the best practices. In: Proceedings of the ACM CoNEXT conference, pp 1–12

    Google Scholar 

  38. Lim Y-S, Kim H-C, Jeong J, Kim C-K, Kwon TT, Choi Y (2010) Internet traffic classification demystified: on the sources of the discriminative power. In: Proceedings of the 6th ACM CoNEXT conference, pp 9:1–9:12

    Google Scholar 

  39. Zhang J, Chen C, Xiang Y, Zhou W (2013) Internet traffic classification by aggregating correlated Naive Bayes predictions. IEEE Trans Inf Forensics Secur 8(1):5–15

    Article  Google Scholar 

  40. Bernaille L, Teixeira R (2007) Early recognition of encrypted applications. In: Passive and active network measurement. Springer, Heidelberg, pp 165–175

    Chapter  Google Scholar 

  41. Hullar B, Laki S, Gyorgy A (2011) Early identification of peer-to-peer traffic. In: Proceedings of the IEEE international conference on communications, pp 1–6

    Google Scholar 

  42. Nguyen T, Armitage G (2006) Training on multiple sub-flows to optimize the use of machine learning classifiers in real-world ip networks. In: Proceedings of the 31st IEEE conference on local computer networks, pp 369–376

    Google Scholar 

  43. Crotti M, Dusi M, Gringoli F, Salgarelli L (2007) Traffic classification through simple statistical fingerprinting. ACM SIGCOMM Comput Commun Rev 37:5–16

    Article  Google Scholar 

  44. Xie G, Iliofotou M, Keralapura R, Faloutsos M, Nucci A (2012) Sub-flow: towards practical flow-level traffic classification. In: Proceedings of the IEEE INFOCOM, pp 2541–2545

    Google Scholar 

  45. Suthaharan S (2014) Big data classification: problems and challenges in network intrusion prediction with machine learning. ACM SIGMETRICS Perform Eval Rev 41(4):70–73

    Article  Google Scholar 

  46. Singh K, Guntuku SC, Thakur A, Hota C (2014) Big data analytics framework for peer-to-peer botnet detection using random forests. Inform Sci 278:488–497

    Article  Google Scholar 

  47. McGregor A, Hall M, Lorier P, Brunskill J (2004) Flow clustering using machine learning techniques. In: Passive and active network measurement. Springer, Heidelberg, pp 205–214

    Chapter  Google Scholar 

  48. Bernaille L, Teixeira R, Akodkenou I, Soule A, Salamatian K (2006) Traffic classification on the fly. ACM SIGCOMM Comput Commun Rev 36:23–26

    Article  Google Scholar 

  49. Wang Y, Xiang Y, Yu S-Z (2010) An automatic application signature construction system for unknown traffic. Concurrency Comput Pract Experience 22:1927–1944

    Article  Google Scholar 

  50. Finamore A, Mellia M, Meo M (2011) Mining unclassified traffic using automatic clustering techniques. In: Proceedings of the 3rd international traffic monitoring and analysis (TMA), pp 150–163

    Google Scholar 

  51. Erman J, Mahanti A, Arlitt M, Cohenz I, Williamson C (2007) Semi-supervised network traffic classification. ACM SIGMETRICS Perform Eval Rev 35(1):369–370

    Article  Google Scholar 

  52. Li X, Qi F, Xu D, Qiu X (2011) An internet traffic classification method based on semi-supervised support vector machine. In: Proceedings of the IEEE conference communications (ICC), pp 1–5

    Google Scholar 

  53. Erman J, Arlitt M, Mahanti A (2006) Traffic classification using clustering algorithms. In: Proceedings of the SIGCOMM workshop’ 06, pp 281–286

    Google Scholar 

  54. Erman J, Mahanti A, Arlitt M (2006) Internet traffic identification using machine learning techniques. In: Proceedings of the 49th IEEE global telecommunication conference (GLOBECOM 2006), pp 1–6

    Google Scholar 

  55. Pietrzyk M, Costeux J-L, Urvoy-Keller G, En-Najjary T (2009) Challenging statistical classification for operational usage: the ADSL case. In: Proceedings of the 9th ACM SIGCOMM, pp 122–135

    Google Scholar 

  56. Williams N, Zander S, Armitage G (2006) A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification. ACM SIGCOMM Comput Commun Rev 36(5):5–16

    Article  Google Scholar 

  57. Lim Y, Kim H, Jeong J, Kim C, Kwon T, Choi Y (2010) Internet traffic classification demystified: on the sources of the discriminative power. In: Proceedings of the ACM CoNEXT conference, pp 9:1–9:12

    Google Scholar 

  58. Lee S, Kim H, Barman D, Lee S, Kim C, Kwon T, Choi Y (2011) Netramark: a network traffic classification benchmark. ACM SIGCOMM Comput Commun Rev 41(1):22–30

    Article  Google Scholar 

  59. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18

    Article  Google Scholar 

  60. Zhang J, Chen X, Xiang Y, Zhou W, Wu J (2014) Robust network traffic classification. IEEE/ACM Trans Netw 23(4):1257–1270

    Article  Google Scholar 

  61. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  62. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inform Syst 14(1):1–37

    Article  Google Scholar 

  63. Gislason PO, Benediktsson JA, Sveinsson JR (2006) Random forests for land cover classification. Pattern Recogn Lett 27(4):294–300

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (No. 61401371), Fundamental Research Funds for the Central Universities (No. XDJK2015D029), Science and Technology Foundation of Guizhou (No. LH20147386) and Natural Science Foundation of the Education Department of Guizhou Province.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Wang, B., Zhang, J., Zhang, Z., Luo, W., Xia, D. (2016). Traffic Identification in Big Internet Data. In: Yu, S., Guo, S. (eds) Big Data Concepts, Theories, and Applications . Springer, Cham. https://doi.org/10.1007/978-3-319-27763-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27763-9_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27761-5

  • Online ISBN: 978-3-319-27763-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics