Skip to main content
Log in

A machine learning approach for feature selection traffic classification using security analysis

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Class imbalance has become a big problem that leads to inaccurate traffic classification. Accurate traffic classification of traffic flows helps us in security monitoring, IP management, intrusion detection, etc. To address the traffic classification problem, in literature, machine learning (ML) approaches are widely used. Therefore, in this paper, we also proposed an ML-based hybrid feature selection algorithm named WMI_AUC that make use of two metrics: weighted mutual information (WMI) metric and area under ROC curve (AUC). These metrics select effective features from a traffic flow. However, in order to select robust features from the selected features, we proposed robust features selection algorithm. The proposed approach increases the accuracy of ML classifiers and helps in detecting malicious traffic. We evaluate our work using 11 well-known ML classifiers on the different network environment traces datasets. Experimental results showed that our algorithms achieve more than 95% flow accuracy results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Foremski P (2013) On different ways to classify internet traffic? A short review of selected publications. Theor Appl Inform 25(2):119–136

    Google Scholar 

  2. Moore A, Papagiannaki K (2005) Toward the accurate identification of network applications. Passiv Act Netw Meas 3431:4–54

    Google Scholar 

  3. Nguyen T, Armitage G (2008) A survey of techniques for internet traffic classification using machine learning. IEEE Commun Surv Tutor 10(4):56–76

    Article  Google Scholar 

  4. Karagiannis T, Broido A, Faloutsos M, Claffy K (2004) Transport layer identification of P2P traffic. In: IMC ’04 Proceedings 4th ACM SIGCOMM Conference Internet Measurement, pp 12–134

  5. Sen S, Spatscheck O, Wang D (2004) Accurate, scalable in-network identification of p2p traffic using application signatures. In: Proceedings 13th International Conference World Wide Web, p 521

  6. Karagiannis T (2004) Application-specific payload bit strings. http://alumni.cs.ucr.edu/~tkarag/papers/strings.txt, 2004. [Online]. http://alumni.cs.ucr.edu/~tkarag/papers/strings.txt. [Toegang verkry: 0Jan-2017]

  7. Haffner P, Sen S, Spatscheck O, Acas DW (2005) Automated construction of application signatures. In: Proceedings 2005 Workshop Mining Network Data, pp 197–202

  8. Moore AW, Zuev D (2005) Internet traffic classification using Bayesian analysis techniques categories and subject descriptors. In: Sigmetrics, pp 50–60

  9. Singh R, Kumar H, Singla R (2013) Sampling based approaches to handle imbalances in network traffic dataset for machine learning techniques. arXiv Prepr. arXiv1311.2677

  10. Labovitz C, Iekel-Johnson S, McPherson D, Oberheide J, Jahanian F (2010) Internet inter-domain traffic. SIGCOMM Computer Communication Review, vol 41

  11. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  12. Maes F, Collignon A, Vandermeulen D, Marchal G, Suetens P (1997) Multimodality image registration by maximization of mutual information. IEEE Trans Med Imaging 16:187

    Article  Google Scholar 

  13. Zhang H, Lu G, Qassrawi MT, Zhang Y, Yu X (2012) Feature selection for optimizing traffic classification. Comput Commun 35(12):1457–1471

    Article  Google Scholar 

  14. Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159

    Article  Google Scholar 

  15. Shafiq M, Yu X, Laghari AA (2016) WeChat text messages service flow traffic classification using machine learning technique. In: 2016 6th International Conference IT Convergence and Security ICITCS 2016

  16. Shafiq M, Yu X (2017) Effective packet number for 5G im WeChat application at early stage traffic classification. Mob Inf Syst 2017

  17. Shafiq M et al (2017) WeChat text and picture messages service flow traffic classification using machine learning technique. In: Proceedings—18th IEEE International Conference High Performing Computer Communication 14th IEEE International Conference Smart City 2nd IEEE International Conference Data Science System HPCC/SmartCity/DSS 2016, pp 58–62

  18. Peng L, Zhang H, Yang B, Chen Y, Qassrawi MT, Lu G (2010) Traffic identification using flexible neural trees. In: IEEE International Workshop Quality Servervice IWQoS

  19. Lu G, Zhang H, Sha X, Chen C, Peng L (2010) TCFOM: a robust traffic classification framework based on OC-SVM combined with MC-SVM. In: Proceedings—2010 International Conference Communication Intelligence Information Security ICCIIS 2010, pp 180–186

  20. Auld T, Moore AW, Gull SF (2007) Bayesian neural networks for internet traffic classification. IEEE Trans Neural Netw 18(1):223–239

    Article  Google Scholar 

  21. Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets. In: IEEE International Conference Granular Computing, pp 732–737

  22. Nechay D, Pointurier Y, Coates M (2009) Controlling false alarm/discovery rates in online internet traffic flow classification. IEEE INFOCOM 2009:684–692

    Google Scholar 

  23. Li W, Canini M, Moore AW, Bolla R (2009) Efficient application identification and the temporal and spatial stability of classification schema. Comput Netw 53(6):790–809

    Article  Google Scholar 

  24. Gomes DG, Agoulmine N, Bennani Y, de Souza JN (2007) Predictive connectionist approach for VoD bandwidth management. Comput Commun 30(10):2236–2247

    Article  Google Scholar 

  25. Chen X, Wasikowski M (2008) FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems. In: Proceeding 14th ACM SIGKDD International Conference Knowledge Discovery and Data Mining—KDD 08, pp 124–132

  26. Van Der Putten P, Van Someren M (2004) A bias-variance analysis of a real world learning problem: the CoIL challenge 2000. Mach Learn 57(–2):177–195

    Article  Google Scholar 

  27. Lei D, Xiaochun Y, Jun X (2008) Optimizing traffic classification using hybrid feature selection. In: Ninth International Conference Web-Age Information Management, pp 520–525

  28. Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. SIGKDD Explor 6(1):80–89

    Article  Google Scholar 

  29. Lim Y, Kim H, Jeong J, Kim C, Kwon TT, Choi Y (2010) Internet traffic classification demystified: on the sources of the discriminative power. In: Proceedings 6th International Conference, p 9

  30. Kamal AHM, Zhu X, Pandya A, Hsu S (2009) Feature selection with biased sample distributions. In: 2009 IEEE International Conference on Information Reuse and Integration IRI, pp 23–28

  31. Wasikowski M, Chen X (2010) Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng 22(10):1388–1400

    Article  Google Scholar 

  32. Moore A, Zuev D, Crogan M (2005) Discriminators for use in flow-based classification

  33. Peng L, Zhang H, Yang B, Chen Y (2014) Feature evaluation for early stage internet traffic identification. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence Lecture Notes in Bioinformatics), vol 8630. LNCS, pp 51–525

    Chapter  Google Scholar 

  34. Peng L, Yang B, Chen Y, Chen Z (2015) Effectiveness of statistical features for early stage internet traffic identification? Int J Parallel 44:18–197

    Google Scholar 

  35. Bernaille L, Teixeira R, Akodjenou I, Soule A, Salamatian K (2006) Traffic classification on the fly. ACM SIGCOMM Comput Commun Rev 36(2):23–26

    Article  Google Scholar 

  36. Bahl LB et al (1986) Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: ICASSP ’86. International Conference on Acoustics Speech Signal Process, vol 11, pp 49–52

  37. Peng H Mutual information Matlab Toolbox. https://www.mathworks.com/matlabcentral/fileexchange/14888-mutual-information-computation

  38. Peng L, Yang B, Chen Y (2015) Effective packet number for early stage internet traffic identification. Neurocomputing 156:252

    Article  Google Scholar 

  39. WireShark Trace Traffic WireShark, 2015. [Online]. https://www.wireshark.org/. [Toegang verkry: 0Jan-2015]

  40. Introduction to NetMate Tool. [Online]. https://dan.arndt.ca/nims/calculating-flow-statistics-using-netmate/comment-page-1/

  41. Makhoul J, Kubala F, Schwartz R, Weischedel R (1999) Performance measures for information extraction. In: Proceedings DARPA Broadcast News Workshop, pp 249–252

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China under Grant No. 61571144.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muhammad Shafiq.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shafiq, M., Yu, X., Bashir, A.K. et al. A machine learning approach for feature selection traffic classification using security analysis. J Supercomput 74, 4867–4892 (2018). https://doi.org/10.1007/s11227-018-2263-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-018-2263-3

Keywords

Navigation