Advertisement

Information Systems Frontiers

, Volume 16, Issue 3, pp 509–521 | Cite as

A comparison of improving multi-class imbalance for internet traffic classification

  • Qiong LiuEmail author
  • Zhen Liu
Article

Abstract

Most research of class imbalance is focused on two class problem to date. A multi-class imbalance is so complicated that one has little knowledge and experience in Internet traffic classification. In this paper we study the challenges posed by Internet traffic classification using machine learning with multi-class unbalanced data and the ability of some adjusting methods, including resampling (random under-sampling, random over-sampling) and cost-sensitive learning. Then we empirically compare the effectiveness of these methods for Internet traffic classification and determine which produces better overall classifier and under what circumstances. Main works are as below. (1) Cost-sensitive learning is deduced with MetaCost that incorporates the misclassification costs into the learning algorithm for improving multi-class imbalance based on flow ratio. (2) A new resampling model is presented including under-sampling and over-sampling to make the multi-class training data more balanced. (3) The solution is presented to compare among three methods or to compare three methods with original case. Experiment results are shown on sixteen datasets that flow g-mean and byte g-mean are statistically increased by 8.6 % and 3.7 %; 4.4 % and 2.8 %; 11.1 % and 8.2 % when three methods are compared with original case. Cost-sensitive learning is as the first choice when the sample size is enough, but resampling is more practical in the rest.

Keywords

Multi-class imbalance Resampling Cost-sensitive learning Internet traffic classification 

Notes

Acknowledgments

The authors would like to thank Andrew W. Moore et al. for sharing their traffic flow datasets in public and Peng Xu et al. for dealing with traffic traces. Our work is supported by the National 973 Basic Research Program of China under grant Nos. 2007CB307100 and 2007CB307106

References

  1. Alejo, R., Sotoca, J.M., & Casañ, G.A. (2008). An empirical study for the multi-class imbalance problem with neural networks. Proceedings of the 13th Iberoamerican Congress on Pattern Recognition, 479–486.Google Scholar
  2. Dainotti, A., Pescapé, A., & Claffy, K.C. (2012). Issues and future directions in traffic classification. IEEE Network, January/February 2012, 35–40.Google Scholar
  3. Domingos, P. (1999). MetaCost: A general method for making classifiers cost-sensitive. Proceedings of the 5th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 155-164.Google Scholar
  4. He, H.T., Che, C.H., Ma, F.T., Luo, X.N., & Wang, J.M. (2008). Improve flow accuracy and byte accuracy in network traffic classification, Proceedings of the 4th International Conference on Intelligent Computing, 449-458.Google Scholar
  5. Jin, Y., Duffield, N., Erman, J., Haffner, P., Sen, S., & Zhang, Z. L. (2012). A modular machine learning system for flow-level traffic classification in large networks. ACM Transactions on Knowledge Discovery from Data, 6(1), 1–34. Article 4.CrossRefGoogle Scholar
  6. Li, W., Canini, M., Moore, A. W., & Bolla, R. (2009). Efficient application identification and the temporal and spatial stability of classification schema. Computer Networks, 53(6), 790–809.CrossRefGoogle Scholar
  7. Liu, Q., Liu, Z., & Huang, M. (2010). Study on Internet traffic classification using machine learning. Computer Science, 37(12), 35–39. 66 (in Chinese).Google Scholar
  8. Moore, A.W., & Zuev, D. (2005). Internet traffic classification using Bayesian analysis techniques. Proceedings of the ACM Sigmetrics, 50-60.Google Scholar
  9. Nguyen, T. T. T., & Armitage, G. (2008). A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys and Tutorials, 10(4), 56–76.CrossRefGoogle Scholar
  10. Wang, S., & Yao, X. (2012). Multi-class imbalance problems: analysis and potential solutions. IEEE Transactions on Systems, Man and Cybernetics, part B. PP(99), 1-13.Google Scholar
  11. Weiss, G.M., McCarthy, K., & Zabar, B. (2007). Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs? Proceedings of the International Conference on Data Mining, pp 35-41.Google Scholar
  12. Williams, N., Zander, S., & Armitage, G. (2006). A preliminary performance comparison of five machine learning algorithms for practical IP traffic flow classification. ACM SIGCOMM Computer Communication Review, 36(5), 5–16.CrossRefGoogle Scholar
  13. Witten, I.H., & Frank, E. (2005). Data mining, practical machine learning tool and techniques, 2nd Edn, Elsevier Printing, 403-418.Google Scholar
  14. Zhong, W.C., Raahemi, B., & Liu, J. (2009). Learning on class imbalanced data to classify Peer-to-Peer applications in IP traffic using resampling techniques. Proceedings of the International Joint Conference on Neural Networks, 1573-1579.Google Scholar
  15. Zhou, Z. H., & Liu, X.Y. (2006). On multi-class cost-sensitive learning. Proceedings of the 21st National Conference on Artificial Intelligence, 567-572.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.School of Software Engineering, School of Computer Science and EngineeringSouth China University of TechnologyGuangzhouChina

Personalised recommendations