Skip to main content
Log in

Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic

  • Special Issue Article
  • Published:
Computing Aims and scope Submit manuscript

Abstract

Network traffic data basically comprise a major amount of normal traffic data and a minor amount of attack data. Such an imbalance problem in the amounts of the two types of data reduces prediction performance, such as by prediction bias of the minority data and miscalculation of normal data as outliers. To address the imbalance problem, representative sampling methods include various minority data synthesis models based on oversampling. However, as the oversampling method for resolving the imbalance problem involves repeatedly learning the same data, the classification model can overfit the learning data. Meanwhile, the undersampling methods proposed to address the imbalance problem can cause information loss because they remove data. To improve the performance of these oversampling and undersampling approaches, we propose an oversampling ensemble method based on the slow-start algorithm. The proposed combined oversampling and undersampling method based on the slow-start (COUSS) algorithm is based on the congestion control algorithm of the transmission control protocol. Therefore, an imbalanced dataset oversamples until overfitting occurs, based on a minimally applied undersampling dataset. The simulation results obtained using the KDD99 dataset show that the proposed COUSS method improves the F1 score by 8.639%, 6.858%, 5.003%, and 4.074% compared to synthetic minority oversampling technique (SMOTE), borderline-SMOTE, adaptive synthetic sampling, and generative adversarial network oversampling algorithms, respectively. Therefore, the COUSS method can be perceived as a practical solution in data analysis applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. O’Brien R, Ishwaran H (2019) A random forests quantile classifier for class imbalanced data. Pattern Recognit 90:232–249

    Article  Google Scholar 

  2. Ertekin S, Huang J, Bottou L, Giles L (2007) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the sixteenth ACM conference on information and knowledge management, pp 127–136

  3. Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36

    Article  MathSciNet  Google Scholar 

  4. Kubat M, Holte R, Matwin S (1997) Learning when negative examples abound. In: Proceedings of European conference on machine learning. Springer, Berlin, pp 146–153

  5. Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the seventh international conference on information and knowledge management, pp 148–155

  6. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  7. Yen S-J, Lee Y-S (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727

    Article  Google Scholar 

  8. Huda S, Liu K, Abdelrazek M, Ibrahim A, Alyahya S, Al-Dossari H, Ahmad S (2018) An ensemble oversampling model for class imbalance problem in software defect prediction. IEEE Access 6:24184–24195

    Article  Google Scholar 

  9. Bruzzone L, Serpico SB (1997) Classification of imbalanced remote-sensing data by neural networks. Pattern Recognit Lett 18(11–13):1323–1328

    Article  Google Scholar 

  10. Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on machine learning, pp 935–942

  11. Liu X-Y, Jianxin W, Zhou Z-H (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B (Cybern) 39(2):539–550

    Google Scholar 

  12. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  13. He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of 2008 IEEE international joint conference on neural networks. IEEE, pp 1322–1328

  14. Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Proceedings of international conference on intelligent computing. Springer, Berlin, pp 878–887

  15. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Proceedings of advances in neural information processing systems, pp 2672–2680

  16. Ali-Gombe A, Elyan E, Jayne C (2019) Multiple fake classes GAN for data augmentation in face image dataset. In: Proceedings of 2019 international joint conference on neural networks (IJCNN). IEEE, pp 1–8

  17. Douzas G, Bacao F (2018) Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst Appl 91:464–471

    Article  Google Scholar 

  18. Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K (2019) Modeling tabular data using conditional GAN. In: Proceedings of advances in neural information processing systems, pp 7335–7345

  19. Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784

  20. Xu L, Veeramachaneni K (2018) Synthesizing tabular data using generative adversarial networks. arXiv preprint arXiv:1811.11264

  21. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MATH  Google Scholar 

  22. Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Stat. Interface 2(3):349–360

    Article  MathSciNet  Google Scholar 

  23. Nguyen HM, Cooper EW, Kamei K (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig 3(1):4–21

    Article  Google Scholar 

  24. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  25. Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM Sigkdd Explor Newsl 6(1):40–49

    Article  Google Scholar 

  26. Macia N, Bernadó-Mansilla E, Orriols-Puig A (2008) Preliminary approach on synthetic data sets generation based on class separability measure. In: Proceedings of 2008 19th international conference on pattern ecognition. IEEE, pp 1–4

  27. Wang H-Y (2008) Combination approach of smote and biased-SVM for imbalanced datasets. In: Proceedings of 2008 IEEE international joint conference on neural networks. IEEE, pp 228–231

  28. Hoi C-H, Chan C-H, Huang K, Lyu MR, King I (2004) Biased support vector machine for relevance feedback in image retrieval. In: Proceedings of 2004 IEEE international joint conference on neural networks, vol 4. IEEE, pp 3189–3194

  29. Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29

    Article  Google Scholar 

  30. Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421

    Article  MathSciNet  Google Scholar 

  31. Tomek I et al (1976) Two modifications of CNN

  32. Liu Y, An A, Huang X (2006) Boosting prediction accuracy on imbalanced datasets with SVM ensembles. In: Proceedings of Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 107–118

  33. Crammer K, Singer Y (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2(Dec):265–292

    MATH  Google Scholar 

  34. Jacobson V (1988) Congestion avoidance and control. ACM SIGCOMM Comput Commun Rev 18(4):314–329

    Article  Google Scholar 

  35. Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analysis of the kdd cup 99 data set. In: Proceedings of 2009 IEEE symposium on computational intelligence for security and defense applications. IEEE, pp 1–6

  36. Atilla Özgür, Hamit Erdem (2016) A review of kdd99 dataset usage in intrusion detection and machine learning between 2010 and 2015. PeerJ Preprints 4:e1954v1

    Google Scholar 

  37. Revathi S, Malathi A (2013) A detailed analysis on NSL-KDD dataset using various machine learning techniques for intrusion detection. Int J Eng Res Technol 2(12):1848–1853

    Google Scholar 

  38. Fares AH, Sharawy MI, Zayed HH (2011) Intrusion detection: supervised machine learning. J Comput Sci Eng 5(4):305–313

    Article  Google Scholar 

  39. Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th international conference on machine learning

  40. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

  41. Corder GW, Foreman DI (2014) Nonparametric statistics: a step-by-step approach. Wiley, New York

    MATH  Google Scholar 

  42. Lee K, Lim J, Bok K, Yoo J (2019) Handling method of imbalance data for machine learning: focused on sampling. J Korea Contents Assoc 19(11):567–577

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (MSIT) (No. 2019R1F1A1060742).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hyunhee Park.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, S., Park, H. Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic. Computing 103, 401–424 (2021). https://doi.org/10.1007/s00607-020-00854-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-020-00854-1

Keywords

Mathematics Subject Classification

Navigation