Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic

Park, Seunghyun; Park, Hyunhee

doi:10.1007/s00607-020-00854-1

Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic

Special Issue Article
Published: 21 October 2020

Volume 103, pages 401–424, (2021)
Cite this article

Computing Aims and scope Submit manuscript

836 Accesses
22 Citations
Explore all metrics

Abstract

Network traffic data basically comprise a major amount of normal traffic data and a minor amount of attack data. Such an imbalance problem in the amounts of the two types of data reduces prediction performance, such as by prediction bias of the minority data and miscalculation of normal data as outliers. To address the imbalance problem, representative sampling methods include various minority data synthesis models based on oversampling. However, as the oversampling method for resolving the imbalance problem involves repeatedly learning the same data, the classification model can overfit the learning data. Meanwhile, the undersampling methods proposed to address the imbalance problem can cause information loss because they remove data. To improve the performance of these oversampling and undersampling approaches, we propose an oversampling ensemble method based on the slow-start algorithm. The proposed combined oversampling and undersampling method based on the slow-start (COUSS) algorithm is based on the congestion control algorithm of the transmission control protocol. Therefore, an imbalanced dataset oversamples until overfitting occurs, based on a minimally applied undersampling dataset. The simulation results obtained using the KDD99 dataset show that the proposed COUSS method improves the F1 score by 8.639%, 6.858%, 5.003%, and 4.074% compared to synthetic minority oversampling technique (SMOTE), borderline-SMOTE, adaptive synthetic sampling, and generative adversarial network oversampling algorithms, respectively. Therefore, the COUSS method can be perceived as a practical solution in data analysis applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

A systematic literature review for network intrusion detection system (IDS)

Article 27 March 2023

Review: machine learning techniques applied to cybersecurity

Article 04 January 2019

References

O’Brien R, Ishwaran H (2019) A random forests quantile classifier for class imbalanced data. Pattern Recognit 90:232–249
Article Google Scholar
Ertekin S, Huang J, Bottou L, Giles L (2007) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the sixteenth ACM conference on information and knowledge management, pp 127–136
Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36
Article MathSciNet Google Scholar
Kubat M, Holte R, Matwin S (1997) Learning when negative examples abound. In: Proceedings of European conference on machine learning. Springer, Berlin, pp 146–153
Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the seventh international conference on information and knowledge management, pp 148–155
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Yen S-J, Lee Y-S (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727
Article Google Scholar
Huda S, Liu K, Abdelrazek M, Ibrahim A, Alyahya S, Al-Dossari H, Ahmad S (2018) An ensemble oversampling model for class imbalance problem in software defect prediction. IEEE Access 6:24184–24195
Article Google Scholar
Bruzzone L, Serpico SB (1997) Classification of imbalanced remote-sensing data by neural networks. Pattern Recognit Lett 18(11–13):1323–1328
Article Google Scholar
Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on machine learning, pp 935–942
Liu X-Y, Jianxin W, Zhou Z-H (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B (Cybern) 39(2):539–550
Google Scholar
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of 2008 IEEE international joint conference on neural networks. IEEE, pp 1322–1328
Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Proceedings of international conference on intelligent computing. Springer, Berlin, pp 878–887
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Proceedings of advances in neural information processing systems, pp 2672–2680
Ali-Gombe A, Elyan E, Jayne C (2019) Multiple fake classes GAN for data augmentation in face image dataset. In: Proceedings of 2019 international joint conference on neural networks (IJCNN). IEEE, pp 1–8
Douzas G, Bacao F (2018) Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst Appl 91:464–471
Article Google Scholar
Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K (2019) Modeling tabular data using conditional GAN. In: Proceedings of advances in neural information processing systems, pp 7335–7345
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784
Xu L, Veeramachaneni K (2018) Synthesizing tabular data using generative adversarial networks. arXiv preprint arXiv:1811.11264
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
MATH Google Scholar
Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Stat. Interface 2(3):349–360
Article MathSciNet Google Scholar
Nguyen HM, Cooper EW, Kamei K (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig 3(1):4–21
Article Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM Sigkdd Explor Newsl 6(1):40–49
Article Google Scholar
Macia N, Bernadó-Mansilla E, Orriols-Puig A (2008) Preliminary approach on synthetic data sets generation based on class separability measure. In: Proceedings of 2008 19th international conference on pattern ecognition. IEEE, pp 1–4
Wang H-Y (2008) Combination approach of smote and biased-SVM for imbalanced datasets. In: Proceedings of 2008 IEEE international joint conference on neural networks. IEEE, pp 228–231
Hoi C-H, Chan C-H, Huang K, Lyu MR, King I (2004) Biased support vector machine for relevance feedback in image retrieval. In: Proceedings of 2004 IEEE international joint conference on neural networks, vol 4. IEEE, pp 3189–3194
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Article Google Scholar
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421
Article MathSciNet Google Scholar
Tomek I et al (1976) Two modifications of CNN
Liu Y, An A, Huang X (2006) Boosting prediction accuracy on imbalanced datasets with SVM ensembles. In: Proceedings of Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 107–118
Crammer K, Singer Y (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2(Dec):265–292
MATH Google Scholar
Jacobson V (1988) Congestion avoidance and control. ACM SIGCOMM Comput Commun Rev 18(4):314–329
Article Google Scholar
Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analysis of the kdd cup 99 data set. In: Proceedings of 2009 IEEE symposium on computational intelligence for security and defense applications. IEEE, pp 1–6
Atilla Özgür, Hamit Erdem (2016) A review of kdd99 dataset usage in intrusion detection and machine learning between 2010 and 2015. PeerJ Preprints 4:e1954v1
Google Scholar
Revathi S, Malathi A (2013) A detailed analysis on NSL-KDD dataset using various machine learning techniques for intrusion detection. Int J Eng Res Technol 2(12):1848–1853
Google Scholar
Fares AH, Sharawy MI, Zayed HH (2011) Intrusion detection: supervised machine learning. J Comput Sci Eng 5(4):305–313
Article Google Scholar
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th international conference on machine learning
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Corder GW, Foreman DI (2014) Nonparametric statistics: a step-by-step approach. Wiley, New York
MATH Google Scholar
Lee K, Lim J, Bok K, Yoo J (2019) Handling method of imbalance data for machine learning: focused on sampling. J Korea Contents Assoc 19(11):567–577
Google Scholar

Download references

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (MSIT) (No. 2019R1F1A1060742).

Author information

Authors and Affiliations

Korea University, Seoul, South Korea
Seunghyun Park
Myongji University, Yongin, Gyeonggi-do, South Korea
Hyunhee Park

Authors

Seunghyun Park
View author publications
You can also search for this author in PubMed Google Scholar
Hyunhee Park
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hyunhee Park.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, S., Park, H. Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic. Computing 103, 401–424 (2021). https://doi.org/10.1007/s00607-020-00854-1

Download citation

Received: 13 September 2020
Accepted: 12 October 2020
Published: 21 October 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s00607-020-00854-1

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic

Abstract

Access this article

Similar content being viewed by others

Learning from imbalanced data: open challenges and future directions

A systematic literature review for network intrusion detection system (IDS)

Review: machine learning techniques applied to cybersecurity

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic

Abstract

Access this article

Similar content being viewed by others

Learning from imbalanced data: open challenges and future directions

A systematic literature review for network intrusion detection system (IDS)

Review: machine learning techniques applied to cybersecurity

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation