Abstract
Network traffic data basically comprise a major amount of normal traffic data and a minor amount of attack data. Such an imbalance problem in the amounts of the two types of data reduces prediction performance, such as by prediction bias of the minority data and miscalculation of normal data as outliers. To address the imbalance problem, representative sampling methods include various minority data synthesis models based on oversampling. However, as the oversampling method for resolving the imbalance problem involves repeatedly learning the same data, the classification model can overfit the learning data. Meanwhile, the undersampling methods proposed to address the imbalance problem can cause information loss because they remove data. To improve the performance of these oversampling and undersampling approaches, we propose an oversampling ensemble method based on the slow-start algorithm. The proposed combined oversampling and undersampling method based on the slow-start (COUSS) algorithm is based on the congestion control algorithm of the transmission control protocol. Therefore, an imbalanced dataset oversamples until overfitting occurs, based on a minimally applied undersampling dataset. The simulation results obtained using the KDD99 dataset show that the proposed COUSS method improves the F1 score by 8.639%, 6.858%, 5.003%, and 4.074% compared to synthetic minority oversampling technique (SMOTE), borderline-SMOTE, adaptive synthetic sampling, and generative adversarial network oversampling algorithms, respectively. Therefore, the COUSS method can be perceived as a practical solution in data analysis applications.
Similar content being viewed by others
References
O’Brien R, Ishwaran H (2019) A random forests quantile classifier for class imbalanced data. Pattern Recognit 90:232–249
Ertekin S, Huang J, Bottou L, Giles L (2007) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the sixteenth ACM conference on information and knowledge management, pp 127–136
Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36
Kubat M, Holte R, Matwin S (1997) Learning when negative examples abound. In: Proceedings of European conference on machine learning. Springer, Berlin, pp 146–153
Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the seventh international conference on information and knowledge management, pp 148–155
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Yen S-J, Lee Y-S (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727
Huda S, Liu K, Abdelrazek M, Ibrahim A, Alyahya S, Al-Dossari H, Ahmad S (2018) An ensemble oversampling model for class imbalance problem in software defect prediction. IEEE Access 6:24184–24195
Bruzzone L, Serpico SB (1997) Classification of imbalanced remote-sensing data by neural networks. Pattern Recognit Lett 18(11–13):1323–1328
Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on machine learning, pp 935–942
Liu X-Y, Jianxin W, Zhou Z-H (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B (Cybern) 39(2):539–550
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of 2008 IEEE international joint conference on neural networks. IEEE, pp 1322–1328
Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Proceedings of international conference on intelligent computing. Springer, Berlin, pp 878–887
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Proceedings of advances in neural information processing systems, pp 2672–2680
Ali-Gombe A, Elyan E, Jayne C (2019) Multiple fake classes GAN for data augmentation in face image dataset. In: Proceedings of 2019 international joint conference on neural networks (IJCNN). IEEE, pp 1–8
Douzas G, Bacao F (2018) Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst Appl 91:464–471
Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K (2019) Modeling tabular data using conditional GAN. In: Proceedings of advances in neural information processing systems, pp 7335–7345
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784
Xu L, Veeramachaneni K (2018) Synthesizing tabular data using generative adversarial networks. arXiv preprint arXiv:1811.11264
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Stat. Interface 2(3):349–360
Nguyen HM, Cooper EW, Kamei K (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig 3(1):4–21
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM Sigkdd Explor Newsl 6(1):40–49
Macia N, Bernadó-Mansilla E, Orriols-Puig A (2008) Preliminary approach on synthetic data sets generation based on class separability measure. In: Proceedings of 2008 19th international conference on pattern ecognition. IEEE, pp 1–4
Wang H-Y (2008) Combination approach of smote and biased-SVM for imbalanced datasets. In: Proceedings of 2008 IEEE international joint conference on neural networks. IEEE, pp 228–231
Hoi C-H, Chan C-H, Huang K, Lyu MR, King I (2004) Biased support vector machine for relevance feedback in image retrieval. In: Proceedings of 2004 IEEE international joint conference on neural networks, vol 4. IEEE, pp 3189–3194
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421
Tomek I et al (1976) Two modifications of CNN
Liu Y, An A, Huang X (2006) Boosting prediction accuracy on imbalanced datasets with SVM ensembles. In: Proceedings of Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 107–118
Crammer K, Singer Y (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2(Dec):265–292
Jacobson V (1988) Congestion avoidance and control. ACM SIGCOMM Comput Commun Rev 18(4):314–329
Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analysis of the kdd cup 99 data set. In: Proceedings of 2009 IEEE symposium on computational intelligence for security and defense applications. IEEE, pp 1–6
Atilla Özgür, Hamit Erdem (2016) A review of kdd99 dataset usage in intrusion detection and machine learning between 2010 and 2015. PeerJ Preprints 4:e1954v1
Revathi S, Malathi A (2013) A detailed analysis on NSL-KDD dataset using various machine learning techniques for intrusion detection. Int J Eng Res Technol 2(12):1848–1853
Fares AH, Sharawy MI, Zayed HH (2011) Intrusion detection: supervised machine learning. J Comput Sci Eng 5(4):305–313
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th international conference on machine learning
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Corder GW, Foreman DI (2014) Nonparametric statistics: a step-by-step approach. Wiley, New York
Lee K, Lim J, Bok K, Yoo J (2019) Handling method of imbalance data for machine learning: focused on sampling. J Korea Contents Assoc 19(11):567–577
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) Grant funded by the Korea government (MSIT) (No. 2019R1F1A1060742).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Park, S., Park, H. Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic. Computing 103, 401–424 (2021). https://doi.org/10.1007/s00607-020-00854-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-020-00854-1