Abstract
Most standard classification algorithms are difficult to effectively learn and predict from imbalanced network traffic data, which usually leads to lower classification accuracy. To analyze the influence of imbalanced network traffic data on the performance of standard classification algorithms, the imbalanced data augmentation algorithms are first designed to obtain the imbalanced network traffic data set with gradually varying Imbalance Ratio (IR) and belonging to the same distribution. Then, to obtain more objective classification result and simplify the evaluation process, the evaluation metric AFG is used to evaluate the classification performance of standard classification algorithms based on area under the receiver operating characteristic curve (AUC), F-measure and G-mean. Finally, based on AFG and coefficient of variation (CV), performance stability of standard classification algorithms on imbalanced network traffic data is obtained. Experiments of eight widely used standard classification algorithms on 25 different imbalanced network traffic data demonstrate that the classification performance of GNB, RF and DT is unstable, while BNB, KNN, LR, GBDT, and SVC are relatively stable and not susceptible to imbalanced data. Especially, the KNN has the most stable classification performance. Also, the results are statistically confirmed by Friedman and Nemenyi post hoc statistical tests.
Similar content being viewed by others
Data availability
Enquiries about data availability should be directed to the authors.
Notes
IR is defined as the ratio between majority and minority classes.
References
Adeli E, Li X, Kwon D, Zhang Y, Pohl K (2020) Logistic regression confined by cardinality-constrained sample and feature selection. IEEE Trans Pattern Anal Mach Intell 42(7):1713–1728
Alam S, Sonbhadra SK, Agarwal S, Nagabhushan P (2020) One-class support vector classifiers: a survey. Knowl Based Syst 196:105754
Chai Z, Zhao C (2019) Enhanced random forest with concurrent analysis of static and dynamic nodes for industrial fault classification. IEEE Trans Industr Inf 16(1):54–66
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357
Chen J, Wu Z, Zhang J (2019) Driving safety risk prediction using cost-sensitive with nonnegativity-constrained autoencoders based on imbalanced naturalistic driving data. IEEE Trans Intell Transp Syst 20(12):4450–4465
Chen X, Zhang L, Wei X, Lu X (2020) An effective method using clustering-based adaptive decomposition and editing-based diversified oversamping for multi-class imbalanced datasets. Appl Intell: 1–16.
Douzas G, Bacao F (2017) Self-organizing map oversampling (SOMO) for imbalanced data set learning. Expert Syst Appl 82:40–52
Douzas G, Bacao F (2018) Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2017.09.030
Douzas GB, Fernando (2019) Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE. Inform Sci Int J 501.
Esteve M, Aparicio J, Rabasa A, Rodriguez-Sala JJ (2020) Efficiency analysis trees: a new methodology for estimating production frontiers through decision trees. Expert Syst Appl 162:113783
Fiore U (2020) Minority oversampling based on the attraction-repulsion Weber problem. Wiley, New York
Forkman J (2009) Estimator and tests for common coefficients of variation in normal distributions. Commun Stat Theory Methods 38(2):233–251
Gao X, Ren B, Zhang H, Sun B, Li J, Xu J, He Y, Li K (2020) An ensemble imbalanced classification method based on model dynamic selection driven by data partition hybrid sampling. Expert Syst Appl 160:1–18
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Jia C, Zuo Y, Zou Q (2018) O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique. Bioinformatics 34(12):2029–2036
Jing X-Y, Zhang X, Zhu X, Wu F, You X, Gao Y, Shan S, Yang J-Y (2020) Multiset feature learning for highly imbalanced data classification. IEEE Trans Pattern Anal Mach Intell 43(1):139–156
Kadar C, Maculan R, Feuerriegel S (2019) Public decision support for low population density areas: an imbalance-aware hyper-ensemble for spatio-temporal crime prediction. Decis Support Syst 119(1):107–117
Kaur P, Gosain A (2020) Robust hybrid data-level sampling approach to handle imbalanced data during classification. Soft Comput 24(20):15715–15732
Kim KH, Sohn SY (2020) Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data. Neural Netw 130:176–184
Koziarski M (2020) Radial-based undersampling for imbalanced data classification. Pattern Recogn 102:1–11
Leski JM, Czabański R, Jezewski M, Jezewski J (2019) Fuzzy ordered c-means clustering and least angle regression for fuzzy rule-based classifier: study for imbalanced data. IEEE Trans Fuzzy Syst 28(11):2799–2813
Li L, He H, Li J (2020a) Entropy-based sampling approaches for multi-class imbalanced problems. IEEE Trans Knowl Data Eng 32(11):2159–2170
Li Z, Huang W, Xiong Y, Ren S, Zhu T (2020b) Incremental learning imbalanced data streams with concept drift: the dynamic updated ensemble algorithm. Knowl Based Syst 195:1–17
Loyola-González O, Martínez-Trinidad JF, Carrasco-Ochoa JA, García-Borroto M (2016) Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing 175:935–947
Lu Y, Cheung Y, Tang YY (2020) Bayes imbalance impact index: a measure of class imbalanced data set for classification problem. IEEE Trans Neural Netw 31(9):3525–3539
Luque A, Carrasco A, Martín A, de las Heras A (2019) The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recogn 91:216–231
Márquez-Vera C, Cano A, Romero C, Ventura S (2013) Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Appl Intell 38(3):315–330
Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2–3):427–436
Moraes RM, Ferreira JA, Machado LS (2020) A new bayesian network based on gaussian naive bayes with fuzzy parameters for training assessment in virtual simulators. Int J Fuzzy Syst 23:849–861
Ng WW, Xu S, Zhang J, Tian X, Rong T, Kwong S (2020) Hashing-based undersampling ensemble for imbalanced pattern classification problems. IEEE Trans Cyber. 52(2):1269–1279
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12(10):2825–2830
Raschka S (2014) Naive bayes and text classification i-introduction and theory. arXiv preprint arXiv:1410.5329
Ren R, Yang Y, Sun L (2020) Oversampling technique based on fuzzy representativeness difference for classifying imbalanced data. Appl Intell 50:2465–2487
Settipalli L, Gangadharan GR, Fiore U (2022) Predictive and adaptive drift analysis on decomposed healthcare claims using ART based Topological Clustering. Inform Process Manag Librar Inform Retrie Syst Commun Netw Int J 59:102887
Shi F, Cao H, Zhang X, Chen X (2020) A reinforced k-nearest neighbors method with application to chatter identification in high speed milling. IEEE Trans Industr Electron 67(12):10844–10855
Tao X, Li Q, Guo W, Ren C, Li C, Liu R, Zou J (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inf Sci 487:31–56
Veganzones D, Séverin E (2018) An investigation of bankruptcy prediction in imbalanced datasets. Decis Support Syst 112(1):111–124
Wang X, Xu J, Zeng T, Jing L (2020a) Local distribution-based adaptive minority oversampling for imbalanced data classification. Neurocomputing 422:200–213
Wang Z, Cao C, Zhu Y (2020b) Entropy and confidence-based undersampling boosting random forests for imbalanced problems. IEEE Trans Neural Netw Learn Syst 31(12):5178–5191
Wen Z, Shi J, He B, Chen J, Ramamohanarao K, Li Q (2019) Exploiting GPUs for efficient gradient boosting decision tree training. IEEE Trans Parallel Distrib Syst 30(12):2706–2717
Yu Q, Jiang S, Zhang Y, Wang X, Gao P, Qian J (2018) The impact study of class imbalance on the performance of software defect prediction models. Chin J Comput 41(4):809–824
Zheng M, Li T, Zhu R, Tang Y, Tang M, Lin L, Ma Z (2020) Conditional Wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification. Inf Sci 512:1009–1023
Zhu H, Liu G, Zhou M, Xie Y, Abusorrah A, Kang Q (2020) Optimizing weighted extreme learning machines for imbalanced classification and application to credit card fraud detection. Neurocomputing 407(1):50–62
Zyblewski P, Sabourin R, Woźniak M (2020) Preprocessed dynamic classifier ensemble selection for highly imbalanced drifted data streams. Informat Fus 66:138–154
Acknowledgements
This work was supported by the supported by National Natural Science Foundation of China (62306009, 62272006, 61972438); the Major Project of Natural Science Research in Colleges and Universities of Anhui Province (KJ2021ZD0007); Natural Science Research Project for Universities in Anhui Province(2023AH040027); the Anhui Provincial Natural Science Foundation of China (2208085MF164); Wuhu Science and Technology Bureau Project (2022jc11); the 2021 cultivation project of Anhui Normal University (2021xjxm049).
Funding
The funding has been received from the Anhui Provincial Natural Science Foundation of China with Grant no. 2208085MF164; National Natural Science Foundation of China with Grant no. 62306009, 62272006; the Major Project of Natural Science Research in Colleges and Universities of Anhui Province with Grant no. KJ2021ZD0007; Wuhu Science and Technology Bureau Project with Grant no. 2022jc11; the 2021 cultivation project of Anhui Normal University with Grant no. 2021xjxm049; National Natural Science Foundation of China with Grant no. 61972438; Natural Science Research Project for Universities in Anhui Province with Grant no. 2023AH040027.
Author information
Authors and Affiliations
Contributions
MZ: conceptualization, methodology, software and writing-original. KM: writing—reviewing editing and supervision. XH: writing- reviewing and editing. FW: data curation and experiment. QY: data curation. LG: supervision. FC: writing- reviewing editing and supervision.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This work does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zheng, M., Ma, K., Wang, F. et al. Which standard classification algorithm has more stable performance for imbalanced network traffic data?. Soft Comput 28, 217–234 (2024). https://doi.org/10.1007/s00500-023-09331-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-023-09331-1