Which standard classification algorithm has more stable performance for imbalanced network traffic data?

Zheng, Ming; Ma, Kai; Wang, Fei; Hu, Xiaowen; Yu, Qingying; Guo, Liangmin; Chen, Fulong

doi:10.1007/s00500-023-09331-1

Which standard classification algorithm has more stable performance for imbalanced network traffic data?

Data analytics and machine learning
Published: 26 October 2023

Volume 28, pages 217–234, (2024)
Cite this article

Soft Computing Aims and scope Submit manuscript

Ming Zheng ORCID: orcid.org/0000-0001-9001-0859^1,2,
Kai Ma¹,
Fei Wang¹,
Xiaowen Hu¹,
Qingying Yu^1,2,
Liangmin Guo^1,2 &
…
Fulong Chen^1,2

200 Accesses
Explore all metrics

Abstract

Most standard classification algorithms are difficult to effectively learn and predict from imbalanced network traffic data, which usually leads to lower classification accuracy. To analyze the influence of imbalanced network traffic data on the performance of standard classification algorithms, the imbalanced data augmentation algorithms are first designed to obtain the imbalanced network traffic data set with gradually varying Imbalance Ratio (IR) and belonging to the same distribution. Then, to obtain more objective classification result and simplify the evaluation process, the evaluation metric AFG is used to evaluate the classification performance of standard classification algorithms based on area under the receiver operating characteristic curve (AUC), F-measure and G-mean. Finally, based on AFG and coefficient of variation (CV), performance stability of standard classification algorithms on imbalanced network traffic data is obtained. Experiments of eight widely used standard classification algorithms on 25 different imbalanced network traffic data demonstrate that the classification performance of GNB, RF and DT is unstable, while BNB, KNN, LR, GBDT, and SVC are relatively stable and not susceptible to imbalanced data. Especially, the KNN has the most stable classification performance. Also, the results are statistically confirmed by Friedman and Nemenyi post hoc statistical tests.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Development and Application of Artificial Neural Network

Article 30 December 2017

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on ensemble learning

Article 30 August 2019

Data availability

Enquiries about data availability should be directed to the authors.

Notes

IR is defined as the ratio between majority and minority classes.
https://www.unb.ca/cic/datasets/ids-2017.html
https://www.kdd.org/kdd-cup/view/kdd-cup-1999/Data
https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/

References

Adeli E, Li X, Kwon D, Zhang Y, Pohl K (2020) Logistic regression confined by cardinality-constrained sample and feature selection. IEEE Trans Pattern Anal Mach Intell 42(7):1713–1728
Article Google Scholar
Alam S, Sonbhadra SK, Agarwal S, Nagabhushan P (2020) One-class support vector classifiers: a survey. Knowl Based Syst 196:105754
Article Google Scholar
Chai Z, Zhao C (2019) Enhanced random forest with concurrent analysis of static and dynamic nodes for industrial fault classification. IEEE Trans Industr Inf 16(1):54–66
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357
Article Google Scholar
Chen J, Wu Z, Zhang J (2019) Driving safety risk prediction using cost-sensitive with nonnegativity-constrained autoencoders based on imbalanced naturalistic driving data. IEEE Trans Intell Transp Syst 20(12):4450–4465
Article Google Scholar
Chen X, Zhang L, Wei X, Lu X (2020) An effective method using clustering-based adaptive decomposition and editing-based diversified oversamping for multi-class imbalanced datasets. Appl Intell: 1–16.
Douzas G, Bacao F (2017) Self-organizing map oversampling (SOMO) for imbalanced data set learning. Expert Syst Appl 82:40–52
Article Google Scholar
Douzas G, Bacao F (2018) Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2017.09.030
Article Google Scholar
Douzas GB, Fernando (2019) Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE. Inform Sci Int J 501.
Esteve M, Aparicio J, Rabasa A, Rodriguez-Sala JJ (2020) Efficiency analysis trees: a new methodology for estimating production frontiers through decision trees. Expert Syst Appl 162:113783
Article Google Scholar
Fiore U (2020) Minority oversampling based on the attraction-repulsion Weber problem. Wiley, New York
Book Google Scholar
Forkman J (2009) Estimator and tests for common coefficients of variation in normal distributions. Commun Stat Theory Methods 38(2):233–251
Article MathSciNet Google Scholar
Gao X, Ren B, Zhang H, Sun B, Li J, Xu J, He Y, Li K (2020) An ensemble imbalanced classification method based on model dynamic selection driven by data partition hybrid sampling. Expert Syst Appl 160:1–18
Article Google Scholar
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Jia C, Zuo Y, Zou Q (2018) O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique. Bioinformatics 34(12):2029–2036
Article Google Scholar
Jing X-Y, Zhang X, Zhu X, Wu F, You X, Gao Y, Shan S, Yang J-Y (2020) Multiset feature learning for highly imbalanced data classification. IEEE Trans Pattern Anal Mach Intell 43(1):139–156
Article Google Scholar
Kadar C, Maculan R, Feuerriegel S (2019) Public decision support for low population density areas: an imbalance-aware hyper-ensemble for spatio-temporal crime prediction. Decis Support Syst 119(1):107–117
Article Google Scholar
Kaur P, Gosain A (2020) Robust hybrid data-level sampling approach to handle imbalanced data during classification. Soft Comput 24(20):15715–15732
Article Google Scholar
Kim KH, Sohn SY (2020) Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data. Neural Netw 130:176–184
Article Google Scholar
Koziarski M (2020) Radial-based undersampling for imbalanced data classification. Pattern Recogn 102:1–11
Article Google Scholar
Leski JM, Czabański R, Jezewski M, Jezewski J (2019) Fuzzy ordered c-means clustering and least angle regression for fuzzy rule-based classifier: study for imbalanced data. IEEE Trans Fuzzy Syst 28(11):2799–2813
Article Google Scholar
Li L, He H, Li J (2020a) Entropy-based sampling approaches for multi-class imbalanced problems. IEEE Trans Knowl Data Eng 32(11):2159–2170
Article Google Scholar
Li Z, Huang W, Xiong Y, Ren S, Zhu T (2020b) Incremental learning imbalanced data streams with concept drift: the dynamic updated ensemble algorithm. Knowl Based Syst 195:1–17
Article Google Scholar
Loyola-González O, Martínez-Trinidad JF, Carrasco-Ochoa JA, García-Borroto M (2016) Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing 175:935–947
Article Google Scholar
Lu Y, Cheung Y, Tang YY (2020) Bayes imbalance impact index: a measure of class imbalanced data set for classification problem. IEEE Trans Neural Netw 31(9):3525–3539
MathSciNet Google Scholar
Luque A, Carrasco A, Martín A, de las Heras A (2019) The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recogn 91:216–231
Article Google Scholar
Márquez-Vera C, Cano A, Romero C, Ventura S (2013) Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Appl Intell 38(3):315–330
Article Google Scholar
Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2–3):427–436
Article Google Scholar
Moraes RM, Ferreira JA, Machado LS (2020) A new bayesian network based on gaussian naive bayes with fuzzy parameters for training assessment in virtual simulators. Int J Fuzzy Syst 23:849–861
Article Google Scholar
Ng WW, Xu S, Zhang J, Tian X, Rong T, Kwong S (2020) Hashing-based undersampling ensemble for imbalanced pattern classification problems. IEEE Trans Cyber. 52(2):1269–1279
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12(10):2825–2830
MathSciNet Google Scholar
Raschka S (2014) Naive bayes and text classification i-introduction and theory. arXiv preprint arXiv:1410.5329
Ren R, Yang Y, Sun L (2020) Oversampling technique based on fuzzy representativeness difference for classifying imbalanced data. Appl Intell 50:2465–2487
Article Google Scholar
Settipalli L, Gangadharan GR, Fiore U (2022) Predictive and adaptive drift analysis on decomposed healthcare claims using ART based Topological Clustering. Inform Process Manag Librar Inform Retrie Syst Commun Netw Int J 59:102887
Google Scholar
Shi F, Cao H, Zhang X, Chen X (2020) A reinforced k-nearest neighbors method with application to chatter identification in high speed milling. IEEE Trans Industr Electron 67(12):10844–10855
Article Google Scholar
Tao X, Li Q, Guo W, Ren C, Li C, Liu R, Zou J (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inf Sci 487:31–56
Article MathSciNet Google Scholar
Veganzones D, Séverin E (2018) An investigation of bankruptcy prediction in imbalanced datasets. Decis Support Syst 112(1):111–124
Article Google Scholar
Wang X, Xu J, Zeng T, Jing L (2020a) Local distribution-based adaptive minority oversampling for imbalanced data classification. Neurocomputing 422:200–213
Article Google Scholar
Wang Z, Cao C, Zhu Y (2020b) Entropy and confidence-based undersampling boosting random forests for imbalanced problems. IEEE Trans Neural Netw Learn Syst 31(12):5178–5191
Article Google Scholar
Wen Z, Shi J, He B, Chen J, Ramamohanarao K, Li Q (2019) Exploiting GPUs for efficient gradient boosting decision tree training. IEEE Trans Parallel Distrib Syst 30(12):2706–2717
Article Google Scholar
Yu Q, Jiang S, Zhang Y, Wang X, Gao P, Qian J (2018) The impact study of class imbalance on the performance of software defect prediction models. Chin J Comput 41(4):809–824
Google Scholar
Zheng M, Li T, Zhu R, Tang Y, Tang M, Lin L, Ma Z (2020) Conditional Wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification. Inf Sci 512:1009–1023
Article Google Scholar
Zhu H, Liu G, Zhou M, Xie Y, Abusorrah A, Kang Q (2020) Optimizing weighted extreme learning machines for imbalanced classification and application to credit card fraud detection. Neurocomputing 407(1):50–62
Article Google Scholar
Zyblewski P, Sabourin R, Woźniak M (2020) Preprocessed dynamic classifier ensemble selection for highly imbalanced drifted data streams. Informat Fus 66:138–154
Article Google Scholar

Download references

Acknowledgements

This work was supported by the supported by National Natural Science Foundation of China (62306009, 62272006, 61972438); the Major Project of Natural Science Research in Colleges and Universities of Anhui Province (KJ2021ZD0007); Natural Science Research Project for Universities in Anhui Province(2023AH040027); the Anhui Provincial Natural Science Foundation of China (2208085MF164); Wuhu Science and Technology Bureau Project (2022jc11); the 2021 cultivation project of Anhui Normal University (2021xjxm049).

Funding

The funding has been received from the Anhui Provincial Natural Science Foundation of China with Grant no. 2208085MF164; National Natural Science Foundation of China with Grant no. 62306009, 62272006; the Major Project of Natural Science Research in Colleges and Universities of Anhui Province with Grant no. KJ2021ZD0007; Wuhu Science and Technology Bureau Project with Grant no. 2022jc11; the 2021 cultivation project of Anhui Normal University with Grant no. 2021xjxm049; National Natural Science Foundation of China with Grant no. 61972438; Natural Science Research Project for Universities in Anhui Province with Grant no. 2023AH040027.

Author information

Authors and Affiliations

School of Computer and Information, Anhui Normal University, Wuhu, 241002, China
Ming Zheng, Kai Ma, Fei Wang, Xiaowen Hu, Qingying Yu, Liangmin Guo & Fulong Chen
Anhui Provincial Key Laboratory of Network and Information Security (Anhui Normal University), Wuhu, 241002, Anhui, China
Ming Zheng, Qingying Yu, Liangmin Guo & Fulong Chen

Authors

Ming Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Kai Ma
View author publications
You can also search for this author in PubMed Google Scholar
Fei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowen Hu
View author publications
You can also search for this author in PubMed Google Scholar
Qingying Yu
View author publications
You can also search for this author in PubMed Google Scholar
Liangmin Guo
View author publications
You can also search for this author in PubMed Google Scholar
Fulong Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MZ: conceptualization, methodology, software and writing-original. KM: writing—reviewing editing and supervision. XH: writing- reviewing and editing. FW: data curation and experiment. QY: data curation. LG: supervision. FC: writing- reviewing editing and supervision.

Corresponding author

Correspondence to Ming Zheng.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This work does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Additional experimental results of the chart

See Figs. 9, 10, 11, 12, 13, 14, 15.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zheng, M., Ma, K., Wang, F. et al. Which standard classification algorithm has more stable performance for imbalanced network traffic data?. Soft Comput 28, 217–234 (2024). https://doi.org/10.1007/s00500-023-09331-1

Download citation

Accepted: 30 September 2023
Published: 26 October 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s00500-023-09331-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Which standard classification algorithm has more stable performance for imbalanced network traffic data?

Abstract

Access this article

Similar content being viewed by others

Development and Application of Artificial Neural Network