Skip to main content
Log in

Which standard classification algorithm has more stable performance for imbalanced network traffic data?

  • Data analytics and machine learning
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Most standard classification algorithms are difficult to effectively learn and predict from imbalanced network traffic data, which usually leads to lower classification accuracy. To analyze the influence of imbalanced network traffic data on the performance of standard classification algorithms, the imbalanced data augmentation algorithms are first designed to obtain the imbalanced network traffic data set with gradually varying Imbalance Ratio (IR) and belonging to the same distribution. Then, to obtain more objective classification result and simplify the evaluation process, the evaluation metric AFG is used to evaluate the classification performance of standard classification algorithms based on area under the receiver operating characteristic curve (AUC), F-measure and G-mean. Finally, based on AFG and coefficient of variation (CV), performance stability of standard classification algorithms on imbalanced network traffic data is obtained. Experiments of eight widely used standard classification algorithms on 25 different imbalanced network traffic data demonstrate that the classification performance of GNB, RF and DT is unstable, while BNB, KNN, LR, GBDT, and SVC are relatively stable and not susceptible to imbalanced data. Especially, the KNN has the most stable classification performance. Also, the results are statistically confirmed by Friedman and Nemenyi post hoc statistical tests.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

Enquiries about data availability should be directed to the authors.

Notes

  1. IR is defined as the ratio between majority and minority classes.

  2. https://www.unb.ca/cic/datasets/ids-2017.html

  3. https://www.kdd.org/kdd-cup/view/kdd-cup-1999/Data

  4. https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/

References

  • Adeli E, Li X, Kwon D, Zhang Y, Pohl K (2020) Logistic regression confined by cardinality-constrained sample and feature selection. IEEE Trans Pattern Anal Mach Intell 42(7):1713–1728

    Article  Google Scholar 

  • Alam S, Sonbhadra SK, Agarwal S, Nagabhushan P (2020) One-class support vector classifiers: a survey. Knowl Based Syst 196:105754

    Article  Google Scholar 

  • Chai Z, Zhao C (2019) Enhanced random forest with concurrent analysis of static and dynamic nodes for industrial fault classification. IEEE Trans Industr Inf 16(1):54–66

    Article  Google Scholar 

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357

    Article  Google Scholar 

  • Chen J, Wu Z, Zhang J (2019) Driving safety risk prediction using cost-sensitive with nonnegativity-constrained autoencoders based on imbalanced naturalistic driving data. IEEE Trans Intell Transp Syst 20(12):4450–4465

    Article  Google Scholar 

  • Chen X, Zhang L, Wei X, Lu X (2020) An effective method using clustering-based adaptive decomposition and editing-based diversified oversamping for multi-class imbalanced datasets. Appl Intell: 1–16.

  • Douzas G, Bacao F (2017) Self-organizing map oversampling (SOMO) for imbalanced data set learning. Expert Syst Appl 82:40–52

    Article  Google Scholar 

  • Douzas G, Bacao F (2018) Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2017.09.030

    Article  Google Scholar 

  • Douzas GB, Fernando (2019) Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE. Inform Sci Int J 501.

  • Esteve M, Aparicio J, Rabasa A, Rodriguez-Sala JJ (2020) Efficiency analysis trees: a new methodology for estimating production frontiers through decision trees. Expert Syst Appl 162:113783

    Article  Google Scholar 

  • Fiore U (2020) Minority oversampling based on the attraction-repulsion Weber problem. Wiley, New York

    Book  Google Scholar 

  • Forkman J (2009) Estimator and tests for common coefficients of variation in normal distributions. Commun Stat Theory Methods 38(2):233–251

    Article  MathSciNet  Google Scholar 

  • Gao X, Ren B, Zhang H, Sun B, Li J, Xu J, He Y, Li K (2020) An ensemble imbalanced classification method based on model dynamic selection driven by data partition hybrid sampling. Expert Syst Appl 160:1–18

    Article  Google Scholar 

  • He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  • Jia C, Zuo Y, Zou Q (2018) O-GlcNAcPRED-II: an integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique. Bioinformatics 34(12):2029–2036

    Article  Google Scholar 

  • Jing X-Y, Zhang X, Zhu X, Wu F, You X, Gao Y, Shan S, Yang J-Y (2020) Multiset feature learning for highly imbalanced data classification. IEEE Trans Pattern Anal Mach Intell 43(1):139–156

    Article  Google Scholar 

  • Kadar C, Maculan R, Feuerriegel S (2019) Public decision support for low population density areas: an imbalance-aware hyper-ensemble for spatio-temporal crime prediction. Decis Support Syst 119(1):107–117

    Article  Google Scholar 

  • Kaur P, Gosain A (2020) Robust hybrid data-level sampling approach to handle imbalanced data during classification. Soft Comput 24(20):15715–15732

    Article  Google Scholar 

  • Kim KH, Sohn SY (2020) Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data. Neural Netw 130:176–184

    Article  Google Scholar 

  • Koziarski M (2020) Radial-based undersampling for imbalanced data classification. Pattern Recogn 102:1–11

    Article  Google Scholar 

  • Leski JM, Czabański R, Jezewski M, Jezewski J (2019) Fuzzy ordered c-means clustering and least angle regression for fuzzy rule-based classifier: study for imbalanced data. IEEE Trans Fuzzy Syst 28(11):2799–2813

    Article  Google Scholar 

  • Li L, He H, Li J (2020a) Entropy-based sampling approaches for multi-class imbalanced problems. IEEE Trans Knowl Data Eng 32(11):2159–2170

    Article  Google Scholar 

  • Li Z, Huang W, Xiong Y, Ren S, Zhu T (2020b) Incremental learning imbalanced data streams with concept drift: the dynamic updated ensemble algorithm. Knowl Based Syst 195:1–17

    Article  Google Scholar 

  • Loyola-González O, Martínez-Trinidad JF, Carrasco-Ochoa JA, García-Borroto M (2016) Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing 175:935–947

    Article  Google Scholar 

  • Lu Y, Cheung Y, Tang YY (2020) Bayes imbalance impact index: a measure of class imbalanced data set for classification problem. IEEE Trans Neural Netw 31(9):3525–3539

    MathSciNet  Google Scholar 

  • Luque A, Carrasco A, Martín A, de las Heras A (2019) The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recogn 91:216–231

    Article  Google Scholar 

  • Márquez-Vera C, Cano A, Romero C, Ventura S (2013) Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Appl Intell 38(3):315–330

    Article  Google Scholar 

  • Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2–3):427–436

    Article  Google Scholar 

  • Moraes RM, Ferreira JA, Machado LS (2020) A new bayesian network based on gaussian naive bayes with fuzzy parameters for training assessment in virtual simulators. Int J Fuzzy Syst 23:849–861

    Article  Google Scholar 

  • Ng WW, Xu S, Zhang J, Tian X, Rong T, Kwong S (2020) Hashing-based undersampling ensemble for imbalanced pattern classification problems. IEEE Trans Cyber. 52(2):1269–1279

    Article  Google Scholar 

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12(10):2825–2830

    MathSciNet  Google Scholar 

  • Raschka S (2014) Naive bayes and text classification i-introduction and theory. arXiv preprint arXiv:1410.5329

  • Ren R, Yang Y, Sun L (2020) Oversampling technique based on fuzzy representativeness difference for classifying imbalanced data. Appl Intell 50:2465–2487

    Article  Google Scholar 

  • Settipalli L, Gangadharan GR, Fiore U (2022) Predictive and adaptive drift analysis on decomposed healthcare claims using ART based Topological Clustering. Inform Process Manag Librar Inform Retrie Syst Commun Netw Int J 59:102887

    Google Scholar 

  • Shi F, Cao H, Zhang X, Chen X (2020) A reinforced k-nearest neighbors method with application to chatter identification in high speed milling. IEEE Trans Industr Electron 67(12):10844–10855

    Article  Google Scholar 

  • Tao X, Li Q, Guo W, Ren C, Li C, Liu R, Zou J (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inf Sci 487:31–56

    Article  MathSciNet  Google Scholar 

  • Veganzones D, Séverin E (2018) An investigation of bankruptcy prediction in imbalanced datasets. Decis Support Syst 112(1):111–124

    Article  Google Scholar 

  • Wang X, Xu J, Zeng T, Jing L (2020a) Local distribution-based adaptive minority oversampling for imbalanced data classification. Neurocomputing 422:200–213

    Article  Google Scholar 

  • Wang Z, Cao C, Zhu Y (2020b) Entropy and confidence-based undersampling boosting random forests for imbalanced problems. IEEE Trans Neural Netw Learn Syst 31(12):5178–5191

    Article  Google Scholar 

  • Wen Z, Shi J, He B, Chen J, Ramamohanarao K, Li Q (2019) Exploiting GPUs for efficient gradient boosting decision tree training. IEEE Trans Parallel Distrib Syst 30(12):2706–2717

    Article  Google Scholar 

  • Yu Q, Jiang S, Zhang Y, Wang X, Gao P, Qian J (2018) The impact study of class imbalance on the performance of software defect prediction models. Chin J Comput 41(4):809–824

    Google Scholar 

  • Zheng M, Li T, Zhu R, Tang Y, Tang M, Lin L, Ma Z (2020) Conditional Wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification. Inf Sci 512:1009–1023

    Article  Google Scholar 

  • Zhu H, Liu G, Zhou M, Xie Y, Abusorrah A, Kang Q (2020) Optimizing weighted extreme learning machines for imbalanced classification and application to credit card fraud detection. Neurocomputing 407(1):50–62

    Article  Google Scholar 

  • Zyblewski P, Sabourin R, Woźniak M (2020) Preprocessed dynamic classifier ensemble selection for highly imbalanced drifted data streams. Informat Fus 66:138–154

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the supported by National Natural Science Foundation of China (62306009, 62272006, 61972438); the Major Project of Natural Science Research in Colleges and Universities of Anhui Province (KJ2021ZD0007); Natural Science Research Project for Universities in Anhui Province(2023AH040027); the Anhui Provincial Natural Science Foundation of China (2208085MF164); Wuhu Science and Technology Bureau Project (2022jc11); the 2021 cultivation project of Anhui Normal University (2021xjxm049).

Funding

The funding has been received from the Anhui Provincial Natural Science Foundation of China with Grant no. 2208085MF164; National Natural Science Foundation of China with Grant no. 62306009, 62272006; the Major Project of Natural Science Research in Colleges and Universities of Anhui Province with Grant no. KJ2021ZD0007; Wuhu Science and Technology Bureau Project with Grant no. 2022jc11; the 2021 cultivation project of Anhui Normal University with Grant no. 2021xjxm049; National Natural Science Foundation of China with Grant no. 61972438; Natural Science Research Project for Universities in Anhui Province with Grant no. 2023AH040027.

Author information

Authors and Affiliations

Authors

Contributions

MZ: conceptualization, methodology, software and writing-original. KM: writing—reviewing editing and supervision. XH: writing- reviewing and editing. FW: data curation and experiment. QY: data curation. LG: supervision. FC: writing- reviewing editing and supervision.

Corresponding author

Correspondence to Ming Zheng.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This work does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Additional experimental results of the chart

Appendix A: Additional experimental results of the chart

See Figs. 9, 10, 11, 12, 13, 14, 15.

Fig. 9
figure 9

Relationship of varying performance in GNB and IR

Fig. 10
figure 10

Relationship of varying performance in BNB and IR

Fig. 11
figure 11

Relationship of varying performance in LR and IR

Fig. 12
figure 12

Relationship of varying performance in RF and IR

Fig. 13
figure 13

Relationship of varying performance in DT and IR

Fig. 14
figure 14

Relationship of varying performance in SVC and IR

Fig. 15
figure 15

Performance stability of eight standard classification algorithms on 12 imbalanced network traffic datasets

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, M., Ma, K., Wang, F. et al. Which standard classification algorithm has more stable performance for imbalanced network traffic data?. Soft Comput 28, 217–234 (2024). https://doi.org/10.1007/s00500-023-09331-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-023-09331-1

Keywords

Navigation