Skip to main content
Log in

A novel oversampling and feature selection hybrid algorithm for imbalanced data classification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Traditional approaches tend to cause classier bias in the imbalanced data set, resulting in poor classification performance for minority classes. In particular, there are many imbalanced data in financial fraud, network intrusion, and fault detection, where recognition rate of minority classes is pertinent than the classification performance of majority classes. Therefore, there is pressure on developing efficient algorithms to solve the class imbalance problem. To this end, this article presents a novel hybrid algorithm Negative Binary General (NBG), to improve the performance of imbalanced classifications by combining oversampling and a feature selection algorithm. A novel oversampling algorithm, Negative-positive Synthetic Minority Oversampling Technique (NPSMOTE), improves sample generation’s practicability while the Binary Ant Lion Optimizer (BALO) algorithm extracts the most significant features to improve the classification performance. Simulation experiments carried out using seven benchmark imbalanced data sets demonstrate that, the proposed NBG algorithm significantly outperforms the classification of imbalanced small-sample data sets compared to nine other existing and six recently published algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Alcala-Fdez J, Fernandez A, Luengo J, et al. (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Log Soft Comput 17(2–3):255–287

    Google Scholar 

  2. Abdi L, Hashemi S (2016) To combat multi-class imbalanced problems by means of over-sampling and boosting techniques. IEEE Trans Knowl Data Eng 28(1):238–251

    Article  Google Scholar 

  3. Al-Ghraibah A, Boucheron LE, Mcateer RTJ (2015) A study of feature selection of magnetogram complexity features in an imbalanced solar flare prediction data-set. In: IEEE international conference on data mining workshop, pp 557–564

  4. Ali S, Majid A, Javed SG, Sattar M (2016) Can-csc-gbe: developing cost-sensitive classifier with gentleboost ensemble for breast cancer classification using protein amino acids and imbalanced data. Comput Biol Med 73:38–46

    Article  Google Scholar 

  5. Alibeigi M, Hashemi S, Hamzeh A (2012) Dbfs: an effective density based feature selection scheme for small sample size and high dimensional imbalanced data sets. Data & Knowledge Engineering 81-82(4):67–103

    Article  Google Scholar 

  6. Amin A, Anwar S, Adnan A, Nawaz M, Howard N, Qadir J, Hawalah A, Hussain A (2016) Comparing oversampling techniques to handle the class imbalance problem: a customer churn prediction case study. IEEE Access,(99):1–1

  7. Anbar M, Abdullah R, Al-Tamimi BN, Hussain A (2018) A machine learning approach to detect router advertisement flooding attacks in next-generation ipv6 networks. Cognit Comput 10(3-4):1–14

    Google Scholar 

  8. Bae SH, Yoon KJ (2015) Polyp detection via imbalanced learning and discriminative feature learning. IEEE Trans Med Imaging 34(11):2379

    Article  Google Scholar 

  9. Bao L, Cao J, Li J, Zhang Y (2016) Boosted near-miss under-sampling on svm ensembles for concept detection in large-scale imbalanced datasets. Neurocomputing 172(C):198–206

    Article  Google Scholar 

  10. Barua S, Islam MM, Yao X, Murase K (2013) Mwmote–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425

    Article  Google Scholar 

  11. Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. Acm Sigkdd Explorations Newsletter 6(1):20–29

    Article  Google Scholar 

  12. Beyan C, Fisher R (2015) Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recogn 48(5):1653–1672

    Article  Google Scholar 

  13. Blagus R, Lusa L (2016) Gradient boosting for high-dimensional prediction of rare events. Computational Statistics & Data Analysis:113

  14. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-asia conference on advances in knowledge discovery and data mining, pp 475–482

  15. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) Dbsmote: Density-based synthetic minority over-sampling technique. Appl Intell 36 (3):664–684

    Article  Google Scholar 

  16. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357

    Article  MATH  Google Scholar 

  17. Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery, pp 107–1219

  18. Chen S, He H, Garcia EA (2010) Ramoboost:ranked minority oversampling in boosting. IEEE Trans Neural Netw 21(10):1624–1642

    Article  Google Scholar 

  19. Cheng F, Zhang J, Wen C (2016) Cost-sensitive large margin distribution machine for classification of imbalanced data. Pattern Recognit Let 80:107–112. https://doi.org/10.1016/j.patrec.2016.06.009. http://www.sciencedirect.com/science/article/pii/S0167865516301337

    Article  Google Scholar 

  20. Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18

    Article  Google Scholar 

  21. Dubey R, Zhou J, Wang Y, Thompson PM, Ye J (2014) Analysis of sampling techniques for imbalanced data: an n = 648 adni study. Neuroimage 87 (3):220–241

    Article  Google Scholar 

  22. Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence, vol 17, pp 973–978. Lawrence Erlbaum associates Ltd

  23. Emary E, Zawbaa HM, Hassanien AE (2016) Binary ant lion approaches for feature selection. Neurocomputing 213:54–65

    Article  Google Scholar 

  24. Fang F, Zhou Q, Shen Z, Yang X, Han L, Wang JQ (2018) The application of a novel neural network in the detection of phishing websites. J Ambient Intell Humaniz Comput, (13):1–15

  25. Fernandez A, Garcia S, Chawla NV, Herrera F (2018) Smote for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905

    Article  MathSciNet  MATH  Google Scholar 

  26. Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Learning from imbalanced data sets. Springer

  27. García-Pedrajas N, García-Osorio C (2013) Boosting for class-imbalanced datasets using genetically evolved supervised non-linear projections. Prog Artif Intell 2(1):29–44

    Article  Google Scholar 

  28. Ghazikhani A, Yazdi HS, Monsefi R (2012) Class imbalance handling using wrapper-based random oversampling. In: 20Th iranian conference on electrical engineering (ICEE2012). IEEE, pp 611–616

  29. Guo H, Li Y, Shang J, Gu M, Huang Y, Gong B (2016) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239

    Google Scholar 

  30. Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239

    Article  Google Scholar 

  31. Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. Lect Notes Comput Sci 3644 (5):878–887

    Article  Google Scholar 

  32. Hart BPE (1968) a̱the condensed nearest neighbor ruleo̱. In: IEEE Trans Information theory

  33. He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, pp 1322–1328

  34. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  35. He H, Ma Y (2013) Imbalanced learning: foundations, algorithms, and applications. Wiley

  36. Hu S, Liang Y, Ma L, He Y (2010) Msmote: improving classification performance when training data is imbalanced. In: Second international workshop on computer science and engineering, pp 13–17

  37. Ieracitano C, Adeel A, Gogate M, Dashtipour K, Morabito FC, Larijani H, Raza A, Hussain A (2018) Statistical analysis driven optimized deep learning system for intrusion detection. In: International conference on brain inspired cognitive systems. Springer, pp 759–769

  38. Jin XB, Xie GS, Huang K, Hussain A (2018) Accelerating infinite ensemble of clustering by pivot features. Cognit Comput 10(6):1042–1050

    Article  Google Scholar 

  39. Jz A, Ju JA, Si CA, Rz A, By B, Ql C (2020) A weighted hybrid ensemble method for classifying imbalanced data. Knowl-Based Syst, vol 203

  40. Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of ICNN’95-international conference on neural networks. IEEE, vol 4, pp 1942–1948

  41. Kennedy J, Eberhart RC (1997) A discrete binary version of the particle swarm algorithm. In: 1997 IEEE international conference on systems, man, and cybernetics. Computational cybernetics and simulation. IEEE, vol 5, pp 4104–4108

  42. Khan FA, Gumaei A, Derhab A, Hussain A (2019) Tsdl: a twostage deep learning model for efficient network intrusion detection. IEEE Access

  43. Khoshgoftaar TM, Gao K, Bullard LA (2011) A comparative study of filter-based and wrapper-based feature ranking techniques for software quality modeling. Int J Reliab Qual Saf Eng 18(4):341–364

    Article  Google Scholar 

  44. Krawczyk B, Woźniak M, Schaefer G (2014) Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl Soft Comput J 14 (1):554–562

    Article  Google Scholar 

  45. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. Proc Int’l Conf Mach Learn:179–186

  46. Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on ai in medicine in Europe: artificial intelligence medicine, pp 63–66

  47. Lim P, Goh CK, Tan KC (2016) Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning. IEEE Trans Cybern, (99):1–12

  48. Lima RF, Pereira ACM (2016) A fraud detection model based on feature selection and undersampling applied to web payment systems. In: IEEE / Wic / ACM international conference on web intelligence and intelligent agent technology, pp 219–222

  49. Lin ZY, Hao ZF, Yang XW, Liu XL (2009) Several svm ensemble methods integrated with under-sampling for imbalanced data learning. In: International conference on advanced data mining and applications, pp 536–544

  50. López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141

    Article  Google Scholar 

  51. Loyola-González O, Martínez-Trinidad JF, Carrasco-Ochoa JA, et al. (2016) Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing 175:935–947

    Article  Google Scholar 

  52. Mahmud M, Kaiser MS, Hussain A, Vassanelli S (2017) Applications of deep learning and reinforcement learning to biological data. IEEE Trans Neural Netw Learn Syst 29(6):2063–2079

    Article  MathSciNet  Google Scholar 

  53. Malik ZK, Hussain A, Wu J (2016) An online generalized eigenvalue version of laplacian eigenmaps for visual big data. Neurocomputing 173:127–136

    Article  Google Scholar 

  54. Mao W, Jiang M, Wang J, Li Y (2017) Online extreme learning machine with hybrid sampling strategy for sequential imbalanced data. Cognit Comput 9(6):780–800

    Article  Google Scholar 

  55. Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 28(1):92–122

    Article  MathSciNet  MATH  Google Scholar 

  56. Mirjalili S (2015) The ant lion optimizer. Adv Eng Softw 83(C):80–98

    Article  Google Scholar 

  57. Moepya SO, Akhoury SS, Nelwamondo FV (2015) Applying cost-sensitive classification for financial fraud detection under high class-imbalance. In: IEEE international conference on data mining workshop, pp 183–192

  58. Mohammad RFA, Thabtah TM (2017) UCI machine learning repository, http://archive.ics.uci.edu/ml. Accessed 12 Dec, 2017

  59. Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl-Based Syst 25(1):13–21

    Article  Google Scholar 

  60. Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (a-suwo) for imbalanced datasets. Expert Syst Appl 46:405–416

    Article  Google Scholar 

  61. Nguyen HM, Cooper EW, Kamei K (2009) Borderline over-sampling for imbalanced data classification. In: Proceedings: fifth international workshop on computational intelligence & applications. IEEE SMC hiroshima chapter, vol 2009, pp 24–29

  62. Oh SH (2011) Error back-propagation algorithm for classification of imbalanced data. Neurocomputing 74(6):1058–1061

    Article  Google Scholar 

  63. Pérez-Godoy M, Rivera AJ, Carmona CJ, Jesus MJD (2014) Training algorithms for radial basis function networks to tackle learning processes with imbalanced data-sets. Appl Soft Comput 25(C):26–39

    Article  Google Scholar 

  64. Poria S, Cambria E, Howard N, Huang GB, Hussain A (2016) Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing 174:50–59

    Article  Google Scholar 

  65. Poria S, Peng H, Hussain A, Howard N, Cambria E (2017) Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis. Neurocomputing:S0925231217302023

  66. Precision R (2015) Data mining for imbalanced datasets: an overview

  67. Ramentol E, Caballero Y, Bello R, Herrera F (2012) Smote-rsb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265

    Article  Google Scholar 

  68. Rayhan F, Ahmed S, Mahbub A, Jani MR, Shatabda S, Farid DM (2017) Cusboost: cluster-based under-sampling with boosting for imbalanced classification

  69. Ren F, Cao P, Li W, Zhao D, Zaiane O (2017) Ensemble based adaptive over-sampling method for imbalanced data learning in computer aided detection of microaneurysm. Comput Med Imaging Graph 55:54

    Article  Google Scholar 

  70. Rosipal R, Krämer N (2005) Overview and recent advances in partial least squares. In: International statistical and optimization perspectives workshop “subspace, latent structure and feature selection”. Springer, pp 34–51

  71. Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) Smote–ipf: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291(5):184–203

    Article  Google Scholar 

  72. Satapathy R, Cambria E, Hussain A (2018) Sentiment analysis in the bio-medical domain: techniques, tools, and applications. Springer, vol 7

  73. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197

    Article  Google Scholar 

  74. Song L, Li D, Zeng X, Wu Y, Guo L, Zou Q (2014) ndna-prot: identification of dna-binding proteins based on unbalanced classification. BMC Bioinformatics,15,1(2014-09-08) 15(1):298

    Article  Google Scholar 

  75. Tian Q, Han D, Li KC, Liu X, Castiglione A (2020) An intrusion detection approach based on improved deep belief network. Appl Intell (3)

  76. Tomczak JM (2015) Boosted svm with active learning strategy for imbalanced data. Soft Comput 19(12):3357–3368

    Article  Google Scholar 

  77. Tomek I (1976) Two modifications of cnn. IEEE Trans Syst Man Cybern Syst 6(11):769–772

    MathSciNet  MATH  Google Scholar 

  78. Vluymans S, Saeys Y, Cornelis C, Herrera F (2016) Fuzzy rough classifiers for class imbalanced multi-instance data. Pattern Recogn 53(C):36–45

    Article  Google Scholar 

  79. Wajid SK, Hussain A (2015) Local energy-based shape histogram feature extraction technique for breast cancer diagnosis. Expert Syst Appl 42 (20):6990–6999

    Article  Google Scholar 

  80. Wajid SK, Hussain A, Huang K (2018) Three-dimensional local energy-based shape histogram (3d-lesh): a novel feature extraction technique. Expert Syst Appl 112:388–400

    Article  Google Scholar 

  81. Wei MH, Cheng CH, Huang CS, Chiang PC (2013) Discovering medical quality of total hip arthroplasty by rough set classifier with imbalanced class. Qual Quant 47(3):1761–1779

    Article  Google Scholar 

  82. Wilson DL (2007) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern Syst 2(3):408–421

    MathSciNet  MATH  Google Scholar 

  83. Wong GY, Leung FHF, Ling SH (2018) A hybrid evolutionary preprocessing method for imbalanced datasets. Information Sciences

  84. Xu J, Han D, Li KC, Jiang H (2020) A k-means algorithm based on characteristics of density applied to network intrusion detection. Computer Science and Information Systems:14–14

  85. Yijing L, Haixiang G, Xiao L, Yanan L, Jinling L (2016) Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowl-Based Syst 94:88–104

    Article  Google Scholar 

  86. Yu H, Sun C, Yang X, Yang W, Shen J, Qi Y (2016) Odoc-elm: optimal decision outputs compensation-based extreme learning machine for classifying imbalanced data. Knowl-Based Syst 92:55–70

    Article  Google Scholar 

  87. Zayed AS, Hussain A, Abdullah RA (2006) A novel multiple-controller incorporating a radial basis function neural network based generalized learning model. Neurocomputing 69(16-18):1868–1881

    Article  Google Scholar 

  88. Zhao H (2016) General vector machine

  89. Zhou Q, Chen H, Zhao H, Zhang G, Yong J, Shen J (2016) A local field correlated and monte carlo based shallow neural network model for non-linear time series prediction. Scalable Information Systems 3(8):e5

    Google Scholar 

  90. Zhou Q, Feng F, Shen Z, Zhou R, Hsieh MY, Li KC (2019) A novel approach for mobile malware classification and detection in android systems. Multimed Tools Appl 78(3):3529–3552

    Article  Google Scholar 

  91. Ziba M, Tomczak JM, Lubicz M, Witek J (2014) Boosted svm for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Appl Soft Comput J 14(1):99–108

    Article  Google Scholar 

  92. Zikria YB, Afzal MK, Kim SW, Marin A, Guizani M (2020) Deep learning for intelligent iot: opportunities, challenges and solutions. Comput Commun 164(0140-3664):50–53

    Article  Google Scholar 

  93. Zou Q, Xie S, Lin Z, Wu M, Ju Y (2016) Finding the best classification threshold in imbalanced classification. Big Data Research 5:2–8

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by Plan Project for Guizhou Provincial Basic Research (NO. QKH-Basic-ZK[2022] General 018) and the school level project of Guizhou University of Finance and economics in 2021 (NO. 2021KYYB13).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fang Feng.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Feng, F., Li, KC., Yang, E. et al. A novel oversampling and feature selection hybrid algorithm for imbalanced data classification. Multimed Tools Appl 82, 3231–3267 (2023). https://doi.org/10.1007/s11042-022-13240-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13240-0

Keywords

Navigation