A novel oversampling and feature selection hybrid algorithm for imbalanced data classification

Feng, Fang; Li, Kuan-Ching; Yang, Erfu; Zhou, Qingguo; Han, Lihong; Hussain, Amir; Cai, Mingjiang

doi:10.1007/s11042-022-13240-0

A novel oversampling and feature selection hybrid algorithm for imbalanced data classification

Published: 24 June 2022

Volume 82, pages 3231–3267, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Fang Feng ORCID: orcid.org/0000-0003-3120-2871^1,2,
Kuan-Ching Li³,
Erfu Yang⁴,
Qingguo Zhou²,
Lihong Han²,
Amir Hussain⁵ &
…
Mingjiang Cai⁶

969 Accesses
14 Citations
1 Altmetric
Explore all metrics

Abstract

Traditional approaches tend to cause classier bias in the imbalanced data set, resulting in poor classification performance for minority classes. In particular, there are many imbalanced data in financial fraud, network intrusion, and fault detection, where recognition rate of minority classes is pertinent than the classification performance of majority classes. Therefore, there is pressure on developing efficient algorithms to solve the class imbalance problem. To this end, this article presents a novel hybrid algorithm Negative Binary General (NBG), to improve the performance of imbalanced classifications by combining oversampling and a feature selection algorithm. A novel oversampling algorithm, Negative-positive Synthetic Minority Oversampling Technique (NPSMOTE), improves sample generation’s practicability while the Binary Ant Lion Optimizer (BALO) algorithm extracts the most significant features to improve the classification performance. Simulation experiments carried out using seven benchmark imbalanced data sets demonstrate that, the proposed NBG algorithm significantly outperforms the classification of imbalanced small-sample data sets compared to nine other existing and six recently published algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

A survey on ensemble learning

Article 30 August 2019

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

References

Alcala-Fdez J, Fernandez A, Luengo J, et al. (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Log Soft Comput 17(2–3):255–287
Google Scholar
Abdi L, Hashemi S (2016) To combat multi-class imbalanced problems by means of over-sampling and boosting techniques. IEEE Trans Knowl Data Eng 28(1):238–251
Article Google Scholar
Al-Ghraibah A, Boucheron LE, Mcateer RTJ (2015) A study of feature selection of magnetogram complexity features in an imbalanced solar flare prediction data-set. In: IEEE international conference on data mining workshop, pp 557–564
Ali S, Majid A, Javed SG, Sattar M (2016) Can-csc-gbe: developing cost-sensitive classifier with gentleboost ensemble for breast cancer classification using protein amino acids and imbalanced data. Comput Biol Med 73:38–46
Article Google Scholar
Alibeigi M, Hashemi S, Hamzeh A (2012) Dbfs: an effective density based feature selection scheme for small sample size and high dimensional imbalanced data sets. Data & Knowledge Engineering 81-82(4):67–103
Article Google Scholar
Amin A, Anwar S, Adnan A, Nawaz M, Howard N, Qadir J, Hawalah A, Hussain A (2016) Comparing oversampling techniques to handle the class imbalance problem: a customer churn prediction case study. IEEE Access,(99):1–1
Anbar M, Abdullah R, Al-Tamimi BN, Hussain A (2018) A machine learning approach to detect router advertisement flooding attacks in next-generation ipv6 networks. Cognit Comput 10(3-4):1–14
Google Scholar
Bae SH, Yoon KJ (2015) Polyp detection via imbalanced learning and discriminative feature learning. IEEE Trans Med Imaging 34(11):2379
Article Google Scholar
Bao L, Cao J, Li J, Zhang Y (2016) Boosted near-miss under-sampling on svm ensembles for concept detection in large-scale imbalanced datasets. Neurocomputing 172(C):198–206
Article Google Scholar
Barua S, Islam MM, Yao X, Murase K (2013) Mwmote–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Article Google Scholar
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. Acm Sigkdd Explorations Newsletter 6(1):20–29
Article Google Scholar
Beyan C, Fisher R (2015) Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recogn 48(5):1653–1672
Article Google Scholar
Blagus R, Lusa L (2016) Gradient boosting for high-dimensional prediction of rare events. Computational Statistics & Data Analysis:113
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-asia conference on advances in knowledge discovery and data mining, pp 475–482
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) Dbsmote: Density-based synthetic minority over-sampling technique. Appl Intell 36 (3):664–684
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357
Article MATH Google Scholar
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery, pp 107–1219
Chen S, He H, Garcia EA (2010) Ramoboost:ranked minority oversampling in boosting. IEEE Trans Neural Netw 21(10):1624–1642
Article Google Scholar
Cheng F, Zhang J, Wen C (2016) Cost-sensitive large margin distribution machine for classification of imbalanced data. Pattern Recognit Let 80:107–112. https://doi.org/10.1016/j.patrec.2016.06.009. http://www.sciencedirect.com/science/article/pii/S0167865516301337
Article Google Scholar
Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18
Article Google Scholar
Dubey R, Zhou J, Wang Y, Thompson PM, Ye J (2014) Analysis of sampling techniques for imbalanced data: an n = 648 adni study. Neuroimage 87 (3):220–241
Article Google Scholar
Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence, vol 17, pp 973–978. Lawrence Erlbaum associates Ltd
Emary E, Zawbaa HM, Hassanien AE (2016) Binary ant lion approaches for feature selection. Neurocomputing 213:54–65
Article Google Scholar
Fang F, Zhou Q, Shen Z, Yang X, Han L, Wang JQ (2018) The application of a novel neural network in the detection of phishing websites. J Ambient Intell Humaniz Comput, (13):1–15
Fernandez A, Garcia S, Chawla NV, Herrera F (2018) Smote for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Article MathSciNet MATH Google Scholar
Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F (2018) Learning from imbalanced data sets. Springer
García-Pedrajas N, García-Osorio C (2013) Boosting for class-imbalanced datasets using genetically evolved supervised non-linear projections. Prog Artif Intell 2(1):29–44
Article Google Scholar
Ghazikhani A, Yazdi HS, Monsefi R (2012) Class imbalance handling using wrapper-based random oversampling. In: 20Th iranian conference on electrical engineering (ICEE2012). IEEE, pp 611–616
Guo H, Li Y, Shang J, Gu M, Huang Y, Gong B (2016) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Google Scholar
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Article Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. Lect Notes Comput Sci 3644 (5):878–887
Article Google Scholar
Hart BPE (1968) ^a̱the condensed nearest neighbor rule^o̱. In: IEEE Trans Information theory
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, pp 1322–1328
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
He H, Ma Y (2013) Imbalanced learning: foundations, algorithms, and applications. Wiley
Hu S, Liang Y, Ma L, He Y (2010) Msmote: improving classification performance when training data is imbalanced. In: Second international workshop on computer science and engineering, pp 13–17
Ieracitano C, Adeel A, Gogate M, Dashtipour K, Morabito FC, Larijani H, Raza A, Hussain A (2018) Statistical analysis driven optimized deep learning system for intrusion detection. In: International conference on brain inspired cognitive systems. Springer, pp 759–769
Jin XB, Xie GS, Huang K, Hussain A (2018) Accelerating infinite ensemble of clustering by pivot features. Cognit Comput 10(6):1042–1050
Article Google Scholar
Jz A, Ju JA, Si CA, Rz A, By B, Ql C (2020) A weighted hybrid ensemble method for classifying imbalanced data. Knowl-Based Syst, vol 203
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of ICNN’95-international conference on neural networks. IEEE, vol 4, pp 1942–1948
Kennedy J, Eberhart RC (1997) A discrete binary version of the particle swarm algorithm. In: 1997 IEEE international conference on systems, man, and cybernetics. Computational cybernetics and simulation. IEEE, vol 5, pp 4104–4108
Khan FA, Gumaei A, Derhab A, Hussain A (2019) Tsdl: a twostage deep learning model for efficient network intrusion detection. IEEE Access
Khoshgoftaar TM, Gao K, Bullard LA (2011) A comparative study of filter-based and wrapper-based feature ranking techniques for software quality modeling. Int J Reliab Qual Saf Eng 18(4):341–364
Article Google Scholar
Krawczyk B, Woźniak M, Schaefer G (2014) Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl Soft Comput J 14 (1):554–562
Article Google Scholar
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. Proc Int’l Conf Mach Learn:179–186
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on ai in medicine in Europe: artificial intelligence medicine, pp 63–66
Lim P, Goh CK, Tan KC (2016) Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning. IEEE Trans Cybern, (99):1–12
Lima RF, Pereira ACM (2016) A fraud detection model based on feature selection and undersampling applied to web payment systems. In: IEEE / Wic / ACM international conference on web intelligence and intelligent agent technology, pp 219–222
Lin ZY, Hao ZF, Yang XW, Liu XL (2009) Several svm ensemble methods integrated with under-sampling for imbalanced data learning. In: International conference on advanced data mining and applications, pp 536–544
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141
Article Google Scholar
Loyola-González O, Martínez-Trinidad JF, Carrasco-Ochoa JA, et al. (2016) Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing 175:935–947
Article Google Scholar
Mahmud M, Kaiser MS, Hussain A, Vassanelli S (2017) Applications of deep learning and reinforcement learning to biological data. IEEE Trans Neural Netw Learn Syst 29(6):2063–2079
Article MathSciNet Google Scholar
Malik ZK, Hussain A, Wu J (2016) An online generalized eigenvalue version of laplacian eigenmaps for visual big data. Neurocomputing 173:127–136
Article Google Scholar
Mao W, Jiang M, Wang J, Li Y (2017) Online extreme learning machine with hybrid sampling strategy for sequential imbalanced data. Cognit Comput 9(6):780–800
Article Google Scholar
Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 28(1):92–122
Article MathSciNet MATH Google Scholar
Mirjalili S (2015) The ant lion optimizer. Adv Eng Softw 83(C):80–98
Article Google Scholar
Moepya SO, Akhoury SS, Nelwamondo FV (2015) Applying cost-sensitive classification for financial fraud detection under high class-imbalance. In: IEEE international conference on data mining workshop, pp 183–192
Mohammad RFA, Thabtah TM (2017) UCI machine learning repository, http://archive.ics.uci.edu/ml. Accessed 12 Dec, 2017
Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl-Based Syst 25(1):13–21
Article Google Scholar
Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (a-suwo) for imbalanced datasets. Expert Syst Appl 46:405–416
Article Google Scholar
Nguyen HM, Cooper EW, Kamei K (2009) Borderline over-sampling for imbalanced data classification. In: Proceedings: fifth international workshop on computational intelligence & applications. IEEE SMC hiroshima chapter, vol 2009, pp 24–29
Oh SH (2011) Error back-propagation algorithm for classification of imbalanced data. Neurocomputing 74(6):1058–1061
Article Google Scholar
Pérez-Godoy M, Rivera AJ, Carmona CJ, Jesus MJD (2014) Training algorithms for radial basis function networks to tackle learning processes with imbalanced data-sets. Appl Soft Comput 25(C):26–39
Article Google Scholar
Poria S, Cambria E, Howard N, Huang GB, Hussain A (2016) Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing 174:50–59
Article Google Scholar
Poria S, Peng H, Hussain A, Howard N, Cambria E (2017) Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis. Neurocomputing:S0925231217302023
Precision R (2015) Data mining for imbalanced datasets: an overview
Ramentol E, Caballero Y, Bello R, Herrera F (2012) Smote-rsb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265
Article Google Scholar
Rayhan F, Ahmed S, Mahbub A, Jani MR, Shatabda S, Farid DM (2017) Cusboost: cluster-based under-sampling with boosting for imbalanced classification
Ren F, Cao P, Li W, Zhao D, Zaiane O (2017) Ensemble based adaptive over-sampling method for imbalanced data learning in computer aided detection of microaneurysm. Comput Med Imaging Graph 55:54
Article Google Scholar
Rosipal R, Krämer N (2005) Overview and recent advances in partial least squares. In: International statistical and optimization perspectives workshop “subspace, latent structure and feature selection”. Springer, pp 34–51
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) Smote–ipf: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291(5):184–203
Article Google Scholar
Satapathy R, Cambria E, Hussain A (2018) Sentiment analysis in the bio-medical domain: techniques, tools, and applications. Springer, vol 7
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197
Article Google Scholar
Song L, Li D, Zeng X, Wu Y, Guo L, Zou Q (2014) ndna-prot: identification of dna-binding proteins based on unbalanced classification. BMC Bioinformatics,15,1(2014-09-08) 15(1):298
Article Google Scholar
Tian Q, Han D, Li KC, Liu X, Castiglione A (2020) An intrusion detection approach based on improved deep belief network. Appl Intell (3)
Tomczak JM (2015) Boosted svm with active learning strategy for imbalanced data. Soft Comput 19(12):3357–3368
Article Google Scholar
Tomek I (1976) Two modifications of cnn. IEEE Trans Syst Man Cybern Syst 6(11):769–772
MathSciNet MATH Google Scholar
Vluymans S, Saeys Y, Cornelis C, Herrera F (2016) Fuzzy rough classifiers for class imbalanced multi-instance data. Pattern Recogn 53(C):36–45
Article Google Scholar
Wajid SK, Hussain A (2015) Local energy-based shape histogram feature extraction technique for breast cancer diagnosis. Expert Syst Appl 42 (20):6990–6999
Article Google Scholar
Wajid SK, Hussain A, Huang K (2018) Three-dimensional local energy-based shape histogram (3d-lesh): a novel feature extraction technique. Expert Syst Appl 112:388–400
Article Google Scholar
Wei MH, Cheng CH, Huang CS, Chiang PC (2013) Discovering medical quality of total hip arthroplasty by rough set classifier with imbalanced class. Qual Quant 47(3):1761–1779
Article Google Scholar
Wilson DL (2007) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern Syst 2(3):408–421
MathSciNet MATH Google Scholar
Wong GY, Leung FHF, Ling SH (2018) A hybrid evolutionary preprocessing method for imbalanced datasets. Information Sciences
Xu J, Han D, Li KC, Jiang H (2020) A k-means algorithm based on characteristics of density applied to network intrusion detection. Computer Science and Information Systems:14–14
Yijing L, Haixiang G, Xiao L, Yanan L, Jinling L (2016) Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowl-Based Syst 94:88–104
Article Google Scholar
Yu H, Sun C, Yang X, Yang W, Shen J, Qi Y (2016) Odoc-elm: optimal decision outputs compensation-based extreme learning machine for classifying imbalanced data. Knowl-Based Syst 92:55–70
Article Google Scholar
Zayed AS, Hussain A, Abdullah RA (2006) A novel multiple-controller incorporating a radial basis function neural network based generalized learning model. Neurocomputing 69(16-18):1868–1881
Article Google Scholar
Zhao H (2016) General vector machine
Zhou Q, Chen H, Zhao H, Zhang G, Yong J, Shen J (2016) A local field correlated and monte carlo based shallow neural network model for non-linear time series prediction. Scalable Information Systems 3(8):e5
Google Scholar
Zhou Q, Feng F, Shen Z, Zhou R, Hsieh MY, Li KC (2019) A novel approach for mobile malware classification and detection in android systems. Multimed Tools Appl 78(3):3529–3552
Article Google Scholar
Ziba M, Tomczak JM, Lubicz M, Witek J (2014) Boosted svm for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Appl Soft Comput J 14(1):99–108
Article Google Scholar
Zikria YB, Afzal MK, Kim SW, Marin A, Guizani M (2020) Deep learning for intelligent iot: opportunities, challenges and solutions. Comput Commun 164(0140-3664):50–53
Article Google Scholar
Zou Q, Xie S, Lin Z, Wu M, Ju Y (2016) Finding the best classification threshold in imbalanced classification. Big Data Research 5:2–8
Article Google Scholar

Download references

Acknowledgements

This work was supported by Plan Project for Guizhou Provincial Basic Research (NO. QKH-Basic-ZK[2022] General 018) and the school level project of Guizhou University of Finance and economics in 2021 (NO. 2021KYYB13).

Author information

Authors and Affiliations

School of Information, Guizhou University of Finance and Economics, Guiyang, Guizhou, China
Fang Feng
School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu, China
Fang Feng, Qingguo Zhou & Lihong Han
Department of Computer Science and Information Engineering, Providence University, Taichung, Taiwan
Kuan-Ching Li
Department of Design, Manufacture and Engineering Management, University of Strathclyde, Glasgow, G1 1XJ, UK
Erfu Yang
School of Computing, Edinburgh Napier University, Merchiston Campus, Edinburgh, EH10 5DT, Scotland, UK
Amir Hussain
Guizhou University of Finance and Economics, Guiyang, Guizhou, China
Mingjiang Cai

Authors

Fang Feng
View author publications
You can also search for this author in PubMed Google Scholar
Kuan-Ching Li
View author publications
You can also search for this author in PubMed Google Scholar
Erfu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Qingguo Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Lihong Han
View author publications
You can also search for this author in PubMed Google Scholar
Amir Hussain
View author publications
You can also search for this author in PubMed Google Scholar
Mingjiang Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fang Feng.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Feng, F., Li, KC., Yang, E. et al. A novel oversampling and feature selection hybrid algorithm for imbalanced data classification. Multimed Tools Appl 82, 3231–3267 (2023). https://doi.org/10.1007/s11042-022-13240-0

Download citation

Received: 17 November 2020
Revised: 29 March 2021
Accepted: 16 May 2022
Published: 24 June 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s11042-022-13240-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel oversampling and feature selection hybrid algorithm for imbalanced data classification

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A survey on ensemble learning

Learning from imbalanced data: open challenges and future directions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel oversampling and feature selection hybrid algorithm for imbalanced data classification

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A survey on ensemble learning

Learning from imbalanced data: open challenges and future directions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation