Skip to main content
Log in

Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

The skewed class distributions of many class imbalanced domain datasets often make it difficult for machine learning techniques to construct effective models. In such cases, data re-sampling techniques, such as under-sampling the majority class and over-sampling the minority class are usually employed. In related literatures, some studies have shown that hybrid combinations of under- and over-sampling methods with differ orders can produce better results. However, each study only compares with either under- or over-sampling methods to make the final conclusion. Therefore, the research objective of this paper is to find out which order of combining under- and over-sampling methods perform better. Experiments are conducted based on 44 different domain datasets using three over-sampling algorithms, including SMOTE, CTGAN, and TAN, and three under-sampling (i.e. instance selection) algorithms, including IB3, DROP3, and GA. The results show that if the under-sampling algorithm is chosen carefully, i.e. IB3, no significant performance improvement is obtained by further addition of the over-sampling step. Furthermore, with the IB3 algorithm, it is better to perform instance selection first and over-sampling second than the other combination order, which can allow the random forest classifier to provide the highest AUC rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. https://sci2s.ugr.es/keel/datasets.php.

  2. The computing environment is based on PC, Intel® Core™ i7-2600 CPU @ 3.40 GHz, 4 GB RAM.

References

  • Abualigah L, Diabat A, Mirjalili S, Abd Elaziz M, Gandomi AH (2021) The arithmetic optimization algorithm. Comput Methods Appl Mech Eng 376:113609

    Article  MATH  Google Scholar 

  • Abualigah L, Diabat A, Sumari P, Gandomi AH (2021b) Applications, deployments, and integration of Internet of Drones (IoD): a review. IEEE Sens J 21(22):25532–25546

    Article  Google Scholar 

  • Abualigah L, Abd Elaziz M, Sumari P, Geem ZW, Gandomi AH (2022) Reptile search algorithm (RSA): a nature-inspired meta-heuristic optimizer. Expert Syst Appl 191:116158

    Article  Google Scholar 

  • Abualigah L, Yousri D, Abd Elaziz M, Ewees AA, Al-qaness MAA, Gandomi AH (2021) Aquila optimizer: a novel meta-heuristic optimization algorithm. Comput Ind Eng 157:107250

    Article  Google Scholar 

  • Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66

    Article  Google Scholar 

  • Alcala-Fdez J, Fernandez A, Luengo J, Derrac J, Garcia S, Sanchez L, Herrera F (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Soft Comput 17:255–287

    Google Scholar 

  • Ali J, Khan R, Ahmad N, Maqsood I (2012) Random forests and decision trees. IJCSI Int J Comput Sci Issues 9(5):272–278

    Google Scholar 

  • Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2007) A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell 29(1):173–180

    Article  Google Scholar 

  • Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surv 49(2):31

    Google Scholar 

  • Bugnon LA, Yones C, Milone DH, Stegmayer G (2020) Deep neural architectures for highly imbalanced data in bioinformatics. IEEE Trans Neural Netw Learn Syst 31(8):2857–2867

    Article  Google Scholar 

  • Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on advances in knowledge discovery and data mining, pp 475–48

  • Cano JR, Herrera F, Lozano M (2003) Using evolutionary algorithms as instance selection for data reduction: an experimental study. IEEE Trans Evol Comput 7(6):561–575

    Article  Google Scholar 

  • Cateni S, Colla V, Vannucci M (2014) A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 135:32–41

    Article  Google Scholar 

  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  • Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery, pp 107–119

  • Chen Q, Zhang A, Huang T, He Q, Song Y (2020) Imbalanced dataset-based echo state networks for anomaly detection. Neural Comput Appl 32:3685–3694

    Article  Google Scholar 

  • Chen Z, Yan Q, Han H, Wang S, Peng L, Wang L, Yang B (2018) Machine learning based mobile malware detection using highly imbalanced network traffic. Inf Sci 433–434:346–364

    Article  Google Scholar 

  • Dubey R, Zhou J, Wang Y, Thompson PM, Ye J (2014) Analysis of sampling techniques for imbalanced data: an n = 648 ANDI study. Neuroimage 87:220–241

    Article  Google Scholar 

  • Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874

    Article  Google Scholar 

  • Feng S, Zhao C, Fu P (2020) A cluster-based hybrid sampling approach for imbalanced data classification. Rev Sci Instrum 91(5):055101

    Article  Google Scholar 

  • Fernandez A, Garcia S, Herrera F, Chawla NV (2018) SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905

    Article  MATH  Google Scholar 

  • Fotouhi S, Asadi S, Kattan MW (2019) A comprehensive data level analysis for cancer diagnosis on imbalanced data. J Biomed Inf 90:103089

    Article  Google Scholar 

  • Fujiwara K, Huang Y, Hori K, Nishioji K, Kobayashi M, Kamaguchi M, Kano M (2020) Over- and under-sampling approach for extremely imbalanced and small minority data problem in health record analysis. Front Public Health 8:178

    Article  Google Scholar 

  • Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern C 42:463–484

    Article  Google Scholar 

  • Garcia S, Derrac J, Cano JR, Herrera F (2012) Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans Pattern Anal Mach Intell 34(3):417–435

    Article  Google Scholar 

  • Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Int Conf Neural Inf Process Syst 2:2672–2680

    Google Scholar 

  • Gruszczynski M (2019) On unbalanced sampling in bankruptcy prediction. Int J Financ Stud 7(2):28

    Article  Google Scholar 

  • Guan H, Zhang Y, Xian M, Cheng HD, Tang X (2021) SOMTE-WENN: solving class imbalance and small sample problems by oversampling and distance scaling. Appl Intell 51:1394–1409

    Article  Google Scholar 

  • Guo H, Li Y, Shang J, Gu M, Huang Y, Gong B (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239

    Article  Google Scholar 

  • Han H, Wnag W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, pp 878–887

  • He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, pp 1322–1328

  • He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  • Huang M-W, Tsai C-F, Lin W-C (2021) Instance selection in medical datasets: a divide-and-conquer framework. Comput Electr Eng 90:106957

    Article  Google Scholar 

  • Janicka M, Lango M, Stefanowski J (2019) Using information on class interrelations to improve classification of multiclass imbalanced data. Int J Appl Math Comput Sci 29(4):769–781

    Article  Google Scholar 

  • Jian C, Gao J, Ao Y (2016) A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing 193:115–122

    Article  Google Scholar 

  • Jin C (2021) Software defect prediction model based on distance metric learning. Soft Comput 25:447–461

    Article  MATH  Google Scholar 

  • Krawczyk B, McInnes BT (2018) Local ensemble learning from imbalanced and noisy data for word sense disambiguation. Pattern Recogn 78:103–119

    Article  Google Scholar 

  • Krawczyk B, Triguero I, Garcia S, Wozniak M, Herrera F (2019) Instance reduction for one-class classification. Knowl Inf Syst 59:601–628

    Article  Google Scholar 

  • Lin W-C, Tsai C-F, Hu Y-H, Jhang J-S (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409–410:17–26

    Article  Google Scholar 

  • Lopez V, Fernandez A, Garcia S, Ralade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141

    Article  Google Scholar 

  • Ofek N, Rokach L, Stern R, Shabtai A (2017) Fast-CBUS: a fast clustering-based undersampling method for addressing the class imbalance problem. Neurocomputing 243:88–102

    Article  Google Scholar 

  • Olvera-Lopez JA, Carrasco-Ochoa JA, Martinez-Trinidad JF, Kittler J (2010) A review of instance selection methods. Artif Intell Rev 34:133–143

    Article  Google Scholar 

  • Reinartz T (2002) A unifying view on instance selection. Data Min Knowl Discov 6:91–210

    Article  MATH  Google Scholar 

  • Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203

    Article  Google Scholar 

  • Sathyadevan S, Nair RR (2015) Comparative analysis of decision tree algorithms: ID3, C4.5 and random forest. In: Jain L, Behera H, Mandal J, Mohapatra D (eds) Computational intelligence in data mining, vol 1. Springer, Berlin, pp 549–562

    Google Scholar 

  • Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: International conference on data warehousing and knowledge discover, pp 283–292

  • Sun Z, Song Q, Zhu X, Sun H, Xu B, Zhou Y (2015) A novel ensemble method for classifying imbalanced data. Pattern Recogn 48:1623–1637

    Article  Google Scholar 

  • Tsai C-F, Lin W-C, Hu Y-H, Yao G-T (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54

    Article  Google Scholar 

  • Tsai C-F, Sue K-L, Hu Y-H, Chiu A (2021) Combining feature selection, instance selection and classification techniques for improved financial distress prediction. J Bus Res 130:200–209

    Article  Google Scholar 

  • Tsai C-F, Lin W-C (2021) Feature selection and ensemble learning techniques in one-class classifiers: an empirical study of two-class imbalanced datasets. IEEE Access 9:13717–13726

    Article  Google Scholar 

  • Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38:257–286

    Article  MATH  Google Scholar 

  • Wang Q (2014) A hybrid sampling SVM approach to imbalanced data classification. Abstract Appl Anal 2014:972786

    Google Scholar 

  • Wong GY, Leung FHF, Ling S-H (2018) A hybrid evolutionary preprocessing method for imbalanced datasets. Inf Sci 454–455:161–177

    Article  Google Scholar 

  • Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K (2019) Modeling tabular data using conditional GAN. In: International conference on neural information processing systems, pp 7335–7345

  • Xu Z, Shen D, Nie T, Kou Y (2020) A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data. J Biomed Inf 107:103465

    Article  Google Scholar 

  • You Z-H, Hu Y-H, Tsai C-F, Kuo Y-M (2020) Integrating feature and instance selection techniques in opinion mining. Int J Data Wareh Min 16(3):168–182

    Article  Google Scholar 

  • Yu L, Zhou R, Tang L, Chen R (2018) A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data. Appl Soft Comput 69:192–202

    Article  Google Scholar 

Download references

Acknowledgements

The work was supported in part by the Ministry of Science and Technology of Taiwan under Grant MOST 110-2410-H-182-002 and in part by the Chang Gung Memorial Hospital at Linkou, under Grant BMRPH13 and CMRPG3J0732.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei-Chao Lin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, C., Tsai, CF. & Lin, WC. Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study. Artif Intell Rev 56, 845–863 (2023). https://doi.org/10.1007/s10462-022-10186-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-022-10186-5

Keywords

Navigation