Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study

Lin, Cian; Tsai, Chih-Fong; Lin, Wei-Chao

doi:10.1007/s10462-022-10186-5

Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study

Published: 13 April 2022

Volume 56, pages 845–863, (2023)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

1129 Accesses
13 Citations
Explore all metrics

Abstract

The skewed class distributions of many class imbalanced domain datasets often make it difficult for machine learning techniques to construct effective models. In such cases, data re-sampling techniques, such as under-sampling the majority class and over-sampling the minority class are usually employed. In related literatures, some studies have shown that hybrid combinations of under- and over-sampling methods with differ orders can produce better results. However, each study only compares with either under- or over-sampling methods to make the final conclusion. Therefore, the research objective of this paper is to find out which order of combining under- and over-sampling methods perform better. Experiments are conducted based on 44 different domain datasets using three over-sampling algorithms, including SMOTE, CTGAN, and TAN, and three under-sampling (i.e. instance selection) algorithms, including IB3, DROP3, and GA. The results show that if the under-sampling algorithm is chosen carefully, i.e. IB3, no significant performance improvement is obtained by further addition of the over-sampling step. Furthermore, with the IB3 algorithm, it is better to perform instance selection first and over-sampling second than the other combination order, which can allow the random forest classifier to provide the highest AUC rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

A Review on Random Forest: An Ensemble Classifier

Notes

https://sci2s.ugr.es/keel/datasets.php.
The computing environment is based on PC, Intel® Core™ i7-2600 CPU @ 3.40 GHz, 4 GB RAM.

References

Abualigah L, Diabat A, Mirjalili S, Abd Elaziz M, Gandomi AH (2021) The arithmetic optimization algorithm. Comput Methods Appl Mech Eng 376:113609
Article MATH Google Scholar
Abualigah L, Diabat A, Sumari P, Gandomi AH (2021b) Applications, deployments, and integration of Internet of Drones (IoD): a review. IEEE Sens J 21(22):25532–25546
Article Google Scholar
Abualigah L, Abd Elaziz M, Sumari P, Geem ZW, Gandomi AH (2022) Reptile search algorithm (RSA): a nature-inspired meta-heuristic optimizer. Expert Syst Appl 191:116158
Article Google Scholar
Abualigah L, Yousri D, Abd Elaziz M, Ewees AA, Al-qaness MAA, Gandomi AH (2021) Aquila optimizer: a novel meta-heuristic optimization algorithm. Comput Ind Eng 157:107250
Article Google Scholar
Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37–66
Article Google Scholar
Alcala-Fdez J, Fernandez A, Luengo J, Derrac J, Garcia S, Sanchez L, Herrera F (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Soft Comput 17:255–287
Google Scholar
Ali J, Khan R, Ahmad N, Maqsood I (2012) Random forests and decision trees. IJCSI Int J Comput Sci Issues 9(5):272–278
Google Scholar
Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2007) A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell 29(1):173–180
Article Google Scholar
Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surv 49(2):31
Google Scholar
Bugnon LA, Yones C, Milone DH, Stegmayer G (2020) Deep neural architectures for highly imbalanced data in bioinformatics. IEEE Trans Neural Netw Learn Syst 31(8):2857–2867
Article Google Scholar
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on advances in knowledge discovery and data mining, pp 475–48
Cano JR, Herrera F, Lozano M (2003) Using evolutionary algorithms as instance selection for data reduction: an experimental study. IEEE Trans Evol Comput 7(6):561–575
Article Google Scholar
Cateni S, Colla V, Vannucci M (2014) A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 135:32–41
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery, pp 107–119
Chen Q, Zhang A, Huang T, He Q, Song Y (2020) Imbalanced dataset-based echo state networks for anomaly detection. Neural Comput Appl 32:3685–3694
Article Google Scholar
Chen Z, Yan Q, Han H, Wang S, Peng L, Wang L, Yang B (2018) Machine learning based mobile malware detection using highly imbalanced network traffic. Inf Sci 433–434:346–364
Article Google Scholar
Dubey R, Zhou J, Wang Y, Thompson PM, Ye J (2014) Analysis of sampling techniques for imbalanced data: an n = 648 ANDI study. Neuroimage 87:220–241
Article Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874
Article Google Scholar
Feng S, Zhao C, Fu P (2020) A cluster-based hybrid sampling approach for imbalanced data classification. Rev Sci Instrum 91(5):055101
Article Google Scholar
Fernandez A, Garcia S, Herrera F, Chawla NV (2018) SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Article MATH Google Scholar
Fotouhi S, Asadi S, Kattan MW (2019) A comprehensive data level analysis for cancer diagnosis on imbalanced data. J Biomed Inf 90:103089
Article Google Scholar
Fujiwara K, Huang Y, Hori K, Nishioji K, Kobayashi M, Kamaguchi M, Kano M (2020) Over- and under-sampling approach for extremely imbalanced and small minority data problem in health record analysis. Front Public Health 8:178
Article Google Scholar
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern C 42:463–484
Article Google Scholar
Garcia S, Derrac J, Cano JR, Herrera F (2012) Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans Pattern Anal Mach Intell 34(3):417–435
Article Google Scholar
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Int Conf Neural Inf Process Syst 2:2672–2680
Google Scholar
Gruszczynski M (2019) On unbalanced sampling in bankruptcy prediction. Int J Financ Stud 7(2):28
Article Google Scholar
Guan H, Zhang Y, Xian M, Cheng HD, Tang X (2021) SOMTE-WENN: solving class imbalance and small sample problems by oversampling and distance scaling. Appl Intell 51:1394–1409
Article Google Scholar
Guo H, Li Y, Shang J, Gu M, Huang Y, Gong B (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Article Google Scholar
Han H, Wnag W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, pp 878–887
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, pp 1322–1328
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Huang M-W, Tsai C-F, Lin W-C (2021) Instance selection in medical datasets: a divide-and-conquer framework. Comput Electr Eng 90:106957
Article Google Scholar
Janicka M, Lango M, Stefanowski J (2019) Using information on class interrelations to improve classification of multiclass imbalanced data. Int J Appl Math Comput Sci 29(4):769–781
Article Google Scholar
Jian C, Gao J, Ao Y (2016) A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing 193:115–122
Article Google Scholar
Jin C (2021) Software defect prediction model based on distance metric learning. Soft Comput 25:447–461
Article MATH Google Scholar
Krawczyk B, McInnes BT (2018) Local ensemble learning from imbalanced and noisy data for word sense disambiguation. Pattern Recogn 78:103–119
Article Google Scholar
Krawczyk B, Triguero I, Garcia S, Wozniak M, Herrera F (2019) Instance reduction for one-class classification. Knowl Inf Syst 59:601–628
Article Google Scholar
Lin W-C, Tsai C-F, Hu Y-H, Jhang J-S (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409–410:17–26
Article Google Scholar
Lopez V, Fernandez A, Garcia S, Ralade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141
Article Google Scholar
Ofek N, Rokach L, Stern R, Shabtai A (2017) Fast-CBUS: a fast clustering-based undersampling method for addressing the class imbalance problem. Neurocomputing 243:88–102
Article Google Scholar
Olvera-Lopez JA, Carrasco-Ochoa JA, Martinez-Trinidad JF, Kittler J (2010) A review of instance selection methods. Artif Intell Rev 34:133–143
Article Google Scholar
Reinartz T (2002) A unifying view on instance selection. Data Min Knowl Discov 6:91–210
Article MATH Google Scholar
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203
Article Google Scholar
Sathyadevan S, Nair RR (2015) Comparative analysis of decision tree algorithms: ID3, C4.5 and random forest. In: Jain L, Behera H, Mandal J, Mohapatra D (eds) Computational intelligence in data mining, vol 1. Springer, Berlin, pp 549–562
Google Scholar
Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: International conference on data warehousing and knowledge discover, pp 283–292
Sun Z, Song Q, Zhu X, Sun H, Xu B, Zhou Y (2015) A novel ensemble method for classifying imbalanced data. Pattern Recogn 48:1623–1637
Article Google Scholar
Tsai C-F, Lin W-C, Hu Y-H, Yao G-T (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54
Article Google Scholar
Tsai C-F, Sue K-L, Hu Y-H, Chiu A (2021) Combining feature selection, instance selection and classification techniques for improved financial distress prediction. J Bus Res 130:200–209
Article Google Scholar
Tsai C-F, Lin W-C (2021) Feature selection and ensemble learning techniques in one-class classifiers: an empirical study of two-class imbalanced datasets. IEEE Access 9:13717–13726
Article Google Scholar
Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38:257–286
Article MATH Google Scholar
Wang Q (2014) A hybrid sampling SVM approach to imbalanced data classification. Abstract Appl Anal 2014:972786
Google Scholar
Wong GY, Leung FHF, Ling S-H (2018) A hybrid evolutionary preprocessing method for imbalanced datasets. Inf Sci 454–455:161–177
Article Google Scholar
Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K (2019) Modeling tabular data using conditional GAN. In: International conference on neural information processing systems, pp 7335–7345
Xu Z, Shen D, Nie T, Kou Y (2020) A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data. J Biomed Inf 107:103465
Article Google Scholar
You Z-H, Hu Y-H, Tsai C-F, Kuo Y-M (2020) Integrating feature and instance selection techniques in opinion mining. Int J Data Wareh Min 16(3):168–182
Article Google Scholar
Yu L, Zhou R, Tang L, Chen R (2018) A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data. Appl Soft Comput 69:192–202
Article Google Scholar

Download references

Acknowledgements

The work was supported in part by the Ministry of Science and Technology of Taiwan under Grant MOST 110-2410-H-182-002 and in part by the Chang Gung Memorial Hospital at Linkou, under Grant BMRPH13 and CMRPG3J0732.

Author information

Authors and Affiliations

Department of Information Management, National Central University, Taoyuan, Taiwan, ROC
Cian Lin & Chih-Fong Tsai
Department of Information Management, Chang Gung University, Taoyuan, Taiwan, ROC
Wei-Chao Lin
Department of Thoracic Surgery, Chang Gung Memorial Hospital at Linkou, Taoyuan, Taiwan, ROC
Wei-Chao Lin

Authors

Cian Lin
View author publications
You can also search for this author in PubMed Google Scholar
Chih-Fong Tsai
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Chao Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei-Chao Lin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, C., Tsai, CF. & Lin, WC. Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study. Artif Intell Rev 56, 845–863 (2023). https://doi.org/10.1007/s10462-022-10186-5

Download citation

Accepted: 04 April 2022
Published: 13 April 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s10462-022-10186-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Review on Random Forest: An Ensemble Classifier

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Review on Random Forest: An Ensemble Classifier

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation