Skip to main content
Log in

An empirical study on the joint impact of feature selection and data resampling on imbalance classification

  • Published:
Applied Intelligence Aims and scope Submit manuscript

A Correction to this article was published on 30 July 2022

This article has been updated

Abstract

Many real-world datasets exhibit imbalanced distributions, in which the majority classes have sufficient samples, whereas the minority classes often have a very small number of samples. Data resampling has proven to be effective in alleviating such imbalanced settings, while feature selection is a commonly used technique for improving classification performance. However, the joint impact of feature selection and data resampling on two-class imbalance classification has rarely been addressed before. This work investigates the performance of two opposite imbalanced classification frameworks in which feature selection is applied before or after data resampling. We conduct a large-scale empirical study with a total of 9225 experiments on 52 publicly available datasets. The results show that both frameworks should be considered for finding the best performing imbalanced classification model. We also study the impact of classifiers, the ratio between the number of majority and minority samples (IR), and the ratio between the number of samples and features (SFR) on the performance of imbalance classification. Overall, this work provides a new reference value for researchers and practitioners in imbalance learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Change history

References

  1. Alcalá-Fdez J, Fernández A, Luengo J et al (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic Soft Comput 17(2–3):255–287

    Google Scholar 

  2. Asuncion A, Newman DJ (2007) UCI machine learning repository. http://www.ics.uci.edu/mlearn/MLRepository.html

  3. Barandela R, Valdovinos RM, Sánchez JS (2003) New applications of ensembles of classifiers. Pattern Anal Applic 6(3):245–256

    Article  MathSciNet  Google Scholar 

  4. Batista GE, Carvalho AC, Monard MC (2000) Applying one-sided selection to unbalanced datasets. Lect Notes Comput Sci, 315–325

  5. Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1):20–29

    Article  Google Scholar 

  6. Cawley GC, Talbot NLC, Girolami MA (2006) Sparse multinomial logistic regression via bayesian L1 regularisation. In: Advances in neural information processing systems, 209–216

  7. Chawla NV, Bowyer KW, Hall LO, et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(3):321–357

    Article  MATH  Google Scholar 

  8. Galar M, Fernández A, Barrenechea E et al (2011) An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recogn 44:1761–1776

    Article  Google Scholar 

  9. García V, Mollineda RA, Sánchez JS (2009) Index of balanced accuracy: a performance measure for skewed class distributions. In: Iberian conf on pattern recognition and image analysis, pp 441–448

  10. Gütlein M, Frank E, Hall MA, et al (2009) Large-scale attribute selection using wrappers. In: Proceedings of the IEEE symposium on computational intelligence and data mining, pp 332–339

  11. Hall MA, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 15(6):1437–1447

    Article  Google Scholar 

  12. Hart PE (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory 14:515–516

    Article  Google Scholar 

  13. He H, Bai Y, Garcia EA, et al (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the international joint conference on neural networks, pp 1322–1328

  14. Khoshgoftaar TM, Gao K, Seliya N (2010) Attribute selection and imbalanced data: problems in software defect prediction. In: 2010 22nd IEEE international conference on tools with artificial intelligence (ICTAI). IEEE, pp 137–144

  15. Khoshgoftaar TM, Gao K, Napolitano A et al (2014) A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Inf Syst Front 16(5):801–822

    Article  Google Scholar 

  16. Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Progress Artif Intell 5(4):221–232

    Article  Google Scholar 

  17. Li J, Cheng K, Wang S, et al (2018) Feature selection: a data perspective. ACM Comput Surv (CSUR) 50(6):94:1–94:45

    Article  Google Scholar 

  18. López V, Fernández A et al, García S (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inform Sci 250:113–141

  19. Maldonado S, López J, Vairetti C (2019) An alternative SMOTE oversampling strategy for high-dimensional datasets. Appl Soft Comput 76:380–389

    Article  Google Scholar 

  20. Maldonado S, Vairetti C, Fernandez A et al (2022) FW-SMOTE: a feature-weighted oversampling approach for imbalanced classification. Pattern Recogn 124:108,511

    Article  Google Scholar 

  21. Pan T, Zhao J, Wu W, et al (2020) Learning imbalanced datasets based on SMOTE and gaussian distribution. Inform Sci 512:1214–1233

    Article  Google Scholar 

  22. Sheskin DJ (2003) Handbook of parametric and nonparametric statistical procedures. CRC Press

  23. Shi H, Zhang Y, Chen Y et al (2022) Resampling algorithms based on sample concatenation for imbalance learning. Knowledge-Based Systems, https://doi.org/10.1016/j.knosys.2022.108592

  24. Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: the 10th int conf on data warehousing and knowledge discovery, pp 283–292

  25. Sun J, Lang J, Fujita H et al (2018) Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on smote and bagging with differentiated sampling rates. Inform Sci 425:76–91

    Article  MathSciNet  Google Scholar 

  26. Sun J, Li H, Fujita H et al (2020) Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting. Inform Fus 54:128–144

    Article  Google Scholar 

  27. Thabtah F, Hammoud S, Kamalov F et al (2020) Data imbalance in classification: experimental evaluation. Inform Sci 513:429–441

    Article  MathSciNet  Google Scholar 

  28. Wang W, Wang X, Feng D et al (2014) Exploring permission-induced risk in android applications for malicious application detection. IEEE Trans Inform Forens Secur 9(11):1869–1882

    Article  Google Scholar 

  29. Wasikowski M, Chen X (2010) Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng 22(10):1388–1400

    Article  Google Scholar 

  30. Watanabe S (1985) Pattern recognition: human and mechanical. Wiley, New York

    Google Scholar 

  31. Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82

    Article  Google Scholar 

  32. Zhang C, Bi J, Soda P (2017) Feature selection and resampling in class imbalance learning: which comes first? An empirical study in the biological domain. In: 2017 IEEE International conference on bioinformatics and biomedicine (BIBM, 2017), pp 933–938

  33. Zhang C, Bi J, Xu S, et al (2019) Multi-imbalance: an open-source software for multi-class imbalance learning. Knowl-Based Syst 174:137–143

    Article  Google Scholar 

Download references

Acknowledgments

Prof. Chongsheng Zhang was partially funded by the Laboratory of Yellow River Heritage, Henan University, and the Henan Laboratory of Yellow River (Henan University). Prof. Salvador Garcia was partially supported by the research projects TIN2017-89517-P and A-TIC-434-UGR20.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gaojuan Fan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Chongsheng Zhang and Paolo Soda contributed equally to this work.

The original online version of this article was revised: There was mistake in images and captions of Figures 10, 11, 12 and 13.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, C., Soda, P., Bi, J. et al. An empirical study on the joint impact of feature selection and data resampling on imbalance classification. Appl Intell 53, 5449–5461 (2023). https://doi.org/10.1007/s10489-022-03772-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03772-1

Keywords

Navigation