Skip to main content

Improvement of Random Undersampling to Avoid Excessive Removal of Points from a Given Area of the Majority Class

  • Conference paper
  • First Online:
Computational Science – ICCS 2021 (ICCS 2021)

Abstract

In this paper we focus on class imbalance issue which often leads to sub-optimal performance of classifiers. Despite many attempts to solve this problem, there is still a need to look for better ones, which can overcome the limitations of known methods. For this reason we developed a new algorithm that in contrast to traditional random undersampling removes maximum k nearest neighbors of the samples which belong to the majority class. In such a way, there has been achieved not only the effect of reduction in size of the majority set but also the excessive removal of too many points from the given area has been successfully prevented. The conducted experiments are provided for eighteen imbalanced datasets, and confirm the usefulness of the proposed method to improve the results of the classification task, as compared to other undersampling methods. Non-parametric statistical tests show that these differences are usually statistically significant.

This work was supported by Statutory Research funds of Department of Applied Informatics, Silesian University of Technology, Gliwice, Poland.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For the two-class classification problem \(C=2\).

  2. 2.

    I.e. the layer size was calculated as: 1+(Number of attributes+Number of classes)/2.

  3. 3.

    We used 5-fold cross-validation instead of 10-fold cross-validation because one of the tested datasets (Glass5) had fewer than 10 examples of the minority class.

  4. 4.

    http://www.keel.es/datasets.php.

  5. 5.

    http://archive.ics.uci.edu/ml/index.html.

  6. 6.

    https://gitlab.aei.polsl.pl/awerner/knn_ru.

  7. 7.

    To facilitate visualization and enable the presentation of an exemplary dataset in a two-dimensional space there was performed dimensionality reduction via principal component analysis.

  8. 8.

    This test is considered to be more powerful than the Bonferroni-Dune one.

  9. 9.

    The complete comparisons are on Gitlab: https://gitlab.aei.polsl.pl/awerner/knn_ru.

References

  1. Aha, D., Kibler, D.: Instance-based learning algorithms. Mach. Learn. 6, 37–66 (1991)

    MATH  Google Scholar 

  2. Bach, M., Werner, A.: Cost-sensitive feature selection for class imbalance problem. In: Advances in Intelligent Systems and Computing . ISAT 2017. AISC, vol. 655, pp. 182–194. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-67220-5_17

  3. Bach, M., Werner, A., Palt, M.: the proposal of undersampling method for learning from imbalanced datasets. Procedia Comput. Sci. 159(2019), 125–134 (2019). https://doi.org/10.1016/j.procs.2019.09.167

  4. Bach, M., Werner, A., Żywiec, J., Pluskiewicz, W.: The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf. Sci. Life Sci. Data Analysis 381, 174–190 (2016). https://doi.org/10.1016/j.ins.2016.09.038, ISSN: 0020-0255, Elseviere

  5. Beckmann, M., et al.: A KNN undersampling approach for data balancing. J. Intell. Learn. Syst. Appl. 7, 104–116 (2015). https://doi.org/10.4236/jilsa.2015.74010

  6. Breiman, L.: Random forest. In: Machine Learning. Springer, vol. 45(1), pp. 5–32 (2001). https://doi.org/10.1007/978-1-4419-9326-7_5

  7. Chawla, N.: Data mining for imbalanced datasets: an overview, The Data Mining and Knowledge Discovery Handbook, pp. 853–867. Springer (2005). https://doi.org/10.1007/978-0-387-09823-4_45

  8. Cheng, B., Titterington, D.M.: Neural networks: a review from a statistical perspective. Stat. Sci. 9, 2–54 (1994)

    MathSciNet  MATH  Google Scholar 

  9. Cortes, C., Vapnik, V.: Support-vector network. Mach. Learn. 20, 273–297 (1995)

    Google Scholar 

  10. Derrac, J., et al.: A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 1, 3–18 (2011)

    Article  Google Scholar 

  11. Dittman, D., et al.: Comparison of data sampling approaches for imbalanced bioinformatics data. In: Proceedings of the 27 International Florida Artificial Intelligence Research Society Conference (2014)

    Google Scholar 

  12. Duan, L., et al.: A new support vector data description method for machinery fault diagnosis with unbalanced datasets. Expert Syst. Appl. 64, 239–246 (2016)

    Article  Google Scholar 

  13. Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937)

    Article  Google Scholar 

  14. Galar, M., et al.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man, Cybern., Part C: Appl. Rev. 42(4), 463–484 (2012)

    Google Scholar 

  15. Chun, G.: Analysis of imbalanced data set problem: the case of churn prediction for telecommunication. Artif. Intell. Res. 6(2), 93 (2017). https://doi.org/10.5430/air.v6n2p93

  16. Haixiang, G., et al.: Learning from class imbalanced data: review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017). https://doi.org/10.1016/j.eswa.2016.12.035

  17. Kaur, H., et al.: A systematic review on imbalanced data challenges in machine learning: applications and solutions. ACM Comput. Surv. (2019). https://dl.acm.org/doi/abs/10.1145/3343440

  18. Iman, R., Davenport, J.: Approximations of the critical region of the fbietkan statistic. Commun. Stat.-Theor. Meth. 9(6), 571–595 (1980)

    Google Scholar 

  19. Japkowicz, N.: Class imbalances: are we focusing on the right issue? ICML-KDD’2003 Workshop: Learning from Imbalanced Data Sets (2003)

    Google Scholar 

  20. John, G., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: 11th Conference on Uncertainty in Artificial Intelligence, San Mateo, pp. 338–345 (1995)

    Google Scholar 

  21. Krawczyk, B., et al.: Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl. Soft Comput. 38, 714–726 (2016)

    Article  Google Scholar 

  22. Lopez, V., et al.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013). https://doi.org/10.1016/j.ins.2013.07.007

  23. Luque, A., et al.: The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Patt. Recogn. 91, 216–231 (2019)

    Article  Google Scholar 

  24. Mao, W., et al.: Online sequential prediction of bearings imbalanced fault diagnosis by extreme learning machine. Mech. Syst. Signal Process. 83, 450–473 (2017)

    Article  Google Scholar 

  25. Michalak, M., Sikora, M., Wróbel, Ł.: Rule quality measures settings in a sequential covering rule induction algorithm - an empirical approach. In: Proceedings of the Federated Conference on Computer Science and Information Systems, pp. 109–118 (2015). https://doi.org/10.15439/2015F388

  26. Mishra, S.: Handling imbalanced data: SMOTE vs. Random undersampling. IRJET 4(08)( (2017). ISSN: 2395 0072

    Google Scholar 

  27. Prati, R.C., Batista, G.E., Monard, M.C.: Data mining with imbalanced class distributions: concepts and methods. In: 4th Indian International Conference on AI (2009). ISBN 9780972741279

    Google Scholar 

  28. Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Monroy, R., Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds.) MICAI 2004. LNCS (LNAI), vol. 2972, pp. 312–321. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24694-7_32

    Chapter  Google Scholar 

  29. Richardson, A., Lidbury, B.: Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines. BMC Med. Info. Decis. Mak. 17(1), 121 (2017)

    Article  Google Scholar 

  30. Sandhan, T., Choi, J,Y.: Handling imbalanced datasets by partially guided hybrid sampling for pattern recognition. In: 22nd International Conference on Pattern Recognition, pp. 1449–1453 (2014). https://doi.org/10.1109/ICPR.2014.258

  31. SCI2S Research Material on Classification with Imbalanced Datasets, A University of Granada Research Group, October 2020. http://sci2s.ugr.es/imbalanced

  32. SCI2S Research Material on the Use of Non-Parametric Tests for Data Mining and Computational Intelligence, October 2020. A University of Granada Research Group. http://sci2s.ugr.es/sicidm

  33. Sun, et al.: Classification of imbalanced data: a review. Int. J. Pattern Recogn. Artif. Intell. 23(4), 687–719, World Scientific (2009)

    Google Scholar 

  34. Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Commun. SMC-6, 769–772 (1976)

    Google Scholar 

  35. Hou, W.-H., et al.: A novel dynamic ensemble selection classifier for an imbalanced data set: an application for credit risk assessment Knowledge-Based Systems (2020). https://doi.org/10.1016/j.knosys.2020.106462

  36. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–420 (1972)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Małgorzata Bach .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bach, M., Werner, A. (2021). Improvement of Random Undersampling to Avoid Excessive Removal of Points from a Given Area of the Majority Class. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2021. ICCS 2021. Lecture Notes in Computer Science(), vol 12744. Springer, Cham. https://doi.org/10.1007/978-3-030-77967-2_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-77967-2_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-77966-5

  • Online ISBN: 978-3-030-77967-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics