Advertisement

The CURE for Class Imbalance

  • Colin BellingerEmail author
  • Paula Branco
  • Luis Torgo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11828)

Abstract

Addressing the class imbalance problem is critical for several real world applications. The application of pre-processing methods is a popular way of dealing with this problem. These solutions increase the rare class examples and/or decrease the normal class cases. However, these procedures typically only take into account the characteristics of each individual class. This segmented view of the data can have a negative impact. We propose a new method that uses an integrated view of the data classes to generate new examples and remove cases. ClUstered REsampling (CURE) is a method based on a holistic view of the data that uses hierarchical clustering and a new distance measure to guide the sampling procedure. Clusters generated in this way take into account the structure of the data. This enables CURE to avoid common mistakes made by other resampling methods. In particular, CURE prevents the generation of synthetic examples in dangerous regions and undersamples safe, non-borderline, regions of the majority class. We show the effectiveness of CURE in an extensive set of experiments with benchmark domains. We also show that CURE is a user-friendly method that does not require extensive fine-tuning of hyper-parameters.

Keywords

Imbalanced domains Resampling Clustering 

References

  1. 1.
    Alcalá-Fdez, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17, 255–287 (2011)Google Scholar
  2. 2.
    Barua, S., Islam, M.M., Murase, K.: A novel synthetic minority oversampling technique for imbalanced data set learning. In: Lu, B.-L., Zhang, L., Kwok, J. (eds.) ICONIP 2011. LNCS, vol. 7063, pp. 735–744. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-24958-7_85CrossRefGoogle Scholar
  3. 3.
    Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20–29 (2004)CrossRefGoogle Scholar
  4. 4.
    Bellinger, C., Drummond, C., Japkowicz, N.: Manifold-based synthetic oversampling with manifold conformance estimation. Mach. Learn. 107(3), 605–637 (2018)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Branco, P., Torgo, L., Ribeiro, R.P.: A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. (CSUR) 49(2), 31 (2016)CrossRefGoogle Scholar
  6. 6.
    Branco, P., Torgo, L., Ribeiro, R.P.: Resampling with neighbourhood bias on imbalanced domains. Expert Syst. 35(4), e12311 (2018).  https://doi.org/10.1111/exsy.12311CrossRefGoogle Scholar
  7. 7.
    Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 475–482. Springer, Heidelberg (2009).  https://doi.org/10.1007/978-3-642-01307-2_43CrossRefGoogle Scholar
  8. 8.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)CrossRefGoogle Scholar
  9. 9.
    Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895–1923 (1998)CrossRefGoogle Scholar
  10. 10.
    Estabrooks, A., Japkowicz, N.: A mixture-of-experts framework for learning from imbalanced data sets. In: Hoffmann, F., Hand, D.J., Adams, N., Fisher, D., Guimaraes, G. (eds.) IDA 2001. LNCS, vol. 2189, pp. 34–43. Springer, Heidelberg (2001).  https://doi.org/10.1007/3-540-44816-0_4CrossRefzbMATHGoogle Scholar
  11. 11.
    Fernández, A., Garcia, S., Herrera, F., Chawla, N.V.: Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005).  https://doi.org/10.1007/11538059_91CrossRefGoogle Scholar
  13. 13.
    He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IJCNN 2008, pp. 1322–1328. IEEE (2008)Google Scholar
  14. 14.
    He, H., Ma, Y.: Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley, Hoboken (2013)CrossRefGoogle Scholar
  15. 15.
    Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. ACM SIGKDD Explor. Newslett. 6(1), 40–49 (2004). Special issue on learning from imbalanced datasetsCrossRefGoogle Scholar
  16. 16.
    Kubat, M., Matwin, S., et al.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, Nashville, USA, vol. 97, pp. 179–186 (1997)Google Scholar
  17. 17.
    Lewis, D.D., Catlett, J., Hill, M.: Heterogeneous uncertainty sampling for supervised learning. In: International Conference on Machine Learning, pp. 148–156 (1994)CrossRefGoogle Scholar
  18. 18.
    Lim, P., Goh, C.K., Tan, K.C.: Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning. IEEE Trans. Cybern. 47(9), 2850–2861 (2017)CrossRefGoogle Scholar
  19. 19.
    Lin, W.C., Tsai, C.F., Hu, Y.H., Jhang, J.S.: Clustering-based undersampling in class-imbalanced data. Inf. Sci. 409–410, 17–26 (2017)CrossRefGoogle Scholar
  20. 20.
    Nickerson, A.S., Japkowicz, N., Milios, E.: Using unsupervised learning to guide resampling in imbalanced data sets. In: Proceedings of the Eighth International Workshop on AI and Statistics, p. 5 (2001)Google Scholar
  21. 21.
    Oliveira, M., Torgo, L., Santos Costa, V.: Predicting wildfires. In: Calders, T., Ceci, M., Malerba, D. (eds.) DS 2016. LNCS (LNAI), vol. 9956, pp. 183–197. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46307-0_12CrossRefGoogle Scholar
  22. 22.
    Provost, F., Fawcett, T.: Robust classification for imprecise environments. Mach. Learn. 42(3), 203–231 (2001)CrossRefGoogle Scholar
  23. 23.
    Rijsbergen, C.V.: Information retrieval, 2nd edition. Department of computer science, University of Glasgow (1979)Google Scholar
  24. 24.
    Slama, R., Wannous, H., Daoudi, M., Srivastava, A.: Accurate 3D action recognition using learning on the grassmann manifold. Pattern Recogn. 48(2), 556–567 (2015)CrossRefGoogle Scholar
  25. 25.
    Ward Jr., J.H.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explor. Newsl. 6(1), 7–19 (2004).  https://doi.org/10.1145/1007730.1007734CrossRefGoogle Scholar
  27. 27.
    Williams, D., Myers, V., Silvious, M.: Mine classification with imbalanced data. IEEE Geosci. Remote Sens. Lett. 6(3), 528–532 (2009)CrossRefGoogle Scholar
  28. 28.
    Wu, J., Xiong, H., Chen, J.: COG: local decomposition for rare class analysis. Data Min. Knowl. Discov. 20(2), 191–220 (2010)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Yang, Q., et al.: 10 challenging problems in data mining research. Int. J. Inf. Tech. Decis. 5(4), 597–604 (2006)CrossRefGoogle Scholar
  30. 30.
    Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.National Research Council of CanadaOttawaCanada
  2. 2.Dalhousie UniversityHalifaxCanada

Personalised recommendations