The Imbalanced Training Sample Problem: Under or over Sampling?

  • Ricardo Barandela
  • Rosa M. Valdovinos
  • J. Salvador Sánchez
  • Francesc J. Ferri
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3138)

Abstract

The problem of imbalanced training sets in supervised pattern recognition methods is receiving growing attention. Imbalanced training sample means that one class is represented by a large number of examples while the other is represented by only a few. It has been observed that this situation, which arises in several practical domains, may produce an important deterioration of the classification accuracy, in particular with patterns belonging to the less represented classes. In this paper we present a study concerning the relative merits of several re-sizing techniques for handling the imbalance issue. We assess also the convenience of combining some of these techniques.

Keywords

Majority Class Weighted Distance Under Sampling Minority Class Neighbor Rule 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Aha, D., Kibler, D.: Learning Representative Exemplars of Concepts: An Initial Case Study. In: Proceedings of the Fourth International Conference on Machine Learning, pp. 24–30 (1987)Google Scholar
  2. 2.
    Barandela, R., Cortés, N., Palacios, A.: The Nearest Neighbor rule and the reduction of the training sample size. In: Proc. 9th Spanish Symposium on Pattern Recognition and Image Analysis 1, pp. 103–108 (2001)Google Scholar
  3. 3.
    Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognition 36(3), 849–851 (2003)CrossRefGoogle Scholar
  4. 4.
    Barandela, R., Sánchez, J.S., García, V., Ferri, F.J.: Learning from Imbalanced sets through resampling and weighting. In: Perales, F.J., Campilho, A.C., Pérez, N., Sanfeliu, A. (eds.) IbPRIA 2003. LNCS, vol. 2652, pp. 80–88. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  5. 5.
    Barandela, R., Valdovinos, R.M., Sánchez, J.S.: New applications of ensembles of classifiers. Pattern Analysis and Applications 6(3), 245–256 (2003)CrossRefMathSciNetGoogle Scholar
  6. 6.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2000)Google Scholar
  7. 7.
    Dudani, S.A.: The distance-weighted k-nearest neighbor rule. IEEE Trans. on Systems, Man and Cybernetics 6, 325–327 (1976)CrossRefGoogle Scholar
  8. 8.
    Eavis, T., Japkowicz, N.: A Recognition-based Alternative to Discrimination-based Multi- Layer Perceptrons, Workshop on Learning from Imbalanced Data Sets. Technical Report WS-00-05, AAAI Press (2000)Google Scholar
  9. 9.
    Ezawa, K.J., Singh, M., Norton, S.W.: Learning goal oriented Bayesian networks for telecommunications management. In: Proc. 13th Int. Conf. on Machine Learning, pp. 139–147 (1996)Google Scholar
  10. 10.
    Fawcett, T., Provost, F.: Adaptive fraud detection. Data Mining and Knowledge Discovery 1, 291–316 (1996)CrossRefGoogle Scholar
  11. 11.
    Hart, P.E.: The Condensed Nearest Neighbor rule. IEEE Trans. on Information Theory 6(4), 515–516 (1968)CrossRefGoogle Scholar
  12. 12.
    Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis Journal 6(5), 429–450 (2002)MATHGoogle Scholar
  13. 13.
    Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In: Proceedings of the 14th International Conference on Machine Learning, pp. 179–186 (1997)Google Scholar
  14. 14.
    Kubat, M., Holte, R., Matwin, S.: Detection of Oil-Spills in Radar Images of Sea Surface. Machine Learning 30, 195–215 (1998)CrossRefGoogle Scholar
  15. 15.
    Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naïve Bayes. In: Proc. 16th Int. Conf. on Machine Learning, pp. 258–267 (1999)Google Scholar
  16. 16.
    Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C.: Reducing misclassification costs. In: Proc 11th Int. Conf. on Machine Learning, pp. 217–225 (1994)Google Scholar
  17. 17.
    Ritter, G.I., Woodruff, H.B., Lowry, S.R., Isenhour, T.L.: An Algorithm for Selective Nearest Neighbor Decision Rule. IEEE Trans. on Information Theory 21(6), 665–669 (1975)MATHCrossRefGoogle Scholar
  18. 18.
    Weiss, G.M., Provost, F.: Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artificial Intelligence Research 19, 315–354 (2003)MATHGoogle Scholar
  19. 19.
    Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data sets. IEEE Trans. on Systems, Man and Cybernetics 2, 408–421 (1972)MATHCrossRefGoogle Scholar
  20. 20.
    Woods, K., Doss, C., Bowyer, K.W., Solka, J., Priebe, C., Kegelmeyer, W.P.: Comparative evaluation of pattern recognition techniques for detection of micro-calcifications in mammography. International Journal of Pattern Recognition and Artificial Intelligence 7, 1417–1436 (1993)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Ricardo Barandela
    • 1
    • 2
  • Rosa M. Valdovinos
    • 1
  • J. Salvador Sánchez
    • 3
  • Francesc J. Ferri
    • 4
  1. 1.Instituto Tecnológico de TolucaMetepecMéxico
  2. 2.Instituto de Geografía TropicalLa HabanaCuba
  3. 3.Dept. Llenguatges i Sistemes InformàticsU. Jaume ICastellóSpain
  4. 4.Dept. d’InformàticaU. ValenciaBurjassot (Valencia)Spain

Personalised recommendations