Restricted Decontamination for the Imbalanced Training Sample Problem

  • R. Barandela
  • E. Rangel
  • J. S. Sánchez
  • F. J. Ferri
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2905)


The problem of imbalanced training data in supervised methods is currently receiving growing attention. Imbalanced data means that one class is much more represented than the others in the training sample. It has been observed that this situation, which arises in several practical domains, may produce an important deterioration of the classification accuracy, in particular with patterns belonging to the less represented classes. In the present paper, we report experimental results that point at the convenience of correctly downsizing the majority class while simultaneously increasing the size of the minority one in order to balance both classes. This is obtained by applying a modification of the previously proposed Decontamination methodology. Combination of this proposal with the employment of a weighted distance function is also explored.


Majority Class Weighted Distance Minority Class Imbalanced Data Supervise Method 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Barandela, R.: The Nearest Neighbor rule: an empirical study of its methodological aspects. PhD thesis, Univ. Berlin (1987)Google Scholar
  2. 2.
    Barandela, R., Gasca, E., Alejo, R.: Correcting the training data. In: Chen, D., Cheng, X. (eds.) Pattern Recognition and String Matching, Kluwer, The Netherlands (2003)Google Scholar
  3. 3.
    Barandela, R., Sánchez, J.S., Valdovinos, R.M.: New applications of ensembles of classifiers. Pattern Analysis and Applications (2003) (to appear)Google Scholar
  4. 4.
    Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognition 36, 849–851 (2003)CrossRefGoogle Scholar
  5. 5.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2000)Google Scholar
  6. 6.
    Merz, C.J., Murphy, P.M.: Uci repository of machine learning databases. Technical report, University of California at Irvine, Department of Information and Computer Science (1998)Google Scholar
  7. 7.
    Eavis, T., Japkowicz, N.: A recognition-based alternative to discriminationbased multi-layer perceptrons. In: Workshop on Learning from Imbalanced Data Sets. TR WS-00-05, AAAI Press, Menlo Park (2000)Google Scholar
  8. 8.
    Ezawa, K.J., Singh, M., Norton, S.W.: Learning goal oriented bayesian networks for telecommunications management. In: Proc. 13th Int. Conf. on Machine Learning, pp. 139–147 (1996)Google Scholar
  9. 9.
    Fawcett, T., Provost, F.: Adaptive fraud detection. Data Mining and Knowledge Discovery 1, 291–316 (1996)CrossRefGoogle Scholar
  10. 10.
    Hand, D.J.: Construction and assessment of classification rules. John Wiley and Sons, Chichester (1997)zbMATHGoogle Scholar
  11. 11.
    Koplowitz, J., Brown, T.A.: On the relation of performance to editing in nearest neighbor rules. In: Proceedings of the 4th International Joint Conference on Pattern Recognition (1978)Google Scholar
  12. 12.
    Kubat, M., Holte, R., Matwin, S.: Detection of oil-spills in radar images of sea surface. Machine Learning 30, 195–215 (1998)CrossRefGoogle Scholar
  13. 13.
    Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One sided selection. In: Proceedings of the 14th International Conference on Machine Learning, pp. 179–186 (1997)Google Scholar
  14. 14.
    Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: Proc. 16th Int. Conf. on Machine Learning, pp. 258–267 (1999)Google Scholar
  15. 15.
    Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C.: Reducing misclassification costs. In: Proc 11th Int. Conf. on Machine Learning, pp. 217–225 (1994)Google Scholar
  16. 16.
    Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data 2(3), 408–421 (1972)Google Scholar
  17. 17.
    Woods, K., Doss, C., Bowyer, K.W., Solka, J., Priebe, C., Kegelmeyer, W.P.: Comparative evaluation of pattern recognition techniques for detection of micro calcifications in mammography. International Journal of Pattern Recognition and Artificial Intelligence 7, 1417–1436 (1993)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • R. Barandela
    • 1
    • 2
  • E. Rangel
    • 1
  • J. S. Sánchez
    • 3
  • F. J. Ferri
    • 4
  1. 1.Lab for Pattern Recognition, Inst. Tecnológico de TolucaMetepecMéxico
  2. 2.Instituto de Geografía TropicalLa HabanaCuba
  3. 3.Dept. Llenguatges i Sistemes InformàticsU. Jaume ICastellóSpain
  4. 4.Dept. InformàticaUniversitat de ValènciaBurjassotSpain

Personalised recommendations