Abstract
The problem of imbalanced training data in supervised methods is currently receiving growing attention. Imbalanced data means that one class is much more represented than the others in the training sample. It has been observed that this situation, which arises in several practical domains, may produce an important deterioration of the classification accuracy, in particular with patterns belonging to the less represented classes. In the present paper, we report experimental results that point at the convenience of correctly downsizing the majority class while simultaneously increasing the size of the minority one in order to balance both classes. This is obtained by applying a modification of the previously proposed Decontamination methodology. Combination of this proposal with the employment of a weighted distance function is also explored.
Work partially supported by grants 32016-A (Mexican CONACyT), 744.99-P (Mexican Cosnet), TIC2000-1703-C03-03 (Spanish CICYT) and P1-1B2002-07 (Fundació Caixa Castelló-Bancaixa).
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Barandela, R.: The Nearest Neighbor rule: an empirical study of its methodological aspects. PhD thesis, Univ. Berlin (1987)
Barandela, R., Gasca, E., Alejo, R.: Correcting the training data. In: Chen, D., Cheng, X. (eds.) Pattern Recognition and String Matching, Kluwer, The Netherlands (2003)
Barandela, R., Sánchez, J.S., Valdovinos, R.M.: New applications of ensembles of classifiers. Pattern Analysis and Applications (2003) (to appear)
Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognition 36, 849–851 (2003)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2000)
Merz, C.J., Murphy, P.M.: Uci repository of machine learning databases. Technical report, University of California at Irvine, Department of Information and Computer Science (1998)
Eavis, T., Japkowicz, N.: A recognition-based alternative to discriminationbased multi-layer perceptrons. In: Workshop on Learning from Imbalanced Data Sets. TR WS-00-05, AAAI Press, Menlo Park (2000)
Ezawa, K.J., Singh, M., Norton, S.W.: Learning goal oriented bayesian networks for telecommunications management. In: Proc. 13th Int. Conf. on Machine Learning, pp. 139–147 (1996)
Fawcett, T., Provost, F.: Adaptive fraud detection. Data Mining and Knowledge Discovery 1, 291–316 (1996)
Hand, D.J.: Construction and assessment of classification rules. John Wiley and Sons, Chichester (1997)
Koplowitz, J., Brown, T.A.: On the relation of performance to editing in nearest neighbor rules. In: Proceedings of the 4th International Joint Conference on Pattern Recognition (1978)
Kubat, M., Holte, R., Matwin, S.: Detection of oil-spills in radar images of sea surface. Machine Learning 30, 195–215 (1998)
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One sided selection. In: Proceedings of the 14th International Conference on Machine Learning, pp. 179–186 (1997)
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive bayes. In: Proc. 16th Int. Conf. on Machine Learning, pp. 258–267 (1999)
Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C.: Reducing misclassification costs. In: Proc 11th Int. Conf. on Machine Learning, pp. 217–225 (1994)
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data 2(3), 408–421 (1972)
Woods, K., Doss, C., Bowyer, K.W., Solka, J., Priebe, C., Kegelmeyer, W.P.: Comparative evaluation of pattern recognition techniques for detection of micro calcifications in mammography. International Journal of Pattern Recognition and Artificial Intelligence 7, 1417–1436 (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Barandela, R., Rangel, E., Sánchez, J.S., Ferri, F.J. (2003). Restricted Decontamination for the Imbalanced Training Sample Problem. In: Sanfeliu, A., Ruiz-Shulcloper, J. (eds) Progress in Pattern Recognition, Speech and Image Analysis. CIARP 2003. Lecture Notes in Computer Science, vol 2905. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24586-5_52
Download citation
DOI: https://doi.org/10.1007/978-3-540-24586-5_52
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20590-6
Online ISBN: 978-3-540-24586-5
eBook Packages: Springer Book Archive