The Imbalanced Training Sample Problem: Under or over Sampling?
The problem of imbalanced training sets in supervised pattern recognition methods is receiving growing attention. Imbalanced training sample means that one class is represented by a large number of examples while the other is represented by only a few. It has been observed that this situation, which arises in several practical domains, may produce an important deterioration of the classification accuracy, in particular with patterns belonging to the less represented classes. In this paper we present a study concerning the relative merits of several re-sizing techniques for handling the imbalance issue. We assess also the convenience of combining some of these techniques.
KeywordsMajority Class Weighted Distance Under Sampling Minority Class Neighbor Rule
- 1.Aha, D., Kibler, D.: Learning Representative Exemplars of Concepts: An Initial Case Study. In: Proceedings of the Fourth International Conference on Machine Learning, pp. 24–30 (1987)Google Scholar
- 2.Barandela, R., Cortés, N., Palacios, A.: The Nearest Neighbor rule and the reduction of the training sample size. In: Proc. 9th Spanish Symposium on Pattern Recognition and Image Analysis 1, pp. 103–108 (2001)Google Scholar
- 6.Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2000)Google Scholar
- 8.Eavis, T., Japkowicz, N.: A Recognition-based Alternative to Discrimination-based Multi- Layer Perceptrons, Workshop on Learning from Imbalanced Data Sets. Technical Report WS-00-05, AAAI Press (2000)Google Scholar
- 9.Ezawa, K.J., Singh, M., Norton, S.W.: Learning goal oriented Bayesian networks for telecommunications management. In: Proc. 13th Int. Conf. on Machine Learning, pp. 139–147 (1996)Google Scholar
- 13.Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In: Proceedings of the 14th International Conference on Machine Learning, pp. 179–186 (1997)Google Scholar
- 15.Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naïve Bayes. In: Proc. 16th Int. Conf. on Machine Learning, pp. 258–267 (1999)Google Scholar
- 16.Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C.: Reducing misclassification costs. In: Proc 11th Int. Conf. on Machine Learning, pp. 217–225 (1994)Google Scholar
- 20.Woods, K., Doss, C., Bowyer, K.W., Solka, J., Priebe, C., Kegelmeyer, W.P.: Comparative evaluation of pattern recognition techniques for detection of micro-calcifications in mammography. International Journal of Pattern Recognition and Artificial Intelligence 7, 1417–1436 (1993)CrossRefGoogle Scholar