Selective Pre-processing of Imbalanced Data for Improving Classification Performance
In this paper we discuss problems of constructing classifiers from imbalanced data. We describe a new approach to selective pre-processing of imbalanced data which combines local over-sampling of the minority class with filtering difficult examples from the majority classes. In experiments focused on rule-based and tree-based classifiers we compare our approach with two other related pre-processing methods – NCR and SMOTE. The results show that NCR is too strongly biased toward the minority class and leads to deteriorated specificity and overall accuracy, while SMOTE and our approach do not demonstrate such behavior. Analysis of the degree to which the original class distribution has been modified also reveals that our approach does not introduce so extensive changes as SMOTE.
KeywordsMajority Class Class Distribution Minority Class Class Imbalance Imbalanced Data
Unable to display preview. Download preview PDF.
- 3.Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-sampling Technique. J. of Artifical Intelligence Research 16, 341–378 (2002)Google Scholar
- 5.Kubat, M., Matwin, S.: Adressing the curse of imbalanced training sets: one-side selection. In: Proc. of the 14th Int. Conf. on Machine Learning, pp. 179–186 (1997)Google Scholar
- 6.Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. Tech. Report A-2001-2, University of Tampere (2001)Google Scholar
- 7.Stefanowski, J.: The rough set based rule induction technique for classification problems. In: Proc. of the 6th European Conference on Intelligent Techniques and Soft Computing EUFIT 1998, Aaachen, pp. 109–113 (1998)Google Scholar
- 9.Stefanowski, J., Wilk, S.: Improving Rule Based Classifiers Induced by MODLEM by Selective Pre-processing of Imbalanced Data. In: Proc. of the RSKD Workshop at ECML/PKDD, Warsaw, pp. 54–65 (2007)Google Scholar
- 10.Van Hulse, J., Khoshgoftarr, T., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of ICML 2007, pp. 935–942 (2007)Google Scholar