Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling
In the classification framework there are problems in which the number of examples per class is not equitably distributed, formerly known as imbalanced data sets. This situation is a handicap when trying to identify the minority classes, as the learning algorithms are not usually adapted to such characteristics. An usual approach to deal with the problem of imbalanced data sets is the use of a preprocessing step. In this paper we analyze the usefulness of the data complexity measures in order to evaluate the behavior of undersampling and oversampling methods. Two classical learning methods, C4.5 and PART, are considered over a wide range of imbalanced data sets built from real data. Specifically, oversampling techniques and an evolutionary undersampling one have been selected for the study. We extract behavior patterns from the results in the data complexity space defined by the measures, coding them as intervals. Then, we derive rules from the intervals that describe both good or bad behaviors of C4.5 and PART for the different preprocessing approaches, thus obtaining a complete characterization of the data sets and the differences between the oversampling and undersampling results.
KeywordsClassification Evolutionary algorithms Data complexity Imbalanced data sets Oversampling Undersampling C4.5 PART
This work has been supported by the Spanish Ministry of Education and Science under Project TIN2008-06681-C06-(01 and 02). J. Luengo holds a FPU scholarship from Spanish Ministry of Education.
- Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
- Brazdil P, Giraud-Carrier C, Soares C, Vilalta R (2009) Metalearning: applications to data mining. Cognitive Technologies, Springer.http://10.255.0.115/pub/2009/BGSV09
- Domingos P (1999) Metacost: a general method for making classifiers cost sensitive. In: Advances in neural networks, Int J Pattern Recognit Artif Intell, pp 155–164Google Scholar
- Eshelman LJ (1991) Foundations of genetic algorithms, chap The CHC adaptive search algorithm: how to safe search when engaging in nontraditional genetic recombination, pp 265–283Google Scholar
- Frank E, Witten IH (1998) Generating accurate rule sets without global optimization. In: ICML ’98: Proceedings of the fifteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, pp 144–151Google Scholar
- García V, Mollineda R, Sánchez JS (2008) On the k–NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal Appl 11(3–4):269–280Google Scholar
- Hoekstra A, Duin RP (1996) On the nonlinearity of pattern classifiers. In: ICPR ’96: Proceedings of the international conference on pattern recognition (ICPR ’96) Volume IV-Volume 7472, IEEE Computer Society, Washington, DC, pp 271–275Google Scholar
- Kalousis A (2002) Algorithm selection via meta-learning. PhD thesis, Université de GeneveGoogle Scholar
- Lu WZ, Wang D (2008) Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. Sci Total Environ 395(2–3):109–116Google Scholar
- Mollineda RA, Sánchez JS, Sotoca JM (2005) Data characterization for effective prototype selection. In: First edition of the Iberian conference on pattern recognition and image analysis (IbPRIA 2005), Lecture Notes in Computer Science 3523, pp 27–34Google Scholar
- Pfahringer B, Bensusan H, Giraud-Carrier CG (2000) Meta-learning by landmarking various learning algorithms. In: ICML ’00: Proceedings of the seventeenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, pp 743–750Google Scholar
- Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann Publishers, San Mateo–CaliforniaGoogle Scholar