SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory
Imbalanced data is a common problem in classification. This phenomenon is growing in importance since it appears in most real domains. It has special relevance to highly imbalanced data-sets (when the ratio between classes is high). Many techniques have been developed to tackle the problem of imbalanced training sets in supervised learning. Such techniques have been divided into two large groups: those at the algorithm level and those at the data level. Data level groups that have been emphasized are those that try to balance the training sets by reducing the larger class through the elimination of samples or increasing the smaller one by constructing new samples, known as undersampling and oversampling, respectively. This paper proposes a new hybrid method for preprocessing imbalanced data-sets through the construction of new samples, using the Synthetic Minority Oversampling Technique together with the application of an editing technique based on the Rough Set Theory and the lower approximation of a subset. The proposed method has been validated by an experimental study showing good results using C4.5 as the learning algorithm.
KeywordsImbalanced data-sets Classification Data preparation Oversampling Undersampling Rough sets theory
Unable to display preview. Download preview PDF.
- 2.Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Log Soft Comput 17(2–3): 255–287Google Scholar
- 3.Asuncion A, Newman D (2007) UCI Machine learning repository. http://mlearn.ics.uci.edu/MLRepository.html
- 5.Bello, R, Falcon, R, Pedrycz, W, Kacprzyk, J (eds) (2008) Granular computing: at the junction of rough sets and fuzzy sets. SpringerGoogle Scholar
- 7.Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) ‘Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem’. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD09). LNCS 3644. Springer, pp 475–482Google Scholar
- 14.Fernández A, del Jesus MJ, Herrera F (2010) Multi-class imbalanced data-sets with linguistic fuzzy rule based classification systems based on pairwise learning. 13th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU2010) LNAI 6178. pp 89–98. 159(18):2378–2398Google Scholar
- 22.Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing (ICIC05) LNCS 3644. Springer, pp 878–887Google Scholar
- 30.Midelfar H, Komorowski J, Nørsett K, Yadetie F, Sandvik A, Lægreid A (2003) Learning rough set classifiers from gene expression and clinical data. Fundam Inf 53: 155–183Google Scholar
- 33.Quinlan J (1993) C4.5 programs for machine learning. Morgan Kaufmann, CAGoogle Scholar
- 34.Sheskin D (2003) Handbook of parametric and nonparametric statistical procedures. chapman & hall, CRC PressGoogle Scholar
- 35.Slowinski R, Vanderpooten D (1997) Similarity relation as a basis for rough approximations. Adv Mach Intell Soft-Comput 4: 17–33Google Scholar
- 42.Wei-hua X, Xiao-yan Z, Jian-min Z, Wen-xiu Z (2008) Attribute reduction in ordered information systems based on evidence theory. Knowl Inf Syst 178(5): 1355–1371Google Scholar
- 43.Weiss GM, Hirsh H (2000) A quantitative study of small disjuncts, In: Proceedings of the 17th national conference on artificial inteligence. pp 665–670Google Scholar