A Pruning-Based Approach for Searching Precise and Generalized Region for Synthetic Minority Over-Sampling
One solution to deal with class imbalance is to modify its class distribution. Synthetic over-sampling is a well-known method to modify class distribution by generating new synthetic minority data. Synthetic Minority Over-sampling TEchnique (SMOTE) is a state-of-the-art synthetic over-sampling algorithm that generates new synthetic data along the line between the minority data and their selected nearest neighbors. Advantages of SMOTE is to have decision regions larger and less specific to original data. However, its drawback is the over-generalization problem where synthetic data is generated into majority class region. Over-generalization leads to misclassify non-minority class region into minority class. To overcome the over-generalization problem, we propose an algorithm, called TRIM, to search for precise minority region while maintaining its generalization. TRIM iteratively filters out irrelevant majority data from the precise minority region. Output of the algorithm is the multiple set of seed minority data, and each individual set will be used for generating new synthetic data. Compared with state-of-the-art over-sampling algorithms, experimental results show significant performance improvement in terms of F-measure and AUC. This suggests over-generalization has a significant impact on the performance of the synthetic over-sampling method.
KeywordsSynthetic Data Class Distribution Minority Class Splitting Point Imbalanced Data
Unable to display preview. Download preview PDF.
- 2.Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 475–482. Springer, Heidelberg (2009), http://dblp.uni-trier.de/db/conf/pakdd/pakdd2009.html CrossRefGoogle Scholar
- 5.Frank, A., Asuncion, A.: UCI machine learning repository (2010), http://archive.ics.uci.edu/ml
- 8.He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: IJCNN, pp. 1322–1328. IEEE (2008), http://dblp.uni-trier.de/db/conf/ijcnn/ijcnn2008.html
- 10.Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)Google Scholar
- 11.Ramsey, P.H., Hodges, J.L., Shaffer, J.P.: Significance probabilities of the wilcoxon signed-rank test. Journal of Nonparametric Statistics 2(2), 133–153 (1993), http://www.informaworld.com/10.1080/10485259308832548 MathSciNetzbMATHCrossRefGoogle Scholar
- 12.van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)Google Scholar
- 13.Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explorations 6(1), 7–19 (2004), http://dblp.uni-trier.de/db/journals/sigkdd/sigkdd6.html CrossRefGoogle Scholar
- 14.Yang, Q., Wu, X.: 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making 5(4), 597–604 (2006), http://dblp.uni-trier.de/db/journals/ijitdm/ijitdm5.html CrossRefGoogle Scholar