Abstract
In this paper, we introduce KSMOTE, a new classification technique, that combines k-means [7] with SMOTE [4]. KSMOTE improves the performances of multi-class learning from an imbalanced dataset. K-means is used to split the set of instances into two clusters. For each cluster, two types of sampling methods are used: oversampling and undersampling. Then, Random forests learner [3] is applied for class prediction within a cluster. Finally, the prediction is obtained by combining the results from both clusters through a majority vote. For our experiments, we used 4 multi-class datasets from the UCI machine learning repository [2] with varying levels of imbalance data. KSMOTE is compared with SMOTE and two popular multi-class modeling approaches, OAA and OAO. The experimental results show that our approach achieves high performance rates in learning from imbalanced multi-class problems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Anand, R., Mehrotra, K., Mohan, C.K., Ranka, S.: Efficient classification for multiclass problems using modular neural networks. IEEE Transactions on Neural Networks 6(1), 117–124 (1995)
Arthur Asuncion, D.N.: UCI machine learning repository (2007), http://archive.ics.uci.edu/ml/datasets.html
Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (1999)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (2002)
Chen, S., He, H., Garcia, E.A.: RAMOBoost: Ranked minority oversampling in boosting. IEEE Transactions on Neural Networks 21(10), 1624–1642 (2010)
Fernández, A., del Jesus, M.J., Herrera, F.: Multi-class Imbalanced Data-Sets with Linguistic Fuzzy Rule Based Classification Systems Based on Pairwise Learning. In: Hüllermeier, E., Kruse, R., Hoffmann, F. (eds.) IPMU 2010. LNCS, vol. 6178, pp. 89–98. Springer, Heidelberg (2010)
Forgy, E.: Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21, 768–780 (1965)
Ghanem, A.S., Venkatesh, S., West, G.: Multi-class pattern classification in imbalanced data. In: Proceedings of the 2010 20th International Conference on Pattern Recognition (2010)
Hand, D.J., Till, R.J.: A simple generalisation of the Area Under the ROC Curve for multiple class classification problems. Machine Learning 45(2), 171–186 (2001)
Hastie, T., Tibshirani, R.: Classification by Pairwise Coupling 26(2), 451–471 (1998)
Huang, J., Ling, C.X.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. on Knowledge. and Data Eng. 17(3), 299–310 (2005)
Lorena, A., de Carvalho, A., Gama, J.: A review on the combination of binary classifiers in multiclass problems. Artificial Intelligence Review 30(1), 19–37 (2008)
Orriols-Puig, A., Bernadó-Mansilla, E.: Evolutionary rule-based systems for imbalanced data sets. Soft Computing - A Fusion of Foundations, Methodologies and Applications 13(3), 213–225 (2009)
Vural, V., Dy, J.G.: A hierarchical method for multi-class support vector machines. In: Proceedings of the Twenty-First International Conference on Machine Learning (2004)
Wasikowski, M., Chen, X.-W.: Combating the small sample class imbalance problem using feature selection. IEEE Transactions on Knowledge and Data Engineering 22, 1388–1400 (2010)
Witten, I.H., Frank, E., Hall, M.A.: Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, San Francisco (2005)
Yen, S.-J., Lee, Y.-S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Prachuabsupakij, W., Soonthornphisaj, N. (2012). Clustering and Combined Sampling Approaches for Multi-class Imbalanced Data Classification. In: Zeng, D. (eds) Advances in Information Technology and Industry Applications. Lecture Notes in Electrical Engineering, vol 136. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-26001-8_91
Download citation
DOI: https://doi.org/10.1007/978-3-642-26001-8_91
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-26000-1
Online ISBN: 978-3-642-26001-8
eBook Packages: EngineeringEngineering (R0)