Abstract
Synthetic minority over-sampling technique (SMOTE) is an effective over-sampling technique and specifically designed for learning from imbalanced data sets. However, in the process of synthetic sample generation, SMOTE is of some blindness. This paper proposes a novel approach for imbalanced problem, based on a combination of the Threshold SMOTE (TSMOTE) and the Attribute Bagging (AB) algorithms. TSMOTE takes full advantage of majority samples to adjust the neighbor selective strategy of SMOTE in order to control the quality of the new sample. Attribute Bagging, a famous ensemble learning algorithm, is also used to improve the predictive power of the classifier. A comprehensive suite of experiments tested on 7 imbalanced data sets collected from UCI machine learning repository is conducted. Experimental results show that TSMOTE-AB outperforms the SMOTE and other previously known algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chan, P., Stolfo, S.J.: Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection. In: 4th International Conference on Knowledge Discovery and Data Mining, pp. 164–168. AAAI Press (1998)
Kubat, M., Holte, R.C., Matwin, S., Kohavi, R., Provost, F.: Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 195–215 (1998)
Mazurowski, M.A., Habas, P.A., Zurada, J.M., Lo, J.Y., Baker, J.A., Tourassi, G.D.: Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw. 21(2-3), 427–436 (2008)
Liu, Y.H., Chen, Y.T.: Total margin-based adaptive fuzzy support vector machines for multiview face recognition. In: Proc. IEEE Int. Conf. Syst., Man Cybern., vol. 2, pp. 1704–1711 (2005)
Zadrozny, B., Elkan, C.: Learning and making decisions when costs and probabilities are both unknown. In: 7th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, New York, pp. 204–213 (2001)
Wu, G., Chang, E.: KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans. Knowl. Data Eng. 17(6), 786–795 (2005)
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Expl. Newslett. 6, 20–29 (2004)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Chawla, N.V., Japkowicz, N., Kolcz, A.(eds.): Special Issue Learning Imbalanced Datasets. SIGKDD Explor. Newsl. 6(1) (2004)
Polikar, R.: Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 6(3), 21–45 (2006)
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33, 1–39 (2010)
Freund, Y., Schapire, R.: Experiments with a New Boosting Algorithm. In: 13th International Conference on Machine Learning, pp. 325–332 (1996)
Breiman, L.: Bagging predictors. Mach. Learning 24, 123–140 (1996)
Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, New York (2004)
Bryll, R., Gutierrez-Osuna, R., Quek, F.: Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets. Pattern Recognition 36(6), 1291–1302 (2003)
Wang, B.X., Japkowicz, N.: Imbalanced Data Set Learning with Synthetic Samples. In: Proc. IRIS Machine Learning Workshop (2004)
Langley, P., Iba, W.: Average-case analysis of nearest neighbor algorithm. In: 13th International Joint Conference on Artificial Intelligence, pp. 889–894. Morgan Kaufmann Publishers, San Francisco (1993)
Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to imbalanced datasets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 39–50. Springer, Heidelberg (2004)
Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining Knowl. Discovery 2(2), 121–167 (1998)
Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998)
Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, J., Yun, B., Huang, P., Liu, YA. (2013). Applying Threshold SMOTE Algoritwith Attribute Bagging to Imbalanced Datasets. In: Lingras, P., Wolski, M., Cornelis, C., Mitra, S., Wasilewski, P. (eds) Rough Sets and Knowledge Technology. RSKT 2013. Lecture Notes in Computer Science(), vol 8171. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41299-8_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-41299-8_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41298-1
Online ISBN: 978-3-642-41299-8
eBook Packages: Computer ScienceComputer Science (R0)