A Novel Synthetic Minority Oversampling Technique for Imbalanced Data Set Learning

Barua, Sukarna; Islam, Md. Monirul; Murase, Kazuyuki

doi:10.1007/978-3-642-24958-7_85

Sukarna Barua¹⁸,
Md. Monirul Islam¹⁸ &
Kazuyuki Murase¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7063))

Included in the following conference series:

International Conference on Neural Information Processing

2845 Accesses
41 Citations

Abstract

Imbalanced data sets contain an unequal distribution of data samples among the classes and pose a challenge to the learning algorithms as it becomes hard to learn the minority class concepts. Synthetic oversampling techniques address this problem by creating synthetic minority samples to balance the data set. However, most of these techniques may create wrong synthetic minority samples which fall inside majority regions. In this respect, this paper presents a novel Cluster Based Synthetic Oversampling (CBSO) algorithm. CBSO adopts its basic idea from existing synthetic oversampling techniques and incorporates unsupervised clustering in its synthetic data generation mechanism. CBSO ensures that synthetic samples created via this method always lie inside minority regions and thus, avoids any wrong synthetic sample creation. Simualtion analyses on some real world datasets show the effectiveness of CBSO showing improvements in various assesment metrics such as overall accuracy, F-measure, and G-mean.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Weiss, G.M.: Mining with Rarity: A Unifying Framework. ACM SIGKDD Explorations Newsletter 6(1), 7–19 (2004)
Article Google Scholar
Holte, R.C., Acker, L., Porter, B.W.: Concept Learning and the Problem of Small Disjuncts. In: Proc. Int’l J. Conf. Artificial Intelligence, pp. 813–818 (1989)
Google Scholar
Lewis, D., Catlett, J.: Heterogenous Uncertainty Sampling for Supervised Learning. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 148–156 (1994)
Google Scholar
Fawcett, T.E., Provost, F.: Adaptive Fraud Detection. Data Mining and Knowledge Discovery 3(1), 291–316 (1997)
Article Google Scholar
Kubat, M., Holte, R.C., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30(2/3), 195–215 (1998)
Article Google Scholar
Ling, C.X., Li, C.: Data Mining for Direct Marketing: Problems and Solutions. In: International Conference on Knowledge Discovery & Data Mining (1998)
Google Scholar
Japkowicz, N., Myers, C., Gluck, M.: A Novelty Detection Approach to Classification. In: Proceedings of the Fourteenth Joint Conference on Artificial Intelligence, pp. 518–523 (1995)
Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(10), 1263–1284 (2009)
Google Scholar
Weiss, G. M., Provost, F.: The Effect of Class Distribution on Classifier Learning: An Empirical Study. Technical Report ML- TR-43, Dept. of Computer Science, Rutgers Univ. (2001)
Google Scholar
Laurikkala, J.: Improving Identification of Difficult Small Classes by Balancing Class Distribution. In: Proc. Conf. AI in Medicine in Europe: Artificial Intelligence Medicine, pp. 63–66 (2001)
Google Scholar
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory Under Sampling for Class Imbalance Learning. In: Proc. Int’l. Conf. Data Mining, pp. 965–969 (2006)
Google Scholar
Zhang, J., Mani, I.: KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. In: Proc. Int’l. Conf. Machine Learning (ICML 2003), Workshop Learning from Imbalanced Data Sets (2003)
Google Scholar
Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In: Proc. Int’l. Conf. Machine Learning, pp. 179–186 (1997)
Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations Newsletter 6(1), 20–29 (2004)
Article Google Scholar
Mease, D., Wyner, A.J., Buja, A.: Boosted Classification Trees and Class Probability/Quantile Estimation. J. Machine Learning Research 8, 409–439 (2007)
MATH Google Scholar
Jo, T., Japkowicz, N.: Class Imbalances versus Small Disjuncts. ACM SIGKDD Explorations Newsletter 6(1), 40–49 (2004)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-Sampling Technique. J. Artificial Intelligence Research 16, 321–357 (2002)
MATH Google Scholar
Han, H., Wang, W.Y., Mao, B.H.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Proc. Int’l. Conf. Intelligent Computing, pp. 878–887 (2005)
Google Scholar
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In: Proc. Int’l. J. Conf. Neural Networks, pp. 1322–1328 (2008)
Google Scholar
Freund, Y., Schapire, R.E.: Experiments with a New Boosting Algorithm. In: Proc. Int’l Conf. Machine Learning, pp. 148–156 (1996)
Google Scholar
Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)
Chapter Google Scholar
Guo, H., Viktor, H.L.: Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost IM Approach. ACM SIGKDD Explorations Newsletter 6(1), 30–39 (2004)
Article Google Scholar
Chen, S., He, H., Garcia, E.A.: RAMOBoost: Ranked Minority Oversampling in Boosting. IEEE Trans. Neural Networks 21(20), 624–1642 (2010)
Google Scholar
Japkowicz, N., Stephen, S.: The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis 6(5), 429–449 (2000)
MATH Google Scholar
Voorhees, E.M.: Implementing Agglomerative Hierarchic Clustering Algorithms for use in Document Retrieval. Information Processing and Management 22(6), 465–476 (1986)
Article Google Scholar
Schutze, H., Silverstein, C.: Projections for Efficient Document Clustering. In: SIGIR 1997: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, PA, USA (1997)
Google Scholar
Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
UCI Machine Learning Repository, http://archive.ics.uci.edu/ml/

Download references

Author information

Authors and Affiliations

Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh
Sukarna Barua & Md. Monirul Islam
University of Fukui, Fukui, Japan
Kazuyuki Murase

Authors

Sukarna Barua
View author publications
You can also search for this author in PubMed Google Scholar
Md. Monirul Islam
View author publications
You can also search for this author in PubMed Google Scholar
Kazuyuki Murase
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800, Dongchuan Road, 200240, Shanghai, China
Bao-Liang Lu & Liqing Zhang &
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
James Kwok

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Barua, S., Islam, M.M., Murase, K. (2011). A Novel Synthetic Minority Oversampling Technique for Imbalanced Data Set Learning. In: Lu, BL., Zhang, L., Kwok, J. (eds) Neural Information Processing. ICONIP 2011. Lecture Notes in Computer Science, vol 7063. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24958-7_85

Download citation

DOI: https://doi.org/10.1007/978-3-642-24958-7_85
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24957-0
Online ISBN: 978-3-642-24958-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics