DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique

Bunkhumpornpat, Chumphol; Sinapiromsaran, Krung; Lursinsap, Chidchanok

doi:10.1007/s10489-011-0287-y

DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique

Published: 14 April 2011

Volume 36, pages 664–684, (2012)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Chumphol Bunkhumpornpat¹,
Krung Sinapiromsaran¹ &
Chidchanok Lursinsap¹

1990 Accesses
188 Citations
Explore all metrics

Abstract

A dataset exhibits the class imbalance problem when a target class has a very small number of instances relative to other classes. A trivial classifier typically fails to detect a minority class due to its extremely low incidence rate. In this paper, a new over-sampling technique called DBSMOTE is proposed. Our technique relies on a density-based notion of clusters and is designed to over-sample an arbitrarily shaped cluster discovered by DBSCAN. DBSMOTE generates synthetic instances along a shortest path from each positive instance to a pseudo-centroid of a minority-class cluster. Consequently, these synthetic instances are dense near this centroid and are sparse far from this centroid. Our experimental results show that DBSMOTE improves precision, F-value, and AUC more effectively than SMOTE, Borderline-SMOTE, and Safe-Level-SMOTE for imbalanced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Bai X, Yang X, Yu D, Latecki LJ (2008) Skeleton-based shape classification using path similarity. Int J Pattern Recognit Artif Intell 22(4):733–746
Article Google Scholar
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1):20–29
Article Google Scholar
Blake CL, Merz CJ (2009) UCI Repository of machine learning databases. http://archive.ics.uci.edu/ml/. Department of Information and Computer Sciences, University of California, Irvine, California, USA
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(6):1145–1159
Article Google Scholar
Buckland M, Gey F (1994) The relationship between recall and precision. J Am Soc Inf Sci 45(1):12–19
Article Google Scholar
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho T-B (eds) 13th Pacific-Asia conference on knowledge discovery and data mining, Bangkok, Thailand. Lecture notes in artificial intelligence, vol 5476. Springer, Heidelberg, pp 475–482
Chapter Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:341–378
Google Scholar
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: The 7th European conference on principles and practice of knowledge discovery in databases, Cavtat-Dubrovnik, Croatia, pp 107–119
Google Scholar
Chawla NV, Japkowicz N, Kolcz A (2004) SIGKDD Explor 6(1):1–6. Editorial: Special Issue on Learning from imbalanced data sets
Article Google Scholar
Chiang I-J, Shieh M-J, Hsu JY, Wong J-M (2005) Building a medical decision support system for colon polyp screening by using fuzzy classification trees. Appl Intell 22(1):61–75. Special Issue: Foundations and Advances in Data Mining
Article Google Scholar
Cohen WW (1995) Fast effective rule induction. In: 12th international conference on machine learning, Lake Tahoe, California, USA, pp 115–123
Google Scholar
Corman TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms, 2nd edn. MIT Press, Cambridge
Google Scholar
Cover T, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27
Article MATH Google Scholar
Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: The 5th ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, California, USA, pp 155–164
Chapter Google Scholar
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: The 2nd international conference on knowledge discovery and data mining, Portland, Oregon, USA, pp 226–231
Google Scholar
Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) The 2005 international conference on intelligent computing, Hefei, China. Lecture notes in computer science, vol 3644. Springer, Heidelberg, pp 878–887
Google Scholar
Hu X (2005) A data mining approach for retailing bank customer attrition analysis. Appl Intell 22(1):47–60. Special Issue: Foundations and Advances in Data Mining
Article Google Scholar
Japkowicz N (2000) The class imbalance problem: significance and strategies. In: 2000 international conference on artificial intelligence, Las Vegas, Nevada, USA, pp 111–117
Google Scholar
Japkowicz N (2003) Class imbalance: are we focusing on the right issue? In: 20th international conference on machine learning, Washington, District of Columbia, USA, pp 17–23
Google Scholar
Jungnickel D (2003) Graphs, networks and algorithms. Springer, Heidelberg
Google Scholar
Kamber M, Han J (2000) Data mining: concepts and techniques, 2nd edn. Morgan Kaufman, San Mateo
Google Scholar
Khor K-C, Ting C-Y, Phon-Amnuaisuk S (2010) A cascaded classifier approach for improving detection rates on rare attack categories in network intrusion detection. Appl Intell. doi:10.1007/s10489-010-0263-y
Google Scholar
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: 14th international conference on machine learning, Nashville, Tennessee, USA, pp 179–186
Google Scholar
Kubat M, Holte R, Matwin S (1997) Learning when negative examples abound. In: 9th European conference on machine learning, Prague, Czech Republic, pp 146–153
Google Scholar
Lewis DD, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: 11th international conference on machine learning, New Brunswick, New Jersey, USA, pp 148–156
Google Scholar
Lu Y, Chen TQ, Hamilton B (1998) A fuzzy diagnostic model and its application in automotive engineering diagnosis. Appl Intell 9(3):231–243
Article Google Scholar
Murphey YL, Chen ZH, Feldkamp LA (2008) An incremental neural learning framework and its application to vehicle diagnostics. Appl Intell 28(1):29–49
Article Google Scholar
Prati RC, Batista GEAPA, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Monroy R, Arroyo G, Sucar LE, Sossa H (eds) 3rd Mexican international conference on artificial intelligence, Mexico City, Mexico. Lecture notes in artificial intelligence, vol 2972, pp 312–321
Google Scholar
Quinlan JR (1992) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo
Google Scholar
Tetko IV, Livingstone DJ, Luik AI (1995) Neural network studies. 1. Comparison of overfitting and overtraining. J Chem Inf Comput Sci 35(5):826–833
Article Google Scholar
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, Faculty of Science, Chulalongkorn University, Bangkok, 10330, Thailand
Chumphol Bunkhumpornpat, Krung Sinapiromsaran & Chidchanok Lursinsap

Authors

Chumphol Bunkhumpornpat
View author publications
You can also search for this author in PubMed Google Scholar
Krung Sinapiromsaran
View author publications
You can also search for this author in PubMed Google Scholar
Chidchanok Lursinsap
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chumphol Bunkhumpornpat.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bunkhumpornpat, C., Sinapiromsaran, K. & Lursinsap, C. DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique. Appl Intell 36, 664–684 (2012). https://doi.org/10.1007/s10489-011-0287-y

Download citation

Published: 14 April 2011
Issue Date: April 2012
DOI: https://doi.org/10.1007/s10489-011-0287-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A survey on semi-supervised learning

Learning from imbalanced data: open challenges and future directions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A survey on semi-supervised learning

Learning from imbalanced data: open challenges and future directions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation