Skip to main content
Log in

DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

A dataset exhibits the class imbalance problem when a target class has a very small number of instances relative to other classes. A trivial classifier typically fails to detect a minority class due to its extremely low incidence rate. In this paper, a new over-sampling technique called DBSMOTE is proposed. Our technique relies on a density-based notion of clusters and is designed to over-sample an arbitrarily shaped cluster discovered by DBSCAN. DBSMOTE generates synthetic instances along a shortest path from each positive instance to a pseudo-centroid of a minority-class cluster. Consequently, these synthetic instances are dense near this centroid and are sparse far from this centroid. Our experimental results show that DBSMOTE improves precision, F-value, and AUC more effectively than SMOTE, Borderline-SMOTE, and Safe-Level-SMOTE for imbalanced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bai X, Yang X, Yu D, Latecki LJ (2008) Skeleton-based shape classification using path similarity. Int J Pattern Recognit Artif Intell 22(4):733–746

    Article  Google Scholar 

  2. Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1):20–29

    Article  Google Scholar 

  3. Blake CL, Merz CJ (2009) UCI Repository of machine learning databases. http://archive.ics.uci.edu/ml/. Department of Information and Computer Sciences, University of California, Irvine, California, USA

  4. Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(6):1145–1159

    Article  Google Scholar 

  5. Buckland M, Gey F (1994) The relationship between recall and precision. J Am Soc Inf Sci 45(1):12–19

    Article  Google Scholar 

  6. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho T-B (eds) 13th Pacific-Asia conference on knowledge discovery and data mining, Bangkok, Thailand. Lecture notes in artificial intelligence, vol 5476. Springer, Heidelberg, pp 475–482

    Chapter  Google Scholar 

  7. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:341–378

    Google Scholar 

  8. Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: The 7th European conference on principles and practice of knowledge discovery in databases, Cavtat-Dubrovnik, Croatia, pp 107–119

    Google Scholar 

  9. Chawla NV, Japkowicz N, Kolcz A (2004) SIGKDD Explor 6(1):1–6. Editorial: Special Issue on Learning from imbalanced data sets

    Article  Google Scholar 

  10. Chiang I-J, Shieh M-J, Hsu JY, Wong J-M (2005) Building a medical decision support system for colon polyp screening by using fuzzy classification trees. Appl Intell 22(1):61–75. Special Issue: Foundations and Advances in Data Mining

    Article  Google Scholar 

  11. Cohen WW (1995) Fast effective rule induction. In: 12th international conference on machine learning, Lake Tahoe, California, USA, pp 115–123

    Google Scholar 

  12. Corman TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms, 2nd edn. MIT Press, Cambridge

    Google Scholar 

  13. Cover T, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27

    Article  MATH  Google Scholar 

  14. Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: The 5th ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, California, USA, pp 155–164

    Chapter  Google Scholar 

  15. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: The 2nd international conference on knowledge discovery and data mining, Portland, Oregon, USA, pp 226–231

    Google Scholar 

  16. Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) The 2005 international conference on intelligent computing, Hefei, China. Lecture notes in computer science, vol 3644. Springer, Heidelberg, pp 878–887

    Google Scholar 

  17. Hu X (2005) A data mining approach for retailing bank customer attrition analysis. Appl Intell 22(1):47–60. Special Issue: Foundations and Advances in Data Mining

    Article  Google Scholar 

  18. Japkowicz N (2000) The class imbalance problem: significance and strategies. In: 2000 international conference on artificial intelligence, Las Vegas, Nevada, USA, pp 111–117

    Google Scholar 

  19. Japkowicz N (2003) Class imbalance: are we focusing on the right issue? In: 20th international conference on machine learning, Washington, District of Columbia, USA, pp 17–23

    Google Scholar 

  20. Jungnickel D (2003) Graphs, networks and algorithms. Springer, Heidelberg

    Google Scholar 

  21. Kamber M, Han J (2000) Data mining: concepts and techniques, 2nd edn. Morgan Kaufman, San Mateo

    Google Scholar 

  22. Khor K-C, Ting C-Y, Phon-Amnuaisuk S (2010) A cascaded classifier approach for improving detection rates on rare attack categories in network intrusion detection. Appl Intell. doi:10.1007/s10489-010-0263-y

    Google Scholar 

  23. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: 14th international conference on machine learning, Nashville, Tennessee, USA, pp 179–186

    Google Scholar 

  24. Kubat M, Holte R, Matwin S (1997) Learning when negative examples abound. In: 9th European conference on machine learning, Prague, Czech Republic, pp 146–153

    Google Scholar 

  25. Lewis DD, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: 11th international conference on machine learning, New Brunswick, New Jersey, USA, pp 148–156

    Google Scholar 

  26. Lu Y, Chen TQ, Hamilton B (1998) A fuzzy diagnostic model and its application in automotive engineering diagnosis. Appl Intell 9(3):231–243

    Article  Google Scholar 

  27. Murphey YL, Chen ZH, Feldkamp LA (2008) An incremental neural learning framework and its application to vehicle diagnostics. Appl Intell 28(1):29–49

    Article  Google Scholar 

  28. Prati RC, Batista GEAPA, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Monroy R, Arroyo G, Sucar LE, Sossa H (eds) 3rd Mexican international conference on artificial intelligence, Mexico City, Mexico. Lecture notes in artificial intelligence, vol 2972, pp 312–321

    Google Scholar 

  29. Quinlan JR (1992) C4.5: programs for machine learning. Morgan Kaufmann, San Mateo

    Google Scholar 

  30. Tetko IV, Livingstone DJ, Luik AI (1995) Neural network studies. 1. Comparison of overfitting and overtraining. J Chem Inf Comput Sci 35(5):826–833

    Article  Google Scholar 

  31. Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chumphol Bunkhumpornpat.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bunkhumpornpat, C., Sinapiromsaran, K. & Lursinsap, C. DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique. Appl Intell 36, 664–684 (2012). https://doi.org/10.1007/s10489-011-0287-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-011-0287-y

Keywords

Navigation