DBMUTE: density-based majority under-sampling technique

Bunkhumpornpat, Chumphol; Sinapiromsaran, Krung

doi:10.1007/s10115-016-0957-5

DBMUTE: density-based majority under-sampling technique

Regular Paper
Published: 27 May 2016

Volume 50, pages 827–850, (2017)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Chumphol Bunkhumpornpat¹ &
Krung Sinapiromsaran²

971 Accesses
38 Citations
Explore all metrics

Abstract

Class imbalance is a challenging problem that demonstrates the unsatisfactory classification performance of a minority class. A trivial classifier is biased toward minority instances because of their tiny fraction. In this paper, our density function is defined as the distance along the shortest path between each majority instance and a minority-cluster pseudo-centroid in an underlying cluster graph. A short path implies highly overlapping dense minority instances. In contrast, a long path indicates a sparsity of instances. A new under-sampling algorithm is proposed to eliminate majority instances with low distances because these instances are insignificant and obscure the classification boundary in the overlapping region. The results show predictive improvements on a minority class from various classifiers on different UCI datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

A review of unsupervised feature selection methods

Article 29 January 2019

References

Blake CL, Merz CJ (1998) The UC Irvine machine learning repository. http://archive.ics.uci.edu/ml/. University of California, Irvine, CA
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(6):1145–1159
Article Google Scholar
Buckland M, Gey F (1994) The relationship between recall and precision. J Am Soc Inf Sci 45(1):12–19
Article Google Scholar
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) DBSMOTE: density-based synthetic minority over-sampling technique. Appl. Intell. 36(3):664–684
Article Google Scholar
Bunkhumpornpat C, Sinapiromsaran, K, Lursinsap C (2011) MUTE: majority under-sampling technique. In: The 8th international conference on information, communications, and signal processing, Singapore
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho T-B (eds) The 13th Pacific-Asia conference on knowledge discovery and data Mining, Bangkok, Thailand, vol 5476., Lecture notes in artificial intelligence. Springer, Heidelberg, pp 475–482
Chawla NV (2010) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Berlin, pp 875–886
Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:341–378
MATH Google Scholar
Chiang I-J, Shieh M-J, Hsu JY, Wong J-M (2005) Building a medical decision support system for colon polyp screening by using fuzzy classification trees. Special issue: foundations and advances in data mining. Appl Intell 22(1):61–75
Article Google Scholar
Cohen WW (1995) Fast effective rule induction. In: 12th international conference on machine learning. Lake Tahoe, CA, pp 115–123
Cover T, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27
Article MATH Google Scholar
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Drummond C, Holte RC (2003) C4.5, Class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: the ICML 2003 workshop on learning from imbalanced datasets II, Washington, DC, pp 1–8
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: The 2nd international conference on knowledge discovery and data mining. Portland, Oregon, USA, pp 226–231
Frank E, Bouckaert RR (2006) Naive Bayes for text classification with unbalanced classes. In: The 10th European conference on principles and practice of knowledge discovery in databases, Berlin, Germany, pp 503–510
Garcia V, Sánchez JS, Mollineda RA, Alejo R, Sotoca JM (2007) The class imbalance problem in pattern classification and learning. IV Taller Nacional de Minería de Datos y Aprendizaje (TAMIDA 2007). Zaragoza, Spain, pp 283–291
Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) The 2005 international conference on intelligent computing, Hefei, China. Lecture notes in computer science, vol 3644. Springer, Heidelberg, pp 878–887
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann, Burlington
MATH Google Scholar
Hu X (2005) A data mining approach for retailing bank customer attrition analysis. Special issue: foundations and advances in data mining. Appl Intell 22(1):47–60
Article Google Scholar
Japkowicz N (2000) The class imbalance problem: significance and strategies. In: The 2000 international conference on artificial intelligence, Las Vegas, NV, pp 111–117
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
MATH Google Scholar
Khor K-C, Ting C-Y, Phon-Amnuaisuk S (2012) A cascaded classifier approach for improving detection rates on rare attack categories in network intrusion detection. Appl Intell 36(2):320–329
Article Google Scholar
Kubat M, Holte R, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215
Article Google Scholar
Murphey YL, Chen ZH, Feldkamp LA (2008) An incremental neural learning framework and its application to vehicle diagnostics. Appl Intell 28(1):29–49
Article Google Scholar
Murty MN, Devi VS (2012) Pattern recognition: an algorithmic approach. Springer, Berlin
MATH Google Scholar
Nguyen GH, Bouzerdoum A, Phung SL (2009) Learning Pattern Classification Tasks with Imbalanced Data Sets. In: Yin P (ed) Pattern recognition. In-Teh, Vukovar, pp 193–208
Prati RC, Batista GEAPA, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Monroy R, Arroyo G, Sucar LE, Sossa H (eds.) The 3rd Mexican international conference on artificial intelligence, Mexico City, Mexico. Lecture Notes in artificial intelligence, vol 2972, pp 312–321
Quinlan JR (1992) C4.5: programs for machine learning. Morgan Kaufmann, Burlington
Google Scholar
Tetko IV, Livingstone DJ, Luik AI (1995) Neural network studies. 1. Comparison of overfitting and overtraining. J Chem Inf Comput Sci 35(5):826–833
Article Google Scholar
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772
Article MathSciNet MATH Google Scholar
Wang S, Li Z, Chao W-H, Cao Q (2012) Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning. In: IJCNN, pp 1–8
Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, Burlington
Google Scholar

Download references

Acknowledgments

The authors would like to acknowledge that this research is fully funded by a research Grant for new scholars from the Thailand Research Fund (TRG5680082), Year 2013. In addition, we would like to thank the Research Administration Center, Chiang Mai University, for the proofreading service.

Author information

Authors and Affiliations

Theoretical and Empirical Research Group, Department of Computer Science, Faculty of Science, Chiang Mai University, Chiang Mai, 50200, Thailand
Chumphol Bunkhumpornpat
Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Bangkok, 10330, Thailand
Krung Sinapiromsaran

Authors

Chumphol Bunkhumpornpat
View author publications
You can also search for this author in PubMed Google Scholar
Krung Sinapiromsaran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chumphol Bunkhumpornpat.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bunkhumpornpat, C., Sinapiromsaran, K. DBMUTE: density-based majority under-sampling technique. Knowl Inf Syst 50, 827–850 (2017). https://doi.org/10.1007/s10115-016-0957-5

Download citation

Received: 16 January 2015
Revised: 01 February 2016
Accepted: 13 May 2016
Published: 27 May 2016
Issue Date: March 2017
DOI: https://doi.org/10.1007/s10115-016-0957-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DBMUTE: density-based majority under-sampling technique

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Learning from imbalanced data: open challenges and future directions

A review of unsupervised feature selection methods

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DBMUTE: density-based majority under-sampling technique

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Learning from imbalanced data: open challenges and future directions

A review of unsupervised feature selection methods

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation