Benchmarking binary classification models on data sets with different degrees of imbalance

Zhou, Ligang; Lai, Kin Keung

doi:10.1007/s11704-009-0027-1

Benchmarking binary classification models on data sets with different degrees of imbalance

Research Article
Published: 12 May 2009

Volume 3, pages 205–216, (2009)
Cite this article

Frontiers of Computer Science in China Aims and scope Submit manuscript

Ligang Zhou¹ &
Kin Keung Lai¹

246 Accesses
9 Citations
Explore all metrics

Abstract

In practice, there are many binary classification problems, such as credit risk assessment, medical testing for determining if a patient has a certain disease or not, etc. However, different problems have different characteristics that may lead to different difficulties of the problem. One important characteristic is the degree of imbalance of two classes in data sets. For data sets with different degrees of imbalance, are the commonly used binary classification methods still feasible? In this study, various binary classification models, including traditional statistical methods and newly emerged methods from artificial intelligence, such as linear regression, discriminant analysis, decision tree, neural network, support vector machines, etc., are reviewed, and their performance in terms of the measure of classification accuracy and area under Receiver Operating Characteristic (ROC) curve are tested and compared on fourteen data sets with different imbalance degrees. The results help to select the appropriate methods for problems with different degrees of imbalance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Assessing Imbalanced Datasets in Binary Classifiers

Binormal Precision–Recall Curves for Optimal Classification of Imbalanced Data

Article 11 February 2019

A Comparative Analysis on Recent Methods for Addressing Imbalance Classification

Article 21 November 2023

References

Japkowicz N. Learning from imbalanced data sets: a comparison of various strategies. In: Proceedings of the AAAIWorkshop on learning from imbalanced data sets. Tech. rep. WS-00-05, Menlo Park, CA: AAAI Press, 2000
Google Scholar
Nitesh V C, Nathalie J, Aleksander K. Editorial: special issue on learning from imbalanced data sets. In: Proceedings of the ACM Special Interest Group on Knowledge Discovery and Data Mining Explorations, Newsletter, 2004, 6(1): 1–63
Google Scholar
Baesens B, Gestel T V, Viaene S, et al. Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the Operational Research Society, 2003, 54(6): 627–635
Article MATH Google Scholar
Bishop C M. Pattern Recognition and Machine Learning, New York: Springer, 2006
MATH Google Scholar
Rosenberg E, Gleit A. Quantitative methods in credit management: a survey. Operations research, 42(4): 589–613
Zhou L G. A study on credit scoring with support vector machines, PhD thesis, City University of Hong Kong, 2008
Quinlan J R. Induction of Decision Trees. Machine Learning, 1986, 1(1): 81–106
Google Scholar
Quinlan J R. C4.5: Programs for Machine Learning, Los Altos: Morgan Kaufmann, 1993
Google Scholar
Breiman L, Friedman J, Olshen R, et al. Classification and Regression Trees, Wadsworth Statistics/Probability Series Belmont, CA: Wadsworth, 1980
Google Scholar
Maimon O, Rokach L. Data Mining and Knowledge Discovery Handbook, New York: Springer, 2005
Book MATH Google Scholar
Gurney K. An Introduction to Neural Networks, London: UCL Press, 1997
Google Scholar
Donald F S. Probabilistic neural networks, Neural Networks, 1990, 3(1): 109–118
Article MathSciNet Google Scholar
Demuth H, Beale M, Hagan M. Neural Network Toolbox 6 User’s Guide, Mathworks, 2008
Vapnik, V N. Statistical Learning Theory. New York: Springer-Verlag, 1998
MATH Google Scholar
Suykens J A K, Gestel T V, Brabanter J D, et al. Least Squares Support Vector Machines, Singapore: Singapore World Scientific Pub. Co., 2002
MATH Google Scholar
Lai K K, Yu L, Zhou L, et al. Credit risk evaluation with least square support vector machine, Lecture Notes in Computer Sciences, 4062: 490–495, 2006
Article Google Scholar
Zhou L, Lai K K, Yu L. Credit scoring using support vector machines with direct search for parameters selection. Soft Computing - A Fusion of Foundations, Methodologies and Applications, 13(2): 149–155, 2009
MATH Google Scholar
Freund Y, Robert E S. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 1999, 14(5): 771–780
Google Scholar
Rakotomamonjy A. Optimizing area under ROC curve with SVMs. In: Proceedings of the Workshop on ROC Analysis in Artificial Intelligence, 2004
Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters, 2006, 27(8): 861–874
Article MathSciNet Google Scholar
Vezhnevets A, Vezhnevets V. Modest AdaBoost - Teaching AdaBoost to Generalize Better. In: Proceedings of the Graphicon-2005. Novosibirsk Akademgorodok, Russia, 2005
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Management Sciences, City University of Hong Kong, Hong Kong, China
Ligang Zhou & Kin Keung Lai

Authors

Ligang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Kin Keung Lai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ligang Zhou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, L., Lai, K.K. Benchmarking binary classification models on data sets with different degrees of imbalance. Front. Comput. Sci. China 3, 205–216 (2009). https://doi.org/10.1007/s11704-009-0027-1

Download citation

Received: 15 October 2008
Accepted: 08 February 2009
Published: 12 May 2009
Issue Date: June 2009
DOI: https://doi.org/10.1007/s11704-009-0027-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Benchmarking binary classification models on data sets with different degrees of imbalance

Abstract

Access this article

Similar content being viewed by others

Assessing Imbalanced Datasets in Binary Classifiers

Binormal Precision–Recall Curves for Optimal Classification of Imbalanced Data

A Comparative Analysis on Recent Methods for Addressing Imbalance Classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Benchmarking binary classification models on data sets with different degrees of imbalance

Abstract

Access this article

Similar content being viewed by others

Assessing Imbalanced Datasets in Binary Classifiers

Binormal Precision–Recall Curves for Optimal Classification of Imbalanced Data

A Comparative Analysis on Recent Methods for Addressing Imbalance Classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation