Skip to main content
Log in

Benchmarking binary classification models on data sets with different degrees of imbalance

  • Research Article
  • Published:
Frontiers of Computer Science in China Aims and scope Submit manuscript

Abstract

In practice, there are many binary classification problems, such as credit risk assessment, medical testing for determining if a patient has a certain disease or not, etc. However, different problems have different characteristics that may lead to different difficulties of the problem. One important characteristic is the degree of imbalance of two classes in data sets. For data sets with different degrees of imbalance, are the commonly used binary classification methods still feasible? In this study, various binary classification models, including traditional statistical methods and newly emerged methods from artificial intelligence, such as linear regression, discriminant analysis, decision tree, neural network, support vector machines, etc., are reviewed, and their performance in terms of the measure of classification accuracy and area under Receiver Operating Characteristic (ROC) curve are tested and compared on fourteen data sets with different imbalance degrees. The results help to select the appropriate methods for problems with different degrees of imbalance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Japkowicz N. Learning from imbalanced data sets: a comparison of various strategies. In: Proceedings of the AAAIWorkshop on learning from imbalanced data sets. Tech. rep. WS-00-05, Menlo Park, CA: AAAI Press, 2000

    Google Scholar 

  2. Nitesh V C, Nathalie J, Aleksander K. Editorial: special issue on learning from imbalanced data sets. In: Proceedings of the ACM Special Interest Group on Knowledge Discovery and Data Mining Explorations, Newsletter, 2004, 6(1): 1–63

    Google Scholar 

  3. Baesens B, Gestel T V, Viaene S, et al. Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the Operational Research Society, 2003, 54(6): 627–635

    Article  MATH  Google Scholar 

  4. Bishop C M. Pattern Recognition and Machine Learning, New York: Springer, 2006

    MATH  Google Scholar 

  5. Rosenberg E, Gleit A. Quantitative methods in credit management: a survey. Operations research, 42(4): 589–613

  6. Zhou L G. A study on credit scoring with support vector machines, PhD thesis, City University of Hong Kong, 2008

  7. Quinlan J R. Induction of Decision Trees. Machine Learning, 1986, 1(1): 81–106

    Google Scholar 

  8. Quinlan J R. C4.5: Programs for Machine Learning, Los Altos: Morgan Kaufmann, 1993

    Google Scholar 

  9. Breiman L, Friedman J, Olshen R, et al. Classification and Regression Trees, Wadsworth Statistics/Probability Series Belmont, CA: Wadsworth, 1980

    Google Scholar 

  10. Maimon O, Rokach L. Data Mining and Knowledge Discovery Handbook, New York: Springer, 2005

    Book  MATH  Google Scholar 

  11. Gurney K. An Introduction to Neural Networks, London: UCL Press, 1997

    Google Scholar 

  12. Donald F S. Probabilistic neural networks, Neural Networks, 1990, 3(1): 109–118

    Article  MathSciNet  Google Scholar 

  13. Demuth H, Beale M, Hagan M. Neural Network Toolbox 6 User’s Guide, Mathworks, 2008

  14. Vapnik, V N. Statistical Learning Theory. New York: Springer-Verlag, 1998

    MATH  Google Scholar 

  15. Suykens J A K, Gestel T V, Brabanter J D, et al. Least Squares Support Vector Machines, Singapore: Singapore World Scientific Pub. Co., 2002

    MATH  Google Scholar 

  16. Lai K K, Yu L, Zhou L, et al. Credit risk evaluation with least square support vector machine, Lecture Notes in Computer Sciences, 4062: 490–495, 2006

    Article  Google Scholar 

  17. Zhou L, Lai K K, Yu L. Credit scoring using support vector machines with direct search for parameters selection. Soft Computing - A Fusion of Foundations, Methodologies and Applications, 13(2): 149–155, 2009

    MATH  Google Scholar 

  18. Freund Y, Robert E S. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 1999, 14(5): 771–780

    Google Scholar 

  19. Rakotomamonjy A. Optimizing area under ROC curve with SVMs. In: Proceedings of the Workshop on ROC Analysis in Artificial Intelligence, 2004

  20. Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters, 2006, 27(8): 861–874

    Article  MathSciNet  Google Scholar 

  21. Vezhnevets A, Vezhnevets V. Modest AdaBoost - Teaching AdaBoost to Generalize Better. In: Proceedings of the Graphicon-2005. Novosibirsk Akademgorodok, Russia, 2005

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ligang Zhou.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, L., Lai, K.K. Benchmarking binary classification models on data sets with different degrees of imbalance. Front. Comput. Sci. China 3, 205–216 (2009). https://doi.org/10.1007/s11704-009-0027-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-009-0027-1

Keywords

Navigation