Imbalanced Classification Problems: Systematic Study, Issues and Best Practices

  • Camelia Lemnaru
  • Rodica Potolea
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 102)

Abstract

This paper provides a systematic study of the issues and possible solutions to the class imbalance problem. A set of standard classification algorithms is considered and their performance on benchmark data is analyzed. Our experiments show that, in an imbalanced problem, the imbalance ratio (IR) can be used in conjunction with the instances per attribute ratio (IAR), to evaluate the appropriate classifier that best fits the situation. Also, MLP and C4.5 are less affected by the imbalance, while SVM generally performs poorly in imbalanced problems. The possible solutions for overcoming these classifier issues are also presented. The overall vision is that when dealing with imbalanced problems, one should consider a wider context, taking into account several factors simultaneously: the imbalance, together with other data-related particularities and the classification algorithms with their associated parameters.

Keywords

Class imbalance Metrics Classifiers Comprehensive study 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Barandela, R., Sanchez, J.S., Garcia, V., Rangel, E.: Strategies for Learning in Class Imbalance Problems. Pattern Recognition 36(3), 849–851 (2003)CrossRefGoogle Scholar
  2. 2.
    Japkowicz, N., Stephen, S.: The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis Journal 6, 429–449 (2002)Google Scholar
  3. 3.
    Weiss, G.: Mining with Rarity: A Unifying Framework. SIGKDD Explorations 6(1), 7–19 (2004)CrossRefGoogle Scholar
  4. 4.
    Chawla, N.V.: Data Mining from Imbalanced Data Sets. In: Data Mining and Knowledge Discovery Handbook, ch. 40, pp. 853–867. Springer US (2006)Google Scholar
  5. 5.
    Woods, K., Doss, C., Bowyer, K., Solka, J., Priebe, C., Kegelmeyer, P.: Comparative Evaluation of Pattern Recognition Techniques for Detection of Microcalcifications in Mammography. Int. Journal of Pattern Rec. and AI 7(6), 1417–1436 (1993)CrossRefGoogle Scholar
  6. 6.
    Garcia, S., Herrera, F.: Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy. Evolutionary Computation 17(3), 275–306 (2009)CrossRefGoogle Scholar
  7. 7.
    Batista, G.E.A.P.A., Prati, R.C. Monard, M.C.: A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data, 20–29 (2004)Google Scholar
  8. 8.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)Google Scholar
  9. 9.
    Guo, H., Viktor, H.L.: Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach. Sigkdd Explorations 6, 30–39 (2004)CrossRefGoogle Scholar
  10. 10.
    Brodersen, K.H., Ong, C.S., Stephen, K.E., Buhmann, J.M.: The balanced accuracy and its posterior distribution. In: Proceedings of the 20th Int. Conf. on Pattern Recognition, pp. 3121–3124 (2010)Google Scholar
  11. 11.
    Grzymala-Busse, J.W., Stefanowski, J., Wilk, S.: A Comparison of Two Approaches to Data Mining from Imbalanced Data. Journal of Intelligent Manufacturing 16, 65–573 (2005)CrossRefGoogle Scholar
  12. 12.
    UCI Machine Learning Data Repository, http://archive.ics.uci.edu/ml/
  13. 13.
    Witten, I., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann (2005)Google Scholar
  14. 14.
    Visa, S., Ralescu, A.: Issues in mining imbalanced data sets -a review paper. In: Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, pp. 67–73 (2005)Google Scholar
  15. 15.
    Chawla, N.V., Japkowicz, N., Kolcz, A.: Editorial: special issue on learning from I imbalanced data sets. SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets 6(1), 1–6 (2004)Google Scholar
  16. 16.
    Hall, L.O., Joshi, A.: Building Accurate Classifiers from Imbalanced Data Sets. In: IMACS 2005, Paris (2005)Google Scholar
  17. 17.
    Weiss, G., Provost, F.: Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research 19, 315–354 (2003)Google Scholar
  18. 18.
    Chan, P., Stolfo, S.: Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 164–168. AAAI Press, Menlo Park (1998)Google Scholar
  19. 19.
    Vidrighin Bratu, C., Muresan, T., Potolea, R.: Improving Classification Accuracy through Feature Selection. In: Proceedings of the 4th IEEE International Conference on Intelligent Computer Communication and Processing, ICCP 2008, pp. 25–32 (2008)Google Scholar
  20. 20.
    Provost, F.: Learning with Imbalanced Data Sets. Invited paper for AAAI 2000 Workshop on Imbalanced Data Sets (2000)Google Scholar
  21. 21.
    Joshi, M.V., Agarwal, R.C., Kumar, V.: Predicting rare classes: can boosting make any weak learner strong? In: Proceedings of the Eighth International Conference on Knowledge Discovery and Data Mining, pp. 297–306 (2002)Google Scholar
  22. 22.
    Weiss, G.M., Hirsh, H.: A quantitative study of small disjuncts. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence, pp. 665–670. AAAI Press (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Camelia Lemnaru
    • 1
  • Rodica Potolea
    • 1
  1. 1.Computer Science DepartmentTechnical University of Cluj-NapocaCluj-NapocaRomania

Personalised recommendations