Building Decision Trees for the Multi-class Imbalance Problem

  • T. Ryan Hoens
  • Qi Qian
  • Nitesh V. Chawla
  • Zhi-Hua Zhou
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7301)

Abstract

Learning in imbalanced datasets is a pervasive problem prevalent in a wide variety of real-world applications. In imbalanced datasets, the class of interest is generally a small fraction of the total instances, but misclassification of such instances is often expensive. While there is a significant body of research on the class imbalance problem for binary class datasets, multi-class datasets have received considerably less attention. This is partially due to the fact that the multi-class imbalance problem is often much harder than its related binary class problem, as the relative frequency and cost of each of the classes can vary widely from dataset to dataset. In this paper we study the multi-class imbalance problem as it relates to decision trees (specifically C4.4 and HDDT), and develop a new multi-class splitting criterion. From our experiments we show that multi-class Hellinger distance decision trees, when combined with decomposition techniques, outperform C4.4.

Keywords

Decision Tree Binary Class Minority Class Class Imbalance Imbalance Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)MathSciNetMATHGoogle Scholar
  2. 2.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic minority over-sampling technique. JAIR 16, 321–357 (2002)MATHGoogle Scholar
  3. 3.
    Chawla, N.V.: Data mining for imbalanced datasets: An overview. In: Data Mining and Knowledge Discovery Handbook, pp. 875–886 (2010)Google Scholar
  4. 4.
    Cieslak, D.A., Chawla, N.V.: Learning Decision Trees for Unbalanced Data. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 241–256. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  5. 5.
    Cieslak, D.A., Chawla, N.V.: Start globally, optimize locally, predict globally: Improving performance on imbalanced data. In: ICDM, pp. 143–152 (2008)Google Scholar
  6. 6.
    Demšar, J.: Statistical comparisons of classifiers over multiple data sets. JMLR 7, 30 (2006)Google Scholar
  7. 7.
    Dietterich, T., Bakiri, G.: Error-correcting output codes: A general method for improving multiclass inductive learning programs. In: AAAI, pp. 395–395 (1994)Google Scholar
  8. 8.
    Domingos, P.: Metacost: A general method for making classifiers cost-sensitive. In: KDD, pp. 155–164 (1999)Google Scholar
  9. 9.
    Freund, Y., Schapire, R.: A Desicion-Theoretic Generalization of On-Line Learning and an Application to Boosting. In: Vitányi, P.M.B. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 23–37. Springer, Heidelberg (1995)CrossRefGoogle Scholar
  10. 10.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Exp. News. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  11. 11.
    Krogh, A., Vedelsby, J.: Neural network ensembles, cross validation, and active learning. In: NIPS, pp. 231–238 (1995)Google Scholar
  12. 12.
    Maloof, M.A.: Learning when data sets are imbalanced and when costs are unequal and unknown. In: ICML WLIDS (2003)Google Scholar
  13. 13.
    Provost, F., Domingos, P.: Tree induction for probability-based ranking. Machine Learning 52(3), 199–215 (2003)MATHCrossRefGoogle Scholar
  14. 14.
    Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)Google Scholar
  15. 15.
    Raeder, T., Hoens, T., Chawla, N.: Consequences of Variability in Classifier Performance Estimates. In: ICDM, pp. 421–430 (2010)Google Scholar
  16. 16.
    Rifkin, R., Klautau, A.: In defense of one-vs-all classification. JMLR 5, 101–141 (2004)MathSciNetMATHGoogle Scholar
  17. 17.
    Ting, K.M.: An instance-weighting method to induce cost-sensitive trees. TKDE 14(3), 659–665 (2002)Google Scholar
  18. 18.
    Turney, P.D.: Types of cost in inductive concept learning. In: ICML, pp. 15–21 (2000)Google Scholar
  19. 19.
    Van Calster, B., Van Belle, V., Condous, G., Bourne, T., Timmerman, D., Van Huffel, S.: Multi-class auc metrics and weighted alternatives. In: IJCNN, pp. 1390–1396 (2008)Google Scholar
  20. 20.
    Zhou, Z.-H., Liu, X.-Y.: On multi-class cost-sensitive learning. In: AAAI, pp. 567–572 (2006)Google Scholar
  21. 21.
    Zhou, Z.-H., Liu, X.-Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. TKDE 18(1), 63–77 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • T. Ryan Hoens
    • 1
  • Qi Qian
    • 2
  • Nitesh V. Chawla
    • 1
  • Zhi-Hua Zhou
    • 2
  1. 1.Department of Computer Science and EngineeringUniversity of Notre DameNotre DameUSA
  2. 2.National Key Laboratory for Novel Software TechnologyNanjing UniversityNanjingChina

Personalised recommendations