Abstract
Learning in imbalanced datasets is a pervasive problem prevalent in a wide variety of real-world applications. In imbalanced datasets, the class of interest is generally a small fraction of the total instances, but misclassification of such instances is often expensive. While there is a significant body of research on the class imbalance problem for binary class datasets, multi-class datasets have received considerably less attention. This is partially due to the fact that the multi-class imbalance problem is often much harder than its related binary class problem, as the relative frequency and cost of each of the classes can vary widely from dataset to dataset. In this paper we study the multi-class imbalance problem as it relates to decision trees (specifically C4.4 and HDDT), and develop a new multi-class splitting criterion. From our experiments we show that multi-class Hellinger distance decision trees, when combined with decomposition techniques, outperform C4.4.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic minority over-sampling technique. JAIR 16, 321–357 (2002)
Chawla, N.V.: Data mining for imbalanced datasets: An overview. In: Data Mining and Knowledge Discovery Handbook, pp. 875–886 (2010)
Cieslak, D.A., Chawla, N.V.: Learning Decision Trees for Unbalanced Data. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 241–256. Springer, Heidelberg (2008)
Cieslak, D.A., Chawla, N.V.: Start globally, optimize locally, predict globally: Improving performance on imbalanced data. In: ICDM, pp. 143–152 (2008)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. JMLR 7, 30 (2006)
Dietterich, T., Bakiri, G.: Error-correcting output codes: A general method for improving multiclass inductive learning programs. In: AAAI, pp. 395–395 (1994)
Domingos, P.: Metacost: A general method for making classifiers cost-sensitive. In: KDD, pp. 155–164 (1999)
Freund, Y., Schapire, R.: A Desicion-Theoretic Generalization of On-Line Learning and an Application to Boosting. In: Vitányi, P.M.B. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 23–37. Springer, Heidelberg (1995)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Exp. News. 11(1), 10–18 (2009)
Krogh, A., Vedelsby, J.: Neural network ensembles, cross validation, and active learning. In: NIPS, pp. 231–238 (1995)
Maloof, M.A.: Learning when data sets are imbalanced and when costs are unequal and unknown. In: ICML WLIDS (2003)
Provost, F., Domingos, P.: Tree induction for probability-based ranking. Machine Learning 52(3), 199–215 (2003)
Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)
Raeder, T., Hoens, T., Chawla, N.: Consequences of Variability in Classifier Performance Estimates. In: ICDM, pp. 421–430 (2010)
Rifkin, R., Klautau, A.: In defense of one-vs-all classification. JMLR 5, 101–141 (2004)
Ting, K.M.: An instance-weighting method to induce cost-sensitive trees. TKDE 14(3), 659–665 (2002)
Turney, P.D.: Types of cost in inductive concept learning. In: ICML, pp. 15–21 (2000)
Van Calster, B., Van Belle, V., Condous, G., Bourne, T., Timmerman, D., Van Huffel, S.: Multi-class auc metrics and weighted alternatives. In: IJCNN, pp. 1390–1396 (2008)
Zhou, Z.-H., Liu, X.-Y.: On multi-class cost-sensitive learning. In: AAAI, pp. 567–572 (2006)
Zhou, Z.-H., Liu, X.-Y.: Training cost-sensitive neural networks with methods addressing the class imbalance problem. TKDE 18(1), 63–77 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hoens, T.R., Qian, Q., Chawla, N.V., Zhou, ZH. (2012). Building Decision Trees for the Multi-class Imbalance Problem. In: Tan, PN., Chawla, S., Ho, C.K., Bailey, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2012. Lecture Notes in Computer Science(), vol 7301. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30217-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-30217-6_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-30216-9
Online ISBN: 978-3-642-30217-6
eBook Packages: Computer ScienceComputer Science (R0)