Advertisement

A Comparison of Different Off-Centered Entropies to Deal with Class Imbalance for Decision Trees

  • Philippe Lenca
  • Stéphane Lallich
  • Thanh-Nghi Do
  • Nguyen-Khang Pham
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5012)

Abstract

In data mining, large differences in prior class probabilities known as the class imbalance problem have been reported to hinder the performance of classifiers such as decision trees. Dealing with imbalanced and cost-sensitive data has been recognized as one of the 10 most challenging problems in data mining research. In decision trees learning, many measures are based on the concept of Shannon’s entropy. A major characteristic of the entropies is that they take their maximal value when the distribution of the modalities of the class variable is uniform. To deal with the class imbalance problem, we proposed an off-centered entropy which takes its maximum value for a distribution fixed by the user. This distribution can be the a priori distribution of the class variable modalities or a distribution taking into account the costs of misclassification. Others authors have proposed an asymmetric entropy. In this paper we present the concepts of the three entropies and compare their effectiveness on 20 imbalanced data sets. All our experiments are founded on the C4.5 decision trees algorithm, in which only the function of entropy is modified. The results are promising and show the interest of off-centered entropies to deal with the problem of class imbalance.

Keywords

Decision trees Shannon entropy Off-centered entropies Imbalance class 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Japkowicz, N. (ed.): Learning from Imbalanced Data Sets/AAAI (2000)Google Scholar
  2. 2.
    Chawla, N., Japkowicz, N., Kolcz, A. (eds.): Learning from Imbalanced Data Sets/ICML (2003)Google Scholar
  3. 3.
    Chawla, N., Japkowicz, N., Kolcz, A. (eds.): Special Issue on Class Imbalances. SIGKDD Explorations, vol. 6 (2004)Google Scholar
  4. 4.
    Yang, Q., Wu, X.: 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making 5(4), 597–604 (2006)CrossRefGoogle Scholar
  5. 5.
    Japkowicz, N.: The class imbalance problem: Significance and strategies. In: IC-AI, pp. 111–117 (2000)Google Scholar
  6. 6.
    Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis 6(5), 429–450 (2002)zbMATHGoogle Scholar
  7. 7.
    Visa, S., Ralescu, A.: Issues in mining imbalanced data sets - A review paper. In: Midwest AICS Conf., pp. 67–73 (2005)Google Scholar
  8. 8.
    Weiss, G.M., Provost, F.: The effect of class distribution on classifier learning. TR ML-TR 43, Department of Computer Science, Rutgers University (2001)Google Scholar
  9. 9.
    Weiss, G.M., Provost, F.: Learning when training data are costly: The effect of class distribution on tree induction. J. of Art. Int. Research 19, 315–354 (2003)zbMATHGoogle Scholar
  10. 10.
    Liu, A., Ghosh, J., Martin, C.: Generative oversampling for mining imbalanced datasets. In: DMIN, pp. 66–72 (2007)Google Scholar
  11. 11.
    Kubat, M., Matwin, S.: Addressing the curse of imbalanced data sets: One-sided sampling. In: ICML, pp. 179–186 (1997)Google Scholar
  12. 12.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)Google Scholar
  13. 13.
    Drummond, C., Holte, R.: C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In: Learning from Imbalanced Data Sets/ICML (2003)Google Scholar
  14. 14.
    Domingos, P.: Metacost: A general method for making classifiers cost sensitive. In: KDD, pp. 155–164 (1999)Google Scholar
  15. 15.
    Zhou, Z.H., Liu, X.Y.: On multi-class cost-sensitive learning. In: AAAI, pp. 567–572 (2006)Google Scholar
  16. 16.
    Weiss, G.M., McCarthy, K., Zabar, B.: Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs? In: DMIN, pp. 35–41 (2007)Google Scholar
  17. 17.
    Ling, C.X., Yang, Q., Wang, J., Zhang, S.: Decision trees with minimal costs. In: ICML (2004)Google Scholar
  18. 18.
    Du, J., Cai, Z., Ling, C.X.: Cost-sensitive decision trees with pre-pruning. In: Kobti, Z., Wu, D. (eds.) Canadian AI 2007. LNCS (LNAI), vol. 4509, pp. 171–179. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  19. 19.
    Chawla, N.: C4.5 and imbalanced datasets: Investigating the effect of sampling method, probalistic estimate, and decision tree structure. In: Learning from Imbalanced Data Sets/ICML (2003)Google Scholar
  20. 20.
    Shannon, C.E.: A mathematical theory of communication. Bell System Technological Journal (27), 379–423, 623–656 (1948)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Wehenkel, L.: On uncertainty measures used for decision tree induction. In: IPMU, pp. 413–418 (1996)Google Scholar
  22. 22.
    Loh, W.Y., Shih, Y.S.: Split selection methods for classification trees. Statistica Sinica 7, 815–840 (1997)zbMATHMathSciNetGoogle Scholar
  23. 23.
    Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)Google Scholar
  24. 24.
    Theil, H.: On the estimation of relationships involving qualitative variables. American Journal of Sociology (76), 103–154 (1970)CrossRefGoogle Scholar
  25. 25.
    Kvalseth, T.O.: Entropy and correlation: some comments. IEEE Trans. on Systems, Man and Cybernetics 17(3), 517–519 (1987)CrossRefGoogle Scholar
  26. 26.
    Lallich, S., Vaillant, B., Lenca, P.: Parametrised measures for the evaluation of association rule interestingness. In: ASMDA, pp. 220–229 (2005)Google Scholar
  27. 27.
    Lallich, S., Vaillant, B., Lenca, P.: A probabilistic framework towards the parameterization of association rule interestingness measures. Methodology and Computing in Applied Probability 9, 447–463 (2007)zbMATHCrossRefMathSciNetGoogle Scholar
  28. 28.
    Zighed, D.A., Rakotomalala, R.: Graphes d’Induction – Apprentissage et Data Mining. Hermes (2000)Google Scholar
  29. 29.
    Lallich, S., Vaillant, B., Lenca, P.: Construction d’une entropie décentrée pour l’apprentissage supervisé. In: QDC/EGC 2007, pp. 45–54 (2007)Google Scholar
  30. 30.
    Lallich, S., Lenca, P., Vaillant, B.: Construction of an off-centered entropy for supervised learning. In: ASMDA, p. 8 (2007)Google Scholar
  31. 31.
    Goodman, L.A., Kruskal, W.H.: Measures of association for cross classifications, i. JASA I(49), 732–764 (1954)Google Scholar
  32. 32.
    Lallich, S.: Mesure et validation en extraction des connaissances à partir des données. In: Habilitation à Diriger des Recherches, Université Lyon 2, France (2002)Google Scholar
  33. 33.
    Zighed, D.A., Marcellin, S., Ritschard, G.: Mesure d’entropie asymétrique et consistante. In: EGC, pp. 81–86 (2007)Google Scholar
  34. 34.
    Marcellin, S., Zighed, D.A., Ritschard, G.: An asymmetric entropy measure for decision trees. In: IPMU, pp. 1292–1299 (2006)Google Scholar
  35. 35.
    Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998)Google Scholar
  36. 36.
    Michie, D., Spiegelhalter, D.J., Taylor, C.C. (eds.): Machine Learning, Neural and Statistical Classification. Ellis Horwood (1994)Google Scholar
  37. 37.
    Jinyan, L., Huiqing, L.: Kent ridge bio-medical data set repository. Technical report (2002)Google Scholar
  38. 38.
    Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth International (1984)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Philippe Lenca
    • 1
  • Stéphane Lallich
    • 2
  • Thanh-Nghi Do
    • 3
  • Nguyen-Khang Pham
    • 4
  1. 1.Institut TELECOMTELECOM Bretagne, Lab-STICCBrestFrance
  2. 2.Laboratoire ERIC, Lyon 2Université LyonLyonFrance
  3. 3.INRIA Futurs/LRIUniversité de Paris-SudOrsayFrance
  4. 4.IRISA, RennesFrance

Personalised recommendations