Advertisement

Machine Learning

, Volume 28, Issue 2–3, pp 257–291 | Cite as

An Exact Probability Metric for Decision Tree Splitting and Stopping

  • J. Kent Martin
Article

Abstract

ID3's information gain heuristic is well-known to be biased towards multi-valued attributes. This bias is only partially compensated for by C4.5's gain ratio. Several alternatives have been proposed and are examined here (distance, orthogonality, a Beta function, and two chi-squared tests). All of these metrics are biased towards splits with smaller branches, where low-entropy splits are likely to occur by chance. Both classical and Bayesian statistics lead to the multiple hypergeometric distribution as the exact posterior probability of the null hypothesis that the class distribution is independent of the split. Both gain and the chi-squared tests arise in asymptotic approximations to the hypergeometric, with similar criteria for their admissibility. Previous failures of pre-pruning are traced in large part to coupling these biased approximations with one another or with arbitrary thresholds; problems which are overcome by the hypergeometric. The choice of split-selection metric typically has little effect on accuracy, but can profoundly affect complexity and the effectiveness and efficiency of pruning. Empirical results show that hypergeometric pre-pruning should be done in most cases, as trees pruned in this way are simpler and more efficient, and typically no less accurate than unpruned or post-pruned trees.

pre-pruning post-pruning hypergeometric Fisher's Exact Test 

References

  1. Agresti, A. (1990). Categorical data analysis. New York: John Wiley & Sons.Google Scholar
  2. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Pacific Grove, CA: Wadsworth & Brooks.Google Scholar
  3. Buntine, W. L. (1990). A theory of learning classification rules. PhD thesis. University of Technology, Sydney.Google Scholar
  4. Buntine, W. & Niblett, T. (1992). A further comparison of splitting rules for decision-tree induction. Machine Learning, 8, 75–85.Google Scholar
  5. Cestnik, B., Kononenko, I. & Bratko, I. (1987). ASSISTANT 86: A knowledge-elicitation tool for sophisticated users. In Progress in Machine Learning, EWSL-87. Wilmslow: Sigma Press.Google Scholar
  6. Cestnik, B. & Bratko, I. (1991). On estimating probabilities in tree pruning. In Machine Learning, EWSL-91. Berlin: Springer-Verlag.Google Scholar
  7. Cochran, W. G. (1952). Some methods of strengthening the common X2 tests. Biometrics, 10, 417-451.Google Scholar
  8. Elder, J. F. (1995). Heuristic search for model structure. In Fisher, D. & Lenz, H-J. (Eds.) Learning from Data: Artificial Intelligence and Statistics V, Lecture Notes in Statistics, v. 112 (pp. 131-142). New York: Springer.Google Scholar
  9. Fayyad, U. M., & Irani, K. B. (1992a). The attribute selection problem in decision tree generation. Proceedings of the 10th National Conference on Artificial Intelligence, AAAI-92 (pp. 104–110). Cambridge, MA: MIT Press.Google Scholar
  10. Fayyad, U. M., & Irani, K. B. (1992b). On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8, 87-102.Google Scholar
  11. Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI-93) (pp. 1022-1027). San Mateo, CA: Morgan Kaufmann.Google Scholar
  12. Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2, 139-172.Google Scholar
  13. Fisher, D. H. (1992). Pessimistic and optimistic induction. Technical Report CS-92-22, Department of Computer Science, Vanderbilt University, Nashville, TN.Google Scholar
  14. Fisher, D. H., & Schlimmer, J. C. (1988). Concept simplification and prediction accuracy. Proceedings of the 5th International Conference on Machine Learning (ML-88) (pp. 22–28). San Mateo, CA: Morgan-Kaufmann.Google Scholar
  15. Fulton, T., Kasif, S., & Salzberg, S. (1995). Efficient algorithms for finding multi-way splits for decision trees. Machine Learning: Proceedings of the 12th International Conference (ML-95) (pp. 244-251). San Francisco: Morgan Kaufmann.Google Scholar
  16. Gluck, M. A., & Corter, J. E. (1985). Information, uncertainty, and the utility of categories. Proceedings of the 7th Annual Conference of the Cognitive Society (pp. 283–287). Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
  17. John, G. H. (1995). Robust linear discriminant trees. In Fisher, D. & Lenz, H-J. (Eds.) Learning from Data: Artificial Intelligence and Statistics V, Lecture Notes in Statistics, v. 112 (pp. 375-386). New York: Springer.Google Scholar
  18. Kira, K. & Rendell, L. A. (1992). A practical approach to feature selection. Machine Learning: Proceedings of the 9th International Conference (ML-92) (pp. 249-256). San Mateo, CA: Morgan Kaufmann.Google Scholar
  19. Kononenko, I. (1994). Estimating attributes: Analysis and extensions of RELIEF. Proceedings of the European Conference on Machine Learning (ECML-94), (pp. 171-182). Berlin: Springer.Google Scholar
  20. Liu, W. Z., & White, A. P. (1994). The importance of attribute-selection measures in decision tree induction. Machine Learning, 15, 25–41.Google Scholar
  21. Lopez de Mantaras, R. (1991). A distance-based attribute selection measure for decision tree induction. Machine Learning, 6, 81–92.Google Scholar
  22. Martin, J. K. (1995). An exact probability metric for decision tree splitting and stopping. Technical Report 95-16, Department of Information & Computer Science, University of California, Irvine, CA.Google Scholar
  23. Martin, J. K. & Hirschberg, D. S. (1996a). On the complexity of learning decision trees. Proceedings Fourth International Symposium on Artificial Intelligence and Mathematics, AI/MATH-96 (pp. 112-115). Fort Lauderdale, FL.Google Scholar
  24. Martin, J. K. & Hirschberg, D. S. (1996b). Small sample statistics for classification error rates I: Error rate measurements. Technical Report 96-21, Department of Information & Computer Science, University of California, Irvine, CA.Google Scholar
  25. Martin, J. K. & Hirschberg, D. S. (1996c). Small sample statistics for classification error rates II: Confidence intervals and significance tests. Technical Report 96-22, Department of Information & Computer Science, University of California, Irvine, CA. Mingers, J. (1987). Expert systems — rule induction with statistical data. Journal of the Operational Research Society, 38, 39- 47.Google Scholar
  26. Mingers, J. (1989a). An empirical comparison of pruning measures for decision tree induction. Machine Learning, 4, 227–243.Google Scholar
  27. Mingers, J. (1989b). An empirical comparison of selection measures for decision tree induction. Machine Learning, 3, 319–342.Google Scholar
  28. Murphy, P. M., & Aha, D. W. (1995). UCI Repository of Machine Learning Databases. (machine-readable data depository). Department of Information & Computer Science, University of California, Irvine, CA.Google Scholar
  29. Murphy P. M. & Pazzani, M. J. (1991). ID2-of-3: Constructive induction of M-of-N concepts for discriminators in decision trees. Machine Learning: Proceedings of the 8th International Workshop (ML-91) (pp. 183-187). San Mateo, CA: Morgan Kaufmann.Google Scholar
  30. Murthy, S., Kasif, S., Salzberg, S., & Beigel, R. (1993). OC-1: Randomized induction of oblique decision trees. Proceedings of the 11th National Conference on Artificial Intelligence (AAAI-93) (pp. 322-327). Menlo Park, CA: AAAI Press.Google Scholar
  31. Murthy, S. & Salzberg, S. (1995). Look ahead and pathology in decision tree induction. Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95) (pp. 1025-1031). San Mateo, CA: Morgan Kaufmann.Google Scholar
  32. Niblett, T. (1987). Constructing decision trees in noisy domains. In Progress in Machine Learning, EWSL-87. Wilmslow: Sigma Press.Google Scholar
  33. Niblett, T., & Bratko, I. (1986). Learning decision rules in noisy domains. In Proceedings of Expert Systems 86. Cambridge: Cambridge University Press.Google Scholar
  34. Park, Y. & Sklansky, J. (1990). Automated design of linear tree classifiers. Pattern Recognition, 23, 1393-1412.Google Scholar
  35. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106.Google Scholar
  36. Quinlan, J. R. (1988). Simplifying decision trees. In B. R. Gaines & J. H. Boose (Eds.). Knowledge Acquisition for Knowledge-Based Systems. San Diego: Academic Press.Google Scholar
  37. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.Google Scholar
  38. Quinlan, J. R. & Cameron-Jones, R.M. (1995). Oversearching and layered search in empirical learning. Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95) (pp. 1019-1024). San Mateo, CA: Morgan Kaufmann.Google Scholar
  39. Schaffer, C. (1993). Overfitting avoidance as bias. Machine Learning, 10, 153–178.Google Scholar
  40. Shavlik, J.W., Mooney, R. J., & Towell, G. G. (1991). Symbolic and neural learning algorithms: An experimental comparison. Machine Learning, 6, 111-143.Google Scholar
  41. Weiss, S. M. & Indurkhya, N. (1991). Reduced complexity rule induction. Proceedings of the 12th International Joint Conference on Artificial Intelligence, IJCAI-91 (pp. 678–684). San Mateo, CA: Morgan Kaufmann.Google Scholar
  42. Weiss, S. M. & Indurkhya, N. (1994). Small sample decision tree pruning. Proceedings of the 11th International Conference on Machine Learning (ML-94), (pp. 335-342). San Francisco: Morgan-Kaufman.Google Scholar
  43. White, A. P. & Liu, W. Z. (1994). Bias in information-based measures in decision tree induction. Machine Learning, 15, 321–329.Google Scholar

Copyright information

© Kluwer Academic Publishers 1997

Authors and Affiliations

  • J. Kent Martin
    • 1
  1. 1.Department of Information and Computer ScienceUniversity of California, IrvineIrvine

Personalised recommendations