# An Exact Probability Metric for Decision Tree Splitting and Stopping

- 930 Downloads
- 30 Citations

## Abstract

ID3's information gain heuristic is well-known to be biased towards multi-valued attributes. This bias is only partially compensated for by C4.5's gain ratio. Several alternatives have been proposed and are examined here (distance, orthogonality, a Beta function, and two chi-squared tests). All of these metrics are biased towards splits with smaller branches, where low-entropy splits are likely to occur by chance. Both classical and Bayesian statistics lead to the multiple hypergeometric distribution as the exact posterior probability of the null hypothesis that the class distribution is independent of the split. Both gain and the chi-squared tests arise in asymptotic approximations to the hypergeometric, with similar criteria for their admissibility. Previous failures of pre-pruning are traced in large part to coupling these biased approximations with one another or with arbitrary thresholds; problems which are overcome by the hypergeometric. The choice of split-selection metric typically has little effect on accuracy, but can profoundly affect complexity and the effectiveness and efficiency of pruning. Empirical results show that hypergeometric pre-pruning should be done in most cases, as trees pruned in this way are simpler and more efficient, and typically no less accurate than unpruned or post-pruned trees.

## References

- Agresti, A. (1990).
*Categorical data analysis*. New York: John Wiley & Sons.Google Scholar - Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984).
*Classification and regression trees*. Pacific Grove, CA: Wadsworth & Brooks.Google Scholar - Buntine, W. L. (1990).
*A theory of learning classification rules*. PhD thesis. University of Technology, Sydney.Google Scholar - Buntine, W. & Niblett, T. (1992). A further comparison of splitting rules for decision-tree induction.
*Machine Learning, 8*, 75–85.Google Scholar - Cestnik, B., Kononenko, I. & Bratko, I. (1987). ASSISTANT 86: A knowledge-elicitation tool for sophisticated users. In
*Progress in Machine Learning, EWSL-87*. Wilmslow: Sigma Press.Google Scholar - Cestnik, B. & Bratko, I. (1991). On estimating probabilities in tree pruning. In
*Machine Learning, EWSL-91*. Berlin: Springer-Verlag.Google Scholar - Cochran, W. G. (1952). Some methods of strengthening the common
*X2*tests.*Biometrics, 10*, 417-451.Google Scholar - Elder, J. F. (1995). Heuristic search for model structure. In Fisher, D. & Lenz, H-J. (Eds.)
*Learning from Data: Artificial Intelligence and Statistics V*,*Lecture Notes in Statistics*, v. 112 (pp. 131-142). New York: Springer.Google Scholar - Fayyad, U. M., & Irani, K. B. (1992a). The attribute selection problem in decision tree generation.
*Proceedings of the 10th National Conference on Artificial Intelligence, AAAI-92*(pp. 104–110). Cambridge, MA: MIT Press.Google Scholar - Fayyad, U. M., & Irani, K. B. (1992b). On the handling of continuous-valued attributes in decision tree generation.
*Machine Learning, 8*, 87-102.Google Scholar - Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning.
*Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI-93)*(pp. 1022-1027). San Mateo, CA: Morgan Kaufmann.Google Scholar - Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering.
*Machine Learning, 2*, 139-172.Google Scholar - Fisher, D. H. (1992). Pessimistic and optimistic induction. Technical Report CS-92-22, Department of Computer Science, Vanderbilt University, Nashville, TN.Google Scholar
- Fisher, D. H., & Schlimmer, J. C. (1988). Concept simplification and prediction accuracy.
*Proceedings of the 5th International Conference on Machine Learning (ML-88)*(pp. 22–28). San Mateo, CA: Morgan-Kaufmann.Google Scholar - Fulton, T., Kasif, S., & Salzberg, S. (1995). Efficient algorithms for finding multi-way splits for decision trees.
*Machine Learning: Proceedings of the 12th International Conference (ML-95)*(pp. 244-251). San Francisco: Morgan Kaufmann.Google Scholar - Gluck, M. A., & Corter, J. E. (1985). Information, uncertainty, and the utility of categories.
*Proceedings of the 7th Annual Conference of the Cognitive Society*(pp. 283–287). Hillsdale, NJ: Lawrence Erlbaum.Google Scholar - John, G. H. (1995). Robust linear discriminant trees. In Fisher, D. & Lenz, H-J. (Eds.)
*Learning from Data: Artificial Intelligence and Statistics V*,*Lecture Notes in Statistics*, v. 112 (pp. 375-386). New York: Springer.Google Scholar - Kira, K. & Rendell, L. A. (1992). A practical approach to feature selection.
*Machine Learning: Proceedings of the 9th International Conference (ML-92)*(pp. 249-256). San Mateo, CA: Morgan Kaufmann.Google Scholar - Kononenko, I. (1994). Estimating attributes: Analysis and extensions of RELIEF.
*Proceedings of the European Conference on Machine Learning (ECML-94)*, (pp. 171-182). Berlin: Springer.Google Scholar - Liu, W. Z., & White, A. P. (1994). The importance of attribute-selection measures in decision tree induction.
*Machine Learning, 15*, 25–41.Google Scholar - Lopez de Mantaras, R. (1991). A distance-based attribute selection measure for decision tree induction.
*Machine Learning, 6*, 81–92.Google Scholar - Martin, J. K. (1995). An exact probability metric for decision tree splitting and stopping. Technical Report 95-16, Department of Information & Computer Science, University of California, Irvine, CA.Google Scholar
- Martin, J. K. & Hirschberg, D. S. (1996a). On the complexity of learning decision trees.
*Proceedings Fourth International Symposium on Artificial Intelligence and Mathematics, AI/MATH-96*(pp. 112-115). Fort Lauderdale, FL.Google Scholar - Martin, J. K. & Hirschberg, D. S. (1996b). Small sample statistics for classification error rates I: Error rate measurements. Technical Report 96-21, Department of Information & Computer Science, University of California, Irvine, CA.Google Scholar
- Martin, J. K. & Hirschberg, D. S. (1996c). Small sample statistics for classification error rates II: Confidence intervals and significance tests. Technical Report 96-22, Department of Information & Computer Science, University of California, Irvine, CA. Mingers, J. (1987). Expert systems — rule induction with statistical data.
*Journal of the Operational Research Society, 38*, 39- 47.Google Scholar - Mingers, J. (1989a). An empirical comparison of pruning measures for decision tree induction.
*Machine Learning, 4*, 227–243.Google Scholar - Mingers, J. (1989b). An empirical comparison of selection measures for decision tree induction.
*Machine Learning, 3*, 319–342.Google Scholar - Murphy, P. M., & Aha, D. W. (1995).
*UCI Repository of Machine Learning Databases*. (machine-readable data depository). Department of Information & Computer Science, University of California, Irvine, CA.Google Scholar - Murphy P. M. & Pazzani, M. J. (1991). ID2-of-3: Constructive induction of M-of-N concepts for discriminators in decision trees.
*Machine Learning: Proceedings of the 8th International Workshop (ML-91)*(pp. 183-187). San Mateo, CA: Morgan Kaufmann.Google Scholar - Murthy, S., Kasif, S., Salzberg, S., & Beigel, R. (1993). OC-1: Randomized induction of oblique decision trees.
*Proceedings of the 11th National Conference on Artificial Intelligence (AAAI-93)*(pp. 322-327). Menlo Park, CA: AAAI Press.Google Scholar - Murthy, S. & Salzberg, S. (1995). Look ahead and pathology in decision tree induction.
*Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95)*(pp. 1025-1031). San Mateo, CA: Morgan Kaufmann.Google Scholar - Niblett, T. (1987). Constructing decision trees in noisy domains. In
*Progress in Machine Learning, EWSL-87*. Wilmslow: Sigma Press.Google Scholar - Niblett, T., & Bratko, I. (1986). Learning decision rules in noisy domains. In
*Proceedings of Expert Systems 86*. Cambridge: Cambridge University Press.Google Scholar - Park, Y. & Sklansky, J. (1990). Automated design of linear tree classifiers.
*Pattern Recognition, 23*, 1393-1412.Google Scholar - Quinlan, J. R. (1986). Induction of decision trees.
*Machine Learning, 1*, 81–106.Google Scholar - Quinlan, J. R. (1988). Simplifying decision trees. In B. R. Gaines & J. H. Boose (Eds.).
*Knowledge Acquisition for Knowledge-Based Systems*. San Diego: Academic Press.Google Scholar - Quinlan, J. R. (1993).
*C4.5: Programs for Machine Learning*. San Mateo, CA: Morgan Kaufmann.Google Scholar - Quinlan, J. R. & Cameron-Jones, R.M. (1995). Oversearching and layered search in empirical learning.
*Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95)*(pp. 1019-1024). San Mateo, CA: Morgan Kaufmann.Google Scholar - Schaffer, C. (1993). Overfitting avoidance as bias.
*Machine Learning, 10*, 153–178.Google Scholar - Shavlik, J.W., Mooney, R. J., & Towell, G. G. (1991). Symbolic and neural learning algorithms: An experimental comparison.
*Machine Learning, 6*, 111-143.Google Scholar - Weiss, S. M. & Indurkhya, N. (1991). Reduced complexity rule induction.
*Proceedings of the 12th International Joint Conference on Artificial Intelligence, IJCAI-91*(pp. 678–684). San Mateo, CA: Morgan Kaufmann.Google Scholar - Weiss, S. M. & Indurkhya, N. (1994). Small sample decision tree pruning.
*Proceedings of the 11th International Conference on Machine Learning (ML-94)*, (pp. 335-342). San Francisco: Morgan-Kaufman.Google Scholar - White, A. P. & Liu, W. Z. (1994). Bias in information-based measures in decision tree induction.
*Machine Learning, 15*, 321–329.Google Scholar