Abstract
ID3's information gain heuristic is well-known to be biased towards multi-valued attributes. This bias is only partially compensated for by C4.5's gain ratio. Several alternatives have been proposed and are examined here (distance, orthogonality, a Beta function, and two chi-squared tests). All of these metrics are biased towards splits with smaller branches, where low-entropy splits are likely to occur by chance. Both classical and Bayesian statistics lead to the multiple hypergeometric distribution as the exact posterior probability of the null hypothesis that the class distribution is independent of the split. Both gain and the chi-squared tests arise in asymptotic approximations to the hypergeometric, with similar criteria for their admissibility. Previous failures of pre-pruning are traced in large part to coupling these biased approximations with one another or with arbitrary thresholds; problems which are overcome by the hypergeometric. The choice of split-selection metric typically has little effect on accuracy, but can profoundly affect complexity and the effectiveness and efficiency of pruning. Empirical results show that hypergeometric pre-pruning should be done in most cases, as trees pruned in this way are simpler and more efficient, and typically no less accurate than unpruned or post-pruned trees.
Article PDF
Similar content being viewed by others
References
Agresti, A. (1990). Categorical data analysis. New York: John Wiley & Sons.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Pacific Grove, CA: Wadsworth & Brooks.
Buntine, W. L. (1990). A theory of learning classification rules. PhD thesis. University of Technology, Sydney.
Buntine, W. & Niblett, T. (1992). A further comparison of splitting rules for decision-tree induction. Machine Learning, 8, 75–85.
Cestnik, B., Kononenko, I. & Bratko, I. (1987). ASSISTANT 86: A knowledge-elicitation tool for sophisticated users. In Progress in Machine Learning, EWSL-87. Wilmslow: Sigma Press.
Cestnik, B. & Bratko, I. (1991). On estimating probabilities in tree pruning. In Machine Learning, EWSL-91. Berlin: Springer-Verlag.
Cochran, W. G. (1952). Some methods of strengthening the common X2 tests. Biometrics, 10, 417-451.
Elder, J. F. (1995). Heuristic search for model structure. In Fisher, D. & Lenz, H-J. (Eds.) Learning from Data: Artificial Intelligence and Statistics V, Lecture Notes in Statistics, v. 112 (pp. 131-142). New York: Springer.
Fayyad, U. M., & Irani, K. B. (1992a). The attribute selection problem in decision tree generation. Proceedings of the 10th National Conference on Artificial Intelligence, AAAI-92 (pp. 104–110). Cambridge, MA: MIT Press.
Fayyad, U. M., & Irani, K. B. (1992b). On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8, 87-102.
Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI-93) (pp. 1022-1027). San Mateo, CA: Morgan Kaufmann.
Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2, 139-172.
Fisher, D. H. (1992). Pessimistic and optimistic induction. Technical Report CS-92-22, Department of Computer Science, Vanderbilt University, Nashville, TN.
Fisher, D. H., & Schlimmer, J. C. (1988). Concept simplification and prediction accuracy. Proceedings of the 5th International Conference on Machine Learning (ML-88) (pp. 22–28). San Mateo, CA: Morgan-Kaufmann.
Fulton, T., Kasif, S., & Salzberg, S. (1995). Efficient algorithms for finding multi-way splits for decision trees. Machine Learning: Proceedings of the 12th International Conference (ML-95) (pp. 244-251). San Francisco: Morgan Kaufmann.
Gluck, M. A., & Corter, J. E. (1985). Information, uncertainty, and the utility of categories. Proceedings of the 7th Annual Conference of the Cognitive Society (pp. 283–287). Hillsdale, NJ: Lawrence Erlbaum.
John, G. H. (1995). Robust linear discriminant trees. In Fisher, D. & Lenz, H-J. (Eds.) Learning from Data: Artificial Intelligence and Statistics V, Lecture Notes in Statistics, v. 112 (pp. 375-386). New York: Springer.
Kira, K. & Rendell, L. A. (1992). A practical approach to feature selection. Machine Learning: Proceedings of the 9th International Conference (ML-92) (pp. 249-256). San Mateo, CA: Morgan Kaufmann.
Kononenko, I. (1994). Estimating attributes: Analysis and extensions of RELIEF. Proceedings of the European Conference on Machine Learning (ECML-94), (pp. 171-182). Berlin: Springer.
Liu, W. Z., & White, A. P. (1994). The importance of attribute-selection measures in decision tree induction. Machine Learning, 15, 25–41.
Lopez de Mantaras, R. (1991). A distance-based attribute selection measure for decision tree induction. Machine Learning, 6, 81–92.
Martin, J. K. (1995). An exact probability metric for decision tree splitting and stopping. Technical Report 95-16, Department of Information & Computer Science, University of California, Irvine, CA.
Martin, J. K. & Hirschberg, D. S. (1996a). On the complexity of learning decision trees. Proceedings Fourth International Symposium on Artificial Intelligence and Mathematics, AI/MATH-96 (pp. 112-115). Fort Lauderdale, FL.
Martin, J. K. & Hirschberg, D. S. (1996b). Small sample statistics for classification error rates I: Error rate measurements. Technical Report 96-21, Department of Information & Computer Science, University of California, Irvine, CA.
Martin, J. K. & Hirschberg, D. S. (1996c). Small sample statistics for classification error rates II: Confidence intervals and significance tests. Technical Report 96-22, Department of Information & Computer Science, University of California, Irvine, CA. Mingers, J. (1987). Expert systems — rule induction with statistical data. Journal of the Operational Research Society, 38, 39- 47.
Mingers, J. (1989a). An empirical comparison of pruning measures for decision tree induction. Machine Learning, 4, 227–243.
Mingers, J. (1989b). An empirical comparison of selection measures for decision tree induction. Machine Learning, 3, 319–342.
Murphy, P. M., & Aha, D. W. (1995). UCI Repository of Machine Learning Databases. (machine-readable data depository). Department of Information & Computer Science, University of California, Irvine, CA.
Murphy P. M. & Pazzani, M. J. (1991). ID2-of-3: Constructive induction of M-of-N concepts for discriminators in decision trees. Machine Learning: Proceedings of the 8th International Workshop (ML-91) (pp. 183-187). San Mateo, CA: Morgan Kaufmann.
Murthy, S., Kasif, S., Salzberg, S., & Beigel, R. (1993). OC-1: Randomized induction of oblique decision trees. Proceedings of the 11th National Conference on Artificial Intelligence (AAAI-93) (pp. 322-327). Menlo Park, CA: AAAI Press.
Murthy, S. & Salzberg, S. (1995). Look ahead and pathology in decision tree induction. Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95) (pp. 1025-1031). San Mateo, CA: Morgan Kaufmann.
Niblett, T. (1987). Constructing decision trees in noisy domains. In Progress in Machine Learning, EWSL-87. Wilmslow: Sigma Press.
Niblett, T., & Bratko, I. (1986). Learning decision rules in noisy domains. In Proceedings of Expert Systems 86. Cambridge: Cambridge University Press.
Park, Y. & Sklansky, J. (1990). Automated design of linear tree classifiers. Pattern Recognition, 23, 1393-1412.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106.
Quinlan, J. R. (1988). Simplifying decision trees. In B. R. Gaines & J. H. Boose (Eds.). Knowledge Acquisition for Knowledge-Based Systems. San Diego: Academic Press.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
Quinlan, J. R. & Cameron-Jones, R.M. (1995). Oversearching and layered search in empirical learning. Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95) (pp. 1019-1024). San Mateo, CA: Morgan Kaufmann.
Schaffer, C. (1993). Overfitting avoidance as bias. Machine Learning, 10, 153–178.
Shavlik, J.W., Mooney, R. J., & Towell, G. G. (1991). Symbolic and neural learning algorithms: An experimental comparison. Machine Learning, 6, 111-143.
Weiss, S. M. & Indurkhya, N. (1991). Reduced complexity rule induction. Proceedings of the 12th International Joint Conference on Artificial Intelligence, IJCAI-91 (pp. 678–684). San Mateo, CA: Morgan Kaufmann.
Weiss, S. M. & Indurkhya, N. (1994). Small sample decision tree pruning. Proceedings of the 11th International Conference on Machine Learning (ML-94), (pp. 335-342). San Francisco: Morgan-Kaufman.
White, A. P. & Liu, W. Z. (1994). Bias in information-based measures in decision tree induction. Machine Learning, 15, 321–329.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Martin, J.K. An Exact Probability Metric for Decision Tree Splitting and Stopping. Machine Learning 28, 257–291 (1997). https://doi.org/10.1023/A:1007367629006
Issue Date:
DOI: https://doi.org/10.1023/A:1007367629006