Algorithms for learning classification trees have had successes in artificial intelligence and statistics over many years. This paper outlines how a tree learning algorithm can be derived using Bayesian statistics. This introduces Bayesian techniques for splitting, smoothing, and tree averaging. The splitting rule is similar to Quinlan's information gain, while smoothing and averaging replace pruning. Comparative experiments with reimplementations of a minimum encoding approach,c4 (Quinlanet al., 1987) andcart (Breimanet al., 1984), show that the full Bayesian algorithm can produce more accurate predictions than versions of these other approaches, though pays a computational price.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Price excludes VAT (USA)
Tax calculation will be finalised during checkout.
Bahl, L., Brown, P., de Souza, P. and Mercer, R. (1989) A tree-based language model for natural language speech recognition.IEEE Transactions on Acoustics, Speech and Signal Processing,37, 1001–1008.
Berger, J. O. (1985)Statistical Decision Theory and Bayesian Analysis, Springer-Verlag, New York.
Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984)Classification and Regression Trees, Wadsworth, Belmont.
Buntine, W. (1991a) Some experiments with learning classification trees. Technical report, NASA Ames Research Center. In preparation.
Buntine, W. (1991b) A theory of learning classification rules. PhD thesis. University of Technology, Sydney.
Buntine, W. and Caruana, R. (1991) Introduction to IND and recursive partitioning. Technical Report FIA-91-28, RIACS and NASA Ames Research Center, Moffett Field, CA.
Buntine, W. and Weigend, A. (1991) Bayesian back-propagation.Complex Systems,5, 603–643.
Carter, C. and Catlett, J. (1987) Assessing credit card applications using machine learning.IEEE Expert,2, 71–79.
Catlett, J. (1991) Megainduction: machine learning on very large databases. PhD thesis, University of Sydney.
Cestnik, B., Kononeko, I. and Bratko, I. (1987) ASSISTANT86: A knowledge-elicitation tool for sophisticated users, inProgress in Machine Learning: Proceedings of EWSL-87, Bratko, I. and Lavrač, N. (eds), Sigma Press, Wilmslow, pp. 31–45.
Chou, P. (1991) Optimal partitioning for classification and regression trees.IEEE Transactions on Pattern Analysis and Machine Intelligence,13.
Clark, P. and Niblett, T. (1989) The CN2 induction algorithm.Machine Learning,3, 261–283.
Crawford, S. (1989) Extensions to the CART algorithm.International Journal of Man-Machine Studies,31, 197–217.
Henrion, M. (1990) Towards efficient inference in multiply connected belief networks, inInfluence Diagrams, Belief Nete and Decision Analysis, Oliver, R. and Smith, J. (eds), Wiley, New York, pp. 385–407.
Kwok, S. and Carter, C. (1990) Multiple decision trees, inUncertainty in Artificial Intelligence 4, Schachter, R., Levitt, T., Kanal, L. and Lemmer, J. (eds), North-Holland, Amsterdam.
Lee, P. (1989)Bayesian Statistics: An Introduction, Oxford University Press, New York.
Michie, D., Bain, M. and Hayes-Michie, J. (1990) Cognitive models from subcognitive skills, inKnowledge-based Systems for Industrial Control, McGhee, J., Grimble, M. and Mowforth, P. (eds), Stevenage: Peter Peregrinus.
Mingers, J. (1989a) An empirical comparison of pruning methods for decision-tree induction.Machine Learning,4, 227–243.
Mingers, J. (1989b) An empirical comparison of selection measures for decision-tree induction.Machine Learning,3, 319–342.
Pagallo, G. and Haussler, D. (1990) Boolean feature discovery in empirical learning.Machine Learning,5, 71–99.
Press, S. (1989)Bayesian Statistics, Wiley, New York.
Quinlan, J. (1986) Induction of decision trees.Machine Learning,1, 81–106.
Quinlan, J. (1988) Simplifying decision trees, inKnowledge Acquisition for Knowledge-Based Systems, Gaines, B. and Boose, J. (eds), Academic Press, London, pp. 239–252.
Quinlan, J., Compton, P., Horn, K. and Lazarus, L. (1987) Inductive knowledge acquisitions: A case study, inApplications of Expert Systems, Quinlan, J. (ed.). Addison-Wesley, London.
Quinlan, J. and Rivest, R. (1989) Inferring decision trees using the minimum description length principle.Information and Computation,80, 227–248.
Ripley, B. (1987) An introduction to statistical pattern recognition, inInteractions in Artificial Intelligence and Statistical Methods, Unicom, Gower Technical Press, Aldershot, pp. 176–187.
Rissanen, J. (1989)Stochastic Complexity in Statistical Enquiry, World Scientific, Section 7.2.
Rodriguez, C. (1990) Objective Bayesianism and geometry, inMaximum Entropy and Bayesian Methods, Fougère, P. (ed.), Kluwer, Dordrecht.
Stewart, L. (1987). Hierarchical Bayesian analysis using Monte Carlo integration: computing posterior distributions when there are many possible models.The Statistician,36, 211–219.
Utgoff, P. (1989). Incremental induction of decision trees.Machine Learning,4, 161–186.
Wallace, C. and Patrick, J. (1991). Coding decision trees. Technical Report 151, Monash University, Melbourne, submitted toMachine Learning.
Weiss, S., Galen, R. and Tadepalli, P. (1990) Maximizing the predictive value of production rules.Artificial Intelligence,45, 47–71.
Rights and permissions
About this article
Cite this article
Buntine, W. Learning classification trees. Stat Comput 2, 63–73 (1992). https://doi.org/10.1007/BF01889584
- Classification trees
- Bayesian statistics