Tree-Based Methods and Their Applications


The first part of this chapter introduces the basic structure of tree-based methods using two examples. First, a classification tree is presented that uses e-mail text characteristics to identify spam. The second example uses a regression tree to estimate structural costs for seismic rehabilitation of various types of buildings. Our main focus in this section is the interpretive value of the resulting models.

This brief introduction is followed by a more detailed look at how these tree models are constructed. In the second section, we describe the algorithm employed by classification and regression tree (CART), a popular commercial software program for constructing trees for both classification and regression problems. In each case, we outline the processes of growing and pruning trees and discuss available options. The section concludes with a discussion of practical issues, including estimating a treeʼs predictive ability, handling missing data, assessing variable importance, and considering the effects of changes to the learning sample.

The third section presents several alternatives to the algorithms used by CART. We begin with a look at one class of algorithms – including QUEST, CRUISE, and GUIDE– which is designed to reduce potential bias toward variables with large numbers of available splitting values. Next, we explore C4.5, another program popular in the artificial-intelligence and machine-learning communities. C4.5 offers the added functionality of converting any tree to a series of decision rules, providing an alternative means of viewing and interpreting its results. Finally, we discuss chi-square automatic interaction detection (CHAID), an early classification-tree construction algorithm used with categorical predictors. The section concludes with a brief comparison of the characteristics of CART and each of these alternative algorithms.

In the fourth section, we discuss the use of ensemble methods for improving predictive ability. Ensemble methods generate collections of trees using different subsets of the training data. Final predictions are obtained by aggregating over the predictions of individual members of these collections. The first ensemble method we consider is boosting, a recursive method of generating small trees that each specialize in predicting cases for which its predecessors perform poorly. Next, we explore the use of random forests, which generate collections of trees based on bootstrap sampling procedures. We also comment on the tradeoff between the predictive power of ensemble methods and the interpretive value of their single-tree counterparts.

The chapter concludes with a discussion of tree-based methods in the broader context of supervised learning techniques. In particular, we compare classification and regression trees to multivariate adaptive regression splines, neural networks, and support vector machines.


Random Forest Regression Tree Terminal Node Classification Rule Multivariant Adaptive Regression Spline 



classification and regression tree


classification rule with unbiased interaction selection and estimation


critical value pruning


error-based pruning


generalized, unbiased interaction detection and estimation


linear discriminant analysis


multivariate adaptive regression splines


multiple additive regression tree


minimum error pruning


mean square errors


pessimistic error pruning


quick, unbiased and efficient statistical tree


reduced error pruning


random forest


support vector machine


independent identically distributed


  1. 30.1.
    C. L. Blake, C. J. Merz: UCI repository of machine learning databases (Department of Information and Computer Science (Univ. California), Irvine 1998)Google Scholar
  2. 30.2.
    K.-Y. Chan, W.-Y. Loh: LOTUS: An algorithm for building accurate, comprehensible logistic regression trees, J. Comput. Graph. Stat. 13(4), 826–852 (2004)CrossRefMathSciNetGoogle Scholar
  3. 30.3.
    Federal Emergency Management Agency: Typical Costs of Seismic Rehabilitation of Existing Buildings, FEMA 156, Vol. 1–Summary, 2nd edn. (FEMA, Washington 1993)Google Scholar
  4. 30.4.
    Federal Emergency Management Agency: Typical Costs of Seismic Rehabilitation of Existing Buildings, FEMA 157, Vol. 2–Supporting Documentation, 2nd edn. (FEMA, Washington 1993)Google Scholar
  5. 30.5.
    L. Breiman, J. Friedman, R. Olshen, C. Stone: Classification and Regression Trees (Chapman Hall, New York 1984)MATHGoogle Scholar
  6. 30.6.
    W.-Y. Loh, Y.-S. Shih: Split selection methods for classificaiton trees, Stat. Sin. 7, 815–840 (1997)MathSciNetMATHGoogle Scholar
  7. 30.7.
    H. Kim, W.-Y. Loh: Classification trees with unbiased multiway splits, J. Am. Stat. Assoc. 96, 589–604 (2001)CrossRefMathSciNetGoogle Scholar
  8. 30.8.
    W.-Y. Loh: Regression trees with unbiased variable selection, interaction detection, Stat. Sin. 12, 361–386 (2002)MathSciNetMATHGoogle Scholar
  9. 30.9.
    J. R. Quinlan: C4.5: Programs for Machine Learning (Morgan Kaufmann, San Mateo 1993)Google Scholar
  10. 30.10.
    G. V. Kass: An exploratory technique for investigating large quantities of categorical data, Appl. Stat. 29, 119–127 (1980)CrossRefGoogle Scholar
  11. 30.11.
    R. A. Fisher: The use of multiple measurements in taxonomic problems, Ann. Eugenic. 7, 179–188 (1936)Google Scholar
  12. 30.12.
    T. Hastie, R. Tibshirani, J. Friedman: The Elements of Statistical Learning: Data Mining, Inference, Prediction (Springer, Berlin Heidelberg New York 2001)Google Scholar
  13. 30.13.
    F. Esposito, D. Malerba, G. Semeraro: A comparative analysis of methods for pruning decision trees, IEEE Trans. Pattern Anal. 19, 476–491 (1997)CrossRefGoogle Scholar
  14. 30.14.
    P. Ein-Dor, J. Feldmesser: Attributes of the performance of central processing units: a relative performance prediction model, Commun. ACM 30, 308–317 (1987)CrossRefGoogle Scholar
  15. 30.15.
    R. J. Little, D. B. Rubin: Statistical Analysis with Missing Data, 2nd edn. (Wiley, Boboken 2002)MATHGoogle Scholar
  16. 30.16.
    L. Breiman: Bagging predictors, Mach. Learn. 24, 123–140 (1996)MathSciNetMATHGoogle Scholar
  17. 30.17.
    H. Drucker, C. Cortes: Boosting decision trees. In: Adv. Neur. Inf. Proc. Syst., Proc. NIPSʼ95, Vol. 8, ed. by M. C. Mozer D.  S. Touretzky, E. Hasselmo (Ed.) M. (MIT Press, Cambridge 1996) pp. 479–485Google Scholar
  18. 30.18.
    W.-Y. Loh, N. Vanichsetakul: Tree-structured classification via generalized discriminant analysis (with discussion), J. Am. Stat. Assoc. 83, 715–728 (1988)CrossRefMathSciNetMATHGoogle Scholar
  19. 30.19.
    J. A. Hartigan, M. A. Wong: Algorithm 136, A k-means clustering algorithm, Appl. Stat. 28, 100 (1979)CrossRefMATHGoogle Scholar
  20. 30.20.
    J. R. Quinlan: Discovering rules by induction from large collections of examples. In: Expert Systems in the Micro-Electronic Age, ed. by D. Michie (Edinburgh Univ. Press, Edinburgh 1979) pp. 168–201Google Scholar
  21. 30.21.
    E. B. Hunt, J. Marin, P. J. Stone: Experiments in Induction (Academic, New York 1966)Google Scholar
  22. 30.22.
    J. Dougherty, R. Kohavi, M. Sahami: Supervised, unsupervised discretization of continuous features. In: Proceedings of the Twelfth International Conference on Machine Learning, ed. by A. Prieditis, S. J. Russel (Morgan Kaufmann, San Mateo 1995) pp. 194–202Google Scholar
  23. 30.23.
    J. R. Quinlan: Improved use of continuous attributes in C4.5, J. Artif. Intell. Res. 4, 77–90 (1996)MATHGoogle Scholar
  24. 30.24.
    T.-S. Lim, W.-Y. Loh, Y.-S. Shih: A comparison of prediction accuracy, complexity, training time of thirty-three old and new classification algorithms, Mach. Learn. J. 40, 203–228 (2000)CrossRefMATHGoogle Scholar
  25. 30.25.
    E. Bauer, R. Kohavi: An empirical comparison of voting classification algorithms: bagging, boosting, variants, Mach. Learn. 36, 105–139 (1999)CrossRefGoogle Scholar
  26. 30.26.
    L. Breiman: Statistical modeling: the two cultures, Stat. Sci. 16, 199–215 (2001)CrossRefMathSciNetMATHGoogle Scholar
  27. 30.27.
    L. Breiman: Random forests, Mach. Learn. 45, 5–32 (2001)CrossRefMATHGoogle Scholar
  28. 30.28.
    T. G. Dietterich: An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, randomization, Mach. Learn. 40, 139–157 (2000)CrossRefGoogle Scholar
  29. 30.29.
    Y. Freund, R. E. Schapire: Experiments with a new boosting algorithm. In: Machine Learning: Proceedings of the Thirteenth International Conference, ed. by L. Saitta (Morgan Kaufmann, San Mateo 1996) pp. 148–156Google Scholar
  30. 30.30.
    R. Schapire: The strength of weak learnability, Mach. Learn. 5(2), 197–227 (1990)Google Scholar
  31. 30.31.
    Y. Freund: Boosting aweak learning algorithm by majority, Inform. Comput. 121(2), 256–285 (1995)CrossRefMathSciNetMATHGoogle Scholar
  32. 30.32.
    Y. Freund, R. E. Schapire: A decision-theoretic generalization of on-line learning, an application to boosting, J. Comput. Syst. Sci. 55, 119–139 (1997)CrossRefMathSciNetMATHGoogle Scholar
  33. 30.33.
    J. Friedman, T. Hastie, R. Tibshirani: Additive logistic regression: astatistical view of boosting (with discussion), Ann. Stat. 28, 337–374 (2000)CrossRefMathSciNetMATHGoogle Scholar
  34. 30.34.
    T. K. Ho: The random subspace method for constructing decision forests, IEEE Trans. Pattern Anal. 20, 832–844 (1998)CrossRefGoogle Scholar
  35. 30.35.
    M. R. Segal: Machine learning benchmarks, random forest regression, Technical Report, Center for Bioinformatics and Molecular Biostatistics (Univ. California, San Francisco 2004)Google Scholar

Copyright information

© Springer-Verlag 2006

Authors and Affiliations

  1. 1.Department of MathematicsWashington University in Saint LouisSt. LouisUSA
  2. 2.Department of StatisticsUniversity of Illinois at Urbana-ChampaignChampaignUSA
  3. 3.Department of StatisticsUniversity of Illinois at Urbana-ChampaignChampaignUSA

Personalised recommendations