Skip to main content

Classification and Regression Trees (CART)

  • Chapter
  • First Online:
Statistical Learning from a Regression Perspective

Part of the book series: Springer Texts in Statistics ((STS))

Abstract

In this chapter, we turn to recursive partitioning, which comes in various forms including decision trees and classification and regression trees (CART). We will see that the algorithmic machinery successively subsets the data. Trees are just a visualization of the data subsetting processes. We will also see that although recursive partitioning has too many problems to be an effective, stand-alone data analysis procedure, it is a crucial component of more powerful algorithms discussed in later chapters. It is important, therefore, to get into the details.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    There are extensions of CART that go beyond the generalized linear model. A parallel to multinomial logistic regression is one example. But for didactic purposes these extensions are addressed later.

  2. 2.

    A good multinomial regression procedure in R is multinom in library nnet. Technically, multinomial logistic regression is not a GLM, which is a good reason why it is not included in the glm library in R.

  3. 3.

    Some orders for splits of the same predictor cannot work. For example, age might be first split at 30 and then later at 18. This works for the partition of individuals less than or equal to 30 because there are likely to be some individuals at, above, or below 18. But for the partition including those over 30, there are no 18 year olds on which to further partition the data. In short, only some orders are feasible. But it can be important to appreciate which splits are not chosen and which splits cannot be chosen.

  4. 4.

    It is easy to confuse the regression equation representation of terminal nodes with a model. It is not a model. It is a summary of how the terminal nodes as predictors are related to the response; it is synopsis of the output from an algorithm, nothing more (or less).

  5. 5.

    Among the many problems is that the step functions used to grow a tree are discontinuous.

  6. 6.

    Thanks go to Thomas Cason who updated and improved the existing Titanic data frame using the Encyclopedia Titanica.

  7. 7.

    The procedure rpart is authored by Terry Therneau and Beth Atkinson. The procedure rpart.plot is authored by Stephen Milborrow. Both procedures are superb. One can embed rpart in procedures from the caret library to gain access to a wide variety of handy (if a bit overwhelming) enhancements. A risk is data analysis decisions, which should be informed in part by subject-matter expertise and are dominated by code-automated determinations. Subject-matter nonsense can result. Users should not automatically surrender control to what appear to be labor-saving devices. The caret library was written by Max Kuhn.

  8. 8.

    Here, the use of the class labels “success” and “failure” is arbitrary, so which off-diagonal cells contain “false positives” or “false negatives” is arbitrary as well. What is called a “success” in one study may be called a “failure” in another study.

  9. 9.

    The use of I in Eq. (3.2) for impurity should not be confused with the use of I to represent an indicator variable. The different meanings should be clear in context.

  10. 10.

    In statistics, cross-entropy is called the deviance.

  11. 11.

    There is often a preference for a relatively small number of very homogeneous terminal nodes rather than a much large number of terminal nodes that are not especially homogeneous. This is likely if the kinds of cases in the very homogeneous nodes have a special subject-matter or policy importance. An instance is inmates being considered for release on parole who fall into a terminal node in which most of the parolees are re-arrested for a crime of violence. These inmates may be few in number, but their crimes can be particularly harmful.

  12. 12.

    Misconduct can be minor offenses like failing to report for a work assignment to offenses that would be serious felonies if committed outside of prison (e.g., aggravated assault). Because of privacy concerns, the data may not be shared.

  13. 13.

    The tree diagram is formatted differently from the tree diagram used for the Titanic data to emphasize the terminal nodes.

  14. 14.

    In R, the character variable default order left to right is alphabetical.

  15. 15.

    The issues are actually tricky and beyond the scope of this discussion. At intake, how an inmate will be placed and supervised are unknown and are, therefore, not relevant predictors. Yet, the training data may need to take placement and supervision into account.

  16. 16.

    The overfitting may not be serious in this case because the number of observations is large and a small tree was grown.

  17. 17.

    As Therneau and Atkinson (2015: section 3.3.2) state,“When altered priors are used, they affect only the choice of split. The ordinary losses and priors are used to compute the risk of the node. The altered priors simply help the impurity rule choose splits that are likely to be good in terms of the risk.” By “risk” they mean the expected costs from false negatives and false positives.

  18. 18.

    The argument parms = list(prior = c(0.65,0.35)) sets the marginal distribution of a binary outcome to 0.65 and 0.35. A coded example follows shortly.

  19. 19.

    Very similar results could be obtained using minbucket rather than cp, even though they are defined very differently.

  20. 20.

    These are not options in rpart.

  21. 21.

    These data cannot be shared.

  22. 22.

    The function defaults to only black and white. If you want color (or something else), you need to get into the source code. It’s not a big deal.

  23. 23.

    The requirement of independence would need to be very carefully considered because some of the applicants may be from the same high schools and may have been influenced in similar ways by a teacher or guidance counselor. Fortunately, the independence assumptions can be relaxed somewhat, as mentioned briefly earlier (Kuchibhotla et al. 2018b). The main price is that a larger number of observations are likely to be necessary.

  24. 24.

    One might wonder again why CART does not use Eq. (3.14) from the start when a tree is grown instead of some measure of node impurity. R cp(T) would seem to have built in all of the end-user needs very directly. As mentioned earlier, the rationale for not using a function of classification errors as a fitting criterion is discussed in Breiman et al. (1984: section 4.1). As a technical matter, there can be at any given node, no single best split. But perhaps a more important reason is that less satisfactory trees can result.

  25. 25.

    In a least squares regression setting, this is generally not a good idea because the covariance matrix may no longer be positive definite. And for many algorithmic procedures, this is not a feasible approach in any case.

  26. 26.

    Consider a conventional regression with a single predictor. Random measurement error (i.e., IID mean 0) will increase the variance of the predictor, which sits in the denominator of slope expression. Asymptotically, \(\hat {\beta } =\frac {\beta }{1+ \sigma ^2_\epsilon / \sigma ^2_x}\), where \(\sigma ^2_\epsilon \) is the variance of the measurement error, and \(\sigma ^2_x\) is the variance of the predictor. The result is a bias toward 0.0. When there is more than one predictor, each with random measurement error, the regression coefficients can be biased toward 0.0 or away from 0.0.

  27. 27.

    Data analysts should not re-apply CART to the test data, even if the tuning parameter values have been determined. The algorithm is incurably adaptive.

  28. 28.

    These estimate of fitting performance correspond to neither estimates of generalization error nor expected prediction error, although they are in that spirit. One would need another test dataset to implement either definition.

References

  • Balasubramanian, V. N., Ho, S.-S., & Vovk, V. (2014). Conformal prediction for reliable machine learning. Amsterdam: Elsevier.

    MATH  Google Scholar 

  • Berk, R. A. (2018). Machine learning forecasts of risk in criminal justice settings. New York: Springer.

    Google Scholar 

  • Bhat, H. S., Kumer, N., & Vaz, G. (2011). Quantile Regression Trees. Working paper, School of Natural Sciences, University of California, Merced, CA.

    Google Scholar 

  • Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth Press.

    MATH  Google Scholar 

  • Chaudhuri, P., & Loh, W.-Y. (2002). Nonparametric estimation of conditional quantiles using quantile regression trees. Bernoulli, 8(5), 561–576.

    MathSciNet  MATH  Google Scholar 

  • Chaudhuri, P., Lo, W.-D., Loh, W.-Y., & Yang, C.-C. (1995). Generalized regression trees. Statistic Sinica, 5, 641–666.

    MathSciNet  MATH  Google Scholar 

  • Chipman, H. A., George, E. I., & McCulloch, R. E. (1998). Bayesian CART model search. Journal of the American Statistical Association, 93(443), 935–948.

    Article  Google Scholar 

  • Chipman, H. A., George, E. I., & McCulloch, R. E. (1999). Hierarchical priors for Bayesian CART shrinkage. Statistics and Computing, 10(1), 17–24.

    Article  Google Scholar 

  • Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. Annals of Applied Statistics, 4(1): 266–298.

    Article  MathSciNet  Google Scholar 

  • Choi, Y., Ahn, H., & Chen, J. J. (2005). Regression trees for analysis of count data with extra poisson variation. Computational Statistics & Data Analysis, 49, 893–915.

    Article  MathSciNet  Google Scholar 

  • Grubinger, T., Zeileis, A., & Pfeiffer, K.-P. (2014). evtree: Evolutionary learning of globally optimal classification and regression trees in R. Journal of Statistical Software, 61(1). http://www.jstatsoft.org/

  • He, Y. (2006). Missing Data Imputation For Tree-Based Models. PhD dissertation for the Department of Statistics, UCLA.

    Google Scholar 

  • Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.

    Article  MathSciNet  Google Scholar 

  • Ishwaran, H. (2015). The effect of splitting on random forests. Machine Learning, 99, 75–118.

    Article  MathSciNet  Google Scholar 

  • Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29(2), 119–127.

    Article  Google Scholar 

  • Kuchibhotla, A. K., Brown, L. D., & Buja, A. (2018a). Model-free study of ordinary least squares linear regression. arXiv: 1809.10538v1 [math.ST].

    Google Scholar 

  • Kuchibhotla, A. K., Brown, L. D., Buja, A., George, E. I., & Zhao, L. (2018b). A model free perspective for linear regression: uniform-in-model bounds for post selection inference. arXiv:1802.05801.

    Google Scholar 

  • Lee, S. K. (2005). On generalized multivariate decision tree by using GEE. Computational Statistics & Data Analysis, 49, 1105–1119.

    Article  MathSciNet  Google Scholar 

  • Little, R., & Rubin, D. (2019). The statistical analysis of missing data (3rd ed.). New York: John Wiley.

    MATH  Google Scholar 

  • Loh, W.-L. (2014). Fifty years of classification and regression trees (with discussion). International Statistical Review, 82(3), 329–348.

    Article  MathSciNet  Google Scholar 

  • Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning Research, 7, 983–999.

    MathSciNet  MATH  Google Scholar 

  • Quinlan, R. (1993). Programs in machine learning. San Mateo, CA: Morgan Kaufman.

    Google Scholar 

  • Therneau, T. M., & Atkinson, E. J. (2015). An Introduction to Recursive Partitioning Using the RPART Routines. Technical Report, Mayo Foundation.

    Google Scholar 

  • Wu, Y., Tjelmeland, H., & West, M. (2007). Bayesian CART: Prior specification and posterior simulation. Journal of Computational and Graphical Statistics, 16(1), 44–66.

    Article  MathSciNet  Google Scholar 

  • Xiaogang, S., Tianni, Z., Xin, Y., Juanjuan, F., & Song, Y. (2008). Interaction trees with censored survival data. The International Journal of Biostatistics, 4(1), Article 2.

    Google Scholar 

  • Zeileis, A., Hothorn, T., & Hornik, K. (2008). Model-based recursive partitioning. Journal of Computational and Graphical Statistics, 17(2), 492–514.

    Article  MathSciNet  Google Scholar 

  • Zhang, H., & Singer, B. (1999). Recursive partitioning in the health sciences. New York: Springer.

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Berk, R.A. (2020). Classification and Regression Trees (CART). In: Statistical Learning from a Regression Perspective. Springer Texts in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-030-40189-4_3

Download citation

Publish with us

Policies and ethics