Classification and Regression Trees (CART)

Berk, Richard A.

doi:10.1007/978-3-030-40189-4_3

Richard A. Berk⁵

Part of the book series: Springer Texts in Statistics ((STS))

3472 Accesses
1 Citations

Abstract

In this chapter, we turn to recursive partitioning, which comes in various forms including decision trees and classification and regression trees (CART). We will see that the algorithmic machinery successively subsets the data. Trees are just a visualization of the data subsetting processes. We will also see that although recursive partitioning has too many problems to be an effective, stand-alone data analysis procedure, it is a crucial component of more powerful algorithms discussed in later chapters. It is important, therefore, to get into the details.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Hardcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
There are extensions of CART that go beyond the generalized linear model. A parallel to multinomial logistic regression is one example. But for didactic purposes these extensions are addressed later.
2.
A good multinomial regression procedure in R is multinom in library nnet. Technically, multinomial logistic regression is not a GLM, which is a good reason why it is not included in the glm library in R.
3.
Some orders for splits of the same predictor cannot work. For example, age might be first split at 30 and then later at 18. This works for the partition of individuals less than or equal to 30 because there are likely to be some individuals at, above, or below 18. But for the partition including those over 30, there are no 18 year olds on which to further partition the data. In short, only some orders are feasible. But it can be important to appreciate which splits are not chosen and which splits cannot be chosen.
4.
It is easy to confuse the regression equation representation of terminal nodes with a model. It is not a model. It is a summary of how the terminal nodes as predictors are related to the response; it is synopsis of the output from an algorithm, nothing more (or less).
5.
Among the many problems is that the step functions used to grow a tree are discontinuous.
6.
Thanks go to Thomas Cason who updated and improved the existing Titanic data frame using the Encyclopedia Titanica.
7.
The procedure rpart is authored by Terry Therneau and Beth Atkinson. The procedure rpart.plot is authored by Stephen Milborrow. Both procedures are superb. One can embed rpart in procedures from the caret library to gain access to a wide variety of handy (if a bit overwhelming) enhancements. A risk is data analysis decisions, which should be informed in part by subject-matter expertise and are dominated by code-automated determinations. Subject-matter nonsense can result. Users should not automatically surrender control to what appear to be labor-saving devices. The caret library was written by Max Kuhn.
8.
Here, the use of the class labels “success” and “failure” is arbitrary, so which off-diagonal cells contain “false positives” or “false negatives” is arbitrary as well. What is called a “success” in one study may be called a “failure” in another study.
9.
The use of I in Eq. (3.2) for impurity should not be confused with the use of I to represent an indicator variable. The different meanings should be clear in context.
10.
In statistics, cross-entropy is called the deviance.
11.
There is often a preference for a relatively small number of very homogeneous terminal nodes rather than a much large number of terminal nodes that are not especially homogeneous. This is likely if the kinds of cases in the very homogeneous nodes have a special subject-matter or policy importance. An instance is inmates being considered for release on parole who fall into a terminal node in which most of the parolees are re-arrested for a crime of violence. These inmates may be few in number, but their crimes can be particularly harmful.
12.
Misconduct can be minor offenses like failing to report for a work assignment to offenses that would be serious felonies if committed outside of prison (e.g., aggravated assault). Because of privacy concerns, the data may not be shared.
13.
The tree diagram is formatted differently from the tree diagram used for the Titanic data to emphasize the terminal nodes.
14.
In R, the character variable default order left to right is alphabetical.
15.
The issues are actually tricky and beyond the scope of this discussion. At intake, how an inmate will be placed and supervised are unknown and are, therefore, not relevant predictors. Yet, the training data may need to take placement and supervision into account.
16.
The overfitting may not be serious in this case because the number of observations is large and a small tree was grown.
17.
As Therneau and Atkinson (2015: section 3.3.2) state,“When altered priors are used, they affect only the choice of split. The ordinary losses and priors are used to compute the risk of the node. The altered priors simply help the impurity rule choose splits that are likely to be good in terms of the risk.” By “risk” they mean the expected costs from false negatives and false positives.
18.
The argument parms = list(prior = c(0.65,0.35)) sets the marginal distribution of a binary outcome to 0.65 and 0.35. A coded example follows shortly.
19.
Very similar results could be obtained using minbucket rather than cp, even though they are defined very differently.
20.
These are not options in rpart.
21.
These data cannot be shared.
22.
The function defaults to only black and white. If you want color (or something else), you need to get into the source code. It’s not a big deal.
23.
The requirement of independence would need to be very carefully considered because some of the applicants may be from the same high schools and may have been influenced in similar ways by a teacher or guidance counselor. Fortunately, the independence assumptions can be relaxed somewhat, as mentioned briefly earlier (Kuchibhotla et al. 2018b). The main price is that a larger number of observations are likely to be necessary.
24.
One might wonder again why CART does not use Eq. (3.14) from the start when a tree is grown instead of some measure of node impurity. R _cp(T) would seem to have built in all of the end-user needs very directly. As mentioned earlier, the rationale for not using a function of classification errors as a fitting criterion is discussed in Breiman et al. (1984: section 4.1). As a technical matter, there can be at any given node, no single best split. But perhaps a more important reason is that less satisfactory trees can result.
25.
In a least squares regression setting, this is generally not a good idea because the covariance matrix may no longer be positive definite. And for many algorithmic procedures, this is not a feasible approach in any case.
26.
Consider a conventional regression with a single predictor. Random measurement error (i.e., IID mean 0) will increase the variance of the predictor, which sits in the denominator of slope expression. Asymptotically, \(\hat {\beta } =\frac {\beta }{1+ \sigma ^2_\epsilon / \sigma ^2_x}\), where \(\sigma ^2_\epsilon \) is the variance of the measurement error, and \(\sigma ^2_x\) is the variance of the predictor. The result is a bias toward 0.0. When there is more than one predictor, each with random measurement error, the regression coefficients can be biased toward 0.0 or away from 0.0.
27.
Data analysts should not re-apply CART to the test data, even if the tuning parameter values have been determined. The algorithm is incurably adaptive.
28.
These estimate of fitting performance correspond to neither estimates of generalization error nor expected prediction error, although they are in that spirit. One would need another test dataset to implement either definition.

References

Balasubramanian, V. N., Ho, S.-S., & Vovk, V. (2014). Conformal prediction for reliable machine learning. Amsterdam: Elsevier.
MATH Google Scholar
Berk, R. A. (2018). Machine learning forecasts of risk in criminal justice settings. New York: Springer.
Google Scholar
Bhat, H. S., Kumer, N., & Vaz, G. (2011). Quantile Regression Trees. Working paper, School of Natural Sciences, University of California, Merced, CA.
Google Scholar
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth Press.
MATH Google Scholar
Chaudhuri, P., & Loh, W.-Y. (2002). Nonparametric estimation of conditional quantiles using quantile regression trees. Bernoulli, 8(5), 561–576.
MathSciNet MATH Google Scholar
Chaudhuri, P., Lo, W.-D., Loh, W.-Y., & Yang, C.-C. (1995). Generalized regression trees. Statistic Sinica, 5, 641–666.
MathSciNet MATH Google Scholar
Chipman, H. A., George, E. I., & McCulloch, R. E. (1998). Bayesian CART model search. Journal of the American Statistical Association, 93(443), 935–948.
Article Google Scholar
Chipman, H. A., George, E. I., & McCulloch, R. E. (1999). Hierarchical priors for Bayesian CART shrinkage. Statistics and Computing, 10(1), 17–24.
Article Google Scholar
Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. Annals of Applied Statistics, 4(1): 266–298.
Article MathSciNet Google Scholar
Choi, Y., Ahn, H., & Chen, J. J. (2005). Regression trees for analysis of count data with extra poisson variation. Computational Statistics & Data Analysis, 49, 893–915.
Article MathSciNet Google Scholar
Grubinger, T., Zeileis, A., & Pfeiffer, K.-P. (2014). evtree: Evolutionary learning of globally optimal classification and regression trees in R. Journal of Statistical Software, 61(1). http://www.jstatsoft.org/
He, Y. (2006). Missing Data Imputation For Tree-Based Models. PhD dissertation for the Department of Statistics, UCLA.
Google Scholar
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.
Article MathSciNet Google Scholar
Ishwaran, H. (2015). The effect of splitting on random forests. Machine Learning, 99, 75–118.
Article MathSciNet Google Scholar
Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29(2), 119–127.
Article Google Scholar
Kuchibhotla, A. K., Brown, L. D., & Buja, A. (2018a). Model-free study of ordinary least squares linear regression. arXiv: 1809.10538v1 [math.ST].
Google Scholar
Kuchibhotla, A. K., Brown, L. D., Buja, A., George, E. I., & Zhao, L. (2018b). A model free perspective for linear regression: uniform-in-model bounds for post selection inference. arXiv:1802.05801.
Google Scholar
Lee, S. K. (2005). On generalized multivariate decision tree by using GEE. Computational Statistics & Data Analysis, 49, 1105–1119.
Article MathSciNet Google Scholar
Little, R., & Rubin, D. (2019). The statistical analysis of missing data (3rd ed.). New York: John Wiley.
MATH Google Scholar
Loh, W.-L. (2014). Fifty years of classification and regression trees (with discussion). International Statistical Review, 82(3), 329–348.
Article MathSciNet Google Scholar
Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning Research, 7, 983–999.
MathSciNet MATH Google Scholar
Quinlan, R. (1993). Programs in machine learning. San Mateo, CA: Morgan Kaufman.
Google Scholar
Therneau, T. M., & Atkinson, E. J. (2015). An Introduction to Recursive Partitioning Using the RPART Routines. Technical Report, Mayo Foundation.
Google Scholar
Wu, Y., Tjelmeland, H., & West, M. (2007). Bayesian CART: Prior specification and posterior simulation. Journal of Computational and Graphical Statistics, 16(1), 44–66.
Article MathSciNet Google Scholar
Xiaogang, S., Tianni, Z., Xin, Y., Juanjuan, F., & Song, Y. (2008). Interaction trees with censored survival data. The International Journal of Biostatistics, 4(1), Article 2.
Google Scholar
Zeileis, A., Hothorn, T., & Hornik, K. (2008). Model-based recursive partitioning. Journal of Computational and Graphical Statistics, 17(2), 492–514.
Article MathSciNet Google Scholar
Zhang, H., & Singer, B. (1999). Recursive partitioning in the health sciences. New York: Springer.
Book Google Scholar

Download references

Author information

Authors and Affiliations

Department of Criminology, Schools of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, USA
Richard A. Berk

Authors

Richard A. Berk
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Berk, R.A. (2020). Classification and Regression Trees (CART). In: Statistical Learning from a Regression Perspective. Springer Texts in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-030-40189-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-40189-4_3
Published: 30 June 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-40188-7
Online ISBN: 978-3-030-40189-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics