Abstract
Often tree-based accounts of statistical learning problems yield multiple decision trees which together constitute a forest. Reasons for this include examining tree instability, improving prediction accuracy, accounting for missingness in the data, and taking into account multiple outcome variables. A key disadvantage of forests, unlike individual decision trees, is their lack of transparency. Hence, an obvious challenge is whether it is possible to recover some of the insightfulness of individual trees from a forest. In this paper, we will propose a conceptual framework and methodology to do so by reducing forests into one or a small number of summary trees, which may be used to gain insight into the central tendency as well as the heterogeneity of the forest. This is done by clustering the trees in the forest based on similarities between them. By means of simulated data, we will demonstrate how and why different similarity types in the proposed methodology may lead to markedly different conclusions, and explain when and why certain approaches may be recommended over other ones. We will finally illustrate the methodology with an empirical data set on the prediction of cocaine use on the basis of personality characteristics.
This is a preview of subscription content, log in to check access.









Notes
- 1.
For some types of similarity that do not take logical equivalence into account, it is possible to rectify this by performing a manipulation on the trees under study. In the Supplementary Materials, we will explain how this can be done exactly, and introduce a similarity measure that can be used for this purpose (see Equation 15 of Supplementary Materials Section 2.2).
References
Banerjee, M., Ding, Y., Noone, A.-M. (2012). Identifying representative trees from ensembles. Statistics in Medicine, 31(15), 1601–1616. https://doi.org/10.1002/sim.4492.
Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Machine Learning, 36(1), 105–139. https://doi.org/10.1023/A:100751542.
Breiman, L. (1996a). Bagging predictors. Machine Learning, 24(2), 123–140. https://doi.org/10.1007/bf00058655.
Breiman, L. (1996b). Heuristics of instability and stabilization in model selection. The Annals of Statistics, 24(6), 2350–2383. https://doi.org/10.1214/aos/1032181158.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324.
Breiman, L., Friedman, J., Olshen, R., Stone, C. (1984). Classification and regression trees. Belmont: Wadsworth. https://doi.org/10.1201/9781315139470.
Briand, B., Ducharme, G.R., Parache, V., Mercat-Rommens, C. (2009). A similarity measure to assess the stability of classification trees. Computational Statistics & Data Analysis, 53(4), 1208–1217. https://doi.org/10.1016/j.csda.2008.10.033.
Chipman, H., George, E., McCulloch, R. (1998). Making sense of a forest of trees. In Weisberg, S. (Ed.) Proceedings of the 30th symposium on the interface (pp. 84–92). Fairfax: Interface Foundation of North America.
Dheeru, D., & Karra Taniskidou, E. (2017). UCI machine learning repository. Retrieved from http://archive.ics.uci.edu/ml.
Dietterich, T.G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2), 139–157. https://doi.org/10.1023/A:1007607513941.
Fehrman, E., Muhammad, A.K., Mirkes, E.M., Egan, V., Gorban, A.N. (2017). The five factor model of personality and evaluation of drug consumption risk. In Palumbo, F., Montanari, A., Vichi, M. (Eds.) Data science: innovative developments in data analysis and clustering. https://doi.org/10.1037/10140-001 (pp. 231–242): Springer.
Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383), 553–569. https://doi.org/10.2307/2288117.
Freund, Y., & Schapire, R.E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55 (1), 119–139. https://doi.org/10.1006/jcss.1997.1504.
Hastie, T.J., Tibshirani, R.J., Friedman, J.H. (2009). The elements of statistical learning: data mining, inference and prediction. New York: Springer. https://doi.org/10.1007/978-0-387-84858-7.
Jaccard, P. (1901). ÉTude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin de la Socié,té Vaudoise des Sciences Naturelles, 37, 547–579.
Kaufman, L., & Rousseeuw, P.J. (2009). Finding groups in data: an introduction to cluster analysis. New York: Wiley. https://doi.org/10.1002/9780470316801.
Little, R.J., D’Agostino, R., Cohen, M.L., Dickersin, K., Emerson, S.S., Farrar, J.T., et al. (2012). The prevention and treatment of missing data in clinical trials. New England Journal of Medicine, 367(14), 1355–1360. https://doi.org/10.1056/NEJMsr1203730.
McCrae, R.R., & Costa, P.T. (2004). A contemplated revision of the NEO Five-Factor Inventory. Personality and Individual Differences, 36(3), 587–596. https://doi.org/10.1016/s0191-8869(03)00118-1.
Miglio, R., & Soffritti, G. (2004). The comparison between classification trees through proximity measures. Computational Statistics & Data Analysis, 45(3), 577–593. https://doi.org/10.1016/s0167-9473(03)00063-x.
Milligan, G.W., & Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2), 159–179. https://doi.org/10.1007/bf02294245.
Patton, J.H., Stanford, M.S., Baratt, E.S. (1995). Factor structure of the barratt impulsiveness scale. Journal of Cinical Psychology, 51(6), 768–774. https://doi.org/10.1002/1097-4679(199511)51:6<768::aid-jclp2270510607>3.0.co;2-1.
Quinlan, J.R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. https://doi.org/10.1007/bf00116251.
Rousseeuw, P.J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7.
Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley. https://doi.org/10.1002/9780470316696.
Schafer, J.L., & Graham, J.W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7(2), 147. https://doi.org/10.1037//1082-989x.7.2.147.
Shannon, W.D., & Banks, D. (1999). Combining classification trees using MLE. Statistics in Medicine, 18(6), 727–740. https://doi.org/10.1002/(sici)1097-0258(19990330)18:6<727::aid-sim61>3.3.co;2-u.
Skurichina, M., & Duin, R.P. (2002). Bagging, boosting and the random subspace method for linear classifiers. Pattern Analysis & Applications, 5(2), 121–135. https://doi.org/10.1007/s100440200011.
Strobl, C., Malley, J., Tutz, G. (2009). An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods, 14(4), 323. https://doi.org/10.1037/a0016973.
Turney, P. (1995). Technical note: bias and the quantification of stability. Machine Learning, 20(1), 23–33. https://doi.org/10.1007/bf00993473.
Zuckerman, M., Kuhlman, D.M., Joireman, J., Teta, P., Kraft, M. (1993). A comparison of three structural models for personality: the big three, the big five, and the alternative five. Journal of Paediatric Personality and Social Psychology, 65(4), 757. https://doi.org/10.1037//0022-3514.65.4.757.
Funding
The research reported in this paper was supported in part by the Research Foundation—Flanders (G080219N) and by the Research Fund of KU Leuven (C14/19/054). The data used in the real data examples were obtained from the UCI Machine Learning Repository (Dheeru and Karra Taniskidou 2017; Fehrman et al. 2017).
Author information
Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Sies, A., Van Mechelen, I. C443: a Methodology to See a Forest for the Trees. J Classif (2020). https://doi.org/10.1007/s00357-019-09350-4
Published:
Keywords
- Classification trees
- Statistical learning
- Bagging
- Ensemble methods
- Clustering