C443: a Methodology to See a Forest for the Trees

Abstract

Often tree-based accounts of statistical learning problems yield multiple decision trees which together constitute a forest. Reasons for this include examining tree instability, improving prediction accuracy, accounting for missingness in the data, and taking into account multiple outcome variables. A key disadvantage of forests, unlike individual decision trees, is their lack of transparency. Hence, an obvious challenge is whether it is possible to recover some of the insightfulness of individual trees from a forest. In this paper, we will propose a conceptual framework and methodology to do so by reducing forests into one or a small number of summary trees, which may be used to gain insight into the central tendency as well as the heterogeneity of the forest. This is done by clustering the trees in the forest based on similarities between them. By means of simulated data, we will demonstrate how and why different similarity types in the proposed methodology may lead to markedly different conclusions, and explain when and why certain approaches may be recommended over other ones. We will finally illustrate the methodology with an empirical data set on the prediction of cocaine use on the basis of personality characteristics.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. 1.

    For some types of similarity that do not take logical equivalence into account, it is possible to rectify this by performing a manipulation on the trees under study. In the Supplementary Materials, we will explain how this can be done exactly, and introduce a similarity measure that can be used for this purpose (see Equation 15 of Supplementary Materials Section 2.2).

References

  1. Banerjee, M., Ding, Y., Noone, A.-M. (2012). Identifying representative trees from ensembles. Statistics in Medicine, 31(15), 1601–1616. https://doi.org/10.1002/sim.4492.

    MathSciNet  Article  Google Scholar 

  2. Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Machine Learning, 36(1), 105–139. https://doi.org/10.1023/A:100751542.

    Article  Google Scholar 

  3. Breiman, L. (1996a). Bagging predictors. Machine Learning, 24(2), 123–140. https://doi.org/10.1007/bf00058655.

    Article  MATH  Google Scholar 

  4. Breiman, L. (1996b). Heuristics of instability and stabilization in model selection. The Annals of Statistics, 24(6), 2350–2383. https://doi.org/10.1214/aos/1032181158.

    MathSciNet  Article  MATH  Google Scholar 

  5. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324.

    Article  MATH  Google Scholar 

  6. Breiman, L., Friedman, J., Olshen, R., Stone, C. (1984). Classification and regression trees. Belmont: Wadsworth. https://doi.org/10.1201/9781315139470.

    Google Scholar 

  7. Briand, B., Ducharme, G.R., Parache, V., Mercat-Rommens, C. (2009). A similarity measure to assess the stability of classification trees. Computational Statistics & Data Analysis, 53(4), 1208–1217. https://doi.org/10.1016/j.csda.2008.10.033.

    MathSciNet  Article  MATH  Google Scholar 

  8. Chipman, H., George, E., McCulloch, R. (1998). Making sense of a forest of trees. In Weisberg, S. (Ed.) Proceedings of the 30th symposium on the interface (pp. 84–92). Fairfax: Interface Foundation of North America.

  9. Dheeru, D., & Karra Taniskidou, E. (2017). UCI machine learning repository. Retrieved from http://archive.ics.uci.edu/ml.

  10. Dietterich, T.G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2), 139–157. https://doi.org/10.1023/A:1007607513941.

    Article  Google Scholar 

  11. Fehrman, E., Muhammad, A.K., Mirkes, E.M., Egan, V., Gorban, A.N. (2017). The five factor model of personality and evaluation of drug consumption risk. In Palumbo, F., Montanari, A., Vichi, M. (Eds.) Data science: innovative developments in data analysis and clustering. https://doi.org/10.1037/10140-001 (pp. 231–242): Springer.

  12. Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383), 553–569. https://doi.org/10.2307/2288117.

    Article  MATH  Google Scholar 

  13. Freund, Y., & Schapire, R.E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55 (1), 119–139. https://doi.org/10.1006/jcss.1997.1504.

    MathSciNet  Article  MATH  Google Scholar 

  14. Hastie, T.J., Tibshirani, R.J., Friedman, J.H. (2009). The elements of statistical learning: data mining, inference and prediction. New York: Springer. https://doi.org/10.1007/978-0-387-84858-7.

    Google Scholar 

  15. Jaccard, P. (1901). ÉTude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin de la Socié,té Vaudoise des Sciences Naturelles, 37, 547–579.

    Google Scholar 

  16. Kaufman, L., & Rousseeuw, P.J. (2009). Finding groups in data: an introduction to cluster analysis. New York: Wiley. https://doi.org/10.1002/9780470316801.

    Google Scholar 

  17. Little, R.J., D’Agostino, R., Cohen, M.L., Dickersin, K., Emerson, S.S., Farrar, J.T., et al. (2012). The prevention and treatment of missing data in clinical trials. New England Journal of Medicine, 367(14), 1355–1360. https://doi.org/10.1056/NEJMsr1203730.

    Article  Google Scholar 

  18. McCrae, R.R., & Costa, P.T. (2004). A contemplated revision of the NEO Five-Factor Inventory. Personality and Individual Differences, 36(3), 587–596. https://doi.org/10.1016/s0191-8869(03)00118-1.

    Article  Google Scholar 

  19. Miglio, R., & Soffritti, G. (2004). The comparison between classification trees through proximity measures. Computational Statistics & Data Analysis, 45(3), 577–593. https://doi.org/10.1016/s0167-9473(03)00063-x.

    MathSciNet  Article  MATH  Google Scholar 

  20. Milligan, G.W., & Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2), 159–179. https://doi.org/10.1007/bf02294245.

    Article  Google Scholar 

  21. Patton, J.H., Stanford, M.S., Baratt, E.S. (1995). Factor structure of the barratt impulsiveness scale. Journal of Cinical Psychology, 51(6), 768–774. https://doi.org/10.1002/1097-4679(199511)51:6<768::aid-jclp2270510607>3.0.co;2-1.

    Article  Google Scholar 

  22. Quinlan, J.R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. https://doi.org/10.1007/bf00116251.

    Article  Google Scholar 

  23. Rousseeuw, P.J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7.

    Article  MATH  Google Scholar 

  24. Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley. https://doi.org/10.1002/9780470316696.

    Google Scholar 

  25. Schafer, J.L., & Graham, J.W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7(2), 147. https://doi.org/10.1037//1082-989x.7.2.147.

    Article  Google Scholar 

  26. Shannon, W.D., & Banks, D. (1999). Combining classification trees using MLE. Statistics in Medicine, 18(6), 727–740. https://doi.org/10.1002/(sici)1097-0258(19990330)18:6<727::aid-sim61>3.3.co;2-u.

    Article  Google Scholar 

  27. Skurichina, M., & Duin, R.P. (2002). Bagging, boosting and the random subspace method for linear classifiers. Pattern Analysis & Applications, 5(2), 121–135. https://doi.org/10.1007/s100440200011.

    MathSciNet  Article  MATH  Google Scholar 

  28. Strobl, C., Malley, J., Tutz, G. (2009). An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods, 14(4), 323. https://doi.org/10.1037/a0016973.

    Article  Google Scholar 

  29. Turney, P. (1995). Technical note: bias and the quantification of stability. Machine Learning, 20(1), 23–33. https://doi.org/10.1007/bf00993473.

    Article  Google Scholar 

  30. Zuckerman, M., Kuhlman, D.M., Joireman, J., Teta, P., Kraft, M. (1993). A comparison of three structural models for personality: the big three, the big five, and the alternative five. Journal of Paediatric Personality and Social Psychology, 65(4), 757. https://doi.org/10.1037//0022-3514.65.4.757.

    Article  Google Scholar 

Download references

Funding

The research reported in this paper was supported in part by the Research Foundation—Flanders (G080219N) and by the Research Fund of KU Leuven (C14/19/054). The data used in the real data examples were obtained from the UCI Machine Learning Repository (Dheeru and Karra Taniskidou 2017; Fehrman et al. 2017).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Aniek Sies.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 1.38 MB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sies, A., Van Mechelen, I. C443: a Methodology to See a Forest for the Trees. J Classif (2020). https://doi.org/10.1007/s00357-019-09350-4

Download citation

Keywords

  • Classification trees
  • Statistical learning
  • Bagging
  • Ensemble methods
  • Clustering