Tree-structured modelling of categorical predictors in generalized additive regression

Abstract

Generalized linear and additive models are very efficient regression tools but many parameters have to be estimated if categorical predictors with many categories are included. The method proposed here focusses on the main effects of categorical predictors by using tree type methods to obtain clusters of categories. When the predictor has many categories one wants to know in particular which of the categories have to be distinguished with respect to their effect on the response. The tree-structured approach allows to detect clusters of categories that share the same effect while letting other predictors, in particular metric predictors, have a linear or additive effect on the response. An algorithm for the fitting is proposed and various stopping criteria are evaluated. The preferred stopping criterion is based on p values representing a conditional inference procedure. In addition, stability of clusters is investigated and the relevance of predictors is investigated by bootstrap methods. Several applications show the usefulness of the tree-structured approach and small simulation studies demonstrate that the fitting procedure works well.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

References

  1. Belitz C, Brezger A, Kneib T, Lang S, Umlauf N (2015) BayesX: software for Bayesian inference in structured additive regression models. R package version 1.0-0

  2. Berger M (2017) structree: tree-structured clustering. R package version 1.1.4

  3. Bondell HD, Reich BJ (2009) Simultaneous factor selection and collapsing levels in anova. Biometrics 65(1):169–177

    MathSciNet  Article  Google Scholar 

  4. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  5. Breiman L, Friedman JH, Olshen RA, Stone JC (1984) Classification and regression trees. Wadsworth, Monterey

    Google Scholar 

  6. Bühlmann P, Yu B (2003) Boosting with the L2 loss: regression and classification. J Am Stat Assoc 98(462):324–339

    Article  Google Scholar 

  7. Bürgin R, Ritschard G (2015) Tree-based varying coefficient regression for longitudinal ordinal responses. Comput Stat Data Anal 86:65–80

    MathSciNet  Article  Google Scholar 

  8. Chen J, Yu K, Hsing A, Therneau TM (2007) A partially linear tree-based regression model for assessing complex joint gene-gene and gene-environment effects. Genet Epidemiol 31(3):238–251

    Article  Google Scholar 

  9. Dusseldorp E, Meulman JJ (2004) The regression trunk approach to discover treatment covariate interaction. Psychometrika 69(3):355–374

    MathSciNet  Article  Google Scholar 

  10. Dusseldorp E, Conversano C, Van Os BJ (2010) Combining an additive and tree-based regression model simultaneously: Stima. J Comput Graph Stat 19(3):514–530

    MathSciNet  Article  Google Scholar 

  11. Efron B, Tibshirani RJ (1994) An introduction to the bootstrap. CRC Press, Boca Raton

    Google Scholar 

  12. Eilers PHC, Marx BD (1996) Flexible smoothing with B-splines and Penalties. Stat Sci 11(2):89–121

    MathSciNet  Article  Google Scholar 

  13. Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360

    MathSciNet  Article  Google Scholar 

  14. Fisher WD (1958) On grouping for maximum homogeneity. J Am Stat Assoc 53(284):789–798

    MathSciNet  Article  Google Scholar 

  15. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232

    MathSciNet  Article  Google Scholar 

  16. Friedman JH, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407

    MathSciNet  Article  Google Scholar 

  17. Gertheiss J, Tutz G (2010) Sparse modeling of categorial explanatory variables. Ann Appl Stat 4(4):2150–2180

    MathSciNet  Article  Google Scholar 

  18. Hastie T, Tibshirani R (1990) Generalized additive models. Chapman & Hall, London

    Google Scholar 

  19. Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning, 2nd edn. Springer, New York

    Google Scholar 

  20. Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat 15(3):651–674

    MathSciNet  Article  Google Scholar 

  21. Ishwaran H et al (2007) Variable importance in binary regression trees and forests. Electron J Stat 1:519–537

    MathSciNet  Article  Google Scholar 

  22. McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, New York

    Google Scholar 

  23. Morgan JN, Sonquist JA (1963) Problems in the analysis of survey data, and a proposal. J Am Stat Assoc 58(302):415–435

    Article  Google Scholar 

  24. Oelker M-R (2015) gvcm.cat: regularized categorical effects/categorical effect modifiers/continuous/smooth effects in GLMs. R package version 1.9

  25. Oelker M-R, Tutz G (2015) A uniform framework for the combination of penalties in generalized structured models. Adv Data Anal Classif 1(11):97–120

    MathSciNet  Google Scholar 

  26. Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106

    Google Scholar 

  27. Quinlan JR (1993) Programs for machine learning. Morgan Kaufmann, San Francisco

    Google Scholar 

  28. Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge

    Google Scholar 

  29. Sandri M, Zuccolotto P (2008) A bias correction algorithm for the gini variable importance measure in classification trees. J Comput Graph Stat 17(3):611–628

    MathSciNet  Article  Google Scholar 

  30. Sela RJ, Simonoff JS (2012) Re-em trees: a data mining approach for longitudinal and clustered data. Mach Learn 86(2):169–207

    MathSciNet  Article  Google Scholar 

  31. Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9(1):307

    Article  Google Scholar 

  32. Strobl C, Malley J, Tutz G (2009) An introduction to recursive partitioning: rationale, application and characteristics of classification and regression trees, bagging and random forests. Psychol Methods 14(4):323–348

    Article  Google Scholar 

  33. Su X, Tsai C-L, Wang MC (2009) Tree-structured model diagnostics for linear regression. Mach Learn 74(2):111–131

    Article  Google Scholar 

  34. Tutz G, Gertheiss J (2014) Rating scales as predictors—the old question of scale level and some answers. Psychometrika 79(3):357–376

    MathSciNet  Article  Google Scholar 

  35. Tutz G, Gertheiss J (2016) Regularized regression for categorical data. Stati Model 16(3):161–200

    MathSciNet  Article  Google Scholar 

  36. Tutz G, Oelker M (2016) Modeling clustered heterogeneity: fixed effects, random effects and mixtures. Int Stat Rev 85(2):204–227

    Article  Google Scholar 

  37. Umlauf N, Adler D, Kneib T, Lang S, Zeileis A (2015) Structured additive regression models: an R interface to BayesX. J Stat Softw 63(21):1–46

    Article  Google Scholar 

  38. Wood SN (2006) Generalized additive models: an introduction with R. Chapman & Hall/CRC, London

    Google Scholar 

  39. Wood SN (2011) Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J R Stat Soc B 73(1):3–36

    MathSciNet  Article  Google Scholar 

  40. Yu K, Wheeler W, Li Q, Bergen AW, Caporaso N, Chatterjee N, Chen J (2010) A partially linear tree-based regression model for multivariate outcomes. Biometrics 66(1):89–96

    MathSciNet  Article  Google Scholar 

  41. Zeileis A, Hothorn T, Hornik K (2008) Model-based recursive partitioning. J Comput Graph Stat 17(2):492–514

    MathSciNet  Article  Google Scholar 

  42. Zhang H, Singer B (1999) Recursive partitioning in the health sciences. Springer, New York

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Moritz Berger.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 502 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tutz, G., Berger, M. Tree-structured modelling of categorical predictors in generalized additive regression. Adv Data Anal Classif 12, 737–758 (2018). https://doi.org/10.1007/s11634-017-0298-6

Download citation

Keywords

  • Categorical predictors
  • Tree-structured clustering
  • Recursive partitioning
  • Partially linear tree-based regression

Mathematics Subject Classification

  • 62J12
  • 62J02