Variable selection and model choice in structured survival models

Hofner, Benjamin; Hothorn, Torsten; Kneib, Thomas

doi:10.1007/s00180-012-0337-x

Variable selection and model choice in structured survival models

Original Paper
Published: 22 July 2012

Volume 28, pages 1079–1101, (2013)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Benjamin Hofner¹,
Torsten Hothorn² &
Thomas Kneib³

539 Accesses
12 Citations
Explore all metrics

Abstract

We aim at modeling the survival time of intensive care patients suffering from severe sepsis. The nature of the problem requires a flexible model that allows to extend the classical Cox-model via the inclusion of time-varying and nonparametric effects. These structured survival models are very flexible but additional difficulties arise when model choice and variable selection are desired. In particular, it has to be decided which covariates should be assigned time-varying effects or whether linear modeling is sufficient for a given covariate. Component-wise boosting provides a means of likelihood-based model fitting that enables simultaneous variable selection and model choice. We introduce a component-wise, likelihood-based boosting algorithm for survival data that permits the inclusion of both parametric and nonparametric time-varying effects as well as nonparametric effects of continuous covariates utilizing penalized splines as the main modeling technique. An empirical evaluation of the methodology precedes the model building for the severe sepsis data. A software implementation is available to the interested reader.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nonparametric estimation in the illness-death model using prevalent data

Article 28 June 2016

Flexible Modeling of Frailty Effects in Clustered Survival Data

Nonparametric change point estimation for survival distributions with a partially constant hazard rate

Article 05 April 2018

References

Abrahamowicz M, MacKenzie TA (2007) Joint estimation of time-dependent and non-linear effects of continuous covariates on survival. Stat Med 26:392–408
Article MathSciNet Google Scholar
Bender R, Augustin T, Blettner M (2005) Generating survival times to simulate Cox proportional hazards models. Stat Med 24:1713–1723
Google Scholar
Binder H, Schumacher M (2008) Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinform 9:14
Article Google Scholar
Breiman L (1996) Heuristics of instability and stabilization in model selection. Ann Stat 24:2350–2383
Article MathSciNet MATH Google Scholar
Bühlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22:477–505
Article MATH Google Scholar
Bühlmann P, Yu B (2003) Boosting with the \(\text{ L}_2\) loss: regression and classification. J Am Stat Assoc 98:324–339
Google Scholar
Cox DR (1972) Regression models and life tables (with discussion). J R Stat Soc Ser B 34:187–220
MATH Google Scholar
de Boor C (1978) A practical guide to splines. Springer, New York
Book MATH Google Scholar
Eilers PHC, Marx BD (1996) Flexible smoothing with B-splines and penalties (with discussion). Stat Sci 11:89–121
Article MathSciNet MATH Google Scholar
Fahrmeir L, Kneib T, Lang S (2004) Penalized structured additive regression: a Bayesian perspective. Stat Sinica 14:731–761
MathSciNet MATH Google Scholar
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
Article MATH Google Scholar
Gray RJ (1992) Flexible methods for analyzing survival data using splines, with application to breast cancer prognosis. J Am Stat Assoc 87:942–951
Article Google Scholar
Hartl WH, Wolf H, Schneider CP, Küchenhoff H, Jauch KW (2007) Secular trends in mortality associated with new therapeutic strategies in surgical critical illness. Am J Surg 194:535–541
Article Google Scholar
Hastie T (2007) Comment: Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22:513–515
Google Scholar
Hastie T, Tibshirani R (1993) Varying-coefficient models. J R Stat Soc Ser B 55:757–796
MathSciNet MATH Google Scholar
Hofner B (2009) CoxFlexBoost: Boosting flexible Cox models (with time-varying effects). R package version 0.7-0, http://R-forge.R-project.org/projects/coxflexboost
Hofner B, Hothorn T, Kneib T, Schmid M (2011a) A framework for unbiased model selection based on boosting. J Comput Graph Stat 20:956–971
Article MathSciNet Google Scholar
Hofner B, Kneib T, Hartl W, Küchenhoff H (2011b) Building Cox-type structured hazard regression models with time-varying effects. Stat Modell Int J 11:3–24
Article Google Scholar
Hothorn T, Bühlmann P, Kneib T, Schmid M, Hofner B (2010) Model-based boosting 2.0. J Mach Learn Res 11:2109–2113
MathSciNet MATH Google Scholar
Hothorn T, Bühlmann P, Kneib T, Schmid M, Hofner B (2012) mboost: Model-Based Boosting. R package version 2.1-2, http://CRAN.R-project.org/package=mboost
Kneib T, Fahrmeir L (2007) A mixed model approach for geoadditive hazard regression. Scand J Stat 34:207–228
Article MathSciNet MATH Google Scholar
Kneib T, Hothorn T, Tutz G (2009) Variable selection and model choice in geoadditive regression models. Biometrics 65:626–634
Article MathSciNet MATH Google Scholar
Mayr A, Hofner B, Schmid M (2012) The importance of knowing when to stop—a sequential stopping rule for component-wise gradient boosting. Methods Inform Med 51:178–186
Article Google Scholar
Meinshausen N, Bühlmann P (2010) Stability selection (with discussion). J R Stat Soc Ser B 72:417–473
Article Google Scholar
Moubarak P, Zilker S, Wolf H, Hofner B, Kneib T, Küchenhoff H, Jauch K-W, Hartl WH (2008) Activity-guided antithrombin III therapy in severe surgical sepsis: efficacy and safety according to a retrospective data analysis. Shock 30:634–641
Article Google Scholar
Müller MH, Moubarak P, Wolf H, Küchenhoff H, Jauch KW, Hartl WH (2008) Independent determinants of early death in critically ill surgical patients. Shock 30:11–16
Article Google Scholar
Press WH, Teukolsky SA, Vetterling WT, Flannery B (1992) Numerical recipes in C: the art of scientific computing. Cambridge University Press, Cambridge
Google Scholar
R Development Core Team (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org
Rawlings JO, Pantula S, Dickey DA (1998) Applied regression analysis: a research tool. Springer, New York
Book MATH Google Scholar
Royston P, Altman DG (1994) Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling. Appl Stat 43:429–453
Article Google Scholar
Rüttinger D, Wolf H, Küchenhoff H, Jauch KW, Hartl WH (2007) Red cell transfusion: an essential factor for patient prognosis in surgical critical illness? Shock 28:165–171
Article Google Scholar
Sauerbrei W, Royston P, Look M (2007) A new proposal for multivariable modelling of time-varying effects in survival data based on fractional polynomial time-transformation. Biometrica J 49:453–473
Article MathSciNet Google Scholar
Schmid M, Hothorn T (2008) Boosting additive models using component-wise P-splines. Comput Stat Data Anal 53:298–311
Article MathSciNet MATH Google Scholar
Therneau TM, Grambsch PM (2000) Modeling survival data: extending the Cox model. Springer, New York
Google Scholar
Tutz G, Binder H (2006) Generalized additive modelling with implicit variable selection by likelihood-based boosting. Biometrics 62:961–971
Article MathSciNet MATH Google Scholar
Zucker DM, Karr AF (1990) Non-parametric survival analysis with time-dependent covariate effects: a penalized likelihood approach. Ann Stat 18:329–352
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

The authors thank the associate editor and the anonymous referees for their helpful comments, W. H. Hartl from the Department of Surgery, Klinikum Großhadern for the data set and stimulating problems and D. Inthorn and H. Schneeberger for initiation and maintenance of the database of the surgical intensive care unit. B. Hofner and T. Hothorn were supported by Deutsche Forschungsgemeinschaft, grant HO 3242/1-3.

Author information

Authors and Affiliations

Institut für Medizininformatik, Biometrie und Epidemiologie, Friedrich-Alexander-Universität Erlangen-Nürnberg, Waldstraße 6, 91054, Erlangen, Germany
Benjamin Hofner
Institut für Statistik, Ludwig-Maximilians-Universität, München, Germany
Torsten Hothorn
Institut für Statistik und Ökonometrie, Georg-August-Universität, Göttingen, Germany
Thomas Kneib

Authors

Benjamin Hofner
View author publications
You can also search for this author in PubMed Google Scholar
Torsten Hothorn
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Kneib
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Benjamin Hofner.

Appendices

Appendix A: Detailed simulation results

In this section, more details on the simulation results are given. To explore the accuracy of model choice and thus of variable selection we examine the relative frequency of selected base-learners. This means we count the models (simulation replicates) where the base-learner was included and ignore how often and in which boosting iteration(s) the base-learner was selected. Furthermore, the magnitude of the estimated effects is neglected.

Table 1 (left) shows the selection frequencies of the base-learners in the first simulation scheme. Only the variables \(x_1\) to \(x_6\) have an effect on the hazard rate and thus, on the survival time. These effective covariates are presented in the upper half of the table. We see that almost all effective base-learners have a selection frequency close to one or exactly one except for the linear base-learners \(\mathtt bols( x_2\mathtt ) \) and bols( \(x_6\) ). If we look at the true influence of \(x_2\) it shows that this is a good result, as we have a quadratic influence of this covariate and hence no linear effect is required. The low selection frequency of \(\mathtt bols( x_6\mathtt ) \) can be attributed to the size of the effect of \(x_6\), which is very small. It is \(20\) times smaller than the effect of the other categorical covariate \(x_5\). Hence, the low selection frequency seems very plausible. For \(x_4\) (which has in reality a linear effect) the algorithm selected in \(26~\%\) of the replicates a flexible deviation from linearity. Thus, in some models the (wrong) impression of an underlying nonlinear effect of \(x_4\) is given. However, compared to the selection frequencies of the other effects, this is only of minor importance. In addition, at further inspection we see that the departures from linearity are only very small.

For non-effective covariates we expect the selection frequency of a base-learner to be close to zero or at least substantially smaller than for effective covariates. Looking at the lower part of Table 1 (left) this can be confirmed.

Note that in our simulation the number of base-learners is not equal to the number of variables. A variable is selected if any of the base-learners of this variable is selected. Using this definition, we see that on average we selected 9.8 variables with 5.4 effective variables and 4.4 non-effective variables. Compared to a scheme where we assign only one base-learner for each variable (e.g., a flexible base-learner per covariate with 4 degrees of freedom) we realize that the model choice scheme tends to select more variables and to select more non-effective variables. Perhaps, this is due to an increased number of possible base-learners per covariate. This argument is backed by the finding that we selected on average 12.8 out of 25 base-learners. We selected (on average) about 5.3 non-effective base-learners, which corresponded on average to 4.4 non-effective variables. Thus, almost every non-effective base-learner is based on another variable.

See Figs. 4, 5, 6.

Appendix B: Model for surgical patients: further tables and graphics

See Table 2 and Fig. 7.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hofner, B., Hothorn, T. & Kneib, T. Variable selection and model choice in structured survival models. Comput Stat 28, 1079–1101 (2013). https://doi.org/10.1007/s00180-012-0337-x

Download citation

Received: 25 November 2009
Accepted: 26 May 2012
Published: 22 July 2012
Issue Date: June 2013
DOI: https://doi.org/10.1007/s00180-012-0337-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variable selection and model choice in structured survival models

Abstract

Access this article

Similar content being viewed by others

Nonparametric estimation in the illness-death model using prevalent data

Flexible Modeling of Frailty Effects in Clustered Survival Data

Nonparametric change point estimation for survival distributions with a partially constant hazard rate

References

Acknowledgments