Skip to main content
Log in

Variable selection and model choice in structured survival models

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

We aim at modeling the survival time of intensive care patients suffering from severe sepsis. The nature of the problem requires a flexible model that allows to extend the classical Cox-model via the inclusion of time-varying and nonparametric effects. These structured survival models are very flexible but additional difficulties arise when model choice and variable selection are desired. In particular, it has to be decided which covariates should be assigned time-varying effects or whether linear modeling is sufficient for a given covariate. Component-wise boosting provides a means of likelihood-based model fitting that enables simultaneous variable selection and model choice. We introduce a component-wise, likelihood-based boosting algorithm for survival data that permits the inclusion of both parametric and nonparametric time-varying effects as well as nonparametric effects of continuous covariates utilizing penalized splines as the main modeling technique. An empirical evaluation of the methodology precedes the model building for the severe sepsis data. A software implementation is available to the interested reader.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Abrahamowicz M, MacKenzie TA (2007) Joint estimation of time-dependent and non-linear effects of continuous covariates on survival. Stat Med 26:392–408

    Article  MathSciNet  Google Scholar 

  • Bender R, Augustin T, Blettner M (2005) Generating survival times to simulate Cox proportional hazards models. Stat Med 24:1713–1723

    Google Scholar 

  • Binder H, Schumacher M (2008) Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinform 9:14

    Article  Google Scholar 

  • Breiman L (1996) Heuristics of instability and stabilization in model selection. Ann Stat 24:2350–2383

    Article  MathSciNet  MATH  Google Scholar 

  • Bühlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22:477–505

    Article  MATH  Google Scholar 

  • Bühlmann P, Yu B (2003) Boosting with the \(\text{ L}_2\) loss: regression and classification. J Am Stat Assoc 98:324–339

    Google Scholar 

  • Cox DR (1972) Regression models and life tables (with discussion). J R Stat Soc Ser B 34:187–220

    MATH  Google Scholar 

  • de Boor C (1978) A practical guide to splines. Springer, New York

    Book  MATH  Google Scholar 

  • Eilers PHC, Marx BD (1996) Flexible smoothing with B-splines and penalties (with discussion). Stat Sci 11:89–121

    Article  MathSciNet  MATH  Google Scholar 

  • Fahrmeir L, Kneib T, Lang S (2004) Penalized structured additive regression: a Bayesian perspective. Stat Sinica 14:731–761

    MathSciNet  MATH  Google Scholar 

  • Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232

    Article  MATH  Google Scholar 

  • Gray RJ (1992) Flexible methods for analyzing survival data using splines, with application to breast cancer prognosis. J Am Stat Assoc 87:942–951

    Article  Google Scholar 

  • Hartl WH, Wolf H, Schneider CP, Küchenhoff H, Jauch KW (2007) Secular trends in mortality associated with new therapeutic strategies in surgical critical illness. Am J Surg 194:535–541

    Article  Google Scholar 

  • Hastie T (2007) Comment: Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22:513–515

    Google Scholar 

  • Hastie T, Tibshirani R (1993) Varying-coefficient models. J R Stat Soc Ser B 55:757–796

    MathSciNet  MATH  Google Scholar 

  • Hofner B (2009) CoxFlexBoost: Boosting flexible Cox models (with time-varying effects). R package version 0.7-0, http://R-forge.R-project.org/projects/coxflexboost

  • Hofner B, Hothorn T, Kneib T, Schmid M (2011a) A framework for unbiased model selection based on boosting. J Comput Graph Stat 20:956–971

    Article  MathSciNet  Google Scholar 

  • Hofner B, Kneib T, Hartl W, Küchenhoff H (2011b) Building Cox-type structured hazard regression models with time-varying effects. Stat Modell Int J 11:3–24

    Article  Google Scholar 

  • Hothorn T, Bühlmann P, Kneib T, Schmid M, Hofner B (2010) Model-based boosting 2.0. J Mach Learn Res 11:2109–2113

    MathSciNet  MATH  Google Scholar 

  • Hothorn T, Bühlmann P, Kneib T, Schmid M, Hofner B (2012) mboost: Model-Based Boosting. R package version 2.1-2, http://CRAN.R-project.org/package=mboost

  • Kneib T, Fahrmeir L (2007) A mixed model approach for geoadditive hazard regression. Scand J Stat 34:207–228

    Article  MathSciNet  MATH  Google Scholar 

  • Kneib T, Hothorn T, Tutz G (2009) Variable selection and model choice in geoadditive regression models. Biometrics 65:626–634

    Article  MathSciNet  MATH  Google Scholar 

  • Mayr A, Hofner B, Schmid M (2012) The importance of knowing when to stop—a sequential stopping rule for component-wise gradient boosting. Methods Inform Med 51:178–186

    Article  Google Scholar 

  • Meinshausen N, Bühlmann P (2010) Stability selection (with discussion). J R Stat Soc Ser B 72:417–473

    Article  Google Scholar 

  • Moubarak P, Zilker S, Wolf H, Hofner B, Kneib T, Küchenhoff H, Jauch K-W, Hartl WH (2008) Activity-guided antithrombin III therapy in severe surgical sepsis: efficacy and safety according to a retrospective data analysis. Shock 30:634–641

    Article  Google Scholar 

  • Müller MH, Moubarak P, Wolf H, Küchenhoff H, Jauch KW, Hartl WH (2008) Independent determinants of early death in critically ill surgical patients. Shock 30:11–16

    Article  Google Scholar 

  • Press WH, Teukolsky SA, Vetterling WT, Flannery B (1992) Numerical recipes in C: the art of scientific computing. Cambridge University Press, Cambridge

    Google Scholar 

  • R Development Core Team (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org

  • Rawlings JO, Pantula S, Dickey DA (1998) Applied regression analysis: a research tool. Springer, New York

    Book  MATH  Google Scholar 

  • Royston P, Altman DG (1994) Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling. Appl Stat 43:429–453

    Article  Google Scholar 

  • Rüttinger D, Wolf H, Küchenhoff H, Jauch KW, Hartl WH (2007) Red cell transfusion: an essential factor for patient prognosis in surgical critical illness? Shock 28:165–171

    Article  Google Scholar 

  • Sauerbrei W, Royston P, Look M (2007) A new proposal for multivariable modelling of time-varying effects in survival data based on fractional polynomial time-transformation. Biometrica J 49:453–473

    Article  MathSciNet  Google Scholar 

  • Schmid M, Hothorn T (2008) Boosting additive models using component-wise P-splines. Comput Stat Data Anal 53:298–311

    Article  MathSciNet  MATH  Google Scholar 

  • Therneau TM, Grambsch PM (2000) Modeling survival data: extending the Cox model. Springer, New York

    Google Scholar 

  • Tutz G, Binder H (2006) Generalized additive modelling with implicit variable selection by likelihood-based boosting. Biometrics 62:961–971

    Article  MathSciNet  MATH  Google Scholar 

  • Zucker DM, Karr AF (1990) Non-parametric survival analysis with time-dependent covariate effects: a penalized likelihood approach. Ann Stat 18:329–352

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

The authors thank the associate editor and the anonymous referees for their helpful comments, W. H. Hartl from the Department of Surgery, Klinikum Großhadern for the data set and stimulating problems and D. Inthorn and H. Schneeberger for initiation and maintenance of the database of the surgical intensive care unit. B. Hofner and T. Hothorn were supported by Deutsche Forschungsgemeinschaft, grant HO 3242/1-3.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Benjamin Hofner.

Appendices

Appendix A: Detailed simulation results

In this section, more details on the simulation results are given. To explore the accuracy of model choice and thus of variable selection we examine the relative frequency of selected base-learners. This means we count the models (simulation replicates) where the base-learner was included and ignore how often and in which boosting iteration(s) the base-learner was selected. Furthermore, the magnitude of the estimated effects is neglected.

Table 1 (left) shows the selection frequencies of the base-learners in the first simulation scheme. Only the variables \(x_1\) to \(x_6\) have an effect on the hazard rate and thus, on the survival time. These effective covariates are presented in the upper half of the table. We see that almost all effective base-learners have a selection frequency close to one or exactly one except for the linear base-learners \(\mathtt bols( x_2\mathtt ) \) and bols( \(x_6\) ). If we look at the true influence of \(x_2\) it shows that this is a good result, as we have a quadratic influence of this covariate and hence no linear effect is required. The low selection frequency of \(\mathtt bols( x_6\mathtt ) \) can be attributed to the size of the effect of \(x_6\), which is very small. It is \(20\) times smaller than the effect of the other categorical covariate \(x_5\). Hence, the low selection frequency seems very plausible. For \(x_4\) (which has in reality a linear effect) the algorithm selected in \(26~\%\) of the replicates a flexible deviation from linearity. Thus, in some models the (wrong) impression of an underlying nonlinear effect of \(x_4\) is given. However, compared to the selection frequencies of the other effects, this is only of minor importance. In addition, at further inspection we see that the departures from linearity are only very small.

For non-effective covariates we expect the selection frequency of a base-learner to be close to zero or at least substantially smaller than for effective covariates. Looking at the lower part of Table 1 (left) this can be confirmed.

Note that in our simulation the number of base-learners is not equal to the number of variables. A variable is selected if any of the base-learners of this variable is selected. Using this definition, we see that on average we selected 9.8 variables with 5.4 effective variables and 4.4 non-effective variables. Compared to a scheme where we assign only one base-learner for each variable (e.g., a flexible base-learner per covariate with 4 degrees of freedom) we realize that the model choice scheme tends to select more variables and to select more non-effective variables. Perhaps, this is due to an increased number of possible base-learners per covariate. This argument is backed by the finding that we selected on average 12.8 out of 25 base-learners. We selected (on average) about 5.3 non-effective base-learners, which corresponded on average to 4.4 non-effective variables. Thus, almost every non-effective base-learner is based on another variable.

See Figs. 4, 5, 6.

Fig. 4
figure 4

Simulation scheme 1—remaining estimates of effects of influential covariates from 20 models (gray lines) and real effects (dashed lines). Effect estimates and real effects are centered

Fig. 5
figure 5

Simulation scheme 1—remaining estimates of covariate effects for non-effective covariates from 20 models (gray lines) and real effects (dashed lines). Effect estimates and real effects are centered

Fig. 6
figure 6

Simulation scheme 2—estimation of covariate effects from 20 models (gray lines) and real effects (dashed lines). Effect estimates and real effects are centered

Appendix B: Model for surgical patients: further tables and graphics

See Table 2 and Fig. 7.

 

Fig. 7
figure 7

Remaining estimated effects for the complete Großhadern data set together with pointwise 80 % variability intervals based on sub-samples. The upper three rows represent time-varying effects (for binary covariates except creatinine concentration). The fourth row shows the time-constant effects for continuous covariates and the last row the same for binary covariates

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hofner, B., Hothorn, T. & Kneib, T. Variable selection and model choice in structured survival models. Comput Stat 28, 1079–1101 (2013). https://doi.org/10.1007/s00180-012-0337-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-012-0337-x

Keywords

Navigation