Abstract
We aim at modeling the survival time of intensive care patients suffering from severe sepsis. The nature of the problem requires a flexible model that allows to extend the classical Cox-model via the inclusion of time-varying and nonparametric effects. These structured survival models are very flexible but additional difficulties arise when model choice and variable selection are desired. In particular, it has to be decided which covariates should be assigned time-varying effects or whether linear modeling is sufficient for a given covariate. Component-wise boosting provides a means of likelihood-based model fitting that enables simultaneous variable selection and model choice. We introduce a component-wise, likelihood-based boosting algorithm for survival data that permits the inclusion of both parametric and nonparametric time-varying effects as well as nonparametric effects of continuous covariates utilizing penalized splines as the main modeling technique. An empirical evaluation of the methodology precedes the model building for the severe sepsis data. A software implementation is available to the interested reader.
Similar content being viewed by others
References
Abrahamowicz M, MacKenzie TA (2007) Joint estimation of time-dependent and non-linear effects of continuous covariates on survival. Stat Med 26:392–408
Bender R, Augustin T, Blettner M (2005) Generating survival times to simulate Cox proportional hazards models. Stat Med 24:1713–1723
Binder H, Schumacher M (2008) Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinform 9:14
Breiman L (1996) Heuristics of instability and stabilization in model selection. Ann Stat 24:2350–2383
Bühlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22:477–505
Bühlmann P, Yu B (2003) Boosting with the \(\text{ L}_2\) loss: regression and classification. J Am Stat Assoc 98:324–339
Cox DR (1972) Regression models and life tables (with discussion). J R Stat Soc Ser B 34:187–220
de Boor C (1978) A practical guide to splines. Springer, New York
Eilers PHC, Marx BD (1996) Flexible smoothing with B-splines and penalties (with discussion). Stat Sci 11:89–121
Fahrmeir L, Kneib T, Lang S (2004) Penalized structured additive regression: a Bayesian perspective. Stat Sinica 14:731–761
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
Gray RJ (1992) Flexible methods for analyzing survival data using splines, with application to breast cancer prognosis. J Am Stat Assoc 87:942–951
Hartl WH, Wolf H, Schneider CP, Küchenhoff H, Jauch KW (2007) Secular trends in mortality associated with new therapeutic strategies in surgical critical illness. Am J Surg 194:535–541
Hastie T (2007) Comment: Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22:513–515
Hastie T, Tibshirani R (1993) Varying-coefficient models. J R Stat Soc Ser B 55:757–796
Hofner B (2009) CoxFlexBoost: Boosting flexible Cox models (with time-varying effects). R package version 0.7-0, http://R-forge.R-project.org/projects/coxflexboost
Hofner B, Hothorn T, Kneib T, Schmid M (2011a) A framework for unbiased model selection based on boosting. J Comput Graph Stat 20:956–971
Hofner B, Kneib T, Hartl W, Küchenhoff H (2011b) Building Cox-type structured hazard regression models with time-varying effects. Stat Modell Int J 11:3–24
Hothorn T, Bühlmann P, Kneib T, Schmid M, Hofner B (2010) Model-based boosting 2.0. J Mach Learn Res 11:2109–2113
Hothorn T, Bühlmann P, Kneib T, Schmid M, Hofner B (2012) mboost: Model-Based Boosting. R package version 2.1-2, http://CRAN.R-project.org/package=mboost
Kneib T, Fahrmeir L (2007) A mixed model approach for geoadditive hazard regression. Scand J Stat 34:207–228
Kneib T, Hothorn T, Tutz G (2009) Variable selection and model choice in geoadditive regression models. Biometrics 65:626–634
Mayr A, Hofner B, Schmid M (2012) The importance of knowing when to stop—a sequential stopping rule for component-wise gradient boosting. Methods Inform Med 51:178–186
Meinshausen N, Bühlmann P (2010) Stability selection (with discussion). J R Stat Soc Ser B 72:417–473
Moubarak P, Zilker S, Wolf H, Hofner B, Kneib T, Küchenhoff H, Jauch K-W, Hartl WH (2008) Activity-guided antithrombin III therapy in severe surgical sepsis: efficacy and safety according to a retrospective data analysis. Shock 30:634–641
Müller MH, Moubarak P, Wolf H, Küchenhoff H, Jauch KW, Hartl WH (2008) Independent determinants of early death in critically ill surgical patients. Shock 30:11–16
Press WH, Teukolsky SA, Vetterling WT, Flannery B (1992) Numerical recipes in C: the art of scientific computing. Cambridge University Press, Cambridge
R Development Core Team (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org
Rawlings JO, Pantula S, Dickey DA (1998) Applied regression analysis: a research tool. Springer, New York
Royston P, Altman DG (1994) Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling. Appl Stat 43:429–453
Rüttinger D, Wolf H, Küchenhoff H, Jauch KW, Hartl WH (2007) Red cell transfusion: an essential factor for patient prognosis in surgical critical illness? Shock 28:165–171
Sauerbrei W, Royston P, Look M (2007) A new proposal for multivariable modelling of time-varying effects in survival data based on fractional polynomial time-transformation. Biometrica J 49:453–473
Schmid M, Hothorn T (2008) Boosting additive models using component-wise P-splines. Comput Stat Data Anal 53:298–311
Therneau TM, Grambsch PM (2000) Modeling survival data: extending the Cox model. Springer, New York
Tutz G, Binder H (2006) Generalized additive modelling with implicit variable selection by likelihood-based boosting. Biometrics 62:961–971
Zucker DM, Karr AF (1990) Non-parametric survival analysis with time-dependent covariate effects: a penalized likelihood approach. Ann Stat 18:329–352
Acknowledgments
The authors thank the associate editor and the anonymous referees for their helpful comments, W. H. Hartl from the Department of Surgery, Klinikum Großhadern for the data set and stimulating problems and D. Inthorn and H. Schneeberger for initiation and maintenance of the database of the surgical intensive care unit. B. Hofner and T. Hothorn were supported by Deutsche Forschungsgemeinschaft, grant HO 3242/1-3.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Detailed simulation results
In this section, more details on the simulation results are given. To explore the accuracy of model choice and thus of variable selection we examine the relative frequency of selected base-learners. This means we count the models (simulation replicates) where the base-learner was included and ignore how often and in which boosting iteration(s) the base-learner was selected. Furthermore, the magnitude of the estimated effects is neglected.
Table 1 (left) shows the selection frequencies of the base-learners in the first simulation scheme. Only the variables \(x_1\) to \(x_6\) have an effect on the hazard rate and thus, on the survival time. These effective covariates are presented in the upper half of the table. We see that almost all effective base-learners have a selection frequency close to one or exactly one except for the linear base-learners \(\mathtt bols( x_2\mathtt ) \) and bols( \(x_6\) ). If we look at the true influence of \(x_2\) it shows that this is a good result, as we have a quadratic influence of this covariate and hence no linear effect is required. The low selection frequency of \(\mathtt bols( x_6\mathtt ) \) can be attributed to the size of the effect of \(x_6\), which is very small. It is \(20\) times smaller than the effect of the other categorical covariate \(x_5\). Hence, the low selection frequency seems very plausible. For \(x_4\) (which has in reality a linear effect) the algorithm selected in \(26~\%\) of the replicates a flexible deviation from linearity. Thus, in some models the (wrong) impression of an underlying nonlinear effect of \(x_4\) is given. However, compared to the selection frequencies of the other effects, this is only of minor importance. In addition, at further inspection we see that the departures from linearity are only very small.
For non-effective covariates we expect the selection frequency of a base-learner to be close to zero or at least substantially smaller than for effective covariates. Looking at the lower part of Table 1 (left) this can be confirmed.
Note that in our simulation the number of base-learners is not equal to the number of variables. A variable is selected if any of the base-learners of this variable is selected. Using this definition, we see that on average we selected 9.8 variables with 5.4 effective variables and 4.4 non-effective variables. Compared to a scheme where we assign only one base-learner for each variable (e.g., a flexible base-learner per covariate with 4 degrees of freedom) we realize that the model choice scheme tends to select more variables and to select more non-effective variables. Perhaps, this is due to an increased number of possible base-learners per covariate. This argument is backed by the finding that we selected on average 12.8 out of 25 base-learners. We selected (on average) about 5.3 non-effective base-learners, which corresponded on average to 4.4 non-effective variables. Thus, almost every non-effective base-learner is based on another variable.
Appendix B: Model for surgical patients: further tables and graphics
Rights and permissions
About this article
Cite this article
Hofner, B., Hothorn, T. & Kneib, T. Variable selection and model choice in structured survival models. Comput Stat 28, 1079–1101 (2013). https://doi.org/10.1007/s00180-012-0337-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-012-0337-x