Abstract
Despite the limitations imposed by the proportional hazards assumption, the Cox model is probably the most popular statistical tool used to analyze survival data, thanks to its flexibility and ease of interpretation. For this reason, novel statistical/machine learning techniques are usually adapted to fit its requirements, including boosting. Boosting is an iterative technique originally developed in the machine learning community to handle classification problems, and later extended to the statistical field, where it is used in many situations, including regression and survival analysis. The popularity of boosting has been further driven by the availability of user-friendly software such as the R packages mboost and CoxBoost, both of which allow the implementation of boosting in conjunction with the Cox model. Despite the common underlying boosting principles, these two packages use different techniques: the former is an adaptation of model-based boosting, while the latter adapts likelihood-based boosting. Here we contrast these two boosting techniques as implemented in the R packages from an analytic point of view; we further examine solutions adopted within these packages to treat mandatory variables, i.e. variables that—for several reasons—must be included in the model. We explore the possibility of extending solutions currently only implemented in one package to the other. A simulation study and a real data example are added for illustration.
References
Binder H (2013a) CoxBoost: Cox models by likelihood based boosting for a single survival endpoint or competing risks. R package version 1.4. http://CRAN.R-project.org/package=CoxBoost
Binder H (2013b) GAMBoost: generalized linear and additive models by likelihood based boosting. R package version 1.2-3. http://CRAN.R-project.org/package=GAMBoost
Binder H, Schumacher M (2008) Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinform 9:14
Boulesteix AL, Hothorn T (2010) Testing the additional predictive value of high-dimensional molecular data. BMC Bioinform 11:78
Boulesteix AL, Sauerbrei W (2011) Added predictive value of high-throughput molecular data to clinical data and its validation. Brief Bioinform 12:215–229
Boulesteix AL, Richter A, Bernau C (2013) Complexity selection with cross-validation for lasso and sparse partial least squares using high-dimensional data. In: Lausen B, Van den Poel D, Ultsch A (eds) Algorithms from and for nature and life. Springer, Cham, Switzerland, pp 261–268
Breiman L (1998) Arcing classifier. Ann Stat 26:801–849
Bühlmann P (2006) Boosting for high-dimensional linear models. Ann Stat 34:559–583
Bühlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22:477–505
Bühlmann P, Yu B (2003) Boosting with the L\(_2\) loss: regression and classification. J Am Stat Assoc 98:324–339
Cox D (1972) Regression models and life-tables. J R Stat Soc Ser B (Methodological) 34:187–220
De Bin R, Sauerbrei W, Boulesteix AL (2014) Investigating the prediction ability of survival models based on both clinical and omics data: two case studies. Stat Med 33:5310–5329
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32:407–499
Freund Y (1995) Boosting a weak learning algorithm by majority. Inf Comput 121:256–285
Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th international conference on machine learning. Morgan Kaufmann Publishers Inc., pp 148–156
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28:337–407
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
Gerds T (2014) pec: Prediction error curves for risk prediction models in survival analysis. R package version 2.4-4. http://CRAN.R-project.org/package=pec
Graf E, Schmoor C, Sauerbrei W, Schumacher M (1999) Assessment and comparison of prognostic classification schemes for survival data. Stat Med 18:2529–2545
Hofner B, Hothorn T, Kneib T (2013) Variable selection and model choice in structured survival models. Comput Stat 28:1079–1101
Hofner B, Mayr A, Robinzonov N, Schmid M (2014) Model-based boosting in R: a hands-on tutorial using the R package mboost. Comput Stat 29:3–35
Hothorn T, Bühlmann P, Dudoit S, Molinaro A, Van Der Laan MJ (2006) Survival ensembles. Biostatistics 7:355–373
Hothorn T, Buehlmann P, Kneib T, Schmid M, Hofner B, Sobotka F, Scheipl F (2015) mboost: Model-based boosting. R package version 2.5-0. http://CRAN.R-project.org/package=mboost
Marisa L, de Reyniès A, Duval A, Selves J, Gaub MP, Vescovo L, Etienne-Grimaldi MC, Schiappa R, Guenot D, Ayadi M, Kirzin S, Chazal M, Fljou JF, Benchimol D, Berger A, Lagarde A, Pencreach E, Piard F, Elias D, Parc Y, Olschwang S, Milano G, Laurent-Puig P, Boige V (2013) Gene expression classification of colon cancer into molecular subtypes: characterization, validation, and prognostic value. PLoS Med 10(e1001):453
Mayr A, Hofner B, Schmid M (2012) The importance of knowing when to stop. A sequential stopping rule for component-wise gradient boosting. Methods Inf Med 51:178–186
Mayr A, Binder H, Gefeller O, Schmid M (2014) The evolution of boosting algorithms. Methods Inf Med 53:419–427
McCullagh P, Nelder J (1989) General linear models. Chapman and Halls, London
R Development Core Team (2014) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria
Ridgeway G (1999) Generalization of boosting algorithms and applications of Bayesian inference for massive datasets. Ph.D. thesis, University of Washington
Ridgeway G (2010) gbm: Generalized boosted regression models. R package version 1.6. http://CRAN.R-project.org/package=gbm
Schapire RE (1990) The strength of weak learnability. Mach Learn 5:197–227
Schmid M, Hothorn T (2008) Flexible boosting of accelerated failure time models. BMC Bioinform 9:269
Truntzer C, Mostacci E, Jeannin A, Petit JM, Ducoroy P, Cardot H (2014) Comparison of classification methods that combine clinical data and high-dimensional mass spectrometry data. BMC Bioinform 15:385
Tutz G, Binder H (2006) Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics 62:961–971
Tutz G, Binder H (2007) Boosting ridge regression. Comput Stat Data Anal 51:6044–6059
Van der Laan MJ, Robins JM (2003) Unified methods for censored longitudinal data and causality. Springer, New York
Acknowledgments
RDB was financed by Grant BO3139/4-1 from the German Science Foundation (DFG). Special thanks are devoted to Anne-Laure Boulesteix for her advice and suggestions, to Rory Wilson for his help with linguistic improvements and to the two anonymous reviewers for their comments which led to an improved version of the paper.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
In the paper we showed that, in the case of linear Cox model, the algorithms used by the R packages mboost (through the function glmboost) and CoxBoost follow different learning paths, conversely to the Gaussian linear regression case in which the same result is produced, provided \(\lambda =n(1-\nu )/\nu \) (Binder 2013b). Note that the equivalence in the linear regression case only works for the component-wise version of boosting. In the non-component-wise version, when all the dimensions of \(\hat{\beta }\) are updated simultaneously, indeed, the two weak estimators have the form
for the likelihood- and the model-based boosting, respectively. While the model-based penalty \(\nu \) affects all dimensions identically, the penalty \(\lambda \) penalizes the dimensions depending on the correlation structure of X. Consider \(P=I_p\), the identity matrix used as default in CoxBoost. The weak estimator \(\hat{b}^{LB}\) is then a ridge estimator: in the regression procedure, when the response is projected onto the orthonormal basis of the explanatory variables (columns of X), the penalty term shrinks the coordinates (inversely) proportionally to the variance of the related principal components. This means, in particular, that the term \(\lambda \) penalizes (shrinks) the coefficients related to principal components with low variance more strongly. If we look at the predictive values obtained through the two algorithms, denoting with B the orthonormal basis of the columns of X, we obtain
where \(d_j\), \(j=1,\ldots ,p\) is the j-th eigenvalue of the matrix \(X^\top X\) (when divided by n, it is then the variance of the principal component) and m indicates the number of iterations performed. Replacing m by 0 we obtain the formula for the single step update. It is worth noting that, due to its stage-wise nature, the algorithm of boosting ridge regression leads to a different penalization (and, therefore, to different estimates) than the usual ridge regression (Tutz and Binder 2007).
We can obtain a uniform penalization with the likelihood-based boosting by setting \(P = (1/n)X^\top X\), provided that the columns of X are centered around 0 and standardized. In this case, \((1/n)X^\top X\) represents the correlation matrix of X.
Note that it is possible to use a penalized least squares estimator within the model-boosting algorithm as well. This solution seems to be gaining popularity (see, e.g., Hofner et al. 2014), because it allows the better handling of categorical variables. Please note that in this case the whole penalization is a combination of the effect of \(\lambda \) and \(\nu \). Another issue related to categorical variables is with regard to their variance. In this paper, we considered standardized X, but we saw in the examples that in practical situations we need a standardization process. The likelihood-based boosting, in particular, needs all \(X_j\) having variance 1. For this reason, in the real data example we codified the dummy variables as \((-1;1)\). In the case of completely balanced observations, the variance of the binary variables is then equal to 1. Unfortunately, this balance occurs rarely in practice. A possible solution which does not require the standardization of X is to replace P by the covariance matrix of X or, for the component-wise version, with diag\(((1/n)X^\top X)\): this standardizes the binary variables using their observed standard deviations as well.
Rights and permissions
About this article
Cite this article
De Bin, R. Boosting in Cox regression: a comparison between the likelihood-based and the model-based approaches with focus on the R-packages CoxBoost and mboost . Comput Stat 31, 513–531 (2016). https://doi.org/10.1007/s00180-015-0642-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-015-0642-2