Skip to main content
Log in

Boosting in Cox regression: a comparison between the likelihood-based and the model-based approaches with focus on the R-packages CoxBoost and mboost

Computational Statistics Aims and scope Submit manuscript

Abstract

Despite the limitations imposed by the proportional hazards assumption, the Cox model is probably the most popular statistical tool used to analyze survival data, thanks to its flexibility and ease of interpretation. For this reason, novel statistical/machine learning techniques are usually adapted to fit its requirements, including boosting. Boosting is an iterative technique originally developed in the machine learning community to handle classification problems, and later extended to the statistical field, where it is used in many situations, including regression and survival analysis. The popularity of boosting has been further driven by the availability of user-friendly software such as the R packages mboost and CoxBoost, both of which allow the implementation of boosting in conjunction with the Cox model. Despite the common underlying boosting principles, these two packages use different techniques: the former is an adaptation of model-based boosting, while the latter adapts likelihood-based boosting. Here we contrast these two boosting techniques as implemented in the R packages from an analytic point of view; we further examine solutions adopted within these packages to treat mandatory variables, i.e. variables that—for several reasons—must be included in the model. We explore the possibility of extending solutions currently only implemented in one package to the other. A simulation study and a real data example are added for illustration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

References

  • Binder H (2013a) CoxBoost: Cox models by likelihood based boosting for a single survival endpoint or competing risks. R package version 1.4. http://CRAN.R-project.org/package=CoxBoost

  • Binder H (2013b) GAMBoost: generalized linear and additive models by likelihood based boosting. R package version 1.2-3. http://CRAN.R-project.org/package=GAMBoost

  • Binder H, Schumacher M (2008) Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinform 9:14

    Article  Google Scholar 

  • Boulesteix AL, Hothorn T (2010) Testing the additional predictive value of high-dimensional molecular data. BMC Bioinform 11:78

    Article  Google Scholar 

  • Boulesteix AL, Sauerbrei W (2011) Added predictive value of high-throughput molecular data to clinical data and its validation. Brief Bioinform 12:215–229

    Article  Google Scholar 

  • Boulesteix AL, Richter A, Bernau C (2013) Complexity selection with cross-validation for lasso and sparse partial least squares using high-dimensional data. In: Lausen B, Van den Poel D, Ultsch A (eds) Algorithms from and for nature and life. Springer, Cham, Switzerland, pp 261–268

  • Breiman L (1998) Arcing classifier. Ann Stat 26:801–849

    Article  MathSciNet  MATH  Google Scholar 

  • Bühlmann P (2006) Boosting for high-dimensional linear models. Ann Stat 34:559–583

    Article  MathSciNet  MATH  Google Scholar 

  • Bühlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22:477–505

    Article  MathSciNet  MATH  Google Scholar 

  • Bühlmann P, Yu B (2003) Boosting with the L\(_2\) loss: regression and classification. J Am Stat Assoc 98:324–339

    Article  MATH  Google Scholar 

  • Cox D (1972) Regression models and life-tables. J R Stat Soc Ser B (Methodological) 34:187–220

    MathSciNet  MATH  Google Scholar 

  • De Bin R, Sauerbrei W, Boulesteix AL (2014) Investigating the prediction ability of survival models based on both clinical and omics data: two case studies. Stat Med 33:5310–5329

    Article  MathSciNet  Google Scholar 

  • Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32:407–499

    Article  MathSciNet  MATH  Google Scholar 

  • Freund Y (1995) Boosting a weak learning algorithm by majority. Inf Comput 121:256–285

    Article  MathSciNet  MATH  Google Scholar 

  • Freund Y, Schapire R (1996) Experiments with a new boosting algorithm. In: Proceedings of the 13th international conference on machine learning. Morgan Kaufmann Publishers Inc., pp 148–156

  • Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28:337–407

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232

    Article  MathSciNet  MATH  Google Scholar 

  • Gerds T (2014) pec: Prediction error curves for risk prediction models in survival analysis. R package version 2.4-4. http://CRAN.R-project.org/package=pec

  • Graf E, Schmoor C, Sauerbrei W, Schumacher M (1999) Assessment and comparison of prognostic classification schemes for survival data. Stat Med 18:2529–2545

    Article  Google Scholar 

  • Hofner B, Hothorn T, Kneib T (2013) Variable selection and model choice in structured survival models. Comput Stat 28:1079–1101

    Article  MathSciNet  MATH  Google Scholar 

  • Hofner B, Mayr A, Robinzonov N, Schmid M (2014) Model-based boosting in R: a hands-on tutorial using the R package mboost. Comput Stat 29:3–35

    Article  MathSciNet  MATH  Google Scholar 

  • Hothorn T, Bühlmann P, Dudoit S, Molinaro A, Van Der Laan MJ (2006) Survival ensembles. Biostatistics 7:355–373

    Article  MATH  Google Scholar 

  • Hothorn T, Buehlmann P, Kneib T, Schmid M, Hofner B, Sobotka F, Scheipl F (2015) mboost: Model-based boosting. R package version 2.5-0. http://CRAN.R-project.org/package=mboost

  • Marisa L, de Reyniès A, Duval A, Selves J, Gaub MP, Vescovo L, Etienne-Grimaldi MC, Schiappa R, Guenot D, Ayadi M, Kirzin S, Chazal M, Fljou JF, Benchimol D, Berger A, Lagarde A, Pencreach E, Piard F, Elias D, Parc Y, Olschwang S, Milano G, Laurent-Puig P, Boige V (2013) Gene expression classification of colon cancer into molecular subtypes: characterization, validation, and prognostic value. PLoS Med 10(e1001):453

    Google Scholar 

  • Mayr A, Hofner B, Schmid M (2012) The importance of knowing when to stop. A sequential stopping rule for component-wise gradient boosting. Methods Inf Med 51:178–186

    Article  Google Scholar 

  • Mayr A, Binder H, Gefeller O, Schmid M (2014) The evolution of boosting algorithms. Methods Inf Med 53:419–427

    Article  Google Scholar 

  • McCullagh P, Nelder J (1989) General linear models. Chapman and Halls, London

    Book  MATH  Google Scholar 

  • R Development Core Team (2014) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria

  • Ridgeway G (1999) Generalization of boosting algorithms and applications of Bayesian inference for massive datasets. Ph.D. thesis, University of Washington

  • Ridgeway G (2010) gbm: Generalized boosted regression models. R package version 1.6. http://CRAN.R-project.org/package=gbm

  • Schapire RE (1990) The strength of weak learnability. Mach Learn 5:197–227

    Google Scholar 

  • Schmid M, Hothorn T (2008) Flexible boosting of accelerated failure time models. BMC Bioinform 9:269

    Article  Google Scholar 

  • Truntzer C, Mostacci E, Jeannin A, Petit JM, Ducoroy P, Cardot H (2014) Comparison of classification methods that combine clinical data and high-dimensional mass spectrometry data. BMC Bioinform 15:385

    Article  Google Scholar 

  • Tutz G, Binder H (2006) Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics 62:961–971

    Article  MathSciNet  MATH  Google Scholar 

  • Tutz G, Binder H (2007) Boosting ridge regression. Comput Stat Data Anal 51:6044–6059

    Article  MathSciNet  MATH  Google Scholar 

  • Van der Laan MJ, Robins JM (2003) Unified methods for censored longitudinal data and causality. Springer, New York

    Book  MATH  Google Scholar 

Download references

Acknowledgments

RDB was financed by Grant BO3139/4-1 from the German Science Foundation (DFG). Special thanks are devoted to Anne-Laure Boulesteix for her advice and suggestions, to Rory Wilson for his help with linguistic improvements and to the two anonymous reviewers for their comments which led to an improved version of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Riccardo De Bin.

Electronic supplementary material

Appendix

Appendix

In the paper we showed that, in the case of linear Cox model, the algorithms used by the R packages mboost (through the function glmboost) and CoxBoost follow different learning paths, conversely to the Gaussian linear regression case in which the same result is produced, provided \(\lambda =n(1-\nu )/\nu \) (Binder 2013b). Note that the equivalence in the linear regression case only works for the component-wise version of boosting. In the non-component-wise version, when all the dimensions of \(\hat{\beta }\) are updated simultaneously, indeed, the two weak estimators have the form

$$\begin{aligned} \hat{b}^{LB} = (X^\top X+\lambda P)^{-1}X^\top u \quad \quad \text{ and } \quad \quad \nu \hat{b}^{MB} = \nu (X^\top X)^{-1}X^\top u, \end{aligned}$$

for the likelihood- and the model-based boosting, respectively. While the model-based penalty \(\nu \) affects all dimensions identically, the penalty \(\lambda \) penalizes the dimensions depending on the correlation structure of X. Consider \(P=I_p\), the identity matrix used as default in CoxBoost. The weak estimator \(\hat{b}^{LB}\) is then a ridge estimator: in the regression procedure, when the response is projected onto the orthonormal basis of the explanatory variables (columns of X), the penalty term shrinks the coordinates (inversely) proportionally to the variance of the related principal components. This means, in particular, that the term \(\lambda \) penalizes (shrinks) the coefficients related to principal components with low variance more strongly. If we look at the predictive values obtained through the two algorithms, denoting with B the orthonormal basis of the columns of X, we obtain

$$\begin{aligned}&\hat{y}^{MB}=B \; \text{ diag }(1-(1-\nu )^{m+1}) B^\top y\\&\hat{y}^{LB}=B \; \text{ diag }(1-(1-\frac{d_j}{d_j+\lambda })^{m+1}) B^\top y, \end{aligned}$$

where \(d_j\), \(j=1,\ldots ,p\) is the j-th eigenvalue of the matrix \(X^\top X\) (when divided by n, it is then the variance of the principal component) and m indicates the number of iterations performed. Replacing m by 0 we obtain the formula for the single step update. It is worth noting that, due to its stage-wise nature, the algorithm of boosting ridge regression leads to a different penalization (and, therefore, to different estimates) than the usual ridge regression (Tutz and Binder 2007).

We can obtain a uniform penalization with the likelihood-based boosting by setting \(P = (1/n)X^\top X\), provided that the columns of X are centered around 0 and standardized. In this case, \((1/n)X^\top X\) represents the correlation matrix of X.

Note that it is possible to use a penalized least squares estimator within the model-boosting algorithm as well. This solution seems to be gaining popularity (see, e.g., Hofner et al. 2014), because it allows the better handling of categorical variables. Please note that in this case the whole penalization is a combination of the effect of \(\lambda \) and \(\nu \). Another issue related to categorical variables is with regard to their variance. In this paper, we considered standardized X, but we saw in the examples that in practical situations we need a standardization process. The likelihood-based boosting, in particular, needs all \(X_j\) having variance 1. For this reason, in the real data example we codified the dummy variables as \((-1;1)\). In the case of completely balanced observations, the variance of the binary variables is then equal to 1. Unfortunately, this balance occurs rarely in practice. A possible solution which does not require the standardization of X is to replace P by the covariance matrix of X or, for the component-wise version, with diag\(((1/n)X^\top X)\): this standardizes the binary variables using their observed standard deviations as well.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

De Bin, R. Boosting in Cox regression: a comparison between the likelihood-based and the model-based approaches with focus on the R-packages CoxBoost and mboost . Comput Stat 31, 513–531 (2016). https://doi.org/10.1007/s00180-015-0642-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-015-0642-2

Keywords

Navigation