Skip to main content

\(L_1\) splitting rules in survival forests

Abstract

The log-rank test is used as the split function in many commonly used survival trees and forests algorithms. However, the log-rank test may have a significant loss of power in some circumstances, especially when the hazard functions or when the survival functions cross each other in the two compared groups. We investigate the use of the integrated absolute difference between the two children nodes survival functions as the splitting rule. Simulations studies and applications to real data sets show that forests built with this rule produce very good results in general, and that they are often better compared to forests built with the log-rank splitting rule.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

References

  • Ambler G, Benner A (2014) mfp: multivariable fractional polynomials. R package version 1.5.0. http://CRAN.R-project.org/package=mfp

  • Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • Bou-Hamad I, Larocque D, Ben-Ameur H (2011) A review of survival trees. Stat Surv 5:44–71

    MathSciNet  Article  MATH  Google Scholar 

  • Boulesteix AL, Janitza S, Kruppa J, König IR (2012) Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscip Rev Data Min Knowl Discov 2(6):493–507

    Article  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  • Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth & Brooks, Monterey

    MATH  Google Scholar 

  • Breslow NE, Chatterjee N (1999) Design and analysis of two-phase studies with binary outcome applied to wilms tumour prognosis. J R Stat Soc Ser C (Appl Stat) 48(4):457–468

    Article  MATH  Google Scholar 

  • Chen X, Ishwaran H (2012) Random forests for genomic data analysis. Genomics 99(6):323–329

    Article  Google Scholar 

  • Chen X, Ishwaran H (2013) Pathway hunting by random survival forests. Bioinformatics 29(1):99–105

    Article  Google Scholar 

  • Ciampi A, Thiffault J, Nakache JP, Asselain B (1986) Stratification by stepwise regression, correspondence analysis and recursive partition: a comparison of three methods of analysis for survival data with covariates. Comput Stat Data Anal 4(3):185–204

    Article  MATH  Google Scholar 

  • Ciampi A, Hogg SA, McKinney S, Thiffault J (1988) Recpam: a computer program for recursive partition and amalgamation for censored survival data and other situations frequently occurring in biostatistics. i. methods and program features. Comput Methods Progr Biomed 26(3):239–256

    Article  Google Scholar 

  • Cutler A, Zhao G (2001) Pert-perfect random tree ensembles. Comput Sci Stat 33:490–497

    Google Scholar 

  • De Bin Riccardo, Sauerbrei Willi, Boulesteix Anne-Laure (2014) Investigating the prediction ability of survival models based on both clinical and omics data: two case studies. Stat Med 33(30):5310–5329

    MathSciNet  Article  Google Scholar 

  • Fleming TR, Harrington DP (1991) Counting processes and survival analysis. Wiley, Hoboken

    MATH  Google Scholar 

  • Gordon L, Olshen RA (1985) Tree-structured survival analysis. Cancer Treat Rep 69(10):1065

    Google Scholar 

  • Graf E, Schmoor C, Sauerbrei W, Schumacher M (1999) Assessment and comparison of prognostic classification schemes for survival data. Stat Med 18(17–18):2529–2545

    Article  Google Scholar 

  • Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA (1982) Evaluating the yield of medical tests. JAMA 247(18):2543–2546

    Article  Google Scholar 

  • Hosmer DW Jr, Lemeshow S, May S (2011) Applied survival analysis: regression modeling of time to event data. Wiley, Chichester

    MATH  Google Scholar 

  • Hothorn T, Lausen B (2003) On the exact distribution of maximally selected rank statistics. Comput Stat Data Anal 43(2):121–137

    MathSciNet  Article  MATH  Google Scholar 

  • Hothorn T, Bühlmann P, Dudoit S, Molinaro A, Van Der Laan MJ (2006a) Survival ensembles. Biostatistics 7(3):355–373

    Article  MATH  Google Scholar 

  • Hothorn T, Hornik K, Zeileis A (2006b) Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat 15(3):651–674

    MathSciNet  Article  Google Scholar 

  • Ishwaran H, Kogalur UB (2010) Consistency of random survival forests. Stat Probab Lett 80(13):1056–1064

    MathSciNet  Article  MATH  Google Scholar 

  • Ishwaran H, Kogalur UB (2014) Random forests for survival, regression and classification (rf-src). R package version 1.5.5. http://cran.r-project.org/web/packages/randomForestSRC/

  • Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008) Random survival forests. Ann Appl Stat 2(3):841–860

    MathSciNet  Article  MATH  Google Scholar 

  • Ishwaran H, Kogalur UB, Gorodeski EZ, Minn AJ, Lauer MS (2010) High-dimensional variable selection for survival data. J Am Stat Assoc 105(489):205–217

    MathSciNet  Article  MATH  Google Scholar 

  • Ishwaran H, Kogalur UB, Chen X, Minn AJ (2011) Random survival forests for high-dimensional data. Stat Anal Data min 4(1):115–132

    MathSciNet  Article  Google Scholar 

  • Kalbfleisch JD, Prentice RL (1980) The statistical analysis of failure time data. Wiley series in probability and mathematical statistics. Wiley, New York

  • Leblanc M, Crowley J (1993) Survival trees by goodness of split. J Am Stat Assoc 88(422):457–467

    MathSciNet  Article  MATH  Google Scholar 

  • Lin X, Wang H (2004) A new testing approach for comparing the overall homogeneity of survival curves. Biom J 46(5):489–496

    MathSciNet  Article  Google Scholar 

  • Lin X, Xu Q (2010) A new method for the comparison of survival distributions. Pharm Stat 9(1):67–76

    Article  Google Scholar 

  • Lin Y, Jeon Y (2006) Random forests and adaptive nearest neighbors. J Am Stat Assoc 101(474):578–590

    MathSciNet  Article  MATH  Google Scholar 

  • Loh WY (2002) Regression trees with unbiased variable selection and interaction detection. Stat Sin 12(2):361–386

    MathSciNet  MATH  Google Scholar 

  • Loh WY (2013) Guide classification and regression trees user manual for version 15

  • Mogensen UB, Ishwaran H, Gerds TA (2012) Evaluating random forests for survival analysis using prediction error curves. J Stat Softw 50(11):1

    Article  Google Scholar 

  • R Core Team (2014) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.http://www.R-project.org/

  • Rokach L (2009) Taxonomy for characterizing ensemble methods in classification tasks: a review and annotated bibliography. Comput Stat Data Anal 53(12):4046–4072

    MathSciNet  Article  MATH  Google Scholar 

  • Sauerbrei Willi, Royston Patrick (1999) Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. J R Stat Soc Ser A (Stat Soc) 162(1):71–94

    Article  Google Scholar 

  • Scheike T, Martinussen T, Silver J (2009) timereg: timereg package for flexible regression models for survival data. R package version, pp 1–2

  • Schlichting P, Christensen E, Andersen PK, Fauerholdt L, Juhl E, Poulsen H, Tygstrup N (1983) Prognostic factors in cirrhosis identified by Cox’s regression model. Hepatology 3(6):889–895

    Article  Google Scholar 

  • Schumacher M, Bastert G, Bojar H, Huebner K, Olschewski M, Sauerbrei W, Schmoor C, Beyerle C, Neumann RL, Rauschecker HF (1994) Randomized 2 \(\times \) 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. german breast cancer study group. J Clin Oncol 12(10):2086–2093

    Article  Google Scholar 

  • Segal MR (1988) Regression trees for censored data. Biometrics 44:35–47

    Article  MATH  Google Scholar 

  • Siroky DS (2009) Navigating random forests and related advances in algorithmic modeling. Stat Surv 3:147–163

    MathSciNet  Article  MATH  Google Scholar 

  • Therneau TM (2014) A package for survival analysis in S. R package version 2.37-7. http://CRAN.R-project.org/package=survival

  • Verikas A, Gelzinis A, Bacauskiene M (2011) Mining data with random forests: a survey and results of new tests. Pattern Recogn 44(2):330–349

    Article  Google Scholar 

  • Zhu R, Kosorok MR (2012) Recursively imputed survival trees. J Am Stat Assoc 107(497):331–340

    MathSciNet  Article  MATH  Google Scholar 

Download references

Acknowledgments

The authors would like to thank the Associate Editor and two reviewers whose comments helped in preparing an improved version of this article. This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and by Le Fonds québécois de la recherche sur la nature et les technologies (FQRNT).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Denis Larocque.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 223 KB)

Appendix

Appendix

Hazard function formula for DGP 1

$$\begin{aligned} {\left\{ \begin{array}{ll} 0.27t &{} x_1\le 0.5, t\le 2 \\ 0.2(t-2)+5.4 &{} x_1\le 0.5, t>2 \\ 0.1t &{} x_1>0.5, t\le 6 \\ 5.5(t-6)+0.6 &{} x_1>0.5, t>6. \\ \end{array}\right. } \end{aligned}$$

Hazard function formula for DGP 2

$$\begin{aligned} {\left\{ \begin{array}{ll} 0.27t &{} x_1\le 0.5, x_2\le 0.5, t\le 2 \\ 0.2(t-2)+5.4 &{} x_1\le 0.5, x_2\le 0.5, t>2 \\ 0.27t &{} x_1\le 0.5, x_2>0.5, t\le 2 \\ 5.5(t-2)+5.4 &{} x_1\le 0.5, x_2>0.5, t>2 \\ 0.1t &{} x_1>0.5, x_2\le 0.5, t\le 6 \\ 0.2(t-6)+0.6 &{} x_1>0.5, x_2\le 0.5, t>6 \\ 0.1t &{} x_1>0.5,x_2>0.5, t\le 6 \\ 5.5(t-6)+0.6 &{} x_1>0.5, x_2>0.5, t>6. \\ \end{array}\right. } \end{aligned}$$

Simplification of the Lin and Xu (2010) statistic leading to the \(L_1^*\) splitting rule

For \(i=L,R\), the left and right nodes, denote by \(\hat{\sigma }_i^2\) the estimated variance of \(\hat{S}_i\) from Greenwood’s formula. To perform a formal test of the equality of the survival functions in the left and right nodes, Lin and Xu (2010) propose the statistic

$$\begin{aligned} \Delta ^* = \frac{\Delta -\hat{E}(\Delta )}{\sqrt{\widehat{Var}(\Delta )}} \end{aligned}$$

where

$$\begin{aligned} \hat{E}(\Delta ) = \sum _{j|t_j<\tau } \{2/\pi (\hat{\sigma }_L^2(t_j) + \hat{\sigma }_R^2(t_j))\}^{1/2} (t_{j+1}-t_j) \end{aligned}$$

and

$$\begin{aligned} \widehat{Var}(\Delta )= & {} \sum _{j|t_j<\tau } (t_{j+1}-t_j)^2 (1-2/\pi ) (\hat{\sigma }_L^2(t_j) + \hat{\sigma }_R^2(t_j)) \\&+ \sum _{j<j'|t_j,t_{j'}<\tau } (t_{j+1}-t_j) (t_{j'+1}-t_{j'}) (1-2/\pi ) \\&\times \{(\hat{\sigma }_L^2(t_j) + \hat{\sigma }_R^2(t_j))(\hat{\sigma }_L^2(t_{j'}) + \hat{\sigma }_R^2(t_{j'}))\}^{1/2} \end{aligned}$$

are estimates of \(E(\Delta )\) and \(Var(\Delta )\). These estimates arise from a normal approximation for \(\hat{S}_L(t)-\hat{S}_R(t)\), and the test statistic \(\Delta ^*\) is asymptotically normally distributed under the null hypothesis of equality of the two survival functions. To simplify this statistic in order to speed up computations for tree building, assume that all observations are from the same population with survival function S(t), that is we are under the null hypothesis and there is no censoring. Then \(Var(\hat{S}_i(t))=S(t)(1-S(t))/n_i\), for \(i=L,R\). In that case,

$$\begin{aligned} \hat{E}(\Delta )= & {} \sqrt{2/\pi } \sqrt{(n_L+n_R)/(n_L n_R)} \sum _{j|t_j<\tau } (S(t_j)(1-S(t_j)))^{1/2} (t_{j+1}-t_j)\\= & {} c_1/\sqrt{n_L n_R} \end{aligned}$$

where \(c_1\) is the same constant for all candidate splits. Similarly, \(\widehat{Var}(\Delta )=c^2_2/(n_L n_R)\) where \(c_2\) is the same constant for all candidate splits. Hence,

$$\begin{aligned} \frac{\Delta -\hat{E}(\Delta )}{\sqrt{\widehat{Var}(\Delta )}} = \frac{\sqrt{n_L n_R} \Delta }{c_2} - \frac{c_1}{c_2}. \end{aligned}$$

But using this last expression is equivalent to using \(\sqrt{n_L n_R} \Delta \) as the splitting criterion.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Moradian, H., Larocque, D. & Bellavance, F. \(L_1\) splitting rules in survival forests. Lifetime Data Anal 23, 671–691 (2017). https://doi.org/10.1007/s10985-016-9372-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10985-016-9372-1

Keywords

  • Survival data
  • Right-censored data
  • Ensemble methods
  • Random forests
  • Survival forests