A random survival forest (RSF) is an assemble of trees method for analysis of right censored time-to-event data and an extension of Brieman’s random forest method [14, 20]. Survival trees and forests are popular non-parametric alternatives to (semi) parametric models for time-to-event analysis. They offer great flexibility and can automatically detect certain types of interactions without the need to specify them beforehand [13]. A survival tree is built with the idea of partitioning the covariate space recursively to form groups of subjects who are similar according to the time-to-event outcome. Homogeneity at a node is achieved by minimizing a given impurity measure. The basic approach for building a survival tree is by using a binary split on a single predictor. For a categorical covariant X, a split is defined as X≤c where c is some constant. For a categorical covariate X with many split-points, the potential split is X∈{c
1,…,c
k
} where c
1,…,c
k
are potential split values of a predictor variable X. The goal in survival tree building is to identify prognostic factors that are predictive of the time-to-event outcome. In tree building, a binary split is such that the two daughter nodes obtained from the parent node are dissimilar and several split-rules (different impurity measure) for time-to-event data have been suggested over the years [13, 21].
The impurity measure or the split-rule of the algorithm is very important in survival tree building. In this article, we used the log-rank and the log-rank score split-rules [22–24].
The log-rank split-rule
Suppose a node h can be split into two daughter nodes α and β. The best split at a node h, on a covariate x at a split point s
∗ is the one that gives the largest log-rank statistic between the two daughter nodes [22]. The algorithm for building a survival tree using the split-rule based on the log-rank statistic [13, 22, 25, 26] is given in Algorithm 1 below.
The log-rank score split-rule
The log-rank score split-rule [23] is a modification of the log-rank split-rule mentioned above. It uses the log-rank scores [24]. Given r=(r
1,r
2,…,r
N
), the rank vector of survival times with their indicator variable (T,δ)=((T
1,δ
1),(T
2,δ
2),…,(T
N
,δ
N
)) and that a=a(T,δ)=(a
1(r),a
2(r),…,a
N
(r)) denotes the score vector depending on ranks in vector r. Assume that the ranks order the predictor variables in such away that x
1<x
2<…<x
N
. The log-rank scores for an observation at T
l
is given by:
$$ a_{l}=a_{l} \left(T,\delta\right)=\delta_{l} -\sum_{k=1}^{\gamma_{l}(T)}\frac{\delta_{k}}{N-\gamma_{k}(T)+1}\,{,} $$
(1)
where
$$\gamma_{k}(T)=\sum_{l=1}^{N} \chi \{T_{l} \leqslant T_{k} \} $$
is the number of individuals that have had the event of interest or were censored before or at time T
k
.
$$ i\left(x,s^{\star}\right)=\frac{\sum_{x_{j}\leq s^{\star}}\left(a_{j}-R_{1}\bar{a}\right)}{\sqrt{R_{1}\left(1-\frac{R_{1}}{N}\right)S_{a}^{2}}}\,, $$
(2)
where \(\bar {a}\) and \(S_{a}^{2}\) are the mean and sample variance of the scores {a
j
:j=1,2,…n}. The best split is the one that maximizes |i(x,s
⋆)| over all \(x_{j}^{'}\)s and possible splits s
⋆.
Trees are generally unstable and hence researchers have recommended the growing of a collection of trees [10, 27], commonly referred to as random survival forests [20, 26].
Random survival forests algorithm
The random survival forests algorithm implementation is shown in Algorithm 2 [20, 26].
For this study, we used the log-rank and the log-rank score split-rules in Step 2 of the algorithm. Two random survival forest algorithms were generated denoted as RSF1 and RSF2. RSF1 consists of survival trees built using the log-rank split-rule whereas RSF2 consists of survival trees built using the log-rank score split-rule.
The random survival forests algorithm, has been criticised for having a bias towards selecting variables with many split points and the conditional inference forest algorithm has been identified as a method to reduce this selection bias. Conditional inference forests are formulated in such a way that it separates the algorithm for selecting the best splitting covariate is separated from the algorithm for selecting the best split point [15–18]. To illustrate this, consider a dataset with a time-to-event outcome variable T and two explanatory variables x
1 and x
2 with k
1 and k
2 possible split-points, respectively. Furthermore, consider that T is independent of x
1 and x
2, and that k
1<k
2. In the random survival forests algorithm, the search for the best covariate to split on and the best split-point by comparing the effect for both the covariates on T, gives x
2 the highest probability of being selected just by chance.
Conditional inference trees and forests
Algorithm 3 outlines the general algorithm for building a conditional inference tree as presented by [28].
For time-to-event data, the optimal split-variable in step 1 is obtained by testing the association of all the covariates to the time-to-event outcome using an appropriate linear rank test [28, 29]. The covariate with the strongest association to the time-to-event outcome based on permutation tests [28], is selected for splitting. In covariates selection, a linear rank test based on the log-rank transformation (log-rank scores) is performed. Using the distribution of the resulting rank statistic, p-values are evaluated and the covariates with minimum p-value is known to have the strongest association to the outcome [17, 30, 31]. Although the standard association test is done in the first step, a standard binary split is done in the second step. A single tree is considered unstable and hence research has recommended the growing of an entire forest [9, 10, 20]. The forest of conditional inference trees results into a conditional inference (CIF) model. The CIF model algorithm for time-to-event data is implemented in the R package called party.
To compare the performance of the three models used in this study, integrated Brier scores are used [32] which are described in the section below.