Abstract
Random survival forests (RSF) are a powerful nonparametric method for building prediction models with a timetoevent outcome. RSF do not rely on the proportional hazards assumption and can be readily applied to both low and higherdimensional data. A remaining limitation of RSF, however, arises from the fact that the method is almost entirely focussed on continuously measured event times. This issue may become problematic in studies where time is measured on a discrete scale \(t = 1, 2, ...\), referring to time intervals \([0,a_1), [a_1,a_2), \ldots \). In this situation, the application of methods designed for continuous timetoevent data may lead to biased estimators and inaccurate predictions if discreteness is ignored. To address this issue, we develop a RSF algorithm that is specifically designed for the analysis of (possibly rightcensored) discrete event times. The algorithm is based on an ensemble of discretetime survival trees that operate on transformed versions of the original timetoevent data using tree methods for binary classification. As the outcome variable in these trees is typically highly imbalanced, our algorithm implements a node splitting strategy based on Hellinger’s distance, which is a skewinsensitive alternative to classical split criteria such as the Gini impurity. The new algorithm thus provides flexible nonparametric predictions of individualspecific discrete hazard and survival functions. Our numerical results suggest that node splitting by Hellinger’s distance improves predictive performance when compared to the Gini impurity. Furthermore, discretetime RSF improve prediction accuracy when compared to RSF approaches treating discrete event times as continuous in situations where the number of time intervals is small.
Introduction
Random survival forests (RSF, Ishwaran et al. 2008) have become an established tool to model rightcensored data in observational research. They provide a valuable nonparametric alternative to classical Cox regression in longitudinal studies, being able to deal with higherorder interactions between covariates, higherdimensional covariate spaces, and nonproportional hazards when configured appropriately (Korepanova et al. 2019). In recent years, RSF have been increasingly used in practice (e.g. Fantazzini and Figini 2009; Pan et al. 2017; Banerjee et al. 2018; Ingrisch et al. 2018; Verschut and Hambäck 2018), and there are numerous methodological advances and additions, for example regarding variable selection (Ishwaran et al. 2010, 2011) and improved split criteria (Schmid et al. 2016b; Moradian et al. 2017; Wright et al. 2017).
A remaining limitation of the RSF methodology arises from the fact that RSF are almost entirely focussed on continuously measured event times. This limitation is particularly relevant in observational studies with fixed followup intervals where it is only known that events have occurred between two consecutive points in time. In these cases, event times are grouped (constituting a special case of interval censoring), and time is measured on a discrete scale \(t = 1,2,\ldots , k\), referring to time intervals \([0,a_1), [a_1,a_2), \ldots , [a_{k1},\infty )\) with fixed boundaries \(a_1,\ldots , a_{k1}\). A similar situation occurs in studies where event times are “intrinsically” discrete, for example, when analyzing time to pregnancy, which is usually measured by the number of menstrual cycles (Scheike and Keiding 2006; Fehring et al. 2013). As argued by many authors (e.g. Tutz and Schmid 2016; Bogaerts et al. 2017; Berger et al. 2018), the application of statistical models designed for continuous timetoevent data is not appropriate when interval censoring and/or grouping effects are ignored.
The aim of this work is therefore to develop a random forest algorithm that is specifically designed for the analysis of (possibly rightcensored) discrete timetoevent data. Discretetime survival trees and RSF have been first considered in BouHamad et al. (2009) and BouHamad et al. (2011), respectively. The main building block of the algorithm proposed here is a discretetime survival tree method developed by Schmid et al. (2016a): Conditional on the values of a set of covariates \(X = (X_{(1)},\ldots , X_{(p)})^{\top }\), the method by Schmid et al. (2016a) results in nonparametric estimates of the discrete hazard function
which is the discretetime equivalent to the hazard function considered in continuoustime survival analysis. As with classical random forest approaches (Breiman 2001; Ishwaran et al. 2008), a discretetime RSF estimate of \(\lambda (tX)\) is obtained by averaging the discrete hazard estimates derived from a set of survival trees that have been applied to bootstrap samples of the available data.
The key features of the new RSF algorithm can be briefly summarized as follows:

(i)
Based on the definition of the discrete hazard function (1), the idea is to fit probability estimation trees (PETs, Provost and Domingos 2003; Schmid et al. 2016a) to bootstrap samples of reshaped data sets with a binary outcome \(Y \in \{0,1\}\). To generate these reshaped data, the original timetoevent data are converted to sets of augmented data, in which each individual is represented by multiple data lines. Each of these data lines refers to a specific time point \(t = 1, 2, \ldots \), and the outcome value \(Y =0\) indicates survival of the individual beyond the respective time point. Conversely, each data line with \(Y=1\) indicates an event at the respective time point. Details on data reshaping will be provided in Sect. 2.2.

(ii)
A major problem of tree building with reshaped data is that Y is typically highly imbalanced, as by definition each individual is represented by at most one data line with \(Y=1\) but multiple data lines with \(Y=0\). This problem is known to severely degrade tree performance and evaluation when not accounted for (Menardi and Torelli 2014; Fernandez et al. 2018). To address this issue, we propose to use Hellinger’s distance criterion (Cieslak et al. 2012) for tree building, which is insensitive to the skewness of the distribution of Y and has been shown to be highly effective when used for imbalanced classification tasks (“Hellinger Distance Decision Trees”, Cieslak et al. 2012). Details on split criteria and tree building will be given in Sect. 2.3.

(iii)
Unlike Schmid et al. (2016a), who proposed to fit discretetime survival trees using cardinality pruning in combination with the Gini impurity split criterion, we propose to build discretetime RSF using unpruned trees with a small minimum node size. This strategy follows the original approach to random forest classification for binary outcome variables proposed by Breiman (2001). Details on building the random forest ensemble, along with a formal definition of the RSF estimates of \(\lambda (tX)\), will be given in Sect. 2.4.
In Sects. 3 and 4 we will analyze the properties of the discretetime RSF algorithm using simulation studies and a data set on the duration of unemployment spells of U.S. citizens (Cameron and Trivedi 2005). Our numerical results suggest that node splitting by Hellinger’s distance improves predictive performance when compared to skewsensitive split criteria such as the Gini impurity. Furthermore, discretetime RSF tend to improve prediction accuracy when compared to continuoustime RSF algorithms ignoring discreteness in situations where the number of time intervals is small.
The findings of this paper will be summarized and discussed in Sect. 5. The supplementary materials contain the results of additional simulations as well as another realworld application on the analysis of the time to first childbirth in German women (German Family Panel, Huinink et al. 2011).
Methods
Throughout this paper, we denote by \(T \in \{ 1, 2,\ldots , k\}\) a discrete survival outcome and by \(C \in \{ 1, 2,\ldots , k\}\) a discrete censoring time independent of T. The aim of the discretetime RSF algorithm is to obtain a nonparametric estimate of the discrete hazard function \(\lambda (tX)\) defined in (1). It is assumed that the RSF algorithm is applied to a set of i.i.d. learning data with n independent observations and p covariates \((\tilde{T}_i, \varDelta _i, X_i)\), \(i=1,\ldots , n\), where \(\tilde{T}_i \) and \(\varDelta _i\) denote the sample variables of the (possibly rightcensored) observed survival time \(\tilde{T}_i:= \min (T_i, C_i)\) and the status indicator \(\varDelta _i := \mathrm{I}(T_i \le C_i)\), respectively. Figure 1 provides a schematic overview of the discretetime RSF algorithm. The various steps of the algorithm will be described in detail in the next subsections. For a general description of the rationale of the random forest methodology, see e.g. Breiman (2001) and Ishwaran et al. (2008).
Initialization
Before the algorithm starts, one needs to specify the number of trees of the RSF (termed ntree) and the number of covariates available for splitting at each node (termed mtry). Common choices are \(mtry = \lfloor \sqrt{p}\rfloor \) and \( ntree = 500\). These values have been prespecified in the R package ranger (Wright and Ziegler 2017) and will also be used here.
Data reshaping
Instead of fitting the discretetime RSF directly to the learning data \((\tilde{T}_i, \varDelta _i, X_i)\), \(i=1,\ldots , n\), we propose to define the input of the RSF algorithm by a set of reshaped (“augmented”) data with binary outcome variable. This approach is motivated by parametric and semiparametric discrete timetoevent modeling (e.g., Berger and Schmid 2018), which is based on the optimization of the loglikelihood of the datagenerating process described above. The main idea is to express the loglikelihood of a discretetime survival model in terms of the hazards \(\lambda (tX)\), which results in a binomial loglikelihood function
with the sequences \(y_i = (y_{i1}, \dots , y_{i\tilde{T}_i})\) of length \(\tilde{T}_i\) that are defined by \(y_i := (0, \dots , 0, 1)\) if \(\varDelta _i = 1\) and \(y_i := (0, \dots , 0, 0)\) if \(\varDelta _i = 0\). Note that each entry of \(y_i\) corresponds to one of the time points \(t = 1,2,\ldots , \tilde{T}_i\), where \(y_{it}=0\) indicates survival beyond t and \(y_{it}=1\) indicates an event at t.
Based on the result stated in (2), estimates of \(\lambda (tX)\) can generally be obtained by fitting a statistical model with binary outcome to the set of augmented data, which are defined for each individual by
if \(\varDelta _i = 1\) and
if \(\varDelta _i = 0\). The first columns of the matrices (3) and (4) contain the binary values \(y_{it}\) whereas the third columns contain copies of the covariate values. The values in the second columns of (3) and (4) refer to the time intervals \(1,\ldots ,\tilde{T}_i\); the role of these columns will be explained later. The augmented data matrix of the whole sample is obtained by concatenating the individual augmented data matrices \(M_i\), \(i=1,\ldots , n\), resulting in a matrix with \(m:=\sum _{i=1}^n \tilde{T}_i\) rows. Each row refers to one of the \(\sum _{i=1}^n \tilde{T}_i\) summands in the binomial loglikelihood function (2). For further details on data augmentation we refer to Tutz and Schmid (2016) and Schmid et al. (2016a).
An important consequence of the structure of the loglikelihood function in (2) is that it allows the model fitting algorithm to treat the values \(y_{it}\) as realizations of a binary outcome variable Y. The idea of the proposed RSF algorithm is therefore to augment the available learning data analogously to (3) and (4) (“Data Preparation” step in Fig. 1), to generate a set of ntree bootstrap samples from the augmented data, and to fit ntree probability estimation trees with binary outcome variable Y to the augmented data sets (“Tree Building” step in Fig. 1).
Tree building
Following Schmid et al. (2016a), we propose to base tree building on the classification and regression trees (CART) approach by Breiman et al. (1984). Starting with the root node (referring to the whole covariate space), the idea of CART is to recursively subdivide the support of X into a set of terminal nodes. Partitioning of the covariate space is done by recursively optimizing a split criterion, which is computed from the learning data and is chosen according to the type of outcome variable. For each of the splits a single variable \(X_{(j)}\), \(j \in \{1, \dots , p\}\) is selected. Then the support of \(X_{(j)}\) is subdivided into two mutually exclusive subsets termed children nodes. Generally, partitioning is done such that withinnode homogeneity with respect to the distribution of the outcome variable is maximized. The choice of the split criterion generally depends on the scale of the outcome variable. If the outcome is a categorical variable, a popular strategy is to construct a classification tree or a probability estimation tree (PET) by minimizing the Gini criterion or some other impurity measure in the children nodes. If the outcome is a continuous survival time, a survival tree can be constructed by maximizing the logrank statistic obtained from the survival times in the children nodes. The latter strategy is e.g. used in the continuoustime RSF method by Ishwaran et al. (2008). Node splitting in CART has been the topic of many articles and books; for details, we refer to Breiman et al. (1984).
Based on the definition of the loglikelihood in (2), Schmid et al. (2016a) developed a discretetime survival tree algorithm that applies the CART approach with binary outcome variable Y to the augmented data matrix defined in Sect. 2.2. For this, the rows of the matrices (3) and (4) are subject to recursive partitioning, implying that (i) each individual is represented by multiple rows in the augmented learning data, and that (ii) each terminal node of the discretetime survival tree contains a set of zeros and ones. Estimates of the discrete hazard \(\lambda (tX)\) are then obtained by the proportions of ones in the terminal nodes. Since the discrete hazard function does not only condition on the values of the covariates X but also on the event “\(T \ge t\)”, the idea is to use both X and the values \(\tilde{t}_i := (1,2,\ldots , \tilde{T}_i)\) contained in the second columns of (3) and (4) for node splitting. This strategy implies that the vector \((\tilde{t}_1, \ldots , \tilde{t}_n)^{\top }\) is treated like an ordinal covariate during tree building, so that the estimates of the discrete hazard function capture interactions between the covariates and time. Also, by definition of the augmented data matrix and the recursive partitioning procedure, each terminal node refers to a hazard estimate that is constant within a nodespecific time interval \({\mathcal {T}} \subset \{1, \ldots , k\}\) (cf. Schmid et al. 2016a). Estimates of the discrete hazard function \(\lambda (tX)\) over the whole time range \(t=1,\ldots , k\) are obtained for each individual by concatenating the hazard estimates (i.e., the proportions of ones) in the terminal nodes. For example, assume that an individual \(i \in \{1,\ldots , n\}\) has a covariate combination \(X_i\) and an observed survival time \(\tilde{T}_i\) resulting in \(\tilde{T}_i\) rows in the augmented data matrix. The values of these rows are then dropped down the discretetime survival tree and result in a set of hazard estimates that are defined within mutually exclusive time intervals \({\mathcal {T}}\). Concatenating these time intervals and their associated hazard estimates defines an estimate of the individual’s hazard function across the whole time range \(t=1,\ldots , k\). As illustrated in Schmid et al. (2016a) and Berger and Schmid (2018), this strategy results in a flexible nonparametric estimator of \(\lambda (tX)\) that is able to incorporate both higherorder interactions and timedependent covariate effects.
The discretetime RSF proposed in this paper is based on a similar treebuilding algorithm, which differs from the method in Schmid et al. (2016a) mainly by the choice of the split criterion. With regard to the latter, Schmid et al. (2016a) argued that the focus of a discretetime survival tree is not on the correct classification of the values \(y_{it} \in \{ 0, 1\}\) but on the accurate estimation of the probabilities\({\lambda } (tX)\) (“probability estimation tree”, PET, Provost and Domingos 2003). For this reason, Schmid et al. (2016a) proposed to use the Gini impurity (Breiman et al. 1984) for node splitting, as this criterion is asymptotically equivalent to the Brier score (Gneiting and Raftery 2007) for evaluating the accuracy of probabilistic binary forecasts. Schmid et al. (2016a) demonstrated that the Ginibased splitting approach works well in single trees, in particular when combined with a cardinality pruning strategy that guarantees sufficiently large numbers of observations in the terminal nodes (thereby controlling the variance of the discrete hazard estimates). Unlike Schmid et al. (2016a), however, we do not propose to use cardinality pruning in the development of the discretetime RSF algorithm, as the aim is to control the variance of the hazard estimates by averaging the predictions of an ensemble of unpruned trees with small terminal node sizes. Moreover, it is well known that the Gini split criterion, which was also considered in the original random forest algorithm for binary outcome variables by Breiman (2001), may heavily deteriorate the performance of CART when the distribution of the outcome variable is imbalanced (Cieslak and Chawla 2008). This issue is particularly problematic in discretetime survival analysis, which is based on the binary outcome sequences \(y_{i}\) containing multiple numbers of zeroes but at most one value with \(y_{it}=1\) each (see Sect. 2.2). For example, under the assumption of independent uniformly distributed discrete event and censoring times with \(k=10\) intervals (resulting in a moderately high censoring rate of \(45\%\)), data augmentation will yield only approximately \(14\%\) values with \(y_{it}=1\).
We therefore propose to replace the Gini impurity by Hellinger’s distance, which has been recommended by Cieslak et al. (2012) as a split criterion in decision trees with a highly imbalanced outcome. For any pair of children nodes \({\mathcal {M}}, {\mathcal {N}} \subset \{1, \ldots , m\}\), \({\mathcal {M}} \cap {\mathcal {N}}= \emptyset \), that result from splitting the support of a covariate \(X_{(j)}\), the idea is to consider all observations in \({\mathcal {M}}\) as “positive” instances and all observations in \({\mathcal {N}}\) as “negative” instances. Based on this concept, Hellinger’s distance is defined by
where tpr and fpr refer to the pair of true and false positive rates defined by
and \({\mathcal {O}} \subset ({\mathcal {M}} \cup {\mathcal {N}})\) is the set of elements with outcome value \(y_{it}=1\) in the joint set \({\mathcal {M}} \cup {\mathcal {N}}\). In each node of the ntree trees, splitting is done such that Hellinger’s distance in (5) is maximized over all possible binary partitions of the supports of the mtry covariates (“Hellinger Distance Decision Tree”, HDDT, Cieslak et al. 2012).
By definition, HDDTs capture deviations in the class conditionals, implying that they are skewinsensitive and work well even when the distribution of the binary outcome Y is highly imbalanced. For an indepth discussion and analysis of HDDTs, see Cieslak and Chawla (2008) and Cieslak et al. (2012).
Ensemble estimation
The discretetime RSF ensemble is obtained by collecting the ntree Hellinger Distance Decision Trees constructed from the bootstrap samples generated in the “Data Preparation” step in Fig. 1. Generally, there are two options for generating bootstrap samples: One could either draw bootstrap samples from the original timetoevent data and augment them afterwards (“bootstrapping before augmentation”), or one could augment the timetoevent data first and draw bootstrap samples from the concatenated data matrices (3) and (4) (“bootstrapping after augmentation”, as proposed in Fig. 1). Both options are explored in more detail in Sects. 3 and 4. Following the idea of Breiman (2001), we propose to build unpruned trees with a small terminal node size, i.e., tree building is continued until a prespecified minimum node size is reached. In the remainder of this paper, the minimum node size of all RSF algorithms will be set to 10, which is the default value for fitting PETs in the R package ranger.
After having built the ensemble, estimates of the discrete hazard \(\lambda (tX)\) are obtained as follows: For each new observation with covariate data \(X_{\mathrm{new}}\), one considers the augmented data matrix
covering the time range \(t = 1, \ldots , k1\). Next, the \(k1\) data lines are dropped down each of the ntree trees, resulting in sets of estimates \(\hat{\lambda }_b (1X_{\mathrm{new}})\), \(\hat{\lambda }_b (2X_{\mathrm{new}})\), \(\ldots \), \(\hat{\lambda }_b (k1X_{\mathrm{new}})\), \(b=1,\ldots , ntree\), which are given by the proportions of ones (computed from the learning data) in the respective terminal nodes. Note that it is not necessary to include a kth row in the augmented matrix (8), as \(\lambda (kX_{\mathrm{new}}) = 1\) by definition. Finally, the discretetime RSF estimate of \({\lambda }(tX_{\mathrm{new}})\) is computed by averaging over the ntree trees, i.e.,
When the aim is to measure the performance of the discretetime RSF using a single realvalued score, it is useful to aggregate the set of discrete hazard estimates in (9) over time. Analogous to continuoustime RSF (Ishwaran et al. 2008), this can be done by computing the sum of the cumulative discrete hazard estimates, which is given by
The estimate in (10), which can be computed separately for each individual in a set of independent test data, will be used as a predictive marker to evaluate the performance of discretetime RSF in Sect. 4.
Simulation study
Experimental design
To investigate the properties of the discretetime RSF method, we carried out a simulation study with 100 Monte Carlo replications. The aims of the study were (i) to compare discretetime RSF to alternative methods, in particular to random forests using the Gini split criterion, and (ii) to analyze the effects of the censoring rate and the number of intervals k on the performance of discretetime RSF.
The datagenerating process for the simulation study was defined as follows: In each Monte Carlo replication, we generated a learning data set with \(n=1,000\) observations and \(p=50\) independent standard uniformly distributed covariates. Event times were generated according to the logistic discrete hazard model
where \(\eta _{0t} \in {\mathbb {R}}\) was a set of baseline coefficients independent of X and \(\eta (X) \in {\mathbb {R}}\) was an additive predictor independent of t (“proportional continuation ratio model”, cf. Tutz and Schmid 2016). The baseline coefficients were defined by the linear trend function \((\eta _{01}, \ldots , \eta _{0,k1})^{\top } := ( 1, 1{1}/{(k2)}\), \(1{2}/{(k2)}, 1{3}/{(k2)}, \ldots , 2)^{\top } \), and the values of \(\eta (X)\) were obtained by multiplying the standardized sum of ten independently generated threeway interactions of the covariates \(X_1,\ldots , X_{25}\) (defined as \(X_{(j)}\cdot X_{(k)} \cdot X_{(l)}\), \(j \ne k \ne l \in \{1,\ldots , 25\}\), (j, k, l) drawn ten times with replacement) by the factor two. Depending on the value of k, this datagenerating process resulted in Spearman correlations between T and \(\eta (X)\) ranging from \(0.75\) to \(0.70\). The covariates \(X_{26},\ldots , X_{50}\) served as noise variables in the simulation study.
Censoring times were generated independently from a continuous exponential distribution with rightshifted support \((1, \infty )\). Based on the values of the continuous censoring times (denoted by \(C_{\mathrm{cont},i}\), \(i=1,\ldots , n\)), the observed discrete event times were calculated as \(\tilde{T}_i = \lfloor \min (T_i, C_{\mathrm{cont},i}) \rfloor \), \(i=1,\ldots , n\). For each value of k, the rate of the exponential distribution was adjusted such that the censoring rate (defined as \(\sum _{i=1}^n (1\varDelta _i) / \sum _{i=1}^n \mathrm{I}(T_i > 1) \)) became either \(30\%\) (“low censoring” scenario) or \(70\%\) (“high censoring” scenario).
In each Monte Carlo replication, we applied the discretetime RSF algorithm to the learning data using varying numbers of time points (\(k=5, 6, \ldots , 10\)). Prediction accuracy was evaluated by applying the 100 RSF fits to an independent test sample of size \(n_{\mathrm{test}} = 1000\) that followed the same datagenerating process as the learning data. In each Monte Carlo replication, we calculated the timedependent average squared difference between the values of the predicted survival function \(\hat{S}(tX)\) and the respective values of the true survival function S(tX). Based on these differences (which in the following will be denoted by err(t)), we computed a timeindependent measure of prediction error that was defined by \(Err := {\sum _{t=1}^{k1} \hat{\mathrm{P}} (\tilde{T} = t) \cdot err(t) }\).
For each value of k we compared the following modeling approaches:

(i)
discretetime RSF with splitting by Hellinger’s distance (HD),

(ii)
discretetime RSF with splitting by the Gini impurity (GI),

(iii)
discretetime RSF with splitting by the Gini impurity, combined with synthetic minority oversampling (“SMOTE”) in the data preparation step (Fernandez et al. 2018; GI_SMOTE),

(iv)
continuoustime RSF with splitting by the logrank statistic (LeBlanc and Crowley 1993; Ishwaran et al. 2008; RSF_cont),

(v)
RSF for intervalcensored continuous timetoevent data (Yao et al. 2019a, ICcforest), and

(vi)
the correctly specified logistic discrete hazard model (11).
For the discretetime RSF methods (i) and (ii), we additionally considered a modified version of the data preparation step in which the original timetoevent data were reshaped after bootstrapping (“Bootstrapping before Augmentation”, HD_BA and GI_BA). With this strategy, PETs were fitted to ntree augmented data sets that were generated from bootstrap samples of the original learning data. The SMOTE method in (iii) was applied to compare splitting by Hellinger’s distance with an alternative method for addressing class imbalance in treebased models. The ICcforest method in (v) is based on a likelihood function that represents discrete event times as intervalcensored continuous event times. The logistic discrete hazard model in (vi) was used as a benchmark model reflecting the true datagenerating process. All RSF models except ICcforest were fitted using the R addon package ranger, which is available on GitHub at https://github.com/imbshl/ranger. For the continuoustime RSF method in (iv) we set the minimum terminal node size to three observations, which is the default value for logrank splitting in ranger. RSF for intervalcensored continuous timetoevent data were fitted using the implementation in the R addon package ICcforest (Yao et al. 2019b). Synthetic minority oversampling was done using the ubSMOTE function of the R addon package unbalanced (Dal Pozzolo et al. 2015). To achieve class balance, we adjusted the SMOTE method such that the number of ones in the binary outcome of the reshaped data became approximately equal to \(\sum _{i=1}^n \sum _{t=1}^{\tilde{T}_i} (1y_{it})\).
Results
Figure 2 shows the average values of the timedependent squared prediction error err(t) obtained in the low censoring scenario with \(k=7\). It is seen that the HD method (discretetime RSF with splitting by Hellinger’s distance) resulted in smaller prediction errors than the GI method (discretetime RSF with splitting by the Gini criterion) at all time points. The same result was observed for the alternative methods HD_BA and GI_BA, which, however, performed worse than their respective counterparts HD and GI. Compared to the RSF_cont method (continuoustime RSF), the HD method resulted in smaller prediction errors at early time points and larger prediction errors at later time points. This finding may be attributed to the fact that the HD method considers the time points \(\tilde{t}_i = (1,2,\ldots , \tilde{T}_i)\) as an additional variable during node splitting, which results in an increased accuracy of the hazard estimates at (early) time points that are more frequently observed in the learning data. In contrast, the logrank criterion used by the RSF_cont method is based on the whole time range in each split, resulting in hazard estimates that show less variability (compared to the HD method) at later time points with smaller observed frequencies. The ICcforest method behaved similar to the RSF_cont method, but its prediction errors were smaller than the respective prediction errors of RSF_cont at almost all time points. As expected, the estimates obtained from the correctly specified logistic discrete hazard model resulted in the smallest prediction errors in Fig. 2. Similar results (not shown) were obtained in the high censoring scenario with \(k=7\).
Figure 3 presents the values of the summary measure Err obtained from the HD and GI methods in the low censoring scenarios. It is seen that the prediction error of the HD method was smaller than the respective prediction error of the GI method for all values of k (panel (a) of Fig. 3). Larger values of k (implying an increased class imbalance in the binary outcome variable Y) resulted in higher prediction errors of both methods. Panel (b) of Fig. 3, which depicts the differences in Err between the GI and HD methods, shows that the benefit obtained from splitting by Hellinger’s distance increased with the value of k. This finding confirms the results by Cieslak et al. (2012), who argued that using Hellinger’s distance for node splitting is particularly beneficial in situations where class imbalance in the binary outcome variable is high.
A comparison of the HD and RSF_cont methods (referring to discretetime and continuoustime RSF, respectively) is presented in Fig. 4. It is seen that the prediction error of the HD method was smaller than the respective prediction error of the RSF_cont method for all values of k (panel (a) of Fig. 4). However, the differences between the two methods became smaller with increasing value of k (panel (b) of Fig. 4). This result justifies the use of the continuoustime RSF approach in situations where k is large and where the discrete time scale may be well approximated by a continuous time scale. On the other hand, the discretetime RSF approach performed best when the value of k was small and the time scale was distinctly discrete. Similar results (not shown) were obtained in the high censoring scenarios.
Figure 5 presents a comparison of the differences in prediction error that were obtained in the high and low censoring scenarios. Panel (a) of Fig. 5 shows that the differences in Err between the GI and HD methods were essentially insensitive to variations in the censoring rate (although there appeared to be a very slight decrease in the Err difference at some values of k when the censoring rate was increased from \(30\%\) to \(70\%\)). In contrast, the differences in Err between the RSF_cont and HD methods became larger as the censoring rate increased (panel (b) of Fig. 5). This result confirms earlier findings by Ishwaran et al. (2008) who stated that “the performance of [logrankbased] RF regression depended strongly on the censoring rate”, with the prediction accuracy of continuoustime RSF being “poor” in highcensoring scenarios. The HD method appeared to be more robust against high censoring rates, which might be attributed to the definition and the properties of Hellinger’s distance (being insensitive to class imbalance).
Effect of the minimum node size on prediction accuracy
To analyze the sensitivity of the various RSF methods with regard to choice of the minimum node size, we repeated the simulation study with minimum node sizes ranging between 3 and 100. The prediction errors of the resulting ICcforest, HD and GI fits are summarized in Fig. 6 (low censoring scenario, \(k = 7\)). Obviously, the prediction accuracy of the methods could be improved by optimizing the minimum node sizes of the algorithms. On the other hand, the benefits of this additional tuning step appear to be rather small, especially when compared to the differences in prediction accuracy between the various RSF algorithms and split criteria. Similar results were obtained for other values of k (see Section A of the supplementary materials).
Application: duration of unemployment
To illustrate the use of the discretetime RSF approach, we analyzed a data set on the duration of unemployment spells of \(n = 3,343\) U.S. citizens. The data, which were originally analyzed by McCall (1996) and Cameron and Trivedi (2005), were collected in the years 1986, 1988, 1990 and 1992 as part of the January Current Population Survey’s Displaced workers supplements (DWS). Unemployment duration was measured in twoweek intervals. Analogous to the analysis by Cameron and Trivedi (2005), Example 17.11, and Schmid et al. (2018), we defined the outcome variable as the time to reemployment at either a parttime or a fulltime job. Furthermore, we set \(k = 14\) and summarized all unemployment spells \(> 26\) weeks into one category. This strategy resulted in a censoring rate of \(31.11\%\) (see Table 1).
For RSF analysis we used the publicly available version of the data, which is part of the R addon package Ecdat (Croissant 2016). The covariates considered for RSF analysis were (i) age at baseline (years), (ii) filing of an unemployment claim (yes/no), (iii) eligible replacement rate (defined as the weekly benefit amount divided by the amount of weekly earnings in the lost job), (iv) eligible disregard rate (defined as the disregard, i.e. the amount up to which recipients of unemployment insurance who accept parttime work can earn without any reduction in unemployment benefits, divided by the weekly earnings in the lost job), (v) log weekly earnings ($) in the lost job, and (vi) tenure in the lost job (years). A descriptive summary of the covariates is presented in Table 1. Ninety observations were excluded from statistical analysis because of missing values in at least one of the variables. This resulted in an analysis data set containing \(n=3253\) observations.
To investigate the performance of the discretetime RSF approach, we conducted a benchmark experiment that was based on 100 random partitions of the data. Each partition consisted of a learning data set of size \(n=2602\) and a test data set of size \(n_{\mathrm{test}}=651\). For model comparison we considered the same RSF approaches as in Sect. 3 and applied them to the 100 learning data sets. Furthermore, we considered a logistic discrete hazard model (Tutz and Schmid 2016) that was fitted to the 100 learning data sets using the default implementation of the elastic net method in the R package glmnet (Friedman et al. 2019), E_net). The penalty parameter \(\lambda \) of the elastic net method was determined using tenfold crossvalidation, as implemented in the cv.glmnet function of the glmnet package.
To evaluate the predictive performance of the RSF fits, we computed the aggregated hazard estimates defined in (10) and assessed the concordance between \(\hat{\varLambda }_{\mathrm{RSF}}\) and T in each of the 100 test samples. This was done by applying the estimator of the discrete concordance index (“Cindex”) proposed in Schmid et al. (2018). Generally, the Cindex is defined by the probability \(\mathrm{P}(\zeta _i > \zeta _s \, \, T_i < T_s)\), where \(\zeta \) is a continuous marker (here, \(\zeta \equiv \hat{\varLambda }_{\mathrm{RSF}}\)), and i and s refer to two independent individuals in the test data. By definition, the Cindex compares the rankings of the survival times and the marker values. It takes the value 1 in case of “perfect disagreement”, which, in case of discretetime RSF, implies that a larger value of the aggregated hazard is associated with a shorter event time. For the elastic net fits, we defined \(\zeta \) in the same way as the marker \(\hat{\varLambda }_{\mathrm{RSF}}\) in (10). In addition to the Cindex, we computed estimates of the integrated squared prediction error in each test data set. Generally, the integrated squared prediction error is defined by the timeintegrated squared deviation between the predicted survival functions and the observed survival functions (for each observation defined by a step function dropping from one to zero at time point \(T_i\)) in the test data. It is thus similar to the timeindependent measure Err used in Sect. 3, which is based on the true datagenerating process instead of the observed survival functions. Unlike the Cindex, which solely measures the discriminatory power of a timetoevent model, the integrated squared prediction error also accounts for how well a model is calibrated. For details, we refer to Tutz and Schmid (2016), Chapter 4, and Schmid et al. (2018). The evalCindex and evalIntPredErr functions of the R package discSurv (Welchowski and Schmid 2019) were used to compute the estimates of the Cindex and the integrated squared prediction error, respectively.
The estimates of the integrated squared prediction error are presented in panel (a) of Fig. 7. It is seen that, in contrast to the simulation study presented in Sect. 3, the ICcforest method outperformed the discretetime RSF approaches and resulted in the smallest values of the summary measure Err. Apart from this finding, the results of the simulation study were largely confirmed: Again, the discretetime RSF approaches with splitting by Hellinger’s distance (HD and HD_BA) performed better than the respective approaches with splitting by the Gini impurity (GI_BA and GI_BA). The median values of the integrated squared prediction error (as estimated from the 100 test data sets) were 0.203 (RSF_cont), 0.194 (ICcforest), 0.198 (HD), 0.203 (HD_BA), 0.201 (GI), 0.207 (GI_BA), 0.201 (E_net), and 0.285 (GI_SMOTE). Similar results were obtained from the estimates of the Cindex presented in panel (b) of Fig. 7: Again, the ICcforest method performed best, and the HD approach performed better than the GI approach, although, with regard to the Cindex, the alternative methods HD_BA and GI_BA showed a slightly higher discriminatory power than their respective counterparts HD and GI. The median values of the Cindex (as estimated from the 100 test data sets) were 0.664 (RSF_cont), 0.683 (ICcforest), 0.663 (HD), 0.665 (HD_BA), 0.660 (GI), 0.660 (GI_BA), 0.656 (E_net), and \(0.623\) (GI_SMOTE).
In the final step we applied the HD and GI methods to the whole data set and computed permutationbased variable importance values, as implemented in the importance function of the R package ranger. In case of the HD method (panel (a) of Fig. 8), filing an unemployment claim was estimated to be the most important covariate along with the amount of weekly earnings in the lost job. Conversely, in the case of the GI method (panel (b) of Fig. 8), the amount of weekly earnings in the lost job was estimated to be clearly the most important covariate, followed by the eligible replacement rate. While an indepth analysis of the underlying predictorresponse relationships is out of the scope of this paper, the results presented in Fig. 8 clearly show that differences in the choice of the split criterion (Hellinger’s distance vs. Gini impurity) do not only affect prediction accuracy but may also lead to differences in the interpretation of a RSF fit.
Application: time to first childbirth
In Section B of the supplementary materials, we present the results of another realworld application in which we analyzed the time to first childbirth in German women (German Family Panel, “pairfam”, Huinink et al. 2011). The results of this analysis largely confirmed the results obtained from the analysis of the UnempDur data: Again, the discretetime RSF approaches with splitting by Hellinger’s distance outperformed the respective approaches with splitting by the Gini impurity. In contrast to the analysis of the UnempDur data presented in Fig. 7, the ICcforest method did not perform better than the HD method with regard to the summary measure Err and the estimated Cindex (see Fig. 3 in Section B of the supplementary materials.)
Summary and discussion
The random forest method proposed in this paper provides a flexible approach to prediction modeling in situations where the outcome variable of interest is measured on a discrete time scale. By operating on an augmented data matrix with binary outcome, it directly relates to the CARTbased classification forest methodology by Breiman (2001). Furthermore, our method may be viewed as an alternative to an earlier discretetime RSF approach by BouHamad et al. (2011) that performs node splitting by maximizing the sum of the loglikelihood values of two covariatefree discrete hazard models fitted to the data in the children nodes.
Unlike parametric discrete hazard models (such as the complementary loglog model or the proportional continuation ratio model, cf. Tutz and Schmid 2016), discretetime RSF do not require the prespecification of a link function relating the discrete hazard to the covariates. By including the time variable as an additional candidate variable, our approach accounts for timevarying effects of the covariates on the discrete hazard rate. As demonstrated in Sect. 4, variable importance measures can be calculated in the same way as in “classical” random forest methods for classification and regression.
In our simulation study, the RSF approach with splitting by Hellinger’s distance (HD) performed consistently better in terms of prediction accuracy than the respective RSF approach with splitting by the Gini impurity (GI). This result clearly demonstrates the benefit of accounting for class imbalance in treebased modeling of discrete timetoevent data. Surprisingly, random oversampling of the minority class via the SMOTE method did not improve the performance of Ginibased RSF but resulted in a strong decrease in prediction accuracy. This finding might be explained by the fact that oversampling distorts the ratio of ones and zeros in the binary outcome variable Y which is inherent in the definition of the loglikelihood function (2).
Compared to continuoustime RSF, the discretetime RSF approach with splitting by Hellinger’s distance resulted in a better prediction accuracy when the number of unique time points was small and the time scale was “distinctly” discrete. As expected, the differences in prediction accuracy between the discretetime and the continuoustime methods vanished when the number of unique time points increased and when the discrete time scale could be well approximated by a continuous time scale. When compared to the ICcforest approach that accounts for discrete time measurements by representing them as intervalcensored continuous event times, the discretetime RSF approach resulted in a better prediction accuracy in the simulations and in the analysis of the pairfam data. On the other hand, ICcforest outperformed RSF with splitting by Hellinger’s distance in the analysis of the unemployment data. These results suggest that the appropriate use of the ICcforest and HD methods depends on the characteristics of the data at hand, and that a properly designed comparison study (using validation data or resampling procedures) is essential to decide on the application of the methods in practice.
A remaining limitation of the discretetime RSF approach arises from the size of the augmented data matrices (3) and (4), which may lead to storage and/or run time issues when the number of unique time points is large. Despite the fact that RSF are ideally suited for parallel computing, this problem might restrict the application of discretetime RSF to “big data” sets in some situations. On the other hand, it is likely that the increasing availability of largescale storage solutions and highperformance computing facilities will help to settle this problem in the near future.
We finally emphasize that all numerical results presented in this paper are based on the default values of the minimum node size and the mtry parameter specified in the ranger package. While these parameters are supposed to work well in practice, and while relying on the default values ensured the comparability of the methods in the simulations, one might expect additional tuning steps for the minimum node size and the mtry parameter to further improve the predictive performance of the discretetime RSF method (cf. Fig. 6).
References
Banerjee M, ReyesGastelum D, Haymart MR (2018) Treatmentfree survival in patients with differentiated thyroid cancer. J Clin Endocrinol Metab 103:2720–2727
Berger M, Schmid M (2018) Semiparametric regression for discrete timetoevent data. Stat Model 18:322–345
Berger M, Schmid M, Welchowski T, SchmitzValckenberg S, Beyersmann J (2018) Subdistribution hazard models for competing risks in discrete time. Biostatistics. https://doi.org/10.1093/biostatistics/kxy069
Bogaerts K, Komarek A, Lesaffre E (2017) Survival analysis with intervalcensored data: a practical approach with examples in R, SAS, and BUGS. Chapman & Hall/CRC, New York
BouHamad I, Larocque D, BenHameur H, Mâsse LC, Vitaro F, Tremblay RE (2009) Discretetime survival trees. Can J Stat 37:17–32
BouHamad I, Larocque D, BenAmeur H (2011) Discretetime survival trees and forests with timevarying covariates: application to bankruptcy data. Stat Model 11:429–446
Breiman L (2001) Random forests. Mach Learn 45:5–32
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont
Cameron AC, Trivedi PK (2005) Microeconometrics: methods and applications. Cambridge University Press, Cambridge
Cieslak DA, Chawla NV (2008) Learning decision trees for unbalanced data. In: Daelemans W, Goethals B, Morik K (eds) Proceedings of the joint conference on machine learning and knowledge discovery in databases: ECML PKDD 2008, Antwerp, Belgium. Springer, Berlin, pp 241–256
Cieslak DA, Hoens TR, Chawla NV, Kegelmeyer WP (2012) Hellinger distance decision trees are robust and skewinsensitive. Data Min Knowl Discov 24:136–158
Croissant Y (2016) Ecdat: data sets for econometrics. R package version 0.31. http://cran.rproject.org/web/packages/Ecdat. Accessed 16 Nov 2019
Dal Pozzolo A, Caelen O, Bontempi G (2015) Unbalanced: racing for unbalanced methods selection. R package version 2.0. http://cran.rproject.org/web/packages/unbalanced. Accessed 16 Nov 2019
Fantazzini D, Figini S (2009) Random survival forests models for SME credit risk measurement. Methodol Comput Appl Probab 11:29–45
Fehring R, Schneider M, Raviele K, Rodriguez D, Pruszynski J (2013) Randomized comparison of two internetsupported fertilityawarenessbased methods of family planning. Contraception 88:24–30
Fernandez A, Garcia S, Herrera F, Chawla NV (2018) SMOTE for learning from imbalanced data: progress and challenges, marking the 15year anniversary. J Artif Intell Res 61:863–905
Friedman J, Hastie T, Tibshirani R, Narasimhan B, Simon N (2019) glmnet: lasso and elasticnet regularized generalized linear models. R package version 3.0. http://cran.rproject.org/web/packages/glmnet. Accessed 16 Nov 2019
Gneiting T, Raftery A (2007) Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc 102:359–378
Huinink J, Brüderl J, Nauck B, Walper S, Castiglioni L, Feldhaus M (2011) Panel analysis of intimate relationships and family dynamics (pairfam): conceptual framework and design. J Fam Res 23:77–101
Ingrisch M, Schöppe F, Paprottka K, Fabritius M, Strobl FF, Toni END, Ilhan H, Todica A, Michl M, Paprottka PM (2018) Prediction of 90Y radioembolization outcome from pretherapeutic factors with random survival forests. J Nucl Med 59:769–773
Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008) Random survival forests. Ann Appl Stat 2:841–860
Ishwaran H, Kogalur UB, Gorodeski EZ, Minn AJ, Lauer MS (2010) Highdimensional variable selection for survival data. J Am Stat Assoc 105:205–217
Ishwaran H, Kogalur UB, Chen X, Minn AJ (2011) Random survival forests for highdimensional data. Stat Anal Data Min 4:115–132
Korepanova N, Seibold H, Steffen V, Hothorn T (2019) Survival forests under test: impact of the proportional hazards assumption on prognostic and predictive forests for amyotrophic lateral sclerosis survival. Stat Methods Med Res. https://doi.org/10.1177/0962280219862586
LeBlanc M, Crowley J (1993) Survival trees by goodness of split. J Am Stat Assoc 88:457–467
McCall BP (1996) Unemployment insurance rules, joblessness, and parttime work. Econometrica 64:647–682
Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 28:92–122
Moradian H, Larocque D, Bellavance F (2017) \({L}_1\) splitting rules in survival forests. Lifetime Data Anal 23:671–691
Pan Y, Zhang H, Zhang M, Zhu J, Yu J, Wang B, Qiu J, Zhang J (2017) A fivegene based risk score with high prognostic value in colorectal cancer. Oncol Lett 14:6724–6734
Provost F, Domingos P (2003) Tree induction for probabilitybased ranking. Mach Learn 52:199–215
Scheike TH, Keiding N (2006) Design and analysis of timetopregnancy. Stat Methods Med Res 15:127–140
Schmid M, Küchenhoff H, Hoerauf A, Tutz G (2016a) A survival tree method for the analysis of discrete event times in clinical and epidemiological studies. Stat Med 35:734–751
Schmid M, Wright MN, Ziegler A (2016b) On the use of Harrell’s C for clinical risk prediction via random survival forests. Expert Syst Appl 63:450–459
Schmid M, Tutz G, Welchowski T (2018) Discrimination measures for discrete timetoevent predictions. Econom Stat 7:153–164
Tutz G, Schmid M (2016) Modeling discrete timetoevent data. Springer, New York
Verschut TA, Hambäck PA (2018) A random survival forest illustrates the importance of natural enemies compared to host plant quality on leaf beetle survival rates. BMC Ecol 18:33
Welchowski T, Schmid M (2019) discSurv: discrete time survival analysis. R package version 1.4.0. http://cran.rproject.org/web/packages/discSurv. Accessed 16 Nov 2019
Wright MN, Ziegler A (2017) Ranger: a fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw 77:1–17
Wright MN, Dankowski T, Ziegler A (2017) Unbiased split variable selection for random survival forests using maximally selected rank statistics. Stat Med 36:1272–1284
Yao W, Frydman H, Simonoff JS (2019a) An ensemble method for intervalcensored timetoevent data. Biostatistics. https://doi.org/10.1093/biostatistics/kxz025
Yao W, Frydman H, Simonoff JS (2019b) ICcforest: an ensemble method for intervalcensored survival data. R package version 0.5.0. http://cran.rproject.org/web/packages/ICcforest. Accessed 16 Nov 2019
Acknowledgements
Open Access funding provided by Projekt DEAL. The work of Matthias Schmid was supported by the German Research Foundation (DFG), Grant SCHM 2966/21.
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Responsible editor: Hendrik Blockeel.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Schmid, M., Welchowski, T., Wright, M.N. et al. Discretetime survival forests with Hellinger distance decision trees. Data Min Knowl Disc 34, 812–832 (2020). https://doi.org/10.1007/s1061802000682z
Received:
Accepted:
Published:
Issue Date:
Keywords
 Class imbalance
 Discrete event times
 Hellinger’s distance
 Random survival forests
 Recursive partitioning
 Survival analysis