1 Introduction

Survival analysis is a cornerstone of healthcare research and is widely used in the analysis of clinical trials as well as large-scale medical datasets such as Electronic Health Records and insurance claims. Survival analysis methods are required for censored data in which the outcome of interest is generally the time until an event (onset of disease, death, etc.), but the exact time of the event is unknown (censored) for some individuals. When a lower bound for these missing values is known (for example, a patient is known to be alive until at least time t) the data is said to be right-censored.

A common survival analysis technique is Cox proportional hazards regression (Cox 1972) which models the hazard rate for an event as a linear combination of covariate effects. Although this model is widely used and easily interpreted, its parametric nature makes it unable to identify non-linear effects or interactions between covariates (Bou-Hamad et al. 2011).

Recursive partitioning techniques (also referred to as trees) are a popular alternative to parametric models. When applied to survival data, survival tree algorithms partition the covariate space into smaller and smaller regions (nodes) containing observations with homogeneous survival outcomes. The survival distribution in the final partitions (leaves) can be analyzed using a variety of statistical techniques such as Kaplan–Meier curve estimates (Kaplan and Meier 1958). Several authors have proposed algorithms for building survival trees using censored datasets (Therneau et al. 1990; LeBlanc and Crowley 1993; Hothorn et al. 2006), many of which have been implemented within recursive partitioning software packages (Therneau et al. 2010; Hothorn et al. 2010).

Most recursive partitioning algorithms generate trees in a top-down, greedy manner, which means that each split is selected in isolation without considering its effect on subsequent splits in the tree (Breiman et al. 1984; Quinlan 1986, 2014). This approach can have a negative impact on the quality of the model, such as unnecessarily increasing complexity or decreasing accuracy, resulting in poor out-of-sample performance.

To address these issues, researchers have proposed the construction of optimal decision trees, leveraging optimization techniques (Chou 1991; Nijssen and Fromont 2010; Scott and Nowak 2006; Verwer and Zhang 2019; Verhaeghe et al. 2020). Such approaches lead to higher quality solutions while providing the flexibility to impose additional constraints on the trees. As the problem of tree construction is NP-complete (Laurent and Rivest 1976), recovering the optimal partition in high-dimensional dataset poses scalability issues. Bertsimas and Dunn (2017), Bertsimas and Dunn (2019) have proposed an efficient algorithm which uses modern mixed-integer optimization (MIO) techniques to address this weakness. Similar to other optimization-based approaches, this Optimal Trees algorithm forms the entire decision tree in a single step, allowing each split to be determined with full knowledge of all other splits. It allows the construction of single decision trees for classification and regression that have performance comparable with state-of-the-art methods such as random forests and gradient boosted trees, without sacrificing the interpretability offered by a single-tree model.

The key contributions of this paper are:

  1. 1.

    We present Optimal Survival Trees (OST), a new survival trees algorithm that utilizes the Optimal Trees framework to generate interpretable trees for censored data.

  2. 2.

    We propose a new accuracy metric that evaluates the fit of Kaplan–Meier curve estimates relative to known survival distributions in simulated datasets. We also demonstrate that this metric is reasonably consistent with the Integrated Brier Score (Graf et al. 1999), which can be used to evaluate the fit of Kaplan–Meier curves when the true distributions are unknown.

  3. 3.

    We evaluate the performance of our method in both simulated and real-world datasets and demonstrate improved accuracy relative to two existing algorithms.

  4. 4.

    Finally, we provide examples of how the algorithm can be used in real-world settings with censored data. We apply the algorithm to predict the risk of adverse events associated with cardiovascular health in the Framingham Heart Study dataset, and to predict the risk of mortality in the Wisconsin Longitudinal Study and Health and Lifestyle Survey.

The structure of this paper is as follows. We review existing survival tree algorithms in Sect. 2 and discuss some of the technical challenges associated with building trees for censored data. In Sect. 3, we give an overview of the Optimal Trees algorithm proposed by Bertsimas and Dunn (2017) and we adapt this algorithm for Optimal Survival Trees in Sect. 4. Section 5 begins with a discussion of existing survival tree accuracy metrics, followed by the new accuracy metrics that we have introduced to evaluate survival tree models in simulated datasets. Simulation results are presented in Sect. 6 and results for real-world datasets are presented in Sect. 7. We conclude in Sect. 8 with a brief summary of our contributions.

2 Review of survival trees

Recursive partitioning methods have received a great deal of attention in the literature, the most prominent method being the Classification and Regression Tree (CART) algorithm (Breiman et al. 1984). Tree-based models are appealing due to their logical, interpretable structure as well as their ability to detect complex interactions between covariates. However, traditional tree algorithms require complete observations of the dependent variable in training data, making them unsuitable for censored data.

Tree algorithms incorporate a splitting rule that selects partitions to add to the tree, and a pruning rule that determines when to stop adding further partitions. Since the 1980s, many authors have proposed splitting and pruning rules for censored data. Splitting rules in survival trees are generally based on either (a) node distance measures that seek to maximize the difference between observations in separate nodes or (b) node purity measures that seek to group similar observation in a single node (Zhou and McArdle 2015; Molinaro et al. 2004).

Algorithms based on node distance measures compare the two adjacent child nodes that are generated when a parent node is split, retaining the split that produces the greatest difference in the child nodes. Proposed measures of node distance include the two-sample logrank test (Ciampi et al. 1986), the likelihood ratio statistic (Ciampi et al. 1987) and conditional inference permutation tests (Hothorn et al. 2006). We note that the score function used in Cox regression models also falls into the class of node distance measures, as the partial likelihood statistic is based on a comparison of the relative risk coefficient predicted for each observation.

Dissimilarity-based splitting rules are unsuitable for certain applications (such as the Optimal Trees algorithm) because they do not allow for the assessment of a single node in isolation. We therefore focus on node purity splitting rules for developing the OST algorithm.

Gordon and Olshen (1985) published the first survival tree algorithm with a node purity splitting rule based on Kaplan–Meier estimates. Davis and Anderson (1989) used a splitting rule based on the negative log-likelihood of an exponential model, while Therneau et al. (1990) proposed using martingale residuals as an estimate of node error. LeBlanc and Crowley (1992) suggested comparing the log-likelihood of a saturated model to the first step of a full likelihood estimation procedure for the proportional hazards model and showed that both the full likelihood and martingale residuals can be calculated efficiently from the Nelson–Aalen cumulative hazard estimator (Nelson 1972; Aalen 1978). More recently, Molinaro et al. (2004) proposed a new approach to adjust loss functions for uncensored data based on inverse probability of censoring weights (IPCW).

Most survival tree algorithms make use of cost-complexity pruning to determine the correct tree size, particularly when node purity splitting is used. Cost-complexity pruning selects a tree that minimizes a weighted combination of the total tree error (i.e., the sum of each leaf node error) and tree complexity (the number of leaf nodes), with relative weights determined by cross-validation. A similar split-complexity pruning method was suggested by LeBlanc and Crowley (1993) for node distance measures, using the sum of the split test statistics and the number of splits in the tree. Other proposals include using the Akaike Information Criterion (AIC) (Ciampi et al. 1986) or using a p-value stopping criterion to stop growing the tree when no further significant splits are found (Hothorn et al. 2006).

Survival analysis methods have been extended to include other non-linear learners, such as support vector machines, tree ensembles, and neural networks (Fouodo et al. 2018; Hothorn et al. 2005; Liestbl et al. 1994). Breiman (2002) adapted the CART-based random forest algorithm to survival data, while both Hothorn et al. (2004) and Ishwaran et al. (2008) proposed more general methods that generate survival forests from any survival tree algorithm. “Survival forest” algorithms aggregate the results of multiple trees and aim to produce more accurate predictions by avoiding the instability of single-tree models. In addition, the formulation of the SVM problem has been extended in the survival setting with the objective of maximizing the concordance index for comparable pairs of observations (Van Belle et al. 2011; Evers and Messow 2008). Neural network survival analysis includes various structures, such as feed forward, deep, and recurrent neural networks (Biganzoli et al. 1998; Ripley and Ripley 2001; Fotso 2018; Giunchiglia et al. 2018).

Unlike decision trees, these approaches lead to “black-box” models which are not interpretable and provide little information about how they arrive at their predictions (Samek and Müller 2019; Castelvecchi 2016). The issue of interpretability has become central to the adoption and implementation of artificial intelligence models over the past several years (Gilpin et al. 2018), particularly in application areas like medicine where algorithmic decisions can directly impact patient lives (Rajkomar et al. 2019; Cabitza et al. 2017).

More interpretable survival analysis methods are often based on linear models such as Cox proportional hazards regression (Cox 1972). Various authors have adapted this approach using regularization techniques such as LASSO (Tibshirani 1997; Park and Hastie 2007), ridge regression (Verweij and Van Houwelingen 1994), and elastic net (Simon et al. 2011), which can be used to perform feature selection in large datasets and control the complexity of the models. Although linear models are relatively easy to interpret, their parametric structure can be a significant limitation if the underlying assumptions (for example, proportional hazards) are violated. These models are are also unsuitable for identifying non-linear relationships and interactions in the data. Single tree models provide a clear answer to this problem as they are able to capture intrinsic non-linear effects and interactions in the data while offering transparency to the user with the full characterization of potential risk profiles (Bertsimas and Dunn 2019).

Relatively few survival tree algorithms have been implemented in publicly available, well-documented software. Two user-friendly options are available in R (R Core Team 2017) packages: Therneau’s algorithm based on martingale residuals is implemented in the rpart package (Therneau et al. 2010) and Hothorn’s conditional inference (ctree) algorithm in the party package (Hothorn et al. 2010).

3 Review of optimal predictive trees

In this section, we briefly review approaches to constructing decision trees, and in particular, we outline the Optimal Trees algorithm. The purpose of this section is to provide a high-level overview of the Optimal Trees framework; interested readers are encouraged to refer to Bertsimas and Dunn (2019) and Dunn (2018) for more detailed technical information. The Optimal Trees algorithm and is implemented in Julia (Bezanson et al. 2017) and is available to academic researchers under a free academic license.Footnote 1

Traditionally, decision trees are trained using a greedy heuristic that recursively partitions the feature space using a sequence of locally-optimal splits to construct a tree. This approach is used by methods like CART (Breiman et al. 1984) to find classification and regression trees. The greediness of this approach is also its main drawback—each split in the tree is determined independently without considering the possible impact of future splits in the tree on the quality of the here-and-now decision. This can create difficulties in learning the true underlying patterns in the data and lead to trees that generalize poorly. The most natural way to address this limitation is to consider forming the decision tree in a single step, where each split in the tree is decided with full knowledge of all other splits

The first efforts in the direction of optimal decision tree construction involved the use of pattern mining techniques to construct a global model (Nijssen and Fromont 2007, 2010). Narodytska et al. (2018) proposed the use of a Boolean satisfiability model for computing small-size decision trees with optimality guarantees (\(n<10^3\) observations). Verwer and Zhang (2019) introduce an alternative binary formulation that employs Integer Linear Programming to render the model size largely independent from the training data size, achieving better scaling performance and shorter running times for datasets with thousands of observations. Verhaeghe et al. (2020) recently suggested an even more efficient way to decompose the learning problem with a constraint programming approach. Other attempts in the literature to construct globally optimal predictive trees involve the ones of Bennett and Blue (1996), Son (1998), Grubinger et al. (2014). However, these methods could not scale to datasets of the sizes required by practical applications, and therefore did not displace greedy heuristics as the approach used in practice. In contrast to the proposed algorithm, these frameworks are not able to efficiently partition datasets with sample size \(n>\)  20000 and number of features \(p > 100\).

Optimal Trees is a novel approach for decision tree construction that outperforms many existing decision tree methods (Bertsimas and Dunn 2019). It formulates the decision tree construction problem from the perspective of global optimality using mixed-integer optimization (MIO) and solves this problem with coordinate descent to find optimal or near-optimal solutions in practical run times. These Optimal Trees are often as powerful as state-of-the-art methods like random forests or boosted trees, yet they are just a single decision tree and hence are readily interpretable. This obviates the need to trade off between interpretability and state-of-the-art accuracy when choosing a predictive method.

The Optimal Trees framework is a generic approach that tractably and efficiently trains decision trees according to a loss function of the form

$$\begin{aligned} \min _T ~~\texttt {error}(T, D) + \alpha \cdot \texttt {complexity}(T), \end{aligned}$$
(1)

where T is the decision tree being optimized, D is the training data, \(\texttt {error}(T, D)\) is a function measuring how well the tree T fits the training data D, complexity(T) is a function penalizing the complexity of the tree (for a tree with splits parallel to the axis, this is simply the number of splits in the tree), and \(\alpha\) is the complexity parameter that controls the tradeoff between the quality of the fit and the size of the tree. Cross-validation takes places as an internal component of the method.

Unlike the others, Optimal Trees is able scale to large datasets (n in the millions, p in the thousands) by using coordinate descent to train the decision trees towards global optimality. When training a tree, the splits in the tree are repeatedly optimized one-at-a-time, finding changes that improve the global objective value in Problem (1). To give a high-level overview, the nodes of the tree are visited in a random order and at each node we consider the following modifications:

  • If the node is not a leaf, delete the split at that node;

  • If the node is not a leaf, find the optimal split to use at that node and update the current split;

  • If the node is a leaf, create a new split at that node.

For each of the changes, we calculate the objective value of the modified tree with respect to Problem (1). If any of these changes result in an improved objective value, then the modification is accepted. When a modification is accepted or all potential modifications have been dismissed, the algorithm proceeds to visit the nodes of the tree in a random order until no further improvements are found, meaning that this tree is a locally optimal for Problem (1). The problem is non-convex, so we repeat the coordinate descent process from various randomly-generated starting decision trees, before selecting the final locally-optimal tree with the lowest overall objective value as the best solution. For a more comprehensive guide to the coordinate descent process, we refer the reader to Bertsimas and Dunn (2019).

Although only one tree model is ultimately selected, information from multiple trees generated during the training process is also used to improve the performance of the algorithm. For example, the Optimal Trees algorithm combines the result of multiple trees to automatically calibrate the complexity parameter (\(\alpha\)). Bertsimas and Dunn (2019) present a tailored approach for tuning continuous hyperparameters of the algorithm discretize the range of the parameter, identifying a unique mapping between intervals and the corresponding \(\texttt {complexity}(T)\). Thus, during the tuning process only a restricted set of values are tested, avoiding the exploration of values that result in overlapping solutions. To properly measure variable importance in light of the fact that only one of many correlated covariates could make it into a single tree, the Optimal Trees framework calculates a variable importance score in the same way as random forests or boosted trees to measure the importance of variables during the entire training process and not just in the final tree. More detailed explanations of these procedures can be found in Dunn (2018).

The coordinate descent approach used by Optimal Trees is generic and can be applied to optimize a decision tree under any objective function. For example, the Optimal Trees framework can train Optimal Classification Trees (OCT) by setting \(\texttt {error}(T, D)\) to be the misclassification error associated with the tree predictions made on the training data. We provide a comparison of performance between various classification methods from Bertsimas and Dunn (2019) in Fig. 1. This comparison shows the performance of two versions of Optimal Classification Trees: OCT with parallel splits (using one variable in each split); and OCT with hyperplane splits (using a linear combination of variables in each split). These results demonstrate that not only do the Optimal Tree methods significantly outperform CART in producing a single predictive tree, but also that these trees have performance comparable with some of the best classification methods.

Fig. 1
figure 1

Performance of classification methods averaged across 60 real-world datasets. OCT and OCT-H refer to Optimal Classification Trees without and with hyperplane splits, respectively

4 Survival tree algorithm

In this section, we adapt the Optimal Trees algorithm described in Sect. 3 for the analysis of censored data. For simplicity, we will use terminology from survival analysis and assume that the outcome of interest is the time until death. We begin with a set of observations \((t_i,\delta _i)_{i=1}^n\) where \(t_i\) indicates the time of last observation and \(\delta _i\) indicates whether the observation was a death (\(\delta _i=1\)) or a censoring (\(\delta _i=0\)).

Like other tree algorithms, the OST model requires a target function that determines which splits should be added to the tree. Computational efficiency is an important factor in the choice of target function, since it must be re-evaluated for every potential change to the tree during the optimization procedures. A key requirement for the target function is that the “fit” or error of each node should be evaluated independently of the rest of the tree. In this case, changing a particular split in the tree will only require re-evaluation of the subtree directly below that split, rather than the entire tree.

Due to these computational constraints, splits in the OST model cannot be evaluated by any methods that require the comparison of two or more nodes within the tree. This requirement restricts the choice of target function to the node purity approaches described in Sect. 2.

The splitting rule implemented in the OST algorithm is based on the likelihood method proposed by LeBlanc and Crowley (1992). This splitting rule is derived from a proportional hazards model which assumes that the underlying survival distribution for each observation is given by

$$\begin{aligned} \mathrm {P}(S_i\le t) = 1-e^{-\theta _i\varLambda (t)}, \end{aligned}$$
(2)

where \(\varLambda (t)\) is the baseline cumulative hazard function and the coefficients \(\theta _i\) are the adjustments to the baseline cumulative hazard for each observation.

In a survival tree model we replace \(\varLambda (t)\) with an empirical estimate for the cumulative probability of death at each of the observation times. This is known as the Nelson–Aalen estimator (Nelson 1972; Aalen 1978),

$$\begin{aligned} \hat{\varLambda }(t) = \sum _{i:t_i\le t}\frac{\delta _i}{\sum _{j:t_j\ge t_i} 1}.\end{aligned}$$
(3)

Assuming this baseline hazard, the objective of the survival tree model is to optimize the hazard coefficients \(\theta _i\). We impose that the tree model uses the same coefficient for all observations contained in a given leaf node in the tree, i.e. \(\theta _i = \hat{\theta }_{T(i)}\). These coefficients are determined by maximizing the within-leaf sample likelihood

$$\begin{aligned} L= \prod \limits _{i=1}^n \left( \theta _i\frac{d}{dt}\varLambda (t_i)\right) ^{\delta _i}e^{-\theta _i\varLambda (t_i)}, \end{aligned}$$
(4)

to obtain the node coefficients

$$\begin{aligned} \hat{\theta }_{k} =\frac{\sum _{i}\delta _i I_{\{T_i = k\}}}{ \sum _{i}\hat{\varLambda }(t_i)I_{\{T_i = k\}}}. \end{aligned}$$
(5)

To evaluate how well different splits fit the available data we compare the current tree model to a tree with a single coefficient for each observation. We will refer to this as a fully saturated tree, since it has a unique parameter for every observation. The maximum likelihood estimates for these saturated model coefficients are

$$\begin{aligned} \hat{\theta }^{sat}_i = \frac{\delta _i}{\hat{\varLambda }(t_i)},\quad i=1,\dots , n.\end{aligned}$$
(6)

We calculate the prediction error at each node as the difference between the log-likelihood for the fitted node coefficient and the saturated model coefficients at that node:

$$\begin{aligned} \texttt {error}_k =\sum _{i:T(i) = k} \left( \delta _i \log \left( \dfrac{\delta _i}{ \hat{\varLambda }(t_i)}\right) - \delta _i \log (\hat{\theta }_{k})- \delta _i +\hat{\varLambda }(t_i)\hat{\theta }_{k}\right) . \end{aligned}$$
(7)

The overall error function used to optimize the tree is simply the sum of the errors across the leaf nodes of the tree T given the training data D:

$$\begin{aligned} \texttt {error}(T, D) = \sum _{k \in \mathrm {leaves}(T)} \texttt {error}_k(D). \end{aligned}$$
(8)

We can then apply the Optimal Trees approach to train a tree according to this error function by substituting this expression into the overall loss function (1). At each step of the coordinate descent process, we determine new estimates for \(\hat{\theta }_{k}\) for each leaf node k in the tree using (5). We then calculate and sum the errors at each node using (7) to obtain the total error of the current solution, which is used to guide the coordinate descent and generate trees that minimize the error (8).

5 Survival tree accuracy metrics

In order to assess the performance of the OST algorithm, we now introduce a number of accuracy metrics for survival tree models. We will use the notation \(T^{true}\) to represent a tree model, where \(T^{true}_i = T^{true}(X_i)\) is the leaf node classification of observation i with covariates \(X_i\) in the tree \(T^{true}\). We will use the notation \(T^0\) to represent a null model (a tree with no splits and a single node).

5.1 Review of survival model metrics

We begin by reviewing existing accuracy metrics for survival models that are commonly used in both the literature as well as practical applications.

  1. 1.

    Cox Partial Likelihood Score

    The Cox proportional hazards model (Cox 1972) is a semi-parametric model that is widely used in survival analysis. The Cox hazard function estimate is

    $$\begin{aligned} \lambda (t|X_i) = \lambda _0(t)\exp {(\beta _1X_{i1} + \dots +\beta _pX_{ip})} = \lambda _0(t)\exp {(\beta ^TX_i)} ,\end{aligned}$$
    (9)

    where \(\lambda _0(t)\) is the baseline hazard function and \(\beta\) is a vector of fitted coefficients. This proportional hazards model does not make any assumptions about the form of \(\lambda _0(t)\), and its parameters can be estimated even when the baseline is completely unknown (Cox 1975). The coefficients \(\beta\) are estimated by maximizing the partial likelihood functionFootnote 2,

    $$\begin{aligned} L(\beta ) = \prod _{\mathrm {t_i uncensored}} \frac{\exp {(X_i\beta )}}{\sum _{t_j\ge t_i}\exp {(X_j\beta )}}= \prod _{\mathrm {t_i uncensored}}\frac{\theta _i}{\sum _{t_j\ge t_i} \theta _j}. \end{aligned}$$
    (10)

    For computational convenience, the Cox model is generally implemented using the log partial likelihood,

    $$\begin{aligned} l(\beta ) = \log L(\beta ) =\sum _{\mathrm {t_i uncensored}} X_i\beta - \log \left( \sum _{t_j\ge t_i}\exp {(X_j\beta )}\right) . \end{aligned}$$
    (11)

    In the context of survival trees, we can find the Cox hazard function associated with a particular tree model by assigning one coefficient to each leaf node in the tree, i.e.,

    $$\begin{aligned} \lambda _T(t) = \lambda _0(t)\exp { \left( \sum _{k \in T}\beta _k\mathbbm {1}(T_i=k)\right) } = \lambda _0(t)\exp {(\beta _{T_i})} . \end{aligned}$$
    (12)

    We define the Cox Score for a tree model as the maximized log partial likelihood for the associated Cox model, \(\max _{\beta }l(\beta |T)\). To assist with interpretation, we also define the Cox Score Ratio (CSR) as the percentage reduction in the Cox Score for tree T relative to a null model,

    $$\begin{aligned} CSR(T) = 1-\frac{\max _{\beta }l(\beta |T)}{\max _{\beta }l(\beta |T^0)}. \end{aligned}$$
    (13)

    Due to its widespread use in the context of Cox Regression, the Cox Score is a useful metric for assessing the fit of survival tree models and contrasting the structure of these models with more commonly used linear hazard functions. However, it is important to consider the implications of applying a metric designed for continuous hazard predictions in the context of decision trees, which produce a discrete hazard coefficient for every node. Each additional leaf node in the tree allows an additional degree of freedom in equation (12), and increasing the number of nodes in the tree may inflate Cox score even if the overall quality of the model does not improve.

    Another significant drawback of the Cox score is its reliance on the proportional hazards assumption (2). Although this assumption is commonly used in survival analysis, it may not be appropriate in many applications. This metric should be interpreted with caution when comparing the results of survival tree algorithms that use the proportional hazards model in node splitting rules (such as the OST algorithm) to other algorithms that rely on non-parametric splitting rules.

  2. 2.

    The Concordance Statistic

    Applying a ranking approach to survival analysis is an effective way to deal with the skewed distributions of survival times as well as censored of the data. The Concordance Statistic, which is most familiar from logistic regression, is another popular metric that has been adapted to measure goodness-of-fit in survival models (Harrell et al. 1982). The concordance index is defined as the proportion of all comparable pairs of observations in which the model’s predictions are concordant with the observed outcomes.

    Two observations are comparable if it is know with certainty that one individual died before the other. This occurs when the actual time of death is observed for both individuals (neither is censored) or when the one individual’s death is observed before the other is censored. A comparable pair is concordant if the predicted risk (\(\rho\)) is higher for the individual that died first, and the pair is discordant if the predicted risk is lower for the individual that died first. Thus, the number of concordant pairs in a sample is given by

    $$\begin{aligned} CC = \sum _{i,j} \mathbbm {1}(t_i > t_j)\mathbbm {1}(\rho _i < \rho _j)\delta _j,\end{aligned}$$
    (14)

    and the number of discordant pairs is

    $$\begin{aligned} DC = \sum _{i,j} \mathbbm {1}(t_i> t_j)\mathbbm {1}(\rho _i > \rho _j)\delta _j,\end{aligned}$$
    (15)

    where the indices i and j refer to pairs of observations in the sample. Multiplication by the factor \(\delta _j\) discards pairs of observations that are not comparable because the smaller survival time is censored, i.e., \(\delta _j = 0\). These definitions do not include comparable pairs with tied risk predictions, so we denote these pairs as

    $$\begin{aligned} TR = \sum _{i,j} \mathbbm {1}(t_i > t_j)\mathbbm {1}(\rho _i = \rho _j)\delta _j.\end{aligned}$$
    (16)

    The number of concordant and discordant pairs is commonly summarized using Harrell’s C-index (Harrell et al. 1982),

    $$\begin{aligned} H_C = \frac{CC+0.5\times TR}{CC+DC+TR}.\end{aligned}$$
    (17)

    Harrell’s C takes values between 0 and 1, with higher values indicating a better fit. Note that randomly assigned predictions have an expected score of \(H_C=0.5\).

    More recently, Uno et al. (2011) introduced a modified C-Statistic that weights comparable pairs of observations based on the distribution of censoring times,

    $$\begin{aligned} U_{C_t} = \frac{\sum _{i,j} (\hat{G}(t_j))^{-2}\mathbbm {1}(t_i> t_j,t_j<t)\mathbbm {1}(\rho _i< \rho _j)\delta _j}{\sum _{i,j}(\hat{G}(t_j))^{-2}( \mathbbm {1}(t_i> t_j,t_j<t)\mathbbm {1}(\rho _i> \rho _j)\delta _j+\mathbbm {1}(t_i > t_j,t_j<t)\mathbbm {1}(\rho _i < \rho _j)\delta _j)},\end{aligned}$$
    (18)

    where \(\hat{G}(\cdot )\) is the Kaplan–Meier estimate for the censoring distribution. Due to these coefficients, \(U_C\) converges to a quantity that is independent of the censoring distribution. \(U_C\) takes values between 0 and 1, with higher values indicating a better fit.

    The above definition of Uno’s C-statistic was intended for continuous models, and (18) may be very unstable in small trees due to the large number of observations with tied risks which are not counted in either the numerator or denominator. To avoid this, we include these pairs of observations in a similar manner to Harrell’s C-statistic, i.e., weighted by 0.5 in the numerator and 1 in the denominator. The resulting concordance statistic is

    $$\begin{aligned} U^{*}_{C_t} = \frac{\sum _{i,j} (\hat{G}(t_j))^{-2}\mathbbm {1}(t_i> t_j,t_j<t)\left( \mathbbm {1}(\rho _i< \rho _j) +0.5\times \mathbbm {1}(\rho _i = \rho _j)\right) \delta _j}{\sum _{i,j}(\hat{G}(t_j))^{-2}( \mathbbm {1}(t_i> t_j,t_j<t)\mathbbm {1}(\rho _i> \rho _j)\delta _j+\mathbbm {1}(t_i > t_j,t_j<t)\mathbbm {1}(\rho _i \le \rho _j)\delta _j)}.\end{aligned}$$
    (19)

    This modification improves the stability of the concordance statistics but also makes these metrics somewhat less informative in the context of discrete models, since a large number of tied pairs tend to dominate both the numerator and denominator. More generally, concordance statistics do not account for incomparable pairs of observations, which may be problematic when there is significant censoring. The binary definition of concordance fails to account for the magnitude of the difference in predicted risks for comparable observations. As a result, these metrics may be less informative in datasets with significant variations in risk.

    Unlike the Cox Score, concordance statistics do not explicitly rely on any parametric assumptions. For proportional hazards models it is natural to define the predicted risk in terms of the hazard coefficients in (2), i.e., \(\rho _i = \theta _i\). However, it is also possible to contrast the predicted risk of a comparable pair of observations via the predicted survival probabilities, the expected survival times, or any other comparable prediction extracted from the model. In our analysis we evaluate concordance based on the predicted survival probabilities extracted from the Kaplan–Meier curves at each node, i.e., \(\rho _i(\tau ) = 1-\hat{S}_i(\tau )\). When comparing the risks of a pair of observations, survival probabilities are evaluated at the time of the first event, \(\tau = \min \{t_i,t_j\}\).

  3. 3.

    Integrated Brier score

    The Brier score metric is commonly used to evaluate classification trees (Brier 1950). It was originally developed to verify the accuracy of a probability forecast, primarily for weather forecasting. The most common formula calculates the mean squared prediction error:

    $$\begin{aligned} B = \frac{1}{n}\sum _{i}^n(\hat{p}(y_i) - y_i)^2, \end{aligned}$$
    (20)

    where n is the sample size, \(y_i \in \{0,1\}\) is the outcome of observation i, and \(\hat{p}(y_i)\) is the forecast probability of this observed outcome. In the context of survival analysis, the Brier score may be used to evaluate the accuracy of survival predictions at a particular point in time relative to the observed deaths at that time. We will refer to this as the Brier Point Score:

    $$\begin{aligned}&BP_{\tau } = \frac{1}{|\mathcal {I}_{\tau }|}\sum _{i \in \mathcal {I}_{\tau }}(\hat{S_i}(\tau ) - \mathbbm {1}(t_i >\tau ))^2, \nonumber \\ \text { where }&\mathcal {I}_{\tau } = \{i\in \{1,\dots , n\},| t_i \ge \tau \text { or } \delta _i=1\}. \end{aligned}$$
    (21)

    In this case, \(\hat{S_i}(\tau )\) is the predicted survival probability for observation i at time \(\tau\) and \(\mathcal {I}_{\tau }\) is the set of observations that are known to be alive/dead at time \(\tau\). Observations censored before time \(\tau\) are excluded from this score, as their survival status is unknown.

    Applying this version of the Brier score may be useful in applications where the main outcome of interest is survival at a particular time, such as the 1-year survival rates after the onset of a disease. In the experiments that follow, the point-wise Brier Score will be evaluated at the median observation time in each dataset. For easy interpretation, the reported scores are normalized relative to the score for a null model, i.e.

    $$\begin{aligned} BPR_{\tau }=1-\frac{ BP_{\tau }(T)}{ BP_{\tau }({T^0})}.\end{aligned}$$
    (22)

    The Brier Point score has two significant disadvantages in survival analysis. First, it assesses the predictive accuracy of survival models a single point in time rather than over the entire observation period, which is not well-suited to applications where survival distributions are the outcome of interest. Second, it becomes less informative as the number of censored observations increases, because a greater number of observations are discarded when calculating the score.

    Graf et al. (1999) have addressed these challenges by proposing an adjusted version of the Brier Score for survival datasets with censored outcomes. Rather than measuring the accuracy of survival predictions at a single point, this measure aggregates the Brier score over the entire time interval observed in the data. This modified measure is commonly used in the survival literature and has been interchangeably called the Brier Score or the Integrated Brier Score by various authors (Reddy and Kronek 2008). In this paper, we will refer to the metric specific to survival analysis as the Integrated Brier score (IB), defined as

    $$\begin{aligned} IB =\frac{1}{t_{max}} \frac{1}{n}\sum _{i=1}^n\int _0^{t_i} \frac{(1-\hat{S}_{i}(t))^2}{\hat{G}(t)} dt + \delta _i\int _{t_i}^{t_{max}} \frac{(\hat{S}_{i}(t))^2}{\hat{G}(t_i)} dt.\end{aligned}$$
    (23)

    The IB score uses Kaplan–Meier estimates for both the survival distribution, \(\hat{S}(t)\), and the censoring distribution, \(\hat{G}(t)\). In a survival tree model, these estimates are obtained by pooling observations in each node in the tree, i.e., \(\hat{S}_i(t)=\hat{S}_{T(i)}(t)\). The IB score is a weighted version of the original Brier Score, with the weights being \(1/\hat{G}(t_i)\) if an event occurs before time \(t_i\), and \(1/\hat{G}(t)\) if the event occurs after time t. This metric addresses many of the deficiencies identified in the Cox and concordance scores above: it is non-parametric, counts both censored and uncensored observations, and evaluates accuracy of the predicted survival functions over the entire time horizon.

    In subsequent sections, we report a normalized version of this metric, the Integrated Brier score ratio (IBR), which compares the sum of the Integrated Brier scores in a given tree to the corresponding Integrated Brier scores in a null treeFootnote 3:

    $$\begin{aligned} IBR=1-\frac{ IB(T)}{ IB({T^0})}.\end{aligned}$$
    (24)

Aside from the limitations already discussed, we note that all of the above metrics are subject to noise and often provide contradictory assessments when comparing different tree models. For example, our empirical experiments comparing three candidate models were only able to identify a non-dominated model for about 30% of the instances. In the other 70% of our test cases, none of the three candidate models scored at least as high as the other models on all metrics. These limitations make it difficult to obtain an unambiguous comparison between the performance of different survival tree algorithms. To address this challenge, we will now introduce a simulation procedure and associated accuracy metrics that are specifically designed to assess survival tree models.

5.2 Simulation accuracy metrics

A key difficulty in selecting performance metrics for survival tree models is that the definition of “accuracy” can depend on the context in which the model will be used. For example, consider a survival tree that models the relationship between lifestyle factors and age of death. A medical researcher may use such a model to identify risk factors associated with early death, while an insurance firm may use this model to predict mortality risks for individual clients in order to estimate the volume of life insurance policy pay-outs in the coming years. The medical researcher is primarily interested whether the model has identified important splits, while the insurer is more focused on whether the model can accurately estimate survival distributions.

In subsequent sections we refer to these two properties as tree recovery and prediction accuracy. We develop metrics to measure these outcomes in simulated datasets with the following structure:

Let \(i=1,\dots ,n\) be a set of observations with independent, identically distributed covariates \(\mathbf {X}_{i}=(X_{ij})_{j=1}^m\). Let \(T^{true}\) be a tree model that partitions observations based on these covariates such that \(T^{true}_i = T^{true}(\mathbf {X}_{i})\) is the index of the leaf node in \(T^{true}\) that contains individual i. Let \(S_i\) be a random variable representing the survival time of observation i, with distribution \(S_i\sim F_{T^{true}_i}(t)\). The survival distribution of each individual is entirely determined by its location in the tree \(T^{true}\), and so we refer to \(T^{true}\) as the “true” tree model.

This underlying tree structure provides an unambiguous target against which we can measure the performance of empirical survival tree models. In this context, an empirical survival tree model T has high accuracy if it achieves the following objectives:

  1. 1.

    Tree recovery: the model recovers structure of the true tree (i.e., \(T(\mathbf {X}_{i})=T^{true}(\mathbf {X}_{i})\)).

  2. 2.

    Prediction accuracy: the model recovers the corresponding survival distributions of the true tree (i.e., \(\hat{F}_{T_i}(t)={F}_{T^{true}_i}(t)\)).

It is important to recognize that these two objectives are not necessarily consistent, particularly in small samples. For example, models with perfect tree recovery may have a small number of observations in each leaf node, leading to noisy survival estimates with low prediction accuracy.

5.2.1 Tree recovery metrics

We measure the tree recovery of an empirical tree model (T) relative to the true tree (\(T^{true}\)) using the following metrics:

  1. 1.

    Node homogeneity The node homogeneity statistic measures the proportion of the observations in each node \(k\in T\) that have the same true class in \(T^{true}\). This metric is equivalent to the misclassification error and cluster purity metrics which are commonly used in the clustering and tree-based binary classification evaluation contexts respectively (Friedman et al. 2001; Rendón et al. 2011). Let \(p_{k,l}\) be the proportion of observations in node \(k \in T\) that came from class \(\ell \in T^{true}\) and let \(n_{k,l}\) be the total number of observations at node \(k \in T^{true}\) from class \(\ell \in C\). Then,

    $$\begin{aligned} NH = \frac{1}{n}\sum _{k \in T}\sum _{l \in T^{true}} n_{k,l}p_{k,l}.\end{aligned}$$
    (25)

    A score of \(NH = 1\) indicates that each node in the new tree model contains observations from a single class in \(T^{true}\). This does not necessarily mean that the structure of T is identical to \(T^{true}\) — For example, a saturated tree with a single observation in each node would have a perfect node homogeneity score (see Fig. 2). The node homogeneity metric is therefore biased towards larger tree models with few observations in each node.

  2. 2.

    Class recovery

    Class recovery is a measure of how well a new tree model is able to keep similar observations together in the same node, thereby avoiding unnecessary splits. Class recovery is calculated by counting the proportion of observations from a true class \(\ell \in T^{true}\) that are placed in the same node in T. Let \(q_{k,l}\) be the proportion of observations from class \(\ell \in T^{true}\) that are classified in node \(k \in T\) and let \(n_{k,l}\) be the total number of observations at node \(k \in T\) from class \(\ell \in T^{true}\). Then,

    $$\begin{aligned} CR = \frac{1}{n}\sum _{\ell \in T^{true}}\sum _{k \in T} n_{k,l}q_{k,l}. \end{aligned}$$
    (26)

    This metric is biased towards smaller trees, since a null tree with a single node would have a perfect class recovery score. It is therefore useful to consider both the class recovery and node homogeneity scores simultaneously in order to assess the performance of a tree model (see Fig. 2 for examples). When used together, these metrics indicate how well the model T reflects the structure of the true model \(T^{true}\).

Fig. 2
figure 2

Tree recovery metrics for a survival tree with two classes of observations. The top left tree represents the true tree model

The node homogeneity and class recovery scores can also be used to compare any two tree models, \(T^a\) and \(T^b\). In this case, these metrics should be interpreted as a measure of structural similarity between the two tree models. Note that when \(T^a\) and \(T^b\) are applied to the same dataset, the node homogeneity for model \(T^a\) relative to \(T^b\) is equivalent to the class recovery for \(T^b\) relative to \(T^a\), and vice versa. The average node homogeneity score for \(T^a\) and \(T^b\) is therefore equal to the average class recovery score for \(T^a\) and \(T^b\). We will refer to this as the similarity score for models \(T^a\) and \(T^b\).

5.2.2 Prediction accuracy metric

Our prediction accuracy metric measures how well the non-parametric Kaplan–Meier curves at each leaf in T estimate true the survival distribution of each observation.

  1. 1.

    Area between curves (ABC)

    For an observation i with true survival distribution \(F_{T^{true}_i}(t)\), suppose that \(\hat{S}_{T_i}(t)\) is the Kaplan–Meier estimate at the corresponding node in tree T (see Fig. 3). The area between the true survival curve and the tree estimate is given by

    $$\begin{aligned} ABC_i^T = \frac{1}{t_{max}}\int _{0}^{t_{max}} |1-F_{T^{true}_i}(t)-\hat{S}_{T_i}(t)|dt.\end{aligned}$$
    (27)

    To make this metric easier to interpret, we compare the area between curves in a given tree to the score of a null tree with a single node (\(T^0\)). The area ratio (AR) is given by

    $$\begin{aligned} AR=1-\frac{\sum _i ABC_i^T}{\sum _i ABC_i^{T^0}}.\end{aligned}$$
    (28)

    Similar to the popular \(R^2\) metric for regression models, the AR indicates how much accuracy is gained by using the Kaplan–Meier estimates generated by the tree relative to the baseline accuracy obtained by using a single estimate for the whole population.

    Both the ABC and IBS metrics measure the fit of survival distributions generated at leaf nodes, which are an important component of tree-based survival models. The most important conceptual difference between these metrics is that the IBS compares the estimated survival distributions to events observed in the sampled data (using weights to account for censoring), while the ABC measures accuracy relative to the true survival distributions, which are not affected by censoring or sample size. The ABC cannot be applied in real-world settings where the underlying distributions are unknown, but it provides a simple and intuitive measure of the fit of survival curves in simulation experiments.

Fig. 3
figure 3

An illustration of the area between the true survival distribution and the Kaplan–Meier curve

6 Simulation results

In this section we evaluate the performance of the Optimal Survival Trees (OST) algorithm and compare it to two existing survival tree models available in the R packages rpart and ctree. Our tests are performed on simulated datasets with the structure described in Sect. 5.2.

6.1 Simulation procedure

The procedure for generating simulated datasets in these experiments is as follows:

  1. 1.

    Randomly generate a sample of 20,000 observations with six covariates. The first three covariates are uniformly distributed on the interval [0, 1] and remaining three covariates are discrete uniform random variables with 2, 3 and 5 levels.

  2. 2.

    Generate a random “ground truth” tree model, \(T^{true}\), that partitions the dataset based on these six covariates (see Algorithm 1 in the “Appendix 1”).

  3. 3.

    Assign a survival distribution to each leaf node in the tree \(T^{true}\) (see “Appendix” for a list of distributions).

  4. 4.

    Classify observations into node classes \(T^{true}_i = C(\mathbf {X}_i)\) according to the ground truth model. Generate a survival time, \(s_i\), for each observation based the survival distribution of its node: \(S_i\sim F_{T^{true}_i}(t)\).

  5. 5.

    Generate a censoring time for each observation, \(c_i = \kappa (1-u_i^2)\), where \(u_i\) follows a uniform distribution and \(\kappa\) is a non-negative parameter used to control the proportion of censored individuals.

  6. 6.

    Assign observation times \(t_i=\min (s_i,c_i)\). Individuals are marked as censored (\(\delta _i=0\)) if \(t_i=c_i\).

We used this procedure to generate 1000 datasets based on ground truth trees with a minimum depth of 3 and a maximum depth of 4 (i.e., \(2^4=16\) leaf nodes). In each dataset, 10,000 observations were set aside for testing the tree models. Training datasets of n observations were sampled from the remaining data for \(n \in \{100,200,500,1000,2000,5000,10,000\}\).

In addition to varying the size of the training dataset, we also varied the proportion of censored observations in the data by adjusting the parameter \(\kappa\). Censoring was applied at nine different levels to generate examples with low censoring (0%, 10%, 20%), moderate censoring (30%, 40%, 50%) and high censoring (60%, 70%, 80%). In total, 63 OST models were trained for each dataset to test each of the seven training sample sizes at each of the nine censoring levels.

We evaluated the performance of the OST algorithm relative to two existing survival tree algorithms available in the R packages rpart (Therneau et al. 2010) and ctree (Hothorn et al. 2010). Each of the three algorithms was trained and tested on exactly the same data in each dataset.

Each of the three algorithms tested require two input parameters that control the model size: a maximum tree depth and a complexity/significance parameter that determines which splits are worth keeping in the tree (the interpretation of the ctree significance parameter is different to the complexity parameters in the OST and rpart algorithms, but it serves a similar function).

Since neither rpart nor ctree have built-in methods for selecting tree parameters, we used a similar 5-fold cross-validation procedure on the training data to select the parameters for each algorithm. We considered tree depths up to three levels greater than the true tree depth and complexity parameter/significance values between 0.001 and 0.1 for the rpart and ctree algorithms (the OST complexity parameter is automatically selected during training). Equation (7) was used as the scoring metric to evaluate out-of-sample performance during cross-validation, and the minimum node size for all algorithms was fixed at 5 observations.

6.2 Results

Fig. 4
figure 4

The average tree size for models trained on various sample sizes

To demonstrate the effect of this cross-validation procedure, we summarize the average size of the models produced by each algorithm in Fig. 4. We see a clear link between tree size and the number of training observations, indicating the cross-validation procedure is selecting more conservative depth/complexity parameters when relatively little data is available. In larger datasets, the OST models grow to approximately the same size as the true tree models (6 nodes, on average), while the rpart and ctree models models are slightly larger.

6.2.1 Survival analysis metrics

Figure 5 summarizes the performance of each algorithm in our simulations using the four survival model metrics from Sect. 5. The values displayed in each chart are the average performance statistics across all test datasets.

As expected, the average performance of all three algorithms consistently improves as the size of the training dataset increases. The performance statistics also increase as the proportion of censored observations increases, which seems counter-intuitive (we would expect more censoring to lead to less accurate models). In the case of the Cox partial likelihood and C-statistics, this trend is directly linked to the number of observed deaths, since only observations with observed deaths contribute to the partial likelihood and concordance scores. Similarly, censored observations do not contribute to the Integrated Brier Score after their censoring time.

Each chart also indicates the performance of the true tree model, C, as a point of comparison for the other algorithms. The true tree model performs significantly better than the empirical models trained on smaller datasets, but all three algorithms approach the performance of the true tree for very large sample sizes.

Based on these results, we conclude that the average performance of the OST algorithm in these simulations is consistently better than either of the other two algorithms. In order to understand why this algorithm is able to generate better models, we now analyse the results of the tree metrics introduced in Sect. 5.2.

Fig. 5
figure 5

A summary of the survival model metrics from simulation experiments. The average test set outcomes for each algorithm are shown in color, while the performance of the true tree model, \(T^{true}\), in indicated in black. Shaded areas indicate 95% confidence intervals

6.2.2 Tree recovery

Fig. 6
figure 6

A summary of the tree recovery metrics for survival tree algorithms

The test set tree recovery metrics for all three algorithms are summarized in Table 1 and Fig. 6. The average node homogeneity/class recovery scores are given side-by-side to allow for a comprehensive assessment of each algorithm’s performance. These results confirm that the OST models perform significantly better than the other two models across all censoring levels.

Table 1 A summary of the average node homogeneity/class recovery scores for synthetic experiments

The node homogeneity scores for all three algorithms increase with larger sample sizes, indicating that the availability of additional data leads to better detection of relevant splits. In large populations, the OST algorithm selects more efficient splits than the other models and is able to achieve better node homogeneity with fewer splits (recall Fig. 4—the OST models trained on large data sets have fewer leaf nodes than the other models, on average).

The relationship between tree size and class recovery rates is somewhat more complicated. In datasets smaller than 500 observations the class recovery rates seem to be closely linked to the tree size: the ctree models have the highest average class recovery for models trained on 100 and 200 observations, and also the smallest number of nodes (see Fig. 4). However, this trend does not hold in datasets with 500 observations, where OST models are larger than the ctree models on average, but also have slightly better class recovery. This suggests that tree size is no longer a dominant factor in larger datasets (\(n\ge 500\)).

In these larger datasets we observe distinct trends in class recovery scores. The OST class recovery rate increases consistently despite the increases in model size, which means that the OST models are able to produce more complex trees without overfitting in the training data. By contrast, both of the other algorithms have consistently worse class recovery rates as sample size increases and their models become larger. Based on this trend, neither of these algorithms will reliably converge to the true tree.

6.2.3 Prediction accuracy

Table 2 A summary of the average Kaplan–Meier area ratio (AR) scores for simulation experiments

The test set prediction accuracy metric for each of the three algorithms is summarized in Table 2 and Fig. 7. Overall, the results indicate that sample size plays the most significant role in test set accuracy across all three algorithms. There is also a small increase in accuracy when censoring is increased, which is due to the reduction in the maximum observed time, \(t_{max}\). The OST results are generally better than the other algorithms across all sample sizes, although the performance gap is relatively small in smaller datasets.

To illustrate the effect of sample size on the accuracy of the Kaplan–Meier estimates, Fig. 7 also shows the curve accuracy metrics for the true tree, \(T^{true}\). It is immediately apparent that even the true tree models produce poor survival curve estimates in small datasets. Based on these results, it may be necessary to increase the minimum node size to at least 50 observations in applications where Kaplan–Meier curves will be used to summarize survival tree nodes.

Fig. 7
figure 7

A summary of the average Kaplan–Meier Area Ratio results for simulation experiments. The performance of the true tree model is indicated in black

6.2.4 Comparison of accuracy metrics

Table 3 shows the correlation between each pair of accuracy metrics used in the simulation experiments. All outcome metrics are positively correlated with the exception of class recovery, which has both weak positive and weak negative correlations with other metrics. These mixed results are due to the different trends in class recovery among the three algorithms – OST class recovery was highest for trees trained on larger datasets, while the other algorithms had lower class recovery in these instances (see Fig. 6). Node homogeneity was positively correlated with other metrics, but the correlations were somewhat weaker than average. This reflects the incomplete information captured by this metric – node homogeneity alone does not guarantee a good model, as discussed in Sect. 5.2.1.

Among the other metrics, the highest correlation was observed between the two concordance statistics (0.98), which also had the strongest correlation with most other metrics. There was also high correlation between the two Brier metrics (0.86). The Cox score was most strongly correlated with the concordance statistics (0.87), followed by the Brier statistics (0.77). The Kaplan–Meier area ratio had slightly lower average correlations and was most strongly correlated with the node homogeneity statistic. This is likely due to the fact that both of these metrics are based on the true tree structure, while other metrics reflect how well a model fits the available data.

Table 3 Correlation between different accuracy metrics in simulation experiments

6.2.5 Stability

Fig. 8
figure 8

A summary of the average similarity scores between pairs of trees trained on mutually exclusive sets of observations

Fig. 9
figure 9

A summary of survival tree accuracy metrics for datasets with added noise

Fig. 10
figure 10

A summary of simulation accuracy metrics for datasets with added noise

A frequent criticism of single-tree models is their sensitivity to small changes in the training data. This may be apparent when a tree algorithm produces very different models for different training datasets sampled from the same population. This type of instability is often an indication that the model will not perform well on unseen data.

Given the challenges associated with measuring the test set accuracy for survival tree algorithms, it may be tempting to use stability as a performance metric for these models. Stability is a necessary condition for accuracy in tree models (provided that a tree structure is suitable for the data) but stable models are not necessarily accurate. For example, greedy tree models with depth 1 may select the same split for all permutations of the training data, but these models will not be accurate if the data requires a tree of depth 3.

Although stability is not necessarily a good indicator of the quality of a model, it is nevertheless interesting to consider how the stability of globally optimized trees may differ to the stability of greedy trees. Globally optimized trees are theoretically capable of greater stability because they may include splits that are not necessarily locally optimal for a particular training dataset. However, globally optimized trees also consider a significantly larger number of possible tree configurations and therefore have many more opportunities for overfitting on features of a particular training dataset.

We ran two sets of experiments to investigate the stability of the survival tree models in our simulations. In the first set of experiments we used each algorithm to train two models, \(T^a\) and \(T^b\), on non-overlapping training datasets of equal size drawn from the same population. We then applied each model to the entire dataset (20,000 observations) and used the tree similarity score described in Sect. 5.2.1 to assess the structural similarity between the two models. The average similarity scores for each algorithm are illustrated in Fig. 8.

These results demonstrate that stability across different training datasets is not a sufficient condition for accuracy: models trained on 100 and 200 observations are both more stable and less accurate than models trained on 500 observations. The ctree algorithm produced the most stable results in smaller datasets due to the smaller model sizes selected during cross-validation. For example, 33.1% of ctree models trained on 100 observations had fewer than 2 splits, compared to 29.5% of the rpart models and 26.5% of the OST models.

The stability results for larger training datasets (\(n>1000\)) are reasonably consistent with the accuracy metrics discussed above, and both stability and accuracy increase with sample size across all three algorithms. The OST models have the highest average similarity scores in large datasets and the rpart models are slightly more stable than the ctree models.

In the second set of stability experiments we investigated how small perturbations to the covariate values in the training dataset affect the test set accuracy of each model. We added noise to the training data by replacing the original continuous covariate values, \(x_{ij}\), with “noisy” values \(\tilde{x}_{ij}=x_{ij} + \epsilon _{ij}\). The initial covariates were uniformly distributed between 0 and 1 and the added noise terms were generated from the following two distributions:

$$\begin{aligned}&\epsilon _{ij}\sim U(-0.05,0.05) \quad \quad&\text {(5\% noise), and }\\&\epsilon _{ij}\sim U(-0.1,0.1)&\text {(10\% noise).} \end{aligned}$$

A similar approach was applied to the categorical variables, which were generated by rounding off continuous values (\(x_{ij}\) or \(\tilde{x}_{ij}\)) to the appropriate thresholds. Note that noise was only added to the observations used for training data; the testing data was unchanged.

The results of these experiments are contrasted with the initial outcomes (without added noise) in Figs. 9 and 10. The effects of additional noise in the training data are visible in the results of all three algorithms and the drop in accuracy appears to be fairly consistent. Overall, the OST models maintain the highest scores regardless of noise.

These results indicate that perturbations in the training data affect the OST and greedy tree algorithms in similar ways. The OST algorithm’s performance is diminished by adding noise to the training data, but its ability to consider a wider range of split configurations does not make it more sensitive to these perturbations. In fact, the OST algorithm is generally slightly more stable than the greedy algorithms across permutations of the training data because it tends to produce models that are consistently closer to the true tree.

6.3 Scaling performance

We now provide an overview of the computational performance of the OST algorithm on the synthetic censored datasets. We use the procedure described in Sect. 6.1 to create simulated data varying the number of observations n, the number of features p, and the percentage of censoring. We consider datasets of size \(n \in\) [5000, 10,000, 25,000, 50,000, 100,000] and \(p \in [10, 50, 100]\). We consider three percentages of censoring \([10\%, 50\%, 80\%]\) that correspond to low, moderate, and high censoring respectively. We repeat the experiment for each combination of these parameters on 100 randomized datasets and report the average scaling performanceFootnote 4 and the associated 95% confidence intervals. We perform cross validation using grid search to select the best parameters for each model and we report the computational time of the training procedure. Figure 11 illustrates our findings.

Across all experiments, the algorithm was able to complete in less than an hour. There was no significant change in the average running time across the different levels of censoring. However, the number of features, p, did have a substantial impact on the computational performance. For \(p<100\), we note that all instances were able to solve within 40 minutes. By contrast, for datasets where the number of covariates is restricted to 10, the average time to solve is less than 25 minutes even when the sample size is 100,000. Increasing the number of observations appears to affect the computational performance in a linear way while the number of features empirically shows an exponential effect.

Fig. 11
figure 11

Average computational time for OST tree construction on synthetically generated datasets, with varying numbers of observations n and covariates p. The shaded region corresponds to the 95% confidence intervals

We present a comparative analysis of the computational performance of the OST, rpart, and ctree algorithms in “Appendix 1.3”. Due to its greedy nature, rpart is able to terminate in less than a minute across all instances. By contrast, we observe that the ctree package requires significantly more time. The latter scales faster than OST, though it is associated with an exponential rate as the number of observations increases.

7 Computational experiments with censored data from longitudinal studies and surveys

In this section, we focus on different aspects of algorithmic performance using three widely known longitudinal studies. In Sect. 7.1, we present results from the Wisconsin Longitudinal Study and highlight differences in performance as we vary the mix of categorical and numerical features. In Sect. 7.2, we use data from the Health and Lifestyle survey to compare the algorithms on a large set of features. Finally, in Sect. 7.3, we showcase an application of the algorithm on heart disease using data from the monumental Framingham Heart Study.

The three datasets discussed in this section are typical real-world applications of survival analysis: the outcome of interest is the time to a particular event, and each dataset includes censored outcomes due to individuals lost to follow-up during longitudinal studies. In “Appendix 3”, we describe additional experiments in which we simulate different levels of censoring in datasets drawn from the UCI repository (Dua and Graff 2017). These supplementary results demonstrate the strong performance of the OST algorithm across a variety of datasets with a range of different sizes and features.

7.1 The Wisconsin longitudinal study

In 1957, the Wisconsin Longitudinal Study (WLS) randomly sampled 10 317 Wisconsin high school graduates (one-third of all graduates) for a decades-long study, observing them until 2011 (Herd et al. 2014). The aim of the study was to understand how factors such as social background, schooling, military service, labor market experiences, family characteristics and events, and social participation, may affect mortality and morbidity, family functioning, and health. We have included in our analysis data from all recorded participants for 518 variables that were collected either from the original respondents or their parents.

We removed from our dataset all features for which more than 50% of the values are missing. We imputed the missing values with the mean of each covariate for numerical features and the mode for categorical and binary variables. In total, we collect 317 categorical, 103 numerical, and 77 binary covariates. In each randomized experiment, we sampled between [10, 15, 20, 25, 30] features from each category. Our goal was to observe the algorithms’ performance as we vary the combination of different types of covariates.

Our results show minimal variability in performance as we change the number of numerical and binary features (see “Appendix 2.1”). However, all three methods show trends in the average performance scores for different numbers of categorical features, as shown in Fig. 12. Specifically, both OST and rpart algorithms show slight decreases in performance with larger feature sets, likely due to overfitting, while the ctree algorithm performs slightly better on larger feature sets.

Overall, OST clearly outperforms the other methods in terms of the Integrated Brier Score and the Cox PL ratio, and is on par with rpart in both concordance statistics. The ctree algorithm performs poorly relative to the other algorithms across all metrics.

Fig. 12
figure 12

Average performance of survival tree models on subsets of features from the WLS dataset with varying numbers of categorical variables. The shaded regions represent 95% confidence intervals across 100 randomized experiments

7.2 The health and lifestyle survey

The first Health and Lifestyle Survey (Cox et al. 1988) was carried out in 1984–1985 on a random sample of the population of England, Scotland and Wales. Its objective was to help researchers understand the impact of self-reported health, attitudes to health, and beliefs about causes of disease in relation to measurements of health and lifestyle in adults from different parts of Great Britain. In our numerical experiments, the outcome of interest is the age of death of study participants as observed by follow-up studies until 2009. Our dataset includes 9003 individuals and 112 binary features. We conducted 100 randomized experiments to train each tree algorithm.

Table 4 Average scores for OST, rpart, ctree models on the HALS dataset

Table 4 outlines the results of our analysis on the HALS dataset. The OST algorithm outperforms the other methods in all metrics other than the Uno’s C metric. Specifically, OST is associated with an average Integrated Brier Score of 0.6114 compared to 0.6056 and 0.6105 for ctree and rpart respectively. In terms of the Cox PL ratio, OST offers an 8% improvement over the next best method (rpart) with an average score of 0.0125. With respect to the Harrell’s C metric, OST average Harrell’s C metric is 0.6211. ctree and rpart scored 0.6113 and 0.6185 respectively. Contrary to the other measures of performance, ctree achieves the best score in this series of experiments with an average metric of 0.4098 with a 0.0111 margin from OST. Our findings from this study are in line with the results in Sect. 6 and supplementary experiments in “Appendix 3.”

7.3 The Framingham heart study

In this section, we focus on the interpretation of the tree models using data from the Framingham Heart Study (FHS). Analysis of the FHS successfully identified the common factors or characteristics that contribute to Coronary Heart Disease (CHD) using the Cox regression model (Cox 1972). In our survival tree model, we include all participants in the study from the original cohort (1948–2014) and the offspring cohort (1971–2014) who were diagnosed with CHD. The event of interest in this model is the occurrence of a myocardial infarction or stroke. All 2296 patients were followed for a period of at least 10 years after their first diagnosis of CHD and observations are marked as censored if no event was observed while the patient was under observation.

We applied our algorithm to the primary variables that have been used in the established 10-year Hard Coronary Heart Disease (HCHD) Risk Calculator and the Cardiovascular Risk Calculator (Expert Panel 2001; D’Agostino et al. 2008). For each participant who was diagnosed with CHD, we include the following covariates in our training dataset: gender, smoking status (smoke), Systolic Blood Pressure (SBP), Diastolic Blood Pressure (DBP), use of anti-hypertensive medication (AHT), Body Mass Index (BMI), diabetic status (diabetes). We did not include cholesterol levels in our analysis because these variables are highly correlated with the use of lipid lowering treatment and a high proportion of the sample population did not have sufficient data to account for this interaction.

In Fig. 13 we illustrate the output of our algorithm on the FHS dataset. Every node of the tree provides the following information:

  • The node number.

  • Number of observations classified into the node.

  • Proportion of the node population which has been censored.

  • A plot of survival probability vs. time. In this example, the x-axis represents age and the y-axis gives the Kaplan–Meier estimate for the probability of experiencing no adverse events.

  • Color-coded survival curves to describe the different sub-populations. In each node, the blue curves describe the individuals classified into that node.

  • In internal (parent) nodes, the orange/green curves describe the sub-populations that are split into the left/right child node. After each split, the sub-population with higher likelihood of survival goes into the left node.

  • In leaf nodes, the red curve shows the average survival curve for the entire tree. This facilitates easy comparisons between the survival of a specific node and the rest of the population.

The splits illustrated in Fig. 13 include known risk factors for heart disease and are consistent with well-established medical guidelines. The algorithm identified a BMI threshold of 25 as the first split (node 1), which is in accordance with the NIH BMI ranges that classify an individual as overweight if his/her BMI is greater than or equal to 25. Multiple splits indicated a higher risk of heart attack or stroke in patients who smoke (nodes 2, 6). The group with the highest risk of an adverse event was overweight patients with diabetes (node 9).

Figures 14 and 15 illustrate the output of the ctree and rpart algorithms applied to the same FHS population. The rpart model has a single split (BMI), while the ctree model contains the same variables as the OST output. The Brier scores for each model are 0.0486 (OST), 0.0249 (rpart) and 0.0467 (ctree).

In this example we can reasonably conclude that the smaller size of the rpart tree impacts its predictive performance. This highlights the important role of cross-validation procedures in selecting an appropriate complexity parameter. In “Appendix 2.2.2” we describe additional experiments which contrast the performance of tree models with uniform size and shape (thus eliminating the effects of parameter selection), and note that the average performance of OST models is generally better than rpart trees of the same size.

The discrepancy in the Brier scores for the OST and ctree models is due to slight differences in the threshold and position of certain splits. For example, both methods identify that BMI is the most appropriate variable for the first split, but the BMI threshold differs. The ctree model sets the splitting threshold to 24.117, which is the locally optimal value for the split when building the tree greedily (the same threshold is used in the rpart model). By contrast, the OST algorithm selects a threshold of 25.031. This example demonstrates how the OST algorithm’s efforts to find a globally optimal solution differ from the results of locally optimal splits.

A second difference between the tree models is the order of the smoking and diabetes splits within the overweight population. The ctree model splits on smoking first, since this split has the most significant p-value of the variables at node 5 in the ctree tree. The algorithm also recognizes that diabetes is a risk factor and incorporates this in the subsequent split. Since greedy approaches like ctree do not reevaluate the spits once they have been decided, the algorithm does not recognize that the overall quality of the tree can be improved by reversing the order of these splits. This discrepancy in two otherwise similar trees highlights the advantages of the more sophisticated optimization conducted by OST.

Fig. 13
figure 13

An illustration of optimal survival trees for chd patients in the FHS

Fig. 14
figure 14

Illustration of the rpart output for chd patients in the FHS

Fig. 15
figure 15

Illustration of the ctree output for chd patients in the FHS

8 Conclusion

In this paper, we have extended the state-of-the-art Optimal Trees framework to generate interpretable models for censored data. We have also introduced a new accuracy metric, the Kaplan–Meier Area Ratio, which provides an effective way to measure the predictive power of survival tree models in simulations.

The Optimal Survival Trees algorithm improves on the performance of existing algorithms in terms of both classification and predictive accuracy. Our results in simulations indicate that the OST models improve consistently with increasing sample size, whereas existing algorithms are prone to overfitting in larger datasets. This is particularly important, given that the volume of medical data available for research is likely to increase significantly over the coming years.