Optimal survival trees

Bertsimas, Dimitris; Dunn, Jack; Gibson, Emma; Orfanoudaki, Agni

doi:10.1007/s10994-021-06117-0

Optimal survival trees

Open access
Published: 01 April 2022

Volume 111, pages 2951–3023, (2022)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Optimal survival trees

Download PDF

Dimitris Bertsimas ORCID: orcid.org/0000-0002-1985-1003¹,
Jack Dunn¹,
Emma Gibson¹ &
…
Agni Orfanoudaki²

14k Accesses
15 Citations
1 Altmetric
Explore all metrics

Abstract

Tree-based models are increasingly popular due to their ability to identify complex relationships that are beyond the scope of parametric models. Survival tree methods adapt these models to allow for the analysis of censored outcomes, which often appear in medical data. We present a new Optimal Survival Trees algorithm that leverages mixed-integer optimization (MIO) and local search techniques to generate globally optimized survival tree models. We demonstrate that the OST algorithm improves on the accuracy of existing survival tree methods, particularly in large datasets.

The Transformed MG-Extended Exponential Distribution: Properties and Applications

Article Open access 05 June 2024

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

A survey of Bayesian Network structure learning

Article Open access 17 January 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Survival analysis is a cornerstone of healthcare research and is widely used in the analysis of clinical trials as well as large-scale medical datasets such as Electronic Health Records and insurance claims. Survival analysis methods are required for censored data in which the outcome of interest is generally the time until an event (onset of disease, death, etc.), but the exact time of the event is unknown (censored) for some individuals. When a lower bound for these missing values is known (for example, a patient is known to be alive until at least time t) the data is said to be right-censored.

A common survival analysis technique is Cox proportional hazards regression (Cox 1972) which models the hazard rate for an event as a linear combination of covariate effects. Although this model is widely used and easily interpreted, its parametric nature makes it unable to identify non-linear effects or interactions between covariates (Bou-Hamad et al. 2011).

Recursive partitioning techniques (also referred to as trees) are a popular alternative to parametric models. When applied to survival data, survival tree algorithms partition the covariate space into smaller and smaller regions (nodes) containing observations with homogeneous survival outcomes. The survival distribution in the final partitions (leaves) can be analyzed using a variety of statistical techniques such as Kaplan–Meier curve estimates (Kaplan and Meier 1958). Several authors have proposed algorithms for building survival trees using censored datasets (Therneau et al. 1990; LeBlanc and Crowley 1993; Hothorn et al. 2006), many of which have been implemented within recursive partitioning software packages (Therneau et al. 2010; Hothorn et al. 2010).

Most recursive partitioning algorithms generate trees in a top-down, greedy manner, which means that each split is selected in isolation without considering its effect on subsequent splits in the tree (Breiman et al. 1984; Quinlan 1986, 2014). This approach can have a negative impact on the quality of the model, such as unnecessarily increasing complexity or decreasing accuracy, resulting in poor out-of-sample performance.

To address these issues, researchers have proposed the construction of optimal decision trees, leveraging optimization techniques (Chou 1991; Nijssen and Fromont 2010; Scott and Nowak 2006; Verwer and Zhang 2019; Verhaeghe et al. 2020). Such approaches lead to higher quality solutions while providing the flexibility to impose additional constraints on the trees. As the problem of tree construction is NP-complete (Laurent and Rivest 1976), recovering the optimal partition in high-dimensional dataset poses scalability issues. Bertsimas and Dunn (2017), Bertsimas and Dunn (2019) have proposed an efficient algorithm which uses modern mixed-integer optimization (MIO) techniques to address this weakness. Similar to other optimization-based approaches, this Optimal Trees algorithm forms the entire decision tree in a single step, allowing each split to be determined with full knowledge of all other splits. It allows the construction of single decision trees for classification and regression that have performance comparable with state-of-the-art methods such as random forests and gradient boosted trees, without sacrificing the interpretability offered by a single-tree model.

The key contributions of this paper are:

1.
We present Optimal Survival Trees (OST), a new survival trees algorithm that utilizes the Optimal Trees framework to generate interpretable trees for censored data.
2.
We propose a new accuracy metric that evaluates the fit of Kaplan–Meier curve estimates relative to known survival distributions in simulated datasets. We also demonstrate that this metric is reasonably consistent with the Integrated Brier Score (Graf et al. 1999), which can be used to evaluate the fit of Kaplan–Meier curves when the true distributions are unknown.
3.
We evaluate the performance of our method in both simulated and real-world datasets and demonstrate improved accuracy relative to two existing algorithms.
4.
Finally, we provide examples of how the algorithm can be used in real-world settings with censored data. We apply the algorithm to predict the risk of adverse events associated with cardiovascular health in the Framingham Heart Study dataset, and to predict the risk of mortality in the Wisconsin Longitudinal Study and Health and Lifestyle Survey.

The structure of this paper is as follows. We review existing survival tree algorithms in Sect. 2 and discuss some of the technical challenges associated with building trees for censored data. In Sect. 3, we give an overview of the Optimal Trees algorithm proposed by Bertsimas and Dunn (2017) and we adapt this algorithm for Optimal Survival Trees in Sect. 4. Section 5 begins with a discussion of existing survival tree accuracy metrics, followed by the new accuracy metrics that we have introduced to evaluate survival tree models in simulated datasets. Simulation results are presented in Sect. 6 and results for real-world datasets are presented in Sect. 7. We conclude in Sect. 8 with a brief summary of our contributions.

2 Review of survival trees

Recursive partitioning methods have received a great deal of attention in the literature, the most prominent method being the Classification and Regression Tree (CART) algorithm (Breiman et al. 1984). Tree-based models are appealing due to their logical, interpretable structure as well as their ability to detect complex interactions between covariates. However, traditional tree algorithms require complete observations of the dependent variable in training data, making them unsuitable for censored data.

Tree algorithms incorporate a splitting rule that selects partitions to add to the tree, and a pruning rule that determines when to stop adding further partitions. Since the 1980s, many authors have proposed splitting and pruning rules for censored data. Splitting rules in survival trees are generally based on either (a) node distance measures that seek to maximize the difference between observations in separate nodes or (b) node purity measures that seek to group similar observation in a single node (Zhou and McArdle 2015; Molinaro et al. 2004).

Algorithms based on node distance measures compare the two adjacent child nodes that are generated when a parent node is split, retaining the split that produces the greatest difference in the child nodes. Proposed measures of node distance include the two-sample logrank test (Ciampi et al. 1986), the likelihood ratio statistic (Ciampi et al. 1987) and conditional inference permutation tests (Hothorn et al. 2006). We note that the score function used in Cox regression models also falls into the class of node distance measures, as the partial likelihood statistic is based on a comparison of the relative risk coefficient predicted for each observation.

Dissimilarity-based splitting rules are unsuitable for certain applications (such as the Optimal Trees algorithm) because they do not allow for the assessment of a single node in isolation. We therefore focus on node purity splitting rules for developing the OST algorithm.

Gordon and Olshen (1985) published the first survival tree algorithm with a node purity splitting rule based on Kaplan–Meier estimates. Davis and Anderson (1989) used a splitting rule based on the negative log-likelihood of an exponential model, while Therneau et al. (1990) proposed using martingale residuals as an estimate of node error. LeBlanc and Crowley (1992) suggested comparing the log-likelihood of a saturated model to the first step of a full likelihood estimation procedure for the proportional hazards model and showed that both the full likelihood and martingale residuals can be calculated efficiently from the Nelson–Aalen cumulative hazard estimator (Nelson 1972; Aalen 1978). More recently, Molinaro et al. (2004) proposed a new approach to adjust loss functions for uncensored data based on inverse probability of censoring weights (IPCW).

Most survival tree algorithms make use of cost-complexity pruning to determine the correct tree size, particularly when node purity splitting is used. Cost-complexity pruning selects a tree that minimizes a weighted combination of the total tree error (i.e., the sum of each leaf node error) and tree complexity (the number of leaf nodes), with relative weights determined by cross-validation. A similar split-complexity pruning method was suggested by LeBlanc and Crowley (1993) for node distance measures, using the sum of the split test statistics and the number of splits in the tree. Other proposals include using the Akaike Information Criterion (AIC) (Ciampi et al. 1986) or using a p-value stopping criterion to stop growing the tree when no further significant splits are found (Hothorn et al. 2006).

Survival analysis methods have been extended to include other non-linear learners, such as support vector machines, tree ensembles, and neural networks (Fouodo et al. 2018; Hothorn et al. 2005; Liestbl et al. 1994). Breiman (2002) adapted the CART-based random forest algorithm to survival data, while both Hothorn et al. (2004) and Ishwaran et al. (2008) proposed more general methods that generate survival forests from any survival tree algorithm. “Survival forest” algorithms aggregate the results of multiple trees and aim to produce more accurate predictions by avoiding the instability of single-tree models. In addition, the formulation of the SVM problem has been extended in the survival setting with the objective of maximizing the concordance index for comparable pairs of observations (Van Belle et al. 2011; Evers and Messow 2008). Neural network survival analysis includes various structures, such as feed forward, deep, and recurrent neural networks (Biganzoli et al. 1998; Ripley and Ripley 2001; Fotso 2018; Giunchiglia et al. 2018).

Unlike decision trees, these approaches lead to “black-box” models which are not interpretable and provide little information about how they arrive at their predictions (Samek and Müller 2019; Castelvecchi 2016). The issue of interpretability has become central to the adoption and implementation of artificial intelligence models over the past several years (Gilpin et al. 2018), particularly in application areas like medicine where algorithmic decisions can directly impact patient lives (Rajkomar et al. 2019; Cabitza et al. 2017).

More interpretable survival analysis methods are often based on linear models such as Cox proportional hazards regression (Cox 1972). Various authors have adapted this approach using regularization techniques such as LASSO (Tibshirani 1997; Park and Hastie 2007), ridge regression (Verweij and Van Houwelingen 1994), and elastic net (Simon et al. 2011), which can be used to perform feature selection in large datasets and control the complexity of the models. Although linear models are relatively easy to interpret, their parametric structure can be a significant limitation if the underlying assumptions (for example, proportional hazards) are violated. These models are are also unsuitable for identifying non-linear relationships and interactions in the data. Single tree models provide a clear answer to this problem as they are able to capture intrinsic non-linear effects and interactions in the data while offering transparency to the user with the full characterization of potential risk profiles (Bertsimas and Dunn 2019).

Relatively few survival tree algorithms have been implemented in publicly available, well-documented software. Two user-friendly options are available in R (R Core Team 2017) packages: Therneau’s algorithm based on martingale residuals is implemented in the rpart package (Therneau et al. 2010) and Hothorn’s conditional inference (ctree) algorithm in the party package (Hothorn et al. 2010).

3 Review of optimal predictive trees

In this section, we briefly review approaches to constructing decision trees, and in particular, we outline the Optimal Trees algorithm. The purpose of this section is to provide a high-level overview of the Optimal Trees framework; interested readers are encouraged to refer to Bertsimas and Dunn (2019) and Dunn (2018) for more detailed technical information. The Optimal Trees algorithm and is implemented in Julia (Bezanson et al. 2017) and is available to academic researchers under a free academic license.^{Footnote 1}

Traditionally, decision trees are trained using a greedy heuristic that recursively partitions the feature space using a sequence of locally-optimal splits to construct a tree. This approach is used by methods like CART (Breiman et al. 1984) to find classification and regression trees. The greediness of this approach is also its main drawback—each split in the tree is determined independently without considering the possible impact of future splits in the tree on the quality of the here-and-now decision. This can create difficulties in learning the true underlying patterns in the data and lead to trees that generalize poorly. The most natural way to address this limitation is to consider forming the decision tree in a single step, where each split in the tree is decided with full knowledge of all other splits

The first efforts in the direction of optimal decision tree construction involved the use of pattern mining techniques to construct a global model (Nijssen and Fromont 2007, 2010). Narodytska et al. (2018) proposed the use of a Boolean satisfiability model for computing small-size decision trees with optimality guarantees ($n<10^3$ observations). Verwer and Zhang (2019) introduce an alternative binary formulation that employs Integer Linear Programming to render the model size largely independent from the training data size, achieving better scaling performance and shorter running times for datasets with thousands of observations. Verhaeghe et al. (2020) recently suggested an even more efficient way to decompose the learning problem with a constraint programming approach. Other attempts in the literature to construct globally optimal predictive trees involve the ones of Bennett and Blue (1996), Son (1998), Grubinger et al. (2014). However, these methods could not scale to datasets of the sizes required by practical applications, and therefore did not displace greedy heuristics as the approach used in practice. In contrast to the proposed algorithm, these frameworks are not able to efficiently partition datasets with sample size $n>$ 20000 and number of features $p > 100$.

Optimal Trees is a novel approach for decision tree construction that outperforms many existing decision tree methods (Bertsimas and Dunn 2019). It formulates the decision tree construction problem from the perspective of global optimality using mixed-integer optimization (MIO) and solves this problem with coordinate descent to find optimal or near-optimal solutions in practical run times. These Optimal Trees are often as powerful as state-of-the-art methods like random forests or boosted trees, yet they are just a single decision tree and hence are readily interpretable. This obviates the need to trade off between interpretability and state-of-the-art accuracy when choosing a predictive method.

The Optimal Trees framework is a generic approach that tractably and efficiently trains decision trees according to a loss function of the form

$$\begin{aligned} \min _T ~~\texttt {error}(T, D) + \alpha \cdot \texttt {complexity}(T), \end{aligned}$$

(1)

where T is the decision tree being optimized, D is the training data, $\texttt {error}(T, D)$ is a function measuring how well the tree T fits the training data D, complexity(T) is a function penalizing the complexity of the tree (for a tree with splits parallel to the axis, this is simply the number of splits in the tree), and $\alpha$ is the complexity parameter that controls the tradeoff between the quality of the fit and the size of the tree. Cross-validation takes places as an internal component of the method.

Unlike the others, Optimal Trees is able scale to large datasets (n in the millions, p in the thousands) by using coordinate descent to train the decision trees towards global optimality. When training a tree, the splits in the tree are repeatedly optimized one-at-a-time, finding changes that improve the global objective value in Problem (1). To give a high-level overview, the nodes of the tree are visited in a random order and at each node we consider the following modifications:

If the node is not a leaf, delete the split at that node;
If the node is not a leaf, find the optimal split to use at that node and update the current split;
If the node is a leaf, create a new split at that node.

For each of the changes, we calculate the objective value of the modified tree with respect to Problem (1). If any of these changes result in an improved objective value, then the modification is accepted. When a modification is accepted or all potential modifications have been dismissed, the algorithm proceeds to visit the nodes of the tree in a random order until no further improvements are found, meaning that this tree is a locally optimal for Problem (1). The problem is non-convex, so we repeat the coordinate descent process from various randomly-generated starting decision trees, before selecting the final locally-optimal tree with the lowest overall objective value as the best solution. For a more comprehensive guide to the coordinate descent process, we refer the reader to Bertsimas and Dunn (2019).

Although only one tree model is ultimately selected, information from multiple trees generated during the training process is also used to improve the performance of the algorithm. For example, the Optimal Trees algorithm combines the result of multiple trees to automatically calibrate the complexity parameter ($\alpha$). Bertsimas and Dunn (2019) present a tailored approach for tuning continuous hyperparameters of the algorithm discretize the range of the parameter, identifying a unique mapping between intervals and the corresponding $\texttt {complexity}(T)$. Thus, during the tuning process only a restricted set of values are tested, avoiding the exploration of values that result in overlapping solutions. To properly measure variable importance in light of the fact that only one of many correlated covariates could make it into a single tree, the Optimal Trees framework calculates a variable importance score in the same way as random forests or boosted trees to measure the importance of variables during the entire training process and not just in the final tree. More detailed explanations of these procedures can be found in Dunn (2018).

The coordinate descent approach used by Optimal Trees is generic and can be applied to optimize a decision tree under any objective function. For example, the Optimal Trees framework can train Optimal Classification Trees (OCT) by setting $\texttt {error}(T, D)$ to be the misclassification error associated with the tree predictions made on the training data. We provide a comparison of performance between various classification methods from Bertsimas and Dunn (2019) in Fig. 1. This comparison shows the performance of two versions of Optimal Classification Trees: OCT with parallel splits (using one variable in each split); and OCT with hyperplane splits (using a linear combination of variables in each split). These results demonstrate that not only do the Optimal Tree methods significantly outperform CART in producing a single predictive tree, but also that these trees have performance comparable with some of the best classification methods.

4 Survival tree algorithm

In this section, we adapt the Optimal Trees algorithm described in Sect. 3 for the analysis of censored data. For simplicity, we will use terminology from survival analysis and assume that the outcome of interest is the time until death. We begin with a set of observations $(t_i,\delta _i)_{i=1}^n$ where $t_i$ indicates the time of last observation and $\delta _i$ indicates whether the observation was a death ($\delta _i=1$) or a censoring ($\delta _i=0$).

Like other tree algorithms, the OST model requires a target function that determines which splits should be added to the tree. Computational efficiency is an important factor in the choice of target function, since it must be re-evaluated for every potential change to the tree during the optimization procedures. A key requirement for the target function is that the “fit” or error of each node should be evaluated independently of the rest of the tree. In this case, changing a particular split in the tree will only require re-evaluation of the subtree directly below that split, rather than the entire tree.

Due to these computational constraints, splits in the OST model cannot be evaluated by any methods that require the comparison of two or more nodes within the tree. This requirement restricts the choice of target function to the node purity approaches described in Sect. 2.

The splitting rule implemented in the OST algorithm is based on the likelihood method proposed by LeBlanc and Crowley (1992). This splitting rule is derived from a proportional hazards model which assumes that the underlying survival distribution for each observation is given by

$$\begin{aligned} \mathrm {P}(S_i\le t) = 1-e^{-\theta _i\varLambda (t)}, \end{aligned}$$

(2)

where $\varLambda (t)$ is the baseline cumulative hazard function and the coefficients $\theta _i$ are the adjustments to the baseline cumulative hazard for each observation.

In a survival tree model we replace $\varLambda (t)$ with an empirical estimate for the cumulative probability of death at each of the observation times. This is known as the Nelson–Aalen estimator (Nelson 1972; Aalen 1978),

$$\begin{aligned} \hat{\varLambda }(t) = \sum _{i:t_i\le t}\frac{\delta _i}{\sum _{j:t_j\ge t_i} 1}.\end{aligned}$$

(3)

Assuming this baseline hazard, the objective of the survival tree model is to optimize the hazard coefficients $\theta _i$. We impose that the tree model uses the same coefficient for all observations contained in a given leaf node in the tree, i.e. $\theta _i = \hat{\theta }_{T(i)}$. These coefficients are determined by maximizing the within-leaf sample likelihood

$$\begin{aligned} L= \prod \limits _{i=1}^n \left( \theta _i\frac{d}{dt}\varLambda (t_i)\right) ^{\delta _i}e^{-\theta _i\varLambda (t_i)}, \end{aligned}$$

(4)

to obtain the node coefficients

$$\begin{aligned} \hat{\theta }_{k} =\frac{\sum _{i}\delta _i I_{\{T_i = k\}}}{ \sum _{i}\hat{\varLambda }(t_i)I_{\{T_i = k\}}}. \end{aligned}$$

(5)

To evaluate how well different splits fit the available data we compare the current tree model to a tree with a single coefficient for each observation. We will refer to this as a fully saturated tree, since it has a unique parameter for every observation. The maximum likelihood estimates for these saturated model coefficients are

$$\begin{aligned} \hat{\theta }^{sat}_i = \frac{\delta _i}{\hat{\varLambda }(t_i)},\quad i=1,\dots , n.\end{aligned}$$

(6)

We calculate the prediction error at each node as the difference between the log-likelihood for the fitted node coefficient and the saturated model coefficients at that node:

$$\begin{aligned} \texttt {error}_k =\sum _{i:T(i) = k} \left( \delta _i \log \left( \dfrac{\delta _i}{ \hat{\varLambda }(t_i)}\right) - \delta _i \log (\hat{\theta }_{k})- \delta _i +\hat{\varLambda }(t_i)\hat{\theta }_{k}\right) . \end{aligned}$$

(7)

The overall error function used to optimize the tree is simply the sum of the errors across the leaf nodes of the tree T given the training data D:

$$\begin{aligned} \texttt {error}(T, D) = \sum _{k \in \mathrm {leaves}(T)} \texttt {error}_k(D). \end{aligned}$$

(8)

We can then apply the Optimal Trees approach to train a tree according to this error function by substituting this expression into the overall loss function (1). At each step of the coordinate descent process, we determine new estimates for $\hat{\theta }_{k}$ for each leaf node k in the tree using (5). We then calculate and sum the errors at each node using (7) to obtain the total error of the current solution, which is used to guide the coordinate descent and generate trees that minimize the error (8).

5 Survival tree accuracy metrics

In order to assess the performance of the OST algorithm, we now introduce a number of accuracy metrics for survival tree models. We will use the notation $T^{true}$ to represent a tree model, where $T^{true}_i = T^{true}(X_i)$ is the leaf node classification of observation i with covariates $X_i$ in the tree $T^{true}$. We will use the notation $T^0$ to represent a null model (a tree with no splits and a single node).

5.1 Review of survival model metrics

We begin by reviewing existing accuracy metrics for survival models that are commonly used in both the literature as well as practical applications.

1.
Cox Partial Likelihood Score

The Cox proportional hazards model (Cox 1972) is a semi-parametric model that is widely used in survival analysis. The Cox hazard function estimate is
$$\begin{aligned} \lambda (t|X_i) = \lambda _0(t)\exp {(\beta _1X_{i1} + \dots +\beta _pX_{ip})} = \lambda _0(t)\exp {(\beta ^TX_i)} ,\end{aligned}$$
(9)
where $\lambda _0(t)$ is the baseline hazard function and $\beta$ is a vector of fitted coefficients. This proportional hazards model does not make any assumptions about the form of $\lambda _0(t)$, and its parameters can be estimated even when the baseline is completely unknown (Cox 1975). The coefficients $\beta$ are estimated by maximizing the partial likelihood function^{Footnote 2},
$$\begin{aligned} L(\beta ) = \prod _{\mathrm {t_i uncensored}} \frac{\exp {(X_i\beta )}}{\sum _{t_j\ge t_i}\exp {(X_j\beta )}}= \prod _{\mathrm {t_i uncensored}}\frac{\theta _i}{\sum _{t_j\ge t_i} \theta _j}. \end{aligned}$$
(10)
For computational convenience, the Cox model is generally implemented using the log partial likelihood,
$$\begin{aligned} l(\beta ) = \log L(\beta ) =\sum _{\mathrm {t_i uncensored}} X_i\beta - \log \left( \sum _{t_j\ge t_i}\exp {(X_j\beta )}\right) . \end{aligned}$$
(11)
In the context of survival trees, we can find the Cox hazard function associated with a particular tree model by assigning one coefficient to each leaf node in the tree, i.e.,
$$\begin{aligned} \lambda _T(t) = \lambda _0(t)\exp { \left( \sum _{k \in T}\beta _k\mathbbm {1}(T_i=k)\right) } = \lambda _0(t)\exp {(\beta _{T_i})} . \end{aligned}$$
(12)
We define the Cox Score for a tree model as the maximized log partial likelihood for the associated Cox model, $\max _{\beta }l(\beta |T)$. To assist with interpretation, we also define the Cox Score Ratio (CSR) as the percentage reduction in the Cox Score for tree T relative to a null model,
$$\begin{aligned} CSR(T) = 1-\frac{\max _{\beta }l(\beta |T)}{\max _{\beta }l(\beta |T^0)}. \end{aligned}$$
(13)
Due to its widespread use in the context of Cox Regression, the Cox Score is a useful metric for assessing the fit of survival tree models and contrasting the structure of these models with more commonly used linear hazard functions. However, it is important to consider the implications of applying a metric designed for continuous hazard predictions in the context of decision trees, which produce a discrete hazard coefficient for every node. Each additional leaf node in the tree allows an additional degree of freedom in equation (12), and increasing the number of nodes in the tree may inflate Cox score even if the overall quality of the model does not improve.

Another significant drawback of the Cox score is its reliance on the proportional hazards assumption (2). Although this assumption is commonly used in survival analysis, it may not be appropriate in many applications. This metric should be interpreted with caution when comparing the results of survival tree algorithms that use the proportional hazards model in node splitting rules (such as the OST algorithm) to other algorithms that rely on non-parametric splitting rules.
2.
The Concordance Statistic

Applying a ranking approach to survival analysis is an effective way to deal with the skewed distributions of survival times as well as censored of the data. The Concordance Statistic, which is most familiar from logistic regression, is another popular metric that has been adapted to measure goodness-of-fit in survival models (Harrell et al. 1982). The concordance index is defined as the proportion of all comparable pairs of observations in which the model’s predictions are concordant with the observed outcomes.

Two observations are comparable if it is know with certainty that one individual died before the other. This occurs when the actual time of death is observed for both individuals (neither is censored) or when the one individual’s death is observed before the other is censored. A comparable pair is concordant if the predicted risk ($\rho$) is higher for the individual that died first, and the pair is discordant if the predicted risk is lower for the individual that died first. Thus, the number of concordant pairs in a sample is given by
$$\begin{aligned} CC = \sum _{i,j} \mathbbm {1}(t_i > t_j)\mathbbm {1}(\rho _i < \rho _j)\delta _j,\end{aligned}$$
(14)
and the number of discordant pairs is
$$\begin{aligned} DC = \sum _{i,j} \mathbbm {1}(t_i> t_j)\mathbbm {1}(\rho _i > \rho _j)\delta _j,\end{aligned}$$
(15)
where the indices i and j refer to pairs of observations in the sample. Multiplication by the factor $\delta _j$ discards pairs of observations that are not comparable because the smaller survival time is censored, i.e., $\delta _j = 0$. These definitions do not include comparable pairs with tied risk predictions, so we denote these pairs as
$$\begin{aligned} TR = \sum _{i,j} \mathbbm {1}(t_i > t_j)\mathbbm {1}(\rho _i = \rho _j)\delta _j.\end{aligned}$$
(16)
The number of concordant and discordant pairs is commonly summarized using Harrell’s C-index (Harrell et al. 1982),
$$\begin{aligned} H_C = \frac{CC+0.5\times TR}{CC+DC+TR}.\end{aligned}$$
(17)
Harrell’s C takes values between 0 and 1, with higher values indicating a better fit. Note that randomly assigned predictions have an expected score of $H_C=0.5$.

More recently, Uno et al. (2011) introduced a modified C-Statistic that weights comparable pairs of observations based on the distribution of censoring times,
$$\begin{aligned} U_{C_t} = \frac{\sum _{i,j} (\hat{G}(t_j))^{-2}\mathbbm {1}(t_i> t_j,t_j<t)\mathbbm {1}(\rho _i< \rho _j)\delta _j}{\sum _{i,j}(\hat{G}(t_j))^{-2}( \mathbbm {1}(t_i> t_j,t_j<t)\mathbbm {1}(\rho _i> \rho _j)\delta _j+\mathbbm {1}(t_i > t_j,t_j<t)\mathbbm {1}(\rho _i < \rho _j)\delta _j)},\end{aligned}$$
(18)
where $\hat{G}(\cdot )$ is the Kaplan–Meier estimate for the censoring distribution. Due to these coefficients, $U_C$ converges to a quantity that is independent of the censoring distribution. $U_C$ takes values between 0 and 1, with higher values indicating a better fit.

The above definition of Uno’s C-statistic was intended for continuous models, and (18) may be very unstable in small trees due to the large number of observations with tied risks which are not counted in either the numerator or denominator. To avoid this, we include these pairs of observations in a similar manner to Harrell’s C-statistic, i.e., weighted by 0.5 in the numerator and 1 in the denominator. The resulting concordance statistic is
$$\begin{aligned} U^{*}_{C_t} = \frac{\sum _{i,j} (\hat{G}(t_j))^{-2}\mathbbm {1}(t_i> t_j,t_j<t)\left( \mathbbm {1}(\rho _i< \rho _j) +0.5\times \mathbbm {1}(\rho _i = \rho _j)\right) \delta _j}{\sum _{i,j}(\hat{G}(t_j))^{-2}( \mathbbm {1}(t_i> t_j,t_j<t)\mathbbm {1}(\rho _i> \rho _j)\delta _j+\mathbbm {1}(t_i > t_j,t_j<t)\mathbbm {1}(\rho _i \le \rho _j)\delta _j)}.\end{aligned}$$
(19)
This modification improves the stability of the concordance statistics but also makes these metrics somewhat less informative in the context of discrete models, since a large number of tied pairs tend to dominate both the numerator and denominator. More generally, concordance statistics do not account for incomparable pairs of observations, which may be problematic when there is significant censoring. The binary definition of concordance fails to account for the magnitude of the difference in predicted risks for comparable observations. As a result, these metrics may be less informative in datasets with significant variations in risk.

Unlike the Cox Score, concordance statistics do not explicitly rely on any parametric assumptions. For proportional hazards models it is natural to define the predicted risk in terms of the hazard coefficients in (2), i.e., $\rho _i = \theta _i$. However, it is also possible to contrast the predicted risk of a comparable pair of observations via the predicted survival probabilities, the expected survival times, or any other comparable prediction extracted from the model. In our analysis we evaluate concordance based on the predicted survival probabilities extracted from the Kaplan–Meier curves at each node, i.e., $\rho _i(\tau ) = 1-\hat{S}_i(\tau )$. When comparing the risks of a pair of observations, survival probabilities are evaluated at the time of the first event, $\tau = \min \{t_i,t_j\}$.
3.
Integrated Brier score

The Brier score metric is commonly used to evaluate classification trees (Brier 1950). It was originally developed to verify the accuracy of a probability forecast, primarily for weather forecasting. The most common formula calculates the mean squared prediction error:
$$\begin{aligned} B = \frac{1}{n}\sum _{i}^n(\hat{p}(y_i) - y_i)^2, \end{aligned}$$
(20)
where n is the sample size, $y_i \in \{0,1\}$ is the outcome of observation i, and $\hat{p}(y_i)$ is the forecast probability of this observed outcome. In the context of survival analysis, the Brier score may be used to evaluate the accuracy of survival predictions at a particular point in time relative to the observed deaths at that time. We will refer to this as the Brier Point Score:
$$\begin{aligned}&BP_{\tau } = \frac{1}{|\mathcal {I}_{\tau }|}\sum _{i \in \mathcal {I}_{\tau }}(\hat{S_i}(\tau ) - \mathbbm {1}(t_i >\tau ))^2, \nonumber \\ \text { where }&\mathcal {I}_{\tau } = \{i\in \{1,\dots , n\},| t_i \ge \tau \text { or } \delta _i=1\}. \end{aligned}$$
(21)
In this case, $\hat{S_i}(\tau )$ is the predicted survival probability for observation i at time $\tau$ and $\mathcal {I}_{\tau }$ is the set of observations that are known to be alive/dead at time $\tau$. Observations censored before time $\tau$ are excluded from this score, as their survival status is unknown.

Applying this version of the Brier score may be useful in applications where the main outcome of interest is survival at a particular time, such as the 1-year survival rates after the onset of a disease. In the experiments that follow, the point-wise Brier Score will be evaluated at the median observation time in each dataset. For easy interpretation, the reported scores are normalized relative to the score for a null model, i.e.
$$\begin{aligned} BPR_{\tau }=1-\frac{ BP_{\tau }(T)}{ BP_{\tau }({T^0})}.\end{aligned}$$
(22)
The Brier Point score has two significant disadvantages in survival analysis. First, it assesses the predictive accuracy of survival models a single point in time rather than over the entire observation period, which is not well-suited to applications where survival distributions are the outcome of interest. Second, it becomes less informative as the number of censored observations increases, because a greater number of observations are discarded when calculating the score.

Graf et al. (1999) have addressed these challenges by proposing an adjusted version of the Brier Score for survival datasets with censored outcomes. Rather than measuring the accuracy of survival predictions at a single point, this measure aggregates the Brier score over the entire time interval observed in the data. This modified measure is commonly used in the survival literature and has been interchangeably called the Brier Score or the Integrated Brier Score by various authors (Reddy and Kronek 2008). In this paper, we will refer to the metric specific to survival analysis as the Integrated Brier score (IB), defined as
$$\begin{aligned} IB =\frac{1}{t_{max}} \frac{1}{n}\sum _{i=1}^n\int _0^{t_i} \frac{(1-\hat{S}_{i}(t))^2}{\hat{G}(t)} dt + \delta _i\int _{t_i}^{t_{max}} \frac{(\hat{S}_{i}(t))^2}{\hat{G}(t_i)} dt.\end{aligned}$$
(23)
The IB score uses Kaplan–Meier estimates for both the survival distribution, $\hat{S}(t)$, and the censoring distribution, $\hat{G}(t)$. In a survival tree model, these estimates are obtained by pooling observations in each node in the tree, i.e., $\hat{S}_i(t)=\hat{S}_{T(i)}(t)$. The IB score is a weighted version of the original Brier Score, with the weights being $1/\hat{G}(t_i)$ if an event occurs before time $t_i$, and $1/\hat{G}(t)$ if the event occurs after time t. This metric addresses many of the deficiencies identified in the Cox and concordance scores above: it is non-parametric, counts both censored and uncensored observations, and evaluates accuracy of the predicted survival functions over the entire time horizon.

In subsequent sections, we report a normalized version of this metric, the Integrated Brier score ratio (IBR), which compares the sum of the Integrated Brier scores in a given tree to the corresponding Integrated Brier scores in a null tree^{Footnote 3}:
$$\begin{aligned} IBR=1-\frac{ IB(T)}{ IB({T^0})}.\end{aligned}$$
(24)

Aside from the limitations already discussed, we note that all of the above metrics are subject to noise and often provide contradictory assessments when comparing different tree models. For example, our empirical experiments comparing three candidate models were only able to identify a non-dominated model for about 30% of the instances. In the other 70% of our test cases, none of the three candidate models scored at least as high as the other models on all metrics. These limitations make it difficult to obtain an unambiguous comparison between the performance of different survival tree algorithms. To address this challenge, we will now introduce a simulation procedure and associated accuracy metrics that are specifically designed to assess survival tree models.

5.2 Simulation accuracy metrics

A key difficulty in selecting performance metrics for survival tree models is that the definition of “accuracy” can depend on the context in which the model will be used. For example, consider a survival tree that models the relationship between lifestyle factors and age of death. A medical researcher may use such a model to identify risk factors associated with early death, while an insurance firm may use this model to predict mortality risks for individual clients in order to estimate the volume of life insurance policy pay-outs in the coming years. The medical researcher is primarily interested whether the model has identified important splits, while the insurer is more focused on whether the model can accurately estimate survival distributions.

In subsequent sections we refer to these two properties as tree recovery and prediction accuracy. We develop metrics to measure these outcomes in simulated datasets with the following structure:

Let $i=1,\dots ,n$ be a set of observations with independent, identically distributed covariates $\mathbf {X}_{i}=(X_{ij})_{j=1}^m$. Let $T^{true}$ be a tree model that partitions observations based on these covariates such that $T^{true}_i = T^{true}(\mathbf {X}_{i})$ is the index of the leaf node in $T^{true}$ that contains individual i. Let $S_i$ be a random variable representing the survival time of observation i, with distribution $S_i\sim F_{T^{true}_i}(t)$. The survival distribution of each individual is entirely determined by its location in the tree $T^{true}$, and so we refer to $T^{true}$ as the “true” tree model.

This underlying tree structure provides an unambiguous target against which we can measure the performance of empirical survival tree models. In this context, an empirical survival tree model T has high accuracy if it achieves the following objectives:

1.
Tree recovery: the model recovers structure of the true tree (i.e., $T(\mathbf {X}_{i})=T^{true}(\mathbf {X}_{i})$).
2.
Prediction accuracy: the model recovers the corresponding survival distributions of the true tree (i.e., $\hat{F}_{T_i}(t)={F}_{T^{true}_i}(t)$).

It is important to recognize that these two objectives are not necessarily consistent, particularly in small samples. For example, models with perfect tree recovery may have a small number of observations in each leaf node, leading to noisy survival estimates with low prediction accuracy.

5.2.1 Tree recovery metrics

We measure the tree recovery of an empirical tree model (T) relative to the true tree ($T^{true}$) using the following metrics:

1.
Node homogeneity The node homogeneity statistic measures the proportion of the observations in each node $k\in T$ that have the same true class in $T^{true}$. This metric is equivalent to the misclassification error and cluster purity metrics which are commonly used in the clustering and tree-based binary classification evaluation contexts respectively (Friedman et al. 2001; Rendón et al. 2011). Let $p_{k,l}$ be the proportion of observations in node $k \in T$ that came from class $\ell \in T^{true}$ and let $n_{k,l}$ be the total number of observations at node $k \in T^{true}$ from class $\ell \in C$. Then,
$$\begin{aligned} NH = \frac{1}{n}\sum _{k \in T}\sum _{l \in T^{true}} n_{k,l}p_{k,l}.\end{aligned}$$
(25)
A score of $NH = 1$ indicates that each node in the new tree model contains observations from a single class in $T^{true}$. This does not necessarily mean that the structure of T is identical to $T^{true}$ — For example, a saturated tree with a single observation in each node would have a perfect node homogeneity score (see Fig. 2). The node homogeneity metric is therefore biased towards larger tree models with few observations in each node.
2.
Class recovery

Class recovery is a measure of how well a new tree model is able to keep similar observations together in the same node, thereby avoiding unnecessary splits. Class recovery is calculated by counting the proportion of observations from a true class $\ell \in T^{true}$ that are placed in the same node in T. Let $q_{k,l}$ be the proportion of observations from class $\ell \in T^{true}$ that are classified in node $k \in T$ and let $n_{k,l}$ be the total number of observations at node $k \in T$ from class $\ell \in T^{true}$. Then,
$$\begin{aligned} CR = \frac{1}{n}\sum _{\ell \in T^{true}}\sum _{k \in T} n_{k,l}q_{k,l}. \end{aligned}$$
(26)
This metric is biased towards smaller trees, since a null tree with a single node would have a perfect class recovery score. It is therefore useful to consider both the class recovery and node homogeneity scores simultaneously in order to assess the performance of a tree model (see Fig. 2 for examples). When used together, these metrics indicate how well the model T reflects the structure of the true model $T^{true}$.

The node homogeneity and class recovery scores can also be used to compare any two tree models, $T^a$ and $T^b$. In this case, these metrics should be interpreted as a measure of structural similarity between the two tree models. Note that when $T^a$ and $T^b$ are applied to the same dataset, the node homogeneity for model $T^a$ relative to $T^b$ is equivalent to the class recovery for $T^b$ relative to $T^a$, and vice versa. The average node homogeneity score for $T^a$ and $T^b$ is therefore equal to the average class recovery score for $T^a$ and $T^b$. We will refer to this as the similarity score for models $T^a$ and $T^b$.

5.2.2 Prediction accuracy metric

Our prediction accuracy metric measures how well the non-parametric Kaplan–Meier curves at each leaf in T estimate true the survival distribution of each observation.

1.
Area between curves (ABC)

For an observation i with true survival distribution $F_{T^{true}_i}(t)$, suppose that $\hat{S}_{T_i}(t)$ is the Kaplan–Meier estimate at the corresponding node in tree T (see Fig. 3). The area between the true survival curve and the tree estimate is given by
$$\begin{aligned} ABC_i^T = \frac{1}{t_{max}}\int _{0}^{t_{max}} |1-F_{T^{true}_i}(t)-\hat{S}_{T_i}(t)|dt.\end{aligned}$$
(27)
To make this metric easier to interpret, we compare the area between curves in a given tree to the score of a null tree with a single node ($T^0$). The area ratio (AR) is given by
$$\begin{aligned} AR=1-\frac{\sum _i ABC_i^T}{\sum _i ABC_i^{T^0}}.\end{aligned}$$
(28)
Similar to the popular $R^2$ metric for regression models, the AR indicates how much accuracy is gained by using the Kaplan–Meier estimates generated by the tree relative to the baseline accuracy obtained by using a single estimate for the whole population.

Both the ABC and IBS metrics measure the fit of survival distributions generated at leaf nodes, which are an important component of tree-based survival models. The most important conceptual difference between these metrics is that the IBS compares the estimated survival distributions to events observed in the sampled data (using weights to account for censoring), while the ABC measures accuracy relative to the true survival distributions, which are not affected by censoring or sample size. The ABC cannot be applied in real-world settings where the underlying distributions are unknown, but it provides a simple and intuitive measure of the fit of survival curves in simulation experiments.

6 Simulation results

In this section we evaluate the performance of the Optimal Survival Trees (OST) algorithm and compare it to two existing survival tree models available in the R packages rpart and ctree. Our tests are performed on simulated datasets with the structure described in Sect. 5.2.

6.1 Simulation procedure

The procedure for generating simulated datasets in these experiments is as follows:

1.
Randomly generate a sample of 20,000 observations with six covariates. The first three covariates are uniformly distributed on the interval [0, 1] and remaining three covariates are discrete uniform random variables with 2, 3 and 5 levels.
2.
Generate a random “ground truth” tree model, $T^{true}$, that partitions the dataset based on these six covariates (see Algorithm 1 in the “Appendix 1”).
3.
Assign a survival distribution to each leaf node in the tree $T^{true}$ (see “Appendix” for a list of distributions).
4.
Classify observations into node classes $T^{true}_i = C(\mathbf {X}_i)$ according to the ground truth model. Generate a survival time, $s_i$, for each observation based the survival distribution of its node: $S_i\sim F_{T^{true}_i}(t)$.
5.
Generate a censoring time for each observation, $c_i = \kappa (1-u_i^2)$, where $u_i$ follows a uniform distribution and $\kappa$ is a non-negative parameter used to control the proportion of censored individuals.
6.
Assign observation times $t_i=\min (s_i,c_i)$. Individuals are marked as censored ($\delta _i=0$) if $t_i=c_i$.

We used this procedure to generate 1000 datasets based on ground truth trees with a minimum depth of 3 and a maximum depth of 4 (i.e., $2^4=16$ leaf nodes). In each dataset, 10,000 observations were set aside for testing the tree models. Training datasets of n observations were sampled from the remaining data for $n \in \{100,200,500,1000,2000,5000,10,000\}$.

In addition to varying the size of the training dataset, we also varied the proportion of censored observations in the data by adjusting the parameter $\kappa$. Censoring was applied at nine different levels to generate examples with low censoring (0%, 10%, 20%), moderate censoring (30%, 40%, 50%) and high censoring (60%, 70%, 80%). In total, 63 OST models were trained for each dataset to test each of the seven training sample sizes at each of the nine censoring levels.

We evaluated the performance of the OST algorithm relative to two existing survival tree algorithms available in the R packages rpart (Therneau et al. 2010) and ctree (Hothorn et al. 2010). Each of the three algorithms was trained and tested on exactly the same data in each dataset.

Each of the three algorithms tested require two input parameters that control the model size: a maximum tree depth and a complexity/significance parameter that determines which splits are worth keeping in the tree (the interpretation of the ctree significance parameter is different to the complexity parameters in the OST and rpart algorithms, but it serves a similar function).

Since neither rpart nor ctree have built-in methods for selecting tree parameters, we used a similar 5-fold cross-validation procedure on the training data to select the parameters for each algorithm. We considered tree depths up to three levels greater than the true tree depth and complexity parameter/significance values between 0.001 and 0.1 for the rpart and ctree algorithms (the OST complexity parameter is automatically selected during training). Equation (7) was used as the scoring metric to evaluate out-of-sample performance during cross-validation, and the minimum node size for all algorithms was fixed at 5 observations.

6.2 Results

To demonstrate the effect of this cross-validation procedure, we summarize the average size of the models produced by each algorithm in Fig. 4. We see a clear link between tree size and the number of training observations, indicating the cross-validation procedure is selecting more conservative depth/complexity parameters when relatively little data is available. In larger datasets, the OST models grow to approximately the same size as the true tree models (6 nodes, on average), while the rpart and ctree models models are slightly larger.

6.2.1 Survival analysis metrics

Figure 5 summarizes the performance of each algorithm in our simulations using the four survival model metrics from Sect. 5. The values displayed in each chart are the average performance statistics across all test datasets.

As expected, the average performance of all three algorithms consistently improves as the size of the training dataset increases. The performance statistics also increase as the proportion of censored observations increases, which seems counter-intuitive (we would expect more censoring to lead to less accurate models). In the case of the Cox partial likelihood and C-statistics, this trend is directly linked to the number of observed deaths, since only observations with observed deaths contribute to the partial likelihood and concordance scores. Similarly, censored observations do not contribute to the Integrated Brier Score after their censoring time.

Each chart also indicates the performance of the true tree model, C, as a point of comparison for the other algorithms. The true tree model performs significantly better than the empirical models trained on smaller datasets, but all three algorithms approach the performance of the true tree for very large sample sizes.

Based on these results, we conclude that the average performance of the OST algorithm in these simulations is consistently better than either of the other two algorithms. In order to understand why this algorithm is able to generate better models, we now analyse the results of the tree metrics introduced in Sect. 5.2.

6.2.2 Tree recovery

The test set tree recovery metrics for all three algorithms are summarized in Table 1 and Fig. 6. The average node homogeneity/class recovery scores are given side-by-side to allow for a comprehensive assessment of each algorithm’s performance. These results confirm that the OST models perform significantly better than the other two models across all censoring levels.

Table 1 A summary of the average node homogeneity/class recovery scores for synthetic experiments

Optimal survival trees

Abstract

Similar content being viewed by others

The Transformed MG-Extended Exponential Distribution: Properties and Applications

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

A survey of Bayesian Network structure learning

1 Introduction

2 Review of survival trees

3 Review of optimal predictive trees

4 Survival tree algorithm

5 Survival tree accuracy metrics

5.1 Review of survival model metrics

5.2 Simulation accuracy metrics

5.2.1 Tree recovery metrics

5.2.2 Prediction accuracy metric

6 Simulation results

6.1 Simulation procedure

6.2 Results

6.2.1 Survival analysis metrics

6.2.2 Tree recovery

6.2.3 Prediction accuracy

6.2.4 Comparison of accuracy metrics

6.2.5 Stability

6.3 Scaling performance

7 Computational experiments with censored data from longitudinal studies and surveys

7.1 The Wisconsin longitudinal study

7.2 The health and lifestyle survey

7.3 The Framingham heart study

8 Conclusion

Data availability

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix 1

Tree simulations

2.1 Tree generation algorithm

2.2 Survival distributions

2.3 Computation times

Appendix 2

Real-world survival datasets

2.1 Wisconsin longitudinal study

2.2 The Framingham heart study

2.2.1 FHS dataset inclusion criteria

2.2.2 Saturated trees

Appendix 3

Supplementary experiments for real datasets with simulated censoring

2.1 Summary of results for UCI experiments with simulated censoring

2.2 Nemenyi critical diagrams for UCI experiments with simulated censoring

2.3 Aggregate results for UCI experiments with simulated censoring

2.4 Full results for UCI experiments with simulated censoring

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation