Introduction

Survival analysis has been widely used in biomedical research to investigate the effects of clinical risk factors on survival outcomes. The Cox proportional hazards model1,2 is a widely applied method for assessing survival outcomes during the follow-up period. Although one of the key benefits of the Cox model is that it does not assume the shape of the baseline hazard, it only considers single or first events. Meanwhile, we often encounter some cases in which patients experience multiple events over time. If the events preclude the occurrence of the event of interest, then they can be competing risks. For example, when the event of interest is a death caused by cardiovascular disease, another occurrence of death due to non-cardiovascular disease can be a competing risk3,4,5. Considering this situation, more extended statistical methods are needed to investigate the presence of competing risks.

The first approach to adapting competing risks data is the Cause-specific hazard (CS) model, which estimates the hazard of each event separately6. The second approach is the Fine and Gray (FG) model7, also known as the sub-distribution hazard model to estimate the hazards of the cumulative incidence function. Because the main difference between these two approaches is the presence or absence of competing risks in the risk set, they can yield different results4. Recently, Ishwaran et al.8 introduced the Random survival forests (RSF) which is a notable approach for application to competing risks data in a machine learning framework. The RSF is a tree-based estimation and prediction method for estimating event-specific risk factors and cumulative incidence functions non-parametrically.

Meanwhile, there may be cases in which events of interest or diseases are relatively rare as bio-medical science advances over the past decades. For instance, penile cancer is a rare disease with an incidence of only 0.1 to 0.9 per 100,000 males9. Because accurate and reliable results may not be obtained in the presence of rare events, more caution should be taken when considering the complexity of the data structure.

The remainder of this paper focuses on the predictive performance of three existing competing risks models with rare events. We conducted a nationwide prospective cohort study to investigate the association between clinical risk factors and adverse renal outcomes. Subsequently, we compared the predictive performance using three existing methods with real-world data. Additionally, we consider situations that exist only in linear or nonlinear effects in time-to-event outcomes and covariate terms. Subsequently, discrimination, predictive accuracy, and model calibration were evaluated using the C-index, IBS, and Calibration slope, respectively.

Materials and methods

Patients information and data collection

The KoreanN Cohort Study for Outcomes in Patients With Chronic Kidney Disease (KNOW-CKD) is a multi-center, patient-based cohort study launched in 2011. The objective of the KNOW-CKD is to explore the etiologic risk factors associated with the clinical course progression of CKD. A total of 2,238 eligible patients were all Koreans, aged between 20 and 75 years, with non-dialyzed CKD stages 1 from 5 based on the estimated glomerular filtration rate from serum creatinine.

Demographic characteristics and medical histories, such as smoking status, comorbidities, and cause of CKD, were obtained at baseline. Laboratory data were also collected, including hemoglobin, fasting blood sugar, uric acid, calcium, phosphorous, albumin, total cholesterol, low-density lipoprotein cholesterol, high-density lipoprotein cholesterol, and high-sensitivity C-reactive protein. The major outcomes of CKD can be progression to end-stage kidney disease (ESKD), kidney failure, death, and complications of kidney dysfunction, including cardiovascular disease (CV), anemia, and bone disease. More detailed information regarding the KNOW-CKD cohort study has been published elsewhere10,11.

Risk factors

To identify possible risk factors for renal dysfunction with a competing risks model, we selected 12 covariates according to clinical consideration: age at enrollment (age), gender, baseline comorbid diseases such as coronary artery disease (CAD), diabetes mellitus (DM), cardiovascular disease (CVD), smoking status (smoking), CKD stage at the study entry (CKD_stage), body mass index (BMI), hemoglobin (HG), high-sensitivity C-reactive protein (CRP), mean arterial pressure (MAP), and low-density lipoprotein cholesterol (LDL) at baseline.

Outcomes

In the CKD study, a CV event was defined as any first event of the following: acute myocardial infarction, unstable angina, ischemic or hemorrhagic cerebral stroke, congestive heart failure, symptomatic arrhythmia, aggravated valvular heart, pericardial disease, abdominal aortic aneurysm, and severe peripheral arterial disease. In addition, an end-stage kidney disease (ESKD) can be defined as the initiation of renal replacement therapy, such as dialysis or renal transplantation12.

Methods for analyzing survival data with competing risks

Cause-specific hazard model

The CS model is an adaptation of the Cox model, with cause-specific hazard functions from different types of events. It treats failure from the cause of interest as events and failures from other causes as censored13. Let T and C denote failure and censoring times, respectively. The cause-specific hazard function at time t for cause \(k (k=1, \cdots ,K)\) is

$$\begin{aligned} h_{k}^{CS}(t) = \lim _{\Delta \rightarrow 0}\frac{P(t < T \le t + \Delta t, K=k |T > t)}{ \Delta t}. \end{aligned}$$

Similar to the Cox model, a separate proportional hazards model with p-dimensional covariates for cause k can be defined as

$$\begin{aligned} h_{k}^{CS}(t|{{\textbf {X}}}) = h_{k0}^{CS}(t)\exp ({\varvec{\beta }}^{'}_{{\textbf {k}}}{{\textbf {X}}}), \end{aligned}$$

where \(h^{CS}_{k0}(t)\) is the baseline of the cause-specific hazard function, and the exponential term illustrates the covariate effects on cause k. The regression coefficients \(\beta _{k}\) for cause k can be calculated using the maximum partial likelihood estimation method as follows:

$$\begin{aligned} L(\beta _{k}) =\prod _{i} \left( \frac{{\exp (\beta {'}_{{\textbf {k}}}{{\textbf {X}}}_{i})}}{\sum _{j \in R_{i}^{CS}}{\exp (\beta {'}_{{\textbf {k}}}{{\textbf {X}}}_{j})}}\right) ^{K=k}, \end{aligned}$$

where \(R_{i}^{CS}\) represents the i-th risk set which contains subjects who have not experienced any event and are not censored yet.

Fine and Gray model

The FG method7 is another extension of the Cox model that estimates the incidence of outcomes in a follow-up period with competing risks. Unlike the CS model, the FG model can be defined as the instantaneous rate of occurrence of each event type in subjects who are still under observation and those who have already experienced competing risks. The FG model describes the hazard function,

$$\begin{aligned} f_{k}^{FG}(t) = \lim _{\Delta \rightarrow 0}\frac{P(t< T \le t + \Delta t, K=k|T<t \cup (T<t \cap K \ne k))}{ \Delta t}. \end{aligned}$$

Analogous to the CS model, the FG model with covariates can be defined as follows:

$$\begin{aligned} h_{k}^{FG}(t|{{\textbf {X}}}) = h_{k0}^{FG}(t)\exp (\beta ^{'}_{{\textbf {k}}}{{\textbf {X}}}), \end{aligned}$$

where \(h^{FG}_{k0}(t)\) is the baseline of the subdistribution hazard of cause k. The regression coefficient \(\beta _{k}\) for cause k can also be calculated using the maximum partial likelihood estimation:

$$\begin{aligned} L(\beta _{k}) = \prod _{i}\left( {\frac{\exp (\beta ^{'}_{{\textbf {k}}}{{\textbf {X}}}_{i})}{\sum _{j \in R_{i}^{FG}} w_{ij} \exp (\beta ^{'}_{{\textbf {k}}}{{\textbf {X}}}_{j})}}\right) ^{K=k}, \end{aligned}$$

where \(R_{i}^{FG}\) represents the i-th risk set, which contains subjects who have experienced competing risks ahead and are still event-free. \(w_{ij}\) are subject-specific weights that reflect the incidence of competing risks.

Random survival forests

The RSF, first introduced by Ishwaran et al.14, is an extension of the machine-learning tree-based Random forests15. Subsequently, Ishwaran et al.8 proposed a fully nonparametric method for competing risks in the presence of right-censored data. As mentioned in Ishwaran et al.8, RSF has many useful properties. It calculates the cumulative incidence function directly in each node and provides an accurate prediction performance by aggregating individual trees in the ensemble. This allows the model to include not only linear effects but also nonlinear and interaction effects. Permutation importance was used as a measure of variable importance to identify the event-specific risk factors. A more detailed description of RSF is presented in Ishwaran et al.8.

RSF constructs each tree using a bootstrap sample of the original training data. We extracted .632 samples without replacement as training data in the bootstrapped sample and excluded the rest of them for out-of-bag data. Random feature selection can be applied to evaluate the split tree nodes at each node. For each tree, a cumulative hazard function was estimated using the Nelson-Aalen estimator. The survival forest is calculated by averaging the terminal node statistics in the ensemble.

Since identifying the most influential risk factors is the main interest in the medical field, it can be interesting to use Variable Importance as a means of finding a ranking of important risk factors. The RSF approach ranks covariates based on their predicted values and determines the important factors through the predicted accuracy. That is, the prediction accuracy of the test data in the current model was first calculated for each tree. Then, the prediction accuracy was also calculated on shuffled data using the same method. The differences between the original prediction accuracy and randomly permuted prediction accuracy were averaged over all trees, and they were normalized by the standard error. If a current model without the original values of a variable can provide a worse prediction, then the variables with large values are ranked as more important.

In addition to identifying important risk factors, there is one way to explain how each covariate affects the output of the model. The SHAP value is used to address each variable’s contribution to the model. The SHAP value is closely related to “Shapley values” which were first developed for the game theory method16. It has been widely adopted since the study by Lundberg and Lee17 was first published. The SHAP value translates to assigning an importance value to features, depending on their contribution to the prediction. In other words, it is calculated as the average marginal contribution of a variable value across all possible combinations18. Thus, the SHAP value can be the relative risk of the outcomes, meaning that high values contribute more to the predicted probability.

Performance measures

Concordance index

The C-index is commonly used to identify predicted probability in a prognostic model. This is a generalization of the area under the ROC curve that considers censored data19,20. In randomly selected patients, those with shorter time-to-event would have higher risk scores and lower predicted time-to-event outcomes. As presented in Brentnall and Cuzick21, the C-index is calculated from the Wilcoxon rank-sum statistic by

$$\begin{aligned} C = P( {\hat{T}}_1 \ge {\hat{T}}_2 | T_1 \ge T_2), \end{aligned}$$

where \(T_{1}\) and \(T_{2}\) are survival time of the two different subjects, \({\hat{T}}_1\) and \({\hat{T}}_2\) are their predictions from a fitting method, and \(I(\cdot )\) denotes the indicator function. Then, the C-index can be estimated by

$$\begin{aligned} {\hat{C}} = m^{-1}\sum _{i, j: T_i \le T_j} I[({\hat{S}}(t|X_i) \le {\hat{S}}(t|X_j)], \end{aligned}$$

where \(\hat{S}(t|X_{i})\) is the predicted survival function with covariate \(X_i\) and \(m = \sum _{i,j}I(T_i \le T_j)\).

The C-index value is a proportion between 0 and 1. Values near 1 indicate high model discrimination performance, whereas values near 0.5 show similarity to random prediction20. Note that there are various methods for calculating C-index for right-censored data. Harrell’s C-index depends on censoring distribution because they provided C-index censored information largely excluded22. Otherwise, Uno23 and Efron21’s approaches are censoring-independent.

Integrated brier score

Another criterion used to compute risk prediction is the time-dependent Brier Score (BS)24,25. The BS can be defined by the mean squared error between the predicted survival function \(\hat{S}(t|X_{i})\) and the predictor variable \(X_{i}\) and the true status \(Y_{i}(t)\) at a specified time-point t26. In the presence of right censored data, the BS can be estimated by

$$\begin{aligned} {\widehat{BS}}(t|\hat{S}) = \frac{1}{M}\sum _{i\in \tilde{D}_{M}}\hat{W}_{t}(t)\{\tilde{Y}_{i}(t)-\hat{S}(t|X_{i})\}^2, \end{aligned}$$

where \(\tilde{Y}_{i}(t)\) represents the observed status at time t. \(\tilde{D}_{M}\) is the test dataset with sample size M and \(\hat{W}_{i}(t)\) is the inverse probability of censoring weights24 defined by

$$\begin{aligned} \hat{W}_{i}(t) = \frac{(1-\tilde{Y}_{i}(t))\delta _{i}}{\hat{G}(T_{i}-|X_{i})}+\frac{\tilde{Y}_{i}(t)}{\hat{G}(t|X_{i})}. \end{aligned}$$

Here, \(\delta _{i}\) is the event indicator, \(\hat{G}(t|x)\approx P(C_{i} > t|X_{i}=x)\) is an estimate of the conditional survival function of censoring time C and \(\hat{G}(T_{i}-|X_{i})\) denotes the estimated survival function prior to \(T_{i}\) for the censoring time C. The estimated IBS is calculated by integration:

$$\begin{aligned} {\widehat{IBS}} = \frac{1}{max(t_{i})}\int _{0}^{max(t_{i})}{\widehat{BS}}(t|\hat{S})dt. \end{aligned}$$

Because IBS is based on the mean squared error, lower values are considered a better predictor.

Calibration slope

Calibration refers to the agreement between the predicted probabilities and observed number of events27. As described by Ambler et al.28, the calibration was assessed using a linear regression model to show the association between the observed and predicted values at time t. Thus, the regression model is defined as:

$$\begin{aligned} log\left( \frac{1-S_{i}(t|X_{i})}{S_{i}(t|X_{i})}\right) = intercept + slope\times log\left( \frac{1-\hat{S_{i}}(t|X_{i})}{\hat{S_{i}}(t|X_{i})}\right) + error, \end{aligned}$$

where \(S_{i}(t|X_{i})\) and \(\hat{S_{i}}(t|X_{i})\) are the true and estimated survival functions for the subject i at time t, respectively. When the model is perfectly calibrated, the estimated Calibration slope is 1. If the slope is higher than 1, the model is under-fitted. In contrast, overfitted models show a Calibration slope lower than 1, meaning that the model underestimates the probability of an event in the low-risk group and overestimates it in the high-risk group.

Simulation study

To compare the performances of the three methods, we consider the following simulation study. We first generated time-to-event, \(T_i\), based on the Cox-exponential hazards model with one event of interest and one competing risk (i = 1,2), considering the following form:

$$\begin{aligned} T_{i} = -\frac{\log (U_{i})}{\lambda _{i}\exp (g(X_i, \beta ))}, \end{aligned}$$

where \(U_{i}\) is generated from a uniform distribution and the baseline hazards are \(\lambda _{1} = 1\) and \(\lambda _{2} = 2\), respectively. For the ith state of events, five independent predictors \(X_{ij} (i=1,2; j=1,...,5)\) are generated from the standard normal distribution. For the parametric function \(g(X_i, \beta ))\), we consider linear and nonlinear terms (Table 1). The true regression coefficients are set as follows:

$$\begin{aligned}{} & {} (\beta _{11},\cdots ,\beta _{110}) = (1, 0, 1, -1, 1.5, 0.6, 0.6, 0.6, 0.6, 0.6),\\{} & {} (\beta _{21},\cdots ,\beta _{210}) = (0, 2, 1.5, 1, 1, 0.3, 0.3, 0.3, 0.3, 0.3). \end{aligned}$$
Table 1 Summary of simulation settings.

To investigate the performance of each method when the parametric function is correct or misspecified, we fitted two CS models when \(g(X_i; \beta )\) is linear (CS_l) and nonlinear (CS_nl). Similarly, the FG model when \(g(X_i; \beta )\) is linear (FG_l) and nonlinear (FG_nl), respectively. Note that RSF does not require the selection of the parametric function \(g(X_i; \beta )\). We implemented 500 and 1000 sample sizes and split the training and test sets with a 7:3 ratio in each simulation, which was repeated with 1000 independent observations. In addition, to examine the effect of censoring rates, we consider different censoring rates (30%, 60%, 80%) for each simulation.

Results

Baseline characteristics

Baseline characteristics of a total of 2238 patients are shown in Table 2. Here, we considered a CV event as the event of interest. Since a CV event that occurred after the development of ESKD was not followed for practical reasons, an ESKD event can be regarded as a competing risk. We focused on the 4-year incidence rate with clinical considerations, and the incidence rates of CV and ESKD events were 117 (5.23%) and 360 (16.09%) as shown in Fig. 1. The cumulative incidence function can be defined as the expected proportion of subjects with a specific event over time29 alongside with Kaplan-Meier estimator without any competing risks. The 4-year cumulative incidence of CV and ESKD events in all subjects is described in Fig. 2 showing that the cumulative incidence of ESKD events exceeded that of the CV events.

Table 2 Baseline covariates summary in KNOW-CKD data.
Figure 1
figure 1

The 4-year incidence rates for CV and ESKD events in CKD subjects.

Figure 2
figure 2

Cumulative incidence function curves for each event.

Analysis of CKD data

The results of the two competing risk models are similar (Table 3). According to the CS model, age, gender, DM, and CVD were significantly associated with CV events; HRs were 1.04 (95% CI 1.02,1.07), 0.44 (95% CI 0.20,0.97), 2.15 (95% CI 1.28, 3.60), and 3.15 (95% CI 1.73, 5.72) respectively. Likewise, age, DM, and CVD were significantly associated with CV events in the FG model; HRs were 1.05 (95% CI 1.02, 1.07), 2.11 (95% CI 1.21, 3.66), and 3.03 (95% CI 1.68, 5.48), respectively. In addition, we evaluated the model performance by calculating the predicted probabilities and compared the results. All three results are similar, except for the Calibration slope, which is 1.167, 0.928, and 0.971, respectively (Table 4).

Table 3 Hazard ratios (HR) with their confidence intervals, and p-values in the CS and FG results.
Figure 3
figure 3

SHAP value (left) and Variable Importance (right) in RSF result.

Table 4 Prediction performance.

Lastly, we calculated SHapley Additive exPlanations (SHAP) values and Variable Importance to identify the contribution of the risk factors. As shown in Fig. 3, the left panel displays SHAP values meaning that the first variable on the top is the most important and the last variable on the bottom is the least important. High values of CKD_stage, MAP, and LDL positively contributed to predicting a CV event. DM and CAD were also positively associated with the prediction of CV events. On the other hand, HG and BMI have a negative relationship with predicting a CV event, and males are associated with a higher incidence of a CV event than females. The variable Importance from RSF result is also represented in the right side of Fig. 3 showing that CVD is the most influential predictive factor for the incidence of a CV event. Then, age, DM, CAD, and HG were assessed. CVD, age, and DM were important clinical factors for predicting CV events when ESKD was considered a competing risk in three results. In the case of SHapley values, the CKD stage is the most important factor in predicting CV events. Age, DM, male sex, and advanced CKD stage have been reported as risk factors for adverse CV events in previous studies.30,31,32

Simulation results

The prediction performances were estimated using box plots through the C-index, IBS, and Calibration slope. Detailed descriptions of the three measures are provided in Methods section. Figures 4, 5, 6, 7 show the results of the prediction performance by varying the sample size (\(n=500\) and 1000) and scenarios (1 and 2). In Figs. 4 and 6 (when the true \(g(X_i; \beta )\) is linear), as the censoring rates increase, the three prediction measures have similar performance on average but become highly volatile. In particular, the Calibration slope revealed increased overfitting and underfitting. Because the true \(g(X_i; \beta )\) is linear, the correct models CS_l and FG_l perform similarly to the models CS_nl and FG_nl models. Interestingly, RSF was slightly worse than the correct models in terms of the C-index and IBS, and its Calibration slope is lower than 1 (overfitted) in all cases. Figures 5 and 7 (when the true \(g(X_i; \beta )\) is nonlinear), the misspecified models CS_l and FG_l are poorer than their correct models. Among the methods, the performance of the CS method was poorer than that of the other methods. Their C-index was much lower, the IBS was much larger, and the range of the estimated Calibration slope was too wide, indicating overfitting and underfitting. This showed that the CS model with linear and nonlinear effects in time-to-event was more susceptible as the censoring rates increased. In summary, the performance of each method deteriorated as the censoring rate increased. However, if the conditions are the same, the results with larger sample sizes show a better and more stable performance. The CS approach is more sensitive to censoring rates than the FG and RSF methods.

Figure 4
figure 4

Boxplots of predicted probabilities by each method with the existence of linear effects in time-to-event with sample size n\(=500\). The top, middle and bottom results represent 30%, 60%, and 80% censoring rates, respectively. The horizontal dashed line represents the optimal value in each plot.

Figure 5
figure 5

Boxplots of predicted probabilities by each method with the existence of nonlinear effects in time-to-event with sample size n\(=500\). The top, middle and bottom results represent 30%, 60%, and 80% censoring rates, respectively. The horizontal dashed line represents the optimal value in each plot.

Figure 6
figure 6

Boxplots of predicted probabilities by each method with the existence of linear effects in time-to-event with sample size n\(=1000\). The top, middle and bottom results represent 30%, 60%, and 80% censoring rates, respectively. The horizontal dashed line represents the optimal value in each plot.

Figure 7
figure 7

Boxplots of predicted probabilities by each method with the existence of nonlinear effects in time-to-event with sample size n\(=1000\). The top, middle and bottom results represent 30%, 60%, and 80% censoring rates, respectively. The horizontal dashed line represents the optimal value in each plot.

Discussion

In this study, we analyzed a multicentre cohort study of chronic kidney disease (KNOW-CKD) data based on the competing risks framework. We investigated the association between clinical risk factors and renal dysfunction and compared their estimates, predicted abilities, and feature importance. As the previous studies have shown that age, male sex, diabetes, and known vascular disease are risk factors for the 10-year development of CV diseases in the general population33, our results from the CKD population were also similar. In the CS result, age, gender, DM, and CVD variables were significant. In the FG result, age, DM, and CVD were significant but gender was marginally significant. Although RSF cannot directly provide the p-values for significance, the variable importance was calculated. Analysis from Chronic Renal Insufficiency Cohort (CRIC) with advanced CKD showed increased age, diabetes, elevated blood pressure, and any CVD history are significant risk factors for CV events34. Other studies with KNOW-CKD subjects also reported that male sex, diabetes, and increased CKD stage are risk factors for CV events31,35.

In real data analysis, two conventional methods (CS, FG) and the RSF method showed similar results in terms of the significance of risk factors and prediction. Although all methods select similar variables as the relevant features and have similar prediction performances, they all have good predictive power. Therefore, we believe that the current model with linear terms is reasonable to fit our data. Thus, we have conducted simulation studies by considering linear and non-linear terms. The performance of each method decreases as the censoring rate increase. These results show that more sophisticated methods must be considered when developing prediction models with few events. Based on the simulation results when the non-linear model is true, the RSF method was more robust. Therefore, the RSF method is recommended when the prediction performances are similar.

We only used the baseline covariates as the feature variables to analyze the competing risks data. Our current analysis provided ease of computation and straightforward prediction of survival time. Because the KNOW-CKD study is longitudinal, some relevant covariates related to CKD patients can be time-dependent. Extension to a model where the time-dependent covariates or time-varying coefficients are considered would be interesting future work.