Introduction

Prostate cancer (PCa) is a common cancer in men aged 65 or above, causing substantial mortality and morbidity worldwide. It is estimated that nearly 1.3 million people are newly diagnosed worldwide every year, and approximately 400,000 suffer from treatment-related morbidity (Global regional and national incidence prevalence 2017; Foreman et al. 2018). Although various therapies and management of both primary and metastatic PCa have advanced rapidly (Sandhu et al. 2021), it is still difficult to balance treatments and risks of progression with therapy-related health problems (Donovan et al. 2016). This may imply that we should focus on recognizing those who can benefit from specific therapies.

Radical prostatectomy (RP) has been considered a standard treatment for TNM stage I–III PCa patients (Sekhoacha et al. 2022). It could prevent further metastatic seeding and late complications of aggressive PCa (Costello 2020). RP was widely applied to patients with low-risk PCa rather than those with high-risk PCa in the 1980s and 1990s (Costello 2020). However, over the past 40 years, the role of RP in treating prostate cancer has changed considerably because of RP’s significant risk of overtreatment and accompanying adverse effects (Hamdy et al. 2023). For example, a shift in the application of RP toward PCa patients with high risks occurred, and the survival time was similar in low-risk PCa patients who received RP or other therapies (Wilt et al. 2017, 2020). A review also found that RP was related to decreased cancer-specific quality of life (Lardas et al. 2017) partly due to RP’s effect on biological functions (Litwin and Tan 2017), indicating that it might not always be appropriate to use RP. As such, identification of optimal candidates for RP is quite important to make PCa patients benefit more from therapies and avoid overtreatment.

Therefore, this study aims to establish a model to discern whether an individual patient can benefit more from RP and to identify heterogeneity in treatment responses among PCa patients.

Materials and methods

Study design

All patients were included from the Surveillance, Epidemiology, and End Results (SEER) database, which comprises data from 18 regions across the United States, accounting for approximately 30% of the national population (Islami et al. 2021). This study adhered to the Strengthening the Reporting of Observational Studies in Epidemiology reporting guidelines for observational research (Elm et al. 2007).

Men aged 18 or above who were diagnosed with PCa as a primary cancer and who received RP or did not undergo surgery between 2010 and 2017 were included. Anatomic site codes (C61.9) and histology subtypes (8140) were classified according to the International Classification of Disease for Oncology, 3rd edition. We excluded those falling under any of the following:

  1. 1)

    Age below 18;

  2. 2)

    Lack of clear data on Gleason scores, TNM stage, clinical prostate-specific antigen (PSA) level, or tumor size;

  3. 3)

    Unknown demographic information;

  4. 4)

    Unknown survival months;

  5. 5)

    Unknown metastasis status.

Figure 1A provides a comprehensive illustration of the participant inclusion process. The focal outcome under examination was overall survival (OS), a metric provided by SEER, denoting the time period between all-cause death and the initial PCa diagnosis. Patients who were still alive in December 2020 were considered censored data, so the minimum follow-up time was 3 years.

Fig. 1
figure 1

Flowchart of patient inclusion and schematic of the SNB architecture. A Flowchart of patient inclusion. B Schematic of the SNB architecture. PCa prostate cancer; PSA prostate-specific antigen; SNN Self-normalizing neural network; IPM integral probability metrics

Deep learning algorithms

T-learner is a commonly used model for inferring the individual treatment effect (ITE)(Künzel et al. 2019), which trains each base model on different intervention groups separately and obtains the conditional average treatment effect (CATE). However, T-learner has some drawbacks: (1) it relies heavily on the performance of the trained base models, whose performance can be affected by extremely imbalanced training numbers in two groups (Yao et al. 2020); (2) the ignorability of T-learner only eliminates confounding artifacts, while imbalances in generating distributions due to biased treatment allocation could still be present.

Benefitting from the development of deep learning (DL) and representation learning, balancing the generating distributions of the different treatment groups has been proven to be effective for both covariate space (Li et al. 2014) and latent representations (Johansson et al. 2020). Balanced Individual Treatment Effect for Survival data (BITES) (Schrod et al. 2022), a semiparametric survival regression DL model, contains a shared network and two risk networks. In a shared network, balanced generating distributions are obtained by using integral probability metrics (IPM) to maximize the p-Wasserstein distance of the latent representations of different treatment arms, and the smoothed optimal transport loss is calculated, which is called representation-based causal inference.

Self-normalizing neural networks (SNNs) enhanced feed-forward neural networks (FNNs) and significantly outperformed all competing FNN methods (Klambauer et al. 2017). The neuron activations of SNNs automatically converge toward zero mean and unit variance, which in turn avoids exploding and vanishing gradients. Thus, in this study, we introduce the Self-Normalizing Balanced individual treatment effect for Survival data (SNB). SNB inherits the architecture of BITES, while scaled exponential linear units (SELUs) are added in both shared networks and risk networks. Shared network contains a five-layered SNN with a dropout rate of 10%. The shared network receives input features and uses IPM to balance the latent representations between each treatment group. Risk networks are two identical four-layered SNNs, which, respectively, represent the regularized representations of Non-Surgery group and RP group. At time of inference, the SNB calculates the corresponding treatment-specific baseline hazards in each of the two risk networks. By varying the risk networks which an individual's features are entered and its baseline hazards, SNB can predict survival outcomes under the hypothesis of different treatments, thereby visualizing the advantages and disadvantages of different treatments. The architecture of SNB is presented in Fig. 1B.

Model development and treatment recommendation

Temporal validation(Cooray et al. 2023) was utilized to validate models. We allocated patients diagnosed from 2010 to 2015 (24,464 (69.4%) patients) to a training set that was used for building the models and a testing set which consists of patients diagnosed from 2016 to 2017 (10,772 (30.6%) patients) to evaluate the models’ performance and the effect of the models’ recommendation. During the training period, we used fivefold cross-validation to tune the model hyperparameters. The training process was terminated automatically if the validation loss did not decrease in 1,000 iterations. We trained SNB, BITES, Cox Mixtures with Heterogeneous Effects (CMHE) (Nagpal et al. 2022), DeepSurv (Katzman et al. 2016), Cox proportional hazards model (CPH), and random survival forest (RSF). CMHE, DeepSurv, CPH, and RSF were trained and used in the form of T-learner.

In estimating the individual treatment effect (ITE), only one fact can be observed per patient, and the outcome of the alternative scenario is unobservable. Thus, these outcomes need to be predicted by models. The individual survival distribution is obtained with the predicted log hazard ratios and treatment-specific baseline hazards, which describes the change in survival probability over time. We defined the outcome as the time it took for an individual patient to reach 90% mortality under the predicted individual survival distribution, called the time at risk (TaR). The TaR represents the time interval between PCa diagnosis and the time when his mortality rate reaches 90%. The ITE is therefore calculated as \(ITE={TaR}^{T=1}-{TaR}^{T=0}\), where \(T=1\) represents the situation in which the patient receives RP and \(T=0\) represents the situation in which he does not receive the procedure. In such cases, an individual patient was recommended for RP or non-surgery based on whether the ITE was greater than zero.

The ITE calculation methods of all models were identical. To explore the recommendation effect of the models, we divided the patients into the recommended (Consis.) and anti-recommended (Inconsis.) groups, based on whether the actual treatment they received was consistent with the model recommendations.

Statistical analyses

Statistical analyses were performed using R 4.1.3 and Python 3.8. Continuous variables are reported as medians and interquartile ranges (IQRs), and categorical variables are presented as numbers and percentages (%). Inverse probability treatment weighting (IPTW) was used to avoid treatment selection bias. The log-rank test was used to compare Kaplan–Meier (KM) curves.

Results

Study population

A total of 35,236 PCa patients with complete follow-up records who met the inclusion criteria were included in this study. The overall mortality rate was 6.3% (95% CI: 6.1%–6.6%) over a median (IQR) follow-up time of 76 (52–106) months. The median (IQR) age was 63 (57–68) years, and the median (IQR) tumor size was 17 (11–24) mm. A total of 6,265 (17.8%) patients were in the Non-Surgery group, while 28,971 (82.2%) underwent radical RP. The baseline clinical characteristics of all patients are presented in Table 1.

Table 1 Baseline demographic and pathological features

Model performance

CPH achieved the best discrimination (integrated Brier score in Non-Surgery group (IBSa): 0.09, 95% CI, 0.08–0.10; integrated Brier score in RP group (IBSb): 0.04, 95% CI, 0.03–0.04), followed by SNB (IBSa: 0.10, 95% CI, 0.09–0.12; IBSb: 0.04, 95% CI, 0.03–0.04).

We calculated the multivariate hazard ratio (HR), the difference in restricted mean survival time (dRMST) (month), and risk difference (RD) (%) to evaluate the protective effect of each model. HR describes the multiplicity of changes in mortality resulting from following model recommendations. dRMST describes the average additional survival time of patients in the Consis. compared to the Inconsis. group. RD describes absolute mortality reductions resulting from following model recommendations. To avoid imbalances of prognostic factors between the Consis. and Inconsis. groups, we used IPTW to correct for the above metrics, in which covariates, including age, tumor size, histological grades, TNM stages, metastatic sites, lymph node involvements, PSA level, and Gleason scores, were corrected. In such cases, these metrics were expected to reflect the debiased treatment recommendation performance. All metrics were calculated based on overall survival (OS) with a time horizon of 10 years. The detailed model performance is presented in Table 2.

Table 2 Detailed model performance and treatment recommendation effect

Among all models, only SNB achieved the best IPTW-adjusted HR (HRc) and IPTW-adjusted dRMST (dRMSTc) (HR: 0.76, 95% CI, 0.64–0.92; HRc: 0.77, 95% CI, 0.61–0.95; RD: 3.80, 95% CI, 2.48–5.11; IPTW-adjusted RD (RDc): 2.17, 95% CI, 0.92–3.35; dRMST: 3.81, 95% CI, 2.66–4.85; dRMSTc: 3.23, 95% CI, 2.06–4.45). CMHE had the best RD and dRMST (HR: 0.74, 95% CI, 0.60–0.88; HRc: 1.67, 95% CI, 0.91–3.10; RD: 8.62, 95% CI, 6.62–10.62 RDc: 6.48, 95% CI, 4.25–8.71; dRMST: 9.08, 95% CI, 7.22–10.94; dRMSTc: -12.38, 95% CI, -14.48–-10.68); BITES had the best HR (HR: 0.71, 95% CI, 0.58–0.84; HRc: 0.56, 95% CI, 0.23–1.34; RD: 8.15, 95% CI, 6.26–10.05; RDc: 6.33, 95% CI, 4.16–8.52; dRMST: 8.49, 95% CI, 6.74–10.24; dRMSTc: 2.98, 95% CI, 1.60–4.28); and RSF had the best RDc (HR: 0.70, 95% CI, 0.58–0.84; HRc: 0.57, 95% CI, 0.24–1.33; RD: 8.17, 95% CI, 6.27–10.05; RDc: 6.54, 95% CI, 4.43–8.65; dRMST: 8.47, 95% CI, 6.73–10.24; dRMSTc: 2.96, 95% CI, 1.34–4.27). However, no model, except for SNB, can achieve a statistically significant HRc.

We compared the 2023 National Comprehensive Cancer Network (NCCN) guideline(Schaeffer et al. 2022). Patients whose actual treatment was consistent with the NCCN recommendation were compared with those who were inconsistent. However, the protective effect of NCCN recommendation (HR: 0.87, 95% CI, 0.72–1.07; HRc: 1.06, 95% CI, 0.79–1.40; RD: 5.19, 95% CI, 3.49–4.83; RDc: 5.38, 95% CI, 2.98–7.78; dRMST: 6.88, 95% CI, 5.17–8.58; dRMSTc: 2.02, 95% CI, 0.67–3.64; P of log-rank test < 0.001; P of IPTW-adjusted log-rank test = 0.240) was inferior to our best model, SNB, particularly on multivariate and IPTW-adjusted metrics.

We present the KM curves of Consis. versus Inconsis. regarding OS and prostate cancer-specific survival (PCSS) in Fig. 2A and B, respectively. Better OS outcomes (P of log-rank test < 0.001; P of IPTW-adjusted log-rank test < 0.001) and PCSS outcomes (P of log-rank test < 0.001; P of IPTW-adjusted log-rank test = 0.044) were observed. Figure 2C and D shows the KM curves of RP versus Non-Surgery group for OS and PCSS. The OS (P of log-rank test < 0.001) and PCSS (P of log-rank test < 0.001) advantages of RP were observed; however, this advantage no longer existed after IPTW correction (P of IPTW-adjusted log-rank test of OS = 0.716; P of IPTW-adjusted log-rank test of PCSS = 0.754).

Fig. 2
figure 2

KM curves comparison. A KM curves of Consis. versus Inconsis. regarding overall survival. B KM curves of Consis. versus Inconsis. regarding prostate cancer-specific survival. C KM curves of radical prostatectomy versus Non-Surgery group regarding overall survival. D KM curves of radical prostatectomy versus Non-Surgery group regarding prostate cancer-specific survival. IPTW inverse probability treatment weighting; OS overall survival; PCSS, prostate cancer-specific survival

In addition, we presented the causal path of the protective effect of SNB in Fig. 3. RP was treated as a mediator variable, while all covariates were treated as potential confounders. Thus, the natural direct effect (NDE) and natural indirect effect (NIE) of SNB recommendation was calculated. These values were presented as the slope of a linear regression. After excluding the effect of RP, the protective effect of SNB remained statistically significant (NDE: – 0.04, 95% CI, – 0.04–-0.04).

Fig. 3
figure 3

Causal path of SNB recommendations. SNB Self-Normalizing Balanced individual treatment effect for survival data; RP, radical prostatectomy; X indicates the covariates of patients. Y indicates patients' mortality

The deep learning-based treatment guidelines

To explain the recommendation behavior of SNB, we derived a mixed effect linear regression that predicts ITE from the covariates. Household income and reporting region were set as random effects. Thus, the beta values obtained indicate the presence of this covariate or an increase of one that causes the difference in the time it took for the patient to reach 90% mortality of RP over no surgery to increase beta. This result is presented in Fig. 4A.

Fig. 4
figure 4

Model interpretation. A Interpretation of SNB recommendation behavior. B The non-linear relationship between tumor size and individual treatment effect. C SNB interpretation based on SurvSHAP(t). The individual treatment effect measures the time it took for a patient to reach 90% mortality receiving radical prostatectomy

RP was more effective in patients with higher PSA levels (1.77, 95% CI, 1.73–1.81), non-metastatic disease (30.64, 95% CI, 20.69–40.57), larger tumor size (0.27, 95% CI, 0.26–0.28), Gleason score 6 (3.52, 95% CI, 1.37–5.66), Gleason score 7 (2.57, 95% CI, 0.63–4.49), and Gleason score 8 (3.10, 95% CI, 1.01–5.19). Older ( – 0.14, 95% CI,  – 0.20–  – 0.08), white-raced ( – 1.58, 95% CI,  – 2.63– – 0.52), TNM stage I (-10.80, 95% CI,  – 17.08  – 4.54), and stage III ( – 7.60, 95% CI, – 13.69– – 1.54) patients were not optimal candidates for RP.

In addition, we used the restricted cubic spline model to assess the non-linear relationship between tumor size and ITE, which is presented in Fig. 4B. The optimal knot was tested between 3 and 5 using R2. Patients with tumors larger than 16 mm were found to benefit from RP (ITE > 0).

Model interpretation based on SurvSHAP(t)

We used SurvSHAP(t) to interpret the functional output of SNB, which is the first method introduced to date that can provide a time-dependent interpretation with a solid theoretical basis (Krzyzi’nski et al. 2022). Figure 4C visualizes the aggregation of the eight most important variables, sorted by aggregated Shapley values, rankings over 500 observations. The horizontal bars represent the number of observations where the importance of the variable is ranked first, second, and so on, indicated by the given color. It should be noted that RP in SNB is treated through different risk networks and using different baseline hazards rather than a routine variable. Histological grade was deemed the most important prognostic factor in 247 samples. In addition, Gleason scores and metastasis sites were both important.

To evaluate the importance of features, Table S1 visualizes the changes of overall IBS, IBSa, and IBSb of SNB after excluding the eight most important variables in the testing set, whose conclusions are essentially the same as the findings of SurvSHAP(t).

Discussion

PCa is the most common non-skin cancer in men and ranks second in cancer-related death in the United States, causing 76,234 deaths in all ages in 2018 (Siegel et al. 2021). Although advances in treatment and earlier stage diagnosis continue to emerge (Luh et al. 2018), these treatments are not yet widely used in clinical practice, which highlights the need of constructing an individualized RP recommendation system to extend life expectancy. Therefore, we introduced and carefully evaluated SNB in this study, which outperformed recently proposed or widely used models, real-world physician choices, and NCCN guideline. After adjusting for confounders, SNB led to 6% reduction in patient mortality. Treatment selection often needs to consider complex feature interactions rather than being based on fixed guidelines, and our study demonstrated that DL models are well suited to accomplish this, as clearly evidenced by the stronger protective effect of SNB than NCCN guideline. We hypothesized that there might be other factors that influence treatment choice, not found by existing studies. DL models can identify this complex, potential interaction, embodying the rationality and reliability of SNB.

We believed that the superiority of SNB is attributed to the better predictive power and stability of SNNs over FNNs and single-layer linear regression. On the drug discovery benchmark, SNNs have outperformed other FNNs with and without normalization techniques, such as batch, layer, and weight normalization, or specialized architectures, such as Highway (Zilly et al. 2016) or Residual networks (Klambauer et al. 2017; Xie et al. 2017). It has been proven that SNNs do not face vanishing and exploding gradient problems (Klambauer et al. 2017), which may explain the better performance of SNB compared to BITES.

The nature of artificial intelligence-guided intervention studies gives us the opportunity to obtain DL-based treatment guidelines by interpreting the treatment recommendation behavior of the model. We considered and excluded the influence of confounders on treatment recommendations by holding other parameters unchanged. Consistent with previous studies, we found that baseline features like age (Mottet et al. 2021) worked together with tumor characteristics including TNM stages (Miao et al. 2023), PSA level (Drobner et al. 2023), and Gleason scores (Lam et al. 2019a) significantly affect RP selection, owing to the fact that they are essential factors in life expectancy (Daskivich 2015). However, our models quantified those elements in detail. We found that patients with 1 ng/mL increase in PSA levels receiving RP caused 1.77 months increase in the time to 90% mortality, and the similar results could be found in age, Gleason Score, tumor size, TNM stages, and metastasis status.

Another crucial finding of our research is that 16 mm is the recommended value of tumor size for RP. Exact tumor size indicator for RP was not unified by available evidence (e.g., some chose 5 mm as the critical value for selection of RP (Zhou et al. 2021), while others recommend 10 mm (Sanguedolce et al. 2018; Lam et al. 2019b)), which was considered as the demonstration of disease stratification and prognosis. Our finding was generally inconsistent with Zhou et al.’s conclusion (Zhou et al. 2021). We hypothesized that this result may attribute to the improvement of modern imaging and treatment efficacy that gave opportunities to early intervene small-size tumors. By applying multiparametric magnetic resonance imaging, Park et al. pointed that tumor size ≥ 15 mm was significantly associated with adverse pathology (Baboudjian et al. 2023), which was similar with us and deserve to be further investigated. Based on this situation, DL might provide a new potential for the suggestion of exact tumor size.

Our model (SNB) may serve as a useful analytical tool for treatment recommendation in patients with PCa, given its evidence of the significant prognostic benefits of following the treatment recommendation, which clearly outweigh those associated with not following the recommendation. It is a surgeon’s duty to introduce clinical information to patients. To facilitate discussion of different potential surgical options, surgeons and patients need an informative tool that focuses on survival benefits. In real cases, the establishment of a treatment recommendation system based on a DL model will be key to effectively conveying results and illustrating complex analyses, including prognostic prediction, treatment recommendation to patients and family members, and improving the surgeons’ understanding of the treatment benefits (Wang et al. 2019; Simon et al. 2019; Zhu et al. 2023).

This study has several inevitable limitations. The SEER database did not include information about comorbidities and details of gene panels, which are important for RP selection. Second, although the OS outcome is critical for therapy decisions, the individual preferences of PCa patients and surgeons may reduce the applicability of the model. Third, since OS was the focal outcome, we did not analyze other outcomes, such as quality of life and progression-free survival. Finally, subsequent studies are encouraged to continue to validate the SNB in real-world cohorts to ensure its reliability in clinical practice. Therefore, it remains more various data to maximize the efficacy of models.

Conclusion

In conclusion, SNB successfully predicted which patients with PCa would benefit from receiving RP. The DL-based treatment guidelines were generally consistent with clinical knowledge and may provide priori evidence for subsequent studies. Subsequent studies are needed to further analyze more comprehensive clinical data. DL models have the potential to obtain information with complex heterogeneity of real-world practice and to recommend treatment precisely for individual PCa patients.