Background

With the high prevalence of neural networks and Deep Learning-based algorithms in the Computational Biology, it is clear that the advantages of optimization in a highly non-linear space are welcomed improvements in biomedicine [1,2,3,4,5,6,7]. In Bioinformatics, significant effort has been committed to harnessing transcriptomic data for multiple analyses [7,8,9,10,11,12,13] especially cancer survival prognosis [14, 15]. Faraggi and Simon [16] was the first study to use clinical information to predict prostate cancer survival through an artificial neural network model. Mobadersany et al. [17] integrated histological features, Convolutional Neural Networks (CNN), and genomics data to predict cancer prognosis via Cox regression. Despite of various existed applications on survival analysis such as [14, 15], the use of Deep-Learning Cox models was pioneered by Ching et al. [18], who applied Cox regression with neural networks (Cox-nnet) to predict survival using transcriptomic data became prevalent. Similarly, Katzman et al. [19] used DeepSurv with multi-layer neural networks for survival prognosis and developed a personalized treatment recommendation system.

As a new and effective dimensionality reduction technique, the Autoencoder (AE) framework can lead to efficient lower dimensional representations using unsupervised or supervised learning [20,21,22,23,24]. In addition, Chaudhary et al. [25] also applied AE for dimensionality reduction and then used the low-dimensional representation of data to perform prognosis prediction using traditional method. In this paper, besides two recently developed Deep Learning based methods, namely Cox-nnet and DeepSurv, we also attempted an Autoencoder-based approach (called AECOX) for cancer prognosis prediction with simultaneous learning of lower dimensional representation of inputs. This approach is similar to Cox-nnet [18] and DeepSurv [19], as it implements neural networks with Cox regression, though the network architectures differ. In AECOX (Fig. 1c), the code from AE will link to a Cox regression layer for the prognosis. Both losses from the AE networks and Cox regression layer will be counted to train the entire network weights through back-propagation. AECOX is with symmetric structure of the Autoencoder, and can accept any number of hidden layers. We refer readers to the Additional file 1 for more detailed settings of AECOX.

Fig. 1
figure 1

Neural network architectures of three Deep Learning-based models. a Cox-nnet with a single hidden layer; b DeepSurv with multiple hidden layers having consistent dimensions; c AECOX with multiple hidden layers in the both encoder and decoder part. Last hidden layers in all models were indicated in orange and were connect to a Cox regression neural networks with hazard ratios as the outputs

To evaluate the prediction performance, we adopt two metrics, namely the concordance index and p-value of log-rank test. These metrics are used in comparing two state-of-the-art Deep Learning-based prognosis models (i.e., Cox-nnet, DeepSurv) with AECOX, in a pan-cancer study covering 12 TCGA (The Cancer Genome Atlas) cancers. In addition, we use Partitioning Around Medoids (PAM) clustering algorithm [26] on the last hidden layer for each model to evaluates how well the models discriminate subgroups in the lower dimensional space. P-value of log-rank test based on K groups of Kaplan-Meier survival curve is the metric used for evaluation [27].

As we compared the prognosis prediction performance across 12 cancer types, we wonder whether the performance is related to tumor mutation burden and overall survival time. Tumor mutation burden (TMB) is a measurement of mutations in tumor [28, 29] and is an important genomic marker that is closely associate with immunotherapy and survival prognosis [30,31,32,33,34]. While incorporating TMB feature into input does not increase the prediction performances, we found that TMB is negatively correlated with overall survival time statistics, and both of them are correlated with the concordance index for all three models across cancer types, suggesting an association between TMB, overall survival time, and disease prognosis accuracy.

Overall, we observed comparative results across three different Deep Learning-based cancer survival prognosis models in terms of concordance index. We also investigated the lower dimensional representation that conveyed by Deep Learning algorithms. By inspecting the relationship between TMB, overall survival statistics, and concordance index across 12 cancer types, we confirmed an association among them, suggesting a future study direction of patient stratification and integrative analysis.

Method

Integrating Cox proportional hazards model with neural networks

The neural network architectures of all three Deep Learning-based approaches are provided in Fig. 1. Cox-nnet (Fig. 1a) is the most succinct model with only one hidden layer, while DeepSurv (Fig. 1b) uses multiple hidden layers of consistent dimensions and treats the number of hidden layers as a hyper-parameter. Similarly, AECOX also treats the number of hidden layers as a hyper-parameter, but the hidden layers lay symmetrically in the encoder and decoder (Fig. 1c). All three models employed the same Cox proportional hazards model. However, Cox-nnet and DeepSurv accept the output of the last hidden layer to the Cox model while AECOX uses the low-dimensional code as the input. The output hazard ratio was then compared to the ground truth and the details of evaluation metrics are provided earlier. The reason we introduced AECOX is to explore the feasibility of simultaneously generating a low-dimensional representation of the data while developing an effective model for prognosis.

The Cox proportional hazards model, also known as the Cox model, was developed to models the age specific failure rate or hazard function [35] at time t for patient i with covariate vector Xi by.

$$ h\left(t|{X}_i\right)={h}_0(t)\exp \left(\sum \limits_{k=1}^K{\beta}_k{X}_{ik}\right) $$
(1)

The partial likelihood Li for patient i, which is defined to be the probability of occurrence of a death event at time Yi for patient i, is found to be

$$ {L}_i\left(\beta \right)=\frac{h\left({Y}_i|{X}_i\right)}{\sum_{j:{Y}_j\ge {Y}_i}h\left({Y}_i|{X}_j\right)}=\frac{h_0\left({Y}_i\right){\theta}_i}{\sum_{j:{Y}_j\ge {Y}_i}{h}_0\left({Y}_i\right){\theta}_j}=\frac{\theta_i}{\sum_{j:{Y}_j\ge {Y}_i}{\theta}_j} $$
(2)

at time Yi for patient i. Where \( {\theta}_i=\exp \left(\sum \limits_{k=1}^K{\beta}_k{X}_{ik}\right) \). β = (β1, β2, …, βK) are the K parameters to be estimated. The summation in denominator is carried out over all patients j (including patient i) for which a death event did not occur before time Yi. The partial likelihood for all patients is then defined as

$$ L\left(\beta \right)=\prod \limits_{i:{C}_i=1}{L}_i\left(\beta \right) $$
(3)

where Ci = 1 indicates the occurrence of a death event. The log partial likelihood of Cox model is then obtained as

$$ \ell \left(\beta \right)=\sum \limits_{i:{C}_i=1}\left(\sum \limits_{k=1}^K{\beta}_k{X}_{ik}-\log \left({\sum}_{j:{Y}_j\ge {Y}_i}{\theta}_j\right)\right) $$
(4)

Values of the parameters β = (β1, β2, …, βK) are then obtained through maximum likelihood estimation (MLE), that is

$$ \hat{\beta}={\mathrm{argmax}}_{\beta}\left(\ell \left(\beta \right)\right) $$
(5)

Alternatively, since the Cox model utilizes a regression model that can be implemented as neural network with weights β = (β1, β2, …, βK), values of these weights were obtained through back-propagation. This approach was embedded in all the aforementioned models and was denoted by the blue line with caption “Cox-Regression Neural Network” in Fig. 1.

These models offer several advanced features: (1) a highly non-linear function is learned, (2) neural networks and Cox proportional hazards regression are integrated together enabling the entire weights of the models to be learned through back-propagation, (3) the number of hidden layers and hidden layer dimensions were treated as hyper-parameters that can be fine-tuned, and (4) dimensionality reduction in conjunction with supervised learning is achieved.

To demonstrate the advantages of Deep Learning-based prognosis models, we also compared three traditional machine learning based models for prognosis, they are: Cox proportional hazards model with R package “glmnet” [36], Random Survival Forest (RSF) [37], and Support Vector Machine (SVM) [25]. Particularly, in Chaudhary et al. [25], we implemented their SVM model according to the top 100 mRNA-seq features selected from ANOVA (Analysis of variance) [38].

Regularization, loss functions and hyper-parameters

Despite the fact that the aforementioned Deep Learning-based approaches shared the same Cox regression network and used the hazard ratio as the output (Table 1), yet certain differences existed among the models. Currently all three models used the L2 norm regularization in the final learning after hyper-parameters tuning as it gave the optimal validation accuracy. While all models attempted Dropout and the L2 norm regularization (Ridge Regularization [39]) to penalize the network weights, AECOX also included L1 norm regularization (Least Absolute Shrinkage and Selection Operator, LASSO in short [40]) and elastic net [41].

Table 1 Comparison of model architectures and settings across three Deep Learning-based cancer survival prognosis approaches

The structure of loss functions among models shared a common base formula (Table 1), but each approach used additional penalization. Specifically, both Cox-nnet and DeepSurv used the same objective (loss) function:

$$ \hat{\Theta}={\mathrm{argmin}}_{\Theta}\left\{{\sum}_{i:{C}_i=1}\left(\sum \limits_{k=1}^K{\beta}_k{X}_{ik}-\log \left({\sum}_{j:{Y}_j\ge {Y}_i}{\theta}_j\right)\right)+\lambda {\left\Vert \Theta \right\Vert}_2^2\right\} $$
(6)

whereas AECOX took into account the Autoencoder’s input-output difference:

$$ \hat{\Theta}={\mathrm{argmin}}_{\Theta}\left\{{\lambda}_1\mathrm{MSE}\left({X}_{input},{X}_{output}\right)\right.+\left(1-{\lambda}_1\right){\sum}_{i:{C}_i=1}\left(\sum \limits_{k=1}^K{\beta}_k{X}_{ik}-\log \left({\sum}_{j:{Y}_j\ge {Y}_i}{\theta}_j\right)\right)+\left.{\lambda}_2{\left\Vert \Theta \right\Vert}_1+{\lambda}_3{\left\Vert \Theta \right\Vert}_2^2\right\} $$
(7)

Here Θ denotes to the neural networks’ weights to be learned, including hidden layer weights and Cox regression neural network weights, Xinput and Xoutput are the input and output covariate vectors of Autoencoder, respectively. MSE(∙) is the mean squared error function. The hyper-parameter λ1 balances the loss between Autoencoder’s input-output difference which is a measure of dimensionality reduction and the Cox hazard, which is a measure of regression based supervised learning. The combination of λ2 and λ3 permits the utilization of Elastic Net regularization. Forcing λ2 = 0 results in L2 regularization, whereas forcing λ3 = 0 results in L1 regularization. To optimize the objective functions given above, Cox-nnet, DeepSurv, and AECOX use Nesterov accelerated gradient descent [42], stochastic gradient descent (SGD) [43], and adaptive moment estimation (Adam) optimizer [44], respectively. AECOX adopted Adam optimizer as it is more computationally efficient and require little tuning on hyper-parameters.

As shown in Table 1, Cox-nnet has one hyper-parameter to be fine-tuned, and thus a linear search technique was adopted, whereas DeepSurv and AECOX had multiple hyper-parameters in a high dimensional space. It is thus unrealistic to perform a linear search in each dimension of the hyper-parameter space as the computational complexity would be O(np) for p hyper-parameters. Instead, DeepSurv and AECOX utilize the Sobol solver [45] in the Optunity python package [46]. Given a search time q (e.g., q = 100), the Sobol solver samples q points assuming the hyper-parameters are uniformly distributed in p-dimensional space. This reduces the computational complexity to O(nq), regardless of how large the value of p is.

Data preprocessing and statistics

Genes with lowest 20% absolute expression values and lowest 10% variance across samples were removed. This denoising step was performed via the TSUNAMI package (https://apps.medgen.iupui.edu/rsc/tsunami/) [15], ensuring model robustness and reducing irrelevant noise.

The expression data were then rescaled with natural logarithm operation:

$$ {X}_{input}=\log \left({X}_{original}+1\right) $$
(8)

where Xoriginal was the original non-negative RNA sequencing expression values (Illumina Hi-Seq RNA-seq v2 RSEM normalized), and Xinput was the input covariate vector for the models. Subsequently each gene expression at row r in the input data was normalized as

$$ {X}_{input}^{(r)}=\frac{X_{input}^{(r)}-\min \left({X}_{input}^{(r)}\right)}{\max \left({X}_{input}^{(r)}\right)-\min \left({X}_{input}^{(r)}\right)\ } $$
(9)

This step ensured that each row of the gene expression contributed to the model on an equal scale.

Table 2 provides a summary of the median and range in terms of age and survival months for the TCGA data. Each dataset was split into training, validation, and testing sets in a proportion of 60, 20, and 20% respectively. Confounding effects [47] were minimized by randomly shuffling the data 1000 times and choosing the 5 pairs of training/validation/testing sets with lowest corresponding differences. The differences that were minimized is the summation of (1) standard deviation of male/female ratio on training/validation/testing sets, (2) standard deviation of overall survival time’s standard deviation on training/validation/testing sets, (3) standard deviation of overall survival time’s mean on training/validation/testing sets, (4) standard deviation of the ratio of deceased group to whole population on training/validation/testing sets, and (5) standard deviation of the ratio of tumor stages to whole population on training/validation/testing sets. Thus, survival prognosis was estimated for each cancer type 5 times.

Table 2 The Cancer Genome Atlas (TCGA) 12 cancers’ statistics. Cancers were sorted based on averaged concordance index in descending order according to Fig. 2

In this study, TCGA mutation annotation files (MAFs), containing subsets of the patients for prognosis tasks, were used to calculate TMB summary statistics, including mean, median, max, and 20, 10, 5% tail cut values. These characteristics were used for examining correlation between TMB and concordance index.

Evaluation metrics

We evaluated model performance with concordance index and the p-value of log-rank test. Concordance index had been widely used for evaluating survival prognosis models [48,49,50]. Its value ranges from 0 to 1 and it describes how well models differentiated groups (censored and uncensored groups, or living and deceased groups) [50,51,52,53]. A concordance index of 0.5 indicates that a model was ineffective and is viewed to have generated a random prediction with respect to ground truth. Values above 0.5 indicate improved prediction by a model, with increased performance being conveyed by a concordance index approaching 1. Values below 0.5 indicate that a model predicted values that are the opposite of the ground truth. Higher concordance index values indicate better capability of model to perform cancer survival prognosis.

P-values were derived by dichotomizing the hazard ratios through median value and performing log-rank tests [54,55,56] between the resulted high-risk and low-risk groups. Model performance was then assessed wherein a lower p-value represents an enhanced ability to distinguish two patient groups.

To evaluate the performances across cancer types and across model types, two-way ANOVA [38] is adopted. Pairwise paired t-test [57, 58] and the linear mixed-effects models test from the R package “nlme” [59, 60] are also used. The linear mixed-effects models test is to test between pairs of models while accounting for random effects. The mixed effect model assumed the data (performances) to be dependent within each cancer type and independent across cancer types.

Results

The performance comparison was conducted at pan-cancer level using 12 cancer from The Cancer Genome Atlas (TCGA). These 12 cancers were chosen due to their relatively large sample sizes and sufficient information about patient outcomes. The specific cancers analyzed in this paper were (1) Urothelial Bladder Carcinoma (BLCA); (2) Breast Invasive Carcinoma (BRCA); (3) Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (CESC); (4) Head-Neck Squamous Cell Carcinoma (HNSC); (5) Kidney Renal Clear Cell Carcinoma (KIRC); (6) Kidney Renal Papillary Cell Carcinoma (KIRP); (7) Liver Hepatocellular Carcinoma (LIHC); (8) Lung Adenocarcinoma (LUAD); (9) Lung Squamous Cell Carcinoma (LUSC); (10) Ovarian Cancer (OV); (11) Pancreatic Adenocarcinoma (PAAD); and (12) Stomach Adenocarcinoma (STAD). In this paper, we used the expression data of Illumina Hi-Seq RNA-seq v2 RSEM normalized genes from TCGA.

Performance comparison

Figure 2a and b present concordance indices and p-values of log-rank tests among the different models and different cancer datasets, wherein the cancers in x-axis were sorted based on the averaged concordance index values among all models and experiments. It is observed that models for cancers like KIRP, BRCA, and LIHC yield median concordance indices of at least 0.7, whereas some cancers like STAD and LUSC yield median concordance indices of approximately 0.5. This led to our further investigation with tumor mutation burden (TMB) and overall survival time as described earlier. We also made a comparison between three traditional machine learning models (Fig. 2c, d). Specifically, we presented the results in Fig. 2 as two parts in order to directly visualize the comparison between Deep Learning-based models and traditional machine learning based models in Fig. 2c and d.

Fig. 2
figure 2

a, b: Performance comparisons between three Deep Learning-based models across 12 TCGA (The Cancer Genome Atlas) cancers. a concordance index; b p-value of log-rank test (in −log10 scale). c, d: Performance comparisons between three Deep Learning-based models and three traditional machine learning models across 12 TCGA (The Cancer Genome Atlas) cancers. c concordance index; d p-value of log-rank test (in −log10 scale). Cancers were sorted based on averaged concordance index across models and experiments. For detailed cancer names, please refer to the Additional file 1

Since five experiments were carried out for each cancer type and each model type, we compared the performances (via concordance index and p-value of log-rank test) for all 12 TCGA cancer types using pairwise paired t-test among all models (Table 3a) and the linear mixed-effects models test (Table 3b). In this case we considered a model to be better than another if a higher concordance index or a lower p-value of log-rank test was observed. Thus, a positive t-statistic in Table 3a or a positive coefficient in Table 3b was used to conclude that the model (distribution 1) was better than the other (distribution 2) with respect to the concordance index. In the case of the p-value of log-rank test, a negative t-statistic or coefficient was used to reach the same inference.

Table 3 Model-wised performances comparison at pan-cancer level (12 TCGA (The Cancer Genome Atlas) cancer types) by pairwise paired t-test (A) and linear mixed-effects models test (B), according to metrics concordance index and p-value of log-rank test. Note that for concordance index, larger t-statistic/coefficient indicated better performance at pan-cancer level, while the p-value of log-rank test was on the contrary

As can be observed from Table 3b, all models have a similar performance since most of the test results of their linear mixed-effects models are insignificant. Both Table 3a and Table 3b concluded that among Deep Learning-based approaches, Cox-nnet provided the overall optimal survival prognosis results at pan-cancer level, with respect to the concordance index and the p-value of log-rank test. This advantage of Cox-nnet is due to a simpler neural network architecture and reduced search space for hyper-parameters. Additional file 1: Table S10-S11 presented the same quantitative comparison of performances for Deep Learning-based and traditional machine learning models. All three Deep Learning models demonstrated superior performance than traditional machine learning models, suggesting the advantages of Deep Learning approaches on prognosis prediction.

Lower dimensional representation

The final hidden layer (or the code in AECOX), highlighted in orange in Fig. 1, produces a lower dimensional representation of the input and is one of the intrinsic properties in Deep Learning-based algorithms [18, 21, 61, 62]. By using the Partitioning Around Medoids (PAM) clustering algorithm [26] on the output of the last hidden layer after the network is trained, we can then inspect the original covariate vector in a lower dimensional space. The most suitable number of clusters (ranging from 2 to 10) was determined by maximizing the averaged silhouette score [63, 64]. As depicted in Table 4, Cox-nnet appeared to have overall better p-values of the log-rank test measured between different clusters, indicating a better capacity of dimensionality reduction for 9 cancers (KIRP, KIRC, LIHC, BRCA, CESC, LUAD, HNSC, OV, LUSC).

Table 4 P-value of the log-rank test of lower dimensional representation, generated by Partitioning Around Medoids (PAM) clustering algorithm on the last hidden layers of three Deep Learning-based approaches (testing set only). 12 TCGA (The Cancer Genome Atlas) cancers are being compared. Bolded values indicate the smallest p-value among three Deep Learning approaches, refer to better low dimensional representation

Relationship between prognosis prediction performances and tumor mutation burden

From the performances within a cancer type across models in Fig. 2 and results in Table 3, it appeared that all models achieve respectable performances measured by concordance index. We also found that performance (concordance index) was more significantly associated with cancer types than algorithms (two-way ANOVA: Cancer type p-value <2E-16, Model type p-value = 9.57E-02). This observation suggests that intrinsic characteristics of different cancer types have a large influence on the performance of prognosis models. One such characteristics is the tumor mutation burden (TMB), which is known to vary largely between different types of cancers.

TMB was increasingly used as a marker in predicting efficacy of immunotherapy [33] and was also shown to be a predictor of prognosis [34]. Since the ability to train a cancer survival prognosis model across cancer types varies significantly, we explored whether TMB can be associated with these changes. By inspecting the mutation information associated with different cancer types, we observed that the performance of survival prognosis models was associated with tumor mutation burden (TMB) characteristics. Specifically, we observed that all TMB characteristics were negatively correlated with concordance index especially the mean TMB (Mean TMB: Pearson ρ = − 0.45 (Fig. 3b); Median TMB: Pearson ρ = − 0.30; Maximum TMB: Pearson ρ = − 0.40; 20% tail TMB: Pearson ρ = − 0.32; 10% tail TMB: Pearson ρ = − 0.32; 5% tail TMB: Pearson ρ = − 0.30).

Fig. 3
figure 3

Cox-nnet performances with TMB feature input and without TMB feature input across 12 TCGA (The Cancer Genome Atlas) cancers. a concordance index; b p-value of log-rank test (in −log10 scale). Red diamonds and texts on the boxplot indicate the mean values. Cancers were ordered based on Fig. 2. Note that the performances are differ from Fig. 2 due to new patient cohorts (intersection of patients who has both RNA-seq data and TMB data). For detailed cancer names, please refer to the Additional file 1

One interesting question is then if the incorporating TMB in the model would enhance the model performance. To investigate this, we take the joint subset of patients who have both RNA-seq data and TMB data, performed survival prognosis with Cox-nnet model (the method which has the best performance) with and without TMB feature, respectively. As shown in Fig. 4, although there is a slight improvement on concordance index (average value = 0.003419) after TMB feature is incorporated, the correlation between improved concordance index (mean) and mean TMB values is 0.0688 across 12 TCGA cancers, suggesting that introducing TMB feature into a mRNA-seq based learning model does not substantially improve the performance for Cox-nnet.

Fig. 4
figure 4

a Box plot of log2 transformed tumor mutation burden (TMB) values from all available TCGA (The Cancer Genome Atlas) patients with respect to each cancer type, ordered according to Fig. 2. Texts and diamond symbols in red color indicated the mean values. b Mean TMB versus averaged concordance index results across 12 cancer types with three survival prognosis models. Pearson ρ =  − 0.45 (p-value = 5.79E-03). Individual model correlations are range from −0.46 to −0.44, described in Additional file 1: Table S4. Other results of TMB statistics versus concordance index were shown in Additional file 1: Figure S2 – Figure S6.

Next, we found the correlation between the mean of overall survival times and the mean of TMB values is − 0.6853 (Pearson) and − 0.7133 (Spearman) across 12 cancers, and the correlation between the variance of overall survival times and the variance of TMB values is − 0.6159 (Pearson) and − 0.2448 (Spearman), suggesting a strong correlation between higher TMB and shorter overall survival times statistics. Where the correlation between the mean of overall survival times and the mean of concordance index is 0.4271 (Pearson) and 0.4126 (Spearman).

Discussion

Overall our study demonstrated that the Deep Learning architecture can be effectively applied for cancer prognosis prediction with Cox-proportional hazard model incorporated. We found that Deep Learning-based model demonstrated superior performances comparing to traditional machine learning models. Among the three Deep Learning-based models tested, we observed that Cox-nnet, which has the most succinct neural network structure, resulted in better prognosis performances in the measurement of concordance index and p-value of log-rank test. We showed that integrating autoencoder with Cox regression network does not significantly improve the prognosis performances. These results highlight an important issue in Deep Learning approaches—namely simpler models often perform similar or better to more complex models in biological data.

From the associated fine-tuned hyper-parameters (Additional file 1: Table S5-S9) during the hyper-parameters tuning (with optimal validation accuracy), we found that Deep Learning-based algorithms and traditional machine learning based algorithms, especially with multiple hyper-parameters, tends to converge into different local minima with different hyper-parameter values. For example, the optimal parameter pairs of AECOX are not consistent in five different folds even when these experiments are from same cancer (e.g., TCGA BRCA cancer). This result can potentially be due to the curse of dimensionality [65]: with limited number of different training samples and large number of parameters (e.g., the hidden layer weights in Deep Learning-based models), the optimization may not guarantee to converge to same local minima. These observations lead us to rethink the robustness of training procedure – especially when higher performances are observed on Cox-nnet where it has the least hyper-parameter tuning effort.

We also noticed a negative correlation between TMB values and prognosis prediction performances. The relationship between TMB and prognosis have been examined in existing literatures in cancer biology for individual cancer types. For example, Owada-Ozaki et al. [66] examined the relationship between individual TMB and prognosis and concluded that high TMB is a poor prognostic factor in non-small cell lung cancer (NSCLC). A similar pattern occurs between TMB and prognosis specifically for lung adenocarcinomas (a subtype of NSCLC) (Naidoo et al. [67]). Our pan-cancer analyses are agreeing with these findings yet have a different conclusion. We observed that TMB is correlated with prognosis performances (concordance index), however, integrating TMB to the Cox-nnet model does not improve the performances at the pan-cancer level. By further examining the relationships behind these features and results, we found that TMB is highly correlated with overall survival times (both in mean and variance) across cancer types. Specifically, lower TMB value is associated with longer mean overall survival time, concluded that TMB is a marker for tumor malignancy. These findings lead us to speculate that TMB either affect or are affected by overall survival time, but may not directly contribute to prognosis prediction when gene expression data are used. However, with a strong correlation to TMB, shorter overall survival times leads to worse prognosis performance, suggesting a direct relationship between overall survival statistics and prognosis performances. These findings will guide us to design future experiments to further explain the detailed relationships especially the dependency among TMB, survival times, and prognosis performances at pan-cancer level.

Conclusion

Bringing artificial intelligence into clinical and cancer studies [6, 68,69,70] can unravel numerous interpretabilities behind the data. In this paper, we focused on three different Deep Learning-based cancer prognosis models. The survival predictions are conducted across 12 TCGA cancer types with sufficient number of patients and survival information. We found that Deep Learning based algorithms demonstrate superior performances than traditional machine learning based models. We also found that the cancer prognosis results measured in concordance index are indistinguishable across models while are highly variable across cancers by two-way ANOVA. The highest concordance index that models can predict is renal papillary cell carcinoma (KIRP), while the lowest concordance index is observed for lung squamous cell carcinoma (LUSC). We then examined the relationships between TMB statistics, overall survival statistics, and concordance indices across 12 cancers. We found that although TMB and overall survival times are negatively correlated with concordance indices across the cancer types, integrating TMB does not improve the prognosis prediction performance for individual cancers significantly, whereas TMB has a strong correlation with overall survival times. These findings will guide us to explore the relationships between patient characteristics and survival learnability in a pan-cancer level in the future work.