Background

Cancer is a loosely defined term that designates cells that have acquired pathological properties, mainly loss of cell cycle regulation, high proliferation rate, and loss of contact inhibition leading to invasion of surrounding tissues. In time, tumor cells disrupt the normal function of tissues where they are located and can metastasize to other tissues. Oncogenes contribute to cell transformation while tumor suppressor genes stop aberrant cell proliferation. Changes in the expression, activation, or function of these genes are expected to lead to cancer-like phenotype in various cell or tissue types and many such genes are commonly affected by genomic lesions in cancer. In addition to mutations to hallmark oncogenes and tumor suppressor genes, cancer driver mutations that contribute to disease onset and progression are found in subsets of cancer types [1]. While these genetic alterations are diverse, several genes that are altered in cancer converge on a few molecular mechanisms that are commonly involved in tumorigenesis [2]. These pathways have wide-ranging effects that span the cell cycle, inflammation, and apoptosis, among others. The mechanisms through which they operate in cancer are therefore highly diverse and molecularly heterogeneous, but they are also interrelated. In addition, a recent gene network analysis identified a relatively small number of regulatory modules on which a majority of somatic mutations in cancer converge [3]. Because changes in cellular pathways and biological activity ultimately impact gene expression and post-transcriptional regulation, this leaves the possibility that tumors that arise from the disruption of different pathways share common molecular signatures in the form of transcriptomic variations.

Previous studies have attempted to leverage these projected common signatures of cancer in order train computational models to distinguish tumors from normal samples or distinguish different tumor types. Typically, these studies rely on protein-coding gene expression data combined with deep neural networks or other machine learning algorithms that classify samples into two or more categories [410]. These studies showed that machine learning models can successfully distinguish between normal tissues and tumors given a certain set of conditions, including the pre-selection of biological features before model training. Several automatic feature selection methods exist to lower the number of genes used as input and thus facilitate the training of such machine learning models [1016]. However, pre-selecting genes on the basis of their functions or differential expression in cancer, or removing redundant genes identified by automatic selection prior to model training deprives the models from learning about potentially novel genes contributing to the transcriptomic signature of cancer. In addition, the application of such approaches has not been tested on large heterogeneous sets of tissues.

Recent methods for the interpretation of deep neural networks offer the opportunity to agnostically discover transcriptomic variations characterizing cancer biology from models that successfully predict biological classes [17, 18]. In particular, we recently described enhanced integrated gradients (EIG), a method for deep neural network interpretation [19] that generates attribution values as a measure of the weight or importance of each biological input feature in the model. For example, we used EIG to find splicing events that are differentially included in the brain compared to other tissues without prior knowledge of splicing variations [19].

Here, we aimed to draft a molecular profile of cancer that applies to most solid tumor types by leveraging the predictive power of deep neural networks along with the interpretation capability of enhanced integrated gradients to identify common transcriptomic signatures across a large array of tumor types. We trained feed-forward neural networks with protein-coding gene expression, lncRNA gene expression or splice junction usage data from several normal tissue and tumor types. We then derive attribution values from these models and establish a list of high-attribution features corresponding to a common signature of cancer, which could be causally involved in cancer or result from oncogenic transformation, or both. Finally, we assess the biological functions of these transcriptomic variations.

Results

A feed forward neural network trained with protein-coding gene expression distinguishes between normal and cancer tissues

We aimed to uncover the transcriptomic features that commonly define cancer state. Performing differential gene expression analysis on 11 normal tissue-tumor pairs from GTEx and TCGA and then looking at the overlap in the genes that are deregulated between these analyses show that few protein-coding genes are consistently up- or downregulated (abs(log2FC) > 2, adjusted p-value < 0.01) in six or more tumor types and that none is consistently deregulated in more than nine tumor types (Fig. 1A and Additional file 1: Fig. S1A). Instead, a large fraction of cancer-deregulated genes are specific to a single tumor type (Fig. 1A and Additional file 1: Fig. S1B). In addition, while hallmark oncogenes are expected to be disrupted in many cancer types, we observe in a sampling of 11 oncogenes that these are either not significantly differentially expressed in any of the 11 tumors analyzed (e.g., BCR, CTNNB1, DDX6, FUS, KRAS, MDM2, TPR) or only disrupted in certain tumors (e.g., EGFR, ETV4, JUN, MYC; Additional file 1: Fig. S1C). Such apparent inconsistency between the function of oncogenes and their lack of change in expression in many tumor types can be partially explained by alternative mechanisms of activation that are independent of changes in transcript levels. Nonetheless, these results demonstrate that using a simple differential gene expression analysis fails to capture the complexity and heterogeneity of the transcriptomic variations existing across various cancer types.

Fig. 1
figure 1

A Upset plot summarizing pairwise differential gene expression analyses performed on tumors and their corresponding normal tissue. No gene is significantly deregulated in more than 9 out of 11 cancer types tested. B, C Dataset assembled to train and test binary models to distinguish between normal and tumor samples shown by tissue type (B) or dataset (C). D Graphical representation of the computational framework used to train, test, and interpret the models. E Performance of models trained with protein-coding gene expression, lncRNA gene expression, or splicing variations evaluated by area under the precision-recall curve (AUPRC) and accuracy (sum of true positives and true negatives over the total population). F Accuracy of models trained with protein-coding gene expression, lncRNA gene expression, or splice junction usage across the 13 datasets used to assemble the training set. G Performance of models trained with protein-coding gene expression, lncRNA gene expression, or splice junction usage on an independent dataset consisting of normal and cancer lung samples. H Performance of the deep learning model, SVM, and random forest using protein-coding gene expression on unseen tissue types (blood cancers) with no batch correction. The training set consists of solid tumors only

In order to overcome the limitations of such naive searches for common cancer transcriptome signatures, we sought to train interpretable deep learning models capable of distinguishing between normal and cancer samples. We assembled a large RNA-Seq dataset comprising 13,461 samples from 19 normal tissue types and 18 tumor types and split the data into two classes reflecting cancer state: normal or tumor (Fig. 1B; 5622 total normal samples and 7839 total tumor samples). Samples were sourced from TCGA (https://www.cancer.gov/tcga), GTEx [20] and 12 other datasets (Fig. 1C). Because technical biases and batch effects are a major concern when using large-scale RNA-Seq datasets, especially when comparing perfectly confounded datasets like GTEx and TCGA, we included in our compendium 12 smaller datasets containing either only tumor samples or tumor and matched normal tissue samples from the same donors. These additional datasets allowed us to mitigate dataset-specific biases and focus on cancer-specific signals by performing a tissue/tumor-specific mean correction across the 14 datasets (see Additional file 1: Fig. S2A-B). We also considered alternatives to mean correction, such as the commonly used COMBAT method [21], but this approach severely limited the data and gene sets that could be used for model training (see the “Batch correction” section in the “Methods” section for details).

We first used the mean corrected expression data from 19,657 protein-coding genes to train an autoencoder for dimensionality reduction, followed by a supervised deep neural network to predict cancer state (normal tissue versus tumor, Fig. 1D; see the “Methods” section for details about dimensionality reduction and model training). We divided our dataset into training (8504 samples), validation (2127 samples), and test sets (2658 samples). We tuned model hyperparameters (learning rate, hidden layers, number of nodes, activation functions, and dropout probability) on the validation set and fixed our model architecture using the hyperparameters with the best performance on the validation set (see Additional file 2: Table S1 for the final model architecture of the deep neural network model for the protein-coding genes). To ensure that our model did not learn dataset-specific biases, we evaluated model performance on a previously unseen set of samples (test set with 2658 samples) extracted from the 13 datasets used for training as well as one independent dataset (PRJEB2784) that was not used during training but that comprises 172 samples of tissue types that were included in the training set (normal and tumor lung samples). Our protein-coding gene expression model accurately predicted whether an individual sample corresponds to a normal tissue or a tumor (accuracy 98.62% ± 0.20% and area under the precision-recall curve (AUPRC) 99.88% ± 0.01%, Fig. 1E and Additional file 1: Fig. S3), and this performance generalizes over all 13 datasets despite the dataset imbalance (Fig. 1F, see numbers in the bars for the number of samples in each dataset). The only datasets where the performance is variable have very few samples (PRJNA340880 has 1 sample and PRJNA288518 has 3 samples). Importantly, the model performs almost as well when applied on the independent dataset (Fig. 1G and Additional file 1: Fig. S4).

To evaluate how our model generalizes to cancer types not included in the training set, we assembled a group of normal (macrophages, monocytes, and lymphocytes) and malignant (acute myeloid leukemia and acute lymphoblastic leukemia) blood cells from three datasets (ArrayExpress E-MTAB-2319, Blueprint, and TARGET consortia, see Additional file 2: Table S2 for details on the samples). We omitted batch correction on these samples to assess how our model would fare on a dataset that considerably differs from the training set both at the biological (solid tissues and tumors in training set vs. hematologic cells and tumors in the test set) and technical levels (batch-corrected in training set vs. uncorrected test set). Strikingly, despite these significant differences between training and test sets, our deep neural network model successfully distinguishes normal and cancer samples from blood (Fig. 1H and Additional file 1: Fig. S5), although, as expected, we observe a reduction in accuracy.

Finally, in order to assess how our deep neural network model compares to other machine learning algorithms, we first trained support vector machine and random forest models using the same training set as for our deep neural network model and then tested them on the same independent dataset consisting of batch-corrected normal and cancer lung samples. Under these conditions, all three models perform similarly (Additional file 2: Table S3). However, in sharp contrast with the deep neural network model, both the support vector machine and random forest models completely fail when applied to the hematologic dataset (Fig. 1H). In summary, these results demonstrate that our deep neural network model can more accurately and robustly identify cancer samples compared to commonly used machine-learning methods and motivate the subsequent analysis of the associated features.

lncRNA expression or splice site usage profiles suffice to define cancer state

Other types of transcriptomic features, including lncRNA expression and RNA splicing, have been used as prognostic markers or to predict drug response in cancer [2224]. In addition, a small number of mutations located in lncRNA genes or disrupting splicing in protein-coding genes have been shown to drive cancer [25]. However, it is not known whether widespread changes in lncRNA expression or RNA splicing commonly characterize cancer state. We thus asked if these other types of transcriptomic features could be used to distinguish between normal and tumor samples, similar to what we found for protein-coding gene expression.

We used the same strategy as for the model trained with protein-coding gene expression above and trained models with expression data from 14,257 lncRNA genes or splice site usage data from 40,147 splice junctions. Similar to protein-coding genes, we tuned the hyperparameters for deep neural network models using lncRNA and splicing junctions on the validation set and evaluated model generalization performance on the test set (see Additional file 2: Tables S4 and S5 for the final architectures of the lncRNA and splice junction deep neural networks, respectively). Remarkably, these models achieved 98.57% ± 0.1% and 98.78% ± 0.09% accuracy, respectively, with high AUPRC (99.84% ± 0.06% for lncRNA expression and 99.82% ± 0.06% for splice junction usage, Fig. 1E). As observed with the protein-coding gene expression-trained model, the lncRNA gene expression and the splice junction usage-trained models perform consistently well across all of the test datasets on the task of predicting the cancer state, again despite the dataset imbalance (Fig. 1F), or when tested on an independent dataset (Fig. 1G), These results further support the robustness of our models as capable of identifying true biological signals rather than confounders.

As for the protein-coding genes model, we compared our deep neural network model for lncRNA gene expression with support vector machine and random forest models on the same two datasets. All three models performed well on the batch-corrected dataset consisting of normal and tumor lung samples (Additional file 2: Table S3B). However, while our deep neural network model also correctly predicted cancer state with the hematologic dataset (normal leukocyte and leukemia samples), both support vector machine and random forest models completely failed again on this dataset (Additional file 2: Table S6). On the splicing data, our deep neural network model outperforms both the support vector machine and random forest models on the independent batch-corrected normal and tumor lung dataset (Additional file 2: Table S3C). The strong performance of our lncRNA- and splicing-trained models indicates that tumor samples can be defined not only by their protein-coding gene expression profile, but also using exclusively their lncRNA gene expression or splice junction usage profile.

Interpretation of deep learning networks uncovers new transcriptomic features characterizing cancer state

Given the high performance of our models, we wanted to know what transcriptomic features are the most important in each of our models and whether these features consist mostly of the usual suspects, i.e., genes known to be genetically associated with cancer. To do this, we generated feature importance scores known as attribution values for tumor samples using enhanced integrated gradients (EIG) [19]. Briefly, EIG measures a feature’s contribution, either positive or negative, to the model label predictions (normal tissue versus cancer) when comparing a cancer sample to a baseline. Following our previous work [19], we used the median of normal samples as the baseline (see the “Methods” and “Interpretation of tumor classification models” sections for details).

We selected 1768 protein-coding genes, 1763 lncRNAs and 562 splice junctions that have high median attribution values across tumor types (Fig. 2A and Additional file 2: Tables S7-S9; see the “Methods” and “Selection of feature sets” sections for the selection criteria and Additional file 1: Fig. S6). We also defined “neutral” sets with a sample size equivalent to sets of high-attribution features using features that display attribution values close to zero (Additional file 2: Tables S10-S12). When looking at the cancer type-specific attribution values across 14 tumor types for the top 100 features with positive or negative attribution, we found that protein-coding genes, lncRNAs and splice junctions with the highest median attribution values across all tumor samples have consistently high attribution values in most if not all cancer types (Fig. 2B), highlighting that our models are not driven by outlier expression or splice junctions usage in cancer types with a large sample size, but rather rely on common transcriptomic features of cancer.

Fig. 2
figure 2

A Selection of high-attribution features from models trained with protein-coding gene expression, lncRNA gene expression or splice junction usage. Dotted lines show cutoffs used; purple points around coordinates (0,0) show features selected in neutral sets. B Median attribution values of 100 protein-coding genes, lncRNAs or splice junctions with the highest positive and negative attributions across tumor tissues. C Median attribution values of genes associated with cancer from the COSMIC database. D Overlap between COSMIC oncogenes and TSGs, and genes with high positive (top panel) or negative (bottom panel) attribution values, or between COSMIC tier 1 genes (high confidence for causal role in cancer) and tier 2 genes (some evidence of causal involvement in cancer), and all high-attribution genes (central panel). E Overlap between genes associated with cancer and genes harboring junctions with high attribution values. In both D and E, enrichment or depletion factors were calculated from the ratio of observed vs. expected overlapping genes between sets, and p-values were calculated using the hypergeometric test

In agreement with our differential gene expression analysis that showed that no gene is significantly deregulated in the same manner across all tumor types, we find that the sign of the attribution value for a given gene does not necessarily reflect the change in expression in cancer. In other words, a gene with a high positive attribution value would not necessarily be upregulated in all or most cancers, and conversely, a gene with a high negative attribution value would not necessarily be downregulated in all or most cancers. Thus, rather than highlighting genes and splicing variations that are similarly altered in many cancer types, the interpretation of our models exposes transcriptomic variations that consistently deviate from the norm in cancer. The transcriptomic variations that they identify could reflect changes that cause or are a consequence of cancer, or a mixture of both.

We next sought to assess the relation between model attributions and known oncogenes or tumor suppressor genes (TSGs). Strikingly, we found a clear separation between the latter two groups with oncogenes receiving positive attribution and TSGs receiving negative values (Fig. 2C). However, most of the known oncogenes and TSGs have lower attribution values relative to our top-scoring features, with many having neutral attribution values (close to 0). This result was observed with attribution values from both the expression of COSMIC genes or usage of splice junctions found in those genes. We only observed a small, although statistically significant, enrichment of COSMIC genes among our high negative attribution genes (Fig. 2D). Of note, well-known oncogenes and TSGs are depleted among genes that have splice junctions with high attributions, meaning that there are fewer oncogenes and TSGs with high-attribution splice junctions than would be expected by chance (Fig. 2E). These results show that our models rely on gene expression and splicing variations in genes that mostly differ from established oncogenes and TSGs to predict tumor samples and that the transcriptomic definition of cancer that we provide here largely differs from genes harboring hallmark mutations causally implicated in cancer.

Frequency of genetic alterations in transcriptomic features characterizing cancer state

Next, we wondered if previously unreported genetic alterations in our high-attribution genes might be driving the transcriptomic variations highlighted by our models. We first postulated that high-attribution genes would rarely carry driver mutations since these genes are not known to be genetically linked to cancer, which we confirmed by investigating TCGA samples and finding that less than 2% of high-attribution genes carry a driver mutation in at least one of any of the samples in TCGA (Fig. 3A). While high-attribution genes do not carry driver mutations, our analysis shows that genes with high negative attribution values by expression display a higher frequency of passenger mutations than their reference neutral set and that the frequency of passenger mutations in high negative attribution genes is as high as in COSMIC oncogenes (Fig. 3B). The frequency of structural variants, although higher in high-attribution genes than their reference neutral sets, is lower for all sets of high-attribution genes than for COSMIC genes (Fig. 3C). Similarly, the frequency at which high-attribution genes are impacted by amplification (Fig. 3D) or deletion events (Fig. 3E) is not significantly different from the neutral sets or the COSMIC genes. Overall, we conclude that the cancer transcriptomic features we identified are not frequently affected by genetic alterations, which suggests that the cancer expression and splicing patterns obtained from our models are not driven by genetic variations in these genes.

Fig. 3
figure 3

Fraction of genes with driver mutations (A), frequency of passenger mutations (B), structural variants (C), amplification events (D), or deletion events (E) in the TCGA cohort. The analysis was performed on the following sets of genes: COSMIC oncogenes (yellow), COSMIC tumor-suppressor genes (TSGs, green), genes with high positive (“protein-coding positive,” red), high negative (“protein-coding negative,” blue), or neutral attribution values by expression (“protein-coding neutral,” gray) or genes with splice junctions with high (“splicing high,” light red) or neutral (“splicing neutral,” light gray) attribution values in our models. p-values were calculated using a one-way ANOVA with Tukey post hoc tests

High evolutionary and selective constraints in transcriptomic features defining tumor state

After establishing a list of genes with high attribution values by expression or splice junction usage and discovering that most of these genes do not correspond to COSMIC oncogenes or TSGs, we wondered whether transcriptomic features that carry high attribution values in our models have properties that may indicate important roles in cells. We discovered that protein-coding genes, lncRNA genes and genes with splice junctions corresponding to high-attribution features in our models are highly evolutionarily conserved relative to the neutral sets (Fig. 4A). We noted that protein-coding genes that have high negative attributions as well as lncRNA genes that have high positive or negative attributions are in general significantly longer than the reference neutral sets, but that genes with splice junctions with high attributions are significantly shorter (Fig. 4B). We also observed that protein-coding genes and genes with splice junctions with high attributions display high selective pressure against loss of function mutations, as estimated by the gnomAD LOEUF score [26] (Fig. 4C).

Fig. 4
figure 4

A Evolutionary conservation of protein-coding genes (left panel), genes with variable splice junctions (middle panel) or lncRNA genes (right panel) across human, chimpanzee, mouse, cattle, xenopus, zebrafish, and chicken. B Gene length derived from the longest annotated transcript in Ensembl (hg38) for protein-coding genes (left panel), genes with variable splice junctions (middle panel), or lncRNA genes (right panel). C Selective pressure against loss-of-function mutations in the human population as assessed by gnomAD LOEUF score [26], showing score for protein-coding genes (left panel) or genes with splice junctions (right panel) with high attribution values. A low LOEUF score implies high selective pressure against loss of function. D Pyknon density in lncRNA genes with high attribution values. All p-values shown are calculated using an unpaired t-test

Finally, we inferred the functional impact of lncRNA genes with high attributions by examining the density of a class of DNA motifs termed pyknons. Pyknons are located in loci that were previously reported as often differentially transcribed between normal and colorectal cancer tissues and that can affect the oncogenic functions of lncRNAs [2729]. We found that high-attribution lncRNA genes carry a higher density of pyknons (Fig. 4D) than lncRNA genes from the neutral set. This was true for both positive- and negative-attribution lncRNAs, but it was particularly marked in negative-attribution lncRNAs, where average pyknon density is seven times higher than neutral-attribution lncRNAs. Together, these findings show that high attribution protein-coding and lncRNA genes by expression or splicing are under strong evolutionary and selective constraints and suggest that these protein-coding genes and lncRNAs with altered expression or abnormal splicing in cancer have essential functions in cells.

Characterization of splice junctions with high attributions

While it is easy to conceive how changes in the expression level of a gene can drive tumorigenesis, interpreting the impact of splicing changes in disease is not as straightforward. We thus wanted to assess how variable splice junctions with high attributions are predicted to impact protein sequence and function. We first noted that high-attribution junctions are predicted to disrupt the reading frame of the gene as often as our reference neutral junction set (Additional file 1: Fig. S7A). Previous studies have shown that alternative splicing can modulate protein-protein interactions by targeting disordered regions [3032]. Therefore, we looked at predicted disorderness of the peptide sequence corresponding to the two exons immediately upstream and downstream of variable splice junctions but found that the predicted peptide disorderness level is no different in high-attribution junctions from what we observe in the neutral set (Additional file 1: Fig. S7B).

We then assessed whether high-attribution splice junctions affect known protein domains by predicting the protein domains encoded from the two exons immediately upstream and downstream of high-attribution junctions using the NCBI Conserved Domain Database. Interestingly, we discovered that 11 splice junctions in 10 genes (CSNK2A2, MAPK9, RIOK1, PRKDC, TYK2, PAK1, IRAK1, CSNK2A1, VRK1, MARK3) affect a part of the transcript matching sequences of protein kinase C (PKC)-like superfamily domains (Fig. 5). Genes contributing to PKC signaling have been implicated in cancer as oncogenes or tumor suppressors [33], but little is known about the impact of splicing variations altering PKC-like superfamily domains in cancer. We also found additional high-attribution splice junctions that affect other domains that are linked to cancer signaling, in particular DEAD-like, RING, and C2 domains. Thus, it is possible that some of the high-attribution splice junctions that we uncovered regulate cancer through the alteration of cancer signaling protein domains.

Fig. 5
figure 5

Protein domains that are affected by at least two splice junctions with high attributions

Contrasting functions of genes with high positive or negative attributions by expression or splicing in cancer

Finally, given that a majority of the protein-coding genes or genes with splice junctions with high attribution values in our models were not previously associated with cancer, we sought to understand the functions of those genes. We first checked whether genes that have high attribution values by expression differ from the genes that have high attributions by splice junction usage and confirmed that a large majority of the genes with high attributions by expression differ from the genes with high attributions by splice junction usage (Fig. 6A). We performed a Gene Ontology analysis for protein-coding genes with high attribution values and found that protein-coding genes with high negative attribution values are enriched for functions related to transcription, mitosis, histone modification, chromatin regulation, and localization to the centrosome, in line with the traditional view of cancer (Additional file 1: Fig. S8A). In sharp contrast, protein-coding genes with high positive attribution values are enriched for post-transcriptional and post-translational modifications, in particular tRNA modification, RNA splicing and protein neddylation, as well as membrane-bound organelles (Additional file 1: Fig. S8B). Similar to protein-coding genes with high positive attribution values, genes with splice junctions with high attribution values are also enriched for functions related to RNA processing, in particular, splicing and transport, but also carry a component of terms related to chromatin (Additional file 1: Fig. S8C). Of note, the set of genes with neutral attribution values by expression are enriched for heterogeneous and unrelated terms (Additional file 1: Fig. S8D) and genes with neutral attribution values by splice junction usage failed to return any terms with an adjusted p-value < 0.05, indicating that our high-attribution genes consist of sets of biologically related and consistent functions.

Fig. 6
figure 6

A Overlap between genes with high positive (red set) or negative (blue set) attribution values and genes with splice junctions that have high attribution values (light red set) with the list of genes overlapping between sets. B Enrichment map showing GO terms related to biological processes that are enriched among protein-coding genes with high negative (protein-coding negative, blue) or positive (protein-coding positive, red) attributions by expression, genes with high-attribution splice junctions (splice junctions, light red), or enriched both in protein-coding positive and splice junction genes (protein-coding positive and splice junctions, dark red). Each node is a GO term and the color of the nodes corresponds to gene sets in which they are enriched. The thickness of edges corresponds to the number of genes in common between GO terms. C Heatmap of terms obtained from an Ingenuity Pathway Analysis (QIAGEN) for molecular and cellular functions. D Gene set enrichment analysis on high-attribution protein-coding genes by expression showing an enrichment in KRAS signaling among negative attribution genes. High-attribution genes were ranked on the x-axis from high positive (left, red) to high negative attribution (right, blue)

An enrichment map built from GO terms related to biological processes shows that high positive attribution genes by expression or splicing form a highly interconnected network whose core relates to functions associated with RNA biology (Fig. 6B). The network core from high positive attribution genes by expression or splicing differs from the highly interconnected network core drawn from genes with high negative attributions by expression, which are associated with gene expression and transcription (Fig. 6B). Ingenuity Pathway Analysis for molecular and cellular functions associated with high-attribution genes confirmed that functions of high negative attribution genes are distinct from those of high positive attribution genes, with transcription and RNA processing being predominant in each group, respectively (Fig. 6C). Pathway analysis also revealed how high negative attribution genes by expression and genes with high attribution by splice junction usage partially overlap in functions related to cell survival and how high attribution genes by splice junction usage are involved in the cell cycle. Finally, gene set enrichment analysis unveiled an enrichment of high negative attribution genes for KRAS signaling (Fig. 6D), while no significant enrichment was found for genes with high positive attributions by expression or splicing.

Thus, while genes that have high negative attribution values in cancer share functions of known oncogenes and TSGs, including in how they are implicated in genome maintenance and transcription, genes that have high positive attributions by expression or splicing have distinct functions, several of which are related to RNA regulation and RNA processing.

Discussion

Our results demonstrate that feed-forward neural networks can be used to distinguish between normal and tumor samples using transcriptomic features. Importantly, we show that models trained with lncRNA expression or splice junction usage perform as well as, if not better than, a model trained with protein-coding expression data. This observation highlights how various elements of the transcriptome can inform on disease state and emphasizes the importance of pursuing molecular markers beyond variations in protein-coding gene expression, in particular, by assessing variations in lncRNA gene expression and splice junction usage in cancer. Our approach uncovered common transcriptomic profiles consisting of a number of gene expression and splicing variation markers that are not altered in the same way across all solid tumor types, making up a novel molecular definition of cancer that would be impossible to establish using traditional approaches such as differential gene expression analysis.

The interpretation of our models revealed known and novel molecular features of cancer. Known cancer drivers were moderately enriched among genes that we find to have high attribution values, which can be expected for genes with driver mutations resulting in loss of function or lower protein expression. In addition, among genes with high negative attribution values were genes with functions typically associated with genome integrity maintenance, such as histone modification and chromatin regulation, as well as transcription, two long-known aspects of cancer development [34, 35]. On the other hand, many of the protein-coding genes that have high positive attribution values have roles in RNA regulation or RNA processing. Interestingly, RNA deregulation has become a recurrent theme in cancer research [36]. Driver mutations have been found in several RNA-binding proteins (e.g., SF3B1, U2AF1, SRSF2, HNRNPA2B1, SRRM2) in cancers ranging from blood malignancies to glioblastoma [3739], but several questions remain regarding how widely this group of proteins and their targets are involved in cancer. Our results suggest that RNA deregulation might be a central component of cancer, upon which many cellular pathways involved in cancer may converge. Indeed, network analysis shows that genes with high attribution values by expression or splice junction usage and that have functions related to RNA regulation are tightly connected to the canonical pathways of cancer (Fig. 7 and Additional file 2: Table S13).

Fig. 7
figure 7

Network analysis of common cancer pathways (PI3K, cell cycle, Myc, Rtk-Ras, Notch, Hippo, TP53, Hippo, TGF-beta) together with genes in GO terms related to RNA regulation or RNA processing that are enriched in our protein-coding or splicing models. Each node is a network as predicted with Ingenuity Pathway Analysis; the size of the nodes represents the number of molecules in each network and the thickness of the edges represents the number of molecules in common between two networks. Networks formed by high-attribution genes in our protein-coding and splicing models are highlighted with a thicker node border. Only networks comprising at least three molecules and connected by at least two shared molecules are shown. Numbers identify networks; Additional file 2: Table S13 lists the molecules found in each network (node)

Our transcriptomic definition of cancer includes several elements that were not previously genetically associated with cancer but that display strong sequence constraints, which suggests that these genes or splice junctions play essential roles in cells. Interestingly, several of these genes that are not listed as COSMIC oncogenes still display tumorigenic characteristics. A few examples include DYNC1H1 [40], WSB1 [4143], RUFY3 [44, 45], DOCK5 [46], MYSM1 [47], DSE [48], DCUN1D5 [49, 50], SARNP [51], and FNTA [52, 53], which can all promote cell proliferation or transformation, at least in some conditions. Likewise, we have identified splice junctions in cancer that deviate from normal tissues in functional domains implicated in cancer signaling, such as PKC-like [33], DEAD-like [54] and RING [55] domains, in genes associated with cancer, such as CSNK2A1, CSNK2A2, RIOK1, PRKDC, TYK2, PAK1, and IRAK1, for which gene expression and posttranslational modifications act as mechanisms for cancer progression [5661]. However, splicing variations related to cancer have not been reported for any of these genes except PAK1, for which a JMJD6-regulated exon inclusion event altering the PKC domain enhances MAPK signaling in melanoma [62]. While the splice junction we identified differs from the one reported before, it also affects the PKC domain of PAK1. IRAK1 has two well-characterized alternative splicing events [63], but there exists no evidence that these events are directly involved in tumorigenesis, and they also differ from our splice junctions with high attributions. Overall, keeping in mind that the models we developed were not designed for clinical application, their robustness across tumor types and the functional properties of their most informative features hint that our signatures could be leveraged to design markers for cancer detection.

Interestingly, our analysis of variant frequency shows that high negative attribution genes in our protein-coding gene expression model are more frequently mutated than the neutral set, and almost as frequently as COSMIC oncogenes and TSGs. In contrast, variant frequency is much lower for genes with high positive attributions by expression or splicing. This observation could explain why many of these transcriptomic features have previously been overlooked in genomic studies. In addition, while the directionality of the attribution value does not directly reflect the difference in expression across all cancer types (e.g., a gene with a high positive attribution would not necessarily have higher expression in all cancer types), we noticed that known oncogenes generally have positive attribution values and known tumor suppressors generally have negative attribution values (Fig. 2C). This observation suggests that positive attribution genes could be considered “oncogene-like” while negative attribution genes could be considered “tumor-suppressor-like” in the way they are altered in cancer and, perhaps, in how they contribute to cancer biology.

Conclusions

Altogether, our results show that alteration of RNA processing pathways is a hallmark of several types of cancer. Future work should be directed at assessing whether the transcriptomic features of cancer that we highlight here are causally involved in tumorigenesis or in tumor suppression, thereby highlighting a core transcriptomic component of cancer development, or whether they represent the downstream consequences of genetic alterations in core transcriptional circuitries of cancer [64].

Methods

Notation

We are interested in defining the transcriptomic signature of solid tumors. We achieve this by first predicting the cancer state of an RNA-seq sample using a deep learning model with gene expression (protein-coding or lncRNA) or splicing quantification as input. Subsequently, we interpret the prediction made by the deep learning model given our input observation by assigning attributions to each feature of the observation. Let \(\mathcal {X}\) be the input space and \(\mathcal {Y}\) be the output or label space. Input x is in a p-dimensional feature space \(\mathcal {X}=\mathbb {R}^{p}\). Since we only consider a binary classification task for defining the cancer state, \(\mathcal {Y} = \{0,1\}\), where 0 represents normal tissue and 1 represents tumor. Predictions are obtained by a prediction function on the feature space \(F:\mathcal {X}\rightarrow \mathcal {Y}\). The goal of the interpretation step is to obtain a p-dimensional vector of attributions called \({\mathbf {attr}}\in \mathbb {R}^{p}\), with each value representing how each of the p features contributes to the prediction F(x).

Datasets

In this work, we process RNA-seq samples from normal human tissues and tumors from 14 datasets (see Additional file 2: Table S2 for a list of all samples and their tissue/cancer identity and Fig. 1B, C for the number of samples representing each tissue or tumor type and source dataset, respectively). The two largest datasets among these are from the Genotype-Tissue Expression (GTEx) consortium and the Cancer Genome Atlas (TCGA) program. We processed 5622 samples from 19 normal tissues and 7839 samples from 18 cancer types. Since large datasets often suffer from batch effects, we included 12 other datasets in our analysis. These datasets included lung [65, 66], liver [67], stomach [68], breast [6971], brain [72], and colon [73] tumor samples with matched normal samples, as well as head and neck [7476], pancreatic [77], ovarian [78], and prostrate [79] tumor samples without matched normal samples (see Additional file 2: Table S2 for the number of samples and dataset labels). For testing on independent, unseen tissue and tumor types, we processed 16 macrophage samples, 8 monocyte samples and 9 CD4 + lymphocyte samples from the Blueprint project [80], 35 CD4 + and 14 CD8 + lymphocyte samples from dataset E-MTAB-2319 [81], and 40 B-cell acute lymphoblastic leukemia samples and 40 acute myeloid leukemia from the pediatric TARGET cohort [82].

RNA-Seq data processing

In order to minimize the introduction of technical biases, all RNA-Seq samples were processed from fastq files in the same manner. The raw reads from RNA-Seq experiments are passed through quality control using FastQC [83]. Sequencing adapters were trimmed with TrimGalore (v0.6.6) [84], reads were aligned with STAR (v2.5.2a) [85] against the hg38 human genome assembly [86], and mapped reads were sorted and indexed using samtools (v1.11). Gene expression quantification was carried out using Salmon (v0.14.0) [87] in quasi-mapping mode using an index generated with Ensembl GRCh38 transcriptome release 94. Splice junctions were quantified using MAJIQ (v2.1) [88]. MAJIQ defines alternative splicing in terms of local splicing variations (LSVs). LSVs can be binary or complex. Binary LSVs comprise only two junctions and complex LSVs have more than two junctions. We only use binary splicing variations that were quantified in at least 80% of samples of a given tissue or tumor type and select junction usage value randomly from one of the two junctions composing the splicing variation. The splicing quantification of a junction in a condition is called percent spliced-in (PSI). For binary LSVs, PSI measures the ratio of the number of reads supporting the inclusion of a junction in a condition over the total number of reads supporting its inclusion or exclusion. MAJIQ uses a beta-binomial distribution over the reads to quantify PSI. For more details on the statistical model, please refer to [88].

Batch correction

To mitigate batch effects from our RNA-seq data, we take two steps. First, in addition to GTEx and TCGA, we searched for other datasets with normal tissues and tumor samples. This step is necessary as our signal of normal tissues versus tumor is confounded with whether the sample came from GTEx or TCGA. GTEx consists of normal tissues and TCGA consists of mostly tumor samples. Therefore, we added 12 other small datasets to ensure that we learn cancer-specific signals and are not confounded by dataset-specific technical biases. We found additional data sources for 13/19 normal tissues and 8/18 tumor types. Next, we correct dataset bias for gene expression data by mean-centering each tissue/tumor-type separately across datasets. The batch correction step ensures our deep learning models using protein-coding and lncRNA gene expression generalize across multiple datasets.

We opted for tissue-specific mean correction for mitigating batch effects instead of standard batch correction methods such as COMBAT because our dataset did not meet the criteria required for traditional batch correction methods, such as the requirement for having at least two data sources for each tissue/tumor type [21]. However, to ensure the robustness of attributions generated by mean-corrected data, we compared it with attributions generated by COMBAT-corrected data. We ran COMBAT batch correction on 11/19 normal tissues and 7/18 tumor types, for which data was available from multiple sources. Two additional normal tissues and one tumor type with multiple data sources had high imbalance, which led COMBAT to filter approximately 14,000–15,000 genes, thus making the corrections unusable. As expected, the application of COMBAT on our gene expression data results in the loss of approximately 5500 genes. However, it is worth noting that despite these limitations we observe a very high correlation between the attribution values identified by both types of data pre-processing (mean correction vs. COMBAT correction, see Additional file 1: Fig. S2C, right panel; Pearson’s R 0.90, Spearman’s R 0.85), indicating the stability of our top feature attributions regardless of the batch correction method.

Tumor classification models

We train three deep learning models for defining the transcriptomic signature of solid tumor samples. Each of these models uses a different set of transcriptomic features to predict the cancer state of an RNA-seq sample: protein-coding gene expression, lncRNA gene expression and quantification of splice junctions. Since we have thousands of noisy transcriptomic features, we first train an autoencoder model for dimensionality reduction followed by a supervised feed-forward neural network for the cancer state prediction. In the next three sections, we describe these models in detail.

Protein-coding gene expression based model

In our first model, the input features are the expression values of 19,657 protein-coding genes. We extracted the protein-coding genes from Ensembl BioMart (see Additional file 2: Table S14 for the list of protein-coding genes). Using these features, we first train an autoencoder model and then, using the reduced feature set from the latent space of the autoencoder, we train a supervised feed-forward neural network that predicts normal tissue versus tumor for each RNA-seq sample. The encoder in our autoencoder model has two latent layers followed by a third latent layer that produces the feature set with reduced dimensionality. The decoder mirrors the encoder. It takes in the features from the third latent layer of the encoder and reconstructs the gene expression values of the protein-coding genes. Using the latent features from the encoder as input features, we then train a discriminator network with three latent layers followed by the output layer. Both the autoencoder and the discriminator networks use the ReLU activation function and are trained using the adam optimizer (see Additional file 2: Table S1 for the detailed architecture of both models).

LncRNA gene expression based model

For the next model, the input features are the gene expression of 14,257 lncRNA genes. We extract the lncRNA genes from Ensembl BioMart. See Additional file 2: Table S15 for the list of lncRNA genes. The model architecture and training process of the lncRNA-based model is similar to the protein-coding gene expression model described in the previous section (see Additional file 2: Table S4 for the detailed architecture of the lncRNA autoencoder and discriminator networks).

Splicing junctions based model

Finally, the input features for our third model are the splicing quantification for 40,147 alternative splice junctions from 11,219 genes. See Additional file 2: Table S16 for the list of genes. We generated the splicing quantification for these junctions in the normal tissues and tumors using MAJIQ (see the “RNA-Seq data processing” section for details on quantification of these splicing junctions from the RNA-seq data). As we have thousands of splicing junctions as features, we again train an autoencoder for dimensionality reduction followed by a supervised neural network for tumor classification. The model architecture and training process of this model is similar to the previous gene expression (protein-coding or lncRNA) based models (see Additional file 2: Table S5 for the detailed architecture of the splicing junctions based models).

Interpretation of tumor classification models

In order to find the features responsible for classifying an RNA-seq sample as tumor, we employ the enhanced integrated gradients (EIG) method for interpretation of a deep learning model [19]. Here, interpretation means attributing the prediction of a deep learning model to its input features. Briefly, EIG computes feature attribution by aggregating gradients along a linear/non-linear path between a sample and a class-agnostic/specific baseline. Sample here refers to an input sample to the deep learning model. Baseline refers to a model’s proxy to human counterfactual intuition. This implies that humans assign blame for the difference in two entities on attributes that are present in one entity but absent in the other. EIG offers multiple baselines and paths. In this work, since we want to find features that distinguish between tumor samples from normal tissue samples, we use normal tissues as the baseline class. Specifically, we use the median in the latent space over the normal tissue samples. Then, we compute the attributions for the given tumor samples by aggregating the gradients between the baseline and each tumor sample along a linear path in the original feature space. We assess the class-wide significance of each feature by computing p-values by comparing the attribution distribution of a feature for the tumor class versus a random mixture of normal tissues and tumors using a one-sided t-test with FDR correction for multiple hypothesis testing. For further details on enhanced integrated gradients, please refer to [19].

Selection of feature sets

High-attribution feature sets were selected from features with an adjusted p-value < 0.0001 (Benjamini-Hochberg FDR correction < 0.01) and ranking above the knee-point of the curve in the case of positive attribution values or below the knee-point of the curve in the case of negative attribution values. A neutral set of a size equivalent to the number of high-attribution features was selected among genes that had FDR-corrected p-value > 0.05 and ranking in the middle of the distribution of attribution values (attribution values close to 0). The list of COSMIC oncogenes and tumor suppressor genes (TSGs) was established from the COSMIC genes census that had a role in cancer, comprising the annotation “oncogene” but not “TSG” for oncogenes and comprising the annotation “TSG” but not “oncogene” for TSGs.

Gene ontology analysis, gene set enrichment analysis and Ingenuity Pathway Analysis

Gene ontology analysis was performed with EnrichR [89] v.1.0 using a 2018 release of the GO Consortium annotations and including terms from the molecular function, cellular component and biological process categories. Gene set enrichment analysis was performed using GSEA4.1.0 [90, 91] against the msigdb.v7.4 gene set library and filtered for enrichment corresponding to hallmark, reactome and GO gene sets with a normalized p-value < 0.01.

Splice junction characterization

Disorderness was predicted with IUPred2 [92] from the two exons immediately upstream and downstream of the most distant splice site corresponding to a variable junction. Protein domains were predicted from the same transcript region with the NCBI WEB CD-search tool [93].

Functional characterizations

The conservation score was calculated from the summation of BLAST bit scores from 6 species: human, chimpanzee, mouse, cattle, xenopus, zebrafish, and chicken (taxids: 9606, 9598, 10,090, 8364, 7955, 9031, 9913), normalized to the length of the human transcript. Loss of function mutation frequency is expressed as the gnomAD LOEUF score only for genes for which a LOEUF score was reported [26]. Pyknon density was calculated using the list of human pyknons available from the pyknon database [94] and shown as the number of pyknons found per 1000 nt in the longest RefSeq annotated transcript.