Introduction

During gene expression, the information encoded in a gene is used for the synthesis of a protein or of another functional gene product. In biological sciences, gene expression is considered as the activity of a gene: the higher its expression, the more active the gene.

The measurement of gene expression is called gene expression profiling, and can be performed through several techniques and technologies, including DNA microarrays. A microarray is a grid of microscope slides with thousands of tiny spots in defined positions, with each spot containing a known DNA sequence or gene [1].

Since microarrays can be generated through multiple different techniques, each gene expression dataset is associated to a particular platform on which the gene expression was measured. Each microarray platform has its own gene expression coordinates for the positions of the genes in the genome. These coordinates are indicated by probesets, that are sets of fragments of DNA known as hybridization probes [2]. Each microarray platform therefore has its own probeset system, which is usually incompatible with the probeset system of other platforms. Only platforms of the same brand can have compatible probesets between each other, and this the is case of the Affymetrix platforms GPL96, GPL97, and GPL570, for example.

In most of the cases, a probeset corresponds to one specific gene symbol. A gene symbol, instead, can be related to multiple probesets. This aspect represents a problem in bioinformatics: given a gene symbol alone, it is impossible to know to which probeset of a specific platform it refers. On the contrary, given a probeset and a platform, it is always possible to identify the related gene symbol.

To alleviate this problem, Qiyuan Li and colleagues [3] recently released Jetset, a bioinformatics tool that associate a probeset to its most likely gene symbols for some specific platforms. Even if useful, this tool does not completely solve the probeset-gene association problem.

Even though most of scientific studies still rely on gene symbols, an article by Li Li et al. [4] showed that using different probesets related to the same gene symbol would lead to different results, and advocated for the usage of probesets instead of gene symbols in bioinformatics analyses. We agree with that approach and decided to build our whole analyses on probesets rather than gene symbols.

Genetic signatures Groups of particular of genes together can have an important role in the characterization of diseases; these groups of genes are usually called a genetic signatures. When a signature can be used to differentiate patients from healthy controls, it is called a diagnostic signature. When a signature can be employed to differentiate survived patients from deceased patients, instead, it is called prognostic signature. Here we focus on the latter kind.

Cancer affects around 20 million people and causes approximately 10 million deaths globally each year [5], and the study of potential cancer signatures has been widespread in bioinformatics research worldwide. In the past, prognostic signatures have been used for specific cancer types, such as lung cancer [6] and breast cancer [7].

Here, instead, we propose a prognostic pan-cancer signature able to identify surviving patients and death-risk patients on gene expression datasets of any possible cancer types. In fact, an analysis done on multiple cancer types is called pan-cancer [8].

Several researchers already proposed pan-cancer signatures and pan-cancer studies in the past. Jia and colleagues [9], for example, investigated the role of a gene signature related to the COL11A1 gene for the identification of pan-cancer associated fibroblasts. Xu et al. [10] proposed a 154-gene expression pan-cancer signature derived from a transcriptome data analysis.

In another study, de Almeida and coauthors [11] proposed a centrosome amplification-related signature for clinical outcome across different cancer types. Izzi and colleagues [12] analyzed matrisome data of the extracellular matrix (ECM) to propose 29 cancer types-specific signatures. Data from the ECM were used by Yu and colleagues [13] as well to propose a 5-gene pan-cancer signature for prognosis.

Luo et al. [14] analyzed telomerase reverse transcriptase (TERT) activation data from The Cancer Genome Atlas (TCGA) to propose a TERT\(^{high}\)-specific mRNA expression signature for multiple cancer types.

Yuanyuan Li and coauthors [15] analyzed RNA-Seq data of the The Cancer Genome Atlas to detect a 20-gene pan-cancer signature for survival prediction. More recently, Nagy et al. [16] analyzed the same data to detect an 8-gene pan-cancer signature.

A list of prognostic genes for a specific disease can be found not only through gene expression, but by also integrating multi-omics data. Zhou et al. [17], for example, applied deep machine learning models to data of gene expression, copy number alterations (CNAs), and messenger RNA (mRNA) and detected 12 prognostic genes for breast cancer [17].

A genetic signature can be applied to a bioinformatics dataset mainly in two ways: through statistics survival models [13, 16] or supervised machine learning models [10,11,12, 14, 15]. Our approach belongs to the latter group: in our analysis, in fact, we employed the Random Forests [18] ensemble machine learning method. Random Forests resulted being effective in numerous computational biology studies [19] and on gene expression data in particular [20].

Our proposed pan-cancer prognostic signature In this study, we propose a pan-cancer prognostic signature merged from 5 already-existing cancer type-specific prognostic signatures available in the literature (breast cancer, lung cancer, prostate cancer, colon cancer, and neuroblastoma).

Three aspects make our proposed pan-cancer signature an effective tool for prognosis on gene expession data: (i) The usage of probesets instead of gene symbols; (ii) The 207 probesets derived from 5 different signatures related to a different cancer type; (iii) The application of the signature with Random Forests.

We applied our proposed pan-cancer prognostic signature on 57 gene expression datasets publicly available on GEO, made of 12 different cancer types. Moreover, to better understand the roles and the functions of the genes of our proposed signature, we then employed a gene set enrichment tool and a protein-protein interaction analysis tool, and elaborated their results [21].

Our results confirm the predictive power of our proposed pan-cancer prognostic signature, and the functional validation task unveiled relevant information about the signature genes, that can pave the way for further studies on this topic.

This study We organize the result of this study this way. After this Introduction, we describe the 5 original cancer type-specific signatures that we used to generate our pan-cancer signature and the 57 datasets we employed for testing (Section 2). We then describe the machine learning method we used to predict the survival of the patients and the network and pathway analysis techniques we employed for functional validation (Section 3), and the results obtained in these two steps (Section 4). Lastly, we outline some conclusions about these study and its potential future developments (Section 5).

Datasets

In this section, we first explain how we retrieved the gene expression cancer datasets we employed in our study (Section 2.1) and then we describe how we generated our proposed pan-cancer signature (Section 2.2).

Gene expression data of multiple cancer types

We collected gene expression datasets of the most common cancer types [5] from Gene Expression Omnibus (GEO) through Bioconductor [22, 23] packages such as GEOquery [24] and BioMart [25]. We selected only the prognostic datasets, that are the ones which include a feature about the status of the patient: alive or deceased. We filtered in only the datasets derived from platforms compatible with our pan-cancer signature probesets, that are Affymetrix Human Genome U133 platforms HG-U133A (GPL96), HG-U133B (GPL97), or HG-U133 Plus 2 (GPL570).

For this scope, we developed a Perl script [26] that retrieved 57 different prognostic cancer datasets: 17 of breast cancer, 13 of lung cancer, 10 of colorectal cancer, 5 of lymphoma, 4 of leukemia, 2 of multiple myeloma, and 1 of adrenocortical cancer, bladder cancer, neuroblastoma, ovarian cancer, skin cancer, and stomach cancer.

We included the 11 most common cancer types, plus a rare children cancer, neuroblastoma, to verify both the universal effectiveness of our pan-cancer signature in most cancer types and in one specific rare disease. We wanted to include a dataset of prostate cancer, but we could not find any prognostic one compatible with the GPL96, GPL97, or GPL570 platforms unfortunately.

We reported all the information and the quantitative characteristics of these datasets in Table 1.

Table 1 List of gene expression datasets employed in our analysis, sorted by cancer type

Our pan-cancer signature

To generate our proposed pan-cancer prognostic signature, we joined five different prognostic signatures available in the scientific literature. Each of these five signatures was proposed for a specific cancer type, and its probesets are compatible with the GPL96, GPL97, and GPL570 Affymetrix platforms.

In particular, the five known prognostic signatures contribute to our pan-cancer signature this way (Fig. S1):

  • The sigCangelosi2020 signature for neuroblastoma, with 9 probesets (Table S1) [27] contributes to our pan-cancer signature for 4.33%;

  • The sigChen2012 signature for prostate cancer, with 7 probesets (Table S1) [28] contributes to our pan-cancer signature for 3.37%;

  • The sigGyorffy2013 signature for lung cancer, with 15 probesets (Table S1) [29] contributes to our pan-cancer signature for 7.21%;

  • The sigHallett2012 signature for breast cancer, with 14 probesets (Table S1) [30] contributes to our pan-cancer signature for 6.73%;

  • The sigVanLaar2010 signature for colon cancer, with 163 probesets (Table S2, Table S3, Table S4, and Table S5) [31, 32] contributes to our pan-cancer signature for 78.37%.

As one can notice, the sigVanLaar2010 colon cancer signature makes a large part of our signature. We decided to include signatures of common cancer types (lung cancer, breast cancer, colon cancer, and prostate cancer) plus a signature of a rare cancer (neuroblastoma) because we wanted to create a prognostic signature that could work effectively both on common cancer types and on rare cancer types.

The first step we did was to check the probesets and genes shared by multiple source signatures and therefore present multiple times in our aggregate pan-cancer signature. We used geneExpressionFromGEO [33], and BioGPS [34] for the probeset-gene annotations.

Our proposed pan-cancer signature contains the probeset 203072_at (MYO1E gene ENSG00000157483, myosin IE) [35, 36] that is present twice in our signature because it is located both in the sigVanLaar2010 signature for colorectal cancer and in the sigHallett2012 signature for breast cancer.

Our proposed signature contains 207 unique probesets related to 187 unique gene symbols in total. Some gene symbols occur multiple times:

  • 3 gene symbols appear four times (CTSB, FN1, and TM4SF1);

  • 7 gene symbols appear three times (ANXA2, CD55, DUSP6, KLF6, PLAUR, RPL3, and RPL3P4);

  • 17 gene symbols appear twice (APOE, BGN, C10orf99, CD59, CH507-513H43, CH507-513H44, CH507-513H46, DNAJA3, IGFBP3, IRS2, NNMT, PDK1, PGK1, PRDX5, TMBIM4, TNFRSF21, VCAN, and VEGFA9);

  • All the other gene symbols appear only once.

We report our pan-cancer signature in the Supplementary information (Table S1, Table S2, Table S3, Table S4, and Table S5).

Methods

In this section, we first describe how we applied ensemble machine learning for the prediction of the survival (Section 3.1), and then we report the methods we used for the protein-protein network and pathway analysis of our pan-cancer signature genes (Section 3.2).

Survival prediction through machine learning

In our survival prediction, we first selected the probesets of a specific signature and the survived/deceased label on each gene expression dataset, and we then applied Random Forests [18] for binary classification. Random Forests is an ensemble machine learning method based on decision trees: at each execution, it selects random subsets of the training set (randomly picking some features and some data elements), and trains a decision tree on each of these subsets. At the end of the execution, Random Forests applies each of these decision trees, which generate a binary response. Random Forests eventually applies a majority vote to these responses: if most of these decision trees generated a true outcome, Random Forests will return a true outcome; if most of these decision trees produced a false outcome instead, Random Forests will return a false results too.

Since it is known that changes in the hyper-parameters of Random Forests do not significantly affect results when the method is applied to small datasets [37], we used the default values of the R method, with 500 trees to grow [38].

In this phase we employed traditional best practices for machine learning, by splitting the data into training set (80% of the patients, randomly selected) and test set (remaining 20%) [39, 40]. For imbalanced dataset, with one of the two classes greater than 70%, we applied the ROSE oversampling technique [41]. We measured the results on the test set with several confusion matrix rates, focusing on the Matthews correlation coefficient (MCC) [42], since it is more informative than other scores [43,44,45,46,47]. To avoid having results due to a particular configuration of the training set and of the test set, we repeated the execution of Random Forests 100 times, and reported the average results obtained for each statistic.

Moreover, we also applied several alternative methods to Random Forests: CatBoost [48], lightGBM [49], k-Nearest Neighbors [50], and Decision Tree [51]. Since Random Forests obtained better average MCC results than the other algorithms (Supplementary File S4), we decided to base our study on Random Forests.

Network and pathway analysis

To better understand the biological functions associated to our pan-cancer signature, we employed g:Profiler g:GOSt [52], an online web tool for functional enrichment analysis [29, 53]. g:Profiler g:GOSt reads in a list of genes and associates functions and pathways from several bioinformatics databases, such as the Gene Ontology (GO), WikiPathways (WP), and the Human Protein Atlas (HPA). g:Profiler g:GOSt associates a p-value to each term annotated to the input gene list. We used its g:SCS significance algorithm with 0.005 as significance threshold, as suggested by Benjamin and colleagues [54].

Knowledge about the function and the behavior of the genes of our pan-cancer signature can come from their protein-protein interactions (PPIs), too. For this reason, we looked for the protein-protein interactions associated to our pan-cancer signature on the STRING [55] database. We decided to use only the real, physical interactions provided by STRING, with confidence threshold 0.4, and to discard the predicted interactions. This way, we can focus only on the real, existing protein-protein interactions, with a high level of confidence regarding our scientific discoveries.

For network analysis, we used experimentally detected physical protein-protein interactions (PPIs) obtained from the Integrated Interactions Database (IID, June 2021 version) [56]. For pathway enrichment analysis we used two pathway sets from pathDIP (version 4) [57], core and extended pathways (predictions based on experimentally detected physical connectivity of proteins with pathway members at an association-score 0.95 and higher).

Results

In this section, we first report and describe the results on the survival prediction obtained by our pan-cancer signature (Section 4.1), and the results obtained through the functional validation of the genes of our pan-cancer signature (Section 4.2).

Survival prediction on all the datasets

Our prognostic pan-cancer signature

We applied our pan-cancer signature with several machine learning methods: Random Forests, CatBoost, lightGBM, k-Nearest Neighbors, and Decision Tree. Among them, Random Forests obtained the highest average Matthews correlation coefficient (MCC) on average, and therefore we highlighted this method’s results. We list the results obtained with CatBoost, lightGBM, k-Nearest Neighbors, and Decision Tree in Supplementary File S4.

We report the results obtained by our prognostic signature with Random Forests on the 57 datasets in Table 2 and Fig. 1. Our pan-cancer signature achieved at least a sufficient score among the employed rates (MCC, F\(_1\) score, accuracy, sensitivity, specificity, precision, negative predictive value, PR AUC, and ROC AUC) on 55 out of 57 datasets (all except the dataMicke2011 and dataLeich2009 datasets).

As expected, our signature achieved its best results among the colon cancer datasets, with 6 datasets out of 10 where the MCC is above +0.2. Our proposed signature obtained good MCC results also on the single datasets of neuroblastoma, skin cancer, and stomach cancer. It was able to generate good predictions measured with MCC on 2 leukemia datasets out of 4. Overall, regarding the Matthews correlation coefficient, our pan-cancer signature obtained sufficient results on 19 datasets out of 57, corresponding to the 33.33%.

Regarding sensitivity, our prognostic signature obtained sufficient results (TPR > 0.6) on 58.18% of the datasets, confirming its capability to recognize survived patients with cancer in the gene expression datasets. Our signature, however, obtained sufficient results for specificity only on 21.82%, showing that it is not well performing when classifying deceased patients with cancer.

We also computed the precision-recall curve AUC and the ROC curve AUC to evaluate the performances when no confusion matrix threshold is provided. Our pan-cancer signature obtained sufficient scores for the PR AUC and the ROC AUC on almost 60% of the datasets, confirming its predictive power.

Among the rankings generated with all the employed rates (Fig. 1), four cancer types result being among the first four positions on average: neuroblastoma, stomach cancer, skin cancer, and colorectal cancer. Our prognostic signature obtained more sufficient results on multiple rates on the datasets of these cancer types.

Other cancer type-specific signatures and pan-cancer signatures

To further verify the predictive efficacy of our prognostic pan-cancer signature, we applied each original cancer type-specific signatures with Random Forests to each cancer type-specific dataset, and compared its results with the results obtained by our pan-cancer signature. We measured the results with the Matthews correlation coefficient.

Table 2 Results obtained by our pan-cancer signature on 57 gene expression datasets
Fig. 1
figure 1

Barcharts of the average results obtained by our pan-cancer signature on each cancer type. Adrenocortical cancer: results on the dataHeaton2011 dataset. Bladder cancer: results on the dataReister2012 dataset. Breast cancer: average results on 18 breast cancer datasets. Colorectal cancer: average results on 11 colorectal cancer datasets. Leukemia: average results on 5 leukemia datasets. Lung cancer: average results on 14 lung cancer datasets. Lymphoma: average results on 6 lymphoma datasets. Multiple myeloma: average results on 3 multiple myeloma datasets. Neuroblastoma: results on the dataHiyama2009 dataset. Ovarian cancer: results on the dataUehara2015 dataset. Skin cancer: results on the dataBogunovic2009 dataset. Stomach cancer: results on the dataPasini2021 dataset. We reported the complete suvival prediction results in Table 2. normMCC: normalized Matthews correlation coefficient (\(normMCC = (MCC + 1) / 2\)). TPR: true positive rate, sensitivity, recall. TNR: true negative rate, specificity. PPV: positive predictive value, precision. NPV: negative predictive value. PR: precision recall curve. ROC: receiver operating characteristic curve. AUC: area under the curve. normMCC, F\(_1\) score, accuracy, TPR, TNR, PPV, NPV, PR AUC, and ROC AUC have worst value 0 and best value 1. The formulas of MCC, F\(_1\) score, accuracy, TPR, TNR, PPV, NPV, PR AUC and ROC AUC can be found in the Supplementary information. We report additional information about these datasets in Table 1

Our pan-cancer signature outperformed the sigVanLaar2010 signature on 9 colon cancer datasets out of 10 (all except the dataSmith2009a dataset).

Our prognostic pan-cancer signature also defeated the sigHallett2021 signature on 13 breast cancer datasets out of 17 (all except the dataSinn2019, dataKarn2011, dataLin2009, and dataMetzgerFilho2018 dataset). Our proposed pan-cancer signature outplayed the sigGyorffy2013 signature on 7 lung cancer datasets out of 13 (all except the dataPhilipsen2010, dataRousseaux2013, dataSon2007, dataTsao2010, dataXie2011, dataZChen2020 dataset).

However, our prognostic pan-cancer signature was outperformed by the sigCangelosi2020 signature on the only neuroblastoma dataset. We do not have prognostic datasets of prostate cancer unfortunately so we cannot test the sigChen2012 signature singularly.

Finally, we compared the results obtained by our proposed pan-cancer signature with the results obtained by other pan-cancer signatures found in the literature: the sigNagy2021 signature [16] (Table S6) and the sigYu2021 signature [13] (Table S7).

Our pan-cancer signature outperformed the sigNagy2021 signature on 71.93% of the datasets (Supplementary File S1). Moreover, our prognostic signature defeated the sigYu2021 signature on 75.44% of the datasets (Supplementary File S2).

Analysis of associated pathways and protein-protein interactions

Pathway analysis

We input gene symbols of the probesets of our signature to pathDIP [57], and found that 139 of these genes were present in core (literature-based) pathways and were enriched in 13 pathways (Table 3). These pathways related to hypoxia-inducible factors 1 and 2 (HIF1A and HIF2A) and cell-surface signaling (ECM and integrin signalling) both of which have been shown to be implicated in cancer [58,59,60,61,62]. The latter also suggests potential role of protein products of these genes in interaction of cancer cells with other cells present in the tumour micro-environment. Enrichment analysis using extended pathways highlights immune system pathways (such as TLRs, interleukins, NFKB, and PDGF) as well as cell-death (apoptosis and autophagy) (Fig. S2 and Supplementary File S3).

Table 3 Pathways associated to our pan-cancer signature genes

However, despite these findings are interesting, they are highly biased due to the imbalance in the sizes of the five source signatures. In order to subdue this bias, in the next step of pathway analysis we considered genes in each of the five source signatures separately. Using PPIs available in IID [56], we identified proteins that have physical interactions to at least one protein in each source signature. Four proteins (FANCD2, EEF1A1, YWHAE, PGLS) have PPIs with at least one protein in all signatures and one protein in the breast cancer signature (ALDOC) interacts with all other four signatures. Pathway enrichment analysis of these four genes (core pathDIP) returned a list of 88 pathways. At the top of this list there is “HSF1 activation”, whose importance in several cancer types has been demonstrated [64]. The most highlighted keyword in titles of these 88 pathways are pentose phosphate, glycolysis, and fanconi all of which have strongly been linked to several cancer types [65,66,67,68,69].

Fig. 2
figure 2

Network of integrated interactions of proteins associated to our pan-cancer signature genes. Membership of proteins that interact with protein products of genes that are members of more than three (out of five) signatures. Four proteins (FANCD2, EEF1A1, YWHAE, PGLS) have PPIs with at least one protein products of genes in all signatures and one protein in breast cancer signature (ALDOC) interacts with all other four signatures. These five genes are shown with orange labels. Genes in different signatures are shown with different outline colors: grey for colorectal cancer, red for lung cancer, carbon blue for neuroblastoma, orange for breast cancer, and green for prostate cancer. Nodes with pink outline show interacting proteins with protein products of genes of different signatures. We produced this network with IID [56]

Furthermore, we identified 42 proteins interacting with four out of five source signatures. One of these proteins (TRIM25) is a member of the colorectal cancer signature. Except for ALDOC and TRIM25, no other signature member interacts with more than three signatures. Figure 2 shows membership of proteins that interact with protein products of genes that are members of more than three (out of five) signatures.

Intriguingly, the pathway enrichment analysis of these genes returned pathways that belong to main cancer hallmarks [70]. Examples of these pathways include metabolism (glycolysis, gluconeogenesis, pentose phosphate cycle, citrate-cycle), cell proliferation and maintenance (M2G, DNA-damage checkpoint, growth factors, WNT, PI3K-AKT-mTOR), cell-death (apoptosis, autophagy), immune system (TLRs, cytokine signaling, neutrophils), cell invasion (focal-adhesion, extracellular vesicle-mediated signaling, EMT), inflammation (fibroblast, integrins, TRAFs), angiogenesis (VEGF, HIF). This coverage for cancer hallmarks can partly explain reasonable performance of our combined signature on most cancer datasets (Fig. 3 and Supplementary File S3).

Fig. 3
figure 3

Key-term enrichment analysis. Key-term enrichment analysis of proteins that interact with protein products of genes of at least four different signatures signatures. Size of different key-terms is proportional with -log of statistical significance of appearance of each key-term in title of enriched pathways. We generated this image with pathDIP [57]

STRING protein-protein interaction networks

To better understand the relationships between the genes of our proposed pan-cancer signature, we insert it into STRING [55] and generated a network of physical protein-protein interactions (Fig. 4).

The network produced by STRING showed some interesting relationships between proteins. PIK3R2 and FN1 resulted being the proteins with the highest number of protein-protein interactions, and therefore can be considered as pan-cancer gene hubs.

Fig. 4
figure 4

Protein-protein physical interaction network of our proposed pan-cancer signature. We generated this network with STRING [55]: each node represent a protein generated by a protein-coding gene of our proposed pan-cancer signature, and each edge represents a physical interaction between two proteins. Some nodes contain the known or predicted 3D structure of their proteins. The colors of the edges can represent several types of interactions [55]. Confidence threshold: 0.4 medium

The PIK3R2 gene (ENSG00000105647, phosphoinositide-3-kinase regulatory subunit 2 [71, 72]) that has 5 physical interactions in the protein-protein interaction network of STRING, which is the highest number of edges. PIK3R2 belongs to a family of genes known to be involved in pan-cancer [73]. The protein subnetwork of PIK3R2 could be used for further pan-cancer studies in the future: DUSP10, DUSP6, FHL2, IRS2, PIK3R2, and RIPK2.

The FN1 gene (ENSG00000115414, fibronectin 1 [74, 75]), that occurs 4 times in the signature (top occurrence), has 4 interactions in the STRING physical interaction network. FN1 has a key role in phosphaturic mesenchymal tumors [76]. The subnetwork of FN1 could be used for further pan-cancer studies in the future: CTGF, CYR61, DDIT4, DSTN, FN1, IGFBP3, LCP1, PAPSS1, PLAUR, SPP1, VCL, and VEGFA.

Addditionally, in the STRING physical protein-protein interaction network there are 7 proteins with 3 physical interactions, 13 proteins with 2 physical interactions, and 44 proteins with 1 physical interaction.

Functional enrichment analysis

The functional enrichment tool g:Profiler g:GOSt associated to our prognostic pan-cancer signature several pathways related to pan-cancer (Fig. 5). Gene Ontology annotations related to cancer, such as response to hypoxia apoptotic process, negative regulation of kinase activity, cellular response to hypoxia, extracellular matrix organization, extracellular structure organization, response to oxygen levels, and extracellular matrix, clearly confirm the relationship between our prognostic signature and pan-cancer. This tool also detected lung and adrenal gland as tissues from the Human Protein Atlas. g:Profiler g:GOSt associated to our pan-cancer signature several annotations related to the immune system, confirming the relevance of the genes of our pan-cancer signature in this context.

Fig. 5
figure 5

Functional annotation analysis terms associated to the genes of our proposed pancancer signature. We generated this list of functional annotations using g:Profiler g:GOSt [52] with the following options and list of abbreviations. Statistical domain scope: only annotated genes. Significance threshold: 0.005, as suggested by Benjamin and colleagues [54]. Significance method: g:SCS algorithm. GO: Gene Ontology. BP: biological process. CC: cellular component. MF: molecular function. WP: WikiPathways. TF: Transcription Factors. HPA: Human Protein Atlas

To discover additional aspects about the functional annotations related to our signature, we applied Enrichr [77] to our signature gene list. Among the annotations found by Enrichr, we found two diseases from PheWeb [78] of interest for our analysis. PheWeb associated macular degeneration to our signature gene list. We know vascular endothelial growth factor (VEGF)-A can affect cancer treatment and age-related macular degeneration [79]. PheWeb also associated lipoma of skin and subcutaneous tissue to our signature genes; a lipoma is a benign tumor made of fat. Both g:Profiler g:GOSt and Enrichr confirmed the relationship between our prognostic signature gene list and pan-cancer.

Discussion and conclusions

In this study, we proposed a prognostic pan-cancer signature of probesets merged together from 5 different cancer type-specific signatures available in the scientific literature. Our prognostic pan-cancer signature is made of 207 unique probesets related to 187 unique gene symbols, and is based on the Affymetrix platforms GPL96, GPL97, and GPL570. We applied our proposed signature, with Random Forests and other machine learning methods, to 57 different gene expression datasets related to 12 different cancer types, and noticed that Random Forests outperformed the other algorithms with respect to the average MCC results. We analyzed the results obtained by Random Forests and our prognostic pan-cancer signature on these 57 datasets to verify its capability to classify deceased patients and survived patients. Our pan-cancer signature achieved a sufficient MCC on 33.33% of these datasets, at least one sufficient confusion matrix rate on 55 datasets out of 57, and sufficient ROC AUC and PR AUC on almost 60% of these 57 datasets.

We then compared these results with the results obtained by each specific cancer type signature on its corresponding cancer type datasets. Our signature outperformed the sigVanLaar2010 colon cancer signature on most colon cancer datasets, the sigHallett2021 breast cancer signature on most breast cancer datasets, the sigGyorffy201 lung cancer signature on most lung cancer datasets, and was outperformed by the sigCangelosi2009 neuroblastoma signature on the only neuroblastoma dataset.

Afterwards, we compared the results attained by our pan-cancer signature with the results obtained by other pan-cancer signatures that we found in the literature on the same 57 datasets: the sigNagy2021 signature and the sigYu2021 signature. Our prognostic pan-cancer signature outperformed these two signatures on more than 70% of the datasets.

These results show that, even if not perfect, the genes of our genetic signature have a relevant role in pan-cancer prognosis, and they can serve as an effective starting point for future studies on this theme. In the future, in fact, researchers can explore the genes of our pan-cancer signature to extrapolate new signatures from subgroups of the signature genes. A clear limitation of our signature is that it obtained sufficient MCC results only on 20 datasets out of 57. Our initial goal, however, was so ambitious that this outcome results being relevant in any case: we initially wanted to create a pan-cancer signature made of a list of genes able to discriminate between survived patients and deceased patients for all the possible cancer types. To this ambitious end, having a prognostic signature working well on 33.33% of the datasets represents already a sufficient and relevant result.

Additionally, as mentioned earlier, our prognostic pan-cancer signature was able to outperform other two pan-cancer signatures on most of the datasets, and almost each cancer type-specific signature on its corresponding cancer type-specific datasets. Our proposed pan-cancer signature was outplayed only by the sigCangelosi2009 neuroblastoma signature on the dataHiyama2009 neuroblastoma dataset. We believe this result is due to the orientation of our pan-cancer signature to general common cancer types, such as lung cancer, breast cancer, and colon cancer. Neuroblastoma is a rare, genetic, pediatric cancer disease, and its genetic specificity makes it different from the main cancer types such as colon cancer. We therefore believe our prognostic signature can be considered effective on common cancer types, but less effective than cancer type-specific signatures on cancer type-specific datasets of rare children cancer diseases.

Our results also confirmed the efficacy of Random Forests, a relatively-new ensemble machine learning method which has become widespread in biomedical informatics studies.

To better understand the pan-cancer role of our signature, we then investigated the pathways, the protein-protein interactions, and the functional annotations associated to our signature’s gene list.

The pathway enrichment analysis carried out with pathDIP and g:Profiler g:GOSt suggested that the genes of our signatures are related to interaction of cancer cells with each other and with other cell types present in the tumour micro-environment and to other fundamental biological aspects such as immune system and cell death. Moreover, the analysis of protein-protein interactions related to our pan-cancer signature carried out with IID highlighted the role of proteins known to be associated to several cancer types and to cancer hallmarks. The additional analysis on the protein-protein physical interactions found by STRING highlighted the proteins of the PIK3R2 (phosphoinositide-3-kinase regulatory subunit 2) and FN1 (fibronectin 1) genes as fundamental hubs in our signature, indicating an important role of these genes for pan-cancer.

Moreover, it is interesting to notice that the most relevant pathways found by pathDIP for our pan-cancer signature are known to be related to general aspect of cancer, and their association has been shown through wet lab non-computational techniques in the past: photodynamic therapy-induced HIF-1 survival signaling [80, 81], androgen receptor signaling [82], direct p53 effectors [83], HIF-2-alpha transcription factor network [84], for example.

Regarding limitations, we report that we employed here only microarray gene expression data, and did not use RNA-Seq data, which is a more modern data type. Additionally, we could not use the TCGA data [8], a dataset employed often nowadays for pan-cancer studies, because we based our study on Affymetrix probesets compatible among different GEO datasets, which would not have found direct compatibility with probesets on TCGA. For the same reason, we decided to use no data from ArrayExpress [85], which is a large alternative repository of gene expression.

In the future, we plan to use subgroups of genes indicated by the protein-protein interaction analysis as potential novel pan-cancer signatures.