Background

The Cancer Genome Atlas (TCGA) has generated comprehensive molecular profiles including somatic mutation, copy number variation, gene expression, DNA methylation, microRNA expression, and protein expression for more than 30 different human tumor types [1]. Those large datasets provided a great opportunity to examine the global landscape of aberrations at DNA, RNA and protein levels. Pan-cancer analyses have provided comprehensive landscapes of somatic mutations [2,3,4,5], somatic copy number alterations [6], mutations in chromatin regulatory factor genes [4], viral expression and host gene fusion [7] in those tumors. Integrated analysis of 12 tumor types using data from gene expression, microRNA expression, protein expression, copy number variation, and DNA methylation revealed genomic features that many tumor types had common as well as features unique particular tumor types [8].

Tumor classifications based on gene expression data have revealed distinct tumor subtypes and uncovered expression patterns that were associated with clinical outcomes [9,10,11,12,13,14]. Landmark studies like those demonstrated that gene expression data can provide valuable information about tumor characteristics which allow targeted options for treatment and for patient care and management. TCGA RNA-seq gene expression data provides a great opportunity to discover features that can distinguish different tumor types. Those features may serve as biomarkers for tumor diagnosis and/or potential targets for drug development.

Sex differences in cancer susceptibility are one of the most consistent, but least understood, findings in cancer epidemiology [15, 16]. Males are more prone to develop cancer and have worse overall survival than females with the same tumors [17, 18]. For instance, female patients with melanoma tend to exhibit longer survival than male patients [19]. Males have a threefold greater risk for developing bladder cancer than females [20]. Hepatocellular carcinoma, the most common liver cancer, occurs mainly in men. Sex differences in immune response [21] and hormones [22] may play a role. Although additional factors such as sex chromosomes and life style may also contribute, the mechanisms that influence sex differences in cancer susceptibility remain largely unknown. Thanks to TCGA, large scale analyses of differences between male and female patients become possible and start to emerge [22,23,24,25]. For a recent review on sexual dimorphism in cancer, see [26]. Knowing when features that distinguish tumor types differ between genders might enhance the utility of such features as biomarkers.

We undertook a comprehensive pan-cancer classification of 9096 tumor samples from 31 tumor types from TCGA using RNA-seq gene expression data. We aimed to identify a set of genes whose expression levels can classify all 31 TCGA pan-cancer tumor types. We also carried out the same pan-cancer classification on the gene expression data from 602 “normal” tissue samples taken adjacent to tumors for 17 tumor types. We compared the top-ranked discriminative genes from both tumor and “normal” samples and concluded that most discriminative genes that we identified reflected tumor-type differences rather than simply tissue-of-origin differences. Moreover, we sought to identify, separately in men and in women, analogous sets of genes that can distinguish the 23 sex non-specific tumor types. We hope to gain insight into sexual dimorphism in some tumors from those analyses.

Data

We downloaded all (March, 2015) UNC RNASeqV2 level 3 expression data from the TCGA data portal (https://tcga-data.nci.nih.gov) for 9096 patients representing 31 tumor types (Table 1) and for 602 “normal” samples taken adjacent to tumors representing 17 tumor types (Additional file 1: Table S1). We log2-transformed the TCGA normalized read counts (rsem.genes.normalized) for the RNA-seq data (Because read depths ≤1 Reads Per Kilobase per Million are mostly noise, we filtered them by assigning all values less than 1 the value 1 before transformation.).

Table 1 Tumor types and number of TCGA RNA-seq samples used in the analysis

For the sex non-specific tumor classification, we eliminated all tumor types that are sex-specific, namely, BRCA, CESC, OV, PRAD, TGCT, UCEC and UCS. For the remaining tumor types, we separated samples into two groups based on the patients’ gender. We then eliminated three additional tumor types (CHOL, DLBC and KICH) due to small gender-specific sample sizes. At the time of analyses, data for two new tumor types (ESCA and STAD) became available and were included in the analysis. This brought the total number of sex non-specific tumor types to 23 with 2638 females and 4081 males RNA-seq samples. The numbers of samples for each tumor type from each gender are listed in Table 1.

Methods

We used the GA/KNN method [27, 28] for pan-cancer classification. GA/KNN employs a genetic algorithm (GA) as the gene/feature selection engine and the k-nearest neighbors (KNN) algorithm as the classification tool. GA/KNN can identify gene signatures that not only can separate different classes of samples but also may uncover subtypes within a class. One valuable characteristic of GA/KNN is that, for each training/testing partition, it identifies many near-optimal feature sets and uses each feature set to predict the testing-set samples. Because the algorithm classifies each sample multiple times, one can calculate the proportion of times that each sample was predicted to be each of the 31 classes plus a category of unclassifiable due to ties (proportions sum to 1). Furthermore, one can also assess the relative importance of each gene for sample classification by counting how often that gene appears in those near-optimal feature sets.

In a genetic algorithm, the “chromosome” encodes the candidate solution - the gene signature in this case. A collection of “chromosomes” is referred to as a “population”. In this analysis, the chromosome length was set to 20 (a 20-gene set). The population size was set to 300 chromosomes. The maximal number of “generations” was set to be 300. For KNN classification, k was set to 5 with a majority “voting” rule. We selected these parameters based on an earlier comprehensive analysis of the effect of the choice of parameters on both gene selection and classification accuracy [27].

We randomly divided the data into a training (75% of the samples, e.g., ~6800 samples for pan-cancer classification) and a testing set (25% of the samples, ~2300 samples) with samples drawn proportionally from each tumor type without replacement. The genetic algorithm stopped either when the best “chromosome” in the current “population” classified at least 90% of the training samples correctly or when the search reached a predefined maximal number of “generations” (see below). We refer to the resulting gene set as a near-optimal classifier. The near-optimal classifier was subsequently used to predict the class membership of the samples in the testing set. The predicted and actual class memberships were then compared to calculate the testing-set prediction accuracy. Because the number of features (genes) is much larger than the number of samples (commonly referred to as small n large p), multiple equally discriminative feature sets may exist. We repeated the above GA/KNN procedure 1000 times with the training and testing partitioning unchanged, resulting in 1000 near-optimal classifiers (not necessarily distinct) and 1000 testing prediction accuracies.

The prediction accuracy may vary depending on which samples are assigned to the training set. Given the large size of the pan-cancer gene expression dataset and the high computational demand of the algorithm, we only repeated the above procedure twice, each with an independent training/testing partition to avoid idiosyncrasies from use of a single random assignment. For the sex non-specific pan-cancer classification, we were able to repeat the above procedures five times each for males and for females because of the sample size reduction. For each gender, we combined results from all five independent training/testing partitions. Specifically, if a sample appeared in more than one test set, we averaged the results (see below).

To assess whether the top-ranked discriminative genes that we identified from the tumor samples were specific to the tumors themselves or to the tissue type where the tumors originated, we carried out the same “pan-cancer” classification on the gene expression data from 602 “normal” RNA-seq samples representing 17 tissue types (Additional file 1: Table S1). In addition, we used these “normal” samples to compare performance between GA/KNN and a gradient boosting-based classifier named XGBoost [29]. Specifically, we randomly generated 10 different training/testing partitions with 75% of samples as training and 25% as testing; samples were draw proportionally to their class size.

For our GA/KNN analysis of the “normal” samples, we used the same parameter settings as for the tumor samples. To decide on parameter settings for XGBoost, we first carried out a grid search for the optimal hyperparameters over ranges that we believed were close to optimal from our previous experience with XGBoost on gene expression data. We used 10-fold cross-validation (repeated 10 times) on all RNA-seq samples and chose as optimal the hyperparameters that gave the best averaged cross-validation results (Additional file 2: Table S2). For our XGBoost analyses, we set the number of trees (boosts) to 200, with early stopping criteria when the minimum training error did not improve in 20 rounds. The average number of boosts needed was ~19 (minimum = 7 and maximum = 46). Since XGBoost is a stochastic classifier, we ran XGBoost with the optimal hyperparameters for 1000 times for each of the 10 training/testing partitions. We rank all genes based on the average of times a gene is selected to build the forest from all repeated runs. For each of the 10 testing datasets, we computed the classification accuracy.

All results presented in the remainder of the manuscript are based on samples from testing sets that were not involved in the training process.

Results

Pan-Cancer classification of all tumors ignoring gender

A sample may be unclassifiable by KNN due to the failure of any single tumor type to be in the majority among its nearest neighbors. Thus, given a test sample, it could be classified into one of the 31 tumor types or this unclassifiable category. When the GA/KNN algorithm is applied in many independent runs (here 2000), the proportion of times each sample is predicted to be each of the 32 classes can be obtained (Additional file 3: Table S3). Those 32 proportions sum to 1. One among them is the proportion of GA/KNN runs that a sample was predicted to be its own type, i.e., correctly classified (bolded in Additional file 3: Table S3). For simplicity, we referred to this proportion as proportion-times-correctly-classified (denoted πcc) throughout the manuscript. Summary statistics for πcc for each tumor type are shown in Table 2.

Table 2 Summary statistics for πcc values when classifying 31 tumor types and ignoring sex of the samples across 1000 GA/KNN runs for each of two training/testing partitions (2000 runs total)

The median value of πcc across samples from a given tumor type was in the range of 90–100% for most tumor types. Tumor types such as DLBC, BRCA, LAML, LGG, PCPG, OV, THCA, and UVM had among the highest median πcc values, suggesting that those tumor types could be easily distinguished from all others. For example, BRCA samples were overwhelmingly correctly predicted to be BRCA (Fig. 1). In contrast, the median πcc values for CHOL, READ and UCS were rather low (0.400, 0.136, and 0.255), indicating that those tumors were often classified to types other than themselves (Fig. 1). A close examination showed that the reasons for the low proportions among the four tumor types were not the same. For CHOL, the πcc were the largest among the 32 proportions for 11 of the 15 test samples, suggesting that those samples were still likely to be assigned to CHOL. Among the four misclassified samples, one (TGCA-W5-AA39) was consistently mis-assigned to LIHC (liver) and one (TCGA-3X-AAV9) to PAAD (pancreatic). No clear patterns were seen for the remaining two. For READ, all samples were most often mis-assigned to COAD. About half of the UCS samples were mis-assigned, most often to “unclassifiable” or to UCEC. Samples from three kidney tumor types (KICH, KIRC, and KIRP) were largely correctly classified; those misclassified were assigned to the other kidney tumor types, rather than to tumors in different organs. This mis-assignment within organ was also true for the two lung tumor types (LUAD and LUSC). In essence, the main cause for misclassification among the tumor types appeared to be similarity in their tissue of origin.

Fig. 1
figure 1

Proportion of test-set samples predicted to be each of the 31 tumor types. Y-axis lists the 31 actual tumor types; x-axis lists the 32 possible classification categories (31 tumor types plus “unclassified” [UC]). Each bar represents one of the 32 proportions that samples from the actual tumor type were predicted to be. The 32 plotted proportions represent means from the corresponding proportions for all samples of the actual tumor type

The above analysis was explicitly based on the πcc values. As an alternative way to assess accuracy, we can proceed as follows. For each test-set sample, we can determine its predicted tumor type for each of the 2000 GA/KNN runs and use that information to determine the modal prediction, the tumor type to which the sample was assigned most often. If the modal prediction matched the actual tumor type, we regarded the prediction as correct. The proportion of correct predictions across all the samples of a given tumor type measures what we call the modal prediction accuracy for that tumor type. These modal prediction accuracies (rightmost column in Table 2) are often higher than the corresponding median value of πcc. Averaging across all tumor types, the overall modal prediction accuracy was 95.6% (weighted by number of samples in each tumor type).

Top-ranked genes

From each of the two independent training/testing partitions, we obtained 1000 sets (a set consists of 20 genes) of near-optimal classifiers (2000 sets altogether); and we calculated the frequency with which each gene appeared in those sets (Fig. 2). We regard the frequencies as indicative of the importance of the corresponding genes for sample classification. Remarkably, there are 40 genes in the intersection of top 50s from both partitions and 60 genes in the union of top 50s, indicating that our results were largely reproducible. Those genes from one partition that did not appear in the top 50 from the other partition were all among the top 100 in the other partition, variation likely attributable to the stochastic nature of the algorithm. We combined the counts from the two independent runs. Gene ontology analysis of the top 200 genes in the combined list suggested that those genes are highly enriched in genes implicated in the biological process of development (Table 3).

Fig. 2
figure 2

Stem plot of gene selection frequency based on 2000 near optimal gene selection classifiers from 1000 GA/KNN runs for each of two training/testing partitions

Table 3 Enriched gene ontology (GO) terms for the top 200 genes from the pan-cancer classification of all 9096 samples ignoring the gender

The 20 most frequently selected genes were TCF21, TBX5, EMX20S, EMX2, PA2G4P4, HNF1B, ATP5EP2, NACA2, PTTG3P, FTH1P3, SFTPA1, HSPB1P1, GATA3, NAPSA, ANXA2P3, IGPB1P1, HOXA9, STFA3, RPL19P12, and SFTPA2. A heatmap representation of the relative expression levels of the top 50 genes across all 9096 tumor samples is shown in Fig. 3.

Fig. 3
figure 3

Heatmap representation of the expression patterns of the top 50 genes across all 9096 samples. Each row (gene) was centered by the median expression value across all samples. A hierarchical clustering analysis was carried out for both samples and genes using the Euclidean distance as the similarity metric

TCF21, the most frequently selected gene, encodes a transcription factor of the basic helix-loop-helix family. The TCF21 product is mesoderm specific and expressed in embryonic epicardium, mesenchyme-derived tissues of lung, gut, gonad, and both mesenchymal and glomerular epithelial cells in the kidney. It is required for normal heart development [30,31,32]. TBX5, a member of the T-box genes, encodes a transcription factor that is involved in the regulation of developmental processes.

Five surfactant genes (SFTA3, SFTPA1/A2, and SFTPB/C) were among the top 50. All five genes were highly expressed in LUAD and LUSC and low in all other tumors except that SFTPB and STFA3 were also highly expressed in THCA. Very few other genes showed such tumor specificity.

About one third of the top 50 genes encode transcription factors (TFs) and another one third encode proteins involved in cell adhesion, ion and small molecular transport, protein synthesis and folding, and lung function. Surprisingly, the final third contains 14 pseudogenes and two antisense non-coding genes (EMX2OS and GATA3-AS1).

Pseudogenes

Pseudogenes were significantly enriched among the top 50 genes (14 pseudogenes) (P = 1.2 × 10−19, hypergeometric test) as well as among the top 100 genes (27 pseudogenes) (P = 1.5 × 10−17).

The top-ranked pseudogene was PA2G4P4 (proliferation-associated 2G4 pseudogene 4). Its functional counterpart is PA2G4 (proliferation-associated 2G4). PA2G4 is an RNA-binding protein present in pre-ribosomal ribonucleoprotein complexes and is involved in growth regulation. PA2G4P4 was highly expressed in nearly all tumor samples with the overall highest expression in HNSC, OV, and SARC. Expression levels of PA2G4P4 were positively correlated with those of PA2G4 in all tumor types except KIRC, PRAD, and THCA (Additional file 4: Figure S1) – remarkable given that PA2G4 is on chromosome 12 and PA2G4P4 is on chromosome 3. The expression of PA2G4P4 was also correlated with that of NACA2 in about one third of the tumor types (data not shown). NACA2, the nascent polypeptide associated complex alpha subunit 2 gene, is, like PA2G4, involved in growth regulation and RNA processing.

Interestingly, none of the functional counterparts of the 14 pseudogenes was among the top 100 most frequently selected genes, indicating that the expression of the pseudogenes might better discriminate tumor types than expression of their functional counterparts.

Putative tumor subtypes

Tumors are heterogeneous [33]. For example, breast cancer is known to have several distinct subtypes [12,13,14, 34]. To see if the top-ranked genes for pan-cancer classification can further uncover tumor subtypes, for each tumor type we carried out k-means clustering using expression data for the 50 top-ranked pan-cancer genes that had non-zero interquartile ranges for the particular tumor type. Consequently, each tumor type had a slightly different 50-gene set. In this exploratory analysis, we set the k = 2 or k = 3, that is, we allowed two or three subgroups for each tumor type. We selected an optimal k using the silhouette method [35]. All tumor types can be divided into 3 sub-groups based on the minimum averaged silhouette scores, except SARC and THYM. In addition, we have excluded tumor types GBM, LAML, and OV that have incomplete survival data. Based on the survival analyses, the tumors which have putative subgroups that are correlated with survival after the Bonferroni correction (i.e., P < 0.001) (Additional file 5: Methods) included ACC, BLCA, BRCA, KIRC, KIRP, LGG, and PAAD.

The heatmaps of ACC, BLCA, BRCA, KIRC, KIRP, LGG, and PAAD from k-mean clustering analysis are shown in Additional file 6: Figure S2 and the subgroups associations to survival outcomes are highlighted in Additional file 7: Figure S3. BRCA tumors have distinct subtypes, and the subtypes are associated with survival outcomes [13]. Using the top-ranked genes, we largely recapitulated the ER+ and basal-like subtypes (Additional file 6: Figure S2c) although the association between patients’ survival and our subtypes is only marginally significant after multiple testing adjustment (P = 0.008, Additional file 7: Figure S3c). It is worth pointing out that the subset of genes used in subtype discovery was identified from pan-cancer classification only.

Classification of samples of “normal” tissue taken adjacent to tumors

Using GA/KNN, we correctly classified on average 87.6% test set “normal” samples. The heatmap representation of the relative expression levels of the top 50 genes across all 602 “normal” samples are shown in Additional file 8: Figure S4. Among the 100 top-ranked discriminative genes for “normal” samples using GA/KNN, only 18 were in common with the top 100 discriminative genes from tumor samples (C11orf9, EMX2, EMX20S, ESR1, FOXF1, FTHL3, GAL3ST1, HAND2, HOXA11, HOXA11AS, HOXA9, IRX5, NACA2, NBLA00301, PA2G4P4, SFTPD, TBX5, and TCF21). Restricting to the top 50 discriminative genes, the corresponding overlap was eight. This result suggests that most genes that we identified as distinguishing among the 31 tumor types are differences among the tumor types themselves and not simply reflecting differences among the tissues where the tumors originated.

Comparisons of classification accuracy for GA/KNN and XGBoost for 10 testing sets are shown in Additional file 9: Figure S5. The averaged accuracies are comparable between the two methods (87.6% for GA/KNN vs 90.2% for XGBoost). The heatmap representation of the relative expression levels of the top 50 genes selected by XGBoost across all 602 “normal” samples is shown in Additional file 10: Figure S6. Among the 100 top-ranked discriminative genes from GA/KNN on “normal” samples and corresponding top 100 -from XGBoost, only 16 were in common (C15orf21, C19orf20, CALML3, DSG3, FOXF1, FOXL2, FTHL3, KIF12, KRT6A, NACA2, PCSK1N, SCARNA9, SFTPD, TCF23, TSSK6, and ZMYND17).

Pan-Cancer classification of sex non-specific tumors

For the 23 sex non-specific tumor types, we had 4081 samples from males and 2638 from females. The sample imbalance favoring males persisted in most individual tumor types except for ACC, LUAD, PCPG, SARC, and THCA (Table 1). For each gender, we carried out 1000 independent GA/KNN runs for each of five independent training/testing partitions.

The quartiles for the πcc values for males and females over the 5000 total runs are listed in Table 4 (Additional file 11: Figure S7). Overall, the results recapitulated those from our pan-cancer analysis of 31 tumor types that ignored gender. Those tumor types with high prediction accuracy remained high and those with low accuracy stayed low regardless whether gender was considered or ignored. All READ samples were predicted to be COAD for both genders. The prediction accuracies for BLCA, ESCA, and MESO were relatively low compared to other tumor types regardless of gender.

Table 4 Quartiles for πcc values when classifying 23 non-sex-specific tumor types separately using male and female samples across 1000 GA/KNN runs for each of five training/testing partitions (5000 runs total)

To see the subtle differences between males and females, here we considered the top 100 genes from each gender. The union for the top 100 genes from males and from females contained 125 genes and the intersection contained 75 genes (Additional file 12: Table S4). Two heatmap representations of the relative expression levels of the top genes across all male and female tumor samples are shown in Additional file 13: Figure S8. Many genes had similar ranks in both genders; 21 differed by more than 100 in rank (Table 5). Rank sum tests showed that all 21 genes were differentially expressed in samples between males and females in at least one tumor type (data not shown). In the following paragraphs, we focus on genes that were largely differentially expressed between females and males in tumor samples and whose possible role in sexual dimorphism received support in existing literature.

Table 5 Gene ranks from full female dataset, full male dataset, and the eight “matched” male datasets

FOXA1 had rank 82 in females and 417 in males, suggesting that FOXA1 expression level might be more important for distinguishing sex non-specific tumors in females than in males. FOXA1 had significantly higher expression in LIHC in females than in males (P = 9.9 × 10−5, rank sum test, two-sided). Li et al. [22] elegantly showed that FOXA1/A2 transcription factors regulate estrogen signaling differently in liver and mammary gland, that this female hormone is protective for liver cancer in mice and that this protection requires FOXA1/A2. Upon exposure to hepatocarcinogens, the tumor load in mutant FOXA1/A2 female mice was dramatically increased whereas the tumor load in mutant FOXA1/A2 male mice was dramatically decreased [22]. Besides LIHC, FOXA1 had significantly higher expression in HNSC (P = 2.4 × 10−3, rank sum test, two-sided) and KIRP (P = 6.5 × 10−4, rank sum test, two-sided) in females than in males (Fig. 4). Whether FOXA1 might also play a role for sexual dimorphism in HNSC and KIRP remains unclear.

Fig. 4
figure 4

Boxplots of FOXA1 expression data in the 23 sex non-specific tumors from males (blue) and females (pink)

On the other hand, BNC1 ranked high in males (45th) and low in females (932nd). BNC1 (basonuclin 1) is a zinc finger protein that is thought to play a regulatory role in epithelial proliferation. BNC1 modulates TGF-β1-induced epithelial dedifferentiation of mammary epithelial cells [36]. BNC1 had significantly higher expression in females than in males (Additional file 14: Figure S9) in HNSC (P = 8.7 × 10−3, rank sum test, two-sided), LIHC (P = 5.4 × 10−5, rank sum test, two-sided), and THCA (P = 2.4 × 10−3, rank sum test, two-sided). Interestingly, BNC1 is also a putative ER receptor 1 (ESR1) target as ESR1 was bound in the proximal promoter of BNC1 in T-47D cell line (ENCODE data on UCSC genome browser), raising the possibility that BNC1 might also play a role in sexual dimorphism in liver cancer similar to FOXA1.

Among the 21 genes, FAT2 had larger differential expression between males and females in KIRP (P = 7.5 × 10−7, rank sum test, two-sided) than in any other sex non-specific tumor. FAT2 encodes a tumor suppressor essential for controlling cell proliferation during Drosophila development [37]. FAT2 is a member of the cadherin superfamily and most likely functions as a cell adhesion molecule [38]. FAT2 was frequently mutated in clear cell renal cell carcinoma [39, 40]. It is not clear whether FAT2 plays a role in sexual dimorphism in KIRP.

To see if the differences that we observed in both prediction accuracy and gene ranks between males and females were due to the imbalance of sample proportions among the tumor types, we generated eight male datasets that matched approximately both the total number of samples and tumor proportions as those in females by taking random samples from males without replacement. We repeated the same pan-cancer classification procedure on each of the eight “matched” male datasets as above. The mean and median πcc values from the full female dataset, the full male dataset and the eight “matched” male datasets are shown in Additional file 15: Table S5. Female-male differences in mean or median πcc values observed from original dataset were strongly reduced (Additional file 15: Table S5) when the sample proportions were balanced between the genders. The difference in gene ranks remained (Table 5), however. Basically, genes, such as BNC1 that ranked high from the full male dataset remained high from the matched male datasets, and those ranked low remained low, although a shrinking of rank differences is also apparent (Table 5).

Discussion

Gene expression data can be used to classify tumor types and uncover tumor subtypes that may suggest targeted treatment options. We carried out a pan-cancer classification of ~9100 TCGA tumors from 31 tumor types using RNA-seq gene expression data. We found that, among the 31 tumor types, BRCA, GBM, HNSC, KIRC, LAML, LGG, LIHC, OV, PCPG, PRAD, SKCM, THCT, THCA, THYM, and UVM were more easily distinguished from all other tumor types. Tumors from similar tissue origins (e.g., READ and STAD; UCS and UCEC) are usually more difficult to distinguish from each other than those from different lineages (e.g., READ vs LAML). In an extreme, nearly all READ samples were indistinguishable from COAD samples. Surprisingly, the three kidney tumors (KICH, KIRC, and KIRP) were distinguished from each other and from all other tumor types using gene expression data alone.

We were able to correctly classify more than 90% of the tumor samples overall using many different 20-gene sets, though some genes appeared repeatedly in the sets. Both sample prediction accuracies and gene rank (a measure of importance in classification) were largely reproducible.

We showed that the top ranked genes from the pan-cancer analysis were able not only to distinguish different types of tumor samples but also to uncover potential subtypes within some tumor types. The top 50 genes from our analysis largely captured the ER positive luminal A or luminal B and ER negative basal-like subgroups – subgroups that have distinct survival profiles. For BLCA, KIRC, KIRP, LGG, and PAAD, patients in the three putative subgroups had differential survival.

Our primary analysis tool was GA/KNN, a supervised classification method that carries out feature selection and classification simultaneously [27, 28]. To compare its performance with a more recent supervised method, XGBoost (gradient boosting machines), we ran both tools on ten training/testing partitions generated from the 602 “normal” RNA-seq samples. Test set classification accuracy (~90%) was comparable for both methods. Despite similar classification performance, the top-ranked discriminative genes derived from two methods showed little overlap. Clearly multiple sets of genes may give similar classification performance, but which tool provides a gene list with more biological relevance or utility beyond classification remains an open question – and one that is impossible to address by algorithmic methods alone.

Though unlikely by chance, one third of the top 50 genes were pseudogenes. Interestingly, none of their functional counterparts ranked among the top 100 genes, suggesting that those pseudogenes may better serve as features than their functional counterparts in distinguishing among tumor types. Pseudogenes share high sequence homology to their functional counterparts but in most cases contain deletions/insertions and frameshift mutations or harbor premature stop codons that make them unable to produce functional proteins [41, 42]. Only 10% of human genes have a pseudogene counterpart, and some have just one pseudogene whereas others have multiple pseudogenes [43]. Many pseudogenes have been implicated in tumor biology. Pseudogenes can regulate the expression of their functional counterparts and play a role in tumor development [44,45,46]. For example, PTENP1, a PTEN pseudogene, can regulate the level of PTEN in cells and exert a growth-suppressive role [46]. The positive correlation that we observed between expression of pseudogene PA2G4P4 and that of PA2G4 suggests that this pseudogene may also regulate the expression of its functional counterpart. Recent reviews describe the role of pseudogenes in normal cellular function and in diseases [42, 47]. A pan-cancer analysis of pseudogene expression in ~2800 patient samples showed that a significant number of pseudogenes are differentially expressed and their expression can classify the major histological subtypes of endometrial cancer [48]. Quantification of pseudogene expression in 13 cancer and normal tissue types found evidence of a wide-spread expression of pseudogenes in cancers and identified cancer/tissue-specific pseudogene expression patterns [49]. Seven (PA2G4P4, ATP5EP2, FTH1P3, ANXA2P3, ANXA2P1, HNRNPA1P33, and HSP90B3P) of the 14 pseudogenes that our analysis revealed as important for pan-cancer classification were previously found to be differentially expressed in various cancers.

Lastly, by comparing the top-ranked discriminative genes from “normal” samples to those from tumor samples, we provide evidence that the top-ranked discriminative genes from the tumor samples likely reflect tumor-specific expression differences rather than simply reflecting expression differences attributable to their underlying tissues of origin.

Sexual dimorphism in cancer prevalence and survival between males and females is well-documented but little understood [15, 16]. To see if gene importance in distinguishing the same tumor types differs between males and females, we also carried out pan-cancer classification on 23 TCGA sex non-specific tumor types separately using samples from males and from females. We found that similar prediction accuracies were obtained in 31 pan-cancer and 23 sex non-specific tumor types in both males and females. While most genes had similar ranking for their contribution to tumor type classification in both genders, 21 of the top 100 genes differed in rank by more than 100 between the genders, suggesting that those genes may differ in importance for distinguishing tumor types between males and females. FOXA1 is a known contributor to sexual dimorphism in liver cancer in mice [22]. Our analysis suggested that FOXA1 expression is more important for distinguishing sex non-specific tumors types in female tumor samples than in male samples. FOXA1 had significantly higher expression in HNSC, KIRP, and LIHC from females than from males. FOXA1 is transcriptionally regulated by ESR1 in liver. It is unclear whether FOXA1 is also regulated by ESR1 in head and neck and kidney; if it were, FOXA1 would likely also have a role in sexual dimorphism in those tumors. Our analysis also suggested that BNC1 expression is important for distinguishing sex non-specific tumors in males but not in females. BNC1 is also a putative ESR1 target as ESR1 was bound in the proximal promoter of BNC1 in T-47D cell line, raising the possibility that BNC1 may also have a role for sexual dimorphism in liver cancer.

Conclusion

In conclusion, using RNA-seq gene expression alone, we were able to identify many sets of 20 genes that could correctly classify more than 90% of the samples from 31 different tumor types in a validation set. This accuracy is remarkable given the number of the tumor types and the total number of samples involved. This result was largely replicated when we analyzed 23 non-sex-specific tumor types separately for males and females. Genes appearing in the sets of 20 largely overlapped among sets. We regard the frequency with which a gene appeared in those sets as measuring its importance for tumor classification. One third of the 50 most frequently appearing genes were pseudogenes; the degree of enrichment may be indicative of their importance in tumor classification. Lastly, we identified a few genes that might play a role in sexual dimorphism in certain cancers.