Background

Genetic alterations conferring selective advantages to cancer cells are the main drivers of cancer evolution and hunting for them has been at the core of international cancer genomic efforts [1,2,3]. Given the instability of the cancer genome, distinguishing driver alterations from the rest relies on analytical approaches that identify genes altered more frequently than expected or quantify the positive selection acting on them [4,5,6]. The results of these analyses have greatly expanded our understanding of the mechanisms driving cancer evolution, revealing high heterogeneity across and within cancers [7,8,9].

Recently, deep sequencing screens of non-cancer tissues have started to map positively selected genetic mutations in somatic cells that drive in situ formation of phenotypically normal clones [10, 11]. Many of these mutations hit cancer drivers, sometimes at a frequency higher than the corresponding cancer [12,13,14,15,16]. Yet, they do not drive malignant transformation. This conundrum poses fundamental questions on how genetic drivers of normal somatic evolution are related to and differ from those of cancer evolution. Addressing these questions will clarify the genetic relationship between tissue homeostasis and cancer initiation, with profound implications for cancer early detection.

To assess the extent of the current knowledge on cancer and non-cancer drivers, we undertook a systematic review of the literature and assembled a comprehensive repertoire of genes whose somatic alterations have been reported to drive cancer or non-cancer evolution. This allowed us to compare the current driver repertoire across and within cancer and non-cancer tissues and map their alterations in the large pancancer collection of samples from The Cancer Genome Atlas (TCGA). This revealed significant gaps and biases in our current knowledge of the driver landscape. We also computed an array of systems-level properties across driver groups, confirming the unique evolutionary path of driver genes and their central role in the cell.

We collected all cancer and non-cancer driver genes, together with a large set of their properties, in the Network of Cancer Genes and Healthy Drivers (NCGHD) open-access resource.

Results

More than 3300 genes are canonical or candidate drivers of cancer and non-cancer somatic evolution

We conducted a census of currently known drivers through a comprehensive literature review of 331 scientific articles published between 2008 and 2020 describing somatically altered genes with a proven or predicted role in cancer or non-cancer somatic evolution (Fig. 1a). These publications included three sources of experimentally validated (canonical) cancer drivers, 311 sequencing screens of cancer (293) and non-cancer (18) tissues, and 17 pancancer studies (Additional file 1, Table S1). Each paper was assessed by at least two independent experts (Additional file 2, Fig. S1A-C) returning a total of 3355 drivers, 3347 in 122 cancer types and 95 in 12 non-cancer tissues, respectively (Fig. 1a). We further computed the systems-level properties of drivers and annotated their function, somatic variation, and drug interactions (Fig. 1a).

Fig. 1
figure 1

Collection of a comprehensive repertoire of cancer and healthy drivers. a Literature review and driver annotation workflow. Expert literature curation of 331 publications led to a repertoire of cancer and healthy drivers in a variety of cancer and non-cancer tissues. Combining multiple data sources, a set of properties and annotations was computed for all these drivers. b Intersection of canonical drivers from three sources [17,18,19] that passed our manual curation. c Classification of canonical cancer drivers in tumor suppressors and oncogenes. Eighty-one cancer drivers had a dual role or could not be classified. d Intersection of canonical and candidate driver genes from 310 sequencing screens. Genes whose driver role had only statistical support were considered candidate cancer drivers. e Intersection between cancer drivers with coding and non-coding alterations. f Level of support for the driver role of 531 cancer genes with non-coding driver alterations only. Level 1 means that the gene was predicted as a driver only in one cancer sequencing screen; levels 2, 3, and 4 mean that it was predicted by two, three, or four screens or that it had experimental support. Experimental support was gathered from the 19 publications reporting non-coding cancer drivers (Additional file 1, Table S1) and from the CNCDatabase [20] and included in vitro and in vivo experiments, modification of gene expression, and survival association. g Proportion of healthy drivers that are also canonical or candidate cancer drivers, classified as canonical and candidate healthy drivers, respectively

We reviewed the three sources of canonical cancer drivers [17,18,19] to exclude false positives (Additional file 3, Table S2) and fusion genes whose properties could not be mapped. Only 11% of the resulting 591 canonical drivers (Additional file 4, Table S3) were common to all three sources (Fig. 1b), indicating poor consensus even in well-known cancer genes. We further annotated the genetic mode of action for > 86% of canonical drivers, finding comparable proportions of oncogenes or tumor suppressors (Fig. 1c). The rest had a dual role or could not be univocally classified.

We extracted additional cancer drivers from the curation of 310 sequencing screens that applied a variety of statistical approaches (Additional file 2, Fig. S1 D) to identify cancer drivers among all altered genes. After removing possible false positives (Additional file 3, Table S2), the final list included 3177 cancer drivers, 2756 of which relied only on statistical support (candidate cancer drivers) and 421 were canonical drivers (Fig. 1d, Additional file 4, Table S3). Therefore, 170 canonical drivers have never been detected by any method, suggesting that they may elicit their role through non-mutational mechanisms or may fall below the detection limits of current approaches. Given the prevalence of cancer coding screens (Fig. 1a), only coding driver alterations have been reported for most genes (Fig. 1e) while 16% of them (531) were identified as drivers uniquely in non-coding screens. Since the prediction of drivers with non-coding alterations remains challenging, we further investigated the type of support that these genes had for their driver activity. The overwhelming majority of them (467 genes, 87%) have been predicted as drivers in only one screen. The remaining 64 genes are canonical drivers, have been predicted as drivers in multiple screens, or have additional experimental support for their driver activity (Fig. 1f).

Applying a similar approach (Additional file 2, Fig. S1 A-C), we reviewed 18 sequencing screens of healthy or diseased (non-cancer) tissues. They collectively reported 95 genes whose somatic alterations could drive non-malignant clone formation (healthy drivers). Interestingly, only eight of them were not cancer drivers (Fig. 1g, Additional file 4, Table S3), suggesting a high overlap between genetic drivers of cancer and non-cancer evolution. However, since many non-cancer screens only re-sequenced cancer genes or applied methods developed for cancer genomics (Additional file 2, Fig. S1E), this overlap may be overestimated.

The ability to capture cancer but not healthy driver heterogeneity increases with the donor sample size

To compare cancer and healthy drivers across and within tissues, we grouped the 122 cancer types and 12 non-cancer tissues into 12 and seven organ systems, respectively (the “Methods” section).

Despite the high numbers of sequenced samples (Additional file 5, Table S4) and detected drivers (Fig. 1), several lines of evidence indicated that our knowledge of cancer drivers is still incomplete. First, we detected a strong positive correlation between cancer drivers and donors overall (Fig. 2a) and in individual organ systems (Additional file 2, Fig. S2). This suggests that the current ability to identify new drivers depends on the number of samples included in the analysis. Second, candidates outnumbered canonical drivers in all organ systems except those with a small sample size or low mutation rate such as pediatric cancers, where only the most recurrent canonical drivers could be identified (Fig. 2b). Third, large donor cohorts enabled the detection of a broader representation of canonical drivers than small cohorts (Fig. 2c). For example, pooling thousands of samples together led to > 60% of canonical drivers being detected in adult pancancer re-analyses. Therefore, the size of the cohort influences the level of completeness and heterogeneity of the cancer driver repertoire. This is not surprising since all current approaches act at the cohort level, searching for positively selected genes altered more frequently than expected (Additional file2, Fig. S1D).

Fig. 2
figure 2

Distribution of driver annotations by organ system. a Correlation between numbers of sequenced donors and identified cancer drivers across organ systems. Spearman correlation coefficient R and associated p-value are shown. b Number of canonical, candidate, and healthy drivers in each organ system. Horizontal lines indicate the median number of canonical (92), candidate (160), and healthy (17) drivers across organ systems. c Proportion of canonical drivers detected in each organ system over canonical drivers detected in all cancer screens (421). The horizontal line indicates the median across all organ systems (22%). d Proportion of genes with non-coding driver alterations over all cancer drivers in each organ system. The horizontal line indicates the median across all organ systems (4%). Number of canonical (e), candidate (f), and healthy (g) drivers across screens and organ systems. Representative genes with different recurrence between cancer and healthy tissues are indicated. h Organ system distribution of the top eight recurrent healthy drivers. The full list is provided as Additional file 6, Table S5. i Correlation between numbers of sequenced donors and identified healthy drivers across organ systems. Spearman correlation coefficient R and associated p-value are shown

Our analysis also showed that the contribution of non-coding driver alterations remains largely unappreciated and non-coding drivers have not yet been reported in several tumors, including all pediatric cancers (Fig. 2d). Owing to the re-analysis of large whole-genome collections [21,22,23,24,25,26], almost 40% of adult pancancer drivers were instead modified by non-coding alterations (Fig. 2d). Hematologic and skin tumors also had a high proportion of non-coding driver variants thanks to screens focused on non-coding mutations [27, 28]. Therefore, the re-analysis of already available whole-genome data and further sequencing screens of non-coding variants are needed to fully appreciate their driver contribution.

Compared to cancer, sequencing screens of non-cancer tissues are still in their infancy, as reflected by the lower numbers of screened tissues and detected drivers (Fig. 2b). Despite this, some similarities and differences with cancer drivers could already be observed. Like cancer drivers (Fig. 2e, f, Additional file 6, Table S5), also healthy drivers were mostly organ-specific (Fig. 2g) and the most recurrent healthy drivers were also cancer drivers in the same organ system (Fig. 2h, Additional file 6, Table S5). However, some recurrent cancer drivers (KRAS, PI3KCA, NRAS, NF1) were reported to drive non-cancer clonal expansion only in one or two organ systems (Fig. 2g). Therefore, differences start to emerge at the tissue level between drivers of cancer and non-cancer evolution. Moreover, unlike cancer drivers, no correlation existed between the numbers of drivers and donors (Fig. 2i). This is likely affected by the lower number of non-cancer sequencing studies available so far. If additional studies will confirm the absence of correlation, this may indicate that the healthy driver repertoire is easier to saturate since fewer drivers are needed to initiate and sustain non-cancer clonal expansion [10, 11].

Alteration pattern hints at driver mode of action and confirms the incompleteness of the driver repertoire

To gain further insights into their mode of action, we mapped the type of alterations acquired by cancer and healthy drivers in 34 cancer types from TCGA. After predicting the damaging alterations in 7953 TCGA samples with matched mutation, copy number, and gene expression data (the “Methods” section), we identified the drivers with loss-of-function (LoF) and gain-of-function (GoF) alterations in these samples, respectively (Fig. 3a).

Fig. 3
figure 3

Damaging alteration pattern of drivers in TCGA. a Identification of damaged drivers in 7953 TCGA samples. Mutations, gene deletions, and amplifications were annotated according to their predicted damaging effect. This allowed to distinguish drivers acquiring loss-of-function (LoF) or gain-of-function (GoF) alterations. b Number of TCGA samples with damaging alterations (all, LoF, GoF) in canonical drivers that were detected (421) or undetected (170) by cancer driver detection methods. c Proportion of TCGA samples with GoF and LoF alterations in tumor suppressors, oncogenes, and canonical drivers with a dual or unclassified role. Proportion of TCGA samples with GoF and LoF alterations in (d) canonical drivers and (e) candidate drivers. Genes mentioned in the text are highlighted. The two-dimensional Gaussian kernel density estimations were calculated for each driver group using the R density function. f Number of TCGA samples with damaging alterations (all, LoF, GoF) in drivers previously reported in coding and non-coding sequences. g Proportion of samples with variable numbers of all damaged drivers or only canonical drivers. h Proportion of TCGA samples with GoF and LoF alterations in healthy drivers. Canonical and candidate healthy drivers correspond to genes with a known or predicted cancer driver role. i Number of TCGA samples with damaged canonical, candidate, and remaining healthy drivers and the rest of human genes. All distributions were compared using a two-sided Wilcoxon rank-sum test

The comparison between canonical cancer drivers detected and undetected in sequencing screens (Fig. 1d) revealed that the latter were damaged in a significantly lower number of samples, due to fewer LoF alterations (Fig. 3b, Additional file 2, Fig. S3A). GoF alterations were instead comparable between the two groups, suggesting that current driver detection methods fail to identify drivers that undergo copy number gains but are rarely mutated.

We confirmed that the driver alteration patterns reflected their mode of action, with canonical tumor suppressors and oncogenes showing a prevalence of LoF and GoF alterations, respectively (Fig. 3c). Canonical drivers with a dual role resembled the alteration pattern of oncogenes while those still unclassified had a prevalence of LoF alterations, suggesting a putative tumor suppressor role (Fig. 3c). While all frequently altered (> 500 samples) oncogenes were overwhelmingly modified by GoF alterations (Additional file 7, Table S6), 16 of the 22 most frequently altered tumor suppressors had a prevalence of GoF alterations (Fig. 3d). In the majority of cases, this was due to different alteration patterns across organ systems (Additional file 2, Fig. S3B), and a possible oncogenic role has been documented for some others [29,30,31,32,33,34,35,36,37,38].

Since candidate drivers had no annotation of their mode of action, we reasoned that their alteration pattern could hint at their role as tumor suppressors or oncogenes. According to their prevalent pancancer alterations, 1318 candidates could be classified as putative tumor suppressors and 1405 as putative oncogenes (Additional file 7, Table S6). Interestingly, while candidates with predicted coding driver alterations showed similar distributions of LoF and GoF alterations (Fig. 3e), those with only non-coding driver alterations had a significantly lower occurrence of LoF alterations (Fig. 3f, Additional file 2, Fig. S3C). This may suggest an activating role for their non-coding alterations too. Almost all candidates damaged in ≥ 500 samples (111/115) were putative oncogenes (Fig. 3e, Additional file 7, Table S6). Of the four putative tumor suppressors, CSMD3 has a disputed cancer role [39,40,41] and a likely inflated mutation rate [42], while CDKN2B cooperates with its paralog CDKN2A to inhibit cell cycle [43], supporting its tumor suppressor role.

The number of damaged cancer drivers in individual TCGA samples confirmed that, despite all efforts, the current driver repertoire is still largely incomplete. The large majority of samples (71% and 87%, considering all drivers or only canonical drivers, respectively) had less than five damaged drivers, and ~ 15% of them had no damaged driver (Fig. 3g).

Given their high overlap with cancer drivers, most healthy drivers were recurrently damaged in cancer samples with no prevalence of GoF or LoF alterations (Fig. 3h, Additional file 7, Table S6). Interestingly, all healthy drivers, even the eight with no cancer involvement, were damaged in significantly more cancer samples than the rest of human genes (Fig. 3i). Moreover, 57% of TCGA samples had at least two altered drivers, one of which was a healthy driver, further supporting the hypothesis that more than one driver may be needed to promote the transformation of non-malignant clones into cancer [10, 11].

Properties of cancer and healthy drivers support their central role in the cell

A substantial body of work including our own [44,45,46,47,48,49,50,51,52,53] has shown that cancer drivers differ from the rest of the genes for an array of systems-level properties (Fig. 1a) that are a consequence of their unique evolutionary path and role in the cell. Using our granular annotation of drivers, we set out to check for similarities and differences across the driver groups.

We confirmed that cancer drivers, and in particular canonical drivers, were more conserved throughout evolution and less likely to retain gene duplicates than other human genes (Fig. 4a, Additional file 8, Table S7). They also showed broader tissue expression, engaged in a larger number of protein complexes, and occupied more central and highly connected positions in the protein-protein and miRNA-gene networks (Fig. 4a). We reported substantial differences between tumor suppressors and oncogenes, with the former enriched in old and single-copy genes showing broader tissue expression (Fig. 4b, Additional file 8, Table S7).

Fig. 4
figure 4

Systems-level properties of cancer and healthy drivers. Comparisons of systems-level properties between (a) canonical or candidate cancer drivers and the rest of human genes, (b) tumor suppressors and oncogenes, and (c) cancer genes with coding driver alterations and cancer genes with non-coding driver alterations. The normalized property score was calculated as the normalized difference between the median (continuous properties) or proportion (categorical properties) values in each driver group and the rest of human genes (the “Methods” section). Comparisons of systems-level properties between (d) candidate oncogenes with non-coding driver alterations (324) and canonical tumor suppressors, (e) candidate oncogenes (1405) and canonical tumor suppressors, and (f) candidate tumor suppressors (1318) and canonical oncogenes. g. Comparisons of systems-level properties between canonical healthy, candidate healthy, and remaining healthy drivers and the rest of human genes. Proportions of old (pre-metazoan), duplicated, essential genes, and proteins involved in the complexes were compared using a two-sided Fisher’s exact test. Distributions of gene and protein expression, protein-protein, miRNA-gene interactions, and germline variation were compared using a two-sided Wilcoxon rank-sum test. False discovery rate (FDR) was corrected for using Benjamini-Hochberg

We further expanded the systems-level properties of cancer drivers by exploring their tolerance towards germline variation, because this may indicate their essentiality. Using germline data from healthy individuals [54], we compared the loss-of-function observed/expected upper bound fraction (LOEUF) score, which quantifies selection towards LoF variation [54] as well as the number of damaging mutations and structural variants (SVs) per coding base pairs (bp) between drivers and the rest of genes (the “Methods” section). Cancer drivers, and in particular canonical drivers, had a significantly lower LOEUF score and retained fewer damaging germline mutations and SVs than the rest of the genes (Fig. 4a). This indicates that they are indispensable for cell survival in the germline. Selection against harmful variation was stronger in tumor suppressors than oncogenes (Fig. 4b). This was supported by a significantly higher proportion of cell lines where cancer drivers, and in particular tumor suppressors, were essential (Fig. 4a, b), as gathered from the integration of nine genome-wide essentiality screens [55,56,57,58,59,60,61,62,63] (the “Methods” section).

Genes with non-coding driver alterations had weaker systems-level properties than those with coding alterations (Fig. 4c, Additional file 8, Table S7) and the subset of them with > 50% GoF alterations resembled the property profile of oncogenes when compared to tumor suppressors (Fig. 4d, Additional file 8, Table S7). In general, all candidate drivers with a prevalence of GoF were similar to oncogenes, showing a higher proportion of duplicated genes, narrower tissue expression, and higher tolerance to germline variation than tumor suppressors (Fig. 4e, Additional file 8, Table S7). Conversely, candidate drivers with a prevalence of LoF were older, less duplicated, and less tolerant to germline variation than oncogenes (Fig. 4f, Additional file 8, Table S7).

Systems-level properties of healthy drivers varied according to the overlap with cancer drivers (Fig. 4g, Additional file 8, Table S7). Intriguingly, canonical healthy drivers showed stronger systems-level properties than any other group of drivers. In particular, they were enriched in evolutionarily conserved and broadly expressed genes encoding highly inter-connected proteins are regulated by many miRNAs. Moreover, these genes showed a strong selection against germline variation and high enrichment in essential genes (Fig. 4g). They therefore represent a core of genes with a very central role in the cell, whose modifications are not tolerated in the germline but are selected for in somatic cells because they confer selective growth advantages. Candidate healthy drivers and those not involved in cancer had a substantially different property profile (Fig. 4g). Although numbers are too low for any robust conclusion, it is tempting to speculate that genes able to initiate non-cancer clonal expansion but not tumorigenesis may follow a different evolutionary path.

The Network of Cancer Genes: an open-access repository of annotated drivers

We collected the whole repertoire of 3347 cancer and 95 healthy drivers, their literature support, and properties in the seventh release of the Network of Cancer Genes and Healthy Drivers (NCGHD) database. NCGHD is accessible through an open-access portal that enables interactive queries of drivers (Fig. 5a) as well as the bulk download of the database content.

Fig. 5
figure 5

NCGHD annotations of driver genes. a Example of the type of annotation provided in NCGHD for cancer and healthy drivers (in this case PTEN). Annotation boxes can be expanded for further details, with the possibility of intersecting data interactively (for example, in the case of protein-protein or miRNA-gene interactions) and downloading data for local use. b Proportion of Reactome levels 2–8 enriched pathways mapping to the respective level 1 in each driver group. Enrichment was measured comparing the proportion of drivers in each pathway to that of the rest of human genes with a one-sided Fisher’s exact test. FDR was calculated using Benjamini-Hochberg. The numbers of drivers and enriched Reactome pathways are reported for each group. Proportion of canonical and candidate cancer divers and rest of genes that are (c) targets of FDA-approved antineoplastic drugs or biomarkers of response or resistance to oncological drugs in (d) cancer cell lines and (e) clinical studies. The corresponding numbers for each group are also shown

In addition to the known or predicted mode of action and systems-level properties of cancer and healthy drivers, NCGHD 7.0 also annotates their function, alteration pattern, and gene expression profile in TCGA and cancer cell lines, reported interactions with antineoplastic drugs, and potential role as treatment biomarkers (Fig. 5b). Altogether, this constitutes an extensive compendium of annotation of driver genes, including information relevant for planning experiments involving them.

Functional gene set enrichment analysis showed that at least 60% of enriched pathways (FDR < 0.05) in any driver group converge to five broad functional processes (signal transduction, gene expression, immune system, cell cycle, and DNA repair, Fig. 5b, Additional file 9, Table S8). Within these, tumor suppressors showed a prevalence in cell cycle and DNA repair pathways, while oncogenes were enriched in the gene expression and immune system-related pathways (Additional file 9, Table S8). Healthy drivers closely resembled the functional profile of cancer drivers, given the high overlap (Fig. 5b). Because of the low number, it was not possible to assess the functional enrichment of healthy drivers not involved in cancer.

More than 9% of canonical cancer drivers are targets of anti-cancer drugs and cancer drivers constitute around 40% of their targets (Fig. 5c). Moreover, most of the genes used as biomarkers of resistance or response to treatment in cell lines (Fig. 5d) or clinical trials (Fig. 5e) are cancer drivers, with an overwhelming prevalence of canonical cancer drivers.

Discussion

The wealth of cancer genomic data and the availability of increasingly sophisticated analytical approaches for their interpretation have substantially improved the understanding of how cancer starts and develops. However, our in-depth analysis of the vast repertoire of drivers that have been collected so far shows clear limits in the current knowledge of the driver landscape.

The identification of drivers as genes under positive selection or with a higher than expected mutation frequency within a cohort of patients has biased the current cancer driver repertoire towards genes whose coding point mutations or small indels frequently recur across patients. This strongly impairs the ability to map the full extent of driver heterogeneity leading to an underappreciation of the driver contribution of rarely altered genes and those modified through non-coding or gene copy number alterations, particularly amplifications. It also results in a sizeable fraction of samples with very few or no cancer drivers. This gap can be solved by complementing cohort-level approaches with methods that account for all types of alterations and predict drivers in individual samples, for example identifying their network deregulations [64,65,66] or applying machine learning to identify driver alterations [67]. Alternatively, we have shown that systems-level properties capture the main features of cancer drivers, justifying their use for patient-level driver detection [68, 69].

Our comprehensive study has also shown that cancer sequencing screens have so far mostly focused on resequencing and analyzing the protein-coding portion of cancer genomes, leaving the contribution of non-coding drivers mostly uncovered. This bias may be addressed by performing additional cancer whole genome sequencing screens and improving analytical methods for the prediction of non-coding driver alterations.

Biases are starting to emerge also in the knowledge of healthy drivers. Many non-cancer sequencing screens only targeted cancer genes and healthy driver detection methods used so far were originally developed for cancer genomics. Both these factors may contribute at least in part to explain the high overlap between drivers of cancer and non-cancer evolution. An unbiased investigation of altered genes able to promote clonal expansion but not tumorigenesis could confirm whether their properties are indeed different from cancer drivers as suggested by our initial analysis on the few of them that have been identified so far. Additionally, the investigation of somatically mutated clones in non-cancer tissues has just started and new screens are continuously published. The integrated analysis of these new studies will broaden our understanding of non-cancer clonal expansion and further clarify its relationship with cancer transformation.

Our literature review did not cover driver genes deriving from chromosomal rearrangements or epigenetic changes because of their scattered annotations in the literature and difficulty in mapping their properties. Adding these genes to the repertoire when their knowledge will be mature will help close the gaps in the knowledge of the genetic drivers of tumorigenesis.

Conclusions

Our comprehensive analysis of cancer sequencing screens showed that the current repertoire of cancer driver genes is still incomplete and biased towards frequent mutations altering the gene coding sequence. This calls for the need for additional screens and methods to identify further coding and non-coding cancer drivers at single patient resolution. We confirmed the central role of cancer drivers within the cell, which is reflected in their evolutionary path and is shared by the majority of known healthy drivers. Further sequencing screens of healthy tissues are needed to clarify whether this is a feature of all genes whose mutations can driver non-cancer clonal expansion or there is a group of healthy drivers that underwent a different evolutionary path.

Methods

Literature curation

A literature search was carried out in PubMed, TCGA (https://www.cancer.gov/tcga) and ICGC (https://dcc.icgc.org/) to retrieve cancer screens published between 2018 and 2020 (Additional file 2, Fig. S1A). This resulted in 135 coding and 154 non-coding cancer screens. Of these, only 80 and 37 were retained after examining abstracts and full text, respectively. Criteria for removal were the absence of driver genes or driver detection methods and the impossibility to map non-coding driver alterations to genes. The 37 new cancer screens were added to 273 publications previously curated by our team [70], totaling 310 publications (Additional file 1, Table S1). A similar literature search retrieved 24 sequencing screens of non-cancer tissues publications, 18 of which were retained after the abstract and full-text examination (Additional file 2, Fig. S1A; Additional file 1, Table S1). Each paper was reviewed independently by two experts and further discussed if annotations differed to extract the list of driver genes, the number of donors, the type of screen (whole-genome, whole-exome, target gene re-sequencing), the cancer or non-cancer tissues, and the driver detection method (Additional file 2, Fig. S1B).

Canonical cancer drivers were extracted from two publications [17, 18] and the Cancer Gene Census [71] v.91. In the latter case, all tiers 1 and 2 genes were retained, except those from genomic rearrangements leading to gene fusion (Additional file 2, Fig. S1B). Collected genes were further classified as tumor suppressor, oncogene, or having a dual role according to the annotation in the majority of sources. Genes with conflicting or unavailable annotation were left unclassified.

Drivers from cancer screens and canonical sources underwent further filtering (Additional file 2, Fig. S1C). First, they were intersected with a list of 148 possible false positives [18, 42]. After a manual check of the supporting evidence, two drivers were retained as canonical, five were considered as candidates, and 41 were removed (Additional file 3, Table S2). The three resulting lists (canonical drivers, drivers from cancer screens, and healthy drivers) were intersected to annotate canonical drivers in cancer screens, remaining drivers in cancer screens (candidate cancer drivers), canonical healthy drivers, candidate healthy drivers, and remaining healthy drivers (Additional file 2, Fig. S1C; Additional file 4, Table S3).

Cancer types and non-cancer tissues were mapped to organ systems using previous classification [72]. Cancer types not included in this classification were mapped based on their histopathology (retinoblastoma to central nervous system, vascular and peripheral nervous system cancers to soft tissue, penile tumors to urologic system).

Pancancer TCGA data

A dataset of 7953 TCGA samples with quality-controlled mutation (SNVs and indels), copy number, and gene expression data in 34 cancer types was assembled from the Genomic Data Commons portal I [73] (https://portal.gdc.cancer.gov/). Mutations were annotated with ANNOVAR [74] (April 2018) and dbNSFP [75] v3.0 and only those identified as exonic or splicing were retained. Damaging mutations included (1) truncating (stopgain, stoploss, frameshift) mutations, (2) missense mutations predicted by at least seven out of 10 predictors (SIFT [76], PolyPhen-2 HDIV [77], PolyPhen-2 HVAR, MutationTaster [78], MutationAssessor [79], LRT [80], FATHMM [81], PhyloP [82], GERP++RS [83], and SiPhy [84]), (3) splicing mutations predicted by at least one of two splicing-specific methods (ADA [75] and RF [75]), and (4) hotspot mutations identified with OncodriveCLUST [85] v1.0.0.

Copy number variant (CNV) segments, sample ploidy, and sample purity values were obtained from TCGA SNP arrays using ASCAT [86] v.2.5.2. Segments were intersected with the exonic coordinates of 19,756 human genes in hg19 and genes were considered to have CNV if at least 25% of their transcribed length was covered by a CNV segment. RNA-Seq data were used to filter out false-positive CNVs. Putative gene gains were defined as copy number (CN) > 2 times sample ploidy and the levels of expression were compared between samples with and without each gene gain using a two-sided Wilcoxon rank-sum test and corrected for multiple testing using Benjamini-Hochberg. Only gene gains with a false discovery rate (FDR) < 0.05 were retained. Homozygous gene losses had CN = 0 and fragments per kilobase per million (FPKM) values < 1 over sample purity. Heterozygous gene losses had CN = 1 or CN = 0 but FPKM values > 1 over sample purity. This resulted in 2,192,832 redundant genes damaged in 7921 TCGA samples.

In total, 518,115 genes were considered to acquire LoF alterations because they underwent homozygous deletion or had truncating, missense damaging, splicing mutations, or double hits (CN = 1 and LoF damaging mutation), while 1,674,717 genes were considered to acquire GoF alterations because they had a hotspot mutation or underwent gene gain with increased expression (Fig. 3a).

Systems-level properties

Protein sequences from RefSeq [87] v.99 were aligned to hg38 using BLAT [88]. Unique genomic loci were identified for 19,756 genes based on gene coverage, span, score, and identity [89]. Genes sharing at least 60% of their protein sequence were considered as duplicates [46].

Evolutionary conservation was assessed for 18,922 human genes using their orthologs in EggNOG [90] v.5.0. Genes were considered to have a pre-metazoan origin (and therefore conserved in evolution) if they had orthologs in prokaryotes, eukaryotes, or opisthokonts [53].

Gene expression for 19,231 genes in 49 healthy tissues was derived from the union of Protein Atlas [91] v.19.3 and GTEx [92] v.8. Genes were considered to be expressed in a tissue if their expression value was ≥ 1 transcript per million (TPM). Protein expression for 13,229 proteins in 45 healthy tissues was derived from Protein Atlas [91] v.19.3 retaining the highest value when multiple expression values were available.

A total of 542,397 non-redundant binary interactions between 17,883 proteins were gathered from the integration of five sources (BioGRID [93] v.3.5.185, IntAct [94] v.4.2.14, DIP [95] (February 2018), HPRD [96] v.9 and Bioplex [97] v.3.0). Data on 9476 protein complexes involving 8504 proteins were derived from CORUM [98] v.3.0, HPRD [96] v.9 and Reactome [99] v.72. Experimentally supported interactions between 14,747 genes and 1758 miRNAs were acquired from miRTarBase [100] v.8.0 and miRecords [101] v.4.0. Degree, betweenness, and clustering coefficient were calculated for protein and miRNA networks using the igraph R package [102] v.1.2.6.

The loss-of-function observed/expected upper bound fraction (LOEUF) score for 18,392 genes was obtained from gnomAD [54] v.2.1.1. Germline mutations (SNVs and indels) were obtained from the union of 2504 samples from the 1000 Genomes Project Phase 3 [103] v.5a and 125,748 samples from gnomAD [54] v.2.1.1. Mutations were annotated with ANNOVAR [74] (October 2019), and 18,812 genes were considered as damaged using the same definitions as for TCGA samples. A total of 32,558 germline SVs for 14,158 genes were derived using 15,708 samples from gnomAD [54] v.2.1.1. The numbers of damaging mutations and SVs per base pairs (bp) were calculated for each gene.

Essentiality data for 19,013 genes in 1122 cell lines were obtained integrating three RNAi knockdown and six CRISPR Cas9 knockout screens [55,56,57,58,59,60,61,62,63]. Genes with CERES [57] or DEMETER [63] scores < − 1 or Bayes score [104] > 5 were considered as essential.

Proportions of duplicated, pre-metazoan, essential genes, and proteins engaging in complexes were compared between the gene groups using two-sided Fisher’s exact test. Distributions of tissues where genes or proteins were expressed, protein and miRNA network properties, LOEUF scores, damaging mutations, and SVs per bp were compared between the gene groups using a two-sided Wilcoxon test. Multiple comparisons within each property were corrected using Benjamini-Hochberg. For each systems-level property in each driver group (d), a normalized property score was calculated as:

$$ \mathrm{Normalised}\ \mathrm{property}\ \mathrm{score}=\operatorname{sgn}\left({\Delta}_d\right)\times \frac{\left|{\Delta}_d\right|-\underset{t}{\min}\left|{\Delta}_t\right|}{\underset{t}{\max}\left|{\Delta}_{\mathrm{t}}\right|-\underset{t}{\min}\left|{\Delta}_t\right|} $$

where t represents 11 gene groups (canonical drivers, candidate drivers, tumor suppressors, oncogenes, drivers with coding alterations, drivers with non-coding alterations, canonical healthy drivers, candidate healthy drivers, remaining healthy drivers, and the rest of human genes); sgn(Δd) is the sign of the difference; and Δd indicates the difference of medians (continuous properties) or proportions (categorical properties) between each driver group and the rest of human genes. Minima and maxima were taken over all 11 gene groups for each property.

Pancancer cell line data

Mutation, CNV and gene expression data for 1291 cell lines were obtained from DepMap [56, 105] v. 20Q3. Mutations were functionally annotated using ANNOVAR [74] and LoF mutations were identified as described for TCGA samples. Hotspot mutations were detected using hotspot positions derived from TCGA. Homozygous gene deletions were defined as CN < 0.25 times cell line ploidy and expression < 1 TPM; heterozygous gene deletions were defined as 0.25 < CN < 0.75 times cell line ploidy; gene gains were defined as CN > 2 times cell line ploidy and significantly higher expression relative to cell lines with no gene gains. Genes with LoF or GoF alterations were defined as for TCGA samples. To map cell lines to organ systems, they were first associated with the TCGA cancer types and then the same classification as for TCGA was used [72].

Driver functional annotation

Gene functions were collected for 11,778 proteins from Reactome [99] v.72 and KEGG [106] v.94.1 (levels 1 and 2). Driver enrichment in Reactome pathways (levels 2–8) compared to the rest of human genes was assessed using a one-sided Fisher’s exact test and corrected for multiple testing with Benjamini-Hochberg. Enriched pathways were then mapped to the corresponding Reactome level 1.

Drug interactions

A total of 247 FDA-approved, antineoplastic, and immunomodulating drugs targeting 212 human genes were downloaded from DrugBank [107] v.5.1.8. Genetic biomarkers of response and resistance to drugs in cancer cell lines were obtained from Genomics of Drug Sensitivity in Cancer (GDSC) [108] v.8.2. Of those, only 467 associations with FDR ≤ 0.25 involving 129 drugs and 106 genes were retained. Genetic biomarkers of response and resistance in clinical studies were obtained from the Variant Interpretation for Cancer Consortium Meta-Knowledgebase [109] v.1. A total of 868 associations between drugs and genomic features involving 64 anti-cancer drugs and drug combinations and 24 human genes were retained [109].

Database and website implementation

All annotations of driver genes were entered into a relational database based on MySQL [110] v.8.0.21 connected to a web interface enabling interactive retrieval of information through gene identifiers. The frontend was developed with PHP [111] v.7.4.15. The interactive displays of miRNA-gene and protein-protein interactions were implemented with the R packages Shiny [112] v.1.6.0 and igraph [102] v.1.2.6 and ran on Shiny Server v1.5.16.958.