Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations
While tumor genome sequencing has become widely available in clinical and research settings, the interpretation of tumor somatic variants remains an important bottleneck. Here we present the Cancer Genome Interpreter, a versatile platform that automates the interpretation of newly sequenced cancer genomes, annotating the potential of alterations detected in tumors to act as drivers and their possible effect on treatment response. The results are organized in different levels of evidence according to current knowledge, which we envision can support a broad range of oncology use cases. The resource is publicly available at http://www.cancergenomeinterpreter.org.
The accumulation of so-called “driver” genomic alterations confers on cells tumorigenic capabilities . Thousands of tumor genomes are sequenced around the world every year for both research and clinical purposes. In some cases the whole genome is sequenced while in others the focus is on the exome or a panel of selected genomic regions. It then becomes necessary to annotate which of the somatic mutations identified by the sequencing have a possible role in tumorigenesis and treatment response. This process, which we refer to as “the interpretation of cancer genomes”, is currently tedious and largely unsolved. One of its major bottlenecks is the identification of the alterations driving the tumor. A widely employed approach to solve this hurdle consists in focusing on the mutations affecting known cancer genes, i.e., tumor suppressors and oncogenes. These were initially identified through experimentation, giving rise over the past 40 years to a stable census of human cancer genes . More recently, projects re-sequencing large cohorts of tumors have provided the opportunity to systematically identify genes involved in tumorigenesis through the detection of signals of positive selection in their alteration patterns across tumors of some two dozen malignancies [3, 4, 5, 6]. However, many of the somatic variants detected in tumors, even those in cancer genes, still have uncertain significance and thus it is not clear whether or not they are relevant for tumorigenesis. Another hurdle in the interpretation of cancer genomes concerns one of its crucial aims: the identification of tumor alterations that may affect treatment options. Unstructured information on the effectiveness of therapies targeting specific cancer drivers is continuously generated by clinical trials and pre-clinical experiments, and currently several resources are dedicated to gather and curate these data, such as ClinVar , DoCM , OncoKB , My Cancer Genome (https://www.mycancergenome.org), PMKB , PCT (https://pct.mdanderson.org), CIViC , and JAX-CKB (https://ckb.jax.org). Nevertheless, none of these resources address the whole process of interpretation and researchers and clinicians thus face a challenging task to annotate the variants detected in a newly sequenced cancer genome with their collective information.
Here, we describe the Cancer Genome Interpreter (CGI), a platform that systematizes the interpretation of cancer genomes, the main hallmark of which is the streamlining and automatization of the whole process (Additional file 1: Table S1). Specifically, the CGI addresses the two aforementioned challenges. On the one hand, it identifies all known and likely tumorigenic genomic alterations (point mutations, small insertions/deletions, copy number alterations and/or gene fusions) of a newly sequenced tumor, including the assessment of variants of unknown significance. On the other, it annotates all variants of the tumor that constitute state-of-the-art biomarkers of drug response organized using different clinical evidence. The CGI accepts several data formats and its output reports are provided in a user-friendly interactive framework that organizes the results according to distinct levels of clinical relevance, which may thus be used in a broad range of applications.
Construction and content
A comprehensive catalog of cancer genes across tumor types
Most mutations affecting cancer genes are of uncertain significance
The focus on cancer genes described above is a necessary but not sufficient to identify the tumorigenic variants in a tumor, since not all variants observed in a cancer gene are necessarily capable of driving tumorigenesis. Therefore, the CGI next focuses on annotating and analyzing protein-affecting mutations (PAMs) that occur in genes of the Catalog of Cancer Genes. First, validated tumorigenic mutations may confidently be labeled as drivers when detected in a newly sequenced tumor. We compiled an inventory that currently contains 5314 such validated mutations, including cancer-predisposing variants, from dedicated resources [7, 8, 9, 12, 13, 16] and the literature (Fig. 2b; Additional file 2: Note III). This Catalog of Validated Oncogenic Mutations is available for download through the CGI website (https://www.cancergenomeinterpreter.org/mutations). Across a pan-cancer cohort of 6792 tumors sequenced at the whole-exome level (mostly at diagnosis)  we observed that only 5360 (916 unique variants) of the 44,601 PAMs found in cancer genes appear in this catalog. In other words, 88% of all PAMs that affect cancer genes in this cohort are currently of uncertain significance for tumorigenesis, a proportion that varies widely per gene and tumor type (Fig. 2c; Additional file 2: Note VII). It is therefore crucial to assess the tumorigenic potential of these variants, especially when they affect genes that are—or may be—therapeutic targets. We reasoned that several features of each specific mutation as well as of the genes they affect could help address this question. Moreover, we propose that some of these features of interest can be extracted from the analyses of large sequenced cohorts of healthy and tumor tissue [4, 17]. Examples of relevant attributes include the following: i) the mode of action of the gene in the cancer type (oncogene or tumor suppressor); ii) the consequence type of the mutation (e.g., synonymous, missense, or truncating); iii) its position within the transcript; iv) whether it falls in a mutational hotspot or cluster; v) its predicted functional impact; vi) its frequency within the human population; and vii) whether it occurs in a domain of the protein that is depleted of germline variants. The CGI assesses the tumorigenic potential of the variants of unknown significance via OncodriveMUT, a newly developed rule-based approach that combines the values of these features (Fig. 2d; Additional file 2: Note IVa). We assessed the performance of OncodriveMUT in the task of classifying driver and passenger mutations, using the Catalog of Validated Oncogenic Mutations (n = 5314) and a collected set of likely neutral—i.e., non-tumorigenic—PAMs affecting cancer genes (n = 1676). We found that OncodriveMUT separated the variants of these two data sets with 86% accuracy (Matthews correlation coefficient, 0.64), out-performing other methods employed for this goal (Additional file 2: Note IVb). In addition, for several features, the variants classified as drivers by OncodriveMUT followed the trend expected for oncogenic mutations (e.g., they exhibited larger clonal fractions among all mutations in cancer genes), and OncodriveMUT’s predictions on a set of recently probed uncommon cancer mutations exhibited a high concordance with experimental evidence [18, 19, 20, 21] (Additional file 2: Note IVb). Of note, the attributes employed by OncodriveMUT to classify each variant are detailed in the CGI output, which facilitates the user’s review of the results. In summary, the CGI annotates the mutations affecting cancer genes with features relevant to their potential role in cancer, identifying validated oncogenic events and identifying the most likely drivers among the variants of unknown significance.
A database of genomic determinants of anti-cancer drug response
The second resource is the Cancer Bioactivities database, which currently contains information on 20,243 chemical compound–protein product interactions that may support novel research applications. We built this database by compiling a catalog of available results from bioactivity assays of small molecules interacting with cancer proteins. This information was obtained by querying several external databases (Additional file 2: Note VI). The CGI matches the alterations observed in newly sequenced tumors to the biomarkers or target genes in these two databases. This process supports the identification of biomarkers at different levels of gene resolution, ranged from variants affecting a gene region to specific amino acid changes. Of note, the CGI also reports co-occurring alterations that affect the response to a given treatment as appropriate. This includes the co-existence of biomarkers of resistance and sensitivity to the same drug, and biomarkers of drug sensitivity that depend upon simultaneous genomic events. In summary, these two databases constitute comprehensive repositories of genome-guided therapeutic actionability in cancer according to current supporting evidence. Both resources are available for download through the CGI website (https://www.cancergenomeinterpreter.org/biomarkers, https://www.cancergenomeinterpreter.org/bioactivities).
Utility and discussion
The CGI (and the databases created to support its implementation) are distributed under an open license, and the resource can be accessed via its web site at https://www.cancergenomeinterpreter.org and through an Application Programming Interface (API; Additional file 2: Notes Ic and Id). The use of the CGI to automatically interpret cancer genomes has broad potential applications, ranging from basic cancer genomics to the translational research setting. One feature of the CGI that makes it particularly suitable for different types of applications is its usability. The user can input the tumor alterations to be analyzed by uploading files following different standards and/or by typing them in a free-text box. The system is prepared to automatically recognize and re-map as necessary  different formats, such as genomic, transcript, or protein-based coordinates for mutations  (Additional file 2: Note Ib). The use of the CGI can help addressing questions raised in different oncology research settings. A newly sequenced group of tumors may be readily interpreted, and systematic analyses of large sample sets are supported as exemplified with the 6729 pan-cancer cohort presented in this article. The application of the CGI to the mutations profiled across the whole exomes of these tumors delivered a catalog of putative driver alterations across its 28 cancer types (made available through http://www.intogen.org; Additional file 2: Note VII). The potential of a comprehensive analysis of individual alterations is illustrated by the identification of uncommon events in a tumor cohort that may be exploited by drug re-purposing opportunities (Fig. 3b; Additional file 2: Note VII). Overall, the CGI identified 5.2 and 3.5% of the tumors with genomic alterations that are biomarkers of drug response used in the clinical practice (FDA-approved or international guidelines) or reported in late phase (III–IV) clinical trials, respectively. When considering biomarkers supported by lower levels of clinical relevance, a total of 62% of the tumors exhibited at least one biomarker with increased response to an anti-cancer drug according to findings in early clinical trials, case reports, or pre-clinical assays. These numbers varied greatly across tumor types, partially explained by the relevance of cancer-recurrent alterations in shaping the response to drugs, such as inhibitors of the BRAF V600 mutated form in cutaneous melanoma (clinical guidelines), certain chemotherapies administered for DNMT3A or NPM1 mutant acute myeloid leukemias (clinical guidelines), PIK3CA mutation inhibitors in breast cancer (early clinical trial results ), and WEE1 inhibitors in TP53 mutated ovary tumors (early clinical trial results ) (Fig. 3c; Additional file 2: Note VII). However, this cohort mostly includes samples sequenced at diagnosis and thus they may not reflect the type of tumors that are evaluated by molecular oncology boards at present. We therefore also applied the CGI to the sequencing data of 17,642 tumors recently released by the GENIE project, which profiled more clinically advanced cancers using targeted panels . The CGI identified a larger percentage of tumors bearing potential actionable genomic alterations in that cohort. Specifically, 8 and 6% of GENIE tumors exhibited biomarkers of drug response used in clinical practice or reported in late clinical trials, and overall 72% of these tumors exhibited at least one alteration reported as a biomarker of drug response supported by any level of clinical evidence (including pre-clinical data; Fig. 3d; Additional file 2: Note VII). These percentages do not include cases in which a tumor exhibits co-occurring alterations that confer resistance to a given drug, in which the therapy was not in silico prescribed accordingly. Of note, the GENIE cohort exhibited a larger number of genomic biomarkers of drug resistance (to both targeted therapies and immune checkpoint blockade agents), as expected of tumors with a higher proportion of recurrence/relapse patients (Additional file 2: Note VII). These analyses provide a comprehensive state-of-the-art snapshot of the putative genomic drivers of cancer and the landscape of genomic-guided therapies according to our current knowledge. In addition, the application of the CGI to analyze the results of drug responses observed in tumors with different genomic architecture can facilitate the discovery of novel genomic biomarkers of drug sensitivity or resistance. The distinction between driver and passenger events recently contributed to the development of better predictive models to identify novel genomic markers of drug response in cancer cell lines .
In previous examples, the systematic analysis of large datasets was facilitated by the automatic classification of cancer variants that CGI provides. However, the detailed review of these results is empowered by the inclusion in the output reports of all the annotations employed by the CGI. The ability to review these data is especially critical in the clinical research setting. In this case, the use of the CGI to analyze the list of alterations detected in a patient’s tumor could support decision-making in multiple scenarios, assessing variants of unknown significance that may have implications for response to therapy. Early clinical adopters of the CGI have used the resource to support final decisions about the most appropriate genomic-guided clinical trial to enroll cancer patients or explore potential drug re-purposing opportunities for pediatric tumors unresponsive to standard-of-care therapy (see these use cases in Additional file 2: Note VIII).
Crucial to the performance of the CGI are the maintenance and further development of its two distinct types of resources: the repositories of accumulated knowledge, which are continuously generated, and the bioinformatics methods to estimate the relevance of those events that are not yet validated. As new tumor cohorts are re-sequenced and analyzed, more comprehensive catalogs of cancer genes and oncogenic mutations will be obtained, including both new malignancies and new genomic elements. In particular, the possibility to identify non-coding cancer drivers  from currently generated whole-genome sequencing data will open up the opportunity to explore the actionability of non-coding genomic alterations (https://dcc.icgc.org/pcawg). With respect to the aggregation, curation, and interpretation of the relevance of cancer variants, our team follows the standard operating procedures developed under the umbrella of the H2020 MedBioinformatics (http://www.medbioinformatics.eu/) project, thus ensuring the mid-term maintenance of these resources. Feedback from the community is also facilitated through the CGI web interface. Access to this type of cancer data is crucial for the advance of precision medicine, but is highly complex for a single institution to comprehensively manage and update. We envision that individual databases will continue to be maintained to fulfill specific needs , but our long-term impact will largely rely, first, on the establishment of international standards for the collection of data relevant to associations between cancer variants and clinical outcomes and, second, on our collective success in encouraging the community to share and harmonize such knowledge.
The CGI is a versatile platform that automates the steps proposed here for the interpretation of cancer genomes. It annotates the alterations detected in human tumors with features that may inform about their involvement in tumorigenesis. It also highlights the alterations of the tumor that constitute biomarkers of response to anti-cancer drugs, according to current levels of evidence. The CGI is easy to use, and will improve with new knowledge extracted from the study of thousands of tumors. We envision that it will become established as a useful tool in both the basic and translational cancer research settings.
We thank the clinical and scientific experts curating the Cancer Biomarkers database (https://www.cancergenomeinterpreter.org/biomarkers). We appreciate the support provided by Wanding Zhou for the use of the TransVar method and the work of Elaine Lilly in editing the text. We also thank Jianjiong Gao for providing support for the OncoKB content.
This project has received funding from Fundació La Marató de TV3, the European Union’s Horizon 2020 research and innovation programme 2014-2020 under grant agreement number 634143, and by the European Research Council (Consolidator Grant 682398). IRB Barcelona is a recipient of a Severo Ochoa Centre of Excellence Award from the Spanish Ministry of Economy and Competitiveness (MINECO; Government of Spain) and is supported by CERCA (Generalitat de Catalunya). DT is supported by project SAF2015-74072-JIN funded by the Agencia Estatal de Investigacion (AEI) and Fondo Europeo de Desarrollo Regional (FEDER). CR-P is funded by a FPI MINECO grant (BES-2013-063354). AG-P is supported by a Ramon y Cajal fellowship (RYC-2013-14554).
Availability of data and materials
The tool is freely available through an API or a web interface at https://www.cancergenomeinterpreter.org.
Databases used by the CGI pipeline are also available at the website.
DT participated in the design, development, and curation of CGI databases (including the Catalog of Cancer Genes and the Validated Oncogenic Mutations) and participated in the design and implementation of the driver analysis pipeline (including the development of OncodriveMUT), and co-wrote the manuscript. CR-P participated in the design, development, and curation of CGI databases (including the Cancer Bioactivities database and the Cancer Biomarkers database), and participated in the design and implementation of the in silico prescription pipeline. JDP participated in the implementation of CGI pipelines and the website and provided general technical support. MPS tested the performance of a machine-learning OncodriveMUT approach. AV, AR, IT, JA, JR, and JT provided expert feedback about the CGI design and contributed to the curation of the Cancer Biomarkers Database. CT analyzed the pediatric cancer cohort. RD provided expert feedback about the CGI design, participated in the design, development, and curation of the Cancer Biomarkers database, and analyzed the adult cancer cohort. AG-P oversaw the study and co-wrote the manuscript. NL-B conceived and oversaw the study and co-wrote the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
JR has consulting or advisory roles in Novartis, Eli Lilly, Orion Pharma, SERVIER, MSD, and Peptomyc and receives research funding from Bayer and Novartis. All remaining authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.Weinstein I, Cancer B. Addiction to oncogenes--the Achilles heal of cancer. Science. 2002;297:63–4.Google Scholar
- 9.Chakravarty D, et al. OncoKB: a precision oncology knowledge base. JCO Precis Oncol. 2017; https://doi.org/10.1200/PO.17.00011.
- 10.Huang L, et al. The cancer precision medicine knowledge base for structured clinical-grade mutations and interpretations. J Am Med Informatics Assoc. 2017;24:513–9.Google Scholar
- 15.Schroeder MP, et al. OncodriveROLE classifies cancer driver genes in loss of function and activating mode of action. Bioinformatics. 2014;30:549–55.Google Scholar
- 17.Lek M, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nat Publ Gr. 2014;536:1–26.Google Scholar
- 20.Kim E, et al. Systematic functional interrogation of rare cancer variants identifies oncogenic alleles. Cancer Discov. 2016;2641:617–32.Google Scholar
- 21.Berger AH, et al. High-throughput phenotyping of lung cancer somatic mutations. Cancer Cell. 2015; https://doi.org/10.1016/j.ccell.2016.06.022.
- 22.Global Alliance for Genomics and Health. A federated ecosystem for sharing genomic, clinical data. Science. 2016;352:1278–80.Google Scholar
- 24.Juric D, et al. Phase I dose escalation study of taselisib (GDC-0032), an oral PI3K inhibitor, in patients with advanced solid tumors. Cancer Discov. 2017; https://doi.org/10.1158/2159-8290.CD-16-1080.
- 26.AACR Project GENIE. Powering Precision Medicine Through An International Consortium. Cancer Discov. 2017;7:818–31.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.