Classifying cancer genome aberrations by their mutually exclusive effects on transcription
- 1.1k Downloads
Malignant tumors are typically caused by a conglomeration of genomic aberrations—including point mutations, small insertions, small deletions, and large copy-number variations. In some cases, specific chemotherapies and targeted drug treatments are effective against tumors that harbor certain genomic aberrations. However, predictive aberrations (biomarkers) have not been identified for many tumor types and treatments. One way to address this problem is to examine the downstream, transcriptional effects of genomic aberrations and to identify characteristic patterns. Even though two tumors harbor different genomic aberrations, the transcriptional effects of those aberrations may be similar. These patterns could be used to inform treatment choices.
We used data from 9300 tumors across 25 cancer types from The Cancer Genome Atlas. We used supervised machine learning to evaluate our ability to distinguish between tumors that had mutually exclusive genomic aberrations in specific genes. An ability to accurately distinguish between tumors with aberrations in these genes suggested that the genes have a relatively different downstream effect on transcription, and vice versa. We compared these findings against prior knowledge about signaling networks and drug responses.
Our analysis recapitulates known relationships in cancer pathways and identifies gene pairs known to predict responses to the same treatments. For example, in lung adenocarcinomas, gene-expression profiles from tumors with somatic aberrations in EGFR or MET were negatively correlated with each other, in line with prior knowledge that MET amplification causes resistance to EGFR inhibition. In breast carcinomas, we observed high similarity between PTEN and PIK3CA, which play complementary roles in regulating cellular proliferation. In a pan-cancer analysis, we found that genomic aberrations in BRAF and VHL exhibit downstream effects that are clearly distinct from other genes.
We show that transcriptional data offer promise as a way to group genomic aberrations according to their downstream effects, and these groupings recapitulate known relationships. Our approach shows potential to help pharmacologists and clinical trialists narrow the search space for candidate gene/drug associations, including for rare mutations, and for identifying potential drug-repurposing opportunities.
KeywordsCancer genomics Integrative omics Machine learning Drug repurposing Pan cancer
Typically, a single tumor contains anywhere from tens to millions of genomic aberrations—including point mutations, small insertions, small deletions, and large copy number variations—that differ from the patient’s normal cells [1, 2, 3, 4]. Knowledge of these aberrations may be useful in guiding therapeutic decisions. In some cases, a genomic aberration is the target of an existing therapy and thus may indicate that the therapy is a good match for that patient. For example, Trastuzumab is a targeted therapy for HER2-amplified breast cancers . In other cases, a genomic aberration may be a biomarker for an existing therapy, even though the therapy was not explicitly designed to target that aberration . Many such relationships have been identified for combinations of genomic aberration and therapy . However, in many cases, tumors contain no therapeutic biomarker. Furthermore, few gene/drug associations have been made for the “long tail” of genomic aberrations that occur infrequently at the population level . Although it may be economically infeasible to develop targeted therapies for every rare mutation, we may be able to repurpose existing cancer treatments by identifying similarities in tumor biology between tumors that harbor rare and common aberrations.
By disrupting signaling cascades—or pathways—within tumor cells, genomic aberrations can cause the tumor to grow, divide, or dedifferentiate in an uncontrolled manner . Genomic aberrations within tumors are highly variable across cancer patients—each tumor carries a unique panoply of genomic aberrations. However, a much smaller number of signaling cascades is affected. Even though different genes are mutated in two different tumors, these mutations may affect common signaling cascades (e.g., Ras ➔ Raf ➔ MEK ➔ ERK) . We may be able to better understand the effects of genomic aberrations by considering such downstream effects.
Although it is possible to place genomic aberrations in the context of biological pathways, it may be difficult to decipher whether two aberrations have a similar effect on tumor biology, even though they occur within the same signaling cascade. This observation may be especially true for rare mutations, because little is known about the roles they play in tumorigenesis or therapeutic responses, and samples sizes are small. An alternative approach for understanding the effects of these mutations and their potential as biomarkers is to evaluate the transcriptional effects of the mutations. Using existing, high-throughput technologies (e.g., microarrays and RNA-Sequencing), it is possible to quantify gene-expression levels across the entire transcriptome for a modest cost. Two tumors may have similar gene-expression profiles, even though they have no genomic aberrations in common, ostensibly because the aberrations in either tumor have led to similar downstream effects. Therefore, the tumors may respond similarly to drug treatments. When this approach is applied to many tumors, it may be possible to identify transcriptional patterns that can be used as biomarkers of treatment response, independent of the genomic aberrations that occur within these tumors.
We evaluated this idea using publicly available data from The Cancer Genome Atlas (TCGA). We acquired data representing mutations (SNVs, insertions, or deletions), copy-number variations (large amplifications or deletions), and transcription for 9300 tumors across 25 cancer types available in TCGA. We identified tumors that carried mutations in frequently mutated genes (e.g., KRAS, EGFR, and ERBB2) and in genes that are mutated relatively rarely. We made the simplifying assumption that mutations at different genomic loci within a given gene exert a similar effect on tumor biology. We also assumed that mutations in individual genes have a characteristic effect on tumor transcription, despite the presence of additional mutations within each of these tumors.
Initially focusing on lung adenocarcinomas and then extending our analysis to other tumor types, we identified relatively common mutations and filtered the data to include only tumors where these mutations occurred in a mutually exclusive manner—harboring only one of the mutations of interest. Having classified the tumors by mutation status, we used a supervised, machine-learning algorithm to predict mutation status based on transcriptional patterns observed in the tumors. In many cases, the transcriptional patterns were highly predictive of mutation status, thus indicating that individual genomic aberrations influence transcription in distinct ways. Finally, using lower-frequency genes that had been excluded from the initial analysis, we identified genes that, when mutated, resulted in transcriptional patterns that were similar to some of the genes from our initial set. In several cases, these similarities coincided with prior knowledge about cancer pathways as well as with known therapeutic biomarkers.
Our findings show promise as a way to identify pairs of genes that, when mutated, may serve as biomarkers for the same treatment. These observations promise to be useful in guiding drug-repurposing efforts.
Somatic mutation data
In July 2016, we downloaded all available TCGA somatic mutation data via the National Cancer Institute’s Genomic Data Commons . These data had been generated using high-throughput, exome-sequencing technologies. Using the MuTect tool , these data had been compared against germline variation using human reference genome GRCh38 to make somatic calls. Subsequently, the variants had been annotated with the Variant Effect Predictor (VEP) tool  to record population frequencies and to predict effects of the mutations on gene function. We used these annotations to filter the somatic variant data. We excluded variants that did not pass MuTect’s quality-control criteria or that had a minor allele frequency greater than 0.01 in the ExAC database. Any mutation predicted by VEP to have a “LOW” or “MODIFIER” impact on protein function was excluded. We retained variants that had been predicted by SIFT  to be “deleterious” or “deleterious (low confidence)” or by Polyphen-2  to be “probably damaging” or “possibly damaging.” Although some of these variants were likely false positives, we preferred to err on the side of inclusion rather than exclusion, to minimize the chances of false negatives. In addition, we retained variants that contained no predictions for SIFT or Polyphen-2.
Copy-number variation data
We downloaded copy-number variation (CNV) data that had been preprocessed and stored in the University of California Santa Cruz Xena database . These data had been produced using whole-genome microarrays. The CNVs had been called using the GISTIC2.0 algorithm  and summarized to gene-level values. In addition, the gene-level values were thresholded to estimate whether each sample carried a homozygous deletion, a single-copy deletion, a low-level amplification, or a high-level amplification. We considered tumors with either a homozygous deletion or a high-level amplification to be “mutated.” We considered the remaining samples either to have been in a normal state or to have exerted only a modest effect, if any, on tumor transcription.
Finally, we aggregated the somatic mutation and CNV data to identify tumor samples that carried at least one genomic aberration in a given gene. We merged the somatic-mutation and CNV data to a single value per gene and tumor sample, using Boolean values to indicate whether the tumor harbored an aberration in a given gene.
We downloaded gene-level, RNA-Sequencing data that had been preprocessed and aligned using the Rsubread package [18, 19] and had been summarized using the transcripts per million (TPM) method. For tumor samples that had been sequenced multiple times, we averaged expression values across these samples. We log2-transformed the RNA-Sequencing values to mitigate the effects of extremely high expression values and to enable easier visualization.
Disease-drug-gene mapping data
We downloaded disease-drug-gene mappings that had been curated via a crowdsourcing effort and had been made available via the CIViC database . We used the October 1, 2016 version of this database and focused solely on evidence where the relationship between gene and drug was “Supported.”
Our analysis focused exclusively on TCGA samples for which data were available for all three molecular types (somatic mutation, CNV, and RNA-Sequencing). To reduce noise and computational complexity, we limited the data to 325 genes considered to play a role in canonical cancer pathways, according to the Pathways in Cancer diagram in the Kyoto Encyclopedia of Genes and Genomes (KEGG) .
Using the genome-aberration data, we categorized each tumor according to cancer type and whether an aberration had been identified in a given gene. We generated a “training set” by identifying samples that had at least one mutation in one of the frequently mutated genes (frequency threshold varied; see Results). We then excluded tumor samples that had a mutation in more than one of these genes. We used the resulting data set for a classification analysis, using genes as class labels. The remaining samples were set aside as a “test set.” As a way to assess our therapeutic predictions, we limited genes in the test set to those that were described in the CIViC database.
Using the training set, we performed 5-fold cross validation to evaluate our ability to predict mutation status for each gene. Later, we trained a model on the entire training set and made predictions for the test set. In both cases, we used the Random Forests classification algorithm  with default parameters, other than that we requested probabilistic predictions.
Because mutation frequencies varied considerably across genes, we subsampled the data. For example, if the minimum number of mutated samples across all selected genes were 15, we would randomly select 15 samples for each gene. We repeated this process 10 times and averaged the results across the various subsampled results. This approach ensured that class imbalance would not bias our results.
We wrote scripts in the Python programming language  to parse, filter, and summarize the input data. To enable easier analyses in subsequent steps, we restructured the data into “tidy data” format . In performing the analysis steps and producing graphics, we used the R programming language . These steps were aided by the following packages: readr, dplyr, magrittr, ggplot2, RColorBrewer, randomForest, mlr, coin, and AUC [25, 26, 27, 28, 29, 30, 31, 32, 33]. The entire pipeline executes in 5–15 min on a laptop computer with 4 cores and 16 GB RAM. We placed all analysis code and the tidy data in an open-access repository at https://osf.io/ndjkg.
To evaluate differences in expression for individual genes, we calculated p-values using Student’s t-test and then performed a Bonferroni correction to account for all possible gene pairs (n = 52,650) in our data.
We sought to identify genes whose transcriptional profiles were similar to each other when mutated in tumors. From TCGA, we obtained somatic-mutation data for 10,391 tumor samples. The filtering steps (see Methods) reduced the number of somatic mutations by 93.5%, mostly due to the removal of synonymous variants and common variants. Within our cancer-related genes of interest (n = 325), we observed a total of 45,950 somatic mutations (5.57 per sample). We observed 52,012 high-level amplifications (8.63 per sample) and 19,037 large-scale, homozygous deletions (4.13 per sample) within these genes. After we removed samples that lacked data for at least one type of aberration, data for 9300 patients across 25 distinct cancer types remained.
We repeated this process for other cancer types—bladder carcinoma, head/neck squamous carcinoma, ovarian cystadenocarcinoma, metastatic skin melanoma, stomach adenocarcinoma—that had enough data; we required that at least three genes be mutated at least ten times in a mutually exclusive manner. Additional files 2, 3, 4, 5, 6, 7, 8 and 9 indicate correlation coefficients, nominal p-values, and Bonferroni-adjusted p-values for each pairwise comparison. The strongest negative correlation (rho = 0.83; see Additional file 1: Fig. S2) was between FGFR1 and FGFR3 in bladder carcinoma. Experimental work has demonstrated that both genes are activated via mutations and the genes play distinct roles in regulating bladder-tumor growth . The genes that showed the most significant differences in expression between FGFR1- and FGFR3-mutated tumors were SMAD3, FGFR3, FN1, LAMA1, and FGFR1 (Additional file 1: Figs. S3-S7). In addition to growth signaling, these genes play roles in regulating extracellular matrix adhesion and intracellular signaling.
The correlations we observed between pairs of mutated genes were often tissue specific; for example, a strong relationship between PTEN and PIK3CA was observed in breast carcinomas but not in head/neck squamous carcinomas, stomach adenocarcinomas, or ovarian cystadenocarcinomas.
For the last stage of our analysis, we trained the Random Forests classification algorithm on the full training set and made predictions for tumor samples in the test set. We interpreted that probabilistic predictions coinciding with the actual mutation status of genes in the test set would indicate that the expression profiles of genes in the training set were similar to the expression profiles of genes from the test set. We hoped this analysis would provide insights about genes that are mutated rarely. Using a minimum threshold of five mutated samples (test set) and focusing initially on breast cancer, we found that MMP2 predictions were strongly associated with predictions for PIK3CA (0.47), PTEN (0.49), and AKT1 (0.28), which operate via the same pathway. Like MMP2, PIK3CA and AKT1 play important roles in cancer cell migration and metastasis [48, 49, 50].
We also observed a strong positive correlation (0.33) between TP53 and AKT2 and a modest correlation between ERBB2 and AKT2 (0.20). AKT2 mediates TP53 activity via MDM2 . Trastuzumab was originally developed as a targeted therapy for ERBB2 (Her2) amplification; more recently, aberrant AKT2 expression has been associated with longer time to progression and overall patient survival in Her2-positive patients . We also observed a slightly negative correlation (−0.20) between AKT1 and AKT2 predictions; although the proteins encoded by these genes operate within the same signaling cascade, their roles in regulating cell migration and differentiation are distinct .
When we made test-set predictions for all 25 cancer types, we noted a few modestly positive correlations. The first was between VHL and MTOR1. VHL inactivation leads to constitutive activation of HIF-2 and/or HIF-1. In clear-cell renal carcinomas, the downstream effects of HIF activation are inhibition of mTor Complex 1 . The second positive correlation was between NRAS and BRAF. These genes interact directly with each other and operate via the Ras ➔ Raf ➔ MEK ➔ ERK cascade. Indeed, the CIViC database indicates that several antibody-based treatments—including Cetuximab, Selumetinib, and Vemurafenib—target tumors with mutations in either of these genes.
Although we used the ComBat software to correct for tissue-type effects and used a principal component analysis to visually verify the effects of this correction (Fig. 1), we conducted a follow-up evaluation to assess whether tissue specificity might still have influenced our results, because, in many cases, mutations occur in a tissue-specific manner . Initially, we used the Random Forest algorithm to predict tumor type based on the expression data that had not been adjusted using ComBat. Then we repeated this process using the ComBat-adjusted data. Using this approach, we could predict tumor type with >90% accuracy for both versions of the expression data. Even though ComBat had adjusted for tissue specificity, a subtle footprint remained, which the Random Forest algorithm was able to detect. In a second follow-up evaluation, we used the Random Forest algorithm to predict gene-mutation status using only tumor type (instead of gene-expression data). Although tumor type predicted mutation status less accurately than the gene-expression data (47% and 50%, respectively), these results confirm that tumor type is confounded with mutation status and thus that our pan-cancer results—and results from other pan-cancer studies that examine the relationship between genomic and transcriptomic variation—should be interpreted with caution. However, it is difficult to distinguish between correlation and causation; certain tissue environments (driven by gene expression) may select for somatic variants in certain genes, and/or certain somatic variants may strongly influence the tissue specificity of cancer.
We have developed a computational approach that uses publicly available, molecular-profiling data to identify genes that have similar (or different) effects on gene expression in human tumors. Our overarching goal was to develop a methodology that can be used to guide drug-repurposing efforts and, more generally, to help cancer researchers make sense of the vast complexity and heterogeneity of tumors. Using data from The Cancer Genome Atlas, we observed relationships that recapitulate what was previously known about canonical cancer pathways and treatment responses. Alternative methods have primarily evaluated molecular data in a low-throughput manner, have examined one type of molecule at a time, or have considered the expression of individual genes associated with mutations; in contrast, our method accounts for broad-ranging effects of genomic aberrations on gene expression within cells. Using a supervised-learning approach, we found that we could predict mutation status, often with high accuracy. This provides evidence that many mutations confer a clear and distinct effect on transcriptional responses within downstream genes.
To increase interpretability, we made simplifying assumptions and simplified our approach in several ways. We limited our analysis to genes known to play a role in cancer, as described in KEGG’s Pathways in Cancer diagram. We assumed that the effects of genomic aberrations are mutually exclusive. Even though this assumption may not hold in every case, it reflects current approaches that are used to prioritize targeted cancer therapies. For example, even though it is clearly understood that tumors with mutations in EGFR harbor many variants in genes other than EGFR, mutations in EGFR are used as a biomarker for treating lung adenocarcinomas with therapies such as Gefitinib and Erlotinib. In addition, evidence suggests that EGFR and KRAS mutations occur in a mutually exclusive manner and that tumors with KRAS mutations fail to respond to these drugs . Such findings suggest that mutations in individual genes can strongly influence treatment responses, despite a background of other mutations. To an extent, our goal was to identify such scenarios.
In addition, we ignored the potential impact of epigenomic factors, such as DNA methylation and miRNA expression, as well as gene fusions. Our approach could be refined in future studies to use such observations to indicate whether a given gene is “mutated.” Although genomic aberrations are often thought to be the main drivers of tumorigenesis, epigenomic factors often play a critical role in modulating tumor activity and/or interacting with genomic aberrations. For example, DNA hypermethylation of promoter regions can cause transcriptional silencing of tumor-suppressor genes; BRCA1 hypermethylation has been shown to alter responses to platinum-salt therapies . miRNAs can also play important roles in regulating tumor transcription ; in chronic lymphocytic leukemia, two miRNAs on human chromosome 13q14 occur frequently  and likely play important roles in regulating tumorigenesis and treatment responses.
We treated all mutations equally within a given gene, regardless of genomic loci or mutation type; but in many cases, there may be considerable heterogeneity across genomic loci and mutation types. As sample sizes increase over time, it will be more feasible to sub-classify mutations in finer detail.
We focused primarily on the genome- and cancer-related aspects of this work rather than on algorithmic aspects. In future work, it would be valuable to compare (and perhaps combine) multiple algorithms as a way to optimize our approach. We used the Random Forests algorithm because it has been shown to deal effectively with high-throughput transcriptomic data, executes quickly, and is more amenable to post hoc interpretation than many other algorithms .
Finally, although we have demonstrated a potential to learn about the effects of rare variants, our analysis touched only briefly on such variants. To evaluate the reliability of our method, we focused on genes that had been affected by at least a modest number of mutations. However, we believe this methodology can be applied in cases where a given mutation has been observed in only a single tumor, potentially providing insights for “n-of-1” clinical trials.
We have used supervised-machine learning to integrate genomic and transcriptomic data across 9300 tumors and 25 cancer types to aid in deciphering the downstream effects of genomic aberrations on tumor transcription. This approach has potential to guide development of treatment biomarkers and to understand similarities and differences among genes that play a role in specific types of cancer and across multiple cancer types. We hope this approach will be useful to pharmacologists and clinical trialists who seek to identify relationships between genomic aberrations and treatment responses. In particular, we hope this methodology will reduce barriers for drug-repurposing efforts so that existing treatments can be used on tumors with no current treatment biomarker. We believe such approaches will reduce the costs of developing new cancer drugs and increase the number of tumors that can be treated in a targeted manner.
We thank T. James Lee for providing insights on processing data from TCGA. We thank the Fulton Supercomputing Lab at Brigham Young University for providing computational resources that enabled us to complete this study. We thank the patients who donated tumor samples to TCGA Consortium and consented to share the resulting data with the research community.
S.R.P. acknowledges startup funds from Brigham Young University. J.B.D. expresses gratitude for a student mentoring grant from Brigham Young University. Publication costs for this article were funded by Brigham Young University.
Availability of data and materials
All tidy-data files, analysis code, and analysis outputs (including some figures not included in this paper) are publicly accessible at https://osf.io/ndjkg. We encourage others to extend and refine our methods.
About this supplement
This article has been published as part of BMC Medical Genomics Volume 10 Supplement 4, 2017: 16th International Conference on Bioinformatics (InCoB 2017): Medical Genomics. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume-10-supplement-4.
SRP conceived and designed the study. Both authors developed the analytical pipeline. JBD parsed, filtered, and tidied the data. SRP performed the machine-learning analysis. JBD prepared and organized the code and data. Both authors wrote the paper and prepared figures for the paper. Both authors read and approved the final manuscript.
Ethics approval and consent to participate
The Brigham Young University Institutional Review Board granted an exemption for this study (#E14522). This study uses only publicly available data.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 1.Jones S, Zhang X, Parsons DW, Lin JC-H, Leary RJ, Angenendt P, Mankoo P, Carter H, Kamiyama H, Jimeno A, Hong S-M, Fu B, Lin M-T, Calhoun ES, Kamiyama M, Walter K, Nikolskaya T, Nikolsky Y, Hartigan J, Smith DR, Hidalgo M, Leach SD, Klein AP, Jaffee EM, Goggins M, Maitra A, Iacobuzio-Donahue C, Eshleman JR, Kern SE, Hruban RH, et al. Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science. 2008;321:1801–6.CrossRefPubMedPubMedCentralGoogle Scholar
- 2.Parsons WW, JC-HC L, Jones S, I-MM S, Zhang X, Rasheed AA, SKNK M, Leary RJ, SMOM S, Angenendt P, Mankoo P, Carter H, Gallia GL, Olivi A, McLendon R, Keir S, Nikolskaya T, Nikolsky Y, Busam DA, Tekleab H, Diaz LA, Hartigan J, Smith DR, Strausberg RL, Yan H, Riggins GJ, Bigner DD, Karchin R, Papadopoulos N, Parmigiani G, et al. An integrated genomic analysis of human glioblastoma multiforme. Science. 2008;321:1807–12.CrossRefPubMedPubMedCentralGoogle Scholar
- 6.Garnett MJ, Edelman EJ, Heidorn SJ, Greenman CD, Dastur A, Lau KW, Greninger P, Thompson IR, Luo X, Soares J, Liu Q, Iorio F, Surdez D, Chen L, Milano RJ, Bignell GR, Tam AT, Davies H, Stevenson JA, Barthorpe S, Lutz SR, Kogera F, Lawrence K, McLaren-Douglas A, Mitropoulos X, Mironenko T, Thi H, Richardson L, Zhou W, Jewitt F, et al. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature. 2012;483:570–5.CrossRefPubMedPubMedCentralGoogle Scholar
- 7.Griffith M, Spies NC, Krysiak K, Coffman AC, McMichael JF, Ainscough BJ, Rieke DT, Danos AM, Kujan L, Ramirez CA, Wagner AH, Skidmore ZL, Liu CJ, Jones MR, Bilski RL, Lesurf R, Barnell EK, Shah NM, Bonakdar M, Trani L, Matlock M, Ramu A, Campbell KM, Spies GC, Graubert AP, Gangavarapu K, Eldred JM, Larson DE, Walker JR, Good BM, et al. CIViC is a community knowledgebase for expert crowdsourcing the clinical interpretation of variants in cancer. Cancer. 2016;47(7):170–74.Google Scholar
- 11.Genomic Data Commons [https://gdc.cancer.gov]. Accessed 1 June 2016.
- 13.McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, Flicek P, Cunningham F, Eisenstein M, Weil M, Chen A, Visscher P, Brown M, McCarthy M, Yang J, Pierre A, Saint GE, Zuk O, Schaffner S, Samocha K, Do R, Hechter E, Kathiresan S, Hindorff L, Sethupathy P, Junkins H, Ramos E, Mehta J, Collins F, Puente X, et al. The Ensembl variant effect predictor. Genome Biol. 2016;17:122.CrossRefPubMedPubMedCentralGoogle Scholar
- 15.Adzhubei I, Jordan DM, Sunyaev SR: Predicting functional effect of human missense mutations using PolyPhen-2. Curr Protoc Hum Genet 2013, Chapter 7:Unit7.20.Google Scholar
- 20.Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M. KEGG as a reference resource for gene and protein annotation. Nuc Acids Res. 2016;44(D1):D457-62.Google Scholar
- 22.Python Software Foundation. Python language reference, version 2.7. In: Python software foundation; 2013.Google Scholar
- 23.Wickham H. Tidy Data. J Stat Softw. 2014;59(10):1–23.Google Scholar
- 24.R Core Team: R: A language and environment for statistical Computing 2016.Google Scholar
- 25.Wickham H, Hester J, Francois R: readr: Read tabular data. 2016.Google Scholar
- 26.Wickham H, Francois R: dplyr: A grammar of data manipulation. 2016.Google Scholar
- 27.Bache SM, Wickham H: magrittr: A forward-pipe operator for R. 2014.Google Scholar
- 29.Neuwirth E: RColorBrewer: ColorBrewer palettes. 2014.Google Scholar
- 30.Bischl B, Lang M, Kotthoff L, Schiffner J, Richter J, Jones Z, Casalicchio G: mlr: machine learning in R. 2016.Google Scholar
- 31.Liaw A, Wiener M. Classification and regression by randomForest. R News. 2002;2:18–22.Google Scholar
- 32.Ballings M, Van den Poel D: AUC: threshold independent performance measures for probabilistic classifiers. 2013.Google Scholar
- 37.Paez JG, Jänne PA, Lee JC, Tracy S, Greulich H, Gabriel S, Herman P, Kaye FJ, Lindeman N, Boggon TJ, Naoki K, Sasaki H, Fujii Y, Eck MJ, Sellers WR, Johnson BE, Meyerson M. EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science. 2004;304:1497–500.CrossRefPubMedGoogle Scholar
- 41.Engelman JA, Zejnullahu K, Mitsudomi T, Song Y, Hyland C, Park JO, Lindeman N, Gale CM, Zhao X, Christensen J, Kosaka T, Holmes AJ, Rogers AM, Cappuzzo F, Mok T, Lee C, Johnson BE, Cantley LC, Janne PA. MET amplification leads to gefitinib resistance in lung cancer by activating ERBB3 signaling. Science. 2007;316:1039–43.CrossRefPubMedGoogle Scholar
- 47.Rodriguez-Otero P, Román-Gómez J, Vilas-Zornoza A, José-Eneriz ES, Martín-Palanco V, Rifón J, Torres A, Calasanz MJ, Agirre X, Prosper F. Deregulation of FGFR1 and CDK6 oncogenic pathways in acute lymphoblastic leukaemia harbouring epigenetic modifications of the MIR9 family. Br J Haematol. 2011;155:73–83.CrossRefPubMedGoogle Scholar
- 52.Grell P, Fabian P, Khoylou M, Radova L, Slaby O, Hrstka R, Vyzula R, Hajduch M, Svoboda M. Akt expression and compartmentalization in prediction of clinical outcome in HER2-positive metastatic breast cancer patients treated with trastuzumab. Int J Oncol. 2012;41:1204–12.PubMedPubMedCentralGoogle Scholar
- 54.Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SAJR, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, Børresen-Dale A-L, Boyault S, Burkhardt B, Butler AP, Caldas C, Davies HR, Desmedt C, Eils R, Eyfjörd JE, Foekens JA, Greaves M, Hosoda F, Hutter B, Ilicic T, Imbeaud S, Imielinski M, Imielinsk M, Jäger N, Jones DTW, Jones D, Knappskog S, et al. Signatures of mutational processes in human cancer. Nature. 2013;500:415–21.CrossRefPubMedPubMedCentralGoogle Scholar
- 57.Calin GA, Dumitru CD, Shimizu M, Bichi R, Zupo S, Noch E, Aldler H, Rattan S, Keating M, Rai K, Rassenti L, Kipps T, Negrini M, Bullrich F, Croce CM. Frequent deletions and down-regulation of micro- RNA genes miR15 and miR16 at 13q14 in chronic lymphocytic leukemia. Proc Natl Acad Sci U S A. 2002;99:15524–9.CrossRefPubMedPubMedCentralGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.