Predict drug sensitivity of cancer cells with pathway activity inference
Predicting cellular responses to drugs has been a major challenge for personalized drug therapy regimen. Recent pharmacogenomic studies measured the sensitivities of heterogeneous cell lines to numerous drugs, and provided valuable data resources to develop and validate computational approaches for the prediction of drug responses. Most of current approaches predict drug sensitivity by building prediction models with individual genes, which suffer from low reproducibility due to biologic variability and difficulty to interpret biological relevance of novel gene-drug associations. As an alternative, pathway activity scores derived from gene expression could predict drug response of cancer cells.
In this study, pathway-based prediction models were built with four approaches inferring pathway activity in unsupervised manner, including competitive scoring approaches (DiffRank and GSVA) and self-contained scoring approaches (PLAGE and Z-score). These unsupervised pathway activity inference approaches were applied to predict drug responses of cancer cells using data from Cancer Cell Line Encyclopedia (CCLE).
Our analysis on all the 24 drugs from CCLE demonstrated that pathway-based models achieved better predictions for 14 out of the 24 drugs, while taking fewer features as inputs. Further investigation on indicated that pathway-based models indeed captured pathways involving drug-related genes (targets, transporters and metabolic enzymes) for majority of drugs, whereas gene-models failed to identify these drug-related genes, in most cases. Among the four approaches, competitive scoring (DiffRank and GSVA) provided more accurate predictions and captured more pathways involving drug-related genes than self-contained scoring (PLAGE and Z-Score). Detailed interpretation of top pathways from the top method (DiffRank) highlights the merit of pathway-based approaches to predict drug response by identifying pathways relevant to drug mechanisms.
Taken together, pathway-based modeling with inferred pathway activity is a promising alternative to predict drug response, with the ability to easily interpret results and provide biological insights into the mechanisms of drug actions.
KeywordsPathway activity Drug sensitivity Precision therapy Machine learning Cancer Pharmacogenomics
Cancer Cell Line Encyclopedia
Genomic Drug Sensitivity of Cancer
Gene Set Enrichment Analysis
Gene Set Variation Analysis
Median of Absolute Correlation Coefficients
Mean Square Error
Determining the responses of individual patients to drugs has become a critical task in the practice of personalized medicine. Experimental efforts have been undertaken to directly measure drug response of the cells extracted from patients’ cancerous tissues, including in-vitro and in-vivo models . While such experimental approaches capture biological characteristics of patients’ tumor, the high-cost and time-consuming operations render them hardly scalable in practice.
With the advance of high-throughput genomic technologies, pharmacogenomics is becoming a powerful approach to determine individuals’ responses to drug therapies . Typically, studies generate molecular profiles (i.e. SNPs, gene or protein expressions, etc) from cell lines, measure cellular responses to drugs, and then develop computational models to predict drug responses . These computational models could be applied to identify molecular determinants of drug response and further stratify patient population for given drug therapies, with the assumption that cell line models yield clinical relevance . For example, earlier efforts on NCI-60 panels  have highlighted specific genetic aberrations as drug targets or biomarkers informative of drug response. For instance, BRAF and EGFR mutations are currently used to predict response to specific kinase inhibitors . Later, studies like Cancer Cell Line Encyclopedia (CCLE) , Genomic Drug Sensitivity of Cancer (GDSC)  and GSK panel  have extended to large-scale collection of cell lines with drug responses and more molecular data types. These large cell line datasets provide a more comprehensive representation of the genomics variability observed in tumors providing new means to identify novel drug targets or drug response biomarkers. These large datasets can also be used to develop computational models to predict drug responses. For instance, CCLE and GDSC have been used to evaluate the robustness of linear prediction models , develop novel computational approaches identifying combinatorial biomarkers of drug response  and validate prediction models with genomic and chemical features .
Exploring these data-resources can help uncover new drug mechanisms and further personalize drug therapies. Currently, most of the computational models to predict drug sensitivity of cancer cell lines involve gene-level features like gene expression . However, gene level features have been reported as having limited reproducibility across independent studies and challenges to biological interpretation . There is growing evidence that drug responses could be modulated by the concerted behavior of multiple genes, instead of individual genes . Pathway (or gene-set) based approaches can help to take into account such coordination of genes, reduce model complexity and increase explanatory power of prediction models . In fact, pathway approaches have been successfully applied in disease classifications [16, 17] by aggregating gene expressions into pathway-level activities used for prediction. In the context of drug sensitivity, such pathway-based approach may also help improving predictions. While gene-level models have been validated and compared [10, 18], the performance of pathway-based approach in this context is yet to be investigated and validated.
In this study, we investigate four representative approaches to score pathway activities solely based on gene expression data alone. Specifically, these four approaches were compared based on 24 drugs from CCLE dataset , in term of their performance to predict drug response and the ability to recapitulate target-related pathways. For each approach, sample-wise pathway activity scores were first calculated for cell lines, and then were used as inputs in Elastic net  models to predict drug responses.
Raw gene expression and drug response data (IC50) were collected from the CCLE for 24 drugs. Specifically, raw gene expression data (Affymetrix cel files) was first extracted and normalized with Bioconductor Affy package (MAS5 algorithm) and then log-transformed. For genes with multiple probesets, the optimal probeset was then determined using R package jetset . For each drug, IC50 values are log-transformed for downstream analysis. Only the cell lines with both gene expression and response data are used to build prediction for each drug. Note that, the number of cell lines varies with drugs, because some cell lines may not have response data for all drugs.
Canonical pathways are collected from MetaCore pathway knowledge database, including pathways defined for specific diseases, biological process or certain stimulus. Our analysis is restricted to the 1410 pathways consisting of [5, 200] member genes.
Pathway activity scoring approaches
Where n1 and n2 are the numbers of member and non-member genes of a given pathway, respectively. Likewise, ri and rj represent the rankings of individual member and non-member genes based on their expression levels in samples.
Note that these four pathway scoring approaches could be grouped into two categories. Specifically, both DiffRank and GSVA score the pathway activity as a function of genes inside and outside pathways, analogue to the competitive gene-set analysis. In contrast, PLAGE and Z-Score consider only the genes inside pathways, analogue to the self-contained gene-set analysis. DiffRank is implemented from scratch and all the other three approaches are adopted from the gsva package in Bioconductor.
Building prediction model of drug response
Once pathway activity scores are generated for cell lines, various machine learning models could be applied to predict drug response. We noticed that most individual pathway-level or gene-level features were modestly correlated to drug response for most drugs (data not shown). For such datasets, machine learning models with regularization (i.e. Elastic net) have proven promising to achieve better predictions, as demonstrated by model choices in previous studies [7, 8] and the recommendations from a recent effort assessing models for drug sensitivity prediction . As such, Elastic net algorithm (from R package “glmnet”) is used to build the prediction models, and other machine learning algorithms are not considered in this study. The optimal parameters of predictive model are determined through 10-fold cross validations. In particular, a grid of 2500 settings of elastic net parameters (α: 10 settings in [0.2, 1]; λ: 250 settings in [exp− 6, exp5] was searched in cross validations.
Overview of overlaps and correlations among pathway member genes
We further explored the correlations of member genes within individual pathways. For each pathway, Pearson correlation coefficients were calculated for all pairs of member genes. The median of absolute correlation coefficients (MACC) was taken as an overall measure of pathway member gene correlations. Then permutation test was performed to determine if member genes within one pathway have higher correlations than by chance. Specifically, gene expressions of cell lines were randomly shuffled for 1000 times to generate a vector of random median correlations (MACC). For each pathway, the statistical significance of real MACC is then determined by comparing to random MACCs. For example, the p-value would be zero if all random MACCs are smaller than real MACC. The results (Additional file 1) shows that ~ 40% pathways (565 out of 1410) have p-value less than 0.1 as shown in Fig. 2 (panel B). Interestingly, many of the most significant pathways are indeed relevant to cancer mechanisms, such as cell cycle, DNA damage, apoptosis, P53 activation, and translational process with CFTR etc. In contrast, many least significant pathways tend to be defined for other conditions (i.e. asthma, diabetes, cardiovascular) or biological processes (i.e. nicotine regulation, neurophysiological process). This observation is concordant to the notation that pathways are generally condition-specific, since only cancer cell lines are used to generate the CCLE dataset.
Prediction performance of pathway-based models
As shown in Fig. 3, DiffRank performs the best for 9 drugs and the second best for 8 drugs, whereas GSVA has best prediction for 7 drugs and the second best prediction for 5 drugs. Z-Score and PLAGE have best prediction performance for the rest 8 drugs, but poorest performance for 16 drugs. The superior performance of competitive scoring over self-contained score suggested that incorporating both member and non-member genes may better capture the variations of pathway activities among individual samples. Comparing to gene-level models, at least one pathway-based model perform better for 14 of the 24 drugs. Take DiffRank as an example, it outperforms gene-level models for 11 drugs. Meanwhile, gene-level models perform the poorest for three drugs (Nutlin-3, PD-032991 and ZD-6474). For these drugs, pathway-based models could be promising alternatives for predicting their sensitivity on cancer cells.
Identification of pathways involving drug-related genes
Elastic net identifies the features with non-zero weights as important features predictive of cellular response to drugs. In order to evaluate the biological relevance of important features identified from elastic net models, we have collected the drug-related genes (targets, transporters and metabolic enzymes) from commercial and public resources (i.e. MetaCore, DrugBank and original CCLE publication) for all drugs (Additional file 2). We further investigated whether pathways involving these drug-related genes could be captured by pathway-based models.
We also looked into the genes identified by the gene-level models described earlier, against the drug-related genes. It turns out that these gene-level models identified only one target gene for three drugs (Lapatinib, RAF265 and TAE684), one enzyme gene for Sorafenib, but could not capture any drug-related genes for all the other 15 drugs. This indicates gene expression alone can barely identify drug-related genes in majority of cases, which corroborates the notion that the activities of many targeted proteins are not necessarily reflected by their gene expressions.
Pathways recapitulating known drug mechanisms
As demonstrated, DiffRank identified top pathways involving drug-related genes (particularly drug targets) for several drugs, including 17-AAG, AEW541, Irinotecan, Topotecan, Lapatinib, Sorafenib, Paclitaxel and ZD6474. Because of space limitation, we would not discuss each pathway, but rather summarize and highlight a few advantages of pathway models with concrete examples. First, pathway models could identify pathways involving multiple targets. Taking Lapatinib as an example, this drug is a dual inhibitor of EGFR and ERBB2 (or HER2) , and was initially approved for treating breast cancer with over-expression of HER2. Gene-level model only identified ERBB2 but not EGFR (see Lapatinib in Additional file 4). In contrast, pathway models trained with CCLE data successfully identified a few top pathways involving both ERBB2 and EGFR, including “anti-apoptotic action of ErbB2 in breast cancer” (see Additional file 5), “ERBB family signaling”, “mitogenic action of ErbB2 in breast cancer” and “EGFR signalling via small GTPase”.
In the meanwhile, pathway models also captured relevant mechanisms for drugs with similar mechanisms. For example, both Irinotecan and Topotecan are toxic chemotherapies and share same mechanism through inhibiting topoisomerase I (TOP1). Pathway models identified one common pathway “Cell Cycle- Chromosome Condensation” involving TOP1 for both drugs. Specifically, this pathway ranked 2nd for Topotecan and 6th for Irinotecan (see these two drugs in Additional file 3). siRNA knockdown of one chromosome condensation regulator reduced cell proliferation, caused cell-cycle arrest, and increased apoptosis . Other studies also showed that drugs targeting topoisomerases inhibit chromosome condensations , suggesting the inhibition of chromosome condensation is potentially part of underlying mechanisms of Irinotecan and Topotecan.
Pathway models also identified some pathways without drug targets, but known to be relevant to drug responses. For example, the pathway “Normal and pathological TGF-beta-mediated regulation of cell proliferation” ranked at 2nd for PF2341066 (Crizotinib). Researcher has found that activation of TGF-beta receptor signaling confers to the resistance to PF2341066 . Interestingly, this was confirmed with the elevated activity in CCLE cell lines resistant to this drug (middle panel of Fig. 6). Another example came from the pathway “Role of CDK5 in cell adhension” ranking at 7th for Sorafenib. The activity of this pathway is significantly lower in sensitive CCLE cell lines as shown in the right panel of Fig. 6. A recent study discovered that knockdown of CDK5 can inhibit tumor growth in mouse model . Indeed, a more recent study showed that inhibiting CDK5 improved the sensitivity to Sorafenib-induced tumor suppression in xenografts of hepatocellular carcinoma cells .
In this study, we evaluated different unsupervised pathway activity inference approaches for predicting drug sensitivity of cancer cell lines. Our study highlighted the ability of pathway-based models to reveal drug mechanisms, along with prediction performance comparable to gene-based models. Also, pathway-based approach could help generate testable hypotheses by looking at the difference of pathway activity scores between sensitive and resistance cell lines, as demonstrated by the cases in Fig. 6.
A crucial step in pathway-based modelling is to convert gene expression profile to pathway activity scores for individual samples. Our analysis showed that DiffRank and GSVA generally perform better than PLAGE and Z-Score. This suggests that incorporating expression of non-member genes could help better characterize pathway activities than approaches using member genes alone. In addition, both DiffRank and GSVA adopt a ranking-based strategy to calculate pathway activity for individual samples. Such ranking-based pathway activity is computable for single sample with gene expression profile, which makes it very straightforward to perform prediction on new samples, i.e. the N-of-1 situations in precision medicine. However, other approaches to compute pathway activity could be used as well. For example, pathways topology have been used to improve pathway enrichment analysis . In our context, pathway structures could also be utilized to help define the importance of genes to improve the pathway activity scoring.
In this study, Elastic net was used to build the predictive models of drug response. We recognize that other machine learning algorithms (i.e. random forest, neural networks) could also be tested in an attempt to improve the prediction of some of the drugs that display poor correlations with IC50 values (data not shown). Prediction performance could also be improved by including additional -omics data types, such as copy number, methylation, etc. Finally, this study was based on canonical pathways, which involve only genes curated in pathway databases. More gene-sets could be assembled or derived from molecular interaction network, such as densely connected sub-networks or downstream target genes of regulators (i.e. transcription factors) etc. Such molecular networks could cover more genes that are involved in drug responses to improve the accuracy of the predictive models.
We developed a pathway-based modelling strategy to predict drug response of cancer cells. The results show that pathway-based models achieve comparable or even better drug response prediction than gene-based models. Moreover, we have shown that pathway-based models recapitulate known drug response mechanisms for majority of drugs. Pathway-based models could serve as an effective alternative to gene-based models for predicting drug sensitivities of cancer cells.
We acknowledge CCLE program for making the datasets publicly available.
The research work and the publication of this article were sponsored by the Center for Individualized Medicine (CIM) at Mayo Clinic and NIH Multiple Myeloma SPORE award (grant no. 5P50CA186781).
Availability of data and materials
Data were downloaded from CCLE website (https://portals.broadinstitute.org/ccle/).
About this supplement
This article has been published as part of BMC Medical Genomics Volume 12 Supplement 1, 2019: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2018: medical genomics. The full contents of the supplement are available online at https://bmcmedgenomics.biomedcentral.com/articles/supplements/volume-12-supplement-1.
XWW designed the study, performed analyses, and wrote the paper. JPK designed and oversaw the study, analyzed results, and contributed critical review. ZFS analyzed results and contributed critical review. AB designed the study and analyzed results. MTZ consulted for analyses and contributed critical review. All authors read and approved the final manuscript.
Ethics approval and consent to participate
No human data is used in this study.
Consent for publication
Dr. Kocher is an Associate Editor for BMC Medical Genomics. All Authors including Dr. Kocher declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 18.Jang IS, et al. Systematic assessment of analytical methods for drug sensitivity prediction from cancer cell line data. Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing. 2014:63–74.Google Scholar
- 21.Tomfohr J, Lu J, Kepler TB. Pathway level analysis of gene expression using singular value decomposition. Bmc Bioinformatics. 2005;6.Google Scholar
- 30.Lin, T.F., W C, et al, Pooled shRNA screening using mouse xenografts of hepatocellular carcinoma cells identified CDK5 as a potential mechanism mediating Sorafenib resistance, in AACR proceeding. 2017. p. 80.Google Scholar
- 31.Yang Q, et al. Pathway enrichment analysis approach based on topological structure and updated annotation of pathway. Brief Bioinform. 2017.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.