Computationally repurposing drugs for breast cancer subtypes using a network-based approach

Firoozbakht, Forough; Rezaeian, Iman; Rueda, Luis; Ngom, Alioune

doi:10.1186/s12859-022-04662-6

Computationally repurposing drugs for breast cancer subtypes using a network-based approach

Research
Open access
Published: 20 April 2022

Volume 23, article number 143, (2022)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

Computationally repurposing drugs for breast cancer subtypes using a network-based approach

Download PDF

Forough Firoozbakht¹,
Iman Rezaeian^1,2,
Luis Rueda ORCID: orcid.org/0000-0001-7988-2058¹ &
…
Alioune Ngom¹

3635 Accesses
10 Citations
3 Altmetric
Explore all metrics

Abstract

‘De novo’ drug discovery is costly, slow, and with high risk. Repurposing known drugs for treatment of other diseases offers a fast, low-cost/risk and highly-efficient method toward development of efficacious treatments. The emergence of large-scale heterogeneous biomolecular networks, molecular, chemical and bioactivity data, and genomic and phenotypic data of pharmacological compounds is enabling the development of new area of drug repurposing called ‘in silico’ drug repurposing, i.e., computational drug repurposing (CDR). The aim of CDR is to discover new indications for an existing drug (drug-centric) or to identify effective drugs for a disease (disease-centric). Both drug-centric and disease-centric approaches have the common challenge of either assessing the similarity or connections between drugs and diseases. However, traditional CDR is fraught with many challenges due to the underlying complex pharmacology and biology of diseases, genes, and drugs, as well as the complexity of their associations. As such, capturing highly non-linear associations among drugs, genes, diseases by most existing CDR methods has been challenging. We propose a network-based integration approach that can best capture knowledge (and complex relationships) contained within and between drugs, genes and disease data. A network-based machine learning approach is applied thereafter by using the extracted knowledge and relationships in order to identify single and pair of approved or experimental drugs with potential therapeutic effects on different breast cancer subtypes. Indeed, further clinical analysis is needed to confirm the therapeutic effects of identified drugs on each breast cancer subtype.

View this article's peer review reports

Network medicine for disease module identification and drug repurposing with the NeDRex platform

Article Open access 25 November 2021

In Silico Oncology Drug Repositioning and Polypharmacology

RepCOOL: computational drug repositioning via integrating heterogeneous biological networks

Article Open access 02 October 2020

Introduction

Discovery of a new drug, especially for cancer can be a very time-consuming and costly process. It normally takes between 10 to 15 years to develop a new drug [1] and can cost upward of tens of billion dollars [2,3,4]. On the other hand, the success rate of developing a new drug for cancer is very low [5], and the number of new FDA-approved drugs has been declining since past 25 years [6].

Drug repositioning and repurposing are effective alternative strategies to find new uses of existing drugs. Both drug repositioning and repurposing processes consist of using an existing drug for treatment of a disease other than its primary or initial purpose. If the drug is already FDA-approved, the process is called drug repurposing, while if the drug is in trial or experimental phase, the process is called drug repositioning. Since in this work we use the same methodology for all existing drugs, irrespective of their FDA-approval status, for simplicity, we refer to the methodology as “drug repurposing” in a general sense.

In case of repurposed drugs (and to a lesser degree for repositioned drugs, depending on their trial or experimental stage), the overall cost and time associated with using it for treatment of other diseases is significantly lower than developing a new drug [4].

In order to repurpose an existing drug for a new disease, the main challenge is to identify new relationships between drugs and diseases. To overcome this challenge, a variety of approaches have been introduced including computational, biological and experimental approaches, as well as hybrid schemes that combine both computational and biological techniques. Computational approaches for drug repurposing bear much lower cost and other barriers in comparison to biological experimental approaches, which makes it a more appealing strategy and a very good starting point for further clinical trials and biological validations [7].

The majority of existing computational methods for drug repurposing are based on the comparison between gene expression response of various cell lines before and after treatment or a combination of several types of data corresponding to various aspects of disease-drug relationships [8,9,10,11]. For example, Lotfi et al. grouped drug repurposing methods based on their principle source of biological data and core methodology, including gene regulatory networks, metabolic networks and molecular interaction networks [9], while Zou et al. categorized drug repurposing methods into two groups of data-driven and hypothesis-driven approaches [11]. Xue et al., on the other hand, focused on the underlying methodology used in drug repurposing, when it regards to categorizing those methods [12]. Luo et al. used Singular Value Thresholding (SVT) to predict scores for unknown drug-disease pairs based on known relationship between drugs and diseases [13]. Zhang et al. utilized a drug similarity network, a disease similarity network, and known drug-disease associations to explore the potential associations among unrelated pairs of drugs and diseases [14].

Generally speaking, we can group drug repurposing approaches into three distinct groups: text-mining approaches [15,16,17,18,19,20,21], semantics-based approaches [22,23,24], and finally network-based approaches [25,26,27,28,29,30,31,32,33,34,35,36]. The latter takes into the account the relationship and interactions between genes in their corresponding pathways. For example Bourdakou et al. used statistical co-expression networks to highlight and prioritize genes for breast cancer subtypes and leveraging them for drug repurposing [36]. One of the biggest difference between the proposed framework and the previous network-based methods is the ability of the proposed framework to identify not only single drugs, but also pairs of combined drugs (and theoretically unlimited number of drug combinations) for a given disease with a reasonable computational overhead, which enables it to find combinations of drugs that could far outreach the therapeutic effects of single drugs for a given breast cancer subtype (or any other disease in general).

This paper introduces a novel network-based approach to identify drugs with the highest repurposability with respect to each of ten breast cancer subtypes. This goal is achieved by first finding driver genes responsible for each subtype using genomic and transcriptomic data, which are then used along with pathway data in order to find those drugs that have the highest repurposing scores for each of ten breast cancer subtypes. The results show that the proposed method is able to identify potential effective known and experimental drugs developed for other diseases to be repurposed for various breast cancer subtypes. Indeed, further wet lab analysis is needed to determine the therapeutic level of identified drugs on each breast cancer subtype. For reference, what we refer to here as ten breast cancer subtypes are ten distinctive sub-groups identified in [37].

Moreover, we used the proposed method to identify single and pairs of drugs for Triple Negative (TN) breast cancer tumors. Between 10% to 15% of breast cancer cases are considered as TN, where they lack any hormone epidermal growth factor receptor 2 (HER-2), estrogen receptors (ER), and progesterone receptors (PR) in the tumor [38]. Thus, the traditional targeted (often hormone) therapy that targets one of these hormones are ineffective in these cases. This lack of targeted therapies has intensified the interest in this group of patients. Our results show that the proposed method were able to computationally identify single and paired repurposed drugs that could have therapeutic effect on this this group.

Materials and methods

For drug expression data, we used level-5 data of the LINCS dataset (from Gene Expression Omnibus with the reference number GSE70138), which consists of z-score values of more than 118,000 drug/ concentration/ treatment_time for more than 12,000 genes. In order to make the process more computationally manageable, we used only the lowest and highest dosages of each drug (generally 0.04 and 10.0 μmol, correspondingly) and a default 24-h time-point frame for the analysis, in case of having more than one time-point frame. If a drug does not have a 24-h time-point frame, we use the default time-point frame indicated in LINCS database.

Obtaining candidate genes

In the first step, we use CNA, CNV and GE data to find the most informative genes, separately for each subtype. To do so, we first use CNA/CNV information to find those genes that have very high genotypic aberration in each subtype based on their GISTIC score [39]. GISTIC identifies significant aberrations using two steps. In the first step, it calculates the G-score statistic, which involves both the frequency of occurrence and the amplitude of the aberration. In the second step, it assesses the significance of each aberration using Fisher’s Exact test [40]. these two steps take place in Fig. 1a. To make sure that we only target aberrations in the copy number and not common variations across different populations, we use the HapMap database [41] (shown in Fig. 1b. HapMap is a catalog of common genetic variants that occur in human. We only consider those genes for a significant test that have CNA but no CNV. We also use gene expression data to identify the top differentially expressed genes for each subtype. For this, we used Chi2 [42] to rank genes based on their ability to separate each subtype from the remaining subtypes. At the end, after obtaining the top genes using CNA/CNV and GE data separately, if CNA/CNV analysis determined N genes as significant in terms of their genomic aberrations, we select the top N genes from GE data; then out of these two gene sets, we take the intersection as candidate genes. These candidate genes are those genes that have both significant differences in terms of gene expression and copy number aberrations.

Obtaining gene scores

The measurement is a normalized z-score value for each replicate of a given gene treated with the same perturbation agent (i.e., perturbagen: either drugs or small molecule compounds or others) based on 95% confidence level [43]. Thus, for each pair of gene and drug agent, we consider a value between − 10 and 10. A value close to zero shows that the expression of a given gene will not be affected by the drug agent. In comparison, based on the concepts of gene expression inhibition and induction, a negative or positive score shows that the expression of the given gene decreases or increases, respectively, because of the effect of the drug agent.

We have used the METABRIC dataset [37], which contains the copy number values and gene expression levels of 2000 primary breast tumors with long-term clinical follow-up. It can be accessed from the European Genome-Phenome Archive using the accession number EGAS00000000083. In [37], the copy number aberrations and copy number variations generated using Affymetrix SNP 6.0 arrays and gene expression data were obtained using Illumina HT 12 technology. The dataset contains two sets of data, validation set and discovery set. Due to the lack of class labels in the validation set, in this paper we only use the discovery set, which contains 997 samples from ten subtypes of breast cancer. Each sample contains expression data for 48,803 probe IDs. The expression of all probes corresponding to the same gene have been merged based on the median expression of those probes, which maps all the probes to 24,351 unigenes. We calculate the same normalized z-score values for each of the ten breast cancer subtypes in the METABRIC dataset, (which can be accessed from European Genome-Phenome Archive with the study id EGAS00000000083), in such a way that the normalized z-score of each gene is a value between -10 and 10. A value close to zero shows that the expression of a given gene will not be affected by the disease, while a negative or positive score shows that the expression of the given gene decreases or increases, respectively, because of the effect of the disease.

Creating unified global human pathway

In the next step, we use the KEGG Pathway database to find all possible paths between genes [44]. A biological pathway is a series of actions among molecules in a cell that leads to a certain product or a change in the cell. It can trigger the assembly of new molecules, such as a fat or protein, turn genes on and off, or spur a cell to move [45]. The version of the KEGG Pathway database we used contains 265 human pathways. So, by taking union of all genes and also all existing direct relations between each pair of genes, we create a unified global human pathway (UGHP). The UGHP contains interaction between 4985 genes in all 265 human pathways in KEGG as a matrix, where $UGHP_{ij}$ represents signaling interaction type between gene i and gene j. The values of the matrix could be -1, 1 or 0 representing activation, suppression, or no direct signal from gene i to gene j.

Calculating drug-disease repurposing score

At this point, for each drug $D_i$ and breast cancer subtype $S_j$, we perform the following steps:

1
Select the top 50 affected genes by the drug $D_i$ from the LINCS dataset by ranking the genes based on their absolute z-score values and call them drug genes. Note that at the end of the process, the pipeline focuses only on negative correlations between drug and disease.
2
Use the candidate genes corresponding to subtype $S_j$ that we identified in using copy number alteration (CNA), copy number variation (CNV) and gene expression (GE) data [46]. We call these candidate genes disease genes.
3
Map back these drug and disease genes to UGHP to create a drug-disease network, $D_i S_j$, which contains the shortest paths between each pair of drug-disease genes (shown in Fig. 1c). Thus, the maximum number of nodes in $D_i S_j$ network, N, is given as follows:
$$\begin{aligned} N = G_{dr}+G_{di}+G_i \end{aligned}$$
(1)
where $G_{dr}$ is the number of drug genes, $G_{di}$ is the number of disease genes, and $G_i$ is the number of intermediate genes in the shortest path between each pair of drug and disease genes.
4
Since for each gene in this drug-disease network we have two z-score values (one for the effect of drug and one for the effect of disease), we construct two arrays, one consists of drug z-score values while the other consists of disease z-score values with identical gene order.
5
We compute Pearson correlation [47, 48] between the above arrays using the following formula (shown in Fig. 1d):
$$\begin{aligned} r = \frac{ \sum _{i=1}^{n}(x_i-{\bar{x}})(y_i-{\bar{y}}) }{ \sqrt{\sum _{i=1}^{n}(x_i-{\bar{x}})^2}\sqrt{\sum _{i=1}^{n}(y_i-{\bar{y}})^2}} \end{aligned}$$
(2)
where $x_i$ and ${\overline{x}}$ are the z-score value of gene i and average of all z-score values for the drug group in drug-disease network, and $y_i$ and ${\overline{y}}$ are the z-score value of gene i and average of all z-score values for the disease group in the drug-disease network, respectively.

Figure 1 depicts the proposed framework for the identification of best repurposing drugs for each breast cancer subtype.

Obtaining a positive correlation between a given drug genes and the disease genes means that the drug and the disease have similar effect on the genes in the drug-disease network. In contrast, obtaining a negative correlation implies that the drug’s effect on the genes in the drug-disease network is opposite to the effect of the disease on the genes in that network.

Obtaining a negative correlation is a favorable case in this context, because we are looking for drugs that could have a potential reverse effect on the genes affected by the disease.

Identifying combinations of drugs for repurposing

In the previous section, we solely focused on effects of each individual perturbation agent on each subtype of breast cancer. In this section, we test the hypothesis that combination of two or more drugs might be more effective in terms of reversing the effect of the disease, i.e., by generating a more negative correlation with the disease than each drug independently. For simplicity, in this step, we assume that the z-score value of a given pair of drugs to be additive with respect to the z-score value of each of those drugs independently. In other words, if we assume that the z-score value of drug $D_i$ on gene G is $X_i$, and the z-score value of drug $D_j$ on the same gene G is $X_j$, we can then assume that the z-score value for the given pair of drugs [$D_i$,$D_j$] is X, where $X=X_i + X_j$.

Thus, in order to find the best repurposed pair of drugs for a given subtype of breast cancer, first, we calculate the combined z-score value of all genes for every pair of drugs, and then we pick the top genes with the highest absolute value of their combined z-score. Figure 2 depicts the proposed framework for identification of the best pair of repurposed drugs for each breast cancer subtype.

Calculating drug-disease repurposing score for a pair of drugs

For pair of drug $D_i$ and $D_j$ and breast cancer subtype $S_k$, we perform the following steps:

1
Calculate the combined z-score value of all genes for pair of drugs $D_{ij}$, and selecting the top 50 genes with the highest absolute value of their combined z-score values as paired drug genes.
2
Use the candidate genes corresponding to subtype $S_k$ that we identified in using copy number alteration (CNA), copy number variation (CNV) and gene expression (GE) data [46]. We call these candidate genes disease genes.
3
Map back these drug and disease genes to UGHP to create a drug-disease network, $D_{ij} S_k$, which contains the shortest paths between each pair of drug-disease genes (shown in Fig. 2c). Thus, the maximum number of nodes in $D_{ij} S_k$ network, N, is given as follows:
$$\begin{aligned} N = G_{dr}+G_{di}+G_{ij} \end{aligned}$$
(3)
where $G_{dr}$ is the number of drug genes, $G_{di}$ is the number of disease genes, and $G_{ij}$ is the number of intermediate genes in the shortest path between each pair of combined drug and disease gene.
4
Since for each gene in this drug-disease network we have two z-score values (one for the effect of drug and one for the effect of disease), we construct two arrays, one consists of drug z-score values while the other consists of disease z-score values with identical gene order.
5
We compute Pearson correlation [47, 48] between the above arrays using the following formula (shown in Fig. 2d):
$$\begin{aligned} r = \frac{ \sum _{i=1}^{n}(x_i-{\bar{x}})(y_i-{\bar{y}}) }{ \sqrt{\sum _{i=1}^{n}(x_i-{\bar{x}})^2}\sqrt{\sum _{i=1}^{n}(y_i-{\bar{y}})^2}} \end{aligned}$$
(4)
where $x_i$ and ${\overline{x}}$ are the z-score value of gene i and average of all z-score values for the drug group in drug-disease network, and $y_i$ and ${\overline{y}}$ are the z-score value of gene i and average of all z-score values for the disease group in the drug-disease network, respectively.
6
Finally, we perform a post-verification analysis on drug interference for the identified pairs of drugs using DrugBank’s Interaction Checker tool, in order to confirm if there is any known interference between any of those identified pairs. (shown in Fig. 2e).

A drug-drug interference is a situation in which one drug affects the activity of another. Drugs may interact with each other to cause side effects that are unexpected or unintended. If any pair of drugs have known drug-drug interference, we remove them from the analysis. For example, the combination of Tadalafil and palbociclib generated a negative correlation of -0.65 with subtype 3, which put them in the top 10 list of paired-drugs for this subtype. But given the fact that they have a known moderate interaction with each other, this pair has been removed from the analysis [49]. The reason for doing a post-verification analysis instead of checking it as a pre-process step, is that the post-verification approach gives us the flexibility of updating the results with newly discovered drug interference in the future without a need to rerun the analysis. Also, using a post-verification approach gives us the ability to deal with interferences between a given pair of drugs at different levels depending on the type or level of interference.

Extension to triple-negative breast cancer tumors

In this section, we leverage the proposed pipeline for the ten breast cancer subtypes considered in the previous section, to identify potential repurposable drugs specifically for triple negative beast cancer tumors. In order to do so for identifying candidate genes, we treat triple negative samples in METABRIC as one group and all the remaining samples as another group. By running the pipeline introduced in [46], we identify the most discriminative genes in terms of gene expression and copy number aberration between TN and non-TN groups. Then, using the pipelines depicted in Figs. 1 and 2, we identify the top repurposing single and paired drugs for the TNBC subtype.

Results and discussion

For reference, Table 1 shows the list of drugs that have been approved by FDA to date for breast cancer treatment [50]. The results show that the proposed model is able to identify highly negative correlated drugs corresponding to each of ten breast cancer subtypes, both when used in single drug mode or for identifying pairs of drugs. Some of the well-known and widely used breast cancer drugs have been identified among the top drugs, which again shows that the proposed approach was able to pick up current drugs with high accuracy. For example, Goserelin (Zoladex) is a well known and FDA approved hormone therapy drug for treatment of BC that showed up in top ten drugs for subtypes 2 and 8. Also, Palbociclib (Ibrance) is another well known and FDA approved chemo therapy drug for treatment of BC that showed up in top ten drugs for subtype 4. Moreover, Ruxolitinib, which showed up in top ten drugs for 9 out of 10 subtypes (Table 12) has been under several trials and studies regarding its potential inhibiting effects on BC [51, 52].

Table 1 List of FDA-approved drugs for breast cancer treatment

Computationally repurposing drugs for breast cancer subtypes using a network-based approach

Abstract

Similar content being viewed by others

Network medicine for disease module identification and drug repurposing with the NeDRex platform

In Silico Oncology Drug Repositioning and Polypharmacology

RepCOOL: computational drug repositioning via integrating heterogeneous biological networks

Introduction

Materials and methods

Obtaining candidate genes

Obtaining gene scores

Creating unified global human pathway

Calculating drug-disease repurposing score

Identifying combinations of drugs for repurposing

Calculating drug-disease repurposing score for a pair of drugs

Extension to triple-negative breast cancer tumors

Results and discussion

Single drug repurposing

Paired drug repurposing

Triple negative breast cancer subtype

Conclusion and future work

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation