Background

To reposition drugs [13] from one approved indication to a new indication, drug developers could significantly save associated development cost [4] and lower development risks[5]. With the rapid accumulation of genomics, functional genomics, and chemical informatics data in the past decade, several new systematic approaches to drug repositioning have been proposed. For example, one may study the drug-ligand structural binding relationships systematically for all approved drugs to discover their new targets implicated in other diseases using chemoinformatic tools [6]. If the drug-drug similarity relationships, disease-disease similarity relationships, or side-effect-to-side-effect similarity relationships [7] are characterized, one may populate indications from one drug to another among all drugs under study that are closely related through shared disease, shared side effect, or shared target relationship profiles. Machine learning [1] and biomedical literature text mining [8] approaches can also help uncover non-obvious relationships between approved drugs and potential new indications.

Recently, there has been surging interest to apply "connectivity map" (CMAP) techniques, which attempt to match a repositioned drug's effects by their shared disease perturbation gene expression profiles [2, 3, 911]. A major resource--CMAP--was developed by Lamb et al. [11] to assay genome-wide transcriptional expression data across a wide range of cell lines treated with small drug molecules. Based on the CMAP data, Iorio et al. [3] proposed a drug repositioning method by constructing drug-drug similarity networks. Hu and Agarwal[9] and Sirota et al. [2] also investigated how to pair drugs and disease indications based on negative correlation of drug perturbation and disease gene expression patterns identified from CMAP. The anti-correlation relationships between the drugs and diseases are demonstrated to suggest novel therapeutic indications for existing drugs. The primary advantage of CMAP is that it does not require prior knowledge of drug targets or a drug's detailed mechanism of actions to work. However, CMAP's limitation is also quite apparent: limited coverage of drugs, limited drug perturbation gene expression data, limited dosage-dependent conditions, and the dubious transferability of expression patterns from cell lines or animal models to human systems. Ultimately, it can be time-consuming and costly before a significant portion of current drugs in all safe dosage conditions can be tested in even a limited number of cell lines for CMAP according to the statistics in [12].

In this work, we describe our development of a new resource called DMAP that can help drug development researchers evaluate what effects a drug may have on disease-relevant genes or proteins. DMAP compiles each drug's stimulatory or inhibitory effects on genes or their protein products (Figure 1), based on the computational integration of such data from different databases. It covers 438,004 chemical-to-protein effect relationships between 24,121 PubChem compounds that cover 289,571 chemical entities with a synonymous name, and 5,196 distinct UniProt proteins. DMAP may be used wherever CMAP data coverage is poor for drug repositioning applications. To evaluate the DMAP performances, we calculated drug-to-drug similarity based on newly generated DMAP profiles [3] and obtained Kolmogorov-Smirnov test scores [2, 11]. We demonstrate that DMAP can successfully recall known drugs for examined disease indications. In addition, by applying DMAP, we propose novel indications for drugs currently in NCATS [13].

Figure 1
figure 1

Computational framework. Construct the Drug directionality Map (DMAP) dataset from STITCH and HAPPI and evaluate DMAP's utility for drug repositioning with drug similarity network and K-S algorithms.

Methods

Develop DMAP context from existing databases

In DMAP, we collected, integrated and ranked each pair of drug-to-protein/gene relationship. The primary data for drug-protein information comes from the STITCH [14] database, and may be expanded easily to include other sources such as CTD [15] data. STITCH is an aggregated Cheminformatics database of chemical-to-protein interactions connecting over 300,000 chemicals and 2.6 million proteins for many species mined from biomedical literature. We parsed out STICH chemical protein interactions for Homo sapiens with those chemical-protein "edge actions" being either "activation" (stimulatory interaction) or "inhibition" (inhibitory interactions). To eliminate the synonymous chemicals with the same chemical structure, we mapped 289,571 chemicals to the PubChem database in the result of 24,121 distinct PubChem Compound ID (CID).

Next, we calculated a probability-weighted summary of all the evidence to determine an overall mechanism of "edge action" for each specific chemical-protein interaction using conf(d,p).

conf ( d , p ) = i = 1 i = N ( p r o b i ( d , p ) * s i g n i )
(1)

where d and p are specific drugs and proteins, respectively. N is the number of evidence for the interaction between d and p. prob i (d,p) is the confidence of each evidence i with a value within the range of 0[1]. sign i has a value of 1 if the evidence i represents activation while has a value of -1 if the evidence i represents inhibition.

Then, to rank each interaction, we used HAPPI [16], an integrated protein interaction database that comprehensively integrated weighted human protein-protein interaction data from HPRD, BIND, MINT, STRING, and OPHID by assigning a weight weight(p) for each drug's interacting proteins using the following formula adapted from [17].

w e i g h t ( p ) = k × l n q N E T c o n f ( p , q ) - l n q N E T N ( p , q )
(2)

Here, p and q are proteins on the protein interaction network, k is an empirical constant (k=2 in this study), conf(p, q) is the confidence score assigned by HAPPI to each interaction between protein p and q, and N(p, q) holds the value of 1 if protein p interacts with q or the value of 0 if protein p does not interact with q.

Finally, we developed an intuitive pharmacology score (P-Score) to combine the probability for each interaction and the weight of the interacting proteins:

P  -  S c o r e ( d , p ) = c o n f ( d , p ) × w e i g h t ( p )
(3)

Here, P-Score contains both the information of each drug's action on their interacting proteins and the importance of the protein in the protein-protein interaction network. This is different than the expression level based ranking in CMAP, which may be more suitable for biomarker discovery instead of drug discovery. With P-Score for each drug-protein interaction, DMAP is thus in a compatible format with CMAP [11].

Integrate drug therapeutic indication data

To construct a golden standard of known drug indications to evaluate DMAP's drug repositioning performance, we integrated the Therapeutic Target Database (TTD) [18] and the dataset from the PREDICT [1] paper. TTD is a database that provides information about drugs' known therapeutic protein targets and their targeted diseases. The PREDICT paper provides a compiled list of drug indications. We integrated these two sources to get 2,912 drug indication associations corresponding to 1,180 drugs and 726 indications.

Prepare disease expression signatures and drug expression signatures

To apply the Kolmogorov-Smirnov algorithms with DMAP or CMAP for the drug repositioning, we need the disease expression dataset as one of the inputs. We thus retrieved the disease gene expression profiles from Pacini C et al. [19]'s paper. In total, 87 disease associated microarray experiments were compiled to represent 45 distinct diseases. According to Pacini C's paper, these datasets were obtained from the GEO microarray repository [20]. The raw CEL files were normalized with RMA [21]. For those gene expression profiles representing the same disease, they were combined with the median rank normalization by Warnat et al. [22].

The drug-gene expression datasets were obtained from Iorio et al.[3]'s paper instead of directly from CMAP [11] to reduce the batch effect. Iorio et al.[3] computed a single synthetic ranked list of genes, called Prototype Ranked List (PRL), by merging all the ranked list of the same compound in CMAP. Only consistently overexpressed/underexpressed genes are placed at the top/bottom of the RPL. This helped capture a consensus transcriptional response for each drug. We thus chose to use the PRL to represent the drug signatures from CMAP in this study.

Design drug similarity measurement

The hypothesis for the drug similarity network approach is: if two drugs were similar, the disease indication for one drug could be potentially assigned to the other drug. To measure the similarity among each drug pair, we computed SIM(d x, d y ) based on the Tanimoto Coefficient between their interacting proteins (4).

S I M ( d x , d y ) = | p x + p y + | - | p x - p y - | | p x p y |
(4)

Here, d x and d y represent the two specific drugs, p x represents the set of proteins interacting with d x , p y represents the set protein interacting with d y . |p x p y | is the number of total distinct proteins in p x and p y . |p x+ p y+ | is the number of overlapped proteins on which both drugs have identical interactions (i.e. both activate or inhibit the shared proteins). |p x- p y- | is the number of shared related proteins on which the two drugs have opposite interactions (i.e. one activates while the other inhibits the shared proteins). SIM(d x, d y ) lies in the range of [-1,1] with 1 representing that the two drugs share the same interacting proteins and the drugs' action on each protein is the same while -1 representing that the two drugs share the same proteins but the drugs' action on each protein is opposite.

Evaluate the prediction performance

To assess the prediction performance, we implemented the 'Guilt by Association' (GBA) concepts (Figure 2.) presented by Chiang et al[23] and conducted "Leave-One-Out" cross-validation. For each drug, we removed its known indications and attempted to recover them by considering the indications for its top N similar drugs found. We calculated overall sensitivity and specificity by varying N--the number of similar drugs--from 1 to 981. The area under the ROC curve (AUC) score was used to measure the performance.

Figure 2
figure 2

A schematic representation of the GBA method. Given two drugs × and y and their corresponding indication profiles Ix and Iy, respectively, the potential novel uses for drug × is Iy,x. Similarly, potential novel drug uses for drug y is Ix,y.

Implement Kolmogorov-Smirnov strategy

We implemented the nonparametric, rank-based strategy based on the algorithm originally introduced by Lamb et al.[11] to generate a ranked list of candidate drugs for each disease. For each disease signature, we computed an enrichment score separately for the up- or down- regulated genes: esup and esdown. In specific, we constructed a vector V of the position of each of the up- or down- regulated genes on the basis of the values from the reference drug dataset. The vector was then sorted in ascending order such that V(j) is the position of disease gene j. The computation of the enrichment score is based on Kolmogorov-Smirnov statistic and the details can be referred to in the supplementary material in Lamb et al. [11]. The drug score is set to zero, where esup and esdown have the same algebraic sign. Otherwise, we set the drug score to esup-esdown. To evaluate the statistical significance of the score, we applied a permutation approach by randomly selecting any drug signatures and re-calculated the score accordingly. We did the permutation 200 times for each drug-disease pair and computed the p-value by checking the actual score with the score distribution after randomization. We hypothesized that those drugs with a statistically significant negative score might be a possible treatment for the disease of interest.

Perform literature validation

To check whether the predicted drug-disease pairs have clinical literature evidence, we used the esearch API provided by NCBI. The query term we used is 'drug name AND disease name AND (Clinical Trial[ptyp] OR Clinical Trial, Phase I[ptyp] OR Clinical Trial, Phase II[ptyp] OR Clinical Trial, Phase III[ptyp] OR Clinical Trial, Phase IV[ptyp])'. We recorded the total number of clinical type PubMed articles for each association.

Results and discussion

Drug directionality Map (DMAP) Construction

We constructed DMAP containing 438,004 chemical protein interactions for 24,121 PubChem Compound. (Table 1). Compare to CMAP [11], DMAP shows a 14-fold increase of CID coverage. In addition, DMAP cover most of the Anatomical Therapeutic Chemical (ATC) categories: 100% at the first level, 94.3% at the second level and 92.% at the third level. This fact is significant if we compare CMAP coverage on ATC categories: 100% at the first level, 12.5% at the second level and 11.7% at the third level. Comparing to all the drugs from DrugBank, we have 71.5% in approved group and 12.7% in experimental group in DMAP that exceeding the 39.2% and 11.7% respectively in CMAP. The protein and drug-protein interaction coverage in DMAP does not match to CMAP due to CMAP is based on the whole protein screening. The DMAP most popular ATC category distribution is general balance with the DrugBank approved drug distribution except for infectious disease and cancer drugs in category J and L in Figure 3. The reason for the approved drugs in Drugbank little higher rate than DMAP in J and L is probably due to the favor of infectious disease and cancer drug discovery in Drugbank. We also normalize the score range [-1,1] between CMAP and DMAP to perform Pearson correlation and the result is 0.017, which indicates the difference exists between the cell line drug-protein result, and the protein-chemical association and predicted drug-downstream target association in human. Figure 4 shows the number of shared chemicals between DMAP, CMAP, and for drugs with known indications which we compiled from the TTD database[24] and literature [1] (Figure 4). CMAP contains 394 drugs with known indications. Meanwhile, DMAP contains 982 drugs with known indication. Among these, CMAP and DMAP share 380 drugs, which cover 96.5% of CMAP but only 38.70% of DMAP. CMAP only contains 14 drugs not covered in DMAP; meanwhile DMAP contains 602 drugs not covered in CMAP. Thus, we argue that DMAP provides a valuable resource for repositioning existing drug for new uses. To demonstrate this, in the following section we applied two representative drug-repositioning methods with DMAP dataset and proved its utility for computational drug repositioning.

Table 1 Database statistics comparing CMAP (build 02) and the new DMAP.
Figure 3
figure 3

ATCCODES distribution of DMAP database, the background is the Drugbank approved drug distribution.

Figure 4
figure 4

The Venn diagram of drugs from DMAP drug signatures, CMAP drug signatures and drugs with Indication (acquired from TTD database).

In Figure 5, we show the scale-free characteristics of the drug-protein interaction bipartite network in DMAP. Here, the drug degree of a drug is defined as the number of proteins interacting with the drug, and the protein degree of a protein is defined as the number of drugs interacting with the protein. The R-square for linear regression in drug degree and protein degree are 0.83 and 0.81 (in log scale), correspondingly.

Figure 5
figure 5

Distribution of protein degrees (A) and drug degrees (B) in DMAP bipartite network.

Figure 6 shows the gene ontology (GO) terminologies overrepresented (Figure 6A) and underrepresented (Figure 6B) by protein covered in DMAP. Here, we use the FDR calculated by DAVID functional annotation tool [25] on GO to sort the GO terms, and use protein not covered in DMAP to construct the underrepresented GO terms. We observe that GO 'respond' and 'regulation' terms are the most represented in DMAP.

Figure 6
figure 6

Top 10 gene ontology terms over-represented (A) and under-represented (B) covered by DMAP's proteins. The × axis is the -log(FDR) of gene ontology analysis acquired from DAVID functional annotation tool.

DMAP's utility for drug repositioning

To check DMAP's utility for drug repositioning, we applied the following two well-known drug-repositioning methods in literatures: (i) drug similarity approach [3], (ii) Kolmogorov-Smirnov algorithms [11].

DMAP outperforms CMAP in repurposing using drug similarity approach

We computed 481,671 pairwise drug similarities for the 982 drugs with known indications by calculating the Tanimoto Coefficient between their interacting proteins profiles and evaluate the prediction performance with "Leave-One-Out" cross-validation.

We observe that using the drug-protein interaction in DMAP, the repurposing performance significantly increases, compared to the performance using the same type of information in CMAP. The Overall AUC for the prediction based on DMAP achieved 0.82. Most importantly, early retrieval performed well, with a partial AUC of 0.72 for a specificity of 90% or above[26]. Since one could only test the limited number of drugs in experimental setting, the good performance in high specificity region, approximately corresponding to the top ten candidates of all the predictions, would make the proposed drug repositioning more meaningful in practice.

In comparison, we performed similar analysis based on CMAP transcriptome data and the overall AUC was 0.64. The early retrieval performance was only 0.55. Figure 7 showed that the ROC curve based on DMAP was above the curve from CMAP.

Figure 7
figure 7

ROC curves for the prediction performance based on DMAP (blue line) and CMAP (red line). Blue shade area provides a partial ROC area corresponding to specificity 90% above.

To rule out the possibility that the performance difference was purely due to the drug coverage difference between DMAP and CMAP, we conducted the ROC analysis with only the shared drugs between DMAP and CMAP. The DMAP achieved an AUC of 0.82 while CMAP only achieved an AUC of 0.64 (Figure 8).

Figure 8
figure 8

The ROC curves for DMAP and CMAP using the overlapped drugs.

Out of all the possible drug pairs, we identified 3,014 significant pairs by requiring the number of overlapped proteins to no less than two and the drug similarity score at the top 5% of the distribution. The resulting drug network showed a scale-free property (Figure 9), commonly observed in a biological network. Most of the drugs are well connected and formed communities. In fact, 451 drug pairs out of these 3,014 significant pairs have shared at least one known disease indication. For the remaining 2,563 pairs without overlapping indications, the novel drug-disease associations from 1,206 drug pairs were supported by at least one clinical type PubMed article. Table 2 lists the top 20 drug-disease pairs and could be a good starting point for further experimental validations.

Figure 9
figure 9

Power-law degree distribution of drug similarity network.

Table 2 Top 20 novel drug repositioning candidate identified and the count of PubMed publication support the proposed clinical indications

DMAP outperforms CMAP in repurposing using Kolmogorov-Smirnov approach

We compiled the gene expression profiles for 45 distinct diseases and then queried them against DMAP and CMAP, respectively, to generate a ranked list of potential treatments for each of the diseases of interest. By using DMAP drug-protein interaction data, we were able to correctly retrieve the drugs' indications, which were unable to be retrieved using CMAP drug-protein interaction data. We examined results for diseases that are the leading causes of death in the US [27]. For breast cancer, with the DMAP, we successfully retrieved Anastrozole, Capecitabine, Doxorubicin, Estradiol, Megestrol, Paclitaxel, Testosterone and Testolactone as possible therapeutic drugs for breast cancer. With the CMAP data, only Paclitaxel was retrieved as a potential therapeutic drug. For lung cancer, we retrieved Cisplatin and Etoposide by using the DMAP. However, when CMAP was used, we were not able to retrieve any drugs for lung cancer. Additional file 1 also contains the results for other diseases. To have statistical significance, we required a p-value of less than 0.05. CMAP did relatively better in the case for Alzheimer's disease and Leukemia. For these known relationships covered in CMAP but not DMAP, or vice-versa, some were due to having a borderline p-value while others were due to violating our hypothesis of negative correlation. Overall, DMAP and CMAP database were complimentary to each other.

Besides recalling the known drug-disease relationships, with DMAP, the Kolmogorov-Smirnov approach could also propose novel drug-disease associations. National Center for Advancing Translational Sciences (NCATS) [13] provides a list of drugs for translational medical research. We cross checked the novel predictions with their drug list. Here, we highlight three case studies for Vincristine, Nifedipine and Progestrone. Vincristine is a drug typically indicated for Leukemia and Wilm's tumor. A recent study performed by Indolfi et al.[28] revealed that there is a potentially higher rate of survival in patients with bilateral Wilm's tumor when patients are given a dosage of vincristine/actinomycin D. Nifedipine is indicated to treat high blood pressure and angina. The DMAP results suggest that Nifedipine can also be used to treat asthma. Since Nifidipine is a PKC inhibitor and PKC is a potential therapeutic target for asthma [29], it is a potential treatment for asthma. Cheng et al [30] demonstrated in their study that Nifedipine can help control the constriction involved in sensitized tissue in asthma. Furthermore, another study by Barnes et al[31] suggested that Nifidipine modifies exercise-induced asthma. Progesterone is a prescription drug used for women taking estrogens after menopause and is also used for treating amenorrhea. The DMAP results suggest that progesterone can be used to treat breast cancer. In the study by Groshong et al [32], it was determined that treatment with Progesterone can be used to regulate Breast Cancer cell growth.

Additional file 2 summarized all the novel drug repositioning predicted by both similarity approach and KS algorithms, which could be a starting point for further experimental validation.

Conclusions

Reliable measurements of how drugs affect disease proteins is critical to drug repositioning. In this work we presented a computational drug directionality resource called DMAP to address the challenges. We demonstrated that the resource can greatly facilitate the drug discovery process for the following reasons: access to disease gene-drug relationship data with high coverage and quality; incorporating prior knowledge about biological significance with protein interaction network.

This study differs from previous research in that it provides a comprehensive database of computationally derived drug-protein relationships. Previous efforts [2, 3, 9, 10] on pairing the expression of drugs and diseases mainly rely on experimental connectivity map. For example, Sirota et al.[2] performed a large-scale integration of expression signatures of human diseases from the public data with CMAP drug signatures. This work provides another alternative resource of directed drug-protein relationships. The drug similarity study proves the validity of the probabilistic-based directionality for each drug-protein relationship. The implementation of K-S algorithm proves the compatibility of the pharmacology score based ranking with the expression based ranking in CMAP for the drug repositioning research. With these two major drug repositioning approaches, the knowledge base from DMAP performed better than directly using the microarray data from CMAP. It can thus serve as a valuable resource for drug repositioning studies.

One limitation of DMAP lies in that the number of interacting proteins for each drug is not a constant number. For the gene expression based profiles in the CMAP database, each drug was measured against the same number of proteins in experiments while in DMAP the number of interacting proteins varies from drug to drug. In DMAP, 13,717 drugs have at least 10 activated and inhibited proteins. Despite this limitation, the database served its purpose for systematic drug repositioning as demonstrated in this work.

Another limitation of DMAP is the dependency of drug-protein interaction scoring on protein-protein interaction (PPI) databases. As mentioned in [33], disease gene ranking should be performed using PPI data not only with reasonable quality but also high data coverage. In this work, we only used the PPI data to calculate the protein weight. Therefore, we believe the conclusions above still hold. In other words, we expect time and PPI quality to affect primarily drug-protein data significantly if and only if the drug-protein relationship score is relatively low; when the drug-protein relationship is high - suggesting that there're lots of data coverage for the relationships across many literature reports - the time or PPI quality effect is expected to be relatively small.

Funding

Publication of this article was funded in part by the National Institute of Health to Dr Jake Chen (co-PI of R21CA173918).