Background

Development of new drugs is carried out when there are no drugs to cure diseases or alleviate their clinical symptoms, or there are some motivations related to side effects [1]. Most of new drugs, which have been developed until now, used a method of de novo drug designing, which undergoes many phases covering from drug target discovery and screening to Absorption, Distribution, Metabolism, Excretion and Toxicity (ADMET) and Lead Optimization. Finally this method performs 3 phases of clinical tests in clinical areas and then approve a drug and commercializes it [2]. The whole processes for de novo drug discovery requires 10 ~ 17 years of period and tremendous cost of 300 ~ 600 million dollars, which is a deteriorated figure compared to 10 million dollar in 1970 and 100 million dollars in 2000 [3].

In order to solve problems of high cost and rate of failure with traditional drug discovery, drug repositioning has appeared [4]. Drug repositioning is a process to find probabilities that an already-approved drug could be applied to other diseases. This method, unlike conventional de novo method, has a most significant benefit that it could reduce required time to 3 ~ 12 years through in vitro or in vivo method [5]. Some of major success cases include a case that sildenafil is applied to erectile dysfunction and also a case that thalidomide is applied to multiple myeloma [6, 7]. This approach, however, has weaknesses in that it still relies on prior knowledge for manual method and clinical trials in wet bench and in that success stories are serendipitous and rare. Therefore, in silico drug repositioning which selects and predict new targets for drugs via computational approach are attracting people’s attentions [8]. In silico drug repositioning uses data for drugs, diseases and other relevant information. With such data, it performs a process to calculate probability of success for new indications found in conventional drugs by designing systematic algorithm and then finally predicts drug repositioning for selected high potential and evaluates its performance with accuracy [9]. So far, there have been numerous studies for in silico drug repositioning which could be divided into two mainstreams, drug-based approach and disease-based approach.

Drug-based approach attempts drug repositioning focusing on characteristics of drugs in terms of pharmaceutical aspects. Most of conventional researches predicted new targets of drugs by calculating similarities using drug-related information. Lamb et al. (2006) used molecule movement information for chemicals that are components of drugs [10], Keiser et al. (2009) took advantage of chemical structure and targeted protein information of drugs [11] while Chang et al. (2010) used tissue localization and gene expression pattern together [12]. However, information for drug’s chemical structure and characteristics contains numerous errors and moreover it’s hard to access such information due to ownership of drug manufacturers. Moreover, there is limitation for correct prediction due to complicated metabolic and pharmacokinetic transformations inside human body. Disease-based approach is started by identifying features of diseases at their gene or protein levels in terms of pathological aspects with proper medicine. In conventional studies, Chiang et al. (2009) approached a drug repositioning through “guilt by association” under assumption that if two diseases share few number of similar therapies, then a drug used for a certain disease could be used for other disease [13], Campillos et al. (2008) predicted new targets for drugs by calculating similarities between diseases based on side effect that appears from injection of drug [14]. However there are also limitations that lots of complex factors affect pathology of diseases and information for side effect should be well arranged and its amount should be also enough.

Although In silico drug repositioning methods are classified into two major ones, they mostly rely on an assumption based on similarity. Such assumption in drug-based approach is that similar drugs would have similar therapeutic influence upon the targets while assumption in disease-based approach is that similar diseases require similar therapy and thus the same drugs. Computational method to advance these assumptions is network-based modeling [15]. Drug repositioning based on network-based modeling is able to consider overall relations between diseases in terms of direct and indirect relations. In addition, this is able to extend relation between drugs and targets to “many-to-many” from “one-by-one” in terms of network [16]. Under these conditions, Suthram et al. (2010) attempted drug repositioning by structuring functional module network using molecular biological information and protein-protein interaction(PPI) [17].

In this paper, we propose a methodology to implement drug repositioning via in silico to maximize effectiveness in terms of time and cost. From disease-based approach which is easy to be used with relatively lots of data, the proposed method includes network modeling which is easy to address relations between diseases and machine learning algorithm based on such relations. The proposed method is devised under an assumption that similar disease could be treated by similar drugs. If a disease with similar symptom doesn’t use similar drug even if two diseases are similar, then there could be an opportunity to reposition drugs between the two diseases. The proposed method is called Network Mirroring and its overview is shown in Fig. 1. Figure 1 shows a toy example in which it proposes 4 drugs (Dr 1 ~ Dr 4) with 5 proteins (Pr 1 ~ Pr 5) against 6 diseases (D A  ~ D F ). Protein-based Disease Network(PrDN) and Drug-based Disease Network(DrDN) are disease networks constructed with protein and drug information, respectively. DrDN is reflected from PrDN through network mirroring and the relationships between disease nodes are identified. In the figure disease nodes are prioritized on the basis of difference in edges between diseases. From all six diseases, D A is selected by first priority. For other five diseases, we applied a machine learning algorithm (Additional file 1) on PrDN to obtain scores. The most highly scored disease is believed to be most similar in terms of molecular biology. D D is selected as the most similar disease compared to D A . Then, with identifying Dr 4 to be used for D D from disease-drug association and then repositions it to D A .

Fig. 1
figure 1

Network Mirroring. PrDN and DrDN are disease networks using protein information and drug information respectively. If we reflect the two networks, it would be easier to identify diseases with different connections. Different connections of disease nodes in two networks indicate that diseases which are similar in PrDN i.e. they share same protein information actually have different drug profiles. Given that diseases with similar bio-molecular characteristics can be treated by similar drugs, there is possibility of drug repositioning between these diseases

This paper consists of following sections: Section 2 explains procedures for Network Mirroring and Section 3 includes results of experiment that applied Network Mirroring to actual diseases. Section 4 represents our conclusion.

Methods

Network mirroring for drug repositioning

In this paper, we propose Network Mirroring as a new method to reposition drug. The proposed method is based on disease network. Disease network expresses relations between diseases by nodes and edges in graph in G = (D, W). Node set D is a disease and edge set W is calculated by similarity between diseases. In this case, meaning of similarity is varying depending on information used by calculating edges. Two disease networks are constructed by using different information. First one is a disease network based on protein information that diseases share and the other uses drug-related information for diseases. From the constructed networks, we can compare two disease networks. If drugs are well developed relying on molecular biological similarity between diseases, the two disease networks would be similar. However, such networks are different, there could be a possibility for drug repositioning. It is because diseases with similar molecular biology are likely to use same drugs. Network Mirroring based on such intuition consists of 4 steps. First, it builds two disease networks using protein and drug information respectively. Second, candidate disease is selected based on most different edges in two disease networks. Third, similar diseases are selected by similarity of candidate disease through machine learning algorithm and then candidate drugs are selected to be used for such diseases. Lastly, it repositions candidate drugs onto candidate disease. Schematic description for the proposed method is shown in Fig. 2.

Fig. 2
figure 2

Schematic description of the proposed method. The proposed method consists of a total 4 steps: a it builds two disease networks PrDN and DrDN using protein and drug information respectively. b this step selects candidate disease by prioritizing diseases whose difference in edges is very high by mirroring DrDN from PrDN c it scores on other diseases against candidate disease through machine learning algorithm and then selects diseases whose score is high as similar diseases and then assigns candidate drugs which is used for such diseases d lastly, it repositions candidate drugs onto candidate disease

Disease network construction

From preceding studies on how to build disease network, Hidalgo et al. (2009) constructed network indicating co-occurrence between diseases by calculating edges based on records of patients [18]. Besides this, there are other studies constructing disease networks with various disease-related information such as genetic character, phenotype, protein interaction or metabolic pathway [19,20,21,22,23]. In this paper, we use tripartite information for protein-disease-drug to construct disease networks. This tripartite relation indicates a certain procedure for outbreak and treatment of diseases. It is because a disease is generated by abnormal protein and is treated by drug which targets such protein. Under this environment, we construct Protein-based Disease Network(PrDN) and Drug-based Disease Network(DrDN) by separating the tripartite information. Diseases on PrDN are connected to each other related to same protein [24,25,26,27,28]. In this case, connection between diseases indicate similarity of molecular biology [29]. Since the possibility of similar diseases being targeted by same drugs is high, PrDN indicates the potential of using same drugs for similar diseases. On the other hand, diseases on DrDN are connected with the number of shared drugs which are used for actual diseases [26, 28, 30]. Therefore, DrDN indicates status quo of using same drugs for similar diseases.

Disease networks are graphs, PrDN = (D, W Pr) and DrDN = (D, W Dr), that indicate connection between diseases with nodes and edges. Because two networks have same number and types of diseases, their node set is same but their edge set is different. Edges between diseases are calculated by Tanimoto similarity between vectors, which represent information of diseases [31, 32]. Tanimoto similarity, if its data type is binary or integer and if it’s sparse, is useful similarity measurement. Edge set W Pr uses protein vector while that of W Dr uses drug vector. Protein and drug vectors exist for each disease, and all vector elements are binary type. The weight value of each edge increases as the number of shared proteins or drugs between the two diseases increases. Equation (1) indicates calculation for similarity w ij between Disease i and Disease j. D i and D j are vector for each disease while D ik and D jk is kth component for protein or drug vector respectively.

$$ {w}_{ij}=\frac{{\displaystyle {\sum}_k{D}_{ik}}\cdot {D}_{jk}}{{\displaystyle {\sum}_k{D}_{ik}}+{\displaystyle {\sum}_k{D}_{jk}-{\displaystyle {\sum}_k{D}_{ik}}\cdot {D}_{jk}}} $$
(1)

Candidate disease prioritization

In the candidate disease prioritization step, we select a disease for drug repositioning. For this purpose, the process searches diseases whose edge distribution is different by comparing PrDN and DrDN and then prioritizes them. Therefore, we apply the Kullback-Leibler(KL) divergence to compare all diseases quantitatively. The KL divergence is used to look into difference between two probability distributions [33,34,35]. The formula of KL divergence is shown in Eq. (2).

$$ K L\left( P\left|\right| Q\right)={\displaystyle \sum_i^N{P}_i\; ln\frac{P_i}{Q_i}} $$
(2)

where P i and Q i indicates probability function for probability variable i.

KL(P ∥ Q) indicates difference between a probability distribution P and Q (Note that the value is not symmetric if applied in reverse order, Q from P). KL is 0 if distribution of P and Q is same, otherwise it is other value than 0.

The proposed method in this study considers reflection of PrDN on DrDN since PrDN is a network providing information on potential drug repositioning. Therefore a probability distribution P in Eq. (2) is substituted by PrDN whereas a probability distribution Q is substituted by DrDN. However, KL divergence is calculated through probability value, pre-processing is required to convert w ij into probability. In this case, edge is converted into exponential type to improve sparseness of data and then probability is calculated as shown in Eq. (3).

$$ {p}_{ij}=\frac{e^{w_{ij}}}{{\displaystyle {\sum}_k^N}{e}^{w_{ik}}} $$
(3)

where N denotes the number of diseases.

p ij could be an expression of probability for weight of D j among diseases connected to D i . Likewise, q ij is also calculated by same equation. KL divergence is calculated for each disease and the bigger value is more highly prioritized by its orders. In other words, calculation of KL divergence for i th disease is expressed by Eq. (4).

$$ K{L}_i\left({p}_{i j}\parallel {q}_{i j}\right) = {\displaystyle \sum_i^N}{p}_{i j} \ln \frac{p_{i j}}{q_{i j}},\kern0.5em {p}_{i j},\ {q}_{i j}\in {R}^N $$
(4)

where p ij and q ij indicates probability value where i th disease is converted by PrDN and DrDN respectively.

With this process, upper σ% of diseases will be assigned to candidate disease for drug repositioning. σ is a user-specific parameter. We can see the example for candidate disease prioritization step through Fig. 2b. D A is connected to D B and D D in PrDN while it is connected to D C and D E in DrDN, which means it is connected to totally different diseases between two disease networks. On the contrary, D F is connected to D D and D E in both PrDN and DrDN. From assumption suggested by the proposed method, we can see intuitively that D A with totally different connection is more likely to have probability of drug repositioning than D F with perfectly same connection on two disease networks. This process and quantitative comparison procedures are shown in Fig. 3.

Fig. 3
figure 3

Toy example of Candidate Disease Prioritization. By comparing D A and D F , figures show results of candidate disease prioritization by step-by-step. a expresses similarity vector for edges where D A and D F are connected to other diseases in PrDN and DrDN. b is a probability of similarity from a through pre-processing. c is KL value that is calculated between two diseases according to formula. KL A is 0.2 and bigger than KL F that is near 0. Therefore, intuitive decision for priority is digitized, we can see that same results are appearing. d and e are graphs which express probability distributions for two diseases in b. These graphs display such distribution by order of bigger values. The reason for big difference in KL value is evident by comparing d and e

Candidate drug selection and drug repositioning

Candidate drug selection step is a process to select drug to be repositioned for candidate disease. We define Candidate Drugs as drugs that are used for disease that are similar to candidate disease. Similar diseases are selected in a way that scores relations between candidate disease and other diseases on PrDN using machine learning algorithm and then the process selects disease whose score is bigger. For such scoring, graph-based Semi-Supervised Learning(SSL) algorithm is used [21]. SSL algorithm shows good performance especially when the number of labeled data is scarce compared to lots of data such as biomolecular and drug data. Among them, a suitable thing for network structure is graph-based SSL algorithm. When a graph and labels are given, SSL algorithm calculates predictive output, f-scores, for unlabeled nodes. See Appendix A. The bigger strength of connections between nodes leads to higher f-scores. The fact that higher f-scores for unlabeled nodes indicate that it is more similar to labeled nodes [22, 36, 37].

To assign similar diseases which are highly similar with candidate disease biologically, PrDN’s edge set W Pr is applied to the algorithm. A candidate disease node is set to be label ‘1’ and others are set to be ‘0’. Also, δ % of all diseases are selected to similar disease. δ is a user-specific parameter. Finally, all of drugs that used for similar diseases are chosen as candidate drugs for a candidate disease. This procedure is formulated as shown in Eq. (5).

$$ D{r}_i^C={\displaystyle \underset{j=1}{\overset{n_s}{\cup }} Drug\left({D}_j\right)} $$
(5)

where n s  = |{S(D i )}|, D j  ∈ S(D i ), D i D j  ∈ PrDN.

In (5), S(⋅) is a neighborhood function, which means D j is one of similar diseases of D i . Drug(D j ) means drugs used for disease j and Dr C i means candidate drugs of disease i.

Toy example for candidate drug selection step is shown in Fig. 2c. D A is selected as candidate disease through the previous step. Therefore, label setting for all nodes is set to be {D A , D B , D C , D D , D E , D F } = {1, 0, 0, 0, 0, 0}. As the results of performing algorithm by applying PrDN’s edge set W Pr it’s proved that f-score for {D B , D C , D D , D E , D F } excepting D A is {0.6,0.2,0.9,0.3,0.1} respectively. Since it takes upper 20% (δ = 20) of such diseases, D D is finally selected. Consequently, drugs used for D D are selected as candidate drugs. Finally, the last step of the process, drug repositioning by repositioning candidate drugs onto candidate disease. Figure 4 shows the pseudo code for Network Mirroring.

Fig. 4
figure 4

Pseudo Code of Network Mirroring

Results and discussion

Data

The proposed method is applied for all diseases which have association with proteins and drugs. We collected disease information from Medical Subject Headings(MeSH) in The National Library of Medicine(NLM) [38]. The relational information includes 161,223 disease-protein associations, 51,074 disease-drug associations and 91,450 drug-protein associations from multiple databases. With these information, we extracted diseases only having associations with protein and drug. Finally, we used 2890 diseases, 23,499 proteins and 4603 drugs information for PrDN and DrDN. We constructed PrDN using 161,223 disease-protein associations. When DrDN was constructed, we computed new disease-drug associations by combining existing disease-protein associations and drug-protein associations. In this case, disease and drug is related when they share same protein. The data used for construction of both networks are accessible in [39]. Table 1 summarizes sources and types of data used by the experiment.

Table 1 Data for diseases, proteins, drugs, and disease-protein associations, disease-drug associations, drug-protein associations

Results on validity of network mirroring

We carried out verification as to how better performance drug repositioning shows when it is performed through the Network Mirroring. For this purpose, we divided all diseases into 5 tiers that is top 20% (σ = 20) unit depending on priority by candidate disease prioritization that is second step of Network Mirroring. Figure 5 indicates Kullback-Leibler divergence value for entire diseases and each tier.

Fig. 5
figure 5

Kullback-Leibler divgergence value for entire diseases. The graph shows KL value on entire diseases by red line according to descending order. Average value of each tier is expressed by bar. By comparing PrDN and DrDN, diseases with different connection show higher KL value whereas ones with similar connection show lower KL value

For the next step, candidate drug selection and drug repositioning, we verified difference in performance for each tier. In this case, we compared with predicted result of drug repositioning with the reference experiment. In the reference experiment, we carried out greedy searching for the entire diseases. The experiment was repeated 10 times by 10-fold cross validation to disease-drug associations. The performance was measured on drug repositioning results in the last step of Network Mirroring. F-measure was used for performance measure. The process selects candidate drugs, which are all drugs used for similar diseases, and repositions them to candidate disease. Thus, the results consist of binary value (0 or 1). For binary results, F-measure is a suitable performance measurement method [40]. Eq. (6) is formula of F-measure.

$$ \boldsymbol{F}-\boldsymbol{measure}=\frac{2\left( precision\times recall\right)}{precision+ recall} $$
(6)

where \( \boldsymbol{precision}=\frac{TP}{TP+ FP}\kern0.24em \boldsymbol{recall}=\frac{TP}{TP+ FN} \)

where TP, FP and FN indicate True Positive, False Positive, False Negative respectively in confusion matrix of Table 2.

Table 2 Confusion Matrix

Precision means the ratio of correct positive results to all positive results. Recall indicates the ratio of correct positive results to positive results that should have been returned. F-measure is a harmonic mean of them and Fig. 6 indicates F-measure for each tier.

Fig. 6
figure 6

F-measure of each tier and entire diseases. The graph shows results of prediction for each tier through candidate drug selection and drug repositioning. The most precise tier has 1st-tier of KL value indicating 0.75 of F-measure performance. On the other hand, 5th-tier which falls on bottom 20% of KL value showed 0.17 that is the lowest accuracy. 5 tiers showed that they become more precise when their level is high. Reference experiment showed 0.51 performance. To summarize, the proposed method is believed to be a meaningful methodology to perform drug repositioning

Results on utility of network mirroring

In this section, we show utility of Network Mirroring via dementia. The results are shown in step-by-step depending on the process concerning dementia. Dementia is caused by brain damage from various factors. If a normal person begins to suffer dementia, he or she shows critical disorder in cognitive skills. As their memory, language skills, decision making and abstractive thinking are deteriorated, it makes impossible to live a normal life [41, 42].

First, we show results of the candidate disease prioritization step. Dementia, with its KL value of 0.68, belongs to upper 8% of entire diseases. For comparison, Urinary Incontinence, which falls on bottom 10% with 0.04 of KL value, is selected. Urinary incontinence is a disease that a person urinates unconsciously due to disorder in regulating bladder. It occurs along with overactive bladder, nocturia and other symptoms [43, 44]. Figure 7 shows probability distribution in PrDN and DrDN for dementia and urinary incontinence which shows big difference in KL value.

Fig. 7
figure 7

Probability Distributions in PrDN and DrDN for Dementia and Urinary Incontinence. The graphs show probability distributions in PrDN and DrDN to calculate KL value in the candidate disease prioritization step. a indicates a graph for dementia while b is a graph for urinary incontinence. Both graphs are lined-up by descending order of probability value. From the two graphs, b urinary incontinence shows a little bit of difference at both ends and almost overlapped interval is lengthy without significant difference. On the contrary, a dementia indicates significant difference without overlapped interval between PrDN and DrDN

Next, we performed candidate drug selection and drug repositioning for dementia. Three similar diseases for dementia were selected from 2890 diseases, which is equivalent to 0.1% (δ = 0.1) of the entire disease. These three are lipid metabolism disorders, dyslipidemias and hypertriglyceridemia by the order of higher f-score. A total of 1296 candidate drugs are selected from similar diseases and they are all repositioned to dementia (Note that 1296 candidate drugs are ones targeting related proteins of three diseases). Dementia is related to 1300 drugs previously, 945 drugs out of 1296 repositioned ones covered existing drugs. Other 351 drugs are newly predicted drugs which are not identified yet. Actual effects on these are verified by clinical literature. Clinical literature showed the results of observing the progress of medication to patients in order to evaluate the effectiveness of medication. We used PubMed to search clinical information. As results of verification, 25 drugs out of newly repositioned 351 drugs for dementia are verified to be actually effective for dementia through clinical information literature. Proved results are shown with drugs and PMID in Table 3. To summarize, assuming that there are not any known drugs to be used for dementia, 970 drugs (945 + 25), 75% (970/1296) are verified to be repositioned via Network Mirroring.

Table 3 Validated Drugs via Literature Survey

Now, we look into cases of Vasopressin, Tolfenamic acid and Creatine as major proved drugs through clinical literature. These three drugs, when they are repositioned for dementia, show high efficacy especially compared to other drugs.

Vasopressin

In several subtypes of frontotemporal dementia (FTD), damage to regions of the frontal and temporal lobes that occurs early in the disease course critically impairs emotional processing, social cognition, and behavior. Vasopressin can not only affect social cognition and behavior, but also serve as the potential implications for these agents as novel treatments in FTD [45].

Tolfenamic acid

Tolfenamic acid lowers the levels of tau, which forms pathological aggregates in Alzheimer’s disease and other tauopathies, by promoting the degradation of the transcription factor specificity protein 1 which regulates tau transcription [46].

Creatine

Sixty four participants were able to keep their condition healthy and stable by taking 8 g of creatine during 16 weeks of clinical trial. In addition, efficacy of creatine to treat dementia could be verified through Serum8-hydroxy-2'-deoxyguanosine (8OH2'dG) levels indicating oxidative injury to DNA. Although this value is rapidly increasing if condition for a patient aggravates, it could be reduced to a normal condition by creatine treatment. Therefore, if creatine is repositioned to dementia, it’s believed to be effective for treatment [47].

Conclusion

In this paper, we propose Network Mirroring for drug repositioning. The proposed method starts from an assumption that diseases with similar molecular biological characteristics are likely to use same drugs. We constructed two disease networks, PrDN and DrDN from protein information and drug information and reflects them. To check whether or not diseases with similar molecular biological characteristics use similar drugs, the criterion is PrDN. If they are different, such condition could be regarded as remaining room for drug repositioning. We used Kullback-Leibler divergence for quantitative comparison. Through the process, we select candidate disease by prioritizing a list of diseases suitable for drug repositioning. Then, we determine similar diseases with the candidate disease based on graph-based SSL algorithm. From similar diseases, we select candidate drugs. Finally, we complete Network Mirroring for drug repositioning which repositions candidate drugs to candidate disease.

For verification of the proposed method, we applied it to 2890 diseases, 23,499 proteins and 4603 drugs information. From the results, the proposed method preferably repositions drugs in top 20% of diseases more effectively than accessing to entire diseases. To observe the utility of the proposed method, it was applied to dementia. The selected drugs with Network Mirroring coincides with existing drugs in usage. In addition, it also discovered drugs with high potential of repositioning and the drugs were verified through clinical literature. Through the study, It is expected to produce profound insights for possibility of undiscovered drug repositioning.

For future works, we can consider performance comparison with existing works for validation and develop Network Mirroring into more sophisticated algorithm. In the aspect of utility, by integrating various information related to diseases, we plan to complement PrDN and extend Network Mirroring not only to dementia but also to other various diseases. In addition, we plan to carry out more studies for discovering new repositioned drugs for candidate diseases by considering information regarding drug analogues used for treatment.