Background

Overcoming diseases is the eternal goal of human beings, and the current treatment strategies mainly depend on drugs, aiming to act on the target genes or proteins to alleviate the symptoms or even prevent the attack of the disease [1]. In the drug-target-disease mechanism, identifying the disease-caused protein is a crucial and fundamental problem, also becomes challenge at the same time [2]. Currently, computational methods to predict pathogenic targets have been widely applied because of their high efficiency and low consumption prior to in vitro or in vivo biological experimental methods [3]. During the past decades, various prediction methods have been presented with different performances.

Earlier researches mainly focused on the protein–protein interaction (PPI) network, whose topological structure was directly used to predict disease-gene associations [4, 5]. However, the large number of false positives in the PPI network from public databases made these methods difficult to acquire higher prediction accuracy. Hence, the disease-related clinical data was added into later studies, which were based on GWAS [6,7,8] and gene expression [9,10,11,12,13], respectively. Although these methods obtained more accurate prediction than methods which applied PPI network alone, limitations still existed. For example, even the comprehensive platform TCGA [14] could only provide limited available data about uncommon cancers, let alone other non-cancer diseases, which greatly restricted the performance of these methods. Difficult to break limitations on the data source, researchers have begun to conduct in-depth research on algorithms, where the most widely used were about machine learning. Model GCN-MF combined the graph convolutional network with matrix factorization for disease-gene association identification [15]. Natarajan et al. derived features of diseases and genes for the inductive matrix completion [16]. Method CATAPULT was proposed by training a biased support vector machine model with features derived from a heterogeneous network [17]. Zeng et al. considered this problem as the recommender system, presenting a probability-based collaborative filtering model to predict pathogenic human genes [18]. Luo et al. developed a method to predict disease–gene associations with multimodal deep learning [19]. Although these efforts on algorithm development made prediction results improved, most methods still extracted valid information only from gene data and disease data. Actually, utilizing other information besides gene and disease to solve the prediction problem is essential and urgent in such intricate biological networks.

The ultimate objective of predicting pathogenic genes or proteins is to find a breakthrough for disease treatment. If predicting on the whole gene (protein) set, even though a novel gene-disease (protein-disease) association is successfully predicted, it will still be a long process to treat the disease specifically for this gene (protein). The reason comes from many aspects, for example, the research and development for new drugs usually take a long time. Actually, reducing the scope of the whole protein set to drug-targeted protein set will be more conducive for the disease treatment in clinical research, because for a novel predicted protein-disease association, the drugs which target this protein can be regarded as a candidate collection for the disease treatment instead of developing new drugs. Hence, we proposed a method to predict drug-targeted pathogenic proteins, named as M2PP. First, the target, disease and drug set were collected to construct association networks and similarity networks. Then, features were constructed for each target-disease pair based on the neighborhood similarity information, drug-inferred information and path information, respectively. Finally, a random forest regression model was trained to score unconfirmed target-disease pairs.

Method

Data collection

We collected the drug-targeted single human target proteins from DrugBank [20], where the drugs were approved by the Food and Drug Administration (FDA) [21]. For these targets, we extracted diseases which had curated associations with them from the Comparative Toxicogenomics Database (CTD) [22]. Then, three sets (a target set, a disease set and a drug set) were constructed. Next, we reduced these sets to make sure that any element in one set had association with both the other two sets (all associations were from DrugBank and CTD). Finally, we obtained 1002 targets, 1035 diseases and 1095 drugs (Fig. 1a)). The target set, disease set and drug set were represented as \(T = \left\{ {t_{1} ,t_{2} , \ldots ,t_{nT} } \right\}.\)\({\text{ D}} = \left\{ {{\text{d}}_{1} ,{\text{d}}_{2} , \ldots ,{\text{d}}_{{{\text{nD}}}} } \right\}\) and \({\text{ M}} = \left\{ {{\text{m}}_{1} ,{\text{m}}_{2} , \ldots ,{\text{m}}_{{{\text{nM}}}} } \right\}\), respectively.

Fig. 1
figure 1

The framework of M2PP. a Construct the target set, disease set and drug set; b Construct heterogeneous networks: the target-disease association network, target-drug interaction network, disease-drug association network, disease-disease similarity network, target-target similarity network and drug-drug topological structure similarity network; c Construct features for target-disease pairs; d Train the random forest model and predict association scores for unconfirmed target-disease pairs

Network construction

First, we constructed three association networks among the target, disease and drug set: (1) the target-disease association network, including 7342 curated associations from CTD, whose adjacency matrix was represented as \({\text{TDA}}^{{{\text{nT}} \times {\text{nD}}}}\); (2) the target-drug interaction network, including 38,871 curated interactions from DrugBank and CTD, representing its adjacency matrix as \({\text{TDI}}^{{{\text{nT}} \times {\text{nM}}}}\); (3) the disease-drug association network, including 35,319 curated associations from CTD, with adjacency matrix of \({\text{DDA}}^{{{\text{nD}} \times {\text{nM}}}}\). For target \({\text{ t}}_{{\text{i}}}\) \(\left( {1 \le {\text{i}} \le {\text{nT}}} \right)\) and disease \({\text{d}}_{{\text{j}}}\) \(\left( {1 \le {\text{j}} \le {\text{nD}}} \right)\), if the known association between them was existed,\({\text{ TDA}}_{{{\text{i}},{\text{j}}}} = 1\); otherwise,\({\text{ TDA}}_{{{\text{i}},{\text{j}}}} = 0\). Analogously did \({\text{ TDI }}\) and \({\text{ DDA}}\).

Then, we constructed the similarity networks:

(1) The disease-disease similarity network. We calculated the disease semantic similarities based on the Medical Subject Headings (MESH) descriptors [23] by the IDSSIM algorithm [24] and based on Disease Ontology (DO) [25] by Wang et al.’s method [26], respectively. For a disease-disease pair, the mean value of the two similarities was computed to construct the semantic similarity matrix \({\text{DDS}}\_{\text{S}}^{{{\text{nD}} \times {\text{nD}}}}\). Then, we calculated diseases’ topological structure similarity [27], whose matrix was represented as \({\text{DDS}}\_{\text{T}}^{{{\text{nD}} \times {\text{nD}}}}\):

$${\text{DDS}}\_{\text{T}}_{i,j} = {\text{exp}}\left( { - \alpha ||TDA_{,i} - TDA_{,j}||^{2} } \right)$$
(1)
$$\alpha = \alpha ^{\prime } /\frac{1}{{nD}}\sum\limits_{{k = 1}}^{{nD}} {||TDA_{,k} ||} ^{2}$$

where \(1 \le {\text{i}},{\text{j}} \le {\text{nD}}\); \(TDA_{,i}\) was the ith column of \(TDA\); \(\alpha^{\prime}\) was set to 1 according to previous study [28]. For the two similarity matrices \({\text{DDS}}\_{\text{S}}\) and \({\text{DDS}}\_{\text{T}}\), we proposed an integration way based on the entropy to get the final disease similarity matrix \({\text{DDS}}^{{{\text{nD}} \times {\text{nD}}}}\). The entropy of row \(i\) in matrix \({\text{W}}^{x \times y}\) was represented as \({\text{E}}_{i}^{{\text{W }}}\):

$${\text{E}}_{i}^{{\text{W }}} = - \mathop \sum \limits_{j = 1}^{y} p_{i,j} {\text{log}}\left( {p_{i,j} } \right)$$
(2)
$$p_{{i,j}} = {{{\text{W}}_{{i,j}} } \mathord{\left/ {\vphantom {{{\text{W}}_{{i,j}} } {\sum\limits_{{k = 1}}^{y} {{\text{W}}_{{i,k}} } }}} \right. \kern-\nulldelimiterspace} {\sum\limits_{{k = 1}}^{y} {{\text{W}}_{{i,k}} } }}$$

According to the formula above, the entropy of disease \({\text{d}}_{{\text{i}}}\) in matrix \({\text{DDS}}\_{\text{S}}\) and \({\text{DDS}}\_{\text{T}}\) was calculated and represented as \({\text{E}}_{i}^{{{\text{DDS}}\_{\text{S }}}}\) and \({\text{E}}_{i}^{{{\text{DDS}}\_{\text{T }}}}\), respectively. All diseases could be divided into two subsets, \({\text{D}}\_{\text{A}}\) and \({\text{D}}\_{\text{B}}\):

$${\text{D}}\_{\text{A}} = \left\{ {{\text{d}}_{{\text{i}}} {\text{|E}}_{i}^{{{\text{DDS}}\_{\text{S }}}} \le {\text{E}}_{i}^{{{\text{DDS}}\_{\text{T }}}} ,1 \le {\text{i}} \le {\text{nD}}} \right\}$$
(3)
$${\text{D}}\_{\text{B}} = \left\{ {{\text{d}}_{{\text{j}}} {\text{|E}}_{j}^{{{\text{DDS}}\_{\text{T }}}} < {\text{E}}_{j}^{{{\text{DDS}}\_{\text{S }}}} ,1 \le {\text{j}} \le {\text{nD}}} \right\}$$
(4)

The similarity matrix \({\text{DDS}}\) could be divided into four parts by \({\text{D}}\_{\text{A}}\) and \({\text{D}}\_{\text{B}}\):

$${\text{DDS}} = \left[ {\begin{array}{*{20}c} {{\text{similarity matrix between D}}\_{\text{A and }} {\text{D}}\_{\text{A}}} & {{\text{similarity matrix between D}}\_{\text{A and }} {\text{D}}\_{\text{B}}} \\ {{\text{similarity matrix between D}}\_{\text{B and }} {\text{D}}\_{\text{A}}} & {{\text{similarity matrix between D}}\_{\text{B and }} {\text{D}}\_{\text{B}}} \\ \end{array} } \right]$$
(5)

A low entropy value meant little random information from the similarities. Hence, the upper left and lower right part of \({\text{DDS}}\) were defined as below:

$${\text{similarity matrix between D}}\_{\text{A and }} {\text{D}}\_{\text{A}} = {\text{DDS}}\_{\text{S}}_{{{\text{D}}\_{\text{A}},{\text{D}}\_{\text{A}}}}$$
(6)
$${\text{similarity matrix between D}}\_{\text{B and }} {\text{D}}\_{\text{B}} = {\text{DDS}}\_{\text{T}}_{{{\text{D}}\_{\text{B}},{\text{D}}\_{\text{B}}}}$$
(7)

The similarities between \({\text{D}}\_{\text{A}}\) and \({\text{D}}\_{\text{B}}\) were still integrated based on the entropy. \({\text{D}}\_{\text{A}}\) was divided into two subsets, \({\text{D}}\_{\text{A}}\_{\text{a}}\) and \({\text{D}}\_{\text{A}}\_{\text{b}}\):

$${\text{D}}\_{\text{A}}\_{\text{a}} = \left\{ {{\text{d}}_{{\text{i}}} {\text{|E}}_{i}^{{{\text{DDS}}\_{\text{S}}_{{{\text{D}}\_{\text{A}},{\text{D}}\_{\text{B}}}} { }}} \le {\text{E}}_{i}^{{{\text{DDS}}\_{\text{T}}_{{{\text{D}}\_{\text{A}},{\text{D}}\_{\text{B}}}} { }}} ,1 \le {\text{i}} \le \left| {{\text{D}}\_{\text{A}}} \right|} \right\}$$
(8)
$${\text{D}}\_{\text{A}}\_{\text{b}} = \left\{ {{\text{d}}_{{\text{j}}} {\text{|E}}_{j}^{{{\text{DDS}}\_{\text{T}}_{{{\text{D}}\_{\text{A}},{\text{D}}\_{\text{B}}}} { }}} < {\text{E}}_{j}^{{{\text{DDS}}\_{\text{S}}_{{{\text{D}}\_{\text{A}},{\text{D}}\_{\text{B}}}} { }}} ,1 \le {\text{j}} \le \left| {{\text{D}}\_{\text{A}}} \right|} \right\}$$
(9)

The \({\text{similarity matrix between D}}\_{\text{A and }} {\text{D}}\_{\text{B}}\) could be represented as below:

$${\text{Similarity}}\;{\text{matrix}}\;{\text{between}}\;{\text{D}}\_{\text{A}}\;{\text{and}}\;{\text{D}}\_{\text{B}} = \left[ {\begin{array}{*{20}c} {{\text{DDS}}\_{\text{S}}_{{{\text{D}}\_{\text{A}}\_{\text{a}},{\text{D}}\_{\text{B}}}} } \\ {{\text{DDS}}\_{\text{T}}_{{{\text{D}}\_{\text{A}}\_{\text{b}},{\text{D}}\_{\text{B}}}} } \\ \end{array} } \right]$$
(10)

To ensure the symmetry of \({\text{DDS}}\), the \({\text{similarity matrix between D}}\_{\text{B and }} {\text{D}}\_{\text{A}}\) was set as the transpose of \({\text{similarity}}\;{\text{matrix}}\;{\text{between}}\;{\text{D}}\_{\text{A}}\;{\text{and}}\;{\text{D}}\_{\text{B}}\). Finally, \({\text{DDS}}\) could be obtained as below:

$${\text{DDS}} = \left[ {\begin{array}{*{20}c} {{\text{DDS}}\_{\text{S}}_{{{\text{D}}\_{\text{A}},{\text{D}}\_{\text{A}}}} } & {\left[ {\begin{array}{*{20}c} {{\text{DDS}}\_{\text{S}}_{{{\text{D}}\_{\text{A}}\_{\text{a}},{\text{D}}\_{\text{B}}}} } \\ {{\text{DDS}}\_{\text{T}}_{{{\text{D}}\_{\text{A}}\_{\text{b}},{\text{D}}\_{\text{B}}}} } \\ \end{array} } \right]} \\ {\left[ {\begin{array}{*{20}c} {{\text{DDS}}\_{\text{S}}_{{{\text{D}}\_{\text{A}}\_{\text{a}},{\text{D}}\_{\text{B}}}} } \\ {{\text{DDS}}\_{\text{T}}_{{{\text{D}}\_{\text{A}}\_{\text{b}},{\text{D}}\_{\text{B}}}} } \\ \end{array} } \right]^{T} } & {{\text{DDS}}\_{\text{T}}_{{{\text{D}}\_{\text{B}},{\text{D}}\_{\text{B}}}} } \\ \end{array} } \right]$$
(11)

(2) The target-target similarity network. We calculated the target proteins’ amino acid sequences similarity from the KEGG database [29] by the Smith-Waterman algorithm [30] and the protein functional similarity by Chen et al.’s method [31], respectively. For a target-target pair, the mean value of the two similarities was calculated to construct the similarity matrix \({\text{TTS}}\_{\text{S}}^{{{\text{nT}} \times {\text{nT}}}}\). Then, targets’ topological structure similarity matrix \({\text{TTS}}\_{\text{T}}^{{{\text{nT}} \times {\text{nT}}}}\) was computed as below:

$${\text{TTS}}\_{\text{T}}_{i,j} = {\text{exp}}\left( { - \beta ||TDA_{i,} - TDA_{j,}||^{2} } \right)$$
(12)
$$\beta = {{\beta ^{\prime } } \mathord{\left/ {\vphantom {{\beta ^{\prime } } {\frac{1}{{nT}}\sum\limits_{{k = 1}}^{{nT}} | |TDA_{{k,}} ||^{2} }}} \right. \kern-\nulldelimiterspace} {\frac{1}{{nT}}\sum\limits_{{k = 1}}^{{nT}} | |TDA_{{k,}} ||^{2} }}$$

where \(1 \le {\text{i}},{\text{j}} \le {\text{nT}}\); \(TDA_{i,}\) was the ith row of \(TDA\); \(\beta^{\prime} = 1\).

The target subset \({\text{T}}\_{\text{A}}\), \({\text{T}}\_{\text{B}}\), \({\text{T}}\_{\text{A}}\_{\text{a}}\) and \({\text{T}}\_{\text{A}}\_{\text{b}}\) were defined as below:

$${\text{T}}\_{\text{A}} = \left\{ {{\text{t}}_{{\text{i}}} {\text{|E}}_{i}^{{{\text{TTS}}\_{\text{S }}}} \le {\text{E}}_{i}^{{{\text{TTS}}\_{\text{T }}}} ,1 \le {\text{i}} \le {\text{nT}}} \right\}$$
(13)
$${\text{T}}\_{\text{B}} = \left\{ {{\text{t}}_{{\text{j}}} {\text{|E}}_{j}^{{{\text{TTS}}\_{\text{T }}}} < {\text{E}}_{j}^{{{\text{TTS}}\_{\text{S }}}} ,1 \le {\text{j}} \le {\text{nT}}} \right\}$$
(14)
$${\text{T}}\_{\text{A}}\_{\text{a}} = \left\{ {{\text{t}}_{{\text{i}}} {\text{|E}}_{i}^{{{\text{TTS}}\_{\text{S}}_{{{\text{T}}\_{\text{A}},{\text{T}}\_{\text{B}}}} { }}} \le {\text{E}}_{i}^{{{\text{TTS}}\_{\text{T}}_{{{\text{T}}\_{\text{A}},{\text{T}}\_{\text{B}}}} { }}} ,1 \le {\text{i}} \le \left| {{\text{T}}\_{\text{A}}} \right|} \right\}$$
(15)
$${\text{T}}\_{\text{A}}\_{\text{b}} = \left\{ {{\text{t}}_{{\text{j}}} {\text{|E}}_{j}^{{{\text{TTS}}\_{\text{T}}_{{{\text{T}}\_{\text{A}},{\text{T}}\_{\text{B}}}} { }}} < {\text{E}}_{j}^{{{\text{TTS}}\_{\text{S}}_{{{\text{T}}\_{\text{A}},{\text{T}}\_{\text{B}}}} { }}} ,1 \le {\text{j}} \le \left| {{\text{T}}\_{\text{A}}} \right|} \right\}$$
(16)

Finally, \({\text{TTS}}\_{\text{S}}\) and \({\text{TTS}}\_{\text{T}}\) were integrated into the final target similarity matrix \({\text{TTS}}^{nT \times nT}\):

$${\text{TTS}} = \left[ {\begin{array}{*{20}c} {{\text{TTS}}\_{\text{S}}_{{{\text{T}}\_{\text{A}},{\text{T}}\_{\text{A}}}} } & {\left[ {\begin{array}{*{20}c} {{\text{TTS}}\_{\text{S}}_{{{\text{T}}\_{\text{A}}\_{\text{a}},{\text{T}}\_{\text{B}}}} } \\ {{\text{TTS}}\_{\text{T}}_{{{\text{T}}\_{\text{A}}\_{\text{b}},{\text{T}}\_{\text{B}}}} } \\ \end{array} } \right]} \\ {\left[ {\begin{array}{*{20}c} {{\text{TTS}}\_{\text{S}}_{{{\text{T}}\_{\text{A}}\_{\text{a}},{\text{T}}\_{\text{B}}}} } \\ {{\text{TTS}}\_{\text{T}}_{{{\text{T}}\_{\text{A}}\_{\text{b}},{\text{T}}\_{\text{B}}}} } \\ \end{array} } \right]^{T} } & {{\text{TTS}}\_{\text{T}}_{{{\text{T}}\_{\text{B}},{\text{T}}\_{\text{B}}}} } \\ \end{array} } \right]$$
(17)

(3) The drug-drug topological structure similarity networks. We calculated drugs’ topological structure similarities in the target-drug interaction network and the disease-drug association network, respectively. They were represented as \({\text{MMS}}\_{\text{T}}^{{{\text{nM}} \times {\text{nM}}}}\) and \({\text{ MMS}}\_{\text{D}}^{{{\text{nM}} \times {\text{nM}}}}\), respectively:

$${\text{MMS}}\_{\text{T}}_{{i,j}} = {\text{exp}}\left( { - \gamma ||TDI_{{,i}} - TDI_{{,j}} ||^{2} } \right)$$
(18)
$$\gamma = {{\gamma ^{\prime } } \mathord{\left/ {\vphantom {{\gamma ^{\prime } } {\frac{1}{{nM}}\sum\limits_{{k = 1}}^{{nM}} {||TDI_{{,k}} ||^{2} } }}} \right. \kern-\nulldelimiterspace} {\frac{1}{{nM}}\sum\limits_{{k = 1}}^{{nM}} {||TDI_{{,k}} ||^{2} } }}$$
$${\text{MMS}}\_{\text{D}}_{{i,j}} = {\text{exp}}\left( { - \delta ||DDA_{{,i}} - DDA_{{,j}} ||^{2} } \right)$$
(19)
$$\delta = {{\delta ^{\prime } } \mathord{\left/ {\vphantom {{\delta ^{\prime } } {\frac{1}{{nM}}\sum\limits_{{k = 1}}^{{nM}} {||DDA_{{,k}} ||^{2} } }}} \right. \kern-\nulldelimiterspace} {\frac{1}{{nM}}\sum\limits_{{k = 1}}^{{nM}} {||DDA_{{,k}} ||^{2} } }}$$

where \(1 \le {\text{i}},{\text{j}} \le {\text{nM}}\); \(TDI_{,i}\) and \(DDA_{,i}\) was the ith column of \(TDI\) and \(DDA\), respectively; \(\gamma^{\prime} = 1\);\(\delta^{\prime} = 1\).

Finally, the heterogeneous network was constructed as shown in (Fig. 1b)). The characteristics of data in these networks were summarized in Table 1, where the sparsity was the ratio of edges to the network size. Obviously, our objective network (the target-disease association network) was the most imbalanced.

Table 1 The instruction of the five networks’ characteristics

Feature construction for model training to score unconfirmed target-disease pairs

For target-disease pair \({\text{ t}}_{{\text{i}}}\)-\({\text{ d}}_{{\text{j}}}\) (\(1 \le {\text{i}} \le {\text{nT}},{ }1 \le {\text{j}} \le {\text{nD}}\)), we constructed a 9-dimension feature based on its neighborhood similarity information, drug-inferred information and path information (Fig. 1c)), shown in the following formulas:

$${\text{Fea}}1 = {\text{mean}}\left( {{\text{DDS}}_{{{\text{P}},{\text{j}}}} } \right)$$
(20)
$${\text{P}} = \left\{ {{\text{y| TDA}}_{{{\text{i}},{\text{y}}}} = 0,{ }1 \le {\text{y}} \le {\text{nD}}} \right\}$$
$${\text{Fea}}2 = {\text{mean}}\left( {{\text{TTS}}_{{{\text{i}},{\text{Q}}}} } \right)$$
(21)
$${\text{Q}} = \left\{ {{\text{x| TDA}}_{{{\text{x}},{\text{j}}}} = 0,{ }1 \le {\text{x}} \le {\text{nT}}} \right\}$$
$${\text{Fea}}3 = {\text{TTS}}_{{{\text{i}},{\text{a}}}} \times {\text{TDA}}_{{{\text{a}},{\text{j}}}} + {\text{TDA}}_{{{\text{i}},{\text{b}}}} \times {\text{DDS}}_{{{\text{b}},{\text{j}}}} + {\text{TTS}}_{{{\text{i}},{\text{a}}}} \times {\text{TDA}}_{{{\text{a}},{\text{b}}}} \times {\text{DDS}}_{{{\text{b}},{\text{j}}}}$$
(22)
$${\text{ a}} = \mathop {\text{arg max}}\limits_{{\text{x}}} {\text{TTS}}_{{{\text{i}},{\text{x}} = \left\{ {1,2, \ldots {\text{nT}}} \right\}\backslash {\text{i}}}}$$
$${\text{ b}} = \mathop {\text{arg max}}\limits_{{\text{y}}} {\text{DDS}}_{{{\text{y}} = \left\{ {1,2, \ldots {\text{nD}}} \right\}\backslash {\text{j}},{\text{j}}}}$$
$${\text{Fea}}4 = \mathop {\max }\limits_{{{\text{k}} \in {\text{K}}}} \left( {{\text{L}}_{{{\text{j}},{\text{k}}}} /{\text{H}}_{{{\text{i}},{\text{k}}}} } \right)$$
(23)
$${\text{K}} = \left\{ {{\text{z| TDI}}_{{{\text{i}},{\text{z}}}} = 1,{\text{DDA}}_{{{\text{j}},{\text{z}}}} = 1,{ }1 \le {\text{z}} \le {\text{nM}}} \right\}$$
$${\text{ H}}_{{{\text{i}},{\text{k}}}} = \left( {{\text{TDI}} \times {\text{MMS}}\_{\text{T}}} \right)_{{{\text{i}},{\text{k}}}} /\left| {\left\{ {{\text{x|TDI}}_{{{\text{i}},{\text{x}}}} = 1,{\text{ MMS}}\_{\text{T}}_{{{\text{x}},{\text{k}}}} \ne 0,{ }1 \le {\text{x}} \le {\text{nM}}} \right\}} \right|$$
$${\text{ L}}_{{{\text{j}},{\text{k}}}} = \left( {{\text{DDA}} \times {\text{MMS}}\_{\text{D}}} \right)_{{{\text{j}},{\text{k}}}} /\left| {\left\{ {{\text{y|DDA}}_{{{\text{j}},{\text{y}}}} = 1,{\text{ MMS}}\_{\text{D}}_{{{\text{y}},{\text{k}}}} \ne 0,{ }1 \le {\text{y}} \le {\text{nM}}} \right\}} \right|$$
$${\text{Fea}}5 = \left( {{\text{TTS}} \times {\text{TDA}}} \right)_{{{\text{i}},{\text{j}}}} /\left| {\left\{ {{\text{x|TTS}}_{{{\text{i}},{\text{x}}}} \ne 0,{\text{ TDA}}_{{{\text{x}},{\text{j}}}} = 1,{ }1 \le {\text{x}} \le {\text{nT}}} \right\}} \right|$$
(24)
$${\text{Fea}}6 = \left( {{\text{TDA}} \times {\text{DDS}}} \right)_{{{\text{i}},{\text{j}}}} /\left| {\left\{ {{\text{y|TDA}}_{{{\text{i}},{\text{y}}}} = 1,{\text{ DDS}}_{{{\text{y}},{\text{j}}}} \ne 0,{ }1 \le {\text{y}} \le {\text{nD}}} \right\}} \right|$$
(25)
$${\text{Fea}}7 = \frac{{\left( {{\text{TTS}} \times {\text{TTS}} \times {\text{TDA}}} \right)_{{{\text{i}},{\text{j}}}} }}{{\left| {\left\{ {\left( {{\text{x}},{\text{s}}} \right){\text{|TTS}}_{{{\text{i}},{\text{x}}}} \ne 0,{\text{ TTS}}_{{{\text{x}},{\text{s}}}} \ne 0,{\text{ TDA}}_{{{\text{s}},{\text{j}}}} = 1,{ }1 \le {\text{x}},{\text{s}} \le {\text{nT}}} \right\}} \right|}}$$
(26)
$${\text{Fea}}8 = \frac{{\left( {{\text{TTS}} \times {\text{TDA}} \times {\text{DDS}}} \right)_{{{\text{i}},{\text{j}}}} }}{{\left| {\left\{ {\left( {{\text{x}},{\text{y}}} \right){\text{|TTS}}_{{{\text{i}},{\text{x}}}} \ne 0,{\text{ TDA}}_{{{\text{x}},{\text{y}}}} = 1,{\text{ DDS}}_{{{\text{y}},{\text{j}}}} \ne 0,{ }1 \le {\text{x}} \le {\text{nT}},1 \le {\text{y}} \le {\text{nD}}} \right\}} \right|}}$$
(27)
$${\text{Fea}}9 = \frac{{\left( {{\text{TDA}} \times {\text{DDS}} \times {\text{DDS}}} \right)_{{{\text{i}},{\text{j}}}} }}{{\left| {\left\{ {\left( {{\text{y}},{\text{t}}} \right){\text{|TDA}}_{{{\text{i}},{\text{y}}}} = 1,{\text{ DDS}}_{{{\text{y}},{\text{t}}}} \ne 0,{\text{DDS}}_{{{\text{t}},{\text{j}}}} \ne 0,{ }1 \le {\text{y}},{\text{t}} \le {\text{nD}}} \right\}} \right|}}$$
(28)

The analysis of these features were summarized in Table 2, including each feature’s type, description, content and information source. Considering each target-disease pair in the training set as a sample, the pair with known associations was regarded as a positive sample which was labelled as 1, while the pair which did not have known associations was regarded as a negative sample labelled as 0. After constructing features for each sample, the training set was used to train the random forest regression model [32], then the prediction model was used to score the unconfirmed target-disease pairs (Fig. 1d)). A higher score represented a larger possibility that the unconfirmed pair was associated. Parameters of mtry and ntree in the random forest model were set to 3 (the number of features/3) and 500 according to the default settings in R package, respectively.

Table 2 Information summary of the constructed features and their influence coefficient

Results

Evaluation metric

The fivefold cross-validation (CV) experiment was implemented to evaluate the performance of diverse prediction models. In the target-disease association network, there were 7342 known associations and 1,029,728 unconfirmed pairs. First, the 7342 target-disease associations and 7342 randomly selected unconfirmed pairs were considered as positive samples and negative samples, respectively. The remaining 1,022,386 unconfirmed pairs was unlabeled samples. Then, the positive samples and negative samples were evenly divided into 5 parts, where each part contained the same amount of positive and negative samples. In each CV, four parts were taken as training set in turn to train the model, while the remaining part and all unlabeled samples were taken as test set. For each test sample, the model could give a score representing the possibility that the pair was associated. We calculated the true positive rate (TPR) and false positive rate (FPR) for these scores under different thresholds to acquire the areas under the receiver operating characteristic curve (AUROC) and the areas under the precision–recall curve (AUPR). In fivefold CV, we obtained five AUROC/AUPR values and adopted the average AUROC/AUPR value to evaluate the performance of the model in this CV. To make the results more reliable, we repeated fivefold CV for 5 times to compute the mean and standard deviation (SD) values of the five average AUROC/AUPR values as the final evaluation metrics for prediction models.

Feature analysis

M2PP acquired mean AUROC of 0.986 and mean AUPR of 0.417 under fivefold CV for 5 times. To detect the influence of features on model’s prediction performance, we removed each feature in turn to run M2PP with the remaining features under the same fold settings. After removing the investigated feature, the more reduced the prediction performance, the more effective the feature was. The AUROC and AUPR values via removing different feature were exhibited by boxplots in Fig. 2, where the mean values were represented by point in the box. It could be observed that the mean AUROC/AUPR values of using all features was better than removing any feature. The paired t-test [33] was performed between AUROC (AUPR) values of using all features and values of removing any feature to check whether the average difference in their performance is significantly different from zero. All p-values were less than 0.05 as shown in Fig. 2, indicating that the performance of using all features is significantly better than removing any feature. This result demonstrated that each feature was indispensable. To further explore the influence of different feature on prediction performance, we defined an indicator named influence coefficient as below:

$${\text{Influence coefficient of Fea}}i = mean\left( {{\text{DifferenceAUROC}}_{i} ,{\text{DifferenceAUPR}}_{i} } \right)$$
(29)
$${\text{DifferenceAUROC}}_{i} = 1/\left( {1 + e^{{ - sum\left( {AUROC_{all\, features} - AUROC_{all\, features\backslash Feai} } \right)}} } \right)$$
$${\text{DifferenceAUPR}}_{i} = 1/\left( {1 + e^{{ - sum\left( {AUPR_{all\,features} - AUPR_{all\, features\backslash Feai} } \right)}} } \right)$$

where \(1 \le {\text{i}} \le 9\); \(AUROC_{all features}\) and \(AUPR_{all features}\) represented the AUROC and AUPR values of five times fivefold CV by using all features, respectively;\(AUROC_{all features\backslash Feai}\) and \(AUPR_{all features\backslash Feai}\) represented the AUROC and AUPR values of five times fivefold CV by removing feature \(Feai\), respectively. The larger the influence coefficient, the more effective the feature was. The influence coefficient of each feature were shown in Table 2. In the neighborhood similarity information type, Fea3 got the largest influence coefficient, because Fea3 mainly utilized the nearest neighborhoods’ similarity, which was the most valid information in similarity networks. In the path information type, Fea5 and Fea6 obtained advantageous influence coefficients, because paths of length = 2 provided more basic, direct and non-redundant information than length = 3. The drug-inferred information type, Fea4, also acquired decent influence coefficient, indicating that drug indeed play an effective role in predicting target-disease associations because of the drug-target-disease mechanism. Hence, our constructed features were effective, reasonable and indispensable to achieve excellent prediction performance.

Fig. 2
figure 2

Analysis of features in M2PP. a AUROC values via all features and removing each feature with p-values of paired t-test, where the point in the box represent the mean value; b AUPR values via all features and removing each feature with p-values of paired t-test, where the point in the box represent the mean value

Comparison with existing prediction models

M2PP was compared with six state-of-the-art models, which were RFLDA [34], DDR [35], NEDD [36], IRFMDA [37], GCRFLDA [38] and MFLDA [39]. The first four methods were based on random forest algorithm, and the last two methods were based on the graph convolutional matrix completion and the matrix factorization, respectively. We performed fivefold CV for five times on each model, exhibiting the mean and SD of AUROC/AUPR values in Fig. 3a). The AUROC values were 0.986 ± 0.001 (M2PP), 0.918 ± 0.002 (MFLDA), 0.922 ± 0.001 (IRFMDA), 0.936 ± 0.001 (GCRFLDA), 0.936 ± 0.001 (NEDD), 0.97 ± 0.001 (DDR) and 0.979 ± 0.001 (RFLDA); the AUPR values were 0.417 ± 0.016 (M2PP), 0.301 ± 0.018 (MFLDA), 0.336 ± 0.017 (IRFMDA), 0.341 ± 0.018 (GCRFLDA), 0.353 ± 0.015 (NEDD), 0.39 ± 0.014 (DDR) and 0.402 ± 0.015 (RFLDA). Whether AUROC or AUPR values, M2PP always achieved the advantageous performance among all methods.

Fig. 3
figure 3

Model comparison. a Models’ prediction performance on 5-fold CV for five times, including the mean AUROC/AUPR values with SD marked at the top of each bar; b Statistics of disease categories, including the circular bar chart in the left side to exhibit the number and proportion of diseases in each category and the UpSet chart in the right side to exhibit the details of diseases in the top 5 category; c Models’ prediction performance for the top 5 category, including the mean AUROC/AUPR values with SD marked at the top of each bar

Each disease belonged to at least one category provided by MESH, for example, disease “Lymphoma” belonged to three categories, which were “C04: Neoplasms”, “C15: Hemic and Lymphatic Diseases” and “C20: Immune System Diseases”. In our network, diseases involved 24 categories, where the number and proportion of diseases in each category were shown in the left graph in Fig. 3b). Proportion of the top 5 category “C23: Pathological Conditions, Signs and Symptoms”, “C10: Nervous System Diseases”, “C04: Neoplasms”, “C14: Cardiovascular Diseases” and “C16: Congenital, Hereditary, and Neonatal Diseases and Abnormalities” exceeded 10%, whose UpSet chart was shown in the right side in Fig. 3b) to exhibit the details of diseases in them. For these five categories, we detected models’ prediction performance for their diseases. First, we trained the model with a training sample set which included known target-disease (excluded diseases in the investigated category) associations as the positive samples and the randomly selected unconfirmed target-disease (excluded diseases in the investigated category) pairs as the negative samples, noting that the number of positive and negative samples were the same. Second, the pairs between all targets and each disease in the investigated category were considered as the test set in turn to acquire scores by the model. Then, we could compute the AUROC and AUPR values for each disease in the investigated category, and the average AUROC/AUPR value was considered as the prediction performance of the investigated category. The process was repeated for 5 times to get reliable results. Each model’s mean and SD of AUROC/AUPR values for the five categories were exhibited in Fig. 3c), where M2PP always achieved the best performance. These results indicated the excellent ability of our model.

Case studies

We predicted new pathogenic proteins for five common diseases: lung cancer, breast cancer, colon cancer, leukemia and lymphoma. For one investigated disease, M2PP was trained with a training sample set, where the known target-disease (excluded the investigated disease) associations was the positive samples and the randomly selected unconfirmed target-disease (excluded the investigated disease) pairs of the same size was the negative samples. Then, M2PP could predict for the pairs between all targets and the investigated disease to acquire prediction scores. We repeated the process for 5 times, so the pair between one target and the investigated disease had five scores, and finally the average score was considered as the prediction score of the pair. We sorted the prediction score of all unconfirmed pairs between targets and the investigated disease, and manually searched the top 10 pairs in public biomedical literature to find the supporting evidence. All top 10 targets were successfully predicted for lung cancer, breast cancer and colon cancer, nine targets for leukemia and seven targets for lymphoma, shown in Table 3. Here, we mainly introduced the top 1 predicted target for each disease. Researchers found that TNF played a key role in inducing resistance to epidermal growth factor receptor inhibition in lung cancer, and suggested that a concomitant inhibition of epidermal growth factor receptor and TNF maybe a potentially new treatment strategy for lung cancer patients [40]. IL2 inhibited the growth of breast cancer cells through improving the proliferation of natural killer cells [41]. Inhibiting or knocking MET down made colon cancer cells sensitive on cetuximab-mediated growth inhibition, implicating that targeting MET was a rational strategy for reversing cetuximab resistance in colon cancer [42]. VEGFA was observed to have additive effect in inflating the risk of leukemia [43]. CHKA possessed oncogenic activity and could be a potential therapeutic target in lymphoma [44]. We also predicted target-disease association scores on the whole network and sorted all unconfirmed pairs’ scores. Seven associations in top 10 has been successfully predicted with public literature as evidences, shown in Table 4. For example, researchers investigated the expression and functions of ALOX5 in breast cancer cells, and demonstrated that inhibiting ALOX5 had therapeutic potential in breast cancer [45]. In addition to these literature evidences, we also found that no matter in Tables 3 or 4, targets and diseases in all successful predictions had co-associated drugs (CDs), which were drugs simultaneously associated with the target and disease. The phenomenon further demonstrated that these high-rank predicted pairs were reasonable from the aspect of both computational data and biomedicine verification. Other drugs which interacted with the predicted target might be potential candidate therapeutic strategies for the investigated disease, needing to be explored in future clinical trials. These results indicated the ability of M2PP to provide conveniences for the future biological researches.

Table 3 Successfully predicted pathogenic targets in top 10 for common diseases
Table 4 Successfully predicted target-disease associations on the whole network in top 10

Conclusion

Predicting drug-targeted pathogenic proteins is crucial for understanding disease mechanism and implementing disease treatment. In this study, we presented a novel model M2PP to predict drug-targeted pathogenic proteins. First, we constructed a heterogeneous network, including the target-disease association network, target-drug interaction network, disease-drug association network, disease-disease similarity network, target-target similarity network and drug-drug topological structure similarity network. Then, we developed three types of features on the network, which were based on neighborhood similarity information, drug-inferred information and path information. Finally, we trained a random forest model with these features to score unconfirmed target-disease pairs. In the result section, we first analyzed our constructed features in detail. By removing each feature in turn to check the change of prediction performance, we found that each feature was indispensable. Three types of feature obtained the average influence coefficient of 0.598 (the neighborhood similarity information), 0.671 (the drug-inferred information type) and 0.677 (the path information type), respectively. The path information type acquired the highest value mainly benefited from paths of length = 2, which provided more basic, direct and non-redundant information than paths of length = 3. In addition, the drug-inferred information type also got decent value, indicating that drugs were effective in predicting target-disease associations because of the drug-target-disease mechanism. Then, we compared M2PP with several state-of-the-art models, where M2PP obtained advantageous performance among them. According to the disease category, we extracted sub-networks from the whole target-disease association network for the top 5 category to perform the prediction. Results showed that category of “C23”, “C04” and “C14” achieved better performance. This was because that diseases in “C23”, “C04” and “C14” have more associations with targets than in the other two categories “C10” and “C16”. The average degree of diseases in “C23”, “C04” and “C14” were 6.84 (1670 associations /244 diseases), 12.03 (1985/165) and 7.16 (960/134); while in “C10” and “C16”, the average degree of diseases were 5.06 (1057/209) and 2.95 (348/118). Finally, we predicted new target-disease associations using M2PP, where several high rank associations were successfully confirmed with public literature as evidence. These results demonstrated that M2PP was effective and accurate, which might be convenient for biological researches in the future.