Background

Chemogenomics [1, 2] and chemical systems biology [3, 4] aim to accelerate drug discovery inexpensively through in silico predictions, based on a network with enriched drug-target-disease relationships [5]. Integrated chemical and biological networks can be used to hypothesize new clinical indications for approved drugs with desired safety profiles, and to propose new combination therapy design [6, 7]. Drug-target interaction networks can also be utilized to interpret clinical side effects by revealing modes of drug actions [8]. Semantic standards and technologies facilitate seamless data integration across multiple domains, and enable the construction of a heterogeneous network consisting of various biological entities of different types, such as compounds, proteins, and genes [9]. Several semantically linked datasets, such as PubChemRDF [10], Chem2Bio2Rdf [11], Bio2RDF [12], Open PHACTS [13], and ChEMBL RDF [14], have been published to promote large-scale data mining in drug discovery. A statistical model, called Semantic Link Association Prediction (SLAP), has been applied to Chem2Bio2RDF to predict direct links between compounds and proteins based on their indirect links or paths with other biological objects, such as substructures, diseases, side effects, and pathways [15]. It has been demonstrated that SLAPas a novel and validated approach to predict drug-target interactions (DTIs) outperformed existing alternatives.

Predicting DTI is equivalent to link prediction, which is a fundamental problem and long-standing challenge in complex network analysis [16]. In social networks, topological proximity, measured based on observed network data, can be used to suggest future interactions between individuals [17]. In the context of drug discovery, biological networks can be similarly leveraged to identify potential associations between compounds and protein targets. Typical network-based DTI predictions are often based on similarity profiles calculated from common neighbors or direct connections, and are usually limited to bipartite networks [1821]. However, most similarity-based link prediction algorithms designed for homogeneous networks cannot take into account the heterogeneous types and relations defined in semantic networks; furthermore, it is fairly challenging to consider the long paths connecting two end nodes (indirect connections), which can significantly increase large volumes of randomness in the connectivity. Therefore, we incorporated meta-path topological features [22] for link prediction. A meta-path is a composite relation, denoting a sequence of adjacent links between any two objects in a heterogeneous network. Adjacent links are defined with distinct semantics, so different combinations of adjacent links in sequences contribute distinguishably for link prediction. It has been proven that meta-path-based similarity can improve the performance of information retrieval in heterogeneous information networks [23].

A meta-path defines a certain type of paths linking the starting and ending objects. The total number of paths belonging to a specific meta-path is animportant topological feature to evaluate the strength of associations between starting and ending objects, which is often called path count. For instance, a compound and a protein target can be connected through multiple paths of different types: (A) compound \( \overset{similar\ to}{\to } \) compound \( \overset{binds\ to}{\to } \) protein; (B) compound \( \overset{binds\ to}{\to } \) protein \( \overset{binds\ to}{\to } \) compound \( \overset{binds\ to}{\to } \) protein; and (C) compound \( \overset{has\ part}{\to } \) substructure \( \overset{part\ of}{\to } \) compound \( \overset{binds\ to}{\to } \) protein \( \overset{similar\ to}{\to } \) protein. Three meta-paths connect the starting compound to the ending protein: meta-path (A) indicates that the compound most likely binds to a protein to which another structurally similar compound binds; meta-path (B) shows that two compounds sharing an observed protein target may share another protein target as well; meta-path (C) specifies that two compounds sharing a common substructure may bind to two different protein targets that have similar protein sequences. SLAP employs a statistical model to evaluate the importance of each meta-path in link prediction, which is evaluated individually based on the distribution of its connectivity property over a set of randomly sampled drug-target pairs. Several meta-paths are selected according to their statistical significances, and the aggregated connectivity properties of the selected meta-paths are used to predict DTI.

The present work provides an alternative DTI approach to SLAP. Rather than using a statistical model to study the significance of meta-path topological features, we propose a framework to take advantage of machine learning algorithms, including Random Forest (RF) and Support Vector Machine (SVM), to construct binary classification models to predict DTI. A more complete drug-target connectivity map can be constructed using the predicted links. By using machine learning models, feature importance (i.e., the contributions of different meta-paths to the link prediction) can be calculated at the same time as the classification models are built. Additionally, SLAP only considers path counts as a topological feature; whereas our approach can apply different kinds of normalization processes to path counts, including random walk, normalized path count, and symmetric random walk [23] to further enrich the topological feature space. In order to compare our approach with SLAP, we have carried out link prediction experiments on a semantic network, called Chem2Bio2Rdf, which focuses on drug candidates and their biological annotations. Although the proposed approach was just used to construct a more complete drug-target connectivity map in the present study, it can be generalized as a framework to leverage machine learning algorithms to study the topological features of the heterogeneous network for link prediction. Structural similarity links between compounds and sequence similarity links between proteins were added to expand the semantic network. The usefulness of similarity neighboring links from PubChem resources [24] is examined in the context of semantic link prediction.

Methods

Semantic network

In the Chem2Bio2RDF semantic network, nine distinct semantic types are presented, including compounds, proteins, adverse side effects, Gene Ontology (GO) annotations, ChEBI types, substructures, tissues, biological pathways, and diseases; ten different semantic links are incorporated, including links from compounds to ChEBI types, from compounds to proteins, from compounds to substructures, from adverse side effects to compounds, from diseases to compounds, from proteins to proteins (referring to protein-protein interactions), from proteins to GO annotations, from diseases to proteins, from pathways to proteins, and from tissues to proteins. In order to enhance link prediction performance, we enriched the linked dataset by adding two more semantic links: compound neighboring links based on 2D structural similarity, and protein neighboring links, based on sequence similarity. The similarity neighboring links were obtained from PubChem databases [25, 26]. A total of twelve adjacency matrixes were computed based on the semantic links between any two objects. The elements of the adjacency matrixes have two values: ‘0,’ indicating unobserved links, and ‘1,’ indicating observed links. The semantics and statistics of adjacency matrixes were enumerated in Table 1; these were used to calculate the meta-path-based topological features. It is noteworthy that all the semantic links in the Chem2Bio2RDF dataset are reversible, and the adjacency matrix for the reverse semantic links can be obtained through a transpose of the original adjacency matrix.

Table 1 The semantics and statistics of adjacency matrixes

Meta-path-based topological features

The meta-path topological features were encoded in commuting matrixes, calculated by multiplying several adjacency matrixes. To predict the links from compounds to proteins, we exhaustively enumerate all the possible meta-paths, yielding a total of 51 meta-paths. Each commuting matrix represents a certain type of meta-path of a given length. The length of the meta-paths equals the number of multiplied adjacency matrixes. Out of 51 commuting matrixes, 4 meta-paths are of length 2; 11 meta-paths are of length 3; and 36 meta-paths are of length 4. The meta-paths with length greater than 4 are considered to be too long to make a significant contribution to link prediction. The elements in the commuting matrix indicate the number of path instances linking compounds to proteins, and have non-negative integer values. The semantics and statistics of commuting matrixes were enumerated in Table 2. For instance, the commuting matrix C15 represents a meta-path: compound \( \overset{similar\ to}{\to } \) compound \( \overset{binds\ to}{\to } \) protein \( \overset{similar\ to}{\to } \) protein, which was calculated by multiplying three adjacency matrixes: A2, A11, and A12 (Fig. 1). All of the matrix multiplications were carried out using the Armadillo C++ linear algebra library [27], and all of the adjacency and commuting matrixes were encoded as sparse matrixes to reduce memory consumption.

Table 2 The semantics and statistics of commuting matrixes
Fig. 1
figure 1

Schematic representation of calculations of commuting matrix C15 through multiplying A2, A11, and A12

Two measures of topological features were calculated. Path count (PC i,j ) measures the number of path instances between nodes i and j, which corresponds to the value of element in the commuting matrix. We also applied Random Walk (RW) as a normalization process to the number of path instances, based on the overall connectivity of the network. RW was calculated as \( \raisebox{1ex}{$P{C}_{i,j}$}\!\left/ \!\raisebox{-1ex}{$P{C}_{i,\bullet }$}\right. \), where PC i,• are row-wise summations.

Machine learning dataset

In order to build supervised learning models, both positive and negative labels are required. We treated observed links between compounds and protein targets as positive labels. A total of 5,387 positively labeled links from Drugbank were collected, which were used to evaluate the predictive performance of the SLAP algorithm [15]. The unobserved links in the dataset can be either spurious links or potential future links. In order to obtain experimental evidence for the negative labels, we surveyed the PubChem BioAssay database [28]: if the experimental bioactivity value is greater than 10 μM, the link of a compound protein pair is negatively labeled. Accordingly, we obtained 26,682 negative labels out of over 5.6 billion unobserved links between compounds and proteins in the Chem2Bio2RDF semantic network. In order to assess predictive performance without prior knowledge, the positively labeled links were removed from Chem2Bio2RDF when the meta-path-based topological features were calculated. The positively and negatively labeled links were combined and randomly split into training and test sets by a ratio of 2:1. In the training set, there are 3,591 positively labeled links and 17,788 negatively labeled links. In the test set, there are 1,796 positively labeled links and 8,894 negatively labeled links.

The network evolves as new links are identified over time. In order to further examine the ability of the proposed framework to identify the evolution of network connectivity, a much larger set of DTIs were collected from the PubChem BioAssay database. PubChem BioAssay categorizes depositor-provided bioactivities between compounds and protein targets into active, inactive, and unspecified groups, according to assay descriptions and activity values. If the interactions between compounds and protein targets are categorized as active in PubChem BioAssay, and the active interaction pairs have reported activity values of less than 1 μM, the links are positively labeled; if the interactions between compounds and proteins are categorized as inactive in PubChem BioAssay, and there are reported activities for the interactions, the links are negatively labeled. A set of 145,622 positively labeled links contained in the current Chem2Bio2RDF semantic network, plus 600,000 negatively labeled links, constitute a training set; another set of 43,159 positively labeled links that are not contained in the current Chem2Bio2RDF semantic network, but are true positive DTIs, identified through bioassay experiments, plus195,000 negatively labeled links, comprise the test set. Since the positive DTIs in the test set were obtained after construction of the network, this independent test set is used to examine the ability to predict the links in the future network based on the topological features of the current network.

Binary classification models

In order to demonstrate how well the similarity neighboring links obtained from PubChem databases can improve link prediction performance, we have constructed different machine learning models, based on two sets of path count topological features. Feature set I does not include any meta-paths involving similarity neighboring links, so it only contains 29 path count topological features. Feature set II includes all of the path counts encoded in 51 commuting matrixes. We also examined the improvement of predictive performance using an enriched topological feature space. RW normalization was applied to 51 path count topological features, and by combining the path counts and random walks, we obtained feature set III, which contains 102 topological features.

Two popular machine learning algorithms were investigated. Random forest (RF) represents a collection of decision trees, which are grown from bootstrap samples of the training data without pruning, and make predictions based on majority votes of the ensemble trees [29]. RF takes advantage of Out-of-Bag (OOB) error as an unbiased estimate of generalized test error, so there is no need to run cross-validation. RF can calculate the importance of features as well. The values for a given feature are permuted across all of the compound-protein pairs. Either classification accuracies or node impurities (Gini indexes) are measured before and after permutations, and the difference in the measures is used to evaluate feature importance. A default value for the number of trees was used (ntree = 500) in the present study, which has been proven to be satisfactory in most cases [30]. The optimal value for tuning parameter mtry was identified by a grid search.

In contrast to the tree-based model, Support Vector Machine (SVM) is based on a statistical learning theory derived from the structural risk minimization principle and Vapnik-Chervonenkis (VC) dimension [31]. A soft margin SVM with radial basis function (RBF) kernel in the Gaussian form was used in the present study. The optimal values for tuning parameters (C and λ) were determined by a grid search using 10-fold cross-validation.

The classification performances were evaluated using the F1-score [32], which is the harmonic mean of precision and recall.

$$ {\mathrm{F}}_1\;\mathrm{score}:\;\frac{2TP}{2TP+FP+FN} $$
(1)

F1-score can be used for statistical hypothesis testing, in particular, for imbalanced datasets. Both RF and SVM can calculate the probabilities of classifications, and rankings can be derived from the probability calculations. The predictive performance on rankings was evaluated according to Receiver Operating Characteristic (ROC) and Precision Recall (PR) curves for all of the models. The area under the curve for ROC (AUCROC) and PR (AUCPR) were calculated using the natural spline interpolation encoded in the R package ‘Miscellaneous Esoteric Statistical Scripts’ (MESS). The early hit recognitions that are considered more important in virtual screening experiments were evaluated using Boltzmann-enhanced discrimination of ROC (BEDROC), which was calculated using the R package ‘enrichvs.’

Results and discussion

The optimal tuning parameters and the statistical results for all the binary classification models are summarized in Table 3. RF outperformed SVM across all three feature sets. Both RF and SVM yielded consistent rankings of the predictive performance for the different feature sets: feature set III > feature set II > feature set I. The similarity neighboring links improved the link prediction performance on test set by 5.5 % in RF models, and by around 4.4 % in SVM models. In combination with RW normalization, the predictive performance of RF models was improved by 2 %, and the predictive performance of SVM models were boosted by 3.5 %. The differences in predictive performance were consistently demonstrated by ROC and PR curves as well (see Fig. 2). The ROC space and PR space agreed on the rankings of different feature sets, in terms of predictive performance. We can see that feature set III dominated both ROC space and PR space for both RF and SVM models, and RF models slightly outperformed SVM models. Since we have imbalanced distributions for positive and negative labels, PR curves can provide better visual representations than ROC curves to identify the difference of predictive performance. As shown in Fig. 2, the ROC curves were closely clustered, and the PR curves for different models were separated to a larger extent. The differences among AUCPRs were larger than the differences among AUCROCs, as well (see Table 4). It is clear that similarity neighboring links are important for link prediction in the semantic network, and RW normalization can boost predictive performance by enriching feature space. It is noteworthy that all the machine learning models performed fairly well on both training and test sets without over-fitting. In addition, both feature set II and feature set III produced AUCROCs greater than 0.92, which was produced by SLAP [15]. Hence, meta-path-based topological features have been proven to be valuable for link prediction in complex semantic networks using machine learning models.

Table 3 Statistics of binary classification models built upon different feature sets and using different machine learning algorithms
Fig. 2
figure 2

Receiver operating characteristic curves (a) and precision/recall curves (b) for the six models using two machine learning algorithms to build binary classification models upon three topological feature spaces. RF means Random Forest, SVM means support vector machine, FI means feature set I, FII means feature set II, and FIII means feature set III

Table 4 Area under ROC curve (AUCROC) and area under PR curve (AUCPR) of random forest and support vector machine classification models using different feature sets

In order to further compare the proposed approached with SLAP, we carried out link predictions using both methods on a large set of unknown links of an evolving semantic network. The labels of those unknown links were derived from experimental evidence deposited in PubChem BioAssay databases after the Chem2Bio2RDF network was constructed. Hence, these positive labels can be viewed as experimental validations when assessing link prediction performance. The proposed framework, using RF to build a binary classification model upon feature set III, yielded much better BEDROC and AUCROC than SLAP (Table 5). BEDROC is mainly used to compare ranking systems in terms of early recognition [33]. Our approach yielded much better AUC of BEDROC using a default coefficient parameter (α = 20.0) (Table 5). The difference can be seen in Fig. 3 as well.

Table 5 Comparing the proposed framework (random forest classification model applied on feature set III) with existing algorithm SLAP using Area under ROC curve (AUCROC) and area under PR curve (AUCPR)
Fig. 3
figure 3

ROC curves for the Random Forest model built upon feature set III and SLAP. RF means Random Forest and FIII means feature set III

By applying the intrinsic feature ranking algorithm of the RF on feature set II, we can tell which meta-paths are important for link prediction. Feature importance can be visualized as a dot plot (Fig. 4). Two measures evaluated before and after permutations were used for feature ranking: decrease of classification accuracy and decrease of Gini index. Although two measures do not always agree on which features are important, we still can identify some significantly important meta-paths according to two measures. The top four important meta-paths were C1, C19, C16, and C39, and the network nodes connected by these important meta-paths are compounds, proteins, and GO annotations. It is noteworthy that the top three important meta-paths only contain semantic links between compounds and proteins, and the top two important meta-paths contain similarity neighboring links. Therefore, semantic links between compounds and proteins, including similarity neighboring links and interaction links, played a major role in predicting CPIs.

Fig. 4
figure 4

Variable importance for Random Forest model built with feature set II. The color code for feature importance according to mean decrease accuracy: red (>70), blue (>45 and <70), green (<45); the color code for feature importance according to mean decrease Gini index: red (>240), blue (>240 and <100), green (<100)

In contrast to SLAP, that pre-calculates feature importance before making predictions, the proposed framework can evaluate feature importance and build predictive models at the same time. The importance of a given topological feature may vary to some extent when different sets of training data are considered, or when new links are added into the network as a function of time. We carried out an experiment to demonstrate that feature importance may vary significantly when different sets of data are used to build predictive models. We constructed 1,000 RF models using randomly selected training sets with feature set II. Each training set was compiled by 100 positively labeled links from the DrugBank set, and 100 negatively labeled links from the PubChem BioAssay set with experimental bioactivity value greater than 10 μM. The changes of feature importance in different models can be seen in Fig. 5. It is clearly that feature importance varied a lot in different models. Feature C4 has the smallest standard deviation (0.828) and feature C39 has the largest standard deviation (5.537). It is noteworthy that all of the top four importance features in the aforementioned models (C1, C16, C19, and C39) have very large standard deviations. Even though their importance varied a lot in different models, their mean values were well above the average of others; in particular, the mean values of C1 and C39 were much larger than those of other topological features. The predictive performances of those 1,000 RF models tested against a randomly selected set of 50 positive labels and 50 negative labels (not included in any of those 1,000 training sets) varied a lot as well. The highest F1-score is 0.937 and the lowest F1-score is 0.667. Hence, the selection of training set is also very important to build highly predictive machine learning models.

Fig. 5
figure 5

Box plot for the variable importance varying in 1 000 Random Forest models

Conclusions

The semantic network integrating domain knowledge across chemical and biological space can be leveraged for large-scale data mining. Among the different kinds of semantic links, drug-target connectivity maps have drawn extensive attention, since they are beneficial for drug discovery and development, in particular, drug repositioning and polypharmacology research. In the present work, we have proposed a framework to construct state-of-the-art machine learning models using meta-path-based topological features for link prediction in complex semantic networks. Supervised classification models were shown to be powerful, based on their predictive performance in an independent test set containing links of an evolving network. In addition, the intrinsic feature ranking algorithm embedded in machine learning models can be used to select the most important topological features. Although the proposed framework was only applied to predict DTIs in the present work, it can definitely be used for other purposes, such as to predict associations between drugs and adverse side effects, as well as associations between proteins and diseases. In the future, we want to study how to select the most relevant training set for a given prediction task, and how much training set selection can improve predictive performance.

Availability of Data and Materials

The data sets supporting the results of this article are included within the article and its additional files (Additional files 1, 2, 3, 4 and 5).