TargetNet: a web service for predicting potential drug–target interaction profiling via multi-target SAR models
Drug–target interactions (DTIs) are central to current drug discovery processes and public health fields. Analyzing the DTI profiling of the drugs helps to infer drug indications, adverse drug reactions, drug–drug interactions, and drug mode of actions. Therefore, it is of high importance to reliably and fast predict DTI profiling of the drugs on a genome-scale level. Here, we develop the TargetNet server, which can make real-time DTI predictions based only on molecular structures, following the spirit of multi-target SAR methodology. Naïve Bayes models together with various molecular fingerprints were employed to construct prediction models. Ensemble learning from these fingerprints was also provided to improve the prediction ability. When the user submits a molecule, the server will predict the activity of the user’s molecule across 623 human proteins by the established high quality SAR model, thus generating a DTI profiling that can be used as a feature vector of chemicals for wide applications. The 623 SAR models related to 623 human proteins were strictly evaluated and validated by several model validation strategies, resulting in the AUC scores of 75–100 %. We applied the generated DTI profiling to successfully predict potential targets, toxicity classification, drug–drug interactions, and drug mode of action, which sufficiently demonstrated the wide application value of the potential DTI profiling. The TargetNet webserver is designed based on the Django framework in Python, and is freely accessible at http://targetnet.scbdd.com.
KeywordsWeb server SAR models Drug–target interaction Multi-target SAR Naïve Bayes
Drug–target interactions (DTIs) are central to current drug discovery processes and public health fields [1, 2]. In drug discovery process, one of the challenges is to identify the potential targets for drug-like compounds. Once the target is successfully identified, several receptor-based drug design methods could be easily used to optimize the structures of compounds, aiming at improving the biological activities of these compounds. A lot of efforts have been invested for studying various targets in both academic institutions and pharmaceutical industries. However, it is time-consuming and expensive to determine whether a chemical and a target are to interact with each other in a cellular network purely by means of experimental techniques. Although some computational methods were developed in this regard based on the knowledge of the 3D (dimensional) structure of protein, unfortunately their usage are quite limited because the 3D structures for most targets such as many GPCRs are still unknown. Furthermore, analyzing the DTI profiling of the drugs helps to infer drug indications , adverse drug reactions [4, 5, 6], drug–drug interactions [7, 8], and drug mode of actions [9, 10], etc. Therefore, it is of high importance to reliably and fast predict DTI profiling of the chemicals on a genome-scale level.
Currently, two computational approaches are generally used for studying drug–target relations. (1) The inverse- or reverse-docking approach predicts the interactome of chemicals toward a representative collection of proteins based on various molecular docking programs [11, 12]. For example, Li et al.  developed a web server called TarFisDock to identify drug targets from 698 prepared potential targets in advance. Kharkar et al.  reviewed the state-of-the-art and future prospects of the reverse docking for drug repositioning and drug rescue. Minho et al. provided large-scale reverse docking profiles by expanding the scope of target space to a set of all protein structures currently available, and developed several new applications such as predicting the druggability of protein targets and protein function prediction based on docking profile similarity . However, a serious problem for docking is that it cannot be applied to proteins whose 3-D structures are unknown. Additionally, the single DTI prediction by docking programs may need to cost seconds even several minutes. Thus, the docking of a chemical toward multiple proteins needs to cost several hours, which seriously limits its wide applications. (2) Various chemogenomics methods simultaneously consider chemical information and protein information to infer chemical-protein associations [14, 15, 16, 17]. For example, Nagamine et al.  built a statistical model for predicting DTIs based on 519 approved drugs and their associated 29 targets, by using amino acid sequences, two-dimensional chemical structures and mass-spectrometry data. He et al.  established classification models for predicting DTI network using chemical functional groups and biological features. Yu et al.  made a systematic prediction of multiple DTIs from chemical, genomic and pharmacological data, by using support vector machine and random forest. Xiao et al.  developed a sequence-based classifier based on two-dimensional fingerprints of compounds and the pseudo amino acid composition of proteins to predict the interactions between GPCRs and drugs in cellular networking. However, these approaches usually have relatively low prediction accuracies when the number of proteins or the space of DTI data becomes very large . Recently, a variety of statistical methods have been increasingly developed to predict DTIs by integrating multiple evidence sources [23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]. Yamanishi et al.  proposed a bipartite graph learning method to predict true interacting pairs from the integration of chemical and genomic spaces. Bleakley et al.  proposed a bipartite local model by transforming edge-prediction problems into binary classification problems. Xia et al.  used a semi-supervised learning method to predict DTIs from heterogeneous biological spaces. Jocab et al.  proposed a kernel-based learning framework that constructed the pairwise kernel to measure the similarity between drug–target pairs. However, the drawback of the pairwise kernel is that there will be a large number of samples to be classified (i.e., drugs multiplied by the number of targets) which poses remarkable computational complexity. To avoid this problem, more recently van Laarhoven et al.  developed a Gaussian interaction profile kernel for predicting DTIs. Mizutani et al.  related DTI network with drug side effects using sparse canonical correlation analysis.
We developed an open web service called TargetNet to net or predict the binding of multiple targets for any given molecule, following the spirit of multi-target SAR methodology. TargetNet simultaneously constructs a large number of SAR models based on current chemogenomics data to make future predictions. 623 Naïve Bayes models together with various molecular fingerprints were employed to construct prediction models for 623 proteins. Ensemble learning from these fingerprints was also provided to improve their prediction ability. When the user submits a molecule, the server will predict the activity of the user’s molecule across 623 proteins by the established high quality SAR model for each protein, thus generating a DTI profiling that can used as a feature vector for wide applications. The 623 SAR models related to 623 proteins were strictly evaluated and validated by several model validation strategies, resulting in the AUC scores of 75–100 %. We applied the generated DTI profiling to successfully predict potential targets, toxicity classification, drug–drug interactions, and drug mode of action, which sufficiently demonstrated the wide application value of the potential DTI profiling. We recommend DTI profiling to analyze and represent various complex molecular data under investigation. Further, we hope that the package will be helpful when exploring questions concerning target identification, candidate drug screening, drug effect evaluation, and poly-pharmacology or multi-target characterization of candidate chemicals .
Preparation of the library drugs and targets
We used BindingDB database as our training datasets. BindingDB is a public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of proteins considered to be candidate drug–targets with ligands that are small, drug-like molecules . Activity data were filtered to keep only activity end-point points that had half-maximum inhibitory concentration (IC50), half-maximum effective concentration (EC50) or Ki values. Herein, to ensure that enough number of molecules could be used in model building, we previously selected those targets with larger than 200 biological activity data. Following this procedure, 109,061 compounds associated with 623 target proteins remained with 115,257 activity end-points, which were used for model building. The proteins were divided to five classes including enzymes (276) containing kinases (85), ion channels (9), receptors (255), transporter (14) and others (69). The crosslink information related to these targets could be found in the Targets section in the TargetNet website. The list of associated proteins is included in Supplementary Information S1.
Preparation of the positive and negative set
For those compounds with more than one activity values, we took the mean value of their activity values as the final activity value. A compound was considered active when the mean activity value was below 10 μM. All compounds higher than 10 μM are considered inactive. Following this split, maybe some human proteins have very little number of negative samples. To balance the number between positive samples and negative samples for each human protein, we randomly selected certain number of compounds from other human proteins to generate the negative samples for these human proteins. That is to say, the negative samples we used consist of two parts: truly inactive samples and randomly selected unknown interactions. The number of these selected negative samples together with inactive samples should be basically equal to the number of the active samples for these human proteins. These prepared positive set and negative set were used for the subsequent model building. The SMILES formats of the compounds involved in the positive set and negative set for each human protein could be downloaded from the TargetNet website.
We used molecular substructure fingerprints to describe the information of molecular structure instead of commonly molecular descriptors such as topological, constitutional, geometrical, quantum chemical properties. Substructure fingerprints directly encode molecular structure in a series of binary bits that represent the presence or absence of particular substructures in the molecular . It has the potential to keep the overall complexity of molecules, although it divides the whole molecule into lots of fragments. And it does not need reasonable three-dimensional conformation of molecules and thereby does not lead to error accumulation for the description of molecular structures. In addition, it gives a direct relationship between molecular structure and property . In the study, several commonly used molecular fingerprints are used to construct the substructure dictionaries, including FP2, Daylight-like, MACCS, Estate, ECFP2, ECPF4 and ECFP6. The FP2 fingerprint is a path-based fingerprint which indexes small molecule fragments based on linear segments of up to 7 atoms. Each remaining fragment is assigned a hash number which is used to set a bit in a 1024 bit vector. The Daylight-like fingerprints are hashed fingerprints encoding each atom type, all augmented atoms and all paths of length 2–7 atoms, giving a total string of 1024 bits. The MACCS fingerprint uses a dictionary of MDL keys, which contains a set of 166 mostly common substructure features. There are referred to as the MDL public MACCS keys. There is a one-to-one correspondence between each SMARTS pattern and bit in the MACCS fingerprint. For each SMARTS pattern, if its corresponding substructure is present in the given molecule, the corresponding bit in the fingerprint is set to 1; conversely, it is set to 0 if the substructure is absent in the molecule. Electrotopological State (E-state) fingerprints represent the presence/absence of 79 E-state substructures . The ECFP2, ECFP4 and ECFP6 fingerprints are in the family called Morgan fingerprints by setting the diameter of the atom environment to 2, 4 and 6, which is known as circular fingerprints. The fingerprints are calculated by the PyDPI which is a Python package developed for calculating various molecular descriptors and fingerprints .
Naive Bayesian classifiers
Application 1: predicting potential target proteins for the given molecule
As an example, we submitted the drug bromocriptine to TargetNet for a prediction test. The server predicts that bromocriptine might interact with a new protein D(1A) dopamine receptor (Uniprot ID: Q95136). Bromocriptine mesylate is a semisynthetic ergot alkaloid derivative with potent dopaminergic activity. It is indicated for the management of signs and symptoms of Parkinsonian Syndrome. After checking the related literature, we obtained the binding affinity between bromocriptine and D(1A) dopamine receptor (Ki = 1.444 μM). Moreover, we also found that most of the approved targets for bromocriptine are predicted in the top 30 associations. This case study demonstrates that our server could predict potential targets to a certain extent.
Application 2: in silico toxicity prediction by DTI profiling
The prediction performance of three toxicity datasets based on DTI profiling and FP2 fingerprint
0.746 ± 0.003
0.775 ± 0.002
0.715 ± 0.003
0.608 ± 0.002
0.491 ± 0.003
0.815 ± 0.003
0.713 ± 0.004
0.728 ± 0.002
0.698 ± 0.003
0.593 ± 0.002
0.426 ± 0.003
0.782 ± 0.003
0.807 ± 0.002
0.819 ± 0.002
0.792 ± 0.002
0.621 ± 0.003
0.611 ± 0.002
0.877 ± 0.003
0.725 ± 0.004
0.754 ± 0.004
0.693 ± 0.003
0.601 ± 0.003
0.448 ± 0.004
0.818 ± 0.004
0.715 ± 0.004
0.686 ± 0.003
0.744 ± 0.004
0.579 ± 0.003
0.431 ± 0.003
0.782 ± 0.003
0.795 ± 0.002
0.814 ± 0.002
0.770 ± 0.001
0.619 ± 0.002
0.585 ± 0.002
0.881 ± 0.002
Application 3: in silico DDI prediction by DTI profiling
Prediction statistics of DDI data based on different validation levels
0.918 ± 0.001
0.928 ± 0.002
0.888 ± 0.001
0.650 ± 0.003
0.817 ± 0.002
0.969 ± 0.001
0.716 ± 0.001
0.745 ± 0.001
0.686 ± 0.002
0.598 ± 0.002
0.432 ± 0.003
0.776 ± 0.002
0.689 ± 0.005
0.629 ± 0.003
0.751 ± 0.005
0.557 ± 0.007
0.382 ± 0.005
0.761 ± 0.004
Application 4: identify network of drug mode of action by DTI profiling
TargetNet web service
Currently, the TargetNet web service has been applied by more than 500 visits from 39 different countries registered since October 20, 2015. In the 3 months, our TargetNet webserver runs well, and there is no problem of failure for the needs from users. For more details, the editor could refer to the live statistics in the right corner of our TargetNet homepage.
DTI probabilities of user’s molecule with 623 proteins in library. The result table includes “Details”, “Uniprot_ID”, “Protein” and “Probability”. The user can also look over the detailed information for each human protein as needed and conveniently type in a keyword to look for a certain item in the results through the ‘Search’ button. The final prediction results can be downloaded as different formats.
The Lipinski’s rule of five for user’s molecule together with the molecular structure.
We compared TargetNet with two current methods that can yield the DTI profiling. The first is the inverse- or reverse-docking approach, which predicts the interactome of drugs toward a representative collection of target proteins based on various molecular docking programs. TargetNet has two advantages over inverse- or reverse-docking: (1) the calculation speed of TargetNet is faster than that from inverse- or reverse-docking. As we all know, the single drug–target interaction prediction by docking programs may cost seconds even several minutes. Thus, the docking of a drug toward multiple proteins needs to cost several hours, which seriously limits its wide applications. (2) The proteins used in the inverse- or reverse docking must have explicit 3D protein structures and the binding pockets, while the proteins in TargetNet can be arbitrary as long as they have enough interactive compounds confirmed experimentally. The second is the chemogenomics approach, which considers drug information and protein information to infer drug–target associations. The computational chemogenomics approaches, however, have relatively low prediction accuracy when the number of target proteins or the space of DTI data becomes large . Herein, TargetNet only used the chemical structural information to differentiate active from non-active for the given protein, based on the SAR principle. Therefore, TargetNet is more flexible and applicable. Furthermore, we also compared our TargetNet with several popular web servers which are established to predict DTIs or potential drug targets by using different methodologies. For instance, DINIES is a web server for predicting unknown drug–target interaction networks from various types of biological data in the framework of supervised network inference . However, the DINIES method requires the detailed side-effect information or/and protein amino acid sequence which is applicable only to marketed drugs for which side-effect information is available, while TargetNet only needs chemical structures, and therefore is more easy-to-use and flexible. Additionally, their method integrates multiple heterogeneous data to calculate the interaction, which is usually time-consuming, while TargetNet only needs about 5–10 s to cope with one molecule, and thereby its computational speed is faster than DINIES. CDRUG is a web server used for predicting anticancer activity from chemical structures of compounds encoded by the Daylight fingerprint . TargetNet conducts certain 623 SAR models and calculated seven types of fingerprints, and has certain targets and more fingerprints than CDRUG. Compared to the CPI-Predictor proposed by Tand et al., the dataset used in two methods is very different although they are all based on SAR methodology. The CPI-Predictor mainly focuses on the GPCRs from GPCR SARfari database and kinases from kinase SARfari database, and then constructs their SAR models based only on the MACCS fingerprint. TargetNet mainly focuses on the Binding databases, and involves five classes of targets mentioned in the Methods section. Furthermore, TargetNet systematically compared seven types of molecular fingerprints or substructure fragments, and found that ECFP4 fingerprint is more predictive than the other fingerprints including MACCS. SwissTargetPrediction is a web server to infer the targets of bioactive small molecules based on the combination of 2D and 3D similarity values with known ligands. Compared to SwissTargetPrediction based on similarity, TargetNet applied SAR methodology to infer DTIs. However, it is worth noting that the performance of TargetNet largely depends on the quality of each SAR model related to each protein. Those factors influencing the quality of SAR models will directly influence the prediction ability of TargetNet, and then influence the efficiency of DTI profiling, such as the size and diversity of datasets, model quality, molecular structural representations, etc. In the process of building SAR models, we have sufficiently considered several factors to obtain the high-quality SAR models. For example, the size of each dataset is limited to be not less than 200, and the diversity analysis of the dataset is also visualized. Furthermore, a series of model validations and evaluations are performed to ensure the reliability of models.
TargetNet server can predict DTI profiling for the user’s drug across 623 proteins in the database, which is supported by the prediction statistics from cross validations, independent validations, and applications.
TargetNet can help to infer drug indications, adverse drug reactions, drug–drug interactions, and drug mode of actions, and will have wide applications in drug discovery process.
The DTI profiling by TargetNet could be considered as a new molecular representation for various drug discovery studies.
We would like to thank the Django group for their great Django server. We would also like to thank Dr. Peter Ertl for his JME molecular editor, and we thank the developers of D3.js. We would also like to thank three anonymous referees and the editor for their constructive comments, which greatly helped improve upon the original version of the manuscript.
This work has been financially supported by grants from the Project of Innovation-driven Plan in Central South University, the National Natural Science Foundation of China (Grants No. 81402853), the National key basic research program (Grants No. 2015CB910700), and the Postdoctoral Science Foundation of Central South University, the Chinese Postdoctoral Science Foundation (2014T70794, 2014M562142). The studies meet with the approval of the university’s review board.
Compliance with ethical standards
Conflict of interest
- 2.Nunez S, Venhorst J, Kruse CG (2011) Drug Discov Today 17(1):10Google Scholar
- 4.Luo H, Chen J, Shi L, Mikailov M, Zhu H, Wang K, He L, Yang L (2011) Nucleic Acids Res 39(Suppl 2):W492Google Scholar
- 5.Cao DS, Xiao N, Li YJ, Zeng WB, Liang YZ, Lu AP, Xu QS, Chen A (2015) CPT: pharmacometrics & systems. Pharmacology 4(9):498Google Scholar
- 7.Luo H, Zhang P, Huang H, Huang J, Kao E, Shi L, He L, Yang L (2014) Nucleic Acids Res 42(W1):W46Google Scholar
- 13.Lee M, Kim D (2012) BMC Bioinformatics 13(Suppl 17):S6Google Scholar
- 47.Pahikkala T, Airola A, Pietilä S, Shakyawar S, Szwajda A, Tang J, Aittokallio T (2015) Brief Bioinform 16(2):325Google Scholar