Introduction

Structure-based molecular docking is one of the key components in computer-aided drug design [1,2,3]. Docking is a two-step process: conformational sampling of ligands bound to their receptors, followed by assessment of binding free energy between them. Due to advances in computing power and numerical algorithms, the success of docking is no longer restricted by inadequacy of conformational sampling but limited instead by the lack of a precise and reliable scoring function to evaluate the free energy of interactions between proteins and ligands [4]. Developing an accurate scoring function is challenging because molecular interaction is contributed by a delicate balance between several different types of interactions including van der Waals and columbic interactions in between, and interactions with solvent environment in addition to the difficulty in capturing entropic contributions [5, 6].

A wide variety of scoring functions has been developed to approximate energy functions. Based on the derivation ways, scoring functions are usually classified into physics-based methods, empirical scoring functions, knowledge-based potentials, and descriptor-based scoring functions [7]. Physics-based methods, widely employed in molecular dynamics simulation studies, are obtained by combing quantum mechanical calculations of small molecular fragments and empirical fitting to known experimental data. Some examples are linear interaction energy (LIE) [8, 9], linear response approximation (LRA) [10] and MM-PBSA/GBSA [11,12,13]. Since this type of methods require intensive computing time to perform kinetic integration for entropic effects, they are limited to assess a small number of compounds. Differently, virtual screening usually docked millions of molecules into a protein receptor to locate active compounds. Thus, the requirement of fast computation leads to the dominance of computationally efficient empirical scoring functions in docking as shown in the score-function assessment [5, 6]. Empirical scoring functions are based on a linear combination of various energetic terms to approximate binding free energy. Notable examples are ChemScore [14, 15], X-Score [16], Glide-Score [17, 18], and etc. Typically, the weight factors for individual energetic terms in an empirical scoring function are obtained by regression to achieve the highest correlation to experimental binding affinities (scoring power). More recently, machine learning methods have been used to combine energetic terms and/or employ protein–ligand distances for training. Examples are RF-Score [19], ID-Score [20], SVM-SP [21], and DrugVQA [22]. However, these scoring functions are often sensitive to docking poses and don’t perform well to separate decoys from true binding ligands in actual docking experiments [23]. Knowledge-based potentials (or statistical potentials) are derived from statistical analysis of known protein structures. A typical knowledge-based potential considers only the distances between atom pairs that allow efficient calculations. Different knowledge-based functions differ in how protein–ligand atom pair potentials and their reference states are defined. Examples are SmoG [24, 25], DrugScore [26], IT-Score [27, 28], and ASP [29]. Knowledge-based scoring functions are also used in combination with solvation and entropic terms to improve performance. Examples are DSX [30], SmoG2016 [31] and ITScore/SE [32].

Previously, a knowledge-based scoring function called DLIGAND [9] was developed based on the distance-scaled finite ideal-gas reference (DFIRE) state [33, 34], which has successfully been used for protein interactions with DNA [35], RNA [36], and carbohydrate [37] molecules. DLIGAND was developed by representing both protein and ligand atoms by a few mol2 atom types, and trained on a small set of 200 protein complex structures. Here, we developed DLIGAND2 by substituting 13 mol2 atom types by 167 residue-specific atom types for protein atoms and using a large protein structural dataset for training. We showed that DLIGAND2 not only significantly improves over DLIGAND but also has superior performance in separating true ligands from decoys in Database of Useful Decoys-Enhanced (DUD-E).

Methods

Scoring function

DLIGAND2 potential

We have used the same approach as the DLIGAND [38] to derive the distance-dependent interaction energy function between atomic pairs based on the distance-scale finite ideal-gas reference (DFIRE) state [33] as

$$ \bar{\mu } \left( {i,j,r} \right) = \left\{ {\begin{array}{*{20}c} { - \eta RTln\frac{{N_{{obs}} \left( {i,j,r} \right)~~~}}{{\left( {\frac{r}{{r_{{cut}} }}} \right)^{\alpha } \left( {\frac{{\Delta r}}{{\Delta r_{{cut}} }}} \right)N_{{obs}} \left( {i,j,r_{{cut}} } \right)}},} & {r < r_{{cut}} } \\ 0, \qquad \quad\quad\quad\quad \quad\quad\quad\quad \quad & {r \ge r_{{cut}} } \\ \end{array} } \right. $$
(1)

where R is the gas constant, T = 300 K, \( \alpha \) = 1.61, rcut = 15 Å, \( \eta \) is a scaling factor simply set as 0.01/RT. Nobs(i,j,r) is the number of atomic pair (i,j) within the spherical shell of distance r observed in a given structure database, and ∆r(∆rcut) is the bin width at r(rcut). A constant value of 0.5 Å was used for ∆r at all bins and ∆rcut = ∆r. Here, we employed residue-specific atomic types for protein atoms that leads to 167 atomic types for protein atoms. This is different from DLIGAND, where both protein and ligand atoms were represented by mol2 atom types, and thus only 13 atom types were utilized for protein atoms.

We derived the protein–ligand interactions from protein structures because there is only a small number of non-redundant protein–ligand complex structures. From protein structures, we obtained the Nobs for the number of observed pairs between protein atoms, which are converted to protein–ligand interactions by mapping indices for protein atoms to 11 mol2 atom types (see Additional file 1: Table S1) and summing over all pairs that are mapped to the same mol2 atom type as

$$ N_{obs}^{{\prime }} \left( {i,k,r} \right) = \mathop \sum \limits_{j} N_{obs} \left( {i,j,r} \right)\delta \left( {map\left( j \right),k} \right) , $$
(2)

where i is protein atom type, \( \delta \left( {map\left( j \right),k} \right) \) is 1 only when the protein atom type j is mapped to mol2 atom type k, otherwise 0. Based on the \( N_{obs}^{{\prime }} \left( {i,k,r} \right) \), we can derive the potential function in the same manner as DFIRE. This design enables us to obtain the scoring function purely from protein atoms without requiring their binding partners, so we employed our recently collected 12,450 non-redundant protein monomer chains [39] to obtain a sufficient number of observations. This training set represents more than 60 times bigger than the dataset (195 complex structures) used for deriving DLIGAND. For ligand mol2 atom types not existed in proteins, they were mapped to the closest atom type, as detailed in Additional file 1: Table S1. We also adopted the low-count correction according to Bayesian statistics as the previous study [40].

Benchmark datasets

Four benchmark datasets were employed to evaluate DLIGAND2. The first dataset is CASF-2013 [5, 6], a widely used benchmark containing 195 representative protein–ligand complexes. This benchmark has been used to test the accuracy of binding affinity prediction by using experimentally determined protein–ligand complex structures. The second dataset is the PDBbind refined set (version 2016) [41] of 4057 protein–ligand interaction pairs with experimentally measured binding affinity data. We generated protein–ligand complex structures by docking ligands onto their corresponding receptors respectively with eight docking packages, including AudoDock (version 4.2.6) [42], AutoDock Vina (version 1.1.2) [43], rDock (version 2013.1) [44], LeDock (version 1.0) [45], UCSF DOCK (version 6.8) [46], iDock (version 2013.1) [47], GalaxyDock (with BP2 Score) [48, 49], and iGEMDOCK (version 2.1) [50]. Docking ligands are confined to a 10 Å box enclosing the centroid of co-crystalized ligand. The maximum number of docking poses for each ligand was set to 10. After removing complexes failing to yield any complex structures in our selected docking programs, a collection of 4044 complexes remained for evaluation. The full list of 4044 complexes can be found in Additional file 2: Table S2. The scoring ability of functions were evaluated by the Pearson correlation coefficient (PCC) between the predicted and experimental values, as well as the root mean squared error (RMSE) after linear regression.

The ability of DLIGAND2 to perform virtual screening was also evalued on the DUD-E dataset [51]. There are 22, 886 active ligands binding with 102 targets, with an average of 224 ligands per target. For each target, the DUD-E database provides an abundant number of decoys (50 decoys for each active) that have similar physical–chemical properties but dissimilar two-dimensional (2D) topology. We employed the 3D structure of a target protein with the highest resolution in the protein data bank for docking. This is different from original DUD-E test where the 3D structure of the best performance was selected for each target [51]. For each pair of protein target and ligand compound, we employed Autodock Vina with default options to generate one pose, which are re-scored by 5 scoring functions (ΔvinaRF20, ID-Score, X-Score, DLIGAND, and DLIGAND2).

The accuracy of each scoring function was evaluated by the LogAUC and enrichment factor (EF).

As described in DUD-E Ref. [51] and our previous studies [52, 53], LogAUC takes the logarithm of x-axis in area under curve (AUC) to show more information on enrichment at a low false positive rate. We chose three regions of EF in top x% of the DUD-E dataset, where x equals to 1, 5 and 10 respectively.

$$ EF^{x\% } = \frac{{{\raise0.7ex\hbox{${N_{True}^{x\% } }$} \!\mathord{\left/ {\vphantom {{N_{True}^{x\% } } {N_{Selected}^{x\% } }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${N_{Selected}^{x\% } }$}}}}{{{\raise0.7ex\hbox{${N_{Active} }$} \!\mathord{\left/ {\vphantom {{N_{Active} } {N_{Total} }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${N_{Total} }$}}}} $$
(3)

where \( N_{True}^{x\% } \), \( N_{Selected}^{x\% } \), \( N_{Selected}^{x\% } \) and \( N_{Total} \) are the number of true positives, the number of selected candidates at top x% screened candidates, the number of active compounds, and the total number of compounds in the screened library, respectively.

For a fair comparison with the machine-learning-based scoring function (RF-Score-VS [54]) trained on the DUD-E dataset, we selected protein targets from the DEKOIS 2.0 benchmark [55] if it has sequence identity less than 95% to any protein in the DUD-E according to the BLAST [56]. Finally, 55 targets were kept and sorted by their sequence identity, as detailed in Additional file 3: Table S3.

Results and discussion

The DLIGAND2 potential

Different from the united mol2 atom type used by DLIGAND, the improved version DLIGAND2 has employed residue-specific types for protein atoms, which expanded atom types from 12 types to 169 atom types. Sufficient statistics for this larger number of atom types is ensured by using 12,450 protein chains for training. Residue-specific atom types enable the discrimination of the properties (e.g. partial charge) and surrounding environments of atoms. As shown in Fig. 1a, the potential energy between ligand atom S.3 and the main-chain O atom of ASP is significantly lower than between the atom and the main-chain O atom of ARG likely because S.3 atom has a weak but negative partial charge, which is repulsive to the negative charged ASP but attractive to the positive charged ARG residue. By comparison, DLIGAND provides an average potential over 20 amino acids. Significant differences also exist for interactions involving non-polar atoms. As shown in Fig. 1b, the CB atom of GLU and the CE atom of LYS belong to C.3 as defined in mol2, despite their very different electrostatic and steric environment. Their interactions with the ligand type N.am are very different when derived independently (DLIGAND2), and enclose the average energy function from DLIGAND.

Fig. 1
figure 1

The atomic interaction potentials a between ligand type S.3 and main-chain O atom of ASP, or ARG in DLIGAND2, or their common mol2 atom type (O.2) by the DLIGAND, and b between ligand type N.am and atom CB of GLU, or CE of LYS, or their common mol2 type “C.3” by DLIGAND, as a function of distance

Evaluation results on CASF-2013 benchmark

Score power

Figure 2 compares DLIGAND and DLIGAND2 in term of their ability for predicting protein–ligand binding affinity using the CASF-2013 dataset. DLIGAND2 achieves a higher Pearson correlation coefficient (PCC) (0.572) than DLIGAND (0.526). Table 1 further compares PCC values given by 29 other scoring functions. DLIGAND2 ranks the 9th among 30 scoring functions. The improvement of DLIGAND2 over DLIGAND was made without additional training. Interestingly, the top five scores (RF-Score-v2, ID-Score, ΔvinaRF20, AutoDock-hybrid and X-ScoreHM) were all trained directly for binding affinity prediction. The scoring function of Autodock Vina achieves a PCC of 0.56, which is lower than DLIGAND2 but higher than DLIGAND. According to the root mean square error (RMSE), DLIGAND2 (RMSE of 1.85) ranks the 10th after ChemPLP@GOLD (RMSE of 1.84), which is the best in all knowledge-based potential functions. The improvement in correlation coefficients is encouraging as DLIGAND2 was trained on protein structures only.

Fig. 2
figure 2

Comparison between theoretically predicted and experimentally measured protein–ligand binding free energies for 195 complexes on the CASF-2013 testing set for a DLIGAND with a correlation coefficient of 0.526 and b DLIGAND2 with a correlation coefficient of 0.572. The solid line is from the regression fit

Table 1 Comparisons of 30 scoring functions on the CASF-2013 dataset

Docking power

The docking power refers to whether a scoring function can correctly identify the native ligand poses from the predicted poses. Table 2 shows the evaluation results of docking power compared to the results by Li et al. [5] using the same docking sets in the CASF-2013 benchmark. DLIGAND2 achieves 14% improvement in success rate over DLIGAND in detecting native poses as the first ranked pose. Among all methods compared, DLIGAND2 has a moderate performance in term of success rates in ranking the native pose within top 1, 2, and 3 (at 45.1%, 61% and 75.4%, respectively). Nevertheless, DLIGAND2 ranks the second best in all knowledge-based/statistical potential scoring functions, behind ASP@GOLD, but better than PMF@SYBYL, PMF04@DS and PMF@DS. However, ASP@GOLD is not a pure statistical energy function but an empirical mix of a statistical potential with physical-based energetic terms in ChemScore@GOLD. Thus, DLIGAND2 has the best performance for parameter-free statistical potentials.

Table 2 Success rates for the evaluation of docking power ranked by top three poses

Ranking power

The ranking power of a scoring function refers to its ability to correctly rank binders of a given target protein by their predicted binding affinities based on the poses from the crystal structures and optimized structures. Table 3 compares DLIGAND and DLIGAND2 to the evaluation results of other scoring functions of ranking power collected by Li et al. [5]. A high-level success rate indicates a completely correct ranking of all members within each ligand cluster whereas a low-level success rate denotes ranking of the best as top 1 within a cluster. Again, DLIGAND2 has a small improvement over DLIGAND in high level success rates (1.6% on crystal structures and 3% on optimized structures) but identical in low-level success rates. Compared to other statistical potentials (PMF@DF, ASP@GOLD, PMF@SYBYL), DLIGAND2 has the highest high-level success rate in crystal and optimized structures and the highest low-level success rate in optimized structures but not in crystal structures. This suggests that DLIGAND2 is less sensitive to structural changes, compared to ASP@GOLD that has the large drop in low-level success rate from crystal to optimized structures. Empirical scoring functions such as X-Score and ChemScore@SYBYL have the best performance in this test.

Table 3 Success rates (%) for the evaluation of ranking power ranked by high-level results on optimized structures

Evaluation results on PDBbind data set

The above benchmark study is based on experimentally determined, protein–ligand complex structures. We further tested DLIGAND2’s ability to predict protein–ligand binding affinities by using predicted complex structures from docking. To remove random fluctuations, we generated 10 poses for each pair of protein and ligand by each docking method, and the highest score among 10 poses by each scoring function was used to represent the predicted binding affinity, respectively. As shown in Table 4, when scored by docking methods’ own scoring functions, AutoDock Vina yields the best correlation and lowest error with experimental values (PCC of 0.501 and RMSE of 1.75), followed by GalaxyDock (PCC of 0.487 and RMSE of 1.75) and iDock (PCC of 0.485 and RMSE of 1.75). rDock and UCSF dock have PCC < 0.2 and RMSE > 1.95. Low performance by rDOCK and UCSF was consistent with a previous study [4].

Table 4 Pearson correlation coefficients and root mean squared error between experimental binding affinity and binding affinity predicted by DLIGAND, DLIGAND2, and X-Score using docking poses generated by eight docking programs along with the results from the docking programs

When re-assessed by DLIGAND2, the PCCs of predicted binding affinity consistently improve over all eight docking methods to the levels from 0.498 to 0.537 with an average of 0.523, and the RMSE from 1.69 to 1.76 with an average of 1.71. This indicates the main bottleneck of current docking method is the scoring function, as also disclosed in the previous study [4]. By comparison, DLIGAND can improve PCC values for five docking programs but decrease PCC values for 3 others with an average PCC of 0.455 and RMSE of 1.79, which are 13% lower and 4.7% higher than DLIGAND2, respectively. On the basis of average value, X-Score has a performance comparable to DLIGAND2 in PCC but a slightly higher error in RMSE. It should be noted that X-Score was trained on the complex structures homologous to the CASF-2013 benchmark dataset used here, whereas DLIGNAD2 was trained only by independent monomer structures. We also noted that DLIGAND2 is about 5 times faster than X-Score, which takes 2.7 and 13.3 h, respectively to complete this dataset (a total of 40,440 docking poses) by one CPU core of the Intel E5-2692V2 (2.2 GHz). Here, we did not compare to RF-Score (including RF-Score-v4 [59]), ΔvinaRF20 and ID-Score because they were trained on the PBDbind refined set.

Evaluation results on DUD-E data set

The DUD-E dataset is used to examine the ability to separate true ligands from decoys, a practically important problem in virtual screening. Here, we employed the DUD-E dataset to evaluate the screening power of scoring functions. The performance of DLIGAND and DLIGAND2 is compared to those of three top ranked scoring functions in the CASF-2013 benchmark (ID-Score, ΔvinaRF20, and X-Score) using the poses generated by AutoDock Vina.

As shown in the Table 5 (The detailed data can be found in Additional file 4: Table S4), DLIGAND2 achieved the best performance with an average logAUC of 10.14% and enrichment factors of 6.67 for EF1%. DLIGAND2 achieved an average EF1% of 30% higher than Autodock Vina, 52% and 64% higher than DLIGAND and X-Score, separately, and above 3 times higher than ID-score. The logAUC and enrichment factors of all targets are detailed in Additional file 4: Table S4. Notably, Autodock Vina ranks the 2nd by LogAUC and the first on EF5% and EF10%, with EF1% of 26% and 86% higher than those by X-Score, and ID-Score despite the fact that they can provide higher correlation coefficients than Autodock Vina to experimental binding affinities in the CASF-2013 dataset. This is likely because ID-Score and X-Score were all trained by the PDBbind dataset that are homologous to CASF-2013 dataset. The over-training issues in empirical or machining learning based scoring functions have also been observed in several previous studies [23]. The improvement of DLIGAND2 relative to Autodock Vina is more consistent in this independent test. As for RF-Score, the general version (RF-Score v3) for predicting binding affinity doesn’t achieve a good performance with 5.42 for EF1% [53], ranking even behind DLIGAND. Although RF-Score-VS version specifically trained based on DUD-E was reported to achieve EF1% values up to 38.96 [53], the per-target cross validation tends to have an over-estimate due to protein homologs between training and test sets [60]. We will employ an external DEKOIS 2.0 dataset to evaluate DLIGAND2 and RF-Score-VS separately below.

Table 5 The performance of six scoring functions on the DUD-E dataset

To further compare the performance of each scoring function for different protein categories, 102 targets of DUD-E dataset are separated into eight categories and evaluated by the average EF1% as shown in Table 6. DLIGAND2 has the highest values of EF1% in Cytochrome P450, GPCR, Kinase, and Protease. Especially in the category of GPCR and Kinase, DLIGAND2 has obvious advantages compared with other scoring functions by 2.03 times and 1.42 times better than the second ranked methods (ΔvinaRF20 and DLIGAND), respectively. By comparison, AutoDock Vina performs the best in the ion channel, and is far superior to DLIGAND. The scoring function ΔvinaRF20 performs the best in miscellaneous, nuclear receptors and other enzymes. DLIGAND2 doesn’t perform well in targets of kinesin-like protein 1 (KIFF11, miscellaneous) and poly (ADP-ribose) polymerase-1 (PARP1, other enzymes), likely because their binding ligand contains halogen and phosphate elements that don’t appear in training protein chains. Currently, DLIGAND2 simply treats phosphate elements equivalent to the sulfate atom type. This issue may be solved in future study by including additional ligand atoms from protein–ligand complex structures.

Table 6 Enrichment factor values (EF1%) by DLIGAND2, AutoDock Vina, ΔvinaRF20, DLIGAND, X-ScoreHM, ID-Score on eight protein categories

Among the best examples of DLIGAND2 performance, we plotted the receiver operating characteristic (ROC) for the case of PTN1 protein (protein-tyrosine phosphatase 1B). As shown in Fig. 3, DLIGAND2 has the highest area under the curve (AUC) of 0.769, followed by AutoDock Vina (0.75), X-Score (0.729) and DLIGAND (0.639). The differences are more significant at lower false positive rate, the most important region for virtual screening. Indeed, the EF1% are 28.89, 9.12, 18.32, 9.33 and 9.33 for DLIGAND2, Autodock Vina, ΔvinaRF20, DLIGAND and X-Score, respectively. The AUC of ID-Score is 0.553, close to 0.5 by the random selection.

Fig. 3
figure 3

Receiver operating characteristic (ROC) curves for the target PTN1 protein by different scoring methods

Evaluation results on DEKOIS 2.0 data set

To compare with the latest RF-Score-VS v2 (https://github.com/oddt/oddt) scoring function that was trained on the DUD-E, we have compiled a new dataset from the DEKOIS 2.0 benchmark, with all targets sorted according to their sequence identity to the DUD-E targets according to the blastpgp. Figure 4 plots the average enrichment factor (EF1%) as a function of the number of targets sorted according to sequence identity. The average EF1% for RF-Score-vs increases as the sequence identity increases, suggesting the performance of RF-Score-VS v2 is strongly depending on similarity to its training set. AutoDock Vina also has some dependence on similarity to DUD-E targets. By comparison, DLIGAND and DLIGAND2 have the least dependence except when the number of targets is low (< 10) likely due to natural fluctuations. DLIGAND2 has the highest performance when homologous targets are excluded for sequence identity less then 30% with an average EF1% at 5.72, compared to 2.34 by AutoDock Vina, 2.73 by RF-Score-VS and 3.25 by DLIGAND.

Fig. 4
figure 4

The average EF1% in the DEKOIS 2.0 benchmark over the number of targets sorted according to their increasing sequence identity (seqid) by blastpgp from the DUD-E targets

Conclusion

We have developed a new knowledge-based scoring function DLIGAND2 by extending to 167 atom types for protein atoms from 13 types in the original DLIGAND. Residue-specific atom types for proteins allow a more accurate description of the interaction of a ligand atom with different residues. To ensure sufficient statistics, DLIGAND2 is based on an updated non-redundant dataset of 12,450 protein chains, 62 times bigger than the dataset (195 structures) used in the original DLIGAND.

DLIGAND2 consistently improves over DLIGAND in binding affinity prediction using either native or docking-predicted complex structures. The improvement in Pearson correlation coefficient is 8.7% for the CASF-2013 dataset by using native complex structures and 15% for the PDBbind dataset by using predicted complex structures. In addition, DLIGAND2 has significantly higher enrichment than DLIGAND in discriminating true ligands from decoys using the DUD-E dataset according to re-ranking of docked structures. These results suggest the usefulness of expanding protein atomic types in generating the DLIGAND 2 statistical potential.

DLIGAND2 is the best knowledge-based energy score but not as accurate as a few empirical (X-Score) or machine-learning based (RF-Score-v2 and ID-Score) scores trained by CASF-2013 or PDBbind. The X-Score and ID-Score methods outperform Autodock vina in the CASF-2013 and PDBbind, but they all have lower performance in decoy discrimination, a practically more important problem. We have also shown that the performance of RF-score-vs strongly depends on the sequence identity of the target protein to the dataset for training the method. Though RF-score-vs was reported to perform well in the DUD-E that includes many homologous proteins to its training set, it doesn’t perform well on protein targets that are not homologous to its training set. By comparison, DLIGAND2 was derived from only protein monomer structures, ensuring a balanced performance for all targets. Considering the simplicity and fast computation, DLIGAND2 will be useful for re-scoring after docking, or being included as a term for other scoring functions.