Background

Protein-ligand docking is a computational tool that predicts how a ligand binds to a target protein and their binding affinity. Hence docking is useful in elaborating intermolecular interactions and enhancing the potency and selectivity of binding in subsequent phases of computer-aided drug design. Docking has a wide variety of pragmatic and successful applications in structure-basedvirtual screening [1], drug repurposing [2], lead compound optimization [3], protein cavity identification [4], and protein function prediction [5].

Docking consists of two major operations: predicting the position, orientation and conformation of a ligand when docked to the protein’s binding pocket, and predicting their binding strength. The former operation is known as pose generation, and the latter is known as scoring. State-of-the-art docking methods, such as AutoDock Vina [6] and idock [7], work reasonably well at pose generation with a redocking success rate of over 50% [8] on the benchmarks of both PDBbind v2012 and v2011 [9, 10] and the CSAR NRC HiQ Set 24 Sept 2010 [11, 12]. However, the single most critical limitation of docking is the traditionally low accuracy of the scoring functions.

Classical scoring functions are defined by the assumption of a fixed functional form for the relationship between the numerical features that characterize the protein-ligand complex and its predicted binding affinity. This functional form is composed of the energetic contributions of various intermolecular interactions, and is often additive. The overall binding affinity is calculated as a weighted sum of several physically meaningful terms, while their coefficients are typically derived from standard multivariate linear regression (MLR) on experimental data.

Cyscore [13], a recently published empirical scoring function, assumes that the overall protein-ligand binding free energy can be decomposed into four terms: hydrophobic free energy, van der Waals interaction energy, hydrogen bond interaction energy and ligand’s conformational entropy. Cyscore focuses on improving the prediction of hydrophobic free energy by using a novel curvature-dependent surface-area model, which was claimed to be able to distinguish convex, planar and concave surface in hydrophobic free energy calculation.

A recent study on a congeneric series of thrombin inhibitors concludes that free energy contributions to ligand binding at the molecular level are non-additive [14], therefore the modelling assumption of additivity models is error prone. Recent years have seen a growing number of new developments of machine-learning scoring functions, with RF-Score [15] being the first that introduced a large improvement over classical approaches. RF-Score, as its name suggests, uses Random Forest (RF) [16] to implicitly learn the functional form in an entirely data-driven manner, and thus circumvents the modelling assumption imposed by previous scoring functions. RF-Score was shown to significantly outperform 16 classical scoring functions when evaluated on the common PDBbind v2007 benchmark [15]. Despite being a recent development, RF-Score has already been successfully used to discover a large number of innovative binders against antibacterial DHQase2 targets [17]. For the purpose of prospective virtual screening, RF-Score-v3 has now been incorporated into istar [8], our large-scale docking service available at http://istar.cse.cuhk.edu.hk/idock. A number of subsequent machine-learning scoring functions, including NNScore [18], SVR-KB and SVR-EP [19], CScore [20], B2Bscore [21], SFCscoreRF [22], and ID-Score [23], have also shown large improvements over classical approaches.

In this study we compare the prediction performance of two regression models MLR and RF (to be exact, random forest regression rather than classification), and investigate their application conditions and interpretability under various contexts. The Methods section introduces MLR and RF, three sets of features, three benchmarks, two kinds of cross validations, and four performance metrics. The Results and discussion section analyzes the prediction performance of MLR and RF on the three benchmarks and discusses the conditions of applying MLR and RF. The Conclusions section emphasizes the importance of abundance of features and samples for training RF.

Methods

Multiple linear regression (MLR) with Cyscore features

Cyscore is an empirical scoring function in an additive functional form of four energetic terms, which are hydrophobic free energy Δ G hydrophobic , van der Waals interaction energy Δ G vdw , hydrogen bond interaction energy Δ G hbond and ligand’s conformational entropy Δ G entropy (Eq. 1). Their coefficients k h , k v , k b and k e and the intercept C were obtained by MLR on 247 high-quality complexes carefully selected from PDBbind v2012 refined set. The intercept value was not reported in the original publication, but was included in this study as usual [24] in order to make a quick estimation of absolute binding affinity value, which is the ultimate goal in some real-world applications.

Δ G bind = k h Δ G hydrophobic + k v Δ G vdw + k b Δ G hbond + k e Δ G entropy + C
(1)

We use MLR::Cyscore to denote the scoring function built with MLR and the 4 features from Cyscore. It is noteworthy that Cyscore is a pure MLR model, unlike AutoDock Vina [6] which is a quasi MLR model because the number of rotatable bonds N rot is in the denominator so as to penalize ligand flexibility (see [8] for the exact equation) and therefore MLR::Vina would require an additional grid search for the weight of the N rot parameter. So this study allows a more direct comparison between MLR and RF.

Random forest (RF) with Cyscore, AutoDock Vina and RF-score features

A RF [16] is a consensus of a large number of different decision trees generated from random bootstrap sampling of the same training data. During tree construction, at each inner node RF chooses the best splitting feature that results in the highest purity gain from a normally small number (mtry) of randomly selected features rather than utilizing all input features. In regression problems, the final output is calculated as the arithmetic mean of all individual tree predictions in the RF. Further details on RF construction can be found in [8, 15].

In this study, multiple RFs of the default number of 500 trees were built using values of the mtry control parameter from one to the total number of input features. The selected RF was the one resulting in the lowest root mean square error (RMSE) on the Out-of-Bag (OOB) samples of the training set. Only one single random seed was used for training because seed is not a significant impact factor of the prediction performance, and using fewer seeds has the additional advantage of leading to computationally faster training process.

In our experiments we aimed at analyzing how RF responds to varying numbers of features and hence we selected three sets of features: Cyscore [13], AutoDock Vina [6] and RF-Score [15]. Cyscore comprises four numerical features: Δ G hydrophobic , Δ G vdw , Δ G hbond and Δ G entropy . AutoDock Vina comprises six numerical features: Gauss1, Gauss2, Repulsion, Hydrophobic, HBonding and N rot . RF-Score comprises 36 features, defined as the occurrence count of intermolecular contacts between two elemental atom types. Four atom types for proteins (C, N, O, S) and nine for ligands (C, N, O, S, P, F, Cl, Br, I) were selected so as to generate dense features while considering all the heavy atom types commonly observed in protein-ligand complexes. Table 1 summarizes the three combinations of these feature sets used to train RF models. Altogether four models (MLR::Cyscore, RF::Cyscore, RF::CyscoreVina and RF::CyscoreVinaElem) were evaluated in this study.

Table 1 The three combinations of three different sets of features used to train RF models in this study

PDBbind v2007 and v2012 benchmarks

The PDBbind [9, 10] benchmark is arguably the most widely used for binding affinity prediction. It contains an especially diverse collection of experimentally resolved protein-ligand complexes, assembled through a systematic mining of the yearly releases of the entire PDB [25, 26]. For each complex, the experimentally measured binding affinity, either dissociation constant Kd or inhibition constant Ki, was manually collected from its primary literature reference. The complexes with a resolution of ≤2.5Å and with the ligand comprising merely nine common heavy atom types (C, N, O, F, P, S, Cl, Br, I) were filtered to constitute the refined set. These complexes were then clustered by protein sequence similarity with a cutoff of 90%, and for each of the resulting clusters with at least five complexes, the three complexes with the highest, median and lowest binding affinity were selected to constitute the core set. Because of the structural diversity of the core set, it is a common practice to use the core set as a test set and the remaining complexes in the refined set as a training set.

On one hand, Cyscore was tested on two independent sets: PDBbind v2007 core set (N = 195) and PDBbind v2012 core set (N = 201), whose experimental binding affinities span 12.56 and 9.85 pKd units, respectively. On the other hand, Cyscore was trained on a special set of 247 complexes carefully selected from the PDBbind v2012 refined set using certain criteria [13] (e.g. structural resolution < 1.8Å, binding affinity spans 1 to 11 kcal/mol, protein sequence similarity and ligand chemical composition are different from the test set), ensuring that the training complexes are of high quality and do not overlap with any of the two test sets. In this study we used exactly the same training set and the same test sets in order to make a fair comparison to Cyscore.

Furthermore, considering the fact that 16 classical scoring functions have already been evaluated [24] on PDBbind v2007 core set and the top performing of them (e.g. X-Score) were trained on the remaining 1105 complexes in PDBbind v2007 refined set, we also used these 1105 complexes as another training set to permit a direct comparison. Using predefined training and test sets, where other scoring functions had previously been trained and tested, has the advantage of reducing the risk of using a benchmark complementary to one particular scoring function.

Likewise for the PDBbind v2012 benchmark, we used an additional training set comprising the complexes in PDBbind v2012 refined set excluding those in PDBbind v2012 core set. This led to a total of 2696 complexes. By construction, this training set does not overlap with the test set.

PDBbind v2013 round-robin benchmark

We propose a new benchmark to investigate how prediction performance of the four models changes in cross validation and with varying numbers of training samples. We used PDBbind v2013 refined set (N = 2959), which is the latest version and constitutes the most comprehensive and publicly available structural dataset suitable for training scoring functions.

We used 5-fold cross validation, as was used by the recently published empirical scoring function ID-Score [23], to reduce overfitting and thus generalization errors. The entire PDBbind v2013 refined set (N = 2959) was divided into five equal partitions using uniform sampling on a round-robin basis: the entire 2959 complexes were first sorted in the ascending order of their measured binding affinity, and the complexes with the 1st, 6th, 11th, etc. lowest binding affinity belonged to the first partition, the complexes with the 2nd, 7th, 12th, etc. lowest binding affinity belonged to the second partition, and so on. This partitioning method, though not completely random, has two advantages: on one hand, each partition is guaranteed to span the largest range of binding affinities and incorporates the largest structural diversity of different protein families; on the other hand, each partition is composed of a deterministic list of complexes, permitting reproducibility and comparisons in future studies. Table 2 summarizes the statistics of the five partitions. The PDB IDs and measured binding affinities of the complexes in the five partitions are available in the Additional file 1.

Table 2 The statistics of the five partitions of PDBbind v2013 refined set (N = 2959)

We then used the partition on which the best performance was obtained (It turned out to be partition 2 (N = 592). See the Results and discussion section.) as the test set in PDBbind v2013 round-robin benchmark, and used the remaining four partitions (1, 3, 4, 5) to construct four training sets of incremental sizes: the first training set comprises partition 1 (N = 592), the second training set comprises partitions 1 and 3 (N = 1184), the third training set comprises partitions 1, 3 and 4 (N = 1776), and the fourth training set comprises partitions 1, 3, 4 and 5 (N = 2367). Therefore this new benchmark provides a way to study how prediction performance varies with training set size. Moreover, its test set has a significantly larger number of complexes (N = 592) compared to PDBbind v2007 (N = 195) and v2012 (N = 201) benchmarks, making this new benchmark not being a redundant duplication of the previous two benchmarks. Table 3 summarizes the numbers of test and training samples for the three benchmarks.

Table 3 The numbers of test samples and training samples for the PDBbind v2007, v2012 and v2013 benchmarks used in this study

Leave-cluster-out cross validation (LCOCV)

Leave-cluster-out cross validation (LCOCV) [27], in contrast to standard cross validation, divides the complete set of complexes into protein families instead of random subsets. Each protein family, or each cluster, is typically determined by 90% protein sequence identity. Protein families with at least ten complexes are treated as individual clusters, labeled as A to W. Protein families with four to nine complexes are combined into cluster X. Protein families with two to three complexes are combined into cluster Y. Singletons are combined into cluster Z. Each cluster is iteratively left out of the training set and used to evaluate the predictive performance of the scoring function. The performance on each cluster can be inspected individually, and the overall performance can be estimated by averaging over all clusters.

So far LCOCV has been applied to the assessment of six scoring functions, which are RF-Score [20, 21, 27], ddPLAT+MOE [28], CScore [20], B2Bscore [21], SFCscoreRF [22] and the work of Ross et al. [29].

For the purpose of comparison to other scoring functions, PDBbind v2009 refined set (N = 1741) was used in this study to perform LCOCV. The 1xr8 entry in cluster X was discarded because its ligand is far away from its protein, thereby leaving 1740 complexes. The PDB IDs and measured binding affinities of the complexes in the 23 protein families (A to W) and the 3 multi-family clusters (X to Z) are available in the Additional file 2.

Performance metrics

Prediction performance was quantified through standard deviation SD in linear correlation, Pearson correlation coefficient Rp and Spearman correlation coefficient Rs between the measured and predicted binding affinities of the test set. These metrics are commonly used in the community [24], and the SD metric is essentially the residual standard error (RSE) metric used in some other studies [19]. The above three metrics are invariant under linear transformations (e.g. changing the intercept or coefficient values in Eq. 1 affects none of these metrics), so they are mainly for comparative purpose. In some applications, however, the ultimate goal of scoring functions is to report an absolute binding affinity value as close to the measured value as possible. Hence we use a more realistic metric, the root mean square error RMSE between measured and predicted binding affinities without a linear correlation. Lower values in RMSE and SD and higher values in Rp and Rs indicate better prediction performance.

Mathematically, equations 2, 3, 4 and 5 show the expressions of the four metrics. Given a scoring function f and the features x ( n ) describing the nth complex out of N complexes in the test set, p ( n ) =f x ( n ) is the predicted binding affinity, p ̂ ( n ) are the fitted values from the linear model between {y(n)} and {p(n)} on the test set, whereas y r ( n ) and p r ( n ) are the rankings of {y(n)} and {p(n)}, respectively.

RMSE= 1 N n = 1 N p ( n ) - y ( n ) 2
(2)
SD= 1 N - 2 n = 1 N p ̂ ( n ) - y ( n ) 2
(3)
R p = N n = 1 N p ( n ) y ( n ) - n = 1 N p ( n ) n = 1 N y ( n ) N n = 1 N p ( n ) 2 - n = 1 N p ( n ) 2 N n = 1 N y ( n ) 2 - n = 1 N y ( n ) 2
(4)
R s = N n = 1 N p r ( n ) y r ( n ) - n = 1 N p r ( n ) n = 1 N y r ( n ) N n = 1 N p r ( n ) 2 - n = 1 N p r ( n ) 2 N n = 1 N y r ( n ) 2 - n = 1 N y r ( n ) 2
(5)

Results and discussion

Figure 1 plots the prediction performance of MLR::Cyscore, RF::Cyscore, RF::CyscoreVina and RF::CyscoreVinaElem using different numbers of training samples on PDBbind v2007 benchmark (N = 195), PDBbind v2012 benchmark (N = 201) and PDBbind v2013 round-robin benchmark (N = 592). The raw values are available in the Additional file 3.

Figure 1
figure 1

Prediction performance of MLR::Cyscore, RF::Cyscore, RF::CyscoreVina and RF::CyscoreVinaElem trained with varying numbers of samples. First row: root mean square error RMSE. Second row: standard deviation SD in linear correlation. Third row: Pearson correlation coefficient Rp. Fourth row: Spearman correlation coefficient Rs. Left column: PDBbind v2007 benchmark (N = 195). Center column: PDBbind v2012 benchmark (N = 201). Right column: PDBbind v2013 round-robin benchmark (N = 592).

MLR::Cyscore performance does not increase with more training samples

On both PDBbind v2007 and v2012 benchmarks, MLR::Cyscore performed best when it was trained on the 247 carefully selected complexes used by Cyscore. Its performance dropped when more complexes were used for training. On PDBbind v2013 round-robin benchmark, MLR::Cyscore performance stayed flat regardless of training set sizes.

These results show that MLR::Cyscore is unable to exploit large sizes of structural data given only a small set of sophisticated features. Feeding more training samples to MLR::Cyscore actually increases the difficulty in regressing the coefficients well. Generally it would be a good idea to select the training complexes that provide the best performance on a test set, as was the case of Cyscore. However, in real applications the binding affinities of the test set are not known and unfortunately selection of training complexes is not performed blindly (i.e. without measuring performance on test set).

RF performance increases with more structural features and training samples

On all the three benchmarks, given the same set of features, the RF models trained with more samples resulted in higher prediction accuracy. Similarly, given the same training samples, the RF models trained with more features resulted in higher prediction accuracy.

These results suggest that RF is capable of effectively exploiting a comprehensive set of structural features and training samples. Generally the more training samples, the more knowledge for RF to learn so as to capture the non-linearity of the structural data. Likewise, the more appropriate features, the higher probability of choosing the best splitting feature that can result in a high purity gain at non-leaf nodes during RF construction, and hence the higher chance of boosted RF performance.

RF models perform consistently well in cross validation

Table 4 shows the results of 5-fold cross validation for all the four models. The best performance was obtained on partition 2. In terms of average performance, the relative performance ranking is consistent, where RF::CyscoreVinaElem (RMSE = 1.35, SD = 1.35, Rp = 0.738, Rs = 0.738) is better than RF::CyscoreVina (RMSE = 1.44, SD = 1.44, Rp = 0.693, Rs = 0.690), which is better than RF::Cyscore (RMSE = 1.59, SD = 1.59, Rp = 0.603, Rs = 0.587), which is better than MLR::Cyscore (RMSE = 1.66, SD = 1.66, Rp = 0.556, Rs = 0.559).

Table 4 Cross validation results of the four models on the five partitions of PDBbind v2013 refined set (N = 2959) in terms of root mean square error RMSE, standard deviation SD in linear correlation, Pearson correlation coefficient Rp and Spearman correlation coefficient Rs

Leave-cluster-out cross validation leads to unrealistically low performance

Table 5 shows the results of leave-cluster-out cross validation (LCOCV) for all the four models. Not unexpectedly, the observed performance is very heterogeneous across the different protein families. These results indeed agree with the LCOCV results of six other scoring functions from previous studies [2022, 2729]. By analyzing the LCOCV statistics of all these ten scoring functions, we found that they all performed well in certain clusters (e.g. trypsin and β-secretase I) and poorly in some other clusters (e.g. HIV protease and factor Xa). The reasons for the large spread of performance across the different clusters are manifold, and a comprehensive analysis for each protein family would be beyond the scope of this study. As pointed out in [22], eliminating all the HIV protease complexes leads to an imbalance between the training and test sets because HIV protease inhibitors are on average much larger than the ligands of the other targets. This illustrates that the LCOCV results should not be directly interpreted as performance measures on particular protein families. Moreover, the limited size of many clusters and the small range of measured binding affinity values therein make a satisfactory prediction of the ranking rather challenging.

Table 5 Leave-cluster-out cross validation results of the four models on the 23 protein families (A to W) and 3 multi-family (X to Z) clusters of PDBbind v2009 refined set (N = 1740) in terms of root mean square error RMSE, standard deviation SD in linear correlation, Pearson correlation coefficient Rp and Spearman correlation coefficient Rs

While results on standard cross validation might be too optimistic, results on leave-cluster-out cross validation might be too pessimistic. Here we want to emphasize that LCOCV is only suitable for estimating the performance of a generic scoring function on a truly new target protein that does not belong to a cluster represented by any of the proteins in the training set, but this constitutes a very uncommon scenario in real-life applications because it is rare for a target protein not to have high sequence similarity to any other protein in a diverse and large training set. In fact, such type of complexes should never be eliminated from a training set. Instead, the training set composition should reflect as closely as possible the actual complexes on which the scoring function is to be applied. Consequently, LCOCV is not appropriate to evaluate generic scoring functions, as previously argued [30].

Machine-learning scoring functions are significantly more accurate than classical scoring functions with fixed functional forms

Table 6 compares Cyscore, RF::Cyscore, RF::CyscoreVina and RF::CyscoreVinaElem against 21 other scoring functions on PDBbind v2007 core set (N = 195), with RF::CyscoreVinaElem performing best in terms of Rp, Rs and SD. It is worth noting that the top four scoring functions are all trained with RF.

Table 6 Prediction performance of 25 scoring functions evaluated on PDBbind v2007 core set (N = 195) in terms of Pearson correlation coefficient Rp, Spearman correlation coefficient Rs and standard deviation SD in linear correlation on the test set

Substituting RF for MLR and incorporating more features and training samples strongly improves Cyscore

Figure 2 compares the prediction performance of Cyscore and RF::CyscoreVinaElem, with RF::CyscoreVinaElem improving Cyscore by -0.28 in RMSE, -0.37 in SD, +0.143 in Rp and +0.111 in Rs on the PDBbind v2007 benchmark, by -0.14 in RMSE, -0.25 in SD, +0.106 in Rp and +0.093 in Rs on the PDBbind v2012 benchmark, and by -0.40 in RMSE, -0.29 in SD, +0.187 in Rp and +0.184 in Rs on the PDBbind v2013 round-robin benchmark.

Figure 2
figure 2

Correlation plots of predicted binding affinities against measured ones. Top row: Cyscore. Bottom row: RF::CyscoreVinaElem. Left column: PDBbind v2007 benchmark (N = 195), with RF::CyscoreVinaElem trained on 1105 complexes. Center column: PDBbind v2012 benchmark (N = 201), with RF::CyscoreVinaElem trained on 2696 complexes. Right column: PDBbind v2013 round-robin benchmark (N = 592), with RF::CyscoreVinaElem trained on 2367 complexes.

These results show that RF::CyscoreVinaElem performed consistently better than Cyscore on all the three benchmarks. It is important to note that, in each benchmark, both scoring functions used the same non-overlapping training and test sets. Taken together, these results show that one can develop a much more accurate scoring function out of an existing one simply by changing the regression model from MLR to RF and incorporating more structural features and training samples.

Sensitivity analysis of the RF model can determine feature importance

Unlike classical scoring functions, RF-based scoring functions can hardly be explicitly expressed as a mathematical equation like Eq. 1. Therefore it is useful to employ the variable importance tool of RF to estimate the importance of each feature by randomly permuting its training values, and the feature leading to the largest variation in the predicted binding affinity on the OOB data can be regarded as the most important for a particular training set. Figure 3 plots the percentage of increase in mean square error (%IncMSE) observed when each of the 4 Cyscore features used to train RF was noised up. All the 4 features turned out to be important (%IncMSE>20), with van der Waals interaction energy (Vdw) and hydrophobic free energy (Hydrophobic) being relatively more important (%IncMSE>40). Correctly estimating variable importance can assist in feature selection and in understanding ligand binding.

Figure 3
figure 3

RF::Cyscore feature importance estimated on internal OOB data of the 1105 complexes from PDBbind v2007 refined set. The four features are hydrophobic free energy (Hydrophobic), van der Waals interaction energy (Vdw), hydrogen bond interaction energy (HBond) and ligand’s conformational entropy (Ent). The %IncMSE value of a particular feature was computed as the percentage of increase in mean square error observed in OOB prediction when that features was randomly permuted.

Conclusions

In this study we have demonstrated that, on one hand, the multiple linear regression (MLR) model used in many scoring functions like Cyscore does not improve its performance in the presence of abundant training samples. This is a particularly significant drawback for MLR-based scoring functions because they cannot benefit from the future availability of more experimental data. On the other hand, RF-based scoring functions can comprehensively capture the non-linear nature in the data and thus assimilate data significantly better than MLR-based scoring functions. Most importantly, feeding more training samples to RF can increases its prediction performance. Under this circumstance, improvements with dataset size can only be gained with the appropriate regression model. Simply changing the regression model of Cyscore from MLR to RF and expanding the feature set and the sample set can significantly increase the prediction accuracy. The performance gap between MLR-based and RF-based scoring functions will be further widened by the future availability of more and more X-ray crystal structures.

Moreover, classical empirical scoring functions usually rely on complicated energetic contributions that must be carefully devised from intermolecular interactions, whereas RF-based scoring functions can also effectively exploit features as simple as occurrence count of intermolecular contacts. It has also been shown that functional group contributions in protein-ligand binding are non-additive. This means new features cannot be easily incorporated into an existing MLR model. In this study we have shown that using more structural features appropriately can also substantially enhance the prediction accuracy of RF, as can be seen in the comparison between RF::CyscoreVinaElem and RF::Cyscore. This further stresses the importance of substituting RF for MLR in scoring function development.