The prediction task examined here parallels that frequently seen by molecular modelers: given some new molecule that is structurally different from those seen before, identify the manner in which it binds the particular active site in question. Questions about binding mode frequently occur following verification of biological assay against a target of interest following a high-throughput screen. They also arise when new ligand structures are reported within the scientific or patent literature. Understanding the geometric relationship between new ligands and ones that have been studied can be of great utility in shaping design decisions during lead optimization. By partitioning our set of protein-ligand complex data temporally, by target, we have tried to mirror the interesting case: one where there is sufficient uncertainty about the binding mode of a particular ligand (or its effect on a binding pocket) that it is worth the time and effort to make an experimental determination by X-ray crystallography.
Table 2 summarizes the prediction results for each target and overall. By combining substructural guidance during docking and similarity-based guidance in ranking pose families, we achieved a mean success rate for top-scoring pose families of 62 %. Considering the top two pose families, the overall success rate was 74 %. Figure 8 (left plot) shows the cumulative histograms of success rates aggregated for all targets under three protocols: (1) no knowledge-based guidance; (2) guidance only from substructural hints during docking; (3) and additional guidance from knowledge of bound ligands for pose family ranking (as with the results in Table 2). Each of the shifts in distribution of RMSD was statistically significant (p
\(\ll \) 0.001 by Kolmogorov-Smirnov).
Table 2 Summary of results in terms of success percentages at an RMSD threshold of 2.0 Å for the top pose family, top 2, 5, and 10
At right in Fig. 8, performance is shown for all pose families to illustrate the effects of using protein ensembles. The red and blue curves show the effect of substructural guidance during docking, which is marginal at the 2.0 Å threshold, but significant overall. The green curve shows the effect of using the individual protein exemplars singly for all targets, giving the overall performance for all targets and all protein exemplars. The improvement using protein ensembles was highly significant compared with using single protein variants. The per-target patterns of performance using different single variants compared with using the ensemble exhibited some diversity, and this will be discussed below.
Analogous plots for individual targets (all except for CA-II, which exhibited little relative novelty in the context of the other nine targets) are shown in Figs. 9 (overall docking performance) and 10 (effects of protein ensemble use). In Fig. 9, the blue and magenta curves correspond to the top scoring pose family and top two, respectively, with the yellow highlight bar showing the success rate for a threshold of 2.0 Å. When using both types of knowledge-based guidance, except for \(\hbox {PPAR}\gamma \), typical success rates for top-scoring pose family (the blue curves) ranged from 60 to 75 %. The gap between the red and green curves represents the value of using substructural hints during the docking process. The gap between the green and blue curves shows the value of using similarity to known bound poses of ligands in addition to the substructural hints during docking. Overall, as seen in Fig. 8, the value of substructural hints was less (often substantially so) compared to similarity-based information for pose family re-ranking.
The effects of using a protein ensemble versus using the individual exemplars within the ensemble were more varied. In all cases, it was possible to choose a particular protein from among each set of five that would produce substantially poorer performance than other variants or the ensemble produced. In a few cases, fortunate choice of a particular ensemble member yielded performance nearly as good as seen from the ensemble (see the thrombin and HSP90 examples, in particular). In two cases (MAPK14 and BACE1), performance using any single protein variant was substantially worse than that observed using the ensemble.
The three most challenging targets from our previous study (CDK2, thrombin, and MAPK14 [31]) yielded excellent performance in the current study, with mean performance of 71 % for top-scoring pose family and 80 % for the top two. In this study, the most challenging targets were \(\hbox {PPAR}\gamma \), PTP1b, and HIV-PR. These shared in common the highest proportion of test ligands whose structures were not only very different by 2D similarity to previously known ligands, but they were also very different in terms of their maximal 3D similarity (in their bound pose) to previous ligands. Other aspects such as ligand size/flexibility and binding site volume were less important.
In what follows, performance on each target will be discussed in some detail, with particular attention paid whether the knowledge-utilization strategies had different impact on different targets. The order in which the discussion is organized is from best to worst performance for top-scoring pose family (using both types of structural guidance): CDK2, MAPK14, Thrombin, HIV-RT, CA-II, BACE1, \(\hbox {HSP90}\alpha \), HIV-PR, PTP1b, and \(\hbox {PPAR}\gamma \). This corresponds to the order seen in Figs. 9 and 10. Following that, the effects of our protein selection procedure will be discussed.
CDK2
CDK2 was used as an example throughout the description of the methodology, as it represents a typical case in terms of performance within the overall benchmark. As seen in Fig. 9, the use of substructural information during docking produced a mild performance improvement (roughly 4 % points), but the use of similarity-based knowledge of prior known ligand binding modes produced a large improvement (roughly 17 points).
The reasons are two-fold. First, except for a few target cases, the search algorithms within the Surflex-Dock optimization procedure are generally adequate for identifying some close-to-correct poses within the set of 100 produced, even without using any information about previously identified binding motifs (the red curves in Fig. 10). Second, the difference in energy between native-like solutions (e.g. Fig. 7a) and incorrect ones (Fig. 7b) is small, often (as in this case) less than 1 kcal/mol. So the problem, generally speaking, is not one of discovering a good solution, but ranking it as such.
These observations paralleled our previous study, using a much smaller data set from Sutherland et al. [29]. In that work, CDK2 was among eight different targets, with 211 ligands used for testing overall. Pose ranking was the key problem, with less than half of the cases in which a correct pose was produced being correctly identified as such. In the previous work, protein pocket adaptation (Cartesian-space all-atom optimization) was used to influence pose family rankings. In the most difficult cases, of which CDK2 was one, such optimization was able to improve success rates for top-ranked pose family by a few percentage points. However, the case in which two variant methods for pocket adaptation resulted in agreement between the top-ranked pose families, there was a very substantial improvement in success rate.
In the present work, rather than identifying cases in which alternate methods agree on a particular ligand (which may happen infrequently), the current approach looks for agreement, measured by similarity, to the known configurations of bound ligands whose structure was determined at an earlier time point. For CDK2, this resulted in a success rate of 75 % for top-ranked pose family and over 85 % for the top two. Failures to identify any correct solutions happened less than 5 % of the time.
The behavior of the CDK2 case with respect to protein variant choice was also typical (see Fig. 10, top left). Two particular variants, when used alone (the green and teal curves), yielded poor performance. That is, even when selecting from among the five variants, each of which was itself the center of a cluster of variants, it was possible to obtain one that could not be used to identify reasonable poses for many ligands. Three other variants (the magenta, brown, and dark blue curves) performed nearly as well as the ensemble (red curve), except at lower thresholds of RMSD, where a significant advantage for the ensemble approach emerged. Importantly, it is not clear that one can know whether a single variant can be generally successful nor if that is the case which one will perform well. So, making use of the ensemble is an effective strategy. In some cases (discussed more below), it is essential.
One additional point is illustrated in Fig. 11. In the case of the 2XNB complex, the deposited structure contained two alternative poses for the ligand, the first of which was used for the deviation calculations. The bottom of Fig. 11 shows the density corresponding to the ligand (red mesh) along with the two modeled alternative poses, which appear to represent a good explanation of the observed density. The top of the figure shows the full set of poses for the top-ranked pose family, along with an imputed density surface (in transparent blue), contoured to provide a comparison with the experimental density. It seems probable that numerous solutions exist which simultaneously respect the internal energetics of the ligand, the observed density, and sensible interactions with the protein. While the positional variation seen at the left-hand-side of the ligand in the full predicted pose family may extend beyond what is supportable by experiment, more variation than is represented in the modeled structure may exist. We believe that the perspectives of both the structural biologist and the molecular modeler can benefit from broader consideration of pose variants than has been historically done.
MAPK14
Overall performance on MAPK14 was very similar to that seen for CDK2, with the exception that the benefit from making use of substructural knowledge was slightly higher. There was a roughly 9-point difference in success rates a the 2.0 Å threshold when comparing the red and green cumulative histograms of Fig. 9. The top scoring pose family using both types of knowledge-based guidance was 70 %, and the top two pose families yielded 80 % correct. Note that, as with CDK2, MAPK14 was among the three most challenging targets in our previous work, but the combination of knowledge-guidance from substructural hints and similarity-based pose family re-ranking made it the second-best in this work.
As with CDK2, there was significant performance variation among the five chosen protein pocket variants, when used singly (see Fig. 10, top middle plot). However, in sharp contrast, the very best of these performed 26 % points worse than the ensemble. In this case, joint use of the five proteins was crucial to uncovering correct solutions for many ligands. The only other target where this pattern emerged was BACE1. These two cases were used for a systematic test of different strategies for choosing protein ensembles, the results of which will be discussed after the individual protein targets.
Thrombin
Thrombin was the third of the three targets shared with our previous cross-docking study, and as with the previous two discussed, was one of the most challenging. In that work, top-scoring pose family success was roughly 50 %, with the top two pose families achieving roughly 60 %. Here, the comparable numbers were 68 and 76 %, with the difference being essentially entirely attributable to the use of similarity-based re-ranking of poses. Figure 12 shows the results of docking for the test ligand from PDB structure 1ZGV, an example of a thrombin inhibitor with a non-basic S1 binding pocket element. The result obtained from an agnostic docking protocol yielded an RMSD of 3.7 Å (top), getting the placement of the S1-pocket element correct, but flipping the remainder of the molecule out of the correct pose. The unguided protocol contained an excellent solution (0.7 Å), but the docking score for that solution did not result in the top-scoring pose family. Use of substructural guidance produced more numerous and better solutions, which, with the inclusion of similarity-based re-ranking, yielded a solution with just 0.6 Å deviation (bottom of Fig. 12). This case represented a thrombin inhibitor of limited flexibility (just 6 rotatable bonds), which presented little challenge with respect to search adequacy but was difficult in terms of the precise ranking among the poses produced.
A more challenging example, with 11 rotatable bonds, is shown in Fig. 13. This structure was deposited in the PDB in April 2011, nearly ten years after the most recent structures from the “known” pool. The top-scoring pose family in the agnostic protocol contained a single pose, which was flipped completely around the central proline, resulting in a deviation of 8.5 Å. Under the guided protocol, the top-scoring family (bottom left), achieved a degree of congruence with the experimental solution, correctly placing the sulfonamide substituent and obtaining grossly correct positions for the proline linker and the chloro-benzylamine (2.3 Å RMSD, 51 % probability score). The second pose family (46 % probability) was correct, deviating 0.5 Å from the experimental pose.
The nominal docking scores of the various solutions represented in Fig. 13 were within 1.0 kcal/mol of one-another. Cases such as this, with flexible peptide-like ligands having low ligand-efficiency, are among the most challenging in pose prediction. Approaches that seek to disambiguate such pose variants using purely energetic estimation approaches face a high bar. Note that the correct position of the primary amine is away from the aspartic acid residue within the S1 pocket of thrombin, instead being apparently stabilized through intramolecular contacts. Note also that the early complexes used to inform the docking and pose-ranking protocol were dominated by basic groups at the S1 position, with no examples of chloro-phenyl or similar groups. The binding motif seen in the linker from the S1 binding element (including the sulfonamide) to the hydrophobic substituent was of use in identifying the correct configuration.
HIV-RT
The HIV-RT ligands were bound in the non-nucleoside binding site, which was the smallest site, by far, among the ten targets studied (the next larger site of CDK2 was slightly more than twice the volume). This, coupled with limited ligand flexibility (an average of 5 rotatable bonds), mooted the issue of search adequacy. No improvement was observed using substructural hints during docking (see Fig. 9, left side, middle plot). In fact, over 95 % of test ligands yielded a predicted pose with deviation less than 1.5 Å from experimental when considering the full set of pose families produced (see Fig. 10). As was typical, particular choices of protein variant could yield poor results. However, in this case, there was a single pocket variant (protein 4) that performed indistinguishably from the ensemble. HIV-RT was the only example where this was clearly the case.
Despite the small volume, pose ranking for the small, hydrophobic ligands was a challenge. Top-scoring pose family performance was 63 %, with performance improving to 78 % when considering the top two families. Even when considering the top ten families, performance was 91 %, still less than the 98 % success attainable (all but 1 of the 46 test ligands) when considering all pose families that were generated.
Figure 14 shows the pocket volume along with docking results that illustrate the challenge within this small pocket. The top-scoring pose family (bottom left, cyan) deviated by 4.0 Å from experimental. The top-scoring family from the unguided protocol was worse still (not shown). Clearly, the second-ranked pose family (pink) matches the experimental pose better (0.5 Å RMSD), but the nominally large difference in deviation between the two alternative stems from two reasonable “flips.” The pyrimidine-dione is flipped in the top-ranked configuration (“G-Fam-1”), placing the N-ethyl at right, but the core scaffold is nearly symmetric. The methyl and nitrile substituents are also reversed, again not unreasonably. This is a case where nominal RMSD gives an incomplete picture of how informative a geometric prediction may be.
BACE1
Apart from being a more challenging case, BACE1 exhibited the same overall pattern as MAPK14 in terms of the performance benefits of knowledge-based guidance and sensitivity to use of pocket variants compared with the full ensemble. Substructural guidance during docking provided a roughly 12-point improvement in performance (Fig. 9), and similarity-based re-ranking produced roughly 17 points on top of that, resulting in top-scoring pose family performance of 57 %. Performance improved to 71 and 85 % when considering the top two and five pose families, respectively. The ensemble approach produced a more than 25-point improvement over the next best single protein variant (Fig. 10). The effect of protein variant selection strategy for BACE1 will be discussed below.
Note that BACE1 had the second-most flexible set of test ligands (next to HIV-PR), with mean flexibility of 10 rotatable bonds (\(\pm 7\)). It was also the second largest site by volume (next to PTP1b), with the site enveloping 2360 Å\(^3\). BACE1 is generally considered to be a challenging target, in part because of the size and flexibility considerations, so the pose-prediction performance we observed was striking. Here, top-ranked pose family performance on diverse and highly flexible ligands in a temporally segregated cross-docking test matched that observed on challenging cognate docking benchmarks [13, 14, 23] for multiple docking methods (including methods such as Glide, ICM, GOLD, and Surflex-Dock).
HSP90
The pattern of performance improvements for HSP90 most closely paralleled that of CDK2, albeit at lower levels of overall success. Substructural guidance during docking provided a roughly 10-point improvement in performance (Fig. 9), and similarity-based re-ranking produced roughly 13 points on top of that, resulting in top-scoring pose family performance of 57 %. Performance improved to 72 and 91 % when considering the top two and five pose families, respectively. The ensemble approach produced a 5-point improvement over the best single protein variant (Fig. 10), and it was roughly 30 points better than the worst variant.
HIV-PR
HIV-PR was, by a significant margin, the target with the most flexible ligands (an average of \(16\pm 5\) rotatable bonds). That fact, coupled with an active site volume of nearly 2000 Å\(^3\), and a reasonably flexible protein, created an a priori expectation of high difficulty. It was atypical in that it was the only target for which substructural guidance during the docking process yielded a larger improvement (15 points) than similarity-based re-ranking (10 additional points). This was likely due to the extreme flexibility of the test ligands. Top-scoring pose family performance was 55 %, increasing to 69 % for two, and 75 % for top five. The value of substructural hints is clearly seen in Fig. 10 (bottom left plot), where the difference between the red and blue curves is only the use of substructural guidance. At the 2.0 Å threshold, such guidance yields a six-point advantage (for an overall success rate for all pose families of 84 %). At the 1.5 Å threshold, the improvement was 20 points; clearly a very significant impact. Similar to BACE1 and MAPK14, but to a lesser degree, the use of a protein ensemble produced better results than any single protein variant.
Figure 15 shows a typical peptide-like inhibitor (the ligand of 1ZSR), having over 20 rotatable bonds. Two views are shown of the top-scoring knowledge-guided pose family, with the crystallographic pose shown in tan and with a transparent surface. The RMSD of the closest pose within the family was 0.8 Å, and as the inhibitor meets solvent (bottom left and front right in the figure) it exhibits a greater degree of mobility in the docking result. We believe that this picture of a binding-mode prediction is more informative, and likely more accurate, than one where a single pose is displayed.
Especially given the demanding characteristics of HIV-PR, we view the performance of the knowledge-guided protocol as being a success. More broadly, to summarize thus far, performance using the knowledge-guided protocol for all but two of the ten targets reported here (PTP1b and \(\hbox {PPAR}\gamma \), discussed below) met or exceeded 55 % for top-ranked pose family and all but one (\(\hbox {PPAR}\gamma \)) met or exceeded 67 % for the top two. Except for perhaps two targets, the level of performance seen here for cross-docking matches that of challenging cognate docking benchmarks. It significantly exceeds that previously reported on substantial cross-docking benchmarks such as those described in the Introduction, where success rates of 20–30 % were common in cross-docking with single protein variants [28–31]. Note also that this is the only temporally segregated benchmark of which we are aware, and it is also one of the largest and most diverse in terms of both target types and ligand structural variety.
PTP1b
PTP1b had the largest active site (over 3000 Å\(^3\)), and its ligands were quite flexible (an average of \(10\pm 5\) rotatable bonds). The unusual aspect of performance for this target was that, at the 2.0 Å threshold, no real improvement resulted from use of either method for making use of prior knowledge, at least for top-scoring pose family. At larger deviations, there was a clear benefit for using substructural guidance (see Fig. 9), but not for similarity-based re-ranking. PTP1b also had the largest increase in success in moving from a single pose family to two (17 % points). Figure 16 illustrates the challenge of this binding site with an example of the improvement seen between the top and next best scoring pose family under the guided protocol. The top scoring pose family, despite having placed the buried substituent correctly, places the rigid “arm” of the inhibitor in an incorrect position along the surface of the protein (7.9 Å RMSD). The second-ranked pose family (0.5 Å RMSD) was correct. The nature of binding for large inhibitors in this class is mainly on the protein surface, where much less physical constraint exists to constrain potential docking solutions.
\(\hbox {PPAR}\gamma \)
\(\hbox {PPAR}\gamma \) was an outlier in terms of performance under all circumstances: with and without knowledge-based guidance and using any number of top-scoring pose families. For the other nine targets, using the knowledge-guided protocol, the success rate at the 2.0 Å threshold was \(0.62\pm 0.08\) for the top-scoring pose family. For the top two, it was \(0.75\pm 0.06\), and for the top five, it was \(0.85\pm 0.06\). Performance for \(\hbox {PPAR}\gamma \) was, respectively, 31, 34, and 39 % under the same docking protocol, representing decreases in performance of \(4\sigma \) or greater.
The active site of \(\hbox {PPAR}\gamma \) was of moderate volume (1400 Å\(^3\)) and the ligands were of moderate flexibility (an average of \(8\pm 4\) rotatable bonds) for this set of targets. The reasons for the difficulty appear to stem from two primary drivers: (1) diversity in protein active site configurations when bound to the test ligands; and (2) the fraction of novel binding modes of test ligands compared with training ligands.
For \(\hbox {PPAR}\gamma \), protein binding site diversity was the highest among all proteins. We computed the maximal binding pocket similarities for each cognate pocket for each test ligand against the pockets for all ligands within the known early pool. We assessed the fraction of such similarities that fell below a threshold set based upon a global analysis of computations for all proteins (see Methods), calling those sites with lower similarity novel. For \(\hbox {PPAR}\gamma \), the fraction of novel protein active sites was 49%. Interestingly, the two targets that benefited the most from making use of a protein ensemble (MAPK14 and BACE1) also had high novelty fractions (33 and 18 % respectively). Among the remaining targets, active site novelty fractions were all below 10 %, except for HIV-RT (22 %) whose limited volume appears to ameliorate that effect.
To assess novelty with respect to test ligand binding mode, we performed a similar computation using 3D similarity of the bound configurations of test ligands compared with those of the known early pool. For each test ligand, the maximum similarity to the knowns was computed. Novel binding modes accounted for fully 50 % of the ligands for \(\hbox {PPAR}\gamma \). This appears to explain the relative challenge of PTP1b as well, with 40 % of test ligands exhibiting novel binding modes. The remaining targets exhibited novel binding modes less than 10 % of the time, except for HIV-PR (28 %) which was also a relatively challenging target.
To put this issue of binding-mode novelty in perspective, recall Fig. 3. The marked locations “2” and “3” correspond to those marked in Fig. 3: (1) the canonical binding location for acids; and (2) helix 12, the canonical helix around which many ligands are known to bind [42]. Location 2 is surrounded by four donor protons (two from histidine residues, one from serine, and one from tyrosine). Figure 17 shows 11 canonical ligands (tan) from within the early pool of 21 complexes that contained carboxylates in favorable contact with this part of the protein. Also shown are the 9 worst failures (cyan, based on best RMSD among the top ten pose families). All of the latter placed carboxylates in a completely different place than that seen in the canonical binding mode. In addition, a critical arginine residue moves several Angstroms in order to complement the binding mode seen for these difficult test ligands.
The extreme difficulty of this target was well documented in a structural sense by Itoh et al. [42], where hydroxyoctadecadienoic acid variants were shown to be capable of binding in three divergent modes to \(\hbox {PPAR}\gamma \). One mode was in the canonical position, another in the alternative mode that was common among the docking failures, and a third in which a second ligand could bind at the same time as one binding in the canonical mode (where there was also a critical contact made between the ligands). The \(\hbox {PPAR}\gamma \) case represents a true limitation for the methods described here. Binding modes that are completely unlike those seen earlier will not be recovered through use knowledge from previous ligands, either in terms of substructural matching for configurational search or similarity-based pose re-ranking. Further, careful automated choice of protein pocket variants from among a set that does not contain a crucial rearrangement cannot help to identify novel binding modes as are seen with this target.
Effect of protein variant selection
As seen in Fig. 10, selection of protein variants matters in all cases, at least to the extent that a poor choice of a single variant could lead to significantly worse results than the choice of an optimal variant. This was also true for CA-II (whose plot is not shown), especially at more stringent levels of RMS deviation. In all cases, performance of the ensemble (red curves) was much better than the worst single variant, and in no case was the ensemble worse than the best of the single variants. In two cases, MAPK14 and BACE1, the ensemble was more than 25 points better than that seen with the best single protein variant. This was, in part, explained by the analysis of protein pocket novelty, with these two cases showing a relatively high fraction of novel pocket variations among the test complexes.
In order to assess the degree to which the protein variant selection strategy was successful, two additional methods for selecting five variants were tried for MAPK14 and BACE1. The first was a maximally diverse selection strategy. Recall that the strategy described earlier made use of K-means clustering (with K of 5) along with selection of a particular variant for each cluster whose average similarity to the other members was highest. For the maximally diverse choice, the set of 5 proteins that were maximally dissimilar to one another were chosen (using a greedy algorithm beginning with the single protein pocket most dissimilar from all others). The second was a purely random strategy, in which five different sets of five were randomly chosen.
Performance was assessed using the protocol without any knowledge guidance when considering all protein families generated from docking. For the K-means strategy used throughout the paper, the success rates were 95 % for MAPK14 and 87 % for BACE1. The “diverse” strategy success rates of 66 and 82 %, respectively. For MAPK14, the drop of 29 points was highly statistically significant (p
\(< 10^{-6}\) by exact binomial). For BACE1, the drop of 5 points was just significant at the p = 0.05 level.
Using random selections, the average performance for MAPK14 was \(82\,\%\pm 14\). Two of the five random selections performed as well as the K-means strategy (success rates of 96 and 95 %), but three were significantly worse. For BACE1, the average performance of the random selections was \(85\,\%\pm 2\), matching that of the K-means approach.
The K-means approach was at least as good as the best of any alternative selection method, and it was clearly superior to the “diverse” approach. The latter essentially identifies outliers in protein pocket conformational space, which probably do not represent the bulk of relevant configurations for predicting the binding modes of new ligands. Perhaps surprisingly, choosing random sets of 5 proteins each was a better approach than choosing maximally different variants. In fact, for MAPK14, a fortunate choice of variants was as good as the careful K-means approach in 2/5 replications. For BACE1, the random approach never performed better than the K-means approach, but it did not perform worse either. Overall, the evidence suggests that making use of the K-means approach will result in the best performance, but testing such strategies on additional targets appears warranted.