Each of the data sets studied in the course of this work addressed different aspects of ligand-based binding affinity prediction. Results will be presented first on two sets of particular importance in the QSAR field [1, 31], then on multiple series that formed a validation benchmark for a recent implementation of a free-energy perturbation approach [18], then on four sets from a challenging QSAR benchmark [44]. Last, a series of muscarinic ligands is modeled, where the analysis is focused on the relevance of QuanSA to lead optimization [21] and on the concordance of the induced model to the recently solved crystal structure [45]. QuanSA models were generated purely from SAR data for SHBG, 5-HT1aR, the FEP data set, BZR, COX2, and the muscarinic receptor. To demonstrate the feature of using protein structure information during initial ligand alignment, the initial poses for AchE and thrombin were derived from docking.
Steroid binding globulin and 5-HT1a\({_R}\) examples
One very common approach to model quality assessment in QSAR has been cross-validation (often leave-one-out). With the QuanSA approach, cross-validation is offered as one of two means to adjudicate between alternative models, typically models derived from different initial alignments. The other model selection approach uses model’s quantitative parsimony along with the rank correlation and mean absolute error of the training ligands after having been re-fit to the pocket-field. For the CBG and SHBG cases, the latter approach identified the top-ranked alignments produced by the initialization procedure as being the preferred models. Figure 8 shows the results of a leave-one-out cross-validation for CBG and SHBG using the top-scoring initial alignment. In both cases, Pearson’s \(r^2\) was very high (0.77 and 0.81, respectively). For context, the original CoMFA study, which introduced this benchmark, reported cross-validated \(r^2\) values of 0.66 and 0.55, respectively [46], and a very early Compass study reported 0.89 and 0.88 [32]. It is important to understand that the 95% confidence intervals for the QuanSA \(r^2\) results, owing to the small set of 21 molecules, were 0.44–0.92 (CBG) and 0.66–0.92 (SHBG). So, none of these results are likely to represent a substantial difference in predictive quality.
Of course, cross-validation results are always less interesting than results on blind predictions (for any application of any machine-learning method). This is especially true for methods to be applied in lead optimization, where we expect to identify or design molecules that are different enough in character from training molecules to have different population characteristics [47]. Figure 9 depicts performance of the two QuanSA models on blind test sets consisting of 10 molecules for CBG (from the original CoMFA study) and 61 molecules from a much more recent study from Cherkasov et al. [29].
Based on prior work with QMOD, using the measures of prediction confidence described above, we adopted thresholds of \(\le\) 0.85 for novelty, \(\ge\) 0.35 for confidence, and \(\le\) 0.95 for exclusion violations [27]. The conjunction of novelty and exclusion criteria form the broadest set of predictions that should generally be considered as likely to be accurate (termed “in-model”), and the conjunction of confidence and exclusion criteria usually produces a narrower set (termed “high-confidence”) with more accurate predictions. Note, however, that for predictions near the highest range of the training data (or higher), error magnitudes are typically quite low, independent of the quality measurements.
The highlighted points (filled circles) in the plot correspond to the set of high-confidence predictions. For CBG, the mean error for this set (just five molecules) was 0.2 log units, Kendall’s Tau (a non-parametric rank-correlation measure) was 1.0 (p < 0.05), and \(r^2\) was 0.82. The ten-molecule CBG blind prediction set has been the subject of innumerable analysis, including particular discussion of an infamous fluoro-substituted outlier, where an H to F substitution produces a 2.0 log unit decrease in pK\(_d\) (molecules 30 and 31 of the blind set). We believe that it should be incumbent on the prediction method itself to provide measures that allow for unbiased identification of the subset of molecule on which predictions should be believed. Further, prediction sets with just a handful of molecules are unhelpful in assessing model quality.
For the SHBG case, the blind test included 61 molecules, of which 27 were high-confidence (filled green circles in Fig. 9). For this set, the mean error was 0.5 log units, Kendall’s Tau was 0.74 (p < 0.00001), and \(r^2\) was 0.67. The full set of 61 was sufficiently large and related enough to the training set (the mean raw similarity value was 0.90), that it is not unreasonable to consider prediction statistics overall: mean error 1.0, Tau 0.60 (p < 0.00001), and \(r^2\) was 0.47. While quantitative performance clearly was diminished, the results were still highly statistically significant.
Figure 9 shows two blind predictions from the set of 61 diverse test ligands. The first (ChEMBL261407) met the confidence criterion but neither the exclusion nor novelty criteria. However, it was predicted to be among the highest activity values from the training set. The second (ChEMBL265940) met both the confidence and exclusion criteria despite being a substituted benzofuran rather than a steroid at all.
Figure 10 shows the results for constructing a QuanSA pocket-field using 5-HT1a ligands. This set of 20 training ligands and 35 blind testing ligands was originally used in a validation exercise for Compass [33]. The training set is exemplified by molecules m4a and m8b, and the test set consisted of variations in the angular and linear tricyclic structures (including changes to the ring fusion chirality) as well as more substantially different examples such as molecules m45 and m46. The benchmark is particularly challenging because the primary driver of potency variation is the detailed shape of hydrophobic parts of the ligands, with small changes in chirality or changes of a few atoms resulting in significant effects on pK\(_d\). For example, the enantiomer of m4a has a pK\(_d\) reduced by 1.5 log units, and the enantiomer of m8b has pK\(_d\) < 6.0.
As seen in the plot, predictive performance was quite accurate, with most molecules (whether nominally in-model or not) being predicted within 1 kcal/mol and just a handful at or slightly beyond the 2 kcal/mol error level. For the 22 in-model molecules, Tau was 0.74 (95% CI 0.52–0.91, \(p < 0.00001\)), MAE was 0.64 log units. For the full set of 35 molecules, Tau was 0.58 (0.34–0.76, \(p < 0.00001\)) and MAE was 0.66. QMOD predictions for the full set of 35 resulted in a Tau of 0.34 (\(p < 0.01\)) and MAE of 0.8 [23]. Compass predictions produced a Tau of 0.36 (\(p < 0.01\)) and MAE of 0.8 [33]. The QuanSA predictions were clearly superior those from QMOD and appear to be significantly better than those from Compass.
The particular predictions shown in Fig. 10 on the novel scaffolds are not only quite accurate, but they appear to be correct for the right reasons. The pose optimization procedure placed the protonated amines in close correspondence to the training ligands, fit the alkyl nitrogen substituents into the pocket established by the training examples, and allowed the carbonyl of the urea to mimic the acceptor interaction made by hydroxyl substituents in various training molecules.
Table 2 Results on the FEP test set of 199 molecules under two prediction regimes for FEP and QuanSA (units for MAE are pK\(_d\))
The steroid and 5-HT1a benchmarks do not represent a comprehensive validation for any 3D-QSAR method. However, they represent a necessary condition. If a method cannot yield predictive models in these cases, where ligands are relatively small and rigid, and where molecular alignments are not enormously difficult, it is unlikely that more challenging cases of pharmaceutical relevance will prove to be tractable. Here, using fully automatic computational procedures, highly predictive models were produced with no requirement for manual ligand alignment.
Physics versus machine-learning: comparison to FEP
As mentioned in the Introduction, there has been a resurgence in interest in practically applicable physics-based estimation of binding free energy, exemplified by a recent study [18], where data presented for 199 compounds covering eight pharmaceutically relevant targets enables a comparison here. The FEP approach typically requires a single ligand with known free-energy of binding along with a corresponding experimental structure of the ligand bound to the protein of interest. From this reference ligand, a set molecular transformations can be made and arranged into a connected graph such that connected pairs of molecules have relatively high similarity. For each such connected pair, a calculation of the \({\Delta }{\Delta }G_{ij}\) is carried out. To obtain a prediction for a particular molecule k, one begins from a molecule with known \({\Delta }G^{exp}\) and traverses a path of connected molecules, each time adding the calculated difference in energy. Cycle-closure constraints are enforced such that traversal of different paths to a particular molecule will yield the same value of \({\Delta }G^{pred}\).
Because of the construction of the connected graph of molecular mutations, one may initiate the calculation of any molecule’s \({\Delta }G^{pred}\) from any molecule whose experimental free-energy of binding is known. Figure 11 (left) shows the reported results for the FEP calculation in pK\(_d\) units (top) for all eight targets. These results are for the case where the \({\Delta }G^{pred}\) values are computed from the single reference compound for each of the eight targets (a realistic application scenario).
The middle plot shows the results for QuanSA, where, for each target, five models were constructed, each with a non-overlapping 20% of the molecules reserved for blind scoring. No crystal structure information was used in any fashion. The pocket field models were induced using information only from the ligand structures and their corresponding activities. The collection of the five sets of blind predictions were used to assess performance. The FEP and QuanSA methods are essentially orthogonal in strategy, and their prediction errors are uncorrelated (data not shown). Consequently, the combination of the two, calculated by averaging each molecule’s two \({\Delta }G^{pred}\) values, results in an improved overall set of predictions (rightmost plot).
The three molecular pairs are typical of the molecular changes for which FEP \({\Delta }{\Delta }G\) values are computed and also give an idea of the variation within each target data set. The prediction conditions in Fig. 11 represent realistic scenarios for each method. For FEP, a connected set of alternatives are predicted from a compound whose bound structure and free energy is known. For QuanSA, information from a collection of compounds is used to predict on a set a quarter as large as that used for training.
Both methods also have best-case scenarios, each of which is worth analyzing. In the original paper, the reported FEP \({\Delta }G^{pred}\) were adjusted from what is shown in Fig. 11. Rather than calculating the nominal \({\Delta }G^{pred}\) values based only on the known reference ligand’s \({\Delta }G^{exp}\), the predicted free-energy values were re-centered such that their mean would match the mean \({\Delta }G^{exp}\) (the procedure is described in detail in the Supplemental Information for [18]). From a machine-learning perspective, this is spiritually similar to a leave-one-out validation experiment. For FEP, a cross-validation without any information contamination would have required making use of \((N-1)\) true \({\Delta }G^{exp}\) values to predict each “hold-out.” However, the mean estimated from \((N-1)\) ligands rather than N would not differ by much. The information leak from using \({\Delta }G^{exp}_{i}\) for a molecule in this fashion to calculate \({\Delta }G^{pred}_{i}\) is negligible. Note also that this correction has no effect at all on within-target correlation statistics.
For QuanSA, the best-case calculation is a leave-one-out cross-validation that makes use of the reference ligand’s bound structure to guide the induced molecular alignments. By leaving a single compound out, maximal information is used to derived the pocket-field, and ensuring that the ligand conformations are reasonable close to the correct absolute configuration provides a minor bias in favor of physically correct models (though these ligands and variations are simple enough that this makes little difference).
Table 2 shows summary statistics for rank correlation and MAE for both the best-case and realistic application scenarios for both methods. Here, Kendall’s Tau was used, with resampling-based calculation of confidence intervals (a value of 0.2 pK\(_d\) units defined tied experimental activity values). For FEP, the prediction re-centering procedure improved the overall MAE by just over 25%, but in individual cases, the reduction in MAE was nearly 50%. For QuanSA, the MAE values are not substantially different, either per-target or overall. However, while the per-target correlation values have wide confidence intervals due to small sample sizes, the overall rank correlation performance begins to show an edge for the best-case approach. Due to small sample sizes, it is difficult to make a strong case, but it appears that the FEP approach is likely better in the thrombin case and that the QuanSA approach is better in the CDK2 case.
Table 3 Results from combining the FEP (uncorrected reference compound \(\Delta {G}\)) predictions with the QuanSA 80/20 pure ligand-based predictions
The striking aspect of this comparison is not that one method is clearly better or worse than the other. Rather, it is that one method relies on direct simulation-based energetic modeling of protein-ligand interactions, that another infers the protein-ligand interaction energy landscape from structure-activity data, and that both methods produce very similar results across eight diverse targets. Further, the cancellation effect of orthogonal errors is substantial. Table 3 shows the rank correlation and MAE data derived by combining the realistic-scenario FEP and QuanSA predictions. In all eight cases, the combined result is either better than, or extremely close to, the best result from either method alone.
Using a single standard Intel i7 computing core, the QuanSA scoring process, in its default thorough search mode, required typically 20–40 seconds for molecules of the complexity represented by the FEP test set. By contrast, the FEP approach required approximately six hours per individual perturbation calculation using eight NVIDIA GTX-780 GPUs. As will be shown in what follows, QuanSA is suitable for large-scale calculations of thousands of synthetic alternatives, and it can be used to identify potent molecules with novel scaffolds. Where feasible, both computationally and in terms of theoretical applicability, methods such as FEP could be profitably employed to provide an orthogonal estimate of binding affinity.
More challenging predictions and application to ChEMBL data
The FEP set includes eight diverse protein targets, but the ligand diversity is limited. As seen in Fig. 11, relatively small modifications at discrete positions on a single scaffold represented the typical range of structural variation. The 3D-QSAR benchmark reported by Sutherland et al. [44] was designed to present a significant extrapolative challenge for methods, with designed test sets for each of eight targets. For each target, approximately one-third of molecules for were selected by optimization using a maximum dissimilarity algorithm and were assigned to the test set, with the remaining compounds assigned to the training set. For four of the eight targets (BZR, COX2, AchE, and thrombin), substantial ChEMBL data was available, and this was used to further test extrapolative ability for the QMOD approach [27]. Here, we reports results for the QuanSA approach on these four targets using both the original blind test data as well as the ChEMBL data.
Table 4 Test results for the complete Sutherland benchmark
Table 4 summarizes results for these four targets on the Sutherland test set, with results for QuanSA included on the in-model subset of compounds as well as the full test set. In all four cases, QuanSA yielded statistically significant predictions for both the in-model subset and the full blind test (\(p < 0.001\) for BZR, COX2, and AchE, and \(p < 0.01\) for thrombin). The mean absolute error was in the range seen with FEP and QuanSA on the FEP Set under realistic prediction scenarios (with QuanSA error values being slightly lower and FEP error values being slightly higher). However, the Kendall’s Tau values were slightly lower for these four targets from the Sutherland Set. This primarily reflects the greater jumps from knowns to unknowns in the data underlying Table 4.
Ideally, it would be useful if one could compare Tau and MAE values from different methods on various data sets. The problem with such comparisons is that some datasets are exceptionally challenging relative to others. The relative challenge of each data set for prediction is quantified in Fig. 12. The median nearest-neighbor similarity for test molecules for the targets within the original Sutherland benchmark (orange line) was 0.92, which was substantially more challenging than the FEP Set (yellow line). For the 80/20 blind QuanSA validation for the FEP Set, 80% of the test molecules had nearest neighbor 3D similarities to training molecules of 0.93 or greater. For the Sutherland Set, fewer than 40% of the test molecules had nearest-neighbor similarities of that magnitude. Just 1% of test molecules within the FEP Set had 3D nearest-neighbors with 0.85 3D similarity or less, but the corresponding value for the Sutherland Set was 13%.
Table 5 Results for QuanSA models on diverse ChEMBL compounds, with N being the total number of tested compounds, “N i-m” being the number of in-model predictions, and the statistical performance assessed by Kendall’s Tau and mean absolute error
Table 5 shows summary statistics for the four Sutherland targets on ChEMBL data (and for the muscarinic acetylcholine receptor, discussed in the next section). The total number of test compounds ranged from 1000–3000, and the coverage with respect to in-model predictions varied considerably. For the thrombin case, the training set was so small (59 compounds) and so narrow (all inhibitors were meta-substituted benzamidines with minor variations at two positions), that zero ChEMBL molecules met the criteria for being in-model. This case will not be discussed further.
For the other three cases, coverage ranged from roughly 10% for BZR and AchE to 25% for COX2, which had the largest training set (188 molecules, more than twice as many as the other cases). In all three cases, Tau was very similar (roughly 0.25), corresponding to very small p-values (\(< 0.0001\)). MAE was roughly 1 log unit for BZR and COX2 (less than 1.5 kcal/mol on average) and was higher for AchE (just over 2 kcal/mol).
Referring to Fig. 12, for ChEMBL predictions, the most similar in-model prediction set was for COX2 (blue line), with BZR being the most challenging (purple) and AchE falling in the middle (green). The AchE case presented additional difficulty due to the generally large and flexible ligands, which increase uncertainty with respect to pose and ligand energetics.
Figure 13 illustrates two in-model ChEMBL predictions along with their nearest training neighbors along with one prediction whose novelty value exceeded the threshold of 0.85. The leftmost molecule was very accurately predicted, showing the introduction of a methyl-oxadiazole in place of the alkyl-ester that was well represented in the training data. From a medicinal chemistry point of view, this is an interesting leap, though likely not terribly surprising for those experienced with common bioisosteric substitutions [48]. For reference, the nearest-neighbor similarity in this case was 0.92, matching the typical test molecule from within the Sutherland Set.
The canonical BZR ligands used to construct the QuanSA model were known from the early 1960’s (e.g. diazepam [49]), the early 1970’s (e.g. alprazolam [50]), and the early 1980’s (e.g. compounds such as RO 15-3505 [51]). The middle molecule, representative of several well-predicted pyrazolo-pyridine esters, is a genuinely significant scaffold leap. The scaffold was first disclosed in 1989 [52] and was designed on the basis of a similar triazolopyridazine known from about a decade earlier. The rightmost molecule, a carboline derivative that is out-of-model, is representative of a class that was known contemporaneously with some of the training molecules, but none were included in the training set.
These QuanSA predictions are significant for four reasons. First, the GABA\(_A\) receptor is a complex hetero-multimeric ligand-gated ion channel, part of a large group of important pharmaceutical targets that are very challenging for biophysical characterization down to atomic resolution. The pocket-field was constructed automatically, using only ligand structure and activity information. Second, its prediction quality rivaled that seen for the FEP Set, where protein structures were known and where structural jumps were smaller. Third, because the method is relatively fast, application to thousands of candidates (using modest computer hardware) is possible, allowing for evaluation of a large chemical design space, with built-in calculations for the applicable model prediction domain. Fourth, the method produced predictions of both pK\(_d\) and bound pose, with the latter offering support for predictions that otherwise might be difficult to justify for experimental follow-up.
Table 6 Test results for the complete Sutherland benchmark.
Figure 14 illustrates the AchE QuanSA pocket-field. The AchE enzyme pocket is a 20Å gorge, with key interactions being made by the labeled residues. In particular, Trp286, Tyr72, Asp74, and Tyr341 of the peripheral anionic site play a key role in the electrostatic attraction to cationic ligands. Tyr337 and Trp86 near the bottom of the gorge interact with the quaternary ammonium group of the substrate acetylcholine. The transparent surface shows the induced pocket-field, and the amine of training molecule 1-19 can be seen to have a strong interaction in the direction of the face of Trp-286 (long blue stick). Lower down, the dual carbonyl atoms of the phthalimide linker make favorable interactions (red sticks) with both sides of the pocket (blue shaded area).
ChEMBL-95020 presents a convincing superimposition of the isoxazole-containing tricycle [53] to the phthalimide of the training molecule. In a 2D sense, this prediction seems more surprising than the nominal 3D similarity value would suggest, but the actual disposition of chemical functionality and its mimicry of known ligands makes this not only an in-model prediction but a confident one that is also accurate. The right-hand prediction [54] in Fig. 14 is also in-model, but it is of lower confidence and is more typical of the level of prediction error in the AchE ChEMBL set. Rather than making very similar polar interactions as the phthalimide, it fills the lower-left hydrophobic cavity of the pocket. However, the areas of correspondence between the novel ligand and in terms of pocket-field complementarity and relationship to other AchE ligands provides confidence in the prediction.
Many more examples of such extrapolative predictions exist, but rather than enumerate them, Table 6 provides summary statistics on true positive recovery rates and two different estimates of false positive rates. Here, we have defined true positives as those molecules for which pK\(_i \ge 7.5\). For those molecules with measured activity, false positives were defined as those molecules for which pK\(_i \le 6.5\). This is a strict test of the ability to distinguish relatively active molecules from those that are slightly less active, and it is relevant in contemplating the effectiveness of a prediction method for synthetic prioritization. We have also made use of a decoy set of 1000 drug/lead-like ZINC molecules, with the entire set presumptively defined as false positives. Considering the FP rate for the decoy set the value of a predictive method when evaluating very large numbers of candidates, possibly from a virtual screening library or a computational de novo design procedure.
For COX2, where we have not shown specific examples, the coverage of the ChEMBL molecules was highest (about 25%), and the relative similarity of the in-model predictions to the training set was also high (see Fig. 12). This lead to a true-positive rate from the full ChEMBL set of 26%, so a model constructed from fewer than 200 COX2 inhibitors (utilizing no protein structure) was able to identify a quarter of all of the potent COX2 inhibitors curated within ChEMBL, representing many years of discovery efforts. The enrichment rate was 8-fold (known actives to known inactives). From the 1000 ZINC decoys, not a single molecule was identified as an in-model predicted true positive, suggesting an enrichment rate in a screening scenario of well over 2000-fold. The situation for AchE was very similar, albeit with a lower true-positive rate 5%. Again, however, the model was very specific, with a 5-fold TP/FP enrichment between known actives and known inactives and not a single false-positive decoy identified.
The BZR case showed true-positive identification between that of AchE and COX2 (12%), but the false positive rate on the known inactive molecules was higher, leading to an enrichment of 2-fold. Similarly, application of the pocket-field to the decoy set identified 11 decoys, suggesting a screening-oriented false-positive rate of 1% and an enrichment of just over 10-fold. It is not clear whether the BZR model is truly less accurate or less specific than the others, though this seems somewhat unlikely based on assessment of its quantitative predictive accuracy. It may be the case that the decoy set inadvertently contains a number of bona fide BZR ligands. The nominal nearest-neighbor similarities of the 11 nominal false-positives ranged from 0.67–0.90, which is within the range observed for the in-model ChEMBL predictions, but this is not definitive. It may also be the case that the BZR binding site is truly significantly more promiscuous than the other binding sites, but we have no direct evidence of this.
Summary of comparative performance
Two sets of comparisons to other methods have been carried out. The first set was to other QSAR methods on benchmarks of wide use or of particular significance to related methods. The second set was to FEP, a sophisticated physics-based simulation approach.
The QSAR comparisons included two steroid binding globulins [1, 32], 5-HT1a receptor, and four targets from a more recent benchmark [4, 44]. Considering the limitations of assay quality and of data set size, QuanSA performed as well or better than any reported method (including CoMFA and related methods and Compass) on the classic steroid cases, including accurate extrapolation to structurally novel molecules published 20 years after the introduction of the sets [29]. The 5-HT1a receptor case was the central validation example for the Compass method and was also used for validation of the initial QMOD approach [23, 33], and QuanSA performance was superior to either method. In the most challenging of the Sutherland cases (BZR and COX2), QuanSA results were substantially better than those seen with QMOD, whose performance exceeded that of CMF, CoMFA, and numerous 2D methods [4, 44].
With respect to the FEP comparison, individual chemical mutations across the entire set of eight targets resulted in errors of 0.9 kcal/mol for FEP [18]. For QuanSA, the comparable situation was where 80% of the data was used to induce a purely ligand-based model to then predict the remaining 20% of the data, and the result was a mean error that was slightly lower (0.7 kcal/mol). More importantly, though, was the observation that the errors made by the two methods were uncorrelated, so that combined predictions achieved robust predictive performance across all eight targets.
Explanatory power and correspondence with future crystallography
A particularly interesting aspect of drug-design for older targets that lacked biophysical characterization for many years is that chemical exploration was done agnostically, with the synthesized chemicals themselves being used to elucidate binding pockets. Consequently, exploration of positions on a particular scaffold were often driven by considerations of systematicity and synthetic feasibility.
The Compass method, an antecedent to QuanSA, was developed at Arris Pharmaceutical and was refined during a period of collaboration between Arris and Pharmacia in the early to middle 1990’s. This was a period during which Pharmacia was also pursuing muscarinic antagonists, resulting ultimately in the approval of tolterodine by the US FDA in 1998 (see Fig. 15). At the time, the potent anti-muscarinic QNB was commonly used as a radioligand for displacement assays, and oxybutynin was a competing muscarinic antagonist. Of course, atropine as a medicinal compound had been known for many decades, and it was established as a potent muscarinic antagonist in modern pharmacological assays by the 1950’s [55]. During this time-period two series of quinuclidinene anti-muscarinics were pursued by Pharmacia [42, 43].
Figure 15 depicts a QuanSA model-building and refinement process that made use of a total of 43 training molecules (the four named molecules above plus 39 from two quinuclidinene series, with various substitutions on the furan and benzofuran heterocycles and including variations such as thiophene and benzothiophene analogs). The process was completely automatic, using default parameters for ligand preparation, initial alignment generation, and model building. The model selection procedure identified the second-ranked alignment clique of the top five as being likely to be most predictive. The resulting pocket-field was used to score 1019 ChEMBL molecules, of which just 26 were nominally in-model. Given the poor coverage, these 26 molecules (actual pK\(_i\) ranging from 5.3–10.5) were used to refine the model.
Apart from statistics of quantitative accuracy, the value of a physical QSAR model derives in part from the degree to which it is making predictions for the right reasons. As with the steroid globulin case presented first, the decades have produced critical crystallographic information. In particular, PDB structure 3UON revealed the bound configuration of QNB bound to the human muscarinic acetylcholine receptor [45], which is the biological target that was being investigated heavily roughly 20 years earlier. Figure 16 shows the protein binding pocket bound to QNB along with another antagonist-bound variant (rat M3, PDB Code 4U15). The crystallographic data was aligned to the muscarinic pocket-field using the ligand QNB to minimize automorph-corrected RMSD.
The predicted conformation of QNB (magenta, top middle of Fig. 16) was just 0.5Å RMSD from the bound form (green). The orientation of the amine was slightly off, and the orientation of the hydroxyl was significantly rotated away from the clear preference of the bound form. When bound, the hydroxyl proton interacts with the carbonyl of Asn-386. The predicted orientation is driven by energetic preferences of the ligand, turned toward the QNB’s ester carbonyl. This is similar to the problem of hydroxyl rotamers seen in the steroid globulin case (Fig. 4). Very often, within a collection of SAR data, there will be no information that allows for an unambiguous determination of the absolute conformations of the molecules. In such cases, this will become a problem when a molecule to be predicted would resolve the ambiguity but where the model has guessed incorrectly. Nonetheless, the 0.5Å deviation represents excellent agreement, given that QNB is a reasonably flexible molecule.
The two quinuclidinene series shown in Fig. 16 were being optimized for potency, with the benzofuran scaffold yielding no significant improvement over molecule b1 (pK\(_d = 7.2\)), despite extensive synthetic effort [42]. However, exploration of the furan scaffold yielded significant improvement (e.g. a49, with pK\(_d = 7.9\), essentially equipotent to tolterodine). Variations such as b29 (the analogous phenyl variant of the benzofuran scaffold to a49) represent particularly challenging data points to explain. Considering the 2D SAR, and assuming additivity, b29 should have been a significant improvement over b1 and a49, but it was essentially equipotent with b1.
To understand the reasons for the puzzling SAR, and to see whether the QuanSA approach would help explain it, we compared the results of docking (green sticks) to predicted pocket-field poses (magenta for a49 and cyan for b1 and b29). For a49 (lower left), there was good agreement between the QuanSA-predicted binding mode and that from docking, though due to the rigidity of the molecule, the orientation of the quinuclidinene was non-optimal in the docking. The 3-phenyl substituent fills the back of the pocket, and the furan oxygen points forward, making a favorable pocket-field interaction. In the actual binding pocket, it appears to be a mobile tyrosine hydroxyl (clipped from the front of the protein pocket depiction) that interacts with the furan oxygen.
Molecule b1, the unsubstituted benzofuran (cyan), orients its oxygen downward in order to optimize its pocket-field interaction, in direct conflict with the required orientation for a49 (magenta). Docking of b1 (not shown) resulted in many possible fits, all of which were inconsistent with the docked pose of a49, but some of which were close to the preferred QuanSA pose. For molecule b29, the anomalously less active phenyl benzofuran, the QuanSA-predicted pose (magenta, lower right), shifted the aromatic ring down from the preferred position of a49. The docking of b29 shifted the ring downward further, but in qualitative agreement with the model’s prediction. Here, there are clearly conflicting preferences for the phenyl substituent, the furan/benzofuran oxygen, and the hydrophobic portion of the benzofuran. The QuanSA model was able to induce a model of the binding pocket that quantitatively explained the activities and qualitatively explained the reason for the non-additive behavior.
Non-additive SAR is actually quite common, as pointed out by Klebe’s group [22]. Using detailed thermodynamic and crystallographic data involving thrombin inhibitors, they explicitly showed that a particular functional group change (-H to -NH\(_2\)) exhibited context dependent \({\Delta }{\Delta }G\) effects. In cases where addition of the amino group created a “conflict-of-interest” with respect to the preferred binding mode of the sister molecule, the relative improvement of binding affinity was decreased. Strict additivity of functional group contributions to binding should be expected only when the functional group modifications do not affect the conformation of the rest of the molecule or its overall alignment (and even in those cases may be affected by more subtle effects on free-energy such as differential enthalpy/entropy compensation).
The initial, unrefined, pocket-field had a mean absolute error of 1.4 log units on the 26 in-model molecules, and the Kendall’s Tau of 0.19 missed a reasonable cutoff of statistical significance (\(p = 0.12\)). However, the refined model (constructed from a total of 69 muscarinic ligands) covered 291 (29%) of the 993 remaining ChEMBL muscarinic set. The Kendall’s Tau rank correlation was 0.34 (\(p < 10^{-6}\)) and MAE was 1.1 log units (see Table 5). The model was also extremely specific in terms of decoy rejection (see Table 6), and it was comparable in quality for rejection of known inactives to the COX2 model.
The overall true-positive identification of was 9%. This appears modest (a random selection of 66/993 would produce approximately double that number), but in the context of the false positive rates, the predictive value is clear. The QuanSA-driven procedure identified twice as many actives as inactives from the ChEMBL set, despite the a priori probability being in the other direction. For the ZINC decoys, the random selection process would have yielded roughly 70 false positives, but the QuanSA calculation produced zero. It would require much larger decoy sets to establish an accurate false positive rate in a screening sense, but based on the distributions of scores, novelty values, and exclusion values, it is likely to be less than 1/10,000, which suggest an enrichment rate of 1,000-fold or greater.
Figure 17 shows four examples of the predicted in-model potent ChEMBL molecules. The first compound (ChEMBL-287868) was the most potent, but least interesting, as it was an obvious analog of QNB, atropine, and related older antimuscarinics [56]. The second (ChEMBL-264414) was published in 2008 [57]. The muscarinic activity of alkyne linked quinuclidines was a surprise to researchers working toward novel bladder agents, and the selectivity profile suggested potential toward a COPD indication. This scaffold was different from any within the expanded training set for model refinement. The third compound, ChEMBL-583027, was reported in 2010 [58]. It was one of a group of pyrrolidinylfurans, of which two potent examples were identified by the unrefined model (ChEMBL compounds 571121 and 569760, predicted at 8.1 and 9.5, respectively, with actual pK\(_i\) values of 8.3 and 8.0). The refined model, in addition to identifying ChEMBL-583027, also identified four other actives of this scaffold correctly.
The last compound (lower right) was identified through pharmacophoric modeling and virtual screening, reported in 2013 [59]. Of the 28 compounds from in that study, 12 were represented within the ChEMBL data set with reported \(pK_i \ge 7.5\). The initial unrefined QuanSA model identified one-third of these (ChEMBL molecules 37372 and 517712 with experimental \(pK_i> 9.0\), and 2377261 and 2377269, both with \(pK_i\) very close to 8.0).
The refined model identified three more with activity predictions \(\ge 7.5\). This included ChEMBL molecule 3085495 shown in Fig. 17. ChEMBL molecules 2106570 and 2377268 were predicted to have pK\(_i\) values of 8.5 and 7.5 and had experimental activities of 8.8, and 7.5, respectively. The remaining five ChEMBL compounds (1231, 2377387, 1490, 1123, 2377267, all with experimental pK\(_i \ge\) 7.5) were predicted with pK\(_i =\) 6.5–6.9. The QuanSA model would have identified all of the potent compounds as in-model winners in a screen with the threshold set as pK\(^{pred}_i \ge 6.5\), with an extremely low false positive rate (3/1000 ZINC decoys are identified by this procedure at this predicted activity threshold).
In all cases of predicted and confirmed active molecules, even for those with divergent scaffolds from all training examples, the pose predictions were consistent across multiple variants and were convincing in the light of the SAR available.