Dynamic clustering threshold reduces conformer ensemble size while maintaining a biologically relevant ensemble

Yongye, Austin B.; Bender, Andreas; Martínez-Mayorga, Karina

doi:10.1007/s10822-010-9365-1

Dynamic clustering threshold reduces conformer ensemble size while maintaining a biologically relevant ensemble

Open access
Published: 25 May 2010

Volume 24, pages 675–686, (2010)
Cite this article

Download PDF

You have full access to this open access article

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Dynamic clustering threshold reduces conformer ensemble size while maintaining a biologically relevant ensemble

Download PDF

Austin B. Yongye¹,
Andreas Bender²^nAff3 &
Karina Martínez-Mayorga¹

1621 Accesses
26 Citations
Explore all metrics

Abstract

Representing the 3D structures of ligands in virtual screenings via multi-conformer ensembles can be computationally intensive, especially for compounds with a large number of rotatable bonds. Thus, reducing the size of multi-conformer databases and the number of query conformers, while simultaneously reproducing the bioactive conformer with good accuracy, is of crucial interest. While clustering and RMSD filtering methods are employed in existing conformer generators, the novelty of this work is the inclusion of a clustering scheme (NMRCLUST) that does not require a user-defined cut-off value. This algorithm simultaneously optimizes the number and the average spread of the clusters. Here we describe and test four inter-dependent approaches for selecting computer-generated conformers, namely: OMEGA, NMRCLUST, RMS filtering and averaged-RMS filtering. The bioactive conformations of 65 selected ligands were extracted from the corresponding protein:ligand complexes from the Protein Data Bank, including eight ligands that adopted dissimilar bound conformations within different receptors. We show that NMRCLUST can be employed to further filter OMEGA-generated conformers while maintaining biological relevance of the ensemble. It was observed that NMRCLUST (containing on average 10 times fewer conformers per compound) performed nearly as well as OMEGA, and both outperformed RMS filtering and averaged-RMS filtering in terms of identifying the bioactive conformations with excellent and good matches (0.5 < RMSD < 1.0 Å). Furthermore, we propose thresholds for OMEGA root-mean square filtering depending on the number of rotors in a compound: 0.8, 1.0 and 1.4 for structures with low (1–4), medium (5–9) and high (10–15) numbers of rotatable bonds, respectively. The protocol employed is general and can be applied to reduce the number of conformers in multi-conformer compound collections and alleviate the complexity of downstream data processing in virtual screening experiments.

The comparison of automated clustering algorithms for resampling representative conformer ensembles with RMSD matrix

Article Open access 23 March 2017

Ensemble completeness in conformer sampling: the case of small macrocycles

Article Open access 29 July 2021

An integrated approach to knowledge-driven structure-based virtual screening

Article 04 July 2014

Introduction

Ligand-based drug design (LBDD) approaches, such as 3D-similarity searches [1, 2], pharmacophore modeling [3, 4], and 3D-QSAR development [5–7], involve predicting the bioactive conformations of drugs in the absence of a structural model for the receptor. Typically, multiple conformations of potential drug molecules are generated via random or systematic conformational searches in vacuum, distance-dependent dielectric or implicit solvent [8, 9], and employed to search for bioactive conformations. However, it has been established that flexible ligands undergo conformational changes upon binding to receptors, do not bind in their lowest energy solution- or gas-phase free states [10–12] and are more extended in the bound state [13, 14]. Therefore, the success of LBDD relies heavily on the ability of conformer generators to produce conformers whose conformational space includes the bioactive state (except in cases where methods are largely independent of the precise ligand conformation) [15]. Because of the difference between bound and unbound states, another consideration is how many unbound conformations should be employed to contain a representative of the bioactive conformation(s). The level of interest in addressing this question is reflected in the number of studies that have been undertaken either by testing existing conformer generators with different protocols [9, 11–13, 16–22] or by developing more efficient conformational search algorithms [23–26].

The contribution of internal energy to the thermodynamics of binding necessitates a good 3D representation of the conformers. For example, a 1.4-kcal/mol increase in conformational energy results in an approximately 10-fold decrease in affinity [27]. Furthermore, intramolecular hydrogen bonding (due to its directionality) and electrostatics (due to its sensitivity to distance) increase the complexity of conformational search space, among others. Therefore, the treatment of these interactions is crucial for conformer generators, some of which include CAESAR [24], OMEGA [28], ConfGen [29], CatConf [30], and stochastic proximity embedding [31, 32], to cite a few. Having generated the conformers the next step involves a judicious selection of structures to be employed in further work (ligand- or structure-based investigations). The most common strategies employed by some established conformer generators comprise RMS filtering [28, 29] and poling [30]. In addition RMS could also be used for post filtration.

Studies to determine what metrics to employ in selecting unbound conformers to represent a bioactive conformation typically include energy- and geometry-based methods. Alternatively, structural similarity and activity similarity have been used together to derive putative bound conformations [33]. Energy-based methods, on the other hand, involve comparing the internal strain energies of the global or local minima of computer-generated models with those of the bound conformer or a generated conformer that is geometrically very similar to the bound state [9–11, 16, 34]. The energies are computed employing quantum–mechanical methods or empirical force fields in different environments. The energy-based methods are outside the scope of this work; however, it suffices to note that these methods have resulted in cut-offs ranging from 0.5 [16] to 41.6 kcal/mol [10], which truly illustrates the diversity present in the energetics of ligand binding events. Thirdly, in geometric approaches [10, 13, 18, 21] root-mean-squared deviations (RMSD) are computed between the heavy atom positions of computer-generated low energy conformers of a ligand and those of its bioactive conformer. Previously, conformational differences were observed between bound and unbound structures; however, in nine out of ten cases the bound and free conformations displayed similarities in the positions of key atoms involved in ligand recognition [21]. Utilizing 100 low energy conformers per ligand, Günther et al. [18] reproduced the bound states of ligands for 70% of the entire dataset and 90% of the time for averaged-sized (5.6 rotatable bonds) molecules with a similarity threshold of 1.0 Å. At a 1.0-Å cutoff, Auer et al. identified bioactive structures in 75% of the ligands studied [13], while the RMSDs of at least 86% of the ligands investigated by Kirchmair et al. were within 2.0 Å of the bioactive conformer [12].

These RMSD and energy ranges indicate that they cannot be applied broadly, but only by simultaneously considering the number of rotatable bonds of the ligand and the functional form of the force field or energy-based method. Given that bioactive conformers span a range of internal energies it is reasonable to select conformers that cover this range, while at the same time employing a cutoff that does not result in an unmanageable number of conformers in a multi-conformer database. Quantum mechanics methods are the most rigorous; however, their computationally intensive nature prohibits their application to a multi-conformer database. Hence, the use of faster but less accurate empirical methods remains.

Our broad goal is to integrate objective ensemble clustering strategies with established conformer generators in order to select as few computer-generated conformers as possible to represent the bioactive conformer(s) in a search database, in an effort to minimize the complexity of downstream analyses of results. Ensemble clustering has been implemented before using principal component analysis to determine unique conformers from a collection of conformers [35]. Also, it was demonstrated earlier that conformational diversity could be achieved by including a poling penalty function in the search algorithm, which penalizes current conformers that are closely related to pre-existing conformers [36]. The conformer generator employed in this study, OMEGA [28], utilizes RMSD filtering to reduce conformer redundancy during the filtration step; nonetheless, the number of conformers generated can potentially be reduced. Additional downstream or on-the-fly enhancements can be employed by clustering methods that do not require user-defined cut-off values to pare down the number of conformers in large databases, for instance, the hundreds of thousands of compounds typically found in combinatorial libraries. In the current work, we employed the NMRCLUST algorithm implemented in the Chimera version 1.4.1 command line interface [37] as the extra step. NMRCLUST is non-subjective compared to other clustering schemes because it avoids the requirement for the user to specify a maximum intra-cluster RMSD cutoff that is directly linked to the number of clusters generated. In practice this is convenient since knowledge of the full conformational landscape of a compound is generally not known a priori. In this work we show that further filtering of OMEGA-generated conformers with the NMRCLUST algorithm produces a smaller number of structures that can be employed to represent the conformational space of a drug-like molecule, while at the same time maintaining biological relevance of the ensemble.

Methods

The workflow employed in this study is illustrated in Fig. 1. The bound conformations of the ligands were obtained from their co-crystallized complexes in the Protein Data Bank (PDB). These structures had been employed previously to investigate the relative energies of the bound conformations of drug-like molecules [16]. Two structures (1RO9 and 3CPA) were removed from the list, because of questionable B-factors [16]. In addition, eight pairs of identical ligands that crystallized in dissimilar conformations in different receptors were included. A subset of 65 ligands was selected to represent the entire range of rotatable bonds reported for some drug-like compounds (Tables 1, 4) [12]. Hydrogen atoms were added to the crystal structures of the ligands employing the AddH tool of Chimera version 1.4.1 [37], and each ligand inspected visually for structural consistency. The positions of hydrogen atoms were optimized, while the heavy atoms were fixed utilizing the default minimization criteria in Chimera version 1.4.1 (100 steps, stepsize 0.02 Å, update interval 10, Gasteiger charges). Finally, each structure was minimized in the Molecular Operating Environment version 2008.10 (MOE) program [38] employing heavy-atom positional constraints that are related to atomic B-factors and the temperatures at which the crystals were solved. This treatment of the dataset was performed to take into consideration high B-factors that can lead to inaccurate fitting of ligand atoms. Details of this approach have been published elsewhere [16]. Briefly, this method takes advantage of the notion that atoms with low B-factors have well-resolved electron densities, therefore, their positions are well-defined by the experimental coordinates and may not require further adjustments. However, high B-factors indicate high atomic mobility and positional uncertainties. Thus, in minimizing the bioactive structures the positional constraints are higher on atoms with well-defined atomic coordinates compared to those with poorly-defined coordinates. As a result atoms with low B-factors would be relatively stationary, while the positions of atoms with high B-factors would move presumably to their optimal positions. (In this study superposing pre- and post-minimized bioactive conformations did not lead to any significant changes in compound geometry, see the Results section and Table S1). Additional factors, such as protein environment, explicit solvent effects, etc. are not considered in this process.

Table 1 The protein data bank IDs from which the ligands were extracted, the number of rotatable bonds, the number of conformers

Full size table

Next, the computational 3D models were built from scratch, and minimized employing the MMFF94x force field and default parameters in MOE. Four conformer sets were generated from these initial conformers. The first set of conformers (omega) was generated utilizing default OMEGA version 2.3.1 parameters except for the following: rms = 0.4; ewindow = 25.0 kcal/mol, maxconfs = 500; searchff = mmff94s_noestat. The initial number of conformers generated was 50,000, specified via the maxconfgen parameter. The rms parameter sets a lower limit for filtering similar conformers; maxconfs determines the final number of conformers to be retained from the initial ensemble requested via maxconfgen; the searchff specifies the force field employed to compute internal energies during conformer search; and ewindow sets an upper limit for retaining the generated conformers. The incomplete force field, mmff94s_noestat, was employed in order to neglect intramolecular gas-phase interactions that could lead to collapsed conformations, given that bound conformers are generally more extended than unbound conformers [13, 14]. The OMEGA parameters were employed because they have been shown to be optimal in terms of reproducing the bound conformations of ligands [12]. In addition, the maxconfs limit of 500 was set because the clustering algorithm scales as O(n²), see below. The OMEGA-generated conformers were translated to the same coordinate frame of reference as the bioactive conformer employing a rigid-body superposition with the ROCS version 2.3.1 program [39]. The second set (nmrclust) was generated by clustering the OMEGA-generated conformers of each molecule using the NMRCLUST algorithm in the Chimera command line interface, which employs the Kelley penalty function [40] to determine an optimal number of clusters. Utilizing the NMRCLUST algorithm avoids subjective inputs of pre-defined intra-cluster cut-offs or spreads, by selecting the number of clusters that minimizes a penalty function during hierarchical clustering of an RMS distance matrix, D(i, j) employing the average-linkage method. The average-linkage method performed best for this type of studies compared to single or complete linkage [40]. For each hierarchy a penalty function is determined using the number of clusters and the average spread of the clusters. The hierarchy that gives the minimum value of the penalty function is selected to represent the optimum number of clusters for the conformer ensemble. Briefly, a distance matrix consisting of heavy-atom pairwise RMSDs for an ensemble of structures is generated. Next, hierarchical clustering is performed with the matrix using the average-linkage method:

$$ {\text{dist}}(m,n) \, = \, {\frac{{\left( {\sum\nolimits_{i = 1}^{X} {\sum\nolimits_{j = 1}^{Y} {{\text{dist}}(i,j)} } } \right)}}{XY}} $$

for clusters m and n with X and Y members, respectively, and dist(i, j) the RMS between the superimposed i and j from m and n, respectively [40].

In the course of the clustering, the average spread is determined at each stage using the spreads determined by: [40]

$$ {\text{spread}}m \, = \, {\frac{{\left( {\sum\nolimits_{k = 1}^{N} {\sum\nolimits_{i = 1,\,i < k}^{N} {{\text{dist}}(i,k)} } } \right)}}{N(N - 1)/2}} $$

for cluster m containing N members, with conformers i and k; by definition, clusters that contain only one member (singletons or N = 1) are excluded in the calculation of the spread. The average spread is computed by: [40]

$$ {\text{AvSp}}i \, = \, {\frac{{\left( {\sum\nolimits_{m = 1}^{cnumi} {{\text{spread}}m} } \right)}}{cnumi}} $$

where i is a given hierarchy, and cnumi the number of clusters at that hierarchy. The average spreads are then normalized with values between one and (N _T − 1), whereby N _T is the total number of structures in the ensemble as follows: [40]

$$ {\text{AvSp}}({\text{norm}})i \, = \, \left( {{\frac{(N_{T} - 2)}{{{\text{Max}}({\text{AvSp}}) \, - {\text{ Min}}({\text{AvSp}})}}}} \right)*({\text{AvSp}}i - {\text{ Min}}({\text{AvSp}})) + 1 $$

Max(AvSp) and Min(AvSp) denote the maximum and minimum average spreads, respectively, in the set across all the stages of the clustering. This results in equal weights in the average spreads and number of clusters in a penalty function that is computed as the sum of the normalized average spread at a given hierarchy and the corresponding number of clusters (including singletons). The penalty scores are then stored as a function of the number of clusters and the average normalized spreads: [40]

$$ Pi \, = {\text{ AvSp}}({\text{norm}})i \, + {\text{ nclust}}i $$

The number of clusters that corresponds to the minimum penalty score defines the cut-off for the ensemble. This cut-off represents the stage wherein the clusters are as highly populated as possible, while concurrently minimizing the spread. After this analysis, a structure closest to the centroid of each cluster is selected as the representative structure. The third set of structures (rms) was generated by altering the value of the rms parameter in OMEGA in order to obtain a comparable number of conformers as the number of representative structures identified by NMRCLUST. Finally, the fourth set of structures (rms_avg) was generated by partitioning the dataset in terms of number of rotors: low, having between one and four rotatable bonds; medium, possessing between five and nine rotatable bonds; and high, with ten to 15 rotatable bonds. The rms-filtering cutoffs employed in set three for the compounds in each category were averaged and employed to generate conformers for each molecule in the rms_avg set.

The RMSDs between the computer-generated structures and their bioactive conformations were computed for each multi-model file, utilizing the g_rms module of GROMACS [41], and the RMSD statistics (average, standard deviation, minimum and maximum values) were extracted (see Table 2). Perl scripts were written for the automation of the conformer generation, ROCS overlays, and RMSD analyses procedures.

Table 2 The minimum RMSDs between the bioactive conformations and the structures from the different computer-generated datasets

Full size table

Results and discussions

The metric employed to assess deviations between the computer-generated and bioactive conformers was the RMSD between each pair of computed and experimental structures. To improve the quality of the structures, the bioactive conformers were refined via energy minimizations taking into account positional uncertainties in the experimental atomic coordinates via atomic B-factors. Details are provided in the Methods section. It is conceivable that these minimizations may significantly alter the conformations of the bioactive structures, though minimizations of experimental structures in energy and structural comparisons are not uncommon [16, 22]. In this study superposing the pre- and post-minimized bioactive conformations of each compound did not reveal any significant changes, Table S1.

For the computer-generated ensembles, the first set (omega) was employed as a performance reference as well as the input file for subsequent clustering steps. It could also be seen from this output whether the number of conformers generated with our OMEGA parameters actually included the bioactive conformation in the first place. The second set (nmrclust) served to represent the conformational space of each molecule employing a smaller number of conformers by clustering, with the aim of retaining the bioactive conformation. The clustering approach employed here does not require a priori knowledge of the desired number of conformers, nor the maximum spread of distance cut-off to include structures in a cluster [40]. In the third set of structures, the rms parameter of OMEGA was adjusted for each compound to give a similar number of conformers as obtained with the clustering method. This set is intended to determine whether the clustering can be avoided by simply modifying the rms filtering value to generate the desired number of conformers per molecule. Lastly, a fourth set was constructed, named rms_avg. Structures in this set were generated to determine whether specific values could be employed during conformational sampling depending on the number of rotors in a compound. It is recognized that the last two sets of structures (rms and rms_avg) include information derived from the clustering dataset. As such it is presumed that the NMRCLUST algorithm is an efficient clustering approach.

The ligands employed in this work, the number of rotatable bonds and the initial numbers of conformers generated by OMEGA are presented in Table 1. The number of clusters identified employing the NMRCLUST algorithm, and the number of conformers generated by the rms and rms_avg filtering schemes are also shown. As expected, [42] the number of conformers generally increased with the number of rotors for each method. For instance for two rotors the average number of conformers was 10.67, 4.67, 6 and 5.33 for omega, nmrclust, rms and rms_avg, respectively, compared to 500, 47.5, 48 and 106, respectively, for fifteen rotors.

As a way to account for the molecular size, the ratio between the number of rotatable bonds and the total number of bonds between the heavy atoms for each ligand was determined. This ratio is an indication of the flexibility of the molecule. Lower values of this ratio indicate that the compound is generally less flexible, more unsaturated, with cyclic substructures that may or may not be fused. Consequently, its bioactive conformer can be determined with relative ease. The reverse is true for higher values. There was an overall increase in this ratio, visualized in Fig. 2 as the black line-open circles, reflecting some of the challenges encountered when utilizing conformer generators to obtain a conformer that closely resembles the bioactive conformation of highly flexible compounds, in agreement with previous studies [10, 12]. Also shown in Fig. 2, are the average RMSDs computed between the conformers in each computer-generated multi-conformer file and their bioactive conformer. The average RMSDs were statistically similar across all four methods, and did not provide specific details about the similarities between individual computer-generated conformers and their bioactive conformer.

To obtain a better indication of the occurrence of the bioactive conformer among the computer-generated conformers the range of RMSD values was determined for each ligand for the different schemes. The bins and populations of the minimum RMSD values between each ligand and its bioactive conformer for the four methods employed are shown in Fig. 3. A tabular format of these data is given in supplementary material Table S2. A classification of RMSD values between computer-generated and bound conformers has been suggested before: [12] RMSD < 0.5 indicates an excellent match; 0.5 ≤ RMSD < 1.0 signifies a good match; 1.0 ≤ RMSD < 1.5 suggests an acceptable match; 1.5 ≤ RMSD < 2.0 is still acceptable; and RMSD ≥ 2.0 is unacceptable. The population distributions are color-coded with black, spotted and gray representing the low, medium and high number of rotor categories, respectively. Overall, the RMSD distributions covered the entire range from excellent to unacceptable, although the majority of the values occupied the good to acceptable limits (from 0.5 to 1.5 Å). It is worth pointing out that for ligands with high numbers of rotatable bonds (10–15) none of the datasets contained a conformer that was in excellent agreement with the bioactive conformer. This is most likely a reflection of insufficient numbers of conformers because of the difficulty in exhaustively sampling the conformational space of highly flexible molecules [42].

Comparing the four different sets of generated conformers in terms of getting the bound ligand structure revealed some notable trends, Fig. 3. For RMSDs ≤ 0.5 Å the rankings were as follows: omega > nmrclust > rms > rms_avg. The trend was similar for good reproduction, except that rms = rms_avg. Given the observed trends for excellent and good reproduction it is expected that the order will be reversed for acceptable and still-acceptable fits with the rankings being rms_avg > rms > nmrclust > omega and rms > rms_avg > nmrclust > omega, respectively. That the trends were reversed for the latter RMSD ranges simply indicates the greater number of compounds distributed in the “excellent” and “good” categories of the RMSD fits for omega and nmrclust, compared to rms and rms_avg. It is interesting to note that nmrclust was better than rms in terms of “excellent” and “good” fits, given that rms filtering had at least as many structures as nmrclust.

Since the rms_avg set was derived from the rms set, it is expected that the number of conformers generated in the rms_avg set would differ from the number of conformers from the rms set; in fact, only in few cases such as 1CIM and 1H9U were the number of conformers from these two sets identical (Table 1). It was hypothesized that ligands with more conformers in the rms_avg set than in the rms set would be more likely to capture the bioactive conformation, while the reversed would be true for ligands with a smaller number of conformers. The overall comparison of the number of conformers and the differences between minimum RMSDs to the bioactive conformation for the rms_avg and rms sets are shown in Fig. 4. The horizontal axis represents conformer differences (rms—rms_avg), while the vertical axis represents differences in minimum RMSD (rms_avg—rms). The quadrants depict the dataset as follows: lower-left, rms has less conformers and worse representation of the bioactive conformers; upper-left, rms has less conformers and better representation of the bioactive conformers; lower-right, rms has more conformers and worse representation of the bioactive conformers; upper-right, rms has more conformers and better representation of the bioactive conformers. It is expected that no data points populate the upper-left and lower-right quadrants of this plot. The few cases falling into these quadrants have either a small difference in the number of conformers or a small difference in the minimum RMSD (rms_avg—rms). Black circles (10 data points) represent ligands that were classified into different categories in the qualitative classification of Fig. 3. The relatively few number of these highlighted entries may explain why the rms_avg and rms filtering methods performed equally well in the classification presented in Fig. 3. Interestingly, increasing the number of conformers did not ensure better fits; in fact some cases (far left) provided at best the same performance (the difference in minimum RMSD is close to zero).

The statistics for the number of conformers and minimum RMSD values generated by each method are presented in Table 3. It is demonstrated that employing a smaller number of conformers in nmrclust, rms or rms_avg we were able to get equal overall performances (56/57 acceptable RMSDs) from these methods. However, whenever possible nmrclust would be the recommended approach given that rms_avg was derived from rms and will involve the cumbersome adjustments of individual rms cutoffs for compounds in a large database. The rms_avg values of 0.76, 1.01 and 1.39 for small, medium and large number of rotors, respectively, derived in this study may serve as guidelines in OMEGA for these categories of compounds.

Table 3 Statistics for the number of conformers and minimum RMSD obtained by the four methods utilized

Full size table

In addition, eight ligands adopting different bioactive conformations in different complexes were included, Table 4. The aim was to test whether the methods could sample multiple bioactive conformations for the same ligand. Generally, except for the kanamycin (KAN) and FAD ligands the omega, nmrclust, rms and rms_avg methods sampled both bioactive conformers within acceptable limits (RMSD < 2.0 Å). For the KAN case, the methods sampled the bound conformation reasonably in at least one complex. In the final case (FAD in 1A8P and 1B2R), while rms did not capture the bound conformation of the ligand, it was sampled at least once by omega, nmrclust and rms_avg filtering. It is worth pointing out that when the rms_avg value was employed in OMEGA for ligand FAD the number of conformers retained was 500, similar to the number generated by omega, Table 4. For FAD in 1A8P, rms_avg identified the same closest conformer as omega. However, for 1B2R rms_avg sampled a conformer that better reproduced the bioactive conformer compared to omega. Also worth mentioning are cases wherein a small number of conformers exhibited a better representation of the bioactive conformer, comparing omega and rms_avg. These include: ADP in 1HW8; kanamycin in 1L8T; FAD in 1B2R; IM1 in 1TCW. (See 1H1P and 2CSN in Table 2). Although small in number, these cases indicate that different conformers are being sampled and that more than 500 conformers should be considered, especially for highly flexible molecules. The overall performances of the methods were omega (88%), nmrclust (88%), rms (81%) and rms_avg (94%). The statistics of the number of conformers and minimum RMSDs indicate, once more, that a smaller number of ensembles may be utilized to capture the bioactive conformer for this set of ligands.

Table 4 The PDB structures employed for ligands present in dissimilar bound conformations in different receptors

Full size table

Employing a predictive model it has been hypothesized that for small RMSD filtering values and large numbers of rotors the number of conformers required to exhaustively cover the conformational space ranges from the hundreds to hundreds of thousands [42]. This suggests that increasing the likelihood of incorporating a bioactive conformer during the conformer generation stage in ligand-based methods could result in huge computational costs during the screening stage. It is worth noting that there are some instances wherein it has been stated [43] and demonstrated [22] that the determination of a bioactive conformation or number of query conformers employed does not improve the performance of a 3D shape-based method such as ROCS in recovering active compounds during virtual screenings. This attests to the conformer generating strengths of OMEGA, and the ability of ROCS to score the compounds correctly even though the conformation may not represent the bound state. However, in cases such as pharmacophore modeling [18] and molecular-field-based similarity analysis [35] where the description of ligand features complementary to an active site is crucial, an accurate representation of the bound conformation is still of utmost importance. Therefore, it would be computationally efficient to reduce the number of conformers per compound in a database, while still retaining the bioactive conformer.

In a previous study, 10 conformers were recommended [18] for averaged-sized molecules, while 50 conformers have also been proposed for screening databases containing several million compounds [12]. The goal of the current work was to produce the least number of computer-generated structures, while still including the bioactive conformer. Our results reflected this possibility, demonstrated by the four conformer sets exhibiting acceptable representations (RMSD < 2.0 Å) of their bound conformations in 56/57 (98%) instances.

The conformational overlap between the bound ligands and the computer generated conformers is shown in Fig. 5 for the ligand with PDB ID 1MMB as an example. This representation provides a qualitative view of how the methods are performing in terms of sampling the bioactive conformer. The generation of several structures dissimilar to the bound conformer is observed. More importantly, the bioactive conformer is captured, using a smaller number of computer generated structures.

Conclusions

Ensemble conformer clustering implemented using the NMRCLUST algorithm has here been employed to determine the extent to which clustering of computer-generated conformers reduces ensemble size, while still retaining the bioactive conformation. This approach relies on the ability of the conformer generator, in this case OMEGA 2.3.1, to generate the bioactive conformation in the first place. Analysis of the minimum RMSD values between the bioactive and the computer-generated ligands indicated that the presence of more conformers in the ensemble increased the probability of including the bound conformation. Even though downsizing the number of generated conformers by clustering may result in a potential loss of bioactive conformers, we showed that this approach successfully reproduced acceptable bound ligand conformations 56 out of 57 of the cases. In addition, OMEGA 2.3.1 sampled satisfactorily different bound conformations for the same ligand in different receptors. In terms of “excellent” and “good” representations, ensemble clustering performed closest to the reference method (omega) compared to the two RSMD filtering methods employed here. Therefore, by using this clustering method we showed that a smaller number of conformers was sufficient to capture the bioactive conformers of the ligands. It remains to be determined how multiple conformers derived from other conformer generators will perform. For combinatorial libraries that range from hundreds of thousands to millions of compounds such an approach may be applied to reduce the number of conformers per ligand by performing on-the-fly clustering, thus allowing less intensive virtual screening campaigns.

Supporting information available

The RMSDs between pre- and post-minimized bioactive conformers and the distribution of the minimum RMSDs relative to the bioactive structures. The experimental and computer-generated coordinates of the ligands employed in this work.

References

Bender A, Glen RC (2004) Org Biomol Chem 2:3204
Article CAS Google Scholar
Johnson MA, Maggiora GM (1990) Concepts and applications of molecular similarity. Wiley, New York
Google Scholar
van Drie JH (2003) Curr Pharm Desgn 9:1649
Article Google Scholar
Alvarez J, Shoichet B (2005) Virtual screening in drug discovery, Taylor and Francis Group. LLC CRC Press, Boca Raton
Google Scholar
Tong W, Welsh WJ, Shi LM, Fang H, Perkins R (2003) Environ Toxicol Chem 22:1680
Article CAS Google Scholar
Dixon SL, Smondyrev AM, Knoll EH, Rao SN, Shaw DE, Friesner RA (2006) J Comput Aided Mol Des 20:647
Article CAS Google Scholar
Kubinyi H (1997) Drug discovery today 2:457
Article CAS Google Scholar
Leach AR, Willet VJ (2003) An introduction to chemoinformatics. Kluwer Academic Publishers, Dordrecht
Google Scholar
Perola E, Charifson PS (2004) J Med Chem 47:2499
Article CAS Google Scholar
Nicklaus MC, Wang SM, Driscoll JS, Milne GWA (1995) Bioorg Med Chem 3:411
Article CAS Google Scholar
Boström J, Norrby PO, Liljefors T (1998) J Comput Aided Mol Des 12:383
Article Google Scholar
Kirchmair J, Wolber G, Laggner C, Langer T (2006) J Chem Inf Model 46:1848
Article CAS Google Scholar
Auer J, Bajorath J (2008) J Chem Inf Model 48:1747
Article CAS Google Scholar
Diller DJ, Merz KM Jr (2002) J Comput Aided Mol Des 16:105
Article CAS Google Scholar
Bender A, Mussa HY, Gill GS, Glen RC (2004) J Med Chem 47:6569
Article CAS Google Scholar
Butler KT, Luque FJ, Barril X (2009) J Comput Chem 30:601
Article CAS Google Scholar
Chen IJ, Foloppe N (2008) J Chem Inf Model 48:1773
Article CAS Google Scholar
Günther S, Senger C, Michalsky E, Goede A, Preissner R (2006) BMC Bioinformatics 7
Hao MH, Haq O, Muegge I (2007) J Chem Inf Model 47:2242
Article CAS Google Scholar
Kirchmair J, Laggner C, Wolber G, Langer T (2005) J Chem Inf Model 45:422
Article CAS Google Scholar
Vieth M, Hirst JD, Brooks CL III (1998) J Comput Aided Mol Des 12:563
Article CAS Google Scholar
Kirchmair J, Distinto S, Markt P, Schuster D, Spitzer GM, Liedl KR, Wolber G (2009) J Chem Inf Model 49:678
Article CAS Google Scholar
Dorfman RJ, Smith KM, Masek BB, Clark RD (2008) J Comput Aided Mol Des 22:681
Article CAS Google Scholar
Li J, Ehlers T, Sutter J, Varma-O’Brien S, Kirchmair J (2007) J Chem Inf Model 47:1923
Article CAS Google Scholar
Izrailev S, Zhu FQ, Agrafiotis DK (2006) J Comput Chem 27:1962
Article CAS Google Scholar
Pavlov T, Todorov M, Stoyanova G, Schmieder P, Aladjov H, Serafimova R, Mekenyan O (2007) J Chem Inf Model 47:851
Article CAS Google Scholar
Liljefors T, Petterson I (1996) In: Krogsgaard-Larsen P, Liljefors T, Madsen U (eds) A textbook of drug design and development. Overseas Publishers Association, Amsterdam, pp 60–93
Google Scholar
OMEGA: version 2.2.1 OpenEye Scientific Software: Santa Fe, NM, USA, www.eyesopen.com
Schrodinger, LLC, New York, NY 2008
Accelrys, Burlington, MA
Agrafiotis DK, Gibbs AC, Zhu FQ, Izrailev S, Martin E (2007) J Chem Inf Model 47:1067
Article CAS Google Scholar
Agrafiotis DK, Xu HF (2002) Proc Natl Acad Sci USA 99:15869
Article CAS Google Scholar
Martinez-Mayorga K, Medina-Franco JL, Giulianotti MA, Pinilla C, Dooley CT, Appel JR, Houghten RA (2008) Bioorg Med Chem 16:5932
Article CAS Google Scholar
Tirado-Rives J, Jorgensen WL (2006) J Med Chem 49:5880
Article CAS Google Scholar
Mestres J, Rohrer DC, Maggiora GM (2000) J Comput Aided Mol Des 14:39
Article CAS Google Scholar
Smellie A, Teig SL, Towbin P (1995) J Comput Chem 16:171
Article CAS Google Scholar
Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE (2004) J Comput Chem 25:1605
Article CAS Google Scholar
Version 2007; Chemical Computing Group Inc.: Montreal, Quebec, Canada
ROCS: version 2.3.1 OpenEye Scientific Software: Santa Fe, NM, USA, www.eyesopen.com
Kelley LA, Gardner SP, SM J (1996) Protein Eng 9:1063
Hess B, Kutzner C, van der Spoel D, Lindahl E (2008) J Chem Theory Comput 4:435
Article CAS Google Scholar
Borodina YV, Bolton E, Fontaine F, Bryant SH (2007) J Chem Inf Model 47:1428
Article CAS Google Scholar
Hawkins PCD, Skillman GA, Nicholls A (2007) J Med Chem 50:74
Article CAS Google Scholar

Download references

Acknowledgments

This work was supported by the State of Florida, Executive Officer of the Governor’s Office of Tourism, Trade and Economic Development and partially performed within the framework of the Dutch Top Institute Pharma, project number: D1-105 (A.B.). We thank Dr. Xavier Barril for providing both the SVL scripts and instructions on how to implement the B-factor and temperature-dependent constraints in MOE; Dr. Gerald M. Maggiora for insightful discussions; and Dr. Conrad C. Huang for the NMRCLUST algorithm. We thank the referees for helpful suggestions. Molecular graphics images were produced using the UCSF Chimera package from the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco (Supported by NIH P41 RR-01081). We thank OpenEye Scientific Software for providing the OMEGA, ROCS, and VIDA programs.

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Author information

Andreas Bender
Present address: Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK

Authors and Affiliations

Torrey Pines Institute for Molecular Studies, 11350 SW Village Parkway, Room 132, Port St. Lucie, FL, 34987, USA
Austin B. Yongye & Karina Martínez-Mayorga
Medicinal Chemistry Division and Pharma-IT Platform, Leiden/Amsterdam Center for Drug Research, Leiden University, 2333 CC, Leiden, The Netherlands
Andreas Bender

Authors

Austin B. Yongye
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Bender
View author publications
You can also search for this author in PubMed Google Scholar
Karina Martínez-Mayorga
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Karina Martínez-Mayorga.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(DOC 85 kb)

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Yongye, A.B., Bender, A. & Martínez-Mayorga, K. Dynamic clustering threshold reduces conformer ensemble size while maintaining a biologically relevant ensemble. J Comput Aided Mol Des 24, 675–686 (2010). https://doi.org/10.1007/s10822-010-9365-1

Download citation

Received: 11 February 2010
Accepted: 05 May 2010
Published: 25 May 2010
Issue Date: August 2010
DOI: https://doi.org/10.1007/s10822-010-9365-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Dynamic clustering threshold reduces conformer ensemble size while maintaining a biologically relevant ensemble

Abstract

Similar content being viewed by others

The comparison of automated clustering algorithms for resampling representative conformer ensembles with RMSD matrix

Ensemble completeness in conformer sampling: the case of small macrocycles

An integrated approach to knowledge-driven structure-based virtual screening

Introduction

Methods

Results and discussions

Conclusions

Supporting information available

References

Acknowledgments

Open Access

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

(DOC 85 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dynamic clustering threshold reduces conformer ensemble size while maintaining a biologically relevant ensemble

Abstract

Similar content being viewed by others

The comparison of automated clustering algorithms for resampling representative conformer ensembles with RMSD matrix

Ensemble completeness in conformer sampling: the case of small macrocycles

An integrated approach to knowledge-driven structure-based virtual screening

Introduction

Methods

Results and discussions

Conclusions

Supporting information available

References

Acknowledgments

Open Access

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

(DOC 85 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation