Where possible, data have been collected that allow for fair and direct comparisons between the methods reported here and widely used alternatives. This is challenging for three reasons: (1) some high-quality data sources prohibit redistribution of molecular structural data, necessitating re-acquisition; (2) many methodological developers and evaluators choose to provide only PDB and ligand HET codes, or only PDB codes with no indication of ligand identity, necessitating inferences as to ligand bond orders, tautomer states, formal charges, and even which ligand might be meant; and (3) molecular file format and conversion utilities may introduce noise into the data, most commonly by producing incorrect annotations of chiral atoms and configurations of double bonds. Every effort has been made here to ensure that the curated data fairly represents the structural data underpinning other published reports, and great care has been taken to remove all memory of 3D coordinates prior to generating initial 3D structural models and proceeding with conformational elaboration.
Molecular data sets
The results in this work were derived from the data summarized in Table 1. The CSD Set was provided as CCDC Reference Codes in the primary validation study of OMEGA from Hawkins et al. [3]. These compounds were downloaded directly from the Cambridge Structural Database [14] as SYBYL mol2 files; the largest connected molecular graph was detected (the first one if multiple of maximal size existed) and taken to be the structure of interest; and protons were added automatically as needed (few compounds had full explicit hydrogen atoms). Because the CSD Set contained explicit bond order information, fidelity of results to other reports is expected to be high.
The remaining four data sets were curated directly from the RCSB PDB, using an automated process. Given a PDB code and a specific HET code, the procedure is as follows:
-
1.
The PDB biological assembly is downloaded using wget.
-
2.
The Surflex-tools grindpdb command is used to heuristically infer components (protein, water, cofactors, and ligands), bond orders, and protonation/tautomer states. SYBYL mol2 files are produced for all components.
-
3.
Quality measurements are calculated:
-
(a)
Ligand strain by movement A ligand is minimized under a quadratic positional constraint on its heavy atoms. It is retained if RMSD from the original coordinates does not exceed 0.55Å. The ligand is then freed from positional restraint.
-
(b)
Ligand strain by energy The ligand’s pose is optimized in Cartesian space with both the internal force field and the Surflex-Dock scoring function. Its internal energy is calculated and it is then minimized outside of the protein. If the difference between the (per heavy atom) optimal pose and the local minimum does not exceed 0.50 kcal/mol/atom, it is retained.
-
(c)
Structure quality by movement If the RMSD between the experimental coordinates and the optimal scoring pose from local optimization is less than 1.25Å, the ligand is retained.
-
(d)
Structure quality by ligand efficiency The optimal docking score (nominally in units of pK\(_d\)) divided by the number heavy atoms is at least 0.10 pK\(_d\)/atom, the ligand is retained.
-
(e)
Structure match to alternate curation Graph matching is done between the final ligand and the corresponding SMILES-based molecular structure (and tautomeric variants) from the RCSB Ligand Expo. If there is a match, the ligand is retained.
-
4.
In a few cases where a ligand from a reported benchmark failed this process, manual adjustment of bond orders was done (this was needed for several ligands from the Foloppe Set).
For the PINC dataset, the original report made use of 1261 ligands. Here, with a fully automated workflow, 1062 ligands emerged passing all quality criteria (84%). In cases where ligands automatically parsed from PDB coordinates fail to match the curated SMILES structures, one cannot assume that either structure is correct. In such cases, we have observed: that the grindpdb procedure yields a ligand structure that matches the published report; that the PDB curated SMILES (or SDF) structure matches the report; or that neither the grindpdb ligand nor the PDB curated structure matches the report. Further, in all cases where the independent structural inference from the Surflex-Tools grindpdb command agrees with the PDB curated SMILES structure, such ligand structures appear correct.
For the ConfGen set, the original publication listed 667 PDB codes (with no indication of ligand HET names) in supplementary material [15]. Using the automated procedure just described, the PDB structures were processed. After eliminating ligands that failed quality criteria, duplicate ligands from individual structures were removed. Cases where multiple ligands were still present were manually checked against the PDB monographs to identify the ligand of interest (typically this involved selecting an inhibitor rather than some type of cofactor). Cases where a single ligand was present were manually checked to ensure that the ligand was the compound under study, and cases where this was not true were discarded. This process resulted in 520 ligands from the original full set of 667 (79%). The remaining 147 ligands were manually curated in order to ensure that the correct structures were used. In all but a single case, the small molecule SDF files were used unmodified. The single case requiring adjustment was PDB Code 1QBV, where the coordinates for a single carbon of a phenyl ring were clearly wrong (they were on top of the carbon para to the correct position and the problem was fixed manually).
For the Macrocycle Set, the entire set of roughly 25,000 SMILES strings associated with different ligands in the PDB were analyzed. Those that were possible to parse, for which force field parameters were assignable, and where bonds existed for which the smallest enclosing ring size was nine or greater were identified. The PDB codes for that set were extracted, and these were subjected to the procedure described above. Those ligands with 60 or fewer heavy atoms which passed all quality tests formed a large set with substantial redundancy. For each protein structure, the surviving macrocyclic exemplar with the highest ligand efficiency was retained. In cases where a single HET code was represented in more than one protein structure, at most five examples were retained (again based on ligand efficiency). There were 3 ligands with 5 exemplars each, 1 with 4, 8 with 3, 13 with 2, and 113 represented as singletons.
Overall, there were 138 unique macrocyclic ligands. Multiple examples of ligands were included for two reasons. First, ligands from different protein structures often exhibit different conformations. Second, the algorithms we report here may be dependent on atom order: the sequence of bends and twists that are made can vary depending on atom ordering. Therefore, it seemed wise to consider the effect of differences in input even in cases where the bioactive conformations might be quite similar. We believe this to be the largest set of macrocyclic ligands curated for the purpose of assessing 3D structure and conformer generation. While there are sure to be additional macrocycles of 60 or fewer heavy atoms of high quality in the PDB, we believe that this set is both diverse and large enough to begin to tease apart statistically significant differences between the performance of different methods.
The Macrocycle Set contained 14 of the 30 molecules from the Foloppe Set, the remainder of which were curated as just described, but several ligands required careful manual correction of bond orders.
In all cases, the coordinates of non-hydrogen atoms were not changed in building the reference ligand poses. In order to assess either 3D structure generation or conformer generation, we believe that it is important to fully erase any memory of the target coordinates. Here, we have taken two approaches. For all five data sets, we used a procedure to mark tetrahedral chirality and carbon-carbon double-bond configurations and then zeroed all coordinates. This computational molecular construct consists only of atomic elements, bond connectivity, formal charges, and the two types of configurational notations. It contains the same information as an isomeric SMILES representation. For the CSD Set, in addition, we repeated our experiments using isomeric SMILES as input.
Algorithmic details
There were four major additions to the Surflex Platform for the work reported here: a more sophisticated force field, a partial charge assignment method, a method for 3D structure generation, and a novel ring search method. They are detailed as follows, along with a brief description of the torsional sampling approach, which has not been substantially altered.
Force field: MMFF94sf
Our variant of MMFF94 and MMFF94s [16–21] is called “MMFF94sf.” The implementation began directly with MMFF94 (including analytical gradients), followed by the parameter changes introduced in MMFF94s [21] that increased the planarity of unstrained delocalized trigonal nitrogen centers. Extensive validation was conducted against the two available suites of small molecules with assigned atom types, energetic term values, and total energies.
The validation tests were done using a dielectric value of 1.0 and the partial charges given. We can technically call our implementation of MMFF94s a “partial” one, because on fewer that 3% of the molecules in the validation suite, our atom type assignments differ slightly. These differences typically occur in the treatment of nitrogen atoms where there are multiple logical assignments for the atom types, generally in aromatic or conjugated systems that also include a formally charged nitrogen. The differences were very small in terms of the locations of the minima between our implementation and a fully compliant one (hundredths of Angstroms).
Of more significance, we have further modified the force field to increase the planarity of unstrained delocalized trigonal nitrogen centers. For each out-of-plane term within MMFF94s that differed from one in MMFF94, we multiplicatively alter the force constant by a fixed value, whose default is 6.67. Changing this value to 1.0 brings the parameters back to those in MMFF94s. No adjustments were made to the torsional terms of MMFF94s.
We call our variant force field MMFF94sf to distinguish it from other variants. In all of the work reported here, we used a dielectric constant of 80.0 to match aqueous conditions. This has the beneficial effect of preventing intramolecular electrostatic interactions from dominating the energetic minima and allowing molecules to explore a wider range of conformational configurations. Partial charges were assigned as follows.
Partial charges: electronegativity equalization
Rather than using the bond-charge-increment scheme of MMFF94 [17], we have implemented an electronegativity equalization approach, similar to that reported by Gilson et al. [22]. Electronegativity equalization is a general method for partial charge assignment that avoids complex atom typing schemes [23–28].
The Gilson method was parameterized with 39 atom types, each associated with an electronegativity and a hardness (which quantifies an atom’s resistance to change in preferred charge). The atomic electronegativity values are modified based on local environment. Then, atomic charges are assigned by minimizing a simple function E that depends on the electronegativity and hardness of the atoms in the molecule. The function E is defined as follows, where for each atom i, \(e_i\) is the modified electronegativity, \(s_i^o\) is the hardness, and \(q_i\) is the partial charge:
$$\begin{aligned} E&= \sum \limits _{i=1}^{n}\left( e_iq_i + \frac{1}{2}s_i^oq_i^2\right) \end{aligned}$$
(1)
Minimization of E is subject to two constraints. First, formal charge within local groups of atoms is preventing from bleeding outside of each group by a fixed amount. Second, the total formal charge of the molecule must be the sum of the individual partial charges. One last aspect aspect of the method was a novel approach to ensure that atomic equivalence over different resonance forms resulted in equivalent charges for equivalent atoms. The parameters of the method were chosen to reproduce ab initio molecular electrostatic potentials for a set of 284 molecules.
Our method differs in two key respects. First, to address the issue of symmetry across different forms of a molecule, instead of enumerating resonance forms, we make use of a general graph matching algorithm that identifies atom-atom equivalencies. This is done through straightforward topological comparison of a molecule to itself (the same approach is used to identify chiral atoms). The partial charges of any atom sets that are identical to one another have their initial charges replaced by the mean across the symmetry group. For example, in propane, the methyl groups on the ends form groups of six identical hydrogen atoms and two identical carbon atoms, with the middle carbon being unique but having two identical hydrogen atoms.
Second, our method differs in the manner in which the constraints on the total formal charge and on local formal charge containment are enforced. The Gilson method makes use of Lagrangian multipliers over \(3^N\) different conditions, where N is the total number of charge groups, and the charge distribution with the lowest value of E is taken. Instead, we perform a series of minimizations of E using Powell’s method [29], with a quadratic penalty on the magnitude of violation of the total charge and local charge constraints.
The quadratic penalty is weighted by a factor of 2 in the first iteration (e.g. if the formal charge of a molecule is 0.0 and the total partial charge is 2.5, then the penalty applied to E is 12.5). Beginning with the optimized partial charges of each successive minimization, the penalty on deviations is doubled, and the process is iterated until the penalty exceeds \(10^8\). Using this method, deviations for either total charge or local charge are smaller than \(10^{-6}\).
For the current work, parameterization followed that of Gilson et al. [22], but additional refinement will be the subject of future work. In particular, we plan to make use of MMFF94sf atom types, which form a more diverse and complete description of chemical behavior, on a much larger set of molecules for parameterization. Also, rather than using a hard boundary on local charge bleed, it is appealing to contemplate a softer penalty and see whether the fit to high-quality ab initio charges can be improved. Note that because we have made use of a high dielectric constant here, the effects of small differences in computed partial charges are expected to make little impact on the character and quality of the conformational ensembles produced by ForceGen.
3D structure generation
The Introduction described the overall algorithm for 3D structure generation. Important details in the implementation are as follows:
-
1.
Bond lengths For initial atomic position assignment, bonds between hydrogen and any other atom are given the standard alkane C–H bond length from MMFF94. All other bonds are assigned the standard alkane C–C bond length.
-
2.
Minimization The process of structure refinement requires repeated minimization calculations, each potentially turning off or on various classes of terms in the forcefield. All of the minimizations are done in Cartesian space, using the quasi-Newton Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS). In the initial refinement of the structure (Steps 1–5), the termination conditions are gradient \(\le 0.1\), atom position change \(\le 0.001\), and energy change \(\le 0.1\). For final refinement, these threshold values become, respectively, \(10^{-3}\), \(10^{-5}\), and \(10^{-5}\).
-
3.
Iteration Refined molecular structures whose final energy is \(\le 7.0\) kcal/mol/atom and whose specified chirality and double-bond configurations are correct is called a success. The process of initial atomic position assignment and refinement is repeated until either six successes occur or until a maximum number of tries is exceeded, defined by default to be five times the number of atoms in the molecule.
In the event that the procedure fails to find any successful structures (as defined above), if a structure has been produced that matches the specified configurations of chiral centers and double bonds, that structure is returned, despite having relatively high energy. Note that for certain highly-strained molecules (e.g. cubane), the global minimum conformer has high per atom energy. In the event that no structure is found that matched the given specification, no solution is returned.
Note that it is possible, for example, to construct a SMILES string where the specified chirality is physically impossible to obtain, and this sometimes happens in error. In extensive testing, the procedure very rarely fails to find a reasonable 3D structure for a well-formed input molecule that has defined atom types within the force field. Approximately 95% of final structures have energies of 2.0 kcal/mol/atom or less, with 99.5% having enrgies of 2.5 kcal/mol/atom or less. Typical times for structure generation on the molecules under study here ranged from 1–3 s, including the atomic partial charge calculation.
With respect to protonation, the default approach is to make use of the specified formal charges from the input representation and to fill out the standard valences of the heavy atoms with hydrogen atoms. Optionally, a heuristic method may be employed to infer charges such that addition of hydrogen atoms will yield structures likely to be physiologically relevant (e.g. with carboxylic acids deprotonated).
Ring search
The Introduction described the overall algorithm for ring search, involving manipulation of ring systems through a series of bends and twists. Important details in the implementation are as follows (for the standard search protocol):
-
1.
Ring redundancy For non-macrocyclic ring systems, ring redundancy depends on ring system size. For systems of fewer than ten atoms, the RMSD threshold is 0.1Å. For larger systems but with fewer than 35 atoms, the threshold is 0.2Å. For still larger systems, the threshold is 0.3Å. For macrocyclic systems, the entire molecule is considered to be part of the nominal ring system because pendant groups often have a strong influence on ring geometry. For these, the RMSD threshold is 0.5Å. As new conformations for ring systems are produced, they are compared to existing ones in a pool that is initialized with the original ring system configuration. In cases where a new variant is non-redundant and falls within a energy window of the current minimum, it is added to the pool. In cases, where one is redundant of an existing variant, the one with lower energy is retained as part of the pool.
-
2.
Atom pinning In the process of ring bends and twists, certain atoms are held in place during minimization calculations by means of a quadratic penalty. For bends, after the physical bend is made, atoms of the ring system are pinned to their positions with a force of 100.0 kcal/mol per Å\({^2}\). So, a deviation of 0.1Å from the pinned position produces a penalty of 1.0 kcal/mol. The pinning process allows the remainder of the molecule to adapt to the new ring conformation. Without this step, ring bends will often revert to their original position due to the influence of pendant atoms. For the twists in macrocyclic systems, the penalty is the same for all four atoms involved in the twist. However, the three atoms that are fixed are allowed the freedom to wiggle by 0.1Å without incurring a penalty (this results in a square-bottomed quadratic penalty). Allowing some motion for these atoms helps in accommodating the extensive physical ring-closure constraints while rotating the fourth atom around the bond axis of the twist. As in the process of structure generation, minimization is done in Cartesian space.
-
3.
Bends As described in the Introduction, bends are made in flexible ring systems by identifying appropriate pairs of ring atoms whose axis will be used to perform a bend. The bend amount is determined by considering the torsion angle between the centroids of the LHS and RHS sides. The bend that is made is symmetric, so if the angle of deflection from a plane through the LHS for the RHS side is 20 degrees, then the bend will result in 20 degrees in the other direction. After each bend, a minimization is carried out with the ring system atoms pinned using the more lenient of the termination cutoffs above. Then the pins are released and a minimization is carried out with the more stringent cutoffs. This process reliably produces sensible variations of ring systems composed of small rings while preventing reversion to the original ring conformation unless the new one is inappropriate.
-
4.
Twists The process of twisting is similar to that of bending, but for each twist, multiple rotations are employed in increments of 60, 120, 180, and 240 degrees. For each rotation of the fourth atom, the other three atoms are pinned in their parent conformer’s position. The fourth is pinned in the desired new positions sequentially. Minimization with a lenient cutoff is carried out initially for each such position, pins are released, and minimization with more stringent cutoffs is done. This process produces low-energy variations of macrocyclic systems that overcome high-energy barriers in a direct manner. Stochastic approaches are often stymied by these barriers, instead relying, fundamentally, on the luck of the draw to identify new ring system variations across high-energy barriers.
-
5.
Bounds on ring pool variants Limits are respected both regarding the number and maximal energy of ring variants in the pool. The number depends on the search mode. For standard search. the limit for non-macrocyclic systems is 20 and for macrocyclic systems is 36. The energetic limit on ring variants in the pool, measured from the current minimum-energy ring variant, is double the overall energy window for the whole molecule. By default, the overall window is 10.0 kcal/mol, so ring variations may include conformations up to 20.0 kcal/mol higher than the particular minimum in a pool that is to be augmented. At the end of each round of twists and bends, the ring variant pool is pruned to ensure that the number and energy constraints are maintained.
-
6.
Iteration When any particular ring variant is elaborated through bends and twists, it is marked as done. If it is replaced by a lower-energy alternative within the ring redundancy threshold, the new ring variation must be re-elaborated. The overall process of ring elaboration ends when no ring variants have been elaborated or when five rounds of ring elaboration have occurred.
This procedure has not been optimized heavily, except for small systems such as cyclohexane. Numerous opportunities exist for speed improvements and for different types of physical manipulations of ring systems.
Torsional elaboration
The procedure for torsional elaboration has not changed appreciably since the introduction of the Surflex Docking method [4].
Briefly, the non-ring bonds within a molecule that require sampling are identified and assigned types that, for example, differentiate sp3–sp3 linkages from sp3–sp2. Different types of bonds are assigned different sampling levels (e.g. sp3–sp3 are assigned three total rotations including the existing one and sp2–sp3 bonds are assigned 6). The limits on energy and numbers of conformers in the following refer to the standard search mode. The following describes the search for a single ring variant, which is used to initialize the conformer pool. Multiple ring variants can be searched serially or as a pool. When done as a pool, the pool sizes are increased accordingly.
Groups of such bonds are made such that they form molecular fragments where each such fragment contains bonds of a single group. The groups are limited such that exhaustive sampling will not exceed 200 variants. The combination of bond rotations required to exhaustively sample the bonds within the group is applied to each conformer in the current pool, and these variants are collected and then added to the pool. From these, the most diverse subset is chosen up to 400 (the maximal pool size is larger than the maximal variant sampling because of iteration). The process is repeated through the different bond groups, with iterative steps of pool expansion and then selection based on diversity.
At this point, the conformers in the pool are rapidly relaxed using internal coordinates to minimize energy. Redundant conformers and those with excessively high energy are discarded. These are then minimized in Cartesian space using lenient cutoffs. Those that meet an energy threshold of 20.0 kcal/mol above the current minimum are then minimized using stringent cutoffs. All conformers with energy greater than 10.0 kcal/mol above the discovered minimum are discarded. If more conformers exist than desired (200 in standard mode), the most diverse of the set are chosen and returned.
In cases where the nominal calculation of the total number of conformers exceeds \(10^6\) a different approach is taken. Exhaustive sampling across the identified rotatable bonds can be done in a specific order that is analogous to counting, where each digit in the count is the index of the rotation of the rotatable bond in question. For highly flexible molecules, a modulus M is selected such that sampling every \(M^{th}\) of the sequence of conformers will produce five times the number of desired final conformations. These are then put through the same process as just described to yield a diverse pool of low energy conformational variants.
Computational procedures and statistical analysis
The results reported here were generated using Surflex-Tools version 4.057. The bulk of the results were generated through zeroed-coordinate conformer randomization in standard search mode, as follows (shown for the PINC Set):
RMS deviations were done for each resulting conformer pool by identifying all molecular symmetries, then applying the rigid body alignment transform to each conformer so as to minimize the RMSD against the crystallographic one under all identified symmetric self matches. The minimum such RMSD value (for non-hydrogen atoms) is the value reported for each ligand. RMSD of heavy atoms corrected for molecular automorphism is standard in evaluations of docking calculations and for conformer generation.
An alternative control used for the CSD Set involved beginning the process from SMILES representations of the molecules, as follows:
Here, SMILES strings were generated from the original SYBYL mol2 archive file using the BIOVIA Discovery Studio Viewer Version 4.0. Note that this procedure yielded structures with very different atom orderings than in the original ligand structures, providing data to assess whether the nominal order dependencies within the ForceGen algorithms produce significant variations in conformer quality.
Standard search depth is specified by the -pgeom argument. This selects ring search (maximum ring variants 20 or macrocyclic variants 36) and produces a maximum of 200 conformers per ligand, with a limit of 10.0 kcal/mol for the highest energy conformer above that with the minimum energy. The thorough protocol is specified by -pquant, which increases the allowable ring variations for macrocycles to 72 and the maximum number of conformations to 1000 (ring redundancy for macrocycles is also decreased to 0.3Å).
The screening protocol is specified by -pscreen, which disables macrocycle searching, decreases ring variations to 4, decreases the energy window to 5.0 kcal/mol, and limits the number of conformations to 20 for molecules with 5 or fewer rotatable bonds, 60 for 6–11 rotatable bonds, and 200 for 12 or greater. The fast search protocol (-pfast) is similar to the screening one except that it disables ring search altogether, using the ring configurations in the given conformer and limits all molecules to 20 conformers regardless of the number of rotatable bonds that each contain.
Under all search protocols, conformers are eliminated whose RMSD is 0.25Å or less, though the pressure to identify maximally diverse conformer pools generally prevents this level of redundancy from occurring. Numerous user-settable parameters allow control over the procedures beyond the four basic protocol choices. However, in our experience, these four protocols cover nearly all needs.
Where possible, we have made direct comparisons between the methods introduced here and other widely used methods. This involves comparing success rates across different thresholds of RMSD for generated conformers against the experimental structures of the ligands. We favor providing full cumulative histograms of such data, which allows for detailed inspections of performance across all thresholds of interest, and it also allows for comparison of distributions using the Kolmogorov Smirnov (KS) statistical test. The KS test finds the maximal gap between two cumulative histograms, and, given the numbers of data points for each cumulative histogram, it is possible to relate the magnitude of that gap to the probability that a gap of such magnitude would be observed by chance. Note that this test is non-parametric, requiring no assumptions about the characteristics of the underlying distributions.
Technically the test tells only whether the two distributions are different, not whether one is better than the other. However, with non-perverse distributions, it is easy to see by inspection which of two cumulative histograms is better than the other given that the two are different. Here, comparisons between methods or method variations are made with data set sizes of 30, 182, 480, 667, and 1062. Respectively, the critical values for maximal percentage point differences at \(p = 0.05\) are: 35.1, 14.3, 8.8, 7.4, and 5.9. So, on 480 data points, if the largest difference between Method X and Method Y is 8.8 percentage points, then Method X is likely to be statistically significantly better than Method Y at the \(p = 0.05\) level. In the results that follow, an indication that a difference between methods or method variations being statistically significant refers to an approximate KS test, as just described, unless otherwise noted. Care has been taken to ensure that misleading conclusions are not suggested by clever selection of statistics or thresholds.
Additional details about the data set and computational procedures are available at www.jainlab.org. Details about obtaining the Surflex software are available at www.biopharmics.com.