Application of the PM6 method to modeling proteins
- First Online:
- Cite this article as:
- Stewart, J.J.P. J Mol Model (2009) 15: 765. doi:10.1007/s00894-008-0420-y
- 2.1k Downloads
The applicability of the newly developed PM6 method for modeling proteins is investigated. In order to allow the geometries of such large systems to be optimized rapidly, three modifications were made to the conventional semiempirical procedure: the matrix algebra method for solving the self-consistent field (SCF) equations was replaced with a localized molecular orbital method (MOZYME), Baker’s Eigenfollowing technique for geometry optimization was replaced with the L-BFGS function minimizer, and some of the integrals used in the NDDO set of approximations were replaced with point-charge and polarization functions. The resulting method was used in the unconstrained geometry optimization of 45 proteins ranging in size from a simple nonapeptide of 244 atoms to an importin consisting of 14,566 atoms. For most systems, PM6 gave structures in good agreement with the reported X-ray structures. Some derived properties, such as pKa and bulk elastic modulus, were also calculated. The applicability of PM6 to model transition states was investigated by simulating a hypothetical reaction step in the chymotrypsin-catalyzed hydrolysis of a peptide bond. A proposed technique for generating accurate protein geometries, starting with X-ray structures, was examined.
KeywordsPM6 Hydrogen bonding Metalloenzymes MOZYME Proteins Salt bridge Young’s modulus
Semiempirical methods, such as MNDO [1, 2], AM1 , and PM3 [4, 5], have been used for a long time for modeling small organic and inorganic systems. They have not, however, enjoyed much success when used for modeling proteins, primarily due to the poor accuracy in reproducing the geometries of large organic systems such as oligopeptides, and to the large computational effort involved.
The recently developed semiempirical method PM6 has been shown to reproduce the heats of formation and geometries of small molecules , simple organic and inorganic crystals , and a hormone—the nonapeptide oxytocin —with good accuracy. Because of these encouraging results, determining the applicability of PM6 to modeling larger biochemical molecules, particularly proteins, was of obvious interest.
Despite their large size, proteins can be regarded as simple organic compounds, being composed mainly or even entirely of residues of the 20 common amino acids. Some, for example, cytochrome-P450 and hemoglobin, are more complicated, in that they contain organometallic structures, e.g., the metalloporphyrin heme ring system, while others, such as the structural protein collagen, contain modified residues. But with the exception of photoactive sites of the type that occur in chlorophyll and in retinal-containing proteins, most of the subtle electronic phenomena frequently encountered in transition metal chemistry is absent. At the primary structural level, therefore, proteins can be described in terms of simple covalent bonds and weak hydrogen bonds. As such, the primary structure is easily and accurately reproduced by NDDO  methods, in particular PM6. It is only when secondary, tertiary, and quaternary structures are involved that the wide range of properties of proteins becomes apparent, among the more important of which is the ability to catalyze reactions.
Several successful approaches have been developed to reduce the computational effort required for modeling proteins using semiempirical quantum chemical methods. One such technique is the divide-and-conquer method , in which a large molecule is divided into fragments, each of which is relatively easy to manipulate using quantum chemistry methods. This approach has been applied extensively to biochemical systems [10, 11, 12, 13].
Another, hybrid, approach has been to model the interesting parts of a protein using quantum mechanics (QM), and to model the rest of the system using molecular mechanics (MM). This approach, QM/MM, although complicated because of problems associated with the boundary between the QM and MM regions, has enjoyed considerable success .
In this work, the entire protein is modeled using semiempirical QM methods in which the self-consistent field (SCF) equations are solved using the localized molecular orbital method MOZYME instead of the more conventional matrix algebra methods. Minor changes to the basic MOZYME approach were made in an attempt to reduce the computational requirements. These changes were incorporated into the program MOPAC2007. All calculations were performed using 3.6 GHz PCs, each of which had either 1 or 2 Gb of RAM.
Solving the SCF equations
A consequence of the way that semiempirical methods developed is that the SCF equations are usually solved using matrix algebra techniques, and because the computational effort required for such operations scales as the third power of the number of atomic orbitals involved, conventional methods of solving the SCF equations are impractical when applied to large systems.
One way to accelerate the solution of the SCF equations for a large system is to divide it into smaller pieces and then solve the SCF equations for each piece. Thus, if a large system is divided into n approximately equal fragments, then the CPU time required to solve the SCF equations for each of the n fragments would be 1/n3 of that for the entire system. The increase in speed as a result of using this divide-and-conquer method would therefore be approximately n2. The divide-and-conquer  approach has been used successfully in modeling biological systems , including proteins  of the type discussed here.
Obviously, the greatest increase in speed is obtained when the system is divided into the maximum number of fragments. If this concept is taken to its logical limit, the smallest fragment of an electronic structure of a molecule is the individual chemical bond or lone pair. This is the principle behind the localized molecular orbital (LMO) method, MOZYME.
The MOZYME technique begins with generating the Lewis structure for the system. By their nature, Lewis structures are a better approximation to the correct electronic structure than is the starting point in conventional SCF methods. Given a starting Lewis structure consisting of monatomic (lone pairs) and diatomic (bonds) LMOs, an improved electronic structure can then be obtained by annihilating the energy terms connecting the nearby occupied and virtual LMOs by performing a series of 2×2 Euler rotations. This results in a lower energy, and, by implication, an improvement in the electronic structure. When this annihilation process is performed repeatedly, the energy of the system is reduced to a minimum, and therefore, by implication, the SCF equations are solved. As with the SCF procedure in conventional methods this is necessarily an iterative procedure but, because LMOs are used, the computational effort required scales almost linearly with the size of the system, instead of the N3 dependency of conventional methods.
Use of point charges
A further, large, increase in computational efficiency can be obtained by replacing the NDDO set of approximations with simpler terms for pairs of atoms that are separated by large distances. The NDDO approximation is only necessary for pairs of atoms that have a finite electronic interaction, that is, for atom pairs whose common density matrix terms are significantly non-zero. Without significant loss of precision, the NDDO approximation for all other atom pairs can be replaced by either a point charge or a point charge plus polarization term, to represent the effect of lone pairs. To allow the transition from exact NDDO to point charge plus polarization to be varied, a parameter, CUTOFF = n.nn, was added to the set of keywords used by MOPAC.
Effect on geometry optimization of chymotrypsin of varying CUTOFF. Times are averages over six cycles. Times for cycles of less than 5 Å varied too much to be useful. RMS Root mean square
Average time per cycle (s)
Gradient norm of “exact” PM6 geometry (kcal mol−1 Å−1)
RMS difference from “exact” PM6 geometry (Å)
ΔHf of optimized geometry (kcal mol−1)
ΔHf of “exact” PM6 geometry (kcal mol−1)
Changes in forces acting on the fully optimized PM6 structure as a result of varying the value of CUTOFF were calculated using the structure obtained from CUTOFF = 18 Å as the reference PM6 geometry. When single-point calculations were performed on the reference PM6 structure, the gradient norm increased rapidly as the transition distance decreased, reflecting the close relationship of forces acting on the geometry (Table 1, column 3), and the degree of distortion of that geometry (Table 1, column 4). These geometric changes were also reflected in the heat of formation, where, as expected, the change in ΔHf on going from the reference PM6 structure to the fully optimized structure increased as CUTOFF decreased. Interestingly, the changes in ΔHf for the optimized geometry and for the reference geometry did not follow any obvious pattern as the value of the CUTOFF decreased.
Internal coordinates are normally used when the geometries of small molecules are being optimized, because of their increased efficiency compared to using Cartesian coordinates. This is true even for large systems, provided that the topology of the system does not include any large rings. In this context, hydrogen bonds, salt bridges, and other weak interactions can be regarded as bonding interactions, and therefore by implication they contribute to the topology. As proteins invariably contain many such weak interactions, their topologies also invariably contain many large rings. If internal coordinates were used in defining systems that contain large rings, then any small change in internal coordinate angles would result in large changes in interatomic separations involving bonded atoms, atoms often far away from those used in defining the angle. This would result in large fluctuations in the heat of formation and a consequent failure of the optimization procedure. To avoid this problem, Cartesian coordinates were used in all optimizations reported here.
Two conventional geometry optimization methods have been developed for use in semiempirical packages such as MOPAC. Both, however, were found to be unsuitable for optimizing the geometries of large systems. The more efficient method, Baker’s Eigenfollowing (EF) technique , involves matrix operations which, while very rapid for small numbers of geometric parameters, soon become impractical when applied to large systems because the number of arithmetic operations required by such operations scales as the third power of the number of parameters. To a lesser degree, the same problem precludes the use of the older BFGS [19, 20, 21, 22] procedure. In addition, since both methods require the construction and manipulation of matrices whose size is proportional to the square of the number of parameters, as the size of system increases, the memory requirements of these methods eventually becomes prohibitive.
The problem of optimizing large numbers of parameters to minimize the value of a function occurs often in computational methods, and has been a focus of interest to the Optimization Technology Center. This group developed the L-BFGS method [23, 24], a limited-memory quasi-Newton code for unconstrained function optimization. The L-BFGS optimization technique is, as its name suggests, a modification of the BFGS method that is specifically designed for use with systems involving large numbers of parameters. Because it is based on the BFGS method, and because it does not make full use of the partial Hessian constructed in the EF procedure, the L-BFGS method is less efficient than EF for optimizing the structures of systems of less than about 2,000 parameters. However, for systems with over 2,000 variables, the L-BFGS method was found to be significantly more efficient, and as a result was made the default for large systems.
Geometry optimization was typically performed in three stages. First, the positions of all hydrogen atoms were optimized. In those cases where the hydrogen atoms were not reported in the starting X-ray structure, hydrogen atoms were added, but obviously their initial positions were only estimates. Where hydrogen atoms were present, the hydrogen bond lengths were usually too short by 0.05–0.15 Å. Errors of this type became immediately apparent when the positions of the hydrogen atoms were optimized, the heat of formation of the resulting structure being invariably much less than that of the starting geometry, typically by hundreds of kcal mol−1. CUTOFF was set to 5 Å during this process. In the second stage, the positions of all atoms were optimized, again using a CUTOFF of 5 Å. This operation was terminated when the gradient dropped to 20 kcal mol−1 Å−1. Finally, CUTOFF was set to 9 Å, and the geometry re-optimized. For small proteins, optimization was terminated when the gradient dropped below 1 kcal mol−1 Å−1. For many of the larger proteins this criterion was too severe, and, instead, optimization was terminated when the heat of formation did not decrease significantly over 20 cycles of optimization. With the exception of the largest proteins, this typically corresponded to a gradient of 10 kcal mol−1 Å−1.
Starting geometries of all the systems reported in this work were obtained from the Protein Data Bank (PDB) . Because of the way X-ray structures are determined, modifications had to be made to all the structures obtained from the PDB before they could be used in meaningful quantum mechanical calculations. The most common preconditioning operations were:
Where positional or structural disorder existed, the disorder was resolved to yield a single structure. Disorder of this type occurs naturally, but because semiempirical simulations require a well-defined system, only one of the various structures reported could be used. Serendipitously, disorder of this type invariably occurred only where its presence was not important. That is, in all cases where disorder exists, all candidate structures were equally suitable, and the choice could be made arbitrarily without significantly affecting any important sites in the system.
Where atoms in a residue or even entire residues were missing, valency requirements were satisfied by adding hydrogen atoms. As with positional and structural disorder, when groups or residues were missing, the absences again invariably occurred in sections of the chain far away from any interesting features of the protein such as active sites.
The main structural deficiency, arising from the way X-ray structures are generated, was the frequent and complete absence of any hydrogen atoms. These were added as needed. Several residues, such as Asp and Asn, contain ionizable groups; all such groups were represented by their initially neutral forms, this being regarded as the most easily definable ionization state for the entire system.
Where the positions of hetero molecules, such as water, sulfate, phosphate, ethanol, or other small organic species, were given or indicated, such hetero molecules were used in the model.
Preconditioning was completed with a preliminary calculation to optimize the positions of all hydrogen atoms.
The suitability of PM6 for modeling proteins was investigated by modeling several properties of proteins obtained from the PDB. Because the focus of this work was to determine the applicability of PM6 to modeling proteins, systems were selected that illustrated specific properties. Many of the proteins examined here have been the subject of intense and extensive study because of some important biological property, such as the actinic response of bacteriorhodopsin. However, for the purposes of this work, such properties were considered to be of secondary importance.
Before any computational method can be used for modeling proteins, the ability of the method to accurately reproduce known structures must be determined. This will be demonstrated for PM6, after which the applicability of PM6 to the study of other properties, including both chemical, such as the ability to catalyze reactions and prediction of pKa, and physical, such as biomechanical behavior, can then be examined.
Proteins can be very large molecules. Even the smallest of the systems used in this work is much larger than the largest molecule that was used during the development and testing of PM6. Because of this, analyses of the type used previously in reporting PM6 geometric results were considered to be inappropriate, and, instead, the analysis of the accuracy of prediction of protein geometries given here will be split into the various levels of structural complexity normally found in proteins. In order of increasing complexity, these are: primary, which is the structures of individual amino acid residues and their order in the polypeptide chain; secondary, which deals with common local structures, such as α helices, β sheets, and turns, in which the individual structural motifs are held together by hydrogen bonds; tertiary, which deals with packing motifs involving secondary structures; and, finally, quaternary, which is the packing together of entire protein subunits, i.e., two or more chains.
Root mean square differences in structures of the 20 amino acids calculated using B3LYP and PM6
RMS error (Å)
The peptide linkage can be represented by N-methyl acetamide, where PM6 predicts the C–N distance to be 1.396 Å, slightly longer than the B3LYP value of 1.367 Å.
Unlike the primary structure, there is a wealth of accurate X-ray data for secondary structures. Three structural elements occur frequently in secondary structures: the α helix, the β sheet, and turns. As these involve very different structural features, they will be treated separately.
Alpha helices consist of a right-handed helical arrangement of the polypeptide backbone in which the pitch, that is, one complete turn of the helix, involves 3.6 residues and results in a translation of about 5.4 Å along the axis. Helices are stabilized by the N–H of the amide group on residue n forming a hydrogen bond with the C = O of the amide group on residue n −4. A good example of helix structure is provided by the halobacteria protein bacteriorhodopsin (bR). Bacteriorhodopsin is a trans-membrane protein consisting of seven α-helices in the center of which is an extended conjugated π system, retinal, that forms a Schiff base with one of the residues, Lys216.
β antiparallel sheet
Another important secondary structure in proteins is the β sheet, in which individual β strands form hydrogen bonds with adjacent β strands, resulting in either parallel or antiparallel β sheets, depending on the relative direction of the strands. A classic example of the stronger of these two arrangements, the β antiparallel sheet, is provided by silk. Silk is a strong fibrous protein produced by arachnids and some insects. Because of its wide availability, one of the most studied forms of silk is Silk I from the mulberry silkworm, Bombyx mori. A putative structure for the crystalline domain of Silk I is represented in the PDB by a simulation, entry 1SLK , where, for computational simplicity, the naturally occurring structure was replaced by the model high polymer, poly(l-Ala-Gly). In that work, the authors modeled the crystalline structure using a large discrete cluster consisting of 15 heptapeptides. They postulated a novel structure that was consistent with reported X-ray diffraction patterns, in which the packing of the antiparallel β-sheets gave rise to a highly symmetric structure. The unit cell was reported to be orthorhombic, with dimensions a = 8.94 Å, b (fiber axis) = 6.46 Å, and c = 11.26 Å. Subsequent X-ray work  on B. mori Silk I, that is, using the naturally occurring polymer, confirmed this structure, and gave the unit cell dimensions as a = 9.38 Å, b = 9.49 Å, and c (fiber axis) = 6.98 Å.
β parallel sheets
Protein structures are often stabilized by salt bridges, which, although much weaker than covalent bonds, are significantly stronger than hydrogen bonds, principally due to the electrostatic interaction arising from the large, almost unit, charges on the ionized residues.
Some salt bridges have been identified in X-ray structures, such as the two salt bridges reported in the high-resolution structure of crambin (1CBN ) between Arg10 and the carboxylate terminus Asn46, and between Arg17 and Glu23. Positional disorder was indicated in 1CBN by the letters “A” and “B”. By default, set “A” was used. Crambin is a small globular protein found in Abyssinian cabbage seeds, contains only 46 residues, and, although it has no obvious biochemical activity, is a popular test of modeling methods. In part, this can be attributed to its unusual rigidity arising from three disulfide bridges and to the presence of two distinct α helices, a set of features normally only found in larger proteins. Optimization of the X-ray structure of 1CBN using PM6 proceeded smoothly: the salt bridges were preserved in the optimized structure, and the RMS difference between the PM6 and X-ray structures was ∼0.9 Å. When the salt bridges were neutralized and the structure re-optimized, the salt bridges reformed. That is, during the optimization procedure, the protons on the two carboxylate groups spontaneously migrated to the nearby arginine residues, the implication being that crambin with salt bridges is considerably more stable than crambin in the neutral form.
Crambin is unusual in that the locations of the hydrogen atoms were reported. In general, only the positions of the heavy atoms are given, so that, for most proteins, no inferences can be made regarding the locations of salt bridges. However, as with crambin, when the geometry of the neutral form of a protein is optimized, salt bridges might form spontaneously, strongly suggesting the existence of salt bridges in the natural form. An example of such a phenomenon is provided by barnase—a ribonuclease produced by Bacillus amyloliquefaciens. PDB entry 1RNB  reports the structure of the complex formed between barnase and the dinucleoside monophosphate d(GpC). One residue, Ala1, and several atoms in Gln2 were missing, but as these were distant from the sites of interest their absence is unlikely to affect the analysis significantly. Barnase is interesting in that two pairs of ionizable residues, Asp93 and Arg69, and Asp75 and Arg83, are in close proximity and are therefore candidates for the formation of salt bridges.
In accordance with the standard preconditioning protocol, the initial geometry used in the optimization had all potentially ionized sites represented by their neutral form. However, during the optimization procedure, a proton that was initially on Asp93 spontaneously migrated to Arg69 to form a salt bridge. The optimized distance between Cγ of Asp93 and the Cζ of Arg69 was 3.97 Å; in 1RNB, the distance is 4.22 Å, i.e., the optimized PM6 distance was 0.25 Å shorter than that found experimentally. The other putative salt bridge, between Asp75 and Arg83, was not present in the optimized PM6 structure, but when the proton was moved from the aspartic acid to the arginine and the optimization re-run, the resulting heat of formation dropped by 17.5 kcal mol−1. This strongly suggested that the salt bridge was indeed the stable form, and that a possible potential energy barrier to proton migration existed between the neutral and Zwitterionic forms. Additional evidence for the salt bridge is provided by the X-ray structure: in 1RNB the distance between Cζ of Arg83 to Cγ of Asp75 is 4.21 Å. In the PM6 salt-bridge form, the equivalent distance was 4.00 Å, and in the neutral form the distance increased to 5.08 Å.
Reliability of prediction of secondary structures
One consequence of the fact that all geometry optimizations reported here start with the preconditioned native structure is that, of necessity, the structures predicted by PM6 calculations are biased towards the reference structure. An estimate of the degree to which this compromises the validity of the inferences being presented here was obtained by examining the effect of optimizing the ten reference structures in 1CAG , a nonapeptide from Drosophila melanogaster, all of which represent the same system.
This fragmentary system contained 121 atoms, and had the sequence PFCNAFTGC, with the two Cys residues being connected by a disulfide bond. No obvious hydrogen bonds were present. All ten NMR structures reported for 1CAG were used and, as the locations of the hydrogen atoms were given, the only preconditioning required was the removal of H8 from Pro1 and the addition of a hydroxyl group to Cys9, to make a terminal carboxylate group. All optimizations proceeded rapidly, each taking only about 13 CPU minutes. The RMS difference between the PM6 optimized structure and that reported in 1CAG averaged about 1.1 Å, a value much larger than expected for a biochemical of this size. Examination of the optimized structures showed that each of them corresponded to a different minimum on the potential energy surface (PES). That is, there were potential energy barriers between all the conformers.
Comparison of two of the structures, those corresponding to the highest and the lowest energy minima, showed that the higher energy conformer differed from the lower energy conformer by two distinct conformational differences, both of which involved a moiety of form –CH2R with the “R” group being rotated by ∼120° in one conformer relative to the other. The higher energy structure was 18 kcal mol−1 above the lower energy structure, and the RMS difference in geometries was 1.81 Å. When the “R” groups of the higher energy conformer were rotated to match the lower energy conformer, and the structure re-optimized, the heat of formation dropped to within 1 kcal mol−1 of the lower energy conformer. Despite this, the geometries were still very different, with the RMS difference dropping only slightly, to 1.78 Å. When the higher energy system was continuously distorted so as to reduce the RMS difference between it and the lowest energy structure, and the geometry re-optimized at each point, the heat of formation did not rise. Instead, it decreased monotonically, indicating that no barrier existed between these structures. The obvious inference was that the two structures were in the same minimum, and that the PES near the bottom of that minimum was extremely flat. This condition—a very flat PES in the vicinity of the minimum—was also initially anticipated to occur with the larger proteins investigated. However, with few exceptions, this was not the case; subsequent work showed that 1CAG was unusual among the systems investigated in that all the other systems had better-defined minima. One possible reason for this phenomenon is that, because the nonapeptides in 1CAG were not complete biological molecules, they did not have the requisite hydrogen bonds and other structures that would normally confer rigidity to naturally occurring proteins. To investigate this conjecture, two naturally occurring proteins, bR and crambin, both of which have multiple structures in the PDB, were examined.
Root mean square differences for X-ray and PM6 structures of bacteriorhodopsin (bR), heavy atoms for residues 8:151 and 168:224 only. PM6 structures were obtained by starting the optimization from the appropriate X-ray structure
Surprisingly, the errors did not decrease. Instead, in all cases the RMS between the optimized PM6 structures was larger than that between the equivalent reference structure. This unexpected result prompted further examination of the reference data. The two most similar structures were 1BRX  and 1C3W, with the RMS difference being only 0.78 Å, and, since the distance between them on the PES was a minimum, they constitute the pair of structures that would be the least likely to have a barrier between them. Nevertheless, on examining the X-ray structures, the N–C–C–N torsion angle, i.e., the ψ angle, for Leu62, was very different in the two structures, being −8.9° in 1C3W and 149.6° in 1BRX, i.e., cis in 1C3W and trans in 1BRX. Conversion between cis and trans does not occur readily because of the presence of large steric barriers, so these two structures obviously represented points on the sides of different minima. As all other pairs of structures are separated by even greater distances, the likelihood is that each such pair is in a different minimum, and, by extension, all six bR are in different minima. The failure of the energy minimization procedure to show that any pair of structures were related by continuous deformation is thus rationalized.
In the middle of bR is a highly conserved biochemical structure, an extended (> 18 Å long) conjugated π system, retinal, that forms a Schiff base with Lys216. This structure is central to the actinic behavior of bR. Comparison of the 1C3W and 1BRX optimized structures for retinal gave an RMS difference of 0.31 Å, that is, they were almost identical. This was the expected result, given that the retinal group is critical to the functioning of bR and therefore its geometry would be expected to be highly conserved. Retinal is normally represented by the ionized form, bR(+), with the ionized Lewis structure site being the nitrogen atom. Interestingly, the results of a PM6 calculation indicate that there was almost no charge on the nitrogen, and instead that the charge was delocalized over the extended π system. This delocalization can be rationalized in that there are 11 valid Lewis structures for the Schiff base, each of which puts the positive charge on 1 of the 12 atoms in the conjugated system—no Lewis structure can put the charge on the penultimate carbon, that is, the first carbon of the cyclohexene. As all of these are equally valid, there is no a priori reason to select the nitrogen atom.
The issue of bias arising from the initial choice of geometry was investigated further by examining two high-accuracy structures, 1CBN and 1EJG , of crambin. Like 1CBN, 1EJG had positional disorder, and again set “A” was used. An unexpectedly large RMS difference of 0.83 Å was found when the two X-ray structures were compared. Most of this difference was traced to different sequences of atoms in the two files, and after re-sequencing the RMS difference dropped to 0.24 Å. Further examination of the two PDB files showed that the structures corresponded to different conformers: where the structure derived from 1CBN had a methyl group on Ile7, the equivalent methyl group in the 1EJG PM6 structure was rotated about Cβ–Cγ by about 120°. This positional disorder was noted in both PDB files. Additionally, and not reported in the PDB files, the structures of Asn46 were different, in that the side-chain O and NH2 groups were interchanged.
Optimization of both structures resulted in the side chain of Asn46 in 1CBN rotating by almost half a circle. The resulting ΔHf and optimized geometry were similar to those of the optimized PM6 structure of 1EJG. When the torsion angle of the Asn46 side chain in 1CBN was constrained at the X-ray value and the geometry again optimized, the resulting ΔHf was several kcal mol−1 higher. Based on these results, the obvious deduction can be made that the orientation of Asn46 in 1EJG is correct, and that the orientation in 1CBN is wrong. Additionally, the effect of bias in favor of the original PDB structure, although present, is seen to be quite small—the side chain of Asn46 in 1CBN spontaneously rotated to match 1EJG.
In single-chain proteins, the tertiary structure is the three-dimensional geometry of the entire polypeptide. For multi-chain proteins, tertiary structure refers to the structure of each chain. Two main types are of interest here: (1) globular proteins, composed of α helices, β sheets, and turns, with the overall structure being stabilized by weak intra-chain interactions, such as salt bridges, hydrogen bonds and π-bonding, and (2) long, thin, rigid proteins, composed mainly of α helix, and usually consisting of two or more chains.
Root mean square errors for PM6 optimized structures of simple proteins. PDB Protein data bank
No. of residues
RMS error (Å)
No. of atoms
Transcription factor IIIA
(C359H516N88O118) (Ac)6 (H2O)85
(C569 H870 N159 O194 PS) (H2O)94
Escherichia coli rho
Turkey egg white lysozyme
Hen egg white lysozyme
(C613H951N193O185S10) (H2O)142 (NaCl)
Bacteria cytochrome c
Bacteria azurin Iso-2
(C763H1213N209O227S3) (C34H34N4O4Fe…CO) (H2O2) (H2O)272
Mealworm thermal hysteresis protein isoform YL-1
Bovine γ-B crystallin
Drosophila melanogaster spectrin
E. coli green fluorescent protein (GFP)
(C1096H1730N295O345S12) (C22H28N3O2F)2 (SO4) (H2O)124
E. coli outer surface protein
(C1112H1887N299O411S) (C3N2H4) (C8H18O5) (C5H12O3) (H2O)388
Yeast triose phosphate isomerase
UDP N-acetylglucose O-acetyltransferase
Bacteriochlorophyll (Fenna-Mathews-Olsen protein)
(C2008H3228N572O586S13) (C62H96N13O14PCo…CO) (Cl)3 (H2O)174
Rabbit cytochrome-P450 2C5
(C2309H3631N603O655S22Fe) (C34H34N4O4) (SO4)3
Human triose phosphate isomerase
Dimethylsulfoxide (DMSO) reductase
(C3744H5737N998O1120S28) (MoO2) (C20H24N10O13P2S2)2 (H2O)355
For most proteins, the RMS errors in the PM6 structure are in the order of an Ångstrom, suggesting that PM6 can be used for accurately modeling protein structures, and by implication can be used for investigating protein properties, such as catalysis by enzymes. No obvious dependence of RMS error on size of system was noted. This was unexpected, in that small systematic errors in bond-lengths were expected to exist, and, if present, these would increase the RMS error in proportion to the size of the system.
Root mean square errors for PM6 optimized structures of oligomeric proteins
No. of residues
No. of chains
RMS error (Å)
No. of atoms
(C359H516N88O118) (Ac)6 (H2O)85
Mealworm thermal hysteresis protein isoform YL-1
Voltage-gated potassium channel
K3((C462H719N115O120S2) (C470H732N118O123S2))2 (C8H20N)(H2O)
E. coli nuclear distribution protein nudE-line 1
Triose phosphate isomerase
(C2683H4175N733O838S17) (C12H22O11)2 (H2O)123
(C2818H4390N764O795S12) (C34H32N4O4Fe….O2)4 (H2O)200
(C4288H6853N1145O1330S47) (C225H396N84O64S2) (H2O)44
Potassium selectivity filter
The potassium channel homotetrameric membrane protein 1JVM  forms a filter for migration of potassium ions. In it, four residues from each chain, Thr75, Val76, Gly77, and Tyr78, form part of a channel or tube through which alkali ions can migrate. Because a large fraction of each chain consists of α helix, almost all of the hydrogen bonds are intra-chain, and there are very few available for inter-chain hydrogen bonding. The reason for the stability of the tetrameric structure in therefore of interest.
The superpoison ricin is a heterodimeric protein composed of a 32 kDa N-glycoside hydrolase ribosome-inactivating enzyme, the “A” chain, joined by a disulfide bond to a 34 kDa lectin, the “B” chain. At 529 residues, ricin is one of the larger proteins; nonetheless, like most of these proteins, it is a relatively simple system, in that, with the exception of two residues on chain B, Asn95 and Asn135, which are glycosylated, it is composed only of residues of the 20 common amino acids. From a computational perspective, one of the interesting features of ricin is the nature of the interface between the two chains. The most suitable entry in the PDB for studying this system was 2AAI , which contained both the A and B chains, residues of 14 saccharide moieties, and 123 water molecules.
As a result of its size, geometry optimization was slower than for most other proteins. Each geometry optimization cycle required only about 15 CPU minutes; however, an unusually large number of optimization cycles were required in order to reduce the gradient to 15 kcal mol−1 Å−1, and this resulted in the entire optimization requiring almost 10 CPU days. To verify that the minimum had, in fact, been reached, the optimization was continued, and the gradient reduced further to 9 kcal mol−1 Å−1. This required an additional 3 CPU days of effort, and resulted in the ΔHf decreasing by 26.8 kcal mol−1 to −51071.5 kcal mol−1, that is, a drop of 0.003 kcal atom−1. The two geometries differed by a RMS of 0.15 Å, and the RMS error between the calculated and X-ray geometries increased from 1.18 Å to 1.23 Å. Based on these results, the decision was made that further work was not justified.
Human oxy-hemoglobin was one of the larger proteins studied. It consists of a pair of two α and two β sub-units, each of which contains a heme unit, to give the tetramer α2β2. A consequence of the quaternary structure is the allosteric behavior of hemoglobin: as each heme becomes oxygenated, the remaining heme rings increase their affinity for oxygen.
An attempt was made to model the allosteric properties of hemoglobin, but the results were inconclusive.
In addition to the common main-group elements, many proteins also contain metal atoms. These metals can be either covalently or ionically bound to side chains, to non-protein moieties, such as porphyrin ring systems, or they might be ionized and as a result would be highly mobile. Biochemical activity, such as enzymatic catalysis or charge transfer across cell membranes, often depends on the immediate environment of the metal atom. Modeling metalloproteins using earlier semiempirical methods has not been very successful, so the degree to which PM6 can reproduce the environment of the various metal atoms is of obvious interest.
The photosynthetic system of the pigment bacteriochlorophyll-A, 4BCL , contains seven chlorophyll molecules, and thus seven magnesium atoms. This protein is unusual in that a large fraction of its surface is composed of β antiparallel sheet, underneath which are the chlorophyll molecules in close proximity, i.e., they form a compact cluster.
Minor problems were encountered during preconditioning, in that the Lewis structure generated by the MOZYME procedure for the chlorophyll molecules was different to the conventional structure, but as the computed structure was nevertheless a valid Lewis structure, and as there was no overriding reason to use the conventional one, the computed Lewis structure was used to start the SCF procedure. This resulted in no complications. Preconditioning was completed with the optimization of the positions of the added hydrogen atoms.
The metal atom in potassium-containing proteins is invariably almost completely ionized, and therefore does not form covalent bonds. Instead, it is free to migrate within cavities in the protein. One such structure consists of potassium ions in a pore or channel in a transmembrane protein; such proteins are usually involved in regulating the electrical potential across the cell wall.
An example of such a potassium channel membrane protein is the 1JVM protein mentioned earlier; its four chains form a channel in the center of the protein that contains three metal ions, a tetrabutylammonium ion, and a water molecule. In the PDB file, all three K+ ions were replaced by Rb+ ions, that is, by ions of similar type and size, but as the purpose of this study is to investigate systems that can occur in vivo, and because naturally occurring proteins do not normally contain significant amounts of Rb+, before any work was done, the Rb+ ions were replaced once again by K+ ions.
Calcium also occurs in the endoprotease 1C7K  found in Streptomyces caespitosus, where it binds to the backbone through Asp76 and Thr78. The environment of calcium in this protein is reproduced with reasonable accuracy, with the calcium to Thr78 side chain oxygen distance being 2.50 Å, only slightly larger than the X-ray value of 2.45 Å, but the optimized Asp78 carboxylate oxygen to calcium distance was 2.26 Å, substantially smaller than the reported 2.53 Å. Calcium also bonds to four water molecules, with PM6 predicting the Ca–O distance to be 2.41 Å, in good agreement with the reported 2.35–2.51 Å. Although the calcium ion is near to only two backbone residues, the carbonyl oxygen on Thr113 forms a hydrogen bond with one of the water molecules in the first coordination shell of calcium, so the calcium atom could be regarded as also bonding to a third residue.
Iron is one of the most important metals in the biochemistry of the animal kingdom; iron-containing proteins are almost ubiquitous in oxygen-breathing creatures. In its most common form, iron is the central atom in a porphyrin heterocyclic ring system, the heme molecule. A wide range of proteins containing heme molecules are known, and the applicability of PM6 to model such systems is of obvious interest. A potential problem in modeling heme-containing proteins was anticipated as a consequence of the restriction of the current MOZYME technique to closed-shell systems. Iron atoms in the heme system have formal oxidation states Fe(II) or Fe(III), and therefore are almost certainly open shell. Nevertheless, structures of various proteins containing heme were optimized without problems.
Bond lengths to iron in heme in cytochrome c, 1CPQ
The environment of the iron atom in the slightly larger protein carboxy-myoglobin, 1 M6C , is similar to that of cytochrome c, except that in addition to the five nitrogen atoms coordinating to the iron, the sixth coordination site is occupied by a carbonyl ligand. In the optimized PM6 structure, the iron remained octahedrally coordinated, and the Fe–CO bond length, at 1.75 Å, was essentially unchanged from that in the X-ray structure, 1.78 Å.
Cytochrome-P450 forms a very large and diverse set of hemoproteins. A typical example of such a system is the rabbit cytochrome-P450, 1DT6 . It is structurally similar to cytochrome c and myoglobin, the main difference being that, instead of a histidine nitrogen, the fifth coordination site consists of a sulfur atom from Cys432 forming an axial thiolate bond to the iron atom. At 1.82 Å, PM6 significantly underestimates the Fe–S distance, the X-ray structural value being 2.46 Å. This large error might be attributable to the closed-shell method used, in that the Restricted Hartree-Fock (RHF) calculation requires a covalent single bond to exist between the sulfur and the iron atoms, but this conjecture cannot be tested until either a configuration interaction or unrestricted Hartee-Fock method is available.
With four heme molecules, per-oxy-hemoglobin was the largest iron-containing protein studied. In 1GZX each iron atom is octahedrally coordinated: four bonds to the nitrogen atoms of the porphyrin ring as usual, with one bond to a nitrogen in the imidazole ring of His87, and one bond to an oxygen atom of molecular oxygen. As with the other heme systems, PM6 reproduces the Fe–N of the porphyrin ring with good accuracy, but it underestimates the Fe–N distance, at 1.98 Å compared to the X-ray 2.26 Å. Conversely, the iron–molecular oxygen interaction is slightly over-estimated at 1.71 Å compared to the X-ray 1.61 Å.
Two zinc fingers, from Methanobacterium thermoautotrophicum,1EF4 , and from the human enhancer binding protein, 3ZNF , had unusually large RMS errors, 2.35 Å and 2.00 Å, respectively. These large errors can be attributed in part to the large positive charges on the proteins, and in part to the lack of weak bonds connecting distant parts of the polypeptide chain. In order to allow a realistic calculation to be performed, protons were removed from ionized nitrogen atoms until the systems were neutral. Despite the large RMS distortion, the immediate environment of the zinc atom was accurately reproduced, PM6 predicting the Zn–S distances in 1EF4 to range from 2.27 to 2.48 Å versus the reported NMR values of 2.22 to 2.32 Å, while in 3ZNF the Zn–S distances were predicted to be 2.24 Å versus the NMR structural value of 2.30 Å, and the Zn–N distances averaged 1.98 Å versus the NMR 2.00 Å.
Although selenium is not a true metal, it is convenient to consider it here. Selenium-containing proteins are relatively rare, with one of the smaller proteins being the transcriptional terminator protein rho, 1A62 . Rho contains three selenomethionine, –CH2–CH2–Se–CH3, groups, in which selenium forms normal covalent single bonds with two adjacent aliphatic carbon atoms: that is, selenium is in a common chemical environment. After preconditioning, geometry optimization proceeded without problem, and resulted in a RMS difference of the X-ray and PM6 structures of 0.82 Å.
A comparison of the selenomethionine geometries showed a wide range of Se–C distances in the X-ray structure, from 1.78 to 2.06 Å, where PM6 gave a narrower range, from 1.95 to 1.97 Å, close to the 1.93 ± 0.02 Å expected for organoselenium compounds. The C–Se–C angles were similar, averaging 100.2° for the X-ray structure versus 98.1° for the PM6 structure.
Molybdenum is the only second-row transition metal that is required by most living organisms . An example of a molybdenum-containing molecule is the enzyme cofactor molybdopterin, in which it exists as MoIV or MoVI, depending on the nature of its ligands. In molybdopterin, the molybdenum atom forms strong bonds to two oxygen atoms, to form the very stable MoO2 moiety, and weaker bonds to two sulfur atoms. An example of a molybdopterin-containing protein is the enzyme dimethyl sulfoxide (DMSO) reductase, 1DMS , from Rhodobacter capsulatus.
In the PDB entry for 1DMS, positions of heavy atoms in 766 of the 781residues were given, and all of these were used in the PM6 calculation. The stated formal oxidation state of molybdenum in 1DMS is four, therefore during preconditioning hydrogen atoms were added to both sulfur atoms, so that they would form dative rather than simple covalent bonds to the molybdenum, and thus not contribute to the formal oxidation state. In the native structure, an ionizable oxygen atom on Ser147 is positioned near to the molybdenum atom. As the Mo–O distance is only 1.96 Å, the assumption was made that this oxygen atom was ionized, i.e., unprotonated.
Because 1DMS is larger than most enzymes, the memory requirements were also unusually large. As a result, in order to allow the geometry optimization to be run, it was necessary to reduce the value of CUTOFF from 9 Å to 8 Å, this resulting in a significant reduction in the size of arrays used. With this one change, optimization of 1DMS proceeded without difficulty.
Enzymes catalyze reactions by lowering the activation barrier, that is, the heat of formation of the transition state relative to that of the reactants. Although this is a very important field of biochemistry, computational chemistry tools have hitherto enjoyed only limited success, partly because of limitations in available computational chemistry modeling tools. Reactions cannot be modeled by molecular mechanics methods because such processes involve electronic changes; conversely, while ab initio methods can correctly represent such phenomena, they are impractical because of the prohibitive amount of calculation involved. Although earlier semiempirical methods were much faster, and could model the electronic processes involved, the computational effort was still large, and, even if the calculations were done, the accuracy of the results was insufficient to allow much confidence to be placed in them.
In this work, the ability of PM6 to model structural features of proteins has been demonstrated, and, by using the MOZYME localized molecular orbital method, the computational effort has been significantly reduced, so determining the applicability of PM6 to the study of enzyme reaction mechanisms is appropriate.
A suitable, simple, test case for this work is the formation of a tetrahedral intermediate in the active site of chymotrypsin. Chymotrypsin is a proteolytic enzyme that catalyzes the hydrolysis of peptide linkages, in particular those involving aromatic residues. The generally accepted reaction mechanism involves a triad of residues, His57, Asp102, and Ser195, in which a positive charge, a proton, is shuttled from the serine via the histidine to the aspartate, with the result that the energy required for ionizing the serine is reduced. In turn, this directly lowers the activation barrier for hydrolysis, thus giving rise to the catalytic activity. For the mechanism of the catalytic triad to work, the three residues involved must be positioned precisely in that, if they were too far apart, the barrier to proton migration would be insurmountable.
An approximate structure of the tetrahedral intermediate was constructed by moving the hydroxyl hydrogen atom to the region of the amide nitrogen, and reducing the oxygen–carbon separation to that of a C–O single bond. The expected product from optimization of the geometry of the intermediate was the ester and amine, but the actual structure obtained as a result of energy minimization was the tetrahedral intermediate, i.e., the tetrahedral intermediate was a stationary point on the PES. Geometry optimization of much simpler systems containing the same structural feature resulted in decomposition to ester plus amine, from which it can be inferred that the tetrahedral intermediate was being stabilized by its environment. PM6 has a known fault in that it over-stabilizes Zwitterions, which reduces the likelihood of the tetrahedral intermediate being a minimum on the PES. However, for the purpose of this work, the presence of a stationary point on the PES immediately after the transition state was serendipitous in that it yielded well-defined points on both sides of the transition state.
The prospect of attempting to locate even an approximate structure for the transition state appeared to be a formidable task; conventional methods, such as the SADDLE technique , synchronous transit , assuming a reaction path, etc, would all require an excessive computational effort. In an attempt to simplify this task, a modified synchronous transit method was used.
In the synchronous transit method, the assumption is made that the geometry of a transition state is intermediate between those of the reactant(s) and product(s). If, as is the case here, the reactants and products are very different, in that all the atoms involved in the system are moving, the fact that the assumption might be true still does not provide a reliable guide as to how to locate the transition state. However, by modifying the geometry in such a manner as to reduce the difference in the geometries, while at the same time ensuring that they are on opposite sides of the transition state, the size of the domain in which the transition state is located can be reduced. This result can be achieved by a constrained optimization of the type described below for improving the geometries of X-ray structures. That is, the reactant geometry is re-optimized using a penalty function whose value depends on the difference between the current geometry and that of the product. The same procedure can then be performed on the product geometry. The resulting geometries are then used in a second, and sometimes a third, optimization, culminating in two geometries that are near to the points of inflection on the reaction PES. At that point the two heats of formation are similar, and a good approximation to the transition state can then be obtained by averaging the two geometries. A good approximation to the transition state for the chymotrypsin reaction was obtained using this method.
Conventional procedures are unsuitable for refining transition states of large systems. Thus, in one of the more efficient methods, Baker’s Eigenfollowing technique , the gradient norm is minimized using a quasi-Newton minimizer that requires evaluation of the associated Hessian. In the current chymotrypsin system this would entail evaluation of gradients for 11,901 separate geometries.
There is no obvious choice of the number of atoms to be considered as core atoms, so the normal modes of vibration for the transition state were evaluated using three different sets of atoms: one set of four atoms, consisting of the carbon and nitrogen atoms of the peptide bond being hydrolyzed plus the hydroxyl of Ser195; a second set, of nine atoms, consisting of the first set plus adjacent atoms; and a third set of 18 atoms consisting of the second set plus adjacent atoms. For these three sets, the lowest vibrational frequencies were, in order, i925, i934 and i940 cm−1, and the next lowest vibrational frequency in each set was real and over 100 cm−1. These results provided unequivocal justification for the steps used in this process. The narrow spread of imaginary frequencies implied that even the smallest set of core atoms was large enough to verify that the system was at a transition state. The second and higher vibrational modes, although positive, were significantly different in the various calculations. Again, this was expected, in that, although the character of the transition mode should be conserved between sets of core atoms, this requirement did not extend to the other modes.
As expected, the product side of the IRC terminated in the region of the tetrahedral intermediate. Exhaustive optimization of the intermediate gave a ΔHf of −26,165.3 kcal mol−1, only 0.6 kcal mol−1, below the termination of the IRC, at −26,164.7 kcal mol−1. Conversely, and initially unanticipated, the geometry of the starting structure on the reaction side of the IRC was distinctly different to that expected from the X-ray structure. Where the IRC had the ΔHf of the reactants equal to −26,168.5 kcal mol−1, the ΔHf of the optimized X-ray structure was considerably more negative, at −26,190.1 kcal mol−1. The geometries were also very different: although the optimized X-ray structure to transition state geometry represented a motion of over 800 Å, in the IRC the transition state to reactant distance was only 55 Å.
A conjecture to explain the large difference between the reactant geometry from the IRC and that from the optimized X-ray structure can be derived based on a consideration of the nature of the two structures. First, there is overwhelming evidence for the existence of myriad minima on the PES of real proteins, and since the X-ray structure presumably corresponded to the lowest energy conformation of the docked substrate in the active site of chymotrypsin, optimization of the X-ray geometry would naturally result in a very low energy structure. This need not be true when the starting geometry is a transition state. In order to reach the transition state, various geometric changes had to be made, and because the energies involved are relatively large, these motions almost certainly involved large conformational changes, amounting to a RMS motion of 0.35 Å per atom. On descending from the transition state in the direction of the reactant, the IRC would be expected to terminate at the first minimum, a minimum that in this case was 22 kcal mol−1 above the optimized X-ray minimum. This conjecture implied that further examination of the PES between the IRC reactant minimum and the optimized X-ray minimum should reveal many small intermediate minima separated by transition states corresponding to conformational changes, but not any that involve high energy transition states of the type encountered in covalent bond making or bond breaking.
This conjecture was tested by selecting the terminal geometry from the IRC and performing an exhaustive optimization. Although the terminal geometry was optimized within the criteria of the IRC calculation, when the criterion was tightened, the geometry continued to change, accompanied by a steady decrease in heat of formation, until, after 688 further cycles of optimization, it converged to a geometry similar to that resulting from optimization of the X-ray structure. As a result of the monotonic decrease in energy on going from the terminus of the IRC calculation to the fully optimized geometry, the conclusion can be made that the conjecture was incorrect—the hypothetical small local minima that were postulated to impede motion from the transition state to the true minimum did not in fact exist, and that the failure of the IRC to converge on the true minimum can be attributed simply to the different termination criteria.
Mechanical properties: Young’s modulus
Some proteins have distinct biomechanical properties, often forming long elastic fibers that have great tensile strength. The behavior of such systems when stretched is therefore of interest. As no work has thus far been reported on the applicability of PM6 to modeling mechanical properties, a preliminary examination of a simple and well-characterized high polymer is warranted.
Description of method and application to polyethylene
The most important characteristic of silk protein is its ability to form strong elastic fibers, a property that can be directly attributed to its antiparallel β sheet structure. Based on X-ray analyses, values of Young’s modulus for the crystalline domains in silk range from 20 to 28 GPa , depending on the degree of crystallinity, and calculations  give 13 or 16 GPa, depending on the method of analysis.
Unlike polyethylene, the stress-strain curve for solid poly(l-Ala-Gly) was only very approximately parabolic. At low strains, the force constant was low: K = ∼70 N m−1, and E = ∼29 Gpa. In this domain, the increased values of the translation vector is achieved by changes in torsion angles. When the strain increased, the modulus also increased, with E eventually rising to ∼60 GPa. This increase can be attributed to the transition from changes in torsion angles to changes in bond-angles and bond-lengths, so that the nature of the strains become similar to those for polyethylene.
An interesting result was obtained when the crystal was compressed. If silk behaved like a classical solid, then, for small deformations, the effect of compression would be essentially the same as that of tension, i.e., the potential well should be approximately parabolic. Instead, with increased compression, the ΔHf, varied erratically until, at a compression of about 1.3 Å, a new parabolic surface appeared, one unrelated to that obtained by stretching. The obvious interpretation is that as the uniaxial compression of the crystal is steadily increased, the internal structure changes from one polymorph to the other, with these polymorphs being those described earlier. After all the inter-sheet hydrogen bonds formed, any further compression resulted in stress which manifested itself as an increased ΔHf.
A metastable structure was also found in the compression region. The structure of this polymorph was similar to that found in the tension region, where inter-sheet hydrogen bonds did not form, and its stress-strain curve was apparently a continuation of the tension curve.
The structure of 1A3J was optimized using PM6 using PBC to simulate the infinite polymer. Because of the large size of the unit cell, only one unit cell was needed, so the system used to represent collagen was correspondingly small; only 21 residues, that is, only one-third of the repeat distance of a single chain was used. The result of translating one of the tropocollagen sections was to place it at the end of the adjacent tropocollagen section, i.e., the end of each of the three chains joined on to the other end of one of the other two chains, thus preserving the three-fold symmetry. Optimization proceeded rapidly and smoothly, and yielded a translation distance of 20.39 Å, in excellent agreement with that reported.
An estimate of the Young’s modulus for collagen was obtained by stretching the collagen polymer, which involved increasing the translation distance in several steps and allowing the geometry to relax between each step. If the one-third unit cell used in the geometry optimization was used, then increasing the translational distance would result in one chain being pulled apart from the two other chains. To avoid this catastrophe the stretching operation was performed using the entire unit cell, i.e., (Pro-Pro-Gly)21⋅(H2O)120. In other words, one complete repeat unit for each of the three chains was used in the modulus calculation.
For strains up to 10%, the heat of formation increased in proportion to the square of the strain, as expected, giving ΔΔHf = 1.27⋅Δx2 + 3.69⋅Δx + 0.48, with an R2 of 0.9987. This implied a force constant of 1.76 N m−1, and a Young’s modulus of 6.18 Gpa, in good agreement with other theoretical studies, which give values of 4.8 GPa and ≈7 GPa.
Above a 10% strain, collagen became significantly more stiff, with the result that, for a 36% strain, the best quadratic fit of the stress-strain curve was ΔΔHf = 5.28⋅Δx2 − 44.64⋅Δx + 100.22 with an R2 of 0.9931.
Optimized PM6 structures for proteins are, in general, in good agreement with the starting X-ray structures, albeit thus far PM6 has not been shown to be able to predict such structures de novo. Indeed, there is strong evidence that the myriad local minima on the PES would militate against PM6 being a suitable computational method for making such predictions. Nevertheless, since PM6 predicts very short-scale geometric quantities, such as the primary structure of proteins, e.g., bond lengths and angles, with good accuracy, and since X-ray analyses are ideal for generating secondary, tertiary, and quaternary structures, a combination of the two methods would most likely be of much higher accuracy than either method in isolation. This idea, to refine protein crystal structures using semiempirical quantum mechanically derived energy constraints, was first proposed by Yu, Yennawar, and Merz, in 2005 , and subsequently critically assessed in 2006 .
That primary structures predicted by X-ray analyses are of limited accuracy can readily be demonstrated by observing the precipitous decrease in heat of formation of the first few cycles of a PM6 geometry optimization. For the larger proteins, changes in calculated heats of formation are often in the order of several thousand kcal mol−1. During these early cycles of geometry optimization, errors in X-ray bond lengths and angles—geometric quantities with large force constants—are corrected, and, since large scale motions are not occurring, the RMS error of the optimizing structure remains small.
The rapid drop in energy is then followed by a slow decrease in energy, frequently lasting many hundreds of cycles. During this phase, large changes occur in the secondary, tertiary, and quaternary structures of the protein. As these geometric changes are associated with small or very small force constants, the resulting structure becomes less and less accurate. If a small force constant constraint, of the type proposed by Yu et.al. , were to be added to the overall system, then errors in the secondary and higher order structures could be minimized, and the optimized geometry resulting from the combined method, PM6 plus X-ray structure, would be of unprecedented accuracy.
The region of interest is where the RMS difference is small, and, at the same time, the strain is also relatively small; this occurs when c is in the domain of 1 to 10. When c = 10, the strain, at 75 kcal mol−1 is 80% less than that in the X-ray structure (354 kcal mol−1), and the RMS error is 0.07 Å. When c is reduced to 1, the strain decreases to 38 kcal mol−1, i.e., 90% less than that in the X-ray structure, but this is offset by the RMS error increasing to 0.19 Å.
From these results, it is apparent that an improved, i.e., nearer to the true structure, geometry can be obtained by using a constrained optimization. It is not obvious whether it is preferable to use a large value of c and only allow about 80% of the strain from the X-ray structure to be relieved, or to use a smaller value, thus relieving more strain, but, at the same time, increasing the risk that significant errors due to faults in PM6 might be introduced; such a decision would depend on the purpose for which the resulting geometry would be used.
In all the earlier geometry optimizations reported here, the cutoff for NDDO to point charge plus polarization terms was assigned a large value in order to more precisely reproduce the PM6 method. When a constrained optimization is done, the resulting optimized geometry cannot be described as a PM6 structure, and therefore there is no a priori reason to try to reproduce the PM6 method. On the other hand, a large increase in computational efficiency can be obtained by reducing the cutoff. A cursory test of reducing the cutoff to 6 Å resulted in a negligible distortion of the geometry, but was accompanied by a large reduction in computational effort. The implication of this is that by using a small value for CUTOFF, the method described here provides a simple, very rapid, and general procedure for improving X-ray structures of proteins.
Parameters used in generating charges for pKa calculation
PM6 predicted values of pKa for ionizable hydrogen atoms in proteins
Some pKa values predicted using PM6 were unexpected in that they were large and negative and implied an impossibly strong acid. Examination of the environment of the proton involved revealed several interesting structures. In the simplest of these, a hydroxyl proton had formed a strong hydrogen bond to a water molecule, the resulting structure being intermediate between the neutral system and a hydronium ion in close proximity to an oxygen anionic site. A somewhat more complicated system involved a hydroxyl proton strongly hydrogen bonding to the ionizable nitrogen of a neutral Arg residue: this formed a nascent salt bridge. Still another structure involved two ions, a cation, invariably Arg(+), and two to four water molecules having a net charge of −1, e.g., [H7O4]−, positioned near to the hydroxyl. The close proximity of the large poly-water anion to the hydroxyl unit resulted in an unusually strong hydrogen bond to the hydroxyl proton.
Related semiempirical work
The primary structure of hen egg white lysozyme, represented by 193L , has been optimized  with AM1  using a divide-and-conquer method in which the fundamental unit was the residue. In their work, the authors did not optimize the secondary or tertiary structure “because traditional semiempirical Hamiltonians (AM1 and PM3) have serious shortcomings in the description of peptide backbones.” This assertion was investigated here by performing a global optimization of 193L using AM1 and PM3. Somewhat surprisingly, these optimizations yielded structures that were compatible with PM6, the RMS error for AM1 being 0.97, for PM3 0.68, and for PM6 0.88 Å.
193L is interesting in that its size, 129 residues, corresponds to the maximum in the size distribution of proteins in the PDB, that is, 193L represents a typical PDB entry. As such, and because it had already been modeled using AM1, the decision was made to optimize the structure of 193L using PM6. The starting structure was the PDB entry. This was complete in that all heavy atoms in all residues were located and, in addition, the PDB file also contained 142 water molecules, and sodium and chloride ions. All these moieties were used in the calculation. This meant that the only preconditioning necessary was the addition of hydrogen atoms to satisfy valency requirements; as usual, all residues were represented by their neutral forms. Optimization proceeded without complication. After geometry optimization was complete, the active site in egg white lysozyme was examined. This active site catalyzes the hydrolysis of a polysaccharide, and, in 193L, is composed of the residues Glu35, Asp52, Trp62 and Trp63. No significant distortions were observed. That is, the active site was accurately reproduced, the RMS error for the four residues involved being 0.53 Å. An exact comparison of the computational effort required for the PM6 calculation with the AM1 work reported by Wada and Sakurai was not possible, but the general impression was that the PM6 calculation using MOZYME ran significantly faster than the combined MOZYME and divide-and-conquer method used in the AM1 calculation.
Given that the starting point for all optimizations were geometries derived from files in the PDB, the influence of the correct answer, i.e., the initial X-ray or NMR structure, must be considered. In all systems examined, there was no ambiguity: the starting structure had a large and obvious influence on the final optimized structure. This is a natural consequence of the presumed existence of a very large number of local minima in even the smallest system studied, and it is undoubtedly the presence of these minima that determines the final structure. No attempt was made to locate the global minimum—this operation is well known to be extraordinarily difficult, and, even if it were possible to locate the global minimum on the PM6 PES, there is no reason to believe that this would be the true minimum. Therefore, all structures reported here should be regarded as being derived from the reference data, and should not be considered as de novo structures. What can be addressed, however, is the level of accuracy of prediction of some of the structures that exist in proteins. These are treated in the following sections, in order of increasing complexity.
Primary and secondary structure
Because of its high accuracy, PM6 would be useful in detecting large errors in original X-ray structures. After preconditioning a structure from the PDB, the gradients or forces acting on the heavy atoms can be calculated. Consistent with the premise that the PDB structure is of good accuracy, most of these should be small. Any large gradients would then be indicative of significant differences between the PDB and optimized PM6 structure. In a survey of several proteins, most large differences were found to occur in arginine residues and in residues containing carboxylate, i.e., Asp and Glu. Carboxylate X-ray structures have C–O distances in a single range, from 1.24 to 1.26 Å, but PM6 predicts three distinct sets of the C–O distances: 1.20–1.23 Å, 1.25–1.27 Å, and 1.34–1.39 Å, corresponding to the C = O and C–O–H of the neutral carboxylic acid, and the C–O1/2- of an ionized carboxylate, respectively. These differences arise from the fact that an ionizable hydrogen atom in a carboxylate group can be in one of two different locations, and presumably in proteins both locations are fractionally occupied, but in the quantum chemical calculation any ambiguity of this type must be resolved. When appropriate averaging was done, the differences in observed and predicted C–O bond-lengths decreased considerably. A similar condition exists for arginine, and, when averaging was done, the large difference in C–N bond lengths again vanished. After discounting atoms with large forces attributable to the resolution of fractional populations or ambiguities, the remaining forces can be interpreted as indicators of possible errors in the PDB structures. Several types of such putative errors were found. Representative examples of these are:
Some covalent bonds are of unexpected lengths. Thus, in the X-ray structure of 1CBN, the Cγ–Nδ distance in Asn46 is reported to be 1.25 Å, much less than the expected 1.33–1.38 Å for a bond of this type. A PM6 calculation showed that the forces acting on these two atoms were very large, over 200 kcal mol−1 Å−1, given that the median force on an atom in 1CBN was 4 kcal mol−1 Å−1. Two other Asn residues are present in 1CBN, and for both of these the Cγ–Nδ distances, 1.312 and 1.347 Å, were nearer to the values expected. Many errors of this type were found in X-ray structures, and their occurrence was apparently unpredictable.
Some geometric quantities had systematic errors. In the PDB structure for ricin, 2AAI, for example, the reported angles for Cγ–Cδ–Nɛ in tryptophan residues averaged about 104°. This is significantly less than the typical 109–111° angles found in other proteins, less than that in crystalline tryptophan from CSD entry LIHLIX (110.6°), less than that found in the structure of ricin predicted by PM6 (109.6°), and less than that predicted for tryptophan by B3LYP (110.2°), and by PM6 (109.3°).
Some non-bonded distances were unexpectedly short. In the X-ray structure of hemoglobin, 1GZX, an oxygen atom on Phe188 was positioned only 2.55 Å from a nitrogen atom of Asn200. During preconditioning, hydrogen atoms were added and after their positions were optimized one hydrogen atom was positioned between the oxygen and the nitrogen atoms, implying that the hydrogen bond distance to the carbonyl oxygen was only ∼1.5 Å. This would represent a highly unusual, extremely short, and therefore strong, hydrogen bond. On optimizing the structure, the N–O distance increased to the expected 2.9 Å.
Some groups were incorrectly assigned. In crambin, the side-chain, –CH2–CONH2, of residue Asn46 has the locations of the oxygen and amine groups exchanged in PDB entries 1CBN and 1EJG. Obviously only one of these structures can be correct. Errors of this type can occur in X-ray structures when the number of electrons in two groups are similar. Only minor motion of Asn46 occurred when the structure of 1EJG was optimized, but a large motion occurred when 1CBN was optimized, with the side chain rotating by almost half a circle, essentially converting to the orientation in 1EJG, which strongly suggests that the orientation of the side chain in 1EJG was correct.
Some differences between X-ray and PM6 structures can be attributed to the lack of environmental effects in the computational model. These are most obvious in surface residues, where solvent and other external effects would be largest. An example of this is provided by the surfactant protein hydrophobin HFBII found in Trichoderma reesei, for which a very high resolution structure has been reported, 2B97 . Hydrophobin is a globular protein strongly stabilized by four sets of disulfide bridges. Its surfactant behavior arises from the presence of an unusually large fraction of hydrophobic residues on its surface, and as a result the protein is, as its name suggests, highly hydrophobic.
Geometry optimization resulted in only minor motion of the heavy atoms of the hydrophobic residues, the RMS difference being 0.82 Å, but in a much larger motion of the ionized residues, the RMS error for these being 1.43 Å. All the ionized sites lie on the surface of the protein, and, in both the crystal and in vivo, these sites would be either involved in salt bridges or be solvated. As the PM6 calculation was performed using only the isolated molecule, and therefore solvation effects were not included, the large distortions of the ionized sites can be rationalized.
Most of the proteins considered in this work are globular, with the secondary structures cross-linking using disulfide bonds, salt bridges, and bridging and normal hydrogen bonds. As bonds of this type are strong relative to the other interactions that determine the shape of the backbone, tertiary structures are reproduced with an accuracy similar to that of the secondary structure. Only when the structures become very large, as in hemoglobin, do the RMS errors approach 2 Å. In some oligopeptides such as the nonapeptides in 1V46, stabilizing inter- and intra-chain bonds are absent; in those cases the PES is very flat, and the structures predicted by PM6 are severely in error.
The agreement between the predicted and reported structures for the zinc finger proteins 1EF4 and 3ZNF were unexpectedly poor, the RMS difference being very large. These systems, and the nonapeptides in 1V46, were among the few structures examined that were derived from NMR analyses rather than from X-ray. While there is no obvious reason for these large differences, the speculation can be made that since the NMR structures were derived from solvent studies and the PM6 calculations modeled the isolated, gas-phase system, the distortions are due to the neglect of solvent effects.
Several proteins with quaternary structure were examined. Among these were ricin, with two sub-units connected by a disulfide bond, and per-oxy-hemoglobin, with four sub-units joined by salt bridges and hydrogen bonds. As with the tertiary structures, the quaternary structure is determined mainly by the same types of inter-chain interactions as those found in tertiary structures. No problems specific to quaternary structures were identified in the PM6 optimizations or optimized geometries.
As a result of various modifications, transition states for closed-shell biochemical reactions can now be modeled easily and rapidly. Thus for the hypothetical reaction described above, involving formation of a tetrahedral intermediate in chymotrypsin, the transition state was obtained using a straightforward procedure. The first step in this procedure consists of evaluating stationary points on the PM6 PES corresponding to the reactant and product. There are several ways of preparing these starting points. Using molecular mechanics methods, a substrate can be docked into the active site of an enzyme, and, after preprocessing if necessary, the resulting geometry optimized using PM6. Alternatively, the starting point could be an X-ray structure of the docked substrate. Again, after preprocessing as necessary, geometry optimization would yield the stationary point. Generating the stationary point corresponding to the product requires a knowledge of the reaction, or purported reaction, and a likely first step would consist of modifying the stationary point corresponding to the reactant, followed by energy minimization.
Once both stationary points are available, the systems would then be moved in the direction of the transition state by re-optimizing each geometry after adding to the Hamiltonian a perturbative potential corresponding to the distance to the other geometry. This process is iterative: first, the stationary point with the lower energy is reoptimized in the field of the other geometry, then the geometries are swapped around and the second geometry is optimized in the field of the first, perturbed, geometry. At each stage, the geometry with the lower energy is moved in the direction of the geometry with the higher energy. The process is terminated when the distance between the two geometries is small, e.g., less than 2 Å. In the case of chymotrypsin, this required three complete iterations.
An approximation to the transition state is then readily obtained by averaging the structures of the two strained geometries: this is similar to that used in the synchronous transit method. Given the approximate transition state, refining it to obtain the stationary point, i.e., the minimization of the gradient norm, using traditional methods such as EF , would require evaluation of the Hessian. For large systems such as enzymes, construction of the entire Hessian would be extremely computationally intensive, but, by using the unique property of the reaction eigenvector, that its associated eigenvalue is irreducibly negative, the process can be simplified. Because it is irreducibly negative, the reaction eigenvector has significant intensity on only a few atoms, and if only these atoms are used in the gradient minimization, location of the stationary point can be performed rapidly and efficiently. Of course, such an operation necessarily introduces perturbative forces in the rest of the system, so, as with the previous process, a tandem or two-step iterative procedure is necessary: in the first step, the gradient norm of the atoms likely to contribute significantly to the reaction eigenvector is minimized, and in the second step, the heat of formation of the rest of the system is minimized. The stationary point for the whole system is then rapidly achieved; three iterations were sufficient to refine the transition state for the chymotrypsin system. A further increase in computational efficiency was obtained by eliminating the need to construct the small Hessian used in the gradient minimization step. This was achieved by using Bartels’ non-linear least squares method for obtaining the Chebyshev solution, which, when used in gradient minimization for refining transition state stationary points, turned out to be much more efficient than EF.
Validation of the transition state was performed using two methods. First, the vibrational frequency of the normal mode corresponding to reaction was determined. As with earlier steps, only those atoms that were likely to contribute to the transition eigenvector were included in the Hessian. That a transition state stationary point existed was confirmed by the presence of one, and only one, large imaginary frequency. A second, definitive, test was to map out the IRC. Starting with the stationary point, the geometry was perturbed using the reaction eigenvector from the normal mode calculation. The IRC was then followed until a new stationary point was reached. On reversing the phase of the normal mode, and repeating the IRC calculation, a different stationary point was reached. Examination of these two new points showed that they corresponded to reactant and product.
Implications of size
Both the computational effort required for solving the SCF equations and the number of steps required for geometric operations increases with the increasing size of the protein. Because of this, computational experiments involving systems of 9,000 or more atoms, while possible, involve a significant computational effort, often in the order of CPU weeks. As a result, the technique described here should not be regarded as being practical for routine work on systems of more than 9,000 atoms. On the other hand, operations involving relatively small systems—up to 5,000 atoms—can be performed facilely, each taking only about 1 CPU day of computational effort. Thus, optimizing the geometry of a substrate docked in an active site of an enzyme, optimizing the product structure, locating the transition state, refining it, and characterizing it would each require about 1 CPU day for systems of a few thousand atoms. Generating the plot of the IRC would involve more effort, typically in the order of 5–10 CPU days.
As most of the proteins in the PDB are smaller than 5,000 atoms, this implies that the bulk of the systems in the PDB are amenable to modeling with the techniques described here using readily available desktop computers.
The PM6 method has been used successfully for modeling a large number of properties of proteins, including metalloproteins, ranging from generating optimized geometries of enzymes, including the primary, secondary, tertiary, quaternary, and active site structures, to comparison of various candidate structures, to predicting Young’s modulus for the stretching of silk and collagen, and, by implication, any regular polymeric system. When the course of a PM6 optimization is biased using experimentally determined geometries of the type found in the PDB, the resulting geometry is likely to be more accurate than either the experimentally derived geometry or that predicted by PM6 alone.
Some active sites were successfully modeled, and, in the case of chymotrypsin, one of the simplest reaction steps in the charge relay mechanism was mapped and verified by determining that the force constant for reaction was negative, and by mapping the intrinsic reaction coordinate. In principle, no problems are anticipated in modeling other reaction mechanisms.
All salt bridges reported to exist in the various enzymes examined were reproduced, but as PM6 is known to favor the formation of salt bridges over the neutral equivalent, the ability of PM6 to reproduce known salt bridges should be tempered with concern that some neutral systems might incorrectly be predicted by PM6 to exist as salt bridges. All calculations reported here involved neutral or singly ionized proteins. A significant lowering of energy would likely occur if solvated higher ionized species were considered. This would be an obvious field of application of the method described here.
Support for this work was provided by the National Institutes of Health grant No.1R43GM083178-01. The author also gratefully recognizes the generous contribution of Fujitsu for giving permission to use the MOZYME method and for providing the relevant source code.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.