SHIFTX2: significantly improved protein chemical shift prediction
A new computer program, called SHIFTX2, is described which is capable of rapidly and accurately calculating diamagnetic 1H, 13C and 15N chemical shifts from protein coordinate data. Compared to its predecessor (SHIFTX) and to other existing protein chemical shift prediction programs, SHIFTX2 is substantially more accurate (up to 26% better by correlation coefficient with an RMS error that is up to 3.3× smaller) than the next best performing program. It also provides significantly more coverage (up to 10% more), is significantly faster (up to 8.5×) and capable of calculating a wider variety of backbone and side chain chemical shifts (up to 6×) than many other shift predictors. In particular, SHIFTX2 is able to attain correlation coefficients between experimentally observed and predicted backbone chemical shifts of 0.9800 (15N), 0.9959 (13Cα), 0.9992 (13Cβ), 0.9676 (13C′), 0.9714 (1HN), 0.9744 (1Hα) and RMS errors of 1.1169, 0.4412, 0.5163, 0.5330, 0.1711, and 0.1231 ppm, respectively. The correlation between SHIFTX2’s predicted and observed side chain chemical shifts is 0.9787 (13C) and 0.9482 (1H) with RMS errors of 0.9754 and 0.1723 ppm, respectively. SHIFTX2 is able to achieve such a high level of accuracy by using a large, high quality database of training proteins (>190), by utilizing advanced machine learning techniques, by incorporating many more features (χ2 and χ3 angles, solvent accessibility, H-bond geometry, pH, temperature), and by combining sequence-based with structure-based chemical shift prediction techniques. With this substantial improvement in accuracy we believe that SHIFTX2 will open the door to many long-anticipated applications of chemical shift prediction to protein structure determination, refinement and validation. SHIFTX2 is available both as a standalone program and as a web server (http://www.shiftx2.ca).
KeywordsNMR Protein Chemical shift Machine learning
Chemical shifts are often called the mileposts of NMR spectroscopy. They are easily measured, highly reproducible spectroscopic parameters that can be readily used to identify, annotate or locate individual atoms. Chemical shifts also contain a considerable amount of information pertaining to a molecule’s covalent and non-covalent structure. Indeed, their sensitivity to the type and character of neighbouring atoms has long made chemical shifts a favourite tool of organic synthetic chemists to help decipher the structure of small molecules. Likewise, their sensitivity to a variety of important protein structural features has made chemical shifts equally valuable to protein chemists and biomolecular NMR spectroscopists. In fact, protein chemical shifts can be used to identify secondary structures (Pastore and Saudek 1990; Williamson 1990; Wishart et al. 1991), estimate backbone torsion angles (Spera and Bax 1991; Wishart and Nip 1998), determine the location of aromatic rings (Perkins and Dwek 1980; Osapay and Case 1991), assess cysteine oxidation states (Sharma and Rajarathnam 2000), estimate solvent exposure (Vranken and Rieping 2009) or measure backbone flexibility (Berjanskii and Wishart 2005).
While the extraction of approximate structural features from protein chemical shifts has become almost routine, the extraction of precise structural features is not. In fact, the inherently complex geometric, dynamic and electronic dependencies of protein chemical shifts has made the calculation of precise chemical shifts from protein structures or the calculation of precise structures from chemical shifts a significant challenge for more than 40 years (Sternlicht and Wilson 1967). For the specific task of calculating chemical shifts from structure (i.e. protein chemical shift prediction), at least two different routes have emerged. One is based on using sequence/structure alignment against chemical shift databases (i.e. sequence-based methods) and the other is based on directly calculating chemical shifts from atomic coordinates (i.e. structure-based methods).
Sequence-based methods take advantage of the continuous growth of today’s protein chemical shift databases. The idea behind predicting shifts via sequence homology lies in the simple observation that similar protein sequences share similar structures, which in turn, share similar chemical shifts (Gronwald et al. 1998; Potts and Chazin 1998; Wishart et al. 1997). The first implementation of this concept appeared in 1997 in a program called SHIFTY (Wishart et al. 1997). This relatively simple program takes an input sequence and uses sequence alignment against the BRMB (Seavey et al. 1991) or other chemical shift databases (Zhang et al. 2003) to identify a matching homologue. Once found, the complete set of homologous shifts of the matching protein is “assigned” to the query protein using a set of empirically defined rules. Chemical shifts predicted via sequence homology can be very accurate if a good homologue is found (Wishart and Nip 1998; Wishart et al. 1997). A key advantage to sequence-based methods is that as the chemical shift database (e.g. BMRB) expands, the predictions tend to improve as the odds of finding a suitable sequence homologue tends to increase. A key disadvantage of sequence-based approaches is that no predictions will be performed if no sequence homologue can be found.
A more recent extension to standard sequence-based shift prediction methods is SPARTA (Shen and Bax 2007). Rather than looking for global similarity, as is done with SHIFTY, SPARTA assesses similarity over a much smaller sequence range (just three residues). To predict chemical shifts for a given query protein, each tripeptide in the query structure is searched against the SPARTA tripeptide database and scored on the basis of its sequence and torsion angle (ϕ, ψ, and χ1) similarity. This information is combined with additional structural information (H-bond effects and ring current effects) to calculate a final set of chemical shifts. SPARTA and its successor SPARTA+ (Shen and Bax 2010), have proven to be remarkably accurate, especially for predicting 13C and 15N backbone shifts.
In addition to these sequence-based methods, a substantial number of structure-based methods have emerged over the past 10 years. These include SHIFTCALC (Iwadate et al. 1999), SHIFTS (Moon and Case 2007; Xu and Case 2001), CheShift (Vila et al. 2009), SHIFTX (Neal et al. 2003), PROSHIFT (Meiler 2003) and CamShift (Kohlhoff et al. 2009). All of these programs calculate chemical shifts using only protein coordinates as input. Some methods, such as SHIFTCALC and SHIFTX use empirically derived chemical shift hypersurfaces or related structure/shift tables to translate coordinate data into chemical shifts. Others, such as CheShift and SHIFTS use quantum mechanical models to generate their atom-specific chemical shift hypersurfaces. Still others, such as PROSHIFT, use neural network methods (i.e. machine learning) to predict protein chemical shifts from coordinate data. CamShift employs an ingenious approach to calculate chemical shifts using a set of parameterized distance equations. This makes CamShift’s chemical shift functions both rapid to calculate and easily differentiable. Having a differentiable function is particularly useful for chemical shift refinement via conjugate gradient minimization or molecular dynamics.
All the aforementioned methods are capable of predicting protein chemical shifts with reasonably high accuracy. As a rule, SHIFTX, SHIFTY, CamShift and SPARTA generally perform better than PROSHIFT, SHIFTS, SHIFTCALC and CheSHIFT. Nevertheless, it appears that sequence-based approaches, under certain circumstances, perform better than structure-based approaches, and vice versa. This suggests that by combining the strengths of both approaches, it may be possible to produce a hybrid method that exceeds the performance of any single sequence-based or structure-based method. Here we describe just such a hybrid method, called SHIFTX2. In particular, SHIFTX2 combines many of the structure-based concepts originally introduced in SHIFTX (Neal et al. 2003) with the sequence-based concepts introduced with SHIFTY (Wishart et al. 1997). By making use of a much larger and higher quality training set in combination with a number of other enhancements (using advanced machine learning techniques, employing more structural parameters) the performance of the structure-based component (now called SHIFTX+) was substantially improved. Likewise by using an improved sequence/shift database and by making use of local, instead of global, sequence alignment techniques we were also able to make substantial improvements to the performance of the sequence-based component (now called SHIFTY+). By carefully combining the algorithms for SHIFTX+ and SHIFTY+ we were able to create the hybrid program called SHIFTX2.
As shown below, SHIFTX2 is substantially more accurate (up to 26% better by correlation coefficient and an RMS error that is up to 3.3× smaller) than the next best performing program. It also provides significantly more coverage (up to 10% more), is significantly faster (up to 8.5×) and capable of calculating a wider variety of backbone and side chain chemical shifts (up to 6×) than many other shift predictors. In particular, SHIFTX2 is able to attain correlation coefficients between experimentally observed and predicted backbone chemical shifts of 0.9800 (15N), 0.9959 (13Cα), 0.9992 (13Cβ), 0.9676 (13C′), 0.9714 (1HN), 0.9744 (1Hα) and RMS errors of 1.1169, 0.4412, 0.5163, 0.5330, 0.1711, and 0.1231 ppm, respectively. The correlation coefficients between SHIFTX2’s predicted and observed side chain chemical shifts are 0.9787 (13C) and 0.9482 (1H) with RMS errors of 0.9754 and 0.1723 ppm, respectively. Additional details about SHIFTX2’s algorithms, its training process, its testing protocols and its potential applications is provided in the following pages.
Key to the development of accurate chemical shift predictors is the creation of high quality chemical shift databases. For sequence-based methods it is necessary to develop a large and accurate database of protein sequences and properly referenced protein assignments. For structure-based methods it is critical to develop a large and accurate database of protein structures with correspondingly accurate and comprehensive chemical shift assignments. In developing the database for our sequence-based method (SHIFTY+) we used the chemical shift assignments from RefDB (Zhang et al. 2003). RefDB, which is updated weekly, currently contains 1903 re-referenced protein assignments that are automatically extracted and processed from the BioMagResBank (Seavey et al. 1991).
In constructing the database for our structure-based method (SHIFTX+) we compiled a preliminary collection of ~300 candidate proteins from a number of sources, including RefDB (Zhang et al. 2003), the SPARTA training set17 and the SHIFTX training set (Neal et al. 2003). This dataset was filtered by selecting only those proteins that had X-ray structures with a resolution <2.1 Å, that were largely monomeric, that were free of bound DNA, RNA or large cofactors and that had mostly (>90%) sequentially complete 1H, 13C and/or 15N assignments. Note that in compiling this database, X-ray structures were given preference over NMR structures. This is because it is widely acknowledged that most NMR structures do not achieve the coordinate accuracy or precision of high quality X-ray structures (Andrec et al. 2007; Berjanskii et al. 2010; Laskowski et al. 1996; Shen and Bax 2007). This collection of ~250 high resolution X-ray structures was then analyzed for structural defects using a number of structure validation programs including VADAR (Willard et al. 2003), PROSA (Wiederstein and Sippl 2007), and WHAT_CHECK (Hooft et al. 1996). A separate program called RefDens (Ginzinger et al. 2010) was used to assess the quality of the protein side chains in each model. Several dozen structures were subsequently excluded due to their poor coordinate geometry or obvious structural defects.
For the remaining structures, we manually matched each structure with their observed chemical shift record from the BioMagResBank (Seavey et al. 1991). SHIFTCOR (Zhang et al. 2003) was used to identify potential chemical shift referencing problems and to re-reference all observed chemical shifts to the IUPAC standard—DSS (2,2-dimethyl-2-silapentane-5-sulfonic acid) (Wishart et al. 1995b). PANAV (Wang et al. 2010), CheckShift (Ginzinger et al. 2009) and SHIFTX were also used to check the quality of the protein chemical shift assignments and to identify certain types of gross assignment errors (i.e. “flipped” assignments from folded spectra). Within the accepted set of structures and assignments we further excluded certain chemical shifts from the dataset that seemed to be extreme outliers (beyond four standard deviations) based on the expected shifts of their atom type, residue type or observed secondary structure. These outliers were identified by CheckShift (Ginzinger et al. 2009) and PANAV (Wang et al. 2010). Finally all of the X-ray structures were “protonated” (i.e. H atoms added) using the program called REDUCE (Word et al. 1999). Consequently, the final training dataset consisted of 197 high resolution and high-quality protein structures (with computationally added hydrogen atoms) which had a total of 140,518 re-referenced backbone chemical shifts and 66,385 re-referenced side chain chemical shifts. A list of the training set’s protein names along with their BMRB accession numbers and PDB identifiers is provided in Table S1. The complete training data set (coordinates and assignments) is downloadable from the SHIFTX2 website.
In addition to this large training set, a separate “testing” dataset was assembled to assess the performance of both SHIFTX2 and other chemical shift prediction programs. This test set was constructed using the same criteria described above, but with the requirement that the proteins could not already be in the training or testing sets used by other programs (SHIFTX2, SHIFTX, SPARTA, CamShift). This was done to reduce any potential performance bias towards a single prediction program. The final testing dataset consisted of 61 high resolution protein structures corresponding to 47,514 re-referenced backbone chemical shifts and 24,933 re-referenced side chain chemical shifts. A list of the test set’s protein names along with their BMRB accession numbers and PDB identifiers is provided in Table S2. The complete testing data set (coordinates and assignments) is also downloadable from the SHIFTX2 website.
To develop our structure-based shift prediction algorithm (SHIFTX+) each protein structure in the training data set was further processed by VADAR (Willard et al. 2003), SHIFTX (Neal et al. 2003), PROSESS (Berjanskii et al. 2010) and other in-house programs. These programs calculate dozens of structural features from protein coordinate data, including backbone torsion angles, side chain torsion angles, hydrogen bond energies, hydrogen bond angles, hydrogen bond lengths, solvent exposure, secondary structure, etc. In addition to these structural features, other features pertaining to the pH, temperature and solvents were extracted from each protein’s BioMagResBank file. Likewise, experimentally derived random coil chemical shifts (Wishart et al. 1995a) and nearest-neighbour sequence information were also used as input features. In total, 97 atom-specific, residue-specific and protein-specific data features were compiled.
Once the optimal machine learning method was identified, SHIFTX+ was then further refined through a process known as feature selection. In machine learning, a high quality feature set is particularly important for improving the accuracy of any given predictor. Generally speaking, optimal accuracy may only be obtained by retaining the most important features. To select the best input features, we initially used as many features as possible to train our predictor. We then progressively examined each feature and retained it only if the exclusion of such a feature decreased the prediction accuracy of the model. This feature selection process was repeated several times using different orderings of input features. From this initial set of 97 features, our feature selection process reduced this list to a final set of 63 “useful” features. These features are listed in Table S3 (which is also available on the SHIFTX2 website). The performance of the final version of SHIFTX+ was assessed against both its training set (via tenfold cross-validation) and the testing dataset. This was done to determine the robustness of the predictor and to check if any over-training had occurred.
As noted earlier, SHIFTX2 is composed of two components, a structure-based component (SHIFTX+) and a sequence-based component (SHIFTY+). SHIFTY+ is essentially an enhanced version of SHIFTY (Wishart et al. 1997). Both SHIFTY and SHIFTY+ predict 1H, 13C and 15N chemical shifts based on sequence matching and alignment of a query protein against a database of previously assigned proteins (RefDB or BMRB). Sufficiently high scoring matches (>40% sequence identity) are aligned together and the chemical shifts of the database protein(s) are transferred to the chemical shifts of the query protein using appropriate residue-specific corrections. In developing SHIFTY+, a number of improvements were made including the use of BLAST (Altschul et al. 1990) to identify sequence matches instead of the slower Needleman-Wunsch algorithm, the expansion of the chemical shift database by a factor 27.5% to include 1,903 assigned proteins, the correction of numerous chemical shift referencing errors in the database (via CheckShift and SHIFTCOR), and the elimination of erroneous or questionable assignments among the reference database’s collection of shifts (via PANAV and CheckShift).
Sequence-based methods tend to outperform structure-based methods, especially when a good homologue is found (Wishart and Nip 1998; Wishart et al. 1997). However, if no suitable homologue exists sequence-based methods will obviously do much worse than structure-based methods. Even when homologues are found, sequence-based methods make a potentially dangerous assumption that the structure of the matching homologue is always similar to the query protein. This is not always true. In NMR it is certainly possible to have identical sequences but completely different chemical shifts (i.e. folded and unfolded versions of the same protein). In these (rare) situations sequence-based methods cannot distinguish whether the folded or unfolded form is correct. Likewise, sequence-based methods are not sensitive to subtle conformational changes arising from mutations, deletions, structure refinement or the existence of “excited” states that are conformationally different from the database’s homologues. On the other hand, structure-based methods are not limited by these kinds of constraints. Therefore, by intelligently combining structure-based methods (SHIFTX+) with sequence based methods (SHIFTY+) we should be able to exploit the high prediction accuracy of sequence-based methods with the broad prediction coverage of structure-based methods.
SHIFTX2 was written in C, Java and Python is available as a standalone program, as an online web server and as a VMWare version. All of these versions are available at http://www.shiftx2.ca. SHIFTX2 has been compiled and tested on Ubuntu Linux 10.04LTS; however, if properly configured, the SHIFTX2 program should run under most UNIX-like environments including Debian/GNU and Mandriva Linux, openSUSE, OpenSolaris, OpenBSD and Mac OS X. Despite having many more computationally intensive components than the original SHIFTY or SHIFTX programs, a number of code optimizations were also implemented to make SHIFTX2 sufficiently fast so that it could be used in chemical shift refinement or incorporated into chemical-shift-based structure generation programs such as CS23D (Wishart et al. 2008), CSRosetta (Shen et al. 2008) or GeNMR (Berjanskii et al. 2009) without any loss in speed.
Results and discussion
To fully assess SHIFTX2, we initially studied the performance of each of its component programs (SHIFTX+ and SHIFTY+). First, we evaluated SHIFTX+ on its training (197 proteins) dataset using tenfold cross validation. This was done to test the general robustness of the predictor. Second, we evaluated SHIFTX+ on a separate testing (61 proteins) dataset. This was done to obtain an independent measure of SHIFTX+’s performance. Third we evaluated SHIFTY+ on the combined training/testing dataset (235 unique proteins) by excluding any exact database matches from the SHIFTY+ predictions. The exclusion of exact database matches was done to avoid predicting chemical shifts for proteins that had already been assigned and to simulate more realistic prediction scenarios. These results were used to assess SHIFTY+’s performance relative to SHIFTX+ and to get an estimate of its coverage (i.e. rate of prediction). Fourth, we assessed SHIFTY+ on the full set of proteins in RefDB (1,903 proteins) to obtain a more precise estimate of SHIFTY+’s expected coverage or probability of prediction for any new query protein.
After obtaining estimates of the performance and coverage of the component programs we then evaluated the performance of the combined program—SHIFTX2. This assessment involved comparing the performance of SHIFTX2 to its component parts (SHIFTX+ and SHIFTY+) and to other state-of-the-art protein chemical shift predictors (SHIFTX, CamShift, SPARTA, PROSHIFT, SHIFTS, SPARTA +) using our independent test set of 61 proteins. All seven programs were evaluated on the basis of: (1) their correlation coefficients (between observed and predicted shifts); (2) their root mean square deviation (RMSD) or RMS error; (3) their coverage (proportion of proteins or residues in the test set that were predicted); (4) their comprehensiveness (number of atoms or atom types predicted); and (5) their speed (CPU seconds or processing time to return an answer).
Table S4 (also available on the SHIFTX2 website) shows the correlation coefficients and RMSDs of the backbone chemical shifts achieved for SHIFTX+ both for the training dataset and the testing dataset. As noted earlier, the training dataset performance was assessed using cross-validation. Cross-validation is a standard method in machine learning for evaluating almost any prediction model. In tenfold cross validation, 10% of data is randomly extracted to test a model from the training set, the algorithm is trained on the remaining 90% data and then evaluated on the test set. This process is repeated ten times and the results are averaged. If the algorithm has not been over-trained, the performance for the tenfold cross validation should match closely with performance on the independent test set. As seen in Table S4, this is indeed the case. This result certainly gives us a high level of confidence that the SHIFTX+ algorithm is robust and that the regression model has not been over-trained. Overall, SHIFTX+ is able to attain correlation coefficients (R) of 0.9149, 0.9842, 0.9970, 0.8939, 0.8103, and 0.9226 for 15N, 13Cα, 13Cβ, 13C′, 1HN, 1Hα shifts with corresponding RMS errors of 2.2878, 0.8743, 1.0099, 0.9945, 0.4356, and 0.2152 ppm, respectively. Table S5 (also available on the SHIFTX2 website) shows SHIFTX+’s prediction accuracy for side chain atoms. The correlation coefficients between SHIFTX+’s predicted and observed side chain chemical shifts are 0.9769 (13C) and 0.9321 (1H) with RMS errors of 0.9903 and 0.2238 ppm, respectively.
Because SHIFTY+ is not based on machine learning techniques but on sequence alignment, its performance can be assessed much more simply. Table S6 (see the SHIFTX2 website) provides the prediction accuracy data for SHIFTY+ for the 235 non-redundant proteins in the training and testing datasets. As noted before, exact matches of the database proteins to the query protein were excluded from the performance calculations to simulate more realistic prediction scenarios. This “forced” SHIFTY+ to predict shifts using only homologous proteins or protein fragments. Using a sequence identity cutoff of 40%, we found that up to 74.5% (175/235) of the proteins could have at least one class of chemical shifts predicted by SHIFTY+. Because there is considerable variability in the type and number of protein assignments deposited in chemical shifts databases (some report on 1H shifts, others report only 13C shifts and still others report all shifts), there will naturally be some variability in the chemical shift coverage that SHIFTY+ can achieve. In particular, SHIFTY+’s coverage ranged from a low of 38% (for HE3) to a high of 74% (for 1HN), with an average of 57% over all atom types. This means that SHIFTY+ was able to generate nearly complete assignments for about 57% of the query proteins or, alternately, that SHIFTY+ predicted shifts for 57% of the residues it processed. For those chemical shifts that SHIFTY+ did predict in the 235 protein testing/training set, it achieved correlation coefficients between predicted and observed backbone chemical shifts of 0.9800, 0.9925, 0.9991, 0.9638, 0.9610, and 0.9677 for 15N, 13Cα, 13Cβ, 13C′, 1HN, 1Hα atoms with corresponding RMS errors of 1.1352, 0.6127, 0.5562, 0.5784, 0.2097, and 0.1411 ppm, respectively. The correlation coefficient between SHIFTY+’s predicted and observed side chain proton chemical shifts was 0.9628 (1H) with and RMS error of 0.1393 ppm. The performance for SHIFTY+ was slightly better for the 61 protein testing dataset (for which it predicted shifts for 46 proteins). In particular, SHIFTY+ achieved correlation coefficients between predicted and observed backbone chemical shifts of 0.9974, 0.9991, 0.9999, 0.9961, 0.9964, and 0.9882 for 15N, 13Cα, 13Cβ, 13C′, 1HN, 1Hα atoms with corresponding RMS errors of 0.4115, 0.2087, 0.2136, 0.1847, 0.0630, and 0.0845 ppm, respectively. While the coverage of SHIFTY+ is certainly not as comprehensive as SHIFTX+, it is clear that for the ~57% of residues it could predict, SHIFTY+ is somewhat more accurate.
Expanding the SHIFTY+ testing dataset to include all 1903 proteins in the RefDB/BMRB database revealed that very similar levels of coverage and accuracy could be obtained. In particular a total of 1,270 out of 1,903 proteins (66.7%) could have at least one class of backbone and/or side chain chemical shifts predicted by SHIFTY+. Averaged over all atom types, SHIFTY+ achieved a residue coverage of 55%. In terms of protein coverage (76% vs. 67%) or residue coverage (57% vs. 55%) these numbers are almost identical to those found with the smaller (235 protein) testing/training set. Likewise, as seen in Table S7, the correlation coefficients and RMS errors for the backbone and side-chain shifts are essentially identical to those seen in Table S6. These data suggest that sequence-based methods should routinely work about 70% of the time for any new query protein. Assessing SHIFTY+’s performance with different sizes of the RefDB showed a clear correlation between the size of the reference database and the level of coverage as well as the quality of the predictions (see Table S8 and the SHIFTX2 website for more details). Based on the size and current growth rate of the BMRB and RefDB (about 300 proteins/year) we expect that the proportion of proteins predictable by SHIFTY+ should climb at a rate of about 3–5% per year. This coverage projection was calculated by fitting the data in Table S8 to the following equation: Coverage = 0.84 − 390/NRefDB + 45,000/(NRefDB2) where NRefDB is the number of proteins in RefDB.
Comparative performance of SHIFTX2
Relative to SHIFTX+, SHIFTX2 routinely performs about 6% better (as measured by correlation coefficients), with the highest performance gain being seen for amide 1HN shifts (17.8%). For those proteins (~46) where SHIFTY+ is able to make predictions, the performance of SHIFTY+ and SHIFTX2 is identical. However, when the performance of SHIFTX2 for the complete set of 61 proteins is compared to the performance of SHIFTY+ for its partial set of 46 proteins, SHIFTX2 performs only slightly worse (~1.3% as measured by average correlation coefficient). On the other hand, SHIFTX2’s coverage (percentage of proteins or residues predicted) is more than 24% greater than SHIFTY+’s coverage. These data clearly show that SHIFTX2 is superior to both SHIFTX+ and SHIFTY+.
To compare the performance of SHIFTX2 with other state-of-the-art shift predictors, we ran our test dataset of 61 proteins on six publicly available chemical shift prediction programs or web servers, including SHIFTS, SHIFTX, PROSHIFT, CamShift, SPARTA and SPARTA+. All seven programs were evaluated on the basis of: (1) their correlation coefficients (between observed and predicted shifts); (2) their root mean square deviation (RMSD); (3) their coverage (proportion of proteins or residues in the test set that were predicted); (4) their comprehensiveness (number of atoms or atom types predicted); and (5) their speed (CPU seconds or processing time to return an answer).
Summary of the performance (correlation coefficients and RMSD) for predicted backbone shifts for seven different chemical shift predictors using a test set of 61 proteins
15N correlation (RMSD)
13Cα correlation (RMSD)
13Cβ correlation (RMSD)
13C′ correlation (RMSD)
1HN correlation (RMSD)
1Hα correlation (RMSD)
While most state-of-the-art protein chemical shift predictors predict backbone chemical shifts, only four attempt to predict a subset of side chain shifts (SHIFTX, SHIFTX2, SHIFTS and PROSHIFT) and only two attempt to predict all possible side chain shifts (SHIFTX2 and PROSHIFT). Given the enormous amount of structural information contained in side chain chemical shifts (especially with respect to the influence of ring current effects and other long-range effects) it is surprising that more effort is not directed towards studying this class of chemical shifts. Indeed, ignoring side chain chemical shifts for proteins is a bit like ignoring side chain NOEs. Certainly most protein structures could not be solved or at least solved accurately without the inclusion of side chain NOEs. Similarly any effort directed at refining or solving protein structures using only backbone chemical shifts would no doubt lead to somewhat middling or ambiguous results.
Correlation coefficients and RMSDs between observed and predicted side chain chemical shifts (29 different atom types) for four different chemical shift prediction programs as measured for a test set of 61 proteins
No. of shifts
Level of backbone chemical shift coverage for seven different chemical shift prediction programs using the standard test set of 61 proteins consisting of 55,493 predictable shifts. The HA2 and HA3 shifts for glycine were reduced to a single average shift to permit comparison between all programs
No. of expected shifts
Coverage rate (for 55,493 atoms) (%)
No. of PDB
No. of shifts
Comprehensiveness (number of atom types predicted) and the computational speed (limited to backbone shifts) of the seven different chemical shift predictors
No. of atom types predicted
Speed (seconds/100 residues)
In terms of computational speed, there is obviously considerable variability among the seven programs. SPARTA appears to be the slowest program, with an average speed of 17.92 s per 100 residues. PROSHIFT is the next slowest (12.87 s per 100 residues) while SHIFTS is approximately four times faster with an average speed of 3.66 s per 100 residues. The fastest program is SHIFTX, which averages 0.59 s per 100 residues. Of the seven programs, SHIFTX2 appears to be the third fastest program with an average speed of 2.10 s per 100 residues. All of the computational speed tests for SPARTA, SPARTA+, SHIFTS, CamShift, SHIFTX and SHIFTX2 were performed on the same computer (an Intel Core™2 Duo CPU 1.83 GHz processor with 1.6 GB RAM) using the same set of proteins. The calculation speed reported for PROSHIFT is based on the response rate of its web server. Without knowing the architecture of the PROSHIFT server it is difficult to know whether PROSHIFT numbers are comparable to the values generated on our test CPU processor.
Influence of different parameters on different chemical shifts
Relative (%) influence of the top 20 features or atomic property descriptors for the SHIFTX+ prediction module
R. coil shift
Assessing SHIFTX2 for chemical shift refinement
Caveats and limitations
While we have presented a substantial body of data showing that SHIFTX2 has achieved a significant improvement in protein chemical shift prediction accuracy, it is important to be aware of its limitations. In particular, it is essential to remember that the high correlation coefficients and low RMS errors reported here will typically be better (1–2%) than what one will get using an “average” protein. This is because the test set of 61 proteins used to assess SHIFTX2’s (and all of the other predictors’) performance was specially selected for their exceptionally high resolution and high quality. If one were to choose lower quality structures (low resolution X-ray or NMR) then the agreement between observed and predicted shifts would obviously be lower—regardless of which program is chosen. Chemical shifts are exquisitely sensitive to small coordinate errors or small coordinate displacements (Iwadate et al. 1999; Kohlhoff et al. 2009; Meiler 2003; Moon and Case 2007; Neal et al. 2003; Shen and Bax 2007, 2010; Vila et al. 2009; Xu and Case 2001). Therefore any errors or lack of precision in coordinate data will always be reflected in any set of predicted chemical shifts. In other words, “garbage in = garbage out”. For instance, if one were to use a low resolution or a poor quality structure to attempt to predict chemical shifts for assignment purposes, then it is likely that a number of assignment errors will ensue. On the other hand, if one finds that the calculated shifts for a given structure disagree with the observed shifts by more than what is quoted in Tables 1 or 2, then this is likely a good indication that the structure is in need of further refinement. As shown in Fig. 4, and as advocated in many other recent publications (Kohlhoff et al. 2009; Meiler 2003; Neal et al. 2003), using chemical shifts to assist with the structure refinement process would certainly help improve the quality of many NMR-generated structures.
It is also important to remember that most protein chemical shift predictors are designed to predict chemical shifts of diamagnetic proteins in aqueous conditions at moderate temperatures and at moderate pH values. Therefore, attempting to use SHIFTX2 (or most other programs) on paramagnetic proteins or on proteins dissolved in non-aqueous buffers or at extreme temperatures or at extremes of pH will likely lead to poor results. While SHIFTX2 can be used to calculate chemical shifts of protein–protein complexes, it is not capable of accurately predicting shifts of amino acid residues in close proximity to DNA, RNA or certain small molecule co-factors (heme rings, NAD, FAD, etc.). This is because the characteristic ring current and charge models for these non-proteinaceous molecules are not included in the current SHIFTX2 model.
In this report we have described SHIFTX2, a novel, hybrid chemical shift predictor that is capable of rapidly and accurately calculating diamagnetic 1H, 13C and 15N chemical shifts from protein coordinate data. Comparison’s of SHIFTX2 against many state-of-the art predictors clearly show that the program is substantially more accurate (up to 26% better by correlation coefficient with an RMS error that is up to 3.3× smaller) than the next best performing program. It also provides significantly more coverage (up to 10% more), is significantly faster (up to 8.5×) and capable of calculating a wider variety of backbone and side chain chemical shifts (up to 6×) than many other shift predictors. We were able to achieve this high level of performance by carefully training and testing each of SHIFTX2’s component programs (SHIFTY+ and SHIFTX+) on a set of large and very accurate databases. By utilizing advanced machine learning techniques and by incorporating many more features in our machine learning model we were able to substantially improve SHIFTX2’s structure-based predictor (SHIFTX+). By carefully preparing a large reference sequence/shift database (RefDB) and enhancing the sequence alignment algorithm we were also able to substantially improve SHIFTX2’s sequence-based predictor (SHIFTY+). By combining the results of these two programs using an automated differential weighting scheme we were able to get the benefits of both shift prediction techniques.
While the results we have obtained with SHIFTX2 are impressive and the improvements over existing methods are significant, it is likely that the predictive performance of protein chemical shift predictors is now nearing its limit. No doubt as databases continue to expand and as more methods are intelligently combined, it may be possible to improve shift prediction accuracy by another 1 or 2%. However, once this level is reached, the inherent imprecision of atomic coordinates and the inherent conformational differences between proteins in the solid state (crystals) versus those in solution will probably become the largest contributors to any observed chemical shift discrepancies. In other words, it will be impossible to get perfect chemical shift predictions. Perhaps the only way to get around this “atomic precision” barrier may be to start including conformational ensembles determined from molecular dynamic simulations or generated via chemical shift refinement (Lehtivarjo et al. 2009; Markwick et al. 2010). Certainly a number of recent studies have suggested that chemical shifts calculated over carefully weighted ensembles of protein structures appear to give better agreement to observed shifts than those generated from just a single protein structure.
Despite these caveats, we believe that SHIFTX2, with is high level of accuracy and broad chemical shift coverage, should open the door to many long-anticipated applications of chemical shift prediction. Indeed SHIFTX2 should be particularly useful in refining and assessing protein structures, validating and adjusting chemical shift assignments, and ultimately, for generating protein structures using only chemical shift data alone.
The authors would like to thank the Alberta Prion Research Institute (APRI), PrioNet, the Natural Sciences and Engineering Research Council (NSERC), Genome Canada and the Austrian Science Fund (FWF), grant P21294-B12 for financial support.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410Google Scholar
- Kotsiantis SB (2007) Combining bagging and additive regression. Int J Comput Math Sci 1:61–67Google Scholar
- Pastore A, Saudek V (1990) The relationship between chemical shift and secondary structure in proteins. J Magn Reson 90:165–176Google Scholar
- Schapire RE (1990) The strength of weak learnability. Mach Learn 5:197–227Google Scholar
- Shen Y, Lange O, Delaglio F, Rossi P, Aramini JM, Liu G, Eletsky A, Wu Y, Singarapu KK, Lemak A, Ignatchenko A, Arrowsmith CH, Szyperski T, Montelione GT, Baker D, Bax A (2008) Consistent blind protein structure generation from NMR chemical shift data. Proc Natl Acad Sci U S A 105:4685–4690ADSCrossRefGoogle Scholar