Database proton NMR chemical shifts for RNA signal assignment and validation

The Biological Magnetic Resonance Data Bank contains NMR chemical shift depositions for 132 RNAs and RNA-containing complexes. We have analyzed the 1H NMR chemical shifts reported for non-exchangeable protons of residues that reside within A-form helical regions of these RNAs. The analysis focused on the central base pair within a stretch of three adjacent base pairs (BP triplets), and included both Watson–Crick (WC; G:C, A:U) and G:U wobble pairs. Chemical shift values were included for all 43 possible WC-BP triplets, as well as 137 additional triplets that contain one or more G:U wobbles. Sequence-dependent chemical shift correlations were identified, including correlations involving terminating base pairs within the triplets and canonical and non-canonical structures adjacent to the BP triplets (i.e. bulges, loops, WC and non-WC BPs), despite the fact that the NMR data were obtained under different conditions of pH, buffer, ionic strength, and temperature. A computer program (RNAShifts) was developed that enables convenient comparison of RNA 1H NMR assignments with database predictions, which should facilitate future signal assignment/validation efforts and enable rapid identification of non-canonical RNA structures and RNA-ligand/protein interaction sites. Electronic supplementary material The online version of this article (doi:10.1007/s10858-012-9683-9) contains supplementary material, which is available to authorized users.


Introduction
RNA molecules participate in a large and expanding array of known biological functions including gene regulation, maintenance of sub-cellular and viral structure, intracellular trafficking, antiviral restriction, catalysis, and, of course, propagation of genetic information (Korostelev and Noller 2007;Steitz 2008;Bessonov et al. 2008;Boisvert et al. 2007;Wakeman et al. 2007;Edwards et al. 2007;Bartel 2004;Kim 2005;Hassouna et al. 1984;Brodersen and Voinnet 2006;Doudna and Rath 2002;Ponting et al. 2009). Like proteins, the functional activities of most RNAs are intrinsically linked to their structures. Unfortunately, although a wealth of structural information is currently available for functionally active proteins and protein domains, structural information for functionally relevant RNAs remains relatively limited. Thus, the Protein Data Bank (PDB; http://www.rcsb.org/pdb/home/home.do) currently contains more than 55,000 protein structure depositions, whereas the Nucleic Acid Database (NDB; http://ndbserver.rutgers.edu/) contains atomic coordinate depositions for fewer than *2,100 RNAs and protein/ ligand-RNA complexes, of which *1,600 were determined by X-ray crystallography and *500 by NMR spectroscopy. Conformational heterogeneity and the presence of a relatively uniform, negative surface charge can hinder structural studies by X-ray crystallography, and as discussed below, difficulties associated primarily with limited chemical shift dispersion have generally limited NMR applications to relatively small RNAs. For these reasons, much of what is known about the structures of biologically functional RNAs (primarily secondary structure information) has been obtained by chemical and enzymatic accessibility mapping experiments, coupled with phylogenetic and free energy calculations. Although RNA probing methodologies are potentially very powerful and have been widely applied (Peattie and Gilbert 1980;Ehresmann et al. 1987;Stern et al. 1988;Forconi and Herschlag 2009;Weeks 2010), interpretation of the data can be problematic, particularly for RNAs that exist as equilibrium mixtures of multiple conformational species (see for example, Kladwang et al. 2011;Houck-Loomis et al. 2011;Lu et al. 2011a,b;Miyazaki et al. 2010).
NMR is a potentially powerful tool for probing RNA structure (Wüthrich 1986;Allain and Varani 1997;Lukavsky and Puglisi 2005), but its application to larger RNAs can be complicated by a number of factors. Inter-residue scalar couplings are generally weak, limiting the utility of ''through bond'' inter-residue connectivity experiments for signal assignment. The most commonly used assignment approach involves identification of sequential inter-residue NOE connectivities (Wüthrich 1986), but even this approach can be problematic for modest sized RNAs (ca. 25-60 nucleotides). Although resolution can be increased by 1 H-13 C heteronuclear spectral editing (Peterson et al. 2004;Davis et al. 2005;Batey et al. 1995;Batey et al. 1992;Michnicka et al. 1993;Kim et al. 1995;Xu et al. 1996;Kim et al. 2002;Lukavsky et al. 2003;Lu et al. 2009), chemical shift dispersion is relatively limited (Allain and Varani 1997;Lukavsky and Puglisi 2005), and severe dipolar broadening of the aromatic 1 H-13 C signals that are critical for structural analysis can preclude detection of 1 H-13 C correlation NMR signals in larger RNAs (Lu et al. 2011a). In addition, interproton distances between elements of secondary structure in larger RNAs typically exceed those required for NOE detection (Lu et al. 2009;Tolbert et al. 2010). Thus, high-resolution NMR-based structural studies have been applied mainly to relatively small RNAs: Of the 496 RNA NMR structures that have been deposited in the NDB, only 19 contain 60 or more nucleotides; the largest is a symmetrical dimer of 132 nucleotides (two 66 nucleotide subunits), and the average size is *27 nucleotides.
One approach for addressing issues of signal degeneracy involves the application of traditional 2D NOESY experiments to RNA samples that are site-and/or nucleotide-specifically labeled with deuterium Davis et al. 2005;Kim et al. 1995;Lu et al. 2009;Zhou et al. 2006;Nelissen et al. 2008;Heng et al. 2012;Duss et al. 2012). 2 H-isotope edited 2D NMR has enabled nearly complete assignment of the aromatic, H 1 0 , H 2 0 , and H 3 0 ribose signals of RNAs containing up to 132 nucleotides , and has also enabled assignment of selected residues within a 720 nucleotide RNA (Lu et al. 2011a;Heng et al. 2012). This approach, which involves comparison of high resolution 2D NOESY spectra obtained for multiple, differentially 2 H-labeled samples, avoids relaxation problems associated with aromatic 1 H-13 C spectral editing and enables observation of signals in 2D 1 H-1 H NOESY spectra for protons with T 2 values as short as 8 ms (Lu et al. 2011a). Although resolution and sensitivity can be improved dramatically by nucleotidespecific deuteration, signal overlap can still hinder the assignment process for RNAs comprising more than 150 nucleotides (Summers and coworkers, unpublished). NMR chemical shifts have been widely utilized for NMR signal assignment and structural studies of proteins (for examples see: Grzesiek and Bax 1993;Wishart and Sykes 1994;Wishart et al. 1991Wishart et al. , 1992Cavalli et al. 2007;Shen et al. 2008;Wishart et al. 2008). Although relationships between 13 C chemical shifts and RNA structure have been identified (Ebrahimi et al. 2001;Fares et al. 2007;Ohlenschlager et al. 2008), and 15 N NMR chemical shifts have been incorporated into a probabilistic approach for automated assignment of RNA imino groups (Bahrami et al. 2012), heteronuclear NMR chemical shifts have not been widely exploited for RNA studies (Lam and Chi 2010;Aeschbacher et al. 2012). On the other hand, Wijmenga and co-workers showed that non-exchangeable 1 H NMR chemical shifts for A-form helical residues could be back-calculated from a given 3D RNA structure (Cromsigt et al. 2001). For 28 examples tested, the backcalculated shifts were in good agreement with shifts reported in the Biological Magnetic Resonance Bank (BMRB; www.bmrb.wisc.edu), and some general 1 H NMR chemical shift trends were identified (Cromsigt et al. 2001).
Here we report a detailed analysis of the H 8 , H 2 , H 6 , H 5 , H 1 0 , H 2 0 , and H 3 0 proton NMR chemical shifts that have been deposited in the BMRB. After correcting for differences in chemical shift referencing and sample conditions, excellent correlations were observed, despite the fact that the data were obtained over a wide range of sample conditions. Our findings confirm and quantify previously identified trends and identify new sequence-and structuredependent chemical shift correlations that can be used for assignment and/or validation of non-exchangeable 1 H NMR chemical shifts and for the identification of noncanonical RNA structural features and intermolecular interaction sites.

Methods
NMR data were analyzed using ''RNAShifts'', a program designed to download and analyze RNA 1 H NMR chemical shifts that have been deposited in the BMRB. (Locally derived shifts that have yet to be deposited can also be analyzed). All 131 depositions available in the BMRB were used in the current analysis except BMRB ID 5170, 6814, 4816, 15697, 15915, 5023, 4253, 4894, and 15257, which could not be reliably used because either the BMRB assignments didn't match the published PDB assignments, or because there was no associated publication or PDB file that could be used to identify RNA secondary structure. As additional input, files were manually generated for each deposition, based on published structural studies, that identify for each residue (1) whether or not the residue is base-paired, (2) the nature of the base-pairing partner, (3) any long-range intraand/or inter-molecular interactions (e.g., sites of protein binding or participation in A-minor or other RNA-RNA contacts), (4) participation in structured (e.g., GNRA; G/g = guanosine, N/n = any nucleotide; R/r = purine; A/a = adenosine) or unstructured loops. A representative input file is shown in Supplementary Table S1.
We chose a relatively conservative approach in modeling the effect of the neighborhood of each central base pair. This The chemical shifts of the N (i) residue are analyzed in this work, and this strand may be preceded by a base-paired (WC or GU wobble) nucleotide (pre_n) or a non-base paired residue (5loop), or followed by a base-paired residue (suc_n) or non-base paired residue (3loop). b Plot of the database chemical shift (automatically re-referenced as described in the text) (d) versus calculated chemical shift (d pred ) for the 3758 assignment depositions utilized in the present study (rms deviation = 0.056). c Plot of d versus mean chemical shift (hdi) for residues in canonical triplets (triplets that contain only GC and/or AU base pairs and are both preceded and followed by a GC and/or AU base pair) (rms deviation = 0.043) J Biomol NMR (2013) 55: 33-46 35 was done because there are still, especially in comparison to proteins, relatively few chemical shift assignment sets for RNA deposited at the BMRB. Rather than using any nonlinear or neural network approach we used an approach similar to the chemical shift increment method of Pretsch as used in predicting spectra of small organic molecules (Pretsch et al. 2009). Thus, for the central residue of each WC-BP triplet, we defined the attributes describing the neighborhood of the central nucleotide as described above, and calculated the contribution that each attribute makes to the predicted chemical shift. The predicted chemical shift is then a base chemical shift plus the linear contribution of the value corresponding to each attribute present in that nucleotide's environment. The contribution of each attribute was calculated by linear regression of the chemical shifts in our database of RNA chemical shifts with the set of explanatory variables represented by the neighborhood attributes. The constant term of our regression model corresponds to a nucleotide embedded in a triplet of Watson-Crick base pairs with a U (uridine) flanking it on both the 5 0 and 3 0 sides and Watson-Crick basepaired nucleotides at the 5 0 and 3 0 ends of the triplet. Our analysis included a total of 15 potential variables, Table 1, of which only some might potentially contribute significantly to the shift of a specific atom in a given central nucleotide. Because the approach includes a large number of independent variables relative to the chemical shift datasets, there was a significant danger of over-fitting using a conventional linear regression algorithm. Over-fitting can lead to excellent prediction of the training set, but poor predictive capability on novel datasets. To minimize the risk of overfitting we chose an algorithm, Pace Regression (Projection Adjustment by Contribution Estimation), that is capable of assessing the importance of each of the parameters. Calculations were performed using the Weka Machine Learning and Data Mining Library system, which allowed us to perform a statistical analysis of the prediction model (Witten et al. 2011). Pace Regression is a linear regression system that uses various information criteria to assess the degree of importance of the regression variables (Wang and Witten 2002). Thus it provides one solution to the subset selection problem: which subset of a set of potential regressors is the appropriate set to explain the data, and thereby minimize the risk of overfitting and maximize the predictive capability on previously unseen data.
Use of Weka provided not only access to Pace Regression, but also various assessments of the quality of the predictions. In particular, we used 10-fold stratified crossvalidation during our analysis. Rather than providing correlation coefficients and root mean squared (rms) deviations of the predictions using all the data in the prediction, this technique trains the model on 90 % of the data and then assesses the results of predicting the remaining 10 % of the data. The process is repeated 10 times, using a different subset of the data each time and derives the correlation coefficients and rms deviations based on the whole process. Pace regression was used independently on each atom type present in each of the four central nucleotides for a total of 19 regression calculations.
We were unable in our analysis to adequately identify and control for sample conditions (pH, temperature, ionic strength, etc.) and unusual molecular conformation, and there is a significant possibility of misassignment, especially of some atom types. Therefore, after dropping a single obvious major outlier, we minimized these effects by automatically trimming outliers and automatically adjusting the reference for the chemical shift sets. Automated outlier elimination was performed by running two passes of the Pace Regression for each atom/central nucleotide. In the first pass, the rms deviations between the experimental and predicted values were calculated using all of the data. Any data values that deviated from the predicted values by more than three times the rms deviation value were dropped, and a second pass of the Pace Regression was performed on the now trimmed dataset. Automatic re-referencing was achieved by performing the above analysis (including outlier detection) twice. In the first of these passes, the mean error of prediction was calculated for all the shifts from each BMRB file. Prior to the second pass, each shift was corrected by the mean deviation calculated for the corresponding BMRB file. The chemical shift corrections determined by this approach are listed in Table S3.
The RNAShifts program was written using JTcl (http://jtcl.kenai.com) and Swank (http://swank.kenai.com ), which are the Java implementations of the Tcl programming language and Tk graphical user interface toolkit (Ousterhout and Jones 2010). The analysis mode is run in three stages. The first loads BMRB files (fetching them from http://bmrb.wisc.edu if necessary), extracts chemical shifts, and then uses the input template to assign attributes to each shift. The second stage reads the output of the first stage and generates input files in the format used by Weka. The third executes Weka multiple times for each proton type, manages the two passes used for outlier detection and generates various statistical output files. The graphical interface module allows plotting predicted and experimental data subject to various criteria for choosing subsets of the data and attributes for plotting. The RNAShifts program is available upon request from the author (BAJ).

Results and discussion
Outlier chemical shifts The statistical analysis described above identified 65 chemical shift assignments from the full BMRB database Output from the Pace Regression analysis. Each row represents an individual atom type in the specified nucleotide (e.g., A-H 2 is the H 2 proton of Adenine). The column labeled const represents the chemical shift of that atom in the triplet uXu when none of the additional attributes represented in subsequent columns are present. Contributions with values equal to 0 represent attributes that the Pace Regression algorithm found could not be supported by the data and were thereby automatically excluded from the regression analysis. The contribution from columns labeled pre_x and suc_x, where x is a,c,g, or gu are used where the preceding or succeeding nucleotide is not a u. A GU attribute represents the case where the nucleotide is in a GU, rather than GC, base pair, and can apply to the i -1 (pre_gu), i ? 1 (suc_gu) or central (GU) triplet (with the same approach used for UG wobbles). The 5ter attribute indicates the triplet is at the 5 0 end (so there is no i -2 nucleotide), and 3ter indicates the triplet is at the 3 0 end (so there is no i ? 2 nucleotide). The loop attributes indicate that the i -2 (5loop) nucleotide or i ? 2 (3loop) nucleotide is in a loop or mismatched base pair. The columns labeled corr and rms represent the correlation coefficient (corr) and the square root of the mean of squared deviations between predicted and experimental values (rms) for all the data in the fit. The columns labeled xcorr and xrms represent the same values, but calculated with 10-fold stratified cross-validation. The column labeled nobs represents the number of observations available and ntrim the number that were automatically eliminated as outliers that, after automated re-referencing, deviated from expected values by more than 3 standard deviations. Seven of these assignments were associated with earlier publications from the M.F.S. laboratory, and inspection of the original NMR spectra revealed that these signals had been erroneously assigned (corrections to BMRB files 15113 and 17083 have now been made). We also discovered relatively large systematic chemical shift variations for one of our earlier depositions (BMRB ID 6094) that were associated with improper chemical shift referencing (the residual water signal at 35°C was erroneously assigned a chemical shift of 4.792 ppm). We therefore updated the BMRB with the modified values, which were used in the present analysis. Based on examination of published NMR spectra, we were able to correct 19 additional assignments in the BMRB-in many cases, the signals had been properly assigned in the published spectra but improperly recorded in the BMRB files. In all cases, the re-assigned (or typocorrected) shifts were well within the 3-standard deviation cutoff. We were unable to determine the nature of the deviations observed for the remaining 38 outliers because relevant regions of the NMR spectra were not provided in the original publications, and these 38 assignments were not used in subsequent analyses. The majority of these outliers were associated with ribose protons, of which 17 were for highly overlapping H 2 0 and H 3 0 proton signals. Thus, of the 3,796 available chemical shifts, 3,758 were retained for analysis and 38 (1 %; mostly ribose assignments) were excluded.
Chemical shifts that were either re-assigned or excluded are summarized in Supplementary Table S2, and referencing corrections employed for all of the utilized depositions are summarized in Supplementary Table S3. The final dataset included values for the central base pairs of all of the 4 3 possible combinations of WC-BP triplets, with as few as one, and as many as 23, assignments for each of the possible combinations. A total of 137 additional triplets that contain G:U base pairs were also included in the analysis. As shown in Fig. 1b, the retained and re-referenced BMRB shifts (d) were in good agreement with predicted shifts (d pred ) (rms deviation for the entire dataset = 0.056). Good agreement was also obtained when training was performed using a twofold cross-validation analysis, in which half of the data were used for training and half for validation (rms = 0.069 ppm), and when training was performed with 60 % of randomlyordered BMRB entries and validation assessed with the remaining 40 % of the data (rms = 0.063, averaged over all atom types).

Chemical shift trends for canonical triplets
The re-referenced NMR chemical shifts (d) were generally in good agreement with the mean shifts calculated for each unique sequence/atom type (hdi). For example, excellent correlations were observed in a plot of d versus hdi for the central residues of ''canonical triplets,'' defined here as a triplet that contains only GC and/or AU base pairs and are both preceded and followed by canonical GC or AU base pairs (rms deviation = 0.043), Fig. 1c. The database utilized does not contain chemical shift values for aAa and uCa canonical triplets, nor for the H 2 0 and/or H 3 0 protons of the following canonical triplets: aAu (H 2 0 , H 3 0 ), uGa (H 2 0 , H 3 0 ), aUu (H 2 0 , H 3 0 ), gGu (H 3 0 ), aCc (H 3 0 ). (Note that data were available for non-canonical forms of these triplets and were included in the analysis). There were no significant differences in correlation coefficients obtained upon fitting d versus hdi for the A, G, C and U nucleotides, but as Fig. 3 Plots of re-referenced 1 H NMR chemical shifts (d) reported for the central guanosine (a), cytosine (b) and uracil (c) residues within canonical triplets (as defined in text and Fig. 1) Fig. 2, and data for the n-G-n, n-C-n and n-U-n canonical triplets are plotted in Fig. 3. The contributions of the attributes calculated by Pace Regression are plotted in Fig. 4. The observed trends are consistent with several generalized correlations identified by Wijmenga and coworkers (Cromsigt et al. 2001). For example, d values for purine-H 8 protons in canonical triplets are highly sensitive to the nature of the 5 0 -residue within the triplet, with 5 0 -purines associated with more upfield chemical shifts. We further observe that 5 0 -uridines induce a larger downfield H 8 shift than 5 0 -cytidines (Figs. 2c,3a), and that the H 8 chemical shift is also sensitive to the nature of the 3 0 -residue, Figs. 2c and 3a. For example, the A-H 8 hdi can values observed for n-A-a canonical triplets are consistently downfield relative to those observed for n-A-g canonical triplets, Fig. 2c, and a similar trend is observed for n-G-a versus n-G-g triplets, Fig. 3a.
The adenosine-H 2 proton is sensitive to the nature of both the 5 0 -and 3 0 -nucleotides (Cromsigt et al. 2001) and exhibits a large chemical shift range of *6.4-8.0 ppm. Importantly, the simultaneous presence of a 5 0 -pyrimidine and 3 0 -purine is associated with a significant upfield A-H 2 NMR chemical shift, to a less crowded region of the RNA NMR spectrum (6.4-7.1 ppm, Fig. 2b) where they are potentially useful for structural characterization of large RNAs Lu et al. (2011). In contrast, significant downfield shifts are observed for the H 2 protons of adenosines that are preceded by a purine and followed by a pyrimidine, Fig. 2b. The H 5 protons of the C and U are sensitive to the nature of the preceding residue of the triplet but exhibit almost no detectable sensitivity to the nature of the following residue, Fig. 3c, d. The pyrimidine H 6 protons are also more sensitive to the nature of the 5 0 residue, but exhibit some sensitivity to the 3 0 residue as well (Fig. 3c,  e). The ribose protons appear to be sensitive to the nature of both the 5 0 and 3 0 residues, although the limited chemical shift dispersion and uncertainties regarding some of H 2 0 and H 3 0 assignments make it more difficult to identify clear chemical shift trends.
Influence of 5 0 -and 3 0 -terminal base pairs within the WC-BP triplet The presence of 5 0 -and/or 3 0 -terminating base pairs within the WC-BP triplet has a significant influence on the chemical shifts of the central residue. As shown in Fig. 5a, the aromatic, H 1 0 , H 2 0 and H 3 0 protons of the central residue exhibit small but significant downfield shifts relative to hdi can values when adjacent to a 5 0 -terminating base-paired residue (the single H 3 0 outlier is most likely due to a misassignment or typo). The most significant perturbations are observed for the aromatic protons, which exhibit deviations in the range of 0.15-0.45 ppm. In contrast, most signals for residues that reside next to a 3 0 -terminal WC-BP exhibit smaller but nevertheless consistent upfield shifts relative to the hdi can values, Fig. 5b. The most significant shifts are observed for H 2 0 protons which have a mean upfield shift of 0.2 ppm.
Influence of non-canonical elements adjacent to the WCBP triplets Our analysis assessed the influence of non-canonical structural elements that reside immediately upstream (5loop) or downstream (3loop) of the WC-BP triplets. We defined these elements to include internally stacked residues that are not involved in Watson-Crick base pairing, looped or bulge residues believed to be flexible or structured (e.g., K-turns), and residues involved in base-triples or long-distance RNA-RNA interactions. As shown in Fig. 5d, the presence of non-canonical RNA structures at the 3 0 -end of the WC-BP triplet does not appear to significantly influence any of the proton shifts associated with Fig. 4 Plot of the chemical shift contributions (d contrib ) of each attribute relative to a canonical uNu triplet as obtained via Pace Regression for aromatic (a) and ribose (b) proton assignments (positive values denote downfield shifts). Data in these plots are derived from Table 1. For simplification, data for aromatic protons with similar trends in their response to the attributes were combined, and within each group of proton type, the largest absolute value is plotted. Because this procedure can mask the details of individual proton types one should use this plot for observing general trends and refer to the specific contributions in Table 1 the central residue of the triplet. On the other hand, the presence of non-canonical structure on the 5 0 -side of the WC-BP triplet results in small but significant upfield shifts relative to hdi can values for the aromatic and H 1 0 protons, Fig. 5c.
Influence of G:U base pairing within the triplet Because GU base pairs are both prevalent and functionally important (Varani and McClain 2000), we also assessed the influence of this class of base pairing on 1 H NMR chemical shifts. Systematic variations are apparent for some protons of the central U of triplets when they are base paired with G. Considering only canonical triplets in which the central U:A base pair is substituted by U:G, the H 5 protons exhibit a downfield shift and the H 1 0 and H 2 0 protons exhibit small upfield shifts, whereas the H 6 and H 3 0 chemical shifts appear to be relatively unperturbed, Fig. 6a. If the central residue of the canonical triplet is a G, base pairing with U results in a small downfield shift of the H 2 0 NMR signal and upfield shift of H 3 0 (relative to base pairing with C) but does not significantly affect the shifts of the other G protons, Fig. 6b.
The presence of GU (or UG) base pairs at the n (i-1) or n (i?1) positions can significantly influence the signals of the central residue, and data for otherwise canonical triplets are shown in Fig. 6c-f. For triplets in which the central residues is a pyrimidine, the H 1 0 and H 3 0 are relatively unaffected by the presence of a preceding GU wobble, Fig. 6c. However, the H 6 , H 5 and H 2 0 protons are systematically perturbed, with the u(wob)-U/C-n H 6 signal shifted downfield, the g(wob)-U-n H 6 signal shifted upfield, and the g(wob)-C-n C-H 6 signal shifted downfield relative to the average canonical shifts, Fig. 6c. Interestingly, the u(wob)-C/U-n H 5 shifts are relatively unperturbed relative to canonical shifts, whereas g(wob)-C/U-n H 5 shifts are generally shifted downfield relative to the signals observed for the canonical triplets, Fig. 6c. Also, H 2 0 shifts of the central pyrimidine are shifted downfield when preceded by a UG wobble, but are shifted upfield when preceded by a GU wobble, Fig. 6c. When the central residue is a purine, the H 1 0 and H 3 0 proton shifts are relatively unaffected by a preceding wobble, but the H 8 , H 2 , and H 2 0 protons generally exhibit systematic downfield shifts, with the magnitude of the shift being somewhat greater for a preceding U(wob) compared to a preceding G(wob), Fig. 6d.
The presence of a subsequent GU wobble can also result in systematic chemical shift perturbations. For triplets in which the central residue is a pyrimidine followed by a U(wob) mismatch, the H 6 and H 3 0 signals exhibits small upfield shifts but the remaining signals do not appear to be significantly perturbed, Fig. 6e. In contrast, the presence of a subsequent G(wob) mismatch does not appear to lead to any detectable perturbations, Fig. 6e. For triplets in which the central residue is a purine, a subsequent G(wob) leads to a small systematic downfield shift of the H 1 0 proton but does not significantly perturb the other NMR signals, whereas a subsequent U(wob) pair results in small upfield shifts of the H 6 and H 5 signals and a small downfield shift of the H 2 0 signal, Fig. 6f.

Chemical shift predictions
The Pace regression approach described above provided predicted chemical shift values for all possible combinations of WC-BP triplet parameters used in the present study, Table 1. 1 H NMR chemical shifts observed for the canonical triplets are in good agreement with the shifts predicted using the Pace regression approach described above (d pred ), Fig. 7a (rms deviation = 0.050). Excellent agreement was also observed for triplets that contained only a single modifying element (e.g., only a 5ter but no other non-canonical elements), with the greatest deviations observed for a few of the H 2 0 and H 3 0 assignments, Fig. 7b-h (rms deviation in the range 0.057-0.057). Good fits were also observed for triplets that contained more than one modifying element (rms deviation for all canonical and non-canonical triplets = 0.056), Fig. 7i. As observed in other fits, the largest deviations are observed for the H 2 0 and H 3 0 proton assignments.
The data in Table 1 can be used in computer programs such as NMRView (Johnson 2004;Johnson and Blevins 1994) or ad hoc calculations to predict chemical shifts. The constant term represents the value of the given atom in nucleotide i, when the i -1 and i ? 1 nucleotides are both U, and all nucleotides from i -2 through i ? 2 are present and in canonical Watson-Crick base pairs. For example, an A-H 2 proton, in a canonical uAu triplet would be at 7.0299. Calculating the shift of the A-H 2 proton in a different environment is done by adding to the constant term the contributions from any applicable columns in the A-H 2 row of Table 1. For example, the chemical shift of an A-H 2 proton in a gAc triplet, in which the i -2 residue is in a loop, would be: 7.8469 ppm (7.0299 ? 0.6899 ? 0.0658 ? 0.0622). If the i -1 G is in a GU (rather than GC) base pair, the A-H 2 proton chemical shift would be: 7.9217 ppm (7.0299 ? 0.7637 ? 0.0658 ? 0.0622). c-f Influence of GU base pairs at the n (i-1) and n (i?1) position of the n (i-1) -N i -n (i?1) triplet on the NMR chemical shift of the central canonical base pair. Symbols are defined in the panel insets

Conclusions
The present studies provide the first quantitative analysis of the RNA non-exchangeable 1 H NMR chemical shifts in the BMRB. Our studies identify sequence-dependent chemical shift correlations and establish the influence of terminating base pairs within the triplets and canonical and non-canonical structures adjacent to the BP triplets (i.e. bulges, loops, WC and non-WC BPs). Excellent correlations were observed despite the fact that the NMR data were obtained under different conditions of pH, buffer, ionic strength, and temperature. A relatively small number of outliers that were not utilized in the analysis, mainly ribose H 2 0 and H 3 0 assignments, are likely due to assignment or typographical errors and should be re-examined. Assignments for some triplet combinations were either limited or lacking; for example, the database does not include assignments for two of the 64 possible ''canonical triplets.'' Although shifts for these triplets could be predicted from assignments made for non-canonical triplets (e.g., WC-BP triplets adjacent to non-canonical structures or that contain terminal or GC base pairs), future studies of oligonucleotides with the missing sequences are clearly in order.
The statistics indicate that the protocol employed for chemical shift predictions, assigning attributes to different triplet environments and then conducting selection and linear model fitting with Pace Regression, performed very well for the data used in this study. However, as we move forward with this research and the number of attributes is expanded, alternative fitting methods such as Neural Networks and allowing attributes to contribute in non-linear modes may be required. The protocol used here can, of course, also be applied to nitrogen and carbon nuclei, and it will be interesting to determine if these nuclei exhibit similar environment-and structure-dependent sensitivities.
The 1 H NMR shifts observed for residues that participate directly in long-range RNA-RNA interactions or interactions with ligands or proteins, as identified in the associated publications and/or the structure coordinate (PDB) files, generally deviated from the A-form helical triplet shifts. For example, the H 6 and H 5 NMR chemical shifts observed for residue U5 of the ScYLV P-1-P2 frameshifting pseudoknot (7.93 and 5.25 ppm, respectively) (Cornish et al. 2005), deviate by 0.24 and 0.29 ppm from the expected values (7.69 and 5.54 ppm, respectively) and are well outside the rms range calculated for canonical gUg triplets (rms = 0.06 and 0.03 ppm, respectively). Significant deviations were also observed for otherwise canonical A-form helical residues that interact with protein elements. In future studies of RNAs with unknown structures, the observation of outlier chemical shifts may serve as useful indicators of potential long-range RNA:RNA or RNA:protein interactions. In addition, the trends identified in the present studies should facilitate the refinement of algorithms used to calculate 1 H NMR chemical shifts on the basis of RNA structural coordinates alone (Cromsigt et al. 2001;Case 1995Case , 2002Dejaegere et al. 1999;Case et al. 2005), thereby making the 1 H NMR chemical shift a more useful parameter for RNA structure refinement. Because the variations in chemical shifts observed for atoms of a given triplet are small, variations in the 3D structures of the triplets should also be small. This observation lends support for refinement approaches that utilize residual dipolar couplings and/or small angle X-ray scattering (SAXS) data to orient idealized A-form helices (Funari et al. 2000;Walsh et al. 2004;Zuo et al. 2008;Grishaev et al. 2008;Wang et al. 2009Wang et al. , 2010Burke et al. 2012).
In the course of these studies, chemical shift trends were tentatively identified for a number of non-A-form helical structures that are well represented in the BMRB, particularly those of conserved tetraloops (e.g., GNRA). Future studies that include parameterizations for tetraloops, base triples, and other conserved and well-defined RNA substructures will likely lead to the identification of additional trends useful for 1 H NMR assignment and verification. In addition, it should now be possible to incorporate the approach into software programs to enable semi-automated assignment of RNA, including large RNAs with different combinations of 2 H-labeled or segmentally-labeled nucleotides (underway).