Database proton NMR chemical shifts for RNA signal assignment and validation
- First Online:
- Cite this article as:
- Barton, S., Heng, X., Johnson, B.A. et al. J Biomol NMR (2013) 55: 33. doi:10.1007/s10858-012-9683-9
- 1.3k Downloads
The Biological Magnetic Resonance Data Bank contains NMR chemical shift depositions for 132 RNAs and RNA-containing complexes. We have analyzed the 1H NMR chemical shifts reported for non-exchangeable protons of residues that reside within A-form helical regions of these RNAs. The analysis focused on the central base pair within a stretch of three adjacent base pairs (BP triplets), and included both Watson–Crick (WC; G:C, A:U) and G:U wobble pairs. Chemical shift values were included for all 43 possible WC-BP triplets, as well as 137 additional triplets that contain one or more G:U wobbles. Sequence-dependent chemical shift correlations were identified, including correlations involving terminating base pairs within the triplets and canonical and non-canonical structures adjacent to the BP triplets (i.e. bulges, loops, WC and non-WC BPs), despite the fact that the NMR data were obtained under different conditions of pH, buffer, ionic strength, and temperature. A computer program (RNAShifts) was developed that enables convenient comparison of RNA 1H NMR assignments with database predictions, which should facilitate future signal assignment/validation efforts and enable rapid identification of non-canonical RNA structures and RNA-ligand/protein interaction sites.
KeywordsRNAChemical shiftA-form helicesNMR signal assignment and validation
Biological Magnetic Resonance Data Bank
Nucleic Acid Database
Protein Data Bank
Watson–Crick base pair
Mean NMR chemical shift
Mean NMR chemical shift determined for a canonical triplet, defined here as a stretch of three sequential canonical base pairs that is both preceded and followed by at least one canonical base pair
Predicted NMR chemical shift
RNA molecules participate in a large and expanding array of known biological functions including gene regulation, maintenance of sub-cellular and viral structure, intracellular trafficking, antiviral restriction, catalysis, and, of course, propagation of genetic information (Korostelev and Noller 2007; Steitz 2008; Bessonov et al. 2008; Boisvert et al. 2007; Wakeman et al. 2007; Edwards et al. 2007; Bartel 2004; Kim 2005; Hassouna et al. 1984; Brodersen and Voinnet 2006; Doudna and Rath 2002; Ponting et al. 2009). Like proteins, the functional activities of most RNAs are intrinsically linked to their structures. Unfortunately, although a wealth of structural information is currently available for functionally active proteins and protein domains, structural information for functionally relevant RNAs remains relatively limited. Thus, the Protein Data Bank (PDB; http://www.rcsb.org/pdb/home/home.do) currently contains more than 55,000 protein structure depositions, whereas the Nucleic Acid Database (NDB; http://ndbserver.rutgers.edu/) contains atomic coordinate depositions for fewer than ~2,100 RNAs and protein/ligand-RNA complexes, of which ~1,600 were determined by X-ray crystallography and ~500 by NMR spectroscopy. Conformational heterogeneity and the presence of a relatively uniform, negative surface charge can hinder structural studies by X-ray crystallography, and as discussed below, difficulties associated primarily with limited chemical shift dispersion have generally limited NMR applications to relatively small RNAs. For these reasons, much of what is known about the structures of biologically functional RNAs (primarily secondary structure information) has been obtained by chemical and enzymatic accessibility mapping experiments, coupled with phylogenetic and free energy calculations. Although RNA probing methodologies are potentially very powerful and have been widely applied (Peattie and Gilbert 1980; Ehresmann et al. 1987; Stern et al. 1988; Forconi and Herschlag 2009; Weeks 2010), interpretation of the data can be problematic, particularly for RNAs that exist as equilibrium mixtures of multiple conformational species (see for example, Kladwang et al. 2011; Houck-Loomis et al. 2011; Lu et al. 2011a,b; Miyazaki et al. 2010).
NMR is a potentially powerful tool for probing RNA structure (Wüthrich 1986; Allain and Varani 1997; Lukavsky and Puglisi 2005), but its application to larger RNAs can be complicated by a number of factors. Inter-residue scalar couplings are generally weak, limiting the utility of “through bond” inter-residue connectivity experiments for signal assignment. The most commonly used assignment approach involves identification of sequential inter-residue NOE connectivities (Wüthrich 1986), but even this approach can be problematic for modest sized RNAs (ca. 25–60 nucleotides). Although resolution can be increased by 1H–13C heteronuclear spectral editing (Peterson et al. 2004; D’Souza et al. 2004; D’Souza and Summers 2004; Davis et al. 2005; Batey et al. 1995; Batey et al. 1992; Nikonowicz and Pardi 1992; Nikonowicz et al. 1992; Michnicka et al. 1993; Kim et al. 1995; Xu et al. 1996; Kim et al. 2002; Lukavsky et al. 2003; Lu et al. 2009), chemical shift dispersion is relatively limited (Allain and Varani 1997; Lukavsky and Puglisi 2005), and severe dipolar broadening of the aromatic 1H–13C signals that are critical for structural analysis can preclude detection of 1H–13C correlation NMR signals in larger RNAs (Lu et al. 2011a). In addition, interproton distances between elements of secondary structure in larger RNAs typically exceed those required for NOE detection (Lu et al. 2009; Tolbert et al. 2010). Thus, high-resolution NMR-based structural studies have been applied mainly to relatively small RNAs: Of the 496 RNA NMR structures that have been deposited in the NDB, only 19 contain 60 or more nucleotides; the largest is a symmetrical dimer of 132 nucleotides (two 66 nucleotide subunits), and the average size is ~27 nucleotides.
One approach for addressing issues of signal degeneracy involves the application of traditional 2D NOESY experiments to RNA samples that are site- and/or nucleotide-specifically labeled with deuterium (Miyazaki et al. 2010; D’Souza et al. 2004; Davis et al. 2005; Kim et al. 1995; Lu et al. 2009; Zhou et al. 2006; Nelissen et al. 2008; Heng et al. 2012; Duss et al. 2012). 2H-isotope edited 2D NMR has enabled nearly complete assignment of the aromatic, H1′, H2′, and H3′ ribose signals of RNAs containing up to 132 nucleotides (Miyazaki et al. 2010), and has also enabled assignment of selected residues within a 720 nucleotide RNA (Lu et al. 2011a; Heng et al. 2012). This approach, which involves comparison of high resolution 2D NOESY spectra obtained for multiple, differentially 2H-labeled samples, avoids relaxation problems associated with aromatic 1H–13C spectral editing and enables observation of signals in 2D 1H–1H NOESY spectra for protons with T2 values as short as 8 ms (Lu et al. 2011a). Although resolution and sensitivity can be improved dramatically by nucleotide-specific deuteration, signal overlap can still hinder the assignment process for RNAs comprising more than 150 nucleotides (Summers and coworkers, unpublished).
NMR chemical shifts have been widely utilized for NMR signal assignment and structural studies of proteins (for examples see: Grzesiek and Bax 1993; Wishart and Sykes 1994; Wishart et al. 1991, 1992; Cavalli et al. 2007; Shen et al. 2008; Wishart et al. 2008). Although relationships between 13C chemical shifts and RNA structure have been identified (Ebrahimi et al. 2001; Fares et al. 2007; Ohlenschlager et al. 2008), and 15N NMR chemical shifts have been incorporated into a probabilistic approach for automated assignment of RNA imino groups (Bahrami et al. 2012), heteronuclear NMR chemical shifts have not been widely exploited for RNA studies (Lam and Chi 2010; Aeschbacher et al. 2012). On the other hand, Wijmenga and co-workers showed that non-exchangeable 1H NMR chemical shifts for A-form helical residues could be back-calculated from a given 3D RNA structure (Cromsigt et al. 2001). For 28 examples tested, the back-calculated shifts were in good agreement with shifts reported in the Biological Magnetic Resonance Bank (BMRB; www.bmrb.wisc.edu), and some general 1H NMR chemical shift trends were identified (Cromsigt et al. 2001). Here we report a detailed analysis of the H8, H2, H6, H5, H1′, H2′, and H3′ proton NMR chemical shifts that have been deposited in the BMRB. After correcting for differences in chemical shift referencing and sample conditions, excellent correlations were observed, despite the fact that the data were obtained over a wide range of sample conditions. Our findings confirm and quantify previously identified trends and identify new sequence- and structure-dependent chemical shift correlations that can be used for assignment and/or validation of non-exchangeable 1H NMR chemical shifts and for the identification of non-canonical RNA structural features and intermolecular interaction sites.
NMR data were analyzed using “RNAShifts”, a program designed to download and analyze RNA 1H NMR chemical shifts that have been deposited in the BMRB. (Locally derived shifts that have yet to be deposited can also be analyzed). All 131 depositions available in the BMRB were used in the current analysis except BMRB ID 5170, 6814, 4816, 15697, 15915, 5023, 4253, 4894, and 15257, which could not be reliably used because either the BMRB assignments didn’t match the published PDB assignments, or because there was no associated publication or PDB file that could be used to identify RNA secondary structure. As additional input, files were manually generated for each deposition, based on published structural studies, that identify for each residue (1) whether or not the residue is base-paired, (2) the nature of the base-pairing partner, (3) any long-range intra- and/or inter-molecular interactions (e.g., sites of protein binding or participation in A-minor or other RNA–RNA contacts), (4) participation in structured (e.g., GNRA; G/g = guanosine, N/n = any nucleotide; R/r = purine; A/a = adenosine) or unstructured loops. A representative input file is shown in Supplementary Table S1.
We chose a relatively conservative approach in modeling the effect of the neighborhood of each central base pair. This was done because there are still, especially in comparison to proteins, relatively few chemical shift assignment sets for RNA deposited at the BMRB. Rather than using any non-linear or neural network approach we used an approach similar to the chemical shift increment method of Pretsch as used in predicting spectra of small organic molecules (Pretsch et al. 2009). Thus, for the central residue of each WC-BP triplet, we defined the attributes describing the neighborhood of the central nucleotide as described above, and calculated the contribution that each attribute makes to the predicted chemical shift. The predicted chemical shift is then a base chemical shift plus the linear contribution of the value corresponding to each attribute present in that nucleotide’s environment. The contribution of each attribute was calculated by linear regression of the chemical shifts in our database of RNA chemical shifts with the set of explanatory variables represented by the neighborhood attributes. The constant term of our regression model corresponds to a nucleotide embedded in a triplet of Watson–Crick base pairs with a U (uridine) flanking it on both the 5′ and 3′ sides and Watson–Crick base-paired nucleotides at the 5′ and 3′ ends of the triplet.
Sequence variables and chemical shift corrections calculated by Pace regression
Use of Weka provided not only access to Pace Regression, but also various assessments of the quality of the predictions. In particular, we used 10-fold stratified cross-validation during our analysis. Rather than providing correlation coefficients and root mean squared (rms) deviations of the predictions using all the data in the prediction, this technique trains the model on 90 % of the data and then assesses the results of predicting the remaining 10 % of the data. The process is repeated 10 times, using a different subset of the data each time and derives the correlation coefficients and rms deviations based on the whole process. Pace regression was used independently on each atom type present in each of the four central nucleotides for a total of 19 regression calculations.
We were unable in our analysis to adequately identify and control for sample conditions (pH, temperature, ionic strength, etc.) and unusual molecular conformation, and there is a significant possibility of misassignment, especially of some atom types. Therefore, after dropping a single obvious major outlier, we minimized these effects by automatically trimming outliers and automatically adjusting the reference for the chemical shift sets. Automated outlier elimination was performed by running two passes of the Pace Regression for each atom/central nucleotide. In the first pass, the rms deviations between the experimental and predicted values were calculated using all of the data. Any data values that deviated from the predicted values by more than three times the rms deviation value were dropped, and a second pass of the Pace Regression was performed on the now trimmed dataset. Automatic re-referencing was achieved by performing the above analysis (including outlier detection) twice. In the first of these passes, the mean error of prediction was calculated for all the shifts from each BMRB file. Prior to the second pass, each shift was corrected by the mean deviation calculated for the corresponding BMRB file. The chemical shift corrections determined by this approach are listed in Table S3.
The RNAShifts program was written using JTcl (http://jtcl.kenai.com) and Swank (http://swank.kenai.com), which are the Java implementations of the Tcl programming language and Tk graphical user interface toolkit (Ousterhout and Jones 2010). The analysis mode is run in three stages. The first loads BMRB files (fetching them from http://bmrb.wisc.edu if necessary), extracts chemical shifts, and then uses the input template to assign attributes to each shift. The second stage reads the output of the first stage and generates input files in the format used by Weka. The third executes Weka multiple times for each proton type, manages the two passes used for outlier detection and generates various statistical output files. The graphical interface module allows plotting predicted and experimental data subject to various criteria for choosing subsets of the data and attributes for plotting. The RNAShifts program is available upon request from the author (BAJ).
Results and discussion
Outlier chemical shifts
The statistical analysis described above identified 65 chemical shift assignments from the full BMRB database that, after automated re-referencing, deviated from expected values by more than 3 standard deviations. Seven of these assignments were associated with earlier publications from the M.F.S. laboratory, and inspection of the original NMR spectra revealed that these signals had been erroneously assigned (corrections to BMRB files 15113 and 17083 have now been made). We also discovered relatively large systematic chemical shift variations for one of our earlier depositions (BMRB ID 6094) that were associated with improper chemical shift referencing (the residual water signal at 35 °C was erroneously assigned a chemical shift of 4.792 ppm). We therefore updated the BMRB with the modified values, which were used in the present analysis. Based on examination of published NMR spectra, we were able to correct 19 additional assignments in the BMRB—in many cases, the signals had been properly assigned in the published spectra but improperly recorded in the BMRB files. In all cases, the re-assigned (or typo-corrected) shifts were well within the 3-standard deviation cutoff. We were unable to determine the nature of the deviations observed for the remaining 38 outliers because relevant regions of the NMR spectra were not provided in the original publications, and these 38 assignments were not used in subsequent analyses. The majority of these outliers were associated with ribose protons, of which 17 were for highly overlapping H2′ and H3′ proton signals. Thus, of the 3,796 available chemical shifts, 3,758 were retained for analysis and 38 (1 %; mostly ribose assignments) were excluded.
Chemical shifts that were either re-assigned or excluded are summarized in Supplementary Table S2, and referencing corrections employed for all of the utilized depositions are summarized in Supplementary Table S3. The final dataset included values for the central base pairs of all of the 43 possible combinations of WC-BP triplets, with as few as one, and as many as 23, assignments for each of the possible combinations. A total of 137 additional triplets that contain G:U base pairs were also included in the analysis. As shown in Fig. 1b, the retained and re-referenced BMRB shifts (δ) were in good agreement with predicted shifts (δpred) (rms deviation for the entire dataset = 0.056). Good agreement was also obtained when training was performed using a two-fold cross-validation analysis, in which half of the data were used for training and half for validation (rms = 0.069 ppm), and when training was performed with 60 % of randomly-ordered BMRB entries and validation assessed with the remaining 40 % of the data (rms = 0.063, averaged over all atom types).
Chemical shift trends for canonical triplets
The adenosine-H2 proton is sensitive to the nature of both the 5′- and 3′-nucleotides (Cromsigt et al. 2001) and exhibits a large chemical shift range of ~6.4–8.0 ppm. Importantly, the simultaneous presence of a 5′-pyrimidine and 3′-purine is associated with a significant upfield A-H2 NMR chemical shift, to a less crowded region of the RNA NMR spectrum (6.4–7.1 ppm, Fig. 2b) where they are potentially useful for structural characterization of large RNAs Lu et al. (2011). In contrast, significant downfield shifts are observed for the H2 protons of adenosines that are preceded by a purine and followed by a pyrimidine, Fig. 2b. The H5 protons of the C and U are sensitive to the nature of the preceding residue of the triplet but exhibit almost no detectable sensitivity to the nature of the following residue, Fig. 3c, d. The pyrimidine H6 protons are also more sensitive to the nature of the 5′ residue, but exhibit some sensitivity to the 3′ residue as well (Fig. 3c, e). The ribose protons appear to be sensitive to the nature of both the 5′ and 3′ residues, although the limited chemical shift dispersion and uncertainties regarding some of H2′ and H3′ assignments make it more difficult to identify clear chemical shift trends.
Influence of 5′- and 3′-terminal base pairs within the WC-BP triplet
Influence of non-canonical elements adjacent to the WCBP triplets
Our analysis assessed the influence of non-canonical structural elements that reside immediately upstream (5loop) or downstream (3loop) of the WC-BP triplets. We defined these elements to include internally stacked residues that are not involved in Watson–Crick base pairing, looped or bulge residues believed to be flexible or structured (e.g., K-turns), and residues involved in base-triples or long-distance RNA–RNA interactions. As shown in Fig. 5d, the presence of non-canonical RNA structures at the 3′-end of the WC-BP triplet does not appear to significantly influence any of the proton shifts associated with the central residue of the triplet. On the other hand, the presence of non-canonical structure on the 5′-side of the WC-BP triplet results in small but significant upfield shifts relative to 〈δ〉can values for the aromatic and H1′ protons, Fig. 5c.
Influence of G:U base pairing within the triplet
The presence of GU (or UG) base pairs at the n(i−1) or n(i+1) positions can significantly influence the signals of the central residue, and data for otherwise canonical triplets are shown in Fig. 6c–f. For triplets in which the central residues is a pyrimidine, the H1′ and H3′ are relatively unaffected by the presence of a preceding GU wobble, Fig. 6c. However, the H6, H5 and H2′ protons are systematically perturbed, with the u(wob)-U/C-n H6 signal shifted downfield, the g(wob)-U-n H6 signal shifted upfield, and the g(wob)-C-n C-H6 signal shifted downfield relative to the average canonical shifts, Fig. 6c. Interestingly, the u(wob)-C/U-n H5 shifts are relatively unperturbed relative to canonical shifts, whereas g(wob)-C/U-n H5 shifts are generally shifted downfield relative to the signals observed for the canonical triplets, Fig. 6c. Also, H2′ shifts of the central pyrimidine are shifted downfield when preceded by a UG wobble, but are shifted upfield when preceded by a GU wobble, Fig. 6c. When the central residue is a purine, the H1′ and H3′ proton shifts are relatively unaffected by a preceding wobble, but the H8, H2, and H2′ protons generally exhibit systematic downfield shifts, with the magnitude of the shift being somewhat greater for a preceding U(wob) compared to a preceding G(wob), Fig. 6d.
The presence of a subsequent GU wobble can also result in systematic chemical shift perturbations. For triplets in which the central residue is a pyrimidine followed by a U(wob) mismatch, the H6 and H3′ signals exhibits small upfield shifts but the remaining signals do not appear to be significantly perturbed, Fig. 6e. In contrast, the presence of a subsequent G(wob) mismatch does not appear to lead to any detectable perturbations, Fig. 6e. For triplets in which the central residue is a purine, a subsequent G(wob) leads to a small systematic downfield shift of the H1′ proton but does not significantly perturb the other NMR signals, whereas a subsequent U(wob) pair results in small upfield shifts of the H6 and H5 signals and a small downfield shift of the H2′ signal, Fig. 6f.
Chemical shift predictions
The data in Table 1 can be used in computer programs such as NMRView (Johnson 2004; Johnson and Blevins 1994) or ad hoc calculations to predict chemical shifts. The constant term represents the value of the given atom in nucleotide i, when the i − 1 and i + 1 nucleotides are both U, and all nucleotides from i − 2 through i + 2 are present and in canonical Watson–Crick base pairs. For example, an A-H2 proton, in a canonical uAu triplet would be at 7.0299. Calculating the shift of the A-H2 proton in a different environment is done by adding to the constant term the contributions from any applicable columns in the A-H2 row of Table 1. For example, the chemical shift of an A-H2 proton in a gAc triplet, in which the i − 2 residue is in a loop, would be: 7.8469 ppm (7.0299 + 0.6899 + 0.0658 + 0.0622). If the i − 1 G is in a GU (rather than GC) base pair, the A-H2 proton chemical shift would be: 7.9217 ppm (7.0299 + 0.7637 + 0.0658 + 0.0622).
The present studies provide the first quantitative analysis of the RNA non-exchangeable 1H NMR chemical shifts in the BMRB. Our studies identify sequence-dependent chemical shift correlations and establish the influence of terminating base pairs within the triplets and canonical and non-canonical structures adjacent to the BP triplets (i.e. bulges, loops, WC and non-WC BPs). Excellent correlations were observed despite the fact that the NMR data were obtained under different conditions of pH, buffer, ionic strength, and temperature. A relatively small number of outliers that were not utilized in the analysis, mainly ribose H2′ and H3′ assignments, are likely due to assignment or typographical errors and should be re-examined. Assignments for some triplet combinations were either limited or lacking; for example, the database does not include assignments for two of the 64 possible “canonical triplets.” Although shifts for these triplets could be predicted from assignments made for non-canonical triplets (e.g., WC-BP triplets adjacent to non-canonical structures or that contain terminal or GC base pairs), future studies of oligonucleotides with the missing sequences are clearly in order.
The statistics indicate that the protocol employed for chemical shift predictions, assigning attributes to different triplet environments and then conducting selection and linear model fitting with Pace Regression, performed very well for the data used in this study. However, as we move forward with this research and the number of attributes is expanded, alternative fitting methods such as Neural Networks and allowing attributes to contribute in non-linear modes may be required. The protocol used here can, of course, also be applied to nitrogen and carbon nuclei, and it will be interesting to determine if these nuclei exhibit similar environment- and structure-dependent sensitivities.
The 1H NMR shifts observed for residues that participate directly in long-range RNA–RNA interactions or interactions with ligands or proteins, as identified in the associated publications and/or the structure coordinate (PDB) files, generally deviated from the A-form helical triplet shifts. For example, the H6 and H5 NMR chemical shifts observed for residue U5 of the ScYLV P-1-P2 frameshifting pseudoknot (7.93 and 5.25 ppm, respectively) (Cornish et al. 2005), deviate by 0.24 and 0.29 ppm from the expected values (7.69 and 5.54 ppm, respectively) and are well outside the rms range calculated for canonical gUg triplets (rms = 0.06 and 0.03 ppm, respectively). Significant deviations were also observed for otherwise canonical A-form helical residues that interact with protein elements. In future studies of RNAs with unknown structures, the observation of outlier chemical shifts may serve as useful indicators of potential long-range RNA:RNA or RNA:protein interactions. In addition, the trends identified in the present studies should facilitate the refinement of algorithms used to calculate 1H NMR chemical shifts on the basis of RNA structural coordinates alone (Cromsigt et al. 2001; Case 1995, 2002; Dejaegere et al. 1999; Case et al. 2005), thereby making the 1H NMR chemical shift a more useful parameter for RNA structure refinement. Because the variations in chemical shifts observed for atoms of a given triplet are small, variations in the 3D structures of the triplets should also be small. This observation lends support for refinement approaches that utilize residual dipolar couplings and/or small angle X-ray scattering (SAXS) data to orient idealized A-form helices (Funari et al. 2000; Walsh et al. 2004; Zuo et al. 2008; Grishaev et al. 2008; Wang et al. 2009, 2010; Burke et al. 2012).
In the course of these studies, chemical shift trends were tentatively identified for a number of non-A-form helical structures that are well represented in the BMRB, particularly those of conserved tetraloops (e.g., GNRA). Future studies that include parameterizations for tetraloops, base triples, and other conserved and well-defined RNA sub-structures will likely lead to the identification of additional trends useful for 1H NMR assignment and verification. In addition, it should now be possible to incorporate the approach into software programs to enable semi-automated assignment of RNA, including large RNAs with different combinations of 2H-labeled or segmentally-labeled nucleotides (underway).
Support from the National Institute for General Medical Sciences (GM42561 to M.F.S. and GM103297 to B.J.) is gratefully acknowledged. S.B. was supported by grants from the National Institute of General Medical Sciences for enhancing minority access to research careers (MARC U*STAR 2T34 GM008663) and the Howard Hughes Medical Institute for enhancing undergraduate research training.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.