Fully automated high-quality NMR structure determination of small 2H-enriched proteins
- 675 Downloads
Determination of high-quality small protein structures by nuclear magnetic resonance (NMR) methods generally requires acquisition and analysis of an extensive set of structural constraints. The process generally demands extensive backbone and sidechain resonance assignments, and weeks or even months of data collection and interpretation. Here we demonstrate rapid and high-quality protein NMR structure generation using CS-Rosetta with a perdeuterated protein sample made at a significantly reduced cost using new bacterial culture condensation methods. Our strategy provides the basis for a high-throughput approach for routine, rapid, high-quality structure determination of small proteins. As an example, we demonstrate the determination of a high-quality 3D structure of a small 8 kDa protein, E. coli cold shock protein A (CspA), using <4 days of data collection and fully automated data analysis methods together with CS-Rosetta. The resulting CspA structure is highly converged and in excellent agreement with the published crystal structure, with a backbone RMSD value of 0.5 Å, an all atom RMSD value of 1.2 Å to the crystal structure for well-defined regions, and RMSD value of 1.1 Å to crystal structure for core, non-solvent exposed sidechain atoms. Cross validation of the structure with 15N- and 13C-edited NOESY data obtained with a perdeuterated 15N, 13C-enriched 13CH3 methyl protonated CspA sample confirms that essentially all of these independently-interpreted NOE-based constraints are already satisfied in each of the 10 CS-Rosetta structures. By these criteria, the CS-Rosetta structure generated by fully automated analysis of data for a perdeuterated sample provides an accurate structure of CspA. This represents a general approach for rapid, automated structure determination of small proteins by NMR.
KeywordsCold shock protein CS-Rosetta Rapid NMR structure determination Protein perdeuteration cSPP expression system
Condensed-phase signal protein production
Cold shock protein A
Protein Data Bank
NMR spectroscopy is well suited for rapid and, in favorable cases, largely automated structure determination of small (<125 residues) proteins [22, 36]. While backbone assignments for such proteins are routinely obtained in a largely automated fashion [1, 17, 23, 39], assignment of sidechain resonances can often be a bottleneck for the process of structure determination. Automated sidechain assignment methods are, however, evolving and beginning to have an important impact on the field [1, 13]. Recently, we described an approach for solving protein NMR structures using Rosetta conformational energy calculations together with only a limited amount of experimental NMR data, including backbone resonance assignments, residual dipolar coupling data, and some manually-assigned long-range backbone-backbone NOEs . This approach was demonstrated to provide accurate backbone structures (chain folds) for proteins of up to 25 kDa, with reasonably accurate core sidechain packing.
Several years ago, we described a strategy for rapid automatic determination of small (<100 residue) protein structures using only the sparse constraints that can be obtained using a perdeuterated protein . Our strategy for rapid fold determination derives from ideas that were originally introduced for determining NMR structures of larger proteins [9, 10, 12], using [2H,13C,15N]-enriched protein samples with protonated sidechain methyl groups (13CH3). Data collection includes acquiring NMR spectra for determining assignments of backbone and sidechain 15N, HN resonances, and sidechain 13CH3 methyl resonances. Backbone resonance assignments and NOESY cross peaks are then determined automatically, and 3D structures generated using CNS [5, 20]. This strategy provides reliable backbone chain folds for small (<100 residue) proteins, which are useful for certain applications, and good starting points for further refinement to high precision and accuracy using additional NMR data.
This “sparse constraint” approach exploits the fact that perdeuteration generally improves spectral quality and interpretability even of smaller proteins. Although protein deuteration is not generally required for small protein structure determination, it is valuable for improving sensitivity of many amide or methyl proton-detected heteronuclear NMR experiments [2, 14] even for proteins in the 7–12 kDa range. As the gyromagnetic ratio of the 2H is ~6.5 fold smaller than that of 1H, the dipolar interaction between 13C or 15N and the directly bound proton spin is greatly reduced. Therefore the transverse relaxation times T2 of 13C and 15N nuclei are increased, providing sharper linewidths and higher signal-to-noise ratios (S/N). Constant-time NMR experiments which may have poor S/N with fully protonated proteins can be recorded with higher sensitivity due to the reduced transverse relaxation rates of 13C and 15N obtained for perdeuterated proteins. We also observe better performance of automated resonance assignment software for backbone resonance assignments (e.g. AutoAssign ) because of the improved resolution and sensitivity of amide HN-detected triple resonance experiments on the perdeuterated protein samples. Another advantage of longer transverse relaxation times and the reduction in spin-diffusion pathways is that it permits the detection of weaker NOEs that may not otherwise be observed when longer NOESY mixing times are used. Some poor NMR signals resulting from exchange broadening and limited protein solubility can also be improved by perdeuteration. These advantages of deuterium incorporation are well-known for studies of larger (15–50 kDa) proteins, but also provide improved performance and improved S/N for smaller sized (<70–100 residues) proteins.
While the idea of rapid, fully automated structure determination of small perdeuterated proteins is attractive and innovative, two drawbacks have hindered the routine application of this method for high-throughput NMR protein structure determination. First, producing perdeuterated proteins by conventional expression methods is expensive, and secondly, only backbone chain folds are reliably determined using sparse constraints and CNS refinement ; the details of the resulting structures are not particularly good.
Here, we combine the fully automated sparse constraint approach for small proteins, first outlined by Zheng et al. , with two recent innovations. First, we have adopted recently developed condensed-phase single protein production (cSPP) methods [29, 33, 34, 35] to allow bacterial expression in 10 to 40-fold condensed-phase fermentations without reduction in protein expression per cell, allowing significantly less expensive production of 2H, 13C, 15N-enriched proteins. In the cSPP system, MazF, an mRNA interferase functioning as an ACA-specific endoribonuclease, is co-expressed with the target protein. The expression of MazF eliminates almost all cellular mRNAs containing ACA sequences. The target gene is selectively expressed by engineering it to contain no ACA sequences, without altering the amino acid sequence of the protein encoded by the resulting mRNA. Secondly, we replace CNS refinement with the recently introduced CS-Rosetta method  for small protein structure analysis. The CS-Rosetta program provides a powerful approach for NMR structure determination of small proteins using only 1H, 13C, and 15N backbone and 13Cβ resonance assignments . Exploiting these recent innovations, we have extended the approach originally described by Zheng et al.  to demonstrate, using a 2H, 15N, 13C-enriched sample of the 86-residue E. coli cold shock protein (CspA) as an example of a general process for determining accurate small protein structures requiring only a few days of NMR data collection, a specific data collection protocol, and largely automated data analysis.
Methods and materials
Preparation of [1H-13C]-I(δ1)LV, 13C, 15N, 2H—CspA for structural studies
Competent E.coli BL21(DE3) cells containing the pACYCmazF  plasmid were transformed with pColdI(SP-4)  plasmid (Takara Bioscience, Inc) containing ACA-less cspA gene. The resulting constructs include a 16-residue N-terminal tag, consisting of a translation enhancing element (TEE), a His6 tag, and a Factor Xa cleavage site. Protein expression was performed essentially as described by Schneider et al. , with the following details: single colonies were selected and used to inoculate 2.5 ml LB medium at 37°C for 6 h. 2 ml of the LB culture was inoculated into 100 ml of MJ9 minimal medium at 37°C overnight. When OD600 reached 1.8–2.0 units, the culture was centrifuged at 3,000 × g for 15 min at 4°C. The cell pellet was resuspended in 1 l of fresh MJ9 medium and cells were grown at 37°C until OD600 reached 0.5. At this point the culture was chilled on ice for 5 min and shifted to 15°C for 45 min to acclimate the cells to cold shock conditions. Target protein (CspA) was then expressed along with MazF for 1.5 h by addition of 1 mM isopropyl-β-d-thiogalactoside (IPTG) prior to expression in isotope enriched medium. Cultures were then centrifuged at 3,000 × g for 15 min at 4°C, resuspended in 2.5% volume (40× condensed) in deuterated (2H2O) wash solution [7.0 g/l Na2HPO4; 3.0 g/l KH2PO4; 0.5 g/l NaCl; pH 7.4], centrifuged again, and resuspended in 25 ml of deuterated MJ9 minimal medium containing 1 g/l 15NH4Cl; 4 g/l 13C, 2H-glucose; 50 mg/l α-13C-ketobutyric acid; 100 mg/l α-13C-ketoisovaleric acid; and 1 mM IPTG. Protein expression continued at 15°C for 24 h. Cells were harvested by centrifugation as described above and cell pellets were stored at −80°C. All isotopes were purchased from Cambridge Isotope Laboratories.
CspA purification and concentration
Cell pellets were resuspended in 40 ml of lysis buffer [50 mM Na2HPO4-NaH2PO4; 300 mM NaCl; 5 mM imidazole; 5 mM 2-mercaptoethanol; with 1 EDTA-free protease inhibitor tablet (Roche Cat. # 11 873 580 001) per 50 ml at pH 8.0] and sonicated to lyse the cells. Lysed cells were then centrifuged at 4°C for 1 h at 16,000 rpm in a Sorvall SS-34 rotor. The protein was further purified by binding to Ni–NTA agarose at 40 ml of soluble extract per 1 ml of bed resin at 4°C overnight. Resin was washed twice with 25 ml of Wash Buffer [50 mM Na2HPO4-NaH2PO4; 300 mM NaCl; 25 mM imidazole; 5 mM 2-mercaptoethanol, pH 8.0], and protein was eluted in 8 ml of Elution Buffer [50 mM Na2HPO4-NaH2PO4; 300 mM NaCl; 300 mM imidazole; 5 mM 2-mercaptoethanol, pH 8.0]. The protein solution was then dialyzed overnight at 4°C into NMR Buffer containing 50 mM KH2PO4, 1 mM NaN3, pH 6.0, containing 10% 2H2O, and concentrated to a final concentration of ~0.2 mM.
800 MHz triple resonance data used for determining backbone resonance assignments
No. of points
1024, 40, 50
1024, 40, 40
1024, 40, 50
1024, 64, 100
1024, 64, 100
1024, 40, 40
1024, 72, 82
1024, 72, 72
1024, 72, 82
1024, 96, 164
1024, 96, 164
1024, 72, 72
After zero filling
1024, 128, 128
1024, 128, 128
1024, 128, 128
1024, 128, 256
1024, 128, 256
1024, 128, 128
No. of scans
Spectral width (ω1, ω2, ω3; ppm)
14, 23, 32
14, 23, 24
14, 23, 32
14, 28, 72
14, 28, 72
14, 23, 24
Recycle delay (s)
Collection time (h)
In addition, 3D 13C-edited NOESY (mixing time of 350 ms) and 15N-edited NOESY (mixing time of 175 ms) were collected on a 600 MHz Bruker spectrometer with TXI probe. The matrix sizes of these spectra were 1,024 × 32 × 220 total data points for 13C-edited NOESY, and 1,024 × 64 × 256 total data points for 15N-edited NOESY. For 13C-edited NOESY, the spectrum widths in 1H, 13C and indirect detected 1H dimensions were set to 14, 16 and 12 ppm respectively and the carrier positions were set 4.7 ppm for 1H and 16 ppm for 13C dimension. For 15N-edited NOESY, the spectrum widths in 1H, 15N and indirect detected 1H dimensions were set to 14, 28 and 11.5 ppm respectively and the carrier positions were set 4.7 ppm for 1H and 118 ppm for 15N dimensions. The total data collection time for 13C-edited and 15N-edited NOESY spectra was approximately 2.5 days.
In all NMR experiments, FIDs were processed with linear prediction and zero filling, and weighted by sine bell function in all direct and indirect detected dimensions. All NMR spectra were processed and examined with NMRPipe and NMRDraw software packages . The program SPARKY  was used for data visualization and analysis. Chemical shifts of proton were referenced to external DSS. 13C and 15N chemical shifts were referenced indirectly based on the proton referencing.
Analysis of resonance assignments
AutoAssign  software was used for automated analysis of backbone and side chain 13Cβ resonance assignments for CspA. Peak list of [15N-1HN]-HSQC, and peak lists from the triple resonance experiments, including 3D HNCO; HN(ca)CO; HNCA; HN(co)CA; HNCACB and HN(co)CACB, were peak picked automatically using the ‘restrictive peak picking’ function of the SPARKY  software; in order to improve the performance of AutoAssign for backbone assignments, these peak lists were manually refined prior to input into AutoAssign [23, 39] for automated analysis of backbone resonance assignments. Cleaning up the peak lists only required 2–3 h. Sidechain 13C and 1H methyl resonances of Leu, Val and Ile (δ1) were determined subsequently by interactive spectral analysis using [13C–1H]-HSQC, 3D 13C-edited NOESY, and 3D 15N-edited NOESY spectra. These methyl sidechain assignments were used in the “conventional 3D structure calculations”, but not in the CS-Rosetta calculations.
Sparse-constraint 3D structure calculations
Sparse-constraint 3D structure calculations were performed using the AutoStructure [15, 16] software ver. 2.2.1-CND for automated analysis of NOESY cross peak assignments, implemented together with the program CYANA ver. 2.1 for structure generation. The input for AutoStructure analysis consisted of (1) a list of backbone and 13C-1H methyl sidechain assignments; (2) manually edited NOESY peak lists, including chemical shift and peak heights, generated from 13C-edited and 15N-edited NOESY spectra; (3) sites of slowly exchanging amide hydrogens based on published amide 1H/2H exchange data for CspA [8, 24]; (4) broad ϕ, ψ angle constraints (±40° and ±50°, respectively) derived from chemical shift data (after correction of 2H isotope-shift effect) using the program TALOS . The best 10 of 56 structures (lowest energy) from the final cycle of AutoStructure were refined by restrained molecular dynamics in an explicit water bath using CNS 1.1 [5, 20].
Chemical-shift based protein structure prediction by ROSETTA (CS-ROSETTA)
Chemical shift information, including backbone 13Cα, 15N, 13C’, 1HN and sidechain 13Cβ assignments, were used as input for CS-ROSETTA. Details of the process of generating the CS-ROSETTA protein structure are described in Shen et al.  Three key steps are involved. First, based on the chemical shift values (which did not include backbone 1Hα shifts) and protein sequences, peptide fragments were selected from a protein structure database using the MFR module [7, 18] of the NMRPipe software package. All proteins with PSI-BLAST E-val score <0.05 with E. coli CspA were removed from the database. Second, a standard ROSETTA  protocol was used for de novo structure generation. Third, ROSETTA all-atom models resulting from the above procedure were evaluated based on how well backbone chemical shifts predicted for the models using SPARTA  agree with the experimental chemical shifts. If the lowest energy models cluster within less than ~2 Å from the model with the lowest energy, the structure prediction is considered successful and lowest energy models are converged. A total of 10,000 all-atom Rosetta models were generated from the MFR-selected peptide fragments, using a cluster of 20 CPUs. These CS-Rosetta runs required approximate 3 days. The 1,000 lowest-energy models were chosen and their all-atom ROSETTA energies were recalculated in terms of the fitness with respect to the experimental chemical shift values. The lowest energy models are converged based on the fact that Cα RMSD values are less than ~2 Å relative the lowest energy model. 10 lowest energy models were selected as a representation of the 3D structure of CspA. The CS-ROSETTA package used in this work may be downloaded from http://spin.niddk.nih.gov/bax/software/CSROSETTA/indes.html.
Structure quality assessment
Global structure quality factors for the ensemble of CspA structures generated using sparse NMR constraints with conventional data analysis methods or by CS-Rosetta were determined using Protein Structure Validation Suite (PSVS) software package . The output of the PSVS includes raw scores and normalized statistical Z-scores  for metrics assessed by the Verify 3D , Prosa II , PROCHECK , and MolProbity  software packages.
Rapid resonance assignments with perdeuterated CspA sample prepared by cSPP
Protein structure determination using sparse NMR constraints
Summary of structural statistics for E. coli CspA NMR structures
Sparse-constraint NMR structurea
Sparse-constraint CS-Rosetta structurec
Intra-residue (i = j)
Sequential (|i − j| = 1)
Medium range (1 < |i − j| ≤ 5)
Long range (|i − j| > 5)
Distance constraints per residue
Dihedral angle constraints
Hydrogen bond constraints
Long range (|i − j| > 5)
Number of constraints per residue
Number of long range constraints per residue
Residual constraint violationsd
Average number of distance violations per structure
Average RMS distance violation/constraint (Å)
Maximum distance violation (Å)
Average number of dihedral angle violations per residue
Average RMS dihedral angle violation/constraint (°)
Maximum dihedral angle violation (°)
RMSD from average coordinates (Å)d,e
1.2 ± 0.2
0.5 ± 0.1
0.8 ± 0.2
1.7 ± 0.2
1.1 ± 0.1
1.2 ± 0.2
RMSD from X-ray structure (Å)d,f
1.58 ± 0.38
0.95 ± 0.11
0.52 ± 0.12
2.24 ± 0.34
1.63 ± 0.16
1.17 ± 0.11
Sidechain RMSD from X-ray structure (Å)d,g
1.75 ± 0.20
1.59 ± 0.15
0.86 ± 0.11
Heavy sidechain atoms
1.81 ± 0.23
1.93 ± 0.22
1.14 ± 0.12
Most favored regions (%)
Additional allowed regions (%)
Generously allowed (%)
Disallowed regions (%)
Global quality Scoresd
Procheck (all dihedrals)
Molprobity clash score
CS-Rosetta structure generation for perdeuterated CspA
A comparison of the CS-Rosetta structure of Fig. 3 with the NOESY constraint list used to generate the sparse-constraint NMR structure of Fig. 2 (i.e. the data obtained for 2H, 15N, 13C-enriched 13CH3 methyl protonated CspA) is also summarized statistically in Table 2. This analysis reveals only a few distance violations >0.5 Å (the largest being 1.7 Å) across the ensemble of 10 CS-Rosetta structures, cross-validating the high accuracy of the CS-Rosetta structure. Comparison with the more extensive NOESY constraint list used to determine the 3mef  reveals some additional constraint violations by the CspA structure; however this work was performed using a different CspA construct, and the overall structure quality scores (Table 2) for this published “full blown” NMR structure 3mef are much poorer than either the CS-Rosetta structure or the 1mjc. Indeed, structure quality scores for the published NMR structure (Table 2), particularly the Procheck (all dihedral) and Molprobity Clash scores, are well below the threshold (Z = −5) considered to be acceptable for a good quality NMR structure . Based on its closer agreement with 1mjc, particularly for core sidechain atom positions, and better overall structure quality scores, it appears that the CS-Rosetta NMR structure of CspA (Fig. 3) is in fact more accurate than the previously published “full blown” NMR structure 3mef .
Our results demonstrate a general, rapid, and simple approach for determining high quality 3D structures of small (<10 kDa) proteins, in fully automated fashion, with accuracies rivaling structures determined using more extensive NMR methods. In particular, the core sidechain packing, determined by the Rosetta potential energy function, is quite accurate based on comparison with the crystal structure, despite the fact that no sidechain constraints are used in these calculations. Similar results were observed in CS-RDC-Rosetta calculations with larger proteins .
The time spent on CS-Rosetta runs depends on the number of Rosetta models generated and the number of CPUs used for the CS-Rosetta structural generation. In our study, we generate 10,000 models initially and we use 20 CPUs for the calculation. The process takes about 3 days. The time saved for structure determination using our proposed methods relative to conventional methods includes the time required for collection of spectra required for determining side-chain assignments and NOESYs, time required to process and analyze these spectra, as well as the time required for structure calculations and refinement which are the time-limiting steps for NMR structure determination. Our proposed approach only requires triple resonance NMR experiments for backbone assignments followed by automated analysis of backbone resonance assignments. Once most of the backbone resonance assignments are determined, these chemical shift data are submitted to CS-Rosetta. This approach, which is largely automated, not only saves time in data collection and analysis, but can generate a high-quality protein structure.
NOESY data and protein ILV methyl protonation are not required in the strategy proposed in this paper for small protein structure determination. NOESY data on the ILV-labeled sample was only used for cross validation of the CS-Rosetta structure. However, CS-Rosetta calculations do not always converge, even for small protein structures, and NOESY data for the perdeuterated ILV methyl protonated protein sample can be used if necessary together with CS-Rosetta if the chemical shift data alone do not provide a converged structure.
Our work further demonstrates that 2H, 13C,15N-enriched protein samples made by the cSPP system at a drastically reduced cost and purified with a single-step Ni–NTA affinity chromatography, allow data collection and automated analysis of backbone 1HN, 15N, 13Cα, 13C′, as well as sidechain 13Cβ, assignments in only a few days. In related work, we have demonstrated the combined use of CS-Rosetta and automated NOESY analysis to provide more accurate NOESY cross peak assignments, beginning with extensive backbone and sidechain assignments , and the use of CS-RDC-Rosetta with manually assigned NOESY-based constraints to generate good quality structures of larger (10–25 kDa proteins) . The present study is the first example of applying CS-Rosetta for rapid fully-automated NMR structure determination of small proteins, a unique application that provides a new and general approach for obtaining 3D structures of small proteins. The CspA structure obtained rivals the best NMR structures available to date for CspA using conventional methods, even those utilizing extensive sidechain proton assignments . This approach has tremendous value in preparing protein samples and generating assignments and structural information for small molecule screening studies, as well as in high-throughput structural and functional genomics studies.
We thank Drs. Ad Bax, Rajeswari Mani, and Rong Xiao for helpful discussions and comments on the manuscript. The CS-Rosetta structure of CspA has been submitted to the Protein Data Bank as entry 2L15. This work was supported by the National Institutes of General Medical Science Protein Structure Initiative program, grants U54 GM074958 (to G.T.M and M.I.) and U54 GM75026 (to G.T.M. and M.I.), and by grant RO1 GM070837 and GM088808 (to M.J.R.). W.M.S. was supported by NIH training grants T32 GM08360 and T32 A1007403. This work was also supported by the Intramural Research Program of the National Institutes of Health National Institute of Diabetes and Digestive and Kidney Diseases (to Y.S.).
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- 8.Feng W, Tejero R, Zimmerman DE, Inouye M, Montelione GT (1998) Solution NMR structure and backbone dynamics of the major cold-shock protein (CspA) from Escherichia coli: evidence for conformational dynamics in the single-stranded RNA-binding site. Biochemistry 37:10881–10896CrossRefPubMedGoogle Scholar
- 11.Goddard TD, Kneller DG (2008) SPARKY 3. University of California, San FranciscoGoogle Scholar
- 22.Montelione GT, Zheng D, Huang YJ, Gunsalus KC, Szyperski T (2000) Protein NMR spectroscopy in structural genomics. Nat Struct Biol 7 Suppl:982–985Google Scholar