Introduction

Magic-angle spinning solid-state NMR (MAS SSNMR) represents a fast developing experimental method with great potential to provide structural and dynamics information for proteins not amenable to solution NMR nor X-ray crystallography. Many technical aspects of MAS SSNMR are rapidly developing, among them: (i) improvements in nano/microcrystalline and membrane protein sample preparation (Frericks et al. 2006; Li et al. 2007; Lorch et al. 2005) (ii) improvements in commercially available hardware, and (iii) development of pulse sequences for new and improved experiments (Sun et al. 1997; Li et al. 2007; Franks et al. 2007; Zhong et al. 2007; Hong 1999; Bockmann et al. 2003, Rienstra et al. 2000; Pauli et al. 2001; Igumenova et al. 2004; Astrof et al. 2001). In many cases, adaptation of tools and techniques from solution NMR have fueled this rapid development. However, the development of analysis software for MAS SSNMR lags far behind. In particular, more sophisticated automated protein resonance assignment programs for solution NMR cannot be directly used on SSNMR data lacking hydrogen resonances. This is because leading protein resonance assignment programs (Zimmerman et al. 1997; Leutner et al. 1998; Atreya et al. 2000; Bartels et al. 1996, 1997, 2004; Moseley et al. 2001; Moseley and Montelione 1999; Moseley et al. 2004; Huang et al. 2005; Coggins and Zhou 2003; Jung and Zweckstetter 2004; Eghbalnia et al. 2005; Hyberts and Wagner; 2003) are hard wired with an amide 15N-1H double resonance spin system root definition (Fig. 1) and require hydrogen-based experiments. To address this deficiency, we present a methodology for automating protein resonance assignments of MAS SSNMR spectral data and its practical application to an experimental peak list dataset of β1 immunoglobulin binding domain of protein G (GB1) as a proof of concept. Our goals are: (i) to eventually provide the necessary software tools to automate the MAS SSNMR protein resonance assignment process (ii) to improve the quality of this analysis, and (iii) to make this analysis more objective and reproducible.

Fig. 1
figure 1

Standard dipeptide spin system definitions for sequential protein resonance assignments in solution and solid state NMR. Spin system root resonances are in red. The solid red box indicates that the root resonances are found in all standard experiments used in dipeptide spin system assembly. The dashed red boxes indicate pairs of root resonances are found in only a subset of the experiments used in dipeptide spin system assembly

Figure 2 shows the protein resonance assignment problem represented as a bipartite graph. This assignment problem is essentially the same for both solution and solid-state NMR (Tycko 1996; Hong 1999) and involves seven basic steps to effectively solve it (Table 1). But one of the critical differences between solution and solid-state NMR is the root resonances used to group peaks into spin systems. These resonances are dictated by the set of NMR experiments (i.e., experimental strategy) used to solve this assignment problem. As shown in Fig. 1, common MAS SSNMR protein resonance assignment strategies use a partial triple resonance spin system root definition (Pauli et al. 2001; Igumenova et al. 2004; Franks et al. 2005; Balayssac et al. 2007; Hong 1999; Sperling et al. 2010), since not all three resonances may be present within each experiment in a given strategy. MAS SSNMR experimental strategies naturally group into three categories of assignment strategies (Table 2). In category I, two sets of experiments containing either Ni-C’i-1 or Ni-Cαi root resonances are combined into complete dipeptide spin systems using the single common amide nitrogen root resonance. In categories IIa and IIb, experiments containing either Ni-C’i-1 or Ni-Cαi root resonances are combined into complete dipeptide spin systems using two common root resonances. In category III, the listed 4D experiments contain all three root resonances, which represent a complete triple resonance spin system root definition. Labs have published assignment results using category I strategies, but only on small proteins (Hong 1999; Pauli et al. 2001; Igumenova et al. 2004; Franks et al. 2005; Balayssac et al. 2007). Labs are starting to use category II strategies for larger proteins (Frericks et al. 2006; Li et al. 2007; Li et al. 2008). It is expected that labs in the future will probably explore category III strategies using newer G-matrix Fourier transformation (GFT) experiments(Szyperski et al. 1993a; Szyperski et al. 1993b; Kim and Szyperski 2003; Kim and Szyperski 2004; Astrof et al. 2001; Luca and Baldus 2002). Moreover, category II and III strategies have strengths that could make them better for automation than even solution NMR strategies. First, the chemical shift dispersion in Euclidean space of Ni-Cαi, and especially C′i−1-Ni-Cαi root resonance tuples is significantly greater than for Ni-Hi root resonance tuples. Said another way, Ni-Cαi pairs of chemical shifts for a folded protein plotted on a 2D graph as small circles with radius representing the uncertainty in their chemical shift values will show less dense clumps (i.e. less overlapping of circles) than Ni-Hi pairs of chemical shifts plotted in a similar way. This helps prevent the non-unique grouping of peaks into spin systems, which severely complicates resonance assignments. Second, category IIa and IIb strategies can be combined into a single strategy represented as a merged double bipartite graph. This representation may lead to the development of superior grouping and linking algorithms.

Fig. 2
figure 2

Bipartite graph representing the protein resonance assignment problem. Amino acid typing limits the edges present. Red highlights represent spin system linking into a uniquely mapped segment

Table 1 Protein resonance assignment process
Table 2 MAS SSNMR experimental strategies for protein resonance assignment
Fig. 3
figure 3

Automated resonance assignments of β1 immunoglobulin binding domain of protein G. Resonances derived from intra experiments are indicated in red. Resonances derived from sequential experiments are indicated in blue

However, MAS SSNMR spectra, especially of membrane proteins, often lack significant numbers of resonances at a given experimental condition (Andronesi et al. 2005; Li et al. 2007), which can especially confuse both global optimization and exhaustive search mapping algorithms. But spectroscopists are finding clever ways to optimize their experiments for higher sensitivity. For instance, dropping the temperature below 0°C can improve signal intensity several-fold (Kloepper et al. 2007). Moreover, experiments can be collected under multiple conditions to improve detection of all resonances. Another historical problem in SSNMR experiments is large spectral line widths, which increase spectral crowding and peak overlap. However, improvements in magic-angle spinning techniques, pulse sequences, and micro/nano crystalline sample preparations are greatly reducing observed line widths into the sub-ppm range (Franks et al. 2005; Pauli et al. 2000, McDermott et al. 2000; Martin and Zilm 2003). For example, a recent MAS SSNMR resonance assignment of 20 kDa membrane protein DsbB had average 15N and 13C line widths of 0.7 and 0.5 ppm, respectively (Li et al. 2007, 2008). Furthermore, several labs have recently developed and used 3D and 4D experiments to reduce peak overlap in spectra of membrane proteins (Zhong et al. 2007; Kijac et al. 2007; Li et al. 2007, 2008; Frericks et al. 2006; Franks et al. 2007).

Materials and methods

We have implemented a prototype of alignment, grouping, and typing algorithms and combined them with the linking and mapping algorithms from the solution NMR assignment package AutoAssign (Moseley et al. 2001; Moseley and Montelione 1999; Moseley et al. 2004; Baran et al. 2004; Huang et al. 2005; Zimmerman et al. 1997) to provide a proof of concept. The alignment algorithm constructs and compares Euclidean distance matrices for “input” and “root” peak lists and is similar to the point pattern match algorithm pioneered by Ranade and Rosenfeld (Ranade and Rosenfeld 1980) and improved later for use in landstat image registration (Ton and Jain 1989). We have three improvements over their algorithm: (i) the use of the Jaccard coefficient (i.e. set union divided by set intersection) in place of a simple support list count as the robustness score; (ii) the multiplication of the Jaccard coefficient by the probability of a support pair’s registration; and (iii) the use of a weighted standard deviation of registration in deriving support tolerances. The latter two improvements convert the algorithm into a stationary iterative method. The algorithm is optimized to a computational complexity of O(mn2logn) where m and n represent the lengths of the root and input peak lists, respectively. But we see a clear path to improve the computational complexity to O(mn2). This alignment algorithm provides: (i) the best mapping of peaks from an “input” peak list to peaks in a “root” peak list for their comparable spectral dimensions; (ii) the registration needed to translate the input peak list to the root peak list in their comparable dimensions; and (iii) the standard deviation of this registration, which is needed to calculate match tolerances. While the alignment step is the most computationally intensive step, it only has to be performed once and provides the first set of major quality control measures for the given dataset.

The next step involves grouping of peaks into dipeptide spin systems using root resonances that all the peaks in the spin system have in common. Each dipeptide spin system is composed of intra-residue resonances and sequential-residue resonances organized as ladders. Our grouping algorithm uses a new bottom-up approach to dipeptide spin system grouping in contrast to the common top-down algorithms that use a single root spectrum as seeds for spin system creation. In this grouping algorithm, peak list-based and ladder-based groupings are done first before building the dipeptide spin systems. Peaks from a single spectrum are more self-consistent in their values than peaks between spectra. The new algorithm can use narrower tolerances to group peaks within a spectrum first and then average the root resonances of these intra-spectra peaks to improve their standard error. The same logic is applied to groups of peaks in the same ladder. The number of complete spin systems derived from the grouping algorithm provides the second major quality control measure for the given dataset.

For the typing algorithm, we introduce the concept of a chemical shift tuple or ordered list of chemical shifts that have some support for being in the same ladder or dipeptide spin system. Using a heuristic, the algorithm constructs a set of possible carbon chemical shift tuples to calculate Bayesian typing probabilities. Doing so minimizes the deleterious effects of resonance misclassification, which can arise from a multitude of situations including overlapped spin systems, noise peaks, and missing peaks. Furthermore, we can constrain tuple creation using 4D information from category III experiments (Table 2) and bottom-up grouping. However, the probability densities are no longer comparable in this Bayesian statistical framework because the probability density function changes with the number of carbon chemical shifts or independent variables used. This variation in the number of independent variables across the 20 residue types requires the use of chi-square probabilities, or p-values of a chi-square statistic, instead of probability densities. In the future, we can use the tuple concept to improve the linking and mapping algorithms.

Results and discussion

Currently, our implementation handles only a limited set of experimental peak lists which includes: (i) NCACX 3D (with 35ms DARR mixing) (ii) CANcoCA 3D, and (iii) CANCOCX 4D (Franks et al. 2005; Franks et al. 2007). These peak lists represent a category IIb assignment strategy (Table 2) which uses a Ni-Cαi root to create dipeptide spin systems. The implementation takes these peak lists, aligns them, groups peaks into dipeptide spin systems in a bottom-up strategy, and then types each ladder to probable amino acids using the carbon shift tuples. The implementation then simulates a set of Ni-Hi rooted peak lists for AutoAssign with an artificial HN shift equal to the observed CA shift divided by 6 (HN = CA/6). This creation of artificial HN shifts is necessary because AutoAssign requires Ni-Hi rooted peak lists. We then use AutoAssign to perform the linking and mapping steps. From this, we have an overall 84.1% assignment of the N, CO, CA, and CB resonances with no errors (Fig. 3), as compared to manually determined and verified assignments (BMRB entry 15156). These results demonstrate the feasibility of automating protein resonance assignments of MAS SSNMR spectral data. They are easily reproduced by the software and lack significant human subjectivity in the grouping and typing of spin systems. Also, the input peak lists are not perfect either, representing realistic peak lists that a spectroscopist used for manual assignment. There are only matching peaks to form 52 out of 56 dipeptide spin systems and some CB peaks are simply missing. Since the CANCOCX experiment is a 4D experiment, the resolution of the CA dimension is very low, causing a matching standard deviation of ~0.5 ppm when aligned to the other two peak lists. But our implementation handled the missing information and resolution issues and assigned 43 out of 52 dipeptide spin systems. There are three main reasons for these results: (i) better dispersion with a Ni-Cαi root; (ii) an improved bottom-up grouping algorithm that especially allows CANCOCX peaks to group around a common C’i-1-Ni-Cαi root before grouping with peaks from other peak lists; and (iii) improved amino acid typing algorithms that shrank the average “possible residue type list” to 5.7 residues with 0.9999 confidence (normally ~8 residues with Cα/Cβ typing). We expect even better results once improved linking and mapping algorithms are implemented, allowing the development of software that will improve the quality of analysis over manual assignment alone. This software is available at http://bioinformatics.chem.louisville.edu.