Journal of Biomolecular NMR

, Volume 40, Issue 4, pp 263–276

Structure-based protein NMR assignments using native structural ensembles

Authors

  • Mehmet Serkan Apaydın
    • Department of Computer ScienceDuke University
  • Vincent Conitzer
    • Department of Computer ScienceDuke University
    • Department of Computer ScienceDuke University
    • Department of BiochemistryDuke University Medical Center
Article

DOI: 10.1007/s10858-008-9230-x

Cite this article as:
Apaydın, M.S., Conitzer, V. & Donald, B.R. J Biomol NMR (2008) 40: 263. doi:10.1007/s10858-008-9230-x

Abstract

An important step in NMR protein structure determination is the assignment of resonances and NOEs to corresponding nuclei. Structure-based assignment (SBA) uses a model structure (“template”) for the target protein to expedite this process. Nuclear vector replacement (NVR) is an SBA framework that combines multiple sources of NMR data (chemical shifts, RDCs, sparse NOEs, amide exchange rates, TOCSY) and has high accuracy when the template is close to the target protein’s structure (less than 2 Å backbone RMSD). However, a close template may not always be available. We extend the circle of convergence of NVR for distant templates by using an ensemble of structures. This ensemble corresponds to the low-frequency perturbations of the given template and is obtained using normal mode analysis (NMA). Our algorithm assigns resonances and sparse NOEs using each of the structures in the ensemble separately, and aggregates the results using a voting scheme based on maximum bipartite matching. Experimental results on human ubiquitin, using four distant template structures show an increase in the assignment accuracy. Our algorithm also improves the robustness of NVR with respect to structural noise. We provide a confidence measure for each assignment using the percentage of the structures that agree on that assignment. We use this measure to assign a subset of the peaks with even higher accuracy. We further validate our algorithm on data for two additional proteins with NVR. We then show the general applicability of our approach by applying our NMA ensemble-based voting scheme to another SBA tool, MARS. For three test proteins with corresponding templates, including the 370-residue maltose binding protein, we increase the number of reliable assignments made by MARS. Finally, we show that our voting scheme is sound and optimal, by proving that it is a maximum likelihood estimator of the correct assignments.

Keywords

Automated NMR assignmentsNormal mode analysisNMR structural biologyProtein flexibility via structural ensemblesStructural bioinformatics

Abbreviations

bb RMSD

Backbone root mean square distance

BPG

Bipartite graph

CS

Chemical shift

EIN

N-terminal domain of enzyme I

EM

Expectation-maximization

GαIP

G-α interacting protein

HD

Homology detection

MBM

Maximum bipartite matching

MBP

Maltose-binding protein

MLE

Maximum likelihood estimator

MR

Molecular replacement

NMA

Normal mode analysis

NMR

Nuclear magnetic resonance

NOE

Nuclear overhauser effect

NVR

Nuclear vector replacement

PR

Pseudoresidue

RDC

Residual dipolar coupling

SBA

Structure-based assignment

SPG

Streptococcal protein G

Introduction

One of the key steps in NMR protein structure determination is resonance and NOE assignments. The assignment problem requires mapping spectral peaks to tuples of interacting atoms in a protein. In this paper, we report a new algorithm for automated structure-based NMR assignments by exploiting an ensemble of structural templates.

Structure-based assignment (SBA) denotes automated assignment given prior information in the form of the putative structure (“template”) of the protein. By analogy, in X-ray crystallography, the molecular replacement (MR) technique allows solution of the crystallographic phase problem when a “close” or homologous structural model is known, thereby facilitating rapid structure determination (Rossman and Blow 1962). An automated procedure for rapidly determining NMR assignments given a homologous structure will similarly accelerate structure determination. Furthermore, even when the structure has already been determined by crystallography or homology modeling, NMR assignments are valuable to probe protein–protein interactions and protein–ligand binding (via chemical shift mapping or line-broadening). Previous SBA algorithms include CAP (Al-Hashimi and Patel 2002; Hus et al. 2002), NVR (Langmead et al. 2003; Langmead and Donald 2004a), (Meiler and Baker 2003), and MARS (Jung and Zweckstetter 2004b). The idea of correlating unassigned experimentally-measured residual dipolar couplings (RDCs) with bond vector orientations from a known structure was first proposed by Al-Hashimi and Patel (2002) and subsequently demonstrated by Al-Hashimi et al. (2002) who considered permutations of assignments for RNA. In Hus et al. (2002), RDC-based maximum bipartite matching (MBM) was successfully applied to SBA. Similarly, MARS (Jung and Zweckstetter 2004b) matches RDCs to those calculated from a known structure. An SBA algorithm should be robust with respect to structural noise and handle distant structural templates: A small change in the putative structure should not change the assignments drastically and it should work even when a close structural template is not available.

NVR (Langmead et al. 2003; Langmead and Donald 2004a) is an MR-like approach for SBA of resonances and sparse NOEs. NVR computes assignments that correlate experimentally-measured HN15N HSQC, HN15N RDCs (in two media), 3D NOESY-15N-HSQC spectra (dNN’s) and amide exchange rates, to a given backbone structural model. The algorithm requires only uniform 15N-labeling of the protein. The NMR data used by NVR can be acquired relatively rapidly compared to the traditional suite of experiments used to perform assignments. NVR runs in minutes and assigns with high accuracy the (HN,15N) backbone resonances as well as the sparse dNN’s from the 3D 15N-NOESY spectrum. NVR works well only when the structure of the protein is known or for close templates (less than 2 Å backbone (bb) RMSD). SBA in general and NVR in particular have had an impact on algorithms for NMR methodology (Bailey-Kellogg et al. 2004; Vitek et al. 2005), and SBA has been important in the determination of protein structures (Potluri et al. 2006, 2007).

We introduce an algorithm that extends the circle of convergence of NVR such that distant templates can be used to obtain high assignment accuracies. We also improve NVR’s robustness with respect to structural noise. In addition, we provide a measure of confidence for individual assignments.

As in NVR, our procedure takes as input NMR data plus a single structure P (Fig. 1). P is called the “template” and is obtained from a putative (remote) structural homolog of the protein that originated the NMR data. We then generate an ensemble of structures from P by considering its flexibility, and then make the assignments for each structure in the ensemble separately. We then aggregate the assignments of each of the models using MBM as a voting scheme, which we show is a maximum likelihood estimator (MLE). In our study, we find that this scheme generally improves the assignment accuracy and improves the robustness of assignments with respect to structural noise. The percentage of models that agree on a given assignment provides an intuitive confidence measure for the assignment. We demonstrate our algorithm on four different structural models of human ubiquitin, using HD (for homology detection) (Langmead and Donald 2004b), a variant of NVR. In contrast to the original NVR, where the structural template must be less than 2 Å bb RMSD from the target, we use templates ranging in bb RMSD between 3.2 and 7.7 Å. The assignment accuracy of NVR for these distant structural templates ranges between 47–57% for human ubiquitin. With our new algorithm, the range of the assignment accuracy improves to 69–74%. Furthermore, combining the models from all ensembles raises the accuracy to 86%. Similarly, for G-α interacting protein (GαIP), the assignment accuracy increases from 65% to 77%. For streptococcal protein G (SPG), our results are mixed. However, combining the models for SPG raises the assignment accuracy.
https://static-content.springer.com/image/art%3A10.1007%2Fs10858-008-9230-x/MediaObjects/10858_2008_9230_Fig1_HTML.gif
Fig. 1

Overview of our methodology. We start from a template P. We use normal mode analysis (NMA) to obtain an ensemble of perturbed models around this template. Each model is in turn used as an input to a structure based assignment (SBA) algorithm (such as NVR or MARS) along with measured NMR data to compute the assignments. Each assignment is then combined using our voting scheme to obtain the “consensus” assignments

We demonstrate the generality of our approach (of using NMA ensembles around a given template with our voting scheme) with MARS, which is a significantly different SBA tool from NVR (in terms of its algorithm and its input data). MARS can use both 13C- and 15N-labeled data and takes as input the observed intra- and inter-residual chemical shifts grouped into pseudoresidues (PR). Depending on the type of available spectra, MARS uses chemical shifts of HiN, Ni, C′i−1, Ciα, Cαi−1, Cβi, Cβi−1, grouped into a PR with the HiN and Ni serving as an anchor, obtained from an 15N-1H HSQC spectrum. In addition, when a template structure is available, MARS can use arbitrary RDCs from triple-resonance experiments to help the assignments. MARS is a hybrid assignment framework that optimizes local and global quality of fit of the amino acid sequence to the pseudoresidues. It links pseudoresidues to obtain PR segments of length five to two using sequential connectivity information in the linking stage. It then maps these segments to the amino acid sequence in the matching stage to obtain the assignments. It compares these assignments with one obtained using a global energy function and retains the consistent assignments. MARS follows an iterative procedure, where the experimental data is perturbed by adding noise to extract robust assignments. MARS computes a reliability information for each assignment, denoting each assignment as with low, medium or high confidence. It also lists all possible assignments for a given PR, along with their probabilities. We demonstrate our algorithm on three proteins that come with the MARS software distribution (Jung and Zweckstetter 2004a), and corresponding templates. The templates are close structural homologs of the corresponding target proteins, and with 100% sequence identity to the target proteins. The target proteins are: 76-residue human ubiquitin, 259-residue amino terminal domain of enzyme I from E. Coli (EIN), and 370-residue maltose-binding protein (MBP). With our new technique, we show that the number of correct and reliable (high confidence) assignments increases in all test cases. As in Jung and Zweckstetter (2004b), we apply our framework to MARS with varying amount of data, such as with and without sequential connectivity information, and up to three RDCs per residue. Depending on the amount of data used as input, the number of correct and reliable assignments increases by up to 23 at the expense of introducing three incorrect assignments (corresponding to a 3-fold increase in the number of correct assignments). Furthermore, the number of incorrect assignments generally does not increase.

Using an ensemble of structures in SBA is reasonable, since the structures of proteins in the PDB presumably correspond only to the ground state of these proteins (Kay 1998). The NMR data acquired from a protein in solution corresponds to a time- and ensemble-average over the many conformations assumed during data acquisition. We use NMA to perturb the template to obtain an ensemble of structures. NMA is a technique commonly used to study the low-frequency motion of proteins. It represents the energy landscape around a given energy minimum with a harmonic approximation and solves for the equations of motion within that well analytically. It has been shown that over half of the known protein movements can be modeled by displacing the protein along at most two low frequency normal modes (Krebs et al. 2002). Furthermore, NMA has been shown to reproduce the deformations in the core of homologous proteins caused by sequence differences in 35 large, diverse, and well studied superfamilies (Leo-Macias et al. 2005). Therefore, it seems reasonable to expect that the conformational differences between the template and the target protein can be modeled by NMA. In contrast to classical molecular motion simulation techniques such as molecular dynamics, NMA can very rapidly compute an ensemble of structures that correspond to the likely conformations assumed by the molecule around its energy minimum. NMA has been successful in predicting experimental quantities such as temperature factors of proteins (Bahar et al. 1997). We use coarse-grained NMA where several amino acids are grouped into a single super-residue which effectively removes the small scale fluctuations of a protein such as sidechain motions to model the slow, large-scale motions (such as backbone rearrangements) (Suhre and Sanejouand 2004a).

To the best of our knowledge, ours is the first approach that uses ensembles for structure-based resonance assignments. Note that previously ensembles have been used successfully in structure determination (given assignments) (Best and Vendruscolo 2004), and for NOE assignments (given resonance assignments) (Mumenthaler et al. 1997; Güntert 2004). Our results show that ensemble-based approaches are also useful for structure-based resonance assignments.

NMA was analogously used by Suhre and Sanejouand (2004b) for protein structure determination by MR, using X-ray diffraction data. The authors observed that although the original template did not help solve the crystallographic phase problem, there existed a structure in the NMA ensemble that enabled the refinement of the target structure. This structure was chosen from the ensemble using a scoring function.

Our contributions in this paper are:
  1. 1.

    The use of NMA structural ensembles in structure-based NMR assignments,

     
  2. 2.

    “Robust” NMR assignments with respect to structural noise, by which we mean there is only a small change in assignment accuracy when the input structure changes slightly (note that this is not the case in general for maximum bipartite matching based assignment algorithms (including Langmead and Donald 2004a),

     
  3. 3.

    Increased radius of convergence of NVR with respect to the target–template structural similarity,

     
  4. 4.

    Improved assignment accuracy of NVR for distant templates (by up to 22%),

     
  5. 5.

    A confidence measure for each assignment,

     
  6. 6.

    A demonstration of the generality of our framework by improving the assignment accuracy of MARS on three test proteins (by up to 3-fold), and

     
  7. 7.

    A proof that our voting rule, which aggregates the assignments corresponding to individual models, is a maximum likelihood estimator.

     

Preliminaries

NMR data used by NVR

An assignment algorithm must determine the mapping of the resonances and NOEs to the corresponding nuclei of the protein. We can define the assignment problem as the mapping of the peaks to the corresponding residues, due to the specific set of NMR data used by our framework.

We use the following NMR data: HN-15N HSQC, NOESY-15N-HSQC (yielding sparse dNN’s, observed between nearby pairs of amide protons), NH RDCs in two media (which provide global orientational restraints on NH amide bond vectors), 15N TOCSY (for the sidechain chemical shifts), and amide exchange HSQC (to identify, probabilistically, solvent-exposed amide protons).

RDCs provide global information on the orientation of internuclear vectors. For each RDC r, we have the following RDC equation (Tolman et al. 1995; Tjandra and Bax 1997):
$$ r = D_{max} {\mathbf{v^TS v}}. $$
(1)

Here Dmax is the dipolar interaction constant, v is the internuclear bond vector orientation relative to an arbitrary molecular frame, and S is the 3 × 3 Saupe order matrix which describes the average substructure alignment in the weakly-aligned anisotropic phase. Equation 1 shows the quadratic dependence of r on v, thus explaining the sensitivity of RDCs (and hence, SBA algorithms that use RDCs, such as NVR) with respect to structural noise.

Only unambiguous dNN’s are used in NVR. Typically only a few unambiguous dNN’s (e.g., 43 for ubiquitin) can be obtained from the 3D-NOESY. These dNN’s are automatically-assigned as a byproduct of NVR’s resonance assignments (Langmead and Donald 2004a).

NVR

NVR is an automated SBA algorithm for proteins of known structure or with a known close structural homolog. NVR uses MBM in an expectation maximization (EM) framework to compute the assignments. Each peak p and residue r form the nodes of a bipartite graph (BPG), where one set of vertices is the set of peaks, the other set of vertices is the set of residues, and the edges correspond to the likelihood of assigning p to r in the bipartite graph. The EM framework is used to iteratively select the most likely (peak, residue) assignment. More details can be found in (Langmead and Donald 2004a).

NVR integrates various NMR data as a means to increase the signal-to-noise ratio. The signal is the computed likelihood of the assignment between a peak and the (correct) residue. The noise is the uncertainty in the data, where the probability mass is distributed among multiple residues. Each line of evidence (i.e., experiment) has noise, but the noise tends to be random and thus cancels when the lines of evidence are combined. Conversely, the signals embedded in each line of evidence tend to reinforce one another, resulting in relatively unambiguous assignments.

NVR has the advantage that it only needs 15N-labeled data, which is cheaper to obtain than 13C-labeling, which is required by many automated assignment algorithms. NVR only uses unassigned data.

Methods

An overview of our methodology is presented in Fig. 1. Our algorithm starts with a structural model. We apply NMA to this model to obtain an ensemble of structures. Then, for each member of the ensemble, we predict the backbone chemical shifts, and we also extract the NH amide bond vectors as well as proton coordinates of the amide bonds. NVR requires these, as well as the experimental NMR data. We predict the chemical shifts using the BMRB (Seavey et al. 1991), SHIFTS (Xu and Case 2001), and SHIFTX (Neal et al. 2003), following the protocol in Langmead and Donald (2004a, b). We then run NVR for each of the structural models. We combine the resulting assignments using MBM (Fig. 2). The MBM is done on a BPG in which one set of nodes represents peaks and the other represents residues. The edge weights are simply the number of models in the ensemble that vote for the corresponding assignment. We used the Hungarian (Kuhn–Munkres) algorithm (Kuhn 1955), as implemented by N. Borlin, to solve MBM. While MBM has been used previously for NMR assignments (Hus et al. 2002; Xu et al. 2002; Langmead and Donald 2004a), edge-weights based on votes by a structural ensemble are novel.
https://static-content.springer.com/image/art%3A10.1007%2Fs10858-008-9230-x/MediaObjects/10858_2008_9230_Fig2_HTML.gif
Fig. 2

Our ensemble-based voting algorithm (maximum bipartite matching) combines the assignments for each model. The aggregated bipartite graph (BPG) combines the BPGs corresponding to each of the individual models. In the aggregate bipartite graph, the edge weight is one for the continuous lines, two for the dashed edges, and three for the dotted edge

We tested our algorithm on NVR with three proteins, and a total of seven distant templates, previously studied in Langmead and Donald (2004b). They are listed in Table 1. The three proteins are the 76-residue human ubiquitin (PDB ID 1D3Z, (Cornilescu et al. 1998)), the 56-residue streptococcal protein G (SPG) (PDB ID 3GB1, (Kuszewski et al. 1999)), and the 128-residue GαIP (PDB ID 1CMZ, (De Alba et al. 1999)). For these proteins, the NMR data (but not the actual structures) were used by our algorithm. For ubiquitin, the NH residual dipolar couplings recorded in two separate media (bicelle and phage) (Cornilescu et al. 1998), and HN15N HSQC and NOESY-15N-HSQC spectra (Harris 2002) were used. For SPG and GαIP, the chemical shifts deposited into BMRB (Seavey et al. 1991) and amide-bond RDC data (Kuszewski et al. 1999; De Alba et al. 1999, resp.) were used. A set of sparse, unassigned dNN’s were simulated for SPG and GαIP using the target structure and BMRB shifts as in Langmead and Donald (2004b). For all three proteins, amide-exchange data and TOCSY data were simulated using the target structure and BMRB shifts as previously described (Langmead and Donald 2004b). The template structures were obtained from the structural homologs of the target protein by homology modeling, as previously described (Langmead and Donald 2004b), using MODELLER (Sali and Blundell 1993). MODELLER was used to construct a backbone model for the target using template’s backbone structure. Next, the sidechains for the model were constructed using MAXSPROUT (Holm and Sander 1991). MAXSPROUT considers rotamers for each sidechain and avoids steric clashes. Hydrogen atom coordinates were added to the template structures and these structures were energy-minimized using the PROTONATE and SANDER modules of AMBER (Pearlman et al. 1995), respectively. There is less than 30% sequence identity between each target protein and its structural homologs. We report in Table 1 the backbone RMSD as well as the CE RMSD of these distant templates. The CE RMSD refers to the RMSD of the aligned (homologous) regions between the template and the target, as computed by CE (Shindyalov and Bourne 1998). CE performs a combinatorial search to find the optimal structural alignment, and matches the vectors between Cα atoms to obtain aligned fragment pairs, which it optimizes using dynamic programming. The aligned regions measure the degree of homology between the target and the template.
Table 1

Assignment accuracy (% correct assignments) of NVR with distant templates for corresponding target proteins

Targeta protein

Homologb

bb

RMSDc (Å)

CE

RMSDd (Å)

Sequence identity (%)e

Originalf

Lowest–highestg

HD Scoreh

Ensemblei

Confidentj

Accuracy (%)

Human ubiquitin

1RFA

8.0 (7.4–9.5)

2.2 (89)

12

51

19–67

57

73

97

1EF1:A[4–84]

6.3 (6.0–9.0)

1.7 (38)

10

57

17–69

63

74

100

1H8C:A

6.7 (6.4–8.4)

1.9 (89)

16

47

21–76

51

69

89

1VCB:B

3.5 (3.4–5.1)

3.8 (44)

13

53

13–73

64

74

95

All templatesk

13–76

64

86

100

GαIP

1DK8:A

2.7 (2.6–2.7)

1.9 (82)

29

65

34–78

46

77

96

SPG

1HEZ:E

5.0 (5.0–7.0)

2.0 (94)

13

60

25–76

71

62

79

1JML:A

8.6 (7.2–11.0)

1.9 (75)

13

65

33–80

64

60

83

All templatesk

25–80

71

69

87

aThe target protein (source of the NMR data). This structure was not used by our algorithm. Instead, a template structure was used, obtained using homology modeling and energy minimization, starting from the structural homolog

bPDB ID for the structural homologs

coverall backbone (bb) RMSD between the template structure and the target protein’s structure. The range of the RMSD distance of the ensemble to the target is provided in parenthesis

dbb RMSD of the structural alignment as computed by CE (Shindyalov and Bourne 1998). The percentage of the residues in CE alignment is shown in parenthesis

eThe sequence identity between the sequences of the target protein and the structural homolog, as computed by CE

fAssignment accuracy obtained using the template

gThe range of assignment accuracy over the NMA ensemble: the minimum and the maximum

hThe accuracy of the structure in the NMA ensemble with the highest HD score

iThe accuracy with our NMA ensemble-based voting scheme

jAssignment accuracy for the ‘confident’ peaks (where the confidence threshold is 0.5)

kObtained by combining all corresponding NMA ensembles

We further tested our algorithm on three more proteins and a set of three structurally close templates, previously studied by Jung and Zweckstetter (2004b), and that came with the MARS software distribution (Jung and Zweckstetter 2004a). The target proteins are, human ubiquitin (PDB ID 1D3Z), the 259-residue amino terminal domain of enzyme I from E. coli (EIN) (PDB ID 3EZA), and the 370-residue maltose-binding protein (MBP) (PDB ID 1EZP). The set of NMR data used for these proteins, as well as the template information, is given in Table 3. Unlike our tests with NVR, in which we used a more distant ensemble of structures, the templates are structurally closer to the target structure (the bb RMSD ranges between 0.4–3.7 Å). Hence this study provides both a test of our algorithm with a significantly different SBA tool, as well as with structurally similar templates.

For both NVR and MARS, we used an NMA webserver, elNémo (Suhre and Sanejouand 2004a) to obtain an ensemble of structures around the template. We computed the five lowest-frequency normal mode displacements, with default parameters. Each of the low frequency normal modes returned 11 structures, corresponding to the motion of the template along that normal mode. We thus obtain 55 structures. We also displaced the template structure bidirectionally along its two lowest frequency normal modes, resulting in a total of 176 structures per template. The bb RMSD of the most distant structure to the starting model is less than 3 Å.

Our algorithm runs in O(mn + mn2.5 log(cn)) time, where m is the number of models in the ensemble, n is the number of residues in the target protein, and c is the maximum edge weight in an integer-weighted bipartite graph. In comparison, NVR runs in time O(n2.5 log(cn)), whereas HD has a time complexity of O(pn2.5 log(cn) + p log ppn), where p is the number of proteins in a database of structural models. For a discussion of the complexity of NVR and HD, see Langmead and Donald (2004a, b) respectively. For reference, c is a constant and is dictated by the resolution of the NMR data. NVR runs in minutes on a desktop PC to assign a protein with about 56–128 residues using one template.

Results

We ran NVR for all three target proteins, with the corresponding templates obtained from structural homologs, for each of the ensemble of models obtained by NMA. We report the assignment accuracy for the template structure, as well as the range of accuracies in the NMA ensemble in Table 1. It can be seen that if we could choose the right template from this ensemble, we would improve the assignment accuracy of NVR. However this requires a scoring function that correlates strongly with the assignment accuracies.

Using a scoring function to choose a model from the ensemble

Suhre and Sanejouand (2004b) used NMA to perturb the structural model, and then chose a perturbed template structure with a scoring function (“free R factor”) in MR in X-ray crystallography, which allowed them to solve the target protein structure. We hypothesized that we could follow a similar methodology to choose a template from the NMA ensemble as input to NVR in NMR structure determination.

The HD score function combines the “preference list” of all the seven “voters” of NVR. These “voters” correspond to the NMR data used by NVR. They are: RDCs in two media, chemical shifts predicted using three different protocols (Langmead and Donald 2004a), amide-exchange and TOCSY data. Each “voter” has a ranked list of probabilities (“preference list”) for each peak, corresponding to the likelihood of matching that peak with each residue (e.g., according to RDCs). The HD scoring function (Langmead and Donald 2004b) simply multiplies and normalizes these probabilities to obtain an overall matrix representing the aggregated preference of all the voters for each peak. Given an assignment, the set of probabilities corresponding to individual (peak, residue) assignments are combined and returned as the HD score.

We used HD score to choose a model that has the highest HD score from the NMA ensemble. The assignment accuracy for this structure is in column entitled ‘HD score accuracy’ of Table 1. The correlation of the HD score with the assignment accuracies is shown in Fig. 3. Each point in the scatter plot corresponds to one of the structural models. The x-axis corresponds to the HD score and the y-axis to the assignment accuracy. The correlation between the HD score and the assignment accuracy is 0.44. It can be seen that HD score cannot be used reliably to choose a model with a higher assignment accuracy from the ensemble, with respect to the starting template.
https://static-content.springer.com/image/art%3A10.1007%2Fs10858-008-9230-x/MediaObjects/10858_2008_9230_Fig3_HTML.gif
Fig. 3

HD score vs. assignment accuracy. Each of the points correspond to a template in the normal mode analysis (NMA) ensemble. A representative set of templates for all three target proteins are shown. The x-axis is the HD-Score of the template structure, whereas the y-axis is the assignment accuracy (%). The correlation coefficient is 0.44

MBM voting over the NMA ensemble

We used MBM to aggregate the assignments corresponding to all of the models in the NMA ensemble (see Methods). The results of this scheme are in column entitled ‘Ensemble accuracy’ of Table 1. For all three proteins with the corresponding templates, the assignment accuracy improves in all but one of the seven protein–template pairs, with respect to the starting structural model, by up to 22%. We also combined the assignments of all the models corresponding to all four templates for human ubiquitin and both templates for SPG, obtaining an even higher accuracy (86% and 69%, respectively, shown in column entitled ‘Ensemble accuracy’ and row entitled ‘All templates’ of Table 1). For SPG using the template obtained from pdb ID 1JML (which is 8.7 Å bb RMSD from SPG), the assignment accuracy actually decreases with the consensus scheme. Note that the template obtained from 1JML is the farthest structure from its corresponding target structure in our test cases, and it may be that this starting template is outside the radius of convergence of NVR.

Confidence measure

Given the assignments for each structural model in the NMA ensemble and the consensus assignments computed using MBM voting, we can compute the fraction of models that agree on a given resonance assignment. This ratio can be used as a ‘confidence’ measure for that assignment. Intuitively, the larger the number of models that agree on a (peak, residue) assignment, the less likely it is that that assignment is due to noise.

In Fig. 4, we show the ratio of the models that agree on a particular assignment (the ‘confidence’, shown on the y-axis) for each individual peak (the x-axis), for human ubiquitin with template obtained from 1RFA (which is 7.7 Å bb RMSD from human ubiquitin). Each blue ‘circle’ (resp., red ‘cross’) corresponds to a correct (resp., incorrect) assignment. The ratio of ‘blue circles’ to all signs determines the accuracy of the consensus assignments reported in column entitled ‘Ensemble accuracy’ of Table 1. The blue ‘circles’ have a higher confidence value than the red ‘crosses’ in general, suggesting that the ‘confidence’s indeed correlate with the assignment accuracy. The higher the ‘confidence’ for an assignment, the more likely it is to be correct.
https://static-content.springer.com/image/art%3A10.1007%2Fs10858-008-9230-x/MediaObjects/10858_2008_9230_Fig4_HTML.gif
Fig. 4

Assignments and confidences for human ubiquitin with pdb ID 1RFA as template and NVR. The diagram shows in blue (‘o’) the correct resonance assignments, and in red (‘x’) the incorrect ones. The x-axis corresponds to the individual peaks from the ubiquitin spectra, and the y-axis shows the “confidence”, which is the fraction of the models that agree on the corresponding assignment for that peak over all models. The peaks are sorted in the order of ascending confidences. The assignment accuracy increases to 71% with our algorithm (compared to 51% with the single-structure based NVR)

In Fig. 4, there are very few incorrect assignments for which more than half of the models agree. Therefore, we selected a threshold of 50%, and called an individual assignment ‘confident’ if more than 50% of the models agree on that assignment. The assignment accuracy of the confident assignments is in the last column of Table 1. We also combined all the models and report the corresponding ‘confident’ assignment accuracy. The ‘confident’ assignment accuracy is higher than consensus assignment accuracy in all cases.

If we select a lower confidence threshold than 0.5, we can include more of the correct individual (peak, residue) assignments, at the expense of introducing some of the incorrect individual (peak, residue) assignments. This trade-off can be seen with a receiver–operator characteristic (ROC) curve. For each threshold, one can compute the sensitivity and the specificity and plot these points as in Fig. 5. For instance, for the target protein SPG with pdb ID 1HEZ as template (which is at 5.1 Å from SPG), a confidence threshold of 0.9 seems more suitable to correctly assign more than 40% of the peaks (corresponding to 13 peaks) without introducing any incorrect assignments. On the other hand, for the target protein ubiquitin with 1EF1 as template (which is at 6.2 Å from ubiquitin), a confidence threshold of 0.5 results in 25 correctly assigned peaks with no incorrect assignment among them. The trade-off between choosing a confidence threshold of 0.9 and 0.5 can also be seen in Table 2, where the absolute number of correct and incorrect peaks found using both thresholds are provided. One can select a confidence threshold to return the higher number of correct assignments, while minimizing the number of incorrect assignments.
https://static-content.springer.com/image/art%3A10.1007%2Fs10858-008-9230-x/MediaObjects/10858_2008_9230_Fig5_HTML.gif
Fig. 5

Receiver–operator characteristic (ROC) Curve for varying confidence thresholds: (Left) for SPG with the template obtained from 1HEZ; (Right) for ubiquitin with the template obtained from 1EF1. The confidence threshold is the ratio of models that must agree on a particular (peak, residue) assignment in order to include that pairing on the reported subset of assignments. For a given threshold, the x-axis is the ratio of reported incorrect assignments over all incorrect assignments (1-specificity). The y-axis is the ratio of correct assignments over all correct assignments (sensitivity). An ideal confidence threshold would be such that the returned assignments would include a maximum subset of the correct assignments and a minimum subset of the incorrect assignments. (Left) For this case, a confidence threshold of 0.9 would return about 40% of the correct assignments with no incorrect assignments. (Right) For this case, a confidence threshold of 0.5 would return about 50% of the correct assignments with no incorrect assignments

Table 2

Effect of varying the confidence threshold: Number of correct and incorrect peaks with NVR with distant templates for corresponding target proteins, with varying confidence thresholds

Targeta protein

Homologb

# Correct (# incorrect)c

# Correct (# incorrect)d

Human ubiquitin

1RFA

32 (1)

7 (0)

1EF1

25 (0)

6 (0)

1H8C

25 (3)

9 (0)

1VCB

35 (2)

8 (1)

All templatese

29 (0)

5 (0)

GαIP

1DK8

75 (3)

38 (2)

SPG

1HEZ

27 (7)

13 (0)

1JML

29 (6)

16 (0)

All templatese

26 (4)

11 (0)

aThe target protein (source of the NMR data). This structure was not used by our algorithm. Instead, a template structure was used, obtained using homology modeling and energy minimization, starting from the structural homolog

bPDB ID for the structural homologs

cNumber of correct (resp., incorrect) confident peaks with a confidence threshold of 0.5

dNumber of correct (resp., incorrect) confident peaks with a confidence threshold of 0.9

eObtained by combining all corresponding NMA ensembles

Robustness with respect to structural noise

We call a structure-based assignment algorithm “robust” if its result does not change significantly when the input structure changes slightly. This is a reasonable definition of robustness, since due to structural noise, there may be small perturbations in the input structure. In order to demonstrate the improved robustness in the assignment accuracies with the consensus and ‘confident’ assignment schemes, we chose 11 structurally-similar starting models for human ubiquitin from the NMA ensemble computed around the template obtained from pdb ID 1H8C (which is 6.7 Å bb RMSD from human ubiquitin), and computed the assignment accuracies using the original NVR (Langmead and Donald 2004a) and our ensemble-based voting algorithm. For our approach, we report both the accuracy of the resulting assignments after voting, and the accuracy of the ‘confident’ assignments selected with a confidence threshold of 0.5. The ‘confident’ assignments comprise a significant (more than 35%) subset of all (peak, residue) assignments. Note that our approach requires constructing an ensemble around each of these 11 structural models. The results are in Fig. 6. The x-axis corresponds to the method used, where the first column is with the assignments using a single structural model, the second column is obtained using the consensus assignment scheme, and the third column is with the ‘confident’ assignment scheme. The plot shows the range of the assignment accuracies, the lower and upper quartiles, and the red line in the middle of the box is the median of the assignment accuracies. The whiskers show the extent of the data and the outliers are shown with ‘+’s. As can be seen, our approach not only improves the assignment accuracies, but also reduces the variance. Therefore, our ensemble-based assignment scheme improves the robustness of NVR with respect to structural noise.
https://static-content.springer.com/image/art%3A10.1007%2Fs10858-008-9230-x/MediaObjects/10858_2008_9230_Fig6_HTML.gif
Fig. 6

Our ensemble-based voting scheme improves the robustness of NVR against structural noise, and increases the assignment accuracies. We show the distribution of the assignment accuracies with the single-structure based NVR and our ensemble-based voting scheme with NVR, for 11 starting structures (obtained along an individual normal mode) that are structurally similar, for human ubiquitin with 1H8C as template. The first column shows the distribution of the assignment accuracies with single-structure NVR. The second column shows the accuracy with our ensemble-based voting scheme, and the third column shows the assignment accuracy of the subset of ‘confident’ assignments (with a confidence threshold of 0.5). The boxplot shows the lower and upper quartile, and the median in red. The whiskers show the extent of the accuracy results and the outliers are shown with ‘+’ sign. There is only one outlier. The variance in assignment accuracies decreases with our algorithm, while the assignment accuracy increases

Application of our framework to MARS

We used our NMA ensemble-based voting scheme with MARS, an SBA tool that is significantly different from NVR in terms of the data it uses, as well as its algorithm. We tested our NMA ensemble-based voting scheme on three target proteins with corresponding close structural templates (Table 3). We only considered the subset of assignments that are labeled as ‘reliable’ (‘H’igh and ‘M’edium reliability) by MARS. MARS calls an assignment ‘H’ighly reliable if it is consistent across all solutions obtained by MARS. According to Jung and Zweckstetter (2004c), ‘M’edium does not fulfill all criteria for ‘H’ and the criterion is adjusted automatically according to the completeness of the input data. We report the number of correct and incorrect reliable assignments with each template in Table 4. We find and report the confident subset of the assignments with a confidence threshold of 0.05 in order to automatically discard incorrect assignments made by individual models in the ensemble. As in Jung and Zweckstetter (2004b), we tested our framework with MARS on different sets of data, such as with and without sequential connectivity information, and by varying the number of RDCs per residue. We report the number of correct and incorrect assignments for the original template, for the template in the NMA ensemble that has the highest number of reliable assignments, and the result of our voting scheme. Depending on the amount of data used as input, the number of correct and reliable assignments increases by 8 for EIN and by up to 6 for MBP, while the number of incorrect assignments decreases or stays constant in most cases. For human ubiquitin, we compute the assignments using pdb ID 1UBQ as template, both with and without sequential connectivity information, and using up to three RDCs per residue. The number of correct reliable assignments increases by up to 23 at the expense of introducing three incorrect assignments (corresponding to a 3-fold increase in the number of correct assignments). Note that for MARS, unlike NVR, the best model in the NMA ensemble also leads to improved assignment accuracies. For instance, for the ubiquitin target without sequential connectivity information and with 2 RDCs per residue, the number of correct assignments increases by 15 while the number of incorrect assignments increases by one, for the best model in the NMA ensemble. This corresponds to 96% assignment accuracy with more than twice the original number of assignments.
Table 3

Proteins used with MARS

Targeta protein

Template (crystal) structureb (PDB ID)

Sequencec identity (%)

# of residues with data

# of residues

BMRB Code

RDCs (PDB ID)

bbd RMSD (Å)

CEe RMSD (Å)

Ubiquitin

1UBQ

100

76

72

1D3Z

0.7

0.5 (100)

EIN

1ZYM

100

259

248

4106

3EZA

3.7

1.2 (94)

MBP

1DMB

100

370

335

4354

1EZP

3.4

3.3 (99)

aThe target protein (source of the NMR data). This structure was not used by MARS. Instead, a template structure was used

bPDB ID for the template structure. Unlike NVR, these templates were not used in homology modeling, but directly used with MARS

cThe sequence identity between the sequences of the target and template protein, as computed by CE (Shindyalov and Bourne 1998)

dBackbone (bb) RMSD between the template and the target protein structure

eBackbone (bb) RMSD of the structural alignment as computed by CE (Shindyalov and Bourne 1998). The percentage of the residues involved in CE alignment is shown in parenthesis

Table 4

MARS assignment accuracy improves with our NMA ensemble-based voting algorithm

Protein name

Template

RDCs

Chemical Shifts for linking

Chemical shifts for matching

Reliable assignments # correct (# incorrect)

Original modela

Best modelb

NMA ensemblec

Without sequential connectivity information

Human ubiquitin

1UBQ

1DNH

C′i−1, Cαi−1, Cβi−1

11 (0)

16 (0)

18 (0)

1DNH1DNC′

C′i−1, Cαi−1, Cβi−1

11 (0)

26 (1)

34 (3)

1DNH, 1DNC′, 1DCaC′

C′i−1, Cαi−1, Cβi−1

51 (3)

51 (3)

57 (1)

With sequential connectivity information

  

1DNH

Cα

C′i−1, Cαi−1, Cαi

51 (0)

67 (2)

66 (3)

1DNH, 1DNC′

Cα

C′i−1, Cαi−1, Cαi

70 (0)

72 (0)

70 (0)

1DNH, 1DNC′, 1DCaC′

Cα

C′i−1, Cαi−1, Cαi

72 (0)

72 (0)

72 (0)

EIN

1ZYM

1DNH

Cα, Cβ

Ci−1, Cαi−1, Cαi, Cβi−1, Cβi

238 (2)

244 (0)

246 (2)

MBP

1DMB

1DNH

Cα, Cβ

C′i−1, Cαi−1, Cαi, Cβi−1, Cβi

323 (2)

328 (0)

329 (1)

1DNH, 1DNC′

Cα, Cβ

C′i−1, Cαi−1, Cαi, Cβi−1, Cβi

328 (0)

331 (0)

331 (0)

1DNH, 1DNC′, 1DCaC′

Cα, Cβ

Ci−1, Cαi−1, Cαi, Cβi−1, Cβi

327 (0)

331 (0)

331 (0)

MARS links fragments of pseudoresidues together (in the “linking” stage) and then maps them to the amino acid sequence (in the “matching” stage). The chemical shifts used for linking and matching are listed

aThe number of correct (resp., incorrect) reliable (denoted as ‘M’edium and ‘H’igh reliability in MARS) assignments returned by MARS for the original template

bThe results for the structure in the NMA ensemble that has the highest number of reliable assignments, as returned by MARS

cThe number of reliable and confident assignments with our ensemble-based voting scheme. We used a confidence threshold of 0.05

Discussion and conclusions

In this paper, we improved the assignment accuracy of NVR for distant structural models, and made it robust with respect to structural noise. On three different proteins, with distant structural homologs, we obtained an increased assignment accuracy compared to the initial structural model for all cases but one, which used the template farthest from the target structure in our test set. However, in this case, combining the ensembles from both templates still increased the assignment accuracy. We also calculated a measure of confidence in the individual assignments. We used this measure to assign a subset of the peaks with even higher assignment accuracy. We also improved the robustness of NVR with respect to structural noise. We further demonstrated the general applicability of our approach to SBA by improving the assignment accuracy of MARS, a significantly different SBA algorithm from NVR.

Given a distant structural homolog, our methodology used NMA to obtain a set of structural models, which were then provided as input to NVR. We combined the NVR assignments for each of these structural models by maximum bipartite matching. The percentage of structural models that agreed on a given assignment provided the confidence measure. We also showed (see Appendix) that MBM is a maximum likelihood estimator of the correct assignments.

The greatest improvement with our ensemble-based assignments comes when we do not have sequential connectivity information. Nevertheless, modest improvements are seen even with sequential connectivities. Even these modest improvements are potentially useful, and our results represent a significant improvement over all previous structure-based assignment algorithms (e.g., Hus et al. 2002; Meiler and Baker 2003) for distant structural homologs (as opposed to exact crystal structure). No former SBA algorithm performs well using even slightly distant homologs. For instance, (Hus et al. 2002) was tested only on the crystal structure. In Meiler and Baker (2003), assignment accuracies in the range of only 5% to 40% were obtained using ROSETTA models with 3–6 Å RMSD from the native structure. Our approach tests whether ensembles for assignments can begin to overcome this bottleneck, and forms a basis for SBA that can be improved in the future.

Our approach demonstrates that an ensemble of structures simulating the fluctuations of a protein in its native state improves the accuracy and robustness of SBA. Furthermore, our voting scheme reinforces the signal (for the correct assignments), whereas the noise (incorrect assignments) cancels out. This is supported by the fact that we obtain high assignment accuracies despite the large fluctuations in assignment accuracies across the ensembles. Therefore, NMA is useful for both MR in X-ray crystallography (Suhre and Sanejouand 2004b) and SBA in NMR (this paper). Note that our results with MARS show that the best structure in the NMA ensemble helps improve the assignment accuracy, with respect to the starting template, analogous to (Suhre and Sanejouand 2004b). However, unlike (Suhre and Sanejouand 2004b), we also show that an entire ensemble is useful to improve the assignment accuracy with NVR.

It is interesting that our voting scheme obtains an assignment accuracy that is greater than or equal to the maximum assignment accuracy achieved by any individual structure in the NMA ensembles, both with MARS and NVR, for most target protein–template pairs. This suggests that our voting scheme is more likely to improve the assignment accuracy than any single-structure scoring function.

An analysis of our assignments reveals that the confident assignments (with a confidence threshold of 0.9) which have 95% or higher assignment accuracy mostly fall into regular secondary structure elements. For ubiquitin, GαIP and SPG, 1/5, 3/40 and 1/11 of the confident assignments fall into loop regions, respectively; furthermore for GαIP, the sixth helix contains most of the correct assignments, similarly, most of the confident assignments of SPG are in its alpha helix. The secondary structure elements are the similar regions between the target and the template protein, and therefore it is expected to find most of the correct assignments in those regions.

We envision three scenarios where our ensemble approach is useful. The first is for medium-sized proteins. One can perform a suite of triple-resonance experiments and use MARS with our ensemble method in order to improve MARS assignments, as was shown in this paper. Thus, we tested the hypothesis that SBA can be improved using ensembles, for medium-sized proteins. The second scenario is also for medium-sized proteins, but our NVR protocol requires only 15N-labeling and reduced spectrometer time. While RDCs must be measured, recent progress made by Tolman and co-workers (Ruan and Tolman 2005) make it more convenient to find multiple alignment media for the proposed RDC measurements. Measurement of RDCs for small- to medium-sized proteins usually only needs 2D IPAP experiments, and thus can be done in less time. The third scenario is for large proteins, where one can hopefully collect chemical shifts, dNN’s, and RDCs (but other data might be hard to collect), and then use NVR with our ensemble-based technique. Since our algorithm requires only sparse data, this could make it less susceptible to the overlap problems that can occur with large proteins. Finally, since NVR requires only 15N-labeling, the cost of sample preparation is less for the last two scenarios.

Our approach should be valuable in pharmacology and drug design (Ferentz and Wagner 2000) by helping assign proteins for which there is no close structural homolog available. One could use our scheme to assign a subset of peaks with high confidence, and then do a few more disambiguating NMR experiments (e.g., using selective labeling) in order to assign the remaining peaks. Furthermore, it is possible to run the algorithm iteratively, setting the confident assignments found in the previous iteration to boost the number of peaks reported with high confidence. Our method is simple and general, and can be used with other SBA algorithms, such as MARS, to improve their accuracy and robustness.

Our approach has some similarities to previous work such as Jung and Zweckstetter’s (2004b) MARS and (Meiler and Baker 2003). Both of these works obtain multiple assignments for a protein, and retain the subset of peak-residue assignments that are consistent across those assignments. The difference is in how the assignments are computed. Jung and Zweckstetter (2004b) modulate the predicted chemical shifts by adding Gaussian noise and run MARS on perturbed data to obtain new assignments. Meiler and Baker (2003) start from random assignments and then use Monte Carlo search to optimize them. In contrast, we compute an ensemble of structures using NMA, and then use each structure to calculate a new assignment. Of these three approaches, ours is the only one that simulates the likely equilibrium conformations assumed by the template protein. It also has an intuitive correspondence with the NMR ensemble that generated the experimental data. As shown in section "Application of our framework to MARS", our approach can be used with MARS; it is likely that it can also be used with other (such as Meiler and Baker’s 2003) SBA algorithms.

As future work, we are interested in developing a single-structure scoring function that takes into account the dependencies between various sources of NMR data. This would allow to choose a model from the ensemble that has the highest assignment accuracy. Secondly, other techniques that characterize the flexibility of protein structures such as FRODA (Wells et al. 2005) or protein ensemble method (Shehu et al. 2006) could be used and compared with NMA using the lens of SBA. Finally, NVR currently returns a single assignment for each template, even though there may be many assignments consistent with the structural model. Incorporating backtracking into the assignments as in Vitek et al. (2005) to obtain all consistent assignments could improve the accuracy and robustness.

Availability

The NVR software as well as our scripts to run NVR on an ensemble of proteins and aggregate the results are available upon request. It is written in Matlab and Perl and is approximately 10K lines.

Our scripts to run MARS on an ensemble of templates and aggregate the results are less than 1K lines of code and are similarly available upon request.

Acknowledgments

We thank Drs. C. Bailey-Kellogg, P. Zhou, Mr. D. Keedy, Mr. J. MacMaster, Mr. C. Tripathy, Mr. A. Yan, Mr. M. Zeng and all members of the Donald Lab for discussions and comments. This work is supported by a grant to B.R.D. from the National Institute of Health (R01 GM-65982).

Copyright information

© US Government 2008