CING: an integrated residue-based structure validation program suite
- 1.4k Downloads
We present a suite of programs, named CING for Common Interface for NMR Structure Generation that provides for a residue-based, integrated validation of the structural NMR ensemble in conjunction with the experimental restraints and other input data. External validation programs and new internal validation routines compare the NMR-derived models with empirical data, measured chemical shifts, distance- and dihedral restraints and the results are visualized in a dynamic Web 2.0 report. A red–orange–green score is used for residues and restraints to direct the user to those critiques that warrant further investigation. Overall green scores below ~20 % accompanied by red scores over ~50 % are strongly indicative of poorly modelled structures. The publically accessible, secure iCing webserver (https://nmr.le.ac.uk) allows individual users to upload the NMR data and run a CING validation analysis.
KeywordsNMR Structure validation PDB Errors Quality Protein structure
Nuclear Magnetic Resonance (NMR) Spectroscopy is the second most important tool for the structure determination of biomolecules at the atomic level. Approximately 12 % of all ~82,000 deposited biomolecular structures in the Worldwide Protein Databank, wwPDB (Berman et al. 2003) have been solved by NMR. This percentage increases to ~25 % if only unique folds are considered (Laskowski 2003). The steady increase in the number of biomolecular structures solved by NMR in the wwPDB originates both from the increased world-wide capacity, e.g. as result of the efforts of the structural genomics consortia, as well as from improved technology in several, often automated stages of the structure determination process itself (Güntert 2009; Rieping et al. 2007). For NMR, it has been reported that the data acquisition and data analysis, and the subsequent structure determination of a moderately-sized protein by NMR could take as little as 1–9 days (Liu et al. 2005).
The detection of a series of fraudulent X-ray structures at the end of 2009 brought the topic of validation back to the forefront (Baker et al. 2010). The PDB NMR data has only recently been fully remediated and validated (Doreleijers et al. 2009; Henrick et al. 2008) including the corrections to the hydrogen atom nomenclature and geometry (Doreleijers et al. 1999). CING presents the tools for authors, referees, and end-users to validate NMR structures in a comprehensive and integrated way.
For NMR, as with any other experimental technique, it is imperative that the transformation of experimental data into resulting structures occurs according to well-described, reproducible procedures. For high-resolution NMR, this transformation typically involves three steps: raw NMR data are first processed by Fourier transformation and peak picked. Next, the resulting spectral data, such as resonance frequencies, peaks, and fine-structure are converted into structural restraints. In a third step, a computational algorithm transforms these restraints into an ensemble of conformers. This latter step often involves a simulated annealing molecular dynamics calculation. NMR structure calculation programs such as AMBER (Case et al. 2005), CYANA (Güntert 2004), ARIA (Habeck et al. 2004) and Xplor-NIH (Schwieters et al. 2006) typically provide information regarding the agreement between the NMR ensemble and the experimental restraints, as well as some rudimentary information regarding the quality of the structure ensemble. A more detailed quality analysis is typically performed using external programs such as PROCHECK_NMR/AQUA (Laskowski et al. 1996), Molprobity (Lovell et al. 2003), WHAT_CHECK (Hooft et al. 1996b), and sometimes Model Quality Assessment Programs (MQAPs) (McGuffin 2007). More recently, a visual validation web server, called NMR Constraints Analyser, has been presented (Heller and Giorgetti 2010) that focuses on the validation of the distance restraints. Other programs such as PSVS (Bhattacharya et al. 2007), GLM (Bagaria et al. 2012) and ResProx (http://www.resprox.ca) have combined several of the common tools with their own specific checks.
Surveys by ourselves (Doreleijers et al. 1998; Hooft et al. 1996b; Nabuurs et al. 2005, 2006), and others e.g. (Bhattacharya et al. 2007; Snyder et al. 2005) indicated that the commonly accepted protocols in NMR for validation of the structure ensemble do not always detect misfolded structures or other serious problems. Our analyses even showed that a wrongly folded structure (PDB code 1tgq, now redeposited as 2b95) can be refined to such an extent that commonly reported NMR parameters for structural quality, such as restraint violations and Ramachandran plot scores, will not flag it as having serious problems (Nabuurs et al. 2006). Likewise, structure refinement against a single set of experimental residual dipolar couplings (RDCs) can also yield seemingly good structures without violations, even if the experimental data were completely randomized (Bax and Grishaev 2005). These examples illustrate the need for a more sophisticated NMR structure quality validation approach.
Our analysis of the obsolete PDB entry 1tgq indicated that an underlying cause of the problems was the neglect of the specific nature of the NMR data, which for the most part are highly local. As a result, the NMR ensemble can contain both well-resolved and problematic areas. Hence, the structural quality cannot be captured in a single parameter that describes an overall structure property, because such a value will be the average over the good and the bad parts. Instead, we advocate a residue-oriented approach to properly account for this variability. Here, we present a suite of programs based upon this philosophy, named CING (pronounced as ‘king’) for Common Interface for NMR Structure Generation that provides for a residue-based, integrated validation of the structure ensemble together with the experimental restraints and other data. The program is optimized for, but not limited to, NMR-derived biomolecular structures.
The CING validation will implement and follow the forthcoming wwPDB NMR validation task force recommendations. The iCing Virtual Machine web server provides for easy, anonymous access to the CING validation suite for both human users and automated submission by external programs. The server is available from the WeNMR Virtual Research Community’s web portal to the grid at http://wenmr.eu (Wassenaar et al. 2011) and via https://nmr.cmbi.ru.nl/icing/ or https://nmr.le.ac.uk.
CING program philosophy and design
The information derived from existing structure validation programs was studied to see how and to what extent this information was typically used and reported on by the NMR community (Markley et al. 1998). This analysis showed that existing programs suffered from a number of fundamental and practical drawbacks that strongly limited their usage. The practical drawbacks included (a) programs being out-dated and no longer actively developed, (b) not being suitable for NMR often because of their inability to adequately handle the multiple models contained in the NMR ensemble and (c) programs being difficult to run and install. At the more fundamental level, the knowledge obtained from the existing programs was typically hard to analyse as well as difficult to automatically integrate with the rest of the data; hence a combination of validation knowledge from different sources was generally not used. In addition, the direct relation to the experimental data often would be absent and essential functionalities for proper structure evaluation were missing.
External programs interfaced by CING
Vranken et al. (2005)
Laskowski et al. (1996)
Vranken et al. (2005)
Hooft et al. (1996a)
Koradi et al. (1996)
No longer supported
Laskowski et al. (1996)
DeLano and Bromberg (2004)
Nabuurs et al. (2003) (original)
Zhang et al. (2003)
Shen et al. (2009)
Rieping and Vranken (2010)
Doreleijers et al. (2005)
Hooft et al. (1996b)
Schwieters et al. (2006)
Joosten et al. (2011)
The CING software development uses the Google Code repository at http://code.google.com/p/cing. Elements of Extreme programming (Beck and Andres 2004) such as code review, and daily commits (1,185 revisions to date) are an essential part of the design and the daily maintenance. CING is mostly implemented in Python, an open-source, high-level, object-oriented, interpreted language (Lutz 2001; Millman and Aivazis 2011). A small fraction of the code is implemented in C using the cython Python-to-C interface because of speed reasons (Behnel et al. 2011). Two-dimensional graphics were implemented using the matplotlib library (http://matplotlib.sourceforge.net). Table 1 lists the software tools and external programs that have thus far been incorporated in CING code. CING is freely available at Google under a GNU Lesser General Public License. Virtual Machine images are available upon request from the authors.
The ability to accommodate a wide array of data types and formats is unfortunately still crucial for any NMR structure validation program. Relevant formats, such as those of CCPN and CYANA, can be handled by CING internally through the use of program-specific plugin converters (see Fig. 1 and described above). The CCPN data format captures an enormous variety of data and has been well-used and tested in many laboratories around the world. The on-line tool based on the CCPN FormatConverter for conversion to and from the CCPN format is actively maintained (Vranken et al. 2005) (available at: http://webapps.ccpn.ac.uk/fcweb). Hence, data in the CCPN format has our preference for interaction with CING.
Reference data organization
The CING program is inherently ‘NMR-aware’. The data related to the molecular topology of residues, reference chemical shifts of atoms, the notion of pseudoatoms, etc. resides in so-called database per-residue SML files. This reference database includes all common amino and nucleic acids, protonation variants, and several special entities such as water and ions. The reference data includes definitions for dihedral angles and atomic properties such as type, spin, and BioMagResBank (BMRB) derived average and standard deviations of chemical shift values (Markley et al. 2008). A new entry into this database is created on the fly when the CING importing routines encounter a non-canonical residue. This entry is then stored within the CING project. The nomenclature of CING atoms and residues follows IUPAC standards (Markley et al. 1998) with extensive conversion facilities for other nomenclatures, called ‘conventions’, such as CCPN, CYANA, X-PLOR, and their dialects. Alternative atom and residue names can be used as aliases.
ROG scoring criteria
%G ≤ 20 + %R
%G ≤ %R − 20
Propagated from residue entitiesb
Omega deviation (°)c
D1D2 plot (Z)d
Janin plot (Z)d
Propagated from CS entitiesb
Propagated from peak entitiesb
CS assignment (Z)g
Propagated from CS entitiesb
CS assignment (Z)h
Leucine side chaini
Propagated from DR entitiesb
Max. violation (Å)k,l
RMS violation (Å)l
Propagated from DH entitiesb
Max. violation (°)l
RMS violation (°)l
CING is integrated with the molecular graphics programs: YASARA (Joosten et al. 2011), PyMol (DeLano and Bromberg 2004), MOLMOL (Koradi et al. 1996) and JMol (Herráez 2006). CING can instruct each of these programs via macros to render properties such as the per-residue ROG score colouring onto the backbone.
iCing is a secure web portal (https://nmr.cmbi.ru.nl/icing/) to the CING server that allows users to validate their own data. The iCing web portal currently touts three input formats that can be used to submit complete projects (coordinates, experimental data, and restraints): CING, CCPN, and CYANA as well as the PDB format for importing a structure ensemble without additional data. The API to the CING-formatted data is described in the Google-code repository. iCing also serves as a interface to CING for third-party applications. An iCing robot allows for automated upload of project data and the return of XML formatted validation results. The iCing front-end is implemented using the Google Web Toolkit technology.
CING tools for experimental data analysis
Various potential assignment issues are evaluated, such as the presence of multiple assignments, missing assignments, and inconsistent pseudo-atom and/or stereo-specific assignments. In addition, the chemical shifts are compared to the BMRB-derived distributions and compared to the back-calculated values using the SHIFTX program (Zhang et al. 2003).
Peaks typically represent an abstracted stage of the experimental data and typically are neither deposited with the structure ensemble nor retained otherwise. CING stores and analyses peak entities for consistency with valid assignments.
The distance and dihedral angle restraints are validated to show the (RMS) violations in the ensemble and counts of models in which a violation occurs above the thresholds commonly used (lower-bound violations and 0.1, 0.3, and 0.5 Å for upper-bound violations of the distance restraints and 1, 3, and 5° for dihedral angle restraints). The distance restraints are checked for duplicates and are clustered into the following classes: intra-residual, sequential, medium-range (between 2 and 4 residues apart), long-range, or ambiguous.
A simplified and faster Python-based QUEEN (Nabuurs et al. 2003) implementation (called QUEENY) for residue-restraint information calculation was integrated within CING. It finishes well within a computer core minute for a 56 amino acid protein on regular hardware resources. The total per-residue restraint information is calculated and archived in the CING data structure.
CING tools for structure analyses
The CING package implements several tools that aid the analysis of the structural results.
2D dihedral angle combinations plots
Visualization of statistical preferences of dihedral angles provides information that can aid the assessment of the conformations in the structure ensemble. In CING, a large set of high-resolution X-ray structures were used to derive residue-specific statistical preferences for the dihedral angle combinations of Φ/Ψ (Ramachandran plot), χ1/χ2 [so-called Janin plot (Janin et al. 1978)] and the plot of the virtual dihedrals D1 and D2 (D1D2 plots) (see below). The reference dataset for the Ramachandran and Janin plots was based on the PDBSELECT database (v.2009-02-28) containing a set of 4,906 entries (5,135 chains and 88,540 residues) for which the R factor is <0.19 and the X-ray resolution better then 1.3 Å (Joosten et al. 2011). Background colouring in the Ramachandran and Janin plots was based on the DSSP classification (Hooft et al. 1996a) into helix (blue), sheet (yellow) and other (green). The colour changes linearly from white to e.g. yellow for densities from 2 to 20 %.
NMR typically generates ensembles of 20+ structures, so ‘outliers’ or spurious data present a serious problem for any measure that relies on averaging over the data points. CING uses an automated method based on Peirce’s criterion (Ross 2003) for outlier analysis and annotates the distributions of parameters, such as the dihedral angle distributions of individual residues. The models that fall outside the derived distribution are indicated in the text of the report and are colour coded in the corresponding plot. The original Peirce criterion code was extended to allow for analysis of larger data sets and limiting the number of outliers to be less than half of the complete set.
Range definition and superposition
For many validation criteria it is important to include only those residues that are in well-defined regions and many range definitions have been proposed in software tools such as: ARIA (Nilges et al. 1987), AQUA (Doreleijers et al. 1998), FindCore (Snyder and Montelione 2005), PDBstat (Bhattacharya et al. 2007), and UNIO (Guerry and Herrmann 2012). In CING, the range is assumed to be the full sequence when only one model is available or when the molecule contains no amino acids. The range includes those residues that have a Φ/Ψ dihedral angle circular variance (cv) of no more than 0.2. Consecutively, where the above selection caused a gap of four residues or less, those residues are reintroduced into the range. Short fragments of four residues or less are then omitted from the range. When this procedure results in an empty range, then the range is reset to all residues. Upon user request, CING can use range definitions that are based on chemical-shift derived order parameters (Berjanskii and Wishart 2008).
Using either user-specified or automatically determined ranges, the different conformers can be superimposed using backbone only or all heavy-atom selections. The RMSD to the average structure is then determined and the conformer closest to this average is reported.
Salt bridges cannot directly be inferred from experimental NMR data and therefore are established indirectly from the analysis of the coordinate data. Due to the sparse density of protons surrounding a typical salt bridge, the exact geometry often is not observed in every model of the ensemble. CING employs a classification proposed by (Kumar and Nussinov 2002) and reports on all combinations of potential salt bridge forming residue pairs (any Arg/Lys with any Glu/Asp).
CING integration with external programs
By means of plugins (vide supra), CING integrates the analysis results of the external programs (cf. Table 1), with the most important ones detailed below.
Even though the software package PROCHECK_NMR (Laskowski et al. 1996), like PROCHECK (PC), is no longer actively maintained, it has for a long time been the de facto validation standard for NMR spectroscopists. Even today, most papers quote the percentages of residues in the various regions of the PC Ramachandran plot. CING reports these numbers in its Summary page. The PC residue-specific G-factor is used in the CING ROG residues scores (cf. Table 2).
The software package WHAT IF contains an extensive subsection dedicated for structure validation which is available free of charge for academia under the name WHAT_CHECK (WC). A good introduction to the different WC checks is available at: http://swift.cmbi.ru.nl/gv/whatcheck. We have adapted the WC per-model and per-residue analyses results to ensemble properties suitable for validating an NMR ensemble.
CING reports on the overall WC scores on its Summary page and the following residue-specific WC properties on the Molecule page: (a) packing quality, Ramachandran, backbone normality, χ1, χ2 rotamers (Janin), χ1 rotamer (WC codes: QUA, RAM, BBC, C12, ROT, respectively), (b) bond lengths, bond angles, 2nd generation packing quality, protein side chain planarities, connections to aromatic rings, side chain planarity with hydrogens attached (BND, ANG, NQA, PLN, PL2, PL3, respectively), and (c) bumps, relative accessibility, accessibility, flip HIS GLN ASN hydrogen-bonds, torsion angle (BMP, ACCLST, INO, FLP, CHI, respectively). Only the Ramachandran and Janin WC Z-scores are used for ROG scoring in CING (cf. Table 2).
The secondary structure elements were identified in each model of the ensemble by DSSP (Kabsch and Sander 1983) currently maintained in our laboratory (Joosten et al. 2011). The DSSP codes are collapsed in CING to three states of helix (DSSP: 3/H), sheet (DSSP: B/E), and other (all other DSSP codes). The state having the largest fraction of the three states in the ensemble is used as the overall consensus state.
The inter-residue NOE distance restraint completeness up to 4 Å is analysed using Wattos (Doreleijers et al. 2005). The counts of observable atoms per residue and the expected, observed, and matched distance restraints are plotted for reference. The overall NOE completeness statistics are presented in the CING summary.
The two key concepts of the CING approach are the notions of a ‘Project’ and a residue-based analysis philosophy encompassing both experimental data and structure results. The project constitutes the complete collection of peaks, assignments, a molecule with chains, residues and atoms, all restraints, the coordinates of the structure ensemble and all the results of the validation routines and programs. These elements are linked according to the logical relationships that exist between them. For example, the project links to lists of restraints; each of these restraints links to its validation results as well as to the specific atoms involved (e.g. for a NOE), which in turn links to a residue, which links to a molecule. The reverse links are also modelled: project to molecule to chain to residue to atom etc. It is the presence of such linkages that makes the implementation of specific tasks or tests much simpler when compared to implementing them from scratch. The concept of a Project also allows for an easy connection between information originating from different programs.
The CING report consists of a collection of HTML/Web 2.0 pages that reflect the hierarchy of the project and the links between the entities. Thus, the pages provide for easy navigation between structure and data elements and all entities within the CING project are coloured according to their three-way ROG-assessment.
NMR structures often contain unstructured regions. A crucial aspect of any validation therefore concerns the decision which residues of the biomolecule to include in the assessments. To test the range-selection criterion, we selected from 9,300 NMR NRG-CING entries (Doreleijers et al. 2011) those entries of at least 30 amino acids and 10 structure models. This yielded 6,460 entries encompassing 7,735 chains, 10,088 segments and 624,958 residues. Using the CING circular variance-based range selection analysis (described above), overall only 13 % of the residues are excluded. The mean ordered segment was 61.9 residues long and there were on average 1.3 segments per polypeptide chain. The average number of segments per chain is considerably reduced from the 3.2 that we previously found using a simple window averaging scheme (Doreleijers et al. 1998). This new procedure produces similar results as the consensus procedure of PSVS (Bhattacharya et al. 2007) used in the 2010 CASD-NMR assessment (Rosato et al. 2012). The CING-derived ranges of the CASD-NMR targets include a total of 921 residues of which 75 residues were excluded by PSVS (data not shown). Conversely, there are only 5 residues that are not in the CING derived ranges. For example, the CING range for PDB entry 2kpm (unpublished) is [10–98]. It includes 29 amino acids at the termini that were excluded in the CASD-NMR range [23–82]. The excluded residues 83–98 display low backbone flexibility (c.v. of 0.009) and 44.8 distance restraints per residue on average, suggesting that they could and perhaps should have been validated.
Example CING report
The recently submitted PDB entry 2kq3 comprises the NMR-derived structure ensemble (20 models of 140 amino acids) of the monomeric and very well-studied staphylococcal nuclease protein. The NMR ensemble was obtained using 2,089 distance restraints, 64 hydrogen-bond restraints and 147 dihedral restraints (Wang et al. 2010). The ensemble and its experimental dataset are typical for NMR-derived proteins of this size; it was chosen at random from a set of entries with similar properties.
Figure 2 shows parts of the CING report for 2kq3. The full CING analysis of 2kq3, including many more figures than the ones displayed in this manuscript, can be obtained from http://nmr.cmbi.ru.nl/NRG-CING/data/kq/2kq3/2kq3.cing. The Project page (cf. Fig. 2a) is the starting point for the report and shows a first impression of the monomeric protein with the beta sheet and alpha helices as well as the colour-coded entities to the left.
The CING Summary page reports on the overall CING, WHAT_CHECK and PROCHECK scores as well as the structural variation analysis (data not shown). Of the 140 residues of the polypeptide, the CING analysis identified 122 residues to be structured and excluded the four disordered N-terminal residues and an unstructured loop for residues 42–55 (top panel of Fig. 2c). The structured residues have an RMSD to the mean of 0.96 ± 0.21 Å for backbone atoms only. The values reported by Wang et al. on the basis of secondary structure are considerable less, i.e. 0.32 ± 0.07 Å (Wang et al., Supplementary Table 1), suggesting a much tighter bundle. The CING Summary page also lists overall WHAT_CHECK and PROCHECK results; Wang et al. also reported the latter (Supplementary Table 1), and comparison between the CING (78/19/3/0%) and original (72/23/3/2%) analysis show that the overall PROCHECK scores are similar. The average WHAT_CHECK Ramachandran and rotamer normality scores (−5.1 and −7.1) reported by CING are however strongly indicative of poor conformations. The Summary page reports that 55 out of the 122 structured residues (45 %) have been flagged ‘red’ for the CING ROG score and 46 residues (38 %) orange. When the ROG scores are mapped onto the structure (Fig. 2b), they indicate that the problems and the warnings encompass nearly the whole protein. The RECOORD protocol for recalculation and subsequent refinement in water results in a much improved ensemble for many PDB entries (Nederveen et al. 2005). After applying the protocol to this entry, the number of residues marked red dropped from 55 to 28 % and the WHAT_CHECK Ramachandran and rotamer normality scores improved from −5.1 and −7.1 to −2.6 and −4.2, respectively. The core backbone atom pairwise RMSDs between the original and recalculated ensembles is 1.5 ± 0.2 which is higher than the variance within the recalculated ensemble (1.2 ± 0.3), indicating a significant structural adjustments. Wang et al. were not available for comments on these findings.
Oddly, the Ramachandran plot of Leu14 (Fig. 3b) shows all conformers of the ensemble to cluster near a single ψ value of ~109°, which appears to be dictated by the dihedral restraint for this angle (indicated by the transparent orange box). The experimental Φ,Ψ values of the conformers are also close to the edge of the commonly observed conformations for Leucine, resulting in relatively poor WHAT_CHECK score and PROCHECK G-factor. The Janin plot (Fig. 3c) shows the side chain of Leu14 to be tightly restricted in a staggered χ1 conformation, which also appears unlikely from the database reference distribution. Finally, the D1D2 plot shows the conformation of Leu14 with respect to its previous and next residue to be in the extended conformation, albeit at the edge of what is commonly observed. Hence, the ‘red’ ROG score for Leu14 originates from this set of poor conformations.
Leucine 14 is part of a stretch of poorly modelled residues in this first β-strand of staphylococcal Nuclease, and only Ala17 in this strand has a green ROG score. For example, the backbone dihedral angles of Asp19 cluster in a very unfavourable region of the Ramachandran plot (Supplementary Figure S1). This conformation also conflicts with the Φ,Ψ dihedral restraint region derived by the CING Talos+ analysis (shown as an open box in Supplementary Figure S1), on the basis of the experimental chemical shift data. Potentially, the poor conformations of Leu14, and the directly following troublesome β-bulge residues Leu15 and Lys16, are the result of a set of smaller but propagating and reinforcing errors. The analysis of the conformation of residues and identification of potential problems can conveniently be done using the ‘Dihedral plots per residue’ page, which displays the relevant plot of all residues sequentially, in one scrollable interface (Supplementary Figure S2).
The agreement between the restraints and the resulting structure ensemble is a commonly calculated parameter for NMR structures. The numbers of restraints are tabulated and their RMSD and other violation statistics are typically reported. CING also reports the results of such a full restraint analysis.
Chemical shift validation
The Atom page reports on CING’s comparisons of the experimental chemical shifts, after VASCO re-referencing (Rieping and Vranken 2010), with both the BMRB database values and the back-calculated values from the coordinates using the SHIFTX software (Zhang et al. 2003). A report for PDB entry 2kq3 and corresponding BMRB entry 16585 is shown in Fig. 4b. A small 0.07 ppm offset was determined by VASCO for the 1H nuclei and applied automatically. 13C and 15N corrections were minor, i.e. −0.03 and 0.1, respectively and not applied because the corrections were less than three times the uncertainty. No assignment consistency issues were detected by CING.
The comparison of the experimental values with the database values is the basis for detecting outliers for ROG scoring. For 2kq3 CING identifies nine instances in which the experimental values deviate by more than 3 standard deviations from their database averages (cf. Fig. 4b). As an example, one of the most extreme cases involves the Hδ22 side chain atom of Asn100, which is listed to resonate at 9.65, 2.5 ppm (4.5 standard deviations) higher then its average value. The experimental value is also 1.9 ppm higher than its SHIFTX predicted value. Whereas the Hδ12 atom is involved in two long-range restraints to the amide protons of Ile92 and Leu37, no restraints involving the Hδ22 atom are included in the restraint set. Intra-residue or sequential restraints involving these atoms are also not present in the dataset. The χ1/χ2 side chain angle value-pairs of Asn100 cluster in two, non-ideal regions of the Janin plot and the Hδ21 and Hδ22 are packed in a hydrophobic environment. Many NOEs, also involving the Hδ22 atom, should have been observable in such a conformation and thus should have resulted in restraints. In all, the extreme Hδ22 experimental chemical shift value and the combined pattern of distance restraints and structural conformation, strongly suggests an erroneous assignment for one or more of the atoms of Asn100.
As an example of the salt bridge detecting functionality of CING, a water-refined ensemble of structures of the second domain (CBD2) of NCX under Ca2+-free conditions (PDB entry 2kls) was examined (Hilge et al. 2009). In the apo structure, three basic residues in CBD2, Arg547, Lys583, and Lys585, are well positioned to form salt bridges with Asp552, Asp578, Glu580, and Glu582. For example, the CING analysis showed Lys585 to engage in an interaction with Asp552, classified as salt bridge in 18 out of 20 models. In addition, electrostatic interactions of Lys585 to Asp578 and Glu582 were identified as salt bridges (7/8 out of 20 models, respectively) or ionic interactions (10/11 out of 20 models). These now unambiguously identified electrostatic interactions were previously shown to be crucial, as they partly stabilize some of the negative charges resulting from the release of Ca2+ and therefore prevent unfolding (Hilge et al. 2006).
Comparison between CING and PROCHECK_NMR
Analysis of the data indicates that a significant number of entries combine high PROCHECK_NMR scores (>75 %) with low ROG green scores, which flags these as highly suspect. The entry 2kq3 (critiqued above) also combines a 78 % PROCHECK_NMR score with a very low 17 % ROG green score. Conversely, there are almost no entries in the top left corner of the plot that would be entries found to be problematic by PROCHECK_NMR but not by CING (top left corner of the plot), which is to be expected as the CING ROG score incorporates the residue-specific PROCHECK_NMR G-factor.
A residue is a natural concept for NMR-based structure determination. NMR assignment strategies are almost exclusively residue-based. NMR related parameters are residue type dependent and the local nature of the NMR-derived restraints also correlates well with a residue-based approach. Structural properties can also be conveniently summarized at the residue level. We previously showed that structurally bad regions are masked when using overall validation parameters (Nabuurs et al. 2006), which can be circumvented by a residue-based approach.
The 2kq3 structure was chosen as a typical representative of an NMR structure in terms of molecular size and experimental restraints. Distance and dihedral restraints still constitute the input data for the majority of the NMR ensembles, including entry 2kq3, while RDC restraints have only been deposited for only 499 of the >9,000 NMR-derived structures (Doreleijers et al. 2011). The 2kq3 entry has a high percentage of residues that CING flags as ‘red’ or ‘orange’ (Fig. 2b), as a result of many uncommon backbone or side chain dihedral angles (Fig. 3; Supplementary Figures S1, S2). The ROG scores combines the analysis of several tools (cf. Table 2) and is effective in flagging problematic regions. Comparison of the dihedral restraints deposited by the authors and the Φ/Ψ dihedral restraint region derived by the CING Talos+ analysis, shows the former to be significantly more restricted and often to result in suspect Φ/Ψ angle distributions. In addition, the overall information content appears low on the basis of the Wattos NOE completeness criterion and we identified several suspect assignments (Fig. 4). These problems warrant a careful inspection of the original data and derived restraints and we established that a refinement of the structure ensemble using an extended force field that included electrostatics and water yields better results.
The CING analysis is sensitive to local problems and ill-refined structures rapidly result in large numbers of red or orange flagged residues. However, in absence of any gross restraint errors and using a proper water-refinement protocol (Spronk et al. 2002), the local errors can be remedied readily and the green scores improved. For example, the pre-water refined NCX3-CBD2-B ensemble of structures obtained from CYANA calculations yielded ROG scores of 53/30/17%, but improved to 23/31/46% after refinement in explicit solvent using the YASARA YAMER force field (Breukels et al. 2012). Similar improvements in structure quality were obtained in the DRESS (Nabuurs et al. 2004) and RECOORD (Nederveen et al. 2005) databases of recalculated and refined NMR structures.
In our experience, based on spot checking dozens of entries in NRG-CING, a properly refined ensemble of structures that has low green ROG (<~20 %) and high red ROG (>~50 %) scores can be labelled as highly troublesome. Entry 2kq3 showed numerous issues, in spite of its relatively high PROCHECK_NMR scores, which illustrates the latter to be a less reliable indicator of structure quality. Multiple parameters, as implemented in the CING ROG scores, appear more sensitive to identification of potentially problematic structures. We have not manually examined all entries that combine high PROCHECK_NMR scores with low ROG green scores (cf. Fig. 5), but for those entries that were examined clear problems could typically be identified.
It is important to keep in mind that a substantial part of the CING analysis and resulting ROG scores is based upon comparisons with database-derived properties. Particular features of a structure not present in the reference databases will therefore be flagged as red or orange, in spite of these potentially being correct. However, given the now extensive nature of the structure database, such occurrences are very rare and should be treated with extreme caution. Examples of these are the inclusion of unusual amino acids or chemical modification of residues.
In the case of the now retracted entry 1tgq, which prompted the development of the CING suite, the analysis clearly shows major problems (ROG scores 54/30/16%) (Nabuurs et al. 2006). In addition, back-calculated chemical shifts (data not shown) readily identify the troublesome kinked α-helical region and the errors for β-strand 3. In contrast, the correct 1y4o structure (Song et al. 2005) displays normal ROG scores (16/27/57%).
One category of ill-folded structures that are not recognised by CING are those derived using the chemical-shift based ROSETTA protocol (Shen et al. 2008) or its variants. As the CS-ROSETTA protocol samples from a structure database and uses chemical shift matching for fragment selection, it optimises the two main criteria that CING uses for identification of problems. It is therefore not surprising that analysis of the automated structure calculation efforts of CASD-NMR (Rosato et al. 2009) showed the CING routines unable to identify the incorrectly folded CS-ROSETTA derived structures (Rosato et al. 2012).
In general, a proper validation assessment should be based on information that is not used to calculate the structure ensemble. Such cross-validation procedures are now feasible for most high-resolution NMR derived structures because the information content of the restraints used for typical structure calculations has greatly increased over the past decade, and part of them could be left out to validate the results.
This paper describes an integrated residue-based approach for NMR structure validation, yielding validation reports for authors, referees, and end-users. The intuitive red-orange-green set of residue-based critiques directs the attention to specific parts of the structure in need of manual verification. The iCing server and the CCPN analysis program allow for straightforward upload and initial visualization of the validation results, enabling individual users to test their structure ensemble prior to submission to the PDB and its reporting in a manuscript. The CING validation suite was developed for, but is not limited to, NMR-derived protein structures. Oligonucleotide and X-ray structures can also easily be examined, albeit that no experimental X-ray data can be validated for the latter. The CING suite will continue to evolve, also in response to recommendations put forward by the wwPDB NMR validation taskforce.
Finally, the Journal of Biomolecular NMR like many journals, requires authors of new structure papers to deposit the coordinate and experimental data in accordance to the IUPAC guidelines (Markley et al. 1998). However, referees are not usually provided with any external validation report on those coordinates. It would be of great value to authors and referees to have the CING reports available as part of their submissions.
We enjoyed the advice and constructive criticism from Karen Berntsen, Vincent Breukels, and Wouter Touw. We thank the authors of all external programs and algorithms used through CING for their important efforts and willingness to let us use and sometimes modify their code. We acknowledge Wilmar Teunissen for system administration. E. K. would like to thank the users of the YASARA molecular modeling program for financing his work. This work was supported by the Netherlands Organization for Scientific Research (NWO) grants 700.55.443 and 036.001.914 and BBSRC grant BB/K002015/1 to GWV, Netherlands Bioinformatics Centre (NBIC); EU FP6 grants STREP Extend-NMR LSHG-CT-2005-018988, and FP7 WeNMR grant 261572 to JFD, GV, and GWV; BBSRC grant BB/D006384/1 to AWSS; NWO grant 700.58.410 to SBN; BBSRC grant BB/H004130/1 to TJS; Brussels Institute for Research and Innovation (Innoviris) grant BB2B 2010-1-12 to WFV.
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
- Beck K, Andres C (2004) Extreme programming explained: embrace change, 2nd edn. Addison-Wesley, BostonGoogle Scholar
- Breukels V, Touw W, Vuister GW (2012) NMR structure note: solution structure of Ca2+ binding domain 2B of the third isoform of the Na+/Ca2+ exchanger. J Biomol NMR 54:115–121Google Scholar
- DeLano W, Bromberg S (2004) PyMOL user guide. DeLano Scientific, San Carlos. http://pymol.sourceforge.net/newman/userman.pdf
- Doreleijers JF, Vriend G, Raves ML, Kaptein R (1999) Validation of nuclear magnetic resonance structures of proteins and nucleic acids: hydrogen geometry and nomenclature. Proteins Struct Funct Bioinformatics 37(3):404–416. doi:10.1002/(SICI)1097-0134(19991115)37:3<404:AID-PROT8>3.0.CO;2-2 CrossRefGoogle Scholar
- Doreleijers JF, Nederveen AJ, Vranken W, Lin J, Bonvin AMJJ, Kaptein R, Markley JL et al (2005) BioMagResBank databases DOCR and FRED containing converted and filtered sets of experimental NMR restraints and coordinates from over 500 protein PDB structures. J Biomol NMR 32(1):1–12. doi:10.1007/s10858-005-2195-0 CrossRefGoogle Scholar
- Kumar S, Nussinov R (2002) Close-range electrostatic interactions in proteins. Chembiochem: Eur J Chem Biol 3(7):604–617. doi:10.1002/1439-7633(20020703)3:7<604:AID-CBIC604>3.0.CO;2-X CrossRefGoogle Scholar
- Nederveen AJ, Doreleijers JF, Vranken W, Miller Z, Spronk CAEM, Nabuurs SB, Güntert P et al (2005) RECOORD: a recalculated coordinate database of 500+ proteins from the PDB using restraints from the BioMagResBank. Proteins Struct Funct Bioinformatics 59(4):662–672. doi:10.1002/prot.20408 CrossRefGoogle Scholar
- Ross S (2003) Peirce’s criterion for the elimination of suspect experimental data. J Eng Technol 20(2):38–41Google Scholar
- Wassenaar TA, van Dijk M, Loureiro-Ferreira N, van der Schot G, de Vries SJ, Schmitz C, van der Zwan J, Boelens R, Giachetti A, Ferella L, Rosato A, Bertini I, Herrmann T, Jonker HRA, Bagaria A, Jaravine V, Güntert P, Schwalbe H, Vranken WF, Doreleijers JF, Vriend G, Vuister GW, Franke D, Kikhney A, Svergun DI, Fogh R, Ionides J, Laue ED, Spronk CAE, Verlato A, Badoer S, Dal Pra S, Mazzucato M, Frizziero E, Bonvin AMJJ (2011) WeNMR: structural biology on the grid. IWSG proceedings, urn:nbn:de:0074-819-5Google Scholar