Straightforward and complete deposition of NMR data to the PDBe
We present a suite of software for the complete and easy deposition of NMR data to the PDB and BMRB. This suite uses the CCPN framework and introduces a freely downloadable, graphical desktop application called CcpNmr Entry Completion Interface (ECI) for the secure editing of experimental information and associated datasets through the lifetime of an NMR project. CCPN projects can be created within the CcpNmr Analysis software or by importing existing NMR data files using the CcpNmr FormatConverter. After further data entry and checking with the ECI, the project can then be rapidly deposited to the PDBe using AutoDep, or exported as a complete deposition NMR-STAR file. In full CCPN projects created with ECI, it is straightforward to select chemical shift lists, restraint data sets, structural ensembles and all relevant associated experimental collection details, which all are or will become mandatory when depositing to the PDB. Instructions and download information for the ECI are available from the PDBe web site at http://www.ebi.ac.uk/pdbe/nmr/deposition/eci.html.
KeywordsDatabase deposition CCPN wwPDB Structure calculation Structure validation NMR-STAR
Public databases that archive scientific data hold a crucial record of experimental information, especially in relation to associated publications. In the field of structural biology, the Protein Data Bank (PDB; Berman et al. 2000) has been storing three-dimensional structural data of mainly proteins, DNA and RNA since 1971. The predominant experimental techniques to determine these structures are X-ray crystallography and Nuclear Magnetic Resonance (NMR) spectroscopy. In recent years, the worldwide PDB (wwPDB; Berman et al. 2007), the organisation that manages the PDB, has begun to require the mandatory deposition of an increasing amount of experimental data, associated parameters and meta-data. For structures determined by X-ray crystallography, the deposition of structure factors has been mandatory since 2008. In NMR, where data management is typically more complex because of the variety of data that can be obtained and the general lack of consistent data formats, deposition of the restraints used to calculate the structure has been mandatory since 2008 as well (Markley et al. 2008). The deposition of chemical shifts and associated referencing information will become mandatory during 2010. Furthermore, it is not inconceivable that additional types of data and information will become mandatory for deposition at the PDB in the future.
Whilst obtaining this additional NMR data is of increasing value to a varied group of researchers; for example, in chemical shift-based structure calculation (Cavalli et al. 2007; Wishart et al. 2008; Shen et al. 2008), structure recalculation efforts (Nederveen et al. 2005), and large scale data analyses (Vranken 2007; Vranken and Rieping 2009), it has also made the deposition of NMR data more complicated. For deposition of macromolecular NMR data to the PDB, there are currently two web-based options: the first is AutoDep (Sen et al. 2007), which is hosted by the Protein Data Bank in Europe (PDBe; Velankar et al. 2010) group at the European Bioinformatics Institute (EBI). This allows users to submit coordinates, NMR-derived structure restraints and other NMR data items such as chemical shifts or peak lists along with associated information such as authors, citation references, molecule/sequence data, sample and experimental information. The second option is ADIT-NMR at the BioMagResBank (BMRB, University of Wisconsin and PDBj-BMRB, Osaka, Japan; Ulrich et al. 2008), which allows submitters to deposit similar information, but with a different web-based input tool. The main disadvantage of web submission is that it can be slow and cumbersome to fill in the forms and to locate and upload all the necessary data and files.
These developments ensure higher quality of the deposited data because the depositor either uses the internally consistent CCPN framework while analysing data, or is interactively involved in resolving issues with ambiguous data when creating a CCPN project for deposition only from existing NMR data formats. They also greatly reduce the need for annotator intervention during deposition, because the depositor ensures the consistency of the incoming data pre-deposition. This is important for the future of the wwPDB, as annotation staff numbers remain constant while the number of structure depositions increase and the deposition of additional experimental data becomes mandatory. The tools for deposition of NMR structures and related data described here thus help to maintain the quality of the PDB archive and uphold current response times to depositors.
All the code is written in Python and uses the CCPN API (Application Programming Interface) libraries to read CCPN XML (eXtensible Markup Language) project files into Python objects (Fogh et al. 2010). Reading and writing of external data files (for example, coordinates from PDB files, or NMR restraints files from various formats) are performed using FormatConverter libraries (Vranken et al. 2005). Reading of NMR-STAR files is also done using FormatConverter libraries, which have been extended as part of this work, to import all data found in NMR-STAR 3.1 files and the header information from PDB files. For export of NMR-STAR files from completed CCPN projects, we have developed new code that uses a Python-based dictionary (Ccpn_To_NmrStar.py) and associated parser (NmrStarExport.py). For each NMR-STAR data value (tag), the dictionary specifies a CCPN data value (attribute) or in more complex cases a Python subroutine that will provide the relevant data. The dictionary controls looping over CCPN data objects where there are multiple values (e.g., chemical shifts) to be written out in an NMR-STAR data values table, and can handle further complications in the export process; for example, when data for a single NMR-STAR data category (saveframe) must be extracted from several different types of CCPN objects. It is possible to define mappings between specific CCPN framework releases and NMR-STAR versions in the NMR-STAR export framework; it can thus maintain point-to-point compatibility between previously mapped CCPN releases and NMR-STAR versions, and this setup is essential for compatibility testing between new CCPN and NMR-STAR versions before they are released.
Two main Python scripts handle CCPN projects as part of the AutoDep web interface. The first script (Ccpn2Autodep.py) converts the CCPN data into AutoDep XML files. It also automatically exports structures in PDB format and distance restraints in CNS format when the CCPN project is uploaded. After the data has been deposited and curated, a second script (Autodep2Ccpn.py) reads the final AutoDep XML file and the curated PDB file to identify any new data that was added or information that was updated during deposition or curation, which can then be updated in the original CCPN project.
The deposition pipeline described here has been tested on 102 real projects received at the PDBe over the last five years, including projects with complicated molecular systems (Table S1), and covering a wide variety of NMR data (Table S2). Note that the FormatConverter covers a much wider range of software (http://www.ebi.ac.uk/pdbe/nmr/software/formatConverterIOTable.html).
CcpNmr software is released in two versions, both of which include ECI; they can be downloaded from: http://www.ebi.ac.uk/pdbe/nmr/deposition/eci.getting_started.downloads.html. The releases have different advantages from a deposition point of view: the full release includes CcpNmr Analysis, providing greatly enhanced visualisation and analysis capabilities, and is fully supported on 32 and 64-bit Linux, Mac OSX Intel/PPC and Windows. The FormatConverter-only release is essentially platform-independent as it only requires the widely available Python and Tcl/Tk packages.
Figure 1 shows an overview of the deposition system described here. The individual components are described below. Tutorials and detailed help for each component can be found on the PDBe web site (http://www.ebi.ac.uk/pdbe/nmr/deposition/).
Creating CCPN projects
The investigator typically begins with a CCPN project created whilst working with the suite of CCPN-framework integrated software. Currently, the main tools available are CcpNmr FormatConverter and CcpNmr Analysis (Vranken et al. 2005). CcpNmr Analysis is a spectrum visualisation, resonance assignment and NMR data analysis application. For users who have custom pipelines to calculate NMR structures, FormatConverter allows for the import of most types of NMR data from the most common other NMR assignment, peak picking and structure calculation programs. It is available both as a desktop tool and in a web-based version (see: http://www.ebi.ac.uk/pdbe/nmr/software/formatConverterUsage.html). Users of CcpNmr Analysis, or the CcpNmr Extend-NMR software, will already have all their data in a CCPN project. Whatever the starting point, the end result is a CCPN project that contains derived NMR data such as chemical shifts, restraint sets and structure ensembles (see http://www.ebi.ac.uk/pdbe/nmr/deposition/overview.html) (Fig. 1).
For users wishing to start with data from previous submissions to the wwPDB, it is possible to import PDB header information and NMR-STAR v 3.1 files into ECI. This data can then be edited and modified to suit the new submission, with very similar projects requiring little user input. Furthermore, it is possible to use the compatible CcpNmr DataShifter (which is available as part of the CCPN releases) to copy data from other CCPN projects quickly into the new project. It is worth noting that archived NMR data (Ulrich et al. 2008) and “cleaned up” restraints (Doreleijers et al. 2009), which are available from the BMRB as NMR-STAR files, can be imported into CCPN and analysed further using CcpNmr and Extend-NMR software.
The curated CCPN project is also exported to NMR-STAR format and this file is forwarded to the BMRB, together with the original and curated data, after the coordinate annotation is finished (typically within two days after the AutoDep deposition). The BMRB will then initiate a new ADIT-NMR deposition for the NMR data only and e-mail the depositor a web link where the BMRB data submission procedure can be completed. In our experience, for data sets that are “all green” in ECI (i.e. all data necessary for deposition are available), more than 90% of the fields in ADIT-NMR will be populated, and little user input will be required to finish the deposition. At this point, annotators at the BMRB will curate the NMR data in the submission (for more details about completing ADIT-NMR pages, see: http://www.ebi.ac.uk/pdbe/nmr/deposition/adit-nmr.html).
Discussion and conclusions
The basic philosophy behind the web tools for depositing NMR data (AutoDep and ADIT) is very different: AutoDep is designed to provide context-dependent forms based on the experimental method and refinement software used, and stores data temporarily in an internal XML format that can later be transformed into proper archive formats such as PDB, mmCIF (Bourne et al. 1997), or NMR-STAR. However, ADIT (hosted at RCSB, Rutgers University) and ADIT-NMR store the data from their sessions in mmCIF or NMR-STAR files. In spite of these differences, both AutoDep and ADIT(-NMR) suffer from a need to have the user to separately save each completed page. This can make the deposition process slow and time-consuming and is also error prone if the user fails to notice, for example, typographical errors and then changes the web page to a new deposition section. A desktop-based solution like the one presented here is much more user-friendly. It is easier to navigate, will flag missing or incorrect data and has the advantage that all the meta-data that has been entered is stored locally and remains available. This solution does require installation of a software package, but since the software used (Python and Tk) is platform independent and the CCPN installation scripts are now well developed and tested, this does not present a major obstacle.
For NMR data storage in the PDB, the primary archive format is NMR-STAR (Ulrich et al. 2008). It is a text-based format, similar to STAR and CIF used for crystallographic data, and uses identifiers in save frames and tables to uniquely tag data items that can then be referenced elsewhere in the same file. Because of its text-based and human-readable nature, it is a good format for long-term archival of data. For software development, it is more important that data can be transformed directly and unambiguously between files and data structures in memory, and that the data consistency and validity can be assured. This is where the UML-based CCPN data model excels (Fogh et al. 2010). CCPN projects consist of many XML files, which are less intelligible to humans, but are read and written directly by the subroutine libraries that come with the CCPN implementation. Once data is in memory, the subroutine libraries (available in Python, Java and C) ensure data access and maintain data integrity, while application programs perform the actual calculations on chemical shifts, peak lists, NMR restraints, atomic coordinates and other available data.
Although it is also possible to write out an NMR-STAR file from ECI and send it to the BMRB directly, we strongly encourage that, for structure-based depositions, the whole CCPN project is uploaded first into AutoDep at the PDBe, especially if CcpNmr Analysis or Extend-NMR software were used (Fig. 1). A CCPN project contains a more complete record of the information gathered during the spectral analysis (for example, incomplete assignments or peaks that were observed but not used) and structure calculation process (e.g., the restraint lists that were used for the first iteration of a structure calculation); data that would be lost if only the final data were archived. This ability of the CCPN data model to describe all aspects of the process of macromolecular structure determination using NMR, combined with complete and faithful inter-conversion with the archive/deposition NMR-STAR file format as described in this paper, makes CCPN projects an ideal medium for NMR groups to deposit their NMR data with the PDB, as well as allowing for longer term, internal archival of data in a compact and consistent format. Finally, there is the benefit of a simpler and faster deposition process both for the users and the data curators. The user only needs to make one CCPN project that can be used for all aspects of deposition and particularly the collation of all mandatory data and associated information in a simple, secure and organised fashion. As increasing amounts of NMR data gradually become mandatory for journal authors to obtain PDB and BMRB accession codes, authors will find that this unified approach saves them large amounts of time when depositing NMR models and data, and annotators will not have to spend time dealing with unnecessary data consistency issues.
There are a wide variety of NMR software programs that can now create or use CCPN projects. One good example of this is the iCing server (currently in beta form at http://nmr.cmbi.ru.nl/icing/) that allows NMR spectroscopists to validate their own CCPN projects and identify potential problems with respect to structural ensemble geometry (WHATIF; Hooft et al. 1996 and PROCHECK-NMR; Laskowski et al. 1996), NOE violations (Doreleijers et al. 2005) and chemical shift data (SHIFTX; Neal et al. 2003).
In conclusion, a simple, secure and complete deposition system is presented for NMR depositors, that allows local editing and storage of all information related to an NMR structure determination project. Since this system is based on a framework that consistently stores all types of NMR data, it is also easily amenable to include new mandatory data types for deposition in the future.
The authors thank Brian Smith, Yinan Fu, Vitaliy Gorbatyuk, Marie Phelan and Nicole Cheung for testing the tools with their CCPN NMR projects made using Analysis. We also acknowledge Wayne Boucher for programming help and Chris Spronk for help with the documentation pages. This project was funded by the UK Biotechnology and Biological Sciences Research Council (BBSRC) grant BBE0075111, with equipment support from The Wellcome Trust grant 088944. BMRB is supported by grant LM05799 from the US National Library of Medicine.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- Doreleijers JF, Nederveen AJ, Vranken W, Lin J, Bonvin AM, Kaptein R, Markley JL, Ulrich EL (2005) BioMagResBank databases DOCR and FRED containing converted and filtered sets of experimental NMR restraints and coordinates from over 500 protein PDB structures. J Biomol NMR 32:1–12CrossRefGoogle Scholar
- Fogh R, Ionides J, Ulrich E, Boucher W, Vranken W, Linge JP, Habeck M, Rieping W, Bhat TN, Westbrook J, Henrick K, Gilliland G, Berman H, Thornton J, Nilges M, Markley J, Laue E (2002) The CCPN project: an interim report on a data model for the NMR community. Nat Struct Biol 9:416–418CrossRefGoogle Scholar
- Nederveen AJ, Doreleijers JF, Vranken W, Miller Z, Spronk CA, Nabuurs SB, Güntert P, Livny M, Markley JL, Nilges M, Ulrich EL, Kaptein R, Bonvin AM (2005) RECOORD: a recalculated coordinate database of 500 + proteins from the PDB using restraints from the BioMagResBank. Proteins 59:662–672CrossRefGoogle Scholar
- Sen S, Van Ginkel G, Kapopoulou A, Sahni G, Swaminathan GJ, Newman RH, Velankar S, Henrick K (2007) Autodep 4.1: a web-based deposition and archival system. Acta Cryst A63:s141Google Scholar
- Shen Y, Lange O, Delaglio F, Rossi P, Aramini JM, Liu G, Eletsky A, Wu Y, Singarapu KK, Lemak A, Ignatchenko A, Arrowsmith CH, Szyperski T, Montelione GT, Baker D, Bax A (2008) Consistent blind protein structure generation from NMR chemical shift data. Proc Natl Acad Sci USA 105:4685–4690CrossRefADSGoogle Scholar
- Velankar S, Best C, Beuth B, Boutselakis CH, Cobley N, Sousa Da Silva AW, Dimitropoulos D, Golovin A, Hirshberg M, John M, Krissinel EB, Newman R, Oldfield T, Pajon A, Penkett CJ, Pineda-Castillo J, Sahni G, Sen S, Slowley R, Suarez-Uruena A, Swaminathan J, van Ginkel G, Vranken WF, Henrick K, Kleywegt GJ (2010) PDBe: Protein Data Bank in Europe. Nucleic Acids Res 38:D308–D317CrossRefGoogle Scholar