NMR-STAR is the archival format used by the Biological Nuclear Magnetic Resonance data Bank (BMRB), the international repository of biomolecular NMR data (Ulrich et al. 2008) and an archive of the Worldwide Protein Data Bank (wwPDB 2018). NMR-STAR is available as input and/or output by several software packages that deal with the harvesting and processing of biomolecular data [CCPN (Vranken et al. 2005), NMRView (Johnson 2004), TALOS (Cornilescu et al. 1999), NMRFAM-SPARKY (Lee et al. 2015), PINE (Bahrami et al. 2009), ARECA (Dashti et al. 2016), PONDEROSA (Lee et al. 2011), Integrative NMR (Lee et al. 2016), CSI (Hafsa et al. 2015), NMRFxFootnote 1, RCI (Berjanskii and Wishart 2007), ABACUS (Grishaev et al. 2005), relax (d’Auvergne et al. 2008; d’Auvergne and Gooley 2008), and PDBstat (Tejero et al. 2013)] and with chemical shift prediction [SHIFTX2 (Han et al. 2011) and SHIFTS (Xu and Case 2002)]. NMR-STAR is also used as a data exchange format by the NMRbox project (Maciejewski et al. 2017).
We describe here the NMR-STAR ontology and associated software tools that facilitate its use. The NMR-STAR v3.2 ontologyFootnote 2 provides an extensive controlled vocabulary for the description of NMR spectroscopic studies of biological systems. The ontology includes the description of experiments, the data generated, and the derived results such as molecular structures, dynamics, and functional properties. NMR-STAR v3.2 is constructed along the lines of an object/relational model using a subset of the Self-defining Text Archival and Retrieval (STAR) specification (Hall and Spadaccini 1994). Full documentation on the rules and conventions for constructing valid NMR-STAR formatted files is available from the BMRB website.Footnote 3,Footnote 4 The data in a BMRB entry as defined by the NMR-STAR ontology are organized in natural objects such as citations, molecular entities, samples, software applications, NMR experiments, and experimental data sets of various kindsFootnote 5 (see Fig. 1). However, the data also are organized in tables within the objects and are referentially linked using primary and foreign keys forming an entity/relational model, as well. The NMR-STAR ontology, therefore, can be used to create a relational schema and database. The data from multiple entries can be easily organized as tables for loading into a relational database constructed from the NMR-STAR ontology.
While NMR-STAR is intended to cover the vast majority of biomolecular NMR information, data that are massive and more appropriately stored in binary or other formats (for example, time-domain and processed multidimensional NMR spectral data) or instrument specific (pulse sequence files) are modeled as external files. Archiving and exchanging these kinds of data is very important, but files of these kinds can be referenced and associated with an NMR-STAR file without being incorporated into that file.
The NMR-STAR ontology was first released for use as a deposition, archival, and data exchange format in 1996 (Ulrich et al. 1996). At that time, the STAR format was chosen over alternatives, such as Abstract Syntax Notation One (ASN.1), standard generalized markup language (SGML), and extensible markup language (XML), because STAR met the criteria for extensibility while at the same time the format addressed the needs for a developing and expanding scientific field: combined machine and human readability, efficient editing by existing tools, and ease of mapping to relational database technology. The original NMR-STAR version 1.0 accommodated a limited number of experimental data (assigned chemical shifts, coupling constants, peak lists, and relaxation parameters). With input from many scientists in the NMR, X-ray crystallography, and computer science communities, the ontology has expanded to include over 90 data category groups containing a total of more than 300 data categories and over 6500 data item tags. Because the Crystallography Information File (CIF) format used by the small molecule crystallography community and the mmCIF format used by the Protein Data Bank (PDB) are subsets of STAR, NMR-STAR communicates easily with these repositories. Extensible markup language (XML) and resource description framework (RDF) versions of NMR-STAR have been developed and are available from PDBj-BMRB (Yokochi et al. 2016).
The ontology has evolved over time with input from the user community. The entire BMRB archive has been upgraded to NMR-STAR v3.2 for consistency with legacy data. NMR-STAR v3.2 incorporates a variety of experimental data (e.g., coupling constants, heteronuclear NOEs, T1/R1, T2/R2, and autorelaxation and dipole–dipole cross correlation parameters). In NMR-STAR v3.2 ‘molecular interactions’ tags have been redefined as ‘chemical shift perturbation’ to match the common terminology used by the NMR community. The ontology has been expanded to capture additional parameters derived from experiments as well as data processing workflows.
The NMR exchange format (NEF) (Gutmanas et al. 2015) utilizes the STAR format and defines an ontology simpler than NMR-STAR for the purpose of facilitating data exchange between structural NMR software packages. The wwPDB has agreed to accept both NEF and NMR-STAR as deposition formats for structural restraints. The BMRB has adopted NEF as a subset of NMR-STAR and has developed software to convert restraints in NEF to the archival NMR-STAR format that serves as a more comprehensive exchange format for restraints, in that it handles restraint types not covered by NEF (including ambiguous restraints between subunits, restraints to ligands, residual dipolar couplings, paramagnetic relaxation enhancement, and restraints derived from cross-linking or SAXS). Because NEF utilizes non-unique atom designators rather than standard IUPAC, NEF can be converted to NMR-STAR, but NMR-STAR cannot be converted uniquely back to NEF. For that reason, the wwPDB stores the original NEF from depositions and requires NEF depositions to include coordinates in mmCIF format with a clear mapping from the atom names used in the NEF files to the IUPAC format used by mmCIF/NMR-STAR. In response to suggestions from the NEF team, NMR-STAR V3.2 supports views of spectral peak list information that merge data from four tables into one more-readable table. The BMRB retains the four-table format for purposes of database management.
Users can retrieve data from the NMR-STAR archive by means of a variety of query interfaces available on the BMRB website (http://www.bmrb.wisc.edu/). Queries can be based either on the NMR-STAR data model or on information extracted from BMRB entries by means of the NMR-STAR dictionary. PyNMRSTAR is a library for interacting with NMR-STAR files in the Python language by reading files, modifying them, and writing them out.Footnote 6 The PyNMRSTAR library is used by a variety of tools developed by the BMRB or external groups for operating with BMRB files. The RBMRB software package provides access to BMRB data in the R environment.Footnote 7 One can use built-in functions in RBMRB to visualize data mined from BMRB and to simulate spectra. Third-party software developers have provided Perl script for converting NMR-STAR to NMRPipe.Footnote 8 and Python parsers for NMR-STAR that support interfacing with BMRB (Smelter et al. 2017).Footnote 9