NMR-STAR: comprehensive ontology for representing, archiving and exchanging data from nuclear magnetic resonance spectroscopic experiments

The growth of the biological nuclear magnetic resonance (NMR) field and the development of new experimental technology have mandated the revision and enlargement of the NMR-STAR ontology used to represent experiments, spectral and derived data, and supporting metadata. We present here a brief description of the NMR-STAR ontology and software tools for manipulating NMR-STAR data files, editing the files, extracting selected data, and creating data visualizations. Detailed information on these is accessible from the links provided.

We describe here the NMR-STAR ontology and associated software tools that facilitate its use. The NMR-STAR v3.2 ontology 2 provides an extensive controlled vocabulary for the description of NMR spectroscopic studies of biological systems. The ontology includes the description of experiments, the data generated, and the derived results such as molecular structures, dynamics, and functional properties. NMR-STAR v3.2 is constructed along the lines of an object/relational model using a subset of the Self-defining Text Archival and Retrieval (STAR) specification (Hall and Spadaccini 1994). Full documentation on the rules and conventions for constructing valid NMR-STAR formatted files is available from the BMRB website. 3,4 The data in a BMRB entry as defined by the NMR-STAR ontology are organized in natural objects such as citations, molecular entities, samples, software applications, NMR experiments, and experimental data sets of various kinds 5 (see Fig. 1). However, the data also are organized in tables within the objects and are referentially linked using primary and foreign keys forming an entity/relational model, as well. The NMR-STAR ontology, therefore, can be used to create a relational schema and database. The data from multiple entries can be easily organized as tables for loading into a relational database constructed from the NMR-STAR ontology.
While NMR-STAR is intended to cover the vast majority of biomolecular NMR information, data that are massive and more appropriately stored in binary or other formats (for example, time-domain and processed multidimensional NMR spectral data) or instrument specific (pulse sequence files) are modeled as external files. Archiving and exchanging these kinds of data is very important, but files of these kinds can be referenced and associated with an NMR-STAR file without being incorporated into that file.
The NMR-STAR ontology was first released for use as a deposition, archival, and data exchange format in 1996 (Ulrich et al. 1996). At that time, the STAR format was chosen over alternatives, such as Abstract Syntax Notation One (ASN.1), standard generalized markup language (SGML), and extensible markup language (XML), because STAR met the criteria for extensibility while at the same time the format addressed the needs for a developing and expanding scientific field: combined machine and human readability, efficient editing by existing tools, and ease of mapping to relational database technology. The original NMR-STAR version 1.0 accommodated a limited number of experimental data (assigned chemical shifts, coupling constants, peak lists, and relaxation parameters). With input from many scientists in the NMR, X-ray crystallography, and computer science communities, the ontology has expanded to include over 90 data category groups containing a total of more than 300 data categories and over 6500 data item tags. Because the Crystallography Information File (CIF) format used by the small molecule crystallography community and the mmCIF format used by the Protein Data Bank (PDB) are subsets of STAR, NMR-STAR communicates easily with these repositories. Extensible markup language (XML) and resource description framework (RDF) versions of NMR-STAR have been developed and are available from PDBj-BMRB (Yokochi et al. 2016).
The ontology has evolved over time with input from the user community. The entire BMRB archive has been upgraded to NMR-STAR v3.2 for consistency with legacy data. NMR-STAR v3.2 incorporates a variety of experimental data (e.g., coupling constants, heteronuclear NOEs, Fig. 1 First two tiers in the NMR-STAR ontology for BMRB. The STAR dictionary tree is an interactive tool for inspecting the NMR-STAR ontology and assessing the most appropriate construct for the user's application. The NMR-STAR ontology is under constant evolution with community feedback to reflect advancements in NMR techniques. An interactive version is available from the BMRB website with tools that enable searching by category or tag T1/R1, T2/R2, and autorelaxation and dipole-dipole cross correlation parameters). In NMR-STAR v3.2 'molecular interactions' tags have been redefined as 'chemical shift perturbation' to match the common terminology used by the NMR community. The ontology has been expanded to capture additional parameters derived from experiments as well as data processing workflows.
The NMR exchange format (NEF) (Gutmanas et al. 2015) utilizes the STAR format and defines an ontology simpler than NMR-STAR for the purpose of facilitating data exchange between structural NMR software packages. The wwPDB has agreed to accept both NEF and NMR-STAR as deposition formats for structural restraints. The BMRB has adopted NEF as a subset of NMR-STAR and has developed software to convert restraints in NEF to the archival NMR-STAR format that serves as a more comprehensive exchange format for restraints, in that it handles restraint types not covered by NEF (including ambiguous restraints between subunits, restraints to ligands, residual dipolar couplings, paramagnetic relaxation enhancement, and restraints derived from cross-linking or SAXS). Because NEF utilizes non-unique atom designators rather than standard IUPAC, NEF can be converted to NMR-STAR, but NMR-STAR cannot be converted uniquely back to NEF. For that reason, the wwPDB stores the original NEF from depositions and requires NEF depositions to include coordinates in mmCIF format with a clear mapping from the atom names used in the NEF files to the IUPAC format used by mmCIF/NMR-STAR. In response to suggestions from the NEF team, NMR-STAR V3.2 supports views of spectral peak list information that merge data from four tables into one morereadable table. The BMRB retains the four-table format for purposes of database management.
Users can retrieve data from the NMR-STAR archive by means of a variety of query interfaces available on the BMRB website (http://www.bmrb.wisc.edu/). Queries can be based either on the NMR-STAR data model or on information extracted from BMRB entries by means of the NMR-STAR dictionary. PyNMRSTAR is a library for interacting with NMR-STAR files in the Python language by reading files, modifying them, and writing them out. 6 The PyN-MRSTAR library is used by a variety of tools developed by the BMRB or external groups for operating with BMRB files. The RBMRB software package provides access to BMRB data in the R environment. 7 One can use built-in functions in RBMRB to visualize data mined from BMRB and to simulate spectra. Third-party software developers have provided Perl script for converting NMR-STAR to NMRPipe. 8 and Python parsers for NMR-STAR that support interfacing with BMRB (Smelter et al. 2017). 9

Discussion
Conceptually, a BMRB entry can be viewed as a set of linked objects (saveframes) connected by the STAR saveframe pointer method (Hall and Cook 1995), by the use of integer object identifiers, and by relational primary and foreign keys. The NMR-STAR data model, as a relational schema, is highly denormalized with redundant information present across many tag categories or tables. There also exist multiple paths for linking experimental data with the experiments, samples, sample conditions, and other metadata. This has allowed data derived from the literature to be represented in the relational database, even when the information required to link metadata to experimental data is not always complete. While NMR-STAR is feature-rich, the format is flexible and allows for lightweight use. For example, subsets of the full NMR-STAR ontology are utilized in workflows involving data exchange. The combination of rich representation and lightweight application provides application developers an evolutionary path to data exchange while at the same time it mitigates the N 2 format-to-format translation problems. For ease of on-line searching, saveframes and tags are organized into categories and sub-categories. 10 Each tag is fully described with data type and mandatory status defined.
Information encoded in the NMR-STAR ontology, supplemented by interface specification files, drives the deposition interface at BMRB. Templates are available from the BMRB website for generating and validating NMR-STAR files for a variety of NMR data, including assigned chemical shifts, coupling constants, hydrogen exchange data, and a variety of relaxation data. A number of third party software applications read and write NMR-STAR compliant files that are ready for deposition. Software tools are available for converting data from various software packages to NMR-STAR or NEF. The most popular are PDBStat (Tejero et al. 2013) and STARch (STAR conversion handler). 11 STARch converts data tables in NMR-STAR 2.0 to 3.0. Depositors of NMR data to wwPDB archives are encouraged to use these 1 3 or similar tools to convert their data into NMR-STAR (or NEF) prior to submission.