CONNJUR spectrum translator: an open source application for reformatting NMR spectral data
- First Online:
- 543 Downloads
NMR spectroscopists are hindered by the lack of standardization for spectral data among the file formats for various NMR data processing tools. This lack of standardization is cumbersome as researchers must perform their own file conversion in order to switch between processing tools and also restricts the combination of tools employed if no conversion option is available. The CONNJUR Spectrum Translator introduces a new, extensible architecture for spectrum translation and introduces two key algorithmic improvements. This first is translation of NMR spectral data (time and frequency domain) to a single in-memory data model to allow addition of new file formats with two converter modules, a reader and a writer, instead of writing a separate converter to each existing format. Secondly, the use of layout descriptors allows a single fid data translation engine to be used for all formats. For the end user, sophisticated metadata readers allow conversion of the majority of files with minimum user configuration. The open source code is freely available at http://connjur.sourceforge.net for inspection and extension.
KeywordsFID Spectrum Software Conversion CONNJUR Reconstruction
Over the past several decades NMR has proven itself to be a powerful and versatile tool for measuring many biophysical aspects of bio-molecules. It is an accepted technique for determining the three-dimensional structure of macromolecules and macromolecular complexes in solution (Williamson et al. 1985; Clore and Gronenborn 1998; Wuthrich 2003). It is also often relied on for measuring the internal dynamics of molecules, from picosecond to nanosecond motions by monitoring 15N relaxation (Barbato et al. 1992; Palmer 2004), microsecond to millisecond motions by measuring relaxation dispersion (Korzhnev et al. 2004; Neudecker et al. 2009), and millisecond-timescale and longer motions through observations of hydrogen exchange (Englander and Mayne 1992; Gryk et al. 1995) or other direct measurements made in real time. In the absence of a high-resolution structure, chemical shifts and chemical shift changes have been used for determining secondary (Wishart et al. 1992; Cornilescu et al. 1999) and tertiary structural elements (Cavalli et al. 2007; Shen et al. 2008) as well as the binding interfaces for ligand binding and macromolecular interactions (Shuker et al. 1996; Gryk et al. 2002). In addition, measurement of chemical shift changes upon titration of ligands or pH have proven effective for measuring binding constants and local pKa’s in solution (Malthouse 1999).
In contrast to these diverse applications of NMR, the primary measurement recorded on the spectrometer is essentially the same for all experiments—that of the precessing nuclear spin magnetization within the probe coil, termed the free induction decay (FID) (Bloch 1946). The diversity of information obtainable from such a uniform measurement type is gained through the large number of ways in which the physical sample can be perturbed, as well as those of the nuclear spin populations which exist within an otherwise static sample (Cavanagh et al. 1996). This diversity of information content comes at a cost; however, in that even though the primary data recording is essentially the same regardless of the design of the experiment, both the acquisition and processing paradigms for the data differ depending on the goal of the individual or set of experiments. Thus, the diversity in information content is achieved by adding complexity in the experimental setup and design, as well as a concomitant complexity in subsequent data analysis.
It is perhaps not surprising that with such a diversity in NMR applications, there has also grown a diversity in NMR data processing tools, spanning the arenas of spectral reconstruction (converting time domain data to the frequency domain), spectral analysis (peak identification, assignment and characterization), and biophysical characterization (converting NMR-specific measurements to biophysical insights) (Ellis et al. 2006b). These tools differ widely in their architecture, their underlying algorithms, the operating systems under which they perform and their internal and external data formats. It is not the case that one software tool is optimal for all given tasks and the others inferior. Rather, each of the various tools has pros and cons depending on the processing task to be performed as well as other competing concerns (speed, accuracy, cost of use, etc.) Thus, it is desirable to the NMR community that the diversity of tools be maintained and not eliminated. Unfortunately, the diversity in design and use of the tools, and principally of their respective file formats, creates an obstacle to tool interoperability.
The goal of the CONNJUR project is to alleviate the strain on the NMR user by increasing the interoperability of existing NMR processing tools (http://www.connjur.org). The overall strategy is (a) to provide a common data repository for all types of NMR data required and provided by the various third-party software tools; (b) provide software wrappers for the existing tools so they can make use of the common data repository; and (c) provide a user interface from which to drive all NMR processing tasks using any subset of existing NMR tools. Portions of the common data model under development have been published previously (Fox-Erlich et al. 2004; Ellis et al. 2006a; Gryk et al. 2010), and prototypes of the CONNJUR software integration environment have been demonstrated at recent Experimental NMR Conferences (http://www.enc-conference.org) and the 2008 NANUC NMR Bootcamp (http://www.nanuc.ca/resources/workshop2008.php). A similar but distinct effort has been undertaken by the Collaborative Computing Project for NMR (CCPN) (Fogh et al. 2002; Vranken et al. 2005). While sharing a similar goal of data integration, recent efforts of CCPN have been to redesign a powerful spectral analysis tool called CcpNmr Analysis. To date, CCPN does not support spectrum translation; however, CCPN does allow spectral data to be imported into the CCPN data model.
While detailing the functional requirements of CONNJUR, it was recognized that the CONNJUR integration platform would require the functionality to interconvert data between the different file formats utilized by the third party tools we intend to wrap. The requirements for such a universal spectrum translator are fourfold: (1) the spectrum translator should be able to translate between any two arbitrary file formats, (as within the context of CONNJUR, no directionality of the processing pipeline should be assumed), (2) the translator should use a common data model such that the data and metadata provide a portion of the fundamental data model used by CONNJUR, (3) the spectrum translator must be easily extensible to support the addition of new file formats as new tools emerge and additional tools are wrapped by CONNJUR, and (4), as CONNJUR is intended to be an open source project, the spectrum translator should also be open source.
The CONNJUR Spectrum Translator (CONNJUR-ST) represents our first official software release. CONNJUR-ST tackles a small but critical component of the overall NMR software integration problem—interconversion of NMR spectral data between the various data formats used by the various software processing tools. The current release supports six, vastly different data formats, four used by the spectrometer vendors, Varian (http://www.varianinc.com/) and Bruker (http://www.bruker-biospin.com/), as well as two NIH-supported spectral reconstruction suites, NMRPipe (Delaglio et al. 1995) and the Rowland NMR Toolkit (Hoch and Stern 2010). It is hoped that the establishment of CONNJUR-ST as an open source project will provide both the incentive and opportunity for the addition of conversion functionality for other useful time-domain data formats. The modular design of CONNJUR-ST should assist in such support of additional file formats.
Materials and methods
CONNJUR-ST (version 1.1) is written in Java and requires that the user has preinstalled the Java Virtual Machine (JVM v. 1.5) (http://www.Java.com) in order to execute. The JVM is multiplatform and typically comes preinstalled with most desktop computers. CONNJUR-ST has been developed using Eclipse (http://www.eclipse.org) as an integration development environment. A source code control system (Concurrent Versioning System, or CVS) (http://www.nongnu.org/cvs/) is used to track changes to the software and allow changes to be applied in a controlled manner. In the event testing finds a change has introduced errors, previous versions may be restored. A fully automated full extraction and build occurs daily. Regression test conversions against both existing tools and previous CONNJUR-ST versions are repeated to ensure data fidelity.
Validation of CONNJUR-ST conversions
CONNJUR-ST conversion was compared against existing conversion tools. Output was compared using one of three methods. First, visual examination of the converted spectra in display tools such as NmrDraw (Delaglio et al. 1995) was performed. Second, portions of output files containing NMR data were compared on a binary basis. (Some embedded non-NMR data, such as internal timestamps, vary on a binary basis. Therefore, a complete binary comparison as done with the Unix command ‘diff’ is not sufficient for testing.) Third, CONNJUR-ST has been alpha-tested by members of the CONNJUR team as well as collaborators at the UCONN Health Center.
File format documentation
Varian data file format information was primarily obtained from “VNMR User Programming” (Varian 1998). Kirk Marat provided supplementary information regarding use of binary bits (personal communication). Nmrpipe format was obtained from the “fdatap.html” and “fdatap.h” files distribution with NMRPipe (Delaglio et al. 1995). Rowland Toolkit information was found in the online manual (Hoch and Stern 2010). Bruker file information was obtained from “XWIN-NMR Software Manual” (Bruker 1995).
Initial datasets used included spectrum previously collected by the principal investigator, Dr. Michael Gryk, for Varian datasets and data collected at the University of Connecticut chemistry department for Bruker data. Additionally, a set of spectra of small file size was collected with various combinations of phasing and real/complex data. While the small datasets do not provide spectra usable for analysis they provide excellent test cases for possible variations in file formats. Synthetic datasets generated by the code allow tracking of specific values from data file format to another.
The CONNJUR Spectrum Translator is available for download at http://www.connjur.org. The software and documentation, including assumptions, configuration parameter descriptions and quickstart instructions, as well as all test data can be found in the same location. The source code and manifesto for open source contributions will be released shortly following this publication. The CONNJUR spectral object is documented using standard Javadoc technology (http://www.Java.com). All Javadocs on the CONNJUR spectral object will be accessible through the CONNJUR website.
A major benefit of this modular approach is in the support of multiple, third-party file formats. If one simply supported one-to-one file converters, the number of converters necessary would be n(n−1), where n is the number of supported file formats. With our approach, however, the total number of translators required is only 2n as all translation is mediated through a common model, decoupling the input format from the output format. In addition, as existing file formats are altered, software bugs detected, or new features added, the maintenance cost amounts to only two converters needing to be updated rather than 2(n−1). For this release of CONNJUR-ST, this amounts to a twofold savings in implementation and a fourfold savings in support. This benefit will increase drastically as additional file formats are supported by CONNJUR-ST.
It is worth noting that this same computational savings would be achieved if the NMR community were to settle on any one file format as a standard, with a suite of importers and exporters built around that standard format. (See for example the solution offered by Olivia which offers a suite of importers/exporters (http://fermi.pharm.hokudai.ac.jp/olivia/api/index.php/Xyza2pipe_src) for the nmrPipe file format.) It is our estimation that each of the file formats supported by CONNJUR-ST has sufficient pros and cons associated with it that none would be ideal for such a standard. It is also true, that as a matter of computational efficiency, all translation should be done in memory, not through writing to disk, further discouraging such an approach.
An additional benefit arose in the CONNJUR-ST design, inspired by the manner in which the Rowland NMR Toolkit reads various third-party file formats. As there are a limited number of simple ways in which arrays of numbers may be ordered, the reading and writing portions of the translator are further broken down into three subtasks: reading (writing) metadata, reading (writing) data, and interpreting the binary data in terms of a layout descriptor. The layout descriptor—unique for each file format—dictates the ordering of the numbers in the binary file and can be thought of as a correspondence between the metadata in the file and the layout of the file as specified for each third party tool. Therefore, the maintenance tasks are simplified to the two main tasks of handling metadata and handling the layout descriptor—a vast savings in effort from supporting a stand-alone format converter for a given third-party format.
Model for time-domain NMR spectra
The NMR data model consists of the following three key components: (a) a set of attributes (metadata) related to the spectrum as a whole including the N dimensions for the experiment, where N is a positive integer, (b) a set of metadata related to each dimension of the experiment, and (c) the N-dimensional matrix of complex data points, representing the actual data collected in the experiment (that is, the magnetic induction measured within the spectrometer probe coil).
The core metadata for each dimension consist of the sweep width (which is inversely proportional to the spacing between collected time increments), the total number of points (either real or complex1 depending on data collection), and a Boolean descriptor as to whether the data are real or complex. A Boolean flag defining time domain versus frequency domain is also added for future support of frequency domain spectral data.
The model for the physical arrangement of NMR files includes indications of the data encoding: float versus integer, 32-bit versus16-bit, and little endian versus big endian, and the ordering of the complex data points (stored as real-imaginary pairs versus all reals followed by all imaginaries). The numerical, spectral data are assumed to be contained in one or more binary file, each binary file having optional headers, footers, and block headers. The metadata is assumed to be contained either within the binary data file, or within optional ASCII parameter files. The ordering of the data points within the file, and within multiple files, as well as the formatting of metadata in the parameter files are all included in our layout model.
The current implementation of CONNJUR-ST expects the numerical spectral data to be stored in binary, as is the case for each of the five file formats currently supported. If a third-party tool represents the data in ASCII rather than binary, the nature of the character encoding and numerical delimiters could be included as metadata to support such file formats. CONNJUR-ST does currently support output to ASCII, but has henceforth only been used for debugging purposes. It should also be noted that CONNJUR-ST expects the spectral data to be stored as files on the file system. However, as used within the CONNJUR project, CONNJUR-ST is also designed to accept input/output streams as well as files. This release of CONNJUR-ST (v 1.1) supports input/output data streams for the NMRPipe format—the only supported format consisting of a single, binary file.
CONNJUR-ST converts time and frequency domain, spectral data from one file format to another while requiring minimal user input. Secondarily it provides semantic transformation operations on the data, including negating imaginary values on some or all dimensions, performing the Rance-Kay (Echo/Anti-Echo) sensitivity enhancement (Kay et al. 1992) correction, and removing leading data introduced by the Bruker digital filtering. Additionally it allows some metadata manipulation, e.g. data set comments and dimension labels.
Bruker acquisition format (FID for 1D, SER for higher dimensions);
Bruker processing format (ex., rr file);
NMRPipe: both the plane and cube representation;
Rowland NMR Toolkit standard layout only. RNMRTK is capable of reading and writing a limitless number of representations. By necessity, our translator is limited to reading/writing only the standard, default representation used by RNMRTK;
Varian acquisition format. The Varian format requires complex data in the first incremented dimension. Otherwise, arbitrary quadrature settings are supported.
Varian processed format (limited support). Varian software is not designed with the intention that frequency-domain should be stored or loaded from disk (George Gray, personal communication). Rather, the software expects the user to store/load only the time-domain data, and reprocess using VNMRJ if necessary. That said, many NMR users have learned to extract and use frequency-domain data, stored as the “phasefile” within the experiment directory. CONNJUR provides limited support for this type of data, allowing translation from the phasefile, but not to the phasefile (i.e. allowing import but not export). It should be noted that the data stored within the phasefile is not guaranteed to be complete, and so this type of translation should be done with caution.
In order to ensure that CONNJUR-ST is capable of handling all operations with minimal user input, it has been designed to self-configure based on some assumptions about the input and output layouts and metadata. These assumptions vary in confidence from the likely conjecture that the input metadata regarding sweep widths is correct to less certain assumptions such as that the input metadata regarding PPM referencing is correct. There are some configurations for which the translator makes no assumptions such as the output file format desired or the semantic conversions which the user may wish to implement.
The business logic that underlies the ease-of-use assumptions can be overridden by the user in cases where the user knows the metadata values and wishes to set them directly. For these cases, the user may either set metadata and configuration settings through command line flags, or may configure an XML file to be read by the spectrum translator. The latter is desirable for cases in which many different pieces of metadata need to be under user control, or for automated direction of the CONNJUR-ST either through external programs or through batch-processing scripts.
Some flags control multiple metadata settings. In the case of the Varian file format, there are groups of metadata associated with acquisition dimensions of the experiment (e.g., number of points, sweep width), while other groups are associated with amplifier channels on the spectrometer (e.g., amplifier frequency, observation nucleus). While the association between channels and dimensions may be ambiguous, all channel-specific parameters are expected to be associated with the same dimension. Thus, the user can specify a channel-dimension link, in which case all parameters associated with a specific channel will be linked to the desired dimension. The CONNJUR website provides a dynamic webpage for generating the appropriate XML file given desired channel-dimension associations. A more sophisticated graphical interface to serve this purpose is currently under development.
There are two obvious use cases that CONNJUR-ST does not currently support. One is pseudo-multidimensional domain data, the other is non-uniformly spaced data. Both are of importance to the NMR community and will be prioritized developments for future releases.
Discussion and conclusions
CONNJUR-ST represents an important step forward in filling in for the lack of standardization in NMR file formats. This lack of standardization creates a constant hindrance to NMR spectroscopists, who must continually rely on multiple, third-party tools for reformatting data for input into NMR data processing and analysis tools. Some processing choices, such as reprocessing NMRPipe data within VNMR, have not been available because no tool existed for performing the required translation.
CONNJUR-ST currently supports the translation of NMR spectral data. CONNJUR-ST allows conversion to and from six, third-party formats: the two Varian file formats used in VNMR, the two Bruker formats used by TopSpin, the NMRPipe file format, and the standard file format for the Rowland NMR Toolkit (RNMRTK). We believe CONNJUR-ST will be primarily useful for back-converting data (for instance, converting NMRPipe datasets back to Bruker or Varian format for subsequent processing with the spectrometer software), cross-converting data between spectral reconstruction tools such as between NMRPipe and RNMRTK, and for cases in which the traditional translators (var2pipe, RNMRTK’s loader) are too tedious or cumbersome to configure.
CONNJUR-ST is implemented in Java, and therefore is multi-platform irrespective of the platform specificity of the underlying tools whose format it supports. This provides for the use case of translating spectra on computer systems which are incapable of supporting the underlying software tools and their built-in translators. CONNJUR-ST is provided free and open source, and contributions from the NMR and software development communities are encouraged. It is with the hope of such help that we envision CONNJUR-ST eventually supporting translation of all available data formats and additional semantic conversions.
Future enhancements to the CONNJUR-ST will be the addition of other tool file formats such as XEASY, Sparky and nmrViewJ. Other improvements would be the support of non-uniform datasets (datasets in which the time delay between collected datapoints is not uniform) and pseudo-multidimensional datasets—referring to datasets in which some of the dimensions do not represent frequency axes, but rather the variation of other parameters, such as relaxation delays.
Additional use cases will also be explored. Currently, CONNJUR-ST is a command-line driven application for converting individual spectra. CONNJUR-ST can translate multiple spectra if driven by a script, but CONNJUR-ST does not accept multiple spectra as direct input. Providing for this capability will be explored. Also, a GUI is planned to ease the configuration of the advanced XML options used to customize the translation of data. Of course, we will continue to explore how best to utilize CONNJUR-ST within the main CONNJUR framework.
To foster the open source development of CONNJUR-ST, all code will be stored on SourceForge under the project name CONNJUR. The CONNJUR spectral object is documented using standard Javadoc technology (http://www.Java.com) with Javadocs being available through the CONNJUR website. The code is free and open for the NMR community to download, share and use. We intend to provide a manifesto detailing our expectations of how developer contributions will be invited, screened, tested and eventually accepted into the CONNJUR-ST project. While adoption into the CONNJUR-ST project is encouraged, additions/improvements may be distributed independently, as in accordance with the GNU Public License structure which we have adopted for CONNJUR-ST. We encourage community engagement and feedback on all CONNJUR software as well as the project goals and direction.
In this context, complex refers to two orthogonal data collection channels—termed quadrature in signal processing. The two orthogonal channels may both be real data collected on independent probe coils, or can be virtual representing alteration of pulse phases during the experiment.
This research was funded by United States National Institutes of Health grants EB-001496 and GM-083072. The authors wish to thank Drs. Frank Delaglio, Jeffrey Hoch, Mark W. Maciejewski and Alan Stern for useful conversations and assistance with NMRPipe and the Rowland NMR Toolkit file formats. The authors also wish to thank Agilent Technologies and Bruker Biospin for assistance with understanding the Varian and Bruker file formats, with particular thanks to Drs. George Gray and Clemens Anklin.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.