Data Collection and Curation
Key to the development of this software package was the creation of an extensive 2D spectral library containing TOCSY and HSQC spectra of pure metabolites. We used several publicly available sources in constructing this library. The majority of the raw 1H-1H TOCSY spectra were collected from the standard compound spectral library available at the BMRB database [24]. A few additional compound spectra were obtained from the MRMD [26]. The 1H-13C HSQC spectral library was downloaded from the HMDB [23]. These raw spectra contained a number of spectral artefacts (noise, water bands, asymmetries, peaks from TSP or DSS, contaminants, etc.). Consequently it was necessary to convert these raw spectra into "synthetic" or "simplified" spectra corresponding to the peaks specific to the pure compounds of interest. This conversion was done manually, with each of these simplified, noise-free spectra being examined for inconsistencies by comparing them to the original raw spectra and the compound's known resonance assignments. In total, the MetaboMiner TOCSY reference library includes spectra from 223 common metabolites and the MetaboMiner HSQC library contains spectra from 502 metabolites. The compounds in both libraries were further catalogued into three sub-libraries corresponding to the three common human biofluids – cerebrospinal fluid (CSF), plasma and urine. The classification was based on their respective metabolic compositions listed in the HMDB [23]. Since the presence of these biofluid-specific metabolites was determined by a variety of technologies not limited to NMR, we further investigated the appearance of these metabolites in a large number of 1D 1H spectra collected from CSF, plasma and urine. The combined collection of compounds (and spectra) were used to create corresponding "common biofluid" 2D NMR spectral libraries that effectively represent a generic biofluid or cell extract. The "CSF", "plasma", "urine", "biofluid" and "total" spectral libraries are stored as XML files and are editable via MetaboMiner's graphical user interface (GUI).
After the spectral libraries were constructed, each peak for each compound in each library was assigned a series of uniqueness values that are specific for that reference library. A unique peak in MetaboMiner is defined as a relatively isolated peak around which no peak from any other compound is observed based on the spectral library of the given biofluid. For any given peak, its uniqueness value is calculated as the total number of surrounding peaks from other compounds within a given chemical shift "distance". Five distance levels were used to measure peak uniqueness. For 1H chemical shifts, the distance thresholds are 0.01, 0.02, 0.03, 0.04, and 0.05 ppm. For 13C chemical shifts, the distance thresholds are set at 0.05, 0.10, 0.15, 0.20, and 0.25 ppm. For instance, an HSQC peak with a series of assigned uniqueness values of 0-0-0-1-2 indicates that no peak from any other compound in the reference library is observed within 0.03 ppm (1H dimension) and 0.15 ppm (13C dimension) of that peak. It also indicates that one peak from another compound in the spectra library was observed within 0.03~0.04 ppm (1H dimension) and 0.15~0.20 ppm (13C dimension) and another peak from another compound was observed within 0.04~0.05 ppm (1H dimension) and 0.20~0.25 ppm (13C dimension). See Figure 1 for a more complete description of the uniqueness value concept. These uniqueness values are automatically updated after any spectral library change using MetaboMiner's graphical user interface.
Peak Processing, Peak Matching and Compound Identification
As part of its input, MetaboMiner requires peak lists corresponding to the peaks that were identified in either the TOCSY or HSQC spectra collected from the biofluid(s) of interest. While it is possible for users to provide manually picked peak lists, MetaboMiner also supports processing of multidimensional NMR peak lists obtained from automatic peak peaking programs. Multidimensional NMR spectra typically contain substantial numbers of spectral artefacts such as baseline distortions, intense solvent lines, ridges, sinc wiggles, etc. Automatic peak-picking programs tend to mistake these noise signals for real resonances. Therefore, any raw 2D spectra collected from biofluids must be processed appropriately before attempting to match them to MetaboMiner's reference spectral library. Two automated procedures were found to be very effective in cleaning up raw 2D spectra: 1) streak removal and 2) symmetrical editing. Note that the latter processing technique is only applicable for TOCSY spectra. Spectral streaks are usually caused by residual solvent signals (i.e. water) or the presence of other compounds at extremely high concentrations. Streaks can be recognized by their specific locations and prominent shapes in the NMR spectra. Streak removal was implemented by searching for groups of peaks at these common locations and eliminating them from the peak list. Symmetrical editing exploits the fact that real TOCSY peak signals form a symmetrical square pattern along the diagonal line. Off-diagonal peaks without any corresponding symmetrical peaks can be considered to be artefacts. Both peak positions and intensities (if provided by the user) of the corresponding peaks are examined for symmetry. Since TOCSY cross peaks are frequently not of equal intensity we require that the intensity ratio between the upper and lower-diagonal peaks should be within a range of 0.8~2.5 of each other to be considered symmetrical.
In order to accommodate small chemical shift differences between the observed NMR spectra and the reference NMR spectra, an adaptive threshold method was implemented based on the uniqueness values (described above) of each reference peak. During the peak searching/matching process, the search threshold varies automatically based on the maximum uniqueness value of the current peak. For instance, when searching for potential matches for a TOCSY peak with uniqueness values of 0-0-0-0-1, MetaboMiner will automatically set its threshold to 0.04 ppm. The peak matching and adaptive thresholding employ two processes: a reverse search strategy and a forward search strategy. In the reverse search strategy, the library peaks are searched and matched against the query peaks. Typically most query peaks find their potential matches during this reverse search step. However there are usually some peaks left without any matches. In order to assign these unmatched peaks a forward search is performed in which the unmatched query peaks are searched against the reference library with expanded thresholds – 0.08 ppm for TOCSY and 0.12 ppm (1H) and 0.4 ppm (13C) for HSQC spectra. A match is identified if only a single reference peak is identified within this range.
In MetaboMiner a compound is considered to be present only if its matched pattern satisfies the requirements of what we call "minimal signatures". A minimal signature is defined as the minimum peak set that can uniquely identify a compound from all others in a given spectral library. Based on the complete peak set of the reference spectral library, many minimal signatures can be derived through different combinations of unique peaks. A single peak match may be considered a minimal signature if it is completely unique. More peaks are required to define a minimal signature for less unique ones. For instance, in our current implementation, the presence of a single peak with uniqueness values 0-0-0-0-x (x >= 0) will determine the presence of the corresponding compound (subject to authenticity checks as discussed later); while at least two peaks with uniqueness values 0-0-0-x-x are required to reach the decision.
Since query spectra (i.e. real spectra from biofluids) usually contain substantial levels of spectral noise, even after pre-processing, we found that we could reduce MetaboMiner's false positive rate even further by implementing several authenticity checks. These include: 1) having a minimum number of matched peaks (3 for TOCSY spectra and 1 for HSQC spectra), 2) having a minimum matched fraction of peaks (1/2 for TOCSY spectra and 1/6 for HSQC spectra), 3) ensuring the presence of certain peaks for certain compounds (determined by manual testing and validation for each compound), and 4) ensuring that the identified compounds were known to be in a given biofluid.
User Interface Description
MetaboMiner's graphical user interface was implemented using Java Swing technology. The spectral visualization and manipulation tools were built using the JGraph library (Java open source graph visualization library, http://www.jgraph.com). Figure 2 illustrates a flowchart describing the MetaboMiner GUI. There are four main functional views, 1) a Processing View, 2) a Search View, 3) an Annotation View, and 4) a Library View. All these views share the same component arrangement, with panels on the right side being used for visualizing and manipulating peaks, and the panels on the left being used for displaying parameters, compound lists, structure images, etc. Navigation to each view is readily accessible by clicking an appropriate menu item.
When the program launches, the default view is the "Processing View" where users can copy and paste the automatically picked peak list. The input format must be either a two or three-column list, with numbers separated by a space or a semicolon. The first two columns must be the x and y chemical shift coordinates of each peak in the 2D spectrum and the optional third column must be the peak height or peak intensity. After processing the raw peaks, both the original and the processed spectra will be displayed on MetaboMiner's spectral viewing panel (located on the right). With this viewing panel, users can directly edit peaks on the spectrum if necessary. For manually picked peaks, this step can be skipped by turning the processing options off. By clicking the "Search" button, MetaboMiner's "Search View" will be displayed with its initial, automated compound identification results. Users can adjust the search threshold or switch the reference library to further refine the result. A compound is marked as identified if the matched pattern passes the authenticity checks and satisfies the minimum signature requirement. The raw matched scores are also displayed. MetaboMiner's interface allows users to visually inspect the matched peaks of any metabolite against the corresponding reference spectrum. By right clicking any peak displayed on the spectrum, users can search the library for this particular peak. The identified compound list can be saved in three different formats by clicking the "Export" button. A screenshot of MetaboMiner's "Search View" is shown in Figure 3.
Users can further refine the automated search results by manually annotating the raw 2D spectrum. By clicking the "Refine" button in the "Search View", the "Annotation View" will be launched with the identified compounds being transferred as the starting point. Users can also directly enter the "Annotation View" mode by clicking the "Annotate" button from the "Console" menu. In order to perform manual annotation, users first need to load a high resolution spectral image in PNG format and set up the spectral axes properly. Peak searching is performed by right clicking the peak position on the spectrum to search the reference library as shown in Figure 4. All compounds that generate peaks within the search threshold will be checked. The compound with the closest peak match will be highlighted with its database reference spectrum displayed on the uploaded "raw" spectrum. Users can perform peak annotation for any currently displayed compound. Double clicking any database peak will open a small text editor where users can enter the peak assignment or a comment. The peak pattern of the identified compounds can also be edited to match the experimental spectrum. For example, users can insert, delete, or drag a database peak to match the observed peak in the raw spectrum. These changes will be valid only for the current session. To make permanent changes, MetaboMiner's "Library View" must be used.
The "Library View" is intended for browsing and managing MetaboMiner's spectral libraries. To view all the available reference spectra in MetaboMiner's libraries, users must click the "Browse" button in the "Library" menu. Double clicking any compound in the compound list will open a popup window for peak editing. Any changes will be reflected on the spectrum at real time. New compounds can be introduced by clicking the "New" button at the bottom of the compound list. A new compound can be either exported from another library or be created from scratch through the wizard dialog. Both peak editing or adding new compounds will trigger updating of the uniqueness values of the affected peaks. For researchers who study other types of biological samples (e.g. plant or microbial extracts), they may either use MetaboMiner's generic spectral reference library or create a new library customized for that particular type of biofluid. Library creation or deletion can be easily accomplished by clicking the appropriate menu items in the "Library" menu. The compounds in the default reference library are linked to PubChem, HMDB [18], and the BMRB [19] via the hyperlink under their structure icon. The "Graphics" menu enables users to change the size, shape, or color of the synthetic peaks to suit their preferences.
It is important to note that MetaboMiner does not support spectral processing such as phasing, baseline correction or chemical shift referencing. There are many other high-quality NMR-processing software available for this task, including NMRPipe [27], Felix (Molecular Simulations, Inc., San Diego, CA), VNMR (Varian, Inc., Palo Alto, CA), and XWinNMR (Bruker Analytik GmbH, Karlsruhe, Germany), to name a few. These tools should be used prior to loading spectral images into MetaboMiner. In other words, MetaboMiner is not a spectral processing tool, but a NMR-based metabolomics tool that facilitates automatic peak processing, rapid compound identification, and facile spectrum annotation capabilities through an intuitive graphical interface. MetaboMiner is available at: http://wishart.biology.ualberta.ca/metabominer/
Evaluation protocol
MetaboMiner was assessed in a variety of ways using both synthetic and experimental NMR spectra. The synthetic spectra were generated from the 162 compounds that have both TOCSY and HSQC spectra in the reference library. The experimental spectra were collected from three defined compound mixtures (totalling 72 compounds) and a biofluid sample of known composition (plasma). These evaluations allowed a complete and comprehensive assessment of MetaboMiner's performance as well as its potential strengths and limitations.
The effects of different types of spectral noise on compound identification
The performance of the minimal signature method and the adaptive threshold method were evaluated under two common types of spectral noise – missing peaks and "drifting" peaks (i.e. peaks that have drifted from their canonical positions due to temperature, pH or solvent effects). The missing peaks were simulated by deleting peaks of each compound at random with 0%, 10%, 20%, 30%, 40%, 50% probabilities. The chemical shift drift effects were simulated by adding random values of ± 0.01, ± 0.02, ± 0.03, ± 0.04, ± 0.05 ppm for each 1H chemical shift, and ± 0.05, ± 0.10, ± 0.15, ± 0.20, ± 0.25 ppm for each 13C chemical shift. The spectra of each synthetic query mixture were generated by first pooling the peaks from 50 compounds that were randomly selected from the MetaboMiner reference spectral library (162 compounds). After introducing this artificial spectral noise, the query mixtures were searched against the reference spectral library with and without using the adaptive threshold method. Two compound identification strategies were compared – the minimal signature method (MS) and the percentage match method (PM) with 75% as the cut-off value. The F-measure was used for performance evaluation, where F = 2 × (precision × recall)/(precision + recall) where recall is the proportion of true positives in the returned result (recall = TP/(TP+FN)) and precision is a measure of the percentage of positive or correct results (precision = TP/(TP+FP)). The values were obtained as the averages of TOCSY and HSQC search results over 50 iterations. Figure 5A summarizes MetaboMiner's performance using data with different fractions of missing peaks. Figure 5B shows the results using data with increasing chemical shift drift effects.
The effects of different spectral data types on compound identification
We further investigated the usefulness of different NMR data types for compound identification based on our concept of a minimal spectral signature. Four NMR data types were compared – 1D 1H, 1D 13C, 1H TOCSY, and 1H-13C HSQC spectra. For this particular evaluation, reference 1D 1H and 1D 13C spectra were obtained from the corresponding 1H and 13C chemical shifts of MetaboMiner's HSQC spectral library. For a small number of compounds, these artificial 1D spectra lacked some of the expected 1H or 13C signals that might be seen in a real 1D NMR spectrum, but their absence also helped to simulate the fact that some peaks in 1D NMR spectra are broadened or washed out due to signal overlap or solvent suppression.
Synthetic 2D NMR spectra (query spectra) representing different biofluids of increasing molecular complexity were generated by pooling peaks of 20, 30, 40, 50, 60, 70, and 80 compounds randomly selected from MetaboMiner's reference spectral library. To further simulate noise or pH/salt effects, 10% of the peaks from the query spectra were deleted at random, followed by the introduction of random chemical shift changes (± 0.01 ppm for 1H and ± 0.05 ppm for 13C) to the remaining peaks. The resulting peaks were subsequently searched against MetaboMiner's reference spectral library using the adaptive threshold method. The F measures were averaged over 50 iterations. The result is summarized in Figure 6.
Compound identification using experimental spectra
Twelve 2D NMR experiments (six TOCSY and six HSQC) were collected under different pH conditions using three synthetic mixtures and a plasma sample. The three synthetic mixtures were composed of 27, 21, and 24 common metabolites, respectively, with concentrations ranging from 40 to 60 mM. The plasma sample contained 35 identifiable metabolites (ranging in concentration from 0.1 to 10 mM) as determined by independent profiling of its 1D 1H NMR spectra by several experienced individuals using Chenomx's NMR Suite software [12]. These results were further confirmed by spiking/doping authentic standards into the plasma sample and by GC-MS analysis. The plasma sample was prepared by filtering the sample through a 3 kDa filter (to remove proteins), then lyophilizing and finally dissolving the remaining solids in distilled water to its 1/5 original volume. Deuterium oxide (D2O) was added to make a final concentration of 90% H2O and 10% D2O. All spectra were acquired at 25°C. Six spectra were collected on a Varian INOVA 800 MHz spectrometer equipped with a 5 mm triple axis gradient cryoprobe. The other six spectra were collected on a Varian INOVA 500 MHz spectrometer with a 5 mm triple-resonance z-gradient probe. The TOCSY experiments were performed using the wgtocsy pulse sequence, and the HSQC experiments were performed using the gChsqc pulse sequence, both provided by Varian's BioPack. For the TOCSY experiments, the spectral width was set to 11990 Hz and a mixing time of 50 milliseconds. Sixteen transients were collected for each of 128 t1 intervals using an acquisition time of 0.085 seconds and a relaxation delay of 2.0 seconds. The total acqisition time for the TOCSY spectra was 2.5 hours For the 13C-HSQC experiments, the spectral widths of the proton and carbon dimensions were 11990 Hz and 28160 Hz respectively. Sixty four transients were acquired for each t1 interval using an acquisition time of 0.085 seconds and a relaxation delay of 1.0 seconds. The spectra were collected with 2048*256 complex points for the 1H and 13C dimensions respectively. The total spectral acquisition time for the HSQC spectra was 5 hours. Sample TOCSY and HSQC spectra are available [see Additional File 1].
The raw NMR spectra were first processed using NMRPipe [27] and the peaks were subsequently picked using Sparky's [28] automatic peak picking program. The resulting "raw" peak lists were copied and pasted to the processing view of MetaboMiner. Both peak processing and compound identification were performed using MetaboMiner's default parameter sets. The reference library used for the synthetic mixtures was the biofluid (common) library. For plasma data, the plasma (common) library was used. To assess the degradation in performance assuming no prior knowledge of the sample source (urine, plasma, cell extract or generic biofluid) the complete spectral reference library (223 compounds for TOCSY, 502 compounds for HSQC) was also used to identify compounds. To assess the performance of the web-servers that support 2D NMR mixture analysis – the HMDB [23], the MMCD [25], the BMRB [24], and the SpinAssign [29] of PRIMe http://prime.psc.riken.jp/ – the same set of peak lists were submitted. For PRIMe, the default search parameters were used. For other web services, the search threshold for 1H was set to 0.03 ppm and 0.10 ppm for 13C. The results are summarized in Tables 1 and 2.
Table 1 Performance evaluation using HSQC data collected at pH ~7.2. Table 2 Performance evaluation using TOCSY data collected at pH ~7.2.