Background

Metabolomics, the study of the population of small molecules in a cell, has drawn intense interest in fields from medicine to synthetic biology because it can provide a fine-grain representation of cellular state and activity [14]. Of particular interest is untargeted metabolomics, which seeks to measure as much of the metabolome as possible by limiting methodological detection bias. The dominant analysis technique for untargeted metabolomics is chromatography coupled with mass spectrometry (MS) but this method is hindered by a large number of unknown peaks [5] and the limited number of reference spectra available to identify the peaks [6]. A number of tools have been developed to propose structural matches for unannotated peaks [711] but in practice these tools either return too many candidates when drawing from large chemical databases such as PubChem [12] or miss compounds not yet present in curated biochemical database [13, 14].This has the effect of locking untargeted metabolomics in a unfortunate paradox: compounds that are not present in biochemical databases are not identified and in the absence of experimental identification, new compounds cannot be added to databases [15].

There is a growing consensus that many enzymes mediate undocumented side-reactions (known as promiscuous activities) as a result of exposure to diverse cellular metabolites [16, 17]. These activities may explain unannotated peaks in metabolomics datasets [18, 19] but are difficult to detect as they may be overshadowed by a known function [20] or be dependent on intracellular conditions [21]. Predicting novel chemical reactions based on broad enzyme specificity has been utilized by a number of tools for the prediction of new biochemical pathways [2224]. Recently, this technique has also been used to expand structure databases for metabolomics by the MyCompoundID tool [25] the In Vivo/In Silico Metabolites Database (IIMDB) [15], LipidHome [26] and others [27, 28].

Here we present Metabolic In silico Network Expansions (MINEs) that utilize the Biochemical Network Integrated Network Explorer (BNICE) [29, 30] to expand on general biochemical databases as well as organism-specific databases for Escherichia coli and yeast. The focus on endogenously present and organism-specific metabolites has been cited as critical to improving the confidence of compound matches [5] and thus we complement existing resources that focus on human metabolism. In principle, these predictions could also be made using Reaction Difference Matching (RDM) [23], machine learning methods [31, 32], or other rule-based methods such as ChemAxon’s Metabolizer. Each of these approaches has their benefits; the output really depends on the quality and coverage of the reaction rules used in the analysis. We selected BNICE because we have a set of BNICE reaction rules that have been demonstrated to reproduce a large fraction of known biochemical reactions [24], as well as to predict enzyme reactions that were subsequently verified experimentally [33]. Importantly, we also have the right to re-distribute BNICE output. No license is required for academic users to access the website or APIs and all BNICE predicted compounds are available for download in SDF format from the website.

Construction and content

Construction of MINE databases follows the steps depicted in Fig. 1: BNICE expansion, Standardization and Annotation. The standardization and annotation procedure was guided by previous databases that combine reaction and compound data from various sources [34, 35].

Fig. 1
figure 1

MINE database construction and access methods. The process of constructing a MINE database from the curated source databases is depicted on the left. The methods for accessing the database are shown on the right.

Compound information was obtained from the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Release 68.0) [36], the Yeast Metabolome Database (YMDB) (Version 1.0) [37] and EcoCyc (Version 17.0) [38]. Generalized (containing R groups), inorganic compounds, and disconnected fragments were removed using the Pybel toolkit [39]. Generalized structures are of very limited utility, as they cannot be assigned an accurate mass or represented in a canonical form. Where possible, we encourage developers to avoid ambiguity by enumerating all possible structures in their databases. Additionally, biochemical databases often contain numerous duplicate compounds [40] and these were identified by Standard InChIKey [41] comparison and removed for computational efficiency.

The BNICE framework has previously been used to explore alternate biosynthetic and xenodegradation pathways through the iterative application of generalized reaction rules. Unlike some approaches that model only a specific class of chemistry (e.g. cytochrome P450 metabolism) these reaction rules span the breadth of the Enzyme Commission (EC) classification system and have been hand curated by examining reactions at the third level of EC specificity. Figure 2 demonstrates the process of encoding the common reactive site motifs as well as the bonds that are broken or formed. 198 of these generalized chemical reaction rules were applied to all compounds in a given source database, resulting in a MINE database of predicted products and chemical reactions.

Fig. 2
figure 2

Generalizing a BNICE reaction rule from known biochemical reactions. The common motif of the hydrolysis of the 1,3-diketone is shaded for emphasis.

BNICE products may take a variety of tautomeric forms depending on the source structure and the nature of the operator applied. Therefore, products were processed with ChemAxon’s Standardizer & Structure Checker (JChem 6.0.4, 2013) to ensure canonical valences and placement of charge. Natural Product Likeness scores [42] and estimated logP values were calculated with a standalone Java ARchive (JAR) package and ChemAxon’s Calculator Plugins (JChem 6.0.4, 2013) respectively. Estimated Kováts Retention Indices were calculated using the NIST RI algorithm [43].

Compounds were matched to PubChem [44] and KEGG COMPOUND databases with the connectivity block of InChIKeys for annotation. Generated compounds are assigned identifiers based on hash of the canonical SMILES [45] for internal use and a numeric MINE ID for human readability. Finally, the exact mass and chemical fingerprints of structures were calculated with Pybel.

Compound and reaction data is stored as collections in a Mongo Database (v2.6.2). A compound entry contains the chemical formula, exact mass, InChIKey canonical SMILES [45], FP2 and FP4 fingerprints and lists of reactions in which the compound is predicted to participate as a reactant or product. A compound may also be annotated with additional information such as common names or database links if it matches a KEGG or PubChem entry. Reactions are uniquely identified by an ‘R’ followed by the SHA1 hash of the sorted chemical reaction. Reactions entries contain arrays of reactants and products as tuples of the stoichiometric coefficient and the compound ID as well as a list of the operators that predicted the reaction.

Utility and discussion

Database validation

Table 1 summarizes a few key statistics to compare MINEs to other commonly used databases. The most conservative metabolite-prediction database is IIMDB [15], which utilizes a combination of absolute and relative reasoning rules [46] based on human xenometabolism to constrain the size of the database. Two other methods using computationally-predicted metabolites, MyCompoundID [25] and Ridder et al.’s green tea metabolites [27], begin with much smaller metabolite starting sets than KEGG COMPOUND but utilize broader reaction rules and permit more sequential transformations. MINE operators specify reactant substructures but involve no relative likeliness calculations and therefore generate more compounds than IIMDB, but less than MyCompoundID. The relative increase between the starting metabolite set and the resulting MINE is dependent on the specific compounds present in the starting database. For example, YMDB contains more high-molecular-weight compounds than EcoCyc and thus contains more reaction sites and generates more derivatives. Like the IIMDB, the majority of compounds in MINE databases are not found in PubChem (when searching with the InChIKey connectivity block), which indicates MINEs are largely composed of novel structures. An analysis of the overlap in compounds represented in IIMDB was not performed due to licensing restrictions.

Table 1 Comparison of MINEs generated from various source databases and other databases containing computationally predicted metabolites

Figure 3 displays the Natural Product (NP) Likeness scores [42] for 500,000 randomly sampled PubChem compounds, and the entirety of the KEGG COMPOUND and KEGG MINE databases. NP Likeness is calculated by scoring characteristic atomic signatures, which are present in the query molecule. Scores range from −3 to 3 with higher scores indicating a compound that contains more natural than synthetic structural features. Despite being a common source of candidate structures for annotating metabolomics data, the PubChem sample is clearly skewed towards synthetic compounds. In contrast, KEGG is primarily Natural Product-like compounds and the average KEGG MINE compound is even more so. This shift is due to the action of reaction rules in BNICE that mimic detoxification metabolism acting on the least natural compounds in KEGG and additional reactivity of operators with high NP likeness (see Additional file 1). This bias toward NP-like compounds makes it a preferable source for candidate structures for unknown pathway intermediates and peaks in untargeted experiments.

Fig. 3
figure 3

Histogram of Natural Product Likeness. This plot shows the distribution of Natural Product Likeness Scores for the KEGG Database (mean score 0.77), the KEGG MINE (mean score 0.98) and a random sample of 500,000 PubChem compounds (mean score −0.52). A more positive score indicates more natural atomic features.

Web interface description

The web interface for the MINE databases has been designed for a range of user needs such as (a) investigation of potential enzymatic transformations, (b) annotation of accurate masses and (c) chemical structure search. Users may access a compound of interest with a variety of identifiers such as InChI Keys, database IDs or common names, or with structure based tools like substructure and structural similarity searching. Compound pages display a set of name, pathway and enzyme annotations inferred from KEGG as well as the in silico predicted reactions that a compound may take part in as a reactant or product. Additionally, we provide a web interface for the annotation of accurate mass LC–MS data as shown in Fig. 4. This utility provides users a way to search for potential matches for a large number of mass-to-charge ratios and a color-coded interface that enables users to rapidly focus on the most probable putative identifications.

Fig. 4
figure 4

Screenshot of Metabolomics search results. This screenshot displays features of the metabolomics results including filtering by attributes and highlighting (blue) of compound present in a specified KEGG genome reconstruction.

Use case: annotation of accurate mass datasets

As a demonstration of the potential of MINEs for annotation of accurate mass data, a diverse test set of 667 unique compounds was compiled from MassBank [47]. The databases were searched by exact precursor mass to charge (m/z) ratio with 2 mDa precision and with [M+]+, [M+H]+, [M+Na]+, [M−H] and [M+CH3COO] adducts. The results of this validation are displayed in Table 2. Using KEGG as source database, structures were suggested for 84.5% of the m/z. The KEGG MINE database annotated an additional 14% of compounds while maintaining a similar accuracy to the KEGG annotations. PubChem annotates a comparable number of these known compounds to the KEGG MINE but does so at the expense of returning a bin of candidates that is two orders of magnitude larger than the MINE. While the MINE database has a higher median number of structures per peak than the KEGG database, the number remains feasible to examine manually. The web interface facilitates this process by distinguishing compounds that are present in user specified KEGG genome reconstructions from those generated by computational means, hence allowing users to consider the most probable isomers first. Additionally, users may restrict structures to a range of partition coefficients or Kováts retention index values. Candidate structures can then be downloaded as a Microsoft Excel compatible CSV file for further review.

Table 2 Annotation of MassBank data

Finally, to demonstrate the practical utility of MINE databases, we utilized the EcoCyc MINE to annotate untargeted metabolomics data from an E. coli knockout study analyzed by LC–MS. The protocols for sample extraction, data acquisition and post processing are available in the supplementary information. 493 distinct exact MS features were extracted, 30 of which were identified following a traditional annotation workflow using NIST MSPepsearch (see Additional file 2); in contrast, the EcoCyc MINE database proposed candidates for 132 of the accurate masses when searching with 5 mDa precision and with [M+]+, [M+H]+, [M+Na]+ adducts. The resulting MINE candidates were consistent with 93% of the NIST MSPepsearch results.

Of these 132 features, 79 matched at least one of the metabolites proposed in the MINEs by the BNICE method. We selected one of these features, which also exhibited statistically significant variation in peak height across our experimental samples, for further study. The EcoCyc MINE database returned one potential hit for this metabolite, a phosphoethanolamine (PE) lipid that we were not able to identify with our traditional workflow. LipidBlast [11] was used to confirm that the MS–MS fragmentation pattern, presented in Fig. 5, is consistent with PE (32:1), more specifically, PE (16:0/16:1), which is also present as a predicted but unidentified lipid in the LipidHome database [26]. Detection and verification of novel metabolites is ongoing but beyond the scope of this article.

Fig. 5
figure 5

Positive MS spectrum (a), positive MS/MS spectrum (b) and negative MS/MS spectrum (c). The positive MS spectrum provides the mass of the precursor ion [M+H]+ = 690.5099 Da and its isotopic abundance pattern. The prominent ion in the positive MS/MS spectrum corresponds to the neutral loss of the phosphoethanolamine head group. The negative MS/MS spectrum shows the molecular ion [M−H] as well as a pair of ions corresponding to the (16:0) and (16:1) side chains.

Further development

In addition to the existing web tools, the underlying MINE databases are accessible through free, developer-friendly APIs. Clients are available for integration into Python, Perl and JavaScript frameworks at https://github.com/JamesJeffryes/MINE-API. This API allows the databases to be integrated into existing candidate ranking algorithms and pipelines. Future versions of these databases will incorporate transformation rules for spontaneous chemical reactions of metabolites, and improved filtering and prioritization of candidate structures.

In addition to expanding the scope for the metabolome, the MINE framework also offers a pipeline for illuminating the synthesis and degradation of poorly annotated secondary metabolites. While applied very broadly to nearly all of metabolism in this study, BNICE expansions may be focused on a region of interest in the metabolic network by adjusting the starting compounds and permissible transformations in a manner similar to that recently demonstrated by Ridder et al. [27]. These targeted MINEs will integrate the generation of plausible pathways by BNICE with the tools to detect the presence of predicted pathway intermediates with accurate mass spectrometry thereby accelerating the process of proposing and evaluating hypothetical enzymatic synthesis routes for a number of compounds of interest.

Conclusions

Here we have presented Metabolic In silico Network Expansions (MINEs) that utilizes generalized biochemical transformations to propose structures for use in untargeted metabolomics. The resulting compounds are rarely found in PubChem but are structurally similar to natural products. We have demonstrated the utility of these databases for proposing correct metabolite structures that stymied a standard annotation workflow. MINE data are accessible without licensing restrictions for non-commercial users through a user-friendly web interface and API for developers in several common scripting languages.

Availability and requirements

MINE databases are freely accessible at: http://minedatabase.mcs.anl.gov and API clients are available at https://github.com/JamesJeffryes/MINE-API. There are no restrictions for Academic Use. Commercial users must obtain a license from Pathway Solutions Inc. (www.pathway.jp) and explicit permission from the authors.