Integration of untargeted metabolomics with transcriptomics reveals active metabolic pathways

While recent advances in metabolomic measurement technologies have been dramatic, extracting biological insight from complex metabolite profiles remains a challenge. We present an analytical strategy that uses data obtained from high resolution liquid chromatography–mass spectrometry and a bioinformatics toolset for detecting actively changing metabolic pathways upon external perturbation. We begin with untargeted metabolite profiling to nominate altered metabolites and identify pathway candidates, followed by validation of those pathways with transcriptomics. Using the model organisms Rhodospirillum rubrum and Bacillus subtilis, our results reveal metabolic pathways that are interconnected with methionine salvage. The rubrum-type methionine salvage pathway is interconnected with the active methyl cycle in which re-methylation, a key reaction for recycling methionine from homocysteine, is unexpectedly suppressed; instead, homocysteine is catabolized by the trans-sulfuration pathway. Notably, the non-mevalonate pathway is repressed, whereas the rubrum-type methionine salvage pathway contributes to isoprenoid biosynthesis upon 5′-methylthioadenosine feeding. In this process, glutathione functions as a coenzyme in vivo when 1-methylthio-d-xylulose 5-phosphate (MTXu 5-P) methylsulfurylase catalyzes dethiomethylation of MTXu 5-P. These results clearly show that our analytical approach enables unexpected metabolic pathways to be uncovered. Electronic supplementary material The online version of this article (doi:10.1007/s11306-014-0713-3) contains supplementary material, which is available to authorized users.


Electronic Supplementary Material
1.2 Growth conditions and feeding experiment of R. rubrum.
1.5 LC-MS data analysis platform.
1.6 References. Tables   Table S1. Performance comparison of putative peak annotation approaches for R. rubrum. Table S2. Performance comparison of putative peak annotation approaches for B. subtilis. Table S3. Actively changing metabolic pathways and their constituent metabolites in B. subtilis detected by the enrichment analysis. Table S4. Abundance comparison of detected metabolites in the isoprenoid biosynthesis pathway between R. rubrum and B. subtilis. Table S5. Actively changing metabolic pathways and their constituent metabolites of R. rubrum detected by the enrichment analysis. Table S6. Genes and primers list used by qRT-PCR experiments. Table S7. Quantitative real time polymerase chain reaction (qRT-PCR) of B. subtilis. Table S8. Quantitative real time polymerase chain reaction (qRT-PCR) of R. rubrum. Table S9. Genes and their CPMs of control (CPM_Ctrl) and treatment (CPM_Trt) in the RNAseq experiment (separate Excel file). Figure S1. The LC-MS analysis platform system design.

MTA feeding experiment:
For three biological samples prepared as described above, metabolite depletion of the concentrated cells was immediately carried out at 37 °C in 5 mL minimal medium without sulfur for 10 min. The cells were then transferred back onto ice. The OD 600 (ca. 24) was measured for each in triplicate by 50-fold dilution of a small aliquot into minimal medium in a cuvette. The averaged OD value of the concentrated cells was then used to dilute the cells to aliquots of OD 600 = 6 at a final volume of 1 mL. Sulfur feeding was initiated by addition of concentrated MTA or sulfate to 1 mM before transferring to 37 °C for 0, 2, 5, or 15 min. Immediately after incubation, the cells were transferred to a 4 °C centrifuge and spun at 16,100 × g for 2 min, transferred to ice, and the supernatant was quickly removed by pipetting. Immediately following removal of the supernatant, the cells were frozen with liquid nitrogen and stored at -80 °C prior to extraction.

Growth conditions and feeding experiment of R. rubrum
Cell growth: R. rubrum (DSM 467, ATCC 11170) and its MTXu 5-P methylsulfurylase mutant (a gift from R. Tabita and Jaya Singh) were grown aerobically in LB medium or in the dark at 30° C on 20 -2,000 mL minimal medium with sulfate or MTA as the sole sulfur source as described previously. (Erb et al. 2012 cultures were harvested at OD 578 = 0.6 (Genesys 20 spectrophotometer) by centrifugation, and the cells were kept on ice unless otherwise indicated. The cells were then washed with 50 mL minimal media without sulfur.

In vitro reactions with 1-methylthio-D-xylulose-5-phosphate methylsulfurylase
Unless otherwise stated, chemicals used for synthesis were purchased from Sigma-Aldrich, and buffers and reaction mixture components were purchased from Fisher Scientific. Production of glutathione-methylthiol disulfide and DXP from MTXu 5-P was monitored by 1 H NMR with solvent suppression and LC-FTMS. The reactions (700 μL) in 100% H 2 O contained: 60 mM Tris-HCl (pH 7.9), 15 mM NaHCO 3 , 1 mM MgCl 2 , 6 mM glutathione (or without glutathione), 4 mM MTXu/Ru 5-P, and 5 µM R. rubrum RLP. MTXu/Ru 5-P was enzymatically prepared from MTRu 1-P using the R. rubrum RLP. Immediately before inserting samples into the spectrometer, 100 μL of D 2 O was added. The spectrum for MTXu/Ru 5-P was acquired and then the R. rubrum 1-methylthio-D-xylulose 5-phosphate methylsulfurylase was added to a final concentration of 10 μM.

Liquid chromatography (LC)-Fourier transform (FT) mass spectrometry (MS) metabolomics
Metabolite extraction: Metabolite extractions were performed at room temperature, with solvent B (10 mM ammonium bicarbonate buffer (pH 9.2) containing 90% (v/v) acetonitrile) used for the ZIC-pHILIC column (Merck SeQuant AB, S-907 19, Umeå, Sweden) on the LC-FTMS instrument described below. For each cell pellet originated from 1 mL of cells of OD 600 = 6,375 µL, solvent B was added, vortexed for 15 min, and centrifuged two times at 16,100 × g for 5 min before adding to a fresh 96-well plate in the HPLC autosampler.
LC-FTMS analysis: LC-FTMS was performed using an 11T LTQ-FT Ultra mass spectrometer (Thermo Fisher Scientific, Waltham, MA) equipped with an Agilent 1200 HPLC system (Agilent Technologies, Santa Clara, CA). 100 µL of extracted metabolites were injected onto a 2.1 × 150 mm ZIC-HILIC column equilibrated with solvent A, 10 mM ammonium bicarbonate (pH 9.2). A linear gradient of 100% solvent B to 40% solvent B over 35 min, and then 40% solvent B to 100% solvent B in 10 min, followed by a 15 min re-equilibration at a flow rate 200 of µL/min was used for metabolomics experiments. For in vitro reaction monitoring, a shorter gradient was used: 100% B for 5 min followed by a linear gradient down to 40% B over 10 min, then a return to 100% B over 5 min followed by a 15 min re-equilibration for 15 min. In the case of B. subtilis, for each time point and condition (sulfate versus MTA feeding), three injections (technical replicates) were analyzed for each of three extractions (biological replicates). For R. rubrum analysis, four biological samples were each analyzed in triplicate for each time point and condition. All data were collected in negative profile mode at resolution of 50,000 with full scan set to m/z 100-1000.

LC-MS data analysis platform
Data processing: We used XCMS (Smith et al. 2006), a freely available and popular software package for LC-MS data pre-processing, including peak detection, peak matching, and retention time alignment. The raw LC-MS files were converted into XML format using the program MM File Conversion, one of the MassMatrix Mass spec Data File Conversion Tools (Xu and Freitas 2009). These XML-formatted files were used as input for XCMS. We used a binning method for fast data processing at the peak detection step, but it has some drawbacks. Apart from an optimal bin size issue, XCMS combines maximum signal intensities from adjacent slices into an overlapping combined chromatogram, thereby mixing potentially distinct peaks into a single peak. This peak mixing phenomenon reduces the benefits of using high resolution LC-MS instruments, even though XCMS uses an intensity-weighted averaging strategy to reduce this effect. In addition, we might not detect close-by and partially overlapping features due to a post-processing step (implemented by the 'rectUnique()' function in the source code) that eliminates any peaks in the vicinity of higher intense peaks. This can be controlled by setting the 'mzdiff' option as a minus value, but in this case, we can get the same peak identification for several different peaks, which might be problematic for automatic data processing. XCMS provides another algorithm, centWave (Tautenhahn et al. 2008), that circumvents these issues by collecting the region of interest and using continuous wavelet transformation, but the longer processing time required by this method to search the entire list of regions of interest was prohibitive to our use. To properly address the issues with the binning method, we changed the source code in XCMS to filter minor mixing peaks in the major peaks using a 2 ppm mass tolerance in each bin allowing for more precise mass and area calculations. Also, we changed the rounding function to provide a precision of 5 decimal points for mass and to include retention times for peak identifications, allowing automatic processing in the following procedures. We set mzdiff = -1.001 to detect close-by and partially overlapping peaks. To get as many peaks as possible, we set the signal-to-noise threshold at 2 or to the lowest value, which does not produce peak alignment errors (e.g., normally, snthresh = 2 or snthresh = 4). The parameters for xcmsSet(), group(), and rector() were set to the default values.
Data pre-processing: Multiple data pre-processing steps were introduced for data quality control (retention time filtering, non-biological peaks elimination, missing value imputation, and normalization), data redundancy elimination (adducts, isotopologues, and multimers), and automatic isotope pattern analysis for the experimental data. We eliminated peaks with retention times of less than 120 s or greater than 2,030 s, since salts and chemical impurities were frequently detected in these regions in our solvent system. Non-biological peaks were further filtered based on the mass distribution (e.g., integral versus non-integral) of metabolites in the KEGG and PubChem databases ( Figure 1 and Figure S2). The portion of peaks in between high density areas was just below 0.3%, even in the PubChem database, containing large numbers of synthetic molecules. Peaks in this area were eliminated by the following criteria: i) if integral ≤ 100, then nonintegral ≥ 0.90 or ≤ 0.10; ii) if integral > 100 and ≤ 200, then non-integral ≥ 0.82 or ≤ 0.20; iii) if integral > 200 and ≤ 300, then non-integral ≥ 0.82 or ≤ 0.30; iv) if integral > 300 and ≤ 400, then non-integral ≥ 0.82 or ≤ 0.40; and v) if integral > 400 and ≤ 500, then non-integral ≥ 0.82 or ≤ 0.50). A non-biological filter was not applied when masses were greater than 500 m/z according to the mass distribution analysis. Missing intensity values were set to the integer values of the minimum intensity in each experimental data set (i.e., missing value imputation). A matrix containing all-by-all mass differences was constructed to check adducts, multimers (e.g., mostly dimers and trimers) and isotopologues (e.g., C, N, S, and O) based on a 2 ppm mass tolerance and a 60-s retention time tolerance for all detected peaks. For every peak, if there were mass differences corresponding to adducts or multimers, those peaks were eliminated, and isotopologues were also removed after isotope patterns (i.e., relative abundance and masses) were stored as a matrix for later use in molecular formula determination. For the remaining peaks, we calculated the mean intensities of biological and technical replicates and the relative ratios of mean intensities between control samples and treatment samples at each time point. Peaks were classified into the monoisotopic group or the isotopic group according to the presence of isotopologues. The isotopic S4 group was subdivided into the primary group and the secondary group based on the abundance change. The primary group included peaks with isotopic patterns showing > 20% of abundance change at each time point.

Molecular formula determination:
For peaks within an isotopic group, molecular formulas were determined using six atoms (e.g., C, N, O, P, S, and H), because most biological molecules are mainly composed of these atoms.
Halogenated molecules were excluded at this point, but they can be easily included if necessary. Molecular formula determination was carried out using three functional modules: a theoretical molecular formula modeling module, theoretical isotopic pattern modeling module, and matching probability calculation module between theoretically predicted isotopic patterns and experimentally observed isotopic patterns. Theoretical molecular formulas were predicted based only on monoisotopic masses using previously published algorithms (Liptak 2007;Bocker et al. 2009;Bocker et al. 2008). Briefly, the algorithm constructs a data structure called an extended residue table to compute the smallest decomposable integer over given integer masses, and generates all possible decompositions of a query mass, recursively.
For real-valued decompositions, the algorithm transforms the integer knapsack problem (Kellerer et al. 2004) with realvalued coefficients into a problem instance with integer coefficients by introducing a blowup factor (i.e., scaling factor).
This blowup factor inevitably causes a rounding error. To avoid this rounding error, previous studies decomposed additional integers by extending the upper bound using the maximum relative rounding error based on hydrogen atoms. A blowup factor can affect search space in the residue table due to the condition of coprimality between masses. In this study, we used 25,000 as a blowup factor. All of the possible molecular formulas were filtered by applying the seven golden rules (Kind and Fiehn 2007), except the trimethylsilylated compounds rule, which is only applicable to gas chromatography MS data, to select the most likely and chemically correct molecular formulas. For each predicted formula, we simulated theoretical isotopic patterns based on a dynamic programming approach in the context of the Markov process as described in a prior study (Snider 2007). We compared simulated isotopic patterns with the experimentally measured isotopic patterns based on Bayesian statistics, as described previously (W. Zhang and Chait 2000;N. Zhang et al. 2002;Bocker et al. 2009;J. Zhang et al. 2005). Only the top three ranked molecular formulas were used for searching against the KEGG database.
Seed metabolites: Peaks in the primary group were searched against the publicly available KEGG database, using nominal masses and the top three predicted molecular formulas with 5 ppm mass tolerance. If there were any hits, we considered those as seed metabolites, referring to the metabolites showing: higher abundance changes upon perturbation, clear isotope patterns in experimental data, and search hits in the database. Metabolic pathway information was extracted from these seed metabolites and used to construct initial pathway clusters. Since the same metabolite can participate in many different pathways, multiple pathway information can be extracted from a single seed metabolite. Initial pathway clusters were used to further annotate peaks in the secondary group and the monoisotopic group. Since it is difficult to annotate monoisotopic peaks with high confidence, even if there are hits in the database, additional information is needed to increase confidence. Retention time is starting to be used, but there are still some hurdles to overcome, such as different solvent systems. Pathway information and chemical reaction information can be used as alternatives to confer high confidence upon putative annotation.
Dynamic construction of metabolite sets (i.e. implicated pathways): After detecting seed metabolites and their respective pathways, peaks in the secondary group and the monoisotopic group were searched against the KEGG database with 5 ppm mass tolerance. Peaks in the secondary group can provide enhanced molecular formula information and increase the reliability of the search hits. For hits in the database, we compared the pathways and chemical reactions of those hits with those of the seed metabolites in the initial pathway clusters. If they matched with each other, we added those metabolites into the pathway clusters. This pathway comparison makes it possible to annotate monoisotopic peaks with high confidence which are usually difficult to annotate based solely on single mass information. A running sum across all N metabolites was computed, and the maximum enrichment score (MES) was considered as the maximum observed positive deviation of the running sum for every metabolite set.
For a null distribution, we permuted the class labels for 1,000 times and recorded the maximum enrichment score for every permutation. The maximum enrichment score from the actual data was then compared to this null distribution, providing a nominal P value to indicate statistical significance. Every metabolite set with p < 0.05 was considered as active metabolic pathway.  Seed metabolites and clustering by pathways can reduce false positives and increase coverage by using monoisotopic peaks without isotopic information. a UID: Number of unique peak identifications having search hits in KEGG database. b TSH: Total number of search hits in KEGG database. c HRPP: Hit ratio per peak calculated by TSH/UID. The performance of putative peak annotation is represented by the hit ratio per peak (HRPP). Note that our approach shows the lowest HRPP values but the higher number of unique peaks in comparison with the conventional approach (i.e., search with mass and molecular formulas), implying better coverage and lower false positives. Search based only on mass cannot annotate peaks with confidence without any biological knowledge and manual curation. Seed metabolites and clustering by pathways can reduce false positives, and increase coverage by using monoisotopic peaks without isotopic information. a UID: Number of unique peak ID having search hits in KEGG database b TSH: Total number of search hits in KEGG database c HRPP: Hit ratio per peak calculated by TSH/UID A similar pattern of performance was observed in the B. subtilis metabolomics data. Our approach shows better coverage and lower false positives. 1.33 × 10 -3 6.56 × 10 -4 met salvage pathway a 3'-Methylthiopropionate b 4-Methylthio-2-oxobutanoate c 1,2-Dihydroxy-3-keto-5-methylthiopentene S10 d 2,3-Diketo-5-methylthiopentyl-1-phosphate/2-Hydroxy-3-keto-5-methylthiopentenyl-1-phosphate e S-Methyl-5-thio-D-ribose/S-Methyl-5-thio-D-ribose-1-phosphate/S-Methyl-5-thio-D-ribulose-1-phosphate/S-Methyl-5-thio-D-xylulose-1-phosphate f 5'-Methylthioadenosine

S15
1 Candidate genes for qRT-PCR are manually selected based on metabolomics analysis. The purine metabolism is divided into the purine salvage pathway and the de novo purine biosynthesis pathway. Primer sequences read 5' to 3', and expression patterns are monitored at selected time points (0 min, 2 min, 5 min, and 15 min). Candidate genes for qRT-PCR are selected manually based on metabolomics analysis. Detected active pathways are further categorized based on previous studies and plausible hypothesis. Primer sequences read 5' to 3', and expression patterns are monitored according to the time points (i.e., 0 min, 10 min, 20 min, 0.6 OD).  (Table S9). Figures   Fig. S1. The LC-MS analysis platform system design. Our LC-MS analysis platform was composed of four major functional modules, including data processing, peak grouping, molecular formula determination and pathway activity profiling modules. Each module consisted of sub-modules to deal with arising issues in the metabolomics data analysis. Non-biological signals and redundant peaks were filtered by a data pre-processing sub-module. Missing values were imputed at the data quality control stage. Remaining peaks were categorized into the isotopic group (primary and secondary groups) or the tertiary group. Molecular formulas were predicted in the molecular formula prediction module based on previous published algorithms. Based on predicted molecular formulas and nominal masses, peaks were putatively annotated and active pathways were assigned by MSEA. Notably, the concept of seed metabolites and dynamic build-up of metabolite sets to be evaluated enabled the streamlined process to detect active pathways from raw LC-MS data. These detected active pathways were further validated by transcriptomics, and metabolites highly perturbed but not annotated were listed for further experiments. The distribution of the integral parts and the non-integral parts of the masses in the PubChem database was investigated as a proof-of-concept of the mass filter. As with the KEGG database, there was a region not occupied by molecules in PubChem, even though many synthetic molecules are listed in the PubChem database. Only > 0.3% of molecules appeared in-between the high density areas, indicating that this region can be used as a mass filter for non-biological signals at the data processing step without much information loss. By eliminating spurious peaks in this region, we enhanced processing time and concentrated on the major peaks in the subsequent steps.   Figure 5S. B) Two minute R. rubrum MTA feeding reaction cell extract. C) Control R. rubrum cell extract that had not been fed exogenous MTA. Only one peak is present in the spiked sample, thus supporting the assignment of the peak at m/z 352.0636 as glutathione-methanthiol disulfide in the MTA feeding reactions. The presence of glutathionemethanethiol disulfide in the unfed cell extract supports glutathione as the in vivo cosubstrate for 1-methylthio-D-xyulose 5-phosphate methylsulfurylase in R. rubrum.

Fig. S5. Pearson correlation between metabolites in implicated pathways (R. rubrum).
Metabolites-metabolites correlation analysis was carried out based on Pearson correlation analysis. Pearson correlation coefficients of percentage changes in metabolite abundances were clustered by agglomerative hierarchical clustering. Euclidean distance was used as a distance measure. In each pathway, there was a clear correlation between constituent metabolites. In addition, there was a strong correlation between intertwined active pathways. The non-mevalonate isoprenoid biosynthesis showed a moderate correlation, and sulfur metabolism showed an anti-correlated pattern. This is because metabolites in sulfur metabolism were down-regulated, whereas metabolites in other pathways were up-regulated.

Fig. S6. Pearson correlation between metabolites in implicated pathways (B. subtilis).
In B. subtilis, metabolites in each active pathway showed strong correlation, but there was a weak correlation between the subtilis-type methionine salvage pathway and the purine salvage pathway. In addition, there was an anti-correlation between de novo purine biosynthesis and other pathways.