1 Introduction

Untargeted metabolomics aims to describe living systems by the set of metabolites present in a cell at certain moment of time and under specific environmental constraints (Fiehn 2002; Dettmer et al. 2007; Oldiges et al. 2007). Since metabolites are the final link between the gene expression and the phenotype exhibited by the cell, metabolomics represents a valuable tool to achieve a better understanding of an organism’s phenotype (Fiehn 2002; Oldiges et al. 2007). The study of the metabolome is complementary to the other “omics” sciences (genomics, transcriptomics, proteomics, fluxomics…) and fits well with the general approach of systems biology (Arita 2009).

Important advances have been realized in the past years for untargeted metabolite profiling in different research fields, from human health to nutrition (Scalbert et al. 2009; Kamleh et al. 2008). However, metabolomics is still an emerging field in the post-genomic arena. For example, due to the chemical diversity of cellular metabolites and the complexity of the cell extracts, there is no single method which can separate, detect and identify all small molecules present in a cell extract. Furthermore the Achilles’ heel of metabolomics remains the identification and structure elucidation of metabolites (Kind and Fiehn 2010). Sometimes, fragmentation patterns of the molecules can be used for identification. For metabolomics data the detected fragment patterns can, e.g., be matched to online databases, like Metlin (Smith et al. 2005), and assigned to a quality score. But in our experiments we have however observed that the scan time of the LTQ-Orbitrap is considerably affected by the inclusion of fragmentation steps, making the normal LC-MS data stream fragmentary and difficult to analyze automatically. As more convenient alternative, the Orbitrap Exactive platform (without the linear iontrap but with faster scan speeds) can be used to capture more data points using the positive–negative polarity switch mode (Lu et al. 2010). Thus, currently matching on mass alone to databases is the most commonly used method. Unfortunately, this approach to metabolite identification is very seriously hampered by the fact that the vast majority of the signals in the data set can be caused by contaminants in the sample or LC-MS system (Keller et al. 2008), technical artefacts and so-called “derivative peaks” (Scheltema et al. 2009). In many cases, several peaks or signals share the same identifications, even if signals are detected with an accuracy of better than 2 ppm, as is routinely possible using, e.g., modern Fourier Transform mass spectrometers, like the Orbitrap (Scheltema et al. 2008). Such spurious peaks need to be checked manually and assigned to their real identification or discarded if the signal shows typical artefacts.

Our goal was to develop an analytical method that would be able to eliminate a substantial part of the spurious signals from the data set. This required the development of new approaches and the collection of an unusual type of data on biological samples and mixtures of analytical standards, to distinguish real effects from spurious fluctuations in LC-MS analyses and peak detection algorithms. The strategies developed here will be generally useful for metabolomics.

2 Materials and methods

2.1 Amino acid standard mixture samples

A mixture of 38 physiological amino acid standards (Product No. A9906, Sigma) was used. In the stock solution, amino acids and related compounds are contained at a final concentration of 0.5 μmol/ml ± 4% in 0.2 N lithium citrate buffer, pH 2.20, containing thiodiglycol (2% w/v) and phenol (0.1% w/v) as antioxidant and preservative, respectively. The concentration in the injected diluted samples is described in Table 1.

Table 1 Dilution factor and concentrations of the analysed samples

2.2 Biological samples

Analytical samples were obtained from Streptomyces coelicolor wild-type M145 strain (Bentley et al. 2002). Bacteria were grown in 50 ml liquid minimum medium (Nieselt et al. 2010) as described (Takano et al. 2001).

Cells from 25 ml of culture were collected on a 0.45 μm filter by vacuum filtration and washed twice with 25 ml of 2.63% NaCl solution. For cell quenching, the filter with the collected cells was quickly moved into 60% methanol solution (HPLC-grade, Boom, The Netherlands) pre-chilled at −20°C and frozen in liquid nitrogen. Samples were stored at –80°C until metabolite extraction was performed.

Metabolites were extracted by three freeze–thaw cycles. Cells were thawed in an ethanol bath at −20°C (~15 min), vortexed vigorously for 1 min and, right afterwards, frozen in liquid nitrogen for 5 min. The cycle was repeated three times. After the third cycle, the samples were centrifuged at 4500 rpm for 10 min at −9°C. The supernatant (cell extract) was collected and stored at −80°C until LC-MS analysis. Before analysis, obtained samples were diluted with the same dilution factor as for the analytical standards mixture, resulting in eight samples with different metabolite concentrations.

2.3 LC-Orbitrap MS analysis

The analytical mixtures and cell extracts were analyzed by liquid chromatography coupled to a high-accuracy LTQ Orbitrap XL mass spectrometer (Thermo Fisher Scientific, Germany).

Two chromatographic columns were used: a reversed-phase Shim-pack XR-ODS C18 column (Achrom, Belgium) (3.0 × 75 mm, 2.2 μm, Shimadzu Corp.) and a ZIC-HILIC column (Achrom, Belgium) (150 × 2.1 mm, 3.5 μm, Merck Sequant AB) fitted with a ZIC-HILIC PEEK guard column (Achrom, Belgium) (15 × 1.0 mm; 5 μm, Merck Sequant AB).

For the C18 column, the flow rate was set to 0.6 ml/min; the mobile phase consisted of (A) 0.1% formic acid in water and (B) 0.1% formic acid in acetonitrile. A gradient of 18 min was used. The elution of solvent B started at 2% for the first 2 min and was increased to 95% within 8 min. This composition was maintained for 2 min, after which the elution of B was decreased to 2% within 1 min. To re-equilibrate the system, the elution of B was held at 2% for 5 min.

For the ZIC-HILIC column, the flow rate was set to 0.1 ml/min; as buffers, (A) 0.1% formic acid in acetonitrile and (B) 0.1% formic acid in water were used. A gradient of 40 min was applied. Solvent A was set to 80% as starting condition. The elution fraction of solvent B was increased to 40% within 6 min and maintained at 40% for 12 min, after which solvent B was increased to 90% in a 4 min-interval. This composition was held for 2 min after which B was decreased to 20% in 2.5 min. The gradient was held at 20% B for 13.5 min to re-equilibrate the system.

The sample volume injected was 5 μl for both columns, and two technical replicates were recorded for the C18 analysis, and three replicates on the HILIC column.

The system was operated with the electrospray ionization source in positive mode. Full-scan spectra were obtained over an m/z range of 50–1000 Da.

ULC grade acetonitrile, formic acid and water were purchased at Biosolve (Netherlands).

2.4 Data processing

Raw data files from the mass spectrometer were converted into the mzXML format by the ReAdW.exe utility (a tool of the Trans-Proteomic Pipeline software collection, downloaded from http://tools.proteomecenter.org/wiki/index.php?title=Software:ReAdW).

The CentWave (Tautenhahn et al. 2008) feature detection algorithm from the XCMS (Smith et al. 2006) package was used on each individual data file. Further processing was handled by the flexible data processing pipeline mzMatch (Scheltema et al. 2011), performing noise removal (Windig 2004) and several steps of signal filtering and peak matching. The first matching step involved aligning of the chromatographic features between technical replicates of a single sample. Peaks that were not detected in all technical replicates were discarded from further analysis. In the second matching step, the chromatographic peaks, which were combined in single files containing technical replicates in the previous matching step, were aligned to each other for all eight dilutions. After combining the eight measurements in a single file, there were still peak sets that did not include peaks from every sample. Such gaps were filled by extracting ion chromatograms within the retention time and mass window of the given peak set directly from the raw data files.

Derivative signals (isotopes, adducts, dimers and fragments) were automatically annotated by correlation analysis on both signal shape and intensity pattern, as described (Scheltema et al. 2009). These peaks were not discarded and their assigned annotations were taken into account in the subsequent analysis.

Putative identifications were made by matching the detected masses to a database of Streptomyces coelicolor (ScoCyc) metabolites, a contaminants database (Keller et al. 2008), and the list of analytical standards in the standard mixture. The metabolite database was obtained from a genome annotation file created by Jonathan Moore as part of the SysMO STREAM project (https://www.wsbc.warwick.ac.uk/groups/sysmopublic/), which is also available for download from the BioCyc project page (Karp et al. 2009) as a flat-file in Pathway Tools format (Karp et al. 2002).

Pearson’s correlation of binary logarithm of the peak intensities was applied to evaluate dilution trends in the obtained data set. Samples for the 8 dilution points were ordered from highest to lowest concentration, so that metabolites matching the sample dilution trend would show high negative correlation values between intensity and sample number. Correlation values smaller than −0.85 were considered as indicating a significantly reproducible dilution trend.

For low-abundance peaks, where signals for the highest dilutions were below the limit of detection, correlation values were calculated for the detectable consecutive measurements (at least 3 dilution points were required).

All statistical analyses and graphical routines were handled in R (R Development Core Team, R: A Language and Environment for Statistical Computing, Austria: 2011; http://www.R-project.org).

Raw data files in mzXML format, R code containing the complete data processing pipeline, as well final peak tables are available for download at http://mzmatch.sourceforge.net/metabolomics.html.

3 Results and discussion

Our study was carried out in two steps. First we wanted to validate our filtering method by applying it to the data sets of the mixtures of analytical standards. The resulting numbers of detected peaks are shown in Table 2. Data for both chromatographic columns are shown: even for relatively simple samples (39 compounds in the mix of standards) a huge amount of the peaks were detected (2831 peak sets for C18 data, and 11169 for HILIC). Only about 20–30% of these signals can be identified in chemical databases or assigned to known contaminants. A significant amount of the uninformative signals could be removed after application of the dilution trend filter. For example, in the unfiltered data set for HILIC data 28 unique standard compounds were matching 409 features within 5 ppm mass accuracy window. After application of the dilution trend filter, this number decreased to 91 features matching 26 unique standard compounds. In other words, the number of detected compounds is not significantly changing, while the number of total peaks in the data set is decreasing by almost 5 times and the number of unambiguous matches is substantially increased (Fig. 1a). Manual inspection showed that the two putative standard compounds removed by application of the filter were artefacts, i.e. these two compounds were not really detectable. Also, a very large amount of the signals matching the ScoCyc database (which should not be present in samples of analytical standards) was removed by the trend filter, as were most of the unidentifiable compounds, which also do not match the expected composition of the samples. Overall the fraction of correctly identifiable compounds is dramatically increased.

Table 2 Comparison of number of the peaks extracted for the standard mixtures samples
Fig. 1
figure 1

Proportional relationship between identified compounds before and after filtering on dilution trend. Compounds labelled as base peaks by the mzMatch software are shown. For the standards mixture (a) where only matches to the standard compounds are expected, a clear increase of the fraction of identified peaks can be seen after filtering. Importantly, the fraction of uniquely identified compounds (lighter shade of the color) is also strongly increasing. In other words, after filtering more compounds with unambiguous, unique identifications are retained. The same trend can be also seen in the data for the biological samples (b), where matches to the standard compounds and the ScoCyc data base are expected. Matches to the contaminant compounds decrease in the filtered data, and the number of unique identifications increases substantially (Color figure online)

A list of the standard compounds detected on both C18 and HILIC columns is shown in Table 3. The following structural isomers could not be distinguished: l-alanine, l-sarcosine and β-alanine; γ-amino-N-butyric acid, d,l-β-aminoisobutryic acid and L-α-amino-n-butyric acid. For l-isoleucine/l-leucine and 1-methyl-l-histidine/3-methyl-l-histidine two peaks eluting close to each other were observed. Ammonium chloride was not detected on either column (because of its low molecular weight), and l-ornithine was not detected on the HILIC column. Almost no separation was achieved on the C18 column (most of the signals eluted within the first minute of the analytical run). Surprisingly high quantification accuracy (correlation value is close to −1, i.e. a linear relationship between intensity and sample dilution) can be observed for almost all analytical standards on both chromatographic columns.

Table 3 Identified compounds in the analytical mixture

The resulting numbers of detected peaks after processing of biological samples are shown in Table 4 . Surprisingly, the amount of detected peaks is comparable to the numbers seen for the analytical standards, both in the filtered and unfiltered data sets. For the HILIC data set, 639 features were putatively identified in the ScoCyc database (78 unique compounds), but only 28 peaks (24 unique identifiers) were retained after application of the dilution trend filtering. Clear trends in improvement of the data set quality are shown in Fig. 1b. Interesting compounds that were identified (and expected) only in the biological samples on both chromatographic columns are the osmoregulator compound ectoine and hypoxanthine. In Fig. 2, an example of dilution trends and chromatographic peaks for the biological sample (Fig. 1a) and the standard mixture (Fig. 1b) is given. In both data sets, a peak was identified as matching the mass of ectoine with an apparent mass error less than 1 ppm, but in the standard mixture (which does not contain ectoine), this peak was successfully discarded by the trend filter, as the signal intensity patterns (shown in the left panel of the plot) are not following the sample dilution trend.

Table 4 Comparison of the number of the peaks extracted for the biological samples before and after trend filtering
Fig. 2
figure 2

Example of the dilution trends (on the left) and extracted mass chromatograms (on the right) for a metabolite putatively identified as ectoine. For the biological samples, which are expected to contain ectoine (Kol et al. 2010), three technical replicates show clearly identifiable dilution trend (trend correlation value −0.97). For the standard mixture, which does not contain ectoine, a random trend is seen in all replicates for the signal putatively identified as ectoine (mass error 0.86 ppm); this putative technical artefact can thus be removed by the trend filtering (Color figure online)

The biological samples used in this illustrative example are particularly challenging, due to a large number of peaks with low signal intensities. Our results show that even for such difficult data, the dilution trend filter can be applied with no real danger of losing information of interest. It is also quite obvious that sample dilution factors should be adjusted according to the expected overall metabolite levels in the analysed samples, to avoid over-dilution and loss of signals of interest. To avoid the problem of large correlations occurring by chance when the number of observations is low, the statistical significance of the observed correlation can be examined and the obtained p-values can be used to determine the threshold for peak selection. This method can also be integrated with a quality control sample approach (Sangster et al. 2006), where repeated injection of a pooled randomized sample throughout the analysis serves as a reference for quality control; this approach is commonly used in large populations studies (Zelena et al. 2009). This control sample can be replaced with injections of pooled dilution samples in randomized order. Thereby, without increasing the number of injections for a typical analytical sample batch, it will be possible to simultaneously assess machine stability (as the dilution trend should stay constant) and do a filtering of the data set on highly reproducible signals.

The method suggested here is therefore a useful complement to the commonly used relative standard deviation (RSD) filters (Shah et al. 2000; Scheltema et al. 2008) and the CoDA-DW filters, (Windig 2004), allowing automatic retrieval of signals of interest, reducing the complexity of the data and consequently speeding up the interpretation process.

The dilution filtering approach can be easily integrated in a complete data processing pipeline (based on mzMatch and XCMS software tools) and used in a semi-automated manner. This is illustrated in the R script provided as supplementary material for this study (http://mzmatch.sourceforge.net/metabolomics.html).

4 Concluding remarks

We have been able to demonstrate the effectiveness and reliability of a relatively simple data filtering strategy. The proposed trend correlation filter significantly decreases the amount of non-informative signals in the data sets and makes metabolite identification much easier. We could show that even very stringent filtering of the data is not causing a loss of informative signals.

Our illustrative application to biological samples demonstrates that our approach can also be applied to assess the performance of metabolite extraction from the samples. This allows a more reliable estimate of the true metabolomic complexity observed in a particular experiment.