Mixture model normalization for non-targeted gas chromatography/mass spectrometry metabolomics data
- 2.5k Downloads
Metabolomics offers a unique integrative perspective for health research, reflecting genetic and environmental contributions to disease-related phenotypes. Identifying robust associations in population-based or large-scale clinical studies demands large numbers of subjects and therefore sample batching for gas-chromatography/mass spectrometry (GC/MS) non-targeted assays. When run over weeks or months, technical noise due to batch and run-order threatens data interpretability. Application of existing normalization methods to metabolomics is challenged by unsatisfied modeling assumptions and, notably, failure to address batch-specific truncation of low abundance compounds.
To curtail technical noise and make GC/MS metabolomics data amenable to analyses describing biologically relevant variability, we propose mixture model normalization (mixnorm) that accommodates truncated data and estimates per-metabolite batch and run-order effects using quality control samples. Mixnorm outperforms other approaches across many metrics, including improved correlation of non-targeted and targeted measurements and superior performance when metabolite detectability varies according to batch. For some metrics, particularly when truncation is less frequent for a metabolite, mean centering and median scaling demonstrate comparable performance to mixnorm.
When quality control samples are systematically included in batches, mixnorm is uniquely suited to normalizing non-targeted GC/MS metabolomics data due to explicit accommodation of batch effects, run order and varying thresholds of detectability. Especially in large-scale studies, normalization is crucial for drawing accurate conclusions from non-targeted GC/MS metabolomics data.
KeywordsMetabolomics Non-targeted Gas chromatography/mass spectrometry GC/MS Normalization Batch effects
Gas chromatography/Mass spectrometry
Hyperglycemia and Adverse Pregnancy Outcome
Oral glucose tolerance test
Relative standard deviation
Variance stabilizing normalization
Non-targeted metabolomics technologies are unique tools in high-throughput ‘omics’ that provide an integrative measure of genetic and environmental factors contributing to metabolism and related phenotypes . Techniques such as gas-chromatography/mass-spectrometry (GC/MS), liquid-chromatography/mass-spectrometry and nuclear magnetic resonance have their own strengths for varying applications, but all work toward the same goal to comprehensively characterize metabolite levels in samples of interest. These approaches are frequently accompanied by targeted technologies for which levels of specific metabolites are assayed and calibrated, for example by using stable isotope-labeled internal standards with an external series of unlabeled calibrants. When used for large-scale studies, non-targeted platforms generally require batching of samples over many days. Meaningful data analyses from large-scale studies demand careful application of quality control protocols for sample collection and storage, compound derivatization, metabolite extraction and reproducible annotation for all sample batches [2, 3, 4, 5]. Even with precise monitoring of quality control procedures, large variations in metabolite abundance attributable to batch and run order within batch are well documented, particularly for GC/MS [1, 6]. In this manuscript, we propose a statistical approach to metabolomics data normalization to control technical variability attributable to batch and run order for large-scale metabolomics experiments. Data normalization is just one component of carefully crafted data quality pipelines that should be rigorously applied to minimize technical variability in large-scale metabolomics studies.
Many approaches for statistical control of batch- and run order-related technical variability, i.e. normalization, have been described [7, 8, 9]. Straightforward approaches calculate scaling factors, often based on total sample intensity or a relevant physiological variable, to be applied uniformly to all metabolites measured in a sample [10, 11]. While easy to use, these approaches do not account for the chemical diversity of all compounds and differential batch and run order effects often evident for different metabolites [1, 12]. Normalization approaches borrowed from gene expression microarray studies, including loess-based normalization , quantile normalization , surrogate variable analysis , empirical Bayes batch effect correction (ComBat)  and variance stabilizing normalization (VSN)  generally assume that few metabolites change across samples, that roughly equal numbers of metabolites are increased and decreased across samples, and/or that batch affects metabolites in similar ways. Any of these assumptions can be easily violated for metabolomics data . Other approaches rely on addition of single  or multiple internal standard compounds  or a priori identification of a set of metabolites expected not to change in the experimental conditions . Selection of these standards and non-changing sets could vary substantially depending on sample type and metabolite classes of interest. Furthermore, in sample types that are poorly understood, selected internal standards or non-changing compounds may not correspond well in terms of retention time and mass spectrometry peak alignment with metabolites observed in samples of analytical interest .
Noting chemical diversity of batch and run-order effects and the difficulty of a priori selection of internal standards, repeated assay of quality control (QC) samples from a consistent control pool is increasingly applied in large-scale metabolomics studies [1, 12, 19, 20]. While QC-based normalization methods are gaining favor, current approaches do not formally model well-known variation in thresholds of detectability across batches for GC/MS data [21, 22] and instead require elimination or imputation of abundance levels for undetected metabolites. This is particularly problematic for methods that rely on total compound abundance since low abundance compounds can be systematically missed in batches with higher detectability thresholds.
We describe a mixture model approach for non-targeted GC/MS metabolomics data normalization (mixnorm) that is compound-specific, avoids a priori selection of internal standard compounds, and formally models not only batch and run order effects, but also varying thresholds of detectability across batches. The estimated parameters for batch and run order effects account for truncation of undetected abundance levels in QC samples and are easily interpretable given their regression-based derivation. Mixture modeling has been used for downstream data analyses to investigate biological associations between phenotypes and metabolites ; in this application, we discuss an alternative use of mixture modeling for normalization purposes. A large-scale simulation study confirms accuracy of mixnorm over other methods for controlling technical variability and for detecting true associations with a simulated phenotype variable over a range of batch-specific detectability thresholds and undetected metabolites. Improved performance of mixnorm is also demonstrated using GC/MS data from 162 metabolites with reliable annotation in a reproducible AMDIS-based pipeline for 300 QC and 1200 analytical serum samples processed with highly standardized quality control procedures in the ongoing Hyperglycemia and Adverse Pregnancy Outcome (HAPO) Metabolomics study . When evaluated according to variability of individual metabolites across QC and analytical samples, pairwise Spearman correlation of QC samples, and Spearman correlation with targeted assays of the same compounds, mixnorm demonstrates superior performance to other approaches evaluated here.
Mixture model normalization (mixnorm)
Mixture model normalization (mixnorm) uses data from QC samples drawn from a common pool and included at multiple run order positions within all GC/MS batches. Given their common source, observed systematic variation in abundance levels for a given metabolite for QC samples is attributable to batch and/or run order within batch. If multiple QC types are used, it can also be assumed that batch- and run-order effects are equal across QC types for detectable metabolites, even if the actual abundance levels vary. Mixnorm uses QC data to estimate batch- and run order effects, and then applies these corrections to samples of analytical interest to prepare data for downstream analysis.
contributes when δi = 0, i.e. when a metabolite is not detected in the ith QC sample. A metabolite may not be detected either because it is truly absent from the sample or because it is present below the detectability threshold. Mixnorm specifies a logistic model for p i as log(p i /(1-p i )) = x i ’β, where x i and β are covariate and parameter vectors, respectively. Including (1-p i ) in this component of the likelihood allows for the small probability that a metabolite would degrade over the course of running a batch and would therefore be undetectable due to true absence from the sample. The remainder of the first component of the likelihood models the probability that the metabolite is present in the sample but below the detectability threshold T i using a normal cdf ϕ. T i is specified in mixnorm as the minimum observed metabolite abundance for the batch that included sample i. Mixnorm specifies a linear model for the mean of the metabolite level, μ i with μ i = z i ’α, where z i and α are covariate and parameter vectors, respectively. Mixnorm assumes that, conditional on technical covariates relevant for normalization, the variance σ2 of metabolite levels in QCs is the same across batches.
contributes when δi = 1, i.e. when a metabolite is detected in the ith QC sample and abundance is quantified as log2-transformed MS peak area. This component of the likelihood models the probability that the metabolite is present p i and specifies a normal distribution with mean μ i and variance σ2 for the observed value y i. The logistic and linear regression models described above for p i and y i link the two components of the likelihood.
Importantly, for normalization purposes, the covariates used to model variation in QC data should reflect technical factors, for example batch, run order within batch, or different types of QC pools. While covariate vectors x i and z i can be specified to include the same covariates, mixnorm does not require that they be identical. A more limited set of variables could be appropriate for x i depending on the number of QC samples and the frequency of undetected metabolites.
Maximum likelihood parameters are estimated in mixnorm using BFGS optimization over all QCs. After estimating model parameters β and α, location shift corrections are applied to observed metabolite levels for all QCs and to samples of analytical interest, according to effect estimates and covariates for each sample. In experiments that include multiple QC types, mixnorm will estimate the mean difference in metabolite levels for different types of QCs. If different QC pools are reflective of different types of analytical samples of interest, these location shifts can be applied to analytical data if desired. Mixnorm functionality is available in the metabomxtr R package (devel)  at http://www.bioconductor.org/ .
Other normalization methods
Normalization methods compared to mixnorm in this study are described briefly below, with more lengthy descriptions and a table comparing features in Additional file 1.
For each metabolite, the difference between the batch-specific mean and the mean across all samples in all batches for that metabolite is subtracted from the observed metabolite level.
For each metabolite, the abundance level in a given batch is divided by the ratio of its batch-specific median to the median for that metabolite across all samples in all batches.
Quantile normalization uses the means of ranked values within samples to match the distribution of abundance levels across all samples .
Quantile + ComBat
Quantile normalization is followed by ComBat, an empirical Bayes method using metabolite-specific estimates of mean and variance to correct for batch, while maintaining phenotype effects . ComBat requires complete data, therefore missing values are imputed using Bayesian principle components (PC) analysis with half-minimum value substitutions for negative imputed values .
EigenMS first requires estimation of a categorical ‘treatment’ effect via ANOVA. Singular value decomposition is then applied to the matrix of residuals and additional bias trends are removed from the data .
Batch Normalizer is a regression-based algorithm that relies on QC samples and incorporates total abundance of each sample when estimating corrections for batch and run order effects .
Variance Stabilizing Normalization (VSN)
VSN applies a smooth transformation to all metabolites that mimics a log transformation for high intensity values and linear scaling for low intensity values, rendering variance approximately constant across the full range of intensities .
Mean centering, median scaling and Batch Normalizer were implemented using R; functions are available in Additional file 2. Quantile normalization was implemented using preprocessCore (version 1.36.0)  R package. Quantile + ComBat used preprocessCore (version 1.36.0)  and sva (version 3.22.0)  R packages, with Bayesian PC imputation from pcaMethods (version 1.66.0)  R package. VSN was implemented using vsn (version 3.42.3)  R package. PreprocessCore, sva, pcaMethods and vsn are all available at http://www.bioconductor.org/ . EigenMS R functions are available at http://www.sourceforge.net/ .
We conducted a simulation study to assess mixnorm’s performance relative to other normalization approaches. We simulated a GC/MS experiment with 150 metabolites for 20 batches of 24 analytical and 3 QC samples each, totaling 480 analytical and 60 QC samples. Each metabolite was assigned a mean ‘intercept’ α m (m = 1,…,150) according to a random draw from a normal distribution with mean 18 and standard deviation 2, α m ~ N(18,22), placing the simulated metabolite means within the range of 13.5–23.5, consistent with typical GC/MS log2 transformed peak areas. Each analytical sample was next assigned a ‘phenotype’ v i (i = 1,…,480) according to a random draw from a standard normal distribution, v i ~ N(0,1). Phenotype associations varied according to β m (m = 1,…,150) for each metabolite, with β m sampled from a standard normal distribution, β m ~ N(0,1). Values of α m, v i and β m were held constant for 1000 simulation rounds.
Summary statistics for metabolite variability according to RSD for QC and analytical samples prior to and following normalization
RSD % of individual metabolites across samples: mean (min, max)
2.99 (1.77, 4.32)
5.82 (2.65, 16.15)
10.03 (2.33, 18.81)
10.88 (2.88, 19.04)
3.08 (1.09, 6.08)
5.24 (1.81, 14.72)
3.08 (.97, 6.30)
5.32 (1.84, 14.97)
10.21 (1.02, 18.69)
10.66 (2.29, 18.27)
Quantile + ComBat
3.94 (1.10, 11.70)
5.65 (1.62, 19.01)
6.82 (1.75, 15.27)
7.05 (1.76, 16.05)
9.99 (2.27, 17.51)
10.84 (2.81, 19.27)
1.73 (.19, 3.09)
6.26 (2.41, 16.71)
2.42 (.73, 4.59)
5.81 (1.84, 19.62)
In GC/MS, batch effects vary depending on chemical class and don’t necessarily follow monotonic trends over all batches. In each simulation round, we therefore randomly sampled batch effects for metabolite m in batch k, b mk ~ N(0,22). These batch effects were then added to QC and analytical sample levels such that if QC sample j was in batch k, z jm = z jm + b mk and if analytical sample i was in batch k, y im = y im + b mk for metabolite m. After generating simulated abundance levels, to mirror detection threshold variability across batches, we applied detection thresholds ranging incrementally from 12.5 to 15 and randomly applied across batches 1 to 20. Simulated values for all metabolites that fell below batch-specific detection thresholds were treated as undetected.
This simulation approach included batch effects with equal means for a given metabolite for QC and analytical samples assigned to the same batch, and the same detection threshold for all metabolites for a given batch. Once batch effects and batch-specific detection thresholds were included, RSD summarized over all QC and analytical samples increased as expected (Table 1). The increased variability caused by batch effects is precisely the technical noise that normalization seeks to control; i.e. RSD for correctly normalized data should equal RSD for simulated data prior to including batch effects. All simulated data, including unnormalized data and data after normalization using mixnorm and the other approaches described here, are publicly available at https://dataverse.harvard.edu/dataverse/gcmsmetab.
Simulation data normalization metrics
All described normalization algorithms were applied to the simulated data. For mixnorm, covariates for batch were included in both the logistic and linear components of the model. Normalization results for simulated data were evaluated using RSD and associations with the simulated phenotype.
Relative Standard Deviation (RSD)
RSD was calculated for each metabolite prior to and after normalization. To assess consistency of RSDs after normalization with true RSDs prior to introducing batch effects and truncation in the simulation, we used simple linear regression with intercept term set to 0 for all metabolites in simulated analytical samples, treating estimated RSD as the outcome and true RSD as the predictor. Beta = 1 from this simple no-intercept linear regression model indicates perfect agreement of true and estimated RSD following normalization, with beta values lower (higher) than 1 indicating under- (over-) estimation of RSD after normalization. These linear regression analyses were examined for metabolites with varying proportions of undetected values. Metabolites were grouped by 0, 0–5%, 5–10%,…, 75–80% undetected values. Metabolites with >80% undetected values were omitted from analysis.
Detectable associations with simulated phenotype
Detectability of metabolite associations with the simulated phenotype variable v i were summarized prior to and following normalization. The frequency of true positive and false positive associations were calculated for a range of values for β m specified in the simulation.
HAPO Metabolomics study
The original HAPO Study was an international population-based study conducted 2000–2006, designed to examine associations between maternal glucose levels during pregnancy and newborn outcomes. HAPO Study methods were described previously [31, 32]. The HAPO Study protocol was approved by the institutional review board at each HAPO field center and all participants provided informed consent. Over 23,000 eligible women at 15 international field centers underwent a 75-g oral glucose tolerance test (OGTT) between 24 and 32 weeks’ gestation. Fasting and 1-hour plasma glucose were measured and additional serum samples collected and stored using highly standardized protocols after rigorous training at all HAPO field centers [31, 32]. Immediately following collection, maternal and offspring cord serum samples were processed, stored at -20C or -80C for 1–6 weeks, shipped on dry ice to the HAPO Central Laboratory, and remained frozen at -80C until the present assays.
HAPO Metabolomics experimental design
HAPO Metabolomics was designed to study maternal and newborn metabolic profile associations with maternal glucose levels during pregnancy and newborn outcomes [24, 33]. Fasting and 1-hour maternal and newborn cord serum triples for 400 European ancestry mothers and their newborns were sampled for HAPO Metabolomics to reflect the distributions of characteristics observed in the original HAPO Study. Maternal serum samples at fasting and 1-hour following Trutol consumption during the OGTT and cord serum from their newborns were analyzed using conventional, targeted amino acid and non-targeted GC/MS metabolomics.
Conventional metabolite and targeted amino acid assays
Conventional metabolites were measured on a Beckman-Coulter DxC600 autoanalyzer using reagents from Beckman (Brea, CA; lactate) and Wako USA (Richmond, VA; beta-hydroxybutyrate). For free glycerol, reagents by Roche (Indianapolis, IN) for glycerol-blanked triglycerides were modified. To 84 mL of the Roche R1 reagent, 6.0 mg 4-aminoantipyrine dye (Sigma, St. Louis, MO) was added. This assay was run by combining 250 μL reagent with 20 μL sample volume, calibrated against a glycerol standard (2.29 mM) with detection at 520 nm after 5 min. Targeted assays of amino acids using stable-isotope-labeled internal standards were performed on an Acquity TQD Triple Quadrupole system (Waters Corporation, Milford, MA). Absolute metabolite concentrations were calculated based on the linear relationship between concentration and peak area, and calibrated against internal standards with known concentrations, as previously described [34, 35]. Conventional metabolite and targeted amino acid data are used here to evaluate whether each normalization method can improve correlation of non-targeted GC/MS data with targeted measurements that are quantified using internal standards and not subject to batch effects.
Non-targeted GC/MS assay sample preparation and quality control
For non-targeted assays, serum was uniformly prepped for each batch using a modification of previously described protocols [36, 37]. Methanol, the extraction solvent, was spiked with a retention-time-lock (RTL) internal standard of perdeuterated myristic acid. Extracts were dried, and then prepared for non-targeted GC/MS by methoximation and trimethylsilylation , and run on a 6890N GC-5975 Inert MS (Agilent Technologies, Santa Clara, CA). Programmed-temperature vaporization in the inlet and post-run, mid-column, hot backflushing of the GC column minimized analyte decomposition, carryover, and fouling of GC and MS.
Non-targeted GC/MS peak deconvolution and annotation
GC/MS data were deconvoluted with AMDIS freeware, courtesy of National Institute of Standards and Technology, Gaithersburg, MD  and parsed against peaks annotated using the Fiehn RTL spectral library  with additions from our laboratory. Detected peak areas were log2-transformed for abundance quantification. Manual curation included re-annotating features that matched multiple metabolites from our library (often co-eluting isomers such as aldohexoses), and favoring those with higher AMDIS Reverse scores. Annotation was performed simultaneously for the full data set, so there were no inconsistencies across HAPO Metabolomics samples. Reliably annotated peaks for 162 unique metabolites with detected abundance levels in at least 20% of all samples were used in this analysis. All targeted and non-targeted HAPO Metabolomics data used in this manuscript are included in Additional file 3.
HAPO metabolomics GC/MS data normalization parameters
For HAPO Metabolomics GC/MS data normalization using mixnorm, in the logistic model for p i , we included indicator variables for batch (using the batch with median abundance level for the metabolite being normalized as the referent) and QC sample type (newborn v. referent maternal). For the linear regression component for μ i , we included batch and QC sample type, as well as log-transformed run order (the best fit to the data after exploring linear, quadratic and log-transformed effects). All other normalization methods were implemented as described in Additional file 1.
HAPO metabolomics GC/MS data normalization metrics
Individual metabolite variability across QC samples
To view metabolite variability across QC samples, scatterplots were created with metabolite levels on the y-axis versus batch number on the x-axis with different plotting characters for maternal and newborn QCs and batch position. Undetected metabolites were indicated by a point below a dashed black line set below the minimum observed level for the metabolite of interest. Variability of detected metabolites was summarized as RSD across all maternal and newborn QC samples.
Individual metabolite variability across analytical samples
RSDs of metabolites across maternal fasting and 1-hour and newborn cord serum analytical samples were calculated prior to and following normalization. It is expected that RSD across all analytical samples of a given type would be higher than RSD across QC samples since analytical samples include biologically relevant variability.
Pairwise correlations of QC samples
To evaluate comparability of all metabolites in QC samples, pairwise Spearman correlations of maternal and newborn QC metabolites were calculated prior to and following normalization.
Correlations of non-targeted data with conventional and targeted amino acid data
Spearman correlation coefficients were calculated for non-targeted metabolites and their conventional metabolite and targeted amino acid counterparts on all HAPO Metabolomics analytical samples prior to and following normalization.
Associations with HAPO phenotypes
Associations with maternal fasting plasma glucose (FPG) in HAPO were modeled using metabolomics data from maternal fasting samples. Associations were modeled using two approaches. The first used linear regression, dropping unobserved metabolite levels from analysis or using imputed data as indicated by the normalization method. The second analysis approach used mixture modeling for downstream analysis. While similar in concept to the mixture model proposed here for normalization purposes, downstream mixture modeling is applied subsequent to normalization and the covariates used for downstream analysis include phenotypic predictors of interest. In both the linear regression and mixture model analyses, the primary covariate of interest was maternal FPG, but all analyses additionally included adjustment for HAPO Study field center (Belfast UK, Brisbane and Newcastle, Australia), maternal BMI, mean arterial pressure, maternal age and gestational age at OGTT, and sample storage time. The number of statistically significant associations with nominal p < 0.05 was summarized for analyses of HAPO Metabolomics GC/MS data after application of each normalization method. Pathway analyses using MetaboAnalyst 3.0 (http://www.metaboanalyst.ca/) were also conducted using hypergeometric tests to evaluate pathway enrichment of metabolites significantly associated with FPG .
Relative Standard Deviation (RSD)
RSDs were calculated for simulated QC and analytical sample data prior to including batch effects (‘truth’), after including batch effects and truncation according to batch-specific thresholds (‘not normalized’) and after application of each normalization method (Table 1). While mixnorm slightly underestimates true RSD in QCs, the mean RSD for analytical samples of 5.81% is remarkably consistent with the true analytical sample mean RSD of 5.82%. Mean centering, median scaling and quantile + ComBat also yield RSDs that are similar to true RSDs, although the means are somewhat smaller and may indicate underestimation of true variability. Batch Normalizer RSDs for QC samples are quite low (mean 1.73%), although the mean for analytical samples is fairly consistent with the truth (mean 6.26%). Summary statistics for RSDs after quantile normalization, EigenMS and VSN suggested poorer correction of batch effects than the other methods.
False positive probabilities were calculated by determining the frequency with which each method led to detection of associations for beta values approaching 0 (Additional file 5). These probabilities were comparable for mixnorm, mean centering and median scaling, with substantial increases in false positive probabilities observed for all other methods.
HAPO Metabolomics results
Individual metabolite variability in QC samples
Pairwise correlations of QC samples
Correlations with conventional and targeted metabolites in analytical samples
Summary statistics for Spearman correlation coefficients between non-targeted and targeted assays
Spearman correlation estimate summary statistics: mean (min, max)
Newborn cord serum
.33 (.09, .75)
.33 (.09, .62)
.41 (.19, .87)
.49 (.24, .89)
.51 (.26, .80)
.59 (.35, .94)
.49 (.25, .87)
.51 (.27, .79)
.59 (.34, .93)
.36 (.13, .79)
.37 (.15, .64)
.45 (.20, .93)
Quantile + ComBat
.46 (.16, .87)
.46 (.18, .80)
.51 (.24, .95)
.30 (−.07, .75)
.33 (.08, .61)
.42 (.17, .87)
.38 (.08, .85)
.37 (.16, .76)
.47 (.16, .93)
.27 (.11, .70)
.24 (.10, .59)
.25 (.00, .78)
.52 (.25, .91)
.54 (.28, .82)
.59 (.31, .95)
Associations with HAPO phenotypes
We propose a mixture-model normalization approach for GC/MS non-targeted metabolomics data called mixnorm that estimates metabolite-specific batch and run order effects based on QC samples. Mixnorm easily accommodates multiple QC sample types, an important feature for experiments that include samples from different sources, different types of individuals, etc. Additionally, rather than ignoring undetected metabolites or relying on imputation of their values, mixnorm formally models detectability or lack thereof for low abundance metabolites and accommodates batch-specific detectability thresholds. This is a more precise handling of truncated data than simple imputation of a low-valued constant or reliance on algorithmic approaches that often impute values in the range of observed values thus inconsistent with the notion of low abundance. Given the specific corrections estimated for each metabolite, mixnorm is applicable for the full set of mass spectrometry peaks following decomposition or for a desirable subset, for example in our case the peaks that were reliably annotated in the AMDIS-based pipeline we applied.
In simulations, when compared to other methods, mixnorm demonstrated the most accurate recovery of true RSDs for both QC and analytical sample data, even in the presence of substantial proportions of undetected values due to data truncation. Mixnorm also yielded a very high true positive probability for detecting associations with the simulated phenotype, with only mean centering and median scaling showing comparable performance on this particular metric. In analyses of HAPO Metabolomics data, mixnorm demonstrated reduction in RSD with patterns that most reasonably reflect expected lower RSD values for QC data compared to RSD values for analytical data. Mixnorm also demonstrated consistent improvement in pairwise Spearman correlation coefficients among QC samples. Importantly, when compared to targeted measurements of the same metabolites in HAPO Metabolomics samples, mixnorm yielded the highest and most consistent improvement in Spearman correlation coefficients across all methods. Phenotype association analyses and pathway analyses using HAPO Metabolomics data also confirm the ability of mixnorm to detect meaningful associations of biological relevance.
Normalization is just one component of carefully crafted pipelines that should be applied to perform high quality metabolomics experiments. Rigorous protocols for sample collection and storage, compound derivatization and metabolite extraction, and reproducible compound annotation pipelines are paramount to successful study conduct [2, 3, 4, 5]. Normalization procedures that rely on QC data should take note of potential outlying observations. Summaries of HAPO Metabolomics data indicated that none of the QC observations fell outside 3 standard deviations within a given metabolite; hence, we determined that all QC observations could be used for parameter estimation. We do recommend that investigators identify potential outlying QCs; the mixnorm function in the metabomxtr R package supports outlier filtering. It is also recommended that investigators take note of potential outliers in analytical samples that may influence observed phenotype associations even after data normalization is performed.
Effective normalization using QC controls also requires careful attention to experimental design. The HAPO Metabolomics study was designed to examine metabolic profiles in mothers at fasting and 1-hour into an OGTT and in their newborns’ cord serum. We randomly sampled mother/baby sample triples for placement into batches to balance continuous traits of interest across batches to the extent possible. Given expected differences in maternal and newborn metabolic profiles, we created separate QC pools to resemble the full retention time distribution of each sample type, and strategically placed QC samples from each pool at the beginning, middle and end of each batch to capture run order effects. We do note that the HAPO Metabolomics Study utilized stored samples and thus common QC pools could be created at the outset of the experiment for all batches. In ongoing studies for which samples accumulate over time, investigators may need to utilize QC pools from external sources or build adequate sized pools from initial samples that can be utilized over the anticipated duration of the study. HAPO Metabolomics also included two types of QC pools to mirror the maternal and newborn sample types involved in the study. We note that mixnorm was applied to all QC data simultaneously, with a covariate for QC sample type included in the mixture model along with batch and run order covariates. Hence, batch and run order effects were estimated using data from all QCs and location shifts were uniformly applied to all maternal and newborn analytical samples. This is an important, albeit subtle, point especially for studies in which sample classification is the goal. Analytical sample types may not be known a priori, but if QC pools can be obtained that are similar in nature to the anticipated classes, batch and run order effects can be estimated using multiple QC types and location shifts applied to all analytical samples in the same way we used mixnorm for HAPO Metabolomics data. Additionally, we note that three QC samples of a given type placed evenly within each batch produced reasonably stable parameter estimates for batches including 24 analytical samples in both the simulation and HAPO Metabolomics. Fewer QC samples would likely lead to less precise effect estimates. The ability to invest in QCs is likely to vary substantially from study to study; if possible, it may be useful to conduct preliminary studies including QC samples to estimate reasonable QC sample sizes for stable parameter estimation. Exact experimental design specifications will depend on the study, but classic principles of covariate balance, sample matching, and thoughtful QC sample creation and placement should be strongly considered when designing batches for large-scale metabolomics experiments.
This investigation describes the use of mixture modeling for normalization purposes. As discussed in phenotypic association analyses for both the simulated data and the HAPO Metabolomics GC/MS data, mixture modeling can also be applied for downstream analyses with covariates specified to represent variables of biological, epidemiological and/or clinical interest. The main emphasis of this manuscript is the utility of the mixture model for control of technical noise related to batch and run order in large-scale GC/MS studies, but the general modeling strategy has other uses as well.
In summary, we propose mixnorm for normalization of data from large-scale non-targeted GC/MS metabolomics studies. While application of the method requires use of multiple QC samples from one or more control pools over the course of the experiment, these control pools can typically be generated using small extractions from the samples of analytical interest without compromising the integrity of analytical samples for non-targeted profiling. Simulation studies confirm that mixnorm accommodates a far higher proportion of undetected metabolite values while maintaining more accurate estimates of RSD than other methods evaluated here. This is crucial for accurately modeling and analyzing low abundance compounds that may be subject to batch-specific truncation. Across global metrics including metabolite RSDs and pairwise correlations for QC samples, mixnorm showed consistent and marked improvement using data from the HAPO Metabolomics study case study. When evaluated with reference to conventional and targeted assays of a subset of metabolites reflecting a range of phenotypes in HAPO analytic samples of interest, mixnorm, along with mean centering and median scaling, accomplished the greatest increases in Spearman correlations compared to the other methods. Both simulation results and the case study using HAPO Metabolomics data also indicate reliable detection of phenotypic associations when GC/MS data are normalized using mixnorm, with comparable performance by mean centering and median scaling. It is possible that results may vary depending on phenotypes of interest in other metabolomics studies. Mixnorm can be implemented using functionality in the metabomxtr R package (devel version)  available at http://www.bioconductor.org/ .
This study was funded by grants R01DK095963 from the National Institute of Diabetes and Digestive and Kidney Diseases and R01-HD34242 and R01-HD34243 from the National Institute of Child Health and Human Development.
Availability of data and materials
The HAPO Metabolomics data supporting the conclusions of this article are included in Additional file 3. The simulated data sets, including data prior to and following normalization using all methods described in this manuscript, are publicly available as R data frames at https://dataverse.harvard.edu/dataverse/gcmsmetab. All mixnorm analyses can be performed using the freely available metabomxtr R package (devel version) available at http://www.bioconductor.org/. All other normalization methods can be performed using the freely available R packages documented in this manuscript or R functions included in Additional file 2.
ACR performed programming and statistical analyses and contributed to primary drafting of the manuscript. MJM and JRB performed non-targeted GC/MS assays and advised on subtleties of the technology and data interpretation. MN contributed to statistical analyses and programming, including programming of mixnorm functionality in the metabomxtr package. RDS and OI performed targeted metabolomics assays. BEM, CBN, WLL and DMS designed the HAPO Metabolomics Study and contributed to interpretation of normalization results. DMS conceived the study hypotheses, performed statistical analyses and contributed to primary drafting of the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent to publish
Ethics approval and consent to participate
The HAPO Study was an international population-based study conducted 2000–2006. HAPO study methods were described previously [31, 32]. Data used for analysis in this manuscript were obtained for European ancestry participants from the Belfast UK and Brisbane and Newcastle, Australia field centers. The original HAPO Study protocol was approved by the institutional review boards at these participating sites, specifically Royal Jubilee Maternity Hospital, Belfast, Northern Ireland; Mater Misericordiae Mothers’ Hospital-University of Queensland, Brisbane, Australia; and John Hunter Hospital, Newcastle Australia. All participants provided informed consent, including permission to use stored samples for future research, with consent documents approved by IRBs at these institutions.
- 1.Dunn WR, Broadhurst D, Begley P, Zelena E, Francis-McIntyre S, Anderson N, Brown M, Knowles JD, Halsall A, Haselden JN, et al. Procedures for large-scale metabolic profiling of serum and plasma using has chromatography and liquid chromatography coupled to mass spectrometry. Nat Protoc. 2011;6(7):1060–83.CrossRefPubMedGoogle Scholar
- 10.Veselkov KA, Vingara LK, Masson P, Robinette SL, Want E, Li JV, Barton RH, Boursier-Neyret C, Walter B, Ebbels TM, et al. Optimized preprocessing of ultra-performance liquid chromatography/mass spectrometry urinary metabolic profiles for improved information recovery. Ana Chem. 2011;83:5864–72.CrossRefGoogle Scholar
- 13.Dudoit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin. 2002;12:111–39.Google Scholar
- 20.Wang S-Y, Kuo C-H, Tseng YJ. Batch Normalizer: A Fast Total Abundance Regression Calibration Method to Simultaneously Adjust Batch and Injection Order Effects in Liquid Chromatography/Time-of-Flight Mass Spectrometry-Based Metabolomics Data and Comparison with Current Calibration Methods. Anal Chem. 2013;85:1037–46.CrossRefPubMedGoogle Scholar
- 24.Scholtens DM, Bain JR, Reisetter AC, Muehlbauer MJ, Nodzenski M, Stevens RD, Ilkayeva O, Lowe LP, Metzger BE, Newgard CB, et al. Metabolic networks and metbolites underlie associations between maternal glucose during pregnancy and newborn size at birth. Diabetes. 2016;65(7):2039–2050.Google Scholar
- 29.Bolstad BM. preprocessCore: a collection of pre-proccesing functions. R package version 1.30.0. 2016.Google Scholar
- 30.Leek JT, Johnson WE, Parker HS, Fertig EJ, Jaffe AE, Storey JD. sva: Surrogate Variable Analysis. R package version 3.14.0. 2016. Google Scholar
- 34.Lien LF, Haqq AM, Arlotto M, Slentz CA, Muehlbauer MJ, McMahon RL, Rochon J, Gallup D, Bain JR, Ilkayeva O, et al. The STEDMAN project: biophysical, biochemical and metabolic effects of a behavioral weight loss intervention during weight loss, maintenance, and regain. OMICS. 2009;13:21–35.CrossRefPubMedPubMedCentralGoogle Scholar
- 38.Halket JM, Przyborowska A, Stein SE, Mallard WG, Down S, Chalmers RA. Deconvolution gas chromatography/mass spectrometry of urinary organic acids: potential for pattern recognition and automated identification of metabolic disorders. Rapid Commun Mass Spectrom. 1999;13:279–84.CrossRefPubMedGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.