Abstract
Metabolomics is a proven tool to obtain information about differences in food stuffs and to select biochemical markers for sensory quality of food products. A valuable application of untargeted metabolomics is the selection of metabolites that are (highly) predictive for sensory or phenotypical traits for use as (bio) markers. This chapter demonstrates how to robustly select key metabolites and evaluate their predictive properties. The proposed approach constrains the number of selected metabolites, searching for an optimal number of predictive metabolites by cross-validation. This mitigates the problem of selection of spurious metabolites. It also enables straightforward use of linear regression. In the present implementation simple forward selection is used. In concert with a second cross-validation to assess the predictive power of the selected set of metabolites, the proposed method involves two leave-one-out cross-validations and will be referred to as LOO2CV. In the second leave-one-out cross-validation a multitude of regression models is generated. This offers additional information that is potentially useful for selection of key metabolites in the spirit of stability selection. The proposed LOO2CV approach is illustrated with sensory and large-scale metabolomics data from a set of 76 different cocoa liquors. The proposed approach is compared with conventional stepwise regression and stepwise regression in concert with cross-validation for evaluation of predictive power of the model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hall, R. D. (2011). Biology of plant metabolomics. In R. D. Hall (Ed.), Annual plant reviews (Vol. 43). Oxford: Wiley.
Keurentjes, J. J. B., et al. (2006). The genetics of plant metabolism. Nature Genetics, 38(7), 842–849.
Moing, A., et al. (2011). Extensive metabolic cross-talk in melon fruit revealed by spatial and developmental combinatorial metabolomics. New Phytologist, 190(3), 683–696.
Tikunov, Y. M., et al. (2010). A role for differential glycoconjugation in the emission of phenylpropanoid volatiles from tomato fruit discovered using a metabolic data fusion approach. Plant Physiology, 152(1), 55–70.
Gupta, A. J., et al. (2014). Chemometric analysis of soy protein hydrolysates used in animal cell culture for IgG production - An untargeted metabolomics approach. Process Biochemistry, 49(2), 309–317.
Lindinger, C., et al. (2009). Identification of ethyl formate as a quality marker of the fermented off-note in coffee by a nontargeted chemometric approach. Journal of Agricultural and Food Chemistry, 57(21), 9972–9978.
Capanoglu, E., et al. (2008). Changes in antioxidant and metabolite profiles during production of tomato paste. Journal of Agricultural and Food Chemistry, 56(3), 964–973.
Hendriks, M., et al. (2011). Data-processing strategies for metabolomics studies. Trac-Trends in Analytical Chemistry, 30(10), 1685–1698.
Jelizarow, M., et al. (2010). Over-optimism in bioinformatics: An illustration. Bioinformatics, 26(16), 1990–1998.
Wehrens, R., et al. (2011). Stability-based biomarker selection. Analytica Chimica Acta, 705(1–2), 15–23.
Hageman, J. A., et al. (2008). Simplivariate models: Ideas and first examples. PLoS One, 3(9).
Montgomery, D., & Peck, E. (1982). Introduction to linear regression analysis. Wiley.
Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. Monographs on statistics and applied probability (Vol. 57). Chapman & Hall.
Westerhuis, J. A., et al. (2008). Assessment of PLSDA cross validation. Metabolomics, 4(1), 81–89.
Smit, S., et al. (2007). Assessing the statistical validity of proteomics based biomarkers. Analytica Chimica Acta, 592(2), 210–217.
Abeel, T., et al. (2010). Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics, 26(3), 392–398.
Meinshausen, N., & Buhlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society Series B-Statistical Methodology, 72, 417–473.
Menendez, P., et al. (2012). Penalized regression techniques for modeling relationships between metabolites and tomato taste attributes. Euphytica, 183(3), 379–387.
Vandeginste, B. G. M., et al. Handbook of chemometrics. Data handling in science and technology (Vol. 20B). Amsterdam: Elsevier.
Hageman, J. A., et al. (2003). Wavelength selection with tabu search. Journal of Chemometrics, 17(8–9), 427–437.
Furnival, G. M., & Wilson, R. W. (1974). Regressions by leaps and bounds. Technometrics, 16(4), 499–511.
Hammami, D., et al. (2012). Predictor selection for downscaling GCM data with LASSO. Journal of Geophysical Research-Atmospheres, 117, 1–11.
Neter, et al. (1996). Applied linear statistical models. Irwin.
Tukey, J. W. (1958). Bias and confidence in not-quite large samples. Annals of Mathematical Statistics, 29(2), 614.
De Vos, R. C. H., et al. (2007). Untargeted large-scale plant metabolomics using liquid chromatography coupled to mass spectrometry. Nature Protocols, 2(4), 778–791.
Tikunov, Y. M., et al. (2012). MSClust: A tool for unsupervised mass spectra extraction of chromatography-mass spectrometry ion-wise aligned data. Metabolomics, 8(4), 714–718.
Mathworks, I. (2008). Matlab 7.1.
Funding
This project was financed by the Netherlands Metabolomics Centre (project NMC-BS2a “Power analysis of metabolomics studies”) which is part of the Netherlands Genomics Initiative/Netherlands Organization for Scientific Research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Hageman, J.A. et al. (2017). Robust and Confident Predictor Selection in Metabolomics. In: Datta, S., Mertens, B. (eds) Statistical Analysis of Proteomics, Metabolomics, and Lipidomics Data Using Mass Spectrometry. Frontiers in Probability and the Statistical Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-45809-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-45809-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45807-6
Online ISBN: 978-3-319-45809-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)