Robust and Confident Predictor Selection in Metabolomics

Hageman, J. A.; Engel, B.; de Vos, Ric C. H.; Mumm, Roland; Hall, Robert D.; Jwanro, H.; Crouzillat, D.; Spadone, J. C.; van Eeuwijk, F. A.

doi:10.1007/978-3-319-45809-0_13

J. A. Hageman^8,9,10,
B. Engel^8,9,10,
Ric C. H. de Vos^9,10,11,
Roland Mumm^9,11,
Robert D. Hall^9,10,11,12,
H. Jwanro¹³,
D. Crouzillat¹³,
J. C. Spadone¹⁴ &
…
F. A. van Eeuwijk^8,9,10,12

Part of the book series: Frontiers in Probability and the Statistical Sciences ((FROPROSTAS))

2997 Accesses
1 Citations

Abstract

Metabolomics is a proven tool to obtain information about differences in food stuffs and to select biochemical markers for sensory quality of food products. A valuable application of untargeted metabolomics is the selection of metabolites that are (highly) predictive for sensory or phenotypical traits for use as (bio) markers. This chapter demonstrates how to robustly select key metabolites and evaluate their predictive properties. The proposed approach constrains the number of selected metabolites, searching for an optimal number of predictive metabolites by cross-validation. This mitigates the problem of selection of spurious metabolites. It also enables straightforward use of linear regression. In the present implementation simple forward selection is used. In concert with a second cross-validation to assess the predictive power of the selected set of metabolites, the proposed method involves two leave-one-out cross-validations and will be referred to as LOO2CV. In the second leave-one-out cross-validation a multitude of regression models is generated. This offers additional information that is potentially useful for selection of key metabolites in the spirit of stability selection. The proposed LOO2CV approach is illustrated with sensory and large-scale metabolomics data from a set of 76 different cocoa liquors. The proposed approach is compared with conventional stepwise regression and stepwise regression in concert with cross-validation for evaluation of predictive power of the model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hall, R. D. (2011). Biology of plant metabolomics. In R. D. Hall (Ed.), Annual plant reviews (Vol. 43). Oxford: Wiley.
Google Scholar
Keurentjes, J. J. B., et al. (2006). The genetics of plant metabolism. Nature Genetics, 38(7), 842–849.
Article Google Scholar
Moing, A., et al. (2011). Extensive metabolic cross-talk in melon fruit revealed by spatial and developmental combinatorial metabolomics. New Phytologist, 190(3), 683–696.
Article Google Scholar
Tikunov, Y. M., et al. (2010). A role for differential glycoconjugation in the emission of phenylpropanoid volatiles from tomato fruit discovered using a metabolic data fusion approach. Plant Physiology, 152(1), 55–70.
Article Google Scholar
Gupta, A. J., et al. (2014). Chemometric analysis of soy protein hydrolysates used in animal cell culture for IgG production - An untargeted metabolomics approach. Process Biochemistry, 49(2), 309–317.
Article Google Scholar
Lindinger, C., et al. (2009). Identification of ethyl formate as a quality marker of the fermented off-note in coffee by a nontargeted chemometric approach. Journal of Agricultural and Food Chemistry, 57(21), 9972–9978.
Article Google Scholar
Capanoglu, E., et al. (2008). Changes in antioxidant and metabolite profiles during production of tomato paste. Journal of Agricultural and Food Chemistry, 56(3), 964–973.
Article Google Scholar
Hendriks, M., et al. (2011). Data-processing strategies for metabolomics studies. Trac-Trends in Analytical Chemistry, 30(10), 1685–1698.
Article Google Scholar
Jelizarow, M., et al. (2010). Over-optimism in bioinformatics: An illustration. Bioinformatics, 26(16), 1990–1998.
Article Google Scholar
Wehrens, R., et al. (2011). Stability-based biomarker selection. Analytica Chimica Acta, 705(1–2), 15–23.
Article Google Scholar
Hageman, J. A., et al. (2008). Simplivariate models: Ideas and first examples. PLoS One, 3(9).
Google Scholar
Montgomery, D., & Peck, E. (1982). Introduction to linear regression analysis. Wiley.
Google Scholar
Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. Monographs on statistics and applied probability (Vol. 57). Chapman & Hall.
Google Scholar
Westerhuis, J. A., et al. (2008). Assessment of PLSDA cross validation. Metabolomics, 4(1), 81–89.
Article Google Scholar
Smit, S., et al. (2007). Assessing the statistical validity of proteomics based biomarkers. Analytica Chimica Acta, 592(2), 210–217.
Article Google Scholar
Abeel, T., et al. (2010). Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics, 26(3), 392–398.
Article Google Scholar
Meinshausen, N., & Buhlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society Series B-Statistical Methodology, 72, 417–473.
Article MathSciNet Google Scholar
Menendez, P., et al. (2012). Penalized regression techniques for modeling relationships between metabolites and tomato taste attributes. Euphytica, 183(3), 379–387.
Article Google Scholar
Vandeginste, B. G. M., et al. Handbook of chemometrics. Data handling in science and technology (Vol. 20B). Amsterdam: Elsevier.
Google Scholar
Hageman, J. A., et al. (2003). Wavelength selection with tabu search. Journal of Chemometrics, 17(8–9), 427–437.
Article Google Scholar
Furnival, G. M., & Wilson, R. W. (1974). Regressions by leaps and bounds. Technometrics, 16(4), 499–511.
Article MATH Google Scholar
Hammami, D., et al. (2012). Predictor selection for downscaling GCM data with LASSO. Journal of Geophysical Research-Atmospheres, 117, 1–11.
Article Google Scholar
Neter, et al. (1996). Applied linear statistical models. Irwin.
Google Scholar
Tukey, J. W. (1958). Bias and confidence in not-quite large samples. Annals of Mathematical Statistics, 29(2), 614.
Article Google Scholar
De Vos, R. C. H., et al. (2007). Untargeted large-scale plant metabolomics using liquid chromatography coupled to mass spectrometry. Nature Protocols, 2(4), 778–791.
Article Google Scholar
Tikunov, Y. M., et al. (2012). MSClust: A tool for unsupervised mass spectra extraction of chromatography-mass spectrometry ion-wise aligned data. Metabolomics, 8(4), 714–718.
Article Google Scholar
Mathworks, I. (2008). Matlab 7.1.
Google Scholar

Download references

Funding

This project was financed by the Netherlands Metabolomics Centre (project NMC-BS2a “Power analysis of metabolomics studies”) which is part of the Netherlands Genomics Initiative/Netherlands Organization for Scientific Research.

Author information

Authors and Affiliations

Biometris-Applied Statistics, Wageningen University, 16, 6700 AA, Wageningen, The Netherlands
J. A. Hageman, B. Engel & F. A. van Eeuwijk
Centre for BioSystems Genomics, 98, 6700 AB, Wageningen, The Netherlands
J. A. Hageman, B. Engel, Ric C. H. de Vos, Roland Mumm, Robert D. Hall & F. A. van Eeuwijk
Netherlands Metabolomics Centre, Einsteinweg 55, 2333 CC, Leiden, The Netherlands
J. A. Hageman, B. Engel, Ric C. H. de Vos, Robert D. Hall & F. A. van Eeuwijk
Plant Research International, 619, 6700 AP, Wageningen, The Netherlands
Ric C. H. de Vos, Roland Mumm & Robert D. Hall
Netherlands Consortium for Systems Biology, 94215, 1090 GE, Amsterdam, The Netherlands
Robert D. Hall & F. A. van Eeuwijk
Laboratory of Plant Physiology, Wageningen University, 658, 6700 AR, Wageningen, The Netherlands
H. Jwanro & D. Crouzillat
Nestle Research Center, Vers-chez-les-Blanc, CH-1000, Lausanne 26, Switzerland
J. C. Spadone

Authors

J. A. Hageman
View author publications
You can also search for this author in PubMed Google Scholar
B. Engel
View author publications
You can also search for this author in PubMed Google Scholar
Ric C. H. de Vos
View author publications
You can also search for this author in PubMed Google Scholar
Roland Mumm
View author publications
You can also search for this author in PubMed Google Scholar
Robert D. Hall
View author publications
You can also search for this author in PubMed Google Scholar
H. Jwanro
View author publications
You can also search for this author in PubMed Google Scholar
D. Crouzillat
View author publications
You can also search for this author in PubMed Google Scholar
J. C. Spadone
View author publications
You can also search for this author in PubMed Google Scholar
F. A. van Eeuwijk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. A. Hageman .

Editor information

Editors and Affiliations

Department of Biostatistics, University of Florida, Gainesville, Florida, USA
Susmita Datta
Department of Medical Statistics and Bioinformatics, Leiden University Medical Centre, RC Leiden, The Netherlands
Bart J. A. Mertens

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hageman, J.A. et al. (2017). Robust and Confident Predictor Selection in Metabolomics. In: Datta, S., Mertens, B. (eds) Statistical Analysis of Proteomics, Metabolomics, and Lipidomics Data Using Mass Spectrometry. Frontiers in Probability and the Statistical Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-45809-0_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-45809-0_13
Published: 16 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45807-6
Online ISBN: 978-3-319-45809-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics