Skip to main content

Abstract

Metabolomics is a proven tool to obtain information about differences in food stuffs and to select biochemical markers for sensory quality of food products. A valuable application of untargeted metabolomics is the selection of metabolites that are (highly) predictive for sensory or phenotypical traits for use as (bio) markers. This chapter demonstrates how to robustly select key metabolites and evaluate their predictive properties. The proposed approach constrains the number of selected metabolites, searching for an optimal number of predictive metabolites by cross-validation. This mitigates the problem of selection of spurious metabolites. It also enables straightforward use of linear regression. In the present implementation simple forward selection is used. In concert with a second cross-validation to assess the predictive power of the selected set of metabolites, the proposed method involves two leave-one-out cross-validations and will be referred to as LOO2CV. In the second leave-one-out cross-validation a multitude of regression models is generated. This offers additional information that is potentially useful for selection of key metabolites in the spirit of stability selection. The proposed LOO2CV approach is illustrated with sensory and large-scale metabolomics data from a set of 76 different cocoa liquors. The proposed approach is compared with conventional stepwise regression and stepwise regression in concert with cross-validation for evaluation of predictive power of the model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hall, R. D. (2011). Biology of plant metabolomics. In R. D. Hall (Ed.), Annual plant reviews (Vol. 43). Oxford: Wiley.

    Google Scholar 

  2. Keurentjes, J. J. B., et al. (2006). The genetics of plant metabolism. Nature Genetics, 38(7), 842–849.

    Article  Google Scholar 

  3. Moing, A., et al. (2011). Extensive metabolic cross-talk in melon fruit revealed by spatial and developmental combinatorial metabolomics. New Phytologist, 190(3), 683–696.

    Article  Google Scholar 

  4. Tikunov, Y. M., et al. (2010). A role for differential glycoconjugation in the emission of phenylpropanoid volatiles from tomato fruit discovered using a metabolic data fusion approach. Plant Physiology, 152(1), 55–70.

    Article  Google Scholar 

  5. Gupta, A. J., et al. (2014). Chemometric analysis of soy protein hydrolysates used in animal cell culture for IgG production - An untargeted metabolomics approach. Process Biochemistry, 49(2), 309–317.

    Article  Google Scholar 

  6. Lindinger, C., et al. (2009). Identification of ethyl formate as a quality marker of the fermented off-note in coffee by a nontargeted chemometric approach. Journal of Agricultural and Food Chemistry, 57(21), 9972–9978.

    Article  Google Scholar 

  7. Capanoglu, E., et al. (2008). Changes in antioxidant and metabolite profiles during production of tomato paste. Journal of Agricultural and Food Chemistry, 56(3), 964–973.

    Article  Google Scholar 

  8. Hendriks, M., et al. (2011). Data-processing strategies for metabolomics studies. Trac-Trends in Analytical Chemistry, 30(10), 1685–1698.

    Article  Google Scholar 

  9. Jelizarow, M., et al. (2010). Over-optimism in bioinformatics: An illustration. Bioinformatics, 26(16), 1990–1998.

    Article  Google Scholar 

  10. Wehrens, R., et al. (2011). Stability-based biomarker selection. Analytica Chimica Acta, 705(1–2), 15–23.

    Article  Google Scholar 

  11. Hageman, J. A., et al. (2008). Simplivariate models: Ideas and first examples. PLoS One, 3(9).

    Google Scholar 

  12. Montgomery, D., & Peck, E. (1982). Introduction to linear regression analysis. Wiley.

    Google Scholar 

  13. Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. Monographs on statistics and applied probability (Vol. 57). Chapman & Hall.

    Google Scholar 

  14. Westerhuis, J. A., et al. (2008). Assessment of PLSDA cross validation. Metabolomics, 4(1), 81–89.

    Article  Google Scholar 

  15. Smit, S., et al. (2007). Assessing the statistical validity of proteomics based biomarkers. Analytica Chimica Acta, 592(2), 210–217.

    Article  Google Scholar 

  16. Abeel, T., et al. (2010). Robust biomarker identification for cancer diagnosis with ensemble feature selection methods. Bioinformatics, 26(3), 392–398.

    Article  Google Scholar 

  17. Meinshausen, N., & Buhlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society Series B-Statistical Methodology, 72, 417–473.

    Article  MathSciNet  Google Scholar 

  18. Menendez, P., et al. (2012). Penalized regression techniques for modeling relationships between metabolites and tomato taste attributes. Euphytica, 183(3), 379–387.

    Article  Google Scholar 

  19. Vandeginste, B. G. M., et al. Handbook of chemometrics. Data handling in science and technology (Vol. 20B). Amsterdam: Elsevier.

    Google Scholar 

  20. Hageman, J. A., et al. (2003). Wavelength selection with tabu search. Journal of Chemometrics, 17(8–9), 427–437.

    Article  Google Scholar 

  21. Furnival, G. M., & Wilson, R. W. (1974). Regressions by leaps and bounds. Technometrics, 16(4), 499–511.

    Article  MATH  Google Scholar 

  22. Hammami, D., et al. (2012). Predictor selection for downscaling GCM data with LASSO. Journal of Geophysical Research-Atmospheres, 117, 1–11.

    Article  Google Scholar 

  23. Neter, et al. (1996). Applied linear statistical models. Irwin.

    Google Scholar 

  24. Tukey, J. W. (1958). Bias and confidence in not-quite large samples. Annals of Mathematical Statistics, 29(2), 614.

    Article  Google Scholar 

  25. De Vos, R. C. H., et al. (2007). Untargeted large-scale plant metabolomics using liquid chromatography coupled to mass spectrometry. Nature Protocols, 2(4), 778–791.

    Article  Google Scholar 

  26. Tikunov, Y. M., et al. (2012). MSClust: A tool for unsupervised mass spectra extraction of chromatography-mass spectrometry ion-wise aligned data. Metabolomics, 8(4), 714–718.

    Article  Google Scholar 

  27. Mathworks, I. (2008). Matlab 7.1.

    Google Scholar 

Download references

Funding

This project was financed by the Netherlands Metabolomics Centre (project NMC-BS2a “Power analysis of metabolomics studies”) which is part of the Netherlands Genomics Initiative/Netherlands Organization for Scientific Research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. A. Hageman .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Hageman, J.A. et al. (2017). Robust and Confident Predictor Selection in Metabolomics. In: Datta, S., Mertens, B. (eds) Statistical Analysis of Proteomics, Metabolomics, and Lipidomics Data Using Mass Spectrometry. Frontiers in Probability and the Statistical Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-45809-0_13

Download citation

Publish with us

Policies and ethics