Reflections on univariate and multivariate analysis of metabolomics data

Abstract

Metabolomics experiments usually result in a large quantity of data. Univariate and multivariate analysis techniques are routinely used to extract relevant information from the data with the aim of providing biological knowledge on the problem studied. Despite the fact that statistical tools like the t test, analysis of variance, principal component analysis, and partial least squares discriminant analysis constitute the backbone of the statistical part of the vast majority of metabolomics papers, it seems that many basic but rather fundamental questions are still often asked, like: Why do the results of univariate and multivariate analyses differ? Why apply univariate methods if you have already applied a multivariate method? Why if I do not see something univariately I see something multivariately? In the present paper we address some aspects of univariate and multivariate analysis, with the scope of clarifying in simple terms the main differences between the two approaches. Applications of the t test, analysis of variance, principal component analysis and partial least squares discriminant analysis will be shown on both real and simulated metabolomics data examples to provide an overview on fundamental aspects of univariate and multivariate methods.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

References

  1. Allen, G. I., & Maletić-Savatić, M. (2011). Sparse non-negative generalized PCA with applications to metabolomics. Bioinformatics, 27(21), 3029–3035.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  2. Beckonert, O., Keun, H. C., Ebbels, T. M., Bundy, J., Holmes, E., Lindon, J. C., et al. (2007). Metabolic profiling, metabolomic and metabonomic procedures for NMR spectroscopy of urine, plasma, serum and tissue extracts. Nature Protocols, 2(11), 2692–2703.

    CAS  Article  PubMed  Google Scholar 

  3. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57, 289–300.

    Google Scholar 

  4. Benjamini, Y., & Hochberg, Y. (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics, 25(1), 60–83.

    Article  Google Scholar 

  5. Brereton, R. G. (2006). Consequences of sample size, variable selection, and model validation and optimisation, for predicting classification ability from analytical data. Trends in Analytical Chemistry, 25(11), 1103–1111.

    CAS  Article  Google Scholar 

  6. Broadhurst, D. I., & Kell, D. B. (2006). Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics, 2(4), 171–196.

    CAS  Article  Google Scholar 

  7. Bylesjö, M., Rantalainen, M., Cloarec, O., Nicholson, J. K., Holmes, E., & Trygg, J. (2006). OPLS discriminant analysis: Combining the strengths of PLS-DA and SIMCA classification. Journal of Chemometrics, 20(8–10), 341–351.

    Article  Google Scholar 

  8. Christin, C., Hoefsloot, H. C., Smilde, A. K., Hoekman, B., Suits, F., Bischoff, R., et al. (2013). A critical assessment of feature selection methods for biomarker discovery in clinical proteomics. Molecular and Cellular Proteomics, 12(1), 263–276.

    Article  PubMed  Google Scholar 

  9. Craig, A., Cloarec, O., Holmes, E., Nicholson, J. K., & Lindon, J. C. (2006). Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Analytical Chemistry, 78(7), 2262–2267.

    CAS  Article  PubMed  Google Scholar 

  10. de Boves Harrington, P. (2006). Statistical validation of classification and calibration models using bootstrapped Latin partitions. Trends in Analytical Chemistry, 25(11), 1112–1124.

    Article  Google Scholar 

  11. Dieterle, F., Ross, A., Schlotterbeck, G., & Senn, H. (2006). Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Analytical Chemistry, 78(13), 4281–4290.

    CAS  Article  PubMed  Google Scholar 

  12. Dillon, W. R., & Goldstein, M. (1984). Multivariate analysis. New York: Wiley.

    Google Scholar 

  13. Donoho, D., & Jin, J. (2008). Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proceedings of the National Academy of Sciences, 105(39), 14790–14795.

    CAS  Article  Google Scholar 

  14. Dudoit, S., Fridlyand, J., & Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457), 77–87.

    CAS  Article  Google Scholar 

  15. Ellero-Simatos, S., Szymańska, E., Rullmann, T., Dokter, W. H., Ramaker, R., Berger, R., et al. (2012). Assessing the metabolic effects of prednisolone in healthy volunteers using urine metabolic profiling. Genome Medicine, 4(11), 94.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  16. Ellis, J. K., Athersuch, T. J., Thomas, L. D., Teichert, F., Pérez-Trujillo, M., Svendsen, C., et al. (2012). Metabolic profiling detects early effects of environmental and lifestyle exposure to cadmium in a human population. BMC Medicine, 10(1), 61.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  17. Ernest, B., Gooding, J. R., Campagna, S. R., Saxton, A. M., & Voy, B. H. (2012). MetabR: An R script for linear model analysis of quantitative metabolomic data. BMC Research Notes, 5(1), 596.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Franceschi, P., Masuero, D., Vrhovsek, U., Mattivi, F., & Wehrens, R. (2012). A benchmark spike-in data set for biomarker identification in metabolomics. Journal of Chemometrics, 26(1–2), 16–24.

    CAS  Article  Google Scholar 

  19. Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American Statistical Association, 84(405), 165–175.

    Article  Google Scholar 

  20. Grove, H., Jørgensen, B. M., Jessen, F., Søndergaard, I., Jacobsen, S., Hollung, K., et al. (2008). Combination of statistical approaches for analysis of 2-DE data gives complementary results. Journal of Proteome Research, 7(12), 5119–5124.

    CAS  Article  PubMed  Google Scholar 

  21. Hageman, J. A., Hendriks, M. M., Westerhuis, J. A., van der Werf, M. J., Berger, R., & Smilde, A. K. (2008). Simplivariate models: Ideas and first examples. PLoS One, 3(9), e3259.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Hendrickx, D. M., Hoefsloot, H. C. J., Hendriks, M. M. W. B., Canelas, A. B., & Smilde, A. K. (2012). Global test for metabolic pathway differences between conditions. Analytica chimica acta, 719, 8–15.

    CAS  Article  PubMed  Google Scholar 

  23. Hendriks, M. M. W. B., Eeuwijk, F. A., Jellema, R. H., Westerhuis, J. A., Reijmers, T. H., Hoefsloot, H. C. J., et al. (2011). Data-processing strategies for metabolomics studies. Trends in Analytical Chemistry, 30(10), 1685–1698.

    CAS  Article  Google Scholar 

  24. Hochberg, Y., & Benjamini, Y. (1990). More powerful procedures for multiple significance testing. Statistics in Medicine, 9(7), 811–818.

    CAS  Article  PubMed  Google Scholar 

  25. Hrydziuszko, O., & Viant, M. R. (2012). Missing values in mass spectrometry based metabolomics: An undervalued step in the data processing pipeline. Metabolomics, 8(1), 161–174.

    CAS  Article  Google Scholar 

  26. Hui, B. S., & Wold, H. (1982). Consistency and consistency at large of partial least squares estimates (pp. 119–130). Amsterdam: North Holland.

    Google Scholar 

  27. Jansen, J. J., Allwood, J. W., Marsden-Edwards, E., van der Putten, W. H., Goodacre, R., & van Dam, N. M. (2009). Metabolomic analysis of the interaction between plants and herbivores. Metabolomics, 5(1), 150–161.

    CAS  Article  Google Scholar 

  28. Jansen, J. J., Smit, S., Hoefsloot, H. C. J., & Smilde, A. K. (2010). The photographer and the greenhouse: How to analyse plant metabolomics data. Phytochemical Analysis, 21(1), 48–60.

    CAS  Article  PubMed  Google Scholar 

  29. Jolliffe, I. T. (2002). Principal component analysis, Wiley Online Library.

  30. Jolliffe, I. T. (2012). Principal component analysis: a beginner’s guide—I. Introduction and application. Weather, 45(10), 375–382.

    Article  Google Scholar 

  31. Keun, H. C., Ebbels, T. M., Bollard, M. E., Beckonert, O., Antti, H., Holmes, E., et al. (2004). Geometric trajectory analysis of metabolic responses to toxicity can define treatment specific profiles. Chemical Research in Toxicology, 17(5), 579–587.

    CAS  Article  PubMed  Google Scholar 

  32. Kjeldahl, K., & Bro, R. (2010). Some common misunderstandings in chemometrics. Journal of Chemometrics, 24(7–8), 558–564.

    CAS  Article  Google Scholar 

  33. Martens, H. A., & Dardenne, P. (1998). Validation and verification of regression in small data sets. Chemometrics and Intelligent Laboratory Systems, 44(1), 99–121.

    CAS  Article  Google Scholar 

  34. Pang, H. and T. Tong (2012). Recent advances in discriminant analysis for high-dimensional data classification. Journal of Biometrics & Biostatistics.

  35. Petersen, A.-K., Krumsiek, J., Wägele, B., Theis, F. J., Wichmann, H.-E., Gieger, C., et al. (2012). On the hypothesis-free testing of metabolite ratios in genome-wide and metabolome-wide association studies. BMC Bioinformatics, 13(1), 120.

    Article  PubMed  PubMed Central  Google Scholar 

  36. Purohit, P. V., Rocke, D. M., Viant, M. R., & Woodruff, D. L. (2004). Discrimination models using variance-stabilizing transformation of metabolomic NMR data. Omics, 8(2), 118–130.

    CAS  Article  PubMed  Google Scholar 

  37. Reo, N. V. (2002). NMR-based metabolomics. Drug and Chemical Toxicology, 25(4), 375–382.

    CAS  Article  PubMed  Google Scholar 

  38. Rosipal, R., & Trejo, L. J. (2002). Kernel partial least squares regression in reproducing Kernel Hilbert space. The Journal of Machine Learning Research, 2, 97–123.

    Google Scholar 

  39. Rubingh, C. M., Bijlsma, S., Derks, E. P. P. A., Bobeldijk, I., Verheij, E. R., Kochhar, S., et al. (2006). Assessing the performance of statistical validation tools for megavariate metabolomics data. Metabolomics, 2(2), 53–61.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  40. Saccenti, E., Westerhuis, J. A., Smilde, A. K., van der Werf, M. J., Hageman, J. A., & Hendriks, M. M. W. B. (2011). Simplivariate models: Uncovering the underlying biology in functional genomics data. PLoS One, 6(6), e20747.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  41. Sachse, D., Sletner, L., Mørkrid, K., Jenum, A. K., Birkeland, K. I., Rise, F., et al. (2012). Metabolic changes in urine during and after pregnancy in a large, multiethnic population-based cohort Study of gestational diabetes. PLoS One, 7(12), e52399.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  42. Schäfer, J., & Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical applications in genetics and molecular biology, 4(1), 32.

    Article  Google Scholar 

  43. Schneeweiss, H. (1993). Consistency at large in models with latent variables. Amsterdam: Elsevier.

    Google Scholar 

  44. Shiryaeva, L., Antti, H., Schröder, W. P., Strimbeck, R., & Shiriaev, A. S. (2012). Pair-wise multicomparison and OPLS analyses of cold-acclimation phases in Siberian spruce. Metabolomics, 8(1), 123–130.

    CAS  Article  PubMed  Google Scholar 

  45. Sokal, R. R., & Rohlf, F. J. (1995). Biometry. New York: W. H. Freeman and Company.

    Google Scholar 

  46. Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society, 64, 479–498.

    Article  Google Scholar 

  47. Storey, J. D., & Tibshirani, R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100(16), 9440–9445.

    CAS  Article  Google Scholar 

  48. Szymanska, E., Saccenti, E., Smilde, A. K., & Westerhuis, J. A. (2011). Double-check: Validation of diagnostic statistics for PLS-DA models in metabolomics studies. Metabolomics, 8, 3–16.

    Article  PubMed  PubMed Central  Google Scholar 

  49. Szymańska, E., van Dorsten, F. A., Troost, J., Paliukhovich, I., van Velzen, E. J., Hendriks, M. M., et al. (2012). A lipidomic analysis approach to evaluate the response to cholesterol-lowering food intake. Metabolomics, 8(5), 894–906.

    Article  PubMed  Google Scholar 

  50. Thissen, D., Steinberg, L., & Kuang, D. (2002). Quick and easy implementation of the Benjamini–Hochberg procedure for controlling the false positive rate in multiple comparisons. Journal of Educational and Behavioral Statistics, 27(1), 77–83.

    Article  Google Scholar 

  51. Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Statistical Science, 18, 104–117.

    Article  Google Scholar 

  52. Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences, 99(10), 6567.

    CAS  Article  Google Scholar 

  53. Trygg, J., Holmes, E., & Lundstedt, T. (2007). Chemometrics in metabonomics. Journal of Proteome Research, 6(2), 469–479.

    CAS  Article  PubMed  Google Scholar 

  54. Van Den Berg, R. A., Hoefsloot, H. C. J., Westerhuis, J. A., Smilde, A. K., & Van Der Werf, M. J. (2006). Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics, 7(1), 142.

    Article  PubMed  PubMed Central  Google Scholar 

  55. van Velzen, E. J. J., Westerhuis, J. A., van Duynhoven, J. P. M., van Dorsten, F. A., Hoefsloot, H. C. J., Jacobs, D. M., et al. (2008). Multilevel data analysis of a crossover designed human nutritional intervention study. Journal of Proteome Research, 7(10), 4483–4491.

    Article  PubMed  Google Scholar 

  56. Velden, M. G. M., Rinaldo, P., Elvers, B., Henderson, M., Walter, J. H., Prinsen, B. H., et al. (2012). The proline/citrulline ratio as a biomarker for OAT deficiency in early infancy. JIMD Reports-Case and Research Reports, 2012(3), 95–99.

    Article  Google Scholar 

  57. Viant, M. R., Rosenblum, E. S., & Tjeerdema, R. S. (2003). NMR-based metabolomics: A powerful approach for characterizing the effects of environmental stressors on organism health. Environmental Science and Technology, 37(21), 4982–4989.

    CAS  Article  PubMed  Google Scholar 

  58. Wang, S., & Zhu, J. (2007). Improved centroids estimation for the nearest shrunken centroid classifier. Bioinformatics, 23(8), 972–979.

    CAS  Article  PubMed  Google Scholar 

  59. Wehrens, R., & Franceschi, P. (2012). Thresholding for biomarker selection in multivariate data using higher criticism. Molecular BioSystems, 8(9), 2339–2346.

    CAS  Article  PubMed  Google Scholar 

  60. Westerhuis, J. A., Hoefsloot, H. C. J., Smit, S., Vis, D. J., Smilde, A. K., van Velzen, E. J. J., et al. (2008). Assessment of PLSDA cross validation. Metabolomics, 4(1), 81–89.

    CAS  Article  Google Scholar 

  61. Wold, S., & Eriksson, L. (2001). PLS-regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58(2), 109–130.

    CAS  Article  Google Scholar 

  62. Xia, J., Mandal, R., Sinelnikov, I. V., Broadhurst, D., & Wishart, D. S. (2012). MetaboAnalyst 2.0: A comprehensive server for metabolomic data analysis. Nucleic Acids Research, 40(W1), W127–W133.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  63. Xu, Y., E. Correa and R. Goodacre (2013). Integrating multiple analytical platforms and chemometrics for comprehensive metabolic profiling: Application to meat spoilage detection. Analytical and bioanalytical chemistry: 1–12.

  64. Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of computational and graphical statistics, 15(2), 265–286.

    Article  Google Scholar 

Download references

Acknowledgments

This project was financed by The Netherlands Metabolomics Centre (NMC), which is part of The Netherlands Genomics Initiative (NGI)/Netherlands Organization for Scientific Research. The authors wish to thank Claudio Luchinat and Renger Jellema for fruitful comments on the manuscript.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Edoardo Saccenti.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOC 528 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Saccenti, E., Hoefsloot, H.C.J., Smilde, A.K. et al. Reflections on univariate and multivariate analysis of metabolomics data. Metabolomics 10, 361–374 (2014). https://doi.org/10.1007/s11306-013-0598-6

Download citation

Keywords

  • Univariate analysis
  • Multivariate analysis
  • Hypothesis testing
  • Multiple test correction
  • Overfitting
  • Consistency at large