, Volume 10, Issue 3, pp 361–374 | Cite as

Reflections on univariate and multivariate analysis of metabolomics data

  • Edoardo SaccentiEmail author
  • Huub C. J. Hoefsloot
  • Age K. Smilde
  • Johan A. Westerhuis
  • Margriet M. W. B. Hendriks
Review Article


Metabolomics experiments usually result in a large quantity of data. Univariate and multivariate analysis techniques are routinely used to extract relevant information from the data with the aim of providing biological knowledge on the problem studied. Despite the fact that statistical tools like the t test, analysis of variance, principal component analysis, and partial least squares discriminant analysis constitute the backbone of the statistical part of the vast majority of metabolomics papers, it seems that many basic but rather fundamental questions are still often asked, like: Why do the results of univariate and multivariate analyses differ? Why apply univariate methods if you have already applied a multivariate method? Why if I do not see something univariately I see something multivariately? In the present paper we address some aspects of univariate and multivariate analysis, with the scope of clarifying in simple terms the main differences between the two approaches. Applications of the t test, analysis of variance, principal component analysis and partial least squares discriminant analysis will be shown on both real and simulated metabolomics data examples to provide an overview on fundamental aspects of univariate and multivariate methods.


Univariate analysis Multivariate analysis Hypothesis testing Multiple test correction Overfitting Consistency at large 



This project was financed by The Netherlands Metabolomics Centre (NMC), which is part of The Netherlands Genomics Initiative (NGI)/Netherlands Organization for Scientific Research. The authors wish to thank Claudio Luchinat and Renger Jellema for fruitful comments on the manuscript.

Supplementary material

11306_2013_598_MOESM1_ESM.doc (528 kb)
Supplementary material 1 (DOC 528 kb)


  1. Allen, G. I., & Maletić-Savatić, M. (2011). Sparse non-negative generalized PCA with applications to metabolomics. Bioinformatics, 27(21), 3029–3035.CrossRefPubMedPubMedCentralGoogle Scholar
  2. Beckonert, O., Keun, H. C., Ebbels, T. M., Bundy, J., Holmes, E., Lindon, J. C., et al. (2007). Metabolic profiling, metabolomic and metabonomic procedures for NMR spectroscopy of urine, plasma, serum and tissue extracts. Nature Protocols, 2(11), 2692–2703.CrossRefPubMedGoogle Scholar
  3. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57, 289–300.Google Scholar
  4. Benjamini, Y., & Hochberg, Y. (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics, 25(1), 60–83.CrossRefGoogle Scholar
  5. Brereton, R. G. (2006). Consequences of sample size, variable selection, and model validation and optimisation, for predicting classification ability from analytical data. Trends in Analytical Chemistry, 25(11), 1103–1111.CrossRefGoogle Scholar
  6. Broadhurst, D. I., & Kell, D. B. (2006). Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics, 2(4), 171–196.CrossRefGoogle Scholar
  7. Bylesjö, M., Rantalainen, M., Cloarec, O., Nicholson, J. K., Holmes, E., & Trygg, J. (2006). OPLS discriminant analysis: Combining the strengths of PLS-DA and SIMCA classification. Journal of Chemometrics, 20(8–10), 341–351.CrossRefGoogle Scholar
  8. Christin, C., Hoefsloot, H. C., Smilde, A. K., Hoekman, B., Suits, F., Bischoff, R., et al. (2013). A critical assessment of feature selection methods for biomarker discovery in clinical proteomics. Molecular and Cellular Proteomics, 12(1), 263–276.CrossRefPubMedGoogle Scholar
  9. Craig, A., Cloarec, O., Holmes, E., Nicholson, J. K., & Lindon, J. C. (2006). Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Analytical Chemistry, 78(7), 2262–2267.CrossRefPubMedGoogle Scholar
  10. de Boves Harrington, P. (2006). Statistical validation of classification and calibration models using bootstrapped Latin partitions. Trends in Analytical Chemistry, 25(11), 1112–1124.CrossRefGoogle Scholar
  11. Dieterle, F., Ross, A., Schlotterbeck, G., & Senn, H. (2006). Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Analytical Chemistry, 78(13), 4281–4290.CrossRefPubMedGoogle Scholar
  12. Dillon, W. R., & Goldstein, M. (1984). Multivariate analysis. New York: Wiley.Google Scholar
  13. Donoho, D., & Jin, J. (2008). Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proceedings of the National Academy of Sciences, 105(39), 14790–14795.CrossRefGoogle Scholar
  14. Dudoit, S., Fridlyand, J., & Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457), 77–87.CrossRefGoogle Scholar
  15. Ellero-Simatos, S., Szymańska, E., Rullmann, T., Dokter, W. H., Ramaker, R., Berger, R., et al. (2012). Assessing the metabolic effects of prednisolone in healthy volunteers using urine metabolic profiling. Genome Medicine, 4(11), 94.CrossRefPubMedPubMedCentralGoogle Scholar
  16. Ellis, J. K., Athersuch, T. J., Thomas, L. D., Teichert, F., Pérez-Trujillo, M., Svendsen, C., et al. (2012). Metabolic profiling detects early effects of environmental and lifestyle exposure to cadmium in a human population. BMC Medicine, 10(1), 61.CrossRefPubMedPubMedCentralGoogle Scholar
  17. Ernest, B., Gooding, J. R., Campagna, S. R., Saxton, A. M., & Voy, B. H. (2012). MetabR: An R script for linear model analysis of quantitative metabolomic data. BMC Research Notes, 5(1), 596.CrossRefPubMedPubMedCentralGoogle Scholar
  18. Franceschi, P., Masuero, D., Vrhovsek, U., Mattivi, F., & Wehrens, R. (2012). A benchmark spike-in data set for biomarker identification in metabolomics. Journal of Chemometrics, 26(1–2), 16–24.CrossRefGoogle Scholar
  19. Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American Statistical Association, 84(405), 165–175.CrossRefGoogle Scholar
  20. Grove, H., Jørgensen, B. M., Jessen, F., Søndergaard, I., Jacobsen, S., Hollung, K., et al. (2008). Combination of statistical approaches for analysis of 2-DE data gives complementary results. Journal of Proteome Research, 7(12), 5119–5124.CrossRefPubMedGoogle Scholar
  21. Hageman, J. A., Hendriks, M. M., Westerhuis, J. A., van der Werf, M. J., Berger, R., & Smilde, A. K. (2008). Simplivariate models: Ideas and first examples. PLoS One, 3(9), e3259.CrossRefPubMedPubMedCentralGoogle Scholar
  22. Hendrickx, D. M., Hoefsloot, H. C. J., Hendriks, M. M. W. B., Canelas, A. B., & Smilde, A. K. (2012). Global test for metabolic pathway differences between conditions. Analytica chimica acta, 719, 8–15.CrossRefPubMedGoogle Scholar
  23. Hendriks, M. M. W. B., Eeuwijk, F. A., Jellema, R. H., Westerhuis, J. A., Reijmers, T. H., Hoefsloot, H. C. J., et al. (2011). Data-processing strategies for metabolomics studies. Trends in Analytical Chemistry, 30(10), 1685–1698.CrossRefGoogle Scholar
  24. Hochberg, Y., & Benjamini, Y. (1990). More powerful procedures for multiple significance testing. Statistics in Medicine, 9(7), 811–818.CrossRefPubMedGoogle Scholar
  25. Hrydziuszko, O., & Viant, M. R. (2012). Missing values in mass spectrometry based metabolomics: An undervalued step in the data processing pipeline. Metabolomics, 8(1), 161–174.CrossRefGoogle Scholar
  26. Hui, B. S., & Wold, H. (1982). Consistency and consistency at large of partial least squares estimates (pp. 119–130). Amsterdam: North Holland.Google Scholar
  27. Jansen, J. J., Allwood, J. W., Marsden-Edwards, E., van der Putten, W. H., Goodacre, R., & van Dam, N. M. (2009). Metabolomic analysis of the interaction between plants and herbivores. Metabolomics, 5(1), 150–161.CrossRefGoogle Scholar
  28. Jansen, J. J., Smit, S., Hoefsloot, H. C. J., & Smilde, A. K. (2010). The photographer and the greenhouse: How to analyse plant metabolomics data. Phytochemical Analysis, 21(1), 48–60.CrossRefPubMedGoogle Scholar
  29. Jolliffe, I. T. (2002). Principal component analysis, Wiley Online Library.Google Scholar
  30. Jolliffe, I. T. (2012). Principal component analysis: a beginner’s guide—I. Introduction and application. Weather, 45(10), 375–382.CrossRefGoogle Scholar
  31. Keun, H. C., Ebbels, T. M., Bollard, M. E., Beckonert, O., Antti, H., Holmes, E., et al. (2004). Geometric trajectory analysis of metabolic responses to toxicity can define treatment specific profiles. Chemical Research in Toxicology, 17(5), 579–587.CrossRefPubMedGoogle Scholar
  32. Kjeldahl, K., & Bro, R. (2010). Some common misunderstandings in chemometrics. Journal of Chemometrics, 24(7–8), 558–564.CrossRefGoogle Scholar
  33. Martens, H. A., & Dardenne, P. (1998). Validation and verification of regression in small data sets. Chemometrics and Intelligent Laboratory Systems, 44(1), 99–121.CrossRefGoogle Scholar
  34. Pang, H. and T. Tong (2012). Recent advances in discriminant analysis for high-dimensional data classification. Journal of Biometrics & Biostatistics.Google Scholar
  35. Petersen, A.-K., Krumsiek, J., Wägele, B., Theis, F. J., Wichmann, H.-E., Gieger, C., et al. (2012). On the hypothesis-free testing of metabolite ratios in genome-wide and metabolome-wide association studies. BMC Bioinformatics, 13(1), 120.CrossRefPubMedPubMedCentralGoogle Scholar
  36. Purohit, P. V., Rocke, D. M., Viant, M. R., & Woodruff, D. L. (2004). Discrimination models using variance-stabilizing transformation of metabolomic NMR data. Omics, 8(2), 118–130.CrossRefPubMedGoogle Scholar
  37. Reo, N. V. (2002). NMR-based metabolomics. Drug and Chemical Toxicology, 25(4), 375–382.CrossRefPubMedGoogle Scholar
  38. Rosipal, R., & Trejo, L. J. (2002). Kernel partial least squares regression in reproducing Kernel Hilbert space. The Journal of Machine Learning Research, 2, 97–123.Google Scholar
  39. Rubingh, C. M., Bijlsma, S., Derks, E. P. P. A., Bobeldijk, I., Verheij, E. R., Kochhar, S., et al. (2006). Assessing the performance of statistical validation tools for megavariate metabolomics data. Metabolomics, 2(2), 53–61.CrossRefPubMedPubMedCentralGoogle Scholar
  40. Saccenti, E., Westerhuis, J. A., Smilde, A. K., van der Werf, M. J., Hageman, J. A., & Hendriks, M. M. W. B. (2011). Simplivariate models: Uncovering the underlying biology in functional genomics data. PLoS One, 6(6), e20747.CrossRefPubMedPubMedCentralGoogle Scholar
  41. Sachse, D., Sletner, L., Mørkrid, K., Jenum, A. K., Birkeland, K. I., Rise, F., et al. (2012). Metabolic changes in urine during and after pregnancy in a large, multiethnic population-based cohort Study of gestational diabetes. PLoS One, 7(12), e52399.CrossRefPubMedPubMedCentralGoogle Scholar
  42. Schäfer, J., & Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical applications in genetics and molecular biology, 4(1), 32.CrossRefGoogle Scholar
  43. Schneeweiss, H. (1993). Consistency at large in models with latent variables. Amsterdam: Elsevier.Google Scholar
  44. Shiryaeva, L., Antti, H., Schröder, W. P., Strimbeck, R., & Shiriaev, A. S. (2012). Pair-wise multicomparison and OPLS analyses of cold-acclimation phases in Siberian spruce. Metabolomics, 8(1), 123–130.CrossRefPubMedGoogle Scholar
  45. Sokal, R. R., & Rohlf, F. J. (1995). Biometry. New York: W. H. Freeman and Company.Google Scholar
  46. Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society, 64, 479–498.CrossRefGoogle Scholar
  47. Storey, J. D., & Tibshirani, R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100(16), 9440–9445.CrossRefGoogle Scholar
  48. Szymanska, E., Saccenti, E., Smilde, A. K., & Westerhuis, J. A. (2011). Double-check: Validation of diagnostic statistics for PLS-DA models in metabolomics studies. Metabolomics, 8, 3–16.CrossRefPubMedPubMedCentralGoogle Scholar
  49. Szymańska, E., van Dorsten, F. A., Troost, J., Paliukhovich, I., van Velzen, E. J., Hendriks, M. M., et al. (2012). A lipidomic analysis approach to evaluate the response to cholesterol-lowering food intake. Metabolomics, 8(5), 894–906.CrossRefPubMedGoogle Scholar
  50. Thissen, D., Steinberg, L., & Kuang, D. (2002). Quick and easy implementation of the Benjamini–Hochberg procedure for controlling the false positive rate in multiple comparisons. Journal of Educational and Behavioral Statistics, 27(1), 77–83.CrossRefGoogle Scholar
  51. Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Statistical Science, 18, 104–117.CrossRefGoogle Scholar
  52. Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences, 99(10), 6567.CrossRefGoogle Scholar
  53. Trygg, J., Holmes, E., & Lundstedt, T. (2007). Chemometrics in metabonomics. Journal of Proteome Research, 6(2), 469–479.CrossRefPubMedGoogle Scholar
  54. Van Den Berg, R. A., Hoefsloot, H. C. J., Westerhuis, J. A., Smilde, A. K., & Van Der Werf, M. J. (2006). Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics, 7(1), 142.CrossRefPubMedPubMedCentralGoogle Scholar
  55. van Velzen, E. J. J., Westerhuis, J. A., van Duynhoven, J. P. M., van Dorsten, F. A., Hoefsloot, H. C. J., Jacobs, D. M., et al. (2008). Multilevel data analysis of a crossover designed human nutritional intervention study. Journal of Proteome Research, 7(10), 4483–4491.CrossRefPubMedGoogle Scholar
  56. Velden, M. G. M., Rinaldo, P., Elvers, B., Henderson, M., Walter, J. H., Prinsen, B. H., et al. (2012). The proline/citrulline ratio as a biomarker for OAT deficiency in early infancy. JIMD Reports-Case and Research Reports, 2012(3), 95–99.CrossRefGoogle Scholar
  57. Viant, M. R., Rosenblum, E. S., & Tjeerdema, R. S. (2003). NMR-based metabolomics: A powerful approach for characterizing the effects of environmental stressors on organism health. Environmental Science and Technology, 37(21), 4982–4989.CrossRefPubMedGoogle Scholar
  58. Wang, S., & Zhu, J. (2007). Improved centroids estimation for the nearest shrunken centroid classifier. Bioinformatics, 23(8), 972–979.CrossRefPubMedGoogle Scholar
  59. Wehrens, R., & Franceschi, P. (2012). Thresholding for biomarker selection in multivariate data using higher criticism. Molecular BioSystems, 8(9), 2339–2346.CrossRefPubMedGoogle Scholar
  60. Westerhuis, J. A., Hoefsloot, H. C. J., Smit, S., Vis, D. J., Smilde, A. K., van Velzen, E. J. J., et al. (2008). Assessment of PLSDA cross validation. Metabolomics, 4(1), 81–89.CrossRefGoogle Scholar
  61. Wold, S., & Eriksson, L. (2001). PLS-regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58(2), 109–130.CrossRefGoogle Scholar
  62. Xia, J., Mandal, R., Sinelnikov, I. V., Broadhurst, D., & Wishart, D. S. (2012). MetaboAnalyst 2.0: A comprehensive server for metabolomic data analysis. Nucleic Acids Research, 40(W1), W127–W133.CrossRefPubMedPubMedCentralGoogle Scholar
  63. Xu, Y., E. Correa and R. Goodacre (2013). Integrating multiple analytical platforms and chemometrics for comprehensive metabolic profiling: Application to meat spoilage detection. Analytical and bioanalytical chemistry: 1–12.Google Scholar
  64. Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of computational and graphical statistics, 15(2), 265–286.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Edoardo Saccenti
    • 1
    • 3
    Email author
  • Huub C. J. Hoefsloot
    • 1
    • 3
  • Age K. Smilde
    • 1
    • 3
  • Johan A. Westerhuis
    • 1
    • 3
  • Margriet M. W. B. Hendriks
    • 2
    • 3
  1. 1.Biosystems Data Analysis Group, Swammerdam Institute for Life SciencesUniversity of AmsterdamAmsterdamThe Netherlands
  2. 2.Analytical BiosciencesLeiden Academic Centre for Drug ResearchLeidenThe Netherlands
  3. 3.Netherlands Metabolomics CentreLeidenThe Netherlands

Personalised recommendations