Skip to main content
Log in

Reflections on univariate and multivariate analysis of metabolomics data

  • Review Article
  • Published:
Metabolomics Aims and scope Submit manuscript

Abstract

Metabolomics experiments usually result in a large quantity of data. Univariate and multivariate analysis techniques are routinely used to extract relevant information from the data with the aim of providing biological knowledge on the problem studied. Despite the fact that statistical tools like the t test, analysis of variance, principal component analysis, and partial least squares discriminant analysis constitute the backbone of the statistical part of the vast majority of metabolomics papers, it seems that many basic but rather fundamental questions are still often asked, like: Why do the results of univariate and multivariate analyses differ? Why apply univariate methods if you have already applied a multivariate method? Why if I do not see something univariately I see something multivariately? In the present paper we address some aspects of univariate and multivariate analysis, with the scope of clarifying in simple terms the main differences between the two approaches. Applications of the t test, analysis of variance, principal component analysis and partial least squares discriminant analysis will be shown on both real and simulated metabolomics data examples to provide an overview on fundamental aspects of univariate and multivariate methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Allen, G. I., & Maletić-Savatić, M. (2011). Sparse non-negative generalized PCA with applications to metabolomics. Bioinformatics, 27(21), 3029–3035.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Beckonert, O., Keun, H. C., Ebbels, T. M., Bundy, J., Holmes, E., Lindon, J. C., et al. (2007). Metabolic profiling, metabolomic and metabonomic procedures for NMR spectroscopy of urine, plasma, serum and tissue extracts. Nature Protocols, 2(11), 2692–2703.

    Article  CAS  PubMed  Google Scholar 

  • Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57, 289–300.

    Google Scholar 

  • Benjamini, Y., & Hochberg, Y. (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics, 25(1), 60–83.

    Article  Google Scholar 

  • Brereton, R. G. (2006). Consequences of sample size, variable selection, and model validation and optimisation, for predicting classification ability from analytical data. Trends in Analytical Chemistry, 25(11), 1103–1111.

    Article  CAS  Google Scholar 

  • Broadhurst, D. I., & Kell, D. B. (2006). Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics, 2(4), 171–196.

    Article  CAS  Google Scholar 

  • Bylesjö, M., Rantalainen, M., Cloarec, O., Nicholson, J. K., Holmes, E., & Trygg, J. (2006). OPLS discriminant analysis: Combining the strengths of PLS-DA and SIMCA classification. Journal of Chemometrics, 20(8–10), 341–351.

    Article  Google Scholar 

  • Christin, C., Hoefsloot, H. C., Smilde, A. K., Hoekman, B., Suits, F., Bischoff, R., et al. (2013). A critical assessment of feature selection methods for biomarker discovery in clinical proteomics. Molecular and Cellular Proteomics, 12(1), 263–276.

    Article  PubMed  Google Scholar 

  • Craig, A., Cloarec, O., Holmes, E., Nicholson, J. K., & Lindon, J. C. (2006). Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Analytical Chemistry, 78(7), 2262–2267.

    Article  CAS  PubMed  Google Scholar 

  • de Boves Harrington, P. (2006). Statistical validation of classification and calibration models using bootstrapped Latin partitions. Trends in Analytical Chemistry, 25(11), 1112–1124.

    Article  Google Scholar 

  • Dieterle, F., Ross, A., Schlotterbeck, G., & Senn, H. (2006). Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Analytical Chemistry, 78(13), 4281–4290.

    Article  CAS  PubMed  Google Scholar 

  • Dillon, W. R., & Goldstein, M. (1984). Multivariate analysis. New York: Wiley.

    Google Scholar 

  • Donoho, D., & Jin, J. (2008). Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proceedings of the National Academy of Sciences, 105(39), 14790–14795.

    Article  CAS  Google Scholar 

  • Dudoit, S., Fridlyand, J., & Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457), 77–87.

    Article  CAS  Google Scholar 

  • Ellero-Simatos, S., Szymańska, E., Rullmann, T., Dokter, W. H., Ramaker, R., Berger, R., et al. (2012). Assessing the metabolic effects of prednisolone in healthy volunteers using urine metabolic profiling. Genome Medicine, 4(11), 94.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Ellis, J. K., Athersuch, T. J., Thomas, L. D., Teichert, F., Pérez-Trujillo, M., Svendsen, C., et al. (2012). Metabolic profiling detects early effects of environmental and lifestyle exposure to cadmium in a human population. BMC Medicine, 10(1), 61.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Ernest, B., Gooding, J. R., Campagna, S. R., Saxton, A. M., & Voy, B. H. (2012). MetabR: An R script for linear model analysis of quantitative metabolomic data. BMC Research Notes, 5(1), 596.

    Article  PubMed  PubMed Central  Google Scholar 

  • Franceschi, P., Masuero, D., Vrhovsek, U., Mattivi, F., & Wehrens, R. (2012). A benchmark spike-in data set for biomarker identification in metabolomics. Journal of Chemometrics, 26(1–2), 16–24.

    Article  CAS  Google Scholar 

  • Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American Statistical Association, 84(405), 165–175.

    Article  Google Scholar 

  • Grove, H., Jørgensen, B. M., Jessen, F., Søndergaard, I., Jacobsen, S., Hollung, K., et al. (2008). Combination of statistical approaches for analysis of 2-DE data gives complementary results. Journal of Proteome Research, 7(12), 5119–5124.

    Article  CAS  PubMed  Google Scholar 

  • Hageman, J. A., Hendriks, M. M., Westerhuis, J. A., van der Werf, M. J., Berger, R., & Smilde, A. K. (2008). Simplivariate models: Ideas and first examples. PLoS One, 3(9), e3259.

    Article  PubMed  PubMed Central  Google Scholar 

  • Hendrickx, D. M., Hoefsloot, H. C. J., Hendriks, M. M. W. B., Canelas, A. B., & Smilde, A. K. (2012). Global test for metabolic pathway differences between conditions. Analytica chimica acta, 719, 8–15.

    Article  CAS  PubMed  Google Scholar 

  • Hendriks, M. M. W. B., Eeuwijk, F. A., Jellema, R. H., Westerhuis, J. A., Reijmers, T. H., Hoefsloot, H. C. J., et al. (2011). Data-processing strategies for metabolomics studies. Trends in Analytical Chemistry, 30(10), 1685–1698.

    Article  CAS  Google Scholar 

  • Hochberg, Y., & Benjamini, Y. (1990). More powerful procedures for multiple significance testing. Statistics in Medicine, 9(7), 811–818.

    Article  CAS  PubMed  Google Scholar 

  • Hrydziuszko, O., & Viant, M. R. (2012). Missing values in mass spectrometry based metabolomics: An undervalued step in the data processing pipeline. Metabolomics, 8(1), 161–174.

    Article  CAS  Google Scholar 

  • Hui, B. S., & Wold, H. (1982). Consistency and consistency at large of partial least squares estimates (pp. 119–130). Amsterdam: North Holland.

    Google Scholar 

  • Jansen, J. J., Allwood, J. W., Marsden-Edwards, E., van der Putten, W. H., Goodacre, R., & van Dam, N. M. (2009). Metabolomic analysis of the interaction between plants and herbivores. Metabolomics, 5(1), 150–161.

    Article  CAS  Google Scholar 

  • Jansen, J. J., Smit, S., Hoefsloot, H. C. J., & Smilde, A. K. (2010). The photographer and the greenhouse: How to analyse plant metabolomics data. Phytochemical Analysis, 21(1), 48–60.

    Article  CAS  PubMed  Google Scholar 

  • Jolliffe, I. T. (2002). Principal component analysis, Wiley Online Library.

  • Jolliffe, I. T. (2012). Principal component analysis: a beginner’s guide—I. Introduction and application. Weather, 45(10), 375–382.

    Article  Google Scholar 

  • Keun, H. C., Ebbels, T. M., Bollard, M. E., Beckonert, O., Antti, H., Holmes, E., et al. (2004). Geometric trajectory analysis of metabolic responses to toxicity can define treatment specific profiles. Chemical Research in Toxicology, 17(5), 579–587.

    Article  CAS  PubMed  Google Scholar 

  • Kjeldahl, K., & Bro, R. (2010). Some common misunderstandings in chemometrics. Journal of Chemometrics, 24(7–8), 558–564.

    Article  CAS  Google Scholar 

  • Martens, H. A., & Dardenne, P. (1998). Validation and verification of regression in small data sets. Chemometrics and Intelligent Laboratory Systems, 44(1), 99–121.

    Article  CAS  Google Scholar 

  • Pang, H. and T. Tong (2012). Recent advances in discriminant analysis for high-dimensional data classification. Journal of Biometrics & Biostatistics.

  • Petersen, A.-K., Krumsiek, J., Wägele, B., Theis, F. J., Wichmann, H.-E., Gieger, C., et al. (2012). On the hypothesis-free testing of metabolite ratios in genome-wide and metabolome-wide association studies. BMC Bioinformatics, 13(1), 120.

    Article  PubMed  PubMed Central  Google Scholar 

  • Purohit, P. V., Rocke, D. M., Viant, M. R., & Woodruff, D. L. (2004). Discrimination models using variance-stabilizing transformation of metabolomic NMR data. Omics, 8(2), 118–130.

    Article  CAS  PubMed  Google Scholar 

  • Reo, N. V. (2002). NMR-based metabolomics. Drug and Chemical Toxicology, 25(4), 375–382.

    Article  CAS  PubMed  Google Scholar 

  • Rosipal, R., & Trejo, L. J. (2002). Kernel partial least squares regression in reproducing Kernel Hilbert space. The Journal of Machine Learning Research, 2, 97–123.

    Google Scholar 

  • Rubingh, C. M., Bijlsma, S., Derks, E. P. P. A., Bobeldijk, I., Verheij, E. R., Kochhar, S., et al. (2006). Assessing the performance of statistical validation tools for megavariate metabolomics data. Metabolomics, 2(2), 53–61.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Saccenti, E., Westerhuis, J. A., Smilde, A. K., van der Werf, M. J., Hageman, J. A., & Hendriks, M. M. W. B. (2011). Simplivariate models: Uncovering the underlying biology in functional genomics data. PLoS One, 6(6), e20747.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Sachse, D., Sletner, L., Mørkrid, K., Jenum, A. K., Birkeland, K. I., Rise, F., et al. (2012). Metabolic changes in urine during and after pregnancy in a large, multiethnic population-based cohort Study of gestational diabetes. PLoS One, 7(12), e52399.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Schäfer, J., & Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical applications in genetics and molecular biology, 4(1), 32.

    Article  Google Scholar 

  • Schneeweiss, H. (1993). Consistency at large in models with latent variables. Amsterdam: Elsevier.

    Google Scholar 

  • Shiryaeva, L., Antti, H., Schröder, W. P., Strimbeck, R., & Shiriaev, A. S. (2012). Pair-wise multicomparison and OPLS analyses of cold-acclimation phases in Siberian spruce. Metabolomics, 8(1), 123–130.

    Article  CAS  PubMed  Google Scholar 

  • Sokal, R. R., & Rohlf, F. J. (1995). Biometry. New York: W. H. Freeman and Company.

    Google Scholar 

  • Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society, 64, 479–498.

    Article  Google Scholar 

  • Storey, J. D., & Tibshirani, R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100(16), 9440–9445.

    Article  CAS  Google Scholar 

  • Szymanska, E., Saccenti, E., Smilde, A. K., & Westerhuis, J. A. (2011). Double-check: Validation of diagnostic statistics for PLS-DA models in metabolomics studies. Metabolomics, 8, 3–16.

    Article  PubMed  PubMed Central  Google Scholar 

  • Szymańska, E., van Dorsten, F. A., Troost, J., Paliukhovich, I., van Velzen, E. J., Hendriks, M. M., et al. (2012). A lipidomic analysis approach to evaluate the response to cholesterol-lowering food intake. Metabolomics, 8(5), 894–906.

    Article  PubMed  Google Scholar 

  • Thissen, D., Steinberg, L., & Kuang, D. (2002). Quick and easy implementation of the Benjamini–Hochberg procedure for controlling the false positive rate in multiple comparisons. Journal of Educational and Behavioral Statistics, 27(1), 77–83.

    Article  Google Scholar 

  • Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Statistical Science, 18, 104–117.

    Article  Google Scholar 

  • Tibshirani, R., Hastie, T., Narasimhan, B., & Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences, 99(10), 6567.

    Article  CAS  Google Scholar 

  • Trygg, J., Holmes, E., & Lundstedt, T. (2007). Chemometrics in metabonomics. Journal of Proteome Research, 6(2), 469–479.

    Article  CAS  PubMed  Google Scholar 

  • Van Den Berg, R. A., Hoefsloot, H. C. J., Westerhuis, J. A., Smilde, A. K., & Van Der Werf, M. J. (2006). Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics, 7(1), 142.

    Article  PubMed  PubMed Central  Google Scholar 

  • van Velzen, E. J. J., Westerhuis, J. A., van Duynhoven, J. P. M., van Dorsten, F. A., Hoefsloot, H. C. J., Jacobs, D. M., et al. (2008). Multilevel data analysis of a crossover designed human nutritional intervention study. Journal of Proteome Research, 7(10), 4483–4491.

    Article  PubMed  Google Scholar 

  • Velden, M. G. M., Rinaldo, P., Elvers, B., Henderson, M., Walter, J. H., Prinsen, B. H., et al. (2012). The proline/citrulline ratio as a biomarker for OAT deficiency in early infancy. JIMD Reports-Case and Research Reports, 2012(3), 95–99.

    Article  Google Scholar 

  • Viant, M. R., Rosenblum, E. S., & Tjeerdema, R. S. (2003). NMR-based metabolomics: A powerful approach for characterizing the effects of environmental stressors on organism health. Environmental Science and Technology, 37(21), 4982–4989.

    Article  CAS  PubMed  Google Scholar 

  • Wang, S., & Zhu, J. (2007). Improved centroids estimation for the nearest shrunken centroid classifier. Bioinformatics, 23(8), 972–979.

    Article  CAS  PubMed  Google Scholar 

  • Wehrens, R., & Franceschi, P. (2012). Thresholding for biomarker selection in multivariate data using higher criticism. Molecular BioSystems, 8(9), 2339–2346.

    Article  CAS  PubMed  Google Scholar 

  • Westerhuis, J. A., Hoefsloot, H. C. J., Smit, S., Vis, D. J., Smilde, A. K., van Velzen, E. J. J., et al. (2008). Assessment of PLSDA cross validation. Metabolomics, 4(1), 81–89.

    Article  CAS  Google Scholar 

  • Wold, S., & Eriksson, L. (2001). PLS-regression: A basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58(2), 109–130.

    Article  CAS  Google Scholar 

  • Xia, J., Mandal, R., Sinelnikov, I. V., Broadhurst, D., & Wishart, D. S. (2012). MetaboAnalyst 2.0: A comprehensive server for metabolomic data analysis. Nucleic Acids Research, 40(W1), W127–W133.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Xu, Y., E. Correa and R. Goodacre (2013). Integrating multiple analytical platforms and chemometrics for comprehensive metabolic profiling: Application to meat spoilage detection. Analytical and bioanalytical chemistry: 1–12.

  • Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of computational and graphical statistics, 15(2), 265–286.

    Article  Google Scholar 

Download references

Acknowledgments

This project was financed by The Netherlands Metabolomics Centre (NMC), which is part of The Netherlands Genomics Initiative (NGI)/Netherlands Organization for Scientific Research. The authors wish to thank Claudio Luchinat and Renger Jellema for fruitful comments on the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Edoardo Saccenti.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOC 528 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Saccenti, E., Hoefsloot, H.C.J., Smilde, A.K. et al. Reflections on univariate and multivariate analysis of metabolomics data. Metabolomics 10, 361–374 (2014). https://doi.org/10.1007/s11306-013-0598-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11306-013-0598-6

Keywords

Navigation