Journal of Chemical Ecology

, Volume 44, Issue 3, pp 215–234 | Cite as

Multivariate Analysis of Multiple Datasets: a Practical Guide for Chemical Ecology

  • Maxime R. HervéEmail author
  • Florence Nicolè
  • Kim-Anh Lê Cao


Chemical ecology has strong links with metabolomics, the large-scale study of all metabolites detectable in a biological sample. Consequently, chemical ecologists are often challenged by the statistical analyses of such large datasets. This holds especially true when the purpose is to integrate multiple datasets to obtain a holistic view and a better understanding of a biological system under study. The present article provides a comprehensive resource to analyze such complex datasets using multivariate methods. It starts from the necessary pre-treatment of data including data transformations and distance calculations, to the application of both gold standard and novel multivariate methods for the integration of different omics data. We illustrate the process of analysis along with detailed results interpretations for six issues representative of the different types of biological questions encountered by chemical ecologists. We provide the necessary knowledge and tools with reproducible R codes and chemical-ecological datasets to practice and teach multivariate methods.


Discriminant analyses Distance-based analyses Integrative analyses Metabolomics Multi-block methods Ordination methods 



We are very grateful to Bernard Banaigs, Lucie Conchou, Laurent Dormont, Stéphane Greff, Maria Cristina Lorenzi, Thierry Pérez, Bertrand Schatz, Oriol Sacristán-Soriano and Olivier Thomas who kindly provided their data to illustrate the examples, Stéphane Dray and Denis Poinsot for their insightful comments on the manuscript and Zoe Welham for proof reading of the manuscript.

Compliance with Ethical Standards

Conflict of Interest

The authors declare that they have no conflict of interest.

Supplementary material (1.9 mb)
ESM 1 (zip 1.86 MB)
10886_2018_932_MOESM2_ESM.pdf (298 kb)
ESM 2 (PDF 297 KB)


  1. Aitchison J (1983) Principal component analysis of compositional data. Biometrika 70:57CrossRefGoogle Scholar
  2. Aitchison J (1986) The statistical analysis of compositional data. Chapman & Hall Ltd, LondonCrossRefGoogle Scholar
  3. Aitchison J, Barceló-Vidal C, Martín-Fernández JA, Pawlowsky-Glahn V (2000) Logratio analysis and compositional distance. Math Geol 32:271–275CrossRefGoogle Scholar
  4. Allaire J, Cheng J, Xie Y, McPherson J, Chang W, Allen J, Wickham H, Atkins A, Hyndman R, Arslan R (2017) Rmarkdown: dynamic documents for R. R package version 1.6.
  5. Archunan G (2009) Vertebrate pheromones and their biological importance. J Exp Zool India 12:227–239Google Scholar
  6. Bais HP, Weir TL, Perry LG, Gilroy S, Vivanco JM (2006) The role of root exudates in rhizosphere interactions with plants and other organisms. Annu Rev Plant Biol 57:233–266PubMedCrossRefGoogle Scholar
  7. Barker M, Rayens W (2003) Partial least squares for discrimination. J Chemom 17:166–173CrossRefGoogle Scholar
  8. Bertrand D, Courcoux P, Autran J-C, Meritan R, Robert P (1990) Stepwise canonical discriminant analysis of continuous digitalized signals: application to chromatograms of wheat proteins. J Chemom 4:413–427CrossRefGoogle Scholar
  9. Bonelli M, Lorenzi MC, Christidès J-P, Dupont S, Bagnères A-G (2015) Population diversity in Cuticular hydrocarbons and mtDNA in a mountain social wasp. J Chem Ecol 41:22–31PubMedCrossRefGoogle Scholar
  10. Brereton RG, Lloyd GR (2014) Partial least squares discriminant analysis: taking the magic away. J Chemom 28:213–225CrossRefGoogle Scholar
  11. Brückner A, Heethoff M (2017) A chemo-ecologists’ practical guide to compositional data analysis. Chemoecology 27:33–46CrossRefGoogle Scholar
  12. Bylesjö M, Rantalainen M, Cloarec O, Nicholson JK, Holmes E, Trygg J (2006) OPLS discriminant analysis: combining the strengths of PLS-DA and SIMCA classification. J Chemom 20:341–351CrossRefGoogle Scholar
  13. Chessel D, Hanafi M (1996) Analyses de la co-inertie de K nuages de points. Rev Stat Appliquée 44:35–60Google Scholar
  14. Conchou L, Cabioch L, Rodriguez LJV, Kjellberg F (2014) Daily rhythm of mutualistic pollinator activity and scent emission in Ficus Septica: ecological differentiation between co-occurring pollinators and potential consequences for chemical communication and facilitation of host speciation. PLoS One 9:e103581PubMedPubMedCentralCrossRefGoogle Scholar
  15. Després L, David J-P, Gallet C (2007) The evolutionary ecology of insect resistance to plant chemicals. Trends Ecol Evol 22:298–307PubMedCrossRefGoogle Scholar
  16. Dolédec S, Chessel D (1994) Co-inertia analysis: an alternative method for studying species–environment relationships. Freshw Biol 31:277–294CrossRefGoogle Scholar
  17. Dormont L, Delle-Vedove R, Bessière J-M, Schatz B (2014) Floral scent emitted by white and coloured morphs in orchids. Phytochemistry 100:51–59PubMedCrossRefGoogle Scholar
  18. Dray S, Chessel D, Thioulouse J (2003a) Procrustean co-inertia analysis for the linking of multivariate datasets. Écoscience 10:110–119CrossRefGoogle Scholar
  19. Dray S, Chessel D, Thioulouse J (2003b) Co-inertia analysis and the linking of ecological data tables. Ecology 84:3078–3089CrossRefGoogle Scholar
  20. Engel J, Gerretzen J, Szymańska E, Jansen JJ, Downey G, Blanchet L, Buydens LMC (2013) Breaking with trends in pre-processing? TrAC Trends Anal Chem 50:96–106CrossRefGoogle Scholar
  21. Engel J, Blanchet L, Bloemen B, van den Heuvel LP, Engelke UHF, Wevers RA, Buydens LMC (2015) Regularized MANOVA (rMANOVA) in untargeted metabolomics. Anal Chim Acta 899:1–12PubMedCrossRefGoogle Scholar
  22. Escoufier Y (1973) Le Traitement des Variables Vectorielles. Biometrics 29:751CrossRefGoogle Scholar
  23. Filzmoser P, Hron K, Reimann C (2009) Principal component analysis for compositional data with outliers. Environmetrics 20:621–632CrossRefGoogle Scholar
  24. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7:179–188CrossRefGoogle Scholar
  25. Gatehouse JA (2002) Plant resistance towards insect herbivores: a dynamic interaction. New Phytol 156:145–169CrossRefGoogle Scholar
  26. González I, Lê Cao K-A, Davis MJ, Déjean S (2012) Visualising associations between paired “omics” data sets. BioData Min 5:19PubMedPubMedCentralCrossRefGoogle Scholar
  27. Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53:325–338CrossRefGoogle Scholar
  28. Gower JC (1971) Statistical methods of comparing different multivariate analyses of the same data. In: Tautu P (ed) Mathematics in the archaeological and historical sciences. Edinburgh University Press, Edinburgh, pp 138–149Google Scholar
  29. Gower JC, Legendre P (1986) Metric and Euclidean properties of dissimilarity coefficients. J Classif 3:5–48CrossRefGoogle Scholar
  30. Greff S, Aires T, Serrão EA, Engelen AH, Thomas OP, Pérez T (2017) The interaction between the proliferating macroalga Asparagopsis Taxiformis and the coral Astroides Calycularis induces changes in microbiome and metabolomic fingerprints. Sci Rep 7:42625PubMedPubMedCentralCrossRefGoogle Scholar
  31. Harrington P d B, Vieira NE, Espinoza J, Nien JK, Romero R, Yergey AL (2005) Analysis of variance–principal component analysis: a soft tool for proteomic discovery. Anal Chim Acta 544:118–127CrossRefGoogle Scholar
  32. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Spinger, New YorkCrossRefGoogle Scholar
  33. Heo M, Gabriel KR (1998) A permutation test of association between configurations by means of the rv coefficient. Commun Stat Simul Comput 27:843–856CrossRefGoogle Scholar
  34. Hervé MR, Delourme R, Gravot A, Marnet N, Berardocco S, Cortesero AM (2014) Manipulating feeding stimulation to protect crops against insect pests? J Chem Ecol 40:1220–1231PubMedCrossRefGoogle Scholar
  35. Hill MO, Smith AJE (1976) Principal component analysis of taxonomic data with multi-state discrete characters. Taxon 25:249CrossRefGoogle Scholar
  36. Hotelling H (1933) Analysis of a complex of statistical variables into principal components. Educ Psychol 24:417–441CrossRefGoogle Scholar
  37. Hotelling H (1936) Relations between two sets of variates. Biometrika 28(377):321CrossRefGoogle Scholar
  38. Howard RW, Blomquist GJ (2005) Ecological, behavioral, and biochemical aspects of insect hydrocarbons. Annu Rev Entomol 50:371–393PubMedCrossRefGoogle Scholar
  39. Indahl UG, Martens H, Næs T (2007) From dummy regression to prior probabilities in PLS-DA. J Chemom 21:529–536CrossRefGoogle Scholar
  40. Indahl UG, Liland KH, Naes T (2009) Canonical partial least squares-a unified PLS approach to classification and regression problems. J Chemom 23:495–504CrossRefGoogle Scholar
  41. Ivanišević J, Thomas OP, Lejeusne C, Chevaldonné P, Pérez T (2011) Metabolic fingerprinting as an indicator of biodiversity: towards understanding inter-specific relationships among Homoscleromorpha sponges. Metabolomics 7:289–304CrossRefGoogle Scholar
  42. Jaccard P (1901) Étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bull Soc Vaud Sci Nat 37:547–579Google Scholar
  43. Jackson DA (1995) PROTEST: a PROcrustean randomization TEST of community environment concordance. Écoscience 2:297–303CrossRefGoogle Scholar
  44. Jansen JJ, Hoefsloot HCJ, van der Greef J, Timmerman ME, Westerhuis JA, Smilde AK (2005) ASCA: analysis of multivariate data obtained from an experimental design. J Chemom 19:469–481CrossRefGoogle Scholar
  45. Jombart T, Devillard S, Balloux F (2010) Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet 11:94PubMedPubMedCentralCrossRefGoogle Scholar
  46. Kemsley EK (1996) Discriminant analysis of high-dimensional data: a comparison of principal components analysis and partial least squares data reduction methods. Chemom Intell Lab Syst 33:47–61CrossRefGoogle Scholar
  47. Kjeldahl K, Bro R (2010) Some common misunderstandings in chemometrics. J Chemom 24:558–564CrossRefGoogle Scholar
  48. Kruskal JB (1964a) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29:1–27CrossRefGoogle Scholar
  49. Kruskal JB (1964b) Nonmetric multidimensional scaling: a numerical method. Psychometrika 29:115–129CrossRefGoogle Scholar
  50. Lê Cao K-A, Boitard S, Besse P (2011) Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinf 12:253CrossRefGoogle Scholar
  51. Legendre P, Anderson MJ (1999) Distance-based redundancy analysis: testing multispecies responses in multifactorial ecological experiments. Ecol Monogr 69(1)Google Scholar
  52. Legendre P, Legendre L (2012) Numerical Ecology. Elsevier, AmsterdamGoogle Scholar
  53. Leurgans SE, Moyeed RA, Silverman BW (1993) Canonical correlation analysis when the data are curves. J R Stat Soc Ser B Methodol 55:725–740Google Scholar
  54. Liland KH, Indahl UG (2009) Powered partial least squares discriminant analysis. J Chemom 23:7–18CrossRefGoogle Scholar
  55. Liquet B, Lê Cao K-A, Hocini H, Thiébaut R (2012) A novel approach for biomarker selection and the integration of repeated measures experiments from two assays. BMC Bioinformatics 13:325PubMedPubMedCentralCrossRefGoogle Scholar
  56. Löfstedt T, Trygg J (2011) OnPLS-a novel multiblock method for the modelling of predictive and orthogonal variation. J Chemom 25:441–455Google Scholar
  57. Löfstedt T, Hanafi M, Mazerolles G, Trygg J (2012) OnPLS path modelling. Chemom Intell Lab Syst 118:139–149CrossRefGoogle Scholar
  58. Löfstedt T, Hoffman D, Trygg J (2013) Global, local and unique decompositions in OnPLS for multiblock data analysis. Anal Chim Acta 791:13–24PubMedCrossRefGoogle Scholar
  59. Lohmöller J (1989) Latent variables path modeling with partial least squares. Physica-Verlag, HeidelbergCrossRefGoogle Scholar
  60. Marini F, de Beer D, Joubert E, Walczak B (2015) Analysis of variance of designed chromatographic data sets: the analysis of variance-target projection approach. J Chromatogr A 1405:94–102PubMedCrossRefGoogle Scholar
  61. Mehmood T, Liland KH, Snipen L, Sæbø S (2012) A review of variable selection methods in partial least squares regression. Chemom Intell Lab Syst 118:62–69CrossRefGoogle Scholar
  62. Meng C, Zeleznik OA, Thallinger GG, Kuster B, Gholami AM, Culhane AC (2016) Dimension reduction techniques for the integrative analysis of multi-omics data. Brief Bioinform 17:628–641PubMedPubMedCentralCrossRefGoogle Scholar
  63. Miller J, Farr S (1971) Bimultivariate redundancy: a comprehensive measure of interbattery relationship. Multivar Behav Res 6:313–324CrossRefGoogle Scholar
  64. Nocairi H, Mostafa Qannari E, Vigneau E, Bertrand D (2005) Discrimination on latent components with respect to patterns. Application to multicollinear data. Comput Stat Data Anal 48:139–147CrossRefGoogle Scholar
  65. Palarea-Albaladejo J, Martín-Fernández JA, Soto JA (2012) Dealing with distances and transformations for fuzzy C-means clustering of compositional data. J Classif 29:144–169CrossRefGoogle Scholar
  66. Pearson K (1896) Mathematical contributions to the theory of evolution - on a form of spurious correlation which may Arise when indices are used in the measurement of organs. Proc R Soc Lond 60:489–498CrossRefGoogle Scholar
  67. Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philos Mag 2:559–572CrossRefGoogle Scholar
  68. Peres-Neto PR, Legendre P, Dray S, Borcard D (2006) Variation partitioning of species data matrices: estimation and comparison of fractions. Ecology 87:2614–2625PubMedCrossRefGoogle Scholar
  69. Pierotti MER, Martín-Fernández JA (2011) Compositional analysis in behavioural and evolutionary ecology. In: Pawloswky-Glahn V, Buccianti A (eds) Compositional data analysis: theory and applications. John Wiley & Sons, Ltd, Hoboken, pp 218–234CrossRefGoogle Scholar
  70. R Core Team (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, ViennaGoogle Scholar
  71. Raguso RA (2008) Wake up and smell the roses: the ecology and evolution of floral scent. Annu Rev Ecol Evol Syst 39:549–569CrossRefGoogle Scholar
  72. Rao CR (1964) The use and interpretation of principal component analysis in applied research. Sankhyā Indian J Stat Ser A 329–358Google Scholar
  73. Reudler JH, Elzinga JA (2015) Photoperiod-induced geographic variation in plant defense chemistry. J Chem Ecol 41:139–148PubMedCrossRefGoogle Scholar
  74. Robert P, Escoufier Y (1976) A unifying tool for linear multivariate statistical methods: the RV- coefficient. Appl Stat 25:257CrossRefGoogle Scholar
  75. Rohart F, Gautier B, Singh A, Le Cao K-A (2017) mixOmics: An R package for ‘omics feature selection and multiple data integration. PLoS Comput Biol 13(11):e1005752Google Scholar
  76. Saccenti E, Hoefsloot HCJ, Smilde AK, Westerhuis JA, Hendriks MMWB (2014) Reflections on univariate and multivariate analysis of metabolomics data. Metabolomics 10:361–374CrossRefGoogle Scholar
  77. Sacristán-Soriano O, Banaigs B, Casamayor EO, Becerro MA (2011) Exploring the links between natural products and bacterial assemblages in the sponge Aplysina aerophoba. Appl Environ Microbiol 77:862–870PubMedCrossRefGoogle Scholar
  78. Sampson PD, Streissguth AP, Barr HM, Bookstein FL (1989) Neurobehavioral effects of prenatal alcohol: part II. Partial least squares analysis. Neurotoxicol Teratol 11:477–491PubMedCrossRefGoogle Scholar
  79. Shen H, Huang JZ (2008) Sparse principal component analysis via regularized low rank matrix approximation. J Multivar Anal 99:1015–1034CrossRefGoogle Scholar
  80. Shepard RN (1962) The analysis of proximities: multidimensional scaling with an unknown distance function. II. Psychometrika 27:219–246CrossRefGoogle Scholar
  81. Singh A, Gautier B, Shannon CP, Vacher M, Rohart F, Tebutt SJ, Le Cao K-A (2016) DIABLO-an integrative, multi-omics, multivariate method for multi-group classification. BioRxiv 067611.
  82. Smilde AK, Jansen JJ, Hoefsloot HCJ, Lamers R-JAN, van der Greef J, Timmerman ME (2005) ANOVA-simultaneous component analysis (ASCA): a new tool for analyzing designed metabolomics data. Bioinformatics 21:3043–3048PubMedCrossRefGoogle Scholar
  83. Smit S, van Breemen MJ, Hoefsloot HCJ, Smilde AK, Aerts JMFG, de Koster CG (2007) Assessing the statistical validity of proteomics based biomarkers. Anal Chim Acta 592:210–217PubMedCrossRefGoogle Scholar
  84. Sokal RR, Michener CD (1958) A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 38:1409–1438Google Scholar
  85. Ståhle L, Wold S (1987) Partial least squares analysis with cross-validation for the two-class problem: a Monte Carlo study. J Chemom 1:185–196CrossRefGoogle Scholar
  86. Szymańska E, Saccenti E, Smilde AK, Westerhuis JA (2012) Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies. Metabolomics 8:3–16PubMedCrossRefGoogle Scholar
  87. Tapp HS, Kemsley EK (2009) Notes on the practical utility of OPLS. TrAC Trends Anal Chem 28:1322–1327CrossRefGoogle Scholar
  88. Tenenhaus A, Tenenhaus M (2011) Regularized generalized canonical correlation analysis. Psychometrika 76:257–284CrossRefGoogle Scholar
  89. Tenenhaus M, Young FW (1985) An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data. Psychometrika 50:91–119CrossRefGoogle Scholar
  90. Tenenhaus A, Philippe C, Guillemot V, Le Cao K-A, Grill J, Frouin V (2014) Variable selection for generalized canonical correlation analysis. Biostatistics 15:569–583PubMedCrossRefGoogle Scholar
  91. Tholl D, Boland W, Hansel A, Loreto F, Röse USR, Schnitzler J-P (2006) Practical approaches to plant volatile analysis. Plant J 45:540–560PubMedCrossRefGoogle Scholar
  92. Tieri P, Nardini C, Dent JE (2015) Multi-omic data integration. Frontiers Media SA, LausanneCrossRefGoogle Scholar
  93. Trygg J (2002) O2-PLS for qualitative and quantitative analysis in multivariate calibration. J Chemom 16:283–293CrossRefGoogle Scholar
  94. Trygg J, Wold S (2003) O2-PLS, a two-block (X-Y) latent variable regression (LVR) method with an integral OSC filter. J Chemom 17:53–64CrossRefGoogle Scholar
  95. Tseng G, Ghosh D, Zhou X (2015) Integrating omics data. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  96. van den Berg RA, Hoefsloot HC, Westerhuis JA, Smilde AK, van der Werf MJ (2006) Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 7:142PubMedPubMedCentralCrossRefGoogle Scholar
  97. Van Den Wollenberg AL (1977) Redundancy analysis an alternative for canonical correlation analysis. Psychometrika 42:207–219CrossRefGoogle Scholar
  98. van Velzen EJJ, Westerhuis JA, van Duynhoven JPM, van Dorsten FA, Hoefsloot HCJ, Jacobs DM, Smit S, Draijer R, Kroner CI, Smilde AK (2008) Multilevel data analysis of a crossover designed human nutritional intervention study. J Proteome Res 7:4483–4491PubMedCrossRefGoogle Scholar
  99. Vinod HD (1976) Canonical ridge and econometrics of joint production. J Econ 4:147–166CrossRefGoogle Scholar
  100. Volkman JK, Barrett SM, Blackburn SI, Mansour MP, Sikes EL, Gelin F (1998) Microalgal biomarkers: a review of recent research developments. Org Geochem 29:1163–1179CrossRefGoogle Scholar
  101. Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, van Velzen EJJ, van Duijnhoven JPM, van Dorsten FA (2008) Assessment of PLSDA cross validation. Metabolomics 4:81–89CrossRefGoogle Scholar
  102. Westerhuis JA, van Velzen EJJ, Hoefsloot HCJ, Smilde AK (2010) Multivariate paired data analysis: multilevel PLSDA versus OPLSDA. Metabolomics 6:119–128PubMedCrossRefGoogle Scholar
  103. Witten DM, Tibshirani RJ (2009) Extensions of sparse canonical correlation analysis with applications to genomic data. Stat Appl Genet Mol Biol 8:1–27CrossRefGoogle Scholar
  104. Witten DM, Tibshirani R, Hastie T (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10:515–534PubMedPubMedCentralCrossRefGoogle Scholar
  105. Wold H (1985) Partial least squares. In: Kotz S, Johnson N (eds) Encyclopedia of statistical sciences. Wiley, New York, pp 581–591Google Scholar
  106. Wold S, Martens H, Wold H (1983) The multivariate calibration problem in chemistry solved by the PLS method. In Matrix Pencils, (Springer), pp. 286–293Google Scholar
  107. Wold S, Sjöström M, Eriksson L (2001) PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst 58:109–130CrossRefGoogle Scholar
  108. Worley B, Powers R (2013) Multivariate analysis in metabolomics. Curr Metabolomics 1:92–107PubMedPubMedCentralGoogle Scholar
  109. Zerzucha P, Daszykowski M, Walczak B (2012) Dissimilarity partial least squares applied to non-linear modeling problems. Chemom Intell Lab Syst 110:156–162CrossRefGoogle Scholar
  110. Zhang W, Li F, Nie L (2010) Integrating multiple “omics” analysis for microbial biology: application and methodologies. Microbiology 156:287–301PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Maxime R. Hervé
    • 1
    Email author
  • Florence Nicolè
    • 2
  • Kim-Anh Lê Cao
    • 3
  1. 1.University of Rennes, Inra, Agrocampus Ouest, IGEPP - UMR-A 1349RennesFrance
  2. 2.University of Lyon, UJM-Saint-Etienne, CNRS, LBVpam FRE 3727, EA 3061Saint-EtienneFrance
  3. 3.Melbourne Integrative Genomics and School of Mathematics and StatisticsUniversity of MelbourneParkvilleAustralia

Personalised recommendations