Skip to main content
Log in

Multivariate Analysis of Multiple Datasets: a Practical Guide for Chemical Ecology

  • Published:
Journal of Chemical Ecology Aims and scope Submit manuscript

Abstract

Chemical ecology has strong links with metabolomics, the large-scale study of all metabolites detectable in a biological sample. Consequently, chemical ecologists are often challenged by the statistical analyses of such large datasets. This holds especially true when the purpose is to integrate multiple datasets to obtain a holistic view and a better understanding of a biological system under study. The present article provides a comprehensive resource to analyze such complex datasets using multivariate methods. It starts from the necessary pre-treatment of data including data transformations and distance calculations, to the application of both gold standard and novel multivariate methods for the integration of different omics data. We illustrate the process of analysis along with detailed results interpretations for six issues representative of the different types of biological questions encountered by chemical ecologists. We provide the necessary knowledge and tools with reproducible R codes and chemical-ecological datasets to practice and teach multivariate methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Aitchison J (1983) Principal component analysis of compositional data. Biometrika 70:57

    Article  Google Scholar 

  • Aitchison J (1986) The statistical analysis of compositional data. Chapman & Hall Ltd, London

    Book  Google Scholar 

  • Aitchison J, Barceló-Vidal C, Martín-Fernández JA, Pawlowsky-Glahn V (2000) Logratio analysis and compositional distance. Math Geol 32:271–275

    Article  Google Scholar 

  • Allaire J, Cheng J, Xie Y, McPherson J, Chang W, Allen J, Wickham H, Atkins A, Hyndman R, Arslan R (2017) Rmarkdown: dynamic documents for R. R package version 1.6. https://CRAN.R-project.org/package=rmarkdown

  • Archunan G (2009) Vertebrate pheromones and their biological importance. J Exp Zool India 12:227–239

    Google Scholar 

  • Bais HP, Weir TL, Perry LG, Gilroy S, Vivanco JM (2006) The role of root exudates in rhizosphere interactions with plants and other organisms. Annu Rev Plant Biol 57:233–266

    Article  CAS  PubMed  Google Scholar 

  • Barker M, Rayens W (2003) Partial least squares for discrimination. J Chemom 17:166–173

    Article  CAS  Google Scholar 

  • Bertrand D, Courcoux P, Autran J-C, Meritan R, Robert P (1990) Stepwise canonical discriminant analysis of continuous digitalized signals: application to chromatograms of wheat proteins. J Chemom 4:413–427

    Article  CAS  Google Scholar 

  • Bonelli M, Lorenzi MC, Christidès J-P, Dupont S, Bagnères A-G (2015) Population diversity in Cuticular hydrocarbons and mtDNA in a mountain social wasp. J Chem Ecol 41:22–31

    Article  CAS  PubMed  Google Scholar 

  • Brereton RG, Lloyd GR (2014) Partial least squares discriminant analysis: taking the magic away. J Chemom 28:213–225

    Article  CAS  Google Scholar 

  • Brückner A, Heethoff M (2017) A chemo-ecologists’ practical guide to compositional data analysis. Chemoecology 27:33–46

    Article  CAS  Google Scholar 

  • Bylesjö M, Rantalainen M, Cloarec O, Nicholson JK, Holmes E, Trygg J (2006) OPLS discriminant analysis: combining the strengths of PLS-DA and SIMCA classification. J Chemom 20:341–351

    Article  CAS  Google Scholar 

  • Chessel D, Hanafi M (1996) Analyses de la co-inertie de K nuages de points. Rev Stat Appliquée 44:35–60

    Google Scholar 

  • Conchou L, Cabioch L, Rodriguez LJV, Kjellberg F (2014) Daily rhythm of mutualistic pollinator activity and scent emission in Ficus Septica: ecological differentiation between co-occurring pollinators and potential consequences for chemical communication and facilitation of host speciation. PLoS One 9:e103581

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  • Després L, David J-P, Gallet C (2007) The evolutionary ecology of insect resistance to plant chemicals. Trends Ecol Evol 22:298–307

    Article  PubMed  Google Scholar 

  • Dolédec S, Chessel D (1994) Co-inertia analysis: an alternative method for studying species–environment relationships. Freshw Biol 31:277–294

    Article  Google Scholar 

  • Dormont L, Delle-Vedove R, Bessière J-M, Schatz B (2014) Floral scent emitted by white and coloured morphs in orchids. Phytochemistry 100:51–59

    Article  CAS  PubMed  Google Scholar 

  • Dray S, Chessel D, Thioulouse J (2003a) Procrustean co-inertia analysis for the linking of multivariate datasets. Écoscience 10:110–119

    Article  Google Scholar 

  • Dray S, Chessel D, Thioulouse J (2003b) Co-inertia analysis and the linking of ecological data tables. Ecology 84:3078–3089

    Article  Google Scholar 

  • Engel J, Gerretzen J, Szymańska E, Jansen JJ, Downey G, Blanchet L, Buydens LMC (2013) Breaking with trends in pre-processing? TrAC Trends Anal Chem 50:96–106

    Article  CAS  Google Scholar 

  • Engel J, Blanchet L, Bloemen B, van den Heuvel LP, Engelke UHF, Wevers RA, Buydens LMC (2015) Regularized MANOVA (rMANOVA) in untargeted metabolomics. Anal Chim Acta 899:1–12

    Article  CAS  PubMed  Google Scholar 

  • Escoufier Y (1973) Le Traitement des Variables Vectorielles. Biometrics 29:751

    Article  Google Scholar 

  • Filzmoser P, Hron K, Reimann C (2009) Principal component analysis for compositional data with outliers. Environmetrics 20:621–632

    Article  Google Scholar 

  • Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7:179–188

    Article  Google Scholar 

  • Gatehouse JA (2002) Plant resistance towards insect herbivores: a dynamic interaction. New Phytol 156:145–169

    Article  CAS  Google Scholar 

  • González I, Lê Cao K-A, Davis MJ, Déjean S (2012) Visualising associations between paired “omics” data sets. BioData Min 5:19

    Article  PubMed  PubMed Central  Google Scholar 

  • Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53:325–338

    Article  Google Scholar 

  • Gower JC (1971) Statistical methods of comparing different multivariate analyses of the same data. In: Tautu P (ed) Mathematics in the archaeological and historical sciences. Edinburgh University Press, Edinburgh, pp 138–149

    Google Scholar 

  • Gower JC, Legendre P (1986) Metric and Euclidean properties of dissimilarity coefficients. J Classif 3:5–48

    Article  Google Scholar 

  • Greff S, Aires T, Serrão EA, Engelen AH, Thomas OP, Pérez T (2017) The interaction between the proliferating macroalga Asparagopsis Taxiformis and the coral Astroides Calycularis induces changes in microbiome and metabolomic fingerprints. Sci Rep 7:42625

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Harrington P d B, Vieira NE, Espinoza J, Nien JK, Romero R, Yergey AL (2005) Analysis of variance–principal component analysis: a soft tool for proteomic discovery. Anal Chim Acta 544:118–127

    Article  CAS  Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Spinger, New York

    Book  Google Scholar 

  • Heo M, Gabriel KR (1998) A permutation test of association between configurations by means of the rv coefficient. Commun Stat Simul Comput 27:843–856

    Article  Google Scholar 

  • Hervé MR, Delourme R, Gravot A, Marnet N, Berardocco S, Cortesero AM (2014) Manipulating feeding stimulation to protect crops against insect pests? J Chem Ecol 40:1220–1231

    Article  PubMed  CAS  Google Scholar 

  • Hill MO, Smith AJE (1976) Principal component analysis of taxonomic data with multi-state discrete characters. Taxon 25:249

    Article  Google Scholar 

  • Hotelling H (1933) Analysis of a complex of statistical variables into principal components. Educ Psychol 24:417–441

    Article  Google Scholar 

  • Hotelling H (1936) Relations between two sets of variates. Biometrika 28(377):321

    Article  Google Scholar 

  • Howard RW, Blomquist GJ (2005) Ecological, behavioral, and biochemical aspects of insect hydrocarbons. Annu Rev Entomol 50:371–393

    Article  CAS  PubMed  Google Scholar 

  • Indahl UG, Martens H, Næs T (2007) From dummy regression to prior probabilities in PLS-DA. J Chemom 21:529–536

    Article  CAS  Google Scholar 

  • Indahl UG, Liland KH, Naes T (2009) Canonical partial least squares-a unified PLS approach to classification and regression problems. J Chemom 23:495–504

    Article  CAS  Google Scholar 

  • Ivanišević J, Thomas OP, Lejeusne C, Chevaldonné P, Pérez T (2011) Metabolic fingerprinting as an indicator of biodiversity: towards understanding inter-specific relationships among Homoscleromorpha sponges. Metabolomics 7:289–304

    Article  CAS  Google Scholar 

  • Jaccard P (1901) Étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bull Soc Vaud Sci Nat 37:547–579

    Google Scholar 

  • Jackson DA (1995) PROTEST: a PROcrustean randomization TEST of community environment concordance. Écoscience 2:297–303

    Article  Google Scholar 

  • Jansen JJ, Hoefsloot HCJ, van der Greef J, Timmerman ME, Westerhuis JA, Smilde AK (2005) ASCA: analysis of multivariate data obtained from an experimental design. J Chemom 19:469–481

    Article  CAS  Google Scholar 

  • Jombart T, Devillard S, Balloux F (2010) Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet 11:94

    Article  PubMed  PubMed Central  Google Scholar 

  • Kemsley EK (1996) Discriminant analysis of high-dimensional data: a comparison of principal components analysis and partial least squares data reduction methods. Chemom Intell Lab Syst 33:47–61

    Article  CAS  Google Scholar 

  • Kjeldahl K, Bro R (2010) Some common misunderstandings in chemometrics. J Chemom 24:558–564

    Article  CAS  Google Scholar 

  • Kruskal JB (1964a) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29:1–27

    Article  Google Scholar 

  • Kruskal JB (1964b) Nonmetric multidimensional scaling: a numerical method. Psychometrika 29:115–129

    Article  Google Scholar 

  • Lê Cao K-A, Boitard S, Besse P (2011) Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinf 12:253

    Article  Google Scholar 

  • Legendre P, Anderson MJ (1999) Distance-based redundancy analysis: testing multispecies responses in multifactorial ecological experiments. Ecol Monogr 69(1)

  • Legendre P, Legendre L (2012) Numerical Ecology. Elsevier, Amsterdam

    Google Scholar 

  • Leurgans SE, Moyeed RA, Silverman BW (1993) Canonical correlation analysis when the data are curves. J R Stat Soc Ser B Methodol 55:725–740

    Google Scholar 

  • Liland KH, Indahl UG (2009) Powered partial least squares discriminant analysis. J Chemom 23:7–18

    Article  CAS  Google Scholar 

  • Liquet B, Lê Cao K-A, Hocini H, Thiébaut R (2012) A novel approach for biomarker selection and the integration of repeated measures experiments from two assays. BMC Bioinformatics 13:325

    Article  PubMed  PubMed Central  Google Scholar 

  • Löfstedt T, Trygg J (2011) OnPLS-a novel multiblock method for the modelling of predictive and orthogonal variation. J Chemom 25:441–455

    Google Scholar 

  • Löfstedt T, Hanafi M, Mazerolles G, Trygg J (2012) OnPLS path modelling. Chemom Intell Lab Syst 118:139–149

    Article  CAS  Google Scholar 

  • Löfstedt T, Hoffman D, Trygg J (2013) Global, local and unique decompositions in OnPLS for multiblock data analysis. Anal Chim Acta 791:13–24

    Article  PubMed  CAS  Google Scholar 

  • Lohmöller J (1989) Latent variables path modeling with partial least squares. Physica-Verlag, Heidelberg

    Book  Google Scholar 

  • Marini F, de Beer D, Joubert E, Walczak B (2015) Analysis of variance of designed chromatographic data sets: the analysis of variance-target projection approach. J Chromatogr A 1405:94–102

    Article  CAS  PubMed  Google Scholar 

  • Mehmood T, Liland KH, Snipen L, Sæbø S (2012) A review of variable selection methods in partial least squares regression. Chemom Intell Lab Syst 118:62–69

    Article  CAS  Google Scholar 

  • Meng C, Zeleznik OA, Thallinger GG, Kuster B, Gholami AM, Culhane AC (2016) Dimension reduction techniques for the integrative analysis of multi-omics data. Brief Bioinform 17:628–641

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Miller J, Farr S (1971) Bimultivariate redundancy: a comprehensive measure of interbattery relationship. Multivar Behav Res 6:313–324

    Article  Google Scholar 

  • Nocairi H, Mostafa Qannari E, Vigneau E, Bertrand D (2005) Discrimination on latent components with respect to patterns. Application to multicollinear data. Comput Stat Data Anal 48:139–147

    Article  Google Scholar 

  • Palarea-Albaladejo J, Martín-Fernández JA, Soto JA (2012) Dealing with distances and transformations for fuzzy C-means clustering of compositional data. J Classif 29:144–169

    Article  Google Scholar 

  • Pearson K (1896) Mathematical contributions to the theory of evolution - on a form of spurious correlation which may Arise when indices are used in the measurement of organs. Proc R Soc Lond 60:489–498

    Article  Google Scholar 

  • Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philos Mag 2:559–572

    Article  Google Scholar 

  • Peres-Neto PR, Legendre P, Dray S, Borcard D (2006) Variation partitioning of species data matrices: estimation and comparison of fractions. Ecology 87:2614–2625

    Article  PubMed  Google Scholar 

  • Pierotti MER, Martín-Fernández JA (2011) Compositional analysis in behavioural and evolutionary ecology. In: Pawloswky-Glahn V, Buccianti A (eds) Compositional data analysis: theory and applications. John Wiley & Sons, Ltd, Hoboken, pp 218–234

    Chapter  Google Scholar 

  • R Core Team (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna

    Google Scholar 

  • Raguso RA (2008) Wake up and smell the roses: the ecology and evolution of floral scent. Annu Rev Ecol Evol Syst 39:549–569

    Article  Google Scholar 

  • Rao CR (1964) The use and interpretation of principal component analysis in applied research. Sankhyā Indian J Stat Ser A 329–358

  • Reudler JH, Elzinga JA (2015) Photoperiod-induced geographic variation in plant defense chemistry. J Chem Ecol 41:139–148

    Article  CAS  PubMed  Google Scholar 

  • Robert P, Escoufier Y (1976) A unifying tool for linear multivariate statistical methods: the RV- coefficient. Appl Stat 25:257

    Article  Google Scholar 

  • Rohart F, Gautier B, Singh A, Le Cao K-A (2017) mixOmics: An R package for ‘omics feature selection and multiple data integration. PLoS Comput Biol 13(11):e1005752

  • Saccenti E, Hoefsloot HCJ, Smilde AK, Westerhuis JA, Hendriks MMWB (2014) Reflections on univariate and multivariate analysis of metabolomics data. Metabolomics 10:361–374

    Article  CAS  Google Scholar 

  • Sacristán-Soriano O, Banaigs B, Casamayor EO, Becerro MA (2011) Exploring the links between natural products and bacterial assemblages in the sponge Aplysina aerophoba. Appl Environ Microbiol 77:862–870

    Article  PubMed  CAS  Google Scholar 

  • Sampson PD, Streissguth AP, Barr HM, Bookstein FL (1989) Neurobehavioral effects of prenatal alcohol: part II. Partial least squares analysis. Neurotoxicol Teratol 11:477–491

    Article  CAS  PubMed  Google Scholar 

  • Shen H, Huang JZ (2008) Sparse principal component analysis via regularized low rank matrix approximation. J Multivar Anal 99:1015–1034

    Article  Google Scholar 

  • Shepard RN (1962) The analysis of proximities: multidimensional scaling with an unknown distance function. II. Psychometrika 27:219–246

    Article  Google Scholar 

  • Singh A, Gautier B, Shannon CP, Vacher M, Rohart F, Tebutt SJ, Le Cao K-A (2016) DIABLO-an integrative, multi-omics, multivariate method for multi-group classification. BioRxiv 067611. https://doi.org/10.1101/067611

  • Smilde AK, Jansen JJ, Hoefsloot HCJ, Lamers R-JAN, van der Greef J, Timmerman ME (2005) ANOVA-simultaneous component analysis (ASCA): a new tool for analyzing designed metabolomics data. Bioinformatics 21:3043–3048

    Article  CAS  PubMed  Google Scholar 

  • Smit S, van Breemen MJ, Hoefsloot HCJ, Smilde AK, Aerts JMFG, de Koster CG (2007) Assessing the statistical validity of proteomics based biomarkers. Anal Chim Acta 592:210–217

    Article  CAS  PubMed  Google Scholar 

  • Sokal RR, Michener CD (1958) A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 38:1409–1438

    Google Scholar 

  • Ståhle L, Wold S (1987) Partial least squares analysis with cross-validation for the two-class problem: a Monte Carlo study. J Chemom 1:185–196

    Article  Google Scholar 

  • Szymańska E, Saccenti E, Smilde AK, Westerhuis JA (2012) Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies. Metabolomics 8:3–16

    Article  PubMed  CAS  Google Scholar 

  • Tapp HS, Kemsley EK (2009) Notes on the practical utility of OPLS. TrAC Trends Anal Chem 28:1322–1327

    Article  CAS  Google Scholar 

  • Tenenhaus A, Tenenhaus M (2011) Regularized generalized canonical correlation analysis. Psychometrika 76:257–284

    Article  Google Scholar 

  • Tenenhaus M, Young FW (1985) An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data. Psychometrika 50:91–119

    Article  Google Scholar 

  • Tenenhaus A, Philippe C, Guillemot V, Le Cao K-A, Grill J, Frouin V (2014) Variable selection for generalized canonical correlation analysis. Biostatistics 15:569–583

    Article  PubMed  Google Scholar 

  • Tholl D, Boland W, Hansel A, Loreto F, Röse USR, Schnitzler J-P (2006) Practical approaches to plant volatile analysis. Plant J 45:540–560

    Article  CAS  PubMed  Google Scholar 

  • Tieri P, Nardini C, Dent JE (2015) Multi-omic data integration. Frontiers Media SA, Lausanne

    Book  Google Scholar 

  • Trygg J (2002) O2-PLS for qualitative and quantitative analysis in multivariate calibration. J Chemom 16:283–293

    Article  CAS  Google Scholar 

  • Trygg J, Wold S (2003) O2-PLS, a two-block (X-Y) latent variable regression (LVR) method with an integral OSC filter. J Chemom 17:53–64

    Article  CAS  Google Scholar 

  • Tseng G, Ghosh D, Zhou X (2015) Integrating omics data. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • van den Berg RA, Hoefsloot HC, Westerhuis JA, Smilde AK, van der Werf MJ (2006) Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 7:142

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  • Van Den Wollenberg AL (1977) Redundancy analysis an alternative for canonical correlation analysis. Psychometrika 42:207–219

    Article  Google Scholar 

  • van Velzen EJJ, Westerhuis JA, van Duynhoven JPM, van Dorsten FA, Hoefsloot HCJ, Jacobs DM, Smit S, Draijer R, Kroner CI, Smilde AK (2008) Multilevel data analysis of a crossover designed human nutritional intervention study. J Proteome Res 7:4483–4491

    Article  PubMed  CAS  Google Scholar 

  • Vinod HD (1976) Canonical ridge and econometrics of joint production. J Econ 4:147–166

    Article  Google Scholar 

  • Volkman JK, Barrett SM, Blackburn SI, Mansour MP, Sikes EL, Gelin F (1998) Microalgal biomarkers: a review of recent research developments. Org Geochem 29:1163–1179

    Article  CAS  Google Scholar 

  • Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, van Velzen EJJ, van Duijnhoven JPM, van Dorsten FA (2008) Assessment of PLSDA cross validation. Metabolomics 4:81–89

    Article  CAS  Google Scholar 

  • Westerhuis JA, van Velzen EJJ, Hoefsloot HCJ, Smilde AK (2010) Multivariate paired data analysis: multilevel PLSDA versus OPLSDA. Metabolomics 6:119–128

    Article  CAS  PubMed  Google Scholar 

  • Witten DM, Tibshirani RJ (2009) Extensions of sparse canonical correlation analysis with applications to genomic data. Stat Appl Genet Mol Biol 8:1–27

    Article  CAS  Google Scholar 

  • Witten DM, Tibshirani R, Hastie T (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10:515–534

    Article  PubMed  PubMed Central  Google Scholar 

  • Wold H (1985) Partial least squares. In: Kotz S, Johnson N (eds) Encyclopedia of statistical sciences. Wiley, New York, pp 581–591

    Google Scholar 

  • Wold S, Martens H, Wold H (1983) The multivariate calibration problem in chemistry solved by the PLS method. In Matrix Pencils, (Springer), pp. 286–293

  • Wold S, Sjöström M, Eriksson L (2001) PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst 58:109–130

    Article  CAS  Google Scholar 

  • Worley B, Powers R (2013) Multivariate analysis in metabolomics. Curr Metabolomics 1:92–107

    CAS  PubMed  PubMed Central  Google Scholar 

  • Zerzucha P, Daszykowski M, Walczak B (2012) Dissimilarity partial least squares applied to non-linear modeling problems. Chemom Intell Lab Syst 110:156–162

    Article  CAS  Google Scholar 

  • Zhang W, Li F, Nie L (2010) Integrating multiple “omics” analysis for microbial biology: application and methodologies. Microbiology 156:287–301

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgments

We are very grateful to Bernard Banaigs, Lucie Conchou, Laurent Dormont, Stéphane Greff, Maria Cristina Lorenzi, Thierry Pérez, Bertrand Schatz, Oriol Sacristán-Soriano and Olivier Thomas who kindly provided their data to illustrate the examples, Stéphane Dray and Denis Poinsot for their insightful comments on the manuscript and Zoe Welham for proof reading of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maxime R. Hervé.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Electronic Supplementary Material

ESM 1

(zip 1.86 MB)

ESM 2

(PDF 297 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hervé, M.R., Nicolè, F. & Lê Cao, KA. Multivariate Analysis of Multiple Datasets: a Practical Guide for Chemical Ecology. J Chem Ecol 44, 215–234 (2018). https://doi.org/10.1007/s10886-018-0932-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10886-018-0932-6

Keywords

Navigation