Abstract
Chemical ecology has strong links with metabolomics, the large-scale study of all metabolites detectable in a biological sample. Consequently, chemical ecologists are often challenged by the statistical analyses of such large datasets. This holds especially true when the purpose is to integrate multiple datasets to obtain a holistic view and a better understanding of a biological system under study. The present article provides a comprehensive resource to analyze such complex datasets using multivariate methods. It starts from the necessary pre-treatment of data including data transformations and distance calculations, to the application of both gold standard and novel multivariate methods for the integration of different omics data. We illustrate the process of analysis along with detailed results interpretations for six issues representative of the different types of biological questions encountered by chemical ecologists. We provide the necessary knowledge and tools with reproducible R codes and chemical-ecological datasets to practice and teach multivariate methods.
Similar content being viewed by others
References
Aitchison J (1983) Principal component analysis of compositional data. Biometrika 70:57
Aitchison J (1986) The statistical analysis of compositional data. Chapman & Hall Ltd, London
Aitchison J, Barceló-Vidal C, Martín-Fernández JA, Pawlowsky-Glahn V (2000) Logratio analysis and compositional distance. Math Geol 32:271–275
Allaire J, Cheng J, Xie Y, McPherson J, Chang W, Allen J, Wickham H, Atkins A, Hyndman R, Arslan R (2017) Rmarkdown: dynamic documents for R. R package version 1.6. https://CRAN.R-project.org/package=rmarkdown
Archunan G (2009) Vertebrate pheromones and their biological importance. J Exp Zool India 12:227–239
Bais HP, Weir TL, Perry LG, Gilroy S, Vivanco JM (2006) The role of root exudates in rhizosphere interactions with plants and other organisms. Annu Rev Plant Biol 57:233–266
Barker M, Rayens W (2003) Partial least squares for discrimination. J Chemom 17:166–173
Bertrand D, Courcoux P, Autran J-C, Meritan R, Robert P (1990) Stepwise canonical discriminant analysis of continuous digitalized signals: application to chromatograms of wheat proteins. J Chemom 4:413–427
Bonelli M, Lorenzi MC, Christidès J-P, Dupont S, Bagnères A-G (2015) Population diversity in Cuticular hydrocarbons and mtDNA in a mountain social wasp. J Chem Ecol 41:22–31
Brereton RG, Lloyd GR (2014) Partial least squares discriminant analysis: taking the magic away. J Chemom 28:213–225
Brückner A, Heethoff M (2017) A chemo-ecologists’ practical guide to compositional data analysis. Chemoecology 27:33–46
Bylesjö M, Rantalainen M, Cloarec O, Nicholson JK, Holmes E, Trygg J (2006) OPLS discriminant analysis: combining the strengths of PLS-DA and SIMCA classification. J Chemom 20:341–351
Chessel D, Hanafi M (1996) Analyses de la co-inertie de K nuages de points. Rev Stat Appliquée 44:35–60
Conchou L, Cabioch L, Rodriguez LJV, Kjellberg F (2014) Daily rhythm of mutualistic pollinator activity and scent emission in Ficus Septica: ecological differentiation between co-occurring pollinators and potential consequences for chemical communication and facilitation of host speciation. PLoS One 9:e103581
Després L, David J-P, Gallet C (2007) The evolutionary ecology of insect resistance to plant chemicals. Trends Ecol Evol 22:298–307
Dolédec S, Chessel D (1994) Co-inertia analysis: an alternative method for studying species–environment relationships. Freshw Biol 31:277–294
Dormont L, Delle-Vedove R, Bessière J-M, Schatz B (2014) Floral scent emitted by white and coloured morphs in orchids. Phytochemistry 100:51–59
Dray S, Chessel D, Thioulouse J (2003a) Procrustean co-inertia analysis for the linking of multivariate datasets. Écoscience 10:110–119
Dray S, Chessel D, Thioulouse J (2003b) Co-inertia analysis and the linking of ecological data tables. Ecology 84:3078–3089
Engel J, Gerretzen J, Szymańska E, Jansen JJ, Downey G, Blanchet L, Buydens LMC (2013) Breaking with trends in pre-processing? TrAC Trends Anal Chem 50:96–106
Engel J, Blanchet L, Bloemen B, van den Heuvel LP, Engelke UHF, Wevers RA, Buydens LMC (2015) Regularized MANOVA (rMANOVA) in untargeted metabolomics. Anal Chim Acta 899:1–12
Escoufier Y (1973) Le Traitement des Variables Vectorielles. Biometrics 29:751
Filzmoser P, Hron K, Reimann C (2009) Principal component analysis for compositional data with outliers. Environmetrics 20:621–632
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7:179–188
Gatehouse JA (2002) Plant resistance towards insect herbivores: a dynamic interaction. New Phytol 156:145–169
González I, Lê Cao K-A, Davis MJ, Déjean S (2012) Visualising associations between paired “omics” data sets. BioData Min 5:19
Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53:325–338
Gower JC (1971) Statistical methods of comparing different multivariate analyses of the same data. In: Tautu P (ed) Mathematics in the archaeological and historical sciences. Edinburgh University Press, Edinburgh, pp 138–149
Gower JC, Legendre P (1986) Metric and Euclidean properties of dissimilarity coefficients. J Classif 3:5–48
Greff S, Aires T, Serrão EA, Engelen AH, Thomas OP, Pérez T (2017) The interaction between the proliferating macroalga Asparagopsis Taxiformis and the coral Astroides Calycularis induces changes in microbiome and metabolomic fingerprints. Sci Rep 7:42625
Harrington P d B, Vieira NE, Espinoza J, Nien JK, Romero R, Yergey AL (2005) Analysis of variance–principal component analysis: a soft tool for proteomic discovery. Anal Chim Acta 544:118–127
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Spinger, New York
Heo M, Gabriel KR (1998) A permutation test of association between configurations by means of the rv coefficient. Commun Stat Simul Comput 27:843–856
Hervé MR, Delourme R, Gravot A, Marnet N, Berardocco S, Cortesero AM (2014) Manipulating feeding stimulation to protect crops against insect pests? J Chem Ecol 40:1220–1231
Hill MO, Smith AJE (1976) Principal component analysis of taxonomic data with multi-state discrete characters. Taxon 25:249
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. Educ Psychol 24:417–441
Hotelling H (1936) Relations between two sets of variates. Biometrika 28(377):321
Howard RW, Blomquist GJ (2005) Ecological, behavioral, and biochemical aspects of insect hydrocarbons. Annu Rev Entomol 50:371–393
Indahl UG, Martens H, Næs T (2007) From dummy regression to prior probabilities in PLS-DA. J Chemom 21:529–536
Indahl UG, Liland KH, Naes T (2009) Canonical partial least squares-a unified PLS approach to classification and regression problems. J Chemom 23:495–504
Ivanišević J, Thomas OP, Lejeusne C, Chevaldonné P, Pérez T (2011) Metabolic fingerprinting as an indicator of biodiversity: towards understanding inter-specific relationships among Homoscleromorpha sponges. Metabolomics 7:289–304
Jaccard P (1901) Étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bull Soc Vaud Sci Nat 37:547–579
Jackson DA (1995) PROTEST: a PROcrustean randomization TEST of community environment concordance. Écoscience 2:297–303
Jansen JJ, Hoefsloot HCJ, van der Greef J, Timmerman ME, Westerhuis JA, Smilde AK (2005) ASCA: analysis of multivariate data obtained from an experimental design. J Chemom 19:469–481
Jombart T, Devillard S, Balloux F (2010) Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet 11:94
Kemsley EK (1996) Discriminant analysis of high-dimensional data: a comparison of principal components analysis and partial least squares data reduction methods. Chemom Intell Lab Syst 33:47–61
Kjeldahl K, Bro R (2010) Some common misunderstandings in chemometrics. J Chemom 24:558–564
Kruskal JB (1964a) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29:1–27
Kruskal JB (1964b) Nonmetric multidimensional scaling: a numerical method. Psychometrika 29:115–129
Lê Cao K-A, Boitard S, Besse P (2011) Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinf 12:253
Legendre P, Anderson MJ (1999) Distance-based redundancy analysis: testing multispecies responses in multifactorial ecological experiments. Ecol Monogr 69(1)
Legendre P, Legendre L (2012) Numerical Ecology. Elsevier, Amsterdam
Leurgans SE, Moyeed RA, Silverman BW (1993) Canonical correlation analysis when the data are curves. J R Stat Soc Ser B Methodol 55:725–740
Liland KH, Indahl UG (2009) Powered partial least squares discriminant analysis. J Chemom 23:7–18
Liquet B, Lê Cao K-A, Hocini H, Thiébaut R (2012) A novel approach for biomarker selection and the integration of repeated measures experiments from two assays. BMC Bioinformatics 13:325
Löfstedt T, Trygg J (2011) OnPLS-a novel multiblock method for the modelling of predictive and orthogonal variation. J Chemom 25:441–455
Löfstedt T, Hanafi M, Mazerolles G, Trygg J (2012) OnPLS path modelling. Chemom Intell Lab Syst 118:139–149
Löfstedt T, Hoffman D, Trygg J (2013) Global, local and unique decompositions in OnPLS for multiblock data analysis. Anal Chim Acta 791:13–24
Lohmöller J (1989) Latent variables path modeling with partial least squares. Physica-Verlag, Heidelberg
Marini F, de Beer D, Joubert E, Walczak B (2015) Analysis of variance of designed chromatographic data sets: the analysis of variance-target projection approach. J Chromatogr A 1405:94–102
Mehmood T, Liland KH, Snipen L, Sæbø S (2012) A review of variable selection methods in partial least squares regression. Chemom Intell Lab Syst 118:62–69
Meng C, Zeleznik OA, Thallinger GG, Kuster B, Gholami AM, Culhane AC (2016) Dimension reduction techniques for the integrative analysis of multi-omics data. Brief Bioinform 17:628–641
Miller J, Farr S (1971) Bimultivariate redundancy: a comprehensive measure of interbattery relationship. Multivar Behav Res 6:313–324
Nocairi H, Mostafa Qannari E, Vigneau E, Bertrand D (2005) Discrimination on latent components with respect to patterns. Application to multicollinear data. Comput Stat Data Anal 48:139–147
Palarea-Albaladejo J, Martín-Fernández JA, Soto JA (2012) Dealing with distances and transformations for fuzzy C-means clustering of compositional data. J Classif 29:144–169
Pearson K (1896) Mathematical contributions to the theory of evolution - on a form of spurious correlation which may Arise when indices are used in the measurement of organs. Proc R Soc Lond 60:489–498
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philos Mag 2:559–572
Peres-Neto PR, Legendre P, Dray S, Borcard D (2006) Variation partitioning of species data matrices: estimation and comparison of fractions. Ecology 87:2614–2625
Pierotti MER, Martín-Fernández JA (2011) Compositional analysis in behavioural and evolutionary ecology. In: Pawloswky-Glahn V, Buccianti A (eds) Compositional data analysis: theory and applications. John Wiley & Sons, Ltd, Hoboken, pp 218–234
R Core Team (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Raguso RA (2008) Wake up and smell the roses: the ecology and evolution of floral scent. Annu Rev Ecol Evol Syst 39:549–569
Rao CR (1964) The use and interpretation of principal component analysis in applied research. Sankhyā Indian J Stat Ser A 329–358
Reudler JH, Elzinga JA (2015) Photoperiod-induced geographic variation in plant defense chemistry. J Chem Ecol 41:139–148
Robert P, Escoufier Y (1976) A unifying tool for linear multivariate statistical methods: the RV- coefficient. Appl Stat 25:257
Rohart F, Gautier B, Singh A, Le Cao K-A (2017) mixOmics: An R package for ‘omics feature selection and multiple data integration. PLoS Comput Biol 13(11):e1005752
Saccenti E, Hoefsloot HCJ, Smilde AK, Westerhuis JA, Hendriks MMWB (2014) Reflections on univariate and multivariate analysis of metabolomics data. Metabolomics 10:361–374
Sacristán-Soriano O, Banaigs B, Casamayor EO, Becerro MA (2011) Exploring the links between natural products and bacterial assemblages in the sponge Aplysina aerophoba. Appl Environ Microbiol 77:862–870
Sampson PD, Streissguth AP, Barr HM, Bookstein FL (1989) Neurobehavioral effects of prenatal alcohol: part II. Partial least squares analysis. Neurotoxicol Teratol 11:477–491
Shen H, Huang JZ (2008) Sparse principal component analysis via regularized low rank matrix approximation. J Multivar Anal 99:1015–1034
Shepard RN (1962) The analysis of proximities: multidimensional scaling with an unknown distance function. II. Psychometrika 27:219–246
Singh A, Gautier B, Shannon CP, Vacher M, Rohart F, Tebutt SJ, Le Cao K-A (2016) DIABLO-an integrative, multi-omics, multivariate method for multi-group classification. BioRxiv 067611. https://doi.org/10.1101/067611
Smilde AK, Jansen JJ, Hoefsloot HCJ, Lamers R-JAN, van der Greef J, Timmerman ME (2005) ANOVA-simultaneous component analysis (ASCA): a new tool for analyzing designed metabolomics data. Bioinformatics 21:3043–3048
Smit S, van Breemen MJ, Hoefsloot HCJ, Smilde AK, Aerts JMFG, de Koster CG (2007) Assessing the statistical validity of proteomics based biomarkers. Anal Chim Acta 592:210–217
Sokal RR, Michener CD (1958) A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 38:1409–1438
Ståhle L, Wold S (1987) Partial least squares analysis with cross-validation for the two-class problem: a Monte Carlo study. J Chemom 1:185–196
Szymańska E, Saccenti E, Smilde AK, Westerhuis JA (2012) Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies. Metabolomics 8:3–16
Tapp HS, Kemsley EK (2009) Notes on the practical utility of OPLS. TrAC Trends Anal Chem 28:1322–1327
Tenenhaus A, Tenenhaus M (2011) Regularized generalized canonical correlation analysis. Psychometrika 76:257–284
Tenenhaus M, Young FW (1985) An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data. Psychometrika 50:91–119
Tenenhaus A, Philippe C, Guillemot V, Le Cao K-A, Grill J, Frouin V (2014) Variable selection for generalized canonical correlation analysis. Biostatistics 15:569–583
Tholl D, Boland W, Hansel A, Loreto F, Röse USR, Schnitzler J-P (2006) Practical approaches to plant volatile analysis. Plant J 45:540–560
Tieri P, Nardini C, Dent JE (2015) Multi-omic data integration. Frontiers Media SA, Lausanne
Trygg J (2002) O2-PLS for qualitative and quantitative analysis in multivariate calibration. J Chemom 16:283–293
Trygg J, Wold S (2003) O2-PLS, a two-block (X-Y) latent variable regression (LVR) method with an integral OSC filter. J Chemom 17:53–64
Tseng G, Ghosh D, Zhou X (2015) Integrating omics data. Cambridge University Press, Cambridge
van den Berg RA, Hoefsloot HC, Westerhuis JA, Smilde AK, van der Werf MJ (2006) Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 7:142
Van Den Wollenberg AL (1977) Redundancy analysis an alternative for canonical correlation analysis. Psychometrika 42:207–219
van Velzen EJJ, Westerhuis JA, van Duynhoven JPM, van Dorsten FA, Hoefsloot HCJ, Jacobs DM, Smit S, Draijer R, Kroner CI, Smilde AK (2008) Multilevel data analysis of a crossover designed human nutritional intervention study. J Proteome Res 7:4483–4491
Vinod HD (1976) Canonical ridge and econometrics of joint production. J Econ 4:147–166
Volkman JK, Barrett SM, Blackburn SI, Mansour MP, Sikes EL, Gelin F (1998) Microalgal biomarkers: a review of recent research developments. Org Geochem 29:1163–1179
Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, van Velzen EJJ, van Duijnhoven JPM, van Dorsten FA (2008) Assessment of PLSDA cross validation. Metabolomics 4:81–89
Westerhuis JA, van Velzen EJJ, Hoefsloot HCJ, Smilde AK (2010) Multivariate paired data analysis: multilevel PLSDA versus OPLSDA. Metabolomics 6:119–128
Witten DM, Tibshirani RJ (2009) Extensions of sparse canonical correlation analysis with applications to genomic data. Stat Appl Genet Mol Biol 8:1–27
Witten DM, Tibshirani R, Hastie T (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10:515–534
Wold H (1985) Partial least squares. In: Kotz S, Johnson N (eds) Encyclopedia of statistical sciences. Wiley, New York, pp 581–591
Wold S, Martens H, Wold H (1983) The multivariate calibration problem in chemistry solved by the PLS method. In Matrix Pencils, (Springer), pp. 286–293
Wold S, Sjöström M, Eriksson L (2001) PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst 58:109–130
Worley B, Powers R (2013) Multivariate analysis in metabolomics. Curr Metabolomics 1:92–107
Zerzucha P, Daszykowski M, Walczak B (2012) Dissimilarity partial least squares applied to non-linear modeling problems. Chemom Intell Lab Syst 110:156–162
Zhang W, Li F, Nie L (2010) Integrating multiple “omics” analysis for microbial biology: application and methodologies. Microbiology 156:287–301
Acknowledgments
We are very grateful to Bernard Banaigs, Lucie Conchou, Laurent Dormont, Stéphane Greff, Maria Cristina Lorenzi, Thierry Pérez, Bertrand Schatz, Oriol Sacristán-Soriano and Olivier Thomas who kindly provided their data to illustrate the examples, Stéphane Dray and Denis Poinsot for their insightful comments on the manuscript and Zoe Welham for proof reading of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Rights and permissions
About this article
Cite this article
Hervé, M.R., Nicolè, F. & Lê Cao, KA. Multivariate Analysis of Multiple Datasets: a Practical Guide for Chemical Ecology. J Chem Ecol 44, 215–234 (2018). https://doi.org/10.1007/s10886-018-0932-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10886-018-0932-6