Abstract
Ecologists often collect data with the aim of determining which of many variables are associated with a particular cause or consequence. Unsupervised analyses (e.g. principal components analysis, PCA) summarize variation in the data, without regard to the response. Supervised analyses (e.g., partial least squares, PLS) evaluate the variables to find the combination that best explain a causal relationship. These approaches are not interchangeable, especially when the variables most responsible for a causal relationship are not the greatest source of overall variation in the data—a situation that ecologists are likely to encounter. To illustrate the differences between unsupervised and supervised techniques, we analyze a published dataset using both PCA and PLS and compare the questions and answers associated with each method. We also use simulated datasets representing situations that further illustrate differences between unsupervised and supervised analyses. For simulated data with many correlated variables that were unrelated to the response, PLS was better than PCA at identifying which variables were associated with the response. There are many applications for both unsupervised and supervised approaches in ecology. However, PCA is currently overused, at least in part because supervised approaches, such as PLS, are less familiar.
Similar content being viewed by others
Data accessibility
Reproducible R code for generating and analyzing data is archived in Zenodo (https://doi.org/10.5281/zenodo.3568392). Complete data from Muir et al. (2017b) are available on Data Dryad.
References
Aguilera AM, Escabias M, Valderrama MJ (2006) Using principal components for estimating logistic regression with high-dimensional multicollinear data. Comput Stat Data Anal 50:1905–1924. https://doi.org/10.1016/J.CSDA.2005.03.011
Anderson MJ (2001) A new method for non-parametric multivariate analysis of variance. Austral Ecol 26:32–46. https://doi.org/10.1111/j.1442-9993.2001.01070.pp.x
Aplin P (2005) Remote sensing: ecology. Prog Phys Geogr 29:104–113. https://doi.org/10.1191/030913305pp437pr
Berger B, Parent B, Tester M (2010) High-throughput shoot imaging to study drought responses. J Exp Bot 61:3519–3528. https://doi.org/10.1093/jxb/erq201
Bonney R, Cooper CB, Dickinson J, Kelling S, Phillips T, Rosenberg KV, Shirk J (2009) Citizen science: a developing tool for expanding science knowledge and scientific literacy. Bioscience 59:977–984. https://doi.org/10.1525/bio.2009.59.11.9
Borcard D, Gillet F, Legendre P (2018) Numerical ecology with R. Springer International Publishing, Cham
Cardini A, O’Higgins P, Rohlf FJ (2019) Seeing distinct groups where there are none: spurious patterns from between-group PCA. Evol Biol 46:303–316. https://doi.org/10.1007/s11692-019-09487-5
Carrascal LM, Galván I, Gordo O (2009) Partial least squares regression as an alternative to current regression methods used in ecology. Oikos 118:681–690. https://doi.org/10.1111/j.1600-0706.2008.16881.x
Cooke SJ, Hinch SG, Wikelski M, Andrews RD, Kuchel LJ, Wolcott TG, Butler PJ (2004) Biotelemetry: a mechanistic approach to ecology. Trends Ecol Evol 19:334–343. https://doi.org/10.1016/J.TREE.2004.04.003
Dickinson JL, Shirk J, Bonter D, Bonney R, Crain RL, Martin J, Phillips T, Purcell K (2012) The current state of citizen science as a tool for ecological research and public engagement. Front Ecol Environ 10:291–297. https://doi.org/10.1890/110236
Dray S, Chessel D, Thioulouse J (2003) Co-inertia analysis and the linking of ecological data tables. Ecology 84:3078–3089. https://doi.org/10.1890/03-0178
Dray S, Pélissier R, Couteron P, Fortin MJ, Legendre P, Peres-Neto PR, Bellier E, Bivand R, Blanchet FG, De Cáceres M, Dufour AB, Heegaard E, Jombart T, Munoz F, Oksanen J, Thioulouse J, Wagner HH (2012) Community ecology in the age of multivariate multiscale spatial analysis. Ecol Monogr 82:257–275. https://doi.org/10.1890/11-1183.1
Eriksson L, Johansson E, Kettaneh-Wold N, Trygg J, Wikström C, Wold S (2006) Multi-and megavariate data analysis part 1: basic principles and applications. Umetrics AB, Umeå, Sweeden
Fahlgren N, Gehan MA, Baxter I (2015) Lights, camera, action: high-throughput plant phenotyping is ready for a close-up. Curr Opin Plant Biol 24:93–99. https://doi.org/10.1016/J.PBI.2015.02.006
Fick SE, Hijmans RJ (2017) WorldClim 2: new 1-km spatial resolution climate surfaces for global land areas. Int J Climatol. https://doi.org/10.1002/joc.5086
Gotelli NJ, Ellison AM (2013) A primer of ecological statistics, 2nd edn. Sinauer Associates Inc Publishers, Sunderland
Hervé MR, Nicolè F, Lê Cao K-A (2018) Multivariate analysis of multiple datasets: a practical guide for chemical ecology. J Chem Ecol 44:215–234. https://doi.org/10.1007/s10886-018-0932-6
Jolliffe IT (1982) A note on the use of principal components in regression. Appl Stat 31:300. https://doi.org/10.2307/2348005
Kallenbach M, Oh Y, Eilers EJ, Veit D, Baldwin IT, Schuman MC (2014) A robust, simple, high-throughput technique for time-resolved plant volatile analysis in field experiments. Plant J 78:1060–1072. https://doi.org/10.1111/tpj.12523
Kfoury N, Scott E, Orians C, Robbat A (2017) Direct contact sorptive extraction: a robust method for sampling plant volatiles in the field. J Agric Food Chem 65:8501–8509. https://doi.org/10.1021/acs.jafc.7b02847
Kjeldahl K, Bro R (2010) Some common misunderstandings in chemometrics. J Chemom 24:558–564. https://doi.org/10.1002/cem.1346
Kuhn M, Wickham H (2019) Rsample: general resampling infrastructure. https://cran.r-project.org/package=rsample
Legendre P, Louis L, Louis L (1998) Numerical ecology. Elsevier, Amsterdam
Muir CD, Conesa M, Roldán EJ, Molins A, Galmés J (2017a) Weak coordination between leaf structure and function among closely related tomato species. New Phytol 213:1642–1653. https://doi.org/10.1111/nph.14285
Muir CD, Conesa MÀ, Roldán EJ, Molins A, Galmés J (2017b) Data from: weak coordination between leaf structure and function among closely related tomato species. Dryad Digit Repos. https://doi.org/10.5061/dryad.1r8c2
Orloci L (1966) Geometric models in ecology: I. The theory and application of some ordination methods. J Ecol 54:193. https://doi.org/10.2307/2257667
Porter J, Arzberger P, Braun H-W, Bryant P, Gage S, Hansen T, Hanson P, Lin C-C, Lin F-P, Kratz T, Michener W, Shapiro S, Williams T (2005) Wireless sensor networks for ecology. Bioscience 55:561–572. https://doi.org/10.1641/0006-3568(2005)055[0561:WSNFE]2.0.CO;2
R Core Team (2018) R: a language and environment for statistical computing
Reuter JA, Spacek DV, Snyder MP (2015) High-throughput sequencing technologies. Mol Cell 58:586–597. https://doi.org/10.1016/J.MOLCEL.2015.05.004
Roughgarden J, Running SW, Matson PA (1991) What does remote sensing do for ecology? Ecology 72:1918–1922. https://doi.org/10.2307/1941546
Scott ER (2019a) Cupcakes vs muffins: Round 2. www.ericrscott.com/post/cupcakes-vs-muffins-round-2/. Accessed 17 Sep 2020
Scott ER (2019b) Holodeck: a tidy interface for simulating multivariate data. https://cran.r-project.org/package=holodeck
Silvertown J (2009) A new dawn for citizen science. Trends Ecol Evol 24:467–471. https://doi.org/10.1016/J.TREE.2009.03.017
Simpson RK, McGraw KJ (2018) It’s not just what you have, but how you use it: solar-positional and behavioural effects on hummingbird colour appearance during courtship. Ecol Lett 21:1413–1422. https://doi.org/10.1111/ele.13125
Thévenot EA, Roux A, Xu Y, Ezan E, Junot C (2015) Analysis of the human adult urinary metabolome variations with age, body mass index, and gender by implementing a comprehensive workflow for univariate and OPLS statistical analyses. J Proteome Res 14:3322–3335. https://doi.org/10.1021/acs.jproteome.5b00354
Tiede Y, Hemp C, Schmidt A, Nauss T, Farwig N, Brandl R (2018) Beyond body size: consistent decrease of traits within orthopteran assemblages with elevation. Ecology 99:2090–2102. https://doi.org/10.1002/ecy.2436
Tjur T (2009) Coefficients of determination in logistic regression models—a new proposal: the coefficient of discrimination. Am Stat 63:366–372. https://doi.org/10.1198/tast.2009.08210
Valverde-Barrantes OJ, Smemo KA, Feinstein LM, Kershner MW, Blackwood CB (2018) Patterns in spatial distribution and root trait syndromes for ecto and arbuscular mycorrhizal temperate trees in a mixed broadleaf forest. Oecologia 186:731–741. https://doi.org/10.1007/s00442-017-4044-8
Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, van Velzen EJJ, van Duijnhoven JPM, van Dorsten FA (2008) Assessment of PLSDA cross validation. Metabolomics 4:81–89. https://doi.org/10.1007/s11306-007-0099-6
Wiggins WD, Wilder SM (2018) Mismatch between dietary requirements for lipid by a predator and availability of lipid in prey. Oikos 127:1024–1032. https://doi.org/10.1111/oik.04766
Wold H (1975) Soft modelling by latent variables: the non-linear iterative partial least squares (NIPALS) approach. J Appl Probab 12:117–142. https://doi.org/10.1017/S0021900200047604
Worley B, Powers R (2016) PCA as a practical indicator of OPLS-DA model reliability. Curr Metab 4:97–103. https://doi.org/10.2174/2213235X04666160613122429
Worley B, Halouska S, Powers R (2013) Utilities for quantifying separation in PCA/PLS-DA scores plots. Anal Biochem 433:102–104. https://doi.org/10.1016/J.AB.2012.10.011
Wright IJ, Reich PB, Westoby M, Ackerly DD, Baruch Z, Bongers F, Cavender-Bares J, Chapin T, Cornelissen JHC, Diemer M, Flexas J, Garnier E, Groom PK, Gulias J, Hikosaka K, Lamont BB, Lee T, Lee W, Lusk C, Midgley JJ, Navas M-L, Niinemets Ü, Oleksyn J, Osada N, Poorter H, Poot P, Prior L, Pyankov VI, Roumet C, Thomas SC, Tjoelker MG, Veneklaas EJ, Villar R (2004) The worldwide leaf economics spectrum. Nature 428:821–827. https://doi.org/10.1038/nature02403
Acknowledgements
We thank Christopher Muir and Colin M. Orians for comments on a draft of this manuscript.
Author information
Authors and Affiliations
Contributions
ERS and EEC conceived and designed the study. ERS analyzed data and led the writing of the manuscript. Both authors contributed significantly to drafts and approve the final version for publication.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflicts of interest.
Additional information
Communicated by Casey P. terHorst.
Supervised multivariate analyses are underutilized in ecology. These analyses give different results than unsupervised approaches (e.g. PCA) which find main axes of variation without respect to a response. Here, we show how unsupervised and supervised approaches are not interchangeable and require different interpretation. In particular, unsupervised approaches are likely to miss significant relationships with variables that are not part of a main axis of variation, a situation which may be common in ecological datasets.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Scott, E.R., Crone, E.E. Using the right tool for the job: the difference between unsupervised and supervised analyses of multivariate ecological data. Oecologia 196, 13–25 (2021). https://doi.org/10.1007/s00442-020-04848-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00442-020-04848-w