Plant Ecology

, Volume 216, Issue 5, pp 657–667 | Cite as

Principal component analysis with missing values: a comparative survey of methods

Article

Abstract

Principal component analysis (PCA) is a standard technique to summarize the main structures of a data table containing the measurements of several quantitative variables for a number of individuals. Here, we study the case where some of the data values are missing and propose a review of methods which accommodate PCA to missing data. In plant ecology, this statistical challenge relates to the current effort to compile global plant functional trait databases producing matrices with a large amount of missing values. We present several techniques to consider or estimate (impute) missing values in PCA and compare them using theoretical considerations. We carried out a simulation study to evaluate the relative merits of the different approaches in various situations (correlation structure, number of variables and individuals, and percentage of missing values) and also applied them on a real data set. Lastly, we discuss the advantages and drawbacks of these approaches, the potential pitfalls and future challenges that need to be addressed in the future.

Keywords

Imputation Ordination PCA Traits 

Supplementary material

11258_2014_406_MOESM1_ESM.pdf (74 kb)
Electronic supplementary material 1 (PDF 74 kb)
11258_2014_406_MOESM2_ESM.tex (59 kb)
Electronic supplementary material 1 (TEX 59 kb)

References

  1. Bentler P, Yuan K (2011) Positive definiteness via off-diagonal scaling of a symmetric indefinite matrix. Psychometrika 76:119–123CrossRefPubMedCentralPubMedGoogle Scholar
  2. Brown CM, Arbour JH, Jackson DA (2012) Testing of the effect of missing data estimation and distribution in morphometric multivariate data analyses. Syst Biol 61(6):941–954CrossRefPubMedGoogle Scholar
  3. Cailliez F (1983) The analytical solution of the additive constant problem. Psychometrika 48(2):305–308CrossRefGoogle Scholar
  4. Clavel J, Merceron G, Escarguel G (2014) Missing data estimation in morphometrics: how much is too much? Syst Biol 63(2):203–218CrossRefPubMedGoogle Scholar
  5. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38Google Scholar
  6. Diaz S, Cabido M (1997) Plant functional types and ecosystem function in relation to global change. J Veg Sci 8:463–474CrossRefGoogle Scholar
  7. Dray S (2008) On the number of principal components: a test of dimensionality based on measurements of similarity between matrices. Comput Stat Data Anal 52:2228–2237CrossRefGoogle Scholar
  8. Dray S, Dufour AB (2007) The ade4 package: implementing the duality diagram for ecologists. J Stat Softw 22(4):1–20Google Scholar
  9. Dray S, Pettorelli N, Chessel D (2003) Multivariate analysis of incomplete mapped data. Trans GIS 7:411–422CrossRefGoogle Scholar
  10. Escoufier Y (1973) Le traitement des variables vectorielles. Biometrics 29:751–760CrossRefGoogle Scholar
  11. Goodall DW (1954) Objective methods for the classification of vegetation III. An essay on the use of factor analysis. Aust J Bot 2:304–324CrossRefGoogle Scholar
  12. Gower J (1971a) A general coefficient of similarity and some of its properties. Biometrics 27:857–871Google Scholar
  13. Gower JC (1971b) Statistical methods of comparing different multivariate analyses of the same data. In: Hodson FR, Kendall DG, Tautu P (eds) Mathematics in the archaeological and historical sciences. Edinburgh University Press, pp 138–149Google Scholar
  14. Gower JC (1975) Generalized procrustes analysis. Psychometrika 40:33–51CrossRefGoogle Scholar
  15. Honaker J, King G, Blackwell M (2011) Amelia II: a program for missing data. J Stat Softw 45(7):1–47Google Scholar
  16. Husson F, Josse J (2010) missMDA: handling missing values with/in multivariate data analysis (principal component methods). R package version 1:2Google Scholar
  17. Josse J, Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique 153(2):1–21Google Scholar
  18. Josse J, Pagès J, Husson F (2009) Gestion des données manquantes en analyse en composantes principales. Journal de la Société Française de Statistique 150(2):28–51Google Scholar
  19. Josse J, Pagès J, Husson F (2011) Multiple imputation in PCA. Adv Data Anal Classif 5(3):231–246CrossRefGoogle Scholar
  20. Kattge J, Díaz S, Lavorel S, Prentice IC, Leadley P, Bönisch G, Garnier E, Westoby M, Reich PB, Wright IJ, Cornelissen JHC, Violle C, Harrison SP, Van Bodegom PM, Reichstein M, Enquist BJ, Soudzilovskaia NA, Ackerly DD, Anand M, Atkin O, Bahn M, Baker TR, Baldocchi D, Bekker R, Blanco CC, Blonder B, Bond WJ, Bradstock R, Bunker DE, Casanoves F, Cavender-Bares J, Chambers JQ, Chapin FS III, Chave J, Coomes D, Cornwell WK, Craine JM, Dobrin BH, Duarte L, Durka W, Elser J, Esser G, Estiarte M, Fagan WF, Fang J, Fernández-Méndez F, Fidelis A, Finegan B, Flores O, Ford H, Frank D, Freschet GT, Fyllas NM, Gallagher RV, Green WA, Gutierrez AG, Hickler T, Higgins SI, Hodgson JG, Jalili A, Jansen S, Joly CA, Kerkhoff AJ, Kirkup D, Kitajima K, Kleyer M, Klotz S, Knops JMH, Kramer K, Kühn I, Kurokawa H, Laughlin D, Lee TD, Leishman M, Lens F, Lenz T, Lewis SL, Lloyd J, Llusià J, Louault F, Ma S, Mahecha MD, Manning P, Massad T, Medlyn BE, Messier J, Moles aT, Müller SC, Nadrowski K, Naeem S, Niinemets U, Nöllert S, Nüske A, Ogaya R, Oleksyn J, Onipchenko VG, Onoda Y, Ordoñez J, Overbeck G, Ozinga WA, Patiño S, Paula S, Pausas JG, Peñuelas J, Phillips OL, Pillar V, Poorter H, Poorter L, Poschlod P, Prinzing A, Proulx R, Rammig A, Reinsch S, Reu B, Sack L, Salgado-Negret B, Sardans J, Shiodera S, Shipley B, Siefert A, Sosinski E, Soussana JF, Swaine E, Swenson N, Thompson K, Thornton P, Waldram M, Weiher E, White M, White S, Wright SJ, Yguel B, Zaehle S, Zanne AE, Wirth C, (2011) TRY—a global database of plant traits. Glob Change Biol 17(9):2905–2935Google Scholar
  21. Kiers HAL (1997) Weighted least squares fitting using ordinary least squares algorithms. Psychometrika 62(2):251–266CrossRefGoogle Scholar
  22. Lingoes J (1971) Some boundary conditions for a monotone analysis of symmetric matrices. Psychometrika 36(2):195–203CrossRefGoogle Scholar
  23. Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley series in probability and statistics. Wiley, Hoboken, NJGoogle Scholar
  24. Nakagawa S, Freckleton RP (2008) Missing inaction: the dangers of ignoring missing data. Trends Ecol Evolut 23(11):592–596CrossRefGoogle Scholar
  25. Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Wagner H (2013) Vegan: community ecology package. R package version 2.0-9Google Scholar
  26. Pavoine S, Vallet J, Dufour AB, Gachet S, Daniel H (2009) On the challenge of treating various types of variables: application for improving the measurement of functional diversity. Oikos 118(3):391–402CrossRefGoogle Scholar
  27. Peres-Neto PR, Jackson DA, Somers KM (2005) How many principal components? Stopping rules for determining the number of non-trivial axes revisited. Comput Stat Data Anal 49:974–997CrossRefGoogle Scholar
  28. Rubin D (1976) Inference and missing data. Biometrika 69(3):581–592CrossRefGoogle Scholar
  29. Rubin DB (1987) Multiple imputation for non-response in survey. Wiley, LondonCrossRefGoogle Scholar
  30. Schafer J (1997) Analysis of incomplete multivariate data. Chapman & Hall/CRC, LondonCrossRefGoogle Scholar
  31. Shan H, Kattge J, Reich PB, Banerjee A, Schrodt F, Reichstein M (2012) Gap filling in the plant kingdom—trait prediction using hierarchical probabilistic matrix factorization. Proceedings of the 29th international conference on machine learning (ICML-12). Edinburgh, Scotland, pp 1303–1310Google Scholar
  32. Strauss RE, Atanassov MN, De Oliveira JA (2003) Evaluation of the principal-component and expectation-maximization methods for estimating missing data in morphometric studies. J Vertebr Paleontol 23(2):284–296CrossRefGoogle Scholar
  33. Swenson N (2014) Phylogenetic imputation of plant functional trait databases. Ecography 37:105–110CrossRefGoogle Scholar
  34. Wold H, Lyttkens E (1969) Nonlinear iterative partial least squares (NIPALS) estimation procedures. Bull Int Stat Inst 43:29–51Google Scholar
  35. Wright IJ, Reich PB, Westoby M, Ackerly DD, Baruch Z, Bongers F, Cavender-Bares J, Chapin T, Cornelissen JHC, Diemer M, Flexas J, Garnier E, Groom PK, Gulias J, Hikosaka K, Lamont BB, Lee T, Lee W, Lusk C, Midgley JJ, Navas ML, Niinemets U, Oleksyn J, Osada N, Poorter H, Poot P, Prior L, Pyankov VI, Roumet C, Thomas SC, Tjoelker MG, Veneklaas EJ, Villar R (2004) The worldwide leaf economics spectrum. Nature 428:821–827CrossRefPubMedGoogle Scholar
  36. Yuan KH, Chan W (2008) Structural equation modeling with near singular covariance matrices. Comput Stat Data Anal 52(10):4842–4858CrossRefGoogle Scholar
  37. Yuan KH, Wu R, Bentler PM (2011) Ridge structural equation modelling with correlation matrices for ordinal and continuous data. Br J Math Stat Psychol 64:107–133CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2014

Authors and Affiliations

  1. 1.Université de LyonLyonFrance
  2. 2.Université Lyon 1, CNRS, UMR5558, Laboratoire de Biométrie et Biologie EvolutiveVilleurbanneFrance
  3. 3.Applied Mathematics DepartmentAgrocampus OuestRennesFrance

Personalised recommendations