Principal component analysis with missing values: a comparative survey of methods

Abstract

Principal component analysis (PCA) is a standard technique to summarize the main structures of a data table containing the measurements of several quantitative variables for a number of individuals. Here, we study the case where some of the data values are missing and propose a review of methods which accommodate PCA to missing data. In plant ecology, this statistical challenge relates to the current effort to compile global plant functional trait databases producing matrices with a large amount of missing values. We present several techniques to consider or estimate (impute) missing values in PCA and compare them using theoretical considerations. We carried out a simulation study to evaluate the relative merits of the different approaches in various situations (correlation structure, number of variables and individuals, and percentage of missing values) and also applied them on a real data set. Lastly, we discuss the advantages and drawbacks of these approaches, the potential pitfalls and future challenges that need to be addressed in the future.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3

References

  1. Bentler P, Yuan K (2011) Positive definiteness via off-diagonal scaling of a symmetric indefinite matrix. Psychometrika 76:119–123

    Article  PubMed Central  PubMed  Google Scholar 

  2. Brown CM, Arbour JH, Jackson DA (2012) Testing of the effect of missing data estimation and distribution in morphometric multivariate data analyses. Syst Biol 61(6):941–954

    Article  PubMed  Google Scholar 

  3. Cailliez F (1983) The analytical solution of the additive constant problem. Psychometrika 48(2):305–308

    Article  Google Scholar 

  4. Clavel J, Merceron G, Escarguel G (2014) Missing data estimation in morphometrics: how much is too much? Syst Biol 63(2):203–218

    Article  PubMed  Google Scholar 

  5. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38

    Google Scholar 

  6. Diaz S, Cabido M (1997) Plant functional types and ecosystem function in relation to global change. J Veg Sci 8:463–474

    Article  Google Scholar 

  7. Dray S (2008) On the number of principal components: a test of dimensionality based on measurements of similarity between matrices. Comput Stat Data Anal 52:2228–2237

    Article  Google Scholar 

  8. Dray S, Dufour AB (2007) The ade4 package: implementing the duality diagram for ecologists. J Stat Softw 22(4):1–20

    Google Scholar 

  9. Dray S, Pettorelli N, Chessel D (2003) Multivariate analysis of incomplete mapped data. Trans GIS 7:411–422

    Article  Google Scholar 

  10. Escoufier Y (1973) Le traitement des variables vectorielles. Biometrics 29:751–760

    Article  Google Scholar 

  11. Goodall DW (1954) Objective methods for the classification of vegetation III. An essay on the use of factor analysis. Aust J Bot 2:304–324

    Article  Google Scholar 

  12. Gower J (1971a) A general coefficient of similarity and some of its properties. Biometrics 27:857–871

  13. Gower JC (1971b) Statistical methods of comparing different multivariate analyses of the same data. In: Hodson FR, Kendall DG, Tautu P (eds) Mathematics in the archaeological and historical sciences. Edinburgh University Press, pp 138–149

  14. Gower JC (1975) Generalized procrustes analysis. Psychometrika 40:33–51

    Article  Google Scholar 

  15. Honaker J, King G, Blackwell M (2011) Amelia II: a program for missing data. J Stat Softw 45(7):1–47

    Google Scholar 

  16. Husson F, Josse J (2010) missMDA: handling missing values with/in multivariate data analysis (principal component methods). R package version 1:2

  17. Josse J, Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique 153(2):1–21

    Google Scholar 

  18. Josse J, Pagès J, Husson F (2009) Gestion des données manquantes en analyse en composantes principales. Journal de la Société Française de Statistique 150(2):28–51

    Google Scholar 

  19. Josse J, Pagès J, Husson F (2011) Multiple imputation in PCA. Adv Data Anal Classif 5(3):231–246

    Article  Google Scholar 

  20. Kattge J, Díaz S, Lavorel S, Prentice IC, Leadley P, Bönisch G, Garnier E, Westoby M, Reich PB, Wright IJ, Cornelissen JHC, Violle C, Harrison SP, Van Bodegom PM, Reichstein M, Enquist BJ, Soudzilovskaia NA, Ackerly DD, Anand M, Atkin O, Bahn M, Baker TR, Baldocchi D, Bekker R, Blanco CC, Blonder B, Bond WJ, Bradstock R, Bunker DE, Casanoves F, Cavender-Bares J, Chambers JQ, Chapin FS III, Chave J, Coomes D, Cornwell WK, Craine JM, Dobrin BH, Duarte L, Durka W, Elser J, Esser G, Estiarte M, Fagan WF, Fang J, Fernández-Méndez F, Fidelis A, Finegan B, Flores O, Ford H, Frank D, Freschet GT, Fyllas NM, Gallagher RV, Green WA, Gutierrez AG, Hickler T, Higgins SI, Hodgson JG, Jalili A, Jansen S, Joly CA, Kerkhoff AJ, Kirkup D, Kitajima K, Kleyer M, Klotz S, Knops JMH, Kramer K, Kühn I, Kurokawa H, Laughlin D, Lee TD, Leishman M, Lens F, Lenz T, Lewis SL, Lloyd J, Llusià J, Louault F, Ma S, Mahecha MD, Manning P, Massad T, Medlyn BE, Messier J, Moles aT, Müller SC, Nadrowski K, Naeem S, Niinemets U, Nöllert S, Nüske A, Ogaya R, Oleksyn J, Onipchenko VG, Onoda Y, Ordoñez J, Overbeck G, Ozinga WA, Patiño S, Paula S, Pausas JG, Peñuelas J, Phillips OL, Pillar V, Poorter H, Poorter L, Poschlod P, Prinzing A, Proulx R, Rammig A, Reinsch S, Reu B, Sack L, Salgado-Negret B, Sardans J, Shiodera S, Shipley B, Siefert A, Sosinski E, Soussana JF, Swaine E, Swenson N, Thompson K, Thornton P, Waldram M, Weiher E, White M, White S, Wright SJ, Yguel B, Zaehle S, Zanne AE, Wirth C, (2011) TRY—a global database of plant traits. Glob Change Biol 17(9):2905–2935

  21. Kiers HAL (1997) Weighted least squares fitting using ordinary least squares algorithms. Psychometrika 62(2):251–266

    Article  Google Scholar 

  22. Lingoes J (1971) Some boundary conditions for a monotone analysis of symmetric matrices. Psychometrika 36(2):195–203

    Article  Google Scholar 

  23. Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley series in probability and statistics. Wiley, Hoboken, NJ

  24. Nakagawa S, Freckleton RP (2008) Missing inaction: the dangers of ignoring missing data. Trends Ecol Evolut 23(11):592–596

    Article  Google Scholar 

  25. Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Wagner H (2013) Vegan: community ecology package. R package version 2.0-9

  26. Pavoine S, Vallet J, Dufour AB, Gachet S, Daniel H (2009) On the challenge of treating various types of variables: application for improving the measurement of functional diversity. Oikos 118(3):391–402

    Article  Google Scholar 

  27. Peres-Neto PR, Jackson DA, Somers KM (2005) How many principal components? Stopping rules for determining the number of non-trivial axes revisited. Comput Stat Data Anal 49:974–997

    Article  Google Scholar 

  28. Rubin D (1976) Inference and missing data. Biometrika 69(3):581–592

    Article  Google Scholar 

  29. Rubin DB (1987) Multiple imputation for non-response in survey. Wiley, London

    Google Scholar 

  30. Schafer J (1997) Analysis of incomplete multivariate data. Chapman & Hall/CRC, London

    Google Scholar 

  31. Shan H, Kattge J, Reich PB, Banerjee A, Schrodt F, Reichstein M (2012) Gap filling in the plant kingdom—trait prediction using hierarchical probabilistic matrix factorization. Proceedings of the 29th international conference on machine learning (ICML-12). Edinburgh, Scotland, pp 1303–1310

  32. Strauss RE, Atanassov MN, De Oliveira JA (2003) Evaluation of the principal-component and expectation-maximization methods for estimating missing data in morphometric studies. J Vertebr Paleontol 23(2):284–296

    Article  Google Scholar 

  33. Swenson N (2014) Phylogenetic imputation of plant functional trait databases. Ecography 37:105–110

    Article  Google Scholar 

  34. Wold H, Lyttkens E (1969) Nonlinear iterative partial least squares (NIPALS) estimation procedures. Bull Int Stat Inst 43:29–51

    Google Scholar 

  35. Wright IJ, Reich PB, Westoby M, Ackerly DD, Baruch Z, Bongers F, Cavender-Bares J, Chapin T, Cornelissen JHC, Diemer M, Flexas J, Garnier E, Groom PK, Gulias J, Hikosaka K, Lamont BB, Lee T, Lee W, Lusk C, Midgley JJ, Navas ML, Niinemets U, Oleksyn J, Osada N, Poorter H, Poot P, Prior L, Pyankov VI, Roumet C, Thomas SC, Tjoelker MG, Veneklaas EJ, Villar R (2004) The worldwide leaf economics spectrum. Nature 428:821–827

    Article  CAS  PubMed  Google Scholar 

  36. Yuan KH, Chan W (2008) Structural equation modeling with near singular covariance matrices. Comput Stat Data Anal 52(10):4842–4858

    Article  Google Scholar 

  37. Yuan KH, Wu R, Bentler PM (2011) Ridge structural equation modelling with correlation matrices for ordinal and continuous data. Br J Math Stat Psychol 64:107–133

    Article  PubMed  Google Scholar 

Download references

Acknowledgments

We would like to thank Peter Minchin and Jari Oksanen for the invitation to participate to this special issue and Gavin Simpson and an anonymous reviewer for comments on an earlier draft of the manuscript. We would like to warmly thank Ian Wright for freely distributing the GLOPNET data set.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Stéphane Dray.

Additional information

Communicated by P. R. Minchin and J. Oksanen.

Electronic supplementary material

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dray, S., Josse, J. Principal component analysis with missing values: a comparative survey of methods. Plant Ecol 216, 657–667 (2015). https://doi.org/10.1007/s11258-014-0406-z

Download citation

Keywords

  • Imputation
  • Ordination
  • PCA
  • Traits