Skip to main content
Log in

Principal component analysis with missing values: a comparative survey of methods

  • Published:
Plant Ecology Aims and scope Submit manuscript

Abstract

Principal component analysis (PCA) is a standard technique to summarize the main structures of a data table containing the measurements of several quantitative variables for a number of individuals. Here, we study the case where some of the data values are missing and propose a review of methods which accommodate PCA to missing data. In plant ecology, this statistical challenge relates to the current effort to compile global plant functional trait databases producing matrices with a large amount of missing values. We present several techniques to consider or estimate (impute) missing values in PCA and compare them using theoretical considerations. We carried out a simulation study to evaluate the relative merits of the different approaches in various situations (correlation structure, number of variables and individuals, and percentage of missing values) and also applied them on a real data set. Lastly, we discuss the advantages and drawbacks of these approaches, the potential pitfalls and future challenges that need to be addressed in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Bentler P, Yuan K (2011) Positive definiteness via off-diagonal scaling of a symmetric indefinite matrix. Psychometrika 76:119–123

    Article  PubMed Central  PubMed  Google Scholar 

  • Brown CM, Arbour JH, Jackson DA (2012) Testing of the effect of missing data estimation and distribution in morphometric multivariate data analyses. Syst Biol 61(6):941–954

    Article  PubMed  Google Scholar 

  • Cailliez F (1983) The analytical solution of the additive constant problem. Psychometrika 48(2):305–308

    Article  Google Scholar 

  • Clavel J, Merceron G, Escarguel G (2014) Missing data estimation in morphometrics: how much is too much? Syst Biol 63(2):203–218

    Article  PubMed  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38

    Google Scholar 

  • Diaz S, Cabido M (1997) Plant functional types and ecosystem function in relation to global change. J Veg Sci 8:463–474

    Article  Google Scholar 

  • Dray S (2008) On the number of principal components: a test of dimensionality based on measurements of similarity between matrices. Comput Stat Data Anal 52:2228–2237

    Article  Google Scholar 

  • Dray S, Dufour AB (2007) The ade4 package: implementing the duality diagram for ecologists. J Stat Softw 22(4):1–20

    Google Scholar 

  • Dray S, Pettorelli N, Chessel D (2003) Multivariate analysis of incomplete mapped data. Trans GIS 7:411–422

    Article  Google Scholar 

  • Escoufier Y (1973) Le traitement des variables vectorielles. Biometrics 29:751–760

    Article  Google Scholar 

  • Goodall DW (1954) Objective methods for the classification of vegetation III. An essay on the use of factor analysis. Aust J Bot 2:304–324

    Article  Google Scholar 

  • Gower J (1971a) A general coefficient of similarity and some of its properties. Biometrics 27:857–871

  • Gower JC (1971b) Statistical methods of comparing different multivariate analyses of the same data. In: Hodson FR, Kendall DG, Tautu P (eds) Mathematics in the archaeological and historical sciences. Edinburgh University Press, pp 138–149

  • Gower JC (1975) Generalized procrustes analysis. Psychometrika 40:33–51

    Article  Google Scholar 

  • Honaker J, King G, Blackwell M (2011) Amelia II: a program for missing data. J Stat Softw 45(7):1–47

    Google Scholar 

  • Husson F, Josse J (2010) missMDA: handling missing values with/in multivariate data analysis (principal component methods). R package version 1:2

  • Josse J, Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique 153(2):1–21

    Google Scholar 

  • Josse J, Pagès J, Husson F (2009) Gestion des données manquantes en analyse en composantes principales. Journal de la Société Française de Statistique 150(2):28–51

    Google Scholar 

  • Josse J, Pagès J, Husson F (2011) Multiple imputation in PCA. Adv Data Anal Classif 5(3):231–246

    Article  Google Scholar 

  • Kattge J, Díaz S, Lavorel S, Prentice IC, Leadley P, Bönisch G, Garnier E, Westoby M, Reich PB, Wright IJ, Cornelissen JHC, Violle C, Harrison SP, Van Bodegom PM, Reichstein M, Enquist BJ, Soudzilovskaia NA, Ackerly DD, Anand M, Atkin O, Bahn M, Baker TR, Baldocchi D, Bekker R, Blanco CC, Blonder B, Bond WJ, Bradstock R, Bunker DE, Casanoves F, Cavender-Bares J, Chambers JQ, Chapin FS III, Chave J, Coomes D, Cornwell WK, Craine JM, Dobrin BH, Duarte L, Durka W, Elser J, Esser G, Estiarte M, Fagan WF, Fang J, Fernández-Méndez F, Fidelis A, Finegan B, Flores O, Ford H, Frank D, Freschet GT, Fyllas NM, Gallagher RV, Green WA, Gutierrez AG, Hickler T, Higgins SI, Hodgson JG, Jalili A, Jansen S, Joly CA, Kerkhoff AJ, Kirkup D, Kitajima K, Kleyer M, Klotz S, Knops JMH, Kramer K, Kühn I, Kurokawa H, Laughlin D, Lee TD, Leishman M, Lens F, Lenz T, Lewis SL, Lloyd J, Llusià J, Louault F, Ma S, Mahecha MD, Manning P, Massad T, Medlyn BE, Messier J, Moles aT, Müller SC, Nadrowski K, Naeem S, Niinemets U, Nöllert S, Nüske A, Ogaya R, Oleksyn J, Onipchenko VG, Onoda Y, Ordoñez J, Overbeck G, Ozinga WA, Patiño S, Paula S, Pausas JG, Peñuelas J, Phillips OL, Pillar V, Poorter H, Poorter L, Poschlod P, Prinzing A, Proulx R, Rammig A, Reinsch S, Reu B, Sack L, Salgado-Negret B, Sardans J, Shiodera S, Shipley B, Siefert A, Sosinski E, Soussana JF, Swaine E, Swenson N, Thompson K, Thornton P, Waldram M, Weiher E, White M, White S, Wright SJ, Yguel B, Zaehle S, Zanne AE, Wirth C, (2011) TRY—a global database of plant traits. Glob Change Biol 17(9):2905–2935

  • Kiers HAL (1997) Weighted least squares fitting using ordinary least squares algorithms. Psychometrika 62(2):251–266

    Article  Google Scholar 

  • Lingoes J (1971) Some boundary conditions for a monotone analysis of symmetric matrices. Psychometrika 36(2):195–203

    Article  Google Scholar 

  • Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley series in probability and statistics. Wiley, Hoboken, NJ

  • Nakagawa S, Freckleton RP (2008) Missing inaction: the dangers of ignoring missing data. Trends Ecol Evolut 23(11):592–596

    Article  Google Scholar 

  • Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Wagner H (2013) Vegan: community ecology package. R package version 2.0-9

  • Pavoine S, Vallet J, Dufour AB, Gachet S, Daniel H (2009) On the challenge of treating various types of variables: application for improving the measurement of functional diversity. Oikos 118(3):391–402

    Article  Google Scholar 

  • Peres-Neto PR, Jackson DA, Somers KM (2005) How many principal components? Stopping rules for determining the number of non-trivial axes revisited. Comput Stat Data Anal 49:974–997

    Article  Google Scholar 

  • Rubin D (1976) Inference and missing data. Biometrika 69(3):581–592

    Article  Google Scholar 

  • Rubin DB (1987) Multiple imputation for non-response in survey. Wiley, London

    Book  Google Scholar 

  • Schafer J (1997) Analysis of incomplete multivariate data. Chapman & Hall/CRC, London

    Book  Google Scholar 

  • Shan H, Kattge J, Reich PB, Banerjee A, Schrodt F, Reichstein M (2012) Gap filling in the plant kingdom—trait prediction using hierarchical probabilistic matrix factorization. Proceedings of the 29th international conference on machine learning (ICML-12). Edinburgh, Scotland, pp 1303–1310

  • Strauss RE, Atanassov MN, De Oliveira JA (2003) Evaluation of the principal-component and expectation-maximization methods for estimating missing data in morphometric studies. J Vertebr Paleontol 23(2):284–296

    Article  Google Scholar 

  • Swenson N (2014) Phylogenetic imputation of plant functional trait databases. Ecography 37:105–110

    Article  Google Scholar 

  • Wold H, Lyttkens E (1969) Nonlinear iterative partial least squares (NIPALS) estimation procedures. Bull Int Stat Inst 43:29–51

    Google Scholar 

  • Wright IJ, Reich PB, Westoby M, Ackerly DD, Baruch Z, Bongers F, Cavender-Bares J, Chapin T, Cornelissen JHC, Diemer M, Flexas J, Garnier E, Groom PK, Gulias J, Hikosaka K, Lamont BB, Lee T, Lee W, Lusk C, Midgley JJ, Navas ML, Niinemets U, Oleksyn J, Osada N, Poorter H, Poot P, Prior L, Pyankov VI, Roumet C, Thomas SC, Tjoelker MG, Veneklaas EJ, Villar R (2004) The worldwide leaf economics spectrum. Nature 428:821–827

    Article  CAS  PubMed  Google Scholar 

  • Yuan KH, Chan W (2008) Structural equation modeling with near singular covariance matrices. Comput Stat Data Anal 52(10):4842–4858

    Article  Google Scholar 

  • Yuan KH, Wu R, Bentler PM (2011) Ridge structural equation modelling with correlation matrices for ordinal and continuous data. Br J Math Stat Psychol 64:107–133

    Article  PubMed  Google Scholar 

Download references

Acknowledgments

We would like to thank Peter Minchin and Jari Oksanen for the invitation to participate to this special issue and Gavin Simpson and an anonymous reviewer for comments on an earlier draft of the manuscript. We would like to warmly thank Ian Wright for freely distributing the GLOPNET data set.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stéphane Dray.

Additional information

Communicated by P. R. Minchin and J. Oksanen.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dray, S., Josse, J. Principal component analysis with missing values: a comparative survey of methods. Plant Ecol 216, 657–667 (2015). https://doi.org/10.1007/s11258-014-0406-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11258-014-0406-z

Keywords

Navigation