Abstract
Principal component analysis (PCA) is a standard technique to summarize the main structures of a data table containing the measurements of several quantitative variables for a number of individuals. Here, we study the case where some of the data values are missing and propose a review of methods which accommodate PCA to missing data. In plant ecology, this statistical challenge relates to the current effort to compile global plant functional trait databases producing matrices with a large amount of missing values. We present several techniques to consider or estimate (impute) missing values in PCA and compare them using theoretical considerations. We carried out a simulation study to evaluate the relative merits of the different approaches in various situations (correlation structure, number of variables and individuals, and percentage of missing values) and also applied them on a real data set. Lastly, we discuss the advantages and drawbacks of these approaches, the potential pitfalls and future challenges that need to be addressed in the future.
Similar content being viewed by others
References
Bentler P, Yuan K (2011) Positive definiteness via off-diagonal scaling of a symmetric indefinite matrix. Psychometrika 76:119–123
Brown CM, Arbour JH, Jackson DA (2012) Testing of the effect of missing data estimation and distribution in morphometric multivariate data analyses. Syst Biol 61(6):941–954
Cailliez F (1983) The analytical solution of the additive constant problem. Psychometrika 48(2):305–308
Clavel J, Merceron G, Escarguel G (2014) Missing data estimation in morphometrics: how much is too much? Syst Biol 63(2):203–218
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38
Diaz S, Cabido M (1997) Plant functional types and ecosystem function in relation to global change. J Veg Sci 8:463–474
Dray S (2008) On the number of principal components: a test of dimensionality based on measurements of similarity between matrices. Comput Stat Data Anal 52:2228–2237
Dray S, Dufour AB (2007) The ade4 package: implementing the duality diagram for ecologists. J Stat Softw 22(4):1–20
Dray S, Pettorelli N, Chessel D (2003) Multivariate analysis of incomplete mapped data. Trans GIS 7:411–422
Escoufier Y (1973) Le traitement des variables vectorielles. Biometrics 29:751–760
Goodall DW (1954) Objective methods for the classification of vegetation III. An essay on the use of factor analysis. Aust J Bot 2:304–324
Gower J (1971a) A general coefficient of similarity and some of its properties. Biometrics 27:857–871
Gower JC (1971b) Statistical methods of comparing different multivariate analyses of the same data. In: Hodson FR, Kendall DG, Tautu P (eds) Mathematics in the archaeological and historical sciences. Edinburgh University Press, pp 138–149
Gower JC (1975) Generalized procrustes analysis. Psychometrika 40:33–51
Honaker J, King G, Blackwell M (2011) Amelia II: a program for missing data. J Stat Softw 45(7):1–47
Husson F, Josse J (2010) missMDA: handling missing values with/in multivariate data analysis (principal component methods). R package version 1:2
Josse J, Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique 153(2):1–21
Josse J, Pagès J, Husson F (2009) Gestion des données manquantes en analyse en composantes principales. Journal de la Société Française de Statistique 150(2):28–51
Josse J, Pagès J, Husson F (2011) Multiple imputation in PCA. Adv Data Anal Classif 5(3):231–246
Kattge J, Díaz S, Lavorel S, Prentice IC, Leadley P, Bönisch G, Garnier E, Westoby M, Reich PB, Wright IJ, Cornelissen JHC, Violle C, Harrison SP, Van Bodegom PM, Reichstein M, Enquist BJ, Soudzilovskaia NA, Ackerly DD, Anand M, Atkin O, Bahn M, Baker TR, Baldocchi D, Bekker R, Blanco CC, Blonder B, Bond WJ, Bradstock R, Bunker DE, Casanoves F, Cavender-Bares J, Chambers JQ, Chapin FS III, Chave J, Coomes D, Cornwell WK, Craine JM, Dobrin BH, Duarte L, Durka W, Elser J, Esser G, Estiarte M, Fagan WF, Fang J, Fernández-Méndez F, Fidelis A, Finegan B, Flores O, Ford H, Frank D, Freschet GT, Fyllas NM, Gallagher RV, Green WA, Gutierrez AG, Hickler T, Higgins SI, Hodgson JG, Jalili A, Jansen S, Joly CA, Kerkhoff AJ, Kirkup D, Kitajima K, Kleyer M, Klotz S, Knops JMH, Kramer K, Kühn I, Kurokawa H, Laughlin D, Lee TD, Leishman M, Lens F, Lenz T, Lewis SL, Lloyd J, Llusià J, Louault F, Ma S, Mahecha MD, Manning P, Massad T, Medlyn BE, Messier J, Moles aT, Müller SC, Nadrowski K, Naeem S, Niinemets U, Nöllert S, Nüske A, Ogaya R, Oleksyn J, Onipchenko VG, Onoda Y, Ordoñez J, Overbeck G, Ozinga WA, Patiño S, Paula S, Pausas JG, Peñuelas J, Phillips OL, Pillar V, Poorter H, Poorter L, Poschlod P, Prinzing A, Proulx R, Rammig A, Reinsch S, Reu B, Sack L, Salgado-Negret B, Sardans J, Shiodera S, Shipley B, Siefert A, Sosinski E, Soussana JF, Swaine E, Swenson N, Thompson K, Thornton P, Waldram M, Weiher E, White M, White S, Wright SJ, Yguel B, Zaehle S, Zanne AE, Wirth C, (2011) TRY—a global database of plant traits. Glob Change Biol 17(9):2905–2935
Kiers HAL (1997) Weighted least squares fitting using ordinary least squares algorithms. Psychometrika 62(2):251–266
Lingoes J (1971) Some boundary conditions for a monotone analysis of symmetric matrices. Psychometrika 36(2):195–203
Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley series in probability and statistics. Wiley, Hoboken, NJ
Nakagawa S, Freckleton RP (2008) Missing inaction: the dangers of ignoring missing data. Trends Ecol Evolut 23(11):592–596
Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Wagner H (2013) Vegan: community ecology package. R package version 2.0-9
Pavoine S, Vallet J, Dufour AB, Gachet S, Daniel H (2009) On the challenge of treating various types of variables: application for improving the measurement of functional diversity. Oikos 118(3):391–402
Peres-Neto PR, Jackson DA, Somers KM (2005) How many principal components? Stopping rules for determining the number of non-trivial axes revisited. Comput Stat Data Anal 49:974–997
Rubin D (1976) Inference and missing data. Biometrika 69(3):581–592
Rubin DB (1987) Multiple imputation for non-response in survey. Wiley, London
Schafer J (1997) Analysis of incomplete multivariate data. Chapman & Hall/CRC, London
Shan H, Kattge J, Reich PB, Banerjee A, Schrodt F, Reichstein M (2012) Gap filling in the plant kingdom—trait prediction using hierarchical probabilistic matrix factorization. Proceedings of the 29th international conference on machine learning (ICML-12). Edinburgh, Scotland, pp 1303–1310
Strauss RE, Atanassov MN, De Oliveira JA (2003) Evaluation of the principal-component and expectation-maximization methods for estimating missing data in morphometric studies. J Vertebr Paleontol 23(2):284–296
Swenson N (2014) Phylogenetic imputation of plant functional trait databases. Ecography 37:105–110
Wold H, Lyttkens E (1969) Nonlinear iterative partial least squares (NIPALS) estimation procedures. Bull Int Stat Inst 43:29–51
Wright IJ, Reich PB, Westoby M, Ackerly DD, Baruch Z, Bongers F, Cavender-Bares J, Chapin T, Cornelissen JHC, Diemer M, Flexas J, Garnier E, Groom PK, Gulias J, Hikosaka K, Lamont BB, Lee T, Lee W, Lusk C, Midgley JJ, Navas ML, Niinemets U, Oleksyn J, Osada N, Poorter H, Poot P, Prior L, Pyankov VI, Roumet C, Thomas SC, Tjoelker MG, Veneklaas EJ, Villar R (2004) The worldwide leaf economics spectrum. Nature 428:821–827
Yuan KH, Chan W (2008) Structural equation modeling with near singular covariance matrices. Comput Stat Data Anal 52(10):4842–4858
Yuan KH, Wu R, Bentler PM (2011) Ridge structural equation modelling with correlation matrices for ordinal and continuous data. Br J Math Stat Psychol 64:107–133
Acknowledgments
We would like to thank Peter Minchin and Jari Oksanen for the invitation to participate to this special issue and Gavin Simpson and an anonymous reviewer for comments on an earlier draft of the manuscript. We would like to warmly thank Ian Wright for freely distributing the GLOPNET data set.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by P. R. Minchin and J. Oksanen.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Dray, S., Josse, J. Principal component analysis with missing values: a comparative survey of methods. Plant Ecol 216, 657–667 (2015). https://doi.org/10.1007/s11258-014-0406-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11258-014-0406-z