Abstract
The aim of this chapter is to provide an overview of recent developments in principal component analysis (PCA) methods when the data are incomplete. Missing data bring uncertainty into the analysis and their treatment requires statistical approaches that are tailored to cope with specific missing data processes (i.e., ignorable and nonignorable mechanisms). Since the publication of the classic textbook by Jolliffe, which includes a short, same-titled section on the missing data problem in PCA, there have been a few methodological contributions that hinge upon a probabilistic approach to PCA. In this chapter, we unify methods for ignorable and nonignorable missing data in a general likelihood framework. We also provide real data examples to illustrate the application of these methods using the R language and environment for statistical computing and graphics.
Similar content being viewed by others
References
Bartolucci, F., Farcomeni, A.: A discrete time event-history approach to informative drop-out in mixed latent Markov models with covariates. Biometrics 71(1), 80–89 (2015)
Booth, J.G., Hobert, J.P.: Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society B 61(1), 265–285 (1999)
Creemers, A., Hens, N., Aerts, M., Molenberghs, G., Verbeke, G., Kenward, M.G.: A sensitivity analysis for shared-parameter models for incomplete longitudinal outcomes. Biometrical Journal 52(1), 111–125 (2010)
de Brevern, A., Hazout, S., Malpertuy, A.: Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering. BMC Bioinformatics 5(1), 114 (2004)
de Souto, M.C., Jaskowiak, P.A., Costa, I.G.: Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinformatics 16(1), 64 (2015)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39(1), 1–38 (1977)
Ding, C., Zhou, D., He, X., Zha, H.: \(L_{1}\)-PCA: rotational invariant \(L_{1}\)-norm principal component analysis for robust subspace factorization. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 281–288. ACM
Farcomeni, A., Greco, L.: Robust methods for data reduction. CRC Press, Boca Raton, FL (2015)
Geraci, M.: Estimation of regression quantiles in complex surveys with data missing at random: An application to birthweight determinants. Statistical Methods in Medical Research 25(4), 1393–1421 (2016)
Geraci, M., Bottai, M.: Use of auxiliary data in semi-parametric spatial regression with nonignorable missing responses. Statistical Modelling 6(4), 321–336 (2006)
Geraci, M., Farcomeni, A.: Probabilistic principal component analysis to identify profiles of physical activity behaviours in the presence of nonignorable missing data. Journal of the Royal Statistical Society C 65(1), 51–75 (2016)
Gilks, W.R., Wild, P.: Adaptive rejection sampling for Gibbs sampling. Journal of the Royal Statistical Society C 41(2), 337–348 (1992)
Griffiths, L.J., Cortina-Borja, M., Sera, F., Pouliou, T., Geraci, M., Rich, C., Cole, T.J., Law, C., Joshi, H., Ness, A.R., Jebb, S.A., Dezateux, C.: How active are our children? Findings from the Millennium Cohort Study. BMJ Open 3(8), e002,893 (2013)
Heitjan, D.F., Basu, S.: Distinguishing “missing at random” and “missing completely at random”. The American Statistician 50(3), 207–213 (1996)
Houseago-Stokes, R.E., Challenor, P.G.: Using PPCA to estimate EOFs in the presence of missing values. Journal of Atmospheric and Oceanic Technology 21(9), 1471–1480 (2004)
Husson, F., Josse, J.: missMDA: Handling missing values with/in multivariate data analysis (principal component methods) (2013). https://CRAN.R-project.org/package=missMDA. R package version 1.7.2
Ibrahim, J.G., Chen, M.H., Lipsitz, S.R.: Missing responses in generalised linear mixed models when the missing data mechanism is nonignorable. Biometrika 88(2), 551–564 (2001)
Ibrahim, J.G., Molenberghs, G.: Missing data methods in longitudinal studies: A review. Test 18(1), 1–43 (2009)
Ilin, A., Raiko, T.: Practical approaches to principal component analysis in the presence of missing values. Journal of Machine Learning Research 11(Jul), 1957–2000 (2010)
Jolliffe, I.T.: Principal component analysis, 2nd edn. Springer-Verlag, New York, NY (2002)
Josse, J., Husson, F.: Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique 153(2), 79–99 (2012)
Josse, J., Husson, F.: Selecting the number of components in principal component analysis using cross-validation approximations. Computational Statistics and Data Analysis 56(6), 1869–1879 (2012)
Josse, J., Pagès, J., Husson, F.: Multiple imputation in principal component analysis. Advances in Data Analysis and Classification 5(3), 231–246 (2011)
Laird, N.M.: Missing data in longitudinal studies. Statistics in Medicine 7(1–2), 305–315 (1988)
Lê, S., Josse, J., Husson, F.: FactoMineR: A package for multivariate analysis. Journal of Statistical Software 25(1), 1–18 (2008)
Little, R.J.A., Rubin, D.B.: Statistical analysis with missing data. Wiley, New York, NY (1987)
Little, R.J.A., Rubin, D.B.: Statistical analysis with missing data, 2nd edn. Wiley, Hoboken, NJ (2002)
Mehrotra, D.V.: Robust elementwise estimation of a dispersion matrix. Biometrics 51(4), 1344–51 (1995)
Melgani, F., Mercier, G., Lorenzi, L., Pasolli, E.: Recent methods for reconstructing missing data in multispectral satellite imagery. In: R.S. Anderssen, P. Broadbridge, Y. Fukumoto, K. Kajiwara, T. Takagi, E. Verbitskiy, M. Wakayama (eds.) Applications + Practical Conceptualization + Mathematics = fruitful Innovation: Proceedings of the Forum of Mathematics for Industry 2014, pp. 221–234. Springer Japan, Tokyo (2016)
Molenberghs, G., Beunckens, C., Sotto, C., Kenward, M.G.: Every missingness not at random model has a missingness at random counterpart with equal fit. Journal of the Royal Statistical Society B 70(2), 371–388 (2008)
Morelli, M.S., Giannoni, A., Passino, C., Landini, L., Emdin, M., Vanello, N.: A cross-correlational analysis between electroencephalographic and end-tidal carbon dioxide signals: Methodological issues in the presence of missing data and real data results. Sensors (Basel, Switzerland) 16(11), e1828 (2016)
Oh, S., Kang, D.D., Brock, G.N., Tseng, G.C.: Biological impact of missing-value imputation on downstream analyses of gene expression profiles. Bioinformatics 27(1), 78–86 (2011)
Orchard, T., Woodbury, M.A.: A missing information principle: theory and applications. In: Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics, Sixth Berkeley Symposium on Mathematical Statistics and Probability, pp. 697–715. University of California Press
Pearson, K.: LIII. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of. Science 2(11), 559–572 (1901)
Petris, G., Tardella, L.: HI: Simulation from distributions supported by nested hyperplanes (2013). https://CRAN.R-project.org/package=HI. R package version 0.4
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2016). https://www.R-project.org/
Rich, C., Cortina-Borja, M., Dezateux, C., Geraci, M., Sera, F., Calderwood, L., Joshi, H., Griffiths, L.J.: Predictors of non-response in a UK-wide cohort study of children’s accelerometer-determined physical activity using postal methods. BMJ Open 3(3), e002290 (2013)
Roweis, S.: EM algorithms for PCA and SPCA. In: M.I. Jordan, M.J. Kearns, S.A. Solla (eds.) Advances in neural information processing systems 10: Proceedings of the 1997 conference, vol. 10, pp. 626–632. MIT Press, Cambridge, MA (1998)
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Sattari, M.T., Rezazadeh-Joudi, A., Kusiak, A.: Assessment of different methods for estimation of missing data in precipitation studies. Hydrology Research (2016). https://doi.org/10.2166/nh.2016.364
Schneider, T.: Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate 14(5), 853–871 (2001)
Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. Journal of the Royal Statistical Society B 61(3), 611–622 (1999)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix – EM Algorithm for PPCA with MNAR Values
Appendix – EM Algorithm for PPCA with MNAR Values
In this appendix, we provide additional details on the Monte Carlo EM algorithm introduced in Sect. 3.2 and we derive a simplified E-step where the random effects are integrated out from the complete data log-likelihood.
The Monte Carlo E-step requires sampling from \(f\left( \mathbf {z}_{i},\mathbf {u}_{i}|\mathbf {x}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right) \). This task can be carried out efficiently via ARMS [12] using the full conditionals
An implementation of ARMS is available in the R package HI [35].
A sample \(\varvec{\xi }_{i1},\ldots ,\varvec{\xi }_{iK}\) for \(i=1,\ldots ,n\) is obtained at each EM iteration t, where the \((s_{i} + q) \times 1\) vector \(\varvec{\xi }_{ik} = \left( \tilde{\mathbf {z}}_{ik},\tilde{\mathbf {u}}_{ik}\right) \), \(k = 1, \ldots , K\), contains ‘imputed’ values for \(\mathbf {z}_{i}\) and \(\mathbf {u}_{i}\) (with the understanding that \(\varvec{\xi }_{ik} = \tilde{\mathbf {u}}_{ik}\) if \(s_{i}=0\)). Here the Monte Carlo sample size K is kept constant throughout. Alternative strategies with varying \(K^{(t)}\) that may increase the speed or the accuracy of the EM algorithm can be considered [2, 17]. The E-step (11) is approximated by
The maximization of (15) with respect to \(\varvec{\lambda }\) is straightforward. Define \(\tilde{\mathbf {y}}_{ik} = \left( \tilde{\mathbf {z}}_{ik},\mathbf {x}_{i}\right) \) if \(s_{i}>0\) or \(\tilde{\mathbf {y}}_{ik} = \mathbf {y}_{i}\) if \(s_{i}=0\), \(i = 1,\ldots ,n\), \(k = 1,\ldots ,K\). The maximum likelihood solution of the M-step at the \((t+1)\)th iteration is given by
Analogously, the MLE of \(\varvec{\eta }\) can be easily obtained using standard results for generalized linear models.
Note that the computational burden can be alleviated by first integrating out the random effects in (11) and then sampling from \(f\left( \mathbf {z}_{i}|\mathbf {x}_{i},\mathbf {m}_{i},\varvec{\lambda }^{(t)}\right) \) during the Monte Carlo E-step. We obtain what we call a simplified E-step
where \(\mathbf {v}_{i}^{(t)} = \mathbf {B}^{(t)}\mathbf {W}^{(t)^{\top }}\left( \mathbf {y}_{i} - \varvec{\mu }^{(t)}\right) /\psi ^{(t)}\) and \(\mathbf {B}^{(t)} = \left\{ \mathbf {W}^{(t)^{\top }} \mathbf {W}^{(t)}/\psi ^{(t)} + \mathbf {I}_{q}\right\} ^{-1}\). Note that by assumption \(\mathbf {m}_{i}\) is independent from \(\mathbf {u}_i\). The expectation above is now taken with respect to
\(\mathbf {C}^{(t)} = \mathbf {W}^{(t)}\mathbf {W}^{(t)^{\top }} + \varvec{\varPsi }^{(t)}\).
Again, we obtain a sample \(\tilde{\mathbf {z}}_{ik}\), \(i = 1,\ldots ,n\), \(k = 1,\ldots ,K\) and calculate the approximate E-step
The MLE equations of the M-step which follow from maximizing the log-likelihood in (20) are similar to equations (27) and (28) in [42] and they do not require explicit computation of the covariance matrix. We omit them for the sake of brevity.
Finally, we note that, based on the linear predictions
where \(\tilde{\mathbf {y}}_{ik} = \left( \tilde{\mathbf {z}}_{ik},\mathbf {x}_{i}\right) \) is the complete data vector at convergence, we can calculate the element-wise variances of \(\frac{1}{K}\sum _{k=1}^{K} \hat{\mathbf {u}}_{ik}\) over the individuals space as estimates of \(\delta _{1},\ldots ,\delta _{q}\). The quantity \((p-q)\cdot \hat{\varvec{\psi }}\) provides the portion of the total variability associated with the ‘discarded’ components.
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Geraci, M., Farcomeni, A. (2018). Principal Component Analysis in the Presence of Missing Data. In: Naik, G. (eds) Advances in Principal Component Analysis. Springer, Singapore. https://doi.org/10.1007/978-981-10-6704-4_3
Download citation
DOI: https://doi.org/10.1007/978-981-10-6704-4_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6703-7
Online ISBN: 978-981-10-6704-4
eBook Packages: EngineeringEngineering (R0)