Abstract
The straightforward application of Principal Component Analysis (PCA) to incomplete data sets is not possible and practitioners often remove or ignore observations that contain at least one missing value. Three different strategies can be mainly distinguished to apply PCA on a data set with missing entries: (i) imputation of the missings prior to the application of PCA; (ii) obtain the PCA solution and ignore the missings; and (iii) obtain the PCA solution and deal explicitly with missings. Methods implementing the latter strategy have been reviewed and, among them, the iterative PCA (iPCA) approach has been shown to be preferable. This paper proposes a chunk-wise implementation of iPCA, suitable for tall data sets, that is, with many observations. In the proposed approach, each data chunk is imputed according to the insofar analyzed data. The proposed procedure is compared to the batch iPCA and to a naive implementation, which imputes each data chunk independently. In a series of experiments, we consider different data sets and missing data mechanisms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Dray, S., Josse, J.: Principal component analysis with missing values: a comparative survey of methods. Plant Ecol. 216(5), 657–667 (2015)
Folch-Fortuny, A., Arteaga, F., Ferrer, A.: PCA model building with missing data: new proposals and a comparative study. Chemometr. Intell. Lab. Syst. 146, 77–88 (2015)
Geraci, M., Farcomeni, A.: Principal component analysis in the presence of missing data. In: Naik, G.R. (ed.) Advances in Principal Component Analysis, pp. 47–70. Springer (2018)
Gower, J.C.: Statistical methods of comparing different multivariate analyses of the same data. In: Hodson F.R., Kendall, D. G., Tautu, P. (eds.) Mathematics in the Archaeological and Historical Sciences, pp. 138–149. Edinburgh University Press, Edinburgh (1971)
Greenacre, M.J.: Biplots in practice, Fundacion BBVA (2010)
Hall, P., Marshall, D., Martin, R.: Adding and subtracting eigenspaces with eigenvalue decomposition and singular value decomposition. Image Vision Comput. 20(13–14), 1009–1016 (2002)
Iodice D’Enza, A., Markos, A., Buttarazzi, D.: The idm package: incremental decomposition methods in R. J. Stat. Softw. 86(1), 1–24 (2018)
Jolliffe, I.T.: Principal Component Analysis. Springer, New York, NY (2002)
Josse, J., Hussin, F.: Handling missing values in exploratory multivariate data analysis methods. J. Société Française Statistique 153(2), 79–99 (2012)
Kiers, H.: Weighted least squares fitting using ordinary least squares algorithms. Psychometrika 62(2), 251–266 (1997)
Little, R., Rubin. D.: Statistical Analysis with Missing Data. Wiley (2019)
Loisel, S., Takane, Y.: Comparisons among several methods for handling missing data in principal component analysis (PCA). Adv. Data Anal. Classi. 13(2), 495–518 (2019)
Matloff, N.: Software alchemy: turning complex statistical computations into embarrassingly-parallel ones. arXiv preprint arXiv:1409.5827 (2014)
Rieth, C.A., Amsel, B.D., Tran, R., Cook, M.B.: Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation. Harvard Dataverse (2017)
Schafer, J.L.: Analysis of Incomplete Multivariate Data. CRC Press (1997)
Severson, K.A., Molaro, M.C., Braatz, R.D.: Principal component analysis of process datasets with missing values. Processes 5(3), 38 (2017)
Van Ginkel, J.R., Kroonenberg, P.M., Kiers, H.: Missing data in principal component analysis of questionnaire data: a comparison of methods. J. Stat. Comput. Sim. 84(11), 2298–2315 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Iodice D’Enza, A., Palumbo, F., Markos, A. (2021). Single Imputation Via Chunk-Wise PCA. In: Chadjipadelis, T., Lausen, B., Markos, A., Lee, T.R., Montanari, A., Nugent, R. (eds) Data Analysis and Rationality in a Complex World. IFCS 2019. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-030-60104-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-60104-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60103-4
Online ISBN: 978-3-030-60104-1
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)