Skip to main content

Comparisons among several methods for handling missing data in principal component analysis (PCA)

Abstract

Missing data are prevalent in many data analytic situations. Those in which principal component analysis (PCA) is applied are no exceptions. The performance of five methods for handling missing data in PCA is investigated, the missing data passive method, the weighted low rank approximation (WLRA) method, the regularized PCA (RPCA) method, the trimmed scores regression method, and the data augmentation (DA) method. Three complete data sets of varying sizes were selected, in which missing data were created randomly and non-randomly. These data were then analyzed by the five methods, and their parameter recovery capability, as measured by the mean congruence coefficient between loadings obtained from full and missing data, is compared as functions of the number of extracted components (dimensionality) and the proportion of missing data (censor rate). For randomly censored data, all five methods worked well when the dimensionality and censor rate were small. Their performance deteriorated, as the dimensionality and censor rate increased, but the speed of deterioration was distinctly faster with the WLRA method. The RPCA method worked best and the DA method came as a close second in terms of parameter recovery. However, the latter, as implemented here, was found to be extremely time-consuming. For non-randomly censored data, the recovery was also affected by the degree of non-randomness in censoring processes. Again the RPCA method worked best, maintaining good to excellent recoveries when the censor rate was small and the dimensionality of solutions was not too excessive.

This is a preview of subscription content, access via your institution.

References

  • Bergami M, Bagozzi RP (2000) Self-categorization, affective commitment and group-esteem as distinct aspects of social identity in the organization. Brit J Soc Psychol 39:555–577

    Article  Google Scholar 

  • Bernaards CA, Sijtsma K (2000) Influence of imputation and EM methods on factor analysis when item nonresponse in questionnaire data is nonignorable. Multivar Behav Res 35:321–364

    Article  Google Scholar 

  • Dray S, Josse J (2015) Principal component analysis with missing values: a comparative survey of methods. Plant Ecol 216:657–667

    Article  Google Scholar 

  • Folch-Fortuny A, Arteaga F, Ferrer A (2015) PCA model building with missing data. Chemom Intell Lab 146:77–88

    Article  Google Scholar 

  • Folch-Fortuny A, Arteaga F, Ferrer A (2016) Missing data imputation toolbox for MATLAB. Chemom Intell Lab 154:93–100

    Article  Google Scholar 

  • Gabriel KR, Zamir S (1979) Lower rank approximation of matrices by least squares with any choice of weights. Technometrics 22:489–498

    Article  MATH  Google Scholar 

  • Gifi A (1990) Nonlinear multivariate analysis. Wiley, Chichester

    MATH  Google Scholar 

  • Grung B, Manne R (1998) Missing values in principal component analysis. Chemom Intell Lab 42:125–139

    Article  Google Scholar 

  • Hwang H, Takane Y (2014) Generalized structured component analysis: a component-based approach to structural equation modeling. Chapman and Hall/CRC Press, Boca Raton

    Book  MATH  Google Scholar 

  • Ilin A, Raiko T (2010) Practical approaches to principal component analysis in the presence of missing values. J Mach Learn Res 11:1957–2000

    MathSciNet  MATH  Google Scholar 

  • Josse J, Husson F, Pagès J (2009) Gestion des données manquantes en analyse en composantes principales. J de la Société Française de Statistique 150:28–51

    MathSciNet  MATH  Google Scholar 

  • Josse J, Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. J de la Société Française de Statistique 153:79–99

    MathSciNet  MATH  Google Scholar 

  • Josse J, Timmerman ME, Kiers HAL (2013) Missing values in multi-level simultaneous component analysis. Chemom Intell Lab 129:21–32

    Article  Google Scholar 

  • Kiers HAL (1997) Weighted least squares fitting using iterative ordinary least squares algorithms. Psychometrika 62:251–266

    MathSciNet  Article  MATH  Google Scholar 

  • Little RJA, Rubin DB (1987) Statistical analysis with missing data. Wiley, New York

    MATH  Google Scholar 

  • McDonald RP, Burr EJ (1967) A comparison of four methods of constructing factor scores. Psychometrika 32:381–401

    Article  MATH  Google Scholar 

  • Meulman JJ (1982) Homogeneity analysis of incomplete data. DSWO Press, Leiden

    Google Scholar 

  • Mezzich JE (1978) Evaluating clustering methods for psychiatric diagnosis. Biol Psychol 13:265–281

    Google Scholar 

  • Mori Y, Iizuka M, Tarumi T, Tanaka Y (2007) Variable selection in principal component analysis. In: Härdle W, Mori Y, Vieu P (eds) Statistical mehtods for biostatistics and related fields. Springer, Berlin, pp 265–283

    Chapter  Google Scholar 

  • Overall JE, Gorham DR (1962) The brief psychatric rating scale. Psychol Rep 10:799–812

    Article  Google Scholar 

  • Rubin DB (1987) Multiple imputation for nonresponse in survey. Wiley, New York

    Book  Google Scholar 

  • Schafer JL (1997) Analysis of incomplete multivariate data. Wiley, New York

    Book  MATH  Google Scholar 

  • Segi M (1979) Age-adjusted death rates for cancer for selected sites (A-classification) in 51 countries in 1974. Segi Institute of Cancer Epidemiology, Nagoya

    Google Scholar 

  • Serneels S, Verdonck T (2008) Principal component analysis for data containing outliers and missing elements. Comput Stat Data Anal 52:1712–1727

    MathSciNet  Article  MATH  Google Scholar 

  • Shibayama T (1995) A linear composite method for test scores with missing values. Mem Faulty Educ Niigata Univ 36:445–455

    Google Scholar 

  • Stanimirova I, Daszykowski M, Walczak B (2008) Dealing with missing values and outliers in principal component analysis. Talanta 72:172–178

    Article  Google Scholar 

  • Takane Y (2013) Constrained principal component anlysis and related techniques. Chapman and Hall/CRC Press, Boca Raton

    Google Scholar 

  • Takane Y, Oshima-Takane Y (2003) Relationships between two methods for dealing with missing data in principal component analysis. Behaviometrika 30:145–154

    MathSciNet  Article  MATH  Google Scholar 

  • Tanner MA, Wong WH (1987) The calculation of posterier distributions by data augumentation (with discussion). J Am Stat Assoc 82:528–550

    Article  Google Scholar 

  • Tipping ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc B 61:611–622

    MathSciNet  Article  MATH  Google Scholar 

  • Tucker L R (1951) A method of synthesis of factor analysis studies. Personnel Research Section Report No. 984, U. S. Department of Army, Wasgington, DC

  • Van Ginkel JR, Kroonenberg PM (2014) Using generalized procrustes analysis for multiple imputation in principal component analysis. J Classif 31:242–269

    MathSciNet  Article  MATH  Google Scholar 

  • Van Ginkel JR, Kroonenberg PM, Kiers HAL (2014) Missing data in principal component analysis of questionnaire data. J Stat Comput Sim 84:2298–2315

    MathSciNet  Article  Google Scholar 

  • Walczak B, Massart DL (2001) Dealing with missing data, part 1. Chemom Intell Lab 58:15–27

    Article  Google Scholar 

  • Wentzell PD, Andrews DT, Hamilton DC, Faber K, Kowalski BR (1997) Maximum likelihood principal component analysis. J Chemom 11:339–366

    Article  Google Scholar 

Download references

Acknowledgements

The work reported in this paper has been supported by a research grant (Discovery Grant: 10630) from the Natural Sciences and Engineering Research Council of Canada to the second author. We thank Aida Eslami for providing the reference to Josse and Husson (2012) on RPCA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yoshio Takane.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 56 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Loisel, S., Takane, Y. Comparisons among several methods for handling missing data in principal component analysis (PCA). Adv Data Anal Classif 13, 495–518 (2019). https://doi.org/10.1007/s11634-018-0310-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-018-0310-9

Keywords

  • Homogeneity criterion
  • Missing data passive (MDP) method
  • Alternating least squares (ALS) algorithm
  • Weighted low rank approximation (WLRA) method
  • Regularized PCA (RPCA) method
  • Trimmed scores regression (TSR) method
  • Data augmentation (DA) method
  • Congruence coefficient

Mathematics Subject Classification

  • 5A03
  • 15A09