Skip to main content
Log in

A principal component method to impute missing values for mixed data

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

We propose a new method to impute missing values in mixed data sets. It is based on a principal component method, the factorial analysis for mixed data, which balances the influence of all the variables that are continuous and categorical in the construction of the principal components. Because the imputation uses the principal axes and components, the prediction of the missing values is based on the similarity between individuals and on the relationships between variables. The properties of the method are illustrated via simulations and the quality of the imputation is assessed using real data sets. The method is compared to a recent method (Stekhoven and Buhlmann Bioinformatics 28:113–118, 2011) based on random forest and shows better performance especially for the imputation of categorical variables and situations with highly linear relationships between continuous variables.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Benzécri JP (1973) L’analyse des données. L’analyse des correspondances. Dunod, Tome II

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  • Bro R, Kjeldahl K, Smilde AK, Kiers HAL (2008) Cross-validation of component model: a critical look at current methods. Anal Bioanal Chem 390:1241–1251

    Article  Google Scholar 

  • Cornillon PA, Guyader A, Husson F, Jégou N, Josse J, Kloareg M, Matzner-Løber E, Rouvière L (2012) R for Statistics. Chapman and Hall/CRC, Boca Raton

  • de Leeuw J, Mair P (2009) Gifi methods for optimal scaling in R: The package homals. J Statist Software 31(4):1–20, URL http://www.jstatsoft.org/v31/i04/

  • Escofier B (1979) Traitement simultané de variables quantitatives et qualitatives en analyse factorielle. Les cahiers de l’analyse des données 4(2):137–146

    Google Scholar 

  • Gifi A (1990) Nonlinear multivariate analysis. Wiley, Chichester

    MATH  Google Scholar 

  • Greenacre M, Blasius J (2006) Multiple correspondence analysis and related methods. Chapman and Hall/CRC.

  • Husson F, Josse J (2012) missMDA: Handling missing values with/in multivariate data analysis (principal component methods). URL http://www.agrocampus-ouest.fr/math/husson, r package version 1.4

  • Ilin A, Raiko T (2010) Practical approaches to principal component analysis in the presence of missing values. J Mach Learn Res 99:1957–2000, URL http://dl.acm.org/citation.cfm?id=1859890.1859917

  • Josse J, Husson F (2011) Selecting the number of components in PCA using cross-validation approximations. Comput Statist Data Anal 56(6):1869–1879

    Article  MathSciNet  Google Scholar 

  • Josse J, Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique 153(2):1–21

    MathSciNet  Google Scholar 

  • Josse J, Pagès J, Husson F (2009) Gestion des données manquantes en analyse en composantes principales. Journal de la Société Française de Statistique 150:28–51

    MATH  Google Scholar 

  • Josse J, Chavent M, Liquet B, Husson F (2012) Handling missing values with regularized iterative multiple correspondence analysis. J Classif 29:91–116

    Article  MathSciNet  Google Scholar 

  • Kiers HAL (1991) Simple structure in component analysis techniques for mixtures of qualitative and quantitative variables. Psychometrika 56:197–212

    Article  MathSciNet  MATH  Google Scholar 

  • Kiers HAL (1997) Weighted least squares fitting using ordinary least squares algorithms. Psychometrika 62:251–266

    Article  MathSciNet  MATH  Google Scholar 

  • Lafaye de Micheaux P, Drouilhet R, Liquet B (2011) Le logiciel R. Springer, Paris

    Book  MATH  Google Scholar 

  • Lang DT, Swayne D, Wickham H, Lawrence M (2012) rggobi: Interface between R and GGobi. URL http://CRAN.R-project.org/package=rggobi, r package version 2.1.19

  • Lebart L, Morineau A, Werwick KM (1984) Multivariate descriptive statistical analysis. Wiley, New York

    MATH  Google Scholar 

  • Little RJA, Rubin DB (1987, 2002) Statistical analysis with missing data. Wiley series in probability and statistics, New York

  • Mazumder R, Hastie T, Tibshirani R (2010) Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res 11:2287–2322

    MathSciNet  MATH  Google Scholar 

  • Michailidis G, de Leeuw J (1998) The Gifi system of descriptive multivariate analysis. Statist Sci 13(4):307–336

    Article  MathSciNet  MATH  Google Scholar 

  • Peters A, Hothorn T (2012) ipred: Improved Predictors. URL http://CRAN.R-project.org/package=ipred, R package version 0.9-1

  • R Development Core Team (2011) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, URL http://www.R-project.org/, ISBN 3-900051-07-0

  • Rubin DB (1976) Inference and missing data. Biometrika 63:581–592

    Article  MathSciNet  MATH  Google Scholar 

  • Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall/CRC, London

    Book  MATH  Google Scholar 

  • Stekhoven D, Bühlmann P (2011) Missforest - nonparametric missing value imputation for mixed-type data. Bioinformatics 28:113–118

    Google Scholar 

  • Tenenhaus M, Young FW (1985) An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data. Psychometrika 50:91–119

    Article  MathSciNet  MATH  Google Scholar 

  • Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(62001):520–525

    Article  Google Scholar 

  • van Buuren S (2007) Multiple imputation of discrete and continuous data by fully conditional specification. Statist Method Med Res 16:219–242

    Article  MATH  Google Scholar 

  • van Buuren S, Boshuizen H, Knook D (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statist Med 18:681–694

    Article  Google Scholar 

  • van der Heijden P, Escofier B (2003) Multiple correspondence analysis with missing data. In: Analyse des correspondances, Presse universitaire de Rennes, pp 153–170

  • Vermunt JK, van Ginkel JR, van der Ark LA, Sijtsma K (2008) Multiple imputation of incomplete categorical data using latent class analysis. Sociol Methodol 33:369–397

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to François Husson.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Audigier, V., Husson, F. & Josse, J. A principal component method to impute missing values for mixed data. Adv Data Anal Classif 10, 5–26 (2016). https://doi.org/10.1007/s11634-014-0195-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-014-0195-1

Keywords

Mathematics Subject Classification

Navigation