A principal component method to impute missing values for mixed data

Audigier, Vincent; Husson, François; Josse, Julie

doi:10.1007/s11634-014-0195-1

A principal component method to impute missing values for mixed data

Regular Article
Published: 24 December 2014

Volume 10, pages 5–26, (2016)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Vincent Audigier¹,
François Husson¹ &
Julie Josse¹

2583 Accesses
81 Citations
1 Altmetric
Explore all metrics

Abstract

We propose a new method to impute missing values in mixed data sets. It is based on a principal component method, the factorial analysis for mixed data, which balances the influence of all the variables that are continuous and categorical in the construction of the principal components. Because the imputation uses the principal axes and components, the prediction of the missing values is based on the similarity between individuals and on the relationships between variables. The properties of the method are illustrated via simulations and the quality of the imputation is assessed using real data sets. The method is compared to a recent method (Stekhoven and Buhlmann Bioinformatics 28:113–118, 2011) based on random forest and shows better performance especially for the imputation of categorical variables and situations with highly linear relationships between continuous variables.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Benzécri JP (1973) L’analyse des données. L’analyse des correspondances. Dunod, Tome II
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Bro R, Kjeldahl K, Smilde AK, Kiers HAL (2008) Cross-validation of component model: a critical look at current methods. Anal Bioanal Chem 390:1241–1251
Article Google Scholar
Cornillon PA, Guyader A, Husson F, Jégou N, Josse J, Kloareg M, Matzner-Løber E, Rouvière L (2012) R for Statistics. Chapman and Hall/CRC, Boca Raton
de Leeuw J, Mair P (2009) Gifi methods for optimal scaling in R: The package homals. J Statist Software 31(4):1–20, URL http://www.jstatsoft.org/v31/i04/
Escofier B (1979) Traitement simultané de variables quantitatives et qualitatives en analyse factorielle. Les cahiers de l’analyse des données 4(2):137–146
Google Scholar
Gifi A (1990) Nonlinear multivariate analysis. Wiley, Chichester
MATH Google Scholar
Greenacre M, Blasius J (2006) Multiple correspondence analysis and related methods. Chapman and Hall/CRC.
Husson F, Josse J (2012) missMDA: Handling missing values with/in multivariate data analysis (principal component methods). URL http://www.agrocampus-ouest.fr/math/husson, r package version 1.4
Ilin A, Raiko T (2010) Practical approaches to principal component analysis in the presence of missing values. J Mach Learn Res 99:1957–2000, URL http://dl.acm.org/citation.cfm?id=1859890.1859917
Josse J, Husson F (2011) Selecting the number of components in PCA using cross-validation approximations. Comput Statist Data Anal 56(6):1869–1879
Article MathSciNet Google Scholar
Josse J, Husson F (2012) Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique 153(2):1–21
MathSciNet Google Scholar
Josse J, Pagès J, Husson F (2009) Gestion des données manquantes en analyse en composantes principales. Journal de la Société Française de Statistique 150:28–51
MATH Google Scholar
Josse J, Chavent M, Liquet B, Husson F (2012) Handling missing values with regularized iterative multiple correspondence analysis. J Classif 29:91–116
Article MathSciNet Google Scholar
Kiers HAL (1991) Simple structure in component analysis techniques for mixtures of qualitative and quantitative variables. Psychometrika 56:197–212
Article MathSciNet MATH Google Scholar
Kiers HAL (1997) Weighted least squares fitting using ordinary least squares algorithms. Psychometrika 62:251–266
Article MathSciNet MATH Google Scholar
Lafaye de Micheaux P, Drouilhet R, Liquet B (2011) Le logiciel R. Springer, Paris
Book MATH Google Scholar
Lang DT, Swayne D, Wickham H, Lawrence M (2012) rggobi: Interface between R and GGobi. URL http://CRAN.R-project.org/package=rggobi, r package version 2.1.19
Lebart L, Morineau A, Werwick KM (1984) Multivariate descriptive statistical analysis. Wiley, New York
MATH Google Scholar
Little RJA, Rubin DB (1987, 2002) Statistical analysis with missing data. Wiley series in probability and statistics, New York
Mazumder R, Hastie T, Tibshirani R (2010) Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res 11:2287–2322
MathSciNet MATH Google Scholar
Michailidis G, de Leeuw J (1998) The Gifi system of descriptive multivariate analysis. Statist Sci 13(4):307–336
Article MathSciNet MATH Google Scholar
Peters A, Hothorn T (2012) ipred: Improved Predictors. URL http://CRAN.R-project.org/package=ipred, R package version 0.9-1
R Development Core Team (2011) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, URL http://www.R-project.org/, ISBN 3-900051-07-0
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
Article MathSciNet MATH Google Scholar
Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall/CRC, London
Book MATH Google Scholar
Stekhoven D, Bühlmann P (2011) Missforest - nonparametric missing value imputation for mixed-type data. Bioinformatics 28:113–118
Google Scholar
Tenenhaus M, Young FW (1985) An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data. Psychometrika 50:91–119
Article MathSciNet MATH Google Scholar
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(62001):520–525
Article Google Scholar
van Buuren S (2007) Multiple imputation of discrete and continuous data by fully conditional specification. Statist Method Med Res 16:219–242
Article MATH Google Scholar
van Buuren S, Boshuizen H, Knook D (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statist Med 18:681–694
Article Google Scholar
van der Heijden P, Escofier B (2003) Multiple correspondence analysis with missing data. In: Analyse des correspondances, Presse universitaire de Rennes, pp 153–170
Vermunt JK, van Ginkel JR, van der Ark LA, Sijtsma K (2008) Multiple imputation of incomplete categorical data using latent class analysis. Sociol Methodol 33:369–397
Google Scholar

Download references

Author information

Authors and Affiliations

Agrocampus Ouest, 65 rue de St-Brieuc, 35042, Rennes, France
Vincent Audigier, François Husson & Julie Josse

Authors

Vincent Audigier
View author publications
You can also search for this author in PubMed Google Scholar
François Husson
View author publications
You can also search for this author in PubMed Google Scholar
Julie Josse
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to François Husson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Audigier, V., Husson, F. & Josse, J. A principal component method to impute missing values for mixed data. Adv Data Anal Classif 10, 5–26 (2016). https://doi.org/10.1007/s11634-014-0195-1

Download citation

Received: 06 February 2013
Revised: 15 December 2014
Accepted: 16 December 2014
Published: 24 December 2014
Issue Date: March 2016
DOI: https://doi.org/10.1007/s11634-014-0195-1

Keywords

Mathematics Subject Classification

62H25

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A principal component method to impute missing values for mixed data

Abstract

Access this article

Similar content being viewed by others

Analysis of Missing Data

MIMCA: multiple imputation for categorical variables with multiple correspondence analysis

Multiple Imputation: an attempt to retell the evolutionary process

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A principal component method to impute missing values for mixed data

Abstract

Access this article

Similar content being viewed by others

Analysis of Missing Data

MIMCA: multiple imputation for categorical variables with multiple correspondence analysis

Multiple Imputation: an attempt to retell the evolutionary process

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation