Statistics and Computing

, Volume 27, Issue 2, pp 501–518 | Cite as

MIMCA: multiple imputation for categorical variables with multiple correspondence analysis

Article

Abstract

We propose a multiple imputation method to deal with incomplete categorical data. This method imputes the missing entries using the principal component method dedicated to categorical data: multiple correspondence analysis (MCA). The uncertainty concerning the parameters of the imputation model is reflected using a non-parametric bootstrap. Multiple imputation using MCA (MIMCA) requires estimating a small number of parameters due to the dimensionality reduction property of MCA. It allows the user to impute a large range of data sets. In particular, a high number of categories per variable, a high number of variables or a small number of individuals are not an issue for MIMCA. Through a simulation study based on real data sets, the method is assessed and compared to the reference methods (multiple imputation using the loglinear model, multiple imputation by logistic regressions) as well to the latest works on the topic (multiple imputation by random forests or by the Dirichlet process mixture of products of multinomial distributions model). The proposed method provides a good point estimate of the parameters of the analysis model considered, such as the coefficients of a main effects logistic regression model, and a reliable estimate of the variability of the estimators. In addition, MIMCA has the great advantage that it is substantially less time consuming on data sets of high dimensions than the other multiple imputation methods.

Keywords

Missing values Categorical data Multiple imputation Multiple correspondence analysis Bootstrap 

Mathematics Subject Classification

62H25 6207 62F40 

References

  1. Agresti, A.: Categorical Data Analysis. Wiley Series in Probability and Statistics. Wiley, New York (2013)Google Scholar
  2. Agresti, A., Coull, B.A.: Approximate is better than ‘exact” for interval estimation of binomial proportions. Am. Stat. 52(2), 119–126 (1998). doi:10.2307/2685469 MathSciNetGoogle Scholar
  3. Albert, A., Anderson, J.A.: On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71(1), 1–10 (1984). doi:10.2307/2336390 MathSciNetCrossRefMATHGoogle Scholar
  4. Allison, P.D.: Handling missing data by maximum likelihood. In: SAS global forum, pp 1–21 (2012)Google Scholar
  5. Allison, P.D.: Missing Data. Sage, Thousand Oaks (2002)CrossRefMATHGoogle Scholar
  6. Applied Mathematics Department, Agrocampus O, France (2010) galetas data set. http://math.agrocampus-ouest.fr/infoglueDeliverLive/digitalAssets/74258_galetas.txt
  7. Audigier, V., Husson, F., Josse, J.: Multiple imputation for continuous variables using a Bayesian principal component analysis. J. Stat. Comput. Simul. (2014). doi:10.1080/00949655.2015.1104683
  8. Audigier, V., Husson, F., Josse, J.: A principal component method to impute missing values for mixed data. Adv. Data Anal. Classif. 7, 1–22 (2014)Google Scholar
  9. Barnard, J., Rubin, D.B.: Small sample degrees of freedom with multiple imputation. Biometrika 86, 948–955 (1999)MathSciNetCrossRefMATHGoogle Scholar
  10. Bartlett, J.W., Seaman, S.R., White, I.R., Carpenter, J.R.: Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat. Methods. Med. Res. 24, 462 (2014)MathSciNetCrossRefGoogle Scholar
  11. Benzécri, J.P.: L’analyse des données. L’analyse des données.Tome II: L’analyse des correspondances. Dunod (1973)Google Scholar
  12. Bernaards, C.A., Belin, T.R., Schafer, J.L.: Robustness of a multivariate normal approximation for imputation of incomplete binary data. Stat. Med. 26(6), 1368–1382 (2007)MathSciNetCrossRefGoogle Scholar
  13. Besag, J.: Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B (Methodological) 36(2), 192 (1974)MathSciNetMATHGoogle Scholar
  14. Brand, J.P.L., van Buuren, S., Groothuis-Oudshoorn, K., Gelsema, E.S.: A toolkit in sas for the evaluation of multiple imputation methods. Stat. Neerl. 57(1), 36–45 (2003). doi:10.1111/1467-9574.00219 MathSciNetCrossRefGoogle Scholar
  15. Candès, E.J., Tao, T.: The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inf. Theory 56(5), 2053–2080 (2009). doi:10.1109/TIT.2010.2044061 MathSciNetCrossRefGoogle Scholar
  16. Carpenter, J.R., Goldstein, H., Kenward, M.G.: REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. J. Stat. Softw. 45(5), 1–14 (2011), http://www.jstatsoft.org/v45/i05
  17. Carpenter, J., Kenward, M.: Multiple Imputation and its Application, 1st edn. Wiley, Chichester (2013)CrossRefMATHGoogle Scholar
  18. Dawson, R.J.M.: The ‘unusual episode’ data revisited. Journal of Statistics Education 3, 1–7, http://www.amstat.org/publications/jse/v3n3/datasets.dawson.html (1995)
  19. Demirtas, H.: Rounding strategies for multiply imputed binary data. Biom. J. 51(4), 677–688 (2009)MathSciNetCrossRefGoogle Scholar
  20. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. B 39, 1–38 (1977)MathSciNetMATHGoogle Scholar
  21. Doove, L.L., Van Buuren, S., Dusseldorp, E.: Recursive partitioning for missing data imputation in the presence of interaction effects. Comput. Stat. Data Anal. 72, 92–104 (2014). doi:10.1016/j.csda.2013.10.025 MathSciNetCrossRefGoogle Scholar
  22. Dunson, D.B., Xing, C.: Nonparametric Bayes modeling of multivariate categorical data. J. Am. Stat. Assoc. 104(487), 1042–1051 (2009)MathSciNetCrossRefMATHGoogle Scholar
  23. Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1(3), 211–218 (1936)CrossRefMATHGoogle Scholar
  24. Gavish, M., Donoho, D.: Optimal shrinkageof singular values. arXiv:1405.7511 e-prints (214)
  25. Gelman, A., Hill, J., Su, Y., Yajima, M., Grazia Pittau, M., Goodrich, B., Si, Y.: mi: Missing data imputation and model checking. R package version 0.9-93 (2013)Google Scholar
  26. Gifi, A.: Nonlinear Multivariate Analysis. D.S.W.O. Press, Leiden (1981)MATHGoogle Scholar
  27. GlaxoSmithKline, Toronto, Ontario, Canada: Blood pressure data set. http://www.math.yorku.ca/Who/Faculty/Ng/ssc2003/BPMainF.htm (2003)
  28. Greenacre, M.J.: Theory and Applications of Correspondence Analysis. Academic Press, London (1984)MATHGoogle Scholar
  29. Greenacre, M.J., Blasius, J.: Multiple Correspondence Analysis and Related Methods. Chapman & Hall/CRC, Boca Raton (2006)CrossRefMATHGoogle Scholar
  30. Harding, T., Tusell, F., Schafer, J.L.: cat: Analysis of categorical-variable datasets with missing values. http://CRAN.R-project.org/package=cat, r package version 0.0-6.5 (2012)
  31. Honaker, J., King, G., Blackwell, M.: Amelia II: A program for missing data. R package version 1.7.2 (2014)Google Scholar
  32. Honaker, J., King, G., Blackwell, M.: Amelia II: A program for missing data. J. Stat. Softw. 45(7), 1–47 (2011)CrossRefGoogle Scholar
  33. Husson, F., Josse, J.: missMDA: Handling missing values with multivariate data analysis. http://CRAN.R-project.org/package=missMDA, r package version 1.9 (2015)
  34. Ishwaran, H., James, L.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96(453), 161–173 (2001)MathSciNetCrossRefMATHGoogle Scholar
  35. Josse, J., Chavent, M., Liquet, B., Husson, F.: Handling missing values with regularized iterative multiple correspondence analysis. J. Classif. 29, 91–116 (2012)MathSciNetCrossRefMATHGoogle Scholar
  36. Josse, J., Husson, F.: Selecting the number of components in PCA using cross-validation approximations. Comput. Stat. Data Anal. 56(6), 1869–1879 (2011)CrossRefMATHGoogle Scholar
  37. Josse, J., Husson, F.: missmda a package to handle missing values in and with multivariate data analysis methods. J. Stat. Softw. 25, 1 (2015)Google Scholar
  38. Josse, J., Sardy, S.: Adaptive shrinkage of singular values. Stat. Comput. 71, 1–10 (2015)MATHGoogle Scholar
  39. Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab—an S4 package for kernel methods in R. J. Stat. Softw. 11(9):1–20, http://www.jstatsoft.org/v11/i09/ (2004)
  40. King, G., Honaker, J., Joseph, A., Scheve, K.: Analyzing incomplete political science data: An alternative algorithm for multiple imputation. Am. Polit. Sci. Rev. 95(1), 49–69 (2001)Google Scholar
  41. Lebart, L., Morineau, A., Werwick, K.M.: Multivariate Descriptive Statistical Analysis. Wiley, New-York (1984)Google Scholar
  42. Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml (2013)
  43. Little, R.J.A., Rubin, D.B.: Statistical analysis with missing data. Wiley series in probability and statistics, Wiley, New-York (1987, 2002)Google Scholar
  44. Meinfelder, F., Schnapp, T.: BaBooN: Bayesian bootstrap predictive mean matching—multiple and single imputation for discrete data. https://CRAN.R-project.org/package=BaBooN, r package version 0.2-0 (2015)
  45. Meng, X.L., Rubin, D.B.: Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. J. Am. Stat. Assoc. 86(416), 899–909 (1991)CrossRefGoogle Scholar
  46. Nishisato, S.: Analysis of Categorical Data: Dual Scaling and its Applications. University of Toronto Press, Toronto (1980)MATHGoogle Scholar
  47. Quartagno, M., Carpenter, J.: jomo: A package for multilevel joint modelling multiple imputation. http://CRAN.R-project.org/package=jomo (2015)
  48. R Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, http://www.R-project.org/ (2014)
  49. Rousseauw, J., du Plessis, J., Benade, A., Jordann, P., Kotze, J., Jooste, P., Ferreira, J.: Coronary risk factor screening in three rural communities. S. Afr. Med. J. 64, 430–436 (1983)Google Scholar
  50. Rubin, D.B.: Multiple Imputation for Non-Response in Survey. Wiley, New York (1987)CrossRefGoogle Scholar
  51. Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman & Hall/CRC, London (1997)CrossRefMATHGoogle Scholar
  52. Schafer, J.L.: Multiple imputation in multivariate problems when the imputation and analysis models differ. Stat. Neerl. 57(1), 19–35 (2003)MathSciNetCrossRefGoogle Scholar
  53. Seaman, S.R., Bartlett, J.W., White, I.R.: Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods. BMC Med. Res. Methodol. 12(1), 46 (2012). doi:10.1186/1471-2288-12-46 CrossRefGoogle Scholar
  54. Shabalin, A., Nobel, B.: Reconstruction of a low-rank matrix in the presence of gaussian noise. J. Multivar. Anal. 118, 67–76 (2013)MathSciNetCrossRefMATHGoogle Scholar
  55. Shah, A.D., Bartlett, J.W., Carpenter, J., Nicholas, O., Hemingway, H.: Comparison of random forest and parametric imputation models for imputing missing data using MICE: A CALIBER study. Am. J. Epidemiol. 179(6), 764–774 (2014). doi:10.1093/aje/kwt312 CrossRefGoogle Scholar
  56. Si, Y., Reiter, J.: Nonparametric bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. J. Educ. Behav. Stat. 38, 499–521 (2013)CrossRefGoogle Scholar
  57. Stekhoven, D.J., Bühlmann, P.: Missforest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)CrossRefGoogle Scholar
  58. Tenenhaus, M., Young, F.W.: An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data. Psychometrika 50, 91–119 (1985)MathSciNetCrossRefMATHGoogle Scholar
  59. Van Buuren, S., Groothuis-Oudshoorn, K.: mice. R package version 2.22 (2014)Google Scholar
  60. Van Buuren, S., Brand, J.P.L., Groothuis-Oudshoorn, C.G.M., Rubin, D.B.: Fully conditional specification in multivariate imputation. J. Stat. Comput. Simul. 76, 1049–1064 (2006)MathSciNetCrossRefMATHGoogle Scholar
  61. Van Buuren, S.: Flexible Imputation of Missing Data (Chapman & Hall/CRC Interdisciplinary Statistics), 1st edn. Chapman and Hall/CRC, Boca Raton (2012)CrossRefMATHGoogle Scholar
  62. Van Buuren, S., Groothuis-Oudshoorn, C.G.M.: mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 45(3), 1–67 (2011)CrossRefGoogle Scholar
  63. Van der Heijden, P., Escofier, B.: Analyse des correspondances: recherches au coeur de l’analyse des données, Presses universitaires de Rennes, Rennes, France, chap Multiple correspondence analysis with missing data, pp 152–170 (2003)Google Scholar
  64. van der Palm, D., van der Ark, L., Vermunt, J.: A comparison of incomplete-data methods for categorical data. Stat. Methods Med. Res. 17, 33 (2014)Google Scholar
  65. Verbanck, M., Josse, J., Husson, F.: Regularised PCA to denoise and visualise data. Stat. Comput. 25(2), 471–486 (2013). doi:10.1007/s11222-013-9444-y MathSciNetCrossRefMATHGoogle Scholar
  66. Vermunt, J.K., van Ginkel, J.R., van der Ark, L.A., Sijtsma, K.: Multiple imputation of incomplete categorical data using latent class analysis. Sociol. Methodol. 38(38), 369–397 (2008)CrossRefGoogle Scholar
  67. Vidotto, D., Kapteijn, M.C., Vermunt, J.: Multiple imputation of missing categorical data using latent class models: State of art. Psychol. Test Assess. Model. 57, 542 (2014)Google Scholar
  68. Yucel, R.M., He, Y., Zaslavsky, A.M.: Using calibration to improve rounding in imputation. Am. Stat. 62, 125–129 (2008)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Vincent Audigier
    • 1
  • François Husson
    • 1
  • Julie Josse
    • 1
  1. 1.Applied Mathematics DepartmentAgrocampus OuestRennes CedexFrance

Personalised recommendations