Skip to main content

Multiple Imputation: Application

  • Chapter
  • First Online:
Applied Multiple Imputation

Part of the book series: Statistics for Social and Behavioral Sciences ((SSBS))

  • 1856 Accesses

Abstract

In this chapter, we discuss the most important and most commonly used multiple imputation tools in R (Table 5.1 gives an overview of the download frequencies of various MI packages in R.) for both multivariate and clustered data sets, including packages mice, norm2, Amelia, mi, pan, as well as function aregImpute( ) from package Hmisc, and show practical applications. We give hands-on step by step tutorials regarding how to carry out MI in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Table 5.1 gives an overview of the download frequencies of various MI packages in R.

  2. 2.

    Note that classical zero-inflation models allow two sources of zeros. They can stem either from the zero-process or from the count process. A highly similar model class (so-called hurdle models) allows only one source of zeros. These models fit a zero-truncated Poisson or negative binomial model for the count process.

  3. 3.

    From today’s perspective, a more appropriate model would have been the zero-inflated negative binomial model, which allows to estimate a dispersion parameter (Hilbe, 2011). However, this model was not available in standard SEM software like Mplus (Muthén & Muthén, 2017) at that time.

  4. 4.

    To keep computation times of the examples in this book low, we only estimate and report the ‘fixed’ part of the model, i.e., we assume variances and covariances of the latent growth factors to be zero. Hence, η ij = α j for all i, j.

  5. 5.

    File crim4.dat can be found in subdirectory ‘data’ in the supplementary material of the book. To be able to reproduce the examples in the book, we ask readers to set the pathname of the supplementary material as working directory using the setwd( ) function.

  6. 6.

    The output of the md.pairs( ) function is a list of four components named rr, rm, mr, and mm. These components can be accessed using the $ operator. See help( "md.pairs") for details.

  7. 7.

    Reinecke and Weins (2013) have split the zero-inflated count variables into zero vs. non-zero indicators and zero-truncated count variables. They imputed both variables jointly under the multivariate normal model.

  8. 8.

    To reduce simulation error due to small numbers of imputations, we will create m = 100 sets of imputations in all applications throughout this book (see, for example, the recommendations by Graham et al., 2007 or Bodner, 2008). Even with modern computers, this can take a considerable amount of time, especially when imputation models are more complex. To reproduce the examples within a reasonable amount of time, readers could set the number of imputations to 5. Additionally, all sets of imputations are available from the supplementary material of the book.

  9. 9.

    Sometimes the term ‘plausibility’ refers to imputations that lead to valid statistical inferences. In our context, the true parameters are unknown. Here we regard imputations as plausible if they meet our expectations about these values and their distributions, based on what we know about the data set and based on the underlying theory. For the CrimoC data, for example, we can expect boys and students from Hauptschule to receive higher delinquency scores on average.

  10. 10.

    At the time of writing, the most recent version 1.0-9.5 of the norm package was published in February 2013.

  11. 11.

    Panel data however could be imputed in wide format, which would implicitly consider the longitudinal information (for a discussion, see Sect. 5.6.3.1).

  12. 12.

    We once again would like to caution applied researchers to use normal model MI after transformation on highly non-normal data. Transformations and roundings were applied in the past, when more appropriate strategies and models were not yet available.

  13. 13.

    Note that emNorm( ) produces maximum likelihood estimates of the mean vector and covariance matrix of the variables in the model, which can also be directly used as input for further statistical analyses (e.g., structural equation models).

  14. 14.

    https://cran.r-project.org/package=Amelia. Descriptions of the algorithm, and detailed documentation of the software may be found at the project website https://gking.harvard.edu/amelia.

  15. 15.

    https://cran.r-project.org/package=mi.

  16. 16.

    https://cran.r-project.org/web/packages/mi/vignettes/mi_vignette.pdf.

  17. 17.

    https://cran.r-project.org/package=Hmisc.

  18. 18.

    See Sect. 5.2.2 for details about the growth curve ZIP model. Note that the intercept of the zero model has been fixed to zero (cf. Muthén & Muthén, 2017, Example 6.7). See Sects. 5.5.1–5.5.4 for details about the respective imputation methods and models that are used to create the multiple imputations.

  19. 19.

    Shortcomings of some imputation techniques or consequences of misspecifications in simple data sets are discussed in de Jong et al. (2016) or He and Raghunathan (2009).

  20. 20.

    Graham (2009) also recommends to keep the total number of variables in the imputation model below 100.

  21. 21.

    An exception is the statistical package Stata. See the FAQ document ‘How can I account for clustering when creating imputations with mi impute?’, https://www.stata.com/support/faqs/statistics/clustering-and-mi-impute/.

  22. 22.

    Additional R packages like miceadds or countimp provide further two-level imputation functions for the mice framework.

  23. 23.

    In the supplementary material of the book, we provide R code to extract the imputed data sets from the mids object and to write the data to disk for further analysis in Mplus.

  24. 24.

    An introduction to the lavaan syntax may be found at Yves Rosseel’s website http://lavaan.ugent.be/.

  25. 25.

    Another option would be to use functions from package semTools, which we introduce in Sect. 5.7.7.5.

  26. 26.

    https://cran.r-project.org/package=glmmTMB.

  27. 27.

    More precisely, we average the Fisher-Z-transformed correlations and then, in a second step, bring the mean of these transformed correlations back to the original scale (see, for example, Schafer, 1997, for details on how to pool correlation coefficients). Functions cor2fisher( ) and fisher2cor( ) , which we use for the transformation and back-transformation are available from the supplementary material of the book.

  28. 28.

    For further details, see the help files of functions abind( ) and apply( ) .

  29. 29.

    https://github.com/glmmTMB/glmmTMB/blob/master/glmmTMB/R/family.R

  30. 30.

    https://cran.r-project.org/package=miceadds.

  31. 31.

    For further information, see option mimic="Mplus" in lavaan (https://cran.r-project.org/web/packages/lavaan/lavaan.pdf).

  32. 32.

    The transformation of imputed data lists that were created by other packages to an object of class mids is described in Sect. 5.7.

  33. 33.

    Instead of a simple MCAR mechanism, applied researchers could also simulate more complex and more realistic MAR or MNAR mechanisms that are believed to be present in the empirical data.

References

  • Allison, P. D. (2001). Missing data. Thousand Oaks, CA: Sage.

    MATH  Google Scholar 

  • Andridge, R. R. (2011). Quantifying the impact of fixed effects modeling of clusters in multiple imputation for cluster randomized trials. Biometrical Journal, 53(1), 57–74.

    Article  MathSciNet  MATH  Google Scholar 

  • Barnard, J., & Rubin, D. B. (1999). Small-sample degrees of freedom with multiple imputation. Biometrika, 86(4), 948–955.

    Article  MathSciNet  MATH  Google Scholar 

  • Bodner, T. E. (2008). What improves with increased missing data imputations? Structural Equation Modeling, 15(4), 651–675.

    Article  MathSciNet  Google Scholar 

  • Boers, K., Reinecke, J., Seddig, D., & Mariotti, L. (2010). Explaining the development of adolescent violent delinquency. European Journal of Criminology, 7(6), 499–520.

    Article  Google Scholar 

  • Bollen, K. A., & Curran, P. J. (2006). Latent curve models: A structural equation perspective. New York, NY: Wiley.

    MATH  Google Scholar 

  • Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models. Newbury Park, CA: Sage.

    Google Scholar 

  • Carpenter, J. R., Goldstein, H., & Kenward, M. G. (2011). REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. Journal of Statistical Software, 45(5), 1–14.

    Article  Google Scholar 

  • Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6(4), 330–351.

    Article  Google Scholar 

  • Dahl, F. A. (2007). Convergence of random k-nearest-neighbour imputation. Computational Statistics & Data Analysis, 51(12), 5913–5917.

    Article  MathSciNet  MATH  Google Scholar 

  • de Jong, R., van Buuren, S., & Spiess, M. (2016). Multiple imputation of predictor variables using generalized additive models. Communications in Statistics – Simulation and Computation, 45(3), 968–985.

    Article  MathSciNet  MATH  Google Scholar 

  • Drechsler, J. (2015). Multiple imputation of multilevel missing data—Rigor versus simplicity. Journal of Educational and Behavioral Statistics, 40(1), 69–95.

    Article  Google Scholar 

  • Enders, C. K. (2010). Applied missing data analysis. New York, NY: Guilford.

    Google Scholar 

  • Enders, C. K., Mistler, S. A., & Keller, B. T. (2016). Multilevel multiple imputation: A review and evaluation of joint modeling and chained equations imputation. Psychological Methods, 21(2), 222–240.

    Article  Google Scholar 

  • Gaffert, P., Meinfelder, F., & Bosch, V. (2016). Towards an MI-proper predictive mean matching (Discussion Paper). https://www.uni-bamberg.de/fileadmin/uni/fakultaeten/sowi_lehrstuehle/statistik/Personen/Dateien_Florian/properPMM.pdf.

    Google Scholar 

  • GaÅ‚ecki, A., & Burzykowski, T. (2013). Linear mixed-effects models using R: A step-by-step approach. Heidelberg/New York, NY: Springer.

    Book  MATH  Google Scholar 

  • Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60, 549–576.

    Article  Google Scholar 

  • Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science, 8(3), 206–213.

    Article  Google Scholar 

  • Grund, S., Lüdtke, O., & Robitzsch, A. (2016). Multiple imputation of missing covariate values in multilevel models with random slopes: A cautionary note. Behavior Research Methods, 48(2), 640–649.

    Article  Google Scholar 

  • Harel, O. (2009). The estimation of r 2 and adjusted r 2 in incomplete data sets using multiple imputation. Journal of Applied Statistics, 36(10), 1109–1118.

    Article  MathSciNet  Google Scholar 

  • He, Y., & Raghunathan, T. E. (2009). On the performance of sequential regression multiple imputation methods with non normal error distributions. Communications in Statistics – Simulation and Computation, 38(4), 856–883.

    Article  MathSciNet  MATH  Google Scholar 

  • Hilbe, J. M. (2011). Negative binomial regression (2nd ed.). Cambridge, UK: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Hill, M. (1997). SPSS missing value analysis 7.5. Chicago, IL: SPSS.

    Google Scholar 

  • Horton, N. J., Lipsitz, S. R., & Parzen, M. (2003). A potential for bias when rounding in multiple imputation. The American Statistician, 57(4), 229–232.

    Article  MathSciNet  MATH  Google Scholar 

  • Hox, J. J. (2010). Multilevel analysis: Techniques and applications (2nd ed.). New York, NY: Routledge.

    Book  MATH  Google Scholar 

  • Hu, L.-t., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6(1), 1–55.

    Google Scholar 

  • Kleinke, K. (2017). Multiple imputation under violated distributional assumptions—a systematic evaluation of the assumed robustness of predictive mean matching. Journal of Educational and Behavioral Statistics, 42(4), 371–404.

    Article  Google Scholar 

  • Kleinke, K. (2018). Multiple imputation by predictive mean matching when sample size is small. Methodology, 14(1), 3–15.

    Article  Google Scholar 

  • Kleinke, K., Stemmler, M., Reinecke, J., & Lösel, F. (2011). Efficient ways to impute incomplete panel data. Advances in Statistical Analysis, 95(4), 351–373.

    Article  MathSciNet  Google Scholar 

  • Lally, J. R., Mangione, P. L., & Honig, A. S. (1988). The Syracuse University Family Development Research Program: Long-range impact of an early intervention with low-income children and their families. In D. R. Powell (Ed.), Parent education as early childhood intervention: Emerging directions in theory, research and practice (pp. 79–104). Norwood, NJ: Ablex.

    Google Scholar 

  • Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34(1), 1–14.

    Article  MATH  Google Scholar 

  • Lüdtke, O., Robitzsch, A., & Grund, S. (2017). Multiple imputation of missing data in multilevel designs: A comparison of different strategies. Psychological Methods, 22(1), 141–165.

    Article  Google Scholar 

  • Mariotti, L., & Reinecke, J. (2010). Wachstums- und Mischverteilungsmodelle unter Berücksichtigung unbeobachteter Heterogenität: Empirische Analysen zum delinquenten Verhalten Jugendlicher in Duisburg [Growth models, mixture models, and unobserved heterogeneity: Empirical analyses of juvenile delinquent behaviors in Duisburg.] Münster, Germany: Institut für sozialwissenschaftliche Forschung e.V.

    Google Scholar 

  • McCord, J. (1978). A thirty-year follow-up of treatment effects. American Psychologist, 33(3), 284–289.

    Article  Google Scholar 

  • Meng, X.-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 9(4), 538–558.

    Article  Google Scholar 

  • Meng, X.-L. & Rubin, D. B. (1992). Performing likelihood ratio tests with multiply-imputed data sets. Biometrika, 79(1), 103–111.

    Article  MathSciNet  MATH  Google Scholar 

  • Moffitt, T. E. (1993). Adolescence-limited and life-course-persistent antisocial behavior: A developmental taxonomy. Psychological Review, 100(4), 674–701.

    Article  Google Scholar 

  • Morris, T. P., White, I. R., & Royston, P. (2014). Tuning multiple imputation by predictive mean matching and local residual draws. BMC Medical Research Methodology, 14(1), 75–87.

    Article  Google Scholar 

  • Muthén, L. K. & Muthén, B. O. (2017). Mplus user’s guide (8th ed.). Los Angeles, CA: Muthén & Muthén.

    Google Scholar 

  • Nguyen, C. D., Carlin, J. B., & Lee, K. J. (2017). Model checking in multiple imputation: An overview and case study. Emerging Themes in Epidemiology, 14(8), 1–12.

    Google Scholar 

  • Pöge, A. (2005). Persönliche Codes bei Längsschnittstudien. Ein Erfahrungsbericht. ZA-Nachrichten, 56, 50–69.

    Google Scholar 

  • Pöge, A. (2008). Persönliche Codes ‘reloaded’. Methoden—Daten—Analysen, 2(1), 59–70.

    Google Scholar 

  • Reinecke, J., & Seddig, D. (2011). Growth mixture models in longitudinal research. AStA Advances in Statistical Analysis, 95(4), 415–434.

    Article  MathSciNet  Google Scholar 

  • Reinecke, J., & Weins, C. (2013). The development of delinquency during adolescence: A comparison of missing data techniques. Quality & Quantity, 47(6), 3319–3334.

    Article  Google Scholar 

  • Rosseel, Y. (2012). lavaan: An R Package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36.

    Article  Google Scholar 

  • Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York, NY: Wiley.

    Book  MATH  Google Scholar 

  • Rubin, D. B. (1996). Multiple imputation after 18+  years. Journal of the American Statistical Association, 91(434), 473–489.

    Article  MATH  Google Scholar 

  • Schafer, J. L. (1997). Analysis of incomplete multivariate data. London, UK: Chapman & Hall.

    Book  MATH  Google Scholar 

  • Schafer, J. L. (1999b). NORM users guide (version 2) [Computer software manual]. University Park, PA: The Methodology Center, The Pennsylvania State University. https://www.methodology.psu.edu/training/missing-data/.

    Google Scholar 

  • Schafer, J. L. (2016). norm2: Analysis of incomplete multivariate data under a normal model [Computer software manual]. https://CRAN.R-project.org/package=norm2 (R Package Version 2.0.1).

  • Schafer, J. L. & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177.

    Article  Google Scholar 

  • Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for missing-data problems: A data analyst’s perspective. Multivariate Behavioral Research, 33(4), 545–571.

    Article  Google Scholar 

  • Schafer, J. L., & Olsen, M. K. (1999). Modeling and imputation of semicontinuous survey variables. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.159.7891

    Google Scholar 

  • Schafer, J. L., & Yucel, R. M. (2002). Computational strategies for multivariate linear mixed-effects models with missing values. Journal of Computational and Graphical Statistics, 11(2), 437–457.

    Article  MathSciNet  Google Scholar 

  • Schenker, N., & Taylor, J. M. G. (1996). Partially parametric techniques for multiple imputation. Computational Statistics & Data Analysis, 22(4), 425–446.

    Article  MATH  Google Scholar 

  • Siddique, J., & Belin, T. R. (2008). Multiple imputation using an iterative hot-deck with distance-based donor selection. Statistics in Medicine, 27(1), 83–102.

    Article  MathSciNet  Google Scholar 

  • Speidel, M., Drechsler, J., & Sakshaug, J. W. (2018). Biases in multilevel analyses caused by cluster-specific fixed-effects imputation. Behavior Research Methods, 50(5), 1824–1840.

    Article  Google Scholar 

  • Spiess, M., Kleinke, K., & Reinecke, J. (in press). Proper multiple imputation of clustered or panel data. In P. Lynn (Ed.), Advances in longitudinal survey methodology. New York, NY: Wiley. https://www.wiley.com/en-us/Advances+in+Longitudinal+Survey+Methodology-p-9781119376934

  • Su, Y.-S., Gelman, A., Hill, J., & Yajima, M. (2011). Multiple imputation with diagnostics (mi) in R: Opening windows into the black box. Journal of Statistical Software, 45(2), 1–31.

    Article  Google Scholar 

  • van Buuren, S. (2011). Multiple imputation of multilevel data. In J. J. Hox & J. K. Roberts (Eds.), Handbook of advanced multilevel analysis (pp. 173–196). New York, NY: Taylor & Francis.

    Google Scholar 

  • van Buuren, S. (2012). Flexible imputation of missing data. Boca Raton, FL: Chapmann & Hall/CRC.

    Book  MATH  Google Scholar 

  • van Buuren, S. (2013). Multiple imputation of multilevel data. Paper presented at the Conference on Recent Advances in Multiple Imputation, with Emphasis on Dealing with Deviations from MAR or Exchangeability, Utrecht, the Netherlands.

    Google Scholar 

  • van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67.

    Article  Google Scholar 

  • Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics withS (4th ed.). New York, NY: Springer.

    Book  MATH  Google Scholar 

  • Vink, G., Lazendic, G., & van Buuren, S. (2015). Partitioned predictive mean matching as a large data multilevel imputation technique. Psychological Test and Assessment Modeling, 57(4), 577–594.

    Google Scholar 

  • Weins, C., & Reinecke, J. (2007). Delinquenzverläufe im Jugendalter: Eine methodologische Analyse zur Auswirkung von fehlenden Werten im Längsschnitt [Development of Juvenile Delinquency: An analysis of the effects of missing data]. Monatsschrift für Kriminologie und Strafrechtsreform, 90(5), 418–437.

    Article  Google Scholar 

  • Yu, L. M., Burton, A., & Rivero-Arias, O. (2007). Evaluation of software for multiple imputation of semi-continuous data. Statistical Methods in Medical Research, 16(3), 243–258.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Kleinke, K., Reinecke, J., Salfrán, D., Spiess, M. (2020). Multiple Imputation: Application. In: Applied Multiple Imputation. Statistics for Social and Behavioral Sciences. Springer, Cham. https://doi.org/10.1007/978-3-030-38164-6_5

Download citation

Publish with us

Policies and ethics