Multiple Imputation: Application

Kleinke, Kristian; Reinecke, Jost; Salfrán, Daniel; Spiess, Martin

doi:10.1007/978-3-030-38164-6_5

Kristian Kleinke⁵,
Jost Reinecke⁶,
Daniel Salfrán⁷ &
…
Martin Spiess⁷

Part of the book series: Statistics for Social and Behavioral Sciences ((SSBS))

1856 Accesses

Abstract

In this chapter, we discuss the most important and most commonly used multiple imputation tools in R (Table 5.1 gives an overview of the download frequencies of various MI packages in R.) for both multivariate and clustered data sets, including packages mice, norm2, Amelia, mi, pan, as well as function aregImpute( ) from package Hmisc, and show practical applications. We give hands-on step by step tutorials regarding how to carry out MI in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Table 5.1 gives an overview of the download frequencies of various MI packages in R.
2.
Note that classical zero-inflation models allow two sources of zeros. They can stem either from the zero-process or from the count process. A highly similar model class (so-called hurdle models) allows only one source of zeros. These models fit a zero-truncated Poisson or negative binomial model for the count process.
3.
From today’s perspective, a more appropriate model would have been the zero-inflated negative binomial model, which allows to estimate a dispersion parameter (Hilbe, 2011). However, this model was not available in standard SEM software like Mplus (Muthén & Muthén, 2017) at that time.
4.
To keep computation times of the examples in this book low, we only estimate and report the ‘fixed’ part of the model, i.e., we assume variances and covariances of the latent growth factors to be zero. Hence, η _ij = α _j for all i, j.
5.
File crim4.dat can be found in subdirectory ‘data’ in the supplementary material of the book. To be able to reproduce the examples in the book, we ask readers to set the pathname of the supplementary material as working directory using the setwd( ) function.
6.
The output of the md.pairs( ) function is a list of four components named rr, rm, mr, and mm. These components can be accessed using the $ operator. See help( "md.pairs") for details.
7.
Reinecke and Weins (2013) have split the zero-inflated count variables into zero vs. non-zero indicators and zero-truncated count variables. They imputed both variables jointly under the multivariate normal model.
8.
To reduce simulation error due to small numbers of imputations, we will create m = 100 sets of imputations in all applications throughout this book (see, for example, the recommendations by Graham et al., 2007 or Bodner, 2008). Even with modern computers, this can take a considerable amount of time, especially when imputation models are more complex. To reproduce the examples within a reasonable amount of time, readers could set the number of imputations to 5. Additionally, all sets of imputations are available from the supplementary material of the book.
9.
Sometimes the term ‘plausibility’ refers to imputations that lead to valid statistical inferences. In our context, the true parameters are unknown. Here we regard imputations as plausible if they meet our expectations about these values and their distributions, based on what we know about the data set and based on the underlying theory. For the CrimoC data, for example, we can expect boys and students from Hauptschule to receive higher delinquency scores on average.
10.
At the time of writing, the most recent version 1.0-9.5 of the norm package was published in February 2013.
11.
Panel data however could be imputed in wide format, which would implicitly consider the longitudinal information (for a discussion, see Sect. 5.6.3.1).
12.
We once again would like to caution applied researchers to use normal model MI after transformation on highly non-normal data. Transformations and roundings were applied in the past, when more appropriate strategies and models were not yet available.
13.
Note that emNorm( ) produces maximum likelihood estimates of the mean vector and covariance matrix of the variables in the model, which can also be directly used as input for further statistical analyses (e.g., structural equation models).
14.
https://cran.r-project.org/package=Amelia. Descriptions of the algorithm, and detailed documentation of the software may be found at the project website https://gking.harvard.edu/amelia.
15.
https://cran.r-project.org/package=mi.
16.
https://cran.r-project.org/web/packages/mi/vignettes/mi_vignette.pdf.
17.
https://cran.r-project.org/package=Hmisc.
18.
See Sect. 5.2.2 for details about the growth curve ZIP model. Note that the intercept of the zero model has been fixed to zero (cf. Muthén & Muthén, 2017, Example 6.7). See Sects. 5.5.1–5.5.4 for details about the respective imputation methods and models that are used to create the multiple imputations.
19.
Shortcomings of some imputation techniques or consequences of misspecifications in simple data sets are discussed in de Jong et al. (2016) or He and Raghunathan (2009).
20.
Graham (2009) also recommends to keep the total number of variables in the imputation model below 100.
21.
An exception is the statistical package Stata. See the FAQ document ‘How can I account for clustering when creating imputations with mi impute?’, https://www.stata.com/support/faqs/statistics/clustering-and-mi-impute/.
22.
Additional R packages like miceadds or countimp provide further two-level imputation functions for the mice framework.
23.
In the supplementary material of the book, we provide R code to extract the imputed data sets from the mids object and to write the data to disk for further analysis in Mplus.
24.
An introduction to the lavaan syntax may be found at Yves Rosseel’s website http://lavaan.ugent.be/.
25.
Another option would be to use functions from package semTools, which we introduce in Sect. 5.7.7.5.
26.
https://cran.r-project.org/package=glmmTMB.
27.
More precisely, we average the Fisher-Z-transformed correlations and then, in a second step, bring the mean of these transformed correlations back to the original scale (see, for example, Schafer, 1997, for details on how to pool correlation coefficients). Functions cor2fisher( ) and fisher2cor( ) , which we use for the transformation and back-transformation are available from the supplementary material of the book.
28.
For further details, see the help files of functions abind( ) and apply( ) .
29.
https://github.com/glmmTMB/glmmTMB/blob/master/glmmTMB/R/family.R
30.
https://cran.r-project.org/package=miceadds.
31.
For further information, see option mimic="Mplus" in lavaan (https://cran.r-project.org/web/packages/lavaan/lavaan.pdf).
32.
The transformation of imputed data lists that were created by other packages to an object of class mids is described in Sect. 5.7.
33.
Instead of a simple MCAR mechanism, applied researchers could also simulate more complex and more realistic MAR or MNAR mechanisms that are believed to be present in the empirical data.

References

Allison, P. D. (2001). Missing data. Thousand Oaks, CA: Sage.
MATH Google Scholar
Andridge, R. R. (2011). Quantifying the impact of fixed effects modeling of clusters in multiple imputation for cluster randomized trials. Biometrical Journal, 53(1), 57–74.
Article MathSciNet MATH Google Scholar
Barnard, J., & Rubin, D. B. (1999). Small-sample degrees of freedom with multiple imputation. Biometrika, 86(4), 948–955.
Article MathSciNet MATH Google Scholar
Bodner, T. E. (2008). What improves with increased missing data imputations? Structural Equation Modeling, 15(4), 651–675.
Article MathSciNet Google Scholar
Boers, K., Reinecke, J., Seddig, D., & Mariotti, L. (2010). Explaining the development of adolescent violent delinquency. European Journal of Criminology, 7(6), 499–520.
Article Google Scholar
Bollen, K. A., & Curran, P. J. (2006). Latent curve models: A structural equation perspective. New York, NY: Wiley.
MATH Google Scholar
Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models. Newbury Park, CA: Sage.
Google Scholar
Carpenter, J. R., Goldstein, H., & Kenward, M. G. (2011). REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. Journal of Statistical Software, 45(5), 1–14.
Article Google Scholar
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6(4), 330–351.
Article Google Scholar
Dahl, F. A. (2007). Convergence of random k-nearest-neighbour imputation. Computational Statistics & Data Analysis, 51(12), 5913–5917.
Article MathSciNet MATH Google Scholar
de Jong, R., van Buuren, S., & Spiess, M. (2016). Multiple imputation of predictor variables using generalized additive models. Communications in Statistics – Simulation and Computation, 45(3), 968–985.
Article MathSciNet MATH Google Scholar
Drechsler, J. (2015). Multiple imputation of multilevel missing data—Rigor versus simplicity. Journal of Educational and Behavioral Statistics, 40(1), 69–95.
Article Google Scholar
Enders, C. K. (2010). Applied missing data analysis. New York, NY: Guilford.
Google Scholar
Enders, C. K., Mistler, S. A., & Keller, B. T. (2016). Multilevel multiple imputation: A review and evaluation of joint modeling and chained equations imputation. Psychological Methods, 21(2), 222–240.
Article Google Scholar
Gaffert, P., Meinfelder, F., & Bosch, V. (2016). Towards an MI-proper predictive mean matching (Discussion Paper). https://www.uni-bamberg.de/fileadmin/uni/fakultaeten/sowi_lehrstuehle/statistik/Personen/Dateien_Florian/properPMM.pdf.
Google Scholar
Gałecki, A., & Burzykowski, T. (2013). Linear mixed-effects models using R: A step-by-step approach. Heidelberg/New York, NY: Springer.
Book MATH Google Scholar
Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60, 549–576.
Article Google Scholar
Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science, 8(3), 206–213.
Article Google Scholar
Grund, S., Lüdtke, O., & Robitzsch, A. (2016). Multiple imputation of missing covariate values in multilevel models with random slopes: A cautionary note. Behavior Research Methods, 48(2), 640–649.
Article Google Scholar
Harel, O. (2009). The estimation of r ² and adjusted r ² in incomplete data sets using multiple imputation. Journal of Applied Statistics, 36(10), 1109–1118.
Article MathSciNet Google Scholar
He, Y., & Raghunathan, T. E. (2009). On the performance of sequential regression multiple imputation methods with non normal error distributions. Communications in Statistics – Simulation and Computation, 38(4), 856–883.
Article MathSciNet MATH Google Scholar
Hilbe, J. M. (2011). Negative binomial regression (2nd ed.). Cambridge, UK: Cambridge University Press.
Book MATH Google Scholar
Hill, M. (1997). SPSS missing value analysis 7.5. Chicago, IL: SPSS.
Google Scholar
Horton, N. J., Lipsitz, S. R., & Parzen, M. (2003). A potential for bias when rounding in multiple imputation. The American Statistician, 57(4), 229–232.
Article MathSciNet MATH Google Scholar
Hox, J. J. (2010). Multilevel analysis: Techniques and applications (2nd ed.). New York, NY: Routledge.
Book MATH Google Scholar
Hu, L.-t., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6(1), 1–55.
Google Scholar
Kleinke, K. (2017). Multiple imputation under violated distributional assumptions—a systematic evaluation of the assumed robustness of predictive mean matching. Journal of Educational and Behavioral Statistics, 42(4), 371–404.
Article Google Scholar
Kleinke, K. (2018). Multiple imputation by predictive mean matching when sample size is small. Methodology, 14(1), 3–15.
Article Google Scholar
Kleinke, K., Stemmler, M., Reinecke, J., & Lösel, F. (2011). Efficient ways to impute incomplete panel data. Advances in Statistical Analysis, 95(4), 351–373.
Article MathSciNet Google Scholar
Lally, J. R., Mangione, P. L., & Honig, A. S. (1988). The Syracuse University Family Development Research Program: Long-range impact of an early intervention with low-income children and their families. In D. R. Powell (Ed.), Parent education as early childhood intervention: Emerging directions in theory, research and practice (pp. 79–104). Norwood, NJ: Ablex.
Google Scholar
Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34(1), 1–14.
Article MATH Google Scholar
Lüdtke, O., Robitzsch, A., & Grund, S. (2017). Multiple imputation of missing data in multilevel designs: A comparison of different strategies. Psychological Methods, 22(1), 141–165.
Article Google Scholar
Mariotti, L., & Reinecke, J. (2010). Wachstums- und Mischverteilungsmodelle unter Berücksichtigung unbeobachteter Heterogenität: Empirische Analysen zum delinquenten Verhalten Jugendlicher in Duisburg [Growth models, mixture models, and unobserved heterogeneity: Empirical analyses of juvenile delinquent behaviors in Duisburg.] Münster, Germany: Institut für sozialwissenschaftliche Forschung e.V.
Google Scholar
McCord, J. (1978). A thirty-year follow-up of treatment effects. American Psychologist, 33(3), 284–289.
Article Google Scholar
Meng, X.-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 9(4), 538–558.
Article Google Scholar
Meng, X.-L. & Rubin, D. B. (1992). Performing likelihood ratio tests with multiply-imputed data sets. Biometrika, 79(1), 103–111.
Article MathSciNet MATH Google Scholar
Moffitt, T. E. (1993). Adolescence-limited and life-course-persistent antisocial behavior: A developmental taxonomy. Psychological Review, 100(4), 674–701.
Article Google Scholar
Morris, T. P., White, I. R., & Royston, P. (2014). Tuning multiple imputation by predictive mean matching and local residual draws. BMC Medical Research Methodology, 14(1), 75–87.
Article Google Scholar
Muthén, L. K. & Muthén, B. O. (2017). Mplus user’s guide (8th ed.). Los Angeles, CA: Muthén & Muthén.
Google Scholar
Nguyen, C. D., Carlin, J. B., & Lee, K. J. (2017). Model checking in multiple imputation: An overview and case study. Emerging Themes in Epidemiology, 14(8), 1–12.
Google Scholar
Pöge, A. (2005). Persönliche Codes bei Längsschnittstudien. Ein Erfahrungsbericht. ZA-Nachrichten, 56, 50–69.
Google Scholar
Pöge, A. (2008). Persönliche Codes ‘reloaded’. Methoden—Daten—Analysen, 2(1), 59–70.
Google Scholar
Reinecke, J., & Seddig, D. (2011). Growth mixture models in longitudinal research. AStA Advances in Statistical Analysis, 95(4), 415–434.
Article MathSciNet Google Scholar
Reinecke, J., & Weins, C. (2013). The development of delinquency during adolescence: A comparison of missing data techniques. Quality & Quantity, 47(6), 3319–3334.
Article Google Scholar
Rosseel, Y. (2012). lavaan: An R Package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36.
Article Google Scholar
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York, NY: Wiley.
Book MATH Google Scholar
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434), 473–489.
Article MATH Google Scholar
Schafer, J. L. (1997). Analysis of incomplete multivariate data. London, UK: Chapman & Hall.
Book MATH Google Scholar
Schafer, J. L. (1999b). NORM users guide (version 2) [Computer software manual]. University Park, PA: The Methodology Center, The Pennsylvania State University. https://www.methodology.psu.edu/training/missing-data/.
Google Scholar
Schafer, J. L. (2016). norm2: Analysis of incomplete multivariate data under a normal model [Computer software manual]. https://CRAN.R-project.org/package=norm2 (R Package Version 2.0.1).
Schafer, J. L. & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177.
Article Google Scholar
Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for missing-data problems: A data analyst’s perspective. Multivariate Behavioral Research, 33(4), 545–571.
Article Google Scholar
Schafer, J. L., & Olsen, M. K. (1999). Modeling and imputation of semicontinuous survey variables. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.159.7891
Google Scholar
Schafer, J. L., & Yucel, R. M. (2002). Computational strategies for multivariate linear mixed-effects models with missing values. Journal of Computational and Graphical Statistics, 11(2), 437–457.
Article MathSciNet Google Scholar
Schenker, N., & Taylor, J. M. G. (1996). Partially parametric techniques for multiple imputation. Computational Statistics & Data Analysis, 22(4), 425–446.
Article MATH Google Scholar
Siddique, J., & Belin, T. R. (2008). Multiple imputation using an iterative hot-deck with distance-based donor selection. Statistics in Medicine, 27(1), 83–102.
Article MathSciNet Google Scholar
Speidel, M., Drechsler, J., & Sakshaug, J. W. (2018). Biases in multilevel analyses caused by cluster-specific fixed-effects imputation. Behavior Research Methods, 50(5), 1824–1840.
Article Google Scholar
Spiess, M., Kleinke, K., & Reinecke, J. (in press). Proper multiple imputation of clustered or panel data. In P. Lynn (Ed.), Advances in longitudinal survey methodology. New York, NY: Wiley. https://www.wiley.com/en-us/Advances+in+Longitudinal+Survey+Methodology-p-9781119376934
Su, Y.-S., Gelman, A., Hill, J., & Yajima, M. (2011). Multiple imputation with diagnostics (mi) in R: Opening windows into the black box. Journal of Statistical Software, 45(2), 1–31.
Article Google Scholar
van Buuren, S. (2011). Multiple imputation of multilevel data. In J. J. Hox & J. K. Roberts (Eds.), Handbook of advanced multilevel analysis (pp. 173–196). New York, NY: Taylor & Francis.
Google Scholar
van Buuren, S. (2012). Flexible imputation of missing data. Boca Raton, FL: Chapmann & Hall/CRC.
Book MATH Google Scholar
van Buuren, S. (2013). Multiple imputation of multilevel data. Paper presented at the Conference on Recent Advances in Multiple Imputation, with Emphasis on Dealing with Deviations from MAR or Exchangeability, Utrecht, the Netherlands.
Google Scholar
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67.
Article Google Scholar
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics withS (4th ed.). New York, NY: Springer.
Book MATH Google Scholar
Vink, G., Lazendic, G., & van Buuren, S. (2015). Partitioned predictive mean matching as a large data multilevel imputation technique. Psychological Test and Assessment Modeling, 57(4), 577–594.
Google Scholar
Weins, C., & Reinecke, J. (2007). Delinquenzverläufe im Jugendalter: Eine methodologische Analyse zur Auswirkung von fehlenden Werten im Längsschnitt [Development of Juvenile Delinquency: An analysis of the effects of missing data]. Monatsschrift für Kriminologie und Strafrechtsreform, 90(5), 418–437.
Article Google Scholar
Yu, L. M., Burton, A., & Rivero-Arias, O. (2007). Evaluation of software for multiple imputation of semi-continuous data. Statistical Methods in Medical Research, 16(3), 243–258.
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Education Studies and Psychology, University of Siegen, Siegen, Germany
Kristian Kleinke
Faculty of Sociology, University of Bielefeld, Bielefeld, Germany
Jost Reinecke
University of Hamburg, Institute of Psychology, Hamburg, Germany
Daniel Salfrán & Martin Spiess

Authors

Kristian Kleinke
View author publications
You can also search for this author in PubMed Google Scholar
Jost Reinecke
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Salfrán
View author publications
You can also search for this author in PubMed Google Scholar
Martin Spiess
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kleinke, K., Reinecke, J., Salfrán, D., Spiess, M. (2020). Multiple Imputation: Application. In: Applied Multiple Imputation. Statistics for Social and Behavioral Sciences. Springer, Cham. https://doi.org/10.1007/978-3-030-38164-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-38164-6_5
Published: 01 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38163-9
Online ISBN: 978-3-030-38164-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics