Abstract
In this chapter, we discuss the most important and most commonly used multiple imputation tools in R (Table 5.1 gives an overview of the download frequencies of various MI packages in R.) for both multivariate and clustered data sets, including packages mice, norm2, Amelia, mi, pan, as well as function aregImpute( ) from package Hmisc, and show practical applications. We give hands-on step by step tutorials regarding how to carry out MI in practice.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Table 5.1 gives an overview of the download frequencies of various MI packages in R.
- 2.
Note that classical zero-inflation models allow two sources of zeros. They can stem either from the zero-process or from the count process. A highly similar model class (so-called hurdle models) allows only one source of zeros. These models fit a zero-truncated Poisson or negative binomial model for the count process.
- 3.
- 4.
To keep computation times of the examples in this book low, we only estimate and report the ‘fixed’ part of the model, i.e., we assume variances and covariances of the latent growth factors to be zero. Hence, η ij = α j for all i, j.
- 5.
File crim4.dat can be found in subdirectory ‘data’ in the supplementary material of the book. To be able to reproduce the examples in the book, we ask readers to set the pathname of the supplementary material as working directory using the setwd( ) function.
- 6.
The output of the md.pairs( ) function is a list of four components named rr, rm, mr, and mm. These components can be accessed using the $ operator. See help( "md.pairs") for details.
- 7.
Reinecke and Weins (2013) have split the zero-inflated count variables into zero vs. non-zero indicators and zero-truncated count variables. They imputed both variables jointly under the multivariate normal model.
- 8.
To reduce simulation error due to small numbers of imputations, we will create m = 100 sets of imputations in all applications throughout this book (see, for example, the recommendations by Graham et al., 2007 or Bodner, 2008). Even with modern computers, this can take a considerable amount of time, especially when imputation models are more complex. To reproduce the examples within a reasonable amount of time, readers could set the number of imputations to 5. Additionally, all sets of imputations are available from the supplementary material of the book.
- 9.
Sometimes the term ‘plausibility’ refers to imputations that lead to valid statistical inferences. In our context, the true parameters are unknown. Here we regard imputations as plausible if they meet our expectations about these values and their distributions, based on what we know about the data set and based on the underlying theory. For the CrimoC data, for example, we can expect boys and students from Hauptschule to receive higher delinquency scores on average.
- 10.
At the time of writing, the most recent version 1.0-9.5 of the norm package was published in February 2013.
- 11.
Panel data however could be imputed in wide format, which would implicitly consider the longitudinal information (for a discussion, see Sect. 5.6.3.1).
- 12.
We once again would like to caution applied researchers to use normal model MI after transformation on highly non-normal data. Transformations and roundings were applied in the past, when more appropriate strategies and models were not yet available.
- 13.
Note that emNorm( ) produces maximum likelihood estimates of the mean vector and covariance matrix of the variables in the model, which can also be directly used as input for further statistical analyses (e.g., structural equation models).
- 14.
https://cran.r-project.org/package=Amelia. Descriptions of the algorithm, and detailed documentation of the software may be found at the project website https://gking.harvard.edu/amelia.
- 15.
- 16.
- 17.
- 18.
See Sect. 5.2.2 for details about the growth curve ZIP model. Note that the intercept of the zero model has been fixed to zero (cf. Muthén & Muthén, 2017, Example 6.7). See Sects. 5.5.1–5.5.4 for details about the respective imputation methods and models that are used to create the multiple imputations.
- 19.
- 20.
Graham (2009) also recommends to keep the total number of variables in the imputation model below 100.
- 21.
An exception is the statistical package Stata. See the FAQ document ‘How can I account for clustering when creating imputations with mi impute?’, https://www.stata.com/support/faqs/statistics/clustering-and-mi-impute/.
- 22.
Additional R packages like miceadds or countimp provide further two-level imputation functions for the mice framework.
- 23.
In the supplementary material of the book, we provide R code to extract the imputed data sets from the mids object and to write the data to disk for further analysis in Mplus.
- 24.
An introduction to the lavaan syntax may be found at Yves Rosseel’s website http://lavaan.ugent.be/.
- 25.
Another option would be to use functions from package semTools, which we introduce in Sect. 5.7.7.5.
- 26.
- 27.
More precisely, we average the Fisher-Z-transformed correlations and then, in a second step, bring the mean of these transformed correlations back to the original scale (see, for example, Schafer, 1997, for details on how to pool correlation coefficients). Functions cor2fisher( ) and fisher2cor( ) , which we use for the transformation and back-transformation are available from the supplementary material of the book.
- 28.
For further details, see the help files of functions abind( ) and apply( ) .
- 29.
- 30.
- 31.
For further information, see option mimic="Mplus" in lavaan (https://cran.r-project.org/web/packages/lavaan/lavaan.pdf).
- 32.
The transformation of imputed data lists that were created by other packages to an object of class mids is described in Sect. 5.7.
- 33.
Instead of a simple MCAR mechanism, applied researchers could also simulate more complex and more realistic MAR or MNAR mechanisms that are believed to be present in the empirical data.
References
Allison, P. D. (2001). Missing data. Thousand Oaks, CA: Sage.
Andridge, R. R. (2011). Quantifying the impact of fixed effects modeling of clusters in multiple imputation for cluster randomized trials. Biometrical Journal, 53(1), 57–74.
Barnard, J., & Rubin, D. B. (1999). Small-sample degrees of freedom with multiple imputation. Biometrika, 86(4), 948–955.
Bodner, T. E. (2008). What improves with increased missing data imputations? Structural Equation Modeling, 15(4), 651–675.
Boers, K., Reinecke, J., Seddig, D., & Mariotti, L. (2010). Explaining the development of adolescent violent delinquency. European Journal of Criminology, 7(6), 499–520.
Bollen, K. A., & Curran, P. J. (2006). Latent curve models: A structural equation perspective. New York, NY: Wiley.
Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models. Newbury Park, CA: Sage.
Carpenter, J. R., Goldstein, H., & Kenward, M. G. (2011). REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. Journal of Statistical Software, 45(5), 1–14.
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6(4), 330–351.
Dahl, F. A. (2007). Convergence of random k-nearest-neighbour imputation. Computational Statistics & Data Analysis, 51(12), 5913–5917.
de Jong, R., van Buuren, S., & Spiess, M. (2016). Multiple imputation of predictor variables using generalized additive models. Communications in Statistics – Simulation and Computation, 45(3), 968–985.
Drechsler, J. (2015). Multiple imputation of multilevel missing data—Rigor versus simplicity. Journal of Educational and Behavioral Statistics, 40(1), 69–95.
Enders, C. K. (2010). Applied missing data analysis. New York, NY: Guilford.
Enders, C. K., Mistler, S. A., & Keller, B. T. (2016). Multilevel multiple imputation: A review and evaluation of joint modeling and chained equations imputation. Psychological Methods, 21(2), 222–240.
Gaffert, P., Meinfelder, F., & Bosch, V. (2016). Towards an MI-proper predictive mean matching (Discussion Paper). https://www.uni-bamberg.de/fileadmin/uni/fakultaeten/sowi_lehrstuehle/statistik/Personen/Dateien_Florian/properPMM.pdf.
Gałecki, A., & Burzykowski, T. (2013). Linear mixed-effects models using R: A step-by-step approach. Heidelberg/New York, NY: Springer.
Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60, 549–576.
Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science, 8(3), 206–213.
Grund, S., Lüdtke, O., & Robitzsch, A. (2016). Multiple imputation of missing covariate values in multilevel models with random slopes: A cautionary note. Behavior Research Methods, 48(2), 640–649.
Harel, O. (2009). The estimation of r 2 and adjusted r 2 in incomplete data sets using multiple imputation. Journal of Applied Statistics, 36(10), 1109–1118.
He, Y., & Raghunathan, T. E. (2009). On the performance of sequential regression multiple imputation methods with non normal error distributions. Communications in Statistics – Simulation and Computation, 38(4), 856–883.
Hilbe, J. M. (2011). Negative binomial regression (2nd ed.). Cambridge, UK: Cambridge University Press.
Hill, M. (1997). SPSS missing value analysis 7.5. Chicago, IL: SPSS.
Horton, N. J., Lipsitz, S. R., & Parzen, M. (2003). A potential for bias when rounding in multiple imputation. The American Statistician, 57(4), 229–232.
Hox, J. J. (2010). Multilevel analysis: Techniques and applications (2nd ed.). New York, NY: Routledge.
Hu, L.-t., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6(1), 1–55.
Kleinke, K. (2017). Multiple imputation under violated distributional assumptions—a systematic evaluation of the assumed robustness of predictive mean matching. Journal of Educational and Behavioral Statistics, 42(4), 371–404.
Kleinke, K. (2018). Multiple imputation by predictive mean matching when sample size is small. Methodology, 14(1), 3–15.
Kleinke, K., Stemmler, M., Reinecke, J., & Lösel, F. (2011). Efficient ways to impute incomplete panel data. Advances in Statistical Analysis, 95(4), 351–373.
Lally, J. R., Mangione, P. L., & Honig, A. S. (1988). The Syracuse University Family Development Research Program: Long-range impact of an early intervention with low-income children and their families. In D. R. Powell (Ed.), Parent education as early childhood intervention: Emerging directions in theory, research and practice (pp. 79–104). Norwood, NJ: Ablex.
Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34(1), 1–14.
Lüdtke, O., Robitzsch, A., & Grund, S. (2017). Multiple imputation of missing data in multilevel designs: A comparison of different strategies. Psychological Methods, 22(1), 141–165.
Mariotti, L., & Reinecke, J. (2010). Wachstums- und Mischverteilungsmodelle unter Berücksichtigung unbeobachteter Heterogenität: Empirische Analysen zum delinquenten Verhalten Jugendlicher in Duisburg [Growth models, mixture models, and unobserved heterogeneity: Empirical analyses of juvenile delinquent behaviors in Duisburg.] Münster, Germany: Institut für sozialwissenschaftliche Forschung e.V.
McCord, J. (1978). A thirty-year follow-up of treatment effects. American Psychologist, 33(3), 284–289.
Meng, X.-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 9(4), 538–558.
Meng, X.-L. & Rubin, D. B. (1992). Performing likelihood ratio tests with multiply-imputed data sets. Biometrika, 79(1), 103–111.
Moffitt, T. E. (1993). Adolescence-limited and life-course-persistent antisocial behavior: A developmental taxonomy. Psychological Review, 100(4), 674–701.
Morris, T. P., White, I. R., & Royston, P. (2014). Tuning multiple imputation by predictive mean matching and local residual draws. BMC Medical Research Methodology, 14(1), 75–87.
Muthén, L. K. & Muthén, B. O. (2017). Mplus user’s guide (8th ed.). Los Angeles, CA: Muthén & Muthén.
Nguyen, C. D., Carlin, J. B., & Lee, K. J. (2017). Model checking in multiple imputation: An overview and case study. Emerging Themes in Epidemiology, 14(8), 1–12.
Pöge, A. (2005). Persönliche Codes bei Längsschnittstudien. Ein Erfahrungsbericht. ZA-Nachrichten, 56, 50–69.
Pöge, A. (2008). Persönliche Codes ‘reloaded’. Methoden—Daten—Analysen, 2(1), 59–70.
Reinecke, J., & Seddig, D. (2011). Growth mixture models in longitudinal research. AStA Advances in Statistical Analysis, 95(4), 415–434.
Reinecke, J., & Weins, C. (2013). The development of delinquency during adolescence: A comparison of missing data techniques. Quality & Quantity, 47(6), 3319–3334.
Rosseel, Y. (2012). lavaan: An R Package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York, NY: Wiley.
Rubin, D. B. (1996). Multiple imputation after 18+  years. Journal of the American Statistical Association, 91(434), 473–489.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. London, UK: Chapman & Hall.
Schafer, J. L. (1999b). NORM users guide (version 2) [Computer software manual]. University Park, PA: The Methodology Center, The Pennsylvania State University. https://www.methodology.psu.edu/training/missing-data/.
Schafer, J. L. (2016). norm2: Analysis of incomplete multivariate data under a normal model [Computer software manual]. https://CRAN.R-project.org/package=norm2 (R Package Version 2.0.1).
Schafer, J. L. & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177.
Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for missing-data problems: A data analyst’s perspective. Multivariate Behavioral Research, 33(4), 545–571.
Schafer, J. L., & Olsen, M. K. (1999). Modeling and imputation of semicontinuous survey variables. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.159.7891
Schafer, J. L., & Yucel, R. M. (2002). Computational strategies for multivariate linear mixed-effects models with missing values. Journal of Computational and Graphical Statistics, 11(2), 437–457.
Schenker, N., & Taylor, J. M. G. (1996). Partially parametric techniques for multiple imputation. Computational Statistics & Data Analysis, 22(4), 425–446.
Siddique, J., & Belin, T. R. (2008). Multiple imputation using an iterative hot-deck with distance-based donor selection. Statistics in Medicine, 27(1), 83–102.
Speidel, M., Drechsler, J., & Sakshaug, J. W. (2018). Biases in multilevel analyses caused by cluster-specific fixed-effects imputation. Behavior Research Methods, 50(5), 1824–1840.
Spiess, M., Kleinke, K., & Reinecke, J. (in press). Proper multiple imputation of clustered or panel data. In P. Lynn (Ed.), Advances in longitudinal survey methodology. New York, NY: Wiley. https://www.wiley.com/en-us/Advances+in+Longitudinal+Survey+Methodology-p-9781119376934
Su, Y.-S., Gelman, A., Hill, J., & Yajima, M. (2011). Multiple imputation with diagnostics (mi) in R: Opening windows into the black box. Journal of Statistical Software, 45(2), 1–31.
van Buuren, S. (2011). Multiple imputation of multilevel data. In J. J. Hox & J. K. Roberts (Eds.), Handbook of advanced multilevel analysis (pp. 173–196). New York, NY: Taylor & Francis.
van Buuren, S. (2012). Flexible imputation of missing data. Boca Raton, FL: Chapmann & Hall/CRC.
van Buuren, S. (2013). Multiple imputation of multilevel data. Paper presented at the Conference on Recent Advances in Multiple Imputation, with Emphasis on Dealing with Deviations from MAR or Exchangeability, Utrecht, the Netherlands.
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67.
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics withS (4th ed.). New York, NY: Springer.
Vink, G., Lazendic, G., & van Buuren, S. (2015). Partitioned predictive mean matching as a large data multilevel imputation technique. Psychological Test and Assessment Modeling, 57(4), 577–594.
Weins, C., & Reinecke, J. (2007). Delinquenzverläufe im Jugendalter: Eine methodologische Analyse zur Auswirkung von fehlenden Werten im Längsschnitt [Development of Juvenile Delinquency: An analysis of the effects of missing data]. Monatsschrift für Kriminologie und Strafrechtsreform, 90(5), 418–437.
Yu, L. M., Burton, A., & Rivero-Arias, O. (2007). Evaluation of software for multiple imputation of semi-continuous data. Statistical Methods in Medical Research, 16(3), 243–258.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Kleinke, K., Reinecke, J., Salfrán, D., Spiess, M. (2020). Multiple Imputation: Application. In: Applied Multiple Imputation. Statistics for Social and Behavioral Sciences. Springer, Cham. https://doi.org/10.1007/978-3-030-38164-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-38164-6_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-38163-9
Online ISBN: 978-3-030-38164-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)