Skip to main content

A comparison of multiple imputation strategies to deal with missing nonnormal data in structural equation modeling

Abstract

Missing data and nonnormality are two common factors that can affect analysis results from structural equation modeling (SEM). The current study aims to address a challenging situation in which the two factors coexist (i.e., missing nonnormal data). Using Monte Carlo simulation, we evaluated the performance of four multiple imputation (MI) strategies with respect to parameter and standard error estimation. These strategies include MI with normality-based model (MI-NORM), predictive mean matching (MI-PMM), classification and regression trees (MI-CART), and random forest (MI-RF). We also compared these MI strategies with robust full information maximum likelihood (RFIML), a popular (non-imputation) method to deal with missing nonnormal data in SEM. The results suggest that MI-NORM had similar performance to RFIML. MI-PMM outperformed the other methods when data were not missing on the heavy tail of a skewed distribution. Although MI-CART and MI-RF do not require any distribution assumption, they did not perform well compared with the others. Based on the results, practical guidance is provided.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fan Jia.

Additional information

Open practices statement

This paper is based on a simulation study. No data is available, and no experiment was preregistered.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

This is an additional empirical example to demonstrate the differences among the examined missing nonnormal data methods. Inspired by Fan et al. (2010), we used the data from Educational Longitudinal Study of 2002 (National Center for Education Statistics, 2002) to examine the covariance between two latent constructs (students’ motivation and parent-school communication concerning poor performance), and the effect of two observed variables (socioeconomic status [SES] and gender) on them. In this Multiple Indictor Multiple Cause (MIMIC) model, students’ motivation was measured by three composite scores: Math self-efficacy, English self-efficacy, and general effort and persistence. Parent-school communication concerning poor performance had two indicators: frequencies of school contacted parent about poor performance, and frequencies of parent contacted school about poor performance.

figure a

Fig. A1. MIMIC Model

We chose a complete subsample (N = 1287) from the original data and imposed 15% missing data on all the three indicators of student’s motivation, based on the three missing data mechanisms: MACR, MAR-Head, and MAR-Tail. In both MAR conditions, we used SES to determine the probabilities of missingness on those indicators.

Table A1 Point and standard errors estimates of selected parameters

The parameter and standard error estimates obtained from the five missing data methods in comparison with complete data results are shown in Table A1. Under MCAR, all missing data methods yielded comparable results with that of the complete data, while under MAR-Head and MAR-Tail, largest differences were found to be associated with the effect of SES on student’s motivation (γ11). Specifically, for the point estimate, MI-PMM performed the best under MAR-Head, while underestimated γ11 under MAR-Tail. RFIML and MI-NORM yielded smaller γ11 under MAR-Head and overestimated γ11 under MAR-Tail. The estimates obtained from MI-CART and MI-RF in both MAR conditions were drastically smaller than the complete data results. All methods yielded inflated standard errors of γ11 to a certain degree in both MAR conditions.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jia, F., Wu, W. A comparison of multiple imputation strategies to deal with missing nonnormal data in structural equation modeling. Behav Res (2022). https://doi.org/10.3758/s13428-022-01936-y

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.3758/s13428-022-01936-y

Keywords

  • Missing data
  • Nonnormality
  • Multiple imputation
  • Full information maximum likelihood
  • Predictive mean matching
  • Classification and regression trees
  • Random forest