Missing Data Theory

Graham, John W.

doi:10.1007/978-1-4614-4018-5_1

John W. Graham²

Part of the book series: Statistics for Social and Behavioral Sciences ((SSBS))

8616 Accesses
13 Citations

Abstract

In this first chapter, I accomplish several goals. First, building on my 20+ years of work on missing data analysis, I outline a nomenclature or system for talking about the theory underlying the modern analysis of missing data. I intend for this nomenclature to be in plain English, but nevertheless to be an accurate representation of statistical theory relating to missing data analysis. Second, I describe many of the main components of missing data theory, including the causes or mechanisms of missingness. Two general methods for handling missing data, in particular multiple imputation (MI) and maximum-likelihood (ML) methods, have developed out of the missing data theory I describe here. And as will be clear from reading this book, I fully endorse these methods. For the remainder of this chapter, I challenge some of the commonly held beliefs relating to missing data theory and missing data analysis, and make a case that the MI and ML procedures, which have started to become mainstream in statistical analysis with missing data, are applicable in a much larger range of contexts that typically believed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Schafer and colleagues (Collins et al. 2001; Schafer and Graham 2002) have referred to this variable as R; Little and Rubin (2002; and Rubin 1976) refer to the same variable as M.
2.
Little and Rubin (2002) refer to this as Not Missing At Random (NMAR). But Schafer and colleagues (Collins et al. 2001; Schafer and Graham 2002) refer to this same mechanism as Missing Not At Random (MNAR). I have decided to use NMAR here, because it makes sense that missingness should either be MAR or not (i.e., Not MAR). However, there are good arguments for using MNAR as well. I view the two terms to be interchangeable.
3.
Although as I demonstrate in a later section of this chapter, the amount of bias depends on many factors, and may often be tolerably low.
4.
One common variant of Y, for example, could be Z, a 4-level, uniformly distributed variable where the four levels represent the quartiles of the original Y variable, which was continuous and normally distributed. In this example, the two variables are highly correlated (r _YZ = .925), but they are not correlated r = 1.0.
5.
At the heart of all methods for analysis of NMAR missingness is a guess or assumption about the missing data creation model. Because all such methods must make these assumptions, methods for NMAR missingness are only as good as their assumptions. Please see the discussion in the next section.
6.
I describe the range quantity in more detail in Chap. 10. One important point about this quantity is that for any given level of missingness, r_ZR is a linear transformation of the range of probabilities in the MAR-linear IF statements. During our simulation work (Graham et al. 2008), Lori Palen discovered that r_ZR was the product of a constant (0.7453559925 for 50 % missingness and Z as uniformly distributed variable with four levels) and the range between the highest and lowest probabilities for the IF statements. I refer to this constant as the Palen proportion.
7.
Note that the quartilized version of Smoke₁₀ (Z₁₀), had only three levels in the data used in this example (0, 2, 3). Despite this, however, the results shown in this section are representative of what will commonly be found with these analyses.
8.
SPSS and other statistical packages can certainly be used for this assessment. The EM covariance matrix is used here mainly as a convenience. If you are making use of SPSS, please see Chaps. 3 and 5 for details of performing comparable analyses in SPSS.
9.
Note that everything I describe in this section can also be applied to the situation in which the predictor variable is a measured variable and not a manipulated program intervention variable.
10.
Note that the plots shown in Table 1.1 could also be based on more than two levels of a measured independent variable.
11.
Of course, the distinction between main measure and auxiliary variable becomes blurred when the methods used for collecting data on the follow-up sample are the same as, or very similar to, the methods used for the main measure of the DV, and when the follow-up measure occurs at a time not too far removed from the main measure.

References

Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238–246.
Article Google Scholar
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin, 88, 588–606.
Article Google Scholar
Bose, J. (2001). Nonresponse bias analyses at the National Center for Education Statistics. Proceedings of the Statistics Canada Symposium, 2001, Achieving Data Quality in a Statistical Agency: A Methodological Perspective.
Google Scholar
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.) Testing structural equation models. Newbury Park, CA: Sage, pp. 136–162.
Google Scholar
Cohen, J. (1977). Statistical power analysis for the behavioral sciences. New York: Academic Press.
Google Scholar
Colby, M., Hecht, M. L., Miller-Day, M., Krieger, J. R., Syvertsen, A. K., Graham, J. W., and Pettigrew, J. (in press). Adapting School-based Substance Use Prevention Curriculum through Cultural Grounding: A Review and Exemplar of Adaptation Processes for Rural Schools. American Journal of Community Psychology.
Google Scholar
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330–351.
Article Google Scholar
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design & analysis issues for field settings. Chicago: Rand McNally.
Article MathSciNet Google Scholar
Demirtas, H. (2005). Multiple imputation under Bayesianly smoothed pattern-mixture models for non-ignorable drop-out. Statistics in Medicine, 24, 2345–2363.
Article MathSciNet Google Scholar
Demirtas, H., & Schafer, J. L. (2003). On the performance of random-coefficient pattern-mixture models for nonignorable dropout. Statistics in Medicine, 21, 1–23.
Google Scholar
Enders, C. K. (2008). A note on the use of missing auxiliary variables in full information maximum likelihood-based structural equation models. Structural Equation Modeling, 15, 434–448.
Article MathSciNet Google Scholar
Enders, C. K. (2011). Missing not at random models for latent growth curve analysis. Psychological Methods, 16, 1–16.
Article Google Scholar
Glynn, R. J., Laird, N. M., and Rubin, D. B. (1993). Multiple imputation in mixture models for nonignorable nonresponse with followups. Journal of the American Statistical Association, 88, 984–993.
Article MATH Google Scholar
Graham, J. W. (2009). Missing data analysis: making it work in the real world. Annual Review of Psychology, 60, 549–576.
Article Google Scholar
Graham, J. W., & Donaldson, S. I. (1993). Evaluating interventions with differential attrition: The importance of nonresponse mechanisms and use of followup data. Journal of Applied Psychology, 78, 119–128.
Google Scholar
Graham, J. W., Hofer, S. M., and Piccinin, A. M. (1994). Analysis with missing data in drug prevention research. In L. M. Collins and L. Seitz (eds.), Advances in data analysis for prevention intervention research. National Institute on Drug Abuse Research Monograph Series #142, pp. 13–63, Washington DC: National Institute on Drug Abuse.
Google Scholar
Graham, J. W., Palen, L. A., Smith, E. A., and Caldwell, L. L. (2008). Attrition: MAR and MNAR Missingness, and Estimation Bias. Poster presented at the 16th Annual Meetings of the Society for Prevention Research, San Francisco, CA, May 2008.
Article Google Scholar
Graham, J. W., Hofer, S. M., Donaldson, S. I., MacKinnon, D. P., and Schafer, J. L. (1997). Analysis with missing data in prevention research. In K. Bryant, M. Windle, & S. West (Eds.), The science of prevention: methodological advances from alcohol and substance abuse research. (pp. 325–366). Washington, D.C.: American Psychological Association.
Chapter Google Scholar
Hansen, W. B., & Graham, J. W. (1991). Preventing alcohol, marijuana, and cigarette use among adolescents: Peer pressure resistance training versus establishing conservative norms. Preventive Medicine, 20, 414–430.
Article Google Scholar
Hedeker, D., and Gibbons, R. D. (1997). Application of random-effects pattern-mixture models for missing data in longitudinal studies. Psychological Methods, 2, 64–78.
Article Google Scholar
Hu, L. T., and Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55.
Article Google Scholar
Leon, A. C., Demirtas, H., and Hedeker, D. (2007). Bias reduction with an adjustment for participants’ intent to dropout of a randomized controlled clinical trial. Clinical Trials, 4, 540–547.
Google Scholar
Little, R. J. A. (1993). Pattern-Mixture Models for Multivariate Incomplete Data. Journal of the American Statistical Association, 88, 125–134.
MATH Google Scholar
Little, R. J. A. (1994). A Class of Pattern-Mixture Models for Normal Incomplete Data. Biometrika, 81, 471–483.
Article MathSciNet MATH Google Scholar
Little, R. J. A. (1995). Modeling the drop-out mechanism in repeated-measures studies. Journal of the American Statistical Association, 90, 1112–1121.
Article MathSciNet MATH Google Scholar
Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: Wiley.
MATH Google Scholar
Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data: Second Edition. New York: Wiley.
MATH Google Scholar
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.
Google Scholar
Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
Google Scholar
Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7, 147–177.
Google Scholar
Steiger, J. H., and Lind, J. M. (1980). Statistically based tests for the number of common factors. Paper presented at the annual meeting of the Psychometric Society, Iowa City, IA.
Google Scholar
Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor analysis. Psychometrika, 38, 1–10.
Google Scholar
Verbeke, G., & Molenberghs, G. (2000). Linear mixed models for longitudinal data. New York: Springer.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biobehavioral Health, The Pennsylvania State University, Health & Human Development Bldg. East, University Park, PA, USA
John W. Graham

Authors

John W. Graham
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Graham, J.W. (2012). Missing Data Theory. In: Missing Data. Statistics for Social and Behavioral Sciences. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-4018-5_1

Download citation

DOI: https://doi.org/10.1007/978-1-4614-4018-5_1
Published: 10 May 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-4017-8
Online ISBN: 978-1-4614-4018-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics