Skip to main content

Missing Data Theory

  • Chapter
  • First Online:
Missing Data

Part of the book series: Statistics for Social and Behavioral Sciences ((SSBS))

Abstract

In this first chapter, I accomplish several goals. First, building on my 20+ years of work on missing data analysis, I outline a nomenclature or system for talking about the theory underlying the modern analysis of missing data. I intend for this nomenclature to be in plain English, but nevertheless to be an accurate representation of statistical theory relating to missing data analysis. Second, I describe many of the main components of missing data theory, including the causes or mechanisms of missingness. Two general methods for handling missing data, in particular multiple imputation (MI) and maximum-likelihood (ML) methods, have developed out of the missing data theory I describe here. And as will be clear from reading this book, I fully endorse these methods. For the remainder of this chapter, I challenge some of the commonly held beliefs relating to missing data theory and missing data analysis, and make a case that the MI and ML procedures, which have started to become mainstream in statistical analysis with missing data, are applicable in a much larger range of contexts that typically believed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Schafer and colleagues (Collins et al. 2001; Schafer and Graham 2002) have referred to this variable as R; Little and Rubin (2002; and Rubin 1976) refer to the same variable as M.

  2. 2.

    Little and Rubin (2002) refer to this as Not Missing At Random (NMAR). But Schafer and ­colleagues (Collins et al. 2001; Schafer and Graham 2002) refer to this same mechanism as Missing Not At Random (MNAR). I have decided to use NMAR here, because it makes sense that missingness should either be MAR or not (i.e., Not MAR). However, there are good arguments for using MNAR as well. I view the two terms to be interchangeable.

  3. 3.

    Although as I demonstrate in a later section of this chapter, the amount of bias depends on many factors, and may often be tolerably low.

  4. 4.

    One common variant of Y, for example, could be Z, a 4-level, uniformly distributed variable where the four levels represent the quartiles of the original Y variable, which was continuous and normally distributed. In this example, the two variables are highly correlated (r YZ  =  .925), but they are not correlated r  =  1.0.

  5. 5.

    At the heart of all methods for analysis of NMAR missingness is a guess or assumption about the missing data creation model. Because all such methods must make these assumptions, methods for NMAR missingness are only as good as their assumptions. Please see the discussion in the next section.

  6. 6.

    I describe the range quantity in more detail in Chap. 10. One important point about this quantity is that for any given level of missingness, rZR is a linear transformation of the range of probabilities in the MAR-linear IF statements. During our simulation work (Graham et al. 2008), Lori Palen discovered that rZR was the product of a constant (0.7453559925 for 50 % missingness and Z as uniformly distributed variable with four levels) and the range between the highest and lowest probabilities for the IF statements. I refer to this constant as the Palen proportion.

  7. 7.

    Note that the quartilized version of Smoke10 (Z10), had only three levels in the data used in this example (0, 2, 3). Despite this, however, the results shown in this section are representative of what will commonly be found with these analyses.

  8. 8.

    SPSS and other statistical packages can certainly be used for this assessment. The EM covariance matrix is used here mainly as a convenience. If you are making use of SPSS, please see Chaps. 3 and 5 for details of performing comparable analyses in SPSS.

  9. 9.

    Note that everything I describe in this section can also be applied to the situation in which the predictor variable is a measured variable and not a manipulated program intervention variable.

  10. 10.

    Note that the plots shown in Table 1.1 could also be based on more than two levels of a measured independent variable.

  11. 11.

    Of course, the distinction between main measure and auxiliary variable becomes blurred when the methods used for collecting data on the follow-up sample are the same as, or very similar to, the methods used for the main measure of the DV, and when the follow-up measure occurs at a time not too far removed from the main measure.

References

  • Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238–246.

    Article  Google Scholar 

  • Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin, 88, 588–606.

    Article  Google Scholar 

  • Bose, J. (2001). Nonresponse bias analyses at the National Center for Education Statistics. Proceedings of the Statistics Canada Symposium, 2001, Achieving Data Quality in a Statistical Agency: A Methodological Perspective.

    Google Scholar 

  • Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.) Testing structural equation models. Newbury Park, CA: Sage, pp. 136–162.

    Google Scholar 

  • Cohen, J. (1977). Statistical power analysis for the behavioral sciences. New York: Academic Press.

    Google Scholar 

  • Colby, M., Hecht, M. L., Miller-Day, M., Krieger, J. R., Syvertsen, A. K., Graham, J. W., and Pettigrew, J. (in press). Adapting School-based Substance Use Prevention Curriculum through Cultural Grounding: A Review and Exemplar of Adaptation Processes for Rural Schools. American Journal of Community Psychology.

    Google Scholar 

  • Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330–351.

    Article  Google Scholar 

  • Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design & analysis issues for field settings. Chicago: Rand McNally.

    Article  MathSciNet  Google Scholar 

  • Demirtas, H. (2005). Multiple imputation under Bayesianly smoothed pattern-mixture models for non-ignorable drop-out. Statistics in Medicine, 24, 2345–2363.

    Article  MathSciNet  Google Scholar 

  • Demirtas, H., & Schafer, J. L. (2003). On the performance of random-coefficient pattern-mixture models for nonignorable dropout. Statistics in Medicine, 21, 1–23.

    Google Scholar 

  • Enders, C. K. (2008). A note on the use of missing auxiliary variables in full information maximum likelihood-based structural equation models. Structural Equation Modeling, 15, 434–448.

    Article  MathSciNet  Google Scholar 

  • Enders, C. K. (2011). Missing not at random models for latent growth curve analysis. Psychological Methods, 16, 1–16.

    Article  Google Scholar 

  • Glynn, R. J., Laird, N. M., and Rubin, D. B. (1993). Multiple imputation in mixture models for nonignorable nonresponse with followups. Journal of the American Statistical Association, 88, 984–993.

    Article  MATH  Google Scholar 

  • Graham, J. W. (2009). Missing data analysis: making it work in the real world. Annual Review of Psychology, 60, 549–576.

    Article  Google Scholar 

  • Graham, J. W., & Donaldson, S. I. (1993). Evaluating interventions with differential attrition: The importance of nonresponse mechanisms and use of followup data. Journal of Applied Psychology, 78, 119–128.

    Google Scholar 

  • Graham, J. W., Hofer, S. M., and Piccinin, A. M. (1994). Analysis with missing data in drug prevention research. In L. M. Collins and L. Seitz (eds.), Advances in data analysis for prevention intervention research. National Institute on Drug Abuse Research Monograph Series #142, pp. 13–63, Washington DC: National Institute on Drug Abuse.

    Google Scholar 

  • Graham, J. W., Palen, L. A., Smith, E. A., and Caldwell, L. L. (2008). Attrition: MAR and MNAR Missingness, and Estimation Bias. Poster presented at the 16th Annual Meetings of the Society for Prevention Research, San Francisco, CA, May 2008.

    Article  Google Scholar 

  • Graham, J. W., Hofer, S. M., Donaldson, S. I., MacKinnon, D. P., and Schafer, J. L. (1997). Analysis with missing data in prevention research. In K. Bryant, M. Windle, & S. West (Eds.), The science of prevention: methodological advances from alcohol and substance abuse research. (pp. 325–366). Washington, D.C.: American Psychological Association.

    Chapter  Google Scholar 

  • Hansen, W. B., & Graham, J. W. (1991). Preventing alcohol, marijuana, and cigarette use among adolescents: Peer pressure resistance training versus establishing conservative norms. Preventive Medicine, 20, 414–430.

    Article  Google Scholar 

  • Hedeker, D., and Gibbons, R. D. (1997). Application of random-effects pattern-mixture models for missing data in longitudinal studies. Psychological Methods, 2, 64–78.

    Article  Google Scholar 

  • Hu, L. T., and Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55.

    Article  Google Scholar 

  • Leon, A. C., Demirtas, H., and Hedeker, D. (2007). Bias reduction with an adjustment for participants’ intent to dropout of a randomized controlled clinical trial. Clinical Trials, 4, 540–547.

    Google Scholar 

  • Little, R. J. A. (1993). Pattern-Mixture Models for Multivariate Incomplete Data. Journal of the American Statistical Association, 88, 125–134.

    MATH  Google Scholar 

  • Little, R. J. A. (1994). A Class of Pattern-Mixture Models for Normal Incomplete Data. Biometrika, 81, 471–483.

    Article  MathSciNet  MATH  Google Scholar 

  • Little, R. J. A. (1995). Modeling the drop-out mechanism in repeated-measures studies. Journal of the American Statistical Association, 90, 1112–1121.

    Article  MathSciNet  MATH  Google Scholar 

  • Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: Wiley.

    MATH  Google Scholar 

  • Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data: Second Edition. New York: Wiley.

    MATH  Google Scholar 

  • Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.

    Google Scholar 

  • Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.

    Google Scholar 

  • Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7, 147–177.

    Google Scholar 

  • Steiger, J. H., and Lind, J. M. (1980). Statistically based tests for the number of common factors. Paper presented at the annual meeting of the Psychometric Society, Iowa City, IA.

    Google Scholar 

  • Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor analysis. Psychometrika, 38, 1–10.

    Google Scholar 

  • Verbeke, G., & Molenberghs, G. (2000). Linear mixed models for longitudinal data. New York: Springer.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media New York

About this chapter

Cite this chapter

Graham, J.W. (2012). Missing Data Theory. In: Missing Data. Statistics for Social and Behavioral Sciences. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-4018-5_1

Download citation

Publish with us

Policies and ethics