Skip to main content

Missing Data

  • Living reference work entry
  • First Online:
Handbook of Epidemiology

Abstract

A general framework for describing and handling missing data is presented. Methodology is categorized according to its validity under various assumptions about the missing data mechanism. Considerable attention is given to direct-likelihood approaches, weighted generalized estimating equations, and multiple imputation. The value of sensitivity analysis to examine the stability of inferences against untestable assumptions is discussed. A running example is used to illustrate methodology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Similar content being viewed by others

References

  • Aerts M, Geys H, Molenberghs G, Ryan LM (2002) Topics in modelling of clustered binary data. Chapman & Hall, London

    Book  Google Scholar 

  • Afifi A, Elashoff R (1966) Missing observations in multivariate statistics I: review of the literature. J Am Stat Assoc 61:595–604

    Google Scholar 

  • Baker SG, Rosenberger WF, DerSimonian R (1992) Closed-form estimates for missing counts in two-way contingency tables. Stat Med 11:643–657

    Article  CAS  PubMed  Google Scholar 

  • Beckman RJ, Nachtsheim CJ, Cook RD (1987) Diagnostics for mixed-model analysis of variance. Technometrics 29:413–426

    Google Scholar 

  • Beunckens C, Sotto C, Molenberghs G (2008) A simulation study comparing weighted estimating equations with multiple imputation based estimating equations for longitudinal binary data. Comput Stat Data Anal 52:1533–1548

    Article  Google Scholar 

  • Breslow NE, Clayton DG (1993) Approximate inference in generalized linear mixed models. J Am Stat Assoc 88:9–25

    Google Scholar 

  • Carpenter JR, Kenward MG (2013) Multiple imputation and its applications. Wiley, Chichester

    Book  Google Scholar 

  • Carpenter JR, Kenward MG, Vansteelandt S (2006) A comparison of multiple imputation and doubly robust estimation for analyses with missing data. J R Stat Soc Ser A 169:571–584

    Article  Google Scholar 

  • Carpenter JR, Roger JH, Kenward MG (2013) Analysis of longitudinal trials with protocol deviation: a framework for relevant, accessible assumptions, and inference via multiple imputation. J Biopharm Stat 23:1352–1371

    Article  PubMed  Google Scholar 

  • Chatterjee S, Hadi AS (1988) Sensitivity analysis in linear regression. Wiley, New York

    Book  Google Scholar 

  • Cook RD (1977) Detection of influential observations in linear regression. Technometrics 19:15–18

    Google Scholar 

  • Cook RD (1979) Influential observations in linear regression. J Am Stat Assoc 74:169–174

    Article  Google Scholar 

  • Cook RD (1986) Assessment of local influence. J R Stat Soc Ser B 48:133–169

    Google Scholar 

  • Cook RD, Weisberg S (1982) Residuals and influence in regression. Chapman & Hall, London

    Google Scholar 

  • Dempster AP, Rubin DB (1983) Overview. In: Madow WG, Olkin I, Rubin DB (eds) Incomplete data in sample surveys, Theory and annotated bibliography, vol II. Academic Press, New York, pp 3–10

    Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38

    Google Scholar 

  • Diggle PJ, Kenward MG (1994) Informative drop-out in longitudinal data analysis (with discussion). Appl Stat 43:49–93

    Article  Google Scholar 

  • Diggle PJ, Heagerty P, Liang K-Y, Zeger SL (2002) Analysis of longitudinal data. Oxford University Press, New York

    Google Scholar 

  • Enders CK (2010) Applied missing data analysis. The Guildford Press, New York

    Google Scholar 

  • Fitzmaurice GM, Molenberghs G, Lipsitz SR (1995) Regression models for longitudinal binary responses with informative dropouts. J R Stat Soc Ser B 57:691–704

    Google Scholar 

  • Glynn RJ, Laird NM, Rubin DB (1986) Selection modeling versus mixture modeling with nonignorable nonresponse. In: Wainer H (ed) Drawing inferences from self-selected samples. Springer, New York, pp 115–142

    Chapter  Google Scholar 

  • Hartley HO, Hocking R (1971) The analysis of incomplete data. Biometrics 27:7783–7808

    Article  Google Scholar 

  • Heckman JJ (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Ann Econ Soc Meas 5:475–492

    Google Scholar 

  • Heitjan F, Little RJA (1991) Multiple imputaiton for the fatal accident reporting system. Appl Stat 40:13–29

    Article  Google Scholar 

  • Hogan JW, Laird NM (1997) Mixture models for the joint distribution of repeated measures and event times. Stat Med 16:239–258

    Article  CAS  PubMed  Google Scholar 

  • Ibrahim JG, Molenberghs G (2009) Missing data methods in longitudinal studies: a review (with discussion and rejoinder). TEST 18:68–80

    Article  Google Scholar 

  • Jansen I, Molenberghs G (2008) A flexible marginal modeling strategy for non-monotone missing data. J R Stat Soc Ser A 171:347–373

    Article  Google Scholar 

  • Jansen I, Hens N, Molenberghs G, Aerts M, Verbeke G, Kenward MG (2006) The nature of sensitivity in monotone missing not at random models. Comput Stat Data Anal 50:830–858

    Article  Google Scholar 

  • Kenward MG, Molenberghs G (1998) Likelihood based frequentist inference when data are missing at random. Stat Sci 12:236–247

    Google Scholar 

  • Kenward MG, Molenberghs G (2009) Last observation carried forward: a crystal ball? J Biopharm Stat 19:872–888

    Article  PubMed  Google Scholar 

  • Kenward MG, Molenberghs G, Thijs H (2003) Pattern-mixture models with proper time dependence. Biometrika 90:53–71

    Article  Google Scholar 

  • Laird NM (1994) Discussion to Diggle PJ, Kenward MG: informative dropout in longitudinal data analysis. Appl Stat 43:84

    Google Scholar 

  • Laird NM, Ware JH (1998) Random effects models for longitudinal data. Biometrics 28:963–974

    Google Scholar 

  • Lesaffre E, Verbeke G (1998) Local influence in linear mixed models. Biometrics 54:570–582

    Article  CAS  PubMed  Google Scholar 

  • Liang K-Y, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73:13–22

    Article  Google Scholar 

  • Little RJA (1993) Pattern-mixture models for multivariate incomplete data. J Am Stat Assoc 88:125–134

    Google Scholar 

  • Little RJA (1994) A class of pattern-mixture models for normal incomplete data. Biometrika 81:471–483

    Article  Google Scholar 

  • Little RJA (1995) Modeling the drop-out mechanism in repeated measures studies. J Am Stat Assoc 90:1112–1121

    Article  Google Scholar 

  • Little RJA, D’Agostino R, Dickersin K, Emerson SS, Farrar JT, Frangakis C, Hogan JW, Molenberghs G, Murphy SA, Neaton JD, Rotnitzky A, Scharfstein D, Shih W, Siegel JP, Stern H, National Research Council (2010) The prevention and treatment of missing data in clinical trials. Panel on handling missing data in clinical trials. Committee on National Statistics, Division of Behavioral and Social Sciences and Education. The National Academies Press, Washington, DC

    Google Scholar 

  • Little RJA, Kang S (2015) Intention-to-treat analysis with treatment discontinuation and missing data in clinical trials. Stat Med 34:2381–2390

    Article  PubMed  Google Scholar 

  • Little RJA, Rubin DB (1987) Statistical analysis with missing data. Wiley, New York

    Google Scholar 

  • Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York

    Book  Google Scholar 

  • Little RJA, Rubin DB (2014) Statistical analysis with missing data, 3rd edn. Wiley, New York

    Google Scholar 

  • Lu K (2014) An analytic method for the placebo-based pattern-mixture model. Stat Med 33:1134–1145

    Article  PubMed  Google Scholar 

  • Mallinckrodt CH, Clark WS, David SR (2001a) Type I error rates from mixed-effects model repeated measures versus fixed effects analysis of variance with missing values imputed via last observation carried forward. Drug Inf J 35:1215–1225

    Article  Google Scholar 

  • Mallinckrodt CH, Clark WS, David SR (2001b) Accounting for dropout bias using mixed-effects models. J Biopharm Stat 11(1 & 2):9–21

    Article  CAS  PubMed  Google Scholar 

  • Mallinckrodt CH, Clark WS, Carroll RJ, Molenberghs G (2003a) Assessing response profiles from incomplete longitudinal clinical trial data under regulatory considerations. J Biopharm Stat 13:179–190

    Article  PubMed  Google Scholar 

  • Mallinckodt CH, Lipkovich I (2016) Analyzing longitudinal clinical trial data. A practical guide. Chapman & Hall/CRC, Boca Raton

    Book  Google Scholar 

  • Mallinckrodt CH, Sanger TM, Dube S, Debrota DJ, Molenberghs G, Carroll RJ, Zeigler Potter WM, Tollefson GD (2003b) Assessing and interpreting treatment effects in longitudinal clinical trials with missing data. Biol Psychiatry 53:754–760

    Article  PubMed  Google Scholar 

  • Meng XL (1994) Multiple-imputation inferences with uncongenial sources of input (with discussion). Stat Scinica 10:538–573

    Google Scholar 

  • Michiels B, Molenberghs G, Lipsitz SR (1999) A pattern-mixture odds ratio model for incomplete categorical data. Commun Stat Theory Methods 28:2843–2869

    Article  Google Scholar 

  • Michiels B, Molenberghs G, Bijnens L, Vangeneugden T, Thijs H (2002) Selection models and pattern-mixture models to analyze longitudinal quality of life data subject to dropout. Stat Med 21:1023–1041

    Article  PubMed  Google Scholar 

  • Molenberghs G, Beunckens C, Thijs H, Jansen I, Verbeke G, Kenward MG, Van Steen K (2007) Analysis of incomplete data. In: Dmitrienko A, Chuang-Stein C, D’Agostino R (eds) Pharmaceutical statistics using SAS: a practical guide. SAS Press, Cary, pp 313–360

    Google Scholar 

  • Molenberghs G, Fitzmaurice G, Kenward MG, Verbeke G, Tsiatis AA (2015) Handbook of missing data. Chapman & Hall/CRC, Boca Raton

    Google Scholar 

  • Molenberghs G, Kenward MG (2007) Missing data in clinical studies. Wiley, New York

    Book  Google Scholar 

  • Molenberghs G, Kenward MG, Lesaffre E (1997) The analysis of longitudinal ordinal data with non-random dropout. Biometrika 84:33–44

    Article  Google Scholar 

  • Molenberghs G, Michiels B, Lipsitz SR (1999) A pattern-mixture odds ratio model for incomplete categorical data. Commun Stat Theory Methods 28:2843–2869

    Article  Google Scholar 

  • Molenberghs G, Michiels B, Kenward MG, Diggle PJ (1998) Missing data mechanisms and pattern-mixture models. Statistica Neerlandica 52:153–161

    Article  Google Scholar 

  • Molenberghs G, Verbeke G (2005) Models for discrete longitudinal data. Springer, New York

    Google Scholar 

  • Molenberghs G, Verbeke G, Thijs H, Lesaffre E, Kenward MG (2001) Mastitis in dairy cattle: local influence to assess sensitivity of the dropout process. Comput Stat Data Anal 37:93–113

    Article  Google Scholar 

  • Murray GD, Findlay JG (1988) Correcting for the bias caused by drop-outs in hypertension trials. Stat Med 7:941–946

    Article  CAS  PubMed  Google Scholar 

  • Nelder JA, Mead R (1965) A simplex method for function minimisation. Comput J 7:303–313

    Article  Google Scholar 

  • Neuhaus JM (1992) Statistical methods for longitudinal and clustered designs with binary responses. Stat Methods Med Res 1:249–273

    Article  CAS  PubMed  Google Scholar 

  • Neuhaus JM, Kalbfleisch JD, Hauck WW (1991) A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data. Int Stat Rev 59:25–35

    Article  Google Scholar 

  • O’Kelly M, Ratitch B (2014) Clinical trials with missing data: a guide for practitioners. Wiley, New York

    Book  Google Scholar 

  • Pharmacological Therapy for Macular Degeneration Study Group (1997) Interferon α - IIA is ineffective for patients with choroiadal neovascularization secondary to age-related macular degeneration. Arch Ophthalmol 115:865–872

    Article  Google Scholar 

  • Raghunathan T (2016) Missing data analysis in practice. Taylor & Francis, Boca Raton

    Google Scholar 

  • Robins JM, Rotnitzky A, Zhao LP (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc 90:10–121

    Google Scholar 

  • Robins JM, Rotnitzky A, Scharfstein DO (1998) Semiparametric regression for repeated outcomes with non-ignorable non-response. J Am Stat Assoc 93:1321–1339

    Article  Google Scholar 

  • Rosenbaum PR, Rubin DB (1983) The central role of the propensity score method in observational studies for causal effects. Biometrika 70:41–55

    Article  Google Scholar 

  • Rotnitzky A, Cox DR, Bottai M, Robins J (2000) Likelihood-based inference with singular information matrix. Ther Ber 6:243–284

    Google Scholar 

  • Rubin DB (1976) Inference and missing data. Biometrika 63:581–592

    Article  Google Scholar 

  • Rubin DB (1978) Multiple imputation in sample surveys – a phenomenological Bayesian approach to non-response. In: Imputation and editing of faulty or missing survey data. U.S. Department of Commerce, Washington, DC, pp 1–23

    Google Scholar 

  • Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York

    Book  Google Scholar 

  • Rubin DB (1994) Discussion to Diggle PJ, Kenward MG: informative dropout in longitudinal data analysis. Appl Stat 43:80–82

    Google Scholar 

  • Rubin DB, Schenker N (1986) Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J Am Stat Assoc 81:366–374

    Article  Google Scholar 

  • Schafer JL (1997) Analysis of incomplete multivariate data. Chapman & Hall, London

    Book  Google Scholar 

  • Schafer JL (1999) Multiple imputation: a primer. Stat Methods Med Res 8:3–15

    Article  CAS  PubMed  Google Scholar 

  • Sheiner LB, Beal SL, Dunne A (1997) Analysis of nonrandomly censored ordered categorical longitudinal data from analgesic trials. J Am Stat Assoc 92:1235–1244

    Article  Google Scholar 

  • Siddiqui O, Ali MW (1998) A comparison of the random-effects pattern-mixture model with last-observation-carried-forward (LOCF) analysis in longitudinal clinical trials with dropouts. J Biopharm Stat 8:545–563

    Article  CAS  PubMed  Google Scholar 

  • Tan MT, Tian G-L, Ng KW (2010) Bayesian missing data problems. Taylor & Francis, Boca Raton

    Google Scholar 

  • Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation. J Am Stat Assoc 82:528–550

    Article  Google Scholar 

  • TenHave TR, Kunselman AR, Pulkstenis EP, Landis JR (1998) Mixed effects logistic regression models for longitudinal binary response data with informative dropout. Biometrics 54:367–383

    Article  CAS  Google Scholar 

  • Thijs H, Molenberghs G, Michiels B, Verbeke G, Curran D (2002) Strategies to fit pattern-mixture models. Biostatistics 3:245–265

    Article  PubMed  Google Scholar 

  • van Buuren, S, Boshuizen, HC, & Knook, DL (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in medicine 18(6):681–694

    Google Scholar 

  • van Buuren S (2007) Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res 16:219–242

    Article  PubMed  Google Scholar 

  • van Buuren S (2012) Flexible imputation of missing data. Chapman & Hall/CRC, Boca Raton

    Book  Google Scholar 

  • Verbeke G, Molenberghs G (1997) Linear mixed models in practice: a SAS-oriented approach, Lecture notes in statistics 126. Springer, New York

    Book  Google Scholar 

  • Verbeke G, Molenberghs G (2000) Linear mixed models for longitudinal data. Springer, New York

    Google Scholar 

  • Verbeke G, Lesaffre E, Spiessens B (2001a) The practical use of different strategies to handle dropout in longitudinal studies. Drug Inf J 35:419–439

    Article  Google Scholar 

  • Verbeke G, Molenberghs G, Thijs H, Lesaffre E, Kenward MG (2001b) Sensitivity analysis for non-random dropout: a local influence approach. Biometrics 57:7–14

    Article  CAS  PubMed  Google Scholar 

  • Wu MC, Bailey KR (1988) Analysing changes in the presence of informative right censoring caused by death and withdrawal. Stat Med 7:337–346

    Article  CAS  PubMed  Google Scholar 

  • Wu MC, Bailey KR (1989) Estimation and comparison of changes in the presence of informative right censoring: conditional linear model. Biometrics 45:939–955

    Article  CAS  PubMed  Google Scholar 

  • Wu MC, Carroll RJ (1988) Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics 44:175–188

    Article  Google Scholar 

Download references

Acknowledgments

We gratefully acknowledge support from FWO-Vlaanderen Research Project G.0002.98: “Sensitivity Analysis for Incomplete and Coarse Data” and from Belgian IUAP/PAI network “Statistical Techniques and Modeling for Complex Substantive Questions with Complex Data.”

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Geert Molenberghs .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Science+Business Media, LLC, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Molenberghs, G., Beunckens, C., Jansen, I., Thijs, H., Verbeke, G., Kenward, M.G. (2023). Missing Data. In: Ahrens, W., Pigeot, I. (eds) Handbook of Epidemiology. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6625-3_20-1

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-6625-3_20-1

  • Received:

  • Accepted:

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-6625-3

  • Online ISBN: 978-1-4614-6625-3

  • eBook Packages: Springer Reference MedicineReference Module Medicine

Publish with us

Policies and ethics