Skip to main content

Missing Data

  • Reference work entry
Handbook of Epidemiology

Abstract

The problem of dealing with missing values is common throughout statistical work and is present whenever human subjects are enrolled. Respondents may refuse participation or may be unreachable. Patients in clinical and epidemiological studies may withdraw their initial consent without further explanation. Early work on missing values was largely concerned with algorithmic and computational solutions to the induced lack of balance or deviations from the intended study design (Afifi and Elashoff 1966; Hartley and Hocking 1971). More recently, general algorithms such as the Expectation–Maximization (EM) (Dempster et al. 1977) and data imputation and augmentation procedures (Rubin 1987; Tanner and Wong 1987), combined with powerful computing resources, have largely provided a solution to this aspect of the problem. There remains the very difficult and important question of assessing the impact of missing data on subsequent statistical inference. Conditions can be formulated, under which an analysis that proceeds as if the missing data are missing by design, that is, ignoring the missing value process, can provide valid answers to study questions. While such an approach is attractive from a pragmatic point of view, the difficulty is that such conditions can rarely be assumed to hold with full certainty. Indeed, assumptions will be required that cannot be assessed from the data under analysis. Hence in this setting there cannot be anything that could be termed a definitive analysis, and hence any analysis of preference is ideally to be supplemented with a so-called sensitivity analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 999.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 1,399.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Aerts M, Geys H, Molenberghs G, Ryan LM (2002) Topics in modelling of clustered binary data. Chapman & Hall, London

    Book  Google Scholar 

  • Afifi A, Elashoff R (1966) Missing observations in multivariate statistics I: review of the literature. J Am Stat Assoc 61:595–604

    Google Scholar 

  • Baker SG, Rosenberger WF, DerSimonian R (1992) Closed-form estimates for missing counts in two-way contingency tables. Stat Med 11:643–657

    Article  CAS  PubMed  Google Scholar 

  • Beckman RJ, Nachtsheim CJ, Cook RD (1987) Diagnostics for mixed-model analysis of variance. Technometrics 29:413–426

    Google Scholar 

  • Beunckens C, Sotto C, Molenberghs G (2008) A simulation study comparing weighted estimating equations with multiple imputation based estimating equations for longitudinal binary data. Comput Stat Data Anal 52:1533–1548

    Article  Google Scholar 

  • Breslow NE, Clayton DG (1993) Approximate inference in generalized linear mixed models. J Am Stat Assoc 88:9–25

    Google Scholar 

  • Chatterjee S, Hadi AS (1988) Sensitivity analysis in linear regression. Wiley, New York

    Book  Google Scholar 

  • Cook RD (1977) Detection of influential observations in linear regression. Technometrics 19: 15–18

    Article  Google Scholar 

  • Cook RD (1979) Influential observations in linear regression. J Am Stat Assoc 74:169–174

    Article  Google Scholar 

  • Cook RD (1986) Assessment of local influence. J R Stat Soc Ser B 48:133–169

    Google Scholar 

  • Cook RD, Weisberg S (1982) Residuals and influence in regression. Chapman & Hall, London

    Google Scholar 

  • Dempster AP, Rubin DB (1983) Overview. In: Madow WG, Olkin I, Rubin DB (eds) Incomplete data in sample surveys. Theory and annotated bibliography, vol II. Academic, New York, pp 3–10

    Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38

    Google Scholar 

  • Diggle PJ, Kenward MG (1994) Informative drop-out in longitudinal data analysis (with discussion). Appl Stat 43:49–93

    Article  Google Scholar 

  • Diggle PJ, Heagerty P, Liang K-Y, Zeger SL (2002) Analysis of longitudinal data. Oxford University Press, New York

    Google Scholar 

  • Fitzmaurice GM, Molenberghs G, Lipsitz SR (1995) Regression models for longitudinal binary responses with informative dropouts. J R Stat Soc Ser B 57:691–704

    Google Scholar 

  • Glynn RJ, Laird NM, Rubin DB (1986) Selection modeling versus mixture modeling with nonignorable nonresponse. In: Wainer H (ed) Drawing inferences from self-selected samples. Springer, New York, pp 115–142

    Chapter  Google Scholar 

  • Hartley HO, Hocking R (1971) The analysis of incomplete data. Biometrics 27:7783–808

    Article  Google Scholar 

  • Heckman JJ (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Ann Econ Soc Meas 5:475–492

    Google Scholar 

  • Hogan JW, Laird NM (1997) Mixture models for the joint distribution of repeated measures and event times. Stat Med 16:239–258

    Article  CAS  PubMed  Google Scholar 

  • Ibrahim JG, Molenberghs G (2009) Missing data methods in longitudinal studies: a review (with discussion and rejoinder). Test 18, 68–80

    Article  Google Scholar 

  • Jansen I, Molenberghs G (2008) A flexible marginal modeling strategy for non-monotone missing data. J R Stat Soc Ser A 171:347–373

    Article  Google Scholar 

  • Jansen I, Hens N, Molenberghs G, Aerts M, Verbeke G, Kenward MG (2006) The nature of sensitivity in monotone missing not at random models. Comput Stat Data Anal 50:830–858

    Article  Google Scholar 

  • Kenward MG, Molenberghs G (1998) Likelihood based frequentist inference when data are missing at random. Stat Sci 12:236–247

    Google Scholar 

  • Kenward MG, Molenberghs G (2009) Last observation carried forward: a crystal ball? J Biopharm Stat 19(5):872–888

    Article  PubMed  Google Scholar 

  • Kenward MG, Molenberghs G, Thijs H (2003) Pattern-mixture models with proper time dependence. Biometrika 90:53–71

    Article  Google Scholar 

  • Laird NM (1994) Discussion to Diggle PJ, Kenward MG: informative dropout in longitudinal data analysis. Appl Stat 43:84

    Google Scholar 

  • Laird NM, Ware JH (1982) Random effects models for longitudinal data. Biometrics 38:963–974

    Article  CAS  PubMed  Google Scholar 

  • Lesaffre E, Verbeke G (1998) Local influence in linear mixed models. Biometrics 54:570–582

    Article  CAS  PubMed  Google Scholar 

  • Liang K-Y, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73:13–22

    Article  Google Scholar 

  • Little RJA (1993) Pattern-mixture models for multivariate incomplete data. J Am Stat Assoc 88:125–134

    Google Scholar 

  • Little RJA (1994) A class of pattern-mixture models for normal incomplete data. Biometrika 81:471–483

    Article  Google Scholar 

  • Little RJA (1995) Modeling the drop-out mechanism in repeated measures studies. J Am Stat Assoc 90:1112–1121

    Article  Google Scholar 

  • Little RJA, Rubin DB (1987) Statistical analysis with missing data. Wiley, New York

    Google Scholar 

  • Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York

    Google Scholar 

  • Mallinckrodt CH, Clark WS, David SR (2001a) Type I error rates from mixed-effects model repeated measures versus fixed effects analysis of variance with missing values imputed via last observation carried forward. Drug Inform J 35:1215–1225

    Article  Google Scholar 

  • Mallinckrodt CH, Clark WS, David SR (2001b) Accounting for dropout bias using mixed-effects models. J Biopharm Stat Ser 11(1 & 2):9–21

    Article  CAS  Google Scholar 

  • Mallinckrodt CH, Clark WS, Carroll RJ, Molenberghs G (2003a) Assessing response profiles from incomplete longitudinal clinical trial data under regulatory considerations. J Biopharm Stat 13:179–190

    Article  PubMed  Google Scholar 

  • Mallinckrodt CH, Sanger TM, Dube S, Debrota DJ, Molenberghs G, Carroll RJ, Zeigler Potter WM, Tollefson, GD (2003b) Assessing and interpreting treatment effects in longitudinal clinical trials with missing data. Biol Psychiatry Ser 53:754–760

    Article  Google Scholar 

  • Meng XL (1994) Multiple-imputation inferences with uncongenial sources of input (with discussion). Stat Sci 9:538–573

    Google Scholar 

  • Michiels B, Molenberghs G, Lipsitz SR (1999) A pattern-mixture odds ratio model for incomplete categorical data. Commun Stat Theory Methods 28:2843–2869

    Article  Google Scholar 

  • Michiels B, Molenberghs G, Bijnens L, Vangeneugden T, Thijs H (2002) Selection models and pattern-mixture models to analyze longitudinal quality of life data subject to dropout. Stat Med 21:1023–1041

    Article  PubMed  Google Scholar 

  • Molenberghs G, Kenward MG (2007) Missing data in clinical studies. Wiley, New York

    Book  Google Scholar 

  • Molenberghs G, Verbeke G (2005) Models for discrete longitudinal data. Wiley, New York

    Google Scholar 

  • Molenberghs G, Kenward MG, Lesaffre E (1997) The analysis of longitudinal ordinal data with non-random dropout. Biometrika 84:33–44

    Article  Google Scholar 

  • Molenberghs G, Michiels B, Kenward MG, Diggle PJ (1998) Missing data mechanisms and pattern-mixture models. Stat Neerl 52:153–161

    Article  Google Scholar 

  • Molenberghs G, Michiels B, Lipsitz SR (1999) A pattern-mixture odds ratio model for incomplete categorical data. Commun Stat Theory Methods 28:2843–2869

    Article  Google Scholar 

  • Molenberghs G, Verbeke G, Thijs H, Lesaffre E, Kenward MG (2001) Mastitis in dairy cattle: local influence to assess sensitivity of the dropout process. Comput Stat Data Anal 37:93–113

    Article  Google Scholar 

  • Murray GD, Findlay JG (1988) Correcting for the bias caused by drop-outs in hypertension trials. Stat Med 7:941–946

    Article  CAS  PubMed  Google Scholar 

  • Nelder JA, Mead R (1965) A simplex method for function minimisation. Comput J 7:303–313

    Article  Google Scholar 

  • Neuhaus JM (1992) Statistical methods for longitudinal and clustered designs with binary responses. Stat Methods Med Res 1:249–273

    Article  CAS  PubMed  Google Scholar 

  • Neuhaus JM, Kalbfleisch JD, Hauck WW (1991) A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data. Int Stat Rev 59:25–35

    Article  Google Scholar 

  • Pharmacological Therapy for Macular Degeneration Study Group (1997) Interferon α-IIA is ineffective for patients with choroiadal neovascularization secondary to age-related macular degeneration. Arch Ophthalmol 115:865–872

    Article  Google Scholar 

  • Robins JM, Rotnitzky A, Zhao LP (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc 90:106–121

    Article  Google Scholar 

  • Robins JM, Rotnitzky A, Scharfstein DO (1998) Semiparametric regression for repeated outcomes with non-ignorable non-response. J Am Stat Assoc 93:1321–1339

    Article  Google Scholar 

  • Rotnitzky A, Cox DR, Bottai M, Robins J (2000) Likelihood-based inference with singular information matrix. Bernouilli 6:243–284

    Article  Google Scholar 

  • Rubin DB (1976) Inference and missing data. Biometrika 63:581–592

    Article  Google Scholar 

  • Rubin DB (1978) Multiple imputation in sample surveys – a phenomenological Bayesian approach to nonresponse. In: Imputation and editing of faulty or missing survey data. U.S. Department of Commerce, Washington, DC, pp 1–23

    Google Scholar 

  • Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York

    Book  Google Scholar 

  • Rubin DB (1994) Discussion to Diggle PJ, Kenward MG: informative dropout in longitudinal data analysis. Appl Stat 43:80–82

    Google Scholar 

  • Schafer JL (1997) Analysis of incomplete multivariate data. Chapman & Hall, London

    Book  Google Scholar 

  • Schafer JL (1999) Multiple imputation: a primer. Stat Methods Med Res 8:3–15

    Article  CAS  PubMed  Google Scholar 

  • Sheiner LB, Beal SL, Dunne A (1997) Analysis of nonrandomly censored ordered categorical longitudinal data from analgesic trials. J Am Stat Assoc 92:1235–1244

    Article  Google Scholar 

  • Siddiqui O, Ali MW (1998) A comparison of the random-effects pattern-mixture model with last-observation-carried-forward (LOCF) analysis in longitudinal clinical trials with dropouts. J Biopharm Stat 8:545–563

    Article  CAS  PubMed  Google Scholar 

  • Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation. J Am Stat Assoc 82:528–550

    Article  Google Scholar 

  • TenHave TR, Kunselman AR, Pulkstenis EP, Landis JR (1998) Mixed effects logistic regression models for longitudinal binary response data with informative dropout. Biometrics 54:367–383

    Article  CAS  Google Scholar 

  • Thijs H, Molenberghs G, Michiels B, Verbeke G, Curran D (2002) Strategies to fit pattern-mixture models. Biostatistics 3:245–265

    Article  PubMed  Google Scholar 

  • Verbeke G, Molenberghs G (1997) Linear mixed models in practice: a SAS-oriented approach. Lecture notes in statistics 126. Springer, New York

    Book  Google Scholar 

  • Verbeke G, Molenberghs G (2000) Linear mixed models for longitudinal data. Springer, New York

    Google Scholar 

  • Verbeke G, Lesaffre E, Spiessens B (2001a) The practical use of different strategies to handle dropout in longitudinal studies. Drug Inform J 35:419–439

    Google Scholar 

  • Verbeke G, Molenberghs G, Thijs H, Lesaffre E, Kenward MG (2001b) Sensitivity analysis for non-random dropout: a local influence approach. Biometrics 57:7–14

    Article  CAS  PubMed  Google Scholar 

  • Wu MC, Bailey KR (1988) Analysing changes in the presence of informative right censoring caused by death and withdrawal. Stat Med 7:337–346

    Article  CAS  PubMed  Google Scholar 

  • Wu MC, Bailey KR (1989) Estimation and comparison of changes in the presence of informative right censoring: conditional linear model. Biometrics 45:939–955

    Article  CAS  PubMed  Google Scholar 

  • Wu MC, Carroll RJ (1988) Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics 44:175–188

    Article  Google Scholar 

Download references

Acknowledgements

We gratefully acknowledge support from FWO-Vlaanderen Research Project G.0002.98: “Sensitivity Analysis for Incomplete and Coarse Data” and from Belgian IUAP ∕ PAI network “Statistical Techniques and Modeling for Complex Substantive Questions with Complex Data”.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this entry

Cite this entry

Molenberghs, G., Beunckens, C., Jansen, I., Thijs, H., Verbeke, G., Kenward, M.G. (2014). Missing Data. In: Ahrens, W., Pigeot, I. (eds) Handbook of Epidemiology. Springer, New York, NY. https://doi.org/10.1007/978-0-387-09834-0_20

Download citation

  • DOI: https://doi.org/10.1007/978-0-387-09834-0_20

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-0-387-09833-3

  • Online ISBN: 978-0-387-09834-0

  • eBook Packages: MedicineReference Module Medicine

Publish with us

Policies and ethics