Missing Data

Molenberghs, Geert; Beunckens, Caroline; Jansen, Ivy; Thijs, Herbert; Verbeke, Geert; Kenward, Michael G.

doi:10.1007/978-0-387-09834-0_20

Geert Molenberghs,
Caroline Beunckens,
Ivy Jansen,
Herbert Thijs,
Geert Verbeke &
…
Michael G. Kenward

11k Accesses
3 Citations

Abstract

The problem of dealing with missing values is common throughout statistical work and is present whenever human subjects are enrolled. Respondents may refuse participation or may be unreachable. Patients in clinical and epidemiological studies may withdraw their initial consent without further explanation. Early work on missing values was largely concerned with algorithmic and computational solutions to the induced lack of balance or deviations from the intended study design (Afifi and Elashoff 1966; Hartley and Hocking 1971). More recently, general algorithms such as the Expectation–Maximization (EM) (Dempster et al. 1977) and data imputation and augmentation procedures (Rubin 1987; Tanner and Wong 1987), combined with powerful computing resources, have largely provided a solution to this aspect of the problem. There remains the very difficult and important question of assessing the impact of missing data on subsequent statistical inference. Conditions can be formulated, under which an analysis that proceeds as if the missing data are missing by design, that is, ignoring the missing value process, can provide valid answers to study questions. While such an approach is attractive from a pragmatic point of view, the difficulty is that such conditions can rarely be assumed to hold with full certainty. Indeed, assumptions will be required that cannot be assessed from the data under analysis. Hence in this setting there cannot be anything that could be termed a definitive analysis, and hence any analysis of preference is ideally to be supplemented with a so-called sensitivity analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 999.99; Price excludes VAT (USA)

Hardcover Book: USD 1,399.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aerts M, Geys H, Molenberghs G, Ryan LM (2002) Topics in modelling of clustered binary data. Chapman & Hall, London
Book Google Scholar
Afifi A, Elashoff R (1966) Missing observations in multivariate statistics I: review of the literature. J Am Stat Assoc 61:595–604
Google Scholar
Baker SG, Rosenberger WF, DerSimonian R (1992) Closed-form estimates for missing counts in two-way contingency tables. Stat Med 11:643–657
Article CAS PubMed Google Scholar
Beckman RJ, Nachtsheim CJ, Cook RD (1987) Diagnostics for mixed-model analysis of variance. Technometrics 29:413–426
Google Scholar
Beunckens C, Sotto C, Molenberghs G (2008) A simulation study comparing weighted estimating equations with multiple imputation based estimating equations for longitudinal binary data. Comput Stat Data Anal 52:1533–1548
Article Google Scholar
Breslow NE, Clayton DG (1993) Approximate inference in generalized linear mixed models. J Am Stat Assoc 88:9–25
Google Scholar
Chatterjee S, Hadi AS (1988) Sensitivity analysis in linear regression. Wiley, New York
Book Google Scholar
Cook RD (1977) Detection of influential observations in linear regression. Technometrics 19: 15–18
Article Google Scholar
Cook RD (1979) Influential observations in linear regression. J Am Stat Assoc 74:169–174
Article Google Scholar
Cook RD (1986) Assessment of local influence. J R Stat Soc Ser B 48:133–169
Google Scholar
Cook RD, Weisberg S (1982) Residuals and influence in regression. Chapman & Hall, London
Google Scholar
Dempster AP, Rubin DB (1983) Overview. In: Madow WG, Olkin I, Rubin DB (eds) Incomplete data in sample surveys. Theory and annotated bibliography, vol II. Academic, New York, pp 3–10
Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38
Google Scholar
Diggle PJ, Kenward MG (1994) Informative drop-out in longitudinal data analysis (with discussion). Appl Stat 43:49–93
Article Google Scholar
Diggle PJ, Heagerty P, Liang K-Y, Zeger SL (2002) Analysis of longitudinal data. Oxford University Press, New York
Google Scholar
Fitzmaurice GM, Molenberghs G, Lipsitz SR (1995) Regression models for longitudinal binary responses with informative dropouts. J R Stat Soc Ser B 57:691–704
Google Scholar
Glynn RJ, Laird NM, Rubin DB (1986) Selection modeling versus mixture modeling with nonignorable nonresponse. In: Wainer H (ed) Drawing inferences from self-selected samples. Springer, New York, pp 115–142
Chapter Google Scholar
Hartley HO, Hocking R (1971) The analysis of incomplete data. Biometrics 27:7783–808
Article Google Scholar
Heckman JJ (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Ann Econ Soc Meas 5:475–492
Google Scholar
Hogan JW, Laird NM (1997) Mixture models for the joint distribution of repeated measures and event times. Stat Med 16:239–258
Article CAS PubMed Google Scholar
Ibrahim JG, Molenberghs G (2009) Missing data methods in longitudinal studies: a review (with discussion and rejoinder). Test 18, 68–80
Article Google Scholar
Jansen I, Molenberghs G (2008) A flexible marginal modeling strategy for non-monotone missing data. J R Stat Soc Ser A 171:347–373
Article Google Scholar
Jansen I, Hens N, Molenberghs G, Aerts M, Verbeke G, Kenward MG (2006) The nature of sensitivity in monotone missing not at random models. Comput Stat Data Anal 50:830–858
Article Google Scholar
Kenward MG, Molenberghs G (1998) Likelihood based frequentist inference when data are missing at random. Stat Sci 12:236–247
Google Scholar
Kenward MG, Molenberghs G (2009) Last observation carried forward: a crystal ball? J Biopharm Stat 19(5):872–888
Article PubMed Google Scholar
Kenward MG, Molenberghs G, Thijs H (2003) Pattern-mixture models with proper time dependence. Biometrika 90:53–71
Article Google Scholar
Laird NM (1994) Discussion to Diggle PJ, Kenward MG: informative dropout in longitudinal data analysis. Appl Stat 43:84
Google Scholar
Laird NM, Ware JH (1982) Random effects models for longitudinal data. Biometrics 38:963–974
Article CAS PubMed Google Scholar
Lesaffre E, Verbeke G (1998) Local influence in linear mixed models. Biometrics 54:570–582
Article CAS PubMed Google Scholar
Liang K-Y, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73:13–22
Article Google Scholar
Little RJA (1993) Pattern-mixture models for multivariate incomplete data. J Am Stat Assoc 88:125–134
Google Scholar
Little RJA (1994) A class of pattern-mixture models for normal incomplete data. Biometrika 81:471–483
Article Google Scholar
Little RJA (1995) Modeling the drop-out mechanism in repeated measures studies. J Am Stat Assoc 90:1112–1121
Article Google Scholar
Little RJA, Rubin DB (1987) Statistical analysis with missing data. Wiley, New York
Google Scholar
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
Google Scholar
Mallinckrodt CH, Clark WS, David SR (2001a) Type I error rates from mixed-effects model repeated measures versus fixed effects analysis of variance with missing values imputed via last observation carried forward. Drug Inform J 35:1215–1225
Article Google Scholar
Mallinckrodt CH, Clark WS, David SR (2001b) Accounting for dropout bias using mixed-effects models. J Biopharm Stat Ser 11(1 & 2):9–21
Article CAS Google Scholar
Mallinckrodt CH, Clark WS, Carroll RJ, Molenberghs G (2003a) Assessing response profiles from incomplete longitudinal clinical trial data under regulatory considerations. J Biopharm Stat 13:179–190
Article PubMed Google Scholar
Mallinckrodt CH, Sanger TM, Dube S, Debrota DJ, Molenberghs G, Carroll RJ, Zeigler Potter WM, Tollefson, GD (2003b) Assessing and interpreting treatment effects in longitudinal clinical trials with missing data. Biol Psychiatry Ser 53:754–760
Article Google Scholar
Meng XL (1994) Multiple-imputation inferences with uncongenial sources of input (with discussion). Stat Sci 9:538–573
Google Scholar
Michiels B, Molenberghs G, Lipsitz SR (1999) A pattern-mixture odds ratio model for incomplete categorical data. Commun Stat Theory Methods 28:2843–2869
Article Google Scholar
Michiels B, Molenberghs G, Bijnens L, Vangeneugden T, Thijs H (2002) Selection models and pattern-mixture models to analyze longitudinal quality of life data subject to dropout. Stat Med 21:1023–1041
Article PubMed Google Scholar
Molenberghs G, Kenward MG (2007) Missing data in clinical studies. Wiley, New York
Book Google Scholar
Molenberghs G, Verbeke G (2005) Models for discrete longitudinal data. Wiley, New York
Google Scholar
Molenberghs G, Kenward MG, Lesaffre E (1997) The analysis of longitudinal ordinal data with non-random dropout. Biometrika 84:33–44
Article Google Scholar
Molenberghs G, Michiels B, Kenward MG, Diggle PJ (1998) Missing data mechanisms and pattern-mixture models. Stat Neerl 52:153–161
Article Google Scholar
Molenberghs G, Michiels B, Lipsitz SR (1999) A pattern-mixture odds ratio model for incomplete categorical data. Commun Stat Theory Methods 28:2843–2869
Article Google Scholar
Molenberghs G, Verbeke G, Thijs H, Lesaffre E, Kenward MG (2001) Mastitis in dairy cattle: local influence to assess sensitivity of the dropout process. Comput Stat Data Anal 37:93–113
Article Google Scholar
Murray GD, Findlay JG (1988) Correcting for the bias caused by drop-outs in hypertension trials. Stat Med 7:941–946
Article CAS PubMed Google Scholar
Nelder JA, Mead R (1965) A simplex method for function minimisation. Comput J 7:303–313
Article Google Scholar
Neuhaus JM (1992) Statistical methods for longitudinal and clustered designs with binary responses. Stat Methods Med Res 1:249–273
Article CAS PubMed Google Scholar
Neuhaus JM, Kalbfleisch JD, Hauck WW (1991) A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data. Int Stat Rev 59:25–35
Article Google Scholar
Pharmacological Therapy for Macular Degeneration Study Group (1997) Interferon α-IIA is ineffective for patients with choroiadal neovascularization secondary to age-related macular degeneration. Arch Ophthalmol 115:865–872
Article Google Scholar
Robins JM, Rotnitzky A, Zhao LP (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc 90:106–121
Article Google Scholar
Robins JM, Rotnitzky A, Scharfstein DO (1998) Semiparametric regression for repeated outcomes with non-ignorable non-response. J Am Stat Assoc 93:1321–1339
Article Google Scholar
Rotnitzky A, Cox DR, Bottai M, Robins J (2000) Likelihood-based inference with singular information matrix. Bernouilli 6:243–284
Article Google Scholar
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
Article Google Scholar
Rubin DB (1978) Multiple imputation in sample surveys – a phenomenological Bayesian approach to nonresponse. In: Imputation and editing of faulty or missing survey data. U.S. Department of Commerce, Washington, DC, pp 1–23
Google Scholar
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
Book Google Scholar
Rubin DB (1994) Discussion to Diggle PJ, Kenward MG: informative dropout in longitudinal data analysis. Appl Stat 43:80–82
Google Scholar
Schafer JL (1997) Analysis of incomplete multivariate data. Chapman & Hall, London
Book Google Scholar
Schafer JL (1999) Multiple imputation: a primer. Stat Methods Med Res 8:3–15
Article CAS PubMed Google Scholar
Sheiner LB, Beal SL, Dunne A (1997) Analysis of nonrandomly censored ordered categorical longitudinal data from analgesic trials. J Am Stat Assoc 92:1235–1244
Article Google Scholar
Siddiqui O, Ali MW (1998) A comparison of the random-effects pattern-mixture model with last-observation-carried-forward (LOCF) analysis in longitudinal clinical trials with dropouts. J Biopharm Stat 8:545–563
Article CAS PubMed Google Scholar
Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation. J Am Stat Assoc 82:528–550
Article Google Scholar
TenHave TR, Kunselman AR, Pulkstenis EP, Landis JR (1998) Mixed effects logistic regression models for longitudinal binary response data with informative dropout. Biometrics 54:367–383
Article CAS Google Scholar
Thijs H, Molenberghs G, Michiels B, Verbeke G, Curran D (2002) Strategies to fit pattern-mixture models. Biostatistics 3:245–265
Article PubMed Google Scholar
Verbeke G, Molenberghs G (1997) Linear mixed models in practice: a SAS-oriented approach. Lecture notes in statistics 126. Springer, New York
Book Google Scholar
Verbeke G, Molenberghs G (2000) Linear mixed models for longitudinal data. Springer, New York
Google Scholar
Verbeke G, Lesaffre E, Spiessens B (2001a) The practical use of different strategies to handle dropout in longitudinal studies. Drug Inform J 35:419–439
Google Scholar
Verbeke G, Molenberghs G, Thijs H, Lesaffre E, Kenward MG (2001b) Sensitivity analysis for non-random dropout: a local influence approach. Biometrics 57:7–14
Article CAS PubMed Google Scholar
Wu MC, Bailey KR (1988) Analysing changes in the presence of informative right censoring caused by death and withdrawal. Stat Med 7:337–346
Article CAS PubMed Google Scholar
Wu MC, Bailey KR (1989) Estimation and comparison of changes in the presence of informative right censoring: conditional linear model. Biometrics 45:939–955
Article CAS PubMed Google Scholar
Wu MC, Carroll RJ (1988) Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics 44:175–188
Article Google Scholar

Download references

Acknowledgements

We gratefully acknowledge support from FWO-Vlaanderen Research Project G.0002.98: “Sensitivity Analysis for Incomplete and Coarse Data” and from Belgian IUAP ∕ PAI network “Statistical Techniques and Modeling for Complex Substantive Questions with Complex Data”.

Author information

Authors and Affiliations

Authors

Geert Molenberghs
View author publications
You can also search for this author in PubMed Google Scholar
Caroline Beunckens
View author publications
You can also search for this author in PubMed Google Scholar
Ivy Jansen
View author publications
You can also search for this author in PubMed Google Scholar
Herbert Thijs
View author publications
You can also search for this author in PubMed Google Scholar
Geert Verbeke
View author publications
You can also search for this author in PubMed Google Scholar
Michael G. Kenward
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Epidemiological Methods and Etiologic Research, Leibniz Institute for Prevention Research and Epidemiology – BIPS, Bremen, Germany
Wolfgang Ahrens
Department of Biometry and Data Management, Leibniz Institute for Prevention Research and Epidemiology – BIPS, Bremen, Germany
Iris Pigeot

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Molenberghs, G., Beunckens, C., Jansen, I., Thijs, H., Verbeke, G., Kenward, M.G. (2014). Missing Data. In: Ahrens, W., Pigeot, I. (eds) Handbook of Epidemiology. Springer, New York, NY. https://doi.org/10.1007/978-0-387-09834-0_20

Download citation

DOI: https://doi.org/10.1007/978-0-387-09834-0_20
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-09833-3
Online ISBN: 978-0-387-09834-0
eBook Packages: MedicineReference Module Medicine

Publish with us

Policies and ethics