Missing Data

  • Geert Molenberghs
  • Caroline Beunckens
  • Ivy Jansen
  • Herbert Thijs
  • Geert Verbeke
  • Michael G. Kenward

Abstract

The problem of dealing with missing values is common throughout statistical work and is present whenever human subjects are enrolled. Respondents may refuse participation or may be unreachable. Patients in clinical and epidemiological studies may with draw their initial consent without further explanation. Early work on missing values was largely concerned with algorithmic and computational solutions to the induced lack of balance or deviations from the intended study design (Afifi and Elashoff 1966; Hartley and Hocking 1971). More recently general algorithms such as the Expectation-Maximization (EM) (Dempster et al. 1977), and data imputation and augmentation procedures (Rubin1987;Tanner andWong1987) combined with powerful computing resources have largely provided a solution to this aspect of the problem. There remains the very difficult and important question of assessing the impact of missing data on subsequent statistical inference. Conditions can be formulated, under which an analysis that proceeds as if the missing data are missing by design, that is, ignoring the missing value process, can provide valid answers to study questions. While such an approach is attractive from a pragmatic point of view, the difficulty is that such conditions can rarely be assumed to hold with full certainty. Indeed, assumptions will be required that cannot be assessed from the data under analysis. Hence in this setting there cannot be anything that could be termed a definitive analysis, and hence any analysis of preference is ideally to be supplemented with a so-called sensitivity analysis.

Keywords

Generalize Linear Mixed Model Generalize Estimate Equation American Statistical Association Last Observation Carry Forward Royal Statistical Society Series 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aerts M, Geys H, Molenberghs G, and Ryan LM (2002) Topics in Modelling of Clustered Binary Data. Chapman & Hall, LondonGoogle Scholar
  2. Afifi A, Elashoff R (1966) Missing observations in multivariate statistics I: Review of the literature. Journal of the American Statistical Association 61:595–604CrossRefMathSciNetGoogle Scholar
  3. Amemiya T (1984) Tobit models: a survey. Journal of Econometrics 24:3–61MATHCrossRefMathSciNetGoogle Scholar
  4. Ashford JR, Sowden RR (1970) Multi-variate probit analysis. Biometrics 26:535–546CrossRefGoogle Scholar
  5. Baker SG (1995) Marginal regression for repeated binary data with outcome subject to non-ignorable non-response. Biometrics 51:1042–1052MATHCrossRefGoogle Scholar
  6. Bahadur RR (1961) A representation of the joint distribution of responses to n dichotomous items. In: Solomon H (ed) Studies in Item Analysis and Prediction Stanford Mathematical Studies in the Social Sciences VI. Stanford University Press, Stanford CAGoogle Scholar
  7. Beckman RJ, Nachtsheim CJ, and Cook RD (1987) Diagnostics for mixed-model analysis of variance. Technometrics 29:413–426MATHCrossRefMathSciNetGoogle Scholar
  8. Breslow NE, Clayton DG (1993) Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88:9–25MATHCrossRefGoogle Scholar
  9. Buck SF (1960) A method of estimation of missing values in multivariate data suitable for use with an electronic computer. Journal of the Royal Statistical Society Series B 22:302–306MATHMathSciNetGoogle Scholar
  10. Chatterjee S, Hadi AS (1988) Sensitivity Analysis in Linear Regression. John Wiley & Sons, New YorkMATHGoogle Scholar
  11. Cook RD (1977) Detection of influential observations in linear regression. Technometrics 19:15–18MATHCrossRefMathSciNetGoogle Scholar
  12. Cook RD (1979) Influential observations in linear regression. Journal of the American Statistical Association 74:169–174MATHCrossRefMathSciNetGoogle Scholar
  13. Cook RD (1986) Assessment of local influence. Journal of the Royal Statistical Society Series B 48:133–169MATHGoogle Scholar
  14. Cook RD, Weisberg S (1982) Residuals and Influence in Regression. Chapman & Hall, LondonMATHGoogle Scholar
  15. Dale JR (1986) Global cross-ratio models for bivariate, discrete, ordered responses. Biometrics 42:909–917CrossRefGoogle Scholar
  16. Dempster AP, Rubin DB (1983) Overview. Incomplete Data in Sample Surveys, Vol. II: Theory and Annotated Bibliography, Madow WG, Olkin I, Rubin DB (eds). Academic Press, New York, pp 3–10Google Scholar
  17. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society Series B 39:1–38MATHMathSciNetGoogle Scholar
  18. Diggle PJ, Kenward MG (1994) Informative drop-out in longitudinal data analysis (with discussion). Applied Statistics 43:49–93MATHCrossRefGoogle Scholar
  19. Diggle PJ, Heagerty P, Liang K-Y, Zeger SL (2002) Analysis of Longitudinal Data. Oxford University Press, New YorkGoogle Scholar
  20. Draper D (1995) Assessment and propagation of model uncertainty (with discussion). Journal of the Royal Statistical Society Series B 57:45–97MATHMathSciNetGoogle Scholar
  21. Ekholm A (1991) Algorithms versus models for analyzing data that contain misclassification errors. Biometrics 47:1171–1182CrossRefGoogle Scholar
  22. Fahrmeir L, Tutz G (2001) Multivariate Statistical Modelling Based on Generalized Linear Models. Springer-Verlag, HeidelbergMATHGoogle Scholar
  23. Fitzmaurice GM, Molenberghs G, Lipsitz SR (1995) Regression models for longitudinal binary responses with informative dropouts. Journal of the Royal Statistical Society Series B 57:691–704MATHMathSciNetGoogle Scholar
  24. Fitzmaurice GM, Heath G, Clifford P (1996a) Logistic regression models for binary data panel data with attrition. Journal of the Royal Statistical Society Series A 159:249–264MATHMathSciNetGoogle Scholar
  25. Fitzmaurice GM, Laird NM, Zahner GEP (1996b) Multivariate logistic models for incomplete binary response. Journal of the American Statistical Association 91:99–108MATHCrossRefGoogle Scholar
  26. George EO, Bowman D (1995) A saturated model for analyzing exchangeable binary data: Applications to clinical and developmental toxicity studies. Journal of the American Statistical Association 90:871–879MATHCrossRefGoogle Scholar
  27. Geys H, Molenberghs G, Lipsitz SR (1998) A note on the comparison of pseudolikelihood and generalized estimating equations for marginal odds ratio models. Journal of Statistical Computation and Simulation 62:45–72MATHCrossRefGoogle Scholar
  28. Glonek GFV, McCullagh P (1995) Multivariate logisticmodels. Journal of the Royal Statistical Society Series B 81:477–482Google Scholar
  29. Goss PE, Winer EP, Tannock IF, Schwartz LH, Kremer AB (1999) Breast cancer: randomized phase III trial comparing the new potent and selective third-generation aromatase inhibitor vorozole with megestrol acetate in postmenopausal advanced breast cancer patients. Journal of Clinical Oncology 17:52–63Google Scholar
  30. Hartley HO, Hocking R (1971) The analysis of incomplete data. Biometrics 27:7783–808CrossRefGoogle Scholar
  31. Heckman JJ (1976) The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. Annals of Economic and Social Measurement 5:475–492Google Scholar
  32. Hogan JW, Laird NM (1997) Mixture models for the joint distribution of repeated measures and event times. Statistics in Medicine 16:239–258CrossRefGoogle Scholar
  33. Kenward MG, Molenberghs G (1998) Likelihood based frequentist inference when data are missing at random. Statistical Science 12:236–247MathSciNetGoogle Scholar
  34. Kenward MG, Molenberghs G, Thijs H (2003) Pattern-mixture models with proper time dependence. Biometrika 90:53–71MATHCrossRefMathSciNetGoogle Scholar
  35. Laird NM (1994) Discussion to Diggle PJ, Kenward MG: Informative dropout in longitudinal data analysis. Applied Statistics 43:84MathSciNetGoogle Scholar
  36. Lang JB, Agresti A (1994) Simultaneously modeling joint and marginal distributions of multivariate categorical responses. Journal of the American Statistical Association 89:625–632MATHCrossRefGoogle Scholar
  37. le Cessie S, van Houwelingen JC (1994) Logistic regression for correlated binary data. Applied Statistics 43:95–108MATHCrossRefGoogle Scholar
  38. Lesaffre E, Verbeke G (1998) Local influence in linear mixed models. Biometrics 54:570–582MATHCrossRefGoogle Scholar
  39. Liang K-Y, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73:13–22MATHCrossRefMathSciNetGoogle Scholar
  40. Liang K-Y, Zeger SL, Qaqish B (1992) Multivariate regression analyses for categorical data. Journal of the Royal Statistical Society Series B 54:3–40MATHMathSciNetGoogle Scholar
  41. Lipsitz SR, Laird NM, Harrington DP (1991) Generalized estimating equations for correlated binary data: using the odds ratio as a measure of association. Biometrika 78:153–160CrossRefMathSciNetGoogle Scholar
  42. Little RJA (1986) A note about models for selectivity bias. Econometrika 53:1469–1474CrossRefGoogle Scholar
  43. Little RJA (1993) Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association 88:125–134MATHCrossRefGoogle Scholar
  44. Little RJA (1994) A class of pattern-mixture models for normal incomplete data. Biometrika 81:471–483MATHCrossRefMathSciNetGoogle Scholar
  45. Little RJA (1995) Modeling the drop-out mechanism in repeated measures studies. Journal of the American Statistical Association 90:1112–1121MATHCrossRefMathSciNetGoogle Scholar
  46. Little RJA, Rubin DB (1987) Statistical Analysis with Missing Data. John Wiley & Sons, New YorkMATHGoogle Scholar
  47. Mallinckrodt CH, Clark WS, Stacy RD (2001a) Type I error rates from mixed-effects model repeated measures versus fixed effects analysis of variance with missing values imputed via last observation carried forward. Drug Information Journal 35:1215–1225Google Scholar
  48. Mallinckrodt CH, Clark WS, Stacy RD (2001b) Accounting for dropout bias using mixed-effects models. Journal of Biopharmaceutical Statistics series 11,(1 & 2):9–21CrossRefGoogle Scholar
  49. Mallinckrodt CH, Clark WS, Carroll RJ, Molenberghs G (2003a) Assessing response profiles from incomplete longitudinal clinical trial data under regulatory considerations. Journal of Biopharmaceutical Statistics 13:179–190CrossRefMATHGoogle Scholar
  50. Mallinckrodt CH, Sanger TM, Dube S, Debrota DJ, Molenberghs G, Carroll RJ, Zeigler Potter WM, Tollefson, GD (2003b) Assessing and interpreting treatment effects in longitudinal clinical trials with missing data. Biological Psychiatry series 53:754–760CrossRefGoogle Scholar
  51. McCullagh P, Nelder JA (1989) Generalized Linear Models. Chapman & Hall, LondonMATHGoogle Scholar
  52. Michiels B, Molenberghs G, Bijnens L, Vangeneugden T, Thijs H (2002) Selection models and pattern-mixture models to analyze longitudinal quality of life data subject to dropout. Statistics in Medicine 21:1023–1041CrossRefGoogle Scholar
  53. Molenberghs G, Lesaffre E (1994) Marginal modelling of correlated ordinal data using a multivariate Plackett distribution. Journal of the American Statistical Association 89:633–644MATHCrossRefGoogle Scholar
  54. Molenberghs G, Lesaffre E (1999) Marginal modelling of multivariate categorical data. Statistics in Medicine 18:2237–2255CrossRefGoogle Scholar
  55. Molenberghs G, Kenward MG, Lesaffre E (1997) The analysis of longitudinal ordinal data with non-random dropout. Biometrika 84:33–44MATHCrossRefGoogle Scholar
  56. Molenberghs G, Michiels B, Kenward MG, Diggle PJ (1998) Missing data mechanisms and pattern-mixture models. Statistica Neerlandica 52:153–161MATHCrossRefMathSciNetGoogle Scholar
  57. Murray GD, Findlay JG (1988) Correcting for the bias caused by drop-outs in hypertension trials. Statististics in Medicine 7:941–946CrossRefGoogle Scholar
  58. Nelder JA, Mead R (1965) A simplex method for function minimisation. The Computer Journal 7:303–313MathSciNetGoogle Scholar
  59. Neuhaus JM (1992) Statistical methods for longitudinal and clustered designs with binary responses. Statistical Methods in Medical Research 1:249–273Google Scholar
  60. Neuhaus JM, Kalbfleisch JD, Hauck WW (1991) A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data. International Statistical Review 59:25–35CrossRefGoogle Scholar
  61. Plackett RL (1965) A class of bivariate distributions. Journal of the American Statistical Association 60:516–522CrossRefMathSciNetGoogle Scholar
  62. Prentice RL (1988) Correlated binary regression with covariates specific to each binary observation. Biometrics 44:1033–1048MATHCrossRefMathSciNetGoogle Scholar
  63. Robins JM, Rotnitzky A, Zhao LP (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association 90:106–121MATHCrossRefMathSciNetGoogle Scholar
  64. Robins JM, Rotnitzky A, Scharfstein DO (1998) Semiparametric regression for repeated outcomes with non-ignorable non-response. Journal of the American Statistical Association 93:1321–1339MATHCrossRefMathSciNetGoogle Scholar
  65. Rubin DB (1976) Inference and missing data. Biometrika 63:581–592MATHCrossRefMathSciNetGoogle Scholar
  66. Rubin DB (1987) Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New YorkGoogle Scholar
  67. Rubin DB (1994) Discussion to Diggle PJ, Kenward MG: Informative dropout in longitudinal data analysis. Applied Statistics 43:80–82Google Scholar
  68. Schafer JL (1997) Analysis of Incomplete Multivariate Data. Chapman & Hall, LondonMATHGoogle Scholar
  69. Schipper H, Clinch J, McMurray A (1984) Measuring the quality of life of cancer patients: the Functional-Living Index-Cancer: development and validation. Journal of Clinical Oncology 2:472–483Google Scholar
  70. Sheiner LB, Beal SL, Dunne A (1997) Analysis of nonrandomly censored ordered categorical longitudinal data from analgesic trials. Journal of the American Statistical Association 92:1235–1244MATHCrossRefGoogle Scholar
  71. Siddiqui O, Ali MW (1998) A comparison of the random-effects pattern-mixture model with last-observation-carried-forward (LOCF) analysis in longitudinal clinical trials with dropouts. Journal of Biopharmaceutical Statistics 8:545–563MATHCrossRefGoogle Scholar
  72. Skellam JG (1948) A probability distribution derived from the binomial distribution by regarding the probability of success as variable between the sets of trials. Journal of the Royal Statistical Society Series B 10:257–261MATHMathSciNetGoogle Scholar
  73. Smith DM, Robertson B, Diggle PJ (1996) Object-oriented Software for the Analysis of Longitudinal Data in S. Technical Report MA 96/192. Department of Mathematics and Statistics, University of Lancaster, LA1 4YF, United KingdomGoogle Scholar
  74. Stiratelli R, Laird N, Ware J (1984) Random effects models for serial observations with dichotomous response. Biometrics 40:961–972CrossRefGoogle Scholar
  75. Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association 82:528–550MATHCrossRefMathSciNetGoogle Scholar
  76. Thijs H, Molenberghs G, Michiels B, Verbeke G, Curran D (2002) Strategies to fit pattern-mixture models. Biostatistics 3:245–265MATHCrossRefGoogle Scholar
  77. Verbeke G, Molenberghs G (1997) Linear Mixed Models in Practice: A SAS-Oriented Approach. Lecture Notes in Statistics 126. Springer-Verlag, New YorkMATHGoogle Scholar
  78. Verbeke G, Molenberghs G (2000) Linear Mixed Models for Longitudinal Data. Springer-Verlag, New YorkMATHGoogle Scholar
  79. Verbeke G, Molenberghs G, Thijs H, Lesaffre E, Kenward MG (2001) Sensitivity analysis for non-random dropout: a local influence approach. Biometrics 57:7–14CrossRefMathSciNetGoogle Scholar
  80. Wedderburn RWM (1974) Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika 61:439–447MATHMathSciNetGoogle Scholar
  81. Wu MC, Bailey KR (1989) Estimation and comparison of changes in the presence of informative right censoring: conditional linear model. Biometrics 45:939–955MATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Geert Molenberghs
    • 1
  • Caroline Beunckens
    • 1
  • Ivy Jansen
    • 1
  • Herbert Thijs
    • 1
  • Geert Verbeke
    • 1
  • Michael G. Kenward
    • 2
  1. 1.Biostatistics Centre for StatisticsLimburg University CentrumDiepenbeekBelgium
  2. 2.Medical Statistics UnitLondon School of Hygiene & Tropical MedicineLondonUK

Personalised recommendations