Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Missing data methods in longitudinal studies: a review

Abstract

Incomplete data are quite common in biomedical and other types of research, especially in longitudinal studies. During the last three decades, a vast amount of work has been done in the area. This has led, on the one hand, to a rich taxonomy of missing-data concepts, issues, and methods and, on the other hand, to a variety of data-analytic tools. Elements of taxonomy include: missing data patterns, mechanisms, and modeling frameworks; inferential paradigms; and sensitivity analysis frameworks. These are described in detail. A variety of concrete modeling devices is presented. To make matters concrete, two case studies are considered. The first one concerns quality of life among breast cancer patients, while the second one examines data from the Muscatine children’s obesity study.

This is a preview of subscription content, log in to check access.

References

  1. Beckman RJ, Nachtsheim CJ, Cook RD (1987) Diagnostics for mixed-model analysis of variance. Technometrics 29:413–426

  2. Best NG, Spiegelhalter DJ, Thomas A, Brayne CEG (1996) Bayesian analysis of realistically complex models. J R Stat Soc Ser A 159:323–342

  3. Beunckens C, Molenberghs G, Verbeke G, Mallinckrodt C (2008) A latent-class mixture model for incomplete longitudinal Gaussian data. Biometrics 64(1):96–105

  4. Breslow NE, Clayton DG (1993) Approximate inference in generalized linear mixed models. J Am Stat Assoc 88:9–25

  5. Brown ER, Ibrahim JG (2003a) A Bayesian semiparametric joint hierarchical model for longitudinal and survival data. Biometrics 59:221–228

  6. Brown ER, Ibrahim JG (2003b) Bayesian approaches to joint cure rate and longitudinal models with applications to cancer vaccine trials. Biometrics 59:686–693

  7. Brown ER, Ibrahim JG, DeGruttola V (2005) A flexible b-spline model for multiple longitudinal biomarkers and survival. Biometrics 61:64–73

  8. Carpenter J, Pocock S, Lamm CJ (2002) Coping with missing data in clinical trials: a model based approach applied to asthma trials. Stat Med 21:1043–1066

  9. Chen M-H, Ibrahim JG (2002) Maximum likelihood methods for cure rate models with missing covariates. Biometrics 57:43–52

  10. Chen M-H, Ibrahim JG, Lipsitz SR (2002) Bayesian methods for missing covariates in cure rate models. Lifetime Data Anal 8:117–146

  11. Chen M-H, Ibrahim JG, Shao Q-M (2004a) Propriety of the posterior distribution and existence of the maximum likelihood estimator for regression models with covariates missing at random. J Am Stat Assoc 99:421–438

  12. Chen M-H, Ibrahim JG, Sinha D (2004b) A new joint model for longitudinal and survival data with a cure fraction. J Multivar Anal 91:18–34

  13. Chen M-H, Ibrahim JG, Shao Q-M (2006) Posterior propriety anc computation for the Cox regression model with applications to missing covariates. Biometrika 93:791–807

  14. Chen M-H, Ibrahim JG, Shao Q-M (2009) Model identifiability for the Cox regression model with applications to missing covariates. J Multivar Anal (in press)

  15. Chen Q, Ibrahim JG (2006) Missing covariate and response data in regression models. Biometrics 62:177–184

  16. Chen Q, Zeng D, Ibrahim JG (2007) Sieve maximum likelihood estimation for regression models with covariates missing at random. J Am Stat Assoc 102:1309–1317

  17. Chen Q, Ibrahim JG, Chen M-H, Senchaudhuri P (2008) Theory and inference for regression models with missing responses and covariates. J Multivar Anal 99:1302–1331

  18. Chi Y, Ibrahim JG (2006) Joint models for multivariate longitudinal and survival data. Biometrics 62:432–445

  19. Chi Y, Ibrahim JG (2007) A new class of joint models for longitudinal and survival data accomodating zero and zon-zero cure fractions: a case study of an international breast cancer study group trial. Stat Sin 17:445–462

  20. Cook RD (1986) Assessment of local influence. J R Stat Soc Ser B 48:133–169

  21. Cowles MK, Carlin BP, Connett JE (1996) Bayesian tobit modeling of longitudinal ordinal clinical trial compliance data with nonignorable missingness. J Am Stat Assoc 91:86–98

  22. Creemers A, Hens N, Aerts M, Molenberghs G, Verbeke G, Kenward MG (2009) Shared-parameter models and missingness at random (Submitted for publication)

  23. Daniels MJ, Hogan JW (2008) Missing data in longitudinal studies. Chapman and Hall, London

  24. DeGruttola V, Tu XM (1994) Modelling progression of CD4 lymphocyte count and its relationship to survival time. Biometrics 50:1003–1014

  25. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38

  26. Diggle P, Kenward MG (1994) Informative drop-out in longitudinal data analysis (with discussion). Appl Stat 43:49–93

  27. Diggle PJ, Heagerty P, Liang K-Y, Zeger SL (2002) Analysis of longitudinal data. Oxford University Press, London

  28. Ekholm A, Skinner C (1998) The Muscatine children’s obesity data reanalysed using pattern mixture models. Appl Stat 47:251–263

  29. Faucett CL, Thomas DC (1996) Simultaneously modelling censored survival data and repeatedly measured covariates: a Gibbs sampling approach. Stat Med 15:1663–1685

  30. Fitzmaurice GM, Laird NM (2000) Generalized linear mixture models for handling nonignorable dropouts in longitudinal studies. Biostatistics 1:141–156

  31. Fitzmaurice GM, Lipsitz SR, Molenberghs G, Ibrahim JG (2001) Bias in estimating association parameters for longitudinal binary responses with drop-outs. Biometrics 57:15–21

  32. Fitzmaurice GM, Laird NM, Ware JH (2004) Applied longitudinal analysis. Wiley, New York

  33. Fitzmaurice GM, Lipsitz SR, Ibrahim JG, Gelber R, Lipshultz S (2006) Estimation in regression models for longitudinal binary data with outcome-dependent follow-up. Biostatistics 7:469–485

  34. Fitzmaurice GM, Davidian M, Verbeke G, Molenberghs M (2008) Longitudinal data analysis. Chapman and Hall, London

  35. Follman D, Wu M (1995) An approximate generalized linear model with random effects for informative missing data. Biometrics 51:151–168

  36. Garcia RI, Ibrahim JG, Zhu H (2009) Variable selection for regression models with missing data. Stat Sin (in press)

  37. Gilks WR, Wild P (1992) Adaptive rejection sampling for Gibbs sampling. Appl Stat 41:337–348

  38. Henderson R, Diggle P, Dobson A (2000) Joint modelling of longitudinal measurements and event time data. Biostatistics 1:465–480

  39. Herring AH, Ibrahim JG (2001) Likelihood-based methods for missing covariates in the Cox proportional hazards model. J Am Stat Assoc 96:292–302

  40. Herring AH, Ibrahim JG (2002) Maximum likelihood estimation in random effects cure rate models with nonignorably missing covariates. Biostatistics 3:387–405

  41. Herring AH, Ibrahim JG, Lipsitz SR (2002) Frailty models with missing covariates. Biometrics 58:98–109

  42. Herring AH, Ibrahim JG, Lipsitz SR (2004) Nonignorably missing covariate data in survival analysis: a case study of an international breast cancer study group trial. Appl Stat 53:293–310

  43. Hogan JW, Laird NM (1997) Mixture models for the joint distribution of repeated measures and event times. Stat Med 16:239–257

  44. Hogan JW, Laird NM (1998) Increasing efficiency from censored survival data using random effects from longitudinal covariates. Stat Methods Med Res 7:28–48

  45. Huang L, Chen M-H, Ibrahim JG (2005) Bayesian analysis for generalized linear models with nonignorably missing covariates. Biometrics 61:767–780

  46. Ibrahim JG (1990) Incomplete data in generalized linear models. J Am Stat Assoc 85:765–769

  47. Ibrahim JG, Lipsitz SR, Chen M-H (1999a) Missing covariates in generalized linear models when the missing data mechanism is nonignorable. J R Stat Soc Ser B 61:173–190

  48. Ibrahim JG, Chen MH, Lipsitz SR (1999b) Monte Carlo EM for missing covariates in parametric regression models. Biometrics 55:591–596

  49. Ibrahim JG, Chen M-H, Lipsitz SR (2001) Missing responses in generalized linear mixed models when the missing data mechanism is nonignorable. Biometrika 88:551–564

  50. Ibrahim JG, Chen M-H, Lipsitz SR (2002) Bayesian methods for generalized linear models with covariates missing at random. Can J Stat 30:55–78

  51. Ibrahim JG, Chen M-H, Sinha D (2004) Bayesian methods for joint modeling of longitudinal and survival data with applicants to cancer vaccine trials. Stat Sin 14:863–883

  52. Ibrahim JG, Chen M-H, Lipsitz SR, Herring AH (2005) Missing data methods in generalized linear models: a comparative review. J Am Stat Assoc 100:332–346

  53. Ibrahim JG, Chen M-H, Kim S (2008a) Bayesian variable selection for the Cox regression model with missing covariates. Lifetime Data Anal 14:496–520

  54. Ibrahim JG, Zhu H, Tang N (2008b) Model selection criteria for missing data problems using the EM algorithm. J Am Stat Assoc 103:1648–1658

  55. Jennrich RI, Schluchter MD (1986) Unbalanced repeated-measures models with structured covariance matrices. Biometrics 42:805–820

  56. Laird NM, Ware JH (1982) Random-effects models for longitudinal data. Biometrics 38:963–974

  57. Lavalley MP, DeGruttola V (1996) Models for empirical Bayes estimators of longitudinal CD4 counts. Stat Med 15:2289–2305

  58. Lesaffre E, Verbeke G (1998) Local influence in linear mixed models. Biometrics 54:570–582

  59. Liang K-Y, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73:13–22

  60. Lipsitz SR, Ibrahim JG, Fitzmaurice GM (1999a) Likelihood methods for incomplete longitudinal binary responses with incomplete categorical covariates. Biometrics 55:214–223

  61. Lipsitz SR, Ibrahim JG, Zhao LP (1999b) A new weighted estimating equation for missing covariate data with properties similar to maximum likelihood. J Am Stat Assoc 94:1147–1160

  62. Lipsitz SR, Ibrahim JG, Molenberghs G (2000) Using a Box–Cox transformation in the analysis of longitudinal data with incomplete responses. Appl Stat 49:287–296

  63. Lipsitz SR, Parzen M, Molenberghs G, Ibrahim JG (2001) Tesing for bias in weighted estimating equations. Biostatistics 2:295–307

  64. Lipsitz SR, Fitzmaurice GM, Ibrahim JG, Gelber R, Lipshultz S (2002) Parameter estimation in longitudinal studies with outcome-dependent follow-up. Biometrics 58:621–630

  65. Little RJA (1993) Pattern-mixture models for multivariate incomplete data. J Am Stat Assoc 88:125–134

  66. Little RJA (1994) A class of pattern-mixture models for normal incomplete data. Biometrika 81:471–483

  67. Little RJA (1995) Modeling the drop-out mechanism in repeated-measures studies. J Am Stat Assoc 90:1113–1121

  68. Little RJA, Wang Y (1996) Pattern-mixture models for multivariate incomplete data with covariates. Biometrics 52:98–111

  69. Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, New York

  70. Louis T (1982) Finding the observed information matrix when using the EM algorithm. J R Stat Soc Ser B 44:226–233

  71. Meilijson I (1989) A fast improvement to the EM algorithm on its own terms. J R Stat Soc Ser B 51:127–138

  72. Molenberghs G, Verbeke G (2005) Models for discrete longitudinal data. Springer, New York

  73. Molenberghs G, Kenward MG (2007) Missing data in clinical studies. Wiley, New York

  74. Molenberghs G, Kenward MG, Lesaffre E (1997) The analysis of longitudinal ordinal data with nonrandom drop-out. Biometrika 84:33–4

  75. Pawitan Y, Self S (1993) Modeling disease marker processes in AIDS. J Am Stat Assoc 88:719–726

  76. Prentice RL (1989) Surrogate endpoints in clinical trials: definitions and operational criteria. Stat Med 8:431–440

  77. Renard D, Geys H, Molenberghs G, Burzykowski T, Buyse M (2002) Validation of surrogate endpoints in multiple randomized clinical trials with discrete outcomes. Biom J 44:921–935

  78. Rizopoulos D, Verbeke G, Molenberghs G (2008) Shared parameter models under random-effects misspecification. Biometrika 94:63–74

  79. Robins JM, Rotnitzky A, Zhao LP (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc 90(429):106–121

  80. Rotnitzky A, Robins JM, Scharfstein DO (1998) Semiparametric regression for repeated outcomes with nonignorable nonresponse. J Am Stat Assoc 93:1321–1339

  81. Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592

  82. Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley series in probability and mathematical statistics: applied probability and statistics. Wiley, New York

  83. Scharfstein DO, Rotnitzky A, Robins JM (1999) Adjusting for nonignorable drop-out using semiparametric nonresponse models. J Am Stat Assoc 94:1096–1120

  84. Schluchter MD (1992) Methods for the analysis of informatively censored longitudinal data. Stat Med 11:1861–1870

  85. Shi X, Zhu H, Ibrahim JG (2009) Local influence for generalized linear models with missing covariates. Biometrics (in press)

  86. Stubbendick AL, Ibrahim JG (2003) Maximum likelihood methods for nonignorable responses and covariates in random effects models. Biometrics 59:1140–1150

  87. Stubbendick AL, Ibrahim JG (2006) Likelihood-based inference with nonignorably missing responses and covariates in models for discrete longitudinal data. Stat Sin 16:1143–1167

  88. Taylor JMG, Cumberland WG, Sy JP (1994) A stochastic model for analysis of longitudinal AIDS data. J Am Stat Assoc 89:727–736

  89. Thijs H, Molenberghs G, Michiels B, Verbeke G, Curran D (2002) Strategies to fit pattern-mixture models. Biostatistics 3:245–265

  90. Troxel AB, Harrington DP, Lipsitz SR (1998a) Analysis of longitudinal data with nonignorable nonmonotone missing values. Appl Stat 47:425–438

  91. Troxel AB, Lipsitz SR, Harrington DP (1998b) Marginal models for the analysis of longitudinal measurements with nonignorable non-monotone missing data. Biometrika 85:661–672

  92. Tsiatis AA, DeGruttola V, Wulfsohn MS (1995) Modeling the relationship of survival to longitudinal data measured with error. Applications to survival and CD4 counts in patients with AIDS. J Am Stat Assoc 90:27–37

  93. Verbeke G, Molenberghs G (2000) Linear mixed models for longitudinal data. Springer, New York

  94. Wedderburn RWM (1974) Quasi-likelihood methods, generalised linear models, and the Gauss–Newton method. Biometrika 61:439–447

  95. Wei GC, Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85:699–704

  96. Wolfinger R, O’Connell M (1993) Generalized linear models: a pseudo-likelihood approach. J Stat Comput Simul 48:233–243

  97. Woolson RF, Clarke WR (1984) Analysis of categorical incomplete longitudinal data. J R Stat Soc Ser A 147:87–99

  98. Wu MC, Bailey KR (1988) Analysing changes in the presence of informative right censoring caused by death and withdrawal. Stat Med 7:337–346

  99. Wu MC, Carroll RJ (1988) Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics 44:175–188

  100. Wu MC, Bailey KR (1989) Estimation and comparison of changes in the presence of informative right censoring: conditional linear model. Biometrics 45:939–955

  101. Xu J, Zeger SL (2001) Joint analysis of longitudinal data comprising repeated measures and times to events. Appl Stat 50:375–387

  102. Zeger SL, Liang K-Y (1986) Longitudinal data analysis for discrete and continuous outcomes. Biometrics 42:121–130

  103. Zhu H-T, Lee S-Y (2001) Local influence for incomplete-data models. J R Stat Soc Ser B 63:111–126

  104. Zhu H, Ibrahim JG, Shi X (2009) Diagnostic measures for generalized linear models with missing covariates. Scand J Stat (in press)

Download references

Author information

Correspondence to Joseph G. Ibrahim.

Additional information

This invited paper is discussed in the comments available at: http://dx.doi.org/10.1007/s11749-009-0139-9, http://dx.doi.org/10.1007/s11749-009-0140-3, http://dx.doi.org/10.1007/s11749-009-0141-2, http://dx.doi.org/10.1007/s11749-009-0142-1, http://dx.doi.org/10.1007/s11749-009-0143-0.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Ibrahim, J.G., Molenberghs, G. Missing data methods in longitudinal studies: a review. TEST 18, 1–43 (2009). https://doi.org/10.1007/s11749-009-0138-x

Download citation

Keywords

  • Expectation-maximization algorithm
  • Incomplete data
  • Missing completely at random
  • Missing at random
  • Missing not at random
  • Pattern-mixture model
  • Selection model
  • Sensitivity analyses
  • Shared-parameter model

Mathematics Subject Classification (2000)

  • 62J05
  • 62J12
  • 62P10