Skip to main content
Log in

Optimal design for epidemiological studies subject to designed missingness

  • Published:
Lifetime Data Analysis Aims and scope Submit manuscript

Abstract

In large epidemiological studies, budgetary or logistical constraints will typically preclude study investigators from measuring all exposures, covariates and outcomes of interest on all study subjects. We develop a flexible theoretical framework that incorporates a number of familiar designs such as case control and cohort studies, as well as multistage sampling designs. Our framework also allows for designed missingness and includes the option for outcome dependent designs. Our formulation is based on maximum likelihood and generalizes well known results for inference with missing data to the multistage setting. A variety of techniques are applied to streamline the computation of the Hessian matrix for these designs, facilitating the development of an efficient software tool to implement a wide variety of designs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Breslow N (1996) Statistics in epidemiology. J Am Stat Assoc 91: 14–28

    Article  MATH  MathSciNet  Google Scholar 

  • Breslow N, Cain K (1988) Logistic-regression for 2-stage case-control data. Biometrika 75: 11–20

    Article  MATH  MathSciNet  Google Scholar 

  • Breslow N, Chatterjee N (1999) Design and analysis of two-phase studies with binary outcome applied to wilms tumour prognosis. App Stat 48: 475–468

    Google Scholar 

  • Breslow N, McNeney B, Wellner J (2003) Large sample theory for semiparametric regression models with two-phase, outcome dependent sampling. Annals Stat 31: 1110–1139

    Article  MATH  MathSciNet  Google Scholar 

  • Breslow N, Robins J, Wellner J (2000) On the semi-parametric efficiency of logistic regression under case-control sampling. Bernoulli 5: 447–455

    Article  MathSciNet  Google Scholar 

  • Carroll R, Ruppert D, Stefanski L (1995a) Measurement error in nonlinear models. Chapman and Hall

  • Carroll R, Wang S, Wang C (1995b) Prospective analysis of logistic case-control studies. J Am Stat Assoc 90: 157–159

    Article  MATH  MathSciNet  Google Scholar 

  • Chatterjee N, Carroll R (2005) Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika 92: 399–418

    Article  MATH  MathSciNet  Google Scholar 

  • Clayton D, Spiegelhalter D, Dunn G, Pickles A (1998) Analysis of longitudinal binary data from multiphase sampling. J Roy Stat Soc Series B (Stat Methodol) 60: 71–87

    Article  MATH  MathSciNet  Google Scholar 

  • Cornfield J (1951) A method of estimating comparative rates from clinical data: applications to cancer of lung, breast and cervix. J NCI 11: 1269–1275

    Google Scholar 

  • Diggle P, Lophaven S (2006) Bayesian geostatistical design. Scandinavian. J Stat 33(1): 53–64

    MATH  MathSciNet  Google Scholar 

  • Duncan G, Kalton G (1987) Issues of design and analysis of surveys across time. Int Stat Rev/Revue Internationale de Statistique 55: 97–117

    Article  Google Scholar 

  • Harezlak J, Ryan L, Giedd J, Lange N (2005) Individual and population penalized regression splines for accelerated longitudinal designs. Biometrics 61: 1037–1048

    Article  MATH  MathSciNet  Google Scholar 

  • Helms R (1992) Intentionally incomplete longitudinal designs: I. Methodology and comparison of some full span designs. Stat Med 11: 1889–1913

    Google Scholar 

  • Horton N, Laird N (1999) Maximum likelihood analysis of generalized linear models with missing covariates. Stat Meth Med Res 8: 37

    Article  Google Scholar 

  • Hu X, Lawless J (1997) Pseudolikelihood estimation in a class of problems with response-related missing covariates. Canadian J Stat 25: 125–142

    Article  MATH  MathSciNet  Google Scholar 

  • Ibrahim J (1990) Incomplete data in generalized linear models. J Am Stat Assoc 85: 765–769

    Article  Google Scholar 

  • Langholz B, Goldstein L (2001) Conditional logistic analysis of case-control studies with complex sampling. Biostatistics 2: 63–84

    Article  MATH  Google Scholar 

  • Lawless J, Kalbfleisch J, Wild C (1999) Semiparametric methods for response-selective and missing data problems in regression. J Royal Stat Soc Series B 61: 413–438

    Article  MATH  MathSciNet  Google Scholar 

  • Louis T (1982) Finding the observed information matrix when using the EM algorithm. J Royal Stat Soc Series B 44: 226–233

    MATH  MathSciNet  Google Scholar 

  • Meilijson I (1989) A fast improvement to the EM algorithm on its own terms. J Royal Stat Soc Series B 51: 127–138

    MATH  MathSciNet  Google Scholar 

  • Nocedal J, Wright SJ (2000) Numerical optimization. Springer Series in Operations Research. Springer-Verlag, New York

    Google Scholar 

  • Reilly M, Pepe M (1995) A mean score method for missing and auxiliary covariate data in regression models. Biometrika 82: 299–314

    Article  MATH  MathSciNet  Google Scholar 

  • Robins J, Rotnitzky A, Zhao L (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc 90: 106–121

    Article  MATH  MathSciNet  Google Scholar 

  • Sanchez B, Budtz-Jørgensen E, Ryan L, Hu H (2005) Structural equation models - a review with applications to environmental epidemiology. J Am Stat Assoc 100: 1443–1455

    Article  MATH  Google Scholar 

  • Scott A, Wild C (2001a) Case-control studies with complex sampling. J Roy Stat Soc Series C 50: 389–401 (Part 3)

    Article  MATH  MathSciNet  Google Scholar 

  • Scott A, Wild C (2001b) Maximum likelihood for generalised case-control studies. J Stat Plan Infer 96: 3–27

    Article  MATH  MathSciNet  Google Scholar 

  • Spiegelman D, Gray R (1991a) Cost-efficient study designs for binary response data with gaussian ovariate measurement error. Biometrics 47: 851–869

    Article  MATH  Google Scholar 

  • Spiegelman D, Gray R (1991b) The design of cohort studies in which relative risks are corrected for exposure measurement error. Am J Epidemiol 134: 736–737

    Google Scholar 

  • Spinka C, Carroll R, Chatterjee N (2005) Analysis of case-control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity. Genet Epidemiol 29: 108–127

    Article  Google Scholar 

  • Verbeke G, Lesaffre E (1999) The effect of drop-out on the efficiency of longitudinal experiments. Appl Stat 48: 363–375

    MATH  Google Scholar 

  • Verbeke G, Molenberghs G (2000) Linear mixed models for longitudinal data. Springer

  • Wacholder S, Weinberg C (1994) Flexible maximum likelihood methods for assessing joint effects in casecontrol studies with complex sampling. Biometrics 50: 350–357

    Article  MATH  Google Scholar 

  • Wand M (2002) Vector differential calculus in statistics. Am Stat 56: 55–62

    Article  MathSciNet  Google Scholar 

  • Zhou H, Weaver M (2001) Outcome dependent selection models. Encylopedia Environ 3: 1499–1502

    Google Scholar 

  • Zhou H, Weaver M, Qin J, Wang M (2002a) A semiparametric empirical likelihood method for data from an outcome-dependent sampling design with a continuous outcome. Biometrics 58: 413–421

    Article  MathSciNet  Google Scholar 

  • Zhou W, Liu G, Thurston S, Xu L, Miller D, Wain J, Lynch T, Su L, Christiani D (2002b) Genetic polymorphisms of n-acetyltranferase-2 and microsomal epoxide hydrolase and cumulative cigarette smoking in lung cancer. Cancer Epidemiol Biomark Prevention 11: 15–21

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Louise Ryan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Morara, M., Ryan, L., Houseman, A. et al. Optimal design for epidemiological studies subject to designed missingness. Lifetime Data Anal 13, 583–605 (2007). https://doi.org/10.1007/s10985-007-9068-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10985-007-9068-7

Keywords

Navigation