Abstract
In large epidemiological studies, budgetary or logistical constraints will typically preclude study investigators from measuring all exposures, covariates and outcomes of interest on all study subjects. We develop a flexible theoretical framework that incorporates a number of familiar designs such as case control and cohort studies, as well as multistage sampling designs. Our framework also allows for designed missingness and includes the option for outcome dependent designs. Our formulation is based on maximum likelihood and generalizes well known results for inference with missing data to the multistage setting. A variety of techniques are applied to streamline the computation of the Hessian matrix for these designs, facilitating the development of an efficient software tool to implement a wide variety of designs.
Similar content being viewed by others
References
Breslow N (1996) Statistics in epidemiology. J Am Stat Assoc 91: 14–28
Breslow N, Cain K (1988) Logistic-regression for 2-stage case-control data. Biometrika 75: 11–20
Breslow N, Chatterjee N (1999) Design and analysis of two-phase studies with binary outcome applied to wilms tumour prognosis. App Stat 48: 475–468
Breslow N, McNeney B, Wellner J (2003) Large sample theory for semiparametric regression models with two-phase, outcome dependent sampling. Annals Stat 31: 1110–1139
Breslow N, Robins J, Wellner J (2000) On the semi-parametric efficiency of logistic regression under case-control sampling. Bernoulli 5: 447–455
Carroll R, Ruppert D, Stefanski L (1995a) Measurement error in nonlinear models. Chapman and Hall
Carroll R, Wang S, Wang C (1995b) Prospective analysis of logistic case-control studies. J Am Stat Assoc 90: 157–159
Chatterjee N, Carroll R (2005) Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika 92: 399–418
Clayton D, Spiegelhalter D, Dunn G, Pickles A (1998) Analysis of longitudinal binary data from multiphase sampling. J Roy Stat Soc Series B (Stat Methodol) 60: 71–87
Cornfield J (1951) A method of estimating comparative rates from clinical data: applications to cancer of lung, breast and cervix. J NCI 11: 1269–1275
Diggle P, Lophaven S (2006) Bayesian geostatistical design. Scandinavian. J Stat 33(1): 53–64
Duncan G, Kalton G (1987) Issues of design and analysis of surveys across time. Int Stat Rev/Revue Internationale de Statistique 55: 97–117
Harezlak J, Ryan L, Giedd J, Lange N (2005) Individual and population penalized regression splines for accelerated longitudinal designs. Biometrics 61: 1037–1048
Helms R (1992) Intentionally incomplete longitudinal designs: I. Methodology and comparison of some full span designs. Stat Med 11: 1889–1913
Horton N, Laird N (1999) Maximum likelihood analysis of generalized linear models with missing covariates. Stat Meth Med Res 8: 37
Hu X, Lawless J (1997) Pseudolikelihood estimation in a class of problems with response-related missing covariates. Canadian J Stat 25: 125–142
Ibrahim J (1990) Incomplete data in generalized linear models. J Am Stat Assoc 85: 765–769
Langholz B, Goldstein L (2001) Conditional logistic analysis of case-control studies with complex sampling. Biostatistics 2: 63–84
Lawless J, Kalbfleisch J, Wild C (1999) Semiparametric methods for response-selective and missing data problems in regression. J Royal Stat Soc Series B 61: 413–438
Louis T (1982) Finding the observed information matrix when using the EM algorithm. J Royal Stat Soc Series B 44: 226–233
Meilijson I (1989) A fast improvement to the EM algorithm on its own terms. J Royal Stat Soc Series B 51: 127–138
Nocedal J, Wright SJ (2000) Numerical optimization. Springer Series in Operations Research. Springer-Verlag, New York
Reilly M, Pepe M (1995) A mean score method for missing and auxiliary covariate data in regression models. Biometrika 82: 299–314
Robins J, Rotnitzky A, Zhao L (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc 90: 106–121
Sanchez B, Budtz-Jørgensen E, Ryan L, Hu H (2005) Structural equation models - a review with applications to environmental epidemiology. J Am Stat Assoc 100: 1443–1455
Scott A, Wild C (2001a) Case-control studies with complex sampling. J Roy Stat Soc Series C 50: 389–401 (Part 3)
Scott A, Wild C (2001b) Maximum likelihood for generalised case-control studies. J Stat Plan Infer 96: 3–27
Spiegelman D, Gray R (1991a) Cost-efficient study designs for binary response data with gaussian ovariate measurement error. Biometrics 47: 851–869
Spiegelman D, Gray R (1991b) The design of cohort studies in which relative risks are corrected for exposure measurement error. Am J Epidemiol 134: 736–737
Spinka C, Carroll R, Chatterjee N (2005) Analysis of case-control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity. Genet Epidemiol 29: 108–127
Verbeke G, Lesaffre E (1999) The effect of drop-out on the efficiency of longitudinal experiments. Appl Stat 48: 363–375
Verbeke G, Molenberghs G (2000) Linear mixed models for longitudinal data. Springer
Wacholder S, Weinberg C (1994) Flexible maximum likelihood methods for assessing joint effects in casecontrol studies with complex sampling. Biometrics 50: 350–357
Wand M (2002) Vector differential calculus in statistics. Am Stat 56: 55–62
Zhou H, Weaver M (2001) Outcome dependent selection models. Encylopedia Environ 3: 1499–1502
Zhou H, Weaver M, Qin J, Wang M (2002a) A semiparametric empirical likelihood method for data from an outcome-dependent sampling design with a continuous outcome. Biometrics 58: 413–421
Zhou W, Liu G, Thurston S, Xu L, Miller D, Wain J, Lynch T, Su L, Christiani D (2002b) Genetic polymorphisms of n-acetyltranferase-2 and microsomal epoxide hydrolase and cumulative cigarette smoking in lung cancer. Cancer Epidemiol Biomark Prevention 11: 15–21
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Morara, M., Ryan, L., Houseman, A. et al. Optimal design for epidemiological studies subject to designed missingness. Lifetime Data Anal 13, 583–605 (2007). https://doi.org/10.1007/s10985-007-9068-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10985-007-9068-7