Abstract
I consider the design of multistage sampling schemes for epidemiologic studies involving latent variable models, with surrogate measurements of the latent variables on a subset of subjects. Such models arise in various situations: when detailed exposure measurements are combined with variables that can be used to assign exposures to unmeasured subjects; when biomarkers are obtained to assess an unobserved pathophysiologic process; or when additional information is to be obtained on confounding or modifying variables. In such situations, it may be possible to stratify the subsample on data available for all subjects in the main study, such as outcomes, exposure predictors, or geographic locations. Three circumstances where analytic calculations of the optimal design are possible are considered: (i) when all variables are binary; (ii) when all are normally distributed; and (iii) when the latent variable and its measurement are normally distributed, but the outcome is binary. In each of these cases, it is often possible to considerably improve the cost efficiency of the design by appropriate selection of the sampling fractions. More complex situations arise when the data are spatially distributed: the spatial correlation can be exploited to improve exposure assignment for unmeasured locations using available measurements on neighboring locations; some approaches for informative selection of the measurement sample using location and/or exposure predictor data are considered.
Similar content being viewed by others
References
Albert JH and Chib S (1993). Bayesian-analysis of binary and polychotomous response data. J Am Stat Assoc 88: 669–679
Breslow N and Cain K (1988). Logistic regression for two-stage case–control data. Biometrika 75: 11–20
Breslow NE and Chatterjee N (1999). Design and analysis of two-phase studies with binary outcome applied to Wilms tumor prognosis. Appl Stat 48: 457–468
Breslow NE and Holubkov R (1997). Weighted likelihood, pseudo-likelihood and maximum likelihood methods for logistic regression analysis of two-stage data. Stat Med 16: 103–116
Breslow NE and Zhao LP (1988). Logistic regression for stratified case–control studies. Biometrics 44: 891–899
Cain K and Breslow N (1988). Logistic regression analysis and efficient design for two-stage studies. Am J Epidemiol 128: 1198–1206
Carroll RJ, Ruppert D and Stefanski LA (1995). Measurement error in nonlinear models. Chapman and Hall, London
Conti DV, Cortessis V, Molitor J and Thomas DC (2003). Bayesian modeling of complex metabolic pathways. Hum Hered 56: 83–93
Cressie NAC (1993). Statistics for spatial data. Wiley & Sons Inc., New York
Davey Smith G and Ebrahim S (2004). Mendelian randomization: prospects, potentials and limitations. Int J Epidemiol 33: 30–42
Diggle P and Lophaven S (2005). Bayesian geostatistical design. Scand J Stat 33: 53–64
Feng Z, Prentice R and Srivastava S (2004). Research issues and strategies for genomic and proteomic biomarker discovery and validation: a statistical perspective. Pharmacogenomics 5: 709–719
Friedman N (2004). Inferring cellular networks using probabilistic graphical models. Science 303: 799–805
Gauderman WJ, Avol E, Gilliland F, Vora H, Thomas D, Berhane K, McConnell R, Kuenzli N, Lurmann F, Rappaport E, Margolis H, Bates D and Peters J (2004). The effect of air pollution on lung development from 10 to 18 years of age. N Engl J Med 351: 1057–1067
Gauderman WJ, Avol E, Lurmann F, Kuenzli N, Gilliland F, Peters J and McConnell R (2005). Childhood asthma and exposure to traffic and nitrogen dioxide. Epidemiology 16: 737–743
Gauderman WJ, Vora H, McConnell R, Berhane K, Gilliland F, Thomas D, Lurmann F, Avol E, Kunzli N, Jerrett M and Peters J (2007). Effect of exposure to traffic on lung development from 10 to 18 years of age: a cohort study. Lancet 369: 571–577
Greenland S (1988). Statistical uncertainty due to misclassification: implications for validation substudies. J Clin Epidemiol 41: 1167–1174
Haile RW, Siegmund KD, Gauderman WJ and Thomas DC (1999). Study-design issues in the development of the University of Southern California Consortium’s Colorectal Cancer Family Registry. J Natl Cancer Inst Monogr 26: 89–93
Harel O and Zhou XH (2007). Multiple imputation: review of theory, implementation and software. Stat Med 26: 3057–3077
Kooperberg C and Ruczinski I (2005). Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol 28: 157–170
Kraft P and Thomas DC (2000). Bias and efficiency in family-matched gene-characterization studies: Conditional, prospective, retrospective and joint likelihoods. Am J Hum Genet 66: 1119–1131
McConnell R, Berhane K, Gilliland F, London SJ, Islam T, Gauderman WJ, Avol E, Margolis HG and Peters JM (2002). Asthma in exercising children exposed to ozone: a cohort study. Lancet 359: 386–391
McConnell R, Berhane K, Yao L, Jerrett M, Lurmann F, Gilliland F, Kunzli N, Gauderman J, Avol E, Thomas D and Peters J (2006). Traffic, susceptibility, and childhood asthma. Environ Health Persp 114: 766–772
Molitor J, Jerrett M, Chang CC, Molitor NT, Gauderman J, Berhane K, McConnell R, Kuenzli N, Lurmann F, Wu J, Winer A and Thomas D (2007). Assessing uncertainty in spatial exposure models for air pollution health effects assessment. Environ Health Persp 115: 1147–1153
Nijhout HF, Reed MC, Budu P and Ulrich CM (2004). A mathematical model of the folate cycle: new insights into folate homeostasis. J Biol Chem 279: 55008–55016
Nychka D and Saltzman N (1998). Design of air-quality monitoring networks. In: Nychka, D, Piegorsch, W and Cox, LH (eds) Case studies in environmental statistics, Lecture Notes in Statistics number 132, pp 51–75. Springer Verlag, New York
Pepe MS and Flemming TR (1991). A nonparametric method for dealing with mismeasured covariate data. J Am Stat Assoc 86: 108–113
Prentice R and Pyke R (1979). Logistic disease incidence models and case–control studies. Biometrika 86: 403–411
Prentice RL and Zhao LP (1991). Estimating equations for parameters in means and covariances of multivariate discrete and continuous responses. Biometrics 47: 825–839
Reed MC, Nijhout HF, Sparks R and Ulrich CM (2004). A mathematical model of the methionine cycle. J Theor Biol 226: 33–43
Rosner B, Spiegelman D and Willett WC (1992). Correction of logistic regression relative risk estimates and confidence intervals for random within-person measurement error. Am J Epidemiol 136: 1400–1413
Rothman KJ and Greenland S (1998). Modern epidemiology. Lippencott-Raven, Philadelphia
Rubin D (1987). Multiple imputation for nonresponse in surveys. Wiley, New York
Thomas DC (2005). The need for a comprehensive approach to complex pathways in molecular epidemiology. Cancer Epidemiol Biomark Prev 14: 557–559
Thomas DC, Conti DV (2006) Two stage genetic association studies. In: Encycolpedia of clinical trials(in press)
Thomas DC, Stram D and Dwyer J (1993). Exposure measurement error: influence on exposure-disease relationships and methods of correction. Annu Rev Publ Health 14: 69–93
Ulrich CM, Robien K and Sparks R (2002). Pharmacogenetics and folate metabolism – a promising direction. Pharmacogenomics 3: 299–313
White JE (1982). A two stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol 115: 119–128
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Thomas, D.C. Multistage sampling for latent variable models. Lifetime Data Anal 13, 565–581 (2007). https://doi.org/10.1007/s10985-007-9061-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10985-007-9061-1