Skip to main content
Log in

Multistage sampling for latent variable models

  • Published:
Lifetime Data Analysis Aims and scope Submit manuscript

Abstract

I consider the design of multistage sampling schemes for epidemiologic studies involving latent variable models, with surrogate measurements of the latent variables on a subset of subjects. Such models arise in various situations: when detailed exposure measurements are combined with variables that can be used to assign exposures to unmeasured subjects; when biomarkers are obtained to assess an unobserved pathophysiologic process; or when additional information is to be obtained on confounding or modifying variables. In such situations, it may be possible to stratify the subsample on data available for all subjects in the main study, such as outcomes, exposure predictors, or geographic locations. Three circumstances where analytic calculations of the optimal design are possible are considered: (i) when all variables are binary; (ii) when all are normally distributed; and (iii) when the latent variable and its measurement are normally distributed, but the outcome is binary. In each of these cases, it is often possible to considerably improve the cost efficiency of the design by appropriate selection of the sampling fractions. More complex situations arise when the data are spatially distributed: the spatial correlation can be exploited to improve exposure assignment for unmeasured locations using available measurements on neighboring locations; some approaches for informative selection of the measurement sample using location and/or exposure predictor data are considered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Albert JH and Chib S (1993). Bayesian-analysis of binary and polychotomous response data. J Am Stat Assoc 88: 669–679

    Article  MATH  MathSciNet  Google Scholar 

  • Breslow N and Cain K (1988). Logistic regression for two-stage case–control data. Biometrika 75: 11–20

    Article  MATH  MathSciNet  Google Scholar 

  • Breslow NE and Chatterjee N (1999). Design and analysis of two-phase studies with binary outcome applied to Wilms tumor prognosis. Appl Stat 48: 457–468

    MATH  Google Scholar 

  • Breslow NE and Holubkov R (1997). Weighted likelihood, pseudo-likelihood and maximum likelihood methods for logistic regression analysis of two-stage data. Stat Med 16: 103–116

    Article  Google Scholar 

  • Breslow NE and Zhao LP (1988). Logistic regression for stratified case–control studies. Biometrics 44: 891–899

    Article  Google Scholar 

  • Cain K and Breslow N (1988). Logistic regression analysis and efficient design for two-stage studies. Am J Epidemiol 128: 1198–1206

    Google Scholar 

  • Carroll RJ, Ruppert D and Stefanski LA (1995). Measurement error in nonlinear models. Chapman and Hall, London

    MATH  Google Scholar 

  • Conti DV, Cortessis V, Molitor J and Thomas DC (2003). Bayesian modeling of complex metabolic pathways. Hum Hered 56: 83–93

    Article  Google Scholar 

  • Cressie NAC (1993). Statistics for spatial data. Wiley & Sons Inc., New York

    Google Scholar 

  • Davey Smith G and Ebrahim S (2004). Mendelian randomization: prospects, potentials and limitations. Int J Epidemiol 33: 30–42

    Article  Google Scholar 

  • Diggle P and Lophaven S (2005). Bayesian geostatistical design. Scand J Stat 33: 53–64

    Article  MathSciNet  Google Scholar 

  • Feng Z, Prentice R and Srivastava S (2004). Research issues and strategies for genomic and proteomic biomarker discovery and validation: a statistical perspective. Pharmacogenomics 5: 709–719

    Article  Google Scholar 

  • Friedman N (2004). Inferring cellular networks using probabilistic graphical models. Science 303: 799–805

    Article  Google Scholar 

  • Gauderman WJ, Avol E, Gilliland F, Vora H, Thomas D, Berhane K, McConnell R, Kuenzli N, Lurmann F, Rappaport E, Margolis H, Bates D and Peters J (2004). The effect of air pollution on lung development from 10 to 18 years of age. N Engl J Med 351: 1057–1067

    Article  Google Scholar 

  • Gauderman WJ, Avol E, Lurmann F, Kuenzli N, Gilliland F, Peters J and McConnell R (2005). Childhood asthma and exposure to traffic and nitrogen dioxide. Epidemiology 16: 737–743

    Article  Google Scholar 

  • Gauderman WJ, Vora H, McConnell R, Berhane K, Gilliland F, Thomas D, Lurmann F, Avol E, Kunzli N, Jerrett M and Peters J (2007). Effect of exposure to traffic on lung development from 10 to 18 years of age: a cohort study. Lancet 369: 571–577

    Article  Google Scholar 

  • Greenland S (1988). Statistical uncertainty due to misclassification: implications for validation substudies. J Clin Epidemiol 41: 1167–1174

    Article  Google Scholar 

  • Haile RW, Siegmund KD, Gauderman WJ and Thomas DC (1999). Study-design issues in the development of the University of Southern California Consortium’s Colorectal Cancer Family Registry. J Natl Cancer Inst Monogr 26: 89–93

    Google Scholar 

  • Harel O and Zhou XH (2007). Multiple imputation: review of theory, implementation and software. Stat Med 26: 3057–3077

    Article  MathSciNet  Google Scholar 

  • Kooperberg C and Ruczinski I (2005). Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol 28: 157–170

    Article  Google Scholar 

  • Kraft P and Thomas DC (2000). Bias and efficiency in family-matched gene-characterization studies: Conditional, prospective, retrospective and joint likelihoods. Am J Hum Genet 66: 1119–1131

    Article  Google Scholar 

  • McConnell R, Berhane K, Gilliland F, London SJ, Islam T, Gauderman WJ, Avol E, Margolis HG and Peters JM (2002). Asthma in exercising children exposed to ozone: a cohort study. Lancet 359: 386–391

    Article  Google Scholar 

  • McConnell R, Berhane K, Yao L, Jerrett M, Lurmann F, Gilliland F, Kunzli N, Gauderman J, Avol E, Thomas D and Peters J (2006). Traffic, susceptibility, and childhood asthma. Environ Health Persp 114: 766–772

    Google Scholar 

  • Molitor J, Jerrett M, Chang CC, Molitor NT, Gauderman J, Berhane K, McConnell R, Kuenzli N, Lurmann F, Wu J, Winer A and Thomas D (2007). Assessing uncertainty in spatial exposure models for air pollution health effects assessment. Environ Health Persp 115: 1147–1153

    Article  Google Scholar 

  • Nijhout HF, Reed MC, Budu P and Ulrich CM (2004). A mathematical model of the folate cycle: new insights into folate homeostasis. J Biol Chem 279: 55008–55016

    Article  Google Scholar 

  • Nychka D and Saltzman N (1998). Design of air-quality monitoring networks. In: Nychka, D, Piegorsch, W and Cox, LH (eds) Case studies in environmental statistics, Lecture Notes in Statistics number 132, pp 51–75. Springer Verlag, New York

    Google Scholar 

  • Pepe MS and Flemming TR (1991). A nonparametric method for dealing with mismeasured covariate data. J Am Stat Assoc 86: 108–113

    Article  Google Scholar 

  • Prentice R and Pyke R (1979). Logistic disease incidence models and case–control studies. Biometrika 86: 403–411

    Article  MathSciNet  Google Scholar 

  • Prentice RL and Zhao LP (1991). Estimating equations for parameters in means and covariances of multivariate discrete and continuous responses. Biometrics 47: 825–839

    Article  MATH  MathSciNet  Google Scholar 

  • Reed MC, Nijhout HF, Sparks R and Ulrich CM (2004). A mathematical model of the methionine cycle. J Theor Biol 226: 33–43

    Article  MathSciNet  Google Scholar 

  • Rosner B, Spiegelman D and Willett WC (1992). Correction of logistic regression relative risk estimates and confidence intervals for random within-person measurement error. Am J Epidemiol 136: 1400–1413

    Google Scholar 

  • Rothman KJ and Greenland S (1998). Modern epidemiology. Lippencott-Raven, Philadelphia

    Google Scholar 

  • Rubin D (1987). Multiple imputation for nonresponse in surveys. Wiley, New York

    Google Scholar 

  • Thomas DC (2005). The need for a comprehensive approach to complex pathways in molecular epidemiology. Cancer Epidemiol Biomark Prev 14: 557–559

    Article  Google Scholar 

  • Thomas DC, Conti DV (2006) Two stage genetic association studies. In: Encycolpedia of clinical trials(in press)

  • Thomas DC, Stram D and Dwyer J (1993). Exposure measurement error: influence on exposure-disease relationships and methods of correction. Annu Rev Publ Health 14: 69–93

    Article  Google Scholar 

  • Ulrich CM, Robien K and Sparks R (2002). Pharmacogenetics and folate metabolism – a promising direction. Pharmacogenomics 3: 299–313

    Article  Google Scholar 

  • White JE (1982). A two stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol 115: 119–128

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Duncan C. Thomas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Thomas, D.C. Multistage sampling for latent variable models. Lifetime Data Anal 13, 565–581 (2007). https://doi.org/10.1007/s10985-007-9061-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10985-007-9061-1

Keywords

Navigation