Lifetime Data Analysis

, Volume 13, Issue 4, pp 565–581 | Cite as

Multistage sampling for latent variable models



I consider the design of multistage sampling schemes for epidemiologic studies involving latent variable models, with surrogate measurements of the latent variables on a subset of subjects. Such models arise in various situations: when detailed exposure measurements are combined with variables that can be used to assign exposures to unmeasured subjects; when biomarkers are obtained to assess an unobserved pathophysiologic process; or when additional information is to be obtained on confounding or modifying variables. In such situations, it may be possible to stratify the subsample on data available for all subjects in the main study, such as outcomes, exposure predictors, or geographic locations. Three circumstances where analytic calculations of the optimal design are possible are considered: (i) when all variables are binary; (ii) when all are normally distributed; and (iii) when the latent variable and its measurement are normally distributed, but the outcome is binary. In each of these cases, it is often possible to considerably improve the cost efficiency of the design by appropriate selection of the sampling fractions. More complex situations arise when the data are spatially distributed: the spatial correlation can be exploited to improve exposure assignment for unmeasured locations using available measurements on neighboring locations; some approaches for informative selection of the measurement sample using location and/or exposure predictor data are considered.


Study design Latent variable models Multistage sampling  Spatial correlation Biomarkers Exposure measurement error 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Albert JH and Chib S (1993). Bayesian-analysis of binary and polychotomous response data. J Am Stat Assoc 88: 669–679 MATHCrossRefMathSciNetGoogle Scholar
  2. Breslow N and Cain K (1988). Logistic regression for two-stage case–control data. Biometrika 75: 11–20 MATHCrossRefMathSciNetGoogle Scholar
  3. Breslow NE and Chatterjee N (1999). Design and analysis of two-phase studies with binary outcome applied to Wilms tumor prognosis. Appl Stat 48: 457–468 MATHGoogle Scholar
  4. Breslow NE and Holubkov R (1997). Weighted likelihood, pseudo-likelihood and maximum likelihood methods for logistic regression analysis of two-stage data. Stat Med 16: 103–116 CrossRefGoogle Scholar
  5. Breslow NE and Zhao LP (1988). Logistic regression for stratified case–control studies. Biometrics 44: 891–899 CrossRefGoogle Scholar
  6. Cain K and Breslow N (1988). Logistic regression analysis and efficient design for two-stage studies. Am J Epidemiol 128: 1198–1206 Google Scholar
  7. Carroll RJ, Ruppert D and Stefanski LA (1995). Measurement error in nonlinear models. Chapman and Hall, London MATHGoogle Scholar
  8. Conti DV, Cortessis V, Molitor J and Thomas DC (2003). Bayesian modeling of complex metabolic pathways. Hum Hered 56: 83–93 CrossRefGoogle Scholar
  9. Cressie NAC (1993). Statistics for spatial data. Wiley & Sons Inc., New York Google Scholar
  10. Davey Smith G and Ebrahim S (2004). Mendelian randomization: prospects, potentials and limitations. Int J Epidemiol 33: 30–42 CrossRefGoogle Scholar
  11. Diggle P and Lophaven S (2005). Bayesian geostatistical design. Scand J Stat 33: 53–64 CrossRefMathSciNetGoogle Scholar
  12. Feng Z, Prentice R and Srivastava S (2004). Research issues and strategies for genomic and proteomic biomarker discovery and validation: a statistical perspective. Pharmacogenomics 5: 709–719 CrossRefGoogle Scholar
  13. Friedman N (2004). Inferring cellular networks using probabilistic graphical models. Science 303: 799–805 CrossRefGoogle Scholar
  14. Gauderman WJ, Avol E, Gilliland F, Vora H, Thomas D, Berhane K, McConnell R, Kuenzli N, Lurmann F, Rappaport E, Margolis H, Bates D and Peters J (2004). The effect of air pollution on lung development from 10 to 18 years of age. N Engl J Med 351: 1057–1067 CrossRefGoogle Scholar
  15. Gauderman WJ, Avol E, Lurmann F, Kuenzli N, Gilliland F, Peters J and McConnell R (2005). Childhood asthma and exposure to traffic and nitrogen dioxide. Epidemiology 16: 737–743 CrossRefGoogle Scholar
  16. Gauderman WJ, Vora H, McConnell R, Berhane K, Gilliland F, Thomas D, Lurmann F, Avol E, Kunzli N, Jerrett M and Peters J (2007). Effect of exposure to traffic on lung development from 10 to 18 years of age: a cohort study. Lancet 369: 571–577 CrossRefGoogle Scholar
  17. Greenland S (1988). Statistical uncertainty due to misclassification: implications for validation substudies. J Clin Epidemiol 41: 1167–1174 CrossRefGoogle Scholar
  18. Haile RW, Siegmund KD, Gauderman WJ and Thomas DC (1999). Study-design issues in the development of the University of Southern California Consortium’s Colorectal Cancer Family Registry. J Natl Cancer Inst Monogr 26: 89–93 Google Scholar
  19. Harel O and Zhou XH (2007). Multiple imputation: review of theory, implementation and software. Stat Med 26: 3057–3077 CrossRefMathSciNetGoogle Scholar
  20. Kooperberg C and Ruczinski I (2005). Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol 28: 157–170 CrossRefGoogle Scholar
  21. Kraft P and Thomas DC (2000). Bias and efficiency in family-matched gene-characterization studies: Conditional, prospective, retrospective and joint likelihoods. Am J Hum Genet 66: 1119–1131 CrossRefGoogle Scholar
  22. McConnell R, Berhane K, Gilliland F, London SJ, Islam T, Gauderman WJ, Avol E, Margolis HG and Peters JM (2002). Asthma in exercising children exposed to ozone: a cohort study. Lancet 359: 386–391 CrossRefGoogle Scholar
  23. McConnell R, Berhane K, Yao L, Jerrett M, Lurmann F, Gilliland F, Kunzli N, Gauderman J, Avol E, Thomas D and Peters J (2006). Traffic, susceptibility, and childhood asthma. Environ Health Persp 114: 766–772 Google Scholar
  24. Molitor J, Jerrett M, Chang CC, Molitor NT, Gauderman J, Berhane K, McConnell R, Kuenzli N, Lurmann F, Wu J, Winer A and Thomas D (2007). Assessing uncertainty in spatial exposure models for air pollution health effects assessment. Environ Health Persp 115: 1147–1153 CrossRefGoogle Scholar
  25. Nijhout HF, Reed MC, Budu P and Ulrich CM (2004). A mathematical model of the folate cycle: new insights into folate homeostasis. J Biol Chem 279: 55008–55016 CrossRefGoogle Scholar
  26. Nychka D and Saltzman N (1998). Design of air-quality monitoring networks. In: Nychka, D, Piegorsch, W and Cox, LH (eds) Case studies in environmental statistics, Lecture Notes in Statistics number 132, pp 51–75. Springer Verlag, New York Google Scholar
  27. Pepe MS and Flemming TR (1991). A nonparametric method for dealing with mismeasured covariate data. J Am Stat Assoc 86: 108–113 CrossRefGoogle Scholar
  28. Prentice R and Pyke R (1979). Logistic disease incidence models and case–control studies. Biometrika 86: 403–411 CrossRefMathSciNetGoogle Scholar
  29. Prentice RL and Zhao LP (1991). Estimating equations for parameters in means and covariances of multivariate discrete and continuous responses. Biometrics 47: 825–839 MATHCrossRefMathSciNetGoogle Scholar
  30. Reed MC, Nijhout HF, Sparks R and Ulrich CM (2004). A mathematical model of the methionine cycle. J Theor Biol 226: 33–43 CrossRefMathSciNetGoogle Scholar
  31. Rosner B, Spiegelman D and Willett WC (1992). Correction of logistic regression relative risk estimates and confidence intervals for random within-person measurement error. Am J Epidemiol 136: 1400–1413 Google Scholar
  32. Rothman KJ and Greenland S (1998). Modern epidemiology. Lippencott-Raven, Philadelphia Google Scholar
  33. Rubin D (1987). Multiple imputation for nonresponse in surveys. Wiley, New York Google Scholar
  34. Thomas DC (2005). The need for a comprehensive approach to complex pathways in molecular epidemiology. Cancer Epidemiol Biomark Prev 14: 557–559 CrossRefGoogle Scholar
  35. Thomas DC, Conti DV (2006) Two stage genetic association studies. In: Encycolpedia of clinical trials(in press)Google Scholar
  36. Thomas DC, Stram D and Dwyer J (1993). Exposure measurement error: influence on exposure-disease relationships and methods of correction. Annu Rev Publ Health 14: 69–93 CrossRefGoogle Scholar
  37. Ulrich CM, Robien K and Sparks R (2002). Pharmacogenetics and folate metabolism – a promising direction. Pharmacogenomics 3: 299–313 CrossRefGoogle Scholar
  38. White JE (1982). A two stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol 115: 119–128 Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  1. 1.Department of Preventive MedicineUniversity of Southern CaliforniaLos AngelesUSA

Personalised recommendations