Statistics in Biosciences

, Volume 5, Issue 2, pp 232–249 | Cite as

Using the Whole Cohort in the Analysis of Case-Control Data

Application to the Women’s Health Initiative
  • Norman E. BreslowEmail author
  • Gustavo Amorim
  • Mary B. Pettinger
  • Jacques Rossouw
Case Studies and Practice Articles


Standard analyses of data from case-control studies that are nested in a large cohort ignore information available for cohort members not sampled for the sub-study. This paper reviews several methods designed to increase estimation efficiency by using more of the data, treating the case-control sample as a two or three phase stratified sample. When applied to a study of coronary heart disease among women in the hormone trials of the Women’s Health Initiative, modest but increasing gains in precision of regression coefficients were observed depending on the amount of cohort information used in the analysis. The gains were particularly evident for pseudo- or maximum likelihood estimates whose validity depends on the assumed model being correct. Larger standard errors were obtained for coefficients estimated by inverse probability weighted methods that are more robust to model misspecification. Such misspecification may have been responsible for an important difference in one key regression coefficient estimated using the weighted compared with the more efficient methods.


Logistic regression Maximum likelihood Pseudolikelihood Calibration of sampling weights Model misspecification and survey sampling 


  1. 1.
    Anderson GL, Manson J, Wallace R, Lund B, Hall D, Davis S, Shumaker S, Wang CY, Stein E, Prentice RL (2003) Implementation of the Women’s Health Initiative study design. Ann Epidemiol 13:S5–S17 CrossRefGoogle Scholar
  2. 2.
    Anderson GL, Limacher M, Assaf AR, Bassford T, Beresford SAA, Black H, Bonds D, Brunner R, Brzyski R, Caan B, Chlebowski R, Curb D, Gass M, Hays J, Heiss G, Hendrix S, Howard BV, Hsia J, Hubbell A, Jackson R, Johnson KC, Judd H, Kotchen JM, Kuller L, LaCroix AZ, Lane D, Langer RD, Lasser N, Lewis CE, Manson J, Margolis K, Ockene J, O’Sullivan MJ, Phillips L, Prentice RL, Ritenbaugh C, Robbins J, Rossouw JE, Sarto G, Stefanick ML, Van Horn L, Wactawski-Wende J, Wallace R, Wassertheil-Smoller S (2004) Effects of conjugated, equine estrogen in postmenopausal women with hysterectomy—the Women’s Health Initiative randomized controlled trial. JAMA J Am Med Assoc 291:1701–1712 CrossRefGoogle Scholar
  3. 3.
    Breslow NE (1996) Statistics in epidemiology: the case-control study. J Am Stat Assoc 91:14–28 MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Breslow NE, Cain KC (1988) Logistic regression for two-stage case-control data. Biometrika 75:11–20 MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Breslow NE, Chatterjee N (1999) Design and analysis of two-phase studies with binary outcomes applied to Wilms tumor prognosis. Appl Stat 48:457–468 CrossRefzbMATHGoogle Scholar
  6. 6.
    Breslow NE, Day NE (1980) Statistical methods in cancer research I: the analysis of case-control studies. International Agency for Research on Cancer, Lyon Google Scholar
  7. 7.
    Breslow NE, Holubkov R (1997) Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. J R Stat Soc, Ser B 59:447–461 MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Breslow NE, Wellner JA (2007) Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression. Scand J Stat 34:86–102 MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M (2009) Improved Horvitz–Thompson estimation of model parameters from two-phase stratified samples: applications in epidemiology. Stat Biosci 1:32–49 CrossRefGoogle Scholar
  10. 10.
    Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M (2009) Using the whole cohort in the analysis of case-cohort data. Am J Epidemiol 169:1398–1405 CrossRefGoogle Scholar
  11. 11.
    Chatterjee N, Carroll RJ (2005) Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika 92:399–418 MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Cornfield J (1951) A method of estimating comparative rates from clinical data. Applications to cancer of the lung, breast, and cervix. J Natl Cancer Inst 11:1269–1275 Google Scholar
  13. 13.
    Dai JY, LeBlanc M, Kooperberg C (2009) Semiparametric estimation exploiting covariate independence in two-phase randomized trials. Biometrics 65:178–187 MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Deville JC, Särndal CE, Sautory O (1993) Generalized raking procedures in survey sampling. J Am Stat Assoc 88:1013–1020 CrossRefzbMATHGoogle Scholar
  15. 15.
    Haneuse S, Saegusa T, Lumley T (2011) osDesign: an R package for the analysis, evaluation, and design of two-phase and case-control studies. J Stat Softw 43:1–29 Google Scholar
  16. 16.
    Lee AJ, Scott AJ, Wild CJ (2010) Efficient estimation in multi-phase case-control studies. Biometrika 97:361–374 MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, New York zbMATHGoogle Scholar
  18. 18.
    Lumley T (2009) Robustness of semiparametric efficiency in nearly-correct models for two-phase samples. UW biostatistics working paper 351.
  19. 19.
    Lumley T (2012) Complex surveys: a guide to analysis using R. Wiley, Hoboken Google Scholar
  20. 20.
    Lumley T, Shaw PA, Dai JY (2011) Connections between survey calibration estimators and semiparametric models for incomplete data. Int Stat Rev 79:200–220 CrossRefzbMATHGoogle Scholar
  21. 21.
    Marti H, Chavance M (2011) Multiple imputation analysis of case-cohort studies. Stat Med 30:1595–1607 MathSciNetCrossRefGoogle Scholar
  22. 22.
    Nan B (2004) Efficient estimation for case-cohort studies. Can J Stat 32:403–409 MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Pierce DA (1982) The asymptotic effect of substituting estimators for parameters in certain types of statistics. Ann Stat 10:475–478 MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Prentice RL (1986) A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 73:1–11 MathSciNetCrossRefzbMATHGoogle Scholar
  25. 25.
    Prentice RL, Pyke R (1979) Logistic disease incidence models and case-control studies. Biometrika 66:403–411 MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Prentice RL, Caan B, Chlebowski RT, Patterson R, Kuller LH, Ockene JK, Margolis KL, Limacher MC, Manson JE, Parker LM, Paskett E, Phillips L, Robbins J, Rossouw JE, Sarto GE, Shikany JM, Stefanick ML, Thomson CA, Van Horn L, Vitolins MZ, Wactawski-Wende J, Wallace RB, Wassertheil-Smoller S, Whitlock E, Yano K, Adams-Campbell L, Anderson GL, Assaf AR, Beresford SAA, Black HR, Brunner RL, Brzyski RG, Ford L, Gass M, Hays J, Heber D, Heiss G, Hendrix SL, Hsia J, Hubbell FA, Jackson RD, Johnson KC, Kotchen JM, LaCroix AZ, Lane DS, Langer RD, Lasser NL, Henderson MM (2006) Low-fat dietary pattern and risk of invasive breast cancer—the Women’s Health Initiative randomized controlled dietary modification trial. JAMA J Am Med Assoc 295:629–642 CrossRefGoogle Scholar
  27. 27.
    Robins JM, Rotnitzky A, Zhao LP (1994) Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 89:846–866 MathSciNetCrossRefzbMATHGoogle Scholar
  28. 28.
    Rossouw JE, Anderson GL, Prentice RL, LaCroix AZ, Kooperberg C, Stefanick ML, Jackson RD, Beresford SAA, Howard BV, Johnson KC, Kotchen M, Ockene J (2002) Risks and benefits of estrogen plus progestin in healthy postmenopausal women—principal results from the Women’s Health Initiative randomized controlled trial. JAMA J Am Med Assoc 288:321–333 CrossRefGoogle Scholar
  29. 29.
    Rossouw JE, Cushman M, Greenland P, Lloyd-Jones DM, Bray P, Kooperberg C, Pettinger M, Robinson J, Hendrix S, Hsia J (2008) Inflammatory, lipid, thrombotic, and genetic markers of coronary heart disease risk in the Women’s Health Initiative trials of hormone therapy. Arch Intern Med 168:2245–2253 CrossRefGoogle Scholar
  30. 30.
    Scheike TH, Martinussen T (2004) Maximum likelihood estimation for Cox’s regression model under case-cohort sampling. Scand J Stat 31:283–293 MathSciNetCrossRefzbMATHGoogle Scholar
  31. 31.
    Scott AJ, Wild CJ (1986) Fitting logistic models under case-control or choice based sampling. J R Stat Soc, Ser B 48:170–182 MathSciNetzbMATHGoogle Scholar
  32. 32.
    Scott AJ, Wild CJ (1997) Fitting regression models to case-control data by maximum likelihood. Biometrika 84:57–71 MathSciNetCrossRefzbMATHGoogle Scholar
  33. 33.
    Scott A, Wild C (2002) On the robustness of weighted methods for fitting models to case-control data. J R Stat Soc, Ser B, Stat Methodol 64:207–219 MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Scott AJ, Wild C (2006) Calculating efficient semiparametric estimators for a broad class of missing-data problems. In: Liski EP, Isotalo J, Niemelä J, Puntanen S, Styan GPH (eds) Festschrift for Tarmo Pukkila on his 60th birthday, pp 301–314. Dept of Mathematics, Statistics and Philosophy, University of Tampere Google Scholar
  35. 35.
    Scott AJ, Wild CJ (2011) Fitting regression models with response-biased samples. Can J Stat 39:519–536 MathSciNetzbMATHGoogle Scholar
  36. 36.
    Shin JH, McNeney B, Graham J (2007) Case-control inference of interaction between genetic and nongenetic risk factors under assumptions on their distribution. Stat Appl Genet Mol Biol 6:1–41 MathSciNetGoogle Scholar
  37. 37.
    The ARIC Investigators (1989) The atherosclerosis risk in communities (ARIC) study: design and objectives. Am J Epidemiol 129:687–702 Google Scholar
  38. 38.
    Umbach DM, Weinberg CR (1997) Designing and analysing case-control studies to exploit independence of genotype and exposure. Stat Med 16:1731–1743 CrossRefGoogle Scholar

Copyright information

© International Chinese Statistical Association 2013

Authors and Affiliations

  • Norman E. Breslow
    • 1
    Email author
  • Gustavo Amorim
    • 2
  • Mary B. Pettinger
    • 3
  • Jacques Rossouw
    • 4
  1. 1.Department of BiostatisticsUniversity of WashingtonSeattleUSA
  2. 2.Department of StatisticsUniversity of AucklandAucklandNew Zealand
  3. 3.WHI Clinical Coordinating CenterFred Hutchinson Cancer Research CenterSeattleUSA
  4. 4.Division of Cardiovascular SciencesNational Heart, Lung and Blood InstituteBethesdaUSA

Personalised recommendations