Abstract
The case-cohort study involves two-phase samplings: simple random sampling from an infinite superpopulation at phase one and stratified random sampling from a finite cohort at phase two. Standard analyses of case-cohort data involve solution of inverse probability weighted (IPW) estimating equations, with weights determined by the known phase two sampling fractions. The variance of parameter estimates in (semi)parametric models, including the Cox model, is the sum of two terms: (i) the model-based variance of the usual estimates that would be calculated if full data were available for the entire cohort; and (ii) the design-based variance from IPW estimation of the unknown cohort total of the efficient influence function (IF) contributions. This second variance component may be reduced by adjusting the sampling weights, either by calibration to known cohort totals of auxiliary variables correlated with the IF contributions or by their estimation using these same auxiliary variables. Both adjustment methods are implemented in the R survey package. We derive the limit laws of coefficients estimated using adjusted weights. The asymptotic results suggest practical methods for construction of auxiliary variables that are evaluated by simulation of case-cohort samples from the National Wilms Tumor Study and by log-linear modeling of case-cohort data from the Atherosclerosis Risk in Communities Study. Although not semiparametric efficient, estimators based on adjusted weights may come close to achieving full efficiency within the class of augmented IPW estimators.
Similar content being viewed by others
References
Ballantyne CM, Hoogeveen RC, Bang H, (2004) Lipoprotein-associated phospholipase A(2), high-sensitivity C-reactive protein, and risk for incident coronary heart disease in middle-aged men and women in the Atherosclerosis Risk in Communities (ARIC) study. Circulation 109:837–842
Barlow WE (1994) Robust variance estimation for the case-cohort design. Biometrics 50:1064–1072
Barlow WE, Ichikawa L, Rosner D, Izumi S (1999) Analysis of case-cohort designs. J Clin Epidemiol 52:1165–1172
Begun JM, Hall WJ, Huang W-M, Wellner JA (1983) Information and asymptotic efficiency in parametric–nonparametric models. Ann Stat 11:432–452
Binder DA (1992) Fitting Cox’s proportional hazards model from survey data. Biometrika 79:139–147
Borgan O, Langholz B, Samuelsen SO, (2000) Exposure stratified case-cohort designs. Lifetime Data Anal 6:39–58
Breslow N (1974) Covariance analysis of censored survival data. Biometrics 30:89–99
Breslow NE, Holubkov R (1997) Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. J R Stat Soc B 59:447–461
Breslow NE, Wellner JA (2007) Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression’. Scand J Stat 34:86–102
Breslow NE, Wellner JA (2008) A Z-theorem with estimated nuisance parameters and correction note for ‘Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression’. Scand J Stat 35:186–192
Breslow NE, Lumley T, Ballantyne CM, et al (2009) Using the whole cohort in the analysis of case-cohort data. Am J Epidemiol (in press)
Cain KC, Lange NT (1984) Approximate case influence for the proportional hazards regression model with censored data. Biometrics 40:493–499
Cox DR (1972) Regression models and life-tables (with discussion). J R Stat Soc B 34:187–220
Cox DR (1975) Partial likelihood. Biometrika 62:269–276
D’Angio GJ, Breslow N, Beckwith JB, (1989) Treatment of Wilms’ tumor: Results of the third national Wilms’ tumor study. Cancer 64:349–360
Deming WE, Stephan FF (1940) On a least-squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann Math Stat 11:427–444
Deville JC, Särndal C-E (1992) Calibration estimators in survey sampling. J Am Stat Assoc 87:376–382
Green DM, Breslow NE, Beckwith JB, (1998) Comparison between single-dose and divided-dose administration of dactinomycin and doxorubicin for patients with Wilms’ tumor: a report from the national Wilms’ tumor study group. J Clin Oncol 16:237–245
Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47:663–685
Isaki CT, Fuller WA (1982) Survey design under the regression superpopulation model. J Am Stat Assoc 77:89–96
Kovacevic MS, Rai SN (2002) Log-linear modelling of change using longitudinal survey data. Commun Stat Theory Methods 31:1815–1835
Kulich M, Lin DY (2004) Improving the efficiency of relative-risk estimation in case-cohort studies. J Am Stat Assoc 99:832–844
Lin DY (2000) On fitting Cox’s proportional hazards models to survey data. Biometrika 87:37–47
Lin DY, Wei LJ (1989) The robust inference for the Cox proportional hazards model. J Am Stat Assoc 84:1074–1078
Lumley T (2004) Analysis of complex survey samples. J Stat Softw 9:1–19
Mark SD, Katki HA (2006) Specifying and implementing nonparametric and semiparametric survival estimators in two-stage (nested) cohort studies with missing case data. J Am Stat Assoc 101:460–471
Nan B (2004) Efficient estimation for case-cohort studies. Can J Stat 32:403–419
Neyman J (1938) Contribution to the theory of sampling human populations. J Am Stat Assoc 33:101–116
Persson M, Nilsson JA, Nelson JJ, (2007) The epidemiology of Lp-PLA(2): distribution and correlation with cardiovascular risk factors in a population-based cohort. Atherosclerosis 190:388–396
Prentice RL (1986) A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 73:1–11
Rao JNK, Yung W, Hidiroglou M (2002) Estimating equations for the analysis of survey data using post-stratification information. Sankhya 64:364–378
Robins JM, Rotnitzky A, Zhao LP (1994) Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 89:846–866
Rubin-Bleuer S, Kratina IS (2005) On the two-phase framework for joint model and design based inference. Ann Stat 33:2789–2810
Särndal C-E, Swensson B, Wretman JH (1989) The weighted residual technique for estimating the variance of the general regression estimator of the finite population total. Biometrika 76:527–537
Scheike TH, Martinussen T (2004) Maximum likelihood estimation for Cox’s regression model under case-cohort sampling. Scand J Stat 31:283–293
Scott AJ, Wild CJ (1997) Fitting regression models to case-control data by maximum likelihood. Biometrika 84:57–71
The ARIC Investigators (1989) The atherosclerosis risk in communities (ARIC) study: design and objectives. Am J Epidemiol 129:687–702
Therneau TM, Grambsch PM (2000) Modeling survival data: extending the Cox model. Springer, New York
van der Vaart AW (1998) Asymptotic statistics. Cambridge University Press, Cambridge
van der Vaart AW, Wellner JA (1996) Weak convergence and empirical processes with applications in statistics. Springer, New York
Wang CY, Chen HY (2001) Augmented inverse probability weighted estimator for Cox missing covariate regression. Biometrics 57:414–419
White JE (1982) A two-stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol 115:119–128
Zeng D, Lin DY (2007) Maximum likelihood estimation in semiparametric regression models with censored data. J R Stat Soc B 69:507–536
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Breslow, N.E., Lumley, T., Ballantyne, C.M. et al. Improved Horvitz–Thompson Estimation of Model Parameters from Two-phase Stratified Samples: Applications in Epidemiology. Stat Biosci 1, 32–49 (2009). https://doi.org/10.1007/s12561-009-9001-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-009-9001-6