Population-based absolute risk estimation with survey data


Absolute risk is the probability that a cause-specific event occurs in a given time interval in the presence of competing events. We present methods to estimate population-based absolute risk from a complex survey cohort that can accommodate multiple exposure-specific competing risks. The hazard function for each event type consists of an individualized relative risk multiplied by a baseline hazard function, which is modeled nonparametrically or parametrically with a piecewise exponential model. An influence method is used to derive a Taylor-linearized variance estimate for the absolute risk estimates. We introduce novel measures of the cause-specific influences that can guide modeling choices for the competing event components of the model. To illustrate our methodology, we build and validate cause-specific absolute risk models for cardiovascular and cancer deaths using data from the National Health and Nutrition Examination Survey. Our applications demonstrate the usefulness of survey-based risk prediction models for predicting health outcomes and quantifying the potential impact of disease prevention programs at the population level.

This is a preview of subscription content, access via your institution.


  1. Aalen O (1978) Nonparametric inference for a family of counting processes. Ann Stat 6(4):701–726

    Google Scholar 

  2. Benichou J, Gail MH (1990) Estimates of absolute cause-specific risk in cohort studies. Biometrics 46(3):813–26

    Article  Google Scholar 

  3. Benichou J, Gail MH (1995) Methods of inference for estimates of absolute risk derived from population-based case–control studies. Biometrics 51(1):182–194

    Article  MATH  Google Scholar 

  4. Binder D (1992) Fitting Cox’s proportional hazards models from survey data. Biometrika 79(1):139–147

    Article  MathSciNet  Google Scholar 

  5. Breslow N (1974) Covariance analysis of censored survival data. Biometrics 30(1):89–99

    Article  MathSciNet  Google Scholar 

  6. Cai T, Hyndman RJ, Wand MP (2002) Mixed model-based hazard estimation. J Comput Graph Stat 11(4):784–798

    Article  MathSciNet  Google Scholar 

  7. Cox C, Rothwell S, Madans J, Finucane F, Freid V, Kleinman J, Barbano H, Feldman J (1992) Plan and operation of the NHANES I Epidemiologic Followup Study, 1987. Vital Health Stat Ser 1(27):1–190

    Google Scholar 

  8. Demnati A, Rao JNK (2010) Linearization variance estimators for model parameters from complex survey data. Surv Methodol 36(2):193–201

    Google Scholar 

  9. Deville J (1999) Variance estimation for complex statistics and estimators: linearization and residual techniques. Surv Methodol 25(2):193–204

    MathSciNet  Google Scholar 

  10. Engel A, Murphy R, Maurer K, Collins E (1978) Plan and operation of the HANES I augmentation survey of adults 25–74 years United States, 1974–1975. Vital Health Stat Ser 1(14):1–110

    Google Scholar 

  11. Ezzati T, Massey J, Waksberg J, Chu A, Maurer K (1992) Sample design: third National Health and Nutrition Examination Survey. Vital Health Stat Ser 2(113):1–35

    Google Scholar 

  12. Fine J, Gray R (1999) A proportional hazards model for the subdistribution of a competing risk. J Am Stat Assoc 94:496–509

    Google Scholar 

  13. Graubard B, Korn E (2002) Inference for superpopulation parameters using sample surveys. Stat Sci 17(1):73–96

    Article  MATH  MathSciNet  Google Scholar 

  14. Graubard BI, Fears TR (2005) Standard errors for attributable risk for simple and complex sample designs. Biometrics 61(3):847–855

    Article  MATH  MathSciNet  Google Scholar 

  15. Gray RJ (2009) Weighted analyses for cohort sampling designs. Lifetime Data Anal 15(1):24–40

    Article  MATH  MathSciNet  Google Scholar 

  16. Hampel FR (1974) Influence curve and its role in robust estimation. J Am Stat Assoc 69(346):383–393

    Article  MATH  MathSciNet  Google Scholar 

  17. Kalbfleisch JD, Lawless JF (1988) Likelihood analysis of multi-state models for disease incidence and mortality. Stat Med 7(1–2):149–160

    Article  Google Scholar 

  18. Kish L, Frankel MR (1974) Inference from complex samples. J R Stat Soc Ser B Stat Methodol 36(1):1–22

    MATH  MathSciNet  Google Scholar 

  19. Korn EL, Graubard BI (1995) Examples of differing weighted and unweighted estimates from a sample survey. Am Stat 49(3):291–295

    Google Scholar 

  20. Korn EL, Graubard BI (1999) Analysis of health surveys. Wiley series in probability and statistics. Wiley, New York

    Google Scholar 

  21. Langholz B, Borgan O (1997) Estimation of absolute risk from nested case–control data. Biometrics 53(2):767–774

    Google Scholar 

  22. Langholz B, Jiao J (2007) Computational methods for case–cohort studies. Comput Stat Data Anal 51(8):3737–3748

    Article  MATH  MathSciNet  Google Scholar 

  23. Lin D (2000) On fitting Cox’s proportional hazards models to survey data. Biometrika 87(1):37–47

    Article  MATH  MathSciNet  Google Scholar 

  24. Lin D, Wei L (1989) The robust inference for the Cox proportional hazards model. J Am Stat Assoc 84:1074–1078

    Google Scholar 

  25. Lumley T (2011) Survey: analysis of complex survey samples. R package version 3.26

  26. Lumley TS (2004) Analysis of complex survey samples. J Stat Softw 9(1):1–19

    Google Scholar 

  27. McDowell A, Engel A, Massey J, Maurer K (1981) Plan and operation of the Second National Health and Nutrition Examination Survey, 1976–1980. Vital Health Stat Ser 1(15):1–114

    Google Scholar 

  28. Patterson B, Dayton C, Graubard B (2002) Latent class analysis of complex sample survey data. J Am Stat Assoc 97(459):721–741

    Article  MATH  MathSciNet  Google Scholar 

  29. Preston D, Lubin JH, Pierce D, McConney ME (1993) Epicure user’s guide. Hirosoft International Corporation, Seattle

    Google Scholar 

  30. Rao JNK, Scott AJ (1987) On simple adjustments to chi-square tests with sample survey data. Ann Stat 15(1):385–397

    Article  MATH  MathSciNet  Google Scholar 

  31. Reid N, Crepeau H (1985) Influence functions for proportional hazards regression. Biometrika 72(1):1–9

    Article  MathSciNet  Google Scholar 

  32. Särndal CE, Swensson B, Wretman J (1992) Model assisted survey sampling. Springer series in statistics. Springer-Verlag, New York

    Google Scholar 

  33. Shah B (2002) Calculus of Taylor deviations. Joint Statistical Meetings, ASA, Minneapolis

  34. Shen Y, Cheng SC (1999) Confidence bands for cumulative incidence curves under the additive risk model. Biometrics 55(4):1093–1100

    Article  MATH  MathSciNet  Google Scholar 

  35. Williams R (1995) Product-limit survival functions with correlated survival times. Lifetime Data Anal 1(2):171–186

    Article  MATH  Google Scholar 

  36. Woodruff RS (1971) Simple method for approximating variance of a complicated estimate. J Am Stat Assoc 66(334):411–414

    Article  MATH  Google Scholar 

Download references


We thank the reviewers for their helpful comments. We are grateful to Dr. Barry Graubard for suggestions he provided to us during the writing of this paper. This research was supported by the intramural research program of the National Cancer Institute.

Author information



Corresponding author

Correspondence to Ruth M. Pfeiffer.



Derivatives and Taylor deviates for the piecewise exponential hazard function

To simplify the notation, we express the absolute risk estimate of (4) in a more compact form,

$$\begin{aligned} \hat{\pi }(\tau _{0{n_0}},\tau _{1n_1}; \varvec{x}) = \sum _{q={n_0}}^{n_1} \hat{S}(\tau _q) A_q (1-B_q) \end{aligned}$$


$$\begin{aligned} A_q&= \frac{\hat{\lambda }_0^{(1)} (\tau _q) \exp ( \hat{\varvec{\beta }}^{(1)^{\prime }}\varvec{x}^{(1)})}{\sum _m \hat{\lambda }_0^{(m)}(\tau _q) \exp (\hat{\varvec{\beta }}^{(m)^{\prime }}\varvec{x}^{(m)} )},\\ B_q&= \exp \left\{ - \sum _m \hat{\lambda }_0^{(m)} (\tau _q) \exp (\hat{\varvec{\beta }}^{(m)^{\prime }}\varvec{x}^{(m)}) (\tau _{1q} - \tau _{0q} )\right\} , \end{aligned}$$

and \(\hat{S}(\tau _q)\) is as defined in Eq. (5). Then the deviates for \(\hat{\pi }(\tau _{0{n_0}},\tau _{1n_1}; \varvec{x})\) are

$$\begin{aligned} \begin{array}{ll} \Delta _{ijk} \lbrace \hat{\pi }(\tau _{0{n_0}},\tau _{1n_1}; \varvec{x})\rbrace =&{} \sum \limits _{q={n_0}}^{n_1} \left[ \hat{S}(\tau _q) (1-B_q) \Delta _{ijk} \lbrace A_q \rbrace \right. \\ &{}\left. + A_q (1-B_q) \Delta _{ijk} \lbrace \hat{S}(\tau _q) \rbrace - A_q \hat{S}(\tau _q) \Delta _{ijk} \lbrace B_q \rbrace \right] . \end{array} \end{aligned}$$

Taking each component in turn, the deviates for \(A_q\) are

$$\begin{aligned} \Delta _{ijk} \lbrace A_q \rbrace = T_{q}^{-1} \Delta _{ijk} \lbrace T^{(1)}_{q} \rbrace - \frac{T^{(1)}_{q}}{T_{q}^2} \sum _{m=1}^M \Delta _{ijk} \lbrace T^{(m)}_{q} \rbrace \end{aligned}$$

where \(T^{(m)}_{q} = \hat{\lambda }_0^{(m)} (\tau _q) \exp ( \hat{\varvec{\beta }}^{(m)^{\prime }}\varvec{x}^{(m)})\) and \(T_{q} = \sum _{m=1}^M T^{(m)}_{q}\), with deviates

$$\begin{aligned} \Delta _{ijk} \lbrace T^{(m)}_{q} \rbrace = \exp (\hat{\varvec{\beta }}^{(m)^{\prime }}\varvec{x}^{(m)})\Delta _{ijk} \lbrace \hat{\lambda }_0^{(m)} (\tau _q) \rbrace + \varvec{x}^{(m)} T^{(m)}_{q} \Delta _{ijk} \lbrace \hat{\varvec{\beta }}^{(m)} \rbrace . \end{aligned}$$

The deviates for \(B_q\) are

$$\begin{aligned} \Delta _{ijk} \lbrace B_q \rbrace = - \sum _{{m}=1}^M \left[ \exp (\hat{\varvec{\beta }}^{(m)^{\prime }}\varvec{x}^{(m)}) (\tau _{1q} - \tau _{0q} ) B_q\left( \varvec{x}^{(m)} \hat{\lambda }_0^{(m)} (\tau _q) \Delta _{ijk} \lbrace \hat{\varvec{\beta }}^{(m)^{\prime }} \rbrace + \Delta _{ijk} \lbrace \hat{\lambda }_0^{(m)} (\tau _q) \rbrace \right) \right] \end{aligned}$$

For \(q>n_0\), we note that \(\hat{S}(\tau _q) = \prod _{l=n_0}^{q-1} B_l\) so that

$$\begin{aligned} \Delta _{ijk} \lbrace \hat{S}(\tau _q) \rbrace = \hat{S}(\tau _q) \sum _{l=n_0}^{q-1} B_l^{-1} \Delta _{ijk} \lbrace B_l \rbrace , \end{aligned}$$

and \(\Delta _{ijk} \lbrace \hat{S}(\tau _q) \rbrace \) is zero when \(q=n_0\).

The Taylor deviates for \(A_q\), \(B_q\) and \(\hat{S}(\tau _q)\) are each functions of \(\hat{\lambda }_0^{(m)}\) and \(\hat{\varvec{\beta }}^{(m)}\). For \(\hat{\lambda }_0^{(m)}\), we have

$$\begin{aligned} \Delta _{ijk} \lbrace \hat{\lambda }_0^{(m)} (\tau _q) \rbrace = D^{(m)}(\tau _q)^{-1} \left[ w_{ijk} \delta ^{(m)}_{ijk} (t_{ijk}) I(\tau _{0q} \le t_{ijk} < \tau _{1q}) - \hat{\lambda }_0^{(m)} (\tau _q) \Delta _{ijk} \lbrace D^{(m)}(\tau _q) \rbrace \right] \end{aligned}$$


$$\begin{aligned} \Delta _{ijk} \lbrace D^{(m)}(\tau _q) \rbrace&= \mathcal A _{ijk}(\tau _q) w_{ijk} \exp ( \hat{\varvec{\beta }}^{(m)^{\prime }} \varvec{x}^{(m)}_{ijk} )\nonumber \\&+ \left[ \sum _{i,j,k} \varvec{x}^{(m)}_{ijk} \mathcal A _{ijk}(\tau _q) w_{ijk} \exp ( \hat{\varvec{\beta }}^{(m)^{\prime }}\varvec{x}^{(m)}_{ijk} ) \right] \Delta _{ijk} \lbrace \hat{\varvec{\beta }}^{(m)} \rbrace . \nonumber \\ \end{aligned}$$

with \(\mathcal A _{ijk}(\tau _q)\) defined in Eq. (6). The Taylor deviates for each \(\hat{\varvec{\beta }}^{(m)}\) are

$$\begin{aligned} \Delta _{ijk} \lbrace \hat{\varvec{\beta }}^{(m)} \rbrace = \mathcal H (\hat{\varvec{\beta }}^{(m)} )^{-1} w_{ijk} \delta ^{(m)}_{ijk} (t_{ijk}) \lbrace \varvec{x}^{(m)}_{ijk} - \bar{\varvec{H}}(\hat{\varvec{\beta }}^{(m)},t_{ijk}) \rbrace \end{aligned}$$

where \(\mathcal H (\hat{\varvec{\beta }}^{(m)})\) is the second partial derivative of the pseudo-likelihood,

$$\begin{aligned} \mathcal H (\hat{\varvec{\beta }}^{(m)})&= - \left[ \sum _{i,j,k} \varvec{x}^{(m)}_{ijk}{\varvec{x}^{(m)}_{ijk}}^{\prime } h^{(m)}_{ijk} \right] + \bar{\varvec{H}}(\hat{\varvec{\beta }}^{(m)},t_{ijk}) \bar{\varvec{H}}(\hat{\varvec{\beta }}^{(m)},t_{ijk})^{\prime } ,\\ h^{(m)}_{ijk}&= w_{ijk} y_{ijk}(t) \exp ( \hat{\varvec{\beta }}^{(m)^{\prime }} \varvec{x}^{(m)}_{ijk} ) / \sum _{i,j,k} w_{ijk} y_{ijk}(t) \exp ( \hat{\varvec{\beta }}^{(m)^{\prime }} \varvec{x}^{(m)}_{ijk} ), \end{aligned}$$

and \(\bar{\varvec{H}}(\hat{\varvec{\beta }}^{(m)},t_{ijk})\) is defined in Eq. (3). Thus, the deviates for \(\hat{\varvec{\beta }}^{(m)}\) are equivalent to the per-observation update in a Newton–Raphson optimization algorithm where the objective function is the weighted pseudo-likelihood of the Cox regression model.

Derivatives and Taylor deviates for the semiparametric hazard function

Denote the \(N^{(m)}\) ordered observed event times occurring within \([t_0,t_1)\) for the \(m\)th cause as \(u^{(m)}_1 < u^{(m)}_2 < \ldots < u^{(m)}_{N^{(m)}}\). In terms of these event times, Eq. (1) becomes

$$\begin{aligned} \hat{\pi }(t_0,t_1; \varvec{x})&= \sum \limits _{i=1}^{N^{(1)}} \exp (\hat{\varvec{\beta }}^{(1)^{\prime }}\varvec{x}^{(1)} )\hat{\lambda }_0^{(1)}(u^{(1)}_i) \prod _{m=1}^M \left( \frac{\hat{S}_0^{(m)}(u^{(1)}_i)}{\hat{S}_0^{(m)}(u^{(1)}_1)} \right) ^{\exp ( \hat{\varvec{\beta }}^{(m)^{\prime }}\varvec{x}^{(m)} )} \nonumber \\&= \sum \limits _{i=1}^{N^{(1)}} \hat{p}(u^{(1)}_i). \end{aligned}$$

As with the piecewise model, we determine the derivative and deviates for each component of (14). For the \(\hat{\varvec{\beta }}^{(m)}\), the derivate is

$$\begin{aligned} \frac{\partial \hat{\pi }(t_0,t_1;\varvec{x}) }{\partial \hat{\varvec{\beta }}^{(m)}} = \varvec{x}^{(m)} \left[ \hat{\pi }(t_0,t_1;\varvec{x}) + \exp ( \hat{\varvec{\beta }}^{(m)^{\prime }}\varvec{x}^{(m)} ) \sum _{i=1}^{N^{(1)}} \log \left( \hat{S}_0^{(m)}(u^{(1)}_i)/\hat{S}_0^{(m)}(u^{(1)}_1) \right) \hat{p}(u^{(1)}_i) \right] , \end{aligned}$$

when \(m=1\) and

$$\begin{aligned} \frac{\partial \hat{\pi }(t_0,t_1;\varvec{x}) }{\partial \hat{\varvec{\beta }}^{(m)}} = \varvec{x}^{(m)} \exp ( \hat{\varvec{\beta }}^{(m)^{\prime }}\varvec{x}^{(m)} ) \sum _{i=1}^{N^{(1)}} \log \left( \hat{S}_0^{(m)}(u^{(1)}_i)/\hat{S}_0^{(m)}(u^{(1)}_1) \right) \hat{p}(u^{(1)}_i) \end{aligned}$$

for competing causes. The Taylor deviates for each \(\hat{\varvec{\beta }}^{(m)}\) are the same as given by Eq. (13) of the piecewise model.

The derivatives for the baseline hazard components are

$$\begin{aligned} \frac{\partial \hat{\pi }(t_0,t_1; \varvec{x})}{\partial \hat{\lambda }_0^{(1)}(u^{(1)}_i)} = \hat{\lambda }_0^{(1)}(u^{(1)}_i)^{-1} \hat{p}(u^{(1)}_i). \end{aligned}$$

The Taylor deviates for the baseline hazard of cause \(m\) at observed event time \(t\) are

$$\begin{aligned} \Delta _{ijk} \lbrace \hat{\lambda }_0^{(m)} (t) \rbrace = \frac{\partial \hat{\lambda }_0^{(m)} (t)}{\partial N^{(m)}(t)} \Delta _{ijk} \lbrace N^{(m)}(t) \rbrace + \frac{\partial \hat{\lambda }_0^{(m)} (t)}{\partial Y^{(m)}(t)} \Delta _{ijk} \lbrace Y^{(m)}(t) \rbrace , \end{aligned}$$


$$\begin{aligned} N^{(m)}(t) = \sum _{i,j,k} w_{ijk} y_{ijk}(t) \delta ^{(m)}_{ijk}(t) \end{aligned}$$


$$\begin{aligned} Y^{(m)}(t) = \sum _{i,j,k} w_{ijk} y_{ijk}(t) \exp ( \hat{\varvec{\beta }}^{(m)^{\prime }} \varvec{x}^{(m)}_{ijk} ). \end{aligned}$$

In terms of these quantities, the Taylor deviates are

$$\begin{aligned} \Delta _{ijk} \lbrace \hat{\lambda }_0^{(m)} (t) \rbrace = Y^{(m)}(t)^{-1} (w_{ijk} y_{ijk}(t) \delta ^{(m)}_{ijk}(t) - \hat{\lambda }_0^{(m)}(t) \Delta _{ijk} \lbrace Y^{(m)}(t) \rbrace ) \end{aligned}$$


$$\begin{aligned} \begin{array}{ll} \Delta _{ijk} \lbrace Y^{(m)}(t) \rbrace =&{} w_{ijk} y_{ijk}(t) \exp ( \hat{\varvec{\beta }}^{(m)^{\prime }} \varvec{x}^{(m)}_{ijk} ) \\ &{}+ \left[ \sum \limits _{i,j,k} \varvec{x}_{ijk} w_{ijk} y_{ijk}(t) \exp ( \hat{\varvec{\beta }}^{(m)^{\prime }} \varvec{x}^{(m)}_{ijk} ) \right] \Delta _{ijk} \lbrace \hat{\varvec{\beta }}^{(m)} \rbrace . \end{array} \end{aligned}$$

We note that the hazard deviates for the piecewise and semiparametric model in Eq. (12) are equivalent when each interval of the piecewise model contains exactly one observed event time.

The final components are the survival functions. The derivatives for each \(\hat{S}_0^{(m)}(u^{(1)}_j)\) are

$$\begin{aligned} \frac{\partial \hat{\pi }(t_0,t_1;\varvec{x})}{\partial \hat{S}_0^{(m)}(u^{(1)}_j)} = \exp ( \hat{\varvec{\beta }}^{(m)^{\prime }}\varvec{x}^{(m)} ) \hat{S}_0^{(m)}(u^{(1)}_j)^{-1} \hat{p}(u^{(1)}_j). \end{aligned}$$

From the semiparametric estimate of Eq. (7), the Taylor deviates for the baseline survival up to time \(u^{(1)}_j\) for the \(m\)th risk type are

$$\begin{aligned} \Delta _{ijk} \lbrace \hat{S}_0^{(m)} (u^{(1)}_{j}) \rbrace = -\hat{S}_0^{(m)} (u^{(1)}_{j}) \sum _{u^{(m)}_{q}\le u^{(1)}_j} \Delta _{ijk} \lbrace \hat{\lambda }_0^{(m)}(u^{(m)}_{q}) \rbrace . \end{aligned}$$

Combining these results, the expression for the Taylor deviates of \(\hat{\pi }(t_0,t_1;\varvec{x})\) are

$$\begin{aligned} \Delta _{ijk} \lbrace \hat{\pi }(t_0,t_1; \varvec{x}) \rbrace&= \sum _{m=1}^M \frac{\hat{\pi }(t_0,t_1; \varvec{x})}{\partial \hat{\varvec{\beta }}^{(m)}} \Delta _{ijk} \lbrace \hat{\varvec{\beta }}^{(m)} \rbrace + \sum _{{q}=1}^{N^{(1)}} \frac{\hat{\pi }(t_0,t_1; \varvec{x})}{\partial \hat{\lambda }_0^{(1)} (u^{(1)}_{q}) } \Delta _{ijk} \lbrace \hat{\lambda }_0^{(1)} (u^{(1)}_l) \rbrace \\&+ \sum _{{q}=1}^{N^{(1)}} \sum _{m=1}^M \frac{\hat{\pi }(t_0,t_1; \varvec{x})}{\partial \hat{S}_0^{(m)} (u^{(1)}_{q})} \Delta _{ijk} \lbrace \hat{S}_0^{(m)} (u^{(1)}_{q}) \rbrace . \end{aligned}$$

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Kovalchik, S.A., Pfeiffer, R.M. Population-based absolute risk estimation with survey data. Lifetime Data Anal 20, 252–275 (2014). https://doi.org/10.1007/s10985-013-9258-4

Download citation


  • Absolute risk
  • Censored data
  • Crude risk
  • Cumulative incidence
  • Survey cohort