In epidemiology, it is important not only to assess the association between one exposure and the occurrence of health events, but also to quantify the impact of this exposure on the occurrence of these events. This is done by estimating the attributable risk (AR) or the proportion of cases associated with this exposure in the population. This estimation takes into account not only the strength of the link between exposure and disease but also the importance (prevalence) of exposure in the population [1]. It expresses the proportion of disease cases that can be attributed to exposure [2], that is to say, under certain conditions, the proportion of potentially preventable cases by eliminating exposure. The AR is defined as:

$$ AR=\frac{\mathbf{P}(D) - {\mathbf{P}}(D|\bar{E})}{\mathbf{P}(D)}, $$

where P(D) is the probability of disease (incidence) in the population, which includes exposed E and unexposed \(\bar {E}\) subjects, and \(\mathbf {P}(D | \bar {E})\) is the hypothetical probability of disease in the same population but with all exposure eliminated.

The AR can be estimated from different types of studies including case-control studies for which many estimation methods exist (as reviewed in [3]), but it is rarely estimated from cohort studies. In the context of cohort studies and time-to-event outcomes, AR measures can be defined as functions of time [4-9] although a single AR estimate has been proposed alternatively [10].

Recent developments for estimating AR as a function of time from cohort studies in the survival analysis context have not so far led to a consensus definition. Several definitions have been proposed depending on whether authors interpret disease incidences P(D) and \(\mathbf {P}(D|\bar {E})\) in Eq. (1) as cumulative distribution functions (CDFs) [6-9] or as instantaneous hazard functions [4, 5]. The two definitions converge only for rare diseases or low exposure prevalence [4]. Here we focus on the first definition of AR based on CDFs which looks more consistent with the standard AR definition and appears to be the most used in the literature. Several methods of estimation have been proposed for the AR defined in this case, including nonparametric approaches based on Kaplan-Meier’s estimator of the survival function [7], a semiparametric approach based on Cox’s proportional hazards model [7] and a fully parametric approach assuming a piecewise constant hazards model [8]. Some evaluations were made for the nonparametric and semiparametric approaches [7] but, to the best of our knowledge, the performances of these various approaches have not been systematically compared.

The aim of this paper was to compare available methods for estimating AR when defined using CDFs. In the sections to follow, we first review the corresponding estimation methods so far published in the statistical literature. Simulations were conducted to assess the performance of the proposed AR estimators. The methods were then applied to data on menopausal hormone therapy (MHT) and breast cancer from the E3N women cohort (Étude Épidémiologique auprès de Femmes de la Mutuelle Générale de l’Éducation Nationale) [11]. For the purpose of our illustration, we considered 38,359 participants who were postmenopausal and free of cancer when they completed a self-administered questionnaire on their past use of any MHT in January 1992. In total, 17,185 (44.8%) women had ever used MHT at baseline and were considered exposed thereafter. By June 2008 (for a maximal 16.4 years and mean 14.0 years of follow-up), 2,228 invasive breast cancers had been diagnosed (1, 106 in unexposed women). A recent work on the E3N cohort estimated a 14.5% postmenopausal breast cancer risk attributable to MHT use after 15 years of follow-up [12]. We estimated AR as a CDF-based function of time at four time points using nonparametric, semiparametric and parametric approaches, as well as the single overall AR measure proposed by Spiegelman et al. [10].


Review of estimation methods

When interpreting the incidence of disease P(D) as the event probability until some time t, the AR is defined as follows [4, 6, 7]:

$$ A(t) = \frac{\mathbf{P}(T \leq t) - \mathbf{P}(T \leq t | Z=z^{*})}{\mathbf{P}(T \leq t)} $$

where T denotes the time to disease or event time, Z a p-vector of risk factors and z the p-vector of their chosen target values in order to quantify the potential impact of modifying the current distribution of Z to z . Since, in most applications, z is defined by setting one of the components of Z to its baseline (unexposed) level, we use notation Z=0 instead of Z=z in the following. Using the survival functions S(t)=P(T>t) and S 0(t)=S(T>t|Z=0), the AR for time-to-event outcomes can be written as follows [7, 9]:

$$ A(t)=\frac{S_{0}(t)-S(t)}{1-S(t)}=1-\frac{1-S_{0}(t)}{1-S(t)}. $$

A natural estimate of A(t) is obtained by replacing the survival functions S 0(.) and S(.) by their respective estimators \(\hat {S}_{0}(.)\) and \(\hat {S}(.)\). Various estimators \(\hat {S}_{0}(.)\) and \(\hat {S}(.)\) have been proposed, as detailed in the following subsubsections.

Nonparametric approaches

Chen et al. [7] considered several approaches for estimating survival functions S 0(.) and S(.) depending on covariate type: nonparametric when all p covariates are categorical and independent of time, otherwise semiparametric. The former case applies to a single categorical covariate or several covariates forming K+1 categories.

When all p covariates are categorical and independent of time and under the assumption that censoring is independent of the covariates, Chen et al. [7] suggested estimating both S 0(.) and S(.) by the Kaplan-Meier method [13].

When all p covariates are categorical and independent of time but the assumption of covariate-independent censoring does not hold, Chen et al. [7] suggested estimating S(.) by the weighed Kaplan-Meier (WKM) estimator [14] and S 0(.) by the Kaplan-Meier method. For a p-vector Z of covariates with K+1 categories, the WKM estimator is defined as:

$$ \hat{S}(t)=\frac{1}{n}\sum\limits_{k=0}^{K} n_{k} \hat{S}_{k}(t) $$

where \(\hat {S}_{k}(t)\) is the Kaplan-Meier estimator among those with covariate profile k=0,1,2,…,K and n k is the number of subjects with covariate profile k so that \(\sum _{k=0}^{K} n_{k} \) equals n, the total number of subjects.

In all cases, the estimation of the variance of \(\hat {A}(t)\) is based on the expression of \(\{\hat {A}(t) - A(t)\}\) as a linear combination of \(\{\hat {S}_{0}(t) - S_{0}(t)\}\) and \(\{\hat {S}(t) - S(t)\}\) and relies on counting process results [7].

Semiparametric approach

For a more general type of covariates Z, i.e., when covariates are continuous, time-dependent or with too large a number of profile categories for nonparametric approaches, Chen et al. [7] considered using semiparametric instead of nonparametric methods to estimate S 0(.) and S(.). Of these, the Cox proportional hazards model [15] is one of the most familiar. It assumes that, at any time t, the hazard function λ(t|Z) is the product of a nonparametric baseline hazard λ 0(t) and a parametric function of the p-vector of covariates Z (or Z(t) in the case of time-dependent covariates) and the p-vector of corresponding parameters β. The parametric function is usually taken to be the exponential function, such that λ(t|Z)=λ 0(t) exp(β T Z). In this case,

$$\begin{aligned} \hat{S}_{0}(t)&=\exp\!\left[-\hat{\Lambda}_{0}(t)\right] \text{~and~}\\ \hat{S}(t)&=\frac{1}{n}\sum\limits_{i=1}^{n}\exp\!\left[-{\int\limits_{0}^{t}}\exp\{\hat{\beta}^{T}z_{i}(u)\}d\hat{\Lambda}_{0}(u)\right] \end{aligned} $$

where \(\hat {\Lambda }_{0}(.)\) is the Breslow estimator [16] of the baseline cumulative risk \(\Lambda _{0}(t) = {\int \limits _{0}^{t}} \lambda _{0}(u) du\) and \(\hat {\beta }\) is the maximum partial likelihood estimator.

The expression of the variance of \(\hat {A}(t)\) follows the same general principles as for the nonparametric approaches above [7].

Parametric approach

Laaksonen et al. [8] proposed a parametric estimator based on a proportional hazards model with piecewise constant hazards (PCH). In this approach, follow-up time is partitioned into J prespecified intervals (0=a 0,a 1],(a 1,a 2],…,(a j−1,a j ],…,(a J−1,a J ], and the survival function at time t is estimated assuming a constant baseline hazard \(\hat {\lambda }_{0j} = \exp (\hat {\alpha }_{j})\) in each j-th interval (a j−1,a j ], j=1,2,…,J as follows:

$$ \hat{S}_{PCH}(t|Z_{i}) = \exp\!\left\{-\sum_{j=1}^{J} \exp(\hat{\alpha}_{j}+\hat{\beta}^{T}Z_{i})\delta_{j}(t)\right\} $$

where δ j (t) defines the length of follow-up in the j-th interval:

$$ \delta_{j}(t)=\left\{\begin{array}{ll} 0 & \text{if~}\, t\leq a_{j-1},\\ t-a_{j-1} & \text{if~}\, a_{j-1}<t\leq a_{j}, \\ a_{j}-a_{j-1} & \text{if~}\, t>a_{j}. \\ \end{array} \right. $$

The so-called population attributable fraction (PAF) estimator [8] is then defined using the following parametric estimators:

$$\begin{aligned} \hat{S}_{0}(t)&=\frac{1}{n}\sum\limits_{i=1}^{n}\hat{S}_{PCH}(t|Z_{i}=0) \text{~and~}\\ \hat{S}(t)&=\frac{1}{n}\sum\limits_{i=1}^{n}\hat{S}_{PCH}(t|Z_{i}= z_{i}). \end{aligned} $$

The model parameter estimates \(\hat {\alpha }=(\hat {\alpha }_{1}, \ldots,\hat {\alpha }_{J})\) and \(\hat {\beta }\) are obtained by maximum likelihood estimation. The variance of \(\hat {A}(t)\) is estimated using the delta method [8].

Global approaches over the whole follow-up period

Alternatively to the definition of the AR as a function of time, Spiegelman et al. [10] proposed to estimate a single value in cohort studies:

$$ AR = \frac{\sum\limits_{k=0}^{K} q_{k} (RR_{k} - 1)}{1 + \sum\limits_{k=0}^{K} q_{k} (RR_{k} - 1)} $$

where RR k and q k , k=0,…,K, are the relative risk and prevalence in the target population for the kth combination of risk factors.

Upon using Cox’s proportional hazards model, the overall AR can be estimated using estimated hazard ratio (HR) for relative risk and person-years for exposure prevalence in the cohort. The asymptotic variance is estimated using the multivariate delta method [10].

In the case of an unadjusted, binary exposure variable, the formula by Spiegelman et al. [10] simplifies into

$$ AR = \frac{q (RR - 1)}{1 + q (RR - 1)} $$

where q denotes the exposure prevalence and RR the relative risk of exposed relative to nonexposed subjects. This formula resembles the well-known formula used by epidemiologists [1, 2] where q is estimated by the proportion of exposed subjects at baseline (instead of exposed person-years over the whole follow-up).


In this work, we considered a single, binary covariate Z representing exposure with Z=0 and 1 for unexposed and exposed subjects respectively, simulated as a Bernoulli random variable with probability of exposure (q) set to 0.25, 0.50 and 0.75. To compare the different approaches for estimating AR, we considered either proportional or nonproportional hazards between the exposed and the unexposed.

For proportional hazards, we used instantaneous hazard functions of the form λ(t|Z)=λ 0(t) exp(β Z) where β denotes the regression parameter set to ln(2) or 0, and λ 0(t) the baseline hazard function taken from a Weibull distribution with shape parameter γ and scale parameter θ: λ 0(t)=γ θ γ t γ−1, and generated event times from (1/θ)[− ln(U)/ exp(β Z)]1/γ with U uniform on (0,1). We explored situations where the baseline hazard was constant (γ=1) or dependent on time, either increasing (γ=4/3) or decreasing (γ=3/4) with time. The scale parameter θ was chosen as a function of the shape parameter γ so as to obtain median survival time equal to 15 years for unexposed subjects in all scenarios. We calculated survival functions S 0(.) and S(.) as exp{−(t/θ)γ} and (1−q) exp[−(t/θ)γ]+q exp[−(t/θ)γ exp(β)] respectively and derived theoretical values of AR as a function of time from Eq. (2). For the global AR derived from the simpler approach, theoretical values were obtained as q[ exp(β)−1]/{1+q[ exp(β)−1]}.

For nonproportional hazards, we generated event times from G −1[− ln(U)]/[λ 0 exp(β Z)] assuming a cumulative hazard function of the form Λ(t|Z)=G[λ 0 t exp(β Z)] where G is the logarithmic transformation G(t)= ln(1+2t)/2 [7]. Setting λ 0=0.1 yielded a median survival time for unexposed subjects equal to 15 years as in the proportional hazards case. Setting the regression coefficient β to ln(2), the HR between the exposed and the unexposed decreased from 2 toward 1 over time. We calculated survival functions S 0(.) and S(.) as exp{− ln(1+2λ 0 t)/2} and (1−q) exp{− ln(1+2λ 0 t)/2}+q exp{− ln(1+2λ 0 t exp(β))/2} respectively and derived theoretical values of AR as a function of time from Eq. (2).

We generated censoring times independent of the covariate Z and event times T from a uniform distribution on [0,τ], with τ the maximal follow-up time of the study set at 20 years. Depending on scenarios, we obtained censoring percentages around 47-68% (ranges across simulations from 42% to 73%).

We generated 1, 000 data sets of n=1, 000 or 10, 000 independent observations and calculated estimators \(\hat {A}(.)\) of the AR as a function of time and their associated variances using the four approaches: two non-parametric approaches corresponding to the case where S 0(.) and S(.) are both estimated by the Kaplan-Meier method (KM) and to the case where S 0(.) and S(.) are estimated by the Kaplan-Meier and the weighted Kaplan-Meier methods, respectively (WKM) [7], one semiparametric approach using Cox’s proportional hazards model (COX) [7], and one parametric approach corresponding to the case where survival functions are estimated assuming piecewise constant hazards (PCH) [8] considering four intervals of 5-year width. In the case where no event was generated in any five-year interval, the simulated dataset was discarded and replaced by a new one. We also considered the simpler approach based on Eq. (3) to estimate a global AR.

Results of the time-dependent approaches are presented at times t=τ/4, τ/2, 3τ/4 and τ (respectively, 5, 10, 15 and 20 years). For the nonparametric and semiparametric approaches, estimates were obtained at times actually observed in the dataset so we considered values taken at the closest preceding time point. While nonparametric estimations are based on data available until the time of interest, semiparametric and parametric methods use data available over the whole follow-up period. To allow for a fairer comparison under the proportional hazards assumption, we also computed semiparametric and parametric estimators after censoring observation times at either τ/4 or τ/2. The parametric approach was then based on one or two interval(s) of 5-year width respectively.

For all five approaches, results displayed are the average absolute bias relative to the theoretical value A(.), the Sampling Standard Deviation of \(\hat {A}(.)\) (SSD), the average Standard Error Estimator of A(.) (SEE) and the coverage probability (CP) of the 95% confidence interval (CI) of A(.). Although authors [7, 8] have suggested to use the complementary logarithmic transformation ln {1−A(.)black} to improve coverage probabilities in case of small sample size, this did not notably improve coverage probabilities in our results (data not shown) so results presented are for the untransformed A(.).

Simulations were performed using R release 3.0.1. We coded the nonparametric methods using R software and tested the validity of our code by comparing our simulation results with those of the authors using the same parameters [7]. For the semiparametric method [7], we used the R package paf developed by Chen [17]. For the parametric method [8], we used SAS release 9.3 and a set of macros developed by Laaksonen et al. [18]. For the global approach by Spiegelman et al. [10], we used the %par SAS macro developed by the authors.



We first considered the case of proportional hazards between the exposed and the unexposed with β= ln(2) and probability of exposure equal to 0.50, starting with a constant baseline hazard. With a sample size of 1, 000 observations and for the four time-dependent approaches (Table 1, left-hand side), there was more upward bias at the end of follow-up τ, especially with the KM method and the WKM method (to a lesser extent), but AR estimators for all methods and time points were virtually unbiased (relative bias <2.5%). Variance estimators accurately reflected the true variation and the 95% CIs had proper coverage probabilities, except in τ for the two nonparametric methods, where the variance was somewhat underestimated, yielding lower than nominal coverage. Parametric and semiparametric estimators were more precise than nonparametric estimators, particularly at times τ/4 and τ. Estimators of parameter β were unbiased for the semiparametric and parametric approaches (relative bias <0.7%, data not shown).

Table 1 Simulation results for the estimation of attributable risk A(.) under proportional hazards, constant baseline hazard (γ=1) with regression parameter β= ln(2) and probability of exposure q=0.5

When considering samples of size 10, 000 (Table 1, right-hand side), bias decreased in magnitude compared to a sample size of 1, 000 observations (relative bias < 0.7% for AR in all methods and <0.04% for β in the semiparametric and parametric approaches). As expected, precision increased markedly for all methods, by a factor of about \(\sqrt {10}\). Moreover, SEEs and SSDs were in closer agreement even at time τ with nonparametric methods and all coverage probabilities fell within the 0.940 to 0.960 range.

Similar observations held when considering a decreasing baseline hazard (Table 2). When γ=3/4, biases were close to those observed with γ = 1 except for a moderate increase for the parametric approach and both sample sizes (relative bias <2.4%). Nevertheless coverage probabilities remained satisfactory for this method and the others, except again at the end of follow-up τ for the two nonparametric methods and n=1, 000 (0.915 and 0.906 for KM and WKM respectively).

Table 2 Simulation results for the estimation of attributable risk A(.) under proportional hazards, decreasing baseline hazard (γ=3/4) with regression parameter β= ln(2) and probability of exposure q=0.5

Under an increasing baseline hazard (Table 3), coverage probabilities at τ of the two nonparametric estimators worsened with n=1, 000 (0.891 and 0.898 for KM and WKM approaches respectively) as a result of increased biases compared to constant and decreasing baseline hazards. Results were otherwise satisfactory and biases for the parametric method were comparable with those obtained under constant baseline hazard.

Table 3 Simulation results for the estimation of attributable risk A(.) under proportional hazards, increasing baseline hazard (γ=4/3) with regression parameter β= ln(2) and probability of exposure q=0.5

With a lower or greater prevalence of exposure (25% or 75% exposed), coverage probabilities in τ for the nonparametric approaches improved but sometimes remained lower than the nominal value despite a sample size of 10, 000 (Additional file 1: Table S1 and Additional file 2: Table S2 for γ=1 and β= ln(2)). The same general picture held with other values of γ (data not shown), except for the parametric approach which showed slightly insufficient (93%) coverage at times < τ for γ=3/4 and both exposure probabilities 0.25 and 0.75.

Under the same parameters but β=0 (Additional file 3: Table S3 for γ=1 and 50% exposed), results were similar to those with β= ln(2) except for slightly improved coverage probabilities in τ for the nonparametric approaches and a sample size of 1, 000.

Under all scenarios with proportional hazards (Tables 1, 2 and 3, Additional file 1: Table S1, Additional file 2: Table S2 and Additional file 3: Table S3), estimators of global AR from the simpler approach were virtually unbiased with satisfactory coverage probabilities. The estimated single values were generally greater than those of time-dependent approaches at any point in time.

When follow-up was stopped at τ/4 or τ/2, under proportional hazards (data not shown), estimates for the two nonparametric methods were of course identical to those obtained at the same time points with a complete follow-up. SEEs for the semiparametric and parametric methods increased, getting closer to those of nonparametric methods with censoring at τ/2 and even closer with censoring at τ/4. Coverage probabilities remained satisfactory except for the parametric method under decreasing baseline hazard (γ=3/4) where they tended to be lower than the nominal value e.g., 0.935 and 0.918 at τ/4 for censoring at τ/4, β= ln(2) and 50% exposed, and for n=1, 000 and n=10, 000 respectively.

Finally, when considering nonproportional hazards between the exposed and the unexposed (Table 4, for β= ln(2) and 50% exposed), nonparametric methods yielded similar results to those under proportional hazards. However, the semiparametric and parametric approaches that both relied on the proportional hazards assumption performed poorly. With a sample size of 1, 000 observations (Table 4, left-hand side), estimates using the semiparametric approach were biased (relative bias between 7.9 and 32.6%) with poor coverage probabilities except at τ/2. The parametric approach resulted in even more severe biases (relative bias between 14.6 and 81.6%) and poorer coverage probabilities. With n=10, 000, bias remained high and became similar with the semiparametric and parametric approaches (between 7.1 and 31.2% and between 8.3 and 32% respectively), and coverage deteriorated further as a result of tighter 95% CIs (Table 4, right-hand side). With a lower or greater prevalence of exposure, coverage probabilities with the parametric approach improved at all times but generally remained less than 93% (data not shown).

Table 4 Simulation results for the estimation of attributable risk A(.) under nonproportional hazards with regression parameter β= ln(2) and probability of exposure q=0.5

Data example

As in our simulations, we used time-on-study rather than attained age as the time-scale after checking that both yielded similar results. Fitting a Weibull distribution to the observed survival data and considering incident invasive breast cancer as the event of interest (i.e., considering time to breast cancer occurrence), the shape (γ) and scale (θ) parameters were estimated as 1.2 and 178.2 respectively and the corresponding estimated Weibull survival function almost coincided with nonparametric Kaplan-Meier estimate (data not shown). The assumption of proportional hazards between women ever-exposed and those never-exposed to any MHT at baseline seemed appropriate (Schoenfeld residual test, p=0.7), with an estimated HR at 1.22 (95% CI, 1.13 to 1.33) for MHT exposure from the Cox model.

The AR estimates from nonparametric approaches KM and WKM were almost identical (Fig. 1). They tended to increase until 12 years of follow-up (e.g., for the KM approach, from 5.5% (95% CI, −2.7 to 13.6%) after four years to 12.0% (95% CI, 7.8 to 16.2%) after 12 years of follow-up), then to decrease and converge to semiparametric and parametric estimates at the end of follow-up with an estimated 9.2% AR (95% CI, 5.4 to 13.0%) after 16 years. In comparison, estimates using the semiparametric and parametric approaches slightly decreased monotonically over time from 9.0% (95% CI, 5.3 to 12.8%) to 8.8% (95% CI, 5.1 to 12.4%) and from 8.9% (95% CI, 5.2 to 12.6%) to 8.7% (95% CI, 5.0 to 12.3%) respectively. Thus, after 16 years of follow-up, the proportion of invasive breast cancer cases attributable to MHT exposure was close to 9% whatever the method used. Estimates using nonparametric approaches were far less precise at earlier times and displayed wider 95% CIs (even including 0 at time 4 years) than semiparametric and parametric approaches in the first half of the follow-up: e.g., at time 8 years, AR was estimated as 8.9% (95% CI, 3.5 to 14.4%) and 9.0% (95% CI, 5.2 to 12.7%) from the KM and Cox approaches, respectively. Adjusting for age at baseline, either as a continuous covariate in the semiparametric approach or as a dichotomous covariate in all approaches, hardly modified these estimates (data not shown).

Fig. 1
figure 1

Estimation of the risk of invasive breast cancer attributable to ever use of menopausal hormone therapy at baseline as a time function, E3N cohort, 1992-2008. The dark solid and dark dashed curves pertain to the point estimates by KM and COX, respectively, the dark circles to the point estimates by the 4-year interval PCH; the light solid and light dashed curves, as well as the light circles, show the corresponding 95% confidence intervals. The WKM curves are not displayed because they almost coincided with the KM curves at the chosen scale

Finally, using the method proposed by Spiegelman et al. [10], we found that 9.2% (95% CI, 5.4 to 13.0%) of cases who developed invasive breast cancer at various times in the cohort follow-up were attributable to MHT exposure. Using the simpler approach with the proportion of exposed subjects at inclusion, we obtained a close, slightly smaller estimate at 9.1% (95% CI, 5.3 to 12.8%).


Comparing different methods of AR estimation when disease incidence is interpreted as a CDF [7, 8], we observed that AR estimators were essentially unbiased for all approaches when we generated event times from a proportional hazards model. Empirical and estimated variances were close, with proper coverage probabilities except at the end of follow-up for the nonparametric methods and a smaller sample size. When considering a non-constant baseline hazard, estimates using the parametric approach were robust despite misspecification of the baseline hazard. For nonparametric approaches, biases tended to increase at the end of follow-up (time τ) when the baseline hazard increased with time. With the simpler approach, results were satisfactory. However, under nonproportional hazards, estimates using the semiparametric and parametric approaches were biased with poor coverage probabilities.

To our knowledge, this is the first simulation study comparing nonparametric, semiparametric and parametric methods of AR estimation as a function of time as well as a simpler, global approach for a diversity of scenarios (proportional or nonproportional hazards, constant or nonconstant baseline hazard, varying exposure probabilities, strengths of association and sample sizes) in the survival analysis context. Chen et al. [7] reported simulations for the Kaplan-Meier, weighted Kaplan-Meier and transformation models when event times were generated from proportional or nonproportional hazards models with regression parameter β=1, 40% probability of exposure and a sample size of 1, 000 observations. Like them, we found that, under the assumption of independent censoring, results with KM and WKM approaches were very close. Differences between the two nonparametric approaches were apparent when censoring was dependent on covariates [7], which we did not evaluate in this study.

Also in line with Chen et al. [7], when we generated event times from a proportional hazards model, we found that nonparametric and semiparametric estimates were all unbiased, nonparametric estimates had larger variances than semiparametric estimates and estimated variances accurately reflected the true variance except in τ for the nonparametric approaches and a sample size of 1, 000 observations. Nonparametric approaches tended to perform better (respectively worse) when exposure prevalence was lower (respectively higher) which could be expected from the possibly unstable and inefficient Kaplan-Meier estimator of survival among the unexposed when the proportion of those is small [7]. This general picture held in our simulations whether event times were generated with constant, decreasing or increasing baseline hazard. We note, however, that, when we considered a larger sample size, the discrepancies between estimated and empirical variances tended to diminish, with most often satisfactory coverage probabilities in τ.

For nonproportional hazards, we generated event times using a transformation model considered by Chen et al. [7] and found consistent results for the nonparametric approaches, similar to those in the case of proportional hazards. However, while Chen et al. [7] applied the same nonproportional hazards model for both data generation and analysis (AR estimation), we generated data under nonproportional hazards and estimated AR from (misspecified) Cox’s proportional hazards model. This explains the impaired performance we observed when the proportional hazards assumption was violated in contrast with the satisfactory results obtained by Chen et al. [7]. Sjölander and Vansteelandt [9] recently proposed an alternative semiparametric estimator of AR also based on Cox’s proportional hazards model that proved robust to various model misspecifications. However these authors did not evaluate deviations from the proportional hazards assumption.

Like Chen et al. [7] in their simulation and example analysis, we observed greater imprecision of the nonparametric estimators at the start of follow-up, which could explain possible early negative AR values. This imprecision could be expected because the estimation of the survival function relies upon the information available until the time of interest and not many events have yet occurred by then. This differs from the semiparametric and parametric methods which take advantage of the estimation of parameter β being performed over the entire follow-up. Consistently, we found larger variances for the semiparametric and parametric approaches with shorter lengths of follow-up.

Another novelty of this work was the evaluation of the parametric approach to AR estimation proposed by Laaksonen et al. [8] using simulations and its comparison with nonparametric and semiparametric approaches. Generally under proportional hazards, we found close agreement between the semiparametric and parametric approaches, in our simulations as well as in the example. Of note, the parametric approach seemed robust despite misspecification of baseline hazard, i.e., when we considered decreasing or increasing (instead of constant) baseline hazard and proportional hazards. However, like the semiparametric approach based on Cox’s model, the parametric approach was sensitive to the proportional hazards assumption and performed poorly in our simulations when this assumption was violated. We also evaluated the simpler, global approach and our results were satisfactory under proportional hazards.

As noted by several authors [4, 7], simpler approaches based on Eq. (1) or equivalent formulas [1, 2] are generally defined for binary outcomes with time-independent risk factors. Consequently, they prove to be inadequate for cohort studies with censored time-to-event outcomes and possibly time-dependent covariates. In contrast, the nonparametric, semiparametric and parametric approaches we considered here have been specifically developped for censored time-to-event outcomes and produce AR estimate as a function of time, thus allowing the AR to be time-varying. A major limitation of the simpler approach in the context of cohort studies is that it only takes account of the proportion of exposed subjects at the beginning of follow-up. The proportion of exposed subjects indeed decreases as follow-up time increases (because exposed subjects fail earlier than nonexposed subjects) [6]. This explains why our AR estimates from the simpler approach were generally greater than those from time-dependent approaches and further underlines why approaches estimating AR as a function of time are an improvement on the simpler approach in the context of survival analysis.

In our study, we used the definition of AR based on CDFs because it is a natural extension of the standard AR definition (Eq. (1)) for time-to-event outcomes [6–9] and it is equivalent to the standard definition when time t is the end of follow-up in cohort studies [4]. In addition several estimation methods have been proposed for the CDF-based AR definition in cohort studies and the survival analysis context in contrast to the alternative definition based on instantaneous hazard functions [4, 5] for which only one method of estimation based on Cox’s proportional hazards model has been published [4].

In cohort studies where exposed individuals are over-sampled relative to the exposure prevalence in the population, AR will correctly reflect the impact of exposure in the cohort, but the impact at the population level will be overestimated. The marginal survival function S(t) should be corrected in order to alleviate this upward bias on AR estimates. The AR (and its estimates) being a function of time, various representations of AR estimates can be used. We used a graphical representation of the whole time function in our example and produced estimates at four equally spaced times in our simulations. Alternatively, a single overall estimate could be obtained by averaging out the time function of AR estimates or by using the alternative approach by Spiegelman et al. [10] as in our example.

We chose our simulation parameters to resemble real epidemiologic cohorts. These often include a few thousands participants followed for several years. For a smaller sample size (n=500), whether we used the logarithmic transformation or not, we observed findings generally similar to those presented with a sample size of 1,000 observations. This was true with the notable exception of the less than nominal coverage probability for the semiparametric approach at time τ for constant (γ=1) and decreasing (γ=3/4) baseline hazards, and at times τ/4 and τ for increasing (γ=4/3) baseline hazard (data not shown).

In our application, the proportional hazards assumption seemed appropriate, as well as a Weibull distribution for event times with an increasing baseline hazard and shape parameter halfway between the values γ=1 and 4/3 considered in our simulation study. Exposure frequency was also close to our simulated 0.5 probability of exposure. However, as in many epidemiologic cohorts, the censoring rate was much greater in our example (94.2%) than in our simulations. The resulting imprecision may explain the nonparametric AR estimates apparently increasing until three quarters of total follow-up but compatible with the more expected decreasing trend. Chen et al. [7] observed the same finding in their application on a shorter length of follow-up.

Using the approach described by Spiegelman et al. [10], the overall AR estimate for ever use of MHT at baseline was 9.2% in the E3N cohort. This estimate was close to those obtained at the end of follow-up with the nonparametric methods and at the start of follow-up with the parametric and semiparametric approaches. In a recent publication, Dartois et al. [12] reported a higher AR estimate of 14.5% (95% CI, 9.2 to 19.6%) for recent MHT use and postmenopausal invasive breast cancer from the E3N cohort data, using the approach proposed by Spiegelman et al. [10] and a more refined, adjusted analysis with MHT exposure as a time-dependent covariate.

This study has some limitations. First, we did not evaluate AR estimates adjusted for covariates. Adjustment for multiple variables is common practice in epidemiology, especially age which can also be used as the underlying time-variable [19]. In our example, using analyses unadjusted or parametrically adjusted for age, there was a statistically significant association between baseline MHT ever use and breast cancer risk, in line with findings from more complex models with age as the timescale and adjustment for other covariates in the original study [11]. Although adjustment for covariates is available in packages for semiparametric and parametric approaches [7, 8], there are constraints in nonparametric approaches as the number of covariates must be limited and adjustment for continuous variables is not possible. Moreover, available packages for estimating the AR would need to be adapted to allow left truncation resulting from using age as the timescale. Second, in our example, we only considered women who had ever received MHT at baseline as exposed whereas exposure can vary during follow-up. Other methodological studies are needed to take into account the exposure time dependency for estimating AR as a function of time [9]. Finally, we have ignored the competing risk of death and cancers of other sites (11.2% of our 94.2% censored observations) which might also bias our estimate of breast cancer risk attributable to MHT [20].


The AR estimators from the four time-dependent methods had satisfactory performance under the proportional hazards assumption. Estimators using semiparametric and parametric approaches were not robust in case of nonproportional hazards. Lack of precision could be an issue for nonparametric methods at the beginning of the follow-up time in cohorts of relatively low sample size. In practice, if the proportional hazards assumption seems appropriate, the semiparametric or parametric approaches should be used.