1 Introduction

Causal inference provides tools to compare treatment strategies in studies that do not permit random allocation of subjects to therapy groups, e.g., for ethical reasons or simply because it is not feasible. Special analysis methods are necessary because in non-randomized trials, risk factors are likely to be distributed unequally across treatment groups and as a consequence, side-by-side comparisons will lead to biased estimation of the direct treatment effect (Yang et al. 2010; Nørgaard et al. 2017). Randomized trials benefit from causal analysis tools, too, for instance when dealing with non-compliance or selection bias. In this manuscript, we focus on the control of confounding bias. The idea of the counterfactual approach to causal inference is to model the mean outcome in a hypothetical world where all participants of the study are exposed to the same intervention—possibly ‘counter to the fact’, i.e., contrary to the treatment they actually received. Causal conclusions can then be drawn by contrasting the obtained estimates for the treatment levels of interest (Rubin 1974; Hernán and Robins 2020 Sect. I.1).

In case of time-to-event endpoints, statisticians need to take additional difficulties into account, however, as the analysis of right-censored data requires particular techniques. The hazard ratio, which is the common measure of the treatment effect for time-to-event data, comes along with several issues when the aim is to draw causal inferences: In the first place, it is non-collapsible. Thus, the causal effect estimate in the entire population may differ fundamentally from the average of the causal effect estimates across subgroups, even if the variable defining these subgroups is no confounder (Martinussen and Vansteelandt 2013; FDA 2023). Another drawback is selection bias, which has e.g., been described by Aalen et al. (2015). Selection bias occurs because the hazard function only takes survivors into account, but if treatment does indeed affect survival, the distribution of the risk factors will deviate between survivors in the two treatment groups as time progresses. Apart from that, the hazard ratio—as a single value—fails to convey potentially time-varying effects and also depends on the duration of the study (Hernán 2010). We therefore consider the risk difference as effect measure instead. Our target estimand is based on the cumulative incidence function, which quantifies the risk of experiencing a specific event type out of one or more possible causes until a given time point. This way, a competing risks framework is accommodated on top, which covers the standard survival setting as a special case. Examples of observational studies that compare treatment effects using the cumulative incidence function include Philipps et al. (2020); Butt et al. (2021); Chauhan et al. (2022).

Beside the estimated average treatment effect, researchers are often also interested in further statistical inference. The stochastic process associated with the estimated cumulative incidence function is rather complex, making it difficult to derive exact confidence intervals and bands, though. A commonly applied remedy is the classical nonparametric bootstrap proposed by Efron (1981) (cf. Neumann and Billionnet 2016; Stensrud et al. 2020; Stensrud et al. 2016), even though this resampling method is not optimal in several situations, e.g., when dealing with dependent data (Singh 1981; Friedrich et al. 2017). Ozenne et al. (2020) presented an alternative approach based on the influence function, and as counting processes are inherent to time-to-event analysis, resampling methods relying on martingale theory further suggest themselves.

In this paper, we illustrate that apart from the method proposed by Ozenne et al. (2020), the classical bootstrap as well as the martingale-based wild bootstrap also accurately approximate the distribution of the stochastic process at hand. We compare the performance of these resampling approaches in terms of the resulting confidence intervals and bands by means of simulations as well as an applied data example recording the long-term outcomes of early-stage Hodgkin’s disease patients.

The remainder of this manuscript is organized as follows: Sect. 2 establishes the setting and notation as well as the causal estimator for the average treatment effect. In Sect. 3, we introduce the three mentioned resampling approaches. The simulation study and the analysis of the Hodgkin’s disease data are presented in Sects. 4 and 5. Finally, the paper concludes with a discussion.

2 Average treatment effect for right-censored data with competing risks

We consider a competing risks setting with K failure types. Let the absolutely continuous random variables T and C denote an individual’s event and censoring time, respectively. The observed data include \({T \wedge C}\), the minimum of T and C, as well as an indicator \({D \in \{0, 1, \dots , K\}}\), which represents the type of failure. W.l.o.g., let \({D = 1}\) imply that a subject experienced the event of interest. If \({D = 0}\), the event time is censored, i.e., \({C < T}\). Besides, we observe a binary treatment indicator A and a bounded, p-dimensional vector \(\varvec{Z}\) of baseline covariates. Throughout this paper, suppose that the data sample \(\smash {\{(T_i \!\wedge \! C_i, D_i, A_i, \varvec{Z_i})\}_{i \in \{1, \dots , n\}}}\) is independent and identically distributed (i.i.d.), and does not include any tied event times. It is further assumed that \(\smash {T_i}\) and \(\smash {C_i}\) are conditionally independent given \(\smash {(A_i, \varvec{Z_i})}\).

In the presence of competing events, one may be interested in either the direct or the total effect of treatment on the event of interest (Young et al. 2020). The direct effect reflects the impact of the studied therapy in a hypothetical setting where all competing events have been eliminated, whereas the total effect additionally takes the impact of the therapy mediated by competing events into account. Neither of these characterizations is generally preferable over the other: While the direct effect may help to better understand the mechanisms by which the treatment affects the outcome, interventions that eradicate competing events are rare, and thus, the total effect is typically more relevant in practice. We will focus on the estimation of total effects hereafter.

For a fixed time point t within the study time interval \({[0, \tau ]}\), we define the average treatment effect of interest in the entire population as \(\smash {ATE(t) = \mathbb {E}\left( F_1^1(t) - F_1^0(t)\right) }\). The expression \(\smash {F_1^a(t) = P(T^a \le t, \, D^a = 1)}\) refers to the potential cumulative incidence function for cause 1 under treatment \({a \in \{0,1\}}\), applying the counterfactual notation as in Hernán and Robins (2020). Accordingly, \(\smash {F_1^a(t)}\) describes the probability of observing the event of interest until time t, had all study participants received treatment a.

In order to ensure identifiability of ATE, the subsequent assumptions need to be fulfilled (see e.g., Hernán and Robins 2020, Sect. I.3 for a thorough description): Conditional exchangeability holds if there are no unmeasured confounders. For given covariate values, the risk among the treated subjects is thereby equal to the risk among the untreated subjects, had they been treated, and vice versa. A formal definition of conditional exchangeability requires independence between \(\smash {\mathbbm {1}\{T^a \le \tau , \, D^a = 1\}}\) and A, conditional on \(\varvec{Z}\), for \({a \in \{0,1\}}\). (We use \({\mathbbm {1}\{\cdot \}}\) here and in the following to denote the indicator function.) Furthermore, the positivity assumption applies if the conditional treatment probability \({P(A = 1 \mid \varvec{z})}\) is bounded away from 0 and 1 for covariate values \(\varvec{z}\) on the support of \(\smash {f_{\varvec{Z}}(\varvec{z})}\), so that both therapies \({A = 0}\) and \({A = 1}\) are possible. Lastly, the interventions \({A = 0}\) and \({A = 1}\) need to be well-defined, with \({\mathbbm {1}\{T \le \tau , \, D = 1\}} = \smash {\mathbbm {1}\{T^A \le \tau , D^A = 1\}}\). This condition is referred to as consistency, and it ensures that the observed and potential risks are equal if the actual and counterfactual therapy coincide.

Assuming that exchangeability, positivity and consistency apply and there is no interference between the potential outcomes of distinct individuals, the g-formula yields an estimate of the average treatment effect (Ozenne et al. 2020):

$$\begin{aligned} \widehat{ATE}(t) = \frac{1}{n} \sum _{i=1}^n \left( \hat{F}_1(t \mid A = 1, \varvec{Z_i}) - \hat{F}_1(t \mid A = 0, \varvec{Z_i})\right) . \end{aligned}$$

Here, any assumptions made when modelling \(\smash {\hat{F}_1}\) need to be fulfilled in order to obtain a meaningful estimator. Despite the issues pointed out by Aalen et al. (2015), it is reasonable to derive the cumulative incidence function—and hence \(\smash {\widehat{ATE}}\)—from hazard rates; the key point is that the causal interpretation of the effect estimate relies on \(\smash {\hat{F}_1}\). Let \(\smash {\hat{\Lambda }_k(t \mid a, \varvec{z})}\), \({k \in \{1, \dots , K\}}\) be the estimator of the cause-specific, conditional cumulative hazard, and define

$$\begin{aligned} \hat{F}_1(t \mid a, \varvec{z}) = \int _0^t \exp \left( -\sum _{k=1}^K \hat{\Lambda }_k(s \mid a, \varvec{z})\right) \, \text {d}\hat{\Lambda }_1(s \mid a, \varvec{z}), \end{aligned}$$

in line with the characterization proposed by Benichou and Gail (1990). One possibility to obtain \(\smash {\hat{\Lambda }_k(t \mid a, \varvec{z})}\) is to fit a cause-k specific Cox model with covariates A and \(\varvec{Z}\), i.e.,

$$\begin{aligned} \hat{\Lambda }_k(t \mid a, \varvec{z}) = \hat{\Lambda }_{0k}(t) \exp (\hat{\beta }_{kA} a + \hat{\varvec{\beta }}_{\varvec{kZ}}^T \varvec{z}), \end{aligned}$$

with \(\smash {\hat{\varvec{\beta }}_{\varvec{k}} = (\hat{\beta }_{kA}, \hat{\varvec{\beta }}_{\varvec{kZ}}^T)^T}\) representing the estimated vector of regression coefficients. The covariates may in fact vary for the individual causes, since different event types are possibly associated with distinct risk factors—provided that A is included in the model for the cause of interest. The Breslow estimator eventually yields the following approximation of the cumulative baseline hazard (Breslow 1972):

$$\begin{aligned} \hat{\Lambda }_{0k}(t) = \int _0^t \frac{\text {d}N_k(s)}{\sum _{i=1}^n Y_i(s) \exp (\hat{\beta }_{kA} A_i + \hat{\varvec{\beta }}_{\varvec{kZ}}^T \varvec{Z_i})}{.} \end{aligned}$$

We define the counting process \(\smash {N_k(t)}\) as \(\smash {\sum _{i=1}^n N_{ki}(t)}\) with \(\smash {N_{ki}(t) = \mathbbm {1}\{T_i \!\wedge \! C_i \le t, \, D_i = k\}}\), such that \(\smash {\text {d}N_k(t)}\) represents the increment of \(\smash {N_k(t)}\) over the infinitesimal time interval \({[t, t+dt)}\). The at-risk indicator \(\smash {Y_i(t) = \mathbbm {1}\{T_i \!\wedge \! C_i \ge t\}}\) further specifies whether subject i is part of the risk set just prior to time t.

3 Confidence intervals and bands

Pointwise confidence intervals and time-simultaneous confidence bands are routinely reported in clinical trials as they help to assess the (un)certainty of an estimate. In a series of studies with underlying average treatment effect ATE, it is expected that \({(1-\alpha ) \cdot 100}\)% of the confidence intervals for ATE(t) at level \({(1-\alpha )}\) include the true average treatment effect at a given time t. Confidence bands extend this concept to time intervals, meaning that \({(1-\alpha ) \cdot 100}\)% of the confidence bands for ATE at level \({(1-\alpha )}\) will cover the true average treatment effect over the entire interval of interest. It is not straightforward to define such confidence regions for ATE, however, due to the complexity of the stochastic process \(\smash {U_n(t) = \sqrt{n} \, (\widehat{ATE}(t) - ATE(t))}\). As a workaround, we aim to approximate the limiting distribution of \(\smash {U_n}\) by means of different resampling approaches.

3.1 Efron’s bootstrap

The most common way to derive confidence intervals for ATE is the use of the classical nonparametric bootstrap (Efron 1981), which does not require knowledge of the true underlying distribution. By repeatedly drawing with replacement from the data and calculating a statistical functional of interest in each of the drawn samples, one tries to approach the distribution of the functional in the target population. In the given context, we obtain the estimates \(\smash {\{\smash {\widehat{ATE}}^*_b(t)\}_{b \in \{1, \dots , B\}}}\) from B bootstrap samples of the original data, each having size n. An asymptotic confidence interval at level \({(1-\alpha )}\) can, for instance, be determined by setting the empirical \(\smash {\tfrac{\alpha }{2}}\) and \(\smash {(1 - \tfrac{\alpha }{2})}\) quantiles of the bootstrap estimates as limits. Furthermore, we construct an asymptotic simultaneous confidence band over the time interval \(\smash {[t_1, t_2]}\) as

$$\begin{aligned} \left[ \widehat{ATE}(t) - q_{1-\alpha }^{E\!B} \sqrt{\hat{\nu }^{E\!B}(t)}, \ \widehat{ATE}(t) + q_{1-\alpha }^{E\!B} \sqrt{\hat{\nu }^{E\!B}(t)}\right] {,} \end{aligned}$$

with \(\smash {\hat{\nu }^{E\!B}(t)}\) referring to the empirical variance of the bootstrap estimates and \(\smash {q_{1-\alpha }^{E\!B}}\) denoting the \({(1-\alpha )}\) quantile of

$$\begin{aligned} \left\{ \sup _{t \in [t_1, t_2]} \left| \frac{\smash {\widehat{ATE}}^*_b(t) - \tfrac{1}{B} \sum _{\tilde{b}=1}^B \smash {\widehat{ATE}}^*_{\tilde{b}}(t)}{\sqrt{\hat{\nu }^{E\!B}(t)}}\right| \right\} _{b \in \{1, \dots , B\}}. \end{aligned}$$

Note that the absolute value is considered here and in the following to increase the stability of the empirical quantiles.

The classical bootstrap yields asymptotically correct results in many less intricate settings (as long as the considered data are i.i.d.), and its theoretical validity in the given context is proven by Rühl and Friedrich (2023) based on martingale arguments. While the implementation of Efron ’s bootstrap is rather simple, the computation time can become excessive with large sample sizes and multiple bootstrap iterations, though.

3.2 Influence function

Another method to obtain confidence intervals for ATE has been described by Ozenne et al. (2020). Supposing that the underlying model is correct, the functional delta method yields an approximation of the asymptotic distribution of \(\smash {U_n}\) at a given time point w.r.t. the influence function of the average treatment effect. More specifically,

$$\begin{aligned} U_n(t) = \frac{1}{\sqrt{n}} \sum _{i=1}^n I\!F(t; \, T_i \!\wedge \! C_i, \! D_i, \! A_i, \! \varvec{Z_i}) + o_P(1) \\ \overset{\mathscr {D}}{\longrightarrow } \ {\mathcal {N}}\left( 0, \int \left( I\!F(t; \, s, \! d, \! a, \! z)\right) ^2 \, \text{ d }P(s, \! d, \! a, \! \varvec{z})\right) {,} \end{aligned}$$

as n tends to infinity. Here, \({P(t,d,a,\varvec{z})}\) denotes the joint probability distribution of the data \({(T \wedge C, D, A, \varvec{Z})}\). The definition of the influence function \({I\!F}\) according to Ozenne et al. (2020, 2017) can be found in Sect. 1.1 of the supplementary material. Besides, we use \({\mathcal {N}}\) throughout this paper to symbolize the normal distribution. It follows that the plug-in estimator \(\smash {\hat{\nu }^{I\!F}(t) = \tfrac{1}{n} \sum _{i=1}^n \left( \smash {\widehat{I\!F}(t; \, T_i \!\wedge \! C_i, \! D_i, \! A_i, \! \varvec{Z_i}})\right) ^2}\) is consistent for the asymptotic variance of \(\smash {U_n(t)}\) and thus, asymptotic confidence intervals are easy to calculate without the need of resampling. The construction of confidence bands, on the other hand, is more involved. This is because the dependence between the increments of the process \(\smash {U_n}\) must be taken into account when making inferences concerning multiple time points. It can be shown that \(\smash {U_n}\) converges weakly to a zero-mean Gaussian process on the Skorokhod space \({{\mathcal {D}}[0, \tau ]}\) (Rühl and Friedrich 2023), and thus, we can derive an asymptotic \({(1-\alpha )}\) confidence band for ATE over the interval \(\smash {[t_1, t_2]}\) in line with the resampling approach described by Scheike and Zhang (2008):

$$\begin{aligned} \left[ \widehat{ATE}(t) - q_{1-\alpha }^{I\!F} \sqrt{\hat{\nu }^{I\!F}(t)}, \ \widehat{ATE}(t) + q_{1-\alpha }^{I\!F} \sqrt{\hat{\nu }^{I\!F}(t)}\right] {.} \end{aligned}$$

Here, \(\smash {q_{1-\alpha }^{I\!F}}\) denotes the \({(1-\alpha )}\) quantile of

$$\begin{aligned} \left\{ \sup _{t \in [t_1, t_2]} \left| \sum _{i=1}^n \frac{\widehat{I\!F}(t; \, T_i \!\wedge \! C_i, \! D_i, \! A_i, \! \varvec{Z_i})}{\sqrt{\hat{\nu }^{I\!F}(t)}} \cdot \, G_i^{I\!F; (b)} \right| \right\} _{b \in \{1, \dots , B\}}{,} \end{aligned}$$

for B independent standard normal vectors \(\smash {\{(G_1^{I\!F; (b)}, \dots ,} G_n^{I\!F; (b)})^T\}_{b \in \{1, \dots , B\}}\).

As compared to the classical bootstrap, the influence function approach significantly reduces the computation time, considering that the resampling step builds upon repeated generation of random variables rather than the recalculation of functionals based on various individual data sets.

3.3 Wild bootstrap

A third resampling method arises from the fact that the limiting distribution of \(\smash {U_n}\) may be represented in terms of martingales: It can be shown that

$$\begin{aligned} U_n(t) = \sum _{k=1}^K \sum _{i=1}^n&\left( \int _0^t H_{k1i}(s,t) \, \text{ d }M_{ki}(s) \right. \\ {}&\left. + \int _0^\tau H_{k2i}(s,t) \, \text{ d }M_{ki}(s)\right) + o_p(1) {,} \end{aligned}$$

for functions \(\smash {H_{k1i}}\) and \(\smash {H_{k2i}}\) as defined in Sect. 1.2 of the supplementary material and \(\smash {M_{ki}(t) = N_{ki}(t) - \int _0^t Y_i(s) \, \text {d}\Lambda _k}{(s \mid A_i, \varvec{Z_i})}\), \({k \in \{1, \dots , K\}}\), \({i \in \{1, \dots , n\}}\) (Rühl and Friedrich 2023). Note that \(\smash {M_{ki}}\) is a martingale relative to the history \(\smash {\left( \mathscr {F}_t\right) _{t \ge 0}}\) that is generated by the data observed until a given time, i.e., \(\smash {\mathbb {E}\left( \text {d}M_{ki}(t) \mid \mathscr {F}_{t-}\right) = 0}\) and

$$\begin{aligned} \text {Var}\left( \text {d}M_{ki}(t) \mid \mathscr {F}_{t-}\right) = Y_i(t) \, \text {d}\Lambda _k(t \mid A_i, \varvec{Z_i}). \end{aligned}$$

Provided that Aalen’s multiplicative intensity model (Aalen 1978) applies, the characterization of the variance equals the conditional expectation of \(\smash {\text {d}N_{ki}(t)}\) given the past \(\smash {\mathscr {F}_{t-}}\). This motivates the general idea of the wild bootstrap: By replacing \(\smash {\text {d}M_{ki}(t)}\) with the product of \(\smash {\text {d}N_{ki}(t)}\) and suitable random multipliers \(\smash {G_i^{W\!B}}\), \({k \in \{1, \dots , K\}}\), \({i \in \{1, \dots , n\}}\), we can approximate the asymptotic distribution of \(\smash {U_n}\). The initial method described by Lin et al. (1993) only covers standard normal multipliers, but was later extended to more general resampling schemes (cf. Beyersmann et al. 2013; Dobler et al. 2017). In Rühl and Friedrich (2023), we followed ideas of Cheng et al. (1998); Beyersmann et al. (2013) and Dobler et al. (2017) to formally prove that, conditional on the data, the wild bootstrap estimator of \(\smash {U_n}\),

$$\begin{aligned} \hat{U}_n(t) = \sum _{k=1}^K \sum _{i=1}^n&\left( \hat{H}_{k1i}(T_i \!\wedge \! C_i, t) \, N_{ki}(t) G_i^{W\!B} \right. \\ {}&\left. + \hat{H}_{k2i}(T_i \!\wedge \! C_i, t) \, N_{ki}(\tau ) G_i^{W\!B} \right) {,} \end{aligned}$$

converges weakly to the same process as \(\smash {U_n}\) on \({{\mathcal {D}}[0, \tau ]}\). (Here, the estimates \(\smash {\hat{H}_{k1i}}\) and \(\smash {\hat{H}_{k2i}}\) are calculated by plugging appropriate sample estimates into the definition of \(\smash {H_{k1i}}\) and \(\smash {H_{k2i}}\).)

Remark 1

The following choices of multipliers \(\smash {G_i^{W\!B}}\) fulfill the necessary conditions for the wild bootstrap (cf. Dobler et al., 2017):

  • \(\smash {G_i^{W\!B} \overset{\text {i.i.d.}}{\sim } {\mathcal {N}}(0,1)}\), i.e., independent standard normal multipliers (according to the original resampling approach by Lin et al., 1993);

  • \(\smash {G_i^{W\!B} \overset{\text {i.i.d.}}{\sim } {\mathcal {P}}ois(1) - 1}\), that is, independent and centered unit Poisson multipliers (in line with the proposition of Beyersmann et al.,2013);

  • \(\smash {G_i^{W\!B} \sim {\mathcal {B}}in\left( Y(T_i \!\wedge \! C_i), \frac{1}{Y(T_i \wedge C_i)}\right) - 1}\) with \(Y(t) =\) \(\smash { \sum _{i=1}^n Y_i(t)}\) and \(\smash {(G_{i_1}^{W\!B} \perp \!\!\! \perp G_{i_2}^{W\!B}) \mid \mathscr {F}_{\tau }}\) for \(\smash {i_1 \ne i_2}\), i.e., conditionally independent, centered binomial multipliers. This version of the wild bootstrap is equivalent to the so-called weird bootstrap described in Andersen, Borgan, Gill, and Keiding (1993, Subsect. IV.1.4), as Dobler et al. (2017) illustrate.

For a set of multiplier realizations \(\smash {\{(G_1^{W\!B; (b)}, \dots ,} {G_n^{W\!B; (b)})^T\}_{b \in \{1, \dots , B\}}}\), one obtains the asymptotic \({(1 - \alpha )}\) confidence interval

$$\begin{aligned} \left[ \widehat{ATE}(t) - \frac{1}{\sqrt{n}} \, q_{1 - \alpha }^{W\!B}(t), \ \widehat{ATE}(t) + \frac{1}{\sqrt{n}} \, q_{1 - \alpha }^{W\!B}(t)\right] {,} \end{aligned}$$

with \({(1 - \alpha )}\) quantile \(\smash {q_{1 - \alpha }^{W\!B}(t)}\) of \(\smash {\{\left| \smash {\hat{U}_n^{(b)}(t)}\right| \}_{b \in \{1, \dots , B\}}}\). Similarly, an asymptotic simultaneous \({(1 - \alpha )}\) confidence band over the interval \(\smash {[t_1, t_2]}\) is specified by

considering the empirical variance estimator \(\smash {\hat{\nu }^{W\!B}(t)}\) of \(\smash {\{\hat{U}_n^{(b)}(t)\}_{b \in \{1, \dots , B\}}}\) and the \({(1-\alpha )}\) quantile \(\smash {q_{1-\alpha }^{W\!B}}\) of

$$\begin{aligned} \left\{ \sup _{t \in [t_1, t_2]} \left| \frac{\hat{U}_n^{(b)}(t)}{\sqrt{\hat{\nu }^{W\!B}(t)}}\right| \right\} _{b \in \{1, \dots , B\}}. \end{aligned}$$

The described bootstrap, just like the approach based on the influence function, takes only a fraction of the time required by the classical bootstrap. In addition, martingale-based analysis approaches for time-to-event data are built upon the condition of independent right-censoring and do not rely on a strict i.i.d. setup (Andersen et al., 1993, Subsect. III.2.2). Therefore, they are less sensitive to dependencies inherent to the data, where Efron ’s approach is known to fail (Rühl et al. 2022; see also Friedrich et al.,1981; Singh, 2017).

4 Simulation study

In order to compare the performance of the resampling approaches described in Sect. 3, we simulated competing risks data following the same scheme as in Ozenne et al. (2020), and constructed confidence intervals and bands using the proposed methods.

4.1 Data generation

The generated data comprised twelve independent covariates, namely, \(\smash {Z_1, \dots , Z_6}\) following a mean-zero normal distribution and \(\smash {Z_7, \dots , Z_{12}}\) being Bernoulli distributed with parameter 0.5. Each covariate affected the treatment probability, the event time distributions of two competing failure causes and a conditionally independent censoring time in an individual manner (see Table 1 and Fig. S1 in the supplementary material for a directed acyclic graph). The treatment indicator A was for instance derived from a logistic regression model with linear predictor \(\smash {\alpha _0 + \log (2) \cdot \left( Z_1 - Z_2 + Z_6 + Z_7 - Z_8 + Z_{12}\right) }\). Here, the intercept \(\smash {\alpha _0}\) controls the overall frequency of treatment. Apart from that, we simulated the event time based on a multi-state model with Weibull hazards \(\smash {\lambda _d(t) = 0.02 \, t \exp \left( \smash {\beta _{dA} A }\right. }{\left. {+ \varvec{\beta }^T_{\varvec{dZ}} \varvec{Z}}\right) }\) for \(\smash {\varvec{Z} = (Z_1, \dots , Z_{12})}\) and corresponding parameters \(\smash {\beta _{dA}}\) and \(\smash {\varvec{\beta _{dZ}}}\), \({d \in \{1, 2\}}\) (cf. Beyersmann, Latouche et al. 2009). The censoring time was generated independently with hazard \(\smash {\lambda _C(t) = \tfrac{2}{\gamma } \, t \exp \left( \smash {\varvec{\beta }^T_{\varvec{CZ}} \varvec{Z}}\right) }\), where \(\gamma \) determines the intensity of censoring.

Table 1 Effects of the covariates on the treatment probability, event and censoring times

This general simulation scheme served as a basis for a variety of scenarios, each implemented with sample sizes of \({n \in \{50, 75, 100, 200, 300\}}\) and treatment effects according to parameter \(\smash {\beta _{1A} \in \{-2,0,2\}}\). By default, about half of the observations were assigned to be treated, and the event of interest was observed in a third, half or two thirds of the subjects until time \({t = 9}\), corresponding to the case where \(\smash {\beta _{1A} = -2, \, 0, \, 2}\), respectively. The frequency of censoring amounted to 17%, 14% or 11% by \({t = 9}\), whereas the competing event affected 41%, 31% or 21% of the subjects.

Among the examined scenarios were settings with varying degrees of censoring (namely, 0%, 14% and 30% in the case without treatment effect, i.e., \(\smash {\beta _{1A}=0}\)), treatment frequencies of 22% as well as 86% and non-unit variances (0.25 and 4, respectively) of the normally distributed covariates \(\smash {Z1, \ldots , Z_6}\). Besides, we considered a standard survival scenario without competing events that involved type II censoring with staggered entry in order to investigate a setting with independent, but not random censoring (Rühl et al. 2022). For an overview of the different scenarios, see Table 2.

Table 2 Overview of the simulation scenarios

Confidence intervals (at time points \({t \in \{1,3,5,7,9\}}\)) and bands (over the time interval [0, 9]) for the average treatment effect were derived by applying Efron ’s bootstrap (EBS), the influence function approach (IF) and the wild bootstrap (WBS) to each generated data set, using 1000 resampling replications, respectively. The WBS was realized with standard normal, Poisson and binomial multipliers according to Remark 1. We then assessed the performance of the distinct methods by means of the associated 95% coverage probabilities and the widths of the confidence ranges. The simulations were repeated 5000 times for each scenario to keep the Monte Carlo standard error for the coverage below 0.75%.

We approximated the true average treatment effect in the mentioned scenarios empirically, as the analytic form of ATE(t) is hard to evaluate in the presence of multiple covariates. For that purpose, we simulated 1000 data sets with sample size \({n = 100,000}\) as previously described, but with random treatment assignment independent of the covariates and no censoring. For each of these data sets, the difference \(\smash {\hat{F}_1(t \mid A = 1)} - \smash {\hat{F}_1(t \mid A = 0)}\) was determined, and our final estimate of the true average treatment effect is the median of the 1000 resulting values. Because of the large sample sizes considered, this approximation should be fairly close to the true value. Figure 1 depicts the approximated average treatment effect except for the scenarios with non-unit variance of the covariates \(\smash {Z_1, \dots , Z_6}\) and those with type II censoring.

Fig. 1
figure 1

Approximation of the true average treatment effect

4.2 Results

The WBS attained coverage probabilities of the pointwise confidence intervals that were, in total, the closest to the target level of 95%. The mean absolute deviation across all scenarios, sample sizes and time points was 2.42% for the WBS vs. 2.49% and 2.61% for the IF and the EBS, respectively. (See Sects. 2.2 and 2.4 in the supplementary material for the coverage probabilities in the scenarios not presented here as well as the corresponding Monte Carlo standard errors.) Throughout nearly all settings, the confidence intervals obtained by the EBS yielded coverages above those derived from the different WBS versions, whereas the IF intervals included the true average treatment effect the least frequently. Figure 2 illustrates this ranking in the case with low-level censoring and a positive average treatment effect (i.e., \(\smash {\beta _{1A}=2}\), referring here and in the following to the sign of the causal risk difference; that is, a positive average treatment effect indicates that the potential cumulative incidence under treatment is higher than that under no treatment).We observed similar outcomes in the other scenarios that involved treatment effects according to \(\smash {\beta _{1A} \in \{0,2\}}\) (see Figures S3, S5, S7, S8, S10, S13, S15, and S16 in the supplementary material), even though the performance of the resampling methods varied for early analysis time points (see e.g., Fig. 3).

Fig. 2
figure 2

Coverage of the confidence intervals in the scenario with light censoring (11% censored observations) and a positive average treatment effect (\(\smash {\beta _{1A} = 2}\))

Fig. 3
figure 3

Coverage of the confidence intervals in the scenario with high treatment probability (86% treated observations) and no treatment effect (\(\smash {\beta _{1A} = 0}\))

An exception was the setting with widely dispersed covariates: Here, all methods provided rather conservative confidence intervals, and as a consequence, the IF approach achieved the most accurate coverages (see Figs. S18 and S19 in the supplementary material). The same effect was also encountered in the scenarios with negative treatment effect (\(\smash {\beta _{1A} = -2}\), see e.g., Fig. 4), once again excluding the setting with high variance of the covariates (where the EBS performed best for larger sample sizes, see Fig. S17 in the supplementary material). A common feature of all the schemes that yielded coverages along the lines of Fig. 4 is that the proportion of observed type 1 events was lower than in the scenarios with \(\smash {\beta _{1A} \in \{0,2\}}\). This is due to the prevalence of the competing event, and the IF approach seems to be slightly more suitable to cope with that condition than the bootstrap methods.

Fig. 4
figure 4

Coverage of the confidence intervals in the scenario with no censoring (0% censored observations) and a negative average treatment effect (\(\smash {\beta _{1A} = -2}\))

On the other hand, the IF yielded fairly low coverage probabilities in several settings without treatment effect (see Fig. 3 and Figs. S2, S5, S7, S10, S15, and S21 in the supporting material). This issue remains with increasing sample sizes. Ozenne et al. (2020) encountered a similar pattern and considered a non-robust version of the influence function-based variance, which performed somewhat better.

The WBS generally reached its full potential towards later time points, when a sufficient amount of data was available. This became apparent in the scenario with type II censoring and a positive average treatment effect: Because of the absence of any competing events, we evaluated the confidence intervals at earlier times \({t \in \{0.5, 1, 1.5, 2, 2.5\}}\), and the WBS did not reach coverages as close to 95% as those obtained by the IF and the EBS until \({t = 2}\) (see Fig. S22 in the supporting material). For an explanation of this observation, note that the wild bootstrap process \(\smash {\hat{U}_n(t)}\) is based on the products \(\smash {N_{ki}(t) \, G_i^{WB}}\), for \({i \in \{1, \dots , n\}}\), \({k \in \{1, \dots , K\}}\). At early time points, the counting processes \(\smash {N_{ki}}\) jump only rarely, and chances are that the few corresponding multipliers \(\smash {G_i^{WB}}\) do not reflect the target distribution very well. Towards later times, a higher number of multipliers is taken into account, though, so the distribution of \(\smash {\text {d}N_{ki}(\cdot ) \, G_i^{WB}}\) will be closer to that of the martingale increments.

Against our expectations, the simulations revealed no significant superiority of the martingale-based methods in case of type II censoring with staggered entry, despite non-random censoring. It appears as if the dependence within the data was too weak for the sample sizes considered (cf. Rühl et al., 2022).

The coverage probabilities of the time-simultaneous confidence bands followed a similar trend as was observed for the pointwise intervals (see Sects. 2.3 and 2.4 in the supplementary material): While the highest and lowest coverages in almost all scenarios with positive or no average treatment effect were attained by the EBS and IF, respectively, there were only small differences in most of the settings with \(\smash {\beta _{1A} = -2}\). However, the EBS bands were especially accurate given positive average treatment effects (\(\smash {\beta _{1A} = 2}\), see e.g., Fig. 5). On average, the mean absolute discrepancy between the simulated coverages and the nominal level of 95% was 4.75% in comparison to 5.53% and 5.70% for the WBS and the IF approach, respectively.

Fig. 5
figure 5

Coverage of the confidence bands in the scenario with high treatment probability (86% treated observations) and a positive average treatment effect (\(\smash {\beta _{1A} = 2}\))

Our results imply further that the choice of the multiplier for the WBS does not have any significant impact. Since the confidence intervals derived using the approaches of Lin et al. (1993) and Beyersmann et al. (2013) were occasionally wider than those resulting from the weird bootstrap, the latter method yielded lower coverages. Which of the multipliers provided the most accurate outcomes varied depending on the situation, however.

Other than that, the IF produced narrower intervals than any of the WBS versions, and in case of a negative average treatment effect, either approach lead to considerably greater variation in the interval width by comparison with the scenarios where \(\smash {\beta _{1A} \in \{0, 2\}}\). Interestingly, this effect did not apply to the EBS. The extent of the EBS-based intervals ranged between or above the remaining widths, apart from the settings with \(\smash {\beta _{1A} = -2}\). As the sample sizes increased, however, all resampling methods lead to nearly equally wide confidence intervals (cf. Fig. 6).

Fig. 6
figure 6

Width of the confidence intervals at time \({t = 5}\) in the scenario with no censoring (0% censored observations) and a positive average treatment effect (\(\smash {\beta _{1A} = 2}\)); note the spacing of the x-axis

The widths of the confidence bands furthermore related to one another in the same way as their pointwise counterparts.

Due to the small sample sizes we considered, the number of observed events did occasionally not suffice to achieve convergence when the cause-specific Cox models were fitted. This is why some of the coverage probabilities are based on less than 5000 iterations for the influence function approach as well as the wild bootstrap, and less than 1000 bootstrap samples when using Efron’s approach. (The frequency of the convergence issues is shown in Table S10 in the supplementary material.) Results for the settings with \({n = 50}\) and \(\smash {\beta _{1A} = 2}\) should hence be analyzed with care.

Eventually, a last note is in order about the computation times of the distinct methods: The IF and EBS approaches have been implemented in the function ‘ate’ of the R (R Core Team, 2021) package riskRegression by Gerds and Kattan (2021) (see Sect. 2.1.2 in the supplementary material for more information on the software we used). The calculations are sped up significantly by interfacing C++ code for the IF method and parallelizing the computation of the bootstrap replicates for the EBS. We extracted and adapted the parts of the code that were relevant for our purposes. In addition, C++ was also integrated to implement the WBS. The simulations were run on a high-performance computing cluster that operates on 2.4 GHz Intel® processors with 128 GB RAM, where we used 16 cores for parallel computations. Fig. 7 summarizes the resulting execution times for each resampling method.

Fig. 7
figure 7

Computation times in the scenario with no censoring (0% censored observations) and a positive average treatment effect (\(\smash {\beta _{1A} = 2}\)). The height of the bars illustrates the mean computation time; note the spacing of the x-axis

Clearly, the EBS is several times slower than the multiplier-based methods and therefore, the IF approach as well as the WBS can in practice be implemented with a higher number of resampling repetitions, so that the accuracy of the resulting confidence regions is expected to be higher.

Table 3 Summary of the Hodgkin’s disease data

5 Real data application

To illustrate the performance of the resampling approaches when applied to real-world study data, we considered records of the long-term disease progression among patients with early-stage Hodgkin’s lymphoma (i.e., stage I or II) (Pintilie 2006). These data are available within the R package randomForestSRC (data ’hd’, Ishwaran and Kogalur 2022) and comprise information on 865 subjects who were treated at the Princess Margaret Hospital in Toronto between 1968 and 1986, either with radiation alone (\({n = 616}\)) or a combination of radiation and chemotherapy (\({n = 249}\)). We studied the time (in years) from diagnosis until the competing events of relapse and death, respectively. Random values of very small extent (i.e., normally distributed variables with mean zero and variance \(\smash {10^{-6}}\)) were added to the event times in order to break any ties in the data that emerged due to rounding. Covariates recorded include age, sex, clinical stage of the lymphoma, size of mediastinum involvement and whether the disease was extranodal (see Table 3 for a summary of the data). For our analysis, we assume that these variables are sufficient for confounding adjustment, and that the positivity and consistency conditions are met w.r.t. the two therapies. Moreover, tests on the scaled Schoenfeld residuals of the Cox models for both causes did not suggest any violations of the proportional hazards assumption apart from the variable age in the relapse model (Grambsch and Therneau 1994; see Figs. S46 and S47 in the supplementary material). The estimated coefficient in a corresponding model with time-dependent covariate is nearly constant over time, though. We thus use simple Cox models (with time-constant covariates) to derive the average treatment effect.

Our analysis suggests that after 30 years, the risk of relapse would be reduced by 17.89% in a hypothetical setting where every subject had been treated with both radiation and chemotherapy as compared to the case where everyone had received radiation therapy only (see Fig. S48 in the supplementary material). Simultaneously, the risk of death would be raised by 9.49% between these scenarios (see Fig. 8). Note how the ATE concerning relapse drops rather sharply within the first 5 years, whereas the ATE w.r.t. death increases gradually over the entire 30-year interval. In conclusion, treatment with the combined therapy seems to effectively prevent relapse in the studied population, but since we consider competing causes, a decrease in relapse events will leave more subjects who die without prior relapse.

In Fig. 8, it can be seen that all resampling methods lead to fairly similar confidence intervals concerning the effect on death. Yet the EBS confidence bands are notably wider than those derived from the remaining approaches.

Fig. 8
figure 8

Confidence intervals (left) and bands (rights) for the average treatment effect on the risk of death

On the other hand, relapse events are observed more than twice as often as deaths, which is why the corresponding confidence intervals and bands are closer to each other.

6 Discussion

The article at hand compares three resampling methods for the derivation of confidence intervals and bands for the average treatment effect in competing risks settings (although the influence function-based confidence intervals, strictly speaking, do not rely on resampling). As our simulations show, the wild bootstrap yields correct coverage levels for pointwise confidence intervals in the presence of rather small data sets, provided that sufficient events have been observed until the considered time point. This applies regardless of the type of multiplier that is implemented (i.e., standard normal, centered Poisson, or weird bootstrap multipliers). The theory behind the wild bootstrap relies on martingales and therefore accommodates counting processes, which are naturally used to represent time-to-event data. As a consequence, it is straightforward to tackle common issues in survival analysis, such as e.g., left-truncation. (Note the controversy about left-truncation in causal contexts, though, cf. Hernán, 2015; Vandenbroucke and Pearce 2015.) If competing events prevail (like it was the case in the scenarios with \(\smash {\beta _{1A} = -2}\) in our simulation study), one may prefer the influence function approach (or a non-robust version, as proposed by Ozenne et al. 2020, if the treatment is unlikely to have any effect), and if earlier time points are examined, the classical bootstrap seems to be a reasonable choice. The latter also achieves very accurate coverages with respect to time-simultaneous confidence bands. As the amount of available data increases, the differences between the distinct resampling approaches fade away. Efron’s simple bootstrap, which is most commonly used in practice, requires considerable computation time, however. What is more, dependencies might cause issues with this resampling method (Singh 1981; Friedrich et al. 2017; Rühl et al. 2022), even though our simulations did not disclose any major bias in this context.

The three covered approaches were additionally compared given real data on the long-term risk of relapse and death among patients with early-stage Hodgkin’s disease (Pintilie 2006). While the outcomes are generally quite similar, Efron’s bootstrap generated somewhat wider confidence bands for the average treatment effect on the risk of death.

It should be noted that for consistent estimation of the average treatment effect, the model for the cumulative incidence function must be correctly specified. Instead of the cause-specific Cox model used here, one might employ alternatives such as the nonparametric additive hazards model proposed by Aalen (1980) (cf. Ryalen et al. 2018), or the Fine-Gray regression model for \(\smash {F_1(t \mid a, z)}\) adopting the subdistribution approach (see Rudolph et al. 2020 or the more technical discourse by Young et al., 2020 for a discussion on cause-specific vs. subdistribution measures in causal frameworks). In the latter case, however, additional considerations on the associated stochastic process are necessary to make inferences on \(\smash {\widehat{ATE}}\).

We did not address estimators based on inverse probability of treatment weighting (IPTW, which requires correct specification of a treatment model rather than the outcome model) or the doubly-robust version combining both the g-formula and IPTW. This is because one would need to derive the asymptotic distributions of the corresponding processes to justify the application of any resampling methods, which is beyond the scope of this work. Only the representation of the processes based on the influence function has been determined already, see Ozenne et al. (2020) for more details.

In order to handle complex conditions that are often observed in real-world trials with time-varying treatments, a possible subject of future work is the extension of the investigated resampling methods to settings that involve time-dependent confounding. The standard time-dependent Cox analysis has been shown to yield incorrect results in such settings (Hernán et al. 2000), which is why it is important to incorporate appropriate models (see e.g. Keogh et al. 2023).