Resampling-based confidence intervals and bands for the average treatment effect in observational studies with competing risks

The g-formula can be used to estimate the treatment effect while accounting for confounding bias in observational studies. With regard to time-to-event endpoints, possibly subject to competing risks, the construction of valid pointwise confidence intervals and time-simultaneous confidence bands for the causal risk difference is complicated, however. A convenient solution is to approximate the asymptotic distribution of the corresponding stochastic process by means of resampling approaches. In this paper, we consider three different resampling methods, namely the classical nonparametric bootstrap, the influence function equipped with a resampling approach as well as a martingale-based bootstrap version, the so-called wild bootstrap. For the latter, three sub-versions based on differing distributions of the underlying random multipliers are examined. We set up a simulation study to compare the accuracy of the different techniques, which reveals that the wild bootstrap should in general be preferred if the sample size is moderate and sufficient data on the event of interest have been accrued. For illustration, the resampling methods are further applied to data on the long-term survival in patients with early-stage Hodgkin’s disease.


INTRODUCTION
Causal inference provides tools to compare treatment strategies in studies that do not permit random allocation of subjects to therapy groups, e.g. for ethical reasons or simply because it is not feasible.Special analysis methods are necessary because in non-randomized trials, risk factors are likely to be distributed unequally across treatment groups and as a consequence, direct comparisons will lead to bias (Yang et al., 2010;Nørgaard et al., 2017).The idea of the counterfactual approach to causal inference is to eliminate this bias by modeling the mean outcome in a hypothetical world where all participants of the study are exposed to the same intervention -possibly 'counter to the fact', i.e. contrary to the treatment they actually received.Causal conclusions can then be drawn by contrasting the obtained estimates for the treatment levels of interest (Rubin, 1974;Hernán and Robins, 2020, section I.1).In case of time-to-event endpoints, statisticians need to take additional difficulties into account, however, as the analysis of right-censored data requires particular techniques.The hazard ratio, which is the common measure of the treatment effect for time-to-event data, comes along with several issues when the aim is to draw causal inferences: In the first place, it is non-collapsible.Thus, the causal effect estimate may differ from a conditional estimate that is adjusted for further variables even if these variables are no confounders (Martinussen and Vansteelandt, 2013).A related drawback is selection bias, which has e.g.been described by Aalen et al. (2015).Apart from that, the hazard ratio -as a single value -fails to convey potentially time-varying effects and also depends on the duration of the study (Hernán, 2010).
We therefore consider the risk difference in terms of the cumulative incidence function as effect measure instead.This way, a competing risks framework is accommodated on top, which covers the standard survival setting as a special case.Examples of observational studies that compare treatment effects using the cumulative incidence function include Philipps et al. (2020); Butt et al. (2021); Chauhan et al. (2022).Besides the estimated average treatment effect, researchers are often also interested in further statistical inference.The stochastic process associated with the estimated cumulative incidence function is rather complex, making it difficult to derive exact confidence intervals and bands, though.A commonly applied remedy is the classical nonparametric bootstrap proposed by Efron (1981) (cf. Stensrud et al., 2020;Ryalen et al., 2020;Neumann and Billionnet, 2016), even though this resampling method is not optimal in several situations, e.g. when the data lack independence (Singh, 1981;Friedrich et al., 2017).Ozenne et al. (2020) presented an alternative approach based on the influence function, and as counting processes are inherent to time-to-event analysis, resampling methods relying on martingale theory further suggest themselves.
In this paper, we illustrate that apart from the method proposed by Ozenne et al. (2020), the classical bootstrap as well as the martingale-based wild bootstrap also accurately approximate the distribution of the stochastic process at hand.We compare the performance of these resampling approaches in terms of the resulting confidence intervals and bands by means of simulations as well as an applied data example recording the long-term outcomes of early-stage Hodgkin's disease patients.The remainder of this manuscript is organized as follows.Section 2 establishes the setting and notation as well as the causal estimator for the average treatment effect.In Section 3, we introduce the three mentioned resampling approaches.The simulation study and the analysis of the Hodgkin's disease data are presented in Sections 4 and 5. Finally, the paper concludes with a discussion.

AVERAGE TREATMENT EFFECT FOR RIGHT-CENSORED DATA WITH COMPETING RISKS
We consider a competing risks setting with K failure types.Let the absolutely continuous random variables T and C denote an individual's event and censoring time, respectively.The observed data include T ∧ C, the minimum of T and C, as well as an indicator D ∈ {0, 1, . . ., K}, which represents the type of failure.W.l.o.g., let D = 1 imply that a subject experienced the event of interest.If D = 0, the event time is censored, i.e.C < T .Besides, we observe a binary treatment indicator A and a bounded, p-dimensional vector Z of baseline covariates.Throughout this paper, suppose that the data sample {(T i ∧C i , D i , A i , Z i )} i∈{1,...,n} is independent and identically distributed (i.i.d.), and does not include any tied event times.It is further assumed that T i and C i are conditionally independent given (A i , Z i ).
For a fixed time point t within the study time interval [0, τ ], we define the average treatment effect of interest as The expression F a 1 (t) = P (T a ≤ t, D a = 1) refers to the potential cumulative incidence function for cause 1 under treatment a, applying the counterfactual notation as in Hernán and Robins (2020).If the conditions of exchangeability, positivity and consistency along with a well-defined intervention and no interference are fulfilled (see e.g.Hernán and Robins, 2020, section I.3 for a thorough description of these assumptions), the g-formula yields an estimate of the average treatment effect (Ozenne et al., 2020): Despite the issues pointed out by Aalen et al. (2015), it is reasonable to derive the cumulative incidence function -and hence AT E -from hazard rates; the key point is that the causal interpretation of the effect estimate relies on F1 .Let therefore Λk (t | a, z), k ∈ {1, . . ., K} be the estimator of the cause-specific, conditional cumulative hazard, and define in line with the characterization proposed by Benichou and Gail (1990).One possibility to obtain Λk (t | a, z) is to fit a cause-k specific Cox model with covariates A and Z, i.e.
Λk (t | a, z) = Λ0k (t) exp( βkA a + βT kZ z), with βk = ( βkA , βT kZ ) T representing the estimated vector of regression coefficients.In fact, the covariates may vary for different causes as long as A is included in the model for the cause of interest.The Breslow estimator eventually yields the approximation of the cumulative baseline hazard (Breslow, 1972).Letting 1{•} denote the indicator function, the counting process N k (t) is defined as whether subject i is part of the risk set just prior to time t.

CONFIDENCE INTERVALS AND BANDS
Pointwise confidence intervals and time-simultaneous confidence bands are routinely reported in clinical trials as they help to assess the (un)certainty of an estimate.It is not straightforward to define such intervals for the average treatment effect, however, due to the complexity of the stochastic process As a workaround, we aim to approximate the limiting distribution of U n by means of different resampling approaches.

Efron's bootstrap
The most common way to derive confidence intervals for AT E is the use of the classical nonparametric bootstrap (Efron, 1981), which does not require knowledge of the true underlying distribution.By repeatedly drawing with replacement from the data and calculating a statistical functional of interest in each of the drawn samples, one tries to approach the distribution of the functional in the target population.In the given context, we obtain the estimates { AT E * b (t)} b∈{1,...,B} from B bootstrap samples of the original data, each having size n.A confidence interval at level (1 − α) can, for instance, be determined by setting the empirical α 2 and (1 − α 2 ) quantiles of the bootstrap estimates as limits.Furthermore, we construct a simultaneous confidence band over the time interval with νEB (t) referring to the empirical variance of the bootstrap estimates and q EB 1−α denoting the (1 − α) quantile of sup This approach yields asymptotically correct results in many less intricate settings (as long as the considered data are i.i.d.), and its theoretical validity is proven in Rühl and Friedrich (2023) based on martingale arguments.While the implementation of Efron's bootstrap is rather simple, the computation time can become excessive with large sample sizes and multiple bootstrap iterations.

Influence function
Another method to obtain confidence intervals for AT E has been described by Ozenne et al. (2020).Provided that the underlying model is correct, the functional delta method yields an approximation of the asymptotic distribution of U n at a given time point w.r.t. the influence function of the average treatment effect.More specifically, as n tends to infinity.The definition of the influence function IF according to Ozenne et al. (2020Ozenne et al. ( , 2017) ) can be found in the supplementary material.Besides, we use N throughout this paper to symbolize the normal distribution.It follows that the plug-in estimator νIF 2 is consistent for the asymptotic variance of U n (t) and thus, confidence intervals are easy to calculate.The construction of confidence bands, on the other hand, is more involved.This is because the dependence between the increments of the process U n must be taken into account when making inferences concerning multiple time points.It can be shown that U n converges weakly to a zero-mean Gaussian process on the Skorokhod space D[0, τ ] ( Rühl and Friedrich, 2023), and thus, we can derive a (1 − α) confidence band for AT E over the interval [t 1 , t 2 ] in line with the resampling approach described by Scheike and Zhang (2008): Here, ) T } b∈{1,...,B} .As compared to the classical bootstrap, the influence function approach significantly reduces the computation time, considering that the resampling step builds upon repeated generation of random variables rather than the recalculation of functionals based on various individual data sets.

Wild bootstrap
A third resampling method arises from the fact that the limiting distribution of U n may be represented in terms of martingales: It can be shown that for functions H k1i and H k2i as defined in the supplementary material and (Rühl and Friedrich, 2023).Note that M ki is a martingale relative to the history (F t ) t≥0 that is generated by the data observed until a given time, i.e.E (dM ki (t) | F t− ) = 0 and Provided that Aalen's multiplicative intensity model (Aalen, 1978) applies, the characterization of the variance equals the conditional expectation of dN ki (t) given the past F t− .This motivates the general idea of the wild bootstrap: By replacing dM ki (t) with the product of dN ki (t) and suitable random multipliers G WB i , k ∈ {1, . . ., K}, i ∈ {1, . . ., n}, we can approximate the asymptotic distribution of U n .The initial method described by Lin et al. (1993) only covered standard normal multipliers, but was later extended to more general resampling schemes (cf.Beyersmann et al., 2013;Dobler et al., 2017).In Rühl and Friedrich (2023), we followed ideas of Cheng et al. (1998); Beyersmann et al. (2013) and Dobler et al. (2017) to formally prove that, conditional on the data, ∼ N (0, 1), i.e. independent standard normal multipliers (according to the original resampling approach by Lin et al., 1993); ∼ Pois(1) − 1, that is, independent and centered unit Poisson multipliers (in line with the proposition of Beyersmann et al., 2013); i.e. conditionally independent, centered binomial multipliers.This version of the wild bootstrap is equivalent to the so-called weird bootstrap described in Andersen et al. (1993, subsection IV.1.4), as Dobler et al. (2017) illustrate.

For multiple multiplier realizations {(G
..,B} . The described bootstrap, just like the influence function, takes only a fraction of the time required by the classical bootstrap.In addition, martingale-based resampling approaches are built upon the condition of independent rightcensoring (Andersen et al., 1993, subsection III.2.2, cf. Rühl et al., 2022) and do not rely on a strict i.i.d.setup.Therefore, they are less sensitive to deviations from standard assumptions where Efron's approach is known to fail, including dependencies inherent to the data (Singh, 1981;Friedrich et al., 2017).

SIMULATION STUDY
In order to compare the performance of the resampling approaches described in Section 3, we simulated competing risks data following the same scheme as in Ozenne et al. (2020), and constructed confidence intervals and bands using the proposed methods.

Data generation
The generated data comprised twelve independent covariates, namely Z 1 , . . ., Z 6 following a mean-zero normal distribution and Z 7 , . . ., Z 12 being Bernoulli distributed with parameter 0.5.Each covariate affected the treatment probability, the event time distributions of two competing failure causes and a conditionally independent censoring time in an individual manner (see Table 1).The treatment indicator A was for instance derived from a logistic regression model with linear predictor α 0 + log( 2 Here, the intercept α 0 controls the overall frequency of treatment.Apart from that, we simulated censoring and event times according to a Weibull distribution with hazard λ(t) = 0.02 t exp (β dA A + β T dZ Z) for corresponding parameters β dA and β dZ , d ∈ {0, 1, 2}, respectively.The minimum of the three resulting (latent) times determined the type of observation (i.e.censored, type 1 or type 2 event).
This general simulation scheme served as a basis for a variety of scenarios, each implemented with sample sizes of n ∈ {50, 75, 100, 200, 300} and treatment effects according to parameter β 1A ∈ {−2, 0, 2}.By default, about half of the observations were assigned to be treated, and the event of interest was observed in a third, half or two thirds of the subjects until time t = 9, corresponding to the case where β 1A = −2, 0, 2, respectively.The frequency of censoring amounted to 17%, 14% or 11% by t = 9, whereas the competing event affected 41%, 31% or 21% of the subjects.(If necessary, the data generation step was repeated until at least 10 events of both causes were observed, such that meaningless regression outcomes could be averted.)Among the examined scenarios were settings with varying degrees of censoring (namely 0%, 14% and 30% in the case without treatment effect, i.e. β 1A = 0), treatment frequencies of 22% as well as 86% and non-unit variances (0.25 and 4) of the normally distributed covariates.Besides, we considered a standard survival scenario without competing events that involved type II censoring with staggered entry in order to investigate a setting with independent, but not random censoring (Rühl et al., 2022).For an overview of the different scenarios, see Table 2. *: For the scenario with type II censoring, the percentages of censoring and type 1 events are determined at t = 10, t = 5, and t = 2.5, for β 1A ∈ {−2, 0, 2}, respectively.
Confidence intervals (at time points t ∈ {1, 3, 5, 7, 9}) and bands (over the time interval [0, 9]) for the average treatment effect were derived by applying Efron's bootstrap (EBS), the influence function approach (IF) and the wild bootstrap (WBS) to each generated data set.While the EBS was implemented with 1, 000 repetitions to maintain reasonable runtimes, we considered 10, 000 random multipliers for the two remaining methods.The WBS was realized with standard normal, Poisson and binomial multipliers according to Remark 1.We then assessed the performance of the distinct methods by means of the associated 95% coverage probabilities and the widths of the confidence ranges.The maximum Monte Carlo standard errors amounted to 0.77% and 0.013 for the coverage and the width, respectively, as all simulation scenarios were repeated 5, 000 times.
In order to determine the true average treatment effect for each of the mentioned scenarios, we considered 1, 000 simulated data sets with sample size n = 100, 000, random treatment assignment independent of the covariates and no censoring.Figure 1 depicts AT E(t) except for the scenarios with non-unit variance of the covariates Z 1 , . . ., Z 6 or type II censoring.

Results
The WBS attained coverage probabilities that were, in total, the closest to the target level of 95%.The mean absolute deviation across all scenarios, sample sizes and time points was 2.40% for the WBS vs. 2.43% and 2.63% for the IF and the EBS, respectively.Throughout nearly all settings, the confidence intervals obtained by the EBS yielded coverages above those derived from the different WBS versions, whereas the IF intervals included the true average treatment effect the least frequently.Figure 2 illustrates this ranking in the case with low-level censoring and a positive average treatment effect (i.e.β 1A = 2, referring here and in the following to the sign of the causal risk difference, i.e. a positive average treatment effect indicates that the potential cumulative incidence under treatment is higher than that under no treatment).We observed similar outcomes in the other scenarios that involved treatment effects according to β 1A ∈ {0, 2} (see supplementary material), even though the performance of the resampling methods varied for small sample sizes up to 75 (e.g. with treatment probabilities larger than 0.5).An exception was the setting with widely dispersed covariates: Here, all methods provided rather conservative confidence intervals, and as a consequence, the IF t AT E(t) approach achieved the most accurate coverages.The same effect was also encountered in the scenarios with negative treatment effect.It should be noted, however, that the differences between the distinct resampling techniques were negligible in most cases with β 1A = −2, apart from very early time points or sample sizes below 100 (see Figure 3).Greater dissimilarities were only present in the setting with high covariance of the covariates (where the EBS performed best for larger sample sizes).A common feature of all the schemes that yielded coverages along the lines of Figure 3 is that the proportion of observed type 1 events was lower than in the scenarios with β 1A ∈ {0, 2}.This is due to the prevalence of the competing event, and the IF approach seems to be slightly more suitable to cope with that condition than the bootstrap methods.
The WBS, in contrast, generally reached its full potential towards later time points, when a sufficient amount of data was available.This became apparent in the setting with type II censoring and a positive average treatment effect: Because of the absence of any competing events, we evaluated the confidence intervals at earlier times t ∈ {0.5, 1, 1.5, 2, 2.5}, and the WBS did not attain coverages as close to 95% as those obtained by the IF and the EBS until t = 2. Against our expectations, the simulations revealed no considerable superiority of the martingale-based methods in case of type II censoring with staggered entry, despite non-random censoring.It appears as if the dependence within the data was too weak for the sample sizes considered and the lack of additional pressure (e.g. by internal left-truncation, cf.Rühl et al., 2022).The coverage probabilities of the time-simultaneous confidence bands followed a similar trend as was observed for the pointwise intervals: While almost all scenarios with positive or no average treatment effect had the highest and lowest coverages for the EBS and IF, respectively, there were virtually no differences in most of the settings with β 1A = −2 (see supplementary material).However, the EBS bands were especially accurate given positive average treatment effects (see Figure 4).On average, the mean absolute discrepancy between the simulated coverages and the nominal level of 95% was 4.44% in comparison to 5.31% and 5.46% for the WBS and the IF approach, respectively.Our results imply further that the choice of the multiplier for the WBS does not have any significant impact.Since the confidence intervals derived using the approaches of Lin et al. (1993) and Beyersmann et al. (2013) were occasionally wider than those resulting from the weird bootstrap, the latter method attained lower coverages.Which of the multipliers provided the most accurate outcomes varied depending on the situation, however.Other than that, the IF produced narrower intervals than any of the WBS versions, and in case of a negative average treatment effect, either approach lead to considerably greater variation in the interval width by comparison with the situations where β 1A ∈ {0, 2}.Interestingly, this effect did not apply to the EBS.The extent of the EBS-based intervals ranged between or above the remaining widths, apart from the settings with β 1A = −2.As the sample sizes increased, however, all resampling methods lead to nearly equally wide confidence intervals (cf. Figure 5).The widths of the confidence bands furthermore related to one another in the same way as their pointwise counterparts.
Eventually, a last note is in order about the computation times of the distinct methods: The IF and EBS approaches have been implemented in the function 'ate' of the R (R Core Team, 2021) package riskRegression by Gerds and Kattan (2021).The calculations are sped up significantly by interfacing C++ code for the IF method and parallelizing the computation of the bootstrap replicates for the EBS.We extracted and adapted the parts of the code that were relevant for our simulations.In addition, C++ was also integrated to implement the WBS.The resulting computation times are summarized in Figure 6.Clearly, the EBS is several times slower than the multiplier-based methods.However, while the computation time of the IF approach is somewhat lower than that of the WBS, one should note that three different multipliers have been regarded for the latter.

REAL DATA APPLICATION
To illustrate the performance of the resampling approaches when applied to real-world study data, we considered records of the long-term disease progression among patients with early-stage Hodgkin's lymphoma (i.e.stage I or II) (Pintilie, 2006).These data are available within the R package randomForestSRC (data 'hd ', Ishwaran and Kogalur, 2022) and comprise information on 865 subjects who were treated at the Princess Margaret Hospital in Toronto between 1968 and 1986, either with radiation alone (n = 616) or a combination of radiation and chemotherapy (n = 249).We considered the time from diagnosis until death (in years), with prior relapse regarded as competing event.Covariates recorded include age, sex, clinical stage of the lymphoma, size of mediastinum involvement and whether the disease is extranodal (see Table 3 for a summary of the data).
The resulting estimate of the average treatment effect (evaluating the combination of radiation and chemotherapy as opposed to radiation alone) on the risk of death as well as the corresponding confidence intervals and bands are depicted in Figure 7.As it can be seen, the EBS confidence bands are notably wider than those derived from the remaining resampling methods.For the confidence intervals, the difference is less pronounced, but still visible.

DISCUSSION
The article at hand compares three resampling methods for the derivation of confidence intervals and bands for the average treatment effect in competing risks settings.As our simulations show, the wild bootstrap yields correct coverage levels for pointwise confidence intervals in the presence of rather small data sets, provided that sufficient events have been observed until the considered time point.This applies regardless of the type of multiplier that is implemented (i.e. standard normal, centered Poisson, or weird bootstrap multipliers).The theory behind the wild bootstrap relies on martingales and therefore accommodates counting processes, which are naturally used to represent time-to-event data.As a consequence, it is straightforward to tackle common issues in survival analysis, such as e.g.left-truncation.(Note the controversy about left-truncation in causal contexts, though, cf.Vandenbroucke and Pearce, 2015;Hernán, 2015.)In case that competing events prevail, one may prefer the influence function approach, however, and if earlier time points are examined, the classical bootstrap seems to be a reasonable choice.The latter also achieves very accurate coverages with respect to time-simultaneous confidence bands.As the amount of available data increases, the differences between the distinct resampling approaches fade away.Efron's simple bootstrap, which is most commonly used in practice, requires considerable computation time, however.What is more, dependencies might cause issues with this resampling method (Singh, 1981;Friedrich et al., 2017;Rühl et al., 2022), even though our simulations did not disclose any major bias in this context.
The three covered approaches were additionally compared given real data on the long-term survival among patients with early-stage Hodgkin's disease (Pintilie, 2006).While the outcomes resulting from the influence function approach and the wild bootstrap are fairly similar, Efron's bootstrap generated wider intervals and, in particular, bands.
It should be noted that for the estimated average treatment effect to be consistent, the model for the cumulative incidence function must be correctly specified.Instead of the cause-specific Cox model used here, one might employ alternatives such as the nonparametric additive hazards model proposed by Aalen (1980) (cf. Ryalen et al., 2018), or the Fine-Gray regression model for F 1 (t | a, z) adopting the subdistribution approach (see Rudolph et al., 2020 or the more technical discourse by Young et al., 2020 for a discussion on cause-specific vs. subdistribution measures in causal frameworks).In the latter case, however, additional considerations on the associated stochastic process are necessary to make inferences on AT E. For the same reason, we did not address estimators based on inverse probability of treatment weighting (IPTW, which requires correct specification of a treatment model rather than the outcome model) or the doubly-robust version combining both the g-formula and IPTW.More details on these estimators and appropriate resampling techniques based on the influence function are given by Ozenne et al. (2020).In order to handle complex conditions that are often observed in real-world trials with time-varying treatments, a possible subject of future work is the extension of the investigated resampling methods to settings that involve timedependent confounding.The standard time-dependent Cox analysis has been shown to yield incorrect results in such settings (Hernán et al., 2000), which is why it is important to incorporate appropriate models (see e.g.Keogh et al., 2023).

Influence function of the average treatment effect
. ., n}, the influence function IF AT E (t; O i ) of the average treatment effect, as described in (Ozenne et al., 2020) and (Ozenne et al., 2017), is defined as 1.2.Functions appearing in the martingale representation of U n Furthermore, the functions H k1i and H k2i , k ∈ {1, . . ., K}, i ∈ {1, . . ., n}, are given by

i
converges weakly to the same process as U n on D[0, τ ]. (Here, the estimates Ĥk1i and Ĥk2i are calculated by plugging appropriate sample estimates into the definition of H k1i and H k2i .)Remark 1.The following choices of multipliers G WB i fulfill the necessary conditions for the wild bootstrap (cf.Dobler et al., 2017):

Figure 2 .
Figure 2. Coverage of the confidence intervals in the scenario with light censoring and a positive average treatment effect.

Figure 3 .
Figure 3. Coverage of the confidence intervals in the scenario without censoring and a negative average treatment effect.

Figure 4 .
Figure 4. Coverage of the confidence bands in the scenario with high treatment probability and a positive average treatment effect.

Figure 5 .
Figure 5. Width of the confidence intervals at time t = 5 in the scenario with light censoring and a positive average treatment effect.

Figure 6 .
Figure 6.Computation times in the scenario with light censoring and a positive average treatment effect.(The height of the bars illustrates the mean computation time.)

Figure 7 .
Figure 7. Confidence intervals (left) and bands (rights) for the average treatment effect on the risk of death.

Table 1 .
Effects of the covariates on the treatment probability, event and censoring times.

Table 2 .
Overview of the simulation scenarios.

Table 3 .
Summary of the Hodgkin's disease data.