Introduction

In orphan and rare diseases, single-arm trials are common given the impracticability, if not impossibility, of randomized controlled trials [1,2,3]. In these settings, an external control arm (ECA) can be employed to compare against the single-arm trial to estimate treatment effects, though minimizing potential biases to interpret these effects is critical. Regulatory bodies have issued guidelines for the application of ECA in drug development [4,5,6,7]. In practice, the ECAs, their data management, and the assessment of outcomes and confounders should match the prospective trial population and procedures as closely as possible. Even when these criteria are met, the ECA design may inherently introduce further uncertainties to such comparisons due to measured and unmeasured confounders [8]. Therefore, the adjustment for baseline characteristics and potential confounders is one of key steps when an externally control is used as a comparator [9].

Numerous methods have been proposed to handle the potential confounders in ECAs. When individual-level data are available, frequentist approaches may estimate a causal effect using g-computation, propensity score (PS)-based weighting, PS-based stratification, PS-based matching, targeted maximum likelihood estimation (TMLE), etc. [10,11,12,13,14]. Under the circumstances with summary data only, matching-adjusted indirect comparison, simulated treatment outcome, and network meta-analysis may be used to assess the treatment effect [13]. Additionally, Bayesian approaches (e.g., meta-analytic predictive) can also be applied in meta-analysis when combining results from multiple external data sources with either individual or summary data. The adjustments in these methods are based upon measured confounders and incorporated in the analysis. However, the relative performance of these methods in the presence of either unmeasured or unobserved confounders is unclear.

By definition, a confounding variable must be correlated with the outcome and the treatment. The randomization aspect of many studies minimizes the correlations and hence the risk of bias due to confounding. However, when real-word data (RWD) is used to construct an ECA, the benefits of randomization are not possible, and unmeasured or unobserved predictive variables may represent unobserved confounders. Furthermore, the confounder misclassification may lead to another layer of bias when combining trial data and RWD [15]. Although a similar population is selected from RWD with a valid assessment of both outcomes and confounders, this process may not be able to minimize the impact of confounding variables in the way that randomization does. Since we cannot know the confounding effect of unmeasured variables in real-world cases, we wanted to investigate such effects using a simulation study where the effects of the confounder are known. We aimed to compare different methods for estimating average treatment effects (ATEs) in a simulated external control study with individual-level data, which included both measured (included in the model) and unmeasured (excluded from the model) confounders.

Materials and methods

Study design

We generated a treatment arm (e.g., single-arm trial) and an ECA (e.g., standard of care) in order to simulate an externally controlled study. Three types of outcomes – continuous, binary, and time-to-event endpoints were assessed between two arms. In order to emulate a biased ECA, both measured and unmeasured confounders were added to the data. For each outcome, we explored three scenarios for the relationship between unmeasured confounders and observed variables (Fig. 1).

Fig. 1
figure 1

Diagram of measured and unmeasured confounders

In scenario 1 & 2, as shown in Fig. 1 (a & b), the baseline covariate C was a measured confounder of the relationship between the treatment group assignment A and the outcome Y. The measured confounder L was associated with A and the unmeasured confounder U. Thus, the unmeasured factor U might play an indirect confounding role between the treatment group A and the outcome Y through the covariate L. Meanwhile, we reduced the correlations of U with L and Y in scenario 1 in order to let the confounding effect by U be smaller in scenario 1 than scenario 2.

In scenario 3 (Fig. 1c), all relationships in scenario 2 remained the same, but a direct association between U and A was added. Therefore, the unmeasured factor U might play a stronger confounding role between the treatment A and the outcome Y. While U was technically an unmeasured confounder in all three scenarios, we expected it did not confound the treatment effect conditional on the measured confounder L in scenario 1 and 2. On the other hand, the direct association between U and the treatment group assignment A added in scenario 3 could lead to confounded estimates even if adjusting for L as there was an open path from treatment to the outcome through it.

A set of random samples were drawn from a simulated target population for each scenario. For each generated sample, g-computation, PS-based weighting, e.g., inverse probability of treatment weighting (IPTW), standardized mortality/morbidity ratio (SMR), and overlap weighting (OW), and targeted maximum likelihood estimation (TMLE) were used to estimate treatment effects based on observed outcomes and measured confounders which were incorporated in the analytic model.

Simulated dataset

We generated a data set with 20,000 individuals, which was considered as a target population. In this data set, variable U (deemed as an unmeasured confounder) included random numbers from the normal distribution (mean 0.5, variance 1); Variable C was a baseline characteristic from the normal distribution (mean 50, variance 50); variable L was a binary covariate, which was a logit function of U; variable A was the treatment assignment (0 = ECA,1 = Trial arm), which was a logit function of L and C in scenario 1 & 2 and a logit function of L, C and U in scenario 3; Variables Ya, Yb, and Yc were continuous, binary, and time-to-event outcomes, respectively. These outcomes were a corresponding function of A, C, and U. In addition, variable D indicated the event (e.g., 0 = alive, 1 = dead) for the time-to-event outcome Yc.

From this target population, we randomly drew 3000 samples. Each sample contained 200 observations (i.e. ‘individuals’) and were randomly drawn from the target population. Also, we drew another 3000 samples with 100 individuals per sample in a similar fashion, in order to further validate our results given a different sample size.

True effect

For assessing the various methods that obtained estimates based on samples, the true effect of treatment in the population is needed. Population level ATEs were obtained by applying regression models to estimate marginal effects in the full dataset (n = 20,000), including both measured and unmeasured confounders in the analyses. The ATEs of Ya, Yb, and Yc were assessed by mean difference, log odds ratio, and log hazard ratio, respectively.

Methods for minimizing potential biases

This study explores g-computation, PS-based weighting, and TMLE methods that can be used to handle individual-level data for reducing potential biases in externally controlled studies.

G-computation

G-computation is one of Robins’ g methods [16]. It is used for modelling the joint density of the observed data to generate potential outcomes under different exposure scenarios [10]. First, we used the observed data (variables A, C, L) to build the outcome regression models, such as linear regression, logistic regression, and Cox proportional hazards regression for \({\widehat{Y}}_{a}\), \({\widehat{Y}}_{b}\), and \({\widehat{Y}}_{c}\), respectively. Then, the counterfactual outcomes for each individual were estimated based on the outcome regression models by assuming all individuals received the trial treatment and the standard of care at the same time. Finally, the estimated ATE for each outcome was calculated, which was the difference between the two averages of counterfactual outcomes. The R package “RISCA” was used to implement g-computation for estimating ATE for binary and time-to-event outcomes [12].

PS-based weighting

One of common PS-based weighting method is IPTW [17]. SMR and OW methods can be also used to create a weight for minimizing the potential biases in a causal effect estimation [14, 18, 19]. We used logistic regression to predict the probability of treatment assignment (pi, propensity score for each individual) given the observed predictors of C and L. Then, a weight for each individual was calculated according to its corresponding weighting method. The weights were 1/pi, 1, and 1-pi in the trial treatment group, whereas they were 1/(1-pi), pi /(1-pi), and pi in the ECA group when using IPTW, SMR, and OW, respectively. The PS-based weight was further applied in linear regression, logistic regression, and Cox proportional hazards regression for estimating the ATEs of \({\widehat{Y}}_{a}\), \({\widehat{Y}}_{b}\), and \({\widehat{Y}}_{c}\), respectively.

TMLE

The TMLE method puts g-computation and PS-based weighting together, which is also considered as doubly robust estimation [11, 20, 21]. In brief, ATE estimation with TMLE begins with estimation of the conditional expectation of the outcome given the exposure and covariates, E(Y|A, C, L). This estimate of E(Y|A, C, L) is used to generate predicted values of the outcome for both exposure levels (e.g., the pair of potential outcomes). It is also called the initial counterfactual outcomes for each individual using outcome regression model, which is similar to those steps in g-computation. Next, the “targeting” step involves estimation of the exposure mechanism, P(A = 1|C, L), which is then used to update the initial estimate of E(Y|A, C, L) through the use of a predefined working regression model. In this step, the PS-based clever covariate \(H\left(A, C,L\right)=\frac{I\left(A=1\right)}{P(A=1|C,L)}-\frac{I\left(A=0\right)}{P(A=0|C,L)}\) is included in the equation \(logit\left({\widehat{Y}}^{*}\right)=logit\left(\widehat{Y}\right) + \in *H\) in order to estimate the fluctuation parameter (\(\in\)) that provides the information about how much to change, or fluctuate, our initial outcome estimate.

Last, the estimated ATE for each outcome can be calculated as follows:

$$\widehat{ATE}=\frac{1}{n}{\sum }_{i=1}^{n}({\widehat{Y}}_{1}^{*}-{\widehat{Y}}_{0}^{*})$$

The TMLE method was conducted using R package “ltmle” with the default machine learning algorithms. Since the package “ltmle” cannot directly provide hazards ratio (HR) for time-to-event outcome (\({\widehat{Y}}_{c}\)), relative risk (RR) at the primary time point (e.g. median survival of all individuals) was estimated first, then converted into HR using the following equation:

$$\mathrm{HR}=\frac{\mathrm{log}(1-\mathrm{d}*\mathrm{RR})}{\mathrm{log}(1-\mathrm{d})}$$

where RR is the relative risk, HR is the hazard ratio, and d is the death rate for reference group (e.g., d = 0.4) [22].

Model comparison

All statistical analyses were conducted in RStudio (version 1.4.1717) and R (version 4.0.4). In addition to the aforementioned methods, we also conducted the analysis using a raw model, which directly estimated the treatment effects without any adjustment. In order to assess these methods, we compared the true effect with the results (point estimate and 95% confidence interval) from each model, and calculated the bias, root mean squared error (RMSE) along with coverage and width of 95% confidence interval for the treatment effects. The bias was defined as the average difference between the true value (simulated) and its estimate across the simulation replicates using the original scale for the continuous outcome and the log-transformed scale for the binary and time-to-event outcomes, such as log(HR). RMSE was the square root of the mean squared error (MSE) that was the average squared difference between the true value and its estimate across the simulation replicates. Coverage was the proportion of times the 95% confidence interval of the estimate contained the true value. Width was the average difference between the upper and lower bounds of 95% confidence interval of estimate.

Results

Summary of potential confounders

As shown in Table 1, the distribution of three potential confounders (U, C, and L) were considerably different between two treatment arms with a range of standardized mean difference from 0.08 to 0.88.

Table 1 Summary of potential confounders between treatment groups

The unmeasured factor U was associated with the outcomes in the three scenarios. The values of Pearson’s correlation (r) of outcomes (Ya, Yb, and Yc) with U were -0.15, -0.12, and -0.07 in scenario 1, respectively. These correlation values were increased accordingly in scenario 2, especially for Ya (-0.44) followed by Yb (-0.23), and Yc (-0.12), which were similar with scenario 3.

The correlation (r) between the unmeasured factor U and the treatment assignment A was 0.06 and 0.11 in scenario 1 & 2, respectively, and it became much stronger (r = 0.24) in scenario 3 after adding a direct association between them. The treatment assignment A was also influenced by other baseline factors C (r = 0.51, 0.53 and 0.51, respectively) and L (r = 0.18, 0.22 and 0.28, respectively) in the three scenarios.

The odds ratio between the unmeasured confounder U and the treatment assignment A adjusting for the measured factor L was 0.99, 0.99, and 1.34 in scenario 1, 2, and 3, respectively. It indicated that the measured factor L, as an alternative variable of unmeasured factor U, could not block the confounding path (A ← U → Y) in scenario 3.

Estimated ATE for continuous endpoint

As depicted in Fig. 2, the center of estimated ATE distribution (mean ATE = 3.19) for continuous endpoint from the raw model obviously deviated from the true effect (mean difference = 3.99) in scenario 1. The poor performance of raw model was also reflected in the larger RMSE (0.86) and smaller coverage (0.26) compared with other methods (Table 2—Outcome Ya). On the contrast, the mean estimated ATEs for continuous endpoint from g-computation, IPTW, OW, SMR, and TMLE were 3.99, 3.92, 3.98, 3.92, and 3.95, respectively, which were close to the true effect. Among them, g-computation had the smallest RMSE (0.351) followed by OW (RMSE = 0.354), and g-computation had the greatest coverage of 0.95.

Fig. 2
figure 2

Density distribution of treatment effect estimates by different methods for a continuous endpoint (n = 200). GC, g-computation; PS-, propensity score-based; IPTW, inverse probability of treatment weighting; SMR, standardized mortality or morbidity ratio; OW, overlap weighting; TMLE, targeted maximum likelihood estimation

Table 2 Performance of different methods for estimating the treatment effects (n = 200)

In scenario 2, the mean estimated ATE (2.95) for continuous endpoint from the raw model shifted further away from the true effect (4.04). The mean estimates from g-computation, IPTW, OW, SMR, and TMLE were 4.06, 3.90, 4.06, 3.93, and 3.99, respectively, which were still acceptably close to the true effect. However, the mean estimates from these adjusted methods tended to be more biased in scenario 3 compared with scenario 2 (e.g., RMSE for g-computation was 0.60 vs 0.41).

Estimated ATE for binary endpoint

Similar to the results for the continuous endpoint, the mean estimated ATE (0.94) for binary endpoint from the raw model obviously deviated from the true effect (log odds ratio = 1.21) in scenario 1, and the deviation was even larger in scenario 2 & 3. As shown in Fig. 3, the mean estimated ATEs for binary endpoint from g-computation, IPTW, OW, SMR, and TMLE approximated to the true effect in scenario 1 (1.22, 1.24, 1.24, 1.21, and 1.24 versus 1.21, respectively) and in scenario 2 (1.19, 1.17, 1.21, 1.14, and 1.19 versus 1.17, respectively), but they deviated a little from the true effect in scenario 3 (0.98, 0.93, 0.99, 0.89, and 0.95 versus 1.11, respectively). Compared with other methods (Table 2—Outcome Yb), g-computation still had the smallest RMSE (e.g., 0.35 in scenario 2) followed by OW (e.g., 0.37 in scenario 2) along with a reasonable coverage (e.g., 0.93 in scenario 2).

Fig. 3
figure 3

Density distribution of treatment effect estimates by different methods for a binary endpoint (n = 200). GC, g-computation; PS-, propensity score-based; IPTW, inverse probability of treatment weighting; SMR, standardized mortality or morbidity ratio; OW, overlap weighting; TMLE, targeted maximum likelihood estimation

Estimated ATE for time-to-event endpoint

The mean estimated ATEs for time-to-event endpoint from the raw model also considerably differed from the true effects in all scenarios (e.g., log hazard ratio = -0.65 versus -0.83 in scenario 1). The mean estimated ATEs for time-to-event endpoint from g-computation, IPTW, OW, SMR, and TMLE in scenario 1 were -0.83, -0.81, -0.86, -0.85, and -0.68, respectively, which closely approximated to the true effect except the estimate by TMLE with a conversion from RR to HR had an obvious deviation (-0.68 versus -0.83). As shown in Fig. 4, the similar results were observed in scenario 2 & 3. The superior performance of g-computation, IPTW, OW, and SMR was also demonstrated in Table 2 (Outcome Yc), and g-computation had the smallest RMSE in all three scenarios (e.g., 0.19 for scenario 1) compared with other methods.

Fig. 4
figure 4

Density distribution of treatment effect estimates by different methods for a time-to-event endpoint (n = 200). GC, g-computation; PS-, propensity score-based; IPTW, inverse probability of treatment weighting; SMR, standardized mortality or morbidity ratio; OW, overlap weighting; TMLE, targeted maximum likelihood estimation

Sensitivity analysis

When the sample size was reduced from 200 to 100, we found similar results in terms of reduction in bias associated with each method, but we observed increase of variance, which led to a larger MSEs and wider confidence intervals (supplementary file S1).

Discussion

By design, both the unmeasured confounder U and the measured covariates (C and L) played a role of potential confounders when we investigated the treatment effect in the simulated data. The raw or unadjusted (observed) model that ignored all the potential confounders produced results which were obviously biased, deviating greatly from the known treatment effect. When including the measured covariates in the model only, the methods of g-computation, PS-based weighting (IPTW, OW and SMR), and TMLE produced results that were less biased than the raw analysis approach in the three scenarios. Theoretically, the bias due to unmeasured confounder cannot be completely corrected unless an alternative variable is measured and included in the analysis.

In scenario 1 & 2, the unmeasured confounder U indirectly influenced the treatment assignment through the covariate L, and it had a different strength of association with each outcome (Ya > Yb > Yc). Thus, we found that the unmeasured confounder had the strongest confounding effect on outcome Ya compared with other outcomes (Fig. 2,3,4). Under this circumstance, the methods of g-computation, PS-based weighting, and TMLE successfully corrected for most of bias when we investigate the treatment effect based upon the measured data. Here, the covariate L played a role of alternative variable for the unmeasured confounder U, so these methods that incorporated the covariate L worked well.

After adding an additional relationship between U and treatment assignment, the unmeasured confounder showed a stronger confounding effect in scenario 3 than scenario 2, based upon the standardized mean difference (0.36 vs 0.16). It can be mainly explained by the reason that the covariate L in scenario 3 was not a good alternative variable for the unmeasured confounder U anymore because U was also directly associated with treatment assignment. Thus, it is not surprising to see more biases in scenario 3 compared to the other two scenarios. Furthermore, the unmeasured confounder had a relatively strong correlation with outcome Ya. That may be the reason why the results for outcome Ya by all methods were tended to be more biased in scenario 3 than scenario 2 (Fig. 2). The extent of the bias became relatively small for outcome Yb and even negligible for outcome Yc because their associations with the unmeasured confounder were not as strong as outcome Ya.

Although the same predictors were used for modeling, the g-computation had the smallest RSMEs in most of settings of this simulation study (Table 2), which is consistent with the other recent simulation study [12]. Currently, the PS-based approach is still relatively more predominant relative to G-computation for a handful of pragmatic reasons. First, the PS-based approach is easy to understand although it also requires those assumptions that g-computation needs, such as counterfactual consistency, exchangeability, and positivity [10]. Second, the variance formula is explicit for the PS-based approach, but g-computation usually needs a bootstrapping or simulation to obtain the variance [12, 23]. Third, several tutorials on PS-based approach and its successful applications in research and regulatory approvals can be found [24,25,26,27,28]. However, these reasons should not stop g-computation from becoming another popular alternative approach to minimize potential biases in the future drug development.

TMLE is a doubly robust maximum-likelihood-based approach that includes a secondary "targeting" step that optimizes the bias-variance tradeoff for the target parameter [11]. Unsurprisingly, the RSMEs for TMLE were between those for by g-computation and IPTW in our simulation study, since TMLE is considered as a combination of g-computation and PS-based weighting. Undoubtedly, TMLE has its merit that the doubly robust property helps TMLE even against significant model misspecification arising from an omitted confounder in either the exposure or outcome regressions [11]. However, TMLE couldn’t demonstrate its advantage in our simulation study because the unmeasured confounder was omitted in both the exposure and outcome regressions, which is not uncommon in real-world data. In addition, we observed more biases in the estimates of treatment effect for the time-to-event outcome by TMLE compared with other adjusted methods. One explanation for the deviation might be the use of conversion from RR to HR in our study, because the current R packages (e.g., “MOSS”, “survtmle”, “ltmle”) are not able to provide a HR directly. The lack of HR statistics in the current statistical package might be one of hurdles for the application of TMLE in practice. On the other hand, the use of HR is not indispensable in survival analysis, and survival curves are more informative than HRs although the HR is the main, and often the only, effect measure reported in many epidemiological studies [29].

Among the three PS-based weighting methods, OW had the best performance in this simulation. Both IPTW and SMR use the propensity score as a part of denominator for calculating the weight, but OW does not. The weight using reciprocal of propensity score could be greatly amplified when the propensity score is very small or large (e.g. 0.01 or 0.99). Therefore, OW may be less sensitive to the extreme values of propensity score, which might lead to a smaller RSME for estimating the treatment effect compared with IPTW and SMR [19]. Furthermore, the target populations of three PS-based weighting methods may be interpreted differently. The estimates by IPTW, SMR, and OW could be considered as the average treatment effect in the population, the treated, and the overlap, respectively. In a sensitivity analysis with some heterogeneity of treatment effect (supplementary file S2), we further verified that IPTW, SMR, and OW are not interchangeable methods except under constant treatment effects. Both g-computation and TMLE can also be implemented to estimate the treatment effect for different target population, such as average treatment effect in the entire population (ATE), treatment effect of the treated (ATT), and treatment effect of the untreated (ATU).

One of merits in this study is that we explored some scenarios which the previous studies had not investigated, such as three types of outcomes along with different confounding effects (small and blocked confounding paths, medium and blocked confounding paths, and one large unblocked confounding path) caused by an unmeasured confounder [11, 12]. However, it might be worth exploring more scenarios in the future studies. For instance, what if ECA studies have longitudinal data with time-varying measured and unmeasured confounders? In addition, since we targeted on the marginal treatment effect, our simulation did not provide the conditional treatment effect, which is usually estimated by the multivariable regression model adjusting for confounders [30]. Finally, our simulation did not include other doubly robust methods (e.g., augmented-IPTW) because their doubly robust property is similar to TMLE [11, 31,32,33].

Conclusion

In externally controlled studies, the methods of g-computation, PS-based weighting (IPTW, OW and SMR), and TMLE can be used to minimize the biases due to measured and unmeasured confounders. However, the extent of bias reduction by these methods depends on the technique used and the existence of an unblockable confounding path for an unmeasured confounder. Our findings in this simulation study suggest that g-computation and OW are the preferable alternative approaches producing relatively unbiased estimates, especially when there is no large, unblocked confounding path for an unmeasured confounder.