Regression modelling
Typically, adjustment for pre-specified baseline covariates in the analysis of RCTs is performed using standard regression methods. Consider a two arm randomised trial with a total of n subjects where for participant i, Zi = 0 or 1 represents treatment allocation (0 = control, 1 = treatment), Yi denotes the outcome of interest and Xi = (Xi1, …, Xip)′, a (p × 1) vector of baseline covariates. For a continuous outcome Yi, a linear regression model with the following structure may be used to estimate the baseline adjusted treatment effect,
$$ {Y}_i=\alpha +\theta {Z}_i+\beta {\boldsymbol{X}}_i+{e}_i $$
$$ {e}_i\sim N\left(0,{\sigma}^2\right) $$
Here θ represents the treatment effect after adjustment for Xi i.e. conditional on having particular baseline covariate values of Xi. Often this is referred to as an analysis of covariance (ANCOVA). For other types of outcomes alternative models can be used, such as a logistic regression for binary outcomes or a Cox proportional hazards model for time to event outcomes [4].
Previous research has explored the properties of linear regression estimators with a varying number of subjects. Various rules-of-thumb for the number of subjects required in linear regression analyses have been debated, which include specifying either a fixed sample size, regardless to the number of predictors, or a minimum number of subjects per variable (SPV) ranging from 2 to 20 [15,16,17]. In this study, we will compare the performance of linear regression modelling for covariate adjustment in smaller sample RCT settings against IPTW using the propensity score.
IPTW propensity score approach
The propensity score is defined as the conditional probability of being exposed to a particular treatment given the values of measured covariates. For example, continuing in the above two arm RCT setting where Z denotes treatment allocation, Y the continuous outcome and the baseline covariates are represented as X = (X1, …, Xp), the propensity score is defined as:
$$ e(X)=\mathbb{P}\left(Z=1\ \right|\ \boldsymbol{X}\Big). $$
In a simple two arm RCT allocating individuals in a 1:1 ratio this is known to be 0.5. But, previous work has shown that estimating the propensity score using the observed data and using it as if we didn’t know the true score provides increased precision without introducing bias in large samples [14]. The most popular model of choice for estimating the propensity score is a logistic regression [18]. As the treatment indicator Z is binary, and suppose the logistic regression is parametrised by α = (α0, α1, …, αp)⊤, so that:
$$ \mathit{\log}\left\{\mathrm{e}\left(\boldsymbol{X}\right)\right./\left(1-\mathrm{e}\left(\boldsymbol{X}\right)\right)\Big\}={\boldsymbol{X}}^{\top}\boldsymbol{\alpha} . $$
For each participant indexed by the subscript i, a probability of being either in the treatment or control arm, given the baseline characteristics, can be estimated from the fitted propensity score model as:
$$ {\hat{e}}_i=\hat{e}\left({\boldsymbol{X}}_i\right)=\frac{\mathrm{e} xp\left({\boldsymbol{X}}_i^{\top}\hat{\boldsymbol{\alpha}}\right)}{1+\mathrm{e} xp\left({\boldsymbol{X}}_i^{\top}\hat{\boldsymbol{\alpha}}\right)} $$
As described in [18] other methods can be used to obtain the propensity score such as neural networks, recursive partitioning and boosting, however we focus on estimation via logistic model throughout.
The propensity score was originally introduced in 1983 by Rosenbaum and Rubin [19] as a tool to adjust for confounding in the observational study setting. Rosenbaum and Rubin showed that, under certain assumptions, at each value of the propensity score the difference between treatment arms will be an unbiased estimate of the treatment effect at that value. At each value of the propensity score individuals will on average have the same distribution of covariates included in the propensity score model. Consequently matching on the propensity score, stratification on the propensity score or covariate adjustment using the propensity score can provide an unbiased estimate of the treatment effect. Alternatively Inverse Probability of Treatment Weighting (IPTW) using the propensity score [20] may be used. That is for participants in a treatment arm a weight of \( {w}_i=1/{\hat{e}}_i \) is assigned, while participants in a control arm are assigned weights of \( {w}_i=1/\left(1-{\hat{e}}_i\right) \). For a continuous outcome, the adjusted mean treatment group difference can then be obtained by fitting a linear regression model on treatment only, weighted by the inverse probability of receiving treatment.
Unlike within the observational setting, issues of confounding do not occur in the RCT setting. However, recently Williamson et al. [14] introduced the propensity score approach, specifically via IPTW, as a useful method for covariate adjustment to obtain variance reduction in RCT settings. Crucially the variance estimator needs to take into account the estimation of the propensity score. Williamson et al. showed consistent estimation of the treatment effect and large sample equivalence with the variance estimated via ANCOVA using their derived variance estimator, which is based on the theory of M-estimation [21] and Lunceford and Davidian [20] and takes into account the estimation of the propensity score. That is, the full sandwich variance estimator, which taken into account all the estimating equations including the components estimating the propensity score. We hence forth refer to this variance estimator as the IPTW-W variance estimator (see eq. 1, Additional file 1). There has already been examples where trialists have used such methods to obtain precise estimates in the RCT setting [22].
Simulation study
To assess the performance of the two methods of baseline covariate adjustment in small population RCT settings we conducted a simulation study. We also explored the performance of the non-parametric bootstrap variance for the IPTW treatment estimator since the IPTW-W variance estimate involves a number of computational steps (see eq. 1, Appendix A, Additional file 1). Data generation and all analyses were conducted using Stata [23].
Data generation
First we considered RCT scenarios with continuous covariates. We generated a set of 6 continuous covariates (C1-C6), as independent standard normal variables with mean 0 and a variance of 1. A treatment arm indicator (Z) was generated from a Bernoulli distribution with a probability of 0.5 to obtain approximately equally sized treatment groups. A normally distributed outcome Y, with mean E [Y] = 2*C1 + 2*C2 + 2*C3 + 2*C4 + 2*C5 + 2*C6 + 5*Z and variance 52 was then simulated. Covariates were therefore moderately associated with the outcome with a one standard deviation increase in the variable associated with a 2 unit increase in outcome. Treatment was simulated to have a stronger association with outcome, with a true difference of θ = 5 in outcome between treatment arms. Fixed sample sizes of 40–150 (in multiples of 10) and 200 were drawn. The parameters chosen ensured all trial scenarios had at least 80% power. For each sample size scenario we randomly generated a total of 2000 datasets. Secondly we repeated the above steps but included a mix of binary and continuous covariates (see Additional file 1 for these additional simulation methods).
Statistical analysis
A linear regression model containing the treatment covariate only was fitted to estimate the unadjusted treatment effect for each simulated data set. Subsequently we conducted four different adjusted analyses, adjusted for (i) C1 only, (ii) C1-C2, (iii) C1-C4 and (iv) C1-C6. Adjusted analyses were performed using multiple linear regression and IPTW using the propensity score estimated via a logistic regression. The outcome model used in the IPTW analysis was a linear regression of outcome on treatment, weighted by the estimated propensity score. For each analysis we extracted the estimated treatment effect, \( \hat{\theta} \), and its estimated standard error, \( \hat{SE} \). For the unadjusted and adjusted linear regression analyses the model based estimated standard error was used and for IPTW we estimated the variance using the formula provided by Williamson et al. that takes into account the uncertainty in the estimated propensity score (IPTW-W¸ eq. 1 in Appendix A, Additional file 1). For each analysis we also estimated the non-parametric bootstrap variance for the treatment effect, \( {\hat{SE}}_{boot} \) using 1000 replicates drawn with replacement [24] to compare with the model based and IPTW-W variance estimators. For IPTW the bootstrap included re-estimation of the propensity score for each bootstrap sample. 95% confidence intervals were calculated using the t-distribution for the linear regression analyses. For IPTW we calculated 95% confidence intervals (CI’s) using the normal distribution following the approach taken by Williamson et al.
For each scenario and analysis method we calculated the mean treatment effect and mean \( \hat{SE} \) over the 2000 replicated data sets. Mean percentage bias was computed as \( \Big(\hat{\theta} \)–θ)/θ *100. We compared the mean estimated standard error, \( \hat{SE}, \) to the empirical SE of the observed treatment effect over the 2000 simulations, \( SE\left(\hat{\theta}\right) \), and computed the ratio of the mean estimated SE to the empirical SE. We also calculated the coverage of the 95% CI’s as the percentage of estimated 95% CI’s that included the true value of the treatment effect (θ = 5). We used 2000 simulations for each scenario so that an empirical percentage coverage greater than 96% or less than 94% could be considered as being significantly different to the desired 95%.
Case study: the ADAPT trial
The Atopic Dermatitis Anti-IgE Paediatric Trial (ADAPT), conducted by Chan et al. (2018), was a double-blind placebo-controlled trial of Omalizumab (anti-IgE therapy) amongst children with severe atopic eczema. A total of 62 participants were randomised to receive treatment for 24 weeks (30 omalizumab: 32 placebo) at a single specialist centre, stratified by age (< 10, ≥10 yrs) and IgE (≤1500, > 1500 IU/ml). The primary objective was to establish whether Omalizumab was superior to placebo with respect to disease severity. Outcomes of eczema severity included the total SCORing Atopic Dermatitis (SCORAD) and Eczema Area and Severity Index (EASI). Quality of life was assessed using the (Childrens) Dermatological Life Quality Index (C)DLQI. Full details of the trial protocol, statistical analysis plan and results have been published elsewhere [25,26,27]. Analysis followed the Intention-to-treat principle, including all individuals who received treatment as randomised with an available follow-up. Two participants were missing week 24 follow-up are not included in our analyses since the focus of these evaluations is not on missing data here.
For each outcome (the total SCORAD, EASI and (C)DLQI) a linear regression model adjusted for the baseline outcome, IgE and age was fitted. We then implemented IPTW using the propensity score, estimated via a logistic regression, including baseline outcome, IgE and age as covariates. The outcome model used in the IPTW analysis was a linear regression of the outcome on treatment, weighted by the estimated propensity score. The variance of the IPTW treatment estimate was computed using the IPTW-W variance estimator that incorporates the uncertainty in the estimated propensity score (see Eq. 1 in Additional file 1 and Stata code in Additional file 2). Following Williamson et al. we calculated 95% CI’s and p-values for the IPTW treatment estimate using the normal distribution. For the adjusted regression analyses we used the t-distribution. For both methods we also calculated the bootstrap variance, using 10,000 bootstrap samples drawn with replacement and included re-estimation of the propensity score for each replicate. For comparison we also performed an unadjusted analysis for each outcome, using a linear regression of outcome on treatment group only. Our focus is on evaluating the performance of IPTW against linear regression, rather than the clinical interpretation of the trial results which has been discussed elsewhere. All statistical analysis was conducted using Stata [23]. Stata code for the analysis can be found in Additional file 2.