Abstract
This paper studies modelbased and designbased approaches for the analysis of data arising from a stepped wedge randomized design. Specifically, for different scenarios we compare robustness, efficiency, Type I error rate under the null hypothesis, and power under the alternative hypothesis for the leading analytical options including generalized estimating equations (GEE) and linear mixed model (LMM)based approaches. We find that GEE models with exchangeable correlation structures are more efficient than GEE models with independent correlation structures under all scenarios considered. The modelbased GEE Type I error rate can be inflated when applied with a small number of clusters, but this problem can be solved using a designbased approach. As expected, correct model specification is more important for LMM (compared to GEE) since the model is assumed correct when standard errors are calculated. However, in contrast to the modelbased results, the designbased Type I error rates for LMM models under scenarios with a random treatment effect show Type I error inflation even though the fitted models perfectly match the corresponding datagenerating scenarios. Therefore, greater robustness can be realized by combining GEE and permutation testing strategies.
Introduction
A stepped wedge cluster randomized trial design is a type of oneway crossover design in which each cluster starts under a reference or control condition and then crosses over to a treatment condition at a randomly determined time point [6]. Eventually, at the last time point, all clusters receive treatment during the final study time period. The unique control to treatment crossover patterns are referred to as “sequences” (e.g., in Fig. 1, the stepped wedge design has 4 sequences). In contrast, in a parallel cluster randomized design, half of the clusters are (usually) randomly assigned to the intervention and half to the control at the beginning of the trial with no planned crossover. A stepped wedge design is also different from a cluster randomized crossover design in which each cluster is randomly assigned to cross over from control to treatment or treatment to control (possibly more than once). In both crossover and stepped wedge trials, a washout time period may be included between intervention and control periods in order to make sure one condition does not affect the other or to allow individuals enrolled under one condition to complete their intervention before their cluster changes conditions. Figure 1 illustrates the settings for a traditional crossover design, a parallel design and a stepped wedge design [6].
Stepped wedge cluster randomized trials have become increasingly popular in recent years for a number of reasons [13]. For example, in the field of HIV prevention and treatment, as governments and public health agencies have begun to focus on effective implementation of proven interventions, stepped wedge designed studies are often used during program rollout to assess realworld effectiveness. In Killiam et al. [10], for instance, a stepped wedge design was used to evaluate whether integrating antiretroviral therapy (ART) into antenatal care clinics increased the proportion of HIVinfected pregnant women initiating ART during pregnancy, compared to the standard approach of referral for ART.
Another reason to consider a stepped wedge design is that it may be logistically or financially impossible to provide the intervention to all participants at once due to resource limitations or geographical constraints [1, 9, 13]. In this case, the stepped wedge design is feasible because only a small fraction of the clusters are required to initiate the intervention at each time point. Also, the stepped wedge design is useful when it is not ethical or practical to withhold or withdraw treatment, but logistical constraints prevent immediate provision of the intervention, since all participants are able to receive the intervention eventually [14]. For example, in the field of sexually transmitted infections (STI) prevention, Golden et al. [3] used a stepped wedge design to assess the impact of an intervention to reduce STI burden as the program was implemented across Washington state. Finally, the longitudinal nature of the stepped wedge design allows one to study changes in the effectiveness of the intervention over time by modeling the effects of time [7, 19].
Despite the increased adoption of stepped wedge designs, there are a number of important analytical issues that need additional careful study in order to provide practical recommendations. Key issues include the impact of a small number of clusters, and robustness to model assumptions such as additional sources of variation due to heterogeneity in time effects or in treatment effects. We are interested in the performance of marginal and random effect models for evaluating the treatment effect in the stepped wedge design from both a modelbased (inference based on distributional assumptions) and designbased (inference based on reference to the permutation distribution implied by the study design) perspective (see Sect. 2.3 for further discussion). Ji et al. [9] found that modelbased inference on the treatment effect in stepped wedge designs using linear mixed models (LMM) is sensitive to model misspecification, such as failing to account for clusterbytime interactions in the data. Therefore, there is a real practical risk that simple modelbased inference may provide inaccurate standard errors and invalid Type I error rates. Ji et al. [9] also considered permutation tests and found that the permutation test provided tight control of Type I error rates under the scenarios they investigated. Thompson et al. [17] compared clusterlevel parametric and nonparametric withinperiod estimates of treatment effects to the standard mixedeffect modelbased inference. Ultimately the parametric withinperiod model was not recommended due to its below nominal coverage levels under some scenarios. The nonparametric withinperiod estimator was less efficient than the mixedeffect model approach when period effects were common to all clusters or the number of clusters varied. Furthermore, the estimate of the treatment effect from clusterlevel methods was consistently larger than that from the mixedeffect model. Therefore, important gaps exist in terms of what conditions are required for the validity and efficiency of common alternative analysis methods.
In the current study, we conduct both modelbased (asymptotic) and designbased analyses at the individuallevel using linear mixed models (LMM) and generalized estimating equations (GEE) approaches under both null and alternative conditions for a variety of datagenerating scenarios. We include scenarios with random treatment effects where the impact of the treatment depends on the specific cluster to which it is applied, a situation that was not investigated by Ji et al. [9]. We also consider a varying number of clusters. We specifically seek to characterize treatment effect estimation bias, standard error accuracy, and Type I error rates under null conditions, and power under alternative conditions for the different analysis strategies. In Sect. 2, we describe the simulation scenarios and models we use for our study. In Sect. 3, we present results comparing efficiency, robustness and power among the analysis methods. In Sect. 4, we summarize our findings and discuss future steps.
Methods
DataGenerating Model
We generated normally distributed data with an identity link corresponding to a balanced, complete, crosssectional stepped wedged design with 5 time points (T =5) and either twenty (I =20) or forty clusters (I =40). The design structure is shown in the third panel of Fig. 1. One hundred observations (n =100) were generated for each cluster at each time point for a total sample size of N = 10,000 (I = 20) or 20,000 (I = 40). We are envisioning common public health studies or clinical delivery investigations within health care systems where a moderate number of clusters (i.e., villages, hospitals) are available but relatively large population sizes are under study. Let Y_{ijt} be the response for individual j in cluster i at time t \( \left( {i{\text{ in }}1 \ldots 20/40; \, j{\text{ in }}1 \ldots 100; \, t{\text{ in }}1 \ldots 5} \right) \). We generate data from the model
where μ is the overall mean, a_{i} is a random effect for cluster i where a_{i} ~ N(0, τ^{2}), β_{t} is the (categorical) fixed effect of time point t, X_{it} is the treatment indicator (0 = control; 1 = treatment) for cluster i at time t, θ is the fixed or average treatment effect, c_{i} is a random clusterspecific treatment effect where c_{i} ~ N(0, ν^{2}), and e_{ijt} is a random error where e_{ijt} ~ N(0, σ^{2}). We assume that Corr(a_{i}, c_{i}) = ρ (possibly 0) and e_{ijt} is independent of a_{i} and c_{i}.
Simulation Scenarios
Table 1 shows nine datagenerating scenarios used for our simulation studies. We investigated scenarios with different number of clusters (20 vs. 40) under the null condition to understand the effect of number of clusters on Type I error rate [11]. All scenarios contain a fixed treatment effect (θ = 0 under null condition and θ = various values under alternative conditions; the value of θ chosen varied by scenario to achieve power between 10 and 90%), a time effect (β_{1} = 0, β_{2} = 0.2, β_{3} = 0.3, β_{4} = 0.4, β_{5} = 0.5 for t = 1,…,5 under all conditions) and a random cluster effect (τ^{2} = 4). The error variance (σ^{2}) is equal to 1 in all simulations. For these variance components the intraclass correlation coefficient (ICC), defined as \( \frac{{\tau^{2} }}{{\sigma^{2} + \tau^{2} }} \), is equal to 0.8. A random treatment effect (ν^{2} = 4) is also included in some scenarios. When a random treatment effect is included, we allow it to be uncorrelated (corr = 0) or correlated (corr = 0.3) with the random cluster effect.
We generate 500 realizations under each scenario allowing Type I error rate estimates to be accurate to within ± 0.02 due to Monte Carlo variation.
Approaches to Analysis
We fit each simulated dataset in R using two GEE models (R package gee) and four LMM models (R package lme4) using both standard modelbased inference and designbased inference. All models were fit to individuallevel data. All inferences using GEE are based on robust (sandwich) variances [2]. We estimate bias, variance, and Type I error rate under the null hypothesis, and power under alternative hypotheses. Power was only investigated in the 40 cluster cases to focus on scenarios where the Type I error rates were (generally) close to nominal levels.
GEE Approaches
We investigate models with independent (G1) and exchangeable (G2) working correlation structures. The exchangeable working correlation structure, in which the correlation between observations within a cluster is assumed constant [11], is often chosen in the analysis of stepped wedge trials since it captures a common source of correlation. However, it is known that GEE asymptotically will be robust to misspecification of the working variance structure because GEE uses robust (sandwich) variance estimates that are widely valid provided there are a large number of clusters [2].
We conduct both modelbased and designbased tests using GEE. For the modelbased tests, we compare the robust zscore (estimate divided by its robust standard error) of the intervention effect from the GEE analysis to the standard normal distribution. GEE tends to inflate Type I error rates when the number of clusters is small [15], so we expect that performance may be better for 40 clusters compared to 20 clusters.
For the designbased analyses, we permute the stepped wedge sequences among clusters and investigate the use of both the estimated intervention effect and the robust zscore as a test statistic. We reject the null when the test statistic from the observed dataset is smaller than the 2.5th percentile or larger than the 97.5th percentile of the permutation distribution.
LMM Approaches
Table 2 shows four LMM models fit to simulation scenarios S1–S9. Tests are again conducted using both modelbased and designbased tests. We expect that LMM will be less robust to misspecification of the variance structure than GEE since standard errors are computed under the assumed covariance for the outcomes. If the random effect model structure is misspecified, then the modelbased variance in LMM will be invalid and an inflated Type I error rate may result.
For the designbased tests, similar to the approach outlined for GEE, we investigate both unstandardized and standardized intervention effect estimates as a test statistic and reject the null when the observed test statistic is smaller than the 2.5th percentile or larger than the 97.5th percentile of the permutation distribution.
Results
GEE Asymptotic Inference
In the scenarios that we studied, the GEE estimators with different correlation structures (G1 and G2) both give unbiased estimates of the treatment effect (Table 3). GEE with an exchangeable correlation structure (G2) leads to a smaller empirical variance compared to G1, indicating higher efficiency due to the fact that the exchangeable structure corresponds more closely to the true correlation structure and therefore provides an optimal weighted estimator. The efficiency advantage remained true even in scenarios with a random treatment effect (e.g., S2, S3, S5, S6), which does not correspond to a simple exchangeable correlation structure. For 20 clusters, the Type I error rate is inflated to approximately 0.10. As we change the number of clusters from 20 to 40, the estimated sandwich variance more closely approximates the true sampling variance, so the Type I error rate approaches 0.05. The presence of correlation between the random cluster and random treatment effects (e.g., S2 vs. S3 or S5 vs. S6) does not meaningfully affect the results.
Under alternative conditions (S7–S9), we choose an effect size of 0.08 for S7 and an effect size of 1.00 for S8 and S9 to investigate power. The GEE model with an exchangeable correlation structure (G2) is much more efficient than the GEE model with independent correlation structure (G1) due to the large ICC (see Sect. 2.2) in these data.
LMM Asymptotic Inference
Table 4 shows results from fitting LMM models L1–L4 to the nine scenarios. All models give unbiased treatment effect estimates under all scenarios. Also, the Type I error rates for models with a random treatment effect are all close to the nominal threshold 0.05. Not surprisingly, the Type I error rate is significantly inflated for the model that assumes independent data (L1) under all scenarios. For analysis models without a random treatment effect (L1, L2), the Type I error rates are also far above the nominal level under simulation scenarios that include a random treatment effect (e.g., S2, S3, S5, S6). Interestingly, whether the random cluster effect and random treatment effect are modeled as correlated or not does not seem to have a significant effect on the statistical Type I error rates (compare L3 vs. L4). In addition, the crosssimulation variance of the intervention effect estimate is not noticeably different between models L2–L4, suggesting that it is preferable to overfit than to underfit a model. Based on the results in Tables 3 and 4, our simulations validate theoretical predictions that correct model specification is more important for LMM compared to GEE.
Under alternative conditions, we choose an effect size of 0.08 for S7 and effect size of 0.80 for S8 and S9 to evaluate power. The power for L3 and L4 are similar; therefore, there is not much difference in their efficiency.
GEE Permutation Test
In addition to the GEE modelbased analysis shown in Table 3, we also conducted GEE designbased analyses. We provide results based on both the permutation distribution of the estimated treatment effect parameter (Table 5) as well as the permutation distribution of the robust zstatistic (Table 6).
Table 5 illustrates several interesting findings. In scenarios with no random treatment effect (S1, S4), both G1 and G2 maintain the nominal Type I error rate and do not show the Type I error inflation with smaller numbers of clusters that was observed in the modelbased analysis. Table 5 shows some evidence of a small Type I error inflation for G1 under scenarios that include random treatment effects (S2, S3, S5, and S6). However, the Type I error rate for the GEE model with exchangeable correlation structure (G2) is significantly inflated under scenarios with a random treatment effect.
Interestingly, when the permutation test is based on the robust zstatistic (Table 6), the Type I error rate inflation largely disappears. The Type I error rates for G1 are all close to 0.05. Model G2 now shows only slight Type I error inflation under scenarios that include a random treatment effect.
Under the alternative condition, we generated data with θ = 0.21 for S7 and θ = 0.80 for S8 and S9. Similar to the asymptotic results (Sect. 3.1), the model with exchangeable correlation structure (G2) has more power than the model with independent correlation structure (G1), although in Table 5 this is partly due to the inflated Type I error rate previously noted for G2.
LMM Permutation Test
We also conducted designbased tests using the permutation distribution of the LMMbased estimated treatment effects and zstatistics (Tables 7 and 8, respectively). We note that the treatment effect permutation distributions (but not the zstatistic permutation distributions) for models L1 and L2 are identical to those for models G1 and G2, respectively, and so the results in these simulations are similar.
Table 7 shows permutation results using the treatment effect coefficients. Interestingly, the model that assumes independence (L1) shows little to no Type 1 error inflation, while all the nonindependence models (L2–L4) show significant Type I error inflation under scenarios with random treatment effects. For this reason, power comparisons are difficult, except under scenario S7 (no random treatment effect) where we find that models that include a correlation structure similar to the datagenerating mechanism have higher power than the independence model.
In Table 8, we see that designbased tests using zstatistics give similar results to the designbased tests based on coefficients (Table 7) for L1–L2 over all scenarios. In contrast, the Type I error inflation noted for models L3 and L4 under scenarios with a random treatment effect (S2, S3, S5, S6) in Table 7 is reduced, but not eliminated, using a designbased test using zstatistics (Table 8). In addition, unlike Table 7, there is some suggestion in Table 8 that increasing the number of clusters from 20 to 40 reduces the Type I error inflation for L2–L4 under scenarios with a random treatment effect. It is notable that, in contrast to the modelbased LMM results, the Type I error rates for designbased tests using LMM models L3 and L4 give inflated Type I error rates under scenarios with a random treatment effect (S2, S3, S5, S6) even though the models perfectly match the corresponding scenarios.
Since L1–L4 all have nominal Type I error rate only under scenarios without a random treatment effect (S7), we only compare power under scenario S7. The power for L2, L3, and L4 are similar under S7, while the power for L1 is much lower.
Discussion
We conducted both modelbased and designbased analyses of data from stepped wedge study designs to compare the robustness, efficiency, and Type I error rate under null conditions and power under alternative condition among GEE and LMM models for each of 9 datagenerating scenarios. In general, for modelbased analyses correct model specification is more important to LMM compared to GEE, and overspecification of LMM models performs better than underfitting. Specifically, the modelbased results show that LMM models with random cluster and treatment effect produce similar levels of bias, efficiency, and Type I error rate as the correctly specified LMM model, even if there is no random treatment effect in the correctly specified model. In contrast, if a random treatment effect truly exists, the modelbased results for a LMM without a random treatment effect show an inflated Type I error rate.
In modelbased analyses, the number of clusters has a greater effect on Type I error rates in GEE than LMM. As we increase the number of clusters from 20 to 40, the modelbased GEE simulations provide Type I error rates closer to the nominal level. In contrast, the Type I error rates for modelbased LMM simulations are close to nominal levels even for 20 clusters when the analysis model matches the datagenerating scenario. Westgate et al. [11, 12, 18] have investigated the effect of various corrections to GEE when the number of clusters is small in the context of parallel design cluster randomized trials. The application of these methods to stepped wedge trials has not been investigated although Taljaard et al. [16] noted some of the risks associated with too few clusters in stepped wedge trials. Additional research on finite sample size corrections, the effect of number of clusters, and the effect of (possibly variable) cluster size in the context of stepped wedge trials (with both linear and nonlinear links) is needed.
Using permutation tests, the primary quantities of interest are the Type I error rate under null conditions and the power under alternative conditions. Permutation tests do not naturally provide estimates of the treatment effect or confidence intervals (although Hughes et al. [8] describe a designbased procedure for stepped wedge models that gives estimates, confidence intervals, and valid tests). Using a permutation procedure, GEE models show similar Type I error rates under all the scenarios investigated when the permutation test is based on robust zscores. In addition, Type I error rates are not sensitive to the number of clusters when the permutation distribution is used for testing. GEE models with an exchangeable working correlation structure show greater power than models with an independence working correlation structure. However, in scenarios with a random treatment effect, permutation tests from a GEE model with exchangeable correlation matrix show significantly inflated Type I error rates when based on the estimated treatment effect coefficient but only minor inflation when based on robust zstatistics.
Designbased tests using LMM models produced some surprising findings. When a random treatment effect is included in the data generation, the designbased tests using LMM models show inflated Type I error rates even if the underlying model is correctly specified. This is likely due to the fact that the inclusion of a random treatment effect leads to a different covariance matrix for each sequence [8] and thereby violates the assumption of exchangeability under the null hypothesis, which is required for permutation tests [4]. The magnitude of the effect of this violation on the Type I error rate will depend on the relative magnitude of the variance components.
In this study, we used large random effect variances relative to the error variance (e.g., ICC equal to 0.8). This approach has allowed us to identify scenarios that lead to Type I error inflation with a relatively limited number of simulations. However, this also suggests that in applying our results to applications with smaller (relative) random effect variances, researchers should be most concerned about the scenarios where we find moderate to large Type I error inflation. In addition, all the simulations presented here used a linear link and normal errors and are all based on individuallevel analyses of the data. Further research is needed using models with nonlinear links such as binary data or that use clusterlevel methods, as noted by Thompson et al. [17].
We have shown areas of strength and weaknesses of modelbased and designbased analyses of stepped wedge designs. We believe these results will help guide practitioners in choosing approaches to the analysis of data from stepped wedge designs.
Change history
17 August 2020
The original version of this article unfortunately contained an error in ‘R code’ under Appendix section.
References
 1.
Brown CA, Lilford RJ (2006) The stepped wedge trial design: a systematic review. BMC Med Res Methodol 6:54
 2.
Diggle P, Heagerty P, Liang K, Zeger S (2002) Analysis of longitudinal data, 2nd edn. Oxford University Press, Oxford
 3.
Golden MR, Kerani RP, Stenger M, Hughes JP, Aubin M, Malinski C, Holmes KK (2015) Uptake and populationlevel impact of expedited partner therapy (EPT) on Chlamydia trachomatis and Neisseria gonorrhoeae: the Washington State Communitylevel Randomized Trial of EPT. PLoS Med 12(1):e1001777
 4.
Good P (2005) Permutation, parametric and bootstrap tests of hypotheses, 3rd edn. Springer, New York
 5.
Hooper R, Teerenstra S, de Hoop E, Eldridge S (2016) Sample size calculation for stepped wedge and other longitudinal cluster randomized trials. Stat Med 35:4718–4728
 6.
Hussey MA, Hughes JP (2007) Design and analysis of stepped wedge cluster randomized trials. Contemp Clin Trials 28:182–191
 7.
Hughes JP, Granston TS, Heagerty PJ (2015) Current issues in the design and analysis of stepped wedge trials. Contemp Clin Trials 45:55–60
 8.
Hughes JP, Heagerty PJ, Xia F, Ren Y (2019) Robust inference in the stepped wedge design. Biometrics. https://doi.org/10.1111/biom.13106
 9.
Ji X, Fink G, Robyn PJ, Small DS (2017) Randomization inference for steppedwedge clusterrandomized trials: an application to communitybased health insurance. Ann Appl Stat 11:1–20
 10.
Killiam WP, Tambatamba BC, Chintu N, Rouose D, Stringer E, Bweupe M, Yu Y, Stringer JSA (2010) Antiretroviral therapy in antenatal care to increase treatment initiation in HIVinfected pregnant women: a steppedwedge evaluation. AIDS 24:85–91
 11.
Leyrat C, Morgan EK, Leurent B, Kahan CB (2018) Cluster randomized trials with a small number of clusters: which analyses should be used? Int J Epidemiol 47:321–331
 12.
Li P, Redden DT (2015) Comparing denominator degrees of freedom approximations for the generalized linear mixed model in analyzing binary outcome in small sample clusterrandomized trials. BMC Med Res Methodol 15:38
 13.
Mdege ND, Man MS, Taylor CA, Torgerson DJ (2011) Systematic review of stepped wedge cluster randomized trials shows that design is particularly used to evaluate interventions during routine implementation. J Clin Epidemiol 64:936–948
 14.
Rhoda DA, Murray DM, Andridge RR, Pennell ML, Hade EM (2011) Studies with staggered starts: multiple baseline designs and grouprandomized trials. Am J Publ Health 101:2164–2169
 15.
Sharples K, Breslow N (1992) Regression analysis of correlated binary data: some small sample results for the estimating equation approach. J Stat Comput Simul 42:1–20
 16.
Taljaard M, Teerenstra S, Ivers NM, Fergusson DA (2016) Substantial risks associated with few clusters in cluster randomized and stepped wedge designs. Clin Trials 13:459–463
 17.
Thompson JA, Davey C, Fielding K, Hargreaves JR, Hayes RJ (2018) Robust analysis of stepped wedge trials using clusterlevel summaries within periods. Stat Med 37:2487–2500
 18.
Westgate PM (2013) On smallsample inference in group randomized trials with binary outcomes and clusterlevel covariates. Biom J 5:789–806
 19.
Woertman W, de Hoop E, Moerbeek M, Zuidema SU, Gerritsen DL, Teerenstra S (2013) Stepped wedge designs could reduce the required sample size in cluster randomized trials. J Clin Epidemiol 66:752–758
Funding
This research was supported by the National Institute of Allergy and Infectious Diseases Grant AI29168 and PCORI contract ME150731750.
Author information
Affiliations
Corresponding author
Appendix: R, Stata, and SAS code
Appendix: R, Stata, and SAS code
Here we present basic R, Stata, and SAS code for fitting common models for stepped wedge designs with crosssectional data collection at each time point. See [5,6,7].

I
Linear mixed models

(1)
\( {\text{Random cluster effect}}:Y_{ijt} = \mu + a_{i} + \beta_{t} + X_{it} \theta + e_{ijt} \)

(2)
\( {\text{Random cluster and cluster}} \times {\text{time effect}}:Y_{ijt} = \mu + a_{i} + \beta_{t} + b_{it} + X_{it} \theta + e_{ijt} \)

(3)
\( {\text{Random cluster}},{\text{ cluster}} \times {\text{time and treatment effect }}({\text{corr}}(a_{i} ,_{{}} c_{i} ) = 0):Y_{ijt} = \mu + a_{i} + \beta_{t} + b_{it} + X_{it} (\theta + c_{i} ) + e_{ijt} \)

(4)
\( {\text{Random cluster, cluster}} \times {\text{time and treatment effect }}({\text{corr}}(a_{i} ,_{{}} c_{i} ) = \rho ):Y_{ijt} = \mu + a_{i} + \beta_{t} + b_{it} + X_{it} (\theta + c_{i} ) + e_{ijt}, \)
where
a_{i} ~ N(0, τ^{2})
b_{it} ~ N(0, γ^{2})
c_{i} ~ N(0, ν^{2})
e_{ijt} ~ N(0, σ^{2}).

(1)

II
Generalized estimating equation models

(5)
Independent working correlation

(6)
Exchangeable working correlation

(5)
Rights and permissions
About this article
Cite this article
Ren, Y., Hughes, J.P. & Heagerty, P.J. A Simulation Study of Statistical Approaches to Data Analysis in the Stepped Wedge Design. Stat Biosci 12, 399–415 (2020). https://doi.org/10.1007/s1256101909259x
Received:
Revised:
Accepted:
Published:
Issue Date:
Keywords
 Stepped wedge design
 GEE
 LMM
 Permutation test
 Simulation