The random assignment of study conditions to individuals or groups allows for the equivalence of potential outcomes in treated and control groups and provides a strong basis for causal inference of the treatment effect (Hong, 2015). In the social sciences, assigning clusters to conditions is generally more feasible than assigning individuals. Furthermore, randomizing individuals might be inappropriate when assessing the impact of interventions that naturally occur in clusters (Barbui & Cipriani, 2011; Donner & Klar, 2004). The frequency of utilizing cluster-randomized trials (CRTs), and especially multisite cluster-randomized trials (MS-CRTs), has been increasing (Bloom & Spybrook, 2017). In a CRT, clusters of participants are assigned to a condition. An MS-CRT is a type of CRT in which lower-level clusters are assigned from within levels of a higher-level cluster to at least two different levels of a condition. For example, an MS-CRT could have teachers within schools assigned to either treatment or control groups, and all students of a teacher would be in the same condition. By contrast, when the highest-level clusters are assigned to conditions, the design is not an MS-CRT and is typically referred to simply as a CRT. For example, a CRT could have several schools randomly assigned to treatment or control, resulting in a study in which all teachers in a school, and therefore all children in the school, are assigned to the same condition.

In the simplest MS-CRT there are three levels. In educational research, within-school random assignment can offer more efficient studies than random assignment of schools. For a fixed sample size, assigning teachers within a school to different study conditions offers increased statistical power to detect a treatment effect, when compared to random assignment of entire schools. For a fixed target power, random assignment of schools might require up to twice as many schools. (Bloom & Spybrook, 2017; Wijekumar, Hitchcock, Turner, Lei, & Peck, 2009). For an extensive discussion of MS-CRTs, readers are referred to Kelcey, Spybrook, Phelps, Jones, and Zhang (2017), Kraemer (2000), Raudenbush and Liu (2000), and Ruud et al. (2013). In this study we focused on analyses of the simplest MS-CRT, in which the second-level units are randomly assigned to treatment conditions from within a third level.

In any variation of cluster-randomized designs, participants’ scores within a cluster are dependent. For example, in an MS-CRT with children nested in classes nested in schools, scores for participants in a class are dependent, as are scores for participants in a school. Three main frameworks address this dependency when investigating a treatment effect: model-based, design-based, and permutation (Feng, Diehr, Peterson, & McLerran, 2001; Gardiner, Luo, & Roman, 2009; Ghisletta & Spini, 2004; Huang, 2016; Hubbard et al., 2010; Murray et al., 2006; Nevalainen, Oja, & Datta, 2017). However, for an MS-CRT targeting on detecting a covariate by treatment interaction, the choices are limited. Permutation tests, a partially model-free approach, do not allow for examining moderation effects and are used infrequently in educational research. A relatively recent approach, cluster bootstrapping, has been shown to produce results similar to those of multilevel modeling (MLM), but the results are computationally demanding, especially within a Monte Carlo simulation study (Huang, 2016). In brief, MLM, as a completely model-based approach, and generalized estimating equations (GEE) and cluster-robust standard errors (CRSE), as design-based approaches, are possible alternatives for analyzing MS-CRT data (McNeish, 2014; McNeish & Harring, 2017; McNeish & Wentzel, 2017). Among these three methods, MLM is currently the predominant method in the field of the social sciences (Bauer & Sterba, 2011; McNeish, Stapleton, & Silverman, 2017). However, CRSE offers a convenient approach, especially for the applied researchers familiar with single-level models that make fewer assumptions about random effects than MLM does. Furthermore, comparison studies have supported the use of design-based methods over model-based methods (Gardiner et al., 2009; McNeish et al., 2017; Sterba, 2009; Wu, Wang, & Pei, 2012). GEE has fewer advantages for MS-CRT data with a continuous outcome than does CRSE; furthermore, if GEE uses an independent working correlation matrix, its results are expected to be identical to those of CRSE (McNeish et al., 2017). In brief, in this study we investigated the performance of MLM and CRSE using Mplus 7.4 (Asparouhov & Muthén, 2006; Muthén & Muthén, 2015) to detect both a treatment effect and Covariate × Treatment interactions in an MSCRT setup with a continuous outcome.

Investigating moderation effects in an MLM setting is of relatively recent interest, as compared to single-level models. The importance of the subject was emphasized by Bauer and Curran (2005) and by Preacher, Curran, and Bauer (2006). Mathieu, Aguinis, Culpepper, and Chen (2012) and Aguinis, Gottfredson, and Culpepper (2013) investigated estimation procedures for cross-level interactions. Preacher, Zhang, and Zyphur (2016) advised the examination of level-specific moderation, and explained the problems with commonly applied procedures in which moderation tests are completed without separating the lower- and higher-level effects into their orthogonal components. Studies addressing the level-specific moderation in a multilevel setting have been limited. Ryu (2015), focusing on Level 1 (L1) variables, investigated the effect of centering in a multilevel structural equation framework based on an orthogonal partitioning. One of her two simulation studies is relevant to an MS-CRT design in which the interaction between a Level 2 (L2) variable and the between-level component of an L1 variable is investigated. Ryu’s approach of orthogonal partitioning does not allow for interaction between the within-level component of the L1 variable and the L2 variable, due to a homogeneous L1 covariance structure across clusters; hence, she studied the moderation effect at L2, and reported biased estimates due to the use of observed rather than latent means (Lüdtke et al., 2008) with cluster mean centering. A similar insight was provided by Preacher, Zhang, and Zyphur, who suggested using latent decomposition when investigating moderation effects. Latent decomposition refers to decomposing an L1 independent variable around its expected values at higher levels. For example, in a two-level design, the independent variable Xij can be decomposed as Xij = μ + (μj − μ) + (Xij − μj), where μ is the grand expected value and μj is the expected value of Xij for the jth cluster. The mean μj is referred to as a latent mean (Lüdtke et al., 2008). However, the authors did not examine in detail the bias due to the use of observed means as covariates, and they reported that latent decomposition for a three-level model is not easily feasible. A recent work by Brincks et al. (2017), in which the authors investigated the effect of centering in three-level models, also mentioned the infeasibility of latent decomposition in three-level models. The necessity of using latent decomposition to study the main effect of an L1 reflective variable at L2 was also shown by earlier studies (Croon & van Veldhoven, 2007; Lüdtke et al., 2008; Shin & Raudenbush, 2010). However, an investigation comparing the bias of the treatment effect estimator in a two-level CRT revealed no bias when the L2 covariate comprised either observed or latent means; furthermore, the statistical power to detect the treatment effect was slightly lower when the latent means were used as the covariate and cluster sizes were small (Aydin, Leite, & Algina, 2016). Given that testing treatment effects is a primary purpose of CRTs, we addressed two questions:

  • Research Question 1: In an MS-CRT, what are the effects of using the observed L2 means as a covariate on (a) the bias of estimation of the treatment effect and the level-specific treatment by covariate (T×C) interactions, and (b) the Type I error rate and power of the test of the treatment effect and level-specific T×C interactions?

  • Research Question 2: Do the effects of using L2 observed means as a covariate differ between design-based and model-based approaches?

To answer these questions, we considered the simplest MS-CRT, in which k = 1 . . . K Level 3 (L3) units are randomly selected from a population, j = 1 . . . J L2 units are randomly selected within each L3 unit and randomly assigned to a treatment or control group with equal probabilities, and i = 1 . . . n L1 units are selected in each L2 unit. At L1, an outcome (Yijk) and two covariates (X1ijk, X2ijk), all continuous, are assessed. The treatment indicator (Z.jk) is an L2 binary variable. We used following terms in this three-level structure: A site mean refers to the arithmetic mean of all observations within the kth site, and a cluster mean refers to the arithmetic mean of all observations within the jth L2 cluster within the kth site. The article is structured as follows: We (a) briefly discuss decomposing interactions in an MS-CRT design, (b) introduce the competing approaches to analyze MS-CRT data, (c) explain our simulation study design, (d) report the results, and (e) provide an empirical example and a Discussion section.

Decomposing interactions for an MS-CRT setup

For empirical studies in which the main interest is in the treatment effect itself, investigating the interaction between a treatment indicator and a relevant variable is generally stated as an additional research question. One way to include an interaction term in a multilevel model is to multiply the treatment indicator by the relevant variable and, if necessary, decompose it into different levels by centering the product. This approach was considered by Josephy, Vansteelandt, Vanderhasselt, and Loeys (2015), but its use was criticized by Preacher, Zhang, and Zyphur (2016) because the results are uninterpretable. In this study, we investigated the treatment main effect and Covariate × Treatment interaction due to an L1 moderator (1 × (2→1) design) by decomposing the L1 predictor into between- and within-factor components and then multiplying the components by the treatment indicator. The decomposition of the covariate is \( {X}_{1 ijk}=\left({X}_{1 ijk}-{\overline{X}}_{1. ijk}\right)+\left({\overline{X}}_{1. ijk}-{\overline{\overline{X}}}_{1.k}\right)+{\overline{\overline{X}}}_{1.k} \), and the product terms are \( \left({X}_{1 ijk}-{\overline{X}}_{1. ijk}\right){Z}_{. jk} \) at L1 and \( \left({\overline{X}}_{1. ijk}-{\overline{\overline{X}}}_{1.k}\right){Z}_{. jk} \) at L2, where \( {\overline{X}}_{1. ijk} \) and \( {\overline{\overline{X}}}_{1.k} \) represent a cluster mean and a school mean, respectively.

Three approaches to analyze MS-CRT data

In this article, three approaches to address dependency due to the nested structure of an MS-CRT are compared. We utilized MLM, CRSE, and a combination of these two approaches. These approaches were implemented with the Mplus 7.4 software (Muthén & Muthén, 2015). As we noted earlier, multilevel modeling is currently the predominant method among social scientists for analyzing clustered data. For extensive details on three-level models, readers are referred to Moerbeek and Teerenstra (2015), Raudenbush and Bryk (2002), and Snijders and Bosker (2012). These models can be estimated using several different software programs (e.g., SAS, HLM, and the R package nlme or lme4). Both regression coefficients and variance components can be estimated with a multilevel model. A three-level model without covariates for an MS-CRT can be written as

$$ {\displaystyle \begin{array}{l}{Y}_{ijk}={\beta}_{0 jk}+{e}_{ijk}\\ {}{\beta}_{0 jk}={\gamma}_{00k}+{\gamma}_{01k}{Z}_{. jk}+{u}_{0 jk}\\ {}{\gamma}_{00k}={\pi}_{000}+{u}_{00k}\\ {}{\gamma}_{01k}={\pi}_{010}+{u}_{01k}\end{array}} $$
(1)

where Yijk is a continuous L1 outcome and Z.jk is the binary treatment indicator at L2. Fixed effects are represented by π; specifically, π000 is the intercept and π010 is the treatment main effect. The L1 random effect (eijk), L2 random effect (u0jk), and L3 random effects for the intercept (u00k) and the treatment (u01k) are assumed to be normally distributed with a mean of 0 and a covariance matrix T:

$$ T=\left[\begin{array}{llll}{\sigma}^2& & & \\ {}0& {\tau}_{\beta_{0 jk}}& & \\ {}0& 0& {\tau}_{\gamma_{00k}}& \\ {}0& 0& {\tau}_{\gamma_{00k},{\gamma}_{01k}}& {\tau}_{\gamma_{01k}}\end{array}\right] $$
(2)

where σ2 is the within-cluster variance component, \( {\tau}_{\beta_{0 jk}} \) is the variance component due to clusters within sites, \( {\tau}_{\gamma_{00k}} \) represents the variance due to sites, \( {\tau}_{\gamma_{01k}} \) is the treatment effect variance between sites, and \( {\tau}_{\gamma_{00k},{\gamma}_{01k}} \) is the covariance between the site-specific means and the site-specific treatment effects. The statistical power to detect the treatment effect in Eq. 1 varies as a function of the magnitude of π010, the sample size, and the variance at each level including \( {\tau}_{\gamma_{01k}} \) (Bloom & Spybrook, 2017; Spybrook et al., 2011, p. 86). When covariates are added to Eq. 1, the statistical power changes due to the adjustment in π010 and the variance components. The effect of the covariates on the conditional variance depends on the strength of the correlation between the covariates and the outcome, as well as on the correlation between the covariates.

CRSEs can account for dependency due to clusters. As an example of calculating CRSEs, consider a residual-based estimator for standard errors of a single-level model estimated from two-level data. The standard errors can be computed using a sandwich estimator (see, e.g., Raudenbush & Bryk, 2002, p. 277) to produce robust standard errors with large samples. According to Raudenbush and Bryk, the procedure allows for approximately correct tests and confidence intervals, even when the residual for the single-level model is not normally distributed. For relatively small sample sizes, further modifications might be needed (McNeish & Stapleton, 2016).

Several variations of CRSEs are available, at least for a single-level model (McNeish, 2014; McNeish & Harring, 2017; Raudenbush & Bryk, 2002). Mplus provides a CRSE procedure referred to as type = complex, which entails fitting a single-level model to data and correcting the standard errors for clustering at a higher level. Asparouhov (2005) describes the procedure for complex sampling with stratification, clustering, and sampling weights. In the following discussion, we adapt Asparouhov’s description of a design without stratification or sampling weights, the type of design we investigated. Estimates are obtained by maximizing a likelihood defined assuming independence of the observations. Thus, when there are no sampling weights and no stratification, the estimates are ML estimates of a single-level model. Let l be the likelihood based on the independence assumption, lij the contribution to the likelihood by the ith individual in cluster j, \( {z}_j={\sum \limits}_i\partial \left(\log \left({l}_{ij}\right)\right)/\partial \theta \), z the average of the zi, and L” the matrix of the second derivatives of logl. The asymptotic covariance matrix of the estimates is given by \( \left(J/\left(J-1\right)\right){L}^{\hbox{'}\hbox{'}-1}{\sum \limits}_j\left(z-{z}_j\right){\left(z-{z}_j\right)}^T{L^{{\prime\prime}}}^{-1} \), where J is the number of clusters and T is the transpose operator.

A combination of CRSE and MLM can be employed to analyze a three-level data structure by accommodating the first two levels with MLM and the third level with CRSE (McNeish & Wentzel, 2017; Rabe-Hesketh & Skrondal, 2006). Hence, instead of extensive model building at L3, the researchers can focus on a less complex model while accounting for the third-level clustering. Mplus provides a combination procedure that is referred to as type = complex twolevel. Asparouhov and Muthén (2006) have described the procedure to compute standard errors in a multilevel setup that allows for stratification and sampling weights. Assuming no stratification and sampling weights equal to 1 at both L1 and L2, as in this study, the procedure can be described as follows, which we have adapted from Asparouhov and Muthén. Let ljk be the likelihood of the observed data for the jth L2 unit nested in the kth L3 unit, \( l={\prod \limits}_{j,k}{l}_{jk} \), L = log(l), and Ljk = log(ljk), CRSEs are computed using (L)−1Var(L)(L)−1, where the ' and '' refer to the first and second derivatives, respectively, of the log likelihood, and the second term Var(L) is equal to Var\( \left({\sum \limits}_{k,j}{L}_{jk}\right) \). According to Asparouhov and Muthén, the last term “is computed according to the formulas for the variance of the weighted estimate of the total described in Cochran, Chapter 11 (1977) taking the appropriate design into account” (p. 2719).

Monte Carlo simulation study

The data generation process for the MS-CRT data was completed using R (R Core Team, 2016). The simulated datasets were analyzed using Mplus 7.4 (Muthén & Muthén, 2015). The results from Mplus outputs were investigated in a mixed analysis of variance (ANOVA) framework. These three main steps of the Monte Carlo simulation study are presented in this section. The data generation model was a three-level model, presented in Eq. 3:

$$ {\displaystyle \begin{array}{l}{Y}_{ijk}={\beta}_{0 jk}+{\beta}_{1 jk}\left({X}_{1 ijk}-{\mu}_{1 jk}\right)+{\beta}_{2 jk}\left({X}_{1 ijk}-{\mu}_{1 jk}\right){Z}_{. jk}+{\beta}_{3 jk}\left({X}_{2 ijk}-{\mu}_{2 jk}\right)+{e}_{ijk}\\ {}{\beta}_{0 jk}={\gamma}_{00k}+{\gamma}_{01k}{Z}_{. jk}+{\gamma}_{02k}\left({\mu}_{1 jk}-{\mu}_{1k}\right)+{\gamma}_{03k}\left({\mu}_{1 jk}-{\mu}_{1k}\right){Z}_{. jk}+{\gamma}_{04k}\left({\mu}_{2 jk}-{\mu}_{2k}\right)+{u}_{0 jk}\\ {}{\beta}_{1 jk}={\gamma}_{10k}\\ {}{\beta}_{2 jk}={\gamma}_{20k}\\ {}{\beta}_{3 jk}={\gamma}_{30k}\\ {}{\gamma}_{00k}={\pi}_{000}+{\pi}_{001}\left({\mu}_{1k}-{\mu}_1\right)+{\pi}_{002}\left({\mu}_{2k}-{\mu}_2\right)+{u}_{00k}\\ {}{\gamma}_{10k}={\pi}_{100}\\ {}{\gamma}_{20k}={\pi}_{200}\\ {}{\gamma}_{30k}={\pi}_{300}\\ {}{\gamma}_{01k}={\pi}_{010}+{u}_{01k}\\ {}{\gamma}_{02k}={\pi}_{020}\\ {}{\gamma}_{03k}={\pi}_{030}\\ {}{\gamma}_{04k}={\pi}_{040}\end{array}} $$
(3)

where for each of the two L1 covariates s = 1, 2, μs is the grand mean, ωsk = μsk − μs is the random effect for school k with variance component \( {\tau}_{\omega_s},{\xi}_{sjk}={\mu}_{sjk}-{\mu}_{sk} \) is the random effect for class j in school k with variance component τξs and Rsijk = Xsijk − μsjk is the random effect for individual i in class j in school k with variance component \( {\sigma}_{R_S}^2 \). Each of the random effects has a mean of 0. The grand means for the covariates were set equal to 0. The decomposition of the variance of a covariate is \( {\sigma}_{X_{sijk}}^2={\tau}_{\omega_s}+{\tau}_{\xi s}+{\sigma}_{R_S}^2 \). The variance \( {\sigma}_{X_{sijk}}^2 \) was set to 1, so that \( {\tau}_{\omega_s}={ICC}_{X_S-L3},{\tau}_{\xi_s}={ICC}_{X_S-L2} \), and \( {\sigma}_{R_S}^2=1-\left({ICC}_{X_S-L2}+{ICC}_{X_S-L3}\right) \), where \( {ICC}_{X_S-L3} \) and \( {ICC}_{X_S-L2} \) are the intraclass correlation coefficients for Xs. The variance components were equal for the two covariates, and the correlation coefficient for each component of the covariates is provided in Table 1.

Table 1 Simulation symbols and population parameters

The data generation process was completed through the following steps: Simulate (a) the L3 components of covariates \( \left(\begin{array}{c}{\omega}_{1.k}\\ {}{\omega}_{2.k}\end{array}\right) \) from \( N\left(\left(\begin{array}{c}0\\ {}0\end{array}\right),\left[\begin{array}{c}{ICC}_{X1-L3}\\ {} Cov\left({\omega}_{1.k},{\omega}_{2.k}\right)\kern0.24em {ICC}_{X2-L3}\end{array}\right]\right) \); (b) the L2 components of covariates \( \left(\begin{array}{c}{\xi}_{1 jk}\\ {}{\xi}_{2 jk}\end{array}\right) \) from \( N\left(\left(\begin{array}{c}0\\ {}0\end{array}\right),\left[\begin{array}{c}{ICC}_{X1-L2}\\ {} Cov\left({\xi}_{1 jk},{\xi}_{2 jk}\right)\kern0.24em {ICC}_{X2-L2}\end{array}\right]\right) \); (c) L1 components of covariates \( \left(\begin{array}{c}{R}_{1 ijk}\\ {}{R}_{2 ijk}\end{array}\right) \) from \( N\left(\left(\begin{array}{c}0\\ {}0\end{array}\right),\left[\begin{array}{c}1-{ICC}_{X1-L2}+{ICC}_{X1-L3}\\ {} Cov\left({R}_{1 ijk},{R}_{2 ijk}\right)\kern0.72em 1-{ICC}_{X2-L2}+{ICC}_{X2-L3}\end{array}\right]\right) \); (d) the binary treatment indicator to randomly assign half of the L2 units within an L3 unit to treatment and half to control; (e) the L3 random components of the outcome variable \( \left(\begin{array}{c}{u}_{00k}\\ {}{u}_{01k}\end{array}\right) \) from \( N\left(\left(\begin{array}{c}0\\ {}0\end{array}\right),\left[\begin{array}{c}{\tau}_{\pi 00k}\\ {}{\tau}_{\pi 00k,\pi 01k}\kern0.24em {\tau}_{\pi 01k}\end{array}\right]\right) \), where τπ00k = ICCY − L3 and ICCY–L3 is the L3 ICC for Y, conditional on the treatment indicator, the covariates, and the products of the treatment indicator and covariates; (f) for each L3 unit, the L2 random component uojk from N(0, ICCY − L2); and finally (g) for each L2 unit, the L1 random component eijk from N(0, 1 − (ICCY − L3 + ICCY − L2)).

The mixed model for generating the data is

$$ {\displaystyle \begin{array}{c}{Y}_{ijk}={\pi}_{000}+{\pi}_{001}\left({\mu}_{1k}-{\mu}_1\right)+{\pi}_{002}\left({\mu}_{2k}-{\mu}_2\right)+{\pi}_{010}{Z}_{. jk}+{\pi}_{020}\left({\mu}_{1 jk}-{\mu}_{1k}\right)+{\pi}_{030}\left({\mu}_{1 jk}-{\mu}_{1k}\right){Z}_{. jk}\\ {}+{\pi}_{040}\left({\mu}_{2 jk}-{\mu}_{2k}\right)+{\pi}_{100}\left({X}_{1 ijk}-{\mu}_{1 jk}\right)+{\pi}_{200}\left({X}_{1 ijk}-{\mu}_{1 jk}\right){Z}_{. jk}+{\pi}_{300}\left({X}_{2 ijk}-{\mu}_{2 jk}\right)\\ {}+{u}_{00k}+{u}_{0 jk}+{u}_{01k}{Z}_{. jk}+{e}_{ijk}\end{array}} $$
(4)

Conditions and population parameters

Estimating a three-level model for 1,000 replications of each combination of factors (i.e., a condition) was computationally demanding; hence, we conducted four separate simulation studies, one main and three additional simulations. The parameters for these simulations are summarized in Table 1. The first and second additional simulations aimed to explore if the main study findings were consistent with larger sample sizes and factors that were not manipulated in the main simulation. The third additional simulation aimed to provide additional insight on model comparisons under relatively complex covariance structures.

For the main simulation study, the sample size combinations were selected on the basis of reviewing 357 projects funded by Institute of Education Sciences (Aydin et al., 2016); the L1 and L2 sample sizes were six and ten; the L3 sample sizes were 20 and 40. The ICC values for L2 (.08 and .16) and L3 (.09) were chosen in light of a meta-analysis on variance decomposition of academic achievement data due to schools and districts (Hedges & Hedberg, 2013). The correlations between the components of the two L1 covariates were equal across levels and were set so that the total correlation was either .30 or .70, to represent relatively small and large relationship strengths, respectively. The magnitude of the fixed effects for the treatment (.15), its interactions with the covariates (.20 and .30 at both L1 and L2), and the covariate (.20 for both L1 and L2) were chosen on the basis of the representative application studies from the decade of the 2000s, summarized by Mathieu, Aguinis, Culpepper, and Chen (2012). Given that we fixed the variance of X to 1 in our simulation studies, these magnitudes can be considered standardized effect sizes. The effect size variability magnitude, τπ01k = .05, was chosen on the basis of the Optimal Design software (Spybrook et al., 2011, p. 98).

Competing models

We examined the performance of three competing models. The default estimator for each model was maximum likelihood estimation with robust standard errors (MLRFootnote 1). Syntax for fitting the competing models using Mplus 7.4 is provided in the Appendix. Our first model, referred as M3L and presented in Eq. 5, is a three-level model, as is Eq. 4, but an observed mean decomposition of the covariates is employed: group mean centering of the continuous independent variables represented by \( \left({X}_{ijk}-{\overline{X}}_{jk}\right) \) at L2, and \( {\overline{X}}_{jk}-{\overline{\overline{X}}}_{..k} \) at L3.Footnote 2 Fixed effects are represented by π ; specifically, π001 and π002 are the L3 effects, π010 and π040 are the L2 effects, π100 and π300 are the L1 effects of the covariates X1 and X2, respectively. Given that the treatment effect and the covariate by treatment interaction are of greater interest in a CRT than are the coefficients for the covariates, the fixed effect of treatment (π020, referred below as πT), the L2 interaction component (π030 or πTX-L2), and the L1 interaction component (π200 or πTX-L1) are investigated in detail.

$$ {\displaystyle \begin{array}{c}{Y}_{ijk}={\pi}_{000}+{\pi}_{001}{\overline{\overline{X}}}_{\mathrm{1..}k}+{\pi}_{002}{\overline{\overline{X}}}_{\mathrm{2..}k}+{\pi}_{010}{Z}_{. jk}+{\pi}_{020}\left({\overline{X}}_{1. jk}-{\overline{\overline{X}}}_{\mathrm{1..}k}\right)+{\pi}_{030}\left({\overline{X}}_{1. jk}-{\overline{\overline{X}}}_{\mathrm{1..}k}\right){Z}_{. jk}+\\ {}{\pi}_{040}\left({\overline{X}}_{2. jk}-{\overline{\overline{X}}}_{\mathrm{2..}k}\right)+{\pi}_{100}\left({X}_{1. jk}-{\overline{X}}_{1. jk}\right)+{\pi}_{200}\left({X}_{1 ijk}-{\overline{X}}_{1. jk}\right)\left({Z}_{. jk}\right)+{\pi}_{300}\left({X}_{2 ijk}-{\overline{X}}_{2. jk}\right)\\ {}+{u}_{00k}+{u}_{0 jk}+{u}_{01k}{Z}_{. jk}+{e}_{ijk}\end{array}} $$
(5)

Model 2, referred as M2L-C, is a two-level model and thus includes variables at L1 and L2. The two-level complex procedure, which corrects the standard error for clustering at L3, was used to estimate the parameters and to carry out hypothesis testing. The M2L-C model can be obtained from Eq. 5 by deleting u00k and u01k. Model 3, referred as M1L-C, is a single-level model and can be obtained from Eq. 5 by deleting u00k, u01k and u0jk. The complex procedure, which corrects the standard error for clustering at L3, was used to estimate the parameters and to carry out hypothesis testing. All three models included the same fixed effects. In particular, Model 3, as well as Models 1 and 2, includes the L1 interaction component πTX-L1 and the L2 interaction component πTX-L2. Model 1 includes variance components at L1, L2, and L3; Model 2 includes variance components at L1 and L2; and Model 3 includes only an L1 variance.

Analysis of the simulation results

We focused on three coefficients and their standard errors: \( {\widehat{\pi}}_T,{\widehat{\pi}}_{TX-L1}\;\mathrm{and}\;{\widehat{\pi}}_{TX-L2} \). The convergence rate, the ratio of normally terminated estimations to the total number of replications was 100% for each condition. We examined coverage rates, coefficient bias, relative bias of the standard error, power and Type I error rates. We used a mixed-design ANOVA model; factors of the simulation design were treated as between-subjects factors and the analysis method was treated as the within-subjects factor. We conducted 15 separate analyses of variance (ANOVAs), one for each combination of the coefficients, on the one hand, and for coverage rate, coefficient bias, relative standard error bias, Type I error rate, and power, on the other. The dependent variables in these analyses were:

  • Coefficient bias—The dependent variable was \( \widehat{\theta}-\theta \) coefficient bias was calculated as the average of \( \widehat{\theta}-\theta \) over replications of a condition.

  • Coverage rate—The dependent variable was an indicator variable for the 95% confidence interval (CI) in a replication of a condition: \( \widehat{\theta}\pm \left({z}_{.975}\right)S\left(\widehat{\theta}\right) \), where θ was πT, πTX − L1,or πTX − L2. The coverage rate was calculated as the percentage of intervals that contained θ, and rates within .925 and .975 were considered acceptable (Bradley, 1978).

  • Relative standard error bias—The dependent variable was \( SE\left(\widehat{\theta}\right)=\left[ SE\left(\widehat{\theta}\right)- SD\left(\widehat{\theta}\right)\right]/ SD\left(\widehat{\theta}\right) \), where \( SD\left(\widehat{\theta}\right) \) is the standard deviation of the parameter estimate across all replications of a condition (Bandalos & Leite, 2013). The relative standard error bias was calculated as the average of \( \left[ SE\left(\widehat{\theta}\right)- SD\left(\widehat{\theta}\right)\right]/ SD\left(\widehat{\theta}\right) \) over replications of a condition. Following Hoogland and Boomsma (1998), we considered the relative bias of the standard errors acceptable if the average over replications were between – 0.1 and 0.1.

  • Type I error rate—For conditions in which θ = 0, the dependent variable was an indicator variable for whether \( z=\widehat{\theta}/ SE\left(\widehat{\theta}\right) \) did not result in rejection of H0 : θ = 0, with ± z.975 as the critical value. The Type I error rate was calculated as the proportion of replications in which H0 : θ = 0 was not rejected, and rates within .025 and .075 were considered acceptable (Bradley, 1978).

  • Power—For conditions in which θ ≠ 0, the dependent variable was an indicator variable for whether \( z=\widehat{\theta}/ SE\left(\widehat{\theta}\right) \) did result in rejection of H0 : θ = 0, with ± z.975 as the critical value. Power was calculated as the proportion of replications in which H0 : θ = 0 was rejected.

Generalized η2 (Olejnik & Algina, 2003) was calculated as the effect size measure, and effects with generalized η2 < .001 were not interpreted.

Results

Main simulation study

The main simulation study was completed in approximately 2,088 hours, divided across six computers, each of which had 16 GB RAM and a 3.70-GHz central processing unit. As is reported in Table 1, a total of ten between-subjects factors were manipulated, and each factor had only two levels, resulting in 210 = 1,024 conditions. Estimation converged for all 1,024 × 1,000 replications.

Coefficient bias

The average coefficient bias across iterations for \( {\widehat{\pi}}_T \) ranged between – .010 and .009, with a mean of 0; these values did not change across different models. The η2 values were all smaller than .001. Similarly, the coefficient bias for \( {\widehat{\pi}}_{TX-L1} \) was acceptable and ranged between – .006 and .006

The coefficient bias for \( {\widehat{\pi}}_{TX-L2} \) included large values and ranged between – .14 and .14, with a mean of 0; these values did not change across different models. The mixed-design ANOVA revealed substantial effects of \( {\pi}_{TX-L1} \) (η2 = .060) and \( {\pi}_{TX-L2} \) (η2 = .059), and relatively weak effects for (a) the πTX − L1 by ICCX1–L2 interaction (η2 = .003), (b) the πTX − L2 by ICCX1–L2 interaction (η2 = .003), (c) the πTX − L1 by n interaction (η2 = .001), and (d) the πTX − L2 by n interaction (η2 = .001). The results in Table 2 indicate the sources of these effects. Bias occurred when πTX − L1 ≠ πTX − L2. The magnitude of the bias differed after the third decimal place across the different models, and it decreased as n and ICCX1–L2 increased. The direction of the bias was positive when πTX − L2 > πTX − L1, and negative otherwise.

Table 2 Average biases of \( {\widehat{\pi}}_{TX-L2} \) for the main study

Coverage rates

The mean coverage rates across iterations for M2L-C and M1L-C were similar to each other and differed generally only after the third decimal place, but M3L had slightly different rates. The ranges of coverage rates for πT were [.906, 960] for the M3L and [.910, .963] for the M2L-C and M1L-C. The mixed-design ANOVA did not result in any values of η2 > .001. For M3L, 21% of all 1,024 simulated conditions resulted in mean coverage rates lower than .925; for the other two models, low coverage rates were observed in approximately 9% of the conditions. Among all low-coverage conditions, 91% occurred with K = 20.

A similar pattern was observed for πTX − L1: The ranges of coverage rates were [.898, .965] for M3L, and [.904, .966] for the other two models. For M3L, 21% of all 1,024 simulated conditions had mean coverage rates lower than .925, and for the other two models this percentage was 8%. Again, low-coverage conditions occurred largely (91%) when K = 20.

The coverage rates were more problematic for \( {\pi}_{TX-L2} \), with ranges in coverage rates equal to [.751, .953], [.782, .957], and [.782, .958] for M3L, M2L-C, and M1L-C, respectively. Only 23% of the simulated conditions resulted in acceptable mean coverage rates. The mixed-design ANOVA resulted in η2 = .006 for the πTX − L1 by \( {\pi}_{TX-L2} \) interaction. Table 3 shows the ranges in coverage rates for \( {\pi}_{TX-L2} \) as a function of model. The coverage ranges when πTX − L1 and \( {\widehat{\pi}}_{TX-L2} \) were equal were [.893, .953], [.900, .957], and [.900, .958]. These rates are similar to those reported for πT and πTX − L1, and rates lower than .925 occurred largely (81%) when K = 20.

Table 3 Coverage rates for πTX − L2 for the main study

Relative bias of the standard errors

The results for 16 replications out of 1,024,000 had standard errors larger than 10. All of these outlying standard error estimates were for M3L.Footnote 3 For \( {\widehat{\pi}}_T \), the median relative bias of standard errors across 1,000 replications of the 1,024 conditions were in the range [– .113, .047] for M3L, and the range [– .094, .059] for M2L-C and M1L-C. The mixed-design ANOVA did not reveal any interpretable effects, and – .113 was observed when K = 20 and J = 6. A similar pattern was observed for \( {\widehat{\pi}}_{TX-L1} \): [– .115, .045] with M3L, and [– .092, .060] with the other two models; – .11 was observed when K = 20 and J = 6. All four models performed similarly when K was larger. Table 4 reports the mean relative biases of the standard errors after removing the 16 outliers.

Table 4 Mean relative biases of the standard errors for the main study

For \( {\widehat{\pi}}_{TX-L2} \), the ranges for the median relative standard error bias were [– .141, .041], [– .151, .022], and [– .150, .028] for Models 1–3, respectively. The mixed-design ANOVA did not reveal interpretable effects. Out of the 1,024 manipulated conditions, median values lower than – .10 occurred in 98, 120, and 119 conditions for Models 1–3, respectively, mainly when K = 20 and J = 6. Table 5 reports the mean values after removing the 16 outliers. These results indicated that on average, standard errors were slightly underestimated when K was smaller and that J had a larger effect when K was smaller.

Table 5 Mean relative biases of the standard errors for \( {\widehat{\pi}}_{TX-L2} \) for the main study
Table 6 Empirical power to detect πT = .15 for the main study

Power and Type I error rate

The average empirical powers to detect the treatment effect were similar across the different models: .533 for M3L and .523 for M2L-C and M1L-C. The sample size at each of the three levels and the ICCY–L2 factors had η2 > .001. The largest effect size was for K, η2 = .061. The mean power was .402 for K = 20, and .65 for K = 40. The effects of sample size at L1 and L2 were smaller: η2 = .004 for n, with means of .49 and .56 for n = 6 and n = 10, respectively, and η2 = .017 for J, with means of .46 and .59 for J = 6 and J = 10, respectively. The effect size for ICCY–L2 was η2 = .006, with means of .56 and .49 for ICCY-L2 = .08 and .16, respectively. The larger value of ICCY–L2 indicates a larger conditional variance for Y, and this accounts for the reduction in power. The average Type I error rates for the treatment effect were in the ranges [.040, .094], [.039, .090], and [.039, .090], with mean values of .067, .062, and .063 for Models 1–3, respectively. No effect had η2 > .001; however, 20% of the conditions resulted in Type I error rates larger than .075 for M3L, and 8% for the other two models. Among the conditions with large Type I error rates, 90% occurred when K = 20.

The empirical powers to detect πTX − L1 were similar across models: .926 for M3L, and .920 for the other models. The mixed design ANOVA revealed six effects with η2 > .001, each involving sample size: (a) η2 = .051 for K, (b) η2 = .037 for J, (c) η2 = .028 for n, (d) η2 = .015 for the K by n interaction, (e) η2 = .010 for the K by J interaction, and (f) η2 = .006 for the J by n interaction. Table 7 reports the empirical power to detect πTX − L1 as a function of sample size. In addition, η2 = .001 for ICCY–L2, with means equal to .913 for .08 and .930 for .16, and also for ICCX1–L2, with means equal to .930 for .08 and .914 for .16. Averaged across iterations, the Type I error rates when testing πTX − L1 = 0 were in the ranges [.035, .102] and [.034,.096], and on average they were .067 for M3L and .062 for the other models. No effect had η2 > .001; however, 17% of the conditions resulted in Type I error rates larger than .075 for M3L, and 7% for the other two models. Among these conditions with large Type I error rates, 92% occurred when K = 20.

Table 7 Empirical power to detect πTX − L1 for the main study

The L2 interaction effect, \( {\widehat{\pi}}_{TX-L2} \), was estimated without bias only when πTX − L2 = πTX − L1, and therefore we studied empirical power for the 216 conditions in which πTX − L2 = πTX − L1 = .20. The effect size η2 was at least .001 for sample size at L2 (η2 = .017), L3 (η2 = .015), and the interaction effect for the L2 and L3 sample sizes (η2 = .001). The model had η2 = .005. The empirical power rates in Table 8 show that power increases as either the L1 or L2 sample size increases, and the impact of the L1 sample size is larger when the L2 sample size is larger. In addition, M3L had a relatively larger power than the other models, which had very similar powers. An increase in a covariate’s ICC at L2 resulted in a larger power (η2 = .005), with mean empirical power rates equal to .27 for M3L and .22 for the other models when ICCX–L2 = .08, and .35 for M3L and .28 for the other models when ICCX–L2 = .16. An increase in conditional variance for Y at L2 resulted in lower empirical power (η2 = .003), with mean empirical power rates equal to .34 for M3L and .27 for the other models when ICCY–L2 = .08, and .28 for M3L and .23 for the other models when ICCY–L2 = .16. Type I error rates for testing πTX − L2 = 0 were in the ranges [.047, .101] and [.043, .100], and on average they were .074 for M3L and .072 for the other models. No effect had η2 > .001; however, 45% of the conditions resulted in Type I error rates larger than .075 for M3L, and 33% for the other two models. Among these conditions with large Type I error rates, 80% occurred when K = 20.

Table 8 Empirical power to detect πTX − L2 for the main study when πTX − L2 = πTX − L1

Additional simulation studies

The estimation for M3L was computationally demanding and slow, and this prevented manipulating more than two levels for each factor. On the basis of the main study findings, we conducted three additional small scale simulation studies to explore further (a) \( {\widehat{\pi}}_{TX-L2} \) bias, (b) the effect of the parameters that were not manipulated in the main study, and (c) model comparisons with relatively more complex variance structures.

Additional Simulation Study 1

Table 2 indicated that \( {\pi}_{TX-L2} \) was estimated with bias for some conditions. The amount of bias varied, mainly due to the interaction magnitude at both levels (i.e., πTX − L1 and πTX − L2) and to the covariate’s ICC at L2. Thus, in our first additional study we manipulated the interaction magnitude to be 0, .2, and .3 at both levels, and the ICC to be .08 and .16. We also increased the sample size at all levels: n = 20, J = 10, 20, and K = 40, 60. Consistent with the main study results, substantial bias was detected when πTX − L1 ≠ πTX − L2, even with larger sample sizes; the results are reported in Tables 9 and 10.

Table 9 The average biases of \( {\widehat{\pi}}_{TX-L2} \) for Additional Study 1
Table 10 The average biases of \( {\widehat{\pi}}_{TX-L2} \) by sample size when πTX − L1 ≠ πTX − L2 for Additional Study 1

Additional Simulation Study 2

The main study did not reveal substantial differences due to model choice, except for the empirical power difference to detect L2 interactions. In our second additional simulation study, focused on the parameter estimates, we aimed to explore the effects of the factors that were not manipulated in the main study. A total of 1,024 conditions were generated (see Table 1) and analyzed only with M2L-C. Consistent with the main study results, the coefficient estimate was unbiased for \( {\widehat{\pi}}_T\;\mathrm{and}\;{\widehat{\pi}}_{TX-L1} \); the coefficient bias for the \( {\widehat{\pi}}_{TX-L2} \) varied as a function of πTX − L2 (η2 = .075), ICCX1–L2 (η2 = .004), and the πTX − L2 by ICCX1–L2 interaction (η2 = .004), but not as a function of the newly added factors. The relative biases of the standard errors for \( {\widehat{\pi}}_T,{\widehat{\pi}}_{TX-L1} \) and \( {\widehat{\pi}}_{TX-L2} \) were all acceptable: – .01, – .01, and – .03 on average, respectively.

Additional Simulation Study 3

We designed our final additional simulation study to investigate the model effect under a wider range of L3 variances of the treatment random effect (τπ01k) and of the L3 covariances between the L3 random effect and the L3 treatment random effect (τπ00k, π01k). We expected these new conditions to affect the results for πT, but we also report results for πTX − L1 and πTX − L2. The coverage rates for πT, πTX − L2, and πTX − L2 were all 94%, and there was no bias for the estimates. The relative bias of the standard errors for these three parameter estimates were within the range [– .075, .054] and were acceptable for each of the 36 manipulated conditions.

The empirical power to detect the treatment effect was varied as a function of τπ01k (η2 = .174), ICCY–L2 (η2 = .007), and the τπ01k by ICCY–L2 interaction (η2 = .001). Under the manipulated conditions, these results were expected (Bloom & Spybrook, 2017; Spybrook et al., 2011, p. 86). Table 11 reports the average power for these factors.

Table 11 Empirical power to detect πT for Additional Study 3

The empirical power to detect πTX − L1 was high; hence, we examined the effects of conditions on z values. The method effect was small (η2 = .008); the z values varied as a function of ICCY–L2 (η2 = .016) and ICCY–L3 (η2 = .009). Table 12 reports mean z values. M2L-C and M1L-C produced average z values that are equal and slightly smaller than those for M3L. An increase in conditional variance at L2 and L3, which resulted in a decreased L1 variance, was associated with larger z values.

Table 12 Average of z values (estimate/SE) for \( {\widehat{\pi}}_{TX-L1} \) for Additional Study 3

The empirical power to detect \( {\widehat{\pi}}_{TX-L2} \) varied as a function of model choice (η2 = .018), ICCY–L2 (η2 = .007), τπ01k (η2 = .002), and ICCY–L3 (η2 = .001). Moreover, three different two-way interactions affected the results, of model by ICCY–L3, ICCY–L2, and τπ01k each with η2 = .001. Table 13 reports average power for these factors. M2L-C and M1L-C performed similarly under the manipulated conditions, and all three models performed similarly when the variance components at L3 were smaller. The power difference between M3L and the other models reached its maximum with larger L3 variance but smaller L2 variance.

Table 13 The empirical power to detect πTX − L2 for the Additional Study 3

Illustration

We selected a subsample of 22 schools from the data for an ongoing early childhood education project. Each school had three control and three intervention classrooms, and each classroom had three children who met the criteria for risk of developing emotional/behavioral disorders. The total sample size was 396. The outcome measure was selected to be School Readiness Composite (SRC-post) scores at the postintervention. The preintervention SRC score, the Social Awareness Composite (SAC) score at L1, and the binary treatment indicator at L2 served as independent variables. We standardized SRC and SAC scores to have a grand mean of 0 and a standard deviation of 1. The classroom and school arithmetic means of SRC and SAC were added into the models, along with the SRC L1 and L2 deviation by treatment interactions, so that the models investigated in the simulations could be estimated.

The correlation between the treatment indicator and preintervention scores was 0, as we would expect from an MS-CRT design. The correlation between SRC-post and SRC-pre scores was .57; that between SRC-pre and SAC was .38. Three-level empty models revealed that the L1 variance components were .81, .87, and .82; the L2 variance components were .13, .10, and .12; and the L3 variance components were .06, .04, and .07, for SRC-post, SRC-pre, and SAC, respectively. Table 14 reports the results from the models addressed in this study; all were estimated using MLR in Mplus. Consistent with the simulation studies, all three models provided similar results to detect the treatment effect and the L1 interaction component. The estimates for the L2 interaction coefficient and its standard error were the same for M2L-C and M1L-C, and slightly smaller than for M3L. We also tested a two-level model, M2L, in which we ignored L2, listed all covariates at L1, and declared L3 as the second level. As expected, with M2L the standard error estimates were different, due to the distribution of L2 variances over the bottom and top levels (Moerbeek, 2004).

Table 14 Illustration results

Discussion and conclusion

Motivated by Bloom and Spybrook (2017) observation that the use of MS-CRT has been increasing, we compared three methods to estimate and test the treatment effect and covariate by treatment interaction due to L1 covariate moderation both at L1 and L2 with respect to convergence, Type I error rates, power, coverage and bias of estimates. In this section, we discuss our findings in threefold: (a) L2 interaction, (b) treatment main effect and L1 interaction component, and (c) comparison of competing models. We then present the limitations of the present study.

The results showed that, regardless of the model choice, L2 interaction estimates were biased unless the magnitude of the L1 and L2 interaction were equal to L1. The bias was upward when πTX − L2 > πTX − L1 and downward otherwise and bias was larger when difference between πTX − L2 and πTX − L1 was larger. The magnitude of bias was as large as .13 for a population value of .20. This is clearly unacceptable, and the problem of bias persisted even with larger sample sizes studied in this article. For example, in our Additional Simulation Study 1 (see Table 10), the magnitude of bias was .051 for a population value of .20 even when K = 60 and J = 20; it was .050 when K = 60 and J = 10; .052 when K = 40 and J = 20; and .055 when K = 40 and J = 10. As was reported by Ryu (2015) and mentioned by Preacher, Zhang, and Zyphur (2016), this bias is due to unreliability (or sampling error) of the aggregated L1 covariate. Furthermore, the amount of bias parallels with the results calculated using bias derivation formula for two-level models given by Lüdtke et al. (2008). A possible solution is to use a latent decomposition; however, in a three-level model it is not easily computable (Brincks et al., 2017; Preacher et al., 2016), and to date it is not possible to compute with the Mplus software.Footnote 4 We included conditions with πTX − L2 = πTX − L1 in the simulations and found that the estimates of L2 interaction were not biased. The coverage rates, Type I error rates and relative bias of the standard errors were mainly acceptable when πTX − L2 = πTX − L1 and K = 40 with all three competing models under the manipulated conditions in the main study and additional studies; however, when K = 20, slightly unacceptable rates and relative bias values occurred, especially with M3L. The empirical power to detect an unbiased L2 interaction increased with larger sample sizes and larger ICCX–L2 values. The results for M3L are consistent with Dong, Kelcey, and Spybrook (2017, Eq. 28). In terms of model comparison, M3L was slightly more powerful than the other three models when detecting an unbiased L2 interaction but these results are in alignment with slightly underestimated standard errors. The difference between M3L and the other models reached its maximum with larger L3 variance but smaller L2 variance. McNeish and Wentzel (2017) reported comparison between a M3L with small sample adjustment and M2L-C under a relatively simpler covariance matrix than was included in the present study and small L3 sample size (four, seven, and ten). The authors could not compare the power difference between these two models due to poor performance of M2L-C in terms of biased variance estimates, but they also noted that the poor performance was less severe with small L3 variance. Furthermore, our results also emphasize the importance of small sample adjustment given that slightly underestimated standard errors occurred more often for M3L when K = 20.

When K = 40, our results indicate that the coefficient bias for \( {\widehat{\pi}}_T\;\mathrm{and}\;{\widehat{\pi}}_{TX-L1} \) was absent for all four models, Type I error rates were also acceptable. The coverage rates for population parameters of πT and πTX − L1 were between .925 and .975 under 79% of the conditions in the main study; the remaining 21% had coverage rates ranged between .898 and .925. The poor performance in terms of coverage rates occurred mainly (91%) with K = 20 and was due to slight downward bias in standard errors. This finding is also consistent with McNeish and Wentzel’s (2017) study in which they utilized ML estimation; in our study we were limited to MLR given that Mplus does not offer ML with CRSE. The poor performance of MLR than of ML in a multilevel structural equation framework with no assumption violations was reported by Hox, Maas, and Brinkhuis (2010). The downward bias in standard errors was slightly larger for M3L than for the other models, but the difference among the competing models vanished with a larger L3 sample size. Using REML or REML with the Kenward–Roger correction (Kenward & Roger, 2009) could be a possible remedy for M3L’s poor performance with small L3 sample sizes (McNeish, 2017; McNeish & Wentzel, 2017). These alternatives, however, are not available with Mplus. The empirical power to detect a non-zero πT or πTX − L1 did not change across competing models and generally increased as a function of sample size at L3 and then L2. These findings are expected for M3L as shown by Dong, Kelcey, and Spybrook (2017, Eq. 50), Spybrook et al. (2011, p. 86), and Bloom and Spybrook (2017).

In addition to the model comparisons above, one interesting outcome of this study is that M1L, at least with the Mplus software, provided roughly the same results as M2L-C under the conditions of this study and the specifications of the two models. The specifications included correctly including the L1 and L2 components of interaction. If the model used in the program implementing M1L-C had only included a total product term—that is, X1ijkZ.jk and its coefficient—estimates of the coefficient would likely not be equal to either \( {\widehat{\pi}}_{TX-L1}\mathrm{or}\;{\widehat{\pi}}_{TX-L2} \) obtained from the program we used to implement M1L-C in the simulations. If, alternatively, the model used in the program implementing M1L had only included only the product term \( \left({X}_{1 ijk}-{\overline{X}}_{1 jk}\right)\left({Z}_{. jk}\right) \) and its coefficient, estimates of the coefficient would likely be equal to \( {\widehat{\pi}}_{TX-L1} \) obtained from the program we used to implement M1L-C in the simulations. Similar to results reported by McNeish, Stapleton, and Silverman (2017), our results also support the use of design-based methods (M1L-C) or the combination of design-based and model-based methods (M2L-C) as an alternative to completely model-based methods even with a complex multilevel design, in our case an MS-CRT. Our illustration section agrees with this development. A note of caution is due here: Raudenbush and Bloom (2015) related the MS-CRT to “a fleet of experiments” that is particularly useful to study effect heterogeneity. According to Raudenbush and Bloom, investigating both mean program impact and impact heterogeneity is a necessary step when moving forward from a field study to public policy, program theory or professional practice. One dimension of the impact heterogeneity can be studied by examining the L3 random effects and design-based analyses removes the possibility of detecting this type of heterogeneity. Another point is that M2L in the illustration study ignores the intermediate level, instead treating all covariates at L1 and adjusting for L3 clustering only, and thus produced different results than M1L, since it is known that ignoring a level is problematic (Moerbeek, 2004). Therefore, M2L-C should be used rather than M2L.

As is true for all simulations and illustrations, there were some limitations to our study. Our results were restricted to a balanced MS-CRT without any missing data and with all assumptions satisfied. We focused on only three parameters, \( {\widehat{\pi}}_T,{\widehat{\pi}}_{TX-L1},\mathrm{and}\;{\widehat{\pi}}_{TX-L2} \). We were also limited to a single estimator implemented in Mplus. Furthermore, in order to study model comparison with unbiased estimates of L2 interaction we set πTX − L2 = πTX − L1 for a substantial portion of the conditions in each of the four simulations. Another limitation is that M3L demanded computational power, and therefore we refrained from combining the main study and the small simulation study conditions. Even though they occurred for only 0.0015% of the main study conditions, we observed outlying standard error estimates with M3L for the L2 interaction. To address this limitation, we reported both mean and median relative standard error biases.