In experimental research, it is not uncommon to assign clusters rather than individuals to conditions. For example, in educational research, schools or classes may be assigned to either the treatment or control condition; in organisational psychology, it may be companies that are assigned to conditions; in medical research, it may be hospitals or general practices; and in health psychology, it may be therapy groups. Experimental designs such as these are referred to as cluster-randomized trials (see, e.g., Murray, 1998). In cluster-randomized trials, the assumption of independent observations is violated. Due to shared features, shared leadership, and mutual influences, subjects within the same cluster are likely to respond more alike. When analysing the data of cluster-randomized trials, the dependency of the subjects within the same cluster is taken into account by applying a multilevel analysis. Moreover, multilevel analysis makes it possible to include covariates on the individual level as well as on the cluster level. When cluster-level covariates—that is, contextual variables—are taken into account in an ordinary regression or ANOVA model, this covariate has to be disaggregated. The consequence of disaggregation is that an artificial homogeneity is introduced, so that standard errors are biased downwards, with the consequence of an inflated Type I error. On the other hand, in contextual analyses, individual-level covariates have to be aggregated, often resulting in a loss of information and, hence, of power (see, e.g., Greenland, 2002). That aggregation may not always result in a loss of power is shown by, for instance, Hedges (2007). He presented an adjusted t test that accounted for clustering and obtained “reasonably accurate significance levels.” However, he recommended that this test and other tests on cluster means (i.e., Barcikowski’s test) should only be used when raw data are not available and argued that, in other cases, multilevel statistical methods are more appropriate. With multilevel analysis, the covariates are analysed without aggregation or disaggregation (e.g., Snijders & Bosker, 1999).

It is well known that by taking into account influential covariates, power can be increased (see, e.g., Moerbeek, 2006). In cluster-randomized trials, covariates on various levels may need to be included in the study. In the present study, we focused on including a covariate on the lowest level. Since in most cluster-randomized trials the lowest-level units concern individual subjects, from here on we will refer to this level as the subject level.

When including a subject-level covariate, it seems obvious that one wants to assess the influence of this covariate at the level on which it is measured—that is, the subject level. For example, when taking into account pupils’ IQs on math performance, the researcher expects the math performance to be influenced by the subject’s IQ score and, hence, in the multilevel model will treat IQ score as a subject-level covariate. The effect of a subject-level covariate at the subject level is referred to as the within-cluster effect. However, in addition to the within-cluster effect, it is not uncommon that a subject-level covariate has an effect at the cluster level as well. For example, in addition to the effect of the pupil’s individual IQ score on math performance, the mean IQ score in a class may also influence the individual math performance. The effect of a subject-level covariate at the cluster level is in the literature referred to as a between-cluster or contextual effect. Although in the remainder of this article we will make a distinction between these phrases, throughout this introduction we will simply use the phrase contextual effect.

The within-cluster effect and the contextual effect of a subject-level covariate may differ, and they may even have different signs (see, e.g., Kreft & de Leeuw, 1988; Snijders & Bosker, 1999). This phenomenon is also discussed by Begg and Parides (2003), Greenland (2002), Neuhaus and Kalbfleisch (1998), and Palta and Seplaki (2002). These authors have emphasized that when a multilevel analysis is applied, the researcher should check whether the within-cluster and contextual effects differ, and if they do, both effects should be modelled explicitly. If the separate effects are not explicitly modelled, the within-cluster and contextual effects are implicitly assumed to be equal.

However, most substantive researchers seem unaware of the possibly different within-cluster and contextual effects of a first-level covariate and do not explicitly model both effects, implicitly assuming them to be equal. That the assumption of equal within-cluster and contextual effects may be violated has been shown by, for instance, Mann, De Stravol, and Leon (2004) and Dwyer and Blizzard (2005).

The studies cited above have shown not only that the within-cluster effect may differ from the contextual effect of a subject-level covariate, but also that different analysis models—that is, assuming equal effects or modelling both effects explicitly in different ways—can affect the estimation of the parameters of the covariate itself. In an extended study, Shen, Shao, Park, and Palta (2008) discussed the effect of misspecifying an inequality of the covariate effects, illustrating the effect on real data. However, they used data from observational studies and did not discuss the main parameter in cluster-randomized trials—that is, the treatment effect and its standard error. The present study fills this gap, by focusing on the robustness of the treatment parameter and its standard error against ignoring inequality of the within-cluster and contextual effects of the covariate within the framework of a cluster-randomized trial.

In the present study, we apply two different models to simulated data—one of them explicitly modelling within-cluster and contextual effects of the subject-level covariate, and the other implicitly assuming equal effects. Note that the latter model also can be seen as omitting a second-level covariate, the contextual effect. In a randomized design, it is assumed that all covariates—the observed as well as the omitted ones—are balanced over the conditions, and hence that the expected correlation between the omitted covariate and the treatment effect is zero. However, when the covariate is not balanced, the variability may actually increase, which will be visible in larger standard error estimates.

Simulation has the advantage of known true parameter values, as well as presenting the possibility of manipulating the total effect of the covariate and the magnitude of the difference between the within-cluster and contextual effects. Though the main interest of this study is the robustness of the estimated treatment effect and its standard errors, for completeness we will also present the other fixed parameters (i.e., the within-cluster effect, the contextual effect—if estimated—and the constant), the estimated variance components, and their standard errors.

In the next section, two models are described. The first takes into account the different within-cluster and contextual effects. From here on, we will refer to this model as the covariate different-effects multilevel model. This model, though not the label, is proposed by, for instance, Neuhaus and Kalbfleisch (1998) and Snijders and Bosker (1999). The second model ignores the different within-cluster and contextual effects of the covariate, hence implicitly assuming equal effects. Since the latter model is the one usually used for nested data, from here on we will refer to it as the ordinary multilevel model. In the third section, the simulation and an analysis of the parameter estimates are described. The results are given in the fourth section, and the article ends with a summary and discussion.

The models

The covariate different-effects multilevel model

Assume that we have data from a cluster-randomized trial with a continuous outcome, one variable at the cluster level—that is, the treatment indicator—and one continuous subject-level covariate. As stated in the introduction, in data from a cluster-randomized trial and in other hierarchical data, a first-level covariate may have different within-cluster and contextual effects. This is taken into account by explicitly modelling two parameters associated with the subject-level covariate:

$$ {Y_{{ij}}} = {\gamma_{\text{const}}} + {\gamma_{\text{treat}}}{T_j} + {\gamma_{\text{W}}}\,{X_{{if}}} + {\gamma_{\text{B}}}{\bar{X}_j} + {u_j} + {e_{{ij}}}. $$
(1)

In this equation, Y ij is a continuous outcome for subject i in cluster j, and γ const is a constant reflecting the mean outcome when T j and X ij both equal zero. T j is the treatment indicator, coded 1 for the treatment condition and 0 for the control condition. Coded as such, γ treat represents the treatment effect, which is assumed to be the same for all clusters in the treatment condition. The subject-level covariate X is separated into a within-cluster part and a contextual part. The within-cluster part is given by X ij , which is the score on covariate X ij for subject i in cluster j, and by the regression of Y ij on X ij (i.e., the within-cluster effect, given by γ W). The contextual part is given by the cluster mean \( {\bar{X}_j} \) and the regression of Y ij on \( {\bar{X}_j} \) (i.e., the contextual effect γ B). Finally, u j is the cluster-level residual, and e ij is the subject-level residual. These residuals are independently distributed with zero mean and the variances \( \sigma_u^2 \), for the cluster-level residuals, and \( \sigma_e^2 \), for the subject-level residuals. Since these variances are independent, the total residual variance is \( {\sigma^2} = \sigma_u^2 + \sigma_e^2 \).

It should be noted that the separation of the within-cluster and contextual effects of a first-level covariate can also be accomplished by using \( {X_{{ij}}} - {\bar{X}_j} \) instead of X ij , so that Eq. 1 changes into

$$ {Y_{{ij}}} = {\tilde{\gamma }_{\text{const}}} + {\tilde{\gamma }_{\text{treat}}}{T_j} + {\tilde{\gamma }_{\text{W}}}({X_{{ij}}} - {\bar{X}_j}) + {\tilde{\gamma }_{\text{B}}}{\bar{X}_j} + {u_j} + {e_{{ij}}}. $$
(2)

In Eq. 2, the symbol \( \tilde{\gamma } \) is used to distinguish these parameters from the corresponding parameters in Eq. 1. In Eq. 2, \( {\tilde{\gamma }_{\text{W}}} \) is the regression of Y ij on (\( {X_{{ij}}} - {\bar{X}_j} \))—that is, the within-cluster effect. This parameter is equal to the corresponding parameter in Model (1): \( {\tilde{\gamma }_{\text{W}}} = {\gamma_{\text{W}}} \). However, the contextual effect \( {\tilde{\gamma }_{\text{B}}} \) in Model (2) is the regression of \( {\bar{Y}_j} \) on \( {\bar{X}_j} \) and, hence, reflects the actual contextual effect. For clarity, from here on, we will refer to \( {\tilde{\gamma }_{\text{B}}} \) as the contextual effect and γ B [i.e., the parameter used in Model (1)] as the between-cluster effect. The relation between \( {\tilde{\gamma }_{\text{B}}} \) and γ B is given by \( {\tilde{\gamma }_{\text{B}}} = {\gamma_{\text{W}}} + {\gamma_{\text{B}}} \) (see, e.g., Snijders & Bosker, 1999). From this, it becomes obvious that when the between-cluster effect γ B = 0, the contextual effect \( {\tilde{\gamma }_{\text{B}}} \) is equal to the within-cluster effect γ W. Though Models (1) and (2) are equivalent, we choose to use Model (1) in the present study, because in this model both parameters—that is, the between-cluster parameter γ B and the within-cluster parameter γ W—reflect the regression of Y ij .

The true total effect of the subject-level covariate is a weighted sum of the within-cluster and contextual effects (see, e.g., Snijders & Bosker, 1999, p. 30):

$$ {\gamma_{\text{total}}} = {\eta^2}*{\tilde{\gamma }_B} + \left( {1 - {\eta^2}} \right)*{\tilde{\gamma }_W} $$
(3)

In this equation, η 2 is the correlation ratio, which is defined by the intracluster coefficient and the reliability of the cluster mean \( {\bar{X}_j} \). It should be noted that the reliability of the cluster mean depends on both the magnitude of the intracluster correlation coefficient (ICC) and the group size (e.g., Bliese, 2000). That is, the larger the ICC, the more likely is a single score to be a reliable estimate of the cluster mean. And by means of the law of large numbers, it is obvious that the larger the cluster size, the more likely it is that the cluster mean is a reliable estimate for the population cluster mean. Expressed in terms of Model (1), Eq. 3 changes into

$$ {\gamma_{\text{total}}} = {\eta^2}*\left( {{\gamma_{\text{B}}} + {\gamma_{\text{W}}}} \right) + \left( {1 - {\eta^2}} \right)*{\gamma_{{{\text{W}}{.}}}} $$
(4)

The ordinary multilevel model

In the ordinary multilevel model, the possibility of different within-cluster and contextual effects of the covariate is ignored. In other words, it is assumed that the within-cluster and contextual effects are equal, or in terms of Model (1), that the between-cluster effect is zero. If this assumption holds, it is sufficient to estimate a single parameter for the covariate X:

$$ {Y_{{ij}}} = \gamma_{\text{const}}^{\prime} + \gamma_{\text{treat}}^{\prime}{T_j} + \gamma_{\text{W}}^{\prime}{X_{{ij}}} + {u_j} + {e_{{ij}}}. $$
(5)

In this equation, the symbol γ' is used, to distinguish the parameters from the corresponding parameters used in Eqs. 1 and 2. Let Y ij be a continuous outcome for subject i in cluster j and \( \gamma_{\text{const}}^{\prime } \) be a constant, reflecting the mean outcome when T j and X ij both equal zero. T j is the treatment indicator for cluster j (coded 0 for the control condition and 1 for the experimental condition), and \( \gamma_{\text{treat}}^{\prime } \) is the treatment effect. \( \gamma_{\text{W}}^{\prime } \) is the within-cluster effect of the first-level covariate X. It should be noted that this parameter usually is not subscripted W, since it is the only estimated parameter for the subject-level covariate. However, in the context of the present study, the subscript is used to emphasize that this parameter is estimated under the assumption that \( \gamma_{\text{B}}^{\prime } \) is zero. As stated in the introduction section, in the ordinary multilevel model this parameter can also be viewed as an omitted covariate.

In Model (5), as in Model (1), the residual at the cluster level is captured by the term u j , while the residual at the subject level is captured by e ij . The residuals are assumed to be independent of each other and normally distributed, with zero mean and variances \( \sigma_u^2 \) and \( \sigma_e^2 \), respectively. The total variance of the model is given by \( {\sigma^{{2}}} = {\sigma_u}^{{2}} + {\sigma_e}^{{2}} \).

Simulation and evaluation of the estimated parameters

Simulation

Data were generated and analysed with the MLwiN software, version 2.0 (Rasbash, Steele, Browne, & Prosser, 2004). The data were generated according to Model (1)—that is, the covariate different-effects multilevel model (CDEMM). Data were then analysed with the covariate different-effects multilevel model (Eq. 1) and with the ordinary multilevel model (OMM; Eq. 5). Restricted maximum likelihood estimation (REML) was used, as is advised for small sample sizes at the cluster level in most literature, in order to get good estimates for the subject-level variance \( \sigma_e^2 \) (see, e.g., Hox, 2010; Raudenbush, Bryk, Cheong, Congdon, & Du Toit 2004; Snijders & Bosker, 1999).

In all conditions, the cluster size m was fixed at five. This is not an unusual number in, for example, group therapy. Furthermore, when all other requirements are met, a small sample size at the first level hardly affects the estimation of the parameters and standard errors. In order to get unbiased parameter estimates, the second-level sample size is of main importance (Maas & Hox, 2005). In the present study, we chose to have a relatively small second-level sample size of 20 clusters per condition. Such small numbers of clusters are quite common in practice, due to limited financial resources and other practical limitations. Furthermore, it was expected that a large number of clusters would not result in a more accurate estimation of the treatment effect, if it is affected by ignoring the nonequivalence of the within-cluster and contextual effects.

The intercept was fixed at 1, and the treatment effect was fixed at 0.3, which is a medium-sized effect (Cohen, 1988). The subject-level residual variance \( \sigma_e^2 \) was fixed at 1. The value of the cluster-level variance \( \sigma_u^2 \) follows from the ICC and the subject-level variance \( \sigma_e^2 \). Although the ICC affects the sampling variance, it does not affect the estimation of fixed effects. Multilevel software incorporates this effect of the ICC, so the magnitude of the ICC does not affect the bias of estimators. In earlier simulation studies (e.g., Korendijk, Maas, Moerbeek, & Van der Heijden, 2008; Maas & Hox, 2005), the ICC indeed was an insignificant predictor of the biases of estimated parameters and their standard errors. For these reasons, in the present study the ICC was not manipulated but fixed at .10, which is a value that is often encountered in practice.

It was expected that the more the contextual effect differed from the within-cluster effect, the more biased the parameter estimate would be. Since substantive researchers rarely use the CDEMM, little information is available on empirical values for the magnitude of the difference between the within-cluster and contextual effects. Based on the studies by Mann, De Stravol, and Leon (2004), Neuhaus and Kalbfleisch (1998), and Palta and Seplaki (2002), we chose the within-cluster effect to be ten times as small, three times as small, three times as large, and ten times as large as the contextual effect. These different conditions in the tables are labelled in terms of the magnitude of the within-cluster effect as compared to the contextual effect; for instance, 0.1 refers to the condition in which the within-cluster effect is ten times as small as the contextual effect, and 10 refers to the condition in which the within-cluster effect is ten times as large as the contextual effect. From here on we will refer to this variable as the inequality of covariate effects, with the values 0.1, 0.333, 3, and 10. Finally, in order to compare the performance of the CDEMM and the OMM when the within-cluster effect is equal to the contextual effect, data with equal effects were generated, thus adding a condition labelled “1” to the inequality-of-covariate-effects variable. In this case, it was expected that both models would perform equally well, since both models fit the data when the covariate effect is the same at both levels.

Furthermore, it was expected that the stronger the total effect of the first-level covariate, the more the treatment effect and variance estimates would be affected. The total effect of the first-level covariate—that is, the weighted sum of the within-cluster and contextual effects (Eq. 3)—was chosen to be small (.1), medium (.3) and large (.5) (Cohen, 1988). From here on, we will use the term total covariate effect when referring to the total effect of the first-level covariate.

The above-described design consists of (5 × 3 =) 15 conditions, and in each condition 3,000 data sets were generated. The large number of replications was chosen to increase the precision of the estimates (Skrondal, 2000).

Evaluation of the parameter and standard error estimates

Estimated parameters were evaluated by means of the relative bias—that is, the estimated parameter divided by the true parameter value: \( RB = \frac{{\hat{\theta }}}{\theta } \). When estimated without bias, the parameter estimate equals the true parameter, and hence the RB will be 1. When the parameter is overestimated, the RB will be larger than 1, and underestimation will result in an RB smaller than 1. However, when the within-cluster and the contextual effects are equal, the true between-cluster effect is zero: γ B = 0. In this situation, the bias of the estimated between-cluster effect is determined by the deviation of the true value, which is the estimate itself \( \left( {{{\hat{\gamma }}_{\text{B}}} - {\gamma_{\text{B}}} = {{\hat{\gamma }}_{\text{B}}} - 0 = {{\hat{\gamma }}_{\text{B}}}} \right) \). As Skrondal (2000) recommends, estimated parameters will be tested. The parameter of main interest for a researcher conducting a trial is the treatment effect. Hence, the focus of this study is on the estimated treatment effect and its standard error. Investigation of estimated covariate effects is more appropriately done by means of an observational study. For this reason, the results with respect to the estimated within- and between-cluster effects will not be elaborated on, although for completeness they will be presented in the tables, giving an overview of the simulation results in terms of relative bias. In these tables, biased parameters are indicated; however, the estimated covariate effects will not be discussed or further tested. For a discussion of the effects of misspecification on the estimated within-cluster and contextual effects, we refer the reader to the study of Shen, Shao, Park, and Palta (2008).

A number of t tests were performed to test whether the relative bias equalled one. When two or more parameter estimates within a variable (i.e., the ratio or the covariate effect) were biased, two-sample t tests or ANOVAs were performed to test whether the biases differed between conditions. Significant ANOVA results were evaluated by post hoc t tests.

Since unbalance of an omitted covariate, which in the present study is the between-cluster effect of the covariate γ B in the ordinary multilevel model, can result in increased standard errors of the treatment effect, the distributions of the estimated standard errors of this parameter were inspected. The standard errors are expected to increase when the OMM is applied. The distribution of the estimated standard errors of the CDEMM (i.e., the appropriate model when within-cluster and contextual effects differ) was used to determine a cutoff point. This cutoff was the estimated value at the 97.5th percentile of the distribution. The estimated standard errors of the OMM exceeding this cutoff point were considered to be inflated, and percentages of inflated standard errors were determined per condition. The estimated standard errors of all parameters were evaluated by means of the coverage. The 95% confidence intervals around the estimated parameters were established, and it was determined whether or not the true parameter was in this interval, and the estimated parameter was coded 1 or 0, respectively. The mean of this parameter is the coverage, and ideally this proportion should equal .95 (1 – α). Underestimation of the standard errors is reflected by coverages below .95, and overestimation by coverages over .95. Coverages were evaluated by establishing a 99.99% (α = .001) confidence interval around .95 (see, e.g., Newcombe, 1998), corrected for the number of parameters tested simultaneously, and determining whether the coverage was within or outside this interval. The small α was chosen because of the huge number of simulated data sets. When two or more coverages of a parameter were significant within inequality of covariate effects or the total covariate effect, χ 2 tests were performed to test whether the coverages differed between the conditions. Post hoc Fisher exact tests were performed when a χ 2 test turned out to be significant. In accordance with the treatment of the parameter estimates, no tests were conducted on biased standard error estimates associated with the covariate parameters.

When the influence of the magnitude of the covariate effect was evaluated, this was done separately for data in which the within-cluster and contextual effects were different and data in which these effects were equal. It is obvious that in the first situation the CDEMM is the appropriate model and, hence, was expected to outperform the OMM, while in the latter situation both models were expected to perform equally well.

Results

This section is divided into three subsections. In the first subsection, issues concerning convergence and inadmissible solutions of the estimation process are described. In the second subsection, the results with respect to the parameter estimates are given. As stated before, our main interest is the estimate of the treatment effect. However, we present and discuss the estimates of the constant and variance components as well. In the third subsection, the standard error estimates are presented. With respect to the presentation and discussion of the standard errors, we will follow the same line as with respect to the estimated parameters.

Convergence and inadmissible solutions

Convergence was reached in all conditions. In MLwiN (Rasbash et al., 2004), it is possible that the estimation procedure will result in negative variance estimates, especially when the true value is close to zero, as is the case in the present study. In practice, such negative variance estimates are usually set to zero. However, by doing so in a simulation study, bias would be introduced. Therefore, the negative values for the second-level variance were retained. Table 1 shows the percentages of negative second-level variance estimates found per model by condition. When the data are analysed with the covariate different-effects multilevel model, the percentage of negative second-level variance estimates is approximately 5% in all conditions [F(4, 44995) = 1.395, p = .233, for inequality of the covariate effects and F(2, 44997) = 0.114, p = .892, for the total covariate effect]. A smaller percentage is found when the data are analysed with the ordinary multilevel model, and it varies by condition [F(4, 44995) = 29.294, p < .001, for inequality of the covariate effects, and F(2, 44997) = 27.193, p < .001, for the total covariate effect; for pairwise comparisons, see Table 2]. The percentage of negative variance is largest when the within-cluster and contextual effects are equal, and it decreases when the inequality becomes more extreme. When the total effect of the covariate increases, the percentage of negative second-level variance estimates decreases (Table 2).

Table 1 Percentages of negative variance and standard deviations per model, by inequality of the covariate effects and by total covariate effect
Table 2 Pairwise comparisons of percentages of negative second-level variance for the ordinary multilevel model, by inequality of the covariate effects (top panel) and total covariate effect (bottom panel)

Parameter estimates

Tables 3 and 4 show all estimated parameters for various magnitudes of the disparity between the covariate effects and of the total covariate effect, respectively. All fixed parameters appear to be estimated without bias when the CDEMM is applied, irrespective of the magnitudes of the inequality and the total covariate effect. When the OMM is applied, the treatment effect \( \gamma_{\text{treat}}^{\prime } \) and the constant \( \gamma_{\text{const}}^{\prime } \) are estimated without bias in all conditions as well.

Table 3 Relative biases (and standard deviations) of the estimated parameters per model, by inequality of the covariate effects
Table 4 Relative biases of the estimated parameters (and standard deviations) per model, by total covariate effect

The random parameters in the CDEMM—that is, the residual cluster-level variance \( \sigma_u^2 \) and the residual subject-level variance \( \sigma_e^2 \)—are estimated without bias. The random effects in the OMM, \( \sigma_u^{{\prime \,2}} \) and \( \sigma_e^{{\prime \,2}} \), are estimated without bias when the within-cluster effect equals the contextual effect. This is as expected, since in this condition the OMM is not a misspecification of the data. However, when the within-cluster and contextual effects differ, the OMM misspecifies the data, and in this situation both variances are biased. The second-level variance \( \sigma_u^{{\prime \,2}} \) is overestimated in all conditions, and this overestimation is affected by the magnitudes of the inequality of the covariate effect and the total covariate effect [F(3, 35996) = 1,419.396, p < .0001, and F(2, 35997)= 2,628.895, p < .0001]. Table 5 shows that the overestimation is more severe when the magnitude of the inequality increases [t(15123.858) = 44.321, p < .00017, when the within-cluster is smaller than the contextual effect, and t(11901.675) = −13.382, p < .00017, when the within-cluster is larger than the contextual effect]. Comparisons between the conditions with, on the one hand, a small within-cluster effect to a larger contextual effect and, on the other, the reverse situation are all significant, as well (see Table 5). Comparisons between the biases for the different conditions of the total covariate effect reveal that the overestimation of the second-level residual variance \( \sigma_u^{{\prime \,2}} \) increases when the covariate effect increases; that is, the overestimation is smallest when the covariance effect is small, and it is largest when the covariate effect is large [t(29562.393) = −23.338, p < .00033, and t(26691.307) = −40.695, p < .00033, respectively; see Table 5].

Table 5 Primary and post hoc t tests on the biased parameter estimates, by inequality of the covariate effects (top panel) and by total covariate effect (bottom panel)

As we have said, the subject-level residual variance \( \sigma_e^{{\prime \,2}} \) is also biased in conditions with unequal covariate effects. It is overestimated when the within-cluster effect is smaller than the contextual effect and when the within-cluster effect is ten times as large (Table 3). The severity of the overestimation is the same in those three conditions [F(2, 26997) = 3.021, p = .049]. Furthermore, the subject-level variance \( \sigma_e^{{\prime \,2}} \) is overestimated when the total effect of the subject-level covariate is medium or large (Table 4), and the overestimation is more severe when the total effect is large [t(11998) = −3.326, p < .0001; Table 5].

To sum up, using the CDEMM results in an unbiasedly estimated constant and treatment effect as well as unbiasedly estimated random parameters, whether the within-cluster effect differs from the contextual effect or not. When the OMM is applied to data with unequal covariate effects, the residual variance on both levels is biased. However, although the model is misspecified, the parameter of main interest, the treatment effect, is unbiased in all conditions.

Standard error estimates

Since the treatment effect is the parameter of main interest in trials, we start with inspection of the distribution of the standard errors of this estimated parameter. Graphs of the standard error (not presented here) showed nearly perfect normal distributions, regardless of the magnitude of the inequality, the magnitude of the total covariate effect, or the model. Furthermore, we found consistent means (0.18), standard deviations (0.019), and hence consistent values for the 97.5th percentiles (0.216), over the conditions for the CDEMM. Using the 97.5th percentile value as a cutoff score, the percentages that exceed this value in the distributions of the estimated standard errors for the OMM were determined. When the within-cluster effect was ten times as small as the contextual effect, we found 27.9% of the estimated standard errors to be larger than the cutoff score. In the other conditions of unequal covariate effects, the percentages were minor (5.7% when the within-cluster effect was three times as small, 3.1% when it was three times as large, and 6.3% when it was ten times as large as the contextual effect). In other words, only when the within-cluster effect is very small compared to the contextual effect are the standard errors associated with the treatment effect seriously inflated.

Although a comparison of the distributions of the standard errors revealed in one of the conditions for the OMM a considerable inflation of the standard errors, inspection of the coverages gave no reason to be concerned about inflated Type II errors with respect to the treatment effect. That is, the coverages show that, when the OMM is applied, the standard errors of the treatment effect \( \gamma_{\text{treat}}^{\prime } \) are estimated without bias (lower panel of Table 6 and 7). The standard errors associated with the constant \( \gamma_{\text{const}}^{\prime } \), however, are biased. In the OMM, they are overestimated in all conditions, even when the within-cluster and contextual effects are equal. The estimate is affected by the inequality of the covariate effects [χ 2(4) = 30.368, p < .0001; Table 8]. Post hoc Fisher’s exact tests reveal that the overestimation only differs when the condition with equal within-cluster and contextual effects is compared to the condition in which the within-cluster effect is ten times as large as the contextual effect (Fisher exact p < .0001; Table 9). When the within-cluster and contextual effects differ, the overestimation is also affected by the total covariate effect [χ 2(2) = 26.453, p < .0001; Table 8]. The overestimation differs between the small and large total covariate effects (Fisher exact p < .00033; Table 9). The standard error of the cluster-level variance \( \sigma_u^2 \) in the CDEMM is slightly underestimated when the within-cluster effect is three times as large as the contextual effect, and when the effects are equal (Table 6). The underestimation is the same in both conditions (Fisher exact p = .171; Table 9). The standard error of the cluster-level variance \( \sigma_u^2 \) is also underestimated when the covariate effect is small or medium and the within-cluster and contextual effects of the covariate are equal (right-hand panel of Table 7). The underestimation is the same for both conditions (Fisher exact p = .406; Table 9). The standard error of the subject-level variance \( \sigma_e^2 \), however, is overestimated in all conditions (Tables 6 and 7). The overestimation is not affected by the inequality of the covariate effects [χ 2(4) = 3.431, p = .488] or by the total covariate effect [χ 2(2) = 0.667, p = .716, and χ 2(2) = 1.001, p = .606, for different and equal covariate effects, respectively; Table 8].

Table 6 Coverages evaluating the estimated standard errors per model, by inequality of the covariate effects
Table 7 Coverages evaluating the estimated standard errors by total covariate effect, by unequal or equal within-cluster and contextual effects
Table 8 Results of χ 2 tests on significant coverages, by inequality of the covariate effects (top panel) and by total covariate effect (bottom panel)
Table 9 Primary and post hoc Fisher’s exact tests on the significant coverages, by inequality of the covariate effects (top panel) and by total covariate effect (bottom panel)

In the OMM, standard errors associated with the residual variances are often biased as well. The standard errors of the cluster-level variance \( \sigma_u^{{\prime \,2}} \) are underestimated in two situations: when the within-cluster effect is ten times as small as the contextual effect (Table 6) and when the total covariate effect is small and the within-cluster and contextual effects differ (Table 7). In all other situations, the standard errors associated with the cluster-level variance \( \sigma_u^{{\prime \,2}} \) are accurately estimated. The standard errors estimates associated with the subject-level variance \( \sigma_e^{{\prime \,2}} \) are overestimated in all conditions (Table 6 and 7). Table 8 shows that the overestimation does not depend on the inequality of the covariate effects [χ 2(4) = 3.669, p = .453], nor on the magnitude of the total covariate effect [χ 2(2) = 0.348, p = .840, when the within-cluster and contextual effects differ, and χ 2(2) = 2.803, p = .368, when they are equal].

To sum up, the standard errors of the constant and of the first-level variance are biased in both models in all conditions. The standard errors of the cluster-level variance are biased in both models in some conditions. However, the standard errors of the treatment effect are unbiased, whether the different within- and between-cluster effects are ignored or not.

Summary and discussion

The parameter of main interest—that is, the treatment effect, γ treat in the covariate different-effects multilevel model and \( \gamma_{\text{treat}}^{\prime } \) in the ordinary multilevel model—and its standard errors are estimated without bias in all conditions by both models. If a researcher’s only interest is in the treatment effect, it is sufficient to apply an ordinary multilevel model and ignore possible unequal within-cluster and contextual effects of a subject-level covariate. However, researchers who are interested in other parameters as well should take notice of the differences between the two models with respect to parameter and standard error estimates.

When the CDEMM is applied—that is, when the model takes into account the possibility of different within-cluster and contextual effects of a first-level covariate—all parameter estimates are unbiased. This does not hold when a model is applied that assumes the within-cluster effect to be equal to the contextual effect when this assumption is violated—that is, when the OMM is applied to data with unequal within-cluster and contextual effects. In this situation, the random parameters—that is, the cluster-level variance \( \sigma_u^{{\prime \,2}} \) and the subject-level variance \( \sigma_e^{{\prime \,2}} \)—are biased in almost all conditions. This may be tolerable, since the random parameters in cluster-randomized trials are usually viewed as nuisance parameters. However, in, for example, school effectiveness research, the random parameters are of particular interest. In these situations, biased random parameters are not tolerable.

With respect to the standard errors, we have shown that the standard errors of the constant γ const and of the variance components \( \sigma_u^2 \) and \( \sigma_e^2 \) are biased in both the OMM and the CDEMM when the models are applied to data with different covariate effects. Again, these biases may be tolerable, since they are associated with either a nuisance parameter (i.e., the constant \( \gamma_{\text{const}}^{\prime } \)) or with nonnormally distributed parameters (i.e., the residual variances \( \sigma_u^2 \) and \( \sigma_e^2 \)). Even if variance parameters are of interest, it is not advised to use the Wald test when evaluating variance components.

In general, it is unknown whether the effect of the first-level covariate differs from or equals the contextual effect. When the effects are equal, both models perform equally well, and when the effects differ, the CDEMM gives better estimates of the variance components. When a researcher is only interested in the treatment parameter, ignoring a possible difference between the within-cluster and contextual effects will do no harm. However, when a researcher is also interested in the variance components of his model, we advise that the covariate different-effects model be used.