# Estimating effect size when there is clustering in one treatment group

- 1.1k Downloads

## Abstract

Some experimental designs involve clustering within only one treatment group. Such designs may involve group tutoring, therapy administered by multiple therapists, or interventions administered by clinics for the treatment group, whereas the control group receives no treatment. In such cases, the data analysis often proceeds as if there were no clustering within the treatment group. A consequence is that the actual significance level of the treatment effects is larger (i.e., actual *p* values are larger) than nominal. Additionally, biases will be introduced in estimates of the effect sizes and their variances, leading to inflated effects and underestimated variances when clustering in the treatment group is not taken into account. These consequences of clustering can seriously compromise the interpretation of study results. This article shows how information on the intraclass correlation can be used to obtain a correction for biases in the effect sizes and their variances, and also to obtain an adjustment to the significance test for the effects of clustering.

## Keywords

Effect size Partial clustering Significance tests Meta-analysisRandomized field experiments sometimes assign entire intact groups (such as schools, classes, wards, or hospitals) to treatments. Such designs are called *nested* or *cluster-randomized* designs, because the intact groups are nested within treatment conditions, and the groups may act as statistical clusters in the sense of cluster sampling (see, e.g., Donner & Klar, 2000). Such designs may be used for reasons of convenience (when it is practically or politically difficult to assign individuals with the same cluster to different treatments) or to minimize contamination of different treatments within the same cluster. The purpose of such designs is typically to estimate the average effect of the treatment across groups (e.g., across schools). Although analyses could (and often do) attempt to model the variation across clusters, the cluster effects are taken to be random, and the purpose is to generalize treatment effects to a population of clusters from which the observed clusters are a sample.

Sometimes clustering arises within one treatment group (e.g., the active treatment), but not the other (e.g., a waitlist control). For example, when the treatment is administered to subgroups by agents such as teachers, tutors, or therapists and it is expected that the agents may not produce identical effects, then each subgroup may acquire a component of variance due to the agent with whom they worked. In such cases, the subgroups are statistical clusters whose effects need to be taken into account in the statistical analysis and in computing effect size estimates and their sampling variance. When such treatments are compared to a no-treatment control, the situation is one in which clustering occurs in the treatment group, but not in the control group. This is sometimes called a *partially clustered* (or *partially nested*) design. Designs such as that described above can be found in numerous fields, including education, psychology, and medicine.

Several authors have developed multilevel random- and mixed-effects models that take one-group clustering into account in order to test for individual-study significance (e.g., Bauer, Sterba, & Hallfors, 2008; Hoover, 2002; Lee & Thompson, 2005; Lohr, Schochet, & Sanders, 2014; Roberts & Roberts, 2005). However, too often the clustering is ignored, and the data are analyzed as if the observations are independent of one another. Pals et al. (2008) reviewed articles in six public health and behavioral health journals spanning the years 2002 to 2006, in order to examine how individually randomized trials with treatments administered in groups were analyzed. They found that 32 out of 34 articles ignored the group level entirely, analyzing the data only at the individual level. Analyses that incorrectly assume that the observations are independent may lead to wrong conclusions, because ignoring the extra between-cluster variation underestimates the error term, inflates the magnitude of the effect size, and can lead to inflated Type I errors in tests of hypotheses about treatment effects (see, e.g., Wampold & Serlin, 2000).

The effects of clustering in all treatment groups on effect size estimation were considered by Hedges (2007a, b, 2011). In 2007, he derived adjusted statistics that allow for the calculation of effect sizes (and their sampling variances) when the summary data come from a two-level cluster-randomized design (e.g., when students are clustered within classrooms). In 2011, Hedges extended his analysis to allow for correction in three-level designs (e.g., when students are clustered within classrooms, which are then clustered within schools). Both studies take clustering into account when it exists in both the treatment and control groups; however, we know of no development of effect size methods when there is clustering in only a single group. This problem was called to our attention by a colleague conducting a meta-analysis.

Meta-analyses involve a synthesis of statistical estimates from various studies being used to summarize a particular topic (Lipsey & Wilson, 2001). Summary statistics in each study are used to estimate effect sizes, which are then combined to compute tests of significance for the combined results and to describe the pattern of results on the given topic (Hedges & Olkin, 1985). However, if the magnitudes of the effect sizes included in the meta-analysis are inflated or their variances are underestimated, the overall combined mean effect and its variance will also be affected, which may lead to incorrect inferences. Although the multilevel random- and mixed-effects models that take one-group clustering into account lead to valid tests of individual-study significance, to our knowledge no account has described how to compute effect sizes or their sampling variances in partially clustered designs. The purpose of this report is to describe methods for computing effect size estimates and their variances when there is clustering in only one group and the analysis has not taken clustering into account.

The article is organized into several parts. First, we define the model. Then we provide equations for the appropriate effect size computations and show the calculations via example. Finally, we provide an adjustment for the significance test. In the Appendix, we provide the derivations for the statistics.

## Model

Denote the *j*th observation in the *i*th cluster in the treatment by *Y* _{ ij } ^{ T } (*i* =1, . . . , *m*; *j* =1, . . . , *n*), so that there are *m* clusters of size *n* in the treatment group. Denote the *i*th observation in the control group by *Y* _{ i } ^{ C } (*i* =1, . . . , *N* ^{ C }). Notice that there are *no* clusters in the control group. Thus, the sample size is *N* ^{ T } = *nm* in the treatment group and *N* ^{ C } in the control group, and the total sample size is *N* = *N* ^{ T } + *N* ^{ C } = *nm* + *N* ^{ C }.

*i*th cluster in the treatment group by \( {\overline{Y}}_{i\bullet}^T \) (

*i*=1, . . . ,

*m*), and denote the overall (grand) means in the treatment and control groups by \( {\overline{Y}}_{\bullet \bullet}^T \) and \( {\overline{Y}}_{\bullet}^C \), respectively. We also define the total pooled within-group variance

*S*

_{ T }

^{2}via

*μ*

_{ i }

^{ T }and within-cluster variance

*σ*

_{ W }

^{2}. Assume that the observations are also normally distributed in the control group, with mean

*μ*

^{ C }and variance

*σ*

_{ W }

^{2}. That is,

*μ*

_{ ● }

^{ T }and variance

*σ*

_{ B }

^{2}. That is,

*σ*

_{ W }, whereas the standard deviation in the treatment group is a combination of

*σ*

_{ W }and

*σ*

_{ B }, or

*σ*

_{ T }, defined by

Note that the variance, *σ* _{ B } ^{2} of the treatment clusters can be interpreted either as variation due to groups themselves or to variation (heterogeneity) in treatment effects. If it is due to variation in treatment effects, then some investigators might wish to model that variation as a function of explanatory variables (such as characteristics of therapists). Alternatively, they might be primarily concerned with the average treatment effect across clusters. This article deals with the situation like that found in over 90 % of the studies cited by Pals et al. (2008), in which the investigator has chosen not to model between-treatment-cluster variation, so that the focus is implicitly on the average treatment effect across clusters.

*intraclass correlation coefficient ρ*, which is defined by

The intraclass correlation *ρ* can be used to obtain *σ* _{ W } from *σ* _{ T }, as *σ* _{ W } ^{2} *=* (1 – *ρ*)*σ* _{ T } ^{2}.

Because intraclass correlations are used in planning cluster-randomized experiments, there have been considerable efforts to develop empirical evidence about intraclass correlations. One form of evidence is based on intraclass correlation estimates from experiments that have already been conducted (see, e.g., Baldwin et al., 2011; Schnurr et al., 2007). Another form of evidence is based on secondary analyses of sample surveys that use cluster-sampling designs (see, e.g., Hedges & Hedberg, 2007). In order to produce a reference guide for planning cluster-randomized trials in education, Hedges and Hedberg (2007) compiled a collection of intraclass correlation coefficients of academic achievement and provided extensive tables of *ρ* values in the Variance Almanac, available on the website for the University of Chicago’s Center for Advancing Research and Communication. A third form of evidence is based on census data collections from administrative units, such as school districts or states (see, e.g., Bloom, Richburg-Hayes, & Black, 2007; Hedges & Hedberg, 2013).

## Effect sizes

The most commonly used effect sizes in experimental educational and psychological research are standardized mean differences, defined as the difference between the treatment and control group means divided by (i.e., standardized by) a standard deviation. In designs in which there is statistical clustering, in at least one of the treatment groups, there is more than one standard deviation, leading to more than one possible definition of the standardized mean difference (see, e.g., Hedges, 2007b).

Here we focus on designs in which there is clustering in only one group (the treatment group). Two effect sizes are possible, depending on whether or not one wants to include between-cluster–within-treatment group variance in the standard deviation.

*σ*

_{ W }≠0, and hence

*ρ*≠1), one effect size parameter has the form

*ρ*is known, it is possible to obtain

*δ*

_{ W }from

*δ*

_{ T }(and vice versa), because

## Estimating the effect sizes

It is our experience that the information reported in studies will often involve *S* _{ T } (e.g., when the clustering within the treatment group was ignored in the statistical analysis). The actual data might be a standard deviation or a *t* or *F* statistic from an analysis that ignored clustering. Because we believe that the information most likely to be reported will involve *S* _{ T }, we focus first on estimating *δ* _{ T } using *S* _{ T } and converting this estimate to an estimate of *δ* _{ W }, if that estimate is desired. However, occasionally, only *S* _{ W } is reported and not *S* _{ T } (see, e.g., Chaplin & Capizzano, 2006). Consequently, we also address estimation of *δ* _{ W } and *δ* _{ T } when only *S* _{ W } is reported.

### Estimating δ_{T} when S_{T} is reported

*δ*

_{ T }. However, this

*d*

_{ Naive }has a small but systematic bias (which does not disappear even in large samples), due to the fact that

*S*

_{ T }

^{2}consistently underestimates

*σ*

_{ T }

^{2}when there is clustering in the treatment group.

*ρ*may be used to obtain an estimate of

*δ*

_{ T }that takes clustering in the treatment group into account. A direct argument shows that a consistent estimator of

*δ*

_{ T }is

The second factor (under the radical sign) arises as a correction for the bias in *S* _{ T } ^{2} as an estimate of *σ* _{ T } ^{2}. Note that when *ρ* =0, so that there is no clustering, the correction factor is 1 and *d* _{ T } = *d* _{ Naive }. The expression for *d* _{ T } makes it clear that |*d* _{ T }| ≤ | *d* _{ Naive } | and the difference between *d* _{ T } and *d* _{ Naïve } is an increasing function of *ρ*, but also that the value of *d* _{ T } is often very similar to that of *d* _{ Naive } *.* For example, if *N* ^{ C } =10, *m* =2, and *n* =5, so that *N* ^{ C } = *N* ^{ T } =10, the ratio *d* _{ T }/*d* _{ Naive } ranges from .98 for *ρ* = .05 to .91 for *ρ* = .25. Note that if there is no clustering, so that *ρ* =0, *d* _{ T } given in Eq. 5 reduces to the usual standardized mean difference.

*d*

_{ T }is approximately normally distributed with variance

*h*is the effective degrees of freedom of

*S*

_{ T }

^{2}, given by

*nN*

^{ C }/

*N*– 1)

*ρ*] is a design effect or variance inflation factor that describes the ratio of the variance of the mean difference when one group is clustered to that when neither group is clustered. Moreover, observe that in a balanced design in which

*N*

^{ T }=

*N*

^{ C }, the design effect becomes [1 + (

*n*/2 – 1)

*ρ*]. Comparing the design effect just given for clustering in one group with that when there is clustering in both groups, [1 + (

*n*– 1)

*ρ*], we see that clustering in one group has the same impact on the variance of the mean difference as clustering in both groups, with cluster sizes half as large (i.e., clusters of size

*n*/2 vs.

*n*). When clustering is ignored, the variance is computed as

Although the value of *d* _{ T } is often quite similar to the value of *d* _{ Naive }, the variance of *d* _{ T } is typically considerably larger than the value of the variance that would be computed if clustering was ignored. Note that if *ρ* =0, so that there is no clustering, Expression 6 reduces to Expression 8 and *v* _{ T } = *v* _{ Naive }. For example, if *N* ^{ C } =10, *m* =2, and *n* =5, so that *N* ^{ C } = *N* ^{ T } =10, the ratio *v* _{ T }/*v* _{ Naive } ranges from 1.07 for *ρ* = .05 to 1.36 for *ρ* = .25. In the latter case, *v* _{ Naive } underestimates the variance of *d* _{ T } by more than a third. Note that when there is no clustering so that *ρ* =0, then *h* = *N* – 2, and the variance given in Eq. 6 reduces to the variance ignoring clustering given in Eq. 8.

*δ*

_{ T }(analogous to that given by Hedges, 1981) can be obtained by multiplying

*d*

_{ T }by

*h*is the effective degrees of freedom of

*S*

_{ T }

^{2}given in Expression 7. Thus, the approximately unbiased estimator of

*δ*

_{ T }is

which is an analogue of Hedges’s *g* in designs without clustering, and which has variance [*J*(*h*)]^{2} *v* _{ T }.

### Estimating δ_{W} when S_{T} is reported

*ρ*≠1 (so that

*σ*

_{ W }≠0) and an estimate of

*δ*

_{ W }is desired, then the estimate

*d*

_{ T }can be converted into an estimate of

*δ*

_{ W }by using the intraclass correlation; namely,

*δ*

_{ W }and has variance

*v*

_{ W }=

*v*

_{ T }/(1 –

*ρ*). Similarly, an approximately unbiased estimator of

*δ*

_{ W }is

and has variance [*J*(*h*)]^{2} *v* _{ W } = [*J*(*h*)]^{2} *v* _{ T } /(1 – *ρ*).

### Estimating δ_{W} when S_{W} is reported

*ρ*≠1 (so that

*σ*

_{ W }≠0), the effect size estimator

*δ*

_{ W }, where

*S*

_{ W }is the standard deviation of the control group. It is approximately normally distributed with variance

*ρ*=0, the variance given in Eq. 14 reduces to the variance of an effect size standardized by the control group standard deviation when clustering is ignored. An approximately unbiased estimator of

*δ*

_{ W }is

which is an analogue of Hedges’s *g*, and which has variance [*J*(*N* ^{ C } – 1)]^{2} *v* _{ W }.

### Estimating δ_{T} when S_{W} is reported

*ρ*≠1 (so that

*σ*

_{ W }≠0) and an estimate of

*δ*

_{ T }is desired, then the estimate

*d*

_{ W }can be converted into an estimate of

*δ*

_{ T }by using the intraclass correlation; namely,

*δ*

_{ T }and has variance

*v*

_{ W }(1 –

*ρ*). Similarly, an approximately unbiased estimator of

*δ*

_{ T }is

and has variance [*J*(*N* ^{ C } – 1)]^{2} *v* _{ T } (1 – *ρ*).

### Example

Kubany et al. (2004) conducted a study on cognitive trauma therapy for battered women (CTT-BW) with posttraumatic stress disorder (PTSD). They administered cognitive therapy to women assigned to one of two groups: the immediate therapy group, who received cognitive trauma therapy right away, and the delayed therapy group, who received therapy only six weeks after initial assessment. The women were assessed on various measures at four different time-points in order to assess treatment effects. We focused on the results of the Clinician-Administered PTSD Scale (CAPS) at the second time point, which compares the women two weeks after the immediate therapy group has completed CTT-BW and directly before the delayed therapy group has received CTT-BW. (The results that we use to conduct our calculations for this example can be found in columns 3 and 8 of Table 3 of that article: “Posttherapy for the immediate therapy group and Pretherapy 2 for the delayed therapy group.”) That time point is particularly useful for illustrating the method because the immediate therapy group contains *m* =7 therapist clusters, whereas the delayed therapy group does not contain any clustering. The authors do not take this one-group clustering into account, thus, their results may be inflated due to the omission of between-cluster treatment variance. Note that our goal is not to criticize the authors for not accounting for clustering in the treatment group, since the method presented here was not previously available, but rather to use the article for illustrating the new method and describe the difference in results.

The control (delayed therapy) group contained a total of *N* ^{ C } =40 women, whereas the treatment (immediate therapy) group consisted of *N* ^{ T } =45 women. With *m* =7 therapists, the average cluster size in the treatment groups was *N* ^{ T }/*m* =45/7 =6.4. The exact sizes of the clusters were not reported, so here we treat the clusters as equal. Moreover, we chose a slightly conservative sample size by rounding down to *n* =6. That resulted in a total treatment group sample size of *N* ^{ T } = *nm* =6 ×7 =42 and a total sample size of *N* = *N* ^{ T } + *N* ^{ C } =42 +40 =82. The means (and standard deviations) for the treatment and control groups reported in Table 3 of Kubany et al. (2004) are 15.8 (14.4) and 71.9 (23.8), respectively. Finally, in order to adjust the results for clustering, an intraclass correlation coefficient is required. Schnurr et al. (2007), reported a *ρ* of .05 for therapists in the context of cognitive behavioral therapy for PTSD in women.

*d*

_{ T }when information to compute

*S*

_{ T }is reported (i.e., Section “Estimating

*δ*

_{ T }when

*S*

^{ T }is reported” above). With the mean difference between the treatment and control groups \( {\overline{Y}}_{\bullet \bullet}^T-{\overline{Y}}_{\bullet}^C=15.8-71.9=-56.1 \), and the pooled within-group standard deviation

*S*

_{ T }=19.555, the standardized difference between means (SMD) effect size that does not account for clustering is –2.869. Given an intraclass correlation of

*ρ*= .05, the SMD that accounts for treatment group clustering may be calculated using Expression 5:

*h*=79.475 degrees of freedom is calculated using Expression 7; that is,

*d*

_{ T }is given by

With a confidence interval of [–3.489, –2.248] for the population effect ignoring clustering, the difference in lengths is minimal (0.022, or 1.8 %).

*δ*

_{ T }, then one would multiply

*d*

_{ T }by

*J*(

*h*), resulting in an analogue of Hedges’s

*g*:

*g*

_{ T }may also be converted into an estimate of

*g*

_{ W }by using the intraclass correlation:

*d*

_{ W }may be calculated as

### How large are the effects of clustering in one group?

The effects of clustering in our example were quite modest, which might suggest that the results are not worth the extra effort of correcting the estimates for treatment group clustering. However, even though the correction is modest in this example, it need not be in other (plausible) situations. The magnitude of corrections for clustering on both *d* _{ T } and its variance depend on two things: the size of the intraclass correlation *ρ* and the size of the clusters *n*. If either of these is large, then the effects of clustering will be large. In some areas of research, either of them can be large.

For example, in education, the intraclass correlation of academic achievement test scores at the school level is .22 (on average across all school grades; Hedges & Hedberg, 2007). That is much larger than the *ρ* of .05 for therapists presented in the example. It is also easy to imagine large cluster sizes, for example when the treatment is administered by clinics or social service centers, there may be few clusters but large sample sizes associated with them.

Suppose that the actual intraclass correlation in the example above was *ρ* = .22 (a representative value from education), rather than *ρ* = .05. The effect size *d* _{ T } decreases by 6.24 % in absolute magnitude (from –2.869 to –2.690), the variance *v* _{ T } increases by 21.5 % (from .100 to .122), the degrees of freedom *h* decrease by 13.5 % (from 80 to 69.18), and the confidence interval length increases by 10.2 % (from 1.241 to 1.368).

A similar difference would result if the clusters had been larger. In our example, suppose that, instead of *m* =7 clusters of size *n* =6, there were *m* =2 clusters of size *n* =21, for the same total sample size of *N* ^{ T } =42 and *N* ^{ C } =40 (and *ρ* = .05 once again). In this case, the adjustment for clustering still produces a negligible impact on the estimate, decreasing *d* _{ T } by 1.9 % in absolute magnitude (from –2.869 to –2.815), increasing the variance *v* _{ T } by 21.3 % (from .1002 to .1216), decreasing the effective degrees of freedom by 1.5 % (from 80 to 78.84), and increasing the length of the confidence interval by 10.2 % (from 1.241 to 1.367). One might still argue that the change in effect size estimate is modest, but the changes in the variance and length of the confidence intervals certainly are not. To explore these potential changes further, let us examine the effects of clustering for various potential values of *n* and *ρ*.

*d*

_{ T },

*v*

_{ T }, and

*h*as

*ρ*and

*n*increase. The results display the effect ratio, defined as the adjusted estimate divided by the naïve estimate, plotted on the

*y*-axis and either

*ρ*(Fig. 1a) or

*n*(Fig. 1b) plotted on the

*x*-axis. Note that we only examine a balanced design here, in which

*m*=4 and

*N*

^{ C }=

*N*

^{ T }=

*nm*. In Fig. 1a

*n*=10, and in Fig. 1b

*ρ*= .20, an intraclass correlation often found in education research.

First, notice that, in Fig. 1a, when *ρ* =0, the effect ratio is 1 for all three estimates. That is because, in that case, no excess between-group variability exists (just as in the case when clustering does not exist in the study), so the estimates that account for clustering are identical to those that do not account for clustering. However, the ratio diverges from 1 as *ρ* increases. The changes in the estimates are small for the effect size and moderate for the degrees of freedom, with *d* _{ T } decreasing by only 13 % and *h* decreasing by 51 % when *ρ* = .40. v_{ T }, on the other hand, obtains a 2.5-fold increase by the time that *ρ* = .40. Figure 1b displays a similar pattern. The changes in the effect size and degrees of freedom are small as compared to the changes in the variance as *n* increases (although note that the decrease in degrees of freedom is by no means negligible); by the time *n* =100, there is a more than nine-fold increase in the variance, whereas *d* _{ T } and *h* decrease by 6 % and 66 %, respectively. Nonetheless, keep in mind that even small decreases in the effect size may affect meta-analyses’ results.

Recall that, in meta-analyses, effects are combined in order to summarize a particular set of research findings. If a portion of the effect sizes are adjusted toward zero in order to account for treatment group clustering, the combined mean effect will also decrease in size. Moreover, with substantially larger variances about the effects, the effect sizes will be less precise and more variable about the overall mean effect. Depending on the magnitudes of both, the meta-analytic results may change considerably.

## Adjusting the significance test for clustering in only one group

Research studies that involve clustering in one treatment group often ignore that clustering in their statistical analyses (Pals et al., 2008). This is important because it leads to computed levels of statistical significance that are smaller (and can be *much* smaller) than the actual level of statistical significance (see, e.g., Hedges, 2007a). That is, the *p* values computed for tests of the null hypothesis of no treatment effects are too small when clustering is ignored. In reviewing such research, it is sometimes useful to determine the approximate actual level of statistical significance for such studies (to more accurately state the findings of individual studies). Such an approach is used by the US Institute of Education Sciences (IES) What Works Clearinghouse in its reviews.

*t*test, when clustering is ignored, the Student’s

*t*would be computed as

*N*

^{ T }+

*N*

^{ C }– 2 degrees of freedom. Taking clustering into account, the

*t*value must be adjusted by multiplying

*t*

_{ Naive }by the square root of factor

*f*,

Note that *t* _{ A } has a *t* distribution with *h* degrees of freedom when the null hypothesis is true, where *h* is defined in Expression 7. If there is no clustering so that *ρ* =0, then *f* =1, *h = N* – 2, and *t* _{ A } reduces to the usual test statistic ignoring clustering.

*t*test when there is clustering. It is well known that the unadjusted

*t*test has a rejection rate that is often much higher than nominal. The sampling distribution of

*t*

_{ A }provides an analytic expression for the rejection rates of the unadjusted

*t*test under the cluster sampling model. Let

*t*(

*ν*,

*α*) denote the level

*α*two-sided critical value for the

*t*distribution with

*ν*degrees of freedom. The usual unadjusted

*t*test rejects if |

*t*| >

*t*(

*N*– 2,

*α*). Because \( {t}_A = {t}_{Naive}\sqrt{f} \) has a

*t*distribution with

*h*degrees of freedom under the null hypothesis, the rejection rate of the unadjusted test is

*x*,

*ν*] is the cumulative distribution function of the

*t*distribution with

*ν*degrees of freedom. Table 1 provides the actual rejection rates of the

*α*= .05 level two-sided significance test for various values of

*N*

^{ T }=

*N*

^{ C },

*n*,

*m*, and

*ρ*.

Actual significance level of a nominal .05 significance test when there is clustering in the treatment group

| | | | | | \( \sqrt{f} \) | Actual |
---|---|---|---|---|---|---|---|

.05 | 2 | 5 | 10 | 18 | 17.9 | .947 | .06 |

.05 | 2 | 10 | 20 | 38 | 37.7 | .896 | .08 |

.05 | 2 | 20 | 40 | 78 | 76.0 | .815 | .11 |

.05 | 2 | 25 | 50 | 98 | 96.4 | .782 | .12 |

.05 | 2 | 50 | 100 | 198 | 191.5 | .661 | .19 |

.05 | 2 | 75 | 150 | 298 | 283.6 | .584 | .25 |

.05 | 2 | 100 | 200 | 398 | 372.8 | .528 | .30 |

.05 | 2 | 200 | 400 | 798 | 703.0 | .402 | .43 |

.05 | 2 | 500 | 1,000 | 1,998 | 1,493.9 | .268 | .60 |

.10 | 5 | 5 | 25 | 48 | 47.0 | .905 | .08 |

.10 | 5 | 10 | 50 | 98 | 93.9 | .820 | .11 |

.10 | 5 | 20 | 100 | 198 | 181.7 | .704 | .17 |

.10 | 5 | 25 | 125 | 248 | 223.0 | .661 | .19 |

.10 | 5 | 50 | 250 | 498 | 406.4 | .526 | .30 |

.10 | 5 | 75 | 375 | 748 | 558.8 | .450 | .38 |

.10 | 5 | 100 | 500 | 998 | 687.5 | .399 | .43 |

.10 | 5 | 200 | 1,000 | 1,998 | 1,049.2 | .294 | .56 |

.10 | 5 | 500 | 2,500 | 4,998 | 1,532.0 | .191 | .71 |

.15 | 5 | 5 | 25 | 48 | 45.6 | .863 | .09 |

.15 | 5 | 10 | 50 | 98 | 88.6 | .755 | .14 |

.15 | 5 | 20 | 100 | 198 | 163.0 | .622 | .22 |

.15 | 5 | 25 | 125 | 248 | 195.4 | .578 | .26 |

.15 | 5 | 50 | 250 | 498 | 323.2 | .445 | .38 |

.15 | 5 | 75 | 375 | 748 | 412.7 | .375 | .46 |

.15 | 5 | 100 | 500 | 998 | 478.8 | .330 | .52 |

.15 | 5 | 200 | 1,000 | 1,998 | 630.1 | .240 | .64 |

.15 | 5 | 500 | 2,500 | 4,998 | 777.1 | .154 | .76 |

This table illustrates that the actual significance level of the nominal 5 % test is higher than 5 % when *ρ* >0. Moreover, when either *ρ* or *n* is large, the actual significance level can be *much* higher than nominal. For example, when the cluster size *n* =100, the actual rejection rate of the unadjusted test when the null hypothesis is true is .30 if *ρ* = .05, .43 when *ρ* = .10, and .52 when *ρ* = .15. Therefore, when *ρ* = .15 and *n* =100, the unadjusted test with a nominal 5 % significance level rejects the (true) null hypothesis more than 50 % of the time. Even with rather small values of *ρ* (like *ρ* = .05), the behavior of the unadjusted test is unacceptable for all but the smallest cluster sizes.

### Example

*H*

_{0}:

*μ*

_{•}

^{Immediate }=

*μ*

^{ Delayed }. Ignoring clustering,

*t*is

*df*=80 and a highly significant

*p*value of

*p*< .0001. Adjusting the naive

*t*value for clustering in the treatment group,

*t*

_{ A }is

*h*=79.475 from above. With the square root of the adjustment factor

*f*being so close to 1 (.942), notice that the absolute magnitude of

*t*decreases by only 5.8 %. The degrees of freedom essentially do not change as well (from 80 to 79.475), leading again to a highly significant

*p*value of

*p*< .0001.

Remember that these minimal changes in the estimates are due to the small intraclass correlation of .05 for therapists. Once again, imagine that *ρ* = .22, the average intraclass correlation for schools. In this case, the absolute magnitude of *t* _{ A } decreases by 21.4 % (from –12.985 to –10.202), and *h* decreases by 13.5 % (from 80 to 69.177). Although this still yields a highly significant *p* value, one can see that changing just one of the data characteristics can make a substantial impact on the estimates. Let us examine how the *t* value, degrees of freedom, and resulting *p* value change depending on how some of the other data characteristics change.

*t*

_{ A }and

*h*as

*ρ*and

*n*increase. As in Fig. 1, the effect ratio is plotted on the

*y*-axis, and either

*ρ*(Fig. 2a) or

*n*(Fig. 2b) is plotted on the

*x*-axis. We again consider a balanced design with

*m*=4,

*N*

^{ C }=

*N*

^{ T }=

*nm*, and

*n*=10 in Fig. 2a, and

*ρ*= .20 in Fig. 2b.

Figure 2a and b show that the *t* statistic and degrees of freedom obtain similar patterns of adjustment (i.e., a decrease in the effect ratio slope) as *ρ* and *n* increase, albeit *t* _{ A } is consistently more adjusted than *h* (except for the crossover that occurs at a *ρ* of about .35 in Fig. 2a). The results suggest that increasing *ρ* can easily make a significant *t* value nonsignificant, because when *ρ* >0, one requires a much larger sample size in order to obtain significance than if one did not have clustering in the treatment group. This is partially due to the fact that *t* _{ A } is adjusted more than *d* _{ T } (see Fig. 1), with the absolute magnitude adjustment ranging from 0 % to 46 % in *t* _{ A } and from 0 % to 13 % in *d* _{ T } as *ρ* increases from 0 to .40. It is also due to the substantial decrease in degrees of freedom (51 % at *ρ* = .40), since fewer degrees of freedom mean less power to detect an effect. For example, in Fig. 2a, when *ρ* = .05, the *p* value is about .05, but once *ρ* = .10, the *p* value increases to .07 and is no longer significant. Figure 2b shows that, in addition to having a *ρ* >0, increasing *n* can add to the change estimates, with *d* _{ T } and *h* decreasing by 72 % and 66 %, respectively, by the time *n* =100.

## Unequal cluster sizes

Previous sections of this article have involved the assumption that each cluster has the same number *n* of individuals. Although this will often be true (at least approximately), it need not always be true. The same principles used to derive the effect size estimate, its variance, and the adjustment to the significance test when cluster sizes are equal can be used to derive the corresponding statistics when the cluster sizes are unequal. Suppose that *n* _{ i } individuals are in the *i*th cluster in the treatment group, and that the *n* _{ i } values need not be equal.

### Estimation of δ_{T} from S_{T}

*δ*

_{ T }is given by

*ñ*is an “average” cluster size, given by

Note that when all of the *n* _{ i }s are equal to *n*, Expression 23 reduces to *n*, and Expression 22 reduces to Expression 5.

*d*

_{ T }is approximately normally distributed, but now the variance is

*h*is the effective degrees of freedom of

*S*

_{ T }

^{2}, given by

*A*is the auxiliary constant, given by

Note that when the *n* _{ i }s are all equal to *n*, the constant *A* reduces to (*N* ^{ T } – *n*)*n*, Expression 24 reduces to Expression 6, and Expression 25 reduces to Expression 7.

*t*statistic for the impact of clustering becomes

*t*statistic adjusted for clustering becomes

which has *h* degrees of freedom, given in Expression 25.

### Estimation of δ_{W} from S_{W}

*d*

_{ W }is not affected; that is, it remains as in Expression 13. However, the variance of the estimator when cluster sizes are unequal becomes

where *ñ* is given in Expression 23 above.

## Conclusions

Experiments sometimes involve cluster sampling in one or both treatment groups. When there is clustering in only one group, such experiments are often improperly analyzed by ignoring the potential impact of clustering on significance tests or effect size calculation. Reanalysis using more appropriate methods (such as multilevel statistical methods) is obviously desirable. However, when conclusions must be drawn from published reports (using *t* or *F* tests that ignore clustering), we demonstrate how significance levels and confidence intervals adjusted for the impact of clustering can be obtained if the intraclass correlation is known or plausible values can be imputed. One might argue that the intraclass correlations that are imputed may themselves be incorrect. Although this is certainly true, making no adjustment is equivalent to imputing zero for the intraclass correlation, and that is likely to be farther from the truth than a careful imputation or, better still, an imputation of a range of values for a sensitivity analysis. Such procedures provide more accurate significance levels and are suitable for setting bounds on the results.

Clustering in a single treatment group raises the conceptual issue of what effect size parameter is of interest, because more than one definition of effect size is possible. Such clustering is likely to have a modest impact on the magnitude of the effect size estimates, but often will have a nonnegligible impact on the variance of the effect size estimate. The potentially large impact on the variance can have an equally large impact on the weights given to the effect size in a meta-analysis. The methods given in this article provide a way to obtain more accurate estimates of both the effect size estimate and its variance.

## Notes

### Author note

This article is based in part on work supported by the US Institute of Education Sciences (IES) under Grant Nos. R305D11032 and R305B1000027, and by the National Science Foundation (NSF) under Grant No. 0815295. Any opinions, findings, and conclusions or recommendations are those of the authors and do not necessarily represent the views of the IES or the NSF.

## References

- Baldwin, S. A., Murray, D. M., Shadish, W. R., Pals, S. L., Holland, J. M., Abramowitz, J. S., & Watson, J. (2011). Intraclass correlation associated with therapists: Estimates and applications in planning psychotherapy research.
*Cognitive Behaviour Therapy, 40,*15–33. doi: 10.1080/16506073.2010.520731 CrossRefPubMedGoogle Scholar - Bauer, D. J., Sterba, S. K., & Hallfors, D. D. (2008). Evaluating group-based interventions when control participants are ungrouped.
*Multivariate Behavioral Research, 43,*210–236. doi: 10.1080/00273170802034810 CrossRefPubMedPubMedCentralGoogle Scholar - Bloom, H. S., Richburg-Hayes, L., & Black, A. R. (2007). Using covariates to improve precision for studies that randomize schools to evaluate educational interventions.
*Educational Evaluation and Policy Analysis, 29,*30–59. doi: 10.3102/0162373707299550 CrossRefGoogle Scholar - Chaplin, D., & Capizzano, J. (2006).
*Impacts of a summer learning program: A random assignment study of Building Educated Leaders for Life (BELL)*. Washington, DC: Urban Institute.Google Scholar - Donner, A., & Klar, N. (2000).
*Design and analysis of cluster randomization trials in health research*. London, UK: Arnold.Google Scholar - Hedges, L. V. (1981). Distribution theory for Glass’s estimator of effect size and related estimators.
*Journal of Educational Statistics, 6,*107–128.CrossRefGoogle Scholar - Hedges, L. V. (2007a). Correcting a significance test for clustering.
*Journal of Educational and Behavioral Statistics, 32,*151–179. doi: 10.3102/1076998606298040 CrossRefGoogle Scholar - Hedges, L. V. (2007b). Effect sizes in cluster-randomized designs.
*Journal of Educational and Behavioral Statistics, 32,*341–370. doi: 10.3102/1076998606298043 CrossRefGoogle Scholar - Hedges, L. V. (2011). Effect sizes in three-level cluster-randomized experiments.
*Journal of Educational and Behavioral Statistics, 36,*346–380. doi: 10.3102/1076998610376617 CrossRefGoogle Scholar - Hedges, L. V., & Hedberg, E. C. (2007). Intraclass correlation values for planning group-randomized trials in education.
*Educational Evaluation and Policy Analysis, 29,*60–87. doi: 10.3102/0162373707299706 CrossRefGoogle Scholar - Hedges, L. V., & Hedberg, E. C. (2013). Intraclass correlations and covariate outcome correlations for planning 2 and 3 level cluster randomized experiments in education.
*Evaluation Review, 37,*13–57.CrossRefGoogle Scholar - Hedges, L. V., & Olkin, I. (1985).
*Statistical methods for meta-analysis*. New York, NY: Academic Press.Google Scholar - Hoover, D. R. (2002). Clinical trials of behavioral interventions with heterogeneous teaching subgroup effects.
*Statistics in Medicine, 21,*1351–1364. doi: 10.1002/sim.1139 CrossRefPubMedGoogle Scholar - Kubany, E. S., Hill, E. E., Owens, J. A., Iannce-Spencer, C., McCaig, M. A., Tremayne, K. J., & Williams, P. L. (2004). Cognitive trauma therapy for battered women with PTSD (CTT-BW).
*Journal of Consulting and Clinical Psychology, 72,*3–18. doi: 10.1037/0022-006X.72.1.3 CrossRefPubMedGoogle Scholar - Lee, K. J., & Thompson, S. G. (2005). The use of random effects models to allow for clustering in individually randomized trials.
*Clinical Trials, 2,*163–173. doi: 10.1191/1740774505cn 082oa CrossRefPubMedGoogle Scholar - Lipsey, M. W., & Wilson, D. B. (2001).
*Practical meta-analysis*. Thousand Oaks, CA: Sage Publications, Inc.Google Scholar - Lohr, S., Schochet, P. Z., & Sanders, E. (2014). Partially nested randomized controlled trials in education research: A guide to design and analysis. (NCER 2014-2000) Washington, DC: US Department of Education, National Center for Education Research, Institute of Education Sciences. Retrieved from http://ies.ed.gov/
- Pals, S. L., Murray, D. M., Alfano, C. M., Shadish, W. R., Hannan, P. J., & Baker, W. L. (2008). Individually randomized group treatment trials: A critical appraisal of frequently used design and analytical approaches.
*American Journal of Public Health, 98,*1418–1424.CrossRefPubMedPubMedCentralGoogle Scholar - Roberts, C., & Roberts, S. A. (2005). Design and analysis of clinical trials with clustering effects due to treatment.
*Clinical Trials, 2,*152–162. doi: 10.1191/1740774505cn076oa CrossRefPubMedGoogle Scholar - Schnurr, P. P., Friedman, M. J., Engel, C. C., Foa, E. B., Shea, M. T., Chow, B. K., & Bernardy, N. (2007). Cognitive behavioral therapy for posttraumatic stress disorder in women: A randomized controlled trial.
*Journal of the American Medical Association, 297,*820–830.CrossRefPubMedGoogle Scholar - Searle, S. R., Casella, G., & McCulloch, C. E. (1992).
*Variance components*. New York, NY: Wiley.CrossRefGoogle Scholar - Wampold, B. E., & Serlin, R. C. (2000). The consequence of ignoring a nested factor on measures of effect size in analysis of variance.
*Psychological Methods, 5,*425–433. doi: 10.1037/1082-989X.5.4.425 CrossRefPubMedGoogle Scholar