A hierarchical linear model (HLMFootnote 1) is a regression model for hierarchical data sets. Hierarchical data sets result from nesting the data for lower units (e.g., individuals such as students, clients, and citizens) within higher units (e.g., groups such as classes, hospitals, and regions). Repeated measures data and data collected through paired designs are also hierarchical data (Raudenbush & Bryk, 2002; Singer & Willett, 2003). The main advantages of using HLMs are attaining improved estimates of parameters and improved information on the residuals at different hierarchical levels. Software packages for hierarchical data analysis include HLM, MLwiN, Mplus, R, SAS, and SPSS.

In data collection for experimental research, estimating the required sample size beforehand is fundamental to obtaining sufficient statistical power and precision of the focused parameters, and sample size determination problems are often closely related to human resource and budget requirements (Chow, Shao, & Wang, 2003; Raudenbush, 1997; Usami, 2011a). Estimating the expected statistical power before beginning research by power analysis is sometimes crucial to avoiding wrong conclusions (Cohen, 1988). However, actual psychological research is often underpowered (Bezeau & Graves, 2001; Cohen, 1962; Maxwell, 2004; Maxwell, Kelley, & Rausch, 2008). Although in simpler data collection designs there are multiple ways of conveniently conducting power analysis for both nonhierarchical data (e.g., Dupont & Plummer, 1998, for PS; Faul, Erdfelder, Lang, & Buchner, 2007, for G*Power 3) and hierarchical data (e.g., Donner & Klar, 2000, for ACluster; Fosgate, 2007; Raudenbush, Spybrook, Congdon, Liu, & Martinez, 2011, for OD), a more intuitive method is strongly desired for estimating the required sample size for more general data collection designs.

The present research provides closed-form formulas that generalize sample size requirements for testing effects in experimental research with hierarchical data, focusing on both multisite-randomized trials (MRTs), in which individuals are randomized (Raudenbush & Liu, 2000), and cluster-randomized trials (CRTs), in which clusters are randomized (Heo & Leon, 2008). Although MRTs are generally preferable to CRTs, since in CRTs the dependency of individual response data within clusters (i.e., intraclass correlation) inflates the standard errors of estimates of the experimental effects (see the Comparing CRTs and MRTs section for details). However, in many cases CRTs have to be chosen, due to the research purposes (e.g., a difference of doctors is a focused issue of the intervention).

These formulas are derived through considering both statistical power and the width of the confidence interval for a standardized effect size, on the basis of estimates from random-intercept models for three-level data that consider both balanced and unbalanced designs. As was summarized in Usami (2011a), although several methods have been developed for sample size determination in hierarchical data (e.g., Heo & Leon, 2008; Okumura, 2007; Raudenbush, 1997; Raudenbush & Liu, 2000; Roy, Bhaumik, Aryal, & Gibbons, 2007; Usami, 2011b), these methods were developed under designs featuring restricted data collection and numbers of levels. For example, Heo and Leon derived a closed-form power function and a formula for determining the sample size required to detect a single experimental effect in three-level hierarchical CRTs. The derived formulas were restricted to CRTs under a balanced design, however, and formulas to obtain desired confidence intervals were not considered. Roy et al. devised a general method for sample size determination in a three-level HLM for longitudinal data, but it was not in a closed form and the experimental design was not directly considered. In the Japanese literature, Usami (2011b) derived formulas for MRTs and CRTs, but the derived formulas were restricted to two-level hierarchical data under a balanced design. Three-level hierarchies arise frequently in both cross-sectional studies (e.g., students are nested within classes within schools) and longitudinal studies (e.g., longitudinally obtained data are nested within patients within hospitals). In the present research, formulas were derived in a unified way, using generalized least squares estimators for experimental effects, in order to overcome the restrictions of the former research. These formulas also address additional results not derived in previous research, such as lower bounds on the number of required units in the highest (third) level and cases involving more than three levels.

This article is organized into five sections. The following section introduces a three-level HLM. The one after gives derivations of the generalized formulas and examples of estimating the required sample size on the basis of programs provided in the Appendix. Next, additional results obtained from the derived formulas are addressed, and the final section discusses prospects for the proposed method and related problems.

Statistical model

This section introduces a three-level random-intercept model that considers both balanced and unbalanced designs, referring to Heo and Leon (2008). Experimental and control groups are sometimes unbalanced due to practical considerations. For example, producing experimental drugs for clinical trials is expensive, so the experimental and control groups may be of unequal size (Ogungbenro & Aarons, 2009). For brevity, the discussion here will be confined to the case in which the numbers of Level 1 units (e.g., students) and Level 2 units (e.g., classes) are equal within the Level 3 unit (e.g., schools), and in which no attrition occurs during trials (on this point, some conventional alternatives are addressed in the Discussion section).

Let Y ijk be the outcome for an i (= 1, 2, . . . , I)-th Level 1 unit nested within a j (= 1, 2, . . . , J)-th Level 2 unit, which is again nested within a k (= 1, 2, . . . , K)-th Level 3 unit. The following Level 1 model is assumed for expressing Y ijk :

$$ {Y}_{ijk}={b}_{0 jk}+d{X}_{ijk}+{e}_{ijk}. $$
(1)

Here, X ijk is a corresponding assignment indicator variable, set to 1 when it is assigned to an experimental group and to 0 when it is assigned to a control group. Let the proportion of an experimental group size be P (0 < P < 1). The balanced condition is satisfied only when P = .5. In CRT, essentially X ijk = X k , since clusters are randomized, so the number of Level 1 units assigned to an experimental group per Level 2 units is P × I for MRTs, whereas it is 0 or I for CRTs. β 0jk is a random intercept denoting the overall control group mean in the jth Level 2 unit nested within the kth Level 3 unit. e ijk is the corresponding residual, assumed to be independent of X ijk . Additionally, e ijk is assumed to be distributed as e ijk  ∼ N(0,σ 21 ). Here, σ 23 is the residual variance for the respective groups in each Level 1 unit.

The Level 2 model is a decomposition form of β 0jk as

$$ {b}_{0 jk}={b}_{0k}+{e}_{jk}. $$
(2)

Here, β 0k is a random intercept denoting the overall control group mean in the kth Level 3 unit. e jk is the corresponding residual and is assumed to be independent of X ijk and of other residuals. Additionally, e jk is assumed to be distributed as e jk  ∼ N(0,σ 22 ), and σ 22 is the residual variance for the respective groups in each Level 2 unit.

β 0k can be further decomposed in order to obtain the following Level 3 model:

$$ {b}_{0 k}={b}_0+{e}_k. $$
(3)

Here, β 0 is the overall control group mean. e k is the corresponding residual, assumed to be independent of X ijk and of other residuals. Additionally, e k is assumed to be distributed as e k  ∼ N(0,σ 23 ), and σ 23 is the residual variance for the respective groups in each Level 3 unit. From Eqs. 2 and 3, Eq. 1 can now be written as

$$ {Y}_{ijk}=\left({b}_0+d{X}_{ijk}\right)+\left({e}_k+{e}_{jk}+{e}_{ijk}\right). $$
(4)

This combined form clarifies the fixed and residual parts of the three-level model. From Eq. 4, it is evident that the mean of Y ijk , given X ijk , is

$$ E\left({Y}_{ijk}\ |{X}_{ijk}\right)={b}_0+d{X}_{ijk}, $$
(5)

where E() denotes the mean. Additionally, the covariance of Y ijk and Y i'j'k' can be generally expressed as

$$ \mathrm{cov}\left({Y}_{ijk},{Y}_{i\prime j\prime k\prime }\ |{X}_{ijk},{X}_{i\prime j\prime k\prime}\right)=1\left(i=i\prime \&j=j\prime \&k=k\prime \right){s}_1{}^2+1\left(j=j\prime \&k=k\prime \right){s}_2{}^2+1\left(k=k\prime \right){s}_3{}^2, $$
(6)

where cov() denotes the covariance, and 1() is an indicator function, which has a value of 1 if the conditions in parentheses are satisfied and 0 if they are not. From Eq. 6, the variance of Y ijk (namely, i = i', j = j', and k = k', respectively) can be expressed as

$$ \mathrm{Var}\left(\left.{Y}_{ijk}\ \right|{X}_{ijk}\right)={s}_1^2+{s}_2^2+{s}_3^2={s}^2. $$
(7)

Here, Var() denotes the variance. The standardized effect size ∆ of an experimental effect δ is defined according to Cohen (1988) by using the pooled standard deviation σ. Namely,

$$ \varDelta =\frac{\delta }{\sigma }. $$
(8)

Therefore, the intraclass correlation coefficient (ICC) among the Level 2 data can now be expressed as

$$ {r}_2=\mathrm{Corr}\left({Y}_{ijk},{Y}_{i\prime j\prime k}\right)=\frac{s_3{}^2}{s_1{}^2+{s}_2{}^2+{s}_3{}^2}=\frac{s_3{}^2}{s^2}, $$
(9)

and the ICC among the Level 1 data can be expressed as

$$ {r}_1=\mathrm{Corr}\left({Y}_{ijk},{Y}_{i\prime jk}\right)=\frac{s_2^2+{s}_3^2}{s_1^2+{s}_2^2+{s}_3^2}=\frac{s_2^2+{s}_3^2}{s^2}. $$
(10)

Here, Corr() denotes the correlation.

Derivation of generalized formulas

Standard errors of δ

Without loss of generality, we can set σ 2 = 1, and then from Eqs. 810, Δ = δ, ρ 2 = σ 23 , and ρ 1 = σ 22 + σ 23 . If residual variances σ 21 , σ 22 , and σ 23 are known, the test statistic Z for the null hypothesis H 0 : δ = 0 can be constructed as

$$ Z=\frac{\widehat{\delta}}{ se\left(\widehat{\delta}\right)}, $$
(11)

where \( se\left(\widehat{\delta}\right) \) denotes a standard error of estimate \( \widehat{\delta} \). Z is normally distributed as Z ~ N(δ, 1). To derive formulas under a clear and unified procedure, consider a matrix notation of Eq. 4:

$$ \mathbf{Y}=\tilde{\mathbf{X}}\boldsymbol{\upbeta} +\tilde{\boldsymbol{\upvarepsilon}}. $$
(12)

Here, β = (β 0, δ)' and Y is an (I × J × K) × 1 vector, and its elements are arranged as Y = (Y 1 , . . . , Y k , . . . , Y K )' , where Y k = (Y 1 k , . . . , Y jk , . . . , Y Jk )' and Y jk = (Y 1jk , . . . , Y ijk , . . . , Y Ijk )'. \( \tilde{\mathbf{X}}=\left({\mathbf{1}}_{\mathbf{IJK}},\mathbf{X}\right) \) is a corresponding (I × J × K) × 2 matrix, and X is a corresponding (I × J × K) × 1 vector including information about X ijk . \( \tilde{\varepsilon} \) is also a corresponding (I × J × K) × 1 vector including information about \( {{\displaystyle \tilde{e}}}_{ijk}={e}_k+{e}_{jk}+{e}_{ijk} \). From Eq. 6, it can be shown that \( {{\displaystyle \tilde{e}}}_{ijk} \) is distributed as \( {{\displaystyle \tilde{e}}}_{ijk}\sim N\left(0,\boldsymbol{\Sigma} \right) \), where

$$ \tilde{\boldsymbol{\Sigma}}={I}_K\otimes \boldsymbol{\Sigma}, \boldsymbol{\Sigma} ={\sigma}_3^2{\mathbf{1}}_{IJ}\mathbf{1}{\prime}_{IJ}+{\mathbf{I}}_J\otimes \left({\sigma}_2^2{\mathbf{1}}_I\mathbf{1}{\prime}_I\right)+{\sigma}_1^2{\mathbf{I}}_{IJ} $$
(13)
$$ ={r}_2{\mathbf{1}}_{IJ}\mathbf{1}{\prime}_{IJ}+{\mathbf{I}}_J\otimes \left[\left({r}_1\hbox{--} {r}_2\right){\mathbf{1}}_I\mathbf{1}{\prime}_I\right]+\left(1\hbox{--} {r}_1\right){\mathbf{I}}_{IJ}. $$
(14)

Here we assume that σ 21 ≥ 0,  σ 22 ≥ 0, and σ 23 ≥ 0, and that the inverse matrix of Σ (denoted as Σ –1) exists. Let the diagonal elements of Σ –1 be a, the off-diagonal elements denoting the same Level 2 and Level 3 units in Σ –1 be b, and the off-block diagonal elements denoting the same Level 3 unit in Σ –1 be c. Comparing the left and right sides of the identity ΣΣ –1 = I, the following equations are obtained:

$$ \begin{array}{l}a+\left(I\hbox{--} 1\right){r}_1b+I\left(J\hbox{--} 1\right){r}_2c=1,\hfill \\ {}b+{r}_1a+\left(I\hbox{--} 2\right){r}_1b+I\left(J\hbox{--} 1\right){r}_2c=0,\hfill \\ {}\hfill \left[1+\left(I\hbox{--} 1\right){r}_1\right]c+{r}_2\left[a+\left(I\hbox{--} 1\right)b\right]+\left(J\hbox{--} 2\right)I{r}_2c=0.\hfill \end{array} $$
(15)

These equations can be rewritten as

$$ \begin{array}{l}a=b+\frac{1}{1-{\rho}_1},\hfill \\ {}\hfill b=\frac{\left(f-I{\rho}_2\right){\rho}_1-I\left(J-1\right){\rho}_2^2}{I^2\left(J-1\right){\rho}_2^2+\left(I{\rho}_2-f\right)\left[\left(I-1\right){\rho}_1+1\right]}\left[\frac{1}{1-{\rho}_1}\right],\hfill \\ {}\hfill c=\frac{\rho_2}{I{\rho}_2-f}\left[ Ib+\frac{1}{1-{\rho}_1}\right],\hfill \end{array} $$
(16)

where f = 1 + I(J – 1)ρ 2 + (I – 1)ρ 1 is a variance inflation factor or design effect (Heo & Leon, 2008). Simple calculation shows that

$$ \frac{1}{a+\left(I-1\right)b+I\left(J-1\right)c}=f. $$
(17)

Using the generalized least squares estimators, a sample distribution of \( \widehat{\boldsymbol{\upbeta}} \) can be expressed as \( \widehat{\boldsymbol{\upbeta}}\sim N\left[{\left(\tilde{\mathbf{X}}\prime {\tilde{\boldsymbol{\Sigma}}}^{-1}\tilde{\mathbf{X}}\right)}^{-1}\tilde{\mathbf{X}}\prime {\tilde{\boldsymbol{\Sigma}}}^{-1}\mathbf{Y},{\left(\tilde{\mathbf{X}}\prime {\tilde{\boldsymbol{\Sigma}}}^{-1}\tilde{\mathbf{X}}\right)}^{-1}\right] \), and then \( se\left(\widehat{\delta}\right) \) can be evaluated by (2, 2) elements of \( {\left(\tilde{\mathbf{X}}\prime {\tilde{\boldsymbol{\Sigma}}}^{\hbox{--} \mathbf{1}}\tilde{\mathbf{X}}\right)}^{\hbox{--} 1/2}={\left[\tilde{\mathbf{X}}\prime \left({\mathbf{I}}_K \otimes {\boldsymbol{\Sigma}}^{\hbox{--} \mathbf{1}}\right)\tilde{\mathbf{X}}\right]}^{\hbox{--} 1/2} \). Let x m and x c be x m = (1 PI , 0(1P)I )′ and x c = (1 PK , 0(1P)K )′, respectively. Now X can be expressed as

$$ \boldsymbol{X}=\left\{\begin{array}{c}{\mathbf{1}}_{\boldsymbol{JK}}\otimes {\boldsymbol{x}}_{\boldsymbol{m}}\kern.9em \left(\mathrm{MRTs}\right)\\ {}{\boldsymbol{x}}_{\boldsymbol{c}}\otimes {\mathbf{1}}_{\boldsymbol{IJ}}\kern.9em \left(\mathrm{CRTs}\right).\end{array}\right. $$
(18)

for the respective randomized trials. Then, from Eqs. 17 and 18, \( se\left(\widehat{\delta}\right) \) can be calculated as

$$ se\left(\widehat{\delta}\right)=\left\{\begin{array}{c} se\left({\hat{\delta}}_M\right)=\sqrt{\frac{1}{ IJKP\left(1-P\right)\left(a-b\right)}}=\sqrt{\frac{1-{\rho}_1}{ IJKP\left(1-P\right)}},(MRTs)\hfill \\ {}\hfill se\left({{\displaystyle \hat{\delta}}}_C\right)=\sqrt{\frac{1}{ IJKP\left(1-P\right)\left[a+\left(I-1\right)b+I\left(J-1\right)c\right]}}=\sqrt{\frac{f}{ IJKP\left(1-P\right)}},(CRTs).\hfill \end{array}\right. $$
(19)

for the respective randomized trials. As for \( se\left({{\displaystyle \widehat{\delta}}}_C\right) \) in Eq. 19, this completely corresponds to the results of Heo and Leon (2008) when a balanced design is used (i.e., P = 1/2). From Eq. 19, it is evident that larger ρ 2 and ρ 1 lead to an \( se\left(\widehat{\delta}\right) \) that is smaller in MRTs and larger in CRTs, and \( se\left({{\displaystyle \widehat{\delta}}}_M\right)= se\left({{\displaystyle \widehat{\delta}}}_C\right) \) when ρ 2 = ρ 1 = 0.

Generalized formulas for desired statistical power

Let α be a two-sided significance level for the test of the null hypothesis H 0 : δ = 0. Under test statistic Z in Eq. 11, statistical power ϕ can be evaluated as

$$ \phi =\varPhi \left[{z}_{\alpha /2}-E(Z)\right]+\varPhi \left[E(Z)-{z}_{1-\alpha /2}\right]. $$
(20)

Here, Φ is a cumulative density function of the standard normal distribution, and z α denotes the 100α% point of a standard normal distribution.

Without loss of generality, a positive experimental effect (i.e., δ ≥ 0) is assumed here. Then, the probability that z α/2 exceeds E(Z) (i.e., Φ[z α/2E(Z)]) is generally very low, and the first term in Eq. 20 can be pragmatically ignored, unless sample size or effect size is too small (e.g., Usami, 2011b).Footnote 2 The above equation therefore becomes

$$ \phi \approx \varPhi \left[E(Z)-{z}_{1-\alpha /2}\right]. $$
(21)

For a desired statistical power ψ, we can obtain the following relations (Usami, 2011b):

$$ \begin{array}{l}\hfill \varPhi \left[E(Z)\hbox{--} {z}_{1\hbox{--} a/2}\right]\ge \psi, \hfill \\ {}\leftrightarrow E(Z)\hbox{--} {z}_{1\hbox{--} a/2}\ge {z}_{\psi },\hfill \\ {}\hfill \leftrightarrow E(Z)\ge {z}_{1\hbox{--} a/2}+{z}_{\psi }.\hfill \end{array} $$
(22)

Additionally, when σ 21 , σ 22 , and σ 23 (namely, ρ 2 and ρ 1) are known, \( E\left(\widehat{\delta}\right)=\delta \), since \( E\left(\widehat{\boldsymbol{\upbeta}}\right)=E\left[{\left(\tilde{\mathbf{X}}\prime {\tilde{\boldsymbol{\Sigma}}}^{-1}\tilde{\mathbf{X}}\right)}^{-1}\tilde{\mathbf{X}}\prime {\tilde{\boldsymbol{\Sigma}}}^{-1}\mathbf{Y}\right]={\left(\tilde{\mathbf{X}}\prime {\tilde{\boldsymbol{\Sigma}}}^{-1}\tilde{\mathbf{X}}\right)}^{-1}\tilde{\mathbf{X}}\prime {\tilde{\boldsymbol{\Sigma}}}^{-1}E\left(\mathbf{Y}\right)={\left(\tilde{\mathbf{X}}\prime {\tilde{\boldsymbol{\Sigma}}}^{-1}\tilde{\mathbf{X}}\right)}^{-1}\tilde{\mathbf{X}}\prime {\tilde{\boldsymbol{\Sigma}}}^{-1}\left(\tilde{\mathbf{X}}\boldsymbol{\upbeta} \right)=\boldsymbol{\upbeta} \). Then, this relation and Eqs. 11, 19, and 22, give the following required sample size (IJK) for an MRT design:

$$ IJK\ge \frac{{\left({z}_{1-\alpha /2}+{z}_{\psi}\right)}^2\left(1-{\rho}_1\right)}{P\left(1-P\right){\varDelta}^2}. $$
(23)

Note that δ = ∆ because σ 2 is assumed to be 1. From this formula, under fixed α and P, larger ρ 1 and ∆ lead to a smaller sample size requirement for a desired statistical power ψ. Additionally, a balanced design in which P = 1/2 provides the least demands on sample size. Likewise, in a CRT design, the formula for required sample size can be obtained from Eqs. 11, 19, and 22. Since \( se\left({{\displaystyle \widehat{\delta}}}_C\right) \) includes f in its numerator, we get the following sample size determination formulas for the respective units:

$$ I\ge \frac{\left(1-{\rho}_1\right){\left({z}_{1-\alpha /2}+{z}_{\psi}\right)}^2}{ JKP\left(1-P\right){\varDelta}^2-{\left({z}_{1-\alpha /2}+{z}_{\psi}\right)}^2\left[\left(J-1\right){\rho}_2+{\rho}_1\right]} $$
(24)
$$ J\ge \frac{\left[I\left({\rho}_1-{\rho}_2\right)+1-{\rho}_1\right]{\left({z}_{1-\alpha /2}+{z}_{\psi}\right)}^2}{ IKP\left(1-P\right){\varDelta}^2-{\rho}_2I{\left({z}_{1-\alpha /2}+{z}_{\psi}\right)}^2} $$
(25)
$$ K\ge \frac{f{\left({z}_{1-\alpha /2}+{z}_{\psi}\right)}^2}{ IJP\left(1-P\right){\varDelta}^2}=\frac{\left[1+I\left(J-1\right){\rho}_2+\left(I-1\right){\rho}_1\right]{\left({z}_{1-\alpha /2}+{z}_{\psi}\right)}^2}{ IJP\left(1-P\right){\varDelta}^2}. $$
(26)

Note that Eqs. 2426 reduce to the formulas obtained in Heo and Leon (2008) when a balanced design is used (i.e., P = 1/2). In each equation, for a fixed α and P, a larger ρ 2 and ρ 1 and a smaller ∆ lead to larger sample size requirements for a desired statistical power ψ. Additionally, as in an MRT design setting, a balanced design (P = 1/2) provides the least demands on sample size.

Generalized formulas for confidence intervals

A 100(1 – α)% confidence interval for δ is expressed as

$$ \widehat{\delta}-{z}_{1-\alpha /2} se\left(\widehat{\delta}\right)\le \delta \le \widehat{\delta}+{z}_{1-\alpha /2} se\left(\widehat{\delta}\right), $$
(27)

so the width of a confidence interval L can be evaluated as

$$ L=2{z}_{1-\alpha /2} se\left(\widehat{\delta}\right). $$
(28)

Note that this is also the width of the confidence interval for ∆ because σ 2 is assumed to be 1. When a desired width of the confidence interval is specified as L', using Eqs. 19 and 28, the relation LL' can be reexpressed in MRTs as follows:

$$ IJK\ge \frac{4{z}_{1-\alpha /2}^2\left(1-{\rho}_1\right)}{P\left(1-P\right)L{\prime}^2}. $$
(29)

Naturally, as ρ 1 and L' become smaller, the required sample size becomes larger. In CRTs, as with the formulas for desired statistical power (Eqs. 2426), determination formulas can be derived for the respective units as

$$ I\ge \frac{4{z}_{1-\alpha /2}^2\left(1-{\rho}_1\right)}{ JKP\left(1-P\right)L{\prime}^2-4{z}_{1-\alpha /2}^2\left[\left(J-1\right){\rho}_2+{\rho}_1\right]} $$
(30)
$$ J\ge \frac{4{z}_{1-\alpha /2}^2\left[1-I{\rho}_2+\left(I-1\right){\rho}_1\right]}{ IKP\left(1-P\right)L{\prime}^2-4{z}_{1-\alpha /2}^2I{\rho}_2} $$
(31)
$$ K\ge \frac{4{z}_{1-\alpha /2}^2f}{ IJP\left(1-P\right){{\displaystyle {L}^{\prime}}}^2}=\frac{4{z}_{1-\alpha /2}^2\left[1+I\left(J-1\right){\rho}_2+\left(I-1\right){\rho}_1\right]}{ IJP\left(1-P\right){{\displaystyle {L}^{\prime}}}^2} $$
(32)

using Eqs. 19 and 28. Naturally, these equations reduce to the results obtained in Usami (2011b) when the number of levels and P are restricted to two (K = 1 and ρ 2 = 0) and 1/2 (balanced design), respectively.

Examples

To facilitate the use of the derived formulas, R programs are provided in the Appendices to estimate the minimum required sample size in MRTs and CRTs for statistical power (Eqs. 2326) and the width of the confidence intervals for different standardized effect sizes of the experimental effects (Eqs. 2932). Here, we consider a hypothetical situation in which students from different classes and schools are assigned to either an experimental or a control group in order to evaluate the experimental effect on test scores of new learning programs for English conversation. From previous research results, the variance of test scores and the size of the experimental effect δ are assumed to be σ 2 = 202 and δ = 16, respectively, so that ∆ = 16/20 = .80. As for ICC, the variance of the means of English conversation ability is assumed to be small among schools, but large among classes in each school, so that σ 23 is small and σ 22 is large. Therefore, ρ 1 and ρ 2 are set as .15 and .03, respectively.

The desired statistical power and the two-sided significance level for testing the null hypothesis H 0 : δ = 0 are set as ψ = .80 and α = .05, respectively. If MRTs are conducted, Eq. 23 indicates that the required minimum sample sizes of IJK to achieve ϕ ≥ .80 are calculated as being 42 and 50 for different proportions of experimental group sizes P = .5 and .7, respectively. When using the provided programs, the same results can be obtained:

$$ \begin{array}{c}\hfill \mathrm{MRTpower}\left(\mathrm{alpha}=0.05,\kern0.5em \mathrm{psi}=0.80,\kern0.5em \mathrm{rho}1=0.15,\kern0.5em \mathrm{Delta}=0.80,\kern0.5em \mathrm{P}=0.50\right)\hfill \\ {}42\hfill \\ {}\hfill \mathrm{MRTpower}\left(\mathrm{alpha}=0.05,\kern0.5em \mathrm{psi}=0.80,\kern0.5em \mathrm{rho}1=0.15,\kern0.5em \mathrm{Delta}=0.80,\kern0.5em \mathrm{P}=0.70\right)\hfill \\ {}50\hfill \end{array} $$

When the required sample size is determined on the basis of the desired width of a confidence interval L' so that L' = .30, from Eq. 29 the required minimum sample sizes IJK to achieve L' ≤ .30 are calculated as being 581 and 692 for P = .5 and .7, respectively. When using the provided programs, the same results can be obtained:

$$ \begin{array}{c}\hfill \mathrm{MRTconfidenceinterval}\kern0.5em \left(\mathrm{alpha}=0.05,\kern0.5em \mathrm{rho}1=0.15,\kern0.5em \mathrm{L}=0.30,\kern0.5em \mathrm{P}=0.50\right)\hfill \\ {}581\hfill \\ {}\hfill \mathrm{MRTconfidenceinterval}\kern0.5em \left(\mathrm{alpha}=0.05,\kern0.5em \mathrm{rho}1=0.15,\kern0.5em \mathrm{L}=0.30,\kern0.5em \mathrm{P}=0.70\right)\hfill \\ {}692\hfill \end{array} $$

If CRTs are conducted and J and K are fixed at J = 3 and K = 10, from Eq. 24 the required minimum number of units I to achieve ϕ ≥ .80 is calculated as 3 for P = .5. When using the provided programs, the same result can be obtained:

$$ \begin{array}{l}\mathrm{CRTpowerI}\kern0.5em \left(\mathrm{J}=3,\kern0.5em \mathrm{K}=10,\kern0.5em \mathrm{alpha}=0.05,\kern0.5em \mathrm{psi}=0.80,\kern0.5em \mathrm{rho}1=0.15,\kern0.5em \mathrm{rho}2=0.03,\kern0.5em \mathrm{Delta}=0.80,\kern0.5em \mathrm{P}=0.5\right)\hfill \\ {}3\hfill \end{array} $$

When the required sample size is determined on the basis of a desired width of the confidence interval of L' = .70, from Eq. 30 the required minimum number of units I to achieve L' ≤ .70 is calculated as being 30 for P = .5. When using the provided programs, the same result can be obtained:

$$ \begin{array}{c}\hfill \mathrm{CRTconfidenceintervalI}\kern0.5em \left(\mathrm{J}=3,\kern0.5em \mathrm{K}=10,\kern0.5em \mathrm{alpha}=0.05,\kern0.5em \mathrm{rho}1=0.15,\kern0.5em \mathrm{rho}2=0.03,\kern0.5em \mathrm{L}=0.70,\mathrm{P}=0.5\right)\hfill \\ {}\hfill 30\hfill \end{array} $$

Some results relating the derived formulas

This section addresses several useful and important results relating the formulas derived above.

Comparing CRTs and MRTs

Comparing the numerators in the square root of \( se\left({\widehat{\delta}}_C\right) \) and \( se\left({\widehat{\delta}}_M\right) \) in Eq. 19 shows that f – (1 – r 1) = 1 + I(J – 1)r 2 + (I – 1)r 1 – (1 – r 1) = I(J – 1)r 2 + Ir 1 = I[(J – 1)r 2 + r 1] ≥ 0, since I ≥ 1, J ≥ 1, ρ 2 ≥ 0, and ρ 1 ≥ 0. Therefore, \( se\left({\widehat{\delta}}_C\right) \)\( se\left({\widehat{\delta}}_M\right) \), and MRTs are always preferable to CRTs. This relation indirectly indicates that ϕ c ϕ m and L c L m for any combination of I, J, K, ρ 2, and ρ 1. Here, ϕ c (or ϕ m ) and L c (or L m ) are the statistical power and the width of the confidence intervals for CRTs (or MRTs). From Eq. 19, increasing I, J, and K and conducting a balanced design (P = 1/2) both lead to a smaller \( se\left(\widehat{\delta}\right) \), since ∂se(δ)/∂I < 0, ∂se(δ)/∂J < 0, ∂se(δ)/∂K < 0, and ∂se(δ)/∂P < 0 (P ≤ 1/2) in MRTs and CRTs. However, the strengths of the effect of improving I, J, K, and P on se(δ) are different between MRTs and CRTs. For example, it can be shown that

$$ \partial s{e}^2\left({\delta}_m\right)/\partial J-\partial s{e}^2\left({\delta}_c\right)/\partial J={I}^2 KP\left(1-P\right)\left({\rho}_1-{\rho}_2\right)/W\ge 0, $$
(33)
$$ \partial s{e}^2\left({\delta}_m\right)/\partial K-\partial s{e}^2\left({\delta}_c\right)/\partial K={I}^2 JP\left(1-P\right)\left[\left(J-1\right){\rho}_2+{\rho}_1\right]/W\ge 0, $$
(34)
$$ \partial s{e}^2\left({\delta}_m\right)/\partial P-\partial s{e}^2\left({\delta}_c\right)/\partial P={I}^2 JK\left(1-2P\right)\left[\left(J-1\right){\rho}_2+{\rho}_1\right]/W\ge 0, $$
(35)

where W = [IJKP(1 – P)]2 ≥ 0. Namely, the effect of improving the values J, K, and P are always more dominant in MRTs than in CRTs. Interestingly, the similar result for I becomes ∂se 2(δ m )/∂I∂se 2(δ c )/∂I = 0. Namely, the strengths of the effect of improving I are the same between MRTs and CRTs.

As for the influences of ρ 1, an opposite relation holds between MRTs and CRTs, since ∂se 2(δ m )/∂ρ 1 = −1/IJKP(1 – P) < 0 and ∂se 2(δ c )/∂ρ 1 = (I – 1)/IJKP(1 – P) ≥ 0. Namely, a larger ρ 1 always leads to a smaller \( se\left({\widehat{\delta}}_M\right) \) and a larger \( se\left({\widehat{\delta}}_C\right) \), and when I ≥ 2, the absolute strength of ρ 1 is larger in CRTs than in MRTs.

Relative influences of ρ2 and ρ1 in CRTs

In CRTs, both ρ 2 and ρ 1 are included in se(δ C ), and the influences of these ICCs on se(δ C ) differ. Namely, as Heo and Leon (2008) briefly derived in the case of a balanced design, although larger ρ 2 and ρ 1 lead to a larger se(δ), the influence of ρ 2 is greater than that of ρ 1, because ∂f/∂ρ 2 = I(J – 1) > ∂f/∂ρ 1 = (I – 1) ≥ 0 when J ≥ 2, indicating the dominance of σ 23 over σ 22 .

Asymptotic power and confidence intervals and minimum requirement for K in CRTs

Since se(δ C ) includes I and J in its numerator and denominator, se(δ C ) does not take a value near 0, but rather has a lower limit even when I and J become infinite under a fixed number of highest units (K). A lower limit of se(δ C ) when I → ∞, J → ∞ under a fixed K can be derived as follows:

$$ \begin{array}{l}\underset{J\to \infty }{ \lim}\underset{I\to \infty }{ \lim}\sqrt{\frac{f}{ IJKP\left(1-P\right)}}\hfill \\ {}\hfill =\underset{J\to \infty }{ \lim}\left[\underset{I\to \infty }{ \lim}\sqrt{\frac{1+I\left(J-1\right){\rho}_2+\left(I-1\right){\rho}_1}{ IJKP\left(1-P\right)}}\right]\hfill \\ {}=\underset{J\to \infty }{ \lim}\sqrt{\frac{\left(J-1\right){\rho}_2+{\rho}_1}{ JKP\left(1-P\right)}}\hfill \\ {}=\sqrt{\frac{\rho_2}{ KP\left(1-P\right)}}>0.\hfill \end{array} $$
(36)

Combining this result and the relation of Eq. 11, a limit value of E(Z) can be evaluated as \( \left(\sqrt{ KP\left(1-P\right)}\varDelta \right)/\sqrt{\rho_2} \). Then, from Eq. 22, the following relation is obtained, indicating the minimum requirement for K:

$$ K\ge \frac{\rho_2{\left({z}_{1-\alpha /2}+{z}_{\psi}\right)}^2}{P\left(1-P\right){\varDelta}^2}. $$
(37)

Namely, if K does not satisfy the relation above, the actual statistical power ϕ does not exceed the desired statistical power ψ, even when I and J become infinite. Additionally, from the right side of Eq. 37, this minimum required K becomes trivial when ρ 2 = 0 or ∆ is sufficiently large.

A similar equation can be derived for a desired width of the confidence interval, using Eq. 28:

$$ K\ge \frac{4{z}_{1-\alpha /2}^2{\rho}_2}{P\left(1-P\right){{\displaystyle {L}^{\prime}}}^2}. $$
(38)

Naturally, Eqs. 37 and 38 reduce to the results obtained in Usami (2011b) when the number of levels and P are restricted to two (K = 1 and ρ 2 = 0) and 1/2 (balanced design). Minimum integer values of K under α = .05 for both a desired statistical power and a desired width of the confidence interval are summarized in Table 1.

Table 1 Minimum required values of Level 3 units (K) at two-sided significance level of α=.05 in CRTs

Interestingly, these results can also be derived from Eqs. 25 and 31. Namely, the denominators on the right sides of these equations should be positive [IKP(1 − P)Δ 2ρ 2 I(z 1 − α/2 + z ψ )2 ≥ 0 and IKP(1 − P)L2 − 4z 21 − α/2 2 ≥ 0] since both numerators are always positive, and the restriction J ≥ 1 should be satisfied. Then, Eqs. 37 and 38 can be directly derived from these relations.

Cases with more than three levels

Through the same procedure discussed in the previous section, more generalized formulas can be derived for more than three levels. Let D and N 1, N 2, . . . , N D be the number of levels and the number of units for each level, respectively. Let σ 2 d (d = 1, 2 …, D) and ρ d = (∑ D d + 1 σ d 2)/(σ 21 + σ 22 + … + σ 2 D )(d; = 1, 2, …, D − 1) be the residual variances and ICCs among the Level d units under the similar d-level models discussed in the Statistical Model section. Thus, the more generalized form of the standard errors of \( \widehat{\delta} \) can be derived as

$$ se\left({\widehat{\delta}}_G\right)=\left\{\begin{array}{l} se\left({\hat{\delta}}_{GM}\right)=\sqrt{\frac{1-{\rho}_1}{\left({\Pi}_d^D{N}_d\right)P\left(1-P\right)}},(MRTs)\hfill \\ {}\hfill se\left({\hat{\delta}}_{GC}\right)=\sqrt{\frac{1+{\displaystyle {\sum}_{q=1}^{D-2}}\left[\left({\Pi}_{d=1}^{D-q-1}{N}_d\right)\left({N}_{D-q}-1\right){\rho}_{D-q}\right]+\left({N}_1-1\right){\rho}_1}{\left({\Pi}_d^D{N}_d\right)P\left(1-P\right)}},(CRTs)\hfill \end{array}\right. $$
(39)

for the respective trials when D ≥ 3. Naturally, when D = 3 (i.e., N 1 = I, N 2 = J, and N 3 = K), this formula corresponds to Eq. 19. In MRTs, more generalized formulas for required sample sizes Π D d = 1 N d to achieve a desired statistical power and width of the confidence intervals can be obtained through the same equations—namely, Eqs. 23 and 29. In CRTs, sample size determination formulas can be obtained for the respective units as well, and the details are omitted here.

The results discussed in the previous subsections also hold even when D ≥ 3. For example, the strengths of the effect of improving the values N 2, N 3, and N D toward se(δ) are always more dominant in MRTs than in CRTs, whereas the strengths of improving N 1 are the same between MRTs and CRTs. Additionally, the minimum required number of the highest units N D to achieve a desired statistical power and width of the confidence intervals can be expressed through equations similar to Eqs. 37 and 38, as N D ρ D − 1(z 1 − α/2 + z ψ )2/(P(1 − P2) and N D ≥ 4z 21− α/2 ρ D − 1/(P(1 − P)L2), respectively.

Discussion

The present research provides closed-form generalized sample size determination formulas to use when testing effects in experimental research with hierarchical data, focusing on MRTs and CRTs, and these formulas are derived considering both statistical power and the width of the confidence interval of a standardized effect size, on the basis of estimates from a random-intercept model for three-level data that considers both balanced and unbalanced designs. In the present research, as in Usami (2011b), formulas have been derived in a unified way that uses generalized least squared estimators for an experimental effect to overcome the restrictions of previous research. Some additional useful results not derived in the previous research, such as lower bounds on the needed units in the highest (third) level and equations for cases of more than three levels, are also addressed by these formulas. As was noted in the introduction, repeated measures data, paired data, and pre–post data can be analyzed through HLM, and these data are also within the scope of applying the formulas derived here. Additionally, R programs for calculating needed sample sizes are provided in the Appendices to facilitate the use of the derived formulas. Developing a more flexible program is an important topic for future research that will also provide various outputs, including numerical tables. The present and improved programs will be available on the author’s website (http://satoshiusami.com/).

As Usami (2011b) noted, almost no previous research focusing on sample size determination for hierarchical data has provided closed formulas and numerical tables that consider the desired width of the confidence intervals that would be usable by applied researchers. Null hypothesis significance testing has been criticized, in that rejection of the null hypothesis itself does not provide useful information because, strictly speaking, the null hypothesis is rarely true in reality (Balluerka, Gómez, & Hidalgo, 2005; Cohen, 1994, Sedlmeier, 2009). The American Psychological Association has therefore recommended that researchers report confidence intervals (American Psychological Association, 2009). The derived Formulas 2932 are simple and have closed forms, and thus seem to be effective tools to encourage applied researchers to collect data and interpret the obtained results on the basis of confidence intervals.

For simplicity, the explicit development of the method proposed here has been confined to a single factor and two levels. However, it will be straightforward to extend the proposed formulas to an arbitrary number of levels and factors. On this point, Usami (2011a) illustrated a simple, unified method of estimating the statistical power of various types of contrasts to be evaluated regarding main effects and interactions for two-factor between-subjects designs, using multiparameter tests based on Wald statistics.

Cases in which the outcome is binary or ordered are also intriguing topics for future research. An important disadvantage of the derived formulas comes from the assumption that no units will be missing for all levels, although attrition does often occur, especially in Level 1 and 2 units. However, as Heo and Leon (2008) discussed, if variation of the numbers of respective units is completely random, in the sense of missing data, then the fixed sample sizes J and I may be replaced by \( \tilde{J}=\left(1/K\right){\displaystyle {\sum}_{k=1}^K{n}_k} \) and \( \tilde{I}=\left(1/\tilde{J}K\right){\displaystyle {\sum}_{k=1}^K}\;{\displaystyle {\sum}_{j=1}^{n_k}}\;{n}_{jk} \), respectively. Here, n jk denotes the number of Level 2 units in the kth Level 3 unit, and n jk denotes the number of Level 1 units in jth Level 2 unit nested within the kth Level 3 unit.

One possible major limitation of the present research regards the fact that the formulas were derived on the basis of the random-intercept model. There are merits to considering the random-intercepts model, since this model provides direct information about intraclass correlations, which are helpful in determining whether multilevel models are required in the first place. However, the values of slopes can vary significantly among clusters, and the random-intercepts model may not be realistic in actual data. Several researchers (Maas & Hox, 2005; Raudenbush & Liu, 2000; Usami 2011a) have loosened this assumption and discussed ways for evaluating the needed sample size on the basis of a random-intercepts-and-slopes model. Although the relevant parameters and indices for evaluating statistical power would be more complex, the formulas proposed here could be directly extended to the case of a random-intercepts-and-slopes model, and this could be an intriguing topic for future research.

Another limitation is the unrealistic assumption that the residual variances are already known, leading to ignoring asymptotic features of the sample distribution and to the use of a normal distribution to conduct the statistical test of Eq. 11. Therefore, sample sizes calculated from the derived formulas will generally be optimistic and negatively biased. As Heo and Leon (2008) noted, more accurate formulas could be evaluated under noncentral t distributions. However, differences between these distributions are trivial—as simulations performed by Heo and Leon (2008) and Usami (2011a) showed—because when the degrees of freedom are more than 20, the t distribution approaches a normal distribution. This point seems to be important only when the estimated required sample size becomes small (i.e., less than 20)—for example, when a large effect size is assumed.

In estimating the required sample size for hierarchical data, one major problem facing all researchers designing CRTs is the need to specify ICCs (Smeeth & Ng, 2002). In CRTs, the specification of ICCs becomes problematic in behavioral research (for clinical trials, see Hedges & Hedberg, 2007; Murray, Varnell, & Biltstein, 2004; Shoukri, Asyali, & Donner, 2004), since slight misspecifications of the ICCs may cause seriously biased estimation of required sample sizes. As Smeeth and Ng (2002) and Usami (2011b) pointed out, the ideal solution would be to have ICCs available from previous studies that were large enough and that had sufficient clusters to generate reasonably accurate estimates of the ICC for the variable of interest, although this is generally impossible in practice. As Usami (2011b) noted, although the conventional criteria provided in the literature, such as Raudenbush and Bryk (2002, where ICCs of .05, .10, and .15 are small, medium, and large, respectively) and Hox (2010, where ICCs of .10, .20, and .30 are small, medium, and large, respectively), are useful when no informative data are available, actual ICCs depend heavily on the features of the variables of interest and the units. Presenting estimated ICCs for a range of outcomes through a review, as Smeeth and Ng have done for clinical trial research, will be a very useful to aid for the specification of ICCs, and such reviews will be strongly desired for various research areas. As another strategy, constructing models that include covariates to explain the variance of outcomes Y would also be a useful approach to excluding the influence of ICCs (Hedges & Hedberg, 2007; Murray & Blitstein, 2003). However, note that when such covariates correlate highly not only with the outcomes Y but also with the assignment indicator variable X, estimates for an experimental effect δ may be strongly biased and more difficult to interpret (Usami, 2011b).

Although many issues are left to be investigated in future research, in designing an experiment with hierarchical data based on either MRTs or CRTs in order to evaluate an experimental effect, the derived formulas and related results here will be of great help in estimating the required sample size to achieve a desired statistical power and width of the confidence intervals in actual research.