Longitudinal studies are increasingly common in educational and psychological research settings. In some cases, subjects are measured repeatedly over time in order to examine their individual growth and the potential differences among them. In other cases, subjects assigned to different experimental conditions are treated for a specific period of time and when the study is finished they are compared with respect to their average growth rates. Whatever the purpose of the study, it is usual and reasonable to model the change in the response of interest assuming linear growth (Willett, 1988) and to express the effect of the intervention in terms of the difference in mean slopes or rates of change among groups over time.

A wide variety of methods based on classical linear models can be applied to the analysis of longitudinal data. However, the presence of imbalance, due to missing responses from some subjects or due to observations from the same subject being generally correlated, can lead to erroneous conclusions regarding hypotheses of interest. Among other reasons, this is why multilevel hierarchical linear models have become the method of choice for modeling the change in response over time and the factors influencing the change.

Modeling longitudinal data using a hierarchical system of regression equations requires sufficient experimental units in order to detect the effects of interest at the desired power level. Hence, it is advisable to determine the sample size when planning a longitudinal study. Numerous publications have explained how to calculate the sample size in this type of study (e.g., Heo, Xue, & Kim, 2013; Muthén & Curran, 1997; Raudenbush & Liu, 2001; Usami, 2014; Wänström, 2009). There are also many software packages (e.g., ACluster, nQuery, OptimalDesign, PASS, PinT, or RMASS2) that can be used to perform sample size/power calculations with multilevel data. However, very few publications have dealt with informing researchers on this topic about errors due to heterogeneous variances across treatment groups and/or when it is expected that some subjects will leave the study prematurely (Hedeker, Gibbons, & Waternaux, 1999; Heo, 2014; Roy, Bhaumik, Aryal & Gibbons, 2007; Vallejo, Ato, Fernández, Livacic-Rojas, & Tuero-Herrero, 2016).

Loss of subjects invariably occurs in longitudinal studies, potentially leading to inefficient analyses and invalid conclusions. The existence of heterogeneity has been found in several reviews of studies published in psychology journals (cf. Erceg-Hurn & Mirosevich, 2008). This phenomenon is not only likely to occur in nonrandomized intervention studies, but it can also occur in completely randomized experiments. Some common causes of heterogeneity in real data are problems related to measurement validity, research design, and analysis (e.g., unclear randomization, high dropout rates, small sample sizes, presence of floor or ceiling effects in treatment outcome measures, differential treatment effects across subjects, or bad data). Regardless of the potential sources of heterogeneity, neglecting heterogeneity when it is present can lead to inefficient and potentially misleading inferences about fixed effects. For more detailed information about why heterogeneity occurs in intervention studies, see Grissom and Kim (2012). Also, Keselman, Algina, Lix, Wilcox, and Deering (2008) discuss the impact that heterogeneous variances have on error probabilities.

For the derivation of the power function, it is generally assumed that all variance components included in the multilevel models are known. When suitable prior information is not available, specification of these random components is sometimes a difficult task. In these cases, a possible solution is to simplify the procedure of power analysis by assuming that some effects vary randomly between subjects or clusters, whereas others are constrained to be fixed effects (e.g., a model with nonrandomly varying slopes). These restrictions are sometimes specified in applied research (e.g., Heo & Leon, 2008, 2009). When a source of variation is completely ignored, however, this can lead to overly optimistic sample size and power calculations. For instance, if random-intercept models are used inappropriately, given that both random-intercept and -slope models need to be considered, there is a considerable risk of finding high apparent power, because the so-called random-intercept model generally has a poor control of the Type I error rate (Vallejo, Ato, & Valdés, 2008).

Usami (2014) has developed a procedure that can be applied in order to examine the statistical power to detect a significant group-by-time interaction in a two-level random-coefficient regression model, especially when no informative variance components are available. However, this author confined the development of the proposed method for investigating sample size requirement to detecting an intervention effect based on two groups for situations that assume a linear growth pattern of the outcomes over time, complete data for every subject, and homogeneous errors at both Levels 1 and 2. Subsequently, Vallejo et al. (2016) extended the procedure proposed by Usami (2014) to situations in which the presence of between-subjects heterogeneity can be reasonably predicted and the influence of attrition taken into consideration. However, the formulas derived by Vallejo et al. (2016) are restricted to models that assume a linear change in responses over time. Furthermore, the adequacy of the sample size determination formulas for heterogeneous and incomplete data has not been investigated.

The present study extended the work of Vallejo et al. (2016) so as to overcome the aforementioned limitations and, therefore, can be viewed as a generalization of the corresponding results of these authors. Specifically, our objective in this article is threefold: first, to extend the method originally proposed by Usami (2014) and later updated by Vallejo et al. (2016) to more complex growth models for power and sample size determinations; second, to carry out a Monte Carlo study to verify the statistical power achieved with the estimated sample sizes; and third, to check whether the theoretical statistical power based on estimates by ordinary least squares (OLS) differs from the empirical statistical power based on maximum likelihood (ML) estimates, by means of Monte Carlo simulations. In this study, we used restricted ML (REML) as the estimation method because, in multilevel modeling, REML estimates of variance components tend to be less biased than unrestricted ML estimates (Browne & Draper, 2000).

Formulation of a statistical model

Suppose we are interested in comparing the longitudinal trends of two groups, experimental versus control, in a numeric dependent variable. Considering that measures taken over time are nested in subjects, such data can be analyzed using a hierarchical regression model with two levels. At the first level, we represent the change we expect each subject of the population to experience during a specific period of time, whereas at the second level we describe the conjectured relationship between the parameters of individual growth and the explanatory variables that are assumed stable for the whole duration of the study.

Adopting an individual growth model in which change is a linear function of time, the Level 1 model can be formulated as follows:

$$ {Y}_{it}={b}_{0i}+{b}_{1i}{X}_{it}+{e}_{it}, $$
(1)

where Yit denotes the response variable of the ith subject (i = 1, . . . , N) at the tth measurement occasion (t = 1, . . . , T), Xit defines the specific time (e.g., days) that this subject is observed, and random parameters b0i (intercept), b1i (slope or rate of change) and eit (error term), respectively represent the true value of the subject’s response at baseline, the rate of change during the period of data collection and the measurement error caused by the deviation from linearity. In the absence of missing data, we assume that Xit = Xt for all i, and that measurements of the response from the baseline (X1 = 0) to the last time point increase at time intervals whose length is equal to unity; so, D = T − 1. It is important to observe that starting the time coding with T1 = 1 instead of T1 = 0 would be equivalent, but more difficult to interpret, because the value zero is outside the range of observed measurement occasions.

At the second level, the parameters resulting from modeling the trajectories of individual change over time, are related to the explanatory variables that describe the differences between subjects in intercepts and slopes. If we have only one explanatory variable (e.g., a behavioral intervention to improve the language of autistic children), the Level 2 model becomes

$$ {b}_{0i}={\beta}_{00}+{u}_{0i}, $$
(2)
$$ {b}_{1i}={\upbeta}_{10}+{\upbeta}_{11}{W}_i+{u}_{1i}, $$
(3)

where the indicator variable of the intervention program is Wi = 0 if the ith Level 2 unit is assigned to the control group, and Wi = 1 if it is assigned to the experimental group. Because of the randomization of subjects to the two treatment groups, the Level 2 model for the intercept does not contain the value of group-level variable Wi and we assume a common mean response at time t = 0. In this model, β00 is the mean response in treatment and control group at baseline because no treatment main effect is assumed, β10 is the average rate of change of the control group and β11 is the difference between the average rates of change for the groups. As a result, the average rate of change of the experimental group corresponds to the sum of β10 + β11 Random variables u0i and u1i are independent from eit and it is assumed that they follow a bivariate normal distribution with mean zero, variances τ00 and τ11, respectively, and covariance τ01.

Note that Eq. 2 specifies no predictors for b0i. Suppose, however, that this intercept depends on Wi. One might then formulate another form of the random-intercept model. Specifically, b0i = β00 + β01Wi + u0i, where β01 is the main effect of the treatment W on b0i. In this case, residual variance components τ00 and τ11, represent the variability that remains in parameters b0i and b1i after controlling the effect due to the program.

By substituting Eqs. 2 and 3 into Eq. 1, the mixed or combined model can be expressed as follows:

$$ {Y}_{it}={\beta}_{00}+{\beta}_{10}{X}_{it}+{\beta}_{11}{W}_i{X}_{it}+\left({u}_{1i}{X}_{it}+{u}_{0i}+{e}_{it}\right). $$
(4)

With no assumptions about group differences at baseline, Eq. 4 should also include Wi as a predictor. It is often assumed that errors eit, conditional on u1i and u0i, are distributed normally and independently with mean zero and constant variance σ2. In this study we also considered the presence of heterogeneous variance across treatment groups, although we hold that the distribution of errors is normal.

Under the combined model of Eq. 4, the expected value, variance, and covariance of the measurements Yit, conditional on the explanatory variables, are given by

$$ E\left({Y}_{it}\right)={\beta}_{00}+\left({\beta}_{10}+{\beta}_{11}{W}_i\right){X}_{it}, $$
(5)
$$ Var\left({Y}_{it}\right)={\tau}_{00}+2{X}_{it}{\tau}_{01}+{X}_{it}^2{\tau}_{11}+{\sigma}^2, $$
(6)
$$ Cov\left({Y}_{it},{Y}_{i{t}^{\prime }}\right)={\tau}_{00}+\left({\mathrm{X}}_{it}+{\mathrm{X}}_{i{t}^{\prime }}\right){\tau}_{01}+{\mathrm{X}}_{it}{\mathrm{X}}_{i{t}^{\prime }}{\tau}_{11}. $$
(7)

If baseline values differ across groups, then Eq. 5 should also include the term β01Wi (For more details on these equations, see Appendix 1.)

If there are reasons to suspect that changes in the expected value of the outcome will deviate from linearity over the duration of the study, more complex models of growth can be considered. For example, if the average outcome increases monotonically with time until the improvement stabilizes, then we might consider the following curvilinear growth model:

$$ {\displaystyle \begin{array}{l}{Y}_{it}={\beta}_{00}+\left({\beta}_{10}+{\beta}_{11}{W}_i\right){X}_{it}+\left({\beta}_{20}+{\beta}_{21}{W}_i\right){X}_{it}^2+\\ {}\kern.5em \left({u}_{0i}+{u}_{1i}{X}_{it}+{u}_{2i}{X}_{it}^2+{e}_{it}\right).\end{array}} $$
(8)

Again, we can accept the groups as equivalent enough at the beginning of the study and omit a main effect of treatment from the model. To allow the intercepts (baselines) to differ by groups, we add the dummy variable treatment W to the model of Eq. 8.

In the model of Eq. 8, the expected value, variance, and covariance of the measurements Yit, conditional on the explanatory variables, are now given by

$$ E\left({Y}_{it}\right)={\beta}_{00}+\left({\beta}_{10}+{\beta}_{11}{W}_i\right){X}_{it}+\left({\beta}_{20}+{\beta}_{21}{W}_i\right){X}_{it}^2, $$
(9)
$$ Var\left({Y}_{it}\right)={\tau}_{00}+2{X}_{it}{\tau}_{01}+{X}_{it}^2{\tau}_{11}+2{X}_{it}^2{\tau}_{02}+2{X}_{it}^3{\tau}_{12}+{X}_{it}^4{\tau}_{22}+{\sigma}^2, $$
(10)
$$ {\displaystyle \begin{array}{l} Cov\left({Y}_{it},{Y}_{i{t}^{\prime }}\right)={\tau}_{00}+\left({X}_{it}+{X}_{i{t}^{\prime }}\right){\tau}_{01}+{X}_{it}{X}_{i{t}^{\prime }}{\tau}_{11}+\\ {}\kern10em \left({X}_{it}^2+{X}_{i{t}^{\prime}}^2\right){\tau}_{02}+\left({X}_{it}{X}_{i{t}^{\prime}}^2+{X}_{it}^2{X}_{i{t}^{\prime }}\right){\tau}_{12}+{X}_{it}^2{X}_{i{t}^{\prime}}^2{\tau}_{22}.\end{array}} $$
(11)

Equation 9 should also include the term β01Wi when the baseline mean responses are not assumed equal.

To simplify the calculations further, it is useful to re-express Eqs. 4 and 8 of the multilevel model in terms of matrices and vectors, as follows:

$$ {\mathbf{y}}_i={\mathbf{X}}_i\beta +{\mathbf{Z}}_i{\mathbf{u}}_i+{\mathbf{e}}_i, $$
(12)

where yi is a T × 1 vector of repeated observations for the ith subject, Xi(=ZiAi) is a (T × P) design matrix for the fixed effects, β is a vector (P × 1) of fixed effects, Zi is a (T × Q) design matrix for the random effects, ui is a (Q × 1) vector of random effects, and ei is a (T × 1) vector of errors. Here, Ziis a within-subjects design cual’s mean response changes over time, and Ai is a (Q × P) between-subjects design matrix that contains time-invariant explanatory variables.

With respect to errors and random effects, it is assumed that vectors ei and ui are normally distributed with mean 0 and variance and covariance matrices Ri and T, respectively. Matrix Ri may take various forms, however, it is common to assume a model of conditional independence, that is, Ri = σ2IT, where I is a T × T identity matrix. These assumptions imply that, marginally, \( {\mathbf{y}}_i\sim N\left({\mathbf{X}}_i\upbeta, {\mathbf{V}}_i={\mathbf{Z}}_i{\mathbf{TZ}}_i^{\prime }{\mathbf{R}}_i\right) \). When Vi is known, the generalized least squares estimator of vector β is given by \( \hat{\upbeta}={\left({\sum}_{i=1}^N{\mathbf{X}}_i^{\prime }{\mathbf{V}}_i^{-1}{\mathbf{X}}_i\right)}^{-1}{\sum}_{i=1}^N{\mathbf{X}}_i^{\prime }{\mathbf{V}}_i^{-1}{\mathbf{y}}_i \) and its variance by \( {\left({\sum}_{i=1}^N{\mathbf{X}}_i^{\prime }{\mathbf{V}}_i^{-1}{\mathbf{X}}_i\right)}^{-1} \). In the usual case where Vi is unknown, then an approximation to the true covariance is given, replacing Vi with its estimator \( \hat{{\mathbf{V}}_i} \).

Equations 57 and 911 are essential in order to plan a longitudinal study properly since, as we shall see later, they provide the machinery that allows us to carry out a correct power analysis. To estimate the sample size required to detect a statistically significant group-by-time interaction effect, it is necessary to specify the value of the parameters included in Eqs. 13 of the model. However, such a task is neither easy nor straightforward, given that in many cases it is impossible to surmise the value of the parameters without running the experiment. Hence, in practice, the use of existing methods for calculating the sample size is limited to situations in which researchers are able to anticipate a range of probable values of the parameters of interest from the results obtained in previous studies.

In an attempt to optimize focus for a power analysis in studies in which linear growth is assumed, Usami (2014) suggests transforming the variance components associated with the model of Eq. 4 and the parameter related to the treatment (i. e., β11) into statistical indices whose possible values could reasonably be specified in advance. These are reliability of measure at the baseline (ρ1), standardized effect size at the last time point (dL), level 2 residuals correlation (r1) and ratio between the variance of outcomes at the end and at the beginning of study within groups (k1). Formally,

$$ {\uprho}_1=\frac{Var\left({u}_{0i}\right)}{Var\left({u}_{0i}+{e}_{it}\right)}=\frac{\tau_{00}}{\tau_{00}+{\sigma}^2}, $$
(13)
$$ {d}_L=\frac{E\left({Y}_{iT}\left|{W}_i=1\right.\right)-E\left({Y}_{iT}\left|{W}_i=0\right.\right)}{\sqrt{Var\left({Y}_{iT}\right)}}=\frac{D{\upbeta}_{11}}{\sqrt{\tau_{00}+2D{\tau}_{01}+{D}^2{\tau}_{11}+{\sigma}^2}}, $$
(14)
$$ {r}_1=\frac{Cov\left({u}_{0i},{u}_{1i}\right)}{\sqrt{Var\left({u}_{0i}{u}_{1i}\right)}}=\frac{\tau_{01}}{\sqrt{\tau_{00}{\tau}_{11}}}, $$
(15)

and

$$ {k}_1=\frac{Var\left({Y}_{iT}\right)}{Var\left({Y}_{i1}\right)}=\frac{\tau_{00}+2D{\tau}_{01}+{D}^2{\tau}_{11}+{\sigma}^2}{\tau_{00}+{\sigma}^2}. $$
(16)

It is important to note that the effect size parameter of Eq. 14 depends on the sum β01 + 11, rather than on the choice of β11 alone, when β01 ≠ 0.

By solving Eqs. 15 and 16 simultaneously, the following components of variance and covariance are obtained (see Appendix 2):

$$ {\tau}_{01}=\frac{-{r}_1^2{\tau}_{00}+{r}_1\sqrt{r_1^2{\tau}_{00}^2+{\tau}_{00}\left({k}_1-1\right)\left({\tau}_{00}+{\sigma}^2\right)}}{D}, $$
(17)
$$ {\tau}_{11}=\frac{2{r}_1^2{\tau}_{00}+\left({k}_1-1\right)\left({\tau}_{00}+{\sigma}^2\right)-2{r}_1\sqrt{r_1^2{\tau}_{00}^2+{\tau}_{00}\left({k}_1-1\right)\left({\tau}_{00}+{\sigma}^2\right)}}{D^2}. $$
(18)

At the same time, by replacing Var(YiT) in Eq. 14 with the value found for it in Eq. 16, the coefficient associated with the effect of linear treatment can be written as:

$$ {\upbeta}_{11}=\frac{d_L\sqrt{k_1\left({\tau}_{00}+{\sigma}^2\right)}}{D}. $$
(19)

Please note that if β01 ≠ 0, then \( {\upbeta}_{11}=\left(-{\beta}_{01}+{d}_L\sqrt{k_1\left({\tau}_{00}+{\sigma}^2\right)}\right)/D\operatorname{}. \)

Without loss of generality, we can assume that the variance of the initial outcome is equal to 1 (i. e., τ00 + σ2 = 1). In this case, Eqs. 1319 reduce to that given by Usami (2014). The restriction above makes it possible to calculate the parameters of the model by specifying the values of ρ1, dL, r1, and k1. However, it should be noted that in this regard these indices can be detailed intuitively, which largely prevents the difficulty involved in exploratory studies in defining the values of the parameters before running the experiment. In addition, Usami found that the indices ρ1, r1, and k1 have less influence on the sample size calculation than does dL, in particular when dL> 0.4.

So far, we have focused on a series of formulas derived in order to run a prospective analysis of power in models that assume linear growth. However, this approach can be extended to more complex curvilinear growth models, including polynomial and piecewise growth models. For instance, the outcome may follow a quadratic trend that would require the inclusion of the second-order treatment effect in the model (see Eq. 8).

The calculation of an appropriate sample size for detecting curvature in growth rates relies on transformation of the model parameters (i. e., τ02, τ12, τ22, and β21) into indices that can be specified from a literature review and conjecture. In addition to those specified in Eqs. 1316, this new situation requires the inclusion of four additional indices. Using the results of Eqs. 911, these are defined as follows:

$$ {\displaystyle \begin{array}{c}{d}_Q=\frac{E\left({Y}_{iT}|{W}_i=1\operatorname{}\right)-E\left({Y}_{iT}\operatorname{}{W}_i=0|\right)}{\sqrt{Var\left({Y}_{iT}\right)}}\\ {}=\frac{D{\beta}_{11}+{D}^2{\beta}_{21}}{\sqrt{\tau_{00}+2D{\tau}_{01}+{D}^2{\tau}_{11}+2{D}^2{\tau}_{02}+2{D}^3{\tau}_{12}+{D}^4{\tau}_{22}+{\sigma}^2}}\end{array}}, $$
(20)
$$ {r}_2=\frac{Cov\left({u}_{0i},{u}_{2i}\right)}{\sqrt{Var\left({u}_{0i}{u}_{2i}\right)}}=\frac{\tau_{02}}{\sqrt{\tau_{00}{\tau}_{22}}}, $$
(21)
$$ {r}_{12}=\frac{Cov\left({u}_{1i},{u}_{2i}\right)}{\sqrt{Var\left({u}_{1i}{u}_{2i}\right)}}=\frac{\tau_{12}}{\sqrt{\tau_{11}{\tau}_{22}}}, $$
(22)

and

$$ {k}_2=\frac{Var\left({Y}_{iT}\right)}{Var\left({Y}_{i1}\right)}=\frac{\tau_{00}+2D{\tau}_{01}+{D}^2{\tau}_{11}+2{D}^2{\tau}_{02}+2{D}^3{\tau}_{12}+{D}^4{\tau}_{22}+{\sigma}^2}{\tau_{00}+{\sigma}^2}. $$
(23)

Again, it is important to note that the effect size parameter of Eq. 20 depends on the sum β01 + Dβ11 + D2β21, rather than on the sum Dβ11 + D2β21, when β01 ≠ 0.

By solving the Equation System 2123, a series of equations of the form ax2 + bx + c = 0 are obtained (see Appendix 2). The solutions or roots, which correspond to the components of variance we sought, can be obtained by solving each quadratic equation using the familiar formula of Bhaskara (cf. Puttaswamy, 2012):

$$ {\tau}_{02}=\frac{-{\beta}_{02}\pm \sqrt{B_{02}^2-4{A}_{02}{C}_{02}}}{2{A}_{02}}, $$
(24)

where

$$ {\displaystyle \begin{array}{l}{A}_{02}={D}^4;{B}_{02}=2{D}^2{r}_2^2{\tau}_{00}+2{D}^3{r}_{12}\sqrt{\left({\tau}_{11}/{\tau}_{00}\right)}{r}_2{\tau}_{00};{C}_{02}=2D{\tau}_{01}{r}_2^2{\tau}_{00}+{D}^2{\tau}_{11}{r}_2^2{\tau}_{00-}\\ {}\left({k}_2-1\right)\left({\tau}_{00}+{\sigma}^2\right){r}_2^2{\tau}_{00};\\ {}{\tau}_{12}=\frac{-{B}_{12}\pm \sqrt{B_{12}^2-4{A}_{12}{C}_{12}}}{2{A}_{12}},\end{array}} $$
(25)

where

$$ {\displaystyle \begin{array}{l}{A}_{12}={D}^4;{B}_{12}=2{D}^2{r}_2\sqrt{\left({\tau}_{00}/{\tau}_{11}\right)}{r}_{12}{\tau}_{11}+2{D}^3{r}_{12}^2{\tau}_{11};{C}_{12}=2D{\tau}_{01}{r}_{12}^2{\tau}_{11}+{D}^2{\tau}_{11}{r}_{12}^2{\tau}_{11}-\\ {}\left({k}_2-1\right)\left({\tau}_{00}+{\sigma}^2\right){r}_{12}^2{\tau}_{11};\mathrm{and}\\ {}{\tau}_{22}=\frac{-{\beta}_{22}\pm \sqrt{B_{22}^2-4{A}_{22}{C}_{22}}}{2{A}_{22}},\end{array}} $$
(26)

where

$$ {\displaystyle \begin{array}{l}{A}_{22}={D}^8;{B}_{22}=-{\left(2{D}^2{r}_2\sqrt{\tau_{00}}+{D}^3{r}_{12}\sqrt{\tau_{11}}\right)}^2+2{D}^6{\tau}_{11}+4{D}^5{\tau}_{01}-2{D}^4\left({k}_2-1\right)\left({\tau}_{00}+{\sigma}^2\right);\\ {}{C}_{22}=4{D}^2{\tau}_{01}^2+{D}^4{\tau}_{11}^2+{\left({k}_2-1\right)}^2{\left({\tau}_{00}+{\sigma}^2\right)}^2+4{D}^3{\tau}_{01}{\tau}_{11}-2\left({k}_2-1\right)\left({\tau}_{00}+{\sigma}^2\right)\left(2D{\tau}_{01}+{D}^2{\tau}_{11}\right).\end{array}} $$

Finally, by substituting in Eq. 20 the value found for Var(YiT) in Eq. 23, the coefficient for the quadratic treatment effect can be written as

$$ {\upbeta}_{21}=\frac{d_Q\sqrt{k_2\left({\tau}_{00}+{\sigma}^2\right)}-{d}_L\sqrt{k_1\left({\tau}_{00}+{\sigma}^2\right)}}{D^2}. $$
(27)

In the presence of a main effect of the treatment W, the slope formula would have the same form as that provided in Eq. 27, because both dL and dQ contain information about β01.

In Appendix 3 the machinery is provided that allows us to carry out a correct power analysis using piecewise models. Because the data from many longitudinal studies can be well-approximated using simple piecewise linear models with at most one or two knots that are located at judiciously chosen time points (Fitzmaurice, Laird, & Ware, 2011, p. 151), we only present a random two-slope piecewise model in which the entire growth period of the outcome under study is split into two parts: (1) linear growth from the baseline to the last time point in the study, and (2) linear growth from the breakpoint to the last time point. Obviously, when determining the sample size, it must be known ahead of time where the breakpoint is.

Estimation of the treatment effect and its variance

The goal of a longitudinal intervention study is to test whether there are differences between treatment conditions with respect to their average growth rates. If the change is conceptualized as a sustained linear process, then we must verify if iom

β11 ≠ 0. With two groups (e.g., experimental versus control), the OLS estimator of \( {\upbeta}_{11} \) can be expressed as:

$$ {\hat{\upbeta}}_{11}=\frac{\sum \limits_{i=1}^{N_E}\sum \limits_{t=1}^T\left({X}_{it}-\overline{X_i}\right){Y}_{it}}{\sum \limits_{i=1}^{N_E}\sum \limits_{t=1}^T{\left({X}_{it}-\overline{X_i}\right)}^2}-\frac{\sum \limits_{i=1}^{N_C}\sum \limits_{t=1}^T\left({X}_{it}-\overline{X_i}\right){Y}_{it}}{\sum \limits_{i=1}^{N_C}\sum \limits_{t=1}^T{\left({X}_{it}-\overline{X_i}\right)}^2}, $$
(28)

where NE and NC are the treatment and control group sample sizes, respectively. The generalization of Eq. 28 to more than one active treatment is not direct, but it is simple to derive (see Appendix 4).

To test the interaction effect between variables of Level 1, time, and Level 2, treatment, calculation of the variance of the β11 estimator is required. Using Eqs. 6 and 7, and considering that the variance of a difference reduces to the sum of variances of independent groups, ordinary algebra shows that (see Appendix 5):

$$ Var\left({\hat{\upbeta}}_{11}\right)=\frac{4}{N}\left(\frac{\sigma^2}{\sum_{i=1}^T{\left({X}_{it}-{\overline{X}}_i\right)}^2}+{\tau}_{11}\right), $$
(29)

where N (= NE + NC) denotes the total number of units of second level included in study, with N/2 subjects in each group. The quantity 4/N on the right side of Eq. 30 should be replaced with (1/Np1p2) to allow groups of unequal size, where p1 = NC /N and p2 = NE /N.

If the T measures between X1 = 0 and XT = D are equally spaced, Eq. 29 can be reformulated as follows (see Fitzmaurice et al., 2011):

$$ Var\left({\hat{\upbeta}}_{11}\right)=\frac{4}{N}\left(\frac{12{\sigma}^2\left(T-1\right)}{D^2T\left(T+1\right)}+{\tau}_{11}\right), $$
(30)

where D = f−1(T − 1) and f is the frequency of observation per time unit, whereas V1 and τ11 denote the variability in growth rates within and across subjects, respectively. The sum of \( {V}_1+{\tau}_{11},{\sigma}_{b1}^2 \) onward is a measure of variability in the estimation by OLS of the model slope (1).

When growth is assumed to be linear and f = 1(i. e., Xt = 0, 1, 2…, T − 1; T = D + 1), the sample variance of the rate of change simplifies to V1 = 12σ2/(T3 − T). For more complex growth functions (e.g., quadratic function) and f ≠ 1(e. g., Xt = 0, 2, 4…2T − 2; T = fD + 1), Raudenbush and Liu (2001) showed that the sample variance of the polynomial slope takes the following form

$$ {V}_p=\frac{\sigma^2{f}^{2p}\left(T-p-1\right)!}{l_p\left(T+p\right)!}, $$
(31)

where p denotes the polynomial order of the change of outcome and lp is a constant whose values depend on the way of coding the time variable (sequential, centered or orthogonal). To model nonlinear relations across time it is beneficial to use orthogonal polynomials, since this reduces any form of collinearity that can result from using multiples of t as regressors.

Alternatively, the variance of any trend of interest (e.g., linear, quadratic, or cubic), regardless of the form assumed to characterize the covariance structure of measurement error, can be more easily obtained from the appropriate diagonal element of

$$ Cov\left({\hat{\mathbf{b}}}_i\right)={\left({\mathbf{Z}}_i^{\prime }{\mathbf{V}}_i^{-1}{\mathbf{Z}}_i\right)}^{-1}, $$
(32)

where Zi is a design matrix that specifies the change of outcome of any subject across the study (i.e., a constant, linear, quadratic, etc., function), \( {\mathbf{V}}_i\left(={\mathbf{Z}}_i{\mathbf{T}}_i{\mathbf{Z}}_i^{\prime }+{\mathbf{R}}_i\right) \) is the covariance matrix of repeated measurements, Ti is the dispersion matrix of Level 2 random effects, and Ri is the covariance structure of Level 1 errors.

Additionally, a quick and easy way to test the effects that D and f will have on the power using the matrix formulation of the model is to divide the linear trend component of matrix Zi by f, that of the quadratic trend component by f2 that of the cubic trend by f2 and so on. Very often f = 1, but depending on the value of D, there are many possible alternative results (e.g., f = 0.5 or f = 2).

Statistical power analysis

The power to detect a specified treatment difference is defined as the probability of rejecting the null hypothesis of no treatment-by-linear-trend interaction H0 : β11 = 0, given that it is in fact false (β11 ≠ 0). Using Eqs. 28 and 30, this hypothesis can be tested with:

$$ {F}_0=\frac{{\hat{\upbeta}}_{11}^2}{Var\left({\hat{\upbeta}}_{11}\right)}, $$
(33)

where \( Var\left({\hat{\upbeta}}_{11}\right)={\sigma}_{b1}^2/{Np}_1{p}_2, \)p1 = NC /N, p2 = NE/N, and N = NC + NE. The F0 statistic follows the central F distribution when H0 is true, but when H0 is false it follows the noncentral F distribution with df1 degrees of freedom in the numerator, df2 degrees of freedom in the denominator, and noncentrality parameter λ which is defined as

$$ \lambda =\frac{Np_1{p}_2{\beta}_{11}^2}{\sigma_{b1}^2}, $$
(34)

This strategy is both feasible and straightforward for studies in which there is good reason to assume that the groups have equal variances. However, as we previously indicated, it is possible that the assumption of Level 1 and/or Level 2 homogeneity of variances will be violated (see the example described in Vallejo, Fernández, Cuesta, & Livacic-Rojas, 2015, for details). Under the most general scenario, the noncentrality parameter λ is given by

$$ {\lambda}^{\cdot }=\frac{N{p}_1{p}_2{\beta}_{11}^2}{\sigma_{b{1}^{(C)}}^2+{\sigma}_{b{1}^{(E)}}^2}, $$
(35)

where \( {\upsigma}_{b{1}^{(C)}}^2={p}_2\left[12{\upsigma}_{(C)}^2/\left({T}^3-T\right)+{\uptau}_{11}^{(C)}\right] \) and \( {\upsigma}_{b{1}^{(E)}}^2={p}_1\left[12{\upsigma}_{(E)}^2/\left({T}^3-T\right)+{\uptau}_{11}^{(E)}\right]. \)

Regardless of the values of f and D and of the number of groups to be compared, as well as in the possible presence of heterogeneity, λ can also be computed using a method similar to the one that Shieh (2003) suggested under the multivariate general linear model. Specifically,

$$ \lambda = tr\left[{\left({\mathbf{AVA}}^{\prime}\right)}^{-1}{\left({\mathbf{C}\mathbf{BA}}^{\prime}\right)}^{\prime }{\left({\mathbf{C}\mathbf{M}}^{-1}{\mathbf{C}}^{\prime}\right)}^{-1}\left({\mathbf{C}\mathbf{BA}}^{\prime}\right)\right], $$
(36)

where tr denotes the trace of matrix [⋅], A = (1NG| − 1NG) and C = (1NG − 1| − 1NG − 1) are contrast matrices between subjects with a complete row range, 1NG is a column vector of ones, 1NG is an identity matrix, and the symbol | represents the augmented matrix resulting from appending the columns of matrices A and C. The expected values matrix across T measurements, B = [μ(C)0…μ(C)T − 1; μ(E)0…μ(E)T − 1], can be easily obtained from Eq. 5 by fixing β00 = β10 = 0, M is a diagonal matrix whose elements are the number of subjects in each group [in our case, M = diag(NC, NE),], and the V matrix is constructed using Eqs. 6 and 7. If the group variance components are heterogeneous, then V = p2V(C) + p1V_(E). The described method to compute λ is limited to Model 1; however, nothing prevents this from being extended to other contexts. For example, under Model 8, one would proceed in a similar way, but using Eqs. 911.

That said, the procedure used here to calculate the power of the statistical test F0 to compare groups in terms of linear rates of change involves the following steps:

  1. 1.

    Define the significance level α and sample sizes of the control and experimental groups—that is, NC and NE. Without loss of generality, we can establish that β00 = β10 = 0 (or, alternatively, β00 = β10 = β20 = 0, in the case of the quadratic growth model).

  2. 2.

    Set the values of indices ρ1, dL, r1 and k1 (or, alternatively, ρ1, dL, dQ, r1, r2, r12, k1, and k2, in the case of the quadratic growth model), determining the values of parameters σ2, τ00, τ01, τ11 and β11 (or σ2, τ00, τ01, τ11τ02, τ12, τ22, β11, and β21, in the case of the quadratic growth model), and calculate the λ parameter defined in Eqs. 3436.

  3. 3.

    Specify the critical value of the inverse of the F central distribution function, namely:

    $$ {F}_c=\mathrm{FINV}\left(1-\alpha, {df}_1,{df}_2\right). $$
  4. 4.

    Calculate the probability that the F0 ratio exceeds the critical value FC when H0 is false. Under the alternative hypothesis (H1), the power function associated with the F0 test is given by 1 − β = P[F(df1, df2, λ) > FC], where F(df1, df2, λ) denotes a noncentral F random variable with degrees of freedom (df1, df2) and noncentrality parameter λ, and β denotes the probability of a Type II error.

Determination of sample size

There are several approaches to determining the sample size, including Bayesian and frequentist methods that focus on estimation instead of hypothesis testing. However, the most popular approach involves calculating the power of a statistical test, that is, the probability of rejecting H0 when H1 is true.

Required sample size for two groups

Let us assume that we want to determine the sample size to detect differences between two groups. Hypothesis H0 : β11 = 0 is rejected if the estimator of β11 exceeds the critical value \( \left({\hat{\upbeta}}_{11}>c\right) \). In accordance with Amatya, Bhaumik, and Gibbons (2013), this value defines the limit between the acceptance and rejection regions and is set under the following two conditions:

$$ P\left(\hat{\beta_{11}}>c=0+{Z}_{1-\left(\alpha /2\right)}\sqrt{{\left({Np}_1{p}_2\right)}^{-1}{\sigma}_{b1}^2}|{\mathrm{H}}_0 true\operatorname{}\right)=\alpha, $$
(37)
$$ P\left(\hat{\beta_{11}}>c={\beta}_{11}-{Z}_{1-\beta}\sqrt{{\left({Np}_1{p}_2\right)}^{-1}{\sigma}_{b1}^2}|{\mathrm{H}}_1 true\operatorname{}\right)=1-\beta . $$
(38)

Equating Eqs. 37 and 38, since the critical value c is assumed identical under both statistical hypotheses, and solving for N, we obtain the formula that informs us of the sample size required in order to achieve the desired power (see Appendix 6). Specifically,

$$ N=\frac{{\left({Z}_{1-\left(\alpha /2\right)}+{Z}_{1-\upbeta}\right)}^2{\sigma}_{b1}^2}{\upbeta_{11}^2{p}_1{p}_2} $$
(39)

where Z1 − (α/2) and Z1 − β are 100 (1 − α/2) and 100 (1 − β) percentiles of the standard normal distribution for a bilateral test.

Required sample size for multiple groups

Determining the sample size needed to compare the trends of an arbitrary number of groups is a relatively simple procedure, but one that is seldom documented in longitudinal studies. For this purpose, Eq. 39 can be rewritten as

$$ N=\frac{{\left({Z}_{1-\left(a/2\right)}+{Z}_{1-\upbeta}\right)}^2}{1/\mathrm{Tr}\left[{\left({\mathbf{AVA}}^{\prime}\right)}^{-1}{\left({\mathbf{C}\mathbf{BA}}^{\prime}\right)}^{\prime }{\left({\mathbf{C}\mathbf{P}}^{-1}{\mathbf{C}}^{\prime}\right)}^{-1}\left({\mathbf{C}\mathbf{BA}}^{\prime}\right)\right]}, $$
(40)

where P = diag(p1, p2, …, pJ). The remaining terms have been defined previously.

Required sample size for two or more groups with unequal variances

The sample size calculation specified in Eq. 39 assumes homogeneous errors at both Levels 1 and 2. When it is suspected that the variance components may differ depending on the participation of subjects in the training program, the required sample becomes:

$$ {N}^{\ast }=\left(\frac{{\left[{Z}_{1-\left(\alpha /2\right)}+{Z}_{1-\upbeta}\right]}^2}{\upbeta_{11}^2{p}_1{p}_2}\right)\left({p}_2{\sigma^2}_{b1}^{(C)}+{p}_1{\sigma^2}_{b1}^{(E)}\right). $$
(41)

As was the case in the homogeneous model, the determination of the sample size in models with heterogeneous variances with an arbitrary number of groups also requires the modification of Eq. 41. For example, the value of N to detect differences among the trends of three groups can be obtained as

$$ {N}^{\ast }=\frac{{\left({Z}_{1-\left(\alpha /2\right)}+{Z}_{1-\beta}\right)}^2}{1/\mathrm{Tr}{\left[{\mathbf{A}\mathbf{V}}^{\ast }{\mathbf{A}}^{\prime}\right]}^{-1}{\left({\mathbf{C}\mathbf{BA}}^{\prime}\right)}^{\prime }{\left({\mathbf{C}\mathbf{P}}^{-1}{\mathbf{C}}^{\prime}\right)}^{-1}\left({\mathbf{C}\mathbf{BA}}^{\prime}\right)}, $$
(42)

where V^*· = p1V1 + p2V2 + p3V3. If it is suspected that treatment groups are unbalanced, then V^*· = [(p2p3)/p^*·]V1 + [(p1p3)/p^*·]V2 + [(p1p2)/p^*·]V3, with p^*· = p1p2 + p1p3 + p2p3.

Required sample size for missing data

So far we have focused on how to determine the sample size assuming complete cases. However, dropout (also called attrition) is an inevitable problem in most longitudinal studies. The occurrence of missing values can produce biased estimates and can reduce statistical power, leading to inefficient analyses and invalid conclusions. When the rate of attrition is anticipated, a required sample size may be calculated on the basis of the final number of subjects that are expected to complete the study.

In the case of missing data, the formula described above to calculate the variance in the slopes of the subjects, \( {\sigma}_b^2= Var\left(\hat{b_i}\right) \), may no longer be applicable or may not be realistic (Fitzmaurice et al., 2011). For this reason, we need a solution that mitigates the negative impact exerted by the attrition of the sample on the validity of the inferences and of the conclusions reached.

A method for modeling early leaving of a study reasonably is to divide, element by element, the Vi matrix of Eq. 32 by the matrix that identifies the missing data pattern L. In this regard, O’Kelly and Ratitch (2014) clarified that in studies related to the health area it is more common for subjects to drop out of the study prematurely than temporarily. In this situation—that is, of attrition or dropping out definitively—the variance of the estimator rate of change can be obtained from the appropriate diagonal element of

$$ Cov\left({b}_i^{\ast}\right)={\left[{\mathbf{Z}}_i^{\prime}\left({\mathbf{V}}_i^{-1}\varnothing \mathbf{L}\right){\mathbf{Z}}_i\right]}^{-1} $$
(43)

where ∅ denotes the operator of the Hadamard division.

The choice of L matrix will depend on the loss model that we wish to emphasize. However, if we are interested in modeling the pattern of missingness found most frequently in applied research—that is, the monotone—a reasonable choice of L matrix would be one in which each element of the main diagonal informs us of the proportion of subjects who remain in the study over time (i.e., 1, r, r2, . . . , rt–1), and the remaining elements of the assumed survival rate (i.e., r). For the homogeneous model, the suggested procedure provides results similar to those obtained using the method described in Hedeker, Gibbons, and Waternaux (1999).

Method

Theoretical and Monte Carlo studies were conducted in order to determine the optimal sample size (N) for a study that ensures adequate statistical power for rejecting the null hypothesis of β11 = 0, as well as the accuracy of the estimates, assuming homogeneous (V2 = V1) or heterogeneous (V2 = 2V1) group variances at each of the levels of the model and missing data due to subject dropouts before the completion of the study after baseline. For this purpose we proceed as follows. Initially, using the formulas derived in Eqs. 38 and 40 we carried out a theoretical study to examine the effect of heterogeneity and attrition on determining the appropriate N when the significance level α = 0.05 and the nominal statistical power 1 − β = 0.80. Five factors were manipulated and completely crossed in the study for a total of 108 investigated conditions: reliability of measurement at the first time point (ρ1 = 0.1, 0.5), Level 2 residual correlation (r1 =  − 0.5, 0, 0.5), number of repeated measurements (T = 4, 8), proportion of imbalance between the group sizes (Δ = 0.5, 0.35, 0.2), and standardized effect size at the last time point (dL = 0.4, 0.5, 0.6). According to Cohen (1988), standardized mean differences of 0.2, 0.5, and 0.8 correspond to small, medium, and large magnitudes of an effect, respectively. The ratio between the variances of the outcomes at the end and at the beginning of the study remained constant (k1 = 25) under each of the conditions. Later, a Monte Carlo study was carried out to verify the statistical power achieved with the estimated sample sizes.

Data generation

Datasets were simulated on the basis of the two-level model shown in Eqs. 13. At the first level, a continuous outcome was generated as a linear function of time. The intercept and one Level 1 variable were simulated to vary randomly as a function of treatment at the second level. Each explanatory variable X and W was generated to be standard normal. Later, we dichotomized the W variable by an arbitrary threshold (i.e., the mean of all observed data). The error terms were generated as independent normal random variables with means zero and the variances obtained from the values specified above for the manipulated factors. We used SAS version 9.4 (SAS, 2016) for the simulations.

For each of the 108 investigated conditions, 1,000 sets of raw data were generated and analyzed during the simulation process. In our simulation study, two different situations were considered: with no missing data at each of the time points and time-related dropout with cumulative missing data rates of 27% at the fourth occasion and 52% at the eighth occasion. Both with complete and with missing data, the analyses were carried out twice by REML methods using SAS PROC MIXED, once assuming homogeneity and once modeling the variances, in order to investigate the results of incorporating heterogeneity into the models.

Here we will focus on sample size determination in the presence of a monotone missing data pattern that spans the missing-at-random (MAR) model. For our dropout MAR mechanism, the data point for subject i was missing at time t and the subsequent times if Uit < Φ[λt + Yi(t − 1)], where Uit is a uniform random variable and Φ is the cumulative normal distribution function. The values of λt in the above mechanisms were chosen to yield time-related dropout rates of 0%, 10%, 19%, and 27% for the four respective occasions, and time-related dropout rates of 0%, 10%, 19%, 27%, 34%, 41%, 47%, and 52% for the eight respective occasions.

Evaluation criteria

To determine the accuracy and precision of the strategies being compared (i.e., sample size calculations using derived formula based on OLS estimates and simulations based on REML estimates), we examined their performance in terms of the following quantities:

  1. 1.

    Relative bias To find out whether a parameter tends to be over- or underestimated, the relative bias index was used in this study. If the parameter of interest was φ=(1-β), the percentage relative bias was \( 100\times \left[\left(E\left(\hat{\phi}\right)-\phi \right)/\phi \right] \), where \( E\left(\hat{\phi}\right) \) was computed as the average parameter estimate across valid replications. We have not been able to find any formal criteria in the literature for when a relative bias is too big, so in this article, a relative bias less than 10% was considered acceptable.

  2. 2.

    Approximate 95% coverage rates This refers to the number of times that the absolute difference between the theoretical and empirical power across the examined conditions falls outside of approximately two standard errors (SE). The SEs reported for the empirical estimates of power were estimated by \( \sqrt{pq/m} \), where p is the theoretical probability of a Type II error, q equals 1–p, and m is the number of simulations carried out in the numerical experiment.

Results

Tables 1, 2, 3, and 4 show the sample sizes obtained by the proposed method to achieve theoretical power of at least 80% and the simulation-based empirical power estimates. Table 1 gives the results for complete data with homogeneous variances, Table 2 gives the results for complete data with heterogeneous variances, Table 3 gives the results for incomplete data with homogeneous variances, and Table 4 gives the results for incomplete data with heterogeneous variances. Hereafter, these are known as Scenarios A, B, C, and D, respectively.

Table 1 Sample sizes to obtain theoretical power of at least 80% and the empirical power, with complete data and homogeneous Level 1 and 2 variances
Table 2 Sample sizes to obtain theoretical power of at least 80% and the empirical power, with complete data and heterogeneous Level 1 and 2 variances
Table 3 Sample sizes to obtain theoretical power of at least 80% and the empirical power, with incomplete data and homogeneous Level 1 and 2 variances
Table 4 Sample sizes to obtain theoretical power of at least 80% and the empirical power, with incomplete data and heterogeneous Level 1 and 2 variances

As can be seen from Table 1, the sample size needed to achieve 80% power with a two-sided Type I error rate of 5% decreases substantially with small increases in the effect size at the last time point (dL), whereas the influences of the number of repeated measurements (T), the Level 2 residual correlation (r1), and the reliability of measurement at the first time point (ρ1) are not so obvious. Although the effects of T, r1, and ρ1 are relatively small on statistical power, larger values of these factors show a positive effect on statistical power. It is also shown in Table 1 that the sample size increases with an increasing degree of imbalance between the group sizes. In fact, high levels of imbalance (i.e., Δ = .2) cause a notable increase in the sample size needed to maintain a specific statistical power of 80%. A similar tendency is observed for the same conditions under the remaining scenarios (i.e., B, C, and D).

Table 2 presents the results for complete data in the presence of heterogeneity of variances (Scenario B). When the sample size estimates of Table 1 are compared to those of Table 2, we find that the mere presence of a small degree of heterogeneity in the Level 1 and 2 random effects (V2 = 2V1) leads to a noticeable increase in the sample size necessary to achieve at least 80% power, even when the group sizes are equal. Table 3 lists the necessary sample sizes to reach the preset value of power when the assumption ofthe homogeneity of Level 1 and 2 variances is satisfied but attrition is present (Scenario C). As we stated previously, in this study we have assumed that the dropout rate of subjects from baseline to the last time point of interest is 10% in each group. As compared to the case of equal variances and complete data (Scenario A), it may be observed that dropout rates of 10% over time require that the sample size increase by 20%–25% in order to reach a similar power. Finally, the sample sizes required to accommodate the dropout rate in the presence of heterogeneity of variances (Scenario D) are given in Table 4. All results displayed in this table agree with the previous findings from a qualitative point of view; however, as one would expect, a larger sample is required under this scenario to reach the same level of power.

Table 5 shows the percentages of relative bias by ρ1, dL, and T, collapsed across Level 2 residual correlations (r1). The results yielded negligible levels of bias (less than ±0.05% to 1.5% of the true population parameter, on average) in the vast majority of the 108 conditions examined. The levels of bias of predicted theoretical power were always less than 1%, regardless of the investigated conditions, whereas the mean relative bias for the empirical estimates of power remained under 3.6% in all cells, and it exceeded 3% in only five cases. In fact, there were no statistically significant differences in bias for the power estimates in any of the simulated conditions.

Table 5 Percentages of relative bias for predicted theoretical and empirical powers

The empirical estimates of power can also be compared to the theoretical values stated in Tables 1, 2, 3 and 4. The highest absolute difference was .024 among the 108 conditions displayed in Table 1, .026 among the 108 conditions displayed in Table 2, .039 among the 108 conditions displayed in Table 3, and .038 among the 108 conditions displayed in Table 4. Under Scenarios A and B, the discrepancies between theoretical prediction and empirical results are negligible, since 99% of the power estimates fall within two standard deviation limits (i.e., between .775 and .825). On the other hand, our results also indicate that, for Scenarios C and D, about 85% of power estimates fall within the confidence intervals when T = 4, while only 5% of absolute differences were beyond two standard deviations when T = 8. Therefore, the derived formulas allow the user to rigorously determine the sample size required to yield a certain power for both complete and incomplete data, both assuming homogeneity and when incorporating heterogeneity into the multilevel model.

Empirical illustration using two real longitudinal data examples

To illustrate how the derived formulas for sample size calculations that can be used for a study ensure adequate power to detect statistical significance under different models and conditions (e.g., linear and quadratic, homogeneous and heterogeneous, or complete and missing data), we rely on the data of two longitudinal studies carried out by Núñez, Rosário, Vallejo, and González-Pienda (2013) and Rosário et al. (2017). In the first study, a linear change model was a reasonable assumption, whereas in the second study a quadratic model provides a more suitable choice to represent the shape of change. Consistent with common practice in empirical applications of growth curve models, the Level 1 predictors (i.e., Time and/or Time2) are assumed to be free of measurement error; if errors do exist, they would generally attenuate the estimate of the regression coefficients relative to their population values.

The first example (Núñez et al., 2013) examined the effectiveness of a school-based mentoring program on student self-regulated learning strategies. In this study program effects were tested in 94 sixth grade students assigned randomly to two experimental conditions, evaluated at the beginning of the study and after 3, 6, and 9 months. Thus, if we measure the passage of time quarterly, this design involves f = 1 (the frequency of observation per unit of time is equal to one), D = 3 (the study lasts three quarters), and T = fD + 1 (the number of measurement occasions is four).

After reanalyzing the data of Núñez et al. (2013), without assuming that the groups’ average responses are equal at baseline and using SAS PROC MIXED, the following estimates were obtained: \( {\hat{\tau}}_{00}=.0708,{\hat{\tau}}_{01}=.0048,{\hat{\tau}}_{11}=.0050,{\hat{\sigma}}^2=0.865,{\hat{\beta}}_{01}=.1169\kern0.28em \mathrm{and}\hat{\beta_{11}}=.0804 \). Here, time was treated as a continuous variable centered on its overall mean, rather than as a classification variable, as in the original study. Substituting these estimates in Eqs. 1316, yields estimates of the reliability of measurement at the first time point \( \left(\hat{\rho_1}=.45\right) \) standardized effect size at the last time point \( \left({\hat{d}}_L=.75\right) \), proportion of variance of outcomes between the first and the last time points \( \left({\hat{k}}_1=1.47\right) \) and slope-intercept correlation \( \left({\hat{r}}_1=.25\right). \) In turn, using Eqs. 30 and 34 the variance of the slope \( \left({\hat{\sigma}}_{b1}^2=.0223\right) \). and the non-centrality parameter \( \left(\hat{\lambda}=6.81\right) \) are estimated. Inspection of a table for the noncentral F distribution (see, e.g., Ato & Vallejo, 2015) at the .05 significance level with \( \left(\hat{\lambda}=6.81\right) \) and with (1,280) degrees of freedom yields a power of \( \hat{\varphi}\cong .74 \). Also, standard software (e.g., SAS PROC IML) can be used to estimate this value. Next, we removed 28 data points to yield approximate dropout rates of 0%, 5%, 9%, and 13% for the four time points. In this particular application, the variance of the slope, \( {\hat{\sigma}}_{b1}^2 \) was .0246 and the non-centrality parameter, \( \hat{\lambda} \), 6.18. Using these results and tables of noncentral F distribution, the power is found to be approximately .70. The corresponding estimates of \( {\sigma}_{b1}^2,\lambda \) and φ with heterogeneous errors (ratio 1:3) were .0446, 3.4, and .45, respectively.

Given that, in all three cases described, a power below the often-mentioned benchmark of .80 (Cohen, 1988) was obtained, it was necessary to determine the new sample size that would have allowed us to replicate the differences between treatment conditions, with respect to their average linear growth rates, under each of the situations described. From Eq. 39, with Z1 − (α/2) = 1.96 and Z1 − β = .84, we see that the total sample sizes needed to achieve 80% power with a 5% significance level were 109, 120, and 217, respectively. So far, we have only considered power results for comparing groups on linear rates of change. Yet the rate of change can also be nonlinear.

Next we considered data from the longitudinal randomized design, conducted by Rosário et al. (2017) with 182 fourth grade students, to examine whether the students’ writing quality differed when they wrote journals on a weekly basis, as compared with a control group. In the study, the subjects were measured at baseline and weekly for up to 12 weeks. With regard to the quality of writing compositions, Rosário et al. found that providing extra writing opportunities (i.e., writing journals) had a statistically significant impact on instantaneous rate of change at one specific moment and curvature. We suppose that our interest would lie in replicating the difference in the average acceleration rates between the two groups. Thus, we will first check whether there is sufficient statistical power to detect the described effects.

As in the previous example, we briefly considered three cases: a complete set of data with homogeneous errors; an incomplete set of data with homogeneous errors; and a complete set of data with heterogeneous errors. After analyzing the data using PROC MIXED, the following estimates were obtained: \( {\hat{\tau}}_{00}=45.0677,{\hat{\tau}}_{01}=1.0519,{\hat{\tau}}_{11}=.3254,{\hat{\tau}}_{02}=.2867,{\hat{\tau}}_{21}=.0081,{\hat{\tau}}_{22}=.0081,{\hat{\sigma}}^2=21.1842,{\hat{\beta}}_{11}=.2238, and{\hat{\beta}}_{11}=.2238,\mathrm{and}{\hat{\beta}}_{21}=-.0446 \). Substituting these estimates into Eqs. 13, 2023, 32, and 36, the indices and parameter estimates can be calculated as \( {\hat{\rho}}_1=.6802,{\hat{d}}_Q=-.3106,{\hat{k}}_1=1.3262,{\hat{k}}_2=2.1824,{\hat{r}}_1=-.2747,{\hat{r}}_2=-.4756,{\hat{r}}_{12}=-.1574,{\hat{\sigma}}_{b2}^2=.0186,\mathrm{and}\;\hat{\lambda}=4.8533 \). Inspection of noncentral F tables at the .05 significance level with \( \hat{\lambda}=4.8533 \) and with (1, 2178) degrees of freedom yields a power of \( \hat{\varphi}\cong .60 \). Removing 594 data points from the original study according to a monotone dropout pattern, which represents a 5% dropout, we obtained \( {\hat{\sigma}}_{b2}^2=0.257 \), \( \hat{\lambda}=3.5096 \), and \( \hat{\varphi}\cong 0.47 \). In the presence of heterogeneity of variances (ratio 1:3), however, we obtained \( {\hat{\sigma}}_{b2}^2=0.341 \), \( \hat{\lambda}=2.4267 \), and \( \hat{\varphi}=.34 \). According to the convention suggested by Cohen (1988), in all three cases an unsatisfactory level of statistical power was obtained. Thus, it was necessary to calculate the sample size that would have allowed us to replicate the differences between treatment conditions, with respect to their average acceleration rates, under each of the situations described. From Eq. 39, with Z1 − (α/2) = 1.96 and Z1 − β = .84, we established that the total sample sizes needed to ensure adequate power were 295, 408, and 589, respectively.

Although we have omitted the original data due to limitations of space, the databases for the two examples are available from the first author upon request, and Appendix 7 provides the SAS codes used to perform the sample size and power calculations for Examples 1 and 2.

Discussion and conclusion

Sample size calculations to provide specified power levels were performed in four different scenarios, each involving 108 treatment combinations, through the use of mathematical formulas and numerical simulations. Our results indicate that both the analytic and empirical method provide virtually identical estimates of power across all examined conditions. The empirical estimates were below the theoretical estimates in 124 of the 432 cells of the design (28.7%), but the differences were practically insignificant. As we mentioned above, the mean relative bias for the empirical estimates of power remained under 3.6% in all cells and, with few exceptions, the estimates of power fall inside the boundaries of a 95% confidence interval for the theoretical values, suggesting that the trend described above is due to chance. Consistent with the results of Heo et al. (2013), the data indicate that the derived formulas of power are well-validated by simulation studies, which show that the values of theoretical power are very close to those of the empirical power.

In Scenario A, in which complete data across time and homogeneous variances were available, our results revealed that the effect size and a large degree of imbalance between group sizes had decisive impacts on the sample size determination. For instance, when the groups had markedly different sizes (i.e., one group was four times the size of the other), the sample size was required to increase by approximately 50% in order to achieve the same power as in the balanced case; whereas, for an effect size of .40, the sample size that was required to achieve a power comparable to an effect size of .60 was close to a 100% increase. Therefore, careful attention should be paid with regard to the choice among possible population effect sizes and unequal randomization when planning a study. A conservative approach would be to consider the most plausible and choose the smallest effect size among them. On the other hand, the effect of the correlation of the Level 2 residuals and the reliability of measurement at the first time point was not trivial, but the consequences were much less severe. As compared with other similar studies, these results match, to a large degree, the numerical results reported by Usami (2014) using a method proposed by Satorra and Saris (1985) in the context of structural equation modeling.

In the remaining scenarios, our two main findings can be summarized as follows. Firstly, in the presence of heterogeneity in the Level 1 and 2 random effects, larger sample sizes are required in order to obtain the desired nominal power, even for complete and balanced data. One important caveat is that the results were only obtained by the proposed method under positively paired conditions. A positive pairing implies that the treatment condition that has the smallest number of subjects is associated with the smallest variance, whereas the opposite occurs for a negative pairing. Unfortunately, with an unbalanced design similar to that employed in our work (Livacic-Rojas, Vallejo, Fernández, & Tuero, 2017; Vallejo et al., 2008), the tendency to be conservative is worse under negatively paired conditions. The second finding is that, when there is attrition, sample size requirements can be quite large. As one can easily imagine, however, it is not clear what is sufficiently large with regard to sample size in order to make valid inferences about the parameter of interest. In many cases an increase of 5% or 10% may be sufficient, but depending on the expected rate of attrition, the appropriate percentage could vary. In the present study we observed that with dropout rates of 10% at every time point (e.g., a condition with eight time points would retain approximately 50% of the original sample at the last time point), the sample size would be required to increase by 20%–25% in order to reach a power that was equivalent to the case of complete data. In any case, when attrition is anticipated, the formulas we derived allow the power to be calculated on the basis of the final number of subjects that are expected to complete the study.

Although the numerical results may change slightly depending on the statistical package and the number of iterations or the algorithm used to estimate the parameters, the simulations presented in this article strongly suggest that on the whole the empirical power based on REML estimates is in fairly good agreement with the theoretical power based on OLS estimates. However, it has also become clear from the present study that, with complex statistical models, sample size estimation using simulations may be needed. One reason why the Monte Carlo power method may be preferred over a theoretical method in some cases is because of its great flexibility to be applied to almost any kind of data, regardless of whether all the model assumptions are satisfied, the type of covariates present, and the attrition rate expected. In fact, the sample size calculation through simulation can easily be extended to more complex linear mixed models or generalized linear mixed models, both univariate and multivariate.

Recommendations

As we noted earlier, when performing a prospective power analysis and no information is available regarding the growth model parameters, researchers may explicitly specify parameters by indirectly setting four types of indices (ρ1, k1, dL, r1) for a linear trend. In some cases, this is a reasonable approximation, but in other cases it may become a tricky task. Hence, a range of values often need to be considered.

  1. 1.

    Reliability (ρ1) depends on what measure is being used. The reader should note, however, that questionnaire measures, which represent one of the most important tools available for data collection in the educational and social sciences, appear to have relatively low reliability. Hence, reliabilities in the .4–.7 range would provide a reasonable starting point when planning research.

  2. 2.

    Empirical studies have indicated that, under most situations likely to be encountered by behavioral science researchers, the ratio between the variances of the outcomes at the end and at the beginning of a study (k1) could be more than five times smaller than the value we have examined (cf. Hertzog, Lindenberger, Ghisletta, & von Oertzen, 2008). Thus, the sample size requirements will be less demanding than those shown in the tables.

  3. 3.

    The average effect size (dL) found in published meta-analyses in psychology is around dL = 0.50 (see Bakker, van Dijk, & Wicherts, 2012). An effect size in the range of 0.4–0.6 is regarded as typical. We have not been able to find any guidelines on how to select these effect sizes for a quadratic growth model. Although this issue is an open question and should be investigated, provisionally we have assumed an effect size of one-half of a standard deviation unit for the rate of acceleration (i.e., dQ = 0.50).

  4. 4.

    Although the correlation between the starting point and the rate of change over time (r1) is not known, precisely different authors (Hertzog et al., 2008; Hox, 2010) have suggested that it is unlikely that this correlation would reach values close to zero in a given population. Hence, correlations in the .25–.50 range would be values that are reasonable to choose when planning a longitudinal study.

Finally, for completeness, three caveats are included. First, it should be clear that the sample size requirements to detect an intervention effect are study-specific. Second, although longitudinal studies often involve small samples, it is very important to emphasize that large samples sizes make small effect sizes detectable. Therefore, researchers interested in carrying out studies that have sufficient power to reject the null hypothesis should avoid using small sample sizes whenever possible. This is especially the case when they are unable to specify a minimum effect size that would have either practical or theoretical significance. Third, it should be noted that the reliabilities studied (i.e., .1 and .5) are on the low side. Since unreliability affects statistical power, it becomes obvious that more positive results should be obtained with higher initial reliabilities. If reliability is improved to .80, for example, the potential reduction in sample size that could be achieved would be approximately 20%. Hence, researchers should make an effort to reduce the effects of measurement error.

Limitations of this study

In our simulation study we saw that the theoretical power values based on the sample size formulas derived using the OLS estimates were nearly identical to the empirical power based on the ML estimates, even with a combination of heterogeneous variances and missing data. However, readers should note that the generalization of our results is limited to situations in which the mechanism for missing data is MAR. When missing data due to attrition are driven by an MAR mechanism, the standard likelihood-based method provides valid inferences about differences in growth rates between groups. In contrast, when the missing data are not MAR (NMAR), the likelihood-based method yields erroneous inferences (failure to control the Type I error rate and to provide altered power). Thus, caution should be exercised if the missingness is thought to be NMAR. To improve the validity of estimates, it is recommended that researchers determine why data are missing and build models that include covariates that may be predictive of dropping out.

An additional limitation of our study is that the results and recommendations are based on assuming normality for the continuous outcome variable. The effect of nonnormality on the power would not be of much consequence in the case of near-normal populations. However, the presence of a fair degree of skewness and/or kurtosis, as is not uncommon in educational and psychological studies (see, e.g., Blanca, Arnau, López-Montiel, Bono, & Bendayan, 2013; Cain,Zhang, & Yuan, 2017; Micceri, 1989), would lead to a more conservative alpha level and, thus, to more demanding sample size requirements.

Finally, for computational simplicity, we assumed that the model used only included one categorical variable (e.g., the program studied). However, it is possible to increase precision in the estimation of treatment effects if effective covariates are used in the design. In fact, continuous variables are sometimes included in longitudinal studies as predictors or baseline covariates. In general, as long as the covariates are independent of the group assignments and do not modify the group effects, making an adjustment for baseline response will increase statistical power, because it can be expected that the adjustment will reduce the between- and within-subjects variability.

Author note

We are grateful to the Editor, Associate Editor Wei Wu, and reviewers for their constructive suggestions on an draft of this manuscript.

This work has been funded by the Spanish Ministry of Science and Innovation (Ref: PSI-2015-67630-P) and the Chilean National Fund for Scientific and Technological Development (FONDECYT. Ref.: 1170642).