A tutorial on using the paired t test for power calculations in repeated measures ANOVA with interactions

The a priori calculation of statistical power has become common practice in behavioral and social sciences to calculate the necessary sample size for detecting an expected effect size with a certain probability (i.e., power). In multi-factorial repeated measures ANOVA, these calculations can sometimes be cumbersome, especially for higher-order interactions. For designs that only involve factors with two levels each, the paired t test can be used for power calculations, but some pitfalls need to be avoided. In this tutorial, we provide practical advice on how to express main and interaction effects in repeated measures ANOVA as single difference variables. In particular, we demonstrate how to calculate the effect size Cohen’s d of this difference variable either based on means, variances, and covariances of conditions or by transforming \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\eta _{p}^{2}}$$\end{document}ηp2 or \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\omega _{p}^{2}}$$\end{document}ωp2 from the ANOVA framework into d. With the effect size correctly specified, we then show how to use the t test for sample size considerations by means of an empirical example. The relevant R code is provided in an online repository for all example calculations covered in this article.

Issues of power analyses have received even more attention in the realm of the often-discussed "replication crisis", that is, the observation that many published results cannot be replicated (e.g., Open Science Collaboration, 2015). Among other problems with underpowered studies (summarized in, e.g., Brysbaert, 2019;Fraley & Vazire, 2014), the probability of replicating a result increases with the power of the original study (Ioannidis, 2005). Some funding agencies and journals explicitly require authors to include power considerations in their submissions, related questions are posed by reviewers, and it was even suggested to base quality judgments of journals (among other criteria) on the (mean) power of the studies published within them (Fraley & Vazire, 2014).
Yet, power analyses come with some obstacles once going beyond situations where two groups or conditions can be compared via t tests, and this seems in particular to be true for within-subject designs, where participants provide data for more than one condition (and often on multiple trials per condition, as is typical for experiments in cognitive This work was supported by the German Research Foundation under Grants MA 7702/1-2 (awarded to Axel Mayer) and JA 2307/6-1 (awarded to Markus Janczyk). psychology). 1 Of course, there exists a wide range of articles and books on power analysis (e.g., Cohen, 1988;Keselman et al., 1998;Maxwell, Kelley, & Rausch, 2008;Olejnik & Algina, 2000;Perugini, Gallucci, & Costantini, 2018;Steiger, 2004), as well as software packages like G*Power (Faul, Erdfelder, Lang, & Buchner, 2007), Superpower (Lakens & Caldwell, 2021), or pwr (Champely et al., 2020). Still, the correct specification of arguments is often unclear and this leaves researchers unsure whether their calculations are actually valid or not.
In this tutorial, we focus on the power analysis for a certain use case that often occurs in experimental cognitive psychology: Calculating the power for interactions or main effects in a repeated-measures analysis of variance (RM-ANOVA), where each involved factor has only two levels. Despite the narrow focus, such designs are very common, for instance, in the field of cognitive control from where we also draw the example introduced below. Although our main motivation for writing this tutorial aims at power calculation for interactions, power calculation for main effects follows a highly related logic. This tutorial thus provides practical advice, software code, and formula to perform the necessary calculations for both main and interaction effects.

The focus of this tutorial
According to our experience in methodological consulting, there exists uncertainty about whether power calculations (for interactions) in within-subject designs can be performed with standard software packages such as G*Power (Faul et al., 2007) or only achieved via simulations. For the special case of only two levels on each factor (i.e., a 2 × 2 × 2 … design), the main and interaction effects can be conceived of as differences or "differences of differences". These effects thus boil down to a simple difference variable for which the power analysis can then be done in the framework of a paired-samples t test. The advantage of this approach is clear: Power calculations for the t test are relatively straightforward and solely require the specification of an expected effect size in terms of Cohen's d, instead of 2 p that is often used in the context of RM-ANOVA. Although it has been described in numerous standard texts how main and interaction effects can be expressed as differences (of differences; e.g., Aiken & West, 1991;Cohen, Cohen, West, & Aiken, 2013;Judd, McClelland, & Ryan, 2017;Maxwell & Delaney, 2004), we find that researchers rarely make use of this fact and the paired-samples t test for power calculations. This tutorial provides step-by-step instructions on how to express main and interaction effects as a difference (of differences) variable and how the effect size of such variables can be used to perform a power analysis.
Researchers can generally select one of three strategies to derive effect sizes. First, they can formulate a specific expectation about the means and (co)variances of the dependent variables in a (multi-)factorial experimental design. Such an expectation can be informed by expertise or by a re-analysis of (multiple) data sets, for instance, in case one considers replicating an experiment or extending previous observations in a follow-up study. Second, they can have knowledge about previously reported effect sizes, for instance, because of an available meta-analysis or a review of the relevant literature. Third, they can rely on conventions prevalent in a certain field of research. Most prominently for psychology are the suggestions of Cohen (1988). In particular, Cohen proposed that the effect-size measures d = 0.2, d = 0.5, and d = 0.8 can be considered as small, medium, and large, respectively, and identical labels were assigned to the effect-size measures 2 p = .01 , 2 p = .06 , and 2 p = .14 in the context of ANOVAs. It is important to note -and this will be elaborated on below -that it is not correct (and not even close to correct) to just use the semantic labels and conduct a power analysis for a t test with a large effect size d = 0.8 to compute the power for a large interaction effect with 2 p = .14 . In addition, although Cohen's conventions are frequently used, they were meant to be a "last resort" strategy in case there is limited knowledge about the expected data pattern or effect size (see Correll, Mellinger, McClelland, & Judd, 2020, for a recent elaboration on this issue). Interestingly, meta-analyses have shown that typical effect sizes in psychological research are around d = 0.4, thus smaller as a medium effect according to Cohen's labels (Camerer et al., 2018;Open Science Collaboration, 2015; see also Brysbaert, 2019). Furthermore, a study that replicated numerous original studies has reported even smaller effect sizes in the order of d = 0.15 (Klein et al., 2018). These findings align with the work by Schäfer and Schwarz (2019), who also found that effects from replications of studies were considerably smaller as compared to the original studies (see Janczyk et al., 2022, as a an example). Thus, we urge researchers to make an educated decision regarding the expected effect size when performing power analyses.
Strategies 1 and 2 are based on estimates from previous studies, which then serve to formulate effects at the population level. For Strategy 3, an effect size is directly 1 3 specified at the population level. Importantly, Strategies 2 and 3 require converting 2 p /̂ 2 p into the effect measure d/d for a power analysis in the t test framework. 2 Of course, the various effect size measures can be converted into each other (as also implied by the conventions of Cohen), but this is (a) not trivial and (b) bears the potential of using the wrong transformation formula (i.e., to confuse the transformation for within-and between-subject designs). Below we demonstrate how to determine d for the three strategies and point out common pitfalls. We will also include the effect size measure ̂ 2 p , which is less common, but often recommended due to its smaller bias (e.g., Carroll & Nordholm, 1975;Keselman, 1975). 3 The tutorial is structured as follows: (1) In the upcoming section, we introduce an example for a 2 × 2 × 2 within-subject design that we use throughout this article to demonstrate calculations. We do not provide an extensive review on power analyses for every possible design, but focus on interactions and main effects in within-subject designs with two-level factors.
(2) Afterward, we provide a step-by-step tutorial on how to express main and interaction effects in different designs as a single difference variable based on the dependent variables in a multi-factorial, repeated measures experimental design (Strategy 1). We show how to calculate the mean, variance, and effect size Cohen's d for the difference variable using the means, variances, and covariances of the dependent variables. We explain how the effect size of the difference variable can then be used to perform a power analysis for the main and interaction effects. (3) In the section thereafter, we describe how to correctly convert 2 p /̂2 p (or ̂ 2 p ) to Cohen's d/d in order to use the t test for power analyses (Strategies 2 and 3). We further describe challenges and pitfalls in the calculation and highlight that general rules of thumbs should be avoided (e.g., a medium 2 p = .06 does not correspond to a medium Cohen's d = 0.50 in case of within-subject designs). Along with the present tutorial, we provide software code in an online repository (https:// osf. io/ 87j5m/) and the R package powerANOVA (Langenberg, 2022) that comes with a graphical user interface and implements the calculations covered in this tutorial. The user interface is easy to use and will not be explained in this tutorial. Installation instructions can be found on the corresponding GitHub page (https:// github. com/ lange nberg/ power ANOVA).

A motivating example
Conflict tasks are often used in cognitive psychology to investigate how human performance is affected by taskirrelevant stimuli or stimulus features. For example, in the Eriksen flanker task (Eriksen & Eriksen, 1974), participants respond to a centrally presented target stimulus (e.g., the identity of a letter S vs. H) with a left or right key press. The critical manipulation is that the target is surrounded by other letters, the flankers, that either signal the same response on congruent trials (e.g., SSSSS) or the other response on incongruent trials (e.g., HHSHH). Response times (RTs) are typically longer (and often error rates are higher) in the incongruent compared to congruent condition -the congruency effect (CE). Another example is the Simon task (Simon & Rudell, 1967; for a review, see, Hommel, 2011). Participants have to respond, for example, to the identity of a letter, but this target stimulus is presented at a left or right location. If the (task-irrelevant) location is the same as the required response, this would be a congruent trial (e.g., the letter H requires a left response and the stimulus is presented on the left side). However, if the (task-irrelevant) location is different than the required response, this would be an incongruent trial (e.g., the letter H requires a left response and the stimulus is presented on the right side). Here, a CE is observed as well.
The size of the CE depends on several factors. One of the most often investigated factors is recent trial history, starting with work by Gratton, Coles, and Donchin (1992). In this line of research, the congruency of the preceding trial n − 1 is considered in addition to the congruency of the current trial n. The typical observation is that the CE is larger if trial n − 1 was congruent, compared to when it was incongruent (see the left panel of Fig. 1 for an illustration). This observation is known as the congruency sequence effect (CSE) and has been replicated many times (e.g., Praamstra, Kleine, & Schnitzler, 1999;Schmidt & Weissman, 2014;Stürmer, Leuthold, Soetens, Schröter, & Sommer, 2002;Wühr, 2004; for a review, see, Egner, 2007). The standard analysis approach would now be a 2 × 2 RM-ANOVA with trial n congruency and trial n − 1 congruency as repeated measures, with a particular interest on the two-way interaction.
In principle, another independent variable could of course be added. Janczyk and Leuthold (2018), for example, signaled N = 36 participants on each trial whether they were to respond manually or with their feet. Of interest was whether the effector system repeated or switched from trial n − 1 to trial n (see Fig. 1). In this case, a 2 × 2 × 2 RM-ANOVA with trial n congruency, trial n − 1 congruency, and effector system repetition (vs. switch) as repeated measures would be required. 2 In the following, "hats" will indicate estimators from a sample and symbols without a "hat" will indicate population parameters. 3 ̂ 2 p and ̂ 2 p are not to be confused with (generalized) ̂ 2 G and ̂ 2 G . The latter two measures have been proposed to make effect sizes comparable across studies with different designs (e.g., Fleiss, 1969;Olejnik & Algina, 2003). These examples are typical experiments from cognitive psychology. The design often includes two (or three or even more) repeated measures factors with two levels each. In the following sections, we use data from Experiment 2 (using a Simon task) of Janczyk and Leuthold (2018). This experiment followed a 2 × 2 × 2 design with the three factors congruency in trial n (factor A: A1 = incongruent, A2 = congruent), congruency in trial n − 1 (factor B: B1 = incongruent, B2 = congruent), and effector system (factor C: C1 = repetition, C2 = switch), but we show how the calculations generalize to even more complex designs (i.e., 2 × 2 × 2 × … ). The means of each of the cells and the covariances can be found in Table 1.

Strategy 1: Means and covariances approach
In this section, we show how to express main and interaction effects in a multi-factorial, repeated-measures design as difference variables and how power calculations relate Trial n incongruent congruent Fig. 1 Illustration of the motivating example: a 2 × 2 × 2-interaction reflecting the difference in the congruency sequence effect (CSE). Mean response times (RT) in milliseconds (ms) are depicted as a function of congruency in trial n − 1, congruency in trial n, and effector system repetition (vs. switch). When the effector system was repeated, mean RTs were 492 (SD = 82) and 483 ms (SD = 85, incongruent vs. congruent) when the previous trial was incongruent and 511 (SD = 75) and 444 ms (SD = 88) when the previous trial was congruent. When the effector system was switched, mean RTs were 564 (SD = 107) and 533 ms (SD = 101) when the previous trial was incongruent and 566 (SD = 102) and 521 ms (SD = 107) when the previous trial was congruent. Error bars are confidence intervals based on the t tests comparing the congruent and incongruent conditions of trial n to the mean and the variance of these difference variables.
Using the mean and variance, the effect size of the difference variable can be expressed in terms of Cohen's d, which can, in turn, be used for sample size and power calculations. With the following subsections, the complexity of the considered effects gradually increases, starting with a simple comparison of two means and ending with the interaction effects in a 2 × 2 × 2 design. Each subsection consists of two steps: (1) The main and interaction effects are expressed as differences between conditions and the mean and variance of this difference variable is calculated.
(2) Based on these values, the effect size measure Cohen's d is calculated and plugged into a procedure for performing the sample size and power analysis using R. We provide detailed R code for all of the examples covered in this tutorial in the accompanying online repository. For a more comprehensive introduction on RM-ANOVA and contrast coding, we would like to refer the reader to standard texts, such as Aiken and West (1991), Cohen et al. (2013), Judd et al. (2017), and Maxwell and Delaney (2004).
In the following subsections, we use the data from Experiment 2 of Janczyk and Leuthold (2018). The left part of Table 1 provides the means of each experimental condition. The right part of the table provides the variances and covariances of the original data. The values are organized as a matrix (thus a covariance matrix) of the pairwise covariances between the dependent variables. For instance, the covariance between the RT when trial n was congruent, trial n − 1 was congruent and the effector was repeated and RT when trial n was congruent, trial n − 1 was congruent and the effector switched was 7058 (last row, second last column). Covariances between experimental conditions are important, because they affect the power of hypothesis tests (as will be shown below). In fact, correlations between dependent variables is the key difference between repeated measures ANOVA and between-subject ANOVA (e.g., Liesefeld & Janczyk, 2022). The function in R provides one way of calculating the covariance matrix of a data set. The file in the online repository provides example code to calculate this matrix for the original (already preprocessed) data set. In what follows, we use the means, variances, and covariances of the study by Janczyk and Leuthold (2018), with the simplifying assumptions of (1) an equal variance for all conditions σ 2 = 9000 (i.e., approximately the mean variance across the variables in the study, see Table 1) and (2) a covariance between the variables of σ cov = 7200 and thus a correlation of ρ = 0.8 (i.e., approximately the mean correlation and covariance from the original data). The assumption of equal variances and covariances is also referred to as compound symmetry. The provided R code in this article and the online repository will also use equal variances and covariances. However, the code is very generic and can easily be altered to allow any arbitrary covariance matrix.
We want to highlight that the means and covariances (and thus effect sizes) calculated in this section are based on sample estimates from the study by Janczyk and Leuthold (2018). For power calculations, we have to assume that we know the population parameters and we here do so to illustrate the calculations. In practice, we should not rely on an estimate from a single study, but we should rather collect multiple estimates (e.g., from the literature) to obtain a clearer picture. Sometimes, however, there may not be more information available. In this case, we need to use the few information available. Additionally, one should be aware that Cohen's d overestimates the true effect size. By equating estimators with the true population parameters, we might thus slightly underestimate the required sample size to achieve a desired level of power. Bias corrections have been developed by, for instance, Hedges (1981) and can be used as an alternative (see also Goulet-Pelletier & Cousineau, 2018).
Lastly, we exclusively focus on a significance level of α = .05, as this convention is most often used in the field of psychology. We would like, however, to point out that this convention has been criticized and researchers have advocated lower significance levels, such as α = .005 (Benjamin et al., 2017;Miller & Ulrich, 2019). Lakens et al. (2018) further proposed not to use a default significance level at all and that researchers should make a sound decision based on the individual study. A similar argument applies to the specification of the desired power, which we set to 1−β = .8 throughout this article. We encourage researchers to think about and justify their choices of the significance level and the power in their studies.

Comparing two means
Step 1: Calculating the mean and the variance. We will start with a very simple example on how to calculate the required sample size when comparing two means and how this can be expressed in terms of the mean and the variance of the difference between both conditions. Imagine we want to perform a simple test to determine whether RTs in the motivating example differ when trial n was incongruent, trial n − 1 was incongruent as well, and the effector system was repeated (A1B1C1) versus when trial n was congruent, trial n − 1 trial was incongruent, and the effector system was repeated (A2B1C1). The difference variable can then be defined as where X is the RT difference, and Y A1B1C1 and Y A2B1C1 are the RTs in the corresponding conditions. The mean of Y A1B1C1 is μ A1B1C1 = 492 and the mean of Y A2B1C1 is μ A2B1C1 = 483 (see Table 1). The variance in both conditions is σ 2 = 9000 and the covariance is σ cov = 7200. The mean and variance of the difference variable X are then: Step 2: Performing the power analysis. It is now possible to calculate the effect-size measure Cohen's d from the mean and the variance, which can then be used for sample size and power calculations: The size of the effect is thus even less than small, following the conventions of Cohen (1988).
This effect size is calculated from a sample and probably does not match the effect size in the population. For the following analysis, however, we assume that this is the population effect size. We can then use this effect size and the t test to calculate the required sample size to achieve a power of 1−β = .8. In particular, we use a two-sided t test with α = .05 and obtain that we would need a sample size of at least N = 351 participants to detect an effect of d X = 0.15 if this was indeed a true (but rather small) effect.
Many software packages offer the possibility to calculate the required sample size for a t test and so does R (R Core Team, 2021) with the function . The following command can be used to perform the above calculations: The argument takes the effect size d, is the desired power level, is the desired significance level, indicates if the t test is one-or two-sided, and is the type of the t test. (1) As a side note, the function also provides an argument , which can be used if is the mean of the difference variable before dividing by the standard deviation. Hence, the above command is equivalent to the following command:

× design: Main effects
Step 1: Calculating the mean and the variance. In the previous example, we used RTs from only two conditions. For the next example, we increase the complexity of the difference and consider a subset of the factors, that is, we use the conditions where the effector system was repeated. This leaves us with a 2 × 2 design consisting of the factors A and B.
Although not as obvious as in the previous example, we can still use the t test for power analysis in this case.
Assume we want to investigate the main effect of trial n (factor A). The main effect of factor A can be expressed as the sum of all conditions where trial n trial was incongruent (A1) minus the sum of all conditions where trial n was congruent (A2). Thus, the difference variable can be defined as and the value of the difference variable in the example is then: Calculating the variance is slightly more difficult. Yet, the assumption that the variances and covariances are equal across conditions simplifies the formula. With k denoting the number of factors in the design (i.e., k = 2 in the present case), the variance of the difference variable can be calculated as: The formula for the variance becomes more complicated when we want to use different variances for and covariances between conditions. We provide R scripts in the online repository that can be used as a template and the user only has to insert the correct values in the covariance matrix. The following chunk of R code shows how the mean, the covariance matrix, and a contrast vector are specified in order to calculate the mean and the variance of the difference variable as defined above: In Line 3, the means of the dependent variables are defined, which are equivalent to the means from Eq. 7. In Line 7, the covariance matrix of the dependent variables is defined, which conforms to the compound symmetry assumption. We can easily change the variances and covariances to any other value if we do not want to assume compound symmetry (with the constraint that the covariance matrix must be positive definite). Line 15 defines the contrast vector based on Eq. 6. The first two elements are 1, because the means μ A1B1 and μ A1B2 enter with a positive sign into that equation. The third and fourth elements are -1, because the means μ A2B1 and μ A2B2 enter with a negative sign. In Line 18, the mean of the difference variable is calculated by multiplying the contrast vector with the mean vector (i.e., the cross-product) and the result is printed in Line 20. In Line 23, the variance of the difference variable is calculated by pre-and post-multiplying the covariance matrix with the contrast vector and the result is printed in Line 25.
Step 2: Performing the power analysis. We again use the mean and variance of the difference variable to calculate Cohen's d: Assuming that the effect size is the population effect size, we can calculate the sample size required to achieve a power of 1−β = .8. We find that we would need a sample size of at least N = 12 to detect the effect with a probability of 1−β = .8. The R command for this analysis is very similar as for the previous example. Only the argument must be replaced by d X A = 0.9 (in case is set to its default value of 1).

× design: Interaction effect
Step 1: Calculating the mean and the variance. The interaction effect in a 2 × 2 design can be expressed in terms of a difference variable as well. In this case, the difference is, in fact, a difference of differences. It is the RT difference of the differences where trial n − 1 was incongruent (B1) and where trial n − 1 was congruent (B2) between the two levels of trial n (A1 minus A2): For clarification, X B | A1 indicates the RT difference between the condition where trial n − 1 trial was incongruent and the condition where trial n − 1 was congruent while trial n was incongruent. The mean of the difference variable for our example is: The variance is again a bit more difficult. However, under compound symmetry, it turns out that the same formula as for the main effect can be used here (without compound symmetry, one would have to specify the exact covariance matrix, e.g., by adapting the R code of the supplemental material in the online repository): Step 2: Performing the power analysis. The effect size Cohen's d is then calculated as and the result can be considered a medium to large effect according to Cohen (1988).
Assuming that this value is the population effect size, the power calculation yields a required sample size of N = 19 to observe an effect of this magnitude with a power of 1−β = .8.
(10) Step 1: Calculating the mean and the variance. We now use the full 2 × 2 × 2 design of the example. Again, a main effect in this design with three factors with two levels each can be expressed as a single difference variable. Assume we want to investigate the main effect of trial n (factor A) as before, but we use the full design now. The procedure is just as for a 2 × 2 design. We compare the sum of all conditions where trial n is incongruent (A1) against the sum of all conditions where trial n is congruent (A2). We thus define the difference variable as: The mean of the difference variable for our example can then be calculated as and the variance can be calculated following Eq. 8 with k = 3: In fact, the formula can be used for any number of factors as long as (1) the variances and covariances are equal and (2) the factors have only two levels each.
Step 2: Performing the power analysis. Using the mean and variance of the difference variable, we can calculate Cohen's d as which is even larger than large following the conventions of Cohen (1988).
Using the effect size as the population effect size, we calculate that we would need a sample size of N = 8 to achieve a power of 1−β = .8. This number is very low, due to the large effect size.

× × design: Two-way interactions
Step 1: Calculating the mean and the variance. The required calculations for the two-way interaction of, for example, factor (14) A and B in the 2 × 2 × 2 design are very much the same as for the simpler 2 × 2 design. Only the number of involved variables is larger. The interaction effect is the difference of the differences between the incongruent (B1) and the congruent (B2) condition in trial n − 1 (factor B) between the incongruent (A1) and the congruent (A2) condition in trial n (factor A) while summing across factor C. Expressed formally, this yields and the expected value of the difference of differences variable for the example can be calculated as: For calculating the variance, we use the same formula as before. It does not matter whether we are dealing with a main effect or an interaction effect -the formula is the same under compound symmetry. Only the number of involved factors matters: Step 2: Performing the power analysis. With the mean and the variance, Cohen's d is calculated as before: Assuming this is the true effect size, we would need a sample size of N = 24 to achieve a power of 1−β = .8.

× × design: Three-way interactions
Step 1: Calculating the mean and the variance. Finally, we consider how to express a three-way interaction in terms of a difference. In fact, this interaction is a difference of differences of differences. In other words, the three-way interaction expresses whether the interaction B × C is different for the two levels of factor A. This difference variable can be written as: (18) We see that we calculate the difference between C1 and C2 in the innermost parentheses. We then calculate the difference of this difference between the two levels of B1 and B2. Finally, we calculate the differences of this difference between the two levels A1 and A2. The value of this difference for our example can be calculated as and we use the very same formula as before to calculate the variance for this three-way interaction: Step 2: Performing the power analysis. Using these values, Cohen's d for the three-way interaction is calculated as and we find that a sample size of N = 61 is needed to find an effect of this magnitude with a power of 1−β = .8.

Excursus: The role of the correlation and the order of the interaction
Correlation among conditions. We can also express the variance of the difference variable in terms of the variances of the dependent variables and their correlation ρ (instead of their covariance): This equation directly shows how the variance of the difference variable depends on the correlation between the dependent variables. In particular, the variance of the difference variable becomes smaller if the correlation is larger (i.e., 1 − ρ will decrease when ρ increases) and vice versa. This is especially important when performing a power analysis, because the effect size used for the power analysis depends on the variance of the difference variable. Looking back to the previous example in the Section "2 × 2 design: Interaction effect", the variance of the difference variable was 2 X A:B = 7200 and the corresponding effect size was d X A:B ≈ −0.68 . Recall that the correlation among the dependent variables is ρ = 0.8. However, if the correlation were only ρ = 0.2, the variance would be four times as large, that is, 2 The required sample size to achieve a power of 1−β = .8 for this case would dramatically increase from N = 19 to N = 70.

Order of effects. Another interesting fact when considering
Eq. 26 is that the variance of the difference variable increases (and thus the effect size and power decrease) with the size of the design. The variance for the main and interaction effects in a 2 × 2 design is (with values taken from our example) and is for a 2 × 2 × 2 design. The consequence is that the effect size decreases by the order of √ 2 , which in turn has implications for statistical power and sample size calculations.
Implicit correlation between dependent variables. Finally, we would like to raise awareness about a possible mistake that researchers might commit. In particular, researchers might use the variance of the dependent variables' variance σ 2 when calculating the variance of the difference variable, thereby ignoring the correlation between the dependent variables. The reason for this could be that researchers do not have information about the covariance of the variables or they do not know how to perform the calculations properly. However, when calculating Cohen's d of a difference variable, we must not assume that the variance of the difference variable is equal to the variance of the dependent variables. Instead, we have to calculate the variance based on the variances and covariances of the dependent variables. Failing to do so can have a dramatic impact on power calculations. This is because we implicitly make an assumption about the correlation when equating both variances: That is, if we assume that 2 X (the variance of the difference variable) and σ 2 (the variance of the dependent variables) are equal when comparing two means, we implicitly assume a correlation of, e.g., ρ = 0.5 if k = 1. The correlation even increases for more complex designs. For difference variables of main and interaction effects in a 2 × 2 design, the correlation is ρ = 0.75, and for a 2 × 2 × 2 design, the correlation is ρ = 0.875. This might have a huge impact on sample size calculations, because -as we have seen earlier -the power is also a function of the correlation. As a consequence, the required sample size may be underestimated (or overestimated) if the true correlation is in fact lower (or higher) than the implied correlation. For instance, consider the comparison of two means from the beginning of this section. The effect size was d = 0.15 and N = 351 were needed to detect this effect with a probability of 1−β = .8. Without taking into account the covariance between the RTs in the two conditions, we would assume the effect is d = 9 √ 9000 ≈ 0.09 , thus requiring a sample size of N = 875. The problem also occurs with higherorder interactions. In the 2 × 2 × 2 example right before this excursus, the effect size was d = − 0.37 and we would require N = 61 subjects to detect the effect with a probability of 1−β = .8. If we neglected the covariance, we would assume the effect is d = −44 √ 9000 ≈ −0.46 , thus requiring a sample size of N = 39.

Strategy 2 and 3: Effect size approach
The critical aspect when conducting power analyses via the t test is the correct specification of Cohen's d. In the previous section, we have described a strategy that requires specifying the exact mean and covariance structure or knowing the correct d at a population level. In the present section, we will consider an "effect size approach": A researcher might have an idea about the effect size of an interaction or a main effect (for an overview and a review of common effect size measures, see e.g., Bakeman, 2005;Carroll & Nordholm, 1975;Cohen, 1973;Keselman et al., 1998;Lakens, 2013;Levine & Hullett, 2002;Olejnik & Algina, 2000, 2003Richardson, 2011;Steiger, 2004), and is now confronted with transforming these values to d. Two cases can be distinguished. First, it could be that researchers have knowledge about an observed effect size (e.g., from a previous experiment or from a meta-analysis). In this case, the observed ̂ 2 p or ̂ 2 p value needs to be transformed into d . Second, one might formulate the expectations on 2 p at the population level (e.g., as a minimum effect size of interest), and then transform this value to d. 4 Although both transformations are very similar, the calculations differ slightly. In the former, the transformation is done at the level of observed values, while in the latter the transformation is done at the level of population parameters. If sample sizes are large, both lead to approximately equal results though.
We begin by introducing the conversion on both levels for the simple case of comparing only two means of the whole design (similar to what we have done in Section "Comparing two means" of the previous section). Although it might not be very common to express an effect size in terms of 2 p in this case, this is of course possible and we can also use an F-test to compare the two means (i.e., a one-way ANOVA with one factor that has two levels). Importantly, the relation holds for main and interaction effects in multi-factorial repeated measures designs, which will be considered thereafter. The section will be finished by considering a tempting, but wrong, approach based on the semantic labels of effect sizes as suggested by Cohen (1988).

Comparing two means
We first consider the sample level. In this case, the relation between ̂ 2 p /̂ 2 p and Cohen's d for the within-subject case is (for more details on this, see Appendix A), where N indicates the sample size. For the example used in Section "Comparing two means" of the previous section, the effect size of that comparison can also be expressed as ̂ 2 p = 0.023 or ̂ 2 p = 0. 5 Using Eq. 32, Cohen's d is matching exactly the previously calculated value. This conversion could be done with a simple R function as well: Calling this function as yields d = 0.15. Thus, having obtained some typical effect sizes in terms of ̂ 2 p from, for example, a literature review, we can transform those values to Cohen's d , and use the result for sample size calculations and a power analysis using the t test. This can be done in the exact same way as described in Step 2 in the subsections of "Strategy 1: Means and covariances approach": A single sample estimate for the effect size may be used when we only have limited knowledge about the true effect size, for instance, when there is just a single study at hand. Ideally, we should pool multiple estimates, of course. Anyhow, we should be aware that ̂ 2 p , ̂ 2 p , and Cohen's d overestimate the true effect size and we thus likely underestimate the sample size required to achieve a desired level of power (e.g., Mordkoff, 2019;Goulet-Pelletier & Cousineau, 2018).
In some instances, we might "know" the effect size at the population level to perform a power analysis (e.g., by considering a minimum effect size of interest). Then, 2 p and 2 p are identical, and the transformation to Cohen's d slightly changes to (for more details on this, see Appendix B): Note that this transformation is no longer dependent on the sample size, because it is based on the population effect size. If we "know" the true population effect size 2 p , we should use Eq. 35 and perform the power analysis on its result. A simple R function to perform the transformation could be: Calling the function with yields d = 0.153. This value slightly differs from the previous transformation, because we assume that 2 p = .023 matches the true effect size at the population level (i.e., we ignore the bias of the estimator).

Main and interaction effects
The transformations shown in the previous section also hold true for main effects and interactions as long as the effect of interest can be expressed in terms of a single difference variable (i.e., the test has one numerator degree of freedom).
In the previous section, we have already stated that all main and interaction effects in 2 × 2 ×... × 2 designs can indeed be expressed in terms of a single difference variable.
As an example, we use the three-way interaction based on Experiment 2 by Janczyk and Leuthold (2018), similar to what we have done in Section "2 × 2 × 2 design: Threeway interactions" of the previous section. For this effect, the effect size is ̂ 2 p = .118 . Applying Eq. 32, we obtain which again matches the value from Section "2 × 2 × 2 design: Three-way interactions" of the previous section. In the next step, we can then use this effect size and the t test to estimate the required sample size and perform a power analysis, just as it was done in the previous section. Table 2 provides an excerpt of sample sizes required for achieving a power level of 1−β = .8 for different values of 2 p and 2 p , respectively. It shall provide a quick way for researchers to conduct a priori power considerations for a main or interaction effect of interest.
It is also noteworthy that this transformation holds for designs with factors with more than two levels -as long as the effect of interest consists of a subset of factors that only have two levels. Consider, for example, a 3 × 2 × 2 design (i.e., factor A has three levels, B has two levels, C has two levels). The main effects of factor B and C and the interaction effect B:C have only one numerator degree of freedom or, stated differently, the involved factors have only two levels each. For such effects, the relations outlined above hold true as well (for a tutorial on contrast coding in complex multi-factorial designs, see Schad, Vasishth, Hohenstein, & Kliegl, 2020).

Converting effect sizes via Cohen's semantic labels
Against the background from the previous sections, we finally consider a possible mistake that an incautious researcher may commit when converting 2 p to d: Wrongly applying the formulas for within-subject versus betweensubject cases (see also Brysbaert, 2019).
Remember that, according to the suggestions of Cohen (1988), d = 0.2 is considered a small, d = 0.5 a medium, and d = 0.8 a large effect, while 2 p = .01 is considered small, 2 p = .06 medium, and 2 p = .14 large. It is well known that small, medium, and large effect sizes indeed correspond to each other in the between-subject case with two groups, as this conversion was derived from Cohen's f (see also Appendix C, and Cohen, 1988): It is tempting to calculate the power for a "small", "medium", or "large" interaction in terms of 2 p by simply applying the semantic labels and choose the corresponding convention in terms of Cohen's d. However, as should have become clear from this section (see in particular Eq. 35), a conversion according to Eq. 37 is not valid in the withinsubject case, where the relation between d and 2 p is instead (see Appendix D for the proof): Consequently, a researcher would have to divide the expected d by two, and as a result, the required sample sizes often differ considerably. For instance, a researcher may expect a (large) effect size of 2 p = .14 according to a recent meta-analysis in the field. Using the (wrongly assumed) corresponding large value of d = 0.8, results in a required sample size of N = 15 to achieve a power of at least 1−β = .8. The formula above, however, suggests to use d = 0.4 instead, which yields a sample size of N = 52.

Discussion
Running well-powered experiments helps increasing the reproducibility of psychological research (see Ioannidis, 2005;Fraley & Vazire, 2014). Calculating the power, however, can be complicated as soon as one goes beyond the designs of simple t tests. Because of this, it is useful to conceive main and interaction effects of factors as "differences of differences", which eventually can be treated in the framework of a t test.
In this tutorial article, we focused on exactly this situation and discussed how to use the t test to perform power analyses in multi-factorial repeated measures designs that involve factors with two levels only (i.e., hypothesis tests with one numerator degree of freedom). Two cases can be distinguished. In the first case, means, variances, and covariances of the dependent variable are available for each experimental condition and combinations thereof. In the second case, previously reported or expected effect sizes are available in terms of 2 p /̂ 2 p or ̂ 2 p . In either case, a translation into d or d is desired, and we demonstrated how to do so. To aid with the required calculations, we provide example R code in an online repository. Moreover, we developed an R package called powerANOVA (Langenberg, 2022) for this article that comes with a graphical user interface and implements the presented procedures.

Key message
The good news is that functions calculating the power in the framework of t tests can indeed be used to calculate the power in 2 × 2 ×... within-subject designs. Yet, some measures of precaution are required to correctly perform the calculations.
Based on the means of the dependent variables, it is easy to establish the correct numerator of d when conceiving the effect of interest as a difference variable (e.g., as a "differences of differences" in case of interactions). The critical part is how to calculate the correct standard deviation for the denominator. Importantly, the correct value is not simply the (averaged) standard deviation within the conditions. Rather its correct calculation includes all variances within and the correlations/covariances between the conditions. Nowadays, most scientific publications present means and some form of variability in figures and/or tables. However, converting standard errors back into variances is not straightforward when they are, for example, based on the error term from an ANOVA (Loftus & Masson, 1994). Furthermore, correlations/covariances of the dependent measures between the conditions are almost never reported. Thus, researchers Table 2 Required sample size to achieve a statistical power of 1−β = .8 given α = .05 and the effect size 2 p or 2 p in RM-ANOVA for effects with one numerator degree of freedom Row names indicate the first decimal place of 2 p ∕ 2 p , column names indicate the second decimal place. For instance, N = 26 subjects are required to achieve a power of 1−β = .8 for an effect size of 2 p ∕ 2 p = .25 (third row, sixth column) likely have to set their (co-)variances/correlations based on a reasonable expectation or by calculating them from a raw data set. For this reason, we consider it helpful if researchers also report correlations/covariances (and perhaps even variances).
With this information at hand, power analyses for interactions in 2 × 2 ×... × 2 designs can be performed straightforward. Power analysis can also be performed based on typical effect sizes ̂ 2 p , 2 p , or ̂ 2 p for the experimental context in which their study is embedded. These effect sizes may be derived from previous research, a meta-analysis, or simply be assumed. In this case, a relationship between those measures and Cohen's d or d exists. Yet, it would be wrong to simply equate them based on their semantic meaning as proposed by Cohen (1988) (i.e., as small, medium, or large effects). More precisely, if, for instance, a value 2 p = .14 is assumed for the 2 × 2 interaction (thus a "large" effect), the correct value entered into functions calculating the power for t tests is not d = 0.8, but rather half of it (see Eq. 35).

Limitations and further directions
The clearest limitation of this article is its scope on withinsubject designs with factors of two levels each. Importantly, different factorial designs that include the same experimental manipulation may produce different effect size estimates if they include additional manipulations or covariates. For instance, imagine an experiment that investigates the CE in a Simon task and additionally includes the covariate age. In this case, the experiment may be able to explain a larger amount of variance as another experiment that uses the exact same design, but does not account for age. As a result, partial effect size estimates can differ. To resolve this issue, generalized effect size estimators have been developed, such as generalized η 2 ( 2 G ). The present article does not cover generalized effect size estimators, but integrating our considerations in the context of generalized effect size estimators could be an interesting research question.
Power analysis is also a topic in multilevel models (MLM; also referred to as hierarchical models or random effects models; Fitzmaurice, Laird, & Ware, 2011;Laird & Ware, 1982), which have become increasingly popular in experimental cognitive psychology. Typically, replications in each condition are averaged within participants in order to be able to use RM-ANOVA (i.e., a single value per participant and condition). MLM models are able to include multiple replications per participant and to account for heterogeneity in main and interaction effects. This can ultimately increase statistical power as all available information can be used. There are a number of articles available that cover power analysis in multilevel models. For instance, Brysbaert and Stevens (2018) and Kumle, Võ, and Draschkow (2021) wrote two helpful tutorials (see also, Arend & Schäfer, 2019;DeBruine & Barr, 2021;Lafit et al., 2021).
Replications across experimental conditions can also be incorporated using latent variable models. For instance, in a Simon task, replications in the congruent condition in trial n can be used as indicators to measure the RT in that condition more reliably. This way, measurement error can explicitly be modeled and main and interaction effects can then be tested on the latent variable level. Langenberg, Helm, and Mayer (2022) showed how to test main and interaction effects in RM-ANOVA using latent variables. It could also be an interesting research question how the calculations in this tutorial article generalize to latent variable models.

Conclusions
In summary, this tutorial article aimed at deepening the understanding of main and interaction effects and how to express them in terms of difference variables, in an attempt to clarify whether or not the framework of a t test can be used to calculate power for interaction effects in within-subject designs. The required calculations to do so are covered in an extensive online repository.
Appendix A: From ̂ 2 p and ̂ 2 p to Cohen's d Within RM-ANOVA, hypothesis tests are usually based on the sums of squares. The sums of squares are a measure for the part of the variance that can be attributed to the involved factors and interactions, and the part that can be attributed to error associated with the factors and interactions. The F-statistic is the ratio of mean sums of squares (Nesselroade & Cattell, 1988), that is, where MS effect is the mean sum of squares from the main or interaction effect of interest (e.g., an interaction between factor A and factor B), and MS error is the mean error sum of squares. Both are defined as the respective sums of squares divided by the corresponding degrees of freedom df 1 (effect) and df 2 (error): The effect sizes 2 p and 2 p (e.g., Bakeman, 2005;Carroll & Nordholm, 1975;Cohen, 1973;Keselman et al., 1998;Lakens, 2013;Levine & Hullett, 2002;Olejnik & Algina, 2000, 2003Richardson, 2011;Steiger, 2004) are estimated as: The equations slightly differ as both estimators are biased to a different extent. ̂ 2 p is known to be less biased, ̂ 2 p is, however, used more often in practice (Okada, 2013;Richardson, 2011). For study designs that only involve factors with two levels each, the main and interaction effects have one numerator degree of freedom (i.e., df 1 = 1). In this case, a simple relationship between F and t holds, as the empirical F-statistic is identical to the square of the empirical t-statistic: Using these definitions and a relation between t and d , it is possible to translate the RM-ANOVA effect size measures into Cohen's d . The reasoning is that, from Eqs. 39, 42, and 43, it can be seen that both the F-statistic as well as the effect size measures are a function of the sums of squares. By rearranging, the effect size measures can thus be expressed as a function of the F-statistic (Friedman, 1968;Kennedy, 1970), and with Eq. 44 they can be expressed as a function of the t-statistic (see also, Mordkoff, 2019): In case df 1 = 1, df 2 equals N − 1 and thus the denominators of Eqs. 45 and 46 are equivalent. However, the numerators slightly differ. We can now use the relation and plug it into Eqs. 45 and 46. Solving for d yields: Although the equations differ, the calculations will lead to the same Cohen's d .
designs, that is, f 2 = 2 p 1− 2 p . However, the relation between f and Cohen's d is different. 2 effect is the squared mean of the difference variable and σ 2 is the variance of the difference variable. For instance, for a one-way design with two levels, 2 effect is given by: and σ 2 is given by: where 2 1 and 2 2 are the variances within each condition, and σ 1,2 is the covariance between conditions. Consequently, f can be expressed in terms of d: It follows that, in the within-subject case, a small effect accounting for 1% of the variance yields d = Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.