What works to advance student learning? There is growing social and political call to answer this fundamental question based on sound evidence (Slavin, 2020). This has put a spotlight on randomized trials (RTs), which allow for causal inferences on the actual effects of educational interventions (Whitehurst, 2012). Individually randomized trials (IRTs) that randomly assign individual students to experimental conditions are imperative for conceiving and testing deliberate programs (e.g., Kelly et al., 2013). Validating a program’s benefit in real-life schooling then requires upscaling and implementation in ecologically valid settings (Campbell, 1957; e.g., Gersten et al., 2015). Over half of educational RTs (Connolly et al., 2018) nowadays represent cluster-randomized trials (CRTs) involving random allocation of student groups, such as whole schools. CRT designs not only reflect the fact that educational interventions ought to reach a broader student body and/or operate at the group level by definition (Bloom, 2005), but also map the natural nesting of students within classrooms and schools in institutional contexts (Konstantopoulos, 2012).

Irrespective of whether individual or intact groups of students form the unit of randomization, one constitutive feature of a methodologically high-quality RT is an adequate design sensitivity (Lipsey, 1990), meaning sufficient statistical power 1−β to detect a treatment effect at significance level α with a high level of statistical precision (e.g., Dong & Maynard, 2013; Hedges & Rhoads, 2010; Raudenbush et al., 2007). This poses a key challenge—not exclusively, but especially—when planning CRTs: their inherent multilevel data structure often dramatically restricts power and precision, and thus often require large sample sizes (Schochet, 2008). For instance, Stallasch et al. (2021, p. 193) show that a CRT requires 4080 students (68 schools, each with 3 classrooms of 20 students) to detect an effect of d = 0.25 on fourth graders’ mathematics achievement (α = 0.05, 1−β = 0.80). An IRT, in stark contrast, requires only 504 students to detect the same effect. In other words, everything else being equal, CRTs are much more resource-intensive than IRTs.

A promising technique to raise sensitivity in RT designs without inflating the sample size is to statistically control for pre-treatment covariates (e.g., Bloom et al., 2007; Kahan et al., 2014; Porter & Raudenbush, 1987; Raudenbush, 1997; Raudenbush et al., 2007).Footnote 1 In the example above, a mathematics pretest that explained 40%/35%/76% of the variance between students/classrooms/schools could reduce the CRT’s sample size requirements by almost two thirds to 1440 students (24 schools; Stallasch et al., 2021, p. 193). This scenario underpins that “well-chosen covariates do wonders for power” (Aberson, 2019, p. 135); yet, the effective value of a covariate is dictated by its prognostic performance. Scholars and agencies hence stress the importance of grounding the ideally preregistered decisions about covariate inclusion on a priori theoretical and empirical considerations that are tied to the specific research field in question (e.g., European Medicines Agency [EMA], 2015; Maxwell et al., 2017; Murray, 1998). Meanwhile, firm guidance on covariate choice is scarce (Pocock et al., 2002; Tafti & Shmueli, 2020), often not going beyond general recommendations for correlational thresholds (e.g., Bausell & Li, 2002, pp. 114–115; Cox & McCullagh, 1982; but see Bloom et al., 2007).

The overall aim of this two-part study is to build thorough empirical guidance on covariate selection to optimize design sensitivity in IRTs and CRTs on student achievement. To this end, we analyze single- and multilevel design parameters for a broad array of outcomes in grades 1–12 by capitalizing on large-scale assessment data from six German samples. Specifically, we quantify impacts of varying (a) covariate types (pretests in the outcome domain, a different domain, and fluid intelligence, as well as sociodemographic measures), their (b) combinations, and (c) time lags to the outcome (1–7 years), alongside the three psychometric heuristics of bandwidth-fidelity (Cronbach & Gleser, 1957), incremental validity (Sechrest, 1963), and validity degradation (Ghiselli, 1956; Humphreys, 1960). Our paper is organized as follows. The Introduction contains a quantitative research review in which we meta-analytically integrate the respective previous empirical evidence. In Part I, we estimate and meta-analytically integrate design parameters, and demonstrate their use in sample size and power computations. In Part II, we use the estimated design parameters in precision simulations to assess the actual covariate returns for the design sensitivity in RTs. This study is accompanied by an extensive OSF repository at https://osf.io/nhx4w. In addition to the full R code, it includes OSM A-G, with (A) expressions of single- and multilevel models; (B) methodology and results related to the quantitative research review; (C) methodology, further results, and manifold application scenarios of study planning related to Part I; (D) methodology and further results related to Part II, as well as interactive Excel workbooks compiling all (E) empirical, (F) meta-analytic, and (G) simulation results.

Statistical Underpinnings

Sufficient design sensitivity is a vital methodological quality criterion of rigorous research (American Psychological Association, 2020; Wilkinson & Task Force on Statistical Inference, 1999). It includes both statistical power and statistical precision (Zhang et al., 2023). Any RT should have an appropriate probability (commonly 80%, i.e., 1−β = 0.80; Cohen, 1988) to detect a true treatment effect.Footnote 2 The precision of an RT can be quantified by its minimum detectable effect size (MDES; Bloom, 1995, 2005) depicting the smallest possible significant (at α) standardized effect size (with 1−β), given the sample size. Thus, a small MDES indicates high design sensitivity. The approximate MDES can be written as (Bloom, 2005, pp. 158–160; Dong & Maynard, 2013, pp. 31–32):

$$\text{MDES}={M}_{df}SE\left({\overline{Y} }_{\text{TG}}-{\overline{Y} }_{\text{CG}}\right)/{\upsigma }_{\text{T}}$$
(1)

Mdf reflects the t-distributions specific to α and 1−β, with df degrees of freedom. For a two-tailed test, Mdf = tα/2 + t1−β, which converges to 2.8 when df ≥ 20, given α = 0.05 and 1−β = 0.80 (Bloom, 2006). The term \(SE\left({\overline{Y} }_{\text{TG}}-{\overline{Y} }_{\text{CG}}\right)/{\upsigma }_{\text{T}}\) represents the treatment effect’s \({\overline{Y} }_{\text{TG}}-{\overline{Y} }_{\text{CG}}\) standard error that is standardized by the (pooled) total student population’s standard deviation σT of an achievement outcome Y, with TG and CG referring to the treatment and control group, respectively. For instance, MDES = 0.25 means that a standardized treatment effect of at least one quarter of a student-level SD in the applied achievement test would be significant under sufficient power (Bloom et al., 2007).Footnote 3

As we show below, \(SE\left({\overline{Y} }_{\text{TG}}-{\overline{Y} }_{\text{CG}}\right)/{\upsigma }_{\text{T}}\) is a function of three factorsFootnote 4: (a) the sample size, (b) the allocation of the sample to the experimental conditions, and (c) so-called (multilevel) design parameters that quantify the unconditional (i.e., unadjusted) and conditional (i.e., covariate-adjusted) variance (components) in Y. Here, a relevant distinction in the assumptions about the (in)dependence of the underlying student sample between IRT and CRT designs is made that has important implications for the MDES.

A single-level IRT randomizes individual students, so that students are sampled independently of each other (i.e., regardless of e.g., school affiliation). Eq. (1) then transforms to (Bloom, 2006, Eq. 14; Dong & Maynard, 2013, p. 45):

$${\text{MDES}}_{\text{IRT}}={M}_{df}\sqrt{\frac{1-{R}_{\text{T}}^{2}}{{P}_{\text{T}}\left(1-{P}_{\text{T}}\right)N}}$$
(2)

N is the total number of students (i.e., the sum of students n in TG and CG; N = nTG + nCG). Everything else being equal, the larger N, the smaller the MDES. PT denotes the proportion of students assigned to TG (i.e., PT = nTG / N), where PT = 0.50 (i.e., a balanced design; nTG = nCG) minimizes the MDES. The design parameter \({R}_{\textrm{T}}^2\) is of special interest in this study because it quantifies the amount of the total variance \({\upsigma}_{\textrm{T}}^2\) in Y that can be explained by covariates CT:

$${R}_{\text{T}}^{2}=\left({\upsigma }_{\text{T}}^{2}-{\upsigma }_{\text{T}|{\text{C}}_{\text{T}}}^{2}\right)/{\upsigma }_{\text{T}}^{2}$$
(3)

\({\upsigma }_{\text{T}|\mathrm{C}_{\text{T}}}^{2}\) symbolizes the conditional total student population’s variance of Y. df = N − QT − 2, where QT is the number of covariates CT.

Unlike an IRT, a multilevel CRT randomizes groups of students (e.g., whole schools). Consider a two-level CRT (2L-CRT) with students at level (L) 1 nested within schools at L3, and a three-level CRT (3L-CRT) with students at L1 nested within classrooms at L2 nested within schools at L3. This clustering implies dependencies among selected subjects—students within the same classroom or school tend to be (often much) more similar than students from distinct ones (Schochet, 2008; Stallasch et al., 2021). The degree of within-cluster similarity is typically expressed by the multilevel design parameters ρL2 and ρL3 (i.e., the intraclass correlation coefficients at L2 and L3), which are the proportions of \({\upsigma}_{\textrm{T}}^2\) in Y that is between classrooms within schools and between schools, respectively:

$${\uprho }_{\text{L}2} ={\upsigma}_{\text{L}2}^{2}/{\upsigma}_{\text{T}}^{2}$$
(4)
$${\uprho }_{\text{L}3}={\upsigma}_{\text{L}3}^{2}/{\upsigma}_{\text{T}}^{2}$$
(5)

For a 2L-CRT, \({\upsigma}_{\textrm{T}}^2={\upsigma}_{\textrm{L}1}^2+{\upsigma}_{\textrm{L}3}^2\), and for a 3L-CRT, \({\upsigma}_{\textrm{T}}^2={\upsigma}_{\textrm{L}1}^2+{\upsigma}_{\textrm{L}2}^2+{\upsigma}_{\textrm{L}3}^2\), where \({\upsigma}_{\textrm{L}1}^2\), \({\upsigma}_{\textrm{L}2}^2\), and \({\upsigma}_{\textrm{L}3}^2\) are the unconditional variances in Y between students within classrooms in schools, between classrooms within schools, and between schools, respectively.

For a 2L-CRT with randomization at L3, Eq. (1) then transforms to (Bloom, 2006, Eq. 21; Dong & Maynard, 2013, p. 33):

$${\text{MDES}}_{2\text{L}-\text{CRT}}={M}_{df}\sqrt{\frac{{\uprho }_{\text{L}3}(1-{R}_{\text{L}3}^{2})}{{P}_{\text{L}3}\left(1-{P}_{\text{L}3}\right)K}+\frac{(1-{\uprho }_{\text{L}3})(1-{R}_{\text{L}1}^{2})}{{P}_{\text{L}3}\left(1-{P}_{\text{L}3}\right)K{n}_{\text{L}3}}}$$
(6)

For a 3L-CRT with randomization at L3, Eq. (1) transforms to (Bloom et al., 2008, Eq. 3; Dong & Maynard, 2013, p. 52):

$${\text{MDES}}_{3\text{L}-\text{CRT}}={M}_{df}\sqrt{\frac{{\uprho }_{\text{L}3}(1-{R}_{\text{L}3}^{2})}{{P}_{\text{L}3}\left(1-{P}_{\text{L}3}\right)K}+\frac{{\uprho }_{\text{L}2}(1-{R}_{\text{L}2}^{2})}{{P}_{\text{L}3}\left(1-{P}_{\text{L}3}\right)K{J}_{\text{L}3}}+\frac{(1-{\uprho }_{\text{L}3}-{\uprho }_{\text{L}2})(1-{R}_{\text{L}1}^{2})}{{P}_{\text{L}3}\left(1-{P}_{\text{L}3}\right)K{J}_{\text{L}3}{n}_{\text{L}2}}}$$
(7)

nL2 and nL3 are the average numbers of students within classrooms and schools, respectively, JL3 is the average number of classrooms within schools, and K is the number of schools (i.e., the sum of schools K in TG and CG; K = KTG + KCG). Generally, K exerts greater impact on the MDES than nL2 or nL3 and JL3: everything else being equal, the larger K, the smaller the MDES. PL3 is the proportion of schools assigned to the treatment condition (i.e., PL3 = KTG / K) with PL3 = 0.50 minimizing the MDES. Further, everything else held constant, the larger ρL2 and/or ρL3, the larger the MDES. Since ρL2 and/or ρL3 are fixed, the multilevel design parameters \({R}_{\textrm{L}1}^2\), \({R}_{\textrm{L}2}^2\), and \({R}_{\textrm{L}3}^2\) are of particular importance in this study because they quantify the amounts of \({\upsigma}_{\textrm{L}1}^2\), \({\upsigma}_{\textrm{L}2}^2\), and \({\upsigma}_{\textrm{L}3}^2\) in Y that can be explained by covariates CL1 at the student, CL2 at the classroom, and CL3 at the school level, respectivelyFootnote 5:

$${R}_{\text{L}1}^{2}=\left({\upsigma }_{\text{L}1}^{2}-{\upsigma }_{\text{L}1|{\text{C}}_{\text{L}1}}^{2}\right)/{\upsigma }_{\text{L}1}^{2}$$
(8)
$${R}_{\text{L}2}^{2}=\left({\upsigma }_{\text{L}2}^{2}-{\upsigma }_{\text{L}2|{\text{C}}_{\text{L}2}}^{2}\right)/{\upsigma }_{\text{L}2}^{2}$$
(9)
$${R}_{\text{L}3}^{2}=\left({\upsigma }_{\text{L}3}^{2}-{\upsigma }_{\text{L}3|{\text{C}}_{\text{L}3}}^{2}\right)/{\upsigma }_{\text{L}3}^{2}$$
(10)

\({\upsigma }_{\text{L}1|{\text{C}}_{\text{L}1}}^{2}\), \({\upsigma }_{\text{L}2|{\text{C}}_{\text{L}2}}^{2}\), and \({\upsigma }_{\text{L}3|{\text{C}}_{\text{L}3}}^{2}\) signify the conditional between-student, -classroom, and -school variances, respectively. df = K − QL3 − 2, where QL3 is the number of covariates CL3.

Estimates of σ2 can be obtained through (multilevel) regression (see OSM A). For both IRTs and CRTs, larger R2 values generally result in smaller MDES values. Adjusting for highly prognostic baseline covariates is thus widely recognized and explicitly recommended to improve design sensitivity and to address chance covariate imbalance (e.g., Coens et al., 2020; EMA, 2015; Moerbeek & Teerenstra, 2016; Porter & Raudenbush, 1987; Raudenbush et al., 2007). Omitting factors which are strongly predictive but not equated between experimental groups may severely bias treatment effect estimates, impair power, and inflate type I error rates (Ciolino et al., 2019; Yang et al., 2020). At the same time, if not correctly done, covariate adjustment has some pitfalls in special cases: First, adjustment is worthless when the loss in df captured by each covariate (in CRTs, at the top hierarchical level) outweigh the gain in precision (Kahan et al., 2014; Moerbeek & Teerenstra, 2016). This situation, however, is very rare; the loss in df is most often without (practical) consequence unless the sample size is very small (Konstantopoulos, 2012; Maxwell et al., 2017, p. 501). Second, adjustment might be detrimental when the assumption of covariate-treatment orthogonalityFootnote 6 is (severely) violated. This risk is amplified with covariates measured after randomization, which could therefore be affected by the treatment, as well as in (very) small-sized RTs, (highly) unbalanced designs (i.e., nTG ≠ nCG and/or in CRTs, unequal cluster sizes), and with (much) missing data on covariates (Kahan et al., 2014; Lin, 2013; Moerbeek, 2006; J. Wang, 2020). Violations of covariate-treatment orthogonality may be compensated in the RT design stage by imposing further balancing methods (e.g., matching, minimization, stratification; Moerbeek & Teerenstra, 2016), and in the RT analysis stage by modeling covariate-treatment interactions, optimally using a robust SE estimator (Lin, 2013; J. Wang, 2020). Either way, it is of utmost importance to exclusively control for carefully a priori selected pre-treatment covariates. Non-prognostic, poorly chosen, or post-treatment covariates likely act as “bad controls” that pose a threat to the validity of results (Cinelli et al., 2022; Kahan et al., 2014; Moerbeek, 2006; Montgomery et al., 2018; Porter & Raudenbush, 1987).

Theoretical and Empirical Considerations on Covariate Selection

Well-founded decisions on pre-treatment covariates are key to designing strong RTs. Scholars and agencies agree that these ideally preregistered decisions should be justified by both substantive theory and empirical results (Committee for Proprietary Medicinal Products, 2004; Cook, 2005; EMA, 1998, 2015; Food and Drug Administration, 2021; Maxwell et al., 2017; Moerbeek & Teerenstra, 2016; Murray, 1998; Raab et al., 2000; Tafti & Shmueli, 2020; Wright et al., 2015). Following this recommendation, the present paper draws on prominent models of school learning (Haertel et al., 1983; M. C. Wang et al., 1993) as well as connects to and expands upon previous empirical studies that examine the impact of covariates on design sensitivity in RTs with student achievement as the target outcome (for an overview, see Stallasch et al., 2021): specifically, student achievement is a multifaceted, complex construct influenced by a myriad of cognitive and non-cognitive (e.g., motivational or sociodemographic) factors (see also Steinmayr et al., 2014; Winne & Nesbit, 2010), of which the following were highlighted as the most important. First, a measure of prior knowledge in the same domain as the outcome (e.g., previous mathematics skills predicting future mathematics skills), which we refer to as a domain-identical pretest (IP), is known to shape performance trajectories (e.g., Ausubel, 1968; Dochy et al., 1999). This view is rooted in the assumption that one’s pre-existing knowledge base fundamentally molds input integration during knowledge acquisition (Brod, 2021; Woolfolk, 2020). Second, a measure of cognitive prerequisites in a certain domain may also explain achievement differences in another domain (e.g., previous language or reading skills predicting future mathematics skills; Peng et al., 2020; Ünal et al., 2023), which we refer to as a cross-domain pretest (CP). This idea is supported by the fact that scores from distinct achievement tests are often highly correlated (Baumert et al., 2009), reflecting the operation of a common cognitive capacity (often described as the g factor; Jensen, 1993) or the relevance of a specific ability to tasks in other domains (e.g., reading comprehension is needed to create a mental representation of mathematical problems; Kintsch, 1998). Third, there is broad consensus that fluid intelligence (Gf) is a powerful predictor of achievement in various domains (e.g., Cattell, 1987; Jensen, 1993; Neisser et al., 1996). Finally, sociodemographic characteristics (SC) such as gender, migration background, and socioeconomic status are also widely acknowledged as persistent precursors for academic success (e.g., Bradley & Corwyn, 2002; Stanat & Chistensen, 2006).

Importantly, educational RTs often address outcomes in multiple domains (Lortie-Forgues & Inglis, 2019; Morrison, 2020) that might need to be adapted or expanded during implementation (e.g., due to logistic or financial reasons, or political decisions; see Bloom et al., 2007), and often span several years (Connolly et al., 2018; Rickles et al., 2018). Moreover, apart from the rule that RTs should always be designed as parsimoniously as possible, they are usually subject to limited resources. Thus, in practice, researchers planning RTs often face the challenge of weighing the potential trade-offs between the different covariate types, their combinations, and time lags for design sensitivity. Three influential, albeit debated, psychometric heuristics may help to derive predictions on the unique, relative, and incremental impacts of IP, CP, Gf, and SC: (a) the bandwidth-fidelity dilemma, (b) the incremental validity concept, and (c) the validity degradation principle.

In the following, we elaborate on each heuristic under both a theoretical and empirical lens. First, we briefly introduce the respective underlying conception. Figure 1 visualizes the implications for R2 in student achievement. Second, we systematically review previous evidence on the links between standardized achievement tests and the covariate sets germane to each heuristic. For this purpose, we meta-analytically integrated R2 as derived from (a) relevant studies providing single-level correlations rT (i.e., not hierarchically decomposed between students, classrooms, and schools), which are informative for planning IRTs, and (b) available studies compiling multilevel design parameters, which are informative for planning CRTs. For each covariate type, combination, and time lag, we fitted a (multivariate) fixed-effect model with the R package metafor (Viechtbauer, 2010) to summarize the available effect sizes.Footnote 7 Figure 2 portrays the Pooled R2 values discussed below (see OSM B for the listing of studies included per covariate set, and details on the methodology and results).

Fig. 1
figure 1

Schematic visualization of the theoretical predictions implied by psychometric heuristics for covariate impacts on R2 in student achievement

Fig. 2
figure 2

Previous research on covariate impacts: meta-analytic Pooled R2 in student achievement for single- and multilevel designs

Covariate Types: Bandwidth-Fidelity

Theoretical Conception

The bandwidth-fidelity dilemma as originally introduced in psychometrics by Cronbach and Gleser (1957) describes an inherent compromise between the complexity (i.e., bandwidth) and the specificity (i.e., fidelity) of a covariate with respect to its predictive validity for an outcome (Hogan & Roberts, 1996; Salgado, 2017). The core idea is that maximal explanatory power requires the alignment of both the conceptual breadths and peculiarities between predictor and outcome (Hogan & Roberts, 1996; Salgado, 2017). Although the heuristic has primarily been discussed for cognitive and personality measures (see Cronbach & Gleser, 1957; Salgado, 2017), it is conceptually not limited to these constructs. Following the underlying rationale, when predicting a domain-specific achievement outcome, IP is expected to be superior to CP because the former matches the outcome domain; yet, as domain-specific cognitive measures, both should be covariates of high fidelity. CP is expected to outperform Gf, as Gf is a domain-general cognitive measure and should be a covariate of lower fidelity/broader bandwidth. Gf is expected to surpass SC, as SC are non-cognitive measures and should be covariates of even broader bandwidth.

Previous Empirical Evidence

Single-Level Perspective

The studies in our review demonstrated the high predictive power of IP for student achievement, explaining on average 56% of variance. Of note, while some found that these associations remained fairly stable across grades (e.g., Cole et al., 2011), others showed IP gains in relevance with older students (e.g., McCoach et al., 2017). CP was half as effective as IP (\(Pooled\ {R}_{\textrm{T}\left|\textrm{CP}\right.}^2=.28\)). Gf turned out to be a significant predictor, with \(Pooled\ {R}_{\textrm{T}\left|\textrm{Gf}\right.}^2\) equaling 0.19. SC explained a meaningful but—relative to the cognitive covariates—small proportion of variance of about 4%. Importantly, for all covariate types, there was substantial between-study variation. For example, \({R}_{\textrm{T}\left|\textrm{IP}\right.}^2\) ranged broadly from 0.17 to 0.73, due to variation across grade levels and/or domains and pre-posttest time lags. To conclude, the reviewed single-level evidence generally supports the theoretical predictions on the differential impacts of covariate types with varying bandwidth-fidelity.

Multilevel Perspective

Across studies, IP appeared to be the most powerful covariate type, explaining an astonishing 73%/81% of achievement differences at L2/L3, and 48% at L1. In Hedges and Hedberg (2013), the prognostic value of IP at L3 strengthened throughout the school career, a trend that was replicated repeatedly (see Stallasch et al., 2021, Fig. 1). Despite domain mismatch, CP proved a highly robust predictor, particularly at L3: \(Pooled\ {R}_{\textrm{L}3\left|\textrm{CP}\right.}^2\) amounted to 0.74, whereas \(Pooled\ {R}_{\textrm{L}1\left|\textrm{CP}\right.}^2\) was 0.30. As far as we are aware, the predictive capacity of Gf has not yet been partitioned into its hierarchical variance components. SC exerted substantial predictive power at L3 (\(Pooled\ {R}_{\textrm{L}3\left|\textrm{SC}\right.}^2=0.64\)), but rather limited predictive properties at L1 (\(Pooled\ {R}_{\textrm{L}1\left|\textrm{SC}\right.}^2=0.10\)) and L2 (\(Pooled\ {R}_{\textrm{L}2\left|\textrm{SC}\right.}^2=0.21\)). For every covariate and at each hierarchical level, we recorded notable cross-study heterogeneity (e.g., 0.23 ≤ \({R}_{\textrm{L1}\mid \textrm{IP}}^2\) ≤ 0.58, 0.49 ≤ \({R}_{\textrm{L2}\mid \textrm{IP}}^2\) ≤ 0.70, and 0.54 ≤ \({R}_{\textrm{L3}\mid \textrm{IP}}^2\) ≤ 0.83). In sum, the available multilevel evidence fit the assumptions about the differential effects of covariate types of varying bandwidth-fidelity quite well. Yet, compared to the single-level findings, the respective differences in R2 seemed far less pronounced, especially at the group levels.

Covariate Combinations: Incremental Validity

Theoretical Conception

Incremental validity (Sechrest, 1963) refers to a measure’s capacity to additionally explain variance in an outcome beyond what is explained by other prognostic factors (Haynes & Lench, 2003; Hunsley & Meyer, 2003) by contrasting a covariate combination with a subset (Haynes & Lench, 2003). As outlined above, IP is the best-known predictor of domain-specific student achievement. When planning RTs, an important question is therefore whether IP plus CP, Gf, and/or SC jointly explain more variance than IP alone.

Previous Empirical Evidence

Single-Level Perspective

Averaged across the reviewed studies, CP contributed to the prediction of student achievement beyond IP, albeit to a small degree; the joint effect computed to \(Pooled\ {R}_{\textrm{T}\left|\textrm{IP}+\textrm{CP}\right.}^2=0.57\). Overall, Gf showed no additional benefits over and above IP (\(Pooled\ {R}_{\textrm{T}\left|\textrm{IP}+\textrm{CP}\right.}^2=0.48\)). Combining IP and SC did also not lead to a general improvement over controlling for IP alone (\(Pooled\ {R}_{\textrm{T}\left|\textrm{IP}+\textrm{SC}\right.}^{2}=0.55\)). Taken together, the full covariate battery did not raise the amount of explained variance beyond IP (\(Pooled\ {R}_{\textrm{T}\left|\textrm{IP}+\textrm{CP}+\textrm{Gf}+\textrm{SC}\right.}^2=0.52\)). At the same time, for all covariate combinations, incremental returns occasionally reached meaningful thresholds, peaking at +15% of variance explanation over and above IP for the complete covariate array (Chu et al., 2018). The largest increments occurred consistently with elementary school samples, potentially implying that the incremental validities of CP, Gf, and/or SD might be stronger in younger than older students.

Multilevel Perspective

We found no multilevel study quantifying incremental validities of CP or Gf, or their combination with SC over and above IP. Much more is known about the set of SC, which incrementally predicted student achievement after IP had been taken into account, although only at the group levels. Pooled across studies, the joint amounts of explained variance equaled 83% at L3, 77% at L2, and 46% at L1. Stallasch et al.’s (2021) analyses revealed that SC contributed around +21%/+13% to the prediction of L2/L3 achievement differences beyond IP, where additional returns appeared to be more pronounced in elementary than secondary school.

Covariate Time Lags: Validity Degradation

Theoretical Conception

The validity degradation principle (Ghiselli, 1956; Humphreys, 1960) implies that the amount of variance explained by a cognitive predictor steadily decreases with growing time lags to the outcome (Hulin et al., 1990; Keil & Cortina, 2001; Reeve & Bonaccio, 2011). The developmental dynamics underlying validity degradation can be described as a simplex time series pattern (Humphreys, 1960). Accordingly, for domain-specific student achievement as outcome, the explanatory power of IP, CP, and Gf assessed 1 year ago should be higher than the explanatory power of IP, CP, and Gf assessed, for example, 7 years ago.Footnote 8

Previous Empirical Evidence

Single-Level Perspective

The vast majority of reviewed investigations suggest that \({R}_{\textrm{T}\left|\textrm{IP}\right.}^2\) in student achievement decreases with greater pre-posttest time lags: values considerably dropped from \(Pooled\ {R}_{\textrm{T}\left|\textrm{IP}-1\right.}^2=0.63\) to \(Pooled\ {R}_{\textrm{T}\left|\textrm{IP}-7\right.}^2=0.36\). Of note, this trend holds true for all grade levels (e.g., McCoach et al., 2017). Analogous results—though far less striking—were reported for the predictive properties of CP: \(Pooled\ {R}_{\textrm{T}\left|\textrm{CP}-1\right.}^2=0.24\) declined to \(Pooled\ {R}_{\textrm{T}\left|\textrm{CP}-7\right.}^2=0.10\). In some studies, however, \({R}_{\textrm{T}\left|\textrm{CP}\right.}^2\) barely diminished (e.g., Erbeli et al., 2021) or even increased with growing time lags (e.g., Träff et al., 2020). The few available studies on the potential validity degradation of Gf indicate fairly robust long-term impacts: pooled across studies, Gf−1 explained 13% and Gf−7 explained 33% of achievement differences. In their review, Reeve and Bonaccio (2011) concluded that the decay of Gf’s predictive property is subtle at best, even across numerous years.

Multilevel Perspective

The few existing studies on multilevel design parameters addressing the temporal validity degradation of covariates attested a notable decrement of explanatory power of IP at L1; \(Pooled\ {R}_{\textrm{L}1\left|\textrm{IP}-1\right.}^2=0.50\) declined to \(Pooled\ {R}_{\textrm{L}1\left|\textrm{IP}-3\right.}^2=0.35\). Meanwhile, amounts of explained variance at L3 were far less prone to time effects (\(Pooled\ {R}_{\textrm{L}3\left|\textrm{IP}-1\right.}^2=0.86\); \(Pooled\ {R}_{\textrm{L}3\left|\textrm{IP}-3\right.}^2=0.76\)). Only Xu and Nichols (2010) studied temporal declines in IP’s predictive power at L2: proportions of explained variance remained at a high level of 70% across two subsequent years. Of note, deteriorations in R2 seem to be generally more prevalent in elementary than secondary school, especially at L3. To our knowledge, multilevel studies focusing on cross-time validity decay of CP and Gf are lacking to date.

The Present Study

Strong RTs unite cost-efficiency and sophisticated methodology to ensure appropriate design sensitivity. Given that well-selected covariates substantially raise statistical power and precision, evaluation researchers need reliable evidence that substantiates covariate choices by quantifying unique, relative, and incremental yields of the target outcome’s most important predictors. We aim to significantly expand the available guidance for IRTs and CRTs on student achievement through a comprehensive compilation of reliable single- and multilevel design parameters that were meta-analyzed and applied to simulate precision.Footnote 9

First, both IRTs and CRTs are in their own right cornerstones of evidence-based education. Both designs are frequently implemented (Connolly et al., 2018). However, single-level design parameters on student achievement have not yet been systematically compiled. Indeed, our quantitative research review may be considered a first major step towards this endeavor. Moreover, extant multilevel design parameters remain mostly restricted to two hierarchical levels. To address these gaps, we cover RTs of three different designs: IRTs (with students assumed to be independently sampled), 2L-CRTs (with students nested within schools), and 3L-CRTs (with students nested within classrooms nested within schools).

Second, researchers rely on knowledge about the potential sensitivity-raising effects of specific covariate types, combinations, and time lags. The above research review pointed out that the latest IP is most likely the best among the covariates. Yet, sometimes the inclusion of IP is not feasible, such as when there are multiple outcome domains (e.g., Lortie-Forgues & Inglis, 2019) while testing time is limited, when the outcome changes after the RT has started (e.g., due to political decisions; Bloom et al., 2007, p. 32), when the outcome is subject to strong developmental dynamics and/or presupposes intensive instruction (e.g., reading skills during elementary school), or when individual pretest differences are unlikely to be observed ahead of the intervention (e.g., integral calculus prior to its introduction; Shadish et al., 2002, p. 118). In such situations, CP, Gf, or SC may be meaningful alternatives to IP. However, only a few multilevel studies provide information on the impacts of CP and SC, and none on the impacts of Gf. Beyond that, the combination of IP with CP, Gf, and/or SC may further boost design sensitivity. Past multilevel studies solely assessed incremental validity of SC over and above IP. Further, RTs often span multiple years (e.g., Rickles et al., 2018), especially when long-term intervention effects are of interest. Although the explanatory power of IP, CP, and Gf may be susceptible to temporal decay, prior multilevel studies addressed rather short pre-posttest time lags of 1–3 years to test validity degradation in IP, but not in CP or Gf. To address these gaps, we systematically vary and combine IP, CP, and Gf with 1- to 7-year-lagged data, as well as SC within 11 different covariate sets (in addition to a set 0 without any covariates).

Third, contemporary educational standards refer to a plethora of skills in various domains (National Research Council, 2011; Organisation for Economic Co-operation and Development [OECD], 2018), as do educational RTs (e.g., Morrison, 2020). Past works on multilevel design parameters dealt with a limited number of outcome domains, namely mathematics, science, and reading. To address this gap, we investigate a wide array of eight commonly targeted outcomes from STEMFootnote 10 and verbal domains.

Fourth, educational RTs are conducted all around the globe (Connolly et al., 2018), but existing collections of multilevel design parameters primarily stem from US samples. Estimates for countries whose school system characteristics markedly deviate from those of the United States, such as an (often much) earlier onset of ability-based school-type-tracking as is the case in Germany, are scarce. To address this gap, we capitalize on longitudinal large-scale assessment data from six German probability samples that represent the total student population in elementary (grades 1–4), lower secondary (grades 5–10), and upper secondary school (grades 11–12), as well as the student populations in lower and upper secondary school belonging to the academic and non-academic track.Footnote 11

Finally, many past educational large-scale RTs lacked design sensitivity (Lortie-Forgues & Inglis, 2019). It is therefore essential to reliably judge how the varying covariates types, combinations, and time lags actually affect precision (given the typical desired 80% power). To this end, power analyses contextualizing the respective R2 values within predefined designs are indispensable: as becomes clear from Eqs. (2), (6), and (7), the MDES is shaped by the interplay of several quantities beyond power and R2, such as sample size and allocation, and in the multilevel case also values of ρ. Furthermore, since empirical design parameters are tainted with sampling error that may (dramatically) distort power analysis outcomes, proper allowance of uncertainty is best practice (e.g., Jacob et al., 2010; Turner et al., 2004). We consequently ran precision simulations that concede ρ and R2 uncertainties via a Bayesian rationale to calculate plausible MDES ranges for IRTs and CRTs.

Part I: Two-Stage Individual Participant Data Meta-Analysis—Estimating and Integrating Design Parameters

Method

We briefly sketch the applied methods here (see OSM C for details). We used R 4.2.2 (R Core Team, 2022); package versions are noted in the R scripts.

Large-Scale Assessment Data

Systematic Search

To identify German large-scale assessment datasets suitable for analyzing covariate impacts on design sensitivity in RTs on student achievement, we carried out a systematic search in three electronic data repositories (see also Brunner, Stallasch, et al., 2023). Datasets had to meet the following eligibility criteria: (a) representativeness for the German student population, (b) longitudinal design, and (c) assessment of student achievement via standardized tests. We found three large-scale assessments providing data of six independent national probability samples.

National Educational Panel Study (NEPS)

NEPS (Blossfeld & Roßbach, 2019) has been tracking multiple cohorts’ educational trajectories throughout their lifespan from 2010 to today. We used the dataFootnote 12 of students from three NEPS starting cohorts: 4-year-olds (in kindergarten) tested through grade 4 (NSC2; NEPS Network, 2020); grade 5 students tested through grade 12 (NSC3; NEPS Network, 2019a); and grade 9 students tested through grade 12 (NSC4; NEPS Network, 2019b). Achievement tests were administered every 1–3 years.

Programme for International Student Assessment (PISA)

The PISA cycles 2003 and 2012 were extended as national longitudinal follow-ups in grades 9–10 in Germany (PISA-Konsortium Deutschland, 2006; Reiss et al., 2017). We used the dataFootnote 13 from PISA-I-Plus 2003, 2004 (PP03; Prenzel et al., 2013), which focuses on students’ mathematics and science achievement development and PISA-Plus 2012–2013 (PP12; Reiss et al., 2019), which additionally incorporates a follow-up assessment of reading achievement.

Assessment of Student Achievements in German and English as a Foreign Language (DESI)

DESI (DESI-Konsortium, 2008) studied students’ verbal achievement during grade 9. We used the DESI data11 (Klieme, 2012) on verbal skills in German.

Sampling Process and Sample Selection

Except for NSC2, all samples were drawn applying a multistage (i.e., multilevel) sampling process where schools were first randomly drawn, followed by at least two intact classrooms per school (Aßmann et al., 2011; Beck et al., 2008; Heine et al., 2017; Prenzel et al. 2006). NSC2 involved sampling kindergarten children and students of the schools that those children entered to ensure representativeness for children entering elementary school (Aßmann et al., 2011).

When studying covariate types and combinations, we drew on the full spectrum of samples. When studying covariate time lags, we drew only on NSC2 and NSC3, as these samples provided longitudinal achievement data across at least three measurement points. As listed in Table 1, we analyzed data from a total of N = 68,502 students, where sample sizes ranged within 1868 (NSC3, grade 12) ≤ N ≤ 10,543 (DESI, grade 9), with median cluster sizes of 4 ≤ nL2 ≤ 25, 14 ≤ nL3 ≤ 50, and 2 ≤ JL3 ≤ 3. Note that in grades 11–12, information at L2 did not exist because, in German upper secondary school, the affiliation of students to intact classrooms is usually replaced by a course grouping system catering to students’ ability level in a certain school subject (e.g., basic vs. advanced courses).

Table 1 Numbers of students N, classrooms J, and schools K, and median numbers of students per classroom nL2, students per school nL3, and classrooms per school JL3

Measures

Achievement Outcomes

We analyzed outcomes in three STEM domains, namely mathematics, science, and information and communication technology (ICT), as well as in five verbal domains in German, namely reading, grammar, spelling, vocabulary, and writing.

Covariates

We examined four covariate categories: IP, CP, Gf, and SC. We employed reading as CP for STEM outcomes and mathematics as CP for verbal outcomes. Gf was assessed in terms of figural reasoning. IP, CP, and Gf were available with a 1- to 7-year time lag to the outcome, where the smallest pre-posttest gap ranged from 1 to 4 years. SC comprised 4 variables, namely students’ gender (0 = male, 1 = female) and migration background (0 = no, 1 = yes) as well as two indicators of socioeconomic status: (1) parents’ highest educational attainment was assessed by the greatest number of years of schooling completed (range 9–18) in all studies except the DESI, where the highest school leaving certificate was used; and (2) parents’ highest International Socio-Economic Index of Occupational Status (HISEI; Ganzeboom & Treiman, 1996; range: 11–89).

Missing Data

Virtually all measures used in this study contained some missing values. The percent of missings across the datasets varied from 11% (PP03, grade 10) to 42% (NSC2, grade 1).The greatest missing rates occurred in pretests measured in the first two waves of NSC2, as only a small share of kindergarten children continued participating in NEPS after entering elementary school. We performed (groupwise) multilevel multiple imputation and generated 50 multiply-imputed datasets for each sample and grade using the mice (van Buuren & Groothuis-Oudshoorn, 2011) and miceadds (Robitzsch et al., 2021) packages.

Procedure

We applied a two-stage approach to meta-analysis of individual participant data (Brunner, Keller, et al., 2023; see also Brunner, Stallasch, et al., 2023). We estimated and meta-analyzed design parameters for three RT designs, namely single- (individual students), two- (students within schools), and three-level designs (students within classrooms within schools), as well as for three target populations, namely the total, academic track, and non-academic track student populations. Notably, in upper secondary school, only single- and two-level designs were considered due to the lack of L2 information.

Stage 1: Single- and Multilevel Modeling—Estimating Design Parameters

We performed single- and multilevel modeling to empirically estimate ρ and R2. As shown in Table 2, we systematically in- and excluded 1- to 7-year-lagged IP, CP, and Gf, as well as SC within a total of 12 covariate sets, with the number of covariates Q per set ranging between 0 ≤ Q ≤ 7. This resulted in up to 363 distinct models per design and population.

Table 2 Covariate sets analyzed in the present study with numbers of covariates Q, ρ/R2 effect sizes G, and samples H by design

Model Fitting

For all outcomes, we fitted two model classes separately for each imputation. The first model class consisted of unconditional models without any covariates (set 0). Specifically, for single-level designs, we obtained \({\upsigma}_{\textrm{T}}^2\) by taking the outcomes’ variances. For multilevel designs, we obtained \({\upsigma}_{\textrm{L}1}^2\), \({\upsigma}_{\textrm{L}2}^2\), and \({\upsigma}_{\textrm{L}3}^2\) by specifying two- and three-level random-intercept-only models. The second model class consisted of conditional models with varying covariate types (sets 1–4), combinations (sets 5–8), and time lags (sets 9–11). Specifically, for single-level designs, we obtained \({\upsigma }_{\text{T}|{\text{C}}_{\text{T}}}^{2}\) by specifying single-level regression models. For multilevel designs, we obtained \({\upsigma }_{\text{L}1|{\text{C}}_{\text{L}1}}^{2}\), \({\upsigma }_{\text{L}2|{\text{C}}_{\text{L}2}}^{2}\), and \({\upsigma }_{\text{L}3|{\text{C}}_{\text{L}3}}^{2}\) by specifying two- and three-level random-intercept models. Note that all covariates were assessed at L1. In two-level models, we entered school averages at L3. In three-level models, we entered classroom averages at L2 and school averages at L3. In single-level models, we centered all covariates around their respective total population’s means whereas in multilevel models, we applied group-mean centering: L1 covariates were centered around their respective school/classroom means in two-/three-level models and L2 covariate means were centered around their respective school means in three-level models. Single-level modeling was performed using the stats package implemented in base R. Multilevel modeling was performed using the lme4 package (Bates et al., 2015) applying restricted maximum likelihood (REML) estimation.

Calculating Design Parameters and Standard Errors

We calculated ρ and R2 by inserting the variance (component) estimates from the model fits into Eqs. (3)–(5) and (8)–(10). SEs of ρ were computed with the formulas for the large sample variances in unbalanced (i.e., with unequal cluster sizes) two-level designs derived in Donner and Koval (1980, Eq. 3) and three-level designs in Hedges et al. (2012, Eqs. 7–9). The latter involves the sampling variances of \({\upsigma}_{\textrm{L}2}^2\) and \({\upsigma}_{\textrm{L}3}^2\), which we obtained by applying the “cases bootstrap” from the lmeresampler package (Loy & Korobova, 2023). We drew 1000 samples (Huang, 2018, p. 303; Schomaker & Heumann, 2018). SEs of R2 were computed with the formula for the large sample variances given in Hedges and Hedberg (2013, p. 451).

Pooling

ρ and R2 with corresponding SEs were pooled across the 50 imputations. We used the mitml package (Grund et al., 2021) that employs Rubin’s (1987) rules to take into account within- and between-imputation variance.

Stage 2: Meta-Analysis—Integrating Design Parameters

We performed meta-analysis to integrate ρ and R2 for covariate types and combinations, and meta-regression with outcome-covariate time lag as moderator to integrate R2 for covariate time lags (both across domains and samples, but within hierarchical and grade levels, designs, and populations).Footnote 14

Model Fitting

Using the metafor package (Viechtbauer, 2010), we fitted two meta-analytic/meta-regression model classes, conditional on the number of R2 effect sizes G per covariate set: either (multivariate) fixed-effect models if G < 10 or (multivariate multilevel) random-effects models via REML if G ≥ 10 (see Langan et al., 2019, p. 95). Both methods yield an average (true) effect size Pooled R2, with SE(Pooled R2). However, the “real” (i.e., not due to sampling error) heterogeneity among true R2 values within samples, τ2Effect sizes, and between samples, τ2Samples, can solely be captured by random-effects models (Borenstein et al., 2021, pp. 61–80). We deployed two weighting schemes, conditional on the number of samples H per covariate set: If H > 1, we addressed within-sample dependencies among R2 effect sizes (Hedges, 2019) by multivariate (multilevel) meta-analyses and imputed working variance–covariance matrices using the clubSandwich package (Pustejovsky, 2022). We assumed a within-sample intercorrelation of r = 0.90 as a reasonable upper-bound guess (see Brunner, Stallasch, et al., 2023). If H = 1, we drew on the sampling variances of R2 in terms of the standard meta-analytic inverse-variance weighting.

Depicting Heterogeneity

With random-effects modeling, we calculated—in addition to the 95% confidence interval (95% CI)—the 95% prediction interval (95% PI). The 95% PI provides a plausible range of R2; it quantifies the total dispersion (sampling variance plus τ2Effect sizes, and if applicable, plus τ2Samples) of R2 around Pooled R2 and defines the range in which an R2 estimated based on data of a new sample randomly drawn from a population of samples will likely (i.e., in 95% of cases) fall (Borenstein et al., 2021, pp. 119–126; Riley et al., 2011). We also calculated (multilevel) I2 (Higgins & Thompson, 2002), the ratio of real heterogeneity to the total variation across observed R2 values (Borenstein et al., 2017).

Gauging Sensitivity and Model Convergence

For the imputed working variance–covariance matrices, we ran sensitivity analyses over r ∈ {0.00, 0.05, …, 0.95} (Hedges, 2019) to preclude a misspecification of R2 dependencies. With random-effects modeling, we profiled log-likelihoods of τ2 values to evaluate their identifiability (see Viechtbauer, 2022).Footnote 15

Key Results

We present major patterns in meta-analytic single- and multilevel (i.e., three-level in grades 1–10 and two-level in grades 11–12) design parameters for the total student population, illustrated in Figs. 3 and 4 (which we refer to in this section, unless otherwise stated; see OSM C for result plots of two-level designs in grades 1–10 and school tracks, and OSM E/F for the full compilation of the empirical/meta-analyzed design parameters).

Fig. 3
figure 3

Meta-analytic integrations of single- and multilevel R2 in student achievement: covariate types and covariate combinations

Fig. 4
figure 4

Meta-analytic integrations of single- and multilevel R2 in student achievement: covariate time lags

Covariate Types: Bandwidth-Fidelity

Single-Level Perspective

IP was consistently the most powerful among all covariate types: IP explained over one third of achievement differences between individual students in elementary and upper secondary school, and almost one half in lower secondary school. Here, the remaining cognitive covariates were also valuable predictors, with \(Pooled\ {R}_{\textrm{T}\left|\textrm{CP}\right.}^2=0.28\) and \(Pooled\ {R}_{\textrm{T}\left|\textrm{Gf}\right.}^2=0.20\). In elementary and upper secondary school, CP and in particular Gf contributed comparably little to the prediction. In contrast, SC performed best as predictors in elementary school (\(Pooled\ {R}_{\textrm{T}\left|\textrm{SC}\right.}^2=0.16\)). We registered substantial \({R}_{\textrm{T}}^2\) heterogeneities, with the broadest 95% PIs for IP and the narrowest for SC. For example, in elementary school, the 95% PIs were [0.13, 0.58] and [0.09, 0.24], respectively (Table F1).

Multilevel Perspective

IP was generally of paramount relevance when predicting student achievement. From grade 5 on, IP was the strongest of all covariate types and showed exceptional prognostic properties at L3, where 98/78% of variance was explained in lower/upper secondary school. Across the entire school career, however, IP turned out to be a weaker predictor at both L1 and L2. CP as well as Gf were very useful to explain differences between schools, in particular in lower secondary school (\(Pooled\ {R}_{\textrm{L}3\left|\textrm{CP}\right.}^2=0.91\); \(Pooled\ {R}_{\textrm{L}3\left|\textrm{Gf}\right.}^2=0.86\)), but less so between classrooms and students—in all grade levels. Gf was moreover consistently the weakest covariate both in elementary and upper secondary school, irrespective of the hierarchical level. Although SC were the poorest predictors in lower secondary school, they still explained over three quarters of between-school variance. Notably, with first–fourth graders, SC outweighed IP at both L2 and L3 (\(Pooled\ {R}_{\textrm{L}2\left|\textrm{SC}\right.}^2=0.35\); \(Pooled\ {R}_{\textrm{L}3\left|\textrm{SC}\right.}^2=0.52\)). Degrees of heterogeneity in multilevel R2 were often considerable, depending not only on the covariate type but also on the grade and hierarchical level. For instance, the 95% PI for \({R}_{\textrm{L}3\left|\textrm{IP}\right.}^2\) was very wide in elementary school with [0.07, 0.81], but considerably narrower in secondary school with [0.95, 1.00] (Tables F1 and F2).

Covariate Combinations: Incremental Validity

Single-Level Perspective

In all grade levels, CP explained additional variance in student achievement over and above IP. On average, incremental gains were largest in lower secondary school (+5%) and smallest in upper secondary school (+2%). When controlling for IP, benefits through Gf were noteworthy only in lower secondary school (\(\Delta Pooled\ {R}_{\textrm{T}\left|+\textrm{Gf}\right.}^{2}=+0.04\)). In contrast, SC particularly contributed to the prediction in elementary/upper secondary school, with about +4%/+3%. Joint effects through the full battery of covariates were always largest, with +0.06 ≤ \(\Delta Pooled\ {R}_{\textrm{T}\left|\textrm{CP}+\textrm{Gf}+\textrm{SC}\right.}^{2}\le +0.08\). We found signal heterogeneities in all \({R}_{\textrm{T}}^2\). For example, the 95% PI for \({R}_{\textrm{T}\left|\textrm{IP}+\textrm{Gf}\right.}^2\) was [0.15, 0.59] (Table F1).

Multilevel Perspective

At the school level, CP clearly added to the prediction of achievement differences beyond IP in elementary (\(\Delta Pooled\ {R}_{\textrm{L}3\left|+\textrm{CP}\right.}^{2}=+0.09\)), but not in secondary (\(\Delta Pooled\ {R}_{\textrm{L}3\left|+\textrm{CP}\right.}^{2} \le+0.01\)) school. At L1/L2, CP provided some additional explanatory power over the entire school career; benefits were largest in lower secondary school (+7%/+5%). In general, we found Gf to be a rather poor additional covariate across grade and hierarchical levels. As an exception, contributions in \(Pooled\ {R}_{\textrm{L}1}^2\) over and above IP amounted to +6% in lower secondary school. SC was of notable incremental relevance in elementary school, especially at L3, adding on average +16% of explained variance. In higher grades, additional gains were often negligible though (note, however, that \(Pooled\ {R}_{\textrm{L}3\left|+\textrm{SC}\right.}^2\) reached 0.99 in lower secondary school). Except for L3 in lower secondary school, the complete set of covariates consistently outweighed all other combinations—at L1, average gains were the highest in lower secondary, and at L2 and L3 in elementary school, with \(\Delta Pooled\ {R}_{\left|+\textrm{CP}+\textrm{Gf}+\textrm{SC}\right.}^{2}\) = +0.09/+0.13/+0.20 at L1/L2/L3. Multilevel R2 heterogeneities largely mirrored those of IP, and were substantive (except \({R}_{\textrm{L}3}^2\) in lower secondary school). For example, the 95% PI of \({R}_{\textrm{L}2\left|\textrm{IP}+\textrm{CP}\right.}^2\) was [0.21, 1.00] in grades 5–10 (Table F2).

Covariate Time Lags: Validity Degradation

Single-Level Perspective

In all grade levels, the predictive power of IP clearly diminished with growing pre-posttest time lags. Validity degradation was most prevalent in elementary school: the meta-regression coefficient blag = −0.06 shows that with each additional year between IP and outcome, \({R}_{\textrm{T}\mid \textrm{IP}}^2\) is predicted to decrease by −6%. In lower/upper secondary school, temporal declines in the proportions of explained variance were also noticeable (blag = −0.04/−0.03). In contrast, CP emerged to be far less prone to cross-time decay: until grade 10, prognostic properties remained stable both in elementary and secondary school. In upper secondary school, predicted amounts of explained variance slightly reduced with −1%. Gf turned out to be an extraordinarily time-robust predictor throughout the entire school career.

Multilevel Perspective

Validity degradation in \({R}_{\mid \textrm{IP}}^2\) was almost always substantial, except for lower secondary school at L3. Here, we recorded remarkable temporal stabilities in the amounts of explained variance (blag = −0.01). In all other cases, the explanatory power of IP is likely to drop about −3% up to −8% per year. CP appeared to be a relatively time-robust covariate; however, decrements in prognostic capacity hinged on both the grade and hierarchical level: in elementary school, solely predicted \({R}_{\textrm{L}3\mid \textrm{CP}}^2\) slightly declined (blag = −0.01); in lower secondary school, only predicted proportions of explained L2 variance dropped, but this strikingly (blag = −0.09); and in upper secondary school, predicted \({R}_{\textrm{L}1\mid \textrm{CP}}^2\) (blag = −0.01) and \({R}_{\textrm{L}3\mid \textrm{CP}}^2\) (blag = −0.02) showed small reductions over time. While Gf consistently emerged highly time-stable at L1, the predicted validity decay at both L2 and L3 was practically significant in all grade levels (−1% ≤ blag ≤ −5%).

Application

Researchers designing RTs may profit from the flow chart in Fig. 5. It facilitates the choice of single- and multilevel design parameters that are optimally tailored to the specific application context. To showcase the estimates’ use in study planning, we developed manifold scenarios to determine the (a) sample size and (b) statistical power of IRTs and CRTs via power analysis. We present one in the following (see OSM C for the remaining).

Fig. 5
figure 5

Flow chart to choose design parameters from our compilation in OSM E and F

An Illustrative Scenario

A research team has programmed an app. It functions as a multidisciplinary digital learning environment which can be used throughout lower secondary school in Germany.

Single-Level Perspective

As a first step, the researchers aim to test the general efficacy of the underlying didactic approach. They plan a small-scale pilot IRT involving exclusively mathematical topics from grade 7. A standardized treatment effect of d = 0.15 is considered meaningful, representing around one half of the expected annual growth in mathematics for grades 6–7 in the German student population (Brunner, Stallasch et al., 2023, Table 1). The team’s objective, therefore, is to sample enough seventh graders to detect MDES = 0.15 at α = 0.05 (two-tailed) with 1−β = 0.80, where PT = 0.50. The minimum required sample size (MRSS) to achieve this in an unconditional IRT design is N = 1397 students. Striving for parsimony and being aware of the potential virtue of covariate adjustment, the researchers plan to statistically control for IP. Before power analysis, they consult our flow chart (Fig. 5): since the IRT addresses the total population in lower secondary school and a specific grade and domain analyzed in our study, the team is guided to Table E2 (panel a) that lists the suitable empirically estimated single-level design parameters. Inserting \({R}_{\textrm{T}\mid \textrm{IP}}^2\) = 0.53, the researchers find that the MRSS more than halves to N = 654 when adjusting for IP. They then think about optimizing the design by additionally including either a reading CP or SC, where \({R}_{\textrm{T}\mid \textrm{IP}+\textrm{CP}}^2\) = 0.56 and \({R}_{\textrm{T}\mid \textrm{IP}+\textrm{SC}}^2\) = 0.55. The MRSS further reduces to N = 630 when combining IP with SC and to N = 622 when combining IP with CP. They decide to administer both a mathematics and reading test. The team wants to account for uncertainty in \({R}_{\textrm{T}\mid \textrm{IP}+\textrm{CP}}^2\). To this end, they determine the 95% CI by means of SE (\({R}_{\textrm{T}\mid \textrm{IP}+\textrm{CP}}^2\)) = 0.01: the lower bound is calculated as 0.56 − 1.96 × 0.01 = 0.54 and the upper bound as 0.56 + 1.96 × 0.01 = 0.57, which leads to an MRSS range of 649 ≥ N ≥ 596. Consequently, when opting for a conservative approach and sampling N = 649 students, it is fairly certain that the IRT will be sensitive to uncover a (truly existing) treatment effect of d = 0.15 with IP and CP as covariates.

Multilevel Perspective

As a second step, the researchers aim to scrutinize the effectiveness of the full app in students’ usual school routine. They plan a large-scale 3L-CRT involving the complete spectrum of domains for grades 5–10. \(d\) = 0.11 is considered reasonable, approximating half of the average academic year-to-year growth observed across lower secondary school in Germany (Brunner, Stallasch, et al., 2023, Table 1). Due to logistical reasons, the total sample is restricted to a maximum of K = 400 schools, with nL2 = 20 and JL3 = 3. The team’s primary concern, thus, is to achieve sufficient power (i.e., 1−β ≥ 0.80) to detect MDES = 0.11 at α = 0.05 (two-tailed), where PL3 = 0.50. Since the 3L-CRT addresses the total population in lower secondary school but neither a specific grade nor domain, our flow chart (Fig. 5) directs them to Table F2 (panel c) that lists the suitable meta-analytically integrated three-level design parameters. Entering Pooled ρ values at L2/L3 of 0.05/0.35 into power analysis, the researchers learn that an unconditional 3L-CRT clearly undercuts the desired power rate (1−β = 0.43). They wonder which covariates to use: given the limited testing time, assessing multiple IPs is not a viable option. Instead, controlling for either Gf or SC seems most feasible, with Pooled R2 values at L1/L2/L3 of 0.07/0.16/0.86 for Gf and 0.04/0.12/0.77 for SC. Controlling for both Gf (1−β = 0.98) and SC (1−β = 0.92) leads to adequate power. However, when incorporating total design parameter heterogeneities (i.e., sampling error plus true variation) and adopting a (very) conservative approach by using the upper bounds of 95% PIs of ρL2 = 0.07 and ρL3 = 0.50 and the lower bounds of the 95% PIs of \({R}_{\textrm{L}1\mid \textrm{Gf}}^2\) = 0.00, \({R}_{\textrm{L}2\mid \textrm{Gf}}^2\) = 0.00, \({R}_{\textrm{L}3\mid \textrm{Gf}}^2\) = 0.76, \({R}_{\textrm{L}1\mid \textrm{SC}}^2\) = 0.00, \({R}_{\textrm{L}2\mid \textrm{SC}}^2\) = 0.00, and \({R}_{\textrm{L}3\mid \textrm{SC}}^2\) = 0.56, only Gf (1−β = 0.81) likely guarantees enough power, as opposed to SC (1−β = 0.59). The team decides to collect students’ Gf scores. Finally, the researchers wish to evaluate the long-term effects of the app. Thus, a possible follow-up 3L-CRT of the same sample should still demonstrate adequate power. The suitable design parameters are Pooled ρL2 = 0.04, Pooled ρL3 = 0.38, and predicted values of \({R}_{\textrm{L}1\mid \textrm{Gf}-2}^2\) = 0.01, \({R}_{\textrm{L}2\mid \textrm{Gf}-2}^2\) = 0.22, \({R}_{\textrm{L}3\mid \textrm{Gf}-2}^2\) = 0.81, as well as \({R}_{\textrm{L}1\mid \textrm{Gf}-4}^2\) = 0.01, \({R}_{\textrm{L}2\mid \textrm{Gf}-4}^2\) = 0.12, \({R}_{\textrm{L}3\mid \textrm{Gf}-4}^2\) = 0.79. Assuming no attrition over time, the team calculates 1−β = 0.95/0.94 for a 2-/4-year-lagged Gf. Consequently, even when reevaluating the app’s impact 4 years later, the 3L-CRT with Gf as a covariate will likely be adequately powered.

Part II: Precision Simulations—Assessing Design Sensitivity via the MDES

Method

We briefly sketch the applied methods here (see OSM D for details). We used R 4.2.2 (R Core Team, 2022); package versions are noted in the R scripts.

Procedure

We adopted a hybrid Bayesian-classical approach to power analysis (Spiegelhalter et al., 2004; see also Pek & Park, 2019). To this end, we took advantage of the (joint) empirical distribution of single- and multilevel design parameters estimated in stage 1 of Part I to simulate MDES distributions for small, medium, and large IRTs and CRTs.

Simulation Conditions

We established typical sample sizes of educational RTs by drawing on data of Lortie-Forgues and Inglis’ (2019)Footnote 16 review. We computed normative distributions (i.e., percentiles P) across K and categorized P10(K) = 14 as small, P50(K) = 46 as medium, and P90(K) = 100 as large, where nL3 = 46. Sample sizes at L2 were not available; we assumed JL3 = 2, resulting in nL2 = 23. It followed that N = 644/2116/4600 for small/medium/large RTs.Footnote 17 We assumed α = 0.05 (two-tailed), 1−β = 0.80, and PT = PL3 = 0.50.

Expressing Uncertainty in Design Parameters

Random noise in ρ and R2 can be incorporated into power analysis in a number of ways. One method is to enter the bounds of their (meta-analytic) 95% CIs/PIs, as we illustrated above in Part I (section “Application”). Here, we apply a hybrid technique that implicitly models uncertainty by following a Bayesian notion to treat ρ and R2 along with their SEs as (informative) prior distributions, which are used to perform Monte Carlo simulations within the frequentist framework (see e.g., Moerbeek & Teerenstra, 2016, pp. 211–213). Specifically, for each set of connected design parameters (e.g., in three-level designs, ρL2, ρL3, \({R}_{\textrm{L}1}^2\), \({R}_{\textrm{L}2}^2\), and \({R}_{\textrm{L}3}^2\) for a certain outcome are interrelated), we specified a multivariate normal distribution. The mean vector was represented by the point estimates of ρ and R2, the variances by their squared SEs, and the covariances were derived assuming an intercorrelation of r = 0.90, as a conservative upper-bound guess of dependencies. Using the SimDesign package (Chalmers & Adkins, 2020), we then generated 100 draws from each multivariate design parameter prior distribution.

Calculating the MDES

For each draw, we computed the MDES based on Eqs. (2), (6), and (7) employing the PowerUpR package (Bulus et al., 2021).

Gauging Sensitivity

For the variance–covariance matrices defined for the multivariate normal distributions, we ran sensitivity analyses over r ∈ {0.00, 0.05, …, 0.95} to preclude a missspecification of ρ and R2 dependencies.

Key Results

We present major patterns in MDES distributions for small, medium, and large IRTs and CRTs (i.e., 3L-CRTs in grades 1–10 and 2L-CRTs in grades 11–12) for the total student population, illustrated in Figs. 6 and 7 (which we refer to in this section; see OSM D for result plots of school tracks and 2L-CRTs in grades 1–10, and OSM G for the full data table of simulated design parameters along with their MDES statistics). Generally, in all simulation conditions, we observed substantive variation in the MDES—between and within outcomes. Further, MDES distributions for small RTs tended to be more sensitive to design parameter uncertainties, and therefore appeared more broadly dispersed than those for large RTs.

Fig. 6
figure 6

MDES distributions for small, medium, and large IRTs and CRTs: covariate types and covariate combinations

Fig. 7
figure 7

MDES distributions for small, medium, and large IRTs and CRTs: covariate time lags

Covariate Types: Bandwidth-Fidelity

Single-Level Perspective

In a medium IRT, MDESIRT was 0.12 without covariates. Precision was then moderately improved by the covariate types; the most by IP, with median MDESIRT|IP equaling 0.10 in elementary and upper secondary school and 0.09 in lower secondary school. Notably, the percentage MDES reduction for a certain covariate type remained constant across IRT sizes. Since precision is a positive function of sample size, absolute MDES gains were stronger in small IRTs than in large IRTs; furthermore, covariate adjustment reached a point of diminishing returns when sample size increased. For instance, in elementary school, SC somewhat raised precision in an IRT with N = 644 (MDESIRT = 0.22 vs. Mdn(MDESIRT|SC) = 0.20) but not with N = 4600 (MDESIRT = Mdn(MDESIRT|SC) = 0.08).

Multilevel Perspective

In a medium CRT, median MDESCRT was 0.35/0.53/0.32 without covariates in elementary/lower secondary/upper secondary school. In lower secondary school, all covariate types strongly boosted median precision, first and foremost IP (MDES3L-CRT|IP = 0.15), followed by CP, Gf, and SC (in this sequence). Likewise, in upper secondary school, IP markedly reduced the MDES2L-CRT to around 0.19, around twice as much as the remaining covariates. In elementary school, particularly SC evoked reasonable average precision improvements (MDES3L-CRT|SC = 0.26), while Gf performed the poorest (MDES3L-CRT|Gf = 0.33). Proportionally, the impact of the covariates strengthened somewhat with CRT size: for example, in elementary school, SC reduced the MDES to about 24% in small CRTs (Mdn(MDES3L-CRT) = 0.68 vs. Mdn(MDES3L-CRT|SC) = 0.52) and to 27% in large CRTs (Mdn(MDES3L-CRT) = 0.24 vs. Mdn(MDES3L-CRT|SC) = 0.17). Meanwhile, as with the IRTs, absolute MDES reductions were still (far) more pronounced with K = 14 than K = 100.

Covariate Combinations: Incremental Validity

Single-Level Perspective

In a medium IRT, the additional inclusion of CP over and above IP led to notable MDES drops, but only in elementary/lower secondary school (Mdn(MDESIRT|IP+CP) = 0.09/0.08). In these grade levels, no other combination resulted in further tweaks. In upper secondary school, only the complete covariate battery resulted in genuine precision benefits; the median MDESIRT|IP+CP+Gf+SC averaged 0.09.

Multilevel Perspective

In a medium CRT targeted at first–fourth graders, adding SC to IP raised precision the most (Mdn(MDES3L-CRT|IP+SC) = 0.22), with no further gains through the full covariate array. From grade 5 on, we did not detect any enhancements in the MDES by pairing IP with CP or Gf. Similarly, the addition of SC, alone or with CP and Gf, returned only miniscule additional MDES declines.

Covariate Time Lags: Validity Degradation

Single-Level Perspective

In a medium IRT, precision was slightly affected by temporal validity decrement in IP: we observed the maximum decrement in elementary school, with ∆MDESIRT|IP equaling +0.02 (from the shortest to the longest time lag). For CP, precision diminished only in upper secondary school, but starting lately after 7 years (∆Mdn(MDESIRT|CP−7) = +0.01). Of note, precision was more prone to validity deterioration in IP and CP in small rather than large IRTs (e.g., Mdn(MDESIRT|IP−2) = 0.17/0.07 and Mdn(MDESIRT|IP−7) = 0.19/0.07 with N = 644/4600 in upper secondary school). By contrast, MDESIRT|Gf consistently remained highly stable.

Multilevel Perspective

In a medium CRT, median MDESCRT was 0.35/0.54/0.32 without covariates in elementary/lower secondary/upper secondary school. The MDES somewhat fluctuated with growing pre-posttest time lags: when subtracting median values for the longest from the shortest time gaps, ∆MDESCRT|IP = + 0.02/+0.03/+0.01, ∆MDESCRT|CP = −0.06/+0.01/±0.00, and ∆MDESCRT|Gf = −0.05/+0.02/±0.00. As for IRTs, cross-time precision decay appeared more pronounced in small rather than large CRTs (e.g., Mdn(MDES2L-CRT|IP−2) = 0.45/0.16 and Mdn(MDES2L-CRT|IP−7) = 0.48/0.16 for K = 14/100 in upper secondary school).

Discussion

Worldwide, the prevalence of educational RTs has been growing sharply (Connolly et al., 2018; Raudenbush & Schwartz, 2020). Reliable knowledge on the effectiveness of programs and innovations to bolster student learning—the foundation of evidence-based policies and practices in education (Hedges, 2018)—requires both well-designed IRTs and CRTs that are sensitive to detect true intervention effects. Highly prognostic covariates are key elements of strong designs; yet, choosing them can be challenging and involves both theoretical and empirical considerations. Our study sought to expand substantive guidance to support informed covariate selection and power analysis for IRTs and CRTs on student achievement: inspired by three psychometric heuristics (the bandwidth-fidelity dilemma, incremental validity concept, and validity degradation principle) and using longitudinal large-scale assessments from Germany, we analyzed unique, relative, and incremental covariate impacts on design sensitivity. Part I covered a wealth of (meta-analytically integrated) single- and multilevel design parameters, and Part II covered a simulation study generating plausible MDES distributions for educational RTs.

Expanding the Range of Designs

We scrutinized covariates in IRTs as well as 2L- and 3L-CRTs. In doing so, our study is unique by covering a large array of the experimental designs implemented to determine the effectiveness of educational interventions (Connolly et al., 2018; Spybrook et al., 2016).

The first central message from our analyses is as follows: In IRTs, effects on design sensitivity through the covariates largely confirmed the psychometric heuristics; in CRTs, usually all of the covariates noticeably boosted design sensitivity, even long-term. From a single-level perspective, the higher the fidelity, the lower the bandwidth, and the shorter the pre-posttest time lag of a covariate, the better the variance explanation between individual students, and the greater the returns in design sensitivity. Thus, the psychometric heuristics are indeed useful to inform covariate choices in IRTs. From a multilevel perspective, however, relations are not always as straightforward. Fortunately, researchers have much more flexibility when choosing covariates for CRTs: all covariates under investigation, regardless of their degree of bandwidth/fidelity and time gap to the outcome, markedly raised design sensitivity. This holds especially true throughout secondary school, where large proportions of between-school differences could be captured by any covariate. This phenomenon, in which aggregated measures tend to correlate much more strongly than their individual-level equivalents, has been described by scholars before (e.g., Bloom et al., 2007; Härnqvist et al., 1994; Robinson, 1950; Snijders & Bosker, 2012).

Expanding the Range of Covariate Types, Combinations, and Time Lags

Previous studies on covariate effects on design sensitivity have systematically analyzed 1- to 3-year-lagged IP, the latest CP, as well as SC; the latter have been examined both uniquely and beyond IP. We added Gf to the spectrum of covariate types, combined IP with CP or Gf as well as with CP plus Gf plus SC, and covered long pre-posttest time lags of up to 7 years. In doing so, we involve the most relevant precursors of students’ learning trajectories (e.g., M. C. Wang et al., 1993) and respond to the needs arising from the features of RTs implemented in education (e.g., Connolly et al., 2018; Lortie-Forgues & Inglis, 2019).

The second central message from our analyses is as follows: Using the latest IP as the only single covariate demonstrated outstanding capacities to improve design sensitivity in both IRTs and CRTs. IP clearly outweighed all remaining covariate types, although its prognostic property was indeed often affected by temporal deterioration. This pattern of results replicated the pattern that we identified in our meta-analytic research review. However, as noted above, there may be scenarios that necessitate the switch to CP, Gf, or SC, even when assessed long before the target outcome, or that justify their additional inclusion. On a side note, the present values of \({R}_{\mid \textrm{CP}/\textrm{Gf}/\textrm{SC}}^2\) may also serve as lower bound estimates when pre-posttest content alignment is less than perfect (Bloom et al., 2007, p. 41). The effectiveness of CP, Gf, and SC to tweak design sensitivity depended on several factors, first and foremost the grade level. Controlling for CP or Gf was a reasonable (alternative) strategy for RTs implemented in lower secondary school. Of importance, Gf appeared to be an exceptionally time-stable predictor, even across numerous years and irrespective of the design. Thus, the idea that Gf may serve as a robust covariate in RTs spanning several years—supported by existing single-level evidence—was generalized to multilevel settings in the present study for the first time. SC, in contrast, performed well as covariates particularly in elementary school, and occasionally also in upper secondary school. Incremental returns of CP, Gf, and/or SC over and above IP were often negligible, largely consonant with previous studies. As an exception, additionally taking into account SC in CRTs with first–fourth graders seems to be a relatively safe option to boost design sensitivity. Consequently, researchers should always take into account the cost-effectiveness of covariates beyond IP, with regard to the specific application context.

Expanding the Range of Outcome Domains

The bulk of available resources of design parameters to guide covariate choices focus on core domains, namely mathematics and science as STEM outcomes, and reading as a verbal outcome. We further complemented the STEM outcomes by ICT and the verbal outcomes by grammar, spelling, vocabulary, and writing. In doing so, we acknowledge that RTs often seek to enhance skills in domains beyond the core domains (Lortie-Forgues & Inglis, 2019; Morrison, 2020).

The third central message from our analyses is as follows: Impacts of the covariates on design sensitivity varied widely between achievement outcomes. For almost all covariates, we observed large heterogeneities in the amounts of explained variance across domains (and, if applicable, samples). Heterogeneity was mostly due to true variation at the level of effect sizes. This observation coincides with the findings of past studies (see also Brunner et al., 2018; Stallasch et al., 2021). Likewise, our simulations emphasize that MDES distributions were considerably dispersed; benefits in precision also strongly hinged on the outcome. Hence, researchers should always strive for an ideal fit between design parameters and the intervention’s target outcome. Yet, circumstances may limit this endeavor, such as the unavailability of suitable estimates for a specific domain. Here, our meta-analytic results may inform researchers of possible design parameter ranges and can be used in power analysis to determine expected lower and upper bounds of sample sizes, power rates, or MDES values.

Expanding the Range of National Scopes

Most evidence on sensitivity-enhancing covariate effects is restricted to the United States. We accumulated design parameters drawing on longitudinal large-scale assessment data from six German samples covering the entire school career (i.e., grades 1–12) of the total student population, as well as the student populations in the academic and non-academic tracks. In doing so, we meet the demands of a vast number of RTs that are conducted in countries where the school system more closely resembles the German system (e.g., with respect to the onset of school type tracking; Connolly et al., 2018).

The fourth central message from our analyses is as follows: The covariates’ capabilities to raise design sensitivity cannot be universally generalized across national education contexts. We found notable differences in multilevel design parameters based on data from German vs. US samples. With the former, explained variances at L3 often appeared more pronounced throughout secondary school, and vice versa at L2. This might be due to the fact that in the tracked German secondary school system, ρL3 tend to be larger and ρL2 tend to be smaller than previously reported in the United States (see Stallasch et al., 2021). Similar patterns in multilevel design parameters by country have also been documented in cross-national works (Brunner et al., 2018; Kelcey et al., 2016). It is therefore of utmost importance that researchers rely on variance estimates that best depict the characteristics of the interventions’ target population.

Essentials of Covariate Adjustment in RTs on Student Achievement at a Glance

Our analyses imply the following general recommendations on pre-treatment covariate inclusion in IRTs and CRTs on student achievement in the German (and similar) school context.

  1. 1.

    A pretest should substantively match the RT’s target outcome as closely as possible. In particular, when the outcome is well-aligned with the content of the intervention (e.g., high curricular validity or sensitivity to depict instructional effects), a pretest in the outcome domain may be favorable over one in another domain.

  2. 2.

    A pretest should have high fidelity/low bandwidth rather than low fidelity/high bandwidth. Thus, a domain-specific pretest may be preferable to a domain-general one.

  3. 3.

    A pretest in fluid intelligence may be considered in—especially long-term—RTs in lower secondary school (grades 5–10).

  4. 4.

    Sociodemographic measures may be used in elementary school RTs (grades 1–4).

  5. 5.

    If a pretest in the outcome domain is available, precision gains through additional covariates are often negligible, except for point 4.

  6. 6.

    In IRTs, a pretest in the outcome domain should be granted priority, despite its potential temporal validity degradation. In CRTs—especially in secondary school (grades 5–12)—cost issues should be brought to the fore, as any covariate may be beneficial.

  7. 7.

    Uncertainty in the design parameters should be taken into account, for example via (meta-analytic) 95% CIs/PIs or simulations based on empirical prior distributions.

In addition, we urge researchers planning RTs to keep the following factors in mind:

  1. 8.

    Measurement error in the covariate(s) and/or outcome typically attenuates single- and multilevel R2 (Cohen et al., 2003; Raudenbush & Bryk, 2002).Footnote 18 Reliable measures are expedient for power and precision in RTs (Maxwell et al., 1991). To handle reliability issues, one may (a) add items when newly developing measures, which should reduce measurement error, (b) use a plausible values approach when estimating test scores (Blackwell et al., 2017), or (c) apply latent variable models when analyzing treatment effects in IRTs (Bollen, 2002; Mayer et al., 2016) and CRTs (Lüdtke et al., 2008; Raudenbush & Bryk, 2002), which partials out measurement error. If none of these options is feasible, adjusting for (even error-prone) covariates is—as a rule—still advisable to raise design sensitivity in RTs (Maxwell et al., 2017, p. 481).

  2. 9.

    In CRTs, aggregated L1 covariates that demonstrate large ρ values at the implementation level of the intervention are advantageous. When developing new measures for the CRT, using pilot data of ρ estimates for each item to construct multi-item scales that optimize between-group differentiation may be worthwhile (Bliese et al., 2019).

  3. 10.

    Participant attrition during RT implementation hampers the prognostic properties of covariates (Rickles et al., 2018). Hence, planning RTs as conservatively as possible given available financial and personnel resources may be reasonable.

Limitations

Our work has several shortcomings. First, this study’s results may generalize well to RT target populations, measures, and design characteristics mimicking those addressed here. More precisely: (a) The ideal application context is the German school system; yet, our design parameters may still serve as valuable benchmarks to plan RTs in countries such as Austria, Czech Republic, Hungary, Slovak Republic, and Turkey with similar performance-based school tracking (Salchegger, 2016). Relatedly, we drew on rather heterogeneous samples. Higher homogeneity (e.g., with convenience samples, as is typical in educational RTs; Tipton & Olsen, 2018) may lead to smaller ρ and R2 due to range restrictions (Miciak et al., 2016). Further, since we chose not to apply sampling weights, our findings are quasi-representative, and the estimates as well as their SEs—albeit presumably only slightly (see e.g., Wenger et al., 2018)—less accurate compared to weighted ones. (b) The present design parameters optimally match measures resembling those used in NEPS, PISA, or DESI. Caution is warranted when designing RTs relying on (substantively) divergent outcome or covariate measures. Importantly, in view of the observed temporal validity degradation, somewhat larger/smaller R2 are expected for shorter/longer pre-posttest time intervals. (c) The MDES formulae assume homoscedasticity (i.e., TG and CG sharing a common outcome variance). For our simulation conditions with (fully) balanced designs, statistical tests for the treatment effect resting on this assumption were shown robust, even under heteroscedasticity (Blanca et al., 2018; Korendijk et al., 2008). In unbalanced designs, however, the MDES values would probably be too optimistic (see e.g., Gail et al., 1996).

Second, our selection of covariates was theoretically and empirically oriented. Yet, as noteworthy amounts of variance remained unexplained for many outcomes, other individual- or group-level attributes might also function as profitable covariates. For example, socioemotional characteristics—representing important curricular targets (OECD, 2015)—such as domain-specific motivation (Levy et al., 2023), self-concept (Wu et al., 2021), or anxiety (Aldrup et al., 2020) have been shown to predict student achievement over and above IP. Of note, the fully documented R code and our R package multides (Stallasch, 2024) enable researchers to produce R2 values for additional covariate sets specifically relevant for their prospective RT, drawing on (publicly) available datasets.

Third, we used reasoning ability as assessed by standard figural matrices as our measure of fluid intelligence. Fluid intelligence, however, is a multifaceted construct that encompasses—besides reasoning as one integral component—various further abilities such as perception speed, accuracy, and problem solving (e.g., Baltes et al., 1999; Cattell, 1987; see also Brunner et al., 2014). Therefore, \({R}_{\mid \textrm{Gf}}^2\) values should be interpreted as lower-bound estimates and would possibly have been stronger with a broader spectrum of subtests.

Fourth, virtually no measure in the social sciences is free from measurement error. Ours are no exception: reliabilities of the test scores were 0.51 ≤ rtt ≤ 0.96 (Mdn(rtt) = 0.77; Table C11). These are fairly typical for applied (experimental) research, where rtt ≥ 0.70 is desirable, but even smaller values may suffice (Schmitt, 1996). Hence, although fallible measures may lower R2, our estimates may generalize well to realistic planning scenarios.

Conclusion

Inspired by psychometric heuristics and capitalizing on rich data from several German longitudinal large-scale assessments, we substantively expanded the body of knowledge on covariate impacts to improve design sensitivity in IRTs and CRTs on student achievement. Our study bundles an extensive compilation of (meta-analytic) single- and multilevel design parameters with a precision simulation study implicitly incorporating uncertainty adopting a Bayesian rationale. Our work is enriched by illustrative and empirically supported application guidance and comprehensive OSMs. We hope that these resources support evaluation researchers in making wise covariate selections when planning educational experiments to gather sound evidence on what works to advance student learning.