Single- and Multilevel Perspectives on Covariate Selection in Randomized Intervention Studies on Student Achievement

Stallasch, Sophie E.; Lüdtke, Oliver; Artelt, Cordula; Hedges, Larry V.; Brunner, Martin

doi:10.1007/s10648-024-09898-7

Single- and Multilevel Perspectives on Covariate Selection in Randomized Intervention Studies on Student Achievement

META-ANALYSIS
Open access
Published: 25 September 2024

Volume 36, article number 112, (2024)
Cite this article

Download PDF

You have full access to this open access article

Educational Psychology Review Aims and scope Submit manuscript

Single- and Multilevel Perspectives on Covariate Selection in Randomized Intervention Studies on Student Achievement

Download PDF

265 Accesses
Explore all metrics

Abstract

Well-chosen covariates boost the design sensitivity of individually and cluster-randomized trials. We provide guidance on covariate selection generating an extensive compilation of single- and multilevel design parameters on student achievement. Embedded in psychometric heuristics, we analyzed (a) covariate types of varying bandwidth-fidelity, namely domain-identical (IP), cross-domain (CP), and fluid intelligence (Gf) pretests, as well as sociodemographic characteristics (SC); (b) covariate combinations quantifying incremental validities of CP, Gf, and/or SC beyond IP; and (c) covariate time lags of 1–7 years, testing validity degradation in IP, CP, and Gf. Estimates from six German samples (1868 ≤ N ≤ 10,543) covering various outcome domains across grades 1–12 were meta-analyzed and included in precision simulations. Results varied widely by grade level, domain, and hierarchical level. In general, IP outperformed CP, which slightly outperformed Gf and SC. Benefits from coupling IP with CP, Gf, and/or SC were small. IP appeared most affected by temporal validity decay. Findings are applied in illustrative scenarios of study planning and enriched by comprehensive Online Supplemental Material (OSM) accessible via the Open Science Framework (OSF; https://osf.io/nhx4w).

Multistage Test Design Considerations in International Large-Scale Assessments of Educational Achievement

Pretest Measures of the Study Outcome and the Elimination of Selection Bias: Evidence from Three Within Study Comparisons

Article 15 November 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

What works to advance student learning? There is growing social and political call to answer this fundamental question based on sound evidence (Slavin, 2020). This has put a spotlight on randomized trials (RTs), which allow for causal inferences on the actual effects of educational interventions (Whitehurst, 2012). Individually randomized trials (IRTs) that randomly assign individual students to experimental conditions are imperative for conceiving and testing deliberate programs (e.g., Kelly et al., 2013). Validating a program’s benefit in real-life schooling then requires upscaling and implementation in ecologically valid settings (Campbell, 1957; e.g., Gersten et al., 2015). Over half of educational RTs (Connolly et al., 2018) nowadays represent cluster-randomized trials (CRTs) involving random allocation of student groups, such as whole schools. CRT designs not only reflect the fact that educational interventions ought to reach a broader student body and/or operate at the group level by definition (Bloom, 2005), but also map the natural nesting of students within classrooms and schools in institutional contexts (Konstantopoulos, 2012).

Irrespective of whether individual or intact groups of students form the unit of randomization, one constitutive feature of a methodologically high-quality RT is an adequate design sensitivity (Lipsey, 1990), meaning sufficient statistical power 1−β to detect a treatment effect at significance level α with a high level of statistical precision (e.g., Dong & Maynard, 2013; Hedges & Rhoads, 2010; Raudenbush et al., 2007). This poses a key challenge—not exclusively, but especially—when planning CRTs: their inherent multilevel data structure often dramatically restricts power and precision, and thus often require large sample sizes (Schochet, 2008). For instance, Stallasch et al. (2021, p. 193) show that a CRT requires 4080 students (68 schools, each with 3 classrooms of 20 students) to detect an effect of d = 0.25 on fourth graders’ mathematics achievement (α = 0.05, 1−β = 0.80). An IRT, in stark contrast, requires only 504 students to detect the same effect. In other words, everything else being equal, CRTs are much more resource-intensive than IRTs.

A promising technique to raise sensitivity in RT designs without inflating the sample size is to statistically control for pre-treatment covariates (e.g., Bloom et al., 2007; Kahan et al., 2014; Porter & Raudenbush, 1987; Raudenbush, 1997; Raudenbush et al., 2007).^{Footnote 1} In the example above, a mathematics pretest that explained 40%/35%/76% of the variance between students/classrooms/schools could reduce the CRT’s sample size requirements by almost two thirds to 1440 students (24 schools; Stallasch et al., 2021, p. 193). This scenario underpins that “well-chosen covariates do wonders for power” (Aberson, 2019, p. 135); yet, the effective value of a covariate is dictated by its prognostic performance. Scholars and agencies hence stress the importance of grounding the ideally preregistered decisions about covariate inclusion on a priori theoretical and empirical considerations that are tied to the specific research field in question (e.g., European Medicines Agency [EMA], 2015; Maxwell et al., 2017; Murray, 1998). Meanwhile, firm guidance on covariate choice is scarce (Pocock et al., 2002; Tafti & Shmueli, 2020), often not going beyond general recommendations for correlational thresholds (e.g., Bausell & Li, 2002, pp. 114–115; Cox & McCullagh, 1982; but see Bloom et al., 2007).

The overall aim of this two-part study is to build thorough empirical guidance on covariate selection to optimize design sensitivity in IRTs and CRTs on student achievement. To this end, we analyze single- and multilevel design parameters for a broad array of outcomes in grades 1–12 by capitalizing on large-scale assessment data from six German samples. Specifically, we quantify impacts of varying (a) covariate types (pretests in the outcome domain, a different domain, and fluid intelligence, as well as sociodemographic measures), their (b) combinations, and (c) time lags to the outcome (1–7 years), alongside the three psychometric heuristics of bandwidth-fidelity (Cronbach & Gleser, 1957), incremental validity (Sechrest, 1963), and validity degradation (Ghiselli, 1956; Humphreys, 1960). Our paper is organized as follows. The Introduction contains a quantitative research review in which we meta-analytically integrate the respective previous empirical evidence. In Part I, we estimate and meta-analytically integrate design parameters, and demonstrate their use in sample size and power computations. In Part II, we use the estimated design parameters in precision simulations to assess the actual covariate returns for the design sensitivity in RTs. This study is accompanied by an extensive OSF repository at https://osf.io/nhx4w. In addition to the full R code, it includes OSM A-G, with (A) expressions of single- and multilevel models; (B) methodology and results related to the quantitative research review; (C) methodology, further results, and manifold application scenarios of study planning related to Part I; (D) methodology and further results related to Part II, as well as interactive Excel workbooks compiling all (E) empirical, (F) meta-analytic, and (G) simulation results.

Statistical Underpinnings

Sufficient design sensitivity is a vital methodological quality criterion of rigorous research (American Psychological Association, 2020; Wilkinson & Task Force on Statistical Inference, 1999). It includes both statistical power and statistical precision (Zhang et al., 2023). Any RT should have an appropriate probability (commonly 80%, i.e., 1−β = 0.80; Cohen, 1988) to detect a true treatment effect.^{Footnote 2} The precision of an RT can be quantified by its minimum detectable effect size (MDES; Bloom, 1995, 2005) depicting the smallest possible significant (at α) standardized effect size (with 1−β), given the sample size. Thus, a small MDES indicates high design sensitivity. The approximate MDES can be written as (Bloom, 2005, pp. 158–160; Dong & Maynard, 2013, pp. 31–32):

$$\text{MDES}={M}_{df}SE\left({\overline{Y} }_{\text{TG}}-{\overline{Y} }_{\text{CG}}\right)/{\upsigma }_{\text{T}}$$

(1)

M_df reflects the t-distributions specific to α and 1−β, with df degrees of freedom. For a two-tailed test, M_df = t_α/2 + t_1−β, which converges to 2.8 when df ≥ 20, given α = 0.05 and 1−β = 0.80 (Bloom, 2006). The term $SE\left({\overline{Y} }_{\text{TG}}-{\overline{Y} }_{\text{CG}}\right)/{\upsigma }_{\text{T}}$ represents the treatment effect’s ${\overline{Y} }_{\text{TG}}-{\overline{Y} }_{\text{CG}}$ standard error that is standardized by the (pooled) total student population’s standard deviation σ_T of an achievement outcome Y, with TG and CG referring to the treatment and control group, respectively. For instance, MDES = 0.25 means that a standardized treatment effect of at least one quarter of a student-level SD in the applied achievement test would be significant under sufficient power (Bloom et al., 2007).^{Footnote 3}

As we show below, $SE\left({\overline{Y} }_{\text{TG}}-{\overline{Y} }_{\text{CG}}\right)/{\upsigma }_{\text{T}}$ is a function of three factors^{Footnote 4}: (a) the sample size, (b) the allocation of the sample to the experimental conditions, and (c) so-called (multilevel) design parameters that quantify the unconditional (i.e., unadjusted) and conditional (i.e., covariate-adjusted) variance (components) in Y. Here, a relevant distinction in the assumptions about the (in)dependence of the underlying student sample between IRT and CRT designs is made that has important implications for the MDES.

A single-level IRT randomizes individual students, so that students are sampled independently of each other (i.e., regardless of e.g., school affiliation). Eq. (1) then transforms to (Bloom, 2006, Eq. 14; Dong & Maynard, 2013, p. 45):

$${\text{MDES}}_{\text{IRT}}={M}_{df}\sqrt{\frac{1-{R}_{\text{T}}^{2}}{{P}_{\text{T}}\left(1-{P}_{\text{T}}\right)N}}$$

(2)

N is the total number of students (i.e., the sum of students n in TG and CG; N = n_TG + n_CG). Everything else being equal, the larger N, the smaller the MDES. P_T denotes the proportion of students assigned to TG (i.e., P_T = n_TG / N), where P_T = 0.50 (i.e., a balanced design; n_TG = n_CG) minimizes the MDES. The design parameter ${R}_{\textrm{T}}^2$ is of special interest in this study because it quantifies the amount of the total variance ${\upsigma}_{\textrm{T}}^2$ in Y that can be explained by covariates C_T:

$${R}_{\text{T}}^{2}=\left({\upsigma }_{\text{T}}^{2}-{\upsigma }_{\text{T}|{\text{C}}_{\text{T}}}^{2}\right)/{\upsigma }_{\text{T}}^{2}$$

(3)

${\upsigma }_{\text{T}|\mathrm{C}_{\text{T}}}^{2}$ symbolizes the conditional total student population’s variance of Y. df = N − Q_T − 2, where Q_T is the number of covariates C_T.

Unlike an IRT, a multilevel CRT randomizes groups of students (e.g., whole schools). Consider a two-level CRT (2L-CRT) with students at level (L) 1 nested within schools at L3, and a three-level CRT (3L-CRT) with students at L1 nested within classrooms at L2 nested within schools at L3. This clustering implies dependencies among selected subjects—students within the same classroom or school tend to be (often much) more similar than students from distinct ones (Schochet, 2008; Stallasch et al., 2021). The degree of within-cluster similarity is typically expressed by the multilevel design parameters ρ_L2 and ρ_L3 (i.e., the intraclass correlation coefficients at L2 and L3), which are the proportions of ${\upsigma}_{\textrm{T}}^2$ in Y that is between classrooms within schools and between schools, respectively:

$${\uprho }_{\text{L}2} ={\upsigma}_{\text{L}2}^{2}/{\upsigma}_{\text{T}}^{2}$$

(4)

$${\uprho }_{\text{L}3}={\upsigma}_{\text{L}3}^{2}/{\upsigma}_{\text{T}}^{2}$$

(5)

For a 2L-CRT, ${\upsigma}_{\textrm{T}}^2={\upsigma}_{\textrm{L}1}^2+{\upsigma}_{\textrm{L}3}^2$, and for a 3L-CRT, ${\upsigma}_{\textrm{T}}^2={\upsigma}_{\textrm{L}1}^2+{\upsigma}_{\textrm{L}2}^2+{\upsigma}_{\textrm{L}3}^2$, where ${\upsigma}_{\textrm{L}1}^2$, ${\upsigma}_{\textrm{L}2}^2$, and ${\upsigma}_{\textrm{L}3}^2$ are the unconditional variances in Y between students within classrooms in schools, between classrooms within schools, and between schools, respectively.

For a 2L-CRT with randomization at L3, Eq. (1) then transforms to (Bloom, 2006, Eq. 21; Dong & Maynard, 2013, p. 33):

$${\text{MDES}}_{2\text{L}-\text{CRT}}={M}_{df}\sqrt{\frac{{\uprho }_{\text{L}3}(1-{R}_{\text{L}3}^{2})}{{P}_{\text{L}3}\left(1-{P}_{\text{L}3}\right)K}+\frac{(1-{\uprho }_{\text{L}3})(1-{R}_{\text{L}1}^{2})}{{P}_{\text{L}3}\left(1-{P}_{\text{L}3}\right)K{n}_{\text{L}3}}}$$

(6)

For a 3L-CRT with randomization at L3, Eq. (1) transforms to (Bloom et al., 2008, Eq. 3; Dong & Maynard, 2013, p. 52):

$${\text{MDES}}_{3\text{L}-\text{CRT}}={M}_{df}\sqrt{\frac{{\uprho }_{\text{L}3}(1-{R}_{\text{L}3}^{2})}{{P}_{\text{L}3}\left(1-{P}_{\text{L}3}\right)K}+\frac{{\uprho }_{\text{L}2}(1-{R}_{\text{L}2}^{2})}{{P}_{\text{L}3}\left(1-{P}_{\text{L}3}\right)K{J}_{\text{L}3}}+\frac{(1-{\uprho }_{\text{L}3}-{\uprho }_{\text{L}2})(1-{R}_{\text{L}1}^{2})}{{P}_{\text{L}3}\left(1-{P}_{\text{L}3}\right)K{J}_{\text{L}3}{n}_{\text{L}2}}}$$

(7)

n_L2 and n_L3 are the average numbers of students within classrooms and schools, respectively, J_L3 is the average number of classrooms within schools, and K is the number of schools (i.e., the sum of schools K in TG and CG; K = K_TG + K_CG). Generally, K exerts greater impact on the MDES than n_L2 or n_L3 and J_L3: everything else being equal, the larger K, the smaller the MDES. P_L3 is the proportion of schools assigned to the treatment condition (i.e., P_L3 = K_TG / K) with P_L3 = 0.50 minimizing the MDES. Further, everything else held constant, the larger ρ_L2 and/or ρ_L3, the larger the MDES. Since ρ_L2 and/or ρ_L3 are fixed, the multilevel design parameters ${R}_{\textrm{L}1}^2$, ${R}_{\textrm{L}2}^2$, and ${R}_{\textrm{L}3}^2$ are of particular importance in this study because they quantify the amounts of ${\upsigma}_{\textrm{L}1}^2$, ${\upsigma}_{\textrm{L}2}^2$, and ${\upsigma}_{\textrm{L}3}^2$ in Y that can be explained by covariates C_L1 at the student, C_L2 at the classroom, and C_L3 at the school level, respectively^{Footnote 5}:

$${R}_{\text{L}1}^{2}=\left({\upsigma }_{\text{L}1}^{2}-{\upsigma }_{\text{L}1|{\text{C}}_{\text{L}1}}^{2}\right)/{\upsigma }_{\text{L}1}^{2}$$

(8)

$${R}_{\text{L}2}^{2}=\left({\upsigma }_{\text{L}2}^{2}-{\upsigma }_{\text{L}2|{\text{C}}_{\text{L}2}}^{2}\right)/{\upsigma }_{\text{L}2}^{2}$$

(9)

$${R}_{\text{L}3}^{2}=\left({\upsigma }_{\text{L}3}^{2}-{\upsigma }_{\text{L}3|{\text{C}}_{\text{L}3}}^{2}\right)/{\upsigma }_{\text{L}3}^{2}$$

(10)

${\upsigma }_{\text{L}1|{\text{C}}_{\text{L}1}}^{2}$, ${\upsigma }_{\text{L}2|{\text{C}}_{\text{L}2}}^{2}$, and ${\upsigma }_{\text{L}3|{\text{C}}_{\text{L}3}}^{2}$ signify the conditional between-student, -classroom, and -school variances, respectively. df = K − Q_L3 − 2, where Q_L3 is the number of covariates C_L3.

Estimates of σ² can be obtained through (multilevel) regression (see OSM A). For both IRTs and CRTs, larger R² values generally result in smaller MDES values. Adjusting for highly prognostic baseline covariates is thus widely recognized and explicitly recommended to improve design sensitivity and to address chance covariate imbalance (e.g., Coens et al., 2020; EMA, 2015; Moerbeek & Teerenstra, 2016; Porter & Raudenbush, 1987; Raudenbush et al., 2007). Omitting factors which are strongly predictive but not equated between experimental groups may severely bias treatment effect estimates, impair power, and inflate type I error rates (Ciolino et al., 2019; Yang et al., 2020). At the same time, if not correctly done, covariate adjustment has some pitfalls in special cases: First, adjustment is worthless when the loss in df captured by each covariate (in CRTs, at the top hierarchical level) outweigh the gain in precision (Kahan et al., 2014; Moerbeek & Teerenstra, 2016). This situation, however, is very rare; the loss in df is most often without (practical) consequence unless the sample size is very small (Konstantopoulos, 2012; Maxwell et al., 2017, p. 501). Second, adjustment might be detrimental when the assumption of covariate-treatment orthogonality^{Footnote 6} is (severely) violated. This risk is amplified with covariates measured after randomization, which could therefore be affected by the treatment, as well as in (very) small-sized RTs, (highly) unbalanced designs (i.e., n_TG ≠ n_CG and/or in CRTs, unequal cluster sizes), and with (much) missing data on covariates (Kahan et al., 2014; Lin, 2013; Moerbeek, 2006; J. Wang, 2020). Violations of covariate-treatment orthogonality may be compensated in the RT design stage by imposing further balancing methods (e.g., matching, minimization, stratification; Moerbeek & Teerenstra, 2016), and in the RT analysis stage by modeling covariate-treatment interactions, optimally using a robust SE estimator (Lin, 2013; J. Wang, 2020). Either way, it is of utmost importance to exclusively control for carefully a priori selected pre-treatment covariates. Non-prognostic, poorly chosen, or post-treatment covariates likely act as “bad controls” that pose a threat to the validity of results (Cinelli et al., 2022; Kahan et al., 2014; Moerbeek, 2006; Montgomery et al., 2018; Porter & Raudenbush, 1987).

Theoretical and Empirical Considerations on Covariate Selection

Well-founded decisions on pre-treatment covariates are key to designing strong RTs. Scholars and agencies agree that these ideally preregistered decisions should be justified by both substantive theory and empirical results (Committee for Proprietary Medicinal Products, 2004; Cook, 2005; EMA, 1998, 2015; Food and Drug Administration, 2021; Maxwell et al., 2017; Moerbeek & Teerenstra, 2016; Murray, 1998; Raab et al., 2000; Tafti & Shmueli, 2020; Wright et al., 2015). Following this recommendation, the present paper draws on prominent models of school learning (Haertel et al., 1983; M. C. Wang et al., 1993) as well as connects to and expands upon previous empirical studies that examine the impact of covariates on design sensitivity in RTs with student achievement as the target outcome (for an overview, see Stallasch et al., 2021): specifically, student achievement is a multifaceted, complex construct influenced by a myriad of cognitive and non-cognitive (e.g., motivational or sociodemographic) factors (see also Steinmayr et al., 2014; Winne & Nesbit, 2010), of which the following were highlighted as the most important. First, a measure of prior knowledge in the same domain as the outcome (e.g., previous mathematics skills predicting future mathematics skills), which we refer to as a domain-identical pretest (IP), is known to shape performance trajectories (e.g., Ausubel, 1968; Dochy et al., 1999). This view is rooted in the assumption that one’s pre-existing knowledge base fundamentally molds input integration during knowledge acquisition (Brod, 2021; Woolfolk, 2020). Second, a measure of cognitive prerequisites in a certain domain may also explain achievement differences in another domain (e.g., previous language or reading skills predicting future mathematics skills; Peng et al., 2020; Ünal et al., 2023), which we refer to as a cross-domain pretest (CP). This idea is supported by the fact that scores from distinct achievement tests are often highly correlated (Baumert et al., 2009), reflecting the operation of a common cognitive capacity (often described as the g factor; Jensen, 1993) or the relevance of a specific ability to tasks in other domains (e.g., reading comprehension is needed to create a mental representation of mathematical problems; Kintsch, 1998). Third, there is broad consensus that fluid intelligence (Gf) is a powerful predictor of achievement in various domains (e.g., Cattell, 1987; Jensen, 1993; Neisser et al., 1996). Finally, sociodemographic characteristics (SC) such as gender, migration background, and socioeconomic status are also widely acknowledged as persistent precursors for academic success (e.g., Bradley & Corwyn, 2002; Stanat & Chistensen, 2006).

Importantly, educational RTs often address outcomes in multiple domains (Lortie-Forgues & Inglis, 2019; Morrison, 2020) that might need to be adapted or expanded during implementation (e.g., due to logistic or financial reasons, or political decisions; see Bloom et al., 2007), and often span several years (Connolly et al., 2018; Rickles et al., 2018). Moreover, apart from the rule that RTs should always be designed as parsimoniously as possible, they are usually subject to limited resources. Thus, in practice, researchers planning RTs often face the challenge of weighing the potential trade-offs between the different covariate types, their combinations, and time lags for design sensitivity. Three influential, albeit debated, psychometric heuristics may help to derive predictions on the unique, relative, and incremental impacts of IP, CP, Gf, and SC: (a) the bandwidth-fidelity dilemma, (b) the incremental validity concept, and (c) the validity degradation principle.

In the following, we elaborate on each heuristic under both a theoretical and empirical lens. First, we briefly introduce the respective underlying conception. Figure 1 visualizes the implications for R² in student achievement. Second, we systematically review previous evidence on the links between standardized achievement tests and the covariate sets germane to each heuristic. For this purpose, we meta-analytically integrated R² as derived from (a) relevant studies providing single-level correlations r_T (i.e., not hierarchically decomposed between students, classrooms, and schools), which are informative for planning IRTs, and (b) available studies compiling multilevel design parameters, which are informative for planning CRTs. For each covariate type, combination, and time lag, we fitted a (multivariate) fixed-effect model with the R package metafor (Viechtbauer, 2010) to summarize the available effect sizes.^{Footnote 7} Figure 2 portrays the Pooled R² values discussed below (see OSM B for the listing of studies included per covariate set, and details on the methodology and results).

Covariate Types: Bandwidth-Fidelity

Theoretical Conception

The bandwidth-fidelity dilemma as originally introduced in psychometrics by Cronbach and Gleser (1957) describes an inherent compromise between the complexity (i.e., bandwidth) and the specificity (i.e., fidelity) of a covariate with respect to its predictive validity for an outcome (Hogan & Roberts, 1996; Salgado, 2017). The core idea is that maximal explanatory power requires the alignment of both the conceptual breadths and peculiarities between predictor and outcome (Hogan & Roberts, 1996; Salgado, 2017). Although the heuristic has primarily been discussed for cognitive and personality measures (see Cronbach & Gleser, 1957; Salgado, 2017), it is conceptually not limited to these constructs. Following the underlying rationale, when predicting a domain-specific achievement outcome, IP is expected to be superior to CP because the former matches the outcome domain; yet, as domain-specific cognitive measures, both should be covariates of high fidelity. CP is expected to outperform Gf, as Gf is a domain-general cognitive measure and should be a covariate of lower fidelity/broader bandwidth. Gf is expected to surpass SC, as SC are non-cognitive measures and should be covariates of even broader bandwidth.

Previous Empirical Evidence

Single-Level Perspective

The studies in our review demonstrated the high predictive power of IP for student achievement, explaining on average 56% of variance. Of note, while some found that these associations remained fairly stable across grades (e.g., Cole et al., 2011), others showed IP gains in relevance with older students (e.g., McCoach et al., 2017). CP was half as effective as IP ($Pooled\ {R}_{\textrm{T}\left|\textrm{CP}\right.}^2=.28$). Gf turned out to be a significant predictor, with $Pooled\ {R}_{\textrm{T}\left|\textrm{Gf}\right.}^2$ equaling 0.19. SC explained a meaningful but—relative to the cognitive covariates—small proportion of variance of about 4%. Importantly, for all covariate types, there was substantial between-study variation. For example, ${R}_{\textrm{T}\left|\textrm{IP}\right.}^2$ ranged broadly from 0.17 to 0.73, due to variation across grade levels and/or domains and pre-posttest time lags. To conclude, the reviewed single-level evidence generally supports the theoretical predictions on the differential impacts of covariate types with varying bandwidth-fidelity.

Multilevel Perspective

Across studies, IP appeared to be the most powerful covariate type, explaining an astonishing 73%/81% of achievement differences at L2/L3, and 48% at L1. In Hedges and Hedberg (2013), the prognostic value of IP at L3 strengthened throughout the school career, a trend that was replicated repeatedly (see Stallasch et al., 2021, Fig. 1). Despite domain mismatch, CP proved a highly robust predictor, particularly at L3: $Pooled\ {R}_{\textrm{L}3\left|\textrm{CP}\right.}^2$ amounted to 0.74, whereas $Pooled\ {R}_{\textrm{L}1\left|\textrm{CP}\right.}^2$ was 0.30. As far as we are aware, the predictive capacity of Gf has not yet been partitioned into its hierarchical variance components. SC exerted substantial predictive power at L3 ($Pooled\ {R}_{\textrm{L}3\left|\textrm{SC}\right.}^2=0.64$), but rather limited predictive properties at L1 ($Pooled\ {R}_{\textrm{L}1\left|\textrm{SC}\right.}^2=0.10$) and L2 ($Pooled\ {R}_{\textrm{L}2\left|\textrm{SC}\right.}^2=0.21$). For every covariate and at each hierarchical level, we recorded notable cross-study heterogeneity (e.g., 0.23 ≤ ${R}_{\textrm{L1}\mid \textrm{IP}}^2$ ≤ 0.58, 0.49 ≤ ${R}_{\textrm{L2}\mid \textrm{IP}}^2$ ≤ 0.70, and 0.54 ≤ ${R}_{\textrm{L3}\mid \textrm{IP}}^2$ ≤ 0.83). In sum, the available multilevel evidence fit the assumptions about the differential effects of covariate types of varying bandwidth-fidelity quite well. Yet, compared to the single-level findings, the respective differences in R² seemed far less pronounced, especially at the group levels.

Covariate Combinations: Incremental Validity

Theoretical Conception

Incremental validity (Sechrest, 1963) refers to a measure’s capacity to additionally explain variance in an outcome beyond what is explained by other prognostic factors (Haynes & Lench, 2003; Hunsley & Meyer, 2003) by contrasting a covariate combination with a subset (Haynes & Lench, 2003). As outlined above, IP is the best-known predictor of domain-specific student achievement. When planning RTs, an important question is therefore whether IP plus CP, Gf, and/or SC jointly explain more variance than IP alone.

Previous Empirical Evidence

Single-Level Perspective

Averaged across the reviewed studies, CP contributed to the prediction of student achievement beyond IP, albeit to a small degree; the joint effect computed to $Pooled\ {R}_{\textrm{T}\left|\textrm{IP}+\textrm{CP}\right.}^2=0.57$. Overall, Gf showed no additional benefits over and above IP ($Pooled\ {R}_{\textrm{T}\left|\textrm{IP}+\textrm{CP}\right.}^2=0.48$). Combining IP and SC did also not lead to a general improvement over controlling for IP alone ($Pooled\ {R}_{\textrm{T}\left|\textrm{IP}+\textrm{SC}\right.}^{2}=0.55$). Taken together, the full covariate battery did not raise the amount of explained variance beyond IP ($Pooled\ {R}_{\textrm{T}\left|\textrm{IP}+\textrm{CP}+\textrm{Gf}+\textrm{SC}\right.}^2=0.52$). At the same time, for all covariate combinations, incremental returns occasionally reached meaningful thresholds, peaking at +15% of variance explanation over and above IP for the complete covariate array (Chu et al., 2018). The largest increments occurred consistently with elementary school samples, potentially implying that the incremental validities of CP, Gf, and/or SD might be stronger in younger than older students.

Multilevel Perspective

We found no multilevel study quantifying incremental validities of CP or Gf, or their combination with SC over and above IP. Much more is known about the set of SC, which incrementally predicted student achievement after IP had been taken into account, although only at the group levels. Pooled across studies, the joint amounts of explained variance equaled 83% at L3, 77% at L2, and 46% at L1. Stallasch et al.’s (2021) analyses revealed that SC contributed around +21%/+13% to the prediction of L2/L3 achievement differences beyond IP, where additional returns appeared to be more pronounced in elementary than secondary school.

Covariate Time Lags: Validity Degradation

Theoretical Conception

The validity degradation principle (Ghiselli, 1956; Humphreys, 1960) implies that the amount of variance explained by a cognitive predictor steadily decreases with growing time lags to the outcome (Hulin et al., 1990; Keil & Cortina, 2001; Reeve & Bonaccio, 2011). The developmental dynamics underlying validity degradation can be described as a simplex time series pattern (Humphreys, 1960). Accordingly, for domain-specific student achievement as outcome, the explanatory power of IP, CP, and Gf assessed 1 year ago should be higher than the explanatory power of IP, CP, and Gf assessed, for example, 7 years ago.^{Footnote 8}

Previous Empirical Evidence

Single-Level Perspective

The vast majority of reviewed investigations suggest that ${R}_{\textrm{T}\left|\textrm{IP}\right.}^2$ in student achievement decreases with greater pre-posttest time lags: values considerably dropped from $Pooled\ {R}_{\textrm{T}\left|\textrm{IP}-1\right.}^2=0.63$ to $Pooled\ {R}_{\textrm{T}\left|\textrm{IP}-7\right.}^2=0.36$. Of note, this trend holds true for all grade levels (e.g., McCoach et al., 2017). Analogous results—though far less striking—were reported for the predictive properties of CP: $Pooled\ {R}_{\textrm{T}\left|\textrm{CP}-1\right.}^2=0.24$ declined to $Pooled\ {R}_{\textrm{T}\left|\textrm{CP}-7\right.}^2=0.10$. In some studies, however, ${R}_{\textrm{T}\left|\textrm{CP}\right.}^2$ barely diminished (e.g., Erbeli et al., 2021) or even increased with growing time lags (e.g., Träff et al., 2020). The few available studies on the potential validity degradation of Gf indicate fairly robust long-term impacts: pooled across studies, Gf−1 explained 13% and Gf−7 explained 33% of achievement differences. In their review, Reeve and Bonaccio (2011) concluded that the decay of Gf’s predictive property is subtle at best, even across numerous years.

Multilevel Perspective

The few existing studies on multilevel design parameters addressing the temporal validity degradation of covariates attested a notable decrement of explanatory power of IP at L1; $Pooled\ {R}_{\textrm{L}1\left|\textrm{IP}-1\right.}^2=0.50$ declined to $Pooled\ {R}_{\textrm{L}1\left|\textrm{IP}-3\right.}^2=0.35$. Meanwhile, amounts of explained variance at L3 were far less prone to time effects ($Pooled\ {R}_{\textrm{L}3\left|\textrm{IP}-1\right.}^2=0.86$; $Pooled\ {R}_{\textrm{L}3\left|\textrm{IP}-3\right.}^2=0.76$). Only Xu and Nichols (2010) studied temporal declines in IP’s predictive power at L2: proportions of explained variance remained at a high level of 70% across two subsequent years. Of note, deteriorations in R² seem to be generally more prevalent in elementary than secondary school, especially at L3. To our knowledge, multilevel studies focusing on cross-time validity decay of CP and Gf are lacking to date.

The Present Study

Strong RTs unite cost-efficiency and sophisticated methodology to ensure appropriate design sensitivity. Given that well-selected covariates substantially raise statistical power and precision, evaluation researchers need reliable evidence that substantiates covariate choices by quantifying unique, relative, and incremental yields of the target outcome’s most important predictors. We aim to significantly expand the available guidance for IRTs and CRTs on student achievement through a comprehensive compilation of reliable single- and multilevel design parameters that were meta-analyzed and applied to simulate precision.^{Footnote 9}

First, both IRTs and CRTs are in their own right cornerstones of evidence-based education. Both designs are frequently implemented (Connolly et al., 2018). However, single-level design parameters on student achievement have not yet been systematically compiled. Indeed, our quantitative research review may be considered a first major step towards this endeavor. Moreover, extant multilevel design parameters remain mostly restricted to two hierarchical levels. To address these gaps, we cover RTs of three different designs: IRTs (with students assumed to be independently sampled), 2L-CRTs (with students nested within schools), and 3L-CRTs (with students nested within classrooms nested within schools).

Second, researchers rely on knowledge about the potential sensitivity-raising effects of specific covariate types, combinations, and time lags. The above research review pointed out that the latest IP is most likely the best among the covariates. Yet, sometimes the inclusion of IP is not feasible, such as when there are multiple outcome domains (e.g., Lortie-Forgues & Inglis, 2019) while testing time is limited, when the outcome changes after the RT has started (e.g., due to political decisions; Bloom et al., 2007, p. 32), when the outcome is subject to strong developmental dynamics and/or presupposes intensive instruction (e.g., reading skills during elementary school), or when individual pretest differences are unlikely to be observed ahead of the intervention (e.g., integral calculus prior to its introduction; Shadish et al., 2002, p. 118). In such situations, CP, Gf, or SC may be meaningful alternatives to IP. However, only a few multilevel studies provide information on the impacts of CP and SC, and none on the impacts of Gf. Beyond that, the combination of IP with CP, Gf, and/or SC may further boost design sensitivity. Past multilevel studies solely assessed incremental validity of SC over and above IP. Further, RTs often span multiple years (e.g., Rickles et al., 2018), especially when long-term intervention effects are of interest. Although the explanatory power of IP, CP, and Gf may be susceptible to temporal decay, prior multilevel studies addressed rather short pre-posttest time lags of 1–3 years to test validity degradation in IP, but not in CP or Gf. To address these gaps, we systematically vary and combine IP, CP, and Gf with 1- to 7-year-lagged data, as well as SC within 11 different covariate sets (in addition to a set 0 without any covariates).

Third, contemporary educational standards refer to a plethora of skills in various domains (National Research Council, 2011; Organisation for Economic Co-operation and Development [OECD], 2018), as do educational RTs (e.g., Morrison, 2020). Past works on multilevel design parameters dealt with a limited number of outcome domains, namely mathematics, science, and reading. To address this gap, we investigate a wide array of eight commonly targeted outcomes from STEM^{Footnote 10} and verbal domains.

Fourth, educational RTs are conducted all around the globe (Connolly et al., 2018), but existing collections of multilevel design parameters primarily stem from US samples. Estimates for countries whose school system characteristics markedly deviate from those of the United States, such as an (often much) earlier onset of ability-based school-type-tracking as is the case in Germany, are scarce. To address this gap, we capitalize on longitudinal large-scale assessment data from six German probability samples that represent the total student population in elementary (grades 1–4), lower secondary (grades 5–10), and upper secondary school (grades 11–12), as well as the student populations in lower and upper secondary school belonging to the academic and non-academic track.^{Footnote 11}

Finally, many past educational large-scale RTs lacked design sensitivity (Lortie-Forgues & Inglis, 2019). It is therefore essential to reliably judge how the varying covariates types, combinations, and time lags actually affect precision (given the typical desired 80% power). To this end, power analyses contextualizing the respective R² values within predefined designs are indispensable: as becomes clear from Eqs. (2), (6), and (7), the MDES is shaped by the interplay of several quantities beyond power and R², such as sample size and allocation, and in the multilevel case also values of ρ. Furthermore, since empirical design parameters are tainted with sampling error that may (dramatically) distort power analysis outcomes, proper allowance of uncertainty is best practice (e.g., Jacob et al., 2010; Turner et al., 2004). We consequently ran precision simulations that concede ρ and R² uncertainties via a Bayesian rationale to calculate plausible MDES ranges for IRTs and CRTs.

Part I: Two-Stage Individual Participant Data Meta-Analysis—Estimating and Integrating Design Parameters

Method

We briefly sketch the applied methods here (see OSM C for details). We used R 4.2.2 (R Core Team, 2022); package versions are noted in the R scripts.

Large-Scale Assessment Data

Systematic Search

To identify German large-scale assessment datasets suitable for analyzing covariate impacts on design sensitivity in RTs on student achievement, we carried out a systematic search in three electronic data repositories (see also Brunner, Stallasch, et al., 2023). Datasets had to meet the following eligibility criteria: (a) representativeness for the German student population, (b) longitudinal design, and (c) assessment of student achievement via standardized tests. We found three large-scale assessments providing data of six independent national probability samples.

National Educational Panel Study (NEPS)

NEPS (Blossfeld & Roßbach, 2019) has been tracking multiple cohorts’ educational trajectories throughout their lifespan from 2010 to today. We used the data^{Footnote 12} of students from three NEPS starting cohorts: 4-year-olds (in kindergarten) tested through grade 4 (NSC2; NEPS Network, 2020); grade 5 students tested through grade 12 (NSC3; NEPS Network, 2019a); and grade 9 students tested through grade 12 (NSC4; NEPS Network, 2019b). Achievement tests were administered every 1–3 years.

Programme for International Student Assessment (PISA)

The PISA cycles 2003 and 2012 were extended as national longitudinal follow-ups in grades 9–10 in Germany (PISA-Konsortium Deutschland, 2006; Reiss et al., 2017). We used the data^{Footnote 13} from PISA-I-Plus 2003, 2004 (PP03; Prenzel et al., 2013), which focuses on students’ mathematics and science achievement development and PISA-Plus 2012–2013 (PP12; Reiss et al., 2019), which additionally incorporates a follow-up assessment of reading achievement.

Assessment of Student Achievements in German and English as a Foreign Language (DESI)

DESI (DESI-Konsortium, 2008) studied students’ verbal achievement during grade 9. We used the DESI data¹¹ (Klieme, 2012) on verbal skills in German.

Sampling Process and Sample Selection

Except for NSC2, all samples were drawn applying a multistage (i.e., multilevel) sampling process where schools were first randomly drawn, followed by at least two intact classrooms per school (Aßmann et al., 2011; Beck et al., 2008; Heine et al., 2017; Prenzel et al. 2006). NSC2 involved sampling kindergarten children and students of the schools that those children entered to ensure representativeness for children entering elementary school (Aßmann et al., 2011).

When studying covariate types and combinations, we drew on the full spectrum of samples. When studying covariate time lags, we drew only on NSC2 and NSC3, as these samples provided longitudinal achievement data across at least three measurement points. As listed in Table 1, we analyzed data from a total of N = 68,502 students, where sample sizes ranged within 1868 (NSC3, grade 12) ≤ N ≤ 10,543 (DESI, grade 9), with median cluster sizes of 4 ≤ n_L2 ≤ 25, 14 ≤ n_L3 ≤ 50, and 2 ≤ J_L3 ≤ 3. Note that in grades 11–12, information at L2 did not exist because, in German upper secondary school, the affiliation of students to intact classrooms is usually replaced by a course grouping system catering to students’ ability level in a certain school subject (e.g., basic vs. advanced courses).

Table 1 Numbers of students N, classrooms J, and schools K, and median numbers of students per classroom n_L2, students per school n_L3, and classrooms per school J_L3

Full size table

Measures

Achievement Outcomes

We analyzed outcomes in three STEM domains, namely mathematics, science, and information and communication technology (ICT), as well as in five verbal domains in German, namely reading, grammar, spelling, vocabulary, and writing.

Covariates

We examined four covariate categories: IP, CP, Gf, and SC. We employed reading as CP for STEM outcomes and mathematics as CP for verbal outcomes. Gf was assessed in terms of figural reasoning. IP, CP, and Gf were available with a 1- to 7-year time lag to the outcome, where the smallest pre-posttest gap ranged from 1 to 4 years. SC comprised 4 variables, namely students’ gender (0 = male, 1 = female) and migration background (0 = no, 1 = yes) as well as two indicators of socioeconomic status: (1) parents’ highest educational attainment was assessed by the greatest number of years of schooling completed (range 9–18) in all studies except the DESI, where the highest school leaving certificate was used; and (2) parents’ highest International Socio-Economic Index of Occupational Status (HISEI; Ganzeboom & Treiman, 1996; range: 11–89).

Missing Data

Virtually all measures used in this study contained some missing values. The percent of missings across the datasets varied from 11% (PP03, grade 10) to 42% (NSC2, grade 1).The greatest missing rates occurred in pretests measured in the first two waves of NSC2, as only a small share of kindergarten children continued participating in NEPS after entering elementary school. We performed (groupwise) multilevel multiple imputation and generated 50 multiply-imputed datasets for each sample and grade using the mice (van Buuren & Groothuis-Oudshoorn, 2011) and miceadds (Robitzsch et al., 2021) packages.

Procedure

We applied a two-stage approach to meta-analysis of individual participant data (Brunner, Keller, et al., 2023; see also Brunner, Stallasch, et al., 2023). We estimated and meta-analyzed design parameters for three RT designs, namely single- (individual students), two- (students within schools), and three-level designs (students within classrooms within schools), as well as for three target populations, namely the total, academic track, and non-academic track student populations. Notably, in upper secondary school, only single- and two-level designs were considered due to the lack of L2 information.

Stage 1: Single- and Multilevel Modeling—Estimating Design Parameters

We performed single- and multilevel modeling to empirically estimate ρ and R². As shown in Table 2, we systematically in- and excluded 1- to 7-year-lagged IP, CP, and Gf, as well as SC within a total of 12 covariate sets, with the number of covariates Q per set ranging between 0 ≤ Q ≤ 7. This resulted in up to 363 distinct models per design and population.

Table 2 Covariate sets analyzed in the present study with numbers of covariates Q, ρ/R² effect sizes G, and samples H by design

Full size table

Model Fitting

For all outcomes, we fitted two model classes separately for each imputation. The first model class consisted of unconditional models without any covariates (set 0). Specifically, for single-level designs, we obtained ${\upsigma}_{\textrm{T}}^2$ by taking the outcomes’ variances. For multilevel designs, we obtained ${\upsigma}_{\textrm{L}1}^2$, ${\upsigma}_{\textrm{L}2}^2$, and ${\upsigma}_{\textrm{L}3}^2$ by specifying two- and three-level random-intercept-only models. The second model class consisted of conditional models with varying covariate types (sets 1–4), combinations (sets 5–8), and time lags (sets 9–11). Specifically, for single-level designs, we obtained ${\upsigma }_{\text{T}|{\text{C}}_{\text{T}}}^{2}$ by specifying single-level regression models. For multilevel designs, we obtained ${\upsigma }_{\text{L}1|{\text{C}}_{\text{L}1}}^{2}$, ${\upsigma }_{\text{L}2|{\text{C}}_{\text{L}2}}^{2}$, and ${\upsigma }_{\text{L}3|{\text{C}}_{\text{L}3}}^{2}$ by specifying two- and three-level random-intercept models. Note that all covariates were assessed at L1. In two-level models, we entered school averages at L3. In three-level models, we entered classroom averages at L2 and school averages at L3. In single-level models, we centered all covariates around their respective total population’s means whereas in multilevel models, we applied group-mean centering: L1 covariates were centered around their respective school/classroom means in two-/three-level models and L2 covariate means were centered around their respective school means in three-level models. Single-level modeling was performed using the stats package implemented in base R. Multilevel modeling was performed using the lme4 package (Bates et al., 2015) applying restricted maximum likelihood (REML) estimation.

Calculating Design Parameters and Standard Errors

We calculated ρ and R² by inserting the variance (component) estimates from the model fits into Eqs. (3)–(5) and (8)–(10). SEs of ρ were computed with the formulas for the large sample variances in unbalanced (i.e., with unequal cluster sizes) two-level designs derived in Donner and Koval (1980, Eq. 3) and three-level designs in Hedges et al. (2012, Eqs. 7–9). The latter involves the sampling variances of ${\upsigma}_{\textrm{L}2}^2$ and ${\upsigma}_{\textrm{L}3}^2$, which we obtained by applying the “cases bootstrap” from the lmeresampler package (Loy & Korobova, 2023). We drew 1000 samples (Huang, 2018, p. 303; Schomaker & Heumann, 2018). SEs of R² were computed with the formula for the large sample variances given in Hedges and Hedberg (2013, p. 451).

Pooling

ρ and R² with corresponding SEs were pooled across the 50 imputations. We used the mitml package (Grund et al., 2021) that employs Rubin’s (1987) rules to take into account within- and between-imputation variance.

Stage 2: Meta-Analysis—Integrating Design Parameters

We performed meta-analysis to integrate ρ and R² for covariate types and combinations, and meta-regression with outcome-covariate time lag as moderator to integrate R² for covariate time lags (both across domains and samples, but within hierarchical and grade levels, designs, and populations).^{Footnote 14}

Model Fitting

Using the metafor package (Viechtbauer, 2010), we fitted two meta-analytic/meta-regression model classes, conditional on the number of R² effect sizes G per covariate set: either (multivariate) fixed-effect models if G < 10 or (multivariate multilevel) random-effects models via REML if G ≥ 10 (see Langan et al., 2019, p. 95). Both methods yield an average (true) effect size Pooled R², with SE(Pooled R²). However, the “real” (i.e., not due to sampling error) heterogeneity among true R² values within samples, τ²_{Effect sizes}, and between samples, τ²_Samples, can solely be captured by random-effects models (Borenstein et al., 2021, pp. 61–80). We deployed two weighting schemes, conditional on the number of samples H per covariate set: If H > 1, we addressed within-sample dependencies among R² effect sizes (Hedges, 2019) by multivariate (multilevel) meta-analyses and imputed working variance–covariance matrices using the clubSandwich package (Pustejovsky, 2022). We assumed a within-sample intercorrelation of r = 0.90 as a reasonable upper-bound guess (see Brunner, Stallasch, et al., 2023). If H = 1, we drew on the sampling variances of R² in terms of the standard meta-analytic inverse-variance weighting.

Depicting Heterogeneity

With random-effects modeling, we calculated—in addition to the 95% confidence interval (95% CI)—the 95% prediction interval (95% PI). The 95% PI provides a plausible range of R²; it quantifies the total dispersion (sampling variance plus τ²_{Effect sizes}, and if applicable, plus τ²_Samples) of R² around Pooled R² and defines the range in which an R² estimated based on data of a new sample randomly drawn from a population of samples will likely (i.e., in 95% of cases) fall (Borenstein et al., 2021, pp. 119–126; Riley et al., 2011). We also calculated (multilevel) I² (Higgins & Thompson, 2002), the ratio of real heterogeneity to the total variation across observed R² values (Borenstein et al., 2017).

Gauging Sensitivity and Model Convergence

For the imputed working variance–covariance matrices, we ran sensitivity analyses over r ∈ {0.00, 0.05, …, 0.95} (Hedges, 2019) to preclude a misspecification of R² dependencies. With random-effects modeling, we profiled log-likelihoods of τ² values to evaluate their identifiability (see Viechtbauer, 2022).^{Footnote 15}

Key Results

We present major patterns in meta-analytic single- and multilevel (i.e., three-level in grades 1–10 and two-level in grades 11–12) design parameters for the total student population, illustrated in Figs. 3 and 4 (which we refer to in this section, unless otherwise stated; see OSM C for result plots of two-level designs in grades 1–10 and school tracks, and OSM E/F for the full compilation of the empirical/meta-analyzed design parameters).