Design-based approaches to replication depend on well-specified research questions regarding the causal estimand of interest. This means that researchers need to specify the units, treatments, outcomes, and settings of interest, and which of these factors, if any, the researcher wishes to vary. Given the “innumerable” potential variations in causal estimands, Nosek and Errington advise researchers to ask, “which ones matter (Nosek & Errington, 2020, p. 4)?” The selection of factors for systematic testing, and the extent to which these factors should be varied, necessarily depends on the researcher’s deep subject matter theory of the intervention, as well as their expert knowledge about the most important factors that are hypothesized to result in effect variation (Simons et al., 2017). These are conditions that the researcher believes are both necessary and sufficient for replicating a causal claim. Explicating these factors is needed for understanding how an intervention’s effect is meant to generalize, as well as the limits of the intervention’s effect under investigation (Nosek & Errington, 2020).
In design-based approaches to replication, applying subject matter theory for selecting a replication design is operationalized through the researcher’s question regarding the causal estimand of interest. For example, direct replications seek to evaluate whether two or more studies with the same well-defined causal estimand yield the same effect. Although the most stringent forms of direct replication seek to meet all replication and individual study assumptions, the most informative direct replication approaches seek to test one or more individual study assumptions (S1-S3) for producing replication failure. High-quality direct replications require that CRF assumptions R1 and R2 are met because these assumptions ensure that studies compare the same causal estimand, while introducing systematic sources of variation that test individual study assumptions (S1-S3). Examples include within-study comparison designs (Fraker & Maynard, 1987; Lalonde, 1986), which compare effect estimates from an observational study with those from an RCT benchmark with the same target population (S1); robustness checks (Duncan et al., 2014), which compare effect estimates for the same target population using different estimation procedures (S2); and reproducibility analyses (Chang & Li, 2015), which compare study results produced by independent investigators using the same data and syntax code. In all of these approaches, the researcher concludes that an individual study effect is biased or incorrectly reported (that is, a violation of individual study assumptions S1-S3) if replication failure is observed.
Conceptual replications, however, seek to examine whether two or more studies with potentially different causal estimands produce the same effect. To implement this approach, the researcher selects and introduces variations in units, treatments, outcomes, and settings (R1-R2) while attempting to ensure that all individual study assumptions (S1-S3) are met. The goal is to test and identify potential sources of effect variation based on subject matter theory, often for the purpose of generalizing effects for broader target populations (Clemens, 2017; Schmidt, 2009; Simons et al., 2017).
Definitions of conceptual and direct replications under the CRF complement existing, more heuristic approaches to replication (Brandt, 2014; LeBel et al., 2018). An advantage of the CRF, however, is that it provides a formal method for deriving replication designs that systematically test sources of effect heterogeneity, as well as for evaluating the quality of the replication design for making inferences. The remainder of this section focuses on research designs for conceptual replication. Although the designs we discuss are widely implemented in field settings, they are not currently recognized as replication designs. Understanding these approaches as replication designs demonstrate that it is both feasible and desirable to conduct high-quality replication studies in field settings, as well as to make inferences about why replication failure occurred.
Multi-Arm RCT Designs
Multi-arm RCTs are designed to evaluate the impact of two or more intervention components in a single study. Participants are randomly assigned to one of multiple intervention arms with differing treatment components, or to a control group. This allows researchers to compare a series of pairwise treatment contrasts. For example, in a study evaluating the effectiveness of personalized feedback interventions for reducing alcohol-related risky sexual behavior, researchers randomly assigned participants to one of three arms: one arm received personalized information on alcohol use and personalized information on sexual behavior (“additive approach”); a second arm received personalized information on the relationship between alcohol and risky sexual behavior (“integrated approach”); and a third control arm received an unrelated information on nutrition and exercise (Lewis et al., 2019).
This multi-arm RCT may be understood as a replication design that purposefully relaxes the assumption of treatment stability (R1) to test whether the effect of a personalized feedback intervention replicates across variations in feedback content relative to the same control condition (an additive versus integrated approach). Because systematic variation is introduced within a single study, all other CRF assumptions other than treatment stability (R1) may be plausibly met: the same instruments are used for assessing outcomes at the same time and settings for all comparisons (R1); the control condition for evaluating intervention effects is the same for each comparison (R1); and random assignment of participants into different intervention conditions ensures identical distributions of participant characteristics on expectation across groups (R2), and that the causal estimand is identified (S1). The researchers may also examine whether each pairwise contrast is robust to different model specifications, providing assurance of unbiased estimation of effects (S2). If all other CRF assumptions are met, and pairwise contrasts yield meaningful and significant differences in effect estimates, then the researcher may conclude with confidence that variation in intervention conditions caused the observed effect heterogeneity.
RCTs with Multiple Cohorts
RCTs with multiple cohorts allow researchers to test the stability of their findings over time. In this design, successive cohorts of participants are recruited within a single institution or a set of institutions, and participants within each cohort are randomly assigned to intervention or control conditions. As a concrete example, in an evaluation of a comprehensive teen dating violence prevention program, 46 schools were randomly assigned to participate in Dating Matters over two successive cohorts of 6th graders or to a business-as-usual control condition (Degue et al., 2020). Experimental intervention effects for each cohort were compared to evaluate whether the same result replicated over time. This multiple cohort design also facilitates recruitment efforts by allowing researchers to deliver intervention services and collect data over multiple waves of participants, which may be useful in cases where resources are limited.
RCTs with multiple cohorts may be considered a conceptual replication designed to test for effect heterogeneities across cohorts at different time points. To address CRF assumptions, the researcher would implement a series of diagnostic checks to ensure replication and individual study assumptions are met. For example, the researcher may check to ensure that the same instruments are used to measure outcomes, and that they are administered in similar settings with similar timeframes across cohorts (R1). The researcher may also implement fidelity measures to evaluate whether intervention and control conditions are carried out in the same way over time (R2) and whether there are no spill-over effects across cohorts (R2), and they may assess whether the distribution of participant characteristics also remain the same (R2). Finally, to address individual study assumptions (S1-S3), the researcher should ensure that a valid research design and estimation approach are used to produce results for each cohort and that the results are verified by an independent analyst.
Because RCTs with multiple cohorts are often implemented in the same institutions with similar conditions, many characteristics related to the intervention, setting, participants, and measurement of outcomes will remain (almost) constant over time. However, some replication assumptions (R1, R2) may be at risk of violation. For instance, intervention conditions often change as interventionalists become more comfortable delivering protocols and/or as researchers seek to make improvements in the intervention components or in their data collection efforts. Moreover, intervention results may change if there are maturational effects among participants that interact with the treatment, or if there are changes in settings that may moderate the effect. The validity of the multiple-cohort designs may also degrade over time, as participants in entering cohorts become aware of the study from prior years. When participants have strong preferences for one condition over another, they may respond differently to their intervention assignments, which may challenge the interpretation of the RCT. Replication designs with multiple cohorts provide useful tests for examining treatment effect variation over time. However, the design is most informative when the researcher is able to document the extent to which replication assumptions are violated over time that may produce replication failure.
Switching Replication Designs
Switching replications allow researchers to test the stability of a causal effect over changes in a setting or context. In this approach, two or more groups are randomly assigned to receive an intervention at different time intervals, in an alternating sequence such that when one group receives treatment, the other group serves as control, and when the control later receives treatment, the original treatment group serves as the control (Shadish et al., 2002). Replication success is examined by comparing the treatment effect from the first interval with the treatment effect from the second interval. Helpfully, the design provides an opportunity for every participant to engage with the intervention, which is useful in cases where the intervention is highly desired by participants or when it is unethical or infeasible to withhold the intervention.
Though switching replications are relatively rare in prevention science, opportunities for their use are commonplace; many evaluations incorporate a waitlist control group design where the control group receives the intervention after the treatment group. Waitlist control groups have recently been used to evaluate the impact of parenting interventions (Keown et al., 2018; Roddy et al., 2020), mental health interventions (Maalouf et al., 2020; Terry et al., 2020), and healthy lifestyle interventions (Wennehorst et al., 2016). As a concrete example, in an evaluation of the Champion in Prevention (CHIP) Germany program, treatment participants met twice a week for 8-weeks receiving lessons aimed at preventing Type 2 diabetes and cardiovascular diseases. The control group was provided access to the same program after the 12-month follow-up period (Wennehorst et al., 2016). If the study researchers were additionally interested in the relative effectiveness of an online version of the program, this study could be easily adapted to become a switching replication. In this design, the waitlist control group would serve as a control for the first treatment group, but after the year follow-up, the waitlist group would participate in virtual CHIP meetings while the first group served as the control. Health outcomes would be measured at the beginning of the study, after the first group receives CHIP, and after the second group receives the online version of CHIP.
In the switching replication design, the RCT in the second interval serves as a conceptual replication of the RCT conducted in the first interval. The primary difference across the two studies is the setting for how the healthy lifestyle intervention was delivered (in-person class versus online class). This allows the researcher to address multiple assumptions under the CRF. Because participants are shared across both studies, the same causal estimand is compared (R2); because participants are randomly assigned into conditions, treatment effects are identified for each study (S1). Reports of results from multiple estimation approaches and independent analysts can provide assurances that assumptions S2 and S3 were met. If replication failure is observed, the researcher may conclude that changes in how the intervention protocol was delivered was the cause of the effect variation.
However, results from the switching replication design are most interpretable when the intervention effect is assumed to be a causally transcient process—that is, once the intervention is removed, there should be no residual impact on participants’ health (R1). The assumption may be checked by extending the length of time between the first and second intervals, and by taking measures of health immediately before the intervention is introduced to the second group. The design also requires that the same outcome measure is used for assessing impacts and for comparing results across study intervals (R1), that there are no history or maturation effects that violate CRF assumptions (R2), and no compositional differences in groups across the two study intervals (R2).
Combining Replication Designs for Multiple Causal Systematic Replications
On its own, a well-implemented research design for replication is often limited to testing a single source of effect heterogeneity. However, it is often desirable for the researcher to investigate and identify multiple sources of effect variation. To achieve this goal, a series of planned systematic replications may be combined in a single study effort. Each replication may be a different research design (as described above) to test a specific source of effect variation or to address a different validity threat. The researcher then examines the pattern of results over multiple replication designs to evaluate the replicability and robustness of effects.
As an example, Cohen et al. (2020) developed a coaching protocol to improve teacher candidates’ pedagogical practice in simulation settings. The simulation provides opportunities for teacher candidates to practice discrete pedagogical tasks such as “setting classroom norms” or “offering students feedback on text-based discussions.” To improve teacher candidates’ learning in the simulation setting, the research team developed a coaching protocol in which a master educator observes a candidate practice in the simulation session and then provides feedback on the candidate’s performance based on a standardized coaching protocol. The teacher candidate then practices the pedagogical task again in the simulation setting. To assess the overall efficacy of the coaching protocol (the treatment condition), the research team randomly assigned teacher candidates to participate in a standardized coaching session or a “self-reflect” control condition, and compared candidates’ pedagogical performance in the simulation session afterwards. Outcomes of candidates’ pedagogical practice were assessed based on standardized observational rubrics of candidates’ quality instructional practices in the simulation setting (Cohen et al., 2020).
To examine the robustness of effects across systematically controlled sources of variation, the research team began by hypothesizing three important sources of effect variation that included differences (a) in the timing of when the study was conducted, (b) in pedagogical tasks practiced in the simulator, and (c) in target populations and study setting. To test these sources of variation, the research team implemented three replication designs that included a multiple-cohort design, a switching replication design, and a conceptual replication that varied the target population and setting under which the coaching intervention was introduced.Footnote 2 These set of replication designs were constructed from four individual RCTs that were conducted from Spring 2018 to Spring 2020. RCTs took place within the same teacher training program but were conducted over two cohorts of teacher candidates (2017–2018, 2018–2019) and an undergraduate sample of participants (Fall 2019).
Table 2 provides an overview of the schedule of the four RCTs. Here, each individual RCT is indexed by Sij, where i denotes the sample (teacher candidate cohorts 1 or 2, or undergraduate sample 3) and j denotes the pedagogical task for which the coaching or self-reflection protocol was delivered (1 if the pedagogical task involved a text-based discussion; and 2 if the pedagogical task involved a conversation about setting classroom norms). Table 3 demonstrates how each replication design was constructed using the four individual studies. Here, the research team designated S22 as the benchmark study for comparing results from the three other RCTs. For example, to assess the replicability of coaching effects over time, the research team looked at whether coaching effects were similar across two cohorts of teacher candidates (S22 versus S12). To examine the replicability of effects across different pedagogical tasks, the research team implemented a modified switching replication design (S22 versus S21). Here, candidates were randomly assigned in Fall 2018 to receive the coaching or the self-reflection protocol in the “text-based discussion” simulation scenario; their intervention conditions were switched in Spring 2019 while they practiced the “text-based discussion” simulation scenario). Coaching effects for the fall and spring intervals were compared to assess the replicability of effects across the two different pedagogical tasks. Finally, to examine replicability of effects over a different target population and setting, the research team compared the impact of coaching in the benchmark study to RCT results from a sample of participants who had interest in entering the teaching profession but had yet to enroll in a teacher preparation program (S22 versus S32). The sample included undergraduate students in the same institution enrolled in a “teaching as a profession” class but had not received any formal methods training in pedagogical instruction. Participants were invited to engage in pedagogical tasks for “setting classroom norms” and were randomly assigned to receive coaching from a master educator, or to engage in the self-reflection protocol. Table 4 summarizes the sources of planned variation under investigation for each replication design. Anticipated sources of variation are indicated by ❌; assumptions that are expected to be held constant across studies are indicated by ✓.
Combined, the causal systematic replication approach allowed the research team to formulate a theory about the replicability of coaching effects in the context of the simulation setting. The research team found large, positive, and statistically significant impacts of coaching on participants’ pedagogical practice in the simulation setting. Moreover, coaching effects were robust across multiple cohorts of teacher candidates and for different pedagogical tasks. The magnitude of effects, however, were smaller for participants who were exploring teaching as a profession but had yet to enroll in the training program. These results suggest that differences in participant characteristics and background experiences in teaching resulted in participants benefiting less from coaching in the simulation setting (Krishnamachari, 2021).