Educational Technology and Instructional Systems research, like most other empirical sciences, is faced with the challenge of determining cause and effect to establish the effectiveness of interventions, determine ways to develop them further, and construct theories that capture the underlying principles of using technology for educational purposes. Determining cause and effect means attributing changes in a variable of interest, e.g., learning outcomes, unequivocally to a preceding variable, e.g., an instructional innovation.

As the experimental method is by many considered the gold standard for identifying cause and effect, Educational Technology and Instructional Systems researchers rely heavily on experimentation to investigate their research questions (Bulfin et al., 2014; Honebein & Reigeluth, 2021; Thomas & Lin Lin, 2020). However, not all research contexts are amenable to rigorously controlled experimentation. This is often the case when an educational technology cannot be implemented without coming into conflict with the features of the system in which learning takes place (Lewis, 2015). Thus, while causal inferences may be desired, clean experimental comparisons are not always feasible, for example, when an instructional innovation entails substantial curricular modification or when a technological innovation cannot be meaningfully implemented without also changing the instructional method. Such is the tension at which much of educational technology research lies (Salomon, 1991).

This theoretical paper will introduce a framework for causal reasoning based on causal graphs, so-called Directed Acyclic Graphs (DAGs). We will outline the basic formalism underlying such graphs, distinguishing them from other forms of graphs well-known to educational technology researchers, structural equation models. Reasoning with causal graphs implies that researchers make their causal assumptions explicit, which is, as we will show, a value in-of-itself. Finally, we will present examples of the utility of DAG-based causal reasoning in different stages of the education technology research process.

Causal inference within varied research approaches

Today, comparative research or “research to prove” is (again) under scrutiny, as presented by Honebein and Reigeluth (2021). In their classification, they distinguish five different types of research, some of which they find less equipped than others to deliver meaningful knowledge to the research field. Research-to-prove entails confirmatory approach of testing hypotheses that are either confirmed or rejected in controlled settings, usually associated with the experimental method, the vehicle of choice for delivering causal inferences. Although this makes up a majority of educational technology research, as Honebein and Reigeluth (2021) posit, other approaches may be better equipped to deliver results in line with the goal of the field. Research-to-prove emphasizes rigor over relevance, with the result that knowledge gained from it is lacking in providing guidance to improve student learning in practical contexts. Also, they posit the well-known problem of confounding, where the experimental comparison varies more than only the variable of interest. The result is that changes in the dependent variable cannot be equivocally attributed to the treatment. Instead, they advocate for an increased focus on research-to-improve. This notion stems from improvement science, a more practical, cyclical approach to solving real-world problems (LeMahieu et al., 2015). Research methods in this paradigm include action research, design-based research, evaluation research, and others, which are seen as more in line with the cyclical nature of designing educational technology (Phillips et al., 2012).

However, while one may be tempted to believe improvement science does not rely on causal inference (LeMahieu et al., 2015; Lewis, 2015), it should be highlighted that the fundamental nature of causality—for understanding both the most basic processes as well as the most complex systemic dependencies—means that our understanding of the world rests on causal assumptions. For example, Buchner and Kerres (2022) systematically review the literature on augmented reality in education and find a preponderance of media comparison studies. As more productive alternatives, they propose Learner-Treatment Interaction designs and value-added designs, which focus on the question when and how technology works well. This ties in with the idea of improvement science because questions about when and how a technology works well are the foundation for improvement, and, crucially, improvement is an inherently causal concept. This means that these research goals can only be achieved by finding ways to causally attribute learning or learning-relevant processes to some feature of the educational technology, which themselves interact with features of the educational context and the learner’s individual differences. Choosing and testing a feature candidate for improvement is based on an—frequently implicit—causal model that describes how variables influence each other. As LeMahieu et al., (2015) state: “We need to understand how some system produces current results to intervene in the right places and with the right changes. We also need this understanding if we are to implement complex practices across contexts” (p. 447).

In other words, freeing ourselves from the shackles of confounded comparative research does not free us from the causal foundation on which all empirical questions rest: “What works and why?” followed by “what can be changed and how?” In fact, within the research classification of Honebein and Reigeluth (2021), only research-to-describe would be satisfied with remaining on an associational level, where no attempts at causal reasoning are made, and no practical implications are intended (see Table 1). For all other research approaches, reasoning about cause and effect is imperative.

Table 1 Causal inference needs within the research classification of Honebein and Reigeluth (2021)

Causal inference and directed acyclic graphs

Given the developments in causal inference in the past decades, there are now tools available that enable researchers to improve their causal inferences, notwithstanding the specific research approach at hand. One central approach to this is a graphical method using DAGs, which allows researchers to reason about causal assumptions and make research design decisions as well as statistical analysis decisions. This approach, spearheaded by Judea Pearl since the 1990s, has taken up speed over the years such that it is now a comprehensive framework of causal inference (Hernan & Robins, 2020; Morgan & Winship 2015; Pearl, 1995, 2009). Most prominently employed in epidemiology (Munafo et al., 2018), there are numerous recent efforts to introduce and apply these principles to psychology (Lee 2012; Rohrer 2018), sociology (Elwert & Winship, 2014), as well as many other fields (Griffiths et al., 2020). The power of this approach is that—with only a few key principles—researchers can begin to improve their causal inferences in varied research contexts and every stage of the research process (see Section “Using DAGs in different phases of educational technology research”). To our knowledge, no such introduction has been explicitly presented to the field of Educational Technology.

DAGs consist of visual representations of causal assumptions. In essence, by using variables, boxes, and arrows, the goal is to build a causal model of the relevant variables of interest. As such, they may look like structural equation models (see Section “The relation between structural equation models and DAGs”). However, in contrast to structural equation models, DAGs are non-parametric. While an arrow pointing from one variable to another indicates a causal effect, it is agnostic about whether the form is linear, quadratic, or exponential. In other words, while working with DAGs, researchers are not subject to statistical and methodological constraints, nor data availability. This is because DAGs are conceptual tools. In the first instance, it is only imperative to consult expert knowledge (i.e., theory) to construct a DAG that is substantively plausible and as comprehensive as possible, given the current state of knowledge.

Using the DAG, researchers can then assess if the causal effect of interest is identified. This means inspecting the graph carefully, paying attention to the directionality of arrows, and tracing possible paths of the causal effect. In essence, through this careful inspection researchers ask: “Is it possible to estimate the causal effect of interest (given that my causal assumptions are correct)?” If the answer is yes, the causal effect can be validly estimated. Crucially, this is irrespective of the research design or approach, meaning that a valid causal estimate can also be retrieved from non-experimental research. On the other hand, if the answer is no, the causal effect is not identified; this means a source of bias prevents valid estimation of a causal effect. In this case, several steps may be taken to improve this situation, statistical controls, ceasing to stratify on a variable, or changing the research design (see Section “Using DAGs in different phases of educational technology research”). This distinction between identification and estimation highlights the function of DAGs as conceptual tools, which should be ideally used in the planning phase of empirical studies. However, we will also show how DAGs can be used in other phases of research, even if they were absent in the planning phase.

Using DAGs to identify confounders

Consider the following example: If a researcher wants to learn about the effectiveness of an intervention, this can be depicted in the simplest of graphical forms, an arrow pointing from the hypothesized independent variable (X) to a dependent variable (Y). This indicates the hypothesized causal effect. However, if the investigator cannot randomize the intervention, extending the causal graph is necessary. The most basic and common extension is a third variable (Z) that influences both the independent and dependent variable (Fig. 1a).

Fig. 1
figure 1

Example of a DAG with confounding variable. The term intervention here and in the following figures denotes some (technological) innovation in educational contexts and is not restricted to treatments that can be manipulated and randomized. a Confounding variable present. b Confounding variable controlled for

This kind of variable is frequently referred to as a confounding variable. In this example, student engagement could play this role because more engaged students (a) may make more or better use of the intervention and (b) are more likely to learn successfully, independent of the intervention (Fig. 1b). Of course, the plausibility of confounders depends on the variables that are at the heart of the investigation. Expert knowledge and theory may guide investigators in reasoning about potential confounders in the causal model. Given this confounding situation (also called a “fork” in the DAG literature), the researcher will conclude that the causal effect is not identified. This means additional steps must be taken because aside from the causal path of interest (InterventionStudent Learning), there is an additional, non-causal path (InterventionStudent EngagementStudent Learning). This path is non-causal because, tracing it from X to Y, we encounter arrows pointing in the ‘wrong’ direction. This indicates a source of bias, meaning that an estimate of the causal effect would be an uninterpretable mix of the true causal effect and additional non-causal association due to confounding. To arrive at this conclusion, all paths connecting X and Y must be considered, irrespective of the direction of arrows. In this case, the inspection yields one open back-door path (Pearl et al., 2016), which should be blocked in the language of DAGs. A standard way of blocking paths is by measuring the confounder and controlling for it statistically. This is often also referred to as conditioning, adjusting, or stratifying. Graphically, this is depicted by a box around the variable controlled for (Fig. 1b). The most straightforward way of doing this is by group-specific analyses, that is, looking at different levels of the confounder individually. This would mean separating highly engaged and hardly engaged students instead of considering them simultaneously. If the confounder is adequately controlled, X and Y are d-separated (Pearl, 1995), meaning that no open back-door paths remain. Then, the causal effect is identified under the assumption that the DAG is complete—that is, no other confounding is present. The statistical estimate of X on Y would then yield the true causal effect. Aside from group-specific analyses, there are many other approaches, of which we will only mention a few. Researchers may instead opt for the covariate-in-regression approach, including one or more third variables (i.e., confounders) in a regression model. Conceptually, this is similar to group-specific analyses; those variables included in the regression model can be considered controlled because the coefficients are estimated while holding the covariate constant. Matching is a third popular approach to statistical control as it works by post-hoc constructing groups that are similar concerning some hypothesized confounding variables. A widely used matching method is propensity score matching. All these approaches aim to approximate the leveling effect of randomization by creating groups that do not differ based on the bias-introducing variable.

Using DAGs to distinguish confounders from mediators

A mediating effect or chain in the language of DAGs consists of three variables in succession, independent variablemediatordependent variable. For example, suppose our intervention is a tool (X) aimed at supporting self-reflection and self-explanation. In that case, these meta-cognitive processes (Z) may be mediators connecting the intervention to enhanced student learning (Y, see Fig. 2a). In DAG language, the mediator transmits the causal effect. It is not a non-causal path because all arrows along this path point in the direction of the causal effect, that is, there are no arrows going in the opposite direction of the causal effect of interest. Whereas in the case of confounding, it is imperative to control for this third variable, in the case of a mediating effect, controlling for Z introduces bias. This is because controlling for metacognitive processes blocks this path (intervention → [metacognitive processes] → student learning), even though changes in metacognitive processes may be a crucial reason for changes in student learning. By comparing only those students with identical values of the mediator, we exclude the very mechanism responsible for the effect. In practice, this overcontrol bias (Elwert & Winship, 2014) frequently attenuates the true causal effect. The latter is represented in Fig. 2b, where controlling for the mediator blocks the mediating path, such that there remains only a direct effect interventionstudent learning if the mediating path is one of partial mediating. In the case of full mediation, a direct effect would be absent, yielding a false-negative. The remedy for this is to not control for mediating variables. To distinguish between confounders and mediators, the researcher need to draw causal graphs based on substantive reasoning.

Fig. 2
figure 2

Example of a DAG with mediating variable. a Mediating variable present. b Mediating variable controlled for

Using DAGs to identify colliders

Finally, we have inverted forks, a somewhat underappreciated configuration with counterintuitive implications, independent variablecolliderdependent variable. If our causal effect of interest is still unknown, but both the intervention and student learning exert an effect on student retention, we can depict this as an inverted fork (Fig. 3a). In inverted forks, the Z variable is called a collider because it has more than one incoming arrow, i.e., they collide in this variable.

Fig. 3
figure 3

Example of a DAG with collider variable. a Collider variable present. b Collider variable controlled for

Colliders, or common effects, have the unique property of naturally blocking spurious associations. In the hypothetical example where we are trying to identify the effect of our intervention on learning (Fig. 3a), inspecting our DAG to identify all possible paths, we do not encounter a back-door path. This is because, unlike a confounder, colliders have natural blocking properties. This means that an uncontrolled collider does not introduce bias. However, if we are unaware that Z is a collider, we may control for Z, either statistically or by study design, consciously or unwillingly. This introduces bias because any control on a collider opens back door paths going through this variable, thereby opening a non-causal path. In Fig. 3b, we now have to open paths, interventionstudent learning and intervention → [student retention] ← student learning, the latter of which introduces a spurious association and, thus, biases the causal effect of interest. Conditioning on a collider, as it is called, or endogenous selection bias is an increasingly recognized pitfall that can have devastating effects on the validity of causal claims (Elwert & Winship, 2014; Munafo et al., 2018; Richardson et al., 2019). Notably, the issue is not (only) about the representativity of the sample, where findings may not be applied to other populations. Endogenous selection bias is more pervasive in that it also biases the estimate itself, thus limiting internal validity. As with confounding, a frequent result of conditioning on a collider is a false-positive, finding an effect or association where there is none. Illustrative examples of how this bias works exactly can be found in Rohrer (2018) and Griffith et al. (2020). Collider bias completes the three fundamental causal configurations found in a DAG. A summary of these configurations and their immediate implications can be found in Table 2.

Table 2 Summary of elemental configurations and recommendations to reduce bias

The relation between structural equation models and DAGs

As structural equation modeling (SEM) is a popular tool for educational technology research that is also frequently used for non-experimental research designs (e.g., Lai et al., 2016; Lee et al., 2010; Makransky & Lilleholt, 2018; Rahman et al., 2021), researchers may wonder about the differences and similarities between DAGs and SEM (or path) models, in particular concerning causal inference. Because many quantitatively-oriented researchers are well acquainted with SEM, we will start with some similarities.

First, DAGs and SEMs look very similar. This is because on a visual level they are both based on the usage of variables (variable names and sometimes circles or boxes) and arrows that connect the variables (or are deliberately absent). This similarity is not due to chance but because DAGs are generalizations of simple path models (Pearl, 2009). Like SEMs, DAGs can become rather complex, with many variables pictured and arrows interconnected. In fact, for most applied cases (as opposed to simplified examples), particularly in the social sciences, complex configurations are expected, which should also be reflected in the models. Usually, there is a tension between the desire for a parsimonious model and the necessary complexity to capture the nomological net of variables.

On the one hand, parsimonious models are easier to analyze and provide a more pleasing portrayal of the research object. In addition, overly complex models are rarely sufficiently theoretically justified and can bring problems in the estimation stage (Goodboy & Kline, 2017). On the other hand, the absence of arrows also brings strong assumptions and needs to be defended on substantive grounds just as well (VanderWeele, 2012). It is a result of the complexity of the social sciences that by adding arrows to our model, we are being—perhaps paradoxically—conservative. Also, both approaches may yield competing models with different causal assumptions, which may be subjected to testing to arrive at the model that best approximates the data. Another similarity is that neither DAGs nor SEMs distinguish between types of causes. Whereas there have been voices insisting on the manipulability of a variable as a criterion for causality (Holland, 1986), attributes, causes, events, conditions, or any other type of cause for that matter (Freese & Kevern, 2013) are represented identically, as a variable with arrows incoming and/or outgoing.

This brings us to the critical differences between SEMs and DAGs. If SEMs are heavyweights in terms of necessary assumptions, DAGs are comparative featherweights, allowing for a streamlined approach to thinking causally. Where the function of SEM is to specify an assumption-laden statistical model that is then estimated with respect to its approximation of the data, DAGs are conceptual tools used to apply principled reasoning about causality and bias reduction. While the primary goal of SEM is estimation, DAGs are used for identification purposes only. Here, estimation comes later and is methodologically decoupled. This brings the benefit that DAGs can be used across several stages of the empirical research process (see Section “Using DAGs in different phases of educational technology research”) and are, thus, not limited to the stages concerned with statistical estimation. As they are non-parametric, many considerations associated with SEM, e.g., linearity, measurement level, multivariate normality, measurement invariance, etc., do not apply to DAGs. This emphasizes conceptual reasoning on the causal structures among variables of interest, irrespective of practical constraints. Further, although SEM can incorporate latent variables, their underlying indicators must be measurable, as declared in the measurement model (Teo et al., 2013). As conceptual tools, DAGs do not have this limitation because they can include hypothesized unmeasured or generally unavailable variables to incorporate them into causal reasoning. This ability to reason about hypothesized yet unobserved variables and their relationships is a central hallmark of the DAG approach.

For example, suppose a researcher knows their sample cannot be randomly selected from the population. They might want to represent this graphically by adding an unmeasured variable U (time zone differences) that points to X (intervention, see Fig. 4). In an online education context, the researcher may have a theory about the nature of U, e.g., that (a) time zone differences affect selection into the sample but also that (b) time zone differences do not affect the outcome Y directly, because, for example, the learning design emphasized asynchronous communication. In a SEM model, this hypothesis could not be represented, whereas these assumptions can easily be represented in a DAG even though U remains unmeasured. Inspecting the graph, our graphical causal reasoning suggests that U does not lead to confounding because there is no back-door path (Fig. 4a). Thus, U need not be controlled. If, for whatever reason, researchers should be able to and decide to measure and control U, this would not systematically introduce bias, being an example of neutral control (Cinelli et al., 2021). However, if the researcher encounters a situation where the learning design necessitates synchronous peer interaction (Z), unmeasured time zone differences U will not only affect selection into the sample but also how peer interaction unfolds, as time zone differences of learners may hamper their ability to connect in a timely and productive manner. In this case, U is a confounder, opening up the back-door path interventiontime zone differencespeer interactionstudent learning (Fig. 4b). Because time zones differences may be unavailable for control due to being unmeasured, researchers would need to find a way to control for peer interaction, as this would also block this back-door path from transmitting non-causal association, interventiontime zone differences → [peer interaction] → student learning. In this case, a causal effect of the intervention on student learning could be obtained by finding a valid measure of peer interaction and statistically controlling for it.

Fig. 4
figure 4

Example of DAG with unmeasured variable. a No bias due to unmeasured selection variable time zone difference. b Confounding bias due to unmeasured selection variable time zone difference

The most significant difference between DAGs and SEMs, however, is the premium placed on explicit causal inference. However, as we will show, this difference is somewhat resolved upon closer inspection. Because many SEM studies rely on observational data, researchers are hesitant to claim causality from their estimates, despite the array of directional arrows populating the model and the causal foundations of SEM (Bollen & Pearl, 2013; Pearl, 2012). This speaks to the broader hesitancy of quantitative researchers to speak causally through non-experimental data (Hernán, 2018).

For example, many technology acceptance studies working with SEM use terms like “predict” to interpret estimates between what are clearly seen as independent variables (i.e., expectancies and conditions) on the intention to use a technology (e.g., Islamoglu et al., 2021; Leow et al., 2021; Rahman et al., 2021). Arguably, these become valuable insights if, and only if, some causal implications can be derived from these predictions, in these cases, e.g., by changing conditions to increase usage intention. In other words, in most cases, interpreting associations will be uninteresting unless the estimates also hold under the assumption of causality; that is, practical implications can be generated from the findings. Of course, researchers are correct in being hesitant, cognizant of the truism that correlation does not mean causation, lest their causal claims will be obstacles in peer review. This tension between clearly directional arrows in SEMs and researchers’ hesitancy to claim causality has a long history in the SEM literature and may stem from the misguided claim that SEM aims to establish causality from associations alone (Bollen & Pearl, 2013). However, this applies to SEM as it does to DAGs, there are no causal conclusions without causal assumptions. In both cases, these are encoded in the presence or absence of variables and the directional arrows connecting them.

For SEM, there are two types of input: causal assumptions arising from domain knowledge and empirical data that may substantiate or disconfirm these assumptions. This second part usually receives much attention in the SEM literature, to the detriment of careful explication of the causal assumptions and their implications (Pearl, 2012). DAGs, on the other hand, deal only with the first part, the causal assumptions based on theory or domain knowledge, which make their causal content crystal clear and much harder to ignore.

Using DAGs in different phases of educational technology research

Finally, we will provide an overview of how DAGs, as a tool for causal reasoning, can be used in phases of educational technology research. Fundamentally, principled causal reasoning with DAGs is helpful in every stage of the empirical research process. Here, we will highlight the information DAGs can provide and the decisions they can facilitate in study planning, data analysis, and appraisal of the literature (see Fig. 5) while illustrating this as much as possible with examples from the literature. We designed these research phases as circular because principled appraisal of the literature concerning causal inference then (likely) leads again to a need for future research.

Fig. 5
figure 5

Potential uses for DAG-based causal reasoning along stages of research

DAGs in study planning

As a first step, we suggest that researchers formulate a clear causal research question, irrespective of the feasibility of different research designs. In other words, if what researchers are interested in is causal (almost always the case), this should be communicated clearly, even if an experiment may not be possible. Then, the researchers may use their expert knowledge (based on their training, the available literature, or with the help of additional content experts) to come up with a causal graph containing the causal effect of interest to the study and additional variables hypothesized to causally affect these main variables. Again, the inclusion of variables should not be based on their availability of measurement or control. Using the construction rules laid out in Section “Causal inference and directed acyclic graphs” (e.g., no cycles, no arrow implies no hypothesized relationship, etc.), a DAG is constructed. As stated before, parametric considerations do not apply at this point, as identification and estimation are two different processes. In many educational technology research contexts, this will result in DAGs that are complex.

As a substantive example of using a DAG for study planning, we turn to the student retention scenario described in Hicks et al. (2022). Researchers interested in an outreach intervention’s efficacy in supporting at-risk students may construct a DAG to plan their research study (see Fig. 6).

Fig. 6
figure 6

DAG representing an at-risk student intervention, adapted from Hicks et al. (2022)

Although researchers are only interested in the effects of an outreach intervention on student retention, in consulting their domain knowledge, they quickly realize that their study would be biased if it simply measured whether students were more likely to remain in the study program if they came in contact with an outreach intervention. At the very least, they reason, there will be a mediating variable through which the intervention may exert its hypothetical effect. They suggest this may be learning regulation, a variable they anticipate to be unobservable in the context of their study. Further, outreach interventions do not happen in a vacuum. Instead, students will be contacted if, and only if, they are deemed at-risk, as indicated by a lack of study progress. Their at-risk status itself depends on student characteristics like academic capital as well as some aspect of teaching efficacy. The researchers expect teacher engagement to be the primary driver of teaching efficacy but also concede that this will likely be unmeasurable. In consulting with colleagues, they agree that this DAG captures the main variables at play. The central question now is: is the causal effect of interest identified? If not, how can the causal effect be identified via an identification strategy?

The researchers in this example conclude that the effect is not identified. In coming up with an identification strategy, they note that at-risk status is an important variable as it is a mediator and a potential collider. As a mediator (academic capitalat-risk statusoutreach intervention), it provides a path for the confounding effect of academic capital (outreach interventionat-risk statusacademic capitallearning regulationretention). As part of an identification strategy, this path could be blocked by adjusting for learning regulation, but because researchers expect this variable to be unobservable, another strategy is needed. If the researchers decide to adjust for at-risk status, for example, by only looking at students that fall into this at-risk category, leaving out well-performing students, they have adjusted for a collider, opening another non-causal path: outreach interventionteacher engagement → [at-risk status] ← academic capital → learning regulationretention. For this reason, the researchers would be advised to sample the whole student population or a randomized subsample instead of only at-risk students. To solve the confounding problem of academic capital, academic capital could be controlled for.

As practical implications for researchers, this implies planning data collection to ensure a sample that includes the whole student population or a representative subset. In addition, researchers would need to search for well-validated measures of academic capital and include this measure in the research design. The extent to which the instrument is valid and reliable for this student population will determine the reduction of confounding bias in estimating the effect of interest. Crucially, under the assumption that the constructed DAG captures the actual constellation of variables to a high degree and their identification strategy can be practically implemented, the researchers can move forward with a non-experimental research design, while still being relatively confident in drawing causal inferences from their data. While the absence of bias cannot be proven conclusively, the goal is a principled approach to reducing bias. To the extent that the identification strategy convincingly eliminates sources of bias, researcher’s claims of causality, albeit preliminary and subject to further probing, are warranted.

DAGS for analysis

Whether DAGs were used in the planning stage or not, when confronted with the gleaned data from a given quantitative study, DAGs can be used to make principled analytical decisions to avoid bias. Of course, the space of potential decisions is limited by the decisions made in the earlier study planning phase. For example, if a variable was not measured prior, it cannot be subject to adjustment in the analysis stage. On the other hand, it is likely that even if study planning was done without DAGs, using them to reason about bias in the analysis stage can still be valuable in improving causal inferences and reducing bias.

When deciding which variables to include in a statistical model (that is, which variable to control for), there is a frequent appeal to include more rather than fewer third variables (Spector & Brannick, 2010). This applies in particular to observational research and is due to the—occasionally mistaken—assumption that with each covariate included, any bias potentially stemming from this variable is, in a way, cleaned out from the resulting estimate. Implicitly, this further assumes that controlling cannot do harm but only good. This unprincipled, overeager approach, sometimes called garbage-can regression (Achen, 2005), was already criticized by Meehl (1970), and its fallibility is apparent given the graphical causal inference framework presented here (Lee, 2012; Pearl, 2009; Rohrer, 2018). As shown, covariates can indeed eliminate confounding but can also bring trouble if they lead to overcontrol (blocking a mediator) or endogenous selection bias (controlling for a collider).

Put succinctly, there are good controls and bad controls, as well as some more ambiguous ones (Cinelli et al., 2021), and all can be directly derived from the graphical formalism of DAGs. As research has shown, the choice of covariates affects the results of a study more than the choice of analysis procedures (e.g., ANCOVA vs. Propensity score matching), and the right set of covariates can lead to estimates with near-total bias reduction compared to experimental effects (Steiner et al., 2011).

As an example of using DAGs in the analysis phase, consider Boerebach et al. (2013), who looked at faculty teaching performance in medical education with a focus on resultant role model types. Outcomes were medical students perceptions of faculty as either teacher-supervisor, physician, or person role model. Teaching performance incorporated elements of feedback, learning climate, professional attitude, etc. (see also Boerebach et al., 2012). Instead of positing one definitive DAG, they explore the impacts of different causal assumptions on estimates of the relationship between teaching performance and role model perceptions. Figure 7 shows one of these DAGs (albeit visually slightly adapted to fit this paper’s style).

Fig. 7
figure 7

DAG 1 from Boerebach et al. (2013) depicts Z, which included faculty’s sex, experience, residency training, etc., as a confounder. The remaining seven DAGs from this paper increase complexity by assuming various causal effects between role model types. RM role model

Boerebach et al. (2013) use eight DAGs of varying complexity to explore the potential causal relationships between teacher performance and different role model perceptions. To reduce visual clutter, they subsume an array of confounders under variable Z, a practice that should be avoided in actual research contexts but does not pose a problem for our purposes. Crucially, Boerebach et al. (2013) know from previous research that all variables subsumed under Z are, in fact, confounders in that they affect teaching performance as well as role model types. For this reason, statistically adjusting for these variables is necessary. Controlling for all known confounders can lead researchers to interpret their estimate causally if the further assumption of relative completeness and correctness of the DAG is supported. Given the scarcity of empirically or theoretically supported causal assumptions in this line of research, Boerebach et al. (2013) opt to explore these relationships statistically instead. Depending on the underlying DAG and its analytical implications, they find a large variance in the estimated effects. From this, they argue for further research to limit the number of plausible models to better estimate causal effects from non-experimental research.

The issue of underdeveloped theoretical models and insufficient empirical support for definitive DAGs also applies to Educational Technology. Ironically, defending the causal assumptions of a DAG—in turn—calls for a robust body of causal knowledge. As of now, educational technology research is not yet at the point to provide strong defenses along these lines. For any plausible DAG, realistically, there will be competing, similarly plausible DAGs. One intermediate solution that can be used in the analysis phase is sensitivity analysis. To this end, VanderWeele and Ding (2017) proposed the E-Value, as the minimum strength of association an unmeasured confounder would need to have with treatment and outcome to explain away the effect of interest fully. For educational technology research, the E-Value could be used to assess the plausibility of competing DAGs with regards to being free from confounding. For example, a high E-Value for a DAG implies that the associations of the confounders with X and Y would need to be very high to reduce the causal effect to zero. Given typical effect sizes in the educational technology literature, an E-Value corresponding to d > 2 makes it implausible that the effect of interest is fully explained by confounding bias. Here, too, substantive knowledge of theory and evidence regarding the specific variables of interest guides the evaluation of plausibility. Researchers may want to assess the E-Value for competing DAGs to arrive at an intuition as to the likely degree of unmeasured confounding.

DAGs for literature appraisal

Causal graphs can also be used to critically appraise the available literature and assess evidence for a particular research question. Researchers can use their extensive expert knowledge and the information from published research studies to construct DAGs post-hoc. For example, Weidlich et al. (2022) analyzed published studies from the Learning Analytics literature. By creating plausible DAGs derived from information presented in the papers and substantive reasoning, they found likely instances of all three causal inference pitfalls, i.e., confounding bias, overcontrol bias, and collider bias. Reasoning about these sources of bias allowed for alternative interpretations of puzzling findings and, in some cases, led to simple proposals to decrease bias in future studies. For example, in line with other commentators, they identified Arnold & Pistilli (2012) as an instance of confounder bias. An analytics-based early warning system, Course Signals, provided striking effects on retention rates. Missing was, however, the complication that students further along their educational trajectory were more likely to encounter Course Signals, and the number of classes taken directly affects student retention (Fig. 8a). In this case, error-free measurement and, thus, complete adjustment for the confounding variable would have been possible.

In other instances, easy fixes are not available. This is the case when, for example, studies control for a collider by design, e.g., conducting their analyses only on students who completed a MOOC. The identification of such biases can lead to alternative interpretations of the results. For example, Zhu et al. (2016) longitudinally looked at how students’ social connectedness in a MOOC forum was associated with learning engagement. Due to the high drop-out rate characteristic for MOOCs, they conducted their analyses on an increasingly truncated sample of students. Plausibly, both variables of interest, social connectedness, and learning engagement affect MOOC attrition, making this a collider (Fig. 8b). The severity of collider bias would then increase with each passing week, as students successively drop out and the remaining sample becomes increasingly non-representative concerning the variables of interest.

Fig. 8
figure 8

Two DAGs presented in Weidlich and Drachsler (2022) suggesting a potential confounding and b potential collider bias

Importantly, post-hoc reasoning about bias in the available literature can contribute to the identification of further research gaps. It is possible that an extensive literature has been produced on the effects of, for example, an innovative educational technology on learning outcomes, yet still, its causal foundations are shaky. Critically, this is more than just a question of whether experimental data is available. By reasoning about causality with the help of DAGs, it is possible to arrive at conclusions showing that an experiment may have yielded biased causal inferences due to, for example, blocking a mediator and, on the other hand, observational research may have provided more substantial evidence for causal claims due to appropriate choice of statistical control.

Regarding approaching real-world problems and the big questions of educational technology research (Reeves & Lin, 2020), it is imperative to appraise the available literature from a causal inference perspective to see the progress made specifically concerning causal knowledge. Systematic efforts to this effect can be found in Haber et al. (2018), where the causal language of academic and media articles is contrasted with the actual strength of inference. Similarly, systematic reviews which incorporate the strength and validity of causal inference using DAGS may also be necessary for emerging and established literatures on the effects of educational-technological innovations, like Flipped Classroom (Cheng et al., 2019), Learning Analytics systems (Bodily & Verbert, 2017), Classroom Technologies (Eutsler et al., 2020), and VR/AR (Di & Zheng, 2022; Buchner & Kerres, 2022).

Discussion and conclusion

Aside from the specific use cases outlined above, the most fundamental asset DAGs bring to the research process is the value of explicit causal thinking. Researchers in the field of Educational Technology will inevitably be faced with the tension between the inferences they would like to make and the reality of their data. The taboo against explicit causal inference (Grosz et al., 2020), which refers to the hesitancy of researchers to think causally and use causal language when dealing with observational data, can also be found in Educational Technology. In this case, researchers may resort to ‘scientific euphemisms’ (Hernan, 2018), using associational language instead of causal language, despite being required to formulate practical implications from their findings. And because there are no implications without causality, papers then display a problematic disconnect between what the researchers wanted to know (causal knowledge), what they ended up doing (knowledge about associations), and the implications that they take from their findings (again causal but reported via euphemistic language). Using DAGs in different phases of the research process, as outlined in Section “Using DAGs in different phases of educational technology research” helps resolve this tension by necessitating explicit causal claims and making transparent the ways in which bias may be present and remedies to prevent it.

Educational technology research has been criticized for its limited methodological ambition and capacity (Bulfin et al., 2014). At the same time, widespread (quasi-) experimental research-to-prove is under scrutiny (Honebein & Reigeluth, 2021). This may lead educational technology researchers to believe in a binary view in which they may only approach causal questions if their research context allows for randomized experimentation. On the contrary, the past two decades have seen tremendous progress in developing quantitative methodological approaches to draw causal inferences from observational research. Where randomized experiments remove confounders from the picture entirely, research with observational data must account for potential biases by deliberately using state-of-the-art methods and employing substantive knowledge to reason causally. With an explicit goal of studying causal mechanisms (with or without experimental data), researchers may be encouraged to use, for example, propensity score matching, instrumental variable estimation, regression discontinuity, and much more (Antonakis et al., 2010). The conceptual basis for making decisions in light of this methodological pluralism can be found in causal reasoning with DAGs.

Spector (2020) lamented the lack of progress in educational technology, a phenomenon, we argue, that can be traced back to ineffective knowledge cumulation. Following West et al. (2020), we agree that undergirding robust educational technology in research and practice is good theory. We further posit that undergirding good theory is a robust body of causal knowledge, cumulated over time. Because no single study can test all hypotheses and rule out all alternative explanations, it is imperative to build on the work of others. Without clarifying causal claims and the specific assumptions under which they may hold, it is challenging to appraise and evaluate the literature in terms of what is known, what is unknown, and what lies in between. In other words, cumulation and theory-building do not work without explicit causal inference. This may be one of the reasons why authors have claimed the field to be under-theorized and may explain why theories are mainly imported from other disciplines (McDonald & Yanchar, 2020). DAGs require theory and domain knowledge as a basis for their construction. At the same time, the causal assumptions in DAGs increase transparency and clarity while allowing for testing and falsifying prospective theories. Along with rapid developments of causal inference in other fields of research, we suggest that educational technology, too, should put explicit causal questions at the forefront—no matter the research paradigm at hand—in order to provide practical implications and address real-world problems (Reeves & Lin, 2020).

This theoretical paper has outlined a methodology for causal reasoning in educational technology research. Central to it is the construction and interpretation of causal graphs, or DAGs. Causal reasoning with DAGs provides a fruitful ground for causal inference that do not solely rely on experimental evidence, a feature crucial to educational technology research, where rigorously controlled experiments may prove difficult. Given the recent advocacy for research-to-improve approaches, this contribution has illustrated the benefits of DAG-based causal reasoning, how it relates to other common research methods, and how it can be used in all stages of research.