1 Introduction

Many important scientific, sociological, and technological questions are fundamentally causal: “which genes regulate one another?,” “how would raising minimum wage affect unemployment rate?,” “which treatment most effectively prolongs the lifespan of breast cancer patients?.” In each case, answering the question requires predicting how a system, e.g., a cell, economy, or human body, will react to external manipulation. Structural causal models can be used to formalize such questions, to create algorithms that determine whether such questions can be answered from available data sources, and to develop general-purpose methods for learning the answers to such questions. In the framework of structural causal models, a directed graph is used to reflect how the variables in these models depend causally on one another. Each node i of the directed graph is associated with a variable \(X_i\), and an edge \(i \rightarrow j\) indicates that the variable \(X_i\) is a direct cause of the variable \(X_j\). In some special, well-studied settings, background knowledge and human reasoning can be used to propose plausible directed graph models. However, in large systems such as gene regulatory networks, the directed graph is not known a priori, making it necessary to develop methods for learning the graph from data. Once this graph is learned, it can be used to predict the effects of interventions or distributional shifts, in contrast to traditional machine learning methods which can only make predictions on inputs that come from the same distribution as the training data.

The problem of learning such a causal graph from data, known as causal structure learning (or causal discovery), has been the focus of much recent work in computer science, statistics, and bioinformatics , covered in a number of recent reviews [40, 41, 64, 74, 113]. Compared to these reviews, we here emphasize the combinatorial aspects of causal structure learning, including characterizations of equivalence classes of graphs, computing the size and number of these equivalence classes, and how the characterization and properties are influenced by the presence of latent variables or interventional data. After discussing these topics, we will cover methods for causal structure learning which are based heavily on the combinatorial structure over the space of directed graphs. Focusing on this combinatorial structure has three significant advantages:

  1. 1.

    Causal structure learning can be dramatically simplified when fixing some combinatorial aspect of the problem, such as the ordering of the variables.

  2. 2.

    Understanding the combinatorial aspects of structure learning allows a number of different methods to be synthesized into a single framework and eases future methodological development.

  3. 3.

    Insights into the combinatorial aspects of structure learning are also useful for other tasks, such as experimental design.

The framework provided by the combinatorial viewpoint encompasses methods for learning causal models with unobserved variables, as well as methods for learning from a combination of observational and interventional data. The second point is especially important, since interventional data are often crucial for identifying the true causal model and subsequently using the causal model for predicting the effects of interventions or distributional shifts.

2 Structural Causal Models

A structural causal model defines causal relationships over a set of random variables \(\{ X_i \}_{i=1}^p\). These relationships are summarized by a directed acyclic graph (DAG) \(\mathcal {G}\) over nodes \(i = 1, \ldots , p\), where the node i in \(\mathcal {G}\) is associated with the variable \(X_i\). Given a DAG \(\mathcal {G}\), we let \({{\,\mathrm{pa}\,}}_\mathcal {G}(i)\) denote the parents of the node i, i.e., \({{\,\mathrm{pa}\,}}_\mathcal {G}(i) = \{ j \mid j \rightarrow i ~\text {in}~\mathcal {G}\}\). Then, a (Markovian) structural causal model (SCM) [124] with causal graph \(\mathcal {G}\) consists of a set of endogenous variables \(\{ X_i \}_{i=1}^p\), a set of exogenous variables \(\{ \epsilon _i \}_{i=1}^p\), a product distribution \(\mathbb {P}_\epsilon \) over the exogenous variables, and a set of structural assignments \(\{ f_i \}_{i=1}^p\). In particular, the structural assignment \(f_i\) asserts the relation \(X_i = f_i(X_{{{\,\mathrm{pa}\,}}_\mathcal {G}(i)}, \epsilon _i)\). Via these structural assignments, the distribution \(\mathbb {P}_\epsilon \) over the exogenous variables induces a distribution \(\mathbb {P}_X\) over the endogenous variables, called the entailed distribution [124]. In particular, we have \(\mathbb {P}_X(X_i \mid X_{{{\,\mathrm{pa}\,}}_\mathcal {G}(i)}) = \mathbb {E}_{\epsilon _i} [\mathbbm {1}_{X_i = f_i(X_{{{\,\mathrm{pa}\,}}_\mathcal {G}(i)}, \epsilon _i)} \mid X_{{{\,\mathrm{pa}\,}}_\mathcal {G}(i)}]\) and

$$\begin{aligned} \mathbb {P}_X(X) = \prod _{i=1}^p \mathbb {P}_X(X_i \mid X_{{{\,\mathrm{pa}\,}}_\mathcal {G}(i)}). \end{aligned}$$
(1)

Example 1

(A simple structural causal model of genetic inheritance) As a running example, we will consider a simplified model of genetic inheritance of weight among a family of mice. Let \(X_2\) and \(X_3\) represent the weights, in grams, of an unrelated male and female mouse, respectively. Let \(X_4\) represent the weight of their offspring, and \(X_5\) represent the weight of the offspring’s offspring. Finally, let \(X_1\) be a binary variable representing whether the two parent mice are genetically modified for increased weight. Assume that these variables are related via the following set of assignments:

$$\begin{aligned} X_1&= \epsilon _1&\epsilon _1 \sim \mathsf {Ber}(0.5)\\ X_2&= \epsilon _2 + 2 X_1&\epsilon _2 \sim \mathcal {N}(25, 1)\\ X_3&= \epsilon _3 + 2 X_1&\epsilon _2 \sim \mathcal {N}(20, 1)\\ X_4&= \frac{1}{2} \left( X_2 + X_3 \right) + \epsilon _4&\epsilon _4 \sim \mathcal {N}(0, 1)\\ X_5&= X_5 + \epsilon _5&\epsilon _4 \sim \mathcal {N}(0, 2) \end{aligned}$$

where the set of \(\epsilon \) are mutually independent. The parent sets are \({{\,\mathrm{pa}\,}}_\mathcal {G}(1) = \varnothing , {{\,\mathrm{pa}\,}}_\mathcal {G}(2) = \{ 1 \}\) \({{\,\mathrm{pa}\,}}_\mathcal {G}(3) = \{ 1 \}\), \({{\,\mathrm{pa}\,}}_\mathcal {G}(4) = \{ 2, 3 \}\), and \({{\,\mathrm{pa}\,}}_\mathcal {G}(5) = \{ 4 \}\). The causal graph is given in Fig.  1, and

$$\begin{aligned} \mathbb {P}_X(X) = \mathsf {Ber}(X_1; 0.5)&\times \mathcal {N}(X_2; 25 + 2 X_1, 1) \times \mathcal {N}(X_3; 20 + 2 X_1, 1) \\&\times \mathcal {N}(X_4; \frac{1}{2} (X_2 + X_3), 1) \times \mathcal {N}(X_5; X_4, 1) \end{aligned}$$

is the entailed distribution. \(\square \)

The above definition of structural causal models can be generalized in at least two ways. First, one may remove the assumption that the distribution over the exogenous variables is a product distribution, i.e., one may allow dependence between \(\epsilon _i\) and \(\epsilon _j\) for \(i \ne j\). Such SCMs are called semi-Markovian and are taken as the basic definition of SCMs by some authors [122]. Instead of allowing for dependencies between exogenous variables, we use Markovian SCMs as the basic definition and assume that any unmodeled dependence between endogenous variables is due to some other unobserved endogenous variables, which we will cover in 2.3. Second, one may remove the assumption that \(\mathcal {G}\) is acyclic. The assumption of acyclicity is natural when considering endogenous variables which are defined at certain time points, since the intuitive notion of causality dictates that a cause precedes any of its effects. However, if the endogenous variables are not well defined in time, e.g., if they represent the average state of a system in equilibrium, then feedback loops may occur. We will briefly discuss recent progress on causal structure learning for cyclic causal models in 5.

2.1 Markov Properties and Markov Equivalence in DAGs

Given a DAG \(\mathcal {G}\), the set of distributions \(\mathbb {P}_X\) that factorize according to 1 are said to follow the Markov factorization property with respect to \(\mathcal {G}\). Depending on assumptions on the structural equations \(\{ f_i \}_{i=1}^p\) and the exogenous variables \(\{ \epsilon _i \}_{i=1}^p\), the Markov factorization property implies many other testable properties of the distribution \(\mathbb {P}_X\). For instance, the entire set of conditional independence statements entailed by the Markov factorization property can be characterized simply in terms of a graphical criterion, known as d-separation, that can be read off from the DAG \(\mathcal {G}\). The definition of d-separation relies on the notion of a collider along a path from i to j. Given a path \(\gamma = \langle \gamma _1 = i, \gamma _2, \ldots , \gamma _M = j \rangle \) from i to j, the node \(\gamma _m\) is a collider if \(\gamma _{m-1} \rightarrow \gamma _m \leftarrow \gamma _{m+1}\), i.e., two arrowheads “collide” at \(\gamma _m\). Then, a path \(\gamma \) d-connects i and j given the set \(C \subseteq [p]{\setminus }\{ i, j \}\) if:

  1. 1.

    All non-colliders on the path do not belong to C.

  2. 2.

    All colliders on the path either belong to C or have a descendant which belongs to C.

Finally, i and j are d-connecting given C if there exists any d-connecting path given C; otherwise, they are d-separated. We denote that i and j are d-separated in \(\mathcal {G}\) given C via . We denote the complete set of d-separation statements in a DAG \(\mathcal {G}\) as ; i.e.,

Example 2

(d-connection and d-separation) In \(\mathcal {G}^*\) from Fig. 1a, there are two paths between 2 and 3, the path \(\gamma _1 = 2 \leftarrow 1 \rightarrow 3\), and the path \(\gamma _2 = 2 \rightarrow 4 \leftarrow 3\). For \(C = \varnothing \), \(\gamma _1\) is a d-connecting path between 2 and 3, since 1 is a non-collider and does not belong to C, while \(\gamma _2\) is not a d-connecting path, since 4 is a collider but neither 4 nor 5 is in C. Thus, 2 and 3 are d-connected given \(C = \varnothing \). For \(C = \{ 1 \}\), neither \(\gamma _1\) nor \(\gamma _2\) are d-connecting paths, so 2 and 3 are d-separated given \(C = \{ 1 \}\). Finally, for any C containing 4 or 5, \(\gamma _2\) is a d-connecting path between 2 and 3. Thus, 2 and 3 are d-connected given \(C = \{ 4 \}\), \(C = \{ 5 \}\), \(C = \{ 1, 4 \}\), etc. \(\square \)

Given a distribution \(\mathbb {P}_X\), we call \(X_i\) and \(X_j\) conditionally independent given \(X_C\) if \(\mathbb {P}_X(X_i, X_j \mid X_C) = \mathbb {P}_X(X_i \mid X_C) \mathbb {P}_X(X_j \mid X_C)\).

This is denoted by .

We denote the set of all conditional independence statements in \(\mathbb {P}_X\) as

If all d-separation statements in the DAG \(\mathcal {G}\) hold as conditional independence statements in \(\mathbb {P}_X\), i.e., , then \(\mathbb {P}_X\) is said to satisfy the global Markov property with respect to \(\mathcal {G}\). Suppose that \(\mathbb {P}_X\) has a density with respect to some product measure. Then, without any additional assumptions on the structural equations or the distributions of exogenous variables, the Markov factorization property and the global Markov property are equivalent [109].

Fig. 1
figure 1

a Causal graph \(\mathcal {G}^*\) for the structural causal model in 1. b A minimal I-MAP \(\mathcal {G}_2\) for \(\mathcal {G}^*\), described in 3 

Conversely, a given distribution \(\mathbb {P}_X\) may satisfy the global Markov property with respect to many different DAGs. These DAGs are called independence maps (I-MAPs) of the distribution \(\mathbb {P}_X\). As an extreme example, the complete graph implies no conditional independencies in \(\mathbb {P}_X\), so it is an I-MAP of all distributions. However, the complete graph does not capture any of the independence structure in \(\mathbb {P}_X\). For a variety of purposes, including computational and statistical efficiency in inference and estimation, it is preferable to find a DAG \(\mathcal {G}\) that captures as many of the independences of \(\mathbb {P}_X\) as possible. This intuition is captured in the definition of a minimal I-MAP for \(\mathbb {P}_X\), which is an I-MAP \(\mathcal {G}\) of \(\mathbb {P}_X\), such that the deletion of any edge will result in a new DAG \(\mathcal {G}'\) which is no longer an I-MAP for \(\mathbb {P}_X\). The following example shows that a distribution \(\mathbb {P}_X\) can have several minimal I-MAPs.

Example 3

(A distribution \(\mathbb {P}_X\) can have multiple minimal I-MAPs) Let \(\mathbb {P}_X\) be the distribution in 1. Then the DAG \(\mathcal {G}^*\) in Fig. 1a is a minimal I-MAP for \(\mathbb {P}_X\). To see this, we consider the deletion of each edge. Deleting \(1 \rightarrow 2\) or \(1 \rightarrow 3\) implies that , or , respectively, both of which are false. Similarly, deleting \(2 \rightarrow 4\) or \(3 \rightarrow 4\) implies that , or , respectively, but both are false. Finally, deleting \(4 \rightarrow 5\) implies that , which is again false.

\(\mathbb {P}_X\) has other minimal I-MAPs, including the DAG \(\mathcal {G}_2\) in Fig.  1b. Deleting \(2 \rightarrow 1\) and \(3 \rightarrow 1\) implies and , respectively, both of which are false. Deleting \(2 \rightarrow 3\) implies that , deleting \(4 \rightarrow 2\) implies , deleting \(4 \rightarrow 3\) implies , and deleting \(4 \rightarrow 5\) implies , showing that \(\mathcal {G}_2\) is indeed minimal. \(\square \)

Suppose \(\mathbb {P}_X\) is entailed by an SCM with causal graph \(\mathcal {G}^*\). Since \(\mathbb {P}_X\) may have multiple minimal I-MAPs, it is natural to ask, under some set of assumptions, whether \(\mathcal {G}^*\) can be distinguished from the other minimal I-MAPs, and if not, whether a small subset of the minimal I-MAPs can be distinguished as candidates for \(\mathcal {G}^*\). As we will discuss in 4, without assumptions on the functional forms of the structural assignments \(f_i\), one cannot in general distinguish \(\mathcal {G}^*\) from all other graphs using only \(\mathbb {P}_X\). In particular, two DAGs \(\mathcal {G}\) and \(\mathcal {G}'\) with the same set of d-separation statements (i.e., ) are called Markov equivalent, and we denote this by \(\mathcal {G}\approx _\mathcal {M}\mathcal {G}'\). The set of all DAGs that are Markov equivalent to \(\mathcal {G}^*\) is called the Markov equivalence class (MEC) of \(\mathcal {G}^*\), denoted \(\mathcal {M}(\mathcal {G}^*)\), and \(\mathbb {P}_X\) can in general only identify \(\mathcal {G}^*\) up to \(\mathcal {M}(\mathcal {G}^*)\).

Fig. 2
figure 2

ac Three Markov equivalent graphs from 4. d A fourth graph that is not Markov equivalent to the other three

Example 4

(Markov equivalence) The three DAGs in 2a–c are all Markov equivalent to one another, since for all three graphs, the only d-separation statement is that 1 and 3 are d-separated given 2. However, the DAG in 2d is not a member of the same MEC, since in \(\mathcal {G}_4\), 1 and 3 are (unconditionally) d-separated, but are d-connected given 2. \(\square \)

However, under certain assumptions, it is possible to distinguish the set \(\mathcal {M}(\mathcal {G}^*)\) from all other minimal I-MAPs of \(\mathbb {P}_X\). This is the case under the sparsest Markov representation (SMR) assumption [131], which states that, for any minimal I-MAP \(\mathcal {G}'\) of \(\mathbb {P}_X\) such that \(\mathcal {G}' \not \in \mathcal {M}(\mathcal {G}^*)\), we have \(|\mathcal {G}'| > |\mathcal {G}^*|\), where \(|\mathcal {G}|\) denotes the number of edges in \(\mathcal {G}\). Under this assumption, \(\mathcal {M}(\mathcal {G}^*)\) can be identified by enumerating over minimal I-MAPs of \(\mathbb {P}_X\) and picking the sparsest minimal I-MAP.

More generally, to identify \(\mathcal {M}(\mathcal {G}^*)\), structure learning algorithms require some form of faithfulness assumption. The strongest such assumption, referred to simply as the faithfulness assumption, is exactly the converse to the global Markov property: all conditional independence statements in \(\mathbb {P}_X\) must hold as d-separation statements in \(\mathcal {G}^*\), i.e., . The faithfulness assumption is a “genericity” assumption in the sense that for parametric models, such as linear Gaussian models, the set of parameters which violate the faithfulness assumption is of Lebesgue measure zero [149]. This is demonstrated by the following example.

Example 5

Consider the distribution \(\mathbb {P}_X\) entailed by the following SCM:

$$\begin{aligned} X_1&= \epsilon _1&\epsilon _1 \sim \mathcal {N}(0, 1) \\ X_2&= \epsilon _2 + \beta _{12} X_1&\epsilon _2 \sim \mathcal {N}(0, 1) \\ X_3&= \epsilon _3 + \beta _{13} X_1&\epsilon _2 \sim \mathcal {N}(0, 1) \\ X_4&= \beta _{24} X_2 + \beta _{34} X_3 + \epsilon _4&\epsilon _4 \sim \mathcal {N}(0, 1) \end{aligned}$$

Denoting the corresponding causal graph by \(\mathcal {G}\), then the d-separation statements are given by . However, if \(\beta _{12} \beta _{24} + \beta _{13} \beta _{34} = 0\), then \(\text {Cov}(X_1, X_2) = 0\), so by Gaussianity, we have that , i.e., but . The set of parameters \((\beta _{12}, \beta _{13}, \beta _{24}, \beta _{34})\) satisfying this equality is of Lebesgue measure zero. \(\square \)

In this example, the effect of \(X_1\) on \(X_4\) along the paths \(1 \rightarrow 2 \rightarrow 4\) and \(1 \rightarrow 3 \rightarrow 4\) perfectly “cancels out.” While perfect cancelation may only occur for very specific parameters, structure learning algorithms do not have direct access to \(\mathbb {P}_X\), and must test for conditional independence using samples from \(\mathbb {P}_X\). Thus, near cancelations, e.g., if \(\beta _{12}\beta _{24} + \beta _{13} \beta _{34} = 0.0015\), may be indistinguishable from cancelations at small sample sizes. To overcome noise and provide finite sample or high-dimensional guarantees for structure learning algorithms, it is necessary to make stronger assumption, such as strong faithfulness [181], which assumes that the (conditional) mutual information between d-connected variables is bounded away from zero. However, the set of parameters which violate the strong faithfulness assumption can have large Lebesgue measure [168]. This has motivated the development of structure learning algorithms under assumptions that only require some subset of the missing d-separation statements in to hold “strongly” in \(\mathbb {P}_X\), thus reducing the size of the set of violating parameters. Such assumptions, including a strong version of the SMR assumption, are reviewed and compared in [131, 183].

Since in general \(\mathcal {G}^*\) can only be identified up to its MEC, the natural search space for causal structure learning algorithms is over MECs, rather than DAGs. Consequently, characterizing the structure within and between MECs has been an important problem for developing structure learning algorithms. We will discuss useful characterizations of the MEC in 3. One way to overcome the limitations on learning from observational data is by using data from interventions, which we now formalize.

2.2 Interventions and Interventional Markov Equivalence

To formalize the effect of an intervention I in an SCM, we consider a new interventional SCM where we modify some subset of the structural assignments and/or the distributions of exogenous noise variables, without introducing new nodes into any of the parent sets. If a node i has either its structural assignment \(f_i\) or the distribution of its exogenous noise \(\epsilon _i\) modified by intervention I, it is called a target of the intervention, and we write \(i \in I\). The new SCM induces a different distribution \(\mathbb {P}_X^I\) on X, called the interventional distribution, which takes the form

$$\begin{aligned} \mathbb {P}_X^I(X) = \prod _{i \not \in I} \mathbb {P}_X(X_i \mid X_{{{\,\mathrm{pa}\,}}_\mathcal {G}(i)}) \prod _{i \in I} \mathbb {P}_X^I(X_i \mid X_{{{\,\mathrm{pa}\,}}_\mathcal {G}(i)}). \end{aligned}$$
(2)

In general, an intervention consists of any modification of the structural assignment or exogenous noise. To distinguish this most general form of intervention from more stringent definitions of intervention, we will follow [124] and call these soft interventions (also referred to as mechanism changes in [163]). Particular subclasses of interventions have generated special interest. Most significantly, a hard intervention, also called a perfect, surgical [26], or structural [44] intervention, is one which completely removes the dependence of a target \(X_i\) on its parents. However, perfect interventions allow for the target to depend on \(\epsilon _i\), so that the target’s value may still be random, i.e., the interventional distribution is

$$\begin{aligned} \mathbb {P}_X^I(X) = \prod _{i \not \in I} \mathbb {P}_X(X_i \mid X_{{{\,\mathrm{pa}\,}}_\mathcal {G}(i)}) \prod _{i \in I} \mathbb {P}_X^I(X_i). \end{aligned}$$
(3)

More extremely, if the structural assignment of \(X_i\) is changed to a constant \(a_i\), then there is no randomness left in \(X_i\). Such a perfect intervention is called a do-intervention [113]. In this case, the interventional distribution is

$$\begin{aligned} \mathbb {P}_X^I(X) = \prod _{i \not \in I} \mathbb {P}_X(X_i \mid X_{{{\,\mathrm{pa}\,}}_\mathcal {G}(i)}) \prod _{i \in I} \mathbbm {1}_{X_i = a_i}. \end{aligned}$$
(4)

Example 6

(The interventional SCM for mouse genetic modification) Suppose we implement an intervention on the model in 1, where we edit the genome of the offspring mouse to reduce its weight. In particular, the effect of this intervention is to change the distribution of \(\epsilon _4\) to \(\mathcal {N}(-10, 0.1)\). The interventional distribution is

$$\begin{aligned} \mathbb {P}_X(X) = \mathsf {Ber}(X_1; .5)&\times \mathcal {N}(X_2; 25 + 2 X_1, 1) \times \mathcal {N}(X_3; 20 + 2 X_1, 1) \\&\times \mathcal {N}(X_4; \frac{1}{2} (X_2 + X_3) - 10, 1) \times \mathcal {N}(X_5; X_4, 1). \end{aligned}$$

This intervention is not a perfect intervention, since \(X_4\) still depends on its parent \(X_2\) and \(X_3\). If instead the genetic modification perfectly ensures that the offspring weights 15 grams, i.e. \(X_4 = 15\) always, then the intervention would be a perfect intervention—in particular, a do-intervention. In this case, the interventional distribution becomes

$$\begin{aligned} \mathbb {P}_X(X) = \mathsf {Ber}(X_1; .5)&\times \mathcal {N}(X_2; 25 + 2 X_1, 1) \times \mathcal {N}(X_3; 20 + 2 X_1, 1) \\&\times \mathbbm {1}_{X_4 = 15} \times \mathcal {N}(X_5; X_4, 1), \end{aligned}$$

where \(X_4\) does not depend on its parents anymore. \(\square \)

The causal DAG also implies relationships between the observational and interventional distributions. The simplest approach to deriving these relationships is to extend the DAG to include variables which represent different interventions, as proposed in [177] and used by [153]. This approach can be seen as an important special case of the Joint Causal Inference (JCI) framework [115]. For a single intervention I with targets T, this can be achieved by adding a node \(\zeta \) with children T. \(\zeta \) represents a binary variable, where \(\zeta = 1\) denotes that a sample comes from the intervention I, and \(\zeta = 0\) denotes otherwise.

Example 7

(Binary encoding of an intervention) Consider the intervention \(I_1\) in 6, where the intervention is applied with probability 0.5. Then the joint distribution over X and \(\zeta \) is

$$\begin{aligned} \mathbb {P}_{X,\zeta }(X,\zeta ) = \mathsf {Ber}(\zeta ; .5)&\quad \times \mathsf {Ber}(X_1; .5) \times \mathcal {N}(X_2; 25 + 2 X_1, 1) \times \mathcal {N}(X_3; 20 + 2 X_1, 1) \\&\quad \times \mathcal {N}(X_4; \frac{1}{2} (X_2 + X_3) - 10 \zeta , 1) \times \mathcal {N}(X_5; X_4, 1) . \end{aligned}$$

The causal DAG for \(\zeta , X_1, X_2, X_3, X_4, X_5\) is shown in 3a. The node 5 is d-separated from \(\zeta \) given 4. Therefore, \(\mathbb {P}(X_5 \mid X_4, \zeta = 1) = \mathbb {P}(X_5 \mid X_4, \zeta = 0)\), i.e., \(\mathbb {P}_X(X_5 \mid X_4) = \mathbb {P}_X^{I_1}(X_5 \mid X_4)\). \(\square \)

To generalize to multiple interventions, we add a node for each intervention. In particular, consider a set of interventions \(\mathcal {I}= \{ I_1, \ldots , I_M \}\). For the intervention \(I_m\) with targets \(T_m\), we introduce a node \(\zeta _m\) with children \(T_m\). Again, \(\zeta _m = 1\) denotes that the sample comes from the intervention \(I_m\), and \(\zeta _m = 0\) otherwise. However, each sample can only be generated from a single intervention, i.e., \(\zeta _m = 1\) for at most one m. To reflect this constraint, we include a final node \(\zeta ^*\), which takes values in \(0, 1, \ldots , M\), to indicate which intervention the sample comes from, i.e., \(\zeta _m = 1\) if and only if \(\zeta ^* = m\). Thus, if \(\zeta ^* = 0\), the sample comes from the observational distribution. The resulting DAG is called the interventional DAG (\(\mathcal {I}\)-DAG) [177].

Fig. 3
figure 3

a \(\mathcal {I}\)-DAG from 7. b \(\mathcal {I}\)-DAG from 8.

Example 8

(Binary encoding of a set of interventions) Let \(I_1\) be the intervention in 6, and let \(I_2\) be an intervention which changes the distribution of \(\epsilon _4\) to \(\mathcal {N}(-5, 0.1)\). Suppose each intervention has a 40% chance of being applied. In the remaining 20% of the time, no intervention takes place. Then the joint distribution over \(X, \zeta _1, \zeta _2\), and \(\zeta ^*\) is

$$\begin{aligned} \mathbb {P}_{X,\zeta }(X,\zeta ) =&\mathsf {Cat}(\zeta ^*; (0, 1, 2), (0.2, 0.4, 0.4)) \times \left( 1 - \mathbbm {1}_{\zeta _1 = 1, \zeta ^* \ne 1} \right) \times \left( 1 - \mathbbm {1}_{\zeta _2 = 1, \zeta ^* \ne 2} \right) \\ \times&\mathsf {Ber}(X_1; .5) \times \mathcal {N}(X_2; 25 + 2 X_1, 1) \times \mathcal {N}(X_3; 20 + 2 X_1, 1) \\ \times&\mathcal {N}(X_4; \frac{1}{2} (X_2 + X_3) - 10 \zeta _1 - 5 \zeta _2, 1) \times \mathcal {N}(X_5; X_4, 1). \end{aligned}$$

The causal DAG for \(\zeta _1, \zeta _2, \zeta ^*, X_1, X_2, X_3, X_4, X_5\) is shown in 3b. \(\square \)

Following [177], we define a conditional invariance statement to be a conditional independence statement where the conditioning set includes intervention variables, e.g., \(\mathbb {P}_{X,\xi }(X_i \mid X_C, \xi ^* = m) = \mathbb {P}_{X,\xi }(X_i \mid X_C, \xi ^* = 0)\). This statements posits that a conditional distribution in the mth interventional setting is the same as it is in the observational setting, i.e., the conditional distribution is invariant under the intervention. A set of observational and interventional distributions satisfies the \(\mathcal {I}\)-Markov property with respect to a DAG \(\mathcal {G}\) and a set of interventions \(\mathcal {I}\) if it satisfies the global Markov property with respect to \(\mathcal {G}\), and satisfies all conditional invariance statements entailed by the \(\mathcal {I}\)-DAG. Similarly to the observational case, given a set \(\mathcal {I}\) of interventions, if two DAGs \(\mathcal {G}\) and \(\mathcal {G}'\) entail the same set of conditional independence and conditional invariance statements, we call them \(\mathcal {I}\)-Markov equivalent, denoted \(\mathcal {G}\approx _{\mathcal {M}_\mathcal {I}} \mathcal {G}'\). The resulting \(\mathcal {I}\)-Markov equivalence class (\(\mathcal {I}\)-MEC) is thus a (not necessarily strict) subset of the MEC, as demonstrated by the following example.

Example 9

(Interventional Markov equivalence) Given the intervention set \(\mathcal {I}= \{ I_1 \}\) for \(I_1\) with target 1, the graphs \(\mathcal {G}_2\) and \(\mathcal {G}_3\) in 2 are \(\mathcal {I}\)-Markov equivalent, since they both entail the invariance statements \(\mathbb {P}^{I_1}(X_2) = \mathbb {P}(X_2)\) and \(\mathbb {P}^{I_1}(X_3) = \mathbb {P}(X_3)\). However, \(\mathcal {G}_1\) does not entail these invariance statements, so it is not \(\mathcal {I}\)-Markov equivalent to \(\mathcal {G}_2\) and \(\mathcal {G}_3\). \(\square \)

2.3 Graphical Representations for Latent Confounding

Thus far, we have discussed how a structural causal model defines a data generating process for a particular system and interventions on that system. In the simplest case, called the causally sufficient setting, one directly observes the generated data. However, it is often the case that observations are subject to additional processing, in which case we call the setting causally insufficient. Two forms of causal insufficiency are commonly considered. First, under latent confounding, some of the endogenous variables are simply unobserved, and we call these variables latent confounders. Thus, instead of observing samples from the distribution \(\mathbb {P}_X\), one observes samples from a marginal distribution \(\mathbb {P}_{X'}\) for \(X' \subset X\). For instance, suppose that in 1, the experimentalist does not record the variable \(X_1\) indicating whether the mice were genetically modified. Then, an observer looking at their data would see samples from the distribution \(\mathbb {P}_{X_2, X_3, X_4, X_5}\). Second, under selection bias, the probability that a sample is observed may depend on the values of some of the variables in the sample. Thus, if we introduce a binary variable S to indicate whether a sample is observed, and we have \(\mathbb {P}(S = 1 \mid X)\) describe the selection process, then one observes samples from the conditional distribution \(\mathbb {P}(X \mid S = 1)\). For instance, suppose that in 1, the experimentalist only records those experiments for which the mouse in the final generation weighs more than 20 grams. Then, someone looking at their data would see samples from the distribution \(\mathbb {P}_X(\cdot \mid X_5 \ge 20)\).

In this section, we will focus on the first type of causal insufficiency, latent confounding. We postpone discussion of selection bias to 5. Without causal sufficiency, one must somehow account for latent confounders to perform accurate causal structure learning. When the latent confounders have special structure, it may be possible to explicitly recover the relationship of the latent confounders and the observed variables. One such case is when each latent confounder is a parent of a large portion of the observed variables, which is termed pervasive confounding. In such settings, the observed data may be “deconfounded” by removing its top principal components [51, 141], even when the causal relations are nonlinear [3]. A large range of assumptions on the structure between the unobserved and observed variables may be suitable for different applications. A thorough summary of methods using such assumptions is outside of the scope of the current review. Instead, we focus on a different approach for accounting for latent confounders, which acknowledges their presence but does not attempt to explicitly recover their relationships with the observed variables.

Structural assumptions on latent confounders can leave a wide range of signatures on the distribution of the observed variables. These signatures include not only conditional independence constraints, which can be expressed in the form \(\mathbb {P}_X(X_i, X_j \mid X_C) = \mathbb {P}_X(X_i \mid X_C) \mathbb {P}_X(X_j \mid X_C)\), but also more complex constraints. This includes both equality constraints on the distribution \(\mathbb {P}_X\), commonly called Verma constraints, as well as inequality constraints. The full set of constraints is referred to as a marginal DAG model [46], and can be graphically modeled using a hypergraph. Indeed, [46] show that ordinary mixed graphs are incapable of representing marginal DAG models. Nevertheless, ordinary mixed graphs are capable of encoding a rich subset of the constraints implied by a marginal DAG model. For example, an acyclic directed mixed graph (ADMG) encodes a subset of the equality constraints of the marginal DAG model via the associated nested Markov model [133, 145]; in fact, the nested Markov model is known to encode all equality constraints in the case of discrete variables [47]. It is outside the scope of this review to provide a full overview of the different types of graphs used to capture the constraints of marginal DAG models, instead see [46] and [103] for more thorough overviews.

In our review, we focus on (directed) ancestral graphs, which encode only conditional independencies, are closed under marginalization, and have at most one edge between each pair of vertices. Directed ancestral graphs are mixed graphs, consisting of both directed and bidirected edges. A bidirected edge between two nodes indicates the possibility that they are both children of the same unobserved variable(s). Similarly to directed graphs in the causally sufficient setting, the mixed graphs in the causally insufficient case are required to obey a form of acyclicity condition. In particular, a mixed graph with directed and bidirected edges is called “ancestral” if there are no directed cycles, and if any two nodes that are connected by a bidirected edge (called spouses) are not ancestors of one another [132].

Similarly to DAG models, there is a notion of separation in directed ancestral graphs called m-separation. The same definition works as for d-separation: two nodes are m-connected by a path \(\gamma \) given a set of nodes C if (1) every non-collider on the path is not in C, and (2) every collider on the path is either in C or has a descendant in C. Unfortunately, this notion of separation has the property that two non-adjacent nodes may have no m-separating set. Fortunately, adding a bidirected edge between two such nodes does not affect the set of m-separation statements which hold in the directed ancestral graph ( [132], Theorem 5.1). The operation of adding bidirected edges between all such nodes is called taking the maximal completion of a directed ancestral graph, and a directed ancestral graph is called maximal if it is its own maximal completion. It is natural in structure learning to restrict the search space to directed maximal ancestral graphs (DMAGs), so that each adjacency between nodes corresponds exactly to the lack of an m-separating set.

Example 10

(Maximal completion) 4 shows a graph (left) which is not maximal, since 1 and 4 are m-connected given any of the sets \(\{ \varnothing , \{ 2 \}, \{ 3 \}, \{ 2, 3 \} \}\), but they are not adjacent. The graph on the right is its maximal completion. \(\square \)

Fig. 4
figure 4

a An ancestral graph that is not maximal; b its maximal completion

3 Identifiability

As alluded to in the previous section, two Markov equivalent DAGs cannot be distinguished from observational data alone. In particular, given a DAG \(\mathcal {G}\), consider the collection of distributions \(\mathbb {M}(\mathcal {G})\) which factorize according to \(\mathcal {G}\), i.e., can be written in the form 1. This collection depends on the allowed set of conditional distributions \(\mathbb {P}_X(X_i \mid X_{{{\,\mathrm{pa}\,}}(i)})\). If the set of conditional distributions is unrestricted, then we have that \(\mathbb {M}(\mathcal {G}) = \mathbb {M}(\mathcal {G}')\) if and only if , i.e., Markov equivalent DAGs give rise to the exact same set of distributions. If the conditional distributions are restricted to specific classes, such as Gaussians or discrete measures, then this equivalence remains [109, 159].

Broadly speaking, there are two approaches to distinguishing between Markov equivalent DAGs. The first approach, which we call the functional form approach, considers restricting the class of conditional distributions in such a way that identifiability is possible from only observational data. The second approach, which we call the equivalence class approach, does not restrict the class of conditional distributions, but instead uses interventional data to refine the level of identifiability from the MEC to the \(\mathcal {I}\)-MEC. Given enough interventions, the equivalence class approach is sufficient for completely identifying a DAG or an ADMG [43].

3.1 Functional Form Approaches to Identifiability

Suppose the true causal graph is \(X_1 \rightarrow X_2\). The core idea in this class of approaches is to find asymmetries between models learned in the “causal” (\(X_1 \rightarrow X_2\)) and “anticausal” (\(X_2 \rightarrow X_1\)) directions. The asymmetries in this bivariate case are often easy to subsequently extend to the multivariate case.

As a canonical example, assume that noise is additive, i.e., \(X_2 = f_2(X_1) + \epsilon _2\), with . By making assumptions about the functional form of \(f_2\) and the distribution of \(\epsilon _2\), it is often possible to show that the induced distribution \(\mathbb {P}_X\) cannot be induced by a model of the form \(X_1 = f_1(X_2) + \epsilon _1\), , under the same assumptions on \(f_1\) and \(\epsilon _1\). For example, [87, 143, 144] assume that each function \(f_i\) is linear, and each \(\epsilon _i\) is non-Gaussian. Indeed, [76] shows that in linear models, symmetry is only possible in the Gaussian case, and gives more general results for the case where \(f_i\) is nonlinear, which form the basis for structure learning methods such as the Causal Additive Model (CAM) algorithm [23]. Even in the linear Gaussian case, it is possible to achieve identifiability by imposing additional assumptions, such as equal error variances for each \(\epsilon _i\) [123]. It is also possible to move beyond the additive noise case, e.g., by allowing for further nonlinearities after the addition of noise [185].

Thus far, we have discussed identification strategies designed for continuous random variables. Similar results are achievable in the discrete case, e.g., by assuming that the exogenous noise terms have low entropy [93] or by assuming the existence of a (hidden) low cardinality representation of the cause variable that mediates its effects [24].

3.2 The Equivalence Class Approach to Identifiability

When no assumptions are made on the functional form, and only observational data is available, the true graph \(\mathcal {G}^*\) can only be identified up to the MEC, i.e., the set of DAGs \(\mathcal {G}'\) such that \(\mathcal {G}' \approx _\mathcal {M}\mathcal {G}^*\). Thus, for the purposes of algorithm design, it becomes interesting to characterize when two DAGs are Markov equivalent.

3.2.1 Characterizations of Markov Equivalence Classes

Fig. 5
figure 5

Transformational characterization of equivalence in a DAG. Starting with the DAG on the left, we proceed to the right by performing covered edge reversals on the red edges (Color figure online)

Characterizations of Markov equivalence in DAGs There are numerous ways to characterize Markov equivalence in DAGs, and we will cover three main characterizations: a graphical characterization, a transformational characterization, and a geometric characterization. The graphical characterization is based on two notions. The skeleton of a DAG \(\mathcal {G}\) is defined as the set \({{\,\mathrm{skel}\,}}(\mathcal {G}) = \{ (i, j) \mid i \rightarrow j ~\text {or}~ j \rightarrow i ~\text {in}~\mathcal {G}\}\). The v-structures (also called immoralities) are defined as \({{\,\mathrm{vstruct}\,}}(\mathcal {G}) = \{ (i, j, k) \mid i \rightarrow j \leftarrow k ~\text {in}~\mathcal {G}, (i, k) \not \in {{\,\mathrm{skel}\,}}(\mathcal {G}) \}\). Verma and Pearl [170] show that two DAGs \(\mathcal {G}\) and \(\mathcal {G}'\) are Markov equivalent if and only if they have the same skeleton and v-structures, i.e., \(\mathcal {G}\approx _\mathcal {M}\mathcal {G}'\) if and only if \({{\,\mathrm{skel}\,}}(\mathcal {G}) = {{\,\mathrm{skel}\,}}(\mathcal {G}')\) and \({{\,\mathrm{vstruct}\,}}(\mathcal {G}) = {{\,\mathrm{vstruct}\,}}(\mathcal {G}')\). Given this graphical notion, it is natural to represent an MEC via an essential graph, which is a mixed graph with the same adjacencies as all DAGs in the equivalence class, and with the edge \(i \rightarrow j\) directed only if \(i \rightarrow j\) in all DAGs in the equivalence class. Meanwhile, the transformational characterization is based on a single notion: a covered edge is an edge \(i \rightarrow j\) in \(\mathcal {G}\) such that \({{\,\mathrm{pa}\,}}_\mathcal {G}(i) = {{\,\mathrm{pa}\,}}_\mathcal {G}(j){\setminus }\{ i \}\). \(\mathcal {G}'\) and \(\mathcal {G}\) are related by a covered edge flip if \(\mathcal {G}'\) has all of the same edges as \(\mathcal {G}\), except that the covered edge \(i \rightarrow j\) in \(\mathcal {G}\) is oriented as \(j \rightarrow i\) in \(\mathcal {G}'\). From the graphical characterization, one can deduce that if \(\mathcal {G}\) and \(\mathcal {G}'\) are related by a series of covered edge flips, then \(\mathcal {G}\approx _\mathcal {M}\mathcal {G}'\). The transformational characterization states that the converse is also true: If \(\mathcal {G}\approx _\mathcal {M}\mathcal {G}'\), then \(\mathcal {G}\) can be transformed into \(\mathcal {G}'\) by a series of covered edge flips [29]. This transformation is illustrated in 5. Finally, the geometric characterization encodes each graph as an integer-valued vector in the space \(\mathbb {Z}^{2^{[p]}}\). First, we introduce a set of basis vectors \(\delta _A\) for all subsets \(A \subset [p]\). Then, the standard imset for a DAG \(\mathcal {G}\) is given by \(u_\mathcal {G}= \delta _{[p]} - \delta _{\varnothing } + \sum _{i=1}^p \left( \delta _{{{\,\mathrm{pa}\,}}_\mathcal {G}(i)} - \delta _{\{ i \} \cup {{\,\mathrm{pa}\,}}_\mathcal {G}(i)} \right) \). Alternatively, [160] introduces the characteristic imset \(c_\mathcal {G}\), with \(c_\mathcal {G}(A) = 1\) if and only if there exists some \(i \in [p]\) such that \(A{\setminus }\{ i \} \subseteq {{\,\mathrm{pa}\,}}_\mathcal {G}(i)\). Two DAGs \(\mathcal {G}\) and \(\mathcal {G}'\) are Markov equivalent if and only if \(u_\mathcal {G}= u_{\mathcal {G}'}\), or equivalently, \(c_\mathcal {G}= c_{\mathcal {G}'}\).

Characterizations of interventional Markov equivalence in DAGs As discussed in 2.2, the effect of an intervention can be formalized by introducing new binary variables to represent each intervention [115]. Therefore, the same characterizations of Markov equivalence that apply in the observational case just discussed also apply in the interventional case. However, it is still instructive to directly characterize the interventional Markov equivalence class. Consider a set of interventions \(\mathcal {I}\) such that \(\varnothing \in \mathcal {I}\) (i.e., observational data is available). Extending a result for perfect interventions [70, 177] shows that two DAGs \(\mathcal {G}\) and \(\mathcal {G}'\) are \(\mathcal {I}\)-Markov equivalent if and only if they (1) have the same skeleton and v-structures, as in the case of a DAG and (2) for all \(I \in \mathcal {I}\) and \(i \in I\), \(j \not \in I\), we have \(j \rightarrow i\) in \(\mathcal {G}\) if and only if \(j \rightarrow i\) in \(\mathcal {G}'\). Note that this is equivalent to stating that the two \(\mathcal {I}\)-DAGs do not differ in v-structures of the form \(\zeta _I \rightarrow i \leftarrow j\), confirming the equivalence with the observational characterization applied to \(\mathcal {I}\)-DAGs. As an example, under the set of interventions \(\mathcal {I}= \{ \varnothing , \{ 1 \} \}\), the graphs \(\mathcal {G}_2\) and \(\mathcal {G}_3\) in 2 are \(\mathcal {I}\)-Markov equivalent, but \(\mathcal {G}_1\) is not, since its \(\mathcal {I}\)-DAG would not have the v-structure \(\zeta _{\{1\}} \rightarrow 1 \leftarrow 2\).

Characterizations of Markov equivalence in DMAGs As in the case of DAGs, equivalence between DMAG models can be characterized in multiple ways, and we will cover the graphical and transformational characterizations. For both characterizations, we must define the notion of a discriminating path for a vertex k. A path \(\gamma = \langle i, \ldots , k, j \rangle \) is a discriminating path for k if (i) there is at least one node on the path between i and k, (ii) every node between i and k is a collider on the path, and (iii) every node between i and k is a parent of j. We denote the set of discriminating paths for node k in \(\mathcal {G}\) as \({{\,\mathrm{discr}\,}}_k(\mathcal {G})\). A fundamental result [151] states that two DMAGs \(\mathcal {G}\) and \(\mathcal {G}'\) are Markov equivalent if and only if (i) they have the same skeleton and v-structures, and (ii) for all k, for all \(\gamma \in {{\,\mathrm{discr}\,}}_k(\mathcal {G}) \cap {{\,\mathrm{discr}\,}}_k(\mathcal {G}')\), k is a collider on \(\gamma \) in \(\mathcal {G}\) if and only if k is a collider on \(\gamma \) in \(\mathcal {G}'\). Checking this graphical condition for Markov equivalence can be computationally expensive, motivating recent work [77] which provides a new graphical characterization of Markov equivalence in DMAGs that can be checked more efficiently. We next describe the transformational characterization of Markov equivalence in DMAGs. As in the case of DAGs, the transformational characterization requires us to define a local structural modification. In particular, the modification of the edge \(i \rightarrow j\) in \(\mathcal {G}\) to the edge \(i \leftrightarrow j\) in \(\mathcal {G}'\), or vice versa, is called a legitimate mark change [182] if (i) \({{\,\mathrm{pa}\,}}_\mathcal {G}(i) \subseteq {{\,\mathrm{pa}\,}}_\mathcal {G}(j)\), (ii) \({{\,\mathrm{sp}\,}}_\mathcal {G}(i){\setminus }\{ j \} \subseteq {{\,\mathrm{sp}\,}}_\mathcal {G}(j) \cup {{\,\mathrm{pa}\,}}_\mathcal {G}(j)\), and (iii) there is no \(\gamma \in {{\,\mathrm{discr}\,}}_i(\mathcal {G})\) for which j is the endpoint adjacent to i. The authors in [182] show that \(\mathcal {G}\approx _\mathcal {M}\mathcal {G}'\) if and only if \(\mathcal {G}\) and \(\mathcal {G}'\) are connected by a series of legitimate mark changes. This transformation is illustrated in 6.

Fig. 6
figure 6

Transformational characterization of equivalence in a DMAG. Starting with the DMAG on the left, we proceed to the right by performing legitimate mark changes on the red edges (Color figure online)

Table 1 Number of MECs (first row), the ratio of the number of MECs to the number of DAGs (second row), and the ratio of the number of MECs of size 1 compared to the total number of MECs (third row), up to 10 nodes

3.2.2 Combinatorial Aspects of Markov Equivalence

Since DAGs in general can only be identified up to (\(\mathcal {I}\)-)Markov equivalence, it has been of significant interest to study the size of a given MEC, the number of MECs over a given number of variables, and the minimum number of interventions required to identify a DAG (i.e., obtain a \(\mathcal {I}\)-MEC of size 1).

The first problem—computing the number of DAGs within a given MEC, or computationally equivalently, sampling uniformly from the MEC—is important for a number of experimental design algorithms [56], which use Monte Carlo approximations to compute expectations over the MEC and pick interventions with good average-case behavior. A recent advance [175] provides a polynomial-time algorithm for this task based on a representation of the equivalence class via clique trees, improving over previous algorithms with exponential worst-case runtime [6, 14, 52, 57, 72, 162].

To address the second problem, [61] develops a program for enumerating all MECs on graphs with a given number of nodes, and obtained results for graphs of up to 10 nodes, shown in 1. Further theoretical works [126, 127] study the problem of enumerating all MECs for a fixed skeleton using the idea of generating functions from combinatorics. The computational results in [61] suggest that, asymptotically, the average MEC contains approximately 4 DAGs, and that roughly one quarter of all MECs are comprised of only a single DAG, in which case no interventional data is needed to identify the causal DAG. However, proving these conjectured limits, as well as efficiently enumerating the number of MECs on a given number of nodes, remain open combinatorial and computational problems. Less work has been done to characterize the average number of interventions required to identify a DAG. For a given DAG, [152] characterizes the minimum-size set of single-node interventions needed to identify the underlying causal DAG, using a representation based on clique trees. However, this work does not address the average of this quantity over all DAGs on a given number of nodes. Meanwhile, [88] conducts a computational study of the average number of greedily selected interventions to identify a graph, where the average is with respect to a directed Erdös-Rényi graph model. In this model, the results suggest that the number of interventions necessary is typically less than 4, but further work is necessary to characterize the average with respect to the uniform distribution over graphs and to address the case where interventions are picked optimally.

4 Methods for Causal Structure Learning

Thus far, we have discussed what is in principle identifiable about the underlying causal DAG with observational and interventional data. Now, we present algorithms which carry these principles of identifiability into practice. In particular, we will discuss a number of algorithms which are consistent, i.e., in the limit of infinite data, they provably learn all identifiable causal structures. We will also highlight some heuristic algorithms, which do not have consistency guarantees but often perform well in practice. We begin with a broad overview of the different paradigms for causal structure learning, before diving into methods which explicitly leverage the combinatorial structures already discussed. At the highest level, methods for estimating causal models from data fall into two broad categories: constraint-based methods and score-based methods. Constraint-based methods are natural when viewing causal structure learning as a constraint satisfaction problem, where conditional independences or other constraints that can be inferred from data are used to iteratively prune the space of possible graphs. In contrast, score-based methods arise from viewing causal structure learning as a combinatorial optimization problem. These methods assign a score to each graph (or equivalence class) which quantifies how well it fits the data, then search the space of graphs (or equivalence classes) to find a model which optimizes the score. To highlight the general principles of these two paradigms, we will first concentrate on the causally sufficient case with only observational data. Then, in 4.3, we discuss algorithms that can make use of interventional data, and in 4.4, we briefly discuss algorithms for learning in the presence of latent confounding.

Constraint-based approaches The most prominent constraint-based approach to causal structure learning is the PC algorithm [86, 149]. The PC algorithm begins with a complete undirected graph and iteratively deletes edges by testing conditional independences involving conditioning sets of increasing cardinality. Then, the second phase of the PC algorithm orients v-structures by reusing the conditional independences found in the first phase. Additional orientations can be inferred via the Meek orientation rules [111].

The method for testing conditional independence (CI) depends on modeling assumptions as well as practical considerations such as computational complexity. For example, in a multivariate Gaussian distribution, two variables \(X_i\) and \(X_j\) are conditionally independent given the variables \(X_C\) if and only if the partial correlation \(\rho _{ij \mid C}\) is zero. Since the distribution of sample partial correlation coefficients is well known (see, e.g., [86]), hypothesis testing for CI in the Gaussian setting is straightforward and computationally efficient. On the other hand, in nonparametric settings, hypothesis tests for conditional independence can often be performed based on more complicated test statistics [75, 158, 186]. Unfortunately, impossibility results [142] state that any uniformly valid conditional independence test (i.e., one whose false positive rate tends to at most the significance level \(\alpha \), over all possible distributions \(\mathbb {P}\) where ) will have no statistical power (i.e., the probability of a true positive will also be at most \(\alpha \)). Thus, testing conditional independence requires additional assumptions on the set of possible distributions, such as complexity restrictions on the function space of \(\mathbb {E}_\mathbb {P}[X \mid Z]\).

Under such complexity assumptions, conditional independence tests allow constraint-based approaches to directly be applied to nonparametric settings, even permitting high-dimensional consistency bounds in these settings [69]. Furthermore, because conditional independences also characterize DMAG models, constraint-based approaches can be easily extended to settings with latent variables [35]. Pushing further, one may encode conditional independences as logical constraints, allowing them to be used in answer set programming (ASP) solvers. These solvers can search over more general model classes and easily incorporate background knowledge [80, 180]. However, ASP-based causal structure learning methods are widely viewed as being difficult to scale for many practical applications.

Score-based approaches Score-based methods for causal structure learning originated in parametric settings, such as in discrete or linear Gaussian models. In parametric settings, the score \(S(\mathcal {G})\) of a graph \(\mathcal {G}\) is often based on the marginal likelihood \(\mathbb {P}(\mathbb {X}\mid \mathcal {G})\) of the data \(\mathbb {X}\) given the graph \(\mathcal {G}\), with respect to some prior \(\mathbb {P}(\theta )\) over the parameters \(\theta \). In some cases, e.g., when choosing a conjugate prior for the likelihood function, \(\mathbb {P}(\mathbb {X}\mid \mathcal {G})\) can be computed in closed form [55]. Alternatively, it is common to use a consistent approximation of the marginal likelihood, in the form of the Bayesian information criterion (BIC) score [31, 32]. Such likelihood-based scores can be extended to nonparametric settings, e.g., by using Gaussian process priors [50] or non-paranormal distributions [117]. The BIC score and related scores are also a natural starting point from which to develop more sophisticated scores with better statistical and computational properties, see, e.g., [21].

Finding the highest scoring DAG model is generally NP-hard [30], imposing a trade-off between computational efficiency and algorithmic consistency guarantees. Score-based methods can generally be subdivided into three categories based on how they address this trade-off. On the one end of the spectrum, exact score-based approaches find some \(\widehat{\mathcal {G}}\) that exactly optimizes the score S. Exact approaches address computational issues using a variety of combinatorial optimization techniques and heuristics, e.g., dynamic programming [95, 121], A*-style state-space search [179], or methods from integer linear programming [11, 38, 39, 82]. For example, the GOBNILP algorithm [39] uses the geometric characterization of Markov equivalence classes to reduce structure learning to an integer linear programming problem. This reduction allows the use of techniques such as cutting planes and pricing to handle the exponential number of decision variables and constraints.

Greedy score-based approaches trade off to achieve better computational efficiency over exact approaches by relaxing the requirement that \(\widehat{\mathcal {G}}\) optimizes S. Most prominently, greedy equivalence search (GES) [31] and its variants [32] perform a search over equivalence classes of graphs that greedily optimizes S. While greedy algorithms are not exact, they are still consistent, placing them in a middle ground on the computational–statistical trade-off. Notably, [106] shows that GES and a number of other greedy approaches can also be viewed geometrically. In particular, these methods can be seen as edge walks between vertices of the characteristic imset polytope, i.e., the convex hull of all characteristic imsets on p variables. Finally, at the other extreme of this trade-off, gradient-based methods [101, 178, 187, 190] relax the discrete search space over DAGs to a continuous search space, allowing gradient descent and other techniques from continuous optimization to be applied to causal structure learning. However, the search space of these problems is highly non-convex, so that the optimization procedure may become stuck in a local minima. Thus, consistency guarantees for these methods will depend on theoretical advances in global minimization of such non-convex optimization problems.

4.1 Learning DAGs Using Permutation-Based Algorithms

Beyond the constraint-based and score-based paradigms for causal structure learning already discussed, there are a variety of hybrid methods [7, 117, 138, 140, 166], which generally use constraints to reduce the search space, and scores to optimize over this reduced search space. In this section, we discuss the greedy sparsest permutation (GSP) algorithm, a hybrid method that constrains the search space to the set of (estimated) minimal I-MAPs of \(\mathbb {P}_X\). By focusing on this method, we highlight the combinatorial nature of the problem of causal structure learning.

As discussed in 2, a distribution \(\mathbb {P}_X\) may permit several different minimal I-MAPs. Since the minimal I-MAPs of \(\mathbb {P}_X\) are the (locally) sparsest DAGs which can correctly model \(\mathbb {P}_X\), they form a natural space over which to search for the true DAG \(\mathcal {G}^*\). Furthermore, the space of minimal I-MAPs of \(\mathbb {P}_X\) can be described as the image of a \(\mathbb {P}_X\)-dependent map, with the \(\mathbb {P}_X\)-independent domain of \(S_p\) of permutations of [p]. We denote by \(i <_\pi j\) that i is earlier in the permutation \(\pi \) than j, and we call a graph \(\mathcal {G}\) consistent with a permutation \(\pi \) if and only if \(i <_\pi j\) implies that \(j \not \rightarrow i\) in \(\mathcal {G}\). The following result establishes the existence of a unique map from permutations to minimal I-MAPs.

Theorem 1

(from [169]) Given a permutation \(\pi \) and a distribution \(\mathbb {P}_X\), there exists a unique graph \(\mathcal {G}_{\mathbb {P}_X}(\pi )\) that is consistent with \(\pi \) and is a minimal I-MAP for \(\mathbb {P}_X\). This graph has edges

Fig. 7
figure 7

A greedy step over minimal I-MAPs performed by GSP. a True graph \(\mathcal {G}^*\), to which the distribution \(\mathbb {P}(X)\) is faithful. b Minimal I-MAP associated with the permutation \(\pi ^{(0)} = [2,4,1,3]\), with covered edges shown in red. c Minimal I-MAP associated with the permutation \(\pi ^{(1)} = [2,4,3,1]\), obtained after flipping the covered edge \(1 \rightarrow 3\) (Color figure online)

Given a graph \(\mathcal {G}\), let \(|\mathcal {G}|\) be the number of edges in the graph. The sparsest I-MAP theorem [131] establishes that, under a mild condition, the sparsest minimal I-MAPs of \(\mathbb {P}_X\)—i.e, those such that \(|\mathcal {G}_{\mathbb {P}_X}(\pi )|\) is minimized—are Markov equivalent to the underlying causal graph \(\mathcal {G}^*\). In particular, the required condition for this result is strictly weaker than the restricted faithfulness assumption [129], which only requires that agree on conditional independences/d-separations involving nodes connected by paths of lengths one or two. The sparsest I-MAP theorem directly suggests the sparsest permutation (SP) algorithm: enumerate over all permutations \(\pi \in S_p\), estimating the minimal I-MAP \(\mathcal {G}_{\mathbb {P}_X}(\pi )\) for each of these permutations using conditional independence testing, and return the sparsest graphs.

However, the SP algorithm is clearly computationally prohibitive, since the size of \(S_p\) is super-exponential in p. To address this issue, [148] proposed the greedy sparsest permutation (GSP) algorithm. GSP searches greedily over the space of permutations, and hence, minimal I-MAPs. In particular, at each step i of the algorithm, GSP maintains a permutation \(\pi ^{(i)}\) and its corresponding minimal I-MAP \(\mathcal {G}_{\mathbb {P}_X}(\pi ^{(i)})\). At this step, GSP searches over the Markov equivalence class of \(\mathcal {G}_{\mathbb {P}_X}(\pi ^{(i)})\) for some DAG \(\mathcal {G}'\) which is not a minimal I-MAP of \(\mathbb {P}_X\). This search can be executed by repeatedly flipping covered edges to generate new permutations. Upon finding \(\mathcal {G}'\) which is not a minimal I-MAP of \(\mathbb {P}_X\), there must be some strict sub-DAG \(\mathcal {G}''\) of \(\mathcal {G}'\) which is a minimal I-MAP of \(\mathbb {P}_X\). GSP then takes the topological ordering of this sub-DAG as the new permutation \(\pi ^{(i + 1)}\), with \(\mathcal {G}''\) as its corresponding minimal I-MAP \(\mathcal {G}_{\mathbb {P}_X}(\pi ^{(i + 1)})\). One greedy step of GSP is demonstrated in 7.

As in the case for other greedy approaches, GSP has an interpretation as an edge walk over a convex polytope. In particular, starting from the permutahedron, i.e., the convex hull of all permutations on p nodes, we may define the DAG associahedron by contracting all edges \(\pi ^{(i)} - \pi ^{(j)}\) of the permutahedron for which \(\mathcal {G}_{\mathbb {P}_X}(\pi ^{(i)}) = \mathcal {G}_{\mathbb {P}_X}(\pi ^{(j)})\). As shown in [148], this contraction results in a convex polytope, GSP is equivalent to an edge walk along this polytope, and, under conditions that are strictly weaker than the faithfulness assumption, this edge walk terminates in the Markov equivalence class of the causal graph \(\mathcal {G}^*\) underlying \(\mathbb {P}_X\). The central technical ingredient in this proof is the existence of Chickering sequences. In particular, [31] proves the Meek conjecture for DAGs [112]: if \(\mathcal {G}_M\) is an I-MAP of \(\mathcal {G}_0 = \mathcal {G}^*\), then there exists a sequence \((\mathcal {G}_0, \mathcal {G}_1, \ldots , \mathcal {G}_M)\) composed only of edge additions and covered edge reversals. This sequence is called a Chickering sequence [148] and its existence guarantees the consistency of GSP.

In addition to the consistency of GSP and the algorithms discussed previously, which provides guarantees as the sample size goes to infinity, it is important to understand the performance of different algorithms for finite sample size. Simulation results suggest that score-based and hybrid approaches perform better for fixed sample sizes [8, 74, 117]. However, a theoretical characterization of the trade-offs between these algorithms on finite samples is not well understood and is an important area for future research, as also briefly described in 5.

4.2 Bayesian Methods for Causal Structure Learning

Thus far, we have only discussed causal structure learning methods which return a point estimate—i.e., a single DAG that (approximately or locally) maximizes a score, and/or satisfies inferred conditional independences. However, when the amount of data is small, there may be substantial uncertainty about the underlying graph (or equivalence class). A common framework for quantifying this uncertainty is Bayesian inference. Given some dataset \(\mathbb {D}\), instead of returning a point estimate, Bayesian methods return (an approximation to) the posterior \(\mathbb {P}(\mathcal {G}\mid \mathbb {D})\) over graphs. This posterior allows one to compute marginal probabilities of any feature of interest, such as the posterior probability of some edge \(i \rightarrow j\).

Bayesian methods for causal structure learning can be divided into three types of approaches: exact approaches (e.g., [42]) and two types of approximate approaches: variational and sampling-based approaches. Similarly to the gradient-based approaches discussed before, variational approaches do not necessarily return a consistent estimate of the posterior; rather, they project the posterior onto a variational family \(\{ Q(\cdot \mid \theta ) \}_{\theta \in \Theta }\), which is more computationally convenient. However, traditional variational families, such as multivariate Gaussians, are continuous and thus do not apply to the discrete setting of DAGs. Thus, until recently, variational methods for Bayesian causal structure learning have not been widely studied. For a recent work in this space, see [107], which uses relaxations of DAGs to a continuous search space and neural networks to parameterize a flexible variational family.

On the other hand, sampling-based approaches to Bayesian causal structure learning have been much more widely studied. Markov chain Monte Carlo (MCMC) methods have been especially popular, beginning with the structure MCMC algorithm [110], which runs a Metropolis–Hastings algorithm over the space of DAG models, using edge additions and deletions to move in this space. However, this approach suffers from slow mixing times due to regions of high-probability DAG models being separated by large regions of low-probability DAG models, i.e., if structure MCMC finds some high-probability DAG \(\mathcal {G}_0\), some other high-probability DAG \(\mathcal {G}_M\) may only be reachable from \(\mathcal {G}_0\) by a sequence of DAGs \(\mathcal {G}_1, \ldots \mathcal {G}_{M-1}\) which have very low probability. Thus, the probability that structure MCMC traverses this path becomes incredibly low, so that \(\mathcal {G}_M\) will not be sampled without running the algorithm for many steps.

This difficulty has motivated a search for “smoother” sampling spaces, either by adding moves to structure MCMC [62, 67], or by changing the search space, as was done in order MCMC [45, 49], partial order MCMC [118], and partition MCMC [99]. These methods run a Markov chain over some “coarser” space (permutations, partial orders, or ordered partitions), then sample DAGs conditionally based on their consistency with the coarser structure. The minimal I-MAP MCMC algorithm [5] also runs a Markov chain over the coarser space of permutations. However, instead of conditionally sampling a DAG based on each permutation, it estimates the minimal I-MAP associated to each sampled permutation.

Since the space of permutations is much smaller than the space of DAGs or MECs, the minimal I-MAP MCMC algorithm can mix more quickly than previous algorithms. But this comes at a price: Minimal I-MAP MCMC does not sample over the entire posterior distribution of DAG models, but only a restricted subset. Luckily, this price is small: intuitively, conditional on an order, the minimal I-MAP asymptotically has the highest posterior probability, so a point mass on the minimal I-MAP is a good approximation of the true conditional distribution. Indeed, [5] shows that the posterior approximation error for any bounded function of the graph decreases exponentially with the number of samples. By highlighting this algorithm, we once again see the computational benefits that are possible when considering the combinatorial nature of the causal structure learning problem.

4.3 Causal Structure Learning Using Interventional Data

As discussed in 3, interventional data can significantly improve the identifiability of causal models. Several approaches have been proposed for learning from a combination of observational and experimental data, going back at least to the Bayesian approaches of [36] and [42]. As in the case of learning from purely observational data, these approaches can be divided into constraint-based approaches, such as the COmbINE [165] algorithm, and score-based approaches. Score-based approaches include greedy algorithms, such as Greedy Interventional Equivalence Search (GIES) [70], and gradient-based algorithms, such as meta-learning approaches [89] and DCDI [22]. Note that, unlike in the case of GES for observational data, GIES is known to not be consistent for interventional data [171].

The Joint Causal Inference framework [115] discussed in 2.2 suggests a natural way to extend causal structure learning algorithms for observational data to settings with interventional data. In particular, an algorithm for the observational setting can be used to learn the \(\mathcal {I}\)-DAG by appending indicator variables to the dataset for each intervention \(I \in \mathcal {I}\), as long as the algorithm can incorporate appropriate forms of background knowledge. This background knowledge includes exogeneity—i.e., intervention variables are not caused by the original “system” variables, randomized context—i.e., lack of confounding between the intervention and system variables, and generic context—i.e., that the intervention variables are deterministically related to one another. As an example, [153] shows that the GSP algorithm can be adapted to include these assumptions, along with any assumptions about known targets of each intervention, while maintaining consistency of the algorithm. They call the resulting algorithm the Unknown Target Intervention GSP (UT-IGSP) algorithm to emphasize its ability to handle interventions with unknown targets, extending previous works where targets were assumed to be known [171, 177]. Finally, it is also natural to develop Bayesian variants of causal structure learning algorithms for interventional data, e.g., [27] shows how to compute posteriors over DAGs in the setting when the data is multivariate Gaussian.

4.4 Causal Structure Learning in the Presence of Latent Confounding

The approaches to causal structure learning in the causally insufficient setting follow the same broad categorization as approaches in the causally sufficient setting. In particular, the Fast Causal Inference (FCI) algorithm [150] is a constraint-based algorithm for learning DMAGs, similar in spirit to the PC algorithm. The FCI algorithm has inspired several variants, including Really Fast Causal Inference (RFCI) [35], and FCI+ [34]. Score-based methods include both greedy search strategies, such as Greedy FCI (GFCI) [119], MAG Max–Min Hill Climbing \((M^{3}HC)\) [167], and Conservative rule and Causal effect Hill Climbing (CCHM) [33], exact score-based approaches, such as AGIP [28], and gradient-based approaches [16].

As in the case of learning DAGs, we will discuss a hybrid method for learning DMAGs, which combines elements of both score-based and constraint-based approaches, and elucidates the combinatorial aspects of learning DMAGs. This method, called the Greedy Sparsest Poset (GSPo) method, restricts the search space of DMAGs to minimal I-MAPs of the distribution \(\mathbb {P}_X\). This space can be realized as the image of a map \(\mathcal {G}_{\mathbb {P}_X}\) from partially ordered sets (posets) to graphs. A partially ordered set \(\pi \) defines a relation \(\preceq _\pi \) that captures the notion of an ordering via three requirements: reflexivity (\(i \preceq _\pi i\) for all i), antisymmetry (\(i \preceq _\pi j\) and \(j \preceq _\pi i\) implies \(i = j\)), and transitivity (\(i \preceq _\pi j\) and \(j \preceq k\) implies \(i \preceq _\pi k\)). Because of the definition of the ancestrality condition, the set of complete DMAGs can be put in bijection to the set of posets, so that posets form a natural domain for the map \(\mathcal {G}_{\mathbb {P}_X}\).

The authors in [13] show that \(\mathcal {G}_{\mathbb {P}_X}(\pi )\) can be constructed using a procedure similar to the procedure defined for DAGs in 1, although the construction requires two iterations of conditional independence testing between pairs of variables instead of one. They also provided a version of the sparsest I-MAP theorem for DMAGs, i.e., under a restricted faithfulness assumption, the sparsest minimal I-MAPs of \(\mathbb {P}_X\) are all Markov equivalent to the underlying DMAG \(\mathcal {G}^*\). Motivated by the GSP algorithm for learning DAGs, [13] introduce the greedy sparsest poset (GSPo) algorithm for learning DMAGs, which uses legitimate mark changes to search over posets and iteratively find sparser I-MAPs. Over 100,000 synthetic examples suggest that the GSPo algorithm is consistent, but proof of its consistency is an important open problem, and closely tied to the open problem of generalizing Meek’s conjecture [31, 112] to DMAGs.

5 Discussion and Open Problems

In this review article, we sought to cover both classical and recent approaches to causal structure learning, emphasizing the combinatorial nature of this problem. We end by discussing several related areas of work that were not covered in depth and remain under active development.

Learning with both interventions and latent confounding While we separately discussed learning with interventional data and learning under confounding, it is natural to combine these two settings. Recent work [83] considers this combination for DMAGs, introducing the new notion of \(\Psi \)-Markov equivalence to capture pairs of graphs and interventions which induce the same set of conditional independencies and conditional invariances. This work allows for both soft and unknown-target interventions. Furthermore, [83] provides a graphical characterization of \(\Psi \)-Markov equivalence, and introduces a constraint-based algorithm, called \(\Psi \)-FCI, for learning the \(\Psi \)-Markov equivalence class from data. As a next step it is natural to consider score-based algorithms, both exact and greedy, for learning DMAGs, ADMGs, and other subclasses of marginal DAG models, using a combination of observational and interventional data.

Learning with assumptions on the latent structure As indicated in 2.3, in some cases with unobserved confounding, it is desirable to recover the unobserved variables and their relationship to the observed variables. Naturally, recovery of these details requires assumptions on their structure. A common assumption, called the exogeneity or measurement assumption, is that all unobserved variables are upstream of the observed variables, i.e., none of the unobserved variables are caused by any of the observed variables.

With the exogeneity assumption as a starting point, additional assumptions may be made to (approximately) recover the latent variables, and possibly, the structure between them. For example, several works [51, 141] consider recovering the unobserved variables in settings with pervasive confounding, i.e., when each unobserved variable has a direct effect on a large number of observed variables. As an important special case of this setting, some works have considered recovering a mixture of DAG models [65, 137, 156, 157], where there is a single unobserved variable that is a parent of all variables in the graph. Alternatively, many works [25, 68, 84, 91, 100, 136, 176, 184] consider recovering unobserved variables under the measurement assumption and a form of purity or anchor assumption, where each unobserved variable must have some number of observed variables which are only their children. Few works consider recovering unobserved variables without the assumption of exogeneity, with [154] being a recent exception.

Learning in the presence of selection bias As suggested in 2.3, considerable effort has gone into characterizing the distributional constraints imposed by marginalization of DAG models. However, in many applications, the observed distribution is the result of both marginalization and conditioning of an underlying distribution. In particular, such observed distributions are induced by selection bias, where the probability that a sample is observed is dependent on the value of some of the variables in the sample. General maximal ancestral graphs (see 2.3), which allow for undirected edges in addition to directed and bidirected edges, are conditional independence models which are closed under marginalization and conditioning. As in the case of marginalization, several graphical representations, including MC graphs [96] and summary graphs [174], have been introduced to capture constraints induced by such conditional models. However, to the best of our knowledge, there is no graphical representation which exactly captures all equality and inequality constraints induced by conditioning a DAG model, in contrast to the case for marginal models [46]. Thus, important next steps include (1) developing a graphical representation which fully captures both marginalization and conditioning, (2) developing notions of Markov equivalence in this setting, including with interventional data, and (3) developing structure learning algorithms in this general setting.

Learning cyclic causal models As indicated in 2, a widespread assumption in causal modeling and causal structure learning is that the structural causal model (SCM) induces an acyclic graph. However, this may not be the case if the SCM models a system that involves feedback loops. While the underlying dynamics of the system are necessarily acyclic over time, feedback loops can arise when modeling the equilibrium states of such systems [19]. For example, in gene regulatory networks, we may have that gene A regulates gene B, and gene B also regulates gene A, so that intervening on either gene will affect the value of the other gene. Recent work [20] has investigated the semantics of cyclic causal models, showing that Markov properties and other desirable properties hold in the case of certain solvability conditions. Despite the technical difficulties associated with cyclic models, several approaches have been proposed for learning their structure from data. These approaches include many algorithms designed for the linear case, including LLC [78], score-based approaches [58], and BackShift [134]. Algorithms for the general case include SAT-based approaches [81], exact score-based approaches [130], and constraint-based approaches [48, 114, 155].

Statistical and computational complexity of causal structure learning In conjunction with methodological developments for settings with cycles, latent confounding, selection bias, and interventional data, it is important to understand the fundamental statistical and computational limits of causal structure learning, and any trade-offs between these. The analysis of existing causal structure learning algorithms gives upper bounds on what is statistically and computationally achievable. Recent work derives upper bounds for a wide range of settings, including the linear equal-variance setting [60], the linear non-Gaussian setting [173], other parametric settings [120, 128], as well as nonparametric settings [53]. On the other hand, it is important to understand the fundamental lower bounds on the sample complexity needed by any causal structure learning algorithm. Such lower bounds have been established for the exponential family setting [59] and the linear equal-variance setting [54], but the lower bounds for a wide range of settings and assumptions remain uncharacterized.

Furthermore, since consistency of causal structure learning algorithms always requires some form of “faithfulness” or genericity assumption (see 2), there are likely trade-offs between the strength of faithfulness assumption imposed and computational and statistical complexity. Indeed, an interesting open question is to characterize the weakest assumption needed for causal structure learning, with the sparsest Markov representation assumption [131] being one candidate. Finally, the works discussed above are all in causally sufficient settings with only observational data. Incorporating interventional data into these analyses would open the possibility for a reduction in overall sample complexity, and may introduce a landscape of trade-offs between interventional and observational sample complexities. Indeed, interventional data has been considered in recent works [1, 17] on the statistical and computational complexity of causal inference tasks, where the causal graph is assumed to be known and the task is to estimate interventional distributions. An interesting future direction is to also explore the effect of interventional data on the complexity of causal structure learning.

Experimental design for causal structure learning In this review article, we have focused on causal structure learning in a passive setting, where we are given a dataset, or possibly several datasets from different interventions or contexts. However, in many scientific settings, such as biology, where interventions such as genetic or chemical perturbations can readily be performed, an important component of causal discovery is the choice of what data to gather [63]. This leads us to consider experimental design approaches for causal structure learning, where an experimenter may pick interventions (and their values) in an effort to identify the underlying causal structure. Several approaches have been proposed for a variety of settings. In the non-adaptive setting, the experimenter picks all interventions at once. In [43] it is shown that, in the absence of any preexisting observational data, \(p-1\) interventions are sufficient and in the worst-case necessary for identifying the underlying causal structure over p variables. Other work in the non-adaptive setting considers the presence of background knowledge (e.g., from observational data) [79], differences in costs between interventions [92, 105], and a fixed-budget setting [56].

Alternatively, the adaptive setting allows the experimenter to observe the outcome of each intervention before picking the next intervention. He and Geng [73] and Hauser and Bühlmann [71] propose greedy approaches for the adaptive setting, picking new interventions based on some measure of either expected or worst-case information gain. While these approaches are designed for the noiseless setting, in which an infinite amount of data is gathered from each intervention, more recent works [98, 164] explore greedy approaches in the noisy setting. [66] shows that strategies which maximize expected information gain can be exponentially suboptimal in the number of interventions that they use, and propose the Central Node algorithm for settings where the essential graph is a tree. They show that this algorithm is a 2-approximation to the optimal adaptive strategy. Follow-up work [152] adapts this algorithm to a more general class of essential graphs, provides a characterization of the number of single-node interventions needed by an oracle to identify a causal graph, and shows that their algorithm uses within a logarithmic factor of this number of interventions.

In between the non-adaptive and adaptive settings, [4] considers the active batched setting, in which the experimenter observes the outcome of a batch of interventions before picking the next batch of interventions. Recent work [161] establishes novel submodularity properties for greedy objectives in this settings, allowing for efficient optimization over the choice of interventions in each batch. Taken together, these recent advances suggest several future directions, including (1) characterizing the number of multi-target interventions needed by an oracle in the adaptive case [125], (2) approximation guarantees for experimental design, compared to either oracles or optimal strategies, and (3) experimental design in settings with latent confounding [2, 94], selection bias, and cycles.

Targeted causal structure learning Thus far, we have focused on the problem of causal structure learning as an end in itself; i.e., in both the passive and active settings discussed, the desired output was a causal graph (or equivalence class). However, ultimately, a major motivation for causal structure learning is to use the causal model in downstream tasks. A task of considerable importance is policy evaluation, i.e., predicting the effect of an action. The overall goal of task can be phrased as estimating a specific functional of an interventional distribution defined by a structural causal model M. Then two principal subtasks are (1) determining whether this functional is identifiable by transforming it into a functional of the available distributions and (2) estimating the resulting functional from samples. When the only available distribution is the observational distribution defined by M, possibly with some variables unobserved, the first subtask is covered by the ID algorithm [146] and its variants [147].

More generally, data might be available from some set of interventional distributions defined by M, or from observational and interventional distributions associated to some related structural causal model \(M'_1, \ldots , M'_K\). The relation between these structural causal models is encoded using a selection diagram, and the task of using the selection diagram to identify the functional is covered by a rich literature on transportability [9, 10, 37, 104]. Once the target functional is transformed into a functional of the available distributions, it becomes essential to estimate the functional in a sample-efficient way. This has been extensively studied in the literature on semiparametric efficiency [15, 135], double machine learning [85], and targeted machine learning [139], also covered in a recent review [90]. Thus far, causal structure learning and policy evaluation have been studied as separate tasks: The output of causal structure learning is a causal graph, while the input to policy evaluation is a causal graph or selection diagram. Therefore, the current approach to using policy evaluation tasks when the graph is unknown would be to first perform causal structure learning, then to use the methods discussed for policy evaluation. It is likely that this approach is not optimally sample-efficient—the two steps should be “aware” of each other, i.e., causal structure learning should be performed in a way that is targeted toward the downstream task.

The problem of targeted causal structure learning remains mostly unexplored, with a few notable exceptions. In the adaptive experimental design setting, [4] considers targeted learning of any property of the underlying graph, and [188] considers targeted learning of a “matching” intervention, which affects the system in some desired way. In the batched data setting, [12, 172, 189] consider targeted learning of the difference between two DAG models, instead of the DAG models themselves. All of these works demonstrate computational and statistical benefits to targeted learning over untargeted structure learning, indicating that this is an important and promising direction.

Causal structure in reinforcement learning Policy evaluation is also an important task in reinforcement learning, where the policy is a sequence of actions that can depend on the state of the environment. The overlap between reinforcement learning and causality has been recently explored in the simple setting of multi-armed bandits, where an agent’s actions do not affect the state of the environment. By assuming that actions correspond to interventions in a known causal graph, the effects of different actions become related, allowing for better regret bounds [102, 116]. If the causal graph is not assumed to be known, there is an additional exploration–exploitation trade-off that needs to be taken into account, which has been considered in recent work [18, 97, 108]. Since certain parts of the causal graph might not be relevant to predicting the effect of an action on some reward, the reinforcement learning setting is another case in which targeted structure learning may be more efficient.