In this section we introduce basic concepts relative to graphical models based on directed acyclic graphs and causal inference. Fur further information on these topics the reader can refer to Pearl (2009) and Lauritzen (1996).
Graphical models
We briefly introduce the graph notation hereinafter adopted. Let \(\mathcal {G}= (V,E)\) be a graph, where \(V = \{1, \dots , q \}\) is a set of nodes (or vertices) and \(E \subseteq V\times V\) a set of edges. In what follows, if \((u,v)\in E\) and \((v,u)\notin E\), \(\mathcal {G}\) contains a directed edge \(u\rightarrow v\), while if both \((u,v)\in E\) and \((v,u)\in E\), then \(\mathcal {G}\) contains an undirected edge \(u - v\). A graph is called directed if contains only directed edges. Moreover, a Directed Acyclic Graph (DAG) \(\mathcal {D}\) is a directed graph which contains no loops, that is sequences of nodes \((u_1,u_2,\dots ,u_k)\) with \(u_1=u_k\), such that there exists a path \(u_1\rightarrow u_2\rightarrow \cdots \rightarrow u_k\). Moreover, if \((u,v)\in E\) we say that u is a parent of v and denote the set of all parents of v in \(\mathcal {D}\) as \(\text {pa}_{\mathcal {D}}(v)\). Also, if there exists a directed path from u to v we say that v is a descendant of u and let \(\text {de}_{\mathcal {D}}(u)\) be the set of all descendants of u in \(\mathcal {D}\). Hence, the non-descendants of u are all nodes in the set \(\text {nd}_{\mathcal {D}}(u)=V\setminus \text {de}_{\mathcal {D}}(u)\).
Let now \((X_1, \dots , X_q)\) be a random vector. The connection between a graph and probabilistic model \(f(x_1,\dots ,x_q)\) for the random vector arises as we associate each variable \(X_j\) to a node in the graph. The latter introduces a set of conditional independencies among \(X_1,\dots ,X_q\) via the so-called Markov property of the graph. As different types of dependence patterns exist, different types of graphs are in general equipped with different Markov properties. A DAG encodes a set of conditional independencies between variables which can be read-off from the DAG using graphical criteria such as d-separation (Pearl, 2009). We then denote with \(I(\mathcal {D})\) the set of all conditional independencies implied by \(\mathcal {D}\). Let \(\mathcal {D}\) be a DAG, \((X_1, \dots , X_q)\) a collection of random variables. A distribution \(f(x_1,\dots ,x_q)\) is said to be compatible with the DAG \(\mathcal {D}\) or Markov relative to \(\mathcal {D}\) if it admits the factorization
$$\begin{aligned} f(x_1,\dots ,x_q)=\prod _{j=1}^{q}f(x_j\,\vert \,\varvec{x}_{\text {pa}_{\mathcal {D}}(j)}). \end{aligned}$$
(1)
As many distributions may admit the factorization (1), it is possible to define a family of distributions \(M(\mathcal {D})\) that are Markov relative to \(\mathcal {D}\). For a given \(f(x_1,\dots ,x_q)\equiv f\), if we let I(f) be the set of conditional independencies in f, then \(f \in M(\mathcal {D})\) if and only if \(I(\mathcal {D}) \subseteq I(f)\). Moreover, if \(I(\mathcal {D}) = I(f)\), then f is said to be faithful to DAG \(\mathcal {D}\). This means that the conditional independencies in \(\mathcal {D}\) are all and only those embodied in the joint distribution f.
A further important property of Bayesian networks is Markov equivalence. In particular, two DAGs \(\mathcal {D}_1\) and \(\mathcal {D}_2\) are Markov equivalent if and only if \(I(\mathcal {D}_1)=I(\mathcal {D}_2)\). It follows that a given set of conditional independencies can be described by several DAGs which are collected into an equivalence class. The latter can be uniquely represented through a Completed Partially Directed Acyclic graph (CPDAG) (Chickering 2002), also known as Essential Graph (EG) (Andersson et al. 1997) which is obtained as the union (over the edge sets) of all DAGs within the equivalence class.
Structural causal models and causal diagrams
DAGs are not necessarily carriers of causal information and their common extension to probabilistic graphical models, namely Bayesian networks, only allow to make conditional independence statements. Causal concepts are instead relationships that cannot be deduced from the distribution alone (Pearl, 2009) and accordingly require additional assumptions on the generating mechanism. One possibility is to assume that each parent-child relationship in the network represents a stable and autonomous physical mechanism, which means that it is possible to change one relationship without affecting the others. This assumption leads to the construction of Structural Causal Models (SCM) and the corresponding graphical tools, named causal diagrams. See also Pearl (2009)[Sect. 1.3.1–1.3.2] for a deep discussion and illustrative examples.
Traditionally, causal concepts are handled in econometrics and social sciences through linear structural causal models, that is SCM in which the relationships between variables are assumed to be linear. In general, an SCM can be represented through a system of relations of the form
$$\begin{aligned} X_j \leftarrow f_j(\text {pa}_\mathcal {D}(j), U_j) \ \ \ j = 1, \dots , q, \end{aligned}$$
(2)
where \(\text {pa}_\mathcal {D}(j)\) is now to be interpreted as the set of variables which directly determine the level of \(X_j\). Moreover, \(U_j\) is an error term and the left-pointing arrow indicates a structural relation (as opposed to algebraic relations); see also Pearl (2009)[Sect. 1.4].
Given a causal model in the form of (2), drawing an arrow from each variable in \(\text {pa}_\mathcal {D}(j)\) towards \(X_j\) results in a DAG \(\mathcal {D}\) called causal diagram which is called Markovian if it is acyclic and the error terms are jointly independent. It can be proved that every Markovian structural causal model M induces a distribution f which admits the same recursive decomposition (1) that characterizes Bayesian networks. However, causal models in the form (2) are more powerful as the assumptions of stability and autonomous mechanism allow to compute the effect of hypothetical interventions from non-experimental (observational) data.
We now introduce the notion of intervention. A hard (or deterministic) intervention on the set of variables \(\{X_j, j\in I\}\), \(I\subseteq V\), is denoted by \(\text {do}\{X_j = \tilde{x}_j\}_{j\in I}\) and is defined as the action of fixing each \(X_j\), \(j \in I\), to some chosen value \(\tilde{x}_j\). A hard intervention modifies the SCM by replacing each equations \(X_j \leftarrow f_j(\text {pa}_\mathcal {D}(j), U_j)\) for \(j \in I\) with a point mass at \(\tilde{x}_j\). From a graphical perspective, the effect of a hard intervention can be represented through the so-called intervention DAG. This is obtained from the original DAG \(\mathcal {D}\) by removing all edges (u, j) such that \(j\in I\) and is denoted by \(\mathcal {D}^I\); see also the example in Fig. 2.
A hard intervention \(\text {do}\{X_j = \tilde{x}_j\}_{j\in I}\) leads to the definition of post-intervention distribution which can be written using the truncated factorization
$$\begin{aligned} f(x_1, \dots , x_q\,\vert \,\text {do}\{X_j = \tilde{x}_j\}_{j \in I}) = {\left\{ \begin{array}{ll} \prod _{i \notin I} f(x_i\,\vert \,\varvec{x}_{\text {pa}_{\mathcal {D}}(i)})\big|_{\{x_j=\tilde{x}_j\}_{j \in I}} &{} \text {if } x_j = \tilde{x}_j \quad \forall \, j \in I, \quad \\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(3)
Importantly, the conditional densities in (3) are the same appearing in (1): this means that the post-intervention distribution can be expressed in terms of observational densities.
Moreover, Nandy et al. (2017) define the total joint effect of an intervention \(\text {do}\{X_j = \tilde{x}_j\}_{j \in I}\) on \(X_1 \equiv Y\) as
$$\begin{aligned} \varvec{\theta }_{Y}^{I} := (\theta _{h,Y}^{I})^\top _{h \in I}, \end{aligned}$$
(4)
where for each \(h \in I\)
$$\begin{aligned} \theta _{h,Y}^I := \frac{\partial }{\partial {x_h}} \mathbb {E}(Y\,\vert \,\text {do}\{X_j = \tilde{x}_j\}_{j \in I}) \end{aligned}$$
(5)
is the causal effect on Y associated to variable \(X_h\) in the joint intervention.
Causal discovery and causal effect estimation
Causal effects can be estimated whenever a causal diagram representing the causal structure of the problem is available. However, often this is not the case and the causal structure must be inferred from the data. Causal discovery methods, that is procedures whose aim is to learn causal DAGs from the data, are traditionally divided into three main classes: constraint-based methods, which estimate equivalence classes of DAGs by testing for conditional independencies between variables; score-based methods, which score DAGs through penalized likelihoods; hybrid methods which combine features of the first two approaches.
The PC algorithm (Spirtes et al. 2000) is one of the most popular algorithms for causal discovery. It is a constraint-based method that assumes acyclicity, causal faithfulness and causal sufficiency, where the latter refers to the absence of hidden (latent) variables. The PC algorithm provides an estimate of the CPDAG representing the true causal DAG. Specifically, it first estimates the CPDAG skeleton (that is the undirected graph that would be obtained by removing all the edge orientations from the DAG) and then orients as many edges as possible using various orientation rules; see also Kalish and Buhlmann (2007). For a complete review on causal discovery algorithms the reader can refer to Heinze-Deml et al. (2018).
A slightly different approach has been adopted by Maathuis et al. (2009), who propose a methodology for causal effect estimation from single-node hard interventions in Gaussian models when the DAG is not available. The resulting algorithm is called IDA (Identification when DAG is Absent). In its basic version, IDA first estimates an equivalence class using the PC algorithm (alternatively any other score-based method can be adopted). Next, for each DAG within the input class, the causal effect of \(X_h\) on Y is computed using multiple linear regression models. This basic version is slightly modified due to computational reasons. In particular, they propose a faster alternative which only returns the distinct causal effects compatible with the input equivalence class, thus avoiding a full enumeration of the DAGs. Their methodology is further extended to joint (simultaneous) hard interventions by Nandy et al. (2017), leading to their joint-IDA method.
As in the case of single interventions, joint-IDA relies on a CPDAG which is estimated up-front, e.g. by using the PC algorithm. Next, three alternative methods for causal estimation from joint interventions are proposed, namely the path method, the recursive regression for causal effects method (RRC) and the modified Cholesky decomposition method (MCD); see the original paper for details.