Reasoning with alternative acyclic directed mixed graphs
 208 Downloads
 1 Citations
Abstract
Acyclic directed mixed graphs (ADMGs) are the graphs used by Pearl (Causality: models, reasoning, and inference. Cambridge University Press, Cambridge, 2009) for causal effect identification. Recently, alternative acyclic directed mixed graphs (aADMGs) have been proposed by Peña (Proceedings of the 32nd conference on uncertainty in artificial intelligence, 577–586, 2016) for causal effect identification in domains with additive noise. Since the ADMG and the aADMG of the domain at hand may encode different model assumptions, it may be that the causal effect of interest is identifiable in one but not in the other. Causal effect identification in ADMGs is well understood. In this paper, we introduce a sound algorithm for identifying arbitrary causal effects from aADMGs. We show that the algorithm follows from a calculus similar to Pearl’s docalculus. Then, we turn our attention to Andersson–Madigan–Perlman chain graphs, which are a subclass of aADMGs, and propose a factorization for the positive discrete probability distributions that are Markovian with respect to these chain graphs. We also develop an algorithm to perform maximum likelihood estimation of the factors in the factorization.
Keywords
Causality Causal effect identification Acyclic directed mixed graphs Factorization Maximum likelihood estimation1 Introduction
Causal effect identification in ADMGs is well understood (Pearl 2009; Shpitser and Pearl 2006; Tian and Pearl 2002a, b). The same is not true for aADMGs. As mentioned, aADMGs were proposed by Peña (2016), who mainly studied them as representation of statistical independence models. In particular, their global, local, and pairwise Markov properties were studied. Later, Peña and Bendtsen (2017) considered aADMGs for causal effect identification. Specifically, they presented a calculus similar to Pearl’s docalculus (Pearl 2009; Shpitser and Pearl 2006), and a decomposition of the density function represented by an aADMG that is similar to the Qdecomposition by Tian and Pearl (2002a, b). In this paper, we extend the decomposition to identify further causal effects. The result is a sound algorithm for causal effect identification in aADMGs. We also show that the algorithm follows from the calculus of interventions in Peña and Bendtsen (2017).
Then, we turn our attention to the use of aADMGs as representation of independence models. As mentioned, Peña (2016) describes Markov properties for aADMGs but no factorization property. We present a first attempt to fill in this gap by developing a factorization for the positive discrete probability distributions that are Markovian with respect to AMP CGs, which recall are a subclass of aADMGs. We also develop an algorithm to perform maximum likelihood estimation of the factors in the factorization. It is worth mentioning that a method for maximum likelihood estimation for Gaussian AMP CGs exists (Drton and Eichler 2006). It should also be mentioned that similar results exist for LWF and MVR CGs. Specifically, Lauritzen (1996) describes a factorization for the positive discrete probability distributions that are Markovian with respect to LWF CGs, and makes use of the celebrated iterative proportional fitting (IPF) algorithm for maximum likelihood estimation of the factors in the factorization. The IPF algorithm guarantees convergence to a global maximum of the likelihood function under mild assumptions. Drton (2008, 2009) describes a factorization for the positive discrete probability distributions that are Markovian with respect to MVR CGs, as well as an algorithm for maximum likelihood estimation of the factors in the factorization. The algorithm, named iterative conditional fitting (ICF) algorithm, can be seen as being dual to the IPF algorithm. However, unlike the IPF algorithm, the ICF algorithm just guarantees convergence to a local maximum or saddle point of the likelihood function, but it has proven to perform well in practice.
The rest of the paper is organized as follows. Section 2 introduces some preliminaries, including a detailed account of aADMGs for causal modeling. Section 3 presents our novel algorithm for causal effect identification. It also proves that the algorithm is sound and it follows from a calculus of interventions. Section 4 presents our factorization for AMP CGs and the algorithm for maximum likelihood estimation. The paper ends with a discussion on followup questions worth investigating.
2 Preliminaries
Unless otherwise stated, all the graphs and density functions in this paper are defined over a finite set of continuous random variables V. We use uppercase letters to denote random variables and lowercase letters to denote their states. For the sake of readability, we use the elements of V to represent singletons, and sometimes we use juxtaposition to represent set union. An aADMG G is a graph with possibly directed and undirected edges but without directed cycles, i.e., \(A \rightarrow \cdots \rightarrow A\) is forbidden. There may be up to two edges between any pair of nodes, but in that case the edges must be different and one of them must be undirected to avoid directed cycles. Edges between a node and itself are not allowed. A topological ordering of V with respect to G is an ordering such that if \(A \rightarrow B\) is in G then \(A < B\).
Given an aADMG G, the parents of a set \(X \subseteq V\) in G are \(Pa_G(X) = \{A  A \rightarrow B\) is in G with \(B \in X \}\). The children of X in G are \(Ch_G(X) = \{A  A \leftarrow B\) is in G with \(B \in X \}\). The neighbours of X in G are \(Ne_G(X) = \{A  A  B\) is in G with \(B \in X \}\). The ancestors of X in G are \(An_G(X) = \{A  A \rightarrow \cdots \rightarrow B\) is in G with \(B \in X\) or \(A \in X \}\). Moreover, X is called an ancestral set if \(X = An_G(X)\). The descendants of X in G are \(De_G(X) = \{A  A \leftarrow \cdots \leftarrow B\) is in G with \(B \in X\) or \(A \in X \}\). A route between two nodes \(V_1\) and \(V_n\) on G is a sequence of (not necessarily distinct) edges \(E_1, \ldots , E_{n1}\) such that \(E_i\) links the nodes \(V_i\) and \(V_{i+1}\). We do not distinguish between the sequences \(E_{1}, \ldots , E_{n1}\) and \(E_{n1}, \ldots , E_{1}\), i.e., they represent the same route. The route is called undirected if it only contains undirected edges. A component of G is a maximal set of nodes such that there is an undirected route in G between any pair of nodes in the set. The components of G are denoted as \({\mathcal C}(G)\), whereas \(Co_G(X)\) denotes the components to which the nodes in \(X \subseteq V\) belong.^{1} A set of nodes of G is complete if there exists an undirected edge between every pair of nodes in the set. The complete sets of nodes of G are denoted as \({\mathcal K}(G)\). A clique of G is a maximal complete set of nodes. The cliques of G are denoted as \({\mathcal Q}(G)\). Given a set \(W \subseteq V\), let \(G_W\) denote the subgraph of G induced by W, i.e., the aADMG over W that has all and only the edges in G whose both ends are in W. Similarly, let \(G^W\) denote the marginal aADMG over W, i.e., \(A \rightarrow B\) is in \(G^W\) if and only if \(A \rightarrow B\) is in G, whereas \(A  B\) is in \(G^W\) if and only if \(A  B\) is in G or \(A  V_1  \cdots  V_n  B\) is in G with \(V_1, \ldots , V_n \notin W\).
A node C on a route in an aADMG G is said to be a collider on the route if \(A \rightarrow C \leftarrow B\) or \(A \rightarrow C  B\) is a subroute. Note that maybe \(A = B\). Moreover, the route is said to be connecting given \(Z \subseteq V\) when every collider on the route is in Z, and every noncollider on the route is outside Z. Let X, Y, and Z denote three disjoint subsets of V. When there is no route in G connecting a node in X and a node in Y given Z, we say that X is separated from Y given Z in G and denote it as \(X \!\perp \!_G Y  Z\). The independence model represented by G is the set of separations \(X \!\perp \!_G Y  Z\). Likewise, we denote by \(X \!\perp \!_f Y  Z\) that X is independent of Y given Z in a density function f. We say that f satisfies the global Markov property or simply that it is Markovian with respect to G if \(X \!\perp \!_G Y  Z\) implies that \(X \!\perp \!_f Y  Z\) for all X, Y, and Z disjoint subsets of V.
Finally, we mention some properties that density functions satisfy as shown by, for instance, (Studený 2005, Chapter 2). For all X, Y, W, and Z disjoint subsets of V, every density function f satisfies the following four properties: symmetry \(X \!\perp \!_f Y  Z \Rightarrow Y \!\perp \!_f X  Z\), decomposition \(X \!\perp \!_f Y \cup W  Z \Rightarrow X \!\perp \!_f Y  Z\), weak union \(X \!\perp \!_f Y \cup W  Z \Rightarrow X \!\perp \!_f Y  Z \cup W\), and contraction \(X \!\perp \!_f Y  Z \cup W \wedge X \!\perp \!_f W  Z \Rightarrow X \!\perp \!_f Y \cup W  Z\). If f is positive, then it also satisfies the intersection property \(X \!\perp \!_f Y  Z \cup W \wedge X \!\perp \!_f W  Z \cup Y \Rightarrow X \!\perp \!_f Y \cup W  Z\). Some (not yet characterized) probability distributions also satisfy the composition property \(X \!\perp \!_f Y  Z \wedge X \!\perp \!_f W  Z \Rightarrow X \!\perp \!_f Y \cup W  Z\).
2.1 Causal interpretation of aADMGs
Algorithm for magnifying an aADMG
Input: An aADMG G.  
Output: The magnified aADMG \(G'\).  
1  Set \(G'=G\) 
2  For each node A in G 
3  Add the node \(U_A\) and the edge \(U_A \rightarrow A\) to \(G'\) 
4  For each edge \(A  B\) in G 
5  Replace \(A  B\) with the edge \(U_A  U_B\) in \(G'\) 
6  Return \(G'\) 
Algorithm for intervening on an aADMG.
Input: An aADMG G and a set \(X \subseteq V\).  
Output: The aADMG after intervening on X in G.  
1  Delete from G all the edges \(A \rightarrow B\) with \(B \in X\) 
2  For each path \(A  V_1  \cdots  V_n  B\) in G with \(A, B \notin X\) and \(V_1, \ldots , V_n \in X\) 
3  Add the edge \(A  B\) to G 
4  Delete from G all the edges \(A  B\) with \(B \in X\) 
5  Return G 
A less formal but more intuitive interpretation of aADMGs is as follows. We can interpret the parents of each node in an aADMG as its observed causes. Its unobserved causes are summarized by an error node that is represented implicitly in the aADMG. We can interpret the undirected edges in the aADMG as the correlation relationships between the different error nodes. The causal structure is constrained to be a DAG, but the correlation structure can be any UG. This causal interpretation of aADMGs parallels that of the original ADMGs. There are, however, two main differences. First, the noise in ADMGs is not necessarily additive normal. Second, the correlation structure of the error nodes in ADMGs is represented by a covariance or bidirected graph. Therefore, whereas a missing edge between two error nodes in ADMGs represents marginal independence, in aADMGs it represents conditional independence given the rest of the error nodes. This means that the ADMG and the aADMG of the domain at hand may encode different assumptions, which may make a difference for causal effect identification, i.e., the effect may be identifiable in one model but not in the other. An example was provided in Sect. 1.
Given the above causal interpretation of an aADMG G, intervening on a set \(X \subseteq V\) so as to change the natural causal mechanism of X amounts to modifying the righthand side of the equations for the random variables in X. For simplicity, we only consider interventions that set variables to fixed values. Graphically, an intervention amounts to modifying G is shown in Table 2. Line 1 is shared with an intervention on an original ADMG. Lines 2–4 are best understood in terms of the magnified aADMG \(G'\): they correspond to marginalizing the error nodes associated with the nodes in X out of \(G'_U\), the UG that represents the correlation structure of the error nodes. In other words, lines 2–4 replace \(G'_U\) with \((G'_U)^{U{\setminus }U_X}\), the marginal graph of \(G'_U\) over \(U {\setminus } U_X\). This makes sense since \(U_X\) is no longer associated with X due to the intervention and, thus, we may want to marginalize it out because it is unobserved. This is exactly what lines 2–4 imply. See Fig. 3 for an example. Note that the aADMG after the intervention and the magnified aADMG after the intervention represent the same separations over V (Peña 2016, Theorem 9). It can be proven that this definition of intervention works as intended: if f(v) is specified by Eqs. 3 and 4, then \(f(v {\setminus } x  \widehat{x})\) is Markovian with respect to the aADMG resulting from intervening on X in G (Peña and Bendtsen 2017, Corollary 5).
It is worth mentioning that Eqs. 3 and 4 specify each node as a linear function of its parents with additive normal noise. The equations can be generalized to nonlinear or nonparametric functions as long as the noise remains additive, i.e., \(A = g(Pa_G(A)) + U_A\) for all \(A \in V\). The density function f(u) can be any that is Markovian with respect to \(G'_U\). That the noise is additive ensures that \(U_A\) is determined by \(A \cup Pa_G(A)\), which is needed for Theorems 9 and 11 by Peña (2016) and Corollary 5 by Peña and Bendtsen (2017) to remain valid.^{2} Hereinafter, we assume that f(u) is positive. This is a rather common assumption in causal discovery (Peters et al 2017, Definition 7.3). For instance, it is justified when U is affected by measurement noise such that any value of U is possible. Moreover, note that if f(u) is positive then f(v) is also positive, which is desirable: it would be impossible to identify the effect of an intervention \(\hat{x}\) if X never attains the value x in the observational regime (Pearl 2009, p. 78).
3 Causal effect identification in aADMGs
In this section, we present a novel sound algorithm for identifying arbitrary causal effects from aADMGs. The algorithm is based on a decomposition of f(v). We also show that the algorithm follows from a calculus of interventions.
3.1 Identification by decomposition
Lemma 1
The following two lemmas show how certain factors are related. They will be instrumental later.
Lemma 2
Lemma 3
Lemma 4
The following three lemmas can be proven in much the same way as Lemmas 1–3.
Lemma 5
Lemma 6
Lemma 7
Theorem 1
Given an aADMG G and two disjoint sets \(X, Y \subseteq V\), if the algorithm in Table 3 returns an expression for \(f(y  \widehat{x})\), then it is correct.
Example 1
We run the algorithm in Table 3 to identify \(f(v_2  \widehat{v_1})\) from the aADMG in Fig. 1. Then, \(X = V_1\) and \(Y = V_2\). Thus, \(B = \{V_1, V_2\}\) and \(A = V_3\) in line 1, and \(Y_1 = \emptyset\) and \(Y_2=V_2\) in line 2. Then, \(S_1 = V_1\) and \(S_2 = V_2\) in line 3. Then, \(C = V_2\) in line 4 and, thus, \(C_1 = V_2\) in line 5. Note that \(C_1 \subseteq S_2\) and, thus, \(q(v_2  v_3) = f(v_2  v_1, v_3)\) by lines 6–9. Therefore, the algorithm returns \(\int f(v_2  v_1, v_3) f(v_3) \,\mathrm{d}v_3\) which is the correct answer.
Algorithm for causal effect identification from aADMGs.
Input: An aADMG G and two disjoint sets \(X, Y \subseteq V\).  
Output: An expression to compute \(f(y  \widehat{x})\) from f(v) or FAIL.  
1  Let \(B = De_G(X)\) and \(A = V {\setminus } B\) 
2  Let \(Y_1 = Y \cap A\) and \(Y_2 = Y \cap B\) 
3  Let \(S_1, \ldots , S_k\) be a partition of B into components in \(G_B\) 
4  Let \(C = An_{(G_B)^{B {\setminus } X}}(Y_2)\) 
5  Let \(C_1, \ldots , C_l\) be a partition of C into components in \((G_B)^C\) 
6  For each \(C_j\) such that \(C_j \subseteq S_i\) do 
7  Compute \(q(s_i  a)\) by Lemma 5 
8  If \(C_j\) is an ancestral set in \((G_B)^{S_i}\) then 
9  Compute \(q(c_j  a)\) from \(q(s_i  a)\) by Lemma 6 
10  Else return FAIL 
11  Return \(\int [ \int \prod _j q(c_j  a) \,d(c {\setminus } y_2)] f(a) \,d(a {\setminus } y_1)\) by Lemma 7 
3.2 Identification by calculus
 Rule 1 (insertion/deletion of observations):where \(G_{\overrightarrow{\underrightarrow{X}}}\) denotes the graph obtained from G by deleting all directed edges in and out of X.$$\begin{aligned} f(y \widehat{x}, z, w) = f(y \widehat{x}, w) \text { if } Y \!\perp \!_{G_{\overrightarrow{\underrightarrow{X}}}} Z  W, \end{aligned}$$
 Rule 2 (intervention/observation exchange):where \(G_{\overrightarrow{\underrightarrow{X}} \underrightarrow{Z}}\) denotes the graph obtained from G by deleting all directed edges in and out of X and out of Z.$$\begin{aligned} f(y \widehat{x}, \widehat{z}, w) = f(y \widehat{x}, z, w) \text { if } Y \!\perp \!_{G_{\overrightarrow{\underrightarrow{X}} \underrightarrow{Z}}} Z  W, \end{aligned}$$
 Rule 3 (insertion/deletion of interventions):where Z(W) denotes the nodes in Z that are not ancestors of W in \(G_{\overrightarrow{\underrightarrow{X}}}\), and \(G_{\overrightarrow{\underrightarrow{X}} \overline{\overrightarrow{Z(W)}}}\) denotes the graph obtained from G by deleting all directed edges in and out of X and all undirected and directed edges into Z(W).$$\begin{aligned} f(y \widehat{x}, \widehat{z}, w) = f(y \widehat{x}, w) \text { if } Y \!\perp \!_{G_{\overrightarrow{\underrightarrow{X}} \overline{\overrightarrow{Z(W)}}}} Z  W, \end{aligned}$$
Lemma 8
Rule 1 follows from rules 2 and 3.
The following theorem summarizes the lemmas above.
Theorem 2
Given an aADMG G and two disjoint sets \(X, Y \subseteq V\), if the algorithm in Table 3 returns an expression for \(f(y  \widehat{x})\), then it is correct. Moreover, the expression can also be obtained by repeated application of rules 2 and 3.
4 Factorization property for discrete AMP CGs
In this section, we turn our attention to the use of aADMGs as representation of independence models. As mentioned, Peña (2016) describes Markov properties for aADMGs but no factorization property. We present a first attempt to fill in this gap by developing a factorization for the positive discrete probability distributions that are Markovian with respect to AMP CGs. As mentioned before, AMP CGs are a subclass of aADMGs with at most one edge between any pair of nodes and without semidirected cycles, i.e., Open image in new window is forbidden, where \(\multimap\) stands for \(\rightarrow\) or −. We show that the factorization property is equivalent to the existing Markov properties for AMP CGs. We also present an algorithm to perform maximum likelihood estimation of the factors in the factorization. Therefore, unless otherwise stated, all the graphs and probability distributions in this section are defined over a finite set of discrete random variables V. To be consistent with previous works on AMP CGs, we need to modify our definition of the set \(De_G(X)\) in Sect. 2. In this section, the descendants of \(X \subseteq V\) are Open image in new window with \(B \in X\) or \(A \in X \}\). The nondescendants of X are \(Nd_G(X)=V {\setminus } De_G(X)\).
4.1 Markov properties

C1: \(C \!\perp \!_p Nd_G(C) {\setminus } Co_G(Pa_G(C))  Co_G(Pa_G(C))\).

C2: \(p(c  co_G(Pa_G(C)))\) is Markovian with respect to \(G_C\).

C3\(^*\): \(D \!\perp \!_p Co_G(Pa_G(C)) {\setminus } Pa_G(D)  Pa_G(D)\) for all \(D \subseteq C\).

C1\(^*\): \(D \!\perp \!_p Nd_G(D) {\setminus } Pa_G(D)  Pa_G(D)\) for all \(D \subseteq C\).

C2\(^*\): \(p(c  pa_G(C))\) is Markovian with respect to \(G_C\).

L1: \(A \!\perp \!_p C {\setminus } (A \cup Ne_G(A))  Nd_G(C) \cup Ne_G(A)\) for all \(A \in C\).

L2: \(A \!\perp \!_p Nd_G(C) {\setminus } Pa_G(A)  Pa_G(A)\) for all \(A \in C\).

L1: \(A \!\perp \!_p C {\setminus } (A \cup Ne_G(A))  Nd_G(C) \cup Ne_G(A)\) for all \(A \in C\).

L2\(^*\): \(A \!\perp \!_p Nd_G(C) {\setminus } Pa_G(A \cup S)  S \cup Pa_G(A \cup S)\) for all \(A \in C\) and \(S \subseteq C {\setminus } A\).

P1: \(A \!\perp \!_p B  Nd_G(C) \cup C {\setminus } (A \cup B)\) for all \(A \in C\) and \(B \in C {\setminus } (A \cup Ne_G(A))\).

P2: \(A \!\perp \!_p B  Nd_G(C) {\setminus } B\) for all \(A \in C\) and \(B \in Nd_G(C) {\setminus } Pa_G(A)\).

P1: \(A \!\perp \!_p B  Nd_G(C) \cup C {\setminus } (A \cup B)\) for all \(A \in C\) and \(B \in C {\setminus } (A \cup Ne_G(A))\).

P2\(^*\): \(A \!\perp \!_p B  S \cup Nd_G(C) {\setminus } B\) for all \(A \in C\), \(S \subseteq C {\setminus } A\) and \(B \in Nd_G(C) {\setminus } Pa_G(A \cup S)\).
4.2 Factorization
Lemma 11
Remark 1
It is customary to think of the factors \(\psi _D(k, pa_G(K))\) in Eq. 10 as arbitrary positive functions, whose product needs to be normalized to result in a probability distribution. Note, however, that Eq. 10 does not include any normalization constant. The reason is that the socalled canonical parameterization in Eq. 20 in Appendix B permits us to write any positive probability distribution as a product of factors that does not need subsequent normalization. One might think that this must be an advantage. However, the truth is that the cost of computing the normalization constant has been replaced by the cost of having to compute a large number of factors in Eq. 10. To see it, note that the size of \({\mathcal K}(G^D)\) is exponential in the size of the largest clique in \(G^D\).
We can now introduce our necessary and sufficient factorization.
Theorem 3
Example 2
Remark 2
It follows from Theorem 3 and the proof of Lemma 11 that the positive probability distributions that are Markovian with respect to G can be parameterized by probabilities of the form \(p(b, \overline{b}^*  pa_G(B), \overline{pa_G(B)}^*)\) for all \(B \subseteq D\), \(D \subseteq C\), and \(C \in {\mathcal C}(G)\), where \(\overline{b}^*\) and \(\overline{pa_G(B)}^*\) denote the states that the variables in \(D {\setminus } B\) and \(Pa_G(D) {\setminus } Pa_G(B)\) take in \(d^*\) and \(pa_G(D)^*\), which are arbitrary but fixed states of D and \(Pa_G(D)\). Alternatively, we can parameterize the probability distributions by factors of the form \(\psi _D(k, pa_G(K))\) for all \(K \in {\mathcal K}(G^D)\), \(D \subseteq C\) and \(C \in {\mathcal C}(G)\). Note that these parameters may be variation dependent or even functionally related, due to Eq. 12. In Example 2, for instance, the parameters \(\psi _{ZW}(z,w,x,y)\) and \(\psi _Z(z,x)\) are functionally related: setting the values for the former determines the values for the latter. That is why we avoid using the term “parameter” in the rest of this section, as some reserve this term for variation independent parameters. Instead, we use the term “factor”. Although our factorization does not lead to a parametrization, it does bring some benefits: it induces a space efficient representation of the distribution at hand, and allows time efficient reasoning as well as data efficient estimation of the distribution.
In some cases, the following necessary and sufficient factorization may be more convenient.
Theorem 4
ICF algorithm for AMP CGs.
Input: A sample from a positive probability distribution, and an AMP CG G.  
Output: Estimates of the factors in the factorization induced by G.  
1  For each \(C \in {\mathcal C}(G)\) do 
2  Set \(\varphi _C(k, pa_G(K))\) to arbitrary values for all \(K \in {\mathcal Q}(G_C)\) 
3  Repeat until convergence 
4  For each \(K \in {\mathcal Q}(G_C)\) do 
5  Solve a convex optimization problem to update \(\varphi _C(k, pa_G(K))\) holding 
the rest of the factors fixed  
6  Return \(\varphi _C(k, pa_G(K))\) for all \(C \in {\mathcal C}(G)\) and \(K \in {\mathcal Q}(G_C)\) 
4.3 Factor estimation
Example 3
Remark 3
As seen in Example 3, the factors corresponding to singlenode components can be estimated in closed form. Therefore, instead of estimating the factors for the given AMP CG G, we may prefer to estimate the factors for an AMP CG \(G'\) that is Markov equivalent to G (i.e., it represents the same independence model) and has as many singlenode components as possible. Luckily, \(G'\) can be obtained from G by repeatedly applying a socalled feasible split operation that, as the name suggests, splits a component of G in two (Sonntag and Peña 2015, Theorems 4 and 5).
5 Discussion
IPF algorithm for initializing the ICF algorithm
Input: A sample from a positive probability distribution, and an AMP CG G.  
Output: Initial estimates of the factors in the factorization induced by G.  
1  For each \(C \in {\mathcal C}(G)\) do 
2  Set \(\varphi _C(k, pa_G(K))\) to arbitrary values for all \(K \in {\mathcal Q}(G_C)\) 
3  Repeat until convergence 
4  Set \(\varphi _C(k, pa_G(K)) = \varphi _C(k, pa_G(K)) \frac{p_\mathrm{e}(k  pa_G(K))}{p(k  pa_G(K))}\) for all \(K \in {\mathcal Q}(G_C)\) 
5  Return \(\varphi _C(k, pa_G(K))\) for all \(C \in {\mathcal C}(G)\) and \(K \in {\mathcal Q}(G_C)\) 
Moralization for the AMP CG in the IPF algorithm
Input: An AMP CG G.  
Output: The moral graph of G.  
1  Set \(G^m=G\) 
2  For each \(C \in {\mathcal C}(G)\) do 
3  For each \(K \in {\mathcal Q}(G_C)\) do 
4  For each \(X \in Pa_G(K)\) and \(Y \in K\) do 
5  Add the edge \(X \rightarrow Y\) to \(G^m\) 
6  For each \(X, Y \in Pa_G(K)\) with \(X \ne Y\) do 
7  Add the edge \(X  Y\) to \(G^m\) 
8  Make all the edges in \(G^m\) undirected 
9  Return \(G^m\) 
Note also that computing \(p(k  pa_G(K))\) in line 4 of Table 5 requires inference. Fortunately, this can efficiently be performed by adapting the algorithm for inference in Bayesian and Markov networks developed by Lauritzen and Spiegelhalter (1988), and upon which most other inference algorithms build. Actually, the only step of the algorithm that needs to be adapted is the moralization step. Table 6 shows how to moralize the AMP CG G into an undirected graph \(G^m\). Note that \(K \cup Pa_G(K)\) is a complete subset of \(G^m\). This guarantees that for every \(K \in {\mathcal Q}(G_C)\) with \(C \in {\mathcal C}(G)\), there will some clique in the triangulation of \(G^m\) that contains the set \(K \cup Pa_G(K)\) and to which the factor \(\varphi _C(k, pa_G(K))\) can be assigned. This is important in the subsequent steps of the inference algorithm. We plan to implement the ICF algorithm with both initializations, i.e., arbitrary values and the IPF algorithm, and report experimental results in a followup paper.
Footnotes
References
 Andersson SA, Madigan D, Perlman MD (2001) Alternative Markov properties for chain graphs. Scand J Stat 28:33–85MathSciNetCrossRefGoogle Scholar
 Bühlmann P, Peters J, Ernest J (2014) CAM: causal additive models, highdimensional order search and penalized regression. Ann Stat 42:2526–2556MathSciNetCrossRefGoogle Scholar
 Cox DR, Wermuth N (1996) Multivariate dependencies—models, analysis and interpretation. Chapman & Hall, LondonzbMATHGoogle Scholar
 Drton M (2008) Iterative conditional fitting for discrete chain graph models. In: Proceedings in computational statistics, pp 93–104Google Scholar
 Drton M (2009) Discrete chain graph models. Bernoulli 15:736–753MathSciNetCrossRefGoogle Scholar
 Drton M, Eichler M (2006) Maximum likelihood estimation in gaussian chain graph models under the alternative markov property. Scand J Stat 33:247–257MathSciNetCrossRefGoogle Scholar
 Drton M, Richardson TS (2008) Binary models for marginal independence. J R Stat Soc B 70:287–309MathSciNetCrossRefGoogle Scholar
 Hoyer PO, Janzing D, Mooij J, Peters J, Schölkopf B (2009) Nonlinear causal discovery with additive noise models. Adv Neural Inf Process Syst 21:689–696zbMATHGoogle Scholar
 Huang Y, Valtorta M (2006) Pearl’s Calculus of intervention is complete. In: Proceedings of the 22nd conference on uncertainty in artificial intelligence, pp 217–224Google Scholar
 Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. The MIT Press, USAGoogle Scholar
 Koster JTA (2002) Marginalizing and conditioning in graphical models. Bernoulli 8:817–840MathSciNetzbMATHGoogle Scholar
 Lauritzen SL (1996) Graphical models. Oxford University Press, OxfordGoogle Scholar
 Lauritzen SL, Spiegelhalter DJ (1988) Local computations with probabilities on graphical structures and their application to expert systems. J R Stat Soc B 50:157–224MathSciNetzbMATHGoogle Scholar
 Levitz M, Perlman MD, Madigan D (2001) Separation and completeness properties for AMP chain graph markov models. Ann Stat 29:1751–1784MathSciNetCrossRefGoogle Scholar
 Mooij JM, Peters J, Janzing D, Zscheischler J, Schölkopf B (2016) Distinguishing cause from effect using observational data: methods and benchmarks. J Mach Learn Res 17:1–102MathSciNetzbMATHGoogle Scholar
 Peña JM (2016) Alternative markov and causal properties for acyclic directed mixed graphs. In: Proceedings of the 32nd conference on uncertainty in artificial intelligence, pp 577–586Google Scholar
 Peña JM, Bendtsen M (2017) Causal effect identification in acyclic directed mixed graphs and gated models. Int J Approx Reason 90:56–75MathSciNetCrossRefGoogle Scholar
 Pearl J (2009) Causality: models, reasoning, and inference. Cambridge University Press, CambridgeGoogle Scholar
 Peters J, Mooij JM, Janzing D, Schölkopf B (2014) Causal discovery with continuous additive noise models. J Mach Learn Res 15:2009–2053MathSciNetzbMATHGoogle Scholar
 Peters J, Janzing D, Schölkopf B (2017) Elements of causal inference: foundations and learning algorithms. MIT Press, USAGoogle Scholar
 Richardson T (2003) Markov properties for acyclic directed mixed graphs. Scand J Stat 30:145–157MathSciNetCrossRefGoogle Scholar
 Richardson T, Spirtes P (2002) Ancestral graph markov models. Ann Stat 30:962–1030MathSciNetCrossRefGoogle Scholar
 Sadeghi K, Lauritzen SL (2014) Markov properties for mixed graphs. Bernoulli 20:676–696MathSciNetCrossRefGoogle Scholar
 Shpitser I, Pearl J (2006) Identification of conditional interventional distributions. In: Proceedings of the 22nd conference on uncertainty in artificial intelligence, pp 437–444Google Scholar
 Sonntag D, Peña JM (2015) Chain graph interpretations and their relations revisited. Int J Approx Reason 58:39–56MathSciNetCrossRefGoogle Scholar
 Studený M (2005) Probabilistic conditional independence structures. Springer, New YorkGoogle Scholar
 Tian J, Pearl J (2002a) A general identification condition for causal effects. In: Proceedings of the 18th national conference on artificial intelligence, pp 567–573Google Scholar
 Tian J, Pearl J (2002b) On the identification of causal effects. Technical report R290L, Department of Computer Science, University of California, Los AngelesGoogle Scholar
 Wainwright MJ, Jordan MI (2008) Graphical models, exponential families, and variational inference. Found Trends Mach Learn 1:1–305CrossRefGoogle Scholar
 Zhang K, Hyvärinen A (2009) On the identifiability of the postnonlinear causal model. In: Proceedings of the 25th conference on uncertainty in artificial intelligence, pp 647–655Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.