Mutual conditional independence and its applications to model selection in Markov networks

The fundamental concepts underlying Markov networks are the conditional independence and the set of rules called Markov properties that translate conditional independence constraints into graphs. We introduce the concept of mutual conditional independence in an independent set of a Markov network, and we prove its equivalence to the Markov properties under certain regularity conditions. This extends the notion of similarity between separation in graph and conditional independence in probability to similarity between the mutual separation in graph and the mutual conditional independence in probability. Model selection in graphical models remains a challenging task due to the large search space. We show that mutual conditional independence property can be exploited to reduce the search space. We present a new forward model selection algorithm for graphical log-linear models using mutual conditional independence. We illustrate our algorithm with a real data set example. We show that for sparse models the size of the search space can be reduced from O(n3)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {O} (n^{3})$\end{document} to O(n2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mathcal {O}(n^{2})$\end{document} using our proposed forward selection method rather than the classical forward selection method. We also envision that this property can be leveraged for model selection and inference in different types of graphical models.


Introduction
A Markov network is a way of specifying conditional independence constraints between components of a multivariate distribution. Markov properties are the set of rules that determine how conditional independence constraints are translated into a graph (see [11] and [10]). The three Markov properties usually considered for Markov networks are pairwise, local and the global Markov properties. These Markov properties are equivalent to one another for positive probability distributions, see [12]. We introduce the concept of mutual conditional independence for an independent set of a Markov network. We derive an alternative formulation for the three Markov properties (local, pairwise and global) using mutual conditional independence. This alternative formulation is then used to prove the equivalence between mutual conditional independence property (MCIP) and Markov properties, under positive probability distribution assumption. This extends the notion of similarity between separation in graph and conditional independence in probability to similarity between the mutual separation in graph and the mutual conditional independence in probability.
We study the applicability of MCIP for learning and inference in graphical models. Model selection in graphical models still remains a difficult problem due to the large search space, we show that mutual conditional independence property can be used to reduce the search space. We propose a new forward model selection algorithm for graphical log-linear models where we exploit mutual conditional independence check to reduce the search space. We illustrate our algorithm with a real data set example. For some sparse graphs, we show that the size of the search space can be reduced drastically from O(n 3 ) to O(n 2 ) using our proposed forward selection algorithm rather than classical forward selection method. We also discuss that in general for any sparse and unknown underlying graph structure, the reduction in the size of the search space depends on the its maximum sized independent sets. Finally, we envision that the MCIP can be leveraged for learning and inference in different types of graphical models. Our contributions are summarized as follows: • We introduce the concept of mutual conditional independence in an independent set of a Markov network, and we prove its equivalence to the Markov properties. • We extend the concept of similarity between separation in graph and conditional independence in probability to similarity between mutual separation in graph and mutual conditional independence in probability. • We propose a new forward model selection algorithm for graphical log-linear models, where mutual conditional independence is employed to reduce the search space. • We show that, for sparse models the size of the search space can be reduced from O(n 3 ) to O(n 2 ) using our proposed forward selection method rather than the classical forward selection method. • We also envision that this property can be leveraged for model selection and inference in different types of graphical models.
The discussion below is organized as follows. In Section 2, we start with a brief overview and the mathematical foundations of the theory of Markov networks and Graphical Log-Linear Models (GLLM). In Section 3 a proof is provided that MCIP holds within the elements of an independent set. Section 4 involves deriving the global Markov property using the MCIP and proving equivalence between them. In Section 5, we discuss an application of mutual conditional independence relations in terms of model selection for GLLMs. In Section 6, we give computational details that we used for model selection. In Section 7, we conclude and discuss future scope and applicability of MCIP. The proposed model selection algorithm is illustrated step by step with a real data set example in Appendix.

An overview and mathematical foundations
A graphical model is a technique for representation of the conditional independences between variables in a multivariate probability distribution. The nodes or vertices in the graph correspond to random variables. The absence of an edge between two random variables denotes a conditional independence relation between them. In the literature, several classes of graphs with various conditional independence interpretations have been described. Undirected graphical models (Markov Networks) and directed acyclic graphs based graphical models (Bayesian Networks) are the most commonly known graphical models. In this article we only consider undirected graphical models, also known as Markov random fields or Markov networks. For details on the foundation of the theory of Markov networks, see [11,15,20] and [17]. We now briefly discuss the necessary background theory, present some existing results and introduce notations that will be used throughout the paper.

Graph theory
where V is the set of vertices and E is the set of edges. A graph is said to be an undirected graph if its vertices are connected by undirected edges. We consider only simple graphs that have neither self loops nor multiple edges. An independent set of a graph G is a subset S ⊂ V such that no two nodes in S are adjacent. An independent set is said to be maximal if no node can be added to S without violating the independent set property. A clique of a graph G is a subset C of the vertices such that all vertices in C are mutually adjacent. A clique is said to be maximal if no vertex can be added to C without violating the clique property. Note that given a complete list of maximal independent sets or a complete list of maximal cliques of a graph, the graph is uniquely determined.

Conditional independence
Suppose that X, Y and Z are random variables with joint distribution f . The random variables X and Y are said to be conditionally independent given the variable Z, denoted by Next, we define some properties of conditional independence in terms of graphoid axioms. For an alternative set of the complete axioms, see [8].
A semi-graphoid is a dependency model which satisfies the first four properties, listed above. If, in addition, the intersection property (7) holds, it is called a graphoid. In probability theory, conditional independence as defined in (2), is a semi-graphoid. If f is a positive probability distribution, then conditional independence is a graphoid. Graph separation in an undirected graph satisfies graphoid axioms. For further details on graphoid axioms, see [11] and [6].

Markov properties of undirected graphs
We define the following three Markov properties for undirected graphs. Let G = (V , E) be an undirected graph and f be a probability distribution over G. The probability distribution f satisfies the pairwise Markov property (P) for the graph G, if for every pair of non adjacent vertices X and Y , X is independent of Y given the rest.
It satisfies the local Markov property (L), if every variable X is conditionally independent of its non-neighbours in the graph, given its neighbours.
where bd(X) denotes boundary or neighbors of X. The global Markov property (G) is said to be satisfied if for any disjoint subsets of nodes A, B, C such that C separates A and B on the graph the distribution satisfies the following: A probabilistic independence model that satisfies graphoid axioms with respect to G, the following holds. For the proof, see [11,14] and [6].:

Markov network graphs and Markov network
A Markov network graph is an undirected graph G = (V , E) where V = {X 1 , X 2 , .., X n } corresponds to the random variables of a multivariate distribution. A Markov network is a tuple M = (G, ψ) where G is a Markov network graph, ψ = {ψ 1 , ψ 2 , ..., ψ m } is a set of non-negative functions for each clique C i ∈ G ∀i = 1 . . . m and the joint probability density function (pdf) can be decomposed into factors as where Z is the normalizing constant.
where C m are the maximal cliques of G and ψ a (x) depends on x through x a = (x v ) v∈a only.
It follows from the above theorem that if a positive probability distribution factorizes with respect to G, then it also satisfies all Markov properties (pairwise, local and global) with respect to G, see [13].

Graphical log linear models
In this section, we briefly review graphical log-linear models for contingency tables. A contingency table is a table of counts that summarizes relationship between factors or categorical variables (see [2]). In the context of contingency tables, a Log Linear Model (LLM) is a linear model in the log scale of the expected cell counts. The LLM basically models the association and interaction among factors of a contingency table. For illustration, we consider a three dimensional table with factors X, Y and Z. Suppose, the factors X, Y and Z have I , J and K levels respectively, then we have an I × J × K contingency table.  [XZ] and [XY Z] denote two-factor and three-factor interactions. We are particularly interested in a class of LLMs that can be represented by graphs, called graphical log-linear models(GLLMs). In GLLMs, the vertices correspond to the factors and the edges correspond to the two-factor interactions. A LLM is said to be graphical if it contains all the lower order terms which can be derived from the variables contained in a higher-order term, then the model also contains the higher order interactions. In the previous example of a three-factor contingency table, if a model includes all the two factor interactions [XY ], [Y Z] and [XZ], then it must also contain the three factor interaction [XY Z]. We usually represent a graphical model as a set of maximal cliques, which is [XY Z] in this case. We note that there is a one-to-one correspondence between GLLMs and graphs (see [4]). For more details on GLLM we refer to the books [3,4] and [1] and a review article [7].

Mutual conditional independence in Markov networks
In this section, we prove that the elements of an independent set are mutually conditionally independent given the rest. Proof Without loss of generality we can assume that the first k elements of X form an Independent set I . Let I = {X 1 , X 2 , ..., X k } be an independent set of G. Since {X 1 , X 2 , ..., X k } are mutually non-adjacent, when we condition on V \ I or equivalently when we remove the nodes V \ I from G, the remaining vertices {X 1 , X 2 , ..., X k } are disconnected which implies in probability complete "conditional" independence among the vertices of I .
Since {X 1 , X 2 , ..., X k } form an independent set, they belong to separate cliques, say X i ∈ C i , for i = 1 to k, where C i is a maximal clique in G. Without loss of generality we can assume that there are exactly k maximal cliques. From Theorem (Hammersley-Clifford theorem) the probability distribution f factorizes as follows.
where Y i is the set of nodes that connects two or more C i s and each {X i , Y i } forms a maximal clique in G (see Fig. 1). It can be noted that Y i can be empty in case of a disconnected graph and also Y i = V \ I .
Then conditional probability f (I | V \ I ) can be expressed as where each φ i is a function of the corresponding variable X i only as the corresponding .., X k } are mutually conditionally independent given {V \ I }.

Mutual conditional independence and the Markov properties
In this section, we represent an alternative way to derive conditional independence relations required for satisfying the Markov properties of the Markov networks. More specifically, we prove equivalence between Mutual Conditional Independence Property (MCIP) and pairwise Markov property, and from (11) it follows that MCIP is equal to the local and global Markov properties as well.
First, we assume that the conditional independence relations required to satisfy pairwise Markov property holds in f with respect to G. It follows that the elements of C i are mutually conditionally independent conditioned on {V \ C i } from Theorem 2. It follows that the pairwise Markov property =⇒ MCIP.
We recall that the Mutual conditional independence implies pairwise conditional independence. Since the elements of a C i are mutually conditionally independent given the set V \ C i , they are also pairwise independent given V \ C i . Let us suppose that {x, y} ∈ V is a pair of nodes which is not an edge. Then {x, y} is an independent set and thus is contained in some C i and hence is pairwise independent given the rest. Therefore,

pairwise Markov Property ⇐⇒ MCIP
We illustrate the proof by an example as follows. Let us consider the Markov network as given in Fig. 2.
The set of all maximal independent sets for this graph is: First consider the maximal independent set C 1 = {A, C, F }, and let us suppose that the MCIP holds which implies that A, C and F are mutually independent given the rest {B, D, E, G}. Or, equivalently independence relation can be expressed as: Applying weak union graphoid axiom (5) to the above independence relation we get Applying similar arguments for the other sets of maximal independent sets it can be shown that for every non-adjacent pair x, y ∈ V which is also a definition of pairwise Markov property. Conversely, suppose that the pairwise Markov property holds. We now show that MCIP also holds. From Theorem 1, it is clear that under positive distribution assumption, the probability distribution f satisfies pairwise Markov property with respect to the graph G if and only if it factorizes as follows.
Consider the pdf of (A, C, F ) conditioned on (B, D, E, G), we get the factorization of pdf as follows.
From the above factorization of pdf, it follows that (A, C, F ) are mutually independent conditioned on (B, D, E, G).
Similarly, it can be shown that the mutual conditional independence relations hold for the remaining maximal independent sets. Hence it follows that MCI P ⇐⇒ pairwise Markov property .
Applying (11), we get the following equivalence relation that completes the proof.

Applications and illustrations
In this section we develop a forward model selection algorithm for graphical log-linear models exploiting mutual conditional independence property, and we illustrate the algorithm through an example. We also discuss how our proposed algorithm is better than the classical forward model selection algorithm in terms of reduced search space (and hence in terms of reduced computational complexity). In particular, we show that the size of the search space explored by the proposed forward selection algorithm is reduced to O(n 2 ) for the case of star graphs, which is O(n 3 ) for the classical forward selection methods.

Model selection using mutual conditional independence
Estimation of conditional independence structures is an important problem. In the context of GLLMs, the goal of model selection is to choose a smallest graphical model from a class of graphical models under consideration that best fits the data and has the least number of edges (i.e., the number of interaction terms). Since computing the exact independence structure is intractable, greedy algorithms are an appropriate way to tackle this problem. Most existing model selection algorithms are based on forward selection, backward elimination or a combination of both. For detailed discussions on model selection in a graphical log-linear case, we refer to [3][4][5]9] and [19]. These greedy algorithms mostly use the conditional independence property as the model selection criterion, and thus may still have a huge search space even for a modest number of variables. We propose a forward model selection algorithm which is based on the concept of MCIP. We use the mutual conditional independence check to reduce the search space for the local search algorithm.
In this approach, we start with the null model (a complete independence model) and our main focus is to find All the Maximal Independent Sets (AMIS) of the underlying graphical model based on the data sample. We maintain two lists tempAMIS and AMIS. The tem-pAMIS contains a list of subsets of factors to be tested for MCIP and the AMIS list contains MISs (for which the data supports MCIP). At each step, first we move all the elements from tempAMIS to AMIS for which the data supports MCIP. Then we pick a largest set from tempAMIS. The most significant edge between its elements is found and the required two factor and higher order terms are added into the present model. Then we split all the sets in tempAMIS that contain both end points of the newly added edge. At this point tempAMIS may contain duplicates or proper subsets within tempAMIS or members of AMIS. We eliminate the duplicates and make tempAMIS irreducible such that no elements of tempAMIS are proper subsets of another member of tempAMIS or any member of AMIS. We repeat the process until tempAMIS is empty. If the union of all members of AMIS is not the exhaustive set of all nodes, then create singleton sets for each missing node and add these singleton sets to AMIS. Finally, the algorithm returns AMIS that determines the graph structure uniquely.
For example, let us consider a five-factor contingency table, where the factors are denoted as 1, 2, 3, 4, and 5. Initial model assumption is the null model (all factor form an MIS), tempAMI S = {{1, 2, 3, 4, 5}} and AMI S = {∅}. Let us suppose that the complete independence model does not fit the data and also assume that the edge (1, 2) is the most significant edge. We add this edge into the complete independence model. Since, we assumed that the set {1, 2, 3, 4, 5} is a MIS at the beginning, but after adding the edge (1, 2), it is no longer a MIS. In fact, now the assumption for the MISs are tempAMI S = { {1, 3, 4, 5}, {2, 3, 4, 5} }. Now let us suppose that the data supports the MCI condition for the set {2, 3, 4, 5}. So we remove it from tempAMI S and add it to AMIS. As the next step we try to find the most significant edge within the set {1, 3, 4, 5}. It is important to note that the node pairs for consideration are only (1, 3), (1,4) and (1,5), since the remaining combinations formed by the subset {3, 4, 5} are contained in the set {2, 3, 4, 5} which is assumed to be an independent set. Under the assumption that the set {2, 3, 4, 5} is an independent set, the node pairs (3,4), (3,5) and (4,5) are also pairwise independent and they will not be considered as candidate edges. At each step the procedure continues in this way until tempAMIS becomes empty. Finally, if the union of all members of AMIS is not the exhaustive set of all nodes {1, 2, 3, 4, 5}, then create singleton sets for each missing node and add those singleton sets to the list AMIS.

An illustration of MCIP based model selection algorithm
Since the proposed algorithm (Algorithm 1) is more notational than conceptual, we begin the development with an example. First, we define the test statistic used to compare models in the following, where a smaller model is tested against a (larger) saturated model to see whether the smaller model is an inadequate explanation of the data. The likelihood ratio chi-square test statistic (G 2 ), to test a model against the saturated model is defined as where O denotes the observed cell count and E denotes the expected cell count. Under the null hypothesis G 2 is distributed as χ 2 with the appropriate degrees of freedom (see [4] Eliminate duplicates from tempAMIS. Make tempAMIS irreducible (it does not contain a proper subset). end end If the union of all members of AMIS is not exhaustive set of all nodes {X 1 , ..., X p }, then create singleton sets for each missing node and add those sets to the AMIS. return AMI S is strictly larger than the model M 1 , we first compute the G 2 saturated model test for each model, say G 2 1 and G 2 2 , for M 1 and M 2 respectively. Then the test of the adequacy of the smaller log-linear model M 1 can be obtained by subtraction from the saturated model test statistics as G 2 = G 2 1 − G 2 2 . The degrees of freedom for the chi-square is the difference in the degrees of freedom for models M 1 and M 2 . The advantage of using G 2 is that it simplifies the process of testing models against each other, see [4]. Note that we do not address the multiple testing problem here, and we refer to [18]. The topic itself is a separate research problem that is out of scope of this work.

Example 1 (Forward Model Selection for the Rienis Dataset)
We illustrate the proposed algorithm (Algorithm 1) using the Rienis dataset. For details about the Reinis dataset, see [16]. The Reinis data is shown in Table 1.
The results obtained in each iteration is summarized in Table 2. Each step is illustrated in more detail in Appendix. Note that each iteration involves the following 3 steps: 1. Testing for MCI: We first check if MCIP holds for each member of tempAMIS. 2. Addition of a new edge: At each step we add the most significant edge as long as the significance level is below a cut-off value. 3. Redundancy elimination : Before moving to the next step we make sure that the list tempAMIS is irreducible and contains no duplicate.

A comparison of the search spaces for forward selection algorithms
In this section, we compare the sizes of the search spaces explored by the classical forward selection algorithm (Chapter 6.1, [4]) vs the proposed forward selection algorithm (1). Though the worst case performance of the proposed algorithm is the same as the classical forward selection algorithm, we show that for some sparse models the size of the search space can be reduced from O(n 3 ) to O(n 2 ).
For illustration, we compare the total search spaces explored by the classical forward selection algorithm and the MCIP based forward selection algorithm for a simple example. Let us consider a star shaped graph, as given in Fig. 4. Let us suppose that the order of significance of the two factor in7teractions is (1, 2), (1, 3), (1, 4) and (1,5). Assuming the graph structure is not known, the classical forward selection algorithm will search the space as follows: The above step can be generalized for any graph with n number of nodes, and k number of edges. The effective size of the search space can be given as Now, let us look into the steps and the number of intermediate fits required by the proposed forward selection method using MCI test.  2. Before deciding for the next significant edge, the algorithm will detect that the set {2, 3, 4, 5} is an independent set, and it is removed from the list tempAMIS. Now, it will search for the most significant edge in the set {1, 3, 4, 5}. Note that there are only three candidate edges (1, 3), (1,4) and (1,5), since pairwise independence also holds for the remaining pairs of nodes (3,4), (3,5) and ( (1, 5) and tempAMIS is empty. 5. In total, the algorithm fits 5 2 + 3 + 2 + 1 = 16 models to decide for the final star shaped graph.
The above step can be generalized for any star graph with n number of nodes, the effective size of the search space explored by the proposed algorithm can be given as follows.
In general, for any sparse graph, where the size of the maximum independent set(s) is proportional to n, then irrespective of the underlying graph structure the search space can be reduced using our proposed algorithm. For example, for sparse graphs as given in Fig. 5 it is easy to show that the search space is reduced from O(n 3 ) to O(n 2 ). For sparse graph-2,  Table 3. Also, for Sparse Graph-3 in (b) of Fig. 5, the max-sized independent set is {4, 5, 6, 7, 8, 9}, which can be explored after 3-4 iterations and thus the search space is reduced from cubic to square order of the number of nodes. Iter corresponds to iteration number, Edge corresponds to the added edge in each iteration, currModels corresponds to the present model, tempAMIS corresponds to the AMIS list to be tested for mutual conditional independence, AMIS corresponds to the list of MISs, and # of candidate edges corresponds to the search space in terms of candidate edges to explore in each iteration All the experimental results in this paper were carried out using R 3.4.0, with the packages gRim and MASS. All packages used are available at http://CRAN.R-project.org. We implemented the new forward selection algorithm in R, the code is available at github: https:// github.com/niharikag/gMCI.

Conclusion
The notion of conditional independence and Markov properties are fundamental for graphical models. We have discussed different Markov properties for the class of Markov networks. We have introduced the concept of mutual conditional independence in an independent set of a Markov network. Particularly, the similarity between separation in a graph and conditional independence in probability has been extended to similarity between the mutual separation in a graph and the mutual conditional independence in probability. We have developed a new forward model selection algorithm for log-linear models, where the mutual conditional independence is used to reduce the search space. Specifically, for some sparse graphs the reduction in the search space from O(n 3 ) to O(n 2 ) has been shown. However, it will be an interesting problem for future research to explore, for any sparse graph, how much the size of the search space can be reduced using the proposed forward selection algorithm as compared to the classical forward selection algorithm.
We conclude with the remark that in the literature most perspectives for the Markov networks are Markov properties and clique factorization. MCIP enables us to take a new perspective on Markov properties and factorization of the corresponding probability distributions. This new perspective could be useful to understand the structures and problems of the Markov random fields. The MCIP can be a promising new direction for model selection and inference in Markov networks. For details on the derivation of the above expression, we refer to the book [3]. After having computed the (estimates of) expected cell counts, the G 2 statistic for this model is computed using (12), and we get the G 2 = 843.957 (df : 57, p-value : 0). Since observed value of

AMI S = {∅}
As mentioned before, at each step we add the most significant edge as long as the significance level is below a cut-off value (we use cut-off of α = 0.05). As a first step we compare all the models with a single edge added to the model of complete independence. We use the G 2 statistics for comparing models (12). Table 4 gives the model fit and Table 5 summarizes the test results.
The model with edge (B, C) has the largest difference in G 2 (or the smallest p-value), we choose this models as the current model. Also the set containing the factors (B, C) gets separated as follows.

AMI S = {∅}
Before moving to the next step we make sure that the list tempAMIS is irreducible and contains no duplicate.
As a next step, we first check that if MCIP holds for the members of tempAMIS. The MCI test for the elements of tempAMIS {A, B, D, E, F } and {A, D, C, E, F } gives the G 2   Table 8 gives the model fit and Table 9 summarizes the test results.  The term (A, C) is added to the current model. Accordingly the modified data structure is given below.