Improving probabilistic inference in graphical models with determinism and cycles
- 850 Downloads
Abstract
Many important real-world applications of machine learning, statistical physics, constraint programming and information theory can be formulated using graphical models that involve determinism and cycles. Accurate and efficient inference and training of such graphical models remains a key challenge. Markov logic networks (MLNs) have recently emerged as a popular framework for expressing a number of problems which exhibit these properties. While loopy belief propagation (LBP) can be an effective solution in some cases; unfortunately, when both determinism and cycles are present, LBP frequently fails to converge or converges to inaccurate results. As such, sampling based algorithms have been found to be more effective and are more popular for general inference tasks in MLNs. In this paper, we introduce Generalized arc-consistency Expectation Maximization Message-Passing (GEM-MP), a novel message-passing approach to inference in an extended factor graph that combines constraint programming techniques with variational methods. We focus our experiments on Markov logic and Ising models but the method is applicable to graphical models in general. In contrast to LBP, GEM-MP formulates the message-passing structure as steps of variational expectation maximization. Moreover, in the algorithm we leverage the local structures in the factor graph by using generalized arc consistency when performing a variational mean-field approximation. Thus each such update increases a lower bound on the model evidence. Our experiments on Ising grids, entity resolution and link prediction problems demonstrate the accuracy and convergence of GEM-MP over existing state-of-the-art inference algorithms such as MC-SAT, LBP, and Gibbs sampling, as well as convergent message passing algorithms such as the concave–convex procedure, residual BP, and the L2-convex method.
Keywords
Markov logic Message passing Constraint propagation Statistical relational learning Expectation maximization1 Introduction
Graphical models that involve cycles and determinism are applicable to a growing number of applications in different research communities, including machine learning, statistical physics, constraint programming, information theory, bioinformatics, and other sub-disciplines of artificial intelligence. Accurate and efficient inference within such graphical models is thus an important issue that impacts a wide number of communities. Inspired by the substantial impact of statistical relational learning (SRL) (Getoor and Taskar 2007), Markov logic (Richardson and Domingos 2006; Singla 2012) is a powerful formalism for graphical models that has made significant progress towards the goal of combining the powers of both first-order logic (Flach 2010) and probability. However, probabilistic inference represents a major bottleneck and can be problematic for learning when using it as a subroutine.
Loopy belief propagation (LBP) is a commonly used message-passing algorithm for performing approximate inference in graphical models in general, including models instantiated by an underlying Markov Logic. However, LBP often exhibits erratic behavior in practice. In particular, it is still not well understood when LBP will provide good approximations in the presence of cycles and when models possess both probabilistic and deterministic dependencies. Therefore, the development of more accurate and stable message passing based inference methods is of great theoretical and practical interest. Perhaps surprisingly, belief propagation achieves good results for coding theory problems with loopy graphs (Mceliece et al. 1998; Frey and MacKay 1998). In other applications, however, LBP often leads to convergence problems. In general LBP therefore has the following limitation:
Limitation 1In the presence of cycles, LBP is not guaranteed to converge.
It is known that the local optima of the Bethe free energy correspond to local minima of LBP, and it has been proven that violating the uniqueness condition for the Bethe free energy generates several local minima (i.e., fixed points) in the space of LBP’s marginal distributions (Heskes 2004; Yedidia et al. 2005). From a variational perspective, it is known that if a factor graph has more than one cycle, then the convexity of the Bethe free energy is violated. A graph involving a single cycle has a unique local minimum and usually guarantees the convergence of LBP (Heskes 2004). From the viewpoint of a local search, LBP performs a gradient-descent/ascent search over the marginal space, endeavoring to converge to a local optimum (Heskes 2002). Heskes viewpoint is that the problem of non-convergence is related to the fact that LBP updates the unnormalized marginal of each variable by computing a coarse geometric average of the incoming messages received from its neighboring factors (Heskes 2002). Under Heskes’ line of analysis, LBP can make large moves in the space of the marginals and therefore it becomes more likely to overshoot the nearest local optimum. This produces an orbiting effect and increases the possibility of non-convergence. Other lines of analysis are based on the fact that messages in LBP may circulate around the cycles, which can lead to local evidence being counted multiple times (Pearl 1988). This, in turn, can aggravate the possibility of non-convergence. In practice, non-convergence occasionally appears as oscillatory behavior when updating the marginals (Koller and Friedman 2009).
Determinism plays a substantial role in reducing the effectiveness of LBP (Heskes 2004). For example, hard clauses in a Markov logic lead to deterministic dependencies in the corresponding factor graphs for groundings and therefore are particularly challenging for inference with LBP. It has been observed empirically that carrying out LBP on cyclic graphical models with determinism is more likely to result in a two-fold problem of non-convergence or incorrectness of the results (Mooij and Kappen 2005; Koller and Friedman 2009; Potetz 2007; Yedidia et al. 2005; Roosta et al. 2008). A second limitation of LBP could thus be formulated as:
Limitation 2In the presence of determinism (a.k.a. hard clauses), LBP may deteriorate to inaccurate results.
In its basic form LBP also does not leverage the local structures of factors, handling them as black boxes. Using Markov logic as a concrete example, LBP often does not take into consideration the logical structures of the underlying clauses that define factors (Gogate and Domingos 2011). Thus, if some of these clauses are deterministic (e.g., hard clauses) or have extreme skewed probabilities, then LBP will be unable to reconcile the clauses. This, in turn, impedes the smoothing out of differences between the messages. The problem is particularly acute for those messages that pass through hard clauses which fall inside dense cycles. This can drastically elevate oscillations, making it difficult to converge to accurate results, and leading to the instability of the algorithm with respect to finding a local minimum (see pages 413–429 of Koller and Friedman 2009, for more details). On the flip side of this issue Koller and Friedman point out that one can prove that if the factors in a graph are less extreme—such that the skew of the network is sufficiently bounded—it can give rise to a contraction property that guarantees convergence (Koller and Friedman 2009). In our work here we are interested in taking advantage of determinism when it exists in the factors of an underlying graph in a way that does not increase the threat of non-convergence.
The literature available on LBP—which is perhaps the most widely used form of message-passing based inference—is heavily influenced by ideas from machine learning (ML) and constraint satisfaction (CS) among others. Although LBP has been scrutinized both theoretically and practically in various ways, most of the existing research either avoids the limitation of determinism when handling cycles, or does not take into consideration the limitation of cycles when handling determinism.
It is well known that techniques such as the junction tree algorithm (Lauritzen and Spiegelhalter 1988) are able to transform a graphical model into larger clusters of variables such that the clusters satisfy the running intersection property and that such a structure can then be used to obtain exact inference results. Such results also hold when the underlying graphical models possess deterministic dependencies. For many problems however, the tree width of the resulting junction tree may be so large that inference becomes intractable. More recent work has explored the interesting question of how to construct thin junction trees (Bach and Jordan 2001). However, many graphical models derived from a Markov logic or problems with complex constraints quickly lead to trees with large tree widths.
An excerpt of Markov logic for the Cora dataset. The atoms SameBib and SameAuthor are unknown. Ar() is an abbreviation for atom Author(), SAr() for SameAuthor(), and SBib() for SameBib(). \(a_1,a_2\) define authors and \(r_1,r_2,r_3\) define citations
Rule | First-order logic | Clausal form | W |
---|---|---|---|
Regularity | \(\forall a_1,a_2, \forall r_1,r_2, \, \text {Ar}(r_1,a_1) \wedge \text {Ar}(r_2,a_2) \wedge \text {SAr}(a_1,a_2) \Rightarrow \text {SBib}(r_1,r_2)\) | \(\lnot \text {Ar}(r_1,a_1) \vee \lnot \text {Ar}(r_2,a_2) \vee \lnot \text {SAr}(a_1,a_2) \vee \text {SBib}(r_1,r_2)\) | 1.1 |
Transitivity | \(\forall r_1,r_2,r_3 \, \text {SBib}(r_1,r_2) \wedge \text {SBib}(r_2,r_3) \Rightarrow \text {SBib}(r_1,r_3)\) | \(\lnot \text {SBib}(r_1,r_2) \vee \lnot \text {SBib}(r_2,r_3) \vee \text {SBib}(r_1,r_3)\) | \(\infty \) |
We have organized the rest of the paper in the following manner. In Sect. 2, we review some key basic concepts in further detail including: Markov Logic, LBP, constraint propagation techniques, Variational Bounds, Expectation maximization (EM) and KL Divergences. In Sect. 3, we demonstrate the framework of GEM-MP variational inference. In Sect. 4 we then derive GEM-MP’s general update rule for Markov logic. In sect. 5, we generalize GEM-MP’s update rules to be applicable for MRFs. In Sect. 6, we conduct a thorough experimental study. This is followed by a discussion in Sect. 7. In Sect. 8 we examine related work. Finally, in Sect. 9, we present our conclusions and discuss directions for future research. The “Appendix” contains the proofs of all propositions used in the paper.
2 Preliminaries
To set the stage for our work here in this section we provide a more detailed discussion of: Markov logic; belief propagation; constraint satisfaction problems, constraint propagation and generalized arch consistency; and variational methods. We begin by reviewing Markov logic using a concrete explanatory example presented in Table 1. This example is an excerpt of the knowledge base for the Cora dataset. That is, suppose that we are given a citation database in which each citation has author, title, and venue fields. We need to know which pairs of citations refer to the same citation and the same authors (i.e. both the SameBib and SameAuthor relations are unknown). For simplicity, our objective will be to predict the SameBib ground atoms’ marginals. At this point, let us first express our basic notation.
Notation A first-order knowledge base (KB) is a set of formulas in first-order logic. Traditionally, as shown in Table 1, it is convenient to convert formulas to clausal form (CNF). After propositional grounding, we get a formula \({\mathcal {F}}\), which is a conjunction of m ground clauses. We use \(f \in {\mathcal {F}}\) to denote a ground clause which is a disjunction of literals built from \({\mathcal {X}}\), where \({\mathcal {X}} = \left\{ X_1, X_2, \ldots , X_n\right\} \) is a set of n Boolean random variables representing ground atoms. The set \({\mathcal {X}}_{f}\) corresponds to the variables appearing in the scope of a ground clause f. Both “\(+\)” and “−” will be used to denote the positive (true) and negative (false) appearance of the ground atoms. We use \(Y_i\) as a subset of satisfying (or valid) entries of ground clause \(f_i\), and \(y_k \in Y_i, \, k \in \left\{ 1,.., |Y_i| \right\} \) denotes each valid entry in \(Y_i\), where the local entry of a factor is valid if it has non-zero probability. We use \(f^s_i\) (resp. \(f^h_i\)) to indicate that the clause \(f_i\) is soft (resp. hard); the soft and the hard clauses are included in the two sets \({\mathcal {F}}^s\) and \({\mathcal {F}}^h\) respectively. The sets \({\mathcal {F}}_{X_j+}\) and \({\mathcal {F}}_{X_j-}\) include the clauses that contain positive and negative literals for ground atom \(X_j\), respectively. Thus \({\mathcal {F}}_{X_j}={\mathcal {F}}_{X_j+} \cup {\mathcal {F}}_{X_j-}\) denotes the whole of \(X_j\)’s clauses, and its cardinality as \(\left| {\mathcal {F}}_{X_j}\right| \). For each ground atom \(X_j\), we use \(\beta _{X_j} = \left[ \beta ^+_{X_j},\beta ^-_{X_j}\right] \) to denote its positive and negative marginal probabilities, respectively.
Markov logic (Richardson and Domingos 2006) is a set of first-order logic formulas (or CNF clauses), each of which is associated with a numerical weight w. Larger weights w reflect stronger dependencies, and thereby deterministic dependencies have the largest weight (\(w \rightarrow \infty \)), in the sense that they must be satisfied. We say that a clause has deterministic dependency if at least one of its entries has zero probability.
To understand the semantics of Markov logic, recall the explanatory example in Table 1. In this example, Markov logic enables us to model the KB by using rules such as the following: 1. Regularity rules of the type that say “if the authors are the same, then their records are the same.” This rule is helpful but innately uncertain (i.e., it is not true in all cases). Markov logic considers this rule as soft and attaches it to a weight (say, 1.1); 2. Transitivity rules that state “If one citation is identical to two other citations, then these two other citations are identical too.” These types of rules are important for handling non-unique names of citations. Therefore, we suppose that Markov logic considers these rules as hard and assigns them an infinite weight.^{1}
Now consider the atoms that we are interested in as a query [SBib(\(C_1,C_1\)), SBib(\(C_2,C_1\)), SBib(\(C_1,C_2\)), and SBib(\(C_2,C_2\))] on the factor graph represented in Fig. 1. Remarkably, these query atoms are involved in many cycles. This emphasizes, at least theoretically, the existence of more than one fixed point (or local optimum) which raises the threat of non-convergence (Limitation 1). In addition, six of these cycles (i.e., those represented with dashed orange lines) such as SBib(\(C_1,C_1\))—\(f_5\)—SBib(\(C_2,C_1\))—\(f_4\)— SBib(\(C_1,C_2\)) have no evidences (i.e., all the atoms in the cycles are queries). Therefore, the double counting problem is expected to happen (Limitation 1). Moreover the six cycles contain only hard clauses, which hinders the process of smoothing out the messages to converge to accurate results (Limitation 2).
Constraint propagation A Constraint Satisfaction Problem (Rossi et al. 2006) is a triple \(\big <{\mathcal {X}}, {\mathcal {D}}, {\mathcal {C}}\big>\) where \({\mathcal {X}}\) is an n-tuple of variables\({\mathcal {X}}=\big <X_1,\ldots ,X_n\big>\), \({\mathcal {D}}\) is a corresponding n-tuple of domains\({\mathcal {D}}=\big <D_1,\ldots ,D_n\big>\) such that \(X_j \in D_j\), and \({\mathcal {C}}\) is a m-tuple of constraints\({\mathcal {C}}=\big <c_1,\ldots ,c_m\big>\). A constraint \(c_i\) is a pair \(\big <{\mathcal {R}}_{{\mathcal {X}}_{c_i}},{\mathcal {X}}_{c_i}\big>\) where \({\mathcal {R}}_{{\mathcal {X}}_{c_i}}\) is a relation on the variables \({\mathcal {X}}_{c_i}=\text {scope}(c_i)\). A solution to the CSP is a complete assignment (or a possible world) \(s=\big <v_1,\ldots ,v_n\big>\) where \(v_j \in D_j\) and each \(c_i \in {\mathcal {C}}\) is satisfied in that \({\mathcal {R}}_{{\mathcal {X}}_{c_i}}\) holds on the projection of s onto the scope \({\mathcal {X}}_{c_i}\). S denotes the set of solutions to the CSP. Constraint propagation (Rossi et al. 2006) is the process of removing inconsistent values in the domains that violate some constraint in \({\mathcal {C}}\). One form of constraint propagation is to apply generalized arc consistency for each constraint \(c \in {\mathcal {C}}\) until a fixed point is reached.
Definition 1
(Generalized arc consistency (GAC)) Given a constraint \(c \in {\mathcal {C}}\) which is defined over the subset of variables \({\mathcal {X}}_{c}\), it is generalized arc consistent (GAC) iff for each variable \(X_j \in {\mathcal {X}}_{c}\) and for each value \(d \in {\mathcal {D}}_{X_j}\) in its domain, there exists a value \(d_k \in {\mathcal {D}}_{X_k}\) for each variable \(X_k \in {\mathcal {X}}_{c} {\setminus } \left\{ X_j\right\} \) that constitutes at least one valid tuple (or valid local entry) that satisfies c.
We can extend this CSP formalism to Weighted CSPs (Rossi et al. 2006) to include soft constraints. This too requires extending GAC to soft generalized arc consistency (soft GAC) to tackle the soft constraints (van Hoeve et al. 2006). At a high level, one can view GAC (or soft GAC) as a function that takes any variable \(X_j \in {\mathcal {X}}\) and returns all other consistent variables’ values that support the values of \(X_j\) with respect to the constraints \(c \in {\mathcal {C}}\). For instance, in our example of Cora in Fig. 1, applying GAC to the hard constraint (or clause) \(f_{6}: \lnot \text {SBib}(C_1,C_1) \vee \lnot \text {SBib}(C_1,C_2)\) with respect to ground atom assignment \(\text {SBib}(C_1,C_1) = true\) implies maintaining only the truth value “false” in the domain of \(\text {SBib}(C_1,C_2)\). This is because the only valid local entry of \(f_{6}\) that supports \(\text {SBib}(C_1,C_1) = true\) is \(\left\{ (true, false)\right\} \).
We can also apply GAC in a probabilistic form. For instance, probabilistic arc consistency (pAC) (Horsch and Havens 2000) performs BP in the form of arc consistency to compute the relative frequency of a variable taking on a particular value in all solutions for binary CSPs (Horsch and Havens 2000, for more details). pAC can be summarized as follows. We start by initializing all variables to have uniform distributions. At each step, each variable stores its previous solution probability distribution, then incoming messages from neighbouring variables are processed, and the results are maintained locally so that there is no need to send messages to all neighbours when no changes are made in the distribution. The new distribution is approximated by multiplying all information maintained from the recent message received from all neighbours. If the variable’s solution distribution has changed then a new message is sent to all neighbours.
This lower bound \({\mathcal {F}}_{{\mathcal {M}}_{\theta }}\) in Eq. (5e) is called the free-energy. In Eq. (5d), \(\large {E}_{\large {q}_{{\mathcal {H}}}({\mathcal {H}})}\) is the expected log marginal likelihood and \(\large {H}\) is the shannon entropy term. Its role in variational EM (Beal and Ghahramani 2003) is that it justifies an iterative optimization algorithm for the lower bound whereby one performs the following steps: (the E-step) in which one makes the bound tighter by computing and updating \(\large {q}_{{\mathcal {H}}}({\mathcal {H}})\), and (the M-step) which uses the approximation to update the parameters of the model, which typically will increase the log marginal likelihood. If the exact posterior is used, or if the approximation to the posterior is exact, then the inequality is met with equality and the original EM algorithm is obtained. Both LBP and variational EM approaches share a similar objective which is to minimize a corresponding energy equation (Yedidia et al. 2005), the Gibbs free energy and variational free energy, respectively. Variational inference over hidden or unobserved variables in the E-step of traditional variational EM has an advantage in that it corresponds to minimizing the KL divergence of an approximation and our quantity of interest as we discuss below.
3 GEM-MP framework
At a conceptual level our overall GEM-MP approach consists of the following three key elements. First, we extend the factor graph used to represent a given problem using mega-node random variables which behave identically to groups of variables participating in a factor. Second, we perform variational inference to update an approximation over the original variables and the mega-nodes. Third, we use a probabilistic form of generalized arc consistency to more efficiently make inferences about hard constraints. Unlike inference operations formulated using LBP, since we formulate inference using variational updates we directly minimize the KL divergence between our approximation for the joint conditional distribution and the true distribution of interest.
Before presenting the inference components of GEM-MP in detail, we will first examine a small concrete example, then present our more general approach for extending factor graphs. Let us consider a simple example factor graph \({\mathcal {G}}\) (Fig. 2 (left)), which is a fragment of the Cora example in Fig. 1, that involves factors \({\mathcal {F}} =\left\{ f_1,f_2,f_3,f_{4}\right\} \) and three random variables \(\left\{ X_1,X_2,X_3\right\} \) denoting query ground atoms \(\left\{ \text {SBib}(C_2,C_2), \text {SBib}(C_2,C_1), \text {SBib}(C_1,C_2)\right\} \) respectively.
Factor \(f_1\) in the original factor graph (top)
\(X_1\) | \(X_2\) | \(f_1(X_1,X_2)\) | ||
---|---|---|---|---|
T | T | 1 | ||
T | F | 0 | ||
F | T | 1 | ||
F | F | 1 |
\(X_1\) | \(X_2\) | \(Y_1\) | \(O_1\) | \(\hat{f_1}(X_1, X_2, Y_1, O_1)\) |
---|---|---|---|---|
T | T | TT | 1 | 1 |
T | F | TT | 1 | 0 |
F | T | TT | 1 | 0 |
F | F | TT | 1 | 0 |
T | T | TT | 0 | 1 |
T | F | TT | 0 | 0 |
F | T | TT | 0 | 0 |
F | F | TT | 0 | 0 |
T | T | TF | 1 | 0 |
T | F | TF | 1 | 0 |
F | T | TF | 1 | 0 |
F | F | TF | 1 | 0 |
T | T | TF | 0 | 0 |
T | F | TF | 0 | 1 |
F | T | TF | 0 | 0 |
F | F | TF | 0 | 0 |
T | T | FT | 1 | 0 |
T | F | FT | 1 | 0 |
F | T | FT | 1 | 1 |
F | F | FT | 1 | 0 |
T | T | FT | 0 | 0 |
T | F | FT | 0 | 0 |
F | T | FT | 0 | 1 |
F | F | FT | 0 | 0 |
T | T | FF | 1 | 0 |
T | F | FF | 1 | 0 |
F | T | FF | 1 | 0 |
F | F | FF | 1 | 1 |
T | T | FF | 0 | 0 |
T | F | FF | 0 | 0 |
F | T | FF | 0 | 0 |
F | F | FF | 0 | 1 |
We attach an auxiliary mega-node\(Y_i\) (dashed oval) to each factor node \(f_i \in {\mathcal {F}}\). Each of these mega-nodes \(Y_i\) captures the local entries of its corresponding factor \(f_i\). Thus, it has a domain size that equals (at the most) the number of local entries in the factor \(f_i\) (i.e., the states of each mega-node correspond to a subset of the Cartesian product of the domains of the variables that are the arguments to the factor \(f_i\)). \({\mathcal {Y}} = \left\{ Y_i\right\} _{i=1}^m\) is the set of mega-nodes in the extended factor graph, where \(m=4\) in the example factor graph.
- In addition, we connect an auxiliary activation node, \(O_i\) (dashed circle), to each factor \(f_i\). The auxiliary activation node \(O_i\) enforces an indicator constraint \({\mathbbm {1}}_{\left( Y_i,f_i\right) }\) for ensuring that the particular configuration of the variables that are the argument to the original factor \(f_i\) is identical to the state of the mega-node \(Y_i\):$$\begin{aligned} {\mathbbm {1}}_{\left( Y_i,f_i\right) } = {\left\{ \begin{array}{ll} 1 &{} \quad \text {If the state of } Y_i \text { is identical to local entry of } f_i. \\ 0 &{} \quad \text {Otherwise} \end{array}\right. } \end{aligned}$$(8)
Now, since we expand the arguments of each factor \(f_i\) by including both auxiliary mega-node and auxiliary activation node variables, then we get an extended factor\(\hat{f_i}\). \(\hat{{\mathcal {F}}} = \left\{ \hat{f_i}\right\} _{i=1}^m\) is the set of extended factors in the extended factor graph.
- When the activation node \(O_i\) equals one, then it activates the indicator constraint in Eq. (8). If this indicator constraint is satisfied, then the extended factor graph \(\hat{f_i}\) preserves the same value of \(f_i\) for the configuration that is defined over the original input variables defining the factor \(f_i\). Thus, clearly, the following condition holds for each extended factor \(\hat{f_i}\) when a configuration, \((x_1,\ldots ,x_n)\), of \(f_i\) equals to state, \(y_i\), of mega-node, \(Y_i\):But if the indicator constraint in Eq. (8) is not satisfied then the extended factor graph \(\hat{f_i}\) assigns a value 0. Thus, this condition also holds for each extended factor \(\hat{f_i}\) when a configuration \((x_1,\ldots ,x_n)\) of \(f_i\) is not equal to state \(y_i\) of mega-node, \(Y_i\):$$\begin{aligned} \hat{f_i}\left( X_1=x_1,\ldots ,X_n=x_n,Y_i=y_i,\bar{O_i}\right) \Bigm |_{\bar{O_i} = 1} = f_i\left( X_1=x_1,\ldots ,X_n=x_n\right) . \end{aligned}$$(9)$$\begin{aligned} \hat{f_i}\left( X_1=x_1,\ldots ,X_n=x_n,Y_i=y_i,\bar{O_i}\right) \Bigm |_{\bar{O_i} = 1} = 0. \end{aligned}$$(10)
Setting \(O_i=0\) effectively removes the impact of \(f_i\) from the model. That is, when the activation node \(O_i\) is not equal to one, then it deactivates the indicator constraint in Eq. (8). Here, the extended factor \(\hat{f_i}\) assigns a value 1 when the possible state of \(Y_i\) matches the configuration of variables that are the arguments to the factor \(f_i\). Otherwise it assigns a value 0. Note that by assigning the values in this way, all factors \(f_i \in {\mathcal {F}}\) will have identical values in their corresponding \(\hat{f_i} \in \hat{{\mathcal {F}}}\) when \(O_i=0\). This implies that the deactivation of their indicator constraint has no impact on the distribution from the inclusion of the factors \(f_i \in {\mathcal {F}}\).
Proposition 1
Proof
see “Appendix”. \(\square \)
Proposition 2
Proof
see “Appendix”. \(\square \)
Given this extended factor graph formulation we can now examine the task of performing inference over unobserved quantities given observed quantities through the lens of variational analysis and inference.
- GEM-MP “\(M_{q\small {({\mathcal {Y}})}}\)-step”: (for maximizing mega-nodes’ parameters distributions)$$\begin{aligned}&\overbrace{{\mathcal {T}}^{\text {(t+1)}}_{\small {{\mathcal {Y}}}}}^{\text {Max. w.r.t }\large {q}_{\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) }} \nonumber \\&\quad = \underset{{\mathcal {T}}_{\small {{\mathcal {Y}}}}}{{\text {argmax}}} \, \overbrace{\large {E}_{\large {q}^{\text {(t)}}_{\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) } \large {q}^{\text {(t)}}_{\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) }} \big [\log P({\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}} |{\mathcal {M}}) \big ]}^{\text { E-step}} + \large {H}\big (\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\big ) \end{aligned}$$(21)
- GEM-MP “\(M_{q\small {({\mathcal {X}})}}\)-step”: (for maximizing variable-nodes’ parameter distributions)$$\begin{aligned}&\overbrace{{\mathcal {B}}^{\text {(t+1)}}_{\small {{\mathcal {X}}}}}^{\text {Max. w.r.t }\large {q}_{\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) }} \nonumber \\&\quad = \underset{{\mathcal {B}}_{\small {{\mathcal {X}}}}}{{\text {argmax}}} \, \overbrace{\large {E}_{\large {q}^{\text {(t)}}_{\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) } \large {q}^{\text {(t+1)}}_{\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) }} \big [\log P({\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}} |{\mathcal {M}}) \big ]}^{\text { E-step}} + \large {H}\big (\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\big ) \end{aligned}$$(22)
\(E_{q\small {({\mathcal {X}})}}\)-step messages, \(\{\mu _{X_j \rightarrow \hat{f_i}} = \large {q}(X_j;\beta _{\small {X_j}})\}\), that are sent from variables \({\mathcal {X}}\) to factors \(\hat{{\mathcal {F}}}\) (as depicted in Fig. 3 (left)). The aim of sending these messages is to perform the GEM-MP’s \(M_{q\small {({\mathcal {Y}})}}\)-step in Eq. (21). That is, the setting of the distributions, \(\{\large {q}(X_j;\beta _{\small {X_j}})\}_{\small {\forall X_j \in {\mathcal {X}}}}\), are used for estimating the distributions, \(\{\large {q}(Y_i;\alpha _{\small {Y_i}})\}_{\small {\forall Y_i \in {\mathcal {Y}}}}\), that maximizes the lower bound on the log marginal-likelihood of Eq. (21). To do so, each variable \(X_j \in {\mathcal {X}}\) sends its current marginal probability \(\beta _{\small {X_j}}\) as an \(E_{q\small {({\mathcal {X}})}}\)-step message, \(\mu _{X_j \rightarrow \hat{f_i}} =\large {q}(X_j;\beta _{\small {X_j}})\), to its neighboring extended factors. Then, at the factors level, each extended factor \(\hat{f_i} \in \hat{{\mathcal {F}}}\) uses the relevant marginals from those received incoming messages of its argument variables, i.e., \(\{\large {q}(X_j;\beta _{\small {X_j}})\}_{\small {\forall X_j \in {\mathcal {X}}_{\hat{f_i}}}}\), to perform the computations of the \(E_{q\small {({\mathcal {X}})}}\)-step of Eq. (21). This implies updating the distribution \(\large {q}(Y_i;\alpha _{\small {Y_i}})\) of its mega-node \(Y_i\) by computing what we call the probabilistic generalized arc consistency (pGAC) (we will discuss pGAC in more detail in Sect. 4).
\(E_{q\small {({\mathcal {Y}})}}\)-step messages, \(\{\mu _{\hat{f_i} \rightarrow X_j} = \sum _{\small {Y_i:\forall y_k(X_j)}} \large {q}(Y_i;\alpha _{\small { Y_i}})\}\), that are sent from factors to variables (as depicted in Fig. 3 (right)). Sending these messages corresponds to the GEM-MP’s \(M_{q\small {({\mathcal {X}})}}\)-step in Eq. (22). Here, the approximation of the distributions, \(\{\large {q}(Y_i;\alpha _{\small {Y_i}})\}_{\small {\forall Y_i \in {\mathcal {Y}}}}\), obtained from the GEM-MP’s \(M_{q\small {({\mathcal {Y}})}}\)-step will be used to update the marginals, i.e., \(\{\large {q}(X_j;\beta _{\small {X_j}})\}_{\small {\forall X_j \in {\mathcal {X}}}}\), that maximizes the lower bound on the log marginal-likelihood in Eq. (22). Characteristically, each extended factor \(\hat{f_i} \in \hat{{\mathcal {F}}}\) sends a corresponding refinement of the pGAC distribution - that approximates the \(\large {q}(Y_i;\alpha _{\small { Y_i}})\) of its mega-node - as an \(E_{q\small {({\mathcal {Y}})}}\)-step message, \(\mu _{\hat{f_i} \rightarrow X_j} = \sum _{\small {Y_i:\forall y_k(X_j)}} \large {q}(Y_i;\alpha _{\small { Y_i}})\), to each of its argument variables, \(X_j \in {\mathcal {X}}_{\hat{f_i}}\). Now, at the variables level, each \(X_j \in {\mathcal {X}}\) uses the relevant refinement of pGAC distributions from those received incoming messages - which are the outgoing messages coming from its extended factors \(\hat{f_i} \in \hat{{\mathcal {F}}}_{X_j}\) - to perform the computations of the \(E_{q\small {({\mathcal {Y}})}}\)-step of Eq. (22). This implies updating its distribution \(\large {q}(X_j;\beta _{\small {X_j}})\) by summing these messages (as it will be discussed in more detail in Sect. 4).
Theorem 1
(GEM-MP guarantees convergence) At each iteration of updating the marginals (i.e., variational parameters \({\mathcal {B}}_{\small {{\mathcal {X}}}}\)), GEM-MP increases monotonically the lower bound on the model evidence such that it never overshoots the global optimum or until converging naturally to some local optima.
Proof
Now since the exact log marginal-likelihood, \(\log \sum _{\small {{\mathcal {X}}},\small {{\mathcal {Y}}}} P({\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}}|{\mathcal {M}})\), is a fixed quantity and the Kullback–Leibler divergence, \(\textit{KL} \ge 0\), is a non-negative quantity then this implies that GEM-MP never overshoots the global optimum of the variational free energy.
Since GEM-MP applies a variational mean-field approximation for \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) and \(\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\) distributions [refer to Eqs. (24) and (23)] over both mega-nodes and variables nodes respectively, it inherits the guarantees of mean field to converge to a local minimum of the negative variational free energy free energy or KL divergence. \(\square \)
Note that the convergence behaviour of GEM-MP for inference task resembles the behaviour of the variational Bayesian expectation maximization approach proposed by Beal and Ghahramani (2003) for the Bayesian learning task. Both of them can be seen as a variational technique (forming a factorial approximation) that minimizes a free-energy-based function for estimating the marginal likelihood of the probabilistic models with hidden variables.
It is worth noting that when reaching the GEM-MP “\(M_{q\small {({\mathcal {Y}})}}\)-step”, one could select between a local or global approximation to distribution \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\). However, in this paper, we restricted ourselves to local approximations.^{8} Furthermore, although GEM-MP represents a general template framework for applying variational inference to probabilistic graphical models, we concentrate on Markov logic models, where the variables will be ground atoms and the factors will be both hard and soft ground clauses (as will be explained in Sect. 4) and Ising models (as will be explained in Sect. 5).
4 GEM-MP general update rule for Markov logic
By substituting the local approximation for \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) from \(M_{q\small {({\mathcal {Y}})}}\)-step into the \(M_{q\small {({\mathcal {X}})}}\)-step, we can synthesize update rules that tell us how to set the new marginal in terms of the old one. So, in practice the \(M_{q\small {({\mathcal {Y}})}}\)-step and the \(E_{q\small {({\mathcal {Y}})}}\)-step messages of GEM-MP can be expressed in the form of one set of messages (from atoms-to-atoms through clauses). This set of messages synthesizes a general update rule for GEM-MP, applicable to Markov logic. However, since the underlying factor graph often contains hard and soft clauses, then within the GEM-MP framework we will intentionally distinguish hard and soft clauses by using two variants of the general update rule (denoted as Hard-update-rule and Soft-update-rule) for tackling hard and soft clauses, respectively.
4.1 Hard update rule
For notational convenience, we explain the derivation of the hard update rule by considering untyped atoms; but extending it to the more general case is straightforward. Also, for clarity, we begin the derivation with the \(M_{q\small {({\mathcal {X}})}}\)-step rather than with the usual \(M_{q\small {({\mathcal {Y}})}}\)-step. So we assume that we have already constructed \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\). We also assume that all clauses are hard.
We apply Eq. (40) in Eq. (39), convert logarithms of products into sums of logarithms, exchange summations, and handle each hard ground clause \(f^h_i \in {\mathcal {F}}^h\) separately in a sum.
Thus the sum of the \(E_{q\small {({\mathcal {Y}})}}\)-step messages that ground atom \(X_j\) will receive from its neighboring hard ground clauses represents a weight (i.e., \(\textit{Weight}^+_{X_j}\)) used to update its positive marginal.
Now to obtain the completed hard update rule, what remains is the \(M_{q\small {({\mathcal {Y}})}}\)-step, through which we need to substitute the distribution \(\large {q}(Y_i;\alpha _{\small {Y_i}})\) in Eqs. (41) and (42).
2.\(M_{q\small {({\mathcal {Y}})}}\)-step: The goal here is to produce the distribution \(\large {q}(Y_i;\alpha _{\small {Y_i}})\) by using the current setting of marginals \({\mathcal {B}}_{\small {{\mathcal {X}}}}\). However the summation \(\sum _{\small {Y_i:\forall y_k(X_j)=``-''}}\) involves enumerating all the valid local entries for each \(Y_i\), which is inefficient. Instead we approximate the distribution \(\sum _{\small {Y_i:\forall y_k(X_j)=``-''}} \large {q}(Y_i;\alpha _{\small {Y_i}})\) for each hard ground clause \(f^h_i \in {\mathcal {F}}^h\) by using a probability \(1-\xi (X_j,f^h_i)\), which we call the probabilistic generalized arc consistency (pGAC). At this point, let us pause to elaborate more on pGAC in the next subsection.
4.1.1 Note on the connection between pGAC and variational inference
Hence, if \(\xi (X_j,f_i)\) is the probability of \(X_j=d\) unsatisfying \(f_i\) then \(1-\xi (X_j,f_i)\) is directly the probability of \(X_j=d\) satisfying the ground clause \(f_i\). It also represents the probability that \(X_j=d\) is GAC with respect to \(f_i\) because the event of \(X_j=d\) satisfying \(f_i\) implies that it must be GAC to \(f_i\). This interpretation entails a form of generalized arc consistency, adapted to CNF, in a probabilistic sense; we call it a Probabilistic Generalized Arc Consistency.
Definition 2
(Probabilistic generalized arc consistency (pGAC))
The definition of the traditional GAC in Sect. 2 corresponds to the particular case of pGAC where \(\xi (X_j,f_i)=0\), meaning that the probability of \(X_j=d\) being GAC to \(f_i\) definitely occurs, and \(\xi (X_j,f_i)=1\) when it is never GAC to \(f_i\). Based on that, if \(f_i\) contains \(X_j\) positively then the pGAC probability of \(X_j=+\) equals 1 because it is always GAC to \(f_i\). In an analogous way, the pGAC probability is 1 for \(X_j=-\) when \(f_i\) contains \(X_j\) negatively.
From a probabilistic perspective, the pGAC probability of \(X_j=d\) represents the probability that \(X_j=d\) is involved in a valid local entry of \(f_i\). This is similar to the computation of the solution probability of \(X_j=d\) by using the probabilistic arc consistency (pAC) (presented by Horsch and Havens 2000, and summarized in Sect. 2). However, it should be noted that our pGAC applies mean-field approximation. This is because when computing \(\xi (X_j,f_i)\), as defined in Eq. (44), for each ground atom \(X_j \in {\mathcal {X}}_{f_i}\), we use the marginal probabilities of other ground atoms \(X_k \in {\mathcal {X}}_{f_i} {\setminus } \{X_j\}\) set unsatisfying in \(f_i\). Thus the main difference between our pGAC and pAC (Horsch and Havens 2000) appears in the usage of mean-field and BP for computing the probability that \(X_j=d\) belongs to valid local entry of \(f_i\) in pGAC and pAC, respectively. Furthermore, it should be noted that pAC is restricted to binary constraints whilst pGAC is additionally applicable to non-binary ones.
From the point of view of computational complexity, \(\xi (X_j,f_i)\) requires only linear computational time in the arity of the ground clause (as will be shown in Proposition 3). Thus, pGAC is an efficient form of GAC compared to pAC. In addition, pGAC guarantees the convergence of mean-field whereas pAC inherits the possibility of non-convergence from BP.
\([1,1-\xi (X_j,f_i)]\) if \(f_i\) contains \(X_j\) positively.
\([1-\xi (X_j,f_i),1]\) if \(f_i\) contains \(X_j\) negatively.
4.1.2 Using pGAC in the derivation of the hard update rule
The end result, as in Eq. (48d), is the \(\textit{Weight}^+_{X_j}\) of ground atom \(X_j\) computed as the summation of all hard ground clauses that include \(X_j\) minus the summation of pGAC of hard ground clauses that involve \(X_j\) as a negative atom.
4.2 Soft update rule
To derive the update rule for soft ground clauses, what we need to do is to soften some restrictions on the weight parts (i.e. \(\textit{Weight}^+_{X_j}\), \(\textit{Weight}^-_{X_j}\)) of the hard update rule. This encompasses modifying the distributions, \(\large {q}(Y_i;\alpha _{\small {Y_i}})\), of hard ground clauses for soft ground clauses by applying two consecutive steps: softening and embedding.
Note that comparing Eq. (53c) to its corresponding Eq. (48c) for the update rules of the hard factors, we have an additional term “\(\xi (X_j,f^s_{i}) \cdot 1\)” in the second summation. This is because computing the second part of Eq. (53b) implies computing two terms as appeared in the second part of Eq. (53c): the first is (\(1 - \xi (X_j,f^s_{i})\)) representing the probability that \(X_j\) being positive satisfies the factor \(f^s_{i}\) that include \(X_j\) as negative ground atom, and therefore it is multiplied by \(\exp (w_{f^s_{i}})\) since at the satisfaction of soft ground clause \(f^s_{i}\) we obtain \(\exp (w_{f^s_{i}})\). The second term is \(\xi (X_j,f^s_{i})\) representing the probability that \(X_j\) being positive dissatisfies the factor \(f^s_{i}\), and therefore it is multiplied by 1 since at the dissatisfaction of \(f^s_{i}\) we obtain 1. This is the “\(\xi (X_j,f^s_{i}) \cdot 1\)” term that has disappeared from the update rules of the hard factors in Eq. (48c) because \(\xi (X_j,f^s_{i})\) is multiplied by 0, since at the dissatisfaction of hard ground clauses we get 0 instead of 1 for the dissatisfaction of soft ground clauses.
At this point, we take the \(\textit{Weight}^+_{X_j}\) and \(\textit{Weight}^-_{X_j}\) from Eqs. (48d), (49), (53d), and (54) and substitute these for the \(\textit{Weight}^+_{X_j}\) and \(\textit{Weight}^-_{X_j}\) in Eqs. (41) and (42) to obtain our ultimate set of GEM-MP’s rules in order to update the marginals of query ground atoms. This is in Table 3. The main advantage of these update rules is that they capture relationships between ground atoms with each other. Thus, we do not need to pass explicitly the messages from atoms-to-clauses or vice versa.
General update rules of GEM-MP inference for Markov logic. These rules capture relationships between ground atoms with each other, and therefore it does not require explicitly passing messages between atoms and clauses
\(\beta ^+_{X_j} = \frac{\textit{Weight}^+_{X_j}}{\lambda _{X_j}}, \, \beta ^-_{X_j} = \frac{\textit{Weight}^-_{X_j}}{\lambda _{X_j}}, \, \lambda _{X_j} = \textit{Weight}^+_{X_j} +\textit{Weight}^-_{X_j}\) | |
---|---|
Hard-update-rule | |
\(\textit{Weight}^+_{X_j} \leftarrow \left| {\mathcal {F}}_{X_j}^h\right| - \sum _{f^h_i \in {\mathcal {F}}_{X_j-}^h} \xi (X_j,f^h_{i})\) | |
\(\textit{Weight}^-_{X_j} \leftarrow \left| {\mathcal {F}}_{X_j}^h\right| - \sum _{f^h_i \in {\mathcal {F}}_{X_j+}^h} \xi (X_j,f^h_{i})\) | |
Soft-update-rule | |
\(\begin{array}{lll} \textit{Weight}^+_{X_j} &{} \leftarrow &{} \bigg [\exp \big (\sum _{f^s_{i} \in {\mathcal {F}}_{X_j}^s} w_{f^s_{i}}\big )\bigg ] \\ &{} &{} - \bigg [\prod _{f^s_{i} \in {\mathcal {F}}_{X_j+}^s} \exp (w_{f^s_{i}}) \times \big [\prod _{f^s_{i} \in {\mathcal {F}}_{X_j-}^s} \xi (X_j,f^s_{i}) (\exp (w_{f^s_{i}}) - 1) \big ]\bigg ] \end{array}\) | |
\(\begin{array}{lll} \textit{Weight}^-_{X_j} &{} \leftarrow &{} \bigg [\exp \big (\sum _{f^s_{i} \in {\mathcal {F}}_{X_j}^s} w_{f^s_{i}}\big )\bigg ] \\ &{} &{} - \bigg [\prod _{f^s_{i} \in {\mathcal {F}}_{X_j-}^s} \exp (w_{f^s_{i}}) \times \big [\prod _{f^s_{i} \in {\mathcal {F}}_{X_j+}^s} \xi (X_j,f^s_{i}) (\exp (w_{f^s_{i}}) - 1) \big ]\bigg ] \end{array}\) |
4.3 GEM-MP versus LBP
One might contrast GEM-MP and LBP inference. Recall the basic quantities used by GEM-MP in Eqs. (41) and (42) versus LBP in Eqs. (3) and (4) for updating the marginal of a single variable \(X_j\). Although the marginal update rules of both algorithms look similar, they are constructed by very different routes, having important differences. The first significant difference is that due to the expectations involved in variational message passing, in GEM-MP we take a summation (i.e. \(\sum _{f_i \in {\mathcal {F}}_{X_j}}\)) over the incoming messages to a given node, which are the outgoing messages coming from the factors. This is in contrast to the multiplication (i.e. \(\prod _{f_i \in {\mathcal {F}}_{X_j}}\)) associated with standard LBP. In other words GEM-MP handles the incoming message (or, as named, \(E_{q\small {({\mathcal {Y}})}}\)-step message) from each factor as a separate term in a sum. This means that when moving toward the local maximum of energy functional \({\mathcal {F}}_{\small {{\mathcal {M}}}}\) in Eq. (17c), GEM-MP computes a moderate arithmetic average of the incoming \(E_{q\small {({\mathcal {Y}})}}\)-step messages to yield the marginal update steps for \(X_j\). Due to the variational underpinnings of GEM-MP these steps update a quantity that is a lower bound on the log marginal likelihood. This is attributable to the use of Jensen’s inequality in Eq. (12c) that allows lower bounding the model evidence, and at each update step we minimize the Kullback–Leibler divergence distance. We therefore cannot ‘overstep’ in our approximation of the true model evidence (refer to Theorem 1). In contrast, LBP computes a (coarse) geometric average of the incoming messages in a setting where there is no such bound.
4.4 GEM-MP algorithm
Algorithm 1 gives a pseudo-code for the GEM-MP inference algorithm. The algorithm starts by uniformly initializing (i.e., \({\mathcal {U}}\)) the marginals of all ground atoms that exist in the query set \({\mathcal {X}}\) (lines 1–3). Then, it distinguishes two subsets of query ground atoms. The first is \({\mathcal {X}}_h\) that involves query ground atoms involved in hard ground clauses (line 4). The second subset is \({\mathcal {X}}_s\) for the ones involved in soft ground clauses (line 5). Note that if the query atom is involved in both soft and hard ground clauses, then it will be included in the two subsets. At each step, the algorithm proceeds by updating the marginals for the first subset of query atoms by using the hard update rule (lines 7-9). Then it updates the marginals for query atoms of the second subset by applying the soft update rule (lines 10-12). The algorithm keeps alternating between carrying out the two update-rules until convergence (i.e., \(\forall X_j \in {\mathcal {X}}, \, \left| \beta _{X_j}({\mathcal {I}})-\beta _{X_j}({\mathcal {I}}-1)\right| <\epsilon \), where \(\epsilon \) is a specified precision) or reaching the maximum number of iterations (line 13). Although the marginals of the query atoms involved by soft and hard ground clauses (i.e., exist in the two subsets \({\mathcal {X}}_h\) and \({\mathcal {X}}_s\)) may be affected by swapping from hard to soft update rules, or vice versa, such query atoms’ marginals play the role of propagating the information about hard ground clauses to query atoms in \({\mathcal {X}}_s\) when it is used by the soft update rule, and propagating the information about soft ground clauses to query atoms in \({\mathcal {X}}_h\) when it is used by the hard update rule. It should be noted that the checks performed by each update-rule are extremely cheap (a fraction of a second, on average) and the subset of ground clauses at each particular step is unlikely to be in the hard critical region.
Proposition 3
(Computational Complexity) Given an MLN’s ground network with n ground atoms, m ground clauses, and a maximum arity of the ground clauses of r, one iteration of computing the marginals of query atoms takes time in O(nmr) in the worst case.
Proof
see “Appendix”. \(\square \)
Note that even though GEM-MP is built on a propositional basis, its computational complexity is quite efficient since the size of the grounded network is proportional to \(O(d^r)\), where d is the number of objects (constants) in the domain. Also, in practice, we can improve this computational time by preconditioning some terms. For instance, we do not compute the constant terms (such as \(\left| {\mathcal {F}}_{X_j}^h\right| \) in the hard update rule) at each iteration, but instead we compute them once and then recall their values.
5 GEM-MP update rules for Ising MRFs
Unit clauses: \((X_i,\theta _i)\), \(\forall X_i \in {\mathcal {X}}\)
Pairwise clauses: \(\big [(\lnot X_i \vee X_j) \wedge (X_i \vee \lnot X_j), \theta _{ij}=\eta \cdot C\big ]\), \(\forall X_i,X_j \in {\mathcal {E}}\)
6 Empirical evaluation
(Q1.) Is GEM-MP’s accuracy competitive with state-of-the-art inference algorithms for Markov logic? This question is important to answer as it examines the soundness of GEM-MP inference.
(Q2.) In the presence of graphs with problematic cycles, comparing with LBP exhibiting oscillations, does GEM-MP lead to convergence? We want to explore and emphasize experimentally that GEM-MP inference indeed addresses Limitation 1.
(Q3.) Is GEM-MP more accurate than LBP in the presence of determinism? We want to check experimentally the effectiveness of GEM-MP inference to remedy Limitation 2.
(Q4.) Is GEM-MP scalable compared to other state-of-the-art propositional inference algorithms for Markov logic? We wish to examine the real-world applicability of GEM-MP inference.
(Q5.) Is GEM-MP accurate compared to state-of-the-art convergent message-passing algorithms for other probabilistic graphical models such as Markov Random Fields? We wish to examine the accuracy and convergence behaviour of GEM-MP inference for other related model classes and algorithms.
(Q6.) Is GEM-MP’s accuracy influenced by the initialization of the marginals? We will examine if the initialization of approximate marginals using random values differs from initializing marginals using a uniform distribution.
MC-SAT proposed by Poon and Domingos (2006).
Lazy MC-SAT (LMCSAT) proposed by Poon et al. (2008).
Loopy Belief Propagation (LBP) (refer to Yedidia et al. 2005).
Gibbs sampling (Gibbs) (Richardson and Domingos 2006).
Lifted Importance sampling (L-Im) proposed by Venugopal and Gogate (2014b) as an improvement of the one proposed by Gogate et al. (2012).
6.1 Datasets
MLN: We used the MLN model which is similar to the established one of Singla and Domingos (2006). The MLN involves formulas stating regularities such as: if two citations are the same, their fields are the same; if two fields are the same, their citations are the same. It also has formulas representing transitive closure, which are assigned very high weight (i.e. near deterministic clauses). The final knowledge base contains 10 atoms and 32 formulas (adjusted as 4 hard, 3 near-deterministic and 25 soft).
Query: The goal of inference is to predict which pairs of citations refer to the same citation (SameBib), and similarly for author, title and venue fields (SameTitle, SameAuthor and SameVenue). The other atoms are considered evidence atoms.
MLN: We used the MLN model described by Davis and Domingos (2009). It involves singleton rules for predicting the interaction relationship, and rules describing how protein functions relate to interactions between proteins (i.e. two interacting proteins tend to have similar functions). The final knowledge base has 7 atoms and 8 first-order formulas (2 hard and 6 soft).
Query: The goal of inference is to predict the interaction relation (Interaction, Function). All other atoms (e.g., location, protein-class, enzyme, etc.) are considered evidence atoms.
MLN: We used the MLN model available from the Alchemy website.^{18} It includes formulas such as the following: each student has at most one advisor; if a student is an author of a paper, so is her advisor; advanced students only TA courses taught by their advisors; a formula indicates that it is not allowed for a student to have both temporary and formal advisors at the same time (\(\lnot TemAdvised(s,p)\vee \lnot Advised(s,p)\) which is a true statement at UW-CSE), etc. The final knowledge base contains 22 atoms and 94 formulas (considered as 7 hard and 65 soft and we excluded the 22 unit clauses). Note that ten out of these 22 clauses are equality predicates: Sameperson(person; person), Samecourse(course; course), etc. which always have known, fixed values that are true if the two arguments are the same constant. The rest of them are easily predictable using the unit clause method.
Query: The inference task is to predict advisory relationships (AdvisedBy), and all other atoms are evidence (corresponding to the all-information scenario in Richardson and Domingos (2006)).
6.2 Metrics
Since computing exact marginal or joint conditional distributions is not feasible for the underlying domains, we evaluated the quality of inference with our method using two metrics: the average conditional log marginal-likelihood (CLL) and the balanced \(F_1\) score. The CLL, which approximates the KL-divergence between the actual and computed marginals returned by an inference algorithm for query ground atoms, is an intuitive way of measuring the quality of the produced marginal probabilities. After obtaining the marginal probabilities from the inference algorithm, the average CLL of a query atom is computed by averaging the log-marginal probabilities of the true values over all its groundings. For the \(F_1\)-score metric, we predict that a query ground atom is true if its marginal probability is at least 0.5; otherwise we predict that it is false (Huynh and Mooney 2011, 2009; Papai et al. 2012, for more details about measuring prediction quality on the basis of marginal probabilities). The advantage of \(F_1\)-score is its insensitivity to true negatives (TNs), and thus it can demonstrate the quality of an algorithm for predicting the few true positives (TPs).
6.3 Methodology and results
All the experiments were run on a cluster of nodes with multiprocessors running 2.4 GHz Intel CPUs with 4 GB of RAM under RED HAT Linux 5.5. We used the implementations of both the training algorithm (preconditioned scaled conjugate gradient) and inference algorithms (MC-SAT, LBP, and Gibbs) that exist in the Alchemy system (Kok et al. 2007). In addition, we implemented our GEM-MP algorithm as an extension to Alchemy’s inference. All of Alchemy’s default parameters were retained (e.g., 100 burn-in iterations to negate the effect of initialization in MC-SAT and Gibbs). We conducted our experimental evaluation through five experiments.
6.3.1 Experiment I
The first experiment was dedicated to answering Q1 and Q2. We ran our experiments using a five-way cross-validation for both Cora and UW-CSE, and a four-way cross-validation for Yeast. In the training phase we learned the weights of models by running a preconditioned scaled conjugate gradient (PSCG) algorithm (in Lowd and Domingos 2007, it was shown that PSCG performed the best). In the testing phase, and using the learned models, we carried out inference on the held-out dataset by using each of the four underlying inference algorithms to produce the marginals of all groundings of query atoms being true. Such marginal probabilities were used to compute the \(F_1\) and average CLL metrics.
Average \(F_1\) scores for the GEM-MP, MC-SAT, Gibbs, LBP, LMCSAT, and L-Im inference algorithms on Cora, Yeast, and UW-CSE at the end of the allotted time
Datasets | Query | Algorithms | |||||
---|---|---|---|---|---|---|---|
GEM-MP | MC-SAT | Gibbs | LBP | LMCSAT | L-Im | ||
Cora | SameBib | 0.778 | 0.695 | 0.443 | 0.382 | 0.690 | 0.491 |
SameAuthor | 0.960 | 0.926 | 0.660 | 0.657 | 0.926 | 0.690 | |
SameTitle | 0.860 | 0.790 | 0.570 | 0.515 | 0.780 | 0.581 | |
SameVenue | 0.843 | 0.747 | 0.584 | 0.504 | 0.739 | 0.613 | |
Yeast | Interacts | 0.792 | 0.669 | 0.474 | 0.536 | 0.651 | 0.512 |
Function | 0.820 | 0.691 | 0.492 | 0.575 | 0.679 | 0.532 | |
UW-CSE | advisedBy | 0.762 | 0.589 | 0.483 | 0.415 | 0.580 | 0.504 |
Overall average | 0.831 | 0.730 | 0.529 | 0.512 | 0.720 | 0.560 |
Table 4 reports the average \(F_1\) scores for the inference algorithms on the underlying datasets. The results complement those of Fig. 5: underscoring the promise of our proposed GEM-MP algorithm to obtain the highest quality among the alternatives for predicting marginals, particularly for the TP query atoms (i.e. query atoms that are true and predicted to be true). GEM-MP substantially outperformed LBP, Gibbs and L-IM on all datasets, achieving 39, 37, and 33 % greater accuracy respectively (answer Q2). MC-SAT was relatively competitive compared with GEM-MP on Cora and UW-CSE, but on the Yeast dataset GEM-MP performed significantly better, attaining \(13 \,\%\) greater accuracy than MC-SAT (conclusive answer to Q1). Gibbs and LBP rivaled each other on the tested datasets but were both dominated by MC-SAT. LMCSAT was very competitive to its propositional MC-SAT with approximately a \(2.2\,\%\) loss in accuracy.
6.3.2 Experiment II
Figure 6 reports the average CLL as a function of time for GEM-MP, LBP, and MC-SAT at different levels of determinism. Overall the results confirm that the amount of determinism in the model has a great impact on both the accuracy and the convergence of GEM-MP and LBP. That is, when increasing the level of determinism, we observe an increase in the accuracy of GEM-MP and a decrease in the accuracy of LBP. At each level of determinism and on all datasets GEM-MP prevailed over the corresponding LBP in terms of accuracy of results (answering Q3). In addition the greater the level of determinism, the greater the convergence for GEM-MP, and the greater the non-convergence for LBP (answering Q2). Remarkably the 0-level, which has no amount of determinism, exhibits the worst behaviour for GEM-MP.^{20} In contrast it is the best level for LBP, though even at this level GEM-MP surpassed LBP on all datasets. For MC-SAT increasing the determinism in the model has a small negative impact on its accuracy.
6.3.3 Experiment III
This experiment examines Q4. We are interested here in judging the scalability of various inference algorithms. To guarantee a fair comparison, we reran Experiment I while increasing the number of objects in the domain from 100 to 200 by increments of 25, following the methodology previously used by Poon et al. (2008), Shavlik and Natarajan (2009). Then we reported the average running time to achieve convergence or up to a maximum of 5000 and 10, 000 iterations respectively for the entire inference process.
6.3.4 Experiment IV
L2-convex proposed by Hazan and Shashua (2010, 2008), which runs sequential message passing on the convex-L2 Bethe free energy.
RBP proposed by Elidan et al. (2006), which runs damped Residual BP, a greedy informed schedule for message passing.
CCCP double loop algorithm proposed by Yuille (2001, 2002), which runs message-passing on the convex-concave Bethe free energy.
Figure 8 (top) displays the cumulative percentage of convergence as a function of the number of iterations for each algorithm at Level 1 and 2. Overall the results show that GEM-MP converges significantly more often than all other compared convergent message-passing algorithms (answering Q5). Also it converges much faster than them. At Level 1 it finishes at 97 % convergence rates versus 82 % for L2-convex, 68 % for CCCP, and 59 % for residual BP. At Level 2 it clearly achieves at least 17.5, 34.8, and 48.4 % better convergence than L2-convex, CCCP, and residual BP respectively.
Figure 8 (bottom) displays the average KL-divergence (KL) between the approximate and exact node marginals for each algorithm as a function of the number of iterations at the two levels. The results complement those of Fig. 8 (top), here again underscoring the promise of GEM-MP for converging to more accurate solutions more rapidly than all other compared algorithms (answering Q5). In the two determinism scenarios, it achieves on average 37.8, 56, and 61.6 % higher quality marginals in terms of the average KL compared to the L2-convex, CCCP, and residual BP methods respectively. Also it finishes at a KL-divergence of 0.23 and 0.19 in the two determinism levels respectively. This shows that the quality of marginals obtained by GEM-MP at Level 2 are more accurate than the ones obtained at Level 1, which is consistent with the results in Experiment II that demonstrate that GEM-MP provides more robust results when there is more determinism in the model.
6.3.5 Experiment V
This experiment attempts to answer Q6. The goal is to compare the quality of solutions returned by GEM-MP at different initialization settings of marginals: GEM-MP with random initialization (GEM-MP-random) and GEM-MP with uniform initialization (GEM-MP-uniform). We re-ran Experiment I for MLNs and recorded the relative correlations of the average CLL between GEM-MP-random and GEM-MP-uniform. In addition, we re-ran Experiment IV for Ising models and report the relative correlations of the average KL-divergence between GEM-MP-random and GEM-MP-uniform.
Figure 9 shows the quality of marginals obtained from GEM-MP-random relative to the quality of marginals of GEM-MP-uniform as a function of the number of iterations at two determinism levels for Cora (red), Yeast (green), UW-CSE (magenta), and Ising (blue). In each scatter plot the line of best fit indicates that both GEM-MP-random and GEM-MP-uniform yield results of nearly identical quality. Any point below the line means that GEM-MP-uniform was more accurate than GEM-MP-random in that iteration, and the contrary is true if the point is above the line. Overall the results show that none of the initialization settings dominates the other (answering Q6), and that GEM-MP is not sensitive to the initialization settings.
7 Discussion
The experimental results from the previous section suggest that, in terms of both accuracy and scalability, GEM-MP outperforms LBP inference. It improves message-passing inference in two ways. First, it alleviates the threat of non-convergence in the presence of cycles. This is due to making moderate moves in the marginal likelihood space and the consequences of Jensen’s inequality which prevents such moves from overshooting the nearest fixed point. Second, it improves the quality of approximate marginals obtained in the presence of determinism, which we believe is attributable to the virtue of using the concept of generalized arc consistency to leverage the local entries of factors in order to compute more accurate outgoing messages.
Moreover, GEM-MP performs at least as well as the other state-of-the-art sampling-based inference methods (such as MC-SAT and Gibbs). The goal of MC-SAT is to combine a satisfiability-based method (e.g., SampleSAT) with MCMC-based sampling approaches to remedy the challenges engendered by determinism in the setting of MCMC inference. On one hand, GEM-MP achieves a similar goal, but by integrating a satisfiability-based method (i.e., GAC) with message-passing inference, instead of sampling inference. On the other hand, they completely differ in how they use ideas from satisfiability-oriented methods to deal with the issue of determinism.
From the satisfiability perspective, MC-SAT uses SampleSAT (Wei et al. 2004) to help slice sampling (i.e. MCMC) to near-uniformly sampling a new state given the auxiliary variables. This provides MC-SAT with the ability to rapidly jump between breaking modes, and thus it avoids the local search in MCMC inference from being trapped in isolated modes. Accordingly, one of the limitation of MC-SAT is that it applies a stochastic greedy local search procedure which is unable to make large moves in the state-space between isolated modes. This may affect its capacity to converge to accurate results. Conversely, at a high level, GEM-MP optimizes the setting of parameters with respect to a distribution over hidden variables that captures the relative weights of samples (i.e., the valid local entries) that are generated by individual variables in closed form. Thereby it performs a sort of gradient descent/ascent local search procedure. This gives GEM-MP an advantage in converging to more accurate results than MC-SAT, though MC-SAT is more likely to converge faster than GEM-MP. This could explain the great success of GEM-MP over MC-SAT on most of the experiments (MC-SAT only surpassed GEM-MP on the Cora dataset in experiment III). But we have to remember that, during the training phase, we trained the models by applying a preconditioned scaled conjugate gradient (PSCG) algorithm which uses MC-SAT for its inference step. This in turn gave an advantage to the MC-SAT algorithm when performing inference in the testing phase.
Gibbs is only reliable when neither determinism nor near-determinism are present. LBP for its part also deteriorates in the presence of determinism and near-determinism, but also when cycles are present. Thus if LBP gets stuck in cycles with determinism, it may be lodged there forever. However, if Gibbs hits a local optimum, it would eventually leave, even though it may take considerable time. This could explain the success of Gibbs over LBP. But with the increase of determinism in the model, Gibbs loses out to LBP, as seen in the case of the Yeast dataset in experiment I. Thus determinism apparently has a stronger effect on Gibbs than on LBP in this experiment.
Furthermore GEM-MP performs better than the other state-of-the-art convergent message-passing inference algorithms such as L2-convex, CCCP and Damped residual BP. The goal of L2-convex is to convexify the Bethe free energy to guarantee BP converging to an accurate local minimum. The CCCP algorithm uses a convex-concave Bethe energy to achieve the same purpose. On the one hand, GEM-MP achieves a similar purpose by optimizing a concave variational free energy, which is a lower bound to the model evidence. On the other hand, it additionally leverages the determinism and therefore, while the presence of determinism in a model can hinder the performance and converging behaviour of both L2-convex and CCCP to reach a local minimum, it increases the possibility that GEM-MP converges to an accurate one.
Overall the experimental results suggest that the initialization of GEM-MP does not significantly matter in practice since the correlation of two initialization settings (i.e., uniform and random) is often moderately positive on average. While we believe that it is important to have a good initialization to ensure that the local minimum that is found is sufficiently close to the global minimum, it seems that a good initialization will depend on the model and data. Therefore in some cases either random or uniform initialization will suffice, whilst in others it may be necessary to use a heuristic. Generally speaking it appears however that GEM-MP is able to reach an accurate result given any initialization, possibly at the expense of a minor increase in computation time.
From the scalability point of view, although Singla (2012) conjectured that lifted inference may subsume lazy, a clear relationship between lifted inference and lazy inference still eludes us. Our experimental results show that neither one was able to dominate the other. On one hand, lazy inference exploits sparseness to ground the network lazily, and therefore greatly reduces the inference memory and time as well. But lazy inference still works at the propositional level, in the sense that the basic units during inference are ground clauses. In contrast, lifted inference exploits a key property of first-order logic to allow answering queries without materializing all the objects in the domain inference. On the other hand, lifted inference requires the network to have a specific symmetric structure, which is not always the case in real-world applications and, in addition, in the presence of evidence most models are not liftable because evidence breaks symmetries. Thus at a high level the structure of the model network plays a significant role in the scalability of inference using different factors: symmetry and sparseness. If the model is extremely sparse then one can expect lazy inference to be more scalable. Lifted inference dominates when the symmetry prevails in the model’s structure.
8 Related work
Belief propagation (BP) was developed by Pearl (1988) as an inference procedure for singly connected belief networks. Pearl was the first to observe that running LBP leads to incorrect results on multi-connected networks. Conversely, other work (such as Mceliece et al. 1998; Frey and MacKay 1998) has shown success with LBP on loopy networks for turbo code applications. Further, Murphy et al. (1999) reported that LBP can provide good results on graphs with loops. These promising results shed light on evaluating the performance of BP in other applications and suggest the value of a closer study of its behavior for understanding the reasons for this success. Accordingly, several formulations of LBP have appeared, such as the direct implementation in a factor graph by Kschischang et al. (2001), tree-weighted BP (Wainwright et al. 2003) and the generalized cluster graph method of Mateescu et al. (2010). Most such formulations were influenced by the admirable analysis of Yedidia et al. (2003) who proved a relationship between LBP and Bethe approximation such that the local minima of Bethe free energy are the fixed points of LBP. Complementing this, further analysis has also explored LBPs relationships to variational approximations (Yedidia et al. 2005). This pioneering work outlined new research directions for a deeper understanding of and improvements to LBP.
In this paper, by relying on a variational formulation, our algorithm optimizes variational bounds on the model evidence and it implicitly guarantees not to overstep a local minimum.“Still, loopy belief propagation can fail to converge, and apparently for two different reasons. The first rather innocent one is a too large step size, similar to taking a too large “learning parameter” in gradient-descent learning”
LBP and Bethe free energy Here, mainstream work attempts to derive new types of LBP for approximate inference by directly optimizing the Bethe energy functional, such as the double loop algorithm (Yuille 2001). However, the main disadvantage of this algorithm is that it requires solving an optimization problem at each iteration, which results in a slower convergence. Another class of algorithm is known as cluster-graph BP, which runs LBP on sub-trees of the cluster graph. These algorithms exhibit faster convergence and introduce a new way of characterizing the connections between LBP and optimization problems based on the energy functional. Consequently, several works appeared which generalized the class of LBP by introducing variants of the energy functional that improve the convergence of LBP. For instance, Wainwright and Jordan (2003) and Nguyen et al. (2004) proposed a convexified free energy that provides an upper bound on the partition functions. But the algorithms that have been built on this energy functional still cannot guarantee convergence. Recently, alternative algorithms have been introduced to guarantee convergence for such energy functionals (Hazan and Shashua 2008; Meltzer et al. 2009; Globerson and Jaakkola 2007; Hazan and Shashua 2010).
At a high level, our GEM-MP approach resembles previously mentioned approaches in that it is based on variational inference and involves minimizing a free-energy functional.
It remains unclear if there is a relationship between determinism and the uniqueness of local minima of LBP. However, our experiments here support prior work that has also observed that applying LBP on graphical models with determinism and cycles is more likely to oscillate or converge to wrong results.
LBP and Constraint propagationHorsch and Havens (2000) proposed an algorithm that is a generalization of arc consistency used in constraint reasoning, and a specialization of the LBP used for probabilistic reasoning. The idea was to exploit the relationship between LBP and arc consistency to compute the solution probabilities, which can be then used as a heuristic to guide constructive search algorithms to solve binary CSPs. The bucket-elimination procedure was proposed by Dechter and Mateescu (2003). However it is known that such a procedure has a time and space complexity that is exponential in the induced width of the problem graph, related to the processing order of variables and to how densely these variables are connected to each other. Alternatively, Mateescu et al. (2010) presented approaches that are based on constructing a relationship between LBP and constraint propagation techniques. One idea underlying these approaches is to transform the loopy graph into a tree-like structure to alleviate the presence of cycles, and then to exploit constraint propagation techniques to tackle the determinism. Building on these ideas we explore the second research hypothesis: constraint satisfaction techniques might be able to help address the challenges resulting from determinism in the graphical models.
A recent extension of such approaches is the combination of LBP, constraint propagation, and expectation maximization to derive an efficient heuristic search for solving both satisfiability problems (Hsu et al. 2007, 2008) and constraint satisfaction problems (Le Bras et al. 2009). Although these algorithms perform well in finding solutions, they apply only to graphical models that have no probabilistic knowledge. In contrast, our GEM-MP method is able to handle probabilistic knowledge.
Damped LBP Another traditional research area to handle non-convergence has involved dampening the marginals (Koller and Friedman 2009) in order to diminish oscillation. However, in many cases, the dampening causes LBP to converge but often yields a poor quality result (Mooij and Kappen 2005). This is because the correct results are not usually in the average point (Murphy et al. 1999). The second track of this research direction is to alleviate double counting by changing the schedule of updating messages [e.g., sequentially on an Euler path, as per Yeang (2010), residual BP, as per Elidan et al. (2006), among others] besides adapting the initialization of the marginals [e.g., restart with different initializations, as per Koller and Friedman (2009)]. However, this cannot guarantee convergence since the algorithm still runs the risk of overshooting the nearest local minimum. Whilst, a key of the approach of GEM-MP is that its iterations are constrained by the variational inequality and therefore updates to distributions over hidden variables are done in a way such that the variational lower bound never exceeds the log marginal likelihood.
Re-parameterized LBP More recently, Smith and Gogate (2014) introduced a new approach aimed at dealing with determinism more effectively. The idea of this approach is to re-parameterize the Markov network by changing the entry in a factor that has zero to any non-negative real value in such a way that the LBP algorithm converges faster. Our GEM-MP also addresses the problem of determinism by improving message-passing inference to deal with determinism and cycles more effectively, but our approach is different being rooted in both variational techniques and leveraging generalized arc consistency.
LBP and variational methods Another research area combines message-passing with other variational methods to produce new types of LBP that can guarantee convergence. For example, Winn and Bishop (2005) presented variational message passing as a way to view many variational inference techniques, and it represents a general purpose algorithm for approximate inference. This algorithm shows great performance when it applies to conjugate exponential family models network. Weinman et al. (2008) proposed a sparse variational message passing algorithm to dramatically accelerate the approximate inference needed for parameter optimization related to the problem of stereo vision. Recently, Dauwels et al. (2005) proposed a generic form of structured variational message-passing and investigated a message-passing formulation of EM. Our GEM-MP method can be seen as akin to these message-passing inference methods. But a basic aspect of GEM-MP is the exploitation of ideas from CS to handle the challenges stemming from determinism.
Lifted LBP Another promising research area that has been recently explored seeks to improve the scalability of LBP on models that feature large networks. Here, mainstream work attempts to exploit some structural properties in the network like symmetry (Ahmadi et al. 2013), determinism (Papai et al. 2011; Ibrahim et al. 2015), sparseness (Poon et al. 2008), and type hierarchy (Kiddon and Domingos 2011) to scale LBP inference. For instance, Lifted Inference either directly operates on the first-order structure or uses the symmetry present in the structure of the network to reduce its size (e.g., Ahmadi et al. 2013). In this context, the key idea is to deal with groups of indistinguishable variables rather than individual variables. Poole (2003) was one of the first to show that variable elimination can be lifted to avoid propositionalization. This has been extended with some lifted variants of the algorithm proposed by De Salvo Braz et al. (2005) and Milch et al. (2008). Subsequently, Singla and Domingos (2008) proposed the first lifted version of LBP, which has been extended by Sen et al. (2009), and generalized with the emergence of the color message-passing algorithm introduced by Kersting et al. (2009) for approximating the computational symmetries. Subsequently, it was shown by Gogate and Domingos (2011) that to avoid dissipating the capabilities of first-order theorem proving, we have to take into considerations the logical structure. Based on that, lifted variants of weighted model counting have been proposed by Gogate and Domingos (2011), meanwhile variants of lifted knowledge compilation such as the bisimulation-based algorithm were introduced by Van den Broeck et al. (2011). Later on, it was observed that in some cases the constructed lifted network can itself be quite large, making it very close in size to the fully propositionalized one, and yielding no speedup by lifting the inference. The interesting argument proposed by Kersting (2012) concludes that the evidence problem could be the reason: symmetries within models easily break down when variables become correlated by virtue of depending asymmetrically on evidence and thus lifting produces models that are often not far from propositionalized ones, diminishing the power of lifted inference. Thus, one can obtain better lifting by performing shattering as needed during BP inference such as anytime BP proposed by De Salvo Braz et al. (2009), or exploit the model’s symmetries before we obtain the evidence as demonstrated in (Bui et al. 2012), or shattering a model into local pieces and then iteratively handling the pieces independently and re-combining the parameters from each piece as explained in (Ahmadi et al. 2013). Recently, Gogate et al. (2012) show that the evidence problem with lifting inference can be solved when applied to importance sampling algorithms by using an informed distribution derived from a compressed representation of MLN. Our approach is different from the above lifted-based message passing algorithms being built on a propositional basis, but it can be easily incorporated with their benefits for lifting its inference.
9 Conclusion and future work
Our work has targeted the less studied issue of the use of LBP and message passing techniques in probabilistic models possessing both cycles and determinism. To fully exploit determinism as opposed to having determinism posing a problem for inference, we have examined some of the intricacies of message passing algorithms. The novelty of our work lies in the proposal and exploration of an approach which we have named Generalized arc-consistency Expectation-Maximization Message-Passing (GEM-MP), a message-passing algorithm that applies a form of variational approximate inference in an extended form of an underlying graphical model. We have focused our experiments on Markov logic, but our method is easily generalized to other graphical models. To demonstrate the ease of generalizing our approach, we have also presented results using Ising models and we find that our method outperforms a variety of state-of-the-art techniques. The rules of GEM-MP can be viewed as a free energy minimization method whose successive updates form a path of bounded steps to the nearest local minimum in the space of approximate marginals. Using entity resolution and link prediction problems, we have experimentally validated the effectiveness of GEM-MP for converging to more accurate marginals and addressed the limitations of LBP engendered by the presence of cycles and determinism.
As with other variational methods, much of the strength of our method is a consequence of Jensen’s inequality which enables variational message-passing inference to estimate marginals - through the optimization of variational parameters - by tightening a lower bound on the model’s marginal likelihood at each approximate marginal update, such that we cannot overshoot the underlying true marginal likelihood. We believe this effect alleviates the threat of non-convergence due to cycles. In addition, the effectiveness of generalized arc consistency for handling the logical structures can be used to exploit structure in the problem that is not normally available to a more naive message-passing algorithm. In so doing, our formulation transforms determinism from a limitation into an advantage from the perspective of GEM-MP.
These explorations point to a number of promising directions for future work. We plan to evaluate the use of GEM-MP as an inference subroutine for learning. Also, we intend to investigate the lifted (Ahmadi et al. 2013; Singla et al. 2010) and the lazy (Poon et al. 2008) versions of GEM-MP to enhance its scalability. Finally, we intend to increase the accuracy of GEM-MP by deriving new update rules that apply a global approximation for \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) in the \(M_{q\small {({\mathcal {Y}})}}\)-step of GEM-MP.
Footnotes
- 1.
In practice, the transitivity rules are assigned very high weights, which complicates the inference.
- 2.
Note that a geometrical average of values \(\left\{ a_i\right\} _{i=1}^{n}\), is computed as \(\root n \of {\prod _i a_i}\). Without \(\root n \of {.}\), it becomes a coarse geometric average because the result would be an extreme value.
- 3.
Note that “log” is a concave function and it can play an important role in maintaining convergences via Jensen’s inequality, as will be explained further on.
- 4.
This is equivalent to finding the \(\large {q}({\mathcal {X}},{\mathcal {Y}})\) that minimizes the KL between the true distribution and its approximation : \(\large {q}^*{\small {({\mathcal {X}},{\mathcal {Y}})}} = \underset{q}{{\text {argmin}}} \,\, \textit{KL} \big [\large {q}({\mathcal {X}},{\mathcal {Y}}) \, || \, P({\mathcal {X}}, {\mathcal {Y}}|{\mathcal {O}},{\mathcal {M}}) \big ]\).
- 5.
An additional source of intractability arises in many models (e.g., SRL models) in which the number of hidden variables is very large. For instance, a model with N binary hidden variables generally requires a distribution over all \(2^N\) possible states of those variables. So even for moderately large N this results in computational intractability.
- 6.
Note that when the optimization is over the parameters \({\mathcal {T}}_{\small {{\mathcal {Y}}}}\), that affects the function \(\large {H}(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}))\) and not \(\large {H}(\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}))\), i.e., only \(\large {H}(\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}))\) is dropped.
- 7.
This is equivalent to converging to a local minimum of the negative free energy functional \(-{\mathcal {F}}_{\small {{\mathcal {M}}}}\), which is a stable local minimumwith respect to an inference task on the original factor graph.
- 8.
Note that the local approximation means that we handle mega-nodes individually. This appears in the factorization of \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) into independent distributions [refer to Eq. (23)].
- 9.
Note that:\(\large {E}_{\large {q}(\small {{\mathcal {X}}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})} \big [\log P({\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}} |{\mathcal {M}}) \big ] \propto \large {E}_{\large {q}(\small {{\mathcal {X}}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})} \big [\log P({\mathcal {O}}, {\mathcal {Y}} | {\mathcal {X}}, {\mathcal {M}}) \big ]\).
- 10.
Note that since we marginalize extended factors over their mega-nodes then it is sufficient to work directly with original factors in the original factor graph, which here are the hard ground clauses. In addition, we can eliminate the observed variable \(\bar{O}=1\) from \(\log P({\mathcal {O}}, {\mathcal {Y}} | {\mathcal {X}}, {\mathcal {M}})\) by explicitly considering only the valid local entries of the hard ground clauses.
- 11.
- 12.
Note that \(l_1 \Leftrightarrow l_2\) converted into CNF gives two clauses : \((\lnot l_1\vee l_2) \wedge (l_1 \vee \lnot l_2)\)
- 13.
Publicly available: http://alchemy.cs.washington.edu/data/.
- 14.
These algorithms run on the original factor graph \({\mathcal {G}}\).
- 15.
Alchemy v0.2 is publicly available at: http://alchemy.cs.washington.edu/.
- 16.
Primarily labeled by Andrew McCallum (https://www.cs.umass.edu/~Cmccallum/data/cora-refs.tar.gz) and recently cleaned up and split into five subsets for cross-validation by Singla and Domingos (2006).
- 17.
Originally prepared by the Munich Information Center for Protein Sequence.
- 18.
Available at: http://alchemy.cs.washington.edu/data/uw-cse/.
- 19.
Note that for the Cora dataset the construction of the grounded network required for inference takes about 185 min on average.
- 20.
With no determinism the hard update rule of GEM-MP is not being used.
- 21.
For simplicity, suppose that we want to compute their marginal probability for all variables \(\{X_1,\ldots ,X_N\}\).
Notes
Acknowledgments
We are grateful to Henry Kautz for helpful discussions. Thanks to Aniruddh Nath for providing help with some datasets and models. The authors would like to thank the anonymous reviewers for their constructive comments and suggestions. This work was funded by Natural Science and Engineering Research Council (NSERC) Discovery Grants Program and the Egyptian Cultural Affairs and Mission Sector. We sincerely thank these sponsors.
References
- Ahmadi, B., Kersting, K., Mladenov, M., & Natarajan, S. (2013). Exploiting symmetries for scaling loopy belief propagation and relational training. Machine Learning, 92(1), 91–132.MathSciNetCrossRefMATHGoogle Scholar
- Bach, F. R., & Jordan, M. I. (2001). Thin junction trees. In Proceedings of the 14th conference on neural information processing systems: Advances in neural information processing systems 14 (NIPS-2001) (pp. 569–576). MIT Press.Google Scholar
- Beal, M. J., & Ghahramani, Z. (2003). The variational bayesian em algorithm for incomplete data: With application to scoring graphical model structures. Bayesian Statistics, 7, 453–464.MathSciNetGoogle Scholar
- Bui, H.B., Huynh, T.N., & de Salvo Braz, R. (2012). Exact lifted inference with distinct soft evidence on every object. In Proceedings of the twenty-sixth AAAI conference on artificial intelligence, July 22–26, Toronto, ON, Canada (pp. 1875–1881). AAAI Press.Google Scholar
- Dauwels, J., Korl, S., & Loeliger, H.-A. (2005). Expectation maximization as message passing. In Proceedings of IEEE international symposium on information theory (ISIT 2005), Adelaide Convention Centre Adelaide, Australia (pp. 583–586). IEEE computer society.Google Scholar
- Davis, J., & Domingos, P. (2009). Deep transfer via second-order markov logic. In Proceedings of the 26th annual international conference on machine learning (ICML-09). Montreal, QC: ACM.Google Scholar
- De Salvo Braz, R., Amir, E., & Roth, D. (2005). Lifted first-order probabilistic inference. In Proceedings of the 19th international joint conference on artificial intelligence, Edinburgh, Scotland (pp. 1319–1325). AAAI Press.Google Scholar
- De Salvo Braz, R., Natarajan, S., Bui, H., Shavlik, J., & Russell, S. (2009). Anytime lifted belief propagation. In Proceedings of 6th international workshop on statistical relational learning, Leuven, Belgium (Vol. 9, pp. 1–3).Google Scholar
- Dechter, R., & Mateescu, R. (2003). A simple insight intoiterative belief propagation’s success. In Proceedings of the nineteenth conference on uncertainty in artificial intelligence (pp. 175–183). Acapulco, Mexico: Morgan Kaufmann Publishers Inc.Google Scholar
- Elidan, G., McGraw, I., & Koller, D. (2006). Residual beliefpropagation: Informed scheduling for asynchronous message passing.In Proceedings of the twenty-second conference annual conference onuncertainty in artificial intelligence (UAI-06) (pp. 165–173). Arlington, VA: AUAI Press.Google Scholar
- Flach, P. A. (2010). First-order logic. In Encyclopedia of machine learning (pp. 410–415). New York: Springer.Google Scholar
- Frey, B. J., & MacKay, D. J. (1998). A revolution: Belief propagation in graphs with cycles. In Proceedings of the 11th conference on neural information processing systems: Advances in neural information processing systems 11 (NIPS-1998) (pp. 479–485). Morgan Kaufmann.Google Scholar
- Getoor, L., & Taskar, B. (2007). Introduction to statistical relational learning (adaptive computation and machine learning). Cambridge: The MIT Press.MATHGoogle Scholar
- Globerson, A., & Jaakkola, T. (2007). Convergent propagation algorithms via oriented trees. In Proceedings of the twenty-third conference on uncertainty in artificial intelligence, Vancouver, BC, Canada, July 19–22 (pp. 133–140). AUAI Press.Google Scholar
- Gogate, V., & Domingos, P. (2011). Probabilistic theorem proving. In Proceedings of the twenty-seventh conference annual conference on uncertainty in artificial intelligence (UAI-11) (pp. 256–265). Corvallis, OR: AUAI Press.Google Scholar
- Gogate, V., Jha, A. K., & Venugopal, D. (2012). Advances in lifted importance sampling. In Proceedings of the twenty-sixth AAAI conference on artificial intelligence, July 22–26, 2012 (pp. 1910–1916). Toronto, ON: AAAI Press.Google Scholar
- Hazan, T., & Shashua, A. (2008). Convergent message-passing algorithms for inference over general graphs with convex free energies. In Proceedings of the 24th conference in uncertainty in artificial intelligence, Helsinki, Finland, July 9–12 (pp. 264–273).Google Scholar
- Hazan, T., & Shashua, A. (2010). Norm-product belief propagation: Primal-dual message-passing for approximate inference. IEEE Transactions on Information Theory, 56(12), 6294–6316.MathSciNetCrossRefGoogle Scholar
- Heskes, T. (2002). Stable fixed points of loopy belief propagation are local minima of the bethe free energy. In Proceedings of the 15th conference on neural information processing systems, Vancouver, BC, Canada, December 9–14: Advances in neural information processing systems 15 (NIPS-2002) (pp. 343–350). Curran Associates Inc.Google Scholar
- Heskes, T. (2004). On the uniqueness of loopy belief propagation fixed points. Neural Computation, 16(11), 2379–2413.CrossRefMATHGoogle Scholar
- Horsch, M. C., & Havens, W. S. (2000). Probabilistic arcconsistency: A connection between constraint reasoning andprobabilistic reasoning. In Proceedings of the sixteenth conferenceon uncertainty in artificial intelligence (pp. 282–290). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.Google Scholar
- Hsu, E. I., Kitching, M., Bacchus, F., & McIlraith, S. A. (2007). Using expectation maximization to find likely assignments for solving csp’s. In Proceedings of 22nd national conference on artificial intelligence (AAAI ’07) (Vol. 22, pp. 224–232). Vancouver, Canada: AAAI Press.Google Scholar
- Hsu, E. I., Muise, C., Beck, J. C., & McIlraith, S. A. (2008).Probabilistically estimating backbones and variable bias. In Proceedings of 14th international conference on principles andpractice of constraint programming (CP ’08) (pp. 613–617). Sydney, Australia: Springer.Google Scholar
- Huynh, T.N., & Mooney, R.J. (2009). Max-margin weight learning for markov logic networks. In: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Part 1, Bled, Slovenia, September 7-11, Springer. vol. 5781, pp. 564–579.Google Scholar
- Huynh, T. N., & Mooney, R. J. (2011). Online max-margin weightlearning for markov logic networks. In Proceedings of SIAM-11 international conference on data mining (pp. 642–651). Mesa, AZ: SIAM/Omnipress.Google Scholar
- Ibrahim, M. -H., Pal, C., & Pesant, G. (2015). Exploitingdeterminism to scale relational inference. In Proceedings of the twenty-ninth national conference on artificial intelligence (AAAI’15), January 25–30, 2015 (pp. 1756–1762). Austin, TX: AAAI Press.Google Scholar
- Kersting, K. (2012). Lifted probabilistic inference. In Proceedingsof 20th European conference on artificial intelligence (ECAI–2012), August 27–31 (pp. 33–38). Montpellier France: IOS Press: ECCAI.Google Scholar
- Kersting, K., Ahmadi, B., & Natarajan, S. (2009). Counting belief propagation. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, Montreal, Quebec, June 18–21 (pp. 277–284). AUAI Press.Google Scholar
- Kiddon, C., & Domingos, P. (2011). Coarse-to-fine inference and learning for first-order probabilistic models. In Proceedings of the twenty-fifth AAAI conference on artificial intelligence, San Francisco, CA, USA, August 7–11 (pp. 1049–1056). AAAI Press.Google Scholar
- Kok, S., Singla, P., Richardson, M., Domingos, P., Sumner, M., Poon, H., & Lowd, D. (2007). The alchemy system for statistical relational AI. Technical report, Department of Computer Science and Engineering, University of Washington, Seattle, WA. http://alchemy.cs.washington.edu.
- Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques. Cambridge: MIT Press.MATHGoogle Scholar
- Kschischang, F., Member, S., Frey, B. J., & Loeliger, H.-A. (2001). Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47, 498–519.MathSciNetCrossRefMATHGoogle Scholar
- Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society Series B (Methodological), 50, 157–224.MathSciNetMATHGoogle Scholar
- Le Bras, R., Zanarini, A., & Pesant, G. (2009). Efficient genericsearch heuristics within the embp framework. In Proceedings of the15th international conference on principles and practice of constraint programming (CP’09), Lisbon, Portugal (pp. 539–553). Berlin: Springer.Google Scholar
- Lowd, D., & Domingos, P. (2007). Efficient weight learning for markov logic networks. In Proceedings of 11th European conference on principles and practice of knowledge discovery in databases (PKDD 2007), Warsaw, Poland, September 17–21 (pp. 200–211). Springer.Google Scholar
- Mateescu, R., Kask, K., Gogate, V., & Dechter, R. (2010). Join-graph propagation algorithms. Journal of Artificial Intelligence Research, 37, 279–328.MathSciNetMATHGoogle Scholar
- Mceliece, R. J., Mackay, D. J. C., & Cheng, J.-F. (1998). Turbo decoding as an instance of pearl’s belief propagation algorithm. IEEE Journal on Selected Areas in Communications, 16, 140–152.CrossRefGoogle Scholar
- Meltzer, T., Globerson, A., & Weiss, Y. (2009). Convergent message passing algorithms—A unifying view. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, Montreal, QC, Canada, June 18–21 (pp. 393–401). AUAI Press.Google Scholar
- Milch, B., Zettlemoyer, L.S., Kersting, K., Haimes, M., & Kaelbling, L.P. (2008). Lifted probabilistic inference with counting formulas. In Proceedings of the twenty third conference on artificial intelligence (Vol. 8, pp. 1062–1068). Chicago, IL: AAAI Press.Google Scholar
- Mooij, J. M., & Kappen, H. J. (2005). Sufficient conditions for convergence of loopy belief propagation. In Proceedings of the 21st annual conference on uncertainty in artificial intelligence (UAI-05), Edinburgh, Scotland, July 26-29 (pp. 396–403). AUAI Press.Google Scholar
- Murphy, K., Weiss, Y., & Jordan, M. (1999). Loopy beliefpropagation for approximate inference: An empirical study. In Proceedings of the fifteenth conference annual conference onuncertainty in artificial intelligence (UAI-99), Stockholm, Sweden (pp. 467–476). Morgan Kaufmann.Google Scholar
- Neal, R. M., & Hinton, G. E. (1999). Learning in graphical models, MIT Press, chap. A view of the EM algorithm that justifies incremental, sparse, and other variants (pp. 355–368).Google Scholar
- Nguyen, X., Wainwright, M. J., & Jordan, M. I. (2004). Decentralized detection and classification using kernel methods. In Proceedings of the twenty-first international conference on machine learning, (ICML) (Vol. 69, pp. 80–88). Banff, Canada: ACM.Google Scholar
- Papai, T., Kautz, H. A., & Stefankovic, D. (2012). Slice normalized dynamic markov logic networks. In Proceedings of 26th conference on neural information processing systems, December 3–8 Harrahs and Harveys, Lake Tahoe: Advances in Neural Information Processing Systems (Vol. 25, pp. 1916–1924). Curran Associates Inc..Google Scholar
- Papai, T., Singla, P., & Kautz, H. (2011). Constraint propagation for efficient inference in markov logic. In Proceedings of 17th international conference on principles and practice of constraint programming (CP 2011), Perugia, Italy, September 12–16 (pp. 691–705). springer.Google Scholar
- Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo: Morgan Kaufmann.MATHGoogle Scholar
- Poole, D. (2003). First-order probabilistic inference. In Proceedings of the 18th international joint conference on artificial intelligence IJCAI’03 (Vol. 3, pp. 985–991). Acapulco, Mexico: Morgan Kaufmann Publishers Inc.Google Scholar
- Poon, H., & Domingos, P. (2006). Sound and efficient inference with probabilistic and deterministic dependencies. In Proceedings of the 21st national conference on Artificial intelligence, July 16–20 (Vol. 1, pp. 458–463). Boston, MA: AAAI Press.Google Scholar
- Poon, H., Domingos, P., & Sumner, M. (2008). A general method for reducing the complexity of relational inference and its application to mcmc. In Proceedings of the twenty-third AAAI conference on artificial intelligence, Chicago, IL, July 13–17 (pp. 1075–1080). AAAI Press.Google Scholar
- Potetz, B. (2007). Efficient belief propagation for vision usinglinear constraint nodes. In Proceeding of IEEE conference oncomputer vision and pattern recognition (CVPR’07) (pp. 1–8). Minneapolis, MN. IEEE computer society.Google Scholar
- Richardson, M., & Domingos, P. (2006). Markov logic networks. Machine Learning, 62(1–2), 107–136.CrossRefGoogle Scholar
- Roosta, T., Wainwright, M. J., & Sastry, S. S. (2008). Convergence analysis of reweighted sum-product algorithms. IEEE Transactions on Signal Processing, 56(9), 4293–4305.MathSciNetCrossRefGoogle Scholar
- Rossi, F., Van Beek, P., & Walsh, T. (2006). Handbook of constraint programming. New York: Elsevier.MATHGoogle Scholar
- Sen, P., Deshpande, A., & Getoor, L. (2009). Bisimulation-based approximate lifted inference. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, Montreal, Canada, June 18–21.Google Scholar
- Shavlik, J., & Natarajan, S. (2009). Speeding up inference inmarkov logic networks by preprocessing to reduce the size of theresulting grounded network. In Proceedings of the 21 international joint conference on artificial intelligence (pp. 1951–1956). Pasadena, CA: IJCAI Organization.Google Scholar
- Shi, X., Schonfeld, D., & Tuninetti, D. (2010). Message erroranalysis of loopy belief propagation. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing, (ICASSP 2010), March 14–19 (pp. 2078–2081). Dallas, TX: IEEEcomputer society.Google Scholar
- Singla, P. (2012). Markov logic networks: Theory, algorithms and applications. In Proceedings of the 18th international conference on management of data, computer society of India (pp. 15–150).Google Scholar
- Singla, P., & Domingos, P. (2006). Entity resolution with markov logic. In Proceedings of the sixth international conference on data mining, ICDM’06, Hong Kong, China, 1822 December (pp. 572–582). IEEE Computer Society.Google Scholar
- Singla, P., & Domingos, P. (2008). Lifted first-order belief propagation. In Proceedings of the twenty-third AAAI conference on artificial intelligence, Chicago, IL, July 13–17 (pp. 1094–1099). AAAI Press.Google Scholar
- Singla, P., Nath, A., & Domingos, P. (2010). Approximate lifted belief propagation. In Proceedings of the twenty-fourth AAAI conference on artificial intelligence, Atlanta, Georgia, USA, July 11–15, 2010 (pp. 92–97). AAAI Press.Google Scholar
- Smith, D., & Gogate, V. (2014). Loopy belief propagation in the presence of determinism. In Proceedings of the seventeenth international conference on artificial intelligence and statistics, April 22–25 (Vol. 33, pp. 895–903). Reykjavik, Iceland:JMLR: W & CP.Google Scholar
- Van den Broeck, G., Taghipour, N., Meert, W., Davis, J., & De Raedt, L. (2011). Lifted probabilistic inference by first-order knowledge compilation. In Proceedings of the twenty-second international joint conference on artificial intelligence, Barcelona, Catalonia, Spain, 16–22 July (pp. 2178–2185). AAAI Press.Google Scholar
- van Hoeve, W. J., Pesant, G., & Rousseau, L.-M. (2006). On global warming: Flow-based soft global constraints. Journal of Heuristics, 12(4–5), 347–373.CrossRefMATHGoogle Scholar
- Venugopal, D., & Gogate, V. (2014a). Evidence-based clustering for scalable inference in markov logic. In: Proceedings of the 7th European conference on machine learning and data mining conference (ECML PKDD 2014), Nancy, France, September 15–19 (pp. 258–273). Springer.Google Scholar
- Venugopal, D., & Gogate, V.G. (2014b). Scaling-up importance sampling for markov logic networks. In Proceedings of the 28th conference on neural information processing systems, 8–13 December, Montreal, Canada: Advances in Neural Information Processing Systems 27 (NIPS 2014) (pp. 2978–2986). Curran Associates Inc.Google Scholar
- Wainwright, M., & Jordan, M. (2003). Semidefinite relaxations for approximate inference on graphs with cycles. In Proceedings of the 17th conference on neural information processing systems: Advances in neural information processing systems 16 (NIPS-2003) (pp. 369–376). MIT Press.Google Scholar
- Wainwright, M., Jaakkola, T., & Willsky, A. (2003). Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Transactions on Information Theory, 49(5), 1120–1146.MathSciNetCrossRefMATHGoogle Scholar
- Wei, W., Erenrich, J., & Selman, B. (2004). Towards efficient sampling: Exploiting random walk strategies. In Proceedings of the nineteenth national conference on artificial intelligence, July 25–29 (Vol. 4, pp. 670–676). San Jose, CA: AAAI Press.Google Scholar
- Weinman, J. J., Tran, L. C., & Pal, C. J. (2008). Efficientlylearning random fields fo stereo vision with sparse message passing. In Proceedings of the 10th European conference on computer vision (pp. 617–630). Marseille, France: Springer.Google Scholar
- Winn, J.M. (2004). Variational message passing and its applications. PhD thesis, University of Cambridge.Google Scholar
- Winn, J. M., & Bishop, C. M. (2005). Variational message passing. Journal of Machine Learning Research, 6, 661–694.MathSciNetMATHGoogle Scholar
- Yeang, C. -H. (2010). Exact loopy belief propagation on euler graphs. In Proceedings of the 12th international conference on artificial intelligence, Las Vegas, Nevada, USA, July 12–15 (pp. 471–477). CSREA Press.Google Scholar
- Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2003). Understanding belief propagation and its generalizations. Exploring Artificial Intelligence in the New Millennium, 8, 236–239.Google Scholar
- Yedidia, J., Freeman, W., & Weiss, Y. (2005). Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 7, 2282–2312.Google Scholar
- Yuille, A.L. (2001). A double-loop algorithm to minimize the bethe free energy. In Proceedings of the third international workshop on energy minimization methods in computer vision and pattern recognition, NRIA Sophia-Antipolis, France, September 3–5 (pp. 3–18). Springer.Google Scholar
- Yuille, A. L. (2002). Cccp algorithms to minimize the bethe and kikuchi free energies: Convergent alternatives to belief propagation. Neural Computation, 14(7), 1691–1722.CrossRefMATHGoogle Scholar