1 Introduction

Graphical models that involve cycles and determinism are applicable to a growing number of applications in different research communities, including machine learning, statistical physics, constraint programming, information theory, bioinformatics, and other sub-disciplines of artificial intelligence. Accurate and efficient inference within such graphical models is thus an important issue that impacts a wide number of communities. Inspired by the substantial impact of statistical relational learning (SRL) (Getoor and Taskar 2007), Markov logic (Richardson and Domingos 2006; Singla 2012) is a powerful formalism for graphical models that has made significant progress towards the goal of combining the powers of both first-order logic (Flach 2010) and probability. However, probabilistic inference represents a major bottleneck and can be problematic for learning when using it as a subroutine.

Loopy belief propagation (LBP) is a commonly used message-passing algorithm for performing approximate inference in graphical models in general, including models instantiated by an underlying Markov Logic. However, LBP often exhibits erratic behavior in practice. In particular, it is still not well understood when LBP will provide good approximations in the presence of cycles and when models possess both probabilistic and deterministic dependencies. Therefore, the development of more accurate and stable message passing based inference methods is of great theoretical and practical interest. Perhaps surprisingly, belief propagation achieves good results for coding theory problems with loopy graphs (Mceliece et al. 1998; Frey and MacKay 1998). In other applications, however, LBP often leads to convergence problems. In general LBP therefore has the following limitation:

Limitation 1 In the presence of cycles, LBP is not guaranteed to converge.

It is known that the local optima of the Bethe free energy correspond to local minima of LBP, and it has been proven that violating the uniqueness condition for the Bethe free energy generates several local minima (i.e., fixed points) in the space of LBP’s marginal distributions (Heskes 2004; Yedidia et al. 2005). From a variational perspective, it is known that if a factor graph has more than one cycle, then the convexity of the Bethe free energy is violated. A graph involving a single cycle has a unique local minimum and usually guarantees the convergence of LBP (Heskes 2004). From the viewpoint of a local search, LBP performs a gradient-descent/ascent search over the marginal space, endeavoring to converge to a local optimum (Heskes 2002). Heskes viewpoint is that the problem of non-convergence is related to the fact that LBP updates the unnormalized marginal of each variable by computing a coarse geometric average of the incoming messages received from its neighboring factors (Heskes 2002). Under Heskes’ line of analysis, LBP can make large moves in the space of the marginals and therefore it becomes more likely to overshoot the nearest local optimum. This produces an orbiting effect and increases the possibility of non-convergence. Other lines of analysis are based on the fact that messages in LBP may circulate around the cycles, which can lead to local evidence being counted multiple times (Pearl 1988). This, in turn, can aggravate the possibility of non-convergence. In practice, non-convergence occasionally appears as oscillatory behavior when updating the marginals (Koller and Friedman 2009).

Determinism plays a substantial role in reducing the effectiveness of LBP (Heskes 2004). For example, hard clauses in a Markov logic lead to deterministic dependencies in the corresponding factor graphs for groundings and therefore are particularly challenging for inference with LBP. It has been observed empirically that carrying out LBP on cyclic graphical models with determinism is more likely to result in a two-fold problem of non-convergence or incorrectness of the results (Mooij and Kappen 2005; Koller and Friedman 2009; Potetz 2007; Yedidia et al. 2005; Roosta et al. 2008). A second limitation of LBP could thus be formulated as:

Limitation 2 In the presence of determinism (a.k.a. hard clauses), LBP may deteriorate to inaccurate results.

In its basic form LBP also does not leverage the local structures of factors, handling them as black boxes. Using Markov logic as a concrete example, LBP often does not take into consideration the logical structures of the underlying clauses that define factors (Gogate and Domingos 2011). Thus, if some of these clauses are deterministic (e.g., hard clauses) or have extreme skewed probabilities, then LBP will be unable to reconcile the clauses. This, in turn, impedes the smoothing out of differences between the messages. The problem is particularly acute for those messages that pass through hard clauses which fall inside dense cycles. This can drastically elevate oscillations, making it difficult to converge to accurate results, and leading to the instability of the algorithm with respect to finding a local minimum (see pages 413–429 of Koller and Friedman 2009, for more details). On the flip side of this issue Koller and Friedman point out that one can prove that if the factors in a graph are less extreme—such that the skew of the network is sufficiently bounded—it can give rise to a contraction property that guarantees convergence (Koller and Friedman 2009). In our work here we are interested in taking advantage of determinism when it exists in the factors of an underlying graph in a way that does not increase the threat of non-convergence.

The literature available on LBP—which is perhaps the most widely used form of message-passing based inference—is heavily influenced by ideas from machine learning (ML) and constraint satisfaction (CS) among others. Although LBP has been scrutinized both theoretically and practically in various ways, most of the existing research either avoids the limitation of determinism when handling cycles, or does not take into consideration the limitation of cycles when handling determinism.

It is well known that techniques such as the junction tree algorithm (Lauritzen and Spiegelhalter 1988) are able to transform a graphical model into larger clusters of variables such that the clusters satisfy the running intersection property and that such a structure can then be used to obtain exact inference results. Such results also hold when the underlying graphical models possess deterministic dependencies. For many problems however, the tree width of the resulting junction tree may be so large that inference becomes intractable. More recent work has explored the interesting question of how to construct thin junction trees (Bach and Jordan 2001). However, many graphical models derived from a Markov logic or problems with complex constraints quickly lead to trees with large tree widths.

In this paper, one of our key objectives is to bring probabilistic Artificial Intelligence, Machine Learning and Constraint Programming techniques closer together through the lens of variational message-passing inference. That is, to address the limitations of LBP discussed above, we introduce Generalized arc-consistency Expectation-Maximization Message-Passing (GEM-MP), a novel approach to inference for graphical models based on variational message-passing and arc-consistency within extended factor graphs. In this work we have focused on Markov logic and Ising models, but our GEM-MP framework is applicable to other representations, including standard graphical models defined in terms of factor graphs. We achieve this by first re-parameterizing the factor graph in such a way that the inference task of computing the probability of unobserved variables given observed variables can be formulated as a variational message passing procedure using auxiliary variables in the extended and re-parameterized factor graph. We then take advantage of the fact that procedures such as variational inference and EM can be viewed in terms of free energy minimization equations. We formulate our Message-Passing approach as the E and M steps of a variational message passing technique reminiscent of classical variational EM procedures (Beal and Ghahramani 2003; Neal and Hinton 1999). This variational formulation leads to the synthesis of new rules that update an approximation to the joint conditional distribution, minimizing the Kullback–Leibler (KL) divergence in a way that also maximizes a lower bound on the true model evidence. Since the procedure monotonically decreases the KL divergence it alleviates Limitation 1 and leads to convergence of the lower bound and KL divergence to a fixed quantity. Furthermore, we exploit the logical structures within factors by applying a generalized arc-consistency concept (Rossi et al. 2006), and to use that to perform a variational mean-field approximation when updating the marginals. Since the procedure is cast within a variational framework, variational bounds apply which can ensure the algorithm converges to a local minimum in terms of the KL divergence.

Table 1 An excerpt of Markov logic for the Cora dataset. The atoms SameBib and SameAuthor are unknown. Ar() is an abbreviation for atom Author(), SAr() for SameAuthor(), and SBib() for SameBib(). \(a_1,a_2\) define authors and \(r_1,r_2,r_3\) define citations

We have organized the rest of the paper in the following manner. In Sect. 2, we review some key basic concepts in further detail including: Markov Logic, LBP, constraint propagation techniques, Variational Bounds, Expectation maximization (EM) and KL Divergences. In Sect. 3, we demonstrate the framework of GEM-MP variational inference. In Sect. 4 we then derive GEM-MP’s general update rule for Markov logic. In sect. 5, we generalize GEM-MP’s update rules to be applicable for MRFs. In Sect. 6, we conduct a thorough experimental study. This is followed by a discussion in Sect. 7. In Sect. 8 we examine related work. Finally, in Sect. 9, we present our conclusions and discuss directions for future research. The “Appendix” contains the proofs of all propositions used in the paper.

2 Preliminaries

To set the stage for our work here in this section we provide a more detailed discussion of: Markov logic; belief propagation; constraint satisfaction problems, constraint propagation and generalized arch consistency; and variational methods. We begin by reviewing Markov logic using a concrete explanatory example presented in Table 1. This example is an excerpt of the knowledge base for the Cora dataset. That is, suppose that we are given a citation database in which each citation has author, title, and venue fields. We need to know which pairs of citations refer to the same citation and the same authors (i.e. both the SameBib and SameAuthor relations are unknown). For simplicity, our objective will be to predict the SameBib ground atoms’ marginals. At this point, let us first express our basic notation.

Notation A first-order knowledge base (KB) is a set of formulas in first-order logic. Traditionally, as shown in Table 1, it is convenient to convert formulas to clausal form (CNF). After propositional grounding, we get a formula \({\mathcal {F}}\), which is a conjunction of m ground clauses. We use \(f \in {\mathcal {F}}\) to denote a ground clause which is a disjunction of literals built from \({\mathcal {X}}\), where \({\mathcal {X}} = \left\{ X_1, X_2, \ldots , X_n\right\} \) is a set of n Boolean random variables representing ground atoms. The set \({\mathcal {X}}_{f}\) corresponds to the variables appearing in the scope of a ground clause f. Both “\(+\)” and “−” will be used to denote the positive (true) and negative (false) appearance of the ground atoms. We use \(Y_i\) as a subset of satisfying (or valid) entries of ground clause \(f_i\), and \(y_k \in Y_i, \, k \in \left\{ 1,.., |Y_i| \right\} \) denotes each valid entry in \(Y_i\), where the local entry of a factor is valid if it has non-zero probability. We use \(f^s_i\) (resp. \(f^h_i\)) to indicate that the clause \(f_i\) is soft (resp. hard); the soft and the hard clauses are included in the two sets \({\mathcal {F}}^s\) and \({\mathcal {F}}^h\) respectively. The sets \({\mathcal {F}}_{X_j+}\) and \({\mathcal {F}}_{X_j-}\) include the clauses that contain positive and negative literals for ground atom \(X_j\), respectively. Thus \({\mathcal {F}}_{X_j}={\mathcal {F}}_{X_j+} \cup {\mathcal {F}}_{X_j-}\) denotes the whole of \(X_j\)’s clauses, and its cardinality as \(\left| {\mathcal {F}}_{X_j}\right| \). For each ground atom \(X_j\), we use \(\beta _{X_j} = \left[ \beta ^+_{X_j},\beta ^-_{X_j}\right] \) to denote its positive and negative marginal probabilities, respectively.

Markov logic (Richardson and Domingos 2006) is a set of first-order logic formulas (or CNF clauses), each of which is associated with a numerical weight w. Larger weights w reflect stronger dependencies, and thereby deterministic dependencies have the largest weight (\(w \rightarrow \infty \)), in the sense that they must be satisfied. We say that a clause has deterministic dependency if at least one of its entries has zero probability.

The power of Markov logic appears in its ability to bridge the gap between logic and probability theory. Thus it has become one of the preferred probabilistic graphical models for representing both probabilistic and deterministic knowledge, with deterministic dependencies (for short we say determinism) represented as hard clauses, and probabilistic ones represented as soft clauses.

Fig. 1
figure 1

Grounded (factor graph) obtained by applying clauses in Table 1 to the constants: \(\left\{ \text {Gilles(G), Chris(C)} \right\} \) for \(a_1\) and \(a_2\); \(\left\{ \text {Citation1}(C_1)\text { and Citation2}(C_2) \right\} \) for \(r_1\), \(r_2\), and \(r_3\). The factor graph involves: 12 ground atoms in which 4 are evidence (dark ovals) and 8 are non-evidence (non-dark ovals); 24 ground clauses wherein 8 are hard (\({\mathcal {F}}^h = \left\{ f_{1},\ldots ,f_{8}\right\} \)) and 16 are soft (\({\mathcal {F}}^s = \left\{ f_9,\ldots ,f_{24}\right\} \))

To understand the semantics of Markov logic, recall the explanatory example in Table 1. In this example, Markov logic enables us to model the KB by using rules such as the following: 1. Regularity rules of the type that say “if the authors are the same, then their records are the same.” This rule is helpful but innately uncertain (i.e., it is not true in all cases). Markov logic considers this rule as soft and attaches it to a weight (say, 1.1); 2. Transitivity rules that state “If one citation is identical to two other citations, then these two other citations are identical too.” These types of rules are important for handling non-unique names of citations. Therefore, we suppose that Markov logic considers these rules as hard and assigns them an infinite weight.Footnote 1

Subsequently, we will represent Markov logic as a factor graph after grounding it using a small set of typed constants (say, for example, \(a_1, a_2 \in \left\{ \text {Gilles(G), Chris(C)} \right\} \), and \(r_1, r_2, r_3 \in \left\{ \text {Citation1}(C_1), \text {Citation2}(C_2) \right\} \)). The output is a factor graph that is shown in Fig. 1, which is a bipartite graph \(\left( {\mathcal {X}},{\mathcal {F}}=\left\{ {\mathcal {F}}^h,{\mathcal {F}}^s\right\} \right) \) that has a variable node (oval) for each ground atom \(X_j \in {\mathcal {X}}\) (here \({\mathcal {X}}\) includes the ground atoms: SBib(\(C_1,C_1\)), SBib(\(C_2,C_1\)), SBib(\(C_2,C_2\)), SBib(\(C_1,C_2\)), Ar(\(C_1,G\)), Ar(\(C_2,G\)), Ar(\(C_1,C\)), Ar(\(C_2,C\)), SAr(CC), SAr(CG), SAr(GC), and SAr(GG)). If the truth value of the ground atom is known from the evidence database, we mark it as evidence (dark ovals). It also involves a factor node for each hard ground clause \(f^h_i \in {\mathcal {F}}^h\) (bold red square) and each soft ground clause \(f^s_i \in {\mathcal {F}}^s\) (non-bold blue square), with an edge linking node \(X_j\) to factor \(f_i\), if \(f_i\) involves \(X_j\). This factor graph compactly represents the joint distribution over \({\mathcal {X}}\) as:

$$\begin{aligned} P\left( X_1,\ldots ,X_n\right) =\frac{1}{\lambda } \prod _{i=1}^{|{\mathcal {F}}^h|} f^h_i \left( {\mathcal {X}}_{f^h_i}\right) \cdot \prod _{i=1}^{|{\mathcal {F}}^s|} f^s_i \left( {\mathcal {X}}_{f^s_i}\right) , \end{aligned}$$
(1)

where \(\lambda \) is the normalizing constant, \(f^s_i\) and \(f^h_i\) are soft and hard ground clauses respectively, and \(|{\mathcal {F}}^h|\) and \(|{\mathcal {F}}^s|\) are the number of hard and soft ground clauses, respectively. Note that, typically, the hard clauses are assigned the same weight (\(w \rightarrow \infty \)). But, without loss of accuracy, we can recast them as factors that allow \(\{0,1\}\) probabilities without recourse to infinity.

Loopy belief propagation The object of the inference task is to compute the marginal probability of the non-evidence atoms (e.g., SameBib) given some others as evidence (e.g., Author). One widely used approximate inference technique is loopy belief propagation (LBP) (Yedidia et al. 2005), which provides exact marginals of query atoms conditional on evidence ones when the factor graph is a tree or a forest, and approximate marginals if the factor graph has cycles. LBP proceeds by alternating the passing of messages between variable (ground atom) nodes and their neighboring factor (ground clause) nodes. The message from a variable \(X_j\) to a factor \(f_i\) is:

$$\begin{aligned} \mu _{X_j \rightarrow f_i} = \prod _{f_k \in {\mathcal {F}}_{X_j} {\setminus } \left\{ f_i\right\} } \mu _{f_k \rightarrow X_j} \end{aligned}$$
(2)

The message from a factor \(f_i\) to variable \(X_j\) is:

$$\begin{aligned} \mu _{f_i \rightarrow X_j} = \sum _{\small {X_1}}..\sum _{\small {X_{j-1}}} \sum _{\small {X_{j+1}}}..\sum _{\small {X_l}} \left( f_i\left( X_1,..,X_j,..,X_l\right) \prod _{X_k \in {\mathcal {X}}_{f_i} {\setminus } \left\{ X_j\right\} } \mu _{X_k \rightarrow f_i}\right) . \end{aligned}$$
(3)

The messages are frequently initialized to 1, and the unnormalized marginal of a single variable \(X_j\) can be approximated by computing a coarse geometrical average of its incoming messages Footnote 2:

$$\begin{aligned} \beta _{X_j} \propto \prod _{f_i \in {\mathcal {F}}_{X_j}} \mu _{f_i \rightarrow X_j}. \end{aligned}$$
(4)

While there are different schedules for passing messages in graphs with loops, one of the most commonly used is synchronous scheduling, wherein all messages are simultaneously updated by using the messages from the previous iteration.

Now consider the atoms that we are interested in as a query [SBib(\(C_1,C_1\)), SBib(\(C_2,C_1\)), SBib(\(C_1,C_2\)), and SBib(\(C_2,C_2\))] on the factor graph represented in Fig. 1. Remarkably, these query atoms are involved in many cycles. This emphasizes, at least theoretically, the existence of more than one fixed point (or local optimum) which raises the threat of non-convergence (Limitation 1). In addition, six of these cycles (i.e., those represented with dashed orange lines) such as SBib(\(C_1,C_1\))—\(f_5\)—SBib(\(C_2,C_1\))—\(f_4\)— SBib(\(C_1,C_2\)) have no evidences (i.e., all the atoms in the cycles are queries). Therefore, the double counting problem is expected to happen (Limitation 1). Moreover the six cycles contain only hard clauses, which hinders the process of smoothing out the messages to converge to accurate results (Limitation 2).

Constraint propagation A Constraint Satisfaction Problem (Rossi et al. 2006) is a triple \(\big <{\mathcal {X}}, {\mathcal {D}}, {\mathcal {C}}\big>\) where \({\mathcal {X}}\) is an n-tuple of variables \({\mathcal {X}}=\big <X_1,\ldots ,X_n\big>\), \({\mathcal {D}}\) is a corresponding n-tuple of domains \({\mathcal {D}}=\big <D_1,\ldots ,D_n\big>\) such that \(X_j \in D_j\), and \({\mathcal {C}}\) is a m-tuple of constraints \({\mathcal {C}}=\big <c_1,\ldots ,c_m\big>\). A constraint \(c_i\) is a pair \(\big <{\mathcal {R}}_{{\mathcal {X}}_{c_i}},{\mathcal {X}}_{c_i}\big>\) where \({\mathcal {R}}_{{\mathcal {X}}_{c_i}}\) is a relation on the variables \({\mathcal {X}}_{c_i}=\text {scope}(c_i)\). A solution to the CSP is a complete assignment (or a possible world) \(s=\big <v_1,\ldots ,v_n\big>\) where \(v_j \in D_j\) and each \(c_i \in {\mathcal {C}}\) is satisfied in that \({\mathcal {R}}_{{\mathcal {X}}_{c_i}}\) holds on the projection of s onto the scope \({\mathcal {X}}_{c_i}\). S denotes the set of solutions to the CSP. Constraint propagation (Rossi et al. 2006) is the process of removing inconsistent values in the domains that violate some constraint in \({\mathcal {C}}\). One form of constraint propagation is to apply generalized arc consistency for each constraint \(c \in {\mathcal {C}}\) until a fixed point is reached.

Definition 1

(Generalized arc consistency (GAC)) Given a constraint \(c \in {\mathcal {C}}\) which is defined over the subset of variables \({\mathcal {X}}_{c}\), it is generalized arc consistent (GAC) iff for each variable \(X_j \in {\mathcal {X}}_{c}\) and for each value \(d \in {\mathcal {D}}_{X_j}\) in its domain, there exists a value \(d_k \in {\mathcal {D}}_{X_k}\) for each variable \(X_k \in {\mathcal {X}}_{c} {\setminus } \left\{ X_j\right\} \) that constitutes at least one valid tuple (or valid local entry) that satisfies c.

We can extend this CSP formalism to Weighted CSPs (Rossi et al. 2006) to include soft constraints. This too requires extending GAC to soft generalized arc consistency (soft GAC) to tackle the soft constraints (van Hoeve et al. 2006). At a high level, one can view GAC (or soft GAC) as a function that takes any variable \(X_j \in {\mathcal {X}}\) and returns all other consistent variables’ values that support the values of \(X_j\) with respect to the constraints \(c \in {\mathcal {C}}\). For instance, in our example of Cora in Fig. 1, applying GAC to the hard constraint (or clause) \(f_{6}: \lnot \text {SBib}(C_1,C_1) \vee \lnot \text {SBib}(C_1,C_2)\) with respect to ground atom assignment \(\text {SBib}(C_1,C_1) = true\) implies maintaining only the truth value “false” in the domain of \(\text {SBib}(C_1,C_2)\). This is because the only valid local entry of \(f_{6}\) that supports \(\text {SBib}(C_1,C_1) = true\) is \(\left\{ (true, false)\right\} \).

We can also apply GAC in a probabilistic form. For instance, probabilistic arc consistency (pAC) (Horsch and Havens 2000) performs BP in the form of arc consistency to compute the relative frequency of a variable taking on a particular value in all solutions for binary CSPs (Horsch and Havens 2000, for more details). pAC can be summarized as follows. We start by initializing all variables to have uniform distributions. At each step, each variable stores its previous solution probability distribution, then incoming messages from neighbouring variables are processed, and the results are maintained locally so that there is no need to send messages to all neighbours when no changes are made in the distribution. The new distribution is approximated by multiplying all information maintained from the recent message received from all neighbours. If the variable’s solution distribution has changed then a new message is sent to all neighbours.

Variational bounds, EM and KL divergences To derive a method with enhanced algorithmic behavior and theoretical semantics for BP, we shall be interested in a variational bound that is widely known in the context of variational expectation maximization algorithms (Beal and Ghahramani 2003; Neal and Hinton 1999). Suppose that we have a model \({\mathcal {M}}_\theta \) with parameters \(\theta \), observed data \({\mathcal {O}}=\{O_1,\ldots ,O_n\}\) and hidden variables \({\mathcal {H}}=\{H_1,\ldots ,H_n\}\). By introducing an approximation to our distribution over hidden variables given by \(\large {q}_{{\mathcal {H}}}({\mathcal {H}})\), we can leverage Jensen’s inequality to obtain a lower bound \({\mathcal {F}}_{{\mathcal {M}}_{\theta }}\) on the log marginal likelihood of the form \(\log \, P({\mathcal {O}} | {\mathcal {M}}_{\theta })\) as follows:

$$\begin{aligned} \log \, P({\mathcal {O}} | {\mathcal {M}}_{\theta })&= \log \sum _{{\mathcal {H}}} P({\mathcal {O}}, {\mathcal {H}} | {\mathcal {M}}_{\theta }) \end{aligned}$$
(5a)
$$\begin{aligned}&= \log \sum _{{\mathcal {H}}} P({\mathcal {O}}, {\mathcal {H}} | {\mathcal {M}}_{\theta }) \,\frac{\large {q}_{{\mathcal {H}}}({\mathcal {H}})}{\large {q}_{{\mathcal {H}}}({\mathcal {H}})} \end{aligned}$$
(5b)
$$\begin{aligned}&\ge \sum _{{\mathcal {H}}} \large {q}_{{\mathcal {H}}}({\mathcal {H}}) \, \log \frac{P({\mathcal {O}}, {\mathcal {H}} | {\mathcal {M}}_{\theta })}{\large {q}_{{\mathcal {H}}}({\mathcal {H}})} \end{aligned}$$
(5c)
$$\begin{aligned}&= \large {E}_{\large {q}_{{\mathcal {H}}}({\mathcal {H}})} \big [\log P({\mathcal {O}}, {\mathcal {H}} | {\mathcal {M}}_{\theta }) \big ] + \large {H}\big (\large {q}_{{\mathcal {H}}}({\mathcal {H}})\big ) \end{aligned}$$
(5d)
$$\begin{aligned}&= {\mathcal {F}}_{{\mathcal {M}}_{\theta }}(\large {q}_{{\mathcal {H}}} ({\mathcal {H}})). \end{aligned}$$
(5e)

This lower bound \({\mathcal {F}}_{{\mathcal {M}}_{\theta }}\) in Eq. (5e) is called the free-energy. In Eq. (5d), \(\large {E}_{\large {q}_{{\mathcal {H}}}({\mathcal {H}})}\) is the expected log marginal likelihood and \(\large {H}\) is the shannon entropy term. Its role in variational EM (Beal and Ghahramani 2003) is that it justifies an iterative optimization algorithm for the lower bound whereby one performs the following steps: (the E-step) in which one makes the bound tighter by computing and updating \(\large {q}_{{\mathcal {H}}}({\mathcal {H}})\), and (the M-step) which uses the approximation to update the parameters of the model, which typically will increase the log marginal likelihood. If the exact posterior is used, or if the approximation to the posterior is exact, then the inequality is met with equality and the original EM algorithm is obtained. Both LBP and variational EM approaches share a similar objective which is to minimize a corresponding energy equation (Yedidia et al. 2005), the Gibbs free energy and variational free energy, respectively. Variational inference over hidden or unobserved variables in the E-step of traditional variational EM has an advantage in that it corresponds to minimizing the KL divergence of an approximation and our quantity of interest as we discuss below.

With a little more analysis it is possible to also determine that the free-energy is smaller than the log-marginal likelihood by an amount equal to the Kullback–Leibler (KL) divergence between \(\large {q}_{{\mathcal {H}}}({\mathcal {H}})\) and the posterior distribution of the hidden variables \(P({\mathcal {H}} | {\mathcal {O}}, {\mathcal {M}}_{\theta })\):

$$\begin{aligned} {\mathcal {F}}_{{\mathcal {M}}_{\theta }}(\large {q}_{{\mathcal {H}}}({\mathcal {H}}))&= \sum _{{\mathcal {H}}} \large {q}_{{\mathcal {H}}}({\mathcal {H}}) \, \log \frac{P({\mathcal {H}} | {\mathcal {O}}, {\mathcal {M}}_{\theta }) P({\mathcal {O}} | {\mathcal {M}}_{\theta })}{\large {q}_{{\mathcal {H}}}({\mathcal {H}})} \end{aligned}$$
(6a)
$$\begin{aligned}&= \log \, P({\mathcal {O}} | {\mathcal {M}}_{\theta }) - \sum _{{\mathcal {H}}} \large {q}_{{\mathcal {H}}}({\mathcal {H}}) \, \log \frac{\large {q}_{{\mathcal {H}}}({\mathcal {H}})}{P({\mathcal {O}}, {\mathcal {H}} | {\mathcal {M}}_{\theta })} \end{aligned}$$
(6b)
$$\begin{aligned}&= \log \, P({\mathcal {O}} | {\mathcal {M}}_{\theta }) - \textit{KL} \big [\large {q}_{{\mathcal {H}}}({\mathcal {H}}) \, || \, P({\mathcal {H}} | {\mathcal {O}}, {\mathcal {M}}_{\theta }) \big ]. \end{aligned}$$
(6c)

That is, since the marginal likelihood of the observed data is a fixed quantity, maximizing the lower bound (or the variational free energy) through variational inference over hidden variables is equivalent to minimizing the KL divergence between our approximation and the true distribution over hidden variables. Thus in the E-step of a variational EM algorithm one can perform iterative updates for \(\large {q}_{{\mathcal {H}}}({\mathcal {H}})\) in a class of distributions \({\mathcal {Q}}\) in a way that minimizes the KL divergence between the posterior \(P({\mathcal {H}} | {\mathcal {O}}, {\mathcal {M}}_{\theta })\) with the goal of obtaining

$$\begin{aligned} \large {q}^{\text {VE}}_{{\mathcal {H}}}({\mathcal {H}}) = \underset{q \in {\mathcal {Q}}}{{\text {argmin}}} \,\, \textit{KL} \big [\large {q}_{{\mathcal {H}}}({\mathcal {H}}) \, || \, P({\mathcal {H}} | {\mathcal {O}}, {\mathcal {M}}_{\theta }) \big ]. \end{aligned}$$
(7)

In our work here we will perform variational inference reminiscent of this approach.

3 GEM-MP framework

At a conceptual level our overall GEM-MP approach consists of the following three key elements. First, we extend the factor graph used to represent a given problem using mega-node random variables which behave identically to groups of variables participating in a factor. Second, we perform variational inference to update an approximation over the original variables and the mega-nodes. Third, we use a probabilistic form of generalized arc consistency to more efficiently make inferences about hard constraints. Unlike inference operations formulated using LBP, since we formulate inference using variational updates we directly minimize the KL divergence between our approximation for the joint conditional distribution and the true distribution of interest.

Before presenting the inference components of GEM-MP in detail, we will first examine a small concrete example, then present our more general approach for extending factor graphs. Let us consider a simple example factor graph \({\mathcal {G}}\) (Fig. 2 (left)), which is a fragment of the Cora example in Fig. 1, that involves factors \({\mathcal {F}} =\left\{ f_1,f_2,f_3,f_{4}\right\} \) and three random variables \(\left\{ X_1,X_2,X_3\right\} \) denoting query ground atoms \(\left\{ \text {SBib}(C_2,C_2), \text {SBib}(C_2,C_1), \text {SBib}(C_1,C_2)\right\} \) respectively.

In our GEM-MP framework the first thing we do is to modify the factor graph. Specifically, we need to re-parameterize the factor graph in such a way that carrying out a learning task on the new parameterization is equivalent to running an inference task on the original factor graph. That is, we modify the original factor graph \({\mathcal {G}}\) (depicted in Fig. 2 (left)) by transforming it into an extended factor graph \(\hat{{\mathcal {G}}}\) (depicted in Fig. 2 (right)) as follows:

Fig. 2
figure 2

An example factor graph \({\mathcal {G}}\) (left) which is a fragment of the Cora example in Fig. 1, that involves factors \({\mathcal {F}} = \left\{ f_1,f_2,f_3,f_{4}\right\} \) and three random variables \(\left\{ X_1,X_2,X_3\right\} \) representing query ground atoms \(\left\{ \text {SBib}(C_2,C_2), \text {SBib}(C_2,C_1), \text {SBib}(C_1,C_2)\right\} \). The extended factor graph \(\hat{{\mathcal {G}}}\) (right) which is a transformation of the original factor graph after adding auxiliary mega-node variables \({\mathcal {Y}} = \left\{ y_1,y_2,y_3,y_4\right\} \), and auxiliary activation-node variables \({\mathcal {O}} = \left\{ O_1,O_2,O_3,O_4\right\} \), which yields extended factors \(\hat{{\mathcal {F}}} = \left\{ \hat{f_1},\hat{f_2},\hat{f_3},\hat{f_{4}}\right\} \)

Table 2 Factor \(f_1\) in the original factor graph (top)
  • We attach an auxiliary mega-node \(Y_i\) (dashed oval) to each factor node \(f_i \in {\mathcal {F}}\). Each of these mega-nodes \(Y_i\) captures the local entries of its corresponding factor \(f_i\). Thus, it has a domain size that equals (at the most) the number of local entries in the factor \(f_i\) (i.e., the states of each mega-node correspond to a subset of the Cartesian product of the domains of the variables that are the arguments to the factor \(f_i\)). \({\mathcal {Y}} = \left\{ Y_i\right\} _{i=1}^m\) is the set of mega-nodes in the extended factor graph, where \(m=4\) in the example factor graph.

  • In addition, we connect an auxiliary activation node, \(O_i\) (dashed circle), to each factor \(f_i\). The auxiliary activation node \(O_i\) enforces an indicator constraint \({\mathbbm {1}}_{\left( Y_i,f_i\right) }\) for ensuring that the particular configuration of the variables that are the argument to the original factor \(f_i\) is identical to the state of the mega-node \(Y_i\):

    $$\begin{aligned} {\mathbbm {1}}_{\left( Y_i,f_i\right) } = {\left\{ \begin{array}{ll} 1 &{} \quad \text {If the state of } Y_i \text { is identical to local entry of } f_i. \\ 0 &{} \quad \text {Otherwise} \end{array}\right. } \end{aligned}$$
    (8)
  • Now, since we expand the arguments of each factor \(f_i\) by including both auxiliary mega-node and auxiliary activation node variables, then we get an extended factor \(\hat{f_i}\). \(\hat{{\mathcal {F}}} = \left\{ \hat{f_i}\right\} _{i=1}^m\) is the set of extended factors in the extended factor graph.

  • When the activation node \(O_i\) equals one, then it activates the indicator constraint in Eq. (8). If this indicator constraint is satisfied, then the extended factor graph \(\hat{f_i}\) preserves the same value of \(f_i\) for the configuration that is defined over the original input variables defining the factor \(f_i\). Thus, clearly, the following condition holds for each extended factor \(\hat{f_i}\) when a configuration, \((x_1,\ldots ,x_n)\), of \(f_i\) equals to state, \(y_i\), of mega-node, \(Y_i\):

    $$\begin{aligned} \hat{f_i}\left( X_1=x_1,\ldots ,X_n=x_n,Y_i=y_i,\bar{O_i}\right) \Bigm |_{\bar{O_i} = 1} = f_i\left( X_1=x_1,\ldots ,X_n=x_n\right) . \end{aligned}$$
    (9)

    But if the indicator constraint in Eq. (8) is not satisfied then the extended factor graph \(\hat{f_i}\) assigns a value 0. Thus, this condition also holds for each extended factor \(\hat{f_i}\) when a configuration \((x_1,\ldots ,x_n)\) of \(f_i\) is not equal to state \(y_i\) of mega-node, \(Y_i\):

    $$\begin{aligned} \hat{f_i}\left( X_1=x_1,\ldots ,X_n=x_n,Y_i=y_i,\bar{O_i}\right) \Bigm |_{\bar{O_i} = 1} = 0. \end{aligned}$$
    (10)
  • Setting \(O_i=0\) effectively removes the impact of \(f_i\) from the model. That is, when the activation node \(O_i\) is not equal to one, then it deactivates the indicator constraint in Eq. (8). Here, the extended factor \(\hat{f_i}\) assigns a value 1 when the possible state of \(Y_i\) matches the configuration of variables that are the arguments to the factor \(f_i\). Otherwise it assigns a value 0. Note that by assigning the values in this way, all factors \(f_i \in {\mathcal {F}}\) will have identical values in their corresponding \(\hat{f_i} \in \hat{{\mathcal {F}}}\) when \(O_i=0\). This implies that the deactivation of their indicator constraint has no impact on the distribution from the inclusion of the factors \(f_i \in {\mathcal {F}}\).

Table 2 visualizes the expansion of factor \(f_1\), in the original factor graph, to its corresponding extended factor \(\hat{f_1}\) in the extended factor graph.

Proposition 1

In the extended factor graph \(\hat{{\mathcal {G}}}\), reducing each extended factor \(\hat{f_i}\) by evidencing its activation node with one, \(\bar{O_i}=1\), and then eliminating its auxiliary mega-node \(Y_i\) by marginalization yields its corresponding original factor \(f_i\) in the original factor graph \({\mathcal {G}}\).

$$\begin{aligned} \sum _{Y_i} \hat{f_i} \left( X_1,\ldots ,X_n,Y_i,\bar{O_i}\right) \Bigm |_{\small {\bar{O_i} = 1}} = f_i \left( X_1,\ldots ,X_n \right) , \, \forall \hat{f_i} \in \hat{{\mathcal {F}}}. \end{aligned}$$
(11)

Proof

see “Appendix”. \(\square \)

Proposition 2

Any arbitrary factor graph \({\mathcal {G}}\) is equivalent, i.e., defines an identical joint probability over variables \({\mathcal {X}}\), to its extended \(\hat{{\mathcal {G}}}\) iff the activation nodes in \(\hat{{\mathcal {G}}}\) are evidenced with one:

$$\begin{aligned} {\mathcal {G}} \equiv \hat{{\mathcal {G}}} \, \text {iff} \, \bar{O_i}=1,\, \forall O_i \in {\mathcal {O}} \, \text {in} \, \hat{{\mathcal {G}}} \end{aligned}$$

Proof

see “Appendix”. \(\square \)

Given this extended factor graph formulation we can now examine the task of performing inference over unobserved quantities given observed quantities through the lens of variational analysis and inference.

Let \({\mathcal {O}}= \left\{ O_i\right\} _{i=1}^m\) be the observed variables, represented as a binary vector (of 1’s), indicating the observation of the activation node variables \(\bar{O_i}=1,\,\forall O_i \in {\mathcal {O}}\). Let \({\mathcal {H}}=\left\{ {\mathcal {X}},{\mathcal {Y}}\right\} \) be the hidden variables, where \({\mathcal {X}} = \left\{ X_j\right\} _{j=1}^n\) is a set of variables (i.e., ground atoms) whose marginals we want to compute, and \({\mathcal {Y}}=\left\{ Y_i\right\} _{i=1}^m\) is the set of mega-nodes. Let \(\large {q}({\mathcal {X}},{\mathcal {Y}})\) be an auxiliary distribution over the set of hidden variables \({\mathcal {H}}\), satisfying that \(\sum _{\small {{\mathcal {X}},{\mathcal {Y}}}} \, \large {q}({\mathcal {X}},{\mathcal {Y}}) =1\). Now using the distribution \(\large {q}({\mathcal {X}},{\mathcal {Y}})\), we can leverage Jensen’s inequality to obtain a lower bound on the log-marginal likelihood of the form \(\log P({\mathcal {O}}|{\mathcal {M}})\) as follows:Footnote 3

$$\begin{aligned} \log P({\mathcal {O}}|{\mathcal {M}})&= \log \sum _{\small {{\mathcal {X}},{\mathcal {Y}}}} P({\mathcal {O}},{\mathcal {X}}, {\mathcal {Y}}|{\mathcal {M}}) \end{aligned}$$
(12a)
$$\begin{aligned}&= \log \sum _{\small {{\mathcal {X}},{\mathcal {Y}}}} P({\mathcal {O}},{\mathcal {X}}, {\mathcal {Y}}|{\mathcal {M}}) \, \frac{\large {q}({\mathcal {X}}, {\mathcal {Y}})}{\large {q}({\mathcal {X}},{\mathcal {Y}})} \end{aligned}$$
(12b)
$$\begin{aligned}&\ge \sum _{\small {{\mathcal {X}},{\mathcal {Y}}}} \large {q}({\mathcal {X}},{\mathcal {Y}}) \, \log \frac{P({\mathcal {O}},{\mathcal {X}}, {\mathcal {Y}}|{\mathcal {M}})}{\large {q}({\mathcal {X}},{\mathcal {Y}})} \end{aligned}$$
(12c)
$$\begin{aligned}&= \large {E}_{\large {q}({\mathcal {X}},{\mathcal {Y}})} \big [\log P({\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}} |{\mathcal {M}}) \big ] + \large {H}\big (\large {q}({\mathcal {X}},{\mathcal {Y}})\big ) \end{aligned}$$
(12d)
$$\begin{aligned}&= {\mathcal {F}}_{\small {{\mathcal {M}}}} (\large {q}({\mathcal {X}},{\mathcal {Y}})) \end{aligned}$$
(12e)

where \({\mathcal {F}}_{\small {{\mathcal {M}}}}\) in Eq. (12e) is the negative of a quantity that represents the variational free energy functional of the free distribution \(\large {q}({\mathcal {X}},{\mathcal {Y}})\). As in Eq. (12d), it is a summation of two terms: \(\large {E}_{\large {q}({\mathcal {X}},{\mathcal {Y}})}\) which is the expected log marginal-likelihood with respect to the distributions, \(\large {q}({\mathcal {X}},{\mathcal {Y}})\), and the second term, \(\large {H}\big (\large {q}({\mathcal {X}},{\mathcal {Y}})\big )\), is the negative entropy of the distribution \(\large {q}({\mathcal {X}},{\mathcal {Y}})\) (Neal and Hinton 1999, for more details).

We can also easily see that similarly to other traditional settings the free-energy \({\mathcal {F}}_{\small {{\mathcal {M}}}}\) is smaller than the log-marginal likelihood by an amount equal to the Kullback–Leibler (KL) divergence between \(\large {q}({\mathcal {X}},{\mathcal {Y}})\) and the distribution over the hidden variables \(P({\mathcal {X}}, {\mathcal {Y}}|{\mathcal {O}},{\mathcal {M}})\):

$$\begin{aligned} {\mathcal {F}}_{\small {{\mathcal {M}}}}(\large {q}({\mathcal {X}},{\mathcal {Y}}))&= \sum _{\small {{\mathcal {X}},{\mathcal {Y}}}} \large {q}({\mathcal {X}},{\mathcal {Y}}) \, \log \frac{P({\mathcal {X}}, {\mathcal {Y}}|{\mathcal {O}},{\mathcal {M}}) P({\mathcal {O}}|{\mathcal {M}})}{\large {q}({\mathcal {X}},{\mathcal {Y}})} \end{aligned}$$
(13a)
$$\begin{aligned}&= \log P({\mathcal {O}}|{\mathcal {M}}) - \sum _{\small {{\mathcal {X}},{\mathcal {Y}}}} \large {q}({\mathcal {X}},{\mathcal {Y}}) \, \log \frac{\large {q}({\mathcal {X}},{\mathcal {Y}})}{P({\mathcal {X}}, {\mathcal {Y}}|{\mathcal {O}},{\mathcal {M}})} \end{aligned}$$
(13b)
$$\begin{aligned}&= \log P({\mathcal {O}}|{\mathcal {M}}) - \textit{KL} \big [\large {q}({\mathcal {X}},{\mathcal {Y}}) \, || \, P({\mathcal {X}}, {\mathcal {Y}}|{\mathcal {O}},{\mathcal {M}}) \big ] \end{aligned}$$
(13c)

Since the \(\textit{KL} \big [\large {q}({\mathcal {X}},{\mathcal {Y}}) \, || \, P({\mathcal {X}}, {\mathcal {Y}}|{\mathcal {O}},{\mathcal {M}}) \big ] \ge 0\) in Eq. (13c) and the log marginal probability under the model is a fixed quantity, then minimizing the KL divergence term is equivalent to maximizing the variational free energy \({\mathcal {F}}_{\small {{\mathcal {M}}}}\). That is, one could equivalently select either to maximize the lower bound (the variational free energy), or to minimize the KL divergence. Based on that, we now want to infer the distribution \(\large {q}({\mathcal {X}},{\mathcal {Y}})\) in a class of distributions \({\mathcal {Q}}\) that maximize the variational free energy:Footnote 4

$$\begin{aligned} \large {q}^*{\small {({\mathcal {X}},{\mathcal {Y}})}} = \underset{q \in {\mathcal {Q}}}{{\text {argmax}}} \,\, {\mathcal {F}}_{\small {{\mathcal {M}}}}(\large {q} ({\mathcal {X}},{\mathcal {Y}})) \end{aligned}$$
(14)

One problem is that the target of the maximization of the variational free energy \({\mathcal {F}}_{\small {{\mathcal {M}}}}\) is unwieldy for direct optimization. The variational free energy requires an explicit summation over all possible instantiations of \({\mathcal {X}}\) and all valid local entries of the factors (i.e., ground clauses) involved in the model for \({\mathcal {Y}}\), which is an operation that is infeasible in practice.Footnote 5 Instead we constrain the auxiliary \(\large {q}({\mathcal {X}},{\mathcal {Y}})\) distribution to be a factorized (separable) approximation as:

$$\begin{aligned} \large {q}({\mathcal {X}},{\mathcal {Y}}) = \large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}) \, \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}) \end{aligned}$$
(15)

where \(\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\) is an approximation of the true distribution \(P({\mathcal {X}}|{\mathcal {O}},{\mathcal {M}})\) over hidden variables, \({\mathcal {X}}\). This distribution is characterized by a set of variational parameters, \({\mathcal {B}}_{{\mathcal {X}}} = \left\{ \beta _{X_j}\right\} _{j=1}^n\). Since we use a fully factored distribution, these approximations are somewhat similar to the approximate marginal probabilities of variables \(X_j \in {\mathcal {X}}\), which one might obtain using standard loopy message-passing inference (e.g., LBP); however, unlike the situation with LBP, here we can subject these approximations to a variational analysis leading to an understanding of message-passing inference in terms of KL divergence minimization. The distribution \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) is an approximation to the true distribution \(P({\mathcal {Y}}|{\mathcal {O}},{\mathcal {M}})\) over hidden mega-nodes, \({\mathcal {Y}}\), which is characterized by a set of variational parameters, \({\mathcal {T}}_{\small {{\mathcal {Y}}}} = \left\{ \alpha _{Y_i}\right\} _{i=1}^m\), for adapting the weights associated with the particular states of mega-nodes \(Y_i \in {\mathcal {Y}}\). As a particular formulation of how the \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) distribution is parametrized, these variational parameters \(\alpha _{\small {Y_i}}\left( f_i\right) \) can be defined as:

$$\begin{aligned} \alpha _{\small {Y_i}}\left( f_i\right) ={\left\{ \begin{array}{ll} v_s &{} \quad \text {if the state of } Y_i \text { satisfies } f_i , \\ v_u &{} \quad \text {otherwise.} \end{array}\right. } \end{aligned}$$
(16)

where \(v_s\) (and \(v_u\)) are the values obtained from \(f_i\) when a particular state of \(Y_i\) satisfies (or unsatisfies) the factor \(f_i\) respectively. Note that \(v_s\) and \(v_u\) can be adapted using the distributions of the factor \(f_i\)’s argument variables and the weight associated with \(f_i\) (as will be explained in Sects. 4.1 and 4.2 for hard and soft factors respectively). Now, using Eq. (15), we can simply represent the lower bound as follows:

$$\begin{aligned} {\mathcal {F}}_{\small {{\mathcal {M}}}}(\large {q}({\mathcal {X}},{\mathcal {Y}}))&= {\mathcal {F}}_{\small {{\mathcal {M}}}}(\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}) \, \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})) \end{aligned}$$
(17a)
$$\begin{aligned}&= \sum _{\small {{\mathcal {X}}},\small {{\mathcal {Y}}}} \large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}) \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}) \, \log \frac{P({\mathcal {O}},{\mathcal {X}},{\mathcal {Y}}|{\mathcal {M}})}{\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}) \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})} \end{aligned}$$
(17b)
$$\begin{aligned}&= \large {E}_{\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}) \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})} \big [\log P({\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}} |{\mathcal {M}}) \big ] + \large {H}\big (\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}) \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\big ) \end{aligned}$$
(17c)

where \(\large {E}_{\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}) \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})}\) is the expected log marginal-likelihood with respect to the distributions, \(\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\) and \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\), and the negative of the second term, \(- \large {H}\big (\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}) \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\big )\), is the entropy. Hence, we can now set up our goal to find the distributions \(\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\) and \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) that maximize the lower bound \({\mathcal {F}}_{\small {{\mathcal {M}}}}\).

Now the role of the GEM-MP algorithm is to iteratively maximize the lower bound \({\mathcal {F}}_{\small {{\mathcal {M}}}}\) (or minimize the negative free energy \(- {\mathcal {F}}_{\small {{\mathcal {M}}}}\)) with respect to the distributions \(\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\) and \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) by applying two steps. In the first step, \(\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\) is used to maximize \({\mathcal {F}}_{\small {{\mathcal {M}}}}\) with respect to \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\). Then in the second step, \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) is used to maximize \({\mathcal {F}}_{\small {{\mathcal {M}}}}\) with respect to \(\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\). That is, GEM-MP maximizes \({\mathcal {F}}_{\small {{\mathcal {M}}}}\) by performing two iterative updates

$$\begin{aligned} \text {--}{\mathcal {T}}^*_{\small {{\mathcal {Y}}}} \propto&\underset{{\mathcal {T}}_{\small {{\mathcal {Y}}}}}{{\text {argmax}}} \, \large {E}_{\large {q}_{\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) } \large {q}_{\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) }} \big [\log P({\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}} |{\mathcal {M}}) \big ] + \large {H}\big (\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}) \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\big ) \end{aligned}$$
(18a)
$$\begin{aligned} \text {--}{\mathcal {B}}^*_{\small {{\mathcal {X}}}} \propto&\underset{{\mathcal {B}}_{\small {{\mathcal {X}}}}}{{\text {argmax}}} \, \large {E}_{\large {q}_{\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) } \large {q}_{\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) }} \big [\log P({\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}} |{\mathcal {M}}) \big ] + \large {H}\big (\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}) \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\big ) \end{aligned}$$
(18b)

Note that the entropy term can be re-written as:

$$\begin{aligned} \large {H}\big (\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}) \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\big ) = \large {H}\big (\large {q}({\mathcal {X}},{\mathcal {Y}})\big ) = \large {H}\big (\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\big ) + \large {H}\big (\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\big ) \end{aligned}$$
(19)

Now if we substitute the entropy \(\large {H}\big (\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}) \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\big )\) from Eq. (19) into Eq. (18a) and (18b), then we will have that \(\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\) does not depend on the entropy \(\large {H}(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}))\) when maximizing \({\mathcal {F}}_{\small {{\mathcal {M}}}}\) with respect to the variational parameters \({\mathcal {T}}_{\small {{\mathcal {Y}}}}\), and \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) does not depend on \(\large {H}(\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}))\) when maximizing \({\mathcal {F}}_{\small {{\mathcal {M}}}}\) with respect to the variational parameters \(B_{\small {{\mathcal {X}}}}\).Footnote 6 We thus have

$$\begin{aligned} \text {-- }{\mathcal {T}}^*_{\small {{\mathcal {Y}}}} \propto&\underset{{\mathcal {T}}_{\small {{\mathcal {Y}}}}}{{\text {argmax}}} \, \large {E}_{\large {q}_{\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) } \large {q}_{\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) }} \big [\log P({\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}} |{\mathcal {M}}) \big ] + \large {H}\big (\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\big ) \end{aligned}$$
(20a)
$$\begin{aligned} \text {-- } {\mathcal {B}}^*_{\small {{\mathcal {X}}}} \propto&\underset{{\mathcal {B}}_{\small {{\mathcal {X}}}}}{{\text {argmax}}} \, \large {E}_{\large {q}_{\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) } \large {q}_{\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) }} \big [\log P({\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}} |{\mathcal {M}}) \big ] + \large {H}\big (\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\big ). \end{aligned}$$
(20b)

Therefore, the goal of GEM-MP can be expressed as that of maximizing a lower bound on the log marginal-likelihood by performing two steps, using superscript (t) to denote the iteration number:

  • GEM-MP “\(M_{q\small {({\mathcal {Y}})}}\)-step”: (for maximizing mega-nodes’ parameters distributions)

    $$\begin{aligned}&\overbrace{{\mathcal {T}}^{\text {(t+1)}}_{\small {{\mathcal {Y}}}}}^{\text {Max. w.r.t }\large {q}_{\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) }} \nonumber \\&\quad = \underset{{\mathcal {T}}_{\small {{\mathcal {Y}}}}}{{\text {argmax}}} \, \overbrace{\large {E}_{\large {q}^{\text {(t)}}_{\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) } \large {q}^{\text {(t)}}_{\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) }} \big [\log P({\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}} |{\mathcal {M}}) \big ]}^{\text { E-step}} + \large {H}\big (\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\big ) \end{aligned}$$
    (21)
  • GEM-MP “\(M_{q\small {({\mathcal {X}})}}\)-step”: (for maximizing variable-nodes’ parameter distributions)

    $$\begin{aligned}&\overbrace{{\mathcal {B}}^{\text {(t+1)}}_{\small {{\mathcal {X}}}}}^{\text {Max. w.r.t }\large {q}_{\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) }} \nonumber \\&\quad = \underset{{\mathcal {B}}_{\small {{\mathcal {X}}}}}{{\text {argmax}}} \, \overbrace{\large {E}_{\large {q}^{\text {(t)}}_{\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) } \large {q}^{\text {(t+1)}}_{\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) }} \big [\log P({\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}} |{\mathcal {M}}) \big ]}^{\text { E-step}} + \large {H}\big (\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\big ) \end{aligned}$$
    (22)

where here the arguments to Eqs. (21) and (22) are the \(E_{q\small {({\mathcal {X}})}}\)-step and \(E_{q\small {({\mathcal {Y}})}}\)-step corresponding to \(M_{q\small {({\mathcal {Y}})}}\)-step and \(M_{q\small {({\mathcal {X}})}}\)-step, respectively. In Eq. (21), the current value of \(\large {q}\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) \) and \(\large {q}\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) \) are used to optimize the mega-nodes’ variational parameters \({\mathcal {T}}_{\small {{\mathcal {Y}}}}\). This could result in maximizing the lower bound with respect to \(\large {q}\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) \). Next, in Eq. (22), the new value of \(\large {q}\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) \) and the current value of \(\large {q}\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) \) are used to optimize the nodes’ variational parameters \({\mathcal {B}}_{\small {{\mathcal {X}}}}\). This could result in maximizing the lower bound once again but this time with respect to \(\large {q}\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) \). Note that the difficulties in dealing with the expectation in Eqs. (21) and (22) depends on the properties of the distributions \(\large {q}\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) \) and \(\large {q}\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) \). That is if the inference is easy in these two distributions, then evaluating the expectation should be relatively easily. For the entropy terms, the choice of how to approximate the distributions \(\large {q}\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) \) and \(\large {q}\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) \) determines whether we can evaluate the entropy terms. As will be shown hereafter, using the variational mean-field for approximating the distributions \(\large {q}\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) \) and \(\large {q}\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) \) makes evaluating the entropy terms tractable.

Now, using a fully factored variational mean field approximation for \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) implies that we create our approximation from independent distributions over the hidden (mega-node) variables as follows:

$$\begin{aligned} \large {q}\left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) = \prod _{Y_i \in {\mathcal {Y}}} \large {q}\left( Y_i;\alpha _{\small {Y_i}}\right) \end{aligned}$$
(23)

where \(\large {q}(Y_i;\alpha _{\small { Y_i}})\) is our complete approximation to the true probability distribution \(P(Y_i | {\mathcal {O}}, {\mathcal {X}}, {\mathcal {M}})\) of a randomly chosen valid local entry of mega-node \(Y_i\). Also, the variational mean-field approximation to \(\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\) is similarly defined as a factorization of independent distributions over the hidden variables in \({\mathcal {X}}\), and can be expressed as follows:

$$\begin{aligned} \large {q}\left( {\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}\right) = \prod _{X_j \in {\mathcal {X}}} \large {q}\left( X_j;\beta _{\small {X_j}}\right) , \end{aligned}$$
(24)

where \(\large {q}(X_j;\beta _{\small {X_j}})\) is an approximate distribution for the true marginal probability distribution \(P(X_j | {\mathcal {O}}, {\mathcal {M}})\) of variable \(X_j\).

From Eqs. (23) and (24), we can write the expected log marginal-likelihood in Eqs. (21) and (22), as follows:

$$\begin{aligned} \begin{aligned}&\large {E}_{\large {q}_{({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})} \large {q}_{({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})}} \big [\log P({\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}} |{\mathcal {M}}) \big ] \\&\quad = \sum _{\small {{\mathcal {Y}}}} \prod _{\small {Y_i \in {\mathcal {Y}}}} \large {q}\left( Y_i;\alpha _{\small {Y_i}}\right) \bigg [ \sum _{\small {{\mathcal {X}}}} \prod _{\small {X_j \in {\mathcal {X}}}} \large {q}\left( X_j;\beta _{\small {X_j}}\right) \log P \left( {\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}} |{\mathcal {M}}\right) \bigg ] \end{aligned} \end{aligned}$$
(25)

We now proceed to optimize the lower bound, through the use of Eqs. (21) and (22), using our variational mean-field approximations for both \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) and \(\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\).

We use Eqs. (25), (24) and (23) in Eq. (21). Hence, we have a maximization of the lower bound on the log marginal-likelihood as

$$\begin{aligned} \begin{aligned} {\mathcal {T}}^{\text {(t+1)}}_{\small {{\mathcal {Y}}}}&= \underset{{\mathcal {T}}_{\small {{\mathcal {Y}}}}}{{\text {argmax}}} \, \sum _{\small {{\mathcal {Y}}}} \prod _{\small {Y_i \in {\mathcal {Y}}}} \large {q}\left( Y_i;\alpha _{\small {Y_i}}\right) \bigg [ \sum _{{\mathcal {X}}} \prod _{\small {X_j} \in {\mathcal {X}}} \large {q}\left( X_j;\beta _{\small {X_j}}\right) \log P\left( {\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}} |{\mathcal {M}}\right) \bigg ] \\&\qquad + \sum _{\small {Y_i \in {\mathcal {Y}}}} H\left( \large {q}\left( Y_i;\alpha _{\small {Y_i}}\right) \right) \end{aligned} \end{aligned}$$
(26)

One can then separate out the terms related to the updates of the variational parameters for each mega-node \(Y_i\) in Eq. (26). In addition, updating the parameter distribution of mega-node \(Y_i\) requires considering the distributions \(\{\large {q}(X_j;\beta _{\small {X_j}})\}\) of only the variables in \({\mathcal {X}}\) that are arguments to the extended factor \(\hat{f_i}\) (i.e., \(X_j \in {\mathcal {X}}_{\hat{f_i}}\))- where \(Y_i\) is attached to \(\hat{f_i}\). That is

$$\begin{aligned} \begin{aligned} \alpha ^{\text {(t+1)}}_{\small {Y_i}}&= \underset{\alpha _{\small {Y_i}}}{{\text {argmax}}} \,\, \sum _{\small {Y_i}} \large {q}\left( Y_i;\alpha _{\small {Y_i}}\right) \bigg [ \sum _{\small {{\mathcal {X}}_{\hat{f_i}}}} \prod _{\small {X_j \in {\mathcal {X}}_{\hat{f_i}}}} \large {q}\left( X_j;\beta _{\small {X_j}}\right) \log \hat{f_i}\left( O_i, {\mathcal {X}}_{\hat{f_i}}, Y_i |{\mathcal {M}}\right) \bigg ] \\&\qquad + H\left( \large {q}\left( Y_i;\alpha _{\small {Y_i}} \right) \right) \end{aligned} \end{aligned}$$
(27)

where \(\hat{f_i}(O_i, {\mathcal {X}}_{\hat{f_i}}, Y_i |{\mathcal {M}})\) is the part of \(P({\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}} |{\mathcal {M}})\) in the model with the factor associated with the mega-node \(Y_i\). This in fact allows optimizing the variational parameters of the distributions of each mega-node \(Y_i\) as

$$\begin{aligned} \large {q}\left( Y_i;\alpha _{\small {Y_i}}^*\right) = \frac{1}{{\mathcal {Z}}_{Y_i}} \exp \bigg (\overbrace{ \large {E}_{\{\large {q}^{(t)} (X_j;\beta _{\small {X_j}}) \, | \, \small {X_j \in {\mathcal {X}}_{\hat{f_i}}}\}} \big [\log \hat{f_i}(O_i, {\mathcal {X}}_{\hat{f_i}}, Y_i |{\mathcal {M}}) \big ]}^{E_{q\small {({\mathcal {X}})}}\text {-step}} \bigg ) \end{aligned}$$
(28)

where \({\mathcal {Z}}_{Y_i}=\sum \large {E}_{\{\large {q}^{(t)}(X_k;\beta _{\small {X_k}}) \, | \, \small {X_k \ne X_j, \, X_j \in {\mathcal {X}}_{\hat{f_i}}}\}} \big [\log \hat{f_i}(O_i, {\mathcal {X}}_{\hat{f_i}}, Y_i |{\mathcal {M}}) \big ]\) is the normalization factor, and the expectation part can be written as

$$\begin{aligned} \begin{aligned}&\large {E}_{\{\large {q}^{(t)} \left( X_j;\beta _{\small {X_j}}\right) \, | \, \small {X_j \in {\mathcal {X}}_{\hat{f_i}}}\}} \big [\log \hat{f_i}(O_i, {\mathcal {X}}_{\hat{f_i}}, Y_i |{\mathcal {M}}) \big ] \\&\quad = \sum _{\small {{\mathcal {X}}_{\hat{f_i}}}} \prod _{\small {X_j \in {\mathcal {X}}_{\hat{f_i}}}} \large {q}\left( X_j;\beta _{\small {X_j}}\right) \log \hat{f_i}\left( O_i, {\mathcal {X}}_{\hat{f_i}}, Y_i |{\mathcal {M}}\right) \end{aligned} \end{aligned}$$
(29)

This update is similar in form to the simpler case of fully factored mean field updates in a model without the additional mega-nodes. See Winn (2004) for more details on the traditional mean field updates. Note that here by using Eqs. (29) and (28) in Eq. (27), we also have that

$$\begin{aligned} \alpha ^{\text {(t+1)}}_{\small {Y_i}}&= \,\, \underset{\alpha _{\small {Y_i}}}{{\text {argmax}}} \,\, \sum _{\small {Y_i}}\large {q}\left( Y_i;\alpha _{\small {Y_i}}\right) \, \log \, \large {q}\left( Y_i;\alpha _{\small {Y_i}}^*\right) + H(\large {q}\left( Y_i;\alpha _{\small {Y_i}}\right) - \log {\mathcal {Z}}_{Y_i} \end{aligned}$$
(30a)
$$\begin{aligned}&= \,\, \underset{\alpha _{\small {Y_i}}}{{\text {argmax}}} \,\, - \textit{KL} \big [\large {q}(Y_i;\alpha _{\small {Y_i}}) \, || \, \large {q}(Y_i;\alpha _{\small {Y_i}}^*) \big ] + const. \end{aligned}$$
(30b)
$$\begin{aligned}&= \,\, \underset{\alpha _{\small {Y_i}}}{{\text {argmax}}} \,\, - \textit{KL} \big [\large {q}(Y_i;\alpha _{\small {Y_i}}) \, || \, \large {q}(Y_i;\alpha _{\small {Y_i}}^*) \big ], \end{aligned}$$
(30c)

where \(\textit{KL}\big [\large {q}(Y_i;\alpha _{\small {Y_i}}) \, || \, \large {q}(Y_i;\alpha _{\small {Y_i}}^*) \big ]\) is the Kullback–Leibler divergence. The constant, in Eq. (30b), is simply the logarithm of the normalization factor representing the variables’ {\(\large {q}(X_j;\beta _{\small {X_j}})\)}, that are independent of \(\large {q}(Y_i;\alpha _{\small {Y_i}})\). Note that, from Eq. (30c), we maximize on the lower bound with respect to \(\large {q}(Y_i;\alpha _{\small {Y_i}})\) by minimizing the Kullback–Leibler divergence. This means that the lower bound can be maximized by setting \(\large {q}(Y_i;\alpha _{\small {Y_i}}) = \large {q}(Y_i;\alpha _{\small {Y_i}}^*)\).

Now, likewise, when updating the distribution of each variable \(X_j\) we only consider the updated distributions \(\{\large {q}(Y_i;\alpha _{\small {Y_i}})\}\) of mega-nodes attached to the extended factors on which \(X_j\) appears (i.e., \(\hat{f_i} \in \hat{{\mathcal {F}}}_{X_j}\)). That is

$$\begin{aligned}&\large {q}(X_j;\beta _{\small {X_j}}^*) \nonumber \\&\quad = \frac{1}{{\mathcal {Z}}_{X_j}} \exp \bigg (\overbrace{\large {E}_{\{\large {q}^{(t+1)}(Y_i;\alpha _{\small {Y_i}}), \, \large {q}^{\text {(t)}}(X_k;{\mathcal {\beta }}_{\small {X_k}}) \, | \, \small {\hat{f_i} \in \hat{{\mathcal {F}}}_{X_j}}\}} \big [\log \hat{F}(O, X_j, {\mathcal {Y}} |{\mathcal {M}})\big ]}^{E_{q\small {({\mathcal {Y}})}} \text {-step}} \bigg ) \end{aligned}$$
(31)

where \(\hat{F}(O, X_j, {\mathcal {Y}} |{\mathcal {M}})\) is the part of \(P({\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}} |{\mathcal {M}})\) in the model with the factors associated with the node \(X_j\). This part involves only the mega-nodes’ qs in the Markov boundary of each \(X_j\), and the qs from the old iteration for the other variables \(X_k \ne X_j\) that are arguments to factors in which \(X_j\) appears.

Fig. 3
figure 3

Illustrating message-passing process of GEM-MP. (left) \(E_{q\small {({\mathcal {X}})}}\)-step messages from variables-to-factors; (right) \(E_{q\small {({\mathcal {Y}})}}\)-step messages from factors-to-variables

At this point, we have paved the way for GEM-MP message-passing inference by transforming the inference task into an instance of an EM style approach often associated with learning tasks. The GEM-MP inference proceeds by iteratively sending two types of messages on the extended factor graph so as to compute the updated q distributions needed for the M-steps above. The \(E_q\) and \(M_q\) steps are alternated until converging to a local maximumFootnote 7 of \({\mathcal {F}}_{\small {{\mathcal {M}}}}(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}), \large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}))\). These messages are different from simply running the standard LBP algorithm, and they are formulated in the form of E (i.e., \(E_{q\small {({\mathcal {X}})}}\), \(E_{q\small {({\mathcal {Y}})}}\)) and M (i.e., \(M_{q\small {({\mathcal {Y}})}}\), \(M_{q\small {({\mathcal {X}})}}\)) steps outlined in Eqs. (21), (28), (22), and (31), where the E-steps can be computed through message passing procedures as outlined below:

  • \(E_{q\small {({\mathcal {X}})}}\)-step messages, \(\{\mu _{X_j \rightarrow \hat{f_i}} = \large {q}(X_j;\beta _{\small {X_j}})\}\), that are sent from variables \({\mathcal {X}}\) to factors \(\hat{{\mathcal {F}}}\) (as depicted in Fig. 3 (left)). The aim of sending these messages is to perform the GEM-MP’s \(M_{q\small {({\mathcal {Y}})}}\)-step in Eq. (21). That is, the setting of the distributions, \(\{\large {q}(X_j;\beta _{\small {X_j}})\}_{\small {\forall X_j \in {\mathcal {X}}}}\), are used for estimating the distributions, \(\{\large {q}(Y_i;\alpha _{\small {Y_i}})\}_{\small {\forall Y_i \in {\mathcal {Y}}}}\), that maximizes the lower bound on the log marginal-likelihood of Eq. (21). To do so, each variable \(X_j \in {\mathcal {X}}\) sends its current marginal probability \(\beta _{\small {X_j}}\) as an \(E_{q\small {({\mathcal {X}})}}\)-step message, \(\mu _{X_j \rightarrow \hat{f_i}} =\large {q}(X_j;\beta _{\small {X_j}})\), to its neighboring extended factors. Then, at the factors level, each extended factor \(\hat{f_i} \in \hat{{\mathcal {F}}}\) uses the relevant marginals from those received incoming messages of its argument variables, i.e., \(\{\large {q}(X_j;\beta _{\small {X_j}})\}_{\small {\forall X_j \in {\mathcal {X}}_{\hat{f_i}}}}\), to perform the computations of the \(E_{q\small {({\mathcal {X}})}}\)-step of Eq. (21). This implies updating the distribution \(\large {q}(Y_i;\alpha _{\small {Y_i}})\) of its mega-node \(Y_i\) by computing what we call the probabilistic generalized arc consistency (pGAC) (we will discuss pGAC in more detail in Sect. 4).

  • \(E_{q\small {({\mathcal {Y}})}}\)-step messages, \(\{\mu _{\hat{f_i} \rightarrow X_j} = \sum _{\small {Y_i:\forall y_k(X_j)}} \large {q}(Y_i;\alpha _{\small { Y_i}})\}\), that are sent from factors to variables (as depicted in Fig. 3 (right)). Sending these messages corresponds to the GEM-MP’s \(M_{q\small {({\mathcal {X}})}}\)-step in Eq. (22). Here, the approximation of the distributions, \(\{\large {q}(Y_i;\alpha _{\small {Y_i}})\}_{\small {\forall Y_i \in {\mathcal {Y}}}}\), obtained from the GEM-MP’s \(M_{q\small {({\mathcal {Y}})}}\)-step will be used to update the marginals, i.e., \(\{\large {q}(X_j;\beta _{\small {X_j}})\}_{\small {\forall X_j \in {\mathcal {X}}}}\), that maximizes the lower bound on the log marginal-likelihood in Eq. (22). Characteristically, each extended factor \(\hat{f_i} \in \hat{{\mathcal {F}}}\) sends a corresponding refinement of the pGAC distribution - that approximates the \(\large {q}(Y_i;\alpha _{\small { Y_i}})\) of its mega-node - as an \(E_{q\small {({\mathcal {Y}})}}\)-step message, \(\mu _{\hat{f_i} \rightarrow X_j} = \sum _{\small {Y_i:\forall y_k(X_j)}} \large {q}(Y_i;\alpha _{\small { Y_i}})\), to each of its argument variables, \(X_j \in {\mathcal {X}}_{\hat{f_i}}\). Now, at the variables level, each \(X_j \in {\mathcal {X}}\) uses the relevant refinement of pGAC distributions from those received incoming messages - which are the outgoing messages coming from its extended factors \(\hat{f_i} \in \hat{{\mathcal {F}}}_{X_j}\) - to perform the computations of the \(E_{q\small {({\mathcal {Y}})}}\)-step of Eq. (22). This implies updating its distribution \(\large {q}(X_j;\beta _{\small {X_j}})\) by summing these messages (as it will be discussed in more detail in Sect. 4).

Other work has empirically observed that asynchronous belief propagation scheduling often yields faster convergence than synchronous schemes (Elidan et al. 2006). In variational message passing schemes, the mathematical derivations lead to updates that are asynchronous in nature. In Sect. 4 we will derive a general update-rule for GEM-MP based on variational principles in more detail and we shall see it leads to an asynchronous scheduling of messages. However, messages can be passed in a structured form whereby variables \({\mathcal {X}}\) are able to send their \(E_{q\small {({\mathcal {X}})}}\)-step messages simultaneously to their factors (or mega-node variables \({\mathcal {Y}}\)). At the level of factors, the marginals are updated one at a time, then the factors send back \(E_{q\small {({\mathcal {Y}})}}\)-step messages simultaneously to their variables. Moreover, it should be noted that we do the asynchronous updating schedule between variables \({\mathcal {X}}\) and mega-nodes \({\mathcal {Y}}\) in a form that allows updates to potentially be computed in parallel. Thus the version of GEM-MP that we present here involves sending messages in parallel from mega-nodes to variables and variables to mega-nodes. Updates to the \(q\small {({\mathcal {X}}_j)}\)s approximations for variables \(X_j \in {\mathcal {X}}\) could be computed in parallel, and updates to the \(q\small {({\mathcal {Y}}_i)}\)s approximations for mega-nodes \(Y_i \in {\mathcal {Y}}\) could also be performed in parallel.

Theorem 1

(GEM-MP guarantees convergence) At each iteration of updating the marginals (i.e., variational parameters \({\mathcal {B}}_{\small {{\mathcal {X}}}}\)), GEM-MP increases monotonically the lower bound on the model evidence such that it never overshoots the global optimum or until converging naturally to some local optima.

Proof

Using Eq. (15) into Eq. (13c) implies that the maximization of the lower bound \({\mathcal {F}}_{\small {{\mathcal {M}}}}(\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}), \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}))\) is equivalent to the minimization of the Kullback–Leibler (\(\textit{KL}\)) divergence between \(\large {q}(\small {{\mathcal {X}}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\) and \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) and the true distribution, \(P({\mathcal {X}},{\mathcal {Y}}|{\mathcal {O}},{\mathcal {M}})\), over hidden variables:

$$\begin{aligned} \log \sum _{\small {{\mathcal {X}}},\small {{\mathcal {Y}}}} P({\mathcal {O}},{\mathcal {X}},{\mathcal {Y}}|{\mathcal {M}}) - {\mathcal {F}}_{\small {{\mathcal {M}}}}&= \sum _{\small {{\mathcal {X}}},\small {{\mathcal {Y}}}} \large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}) \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}) \, \log \frac{\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}) \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})}{P({\mathcal {X}},{\mathcal {Y}}|{\mathcal {O}},{\mathcal {M}})} \end{aligned}$$
(32a)
$$\begin{aligned}&= \textit{KL}\bigg [\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}}) \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}) \, || \, P({\mathcal {X}},{\mathcal {Y}}|{\mathcal {O}},{\mathcal {M}})\bigg ] \end{aligned}$$
(32b)
Fig. 4
figure 4

Illustrating how each step of the GEM-MP algorithm is guaranteed to increase the lower bound on the log marginal-likelihood. In its “\(M_{q\small {({\mathcal {Y}})}}\)-step”, the variational distribution over hidden mega-node variables is maximized according to Eq. (21). Then, in its “\(M_{q\small {({\mathcal {X}})}}\)-step”, the variational distribution over hidden \({\mathcal {X}}\) variables is maximized according to Eq. (22)

Now assume that before and after a given iteration (t), we have \(\large {q}^{(t)}_{({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})}\) and \(\large {q}^{(t+1)}_{({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})}\) that denote the settings of \({\mathcal {B}}_{\small {{\mathcal {X}}}}\) respectively. Likewise for \(\large {q}^{(t)}_{({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})}\) and \(\large {q}^{(t+1)}_{({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})}\) with respect to \({\mathcal {T}}_{\small {{\mathcal {Y}}}}\), where one iteration is a run of GEM-MP “\(M_{q\small {({\mathcal {Y}})}}\)-step” followed by “\(M_{q\small {({\mathcal {X}})}}\)-step”. By construction, in the \(M_{q\small {({\mathcal {Y}})}}\)-step, \(\large {q}^{(t+1)}_{({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})}\) is chosen such that it maximizes \({\mathcal {F}}_{\small {{\mathcal {M}}}}(\large {q}^{(t)}_{({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})},\large {q}^{(t+1)}_{({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})})\) given \(\large {q}^{(t)}_{({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})}\). Then, in the \(M_{q\small {({\mathcal {X}})}}\)-step, \(\large {q}^{(t+1)}_{({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})}\) is set by maximizing \({\mathcal {F}}_{\small {{\mathcal {M}}}}(\large {q}^{(t+1)}_{({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})},\large {q}^{(t+1)}_{({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})})\) given \(\large {q}^{(t+1)}_{({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})}\), and we have (as shown in Fig. 4):

$$\begin{aligned}&\textit{KL}\bigg [\large {q}^{(t)}_{({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})} \, \large {q}^{(t)}_{({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})} \, || \, P({\mathcal {X}}, {\mathcal {Y}} |{\mathcal {O}},{\mathcal {M}})\bigg ] \nonumber \\&\quad \ge \textit{KL}\bigg [\large {q}^{(t)}_{({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})} \, \large {q}^{(t+1)}_{({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})} \, || \, P({\mathcal {X}},{\mathcal {Y}} |{\mathcal {O}},{\mathcal {M}})\bigg ] \end{aligned}$$
(33)

and similarly:

$$\begin{aligned}&\textit{KL}\bigg [ \large {q}^{(t)}_{({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})} \, \large {q}^{(t+1)}_{({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})} \, || \, P({\mathcal {X}},{\mathcal {Y}}|{\mathcal {O}},{\mathcal {M}})\bigg ] \nonumber \\&\quad \ge \textit{KL}\bigg [ \large {q}^{(t+1)}_{({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})} \, \large {q}^{(t+1)}_{({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})} \, || \, P({\mathcal {X}},{\mathcal {Y}}|{\mathcal {O}},{\mathcal {M}})\bigg ] \end{aligned}$$
(34)

This implies that GEM-MP increases the lower bound monotonically.

Now since the exact log marginal-likelihood, \(\log \sum _{\small {{\mathcal {X}}},\small {{\mathcal {Y}}}} P({\mathcal {O}}, {\mathcal {X}},{\mathcal {Y}}|{\mathcal {M}})\), is a fixed quantity and the Kullback–Leibler divergence, \(\textit{KL} \ge 0\), is a non-negative quantity then this implies that GEM-MP never overshoots the global optimum of the variational free energy.

Since GEM-MP applies a variational mean-field approximation for \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) and \(\large {q}({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\) distributions [refer to Eqs. (24) and (23)] over both mega-nodes and variables nodes respectively, it inherits the guarantees of mean field to converge to a local minimum of the negative variational free energy free energy or KL divergence. \(\square \)

Note that the convergence behaviour of GEM-MP for inference task resembles the behaviour of the variational Bayesian expectation maximization approach proposed by Beal and Ghahramani (2003) for the Bayesian learning task. Both of them can be seen as a variational technique (forming a factorial approximation) that minimizes a free-energy-based function for estimating the marginal likelihood of the probabilistic models with hidden variables.

It is worth noting that when reaching the GEM-MP “\(M_{q\small {({\mathcal {Y}})}}\)-step”, one could select between a local or global approximation to distribution \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\). However, in this paper, we restricted ourselves to local approximations.Footnote 8 Furthermore, although GEM-MP represents a general template framework for applying variational inference to probabilistic graphical models, we concentrate on Markov logic models, where the variables will be ground atoms and the factors will be both hard and soft ground clauses (as will be explained in Sect. 4) and Ising models (as will be explained in Sect. 5).

4 GEM-MP general update rule for Markov logic

By substituting the local approximation for \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) from \(M_{q\small {({\mathcal {Y}})}}\)-step into the \(M_{q\small {({\mathcal {X}})}}\)-step, we can synthesize update rules that tell us how to set the new marginal in terms of the old one. So, in practice the \(M_{q\small {({\mathcal {Y}})}}\)-step and the \(E_{q\small {({\mathcal {Y}})}}\)-step messages of GEM-MP can be expressed in the form of one set of messages (from atoms-to-atoms through clauses). This set of messages synthesizes a general update rule for GEM-MP, applicable to Markov logic. However, since the underlying factor graph often contains hard and soft clauses, then within the GEM-MP framework we will intentionally distinguish hard and soft clauses by using two variants of the general update rule (denoted as Hard-update-rule and Soft-update-rule) for tackling hard and soft clauses, respectively.

4.1 Hard update rule

For notational convenience, we explain the derivation of the hard update rule by considering untyped atoms; but extending it to the more general case is straightforward. Also, for clarity, we begin the derivation with the \(M_{q\small {({\mathcal {X}})}}\)-step rather than with the usual \(M_{q\small {({\mathcal {Y}})}}\)-step. So we assume that we have already constructed \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\). We also assume that all clauses are hard.

1. \(M_{q\small {({\mathcal {X}})}}\)-step: Recalling Sect. 3, our basic goal in this step is to use \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) to estimate the marginals (i.e., parameters) \({\mathcal {B}}_{\small {{\mathcal {X}}}}\) that maximize the expected log-likelihood such that each \(\beta _{\small {X_j}} \in {\mathcal {B}}_{\small {{\mathcal {X}}}}\) is a proper probability distribution. Thus we have an optimization problem of the form:Footnote 9

$$\begin{aligned}&\displaystyle \max _{{\mathcal {B}}_{\small {{\mathcal {X}}}}} \large {E}_{\large {q} (\small {{\mathcal {X}}};{\mathcal {B}}_{\small {{\mathcal {X}}}}) \large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})} \big [\log P({\mathcal {O}}, {\mathcal {Y}} | {\mathcal {X}}, {\mathcal {M}}) \big ] \nonumber \\&\text {s.t.} \quad \displaystyle \left( \beta ^+_{\small {X_j}}+\beta ^-_{\small {X_j}}\right) =1, \quad \forall \beta _{\small {X_j}} \in {\mathcal {B}}_{\small {{\mathcal {X}}}} \end{aligned}$$
(35)

To perform this optimization, we first express it as the Lagrangian function \({\varLambda }({\mathcal {B}}_{\small {{\mathcal {X}}}})\):

$$\begin{aligned} {\varLambda }({\mathcal {B}}_{\small {{\mathcal {X}}}}) = \large {E}_{\large {q}(\small {{\mathcal {X}}};{\mathcal {B}}_{\small {{\mathcal {X}}}})\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})} \big [\log P({\mathcal {O}}, {\mathcal {Y}} | {\mathcal {X}}, {\mathcal {M}}) \big ] - T({\mathcal {B}}_{\small {{\mathcal {X}}}}) \end{aligned}$$
(36)

where \(T({\mathcal {B}}_{\small {{\mathcal {X}}}})\) is a constraint that ensures the marginals, \({\mathcal {B}}_{\small {{\mathcal {X}}}}\), are sound probability distributions. This constraint can be simply represented as follows:

$$\begin{aligned} T({\mathcal {B}}_{{\mathcal {X}}}) = \sum _{\small {X_j} \in {\mathcal {X}}} \lambda _{X_j} \left( 1-\beta ^+_{X_j}-\beta ^-_{X_j}\right) \end{aligned}$$
(37)

where \(\lambda _{\small {X_j}}\) are Lagrange multipliers that allow a penalty if the marginal distribution \(\beta _{\small {X_j}}\) does not marginalize to exactly one.

Now, let us turn to the derivation of the expected log-likelihood. We have that:

$$\begin{aligned} \begin{aligned}&\large {E}_{\large {q}_{({\mathcal {X}};{\mathcal {B}}_{\small {{\mathcal {X}}}})} \large {q}_{({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})}} \big [\log P({\mathcal {O}}, {\mathcal {Y}} |{\mathcal {X}}, {\mathcal {M}}) \big ] \\&\quad = \sum _{\small {Y_i \in {\mathcal {Y}}}} \large {q}\left( Y_i;\alpha _{\small {Y_i}}\right) \sum _{\small {X_j \in {\mathcal {X}}}} \large {q}\left( X_j;\beta _{\small {X_j}}\right) \log P\left( {\mathcal {O}}, {\mathcal {Y}} |{\mathcal {X}}, {\mathcal {M}}\right) \end{aligned} \end{aligned}$$
(38)

Based on the (hidden) variables \(X_j \in {\mathcal {X}}\) and mega-nodes \(Y_i \in {\mathcal {Y}}\), we can then decouple the distribution, \(\log P({\mathcal {O}}, {\mathcal {Y}} | {\mathcal {X}}, {\mathcal {M}})\), into individual distributions corresponding to hard ground clauses, and we have:Footnote 10

$$\begin{aligned} \large {E}_{\large {q}\left( \small {{\mathcal {X}}}; {\mathcal {B}}_{\small {{\mathcal {X}}}}\right) \large {q} \left( {\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}}\right) } \big [\log P({\mathcal {O}}, {\mathcal {Y}} | {\mathcal {X}}, {\mathcal {M}}) \big ]&\propto \sum _{X_j, Y_i} \large {q}\left( X_j;\beta _{\small {X_j}}\right) \large {q}\left( Y_i;\alpha _{\small {Y_i}}\right) \nonumber \\&\quad \times \log \bigg [\prod _{f^h_i \in {\mathcal {F}}^h} P(Y_i |{\mathcal {X}}, {\mathcal {M}})\bigg ] \end{aligned}$$
(39)

where \(P(Y_i |{\mathcal {X}}, {\mathcal {M}})\) is the probability of randomly choosing a valid local entry in the mega-node \(Y_i\) given the marginal probabilities of the ground atoms, \({\mathcal {B}}_{\small {{\mathcal {X}}}}\). Now, we can proceed by decomposing \(P(Y_i |{\mathcal {X}}, {\mathcal {M}})\) into individual marginals of ground atoms that possess consistent truth values in the valid local entries of \(Y_i\). That is:

$$\begin{aligned} \log \bigg [\prod _{f^h_i \in {\mathcal {F}}^h} P(Y_i |{\mathcal {X}}, {\mathcal {M}})\bigg ] \approx \log \bigg [\prod _{f^h_i \in {\mathcal {F}}^h} \prod _{\small {X_j} \in {\mathcal {X}}_{f^h_i}} \beta _{\small {X_j}} (Y_i(X_j)) \bigg ] \end{aligned}$$
(40)

where \(\beta _{\small {X_j}} (Y_i(X_j))\) is the marginal probability of ground atom \(X_j\) at its consistent values with \(Y_i\).

It is important to note that the decomposition in Eq. (40) is a mean field approximation for \(P(Y_i |{\mathcal {X}}, {\mathcal {M}})\). It implies that the probability of valid local entries of \(Y_i\) for the ground clause \(f_i\) can be computed using individual marginals of the variables in the scope of \(f_i\) at their instantiations over such local entries. For instance, suppose that \(f_i(X_1,X_2,X_3)\) is defined over three Boolean variables \(\{X_1,X_2,X_3 \}\) with marginal probabilities \(\{\beta _{\small {X_1}},\beta _{\small {X_2}},\beta _{\small {X_3}}\}\). Now let (0, 0, 1) be a valid local entry in the mega-node \(Y_i\) of \(f_i\). To compute the probability \(P(0,0,1 | \beta _{\small {X_1}},\beta _{\small {X_2}},\beta _{\small {X_3}}, {\mathcal {M}})\), we can simply multiply the marginals of the three variables at their instantiations over this valid local entry as:

$$\begin{aligned} P\left( 0,0,1 | \beta _{\small {X_1}}, \beta _{\small {X_2}}, \beta _{\small {X_3}}, {\mathcal {M}}\right) = \beta ^{-}_{\small {X_1}} \times \beta ^{-}_{\small {X_2}} \times \beta ^{+}_{\small {X_3}} \end{aligned}$$

Where \(\beta ^{-}_{\small {X_1}}\), \(\beta ^{-}_{\small {X_2}}\) and \(\beta ^{+}_{\small {X_3}}\) are the marginal probabilities of \(X_1\), \(X_2\), and \(X_3\) at values 0, 0, and 1 respectively.

We apply Eq. (40) in Eq. (39), convert logarithms of products into sums of logarithms, exchange summations, and handle each hard ground clause \(f^h_i \in {\mathcal {F}}^h\) separately in a sum.

Subsequently we take the partial derivative of the Lagrangian function in Eq. (36) with respect to an individual ground atom positive marginal \(\beta ^+_{X_j}\) and equate it to zero:Footnote 11

$$\begin{aligned} \frac{\partial }{\partial \beta ^{+}_{\small {X_j}}}\left[ {\varLambda }({\mathcal {B}}_{\small {{\mathcal {X}}}})\right] = 0 \Rightarrow \beta ^{+}_{\small {X_j}} = \frac{1}{\lambda _{\small {X_j}}} \underbrace{\left( \sum _{f^h_i \in {\mathcal {F}}_{X_j}^h} \overbrace{\sum _{\small {Y_i:\forall y_k(X_j)=``+''}} \large {q}(Y_i; \alpha _{\small {Y_i}})}^{(E_{q\small {({\mathcal {Y}})}}\text {-step message}) \,\, \mu _{f_i\rightarrow X_j}}\right) }_{\textit{Weight}^+_{X_j}} \end{aligned}$$
(41)

where \(\sum _{\small {Y_i:\forall y_k(X_j)=``+''}} \large {q}(Y_i;\alpha _{\small {Y_i}})\) is the \(E_{q\small {({\mathcal {Y}})}}\)-step message that \(X_j\) will receive from each hard ground clause (\(f^h_i \in {\mathcal {F}}_{X_j}^h\)) conveying what it believes about \(X_j\)’s positive marginal. Each \(E_{q\small {({\mathcal {Y}})}}\)-step message is computed by adding a term for those valid local entries \((Y_i:\forall y_k(X_j)=``+'')\) which instantiate the current hard ground clause using the positive value \(``+''\) for ground atom \(X_j\).

Thus the sum of the \(E_{q\small {({\mathcal {Y}})}}\)-step messages that ground atom \(X_j\) will receive from its neighboring hard ground clauses represents a weight (i.e., \(\textit{Weight}^+_{X_j}\)) used to update its positive marginal.

Furthermore an analogous expression can be applied for a negative marginal \(\beta ^-_{X_j}\):

$$\begin{aligned} \frac{\partial }{\partial \beta ^-_{X_j}}\left[ {\varLambda }({\mathcal {B}}_{{\mathcal {X}}})\right] = 0 \Rightarrow \beta ^-_{X_j} = \frac{1}{\lambda _{X_j}} \underbrace{\left( \sum _{f^h_i \in {\mathcal {F}}_{X_j}^h} \overbrace{\sum _{\small {Y_i:\forall y_k(X_j)=``-''}} \large {q}(Y_i;\alpha _{\small { Y_i}})}^{(E_{q\small {({\mathcal {Y}})}}\text {-step message}) \,\, \mu _{f_i(Y_i) \rightarrow X_j}}\right) }_{\textit{Weight}^-_{X_j}} \end{aligned}$$
(42)

Finally we now move to solving \(\lambda _{\small {X_j}}\) as follows:

$$\begin{aligned} \beta ^-_{X_j} + \beta ^+_{X_j}= & {} 1 \overset{\textit{From} \, \textit{Eqs.}~(41) \, \textit{and} \, ~(42)}{\Rightarrow } \nonumber \\ \left[ \left( \textit{Weight}^+_{X_j}+\textit{Weight}^-_{X_j}\right) / \lambda _{\small {X_j}} = 1\right] \Rightarrow \lambda _{\small {X_j}}= & {} \left( \textit{Weight}^+_{X_j}+\textit{Weight}^-_{X_j}\right) \end{aligned}$$
(43)

which shows that \(\lambda _{\small {X_j}}\) serves as a normalizing constant that converts such weights (i.e., \(\textit{Weight}^+_{X_j}\), and \(\textit{Weight}^-_{X_j}\)) into a sound marginal probability (i.e. \(\beta _{\small {X_j}} = \left[ \beta ^-_{\small {X_j}} , \beta ^+_{\small {X_j}}\right] \)).

Now to obtain the completed hard update rule, what remains is the \(M_{q\small {({\mathcal {Y}})}}\)-step, through which we need to substitute the distribution \(\large {q}(Y_i;\alpha _{\small {Y_i}})\) in Eqs. (41) and (42).

2. \(M_{q\small {({\mathcal {Y}})}}\)-step: The goal here is to produce the distribution \(\large {q}(Y_i;\alpha _{\small {Y_i}})\) by using the current setting of marginals \({\mathcal {B}}_{\small {{\mathcal {X}}}}\). However the summation \(\sum _{\small {Y_i:\forall y_k(X_j)=``-''}}\) involves enumerating all the valid local entries for each \(Y_i\), which is inefficient. Instead we approximate the distribution \(\sum _{\small {Y_i:\forall y_k(X_j)=``-''}} \large {q}(Y_i;\alpha _{\small {Y_i}})\) for each hard ground clause \(f^h_i \in {\mathcal {F}}^h\) by using a probability \(1-\xi (X_j,f^h_i)\), which we call the probabilistic generalized arc consistency (pGAC). At this point, let us pause to elaborate more on pGAC in the next subsection.

4.1.1 Note on the connection between pGAC and variational inference

According to the concept of generalized arc consistency, a necessary (but not sufficient) condition for a ground atom \(X_j\) to be assigned a value \(d \in \{+,-\}\), is for every other ground atom appearing in the ground clause \(f_i\) to be individually consistent in the support of this assignment, i.e., \(X_j=d\). Without loss of generality, suppose that \(X_j\) appears positively in \(f_i\): there is a probability that \(X_j=d\) is not generalized arc consistent with respect to \(f_i\) when those other ground atoms appearing in \(f_i\) are individually inconsistent with this assignment since \(X_j=d\) can belong to an invalid local entry of \(f_i\). This means that there is a probability that \(X_j=d\) is unsatisfiable with respect to \(f_i\) when all other ground atoms appearing in \(f_i\) are set unsatisfyingly. We use \(\xi (X_j,f_i)\) to denote this probability, we assume Independence and approximate it as:

$$\begin{aligned} \xi (X_j,f_i) = \left( \prod _{\small {X_k \in {\mathcal {X}}_{f_i^+} {\setminus } \{X_j\}}} \left( 1- \beta _{\small {X_k}}^{+} \right) \cdot \prod _{\small {X_k \in {\mathcal {X}}_{f_i^-} {\setminus } \{X_j\} }} \left( \beta _{\small {X_k}}^{+} \right) \right) \end{aligned}$$
(44)

As indicated in Eq. (44), \(\xi (X_j,f_i)\) is computed by iterating through all the other ground atoms in clause \(f_i\) and consulting their marginals toward the opposite truth value of their appearance in \(f_i\). In other words, the \(\xi (X_j,f_i)\) forms a product representing the probability that, except \(X_j\), all other ground atoms \({\mathcal {X}}_{f_i} {\setminus } \{X_j\}\) in \(f_i\) taking on particular values that constitute invalid local entries to \(f_i\). Such invalid local entries support \(X_j\) unsatisfying \(f_i\) and can be approximated based on the marginal distributions of those ground atoms (i.e., \({\mathcal {X}}_{f_i} \setminus \{X_j\}\)) at these particular values. It should be noted that \(f_i\) has those marginal distributions from the incoming \(E_{q\small {({\mathcal {X}})}}\)-step messages that are sent from its argument ground atoms \({\mathcal {X}}_{f_i}\) during the GEM-MP’s \(M_{q\small {({\mathcal {Y}})}}\)-step.

Hence, if \(\xi (X_j,f_i)\) is the probability of \(X_j=d\) unsatisfying \(f_i\) then \(1-\xi (X_j,f_i)\) is directly the probability of \(X_j=d\) satisfying the ground clause \(f_i\). It also represents the probability that \(X_j=d\) is GAC with respect to \(f_i\) because the event of \(X_j=d\) satisfying \(f_i\) implies that it must be GAC to \(f_i\). This interpretation entails a form of generalized arc consistency, adapted to CNF, in a probabilistic sense; we call it a Probabilistic Generalized Arc Consistency.

Definition 2

(Probabilistic generalized arc consistency (pGAC))

Given a ground clause \(f_i \in {\mathcal {F}}\) defined over ground atoms \({\mathcal {X}}_{f_i}\), and for every \(X_j \in {\mathcal {X}}_{f_i}\), let \(D_{X_j}=\{+,-\}\) be the domain of \(X_j\). A ground atom \(X_j\) assigned a truth value \(d \in D_{X_j}\) is said to be probabilistically generalized arc consistent (pGAC) to ground clause \(f_i\) if the probability of \(X_j=d\) belonging to a valid local entry of \(f_i\) is non-zero. That is to say, if there is a non-zero probability that \(X_j=d\) is GAC to \(f_i\). The pGAC probability of \(X_j=d\) can be approximated as:

$$\begin{aligned} 0 < 1 - \xi (X_j,f_i) \le 1 \end{aligned}$$
(45)

The definition of the traditional GAC in Sect. 2 corresponds to the particular case of pGAC where \(\xi (X_j,f_i)=0\), meaning that the probability of \(X_j=d\) being GAC to \(f_i\) definitely occurs, and \(\xi (X_j,f_i)=1\) when it is never GAC to \(f_i\). Based on that, if \(f_i\) contains \(X_j\) positively then the pGAC probability of \(X_j=+\) equals 1 because it is always GAC to \(f_i\). In an analogous way, the pGAC probability is 1 for \(X_j=-\) when \(f_i\) contains \(X_j\) negatively.

From a probabilistic perspective, the pGAC probability of \(X_j=d\) represents the probability that \(X_j=d\) is involved in a valid local entry of \(f_i\). This is similar to the computation of the solution probability of \(X_j=d\) by using the probabilistic arc consistency (pAC) (presented by Horsch and Havens 2000, and summarized in Sect. 2). However, it should be noted that our pGAC applies mean-field approximation. This is because when computing \(\xi (X_j,f_i)\), as defined in Eq. (44), for each ground atom \(X_j \in {\mathcal {X}}_{f_i}\), we use the marginal probabilities of other ground atoms \(X_k \in {\mathcal {X}}_{f_i} {\setminus } \{X_j\}\) set unsatisfying in \(f_i\). Thus the main difference between our pGAC and pAC (Horsch and Havens 2000) appears in the usage of mean-field and BP for computing the probability that \(X_j=d\) belongs to valid local entry of \(f_i\) in pGAC and pAC, respectively. Furthermore, it should be noted that pAC is restricted to binary constraints whilst pGAC is additionally applicable to non-binary ones.

From the point of view of computational complexity, \(\xi (X_j,f_i)\) requires only linear computational time in the arity of the ground clause (as will be shown in Proposition 3). Thus, pGAC is an efficient form of GAC compared to pAC. In addition, pGAC guarantees the convergence of mean-field whereas pAC inherits the possibility of non-convergence from BP.

From a statistical perspective, the pGAC probability of \(X_j=+\) is a closed form approximation for a sample from the valid local entries of \(f_i\) that involve \(X_j=+\). Thus we have that:

$$\begin{aligned} \big [1-\xi (X_j,f_i)\big ]\Bigm |_{X_j = ``+''} \propto \sum _{\small {Y_i:\forall y_k(X_j)=``+''}} \large {q}\left( Y_i;\alpha _{\small {Y_i}}\right) \end{aligned}$$
(46)

And similarly the pGAC probability for \(X_j=-\):

$$\begin{aligned} \big [1-\xi (X_j,f_i)\big ]\Bigm |_{X_j = ``-''} \propto \sum _{\small {Y_i:\forall y_k(X_j)=``-''}} \large {q}\left( Y_i;\alpha _{\small {Y_i}}\right) \end{aligned}$$
(47)

Based on Eqs. (46) and (47), we can use pGAC for computing the two components of \(E_{q\small {({\mathcal {Y}})}}\)-step message, in Eqs. (41)and (42), that \(f_i\) sends to \(X_j\) as follows:

  • \([1,1-\xi (X_j,f_i)]\) if \(f_i\) contains \(X_j\) positively.

  • \([1-\xi (X_j,f_i),1]\) if \(f_i\) contains \(X_j\) negatively.

Note that computing the components of \(f_i\)’s \(E_{q\small {({\mathcal {Y}})}}\)-step message in this way above requires having in hand the marginals of all other ground atoms, \(X_k \in {\mathcal {X}}_{f_i} {\setminus } \{X_j\}\). Thus, one of the best choices is to simultaneously passing the \(E_{q\small {({\mathcal {X}})}}\)-step messages – which convey the marginals – from ground atoms \({\mathcal {X}}_{f_i}\) to ground clause \(f_i\). Additionally at \(f_i\)’s level we can sequentially update the marginals as: obtain the marginal of the first ground atom then use its new marginal in the updating process of the second atom’s marginal, then use the first and second atoms’ new marginals in the updating process of the third atom’s marginal, and so on. This sequential updating allows GEM-MP to use the latest available information of the marginals through the updating process. In addition, doing so enables a single update rule that performs both the E- and M- steps at the same time, by directly representing the \(M_{q\small {({\mathcal {Y}})}}\)-step within the rule we derived for the \(M_{q\small {({\mathcal {X}})}}\)-step.

4.1.2 Using pGAC in the derivation of the hard update rule

We now continue the derivation of the hard update rule by using pGAC to address the task of producing \(\sum _{\small {Y_i:\forall y_k(X_j)}} \large {q}(Y_i;\alpha _{\small {Y_i}})\) in Eqs. (41) and (42) as follows:

$$\begin{aligned} \textit{Weight}^+_{X_j}&= \sum _{f^h_i \in {\mathcal {F}}_{X_j}^h} \sum _{\small {Y_i:\forall y_k(X_j)=``+''}} \large {q}\left( Y_i;\alpha _{\small {Y_i}}\right) \end{aligned}$$
(48a)
$$\begin{aligned}&= \bigg [\sum _{f^h_i \in {\mathcal {F}}_{X_j+}^h} \sum _{\small {Y_i:\forall y_k(X_j)=``+''}} \large {q}(Y_i;\alpha _{\small {Y_i}})\bigg ] + \bigg [\sum _{f^h_i \in {\mathcal {F}}_{X_j-}^h} \sum _{\small {Y_i:\forall y_k(X_j)=``+''}} \large {q}(Y_i;\alpha _{\small {Y_i}})\bigg ] \end{aligned}$$
(48b)
$$\begin{aligned}&\approx \sum _{f^h_i \in {\mathcal {F}}_{X_j+}^h} [1] + \sum _{f^h_i \in {\mathcal {F}}_{X_j-}^h} \left( 1-\xi \left( X_j,f^h_{i}\right) \right) \end{aligned}$$
(48c)
$$\begin{aligned}&= \left| {\mathcal {F}}_{X_j}^h\right| - \sum _{f^h_i \in {\mathcal {F}}_{X_j-}^h} \xi \left( X_j,f^h_{i}\right) \end{aligned}$$
(48d)

where in Eq. (48b) we first separate the summation into \(X_j\)’s positive and negative hard ground clauses to consider the two distinct situations of whether \(X_j\) appears as a positive ground atom versus the other situation where it appears as a negative ground atom. Further in Eq. (48c) in the first positive summation, we replaced the inner summation with the constant 1 (because all other atoms will be generalized arc consistent with \(X_j=``+''\) for the hard clauses that have a positive appearance of \(X_j\) - as explained in Sect. 4.1.1).

The end result, as in Eq. (48d), is the \(\textit{Weight}^+_{X_j}\) of ground atom \(X_j\) computed as the summation of all hard ground clauses that include \(X_j\) minus the summation of pGAC of hard ground clauses that involve \(X_j\) as a negative atom.

The interpretation of \(\textit{Weight}^+_{X_j}\) can be understood as reducing the positive probability of \(X_j\) according to the expectation of the probability that \(X_j\) is needed by its negative hard ground clauses. Such reductions are taken from a constant that represents the overall number of hard ground clauses that involve \(X_j\) (i.e. \(\left| {\mathcal {F}}_{X_j}^h\right| \)). Similarly we can obtain:

$$\begin{aligned} \textit{Weight}^-_{X_j} = \left| {\mathcal {F}}_{X_j}^h\right| - \sum _{f^h_i \in {\mathcal {F}}_{X_j+}^h} \xi \left( X_j,f^h_{i}\right) \end{aligned}$$
(49)

where \(\textit{Weight}^-_{X_j}\) has an analogous interpretation of \(\textit{Weight}^+_{X_j}\) for the negative probability of \(X_j\).

4.2 Soft update rule

To derive the update rule for soft ground clauses, what we need to do is to soften some restrictions on the weight parts (i.e. \(\textit{Weight}^+_{X_j}\), \(\textit{Weight}^-_{X_j}\)) of the hard update rule. This encompasses modifying the distributions, \(\large {q}(Y_i;\alpha _{\small {Y_i}})\), of hard ground clauses for soft ground clauses by applying two consecutive steps: softening and embedding.

For clarity, let us recall the example of the extended factor graph shown in Fig. 2 (right). In the softening step, we define the variational parameters \(\alpha _i\), of the distributions \(\large {q}(Y_i;\alpha _{\small {Y_i}})\), that are appended to the soft clauses to be different from those appended to hard clauses in a way that renders them suitable to the semantics of soft ground clauses. That is, we discriminate variational parameters of distributions \(\large {q}(Y_i;\alpha _{\small {Y_i}})\) for hard and soft ground clauses respectively as follows:

$$\begin{aligned} \alpha _{\small {Y_i}}\left( f^h_i\right)= & {} {\left\{ \begin{array}{ll} 1 &{} \quad \text {if the state of } Y_i \text { satisfies } f^h_i, \\ 0 &{} \quad \text {Otherwise.} \end{array}\right. } \nonumber \\ \alpha _{\small {Y_i}}\left( f^s_i\right)= & {} {\left\{ \begin{array}{ll} \exp (w_{f^s_{i}}) &{} \quad \text {if the state of } Y_i\text { satisfies } f^s_i, \\ 1 &{} \quad \text {Otherwise.} \end{array}\right. } \end{aligned}$$
(50)

where \(w_{f^s_{i}}\) is the numeric weight associated with soft ground clause \(f^s_i\). Now, the use of variational parameters \(\alpha _{\small {Y_i}}\left( f^s_i\right) \) (instead of \(\alpha _{\small {Y_i}}\left( f^h_i\right) \)) for the hard update rule in Eq. (48d) implies taking the exponential transformation as follows:

$$\begin{aligned} \overset{\textit{Softening}}{\Rightarrow } \beta ^+_{\small {X_j}} = \frac{1}{\lambda _{\small {X_j}}} \left( \sum _{\small {Y_i:\forall y_k(X_j)=``+''}} \underbrace{\exp \bigg [\sum _{f^h_i \in {\mathcal {F}}_{X_j}^h} \large {q}(Y_i;\alpha _{\small {Y_i}}(f^h_i))\bigg ]}_{\prod _{f^s_i \in {\mathcal {F}}_{X_j}^s} \exp \big (\large {q}(Y_i;\alpha _{\small {Y_i}}(f^h_i))\big )}\right) \end{aligned}$$
(51)

Note that \(\exp \bigg [\sum _{f^h_i \in {\mathcal {F}}_{X_j}^h} \large {q}(Y_i;\alpha _{\small {Y_i}}(f^h_i))\bigg ]\) is converted simply to:

$$\begin{aligned} \prod _{f^s_i \in {\mathcal {F}}_{X_j}^s} \exp \big (\large {q}(Y_i;\alpha _{\small {Y_i}}(f^h_i))\big ) \end{aligned}$$

where

$$\begin{aligned} \exp \big (\large {q}(Y_i;\alpha _{\small {Y_i}}(f^h_i))\big ) \approx \large {q}(Y_i;\alpha _{\small {Y_i}}(f^s_i)). \end{aligned}$$

Accordingly in the embedding step we embed the support of invalid local entries. This is because at the dissatisfaction of soft ground clauses we get 1 instead of 0 at the dissatisfaction of hard ground clauses. Thus, we discard the summation over valid local entries (i.e., remove \(\sum _{\small {Y_i:\forall y_k(X_j)=``+''}}\) in Eq. (51)) and instead we consider the support of both valid local entries (weighted by \(exp(w_{f^s_{i}})\)) and invalid local entries (weighted by 1), ending up with:

$$\begin{aligned} \overset{\textit{Embedding}}{\Rightarrow } \beta ^+_{\small {X_j}} = \frac{1}{\lambda _{\small {X_j}}} \underbrace{\left( \prod _{f^s_i \in {\mathcal {F}}_{X_j}^s} \large {q}\left( Y_i;\alpha _{\small {Y_i}}(f^s_i)\right) \right) }_{v^+_{X_j}} \end{aligned}$$
(52)

Likewise adhering to the derivation of the hard update rule, we can obtain the local approximation of \(\textit{Weight}^+_{X_j}\) part for the soft update rule as:

$$\begin{aligned} \textit{Weight}^+_{X_j}&= \prod _{f^s_i \in {\mathcal {F}}_{X_j}^s} \large {q}\left( Y_i;\alpha _{\small {Y_i}}\left( f^s_i\right) \right) \end{aligned}$$
(53a)
$$\begin{aligned}&= \bigg [\prod _{f^s_{i} \in {\mathcal {F}}_{X_j+}^s} \large {q}\left( Y_i;\alpha _{\small {Y_i}}\left( f^s_i\right) \right) \bigg ] \times \bigg [\prod _{f^s_{i} \in {\mathcal {F}}_{X_j-}^s} \large {q}\left( Y_i;\alpha _{\small {Y_i}}\left( f^s_i\right) \right) \bigg ] \end{aligned}$$
(53b)
$$\begin{aligned}&\approx \bigg [\prod _{f^s_{i} \in {\mathcal {F}}_{X_j+}^s} \exp \left( w_{f^s_{i}}\right) [1] \bigg ] \nonumber \\&\quad \times \bigg [\prod _{f^s_{i} \in {\mathcal {F}}_{X_j-}^s} \left[ \left( 1-\xi \left( X_j,f^s_{i}\right) \right) \exp \left( w_{f^s_{i}}\right) + \xi \left( X_j,f^s_{i}\right) \cdot 1\right] \bigg ] \end{aligned}$$
(53c)
$$\begin{aligned}&= \left[ \exp \left( \sum _{f^s_{i} \in {\mathcal {F}}_{X_j}^s} w_{f^s_{i}}\right) \right] \nonumber \\&\quad - \left[ \prod _{f^s_{i} \in {\mathcal {F}}_{X_j+}^s} \exp \left( w_{f^s_{i}}\right) \times \bigg [\prod _{f^s_{i} \in {\mathcal {F}}_{X_j-}^s} \xi \left( X_j,f^s_{i}\right) \left( \exp \left( w_{f^s_{i}}\right) - 1\right) \bigg ]\right] \end{aligned}$$
(53d)

Note that comparing Eq. (53c) to its corresponding Eq. (48c) for the update rules of the hard factors, we have an additional term “\(\xi (X_j,f^s_{i}) \cdot 1\)” in the second summation. This is because computing the second part of Eq. (53b) implies computing two terms as appeared in the second part of Eq. (53c): the first is (\(1 - \xi (X_j,f^s_{i})\)) representing the probability that \(X_j\) being positive satisfies the factor \(f^s_{i}\) that include \(X_j\) as negative ground atom, and therefore it is multiplied by \(\exp (w_{f^s_{i}})\) since at the satisfaction of soft ground clause \(f^s_{i}\) we obtain \(\exp (w_{f^s_{i}})\). The second term is \(\xi (X_j,f^s_{i})\) representing the probability that \(X_j\) being positive dissatisfies the factor \(f^s_{i}\), and therefore it is multiplied by 1 since at the dissatisfaction of \(f^s_{i}\) we obtain 1. This is the “\(\xi (X_j,f^s_{i}) \cdot 1\)” term that has disappeared from the update rules of the hard factors in Eq. (48c) because \(\xi (X_j,f^s_{i})\) is multiplied by 0, since at the dissatisfaction of hard ground clauses we get 0 instead of 1 for the dissatisfaction of soft ground clauses.

Similarly, we can obtain the negative weight part \(\textit{Weight}^-_{X_j}\) for the soft update rule as:

$$\begin{aligned} {\textit{Weight}}^-_{X_j}&= \left[ \exp \left( \sum _{f^s_{i} \in {\mathcal {F}}_{X_j}^s} w_{f^s_{i}}\right) \right] \nonumber \\&\quad \,- \left[ \prod _{f^s_{i} \in {\mathcal {F}}_{X_j-}^s} \exp \left( w_{f^s_{i}}\right) \times \bigg [\prod _{f^s_{i} \in {\mathcal {F}}_{X_j+}^s} \xi \left( X_j,f^s_{i}\right) \left( \exp \left( w_{f^s_{i}}\right) - 1\right) \bigg ]\right] \end{aligned}$$
(54)

Note that the weight parts [in Eqs. (53d) and  (54)] used for the soft update rule, are soft versions of previously derived weight parts [in Eqs. (48d) and  (49)] used for the hard update rule. Therefore at a high level they have similar interpretations.

At this point, we take the \(\textit{Weight}^+_{X_j}\) and \(\textit{Weight}^-_{X_j}\) from Eqs. (48d), (49), (53d), and (54) and substitute these for the \(\textit{Weight}^+_{X_j}\) and \(\textit{Weight}^-_{X_j}\) in Eqs. (41) and (42) to obtain our ultimate set of GEM-MP’s rules in order to update the marginals of query ground atoms. This is in Table 3. The main advantage of these update rules is that they capture relationships between ground atoms with each other. Thus, we do not need to pass explicitly the messages from atoms-to-clauses or vice versa.

Note that, on one hand, using a single update rule for updating the marginals is beneficial for the simplicity of implementation. However, on the other hand, using other scheduling than the one used here for the GEM-MP framework requires re-deriving GEM-MP’s equations to obtain other single update rules that are adopted with the new scheduling, or do not use single update rules and pass explicitly the \(M_{q\small {({\mathcal {Y}})}}\)-step and \(E_{q\small {({\mathcal {Y}})}}\)-step messages from variables-to-factors and factors-to-variables, respectively.

Table 3 General update rules of GEM-MP inference for Markov logic. These rules capture relationships between ground atoms with each other, and therefore it does not require explicitly passing messages between atoms and clauses

4.3 GEM-MP versus LBP

One might contrast GEM-MP and LBP inference. Recall the basic quantities used by GEM-MP in Eqs. (41) and (42) versus LBP in Eqs. (3) and (4) for updating the marginal of a single variable \(X_j\). Although the marginal update rules of both algorithms look similar, they are constructed by very different routes, having important differences. The first significant difference is that due to the expectations involved in variational message passing, in GEM-MP we take a summation (i.e. \(\sum _{f_i \in {\mathcal {F}}_{X_j}}\)) over the incoming messages to a given node, which are the outgoing messages coming from the factors. This is in contrast to the multiplication (i.e. \(\prod _{f_i \in {\mathcal {F}}_{X_j}}\)) associated with standard LBP. In other words GEM-MP handles the incoming message (or, as named, \(E_{q\small {({\mathcal {Y}})}}\)-step message) from each factor as a separate term in a sum. This means that when moving toward the local maximum of energy functional \({\mathcal {F}}_{\small {{\mathcal {M}}}}\) in Eq. (17c), GEM-MP computes a moderate arithmetic average of the incoming \(E_{q\small {({\mathcal {Y}})}}\)-step messages to yield the marginal update steps for \(X_j\). Due to the variational underpinnings of GEM-MP these steps update a quantity that is a lower bound on the log marginal likelihood. This is attributable to the use of Jensen’s inequality in Eq. (12c) that allows lower bounding the model evidence, and at each update step we minimize the Kullback–Leibler divergence distance. We therefore cannot ‘overstep’ in our approximation of the true model evidence (refer to Theorem 1). In contrast, LBP computes a (coarse) geometric average of the incoming messages in a setting where there is no such bound.

The second important difference between these algorithms is how they compute their “outgoing messages” from factors to variables based on the previous iteration’s incoming messages from variables to factors. In LBP the outgoing message is a partial sum over the product of the factor’s probability distribution by its incoming messages from other neighboring variables which naturally arises from the original exact computations which easily fall out of the computations for correctly marginalizing a tree-structured graphical model. However the operations of simply multiplying then taking partial sums do nothing to exploit any local structure of the underlying factor. In contrast GEM-MP leverages the fact that factors (e.g., in Markov logic and Ising models) are represented as logical clauses, and therefore we can take advantage of generalized arc consistency to cleverly convey the local structures’ semantics into their outgoing messages. Strictly speaking GEM-MP’s outgoing \(E_{q\small {({\mathcal {Y}})}}\)-step message is an approximate marginal distribution \(\large {q}(Y_i;\alpha _{\small {Y_i}})\) over the valid local entries \(Y_i:\forall y_k(X_j)\) in which the \(X_j\) (that will receive the message) is GAC with other variables in the factor; we approximate this distribution by computing the pGAC of \(X_j\) using the marginals of other variables in the factor that are GAC with \(X_j\) (refer to Sect. 4.1.1). This means that the outgoing \(E_{q\small {({\mathcal {Y}})}}\)-step message that will be received by \(X_j\) ensures that its marginal should be consistent with the marginals of other variables according to the local structure’s semantic of the factor. This improves the process of convincing the variables to converge correctly. Hence exploiting the logical structures by pGAC when computing the outgoing messages of factors is what we believe helps GEM-MP alleviate the problems associated with determinism.

figure a

4.4 GEM-MP algorithm

Algorithm 1 gives a pseudo-code for the GEM-MP inference algorithm. The algorithm starts by uniformly initializing (i.e., \({\mathcal {U}}\)) the marginals of all ground atoms that exist in the query set \({\mathcal {X}}\) (lines 1–3). Then, it distinguishes two subsets of query ground atoms. The first is \({\mathcal {X}}_h\) that involves query ground atoms involved in hard ground clauses (line 4). The second subset is \({\mathcal {X}}_s\) for the ones involved in soft ground clauses (line 5). Note that if the query atom is involved in both soft and hard ground clauses, then it will be included in the two subsets. At each step, the algorithm proceeds by updating the marginals for the first subset of query atoms by using the hard update rule (lines 7-9). Then it updates the marginals for query atoms of the second subset by applying the soft update rule (lines 10-12). The algorithm keeps alternating between carrying out the two update-rules until convergence (i.e., \(\forall X_j \in {\mathcal {X}}, \, \left| \beta _{X_j}({\mathcal {I}})-\beta _{X_j}({\mathcal {I}}-1)\right| <\epsilon \), where \(\epsilon \) is a specified precision) or reaching the maximum number of iterations (line 13). Although the marginals of the query atoms involved by soft and hard ground clauses (i.e., exist in the two subsets \({\mathcal {X}}_h\) and \({\mathcal {X}}_s\)) may be affected by swapping from hard to soft update rules, or vice versa, such query atoms’ marginals play the role of propagating the information about hard ground clauses to query atoms in \({\mathcal {X}}_s\) when it is used by the soft update rule, and propagating the information about soft ground clauses to query atoms in \({\mathcal {X}}_h\) when it is used by the hard update rule. It should be noted that the checks performed by each update-rule are extremely cheap (a fraction of a second, on average) and the subset of ground clauses at each particular step is unlikely to be in the hard critical region.

Proposition 3

(Computational Complexity) Given an MLN’s ground network with n ground atoms, m ground clauses, and a maximum arity of the ground clauses of r, one iteration of computing the marginals of query atoms takes time in O(nm r) in the worst case.

Proof

see “Appendix”. \(\square \)

Note that even though GEM-MP is built on a propositional basis, its computational complexity is quite efficient since the size of the grounded network is proportional to \(O(d^r)\), where d is the number of objects (constants) in the domain. Also, in practice, we can improve this computational time by preconditioning some terms. For instance, we do not compute the constant terms (such as \(\left| {\mathcal {F}}_{X_j}^h\right| \) in the hard update rule) at each iteration, but instead we compute them once and then recall their values.

5 GEM-MP update rules for Ising MRFs

In this section we demonstrate how to easily adapt the GEM-MP algorithm to handle inference in the presence of determinism over other typical probabilistic graphical models, rather than over Markov logic networks. For simplicity, let us here consider Ising models with arbitrary topology which are a specific subset of the canonical (pairwise) Markov random fields (MRFs)—undirected graphical models that compactly represent a joint distribution over \({\mathcal {X}}\) by assuming that it is obtained as a product of potentials defined on subsets of variables (Koller and Friedman 2009). Although pairwise Markov random fields are commonly used as a benchmark for inference because they have a simple and compact representation, they often pose a challenge for inference. Now assume that \({\mathcal {X}} = \{X_1,\ldots ,X_n\}\) is a set of binary random variables that are Bernoulli distributed. In an Ising model \({\mathcal {I}} = (({\mathcal {X}},{\mathcal {E}}); \theta )\) we have an undirected graph consisting of the set of all variables \({\mathcal {X}}\), a set of edges between variables \({\mathcal {E}}\), and a set of parameters \(\theta =\{\theta _i,\theta _{ij}\}\). The model can be then given as

$$\begin{aligned} p({\mathcal {X}}=x)&= {\mathcal {Z}}_{\theta }^{-1} \overbrace{e^{\big [\sum _{(X_i,X_j) \in {\mathcal {E}}} \theta _{ij} \cdot X_iX_j + \sum _{X_i \in {\mathcal {X}}} \theta _i \cdot X_i\big ]}}^{\text {energy function}} \end{aligned}$$
(55a)
$$\begin{aligned}&= {\mathcal {Z}}_{\theta }^{-1} \bigg [\prod _{(X_i,X_j) \in {\mathcal {E}}} \overbrace{e^{\theta _{ij} \cdot {\mathbbm {1}}_{(X_i,X_j)}}}^{\phi _{ij}(X_i,X_j)} \bigg ] \times \bigg [\prod _{X_i \in {\mathcal {X}}} \overbrace{e^{\theta _i \cdot X_i}}^{\phi _i(X_i)} \bigg ] \end{aligned}$$
(55b)

where \(\{\theta _i\}\) and \(\{\theta _{ij}\}\) are the parameters of uni-variate potentials \(\{\phi _i(X_i)\}\) and pairwise potentials \(\{\phi _{ij}(X_i,X_j)\}\), respectively. Typically it is convenient to represent Ising models in Eq. (55b) as factor graphs, where variables \(X_i \in {\mathcal {X}}\) represented as variable nodes and potentials \(\{\phi _i(X_i)\}\) and \(\{\phi _{ij}(X_i,X_j)\}\) represented as factor nodes. Traditionally, the parameters \(\{\theta _i\}\) are drawn uniformly from \({\mathcal {U}}[- d_f , d_f ]\) where \(d_f \in {\mathbb {R}}\). For pairwise potentials, the parameters \(\{\theta _{ij}\}\) are chosen as \(\eta \cdot C\) where we sample \(\eta \) in the range \([- d_f, d_f]\) having some nodes to agree and disagree with each other. C is also a chosen constant. Higher values of C impose stronger constraints, leading to a harder inference task.

Hence each univariate potential \(\phi _i(X_i)\) can be represented as a unit clause involving only one variable \(X_i\) with associated weight \(\theta _i\) such that it equals \(e^{\theta _i}\) when it is satisfied and 1 otherwise. Similarly each \(\phi _{ij}(X_i,X_j)\) can be formulated as a conjunction of two clausesFootnote 12 \(\big [(\lnot X_i \vee X_j) \wedge (X_i \vee \lnot X_j)\big ]\), with associated weight \(\eta \cdot C\) which equals \(e^{\eta \cdot C}\) when \(X_i=X_j\) and \(e^{-\eta \cdot C}\) otherwise. Hence, the Ising model can be translated into CNF as:

  • Unit clauses: \((X_i,\theta _i)\), \(\forall X_i \in {\mathcal {X}}\)

  • Pairwise clauses: \(\big [(\lnot X_i \vee X_j) \wedge (X_i \vee \lnot X_j), \theta _{ij}=\eta \cdot C\big ]\), \(\forall X_i,X_j \in {\mathcal {E}}\)

Now, without difficulty, we can directly apply the soft update rule from Table 3 for such clauses when computing the marginals on the factor graph. Now assume that we want to present some determinism in the model. We can achieve that by adjusting the parameters in such a way that makes either univariate or pairwise potential produce 0 when it is unsatisfied. For instance, if C is very large (say \(C \rightarrow \infty \)) in the setting of parameters \(\theta _{ij}\) we obtain that all the valid local entries of \(\phi _{ij}(X_i,X_j)\)’s clauses equal \(e^{\infty }\) and all its invalid local entries equal 0 (i.e., \(e^{-\infty }\)), which can be simply re-cast as \(\{0,1\}\) clauses. Thus in this case we can apply the hard update rule from Table 3 when computing the marginals.

6 Empirical evaluation

The goal of our experimental evaluation was to investigate the following key questions:

  • (Q1.) Is GEM-MP’s accuracy competitive with state-of-the-art inference algorithms for Markov logic? This question is important to answer as it examines the soundness of GEM-MP inference.

  • (Q2.) In the presence of graphs with problematic cycles, comparing with LBP exhibiting oscillations, does GEM-MP lead to convergence? We want to explore and emphasize experimentally that GEM-MP inference indeed addresses Limitation 1.

  • (Q3.) Is GEM-MP more accurate than LBP in the presence of determinism? We want to check experimentally the effectiveness of GEM-MP inference to remedy Limitation 2.

  • (Q4.) Is GEM-MP scalable compared to other state-of-the-art propositional inference algorithms for Markov logic? We wish to examine the real-world applicability of GEM-MP inference.

  • (Q5.) Is GEM-MP accurate compared to state-of-the-art convergent message-passing algorithms for other probabilistic graphical models such as Markov Random Fields? We wish to examine the accuracy and convergence behaviour of GEM-MP inference for other related model classes and algorithms.

  • (Q6.) Is GEM-MP’s accuracy influenced by the initialization of the marginals? We will examine if the initialization of approximate marginals using random values differs from initializing marginals using a uniform distribution.

To answer Questions Q1–Q4, we first selected three real-world datasets: Cora for Entity resolution, Yeast for Protein-interactions, and UW-CSE for Advising relationships. Such datasetsFootnote 13 and their corresponding MLN formulations contain the problematic properties of determinism and cycles and therefore represent good bases for carrying out our experimental evaluations. The first point to note is that their expressive Markov logic networks have a formidable number of cycles. Besides this, some of their rules can be expressed as hard formulas. Thus, it is highly anticipated that the inference procedure will face the hindrances engendered from determinism and cycles. The second point is that they exemplify important applications: Entity resolution has recently become somewhat of a holy grail sort of task; Advising relationships and Protein-interactions are instances of Link prediction, an important task that always receives much interest in statistical relational learning (Richardson and Domingos 2006).

To evaluate our proposed GEM-MP inference algorithm, we compared its results with five prominent state-of-the-art inference algorithmsFootnote 14 that are built into the Alchemy system (Kok et al. 2007), one of the most powerful tools to perform inference on Markov logic models:Footnote 15

  • MC-SAT proposed by Poon and Domingos (2006).

  • Lazy MC-SAT (LMCSAT) proposed by Poon et al. (2008).

  • Loopy Belief Propagation (LBP) (refer to Yedidia et al. 2005).

  • Gibbs sampling (Gibbs) (Richardson and Domingos 2006).

  • Lifted Importance sampling (L-Im) proposed by Venugopal and Gogate (2014b) as an improvement of the one proposed by Gogate et al. (2012).

MC-SAT converges rapidly when performing inference in the presence of determinism and L-Im is the recent lifted importance sampling inference algorithm that addresses the evidence problem (Venugopal and Gogate 2014a) and as a result improves the scalability and accuracy of reasoning. Therefore to answer Q1 and Q4, our main comparison is with MC-SAT and L-Im. Since our GEM-MP algorithm is a variant of message-passing inference, we shall compare with LBP to answer Q1, Q2, Q3, and Q4. Gibbs, a popular MCMC algorithm, can serve as a good baseline here. Additionally, even though GEM-MP is built on a propositional basis, it may be suitable to compare its scalability with two state-of-the-art approaches for scaling inference such as Lifted in the L-Import algorithm and Lazy in the LMCSAT algorithm. Note that a few other efficient inference methods are not considered in our experiments because they are completely dominated by one of the three considered algorithms (e.g., simulated tempering had shown poor results compared to MC-SAT, as shown by Poon and Domingos (2006)), or they run exact inference [like PTP introduced by Gogate and Domingos (2011)], which is out of reach for the underlying datasets.

6.1 Datasets

Cora. This dataset consists of 1295 citations of 132 different computer science papers.Footnote 16

  • MLN: We used the MLN model which is similar to the established one of Singla and Domingos (2006). The MLN involves formulas stating regularities such as: if two citations are the same, their fields are the same; if two fields are the same, their citations are the same. It also has formulas representing transitive closure, which are assigned very high weight (i.e. near deterministic clauses). The final knowledge base contains 10 atoms and 32 formulas (adjusted as 4 hard, 3 near-deterministic and 25 soft).

  • Query: The goal of inference is to predict which pairs of citations refer to the same citation (SameBib), and similarly for author, title and venue fields (SameTitle, SameAuthor and SameVenue). The other atoms are considered evidence atoms.

Yeast. This dataset captures information about a protein’s location, function, phenotype, class, enzymes, and protein-protein interaction for the Comprehensive Yeast Genome.Footnote 17 It contains four subsets, each of which contains the information about 450 proteins.

  • MLN: We used the MLN model described by Davis and Domingos (2009). It involves singleton rules for predicting the interaction relationship, and rules describing how protein functions relate to interactions between proteins (i.e. two interacting proteins tend to have similar functions). The final knowledge base has 7 atoms and 8 first-order formulas (2 hard and 6 soft).

  • Query: The goal of inference is to predict the interaction relation (Interaction, Function). All other atoms (e.g., location, protein-class, enzyme, etc.) are considered evidence atoms.

UW-CSE. This dataset records information about the University of Washington (UW), Computer Science and Engineering Department (CSE). The database consists of five subsets: AI, graphics, programming languages, systems, and theory (which corresponds to five research areas).

  • MLN: We used the MLN model available from the Alchemy website.Footnote 18 It includes formulas such as the following: each student has at most one advisor; if a student is an author of a paper, so is her advisor; advanced students only TA courses taught by their advisors; a formula indicates that it is not allowed for a student to have both temporary and formal advisors at the same time (\(\lnot TemAdvised(s,p)\vee \lnot Advised(s,p)\) which is a true statement at UW-CSE), etc. The final knowledge base contains 22 atoms and 94 formulas (considered as 7 hard and 65 soft and we excluded the 22 unit clauses). Note that ten out of these 22 clauses are equality predicates: Sameperson(person; person), Samecourse(course; course), etc. which always have known, fixed values that are true if the two arguments are the same constant. The rest of them are easily predictable using the unit clause method.

  • Query: The inference task is to predict advisory relationships (AdvisedBy), and all other atoms are evidence (corresponding to the all-information scenario in Richardson and Domingos (2006)).

6.2 Metrics

Since computing exact marginal or joint conditional distributions is not feasible for the underlying domains, we evaluated the quality of inference with our method using two metrics: the average conditional log marginal-likelihood (CLL) and the balanced \(F_1\) score. The CLL, which approximates the KL-divergence between the actual and computed marginals returned by an inference algorithm for query ground atoms, is an intuitive way of measuring the quality of the produced marginal probabilities. After obtaining the marginal probabilities from the inference algorithm, the average CLL of a query atom is computed by averaging the log-marginal probabilities of the true values over all its groundings. For the \(F_1\)-score metric, we predict that a query ground atom is true if its marginal probability is at least 0.5; otherwise we predict that it is false (Huynh and Mooney 2011, 2009; Papai et al. 2012, for more details about measuring prediction quality on the basis of marginal probabilities). The advantage of \(F_1\)-score is its insensitivity to true negatives (TNs), and thus it can demonstrate the quality of an algorithm for predicting the few true positives (TPs).

6.3 Methodology and results

All the experiments were run on a cluster of nodes with multiprocessors running 2.4 GHz Intel CPUs with 4 GB of RAM under RED HAT Linux 5.5. We used the implementations of both the training algorithm (preconditioned scaled conjugate gradient) and inference algorithms (MC-SAT, LBP, and Gibbs) that exist in the Alchemy system (Kok et al. 2007). In addition, we implemented our GEM-MP algorithm as an extension to Alchemy’s inference. All of Alchemy’s default parameters were retained (e.g., 100 burn-in iterations to negate the effect of initialization in MC-SAT and Gibbs). We conducted our experimental evaluation through five experiments.

6.3.1 Experiment I

The first experiment was dedicated to answering Q1 and Q2. We ran our experiments using a five-way cross-validation for both Cora and UW-CSE, and a four-way cross-validation for Yeast. In the training phase we learned the weights of models by running a preconditioned scaled conjugate gradient (PSCG) algorithm (in Lowd and Domingos 2007, it was shown that PSCG performed the best). In the testing phase, and using the learned models, we carried out inference on the held-out dataset by using each of the four underlying inference algorithms to produce the marginals of all groundings of query atoms being true. Such marginal probabilities were used to compute the \(F_1\) and average CLL metrics.

Although a traditional way to assess the inference algorithms would be to run them until convergence and to compare their running times, it is problematic here because some of the algorithms may never converge in the presence of determinism and cycles (e.g. LBP). Or some may converge very slowly with the existence of near-determinism (e.g., Gibbs). Instead we assigned all inference algorithms an identical running time sufficient to judge the inference behavior. Then, at each time step, we recorded the average CLL over all query atoms by averaging their CLLs on each held-out test set. In addition, we computed the \(F_1\) score based on the results we obtained at the end of the allotted time.

Fig. 5
figure 5

Average CLL as a function of inference time for GEM-MP, MC-SAT, LBP, Gibbs, LMCSAT, and L-Im algorithms on Cora (top), Yeast (middle), and UW-CSE (bottom)

Figure 5 shows the results for the average CLL as a function of time for inference algorithms on the underlying datasets. For each point, we plotted error bars displaying the average standard deviation over the predictions for the groundings of each predicate. Note that when the error bars are tiny, they may not be clearly visible in the plots. Overall GEM-MP is the most accurate of all the algorithms compared, achieving the best average CLL on Yeast and UW-CSE datasets (this answers Q1). For the Cora dataset it took about 225 min to dominate all other inference algorithms.Footnote 19 MC-SAT came close behind GEM-MP on both Cora and UW-CSE, but considerably further behind on Yeast. LBP was marginally less accurate than Gibbs on both Cora and UW-CSE [which is consistent with the experiments of Singla and Domingos (2008)], but more accurate than Gibbs on Yeast. Remarkably GEM-MP converged quickly on both the Yeast and UW-CSE datasets and converged comparatively fast on Cora as well (this answers Q2). By contrast LBP was unable to converge, oscillating on both the Cora and Yeast datasets, and Gibbs converged very slowly on all datasets. L-Im was clearly more accurate than Gibbs on all the tested datasets. In addition its accuracy was better than LBP’s accuracy with a large margin on Cora and UW-CSE, and slightly less accurate than LBP on the Yeast dataset. The accuracy of MC-SAT and of its lazy algorithm (LMCSAT) were very close on all the datasets.

Table 4 Average \(F_1\) scores for the GEM-MP, MC-SAT, Gibbs, LBP, LMCSAT, and L-Im inference algorithms on Cora, Yeast, and UW-CSE at the end of the allotted time

Table 4 reports the average \(F_1\) scores for the inference algorithms on the underlying datasets. The results complement those of Fig. 5: underscoring the promise of our proposed GEM-MP algorithm to obtain the highest quality among the alternatives for predicting marginals, particularly for the TP query atoms (i.e. query atoms that are true and predicted to be true). GEM-MP substantially outperformed LBP, Gibbs and L-IM on all datasets, achieving 39, 37, and 33 % greater accuracy respectively (answer Q2). MC-SAT was relatively competitive compared with GEM-MP on Cora and UW-CSE, but on the Yeast dataset GEM-MP performed significantly better, attaining \(13 \,\%\) greater accuracy than MC-SAT (conclusive answer to Q1). Gibbs and LBP rivaled each other on the tested datasets but were both dominated by MC-SAT. LMCSAT was very competitive to its propositional MC-SAT with approximately a \(2.2\,\%\) loss in accuracy.

6.3.2 Experiment II

Here we concentrated on Q3. To obtain robust answers we examined the performance of GEM-MP, MC-SAT and LBP at varying amounts of determinism. That is, we re-ran Experiment I at gradual amounts of determinism. We marked each amount of determinism as a level, with determinism levels in the range \(\left[ 0,50\right] \). For example the 0-level stands for zero percentage of determinism (i.e., all clauses in the model are considered soft clauses) and the 50-level means 50 % of determinism (i.e., we considered \(50\,\%\) of the clauses in the model as hard clauses and \(50\,\%\) as soft clauses).

Fig. 6
figure 6

The impact of determinism on the accuracy of GEM-MP, MC-SAT and LBP for Cora (top), Yeast (middle), and UW-CSE (bottom)

Figure 6 reports the average CLL as a function of time for GEM-MP, LBP, and MC-SAT at different levels of determinism. Overall the results confirm that the amount of determinism in the model has a great impact on both the accuracy and the convergence of GEM-MP and LBP. That is, when increasing the level of determinism, we observe an increase in the accuracy of GEM-MP and a decrease in the accuracy of LBP. At each level of determinism and on all datasets GEM-MP prevailed over the corresponding LBP in terms of accuracy of results (answering Q3). In addition the greater the level of determinism, the greater the convergence for GEM-MP, and the greater the non-convergence for LBP (answering Q2). Remarkably the 0-level, which has no amount of determinism, exhibits the worst behaviour for GEM-MP.Footnote 20 In contrast it is the best level for LBP, though even at this level GEM-MP surpassed LBP on all datasets. For MC-SAT increasing the determinism in the model has a small negative impact on its accuracy.

6.3.3 Experiment III

This experiment examines Q4. We are interested here in judging the scalability of various inference algorithms. To guarantee a fair comparison, we reran Experiment I while increasing the number of objects in the domain from 100 to 200 by increments of 25, following the methodology previously used by Poon et al. (2008), Shavlik and Natarajan (2009). Then we reported the average running time to achieve convergence or up to a maximum of 5000 and 10, 000 iterations respectively for the entire inference process.

Figure 7 reports the average inference time as a function of the number of objects in the domain. Overall the results show that both LMCSAT (a lazy-based algorithm) and L-IM (lifted-based algorithm) rivaled each other, and both dominate all other propositional algorithms compared. L-IM was relatively scalable compared with LMCSAT on Yeast and UW-CSE, but on the Cora dataset LMCSAT’s scalability was significantly better than L-IM. Aside from Lazy- and Lifted-based algorithms and by considering the propositional ones, the results demonstrate that GEM-MP is scalable compared to the other evaluated inference algorithms. It clearly prevailed over both LBP and Gibbs on the entire range of domain sizes by a significant margin, while saving time by more than a factor of 2 on all datasets. It also rivaled the MC-SAT algorithm overall. Although it came in slightly behind MC-SAT on the Cora dataset, it outperformed MC-SAT in handling all domain sizes on both the Yeast and UW-CSE datasets, whereas MC-SAT ran out of memory with 200 objects on UW-CSE.

Fig. 7
figure 7

Inference time versus number of objects in Cora (top), Yeast (middle), and UW-CSE (bottom)

6.3.4 Experiment IV

This experiment was performed to answer Q5. Here the goal is to compare GEM-MP with three state-of-the art convergent message-passing algorithms:

  • L2-convex proposed by Hazan and Shashua (2010, 2008), which runs sequential message passing on the convex-L2 Bethe free energy.

  • RBP proposed by Elidan et al. (2006), which runs damped Residual BP, a greedy informed schedule for message passing.

  • CCCP double loop algorithm proposed by Yuille (2001, 2002), which runs message-passing on the convex-concave Bethe free energy.

To evaluate the four underlying message-passing algorithms we apply them to Ising models on a two-dimensional grid network. These model networks are standard benchmarks to evaluate message-passing algorithms as they provide a systematic way to analyze iterative algorithms (Elidan et al. 2006). Following Hazan and Shashua (2010) and Elidan et al. (2006), we generated \(20 \times 20\) grids: The distribution has the form \(p(x) \propto e^{\sum _{(X_i,X_j) \in {\mathcal {E}}} \theta _{ij} \cdot X_iX_j+ \sum _{X_i} \theta _i \cdot X_i}\), where \(\theta _i\), \(\theta _{ij}\) are parameters (i.e., weights) of the univariate and pairwise potentials respectively. For univariate potentials, the parameters \(\theta _i\) were drawn uniformly from \({\mathcal {U}}[- d_f , d_f ]\) where \(d_f \in \{0.05,1\}\). For pairwise potentials, we use \(e^{\eta \cdot C}\) when \(x_i = x_j\) where we sample \(\eta \) in the range \([-0.5, 0.5]\) having some nodes to agree and disagree with each other. C is an agreement factor, so the higher values of C impose stronger clauses (e.g., \(C=200\) and \(\eta =0.5\) yield deterministic potentials since if a state violates a potential with \(C=200\) it becomes \(2.69 \times 10^{43}\) times less probable). Thus to challenge and explore the difficulty of inference in different regimes, we generate the networks with two levels of determinism: Level 1 is [0, 20 %] and Level 2 is [20, 40 %], with realizations obtained at \(10\,\%\) intervals, 50 graphs at each interval. In each individual realization of the interval, we run the four underlying inference algorithms for the network until convergence or up to 500 iterations. To diagnose the convergence we considered the cumulative percentage of convergence of all algorithms as a function of the number of iterations. To address the quality of results, where exact inference was tractable using the junction tree algorithm, we compute the average KL-divergence (KL) metric between the approximate and exact node marginals for each algorithm on all \(20 \times 20\) generated Ising grids.

Fig. 8
figure 8

The results of \(20 \times 20\) grids of Ising model: (top) the cumulative percentage of convergence (convergence \(\%\)) versus number of iterations, and (bottom) the average KL-divergence (KL) metric versus number of iterations at determinism level 1 [0, 20 %] (left) and 2 [20, 40 %] (right)

Fig. 9
figure 9

Top and middle: The average CLL of GEM-MP-random (x-axis) versus the average CLL of GEM-MP-Uniform (y-axis) for Cora (left-red), Yeast (middle-green) and UW-CSE (right-magenta) at two determinism levels, respectively. Bottom: the average KL-divergence of GEM-MP-random versus the average KL-divergence of GEM-MP-Uniform for \(20 \times 20\) grids of Ising model at Level 1 [0, 20 %] (left-blue) and at Level 2 [20, 40 %] (right-blue) during iterations

Figure 8 (top) displays the cumulative percentage of convergence as a function of the number of iterations for each algorithm at Level 1 and 2. Overall the results show that GEM-MP converges significantly more often than all other compared convergent message-passing algorithms (answering Q5). Also it converges much faster than them. At Level 1 it finishes at 97 % convergence rates versus 82 % for L2-convex, 68 % for CCCP, and 59 % for residual BP. At Level 2 it clearly achieves at least 17.5, 34.8, and 48.4 % better convergence than L2-convex, CCCP, and residual BP respectively.

Figure 8 (bottom) displays the average KL-divergence (KL) between the approximate and exact node marginals for each algorithm as a function of the number of iterations at the two levels. The results complement those of Fig. 8 (top), here again underscoring the promise of GEM-MP for converging to more accurate solutions more rapidly than all other compared algorithms (answering Q5). In the two determinism scenarios, it achieves on average 37.8, 56, and 61.6 % higher quality marginals in terms of the average KL compared to the L2-convex, CCCP, and residual BP methods respectively. Also it finishes at a KL-divergence of 0.23 and 0.19 in the two determinism levels respectively. This shows that the quality of marginals obtained by GEM-MP at Level 2 are more accurate than the ones obtained at Level 1, which is consistent with the results in Experiment II that demonstrate that GEM-MP provides more robust results when there is more determinism in the model.

6.3.5 Experiment V

This experiment attempts to answer Q6. The goal is to compare the quality of solutions returned by GEM-MP at different initialization settings of marginals: GEM-MP with random initialization (GEM-MP-random) and GEM-MP with uniform initialization (GEM-MP-uniform). We re-ran Experiment I for MLNs and recorded the relative correlations of the average CLL between GEM-MP-random and GEM-MP-uniform. In addition, we re-ran Experiment IV for Ising models and report the relative correlations of the average KL-divergence between GEM-MP-random and GEM-MP-uniform.

Figure 9 shows the quality of marginals obtained from GEM-MP-random relative to the quality of marginals of GEM-MP-uniform as a function of the number of iterations at two determinism levels for Cora (red), Yeast (green), UW-CSE (magenta), and Ising (blue). In each scatter plot the line of best fit indicates that both GEM-MP-random and GEM-MP-uniform yield results of nearly identical quality. Any point below the line means that GEM-MP-uniform was more accurate than GEM-MP-random in that iteration, and the contrary is true if the point is above the line. Overall the results show that none of the initialization settings dominates the other (answering Q6), and that GEM-MP is not sensitive to the initialization settings.

7 Discussion

The experimental results from the previous section suggest that, in terms of both accuracy and scalability, GEM-MP outperforms LBP inference. It improves message-passing inference in two ways. First, it alleviates the threat of non-convergence in the presence of cycles. This is due to making moderate moves in the marginal likelihood space and the consequences of Jensen’s inequality which prevents such moves from overshooting the nearest fixed point. Second, it improves the quality of approximate marginals obtained in the presence of determinism, which we believe is attributable to the virtue of using the concept of generalized arc consistency to leverage the local entries of factors in order to compute more accurate outgoing messages.

Moreover, GEM-MP performs at least as well as the other state-of-the-art sampling-based inference methods (such as MC-SAT and Gibbs). The goal of MC-SAT is to combine a satisfiability-based method (e.g., SampleSAT) with MCMC-based sampling approaches to remedy the challenges engendered by determinism in the setting of MCMC inference. On one hand, GEM-MP achieves a similar goal, but by integrating a satisfiability-based method (i.e., GAC) with message-passing inference, instead of sampling inference. On the other hand, they completely differ in how they use ideas from satisfiability-oriented methods to deal with the issue of determinism.

From the satisfiability perspective, MC-SAT uses SampleSAT (Wei et al. 2004) to help slice sampling (i.e. MCMC) to near-uniformly sampling a new state given the auxiliary variables. This provides MC-SAT with the ability to rapidly jump between breaking modes, and thus it avoids the local search in MCMC inference from being trapped in isolated modes. Accordingly, one of the limitation of MC-SAT is that it applies a stochastic greedy local search procedure which is unable to make large moves in the state-space between isolated modes. This may affect its capacity to converge to accurate results. Conversely, at a high level, GEM-MP optimizes the setting of parameters with respect to a distribution over hidden variables that captures the relative weights of samples (i.e., the valid local entries) that are generated by individual variables in closed form. Thereby it performs a sort of gradient descent/ascent local search procedure. This gives GEM-MP an advantage in converging to more accurate results than MC-SAT, though MC-SAT is more likely to converge faster than GEM-MP. This could explain the great success of GEM-MP over MC-SAT on most of the experiments (MC-SAT only surpassed GEM-MP on the Cora dataset in experiment III). But we have to remember that, during the training phase, we trained the models by applying a preconditioned scaled conjugate gradient (PSCG) algorithm which uses MC-SAT for its inference step. This in turn gave an advantage to the MC-SAT algorithm when performing inference in the testing phase.

Gibbs is only reliable when neither determinism nor near-determinism are present. LBP for its part also deteriorates in the presence of determinism and near-determinism, but also when cycles are present. Thus if LBP gets stuck in cycles with determinism, it may be lodged there forever. However, if Gibbs hits a local optimum, it would eventually leave, even though it may take considerable time. This could explain the success of Gibbs over LBP. But with the increase of determinism in the model, Gibbs loses out to LBP, as seen in the case of the Yeast dataset in experiment I. Thus determinism apparently has a stronger effect on Gibbs than on LBP in this experiment.

Furthermore GEM-MP performs better than the other state-of-the-art convergent message-passing inference algorithms such as L2-convex, CCCP and Damped residual BP. The goal of L2-convex is to convexify the Bethe free energy to guarantee BP converging to an accurate local minimum. The CCCP algorithm uses a convex-concave Bethe energy to achieve the same purpose. On the one hand, GEM-MP achieves a similar purpose by optimizing a concave variational free energy, which is a lower bound to the model evidence. On the other hand, it additionally leverages the determinism and therefore, while the presence of determinism in a model can hinder the performance and converging behaviour of both L2-convex and CCCP to reach a local minimum, it increases the possibility that GEM-MP converges to an accurate one.

Overall the experimental results suggest that the initialization of GEM-MP does not significantly matter in practice since the correlation of two initialization settings (i.e., uniform and random) is often moderately positive on average. While we believe that it is important to have a good initialization to ensure that the local minimum that is found is sufficiently close to the global minimum, it seems that a good initialization will depend on the model and data. Therefore in some cases either random or uniform initialization will suffice, whilst in others it may be necessary to use a heuristic. Generally speaking it appears however that GEM-MP is able to reach an accurate result given any initialization, possibly at the expense of a minor increase in computation time.

From the scalability point of view, although Singla (2012) conjectured that lifted inference may subsume lazy, a clear relationship between lifted inference and lazy inference still eludes us. Our experimental results show that neither one was able to dominate the other. On one hand, lazy inference exploits sparseness to ground the network lazily, and therefore greatly reduces the inference memory and time as well. But lazy inference still works at the propositional level, in the sense that the basic units during inference are ground clauses. In contrast, lifted inference exploits a key property of first-order logic to allow answering queries without materializing all the objects in the domain inference. On the other hand, lifted inference requires the network to have a specific symmetric structure, which is not always the case in real-world applications and, in addition, in the presence of evidence most models are not liftable because evidence breaks symmetries. Thus at a high level the structure of the model network plays a significant role in the scalability of inference using different factors: symmetry and sparseness. If the model is extremely sparse then one can expect lazy inference to be more scalable. Lifted inference dominates when the symmetry prevails in the model’s structure.

8 Related work

Belief propagation (BP) was developed by Pearl (1988) as an inference procedure for singly connected belief networks. Pearl was the first to observe that running LBP leads to incorrect results on multi-connected networks. Conversely, other work (such as Mceliece et al. 1998; Frey and MacKay 1998) has shown success with LBP on loopy networks for turbo code applications. Further, Murphy et al. (1999) reported that LBP can provide good results on graphs with loops. These promising results shed light on evaluating the performance of BP in other applications and suggest the value of a closer study of its behavior for understanding the reasons for this success. Accordingly, several formulations of LBP have appeared, such as the direct implementation in a factor graph by Kschischang et al. (2001), tree-weighted BP (Wainwright et al. 2003) and the generalized cluster graph method of Mateescu et al. (2010). Most such formulations were influenced by the admirable analysis of Yedidia et al. (2003) who proved a relationship between LBP and Bethe approximation such that the local minima of Bethe free energy are the fixed points of LBP. Complementing this, further analysis has also explored LBPs relationships to variational approximations (Yedidia et al. 2005). This pioneering work outlined new research directions for a deeper understanding of and improvements to LBP.

Studying LBP’s convergence The convergence of LBP has been studied by examining the sufficient conditions to ensure the existence and the uniqueness of local minima. Early on, Heskes (2004) pointed out that if a graph involves a single cycle, then we have a unique local minimum, and the convergence of LBP can be all but guaranteed. Supplementing Heskes (2004), Yedidia et al. (2005) showed that if a factor graph has more than one cycle then the convexity of Bethe free energy is violated and thus the uniqueness of LBP’s fixed points is also violated. More recently, Shi et al. (2010) discussed new sufficient conditions for the convergence of LBP by deriving uniform and non-uniform error bounds on the messages. But this research direction ignores an important observation made by Heskes (2002):

“Still, loopy belief propagation can fail to converge, and apparently for two different reasons. The first rather innocent one is a too large step size, similar to taking a too large “learning parameter” in gradient-descent learning”

In this paper, by relying on a variational formulation, our algorithm optimizes variational bounds on the model evidence and it implicitly guarantees not to overstep a local minimum.

LBP and Bethe free energy Here, mainstream work attempts to derive new types of LBP for approximate inference by directly optimizing the Bethe energy functional, such as the double loop algorithm (Yuille 2001). However, the main disadvantage of this algorithm is that it requires solving an optimization problem at each iteration, which results in a slower convergence. Another class of algorithm is known as cluster-graph BP, which runs LBP on sub-trees of the cluster graph. These algorithms exhibit faster convergence and introduce a new way of characterizing the connections between LBP and optimization problems based on the energy functional. Consequently, several works appeared which generalized the class of LBP by introducing variants of the energy functional that improve the convergence of LBP. For instance, Wainwright and Jordan (2003) and Nguyen et al. (2004) proposed a convexified free energy that provides an upper bound on the partition functions. But the algorithms that have been built on this energy functional still cannot guarantee convergence. Recently, alternative algorithms have been introduced to guarantee convergence for such energy functionals (Hazan and Shashua 2008; Meltzer et al. 2009; Globerson and Jaakkola 2007; Hazan and Shashua 2010).

At a high level, our GEM-MP approach resembles previously mentioned approaches in that it is based on variational inference and involves minimizing a free-energy functional.

It remains unclear if there is a relationship between determinism and the uniqueness of local minima of LBP. However, our experiments here support prior work that has also observed that applying LBP on graphical models with determinism and cycles is more likely to oscillate or converge to wrong results.

LBP and Constraint propagationHorsch and Havens (2000) proposed an algorithm that is a generalization of arc consistency used in constraint reasoning, and a specialization of the LBP used for probabilistic reasoning. The idea was to exploit the relationship between LBP and arc consistency to compute the solution probabilities, which can be then used as a heuristic to guide constructive search algorithms to solve binary CSPs. The bucket-elimination procedure was proposed by Dechter and Mateescu (2003). However it is known that such a procedure has a time and space complexity that is exponential in the induced width of the problem graph, related to the processing order of variables and to how densely these variables are connected to each other. Alternatively, Mateescu et al. (2010) presented approaches that are based on constructing a relationship between LBP and constraint propagation techniques. One idea underlying these approaches is to transform the loopy graph into a tree-like structure to alleviate the presence of cycles, and then to exploit constraint propagation techniques to tackle the determinism. Building on these ideas we explore the second research hypothesis: constraint satisfaction techniques might be able to help address the challenges resulting from determinism in the graphical models.

A recent extension of such approaches is the combination of LBP, constraint propagation, and expectation maximization to derive an efficient heuristic search for solving both satisfiability problems (Hsu et al. 2007, 2008) and constraint satisfaction problems (Le Bras et al. 2009). Although these algorithms perform well in finding solutions, they apply only to graphical models that have no probabilistic knowledge. In contrast, our GEM-MP method is able to handle probabilistic knowledge.

Damped LBP Another traditional research area to handle non-convergence has involved dampening the marginals (Koller and Friedman 2009) in order to diminish oscillation. However, in many cases, the dampening causes LBP to converge but often yields a poor quality result (Mooij and Kappen 2005). This is because the correct results are not usually in the average point (Murphy et al. 1999). The second track of this research direction is to alleviate double counting by changing the schedule of updating messages [e.g., sequentially on an Euler path, as per Yeang (2010), residual BP, as per Elidan et al. (2006), among others] besides adapting the initialization of the marginals [e.g., restart with different initializations, as per Koller and Friedman (2009)]. However, this cannot guarantee convergence since the algorithm still runs the risk of overshooting the nearest local minimum. Whilst, a key of the approach of GEM-MP is that its iterations are constrained by the variational inequality and therefore updates to distributions over hidden variables are done in a way such that the variational lower bound never exceeds the log marginal likelihood.

Re-parameterized LBP More recently, Smith and Gogate (2014) introduced a new approach aimed at dealing with determinism more effectively. The idea of this approach is to re-parameterize the Markov network by changing the entry in a factor that has zero to any non-negative real value in such a way that the LBP algorithm converges faster. Our GEM-MP also addresses the problem of determinism by improving message-passing inference to deal with determinism and cycles more effectively, but our approach is different being rooted in both variational techniques and leveraging generalized arc consistency.

LBP and variational methods Another research area combines message-passing with other variational methods to produce new types of LBP that can guarantee convergence. For example, Winn and Bishop (2005) presented variational message passing as a way to view many variational inference techniques, and it represents a general purpose algorithm for approximate inference. This algorithm shows great performance when it applies to conjugate exponential family models network. Weinman et al. (2008) proposed a sparse variational message passing algorithm to dramatically accelerate the approximate inference needed for parameter optimization related to the problem of stereo vision. Recently, Dauwels et al. (2005) proposed a generic form of structured variational message-passing and investigated a message-passing formulation of EM. Our GEM-MP method can be seen as akin to these message-passing inference methods. But a basic aspect of GEM-MP is the exploitation of ideas from CS to handle the challenges stemming from determinism.

Lifted LBP Another promising research area that has been recently explored seeks to improve the scalability of LBP on models that feature large networks. Here, mainstream work attempts to exploit some structural properties in the network like symmetry (Ahmadi et al. 2013), determinism (Papai et al. 2011; Ibrahim et al. 2015), sparseness (Poon et al. 2008), and type hierarchy (Kiddon and Domingos 2011) to scale LBP inference. For instance, Lifted Inference either directly operates on the first-order structure or uses the symmetry present in the structure of the network to reduce its size (e.g., Ahmadi et al. 2013). In this context, the key idea is to deal with groups of indistinguishable variables rather than individual variables. Poole (2003) was one of the first to show that variable elimination can be lifted to avoid propositionalization. This has been extended with some lifted variants of the algorithm proposed by De Salvo Braz et al. (2005) and Milch et al. (2008). Subsequently, Singla and Domingos (2008) proposed the first lifted version of LBP, which has been extended by Sen et al. (2009), and generalized with the emergence of the color message-passing algorithm introduced by Kersting et al. (2009) for approximating the computational symmetries. Subsequently, it was shown by Gogate and Domingos (2011) that to avoid dissipating the capabilities of first-order theorem proving, we have to take into considerations the logical structure. Based on that, lifted variants of weighted model counting have been proposed by Gogate and Domingos (2011), meanwhile variants of lifted knowledge compilation such as the bisimulation-based algorithm were introduced by Van den Broeck et al. (2011). Later on, it was observed that in some cases the constructed lifted network can itself be quite large, making it very close in size to the fully propositionalized one, and yielding no speedup by lifting the inference. The interesting argument proposed by Kersting (2012) concludes that the evidence problem could be the reason: symmetries within models easily break down when variables become correlated by virtue of depending asymmetrically on evidence and thus lifting produces models that are often not far from propositionalized ones, diminishing the power of lifted inference. Thus, one can obtain better lifting by performing shattering as needed during BP inference such as anytime BP proposed by De Salvo Braz et al. (2009), or exploit the model’s symmetries before we obtain the evidence as demonstrated in (Bui et al. 2012), or shattering a model into local pieces and then iteratively handling the pieces independently and re-combining the parameters from each piece as explained in (Ahmadi et al. 2013). Recently, Gogate et al. (2012) show that the evidence problem with lifting inference can be solved when applied to importance sampling algorithms by using an informed distribution derived from a compressed representation of MLN. Our approach is different from the above lifted-based message passing algorithms being built on a propositional basis, but it can be easily incorporated with their benefits for lifting its inference.

9 Conclusion and future work

Our work has targeted the less studied issue of the use of LBP and message passing techniques in probabilistic models possessing both cycles and determinism. To fully exploit determinism as opposed to having determinism posing a problem for inference, we have examined some of the intricacies of message passing algorithms. The novelty of our work lies in the proposal and exploration of an approach which we have named Generalized arc-consistency Expectation-Maximization Message-Passing (GEM-MP), a message-passing algorithm that applies a form of variational approximate inference in an extended form of an underlying graphical model. We have focused our experiments on Markov logic, but our method is easily generalized to other graphical models. To demonstrate the ease of generalizing our approach, we have also presented results using Ising models and we find that our method outperforms a variety of state-of-the-art techniques. The rules of GEM-MP can be viewed as a free energy minimization method whose successive updates form a path of bounded steps to the nearest local minimum in the space of approximate marginals. Using entity resolution and link prediction problems, we have experimentally validated the effectiveness of GEM-MP for converging to more accurate marginals and addressed the limitations of LBP engendered by the presence of cycles and determinism.

As with other variational methods, much of the strength of our method is a consequence of Jensen’s inequality which enables variational message-passing inference to estimate marginals - through the optimization of variational parameters - by tightening a lower bound on the model’s marginal likelihood at each approximate marginal update, such that we cannot overshoot the underlying true marginal likelihood. We believe this effect alleviates the threat of non-convergence due to cycles. In addition, the effectiveness of generalized arc consistency for handling the logical structures can be used to exploit structure in the problem that is not normally available to a more naive message-passing algorithm. In so doing, our formulation transforms determinism from a limitation into an advantage from the perspective of GEM-MP.

These explorations point to a number of promising directions for future work. We plan to evaluate the use of GEM-MP as an inference subroutine for learning. Also, we intend to investigate the lifted (Ahmadi et al. 2013; Singla et al. 2010) and the lazy (Poon et al. 2008) versions of GEM-MP to enhance its scalability. Finally, we intend to increase the accuracy of GEM-MP by deriving new update rules that apply a global approximation for \(\large {q}({\mathcal {Y}};{\mathcal {T}}_{\small {{\mathcal {Y}}}})\) in the \(M_{q\small {({\mathcal {Y}})}}\)-step of GEM-MP.