Exploiting symmetries for scaling loopy belief propagation and relational training
 846 Downloads
 18 Citations
Abstract
Judging by the increasing impact of machine learning on largescale data analysis in the last decade, one can anticipate a substantial growth in diversity of the machine learning applications for “big data” over the next decade. This exciting new opportunity, however, also raises many challenges. One of them is scaling inference within and training of graphical models. Typical ways to address this scaling issue are inference by approximate message passing, stochastic gradients, and MapReduce, among others. Often, we encounter inference and training problems with symmetries and redundancies in the graph structure. A prominent example are relational models that capture complexity. Exploiting these symmetries, however, has not been considered for scaling yet. In this paper, we show that inference and training can indeed benefit from exploiting symmetries. Specifically, we show that (loopy) belief propagation (BP) can be lifted. That is, a model is compressed by grouping nodes together that send and receive identical messages so that a modified BP running on the lifted graph yields the same marginals as BP on the original one, but often in a fraction of time. By establishing a link between lifting and radix sort, we show that lifting is MapReduceable. Still, in many if not most situations training relational models will not benefit from this (scalable) lifting: symmetries within models easily break since variables become correlated by virtue of depending asymmetrically on evidence. An appealing idea for such situations is to train and recombine local models. This breaks longrange dependencies and allows to exploit lifting within and across the local training tasks. Moreover, it naturally paves the way for the first scalable lifted training approaches based on stochastic gradients, both in an online and a MapReduced fashion. On several datasets, the online training, for instance, converges to the same quality solution over an order of magnitude faster, simply because it starts optimizing long before having seen the entire megaexample even once.
Keywords
Statistical relational learning Lifted inference Lifted online training MapReduce1 Introduction
Machine learning thrives on large datasets and models, and one can anticipate substantial growth in the diversity and the scale of impact of machine learning applications over the coming decade. Such datasets and models originate for example from social networks and media, online books at Google, image collections at Flickr, and robots entering the real life. And as storage capacity, computational power, and communication bandwidth continue to expand, today’s “large” is certainly tomorrow’s “medium” and next week’s “small”.
This exciting new opportunity, however, also raises many challenges. One of them is scaling inference within and training of graphical models. Statistical learning provides a rich toolbox for scaling. Actually, statistical learning is unthinkable for many practical applications without techniques such as inference by approximate message passing, stochastic gradients, and MapReduce, among others. Often, however, we face inference and training problems with symmetries and redundancies in the graph structure. A prominent example are relational models, see De Raedt et al. (2008), Getoor and Taskar (2007) for overviews, that tackle a long standing goal of AI, namely unifying firstorder logic—capturing regularities and symmetries—and probability—capturing uncertainty. They often encode large, complex models using few weighted rules only and, hence, symmetries and redundancies abound. However, symmetries, stemming from relational models or not, have not been considered for scaling inference and training yet.
Machine learning thrives on largescale datasets and models. Several approaches have been developed to scale traditional statistical learning such as approximate message passing, stochastic gradients, and MapReduce, among others (denoted by “x” if it is main stream or by a representative citation). Scaling by lifting, i.e., exploiting symmetries within the graphical model structure, however, has not received a lot of attention. In this paper, we aim at closing this gap (denoted as “Sect.”) in order to boost crossfertilization. Please note that the references are naturally not exhaustive but representatives of the two worlds
Statistical Learning  Statistical Relational Learning  

single core  online  MapReduce  single core  online  MapReduce  
Loopy BP  standard  x  Acar et al. (2008)  Gonzalez et al. (2009a)  x  de Salvo Braz et al. (2009)  Gonzalez et al. (2009a) 
lifted  Sect. 4  Hadiji et al. (2011)  Sect. 5  Sect. 5  
Training  standard  x  x  Zinkevich et al. (2010)  x  Huynh and Mooney (2011)  Sect. 6 
lifted  –  –  –  Sect. 6  Sect. 6  Sect. 7 
Limitation 1
Loopy belief propagation does not exploit symmetries.
Indeed, for relational models, lifted belief propagation has been proposed that exploits symmetries, see e.g. Singla and Domingos (2008). It often renders large, previously intractable probabilistic inference problems quickly solvable by employing symmetries to handle whole sets of indistinguishable random variables. However, symmetries are present in abundance in traditional, nonrelational models, too. Moreover, although with the availability of affordable commodity hardware and high performance networking, we have increasing access to computer clusters,
Limitation 2
Lifted belief propagation is still carried out on a single core.
That is, there are no inference approaches that exploit both symmetries and MapReduce for scaling. But even if so, in many if not most situations, training relational models will not benefit from scalable lifting.
Limitation 3
Symmetries within models easily break since variables become correlated by virtue of depending asymmetrically on evidence,
so that lifting produces models that are often not far from propositionalized, therefore canceling the benefits of lifting for training. And, in relational learning we typically face a single megaexample (Mihalkova et al. 2007) only, a single large set of interconnected facts. Consequently, many if not all standard statistical learning methods do not naturally carry over to the relational case. Consider, for example, stochastic gradient methods. Similar to the perceptron method (Rosenblatt 1962), stochastic gradient approaches update the weight vector in an online setting. We essentially assume that the training examples are given one at a time. The algorithms examine the current training example and then update the parameter vector accordingly. Stochastic gradient approaches often scale sublinearly with the amount of training data, making them very attractive for large training data as targeted by statistical relational learning. Empirically, they are even often found to be more resilient to errors made when approximating the gradient. Unfortunately, stochastic gradient methods do not naturally carry over to the relational cases:
Limitation 4
Stochastic gradients coincide with batch gradients in the relational case since there is only a single megaexample.
Consequently, while there are efficient parallelized gradient approaches such as developed by Zinkevich et al. (2010) that impose very little I/O overhead and are well suited for a MapReduce implementation, these have not been used for lifted training.

We introduce lifted loopy belief propagation that exploits symmetries and hence often scales much better than standard loopy belief propagation. Its underlying idea is rather simple: group together nodes that are indistinguishable in terms of messages received and sent given the evidence. The lifted graph is often significantly smaller and can be used to perform a modified loopy belief propagation yielding the same results as loopy belief propagation applied to the unlifted graph. This overcomes Limitation 1, and our experimental results show that considerable efficiency gains are obtainable.

We present the first MapReduce lifted belief propagation approach. More precisely, we establish a link between colorpassing, the specific way of lifting the graph, and radix sort, which is wellknown to be MapReduceable. Together with Gonzalez et al. (2009a, 2009b) MapReduce belief propagation approach, this overcomes Limitation 2. Our experimental results show that MapReduced lifting scales much better then singlecore lifting.

We develop the first lifted training approach. More specifically, we shatter a model into local pieces. In each iteration, we then train the pieces independently and recombine the learned parameters from each piece. This breaks longrange dependencies and allows one to exploit lifting across the local training tasks. Hence, it overcomes Limitation 3.

Based on the lifted piecewise training, we introduce the first online training approach for relational models that can deal with partially observed training data. The idea is, we treat (minibatches of) pieces as training examples and process one piece after the other. This overcomes Limitation 4. Our experimental results on several datasets demonstrate that the online training converges to the same quality solution over an order of magnitude faster than batch learning, simply because it starts optimizing long before having seen the entire megaexample even once. Moreover, it naturally paves the way to MapReduced lifted training. Indeed, the way we shatter the full model into pieces greatly effects the learning quality: important influences between variables might get broken. To overcome this, we randomly grow relational piece patterns that form trees. Our experimental results show that tree pieces can balance lifting and quality of the online training.
The present paper is a significant extension of the ECML/PKDD 2012 and the UAI 2009 conference papers (Ahmadi et al. 2012; Kersting et al. 2009). It provides the first coherent view on lifted inference and training using loopy belief propagation suggesting (piecewise) symmetries as a novel but promising dimension for scaling statistical machine learning. It develops the first MapReduce approaches to both tasks by establishing a link between colorpassing and radix sort, which is wellknown to be MapReduceable. Additionally, the present paper provides a detailed description of the colorpassing procedure and the resulting lifted equations for lifted BP as well as a characterization of the lifting in terms of colored computation trees (CCTs). Finally, and probably most importantly, it experimentally proves that exploiting symmetries and MapReduce together can scale much better than exploiting one of them alone.
We proceed as follows. After touching upon related work, we recap factor graphs, loopy belief propagation and Markov logic networks, the probabilistic relational framework we focus on for illustration purposes only. We then show how to scale messagepassing inference in two dimensions, namely by exploiting symmetries and by MapReduce. Then, we develop stochastic relational gradients that naturally paves the way to MapReduce training. Each section is concluded by presenting the respective experimental results.
2 Related work
The paper aims at getting mainstream statistical and statistical relational learning closer together. As argued above, we do so by employing symmetries, MapReduce and stochastic gradients. Consequently, there are several related lines of research.
Lifted probabilistic inference
Employing symmetries in graphical models for speeding up probabilistic inference, called lifted probabilistic inference, has recently received a lot of attention, see e.g. Kersting (2012) for an overview. The closest work to our approach for exploiting symmetries within message passing is the work by Singla and Domingos (2008). Actually, an investigation of their approach was the seed that grew into our proposal. Singla and Domingos’ lifted firstorder belief propagation (LFOBP) builds upon Jaimovich et al. (2007) and also groups random variables, i.e., nodes that send and receive identical messages. Lifted BP, the approach introduced in the present paper, differs from LFOBP on two important counts. First, lifted BP is conceptually easier than LFOBP. This is because efficient inference approaches for firstorder and relational probabilistic models are typically rather complex. Second, LFOBP requires as input the specification of the probabilistic model in firstorder logical format. Only nodes over the same predicate can be grouped together to form socalled clusternodes. That means LFOBP coincides with standard BP for propositional MLNs, i.e., MLNs involving propositional variables only. The reason is that propositions are predicates with arity 0 so that the clusternodes are singletons. Hence, no nodes and no features are grouped together. In contrast, our lifting can directly be applied to any factor graph over finite random variables. In this sense, lifted BP is a generalization of LFOBP.
Sen et al. (2008) presented another “clustered” inference approach based on bisimulation. Like lifted BP, which can also be viewed as running a bisimulationlike algorithm on the factor graph, Sen et al.’s approach also does not require a firstorder logical specification. In contrast to lifted BP, it is guaranteed to converge but is also much more complex. Its efficiency in dynamic relational domains, in which variables easily become correlated over time by virtue of sharing common influences in the past, is unclear and its evaluation is an interesting future work.
Others such as Poole (2003), Braz et al. (2005, 2006), Milch et al. (2008), Kisynski and Poole (2009), Choi et al. (2011) and Taghipour et al. (2012) have developed lifted versions of the variable elimination (VE) algorithm. They typically also employ a counting elimination operator that is equivalent to counting indistinguishable random variables and then summing them out immediately. The different variations of these algorithms improve upon Poole (2003) and Braz et al. (2005, 2006) work by introducing counting formulas (Milch et al. 2008), aggregation operators (Choi et al. 2011; Kisynski and Poole 2009) and handling arbitrary constraints (Taghipour et al. 2012). Choi et al. (2010) developed variable elimination algorithm when the underlying distributions are continuous random variables. These exact inference approaches are extremely complex, so far do not easily scale to realistic domains, and hence have only been applied to rather small artificial problems. Recently, Van den Broeck et al. (2012) proposed a method that relaxes firstorder conditions to perform exact lifted inference on the model and then incrementally improve the approximation by adding more constraints back into the model. This can be viewed as bridging the lifted VE and the lifted BP methods presented above. Again, as for LFOBP, a firstorder logical specification of the model is required.
An alternative to BP and VE is to use searchbased methods based on recursive conditioning. That is, we decompose by conditioning on parameterized variables a lifted network into smaller networks that can be solved independently. Each of these subnetworks is then solved recursively using the same method, until we reach a simple enough network that can be solved (Darwiche 2001). Recently, several lifted searchbased methods have been proposed (Gogate and Domingos, 2010, 2011; Van den Broeck et al., 2011; Poole et al., 2011) that assume a relational model given. Gogate and Domingos (2011) reduced the problem of lifted probabilistic inference to weighted model counting in a lifted graph. Van den Broeck et al. (2011) employ circuits in firstorder deterministic decomposable negation normal form to do the same, also for higher order marginals (Van den Broeck and Davis 2012). Both these approaches were developed in parallel and have promising potential to lifted inference. There are also sampling methods that employ ideas of lifting. Milch and Russell developed an MCMC approach where states are only partial descriptions of possible worlds (Milch and Russell 2006). Zettlemoyer et al. (2007) extended particle filters to a logical setting. Gogate and Domingos introduced a lifted importance sampling (Gogate and Domingos 2011). Recently, Niepert proposed permutation groups and group theoretical algorithms to represent and manipulate symmetries in probabilistic models, which can be used for MCMC (Niepert 2012). And Bui et al. (2012) have shown that for MAP inference we can exploit the symmetries of the model before evidence is obtained. It is an interesting question whether one can characterize these symmetries more precisely. Work on symmetry in exponential families (Bui et al. 2012) shows that one can create clusternodes using the automorphism group of the graph, using the notion of orbit partitions. Mladenov et al. (2012) explored colorpassing in the setting of linear programming and showed that symmetries can be found and exploited in linear programs.
As of today, none of these approaches have been shown to be MapReduceable. Moreover, many of them have exponential costs in the treewidth of the graph, making them infeasible for most realworld applications, and none of them have been used for training relational models.
Local training
Our lifted training approach is related to local training methods well known for propositional graphical models. Besag (1975) presented a pseudolikelihood (PL) approach for training an Ising model with a rectangular array of variables. PL, however, tends to introduce a bias and is not necessarily a good approximation of the true likelihood with a smaller number of samples. In the limit, however, the maximum pseudolikelihood coincides with that of the true likelihood (Winkler 1995). Hence, it is a very popular method for training models such as Conditional Random Fields (CRF) where the normalization can become intractable while PL requires normalizing over only one node. An alternative approach is to decompose the factor graph into tractable subgraphs (or pieces) that are trained independently (Sutton and Mccallum 2009), that the present paper also follows. This piecewise training can be understood as approximating the exact likelihood using a propagation algorithm such as BP. Sutton and Mccallum (2009) also combined the two ideas of PL and piecewise training to propose piecewise pseudolikelihood (PWPL) which in spite of being a double approximation has the benefit of being accurate like piecewise and scales well due to the use of PL. Another intuitive approach is to compute approximate marginal distributions using a global propagation algorithm like BP, and simply substitute the resulting beliefs into the exact ML gradient (Lee et al. 2007), which will result in approximate partial derivatives. Similarly, the beliefs can also be used by a sampling method such as MCMC where the true marginals are approximated by running an MCMC algorithm for a few iterations. Such an approach is called contrastive divergence (Hinton 2002) and is popular for training CRFs.
Statistical relational learning
All the above training methods were originally developed for propositional data while realworld data is inherently noisy and relational. Statistical Relational Learning (SRL) (De Raedt et al. 2008; Getoor and Taskar 2007) deals with uncertainty and relations among objects. The advantage of relational models is that they can succinctly represent probabilistic dependencies among the attributes of different related objects leading to a compact representation of learned models. While relational models are very expressive, learning them is a computationally intensive task. Recently, there have been some advances in learning SRL models, especially in the case of Markov Logic Networks (Khot et al., 2011; Kok and Domingos, 2009, 2010; Lowd and Domingos, 2007). Algorithms based on functionalgradient boosting (Friedman 2001) have been developed for learning SRL models such as Relational Dependency Networks (Natarajan et al. 2012), and Markov Logic Networks (Khot et al. 2011). Piecewise learning has also been pursued already in SRL. For instance, the work by Richardson and Domingos (2006) used pseudolikelihood to approximate the joint distribution of MLNs which is inspired from the local training methods mentioned above. Though all these methods exhibit good empirical performance, they apply the closedworld assumption, i.e., whatever is unobserved in the world is considered to be false. They cannot easily deal with missing information. To do so, algorithms based on classical EM (Dempster et al. 1977) have been developed for ProbLog, CPlogic, PRISM, probabilistic relational models, Bayesian logic programs (Getoor et al. 2002; Gutmann et al. 2011; Kersting and De Raedt 2001; Sato and Kameya 2001; Thon et al. 2011), among others, as well as gradientbased approaches for relational models with complex combining rules (Natarajan et al. 2009; Jaeger 2007). Poon and Domingos (2008) extended the approach of Lowd and Domingos (2007) which is using a scaled conjugate gradient with preconditioner to handle missing data. All these approaches, however, assume a batch learning setting; they do not update the parameters until the entire data has been scanned. In the presence of large amounts of data such as relational data, the above method can be wasteful. Stochastic gradient methods as considered in the present paper, on the other hand, are online and scale sublinearly with the amount of training data, making them very attractive for large data sets. Only Huynh and Mooney (2011) have recently studied online training of MLNs. Here, training was posed as an online max margin optimization problem and a gradient for the dual was derived and solved using incrementaldualascent algorithms. Huynh and Mooney’s approach, however, is orthogonal to our approach in that they do discriminative learning as opposed to generative learning in the current paper. Also, they do not employ lifted inference for training and make the closedworld assumption. It would be interesting to see where the approaches can complement each other, e.g. by employing parallel lifted maxproduct belief propagation for the max margin computation.
Distributed inference and training
To scale probabilistic inference and training, Gonzalez et al. (2009a, 2009b) present algorithms for parallel inference on large factor graphs using belief propagation in shared memory as well as the distributed memory setting of computer clusters. Although, Gonzalez et al. report on mapreducing lifted inference within MLNs, they actually assume the lifted network to be given. We here demonstrate for the first time that lifting per se is MapReduceable, thus putting scaling lifted SRL to “Big Data” within reach. As our experimental results illustrate, we achieve orders of magnitude improvement over existing methods using our approach. As far as we are aware, the only other method that has addressed scaling up of SRL algorithms is the work by Niu et al. (2011) that considered the problem of scaling up ground inference and learning over factor graphs that are multiple terabytes in size. They achieve these using database technology with two key observations: First, grounding of the entire SRL model into a factor graph is seen as a RDBMS join that is realized using distributed RDBMS. Second, they make learning I/O bound by using a storage manager to run inference efficiently over factor graphs that are larger than main memory. It remains a very interesting and exciting future work to implement our algorithm using this database technology.
3 Loopy belief propagation and Markov logic networks
An important inference task is to compute the conditional probability of variables given the values of some others, the evidence, by summing out the remaining variables. The belief propagation (BP) algorithm is an efficient way to solve this problem that is exact when the factor graph is a tree, but only approximate when the factor graph has cycles. One should note that the problem of computing marginal probability functions is in general hard (#Pcomplete).
Belief propagation makes local computations only. It makes use of the graphical structure such that the marginals can be computed much more efficiently. We will now describe the BP algorithm in terms of operations on a factor graph. The computed marginal probability functions will be exact if the factor graph has no cycles, but the BP algorithm is still welldefined when the factor graph does have cycles. Although this loopy belief propagation has no guarantees of convergence or of giving the correct result, in practice it often does, and can be much more efficient than other methods (Murphy et al. 1999).
Since, loopy belief propagation is efficient, it can directly be used for inference within statistical relational models that have recently gained attraction within the machine learning and AI communities. Statistical relational models provide powerful formalisms to compactly represent complex realworld domains. These formalisms enable us to effectively represent and tackle large problem instances for which inference and training is increasingly challenging. One of the most prominent examples of statistical relational models are Markov logic networks (Richardson and Domingos 2006).
(Top) Example of Markov logic network inspired by Singla and Domingos (2008). Free variables are implicitly universally quantified. (Center) Grounding of the MLN for constants Anna and Bob. (Bottom) Additional clause about similar smoking habits of buddies instead of friends
English  FirstOrder Logic  Weight 

Smoking causes cancer  Smokes(x)⇒Cancer(x)  1.5 
Friends have similar smoking habits  Friends(x,y)⇒(Smokes(x)⇔Smokes(y))  2.0 
1.5:Smokes(Anna)⇒Cancer(Anna)  
1.5:Smokes(Bob)⇒Cancer(Bob)  
2.0:Friends(Anna,Bob)⇒(Smokes(Anna)⇔Smokes(Bob))  
2.0:Friends(Bob,Anna)⇒(Smokes(Bob)⇔Smokes(Anna)) 
English  FirstOrder Logic  Weight 

Buddies have similar smoking habits  Buddies(x,y)⇒(Smokes(x)⇔Smokes(y))  2.0 
4 Scaling up inference: lifted belief propagation
Although already quite efficient, many graphical models produce inference problems with a lot of additional regularities reflected in the graphical structure but not exploited by BP. Probabilistic graphical models such as MLNs are prominent examples. As an illustrative example, reconsider the factor graph in Fig. 1. The associated potentials are identical. In other words, although the factors involved are different on the surface, they actually share quite a lot of information. Standard BP cannot make use of this information. In contrast, lifted BP—which we will introduce now—can make use of it and speed up inference by orders of magnitude.
Lifted BP performs two steps: Given a factor graph G, it first computes a compressed factor graph \({\mathfrak {G}}\) and then runs a modified BP on \({\mathfrak {G}}\). We will now discuss each step in turn using fraktur letters such as \({\mathfrak {G}}\), \({\mathfrak {X}}\), and \({\mathfrak {f}}\) to denote compressed graphs, nodes, and factors.
Step 1—compressing the factor graph
Essentially, we simulate BP keeping track of which nodes and factors send the same messages, and group nodes and factors together correspondingly.
The final compressed factor graph \({\mathfrak {G}}\) is constructed by grouping all nodes with the same color/shade into socalled clusternodes and all factors with the same color/shade signatures into socalled clusterfactors. In our case, variable nodes A,C and factor nodes f _{1},f _{2} are grouped together, see the right hand side of Fig. 2. Clusternodes (resp. clusterfactors) are sets of nodes (resp. factors) that send and receive the same messages at each step of carrying out BP on G. It is clear that they form a partition of the nodes in G.
Now we can run BP with minor modifications on the compressed factor graph \({\mathfrak {G}}\).
Step 2—BP on the compressed factor graph
Evidence is incorporated either on the ground level by setting f(x)=0 or on the lifted level by setting \({\mathfrak {f}}(\mathbf {x}) = 0\) for states x that are incompatible with it.^{2} Again, different schedules may be used for messagepassing. If there is no compression possible in the factor graph, i.e. there are no symmetries to exploit, there will be only a single position for a variable \({\mathfrak {X}}\) in factor \({\mathfrak {f}}\) and the counts \(c({\mathfrak {f}},{\mathfrak {X}},1)\) will be 1. In this case the equations simplify to Eqs. (3)–(5).
To conclude the section, the following theorem states the correctness of lifted BP.
Theorem 1
Given a factor graph G, there exists a unique minimal compressed \({\mathfrak {G}}\) factor graph, and algorithm CFG(G) returns it. Running BP on \({\mathfrak {G}}\) using Eqs. (7) and (9) produces the same results as BP applied to G.
The theorem generalizes the theorem of Singla and Domingos (2008) but can essentially be proven along the same ways. Although very similar in spirit, lifted BP has one important advantage: not only can it be applied to firstorder and relational probabilistic models, but also directly to traditional, i.e., propositional models such as Markov networks.
Proof
We prove the uniqueness of \({\mathfrak {G}}\) by contradiction. Suppose there are two minimal lifted networks \({\mathfrak {G}}_{1}\) and \({\mathfrak {G}}_{2}\). Then there exists a variable node X that is in clusternode \({\mathfrak {X}}_{1}\) in \({\mathfrak {G}}_{1}\) and in clusternode \({\mathfrak {X}}_{2}\) in \({\mathfrak {G}}_{2}\), \({\mathfrak {X}}_{1} \neq {\mathfrak {X}}_{2}\); or similarly for some clusterfactor \({\mathfrak {f}}\). Since all nodes in \({\mathfrak {X}}_{1}\), and \({\mathfrak {X}}_{2}\) respectively, send and receive the same messages \({\mathfrak {X}}_{1} = {\mathfrak {X}}_{2}\). Following the definition of clusternodes, any pair of nodes \({\mathfrak {X}}\) and \({\mathfrak {Y}}\) in \({\mathfrak {G}}\) send and receive different messages, therefore no further grouping is possible. Hence, \({\mathfrak {G}}\) is a unique minimal compressed network.
Now we show that algorithm CFG(G) returns this minimal compressed network. The following arguments are made for the variable nodes in the graph, but can analogously be applied to factor nodes. Reconsider the colored computation trees (CCT) which resemble the paths along which each node communicates in the network. Variables nodes are being grouped if they send and receive the same messages. Thus nodes X _{1} and X _{2} are in they same clusternode iff they have the same colored computation tree. Unfolding the computation tree to depth k gives the exact messages that the root node receives after k BP iterations. CFG(G) finds exactly the similar CCTs. Initially all nodes are colored by the evidence we have, thus for iteration k=0 we group all nodes that are similarly colored at the level k in the CCT. The signatures at iteration k+1 consist of the signatures at depth k (the nodes own color in the previous iteration) and the colors of all direct neighbors. That is, at iteration k+1 all nodes that have a similar CCT up to the (k+1)th level are grouped. CFG(G) is iterated until the grouping does not change. The number of iterations is bounded by the longest path connecting two nodes in the graph. The proof that modified BP applied to \({\mathfrak {G}}\) gives the same results as BP applied to G also follows from CCTs, Eqs. (7) and (8), and the count resembling the number of identical messages sent from the nodes in G. □
In contrast to the colorpassing procedure, Singla and Domingos (2008) work on the relational representation and lift the Markov logic network in a topdown fashion. Colorpassing on the other hand starts from the ground network and groups together nodes and factor bottomup. While a topdown construction of the lifted network has the advantage of being more efficient for liftable relational models since the model does not need to be grounded, a bottomup construction has the advantage that we do not rely on a relational model such as Markov logic networks. Colorpassing can group ground instances of similar clauses and atoms even if they are named differently. Reconsider the two clause example from Table 2 (top). Now, we add the clause from Table 2 (bottom) which has the same weight as the second clause and is similar in the structure, i.e. it has the same neighbors. Starting topdown, these two clauses would never be grouped by Singla and Domingos (2008) whereas colorpassing would initially give these clauses the same color. More importantly, as long as we can initially color the nodes and factors, based on the evidence and the potentials respectively, colorpassing is applicable to relational as well as propositional data such as Boolean formulae shown in the following.
4.1 Evaluation: lifted belief propagation
 (Q1)
Can we scale inference in graphical models by exploiting symmetries?
To this aim, we implemented lifted belief propagation (LBP) (Ahmadi et al. 2011; Kersting et al. 2009) in C++ based on libDAI^{3} with bindings to Python. We will evaluate the gain for inference by presenting significant showcases for the application of lifted BP, namely approximate inference for dynamic relational models and model counting of Boolean formulae.
Showing highly impressive lifting ratios for inference is commonly done by restricting to the classical symmetrical relational models without evidence. The two showcases for which we demonstrate lifted inference in the following, however, are particularly suited to not only show the benefits but also the shortcomings of lifted inference. Our first showcase, dynamic relational domains consists of long chains that destroy the indistinguishability of variables which might exist in a single timestep. Due to long chains, within and across timesteps, variables become correlated by virtue of sharing some common influence. The second showcase, the problem of model counting of Boolean formulae, as we will see later, is an iterative procedure that repeatedly runs inference. In each iteration new asymmetrical evidence is introduced and lifted inference is run on the modified model. Both are very challenging tasks for inference and in particular for lifting. Thus, in this section, we already address random evidence and randomness in the graphical structure that are major issues for lifted inference and training, as we will learn in the following sections.
Lifted inference in dynamic relational domains
Stochastic processes evolving over time are widespread. The truth values of relations depend on the time step t. For instance, a smoker may quit smoking tomorrow. Therefore, we extend MLNs by allowing the modeling of time. The resulting framework is called dynamic MLNs (DMLNs). Specifically, we introduce fluents, a special form of predicates whose last argument is time.
Here, we focus on discrete time processes, i.e., the time argument is nonnegative integer valued. Furthermore, we assume a successor function \(\mathtt{succ(t)}\), which maps the integer t to \(\mathtt{t+1}\). There are two kinds of formulas: intratime and intertime ones. Intratime formulas specify dependencies within a time slice and, hence, do not involve the succ function. To enforce the Markov assumption, each term in the formula is restricted to at most one application of the succ function, i.e., terms such as \(\mathtt{succ(succ(t))}\) are disallowed. A dynamic MLN is now a set of weighted intra and intertime formulas. Given the domain constants, in particular the time range 0,…,T _{max} of interest, a DMLN induces a MLN and in turn a Markov network over time.
(Top) Example of a social network Markov logic network inspired by Singla and Domingos (2008). Free variables are implicitly universally quantified. (Bottom) Dynamic extension of the static social network model
English  FirstOrder Logic  Weight 

Most people do not smoke  ¬Smokes(x)  1.4 
Most people do not have cancer  ¬Cancer(x)  2.3 
Most people are not friends  ¬Friends(x,y)  4.6 
Smoking causes cancer  Smokes(x)⇒Cancer(x)  2.0 
Friends have similar smoking habits  Friends(x,y)⇒(Smokes(x)<=>Smokes(y))  2.0 
Apriori most people do not smoke  ¬Smokes(x,0)  1.4 
Apriori most people do not have cancer  ¬Cancer(x,0)  2.3 
A priori most people are not friends  ¬Friends(x,y,0)  4.6 
Smoking causes cancer  Smokes(x,t)⇒Cancer(x,t)  2.0 
Friends have similar smoking habits  Friends(x,y,t)⇒(Smokes(x,t)<=>Smokes(y,t))  2.0 
Most friends stay friends  Friends(x,y,t)⇔Friends(x,y,succ(t))  5.0 
Most smokers stay smokers  Smokes(x,t)⇔Smokes(x,succ(t))  5.0 
Assume that there are two constants Anna and Bob. Let us say that Bob smokes at time 0 and he is friend with Anna. Then the ground Markov network will have a clique corresponding to the first two clauses for every timestep starting from 0. There will also be edges between \(\mathtt{Smokes(Bob)}\) (correspondingly Anna) and between \(\mathtt{Friends(Bob,Anna)}\) for consecutive timesteps.
To perform inference, we could employ any known MLN inference algorithm. Unlike the case for static MLNs, however, we need approximation even for sparse models: Random variables easily become correlated over time by virtue of sharing common influences in the past.
Classical approaches to perform approximate inference in dynamic Bayesian networks (DBN) are the BoyenKoller (BK) algorithm (Boyen and Koller 1998) and Murphy and Weiss’s factored frontier (FF) algorithm (Murphy and Weiss 2001). Both approaches have been shown to be equivalent to one iteration of BP but on different graphs (Murphy and Weiss 2001). BK, however, involves exact inference, which for probabilistic logic models is extremely complex, so far does not scale to realistic domains, and hence has only been applied to rather small artificial problems. In contrast, FF is a more aggressive approximation. It is equivalent to (loopy) BP on the regular factor graph with a forwardsbackwards message protocol: each node first sends messages from “left” to “right” and then sends messages from “right” to “left”. Therefore a frontier set is maintained. Starting from timestep t=0 we first send local messages then messages to the next timestep. A node is included in the frontier set iff all of its parents, that is all neighbors from timestep t−1, and its neighbors from the same timestep are included. Only then it receives a message from its neighbors. The basic idea of lifted firstorder factored frontier (LFF) is to plug in lifted BP in place of BP in FF. The local structure is replicated for all timestep in the dynamic network. Thus, the initial coloring of the nodes is the same for all timesteps. However, the communication patterns of the instantiation from different timesteps are different. Therefore, when we compress such a dynamic network nodes and factors from different timesteps end up being in different clusternodes and clusterfactors respectively. To see this, suppose we have a network consisting of a single node over three timesteps X ^{ t }, t∈{0,1,2}, i.e. a chain of length three. Initially all nodes get the same color. However, the signatures are different. Node X ^{0} only has a right neighbor, node X ^{1} has two neighbors (left and right) and node X ^{2} has a left neighbor. The frontier set for the lifted network is still well defined and we can run the lifted factored frontier algorithm.
We used the social network DMLN in Table 3 (bottom). There were 20 people in the domain. For fractions r∈{0.0,0.25,0.5,0.75,1.0} of people we randomly choose whether they smoke or not and who 5 of their friends are, and randomly assigned a time step to the information. Other friendship relations are still assumed to be unknown. \(\mathtt {Cancer(x,t)}\) is unknown for all persons x and all time steps. The “observed” people were randomly chosen. The query predicate was Cancer.
In the second experiment, we compared the “forwardsbackwards” message protocol with the “flooding” protocol, the most widely used and generally bestperforming method for static networks. Using the “flooding” protocol, messages are passed from each variable to each corresponding factor and back at each iteration. Again, we considered 10 time steps. The results shown in Fig. 4 (Middle) clearly favor the FB protocol.
For a qualitative comparison, we finally computed the probability estimates for \(\mathtt{cancer(A,t)}\) using LFF and MCSAT, the default inference of the Alchemy system.^{4} For MCSAT, we used default parameters. There were four persons (A, B, C, and D) and we observed that A smokes at time step 2. All other relations where unobserved for all time steps. We expect that the probability of A having cancer has a peak at t=2 smoothly fading out over time. Figure 4 (right) shows the results. In contrast to LFF, MCSAT does not show the expected behavior. The probabilities drop irrespective of the distance to the observation.
So far Q1 is affirmatively answered. The results clearly show that by lifting we can exploit symmetries for inference in the graphical model. A compression and thereby a speedup, however, is not guaranteed. If there are no symmetries—such as the random 3CNF in the next section—lifted BP essentially coincides with BP.
Model counting using (lifted) belief propagation
Model counting is the classical problem of computing the number of solutions of a given propositional formula. It vastly generalizes the NPcomplete problem of propositional satisfiability, and hence is both highly useful and extremely expensive to solve in practice. Interesting applications include multiagent reasoning, adversarial reasoning, and graph coloring, among others.
Our approach, called LBPCount, is based on BPCount for computing a probabilistic lower bound on the model count of a Boolean formula F, which was introduced by Kroc et al. (2008). The basic idea is to efficiently obtain a rough estimate of the “marginals” of propositional variables using belief propagation with damping. The marginal of variable u in a set of satisfying assignments of a formula is the fraction of such assignments with u=true and u=false respectively. If this information is computed accurately enough, it is sufficient to recursively count the number of solutions of only one of “F with u=true” and “F with u=false”, and scale the count up accordingly. Kroc et al. have empirically shown that BPCount can provide good quality bounds in a fraction of the time compared to previous, samplebased methods.
The basic idea of LBPCount now is to plug in lifted BP in place of BP. However, we have to be a little bit more cautious: propositional variables can appear at any position in the clauses. This makes high compression rates unlikely because, for each clusternode (set of propositional variables) and clusterfeature (set of clauses) combination, we carry a count for each position the clusternode appears in the clusterfeature. Fortunately, however, we deal with disjunctions only (assuming the formula f is in CNF). Propositional variables may appear negated or unnegated in the clauses which is the only distinction we have to make. Therefore, we can safely assume two positions (negated, unnegated) and besides sorting the node color signatures we can now also sort the factor color signatures by position. Reconsider the example from Fig. 2 and assume that the potentials associated with f _{1},f _{2} encode disjunctions. Indeed, assuming B to be the first argument of f _{1} does not change the semantics of f _{1}. As our experimental results will show this can result in huge compression rates and large efficiency gains.
We have implemented (L)BPCount based on SampleCount ^{5} using our (L)BP implementation. We ran BPCount and LBPCount on the circuit synthesis problem 2bitmax_6 with damping factor 0.5 and convergence threshold 10^{−8}. The formula has 192 variables, 766 clauses and a true count of 2.1×10^{29}. The resulting factor graph has 192 variable nodes, 766 factor nodes, and 1800 edges.
Unfortunately, such a significant efficiency gain is not always obtainable. We ran BPCount and LBPCount on the random 3CNF wff3100150. The formula has 100 variables, 150 clauses and a true count of 1.8×10^{21}. Both approaches yield again the same lower bound, which is in the same range as Kroc et al. report. The statistics of running (L)BPCount are shown in Fig. 5 (middle). Lifted BP is not able to compress the factor graph at all. In turn, it does not gain any efficiency but actually produces a small overhead due to trying to compress the factor graph and to compute the counts.
In realworld domains, however, there is often a lot of redundancy. As a final experiment, we ran BPCount and LBPCount on the Latin square construction problem ls8norm. The formula has 301 variables, 1601 clauses and a true count of 5.4×10^{11}. Again, we got similar estimates as Kroc et al. The statistics of running (L)BPCount are shown in Fig. 5 (right). In the first iteration, Lifted BP sent only 0.6 % of the number of messages BP sent. This corresponds to 162 times fewer messages sent than BP. The result on model counting and lifted inference in dynamic relational domains clearly affirm Q1 and show that lifted belief propagation can exploit symmetries and thus scale inference.
5 Scaling up inference: MapReduced lifting
Indeed lifted belief propagation as introduced is an attractive avenue to scaling inference. The empirical evaluation has shown lifting can render large, previously intractable probabilistic inference problems quickly solvable by employing symmetries to handle whole sets of indistinguishable random variables.
With the availability of affordable commodity hardware and high performance networking, however, we have increasing access to computer clusters providing an additional dimension to scale lifted inference. We now show that this is indeed the case. That is, we can distribute the inference and in particular the colorpassing procedure for lifting messagepassing using the MapReduce framework (Dean and Ghemawat 2008).
The MapReduce programming model allows parallel processing of massive data sets inspired by the functional programming primitives map and reduce and is basically divided into these two steps, the Map and the Reducestep. In the Mapstep the input is taken, divided into smaller subproblems and then distributed to all worker nodes. These smaller sub problems are then solved by the nodes independently in parallel. Alternatively, the subproblems can be further distributed to be solved in a hierarchical fashion. In the subsequent Reducestep all outputs of the subproblems are collected and combined to form the output for the original problem.
 1.
Form the colorsignatures
 2.
Group similar colorsignatures
 3.
Assign a new color to each group
Indeed, although radix sort is very efficient—the sorting efficiency is in the order of edges in the ground network—one may wonder whether it is actually well suited for an efficient implementation of lifting within a concrete MapReduce framework such as Hadoop. The Hadoop framework performs a sorting between the Map and the Reducestep. The keyvalue pair that is returned by the mappers has to be grouped and is then sent to the respective reducers to be processed further. This sorting is realized within Hadoop using an instance of quick sort with an efficiency \(\mathcal{O}(n \log n)\). If we have a bounded degree of the nodes in the graph as in our case, however, this limits the length of the signatures and radix sort is still the algorithm of choice.
The Mapphase of (Step2) could also be integrated into the parallel build of the signatures (Step1), such that we have one single Mapstep for building and sorting the signatures.
Taking the MapReduce arguments for (Step1) to (Step3) together this proves that colorpassing itself is MapReduceable. Together with the MapReduceable results of Gonzalez et al. (2009a, 2009b) for the modified belief propagation, this proves the following Theorem.
Theorem 2
Lifted belief propagation is MapReduceable.
Moreover, we have the following time complexity, which essentially results from running radix sort h times.
Theorem 3
The runtime complexity of the colorpassing algorithm with h iterations is \(\mathcal{O}(hm)\), where m is the number of edges in the graph.
Proof
Assume that every graph has n nodes (ground atoms) and m edges (ground atom appearances in ground clauses). Defining the signatures in step 1 for all nodes is an \(\mathcal {O}(m)\) operation. The elements of a signature of a factor are \(s(f) = ( s(X_{1}), s(X_{2}), \ldots, s(X_{d_{f}}))\), where X _{ i }∈nb(f), i=1,…,d _{ f }. Now there are two sorts that have to be carried out. The first sort is within the signatures. We have to sort the colors within a node’s signatures and in the case where the position in the factor does not matter, we can safely sort the colors within the factor signatures while compressing the factor graph. Sorting the signatures is an \(\mathcal{O}(m)\) operation for all nodes. This efficiency can be achieved by using counting sort, which is an instance of bucket sort, due to the limited range of the elements of the signatures. The cardinality of this signature is upperbounded by n, which means that we can sort all signatures in \(\mathcal{O}(m)\) by the following procedure. We assign the elements of all signatures to their corresponding buckets, recording which signature they came from. By reading through all buckets in ascending order, we can then extract the sorted signatures for all nodes in a graph. The runtime is \(\mathcal {O}(m)\) as there are \(\mathcal{O}(m)\) elements in the signatures of a graph in iteration i. The second sorting is that of the resulting signatures to group similar nodes and factors. This sorting is of time complexity \(\mathcal{O}(m)\) via radix sort. The label compression requires one pass over all signatures and their positions, that is \(\mathcal{O}(m)\). Hence all these steps result in a total runtime of \(\mathcal{O}(hm)\) for h iterations. □
5.1 Evaluation: lifting with MapReduce
 (Q2)
Does the MapReduce lifting additionally improve scalability?
6 Scaling up training of relational models: lifted online training
When training a relational model for a given set of observations, however, the presence of evidence on the variables mostly destroys the symmetries. This makes lifted approaches virtually of no use if the evidence is asymmetric. In the fully observed case, this may not be a major obstacle since we can simply count how often a clause is true. Unfortunately, in many realworld domains, the megaexample available is incomplete, i.e., the truth values of some ground atoms may not be observed. For instance in medical domains, a patient rarely gets all of the possible tests. In the presence of missing data, however, the maximum likelihood estimate typically cannot be written in closed form. It is a numerical optimization problem, and typically involves nonlinear, iterative optimization and multiple calls to a relational inference engine as subroutine.
Since efficient lifted inference is troublesome in the presence of partial evidence and most lifted approaches easily fall back to the ground case, we need to seek a way to make the learning task tractable. An appealing idea for efficiently training large models is to break the asymmetries, i.e., to divide the model into pieces that are trained independently and to exploit symmetries across multiple pieces for lifting.
6.1 Breaking asymmetries: piecewise shattering
In piecewise training, we decompose the megaexample and its corresponding factor graph into tractable but not necessarily disjoint subgraphs (or pieces) \(\mathcal{P} = \{p_{1}, \ldots, p_{k} \}\) that are trained independently (Sutton and McCallum 2009). Intuitively, the pieces turn the single megaexample into a set of many training examples and hence pave the way for online training. This is a reasonable idea since in many applications, the local information in each factor alone is already enough to do well at predicting the outputs. The parameters learned locally are then used to perform global inference on the whole model.
More formally, at training time, each piece from \(\mathcal{P} = \{p_{1}, \ldots, p_{k} \}\) has a local likelihood as if it were a separate graph, i.e., training example and the global likelihood is estimated by the sum of its pieces: \(\hat{\ell} (\theta, D) = \sum\nolimits_{p_{i} \in\mathcal{P}} \ell (\theta_{p_{i}}, D_{p_{i}})\). Here \(\theta_{p_{i}}\) denotes the parameter vector containing only the parameters appearing in piece p _{ i } and \(D_{p_{i}}\) the evidence for variables appearing in the current piece p _{ i }. The standard piecewise decomposition breaks the model into a separate piece for each factor. Intuitively, however, this discards dependencies of the model parameters when we decompose the megaexample into pieces. Although the piecewise model helps to significantly reduce the cost of training, the way we shatter the full model into pieces greatly effects the learning and lifting quality. Strong influences between variables might get broken. Consequently, we next propose a shattering approach that aims at keeping strong influence but still features lifting.
6.2 Breaking asymmetries: relational tree shattering
To overcome this, we now present a shattering approach that randomly grows piece patterns forming trees. Formally, a tree is defined as a set of factors such that for any two factors f _{1} and f _{ n } in the set, there exists one and only one ordering of (a subset of) factors in the set f _{1},f _{2},…,f _{ n } such that f _{ i } and f _{ i+1} share at least one variable, i.e. there are no loops. A tree of factors can then be generalized into a tree pattern, i.e., conjunctions of relational “clauses” by variabilizing their arguments. For every clause of the MLN we thus form a tree by performing a random walk rooted in one ground instance of that clause. This process can be viewed as a form of relational pathfinding (Richards and Mooney 1992).
English  FirstOrder Logic  Weight 

Only one person can be promoted  Promotion(x)⇔!Promotion(y)  2.0 
A promotion comes with an increased income  Promotion(x)⇒HighIncome(x)  1.5 
Promotion(x)⇒!HighIncome(y)  1.1 
Now, we show how to turn this upper bound into a lifted online training for relational models.
6.3 Lifted online training via stochastic metadescent
Stochastic gradient descent algorithms update the weight vector in an online setting. We essentially assume that the pieces are given one at a time. The algorithms examine the current piece and then update the parameter vector accordingly. They often scale sublinearly with the amount of training data, making them very attractive for large training data as targeted by statistical relational learning. To reduce variance, we may form minibatches consisting of several pieces on which we learn the parameters locally. In contrast to the propositional case, however, minibatches have another important advantage: we can now make use of the symmetries within and across pieces for lifting. The pieces within a minibatch can be seen as disconnected parts of a larger network. Now if we perform inference for the whole minibatch we naturally exploit the symmetries within each piece and across the pieces in the minibatch.
7 Scaling up training of relational models: MapReduced stochastic gradient
So far, we have shown how the model can be shattered into smaller pieces to efficiently learn the parameters. This shattering makes training large models tractable and improves on the speed of the convergence, as we will show in the experimental section. Even more importantly, as each lifted piece is processed one after the other it naturally paves the way for a MapReduce approach. The gradients of the shattered pieces can be computed locally in a distributed fashion which in turn allows a MapReduce friendly parallel approach without bandwidth constrains and considerable latency, see e.g. Zinkevich et al. (2010), Langford et al. (2009). This proves the following theorem.
Theorem 4
Lifted approximate training of relational models is MapReduceable.
Now, we have everything together to investigate scalable lifted inference and training.
8 Scalable lifted inference and training: experimental evaluation
 (Q3)
Does piecewise lifted inference help in nonsymmetric cases?
 (Q4)
Can we efficiently train relational models using stochastic gradients?
 (Q5)
Are there symmetries within minibatches that result in lifting?
 (Q6)
Can relational treefinding produce pieces that balance accuracy and lifting well?
 (Q7)
Is it even possible to achieve onepass relational training?
To this aim, we implemented lifted online learning for relational models in Python and C++. As a batch learning reference, we used scaled conjugate gradient (SCG) (Møller 1993). SCG chooses the search direction and the step size by using information from the second order approximation. For inference we used our lifted belief propagation (LBP) (Ahmadi et al. 2011; Kersting et al. 2009) implementation. Inference as a subroutine for the training methods was also carried out by LBP to convergence with a threshold of 10^{−8} maximum of 1000 iterations.
8.1 Lifted piecewise inference (Q3)
8.2 Training of friendsandsmokers MLN (Q4, Q5)
As one can see, the lifted SMD using single factor pieces has a steep learning curve and has already learned the parameters before seeing the mega example even once (indicated by the dashed vertical line). Note that we learned the models without stopping criterion and for a fixed number of passes over the data thus the CMLL on the test data can decrease. SCG on the other hand requires four passes over the entire training data to have a similar result in terms of CMLL. Thus Q4 can be answered affirmatively. Moreover, as Fig. 12 (right) shows, piecewise learning greatly increases the lifting compared to batch learning, which essentially does not feature lifting at all. Thus, Q5 can be answered affirmatively.
8.3 Voting MLN (Q5)
8.4 Training of CORA entity resolution MLN (Q6, Q7)
Here we learned the parameters for the Cora entity resolution MLN,^{15} one of the standard datasets for relational learning. In the current paper, however, it is used in a nonstandard, more challenging setting. For a set of bibliography entries (papers) the Cora MLN has facts, e.g., about word appearances in the titles and in author names, the venue a paper appeared in, its title, etc. The task is now to infer whether two entries in the bibliography denote the same paper (predicate samePaper), two venues are the same (sameVenue), two titles are the same (sameTitle), and whether two authors are the same (sameAuthor). We sampled 20 bibliography entries and extracted all facts corresponding to these bibliography entries. We constructed five folds then trained on four folds and tested on the fifth. The megaexample E is composed of the four folds we train on. We employed a transductive learning setting for this task. The MLN was parsed with all facts for the bibliography entries from the five folds, i.e., the queries were hidden for the test fold. The query consisted of all four predicates (sameAuthor, samePaper, sameBib, sameVenue). The resulting ground network consisted of 36,390 factors and 11,181 variables. We learned the parameters using SCG, lifted stochastic metadescent with standard pieces as well as pieces using relational treefinding with a depth d of 1 and a threshold t of 0.9. The trees consisted of around ten factors on average. So we updated with a batch size of 100 for the trees and 1000 for standard pieces with a stepsize of 0.05. Furthermore, other parameters were chosen to be λ=.99, μ=0.9, and γ=0.9. Figure 13 (right) shows the averaged learning results for this entity resolution task. Again, online training does not need to see the whole megaexample; it has learned long before finishing one pass over the entire data. Thus, (Q6) can be answered affirmatively.
Moreover, Fig. 13 also shows that by building tree pieces one can considerably speedup the learning process. They convey a lot of additional information such that one obtains a better solution with a smaller amount of data. This is due to the fact that the Cora dataset contains a lot of strong dependencies which are all broken if we form one piece per factor. The trees on the other hand preserve parts of the local structure which significantly helps during learning. Thus, (Q7) can be answered affirmatively.
8.5 Lifted imitation learning in the Wumpus domain (Q6, Q7)
To further investigate (Q6) and (Q7), we considered imitation learning in a relational domain for a Partially Observed Markov Decision Process (POMDP). We created a simple version of the Wumpus task (Russell and Norvig 2003) where the location of Wumpus is partially observed. We used a 5×5 grid with a Wumpus placed in a random location in every training trajectory. The Wumpus is always surrounded by stench on all four sides. We do not have any pits or breezes in our task. The agent can perform 8 possible actions: 4 move actions in each direction and 4 shoot actions in each direction. The agent’s task is to move to a cell so that he can fire an arrow to kill the Wumpus. The Wumpus is not observed in all the trajectories although the stench is always observed. Trajectories were created by real human users who play the game.
The cells of the 5×5 grid were numbered and we use predicates like cellAtRow(cell, row) and cellAbove(cell, cell) to define the structure of the grid and were the cell is located. These facts were always given. Other predicates were wumpus(cell), stench(cell), agent(cell,t) and actions move/shoot for all for directions, e.g. shootUp(t). The rules we learn the weights for describe the state or whether an action should be performed. Two examples of such rules are:
w _{1}: stench(scell) ∧ cellAbove(scell,wcell)=>wumpus(wcell)
w _{2}: wumpus(wcell) ∧ agent(acell,t) ∧ cellCol(acell,acol) ∧ cellCol(wcell,wcol)
∧ less(acol,wcol)=>shootRight(t)
Taking all experimental results together, all questions Q1–Q7 can be clearly answered affirmatively.
9 Conclusions
Symmetries can be found almost everywhere, in arabesques and French gardens, in the rose windows and vaults in Gothic cathedrals, in the meter, rhythm, and melody of music, in the metrical and rhyme schemes of poetry as well as in the patterns of steps when dancing. Symmetric faces are even said to be more beautiful to humans. Actually, symmetry is both a conceptual and a perceptual notion often associated with beautyrelated judgments (Zaidel and Hessamian 2010). Or, to quote Hermann Weyl “Beauty is bound up with symmetry” (Weyl 1952). This link between symmetry and beauty is often made by scientists. In physics, for instance, symmetry is linked to beauty in that symmetry describes the invariants of nature, which, if discerned, could reveal the fundamental, true physical reality (Zaidel and Hessamian 2010). In mathematics, as Herr and Bödi note, “we expect objects with many symmetries to be uniform and regular, thus not too complicated” (Herr and Bödi 2010). Therefore, it is not surprising that symmetries have also been explored in many AI tasks. For instance, there are symmetryaware approaches in (mixed)integer programming (Bödi et al. 2011; Margot 2010), linear programming (Herr and Bödi 2010; Mladenov et al. 2012), SAT and CSP (Sellmann and Van Hentenryck 2005) as well as MDPs (Dean and Givan 1997; Ravindran and Barto 2001).
Surprisingly, however, symmetries have not been the subject of great interest within statistical learning. In this paper, we have shown that scaling inference and relational training of graphical models can actually greatly benefit from symmetries. We have introduced lifted belief propagation and shown that lifting is MapReduceable. However, already in 1848, Louis Pasteur recognized “Life as manifested to us is a function of the asymmetry of the universe”. This remark characterizes somehow one of the main challenges we are facing: Not only are almost all large graphs asymmetric (Erdös and Rényi 1963), but even if there are symmetries within a model, they easily break when it comes to inference and training since variables become correlated by virtue of depending asymmetrically on evidence. This, however, does not mean that lifted inference and training is hopeless. We have demonstrated that breaking longterm dependencies via piecewise inference and training naturally breaks asymmetries and paves the way to lifted online respectively MapReduced relational training.
The symmetryaware framework for learning outlined in the present paper puts many interesting research goals into reach. For instance, one should tackle onepass relational learning by investigating different ways of gain adaption and scheduling of pieces for updates. Since piecewise training is a simple form of dual decomposition, further exploration of dual decomposition methods is an attractive future direction. One should also investigate budget constraints on both the number of examples and the computation time per iteration. Another interesting avenue for future work is to use sequences of increasingly finer approximations to control the tradeoff between lifting and accuracy (Kiddon and Domingos 2011). Besides belief propagation, lifted message passing approaches have been introduced for Gaussian belief propagation (Ahmadi et al. 2011), warning propagation and survey propagation (Hadiji et al. 2011). The definition of an abstract lifted message passing framework that unifies these and other message passing algorithms remains to be done and is very interesting future work. Finally, one should start investigating symmetries in general machine learning approaches such as supportvector machines and Gaussian processes. So, while there have been considerable advances, there are more than enough problems, in particular asymmetric ones, to go around to really establish symmetryaware machine learning.
Footnotes
 1.
The partitioning of the nodes obtained by colorpassing corresponds to the socalled coarsest equitable partition of the graph (Mladenov et al. 2012). However, a formal characterization of the symmetries is beyond the scope of the current paper.
 2.
Note that, the variables have been grouped according to evidence and their local structure. Thus all factors within a clusterfactor are indistinguishable and we can set the states of the whole clusterfactor \({\mathfrak {f}}\) at once.
 3.
 4.
 5.
 6.
Note that in the shared memory setting this is not necessary. Here we use the MapReduce framework, thus we have to introduce copynodes.
 7.
Note that how we partition the model greatly affects the efficiency of the lifting. Finding an optimal partitioning that balances communication cost and CPUload, however, is out of the scope of this paper. In general, the partitioning problem for parallelism is wellstudied (Chamberlain 1998), there are efficient tools, e.g. http://glaros.dtc.umn.edu/gkhome/metis/parmetis/. We show that colorpassing is MapReduceable and dramatically improves scalability even with a naive partitioning.
 8.
 9.
Since this is essentially the output of the previous step that is passed through the two steps can easily be integrated. We keep them separate for illustration purposes of the two distinct steps of colorpassing.
 10.
 11.
 12.
We develop our lifted training approach within the framework of Markov logic networks for illustration purposes only. We would like to stress that it naturally carries over to other relational frameworks.
 13.
In fact grounding the MLN would lead to more factors however for illustration purposes we assume e.g. “\(\mathtt {Promotion(Anna)\Leftrightarrow\ !Promotion(Bob)}\)” and “\(\mathtt {Promotion(Bob) \Leftrightarrow\ !Promotion(Anna)}\)” are simplified to a single factor.
 14.
 15.
Notes
Acknowledgements
The authors would like to thank the anonymous reviewers for their helpful comments. Thanks to Roman Garnett, Fabian Hadiji and Mirwaes Wahabzada for helpful discussion on mapreduce, and Tushar Khot for kindly providing the Wumpus dataset. The authors also would like to thank Tuyen N. Huynh and Raymond J. Mooney for communicating their idea and code on online discriminative learning. Babak Ahmadi and Kristian Kersting were supported by the Fraunhofer ATTRACT fellowship STREAM and by the European Commission under contract number FP7248258FirstMM. Martin Mladenov and Kristian Kersting were supported by the German Research Foundation DFG, KE 1686/21, within the SPP 1527. Sriraam Natarajan gratefully acknowledges the support of the DARPA Machine Reading Program under AFRL prime contract no. FA875009C0181. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of DARPA, AFRL, or the US government.
References
 Acar, U., Ihler, A., Mettu, R., & Sumer, O. (2008). Adaptive inference on general graphical models. In UAI08, Corvallis, Oregon: AUAI Press. Google Scholar
 Ahmadi, B., Kersting, K., & Sanner, S. (2011). Multievidence lifted message passing, with application to pagerank and the Kalman filter. In IJCAI. Google Scholar
 Ahmadi, B., Kersting, K., & Natarajan, S. (2012). Lifted online training of relational models with stochastic gradient methods. In Proceedings ECMLPKDD, Bristol, UK, September 24–28. Berlin: Springer. Google Scholar
 Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276. CrossRefGoogle Scholar
 Besag, J. (1975). Statistical analysis of nonlattice data. Journal of the Royal Statistical Society. Series D. The Statistician, 24(3), 179–195. MathSciNetGoogle Scholar
 Bödi, R., Herr, K., & Joswig, M. (2011). Algorithms for highly symmetric linear and integer programs. In Mathematical Programming, Series A. Online First. Google Scholar
 Boyen, X., & Koller, D. (1998). Tractable inference for complex stochastic processes. In Proc. of the conf. on uncertainty in artificial intelligence (UAI98) (pp. 33–42). Google Scholar
 Bui, H. H., Huynh, T., & de Salvo Braz, R. (2012). Exact lifted inference with distinct soft evidence on every object. In Proc. of the 26th AAAI conf. on artificial intelligence (AAAI 2012). Google Scholar
 Bui, H.H., Huynh, T. N., & Riedel, S. (2012). Automorphism groups of graphical models and lifted variational inference. In CoRR. Google Scholar
 Chamberlain, B. L. (1998). Graph partitioning algorithms for distributing workloads of parallel computations (Technical report). Google Scholar
 Choi, J., Hill, D., & Amir, E. (2010). Lifted inference for relational continuous models. In UAI. Google Scholar
 Choi, J., de Salvo Braz, R., & Bui, H. (2011). Efficient methods for lifted inference with aggregate factors. In AAAI 2011. Google Scholar
 Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2001). Introduction to algorithms (2nd ed.). Cambridge: MIT Press. zbMATHGoogle Scholar
 Darwiche, A. (2001). Recursive conditioning. Artificial Intelligence, 126(1–2), 5–41. MathSciNetzbMATHCrossRefGoogle Scholar
 De Raedt, L., Frasconi, P., Kersting, K., & Muggleton, S. H. (eds.) (2008). Lecture notes in computer science: Vol. 4911. Probabilistic inductive logic programming. Berlin: Springer. zbMATHGoogle Scholar
 de Salvo Braz, R., Amir, E., & Roth, D. (2005). Lifted first order probabilistic inference. In Proc. of the 19th international joint conference on artificial intelligence (IJCAI05) (pp. 1319–1325). Google Scholar
 de Salvo Braz, R., Amir, E., & Roth, D. (2006). MPE and partial inversion in lifted probabilistic variable elimination. In Proc. of the 21st AAAI conf. on artificial intelligence (AAAI06). Google Scholar
 de Salvo Braz, R., Natarajan, S., Bui, H., Shavlik, J., & Russell, S. (2009). Anytime lifted belief propagation. In Proc. SRL09. Google Scholar
 Dean, J., & Ghemawat, S. (2008). Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113. CrossRefGoogle Scholar
 Dean, T., & Givan, R. (1997). Model minimization in Markov decision processes. In Proc. of the fourteenth national conf. on artificial intelligence (AAAI97) (pp. 106–111). Google Scholar
 Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39. Google Scholar
 Erdös, P., & Rényi, A. (1963). Asymmetric graphs. Acta Mathematica Academiae Scientiarum Hungaricae, 14, 295–315. MathSciNetzbMATHCrossRefGoogle Scholar
 Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, 1189–1232. Google Scholar
 Getoor, L., & Taskar, B. (2007). Introduction to statistical relational learning. Cambridge: MIT Press. zbMATHGoogle Scholar
 Getoor, L., Friedman, N., Koller, D., & Taskar, B. (2002). Learning probabilistic models of link structure. Journal of Machine Learning Research, 3, 679–707. MathSciNetGoogle Scholar
 Gogate, V., & Domingos, P. (2010). Exploiting logical structure in lifted probabilistic inference. In Working note of the AAAI10 workshop on statistical relational artificial intelligence. Google Scholar
 Gogate, V., & Domingos, P. (2011). Probabilistic theorem proving. In Proc. ot the 27th conf. on uncertainty in artificial intelligence (UAI). Google Scholar
 Gonzalez, J. E., Low, Y., & Guestrin, C. (2009b). Residual splash for optimally parallelizing belief propagation. In Artificial intelligence and statistics (AISTATS) (pp. 177–184). Google Scholar
 Gonzalez, J., Low, Y., Guestrin, C., & O’Hallaron, D. (2009a). Distributed parallel inference on large factor graphs. In UAI, Montreal, Canada, July 2009. Google Scholar
 Gutmann, B., Thon, I., & De Raedt, L. (2011). Learning the parameters of probabilistic logic programs from interpretations. In ECMLPKDD (pp. 581–596). Google Scholar
 Hadiji, F., Ahmadi, B., & Kersting, K. (2011). Efficient sequential clamping for lifted message passing. In Proceedings of the 34th annual German conference on artificial intelligence (KI–11). Berlin: Springer. Google Scholar
 Herr, K., & Bödi, R. (2010). Symmetries in linear and integer programs. In CoRR. 0908.3329 Google Scholar
 Hinton, G. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14. Google Scholar
 Huynh, T., & Mooney, R. (2011). Online maxmargin weight learning for Markov logic networks. In SDM. Google Scholar
 Ihler, A. T., Fisher, J. W. III, & Willsky, A. S. (2005). Loopy belief propagation: convergence and effects of message errors. Journal of Machine Learning Research, 6, 905–936. MathSciNetzbMATHGoogle Scholar
 Jaeger, M. (2007). Parameter learning for Relational Bayesian networks. In ICML. Google Scholar
 Jaimovich, A., Meshi, O., & Friedman, N. (2007). Templatebased inference in symmetric relational Markov random fields. In Proc. of the conf. on uncertainty in artificial intelligence (UAI07) (pp. 191–199). Google Scholar
 Kersting, K. (2012). Lifted probabilistic inference. In L. De Raedt, C. Bessiere, D. Dubois, P. Doherty, P. Frasconi, F. Heintz, & P. Lucas (Eds.), Proceedings of 20th European conference on artificial intelligence (ECAI–2012), Montpellier, France, August 27–31, 2012. Amsterdam: ECCAI IOS Press. (Invited talk at the frontiers of AI track). Google Scholar
 Kersting, K., & De Raedt, L. (2001). Adaptive Bayesian logic programs. In ILP. Google Scholar
 Kersting, K., Ahmadi, B., & Natarajan, S. (2009). Counting belief propagation. In UAI, Montreal, Canada. Google Scholar
 Khot, T., Natarajan, S., Kersting, K., & Shavlik, J. (2011). Learning Markov logic networks via functional gradient boosting. In ICDM. Google Scholar
 Kiddon, C., & Domingos, P. (2011). Coarsetofine inference and learning for firstorder probabilistic models. In Proc. of the 25th AAAI conf. on artificial intelligence (AAAI 2011). Google Scholar
 Kisynski, J., & Poole, D. (2009). Constraint processing in lifted probabilistic inference. In UAI (pp. 293–302). Google Scholar
 Kok, S., & Domingos, P. (2009). Learning Markov logic network structure via hypergraph lifting. In ICML. Google Scholar
 Kok, S., & Domingos, P. (2010). Learning Markov logic networks using structural motifs. In ICML. Google Scholar
 Kroc, L., Sabharwal, A., & Selman, B. (2008). Leveraging belief propagation, backtrack search, and statistics for model counting. In Proc. of the 5th int. conf. on the integration of AI and OR techniques in constraint programming for combinatorial optimization problems (CPAIOR08) (pp. 127–141). CrossRefGoogle Scholar
 Langford, J., Smola, A. J., & Zinkevich, M. (2009). Slow learners are fast. In NIPS (pp. 2331–2339). Google Scholar
 Le Roux, N., Manzagol, P.A., & Bengio, Y. (2007). Topmoumoute online natural gradient algorithm. In NIPS. Google Scholar
 Lee, S.I., Ganapathi, V., & Koller, D. (2007). Efficient structure learning of Markov networks using L1regularization. In NIPS. Google Scholar
 Lowd, D., & Domingos, P. (2007). Efficient weight learning for Markov logic networks. In Proceedings of the eleventh European conference on principles and practice of knowledge discovery in databases. Google Scholar
 Margot, F. (2010). Symmetry in integer linear programming. In M. Jünger, T. M. Liebling, D. Naddef, G. L. Nemhauser, W. R. Pulleyblank, G. Reinelt, G. Rinaldi, & L. A. Wolsey (Eds.), 50 years of integer programming 1958–2008: from the early years to the stateoftheart (pp. 1–40). Berlin: Springer. Google Scholar
 Mihalkova, L., Huynh, T. N., & Mooney, R. J. (2007). Mapping and revising Markov logic networks for transfer learning. In AAAI (pp. 608–614). Google Scholar
 Milch, B., & Russell, S. J. (2006). Generalpurpose mcmc inference over relational structures. In Proc. of the 22nd conf. in uncertainty in artificial intelligence (UAI2006). Google Scholar
 Milch, B., Zettlemoyer, L., Kersting, K., Haimes, M., & Pack Kaelbling, L. (2008). Lifted probabilistic inference with counting formulas. In Proc. of the 23rd AAAI conf. on artificial intelligence (AAAI08) (pp. 13–17). Google Scholar
 Mladenov, M., Ahmadi, B., & Kersting, K. (2012). Lifted linear programming. In JMLR: W&CP: Vol. 22. 15th int. conf. on artificial intelligence and statistics (AISTATS 2012) (pp. 788–797). Google Scholar
 Møller, M. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6(4), 525–533. CrossRefGoogle Scholar
 Murphy, K. P., & Weiss, Y. (2001). The factored frontier algorithm for approximate inference in DBNs. In Proc. of the conf. on uncertainty in artificial intelligence (UAI01) (pp. 378–385). Google Scholar
 Murphy, K. P., Weiss, Y., & Jordan, M. I. (1999). Loopy belief propagation for approximate inference: an empirical study. In Proc. of the conf. on uncertainty in artificial intelligence (UAI99) (pp. 467–475). Google Scholar
 Natarajan, S., Tadepalli, P., Dietterich, T. G., & Fern, A. (2009). Learning firstorder probabilistic models with combining rules. Annals of Mathematics and AI. Google Scholar
 Natarajan, S., Khot, T., Kersting, K., Guttmann, B., & Shavlik, J. (2012). Gradientbased boosting for statistical relational learning: The relational dependency network case. Machine Learning. Google Scholar
 Nath, A., & Domingos, P. (2010). Efficient lifting for online probabilistic inference. In AAAI. Google Scholar
 Niepert, M. (2012). Markov chains on orbits of permutation groups. In Proc. of the 28th conf. on uncertainty in artificial intelligence (UAI). Google Scholar
 Niu, F., Ré, C., Doan, A., & Shavlik, J. (2011). Tuffy: scaling up statistical inference in Markov logic networks using an rdbms. Proceedings of the VLDB Endowment, 4(6), 373–384. Google Scholar
 Pearl, J. (1991). Reasoning in intelligent systems: networks of plausible inference (2nd ed.). San Meteo: Morgan Kaufmann. Google Scholar
 Poole, D. (2003). Firstorder probabilistic inference. In Proc. of the 18th international joint conference on artificial intelligence (IJCAI05) (pp. 985–991). Google Scholar
 Poole, D., Bacchus, F., & Kisyński, J. (2011). Towards completely lifted searchbased probabilistic inference. In CoRR. 1107.4035. Google Scholar
 Poon, H., & Domingos, P. (2008). Joint unsupervised coreference resolution with Markov logic. In Proceedings of the conference on empirical methods in natural language processing. Google Scholar
 Ravindran, B., & Barto, A. G. (2001). Symmetries and model minimization in Markov decision processes (Technical Report 0143). University of Massachusetts, Amherst, MA, USA. Google Scholar
 Richards, B. L., & Mooney, R. J. (1992). Learning relations by pathfinding. In AAAI. Google Scholar
 Richardson, M., & Domingos, P. (2006). Markov logic networks. Machine Learning, 62(1–2), 107–136. CrossRefGoogle Scholar
 Rosenblatt, F. (1962). Principles of neurodynamics: perceptrons and the theory of brain mechanisms. New York: Spartan. zbMATHGoogle Scholar
 Russell, S. J., & Norvig, P. (2003). Artificial intelligence: a modern approach. Upper Saddle River: Pearson Education. Google Scholar
 Sato, T., & Kameya, Y. (2001). Parameter learning of logic programs for symbolicstatistical modeling. The Journal of Artificial Intelligence Research, 15, 391–454. MathSciNetzbMATHGoogle Scholar
 Schraudolph, N., & Graepel, T. (2003). Combining conjugate direction methods with stochastic approximation of gradients. In AISTATS (pp. 7–13). Google Scholar
 Sellmann, M., & Van Hentenryck, P. (2005). Structural symmetry breaking. In Proc. of 19th international joint conf. on artificial intelligence (IJCAI05). Google Scholar
 Sen, P., Deshpande, A., & Getoor, L. (2008). Exploiting shared correlations in probabilistic databases. In Proc. of the intern. conf. on very large data bases (VLDB08). Google Scholar
 Singla, P., & Domingos, P. (2008). Lifted firstorder belief propagation. In Proc. of the 23rd AAAI conf. on artificial intelligence (AAAI08), Chicago, IL, USA, July 13–17, 2008 (pp. 1094–1099). Google Scholar
 Sutton, C., & Mccallum, A. (2009). Piecewise training for structured prediction. Machine Learning, 77(2–3), 165–194. CrossRefGoogle Scholar
 Sutton, C., & McCallum, A. (2009). Piecewise training for structured prediction. Machine Learning, 77(2–3), 165–194. CrossRefGoogle Scholar
 Taghipour, N., Fierens, D., Davis, J., & Blockeel, H. (2012). Lifted variable elimination with arbitrary constraints. In JMLR: workshop and conference proceedings (Vol. 22, pp. 1194–1202). Google Scholar
 Thon, I., Landwehr, N., & De Raedt, L. (2011). Stochastic relational processes: efficient inference and applications. Machine Learning, 82(2), 239–272. zbMATHCrossRefGoogle Scholar
 Van den Broeck, G., & Davis, J. (2012). Conditioning in firstorder knowledge compilation and lifted probabilistic inference. In Proc. of the 26th AAAI conf. on artificial intelligence (AAAI2012). Google Scholar
 Van den Broeck, G., Taghipour, N., Meert, W., Davis, J., & De Raedt, L. (2011). Lifted probabilistic inference by firstorder knowledge compilation. In Proc. of the 22nd int. joint conf. on artificial intelligence (IJCAI) (pp. 2178–2185). Google Scholar
 Van den Broeck, G., Choi, A., & Darwiche, A. (2012). Lifted relax, compensate and then recover: from approximate to exact lifted probabilistic inference. In UAI (pp. 131–141). Google Scholar
 Vishwanathan, S. V. N., Schraudolph, N. N., Schmidt, M. W., & Murphy, K. P. (2006). Accelerated training of conditional random fields with stochastic gradient methods. In ICML (pp. 969–976). CrossRefGoogle Scholar
 Wainwright, M., Jaakkola, T., & Willsky, A. (2002). A new class of upper bounds on the log partition function. In UAI (pp. 536–543). Google Scholar
 Weyl, H. (1952). Symmetry. Princeton: Princeton University Press. zbMATHGoogle Scholar
 Winkler, G. (1995). Image analysis, random fields and dynamic Monte Carlo methods. Berlin: Springer. zbMATHCrossRefGoogle Scholar
 Zaidel, D. W., & Hessamian, M. (2010). Asymmetry and symmetry in the beauty of human faces. Symmetry, 2, 136–149. CrossRefGoogle Scholar
 Zettlemoyer, L.S., Pasula, H. M., & Kaelbling, L.P. (2007). Logical particle filtering. In Proc. of the dagstuhl seminar on probabilistic, logical, and relational learning. Google Scholar
 Zhu, S., Xiao, Z., Chen, H., Chen, R., Zhang, W., & Zang, B. (2009). Evaluating splash2 applications using mapreduce. In Proceedings of the 8th international symposium on advanced parallel processing technologies, APPT ’09 (pp. 452–464). Berlin: Springer. CrossRefGoogle Scholar
 Zinkevich, M. A., Smola, A., Weimer, M., & Li, L. (2010). Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems (Vol. 23, pp. 2595–2603). Google Scholar