kProbLog: an algebraic Prolog for machine learning
Abstract
We introduce kProbLog as a declarative logical language for machine learning. kProbLog is a simple algebraic extension of Prolog with facts and rules annotated by semiring labels. It allows to elegantly combine algebraic expressions with logic programs. We introduce the semantics of kProbLog, its inference algorithm, its implementation and provide convergence guarantees. We provide several code examples to illustrate its potential for a wide range of machine learning techniques. In particular, we show the encodings of stateoftheart graph kernels such as WeisfeilerLehman graph kernels, propagation kernels and an instance of graph invariant kernels, a recent framework for graph kernels with continuous attributes. However, kProbLog is not limited to kernel methods and it can concisely express declarative formulations of tensorbased algorithms such as matrix factorization and energybased models, and it can exploit semirings of dual numbers to perform algorithmic differentiation. Furthermore, experiments show that kProbLog is not only of theoretical interest, but can also be applied to realworld datasets. At the technical level, kProbLog extends aProbLog (an algebraic Prolog) by allowing multiple semirings to coexist in a single program and by introducing metafunctions for manipulating algebraic values.
Keywords
Algebraic Prolog Kernel programming Graph kernels Machine learning1 Introduction
The field of logical and relational learning has already a long tradition, cf. (Sammut 1993; De Raedt 2008; Muggleton et al. 2012). In the ’80s and ’90s, the goal of this field was to use purely logical and relational representations within machine learning and in this way, provide more expressive representations that allow complex datasets and background knowledge to be represented. The key challenge at the time was to tightly integrate these representations with symbolic machine learning methods that were then popular, such as rulelearning and decision trees (Van Laer and De Raedt 2001). But the field of machine learning has evolved and broadened; in the last two decades it has focused more on statistical and probabilistic approaches, on kernel and support vector machines and on neural networks. These trends in machine learning have inspired logical and relational learning researchers to extend their goals and to investigate how logical and relational learning principles can be exploited within probabilistic methods, kernel methods, and neural networks.
This is best illustrated by the success of statistical relational learning and probabilistic programming (De Raedt et al. 2016; Getoor and Taskar 2007), which combine logical and relational learning and programming with probabilistic graphical models. Today there exist many frameworks and formalisms that tightly integrate these two paradigms; they support probabilistic and logical inference as well as learning. Prominent examples include PRISM (Sato and Kameya 1997), Dyna (Eisner et al. 2004), (Eisner and Filardo 2011), Markov Logic (Richardson and Domingos 2006), BLOG (Milch et al. 2005), and ProbLog (De Raedt et al. 2007). Statistical relational learning and probabilistic programming have enabled an entirely new generation of applications.
While there has been a lot of research on integrating probabilistic and logic reasoning, the combination of kernelbased methods with logic has been much less investigated with the notable exceptions of kLog (Frasconi et al. 2014), kFOIL (Landwehr et al. 2006) and Gärtner et al’s work (Gärtner et al. 2003, 2004). kLog is a relational language for specifying kernelbased learning problems. It produces a graph representation of a relational learning problem in the spirit of knowledgebased model construction and then employs a graph kernel on the resulting representation. kFOIL is a variation on the rule learner FOIL (Quinlan 1990) that can learn kernels defined as the number of clauses that succeed in both interpretations. Gärtner et al. developed kernels within a typed higher order language and used it on some inductive logic programming benchmarks.
Also for what concerns neural networks, there is a stream of research work that combines neural with logical and symbolic representations, which is often referred to as neuralsymbolic learning and reasoning (Garcez et al. 2015, 2008).
This research on probabilistic models, kernelbased methods and neural networks shows that it is important for logical and relational learning to integrate its principles and techniques with those of other schools in machine learning. Furthermore, the power of logical and relational learning is not only concerned with the expressiveness of the logical and relational representations but also with their declarativeness. Indeed, it has been repeatedly argued that logical and relational learning allows one to declaratively specify and solve problems by specifying background knowledge and declarative bias (De Raedt 2008; Muggleton et al. 2012). This property of logical and relational learning has turned out to be essential for many successes in applications as making small changes to the background knowledge or bias allows one to easily control the learning algorithm. While in the above mentioned probabilistic, kernelbased and neural approaches to logical and relational learning, it is typically possible to tune the logical and relational part in a declarative way, the probabilistic, kernel or neural components are typically builtin and hardcoded into the underlying formalisms and are very hard to modify. For instance, kLog was designed to allow different graph kernels to be plugged in, but support to declaratively specify the kernel is missing. Standard probabilistic programming languages such as PRISM and ProbLog have clear and fixed semantics (the distribution semantics) that cannot be changed. These limitations have motivated the development of algebraic logical languages such as Dyna (Eisner et al. 2004; Eisner and Filardo 2011) and aProbLog (Kimmig et al. 2011). While standard probabilistic programming languages such as PRISM and ProbLog label facts with probabilities, Dyna and aProbLog use algebraic labels belonging to a semiring, which allows the use of other algebraic structures than the probabilistic semiring on top of the underlying logic programs. Dyna has been used to encode many AI problems, particularly in the area of natural language processing.
But so far, the expressiveness of these languages is still limited, which explains why many contemporary machine learning techniques involving probabilistic models, kernels and supportvector machines or neural networks cannot yet be modeled in these languages. Although Dyna and aProbLog have already been used to represent probabilistic models,^{1} and the Dyna papers mention some simple neural networks, there is—to the best of our knowledge—not yet work on using such languages for kernelbased learning. It is precisely this gap that we want to fill in this paper.
 1.
tensorbased operations kProbLog allows to encode tensor operations in a way that is reminiscent of tensor relational algebra (Kim and Candan 2011). kProbLog supports recursion and is therefore more expressive than tensor relational algebra and related representations that have been proposed for relational learning (Nickel et al. 2011).
 2.
a wide family of kernel functions Declarative programming of kernels on structured data can be achieved via algebraic labels in the semiring of polynomials. Polynomials were previously used in combination with logic programming for sensitivity analysis by Kimmig et al. (2011) and for data provenance by Green et al. (2007). In this paper, we show that polynomials as kProbLog’s algebraic labels enable the specification of label propagation and feature extraction schemas as those used in recent graph kernels such as WeisfeilerLehman (WL) graph kernels (Shervashidze et al. 2011), propagation kernels (Neumann et al. 2012) and graph kernels with continuous attributes such as graph invariant kernels (Orsini et al. 2015). Other graph kernels such as those based on random walks (Kashima et al. 2003; Mahé et al. 2004) can be also easily declared in our language.
 3.
probabilistic programs kProbLog is, as we show in Sect. 6, a generalization of the ProbLog probabilistic programming language.
 4.
algorithmic differentiation kProbLog supports algorithmic differentiation by means of dual numbers (Eisner 2002). Many learning strategies (ranging from collaborative filtering to neural networks and deep learning) that combine tensorbased operations with gradient descent parameter tuning can therefore be implemented within the language.
At the more technical level, the key novelty of kProbLog as compared to Dyna and aProbLog is the introduction of two simple yet powerful mechanisms: the coexistence of multiple semirings within the same program, and the use of metafunctions for combining and manipulating algebraic values beyond simple “sum” and “product” operations. This allows to use kProbLog for declaratively specifying not only the logical component but also the algebraic one. The underlying idea being that the logic captures the structural aspect of the problem while the atom labels capture the algebraic aspect (including counts of substructures). We shall formally define the underlying semantics, provide an implementation of the language and prove its convergence properties.
The paper is organized as follows. First, we provide some basic notion of algebra and logic programming in Sect. 2. We then introduce kProbLog in Sect. 3, first giving a simplified version of the language based on a single semiring and then describing the full kProbLog language with multiple semirings and metafunctions. Section 3 also illustrates the relationship with tensor algebra. In Sect. 4 we then explain how kProbLog can be used to declaratively specify some complex stateoftheart graph kernels, Sect. 5 shows that it is possible to perform algorithmic differentiation in kProbLog, while Sect. 6 shows that kProbLog is a proper generalization of ProbLog and hence, can be used as a probabilistic programming language. The work on kProbLog is then evaluated in Sect. 7: we show that kProbLog is expressive enough to allow for encoding kernels for some real world application domains and that the implementation is usable in that we obtain good statistical performance and runtimes on some benchmarks. Finally, in Sect. 8, we offer a comparative analysis of kProbLog and related languages, and draw some conclusions in Sect. 9.
2 Background
In this section, we provide some basic notions about algebra and logic programming.
2.1 Algebra
We now review some mathematical definitions.
Definition 1
 1.
associativity \(\forall a, b, c \in \mathbb {S}\) \((a \cdot b) \cdot c = a \cdot (b \cdot c)\).
 2.
neutral element \(\exists e : \forall s \in \mathbb {S}: \ e \cdot a = a \cdot e = a\).
Definition 2
 1.
\((\mathbb {S}, \oplus \), \(0_{S})\) is a commutative monoid,
 2.
\((\mathbb {S}, \otimes \), \(1_{S})\) is a monoid,
 3.
distributive multiplication left and right distributes over addition i.e. \(a \otimes (b \oplus c) = (a \otimes b) \oplus (a \otimes c)\) and \((a \oplus b) \otimes c = (a \otimes c) \oplus (b \otimes c)\).
 4.
annihilating element the neutral element of the sum \(0_s\) is the annihilating element of multiplication: \(0_s \otimes a = a \otimes 0_s = 0_s\).
Definition 3
 1.
\(\bigoplus _{i \in \emptyset }{a_i} = 0_S\), \(\bigoplus _{i \in \{j\}}{a_i} = a_j\), \(\bigoplus _{i \in \{j, k \}}{a_i} = a_j \oplus a_k\) for \(j \not = k\).
 2.
\(\bigoplus _{j \in J}{(\bigoplus _{i \in I_j}{a_i})} = \bigoplus _{i \in I}{a_i}\) for \(\bigcup _{j \in J}{I_j} = I\) and \(I_j \cap I_k = \emptyset , j \not = k\).
 3.
\(\bigoplus _{i \in I}{(c \otimes a_i)} = c \otimes \left( \bigoplus _{i \in I}{a_i} \right) \), \(\bigoplus _{i \in I}{(a_i \otimes c)} = \left( \bigoplus _{i \in I}{a_i}\right) \otimes c\).
These properties of a complete semiring S define infinite sums that extend finite sums, are associative and commutative, and satisfy the distributive law (Droste and Kuich 2009).
Definition 4
A semiring \((\mathbb {S}, \oplus , \otimes , 0_S, 1_S)\) is naturally ordered if the set \(\mathbb {S}\) is partially ordered by the relation \(\sqsubseteq \) such that \(\forall a, b \in \mathbb {S}: \ a \sqsubseteq b \) if \(\exists c \in \mathbb {S}: a \oplus c = b\). The partial order relation \(\sqsubseteq \) on A is called natural order (Kuich 1997).
Definition 5
A semiring \((\mathbb {S}, \oplus , \otimes , 0_S, 1_S)\) is \(\omega \)continuous when: a) is complete b) is naturally ordered c) if \(\bigoplus _{i = 1}^{n}{a_i} \sqsubseteq c \ \forall n \in \mathbb {N}\) then \(\bigoplus _{i \in \mathbb {N}}{a_i} \sqsubseteq c \) for all sequences \(\{a_n\}_{i \in \mathbb {N}}\) in \(\mathbb {S}\) and \(c \in \mathbb {S}\).
2.2 Logic programming
A term t is recursively defined as a constant c, a logical variable X or a functor f applied on terms t_i, yielding f(t_1,...,t_n). An atom takes the form p(t_1,...,t_m) where p is a predicate of arity m and t_1,...,t_n are terms. A definite clause h : b_1,...,b_n is a universally quantified expression where b_1, ..., b_n and h are atoms. The atom h is called head of the clause while b_1,...,b_n is called body. The head h of the clause is true whenever all the atoms b_1,...,b_n in its body are true. A fact is a clause h : true whose body is true and can be compactly written as h. A definite clause program P is a finite set of definite clauses, also called rules. An expression that does not contain variables is called ground. A Herbrand base A is the set of all the ground atoms that can be constructed from constants, functors and predicates in a definite clause program P. A Herbrand interpretation I of P is a truth value assignment to all the atoms \(a \in A\) and it is often written as the subset of true atoms. A Herbrand interpretation that satisfies all the rules in the program P is called Herbrand model. The model theoretic semantics of a definite clause program is given by its least Herbrand model, that is, the set of all ground atoms \(a \in A\) that are entailed by the logic program P. Logical inference is the task of determining whether a query atom a, is entailed by a given logic program P. The two most common approaches to logical inference are backward reasoning and forward reasoning. The former starts from the query and reasons back tower the facts (Nilsson and Maluszynski 1990) and it is usually implemented in logic programming by SLDresolution, while the latter starts from the facts and derives new true atoms using the immediate consequence operator \(T_{P}\) (van Emden and Kowalski 1976).
Definition 6
The least Herbrand model of a program P is the least fixed point of the \(T_{P}\)operator, i.e. the least set of atoms I such that \(T_{P}(I) = I\).
3 The kProbLog language
We introduce kProbLog in three different steps. In the first subsection, we assume that a single semiring is used; in the second subsection we introduce metafunctions and allow for multiple semirings; in the third, we present the inference algorithm of kProbLog, and analyze its convergence in the fourth subsection.
3.1 \(\hbox {kProbLog}^{\mathbb {S}}\)
\(\hbox {kProbLog}^{\mathbb {S}}\) is an algebraic extension of Prolog with labeled facts and rules, where labels are chosen from a semiring S.
Definition 7

F is a finite set of facts;

R is a finite set of definite clauses (also called rules);

S is a semiring with sum \(\oplus \) and product \(\otimes \) operations; whose neutral elements are \(0_S\) and \(1_S\) respectively;

\(\ell : F \rightarrow \mathbb {S}\) is a function that maps facts to semiring values.
Definition 8
An algebraic interpretation \(I_{w} = (I, w)\) of a ground \(\hbox {kProbLog}^{\mathbb {S}}\) program P with facts F and atoms A is a set of tuples (a, w(a)) where a is an atom in the Herbrand base A and w(a) is an algebraic formula over the fact labels \(\{ \ell (f) f \in F \}\). We use the symbol \(\emptyset \) to denote the empty algebraic interpretation, i.e. \(\{(\text{ true }, 1_S )\} \cup \{(a, 0_S)  a \in A \}\).
In this definition and below we adapt the notation of Vlasselaer et al. (2015).
Definition 9
Example 1
The compound terms^{2} i/2 and j/2, were used to create the new indices that are needed by the Kronecker product. These definitions of matrix operations are reminiscent of tensor relational algebra (Kim and Candan 2011). Each of the above programs can be evaluated by applying the \(T_{(P,S)}(I_w)\) operator only once. For each program we have a different definition of the C matrix that is represented by the predicate c/2. As a consequence of Eq. 2 all the algebraic labels of the c/2 facts are polynomials in the algebraic labels of the a/2 and b/2 facts. We draw an analogy between the representation of a sparse tensor in coordinate format and the representation of an algebraic interpretation. A ground fact can be regarded as a tuple of indices/domain elements that uniquely identifies the cell of a tensor, the algebraic label of the fact represents the value stored in the cell.
Definition 10
An algebraic interpretation \(I_w = (I, w)\) is the fixed point of the \(T_{(P,S)}(I_w)\)operator if and only if for all \(a \in A\), \(w(a) \equiv w'(a)\), where w(a) and \(w'(a)\) are algebraic formulae for a in \(I_w\) and \(T_{(P,S)}(I_w)\) respectively.
We denote with \(T_{(P,S)}^{i}\) the function composition of \(T_{(P,S)}\) with itself i times.
Proposition 1
(Application of Kleene’s theorem) If S is an \(\omega \)continuous semiring the algebraic system of fixedpoint equations \(I_w = T_{(P,S)}(I_w)\) admits a unique least solution \(T_{(P,S)}^{\infty }(\emptyset )\) with respect to the partial order \(\sqsubseteq \) and \(T_{(P,S)}^{\infty }(\emptyset )\) is the supremum of the sequence \(T_{(P,S)}^1(\emptyset ), T_{(P,S)}^2(\emptyset ), \ldots , T_{(P,S)}^i(\emptyset )\). So \(T_{(P,S)}^{\infty }(\emptyset )\) can be approximated by computing successive elements of the sequence. If the semiring satisfies the ascending chain property (see Esparza et al. 2014) then \(T_{(P,S)}^{\infty }(\emptyset ) = T_{(P,S)}^{i}(\emptyset )\) for some \(i \ge 0\) and \(T_{(P,S)}^{\infty }(\emptyset )\) can be computed exactly (Esparza et al. 2014).
Examples of \(\omega \)continuous semirings are the Boolean semiring \((\{\text{ T }, \text{ F }\}, \vee , \wedge , \text{ F }, \text{ T })\), the tropical semiring \((\mathbb {N} \cup \{ \infty \},\min ,+,\infty ,0)\) and the fuzzy semiring \(([0, 1],\max , \min , 0, 1)\) (Green et al. 2007).
Example 2
3.2 kProbLog
\(\hbox {kProbLog}\) generalizes \(\hbox {kProbLog}^{\mathbb {S}}\) in two ways: it allows multiple semirings to coexist in the same program, and it enriches the algebraic expressivity by means of metafunctions and metaclauses.
Definition 11
(Metafunction) A metafunction m \(:\mathbb {S}_1 \times \ldots \times \mathbb {S}_k \mapsto \mathbb {S}'\) is a function that maps k semiring values \(x_i \in \mathbb {S}_i, \ i = 1,\ldots , k \) to a value of type \(\mathbb {S}'\), where \(\mathbb {S}_1,\ldots , \mathbb {S}_k\) and \(\mathbb {S}'\) can be distinct sets. If a_1,...,a_k are algebraic atoms, in kProbLog we use the syntax @m[a_1,...,a_k] to express the application of metafunction @m to the values \(w({\texttt {a\_1}}),...,w({\texttt {a\_k}})\) of the atoms a_1,...,a_k.
Definition 12
(Metaclause) A metaclause h : b_1,...,b_n is a universally quantified expression where h is an atom and b_1,...,b_n can be either atoms or metafunctions applied to other algebraic atoms. The head predicate of a metaclause, the algebraic atoms in the body, and the return types of the metafunctions in the body must all belong to the same semiring.
The introduction of metafunctions in kProbLog allows us to deal with other algebraic structures such as rings that require the additive inverse @minus/1 and fields that require the additive inverse and the multiplicative inverse @inv/1.
Definition 13
(kProbLog program) A kProbLog program P is a union of \(\hbox {kProbLog}^{\mathbb {S}_i}\) programs and metaclauses.
3.2.1 kProbLog \(T_{P}\)operator with metafunctions
The algebraic \(T_P\)operator of kProbLog is defined on the metatransformed program.
Definition 14
(Metatransformed program) A metatransformed kProbLog program is a kProbLog program in which all the metafunctions are expanded to algebraic atoms. For each rule h : b_1,...,@m[a_1,...,a_k],...,b_n in the program P each metafunction @m[a_1,...,a_k] is replaced by an atom b’ and a metaclause b’:@m[a_1,...,a_k] is added to the program P.
Definition 15
Example 3
where we used the identity \(\sin (0.9) = 0.78\ldots \)
3.2.2 Recursive kProbLog program with metafunctions
Recursion is a basic tool in logic programming. For our purposes, it is necessary in most useful computations on structured data such as shortest paths (see Example 2), or random walk graph kernels (See Sect. 4.4.3). Weights need to be updated whenever the groundings of a predicate appear in the cycles of the ground program.
Definition 16
A ground program P is cyclic if it contains a cycle. A cycle is a sequence of rules \(r_1,\ldots , r_n\) such that the head of \(r_i\) is contained in the body of \(r_{i1}\) for \(i = 2, \ldots , m\) and the head of \(r_1\) is contained in \(r_m\). A ground rule that is contained in a cycle is called cyclic rule, otherwise it is called acyclic rule.
\(\hbox {kProbLog}\) allows both additive and destructive updates, as specified by the builtin predicate declare(P, S, U) where U can be either additive or destructive.
Definition 17

additive \(\displaystyle w({\texttt {h}}) = w({\texttt {h}}) \oplus \varDelta w({\texttt {h}})\) or

destructive \(\displaystyle w({\texttt {h}}) = \varDelta w({\texttt {h}})\).
The distinction between additive and destructive is only relevant for cyclic rules. In Sect. 3.3 we give the evaluation algorithm of kProbLog which uses this kind of update when necessary.
Programs such as the transitive closure of a binary relation (see Example 2) or the compilation of ProbLog programs with sentential decision diagrams (SDD) (Darwiche 2011) require additive updates (see Sect. 6). Destructive updates are necessary to specify iterated function composition, as shown in the next example.
Example 4
3.2.3 The Jacobi method
We already showed that kProbLog can express linear algebra operations. We now combine recursion and metafunctions in an algebraic program that specifies the Jacobi method (Golub and Van Loan 2012), an iterative algorithm used for solving diagonally dominant systems of linear equations \(\displaystyle A \mathbf {x} = \mathbf {b}\).
We consider the field of real numbers \(\mathbb {R}\) (i.e. \(\hbox {kProbLog}^{\mathbb {R}}\)) as semiring together with the metafunctions @minus and @inv that provide the inverse element of sum and product respectively.
The introduction of metafunctions makes the result of the evaluation of a kProbLog program dependent on the order in which rules and metaclauses are evaluated. For this reason we explain the order adopted by the kProbLog language.
3.3 kProbLog implementation

every stratum \(P_i\) is a set of ground atoms which is both is maximal and strongly connected (i.e. each ground atom in \(P_i\) depends on every other ground atom in \(P_i\));

a ground atom in an acyclic stratum \(P_i\) can only depend^{4} on the ground atoms from the previous strata \(\bigcup _{j<i}{P_j}\);

an ground atom in a cyclic stratum can depend on the ground atoms in \(\bigcup _{j \le i}{P_j}\).
The update for a weight w(a) of a cyclic atom a is computed by accumulating the result of the application of the \(T_P\)operator to all the cyclic rules with head a. The new weight is then computed as \(w(a) = w(a) + \varDelta w(a)\) (additive updates) or \(w(a) = \varDelta w(a)\) (destructive updates).
This program evaluation procedure is an adaptation the work of Whaley et al. (2005) on Datalog and binary decision diagrams. kProbLog was implemented in Python 3.5 using Gringo 4.5^{6} as grounder. The source code of our kProbLog implementation is available at https://github.com/orsinif/kProbLogDSL.
Example 4
(Continued) Evaluation of a cyclic program. The cyclic program P in Sect. 3.2.2 is already ground and contains two ground atoms x0 and x. The ground atoms x0 and x correspond to two nodes in the dependency graph, while x0 is a fact and does not have incoming arcs, x has two dependencies/incoming arcs which are x0 and itself. As shown in Fig. 1 P is then subdivided in two strata \(P_1\) and \(P_2\): \(P_1\) contains x0 and is acyclic, \(P_2\) contains x and is cyclic.
The algebraic \(T_{P}\)operator is applied only once for acyclic rules and multiple times, until convergence, for cyclic rules (i.e. x: @g[x]).
3.4 Convergence analysis of the kProbLog interpreter
In order to analyze the convergence of the kProbLog interpreter on a kProbLog program P, we assume that all the metafunctions in P terminate and that the finite support condition (Sato 1995) holds.
The finite support condition is commonly used in probabilistic logic programming and ensures that the \(\textsc {Ground}\) procedure outputs a finite ground program.
The convergence properties of kProbLog are characterized by the following theorems.
Theorem 1
(Convergence of acyclic kProbLog programs) The evaluation of an acyclic kProbLog program P invokes the algebraic \(T_{P}\)operator exactly once for each ground rule in \(\textsc {Ground}(P)\) and terminates.
Theorem 2
(Convergence of kProbLog programs) The evaluation of a kProbLog program is guaranteed to terminate only if all the cyclic strata are \(\hbox {kProbLog}^{\mathbb {S}_i}\) programs where \(\mathbb {S}_i\) are \(\omega \)continuous semirings.
The proofs of Theorems 1 and 2 are reported in “Appendix” Sect. A.
Theorem 1 can be used to prove the convergence of the elementary programs that specify matrix operations in Sect. 3.1 and the convergence of the WL algorithm that we shall see in Sect. 4.2. Theorem 2 ensures the convergence of the cyclic program in Example 2 when an \(\omega \)continuous semiring is used for the algebraic labels, but not the convergence of the program in Example 4. While the cyclic program in Example 4 actually converges, this property cannot be entailed from Theorem 2. Indeed the program in Example 4 has a cyclic stratum \(P_2\) (see Fig. 1) involving a metafunction (i.e. @g). Stratum \(P_2\) is not a \(\hbox {kProbLog}^{\mathbb {S}}\) program because \(\hbox {kProbLog}^{\mathbb {S}}\) programs do not admit metafunctions and for this reason we cannot apply Theorem 2 to Example 4. kProbLog programs whose strata are either acyclic programs or cyclic programs on \(\omega \)continuous semirings are guaranteed to converge and it is possible to verify that these conditions are met at runtime. However, since kProbLog is an extension of Prolog, which is a Turingcomplete language we choose to allow metafunctions and non\(\omega \)continuous semirings in cyclic strata. In this way we do not restrict the expressivity of the language.
4 Kernel programming
We now show that kProbLog can be used to declaratively encode stateoftheart graph kernels. But before doing so, we introduce the semiring \(\mathbb {S}[\mathbf {x}]\) that can be used for feature extraction.
4.1 \(\hbox {kProbLog}^{\mathbb {S}[\mathbf {x}]}\): polynomials for feature extraction
\(\hbox {kProbLog}^{\mathbb {S}[\mathbf {x}]}\) labels facts and rule heads with polynomials over the semiring S. \(\hbox {kProbLog}^{\mathbb {S}[\mathbf {x}]}\) is a particular case of \(\hbox {kProbLog}^{\mathbb {S}}\) because polynomials over semirings are semirings in which addition and multiplication are defined as usual.
While polynomials have been used in combination with logic programming for provenance (Green et al. 2007) and sensitivity analysis (Kimmig et al. 2011), we use multivariate polynomials to represent the explicit feature map of a graph kernel.
Definition 18
4.1.1 Operations for feature extraction
Sum of polynomials
The semiring sum \(\oplus \) between polynomials is used in \(\hbox {kProbLog}^{\mathbb {S}[\mathbf {x}]}\) to sum features or equivalently compute a multiset union operation (see Fig. 2a).
Inner product between polynomials
Example 5
The compression metafunction The @id/1 metafunction @id \(:\mathbb {S}[\mathbf {x}] \rightarrow \mathbb {S}[\mathbf {x}]\) is injective. @id/1 transforms a polynomial \(\mathcal {P}(\mathbf {x})\) to a new term t and returns the polynomial \({\texttt {@id}}[\mathcal {P}(\mathbf {x})] = 1.0 \cdot x(t)\). This function can be used to compress a multivariate polynomial to a new polynomial in a single variable. We use the @id metafunction for polynomial compression as Shervashidze et al. (2011) use the function f to compress multisets of labels. We now show how these functions are used to specify graph kernels.
4.2 The WeisfeilerLehman algorithm
The onedimensional WL method is an iterative vertex classification algorithm for the graph isomorphism problem. It begins by coloring vertices with their labels and, at each round, it recolors vertices by a “compressed” version of the multiset of colors at their neighbors. If, at any iteration, two graphs have different sets of vertex colors they cannot be isomorphic. We will use polynomials to represent WL colors, associating variables with colors and using integer coefficients to encode the number of occurrences of a color in a multiset.
A colored graph \(G = (V, E, \ell )\), where V is a set of vertices, \(E \subseteq V \times V\) is the set of the edges, and \(\ell : V \mapsto \varSigma \) is a function that maps vertices to a color alphabet \(\varSigma \), can be declared in kProbLog as follows:
The WL algorithm has been specified as an acyclic program. Indeed, while wl_color/3 and wl_color_multiset/3 are mutually recursive wl_color/3 at step H depends on wl_color/3 and wl_color_multiset/3 at step H1, therefore is acyclic and we can apply Theorem 1 to verify that it converges.
4.3 Graph kernels
In this section we give the declarative specification of some recent graph kernels such as the WL graph kernel (Shervashidze et al. 2011), propagation kernels (Neumann et al. 2012) and graph invariant kernels (Orsini et al. 2015). These methods have been applied to different domains such as natural language processing (Orsini et al. 2015), computer vision (Neumann et al. 2012) and bioinformatics (Shervashidze et al. 2011; Neumann et al. 2012; Orsini et al. 2015).
4.4 WeisfeilerLehman graph kernel and propagation kernels
where \(\mathcal {P}_{\textsc {wl}}^{(h)}(v)\) is the polynomial that represents the WL color of vertex v at step h.
lsh discretizes vectors to integer identifiers so that vectors which are similar have the same integer identifier with high probability.
4.4.1 Shortest path WeisfeilerLehman graph kernel
4.4.2 Graph invariant kernels
There are multiple ways to instantiate giks, we choose the version called \({\textsc {lwl}}_{\textsc {v}}\), which can achieve very good accuracies most of the time as shown by Orsini et al. (2015). \({\textsc {lwl}}_{\textsc {v}}\) uses Rneighborhood subgraphs \(\mathcal {R}\)decomposition relation, computes the kernel on vertex invariants \(k_{\textsc {inv}}(v, v')\) at the pattern level (local gik) and uses \(\delta _{m}(g, g')\) to match subgraphs that have the same number of nodes.
A Rneighborhood subgraph of a graph G from a vertex v is the subgraph induced by all the vertices in G whose shortestpath distance from v is less than or equal to R.
4.4.3 Random walk graph kernels
Vishwanathan et al. (2010) propose generalized random walk kernels. The similarity between a pair of graphs is computed by performing random walks on both graphs and then counting the number of matching walks.
Counting the number of matching random walks between two graphs \(G_a = (V_a, E_a)\) and \(G_b = (V_b, E_b)\) is equivalent to counting the number of walks in \(G_{\times } = (V_{\times }, E_{\times })\), where \(G_{\times }\) is the direct product between the graphs \(G_a\) and \(G_b\) (Vishwanathan et al. 2010).
Predicate edge/3 when its first argument is graph_a (graph_b) represents the adjacency matrix \(W_a \in \mathbb {R}^{V_a\times V_a}\) (\(W_b \in \mathbb {R}^{V_b\times V_b}\)) of graph \(G_a\) (\(G_b\)).
We shall now consider the same example graphs used by Vishwanathan et al. (2010) starting from two graphs \(G_a\) and \(G_b\) encoded with the kProbLog symbols graph_a and graph_b.
The product graph \(G_{\times }\) can be specified following Eq. 16. When the first argument of predicate edge/3 is kron(graph_a, graph_b) it represents the adjacency matrix \(W_{\times } = W_a \times W_b \in \mathbb {R}^{V_aV_b \times V_aV_b}\) of \(G_{\times }\).
According to Vishwanathan et al. (2010) also the initial (stopping) probabilities \(\mathbf {p}_{\times }\) (\(\mathbf {q}_{\times }\)) of the vertices \(V_{\times }\) can be obtained as the Kronecker product between the initial (stopping) probabilities of \(G_a\) and \(G_b\), i.e. \(\mathbf {p}_{\times } = \mathbf {p}_a \times \mathbf {p}_b\) (\(\mathbf {q}_{\times } = \mathbf {q}_a \times \mathbf {q}_b\)).
The above definition of Kronecker product differs from the Kronecker product given in Sect. 3.1, only in the parametrization of the connectivity with graph identifiers.
5 \(\hbox {kProbLog}^{\epsilon }\): dual numbers for algorithmic differentiation
The need for accurately computed derivatives is ubiquitous and algorithmic differentiation (AD) (Griewank and Walther 2008) has emerged as a very useful tool in many areas of Science. In particular, AD has occupied a special role in machine learning (Baydin et al. 2015) since the introduction of the Backpropagation algorithm for training neural networks. Recent advances in deep learning have led to a proliferation of many frameworks for AD such as Torch (Collobert et al. 2002), Theano (Bastien et al. 2012) and TensorFlow (Abadi et al. 2015). While it is beyond the scope of this paper to develop an alternative to these frameworks for deep learning, we show in this Section how to use the semiring of dual numbers and the gradient semiring (a generalization of dual numbers) (Eisner 2002; Kimmig et al. 2011) in kProbLog for AD.
A dual number \(x + \epsilon x'\) consists of a primal part x and a dual part \(x'\) where \(\epsilon \) is the nilpotent element (i.e. \(\epsilon ^2 = 0\)). In the case of variables \(x' = 1\) while in the case of constants \(x' = 0\).
Example 6
 1.
sum rule \(\frac{d}{d x}(f(x) + g(x)) = f'(x) + g'(x)\),
 2.
product rule \(\frac{d}{d x}(f(x) g(x)) = f(x)g'(x) + f'(x)g(x)\).
 1.
sum \(y + z = f(x) + g(x) + \epsilon (f'(x) + g'(x))\),
 2.
product \(yz = f(x) g(x) + \epsilon (f(x)g'(x) + f'(x)g(x))\).
Dual numbers are generalized to gradients by introducing multiple nilpotent elements \(\epsilon _1, \ldots , \epsilon _n\) such that \(\epsilon _i \epsilon _j = 0, \ \forall i, j\). The gradient number \(x + \epsilon _1 x_{1}' + \ldots + \epsilon _n x_{n}'\) combines the primal part x with n partial derivatives \(x_{1}', \ldots , x_{n}'\).
In kProbLog we denote the nilpotent element \(\epsilon \) with the compound term eps(index_term), where the argument index_term is some term that is used to index distinct partial derivatives. The metafunction @grad/2 takes as inputs a dual number y and a nilpotent element \(\epsilon _x\) and outputs the partial derivative \(\frac{\partial y}{\partial x}\).
Example 7
where we defined x(I) as x0(I) \(+ \epsilon \).
The gradient semiring adds to kProbLog support for AD and can naturally be employed for gradient descent learning.
A very natural task to express in kProbLog is matrix factorization (Nickel et al. 2011; Koren et al. 2009; Kim and Candan 2011).
Example 8
Koren et al. (2009) propose a basic factorization model. Users and items are mapped to a joint fdimensional factor space. The interaction between an item and a user is modeled as the innerproduct between their representations in the factor space.
Each user u is associated with a vector \(\mathbf {q}_{u} \in \mathbb {R}^{f}\) while each item i is associated with a vector \(\mathbf {p}_{i}\) and \(r_{ui}\) is the rating given by user u to item i. The goal is to approximate the rating \(r_{ui}\) with a score derived by the innerproduct between \(\mathbf {q}_{u}\) and \(\mathbf {p}_{i}\). Koren et al. (2009) use the mean squared error between the predicted score and the rating \(r_{ui}\) and regularize the factor representations of users and items (\(\mathbf {q}_{u}\) and \(\mathbf {p}_{i}\)) with the \(\ell _2\)norm.
6 \(\hbox {kProbLog}^{D[\mathbb {S}]}\): ProbLog and aProbLog as special cases
We now clarify the relationship between kProbLog and Problog, and we show that the ProbLog implementation using SDDs of Vlasselaer et al. (2015) can be emulated by kProbLog. ProbLog is a probabilistic programming language that defines a probability distribution over possible worlds (Herbrand interpretations). A ProbLog program consists of a set of definite clauses \(c_i\) and a set of facts labeled with probabilities \(p_i\). While a Prolog query can either succeed or fail, ProbLog computes the success probability of a query. The success probability of a query is the sum of the probabilities of all the possible worlds I in which the query q is true, it thus corresponds to the probability that it is true in a randomly chosen possible world.
Example 9
In Fig. 3 (on the left) we show a ProbLog program in which there are two facts p(a) and p(b) with probability labels 0.5 and 0.6 respectively. p(c) and p(d) are defined as the conjunction p(a) \(\wedge \) p(b) and the disjunction of p(a) \(\vee \) p(b) respectively. On the right we show two tables which compute the probabilities of p(c) and p(d). For p(c) we have one possible world while for p(d) there are three possible worlds. For both p(c) and p(d) we enumerate the worlds in which they are true and compute their weighted model count.
To compute the probabilities of queries, ProbLog compiles the logical part of the program into a Boolean circuit and then evaluates this circuit on the probabilities \(p_i\). The circuit is evaluated by replacing disjunctions and conjunctions with sums and products respectively. The compilation process is necessary to cope with the disjointsum problem (De Raedt et al. 2007; Kimmig et al. 2011). For instance, to compute \(P({\texttt {p(d)}})\) we cannot simply sum up \(P( {\texttt {p(a)}})\) and \(P( {\texttt {p(b)}})\) (two possible explanations/proofs for p(d)) as this would lead to a value that is larger than one, but rather we need to compute \(P( {\texttt {p(a)}}) + P( {\texttt {p(b)}} \wedge \lnot {\texttt {p(a)}})\). The disjointsum problem can be solved by representing the Boolean circuit either as a decision diagram. In practice this can be an ordered binary decision diagram (OBDD) (Bryant 1992) or as an SDD (Darwiche 2011). While the first version of ProbLog (De Raedt et al. 2007) was using OBDDs a more recent work (Vlasselaer et al. 2015) used SDDs.
Algebraic model counting (AMC) generalizes probabilistic model counting to a semiring S. In kProbLog it is possible to employ a semiring \(D[\mathbb {S}]\) to specify AMC tasks on an arbitrary commutative semiring \(\mathbb {S}\). The semiring of valued decision diagrams \(D[\mathbb {S}]\) can be represented using an SDD whose variables are labeled with elements from the commutative semiring \(\mathbb {S}\). Valued decision diagrams are similar to PSDD (Kisa et al. 2014), except that values are not necessarily probabilities and they do not necessarily encode probability distributions.
Any ProbLog program can be directly translated into a \(\hbox {kProbLog}^{\mathrm{D}[\mathbb {R}]}\) program using the semiring of SDDs labeled with probabilities. This is a direct consequence of the fact that the evaluation algorithm of kProbLog generalizes the \(T_{P}\)compilation with SDDs of Vlasselaer et al. (2015) to arbitrary semirings. If we label kProbLog facts with SDDs we recover the compilation algorithm of Vlasselaer et al. (2015).
Example 10
The notation sdd(Value,Atom)::Atom used for kProbLog is cumbersome and can be replaced by the syntactic sugar Value::Atom. In this way the \(\hbox {kProbLog}^{D[\mathbb {R}]}\) program becomes syntactically identical to the ProbLog one.
So far we have shown that \(\hbox {kProbLog}^{\mathbb {R}}\) can perform probabilistic model counting. This behavior is not enforced by the language as in ProbLog, but is optional (i.e. it is induced by the type declaration : declare(p, sdd(real))).
While \(\hbox {kProbLog}^{\text {D}[\mathbb {R}]}\) is equivalent to ProbLog, it is also straightforward to represent aProbLog on a semiring S as \(\hbox {kProbLog}^{\text {D}[\mathbb {S}]}\) using SDDs labeled with semiring values.^{7}
Algebraic model counting is useful for inference tasks and reasoning about possible worlds, but there are some tasks which are nontrivial to express in aProbLog. Examples are linear algebra operations and explicit feature extraction as explained in Sect. 4.1.
7 Experimental evaluation
We now experimentally evaluate kProbLog and show how it can be used as a declarative language for machine learning. The choices that a kProbLog programmer needs to make in order to satisfy a requirement are quite different from the ones that an imperative programmer would do. While an imperative programmer would have to use different data structures to meet the software requirements, a kProbLog user can just specify the requirements with logical rules. For instance, when moving from a directed to an undirected graph, imperative programmers would have to change their data structure, while in kProbLog it suffices to simply add an extra rule to capture the symmetry of undirected graphs. kProbLog is well suited for prototyping. As we will show below with an example of graph kernel (cf. E2), kProbLog makes it easy to compose existing programs in order to construct new ones. Different graph kernels take into account different structural aspects. For example, the WL subtree kernel can capture degree centrality while shortest path kernels do not. On the other hand, shortest path kernels are a natural choice if one wants to capture patterns with distant nodes. Even though the WL subtree patterns can capture distant nodes, the number of iterations required to do so could lead to diagonal dominant kernels. Since both these kernels can easily be specified in kProbLog (as we will show), it is also straightforward to create a hybrid graph kernel combining the strengths of both underlying kernels.
Another powerful construct of kProbLog are the metafunctions. In a machine learning context, metafunctions can be exploited as a flexible and expressive instrument for describing rich families of kernels. In this sense, metafunctions can be interpreted as a powerful generalization of common kernel hyperparameters, lifting them from simple numbers to functions. We will show in E1 how metafunctions can be exploited to explore multiple feature spaces against the same logical specification and provide a rich class of feature spaces.

Q1 Can we use metafunctions to explore multiple feature spaces against the same kProbLog specification and increase the classification accuracy?

Q2 Can kProbLog produce hybrid kernels that combine the strengths of existing ones?

Q3 Are the results obtained with kProbLog in line with the state of the art?
7.1 Datasets
 QC

Li and Roth (2002) is a dataset about question classification and contains 5500 training and 500 test questions from the trec10 qa competition. Question classifiers are often used to improve the performance of question answering systems. Indeed, they can be used to provide constraints on the answer types and determine answer selection strategies. qc labels questions according to a twolayer taxonomy of answer types. The taxonomy contains 6 coarse classes (abbreviation, entity, description, human, location and numeric value) and 50 fine classes.
Example 11
The sentence “What films featured the character Popeye Doyle ?” is labeled in qc as entity since we expect films in the answer.
 MUTAG

Debnath et al. (1991) is a dataset of 188 mutagenic compounds labeled according to whether or not they have a mutagenic effect on the Gramnegative bacterium Salmonella typhimurium.
 BURSI

Kazius et al. (2005) is dataset of 4337 molecular compounds subdivided in two classes (2401 mutagens and 1936 nonmutagens) determined with the Ames in vitro assay.
7.2 Experiments
E1 This experiment was designed to provide an answer to Q1 and, in particular, to illustrate the expressiveness of kProbLog’s metafunctions in an NLP context, where a large number of options are typically available to describe the feature space. It also aims to answer Q3 since question classification is a typical task where good results using graph kernels have been reported in the literature (Li and Roth 2002; Zhang and Lee 2003). Each sentence in the qc dataset is represented as a sequence of tokens. For this purpose, we define a predicate token_labels/1. token_labels/1 is a unary relation that associates to each token t an algebraic label which encodes word, lemma and part of speech (pos) tag of token t.
List of the 16 configurations that achieve the highest classification accuracy on qc
Unigram features  Shortest path features  Test accuracy (%) 

lw  lp  91.6 
lpw  lp  91.2 
lw  \(\_lpw\)  91.0 
lw  \(\_lw\)  90.6 
lw  \(\_lp\)  90.6 
lpw  \(\_lpw\)  90.6 
lw  lpw  90.4 
lpw  lpw  90.4 
lpw  \(\_pw\)  90.4 
lpw  \(\_lp\)  90.4 
pw  \(\_l\)  90.2 
lw  \(\_pw\)  90.2 
lpw  pw  90.2 
lw  pw  90.0 
lpw  \(\_lw\)  90.0 
w  pw  89.8 
We measured the runtime of the feature extraction and we found that none of the 127 runs on qc exceeds 32 seconds. The measurement of the runtime was performed on a 16 cores machine (Intel Xeon CPU E52665@2.40GHz and 96gb of ram). E2 In this experiment we mainly aim to answer Q2 and, in particular, to test the ability of kProbLog to hybridize two well known graph kernels in a context (molecule classification) where they are known to perform well. In order to capture the complementary advantages of the WL subtree and shortest path kernels, mentioned at the beginning of this section, we shall specify a hybrid kernel. We extract histograms of shortest paths and decorate them with WL labels. This is where we hybridize the two kernels. The reader should not confuse this kernel with the WL shortest path kernel (Shervashidze et al. 2011) (explained in Sect. 4.4.1) which takes as features pairs of WL labels together with their shortest path distance.
For both mutag and bursi we set maximum number of WL iterations to \({\texttt {MAX\_ITER}}=1\) and ran our kProbLog specification. We made classification experiments using 10 fold crossvalidation and measured the classification accuracy and area under the roc curve for mutag and bursi respectively. We repeated 10 times the 10 fold crossvalidations and we obtained an average accuracy of \(91.1\%\) with a standard deviation of \(0.9\%\) for mutag and an average area under the roc curve of 0.902 with a standard deviation of 0.001 for bursi. For both datasets we used a linear \(\textsc {svm}\) classifier with the C parameter set 1. We measured the runtime on the same hardware used in E1. The runtime for mutag was 32 seconds while bursi was 5 minutes and 7 seconds.
All the experiments can be reproduced by running the code provided with the kProbLog implementation (see Sect. 3.3).
7.3 Discussion
We now answer the experimental questions:
A1 In E1 we explored a parametrized feature space for qc, using different combinations of words, lemmas, pos tags we could list the 16 best parameterizations in Table 1. Since the best results are in line with the results reported by (Li and Roth 2002) and (Zhang and Lee 2003), we conclude that metafunctions are a valid language construct to parametrize the feature space. The \(91.6\%\) of accuracy obtained on qc with the experiments in E1 is in line with the results reported by (Li and Roth 2002) and (Zhang and Lee 2003).
A2 Shervashidze et al. (2011) experimented on mutag with 8 different graph kernels and achieved the highest accuracy (\(87.3 \pm 0.6\)) with shortest path kernels, while the accuracy obtained with the WL subtree kernel is \(82.1 \pm 0.4\) (see Table 1 Shervashidze et al. 2011). As anticipated in the beginning of this section, the WL subtree kernel and the shortest path kernel capture different topological aspects. In experiment E2, thanks to the declarative nature of kProbLog, we made a hybrid and labeled shortest paths with WL colors. We experimented with this kernel on mutag and obtained an accuracy of \(91.1 \pm 0.9\%\), which is significantly higher than the ones individually achieved by the shortest path and WL subtree kernels. In E2, we also experimented on bursi with the same hybrid kernel and obtained \(0.902 \pm 0.001\) of area under the roc, this result is line with those reported in Table 1 of Costa and De Grave (2010).
A3 The \(91.6\%\) of accuracy obtained in E1 on qc are in line with the ones reported by (Li and Roth 2002) and (Zhang and Lee 2003). The \(91.1 \pm 0.9\%\) of accuracy obtained with our hybrid kernel in E2 on mutag is significantly higher than the ones obtained with 8 different graph kernels in Shervashidze et al. (2011). Also the \(0.902 \pm 0.001\) area under the roc curve obtained in E1 on bursi is in line with the results reported by Costa and De Grave (2010). For these reasons, we conclude that kProbLog can be used to specify kernels that work well on realworld application domains. The runtimes measured are reasonable and show that kProbLog is usable in practice. Feature extraction on qc and mutag took less than a minute while on bursi took less than 6 minutes.
In principle we could reimplement any kProbLog program in a declarative language such as Prolog or an imperative language such as Python, but we would loose flexibility and elegance in both cases. Prolog easily expresses relational data, but in order to handle mathematical labels the user would be forced to code inside the rules not only the relational aspect, but also the algebraic aspect. In this sense, kProbLog is advantageous because it decouples the relational aspect from the algebraic aspect and avoids to write boilerplate code. Imperative languages such as Python, C++ and Java offer rich libraries for scientific computing and machine learning, but do not have builtin support for logical variables and unification. In particular each metafunction in a kProbLog program can be put into onetoone correspondence with functions (e.g. a Python function). However, these imperative languages do not have the equivalent of metaclauses which are firstorder constructs and support logical variables.
8 Related work
In the introduction, we claimed that kProbLog can express models for tensorbased operations, for kernels, and for probabilistic programs; we also mentioned approaches such as Dyna and aProbLog. We now discuss related work along these lines.
First, kProbLog is able to combine logic with tensors and can express tasks such as matrix factorization. As such kProbLog is related to Tensor Relational Algebra (Kim and Candan 2011), which combines tensors with relational algebra and which was successfully employed for tensor decomposition. However, tensor relational algebra does not support recursion and is therefore less expressive than kProbLog.
Secondly, and perhaps most importantly, kProbLog can be used to declaratively specify a wide range of relational and graph kernels and feature extraction problems using polynomial semirings. As such it is related to the kLog system (Frasconi et al. 2014), which has focused on the specification of relational learning problems and provides a framework to map them into graphbased learning problems via a procedure called graphicalization. In conjunction with a graph kernel, kLog can construct feature vectors associated with tuples of objects in relational domains. However, kLog does not provide support for programming the kernel itself, it uses a builtin kernel (the NSPDK of Costa and De Grave (2010)) or defers the kernel specification to external plugins. kLog and kProbLog are therefore complementary languages. Furthermore, by adopting kProbLog in kLog one would obtain a statistical relational learning system in which the kernel could be declaratively specified as well. Also Gärtner et al. contributed kernels within a typed higherorder logic in which individuals (the examples) are represented as terms and the kernel definitions, specified in a lambda calculus, exploit the syntactic structure of these example representations. While this also yields a declarative language for specifying kernels on structured objects, it does neither involve the use of semirings nor was it applied to other modeling tasks such as those involving probabilistic reasoning.
Finally, kProbLog is an algebraic logic programming system building upon aProbLog (Kimmig et al. 2011) and Dyna (Eisner et al. 2004; Eisner and Filardo 2011). The relationships to these languages are quite subtle and more technical. Nevertheless, distinguishing features of kProbLog are that it supports A) multiple semirings, B) metafunctions, C) additive and destructive updates, D) algebraic model counting, and E) its semantics are rooted in logic programming theory (using an adaptation of the \(T_P\)operator (Vlasselaer et al. 2015)).
On the other hand, aProbLog (Kimmig et al. 2011) is a generalization of the probabilistic programming language ProbLog (De Raedt et al. 2007) to semirings. ProbLog and other statistical relational learning formalisms are based on a possible world semantics on weighted model counting. The key contribution of aProbLog is that it generalizes weighted model counting to algebraic model counting (Kimmig et al. 2012) based on commutative semirings instead of the probabilistic semiring. kProbLog extends aProblog in that it supports multiple semirings (A), metafunctions (B) and destructive as well as additive updates (C). Furthermore, kProbLog (in particular the \(\hbox {kProbLog}^{D[\mathbb {S}]}\)) replicates aProbLog by performing AMC on a semiring S using the semiring of SDDs whose variables are labeled with values which belong to the semiring S. Furthermore, aProbLog was conceived for algebraic reasoning about possible worlds, while kProbLog main design goal was the specification of tensor algebra and feature extraction problems.
A second closely related language is Dyna (Eisner et al. 2004; Eisner and Filardo 2011), a language that was initially conceived as a semiring weighted extension of Datalog for dynamic programming. Dyna has been developed for quite a while and is a fairly complex language supporting many different extensions of the basic algebraic Datalog. While kProbLog builds upon Dyna’s ideas, Dyna does not support metafunctions (B), destructive updates (C), and algebraic model counting (D). Concerning (D), Dyna has not dealt with the disjointsum problem occurring in probabilistic and algebraic logics such as ProbLog and aProbLog. Furthermore, the semantics of Dyna have been specified in a more informal way in Eisner and Blatz (2007) using the definition of a valuation function and although Eisner and Blatz (2007), Eisner and Filardo (2011) relate Dyna’s semantics to a \(T_{P}\)operator; Dyna’s \(T_{P}\)operator is not formally defined in these papers (E).
9 Conclusions
We proposed kProbLog, a simple algebraic extension of Prolog that can be used for declarative machine learning, most importantly, for kernel programming. Indeed, using polynomials and metafunctions allows to elegantly specify many recent kernels (e.g. the WL graph kernel, propagation kernels and giks) in kProbLog.
We further introduced in the language the semiring of dual numbers so that kProbLog can also express gradient descent learning, while the semiring of dual numbers allowed us to specify matrix factorization. We showed how the semiring of decision diagrams allows to capture aProbLog (and so ProbLog and, hence, probabilistic programming) as a fragment of kProbLog.
All these features make kProbLog a language in which the user can combine rich logical and relational representations with algebraic ones to declaratively specify models for machine learning. Our experimental evaluations showed that kProbLog can be applied to real world datasets, obtaining good statistical performance and runtimes.
Footnotes
 1.
Dyna does not handle the disjointsum problem; a more detailed explanation about reasoning about possible worlds and the disjointsum can be found in Sect. 6.
 2.
We use the notation functor / arity for compound terms.
 3.
The library contains can be extended with the Python language.
 4.
We say that an atom a directly depends on an atom b if a is the head of a rule or a metaclause and b is a body literal or an argument of a metafunction in the meta clause. We say that an atom a depends on an atom b either if a directly depends on b or there is an atom c such that a directly depends on c and c depends on b.
 5.
Atoms of distinct semirings cannot be mutually dependent without using metaclauses.
 6.
 7.
Probabilities have neutral sums (i.e. for each atom a we have that \(p(a) + p(\lnot a) = 1\)) but this property is not verified for semirings in general. This issue is known as the neutralsums problem (Kimmig et al. 2011). Kimmig et al. (2011) explain how to overcome the neutralsums problem by modifying the evaluation of a Boolean circuit.
 8.
We used the spaCy Python library to extract lemmas, pos tags and dependency relations.
Notes
Acknowledgements
We would like to thank Angelika Kimmig and Anton Dries for the fruitful discussions about ProbLog.
References
 Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving G, Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens J, Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke M, Yu, Y., & Zheng, X. (2015). TensorFlow: Largescale machine learning on heterogeneous systems. http://tensorflow.org/, software available from tensorflow.org.
 Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow IJ, Bergeron, A., Bouchard, N., & Bengio, Y. (2012). Theano: New features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.Google Scholar
 Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2015). Automatic differentiation in machine learning: A survey. arXiv:1502.05767
 Bryant, R. E. (1992). Symbolic boolean manipulation with ordered binarydecision diagrams. ACM Computing Surveys, 24(3), 293–318.MathSciNetCrossRefGoogle Scholar
 Ceri, S., Gottlob, G., & Tanca, L. (1989). What you always wanted to know about datalog (and never dared to ask). IEEE Transactions on Knowledge and Data Engineering, 1(1), 146–166. doi: 10.1109/69.43410.CrossRefGoogle Scholar
 Collobert, R., Bengio, S., & Mariéthoz, J. (2002). Torch: A modular machine learning software library. IDIAP: Tech. rep.Google Scholar
 Costa, F., & De Grave, K. (2010). Fast neighborhood subgraph pairwise distance kernel. In Proceedings of the 27th international conference on machine learning (ICML10), June 21–24, Haifa, Israel, pp. 255–262, http://www.icml2010.org/papers/347.pdf
 Darwiche, A. (2011). SDD: A new canonical representation of propositional knowledge bases. In IJCAI 2011, Proceedings of the 22nd international joint conference on artificial intelligence, Barcelona, Catalonia, Spain, July 16–22, pp. 819–826. doi: 10.5591/9781577355168/IJCAI11143
 Darwiche, A., & Marquis, P. (2002). A knowledge compilation map. Journal of Artificial Intelligence Research, 17(1), 229–264.MathSciNetzbMATHGoogle Scholar
 De Marneffe, M. C., & Manning, C. D. (2008). The stanford typed dependencies representation. InColing 2008: Proceedings of the workshop on crossframework and crossdomain parser evaluation, Association for Computational Linguistics, pp. 1–8.Google Scholar
 De Raedt, L. (2008). Logical and relational learning. Berlin: Springer.CrossRefzbMATHGoogle Scholar
 De Raedt, L., Kimmig, A., Toivonen, H. (2007). Problog: A probabilistic prolog and its application in link discovery. In IJCAI 2007, proceedings of the 20th international joint conference on artificial intelligence, Hyderabad, India, January 6–12, pp. 2462–2467, http://ijcai.org/Proceedings/07/Papers/396.pdf
 De Raedt, L., Kersting, K., Natarajan, S., & Poole, D. (2016). Statistical relational artificial intelligence: Logic, probability, and computation. Synthesis Lectures on Artificial Intelligence and Machine Learning, 10(2), 1–189.CrossRefzbMATHGoogle Scholar
 Debnath, A. K., Lopez de Compadre, R. L., Debnath, G., Shusterman, A. J., & Hansch, C. (1991). Structureactivity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. Journal of Medicinal Chemistry, 34(2), 786–797.CrossRefGoogle Scholar
 Droste, M., & Kuich, W. (2009). Semirings and formal power series. Berlin: Springer.CrossRefGoogle Scholar
 Eisner, J. (2002). Parameter estimation for probabilistic finitestate transducers. In Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp. 1–8.Google Scholar
 Eisner, J., Blatz, J. (2007). Program transformations for optimization of parsing algorithms and other weighted logic programs. In Proceedings of formal grammar, pp. 45–85.Google Scholar
 Eisner, J., & Filardo, N. W. (2011). Dyna: Extending datalog for modern AI. In Datalog reloaded. Springer, pp. 181–220.Google Scholar
 Eisner, J., Goldlust, E., & Smith, N. A. (2004). Dyna: A declarative language for implementing dynamic programs. In Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL), Companion Volume, Barcelona, pp. 218–221.Google Scholar
 Esparza, J., Luttenberger, M., & Schlund, M. (2014). Fpsolve: A generic solver for fixpoint equations over semirings. In Proceedings of 19th international conference implementation and application of automata, CIAA 2014, Giessen, Germany, July 30–August 2, pp. 1–15. doi: 10.1007/9783319088464_1.
 Frasconi, P., Costa, F., De Raedt, L., & De Grave, K. (2014). klog: A language for logical and relational learning with kernels. Artificial Intelligence, 217, 117–143. doi: 10.1016/j.artint.2014.08.003.CrossRefzbMATHGoogle Scholar
 Garcez, Ad., Besold, T. R., de Raedt, L., Földiak, P., Hitzler, P., Icard, T., Kühnberger, K. U., Lamb, L. C., Miikkulainen, R., & Silver, D. L. (2015). Neuralsymbolic learning and reasoning: contributions and challenges. In Proceedings of the AAAI spring symposium on knowledge representation and reasoning: Iintegrating symbolic and neural approaches, Stanford.Google Scholar
 Garcez, A. S., Lamb, L. C., & Gabbay, D. M. (2008). Neuralsymbolic cognitive reasoning. Berlin: Springer.zbMATHGoogle Scholar
 Gärtner, T., Flach, P., & Wrobel, S. (2003). On graph kernels: Hardness results and efficient alternatives. In Learning Theory and Kernel Machines. Springer, pp. 129–143Google Scholar
 Gärtner, T., Lloyd, J. W., & Flach, P. A. (2004). Kernels and distances for structured data. Machine Learning, 57(3), 205–232.CrossRefzbMATHGoogle Scholar
 Getoor, L., & Taskar, B. (Eds.). (2007). Introduction to statistical relational learning. Adaptive computation and machine learning. Cambridge, MA: MIT Press.zbMATHGoogle Scholar
 Golub, G. H., & Van Loan, C. F. (2012). Matrix computations (Vol. 3). Baltimore: JHU Press.zbMATHGoogle Scholar
 Green, T. J., Karvounarakis, G., & Tannen, V. (2007). Provenance semirings. In Proceedings of the 26th ACM SIGMODSIGACTSIGART symposium on Principles of database systems. ACM.Google Scholar
 Griewank, A., & Walther, A. (2008). Evaluating derivatives: Principles and techniques of algorithmic differentiation (2nd ed.). Philadelphia: Society for Industrial and Applied Mathematics.Google Scholar
 Kashima, H., Tsuda, K., & Inokuchi, A. (2003). Marginalized kernels between labeled graphs. ICML, 3, 321–328.Google Scholar
 Kazius, J., McGuire, R., & Bursi, R. (2005). Derivation and validation of toxicophores for mutagenicity prediction. Journal of Medicinal Chemistry, 48(1), 312–320.Google Scholar
 Kim, M., & Candan, K. S. (2011). Approximate tensor decomposition within a tensorrelational algebraic framework. In Proceedings of the 20th ACM conference on information and knowledge management, CIKM 2011, Glasgow, United Kingdom, October 24–28, pp. 1737–1742. doi: 10.1145/2063576.2063827
 Kimmig, A., Van den Broeck, G., & De Raedt, L. (2011). An algebraic prolog for reasoning about possible worlds. In Proceedings of the twentyfifth AAAI conference on artificial intelligence, AAAI 2011, San Francisco, California, USA, August 7–11, http://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/view/3685.
 Kimmig, A., Van den Broeck, G., & De Raedt, L. (2012). Algebraic model counting. CoRR abs/1211.4475, arXiv:1211.4475
 Kisa, D., Van den Broeck, G., Choi, A., & Darwiche, A. (2014). Probabilistic sentential decision diagrams. In Proceedings of the 14th international conference on principles of knowledge representation and reasoning (KR).Google Scholar
 Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30–37.Google Scholar
 Kuich, W. (1997). Semirings and formal power series: Their relevance to formal languages and automata (pp. 609–677). In Handbook of formal languages: Springer.Google Scholar
 Landwehr, N., Passerini, A., De Raedt, L., & Frasconi, P. (2006). kfoil: Learning simple relational kernels, pp. 389–394.Google Scholar
 Li, X., & Roth, D. (2002). Learning question classifiers. In Proceedings of the 19th international conference on computational linguistics–Volume 1, Association for Computational Linguistics.Google Scholar
 Mahé, P., Ueda, N., Akutsu, T., Perret, J. L., & Vert, J. P. (2004). Extensions of marginalized graph kernels. In Proceedings of the twentyfirst international conference on machine learning, ACM, p. 70.Google Scholar
 Milch, B., Marthi, B., Russell, S. J., Sontag, D., Ong, D. L., & Kolobov, A. (2005). BLOG: Probabilistic models with unknown objects, pp. 1352–1359, http://ijcai.org/Proceedings/05/Papers/1546.pdf.
 Muggleton, S., Raedt, L. D., Poole, D., Bratko, I., Flach, P. A., Inoue, K., et al. (2012). ILP turns 20–Biography and future challenges. Machine Learning, 86(1), 3–23. doi: 10.1007/s1099401152592.MathSciNetCrossRefzbMATHGoogle Scholar
 Neumann, M., Patricia, N., Garnett, R., & Kersting, K. (2012). Efficient graph kernels by randomization. In Proceedings of machine learning and knowledge discovery in databases—European conference, ECML PKDD 2012, Bristol, UK, September 24–28. Part I, pp. 378–393. doi: 10.1007/9783642334603_30.
 Nickel, M., Tresp, V., & Kriegel, H. P. (2011). A threeway model for collective learning on multirelational data. In Proceedings of the 28th international conference on machine learning (ICML11), pp. 809–816.Google Scholar
 Nilsson, U., & Maluszynski, J. (1990). Logic, programming and Prolog. Chichester: Wiley.zbMATHGoogle Scholar
 Orsini, F., Frasconi, P., & De Raedt, L. (2015). Graph invariant kernels. In Proceedings of the twentyfourth international joint conference on artificial intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25–31, pp. 3756–3762, http://ijcai.org/Abstract/15/528.
 Quinlan, J. R. (1990). Learning logical definitions from relations. Machine Learning, 5(3), 239–266.Google Scholar
 Richardson, M., & Domingos, P. M. (2006). Markov logic networks. Machine Learning, 62(1–2), 107–136. doi: 10.1007/s1099400658331.CrossRefGoogle Scholar
 Sammut, C. (1993). The origins of inductive logic programming: A prehistoric tale. In Proceedings of the 3rd international workshop on inductive logic programming. J Stefan Institute, pp. 127–147.Google Scholar
 Sato, T. (1995). A statistical learning method for logic programs with distribution semantics. In Logic programming, proceedings of the twelfth international conference on logic programming, Tokyo, Japan, June 13–16, pp. 715–729.Google Scholar
 Sato, T., & Kameya, Y. (1997). PRISM: A language for symbolicstatistical modeling. In Proceedings of the fifteenth international joint conference on artificial intelligence, IJCAI 97, Nagoya, Japan, August 23–29, Vol. 2, pp. 1330–1339. http://ijcai.org/Proceedings/972/Papers/078.pdf.
 Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K., & Borgwardt, K. M. (2011). Weisfeilerlehman graph kernels. Journal of Machine Learning Research 12:2539–2561, http://dl.acm.org/citation.cfm?id=2078187.
 van Emden, M. H., & Kowalski, R. A. (1976). The semantics of predicate logic as a programming language. Journal of the ACM, 23(4), 733–742. doi: 10.1145/321978.321991.MathSciNetCrossRefzbMATHGoogle Scholar
 Van Laer, W., & De Raedt, L. (2001). How to upgrade propositional learners to first order logic: A case study. In Machine learning and its applications. Springer, pp. 102–126.Google Scholar
 Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R., & Borgwardt, K. M. (2010). Graph kernels. The Journal of Machine Learning Research, 11, 1201–1242.MathSciNetzbMATHGoogle Scholar
 Vlasselaer, J., Van den Broeck, G., Kimmig, A., Meert, W., & De Raedt, L. (2015). Anytime inference in probabilistic logic programs with tpcompilation. In Proceedings of the twentyfourth international joint conference on artificial intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25–31, pp. 1852–1858, http://ijcai.org/Abstract/15/263.
 Whaley, J., Avots, D., Carbin, M., & Lam, M. S. (2005). Using datalog with binary decision diagrams for program analysis. In Programming languages and systems: Springer.CrossRefzbMATHGoogle Scholar
 Zhang, D., & Lee, W. S. (2003). Question classification using support vector machines. In SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, July 28–August 1, 2003, Toronto, Canada, pp. 26–32. doi: 10.1145/860435.860443.