Learning hierarchical probabilistic logic programs

Probabilistic logic programming (PLP) combines logic programs and probabilities. Due to its expressiveness and simplicity, it has been considered as a powerful tool for learning and reasoning in relational domains characterized by uncertainty. Still, learning the parameter and the structure of general PLP is computationally expensive due to the inference cost. We have recently proposed a restriction of the general PLP language called hierarchical PLP (HPLP) in which clauses and predicates are hierarchically organized. HPLPs can be converted into arithmetic circuits or deep neural networks and inference is much cheaper than for general PLP. In this paper we present algorithms for learning both the parameters and the structure of HPLPs from data. We first present an algorithm, called parameter learning for hierarchical probabilistic logic programs (PHIL) which performs parameter estimation of HPLPs using gradient descent and expectation maximization. We also propose structure learning of hierarchical probabilistic logic programming (SLEAHP), that learns both the structure and the parameters of HPLPs from data. Experiments were performed comparing PHIL and SLEAHP with PLP and Markov Logic Networks state-of-the art systems for parameter and structure learning respectively. PHIL was compared with EMBLEM, ProbLog2 and Tuffy and SLEAHP with SLIPCOVER, PROBFOIL+, MLB-BC, MLN-BT and RDN-B. The experiments on five well known datasets show that our algorithms achieve similar and often better accuracies but in a shorter time.


Introduction
Probabilistic logic programs (PLPs) extend logic programs (LPs) with probabilities (Riguzzi, 2018). Due to its expressiveness, PLPs have been, for more than two decades now, used for modeling, learning and reasoning in domains where uncertainty plays a crucial role, such as bioinformatics (Mørk & Holmes, 2012;De Raedt et al., 2007), natural language processing  and link prediction in social networks (Meert et al., 2010) just to cite a few.
Despite its expressiveness and simplicity, learning the parameters and the structure of general PLPs is still a computationally expensive task due to the high cost of inference often performed many times during the learning process. In Nguembang Fadja et al. (2017) we proposed a new language called hierarchical probabilistic logic programming (HPLP) which is a restriction of the language of logic programs with annotated disjunctions (Vennekens et al., 2004) in which clauses and predicates are hierarchically organized. HPLPs can be translated in an efficient way into arithmetic circuits (ACs) from which computing the probability of queries is linear in the number of nodes of the circuit. This makes inference and learning in HPLPs faster than for general PLPs.
In this paper, we present and implement algorithms for learning both the parameters and the structure of HPLPs from data. In order to estimate the parameters of an HPLP, the algorithm, parameter learning for hierarchical probabilistic logic programs (PHIL), first translates the HPLP into a set of arithmetic circuits (ACs) sharing parameters and then applies gradient descent or expectation maximization (EM) over the ACs. The gradients are computed by evaluating the ACs bottom-up and top-down and the expectations by performing message passing over the ACs. Besides, we present an algorithm, called structure learning of hierarchical probabilistic logic programming (SLEAHP), that learns both the structure and the parameters of HPLP from data. SLEAHP generates a large HPLP from a bottom clause obtained from a language bias as described in Bellodi and Riguzzi (2015) and subsequently applies a regularized version of PHIL on the generated HPLP to cut clauses with a small value of their parameter.
The paper is organized as follows: Sect. 2 defines general concepts underlying first order logic and logic programming. Section 3 introduces general notions of PLP and investigates inference in general PLP. Then, hierarchical probabilistic logic programs are described in Sect. 4. Sects. 5 and 6 present parameter and structure learning respectively. Related work is discussed in Sect. 7. Experiments are presented in Sects. 8 and 9 concludes the paper.

Background
Before describing probabilistic logic programming (PLP) and hierarchical PLP (HPLP) in the following sections, let us define some basic concepts of first order logic (FOL) and logic programming (LP). Readers familiar with these concepts can safely skip this section.

First order logic
Given a domain of interest, a constant identifies an individual entity in the domain. A variable refers to objects in the domain. A function symbol (or functor) univocally identifies an object of the domain as a function of n other (called arity) objects. A predicate symbol is a generic relation (which can be true or false) among n objects.
A term is a variable, a constant, or a functor, f, applied to terms, f (t 1 , t 2 , … , t n ) . An atom is a predicate, p, applied to terms, p(t 1 , t 2 , … , t n ) . A literal is an atom or its negation. A formula is built from atoms using universal and existential quantifier ( ∃, ∀ ) and logical connectives ( ¬, ∨, ∧, →, ↔ ). A FOL theory is a conjunction of formulas. An expression (literal, term or formula) is ground if it does not contain any variable. A clause is a disjunction of literals with all variables universally quantified with scope the entire disjunction. A clause with exactly one positive literal is called definite clause.
The Herbrand universe of a theory T is the set of all ground terms constructed by using the function symbols and constants in T. If the theory does not contain function symbols, the Herbrand universe is finite. The Herbrand base of a theory T is the set of all ground atoms obtained from predicates appearing in T and terms of its Herbrand universe. If the theory does not contain function symbols, the Herbrand base is finite as well. A Herbrand interpretation is an assignment of a truth value to all atoms in the Herbrand base. An interpretation is a (Herbrand) model of a theory T, if all the formulas of T are evaluated to true with respect the interpretation.

Logic programming
A definite logic program (LP), P, is a set of definite clauses represented as where the head of the clause, h, is an atom and its body, b 1 , … , b n , is a conjunction of atoms. A fact is a definite clause with empty body and is written h. For definite clauses, Herbrand models have an important property: the intersection of a set of Herbrand models of P is still a Herbrand model of P. The intersection of all Herbrand models is called the minimal Herbrand model (MHM) of P. Intuitively, the MHM is the set of all ground atoms that are entailed by the LP.
A Normal LP allows negative literals in the body of clauses. A normal clause is represented as where h, b 1 , … , b n , c 1 … c m are atoms. The literals not c i for i = 1, … , m are default negation iterals and are different from classical truth-functional logical negative literals ¬c i . There are different semantics for default negation including Clark's completion (1978), stable models (Gelfond & Lifschitz, 1988) and well-founded semantics (Przymusinski, 1989;Van Gelder et al., 1991). The well-founded semantics assigns a three-valued model to a program, i.e., it identifies a three-valued interpretation as the meaning of the program. In this paper we use the well-founded semantics for range restricted normal programs. In such programs, variables appearing in the head of a clause must appear in positive literals in its body.

Probabilistic logic programming
Probabilistic Logic Programming (PLP) combines Logic Programming with probability. PLP under the distribution semantics has been shown expressive enough to represent a wide variety of domains characterized by uncertainty Alberti et al., 2017). A PLP under the distribution semantics defines a probability distribution over normal logic programs called instances or worlds. Each world is assumed to have a total wellfounded model. The distribution is extended to queries and the probability of a query is computed by marginalizing the joint distribution of the query and the worlds. We consider in this paper a PLP language with a general syntax called Logic Programs with Annotated Disjunctions (LPADs) (Vennekens et al., 2004).
Programs in LPADs allow alternatives in the head of clauses. Each clause head is a disjunction of atoms annotated with probabilities. Consider a program P with p clauses: P = {C 1 , … , C p } . Each clause C i takes the form: where h i1 , … , h in i are logical atoms, b i1 , … , b im i are logical literals and i1 , … , in i are real numbers in the interval [0, 1] that sum up to 1. b i1 , … , b im i is indicated with body(C i ) . Note that if n i = 1 the clause corresponds to a non-disjunctive clause. We also allow clauses where ∑ n i k=1 ik < 1 : in this case, the head of the annotated disjunctive clause implicitly contains an extra atom null that does not appear in the body of any clause and whose annotation is 1 − ∑ n i k=1 ik . We denote by ground(P) the grounding of an LPAD P, i.e. the replacement of all variables with constants. We consider here programs without function symbols so ground(P) is always finite.
Each grounding C i j of a clause C i corresponds to a random variable X ij with values {1, … , n i } where n i is the number of head atoms of C i . The random variables X ij are independent of each other. An atomic choice (Poole, 1997) is a triple (C i , j , k) where C i ∈ P , j is a substitution that grounds C i and k ∈ {1, … , n i } identifies one of the head atoms. In practice (C i , j , k) corresponds to an assignment X ij = k.
A selection is a set of atomic choices that, for each clause C i j in ground(P), contains an atomic choice (C i , j , k) . Let us indicate with S P the set of all selections. A selection identifies a normal logic program l defined as l = {(h ik ∶− body(C i )) j |(C i , j , k) ∈ } . l is called an instance or possible program of P. Since the random variables associated to ground clauses are independent, we can assign a probability to instances: We consider only sound LPADs where, for each selection in S P , the well-founded model of the possible program l chosen by is two-valued. We write l ⊧ q to mean that the query q is true in the well-founded model of the possible program l . Since the wellfounded model of each possible program is two-valued, q can only be true or false in l .
Let L P denotes the set of all instances and P(L P ) the distribution over instances. The probability of a query q given an instance l is P(q|l) = 1 if l ⊧ q and 0 otherwise. The probability of a query q is given by (1) P(q) = ∑ l∈L P P(q, l) = ∑ l∈L P P(q|l)P(l) = ∑ l∈L P ∶l⊧q

P(l)
Example 1 Let us consider the UWCSE domain (Kok & Domingos, 2005) in which the objective is to predict the "advised_by" relation between students and professors. In this case a program for advised_by∕2 may be where project(C, A) means that C is a project with participant A, ta(C, A) means that A is a teaching assistant (TA) for course C and taught_by (C, B) means that course C is taught by B. The probability that a student is advised by a professor depends on the number of joint projects and on the number of courses the professor teaches where the student is a TA, the higher these numbers the higher the probability. The facts in the program state that harry is a student and ben is a professor. They have one joint project and ben teaches one course where harry is a TA. Suppose we want to compute the probability of q = advised_by(harry, ben) , the first clause has one grounding with head q where the body is true and the second clause has one grounding with head q where the body is true. The possible programs where q is true are those that contain at least one of these ground clauses independently of the presence of other groundings so

Inference in PLP
Computing the probabilities of queries by generating all possible programs is infeasible as the number of possible programs is exponential in the number of probabilistic clauses. Therefore much work has been devoted to the development of faster alternatives.
The most common approach for exact inference involves knowledge compilation: the program is converted to a Boolean language from which the computation of the probability of the query is fast.
In order to convert the program to a Boolean language, we must consider composite choices: a composite choice is a consistent set of atomic choices, i.e., a set where the same ground clause C i j appears once.
Given a composite choice , we can define the set of possible programs compatible with L as L = {l | ∈ S P , ⊆ } . Given a set K of composite choices, L K is the set of possible programs compatible with K and is defined as L K = ∪ ∈K L . A composite choice is an explanation for a query q if q is true in all possible programs of L . A set K of composite choices for a query q is covering if the set of all possible programs where q is true is L K .
Given a covering set of explanations K of a query q, we can build the Boolean formula professor(B), ta(C, A), taught_by (C, B). student(harry). professor(ben). project(pr 1 , harry).project(pr 1 , ben). taught_by(c 1 , ben). ta(c 1 , harry). P(advised_by(harry, ben)) = 0.3 ⋅ 0.6 + 0.3 ⋅ 0.4 + 0.7 ⋅ 0.6 = 0.72 where X C is a discrete random variable having values in {1, … , n} with n the number of heads of clause C. The probability that q is true is the probability that B(q) takes value true, given the probability distributions over the variables X C that are all independent of each other.
We can associate a probability to a composite choice as follows: ik . However, to compute P(q) it is not possible to use a summation even if B(q) is a disjunction because the individual disjuncts may share some of the variables. We must make the disjuncts mutually exclusive so that their probabilities can be summed up.
This task is solved by knowledge compilation: B(q) is converted into a language such as Binary Decision Diagrams (BDDs), Deterministic Decomposable Negation Normal Form (d-DNNF) or Sentential Decision Diagrams (SDD) from which computing P(q) is linear in the number of nodes. However, compilation to one of these languages has a cost of #P in the number of random variables.
In order to speed up inference, some languages impose restrictions. For example, PRISM (Sato, 1995) requires the program to be such that queries always have a pairwise incompatible covering set of explanations. In this case, once the set is found, the computation of the probability amounts to performing a sum of products. For programs to allow this kind of approach, they must satisfy the assumptions of independence of subgoals and exclusiveness of clauses, which mean that (Sato et al., 2018): 1. the probability of a conjunction (A, B) is computed as the product of the probabilities of A and B (independent-and assumption), 2. the probability of a disjunction (A; B) is computed as the sum of the probabilities of A and B (exclusive-or assumption).
These assumptions can be stated more formally by referring to explanations. Given an explanation , let RV( ) = {C i j |(C i , j , k) ∈ } be the set of random variables in . Given a set of explanations K, let RV(K) = ⋃ ∈K RV( ) . Two sets of explanations, K 1 and K 2 , are independent if RV(K 1 ) ∩ RV(K 2 ) = � and exclusive if, ∀ 1 ∈ K 1 , 2 ∈ K 2 , 1 and 2 are incompatible.
The independent-and assumption means that, when deriving a covering set of explanations for a goal, the covering sets of explanations K i and K j for two ground subgoals in the body of a clause are independent.
The exclusive-or assumption means that, when deriving a covering set of explanations for a goal, two sets of explanations K i and K j obtained for a ground subgoal h from two different ground clauses are exclusive. This implies that the atom h is derived using clauses that have mutually exclusive bodies, i.e., that their bodies are not true at the same time in any world.
These assumptions make the computation of probabilities "truth-functional" (Gerla, 2001) (the probability of conjunction/disjunction of two propositions depends only on the probabilities of those propositions), while in the general case this is false.
In Riguzzi and Swift (2011) the authors proposed the system PITA(IND, IND) that performs inference on programs that satisfy the independent-or assumption instead of the exclusive-or assumption: 3. the probability of a disjunction (A; B) is computed as if A and B were independent (independent-or assumption).
This means that, when deriving a covering set of explanations for a goal, two sets of explanations K i and K j obtained for a ground subgoal h from two different ground clauses are independent.
If A and B are independent, the probability of their disjunction is by the laws of probability theory.
In the next section we present Hierarchical Probabilistic Logic Programs (HPLPs) which, by construction, satisfy the independent-or assumption. This makes the computation of probabilities in such programs "truth-functional" (Gerla, 2001). With the independent-or assumption we can collect the contribution of multiple groundings of a clause. Therefore it is suitable for domains where entities may be related to a varying number of other entities. The differences in numbers are taken into account by the semantics. The assumption is not weaker or stronger than the exclusive-or one. It is probably less easy for users to write HPLP programs than PRISM programs but our aim is to devise an algorithm for automatically learning them.

Hierarchical probabilistic logic programs
Suppose we want to compute the probability of atoms for a single predicate r using a PLP. In particular, we want to compute the probability of a ground atom r( ) , where is a vector of terms. Let us call r the target predicate. This is a common situation in relational learning.
We consider a specific form of LPADs defining r in terms of input predicates (their definition is given as input and is certain) and hidden predicates, defined by clauses of the program. We discriminate between input predicates, which encapsulate the input data and the background knowledge, and target predicates, which are predicates we are interested in predicting. Hidden predicates are disjoint from input and target predicates. Each clause in the program has a single head atom annotated with a probability. Furthermore, the program is hierarchically defined so that it can be divided into layers. Each layer defines a set of hidden predicates in terms of predicates of the layer immediately below or in terms of input predicates. A generic clause C is of the form where ( , ) is a conjunction of literals for the input predicates. The vector represents variables appearing in the head of C and represents the variables introduced by input predicates. b i ( , ) for i = 1, … , m is a literal built on a hidden predicate. Variables in are existentially quantified with scope the body. Only literals for input predicates can introduce new variables into the clause. Moreover, all literals for hidden predicates must use the whole set of variables of the predicate in the head and input predicates , see predicate r 1_1 (A, B, C) Example 2. This restriction is imposed to allow independence among ground clauses associated to target/hidden predicates, see Sect. 4.1. Moreover, we require that the predicate of each b i ( , ) does not appear elsewhere in the body of C or in the body of any other clause, i.e each hidden predicate literal is unique in the program. We call hierarchical 1 3 PLP (HPLP) the language that admits only programs of this form . A generic hierarchical program is defined as follows: where r is the target predicate and r 1_1…_n is the predicate of b 1_1…_n , e.g. r 1_1 and r n_1 are the predicates of b 1_1 and b n_1 respectively. The bodies of the lowest layer of clauses are composed only of input predicates and do not contain hidden predicates, e.g C 2 and C 1_1_1 in Example 2. Note that here the variables were omitted except for rule heads. A generic program can be represented by a tree, Fig. 1 with a node for each clause and literal for hidden predicates. Each clause (literal) node is indicated with C ( b ) where is a sequence of integers encoding the path from the root to the node. The predicate of literal b is r which is different for every value of .
Example 2 Let us consider a modified version of the program of Example 1:   where publication(P, A, C) means that P is a publication with author A produced in project C and student/1, professor/1, project/2, ta/2, taughtby/2 and publication/3 are input predicates.
In this case, the probability of q = advised_by(harry, ben) depends not only on the number of joint courses and projects but also on the number of joint publications from projects.

Inference in HPLP
HPLPs satisfy the independent-or assumption as every hidden predicate appears just once in the body of a single clause and has all the variables of the clause as arguments. If we consider the grounding of the program, each ground atom for a hidden predicate appears in the body of a single ground clause. A ground atom for a hidden predicate may appear in the head of more than one ground clauses and each ground clause can provide a set of explanations for it but these sets do not share random variables. Moreover, HPLPs also satisfy the independent-and assumptions as every ground atom for a hidden predicate in the body of a ground clause has explanations that depend on disjoint sets of random variables.
Remember that atoms for input predicates are not random, they are either true or false in the data and they play a role in determining which groundings of clauses contribute to the probability of the goal: the groundings whose atoms for input predicates are false do not contribute to the computation of the probability.
Given a ground clause where is a path, we can compute the probability that the body is true by multiplying the probability of being true of each individual literals. If the literal is positive, its probability is equal to the probability of the corresponding atom. Otherwise it is one minus the probability of the corresponding atom. Therefore the probability of the body of G i is P(b i1 , … , b im ) = ∏m i=k P(b ik ) and P(b ik ) = 1 − P(a ik ) if b ik = ¬a ik . If P(b ik ) is a literal for an input predicate, P(b ik ) = 1 if it is true and P(b ik ) = 0 otherwise. We can use this formula because HPLPs satisfy the independent-and assumption.
Given a ground atom a for a hidden predicate, to compute P(a ) we need to take into account the contribution of every ground clause for the predicate of a . Suppose these clauses If we have two clauses, the contribution of each clause is computed as above. Note that, random variables associated with these contributions are independent since we allow hidden predicates in the body of each clause to use the whole set of variables appearing in the head and those introduced by input predicates. It is also worth noting that the probability of a disjunction of two independent random variables is Therefore, if we have two clauses their contributions are combined as follows: where we defined the operator ⊕ that combines two probabilities as follows . This operator is commutative and associative and we can compute sequences of applications as The operator is called probabilistic sum and is the t-norm of the product fuzzy logic (Hájek, 1998). Therefore, if the probabilistic program is ground, the probability of the example atom can be computed with the Arithmetic Circuit (AC) of Fig. 3, where nodes are labeled with the operation they perform and edges from ⊕ nodes are labeled with a probabilistic parameter that must be multiplied by the child output before combining it with ⊕ . We can thus compute the output p in time linear in the number of ground clauses. Note that, the AC can also be interpreted as a deep neural network with nodes (or neurons) whose (non P(a ) = 1 ⋅ P(body(G 1 )).

Fig. 3
Arithmetic circuit/neural net linear) activation functions are × and ⊕ . When the program is not ground, we can build its grounding obtaining a circuit/neural network of the type of Fig. 3, where however some of the parameters can be the same for different edges. In this case the circuit will exhibit parameter sharing.

3
The program tree is shown in Fig. 4 and the corresponding arithmetic circuit in Fig. 5 together with the values computed by the nodes. Inference in the AC of Fig. 5 can be computed at https:// cplint. eu/ examp le/ phil/ uwcse. pl by running the query

Building the arithmetic circuit
The network can be built by performing inference using tabling and answer subsumption using PITA(IND,IND) (Riguzzi, 2014;Riguzzi & Swift, 2010). Suppose we want to compute the probability of a query, which is typically the target predicate in hierarchical PLP.
To speed the computation, PITA(IND,IND) applies a program transformation that adds an extra argument to the query and to each subgoal of the program. The probability of answers to each subgoal is stored in the extra argument. In fact, when a subgoal returns, the inference_hplp(advisedby(harry, ben), ai, Prob).

adivsedby(harry, ben)
G 1 r 1 1 (harry, ben, pr1) extra argument will be instantiated to the probability of the ground atom that corresponds to the subgoal without the extra argument. Note that because of its structure, when a subgoal returns the original arguments are guaranteed to be instantiated in hierarchical PLP. Moreover, suitable literals are also added to the body of clauses for combining the extra arguments of the subgoals: the probability to be assigned to the extra argument of the head atom is the product of probabilities of the answers for the subgoals in the body.
Since a subgoal may unify with the head of multiple groundings of multiple clauses, we need to combine the contributions of these groundings. This is achieved by means of tabling with answer subsumption (Swift & Warren, 2012). Tabling keeps a store of subgoals encountered in a derivation together with answers to these subgoals. If one of the subgoals is encountered again, its answers are retrieved from the store rather than recomputing. Answer subsumption (Swift & Warren, 2012) is a tabling feature that, when a new answer for a tabled subgoal is found, combines old answers with the new one. This combination operator is the probabilistic sum in PITA (IND, IND). Computation by PITA(IND, IND) is thus equivalent to the evaluation of the program arithmetic circuit.
We use an algorithm similar to PITA(IND,IND) to build the Arithmetic circuit or the neural network instead of just computing the probability. To build the arithmetic circuit, it is enough to use the extra argument for storing a term representing the circuit instead of the probability and changing the implementation of the predicates for combining the values of the extra arguments in the body and for combining the values from different clause groundings. The result of inference would thus be a term representing the arithmetic circuit.
There is a one to one correspondence between the data structure built by PITA(IND,IND) and the circuit so there is no need of an expensive compilation step. The term representing the AC in Fig where the operators and and or represent the product × and the probabilistic sum ⊕ respectively. Once the ACs are built, parameter learning can be performed over them by applying gradient descent/back-propagation or expectation maximization. The following section presents these algorithms and their regularized versions. Because of the constraints imposed on HPLPs, writing these programs may be unintuitive for human so we also propose, in Sect. 6, an algorithm for learning both the structure and the parameters from data.

Parameter learning
In this section, we present an algorithm, called parameter learning for hierarchical probabilistic logic programs (PHIL), which learns HPLP's parameters from data. We present two versions of PHIL. The first, deep PHIL (DPHIL), learns HPLP's parameters by applying a gradient-based method and the second, expectation maximization PHIL (EMPHIL), applies expectation maximization (EM). Different regularized versions of each algorithm will also be presented. The parameter learning algorithm can be defined as follows:

3
Definition 1 (Parameter learning problem) Given a HPLP H with parameters = { 1 ⋯ n } , an interpretation I defining input predicates and a training set where each e i is a ground atom for the target predicate r, find the values of that maximize the log likelihood (LL) where P(e i ) is the probability assigned to e i by H ∪ I.
Maximizing the LL can be equivalently seen as minimizing the sum of cross entropy errors err i for all the examples where e i is an example with y i indicating its sign ( y i = 1 if the example is positive and y i = 0 otherwise) and p i indicating the probability that the atom is true.
DPHIL and EMPHIL minimize the cross entropy error or equivalently maximize the log-likelihood of the data. These algorithms (and their regularized variants) are presented in Sects. 5.1 and 5.2 respectively.

Gradient descent: DPHIL
DPHIL computes the gradient of the error err (4) with respect to each parameter and updates the parameters. We do this by building an AC for each example and by running a dynamic programming algorithm for computing the gradient. To simplify gradient computation, we transformed the AC of Fig. 3 as follows: weight, i , labeling arcs from ⊕ to × nodes, are set as children leaves of × nodes and shared weights are considered as individual leaves with many × parents. Moreover, negative literals are represented by nodes of the form not(a) with the single child a. The AC in Fig. 5 is converted into the one shown in   Fig. 6 highlights parameters sharing which is more convenient for illustrating the algorithm. The standard gradient descent algorithm computes gradients at each iteration using all examples in the training set. If the training set is large, the algorithm can converge very slowly. To avoid slow convergence, gradients can be computed using a single example, randomly selected in the training set. Even if in this case the algorithm can converge quickly, it is generally hard to reach high training set accuracy. A compromise often used is mini batch stochastic gradient descent (SGD): at each iteration a mini batch of examples is randomly sampled to compute the gradient. This method usually provides fast converge and high accuracy. DPHIL, shown in Algorithm 1, implements SGD.
After building the ACs and initializing the weights, the gradients and the moments, line 2-6, DPHIL performs two passes over each AC in the current batch, line 8-15. In the first, the circuit is evaluated so that each node is assigned a real value representing its probability. This step is bottom-up or forward (line 12) from the leaves to the root. The second step is backward (line 13) or top-down, from the root to the leaves, and computes the derivatives of the loss function with respect to each node. At the end of the backward step, G contains the vector of the derivatives of the error with respect to each parameter. Line 16 updates the weights.
The parameters are repeatedly updated until a maximum number of steps, MaxIter , is reached, or until the difference between the LL of the current and the previous iteration drops below a threshold, , or the difference is below a fraction of the current LL. Finally, function updatetheory (line 18) updates the parameters of the theory.
Examples ← BuildACs(T heory) Build the set of ACs 3: for i ← 1 → |T heory| do Initialize weights,gradient and moments vector 4: end for 7: Iter ← 1 8: repeat 9: LL ← 0 10: Batch ← NextBatch(Examples) Select the batch according to the strategy 11: for all node ∈ Batch do 12: P ← Forward(node) 13: Backward(G, − 1 P , node) 14: LL ← LL + log P 15: end for 16: return F inalT heory 20: end function We reparametrized the program using weights between -∞ and + ∞ and expressing the parameters using the sigma function i = (W i ) = 1 1+e −W i (5) . In this way we do not have to impose the constraint that the parameters are in [0,1].
Function forward 2 is a recursive function that takes as input an AC node (root node) and evaluates each node from the leaves to the root, assigning value v(n) to each node n. If node = not(n) , p = FORWARD(n) is computed and 1 − p is assigned to v(node), lines 2-5. If node = ⨁ (n 1 , … n m ) , function v(n i ) = FORWARD(n i ) is recursively called on each child node, and the node value is given by is recursively called on each child node, and the node value is given by , lines 14-20.

Algorithm 2 Function Forward
return v(node) 5: else 6: Compute the output example by recursively call Forward on its sub AC 7: if node = (n 1 , . . . n m ) then node 8: for all n j do 9: v(n j ) ← Forward(n j ) 10: return v(node) 13: else and Node 14: return v(node) 20: end if 21: end if 22: end if 23: end function Procedure backward takes an evaluated AC node and computes the derivative of the contribution of the AC to the cost function, err = −y log(p) − (1 − y) log(1 − p) where p is the probability of the atom representing the example. This derivative is given in Eq. 6, see "Appendix 10.1" for the proof.
with where pa n indicate the parents of n.
This leads to procedure backward shown in Algorithm 3 which is a simplified version for the case v(n) ≠ 0 for all ⨁ nodes. To compute d(n), Backward proceeds by recursily propagating the derivative of the parent node to the children. Initially, the derivative of the error with respect to the root node, − 1 v(r) , is computed. If the current node is not(n), with derivative AccGrad, the derivative of its unique child, n, is −AccGrad , line 2-3. If the current node is a ⨁ node, with derivative AccGrad, the derivative of each child, n, is computed as follows: and back-propagated, line 5-9. If the current node is a × node, the derivative of a non leaf child node n is computed as follows: the one for a leaf child node n = i is and back-propagated, line 11-15. For leaf node, i.e a i node, the derivative is accumulated, line 20.
for all n j do 7:

14:
Backward(G, AccGrad 1 , n j ) 15: end for 16: After the computation of the gradients, weights are updated. Standard gradient descent adds a fraction , called learning rate, of the gradient to the current weights. is a value between 0 and 1 that is used to control the parameter update. Small can slow down the algorithm and find local minimum. High avoids local minima but can swing around global minima. A good compromise updates the learning rate at each iteration combining the advantages of both strategies. We use the update method Adam, adaptive moment estimation (Kingma & Ba, 2014), that uses the first order gradient to compute the exponential moving averages of the gradient and the squared gradient. Hyper-parameters 1 , 2 ∈ [0, 1) control the exponential decay rates of these moving averages. These quantities are estimations of the first moment (the mean M 0 ) and the second moment (the uncentered variance M 1 ) of the gradient. The weights are updated with a fraction, current learning rate, of the combination of these moments, see Procedure updateweightsadam in Algorithm 4.

DPHIL regularization: DPHIL 1 and DPHIL 2
In deep learning and machine learning in general, a technique called regularization is often used to avoid over-fitting. Regularization penalizes the loss function by adding a regularization term for favoring smaller parameters. In the literature, there exist two main regularization techniques called L 1 and L 2 regularization that differ on the way they penalize the loss function. While L 1 adds to the loss function the sum of the absolute value of the parameters, L 2 adds the sum of their squares. Given the loss function defined in Eq. 4, the corresponding regularized loss function are given by Eqs. 8 and 9 .
where the regularization hyper-parameter determines how much to penalize the parameters and k the number of parameters. When is zero, the regularization term becomes zero and we are back to the original loss function. When is large, we penalize the parameters and they tend to become small. Note also that we add the regularization term from the initial loss function because we are performing minimization. The main difference between these techniques is that while L 1 favor sparse parameters (parameters closer to zero) L 2 does not. Moreover in general, L 1 (resp. L 2 ) is computationally inefficient(resp. efficient due to having analytical solutions). Now let us compute the derivative of the regularized error with respect to each node in the AC. The regularized term depends only on the leaves (the parameters i ) of the AC. So the gradients of the parameters can be calculated by adding the derivative of the regularization term with respect to i to the one obtained in Eq. 7. The regularized errors are given by: In order to implementent the regularized version of DPHIL ( DPHIL 1 and DPHIL 2 ), the forward and the backward passes described in algorithms 2 and 3 remain unchanged. The unique change occurs while updating the parameters in the Algorithm 4. UpdateWeights-Adam line 6 becomes

Expectation maximization: EMPHIL
We propose another algorithm, EMPHIL, that learns the parameters of HPLP by applying expectation maximization (EM). The algorithm maximizes the log-likelihood LL defined in Eq. 3 by alternating between an expectation (E) and a maximization (M) step. E-step computes the expected values of the incomplete data given the complete data and the current parameters and the M-step determines the new values of the parameters that maximize the likelihood. Each iteration is guaranteed to increase the log-likelihood. Given a hier- . These values are aggregated over all examples obtaining Then the M-step computes i by maximum likelihood, i.e. i = Therefore to perform EMPHIL, we have to compute P(X ij = 1|e) for each example e by performing the belief propagation algorithm over the factor graph associated with the AC. Message passing is then applied over the AC. We show, in the Appendix 10.2, that the messages in the bottom-up, c N , and top-down, t N , are respectively given as where v(N) is the value of node N, P is its parent and the operator ⊖ is defined as c N is performed by applying the forward pass described in Algorithm 2.
Since the belief propagation algorithm (for ACs) converges after two passes, we can compute the unnormalized belief of each parameter during the backward pass by multiplying t N by v(N) (that is all incoming messages). Algorithm 5 performs the backward pass of belief propagation algorithm and computes the normalized belief of each parameter, i.e t N . It also computes the expectations N 0 and N 1 for each parameter, lines 17-19.
if node = (n 1 , . . . n m ) then node 6: for all child n i do 7:

14:
BackwardEM(t n i , n i , B, Count) 15: end for 16: end if 23: end procedure EMPHIL is then presented in Algorithm 6. After building the ACs (sharing parameters) for positive and negative examples and initializing the parameters, the expectations and the counters, lines 2-5, EMPHIL proceeds by alternating between expectation step 8-13 and maximization step 13-24. The algorithm stops when the difference between the current value of the LL and the previous one is below a given threshold or when such a difference relative to the absolute value of the current one is below a given threshold. The theory is then updated and returned (lines 26-27).

3
Algorithm 6 Function EMPHIL. for all node ∈ Examples do 10: P ← Forward(node) 11: BackwardEM(1, node, N 0 , N 1 ) 12: LL ← LL + log P 13: end for Maximization step 14: for i ← 1 → |T heory| do 15: switch T ype 16: case 0: 18: case 2: In this section, we propose three regularized versions of EMPHIL. As described in Li et al. (2005), EM can be regularized for two reasons: first, for highlighting the strong relationship existing between the incomplete and the missing data, assuming in the standard EM algorithm, and second for favoring smaller parameters. We regularize EMPHIL mainly for the latter reason. As in gradient descent regularization, we define the following regularization objective functions for L 1 and L 2 respectively. where = i , N 0 and N 1 are the expectations computed in the E-step (see Eqs. 14 and 15 ). The M-step aims at computing the value of that maximizes J( ) . This is done by solving the equation J( ) = 0 . The following theorems give the optimal value of in each case (see appendix 10.3 for the proofs). So algorithms EMPHIL (the standard EM), EMPHIL 1 , EMPHIL 2 and EMPHIL B differ in the way they update the parameters in the M-step, Algorithm 6 lines 15-22.

Structure learning
In the previous section, the structure of an HPLP was given and the task was to learn its parameters from data. Since hidden predicates could be difficult to interprete for humans in many domains of interest, providing the structure of an HPLP may be unintuitive and tedious even for experts in the domains. In this section, we propose an algorithm for learning both the structure and the parameters from data. The structure is learned by mean of predicate invention. The Structure learning problem is defined as follows: where P(e i k ) is the probability assigned to e i k (an example from the interpretation k). Note that a mega-example, also called mega-interpretation, describes one world in the domain of interest. It includes positive and negative examples for the target predicate and fact for input predicates. It is called mega because it can include more than one example, positive or negative. A mega-example is similar to a partial interpretation in a context-dependent partial interpretation defined in Law et al. (2016) SLEAHP learns HPLPs by generating an initial set of bottom clauses (from a language bias) from which a large HPLP is derived. Then SLEAHP performs structure learning by using parameter learning. Regularization is used to bring as many parameters as possible close to 0 so that their clauses can be removed, thus pruning the initial large program and keeping only useful clauses.

Language bias
An initial set of clauses are generated according to a language bias expressed by means of mode declaration. A mode declaration (Muggleton, 1995) is a set of head, modeh(rec, s), and body, modeb(rec, s), declarations where rec is a recall and s, called schema, is a template for ground literals in the head or body of clauses. The schema contains special place-marker terms of the form #type, +type and -type, which stand, respectively, for ground terms, input variables and output variables of a type. The language defined by a set of mode declarations M is indicated with L(M). In clauses from L(M), an input variable in a body literal of a clause must be either an input variable in the head or an output variable in a preceding body literal in the clause. The head atoms (resp. body literals) of clauses are obtained from some head (resp. body) declaration in M by replacing all # place-markers with ground terms and all + (resp. -) place-markers with input (resp. output) variables. This type of mode declarations is extended with place-marker terms of the form -# which are treated as # when defining L(M ) but differ in the creation of the bottom clauses, see Sect. 6.2.1.

Description of the algorithm
In order to learn an HPLP, SLEAHP (Algorithm 7) initially generates a set of bottom clauses, line 2. Then an n-ary tree, whose nodes are literals appearing in the head or in the body of bottom clauses is constructed, line 3. Then an initial HPLP is generated from the tree, line 4 and a regularized version of PHIL is performed on the initial program. Finally clauses with very small probabilities are removed, line 5. The components of this algorithm are described in the following subsections.

Clause generation
Algorithm 8 generates a set of bottom clauses as in Progol (Muggleton, 1995). Given a language bias and a set of mega-examples, Algorithm 8 generates a set of bottom clauses. These bottom clauses are then used for creating a tree of literals, Algorithm 9.
Consider the target predicate r/ar, the predicate of the schema s associates with a fact modeh(rec, s) in the language bias. In order to create a clause, a mega-example I and an answer h for the goal schema(s) are randomly selected with replacement, I from the available set of mega-examples and h from the set of all answers found for the goal schema(s) in I. schema(s) is the literal obtained from s by replacing all place-markers with distinct variables X 1 ⋯ X ar . Then h is saturated using Progol's saturation method as described in Muggleton (1995).
This cycle is repeated for a user-defined number NS of times and the resulting ground clause BC = h ∶− b 1 , … , b m is then processed to obtain a probabilistic program clause by replacing each term in a + orplace-marker with a variable, using the same variable for identical terms. Terms corresponding to # or -# place-markers are instead kept in the clause. This process is repeated for a number NInt of input mega-examples and a number NA of answers, thus obtaining NInt ⋅ NA bottom clauses. for all modeh declarations modeh(rec, s) with P/Ar predicate of s do 5:

Tree generation
Since an HPLP can be mapped into a tree as described in Sect. 4, we create a tree whose nodes are literals appearing in the head or in the body of bottom clauses generated in the previous section. Every node in the tree share at least one variable with its parent. The tree is then converted into the initial HPLP, see Sect. 6.2.3.
To create the tree, Algorithm 9, starts by considering each Bottom clause in turn, line 3. Each bottom clause creates a tree, lines 4 -11. Consider the following bottom clause where Arg and Arg i are tuples of arguments and b i (Arg i ) (with arguments Arg i ) are literals. Initially r(Arg) is set as the root of the tree, lines 5. Literals in the body are considered in turn from left to right. When a literal b i (Arg i ) is chosen, the algorithm tries to insert the literal in the tree, see Algorithm 10. If b i (Arg i ) cannot be inserted, it is set as the right-most child of the root. This proceeds until all the b i (Arg i ) are inserted into a tree, lines 6-11. Then the resulted tree is appended into a list of trees (initially empty), line 11, and the list is merged obtaining a unique tree, 13. The trees in L are merged by unifying the arguments of the roots. if not(insertTree(T ree, b i (Arg i ))) then

hasposition(B, C) and publication(D, B)) are set as the children of advised_by(A, B) because their arguments share variable A (resp. B). Then the predicate publication(D, E) is set as a child of publication(D, B) because their arguments share variable D, in_phase(A, F) as a child of advised_by(A, B) (they share variable A), taughtby(G, E, H) as a child of publication(D, E) (they share variable E) and finally ta(I, J, H) as a child of taughtby(G, E, H) (they share variable H).
The corresponding tree is shown in Fig. 7.

HPLP generation
Once the tree is built, an initial HPLP is generated at random from the tree. Before describing how the program is created, note that for computation purpose, we considered clauses with at most two literals in the body. This can be extended to any number of literals. Algorithm 11 takes as input the tree, Tree, an initial probability, 0 ≤ MaxProb ≤ 1 , a rate, 0 ≤ rate ≤ 1 , and the number of layers 1 ≤ NumLayer ≤ height(Tree) of HPLP we are about to generate. Let ( k = X 1 , ⋯ , X k , l = Y 1 , ⋯ , Y l , t = Z 1 , ⋯ , Z t and m = W 1 , ⋯ , W m ) be tuples of variables. In order to generate the initial HPLP, the tree is visited breadth-first, starting from level 1. For each node n i at level Level ( 1 ≤ Level ≤ NumLayer ), n i is visited with probability Prob. Otherwise n i and the subtree rooted at n i are not visited. Prob is initialized to Max-Prob, 1.0 by default. The new value of Prob at each level is Prob * rate where rate ∈ [0, 1] is a constant value, 0.95 by default. Thus the deeper the level, the lower the probability value. Supposing that n i is visited, two cases can occur: n i is a leaf or an internal node.
If n i = b i ( l i ) is a leaf node with parent Parent i , we consider two cases. If Parent i = r( k ) (the root of the tree), then the clause is generated, lines 9-11. Otherwise is generated. hidden path ( t i ) is the hidden predicate associated with Parent i and path is the path from the root to Parent i , lines 13-16.
If n i = b i ( l i ) is an internal node having parent Parent i , we consider two cases. If Parent i is the root, the clause is generated. hidden_i( t i ) is associate with b_i( l i ) and t i = k ∪ l i , lines 20-21. If Parent i is an internal node with the associated hidden predicate hidden path ( t ) then the clause is generated where m i = t ∪ l i , lines 24-25.
The generated clause C is added into a list (initially empty), line 28, and the algorithm proceeds for every node at each level until layer NumLayer is reached or all nodes in the tree are visited, line 5. Then hidden predicates appearing in the body of clauses without associated clauses (in the next layer) are removed, line 34, and the program is reduced, line 35. To reduce the program, a set of clauses (without hidden predicates in the body) that are renaming of each other are reduced to a single representative of the set. Note that, atoms in the head of the generated clauses are all annotated with probability 0.5 for exposition purposes. These values are replaced by random values between 0 and 1 at the beginning the parameter learning process, see Algorithm 7 line 5.  if n i is a leaf node then n i is a leaf node 9: if P arent i is the root node then 10: 12: if P arent i is the root node then 20:

Related work
PHIL is related to ProbLog2, (Fierens et al., 2015) which is an algorithm for learning probabilistic logic programs parameters using Expectation Maximization. PHIL and ProbLog2 differ from the Probabilistic language they use. While PHIL uses hierachical PLP  which is a restriction of Logic Programs with Annotated Disjunctions (LPADs) (Vennekens et al., 2004), ProbLog2 uses an extended version of the Prob-Log language in which both facts and rules can be annotated with probabilities. To perform inference, PHIL converts a program into ACs and evaluates the ACs bottom-up. ProbLog2 converts a program into a weighted boolean formulas (WBFs). The WBFs are then converted into deterministic-decomposable negation normal forms (d-DNNFs), see Darwiche (2004), which are in turn converted into arithmetics circuits for performing Weighted Model Counting (WMC) (Sang et al., 2005). ProbLog2 performs parameter learning using EM on top of the circuits.
PHIL related to EMBLEM (Expectation maximization over binary decision diagrams for probabilistic logic programs; Bellodi & Riguzzi, 2013) which is an algorithm for learning PLP parameters by applying EM over Binary decision diagram (Akers, 1978).
EMPHIL, ProbLog2 and EMBLEM are strongly related as they all apply the EM algorithm. The main difference among these languages is the computation cost for compiling the program. In EMBLEM and ProbLog2, the generation of BDDs and ACs respectively is #P in the number of ground clauses while the ACs generation in HPLP is linear in the number of ground clauses. Once the BDDs for EMBLEM and the ACs for HPLP and Prob-Log2 are generated, inference is linear in the number of nodes in BDDs or circuits.
PHIL is also related to Tuffy Niu et al. (2011) which is a machine learning algorithm that performs MAP/Marginal inference and parameter learning on Markov logic networks (MLNs). Differently from PHIL which uses a top-down approach to perform the groundings of clauses (as in ProLog), Tuffy uses a bottom-up approach. In fact, Tuffy expresses grounding as a sequence of SQL queries similarly to waht is done in Datalog. Indeed, it relies on a RDBMS such as PostgreSQL (Drake & Worsley, 2002). Each predicate in the input MLN is associated with a relation in the RDBMS where each row represents a ground atom. Given a MLN query (which could be a conjunction of predicates), Tuffy produces a SQL query that joins together the relations corresponding to the predicates in MLN to produce the atom of the ground clauses. While PHIL relies on ACs, generated from a PLP, to perform inference, Tuffy relies on the WalkSAT algorithm (Kautz et al., 1996). Two algorithms to perform inference are available in the Tuffy systems: the default WalkSAT and a modified version of MAxSAT (Raman et al., 1998) algorithm called SweepSAT (Niu et al., 2011). To learn Markov logic networks weights, Tuffy implements the Diagonal Newton (DN) algorithm described in Lowd and Domingos (2007). The DN algorithm is a gradient descent-based algorithm which relies on the Hessian matrix to update the weights. In fact at each iteration DN uses the inverse of the diagonalized Hessian to compute the gradient. Tuffy's parameter learning algorithm is related more to DPHIL than EMPHIL since both are based on gradient descent.
SLEAHP is related to SLIPCASE (Bellodi & Riguzzi, 2012), SLIPCOVER (Bellodi & Riguzzi, 2015) which are algorithms for learning general PLP, and LIFTCOVER, Nguembang Fadja and Riguzzi (2018) which is an algorithm for learning liftable PLP, where the head of all the clauses in the program share the same predicate (the target predicate) and their bodies contain only input predicates. Therefore, Liftable PLP can be seen as a restriction of HPLP where hidden predicates and clauses are not allowed. Both LIFTCOVER and SLIPCOVER learn the program by performing a search in the space of clauses and then refine the search by greedily adding refined clauses into the theory, while SLIPCASE performs search by theory revision. SLIPCASE and SLIPCOVER use EMBLEM to tune the parameters and compute the log-likehood of the data and LIFTCOVER uses a Gradient descent or EM-based method.
SLEAHP is also related to PROBFOIL+ (Raedt et al., 2015) which is a generalization of mFOIL (Džeroski, 1993). PROBFOIL+ learns both the structure and the parameters of ProbLog programs by performing a hill climbing search in the space of Programs. It consists of a covering loop in which one rule is added to the program at each iteration. The covering loop ends when a condition based on a global scoring function is satisfied. The rule added at each iteration is obtained by a nested loop which iteratively adds literals to the body of the rule performing a beam search in the space of clauses (as in mFOIL) guided by a local scoring function.
Similar to LIFTCOVER, SLIPCOVER and PROBFOIL+, SLEAHP initially performs a search in the space of clauses but , differently from these systems, it creates a tree of literals from which a large HPLP is generated. Then a regularized version of PHIL is applied to the HPLP to tune the parameters and discard irrelevant rules.
SLEAHP finds prominent bottom clauses from which a tree of literals is built. To generate a bottom clause, SLEAHP randomly selects a mega-example and randomly selects in it an atom for the target predicate. Then, bottom clauses are buit as in Progol (Muggleton, 1995). Much work exists on using stochastic components for learning the structure of logic programs, see Železnỳ et al. (2006), Železnỳ et al. (2004) and Železnỳ et al. (2002). In Železnỳ et al. (2006) and Železnỳ et al. (2004), the authors present an empirical study of randomised restarted search in Inductive Logic Programming. Specifically, a search in the clause subsumption lattice is performed by applying different strategies. These strategies all include restarts and differ mainly on the starting clause and the refinement operation. Both algorithms described in Železnỳ et al. (2006) and SLEAHP saturate the bottom clauses as in Progol, see Muggleton (1995). Differently from Železnỳ et al. (2006), SLEAHP generates a tree of literals from the bottom clauses instead of performing clause refinements. In Železnỳ et al. (2002) the authors propose an algorithm called Randomized Rapid Restarts (RRR). To randomize the lattice search of clauses, the algorithm begins by randomly selecting a starting clause, then it generates deterministic refinements through a non traditional refinement operator producing a radial lattice, and finally it returns the first clause satisfying conditions on minimal accuracy. This operation is done for a maximum number of restarts.
SLEAHP performs a form of predicate invention: the hidden predicates represent new predicates that are not present in the data. Much work has been devoted to predicate invention. In  and , for example, the authors propose algorithms that are able to perform predicate invention. Both proposals rely on a form of language bias based on metarules, i.e., rule skeletons where the predicate of literals is a not specified. Learning is then performed by metainterpretation and new predicates are introduced when applying the metarules. These works focus on learning programs: the learned theory often involve recursion. Instead, the language of HPLP is less expressive, as recursion is not allowed, and the focus here is on learning probabilistic classifiers, i.e., functions returning the class of an individual given what is known about him.
SLEAHP performs structure learning of HPLP by parameter learning, i.e, it initially generates a large HPLP and performs a regularized parameter learning on it. Then, clauses providing no contribution are pruned, typically those with low values of probabilities. Learning the structure of models by exploiting the regularization of parameter learning has been successfully investigated, e.g. in Lee et al. (2007) and Huynh and Mooney (2011). In Lee et al. (2007), the authors present an algorithm for learning the structure of log-linear graphical model, e.g Markov networks (MNs), using L 1 -regularization. The algorithm starts with a reasonable number of features and progressively introduces new features into the model letting the L 1 -regularization scheme discard features providing no contribution, i.e. those with small weight, via an optimization process. In Huynh and Mooney (2011), the authors present an algorithm named OSL for learning the structure of MLNs by exploiting parameter estimation. OSL starts with a (possibly) empty set of clauses and, at each step, new clauses are added in order to improve the prediction of the model. Similarly to Lee et al. (2007), regularization is applied to the new set of clauses to discard clauses which are not useful in the long run. SLEAHP and algorithms described in Lee et al. (2007) and Huynh and Mooney (2011) all use regularization in parameter learning, i.e. L 1 and L 2 for SLEAHP and L 1 for the others, to perform structure learning. But differently from Lee et al. (2007) and Huynh and Mooney (2011), which apply the regularization at each step, SLEAHP applies the regularization once on a large HPLP. This can be done since inference in HPLP is cheaper.
To perform structure learning, SLEAHP learns the parameters of an AC obtained from an initially large HPLP. Learning the structure of (graphical) models by exploiting ACs has also been explored in the literature, see Lowd and Rooshenas (2013) and Rooshenas and Lowd (2016). In Lowd and Rooshenas (2013) the authors describe an algorithm, named ACMN, which learns the structure of MNs using ACs. ACMN performs a greedy search in the structure space. It initially considers a MN structure which includes all single-variable features. The MN is then compiled into an AC and the algorithm gradually updates the AC, by splitting features, without recompiling it from scratch. In Rooshenas and Lowd (2016) the authors present an algorithm called DACLearn which discriminatively learns an AC which represents a conditional distribution. DACLearn searches in the combinatorial space of conjunctive features by greedily selecting features that increase the conditional likelihood. While SLEAHP, at each step, updates the parameters of the AC (i.e of the leaves), both ACMN and DACLearn update the AC that represents the model by adding features. The ACs learned in SLEAHP and DACLearn represent a conditional distribution instead of a full joint distribution as the one modelled by ACMN. In ACMN and DACLearn the AC is composed of sums and products while in SLEAHP the AC is composed of probabilistic sums, (Eq. 2) and products.
Ground HPLPs can also be seen as neural network (NNs) where the nodes in the arithmetic circuits are the neurons and the activation function of nodes is the probabilistic sum. Parameter learning by DPHIL is in fact performed as in NNs by backpropagation. Combining logical languages with NNs is an active research field, see De Raedt et al. (2019) for an excellent review. For example, relational neural networks (RelNNs; Kazemi & Poole, 2018) generalize relational logistic regression (RLR) by stacking multiple RLR layers together. The authors provide strong motivations for having multiple layers, highlighting in particular that the layers improve the representation power by allowing aggregation at different levels and on different object populations. HPLPs benefit from the same advantage. Moreover, HPLPs keep a semantics as probabilistic logic programs: the output of the network is a probability according to the distribution semantics.
In Sourek et al. (2016) the authors discuss an approach for building deep neural networks using a template expressed as a set of weighted rules. Similarly to our approach, the resulting network has nodes representing ground atoms and nodes representing ground rules and the values of ground rule nodes are aggregated to compute the value of atom nodes. Differently from us, the contribution of different ground rules are aggregated in two steps, first the contributions of different groundings of the same rule sharing the same head and then the contributions of groundings for different rules, resulting in an extra level of nodes between the ground rule nodes and the atom nodes. The proposal is parametric in the activation functions of ground rule nodes, extra level nodes and atom nodes. In particular, the authors introduce two families of activation functions that are inspired by Lukasiewicz fuzzy logic. In this paper we show that by properly restricting the form of weighted rules and by suitably choosing the activation functions, we can build a neural network whose output is the probability of the example according to the distribution semantics.
MLN-BC and MLN-BT learn Markov logic networks (MLNs) considering a series of relational functional approximation problems. Two kinds of representations for the gradients on the pseudo-likelihood are used: clause-based (MLN-BC) and tree-based (MLN-BT). At each gradient step, MLN-BC simply learns a set of Horn clauses with an associated regression value, while MLN-BT consider MLNs as a set of relational regression trees, in which each path from the root to a leaf can be seen as a clause and the regression values in the leaves are the clause weights. The goal is to minimize the squared error between the potential function and the functional gradient over all training examples.
RDN-B learns Relational Dependency Networks (RDNs) considering a series of relational function approximation problems using Friedman's functional gradient-based boosting. RDN-B approximates the joint distribution of a relational model to a product of conditional distributions over ground atoms. It considers the conditional probability distribution of each predicate as a set of relational regression trees each of which approximates the corresponding gradient. These regression trees serve as the individual components of the final potential function. They are learned such that at each iteration the new set of regression trees aims at maximizing the likelihood. The different regression trees provide the structure of the conditional distributions while the regression values at the leaves form the parameters of the distributions.
Section 8.1 describes the various datasets used in the experiments. Then in Sect. 8.2, PHIL (DPHIL and EMPHIL) and its regularized versions was compared with ProbLog2 and EMBLEM. Finally, Sect. 8.4 presents experiments comparing SLEAHP with SLIP-COVER, PROBFOIL+, MLN-BC, MLN-BT and RDN-B. Moreover, we provide phil in cplint on SWISH 4 which is a web application for reasoning and learning with probabilistic logic programs. In the application, the user can write hierarchical PLPs and perform inference, parameter and structure learning without installing phil locally.

Datasets
The proposed systems were experimented on five well known datasets: The Mutagenesis dataset (Srinivasan et al., 1996) contains information about a number of aromatic/heteroaromatic nitro drugs, and their chemical structures in terms of atoms, bonds and other molecular substructures. For example, the dataset contains atoms of the form bond (compound, atom1, atom2, bondtype) which states that in the compound, a bond of type bondtype can be found between the atoms atom1 and atom2. The goal is to predict the mutagenicity of the drugs which is important for understanding carcinogenesis. The subset of the compounds having positive levels of log mutagenicity are labeled active (the target predicate) and the remaining ones are inactive.
The Carcinogenesis dataset (Srinivasan et al., 1997) is similar to Mutagenesis dataset whose objective is to predict the carcinogenicity of molecular.
The Mondial dataset (May, 1999) contains data from multiple geographical web data sources. The goal is to predict the religion of a country as Christian (target predicate christian_religion(A)).
The UWCSE dataset (Beerenwinkel et al., 2005) contains information about the Computer Science department of the University of Washington. The goal is to predict the target predicate advisedby(A, B) expressing that a student (A) is advised by a professor (B).
In the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD15) dataset 5 the task is to predict gender of a client based on E-commerce data. The dataset was also used in Kazemi and Poole (2018).

Experiments: parameter learning
In this section, we present experiments comparing PHIL with EMBLEM, ProbLog2 and Tuffy on the five datasets. We manually build the initial HPLPs for the datasets UWCSE and PAKDD15 and generate those for Mutagenesis, 6 Carcinogenesis and Mondial using SLEAHP. The following hyper-parameters were used: As stop conditions, we use = 10 −4 , = 10 −5 , MaxIter = 1000 for PHIL and EMBLEM and MIN_IMPROV = 10 −4 , MaxIter = 1000 for ProbLog2. We use the Adam hyperparameters 1 = 0.9 , 2 = 0.999 , = 0.9 , ̂= 10 −8 and we apply batch gradient descent (all ACs are used for computing gradients at each iteration) on every dataset except for UWCSE where we use stochastic gradient descent with batch size BatchSize = 100 . In the regularized version of PHIL, = 10 is used as regularization coefficient and clauses with parameters less than MinProb = 10 −5 are removed. We experiment with three versions of EMPHIL B ( EMPHIL B 1 , EMPHIL B 2 , EMPHIL B 3 ) which use a = 0 and differ from the fraction of the examples n they use at the M-step. They respectively use b = n 10 , b = n 5 , b = n 4 . The parameters in the gradient descent method are initialized between [-0.5, 0.5] and the ones in the EM between [0,1]. In order to test the performance of the algorithms, we apply the cross-validation method: each dataset is partitioned into NF folds of which one is used for testing and the remaining for training. The characteristics of the datasets in terms of number of clauses NC, layers NL, folds NF and the average number of Arithmetic circuits NAC for each fold of each dataset are summarized in Table 1.
We draw, for each test fold, the receiver operating characteristics (ROC) and the precision-recall (PR) curves and compute the area under each curve (AUCROC and AUCPR) as described in Davis and Goadrich (2006). The average values (over the folds) of the areas for each algorithm are shown in Tables 2, 3. Table 4 shows the average training time. In these tables, the best values are highlighted in bold. Note that EMBLEM ran out of memory on the datasets Carcinogenesis and Mondial and we could not compute the areas and the training time. Moreover, to start learning, ProbLog2 needed more memory. Tuffy ran out of memory on all datasets. We found it was due to the fact that, before performing inference/parameter learning, Tuffy maps all predicates and ground clauses into relations in a PostgreSQL database as explained in Sect. 7. Since all programs have hidden predicates which are not included in the input predicates, Tuffy attempts to populate the database with all possible groundings of the hidden predicates which requires much memory. In all datasets, Tuffy ran out of 500 Giga hard disk memory available.

Discussion of the results for parameter learning
The experiments show that PHIL beats EMBLEM, ProbLog2 and Tuffy either in terms of area or in terms of time in all datasets. In Table 4, we highlight in italic the times associated with the best accuracies from Tables 2 and 3 . It can be observed that these times are either the best or in the same order of the best time, in bold. Among DPHIL (resp. EMPHIL) and its regularized versions, DPHIL 2 (resp. EMPHIL 2 ) is often a good compromise in terms of accuracy and time. Note also that regularization is often not necessary in dataset with few clauses like the Mondial dataset. Between DPHIL and EMPHIL, DPHIL is often convenient in dataset with many clauses and examples (see Mutagenesis).
Tables 2 and 3 present several values that are very close to each other both for PHIL systems and the other systems (EMBLEM and ProbLog2). This probably means that EMBLEM, ProbLog2, PHIL and their regularized versions return solutions of the same quality. To verify this assertion, a statistical significance test (t-test) was performed among these systems. Tables showing the p-values are in Supplementary Material and in the Experiments repository 3.We compute the t-test between the fold values in datasets splitting in more than 2 folds i.e in Mutagenesis (10 folds), UWCSE (5 folds) and Mondial (5 folds). We use a paired two-tailed t-test. The test was done w.r.t the AUCROC (highlighted in yellow) and w.r.t to the AUCPR (highlighted in green). In the tables showing the p-values, EMPHIL B refers to EMPHIL B 1 .
The tests show that PHIL systems are statistically equivalent both in terms of AUCROC and in terms of AUCPR in all dataset except in Mutagenesis in which EMPHIL B and DPHIL 1 are almost always significantly worse than the others. Indeed, in Mutagenesis, EMPHIL B is always significantly worse than the other systems ( p ≤ 0.009 ) for both AUCROC and AUCPR except for DPHIL 1 where p = 0.15 and p = 0.19 for AUCROC and  EMBLEM, on the Mutagenis dataset, is statistically equivalent to EMPHIL and its regularized versions both in terms of AUCROC and in terms of AUCPR. This does not happen in the gradient versions of PHIL except for DHPIL. This situation is clearly understandable since EM versions of PHIL and EMBLEM all rely on the EM algorithm. They compute the same expectations. The former on a factor graph and the latter on a Binary Decision Diagram.
ProbLog is not statistically equivalent to the other systems except in the Mondial dataset in which there is a possible equivalence with other systems both in terms of AUCROC and in terms of AUCPR.

Experiments: structure learning
In this section, we present experiments comparing SLEAHP with state-of-the art structure learning systems on the five datasets described in Sect. 8.1. We compared SLEAHP with PLP systems such as SLIPCOVER (Bellodi & Riguzzi, 2015) and PROBFOIL+ (Raedt et al., 2015). We also compared SLEAHP with Statistical Relational Learning methods such as MLN-BC and MLN-BT (Khot et al., 2011) for learning Markov logic networks (MLNs) and with RDN-B (Natarajan et al., 2012) for learning Relational Dependency Networks. Note that the dataset PAKDD15 was randomly divided into 10 folds and the same language bias expressed in terms of modes was used in all systems. To generate the initial HPLP in SLEAHP, we used MaxProb = 1.0 , Rate = 0.95 and Maxlayer = +∞ for every dataset except in UWCSE where we used Maxlayer = 3 . After generating the initial HPLP we use the same hyper-parameters presented in Sect. 8.2 to perform parameter learning. We performed five experiments (one for each regularization). SLEAHP G 1 (resp. SLEAHP G 2 ) uses DPHIL 1 (resp. DPHIL 2 ) and SLEAHP E 1 (resp. SLEAHP E 2 and SLEAHP B ) uses EMPHIL 1 (resp. EMPHIL 2 and EMPHIL B 1 ). The average area under the ROC/PR curves and the average time are shown in Tables 5, 6, 7 respectively. In these tables, the best values are highlighted in bold. The results indicated with -for PROBFOIL+ mean that it was

Discussion of the results for structure learning
In terms of quality of the solutions, SLEAHP outperforms SLIPCOVER in the three datasets Mutagenesis, UWCSE and PAKDD15, and achieves solution of similar quality in the other datasets. Note that SLEAHP E 1 and other EM regularizations do not perform well, in terms of AUCPR, on the UWCSE dataset. On the other hand, on the same dataset, SLEAHP G 1 and SLEAHP G 2 outperform SLIPCOVER in terms of accuracy and time. This motivates the use of different types of algorithms and regularizations. In terms of computation time, SLEAHP outperforms SLIPCOVER in almost all datasets excepts in UWCSE in which the computation time still remains reasonable. In Table 7 we also highlight in italic, as done in Table 4, the times associated with the best accuracies of SLEAHP from Tables 5  and 6 . Between DPHIL and EMPHIL regularizations, as stated before, DPHIL is often preferred in dataset with large examples (see UWCSE). With respect to PROBFOIL+, SLEAHP outperforms PROBFOIL+ in terms of solution quality and time in all datasets.
With respect to MLN-BT/BC systems, SLEAHP systems beat them in all datasets except in UWCSE in which they provide similar solutions of quality. In terms of AUCROC, SLE-AHP beats RDN-B in the Carcinogenis dataset and provides similar quality in the other datasets. In terms of AUCPR, SLEAHP beats RDN-B in three out of five datasets including Mutagenesis, Carcinogenesis and PAKDD15. In the other datasets it provides similar solutions of quality.
In terms of time, SLEAHP systems clearly outperform the other systems in all datasets except in UWCSE where it provides a learning time similar to that of SLIPCOVER.
We also performed a statistical significance test for structure learning as done for parameter learning. The p-values for each couple of systems are shown in the supplementary material and in the Experiments repository 3. so in this case, it can be observed that SLEAHP systems are statistically equivalent both in terms of AUCROC and in terms of AUCPR in almost all datasets except in Mutagenesis in which SLEAHP B is always significantly worse than SLEAHP E 1 and SLEAHP E 2 ( p < 0.05 ) for both AUCROC and AUCPR. In UWCSE and Mondial, there is only one significance difference, that of SLEAHP G 1 with To summarize, SLEAHP beats other systems in terms of computation time in almost all datasets and achieves a similar quality of the solution. We believe that as the dimension of data increases the gap between SLEAHP learning time and the other systems learning time would be strongly remarkable. This would make SLEAHP a good compromise between accuracy and learning time.

Comparing PHIL and SLEAHP
The performance, either in terms of AUCROC, AUCPR and time was recapped for both PHIL and SLEAHP in the same tables in order to highlight a possible benefit of learning the structure of HPLPs instead of manually building them. The different tables (Tables 8,  9 and 10) are available in the appendix and in the Experiments repository 3. SLEAHP is highlighted in bold. From Tables 8 and 9, it can be clearly observed that SLEAHP provides similar and often better performance in terms of AUCROC and AUCPR in all datasets.
In the Mondial dataset, in particular, SLEAHP outperforms PHIL in terms of AUCPR. In terms of time (see Table 10) SLEAHP drastically outperforms PHIL in all datasets except in PAKDD15. This is mainly due to the fact that SLEAHP generates the initial hierarchical PLP with a depth that leads to a good compromise between solution quality and learning time.
In the PAKDD15 dataset, PHIL learning time outperforms SLEAHP learning time because the programs generated by SLEAHP were remarkably deeper than the one manually built used with PHIL. These observations show the usefulness of performing structure learning with SLEAHP. Having a reasonable tradeoff between solution quality and learning time overcomes the tedious process of manually writing the structure of a program for each dataset.

Conclusion
In this work we have presented different algorithms for learning both the structure and the parameters of hierarchical PLP from data. We first presented PHIL, Parameter learning for HIerarchical probabilistic logic programs, that learns the parameters of hierarchical PLP from data. Two versions of PHIL have been presented: DPHIL which learns the parameters by applying gradient descent and EMPHIL which applies Expectation Maximization. Different regularized versions of DPHIL ( DPHIL 1 , DPHIL 2 ) and EMPHIL(EMPHIL 1 , EMPHIL 2 , EMPHIL B ) have also been proposed. These regularizations favor small parameters during learning. Then, we proposed SLEAHP which learns both the structure and the parameters of hierarchical PLP from data. SLEAHP initially generates a large HPLP (containing many clauses) and applies a regularized version of PHIL to perform parameter learning. Clauses with a small parameters are removed from the final program. Finally we performed experiments comparing, in five real datasets, PHIL with EMBLEM, ProbLog2 and Tuffy and SLEAHP with SLIPCOVER, PROBFOIL+, MLN-BC, MLN-BT and RDN-B. PHIL and SLEAHP achieve similar and often better accuracies in a shorter time. The significance tests performed among the different versions of PHIL and SLEAHP show that these versions are statistically equivalent and can be used interchangeably.
Regarding the restriction imposed on the number of literals in the body of clauses in SLEAHP, we plan to extend this number in order to explore a large space of HPLPs. We also plan to extend HPLPs in domains with continuous random variables in order to apply PHIL and SLEAHP to data such as images.

Gradient calculation
In this section we describe how to compute the derivative of the error w.r.t each node of the AC.
Consider the root node r of the AC for an example e. We want to compute err v(n) for each node n in the AC. By the chain rule, Let us first compute err v (r) where v(r) is the output of the AC For a positive example, p = v(r) , while for a negative example r = not(n) , p = 1 − v(r) . In this case, the error defined in Eq. 4 becomes err = − log(v(r)) . Therefore Let us now compute the derivative, d(n), of v(r) with respect to each v(n) d(n) can be computed by observing that d(r) = 1 and, by the chain rule of calculus, for an arbitrary non root node n with pa n indicating its parents If parent pa n is a × node with n ′ indicating its children v(pa n ) = ∏ n � v(n � ) and if node n is not a leaf (not a parameter node), then if n = i then The derivative of pa n with respect to W i corresponding to i is: If the unique parent of n is a not(n) v(pa n ) = 1 − v(n) and Because of the graph construction, ⨁ and × nodes can only have one × and ⨁ parent respectively. Leaf nodes can have many × parent nodes. Therefore Eq. 26 can be written as Combining Eqs. 25 and 32 we have:

Message passing
This appendix describes how to perform message passing over the arithmetic circuit.
In order to illustrate the passes, we construct a graphical model associated with the AC and then apply the belief propagation (BP) algorithm (Pearl, 1988).
A Bayesian Network (BN) can be obtained from the AC by replacing each node with a random variable. The variables associated with an ⨁ node have a conditional probabilistic table (CPT) that encodes deterministic OR function, while variables associated with an × node have a CPT encoding a deterministic AND function. Variables associated with a ¬ node have a CPT encoding the NOT function. Leaf nodes associated with the same parameter are split into as many nodes X ij as the groundings of the rule C i , each associated with a CPT such that P(X ij = 1) = i . We convert the BN into a Factor Graph (FG) using the standard translation because BP can be expressed in a simpler way for FGs. The FG corresponding to the AC of Fig. 6 is shown in Fig. 8.
After constructing the FG, P(X ij = 0|e) and P(X ij = 1|e) are computed by exchanging messages among nodes and factors until convergence. In the case of FG obtained from an AC, the graph is a tree and it is sufficient to propagate the message first bottom-up and then top-down. The message from a variable N to a factor f is defined as, see Nguembang  and Pearl (1988): where nb(X) is the set of neighbors of X (the set of factors X appears in). The message from a factor f to a variable N is defined as: where v(N) is the value of node N, P is its parent and the operator ⊖ is defined as
We want to solve the equation The number of real and complex roots is determined by the discriminant of the cubic equation (Cox, 2012): In our case we have If > 0 the equation has three distinct real roots. If ≤ N 0 + N 1 If ≥ N 0 + N 1 So Eq. 44 has three real roots that can be computed as (Cox, 2012): 1 is minimal when N 1 is 0 and we get 1 is maximal when N 0 is 0 and we get

Comparison between PHIL and SLEAHP
See Tables 8, 9 and 10.