Keywords

1 Introduction

Probabilistic Logic Programming (PLP) combines uncertainty and logic-based languages [17]. Given its expressiveness, in the last decades PLP, and in particular PLP under the distribution semantics [21], has been widely adopted in domains characterized by uncertainty [5, 11, 12, 19, 20]. A probabilistic logic program without function symbols under the distribution semantics defines a probability distribution over normal logic programs, also called instances or worlds. The distribution is extended to a joint distribution over worlds and interpretations (or queries) and the probability of a query can be obtained from this distribution [17]. Logic Programs with Annotated Disjunctions (LPADs) [26] are a PLP language under the distribution semantics. In LPADs without function symbols, heads of clauses are disjunctions in which each atom is annotated with a probability. Since learning probabilistic logic programs is expensive, various approaches have been proposed to overcome this problem. Lifted inference [15] was introduced to improve the performances of reasoning in probabilistic relational models by taking into consideration populations of individuals instead of considering each individual separately. Liftable Probabilistic Logic Programs have been recently proposed to perform inference in a lifted way. LIFTCOVER [13] is an algorithm that performs structure and parameter learning (via Expectation-Maximization (LIFTCOVER-EM) or Limited-memory BFGS (LIFTCOVER-LBFGS)) of liftable probabilistic logic programs. Previous results [13] showed that LIFTCOVER-EM often outperformed LIFTCOVER-LBFGS and other systems at the state of the art. In this paper, we present LIFTCOVER+, an algorithm that extends LIFTCOVER with regularization and gradient descent for parameter learning to improve the quality of the solutions and prevent overfitting. We test LIFTCOVER+ on 12 real-world datasets and compare the results with LIFTCOVER-EM. Empirical results show that LIFTCOVER+ with the regularized Expectation-Maximization algorithm allows to obtain slightly better results than the original LIFTCOVER-EM.

The paper is organized as follows: Sect. 2 presents background on PLP; Sect. 3 introduces LIFTCOVER+; Sect. 4 shows the results of the experiments; in Sect. 5 we discuss related work; and in Sect. 6 we draw the conclusions.

2 Background

We consider the liftable PLP language [13], a restriction of probabilistic logic programs so that inference can be performed in a lifted way. Such programs contain clauses with a single annotated atom in the head and the predicate of this atom is the same for all clauses, i.e., clauses of the form:

$$C_i=h_{i}:\varPi _{i}\ {:\!-}\ b_{i1},\ldots ,b_{iu_i}$$

where the single atom in the head is built over predicate target/a, with a the arity. The bodies of the clauses contain other predicates than target/a and their facts and rules have a single atom in the head with probability 1 (they are certain). The predicate target/a is the target of learning and the other predicates are input predicates. In other words, in the liftable PLP language uncertainty appears only in the rules. The goal is to compute the probability of a ground instantiation (or query) q of target/a. To do so, we find the number of ground instantiations of clauses for target/a such that the body is true and the head is equal to q. Let \(\{\theta _{i1},...,\theta _{im_i}\}\) be the \(m_i\) instantiations for clause \(C_i\), \(i=1,...,n\). Every instantiation \(\theta _{ij}\) corresponds to a random variable \(X_{ij}\) that is equal to 1 (0) with probability \(\varPi _i\) (\(1 - \varPi _i\)). The query q is true if at least one random variable for a rule is true, i.e., takes value 1. Equivalently, the query q is false only if none of the random variables is true. Since all the random variables are mutually independent the probability that q is true can be computed as \(P(q)=1-\prod _{i=1}^n(1-\varPi _i)^{m_i}\). The fact that the random variables associated to the rules are mutually independent does not limit the capability to represent probability distributions, as shown in [17].

LIFTCOVER [13], shown in Algorithm 1, learns the structure of liftable probabilistic logic programs. Given a set \(E^+=\{e_1,\ldots ,e_Q\}\) of positive examples, a set \(E^-=\{e_{Q+1},\ldots ,e_R\}\) of negative examples, and a background knowledge B (possibly a normal logic program defining the input predicates), the goal of structure learning is to find a liftable probabilistic logic program T such that the likelihood

$$L=\prod _{q=1}^Q P(e_q)\prod _{r=Q+1}^R P(\lnot e_r)$$

is maximized. LIFTCOVER solves this problem by first identifying good clauses guided by the log-likelihood (LL) of the data, with a top-down beam search. The refinement operator adds a literal taken from a bottom clause to the body of the current clause. The beam search is repeated a user-defined number of times or until the beam is empty. Then, parameter learning is performed on the full set of clauses found, which is considered as a single theory. LIFTCOVER can use either Expectation-Maximization (EM) or Limited-memory BFGS (LBFGS). LBFGS is used to find the values of the parameters that optimize the likelihood by exploiting the gradient of the log-likelihood with respect to the parameters. The likelihood can be unfolded to

$$L=\prod _{l=1}^n(1-\varPi _l)^{m_{l-}}\prod _{q=1}^Q\left( 1-\prod _{l=1}^n(1-\varPi _l)^{m_{lq}}\right) $$

where \(m_{iq}\) (\(m_{ir}\)) is the number of instantiations of \(C_i\) whose head is \(e_q\) (\(e_r\)) and whose body is true, and \({m_{l-}} = \sum _{r=Q+1}^R m_{lr}\). Its gradient can be computed as:

$$\begin{aligned} \frac{\partial L}{\partial \varPi _i} = \frac{L}{1-\varPi _i}\left( \sum _{q=1}^Q m_{iq}\left( \frac{1}{P(e_q)}-1\right) -m_{i-}\right) \end{aligned}$$
(1)

Because the equation \(\frac{\partial L}{\partial \varPi _i}=0\) does not admit a closed-form solution, optimization is needed to find the maximum of L. The clauses with a probability below a user-defined threshold are discarded.

In models in which the variables are hidden, the EM algorithm [7] must be used to find the maximum likelihood estimates of parameters. In the Expectation step, the distribution of the unseen variables in each instance is computed given the observed data and the current value of the parameters. In the Maximization step, the new parameters are computed so that the expected likelihood is maximized. The alternation between the Expectation and the Maximization steps continues until the likelihood does not improve anymore. To use the EM algorithm, the distribution of the hidden variables given the observed ones, \(P(X_{ij}=1|e)\) and \(P(X_{ij}=1|\lnot e)\) has to be computed. Given that \(P(X_{ij}=1,e)=P(e|X_{ij}=1) \cdot P(X_{ij}=1)=P(X_{ij}=1)=\varPi _i\) and \(P(e|X_{ij}=1)=1\),

$$\begin{aligned} P(X_{ij}=1|e)=\frac{P(X_{ij}=1,e)}{P(e)}=\frac{\varPi _i}{1-\prod _{i=1}^n(1-\varPi _i)^{m_i}} \end{aligned}$$
(2)
$$\begin{aligned} P(X_{ij}=0|e)=1-\frac{\varPi _i}{1-\prod _{i=1}^n(1-\varPi _i)^{m_i}} \end{aligned}$$
(3)

Since \(P(X_{ij}=1,\lnot e)=P(\lnot e|X_{ij}=1) \cdot P(X_{ij}=1)=0\) and \(P(\lnot e|X_{ij}=1)=0\),

$$\begin{aligned} P(X_{ij}=1|\lnot e)=0 \end{aligned}$$
(4)
$$\begin{aligned} P(X_{ij}=0|\lnot e)=1 \end{aligned}$$
(5)

3 LIFTCOVER+

LIFTCOVER can learn very large sets of clauses that may overfit the data. For this reason, we introduce LIFTCOVER+, a modified version of LIFTCOVER that adds regularization to perform parameter learning and uses gradient descent instead of LBFGS to optimize the likelihood.

Regularization is a well-known technique to prevent overfitting, in which a penalty term is added to the loss function to penalize large weights. In this way, we aim to obtain few clauses with large weights. Clauses with small weights have little influence on the probability of the query and can be removed, thus simplifying the theory. Regularization is usually performed in gradient-based algorithms, but it can be performed in EM as well in the Maximization phase, where the parameters that maximized the LL are found. For EM, regularization can be Bayesian, L1, or L2.

In Bayesian regularization, the parameters are updated assuming a prior distribution that takes the form of a Dirichlet probability density with parameters [ab]. It has the same effect as having observed a extra occurrences of \(X_{ij}=1\) and b extra occurrences of \(X_{ij}=0\). If b is much larger than a, this has the effect to shrink the parameters. L1 and L2 differ in how they penalize the loss function: L1 adds the sum of the absolute value of the parameters to the loss function while L2 adds the sum of their squares.

The L1 objective function [14] is:

$$\begin{aligned} J_1(\theta ) = N_1 \cdot log \theta + N_0 \cdot log(1 - \theta ) - \gamma \theta \end{aligned}$$
(6)

where \(\theta = \pi _i\), \(N_0\) and \(N_1\) are the expected occurrences of \(X_{ij}=0\) and \(X_{ij}=1\) computed in the Expectation step, and \(\gamma \) is the regularization coefficient. The value of \(\theta \) that maximizes \(J_1\) is computed in the Maximization step by solving the equation \(\frac{\partial J(\theta )}{\partial \theta }=0\) [14]. \(J_1(\theta )\) is maximum at

$$\begin{aligned} \theta _1=\frac{4 N_1 }{2(\gamma +N_0+N_1+\sqrt{(N_0+N_1)^2+\gamma ^2+2\gamma (N_0-N_1) )}} \end{aligned}$$
(7)

The L2 objective function [14] is:

$$\begin{aligned} J_2(\theta ) = N_1 \cdot log \theta + N_0 \cdot log(1 - \theta ) - \frac{\gamma }{2} \theta ^2 \end{aligned}$$
(8)

and value of \(\theta \) that maximizes \(J_2\), is:

$$\begin{aligned} \theta _2= \frac{2 \sqrt{\frac{3 N_{0} + 3 N_{1} + \gamma }{\gamma }} \cos {\left( \frac{\arccos {\left( \frac{\sqrt{\frac{\gamma }{3 N_{0} + 3 N_{1} + \gamma }} \left( \frac{9 N_{0}}{2} - 9 N_{1} + \gamma \right) }{3 N_{0} + 3 N_{1} + \gamma } \right) }}{3} - \frac{2\pi }{3} \right) }}{3} + \frac{1}{3} \end{aligned}$$
(9)

In LIFTCOVER+, LBFGS is replaced by regularized gradient descent. The objective function is the sum of cross entropy errors \(err_i\) for all the examples:

$$\begin{aligned} err=\sum _{i=1}^{Q+R}(-y_i\log P(e_i)-(1-y_i)\log (1-P(e_i))) \end{aligned}$$
(10)

where \(Q+R\) is the total number of examples, \(e_i\) is an example, and \(y_i\) is its sign, thus \(y_i\) equals to 1 (0) if the example is positive (negative). L1 regularization can then be applied to minimize the loss function [14]:

$$\begin{aligned} err_{L1} = \sum _{i = 1}^{Q+R} - y_i \cdot log P(e_i) - (1-y_i) \cdot log (1-P(e_i)) + \gamma \sum _{i=1}^{k} |\pi _i| \end{aligned}$$
(11)

where k is the number of parameters and the \(\pi _i\)s are the probabilities of the clauses. After learning the parameters, all the clauses with a probability below a fixed threshold are removed.

Algorithm 1
figure a

. Function LIFTCOVER

4 Experiments

The main goal of the experiments is to assess whether adding regularization to LIFTCOVER+ improves the quality of the solution. All experiments were conducted on a GNU/Linux machine with an Intel Core i3-10320 Quad Core 3.80 GHz CPU.

We tested LIFTCOVER+ on 12 real-world datasets: UW-CSE [10] (a dataset that describes the Computer Science Department of the University of Washington, used to predict the fact that a student is advised by a professor), Mondial [22] (a dataset containing information regarding geographical regions of the world, such as population size, political system, and the country border relationship), Carcinogenesis [23] (a classic ILP benchmark dataset for Quantitative Structure-Activity Relationship (QSAR), i.e., predicting the biological activity of chemicals from their physicochemical properties or molecular structure. The goal is to predict the carcinogenicity of compounds from their chemical structure), Mutagenesis [24] (a classic ILP benchmark dataset for QSAR in which the goal is to predict the mutagenicity (a property correlated with carcinogenicity) of compounds from their chemical structure), Bupa (for diagnosing patients with liver disorders), Nba (for predicting the results of NBA basketball games), Pyrimidine and TriazineFootnote 1 (QSAR datasets for predicting the inhibition of dihydrofolate reductase by pyrimidines and triazines, respectively), Financial (for predicting the success of loan applications by clients of a bank), Sisya and Sisyb (datasets regarding insurance business clients, used to classify households and persons in relation to private life insurance), and Yeast (for predicting if a yeast gene codes for a protein involved in metabolism) from [25]Footnote 2. Table 1 shows the characteristics of the datasets.

Four different configurations of LIFTCOVER+ were compared: EM with Bayesian regularization (EM-Bayes), EM with L1 regularization (EM-L1), EM with L2 regularization (EM-L2), and gradient descent with fixed learning rate \(\eta = 0.0001\) and L1 regularization (GD). Hyper-parameters for Bayesian regularization were set as \(a=0\) and b equal to 15% of the total number of examples in the dataset. We set \(\gamma = 50\) for L1 and L2 in EM, and \(\gamma = 10\) for L1 in gradient descent. The parameters controlling structure learning are the following: \(\textit{NInt}\) is the number of mega-examples on which to build the bottom clauses, \(\textit{NA}\) is the number of bottom clauses to be built for each mega-example, \(\textit{NS}\) is the number of saturation steps (for building the bottom clauses), \(\textit{NI}\) is the maximum number of clause search iterations, the size \(\textit{NB}\) of the beam, \(\textit{NV}\) is the maximum number of variables in a rule, and \( WMin \) is the minimum probability under which the rule is removed. Their values are listed in Table 2.

All the configurations were evaluated in terms of Area Under the Precision-Recall Curve (AUC-PR) and Area Under the Receiver Operating Characteristics Curve (AUC-ROC). Both were computed with the methods reported in [4, 8]. LIFTOCOVER+ was compared with LIFTCOVER-EM from [13].

Table 1. Characteristics of the datasets for the experiments: number of predicates (P), of tuples (T) (i.e., ground atoms), of positive (PEx) and negative (NEx) examples for target predicate(s), of folds (F). The number of tuples includes the target positive examples.
Table 2. Parameters controlling structure search for LIFTCOVER+.

Tables 34, and 5 show the performances of the different configurations in terms of average over the folds of AUC-ROC, AUC-PR, and the execution times, respectively. The results of LIFTCOVER-EM were taken from [13]. Execution time for LIFTCOVER+ was scaled (i.e., multiplied by 3.8/2.4) in order to compare them with those of LIFTCOVER-EM in [13] that were executed on a machine with Intel Xeon Haswell E5-2630 v3 (2.40GHz) CPU. Figures 12, and 3 show the histograms of the above-mentioned data.

LIFTCOVER+ performs slightly better than LIFTCOVER-EM in terms of AUC-PR on 6 datasets out of 12 with EM-BAYES and EM-L1, on 7 datasets with EM-L2, and on 3 datasets with GD. As a matter of fact, the average AUC-PR over all datasets is higher for LIFTCOVER+ with EM and L2 regularization, followed closely by Bayesian regularization. Results obtained with LIFTCOVER+ and GD were considerably worse on the Pyramidine and Yeast datasets and were lower in almost all other cases. In particular, LIFTCOVER+ was able to significantly improve the performance on the Nba dataset achieving an AUC-PR of 0.7 against 0.5 reached by LIFTCOVER-EM. Despite that, the Sisyb dataset seems to remain a challenge for LIFTCOVER+ (both with EM and GD). Regarding AUC-ROC, LIFTCOVER+ beats LIFTCOVER-EM on 4 datasets out of 12 with EM-Bayes and EM-L1, on 3 datasets with EM-L2, and on 2 datasets with GD. In general, GD led to a deterioration of the solution in most cases, probably because the loss function is highly non-convex and GD ends up in local minima, while EM seems more capable of escaping local minima. In terms of execution times, LIFTCOVER+ is comparable to LIFTCOVER-EM, although it was slower in some cases. This is especially true for GD, which on some datasets (Bupa, Mondial, Mutagenesis, Pyramidine, Yeast, Triazine, Carcinogenesis) turns out to be slower by one or more orders of magnitude. However, it must be noted that the scaling approach we have used is only a rough approximation, as the architecture of the two processors is different and thus differences in caches and pipelining may have an effect. In the future, we plan to repeat the LIFTCOVER+ experiments on a machine more similar to the one of LIFTCOVER-EM.

Table 3. Average AUC-ROC over the datasets for each configuration: EM with Bayes regularization (EM-Bayes), EM with L1 regularization (EM-L1), EM with L2 regularization (EM-L2), and gradient descent (GD). For each row, the best result is highlighted in bold.
Table 4. Average AUC-PR over the datasets for each configuration: EM with Bayes regularization (EM-Bayes), EM with L1 regularization (EM-L1), EM with L2 regularization (EM-L2), and gradient descent (GD). For each row, the best result is highlighted in bold.
Table 5. Average time in seconds over the datasets for each configuration: EM with Bayes regularization (EM-Bayes), EM with L1 regularization (EM-L1), EM with L2 regularization (EM-L2), and gradient descent (GD). For each row, the best result is highlighted in bold.
Fig. 1.
figure 1

Histograms of average AUC-ROC over the datasets for each configuration: EM with Bayes regularization (EM-Bayes), EM with L1 regularization (EM-L1), EM with L2 regularization (EM-L2), and gradient descent (GD).

Fig. 2.
figure 2

Histograms of average AUC-PR over the datasets for each configuration: EM with Bayes regularization (EM-Bayes), EM with L1 regularization (EM-L1), EM with L2 regularization (EM-L2), and gradient descent (GD).

Fig. 3.
figure 3

Histograms of average time in seconds over the datasets for each configuration: EM with Bayes regularization (EM-Bayes), EM with L1 regularization (EM-L1), EM with L2 regularization (EM-L2), and gradient descent (GD). The scale of the X axis is logarithmic.

5 Related Work

Lifted inference for PLP under the distribution semantics has been surveyed in [18], in which the authors describe and evaluate three different approaches, namely Lifted Probabilistic Logic Programming (\(LP^2\)), lifted inference with aggregation parfactors, and Weighted First Order Model Counting (WFOMC). The authors of [9], instead, focused their survey on lifted graphical models.

LIFTCOVER (and thus LIFTCOVER+) derives from SLIPCOVER [3], an algorithm for learning general PLP by performing a search in the space of clauses and then refining it by greedily adding refined clauses into the theory. Aside from the simplified structure search, LIFTCOVER and LIFTCOVER+ differ from SLIPCOVER also in the approach used for parameter learning. While SLIPCOVER uses EMBLEM [2] to learn the parameters of a probabilistic logic program by applying EM over Binary Decision Diagrams [1], LIFTCOVER and LIFTCOVER+ use EM, LBFGS, and gradient descent.

Hierarchical PLP (HPLP) [14] is a restriction of the general PLP language in which clauses and predicates are hierarchically organized. HPLPs can be efficiently converted into arithmetic circuits (ACs) or deep neural networks so that inference is much cheaper than for general PLP. Liftable PLP can be seen as a restriction of HPLP. For this reason, LIFTCOVER+ is related to Liftable PLP tools such as PHIL and SLEAHP [14]. PHIL performs parameter learning of hierarchical probabilistic logic programs using gradient descent (DPHIL) or EM (EMPHIL). First, it converts the program into a set of ACs sharing parameters. Then, it applies gradient descent or EM over the ACs, evaluating them bottom-up. On the other hand, SLEAHP learns both the structure and the parameters of HPLPs from data. It generates a large hierarchical logic program from an initial set of bottom clauses generated from a language bias [3]. Then, it applies a regularized version of PHIL to prune the initial large program by removing irrelevant rules, i.e., those for which the parameters are close to 0.

LIFTCOVER+ is related also to PROBFOIL+ [16], an algorithm used to perform parameter and structure learning of ProbLog [6] programs with a hill climbing search in the space of programs, consisting of a covering loop that adds one rule to the theory at each iteration and stops when a condition based on a global scoring function is satisfied. The rule to add is obtained from a clause search loop that builds the rule by iteratively adding literals to the body using a local scoring function as the heuristic.

6 Conclusions

In this paper, we have presented LIFTCOVER+, an updated version of LIFTCOVER that performs parameter learning using the EM algorithm or gradient descent with regularization to penalize large weights and prevent overfitting. Experiments were conducted on 12 real-world datasets and results were compared with LIFTCOVER-EM. In summary, we found that using gradient descent does not bring much benefit, having AUC-PR and AUC-ROC comparable to LIFTCOVER-EM, and execution times often much higher. On the other hand, using EM with regularization (and with L2 or Bayesian regularization especially) we obtain a higher AUC-PR on several datasets with roughly equal execution times. Furthermore, when there are no improvements, there is not a significant degradation in the quality of the solutions either. In conclusion, the present findings confirm that adding regularization can help improve the solution in terms of AUC-PR, although some datasets remain hard for LIFTCOVER+.

As future work, we plan to employ LIFTCOVER+ to learn theories from Knowledge Graphs (KG) to perform KG completion and triple classification.