Keywords

1 Introduction

In [14] Halpern discusses the difference between “Type 1” (T1) and “Type 2” (T2) statements: the former describes a statistical property of the world of interest while the latter represents a degree of belief. “The probability that a random person smokes is 20%” is an example of “Type 1” statement while “John smokes with probability 30%”, where John is a particular individual, is an example of “Type 2” statement.

Answer Set Programming (ASP) [7] is a powerful language that allows to easily encode complex domains. However, ASP does not allow uncertainty on the data. To handle this, we need to consider Probabilistic ASP (PASP) where the uncertainty is expressed through probabilistic facts, as done in Probabilistic Logic Programming [10]. We focus here on PASP under the Credal Semantics [9], where each query is associated with a probability interval defined by a lower and an upper bound.

Recently, the authors of [3] introduced PASTA (“Probabilistic Answer set programming for STAtistical probabilities”), a new language (and software) where statistical statements are translated into PASP rules and inference is performed by converting the PASP program into an equivalent answer set program. However, performing exact inference is exponential in the number of probabilistic facts, and thus it is infeasible in the case of more than a few dozens of variables. In this paper, we propose four algorithms to perform approximate inference in PASTA programs: one for unconditional sampling and three for conditional sampling that adopt rejection sampling, Metropolis Hastings sampling, and Gibbs sampling. Empirical results show that our algorithms can handle programs with hundreds of variables. Moreover, we compare our algorithms with PASOCS [23], a solver able to perform approximate inference in PASP program under the Credal Semantics, showing that our algorithms reach a comparable accuracy in a lower execution time.

The paper is structured as follows: Sect. 2 discusses some related works and Sect. 3 introduces background concepts. Section 4 describes our algorithms for approximate inference in PASTA programs that are tested in Sect. 5. Section 6 concludes the paper.

2 Related Work

PASTA [3] extends Probabilistic Logic Programming [20] under the Distribution Semantics [21] by allowing the definition of Statistical statements. Statistical statements, also referred to as “Probabilistic Conditionals”, are discussed in [16], where the authors give a semantics to T1 statements leveraging the maximum entropy principle. Under this interpretation, they consider the unique model that yields the maximum entropy. Differently from them, we consider all the models, thus obtaining a more general framework [3].

T1 statements are also studied in [15] and [24]: the former adopts the cross entropy principle to assign a semantics to T1 statements while the latter identifies only a specific model and a sharp probability value, rather than all the models and an interval for the probability, as we do.

We adopt the credal semantics [9] for PASP, where the probability of a query is defined by a range. To the best of our knowledge, the only work which performs inference in PASP under the Credal Semantics is PASOCS [23]. They propose both an exact solver, which relies on the generation of all the possible combinations of facts, and an approximate one, based on sampling. We compare our approach with it in Sect. 5.

Other solutions for inference in PASP consider different semantics that assign to a query a sharp probability value, such as [6, 17, 19, 22].

3 Background

We assume that the reader is familiar with the basic concepts of Logic Programming. For a complete treatment of the field, see [18].

An Answer Set Programming (ASP) [7] rule has the form h1 ; ... ; hm :- b1, ... , bn. where each hi is an atom, each bi is a literal and :- is called the neck operator. The disjunction of the his is called the head while the conjunction of the bis is called the body of the rule. Particular configurations of the atoms/literals in the head/body identify specific types of rules: if the head is empty and the body is not, the rule is a constraint. Likewise, if the body is empty and the head is not, the rule is a fact, and the neck operator is usually omitted. We consider only rules where every variable also appears in a positive literal in the body. These rules are called safe. Finally, a rule is called ground if it does not contain variables.

In addition to atoms and literals, we also consider aggregate atoms of the form \(\gamma _1 \omega _1 \ \#\zeta \{\epsilon _1, \dots , \epsilon _l\} \ \omega _2 \gamma _2\) where \(\gamma _1\) and \(\gamma _2\) are constants or variables called guards, \(\omega _1\) and \(\omega _2\) are arithmetic comparison operators (such as >, \(\ge \), <, and \(\le \)), \(\zeta \) is an aggregate function symbol, and each \(\epsilon _i\) is an expression of the form \(t_1, \dots , t_i \ : \ F\) where each \(t_j\) is a term, F is a conjunction of literals, and \(i > 0\). Moreover, each variable in \(t_1, \dots , t_i\) also appears in F.

We denote an answer set program with \(\mathcal {P}\) and its Herbrand base, i.e., the set of atoms that can be constructed with all the symbols in it, as \(B_\mathcal {P}\). An interpretation \(I \subset B_\mathcal {P}\) satisfies a ground rule when at least one of the his is true in I when the body is true in I. A model is an interpretation that satisfies all the ground rules of a program \(\mathcal {P}\). The reduct [11] of a ground program \(\mathcal {P}_g\) with respect to an interpretation I is a new program \(\mathcal {P}_g^r\) obtained from \(\mathcal {P}_g\) by removing the rules in which a bi is false in I. Finally, an interpretation I is an answer set for \(\mathcal {P}\) if it is a minimal model of \(\mathcal {P}_g^r\). We consider minimality in terms of set inclusion and denote with \(AS(\mathcal {P})\) the set of all the answer sets of \(\mathcal {P}\).

Probabilistic Answer Set Programming (PASP) [8] is to Answer Set Programming what Probabilistic Logic Programming [20] is to Logic Programming: it allows the definition of uncertain data through probabilistic facts. Following the ProbLog [10] syntax, these facts can be represented with \(\varPi :\,\!: f\) where f is a ground atom and \(\varPi \) is its probability. If we assign a truth value to every probabilistic fact (where \(\top \) represents true and \(\bot \) represents false) we obtain a world, i.e., an answer set program. There are \(2^n\) worlds for a probabilistic answer set program, where n is the number of ground probabilistic facts. Many Probabilistic Logic Programming languages rely on the distribution semantics [21], according to which the probability of a world w is computed with the formula

$$\begin{aligned} P(w) = \prod _{i \mid f_i = \top } \varPi _i \cdot \prod _{i \mid f_i = \bot } (1 - \varPi _i) \end{aligned}$$

while the probability of a query q (conjunction of ground literals), is computed with the formula

$$\begin{aligned} P(q) = \sum _{w \models q} P(w) \end{aligned}$$

when the world has a single answer set.

For performing inference in PASP we consider the Credal Semantics [8], where every query q is associated with a probability range: the upper probability bound \(\overline{\textrm{P}}(q)\) is given by the sum of the probabilities of the worlds w where there is at least one answer set of w where the query is present. Conversely, the lower probability bound \(\underline{\textrm{P}}(q)\) is given by the sum of the probabilities of the worlds w where the query is present in all the answer sets of w, i.e.,

$$\begin{aligned} \overline{\textrm{P}}(q) = \sum _{w_i \mid \exists m \in AS(w_i), \ m \models q} P(w_i), \ \ \underline{\textrm{P}}(q) = \sum _{w_i \mid |AS(w_i)| > 0 \ \wedge \ \forall m \in AS(w_i), \ m \models q} P(w_i) \end{aligned}$$

Note that the credal semantics requires that every world has at least one answer set. In the remaining part of the paper we consider only programs where this requirement is satisfied.

Example 1 (PASP Example)

We consider 3 objects whose components are unknown and suppose that some of them may be made of iron with a given probability. An object made of iron may get rusty or not. We want to know the probability that a particular object is rusty. This can be modelled with:

figure a

The constraint states that at least 60% of the object made of iron are rusty. This program has \(2^3 = 8\) worlds. For example, the world where all the three probabilistic facts are true has 4 answer sets. If we consider the query q rusty(1), this world only contributes to the upper probability since the query is present only in 3 of the 4 answer sets. By considering all the worlds, we get \(\underline{\textrm{P}}(q) = 0.092\) and \(\overline{\textrm{P}}(q) = 0.2\), so the probability of the query lies in the range [0.092, 0.2].

If we want to compute the conditional probability for a query q given evidence e, \(P(q \mid e)\), we need to consider two different formulas for the lower and upper probability bounds [8]:

$$\begin{aligned} \overline{\textrm{P}}(q \mid e) = \frac{\overline{\textrm{P}}(q,e)}{\overline{\textrm{P}}(q,e) + \underline{\textrm{P}}(\lnot q,e)}, \ \ \underline{\textrm{P}}(q \mid e) = \frac{\underline{\textrm{P}}(q,e)}{\underline{\textrm{P}}(q,e) + \overline{\textrm{P}}(\lnot q,e)} \end{aligned}$$
(1)

Clearly, these are valid if the denominator is different from 0, otherwise the value is undefined. If we consider again Example 1 with query q rusty(1) and evidence e iron(2), we get \(\underline{\textrm{P}}(q \mid e) = 0.08\) and \(\overline{\textrm{P}}(q \mid e) = 0.2\).

Following the syntax proposed in [3], a probabilistic conditional is a formula of the form \((C \mid A)[\varPi _l,\varPi _u]\) stating that the fraction of As that are also Cs is between \(\varPi _l\) and \(\varPi _u\). Both C and A are two conjunctions of literals. To perform inference, a conditional is converted into three answer set rules: i) C ; not_C :- A, ii) \(\varPi _l\)*V1, and iii) \(\varPi _u\)*V1, where X is a vector of elements containing all the variables in C and A. If \(\varPi _l\) or \(\varPi _u\) are respectively 0 or 1, the rules ii) or iii) can be omitted. Moreover, if the probability values \(\varPi _l\) and \(\varPi _u\) have n decimal digits, the 10 in the multiplications above should be replaced with \(10^n\), because ASP cannot deal with floating point values.

A PASTA program [3] is composed of a set of probabilistic facts, a set of ASP rules, and a set of probabilistic conditionals.

Example 2 (Probabilistic Conditional (PASTA program))

The following program

figure d

is translated into the PASP program shown in Example 1. The rule iii) is omitted since \(\varPi _u = 1\).

In [3] an exact inference algorithm was proposed to perform inference with probabilistic conditionals, that basically requires the enumeration of all the worlds. This is clearly infeasible when the number of variables is greater than 20–30. To overcome this issue, in the following section we present different algorithms that compute the probability interval in an approximate way based on sampling techniques.

4 Approximate Inference for PASTA Programs

To perform approximate inference in PASTA programs, we developed four algorithms: one for unconditional sampling (Algorithm 1) and three for conditional sampling that adopt rejection sampling (Algorithm 2), Metropolis Hastings sampling (Algorithm 3), and Gibbs sampling (Algorithm 4) [4, 5]. Algorithm 1 describes the basic procedure to sample a query (without evidence) in a PASTA program. First, we keep a list of sampled worlds. Then, for a given n number of times (number of samples), we sample a world id with function SampleWorld by choosing a truth value for every probabilistic fact according to its probability. For every probabilistic facts, the process is the following: we sample a random value between 0 and 1, call it r. If \(r < \varPi _i\) for a given probabilistic fact \(f_i\) with associated probability \(\varPi _i\), \(f_i\) is set to true, otherwise false. id is a binary string representing a world where, if the nth digit is 0, the nth probabilistic fact (in order of appearance in the program) is false, true otherwise. To clarify this, if we consider the program shown in Example 2, a possible world id could be 010, indicating that iron(1) is not selected, iron(2) is selected, and iron(3) is not selected. The probability of this world is \((1-0.2) \cdot 0.9 \cdot (1-0.6) = 0.288\). If we have already considered the currently sampled world, we look in the list of sampled worlds whether it contributes to the lower or upper counters (function GetContribution) and update the lower (\( lp \)) and upper (\( up \)) counters accordingly. In particular, GetContribution returns two values, one for the lower and one for the upper probability, each of which can be either 0 (the world \( id \) does not contribute to the probability) or 1 (the world \( id \) contributes to the probability). If, instead, the world had never been encountered before, we assign a probability value to the probabilistic facts in the program according to the truth value (probability \(\varPi \) for \(\top \), \(1-\varPi \) for \(\bot \)) that had been sampled (function SetFacts), we compute its contribution to the lower and upper probabilities (function CheckLowerUpper, with the same output as GetContribution), and store the results in the list of already encountered worlds (function InsertContribution). In this way, if we sample again the same world, there is no need to compute again its contribution to the two probability bounds. Once we have a number of samples equal to Samples, we simply return the number of samples computed for the lower and upper probability divided by Samples.

figure e

When we need to account also for the evidence, other algorithms should be applied, such as rejection sampling. It is described in Algorithm 2: as in Algorithm 1, we maintain a list with the already sampled worlds. Moreover, we need 4 variables to store the joint lower and upper counters of q and e (\( lpqe \) and \( upqe \)) and \(\lnot q\) and e (\( lpnqe \) and \( upnqe \)), see Eq. 1. Then, with the same procedure as before, we sample a world. If we have already considered it, we retrieve its contribution from the \( sampled \) list. If not, we set the probabilistic facts according to the sampled choices, compute the contribution to the four values, update them accordingly, and store the results. \( lpqe _0\) is 1 if both the evidence and the query are present in all the answer sets of the current world, 0 otherwise. \( upqe _0\) is 1 if both the evidence and the query are present in at least one answer set of the current world, 0 otherwise. \( lpnqe _0\) is 1 if the evidence is present and the query is absent in all the answer sets of the current world, 0 otherwise. \( upnqe _0\) is 1 if the evidence is present and the query is absent in at least one answer set of the current world, 0 otherwise. As before, we return the ratio between the number of samples combined as in Eq. 1.

figure f

In addition to rejection sampling, we developed two other algorithms that mimic Metropolis Hastings sampling (Algorithm 3) and Gibbs sampling (Algorithm 4). Algorithm 3 proceeds as follows. The overall structure is similar to Algorithm 2. However, after sampling a world, we count the number of probabilistic facts set to true (function CountTrueFacts). Then, with function CheckContribution we check whether the current world has already been considered. If so, we accept it with probability \(min(1,N_0/N_1)\) (line 18), where \(N_0\) is the number of true probabilistic facts in the previous iteration and \(N_1\) is the number of true probabilistic facts in the current iteration. If the world was never considered before, we set the truth values of the probabilistic facts in the program (function SetFacts), compute its contribution with function CheckLowerUpper, save the values (function InsertContribution), and check whether the sample is accepted or not (line 27) with the same criteria just discussed. As for rejection sampling, we return the ratio between the number of samples combined as in Eq. 1.

figure g

Finally, for Gibbs sampling (Algorithm 4), we first sample a world until e is true (function TrueEvidence), saving, as before, the already encountered worlds. Once we get a world that satisfies this requirement, we switch the truth values of \( Block \) random probabilistic facts (function SwitchBlockValues, line 19) and we check the contribution of this new world as in Algorithm 2. Also there, the return value is the one described by Eq. 1.

figure h

5 Experiments

We implemented the previously described algorithms in Python 3 and we integrated them into the PASTAFootnote 1 solver [3]. We use clingo [12] to compute the answer sets. To assess the performance, we ran multiple experiments on a computer with Intel® Xeon® E5-2630v3 running at 2.40 GHz with 16 Gb of RAM. Execution times are computed with the bash command time. The reported values are from the real field.

We consider two datasets with different configurations. The first one, iron, contains programs with the structure shown in Example 2. In this case, the size of an instance indicates the number of probabilistic facts. The second dataset, smoke, describes a network where some people are connected by a probabilistic friendship relation. In this case the size of an instance is the number of involved people. Some of the people in the network smoke. A conditional states that at least 40% of the people that have a friend that smokes are smokers. An example of instance of size 5 is

figure i

The number of probabilistic facts follows a Barabási-Albert preferential attachment model generated with the networkx [13] Python package. The initial number of nodes of the graph, n, is the size of the instance while the number of edges to connect a new node to an existing one, m, is 3.

In a first set of experiments, we fixed the number of probabilistic facts, for iron, and the number of people, for smoke, to 10 and plotted the computed lower and upper probabilities and the execution time by increasing the number of samples. All the probabilistic facts have probability 0.5. The goal of these experiments is to check how many samples are needed to converge and how the execution time varies by increasing the number of samples, with a fixed program. For the iron dataset, the query q is rusty(1) and the evidence e is iron(2). Here, the exact values are \(\underline{\textrm{P}}(q) = 0.009765625\), \(\overline{\textrm{P}}(q) = 0.5\), \(\underline{\textrm{P}}(q \mid e) = 0.001953125\), and \(\overline{\textrm{P}}(q \mid e) = 0.5\). For the smoke dataset, the program has 21 connections (probabilistic facts): node 0 is connected to all the other nodes, node 2 with 4, 6, and 8, node 3 with 4, 5, and 7, node 4 with 5, 6, 7, and 9, and node 7 with 8 and 9. All the connections have probability 0.5. Nodes 2, 5, 6, 7, and 9 certainly smoke. The query q is smokes(8) and the evidence is smokes(4). The targets are \(\underline{\textrm{P}}(q) = 0.158\), \(\overline{\textrm{P}}(q) = 0.75\), \(\underline{\textrm{P}}(q \mid e) = 0\), and \(\overline{\textrm{P}}(q \mid e) = 0.923\). Results for all the four algorithms are shown in Figs. 1 (iron) and 2 (smoke). For Gibbs sampling, we set the number \( Block \) (i.e., number of probabilistic facts to resample), to 1. All the algorithms seem to stabilize after a few thousands of samples for both datasets. For iron, MH seems to slightly overestimate the upper probability. Gibbs and rejection sampling require a few seconds to take \(10^6\) samples, while Metropolis Hastings (MH) requires almost 100 s. However, for the smoke dataset, MH and Rejection sampling have comparable execution times (more than 100 s for \(5 \cdot 10^5\) samples) while Gibbs is the slowest among the three. This may be due to a low probability of the evidence.

Fig. 1.
figure 1

Comparison of the sampling algorithms on the iron dataset. Straight lines are the results for PASTA while dashed lines for PASOCS.

Fig. 2.
figure 2

Comparison of the sampling algorithms on the smoke dataset. Straight lines are the results for PASTA while dashed lines for PASOCS. In Fig. 2a the target line at 0.75 is for the upper unconditional probability.

Fig. 3.
figure 3

Comparison of Gibbs sampling on the iron dataset.

Fig. 4.
figure 4

Comparison of Gibbs sampling and MH on the smoke dataset.

We compared our results with PASOCS [23] (after translating by hand the probabilistic conditionals in PASP rules). We used the following settings: -n_min n -n_max -1 -ut -1 -p 300 -sb 1 -b 0 where n is the number of considered samples, n_min is the minimum number of samples, n_max is the maximum number of samples (-1 deactivates it), ut is the uncertainty threshold (-1 deactivates it), p is the percentile (since they estimate values with gaussians), sb is the number of samples to run at once during sampling, and b is the burnin value for Gibbs and Metropolis Hastings sampling (0 deactivates it). We do not select parallel solving, since PASTA is not parallelized yet (this may be the subject of a future work). PASOCS adopts a different approach for conditional inference: at each iteration, instead of sampling a world, it updates the probabilities of the probabilistic facts and samples a world using these values. In Fig. 1b, the execution times of PASOCS for all the tested algorithms are comparable and seem to grow exponentially with the number of samples. The lines for rejection and unconditional sampling for PASTA overlap. This also happens for the lines for MH, Gibbs, and rejection sampling for PASOCS. PASOCS seems to be slower also on the smoke dataset (Fig. 2b), but the difference with PASTA is smaller. We also plotted how PASTA and PASOCS perform in terms of number of samples required to converge. In Fig. 3, we compare Gibbs sampling on the iron dataset. Here, PASTA seems to be more stable on both lower and upper probability. However, even with 5000 samples, both still underestimate the lower probability, even if the values are considerably small. In Fig. 4 we compare PASOCS and PASTA on Gibbs sampling and Metropolis Hastings sampling on the iron dataset. Also here, PASTA seems more stable, but both algorithms are not completely settled on the real probability after 5000 samples. Finally, Fig. 5 compares the unconditional sampling of PASTA and PASOCS on both datasets. Here, the results are similar: after approximately 3000 samples, the computed probability seems to be stabilized. In another experiment, we fixed the number of samples to 1000, increased the size of the instances for the iron dataset, and plot how the execution time varies with PASTA and PASOCS. The goal is to check how the execution time varies by increasing the number of samples. The query is rusty(1). Results are shown in Fig. 6. For PASOCS, we get a memory error starting from size 32. PASTA requires approximately 500 s to take 1000 samples on a program with the structure of Example 2 with 1500 probabilistic facts. Note again that, during sampling, we assume that every world has at least one answer set, since if we need to check this, all the worlds must be generated and clearly the inference will not scale.

Fig. 5.
figure 5

Comparison of unconditional sampling on the iron and the smoke datasets.

Fig. 6.
figure 6

Comparison between PASTA and PASOCS by increasing the number of probabilistic facts for the iron dataset.

6 Conclusions

In this paper, we propose four algorithms to perform approximate inference, both conditional and unconditional, in PASTA programs. We tested the execution time and the accuracy also against the PASOCS solver (after manually performing the conversion of probabilistic conditionals). Empirical results show that our algorithms reach a comparable accuracy in a lower execution time. As future work, we plan to better investigate the convergence of the algorithms and to develop approximate methods for abduction [1, 2] in PASTA programs.