onto2problog: A Probabilistic Ontology-Mediated Querying System using Probabilistic Logic Programming

We present onto2problog, a tool that supports ontology-mediated querying of probabilistic data via probabilistic logic programming engines. Our tool supports conjunctive queries on probabilistic data under ontologies encoded in the description logic \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal{ELH}^{dr}$$\end{document}ELHdr, thus capturing a large part of the OWL 2 EL profile.


Introduction
The amount of data collected has grown considerably in recent years, but with this so has the uncertainty in this data. For example, sophisticated NLP systems like the Never-Ending Language Learner (NELL) [15] are capable of searching the Internet continuously, extracting information from text into a computer-readable logical form. Yet systems like this are not perfectly accurate-indeed, NELL assigns a score to each extracted fact representing the system's confidence in its truth. These scores can be viewed as degrees of belief in the truth of these facts: in other words, probabilities in the Bayesian sense. Typically, these probabilistic facts are assumed to be mutually independent, resulting in a (tupleindependent) probabilistic database [19].
However, in many cases we have some supplementary domain knowledge in the form of an ontology, which can be considered in conjunction with the probabilistic facts. Motivated by this, Jung and Lutz introduced the framework of ontology-mediated querying of probabilistic data (OMQPD): given a set of independent probabilistic facts, an ontology, and a query, evaluate the query on the facts taking into account the supplementary knowledge from the ontology [12]. It is important to note that in this line of work the closed-world assumption that is usually adopted in databases is replaced by the open-world assumption, that is, the ontology might imply facts that are not explicitly stated in the initial set provided.
For example, suppose we have two probabilistic facts: This expresses the knowledge that Alice is a department head with probability 0.9, and, independently, Alice is a mentor of Charlie with probability 0.4. It gives rise to a distribution on four deterministic databases (Table 1): one in which neither fact is true (with probability (1 − 0.9)(1 − 0.4) = 0.06 ), one where both facts are true ( (0.9)(0.4) = 0.36 ), and two when exactly one is true ( (0.9)(1 − 0.4) = 0.54 and (1 − 0.9)(0.4) = 0.04). Now suppose that we also have the following (entirely deterministic) ontology expressed in the description logic EL: Intuitively, this ontology expresses that: 1. All department heads are professors 2. A professor who mentors someone is an academic supervisor Assume we wish to pose the query: Evaluating the query directly on the set of probabilistic facts earlier returns a probability of zero, as information relating to the class " AcademicSupervisor " does not appear anywhere in the set. But if we evaluate it in combination with the ontology, we get a probability of 0.36, corresponding to the world in which Alice is both a department head and a mentor of Charlie. Thus, the addition of an ontology can change the results of our query, and in particular, reduce the uncertainty. This underpins the idea of OMQPD.
To the best of our knowledge there are so far only preliminary implementations realizing this framework in practice, such as the one proposed by Schoenfisch and Stuckenschmidt [18]. Unfortunately, this system is incomplete in the sense that it only works for certain safe combinations of query and ontology, and only for ontologies in DL-Lite [2]. On the other hand, Zese et al. [23] presented semantics for DISPONTE knowledge bases and, based on two algorithms (BUNDLE and TRILL), an implementation for inference on these knowledge bases. DISPONTE knowledge bases are slightly different from the framework considered here in the sense that each axiom in the knowledge base-both facts and ontology-is annotated with an independent probability. They use a type-based semantics orthogonal to ours and thus obtain different probabilities for queries. For an overview about other combinations of uncertainty and description logics, we refer the interested reader to (the related work section of) [10].
Here, we propose the tool onto2problog for the task of OMQPD when the ontology is formulated in the description logic ELH dr and the query is a conjunctive query. Conjunctive queries are a common query language and subsume for example the query Φ above, but can be more complex, such as which asks for all department heads who are mentored by someone. Φ = AcademicSupervisor(alice).
(x) = ∃y.DepartmentHead(x) ∧ mentors(y, x) Further, ELH dr (which underlies the OWL 2 EL profile [16]) is the extension of EL [3] with domain and range restrictions as well as role hierarchies. Thus, beyond statements like (1) and (2) above, in ELH dr we can write statements like expressing that: 3. Anyone who mentors has a PhD 4. Anyone who is mentored is a student 5. Someone who mentors a person also manages that person In contrast to previous work our tool is complete in the sense that it can process all combinations of a query and an ontology. The base of our implementation is the adaptation of the combined approach to ontology-mediated querying over deterministic data [14] to the probabilistic setting [20]. It therefore reduces OMQPD in ELH dr to the task of marginal inference in a probabilistic logic program, which has an extensive literature surrounding it with many practical techniques available. In principle, this reduction can be used on top of any off-the-shelf probabilistic logic programming engine; we chose ProbLog 2 [8] for our implementation due to its flexibility and widespread use. 1 In this paper, we first give some background on ontologymediated querying of probabilistic data, probabilistic databases, and probabilistic logic programs. We then describe the implementation of our system and show how it can be used. Finally, we show an evaluation of our system on the Lehigh University Benchmark. For the technical details of our approach, we refer the reader to our earlier conference paper [20].

Background
In this section, we provide the formal background of ontology-mediated query answering over probabilistic data. We start by reviewing the description logic ELH dr .

Ontologies in ELH dr
Fix disjoint countably infinite sets of concept and role names N C and N R , respectively. Then EL-concepts are formed according to the syntax rule where A ∈ N C and r ∈ N R . An ELH dr -ontology (hereafter ontology) is a set of concept inclusions C ⊑ D , role inclusions r ⊑ s , domain restrictions (r) ⊑ C , and range restrictions (r) ⊑ C , where C and D are EL-concepts and r, s ∈ N R . An ABox is a finite set of concept assertions A(a) and role assertions r(a, b) where A ∈ N C , r ∈ N R , and a, b range over a countably infinite set of individual names N I . We denote with (A) the set of all individual names that occur in A . The semantics of ELH dr is defined as usual in terms of interpretations I = (Δ I , ⋅ I ) ; we elide a full description here and instead refer the reader to Baader et al. [4] for details. We use standard terminology, e.g., I is a model of T or A if it satisfies all the concept and role inclusions as well as domain and range restrictions in T , or all the assertions in A , respectively.

Ontology-Mediated Querying over Probabilistic Data
Let N V denote a countably infinite set of variables disjoint from N I . Then N T = N V ∪ N I forms the set of terms. A conjunctive query (CQ) is a first-order formula where and are tuples of variables in N V , and ( , ) is a conjunction of atoms over signature N C ∪ N R using terms from N T , but only variables from and . We drop the free variables of ( ) whenever no confusion can arise. An ontology-mediated query (OMQ) is a pair (T, ) of an ontology T and a CQ . Given an ABox A , and an OMQ (T, ) , we say that a tuple of individuals from A is a certain answer for The set of all certain answers to (T, ) is denoted by A (T, ). Following [12], we use assertion-independent probabilistic ABoxes (ipABoxes) to model uncertain data. Formally, an ipABox is a pair (A, p) where A is a classical ABox and p ∶ A → [0, 1] assigns a probability to every assertion in A . An ipABox (A, p) induces a distribution p(⋅) over possible ABoxes A ′ ⊆ A , which is defined by taking for every A ′ ⊆ A . The probability of an answer to an OMQ (T, ) over an ipABox (A, p) is then defined as: The prime inference task here is to compute answer probabilities, that is, given an ipABox (A, p) and an OMQ (T, ) , compute Pr A,p (T, , ) for all answer candidates .
Coming back to the example from the introduction, the set of probabilistic facts corresponds to the ipABox (A, p) where and If we denote with T the ontology from the introduction and let (x) be the query AcademicSupervisor(x) , we have:

Probabilistic Logic Programs
We introduce a variant of probabilistic logic programs that is sufficient for our purposes, though some systems support more features. A probabilistic logic program (PLP) is a triple (F, p, Π) where F is a set of facts, p ∶ F → [0, 1] assigns a probability to every fact, and Π is a stratified logic program consisting of rules of the form: where H and all B i are relational atoms over terms. The semantics of PLPs (F, p, Π) is defined as follows. The pair (F, p) induces a probability distribution p(⋅) over subsets F ′ ⊆ F just as in Eq. (6). Moreover, given a set of facts F and a set of rules Π , we denote with Π(F) the minimal supported model of F ∪ Π , obtained via the iterated fixed point construction of [1]. The prime inference task for PLPs is marginal inference, that is, given a PLP (F, p, Π) and a distinguished goal predicate G, compute the probability of all ground facts G( ) under (F, p, Π) , which is defined as:

Our Tool: onto2problog
We have implemented a tool, onto2problog, that enables the use of probabilistic logic programming inference methods for computing answer probabilities of

3
ontology-mediated queries over ipABoxes. The overall architecture of the inference pipeline supported by our tool is depicted in Figure 1. The input of the query answering task consists of the ontology-mediated query (a pair comprising a conjunctive query and an ELH dr -ontology T ), and the probabilistic data given by an ipABox (A, p) . Our tool processes only the ontology-mediated query (T, ) and outputs a stratified logic program Π T, with a distinguished goal predicate G, which is equivalent to (T, ) in the following sense: ( * ) for every ipABox (A, p) and answer candidate , we have where A ′ is essentially A in a slightly different representation (described below). For more concrete information on the structure of Π T, , we again refer the reader to our accompanying technical paper [20]. Here, we only stress that its size is polynomial in the sizes of T and , that the arity of the relation symbols used is bounded by the arity of the query, and that it has only two strata. The use of negation is required to exclude some spurious answers.
We will next give some details on our system and demonstrate its use with the example given earlier in the introduction. We have implemented onto2problog as a Python library, so that it can be called in a flexible and modular way. The ontology is specified in the OWL 2 ontology language (encoded in the standard RDF/XML format [17]), and the query is specified in a simple predicate logic-style syntax.
For example, the fragment of our ontology T expressing the knowledge that all department heads are professors could be represented as follows in RDF/XML: Now suppose we wish to use this ontology and pose the query earlier in the paper asking for all department heads mentored by someone. Then we may specify the query in our Python script in the following way: We can then load in the relevant ontology T : Given T and , onto2problog can then be used to compute the rewriting Π T, as described above (after first normalizing the ontology): We are now ready to pair the rewriting with an ipABox (A, p) . As mentioned above, the rewriting relies on a certain representation of the ABox which we detail next. We represent ipABoxes as strings of probabilistic facts over two fixed predicate names concept and role. For example, the facts DepartmentHead(alice) and mentors(alice, charlie) from earlier, along with their probabilities, are specified as the following string: Note that both concept, role, and individual names become constants under this representation. Putting it all together, we get our final probabilistic logic program with the distinguished query predicate q (the name of our query above): We may now pass this to ProbLog to do the "heavy lifting" of computing the marginal probabilities for the distinguished predicate q in the constructed PLP, producing a list of tuples together with their respective probabilities: By construction, and in particular because of property ( * ) above, the results returned are the answers to the original ontology-mediated query task.
ProbLog supports marginal inference via a variety of different algorithms based on knowledge compilation [6], for example, to d-DNNF and SDD. It also supports forward inference in a process known as T P -compilation [22]. Using ProbLog's Python interface, the user may select which inference method they wish to use in order to evaluate their query.
Our tool together with some documentation and an example is available online at http://www.infor matik .uni-breme n.de/~jeanj ung/onto2 probl og.html.

Evaluation
We evaluated onto2problog on a probabilistic version of the Lehigh University Benchmark (LUBM) [9]. LUBM is a benchmark for measuring the performance of semantic knowledge base systems in a consistent manner, comprising an ontology, data generation tool, and a set of test queries. For the purposes of our experiments, we dropped transitive and inverse role declarations from the ontology in order to obtain a valid ELH dr -ontology. Also queries 11, 12, and 13 were deliberately omitted from the test queries as they are specifically designed to test reasoning with inverse and transitive role declarations. We set the parameters of the original data generation tool to generate an ABox of cardinality 15189. Of this, 12260 statements were role assertions and the remainder were concept assertions.
We wrote scripts to transform the assertions generated by the data generation tool to probabilistic facts in Prob-Log. As the data from the tool is deterministic by default, we enriched the output by associating each ABox assertion with an indepedent, uniformly drawn probability p( ) ∼ U(0, 1) to obtain an ipABox. Finally, using our tool, we computed the rewritings of each of the LUBM queries with respect to the ontology. In the second step we used ProbLog to compute the query probabilities.
We used two different inference methods supported by ProbLog: (1) the "classic" ProbLog inference approach of cycle-breaking and compilation to sentential decision diagrams (SDDs) [21], and (2) T P -compilation to SDDs, which avoids the cycle-breaking step altogether through forward inference [22]. Regardless of the method used, ProbLog first computes the ground program relevant to the query, that is, it transforms the probabilistic logic program into one using only ground atoms (while returning the same probabilities). We refer to this first phase as the grounding step. We refrain from giving more details on the methods (1) and (2) here and instead refer the reader to the aforementioned papers. The runtimes of the computation, divided into the relevant steps, is shown in the left side of Table 2.
We compared onto2problog to an alternative approach to query answering, based on first-order rewritings. Informally, first-order rewritings transform the input ontologymediated query (T, ) into an equivalent first-order query T (or equivalently, a non-recursive datalog program). Although first-order rewritings have been used mainly in the classical, that is, non-probabilistic, ontology-mediated query answering, it has been observed that they remain valid also in the probablistic version OMQPD [12]. In the case of the ontology language EL , first-order rewritings are well-studied and it is known that they do not always exist [11]. Thus, they do not provide a complete tool for OMQPD. However, LUBM does not use all features provided by ELH dr . In fact, when dropping the role transitivity axioms, it is essentially formulated in a variant of DL-Lite, which implies that for all ontology-mediated queries based on LUBM, first-order rewritings do exist [2]. We therefore manually computed these rewritings and evaluated them using ProbLog as well. The results of this can be found in the right side of Table 2.
Interestingly, we see that most of the time is spent in the grounding step rather than the knowledge compilation step for each query. These steps correspond to the (deterministic) query answering phase and probability computation phase, respectively. This means that a large amount of time is taken in the computation of the relevant ground program, which is based on SLD-resolution. As SLD-resolution is theoretically not a hard task, we believe this to be the result of inefficiencies in ProbLog's implementation of grounding which become apparent when dealing with large programs like the ones here.
Moreover, the classic ProbLog inference method of cyclebreaking and compilation to SDDs consistently outperforms T P -compilation. We also observe that first-order rewritings seem to have somewhat better inference times overall, as a trade-off for the incompleteness of this approach. We conclude that in practice, it may be best to first test the first-order rewritability of the query before resorting to the complete approach provided by onto2problog as a second option.
Finally, to get an indication of how our method scales, we examined the total inference time on different ipABox sizes for a subset of the queries in Table 2 for which inference appeared non-trivial. The total inference time here is the sum of grounding, cycle-breaking, and SDD compilation time. The results are shown in Fig. 2. We observe that the runtime increases with ipABox size, but the exact nature of the relationship appears to be dependent on the query in question: the increase is much steeper for query 8 than query 5, for example.

Conclusion and Future Work
We have presented our tool onto2problog for answering queries over incomplete probablistic data in the presence of ontologies formulated in the description logic ELH dr . The evaluation shows potential for our tool to be used in at least small-scale scenarios. At the same time, it shows that the grounding step can be unexpectedly time-consuming. While it is known that grounding can be expensive in logic programming (see for instance [13] in the context of answer set programming), the PLP Π T, we produce should not be "dangerous" in this sense. We therefore conclude that this is a bottleneck in ProbLog's implementation, which indeed has been addressed in very recent work [7]. It would be interesting to combine their results with our efforts.
Beyond these improvements to the grounding step, we would like to extend our tool in three directions. First, we want to integrate first-order rewritings into our program natively, which on the one hand exhibited better performance in some of our experiments, but on the other hand are incomplete in general. Second, we want to investigate whether our approach can be extended to different ontology languages, such as those in the Datalog ± family [5]. Finally, it would be interesting to see whether other capabilities of ProbLog, such as learning, can be transferred to the OMQPD setting.