Keywords

1 Introduction

Over the Web, we are witnessing a growing amount of data available in RDF. The LOD CloudFootnote 1 recorded that there were 1014 RDF data sources in 2014, covering various domains from life science to government. RDF follows the Open-World Assumption (OWA), assuming data is incomplete by defaultĀ [1]. Yet, given such a large quantity of RDF data, one might wonder if it is complete for some topics. As an illustration, consider Wikidata, a crowdsourced KB with RDF supportĀ [2]. For data about the movie Reservoir Dogs, Wikidata is incomplete, as it is missing the fact that Michael Sottile was acting in the movie.Footnote 2 On the other hand, for data about Apollo 11, it is the case that Neil Armstrong, Buzz Aldrin, and Michael Collins, who are recorded as crew members on Wikidata, are indeed all the crew (see Fig.Ā 1).Footnote 3 However, such completeness information is not recorded and thus it is left to the reader to decide whether some data on the Web is already complete.

Fig. 1.
figure 1

Wikidata is actually complete for all the Apollo 11 crew

Nevertheless, the availability of explicit completeness information can benefit data access over RDF data sources, commonly done via SPARQL queries. To illustrate, suppose that in addition to the complete data of Apollo 11 crew, Wikidata is also complete for the children of the three astronauts. Consequently, a user asking for the children of Apollo 11 crew should obtain not only query answers, but also the information that the answers are complete.

Motivated by the above rationales, we argue that it is important to describe completeness of RDF data and provide a technique to check query completeness based on RDF data with its completeness information. Such a check is called completeness entailment. In previous work, Darari et al.Ā [3] proposed a framework to describe completeness of RDF data and check query completeness based on completeness information. One fundamental limitation of this work is that the completeness check is agnostic of the content of the RDF data to which the completeness information is given, which results in weaker inferences. In the next section, we show that incorporating the content of RDF data may provide stronger inferences about query completeness. From the relational databases, Razniewski et al.Ā [4], proposed wildcard-based completeness patterns to provide completeness information over databases. To check query completeness, they defined a pattern algebra, which works upon database tables enriched with completeness patterns. The work incorporated database instances in completeness check, which are conceptually similar to the content of RDF data. However, only a sound algorithm was provided for completeness check.

In this work, we make the following contributions:

  1. 1.

    We provide a formalization of the completeness entailment problem for RDF data, and develop a sound and complete algorithm to solve the completeness entailment problem.

  2. 2.

    We identify a practically relevant fragment of completeness information suitable for crowdsourced, entity-centric RDF data sources like Wikidata, and develop an indexing technique to improve the feasibility of completeness entailment within the fragment.

  3. 3.

    We develop COOL-WD, a tool to manage completeness over Wikidata.

Our paper is structured as follows: Sect.Ā 2 presents a motivating scenario. In Sect.Ā 3, we provide a formalization to the completeness problem, followed by Sect.Ā 4 where we describe formal notions and a generic algorithm to check completeness entailment. SectionĀ 5 introduces a fragment of completeness information, suitable for crowdsourced, entity-centric RDF KBs like Wikidata, and presents an optimization technique for checking completeness entailment within this fragment. SectionĀ 6 reports our experimental evaluations. In Sect.Ā 7, we describe COOL-WD. Related work is given in Sect.Ā 8, whereas further discussion about our work is in Sect.Ā 9. SectionĀ 10 concludes the paper and sketches futureĀ work.

2 Motivating Scenario

Let us consider a motivating scenario for the main problem of this work, that is, the check of query completeness based on RDF data with its completeness information. Consider an RDF graph about the crew of Apollo 99 (or for short,Ā A99), a fictional space mission, and the children of the crew, as shown below.

figure a

Consider now the query \(Q_0\) asking for the crew of A99 and their children:

$$ Q_0 = (W_0, P_0) = (\{\, ?crew , ?child \,\}, \{\,( a99 , crew , ?crew ), ( ?crew , child , ?child )\,\}) $$

Evaluating \(Q_0\) over the graph gives only one mapping result, where the crew is mapped to Tony and the child is mapped to Toby. Up until now, nothing can be said about the completeness of the query since: (i) there can be another crew member of A99 with a child; (ii) Tony may have another child; or (iii) Ted may have a child.

Let us now consider the same graph as before, now enriched with completeness information, as shown below.

figure b

Informally, the above figure contains three completeness statements: \(C_1\), which states that the graph contains all crew members of A99; \(C_2\), which states the graph contains all Tonyā€™s children; and \(C_3\), which states the graph contains all Tedā€™s children (i.e., Ted has no children). With the addition of completeness information, let us see whether we may answer our query completely.

First, since we know that all the crew of A99 are Tony and Ted, the queryĀ \(Q_0\) then becomes equivalent to the following two queries:

  • \(Q_1 = (W_1, P_1) = (\{\,?\textit{child}\,\}, \{\,( a99 , crew , tony ), ( tony , child , ?child )\,\})\)

  • \(Q_2 = (W_2, P_2) = (\{\,?\textit{child}\,\}, \{\,( a99 , crew , ted ), ( ted , child , ?child )\,\})\)

where the variable \( ?crew \) is instantiated with Tony and Ted, respectively.

Moreover, for \(Q_2\) according to our graph and completeness information, Ted has no children. Thus, there is no way that \(Q_2\) will return an answer, so \(Q_2\) can be safely removed. Now, only \(Q_1\) is left. Again, from our graph and completeness information, we know that Toby is the only child of Tony. Thus, \(Q_1\) in turn is equivalent to the following boolean query:

$$\begin{aligned} Q_3 = (W_3, P_3) = (\{\}, \{\,( a99 , crew , tony ), ( tony , child , toby )\,\}) \end{aligned}$$

with the variable \( ?crew \) is instantiated to Tony and \( ?child \) to Toby. However, our graph is complete for \(Q_3\) as it contains the whole body of \(Q_3\). Since from our reasoning the query \(Q_3\) is equivalent to our original query \(Q_0\), we conclude that our graph with its completeness information can guarantee the completeness ofĀ \(Q_0\), that is, Toby is the only child of Tony, the only crew member of A99 having a child.

Note that using the data-agnostic approach fromĀ [3], it is not possible to derive the same conclusion. Without looking at the actual graph, we cannot conclude that Ted and Tony are all the crew members of Apollo 99. Consequently, just having the children of Tony and Ted complete does not help in reasoning about Apollo 99. In the rest of the paper, we discuss how the intuitive, data-specific reasoning from above can be formalized.

3 Formal Framework

In this section, we remind the reader of RDF and SPARQL, and provide formalization to our completeness problem.

RDF and SPARQL. Assume three pairwise disjoint infinite sets I (IRIs), L (literals), and VĀ (variables). A tuple \((s, p, o) \in I \times I \times (I \cup L)\) is called a triple, while a finite set of triples is called an RDF graph.

The standard query language for RDF graphs is SPARQLĀ [5]. At the core of SPARQL lies triple patterns, which are like triples, but also variables are allowed in each position. In this work, we focus on the conjunctive fragment of SPARQL, which uses sets of triple patterns, called basic graph patterns (BGPs). Evaluating a BGP P over G gives the set of mappings \({\llbracket P \rrbracket _{G}} = \{\,\mu \mid \mu P \subseteq G \text{ and } dom (\mu ) = \textit{var}(P)\,\}\). Over P, we define a freeze mapping \(\tilde{\textit{id}}\) that maps each variable \( ?v \) inĀ P to a fresh IRI \({\tilde{v}}\). From such a mapping, we construct the prototypical graph \({\tilde{P}} := \tilde{\textit{id}}\, P\) that encodes any possible graph that can satisfy the BGP. A query \(Q = (W, P)\) projects the evaluation results of a BGP P to a set W of variables. Moreover, a CONSTRUCT query has the abstract form \(({\texttt {CONSTRUCT}} \; P_1 \; P_2)\) where both \(P_1\) and \(P_2\) are BGPs. Evaluating a CONSTRUCT query over G results in a graph where \(P_1\) is instantiated with all the mappings in \({\llbracket P_2 \rrbracket _{G}}\).

Completeness Statements. A completeness statement describes which parts of an RDF graph are complete. We adopt the definition of completeness statements inĀ [3].

Definition 1

(Completeness Statement). A completeness statement C is defined as \(\textit{Compl}(P_C)\) where \(P_C\) is a non-empty BGP.

Example 1

The completeness statements in our motivating scenario are as follows: \(C_1 = Compl (\{\,( a99 , crew , ?c )\,\})\), \(C_2 = Compl (\{\,( tony , child , ?c )\,\})\), and \(C_3 = Compl (\{\,( ted , child , ?c )\,\})\).

To serialize completeness statements in RDF, we refer the reader toĀ [3]. Now, let us define the semantics of completeness statements. First, we associate the CONSTRUCT query \(Q_C = ({\texttt {CONSTRUCT}} \; P_C \; P_C)\) to each statement C. From now on, we fix a graph G upon which we describe its completeness. Given a graph \(G' \supseteq G\), we call \((G, G')\) an extension pair. In general, with no completeness statement, every extension pair is a valid extension pair, that is, \(G'\) is a possible state of the ideal world where all the information is complete. For instance, without completeness statement, in the motivating scenario, all of the following would be valid extensions: That there are more crew members of A99; that Tony has more children; and that Ted has children. Completeness statements restrict the valid extensions of a graph.

Definition 2

(Valid Extension Pairs). Let G be a graph and C a completeness statement. We say that an extension pair \((G, G')\) is valid wrt. C, written \((G, G') \models C\), if \({\llbracket Q_C \rrbracket _{G'}} \subseteq G\).

The above definition naturally extends to sets of completeness statements. Over a set \(\mathcal{C}\) of completeness statements and a graph G, we define the transfer operator \(T_\mathcal{C}(G) = \bigcup _{C \in \mathcal{C}} {\llbracket Q_C \rrbracket _{G}}\). We have the following characterization: for all extension pairs \((G, G')\), it is the case that \((G, G')\models \mathcal{C}\) iff \(T_{\mathcal{C}}(G') \subseteq G\).

Query Completeness. We write \( Compl (Q)\) to denote that a query Q is complete. Over an extension pair, a query is complete iff it returns the same results over both the graphs of the extension pair.

Definition 3

(Query Completeness). Let \((G, G')\) be an extension pair and Q be a query. We define that: \((G, G') \models Compl (Q)\) iff \({\llbracket Q \rrbracket _{G'}} = {\llbracket Q \rrbracket _{G}}\).Footnote 4

Completeness Entailment. We now define the main problem of our work, the completeness entailment.

Definition 4

(Completeness Entailment). Given a set \(\mathcal{C}\) of completeness statements, a graph G, and a query Q, we define that \(\mathcal{C}\) and G entail the completeness of Q, written as \(\mathcal{C}, G \models \textit{Compl}(Q)\), if for all extension pairs \((G, G') \models \mathcal{C}\), it holds that \((G, G') \models \textit{Compl}(Q)\).

In our motivating scenario, we have seen that the graph about the crew of A99 and the completeness statements there entail the completeness of the queryĀ \(Q_0\) asking for the crew of A99 and their children.

In this work, we assume bag semantics for query evaluation, which is the default of SPARQL.Footnote 5 Consequently, this allows us to focus on the BGPs used in the body of conjunctive queries for completeness entailment.

4 Checking Completeness Entailment

In this section, we present an algorithm for performing the completeness check as demonstrated in our motivating scenario.

4.1 Preliminaries

Before presenting the algorithm, we introduce important notions.

First, we need to have a notion for a BGP with a stored mapping from variable instantiations. Let P be a BGP and \(\mu \) be a mapping such that \( dom (\mu )\cap \textit{var}(P) = \emptyset \). We define the pair \((P, \mu )\) as a partially mapped BGP, which is a BGP with a stored mapping. Over a graph G, the evaluation of \((P, \mu )\) is defined as \({\llbracket (P, \mu ) \rrbracket _{G}} = \{\, \mu \cup \nu \mid \nu \in {\llbracket P \rrbracket _{G}}\,\}\). It is easy to see that \(P \equiv (P, \emptyset )\). Furthermore, we define the evaluation of a set of partially mapped BGPs over a graph G as the union of evaluating each of them over G.

Example 2

Consider our motivating scenario. Over the BGP \(P_0\) of the queryĀ \(Q_0\), instantiating the variable \( ?crew \) to \( tony \) results in the BGP \(P_1\) of the query \(Q_1\). Pairing \(P_1\) with this instantiation gives the partially mapped BGP \((P_1, \{\, ?crew \mapsto \textit{tony}\,\})\). Moreover, it is the case that \({\llbracket (P_1, \{\, ?crew \mapsto \textit{tony}\,\}) \rrbracket _{G}} = \{\,\{\, ?crew \mapsto \textit{tony}, ?child \mapsto \textit{toby}\,\}\,\}\).

Next, we would like to formalize the equivalence between partially mapped BGPs wrt. a set \(\mathcal{C}\) of completeness statements and a graph G.

Definition 5

(Equivalence under \(\mathcal{C}\) and G ). Let \((P, \mu )\) and \((P', \nu )\) be partially mapped BGPs, \(\mathcal{C}\) be a set of completeness statements, and G be a graph. We define that \((P, \mu )\) is equivalent to \((P', \nu )\) wrt. \(\mathcal{C}\) and G, written \((P, \mu ) \equiv _{\mathcal{C}, G} (P', \nu )\), if for all \((G, G') \models \mathcal{C}\), it holds that \({\llbracket (P, \mu ) \rrbracket _{G'}} = {\llbracket (P', \nu ) \rrbracket _{G'}}\).

The above definition naturally extends to sets of partially mapped BGPs.

Example 3

Consider all the queries in our motivating scenario. It is the case that \(\{\,(P_0, \emptyset )\,\} \equiv _{\mathcal{C}, G} \{\,(P_1, \{\, ?crew \mapsto \textit{tony}\,\}), (P_2, \{\, ?crew \mapsto \textit{ted}\,\})\,\} \equiv _{\mathcal{C}, G}\) \(\{\,(P_3, \{\, ?crew \mapsto \textit{tony}, ?child \mapsto \textit{toby}\,\})\,\}\).

Next, we would like to figure out which parts of a query contain variables that can be instantiated completely. For this reason, we define

$$ \textit{cruc}_{\mathcal{C}, G}(P) = P \cap {{\tilde{ id }}}^{-1}(T_{\mathcal{C}}(\tilde{P} \cup G)) $$

as the crucial part of P wrt. \(\mathcal{C}\) and G. It is the case that \(\mathcal{C}, G \models \textit{Compl}(\)) \({\textit{cruc}_{\mathcal{C}, G}(P)}\), that is, we are complete for the crucial part. Later on, we will see that the crucial part is used to ā€˜guideā€™ the instantiation process during the completeness entailment check.

Example 4

Consider the query \(Q_0 = (W_0, P_0)\) in our motivating scenario. We have that \(\textit{cruc}_{\mathcal{C}, G}(P_0) = P_0 \cap {{\tilde{ id }}}^{-1}(T_{\mathcal{C}}(\tilde{P}_0 \cup G)) = \{\,( a99 , crew , ?crew )\,\}\) with \({\tilde{ id }}= \{\, ?crew \mapsto \widetilde{ crew }, ?child \mapsto \widetilde{ child }\,\}\). Consequently, we can have a complete instantiation of the crew of A99.

The operator below implements the instantiations of a partially mapped BGP wrt. its crucial part.

Definition 6

(Equivalent Partial Grounding). Let \(\mathcal{C}\) be a set of completeness statements, G be a graph, and \((P, \nu )\) be a partially mapped BGP. We define the operator equivalent partial grounding:

$$\begin{aligned} \textit{epg}((P, \nu ), \mathcal{C}, G) = \{\, (\mu P, \nu \cup \mu ) \mid \mu \in {\llbracket \textit{cruc}_{\mathcal{C}, G}(P) \rrbracket _{G}}\,\}. \end{aligned}$$

The following lemma shows that such instantiations produce a set of partially mapped BGPs equivalent to the original partially mapped BGP, hence the name equivalent partial grounding. The lemma holds since the instantiation is done over the crucial part, which is complete wrt. \(\mathcal{C}\) and G.

Lemma 1

(Equivalent Partial Grounding). Let \(\mathcal{C}\) be a set of completeness statements, G be a graph, and \((P, \nu )\) be a partially mapped BGP. We have that

$$\begin{aligned} \{\,(P, \nu )\,\}\equiv _{\mathcal{C}, G}\textit{epg}((P, \nu ), \mathcal{C}, G). \end{aligned}$$

Example 5

Consider our motivating scenario. We have that:

  • \(\textit{epg}((P_2, \{\, ?crew \mapsto \textit{ted}\,\}), \mathcal{C}, G) = \emptyset \)

  • \(\textit{epg}((P_3, \{\, ?crew \mapsto \textit{tony}, ?child \mapsto \textit{toby}\,\}), \mathcal{C}, G) = \{(P_3, \{ ?crew \mapsto \textit{tony},\) \( ?child \mapsto \textit{toby}\})\}\)

  • \(\textit{epg}((P_0, \emptyset ), \mathcal{C}, G) = \{\,(P_1, \{\, ?crew \mapsto \textit{tony}\,\}), (P_2, \{\, ?crew \mapsto \textit{ted}\,\})\,\}\)

Generalizing from the example above, there are three cases of \(\textit{epg}((P, \nu ), \mathcal{C}, G)\):

  • If \({\llbracket \textit{cruc}_{\mathcal{C}, G}(P) \rrbracket _{G}} = \emptyset \), it returns an empty set.

  • If \({\llbracket \textit{cruc}_{\mathcal{C}, G}(P) \rrbracket _{G}} = \{\,\emptyset \,\}\), it returns \(\{(P, \nu )\}\).

  • Otherwise, it returns a non-empty set of partially mapped BGPs where some variables in P are instantiated.

From these three cases and the finite number of triple patterns with variables of a BGP, it holds that the repeated applications of the \(\textit{epg}\) operator, with the first and second cases above as the base cases, are terminating. Note that the difference between these two base cases is on the effect of their corresponding epg operations, as illustrated in ExampleĀ 5: for the first case, the epg operation returns an empty set, whereas for the second case, it returns back the input partially mapped BGP.

We define that a partially mapped BGP \((P, \nu )\) is saturated wrt. \(\mathcal{C}\) and G, if \(\textit{epg}((P, \nu ), \mathcal{C}, G) = \{\,(P, \nu )\,\},\) that is, if the second case above applies. Note that the notion of saturation is independent from the mapping in a partially mapped BGP: given a mapping \(\nu \), a partially mapped BGP \((P, \nu )\) is saturated wrt. \(\mathcal{C}\) andĀ G iff \((P, \nu ')\) is saturated wrt. \(\mathcal{C}\) andĀ G for any mapping \(\nu '\). Thus, wrt. \(\mathcal{C}\) andĀ G we say that a BGP P is saturated if \((P, \emptyset )\) is saturated.

The completeness checking of saturated BGPs is straightforward as we only need to check if they are contained in the graph G.

Proposition 1

(Completeness Entailment of Saturated BGPs). Let P be a BGP, \(\mathcal{C}\) be a set of completeness statements, and G be a graph. Suppose P is saturated wrt. \(\mathcal{C}\) and G. Then, it is the case that: \(\mathcal{C}, G \models \textit{Compl}(P)\) iff \(\tilde{P} \subseteq G\).

Based on the above notions, we are ready to provide an algorithm to check completeness entailment. The next subsection gives the algorithm.

4.2 Algorithm for Checking Completeness Entailment

Now we introduce an algorithm to compute all saturated, equivalent partial grounding results of a BGP wrt. \(\mathcal{C}\) and G. Following from PropositionĀ 1, we can then check whether all the resulting saturated BGPs are contained in the graphĀ G to see if the completeness entailment holds.

figure c

Consider a BGP \(P_\textit{orig}\), a set \(\mathcal{C}\) of completeness statements, and a graph G. The algorithm works as follows: First, we transform our original BGP \(P_\textit{orig}\) into its equivalent partially mapped BGP \((P_\textit{orig}, \emptyset )\) and put it in \(\mathbf {P_\textit{working}}\). Then, in each iteration of the while loop, we take and remove a partially mapped BGP \((P, \nu )\) from \(\mathbf {P_\textit{working}}\) via the method takeOne. Afterwards, we compute \(\textit{epg}((P,\nu ), \mathcal{C}, G)\). As discussed above there might be three result cases here: (i) If \(\textit{epg}((P,\nu ), \mathcal{C}, G) = \emptyset \), then simply we remove \((P, \nu )\) and will not consider it anymore in the later iteration; (ii) If \(\textit{epg}((P,\nu ), \mathcal{C}, G) = \{\,(P, \nu )\,\}\), that is, \((P, \nu )\) is saturated, then we collect the mapping \(\nu \) to the set \(\varOmega \); and (iii) otherwise, we add to \(\mathbf {P_\textit{working}}\) a set of partially mapped BGPs instantiated from \((P, \nu )\). We keep iterating until \(\mathbf {P}_\textit{working}= \emptyset \), and finally return the set \(\varOmega \).

The following proposition follows from the construction of the above algorithm and LemmaĀ 1.

Proposition 2

Given a BGP P, a set \(\mathcal{C}\) of completeness statements, and a graph G, the following properties hold:

  • For all \(\mu \in \textit{sat}(P, \mathcal{C}, G)\), it is the case that \(\mu P\) is saturated wrt. \(\mathcal{C}\) and G.

  • It holds that \(\{(P, \emptyset )\} \equiv _{\mathcal{C}, G} \{\,(\mu P, \mu ) \mid \mu \in \textit{sat}(P, \mathcal{C}, G)\,\}\).

From the above proposition, we can derive the following theorem, which shows the soundness and completeness of the algorithm to check completeness entailment.

Theorem 1

(Completeness Entailment Check). Let P be a BGP, \(\mathcal{C}\) be a set of completeness statements, and G be a graph. It holds that

$$ \mathcal{C}, G \models \textit{Compl}(P) \quad {iff} \quad { for\,\, all\,\, } \mu \in \textit{sat}(P, \mathcal{C}, G) \; . \; \widetilde{\mu P} \subseteq G.$$

Example 6

Consider our motivating scenario. We have that \(\textit{sat}(P_0, \mathcal{C}, G) = \{\, \{\, ?crew \mapsto \textit{tony}, ?child \mapsto \textit{toby}\,\} \,\}\). It is the case that for all \(\mu \in \textit{sat}(P_0, \mathcal{C}, G)\), it holds that \(\widetilde{\mu P_0} \subseteq G\). Thus, by TheoremĀ 1 the entailment \(\mathcal{C}, G \models \textit{Compl}(P_0)\) holds.

By reduction from validity of \(\forall \exists \)3SAT formula, one can show that the complexity of the completeness entailment is \(\varPi ^P_2\)-complete.

Corollary 1

(Complexity of Completeness Check). Deciding whether the entailment \(\mathcal{C}, G \models Compl (P)\) holds, given a set \(\mathcal{C}\) of completeness statements, a graph G, and a BGP P, is \(\varPi ^P_2\)-complete.

In what follows, we provide optimization techniques for the algorithm, which work for generic cases of completeness entailment.

Early Failure Detection. In our algorithm, the containment checks for saturated BGPs are done at the end. Indeed, if there is a single saturated BGP not contained in the graph, we cannot guarantee query completeness. Thus, instead of having to collect all saturated BGPs and then check the containment later on, we can improve the performance of the algorithm by performing the containment check right after the saturation check (Line 6 of the algorithm). So, as soon as there is a failure in the containment check, we stop the loop and conclude that the completeness entailment does not hold.

Completeness Skip. Recall the definition of the operator \(\textit{epg}((P, \nu ), \mathcal{C}, G) = \{\, (\mu P, \nu \cup \mu ) \mid \mu \in {\llbracket \textit{cruc}_{\mathcal{C}, G}(P) \rrbracket _{G}}\,\}\), which relies on the cruc operator. Now, suppose that \(\textit{cruc}_{\mathcal{C}, G}(P) = P\), that is, we are complete for the whole part of the BGP P. Thus, we actually do not have to instantiate P in the \(\texttt {epg}\) operator, since we know that the instantiation results are contained in G as the consequence of it being complete wrt. \(\mathcal{C}\) and G. In conclusion, whenever \(\textit{cruc}_{\mathcal{C}, G}(P) = P\), we just remove \((P, \nu )\) from \(\mathbf {P_\textit{working}}\) and thus skip its instantiations.

Despite these optimizations, for a large number of completeness statements, the completeness entailment check may take long. In the next section, we identify a practically relevant fragment of completeness statements, for which we develop an indexing technique to make the entailment check feasible.

5 A Practical Fragment of Completeness Statements

This section identifies SP-statements, a fragment of completeness statements possessing several properties that are suitable to be used in practice. In the next sections, we show by experimental evaluations the feasibility of this fragment with an indexing technique we describe below, and demonstrate a completeness tool for Wikidata using the fragment.

5.1 SP-Statements

An SP-statement \( Compl (\{\,(s, p, ?v )\,\})\) is a completeness statement with only one triple pattern in the BGP of the statement, where the subject and the predicate are IRIs, and the object is a variable.Footnote 6 In our motivating scenario, all the completeness statements are in fact SP-statements. The statements possess the following properties, which are suitable for practical use:

  • Having a simple structure, completeness statements within this fragment are easy to create and to be read. Thus, they are suitable for crowdsourced KBs, where humans are involved.

  • An SP-statement denotes the completeness of all the property values of the entity which is the subject of the statement. This fits entity-centric KBs like Wikidata, where data is organized into entities (i.e., each entity has its own data page).

  • Despite their simplicity, SP-statements can be used to guarantee the completeness of more complex queries such as queries whose length is greater than one (as illustrated by our motivating scenario).

5.2 SP-Indexing

We describe here how to optimize completeness entailment check with SP-statements. Recall our generic algorithm to check completeness entailment: InĀ the cruc operator within the epg operator (Line 5 of AlgorithmĀ 1), we have to compute \(T_\mathcal{C}(\tilde{P} \cup G)\), that is, evaluate all CONSTRUCT queries of the completeness statements in \(\mathcal{C}\) over the graph \(\tilde{P} \cup G\). This may be problematic if there are a large number of completeness statements in \(\mathcal{C}\). Thus, we want to avoid such costly \(T_\mathcal{C}\) applications. Given that completeness statements are of the form SP-statements, we may instead look for the statements having the same subject and predicate of the triple patterns in the BGP. The crucial part of the BGP P wrt. \(\mathcal{C}\) and G are the triple patterns with the matching subject and predicate of the completeness statements.

Proposition 3

Given a BGP P, a graph G, and a set \(\mathcal{C}\) of SP-statements, it is the case that

$$\begin{aligned} \textit{cruc}_{\mathcal{C}, G}(P) = \{\,(s, p, o)\! \in P \mid {there\, exists\, a\, statement\, } Compl (\{\,(s, p, ?v )\,\}) \in \mathcal{C}\,\}. \end{aligned}$$

From the above proposition, to get the crucial part, we only have to find an SP-statement with the same subject and predicate for each triple pattern of the BGP. In practice, we can facilitate this search using a standard hashmap, providing constant-time performance, also for other basic operations such as add and delete. The hashmap provides a mapping from the concatenation of the subject and the predicate of a statement to the statement itself. To illustrate, the hashmap of the completeness statements in our motivating scenario is as follows: \(\{\,\textit{a99-crew} \mapsto C_1, \textit{tony-child} \mapsto C_2, \textit{ted-child} \mapsto C_3\,\}\).

6 Experimental Evaluation

Now that we have an indexing technique for SP-statements, we want to see the performance of completeness check. To do so, we perform experimental evaluations with a realistic scenario, where we compare the runtime of completeness entailment when query completeness can be guaranteed (i.e., the success case), completeness entailment when query completeness cannot be guaranteed (i.e., the failure case), and query evaluation.

Experimental Setup. Our reasoning algorithm and indexing modules are implemented in Java using the Apache Jena library.Footnote 7 We use Jena-TDB as the triple store of our experiment. The SP-indexing is implemented using the standard Java hashmap, where the keys are strings, constructed from the concatenation of the subject and predicate of completeness statements, and the values are Java objects representing completeness statements. All experiments are done on a standard laptop with a 2.4 GHz Intel Core i5 and 8 GB of memory.

To perform the experiment, we need three ingredients: a graph, completeness statements, and queries. For the graph, we use the direct-statement fragment of the Wikidata graph, which does not include qualifiers nor references and consists of 100 mio triples.Footnote 8 The completeness statements and queries of this experiment are constructed based on the following pattern queries:

  1. 1.

    Give all mothers of mothers of mothers.

    \(P_1 = \{\,( ?v , P25 , ?w ), ( ?w , P25 , ?x ), ( ?x , P25 , ?y )\,\}\)

  2. 2.

    Give the crew of a thing, the astronaut missions of that crew, and the operator of the missions.

    \(P_2 = \{\,( ?v , P1029 , ?w ), ( ?w , P450 , ?x ), ( ?x , P137 , ?y )\,\}\)

  3. 3.

    Give the administrative divisions of a thing, the administrative divisions of those divisions, and their area.

    \(P_3 = \{\,( ?v , P150 , ?w ), ( ?w , P150 , ?x ), ( ?x , P2046 , ?y )\,\}\)

To generate queries, we simply evaluate each pattern query over the graph, and instantiate the variable \( ?v \) of each pattern query with the corresponding mappings from the evaluation. We record 5200 queries instantiated from \(P_1\), 57 queries from \(P_2\), and 475 queries from \(P_3\). Each pattern query has a different average number of query results: the instantiations of \(P_1\) give 1 result, those ofĀ \(P_2\) give 4 results, and those of \(P_3\) give 108 results on average.

To generate completeness statements, from each generated query, we iteratively evaluate each triple pattern from left to right, and construct SP-statements from the instantiated subject and the predicate of the triple patterns. This way, we guarantee that all the queries can be answered completely. We generate inĀ total around 1.7 mio statements, with 30072 statements for \(P_1\), 484 statements for \(P_2\), and 1682263 statements for \(P_3\). Such a large number of completeness statements would make completeness checks without indexing very slow: Performing just a single application of the \(T_\mathcal{C}\) operator with all these statements, which occurs in the cruc operator of the algorithm without SP-indexing, took about 20Ā min. Note that in a completeness check, there might be many \(T_\mathcal{C}\) applications.

Now we describe how to observe the behavior when queries cannot be guaranteed to be complete, that is, the failure case. In this case, we drop randomly 20Ā % of the completeness statements for each pattern query. To make up the statements we drop, we add dummy statements with the number equal to the number of dropped statements. This way, we ensure the same number of completeness statements for both the success and failure case.

For each query pattern, we measure the runtime of completeness check for both the success case and the failure case, and then query evaluation for the success case.Footnote 9 We take 40 sample queries for each pattern query, repeat each run 10 times, and report the median of these runs.

Experimental Results. The experimental results are shown in Fig.Ā 2. Note that the runtime is in log scale. We can see that in all cases, the runtime increases with the first pattern query having the lowest runtime, and the third pattern query having the highest runtime. This is likely due to the increased number of query results. We observe that in all pattern queries, completeness check when queries are guaranteed to be complete is slower than those whose completeness cannot be guaranteed. We suspect that this is because in the former case, variable instantiations have to be performed much more than in the latter case. In the latter case, as also described in Subsect.Ā 4.2, as soon as we find a saturated BGP not contained in the graph, we stop the loop in the algorithm and return false, meaning that the query completeness cannot be guaranteed.

Fig. 2.
figure 2

Experiment results of completeness entailment

In absolute scale, completeness check runs relatively fast, with 796 \(\upmu \)s forĀ \(P_1\), 5264 \(\upmu \)s for \(P_2\), and 35130 \(\upmu \)s for \(P_3\) in success case; and 485 \(\upmu \)s for \(P_1\), 903 \(\upmu \)s for \(P_2\), and 1209 \(\upmu \)s for \(P_3\) in failure case. Note that as mentioned before, completeness check without indexing is not feasible at all here, as there are a large number of completeness statements, making the \(T_\mathcal{C}\) application very slow (i.e., 20Ā min for a single application). For all pattern queries, however, query evaluation runs faster than completeness checking. This is because completeness checking may involve several query evaluations during the instantiation process with the epg operator.

To conclude, we have observed that completeness checking with a large number of SP-statements can be done reasonably fast, even for large datasets, by the employment of indexing. Also, we observe a clear positive correlation between the number of query results and the runtime of completeness checking. Last, performing completeness check when a query is complete is slower than that when a query cannot be guaranteed to be complete.

7 COOL-WD: A Completeness Tool for Wikidata

In this section, we introduce COOL-WD, a COmpleteness toOL for WikiData. The tool implements our completeness framework with SP-statements and focuses to provide completeness information for direct statements of Wikidata. While our implementation is based on Apache Jena, our approach can be applied also via other Semantic Web frameworks like Sesame.Footnote 10 Our tool is inspired by real, natural language completeness statements on Wikipedia, where completeness statements are given in a crowdsourced way.Footnote 11 The tool is available at http://cool-wd.inf.unibz.it/.

7.1 System Architecture

As shown in Fig.Ā 3, COOL-WD consists of three main components: user interface (UI), COOL-WD engine, and Wikidata-backend.

The first component is the UI, developed using GWT.Footnote 12 The UI provides the front-end interface for COOL-WD users, enabling them to search for Wikidata entities, look at facts about them enriched with completeness information, add/remove completeness statements, and check the completeness of a Wikidata query.

The second component is the engine, responsible for storing completeness statements using SQLite and performing completeness checks. We use optimization techniques as described in Subsect.Ā 4.2 and SP-indexing as described in Sect.Ā 5 to improve the performance of completeness checks.

The last component is the Wikidata-backend. It consists of two subcomponents: Wikidata API and Wikidata SPARQL endpoint. The API is used for the suggestions feature in searching for Wikidata entities, while the Wikidata SPARQL endpoint serves as the source of Wikidata facts to which completeness statements are given, and of query evaluation.

Fig. 3.
figure 3

COOL-WD architecture

7.2 Tool Usage

Here, we describe how one can use COOL-WD. From the landing page, the user is provided with a search bar for Wikidata entities. The search bar features auto-complete search suggestions, matching user keywords with the English labels of Wikidata entities. Clicking on a search suggestion gives the users the entity page, consisting of Wikidata facts about the entity with its completeness information. An example is shown in Fig.Ā 4, which is the Apollo 11 page with the complete crew. Complete properties are distinguished by the checkmark symbol. To add a completeness statement, a user simply clicks a question mark next to the respective properties of an entity. Additionally, it is possible to add provenance information about authors, timestamps, and references of the statement. Suppose the user would also like to add completeness statements for the astronaut missions of Neil Armstrong, Buzz Aldrin, and Michael Collins. Therefore, she may perform an analogous operation: go to the entity pages, and click the question mark next to the respective properties. We also have a feature to see all stored completeness statements over Wikidata filtered by properties on the aggregation page.

Fig. 4.
figure 4

The COOL-WD page of Apollo 11. Complete property values are with checkmarks.

If a user would like to evaluate a query and check its completeness, she has to go to the query page. Suppose she wants to know the crew of Apollo 11 and their astronaut missions. The user then specifies her query, and executes it. Instead of having only query answers, she can also see the completeness information of the answers.

8 Related Work

Data completeness concerns the breadth, depth, and scope of informationĀ [6]. In the relational databases, MotroĀ [7] and LevyĀ [8] were among the first to investigate data completeness. Motro developed a sound technique to check query completeness based on database views, while Levy introduced the notion of local completeness statements to denote which parts of a database are complete. Razniewski and NuttĀ [9] further extended their results by reducing completeness reasoning to containment checking, for which many algorithms are known, and characterizing the complexity of reasoning for different classes of queries. In terms of their terminology, our completeness entailment problem is one of QC-QC entailment under bag semantics, for which so far it was only known that it is in \(\varPi ^P_3\)Ā [10]. InĀ [4], Razniewski et al. proposed completeness patterns and defined a pattern algebra to check the completeness of queries. The work incorporated database instances, yet provided only a sound algorithm for completeness check.

We now move on to the Semantic Web. FĆ¼rber and HeppĀ [11] distinguished three types of completeness: ontology completeness, concerning which ontology classes and properties are represented; population completeness, referring to whether all objects of the real-world are represented; and property completeness, measuring the missing values of a specific property. In our work, SP-statements can be used to state the property completeness of an entity. Mendes et al.Ā [12] proposed Sieve, a framework for expressing quality assessment and fusion methods, where completeness is also considered. With Sieve, users can specify how to compute quality scores and express a quality preference specifying which characteristics of data indicate higher quality. In the context of crowdsourcing, Chu et al.Ā [13] developed KATARA, a hybrid data cleaning system, which not only cleans data, but may also add new facts to increase the completeness of the KB; whereas Acosta et al.Ā [14] developed HARE, a hybrid SPARQL engine to enhance answer completeness.

GalĆ”rraga et al. [15] proposed a rule mining system that is able to operate under the Open-World Assumption (OWA) by simulating negative examples using the Partial Completeness Assumption (PCA). The PCA assumes that if the dataset knows some r-attribute of x, then it knows all r-attributes of x. This heuristic was also employed by Dong et al.Ā [16] to develop Knowledge Vault, a Web-scale system for probabilistic knowledge fusion. In their paper, they used the term Local Closed-World Assumption (LCWA).

9 Discussion

We discuss here various aspects of our work: sources of completeness statements, completeness statements with provenance, and no-value information.

Sources of Completeness Statements. As demonstrated by COOL-WD, one way to provide completeness statements is via crowdsourcing. For domain-specific data like biology and archeology, domain experts may be a suitable source of completeness statements. An automated way to add completeness statements can also be leveraged by using NLP techniques to extract natural language completeness statements already available on the Web: around 13000 Wikipedia pages contain the keywords ā€œcomplete list ofā€ and ā€œlist is completeā€, while IMDb provides complete cast information with the keywords ā€œverified as completeā€ for some movies like Reservoir Dogs.Footnote 13

Completeness Statements with Provenance. Just as data can be wrong, completeness statements can be wrong, too. Moreover, as data may change over time, completeness statements can be out-of-date. As a possible solution, one can add provenance information. Adding information about the author and reference of completeness statements may be useful to check the correctness of the statements, while attaching timestamps would provide timeliness information to the statements.

No-Value Information. Completeness statements can also be used to represent the non-existence of information. For example, in our motivating scenario, there is the completeness statement about the children of Ted with no corresponding data in the graph. In this case, we basically say that Ted has no children. As a consequence of having no-value information, we can be complete for queries despite having the empty answer. Such a feature is similar to that proposed inĀ [17]. The only difference is that here we need to pair completeness statements with a graph that has no corresponding data captured by the statements, while in that work, no-value statements are used to directly say that some parts of data do not exist.

10 Conclusions and Future Work

The availability of an enormous amount of RDF data calls for better data quality management. In this work, we focus on the data quality aspect of completeness. We develop a technique to check query completeness based on RDF data with its completeness information. To increase the practical benefits of our framework, we identify a practically relevant fragment of completeness information upon which an indexing can be implemented to optimize completeness check, and develop COOL-WD, a completeness management tool for Wikidata.

For future work, we would like to investigate indexing techniques for more general cases. One challenge here is that how to index the arbitrary structure of completeness statements. Another plan is to develop a technique to extract completeness statements on the Web. To do so, we in particular want to detect if a Web page contains completeness statements in natural language, and transform them into RDF-based completeness statements. Last, we also want to increase the expressivity of queries, say, to also handle negations. Queries with negations are especially interesting since negation naturally needs complete information to work correctly.