1 Introduction

With the rapid development of the semantic web, there are more and more applications based on ontology, especially, query answering applications [3]. Query answering can return implicit information that is not explicitly stored at the database but instead entailed by ontology to a user query. In this sense, it improves the quality of the answers compared to traditional database querying. The most popular query answering systems can be categorized into two major types: materialization-based and query rewriting-based.

The materialization approach precomputes all consequences entailed by ontologies (also known as the universal model or chase) offline so that the online query can be evaluated directly on the extended RDF data. Thus, it is preferred at online query performance-critical scenarios. PAGOdA [30] is based on materialization algorithm. It is scalable by delegating a large amount of the computational load to a datalog reasoner [21, 23] and using the hypertableau algorithm [22] only when necessary.

However, when the ontologies contain cyclic dependency relation, the universal model can be infinite [17]. The infinite extended data cannot be stored at the database and directly queried by the query engine. The infinite materialization is a significant challenge in materialization-based query answering systems. PAGOdA is incomplete in terms of infinite materialization (we proved it in Sect. 6).

The commonly adopted approach to deal with infinite materialization is query rewriting. Query rewriting techniques have been studied intensively and implemented in many systems, e.g., QuOnto [5], Mastro [7], Ontop [6] for DL-Lite, Grind [11] for \({\mathcal{E}}{\mathcal{L}}\), Clipper [8] for Horn-\({\mathcal{S}}{\mathcal{H}}{\mathcal{I}}{\mathcal{Q}}\). These systems do not materialize the data , but rather first rewrite the query according to the ontologies and mappings. Query rewriting uses a virtual RDF graph technique to avoid infinite materialization. However, it significantly increases the cost of query because rewriting is performed at runtime, and usually, it requires manual mapping. Moreover, the rewritten query can be exponentially large [26].

Besides query rewriting, the materialization-based ontology reasoner, e.g., Pellet [28], adopts a tableau algorithm with a roll-up technique [14] to solve infinite materialization. Pellet is not scalable for large datasets. It can only be applied to small and medium-sized datasets due to the high complexity of the tableau algorithm.

gOWL [19] proposes a partial materialization-based approach that deals with acyclic queries. The materialization algorithm of gOWL has a high time and space complexity due to its poor indexes for the storage and rules. Besides, gOWL cannot handle cyclic queries and Boolean queries, and its approximation rules lose most of the semantics of the OWL 2 DL.

There is also a hybrid approach [15] that computes the canonical model, which is always finite by reusing the anonymous individual name. And it rewrites the queries to remove wrong answers that the canonical model produces. But this rewriting algorithm cannot deal with role inclusion axioms. Then, a filter mechanism is proposed to replace the rewriting technique. The filter mechanism does not rewrite queries at runtime but filters spurious answers after query evaluation [17]. It supports role inclusions. However, it is still limited to lightweight ontology languages, DL-\({\text{Lite}}_{\mathcal{R}}\) [2].

Motivated by the users are mainly interested in the first few levels of the anonymous part of the universal models [10], in this paper, we describe a novel partial materialization approach. \(Partial\) means we do not compute all consequences entailed by ontology, but instead compute a subset of the universal model. \(Partial\) ensures the extended RDF data are always finite. Then, we propose a query analysis algorithm(QAA). QAA takes conjunctive queries as input, and its output indicates the size of the partial universal model. This algorithm is designed to ensure the partial materialization can always produce the same answers as a universal model for rooted conjunctive queries [16] and partial Boolean conjunctive queries in \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\) [2]. Consider the following query \(\mathrm {Q}:\text {select}~?X~\text {where}~\{?X~\text {type}~\text{Student}.~?X~\text {advisor}~?Y_0.~?Y_0~\text {teaches}~?Y_1.\}.\) and the infinite universal model (the RDF), as shown in Fig. 1.

Fig. 1
figure 1

A running example of partial materialization

In this paper, we select only one part from the entire model (the sRDF) for answering the query \(\mathrm {Q}\) as \(\mathrm {ans}\)(sRDF, \(\mathrm {Q}\)) = \(\mathrm {ans}\)(RDF, \(\mathrm {Q}\)). The sRDF is the partial universal model.

To make our approach unlimited to lightweight ontology languages, we soundly and incompletely generalize our approach to deal with OWL 2 DL by rewriting and approximation techniques. The approximation techniques apply to axioms whose semantics exceed \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\). And, additional data structures are designed to preserve the semantics that approximation techniques may lose.

We implement our approach as a prototype system SUMA and integrate a role rewriting algorithm [27] to optimize materialization efficiency further. From a system perspective, SUMA allows us to design an offline modular architecture to integrate off-the-shelf efficient SPARQL [12] query engines. In this way, it makes online queries more efficient.

We validate our proposal in two cases: an evaluation in the finite universal model scenario and an evaluation in the infinite universal model scenario. In the former case, experiments are conducted on two widely adopted benchmark and two real datasets. In the latter case, we manually extend the LUBM [9] and UOBM [18] ontologies to evaluate query answering systems on an infinite universal model. These experiments confirm that: (i) the ontology reasoning algorithm used in PAGOdA cannot deal with infinite materialization, (ii) although Pellet is complete, it is not scalable and can only be used for small and medium data, (iii) SUMA is good at the trade-off the scalability and completeness. Experiments show that SUMA is highly efficient, only taking 124s to materialize LUBM(1000) and 411s to materialize UOBM(500). And, in each test query, it returns the same quality of answers as Pellet.

The rest of the paper is structured as follows. Section 2 introduces the basic notions. Section 3 presents the definition of QAA algorithm for rooted and Boolean conjunctive queries and gives a detailed proof. Section 4 shows how to approximate the OWL 2 DL axiom to \(\ DL\) axiom while preserving its semantics. Section 5 presents the architecture of SUMA and algorithms used at SUMA. The performance of SUMA on two benchmarks and real datasets is demonstrated in Sect. 6. Section 7 concludes this paper.

This work is an extension of the previous proceedings in [1]. In particular, (i) it extends QAA to support Boolean conjunctive queries that contain cyclic or fork structure, (ii) it adopts a role rewriting algorithm [27] to optimize materialization efficiency further, (iii) and it extends the experiments with role rewriting algorithm evaluation, gOWL system and YAGO dataset. This work confirms and extends the main finding from [1]: SUMA is good at the trade-off the scalability and completeness.

2 Preliminaries

In this section, we briefly introduce the syntax and semantics of description logics(DLs)\(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\), conjunctive query, and universal model.

2.1 Description Logics

Description Logic is a family of logics that have been studied and used in knowledge representation and reasoning. DLs underlie the standard Web Ontology Language OWL and OWL 2. In DLs, the elements of the domain are compiled into concepts (corresponding to unary predicates in first-order languages), and their properties are structured by means of roles (corresponding to binary predicates in first-order languages). Complex concepts and role expressions are made from atomic concepts and atomic role names. These names are connected by suitable constructors. The set of available constructors depends on the semantic of specific description logic. The richer the constructors that the description logic contains, the more complex the semantics that the description logic can capture.

Description logic knowledge base \({\mathcal{K}}\) consists of TBox (\({\mathcal{T}}\)) and ABox (\({\mathcal{A}}\)). A TBox typically consists of a set of axioms stating the inclusion between concepts and roles. The semantics of TBox is affected by the constructors. In an ABox, one can assert membership of objects (i.e., constants) in concepts, or that a pair of objects are connected by a role.

Let \(\mathsf {N_I}\) be the individual set. In this paper, by default, we use abcde (with subscripts) to represent individual names. ABZ denote concept names, C (with subscripts) are concepts, PS are role names, and R (with subscripts) are roles. Next, we present a brief overview of two different description logics, \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\), and \({\mathcal{S}}{\mathcal{R}}{\mathcal{O}}{\mathcal{I}}{\mathcal{Q}}\).

\(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\) defines roles and concepts as follows:

$$\begin{aligned} R\;:= P \mid P^-, \quad \quad C := \bot \mid A \mid \ge m\;R. \end{aligned}$$

\(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\) presents \(\exists R\) as \(\ge 1 R\) and defines \((P^{-})^{-}\) with P. Let \(\mathsf {N_R^-}\) denote the set of roles.

A \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\) \({\mathcal{T}}\) is a finite collection, including concept inclusions (CIs) axioms, that are in the form of \(C_1 \sqcap \ldots \sqcap C_n \sqsubseteq C\). A \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\) ABox consists of concept assertions A(a) and role assertions P(ab).

The semantics of \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\) are defined by the interpretation \({\mathcal{I}}= (\varDelta ^{{\mathcal{I}}}, \cdot ^{{\mathcal{I}}})\), where \(\varDelta ^{{\mathcal{I}}}\) is a non-empty domain. The function is denoted by \(\cdot ^{{\mathcal{I}}}\), which can map each A into the set \(A^{{\mathcal{I}}}\), each P into the relation \(P^{{\mathcal{I}}}\) and each a to an element \(a^{{\mathcal{I}}}\). \(A^{{\mathcal{I}}}\) and \(P^{{\mathcal{I}}}\) are subsets of \(\varDelta ^{{\mathcal{I}}}\) and \(\varDelta ^{{\mathcal{I}}} \times \varDelta ^{{\mathcal{I}}}\), respectively. The \(a^{{\mathcal{I}}}\) is an element of \(\varDelta ^{{\mathcal{I}}}\). Besides, \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\) implements the unique name assumption (UNA), that is, if v and w are distinct, then \(v^{{\mathcal{I}}}\) is different from \(w^{{\mathcal{I}}}\).

\(\cdot ^{{\mathcal{I}}}\) interprets each complex concept or role in the following ways: (1) \(\bot ^{{\mathcal{I}}} = \emptyset\) (\(\bot\) is the bottom concept); (2) \({(S^{-})}^{{\mathcal{I}}} = \{(v, w) \mid (w, v) \in S^{{\mathcal{I}}}\}\); and (3) \({(\ge m S)}^{{\mathcal{I}}} = \{w \mid \sharp \{v \mid\) \((w, v) \in S^{{\mathcal{I}}}\} \ge\) \(m\}\). Here \(\sharp\) denotes the cardinality.

The \({\mathcal{I}}\) satisfies a CIs axiom \({\mathcal{T}}_1\) in the form of \(C_1 \sqcap \ldots \sqcap C_n \sqsubseteq C\) if only if \(\bigcap ^n_{i=1} C^{{\mathcal{I}}}_i \subseteq C^{{\mathcal{I}}}\), denoted as \({\mathcal{I}}\models {\mathcal{T}}_1\). If \(a^{{\mathcal{I}}} \in A^{{\mathcal{I}}}\) then \({\mathcal{I}}\models A(a)\) holds. If \((a^{{\mathcal{I}}}, b^{{\mathcal{I}}}) \in P^{{\mathcal{I}}}\) then \({\mathcal{I}}\models P(a, b)\) holds. If \({\mathcal{I}}\) satisfies all TBox and ABox axioms of \({\mathcal{K}}\) then \({\mathcal{I}}\) is a model of \({\mathcal{K}}\).

\({\mathcal{S}}{\mathcal{R}}{\mathcal{O}}{\mathcal{I}}{\mathcal{Q}}\) is the underlying logic of OWL 2 DL. The concept of \({\mathcal{S}}{\mathcal{R}}{\mathcal{O}}{\mathcal{I}}{\mathcal{Q}}\) is defined as: \(C := \bot~\mid~\top~\mid~A~\mid~\lnot~A~\mid~\{a\}~\mid~\ge m R.A ~\mid\) \(\exists R.A\). Besides, \(A \sqcup B\), \(\forall R.A\), and \(\le n R.A\) can be rewritten to \(\lnot (\lnot A \sqcap \lnot B)\), \(\lnot \exists R.\lnot A\), and \(\lnot \ge (n + 1)R.A\), respectively. Enumeration \(\{a_1, a_2,\ldots , a_n\}\) is equal to \(\{a_1\} \sqcup \{a_2\} \sqcup \ldots \sqcup \{a_n\}\).

A \({\mathcal{S}}{\mathcal{R}}{\mathcal{O}}{\mathcal{I}}{\mathcal{Q}}\) \({\mathcal{K}}\) consists of RBox \({\mathcal{R}}\), TBox \({\mathcal{T}}\) and ABox \({\mathcal{A}}\).

In addition to the concept inclusions (CIs) axioms contained in DL-Lite, a \({\mathcal{S}}{\mathcal{R}}{\mathcal{O}}{\mathcal{I}}{\mathcal{Q}}\) \({\mathcal{T}}\) also includes disjointness axioms (\(\mathrm {Dis}(C_1, C_2)\)) and Equivalent concepts (\(C_1 \equiv C_2\)). Given an interpretation \({\mathcal{I}}\), we write \({\mathcal{I}}\models \mathrm {Dis}(C_1, C_2)\) if \({C_1}^{{\mathcal{I}}} \cap {C_2}^{{\mathcal{I}}} = \emptyset\), \({\mathcal{I}}\) \(\models \mathrm {Dis}(R_1, R_2)\) if \({R_1}^{{\mathcal{I}}} \cap {R_2}^{{\mathcal{I}}} = \emptyset\).

The RBox is a limited collection of either role inclusion axioms like \(R_1 \sqsubseteq R_2\) or \(R_1 \circ R_2 \sqsubseteq R_3\), or disjointness axioms in the form of \(\mathrm {Dis}(R_1, R_2)\). The inverse role is denoted as \(\mathrm {Inv}(R)\) with \(\mathrm {Inv}(R) = R^-\), the symmetric role is denoted as \(\mathrm {Sym}(R)\) (defined as \(\mathrm {Inv}(R) \equiv R\)), and the transitive role is denoted as \(\mathrm {Trans}(R)\) (defined as \(R \circ R \sqsubseteq R\)). \(\mathrm {Fun}(R)\) represents functional role. Given an interpretation \({\mathcal{I}}\), we write \({\mathcal{I}}\models R_1 \sqsubseteq R_2\) if \({R_1}^{{\mathcal{I}}} \subseteq {R_2}^{{\mathcal{I}}}\), \({\mathcal{I}}\models R_1 \circ R_2 \sqsubseteq R_3\) if \({R_1}^{{\mathcal{I}}} \times {R_2}^{{\mathcal{I}}} \subseteq {R_3}^{{\mathcal{I}}}\).

A \({\mathcal{S}}{\mathcal{R}}{\mathcal{O}}{\mathcal{I}}{\mathcal{Q}}\) ABox \({\mathcal{A}}\) without UNA includes individual equality \(a \doteq b\) (\(\doteq\) is called sameAs in OWL) and individual inequality \(a \dot{ \ne } b\). If \({\mathcal{I}}\models a \doteq b\) then \(a^{\mathcal{I}}= b^{\mathcal{I}}\). If \({\mathcal{I}}\models a \dot{ \ne } b\) then \(a^{\mathcal{I}}\not = b^{\mathcal{I}}\). If the role R is functional, then \({\mathcal{I}}\models (\ge 2 R \sqsubseteq \bot )\). Besides, if both \((a, b) \in R^{\mathcal{I}}\), \((a, d) \in R^{\mathcal{I}}\) and \(b \dot{ \ne } d \notin {\mathcal{A}}\), then \(b \doteq d\).

2.2 Conjunctive Query

We use \(\mathsf {N_V}\) to denote a collection of variables. A(t) denotes the concept atomic form, and \(P(t, t')\) denotes role atomic form with \(t, t' \in {\mathsf {N_I}} \cup {\mathsf {N_V}}\). A conjunctive query (CQ) \(q = \exists \mathbf {u} \psi (\mathbf {u}, \mathbf {v})\). It is making up of concept and role atoms. It connects these atoms by conjunction. The vector \(\mathbf {v}\) consists of free variables. If \(\vert \mathbf {v}\vert = 0\), we call q Boolean. The vector \(\mathbf {u}\) comprises a collection of variables that are quantified. Since disconnected queries can be divided into connected subqueries for processing, this article only considers connected conjunctive queries. If a CQ is connected and not Boolean, it is a rooted CQ.

The notions of answers and certain answers of CQ are introduced as follows [15]. Let \(q(\mathbf {v})\) be a CQ with \(\vert \mathbf {v}\vert = k\), and \({\mathcal{I}}\) be an interpretation. The \(\mathsf {N_T}\) is used for indicating the collection of all terms in q, that is \(\mathsf {N_T}= {\mathsf {N_I}} \cup {\mathsf {N_V}}\). Let \(\pi\) be a mapping which maps each term of q to \(\varDelta ^{{\mathcal{I}}}\) and each constant a to \(a^{\mathcal{I}}\), we call \({\mathcal{I}}\) satisfies q under \(\pi\) if only if for every \(A(t) \in q\), \(\pi (t) \in A^{\mathcal{I}}\) and for every \(P(t, t') \in q\), \((\pi (t), \pi (t')) \in P^{\mathcal{I}}\). The \(\pi\) is called a match for CQ in \({\mathcal{I}}\). The vector \(\mathbf {a} = a_1\ldots a_k\) is an answer of q, when given a mapping \(\pi\) with \(\pi (v_i) = a^{{\mathcal{I}}}_i\) (\(i \le k\)) and \({\mathcal{I}}\models ^{\pi } q\). The \(\mathrm {ans}(q(\mathbf {v}), {\mathcal{K}})\) represents the collection of all answers of \(q(\mathbf {v})\). \({\mathrm {Ind}}({\mathcal{A}})\) represents a set of individual names occurring in \({\mathcal{A}}\). Let’s call \(\mathbf {a}\) a certain answer when \(\mathbf {a}\) is a subset of \({\mathrm {Ind}}({\mathcal{A}})\) and each model of \({\mathcal{K}}\) satisfies \(q(\mathbf {a})\). The certain answer collection is denoted as \(\mathrm {cert}(q(\mathbf {v}), {\mathcal{K}})\).

2.3 Universal Model

Materialization is a forward chain algorithm that expands ABox according to the axioms in TBox. The ABox extension means expanding the ABox \({\mathcal{A}}\) to a universal model for the given KB \({\mathcal{K}}= ({\mathcal{T}}, {\mathcal{A}})\). More specifically, during the materialization, the universal model is enriched by a set of additional individuals derived from existential and number restrictions axioms and additional assertions derived from CIs in \({\mathcal{T}}\).

A role R is called generating in \({\mathcal{K}}\) if there exists \(a \in {\mathrm {Ind}}({\mathcal{A}})\) and \(R_1,\;\ldots ,\;R_n\) \(= R\) such that the followings hold: (agen) \({\mathcal{K}}\models \exists R_1(a)\) but \(R_1(a, b) \not \in {\mathcal{A}}\), for all \(b \in {\mathrm {Ind}}({\mathcal{A}})\) (written \(a \leadsto c_{R_1}\)); (rgen) for \(i < n\), \({\mathcal{T}}\models \exists R^-_{i} \sqsubseteq\) \(\exists R_{i+1}\) and \(R^-_{i} \ne R_{i+1}\) (written \(c_{R_i} \leadsto c_{R_{i+1}}\)). If R is generating in \({\mathcal{K}}\), then \(c_R\) is called an anonymous individual. And, the anonymous individual collection is denoted as \({\mathsf {N}}_{{\mathsf {I}}}^{\mathcal{T}}\), which is disjoint from \({\mathrm {Ind}}({\mathcal{A}})\). A new assertion \(C_2(a)\) will be included in universal model, if the ABox \({\mathcal{A}}\) contains \(C_1(a)\), TBox \({\mathcal{T}}\) contains \(C_1 \sqsubseteq C_2\) and ABox \({\mathcal{A}}\) does not contain \(C_2(a)\).

The canonical interpretation \({\mathcal{I}}_{\mathcal{K}}\) for \({\mathcal{K}}\) is defined as follows:

  • \(\varDelta ^{{\mathcal{I}}_{\mathcal{K}}} ={\mathrm{Ind}}({\mathcal{A}}) \cup \{c_R \mid R \in N^{-}_R\), R is generating in \({\mathcal{K}}\}\);

  • \(a^{{\mathcal{I}}_{\mathcal{K}}} =a\), for all \(a \in {\mathrm {Ind}}({\mathcal{A}})\);

  • \(A^{{\mathcal{I}}_{\mathcal{K}}} = \{a \in {\mathrm {Ind}}({\mathcal{A}}) \mid {\mathcal{K}}\models A(a)\} \cup \{c_R \in \varDelta ^{{\mathcal{I}}_{\mathcal{K}}} \mid {\mathcal{T}}\models \exists \;R^{-} \sqsubseteq A\}\);

  • \(P^{{\mathcal{I}}_{\mathcal{K}}} = \{(a, b) \in {\mathrm {Ind}}({\mathcal{A}}) \times {\mathrm {Ind}}({\mathcal{A}}) \mid P(a, b) \in {\mathcal{A}}\}\) \(\cup \{(d, c_P) \in \varDelta ^{{\mathcal{I}}_{\mathcal{K}}} \times {\mathsf {N}}_{{\mathsf {I}}}^{\mathcal{T}}\mid d \leadsto c_P\}\) \(\cup \{(c_{P^-}, d) \in {\mathsf {N}}_{{\mathsf {I}}}^{\mathcal{T}}\times \varDelta ^{{\mathcal{I}}_{\mathcal{K}}} \mid d \leadsto c_{P^-}\}\).

A path in \({\mathcal{I}}_{\mathcal{K}}\) is a finite sequence \(ac_{R_1} \cdots c_{R_n}\) \((n \ge 0)\), such that \(a \in {\mathrm {Ind}}({\mathcal{A}})\) and \(R_1, \ldots , R_n\) satisfy (agen) and (rgen) (that is, \(a \leadsto c_{R_1}\) and \(c_{R_i} \leadsto c_{R_{i+1}}\), for \(1 \le i < n\)). The last element of \(\sigma\) in a path is denoted by \(\mathrm {tail}(\sigma\)).

The universal model \({\mathcal{U}}_{\mathcal{K}}\) is defined as follows:

  • \(\varDelta ^{{\mathcal{U}}_{\mathcal{K}}} = \{a \cdot c_{R_1} \cdots c_{R_n} \mid a \in {\mathrm {Ind}}({\mathcal{A}})\), \(n \ge 0,\;a \leadsto c_{R_1} \leadsto \cdots \leadsto c_{R_n}\}\),

  • \(a^{{\mathcal{U}}_{\mathcal{K}}} = a\), for all \(a \in {\mathrm {Ind}}({\mathcal{A}})\);

  • \(A^{{\mathcal{U}}_{\mathcal{K}}} = \{\sigma \in \varDelta ^{{\mathcal{U}}_{\mathcal{K}}} \mid \mathrm {tail}(\sigma )\in A^{{\mathcal{I}}_{\mathcal{K}}}\}\);

  • \(P^{{\mathcal{U}}_{\mathcal{K}}} = \{(a, b) \in {\mathrm {Ind}}({\mathcal{A}}) \times {\mathrm {Ind}}({\mathcal{A}}) \mid P(a, b) \in {\mathcal{A}}\} \cup \{(\sigma , \sigma \cdot c_P) \in \varDelta ^{{\mathcal{U}}_{\mathcal{K}}} \times \varDelta ^{{\mathcal{U}}_{\mathcal{K}}} \mid \mathrm {tail}(\sigma ) \leadsto c_P\} \cup \{(\sigma \cdot c_{P^-}, \sigma ) \in \varDelta ^{{\mathcal{U}}_{\mathcal{K}}} \times \varDelta ^{{\mathcal{U}}_{\mathcal{K}}} \mid \mathrm {tail}(\sigma ) \leadsto c_{P^-}\}\).

The difference between a canonical interpretation and a universal model is that the canonical interpretation is always finite. It ensures that the extended ABox is finite by reusing the symbols of anonymous individuals. However, this reuse mechanism can lead to the canonical model producing wrong answers to conjunctive queries under the certain answer semantic. We compare the definitions of canonical interpretation \({\mathcal{I}}_{\mathcal{K}}\) and universal model \({\mathcal{U}}_{\mathcal{K}}\) by Examples 1 and 2.

Example 1

Let \({\mathcal{K}}\) consist of \({\mathcal{T}}= \{B \sqsubseteq \exists S, \exists S^- \sqsubseteq \exists S\}\) and \({\mathcal{A}}= \{B(d_0)\}\).

Then \(\varDelta ^{\mathcal{I}}_{\mathcal{K}}\;=\;\{d_0, d_1\}\), \(B^{{\mathcal{I}}_{\mathcal{K}}}\) = \(\{d_0\}\), and \(S^{{\mathcal{I}}_{\mathcal{K}}} = \{(d_0, d_1), (d_1, d_1)\}\).

\(\varDelta ^{\mathcal{U}}_{\mathcal{K}}\;=\;\{d_0, d_1, d_2, d_3, \ldots \}\), \(B^{{\mathcal{U}}_{\mathcal{K}}} = \{d_0\}\), and \(S^{{\mathcal{U}}_{\mathcal{K}}} = \{(d_0, d_1), (d_1, d_2), \ldots \}\).

Let q be \(\exists v\;S(v, v)\). If \(\pi (v) = d_1\), then \({\mathcal{I}}_{\mathcal{K}}\models ^\pi q\), however, \({\mathcal{U}}_{\mathcal{K}}\not \models ^\pi q\). Thus, \({\mathcal{K}}\not \models ^\pi q\), the canonical interpretation \({\mathcal{I}}_{\mathcal{K}}\) produces wrong answers.

Example 2

Let \({\mathcal{K}}\) consist of \({\mathcal{T}}= \{\text {Student} \sqsubseteq \exists \text {advisor},\) \(\exists \text {advisor}^- \sqsubseteq \exists \text {teaches}\), \(\exists \text {teaches}^- \sqsubseteq \exists \text {studiedBy}, \exists \text {studiedBy}^- \sqsubseteq \;\text {Student}\}\), and \({\mathcal{A}}= \{\text {Student}(a_0)\}\).

The universal model of this example is shown in Fig. 1.

Kontchakov et al. proposed that for every consistent \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\) KB \({\mathcal{K}}\) and every CQ q, we have \(\mathrm {cert}(q, {\mathcal{K}}) = \mathrm {ans}(q, \mathcal U_{\mathcal K})\) [15].

3 n-step Universal Model \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\)

In this section, we present the definition of n-step universal model and query analysis algorithm. The n-step universal model is a implementation of the partial materialization idea. And the query analysis algorithm is proposed to ensure the n-step universal model always has the same answers as the universal model.

3.1 n-step Universal Model

Query answering over \({\mathcal{A}}\) with \({\mathcal{T}}\) can be reduced to query answering over the universal model, as shown in preliminaries. However, the universal model can be infinite.

Considering such axioms, it satisfies three characteristics. Firstly, it belongs to the concept of inclusions. Secondly, its head and body directly or indirectly contain both existential quantifiers. Thirdly, the roles included in this axiom are inverse to each other. We refer to this kind of axiom as cyclic existential quantifiers axioms (CEQ, for short) in this paper. The simple form of CEQ axioms is \(\exists R^- \sqsubseteq \exists R\) or \(\exists R^- \sqsubseteq A\), \(A \sqsubseteq \exists R\). When ontology contains CEQ axioms, the universal model is infinite [17], as shown in Examples 1 and 2. The infinite RDF data cannot be directly stored and queried with off the shelf query engine.

We propose an n-step universal model to replace the possible infinite universal model. Intuitively, the process of materialization is to extend ABox to a universal model. The process can be thought of as a sequence \(U =\;\{\text {ABox},\;{\mathcal{U}}^1_{\mathcal{K}},\) \({\mathcal{U}}^2_{\mathcal{K}},\;\ldots ,\;{\mathcal{U}}^n_{\mathcal{K}},\;\ldots \}\). And, \({\mathcal{U}}^i_{\mathcal{K}}\subseteq {\mathcal{U}}^{i+1}_{\mathcal{K}}\), \(i < \vert U \vert\). The (agen) and (rgen) (see Sect. 2) are the fundamental reasons for expanding \({\mathcal{U}}^i_{\mathcal{K}}\) to \({\mathcal{U}}^{i+1}_{\mathcal{K}}\). The element \({\mathcal{U}}^n_{\mathcal{K}}\) of U is called the n-step universal model in our method. \({\mathcal{U}}^n_{\mathcal{K}}\) is always finite. The difference between partial materialization and materialization is that the core of partial materialization is to calculate the n-step universal model, while materialization computes the whole model. We ensure our materialization is always finite by selecting \({\mathcal{U}}^n_{\mathcal{K}}\) from the \(\mathcal U_{\mathcal K}\) for query answering.

To formalize \({\mathcal{U}}^n_{\mathcal{K}}\), we require some preliminary definitions. We label R as n-step generating in \({\mathcal{K}}\), when a is an individual of \({\mathrm {Ind}}({\mathcal{A}})\), and the \(R = R_1 \ldots R_{n}\) satisfies (agen) and (rgen). The \(\mathsf {N_R^n}\) denotes the set of roles that are \(\le n\)-step generating in \({\mathcal{K}}\). The n-step canonical interpretation is defined as follows:

Definition 1

n-step canonical interpretation

  • \(\varDelta ^{{\mathcal{I}}^n_{\mathcal{K}}}= {\mathrm {Ind}}({\mathcal{A}}) \cup \{c_R \vert R \in \mathsf {N_R^n}\}\),

  • \(a^{{\mathcal{I}}^n_{\mathcal{K}}} = a, \text {for all}\;a \in {\mathrm {Ind}}({\mathcal{A}})\),

  • \(A^{{\mathcal{I}}^n_{\mathcal{K}}} = \{a \in {\mathrm {Ind}}({\mathcal{A}})\;\vert \;{\mathcal{K}}\models A(a)\} \cup \{c_R \in \varDelta ^{{\mathcal{I}}^n_{\mathcal{K}}}\;\vert \;{\mathcal{T}}\models \exists R^- \sqsubseteq {\mathcal{A}}\}\),

  • \(P^{{\mathcal{I}}^n_{\mathcal{K}}} = \{(a,b) \in {\mathrm {Ind}}({\mathcal{A}}) \times {\mathrm {Ind}}({\mathcal{A}}) \vert P(a,b) \in {\mathcal{A}}\} \cup \{(d, c_p) \in \varDelta ^{{\mathcal{I}}^n_{\mathcal{K}}}\times N^{{\mathcal{T}}}_{I} \vert d \leadsto c_p\} \cup \{(c_{P^-}, d) \in N^{{\mathcal{T}}}_{I} \times \varDelta ^{{\mathcal{I}}^n_{\mathcal{K}}}\vert d \leadsto c_{P^-}\}\).

All roles included in the n-step canonical interpretation must be \(\le n\)-step generating in \({\mathcal{K}}\). All anonymous variables are witness of roles which are \(\le n\)-step generating in \({\mathcal{K}}\). Based on the definition of the n-step canonical interpretation, we can derive the definition of n-step universal model (\({\mathcal{U}}^n_{\mathcal{K}}\)).

Definition 2

n-step universal model

  • \(\varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\;=\;\{a \cdot c_{R_1} \cdots c_{R_l}\;\vert \;a \in {\mathrm {Ind}}({\mathcal{A}}), R_{l} \in \mathsf {N_R^n}, a \leadsto c_{R_1} \leadsto \ldots \leadsto c_{R_l}\}\),

  • \(a^{{\mathcal{U}}^n_{\mathcal{K}}} = a,\;\text {for all}\;a \in {\mathrm {Ind}}({\mathcal{A}})\),

  • \(A^{{\mathcal{U}}^n_{\mathcal{K}}} = \{\sigma \in \varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\;\vert \;tail(\sigma ) \in A^{{\mathcal{I}}^n_{\mathcal{K}}}\}\),

  • \(P^{{\mathcal{U}}^n_{\mathcal{K}}} = \{(a, b) \in {\mathrm {Ind}}({\mathcal{A}}) \times {\mathrm {Ind}}({\mathcal{A}})\;\vert \;P(a, b) \in {\mathcal{A}}\} \cup \{(\sigma , \sigma \cdot c_P) \in \;\varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\times\) \(\varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\;\vert \;tail(\sigma ) \leadsto c_P\} \cup \{(\sigma \cdot c_{P^-}, \sigma ) \in \;\varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\times \varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\;\vert \;tail(\sigma )\;\leadsto \;c_{P^-}\}\).

Example 3

We use Example 1 to illustrate \({\mathcal{U}}^n_{\mathcal{K}}\). \(\varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\;=\;\{d_0, d_1, \ldots , d_n\}, {d_0}^{{\mathcal{U}}^n_{\mathcal{K}}}\) \(=\;d_0,\;B^{{\mathcal{U}}^n_{\mathcal{K}}}\;=\;\{d_0\},\;S^{{\mathcal{U}}^n_{\mathcal{K}}}\;=\;\{(d_0, d_1), (d_1, d_2), \ldots , (d_{n-1}, d_n)\}\). The graph of the 2-step universal model is shown in Fig. 2.

Fig. 2
figure 2

The 2-step universal model of Example 3

Example 4

The 2-step universal model (sRDF) of Example 2 is shown in Fig. 1.

3.2 QAA Based on Rooted Conjunctive Queries

We design a query analysis algorithm (QAA) to ensure that the n-step universal model can always compute the same answer as the universal model. QAA takes a rooted query as input and calculates the number of quantified variables n. Obviously, n can be calculated in \({\mathcal{O}}(\vert \mathsf {N_T}\vert )\). The number of quantified variables denotes the step of the universal model. If the step size is n, then the n-step universal model, i.e., the \({\mathcal{U}}^n_{\mathcal{K}}\) in U, can produce the same answers for q as the universal model. This method is proved in Theorem 2. We define a new triple relation on q. \(\sigma\) denotes a path that consists of terms. \(\delta\) represents a path that includes roles.

Definition 3

Let \(q = \exists \mathbf {u} \varphi (\mathbf {u}, \mathbf {v})\) be a CQ, \({\mathcal{T}}\) be a TBox and \(\pi\) be a mapping, we define a triple relation \(f_\rho = \cup _{i \ge 0} f_\rho ^i \subseteq \mathsf {N_T}\times {\mathsf {N_T}}^* \times { \mathsf {N_R^-}}^*\) with \(\rho = \{t \mid t \in \mathsf {N_T},\) \(\pi (t) \in {\mathrm {Ind}}({\mathcal{A}})\}\), where

  • \(f_\rho ^0 = \{(t, t, \varepsilon )\;\vert \;t \in \rho \}\);

  • \(f_\rho ^{i+1} = f_\rho ^i\;\cup \;\{(t, \sigma s t, \delta S R) \mid (s, \sigma s, \delta S) \in f_\rho ^i, R(s, t) \in q, tail(\pi (s)) \leadsto\) \(tail(\pi (t))\} \;\cup \;\{(t, \sigma , \delta ) \mid (s, \sigma s, \delta R^-) \in f_\rho ^i, R(s, t) \in q, tail(\pi (s))\) \(\leadsto tail(\pi (t))\}\).

A path \(\sigma = t_0 \cdot t_1 \cdots t_{n-1} \cdot t_n\) is a certain path of q, if \((t_0)\) is mapped to \({\mathrm {Ind}}({\mathcal{A}})\) and all other terms are mapped to \({\mathsf {N}}_{{\mathsf {I}}}^{\mathcal{T}}\). A certain path collection is represented as \(\mathrm {CertPath}(q, \pi )\). The max certain path is defined as \(\mathrm {MaxCertPath}(q, \pi )\;:=\;\{\sigma\) \(\mid \;\vert \sigma _i \vert \le \vert \sigma \vert ,\;\text {for all}\;\sigma _i \in \;\mathrm {CertPath}(q, \pi )\}\). Let \(\pi\) be a mapping. We set the depth of q as \(\mathrm {dep}(q, \pi )\;:=\;\vert \sigma \vert - 1\), with \(\sigma \in \;\mathrm {MaxCertPath}(q, \pi )\).

The anonymous part of the universal model is a forest-shaped structure, as shown in Examples 3 and 4. Thus, the subquery that is matched to the anonymous part of the universal model must be a forest-shaped structure. If a term t in q wants to be mapped to an anonymous part of \({\mathcal{U}}_{\mathcal{K}}\), it can only be mapped in this way, \(\pi (t) = \pi (t_0) \cdot c_{R_1} \cdots c_{R_{n}}\) with \((t, \sigma , \delta ) \in f_\rho\), where \(\sigma = t_0 t_1 \cdots t_n\) is a certain path, and \(\delta =\) \(R_1 R_2\cdots R_n\).

Example 5

Let q(x) be a cyclic CQ of the following form: \(\exists \;y P(x, y) \wedge P(y, y)\), \({\mathcal{T}}= \{A \sqsubseteq \exists P, \exists P^- \sqsubseteq \exists P\}\).

Given the first knowledge base \({\mathcal{K}}_1 = \{{\mathcal{T}}, {\mathcal{A}}_1\}\) and \({\mathcal{A}}_1 = \{A(a)\}\). As shown in Fig. 3, the universal model is heterogeneous to the query q. Thus, \(\mathrm {ans}(q, \mathcal U_{\mathcal K}) = \mathrm {cert}(q, {\mathcal{K}}_1) = \emptyset\).

Fig. 3
figure 3

The universal model of \({\mathcal{K}}_1\)

Given the second knowledge base \({\mathcal{K}}_2 = \{{\mathcal{T}}, {\mathcal{A}}_2\}\) and \({\mathcal{A}}_2 = \{A(a),\) \(P(a,b), P(b,b)\}\). As shown Fig. 4, the universal model is homogeneous to the query q. \(\pi = \{x \rightarrow a, y \rightarrow b\}\). \(f_\rho = f_\rho ^0 = \{(x, x, \varepsilon ), (y, y, \varepsilon )\}\). Thus, \(\mathrm {dep}(q, \pi ) = 0\). \(\mathrm {ans}(q, \mathcal U_{\mathcal K}) = \mathrm {cert}(q, {\mathcal{K}}_1) = \{a\}\).

Fig. 4
figure 4

The universal model of \({\mathcal{K}}_2\)

These two examples demonstrate that the cyclic part of the query only can be mapped into the ABox part of the universal model because the anonymous part of the universal model is a forest-shaped structure.

Example 6

Let q(x) be a fork-shaped CQ of the following form: \(\exists \;y P(x, y) \wedge P(z, y)\), \({\mathcal{T}}= \{A \sqsubseteq \exists P, \exists P^- \sqsubseteq \exists P\}\).

Consider \({\mathcal{K}}_1 = \{{\mathcal{T}}, {\mathcal{A}}_1\}\) and \({\mathcal{A}}_1 = \{A(a), A(b)\}\). Then, as shown in Fig. 5, the universal model is heterogeneous to the query q. Thus, \(\mathrm {ans}(q, \mathcal U_{\mathcal K}) = \mathrm {cert}(q, {\mathcal{K}}_1) = \emptyset\).

Fig. 5
figure 5

The universal model of \({\mathcal{K}}_1\)

Let \({\mathcal{K}}_2 = \{{\mathcal{T}}, {\mathcal{A}}_2\}\), where \({\mathcal{A}}_1 = \{A(a), A(b),\) \(P(a,c), P(b,c)\}\). Then, as shown in Fig. 6, the universal model is homogeneous to the query q. \(\pi = \{x \rightarrow a, z \rightarrow b, y \rightarrow c\}\). \(f_\rho = f_\rho ^0 =\) \(\{(x, x, \varepsilon ), (y, y, \varepsilon ),\) \((z, z, \varepsilon )\}\). Thus, \(\mathrm {dep}(q, \pi ) = 0\). \(\mathrm {ans}(q, \mathcal U_{\mathcal K}) = \mathrm {cert}(q, {\mathcal{K}}_1) = \{a, b\}\) .

Fig. 6
figure 6

The universal model of \({\mathcal{K}}_2\)

These two examples confirm that not only a cyclic query, but also a fork query, the fork part of the query can only be mapped into the ABox part of the universal model, due to the anonymous part of the universal model is a forest-shaped structure.

Theorem 1

For every consistent \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\) KB \({\mathcal{K}}\), every rooted conjunctive query \(q = \exists \mathbf {u} \varphi (\mathbf {u}, \mathbf {v})\), and every mapping \(\pi\), with \(\mathrm {dep}(q, \pi ) = n\), we have \(\mathcal U_{\mathcal K}\models ^\pi q\) if only if \({\mathcal{U}}^n_{\mathcal{K}}\;\models ^\pi q\).

Proof

(\(\Rightarrow\)) For every \(\pi\) with \(\mathrm {dep}(q, \pi ) = n\), then there exists a max certain path \(\sigma = t_0 t_1 \cdots t_{n-1} t_n\) and \(R(t_{n-1}, t_n) \in q\). Thus, R is n-step generating, and all other roles are \(\le n\)-step generating. By definition of \({\mathcal{U}}^n_{\mathcal{K}}\), we have that for every \(R \in q\), if \((a, b) \in R^{{\mathcal{U}}_{\mathcal{K}}}\) then \(a \in \;\varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\;\text {and}\;b \in \;\varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\). Thus, \((a, b) \in R^{{\mathcal{U}}^n_{\mathcal{K}}}\). Since q is connected, for every \(A \in q\), suppose \(a \in A^{{\mathcal{U}}_{\mathcal{K}}}\), then \((a, *) \in R^{{\mathcal{U}}_{\mathcal{K}}}\) or \((*, a) \in R^{{\mathcal{U}}_{\mathcal{K}}}\). Thus, \(a \in \;\varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\) and \(a \in A^{{\mathcal{U}}^n_{\mathcal{K}}}\). In conclusion, \({\mathcal{U}}^n_{\mathcal{K}}\;\models q\).

(\(\Leftarrow\)) For every \(R \in q\), if \((a, b) \in R^{{\mathcal{U}}^n_{\mathcal{K}}}\) then \(a \in \;\varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\) and \(b \in \;\varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\). Because \(\varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\) is a subset of \(\varDelta ^{\mathcal{U}}_{\mathcal{K}}\), \(a \in\) \(\varDelta ^{\mathcal{U}}_{\mathcal{K}}\), \(b\;\in \;\varDelta ^{\mathcal{U}}_{\mathcal{K}}\) and \((a, b) \in R^{{\mathcal{U}}_{\mathcal{K}}}\). For every \(A \in q\), if \(a \in A^{{\mathcal{U}}^n_{\mathcal{K}}}\), then \(a \in \;\varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\). Thus \(a \in \;\varDelta ^{\mathcal{U}}_{\mathcal{K}}\) and \(a \in A^{{\mathcal{U}}_{\mathcal{K}}}\).

Therefore, \({\mathcal{U}}^n_{\mathcal{K}}\;\models q(\mathbf {a}, \mathbf {b})\) if only if \(\mathcal U_{\mathcal K}\;\models q(\mathbf {a}, \mathbf {b})\). \(\square\)

Let \(f_\rho\) is a function if for every term t, \(f_\rho (t)\) is singleton set or if \((t, \sigma , \delta )\) \(\in f(t)\;\text {and}\;(t, \sigma ', \delta ') \in f(t)\), then \(\delta = \delta '\).

Lemma 1

For every mapping, if \(\mathcal U_{\mathcal K}\;\models ^{\pi } q\), then \(f_\rho\) is a function and every \(\sigma \in \;\mathrm {MaxCertPath}(q, \pi )\) is finite and \(\mathrm {dep}(q, \pi ) \le \vert \mathbf {u} \vert\).

Proof

Suppose \(f_\rho\) is not a function, then there exists \(t \in {\mathsf{N}}_{\mathsf{T}}\), with \((t, \sigma _i, \delta _i) \in\) \(f_\rho (t)\) and \((t, \sigma _j, \delta _j) \in f_\rho (t)\), where \(\delta _i \ne \delta _j\). We labeled \(\delta _i\) and \(\delta _j\) as \(\delta _i =\) \(R_i^0 \cdot R_i^1 \cdots {R}^{n_i}_i\) and \(\delta _j = R_j^0 \cdot R_j^1 \cdots {R}^{n_j}_j\), respectively. Thus, \(\mathcal U_{\mathcal K}\;\not \models ^{\pi } q\) due to \(c_{R_i^0}\cdots c_{{R}^{n_i}_i}\;\ne \;c_{R_j^0}\cdots c_{{R}^{n_j}_j}\). This creates a contradiction.

Because \({\mathsf{N}}_{\mathsf{T}}\) is finite, if \(\sigma\) is not finite, then there exists \(t'\), with \(\sigma = \sigma _1 t' \sigma _2 t' \sigma _3\). Thus, \((t', \sigma _1 t', \delta _1) \in f_\rho (t')\) and \((t', \sigma _1 t' \sigma _2 t', \delta _2) \in f_\rho (t')\). By the definition of \(f_\rho\), we have that \(\delta _1\) is a subsequence of \(\delta _2\) because \(\sigma _1 t'\) is a subsequence of \(\sigma _1 t' \sigma _2 t'\). Thus, \(f_\rho (t)\) is not a function. However, we have proved that if \(\mathcal U_{\mathcal K}\;\models ^{\pi } q\), then \(f_\rho\) is a function. This creates a contradiction.

Suppose \(\mathrm {dep}(q, \pi ) > \vert \mathbf {u}\vert\), then there exists a \(\sigma\) with \(\vert \sigma \vert > \vert \mathbf {u}\vert + 1\). Because free variables can only be mapped into \({\mathrm {Ind}}({\mathcal{A}})\), a quantified variable repeatedly appears in the path \(\sigma\) exists. Thus, \(\sigma\) is infinite. We have proved that if \(\mathcal U_{\mathcal K}\;\models ^{\pi } q\), then every \(\sigma \in \;\mathrm {MaxCertPath}(q, \pi )\) is finite. This creates a contradiction. \(\square\)

Based on Lemma 1 and Theorem 1, we can conclude that, for every rooted query, we extend the model at most \(\vert \mathbf {u} \vert\) steps. The core of the QAA:

Theorem 2

For each consistent \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\) KB \({\mathcal{K}}\) and each rooted conjunctive query \(q = \exists \mathbf {u} \varphi (\mathbf {u}, \mathbf {v})\), with \(\vert \mathbf {u} \vert = n\), we have \(\mathrm {cert}(q, {\mathcal{K}}) = \mathrm {ans}(q, {\mathcal{U}}^n_{\mathcal{K}})\).

Proof

Kontchakov et al. proposed that \(\mathrm {cert}(q, {\mathcal{K}}) = \mathrm {ans}(q, {\mathcal{U}}_{\mathcal{K}})\) (see Sect. 2). Thus, we only need to proof \(\mathrm {ans}(q, {\mathcal{U}}_{\mathcal{K}}) = \mathrm {ans}(q, {\mathcal{U}}^n_{\mathcal{K}})\).

(\(\Rightarrow\)) Lemma 1 shows that for every mapping \(\pi\), if \(\mathcal U_{\mathcal K}\;\models ^\pi q\), then \(n^*= \mathrm {dep}(q,\) \(\pi ) \le \vert \mathbf {u} \vert\). Based on Theorem 1, we can conclude that \({\mathcal{U}}^{n^*}_{\mathcal{K}}\models ^\pi q\). Because \(n^*\le \vert \mathbf {u} \vert\), \(\varDelta ^{{\mathcal{U}}^{n^*}_{\mathcal{K}}} \subseteq \varDelta ^{{\mathcal{U}}^{n}_{\mathcal{K}}} \subseteq \; \varDelta ^{\mathcal{U}}_{\mathcal{K}}\). We can conclude that \({\mathcal{U}}^{n}_{\mathcal{K}}\models ^\pi q\).

(\(\Leftarrow\)) Because \(\varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\;\subseteq \;\varDelta ^{\mathcal{U}}_{\mathcal{K}}\), if \({\mathcal{U}}^{n}_{\mathcal{K}}\models ^\pi q\) then \(\mathcal U_{\mathcal K}\;\models ^\pi q\).

In conclusion, \(\mathrm {ans}(q, {\mathcal{U}}_{\mathcal{K}}) = \mathrm {ans}(q, {\mathcal{U}}^n_{\mathcal{K}})\), that is, \(\mathrm {cert}(q, {\mathcal{K}}) = \mathrm {ans}(q, {\mathcal{U}}_{\mathcal{K}})\). \(\square\)

Example 7

The example in Sect. 1 proves our idea. Under the \(\pi = \{x \rightarrow a_0, y_0 \rightarrow a_1, y_1 \rightarrow a_2\}\), we can conclude that \(f_\rho = \{(x, x, \varepsilon ), (y_0, x y_0, \text {advisor}),\) \((y_1, x y_0 y_1, \text {advisor}\cdot \text {teaches})\}\). \(\mathrm {CertPath}(q, \pi ) = \{\sigma _1, \sigma _2, \sigma _3\}\) with \(\sigma _1 = x\), \(\sigma _2 = xy_0\), \(\sigma _3 = xy_0y_1\). Then \(\mathrm {MaxCertPath}(q, \pi ) = \{\sigma _3\}\), and, \(\mathrm {dep}(q, \pi ) = 2\). Obviously, the \(\mathrm {dep}(q, \pi )\) is not greater than the number of quantified variables, thus, the sRDF(2- step universal model) makes \(\mathrm {ans}(\text {sRDF}, \mathrm {Q}) = \mathrm {ans}(\text {RDF}, \mathrm {Q})\).

3.3 QAA Based on Boolean Conjunctive Queries

A Boolean conjunctive query is in the form of \(q = \exists \mathbf {u} \psi (\mathbf {u})\). It is a special case of conjunctive queries \(q = \exists \mathbf {u} \psi (\mathbf {u}, \mathbf {v})\), with \(\vert \mathbf {v}\vert = 0\). As mentioned in Sect. 2, all non-connected queries can be decomposed into connected subqueries for processing, so we only consider connected Boolean queries. Because Boolean conjunctive query does not have a variable that must be mapped into ABox, the query analysis algorithm does not explicitly apply to Boolean conjunctive queries. As shown in Example 8, the query analysis algorithm is unsound for Boolean conjunctive queries q(x), that is, when \({\mathcal{U}}^n_{\mathcal{K}}\) does not satisfy q, it is possible for \({\mathcal{K}}\) to satisfy q.

Example 8

Let q(x) be a Boolean CQ of the following form: \(\exists \;x\;y\;z\;(R_1(x, y) \wedge R_2(y, z))\), \({\mathcal{T}}= \{A \sqsubseteq \exists S, \exists S^- \sqsubseteq B, B \sqsubseteq \exists P, \exists P^- \sqsubseteq C, C \sqsubseteq \exists R_1, \exists {R_1}^- \sqsubseteq D, D \sqsubseteq \exists R_2, \exists {R_2}^- \sqsubseteq C\}\) and \({\mathcal{A}}= \{A(a)\}\). Figure 7 shows the graph of 3-step universal model and query. Obviously, the 3-step universal model does not satisfy q, but the universal model does.

Fig. 7
figure 7

The universal model of \({\mathcal{K}}\)

To extend QAA to support Boolean conjunctive queries, we firstly summarize the general properties of the universal model.

  • If the ontology contains the CEQ axioms, the universal model is infinite.

  • The anonymous part of the universal model is forest-like, and is heterogeneous to the cyclic and fork structures.

The two properties are explained or proved in the preceding two sections. Based on the two attributes, we can conclude that if the Boolean conjunctive query contains a cyclic structure or fork structure and the Boolean conjunctive query can be satisfied, then there must be a variable mapped to ABox. Therefore, the Boolean conjunctive queries which contain a cyclic or fork structure can be directly solved by QAA. To verify this, we first extend the triple relation \(f_\rho\) defined in the rooted queries to Boolean queries by modifying the definition of \(\rho\).

Definition 4

Let \(q = \exists \mathbf {u} \varphi (\mathbf {u}, \mathbf {v})\) be a CQ, \({\mathcal{T}}\) be a TBox and \(\pi\) be a mapping, we define a triple relation \(f_\rho = \cup _{i \ge 0} f_\rho ^i \subseteq {\mathsf{N}}_{\mathsf{T}}\times {\mathsf{N}}_{\mathsf{T}}^* \times { {\mathsf{N}}_{\mathsf{R}}^-}^{*}\) with \(\rho = \{t \mid t \in {\mathsf{N}}_{\mathsf{T}},\) \(\pi (t) \in {\mathrm {Ind}}({\mathcal{A}})\;\text {or there is no}\;t' \text {such that}\;tail(\pi (t')) \leadsto tail(\pi (t))\}\), where

  • \(f_\rho ^0 = \{(t, t, \varepsilon )\;\vert \;t \in \rho \}\);

  • \(f_\rho ^{i+1} = f_\rho ^i\;\cup \;\{(t, \sigma s t, \delta S R) \mid (s, \sigma s, \delta S) \in f_\rho ^i, R(s, t) \in q, tail(\pi (s)) \leadsto\) \(tail(\pi (t))\} \;\cup \;\{(t, \sigma , \delta ) \mid (s, \sigma s, \delta R^-) \in f_\rho ^i, R(s, t) \in q, tail(\pi (s))\) \(\leadsto tail(\pi (t))\}\).

Lemma 2

For each consistent \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\) KB \({\mathcal{K}}\) and each Boolean conjunctive query \(q = \exists \mathbf {u} \varphi (\mathbf {u})\) which contains a cyclic structure or fork structure, we have that if \(\mathcal U_{\mathcal K}\) satisfies q and q contains a cyclic structure or fork structure, then there are variables in the query that must be mapped to ABox.

Proof

Assuming that all variables in q are mapped to anonymous variables, if \(\mathcal U_{\mathcal K}\) satisfies q and q contains a cyclic structure or fork structure, then there must be a variable \(t \in {\mathsf{N}}_{\mathsf{T}}\), such that \((t, \sigma _i, \delta _i) \in\) \(f_\rho (t)\) and \((t, \sigma _j, \delta _j) \in f_\rho (t)\), where \(head(\sigma _i) \cdot \delta _i \ne head(\sigma _j) \cdot \delta _j\). The \(\delta _i\) and \(\delta _j\) are labeled as \(\delta _i =\) \(R_i^0 \cdot R_i^1 \cdots {R}^{n_i}_i\) and \(\delta _j = R_j^0 \cdot R_j^1 \cdots {R}^{n_j}_j\), respectively. Thus, \(\mathcal U_{\mathcal K}\;\not \models ^{\pi } q\) due to \(\pi (head(\sigma _i))\cdot c_{R_i^0}\cdots c_{{R}^{n_i}_i}\;\ne \;\pi (head(\sigma _j)) \cdot c_{R_j^0}\cdots c_{{R}^{n_j}_j}\). This creates a contradiction. Thus, if \(\mathcal U_{\mathcal K}\) satisfies Boolean conjunctive query q and q contains a cyclic structure or fork structure, then there are variables in the query that must be mapped to ABox. \(\square\)

Example 9

Let q(x) be a fork-shaped CQ of the following form: \(\exists \;x\;y\;z\;P(x, z) \wedge P(y, z)\), \({\mathcal{T}}= \{A \sqsubseteq \exists R\}\) and \({\mathcal{A}}= \{A(a), A(b)\}\).

Based on the definition of \(f_\rho\), we have that \((z, x \cdot z, R) \in f_\rho\) and \((z, y \cdot z, R) \in f_\rho\). If \(\mathcal U_{\mathcal K}\) satisfies q under \(\pi\) and all variables are mapped to anonymous individuals, then \(\pi (z) = \pi (x)\cdot c_R = \pi (y)\cdot c_R\). However, \(\pi (x) \not = \pi (y)\), this creates a contradiction. Thus, there is no mapping where \(\mathcal U_{\mathcal K}\) satisfies q under this mapping and all variables are mapped to anonymous individuals.

As shown in the following Fig. 8, the universal model is heterogeneous to the query q. Thus, \({\mathcal{U}}_{\mathcal{K}}\not \models q\).

Fig. 8
figure 8

The graph of query and universal mode

Theorem 3

For each consistent \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\) KB \({\mathcal{K}}\) and each Boolean conjunctive query \(q = \exists \mathbf {u} \varphi (\mathbf {u})\) which contains a cyclic structure or fork structure, we have that \(\mathcal U_{\mathcal K}\) satisfies q if only if \({\mathcal{U}}^n_{\mathcal{K}}\) satisfies q with \(n = \vert \mathbf {u} \vert\).

Proof

(\(\Rightarrow\)) Based on Lemma 2, if \(\mathcal U_{\mathcal K}\) satisfies q under \(\pi\) and q contains a cyclic structure or fork structure then there are variables in the query that must be mapped to ABox. Then, we can conclude that \(\mathrm {dep}(q, \pi ) \le \vert \mathbf {u} \vert\) based on Lemma 1. We label the max certain path of q as \(\sigma = t_0 t_1 \cdots t_{m-1} t_m\) and \(R(t_{m-1}, t_m) \in q\). Thus, R is m-step generating with m is \(\le \vert \mathbf {u} \vert\). By definition of \({\mathcal{U}}^n_{\mathcal{K}}\), we have that for every \(R \in q\), if \((a, b) \in R^{{\mathcal{U}}_{\mathcal{K}}}\) then \(a \in \;\varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\;\text {and}\;b \in \;\varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\). Thus, \((a, b) \in R^{{\mathcal{U}}^n_{\mathcal{K}}}\). Since q is connected, for every \(A \in q\), suppose \(a \in A^{{\mathcal{U}}_{\mathcal{K}}}\), then \((a, *) \in R^{{\mathcal{U}}_{\mathcal{K}}}\) or \((*, a) \in R^{{\mathcal{U}}_{\mathcal{K}}}\). Thus, \(a \in \;\varDelta ^{{\mathcal{U}}^n_{\mathcal{K}}}\) and \(a \in A^{{\mathcal{U}}^n_{\mathcal{K}}}\). In conclusion, \({\mathcal{U}}^n_{\mathcal{K}}\;\models q\).

(\(\Leftarrow\)) Because \({\mathcal{U}}^n_{\mathcal{K}}\) satisfies q and \({\mathcal{U}}^n_{\mathcal{K}}\) is a subset of \(\mathcal U_{\mathcal K}\), so \(\mathcal U_{\mathcal K}\) satisfies q. \(\square\)

4 n-step Universal Model in OWL 2 DL

OWL 2 DL is based on \({\mathcal{S}}{\mathcal{R}}{\mathcal{O}}{\mathcal{I}}{\mathcal{Q}}\). Because \({\mathcal{S}}{\mathcal{R}}{\mathcal{O}}{\mathcal{I}}{\mathcal{Q}}\) has more complex constructors than \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\), more complex information about the described domain can be captured by OWL 2 DL. Particularly, OWL 2 DL has complex restrictions, such as property restrictions and arbitrary cardinality. Besides, UNA is not adopted by OWL 2 DL, that is, different individual names may point to the same individual.

However, with the improvement of ontology language expressiveness, the complexity of reasoning algorithms will also increase. For instance, when dealing with the axioms in the form of UNION, such as \(C \sqsubseteq C_1 \mathop {\sqcup } C_2\), it is necessary to try all possibilities and backtrace the tree when there are contradictions in the leaf nodes. This process significantly increases the time and memory consumption of materialization.

To enjoy the high expressiveness of OWL 2 DL and improve materialization efficiency, we implement the support for OWL 2 DL through approximation and rewriting mechanisms. Given an OWL 2 DL ontology, we first attempt to rewrite it as an equivalent \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\) TBox axiom, if possible. Otherwise, we have three choices, approximate processing or adding additional data structures or other ABox transformation rules. The last two methods can preserve the semantics that approximate processing would lose.

4.1 TBox Transformation

TBox transformation is presented in Tables 1 and 2 (r and s denote role names). Given a \({\mathcal{S}}{\mathcal{R}}{\mathcal{O}}{\mathcal{I}}{\mathcal{Q}}\) TBox axiom, we first rewrite it to an equivalent one according to the rewriting rules in Table 1, also known as normalization in [25]. We add concept and its complement to a complement table (CT), which is designed to record complement semantic.

Table 1 OWL 2 DL rewriting rules
Table 2 Approximation rules

Because of the normalized TBox axioms \({\mathcal{T}}\) beyond the expressiveness of \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\), and the axiom in the form of \(C \sqsubseteq \mathop {\sqcup }\nolimits _{i = 1}^{n} C_i\) or \(C \sqsubseteq \exists R.\{a_1, a_2, a_3\}\) will lead to non-determinism, we syntactically approximate partial axioms by their complement, as shown in Table 2. Specially, we construct a new concept for nominals at No.10 approximation rule. All TBox transformation rules are sound, as shown in [25].

4.2 ABox Transformation

The semantics of \(\textit{DL-Lite}^{{\mathcal{N}}}_{\textit{horn}}\) do not cover the axiom shown in Table 3. Thus, we design tractable ABox transformation rules, i.e., ABox reasoning rules for them based on the semantics of \({\mathcal{S}}{\mathcal{R}}{\mathcal{O}}{\mathcal{I}}{\mathcal{Q}}.\)

Table 3 ABox transformation

An extended TBox \({\mathcal{T}}^{*}\) is a set of axioms obtained from \({\mathcal{T}}\) by applying TBox transformation rules and adding ABox transformation rules. Let \({\mathcal{K}}^{*} := ({\mathcal{T}}^{*}, {\mathcal{A}})\). Let n be a natural number. The n-step universal model of \(({\mathcal{T}}^{*}, {\mathcal{A}})\) is called an extended n-step universal model, denoted by \({{\mathcal{U}}}^n_{{\mathcal{K}}^{*}}\), of \({\mathcal{K}}\).

By Theorem 2 and the definition of the transformation rules, we can conclude:

Proposition 1

(Approximation) Let \({\mathcal{K}}= ({\mathcal{T}}, {\mathcal{A}})\) be a consistent KB and \(q = \exists \mathbf {u} \varphi (\mathbf {u}, \mathbf {v})\) be a rooted CQ with \(\vert \mathbf {u} \vert = n\). For every \(\mathbf {a} \subseteq {\mathrm {Ind}}({\mathcal{A}})\) with \(\vert \mathbf {a}\vert = \vert \mathbf {v}\vert\), if \(\mathbf {a} \in \mathrm {ans}(q, {\mathcal{U}}^{n}_{{\mathcal{K}}^{*}})\) then \(\mathbf {a} \in \mathrm {cert}(q, {\mathcal{K}})\).

Example 10

Let \({\mathcal{K}}\;=\;(\{\alpha _1, \alpha _2, \alpha _3\}, \{\beta _1, \beta _2, \beta _3, \beta _4, \beta _5\})\) be a KB shown in the following table. GraduateStudent was abbreviated as GS.

Axiom

Expression

Axiom

Expression

\(\alpha _1\)

\(\text {GS}\;\equiv \;\text {Person}\;\sqcap \;\ge \;3 \text {takeCourse}\)

\(\beta _2\)

\(\text {takeCouse}(b,\;c_1)\)

\(\alpha _2\)

\(\text {Person}\;\equiv \;\text {Woman}\;\sqcup \;\text {Man}\)

\(\beta _3\)

\(\text {takeCouse}(b,\;c_2)\)

\(\alpha _3\)

\(\text {Dis}(\text {Woman},\;\text {Man})\)

\(\beta _4\)

\(\text {takeCouse}(b,\;c_3)\)

\(\beta _1\)

\(\text {GS}(a)\)

\(\beta _5\)

\(\text {Man}(b)\)

First step: we get new TBox axioms:

\(\alpha _1^1\)

\(\text {GS}\;\sqsubseteq \;\text {Person}\)

\((\alpha _1, \text {No1}, \text {No2})\)

\(\alpha _1^2\)

\(\text {GS}\;\sqsubseteq \;\ge \;3 \text {takeCourse}.\text {Thing}\)

\((\alpha _1, \text {No1}, \text {No2})\)

\(\alpha _1^3\)

\(\text {Person}\;\sqcap \;\ge \;3 \text {takeCourse}\;\sqsubseteq \;\text {GS}\)

\((\alpha _1, \text {No1})\)

\(\alpha _2^1\)

\(\text {Woman}\;\sqsubseteq \;\text {Person}\)

\((\alpha _2, \text {No4}, \text {No5})\)

\(\alpha _2^2\)

\(\text {Man}\;\sqsubseteq \;\text {Person}\)

\((\alpha _2, \text {No4}, \text {No5})\)

\(\alpha _3^1\)

\(\text {Woman}\;\sqsubseteq \;\lnot \text {Man}\)

\((\alpha _3, \text {No6})\)

\(\alpha _3^2\)

\(\text {Man}\;\sqsubseteq \;\lnot \text {Woman}\)

\((\alpha _3, \text {No6})\)

Second step: from the above axioms, we can get the following new facts:

\(\beta _1^1\)

\(\;\text {Person}(a)\)

\((\beta _1, \alpha _1^1)\)

\(\;\beta _1^2\)

\(\text {takeCourse}(a,a_1)\)

\((\beta _1, \alpha _1^2)\)

\(\beta _5^1\)

\(\;\text {Person}(b)\)

\((\beta _5, \alpha _2^2)\)

\(\;\beta _1^3\)

\(\text {takeCourse}(a,a_2)\)

\((\beta _1, \alpha _1^2)\)

\(\beta _6\)

\(\;\text {GS}(b)\)

\((\beta _2\;-\;\beta _4, \beta _5^1, \alpha _1^3)\)

\(\;\beta _1^4\)

\(\text {takeCourse} (a,a_3)\)

\((\beta _1, \alpha _1^2)\)

\(\beta _5^2\)

\(\;\lnot \text {Woman}(b)\)

\((\beta _5, \alpha _3^2)\)

\(\;\beta _8\)

\((a_1, \dot{ \ne } , a_2)\)

\((\beta _1^2, \beta _1^3, \alpha _1^2)\)

\(\beta _7\)

\(\;(a_1, \dot{ \ne } , a_3)\)

\((\beta _1^2, \beta _1^4, \alpha _1^2)\)

\(\;\beta _9\)

\((a_2, \dot{ \ne } , a_3)\)

\((\beta _1^3, \beta _1^4, \alpha _1^2)\)

5 The System and Implementation of SUMA

SUMA computes universal model offline and executes queries online. The offline stage consists of three modules: ontology processor, storage, and materialization (Fig. 9).

Fig. 9
figure 9

The architecture of SUMA

The ontology processor module has four submodules. OWL 2 DL processor parses the ontology through the OWL API [13], rewrites it to \(DL-Lite\) axioms according to rewriting and approximation techniques. It builds the inverse role table and equivalent role table to support role rewriting algorithm.

The role processor implements role scoring algorithm [27]. The role rewriter firstly generates equivalent and inverse role mappings following [27]. For instance, if role d is equivalent to role e and \(d.score < e.score\), then d is mapped to e. Formally, \(mapping(d) = e, mapping(e) = e\). If \(d.score = e.score\) and \(d.id > e.id\), then d is mapped to e. Secondly, the role rewriter rewrites axioms and facts.

Example 11

We use the ontologies and facts in Fig. 10 to illustrate the process of role rewriting. The role graph is generated by role score algorithm. Based on the role graph, taughtBy is mapped to teach and like is mapped to love. After choosing the mapping, the role rewriting can be divided into three parts: axiom rewriting, fact rewriting and optimizing rewriting. In fact rewriting phase, taughtBy(Physics, Lisa) and like(Lisa, basketball) are changed to teach(Lisa, Physics) and love(Lisa, basketball), respectively. The axiom range(taughtBy, Teacher) is rewritten at axiom rewriting phase. It was first rewritten to range(teach\(^-\), Teacher) and then optimized to domain(taughtBy, Teacher). During the materialization process, taughtBy(Physics, Lisa) and like(Lisa, basketball) are not stored in memory, and Teacher(Lisa) can only be generated once by teach(Lisa, Physics).

Fig. 10
figure 10

An example of role rewriting algorithm

The storage module uses the Jena API to load RDF data and generates a dictionary by encoding each RDF resource in integer ID. The dictionary is shared by the storage module and ontology processor. The RDF data are stored as a triple table with three types of indexes, e.g., a primary index, a secondary index, and a tertiary index. TableIterator can traverse the triple table efficiently by three types of indexes. And, it maintains an index array that records the n-step model corresponding index ranges in the triple table.

The materialization module has three submodules: binding query, axiom matcher, and sameAs reasoner. The detailed materialization algorithm is shown in Algorithm 1. It iteratively reads a new triple F from the triple table through TableIterator. If F has an equivalent triple G (For instance, \(F = (d, R, e)\), \(G = (d^{\prime }, R, e^{\prime })\), \(d \doteq d^{\prime }\), \(e \doteq e^{\prime }\)), then the program does not process F to improve reasoning efficiency. The reasoning of F can be divided into three situations.

figure a

Firstly, if F is the form of \((d, \doteq , e)\), it is processed by Algorithm 2. The sameAs reasoning function puts the individual d and e into an equivalent pool and selects the individual with the smallest ID as the identifier. We set the sameAs mapping of d as c if there exists \((d, \doteq , c)\) and c is the smallest ID of the equivalent pool. Secondly, if the role in F is a functional or an inverse functional role, (for instance, \(F = (d, P, e)\), and P is a functional role), then, all triples like \((d, P, *)\) are returned by \({\mathcal{I}}.\text {getIndividual}(d, P)\). A new fact \((e, \doteq , *)\) is added to \({\mathcal{I}}\) if \({\mathcal{I}}\) does not contain a fact \((e, \dot{ \ne } , *)\). Thirdly, axiom matcher returns all axioms that can match the triple F through \(\text {matchAxiom}(F, {\mathcal{T}})\). Binding query function converts these partially matched axioms into partial binding queries. The function \({\mathcal{I}}.\text {evaluate}\) executes these queries over \({\mathcal{I}}\) and returns a new fact.

figure b

The online part includes a SPARQL processor and a model matcher. The SPARQL processor applies QAA technologies to compute the step size (n) of the model. In essence, it takes O(1) time complexity to calculate the number of quantified variables. The model matcher takes n as input and passes the n-step universal model to the SPARQL query engine. The SPARQL query engine executes SPARQL queries and returns query results.

6 Experiments and Evaluations

SUMA delegates SPARQL queries to RDF-3X [24] at this experiment. The experimental environment is a 24-core machine that is equipped with 180GB RAM and Ubuntu 18.04. The experiment is divided into three parts: the evaluation of query answering system on finite model, the evaluation of query answering system on infinite model and role rewriting algorithm evaluation.

Evaluation criteria We test two aspects, (i) the soundness and completeness of the answers, (ii) the scalability of the query answering system. The first aspect is testing the number of queries that the system can answer correctly under the certain answer semantic. The evaluation of the scalability of the query answering system is to test the pre-processing time, consists of data load time and materialization time, and the query processing time on the increasing datasets.

  • Data load time: This time includes all the data pre-processing steps before materialization, such as constructing a dictionary, generating an index, etc.

  • Materialization time: The time taken by reasoner to compute consequences.

  • Query processing time: The time taken by the system to execute a query on the extended data and return the query results.

Comparison system We compare SUMA with the following four systems, all of which use materialization methods.

  • Pellet is sound and complete in OWL 2 DL. It is adopted as the criterion for soundness and completeness evaluation.

  • PAGOdA employs RDFox for highly scalable reasoning. Therefore, we mainly test the scalability of SUMA with PAGOdA.

  • gOWL also adopts partial materialization algorithm to solve infinite materialization problem. We evaluate two kinds of partial materialization algorithms from the experimental perspective.

  • SUMA-N indicates a query answering system that does not use the role rewriting algorithm. SUMA-N is used to evaluate the performance of the role rewriting algorithm.

6.1 Query Answering over Finite Universal Model

Table 4 gives a summary of all datasets and queries used in this experiment. Besides the 14 standard queries of LUBM, we also test ten queries from PAGOdA. The DBPedia [4] axiom is simple. It could be captured by OWL 2 RL [20]. Therefore, we adopt the DBPedia+ axiom and 1024 DBPedia+ queries provided by PAGOdA. The DBPedia+ axiom includes additional tourism ontologies. We generate 260 atomic queries for the YAGO [29] dataset.

Table 4 The information of datasets

6.1.1 The Soundness and Completeness Evaluation

Because Pellet and gOWL cannot give query results on LUBM(100), UOBM(100), DBPedia+ and YAGO in 2 h, we do not display the results of them. As shown in Table 5, SUMA can correctly answer all queries on each test dataset.

Table 5 The quality of the answers

6.1.2 The Scalability Test

Since gOWL cannot materialize DBPedia+ and YAGO in less than 2 hours, we compare SUMA and gOWL on small datasets before comparing SUMA and PAGOdA. gOWL takes 1.77 h to materialize LUBM(10), while SUMA takes 1.51 s. The materialization time of gOWL on UOBM(10) is 3.19 h. SUMA only costs 8.34 s materializing UOBM(10). SUMA is more scalable than gOWL.

Next, we test SUMA and PAGOdA on a series of datasets. For LUBM, we use datasets of increasing size with a step of 200. Since UOBM ontology is more complicated than LUBM ontology, we set the UOBM dataset growth step length as 100. For each dataset and ontology, we test the pre-processing time (pre-time), data load time, materialization time (mat-time), and average query processing time (avg-time).

Pre-processing Time Evaluation As shown in Fig. 11, SUMA significantly reduces pre-processing time. Time increases linearly with the size of the dataset. On each test dataset, the pre-processing time of SUMA is faster than PAGOdA.

Fig. 11
figure 11

Pre-processing experimental results

SUMA only takes 124s to materialize LUBM(1000). The pre-processing time of SUMA on LUBM(1000) is 549s, faster than PAGOdA’s 1692s. The time taken by SUMA to materialize UOBM(500) is 411s. The total pre-processing time is 862s. Compared with the 5937s pre-processing time of PAGOdA, SUMA is much faster. SUMA takes 18s and 6s to materialize DBPeida+ and YAGO, respectively. The pre-processing time of SUMA on DBPedia+ and YAGO is 71s and 63s, respectively. PAGOdA costs 309s pre-processing DBPedia+ and 139s pre-processing YAGO. SUMA is more scalable than PAGOdA.

Average Query Processing Time Evaluation

The average query processing time of SUMA on LUBM (1) and UOBM (1) is four and five orders of magnitude faster than Pellet, respectively. Because SUMA and gOWL both rely on existing query engines to perform queries, we only compare SUMA with PAGOdA in the average query processing time evaluation.

We test the average query processing time of 24 LUBM queries on six LUBM datasets, 15 UOBM queries on five UOBM datasets, 1024 DBPedia+ queries on one DBPedia+ dataset and 260 YAGO queries on the YAGO dataset. As shown in Fig. 12a, SUMA has a faster average query processing time than PAGOdA on all LUBM datasets except LUBM(100). (Time(SUMA, LUBM(100)) = 0.62 s, Time(PAGOdA, LUBM(100)) = 0.57 s). The significant decrease in the query processing time of SUMA on LUBM (500) is related to RDF-3X. RDF-3X can provide shorter query time on larger data by building different efficient indexes.

Figure 12b shows the average query processing time of SUMA is an order of magnitude faster than that of PAGOdA on all UOBM datasets.

The average query processing time of SUMA on DBPedia+ is 24.337ms faster than PAGOdA’s 33.235ms. The average query processing time of SUMA on YAGO is 39.166ms faster than PAGOdA’s 67.096ms.

6.2 Query Answering over Infinite Universal Model

Since the LUBM, UOBM, DBPedia+ all have a finite universal model, they are not suitable for the second experiment. We add some manual CEQ axioms to the LUBM and UOBM ontologies, respectively. Table 6 shows all the data used at the infinite model evaluation.

Table 6 The information of datasets

We also customize some additional queries to test LUBM+ and UOBM+. Table 7 summarizes our queries. Besides the queries included in the first experiment, we add nine queries for LUBM+ and five customized queries for UOBM+. The third column shows the number of queries that contain a cyclic structure. The number of queries with more than two quantified variables is given in the table’s fourth column.

Table 7 The information of queries

6.2.1 The Soundness and Completeness Evaluation

As shown in Table 8, SUMA and Pellet can calculate all the correct answers for all test queries, whereas PAGOdA is incomplete on five LUBM+ queries (Q2, Q4, Q5, Q6, Q7) and three UOBM+ queries (Q1, Q2, Q3).Footnote 1 gOWL is complete only on a few queries.

Table 8 The quality of the answers

6.2.2 The Scalability Test

According to statistical analysis of the actual SPARQL queries, more than 96% of the queries include up to 7 triple patterns [10]. Therefore, in most cases, we only need to consider the step of universal model (n) is not greater than 7. Besides, we find that SUMA is also efficient when n is more than 7.

SUMA shows high scalability on LUBM+ and UOBM+. The average query processing time of SUMA on LUBM+(1) and UOBM+(1) is 1.99 ms and 6.48 ms, respectively. It is faster than the PAGOdA’s 11.78 ms and 10.40 ms and five orders of magnitude faster than Pellet.

We focus on testing the materialization time of the infinite universal model. To make our test more challenging, we manually add 100 CEQ axioms to LUBM+ and UOBM+ ontologies, named as LUBM++ and UOBM++, as shown in Table 6.

The materialization time of the 15-step universal model of LUBM++(1000) and UOBM++(500) is 276.869 s and 584.600 s, respectively. When \(n = 7\), the materialization time of LUBM++(1000) and UOBM++(500) is 172.308 s and 486.504 s, respectively. SUMA is highly scalable on the infinite universal model.

6.3 The Role Rewriting Algorithm Evaluation

As shown in the previous two experiments, SUMA is complete on all test queries, which shows that the role rewriting algorithm does not lose the completeness of materialization. Table 9 shows the number of equivalent and inverse roles in the test dataset. Because the YAGO dataset does not include equivalent roles and inverse roles, it is not used to evaluate role rewriting algorithm.

Table 9 The information of the datasets

Materialization Efficiency Evaluation As shown in Fig. 12e, f, on all LUBM and UOBM test data sets, the materialization time of SUMA is faster than that of SUMA-N. SUMA takes 124 s to materialize the LUBM(1000) dataset, while SUMA-N takes 202 s to materialize the LUBM(1000) dataset. SUMA takes 411 s to materialize the UOBM(500) dataset, while SUMA-N takes 515 s to materialize the UOBM(500) dataset.

As shown in Fig. 12g, h, on all LUBM++ and UOBM++ test data sets, the materialization time of SUMA is faster than that of SUMA-N. SUMA takes 276 s to materialize the 15-step LUBM++(1000) model, while SUMA-N takes 351 s to materialize the 15-step LUBM++(1000) data set. SUMA took 584 s to materialize the 15-step UOBM++(500) data set, while SUMA-N took 698 s to materialize the 15-step UOBM++ (500) data set.

On the DBPedia+ dataset, SUMA materialization takes 18 s, while SUMA-N takes 20 s.

Fig. 12
figure 12

Experimental results

The experiment verifies that the role rewriting algorithm can improve materialization efficiency without reducing the answers’ quality.

6.4 Memory Optimization Evaluation

Since SUMA is calculated based on memory, this experiment uses the number of triples in the materialization process to measure the memory consumption of the system. Figure 12i, j shows the number of redundant facts reduced by SUMA. It can be seen from the figure that as the size of the dataset increases, the number of redundant data reduced by the role rewriting algorithm increases linearly. The role rewriting algorithm reduces the memory consumption of LUBM data by 9.48% on average, and reduces the memory consumption of UOBM data by 12.42%. And with the increase in the number of equivalent roles and inverse roles, the effect of memory optimization becomes more obvious.

7 Conclusions

In this paper, we have proposed a partial materialization-based approach for ontology-mediated query answering over OWL 2 DL. Our technique’s core idea is that for a rooted conjunctive query or a Boolean conjunctive query with n quantified variables, the answer to the n-step universal model is the same as the answer to the universal model in DL. SUMA significantly reduces offline materialization costs by building efficient indexes for facts and rules and integrates role rewriting algorithm. The low complexity materialization algorithm makes SUMA can support efficient reasoning of large-scale datasets. In future works, we are interested in extending this proposal to support distributed reasoning, and extending our approach to support other normalized ontology models.