# Approximate querying of RDF graphs via path alignment

## Abstract

A query over RDF data is usually expressed in terms of matching between a graph representing the target and a huge graph representing the source. Unfortunately, graph matching is typically performed in terms of subgraph isomorphism, which makes semantic data querying a hard problem. In this paper we illustrate a novel technique for querying RDF data in which the answers are built by combining paths of the underlying data graph that align with paths specified by the query. The approach is approximate and generates the combinations of the paths that best align with the query. We show that, in this way, the complexity of the overall process is significantly reduced and verify experimentally that our framework exhibits an excellent behavior with respect to other approaches in terms of both efficiency and effectiveness.

### Keywords

Path Graph RDF Approximate matching Alignment## 1 Introduction

The Web 3.0 aims at turning the Web into a global knowledge base, where resources are identified by means of URIs, semantically described with RDF, and related through RDF statements. This vision is becoming a reality by the spread of Semantic Web technology and the availability of more and more linked data sources. However, the rapid increase of semantic data raises in this context severe data management issues [1, 2]. Among them, a major problem lies in the difficulty for users to find the information they need in such huge and heterogeneous repository of semantic data.

In this scenario, approaches to approximate query processing are increasingly capturing the attention of researchers [3, 4, 5, 6, 7] since they relax the matching between queries and data, and thus provide an effective support to non-expert users, who are usually unaware of the way in which data is organized. A support to approximate query processing is particularly relevant in the context of linked data in which usually data do not strictly follow the ontology of reference and therefore queries posed against the schema may not retrieve valid answers.

Since semantic data have a natural representation in the form of a graph, this problem has been often addressed in terms of approximate matching between a small graph \(Q\) representing the query and a very large graph \(G\) representing the database. The usual approach to this problem is based on searching all the subgraphs of \(G\) that are isomorphic to \(Q\). Unfortunately, this problem is known to be NP-complete [8] and the problem is even harder if the matching between query and data is approximate. For this reason, the various approaches to approximate query processing on graph databases rely on heuristics, based on similarity or distance metrics, on the use of specific indexing structures to reduce the complexity of the problem [4, 6, 9], and on fixing some threshold on the maximum number of *hops* (i.e., node/edge additions/deletions needed to perfectly match the query graph with the underlying graph database) that are allowed [5]. Moreover, given that the set of answers to a query is potentially very large, a mechanism that aims to efficiently select the “best” \(k\) answers is desirable.

In this framework, we propose a novel technique for querying graph-shaped data in an approximate way that combines a strategy for building possible answers with a ranking method for evaluating the relevance of the results as soon as they are computed. The goal is the generation of the best results in the first retrieved answers, thus avoiding the computation of all the candidate answers. We focus in particular on Basic Graph Pattern queries [7], which basically express conjunctive queries on graph data models, over RDF data. RDF is the “de-facto” standard language for the representation of semantic information: it encodes Web data as a labeled directed graph in which the nodes represent the resources and values (also called literals), and links represent semantic relationships between resources. A resource is uniquely identified in the Semantic Web with a URI.

*Example 1*

Let us consider the graph \(G_d\) depicted in Fig. 1, taken from [4]: it represents a simplified portion of the GovTrack^{1}, a database that stores events that occurred in the US Congress. In RDF graphs, nodes represent RDF classes, literals, or URIs, whereas edges represent RDF properties.

Assume that a user needs to know all amendments sponsored by Carla Bunes to a bill on the subject of Health Care that was originally sponsored by a male person. Queries \(Q_{1}\) and \(Q_{2}\) in Fig. 1 are two possible ways to express this information need. They only differ by the presence of an “optional” node and an “optional” edge. While \(Q_{1}\) has an exact matching over \(G_d\), a perfect matching algorithm would retrieve an empty result for \(Q_{2}\) over \(G_d\). Conversely, in the context of RDF data, it would be desirable to provide an answer also to \(Q_{2}\).

Usually, different paths of the query graph denote different relationships between nodes. For instance, the edges of \(Q_1\) indicate that Male is the gender of someone sponsoring something on the subject Health Care. This simple observation suggests that query answering can proceed as follows: first, the query is decomposed into a set of paths that start from a source and end into a sink, then those paths are matched against the data graph, and finally the data paths that best match the query paths are combined to generate the answer.

Therefore, we tackle the problem of querying RDF graphs by finding the best combinations of the paths of the data graph that best *align* with the paths of the query graph. Note that in the example above the result is an exact answer to \(Q_1\), but the same strategy can be adopted to generate approximate answers to queries with a suitable relaxation of the notion of alignment between graph paths and data paths. Actually, by using this technique, the same answer of \(Q_{1}\) is returned to the query \(Q_{2}\) in Fig. 1, for which there is indeed no exact answer.

The query processing phase first extracts all the paths of data graph \(G\) that align with the paths of a query graph \(Q\) taking advantage of a special index structure that is built off-line. During the construction, a score function evaluates the answers in terms of *quality* and *conformity*. The former measures how much the paths retrieved align with the paths in the query. The latter measures how much, in \(G\), the combination of paths retrieved is similar to the combination of the paths in the query. Such strategy exhibits, in the worst case, a polynomial time complexity in the number of nodes of the data graph and our experiments show that the technique scales seamlessly with the size of the input.

In order to test the feasibility of our approach, we have developed a system^{2} for querying RDF data that implements the above described technique. A number of experiments over widely used benchmarks have shown that our technique outperforms other approaches, in terms of both effectiveness and efficiency.

The rest of the paper is organized as follows. In Sect. 2 we introduce some preliminary notions and definitions. In Sect. 3 we illustrate our strategies for graph matching over RDF data and in Sect. 5 we present the experimental results. In Sect. 6 we discuss related work and finally, in Sect. 7, we draw some conclusions and sketch some future work.

## 2 Preliminary issues

This section states the problem we address in this paper and introduces some preliminary notions and terminology. We start with the definition of an answer to a query in our context and then introduce the data structures and the scoring function that are used in our technique to build and rank the answers to queries.

### 2.1 Problem definition

A graph is a 4-tuple \(G = \langle N, E, L_N, L_E \rangle \) where \(N\) is a set of *nodes*, \(E\,\subseteq \,N \times N\) is a set of ordered pairs of nodes, called *edges*, and \(L_N\) and \(L_E\) are injective functions that associate an element of a set of *node labels*\(\Sigma _N\) to each node in \(N\) and an element of a set of *edge labels*\(\Sigma _E\) to each edge in \(E\), respectively.

We focus our attention on (possibly large) RDF databases, which are conceptually conceived as labeled directed graphs in which the nodes represent either resources or values while the edges relate resources to resources and resources to values. We then introduce the following notion. Let \(\mathcal {U}\) be a set of *URIs* and \(\mathcal {L}\) be a set of *literals*.

**Definition 1**

(*Data Graph*) A data graph \(Q\) is a graph where \(\Sigma _{N} = \mathcal {U} \cup \mathcal {L}\) and \(\Sigma _{E} = \mathcal {U}\).

Let \(\mathtt{VAR }\) be a set of *variables*, denoted by the prefix “?”. A *query graph*\(Q\) is a data graph in which the nodes can be labeled with variables.

**Definition 2**

(*Query Graph*) A query graph \(Q\) is a graph where \(\Sigma _{N} = \mathcal {U} \cup \mathcal {L} \cup \mathtt{VAR }\) and \(\Sigma _{E} = \mathcal {U} \cup \mathtt{VAR }\).

The evaluation of a query consists on retrieving portions of the data graph that match with the query graph. This process can be relaxed by assuming that before the match, the query graph can be slightly transformed, as formalized in the following.

A *substitution* for a query graph \(Q\) is a function that maps the variables in \(Q\) to either URIs or literals. A *transformation*\(\tau \) on a query graph is a sequence of the following basic update operations: node and edge insertion, node and edge deletion, and labeling modification of both nodes and edges.

**Definition 3**

( *Query Answer*) An (approximate) answer to a query graph \(Q\) over a data graph \(G\) is a subgraph \(G'\) of \(G\) for which there exists a substitution \(\phi \) and a transformation \(\tau \) such that \(G'=\tau (\phi (Q))\). If \(\tau \) is the identity function, \(G'\) is an *exact* answer to \(Q\).

In our implementation, the operation of labeling modification of the \(\tau \) function is based on standard libraries for testing the equality of values based on traditional techniques for full text search (such as stemming). This allows the matching between labels such as *fishing*, *fished*, and *fish*. Since this aspect is outside the scope of the paper, we will simply assume, hereinafter, that the operation of labeling modification provides a support for approximate matching between labels and we will use the term *matching* between values in this sense, without discussing this aspect further.

Intuitively, an answer \(a_1=\tau _1(\phi _1(Q))\) is more relevant than another answer \(a_2=\tau _2(\phi _2(Q))\) if \(\tau _1\) contains a lower number of operations than \(\tau _2\). Moreover, in the context of RDF data in which nodes represent concepts and edges represent relationships, it is useful to associate a weight of relevance to each basic update operation. For instance, it could be reasonable that the modification of a label is less relevant than a node insertion, since the latter increases the semantic distance between concepts. Therefore, let \(\omega \) be a function that associates a *weight of relevance* to each basic operation \(\odot \). We say that the *cost*\(\gamma \) of a transformation \(\tau =\odot _1\circ \ldots \circ \odot _z\) is \(\gamma (\tau )=z\cdot \sum _{i=1}^{z}(\omega (\odot _i))\).

**Definition 4**

(*Relevance of an Answer*) An answer \(a_1=\tau _1(\phi _1(Q))\) is more relevant than another answer \(a_2=\tau _2(\phi _2(Q))\) if \(\gamma (\tau _1)< \gamma (\tau _2)\).

Then, given a data graph \(G\) and a query graph \(Q\), we aim at finding the top-k answers \(a_1,\ldots ,a_k\) of \(Q\) according to their relevance.

### 2.2 Paths and computed answers

We now introduce a number of notions that, in our approach, are used in the construction of the answers to a query. Given a graph \(G\), we call *start* nodes, the nodes of \(G\) with no in-going edges, and *end* nodes, the nodes of \(G\) with no out-going edges.

Basically, a *path* in a graph \(G\) is a sequence of labels from a start node to an end node of \(G\). In the case of cycles, a path ends, intuitively, just before the repetition of a node label. Moreover, if there is no start node in \(G\), a path starts from nodes whose difference between the number of outgoing edges and the number of the incoming edges is maximal in \(G\). We call these nodes *hubs*.

**Definition 5**

(*Path*) Given a data graph \(G = \langle N, E, L_N, L_E \rangle \), a path is a sequence \(\small p=l_{n_1}-l_{e_1}-l_{n_2}-\cdots -l_{e_{k-1}}-l_{n_k}\) where: (i) \(l_{n_i} = L_N(n_i)\), \(l_{e_i} = L_E(e_i)\), and \(n_i \in N\), \(e_i=(n_i,n_{i+1}) \in E\), (ii) \(n_1\) is either a start node or, if \(G\) has no start nodes, a hub, and (iii) \(n_{k}\) is either an end node or a node such that there is no edge \((n_{k},n_{k+1})\) such that \(L_N(n_{k+1})\) (the label of \(n_k\)) already occurs in \(p\).

In the following, give a path \(\small p=l_{n_1}-\cdots -l_{n_k}\), we will call \(n_1\) and \(n_k\) the *source* and the *sink* of \(p\), respectively. The *length* of a path is the number of nodes occurring in the path, while the *position* of a node corresponds to its position among the nodes in the path.

*Health Care*and

*Male*, marked in gray). An example of path is:

*Jeff Ryser*and

*Health Care*, respectively. \(p_z\) has length \(4\) and the node A1589 has position \(2\). The query \(Q_{1}\) in Fig. 1 has the following paths:

**Definition 6**

(*Alignment*) Given a data graph \(G\) and a query graph \(Q\) an alignment is a substitution \(\phi \) and a transformation \(\tau \) of a path \(p\) of \(Q\) such that \(\tau (\phi (p))\) is a path of \(G\).

We are ready to introduce our notion of *computed* answer. We say that a set \(P\) of paths of a graph \(G\) is a *connected component* of \(G\) if, for each pair of paths \(p_1, p_2 \in P\), there is a sequence of paths \([p_1,\ldots ,p_2]\) in \(P\) in which each element has at least a node in common with the following element in the sequence.

**Definition 7**

(*Computed Answer*) Given a query graph \(Q\), a computed answer of \(Q\) over a data graph \(G\) is a set of alignments of all the paths of \(Q\) that forms a connected component of \(G\).

Note that a computed answer of \(Q\) over a data graph \(G\) is indeed an answer of \(Q\) over \(G\) (Definition 3). For this reason, in the following we will often not make any distinction between answer and computed answer, when it is clear from the context the notion we are referring to.

### 2.3 Scoring function

The function \(score\) is an approximate implementation of the general notion of relevance (Definition 4) that can be computed in linear time on the size of the data. The function \(score\) simulates the relevance of computed answers \(a_i\) by taking into account two different aspects, *quality* and *conformity*. The former measures how much the paths retrieved align with the paths in the query. The latter measures how much, in \(a_i\), the combination of paths retrieved is similar to the combination of the paths in the query.

*score*function considers is the conformity between the combination of the paths in the computed answer and the combination of the paths in the query. This is evaluated as follows:

*coherent*with the notion of relevance of an answer, that is, for each pair of answers \(a_1\) and \(a_2\) for a query \(Q\) such that \(a_1\) is more relevant than \(a_2\) we have that \({\textit{score}}(a_1,Q)<{\textit{score}}(a_2,Q)\).

**Theorem 1**

Given a query graph \(Q\) and a data graph \(G\), for each pair of computed answers \(a_{i}\) and \(a_{j}\) for \(Q\) over \(G\), if \(a_{i}\) is more relevant than \(a_{j}\) then we have \({\textit{score}}(a_i,Q)<{\textit{score}}(a_j,Q)\).

*Proof*

Let \(\odot _{N}^{\curvearrowright }\), \(\odot _{N}^{-}\) and \(\odot _{N}^{\times }\) be basic update operations of node insertion, node deletion, and labeling modification, respectively. Analogously, \(\odot _{E}^{-}, \odot _{E}^{\curvearrowright }\) and \(\odot _{E}^{\times }\) are the respective operations on edges. On these operations, we fix the function \(\omega \): (i) \(\omega (\odot _{N}^{-}) = a\), (ii) \(\omega (\odot _{N}^{\curvearrowright }) = b\), (iii) \(~\omega (\odot _{E}^{-}) = c\) and (iv) \(\omega (\odot _{E}^{\curvearrowright }) = d\). We consider, as in other works [10], \(\omega (\odot _{N}^{\times }) = 0\) and \(\omega (\odot _{E}^{\times }) = 0\) because we do not want to penalize the case where the answer gathers more labels than \(Q\).

Now, let us count the number of basic update operations in a transformation \(\tau _i\) for an answer \(a_i\). In this case: \(n_{N}^{-}\) and \(n_{E}^{-}\) are, respectively, the number of nodes and edges of \(a_i\) that are inserted in \(Q\), and \(n_{N}^{\curvearrowright }\) and \(n_{E}^{\curvearrowright }\) are, respectively, the number of nodes and edges updated in \(Q\) by \(\tau _i\).

### 2.4 Computation of alignments

**aTo-B1432**and a substitution \(\phi \) on the variables. In the former case we have \(\lambda (p,q_{1}) = (0 + 0) + (0 + 0) = 0\), since \(n_{N}^{-} = n_{N}^{\curvearrowright } = n_{E}^{-} = n_{E}^{\curvearrowright } = 0\). In the latter case \(\lambda (p,q_{2}) = (0 + b) + (0 + d)\), since \(n_{N}^{-} = n_{E}^{-} = 0\), and \(n_{N}^{\curvearrowright } = n_{E}^{\curvearrowright } = 1\). If we set \(b = 0.5\) and \(d = 1\), we have \(\lambda (p,q_{2}) = 1.5\) (i.e. \(p\) has the best alignment with \(q_{1}\)). In the same way, given

It is straightforward to demonstrate that the time complexity of the alignment is \(O(I)\) where \(I = |p| + |q|\) is the sum of the nodes and edges of the paths in \(p\) and \(q\).

## 3 Path-based query processing

### 3.1 Overview

Let \(G\) be a data graph and \(Q\) a query graph on it. The approach is composed of two main phases: the *indexing* (done off-line), in which all the paths of \(G\) are indexed, and the *query processing* (done on-the-fly), where the query evaluation takes place. The first task will be described in more detail in Sect. 5.

*Preprocessing*Given a query graph \(Q\), in this step the set PQ of all paths of \(Q\) is computed on the fly by traversing \(Q\). We exploit an optimized implementation of the breadth-first search (BFS) traversal. The elements of PQ are organized in the the so-called*intersection query graph*(\(IG\)). Nodes of \(IG\) are paths of \(Q\) while an edge \((q_{i}, q_{j})\) means that \(q_{i}\) and \(q_{j}\) have nodes in common. For instance, referring to Example 1, PQ consists of the following paths.The intersection query graph built from \(q_{1}\), \(q_{2}\) and \(q_{3}\) is depicted in Fig. 3. This data structure keeps track of the fact that \(q_1\) and \(q_2\) have nodes in common (?v2 and Health Care) and that \(q_2\) and \(q_3\) have also nodes in common (only ?v3).$$\begin{aligned} \begin{array}{l} q_1: \mathsf{Carla Bunes-sponsor-?v1-aTo-?v2-subject-Health Care } \\ q_2: \mathsf{?v3-sponsor-?v2-subject-Health Care } \\ q_3: \mathsf{?v3-gender-Male } \end{array} \end{aligned}$$*Clustering*In the second step we build a cluster for each element \(q\) of PQ. Then, we group in the same cluster all the paths \(p\) of \(G\) having a sink that matches the sink of \(q\). If a variable occurs in the sink of \(q\), we retrieve the last value \(v\) occurring in \(q\) and we group in the same cluster all the paths \(p\) of \(G\) containing a label matching \(v\). Before the insertion of a path \(p\) in the cluster for \(q\), we evaluate the alignment needed to obtain \(p\) from \(q\). This allows us to compute the score of \(p\), i.e. \(\lambda (p,q)\). The paths in a clusters are ordered according to their score, with the lower coming first. Note that the same path \(p\) can be inserted in different clusters, possibly with a different score. As an example, given the data graph \(G_{d}\) and the query graph \(Q_{1}\) of Fig. 1, we obtain the clusters shown in Fig. 4. In this case clusters \(cl_{1}\), \(cl_{2}\) and \(cl_{3}\) correspond to the paths \(q_{1}\), \(q_{2}\) and \(q_{3}\) of PQ, respectively; note the scores at the right side of each path and in particular the path \(p_{1}\) occurring in both \(cl_{1}\) and \(cl_{2}\) with different scores, i.e. 7 in \(cl_{1}\) and 5 in \(cl_{2}\).*Search*The last step aims at generating the most relevant answers by combining the paths in the clusters built in the previous step. This is done by picking and combining the paths with lowest score from each cluster. The intersection query graph allows us to verify efficiently if they form an answer. As an example, given the clusters in Fig. 4, the first answer is obtained by combining the paths \(p_{1}\), \(p_{10}\) and \(p_{20}\) that are the elements with the greatest score in each corresponding cluster and provide the best alignment with the paths of PQ associated with the clusters.

The most tricky task of the whole process occurs in the third step above, where we try to generate the top-k answers by minimizing the number of combinations between paths. This is done by organizing the combinations of paths in a forest where nodes represent the retrieved paths, while edges between paths means that they have nodes in common. The label of each edge \((p_{i}, p_{j})\) is \(\langle (q_{i}, q_{j}):[\psi (q_{i}, q_{j}, p_{i}, p_{j})] \rangle \) where \(q_{i}\) and \(q_{j}\) are the paths corresponding to the clusters where \(p_{i}\) and \(p_{j}\) were included, respectively.

The rest of the section describes in more detail the clustering and search steps of the approach.

### 3.2 Clustering

The set \(\mathcal {CL}\) of clusters is implemented as a map where the key is a path \(q\) from PQ and the value is a cluster with all the paths \(p\) ending in the sink of \(q\). Each cluster is implemented as a priority queue of paths, where the priority is based on the score associated with each path (in ascending order). In this paper, we use an implementation of priority queues to guarantee a constant time complexity for insertion/deletion operations. For each \(q \in \mathsf{PQ }\) [line 1] we extract the sink \(sk\) of \(q\) and we retrieve all \(p\) from \(G\) matching \(sk\). Such task is performed by the function getPaths [line 3]. Once we have obtained the set \(\mathsf{PD }\), we evaluate the score of each \(p \in \mathsf{PD }\) with respect to \(q\) and we insert \(p\) in the cluster cl [lines 4–5]. Finally, we insert cl in \(\mathcal {CL}\) [line 8].

### 3.3 Search

As long as we did not generate \(k\) answers and the set of clusters is not empty [line 1], we build a forest \({{\mathcal {F}}}\) [line 2] from the most promising paths in \(\mathcal {CL}\) and we provide the top-k answers by visiting \({{\mathcal {F}}}\) [lines 4–8]. As we have said in Sect. 3.1, nodes of \({{\mathcal {F}}}\) denote paths in \(\mathcal {CL}\), while edges \((p_{i}, p_{j})\) represent the fact that \(p_{i}\) and \(p_{j}\) have nodes in common. The trees of \({{\mathcal {F}}}\) are then returned as answers to the query. In the following we describe in more detail the building and the visiting of \({{\mathcal {F}}}\).

*Building*\(\mathcal {CL}\) and the intersection query graph \(IG\), first of all we have a

*building*phase generating a

*forest*of paths, as shown in Algorithm 3.

*roots*of the forest \({\mathcal {F}}\). We implement \({\mathcal {F}}\) as a map where the keys are paths, i.e. the roots, and the values are the trees of \({\mathcal {F}}\). Each tree T of \({\mathcal {F}}\) is modeled as a graph \(\langle N, E \rangle \) where the nodes in \(N\) are paths and each edge \(l(n_{i},n_{j}) \in E\) is described in terms of \(\langle n_{i}, n_{j}, l\rangle \) where \(l\) is the label of the edge. For each \(p \in \mathsf{PD }\), we build a tree rooted in \(p\) by using the procedure treeBuild [line 9]. The procedure is described in details in Algorithm 4.

The procedure treeBuild starts the navigation of \(IG\) from the input query path \(q\). For each edge (\(q,q'\)), where \(q'\) is not yet visited, we dequeue the top paths \(\mathsf{PD }\) from the cluster cl corresponding to \(q'\) [lines 3–4]. Then for each path \(p \in \mathsf{PD }\) having nodes in common with r (denoted by \(p \leftrightarrow \mathsf{r }\)), we build the edge \((\mathsf{r }, p)\) and the corresponding label \(\langle \mathsf{r }, p, (q,q'):[\psi (q, q', \mathsf{r }, p)]\rangle \), [line 7], and we insert it in the set \(E\) of T [line 8]. The pair \(\langle p,q'\rangle \) is added to the set \(\mathsf L\) [line 9]. Finally, we include \(q'\) in the set V of visited query paths [line 10] and we recursively call the procedure for each \(p\) [line 12].

*Visiting*The last step consists in a visit of the forest \({\mathcal {F}}\). Algorithm 2 starts the visit from the root with the maximum score. The visit implements a

*Depth-first search*(DFS) traversal as shown in details in Algorithm 5.

As in the building step, we exploit the intersection query graph to explore \({\mathcal {F}}\). Starting from a root \(p\) of \({\mathcal {F}}\) to which the path \(q\) is associated (i.e. \(p\) was in the cluster corresponding to \(q\)), for each \((q,q')\) in \(IG\) we select the successor \(p'\) of \(p\) to which \(q'\) is associated. In particular, since we can have multiple \(p'\) to which \(q'\) is associated, we select the most conforming path \(p'\) with the getPathMaxConformity [line 5]. We then include \(p'\) in the answer a and we call recursively the algorithm [line 7]. If \(p\) has no successors, then we return \(p\).

Referring to Fig. 6, we start from roots = \(\{p_{10}, p_{7}, p_{9}, p_{8}\}\) (i.e. in order of priority). Then the first answer \(a_{1}\) dequeues \(p_{10}\). From \(p_{10}\), we add \(p_{20}\) and \(p_{1}\) to \(a_{1}\), that are the most important paths to which \(q_{3}\) and \(q_{1}\) are associated. Finally, we have \(a_{1} = \{p_{10}, p_{1}, p_{20}\}\). Similarly we generate, in order, \(a_{2} = \{p_{7}, p_{1}, p_{17}\}, a_{3} = \{p_{9}, p_{1}\}\) and \(a_{4} = \{p_{8}, p_{1}\}\).

*Overall complexity*Table 1 summarizes all the discussed complexity. The overall process consists of two main phases: clustering (Algorithm 1) and search (Algorithm 2). In turn, the search algorithm consists of two main parts: building (Algorithm 3) and visiting (Algorithm 5). As we have said above, search is the core of the overall process and iterates at most \(k\) times, where \(k\) is the number of answers to return. In each iteration, there is a call to building and, at most, \(I\) calls to visiting. Therefore \( O(\mathsf{search }) \in k \times O(\mathsf{building }) + k \times I \times O(\mathsf{visiting }) \in O(k \times I^{h})\), where \(I\) is the number of paths retrieved and \(h\) is the number of paths in \(Q\). It turns out that the overall complexity is bounded by the complexity of the search phase. Note that the complexity is exponential in the size of the

*query*, which is usually very limited with respect to the size of data. On the other hand, we have that the maximum number of paths in a data graph \(G\) (see Definition 5) is proportional to \(n^2\) where \(n\) is the number of nodes of \(G\). This is because, in the worst case, we have \(n/2\) sources and \(n/2\) sinks with edges between each source and each sink, and so we have \(n^2/4\) paths. It follows that our technique exhibits a polynomial time complexity in the size of the

*data*. Note that, as soon as \(I\) tends to \(n^2\), the depth of the tree computed in the build phase tends to 1, because the graph becomes strongly connected. In this case, the complexity of search reduces to \(O(k\times I)\). We also point out that, in Sect. 5 we show that our technique improves other approaches in terms of efficiency over real world data sets and it scales seamlessly with the size of the input.

Overall complexity

Clustering | Building | Visiting | Search | Overall |
---|---|---|---|---|

\(|Q| \times O(I)\) | \(O(I^{h})\) | \(O(h^{h-1} \times I)\) | \(O(k \times I^{h})\) | \(O(k \times I^{h})\) |

## 4 Implementation

We have implemented our approach in Sama^{3}, a Java system with a Web front end.

*hyperedges*. In other words, \(E\) is a subset of the power set of \(X\). This representation allows us to define indexes on both vertices and hyperedges: \(X = \{x_{m} | m \in M\}\) and \(E = \{e_{f} | f \in F, e_{f} \subseteq X\}\), where each vertex \(x_{m}\) and edge \(e_{f}\) are indexed by an index \(m \in M\) and \(f \in F\), respectively. Figure 7 shows an example of reference.

Physically HGDB implements several HGHandle to wrap and to index nodes and edges of \(G\); each subset of nodes (i.e. hyperedge) is implemented into a HGEdgeLink having several *target* links to the HGHandle of contained nodes (i.e. a hyperedge is implemented as a list of cursors to the contained nodes). In our framework, each path is modeled as a HGEdgeLink in HGDB. The matching is supported by standard IR engines (c.f. Lucene Domain index—LDi^{4}) embedded into HGDB. In particular we define a LDi index on the labels of nodes and edges. In this way, given a label, HGDB retrieves all paths containing data elements matching the label in a very efficient way (i.e. exploiting the cursors). Further, semantically similar entries such as synonyms, hyponyms and hypernyms are extracted from WordNet [12], supported by LDi.

## 5 Experimental results

In our experiments we used several widely-accepted benchmarks for graph matching evaluation. We have compared Sama with three representatives graph matching systems: Sapper [6], Bounded [5] and Dogma [4]. Experiments were conducted on a dual core 2.66 GHz Intel Xeon, running Linux RedHat, with 4 GB of memory and a 2-disk 1Tbyte striped RAID array.

*Indexing*In our experiments we consider real RDF datasets, such as PBlog

^{5}, GovTrack, KEGG, IMDB [13], DBLP, and synthetic datasets, such as Berlin [14], LUBM [15] and UOBM [16]. Table 2 provides importing information for any dataset: number of triples, number of nodes (\(|HV|\) column) and number of generated hyperedges (\(|HE|\) column) in HGDB, time to create the index on HGDB (t column) and memory consumption on disk. In our case, building the index takes hours for large RDF data graphs, due to the demanding traversal on the complete large graph, and requires GB of memory resources on disk to store data and metadata. However our framework benefits the high performance to retrieve data elements on HGDB, as shown in Table 3. The table illustrates the average response times to retrieve, given a label, a path \(p\) and all data elements (nodes and edges) associated with \(p\). We performed

*cold-cache*experiments (i.e. by dropping all file-system caches before restarting the various systems and running the queries) and

*warm-cache*experiments (i.e. without dropping the caches).

HyperGraphDB indexing

DG | #Triples | \(| {HV} |\) | \(| HE |\) | t | Space |
---|---|---|---|---|---|

PBlog | 50 K | 1,5 K | 96 K | 1 sec | 56 MB |

GOV | 1 M | 280 K | 330 K | 4 min | 340 MB |

KEGG | 1 M | 300 K | 606 K | 7 min | 700 MB |

Berlin | 1 M | 320 K | 700 K | 10 min | 910 MB |

iMDB | 6 M | 900 K | 3 M | 47 min | 1.2 GB |

LUBM | 12 M | 1 M | 15 M | 102 min | 12.9 GB |

UOBM | 12 M | 1 M | 15 M | 102 min | 12.9 GB |

DBLP | 26 M | 4 M | 17 M | 441 min | 23.6 GB |

Average time to retrieve a path

DG | Cold (msec) | Warm (msec) |
---|---|---|

IMDB | 0.01 | 0.005 |

LUBM | 0.04 | 0.008 |

DBLP | 0.06 | 0.009 |

On top of this index organization, to avoid to recompile the entire index on HGDB, we implemented also several procedures to support the maintenance: insertion, deletion and update of new vertices or edges in the data graph \(G\). Such operations are documented in [17].

Index maintenance performance

Insertion (ms) | Deletion (ms) | Update (ms) | |
---|---|---|---|

Vertex | 4.6 | 4.3 | 5.7 |

Edge | 11.4 | 22.1 | 6.3 |

*Query execution*In this experiment, for each indexed dataset we formulated 12 queries in SPARQL of different complexities (i.e. number of nodes, edges and variables). In the following we refer to the most huge datasets: in particular we will discuss in depth results over LUBM dataset, that are the most representative. DBLP and iMDB provide a very close behavior. Hence, we consider 12 queries from the LUBM benchmark that provide results without involving reasoning.

^{6}We ran the queries ten times and we measured the average response time, in \(ms\) and logarithmic scale. Precisely, the total time of each query is the time for computing the top-10 answers, including any

*preprocessing*,

*execution*and

*traversal*. We performed both cold-cache and warm-cache experiments. To make a comparison with the other systems, we reformulated the 12 queries by using the input format of each competitor. In Sama we set the coefficients of the scoring function as follows: \(a = 1, b = 0.5, c = 2\) and \(d = 1\). Therefore we show the behavior of all systems in terms of number of triples and complexity. The query run-times are shown in Fig. 8.

*Effectiveness*The last experiment evaluates the effectiveness of Sama and of the other competitors. The first measure we used is the reciprocal rank (RR). For a query, RR is the ratio between 1 and the rank at which the first correct answer is returned; or 0 if no correct answer is returned. In any dataset, for all 12 queries we obtained RR = 1. In this case the monotonicity is never violated. To make a comparison with the other systems we inspected the matches found in terms of the answers returned. Figure 12 shows the effectiveness of all systems on LUBM, where we run the queries without imposing the number k of answers.

*noise*(i.e. not interesting approximate results) in high values of recall.

## 6 Related work

Most of the mentioned works are focused on medical, chemical and proteinic networks and they are usually not efficient over semantic and social data [10]. Therefore, specialized metrics where proposed [10, 30]. GMO [30] introduces a structural metric based on a bipartite graph, NESS [10] proposes a measure based on both topological and content information in the neighborhood of a node of the graph. All these approaches differ quite a lot from our method. Indeed, we tackle the problem using a technique that takes into account the structural constraints on how different relations between nodes have to be correlated. It relies on the tractable problem of alignment between paths.

## 7 Conclusion and future work

In this paper we have presented a novel approach to approximate querying of large RDF data sets. The approach is based on a strategy for building the answers of a query by selecting and combining paths of the underlying data graph that best align with paths in the query. A ranking function is used in query answering for evaluating the relevance of the results as soon as they are computed. In the worst case our technique exhibits a polynomial computational cost with respect to the size of the input and experimental results show that it behaves very well with respect to approaches in terms of both efficiency and effectiveness. This work opens several directions of further research. From a conceptual point of view, we intend to introduce improvements on the construction of answers and on the on-line computation of the scoring function. From a practical point of view, we intend to implement the approach in a Grid environment (for instance using Hadoop/Hbase) and develop optimization techniques to speed-up the creation and the update of the index, as well as compression mechanisms for reducing the overhead required by its construction and maintenance.

## Footnotes

- 1.
- 2.
A prototype application is available at https://www.dropbox.com/sh/d5u1u24qnyqg18f/7Oefq8-qVa.

- 3.
A prototype application is available at https://www.dropbox.com/sh/d5u1u24qnyqg18f/7Oefq8-qVa.

- 4.
- 5.
- 6.
At https://www.dropbox.com/sh/d5u1u24qnyqg18f/7Oefq8-qVa you can find the complete set of queries.

### References

- 1.De Virgilio, R., Giunchiglia, F., Tanca, L. (eds.): Semantic Web Information Management—A Model-Based Perspective. Springer, Berlin (2010)MATHGoogle Scholar
- 2.De Virgilio, R., Guerra, F., Velegrakis, Y. (eds.): Semantic Search Over the Web. Springer, Berlin, Heidelberg (2012)Google Scholar
- 3.De Virgilio, R., Orsi, G., Tanca, L., Torlone, R.: Nyaya: A system supporting the uniform management of large sets of semantic data. In: ICD., pp. 1309–1312. (2012)Google Scholar
- 4.Bröcheler, M., Pugliese, A., Subrahmanian, V.S.: Dogma: A disk-oriented graph matching algorithm for rdf databases. In: ISWC, pp. 97–113. (2009)Google Scholar
- 5.Fan, W., Li, J., Ma, S., Tang, N., Wu, Y., Wu, Y.: Graph pattern matching: from intractable to polynomial time. Proc. VLDB Endow.
**3**(1), 264–275 (2010)CrossRefGoogle Scholar - 6.Zhang, S., Yang, J., Jin, W.: Sapper: subgraph indexing and approximate matching in large graphs. Proc. VLDB Endow.
**3**(1), 1185–1194 (2010)CrossRefMATHGoogle Scholar - 7.Wood, P.T.: Query languages for graph databases. SIGMOD Rec.
**41**(1), 50–60 (2012)CrossRefGoogle Scholar - 8.Gallagher, B.: Matching structure and semantics : A survey on graph-based pattern matching. In: Artificial Intelligence, pp. 45–53. (2006)Google Scholar
- 9.Zhang, S., Li, S., Yang, J.: Gaddi: distance index based subgraph matching in biological networks. In: EDBT, pp. 192–203. (2009)Google Scholar
- 10.Khan, A., Li, N., Yan, X., Guan, Z., Chakraborty, S., Tao, S.: Neighborhood based fast graph search in large networks. In: SIGMOD, pp. 901–912. (2011)Google Scholar
- 11.Iordanov, B.: Hypergraphdb: A generalized graph database. In: WAIM Workshops, pp. 25–36. (2010)Google Scholar
- 12.Fellbaum, C. (ed.): WordNet An Electronic Lexical Database. The MIT Press, Cambridge (1998)Google Scholar
- 13.Hassanzadeh, O., Consens, M.P.: Linked movie data base (triplification challenge report). In: I-SEMANTICS, pp. 194–196 (2008)Google Scholar
- 14.Bizer, C., Schultz, A.: The Berlin sparql benchmark. Int. J. Semant. Web. Inf. Syst.
**5**(2), 1–24 (2009)CrossRefGoogle Scholar - 15.Guo, Y., Pan, Z., Heflin, J.: Lubm: a benchmark for owl knowledge base systems. J. Web. Semant.
**3**(2–3), 158–182 (2005)CrossRefGoogle Scholar - 16.Ma, L., Yang, Y., Qiu, Z., Xie, G.T., Pan, Y., Liu, S.: Towards a complete owl ontology benchmark. In: ESWC, pp. 125–139. (2006)Google Scholar
- 17.Cappellari, P., De Virgilio, R., Maccioni, A., Roantree, M.: A path-oriented rdf index for keyword search query processing. In: DEXA, pp. 366–380. (2011)Google Scholar
- 18.Zou, L., Chen, L., Özsu, M.T.: Distance-join: pattern match query in a large graph database. Proc. VLDB Endow.
**2**(1), 886–897 (2009)CrossRefGoogle Scholar - 19.Fan, W., Bohannon, P.: Information preserving xml schema embedding. ACM Trans. Database Syst.
**33**(1) (2008)Google Scholar - 20.Tran, T., Wang, H., Rudolph, S., Cimiano, P.: Top-k exploration of query candidates for efficient keyword search on graph-shaped (rdf) data. In: ICDE Conference, pp. 405–416 (2009)Google Scholar
- 21.Neumann, T., Weikum, G.: x-rdf-3x: fast querying, high update rates, and consistency for rdf databases. Proc. VLDB Endow.
**3**(1), 256–263 (2010)CrossRefMATHGoogle Scholar - 22.Yan, X., Yu, P.S., Han, J.: Graph indexing: a frequent structure-based approach. In: SIGMOD, pp. 335–346. (2004)Google Scholar
- 23.Zhang, S., Hu, M., Yang, J.: Treepi: A novel graph indexing method. In: ICDE, pp. 966–975. (2007)Google Scholar
- 24.Cheng, J., Ke, Y., Ng, W., Lu, A.: Fg-index: towards verification-free query processing on graph databases. In: SIGMOD, pp. 857–872. (2007)Google Scholar
- 25.Tian, Y., Patel, J.M.: Tale: A tool for approximate large graph matching. In: ICDE, pp. 963–972. (2008)Google Scholar
- 26.Zeng, Z., Tung, A.K.H., Wang, J., Feng, J., Zhou, L.: Comparing stars: on approximating graph edit distance. Proc. VLDB Endow.
**2**(1), 25–36 (2009)CrossRefGoogle Scholar - 27.Jin, R., Xiang, Y., Ruan, N., Fuhry, D.: 3-hop: a high-compression indexing scheme for reachability query. In: SIGMOD, pp. 813–826. (2009)Google Scholar
- 28.Poulovassilis, A., Wood, P.T.: Combining approximation and relaxation in semantic web path queries. In: ISWC, pp. 631–646. (2010)Google Scholar
- 29.Chan, E.P.F., Lim, H.: Optimization and evaluation of shortest path queries. VLDB J.
**16**(3), 343–369 (2007)CrossRefMATHGoogle Scholar - 30.Hu, W., Jian, N., Qu, Y., Wang, Y.: Gmo: A graph matching for ontologies. In: Integrating Ontologies. (2005)Google Scholar