Processing SPARQL queries over distributed RDF graphs
- First Online:
- Received:
- Revised:
- Accepted:
DOI: 10.1007/s00778-015-0415-0
- Cite this article as:
- Peng, P., Zou, L., Özsu, M.T. et al. The VLDB Journal (2016) 25: 243. doi:10.1007/s00778-015-0415-0
- 6 Citations
- 687 Downloads
Abstract
We propose techniques for processing SPARQL queries over a large RDF graph in a distributed environment. We adopt a “partial evaluation and assembly” framework. Answering a SPARQL query Q is equivalent to finding subgraph matches of the query graph Q over RDF graph G. Based on properties of subgraph matching over a distributed graph, we introduce local partial match as partial answers in each fragment of RDF graph G. For assembly, we propose two methods: centralized and distributed assembly. We analyze our algorithms from both theoretically and experimentally. Extensive experiments over both real and benchmark RDF repositories of billions of triples confirm that our method is superior to the state-of-the-art methods in both the system’s performance and scalability.
Keywords
RDF SPARQL RDF graph Distributed queries1 Introduction
The semantic Web data model, called the “Resource Description Framework,” or RDF, represents data as a collection of triples of the form \(\langle \)subject, property, object\(\rangle \). A triple can be naturally seen as a pair of entities connected by a named relationship or an entity associated with a named attribute value. Hence, an RDF dataset can be represented as a graph where subjects and objects are vertices, and triples are edges with property names as edge labels. With the increasing amount of RDF data published on the Web, system performance and scalability issues have become increasingly pressing. For example, Linking Open Data (LOD) project builds an RDF data cloud by linking more than 3000 datasets, which currently have more than 84 billion triples^{1}. The recent work [40] shows that the number of data sources has doubled within 3 years (2011–2014). Obviously, the computational and storage requirements coupled with rapidly growing datasets have stressed the limits of single machine processing.
There have been a number of recent efforts in distributed evaluation of SPARQL queries over large RDF datasets [20]. We broadly classify these solutions into three categories: cloud-based, partition-based and federated approaches. These are discussed in detail in Sect. 2; the highlights are as follows.
Cloud-based approaches (e.g., [23, 27, 33, 34, 37, 48, 49]) maintain a large RDF graph using existing cloud computing platforms, such as Hadoop (http://hadoop.apache.org) or Cassandra (http://cassandra.apache.org), and employ triple pattern-based join processing most commonly using MapReduce.
Partition-based approaches [15, 18, 21, 22, 28, 29] divide the RDF graph G into a set of subgraphs (fragments) \(\{F_{i}\}\) and decompose the SPARQL query Q into subqueries \(\{Q_{i}\}\). These subqueries are then executed over the partitioned data using techniques similar to relational distributed databases.
Federated SPARQL processing systems [16, 19, 36, 38, 39] evaluate queries over multiple SPARQL endpoints. These systems typically target LOD and follow a query processing over data integration approach. These systems operate in a very different environment we are targeting, since we focus on exploiting distributed execution for speedup and scalability.
In this paper, we propose an alternative strategy that is based on only partitioning the data graph but not decomposing the query. Our approach is based on the “partial evaluation and assembly” framework [24]. An RDF graph is partitioned using some graph partitioning algorithm such as METIS [26] into vertex-disjoint fragments (edges that cross fragments are replicated in source and target fragments). Each site receives the full SPARQL query Q and executes it on the local RDF graph fragment providing data parallel computation. To the best of our knowledge, this is the first work that adopts the partial evaluation and assembly strategy to evaluate SPARQL queries over a distributed RDF data store. The most important advantage of this approach is that the number of involved vertices and edges in the intermediate results is minimized, which is proven theoretically (see Proposition 3 in Sect. 4).
Because of interconnections between graph fragments, application of graph homomorphism over graphs requires special care. For example, consider a distributed RDF graph in Fig. 1. Each entity in RDF is represented by a URI (uniform resource identifier), the prefix of which always denotes the location of the dataset. For example, “s1:dir1” has the prefix “s1,” meaning that the entity is located at site s1. Here, the prefix is just for simplifying presentation, not a general assumption made by the approach. There are crossing links between two datasets identified in bold font. For example, “\(\langle \)s2:act1 isMarriedTo s1:dir1\(\rangle \)” is a crossing link (links between different datasets), which means that act1 (at site s2) is married to dir1 (at site s1).
Now consider the following SPARQL query Q that consists of five triple patterns (e.g., ?a isMarriedTo ?d) over this distributed RDF graph:
Some SPARQL query matches are contained within a fragment, which we call inner matches. These inner matches can be found locally by existing centralized techniques at each site. However, if we consider the four datasets independently and ignore the crossing links, some correct answers will be missed, such as (?a=s2:act1, ?d=s1:dir1). The key issue in the distributed environment is how to find subgraph matches that cross multiple fragments—these are called crossing matches. For query Q in Fig. 2, the subgraph induced by vertices 014, 007, 001, 002, 009 and 018 is a crossing match between fragments \(F_1\) and \(F_2\) in Fig. 1 (shown in the shaded vertices and red edges). This is the focus of this paper.
Our solution does not depend on any specific partitioning strategy. In existing partition-based methods, the query processing always depends on a certain RDF graph partitioning strategy, which may be difficult to enforce in certain circumstances. The partition-agnostic framework enables us to adopt any partition-based optimization, although this is orthogonal to our solution in this paper.
Our method guarantees to involve fewer vertices or edges in intermediate results than other partition-based solutions, which we prove in Sect. 4 (Proposition 3). This property often results in smaller number of intermediate results and lowers the cost of our approach, which we demonstrate experimentally in Sect. 7.
2 Related work
2.1 Distributed SPARQL query processing
As noted above, there are three general approaches to distributed SPARQL query processing: cloud-based approaches, partition-based approaches and federated SPARQL query systems.
2.1.1 Cloud-based approaches
There have been a number of works (e.g., [23, 27, 33, 34, 37, 47, 48, 49]) focused on managing large RDF datasets using existing cloud platforms; a very good survey of these is [25]. Many of these approaches follow the MapReduce paradigm; in particular, they use HDFS [23, 37, 48, 49], and store RDF triples in flat files in HDFS. When a SPARQL query is issued, the HDFS files are scanned to find the matches of each triple pattern, which are then joined using one of the MapReduce join implementations (see [30] for more detailed description of these). The most important difference among these approaches is how the RDF triples are stored in HDFS files; this determines how the triples are accessed and the number of MapReduce jobs. In particular, SHARD [37] directly stores the data in a single file and each line of the file represents all triples associated with a distinct subject. HadoopRDF [23] and PredicateJoin [49] further partition RDF triples based on the predicate and store each partition within one HDFS file. EAGRE [48] first groups all subjects with similar properties into an entity class and then constructs a compressed RDF graph containing only entity classes and the connections between them. It partitions the compressed RDF graph using the METIS algorithm [26]. Entities are placed into HDFS according to the partition set that they belong to.
Besides the HDFS-based approaches, there are also some works that use other NoSQL distributed data stores to manage RDF datasets. JenaHBase [27] and H\(_2\)RDF [33, 34] use some permutations of subject, predicate, object to build indices that are then stored in HBase (http://hbase.apache.org). Trinity.RDF [47] uses the distributed memory-cloud graph system Trinity [44] to index and store the RDF graph. It uses hashing on the vertex values to obtain a disjoint partitioning of the RDF graph that is placed on nodes in a cluster.
These approaches benefit from the high scalability and fault tolerance offered by cloud platforms, but may suffer lower performance due to the difficulties of adapting MapReduce to graph computation.
2.1.2 Partition-based approaches
The partition-based approaches [15, 18, 21, 22, 28, 29] partition an RDF graph G into several fragments and place each at a different site in a parallel/distributed system. Each site hosts a centralized RDF store of some kind. At run time, a SPARQL query Q is decomposed into several subqueries such that each subquery can be answered locally at one site, and the results are then aggregated. Each of these papers proposes its own data partitioning strategy, and different partitioning strategies result in different query processing methods.
In GraphPartition [22], an RDF graph G is partitioned into n fragments, and each fragment is extended by including N-hop neighbors of boundary vertices. According to the partitioning strategy, the diameter of the graph corresponding to each decomposed subquery should not be larger than N to enable subquery processing at each local site. WARP [21] uses some frequent structures in workload to further extend the results of GraphPartition. Partout [15] extends the concepts of minterm predicates in relational database systems and uses the results of minterm predicates as the fragmentation units. Lee et. al. [28, 29] define the partition unit as a vertex and its neighbors, which they call a “vertex block.” The vertex blocks are distributed based on a set of heuristic rules. A query is partitioned into blocks that can be executed among all sites in parallel and without any communication. TriAD uses METIS [26] to divide the RDF graph into many partitions, and the number of result partitions is much more than the number of sites. Each result partition is considered as a unit and distributed among different sites. At each site, TriAD maintains six large, in-memory vectors of triples, which correspond to all SPO permutations of triples. Meanwhile, TriAD constructs a summary graph to maintain the partitioning information.
All of the above methods require partitioning and distributing the RDF data according to specific requirements of their approaches. However, in some applications, the RDF repository partitioning strategy is not controlled by the distributed RDF system itself. There may be some administrative requirements that influence the data partitioning. For example, in some applications, the RDF knowledge bases are partitioned according to topics (i.e., different domains) or are partitioned according to different data contributors. Therefore, partition-tolerant SPARQL processing may be desirable. This is the motivation of our partial evaluation and assembly approach.
Also, these approaches evaluate the SPARQL query based on query decomposition, which generate more intermediate results. We provide a detailed experimental comparison in Sect. 7.
2.1.3 Federated SPARQL query systems
Federated queries run SPARQL queries over multiple SPARQL endpoints. A typical example is linked data, where different RDF repositories are interconnected, providing a virtually integrated distributed database. Federated SPARQL query processing is a very different environment than what we target in this paper, but we discuss these systems for completeness.
A common technique is to precompute metadata for each individual SPARQL endpoint. Based on the metadata, the original SPARQL query is decomposed into several subqueries, where each subquery is sent to its relevant SPARQL endpoints. The results of subqueries are then joined together to answer the original SPARQL query. In DARQ [36], the metadata are called service description that describes which triple patterns (i.e., predicate) can be answered. In [19], the metadata are called Q-Tree, which is a variant of RTree. Each leaf node in Q-Tree stores a set of source identifiers, including one for each source of a triple approximated by the node. SPLENDID [16] uses Vocabulary of Interlinked Datasets (VOID) as the metadata. HiBISCuS [38] relies on capabilities to compute the metadata. For each source, HiBISCuS defines a set of capabilities which map the properties to their subject and object authorities. TopFed [39] is a biological federated SPARQL query engine. Its metadata comprise of an N3 specification file and a Tissue Source Site to Tumour (TSS-to-Tumour) hash table, which is devised based on the data distribution.
In contrast to these, FedX [42] does not require preprocessing, but sends “SPARQL ASK” to collect the metadata on the fly. Based on the results of “SPARQL ASK” queries, it decomposes the query into subqueries and assign subqueries with relevant SPARQL endpoints.
Global query optimization in this context has also been studied. Most federated query engines employ existing optimizers, such as dynamic programming [3], for optimizing the join order of local queries. Furthermore, DARQ [36] and FedX [42] discuss the use of semijoins to compute a join between intermediate results at the control site and SPARQL endpoints.
2.2 Partial evaluation
Partial evaluation has been used in many applications ranging from compiler optimization to distributed evaluation of functional programming languages [24]. Recently, partial evaluation has also been used for evaluating queries on distributed XML trees and graphs [6, 7, 8, 13]. In [6, 7, 8], partial evaluation is used to evaluate some XPath queries on distributed XML. These works serialize XPath queries to a vector of subqueries and find the partial results of all subqueries at each site by using a top-down [7] or bottom-up [6] traversal over the XML tree. Finally, all partial results are assembled together at the server site to form final results. Note that since XML is a tree-based data structure, these works serialize XPath queries and traverse XML trees in a topological order. However, the RDF data and SPARQL queries are graphs rather than trees. Serializing the SPARQL queries and traversing the RDF graph in a topological order are not intuitive.
There are some prior works that consider partial evaluation on graphs. For example, Fan et al. [13] study reachability query processing over distributed graphs using the partial evaluation strategy. Partial evaluation-based graph simulation is well studied by Fan et al. [14] and Shuai et al. [31]. However, SPARQL query semantics is based on graph homomorphism [35], not graph simulation. The two concepts are formally different (i.e., they produce different results), and the two problems have very different complexities. Homomorphism defines a “function,” while simulation defines a “relation”—relation allows “one-to-many” mappings while function does not. Consequently, the results are different. The computational hardness of the two problems is also different. Graph homomorphism is a classical NP-complete problem [11], while graph simulation has a polynomial-time algorithm (\(O((|V(G)| + |V(Q)|)(|E(G)| + |E(Q)|))\)) [12], where |V(G)| (|V(Q)|) and |E(G)| (|E(Q)|) denote the number of vertices and edges in RDF data graph G (and query graph Q). Thus, the solutions based on graph simulation cannot be applied to the problem studied in this paper. To the best of our knowledge, there is no prior work in applying partial evaluation to SPARQL query processing.
3 Background and framework
An RDF dataset can be represented as a graph where subjects and objects are vertices and triples are labeled edges.
Definition 1
(RDF graph) An RDF graph is denoted as \(G=\{V,\)\(E,\varSigma \}\), where V is a set of vertices that correspond to all subjects and objects in RDF data; \(E \subseteq V \times V\) is a multiset of directed edges that correspond to all triples in RDF data; \(\varSigma \) is a set of edge labels. For each edge \(e \in E\), its edge label is its corresponding property.
Similarly, a SPARQL query can also be represented as a query graph Q. In this paper, we first focus on basic graph pattern (BGP) queries as they are foundational to SPARQL and focus on techniques for handling these. We extend this discussion in Sect. 6 to general SPARQL queries involving FILTER, UNION, and OPTIONAL.
Definition 2
(SPARQL BGP query) A SPARQL BGP query is denoted as \(Q=\{V^{Q},\)\(E^{Q}, \varSigma ^{Q}\}\), where \(V^{Q} \subseteq V\cup V_{Var}\) is a set of vertices, where V denotes all vertices in RDF graph G and \(V_{Var}\) is a set of variables; \(E^{Q} \subseteq V^{Q} \times V^{Q}\) is a multiset of edges in Q; each edge e in \(E^Q\) either has an edge label in \(\varSigma \) (i.e., property) or the edge label is a variable.
We assume that Q is a connected graph; otherwise, all connected components of Q are considered separately. Answering a SPARQL query is equivalent to finding all subgraph matches (Definition 3) of Q over RDF graph G.
Definition 3
- 1.
if \(v_i\) is not a variable, \(f(v_i)\) and \(v_i\) have the same URI or literal value (\(1\le i \le n\));
- 2.
if \(v_i\) is a variable, there is no constraint over \(f(v_i)\) except that \(f(v_i)\in \{u_1,\ldots ,u_m\}\) ;
- 3.
if there exists an edge \(\overrightarrow{v_iv_j}\) in Q, there also exists an edge \(\overrightarrow{f{(v_i)}f{(v_j)}}\) in G. Let \(L(\overrightarrow{v_iv_j})\) denote a multi-set of labels between \(v_i\) and \(v_j\) in Q; and \(L(\overrightarrow{f{(v_i)}f{(v_j)}})\) denote a multi-set of labels between \(f(v_i)\) and \(f(v_j)\) in G. There must exist an injective function from edge labels in \(L(\overrightarrow{v_iv_j})\) to edge labels in \(L(\overrightarrow{f{(v_i)}f{(v_j)}})\). Note that a variable edge label in \(L(\overrightarrow{v_iv_j})\) can match any edge label in \(L(\overrightarrow{f{(v_i)}f{(v_j)}})\).
Vector \([ f{(v_1)}, \ldots , f{(v_n)}]\) is a serialization of a SPARQL match. Note that we allow that \(f(v_i)=f(v_j)\) when \(1\le i\ne j \le n\). In other words, a match of SPARQL Q defines a graph homomorphism.
In the context of this paper, an RDF graph G is vertex-disjoint partitioned into a number of fragments, each of which resides at one site. The vertex-disjoint partitioning has been used in most distributed RDF systems, such as GraphPartition [22], EAGRE [48] and TripleGroup [28]. Different distributed RDF systems utilize different vertex-disjoint partitioning algorithms, and the partitioning algorithm is orthogonal to our approach. Any vertex-disjoint partitioning method can be used in our method, such as METIS [26] and MLP [46].
The vertex-disjoint partitioning methods guarantee that there are no overlapping vertices between fragments. However, to guarantee data integrity and consistency, we store some replicas of crossing edges. Since the RDF graph G is partitioned by our system, metadata are readily available regarding crossing edges (both outgoing and incoming edges) and the endpoints of crossing edges. Formally, we define the distributed RDF graph as follows.
Definition 4
- 1.
\(\{V_1,\ldots ,V_k\}\) is a partitioning of V, i.e., \(V_i \cap V_j = \emptyset ,1 \le i,j \le k,i \ne j \) and \(\bigcup \nolimits _{i = 1,\ldots ,k} {V_i = V}\) ;
- 2.
\(E_i \subseteq V_i \times V_i\), \(i=1,\ldots ,k\);
- 3.\(E_i^c\) is a set of crossing edges between \(F_i\) and other fragments, i.e.,
- 4.A vertex \(u^\prime \in V_i^e\) if and only if vertex \(u^\prime \) resides in another fragment \(F_j\) and \(u^{\prime }\) is an endpoint of a crossing edge between fragment \(F_i\) and \(F_j\) (\(F_i \ne F_j\)), i.e.,$$\begin{aligned} V_i^e&= \left( \bigcup \nolimits _{1 \le j \le k \wedge j \ne i} \{ {u^\prime } |\overrightarrow{uu^\prime }\,\in E_i ^c \wedge u \in F_i \} \right) \bigcup \\&\quad \quad \left( \bigcup \nolimits _{1 \le j \le k \wedge j \ne i} {\{ {u^\prime } |\overrightarrow{u^\prime u} \in E_i ^c \wedge u \in F_i \} }\right) \end{aligned}$$
- 5.
Vertices in \(V_i^e\) are called extended vertices of \(F_i\), and all vertices in \(V_i\) are called internal vertices of \(F_i\);
- 6.
\(\varSigma _i\) is a set of edge labels in \(F_i\).
Example 1
Figure 1 shows a distributed RDF graph G consisting of four fragments \(F_1\), \(F_2\), \(F_3\) and \(F_4\). The numbers besides the vertices are vertex IDs that are introduced for ease of presentation. In Fig. 1, \(\overrightarrow{002,001}\) is a crossing edge between \(F_1\) and \(F_2\). Also, edges \(\overrightarrow{004,011}\), \(\overrightarrow{001,012}\) and \(\overrightarrow{006,008}\) are crossing edges between \(F_1\) and \(F_3\). Hence, \(V_1^e=\{002,006,012,004\}\) and \(E_1^c=\{\overrightarrow{002,001},\)\(\overrightarrow{004,011},\overrightarrow{001,012},\)\( \overrightarrow{006,008}\}\).
Definition 5
(Problem statement) Let G be a distributed RDF graph that consists of a set of fragments \(\mathcal {F} = \{F_{1}, \ldots , F_{k}\}\) and let \(\mathcal {S} = \{S_{1}, \ldots , S_{k}\}\) be a set of computing nodes such that \(F_{i}\) is located at \(S_{i}\). Given a SPARQL query graph Q, our goal is to find all SPARQL matches of Q in G.
Note that for simplicity of exposition, we are assuming that each site hosts one fragment. Inner matches can be computed locally using a centralized RDF triple store, such as RDF-3x [32], SW-store [1] or gStore [50]. In our prototype development and experiments, we modify gStore, a graph-based SPARQL query engine [50], to perform partial evaluation. The main issue of answering SPARQL queries over the distributed RDF graph is finding crossing matches efficiently. That is a major focus of this paper.
Example 2
Given a SPARQL query graph Q in Fig. 2, the subgraph induced by vertices 014,007,001,002,009 and 018 (shown in the shaded vertices and the red edges in Fig. 1) is a crossing match of Q.
We utilize a partial evaluation and assembly [24] framework to answer SPARQL queries over a distributed RDF graph G. Each site \(S_i\) treats fragment \(F_i\) as the known input s and other fragments as yet unavailable input \(\overline{G}\) (as defined in Sect. 1) [13].
In our execution model, each site \(S_i\) receives the full query graph Q. In the partial evaluation stage, at each site \(S_i\), we find all local partial matches (Definition 6) of Q in \(F_i\). We prove that an overlapping part between any crossing match and fragment \(F_i\) must be a local partial match in \(F_i\) (see Proposition 1).
In the assembly stage, these local partial matches are assembled to form crossing matches. In this paper, we consider two assembly strategies: centralized and distributed (or parallel). In centralized, all local partial matches are sent to a single site for the assembly. In distributed/parallel, local partial matches are combined at a number of sites in parallel (see Sect. 5).
Step 1 (Initialization): A SPARQL query Q is input and sent to each site in \(\mathcal {S}\).
Step 2 (Partial Evaluation): Each site \(S_i\) finds local partial matches of Q over fragment \(F_i\). This step is executed in parallel at each site (Sect. 4).
Step 3 (Assembly): Finally, we assemble all local partial matches to compute complete crossing matches. The system can use the centralized (Sect. 5.2) or the distributed assembly approach (Sect. 5.3) to find crossing matches.
4 Partial evaluation
We first formally define a local partial match (Sect. 4.1) and then discuss how to compute it efficiently (Sect. 4.2).
4.1 Local partial match: definition
Recall that each site \(S_i\) receives the full query graph Q (i.e., there is no query decomposition). In order to answer query Q, each site \(S_i\) computes the partial answers (called local partial matches) based on the known input \(F_i\) (recall that, for simplicity of exposition, we assume that each site hosts one fragment as indicated by its subscript). Intuitively, a local partial match \(PM_i\) is an overlapping part between a crossing match M and fragment \(F_i\) at the partial evaluation stage. Moreover, M may or may not exist depending on the yet unavailable input \(\overline{G}\) . Based only on the known input \(F_i\), we cannot judge whether or not M exists. For example, the subgraph induced by vertices 014, 007, 001 and 002 (shown in shared vertices and red edges) in Fig. 1 is a local partial match between M and \(F_1\).
Definition 6
- 1.
If \(v_i\) is not a variable, \(f(v_i)\) and \(v_i\) have the same URI or literal or \(f(v_i)=NULL\).
- 2.
If \(v_i\) is a variable, \(f(v_i) \in \{u_1,\ldots ,u_m\}\) or \(f(v_i)=NULL\).
- 3.
If there exists an edge \(\overrightarrow{v_iv_j}\) in Q (\(1 \le i\ne j \le n\)), then PM must meet one of the following five conditions: (1) there also exists an edge \(\overrightarrow{f{(v_i)}f{(v_j)}}\) in PM with property p, and p is the same to the property of \(\overrightarrow{v_iv_j}\); (2) there also exists an edge \(\overrightarrow{f{(v_i)}f{(v_j)}}\) in PM with property p, and the property of \(\overrightarrow{v_iv_j}\) is a variable; (3) there does not exist an edge \(\overrightarrow{f{(v_i)}f{(v_j)}}\), but \(f{(v_i)}\) and \(f{(v_j)}\) are both in \(V_k^e\); (4) \(f{(v_i)}=NULL\); (5) \(f{(v_j)}=NULL\).
- 4.
PM contains at least one crossing edge, which guarantees that an empty match does not qualify.
- 5.
If \(f(v_i) \in V_k\) (i.e., \(f(v_i)\) is an internal vertex in \(F_k\)) and \(\exists \overrightarrow{v_i v_j} \in Q\) (or \(\overrightarrow{v_j v_i} \in Q\)), there must exist \(f(v_j) \ne NULL\) and \(\exists \overrightarrow{f(v_i)f(v_j)} \in PM\) (or \(\exists \overrightarrow{f(v_j)f(v_i)} \in PM\)). Furthermore, if \(\overrightarrow{v_i v_j}\) (or \(\overrightarrow{v_j v_i})\) has a property p, \(\overrightarrow{f(v_i)f(v_j)}\) (or \(\overrightarrow{f(v_j)f(v_i)}\)) has the same property p.
- 6.
Any two vertices \(v_i\) and \(v_j\) (in query Q), where \(f(v_i)\) and \(f(v_j)\) are both internal vertices in PM, are weakly connected (see Definition 7) in Q.
Example 3
Given a SPARQL query Q with six vertices in Fig. 2, the subgraph induced by vertices 001, 002, 007 and 014 (shown in shaded circles and red edges) is a local partial match of Q in fragment \(F_1\). The function is \(\{(v_1, 002), (v_2, 001),\)\((v_3, NULL), (v_4,\)\(007), (v_5, NULL), (v_6, 014)\}\). The five different local partial matches in \(F_1\) are shown in Fig. 4.
Definition 6 formally defines a local partial match, which is a subset of a complete SPARQL match. Therefore, some conditions in Definition 6 are analogous to SPARQL match with some subtle differences. In Definition 6, some vertices of query Q are not matched in a local partial match. They are allowed to match a special value NULL (e.g., \(v_3\) and \(v_5\) in Example 3). As mentioned earlier, a local partial match is the overlapping part of an unknown crossing match and a fragment \(F_i\). Therefore, it must have a crossing edge, i.e, Condition 4.
The basic intuition of Condition 5 is that if vertex \(v_i\) (in query Q) is matched to an internal vertex, all of \(v_i\)’s neighbors should be matched in this local partial match as well. The following example illustrates the intuition.
Example 4
Let us recall the local partial match \(PM_1^2\) of Fragment \(F_1\) in Fig. 4. An internal vertex 001 in fragment \(F_1\) is matched to vertex \(v_2\) in query Q. Assume that PM is an overlapping part between a crossing match M and fragment \(F_1\). Obviously, \(v_2\)’s neighbors, such as \(v_1\) and \(v_4\), should also be matched in M. Furthermore, the matching vertices should be 001’s neighbors. Since 001 is an internal vertex in \(F_1\), 001’s neighbors are also in fragment \(F_1\).
Therefore, if a PM violates Condition 5, it cannot be a subgraph of a crossing match. In other words, we are not interested in these subgraphs when finding local partial matches, since they do not contribute to any crossing match.
Definition 7
Two vertices are weakly connected in a directed graph if and only if there exists a connected path between the two vertices when all directed edges are replaced with undirected edges. The path is called a weakly connected path between the two vertices.
Condition 6 will be used to prove the correctness of our algorithm in Propositions 1 and 2. The following example shows all local partial matches in the running example.
Example 5
Given a query Q in Fig. 2 and an RDF graph G in Figs. 1, 4 shows all local partial matches and their serialization vectors in each fragment. A local partial match in fragment \(F_i\) is denoted as \(PM_i^j\), where the superscript distinguishes different local partial matches in the same fragment. Furthermore, we underline all extended vertices in serialization vectors.
- 1.
The overlapping part between any crossing match M and internal vertices of fragment \(F_i\) (\(i=1,\ldots ,k\)) must be a local partial match (see Proposition 1).
- 2.
Missing any local partial match may lead to result dismissal. Thus, the algorithm should find all local partial matches in each fragment (see Proposition 2).
- 3.
It is impossible to find two local partial matches M and \(M^\prime \) in fragment F, where \(M^\prime \) is a subgraph of M, i.e., each local partial match is maximal (see Proposition 4).
Proposition 1
Given any crossing match M of SPARQL query Q in an RDF graph G, if M overlaps with some fragment \(F_i\), let \((M \cap F_i)\) denote the overlapping part between M and fragment \(F_i\). Assume that \((M \cap F_i)\) consists of several weakly connected components, denoted as \((M \cap F_i )=\{PM_1,\ldots ,PM_n\}\). Each weakly connected component \(PM_a\) (\(1\le a \le n\)) in \((M \cap F_i)\) must be a local partial match in fragment \(F_i\).
Proof
(1) Since \(PM_a\) (\(1\!\le \! a \!\le \! n\)) is a subset of a SPARQL match, it is easy to show that Conditions 1–3 of Definition 6 hold.
(2) We prove that each weakly connected component \(PM_a\) (\(1\le a \le n\)) must have at least one crossing edge (i.e., Condition 4) as follows.
Since M is a crossing match of SPARQL query Q, M must be weakly connected, i.e., any two vertices in M are weakly connected. Assume that \((M \cap F_i)\) consists of several weakly connected components, denoted as \((M \cap F_i )=\{PM_1,\ldots ,PM_n\}\). Let \(M = (M \cap F_i ) + \overline{(M \cap F_i )} \), where \( \overline{(M \cap F_i )}\) denotes the complement of \((M \cap F_i )\). It is straightforward to show that \( \overline{(M \cap F_i )}\) must occur in other fragments; otherwise, it should be found at \((M \cap F_i )\). \(PM_a\) (\(1 \le a \le n\)) is weakly disconnected with each other because we remove \( \overline{(M \cap F_i )}\) from M. In other words, each \(PM_a\) must have at least one crossing edge to connect \(PM_a\) with \(\overline{(M \cap F_i )}\). \(\overline{(M \cap F_i )}\) are in other fragments and only crossing edges can connect fragment \(F_i\) with other fragments. Otherwise, \(PM_a\) is a separated part in the crossing match M. Since, M is weakly connected, \(PM_a\) has at least one crossing edge, i.e, Condition 4.
(3) For Condition 5, for any internal vertex u in \(PM_a\) (\(1\le a \le n\)), \(PM_a\) retains all its incident edges. Thus, we can prove that Condition 5 holds.
(4) We define \(PM_a\) (\(1\le a \le n\)) as a weakly connected part in \((M \cap F_i )\). Thus, Condition 6 holds.
To summarize, the overlapping part between M and fragment \(F_i\) satisfies all conditions in Definition 6. Thus, Proposition 1 holds. \(\square \)
Let us recall Example 5. There are some local partial matches that do not contribute to any crossing match, such as \(PM_1^5\) in Fig. 4. We call these local partial matches false positives. However, the partial evaluation stage only depends on the known input. If we do not know the structures of other fragments, we cannot judge whether or not \(PM_1^5\) is a false positive. Formally, we have the following proposition, stating that we have to find all local partial matches in each fragment \(F_i\) in the partial evaluation stage.
Proposition 2
The partial evaluation and assembly algorithm does not miss any crossing matches in the answer set if and only if all local partial matches in each fragment are found in the partial evaluation stage.
Proof
In two parts:
(1) The “If” part: (proven by contradiction).
Assume that all local partial matches are found in each fragment \(F_i\) but a cross match M is missed in the answer set. Since M is a crossing match, suppose that M overlaps with m fragments \(F_1\),...,\(F_m\). According to Proposition 1, the overlapping part between M and \(F_i\) (\(i=1,\ldots ,m\)) must be a local partial match \(PM_i\) in \(F_i\). According to the assumption, these local partial matches have been found in the partial evaluation stage. Obviously, we can assemble these partial matches \(PM_i\) (\(i=1,\ldots ,m\)) to form the complete cross match M.
In other words, M would not be missed if all local partial matches are found. This contradicts the assumption.
(2) The “Only If” part: (proven by contradiction).
We assume that a local partial match \(PM_i\) in fragment \(F_i\) is missed and the answer set can still satisfy no-false-negative requirement. Suppose that \(PM_i\) matches a part of Q, denoted as \(Q^\prime \). Assume that there exists another local partial match \(PM_j\) in \(F_j\) that matches a complementary graph of \(Q^\prime \), denoted as \(\overline{Q}= Q \setminus Q^\prime \). In this case, we can obtain a complete match M by assembling the two local partial matches. If \(PM_i\) in \(F_i\) is missed, then match M is missed. In other words, it cannot satisfy the no-false-negative requirement. This also contradicts the assumption. \(\square \)
Proposition 2 guarantees that no local partial matches will be missed. This is important to avoid false negatives. Based on Proposition 2, we can further prove the following proposition, which guarantees that the intermediate results in our method involve the smallest number of vertices and edges.
Proposition 3
Given the same underlying partitioning over RDF graph G, the number of involved vertices and edges in the intermediate results (in our approach) is not larger than that in any other partition-based solution.
Proof
In Proposition 2, we prove that every local partial match should be found for result completeness (i.e., false negatives). The same proposition proves that our method produces complete results. Therefore, if a partition-based solution omits some of the partial matches (i.e., intermediate results) that are in our solution (i.e., has intermediate result smaller than ours), then it cannot produce complete results. Assuming that they all produce complete results, what remains to be proven is that our set of partial matches is a subset of those generated by other partition-based solutions. We prove that by contradiction.
Let A be a solution generated by an alternative partition-based approach. Assume that there exists one vertex u in a local partial match PM produced by our method, but u is not in the intermediate results of the partition-based solution A. This would mean that during the assembly phase to produce the final result, any edges adjacent to u will be missed. This would produce incomplete answer, which contradicts the completeness assumption.
Similarly, it can be argued that it is impossible that there exists an edge in our local partial matches (i.e., intermediate results) that it is not in the intermediate results of other partition-based approaches.
In other words, all vertices and edges in local partial matches must occur in the intermediate results of other partition-based approaches. Therefore, Proposition 3 holds. \(\square \)
Finally, we discuss another feature of a local partial match \(PM_i\) in fragment \(F_i\). Any \(PM_i\) cannot be enlarged by introducing more vertices or edges to become a larger local partial match. The following proposition formalizes this.
Proposition 4
Given a query graph Q and an RDF graph G, if \(PM_i\) is a local partial match under function f in fragment \(F_i\), there exists no local partial match \(PM^\prime _i\) under function \(f^\prime \) in \(F_i\), where \(f\subset f^\prime \).
Proof
(by contradiction) Assume that there exists another local partial match \(PM^\prime _i\) of query Q in fragment \(F_i\), where \(PM_i\) is a subgraph of \(PM^\prime _i\). Since \(PM_i\) is a subgraph of \(PM^\prime _i\), there must exist at least one edge \(e=\overrightarrow{uu^\prime }\) where \(e\in PM^\prime _i\) and \(e \notin PM_i\). Assume that \(\overrightarrow{uu^\prime }\) is matching edge \(\overrightarrow{vv^{\prime }}\) in query Q. Obviously, at least one endpoint of e should be an internal vertex. We assume that u is an internal vertex. According to Condition (5) of Definition 6 and Claim (1), edge \(\overrightarrow{vv^{\prime }}\) should also be matched in PM, since PM is a local partial match. However, edge \(\overrightarrow{uu^\prime }\) (matching \(\overrightarrow{vv^{\prime }}\)) does not exist in PM. This contracts PM being a local partial match. Thus, Proposition 4 holds. \(\square \)
4.2 Computing local partial matches
Given a SPARQL query Q and a fragment \(F_i\), the goal of partial evaluation is to find all local partial matches (according to Definition 6) in \(F_i\). The matching process consists of determining a function f that associates vertices of Q with vertices of \(F_i\). The matches are expressed as a set of pairs (v, u) (\(v \in Q\) and \(u \in F_i\)). A pair (v, u) represents the matching of a vertex v of query Q with a vertex u of fragment \(F_i\). The set of vertex pairs (v, u) constitutes function f referred to in Definition 6.
A high-level description of finding local partial matches is outlined in Algorithm 1 and Function ComParMatch. According to Conditions 1 and 2 of Definition 6, each vertex v in query graph Q has a candidate list of vertices in fragment \(F_i\). Since function f is as a set of vertex pairs (v, u) (\(v \in Q\) and \(u \in F_i\)), we start with an empty set. In each step, we introduce a candidate vertex pair (v, u) to expand the current function f, where vertex u (in fragment \(F_i\)) is a candidate of vertex v (in query Q).
At each step, a new candidate vertex pair \((v^\prime ,u^\prime )\) is added to an existing function f to form a new function \(f^\prime \). The order of selecting the query vertex can be arbitrarily defined. However, QuickSI [43] proposes several heuristic rules to select an optimized order that can speed up the matching process. These rules are also utilized in our experiments.
To compute local partial matches (Algorithm 1), we revise a graph-based SPARQL query engine, gStore, which is our previous work. Since gStore adopts “subgraph matching” technique to answer SPARQL query processing, it is easy to revise its subgraph matching algorithm to find “local partial matches” in each fragment. gStore adopts a state transformation technique to find SPARQL matches. Here, a state corresponds to a partial match (i.e., a function from Q to G).
Our statetransformation algorithm is as follows. Assume that v matches vertex u in SPARQL query Q. We first initialize a state with v. Then, we search the RDF data graph for v’s neighbor \(v^\prime \) corresponding to \(u^\prime \) in Q, where \(u^\prime \) is one of u’s neighbors and edge \(\overrightarrow{vv^\prime }\) satisfies query edge \(\overrightarrow{uu^\prime }\). The search will extend the state step by step. The search branch terminates when a state corresponding to a match is found or search cannot continue. In this case, the algorithm backtracks and tries another search branch.
Example 6
Figure 5 shows how to compute Q’s local partial matches in fragment \(F_1\). Suppose that we initialize a function f with \((v_3,005)\). In the second step, we expand to \(v_1\) and consider \(v_1\)’s candidates, which are 002 and 028. Hence, we introduce two vertex pairs \((v_1,002)\) and \((v_1,028)\) to expand f. Similarly, we introduce \((v_5,027)\) into the function \(\{(v_3,005),(v_1,002)\}\) in the third step. Then, \(\{(v_3,005),(v_1,002),\)\((v_5,027)\}\) satisfies all conditions of Definition 6; thus, it is a local partial match and is returned. In another search branch, we check the function \(\{(v_3,005),(v_1,028)\}\), which cannot be expanded, i.e., we cannot introduce a new matching pair, without violating some conditions in Definition 6. Therefore, this search branch is terminated.
5 Assembly
Each site \(S_i\) finds all local partial matches in fragment \(F_i\). The next step is to assemble partial matches to compute crossing matches and compute the final results. We propose two assembly strategies: centralized and distributed (or parallel). In centralized, all local partial matches are sent to a single site for assembly. For example, in a client/server system, all local partial matches may be sent to the server. In distributed/parallel, local partial matches are combined at a number of sites in parallel. Here, when \(S_i\) sends the local partial matches to the final assembly site for joining, it also tags which vertices in local partial matches are internal vertices or extended vertices of \(F_i\). This will be useful for avoiding some computations as discussed in this section.
In Sect. 5.1, we define a basic join operator for assembly. Then, we propose a centralized assembly algorithm in Sect. 5.2 using the join operator. In Sect. 5.3, we study how to assemble local partial matches in a distributed manner.
5.1 Join-based assembly
We first define the conditions under which two partial matches are joinable. Obviously, crossing matches can only be formed by assembling partial matches from different fragments. If local partial matches from the same fragment could be assembled, this would result in a larger local partial match in the same fragment, which is contrary to Proposition 4.
Definition 8
- 1.
There exist no vertices u and \(u^\prime \) in \(PM_i\) and \(PM_j\), respectively, such that \(f^{-1}_i(u)=f^{-1}_j(u^\prime )\).
- 2.
There exists at least one crossing edge \(\overrightarrow{uu^\prime }\) such that u is an internal vertex and \(u^\prime \) is an extended vertex in \(F_i\), while u is an extended vertex and \(u^\prime \) is an internal vertex in \(F_j\). Furthermore, \(f^{-1}_i{(u)}=f^{-1}_j{(u)}\) and \(f^{-1}_i{(u^\prime )}=f^{-1}_j{(u^\prime )}\).
The first condition says that the same query vertex cannot be matched by different internal vertices in joinable partial matches. The second condition says that two local partial matches share at least one common crossing edge that corresponds to the same query edge.
Example 7
Let us recall query Q in Fig. 2. Figure 3 shows two different local partial matches \(PM_1^2\) and \(PM_2^2\). We also show the functions in Fig. 3. There do not exist two different vertices in the two local partial matches that match the same query vertex. Furthermore, they share a common crossing edge \(\overrightarrow{002,001}\), where 002 and 001 match query vertices \(v_2\) and \(v_1\) in the two local partial matches, respectively. Hence, they are joinable.
The join result of two joinable local partial matches is defined as follows.
Definition 9
- 1.
if \(f_i(v) \ne NULL \wedge f_j(v)=NULL \)^{2}, f(v) \(\leftarrow \)\( f_i(v)\)^{3};
- 2.
if \(f_i(v) = NULL \wedge f_j(v)\ne NULL \), f(v) \(\leftarrow \)\(f_j(v)\);
- 3.
if \(f_i(v) \ne NULL \wedge f_j(v) \ne NULL \), f(v) \(\leftarrow \)\(f_i(v)\) (In this case, \(f_i(v)=f_j(v)\))
- 4.
if \(f_i(v) = NULL \wedge f_j(v) = NULL \), f(v) \(\leftarrow \)NULL
Figure 3 shows the join result of \(PM_1^2 \bowtie _{f} PM_2^2\).
5.2 Centralized assembly
In centralized assembly, all local partial matches are sent to a final assembly site. We propose an iterative join algorithm (Algorithm 2) to find all crossing matches. In each iteration, a pair of local partial matches is joined. When the join is complete (i.e., a match has been found), the result is returned (Lines 12–13 in Algorithm 2); otherwise, it is joined with other local partial matches in the next iteration (Lines 14–15). There are |V(Q)| iterations of Lines 4–16 in the worst case, since at each iteration only a single new matching vertex is introduced (worst case) and Q has |V(Q)| vertices. If no new intermediate results are generated at some iteration, the algorithm can stop early (Lines 5–6).
Example 8
Figure 3 shows the join result of \(PM_1^2 \bowtie _{f} PM_2^2\). In this example, we consider a crossing match formed by three local partial matches. Let us consider three local partial matches \(PM_1^4\), \(PM_4^1\) and \(PM_3^1\) in Fig. 4. In the first iteration, we obtain the intermediate result \(PM_1^4 \bowtie _{f} PM_3^1\) in Fig. 6. Then, in the next iteration, \((PM_1^4 \bowtie _{f} PM_3^1)\) joins with \(PM_4^1\) to obtain a crossing match.
5.2.1 Partitioning-based join processing
The join space in Algorithm 2 is large, since we need to check whether every pair of local partial matches \(PM_i\) and \(PM_j\) is joinable. This subsection proposes an optimized technique to reduce the join space.
The intuition of our method is as follows. We divide all local partial matches into multiple partitions such that two local partial matches in the same set cannot be joinable; we only consider joining local partial matches from different partitions. The following theorem specifies which local partial matches can be put in the same partition.
Theorem 1
Given two local partial matches \(PM_i\) and \(PM_j\) from fragments \(F_i\) and \(F_j\) with functions \(f_i\) and \(f_j\), respectively, if there exists a query vertex v where both \(f_i(v)\) and \(f_j(v)\) are internal vertices of fragments \(F_i\) and \(F_j\), respectively, \(PM_i\) and \(PM_j\) are not joinable.
Proof
If \(f_i(v)\ne f_j(v)\), then a vertex v in query Q matches two different vertices in \(PM_i\) and \(PM_j\), respectively. Obviously, \(PM_i\) and \(PM_j\) cannot be joinable.
If \(f_i(v)= f_j(v)\), since \(f_i(v)\) and \(f_j(v)\) are both internal vertices, both \(PM_i\) and \(PM_j\) are from the same fragment. As mentioned earlier, it is impossible to assemble two local partial matches from the same fragment (see the first paragraph of Sect. 5.1); thus, \(PM_i\) and \(PM_j\) cannot be joinable. \(\square \)
Example 9
Figure 7 shows the serialization vectors (defined in Definition 6) of four local partial matches. For each local partial match, there is an internal vertex that matches \(v_1\) in query graph. The underline indicates the extended vertex in the local partial match. According to Theorem 1, none of them are joinable.
Definition 10
- 1.
Each partition \(P_{v_i}\) (\(i=1,\ldots ,n\)) consists of a set of local partial matches, each of which has an internal vertex that matches \(v_i\).
- 2.
\(P_{v_i} \cap P_{v_j} = \emptyset \), where \(1 \le i \ne j \le n\).
- 3.
\(P_{v_1} \cup \ldots \cup P_{v_n} = \varOmega \)
Example 10
Let us consider all local partial matches of our running example in Fig. 4. Figure 8 shows two different partitionings.
The basic idea of Algorithm 3 is to iterate the join process on each partition of \(\mathcal {P}\). First, we set MS\(\leftarrow \)\(P_{v_{1}}\) (Line 1 in Algorithm 3). Then, we try to join local partial matches PM in MS with local partial matches \(PM^{\prime }\) in \(P_{v_{2}}\) (the first loop of Lines 3–13). If the join result is a complete match, it is inserted into the answer set RS (Lines 8–9). If the join result is an intermediate result, we insert it into a temporary set \(MS^{\prime }\) (Lines 10–11). We also need to insert \(PM^{\prime }\) into \(MS^{\prime }\), since the local partial match \(PM^{\prime }\) (in \(P_{v_{2}}\)) will join local partial matches in the later partition of \(\mathcal {P}\) (Line 12). At the end of the iteration, we insert all intermediate results (in \(MS^{\prime }\)) into MS, which will join local partial matches in the later partition of \(\mathcal {P}\) in the next iterations (Line 13). We iterate the above steps for each partition of \(\mathcal {P}\) in the partitioning (Lines 3–13).
5.2.2 Finding the optimal partitioning
Obviously, given a set \(\varOmega \) of local partial matches, there may be multiple feasible local partial match partitionings, each of which leads to a different join performances. In this subsection, we discuss how to find the “optimal” local partial match partitioning over \(\varOmega \), which can minimize the joining time of Algorithm 4.
First, there is a need for a measure that would define more precisely the join cost for a local partial match partitioning. We define it as follows.
Definition 11
Definition 11 assumes that each pair of local partial matches (from different partitions of \(\mathcal {P}\)) are joinable so that we can quantify the worst-case performance. Naturally, more sophisticated and more realistic cost functions can be used instead, but finding the most appropriate cost function is a major research issue in itself and outside the scope of this paper.
Example 11
The cost of the partitioning in Fig. 8a is \(5\times 4 \times 4=80\), while that of Fig. 8b is \(6\times 3 \times 4=72\). Hence, the partitioning in Fig. 8b has lower join cost.
Based on the definition of join cost, the “optimal” local partial match partitioning is one with the minimal join cost. We formally define the optimal partitioning as follows.
Definition 12
(Optimal partitioning). Given a partitioning \(\mathcal {P}\) over all local partial matches \(\varOmega \), \(\mathcal {P}\) is the optimal partitioning if and only if there exists no another partitioning that has smaller join cost.
Unfortunately, Theorem 2 shows that finding the optimal partitioning is NP-complete.
Theorem 2
Finding the optimal partitioning is NP-complete problem.
Proof
We can reduce a 0–1 integer planning problem to finding the optimal partitioning. We build a bipartite graph B, which contains two vertex groups \(B_1\) and \(B_2\). Each vertex \(a_j\) in \(B_1\) corresponds to a local partial match \(PM_j\) in \(\varOmega \), \(j=1,\ldots ,|\varOmega |\). Each vertex \(b_i\) in \(B_2\) corresponds to a query vertex \(v_i\), \(i=0,\ldots ,n\). We introduce an edge between \(a_j\) and \(b_i\) if and only if \(PM_j\) has a internal vertex that is matching query vertex \(b_i\). Let a variable \(x_{ji}\) denote the edge label of the edge \(\overline{a_jb_i}\). Figure 9 shows an example bipartite graph of all local partial matches in Fig. 4.
The equivalence between the 0–1 integer planning and finding the optimal partitioning are straightforward. The former is a classical NP-complete problem. Thus, the theorem holds. \(\square \)
Theorem 3
Given a query graph Q with n vertices \(\{v_1\),...,\(v_n\}\) and a set of all local partial matches \(\varOmega \), let \(U_{v_i}\) (\(i=1,\ldots ,n\)) be all local partial matches (in \(\varOmega \)) that have internal vertices matching \(v_i\). For the optimal partitioning \(\mathcal {P}_{opt}=\{P_{v_1},\ldots ,P_{v_n}\}\) where \(P_{v_n}\) has the largest size (i.e., the number of local partial matches in \(P_{v_n}\) is maximum) in \(\mathcal {P}_{opt}\), \(P_{v_n} = U_{v_n}\).
Proof
Therefore, in the optimal partitioning \(\mathcal {P}_{opt}\), we cannot find a local partial match PM, where \(|P_{v_n}|\) is the largest, \(PM \notin P_{v_n}\) and \(PM \in U_{v_n}\). In other words, \(P_{v_n}=U_{v_n}\) in the optimal partitioning. \(\square \)
Let \(\varOmega \) denote all local partial matches. Assume that the optimal partitioning is \(\mathcal {P}_{opt}=\{P_{v_1},P_{v_2},\ldots ,P_{v_n}\}\). We reorder the partitions of \(\mathcal {P}_{opt}\) in non-descending order of sizes, i.e., \(\mathcal {P}_{opt}=\{P_{v_{k_1}},\ldots ,P_{v_{k_n}}\}\), \(|P_{v_{k_1}}| \ge |P_{v_{k_2}}| \ge \ldots \ge |P_{v_{k_n}}|\). According to Theorem 3, we can conclude that \(P_{v_{k_1}}=U_{v_{k_1}}\) in the optimal partitioning \(\mathcal {P}_{opt}\).
Let \(\varOmega _{\overline{v_{k_1}}}=\varOmega - U_{v_{k_1}}\), i.e., the set of local partial matches excluding the ones with an internal vertex matching \(v_{k_1}\). It is straightforward to know \(Cost(\varOmega )_{opt}=|P_{v_{k_1}}| \times Cost(\varOmega _{\overline{v_{k_1}}})_{opt}=|U_{v_{k_1}}| \times Cost(\varOmega _{\overline{v_{k_1}}})_{opt}\). In the optimal partitioning over \(\varOmega _{\overline{v_{k_1}}}\), we assume that \(P_{v_{k_2}}\) has the largest size. Iteratively, according to Theorem 3, we know that \(P_{v_{k_2} } = U^{\prime }_{v_{k_2}}\), where \(U^{\prime }_{v_{k_2}}\) denotes the set of local partial matches with an internal vertex matching \(v_{k_2}\) in \(\varOmega _{\overline{v_{k_1}}}\).
According to the above analysis, if a vertex order is given, the partitioning over \(\varOmega \) is fixed. Assume that the optimal vertex order that leads to minimum join cost is given as \(\{v_{k_1},\ldots , v_{k_n}\}\). The partitioning algorithm work as follows.
Let \(U_{v_{k_1}}\) denote all local partial matches (in \(\varOmega \)) that have internal vertices matching vertex \(v_{k_1}\)^{5}. Obviously, \(U_{v_{k_1}}\) is fixed if \(\varOmega \) and the vertex order is given. We set \(P_{v_{k_1}}=U_{v_{k_1}}\). In the second iteration, we remove all local partial matches in \(U_{v_{k_1}}\) from \(\varOmega _{\overline{v_{k_1}}}\), i.e, \(\varOmega _{\overline{v_{k_1}}}=\varOmega - U_{v_{k_1}}\). We set \(U_{v_{k_2}}^{\prime }\) to be all local partial matches (in \(\varOmega ^{\prime }\)) that have internal vertices matching vertex \(v_{k_2}\). Then, we set \(P_{v_{k_2}}=U_{v_{k_2}}^{\prime }\). Iteratively, we can obtain \(P_{v_{k_3}},\ldots , P_{v_{k_n}}\).
Example 12
Consider all local partial matches in Fig. 11. Assume that the optimal vertex order is \(\{v_3,v_1,v_2\}\). We will discuss how to find the optimal order later. In the first iteration, we set \(P_{v_{3}}=U_{v_{3}}\), which contains five matches. For example, \(PM_1^1\!=\![\underline{002}^{6},{ NULL},005,{ NULL},\)\(027,{ NULL}]\)^{6} is in \(U_{v_3}\), since internal vertex 005 matches \(v_3\). In the second iteration, we set \(\varOmega _{\overline{v_3}}=\varOmega -P_{v_{3}}\). Let \(U_{v_{1}}^{\prime }\) to be all local partial matches in \(\varOmega _{\overline{v_3}}\) that have internal vertices matching vertex \(v_{1}\). Then, we set \(P_{v_{1}}=U_{v_{1}}^{\prime }\). Iteratively, we can obtain the partitioning \(\{P_{v_{3}},P_{v_{1}}, P_{v_{2}}\}\), as shown in Fig. 11.
However, if Eq. 6 is used naively in the dynamic programming formulation, it would result in repeated computations. For example, \(Cost(\varOmega _{\overline{v_1 v_2 } } )_{opt} \) will be computed twice in both \(|U_{v_1} | \times |U_{v_2 }^{\prime }| \times Cost(\varOmega _{\overline{v_1 v_2 } } )_{opt}\) and \( |U_{v_2 } | \times |U_{v_1 }^{\prime }| \times Cost(\varOmega _{\overline{v_1 v_2 } } )_{opt} \). To avoid this, we introduce a map that records \(Cost(\varOmega ^\prime )\) that is already calculated (Line 16 in Function OptComCost), so that subsequent uses of \(Cost(\varOmega ^\prime )\) can be serviced directly by searching the map (Lines 8–10 in Function ComCost).
5.2.3 Join order
When we determine the optimal partitioning of local partial matches, the join order is also determined. If the optimal partitioning is \(\mathcal {P}_{opt}=\{P_{v_{k_1}},\ldots ,P_{v_{k_n}}\}\) and \(|P_{v_{k_1}}| \ge |P_{v_{k_2}}| \ge \ldots \ge |P_{v_{k_n}}|\), then the join order must be \(P_{v_{k_1}} \bowtie P_{v_{k_2}} \bowtie \ldots \bowtie P_{v_{k_n}}\). The reasons are as follows.
First, changing the join order may not prune any intermediate results. Let us recall the example optimal partitioning \(\{P_{v_{3}}, P_{v_{2}}, P_{v_{1}}\}\) shown in Fig. 8b. The join order should be \(P_{v_{3}} \bowtie P_{v_{2}} \bowtie P_{v_{1}}\), and any changes in the join order would not prune intermediate results. For example, if we first join \(P_{v_{2}}\) with \(P_{v_{1}}\), we cannot prune the local partial matches in \(P_{v_{2}}\) that cannot join with any local partial matches in \(P_{v_{1}}\). This is because there may be some local partial matches \(P_{v_{3}}\) that have an internal vertex matching \(v_1\) and can join with local partial matches in \(P_{v_{2}}\). In other words, the results of \(P_{v_{2}} \bowtie P_{v_{1}}\) is not smaller than \(P_{v_{2}}\). Similarly, we can prove that any other changes of the join order of the partitioning have no effects.
Second, in some special cases, the join order may have an effect on the performance. Given a partitioning \(\mathcal {P}_{opt}=\{P_{v_{k_1}},\ldots ,P_{v_{k_n}}\}\) and \(|P_{v_{k_1}}| \ge |P_{v_{k_2}}| \ge \ldots \ge |P_{v_{k_n}}|\), if the set of the first \(n^\prime \) vertices, \(\{v_{k_1}, v_{k_2}, \ldots , v_{n^\prime }\}\), is a vertex cut of the query graph, the join order for the remaining \(n-n^\prime \) partitions of \(\mathcal {P}\) has an effect. For example, let us consider the partitioning \(\{P_{v_{1}}, P_{v_{3}}, P_{v_{2}}\}\) in Fig. 8(a). If the partitioning is optimal, then both joining \(P_{v_{1}}\) with \(P_{v_{2}}\) first and joining \(P_{v_{1}}\) with \(P_{v_{3}}\) first can work. However, it is possible for their cost to be different.^{8} In the worst case, if the query graph is a complete graph, the join order has no effect on the performance.
In conclusion, when the optimal partitioning is determined as \(\mathcal {P}_{opt}=\{P_{v_{k_1}},\ldots ,P_{v_{k_n}}\}\) and \(|P_{v_{k_1}}| \ge |P_{v_{k_2}}| \ge \ldots \ge |P_{v_{k_n}}|\), then the join order must be \(P_{v_{k_1}} \bowtie P_{v_{k_2}} \bowtie \ldots \bowtie P_{v_{k_n}}\). The join cost can be estimated based on the cost function (Definition 11).
5.3 Distributed assembly
An alternative to centralized assembly is to assemble the local partial matches in a distributed fashion. We adopt Bulk Synchronous Parallel (BSP) model [45] to design a synchronous algorithm for distributed assembly. A BSP computation proceeds in a series of global supersteps, each of which consists of three components: local computation, communication and barrier synchronization. In the following, we discuss how we apply this strategy to distributed assembly.
5.3.1 Local computation
Each processor performs some computation based on the data stored in the local memory. The computations on different processors are independent in the sense that different processors perform the computation in parallel.
5.3.2 Communication
Processors exchange data among themselves. Consider the mth superstep. A straightforward communication strategy is as follows. If an intermediate result PM in \(\varDelta _{out}^{m}(F_i)\) shares a crossing edge with fragment \(F_j\), PM will be sent to site \(S_j\) from \(S_i\) (assuming fragments \(F_i\) and \(F_j\) are stored in sites \(S_i\) and \(S_j\), respectively).
However, the above communication strategy may generate duplicate results. For example, as shown in Fig. 4, we can assemble \(PM_1^4\) (at site \(S_1\)) and \(PM_3^1\) (at site \(S_3\)) to form a complete crossing match. According to the straightforward communication strategy, \(PM_1^4\) will be sent to \(S_1\) from \(S_3\) to produce \(PM_1^4 \bowtie PM_3^1\) at \(S_3\). Similarly, \(PM_3^1\) is sent from \(S_3\) to \(S_1\) to assemble at site \(S_1\). In other words, we obtain the join result \(PM_1^4 \bowtie PM_3^1\) at both sites \(S_1\) and \(S_3\). This wastes resources and increases total evaluation time.
To avoid duplicate result computation, we introduce a “divide-and-conquer” approach. We define a total order (\(\prec \)) over fragments \(\mathcal {F}\) in a non-descending order of \(|\varOmega (F_i)|\), i.e., the number of local partial matches in fragment \(F_i\) found at the partial evaluation stage.
Definition 13
Given any two fragments \(F_i\) and \(F_j\), \(F_i \prec F_j \) if and only if \(|\varOmega (F_i)| \le |\varOmega (F_j)|\) (\(1 \le i,j \le n\)).
Without loss of generality, we assume that \(F_1 \prec F_2 \prec \ldots \prec F_n\) in the remainder. The basic idea of the divide-and-conquer approach is as follows. Assume that a crossing match M is formed by joining local partial matches that are from different fragments \(F_{i_1}\),...,\(F_{i_m}\), where \(F_{i_1} \prec F_{i_2} \prec \ldots \prec F_{i_m}\) (\(1 \le i_1,\ldots ,i_m \le n\)). The crossing match should only be generated at fragment site \(S_{i_m}\) rather than other fragment sites.
For example, at site \(S_2\), we generate crossing matches by joining local partial matches from \(F_1\) and \(F_2\). The crossing matches generated at \(S_2\) should not contain any local partial matches from \(F_3\) or even larger fragments (such as \(F_4\),...,\(F_n\)). Similarly, at site \(S_3\), we should generate crossing matches by joining local partial matches from \(F_3\) and fragments smaller than \(F_3\). The crossing matches should not contain any local partial match from \(F_4\) or even larger fragments (such as \(F_5\),...,\(F_n\)).
The “divide-and-conquer” framework can avoid duplicate results, since each crossing match can be only generated at a single site according to the “divided search space.” To enable the “divide-and-conquer” framework, we need to introduce some constraints over data communication. The transmission (of local partial matches) from fragment site \(S_i\) to \(S_j\) is allowed only if \(F_i \prec F_j\).
Let us consider an intermediate result PM in \(\varDelta ^m_{out}(F_i)\). Assume that PM is generated by joining intermediate results from m different fragments \(F_{i_1},\ldots ,F_{i_m}\), where \(F_{i_1} \prec F_{i_2} \prec \ldots \prec F_{i_m}\). We send PM to another fragment \(F_j\) if and only if two conditions hold: (1) \(F_j > F_{i_m}\); and (2) \(F_j\) shares common crossing edges with at least one fragment of \(F_{i_1}\),...,\(F_{i_m}\).
5.3.3 Barrier synchronization
All communication in the mth superstep should finish before entering in the \((m+1)\)th superstep.
We now discuss the initial state (i.e., 0th superstep) and the system termination condition.
Initial state In the 0th superstep, each fragment \(F_i\) has only local partial matches in \(F_i\), i.e, \(\varOmega _{F_i}\). Since it is impossible to assemble local partial matches in the same fragment, the 0th superstep requires no local computation. It enters the communication stage directly. Each site \(S_i\) sends \(\varOmega _{F_i}\) to other fragments according to the communication strategy that has been discussed before.
5.3.4 System termination condition
A key problem in the BSP algorithm is the number of the supersteps to terminate the system. In order to facilitate the analysis, we propose using a fragmentation graph topology graph.
Definition 14
(Fragmentation topology graph) Given a fragmentation \(\mathcal {F}\) over an RDF graph G, the corresponding fragmentation topology graphT is defined as follows: Each node in T is a fragment \(F_i\), \(i=1,\ldots ,k\). There is an edge between nodes \(F_i\) and \(F_j\) in T, \(1 \le i \ne j \le n\), if and only if there is at least one crossing edge between \(F_i\) and \(F_j\) in RDF graph G.
Let Dia(T) be the diameter of T. We need at most Dia(T) supersteps to transfer the local partial matches in one fragment \(F_i\) to any other fragment \(F_j\). Hence, the number of the supersteps in the BSP-based algorithm is Dia(T).
6 Handling general SPARQL
So far, we only consider basic graph pattern (BGP) query evaluation. In this section, we discuss how to extend our method to general SPARQL queries involving UNION, OPTIONAL and FILTER statements.
A general SPARQL query and SPARQL query results can be defined recursively based on BGP queries.
Definition 15
(General SPARQL query) Any BGP is a SPARQL query. If \(Q_1\) and \(Q_2\) are SPARQL queries, then expressions \((Q_1 \; AND \; Q_2)\), \((Q_1 \; UNION \; Q_2)\), \((Q_1 \; OPT \; Q_2)\) and \((Q_1\; FILTER\; F)\) are also SPARQL queries.
Figure 12 shows an example general SPARQL query with multiple operators, including UNION, OPTIONAL and FILTER. The set of all matches for Q is denoted as \(\llbracket Q \rrbracket \).
Definition 16
- 1.
If Q is a BGP, \([[ Q] ]\) is the set of matches defined in Definition 3 of Section 3.
- 2.
If \(Q=Q_1 \; AND \; Q_2\), then \( [[ Q ]] = [[ Q_1 ]] \bowtie [[ Q_2 ]] \)
- 3.
If \(Q=Q_1 \; UNION \; Q_2\), then \( [[ Q ]] = [[ Q_1 ]] \cup [[ Q_2 ]]\)
- 4.
If \(Q=Q_1 \; OPT \; Q_2\), then \( [[ Q ]] = ([[ Q_1 ]] \bowtie [[ Q_2 ]] ) \cup ([[ Q_1 ]] \backslash [[ Q_2 ]]) \)
- 5.
If \(Q=Q_1 \; FILTER \; F\), then \( [[ Q ]] = \varTheta _F ([[ Q_1 ] ]) \)
Further optimizing general SPARQL evaluation is also possible (e.g., [4]). However, this issue is independent on our studied problem in this paper.
7 Experiments
Datasets
Dataset | Number of triples | RDF N3 file size (KB) | Number of entities |
---|---|---|---|
WatDiv 100M | 109,806,750 | 15,386,213 | 5,212,745 |
WatDiv 300M | 329,539,576 | 46,552,961 | 15,636,385 |
WatDiv 500M | 549,597,531 | 79,705,831 | 26,060,385 |
WatDiv 700M | 769,065,496 | 110,343,152 | 36,486,007 |
WatDiv 1B | 1,098,732,423 | 159,625,433 | 52,120,385 |
LUBM 1000 | 133,553,834 | 15,136,798 | 21,715,108 |
LUBM 10000 | 1,334,481,197 | 153,256,699 | 217,006,852 |
BTC | 1,056,184,911 | 238,970,296 | 183,835,054 |
7.1 Setting
- 1.
WatDiv [2] is a benchmark that enables diversified stress testing of RDF data management systems. In WatDiv, instances of the same type can have different attribute sets. We generate three datasets varying sizes from 100 million to 1 billion triples. We use 20 queries of the basic testing templates provided by WatDiv [2] to evaluate our method. We randomly partition the WatDiv datasets into several fragments (except in Exp. 6 where we test different partitioning strategies). We assign each vertex v in RDF graph to the ith fragment if \(H(v) MOD\ N=i\), where H(v) is a hash function and N is the number of fragments. By default, we use the uniform hash function and \(N=10\). Each machine stores a single fragment.
- 2.
LUBM [17] is a benchmark that adopts an ontology for the university domain and can generate synthetic OWL data scalable to an arbitrary size. We assign the university number to 10,000. The number of triples is about 1.33 billion. We partition the LUBM datasets according to the university identifiers. Although LUBM defines 14 queries, some of these are similar; therefore, we use the 7 benchmark queries that have been used in some recent studies [5, 50]. We report the results over all 14 queries in Appendix B for completeness. As expected ,the results over 14 benchmark queries are similar to the results over 7 queries.
- 3.
BTC 2012 (http://km.aifb.kit.edu/projects/btc-2012/) is a real dataset that serves as the basis of submissions to the Billion Triples Track of the Semantic Web Challenge. After eliminating all redundant triples, this dataset contains about 1 billion triples. We use METIS to partition the RDF graph, and use the 7 queries in [48].
- 4.
FedBench [41] is used for testing against federated systems; it is described in Appendix E along with the results.
Evaluation of each stage on WatDiv 1B
Partial evaluation | Assembly | Total | \(\#\) of LPMFs\(^\mathrm{h}\) | \(\#\) of CMFs\(^\mathrm{i}\) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Time (in ms) | \(\#\) of LPMs\(^\mathrm{b}\) | \(\#\) of IMs\(^\mathrm{c}\) | Time (in ms) | \(\#\) of CMs\(^\mathrm{d}\) | Time (in ms) | \(\#\) of Matches\(^\mathrm{g}\) | ||||||||
Centralized | Distributed | PECA\(^\mathrm{e}\) | PEDA\(^\mathrm{f}\) | |||||||||||
Star | \(S_1\) | \(\surd ^{a}\) | 43,803 | 0 | 1 | 0 | 0 | 0 | 43,803 | 43,803 | 1 | 0 | 0 | |
\(S_2\) | \(\surd \) | 74,479 | 0 | 13,432 | 0 | 0 | 0 | 74,479 | 74,479 | 13,432 | 0 | 0 | ||
\(S_3\) | \(\surd \) | 8087 | 0 | 13,335 | 0 | 0 | 0 | 8087 | 8087 | 13,335 | 0 | 0 | ||
\(S_4\) | \(\surd \) | 16,520 | 0 | 2 | 0 | 0 | 0 | 16,520 | 16,520 | 1 | 0 | 0 | ||
\(S_5\) | \(\surd \) | 1861 | 0 | 112 | 0 | 0 | 0 | 1861 | 1861 | 940 | 0 | 0 | ||
\(S_6\) | \(\surd \) | 50,865 | 0 | 14 | 0 | 0 | 0 | 50,865 | 50,865 | 14 | 0 | 0 | ||
\(S_7\) | \(\surd \) | 56,784 | 0 | 1 | 0 | 0 | 0 | 56,784 | 56,784 | 1 | 0 | 0 | ||
Linear | \(L_1\) | \(\surd \) | 15,340 | 2 | 0 | 1 | 16 | 1 | 15,341 | 15,356 | 1 | 2 | 2 | |
\(L_2\) | \(\surd \) | 1492 | 794 | 88 | 18 | 130 | 793 | 1510 | 1622 | 881 | 10 | 10 | ||
\(L_3\) | \(\surd \) | 16,889 | 0 | 5 | 0 | 0 | 0 | 16,889 | 16,889 | 5 | 0 | 0 | ||
\(L_4\) | \(\surd \) | 261 | 0 | 6005 | 0 | 0 | 0 | 261 | 261 | 6005 | 0 | 0 | ||
\(L_5\) | \(\surd \) | 48,055 | 1274 | 141 | 572 | 1484 | 1273 | 48,627 | 49,539 | 1414 | 10 | 10 | ||
Snowflake | \(F_1\) | \(\surd \) | 64,699 | 29 | 1 | 9 | 49 | 14 | 64,708 | 64,748 | 15 | 10 | 10 | |
\(F_2\) | \(\surd \) | 203,968 | 2184 | 99 | 1598 | 3757 | 1092 | 205,566 | 207725 | 1191 | 10 | 10 | ||
\(F_3\) | \(\surd \) | 23,419,32 | 40,656,32 | 58 | 36,734,09 | 24,893,25 | 6200 | 60,153,41 | 48,312,57 | 6258 | 10 | 10 | ||
\(F_4\) | \(\surd \) | 251,546 | 6909 | 0 | 13,693 | 8864 | 1808 | 265,239 | 260410 | 1808 | 10 | 10 | ||
\(F_5\) | \(\surd \) | 25,180 | 92 | 3 | 58 | 1028 | 46 | 25,238 | 26,208 | 49 | 10 | 10 | ||
Complex | \(C_1\) | 206,864 | 161,803 | 4 | 9195 | 5265 | 356 | 216,059 | 212,129 | 360 | 10 | 10 | ||
\(C_2\) | 16,135,25 | 937,198 | 0 | 229,381 | 174,167 | 155 | 18,429,06 | 17,876,92 | 155 | 10 | 10 | |||
\(C_3\) | 123,349 | 0 | 80,997 | 0 | 0 | 0 | 123,349 | 123,349 | 80,997 | 0 | 0 |
Evaluation of each stage on LUBM 1000
Partial evaluation | Assembly | Total | \(\#\) of LPMFs | \(\#\) of CMFs | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Time (in ms) | \(\#\) of LPMs | \(\#\) of IMs | Time (in ms) | \(\#\) of CMs | Time (in ms) | \(\#\) of Matches | ||||||||
Centralized | Distributed | PECA | PEDA | |||||||||||
Star | \(Q_2\) | 1818 | 0 | 10,811,87 | 0 | 0 | 0 | 1818 | 1818 | 10,811,87 | 0 | 0 | ||
\(Q_4\) | \(\surd \) | 82 | 0 | 10 | 0 | 0 | 0 | 82 | 82 | 10 | 0 | 0 | ||
\(Q_5\) | \(\surd \) | 8 | 0 | 10 | 0 | 0 | 0 | 8 | 8 | 10 | 0 | 0 | ||
Snowflake | \(Q_6\) | \(\surd \) | 158 | 6707 | 110 | 164 | 125 | 15 | 322 | 283 | 125 | 10 | 10 | |
Complex | \(Q_1\) | 52,548 | 3033 | 2524 | 53 | 60 | 4 | 52,601 | 52,608 | 2528 | 10 | 10 | ||
\(Q_3\) | 920 | 3358 | 0 | 36 | 48 | 0 | 956 | 968 | 0 | 10 | 0 | |||
\(Q_7\) | 3945 | 167,621 | 42,479 | 211,670 | 35,856 | 1709 | 215,615 | 39,801 | 44,190 | 10 | 10 |
7.2 Exp 1: Evaluating each stage’s performance
In this experiment, we study the performance of our system at each stage (i.e., partial evaluation and assembly process) with regard to different queries in WatDiv 1B and LUBM 1000. We report the running time of each stage (i.e., partial evaluation and assembly) and the number of local partial matches, inner matches, and crossing matches, with regard to different query types in Tables 2 and 3. We also compare the centralized and distributed assembly strategies. The time for assembly includes the time for computing the optimal join order. Note that we classify SPARQL queries into four categories according to query graphs’ structures: star, linear, snowflake (several stars linked by a path) and complex (a combination of the above with complex structure).
7.2.1 Partial evaluation
Tables 2 and 3 show that if there are some selective triple patterns^{10} in the query, the partial evaluation is much faster than others. Our partial evaluation algorithm (Algorithm 1) is based on a state transformation, while the selective triple patterns can reduce the search space. Furthermore, the running time also depends on the number of inner matches and local partial matches, as given in Tables 2 and 3. More inner matches and local partial matches lead to higher running time in the partial evaluation stage.
7.2.2 Assembly
In this experiment, we compare centralized and distributed assembly approaches. Obviously, there is no assembly process for a star query. Thus, we only study the performances of linear, snowflake and complex queries. We find that distributed assembly can beat the centralized one when there are lots of local partial matches and crossing matches. The reason is as follows: in centralized assembly, all local partial matches need to be sent to the server where they are assembled. Obviously, if there are lots of local partial matches, the server becomes the bottleneck. However, in distributed assembly, we can take advantage of parallelization to speed up both the network communication and assembly. For example, in \(F_3\), there are 40,656,32 local partial matches. It takes a long time to transfer the local partial matches to the server and assemble them in the server in centralized assembly. So, distributed assembly outperforms the centralized alternative. However, if the number of local partial matches and the number of crossing matches are small, the barrier synchronization cost dominates the total cost in distributed assembly. In this case, the advantage of distributed assembly is not clear. A quantitative comparison between distributed and centralized assembly approaches needs more statistics about the network communication, CPU and other parameters. A sophisticated quantitative study is beyond the scope of this paper and is left as future work.
In Tables 2 and 3, we also show the number of fragments involved in each test query. For most queries, their local partial matches and crossing matches involve all fragments. Queries containing selective triple patterns (\(L_1\) in WatDiv) may only involve a part of the fragmentation.
7.3 Exp 2: Evaluating optimizations in assembly
In this experiment, we use WatDiv 1B to evaluate two different optimization techniques in the assembly: partitioning-based join strategy (Sect. 5.1) and the divide-and-conquer approach in the distributed assembly (Sect. 5.3). If a query does not have any local partial matches in RDF graph G, it does not need the assembly process. Therefore, we only use the benchmark queries that need assembly (\(L_1\), \(L_2\), \(L_5\), \(F_1\), \(F_2\), \(F_3\), \(F_3\), \(F_4\), \(F_5\), \(C_1\) and \(C_2\)) in our experiments.
7.3.1 Partitioning-based join
First, we compare partitioning-based join (i.e., Algorithm 3) with naive join processing (i.e., Algorithm 2) in Table 4, which shows that the partitioning-based strategy can greatly reduce the join cost. Second, we evaluate the effectiveness of our cost model. Note that the join order depends on the partitioning strategy, which is based on our cost model as discussed in Sect. 5.2.2. In other words, once the partitioning is given, the join order is fixed. So, we use the cost model to find the optimal partitioning and report the running time of the assembly process in Table 4. We find that the assembly with the optimal partitioning is faster than that with random partitioning, which confirms the effectiveness of our cost model. Especially for \(C_2\), the assembly with the optimal partitioning is an order of magnitude faster than the assembly with random partitioning.
7.3.2 Divide-and-conquer in distributed assembly
Table 5 shows that dividing the search space will speed up distributed assembly. Otherwise, some duplicate results can be generated, as discussed in Sect. 5.3. Elimination of duplicates and parallelization speeds up distributed assembly. For example, for \(C_1\), dividing search space lowers the time of assembly more than twice as much as no dividing search space.
7.4 Exp 3: Scalability test
Running time of partitioning-based join versus naive join (in ms)
Partitioning-based join based on the optimal partitioning | Partitioning-based join based on the random partitioning | Naive join | |
---|---|---|---|
\(L_1\) | 1 | 1 | 1 |
\(L_2\) | 18 | 23 | 139 |
\(L_5\) | 572 | 622 | 3419 |
\(F_1\) | 1 | 1 | 1 |
\(F_2\) | 1598 | 2286 | 48,096 |
\(F_3\) | 36,734,09 | 40,054,09 | Timeout\(^\mathrm{a}\) |
\(F_4\) | 13,693 | 13,972 | Timeout |
\(F_5\) | 58 | 80 | 8383 |
\(C_1\) | 9195 | 10,582 | Timeout |
\(C_2\) | 229,381 | 40,831,81 | Timeout |
Dividing versus no dividing (in ms)
Distributed assembly time (in ms) | ||
---|---|---|
Dividing | No dividing | |
\(L_1\) | 16 | 19 |
\(L_2\) | 130 | 151 |
\(L_5\) | 1484 | 1684 |
\(F_1\) | 49 | 55 |
\(F_2\) | 3757 | 5481 |
\(F_3\) | 24,893,25 | 44,394,30 |
\(F_4\) | 8864 | 19,759 |
\(F_5\) | 1028 | 1267 |
\(C_1\) | 5265 | 12,194 |
\(C_2\) | 174,167 | 225,062 |
Note that, as mentioned in Exp. 1, there is no assembly process for star queries, since matches of a star query cannot cross two fragments. Therefore, the query response times for star queries in centralized and distributed assembly are the same. In contrast, for other query types, some local partial matches and crossing matches result in differences between the performances of centralized and distributed assembly. Here, \(L_3\), \(L_4\) and \(C_3\) are a special case. Although they are not star queries, there are few local partial matches for \(L_3\), \(L_4\) and \(C_3\). Furthermore, the crossing match number is 0 in \(L_3\), \(L_4\) and \(C_3\) (in Table 2). Therefore, the assembly times for \(L_3\), \(L_4\) and \(C_3\) are so small that the query response times in both centralized and distributed assembly are the almost same.
7.5 Exp 4: Intermediate result size and query performance versus query decomposition approaches
Table 6 compares the number of intermediate results in our method with two typical query decomposition approaches, i.e., GraphPartition and TripleGroup. We use undirected 1-hop guarantee for GraphPartition and 1-hop bidirection semantic hash partition for TripleGroup. The dataset is still WatDiv 1B.
A star query has no intermediate results, so each method can be answered at each fragment locally. Thus, all methods have the same response time, as given in Table 7 (\(S_1\)–\(S_6\)).
For other query types, both GraphPartition and TripleGroup need to decompose them into several star subqueries and find these subquery matches (in each fragment) as intermediate results. Neither GraphPartition nor TripleGroup distinguishes the star subquery matches that contribute to crossing matches from those that contribute to inner matches—all star subquery matches are involved in the assembly process. However, in our method, only local partial matches are involved in the assembly process, leading to lower communication cost and the assembly computation cost. Therefore, the intermediate results that need to be assembled with others are smaller in our approach.
Number of intermediate results of different approaches on different partitioning strategies
PECA & PEDA | GraphPartition | TripleGroup | |
---|---|---|---|
\(S_1\)–\(S_7\) | 0 | 0 | 0 |
\(L_1\) | 2 | 249,571 | 249,598 |
\(L_2\) | 794 | 73,307 | 79,630 |
\(L_3\)–\(L_4\) | 0 | 0 | 0 |
\(L_5\) | 1274 | 99,363 | 99,363 |
\(F_1\) | 29 | 76,228 | 15,702 |
\(F_2\) | 2184 | 501,146 | 11,198,81 |
\(F_3\) | 40,656,32 | 45,157,31 | 45,157,52 |
\(F_4\) | 6909 | 132,193 | 329,426 |
\(F_5\) | 92 | 25,007,73 | 90,007,62 |
\(C_1\) | 161,803 | 45,515,62 | 44,516,93 |
\(C_2\) | 937,198 | 14,571,56 | 23,684,05 |
\(C_3\) | 0 | 0 | 0 |
Query response time of different approaches (in milliseconds)
PECA | PEDA | GraphPartition | TripleGroup | |
---|---|---|---|---|
\(S_1\) | 43,803 | 43,803 | 43,803 | 43,803 |
\(S_2\) | 74,479 | 74,479 | 74,479 | 74,479 |
\(S_3\) | 8087 | 8087 | 8087 | 8087 |
\(S_4\) | 16,520 | 16,520 | 16,520 | 16,520 |
\(S_5\) | 1861 | 1861 | 1861 | 1861 |
\(S_6\) | 50,865 | 50,865 | 50,865 | 50,865 |
\(S_7\) | 56,784 | 56,784 | 56,784 | 56,784 |
\(L_1\) | 15,341 | 15,776 | 40,840 | 39,570 |
\(L_2\) | 1510 | 1622 | 36,150 | 36420 |
\(L_3\) | 16,889 | 16,889 | 16,889 | 16,889 |
\(L_4\) | 261 | 261 | 261 | 261 |
\(L_5\) | 48,627 | 49,539 | 57,550 | 57,480 |
\(F_1\) | 64,708 | 64,748 | 66,230 | 66,200 |
\(F_2\) | 205,566 | 207,725 | 240,700 | 248,180 |
\(F_3\) | 60,153,41 | 48,312,57 | 62,440,00 | 61,428,00 |
\(F_4\) | 265,239 | 260,410 | 340,540 | 340,600 |
\(F_5\) | 25,238 | 29,208 | 52,180 | 91,110 |
\(C_1\) | 216,059 | 212,129 | 216,720 | 223,670 |
\(C_2\) | 18,429,06 | 17,876,92 | 19,548,00 | 21,683,00 |
\(C_3\) | 123,349 | 123,349 | 12,3349 | 123,349 |
Query response time under different partitioning strategies (in ms)
Uniform | Exponential | Min-cut | |
---|---|---|---|
\(S_1\) | |||
PECA | 4095 | 7472 | 3210 |
PEDA | 4095 | 7472 | 3210 |
\(S_2\) | |||
PECA | 5910 | 5830 | 5053 |
PEDA | 5910 | 5830 | 5053 |
\(S_3\) | |||
PECA | 869 | 2003 | 1098 |
PEDA | 869 | 2003 | 1098 |
\(S_4\) | |||
PECA | 1506 | 1532 | 1525 |
PEDA | 1506 | 1532 | 1525 |
\(S_5\) | |||
PECA | 208 | 384 | 255 |
PEDA | 208 | 384 | 255 |
\(S_6\) | |||
PECA | 5153 | 5642 | 4145 |
PEDA | 5153 | 5642 | 4145 |
\(S_7\) | |||
PECA | 5047 | 5720 | 4085 |
PEDA | 5047 | 5720 | 4085 |
\(L_1\) | |||
PECA | 2301 | 4271 | 3162 |
PEDA | 2325 | 4296 | 3168 |
\(L_2\) | |||
PECA | 271 | 502 | 261 |
PEDA | 339 | 505 | 297 |
\(L_3\) | |||
PECA | 1115 | 2122 | 1334 |
PEDA | 1115 | 2122 | 1334 |
\(L_4\) | |||
PECA | 37 | 54 | 27 |
PEDA | 37 | 54 | 27 |
\(L_5\) | |||
PECA | 7741 | 6736 | 4984 |
PEDA | 7863 | 6946 | 5163 |
\(F_1\) | |||
PECA | 5754 | 7889 | 4386 |
PEDA | 5768 | 7943 | 4415 |
\(F_2\) | |||
PECA | 11809 | 16,461 | 10,209 |
PEDA | 11,832 | 16,598 | 10,539 |
\(F_3\) | |||
PECA | 246,277 | 155,064 | 122,539 |
PEDA | 163,642 | 115,214 | 103,618 |
\(F_4\) | |||
PECA | 26,439 | 37,608 | 21,979 |
PEDA | 26421 | 36,817 | 22030 |
\(F_5\) | |||
PECA | 11,630 | 16,433 | 8735 |
PEDA | 11654 | 16,501 | 8262 |
\(C_1\) | |||
PECA | 14,980 | 30,271 | 14,131 |
PEDA | 14667 | 29861 | 13,807 |
\(C_2\) | |||
PECA | 147,962 | 105,926 | 36,038 |
PEDA | 147406 | 104,084 | 35,220 |
\(C_3\) | |||
PECA | 11,631 | 16,368 | 13,959 |
PEDA | 11,631 | 16,368 | 13,959 |
7.6 Exp 5: Performance on RDF datasets with one billion triples
This experiment is a comparative evaluation of our method against GraphPartition, TripleGroup and EAGRE on three very large RDF datasets with more than one billion triples, WatDiv 1B, LUBM 10000 and BTC. Figure 16 shows the performance of different approaches.
Note that, almost half of the queries (\(S_1\), \(S_2\), \(S_3\), \(S_4\), \(S_5\), \(S_6\), \(S_7\), \(L_3\), \(L_4\) and \(C_3\) in WatDiv, \(Q_2\), \(Q_4\) and \(Q_5\) in LUBM, \(Q_1\), \(Q_2\) and \(Q_3\) in BTC) have no intermediate results generated in any of the approaches. For these queries, the response times of our approaches and partition-based approaches are the same. However, for other queries, the gap between our approach and others is significant. For example, \(L_2\) in WatDiv, for \(Q_3\), \(Q_6\) and \(Q_7\) in LUBM and \(Q_3\), \(Q_4\), \(Q_6\) and \(Q_5\) in BTC, our approach outperforms others one or more orders of magnitudes. We already explained the reasons for GraphPartition and TripleGroup in Exp 4; reasons for EAGRE performance follows.
EAGRE stores all triples as flat files in HDFS and answers SPARQL queries by scanning the files. Because HDFS does not provide fine-grained data access, a query can only be evaluated by a full scan of the files followed by a MapReduce job to join the intermediate results. Although EAGRE proposes some techniques to reduce I/O and data processing, it is still very costly. In contrast, we use graph matching to answer queries, which avoids scanning the whole dataset.
7.7 Exp 6: Impact of different partitioning strategies
In this experiment, we test the performance under three different partitioning strategies over WatDiv 100 M. The impact of different partitioning strategies is shown in Table 8. We implement three partitioning strategies: uniformly distributed hash partitioning, exponentially distributed hash partitioning, and minimum-cut graph partitioning.
The first partitioning strategy uniformly hashes a vertex v in RDF graph G to a fragment (machine). Thus, fragments on different machines have approximately the same size. The second strategy uses an exponentially distributed hash function with a rate parameter pf 0.5. Each vertex v has a probability of \(0.5^k\) to be assigned to fragment (machine) k. This partitioning strategy results in skewed fragment sizes. Finally, we use min-cut-based partitioning strategy (i.e., METIS algorithm) to partition graph G.
Minimum-cut partitioning strategy generally leads to fewer crossing edges than the other two. Thus, it beats the other two approaches in most cases, especially in complex queries (such as F and C category queries). For example, in \(C_2\), the minimum-cut is faster than the uniform partitioning by more than four times. For star queries (i.e., S category queries), since there exist no crossing matches, the uniform partitioning and minimum-cut partitioning have the similar performance. Sometimes, the uniform partitioning is better, but the performance gap is very small. Due to the skew in fragment sizes, exponentially distributed hashing has worse performance, in most cases, than uniformly distributed hashing.
Although our partial evaluation and assembly framework is agnostic to the particular partitioning strategy, it is clear that it works better when fragment sizes are balanced, and the crossing edges are minimized. Many heuristic minimum-cut graph partitioning algorithms (a typical one is METIS [31]) satisfy the requirements.
7.8 Exp 7: Comparing with memory-based distributed RDF systems
We compare our approach (which is disk-based) against TriAD [18] and Trinity.RDF [47] that are memory-based distributed systems. To enable fair comparison, we cache the whole RDF graph together with the corresponding index into memory. Experiments show that our system is faster than Trinity.RDF and TriAD in these benchmark queries. Results are given in Appendix D.
7.9 Exp 8: Comparing with federated SPARQL systems
7.10 Exp 9: Comparing with centralized RDF systems
In this experiment, we compare our method with RDF-3X in LUBM 10000. Table 9 shows the results.
Comparison with centralized system (in ms)
RDF-3X | PECA | PEDA | |
---|---|---|---|
\(Q_1\) | 10,840,47 | 326,167 | 309,361 |
\(Q_2\) | 81,373 | 23,685 | 23,685 |
\(Q_3\) | 72,257 | 10,239 | 10,368 |
\(Q_4\) | 7 | 753 | 753 |
\(Q_5\) | 6 | 125 | 125 |
\(Q_6\) | 355 | 3388 | 1914 |
\(Q_7\) | 146,325 | 143,779 | 46123 |
8 Conclusion
In this paper, we propose a graph-based approach to distributed SPARQL query processing that adopts the partial evaluation and assembly approach. This is a two-step process. In the first step, we evaluate a query Q on each graph fragment in parallel to find local partial matches, which, intuitively, is the overlapping part between a crossing match and a fragment. The second step is to assemble these local partial matches to compute crossing matches. Two different assembly strategies are proposed in this work: centralized assembly, where all local partial matches are sent to a single site, and distributed assembly, where the local partial matches are assembled at a number of sites in parallel.
The main benefits of our method are twofold: first, our solution is partition-agnostic as opposed to existing partition-based methods each of which depends on a particular RDF graph partition strategy, which may be infeasible to enforce in certain circumstances. Our method is, therefore, much more flexible. Second, compared with other partition-based methods, the number of involved vertices and edges in the intermediate results is minimized in our method, which are proven theoretically and demonstrated experimentally.
There are a number of extensions we are currently working on. An important one is handling SPARQL queries over linked open data (LOD). We can treat the interconnected RDF repositories (in LOD) as a virtually integrated distributed database. Some RDF repositories provide SPARQL endpoints and others may not have query capability. Therefore, data at these sites need to be moved for processing that will affect the algorithm and cost functions. Furthermore, multiple SPARQL query optimization in the context of distributed RDF graphs is also an ongoing work. In real applications, queries in the same time are commonly overlapped. Thus, there is much room for sharing computation when executing these queries. This observation motivates us to revisit the classical problem of multi-query optimization in the context of distributed RDF graphs.
\(f_j(v)=NULL\) means that vertex v in query Q is not matched in local partial match \(PM_j\). It is formally defined in Definition 6 condition (2)
An algorithm is called fixed-parameter tractable for a problem of size l, with respect to a parameter n, if it can be solved in time O(f(n)g(l)), where f(n) can be any function but g(l) must be polynomial [10].
When we find local partial matches in fragment \(F_i\) and send them to join, we tag which vertices in local partial matches are internal vertices of \(F_i\).
A problem is said to have optimal substructure if an optimal solution can be constructed efficiently from optimal solutions of its subproblems [9]. This property is often used in dynamic programming formulations.
We use ANTRL v3’s grammar which is an implementation of the SPARQL grammar’s specifications. It is available at http://www.antlr3.org/grammar/1200929755392/.
A triple pattern t is a “selective triple pattern” if it has no more than 100 matches in RDF graph G