DistanceAware Selective Online Query Processing Over Large Distributed Graphs
 1.2k Downloads
 2 Citations
Abstract
Performing online selective queries against graphs is a challenging problem due to the unbounded nature of graph queries which leads to poor computation locality. It becomes even difficult when a graph is too large to be fit in the memory. Although there have been emerging efforts on managing large graphs in a distributed and parallel setting, e.g., Pregel, HaLoop and etc, these computing frameworks are designed from the perspective of scalability instead of the query efficiency. In this work, we present our solution methodology for online selective graph queries based on the shortest path distance semantic, which finds various applications in practice. The essential intuition is to build a distanceaware index for online distancebased query processing and to eliminate redundant graph traversal as much as possible. We discuss how the solution can be applied to two types of research problems, distance join and vertex set bonding, which are distancebased graph pattern discovery and finding the structurewise bonding of vertices, respectively.
Keywords
Graph processing Shortest path distance Graph partition1 Introduction
The tremendous size of realworld graph data raises series of challenges in efficient graph management and query processing. With the generic vertexcentric model [22] applied to practices, there has been a huge advancement in largescale graph analytic tasks, e.g. PageRank, SCC, subgraph listing and etc. More efficient generic graph processing frameworks, like the subgraphcentric model [33, 35], are emerging to accelerate the graph analytic task. However, online graph query which finds various applications in real practice does not attract much research effort, especially the highly selective queries which has an limited output size. In this work, we study the problem of answering shortest path distancebased selective graph queries in a online fashion, such that ad hoc queries of such type can be answered promptly.
Graph queries using the shortest path or shortest path distance semantic are widely used in practices. For example, one popular way to define the similarity of two vertices in the network would be the similarity of their distance vector to a number of preselected vertices. Other examples like influential maximization, or adaptive betweenness calculation and etc., are all based on the shortest path semantic. Therefore, effective distance estimation as well as efficient shortest path retrieving are essential for online graph queries. However, one fundamental issue is the unbounded nature of graph queries which often leads to poor computation locality. For example, to find out the shortest path(s) from vertex u to v, a naive BFS would access a large number of vertices in the graph to answer the query. In an extreme case, if graph G is a social network, and the distance between u and v is five, a shortest path query evaluation much likely leads to an entire graph visit due to the sixseparation law [13].
Various graph indexing techniques have been proposed to speed up the query evaluation, like the embedding technique introduced in GStore [39] for efficient SPARQL query processing, independent setbased labeling [10], the distance oracle approach [28] and etc. However, effective generic index structures for adhoc graph queries are spaceconsuming and involve a long setup time. Therefore, answering online selective graph queries with a combination of lightweight indices and fast graph exploration makes it a feasible solution with respect to both time and space efficiency.

We propose a novel partition strategy for webscaled graphs in the sharednothing distributed environment to efficiently support the pairwise shortest path estimation;

We develop a vertex filtering scheme to effectively support guided graph exploration, such that the cost on redundant vertex accessing could be significantly saved;

We show how our technique can be applied to two types of powerful graph queries;

We discuss our prototype implementation with extensive experiments over both real and synthetic webscaled graphs on an indoor cluster.
2 Problem Definition
In this section, we first clarify the problem definition and notations used in this paper. For comprehensiveness, we present an overview of our solution before diving into the any technical details.
Given graph \(G = \langle V,E\rangle , E\subset V\times V\), a path from vertex u to v is a sequence P of edges: \(e(x_0,x_1)\),...,\(e(x_{i1},x_i)\), where \(e(x_i,x_j)\in E\) and \(x_0=u, x_i=v\). Intuitively, the shortest path from u to v, denoted as SP(u, v), is the edge sequence of the minimum length. The shortest path distance, denoted as d(u, v), is the length of a shortest path. Although one may precompute all pairwise shortest path distance by bruteforce and materialize all the results, the space cost would be at least \(O(n^2)\), which is prohibited in practical usage. On the contrary, we would like to find a quality guaranteed pairwise distance estimation with as less space cost as possible.
Definition 1
As shown in the definition, all pairwise shortest path distance should be estimated in constant time with quality guarantees. The fundamental challenge is to minimize the space consumption as much as possible. Clearly, if we let \(\epsilon \) be 0, then the problem can only be solved by precompute all pairwise distances. Thus, the problem essentially asks for the space lower bound of a parameteradjustable solution for distance estimation.
Although finding the shortest path is a well established problem, we define the optimal shortest path computing problem under an exploration semantic as follows:
Definition 2
(Graph exploration strategy) Given graph \(G = \langle V,E\rangle \), a vertex u is explored only if \(\exists e(u,v)\in E\) and v is explored.
Definition 3
(Atomic graph exploration cost) Given vertex \(u\in V\), the atomic graph exploration cost of u is \(\{e(u,*)\}\), where {\(e(u,*)\)} denotes the set of edges going out from u.
Definition 4
(Optimal shortest path computing) Given a pairwise shortest path query \(Q_{SP}(u,v)\), let \({\mathcal {V}}\) be a set of vertices to access to answer the query in an exploration manner, then the optimal goal is \(\underset{{\mathcal {V}}}{\arg }\!\min\nolimits\sum _{e(v,*)}\), where \(v\in {\mathcal {V}}\).
Intuitively, the optimal shortest path computing is to locate the minimum set of vertices to access in order to answer the query. Unfortunately, the exploration based path computing has been proven to be untractable in terms of the number of vertex access. Thus, we shall study effective heuristics that eliminate redundant vertex access as much as possible.
The solution we introduced for these two problems are somehow correlated. We first introduce a partitionbased distance estimation index, which effectively estimates d(u, v) in constant time. Essentially, we partition the graph into a set of subgraph pairs, such that vertices that are far away enough from each other would be grouped into different partitions. Thus, we can use the distance between partitions to estimate the true pairwise distance. Based on the lightening fast yet accurate (with error guarantee) distance estimation, we can perform a guided graph exploration. Moreover, we introduce the landmarkbased guided graph exploration and probebased graph exploration which requires much less space overhead.
3 PartitionBased Distance Index
Being a crucial criteria of online queries, the latency of query processing should be reduced as much as possible. To address this efficiency issue, we partition graph G based on its summary graph extracted from the Delaunay triangulation of a set of selected vertices. A WellSeparatedSubgraph Decomposition method is employed to guarantee a distanceaware partition. Since the partition task is beyond the capability of a single standalone server, we first randomly partition G to all computing nodes and compute the summary graph in a distributed manner. As the summary graph is small, a partition schema can be derived in one server. Afterward, a repartition of G is performed among all computing nodes.
This process includes three steps, selecting a subset of vertices \(V'\) from G, building the Voronoi diagrams \(Vor(G_{V'})\) and extracting a graph which is the corresponding Delaunay triangulation of \(V'\), i.e., \(DT(G_{V'})\).
Step 1. Selecting \(V'\). We adopt the betweenness approximation method proposed in a series of work [2, 4, 21] to sample \(V'\). The cardinality of \(V'\), however, is considered as a customizable parameter in our solution. It would be great to have a larger \(V'\), but it would be inefficient to compute the partition of \(DT({\mathcal {G}}_{V'})\) in the main memory. Therefore, we make \(V'\) reasonably large as long as a \(O(V'^2)\) matrix can be completely loaded in a server’s main memory.
Step 2. Computing Vor \((G_{V'})\) The pseudocode given in Algorithm 1 illustrate the computation of Vor \((G_{V'})\). Notice that there could be many different ways to handle conflicts, i.e., a vertex can be of equal distance to different reference points. In our solution, we specifically assign a conflicting vertex to the reference point with a smaller id value.
Step 3. Constructing \(DT(G_{V'}). DT(G_{V'})\) can be easily constructed after Vor \((G_{V'})\) is obtained. \(DT(G_{V'})\) is a weighted undirected graph. There is an edge between \(u,v\in V'\) iffu and v reside in two adjacent Voronoi cells in Vor \((G_{V'})\). And the weight of this edge is \(diam(u)+diam(v)\), where \(diam(\cdot )\) denotes the graph diameter.
3.1 cWSSD Partition
To partition a graph in a distanceaware fashion, we actually employ a twolayer hierarchical partition method. The bottom layer is constructed with the identification of numbers of distancepreserving Voronoi cells. The top layer is to partition the Delaunay graph \(DT(G_{V'})\), which is a summary of the bottom layer. Different from most existing graph partition techniques that aim at reducing cross partition cuts, we would like to have any pairwise distance query be evaluated as quickly as possible. The intuitions are simple: 1) the vertices that are far away from each other should be partitioned into different subgraphs; 2) the vertices that are close to each other should be located in the same or adjacent subgraphs. For clear illustration purpose, we first introduce the concept of WellSeparatedSubgraph Decomposition, which serves as a partition constraint for \(DT(G_{V'})\).
Definition 5
 1.
For any \(x,y \in S\), there exists a unique pair (A, B) \(\in {\mathcal {F}}\) such that \(x\in A\) and \(y\in B\).
 2.
\(d(A,B)\ge c\times Max\{diam(A),diam(B)\}\).
In the definition, diam(A) refers to the diameter of subgraph A; d(A, B) denotes the shortest path distance between two subgraphs A and B. We reason the cWSSD graph partition from two aspects. First, such a definition is query oriented. As stated in the first condition, given any two vertices from \(DT(G_{V'})\), the distance constraint can be examined on just one machine that holds the corresponding pair of graph partitions. Second, it restricts the distance between partitions to ensure a straightforward query constraint verification. The second condition given in cWSSD is to guarantee that two clusters of vertices from \({\mathcal {G}}\) are far apart from each other. As we elaborate later in this section, a cWSSD graph partition ensures an errorbounded immediate pairwise distance estimation.
We solve the cWSSD construction problem by employing a comparisonbased binary tree traversing procedure presented in Algorithm 2. To be specific, let \(R(\cdot )\) denote the minimal spanning tree structure of a given subgraph; u, v denote subgraphs of \(DT(G_{V'})\); \(l_u\) and \(l_v\) denote the diameter value of R(u) and R(v), respectively. To elaborate, we first compute \(R( DT(G_{V'}))\) and assign it as the root of a binary tree T. T is constructed as follows. For each nonleaf internal (or root) node \(u\subseteq V'\), we split it into two children by removing the edge on which the center of R(u) resides. Binary tree T is built along with the subgraph pair extraction process. An internal node u splits only if certain conditions hold, as shown in the if...then clauses of Algorithm 2.
Essentially, Algorithm 2 constructs the cWSSD by traversing T with the help of a queue Q. We first initialize Q to store the ordered pair (\(T_{\mathrm {root}},T_{\mathrm {root}}\)), where \(T_{\mathrm {root}} = R(G_{V'}))\). Then we incrementally grow T in a BFStraversal manner, during which process we compare and discover qualified ordered subgraph pairs and have them be stored in \({\mathcal {F}}\). We shall first prove the correctness of Algorithm 2 and then explain its running time complexity.
Lemma 1
\({\mathcal {F}}\) constructed with Algorithm 2 is cWSSD.
Proof
Take any two vertices \(x,y\in DT(G_{V'})\), since Q contains (\(T_{\mathrm {root}},T_{\mathrm {root}}\)) initially, the traversal algorithm must eventually put a pair of nodes (u, v) into Q such that \(R(u)\cap R(v)=\emptyset , x\in u\), and \(y\in v\). Afterward, the traversal algorithm expends upon (u, v) and finally put a pair (\(u',v'\)) into \({\mathcal {F}}\) such that \(x\in R(u')\) and \(y\in R(v')\). Notice that (\(u',v'\)) is inserted into Q exactly once in the algorithm. This guarantees that x and y are not covered by another pair in \({\mathcal {F}}\). Thus, \({\mathcal {F}}\) satisfies the first constraint of cWSSD’s definition. Moreover, it is clear that if (u, v)\(\in {\mathcal {F}}, d(R(u),R(v)) \ge c\times Max\{diam(R(u)),diam(R(v))\}\). Thus, \({\mathcal {F}}\) satisfies the second constraint given in the definition. \(\square \)
Lemma 2
The cardinality of \({\mathcal {F}}\), i.e. \({\mathcal {F}}\), is \(O(V')\).
Proof
It suffices to prove that given any internal node u, there are at most O(1) nodes \(v'\) such that (\(u,v'\)) appears in Q. Suppose that (\(u,v'\)) appears in Q. It holds if (\(u,v'\))=(r, r). Otherwise, (\(u,v'\)) was put into Q because we split the parent of u or the parent of \(v'\). Without loss of generality, assume that we split the parent v of \(v'\). Thus, \(l_v\ge l_u\) and \(d(R(u),R(v))<c\times Max\{l_u,l_v\}\). In other words, R(v) lies inside a spanning tree R with the same center of R(u) and length equal to 4\(c\times l_v\). Then there are no more than 16\(c^2\) disjoint binary tree segments of width \(l_v\) inside R, each of which may generate two children that appear with u in Q. Hence, u appears with at most 32\(c^2\) nodes in Q. \(\square \)
Since each pair in \({\mathcal {F}}\) must appear in Q too, \({\mathcal {F}} =O(n)\). It takes \(O(n\;\mathrm {log}\;n)\) time to construct \({\mathcal {T}}(V')\). Then the traversal algorithm runs in \(O(m\;\mathrm {log}\;m)\), where m is the total number of pairs that appear in Q. Lemma 2 has already shown that \(m=O(n)\). Therefore, Algorithm 2 constructs \({\mathcal {F}}\) in \(O(n\mathrm {log}n)\) time (\(n=V'\)). To achieve the (\(1+\epsilon \)) approximation of the pairwise shortest path distance, given 0\(<\epsilon<\)1, we make \({\mathcal {F}}\) an errorbounded partition of \(DT(G_{V'})\) by making \(c=\frac{2(1+\epsilon )}{\epsilon }\). Then we can have the following result:
Lemma 3
Given any two vertices \(x,y\in DT(G_{V'})\), the shortest path distance approximation between x, y is at most (\(1+\epsilon )d(x,y)\).
Proof
It is worth pointing out that the partition strategy illustrated above also significantly reduces the computation overhead across different storage nodes. Although the cWSSD computation is sequential and handled centrally, the data repartition can be easily conducted in parallel.
4 Guided Graph Exploration
In addition to fast and accurate pairwise distance estimation, finding the exact pairwise shortest path efficiently, in terms of minimizing the redundant edge visit, raises a grand challenge as well. In this section, we study a landmarkbased guided graph exploration strategy.
4.1 Landmark Selection
Although the technique presented in Sect. 3 gives errorbounded distance estimation in O(1) time, the storage cost of subgraph pair index depends on the underlying graph topology structure. If a graph is extremely dense, the constant factor c could be over 100 which makes it an infeasible solution. Therefore, we consider a more generic strategy to estimate pairwise distance, which is sufficient to perform guided graph exploration.
Selecting landmarks or reference points to facilitate the shortest path distance computation has been adopted in many works [26, 28, 29]. Existing landmark selection criteria are quite biased according to different graph structures and applications. In our solution, we select landmarks not only based on the consideration of graph partition and pairwise shortest distance estimation, but an evenly coverage property is desired. To elaborate, we find that given two vertices s and t, the landmark best serves \(p_{st}\) ^{1} computation is the one closest to \(p_{st}\). Therefore, we define a set of landmarks of evenly coverage as follows:
Definition 6
(\(\delta \)evenly coverage) Given a graph G = \(\langle V,E\rangle \), a set of landmarks, \(O = \{o_1,o_2,\ldots ,o_d\}\), is said to be an evenly coverage of G, iff \(\forall v\in V, \exists o_i\in O\) such that \(p_{vo_i}\le \delta \), where \(\delta \) is a customizable parameter.
According to the definition, an interesting question is how to decide an evenly coverage O of a given graph G. Intuitively, if \(\delta \) is small, the cardinality of O, denoted with parameter d would be large. As a matter of fact, it is easy to derive that in an extreme case, d needs to be at least as large as \(\frac{n1}{2\delta }\). On the other hand, at most 3 landmarks are sufficient if the diameter of G is smaller or equal to \(2\delta \). In practice, we would like to select the minimal number of landmarks that satisfy a \(\delta \)evenly coverage of G in order to save index space and computation costs. Algorithm 3 gives a deterministic solution of finding the minimal d, which also helps decide the selection of landmark vertices.
In the first line of Algorithm 3, \(G_{diam}\) denotes the diameter of G. We consider \(G_{diam}\) as a given input as it can be easily computed following the super step based message passing model. Apparently, the above algorithm is to recursively partition G into a set of small graphs with diameter smaller than \(2\delta \), and report the center vertices of these small graphs as landmarks. Let the level of recursions is h, then the total number of landmarks is \(d=2^h\). The computation cost of Algorithm 3 is O(hG), because on each level of recursion the entire graph is traversed. We can save the computation cost using a random algorithm given in Algorithm 4. It is worth pointing out that Algorithm 4 does not need \(G_{diam}\) to be precomputed. On the other hand, as shown on line 3 of the algorithm, we randomly select a path (simply using graph exploration) of length \(2\delta \) at each iteration and filter out all vertices that could be evenly covered in \(\delta \)hops from the middle vertex of this selected path, until all vertices from G are covered.
Lemma 4
Algorithm 4 runs at the complexity of O(G) and returns an evenly coverage of G with at most \(3\times 2^{h1}\) landmarks.
Proof
Consider an uncovered subgraph g with a diameter falls in (\(2\delta ,4\delta ]\), it takes two landmarks to evenly cover g according to Algorithm 3. However, according to Algorithm 4, a subgraph \(g'\in g\) could be selected, leaving the remaining part to be sufficiently covered by at most 2 landmarks. Therefore, it takes three landmarks to cover any two adjacent small graphs after partition in Algorithm 3. Therefore, Algorithm 4 reports at most \(\frac{3}{2}\times 2^{h} = 3\times 2^{h1}\) landmarks.
With the set of landmarks O determined, we are able to associate each vertex a label vector denoting its distance to all landmarks. Let l(v) be a ddimensional vector, where \(l(v)_i\) denotes v’s distance to landmark \(o_i\). Starting from the d landmarks, with one time graph exploration, every l(v) can be determined.
Associating each vertex with a d dimensional vector ideally trades off space cost to empower filtering on graph exploration. However, in realworld scenarios, d could be very large if \(\delta \) is set to a small value, which could impose infeasible space overhead for graph storage. As a matter of fact, given a vertex u and a landmark o, their shortest path distance can be denoted as \(p_{uo'} + p_{o'o}  \sigma \), where \(p_{uo'}\) is the distance from u to its nearest landmark \(o'\). As dist(\(o,o'\)) can be precomputed during preprocessing, then only the adjusted value \(\sigma \) needs to be stored. Note that the employed graph partition strategy potentially promises a localitybased landmark clustering. It results in the value locality of \(\sigma \) in u’s label, where a simple valuebased compression technique can be applied to reduce the total space cost significantly.
4.2 Guided Graph Exploration
To explore the shortest path from s to t, at least \(p_{st}\) super steps are necessary using a vertexcentric model. Starting from s, a naive graph exploration method like BFS would access all vertices within a distance of \(p_{st}\) to s. Thus, we would like to investigate a guided graph exploration approach to significantly reduce the redundant vertex access.
Our design is simple and straightforward. Let \(v_k\) resides on the shortest path between s and t. Assume \(p_{st}\) is given, \(v_k\) is a khop vertex from s, then according to the cosine law, the distance from \(v_k\) to a landmark \(o_i\) is solely determined on \(l(s)_i, l(t)_i\) and k. And such a condition must be hold between every landmark and \(v_k\), which could greatly help filtering out possible candidates for future examination. Plus, as \(v_k\)’s label has been computed during the preprocessing phase, it is easy to verify whether \(v_k\) exists. If negative, it only shows that the assumption on \(p_{st}\) is wrong.
Given vertex s and t, we can simply bound the \(p_{st}\) using the triangle inequality. It is easy to verify that \(p_{st}\in [Max(l(s)_il(t)_i),Min(l(s)_i+l(t)_i)]\), where \(1\le i\le d\). For comprehensive presentation, the notation \(p_{st}\in [LB(p_{st}),UB(p_{st})]\) is employed for the rest of this paper. An observation on the determination of \(p_{st}\) is that an assumption of \(p_{st}\) is correct iff \(\forall k\in [1,p_{st}) \exists v_k\), such that \(\forall o_i\in O, l(v_k)_i\) is valid according to the cosine law. Based on this observation, given a range of possible \(p_{st}\), a bruteforce solution is to check all possible values of \(p_{st}\) in an ascending order and report the first valid result as the correct \(p_{st}\), as described in Algorithm 5. Note that the loop given on line 3 indicates an iterative exploration process. In each iteration, we identify a set of valid vertices to be explored according Observation 1. The benefit of Algorithm 5 is that we can get exact \(p_{st}\) as a side product. However, the worst case happens when some landmark resides on \(p_{st}\), meaning we get correct \(p_{st}\) only after checking all the possible values.
Apparently, Algorithm 5 is efficient only for the scenarios where \(p_{st}\) is very close to its lower bound. In the worst case, it takes \(O(p_{st}^2)\) iterations to find \(p_{st}\). Therefore, we would like to propose another algorithm which has strict performance guarantees on all possible conditions. The intuition is that by starting from a set of vertices possibly residing on \(p_{st}\), which must be a superset of \(p_{st}\), we perform a guided exploration that iteratively prunes all candidates that do not belong to \(p_{st}\).
Lemma 5
Given vertices s and t, a vertex v possibly resides on \(p_{st}\) if \(Max\{l(v)_il(s)_i+l(v)_il(t)_i\}\le UB(p_{st})\), where \(1\le i\le d\).
Proof
Let vertices u and v be directly adjacent to each other. Then \(Max\{l(v)_il(u)_i\}=1\), where \(1\le i\le d\), because jumping from u to v, the distances between u and all landmarks alter by at most one. Therefore, given any two vertices u and \(v, Max\{l(v)_il(u)_i\}\) indicates a lower bound of the pairwise shortest path distance between them. Thus, if the sum of lower bounds of a vertex v’s distance to s and t is greater than an upper bound of \(p_{st}\), denoted as \(UB(p_{st})\), then v must not reside on \(p_{st}\). \(\square \)
Although Lemma 5 indicates a filter on the possible vertices to explore, the cost to examine the entire graph set remains unacceptable. We could rule out some candidate vertices based on their distances to all landmarks, as guaranteed by the following:
Lemma 6
Lemma 6 can be easily proved following the cosine law and the Heron’s formula. By applying the filtering criteria suggested in Lemmas 5 and 6, we could obtain a subgraph of G, denoted as \(g_{st}\), which must be a superset of \(p_{st}\). Note that \(\forall v\in g_{st}, v\)’s degree is at least 2 and all of v’s neighbors belong to \(g_{st}\). This is easy to prove by contradiction. Then, we start an iterative validation process on \(g_{st}\) to obtain \(p_{st}\) by filtering out unnecessary vertices step by step, as described in Algorithm 6.
Algorithm 6 employs a range label to check whether a vertex resides on the path \(p_{st}\). Each vertex that receives a lower(upper) bound of the range label, it sets up the list to watch if any upper(lower) bound would be sent from the same vertex, e.g. v.swatch and v.twatch in lines 9 and 14, respectively. Initially, s and t are only half bounded, and they pass on the range to its neighbors. Iteratively, if a vertex v finds that it receives both the lower and upper range bounds from the same vertex, as examined in the two IF clauses on lines 7 and 11, v definitely does not reside on \(p_{st}\). Therefore, v can be marked as inactive, and it will not participate in any further computation. Finally, all vertices that remains active and closely bounded shall be returned.
Correctness. There are only two cases where v does not reside on \(p_{st}\). One is that v reaches both s and t from a same vertex u. In this case, according to Algorithm 6, v would receive range updates from u only; thus, it will be pruned. The other case is that the sum of two shortest path distances \(p_{uv}+p_{u'v}\) is larger than \(p_{uu'}\), where u and \(u'\) resides on \(p_{st}\). Thus, the algorithm terminates before all vertices on the path \(p_{uv}\) and \(p_{u'v}\) get closely bounded, and these paths would be removed eventually.
Complexity. Obviously, Algorithm 6 takes the space complexity of up to \(O(g_{st})\), and the total iteration step of Algorithm 6 is the same as \(p_{st}\). And within each step, only vertices with range updates would send out messages to selected neighbors. Therefore, comparing to the naive exploration method, Algorithm 6 reduces the communication cost at each superstep. While comparing with Algorithm 5, Algorithm 6’s total number of iteration steps is fixed. It makes Algorithm 6 more generic for all possible workloads.
Note that it is trivial to add a global counter in Algorithm 6 to record each vertex’s shortest path distance to s and t. Then the exact \(p_{st}\) can be obtained after the program execution. The difference between Algorithm 5 and 6 is that the former aims at fast validation of \(p_{st}\) with as least vertex access as possible under the help of vertex labeling. Algorithm 6 first uses vertex labeling to identify a super set of \(p_{st}\) to explore, then conduct the exploration in a way that eliminate communication as much as possible.
Our guided graph exploration method could serve as a building block to evaluate other distance aware queries. For example, in the network field, there are common requests like routing a package from s to t that must pass or must not pass some given node within a transfer budget. Our vertex label method makes it straightforward to estimate the cost to include or exclude a vertex on the shortest path exploration. Therefore, cost aware solutions can be easily constructed to discover such a constraint routing path efficiently.
5 Apply to Online Graph Queries
The distance estimation and guided graph exploration technique elaborated above can serve as fundamental building blocks for various online graph queries. In this section, we present two different types of selective online queries that can benefit from the proposed technique.
5.1 Distance Join Query
Given a query graph Q and the data graph G, a distance join query returns all the subgraphs from G that satisfy every pairwise distance constraint in Q. Such a kind of query is handy and expressive in social network analysis and Biochemical network investigation [8, 14, 38]. It captures not only the structure information about the query graph, but also implies strong connectivity constraints, i.e. the pairwise distance between any two given vertices. Clearly, a distance join query is more flexible than the subgraph search and especially useful in graph analytic tasks that target on corelationship discovery. For comprehensiveness, we first define the distance join query as follows:
Definition 7
 1.
\(\forall v_{i,1\le i\le n}, f(v_i)\in S\);
 2.
\({\hat{d}}(f(v_i),f(v_j))\le w(v_i,v_j)\), if (\(v_i,v_j)\in Q (1\le i \ne j \le n\)), where function \({\hat{d}}(\cdot ,\cdot )\) denotes the pairwise shortest path distance of two vertices in \({\mathcal {G}}\).
Note that DJ(Q, G) allows different pairwise distance constraints, which makes the query introduced in [38] a special case of our study (as the query in [38] restricts all pairwise distance constraints to a same value \({\varDelta }\)). Such a generalized query semantic implies at least the same computational complexity as the query defined in [38], which is reported to be #\({\mathcal {P}}\)complete.
5.1.1 Evaluation Overview
Intuitively, it is the join order selection problem to generate a good query plan for DJ(Q, G). First of all, we need a cost model to evaluate different query plans. There have been many literatures studying the cost metric of a distributed jobs w.r.t various constraints. We omit the details on cost model construction as it is beyond the scope of this paper. For the rest of this paper, we use \(C^*(\cdot )\) to denote the cost function.^{2}
Given the costdriven query evaluation plan, which is a sequence of subqueries to evaluate, the join condition validation falls into two categories: (1) validate two vertices that do not reside on the same machine; (2) validate two vertices that are colocated on the same machine. As elaborated in Sect. 3, the distanceaware graph partition particularly favors the distance approximation of vertices that are far away from each other. Thus, it only takes constant time to justify a join condition for the case (1). However, for the query inputs of two vertices that are not far away from each other, i.e., they fall into the same set of Voronoi cells after partition, we need further computation on each computing node to answer the query. The essential challenge is to minimize the I/O operation as much as possible such that queries can be answered more efficiently.
5.1.2 Data Block Construction
We discuss how we organize graph data locally according to the cWSSD property. Let \(F_i\subset {\mathcal {F}}\) be the set of subgraph partitions distributed to a computing node i. As it is infeasible to keep \(F_i\) completely in the main memory, \(F_i\) must be written back to the file system in a certain manner. Since we employ HDFS as the underlying file system, challenge rises because HDFS, like other Cloud file systems, is managed on the block basis. First, the data loading from disk is at the level of blocks. Second, HDFS demonstrates unsatisfactory performance in random block access. Therefore, it is not a trivial task to write \(F_i\) back to HDFS. A data block needs to be carefully constructed.
Moreover, we introduce redundancies for cell clustering in our implementation, as the three large overlapping rectangles shown in Fig. 2(b), such that it reduces the probability of involving more groups of Voronoi cells for the query evaluation. Although the factor of redundancy itself can be a research topic if more data statistics and query patterns are presented, in our prototype system, we only include the twohop neighbor cells. For each group of Voronoi cells, we write them back to the disk on the basis of Voronoi cells. For each cell, it is organized as the (key, value) data model, although we need to add special links from values to keys, such that when it is loaded in the main memory, it serves as an auxiliary index structure for shortest path computing.
As elaborated above, we now have the power to validate the distance join conditions effectively for both cases: (1) vertices that are far away from each other; (2) vertices that are close by each other. Although to generate the optimal query plan remains an open question, fast and accurate distance estimation always serves as a fundamental building block.
5.2 Vertex Set Bonding
V ertex S et B onding query (VSB query for short) extracts the most prominent vertices, called bonding agents, in connecting two sets of input vertices. The prominence of a vertex is defined on its contribution to the shortest path connectivity between input vertex sets. Intuitively, given two input sets of vertices X and Y, the desired bonding agents are the minimum set of vertices to remove in order to enlarge every pair of shortest path distance between X and Y. Such type of query finds various applications in practice, for example, network monitoring [7], community bonding [5, 13] and etc. In this section, we formally define the VSB query and then show how the query evaluation can benefit from our guided graph exploration. For comprehensiveness, we first introduce the vertexpath and vertex–vertex dominance concept.
Definition 8
(v–p Dominance) A vertex v dominates a path \(p_{st}\), denoted as \(v_\vdash p_{st}\), iff \(p_{st}\) increases by removing v from the graph. \(\{v_\vdash \}^P\) denotes the set of shortest paths dominated by v.
If there exists multiple shortest paths between s and t, then \(p_{st}\) may not be dominated by any single vertex. Instead, \(p_{st}\) is dominated by a vertex set U, denoted as \(U_\vdash p_{st}\), where \(p_{st}\) increases if U is removed from the graph.
Definition 9
(uv Dominance) A vertex u dominates another vertex v, denoted as \(u_\vdash v\), iff \(\{v_\vdash \}^P \subseteq \{u_\vdash \}^P\). The set of vertices dominated by vertex u is denoted as \(\{u_\vdash \}^V\).
Given two sets of vertices X and Y, let \(P_{XY} =\{p_{xy} x\in X, y\in Y\}\) denote the set of all pairwise shortest paths between the elements of X and Y, we further define closed dominance and minimum closed dominance as follows:
Definition 10
(Closed dominance) A vertex set U is said to be a closed dominance of \(P_{XY}\), iff \(P_{XY} \subset \bigcup _{u\in U}\{u_\vdash \}^P\).
Definition 11
(Minimum closed dominance) A vertex set U is a minimum closed dominance of \(P_{XY}\) iff U is no longer a closed dominance of \(P_{XY}\) after removing any element in U.
Definition 12
(Optimal minimum closed dominance) A vertex set U is an optimal minimum closed dominance of \(P_{XY}\) iff U is a minimum closed dominance of \(P_{XY}, \not \exists U'\) which is another minimum closed dominance of \(P_{XY}\) that \(\exists u'\in U', \exists u\in U\) having \(u'_\vdash u\).
Based on the terminology introduced above, now we formally define the vertex set bonding query, a.k.a the VSB query.
Definition 13
In the above definition, \(\sigma _{xy}(v)\) denotes the number of shortest paths between x and y that pass through \(v, \sigma _{xy}\) denotes the total number of shortest paths between x and y.
From the problem definition, one can easily tell that the VSB problem is a variation of the weighted set cover problem, which has been proven to be NPhard. However, one upfront problem is that X and Y are given at ad hoc, no vertexpath dominance relation is determined until the run time. In other words, for any vertex \(v\in V, \{v_\vdash \}^V\) is unpredictable until X and Y are determined and \({\mathcal {G}}\) is extracted. More importantly, the essential difficulty of the VSB problem is that there could be exponential number of vertex sets for the minimum closed dominance verification, which makes none of the existing solutions for weighted set cover problem applicable.
Our general solution framework works as follows. We label all vertices according to their distances to selected landmarks. Then the guided graph exploration building block would effectively filter unnecessary vertices when a VSB query is submitted to the query engine. Later, we shall perform the betweenness ranking computation on exploring only the valid vertices^{4}. To show how guided graph exploration helps with the query evaluation, we highlight two building blocks: path sharing and probebased communication.
5.2.1 Naive Plan Versus Path Sharing
A naive query plan is to apply the same computation procedure on each \(p_{xy}\), where \(x\in X, y\in Y\), and assemble the final results based on a reduction of every \(p_{xy}\)’s dominance vertices, as described in Algorithm 7. Note that a temporary data set \(D_{xy}\) is employed in the algorithm to store all the dominant vertices of \(p_{xy}\) (on Line 4). As Line 13 applies the same computing procedure to all pairwise paths between X and Y, this part can be executed in parallel. The reduction process on Line 4 is to reduce the dominance vertices of each path to a single set \({\mathcal {D}}\) and then compute centrally.
Theorem 1
Algorithm 7 takes up \(O(XYp_{xy})\) space, communicates at \(O(XYp_{xy})\) volume of data, and runs at the time complexity of \(O(X^2Y^2)\), returns an optimal minimum dominance of \(P_{XY} R\) having \(AB(R)>\frac{1}{2}AB(R^*)\), where \(R^*\) is the optimal answer,.
Proof
For each pairwise shortest path \(p_{xy}\), the temporary dominance vertex set \(D_{xy}\) computed on Line 4 can be as large as \(p_{xy}\), which explains the space and communication complexity. The nested loop structure indicates a comparison between a path against every other path, which is of complexity \(O(X^2Y^2)\).
We prove R is an optimal minimum dominant set of \(P_{XY}\) by contradictory. Assume there exists another vertex \(u\in p_{xy}\) that dominates \(v\in R\). As \(v_\vdash p_{xy}\) holds, therefore, both u and v are pushed into the priority queue (Line 9). However, v is returned only if it is the vertex of the largest dominance in the priority queue, meaning \(\{v_\vdash \}^P\supseteq \{u_\vdash \}^P\), which indicates the assumption must be invalid. As we elaborated before, a vertex u’s betweenness \(C_B(u)\) equals to \(\{u_\vdash \}^P+f\), where f is u’s contribution to other shortest paths that it resides on but does not dominate. Clearly f cannot exceed 1, therefore, at each step a returned result’s betweenness is at least \(\frac{1}{2}\) of the optimal choice. Accumulatively, the final \(AB(R)\ge AB(R^*)\). \(\square \)
Algorithm 7 is straightforward and easy to implement, and it works for all query workload. However, its efficiency can suffer from the alltoone large volume of data copy in the reduction step (Line 4). Meanwhile, the efficacy of Algorithm 7 can be further improved if we take the f part of a vertex’s betweenness estimation into consideration. Thus, we develop several optimization techniques to improve the performance of VSB query processing.
In contrast to naively compute the pairwise shortest path between two sets of input vertices X and Y, an optimization opportunity lies in taking the advantage of vertex distribution in X and Y. As the VSB query can be applied to find the bonding between communities, where a community must be composed of vertices that are close to each other. It implies the potential of shortest path sharing property. Thus, there are two problems to solve: (1) how to quickly decide the input vertex distribution, as X and Y are given at ad hoc; (2) how to make the best of path sharing.
We solve the first problem with group prediction using vertex labels. Note that all vertices are labeled by their distances to landmarks. Graph G is partitioned into a number of small graphs that have a diameter restriction. Thus, given two vertices u and v, if both \(l(u)_i\) and \(l(v)_i\) is smaller or equal to \(\delta \) (the graph partition parameter discussed in Sect. 4), it is certain that u and v are in the same partition graph. Intuitively, if u and v share similar distances to multiple landmarks, they are close to each other. Given a VSB query \(Q=\langle G,X,Y,R\rangle \), we first partition vertices in X and Y according to their labels. With so many distancebased clustering algorithms offtheshelf, we choose the simplest one. We groups vertices of the same small graph partition together to obtain long shared paths, such that the exploration cost can be greatly saved.
The second problem essentially concerns how to identify the shared paths when vertices are grouped. As a matter of fact, such shared paths can only be determined during the runtime. Sometimes, vertices that are close to each other may not share an single edge to destinations at all. Therefore, we only need to identify the region or the boundary of shared paths to save the exploration cost. Given a set of grouped vertices, denoted as \(X'\), we simply add a virtual node \(x_v\) to the graph to represent \(X'\). The trick is how we decide \(l(x_v)\).
Lemma 7
Proof
Consider the case when \(p_{x'y}< p_{x_vy}\). Clearly, two adjacent vertex’s label difference on every dimension is at most one, where the label is a ddimensional vector. Thus, the label of \(l(x_v)\) defined in the Lemma indicates the center of \(X'\), which reaches every vertex \(x'\) in \(X'\) with minimum hops. Therefore, the path \(p_{x_vy}\) must passes \(x'\), implying \(p_{x'y}\subset p_{x_vy}\). Similarly, the case when \(p_{x'y}> p_{x_vy}\) can be easily verified. \(\square \)
By employing the path sharing, we could greatly save the concurrent exploration cost of Algorithm 7, as well as the space and communication cost on \(D_{xy}\), since the total number of such dominant vertex set are reduced.
5.2.2 ProbeBased Communication
A main bottleneck of Algorithm 7 is its alltoone communication at the reduction part, which brings about a burst of data copying over network. Instead of such a bruteforce solution, we develop a probebased lookup strategy which could greatly save the overall communication cost. Let each graph vertex be associated with a set of independent hash functions, denoted by H. We could use H to build up a bloom filter for the elementinset test, which is essential to our probebased communication. To elaborate, instead of directly copying \(D_{xy}\) over network (Line 5 in Algorithm 7), we first compute the bloom filter of each \(D_{xy}\) for \(p_{xy}\), denoted as \({\mathcal {F}}_{xy}\), which is a \(m_f\) bits vector. Then we pass \(m_f\) to threads examining other pairwise shortest paths. In this way, each thread can check whether any dominant vertex it finds could also be dominant vertex on other paths. Although the bloom filter may introduce false positive, it greatly reduces the size of data to transfer for verification. Another benefit of using probebased communication is that most computation is local, such that the centralized computing workload (Lines 511 in Algorithm 7) could be reduced. Following the same context of Algorithm 7, we show how the probebased communication is employed to evaluate a VSB query in Algorithm 8.
In Algorithm 8, we eliminated the centralized computation. Although R is a shared variable, a distributed lock can be employed for synchronous updates. As the algorithm shows, it is easy to be executed in parallel, e.g., each computing thread computes for each pairwise shortest path. Apparently, the communication cost is much reduced comparing to Algorithm 7, since only the bloom filter vector is transferred in the first place. The verification later on (Line 9) transfers one vertex’s label at a time. More importantly, each thread aborts as soon as it contributes a dominance vertex to R, or finds out that a residing path is already in R. This early stop property leads to a fast convergence of the final answer. It is worth pointing out that Algorithm 8 achieves the same approximation ratio on AB(R) as Algorithm 7, as long as R greedily chooses a vertex u of the largest \(\{u_\vdash \}^P\) at each synchronous update. Comparing to the path sharing technique, which only benefits when input vertices tend to be close to each other, this probebased solution is generic for all kinds of workloads.
In this section we show how to apply the distance estimation and guided graph exploration to two types of graph queries. As a matter of fact, there are other technical challenges in evaluating such queries. For example, adaptive query plan generation, as well as taking the query structure feature into consideration would yield better performance of distance join processing. As VSB query use the betweenness semantic for bonding vertices, online betweenness approximation and ranking technique would greatly contribute to efficient query processing. However, the two techniques discussed in Sects. 3 and 4 are orthogonal research problems and applicable to any distanceaware graph queries.
6 Experiments
We report two sets of experiments in this section: 1) testing the effectiveness of cWSSD partition method on reducing the computation cost of pairwise join validation for distance join queries; 2) how guided graph exploration accelerates shortest path computing and the VSB query evaluation.
6.1 Setup
6.1.1 cWSSD Partition for Distance Join
Testbed We built up the test bed on a cluster of 16 servers. Every server has 4 Intel(R) Xeon(R) CPU E52650 of 2.0GHz, each of which has two cores and supports 16 threads, 12 GB memory and 1 TB hard disk storage. The running operating system is 2.6.3522server #35Ubuntu SMP.
Graph data used for cWSSD partition and distance join
ID  # of nodes  # of edges  # of labels  \(d_{\mathrm {max}}\) 

A  3,774,768  16,522,438  481  22 
B  428,136,613  454,075,638  1,325  78 
C  500,000,000  4,287,029,468  1,000,000  1,052 
D  10,000,000,000  23,946,452,156  1,000,000  1,927 
Query Workload Given a data set, we randomly select \(10\sim 20\) labels and generate three types of query graphs: starshaped, pathshaped and circled graph. Meanwhile, we randomly assign the pairwise distance constraints. We generate 300 distance queries for each data set and evaluate the batch one by one. We run every job batch with 3 coldstart and report the average execution time.
6.1.2 Guided Graph Exploration for VSB Queries
General statistics of employed graph data sets
ID  No. of nodes (million)  No. of edges (million)  \(d_{\mathrm {max}}\)  Size (GB) 

A  ~428  \(\sim \)454  78  3.6 
B  ~1,825  \(\sim \)65,219  5328  869.2 
C  ~33  \(\sim \)1108  7  25.6 
D  10,000  23,946  2,927  42.2 
Datasets We employ four data sets of different scales and topologies in the experiments, as briefly summarized in Table 2. Data set A describes the web graph of the TREC 2009 Category B data set. Data set B comes from the WebGraph 2012 project [23], which is extracted from the Web cropus released by the Common Crawl Foundation in August 2012. Data set C is a crawled social graph from twitter [18]. Note that we only employ the largest connected component of graph data B and C and make the graphs undirected. Synthetic data sets D is a random graph generated with igraph.
Query Workload For each VSB query, we randomly select \(10\sim 100\) vertices as input X and Y. We generate three types of queries, which essentially represent different kinds of workloads: (1) \(\forall x \in X, l(x)_i\le \delta , \forall y \in Y, l(y)_j \le \delta \), where \(i,j\in [1,d]\) are randomly selected, i.e., both vertices in X and Y are close to each other, denoted as XLYL; (2) \(\forall x \in X, l(x)_i\le \delta \), where \(i\in [1,d]\) is randomly selected, \(\forall y\in Y\) is randomly selected, denoted as XLYR; (3) both X and Y are randomly generated from G, denoted as XRYR. We would like to show that our solution works well for all kinds of workloads, and the optimization techniques we proposed would be very useful for certain kind of workload. We generate 100 VSB queries for each type of workload, and evaluate the batch one by one. We run every job batch with 3 coldstart and report the average execution time.
6.2 cWSSD Partition
cWSSD partition method provides a distanceaware partition of a large graph, which makes it possible to estimate the pairwise shortest path distance in constant time. In the experiments we study from two aspects: (1) the effectiveness of cWSSD method in terms of query evaluation time cost as well as the I/O and network cost; (2) the scalability of cWSSD method under different scales of data and parameter.
As shown in Fig. 3a, cWSSD and random partition have about the same cost, while kminimal cut can introduce high I/O cost. The rationale behind is that kminimal cut tends to group large number of connected vertices in one partition. Therefore, the queried two vertices are very much likely fall into the same storage node where graph exploration method needs to be adopted. It is not surprising that, as shown in Fig. 3b, kminimal cut greatly saves the network traffic. However, cWSSD is the winner of query evaluation time, as shown in Fig. 3c. Because it does not introduce much network traffic or I/O cost during the query evaluation. The extra cost of cWSSD is paid during the preprocessing stage of graph partition.
Figure 4a gives the time cost evaluation on four data sets with respect to the initial vertices selection based on different selection ratios, where \(r=\frac{V'}{V}\) denotes the percentage of employed vertices to partition the graph. We have two main observations. First, for a given r, along with the size of a graph grows, the selection cost increases dramatically. Second, given a data set, when r increases in the order of magnitude, the selection cost also increases, however, following a Logarithm level growth.
Figure 4b demonstrates the time cost of computing Voronoi diagram with Algorithm 1 given in Sect. 3. Given a data set, when r increases in the order of magnitude, the time cost to compute the Voronoi diagram does not grow in the same pace. Since the computing process is essentially in the BSP (Bulk Synchronous Processing) style; therefore, more initial vertices actually help to explore the entire graph faster. However, extra cost to maintain the boundary vertices cannot be neglected.
The time cost to compute the partition set \({\mathcal {F}}\) is presented in Fig. 4c. The results are obtained when \(\epsilon \) is set to 0.05, similar trends of results are observed when \(\epsilon =0.01\) and \(\epsilon =0.1\). As proved in Sect. 3, the cardinality of final \({\mathcal {F}}\) is only subjected to the size of initial vertex set \(V'\). We observe the same trend in the experiment that the time cost to compute \({\mathcal {F}}\) is closely related to the selection of r.
6.3 Guided Graph Exploration for VSB Query
Our experiment study mainly includes three parts: 1) how the proposed guided graph exploration and betweenness ranking onexploration help VSB query processing; 2) how different query processing algorithms work under different query workloads.
Preprocessing As presented in Sect. 3.1, we can select d landmarks using either a deterministic or a random algorithm. \(\delta \) is the crucial parameter to choose. Intuitively, the larger \(\delta \) is, the number of vertices covered by a single landmark gets larger, which leads to a smaller d. Experiments also validate this point. In Table 3, we report the time efficiency of graph preprocessing and the value of d accordingly, as well as the total disk space cost after preprocessing.
Graph preprocessing using different algorithms
ID  d  T(sec)  Size(GB)  

\(\delta \)  dm.  rd.  dm.  rd.  dm.  rd.  
A  4  64  98  146  39  33.2  57.7 
8  36  78  129  32  22.4  41.3  
16  8  42  89  27  8.2  23.6  
B  8  2231  3029  549  227  4216  4248 
16  1429  2574  531  189  4094  4225  
32  879  1782  492  141  2709  2799  
C  1  126  145  329  124  1139  1178 
2  10  26  69.2  36.4  95.6  105.4  
4  3  59  2.3  78.5  78.9  131.7  
D  8  1576  2109  421  179  1465  1509 
16  1206  2005  392  164  1437  1486  
32  457  1324  354  139  1128  1305 
Fast shortest path computing To validate the guided graph exploration for shortest path, we randomly pick 100 pairs of vertices from each graph and ask for \(p_{xy}\), and rank the betweenness of two random vertices from \(p_{xy}\). As a comparison, we employ the GraphLab’s shortest path utility implementation and the parallel betweenness computing algorithm introduced in [2]. Due to the space limit, we highlight our findings on graph B.
To investigate how \(\delta \) affects our algorithm, we present time cost of greedy and guided exploration with different \(\delta \) setting on the same workload in Fig. 5c, d. Note that we normalize all the time cost using GraphLab’s result, as it does not rely on the setting of \(\delta \). Apparently, our algorithm achieves more speedup when \(\delta \) is smaller, which is reasonable as it promises better pruning power. Another observation from the result is that, comparing to the guided exploration, the greedy algorithm is more resistent to different \(\delta \). It is because greedy algorithm uses vertex label pruning in a passive way, while the guided exploration employs the pruning actively before making a decision on the next hop. Clearly, as shown in Fig. 5d, when \(p_{st}\) is large, the guided exploration is more sensitive to the setting of \(\delta \). Although smaller \(\delta \) works better for the path query, there is the greater extra space overhead to trade off.
We report the normalized query processing makespan of different methods on all data sets in Fig. 6g–i. We have made two observations from the efficiency experiments. First, given the same query workload, the underlying graph structure would greatly affect algorithm performance. Take graphs A for example, it is much more sparse than graph C. As shown in Fig. 6g, over the same query workload, the best evaluation strategy for graph A is path sharing, while for graph C it is a combination of path sharing and probebased communication. Clearly, reducing network communication as much as possible for a dense network brings more benefits than packing shared paths. Second, path sharing clearly helps a lot when the vertices in X or Y are close to each other. For example, for the XLYL workload, comparing to the naive algorithm, we can obtain almost 5x speed up on graph A by applying path sharing. On the contrary, the probebased communication method performs more stable on different workloads. One thing to notice is that combining path sharing with probebased solution does not double the speedup. The reason is that path sharing reduces the concurrent computing threads itself, but makes the computing workload of each thread unbalanced. Note that in Algorithm 8, R is updated with synchronization, which could easily suffer from unbalanced current computing workloads.
Another critical concern is that how \(\delta \) affects the query evaluation performance. As the algorithms, we proposed are based on the guided graph exploration method, therefore, we observe the similar trend of efficiency improvement when \(\delta \) decreases as shown in Fig. 5d. We highlight our findings using the results from graph B and C. For graph B, as shown in Fig. 6j, the path sharing optimization method is closely bounded with the total number of vertices to explore. Therefore, the probebased method is essential to the performance improvement. For graph C, as shown in Fig. 6k, due to the density property, path sharing is desirable when \(\delta \) is set to a proper value, like \(\delta \)=2. When delta equals to 1, there is not much optimization space left after a probebased method is employed, as the pruning power on vertex exploration is sufficiently strong. On the contrary, when \(\delta \) equals to 4, almost the entire graph needs to be considered to extract the shared path, which would result in severe performance decay. The hints we learn from the results are that if the query workload is unknown, smaller \(\delta \) is preferred for fast query processing as long as the extra space cost is manageable; the crucial performance optimization lies in reducing the total number of vertices to access and compute; path sharing does not help with the speedup if \(\delta \) is too small or too large.
7 Related Work
Distributed graph processing models and systems General purpose large graph management has drawn great research interest. Early work [20] illustrates the challenging issues of large graph management. Proposals in [3] and [22] are two well recognized models for parallel large graph processing, which are MPI(Message Passing Interface)based and BSP, respectively. Although MPI usually gains more time efficiency, it is relatively complicated and puts a heavy burden of system implementation on programmers. Pregel [22] is a vertexcentric computing model, which is more flexible and relatively easy to program. However, it has to sacrifice time efficiency due to inevitable synchronization costs at each iteration step. Work [24] conducts an empirical comparison of three computing diagrams for large graph processing, which are RDBMSlike approaches, data parallel approach (e.g. Pregel) and inmemory graph exploration approach. Improvement works over Pregel, like Pregelix [6], Blogel [35] targets on network cost reduction to yield better performance. Trinity [31] and GBase [15] are two other stateofart distributed general purpose graph management systems with substantially different designs. GBase models graphs using adjacent matrices. It transforms nearly all the graph analytic functions into matrix manipulations using iterative MapReduce jobs. On the contrary, Trinity employs an inmemory keyvalue store in a distributed shared memory environment. It models graph data following the vertexcentric model, i.e., each vertex is associated with its onehop neighbor(s), such that all the classical graph exploring algorithms can be directly plugged in.
Distancebased query over distributed graph Employing landmarks to approximate the shortest path distance is a widely adopted technique [17, 25, 27]. The basic idea is to precompute the shortest distances between all the nodes and selected landmarks and then apply the triangle inequality to help estimate the shortest path distance. Work [27] investigates finding the optimal set of landmarks. In particular, they target on answering the pairwise shortest path distance query. They introduce the LandMarkCover problem, which is to find a minimum number of points such that given any pair of vertices u and v, there exists at least one landmark residing on the shortest path from u to v. This problem is proven to be closely related with the 2hop labeling scheme [9]. Landmarkbased methods do not aim to provide the exact distance. Instead, they use a small number of landmarks to do estimation. Tao et al. [32] introduce the kskip shortest path, which is a natural substantial of returning the exact shortest path. Intuitively, it reports a set of vertices V that consecutively reside on a shortest path from s to t, having every vertex on this path is at most khop away from at least one vertex in V. Followup works, like graph simplification [30], shortest path discovery over road network [11, 34], employ similar concepts to perform a distancepreserving graph partition. The \(\delta \)evenly coverage landmark selection defined in this work, however, is orthogonal to the kskip concept. Because shortest path is not the substantial concern in our problem. We select landmarks to serve online graph exploration. There is no sequence semantic of our landmarks. In other words, kskip returns more vertices residing on the shortest path of two query points when k decreases. On the contrast, given a smaller \(\delta \), the \(\delta \)evenly coverage serves better in reducing redundant vertex access on exploration step by step. Vertex labeling is another line of research to answer distance queries. Gavoille et al. show that general graphs support an exact distance labeling scheme with labels of O(n) bits [12]. Several special graph families, including trees or graphs with bounded treewidth, have distance labeling schemes with \(O(log_2n)\) bit labels [1]. However, it is infeasible to directly apply these theory results to a large graph of billion nodes, as the space overhead of labeling would be unaccepted. Our solution, on the other hand, simply targets on vertex pruning using distance labels. And due to the \(\delta \)evenly coverage landmark selection scheme, the locality of vertices’ label vectors is well preserved. Therefore, a simple valuebased compression could greatly help to reduce the overall space cost on vertex labeling.
8 Conclusion
In this paper, we study two fundamental building blocks for distanceaware online graph query: fast and accurate distance estimation, as well as guided graph exploration. A cWSSD partition method is introduced to generate the index structure to produce errorbounded shortest path distance estimation in O(1) time with space complexity of O(cV), where c is a constant factor. Furthermore, we discuss how to perform guided graph exploration with landmark referencing. We validate the proposed technique with distance join and VSB query workload over both real and synthetic graph data in real Cloud environment.
Footnotes
 1.
For the rest of this paper, we use \(p_{st}\) and d(s, t) interchangeably, they both denote the shortest path distance from s to t.
 2.
The cost model we employed is elaborated in [36]. As a matter of fact, any offshelf cost models can be applied.
 3.
Figure 2 is to provide a visual aid, and the boundaries are not specifically defined.
 4.
As a matter of fact, we can perform approximated betweenness ranking on exploration as presented in work [37].
 5.
References
 1.Alstrup S, Bille P, Rauhe T (2005) Labeling schemes for small distances in trees. SIAM J Discrete Math 19(2):448–462MathSciNetCrossRefzbMATHGoogle Scholar
 2.Bader DA, Madduri K (2006) Parallel algorithms for evaluating centrality indices in realworld networks. In: ICPP, pp 539–550Google Scholar
 3.Bader DA, Madduri K (2008) Snap, smallworld network analysis and partitioning: an opensource parallel graph framework for the exploration of largescale networks. In: IPDPS, pp 1–12Google Scholar
 4.Bader DA, Kintali S, Madduri K, Mihail M (2007) Approximating betweenness centrality. In: WAW, pp 124–137Google Scholar
 5.Brandtzæg PB, Heim J, Kaare BH (2010) Bridging and bonding in social network sites—investigating familybased capital. IJWBC 6(3):231–253CrossRefGoogle Scholar
 6.Bu Y, Borkar VR, Jia J, Carey MJ, Condie T (2014) Pregelix: big(ger) graph analytics on a dataflow engine. PVLDB 8(2):161–172Google Scholar
 7.Castro M et al (2003) Future directions in distributed computing. In: Topologyaware routing in structured peertopeer overlay networks, pp 103–107Google Scholar
 8.Cheng J, Yu JX, Yu PS (2011) Graph pattern matching: a join/semijoin approach. IEEE Trans Knowl Data Eng 23(7):1006–1021CrossRefGoogle Scholar
 9.Cohen E, Halperin E, Kaplan H, Zwick U (2003) Reachability and distance queries via 2hop labels. SIAM J Comput 32(5):1338–1355MathSciNetCrossRefzbMATHGoogle Scholar
 10.Fu AW, Wu H, Cheng J, Wong RC (2013) ISLABEL: an independentset based labeling scheme for pointtopoint distance querying. PVLDB 6(6):457–468Google Scholar
 11.Funke S, Nusser A, Storandt S (2014) On kpath covers and their applications. PVLDB 7(10):893–902Google Scholar
 12.Gavoille C, Peleg D, Pérennes S, Raz R (2004) Distance labeling in graphs. J Algorithms 53(1):85–112MathSciNetCrossRefzbMATHGoogle Scholar
 13.Guille A, Hacid H, Favre C, Zighed DA (2013) Information diffusion in online social networks: a survey. SIGMOD Rec 42(2):17–28CrossRefGoogle Scholar
 14.Jin W, Yang J (2011) A flexible graph pattern matching framework via indexing. In: SSDBM, pp 293–311Google Scholar
 15.Kang U, Tong H, Sun J, Lin CY, Faloutsos C (2011) Gbase: a scalable and general graph management system. In: KDD, pp 1091–1099Google Scholar
 16.Karypis G et al (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392MathSciNetCrossRefzbMATHGoogle Scholar
 17.Kleinberg JM, Slivkins A, Wexler T (2004) Triangulation and embedding using small sets of beacons. In: FOCS 17–19, pp 444–453Google Scholar
 18.Kwak H, Lee C, Park H, Moon SB (2010) What is twitter, a social network or a news media? In: WWW, pp 591–600Google Scholar
 19.Low Y, et al (2010) Graphlab: a new framework for parallel machine learning. In: UAI, pp 340–349Google Scholar
 20.Lumsdaine A, Gregor D, Hendrickson B, Berry JW (2007) Challenges in parallel graph processing. Parallel Process Lett 17(1):5–20MathSciNetCrossRefGoogle Scholar
 21.Madduri K, et al (2009) A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets. In: IPDPS, pp 1–8Google Scholar
 22.Malewicz G, et al (2010) Pregel: a system for largescale graph processing. In: SIGMOD, pp 135–146Google Scholar
 23.Meusel R, et al (2014) Graph structure in the web  revisited: a trick of the heavy tail. In: WWW, pp 427–432Google Scholar
 24.Najork M, et al (2012) Of hammers and nails: an empirical comparison of three paradigms for processing large graphs. In: WSDM, pp 103–112Google Scholar
 25.Ng TSE, Zhang H (2002) Predicting internet network distance with coordinatesbased approaches. In: INFOCOMGoogle Scholar
 26.Potamias M, Bonchi F, Castillo C, Gionis A (2009a) Fast shortest path distance estimation in large networks. CIKM, pp 867–876Google Scholar
 27.Potamias M, Bonchi F, Castillo C, Gionis A (2009b) Fast shortest path distance estimation in large networks. In: CIKM, pp 867–876Google Scholar
 28.Qi Z, Xiao Y, Shao B, Wang H (2013) Toward a distance oracle for billionnode graphs. PVLDB 7(1):61–72Google Scholar
 29.Qiao M, Cheng H, Yu JX (2011) Querying shortest path distance with bounded errors in large graphs. In: SSDBM, pp 255–273Google Scholar
 30.Ruan N, Jin R, Huang Y (2011) Distance preserving graph simplification. In: ICDM, pp 1200–1205Google Scholar
 31.Shao B, Wang H, Li Y (2012) The trinity graph engine. Technical Report 161291, Microsoft ResearchGoogle Scholar
 32.Tao Y, Sheng C, Pei J (2011) On kskip shortest paths. In: SIGMOD, pp 421–432Google Scholar
 33.Tian Y, Balmin A, Corsten SA, Tatikonda S, McPherson J (2013) From “think like a vertex” to “think like a graph”. PVLDB 7(3):193–204Google Scholar
 34.Yan D, Cheng J, Ng W, Liu S (2013) Finding distancepreserving subgraphs in large road networks. In: ICDE, pp 625–636Google Scholar
 35.Yan D, Cheng J, Lu Y, Ng W (2014) Blogel: a blockcentric framework for distributed computation on realworld graphs. PVLDB 7(14):1981–1992Google Scholar
 36.Zhang X, Chen L, Wang M (2015a) Efficient parallel processing of distance join queries over distributed graphs. IEEE Trans Knowl Data Eng 27(3):740–754CrossRefGoogle Scholar
 37.Zhang X, Cheng H, Chen L (2015b) Bonding vertex sets over distributed graph: a betweenness aware approach. PVLDB 8(12):1418–1429Google Scholar
 38.Zou L, Chen L, Özsu MT (2009) Distancejoin: pattern match query in a large graph database. PVLDB 2(1):886–897Google Scholar
 39.Zou L, Özsu MT, Chen L, Shen X, Huang R, Zhao D (2014) gstore: a graphbased SPARQL query engine. VLDB J 23(4):565–590CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.