Abstract
Nowadays, the scale of various graphs soars rapidly, which imposes a serious challenge to develop processing and analytic algorithms. Among them, graph pattern matching is the one of the most primitive tasks that find a wide spectrum of applications, the performance of which is yet often affected by the size and dynamicity of graphs. In order to handle large dynamic graphs, incremental pattern matching is proposed to avoid recomputing matches of patterns over the entire data graph, hence reducing the matching time and improving the overall execution performance. Due to the complexity of the problem, little work has been reported so far to solve the problem, and most of them only solve the graph pattern matching problem under the scenario of the data graph varying alone. In this article, we are devoted to a more complicated but very practical graph pattern matching problem, continuous matching of evolving patterns over dynamic graph data, and the investigation presents a novel algorithm CEPDG for continuously pattern matching along with changes of both pattern graph and data graph. Specifically, we propose a concise representation TreeMat of partial matching solutions, which can help to avoid recomputing matches of the pattern and speed up subsequent matching process. In order to enable the updates of data graph and pattern graph, we propose an incremental maintenance strategy, to efficiently maintain the intermediate results. Moreover, we conceive an effective model for estimating stepwise cost of pattern evaluation to drive the matching process. Extensive experiments verify the superiority of CEPDG.
Introduction
In recent years, graph analysis plays an increasingly important role in the area of data analytics [14, 15]. Graph pattern matching is one of the most fundamental problems in graph analytics. Given a pattern graph P and a large data graph G, graph pattern matching is to find all subgraph isomorphic of P in G, which has a wide range of applications such as fraud detection and cyber security.
However, graphs are dynamic in nature [11], which continuously evolve over the time. A dynamic graph is defined by an initial graph and a graph update stream of edge insertions and edge deletions. Identifying and monitoring critical patterns in a dynamic graph is important in various application domains [6] such as fraud detection, cyber security, and emergency response, etc. For example, cyber security applications should detect cyber intrusions and attacks in computer network traffic as soon as they appear in the data graph[3]. Most of the previous works only solve the subgraph matching problem under the scenario of the data graph varying alone. But it is common that pattern graph will also evolve along with the time when data graph is updated. For example, in cyberthreats surveillance, one could predict upcoming malicious activities and determine the ultimate goal of an adversary by concealing and supplementing selective edges of attacking patterns, respectively [16].
The aforementioned two update scenarios motivate us to investigate a new problem, continuous matching of evolving patterns over dynamic graph data. Formally, given an initial data graph G_{0}, an initial pattern graph P_{0}, a graph update stream (Δg_{1},Δg_{2},Δp_{3},Δp_{4},⋯ ) consisting of edge insertions and deletions of the data graph and pattern graph, G_{i} = G_{i− 1} ⊕Δg_{i} (resp. P_{i} = P_{i− 1} ⊕Δp_{i}), and M(P,G) denotes the set of subgraph matching results between G and P. Here, ⊕ means that Δg_{i} (resp. Δp_{i}) is applied to G_{i− 1} (resp. P_{i− 1}). Then the continuous matching of evolving patterns over dynamic graph data problem is to report M(P_{i− 1} ⊕Δp_{i},G_{i− 1}) (resp. M(P_{i− 1},G_{i− 1} ⊕Δg_{i})) when each update operation Δp_{i} (resp. Δg_{i}) occurs. A naïve method to solve this problem is to repetitively execute pattern matching for each update to the data graph and pattern graph. Nonetheless, this can be prohibitively costly due to the extensive involvement of expensive subgraph isomorphism tests [8].
To address the challenge, efforts to support incremental graph pattern matching for dynamic data graph seemed to enjoy some success. In [5], INCISOMAT extracts the subgraph of data graph that can be affected by each update operation and conducts subgraph matching for the extracted subgraph to get the new matches by performing the set difference. GraphFlow [9] applies a worstcase optimal join algorithm called Generic Join to incrementally evaluate subgraph matching for each update. SJTREE [2] uses a leftdeep tree, where an internal node in SJTREE corresponds to a subgraph containing more than two connected query vertices, and a leaf node corresponds to a subgraph containing two adjacent query vertices. TurboFlux is the stateoftheart algorithm for continuous subgraph matching [10], which employs a datacentric indicate representation of intermediate results, namely, DCG, in the sense that the query pattern P is embedded into the data graph G. TurboFlux can obtain a higher performance than above algorithms. However, it only considers the update operations of data graph and is no longer applicable on both update scenarios; to put it in our context, TurboFlux has to recompute DCG when the updates occur on the pattern graph, which can be detrimental.
These problem of existing methods motivated us to develop a fullyfledged framework, namely, CEPDG, to achieve fast pattern matching under the variations of both data graph and pattern graph. To the best of our knowledge, this is among the first attempts to conduct pattern matching under the situation of data graph and pattern graph varying simultaneously. In summary, we make the following contributions:

We introduce a concise representation TreeMat of partial solutions, which can help to avoid executing subgraph pattern matching repeatedly for edge updates on the data graph and pattern graph;

In order to enable frequent updates on the data graph, we propose a vertex state transition strategy, to efficiently maintain the intermediate results.

We devise an execution model to efficiently and incrementally maintain the representation during edge updates on the pattern graph, which are compatible with the algorithm proposed for data graph very well.

We conceive an effective cost model for estimating stepwise cost of pattern matching.
Comprehensive empirical study verifies the efficiency of the proposed algorithm and techniques.
Organization
Section 2 formulates the problem, and presents the overview of the proposed framework. Section 3 introduces a novel representation of intermediate results called the TreeMat and proposes the incremental maintenance strategy. Section 4 explains the algorithms of CEPDG in detail. Experimental results and analyses are reported in Section 5. A brief overview of related work follows immediately in Section 6. Section 7 concludes the paper.
Preliminaries and framework
In this section, we first introduce several essential notions and formalize the continuous matching of evolving patterns over dynamic graph data problem. Then, we overview the proposed solution.
Preliminaries
We focus on a labeled undirected graph g = (V,E,L). Here, V is the set of vertices, E ∈ V × V is the set of edges, and L is a labeling function that assigns a label l to each v ∈ V. Each vertex has only one label, representing the attribute of the node. Note that, our techniques can be readily extended to handle directed graphs.
Definition 1 (Graph update stream)
A graph update stream Δo is a sequence of update operations (Δo_{1},Δo_{2},⋯ ), where Δo_{i} is a triple \(\left \langle op,v_{i},v_{j} \right \rangle \) such that \(op=\left \{I,D \right \}\) is the type of operations, with I and D representing edge insertion and deletion of an edge 〈v_{i},v_{j}〉.
A dynamic graph abstracts an initial graph g and an update stream Δo. g transforms to \(g^{\prime }\) after applying Δo to g. Here, g represents a data graph or pattern graph. Note that, insertion of a vertex can be represented by a set of edge insertions, similarly, deletion of a vertex can be considered as a set of edge deletions.
Definition 2 (Subgraph isomorphism)
Given a pattern graph P = (V_{P},E_{P},L_{P}), a data graph G = (U_{G},E_{G},L_{G}), P is isomorphism to G if there is a bijective function between them, such that: (1) ∀v ∈ V_{P}, L_{P}(v) = L_{G}(f(v)); and (2) ∀(v_{i},v_{j}) ∈ E_{P}, (f(v_{i}),f(v_{j})) ∈ E_{G}, where f(v) is the vertex in G to which v is mapped.
Definition 3 (Problem statement)
Given a pattern graph P = (V_{P},E_{P},L_{P}), a data graph G = (U_{G},E_{G},L_{G}), and a graph update stream Δo, the continuous matching of evolving patterns over dynamic graph data problem is to continuously return occurrences of P in G when the updates in Δo occur on the pattern graph P or data graph G.
Frequently used notations are summarized in Table 1.
Overview of solution
In this subsection, we overview the proposed solution, which is referred as CEPDG(C ontinuous matching of E volving P atterns over D ynamic G raph data). Specially, we are to address two technical challenges:

Update operation needs to be efficient such that the intermediate results can be maintained incrementally.

Pattern matching needs to be efficient such that the number of intermediate results is minimized.
The former corresponds to update handling phase, while the latter challenge corresponds to the query evaluation phase.
Algorithm 1 shows the outline of CEPDG, which takes an initial pattern graph P_{0}, an initial data graph G_{0} and a graph update stream Δo as input, and find the matching results of P in G when necessary. We first select a root vertex v_{r} (Line 1). Then we extract from the pattern graph P_{0} a structural tree P_{T} based on v_{r}, walking a spanning tree by breadthfirst search, and removing nontree edges from P_{0}.(Line 2). The purpose is to execute fast query evaluation by leveraging tree structure [8], i.e., we handle the edges in the query tree first, and then, the nontree edges.
In particular, to perform continuous subgraph matching, we construct an auxiliary data structure, namely, TreeMat, based on P_{T} to store the matching results of the structural tree, which is able to provide guidance to generate answers with light computation overhead (Line 3). During a graph update stream, when an update comes, we amend the auxiliary data structure first, and then calculate the matching results if necessary (Lines 4–13). For example, on update o of data graph, we first match o to corresponding edges in P_{T}, and then incrementally maintain the intermediate results in TreeMat (Lines 6–9). On update o of pattern graph, we incrementally maintain TreeMat directly (Lines 10–12). After that, we call subgraphSearch to obtain the matching results if output requested (Line 13). The design and rationale for auxiliary data structure maintenance is given, as well as the algorithm details are given in the subsequent sections, respectively.
Root Vertex Selection
Intuitively, we favor the root vertex to have a small number of candidates and to have a large degree; fewer candidates means fewer partial embeddings being generated, while larger degree means more chance to prune partial embeddings at early stages. In order to minimize the number of matching data vertices for root vertex v_{r}, chooseRootVertex first selects a pattern edge \(\langle v, v^{\prime }\rangle \) which has the smallest number of matching data edges. Between v and \(v^{\prime }\), chooseRootVertex chooses a pattern vertex that has a smaller number of matching data vertices. Finally, if there is a tie, chooseRootVertex chooses a pattern vertex having a larger degree.
Incremental maintenance of intermediate results
The central idea of update handling is to employ a delicate data structure to store and incrementally maintain partial solutions.
A concise representation
There has been a long tradition in graph community to harness a tree structure for fast pattern matching/search [1, 8]. We also follow this tradition, and conceive a succinct data structure for keeping partial solutions. P_{T} is constructed by removing the edges that are not in the spanning tree, i.e., nontree edges, if P contains cycles. The vertices in P are partitioned according to their levels in the spanning tree where the level of a vertex in P_{T} is its depth compared to the root vertex of P_{T}.
To keep partial solutions, we offer a concise representation named TreeMat, which comprises matching vertices to those of P_{T} in topology graph G. Given a vertex v in P_{T}, its matching vertices in TreeMat are arranged into

match(⋅): the set of vertices {u} in G that map to v in some solutions to P_{T}; and

stree(⋅): the set of vertices {u} in G such that 1) the subtree residing at v matches the corresponding subtree at u via subgraph homomorphism [10], and 2) there does not exist a solution to P_{T} that map v to u.
Here, subgraph homomorphism can be obtained by just removing the injectivity constraint. It can be seen that the two sets are mutually exclusive, and we use a general designation candidates of v i.e., cand(v) to refer the vertices in either match(v) or stree(v). As a consequence, the structure of TreeMat is defined as follows.

It is a treelike structure, and for each query vertex v in P_{T}, there is a node containing the candidates of v, which is constituted of two sets match(v) and stree(v); and

there is an edge between u ∈cand(v) and \(u^{\prime }\in \textsf {cand}(v^{\prime })\) for adjacent query vertices v and \(v^{\prime }\) in TreeMat, if and only if edge \(\langle u,u^{\prime }\rangle \in G\).
It is noted that stree(v_{r}) of the root vertex v_{r} in P_{T} is empty, since P_{T} is also a subtree residing at v_{r}.
Example 1
Figure 4b shows the TreeMat for P_{T} (Figure 4a) and initial data graph G_{0}. Given a vertex v in T, the orange square in cand(v) represents a data vertex u ∈stree(v); and the black square in cand(v) represents a data vertex u ∈match(v). Furthermore, we can see that the root vertex v_{1} of P_{T} only has the set match(⋅).
Remark 1
As pointed out in [10], existing work on continuous subgraph matching caches either a set of partial solutions or a set of candidate vertices for each query vertex. These paradigms incur not only great memory overhead but also large computational cost. In contrast, our model takes a more eager strategy, and proposes to keep complete solutions (in match(⋅)) as well as solutionlikelytobe’s (in stree(⋅)). In this way, we save TreeMat from filling up the main memory while offering guidance to efficiently derive affected answers.
Data graph changeoriented rationale of maintenance
In this subsection, we propose a vertex state transition strategy (denoted as VST) to efficiently maintain the intermediate results.
When an edge update operation \(\langle u,u^{\prime }\rangle \) arrives, we try to match it with an edge \(\langle v,v^{\prime }\rangle \) in P_{T}. Here, the level of v is deemed to be smaller than the level of \(v^{\prime }\). Then, we use VST to maintain the TreeMat. We set the data vertex u ∈NULL if u∉cand(v). Figure 1 shows the state transition diagram, consisting of three states and six transition rules (Transitions 1–6), which demonstrates how one state is transited to another. Here, Transition 1–3 are triggered by edge insertion, and Transition 4–6 are triggered by edge deletion.
Handling edge insertion
Consider an edge \(\langle u,u^{\prime }\rangle \) inserted into G_{0}, to which \(\langle v,v^{\prime }\rangle \) is matched in P_{T}. Let v be the parent vertex of \(v^{\prime }\).
From NULL to match. Suppose that u ∈match(v) and \(u^{\prime }\in \textsf {NULL}\). If \(v^{\prime }\) is a leaf vertex, then we add \(u^{\prime }\) into \(\textsf {match}(v^{\prime })\).
Suppose that v is the root vertex in P_{T}, \(u^{\prime }\in \textsf {cand}(v^{\prime })\) and u ∈NULL. For each child vertex v_{c} of v except \(v^{\prime }\), if v_{c} is a leaf vertex, we check if there is an edge 〈u,u_{c}〉 matching 〈v,v_{c}〉; else we further check if u_{c} ∈cand(v_{c}). If so, we add vertex u into match(v). In specific, if v_{c} is a leaf vertex and u_{c} ∈NULL, we should also add vertex u_{c} into match(v_{c}).
From NULL to stree. Suppose that u ∈NULL and \(u^{\prime }\in \textsf {cand}(v^{\prime })\). Here, v is not the root vertex in P_{T}. For each child vertex v_{c} of v except \(v^{\prime }\), if v_{c} is a leaf vertex, we check if there is an edge 〈u,u_{c}〉 matching 〈v,v_{c}〉; else we further check if u_{c} ∈cand(v_{c}). If so, we add vertex u into stree(v). In specific, if v_{c} is a leaf vertex and u_{c} ∈NULL, we should also add vertex u_{c} into stree(v_{c}).
Suppose that the data vertex u is added into stree(v). For each u_{p} ∈NULL that is adjacent to u, if 〈u,u_{p}〉 matches 〈v,v_{p}〉 where v_{p} is the parent vertex \(v^{\prime \prime }\) of v, we further check whether u_{p} can be added into stree(v_{p}) with a similar manner (Fig. 2).
From stree to match. Suppose that \(u^{\prime }\in \textsf {stree}(v^{\prime })\) and u ∈match(v). Then we remove \(u^{\prime }\) from \(\textsf {stree}(v^{\prime })\) to \(\textsf {match}(v^{\prime })\).
Suppose that the data vertex u is added into match(v). For each child vertex v_{c} of v, if there is a vertex u_{c} in stree(v_{c}) that is adjacent to u in TreeMat, then we remove u_{c} from stree(v_{c}) to match(v_{c}).
Example 2
Figure 2c–h give the examples of vertex state transition strategy for edge insertion, where Figure 2c–d show the strategy , Figure 2e–f show the strategy , and Figure 2g–h show the strategy . In Figure 2c, the edge insertion Δo_{1} matches 〈v_{4},v_{7}〉 where u_{6} ∈match(v_{4}). Since v_{7} is a leaf vertex in P_{T}, we add u_{17} to match(v_{7}). In Figure 2d, the edge insertion Δo_{2} matches 〈v_{1},v_{2}〉 where u_{2} ∈match(v_{2}). Since v_{1} is the root vertex in P_{T} and 〈u_{18},u_{4}〉 matches 〈v_{1},v_{3}〉 with u_{4} ∈match(v_{3}), we add u_{18} into match(v_{1}). In Figure 2e, the edge insertion Δo_{3} matches 〈v_{4},v_{7}〉 where u_{14} ∈stree(v_{7}). Since v_{4} has no child vertex exclude v_{7}, we add u_{19} into stree(v_{4}). In Figure 2f, there is a neighbor u_{20} of u_{19} that satisfies 〈u_{19},u_{20}〉 matches 〈v_{4},v_{2}〉. Since 〈u_{20},u_{9}〉 matches 〈v_{2},v_{5}〉, we further add u_{20} into stree(v_{2}). In Figure 2g, the edge insertion Δo_{4} matches 〈v_{4},v_{2}〉 where u_{2} ∈match(v_{2}) and u_{7} ∈stree(v_{4}). We then remove u_{7} from stree(v_{4}) to match(v_{4}). In Figure 2h, we further check the data vertices in stree(v_{7}) where v_{7} is the child vertex of v_{4}. Since u_{13} and u_{14} are the neighbors of u_{7} in stree(v_{7}), we remove u_{13} and u_{14} from stree(v_{7}) to match(v_{7}).
Handling edge deletion
Consider an edge \(\langle u,u^{\prime }\rangle \) deleted from G_{0}, to which \(\langle v,v^{\prime }\rangle \) is matched in P_{T}. Let v be the parent vertex of \(v^{\prime }\).
From match to NULL. Suppose that u ∈match(v) and \(u^{\prime }\in \textsf {match}(v^{\prime })\). If there is no data vertex in \(\textsf {match}(v^{\prime })\) that is adjacent to u except \(u^{\prime }\), we delete u from match(v). In specific, if \(v^{\prime }\) is a leaf vertex, and there is no other data vertex in cand(v) that is adjacent to \(u^{\prime }\), we delete \(u^{\prime }\) from \(\textsf {match}(v^{\prime })\).
Suppose that u is deleted from match(v). For each neighbor u_{p} of u in match(v_{p}) where v_{p} is the parent of v, if there is no other data vertex in match(v) that is adjacent to u_{p}, then we delete u_{p} from match(v_{p}).
From match to stree. Suppose that u ∈match(v) and \(u^{\prime }\in \textsf {match}(v^{\prime })\). If there is no other data vertex in match(v) that is adjacent to \(u^{\prime }\), then we remove \(u^{\prime }\) from \(\textsf {match}(v^{\prime })\) to \(\textsf {stree}(v^{\prime })\). In specific, if \(v^{\prime }\) is a leaf vertex, we need further check if there is a vertex in stree(v) that is adjacent to \(u^{\prime }\); if so, remove \(u^{\prime }\) from \(\textsf {match}(v^{\prime })\) to \(\textsf {stree}(v^{\prime })\).
From stree to NULL. Suppose that u ∈stree(v) and \(u^{\prime }\in \textsf {cand}(v^{\prime })\). If there is no other data vertex in \(\textsf {cand}(v^{\prime })\) that is adjacent to u, we then delete u from stree(v). In specific, if \(v^{\prime }\) is a leaf vertex in P_{T} and \(u^{\prime }\in \textsf {stree}(v^{\prime })\), we need further check whether there is a data vertex in stree(v) that is adjacent to \(u^{\prime }\). If not, we delete \(u^{\prime }\) from \(\textsf {stree}(v^{\prime })\).
Suppose that the vertex u is deleted from stree(v). For each neighbor u_{p} of u in stree(v_{p}) where v_{p} is the parent of v, if there is no other data vertex in cand(v) that is adjacent to u_{p}, then we delete u_{p} from stree(v_{p}).
Pattern graph changeoriented rationale of maintenance
It can be seen that if inserted (or deleted) edge is a nontree edge, we do not update TreeMat, since it has no impact on TreeMat. Thus, the following exposition concentrates on tree edges.
Handling edge insertion
Consider a tree edge \(\langle v,v^{\prime }\rangle \) inserted into P_{T}, where \(v^{\prime }\) is the vertex newly introduced. Under this scenario, candidate vertices are only to be excluded from match(⋅) or stree(⋅), back to NULL state, but not vice versa. To identify affected candidates, we check, for each vertex u in match(v), whether there is an edge \(\langle u,u^{\prime }\rangle \) with \(u^{\prime }\in \textsf {NULL}\) matching \(\langle v,v^{\prime }\rangle \). If not, we delete u from match(v); otherwise, we add vertex \(u^{\prime }\) into \(\textsf {match}(v^{\prime })\) if u ∈match(v). stree(v) or \(\textsf {stree}(v^{\prime })\) can be updated in a similar fashion.
Moreover, when vertex u is excluded from the candidates of v, such update needs to be propagated upwards in TreeMat till the root vertex. Consider the parent vertex v_{p} of v, if u_{p} is the neighbor of u in match(v_{p}), and there is no vertex in match(v) that is adjacent to u_{p} in TreeMat, we exclude u_{p} from match(v_{p}).
Handling edge deletion
We discuss edge deletion in two cases based on whether the deletion involves a leaf vertex of P_{T}.
Case 1
Consider tree edge \(\langle v,v^{\prime }\rangle \) with \(v^{\prime }\) as a leaf vertex. Note that in this case, NULL vertices only are to be included into match(⋅) or stree(⋅), but not vice versa. Intuitively, a vertex u of G_{0} is added into stree(v), only if for each child vertex v_{c} of v exclude \(v^{\prime }\), there is a vertex u_{c} that is candidate to v_{c} such that 〈u,u_{c}〉 matches 〈v,v_{c}〉.
Then, update needs to be propagated upwards to the root of TreeMat. Suppose that vertex u is added into stree(v). For each vertex u_{p} that is adjacent to u and 〈u_{p},u〉 matches 〈v_{p},v〉, if u_{p} ∈NULL, we check whether u_{p} can be added into stree(v_{p}) in a similar manner; else if u_{p} ∈match(v_{p}), we move u from stree(v) to match(v). In the other situation when vertex u is added into match(v), we examine, for each child vertex v_{c} of v, whether there is vertex u_{c} in stree(v_{c}) that is adjacent to u in TreeMat; if so, remove data vertex u_{c} to match(v_{c}).
Case 2
Consider a tree edge \(\langle v,v^{\prime }\rangle \) not involving any leaf vertex. This type of edge deletion will break the connectivity of P_{T} but not P^{Footnote 1}. Thus, a nontree edge that connects \(v^{\prime }\) with an arbitrary vertex will become a tree edge. By intuition, we choose, among all the nontree edges, that one \(v^{\prime \prime }\) that connects \(v^{\prime }\) to a vertex closer to the root and has smaller match(⋅) set.
Then, for each vertex \(u^{\prime \prime } \in \textsf {stree}(v^{\prime \prime })\), we check whether there is a candidate \(u^{\prime }\) of \(v^{\prime }\) such that \(\langle u^{\prime \prime },u^{\prime }\rangle \) matches \(\langle v^{\prime \prime },v^{\prime }\rangle \); if not, we exclude \(u^{\prime \prime }\) from \(\textsf {stree}(v^{\prime \prime })\), and further check the vertices in stree(v_{p}), where v_{p} is the parent of \(v^{\prime \prime }\). The update is propagated upwards till the root.
Example 3
Figure 3d–h give the examples of updating process for edge insertions and deletion of the pattern graph. In Figure 3d, since 〈v_{4},v_{5}〉 is a nontree edge, we only add edge 〈v_{6},v_{10}〉 into P_{T}. In Figure 3e, since there is no vertex \(u^{\prime }\) that is adjacent to u_{11} such that \(\langle u_{11},u^{\prime }\rangle \) matches 〈v_{6},v_{10}〉, we remove u_{11} from stree(v_{6}). Accordingly, we remove the parent vertex u_{5} of u_{11} from stree(v_{3}). What’s more, since u_{10} ∈match(v_{6}), and there are two vertices u_{17} and u_{18} that are adjacent to u_{10} such that edges 〈u_{10},u_{17}〉 and 〈u_{10},u_{18}〉 match 〈v_{6},v_{10}〉, we add u_{17} and u_{18} into match(v_{10}). Figure 3f gives the updated TreeMat with edge insertion Δg_{2}. When the edge Δp_{1} is deleted from P, there are two nontree edges 〈v_{5},v_{6}〉 and 〈v_{6},v_{8}〉 that can be translated into tree edges. Here, we translate 〈v_{5},v_{6}〉 into tree edge, since match(v_{5}) = match(v_{8}) and v_{5} is closer to the root vertex v_{1}. The updated P_{T} and TreeMat are given in Figures 3g and h, respectively.
CEPDG algorithms
In this section, we present detailed algorithms for CEPDG. We develop efficient techniques for constructing TreeMat. While we update the TreeMat, we need only apply necessary transition rules. This motivated us to develop an enhanced version of the maintenance algorithm for the TreeMat. Then we conceive an effective cost model for estimating the stepwise cost of query pattern matching.
TreeMat construction
To construct TreeMat, constructTreeMat (Line 3 of Algorithm 1) (1) first generates cand(v) (candidates of v) for each query vertex v in P_{T}; (2) then constructs the adjacent lists corresponding to query vertices and their parent vertices; and (3) finally divides the cand(v) into stree(v) and match(v).
In the forward processing, we mark all the leaf vertices of P_{T} as visited and then process the query vertices levelbylevel in a bottomup fashion (Lines 1–20). In processing an unvisited vertex v, let N(v) denotes the set of visited neighbors of v in P_{T} (Line 13). Intuitively, a data vertex u is in cand(v) only if for each \(v^{\prime }\in N(v)\), there is a data vertex \(u^{\prime } \in \textsf {cand}(v^{\prime })\) such that \(\langle u,u^{\prime }\rangle \) matches \(\langle v,v^{\prime }\rangle \). In specific, in above process, if \(v^{\prime }\) is a leaf vertex, we need only verify whether there is a data vertex \(u^{\prime }\) such that \(\langle u,u^{\prime }\rangle \) matches \(\langle v,v^{\prime }\rangle \). To achieve this, we maintain a counter V (u) for each data vertex in G_{0} to count the number of visited query neighbors of v that have a candidate \(u^{\prime }\) adjacent to u such that \(\langle u,u^{\prime }\rangle \) matches \(\langle v,v^{\prime }\rangle \). V (u) is updated at Lines 8–10. The candidate cand(v) is the set of vertices satisfying N(v) = V (u) (Lines 14–15). After generating cand(v), we will further generate \(\textsf {cand}(v^{\prime })\) if \(v^{\prime }\) is a leaf vertex. That is, \(u^{\prime }\) is added to \(v^{\prime }\) if there is a data vertex u ∈match(v) such that \(\langle u,u^{\prime }\rangle \) matches \(\langle v,v^{\prime }\rangle \) (Lines 16–18).
At the same time, we construct the adjacency lists corresponding to vertex v and its parent vertex v_{p} in P_{T} (Line 19). The adjacency lists corresponding to an edge 〈v_{p},v〉 is constructed. That is, for each data vertex u ∈cand(v_{p}), an adjacency list \(N_{v}^{v_{p}}(u)\) is constructed, which is the set of data vertices \(\{u^{\prime }\}\) in cand(v) such that \(\langle u^{\prime },u\rangle \) matches 〈v_{p},v〉. Then, we mark v as visited, reset V (u) to be 0 for every vertex u that has a positive count (Line 18).
In the backward processing, we reprocess the query vertices of P_{T} in a top–down manner to divide cand(v) into match(v) and stree(v) for each query vertex v. Firstly, we set match(v_{r}) = cand(v_{r}) for the root vertex v_{r}, since T_{EQ} is also a subtree residing at v_{r}. Then, we process vertices downwards according to their levels. In processing a query vertex v, let v_{p} denote the parent vertex of v. For each data vertex u in cand(v), we check if there is a data vertex u_{p} in match(v_{p}) that is adjacent to u. If so, we move u to match(v); otherwise we move u to stree(v) (Lines 24–26).
Lemma 1
The worst storage complexity of TreeMat is \(O(E_{G_{0}}\times V_{P_{T}})\).
Proof
The TreeMat stores at most \(E_{G_{0}}\) edges for each pattern vertex in P_{T} and thus, its worst storage complexity is \(O(E_{G_{0}}\times V_{P_{T}})\). □
Lemma 2
The worst time complexity of constructTreeMat is \(O(E_{G_{0}}\times E_{P_{T}})\).
Proof
In the worst case, constructTreeMat is called for every query vertex v and every data vertex u. We show that in the forward process for a special v take time \(O(E_{G_{0}}\times N(v))\). In particular, for each data vertex \(u^{\prime }\in \textsf {cand}(v^{\prime })\), it takes \(O(deg(u^{\prime }))\) time to check whether \(\langle u,u^{\prime }\rangle \) matches \(\langle v,v^{\prime }\rangle \) where \(deg(u^{\prime })\) is the degree of \(u^{\prime }\); thus, for all vertices in \(\textsf {cand}(v^{\prime })\), the checking processes take \(O({\sum }_{u^{\prime }\in \textsf {cand}(v^{\prime })}deg(u^{\prime }))=O(E_{G_{0}})\). Similarly, in the backward process for a special v takes time \(O(E_{G_{0}})\) time. Thus, the total time for a special v is \(O(E_{G_{0}}\times (N(v)+1))=O(E_{G_{0}}\times deg(v))\) where deg(v) is the degree of v in P_{T}, and the total running time of constructTreeMat is \(O({\sum }_{v\in P_{T}}E_{G_{0}}\times deg(v))=O(E_{G_{0}}\times E(P_{T}))\). □
Edge updates on the data graph
Now, we explain GinsertEval (Algorithm 3), which is invoked for each edge insertion \(\langle u,u^{\prime }\rangle \). The main idea of GinsertEval is explained as follows: we try to match \(\langle u,u^{\prime }\rangle \) with tree edges in P_{T} and then update the TreeMat through the vertex position transition strategy. Note that there may be more than one query edge in P_{T} to which \(\langle u,u^{\prime }\rangle \) matches, and not all matching situations can cause the update of TreeMat. For this purpose, we should exclude the invalid matching situations.
In order to exclude invalid matching situations, we first obtain the query edges in P_{T} with the same edge label as \(\langle u,u^{\prime }\rangle \). Let v be the parent of \(v^{\prime }\). Then, for each matched query edge \(\langle v,v^{\prime }\rangle \), we check whether \(u^{\prime }\in \textsf {cand}(v^{\prime })\); if not, it will not cause the update of TreeMat and will be ignored (Line 1–3). For each valid matching situation, we execute chooseVST to check whether \(\langle u,u^{\prime }\rangle \) can cause the update of TreeMat (Line 5). If so, chooseVST chooses the corresponding transition rule and updates the states of u and \(u^{\prime }\). What’s more, chooseVST will also check whether the update caused by \(\langle u,u^{\prime }\rangle \) needs to be propagated upwards or downwards. If so, we set \(\textsf {TreeMat.getTransition}(\langle u,u^{\prime }\rangle )\)=true and update TreeMat by calling updateTreeMat (Algorithm 4) recursively (Lines 6–8). Here, updateTreeMat decides the update propagation direction (i.e., upwards or downwards) for current iteration and executes corresponding transition rule. Algorithms for edge deletions on the data graph are similar to those for edge insertions except that they use the transitions 4–6, instead of transitions 1–3; Omitted in the interest of space, the algorithm GdeleteEval (Line 9 of Algorithm 1) is not described here.
Edge updates on the pattern graph
In this subsection, we introduce PdeleteEval (Algorithm 5), which is invoked for each edge deletion \(\langle v,v^{\prime }\rangle \).
We first check whether \(\langle v,v^{\prime }\rangle \) is a nontree edge; if so, it will not cause the update of TreeMat (Line 2–3). In other case, if \(v^{\prime }\) is a leaf vertex, some NULL vertices may be added into stree(v) under this situation. In detail, if a vertex u satisfies: (1) u has the same label as v; (2) u∉cand(v); and (3) for each child vertex v_{c} of v except \(v^{\prime }\), there is a data vertex u_{c} ∈cand(v_{c}) that is adjacent to u, then we add u into stree(v) (Lines 4–16). Note that, if v_{c} is a leaf vertex, we should further check whether there is an edge 〈u,u_{c}〉 matching 〈v,v_{c}〉 and u_{c} ∈NULL; if so, add u_{c} into stree(v_{c}) (Line 17). After that, we call updateTreeMat (Algorithm 4) recursively to update the TreeMat based on the status of u (Line 18). What’s more, if \(v^{\prime }\) is not a leaf vertex, we should translate the nontree edge with an endpoint of \(v^{\prime }\) to tree edge. We also set the status of all the candidates of \(v^{\prime }\) and the descendants of \(v^{\prime }\) as stree at this condition (Line 20). Next, we update stree(v) in a similar way as Lines 5–18. Adding a nontree edge into P_{T} will cause some candidate vertices to be executed. As a result, we should further check for each vertex \(u^{\prime \prime }\in \textsf {cand}(v^{\prime \prime })\), if there is a vertex in \(cand(v^{\prime })\) that is adjacent to \(v^{\prime \prime }\). If not, remove \(u^{\prime \prime }\) from \(\textsf {cand}(v^{\prime \prime })\); else we call updateTreeMat (Algorithm 4) recursively to update the TreeMat based on the status of \(u^{\prime \prime }\) (Lines 22–26). The update is propagated upwards till the root vertex(Line 27).
Algorithms for edge insertions on the pattern graph are similar to those for edge deletions under the situation that \(v^{\prime }\) is not a leaf vertex. Omitted in the interest of space, the algorithm PinsertEval (Line 11 of Algorithm 1) is not described here.
Costdriven pattern matching
Pattern evaluation phase is to harvest complete solutions to pattern graphs by leveraging TreeMat. We are in quest of boosting performance by conducting exploration on TreeMat.
Standard backtracking is viable but inefficient, which neglects the matching order that may greatly affect the performance. A classic models for generic graph patten matching [1, 12] is as follows. Assume the total cost is proportional to the number of comparisons for determining whether a vertex (or an edge) matches. Given an arbitrary order of vertices (v_{1},v_{2},…,v_{n}) for P, the number of comparisons performed in a backtracking algorithm is
where M_{i} represents the set of intermediate results for the subgraph of P induced by (v_{1},v_{2},…,v_{i}), \({d_{i}^{j}}\) is the vertices in match(v_{i}) joinable with an intermediate result in M_{i− 1}, and r_{i} is the number of nontree edges between v_{i} and vertices before v_{i} in the matching order.
Nonetheless, r_{i} largely depends on the actual order. The total number of configurations of r_{i} is exponential in \(O(\left  V_{P} \right !)\), and thus, it is prohibitively expensive to optimize T_{iso} online. In response, we choose to minimize T_{iso} greedily, i.e., every time choose the vertex of the minimum cost on the basis of current intermediate results. Then, to match vertex v_{i}, the number of comparisons concerning v_{i} can be expressed by \(T^{\prime }(v_{i})={\sum }_{j=1}^{\left  M_{i1} \right } {d_{i}^{j}} (r_{i}+1)\).
In addition, we unveil that the advantage of harnessing TreeMat also comes from the derivation of \({d_{i}^{j}}\) given M_{j}, which is inaccessible in pattern matching. Recall that a likelihood estimated over entire topology graph is used to delegate \({d_{i}^{j}}\) [1, 12], which can be inaccurate. Lastly, to select the first vertex, we choose the one with minimum \(\frac {\left  \textsf {match}(v) \right }{deg(v)}\), where deg(v) is the total degree of v.
The estimation above only considers the cost thus far (i.e., current cost), but ignores the cost from the vertices to be accessed (i.e., future cost). It is contended that combining current and future costs may provide rewarding guidance for future steps. However, it is nontrivial to precisely compute the actual intermediate results after mapping u_{i}. To this end, we heuristically estimate the number of intermediate results as
where \({p_{j}^{i}}\) is the likelihood of a vertex in \({d_{i}^{j}}\) has an edge satisfying the restriction of the jth nontree edge of v_{i} connecting to a vertex that has been accessed. Then, we estimate the number of intermediate results for each vertex that has not been accessed. Let v_{k} be an unvisited vertex, the number of intermediate results predicted for v_{k} is \(\left  \textsf {match}(v_{k}) \right \times {\prod }_{j=0}^{n1} {p_{j}^{k}}\). Thus, the summation becomes the total number of intermediate results predicted for all the vertices that have not been accessed. Then, the future cost of mapping vertex v_{i} can be expressed by
where r_{k} represents the number of vertices that has been accessed except the parent of v_{i} that has edges connected to unaccessed vertices.
In overall, the cost of mapping u_{i} can be estimated by \(T^{\prime }(v_{i})+T^{\prime \prime }(v_{i})\). Experiments show that it provides better guidance to the matching processing, in comparison with alternative strategies.
Example 4
Consider the pattern graph and the match(⋅) set of TreeMat in Figure 4. v_{1} is set as the root vertex since \(\frac {\textsf {match}(v_{1})}{2}\) is minimum. Suppose that the vertices v_{1} and v_{3} have been matched. At this time, the number of intermediate results is 2, and we are going to choose the next vertex. If we choose v_{5}, the number of comparisons is 1 + 2 = 3; if we choose v_{2}, the number of comparisons is 8 × 2 = 16. According to the greedy selection that only consider the current matching cost, we will choose v_{5} as the next vertex, and the current total number of comparisons is 1 + 2 + 3 + 12 × 2 + 1 = 31. However, if we take the future matching cost into account, we will choose v_{2} as the next vertex, and the total number of comparisons is 1 + 2 + 8 × 2 + 1 + 1 = 21 that is smaller than 31.
Correctness and complexity
Based on the discussion, we can implement a procedure for choosing the next vertex for matching. Note that, f we use the new cost model may bring fewer cost. While the details of the procedure is omitted in the interest of space, it can be seen that the procedure runs in \(O\left (\left  E_{G^{*}} \right \times \left  V_{P^{*}} \right \times \left  E_{P^{*}} \right  \right )\), where G^{∗} and P^{∗} are the updated data graph and updated pattern graph, respectively.
Remark 2
In comparison with existing cost models for pattern matching and order selection, the proposed model and algorithm are advantageous in the sense that

As identified by existing work [1], TurboFlux [10] fails to be applicable to large and complex query patterns; in contrast, CEPDG lends itself to large and complex queries against the more difficult matching criteria of subgraph isomorphism;

Compared with QuickSI [12], which merely concentrate on a local cost with a greedy strategy, our proposed cost model generates a more effective matching order, which takes both existing and future costs into account, and hence, reduces a large number of unpromising intermediate results;

In comparison with CFL [1], which implements a pathbased cost model, our model chooses an edgebased cost most, and thus, is more flexible and less computationally expensive, while retaining the quality of order selection.
It can be seen that the costdriven matching algorithm heavily relies on a good estimation of cand(⋅), and the more accurate estimation, the better guidance for matching ordering. In the sequel, we strive to offer a good estimation of candidates by levering an online saturation strategy with index support.
Experiments
In this section, we evaluate the performance of CEPDG against the stateoftheart continuous subgraph matching methods, TurboFlux [10], and GraphFlow [9] on two reallife datasets. The source code of TurboFlux was obtained from its authors. The source code of GraphFlow was downloaded from github ^{Footnote 2}. Then, we report experimental results and analyses.
Experiment setup
The proposed algorithms were implemented using C++, running on a Linux machine with two Core Intel Xeon CPU 2.2Ghz and 32GB main memory.
Datasets/Queries
We used two datasets referred as Yago^{Footnote 3} and Netflow^{Footnote 4}. Yago is a dataset that extracts facts from Wikipedia and integrates them with the WordNet thesaurus. This dataset consists of an initial graph G_{0} and a graph update stream Δg. G_{0} contains 12,375,749 triples while Δg consists of insertions of 1,124,302 triples and deletions of 1,027,828 triples. Netflow contains anonymized passive traffic traces monitored from highspeed internet backbone links. In this dataset, G_{0} contains 14,378,113 triples and Δg consists of insertions 1,236,412 triples and deletions of 1,107,635 triples.
As the dataset does not come with patterns, we comprehensively generated various patterns as follows. We first make 4 pattern categories (\(A1\sim A4\)), and then, extract for each category 20 patterns by randomly traversing the topology graph. The size of patterns in A1, A2, A3 and A4 is 15, 20, 25 and 30, respectively. Then, for each graph pattern, to generate the update stream, every time we (1)randomly removed an existing edge while keeping the pattern graph connected; and (2) randomly added an edge between two disconnected vertices with a random edge label conforming uniform distribution. Note that, the size of edge insertions/deletions of each pattern graph did not exceed half of the pattern size (≤ 50%); otherwise, fundamental characteristics of the pattern disappear.
Algorithms
Since there is no existing research directly targeting our problem, two stateoftheart algorithms were adapted and involved for comparison: 1) TurboFlux [10] is an algorithm for pattern matching over dynamic graph; to deal with evolving pattern graph, it has to recompute its auxiliary data structure during update. 2) GraphFlow [9] is an incremental algorithm without maintaining intermediate results. 3) our proposed algorithm CEPDG.
Unless specified otherwise, values in boldface in Table 2 are used as default parameters in the experiments.
Evaluation of data graph updates
We use two measures, the average elapsed time and the size of intermediate results. Note that, for fair comparison, we exclude the elapsed time for updating the data graph. That is, we set the average elapsed time of CEPDG as the difference between the time for processing the graph update stream with and without continuous query answering, and measure the time of the competitors for query processing only. What’s more, we conduct experiments by inserting/deleting edges in batches of 10K (= 10 × 10^{3}). Inserting/deleting edges in batches means that we need only calculate matching results when all the edges have added into or removed from the data graph. We set a 1hour timeout for each query.
Varying pattern size
Figure 5 shows the performance results in Yago dataset. Here, we set edge insertions/deletions as 500K (= 500 × 10^{3}) and vary the query size from 15 to 30. Figure 5(1) shows the average elapsed time. CEPDG behaves better than its competitors regardless of pattern size. Specially, CEPDG outperforms TurboFlux by \(2.28\sim 3.13\) times, and GraphFlow by \(36.67\sim 44.28\) times. The reason is that GraphFlow does not maintain any intermediate results and it will generate a much larger number of partial solutions than CEPDG and TurboFlux. CEPDG only needs to update partial intermediate results for an edge update operation. So even E(P) is big, CEPDG can also achieve a better performance. Moreover, CEPDG can significant reduce the time cost based on the cost model in the pattern matching process. Figure 5(2) shows the average number of intermediate results. Since GraphFlow does not maintain any intermediate results, we only compare CEPDG with TurboFlux. Specially, the average size of intermediate results of TurboFlux is larger than that of CEPDG by \(1.28\sim 1.54\) times. It means that the representation by CEPDG (TreeMat) is more concise than that by TurboFlux.
Figure 6 shows the performance results in Netflow dataset. CEPDG behaves better than its competitors in both of average elapsed time and average size of intermediate results regardless of pattern size. Specially, in Figure 6(1), CEPDG outperforms TurboFlux by up to 2.86 times, and GraphFlow by up to 90.72 times; in Figure 6(2), the average size of intermediate results of TurboFlux is larger than that of CEPDG by up to 1.47 times. This is because Netflow has only eight edge labels and no vertex label. Hence, the size of intermediate results is enormous, and time costs in TurboFlux and GraphFlow are very expensive.
Varying edge insertion size
In this subsection, we evaluate the impact of edge insertions of data graph on the performance of CEPDG and its competitors. Here, we fixed patterns in A3 and varied the number of newlyinserted edges from 250K (= 250 × 10^{3}) to 1000K in 250K increments on Yago. Thus, the number of total update operations also increases accordingly. Figure 7(1) shows the processing time for each algorithm. We see that CEPDG has consistently better performance than it competitors. What’s more, the figure reads a nonexponential increase as edge insertion size grows. Specially, CEPDG outperforms TurboFlux by up to 2.44 times, and GraphFlow by up to 46.78 times at edge insertion size 1000K. CEPDG also outperforms its competitors in terms of the size of intermediate results as shown in Figure 7(2). Specially, the size of intermediate results of TurboFlux is larger than that of CEPDG by up to 1.43 times when the insertion size is 1000K.
Varying edge deletion size
In this subsection, we evaluate the impact of edge deletions of data graph on the performance of CEPDG and its competitors. Here, we fixed patterns in A3 and varied the number of deleted edges from 250K (= 250 × 10^{3}) to 1000K in 250K increments on Yago. Figure 8(1) shows the processing time for each algorithm. Note that the gap between the performance of CEPDG and TurboFlux is larger than that in Figure 7(1). This is because deletion of an edge \((u,u^{\prime })\) could affect all subtrees of \(u^{\prime }\) in TurboFlux. However, in CEPDG, we need only main the affected vertices in TreeMat, which relatively small. Note also that the processing GraphFlow slightly decreases when the size of edge insertions decrease. This is because, the edge deletions reduce the input data size of GraphFlow directly. Specially, CEPDG outperforms TurboFlux by up to 3.16 times, and GraphFlow by up to 74.51 times. CEPDG also outperforms its competitors in terms of the size of intermediate results as shown in Figure 8(2). Specially, the size of intermediate results of TurboFlux is larger than that of CEPDG by up to 1.25 times when the insertion size is 1000K.
Varying the data size
In this testing, we evaluate the performance results of CEPDG against existing algorithms regarding the scalability by using Yago for varying dataset size. Here, we fixed patterns in A3, set edge insertions/deletions as 500K (= 500 × 10^{3}), and randomly sampled about 20% to 100% from the Yago dataset so that the data and result distribution remain approximately the same with the whole dataset. Then, we plot the total processing time and the size of intermediate in Figure 9.
It is revealed that CEPDG consistently outperforms its competitors regardless of the dataset size. In generally, CEPDG and TurboFlux show similar performance for all sizes of datasets. This can be attributed to the proposed pruning and validation technique, which dramatically reduces the required sample size and maintains the intermediate results incrementally. The scalability suggest that CEPDG and TurboFlux can handle reasonably large reallife graphs as those existing algorithms for deterministic graphs. Specially, CEPDG outperforms TurboFlux by up to 2.14 times, and GraphFlow by up to 37.57 times. Figure 9(2) shows similar scalability of intermediate result sizes for CEPDG and TurboFlux. The size of intermediate results of TurboFlux is larger than that of CEPDG by up to 1.64 times.
Evaluating the effectiveness of the cost model
In this subsection, we evaluate the effectiveness of our proposed cost model. We compare the time cost in pattern matching with the stateoftheart algorithm CFL [1] over Yago and Netflow dataset, respectively. Since the size of candidates is also a key factor affecting the running time despite matching order, for a fair comparison, we choose to use the same candidate set for every pattern vertex in both solutions. Here, we use the match(⋅) set of TreeMat as candidates and plot the running time in Figure 10.
It is revealed that our proposed cost model never perform worse than that in CFL. In specific, it can help lower the time cost by a factor of 10. The reason is that CFL implements a pathbased cost model. The path selected each time is that with minimal growth in result size, and after dealing with this path, a new growing path will be selected. Compared with CFL, our cost model does not estimate the cost for each path, but analyses the cost for each edge and considers the cost of next and current steps. Adjusting the cost model is more flexible after joining an edge than joining a path. The result can also prove that our cost model is close to the real cost of the join process. Otherwise, the new join strategy will not work well and may choose an awful edge in some steps, which results in the cost of the join process being high.
Evaluation of pattern graph updates
In this section, we measure the average elapsed time and the size of intermediate results of CEPDG and TurboFlux.
Comparison of different matching orderings
In this set of experiments, we ran CEPDG using patterns in A3, and measured the average elapsed time for unit insertion/deletion, but applied three different ordering strategies—randomly choose an order, greedily choose as per the current cost , and greedily choose as per the estimated overall cost (our method). The results are within expectation that our proposed strategy outperforms the first strategy by 20.12x/23.74x, and the second strategy by 7.26x/8.12x for each edge insertion/deletion.
Varying pattern size
In this set of experiments, we demonstrate the advantage of CEPDG regarding update. We used two measures—average elapsed time for unit insertion/deletion and size of partial solutions. Figure 11 shows the comparison against edge insertion, where we varied pattern size from 15 to 30 (\(A1\sim A4\)). Note that the matching cost does not always increase as the pattern size increases. Specially, in Figure 11(1), CEPDG outperforms TurboFlux by up to 4.36 times. Note that TurboFlux has to recompute the auxiliary data structure, which is not rewarding under this setting. Figure 11(2) shows the size of partial solutions, which reads that the size of partial results of CEPDG is smaller than that of TurboFlux. This is intuitive, the representation by CEPDG (TreeMat) is more concise than that by TurboFlux.
Figure 12 shows the comparison against edge deletion, and similar trends are observed as from Figure 8. Figure 12(1) shows the average elapsed time by the two algorithms. CEPDG significantly outperforms TurboFlux in all cases. Specially, CEPDG outperforms TurboFlux by up to 3.43 times. Further, the average elapsed time for a deletion is much longer than that for an insertion, which suggests deleting an edge from the pattern graph may be more computationally expensive. Figure 12(2) shows the average size of partial solutions. The average size by TreeMat is significantly smaller than that of TurboFlux by up to 1.14 times.
Varying edge update volume
In this set of experiments, we evaluate the impact of the number of edge updates on the performance of CEPDG (and its alternatives). We fixed patterns in A3 and varied edge updates from 3 to 12 in 3 increments. Figure 13 shows the average elapsed time for each algorithm. We see that CEPDG has a better performance than others. In Figure 13(1), it witnesses a nonexponential increase as insertions grows, and CEPDG beats TurboFlux by up to 26.12 times. In Figure 13(2), for edge deletion, the performance gap is slightly larger than that for edge insertion, and CEPDG is more efficient than TurboFlux by up to 29.82 times.
Related work
This section categorizes related work on graph pattern matching into two streams: static and dynamic.
Graph pattern matching
Graph pattern matching is typically defined in terms of subgraph isomorphism [4], which has been studied extensively since 1976. A key issue of subgraph isomorphism is to reduce the number of unpromising intermediate results when iteratively mapping vertices one by one from a pattern graph to a data graph. VF2 [4] and QuickSI [12] propose to enforce the connectivity to prune the candidates. TurboISO [8] proposes to merge together the nodes in a pattern graph with the same labels and the same neighborhoods to further reduce unpromising candidates. Another key issue is to generate an effective matching order. QuickSI [12] proposes to generate a matching order based on the infrequentlabels first strategy. SPath [17] proposes to generate a matching order based on the infrequentpaths first strategy, but the efficiency will get lower when the size of a patten graph get larger. Bi et al. [1] develops a new framework that decomposes a pattern graph into a core and a forest for graph pattern matching. They showed that the coreforestleaf ordering effectively reduces redundant Cartesian products. Han et al. [7] proposes novel techniques for subgraph matching: dynamic programming between a DAG and a graph, adaptive matching order with DAG ordering, and pruning by failing sets. These methods work well on static graphs. However, substantial work is needed to support dynamic graphs.
Dynamic graph pattern matching
As graphs are dynamic in nature in reallife applications, pattern matching over a large dynamic graph attracts more attention. INCISOMAT [5] identifies the the nodes of data graph that may produce new matches according to the changes of data graph. But the number of these nodes will get larger when the pattern graph gets larger and the efficiency will decrease dramatically. GraphFlow [9] applies a worstcase optimal join algorithm called Generic Join to incrementally evaluate subgraph matching for each update without maintaining intermediate results. For each query edge \((v,v^{\prime })\) that matches an updated edge \((u,u^{\prime })\), Graphflow evaluates subgraph matching starting from a partial solution \(\{(v,u), (v^{\prime },u^{\prime })\}\). SJTREE [2] decomposes the main pattern graph based on the selectivity of vertice attributes, the highly selective subpattern is evaluated first, and the remaining subpatterns are evaluated only when new results are found in subpatterns evaluated previously. Thus, a lot of unnecessary computational cost is avoided. But the decomposition features are simple, lots of intermediate results will be produced when the pattern graph gets larger. The pattern decomposition approach of the work in [6] is based on identifying optimal subDAGs (directed acyclic graph) in the pattern graph. The DAGs’ are then traversed to identify source and sink vertices to define message transition rules in the Giraph framework. This approach is on distributed implementation and it is not suitable for all types of patterns. TurboFlux [10] is the stateoftheart algorithm for continuous subgraph matching, which employs a datacentric representation of intermediate results, in the sense that the query pattern P is embedded into the data graph G and its execution model allows fast incremental maintenance. Wang and Chen [13] also deals with continuous subgraph matching for evolving graphs. However, this method produces approximate results only, while our approach generates exact results.
Above algorithms only solve the graph pattern matching problem under the scenario of the data graph updating alone. In this paper, we propose to investigate a new problem, continuous matching of evolving patterns over dynamic graph data, to report matches for each update operation in the graph update stream continuously.
Conclusion
In this paper, we are devoted to a more complicated but very practical graph pattern matching problem, continuous matching of evolving patterns over dynamic graph data, and the investigation presents a novel algorithm CEPDG for continuously pattern matching along with changes of both pattern graph and data graph We showed that CEPDG solved the problems of existing methods and efficiently processed continuous subgraph matching for each update operation on the data graph and pattern graph.
We first proposed a concise representation TreeMat based on the spanning tree of the initial pattern graph for storing partial solutions. We then proposed the vertex state transition strategy, which efficiently identifies which update operation on the data graph can affect the current partial solutions and maintain TreeMat accordingly. We next presented an execution model to efficiently and incrementally maintain the representation during edge updates on the pattern graph, which are compatible with the algorithm proposed for data graph very well. Finally, we conceived an effective cost model for estimating stepwise cost of pattern matching.
Extensive experiments showed that CEPDG outperformed existing competitors by up to orders of magnitude. Overall, we believe our continuous subgraph matching solution provides comprehensive insight and a substantial framework for future research.
Notes
A pattern graph seldom loses connectivity for threats surveillance.
References
Bi, F., Chang, L., Lin, X., Qin, L., Zhang, W.: Efficient subgraph matching by postponing cartesian products. In: SIGMOD’ 16, San Francisco, CA, USA, June 26  July 01, 2016, pp. 1199–1214 (2016)
Choudhury, S., Holder, L.B. Jr., Agarwal, G.C.K, Feo, J.: A selectivity based approach to continuous pattern detection in streaming graphs. In: EDBT’ 15, Brussels, Belgium, March 2327, 2015, pp. 157–168 (2015)
Choudhury, S., Holder, L.B. Jr, Ray, G.C., Beus, A., Feo, S.J.: Streamworks: A system for dynamic graph search. In: Ross, K.A., Srivastava, D., Papadias, D. (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 2227, 2013, pp 1101–1104. ACM (2013)
Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans. Pattern Anal. Mach. Intell 26(10), 1367–1372 (2004)
Fan, W., Li, J., Luo, J., Tan, Z., Wang, X., Wu, Y.: Incremental graph pattern matching. In: SIGMOD’11, Athens, Greece, June 1216, 2011, pp. 925–936 (2011)
Gao, J., Zhou, C., Yu, J.X.: Toward continuous pattern detection over evolving large graph with snapshot isolation. VLDB J. 25(2), 269–290 (2016)
Han, M., Kim, H., Gu, G., Park, K., Han, W.: Efficient subgraph matching: Harmonizing dynamic programming, adaptive matching order, and failing set together. In: Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30  July 5, 2019, pp 1429–1446 (2019)
Han, W., Lee, J., Lee, J.: Turbo_{iso}: Towards ultrafast and robust subgraph isomorphism search in large graph databases. In: SIGMOD’13, New York, USA, June 2227, 2013, pp. 337–348 (2013)
Kankanamge, C., Sahu, S., Mhedbhi, A., Chen, J., Salihoglu, S.: Graphflow: An active graph database. In: SIGMOD’17, Chicago, IL, USA, May 1419, 2017, pp. 1695–1698 (2017)
Kim, K., Seo, I., Han, W., Lee, J., Hong, S., Chafi, H., Shin, H., Jeong, G.: Turboflux: A fast continuous subgraph matching system for streaming graph data. In: SIGMOD’18, Houston, TX, USA, June 1015, 2018, pp. 411–426 (2018)
Ouyang, D., Yuan, L., Qin, L., Chang, L., Zhang, Y., Lin, X.: Efficient shortest path index maintenance on dynamic road networks with theoretical guarantees. Proc. VLDB Endow. 13(5), 602–615 (2020)
Shang, H., Zhang, Y., Lin, X., Yu, J.X.: Taming verification hardness: An efficient algorithm for testing subgraph isomorphism. PVLDB 1(1), 364–375 (2008)
Wang, C., Chen, L.: Continuous subgraph pattern search over graph streams. In: ICDE’09, Shanghai, China, March 29  April 2, 2009, pp. 393–404 (2009)
Yuan, L., Qin, L., Lin, X., Chang, L., Zhang, W.: Diversified topk clique search. VLDB J. 25(2), 171–196 (2016)
Yuan, L., Qin, L., Zhang, W., Chang, L., Yang, J.: Indexbased densest clique percolation community search in networks. IEEE Trans. Knowl. Data Eng. 30(5), 922–935 (2018)
Zhang, Q., Guo, D., Zhao, X., Guo, A.: On continuously matching of evolving graph patterns. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, November 37, 2019, pp. 2237–2240 (2019)
Zhao, P., Han, J.: On graph query optimization in large networks. PVLDB 3(1), 340–351 (2010)
Acknowledgements
This work is supported by the National key research and development program under Grant Nos. 2018YFB1800203 and 2018YFE0207600.
Author information
Affiliations
Corresponding authors
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is an extension of our earlier published work [16] in ACM CIKM 2019.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhang, Q., Guo, D., Zhao, X. et al. Continuous matching of evolving patterns over dynamic graph data. World Wide Web 24, 721–745 (2021). https://doi.org/10.1007/s11280020008605
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280020008605
Keywords
 Dynamic graph
 Subgraph matching
 Incremental algorithm