Continuous matching of evolving patterns over dynamic graph data

Nowadays, the scale of various graphs soars rapidly, which imposes a serious challenge to develop processing and analytic algorithms. Among them, graph pattern matching is the one of the most primitive tasks that find a wide spectrum of applications, the performance of which is yet often affected by the size and dynamicity of graphs. In order to handle large dynamic graphs, incremental pattern matching is proposed to avoid re-computing matches of patterns over the entire data graph, hence reducing the matching time and improving the overall execution performance. Due to the complexity of the problem, little work has been reported so far to solve the problem, and most of them only solve the graph pattern matching problem under the scenario of the data graph varying alone. In this article, we are devoted to a more complicated but very practical graph pattern matching problem, continuous matching of evolving patterns over dynamic graph data, and the investigation presents a novel algorithm CEPDG for continuously pattern matching along with changes of both pattern graph and data graph. Specifically, we propose a concise representation TreeMat of partial matching solutions, which can help to avoid re-computing matches of the pattern and speed up subsequent matching process. In order to enable the updates of data graph and pattern graph, we propose an incremental maintenance strategy, to efficiently maintain the intermediate results. Moreover, we conceive an effective model for estimating step-wise cost of pattern evaluation to drive the matching process. Extensive experiments verify the superiority of CEPDG.


Introduction
In recent years, graph analysis plays an increasingly important role in the area of data analytics [14,15]. Graph pattern matching is one of the most fundamental problems in graph analytics. Given a pattern graph P and a large data graph G, graph pattern matching is to find all subgraph isomorphic of P in G, which has a wide range of applications such as fraud detection and cyber security.
However, graphs are dynamic in nature [11], which continuously evolve over the time. A dynamic graph is defined by an initial graph and a graph update stream of edge insertions and edge deletions. Identifying and monitoring critical patterns in a dynamic graph is important in various application domains [6] such as fraud detection, cyber security, and emergency response, etc. For example, cyber security applications should detect cyber intrusions and attacks in computer network traffic as soon as they appear in the data graph [3]. Most of the previous works only solve the subgraph matching problem under the scenario of the data graph varying alone. But it is common that pattern graph will also evolve along with the time when data graph is updated. For example, in cyberthreats surveillance, one could predict upcoming malicious activities and determine the ultimate goal of an adversary by concealing and supplementing selective edges of attacking patterns, respectively [16].
The aforementioned two update scenarios motivate us to investigate a new problem, continuous matching of evolving patterns over dynamic graph data. Formally, given an initial data graph G 0 , an initial pattern graph P 0 , a graph update stream (Δg 1 , Δg 2 , Δp 3 , Δp 4 , · · · ) consisting of edge insertions and deletions of the data graph and pattern graph, G i = G i−1 ⊕ Δg i (resp. P i = P i−1 ⊕ Δp i ), and M(P , G) denotes the set of subgraph matching results between G and P . Here, ⊕ means that Δg i (resp. Δp i ) is applied to G i−1 (resp. P i−1 ). Then the continuous matching of evolving patterns over dynamic graph data problem is to report M(P i−1 ⊕ Δp i , G i−1 ) (resp. M(P i−1 , G i−1 ⊕ Δg i )) when each update operation Δp i (resp. Δg i ) occurs. A naïve method to solve this problem is to repetitively execute pattern matching for each update to the data graph and pattern graph. Nonetheless, this can be prohibitively costly due to the extensive involvement of expensive subgraph isomorphism tests [8].
To address the challenge, efforts to support incremental graph pattern matching for dynamic data graph seemed to enjoy some success. In [5], INCISOMAT extracts the subgraph of data graph that can be affected by each update operation and conducts subgraph matching for the extracted subgraph to get the new matches by performing the set difference. Graph-Flow [9] applies a worst-case optimal join algorithm called Generic Join to incrementally evaluate subgraph matching for each update. SJ-TREE [2] uses a left-deep tree, where an internal node in SJ-TREE corresponds to a subgraph containing more than two connected query vertices, and a leaf node corresponds to a subgraph containing two adjacent query vertices. TurboFlux is the state-of-the-art algorithm for continuous subgraph matching [10], which employs a data-centric indicate representation of intermediate results, namely, DCG, in the sense that the query pattern P is embedded into the data graph G. TurboFlux can obtain a higher performance than above algorithms. However, it only considers the update operations of data graph and is no longer applicable on both update scenarios; to put it in our context, TurboFlux has to re-compute DCG when the updates occur on the pattern graph, which can be detrimental.
These problem of existing methods motivated us to develop a fully-fledged framework, namely, CEPDG, to achieve fast pattern matching under the variations of both data graph and pattern graph. To the best of our knowledge, this is among the first attempts to conduct pattern matching under the situation of data graph and pattern graph varying simultaneously. In summary, we make the following contributions: -We introduce a concise representation TreeMat of partial solutions, which can help to avoid executing subgraph pattern matching repeatedly for edge updates on the data graph and pattern graph; -In order to enable frequent updates on the data graph, we propose a vertex state transition strategy, to efficiently maintain the intermediate results. -We devise an execution model to efficiently and incrementally maintain the representation during edge updates on the pattern graph, which are compatible with the algorithm proposed for data graph very well. -We conceive an effective cost model for estimating step-wise cost of pattern matching.
Comprehensive empirical study verifies the efficiency of the proposed algorithm and techniques.
Organization Section 2 formulates the problem, and presents the overview of the proposed framework. Section 3 introduces a novel representation of intermediate results called the TreeMat and proposes the incremental maintenance strategy. Section 4 explains the algorithms of CEPDG in detail. Experimental results and analyses are reported in Section 5. A brief overview of related work follows immediately in Section 6. Section 7 concludes the paper.

Preliminaries and framework
In this section, we first introduce several essential notions and formalize the continuous matching of evolving patterns over dynamic graph data problem. Then, we overview the proposed solution.

Preliminaries
We focus on a labeled undirected graph g = (V , E, L). Here, V is the set of vertices, E ∈ V × V is the set of edges, and L is a labeling function that assigns a label l to each v ∈ V . Each vertex has only one label, representing the attribute of the node. Note that, our techniques can be readily extended to handle directed graphs.

Definition 1 (Graph update stream) A graph update stream
Δo is a sequence of update operations (Δo 1 , Δo 2 , · · · ), where Δo i is a triple op, v i , v j such that op = {I, D} is the type of operations, with I and D representing edge insertion and deletion of an edge v i , v j .
A dynamic graph abstracts an initial graph g and an update stream Δo. g transforms to g after applying Δo to g. Here, g represents a data graph or pattern graph. Note that, insertion of a vertex can be represented by a set of edge insertions, similarly, deletion of a vertex can be considered as a set of edge deletions.
Definition 2 (Subgraph isomorphism) Given a pattern graph P = (V P , E P , L P ), a data graph G = (U G , E G , L G ), P is isomorphism to G if there is a bijective function between them, such that: Definition 3 (Problem statement) Given a pattern graph P = (V P , E P , L P ), a data graph G = (U G , E G , L G ), and a graph update stream Δo, the continuous matching of evolving patterns over dynamic graph data problem is to continuously return occurrences of P in G when the updates in Δo occur on the pattern graph P or data graph G.
Frequently used notations are summarized in Table 1.

Overview of solution
In this subsection, we overview the proposed solution, which is referred as CEPDG(Continuous matching of Evolving Patterns over Dynamic Graph data). Specially, we are to address two technical challenges: The former corresponds to update handling phase, while the latter challenge corresponds to the query evaluation phase. Algorithm 1 shows the outline of CEPDG, which takes an initial pattern graph P 0 , an initial data graph G 0 and a graph update stream Δo as input, and find the matching results of P in G when necessary. We first select a root vertex v r (Line 1). Then we extract from the pattern graph P 0 a structural tree P T based on v r , walking a spanning tree by breadthfirst search, and removing non-tree edges from P 0 .(Line 2). The purpose is to execute fast query evaluation by leveraging tree structure [8], i.e., we handle the edges in the query tree first, and then, the non-tree edges.
In particular, to perform continuous subgraph matching, we construct an auxiliary data structure, namely, TreeMat, based on P T to store the matching results of the structural tree,  [10][11][12]. After that, we call subgraphSearch to obtain the matching results if output requested (Line 13). The design and rationale for auxiliary data structure maintenance is given, as well as the algorithm details are given in the subsequent sections, respectively.

Root Vertex Selection
Intuitively, we favor the root vertex to have a small number of candidates and to have a large degree; fewer candidates means fewer partial embeddings being generated, while larger degree means more chance to prune partial embeddings at early stages. In order to minimize the number of matching data vertices for root vertex v r , choose-RootVertex first selects a pattern edge v, v which has the smallest number of matching data edges. Between v and v , chooseRootVertex chooses a pattern vertex that has a smaller number of matching data vertices. Finally, if there is a tie, chooseRootVertex chooses a pattern vertex having a larger degree.

Incremental maintenance of intermediate results
The central idea of update handling is to employ a delicate data structure to store and incrementally maintain partial solutions.

A concise representation
There has been a long tradition in graph community to harness a tree structure for fast pattern matching/search [1,8]. We also follow this tradition, and conceive a succinct data structure for keeping partial solutions. P T is constructed by removing the edges that are not in the spanning tree, i.e., non-tree edges, if P contains cycles. The vertices in P are partitioned according to their levels in the spanning tree where the level of a vertex in P T is its depth compared to the root vertex of P T . To keep partial solutions, we offer a concise representation named TreeMat, which comprises matching vertices to those of P T in topology graph G. Given a vertex v in P T , its matching vertices in TreeMat are arranged into match(·): the set of vertices {u} in G that map to v in some solutions to P T ; and stree(·): the set of vertices {u} in G such that 1) the subtree residing at v matches the corresponding subtree at u via subgraph homomorphism [10], and 2) there does not exist a solution to P T that map v to u.
Here, subgraph homomorphism can be obtained by just removing the injectivity constraint. It can be seen that the two sets are mutually exclusive, and we use a general designation candidates of v i.e., cand(v) to refer the vertices in either match(v) or stree(v). As a consequence, the structure of TreeMat is defined as follows.
-It is a tree-like structure, and for each query vertex v in P T , there is a node containing the candidates of v, which is constituted of two sets match(v) and stree(v); and -there is an edge between u ∈ cand(v) and u ∈ cand(v ) for adjacent query vertices v and v in TreeMat, if and only if edge u, u ∈ G.
It is noted that stree(v r ) of the root vertex v r in P T is empty, since P T is also a subtree residing at v r .
Example 1 Figure 4b shows the TreeMat for P T (Figure 4a) and initial data graph G 0 . Given a vertex v in T , the orange square in cand(v) represents a data vertex u ∈ stree(v); and the black square in cand(v) represents a data vertex u ∈ match(v). Furthermore, we can see that the root vertex v 1 of P T only has the set match(·).
Remark As pointed out in [10], existing work on continuous subgraph matching caches either a set of partial solutions or a set of candidate vertices for each query vertex. These paradigms incur not only great memory overhead but also large computational cost. In contrast, our model takes a more eager strategy, and proposes to keep complete solutions (in match(·)) as well as solution-likely-to-be's (in stree(·)). In this way, we save TreeMat from filling up the main memory while offering guidance to efficiently derive affected answers.

Data graph change-oriented rationale of maintenance
In this subsection, we propose a vertex state transition strategy (denoted as VST) to efficiently maintain the intermediate results.
When an edge update operation u, u arrives, we try to match it with an edge v, v in P T . Here, the level of v is deemed to be smaller than the level of v . Then, we use VST to maintain the TreeMat. We set the data vertex u ∈ NULL if u / ∈ cand(v). Figure 1 shows the state transition diagram, consisting of three states and six transition rules (Transitions 1-6), which demonstrates how one state is transited to another. Here, Transition 1-3 are triggered by edge insertion, and Transition 4-6 are triggered by edge deletion.

Handling edge insertion
Suppose that the data vertex u is added into stree(v). For each u p ∈ NULL that is adjacent to u, if u, u p matches v, v p where v p is the parent vertex v of v, we further check whether u p can be added into stree(v p ) with a similar manner (Fig. 2).
From stree to match. Suppose that u ∈ stree(v ) and u ∈ match(v). Then we remove u from stree(v ) to match(v ).
Suppose that the data vertex u is added into match(v). For each child vertex v c of v, if there is a vertex u c in stree(v c ) that is adjacent to u in TreeMat, then we remove u c from stree(v c ) to match(v c ).  Figure 2c, the edge insertion Δo 1 matches v 4 , v 7 where u 6 ∈ match(v 4 ). Since v 7 is a leaf vertex in P T , we add u 17 to match(v 7 ). In Figure 2d, the edge insertion Since v 4 has no child vertex exclude v 7 , we add u 19 into stree(v 4 ). In Figure 2f, there is a neighbor u 20 of u 19 that satisfies u 19 , u 20 matches v 4 , v 2 . Since u 20 , u 9 matches v 2 , v 5 , we further add u 20 into stree(v 2 ). In Figure 2g, the edge insertion Δo 4 matches v 4 , v 2 where u 2 ∈ match(v 2 ) and u 7 ∈ stree(v 4 ). We then remove u 7 from stree(v 4 ) to match(v 4 ). In Figure 2h, we further check the data vertices in stree(v 7 ) where v 7 is the child vertex of v 4 . Since u 13 and u 14 are the neighbors of u 7 in stree(v 7 ), we remove u 13 and u 14 from stree(v 7 ) to match(v 7 ).

Handling edge deletion
From match to stree. Suppose that u ∈ match(v) and u ∈ match(v ). If there is no other data vertex in match(v) that is adjacent to u , then we remove u from match(v ) to stree(v ). In specific, if v is a leaf vertex, we need further check if there is a vertex in stree(v) that is adjacent to u ; if so, remove u from match(v ) to stree(v ).
From stree to NULL. Suppose that u ∈ stree(v) and u ∈ cand(v ). If there is no other data vertex in cand(v ) that is adjacent to u, we then delete u from stree(v). In specific, if v is a leaf vertex in P T and u ∈ stree(v ), we need further check whether there is a data vertex in stree(v) that is adjacent to u . If not, we delete u from stree(v ).
Suppose that the vertex u is deleted from stree(v). For each neighbor

Pattern graph change-oriented rationale of maintenance
It can be seen that if inserted (or deleted) edge is a non-tree edge, we do not update TreeMat, since it has no impact on TreeMat. Thus, the following exposition concentrates on tree edges.
Handling edge insertion Consider a tree edge v, v inserted into P T , where v is the vertex newly introduced. Under this scenario, candidate vertices are only to be excluded from match(·) or stree(·), back to NULL state, but not vice versa. To identify affected candidates, we check, for each vertex u in match(v), whether there is an edge u, u with u ∈ NULL matching v, v . If not, we delete u from match(v); otherwise, we add vertex u into match(v ) if u ∈ match(v). stree(v) or stree(v ) can be updated in a similar fashion.
Handling edge deletion We discuss edge deletion in two cases based on whether the deletion involves a leaf vertex of P T .
Case 1 Consider tree edge v, v with v as a leaf vertex. Note that in this case, NULL vertices only are to be included into match(·) or stree(·), but not vice versa. Intuitively, a vertex u of G 0 is added into stree(v), only if for each child vertex v c of v exclude v , there is a vertex u c that is candidate to v c such that u, u c matches v, v c .
Then, update needs to be propagated upwards to the root of TreeMat. Suppose that vertex u is added into stree(v). For each vertex u p that is adjacent to u and u p , u matches v p , v , if u p ∈ NULL, we check whether u p can be added into stree(v p ) in a similar manner; else if u p ∈ match(v p ), we move u from stree(v) to match(v). In the other situation when vertex u is added into match(v), we examine, for each child vertex v c of v, whether there is vertex u c in stree(v c ) that is adjacent to u in TreeMat; if so, remove data vertex u c to match(v c ).
Case 2 Consider a tree edge v, v not involving any leaf vertex. This type of edge deletion will break the connectivity of P T but not P 1 . Thus, a non-tree edge that connects v with an arbitrary vertex will become a tree edge. By intuition, we choose, among all the non-tree edges, that one v that connects v to a vertex closer to the root and has smaller match(·) set.
Then, for each vertex u ∈ stree(v ), we check whether there is a candidate u of v such that u , u matches v , v ; if not, we exclude u from stree(v ), and further check the vertices in stree(v p ), where v p is the parent of v . The update is propagated upwards till the root.  Figure 3d, since v 4 , v 5 is a non-tree edge, we only add edge v 6 , v 10 into P T . In Figure 3e, since there is no vertex u that is adjacent to u 11 such that u 11 , u matches v 6 , v 10 , we remove u 11 from stree(v 6 ). Accordingly, we remove the parent vertex u 5 of u 11 from stree(v 3 ). What's more, since u 10 ∈ match(v 6 ), and there are two vertices u 17 and u 18 that are adjacent to u 10 such that edges u 10 , u 17 and u 10 , u 18 match v 6 , v 10 , we add u 17 and u 18 into match(v 10 ). Figure 3f gives the updated TreeMat with edge insertion Δg 2 . When the edge Δp 1 is deleted from P , there are two non-tree edges v 5 , v 6 and v 6 , v 8 that can be translated into tree edges. Here, we translate v 5 , v 6 into tree edge, since |match(v 5 )| = |match(v 8 )| and v 5 is closer to the root vertex v 1 . The updated P T and TreeMat are given in Figures 3g and h, respectively.

CEPDG algorithms
In this section, we present detailed algorithms for CEPDG. We develop efficient techniques for constructing TreeMat. While we update the TreeMat, we need only apply necessary transition rules. This motivated us to develop an enhanced version of the maintenance algorithm for the TreeMat. Then we conceive an effective cost model for estimating the step-wise cost of query pattern matching.

TreeMat construction
To construct TreeMat, constructTreeMat (Line 3 of Algorithm 1) (1) first generates cand(v) (candidates of v) for each query vertex v in P T ; (2) then constructs the adjacent lists corresponding to query vertices and their parent vertices; and (3) finally divides the cand(v) into stree(v) and match(v).
In the forward processing, we mark all the leaf vertices of P T as visited and then process the query vertices level-by-level in a bottom-up fashion (Lines 1-20). In processing an unvisited vertex v, let N(v) denotes the set of visited neighbors of v in P T (Line 13).
In specific, in above process, if v is a leaf vertex, we need only verify whether there is a data vertex u such that u, u matches v, v . To achieve this, we maintain a counter V (u) for each data vertex in G 0 to count the number of visited query neighbors of v that have a candidate u adjacent to u such that u, u matches v, v . V (u) is updated at Lines 8-10. The candidate cand(v) is the set of vertices satisfying N(v) = V (u) (Lines 14-15). After generating cand(v), we will further At the same time, we construct the adjacency lists corresponding to vertex v and its parent vertex v p in P T (Line 19). The adjacency lists corresponding to an edge v p , v is constructed. That is, for each data vertex u ∈ cand(v p ), an adjacency list N v p v (u) is constructed, which is the set of data vertices {u } in cand(v) such that u , u matches v p , v . Then, we mark v as visited, reset V (u) to be 0 for every vertex u that has a positive count (Line 18).
In the backward processing, we reprocess the query vertices of P T in a top-down manner to divide cand(v) into match(v) and stree(v) for each query vertex v. Firstly, we set match(v r ) = cand(v r ) for the root vertex v r , since T EQ is also a subtree residing at v r . Then, we process vertices downwards according to their levels. In processing a query vertex v, let v p denote the parent vertex of v. For each data vertex u in cand(v), we check if there is a data vertex u p in match(v p ) that is adjacent to u. If so, we move u to match(v); otherwise we move u to stree(v) (Lines 24-26).

Lemma 1 The worst storage complexity of
Proof The TreeMat stores at most |E G 0 | edges for each pattern vertex in P T and thus, its worst storage complexity is O(|E G 0 | × |V P T |).

Lemma 2 The worst time complexity of
Proof In the worst case, constructTreeMat is called for every query vertex v and every data vertex u. We show that in the forward process for a special v take time O(|E G 0 | × |N(v)|).
In particular, for each data vertex u ∈ cand(v ), it takes O(deg(u )) time to check whether u, u matches v, v where deg(u ) is the degree of u ; thus, for all vertices in cand(v ), the checking processes take O( u ∈cand(v ) deg(u )) = O(|E G 0 |). Similarly, in the backward process for a special v takes time O(|E G 0 |) time. Thus, the total time for a special v is

Edge updates on the data graph
Now, we explain G-insertEval (Algorithm 3), which is invoked for each edge insertion u, u . The main idea of G-insertEval is explained as follows: we try to match u, u with tree edges in P T and then update the TreeMat through the vertex position transition strategy. Note that there may be more than one query edge in P T to which u, u matches, and not all matching situations can cause the update of TreeMat. For this purpose, we should exclude the invalid matching situations.
In order to exclude invalid matching situations, we first obtain the query edges in P T with the same edge label as u, u . Let v be the parent of v . Then, for each matched query edge v, v , we check whether u ∈ cand(v ); if not, it will not cause the update of TreeMat and will be ignored (Line 1-3). For each valid matching situation, we execute chooseVST to check whether u, u can cause the update of TreeMat (Line 5). If so, chooseVST chooses the corresponding transition rule and updates the states of u and u . What's more, choo-seVST will also check whether the update caused by u, u needs to be propagated upwards or downwards. If so, we set TreeMat.getTransition( u, u )=true and update TreeMat by calling updateTreeMat (Algorithm 4) recursively (Lines 6-8). Here, updateTreeMat decides the update propagation direction (i.e., upwards or downwards) for current iteration and executes corresponding transition rule. Algorithms for edge deletions on the data graph are similar to those for edge insertions except that they use the transitions 4-6, instead of transitions 1-3; Omitted in the interest of space, the algorithm G-deleteEval (Line 9 of Algorithm 1) is not described here.

Edge updates on the pattern graph
In this subsection, we introduce P-deleteEval (Algorithm 5), which is invoked for each edge deletion v, v .
We first check whether v, v is a non-tree edge; if so, it will not cause the update of TreeMat (Line 2-3). In other case, if v is a leaf vertex, some NULL vertices may be added into stree(v) under this situation. In detail, if a vertex u satisfies: (1) u has the same label as v; (2) u / ∈ cand(v); and (3) for each child vertex v c of v except v , there is a data vertex u c ∈ cand(v c ) that is adjacent to u, then we add u into stree(v) (Lines 4-16). Note that, if v c is a leaf vertex, we should further check whether there is an edge u, u c matching v, v c and u c ∈ NULL; if so, add u c into stree(v c ) (Line 17). After that, we call updateTreeMat (Algorithm 4) recursively to update the TreeMat based on the status of u (Line 18). What's more, if v is not a leaf vertex, we should translate the non-tree edge with an endpoint of v to tree edge. We also set the status of all the candidates of v and the descendants of v as stree at this condition (Line 20). Next, we update stree(v) in a similar way as Lines 5-18. Adding a non-tree edge into P T will cause some candidate vertices to be executed. As a result, we should further check for each vertex u ∈ cand(v ), if there is a vertex in cand(v ) that is adjacent to v . If not, remove u from cand(v ); else we call updateTreeMat (Algorithm 4) recursively to update the TreeMat based on the status of u (Lines 22-26). The update is propagated upwards till the root vertex(Line 27).
Algorithms for edge insertions on the pattern graph are similar to those for edge deletions under the situation that v is not a leaf vertex. Omitted in the interest of space, the algorithm P-insertEval (Line 11 of Algorithm 1) is not described here.

Cost-driven pattern matching
Pattern evaluation phase is to harvest complete solutions to pattern graphs by leveraging TreeMat. We are in quest of boosting performance by conducting exploration on TreeMat.
Standard backtracking is viable but inefficient, which neglects the matching order that may greatly affect the performance. A classic models for generic graph patten matching [1,12] is as follows. Assume the total cost is proportional to the number of comparisons for determining whether a vertex (or an edge) matches. Given an arbitrary order of vertices (v 1 , v 2 , . . . , v n ) for P , the number of comparisons performed in a backtracking algorithm is Nonetheless, r i largely depends on the actual order. The total number of configurations of r i is exponential in O(|V P |!), and thus, it is prohibitively expensive to optimize T iso online. In response, we choose to minimize T iso greedily, i.e., every time choose the vertex of the minimum cost on the basis of current intermediate results. Then, to match vertex v i , the number of comparisons concerning v i can be expressed by T (v i ) = |Mi−1| j =1 |d j i |(r i + 1). In addition, we unveil that the advantage of harnessing TreeMat also comes from the derivation of d j i given M j , which is inaccessible in pattern matching. Recall that a likelihood estimated over entire topology graph is used to delegate d j i [1,12], which can be inaccurate. Lastly, to select the first vertex, we choose the one with minimum |match(v)| deg (v) , where deg(v) is the total degree of v.
The estimation above only considers the cost thus far (i.e., current cost), but ignores the cost from the vertices to be accessed (i.e., future cost). It is contended that combining current and future costs may provide rewarding guidance for future steps. However, it is non-trivial to precisely compute the actual intermediate results after mapping u i . To this end, we heuristically estimate the number of intermediate results as where p i j is the likelihood of a vertex in d j i has an edge satisfying the restriction of the j -th non-tree edge of v i connecting to a vertex that has been accessed.
where r k represents the number of vertices that has been accessed except the parent of v i that has edges connected to unaccessed vertices. In overall, the cost of mapping u i can be estimated by T (v i )+T (v i ). Experiments show that it provides better guidance to the matching processing, in comparison with alternative strategies.
Example 4 Consider the pattern graph and the match(·) set of TreeMat in Figure 4. v 1 is set as the root vertex since match(v 1 ) 2 is minimum. Suppose that the vertices v 1 and v 3 have been matched. At this time, the number of intermediate results is 2, and we are going to choose the next vertex. If we choose v 5 , the number of comparisons is 1 + 2 = 3; if we choose v 2 , the number of comparisons is 8 × 2 = 16. According to the greedy selection that only consider the current matching cost, we will choose v 5 as the next vertex, and the current Figure 4 Sample pattern graph and match(·) set of TreeMat total number of comparisons is 1 + 2 + 3 + 12 × 2 + 1 = 31. However, if we take the future matching cost into account, we will choose v 2 as the next vertex, and the total number of comparisons is 1 + 2 + 8 × 2 + 1 + 1 = 21 that is smaller than 31.
Correctness and complexity Based on the discussion, we can implement a procedure for choosing the next vertex for matching. Note that, f we use the new cost model may bring fewer cost. While the details of the procedure is omitted in the interest of space, it can be seen that the procedure runs in O (|E G * | × |V P * | × |E P * |), where G * and P * are the updated data graph and updated pattern graph, respectively.
Remark In comparison with existing cost models for pattern matching and order selection, the proposed model and algorithm are advantageous in the sense that -As identified by existing work [1], TurboFlux [10] fails to be applicable to large and complex query patterns; in contrast, CEPDG lends itself to large and complex queries against the more difficult matching criteria of subgraph isomorphism; -Compared with QuickSI [12], which merely concentrate on a local cost with a greedy strategy, our proposed cost model generates a more effective matching order, which takes both existing and future costs into account, and hence, reduces a large number of unpromising intermediate results; -In comparison with CFL [1], which implements a path-based cost model, our model chooses an edge-based cost most, and thus, is more flexible and less computationally expensive, while retaining the quality of order selection.
It can be seen that the cost-driven matching algorithm heavily relies on a good estimation of cand(·), and the more accurate estimation, the better guidance for matching ordering. In the sequel, we strive to offer a good estimation of candidates by levering an online saturation strategy with index support.

Experiments
In this section, we evaluate the performance of CEPDG against the state-of-the-art continuous subgraph matching methods, TurboFlux [10], and GraphFlow [9] on two real-life datasets. The source code of TurboFlux was obtained from its authors. The source code of GraphFlow was downloaded from github 2 . Then, we report experimental results and analyses.

Experiment setup
The proposed algorithms were implemented using C++, running on a Linux machine with two Core Intel Xeon CPU 2.2Ghz and 32GB main memory.

Datasets/Queries
We used two datasets referred as Yago 3 and Netflow 4 . Yago is a dataset that extracts facts from Wikipedia and integrates them with the WordNet thesaurus. This dataset consists of an initial graph G 0 and a graph update stream Δg. G 0 contains 12,375,749 triples while Δg consists of insertions of 1,124,302 triples and deletions of 1,027,828 triples. Netflow contains anonymized passive traffic traces monitored from highspeed internet backbone links. In this dataset, G 0 contains 14,378,113 triples and Δg consists of insertions 1,236,412 triples and deletions of 1,107,635 triples.
As the dataset does not come with patterns, we comprehensively generated various patterns as follows. We first make 4 pattern categories (A1 ∼ A4), and then, extract for each category 20 patterns by randomly traversing the topology graph. The size of patterns in A1, A2, A3 and A4 is 15, 20, 25 and 30, respectively. Then, for each graph pattern, to generate the update stream, every time we (1)randomly removed an existing edge while keeping the pattern graph connected; and (2) randomly added an edge between two disconnected vertices with a random edge label conforming uniform distribution. Note that, the size of edge insertions/deletions of each pattern graph did not exceed half of the pattern size (≤ 50%); otherwise, fundamental characteristics of the pattern disappear.
Algorithms Since there is no existing research directly targeting our problem, two stateof-the-art algorithms were adapted and involved for comparison: 1) TurboFlux [10] is an algorithm for pattern matching over dynamic graph; to deal with evolving pattern graph, it has to recompute its auxiliary data structure during update. 2) GraphFlow [9] is an incremental algorithm without maintaining intermediate results. 3) our proposed algorithm CEPDG. Table 2 are used as default parameters in the experiments.

Evaluation of data graph updates
We use two measures, the average elapsed time and the size of intermediate results. Note that, for fair comparison, we exclude the elapsed time for updating the data graph. That is, we set the average elapsed time of CEPDG as the difference between the time for processing the graph update stream with and without continuous query answering, and measure the time of the competitors for query processing only. What's more, we conduct experiments by inserting/deleting edges in batches of 10K (= 10 × 10 3 ). Inserting/deleting edges in batches means that we need only calculate matching results when all the edges have added into or removed from the data graph. We set a 1-hour timeout for each query. Figure 5 shows the performance results in Yago dataset. Here, we set edge insertions/deletions as 500K (= 500 × 10 3 ) and vary the query size from 15 to 30. Figure 5(1)  shows the average elapsed time. CEPDG behaves better than its competitors regardless of pattern size. Specially, CEPDG outperforms TurboFlux by 2.28 ∼ 3.13 times, and GraphFlow by 36.67 ∼ 44.28 times. The reason is that GraphFlow does not maintain any intermediate results and it will generate a much larger number of partial solutions than CEPDG and TurboFlux. CEPDG only needs to update partial intermediate results for an edge update operation. So even |E(P )| is big, CEPDG can also achieve a better performance. Moreover, CEPDG can significant reduce the time cost based on the cost model in the pattern matching process. Figure 5(2) shows the average number of intermediate results. Since GraphFlow does not maintain any intermediate results, we only compare CEPDG with Tur-boFlux. Specially, the average size of intermediate results of TurboFlux is larger than that of CEPDG by 1.28 ∼ 1.54 times. It means that the representation by CEPDG (TreeMat) is more concise than that by TurboFlux. Figure 6 shows the performance results in Netflow dataset. CEPDG behaves better than its competitors in both of average elapsed time and average size of intermediate results regardless of pattern size. Specially, in Figure 6(1), CEPDG outperforms TurboFlux by up to 2.86 times, and GraphFlow by up to 90.72 times; in Figure 6(2), the average size of intermediate results of TurboFlux is larger than that of CEPDG by up to 1.47 times. This is because Netflow has only eight edge labels and no vertex label. Hence, the size of intermediate results is enormous, and time costs in TurboFlux and GraphFlow are very expensive.

Varying edge insertion size
In this subsection, we evaluate the impact of edge insertions of data graph on the performance of CEPDG and its competitors. Here, we fixed patterns in A3 and varied the number of newly-inserted edges from 250K (= 250 × 10 3 ) to 1000K in 250K increments on Yago. Thus, the number of total update operations also increases accordingly. Figure 7(1) shows the processing time for each algorithm. We see that CEPDG has consistently better performance than it competitors. What's more, the figure reads a non-exponential increase as edge insertion size grows. Specially, CEPDG outperforms TurboFlux by up to 2.44 times, and GraphFlow by up to 46.78 times at edge insertion size 1000K. CEPDG also outperforms its competitors in terms of the size of intermediate results as shown in Figure 7(2). Specially, the size of intermediate results of TurboFlux is larger than that of CEPDG by up to 1.43 times when the insertion size is 1000K.

Varying edge deletion size
In this subsection, we evaluate the impact of edge deletions of data graph on the performance of CEPDG and its competitors. Here, we fixed patterns in A3 and varied the number of deleted edges from 250K (= 250 × 10 3 ) to 1000K in 250K increments on Yago. Figure 8(1) shows the processing time for each algorithm. Note that the gap between the performance of CEPDG and TurboFlux is larger than that in Figure 7(1). This is because deletion of an edge (u, u ) could affect all subtrees of u in TurboFlux. However, in CEPDG, we need only main the affected vertices in TreeMat, which relatively small. Note also that the processing Graph-Flow slightly decreases when the size of edge insertions decrease. This is because, the edge deletions reduce the input data size of GraphFlow directly. Specially, CEPDG outperforms TurboFlux by up to 3.16 times, and GraphFlow by up to 74.51 times. CEPDG also outperforms its competitors in terms of the size of intermediate results as shown in Figure 8(2). Specially, the size of intermediate results of TurboFlux is larger than that of CEPDG by up to 1.25 times when the insertion size is 1000K.

Varying the data size
In this testing, we evaluate the performance results of CEPDG against existing algorithms regarding the scalability by using Yago for varying dataset size. Here, we fixed patterns in A3, set edge insertions/deletions as 500K (= 500 × 10 3 ), and randomly sampled about 20% to 100% from the Yago dataset so that the data and result distribution remain approximately the same with the whole dataset. Then, we plot the total processing time and the size of intermediate in Figure 9.
It is revealed that CEPDG consistently outperforms its competitors regardless of the dataset size. In generally, CEPDG and TurboFlux show similar performance for all sizes of datasets. This can be attributed to the proposed pruning and validation technique, which dramatically reduces the required sample size and maintains the intermediate results incrementally. The scalability suggest that CEPDG and TurboFlux can handle reasonably large real-life graphs as those existing algorithms for deterministic graphs. Specially, CEPDG outperforms TurboFlux by up to 2.14 times, and GraphFlow by up to 37.57 times. Figure 9(2) shows similar scalability of intermediate result sizes for CEPDG and TurboFlux. The size of intermediate results of TurboFlux is larger than that of CEPDG by up to 1.64 times.

Evaluating the effectiveness of the cost model
In this subsection, we evaluate the effectiveness of our proposed cost model. We compare the time cost in pattern matching with the state-of-the-art algorithm CFL [1] over Yago and Netflow dataset, respectively. Since the size of candidates is also a key factor affecting Performance for varying data size on Yago the running time despite matching order, for a fair comparison, we choose to use the same candidate set for every pattern vertex in both solutions. Here, we use the match(·) set of TreeMat as candidates and plot the running time in Figure 10.
It is revealed that our proposed cost model never perform worse than that in CFL. In specific, it can help lower the time cost by a factor of 10. The reason is that CFL implements a path-based cost model. The path selected each time is that with minimal growth in result size, and after dealing with this path, a new growing path will be selected. Compared with CFL, our cost model does not estimate the cost for each path, but analyses the cost for each edge and considers the cost of next and current steps. Adjusting the cost model is more flexible after joining an edge than joining a path. The result can also prove that our cost model is close to the real cost of the join process. Otherwise, the new join strategy will not work well and may choose an awful edge in some steps, which results in the cost of the join process being high.

Evaluation of pattern graph updates
In this section, we measure the average elapsed time and the size of intermediate results of CEPDG and TurboFlux.

Comparison of different matching orderings
In this set of experiments, we ran CEPDG using patterns in A3, and measured the average elapsed time for unit insertion/deletion, but applied three different ordering strategiesrandomly choose an order, greedily choose as per the current cost , and greedily choose as per the estimated overall cost (our method). The results are within expectation that our proposed strategy outperforms the first strategy by 20.12x/23.74x, and the second strategy by 7.26x/8.12x for each edge insertion/deletion.

Varying pattern size
In this set of experiments, we demonstrate the advantage of CEPDG regarding update. We used two measures-average elapsed time for unit insertion/deletion and size of partial solutions. Figure 11 shows the comparison against edge insertion, where we varied pattern size from 15 to 30 (A1 ∼ A4). Note that the matching cost does not always increase as the  Figure 11(1), CEPDG outperforms TurboFlux by up to 4.36 times. Note that TurboFlux has to recompute the auxiliary data structure, which is not rewarding under this setting. Figure 11(2) shows the size of partial solutions, which reads that the size of partial results of CEPDG is smaller than that of TurboFlux. This is intuitive, the representation by CEPDG (TreeMat) is more concise than that by TurboFlux. Figure 12 shows the comparison against edge deletion, and similar trends are observed as from Figure 8. Figure 12(1) shows the average elapsed time by the two algorithms. CEPDG significantly outperforms TurboFlux in all cases. Specially, CEPDG outperforms TurboFlux by up to 3.43 times. Further, the average elapsed time for a deletion is much longer than that for an insertion, which suggests deleting an edge from the pattern graph may be more computationally expensive. Figure 12(2) shows the average size of partial solutions. The average size by TreeMat is significantly smaller than that of TurboFlux by up to 1.14 times.

Varying edge update volume
In this set of experiments, we evaluate the impact of the number of edge updates on the performance of CEPDG (and its alternatives). We fixed patterns in A3 and varied edge updates from 3 to 12 in 3 increments. Figure 13 shows the average elapsed time for each algorithm. We see that CEPDG has a better performance than others. In Figure 13(1), it witnesses a non-exponential increase as insertions grows, and CEPDG beats TurboFlux by   Figure 13(2), for edge deletion, the performance gap is slightly larger than that for edge insertion, and CEPDG is more efficient than TurboFlux by up to 29.82 times.

Related work
This section categorizes related work on graph pattern matching into two streams: static and dynamic.
Graph pattern matching Graph pattern matching is typically defined in terms of subgraph isomorphism [4], which has been studied extensively since 1976. A key issue of subgraph isomorphism is to reduce the number of unpromising intermediate results when iteratively mapping vertices one by one from a pattern graph to a data graph. VF2 [4] and QuickSI [12] propose to enforce the connectivity to prune the candidates. TurboISO [8] proposes to merge together the nodes in a pattern graph with the same labels and the same neighborhoods to further reduce unpromising candidates. Another key issue is to generate an effective matching order. QuickSI [12] proposes to generate a matching order based on the infrequent-labels first strategy. SPath [17] proposes to generate a matching order based on the infrequent-paths first strategy, but the efficiency will get lower when the size of a patten graph get larger. Bi et al. [1] develops a new framework that decomposes a pattern graph into a core and a forest for graph pattern matching. They showed that the core-forestleaf ordering effectively reduces redundant Cartesian products. Han et al. [7] proposes novel techniques for subgraph matching: dynamic programming between a DAG and a graph, adaptive matching order with DAG ordering, and pruning by failing sets. These methods work well on static graphs. However, substantial work is needed to support dynamic graphs.
Dynamic graph pattern matching As graphs are dynamic in nature in real-life applications, pattern matching over a large dynamic graph attracts more attention. INCISOMAT [5] identifies the the nodes of data graph that may produce new matches according to the changes of data graph. But the number of these nodes will get larger when the pattern graph gets larger and the efficiency will decrease dramatically. GraphFlow [9] applies a worst-case optimal join algorithm called Generic Join to incrementally evaluate subgraph matching for each update without maintaining intermediate results. For each query edge (v, v ) that matches an updated edge (u, u ), Graphflow evaluates subgraph matching starting from a partial solution {(v, u), (v , u )}. SJ-TREE [2] decomposes the main pattern graph based on the selectivity of vertice attributes, the highly selective sub-pattern is evaluated first, and the remaining sub-patterns are evaluated only when new results are found in sub-patterns evaluated previously. Thus, a lot of unnecessary computational cost is avoided. But the decomposition features are simple, lots of intermediate results will be produced when the pattern graph gets larger. The pattern decomposition approach of the work in [6] is based on identifying optimal sub-DAGs (directed acyclic graph) in the pattern graph. The DAGs' are then traversed to identify source and sink vertices to define message transition rules in the Giraph framework. This approach is on distributed implementation and it is not suitable for all types of patterns. TurboFlux [10] is the state-of-the-art algorithm for continuous subgraph matching, which employs a data-centric representation of intermediate results, in the sense that the query pattern P is embedded into the data graph G and its execution model allows fast incremental maintenance. Wang and Chen [13] also deals with continuous subgraph matching for evolving graphs. However, this method produces approximate results only, while our approach generates exact results.
Above algorithms only solve the graph pattern matching problem under the scenario of the data graph updating alone. In this paper, we propose to investigate a new problem, continuous matching of evolving patterns over dynamic graph data, to report matches for each update operation in the graph update stream continuously.

Conclusion
In this paper, we are devoted to a more complicated but very practical graph pattern matching problem, continuous matching of evolving patterns over dynamic graph data, and the investigation presents a novel algorithm CEPDG for continuously pattern matching along with changes of both pattern graph and data graph We showed that CEPDG solved the problems of existing methods and efficiently processed continuous subgraph matching for each update operation on the data graph and pattern graph.
We first proposed a concise representation TreeMat based on the spanning tree of the initial pattern graph for storing partial solutions. We then proposed the vertex state transition strategy, which efficiently identifies which update operation on the data graph can affect the current partial solutions and maintain TreeMat accordingly. We next presented an execution model to efficiently and incrementally maintain the representation during edge updates on the pattern graph, which are compatible with the algorithm proposed for data graph very well. Finally, we conceived an effective cost model for estimating step-wise cost of pattern matching.
Extensive experiments showed that CEPDG outperformed existing competitors by up to orders of magnitude. Overall, we believe our continuous subgraph matching solution provides comprehensive insight and a substantial framework for future research.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.