# Graph Pattern Matching

**DOI:**https://doi.org/10.1007/978-3-319-63962-8_74-1

- 1 Citations
- 1 Mentions
- 604 Downloads

## Keywords

Graph Pattern Matching Uncertain Graph Transaction Graph Graph Query Processing Underlying Graph Data## Definition

The graph pattern matching problem is to find the answers *Q*(*G*) of a pattern query *Q* in a given graph *G*. The answers are induced by specific query language and ranked by a quality measure. The problem can be categorized into three classes (Khan and Ranu 2017): (1) Subgraph/supergraph containment query, (2) graph similarity queries, and (3) graph pattern matching.

In the context of searching a graph database *D* that consists of many (small) graph transactions, the graph pattern matching finds the answers *Q*(*G*) as a set of graphs from *D*. For subgraph (resp. supergraph containment) query, it is to find *Q*(*G*) that are subgraphs (resp. supergraphs) of *Q*. The graph similarity queries are to find *Q*(*G*) as all graph transactions that are similar to *Q* for a particular similarity measure.

In the context of searching a single graph *G*, graph pattern matching is to find all the occurrences of a query graph *Q* in a given data graph *G*, specified by a matching function. The remainder of this entry discusses various aspects of graph pattern matching in a single graph.

## Overview

This entry discusses foundations of graph pattern matching problem: formal definitions, query languages, and complexity bounds.

### Data graph.

A data graph *G* = (*V*, *E*, *L*) is a directed graph with a finite set of nodes *V* , a set of edges *E* ⊆ *V* × *V* , and a function *L* that assigns node and edge content. Given a query *Q* = (*V* _{ Q }, *E* _{ Q }, *L* _{ Q }), a node *v* ∈ *V* in *G* with content *L*(*v*) (resp. edge *e* ∈ *E* with content *L*(*e*)) that satisfies the constraint *L*(*u*) of a node *u* ∈ *V* _{ Q } (resp. *L*(*e* _{ u }) of an edge *e* _{ u } ∈ *E* _{ Q }) is a candidate match of *v* (resp. *e*). Common data graph models include RDF and property graphs (see chapter “Graph Data Models”).

### Graph pattern query.

A graph pattern query *Q* is a directed graph (*V* _{ Q }, *E* _{ Q }, *L* _{ Q }) that consists of a set of nodes *V* _{ Q } and edges *E* _{ Q }, as defined for data graphs. For each node *v* ∈ *V* _{ Q } (resp. *e* ∈ *E* _{ Q }), a function *L* _{ Q } assigns a predicate *L*(*v*) (resp. *L*(*e*)) for each *v* (resp. *e*). The predicate *L*(⋅) can be expressed as conjunction of atomic formulas of attributes from node and edge schema in a data graph or information retrieval metrics.

### Graph pattern matching.

A match *Q*(*G*) of a pattern query *Q* in a data graph *G* is a substructure of *G* induced by a matching function *F* ⊆ *V* _{ Q } × *V* . The node *F*(*u*) (respectively, *F*(*e*)) is called a match of a node *u* ∈ *V* _{ Q } (respectively, *e* ∈ *E* _{ Q }). Given a query *Q* = (*V* _{ Q }, *E* _{ Q }, *L* _{ Q }), a data graph *G* = (*V*, *E*, *L*), the graph pattern matching problem is to find all the matches of *Q* in *G*.

### Subgraph pattern queries.

Traditional graph pattern queries are defined by subgraph isomorphism (Ullmann 1976): (1) each node in *Q* has a unique label, and (2) the matching function *F* specifies a subgraph isomorphism, which is an injective function that enforces an edge-to-edge matching and label equality.

The subgraph isomorphism problem is NP complete. Ullmann proposed a first practical backtracking algorithm for subgraph isomorphism search in graphs (Ullmann 1976). The algorithm enumerates exact matches by incrementing partial solutions and prunes partial solutions when it determines that they cannot be completed. A number of methods such as VF2 (Cordella et al. 2004), QuickSI (Shang et al. 2008), GraphQL (He and Singh 2008), GADDI (Zhang et al. 2009), and SPath (Zhao and Han 2010) are proposed for graph pattern queries defined by subgraph isomorphism. These algorithms follow the backtracking principle as in (Ullmann 1976) and improve the efficiency by exploiting different join orders and various pruning techniques. Lee et al. conducted an experimental study (Lee et al. 2012) to compare these methods, and experimentally verified that QuickSI designed for handling small graphs often outperforms its peers for both small and large graphs.

The major challenge for querying big graphs is twofold: (1) The query models have high complexity. For example, subgraph isomorphism is NP hard. (2) It is often hard for end users to write accurate graph pattern queries. Major research effort has been focused on two aspects: (a) developing computationally efficient query models and algorithms, and (b) user-friendly graph search paradigms.

## Key Research Findings

### Computationally Efficient Query Models

### Graph simulation.

The class of simulation- based query models includes graph simulation and bounded simulation (Fan et al. 2010), strong simulation (Ma et al. 2011), and pattern queries with regular expressions (Fan et al. 2011). A graph simulation *R* ⊆ *V* _{ Q } × *V* relaxes subgraph isomorphism to an edge-to-edge matching relation. For any node (*u*, *v*) ∈ *R*, *u* and *v* have the same label, and each child *u′* of *u* must have at least a match *v′* as a child of *v*, such that (*u′*, *v′*) ∈ *R*. Strong simulation (Ma et al. 2011) enforces simulation for both parents and children of a node in *Q*. Regular expressions are posed on edges in *Q* to find paths with concatenated edge labels satisfying the regular expression (Fan et al. 2011). All these query models can be evaluated with low polynomial time.

### Approximate pattern matching.

Approximate graph matching techniques make use of cost functions to find approximate (top-*k*) matches with fast solutions. The answer quality is usually characterized by aggregated proximity among matches determined by their neighborhood, which is more effective to cope with ambiguous queries with possible mismatches in data graphs. Traditional cost measures include graph edit distance, maximum common subgraph, number of mismatched edges, and various graph kernels. Index techniques and synopses are typically used to improve the efficiency of pattern matching in large data graphs (see chapter “Graph Indexing and Synopses”).

Notable examples include PathBlast (Kelley et al. 2004), SAGA (Tian et al. 2006), NetAlign (Liang et al. 2006), and IsoRank (Singh et al. 2008) that target bioinformatics applications. G-Ray (Tong et al. 2007) aims to maintain the shape of the query via effective path exploration. TALE (Tian and Patel 2008) uses edge misses to measure the quality of a match. NESS and NeMa (Khan et al. 2011, 2013) incorporate more flexibility in graph pattern matching by allowing an edge in the query graph to be matched with a path up to a certain number of hops in the target graph. Both methods identify the top-*k* matches. This is useful in searching knowledge graphs and social networks.

### Making Big Graphs Small

When the query class is inherently expensive, it is usually nontrivial to reduce the complexity. The second category of research aims to identify a small and relevant fraction *G* _{ Q } of large-scale graph *G* that suffice to provide approximate matches for query *Q*. That is, the match *Q*(*G* _{ Q }) is same or close to its exact counterpart *Q*(*G*). Several strategies have been developed for graph pattern matching under this principle.

### Making graph query bounded.

Bounded evaluability of graph pattern queries aims to develop algorithms with provable computation cost that is determined by the small fraction of graph. Notable examples include query preserving graph compression (Fan et al. 2012a) and querying with bounded resources (Fan et al. 2014b; Cao et al. 2015). (1) Given a query class \(\mathscr {Q}\), query-preserving compression applies a fast once-for-all preprocessing to compress data graph *G* to a small, query-able graph *G* _{ Q }, which guarantees that for any query instance from \(\mathscr {Q}\), *Q*(*G*) = *Q*(*G* _{ Q }). Compression schemes have been developed for graph pattern queries defined by reachability and graph simulation. (2) Querying with bounded resource adds additional size bound to *G* _{ Q }. The method in Cao et al. (2015) makes use of a class of access constraints to estimate and fetch the amount of data needed to evaluate graph queries. When no such constraints exist, another method (Fan et al. 2014b) dynamically induces *G* _{ Q } on the fly to give approximate matches. Both guarantee to access *G* _{ Q } with bounded size. (3) Graph views are also used to achieve bounded evaluability, where query processing only refers to a set of materialized views with bounded size (Fan et al. 2014a).

### Incremental evaluation.

When data graph *G* bear changes Δ*G*, bounded evaluability is achieved by applying *incremental pattern matching* (Fan et al. 2013). While Δ*G* is typically small in practice, *G* _{ Q } is induced by combining Δ*G* and a small fraction of *G* (called “affected area”) that are necessarily visited to update *Q*(*G*) to its counterparts in the updated graph. An algorithm guarantees bounded computation if it incurs cost determined by the neighbors of *G* _{ Q } (up to a certain hop) and *Q* only, independent of the size of *G*. Bounded incremental algorithms are identified for several graph pattern queries including reachability and graph simulation (Fan et al. 2013).

### Parallelizing sequential computation.

When data graph *G* is too large for a single processor, bounded evaluability can be achieved by distributing *G* to several processors. Each processor copes with a fragment *G* _{ Q } of *G* with a manageable size and bounded evaluability, guaranteed by partition, load balancing, and other optimization techniques. Existing models include bulk (synchronous and asynchronous) vertex-centric computation and graph-centric model, which are designed to parallelize node and graph-level computations, respectively (see chapter “Parallel Graph Processing”).

Several characterizations of such models have been proposed. Notable examples include *scale independence* (Armbrust et al. 2009), which ensures that a parallel system enforces strict invariants on the cost of query execution as data size grows, and *parallel scalability* (Fan et al. 2014c, 2017), which ensures that the system has polynomial speed-up over sequential computation as more number of processors are used. A notable example is the techniques that aim to automatically parallelize sequential graph computation without recasting (Fan et al. 2017). The declarative platform defines parallel computation with three core functions for any processor – to support local computation, incremental computation (upon receiving new messages), and assembling of partial answers from workers. Several parallel scalable algorithms are developed under this principle for graph pattern queries including reachability and graph simulation (Fan et al. 2012b, 2014c).

### User-Friendly Pattern Matching

A third category of research aims to develop algorithms that can better infer the search intent of graph queries, especially when end users have no prior knowledge of underlying graph data. We review two approaches specialized for graph pattern matching.

### Graph keyword search.

Keyword search over graphs allows users to provide a list of keywords and returns subtrees/subgraphs as interpretable answers. Various ranking criteria also exist to find the top-*k* answers, e.g., sum of all edge weights in the resulting tree/graph, sum of all path weights from root to each keyword in the tree, maximum pairwise distance among nodes, etc. Notable examples include BANKS (Aditya et al. 2002) and BLINKS (He et al. 2007) (see chapter “Graph Exploration and Search” for details). Several keyword searches find answers that are connected subgraphs containing all the query keywords. Examples include *r*-clique (Kargar and An 2011), where the nodes in an *r*-clique are separated by a distance at most *r*. Query refinement and reformulation techniques are studied to convert the keyword queries to structured queries such as SPARQL. Beyond keyword queries, Zheng et al. (2017) develop a more interactive way of answering natural language queries over graph data. For more details of graph queries, see chapters “Graph Query Languages” and “Graph Query Processing.”

### Graph query by example.

A relatively new paradigm for user-friendly graph querying is graph-query-by-example. Query-by-example (QBE) has a positive history in relational databases, HTML tables, and entity sets. Exemplar query (Mottin et al. 2014) and GQBE (Jayaram et al. 2015) adapted similar ideas over knowledge graphs. In particular, the user may find it difficult how to precisely write her query, but she might know a few answers to her query. A graph query-by-example system allows her to input the answer tuple as a query and returns other similar tuples that are present in the target graph. The underlying system follows a two-step approach. Given the input example tuple(s), it first identifies the query graph that captures the user’s query intent. Then, it evaluates the query graph to find other relevant answer tuples. Major techniques of graph query by example are introduced in chapter “Graph Exploration and Search.”

## Key Applications

Graph pattern matching has been widely adopted in accessing and understanding network data. Representative applications include community detection in social networks, attack detection for cyber security, querying Web and knowledge bases as knowledge graphs, pattern recognition in brain and healthcare, sensor network in smart environment, and Internet of Things.

## Future Directions

The open challenges remain to be the designing of efficient and scalable graph pattern query models and algorithms to meet the need of querying big graphs that are heterogeneous, large, and dynamic. Another challenge is the lack of support for declarative languages for more flexible graph pattern queries beyond SPARQL. A third challenge is to enable fast and scalable pattern matching with quality guarantees in incomplete and uncertain graphs, with applications in data fusion and provenance of high-value graph data sources. These call for new scalable techniques for graph pattern query evaluation.

## Cross-References

## References

- Aditya B, Bhalotia G, Chakrabarti S, Hulgeri A, Nakhe C, Parag P, Sudarshan S (2002) BANKS: browsing and keyword searching in relational databases. In: VLDBCrossRefGoogle Scholar
- Armbrust M, Fox A, Patterson D, Lanham N, Trushkowsky B, Trutna J, Oh H (2009) Scads: scale-independent storage for social computing applications. arXiv preprint arXiv:09091775Google Scholar
- Cao Y, Fan W, Huai J, Huang R (2015) Making pattern queries bounded in big graphs. In: ICDE, pp 161–172Google Scholar
- Cordella LP, Foggia P, Sansone C, Vento M (2004) A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans Pattern Anal Mach Intell 26(10):1367–1372CrossRefGoogle Scholar
- Fan W, Li J, Ma S, Tang N, Wu Y, Wu Y (2010) Graph pattern matching: from intractable to polynomial time. PVLDB 3(1–2):264–275Google Scholar
- Fan W, Li J, Ma S, Tang N, Wu Y (2011) Adding regular expressions to graph reachability and pattern queries. In: ICDE, pp 39–50zbMATHGoogle Scholar
- Fan W, Li J, Wang X, Wu Y (2012a) Query preserving graph compression. In: SIGMOD, pp 157–168Google Scholar
- Fan W, Wang X, Wu Y (2012b) Performance guarantees for distributed reachability queries. PVLDB 5(11):1304–1316Google Scholar
- Fan W, Wang X, Wu Y (2013) Incremental graph pattern matching. TODS 38(3):18:1–18:47Google Scholar
- Fan W, Wang X, Wu Y (2014a) Answering graph pattern queries using views. In: ICDECrossRefGoogle Scholar
- Fan W, Wang X, Wu Y (2014b) Querying big graphs within bounded resources. In: SIGMODCrossRefGoogle Scholar
- Fan W, Wang X, Wu Y, Deng D (2014c) Distributed graph simulation: impossibility and possibility. PVLDB 7(12):1083–1094Google Scholar
- Fan W, Xu J, Wu Y, Yu W, Jiang J, Zheng Z, Zhang B, Cao Y, Tian C (2017) Parallelizing sequential graph computations. In: SIGMODCrossRefGoogle Scholar
- He H, Singh A (2008) Graphs-at-a-time: query language and access methods for graph databases. In: SIGMODCrossRefGoogle Scholar
- He H, Wang H, Yang J, Yu PS (2007) BLINKS: ranked keyword searches on graphs. In: SIGMODCrossRefGoogle Scholar
- Jayaram N, Khan A, Li C, Yan X, Elmasri R (2015) Querying knowledge graphs by example entity tuples. TKDE 27(10):2797–2811Google Scholar
- Kargar M, An A (2011) Keyword search in graphs: finding R-cliques. PVLDB 4(10):681–692Google Scholar
- Kelley BP, Yuan B, Lewitter F, Sharan R, Stockwell BR, Ideker T (2004) PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Res 32: 83–88CrossRefGoogle Scholar
- Khan A, Ranu S (2017) Big-graphs: querying, mining, and beyond. In: Handbook of big data technologies. Springer, Cham, pp 531–582CrossRefGoogle Scholar
- Khan A, Li N, Guan Z, Chakraborty S, Tao S (2011) Neighborhood based fast graph search in large networks. In: SIGMODCrossRefGoogle Scholar
- Khan A, Wu Y, Aggarwal C, Yan X (2013) NeMa: fast graph search with label similarity. PVLDB 6(3): 181–192Google Scholar
- Lee J, Han WS, Kasperovics R, Lee JH (2012) An in-depth comparison of subgraph isomorphism algorithms in graph databases. PVLDB 6(2):133–144Google Scholar
- Liang Z, Xu M, Teng M, Niu L (2006) NetAlign: a web-based tool for comparison of protein interaction networks. Bioinformatics 22(17):2175–2177CrossRefGoogle Scholar
- Ma S, Cao Y, Fan W, Huai J, Wo T (2011) Capturing topology in graph pattern matching. PVLDB 5(4):310–321zbMATHGoogle Scholar
- Mottin D, Lissandrini M, Velegrakis Y, Palpanas T (2014) Exemplar queries: give me an example of what you need. PVLDB 7(5):365–376Google Scholar
- Shang H, Zhang Y, Lin X, Yu J (2008) Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. PVLDB 1(1):364–375Google Scholar
- Singh R, Xu J, Berger B (2008) Global alignment of multiple protein interaction networks with application to functional orthology detection. PNAS 105(35):12763–12768CrossRefGoogle Scholar
- Tian Y, Patel JM (2008) TALE: a tool for approximate large graph matching. In: ICDEGoogle Scholar
- Tian Y, McEachin R, Santos C, States D, Patel J (2006) SAGA: a subgraph matching tool for biological graphs. Bioinformatics 23(2):232–239CrossRefGoogle Scholar
- Tong H, Faloutsos C, Gallagher B, Eliassi-Rad T (2007) Fast best-effort pattern matching in large attributed graphs. In: KDDCrossRefGoogle Scholar
- Ullmann JR (1976) An algorithm for subgraph isomorphism. J ACM 23:31–42MathSciNetCrossRefGoogle Scholar
- Zhang S, Li S, Yang J (2009) GADDI: distance index based subgraph matching in biological networks. In: EDBTCrossRefGoogle Scholar
- Zhao P, Han J (2010) On graph query optimization in large networks. In: VLDBGoogle Scholar
- Zheng W, Cheng H, Zou L, Yu JX, Zhao K (2017) Natural language question/answering: let users talk with the knowledge graph. In: CIKMCrossRefGoogle Scholar