Big Graph Analyses: From Queries to Dependencies and Association Rules

This position paper provides an overview of our recent advances in the study of big graphs, from theory to systems to applications. We introduce a theory of bounded evaluability, to query big graphs by accessing a bounded amount of the data. Based on this, we propose a framework to query big graphs with constrained resources. Beyond queries, we propose functional dependencies for graphs, to detect inconsistencies in knowledge bases and catch spams in social networks. As an example application of big graph analyses, we extend association rules from itemsets to graphs for social media marketing. We also identify open problems in connection with querying, cleaning and mining big graphs.


Introduction
The study of graphs has generated renewed interest in the past decade.Graphs make an important source of big data and have found prevalent use in, e.g., social media marketing, knowledge discovery, transportation networks, mobile network analysis, computer vision, the study of adolescent drug use [93], and intelligence analysis for identifying terrorist organizations [55].In light of these, a large number of algorithms, optimization techniques, graph partition strategies and parallel systems have been developed for graph computations.
Are we done with the study of graphs?Not yet!Real-life graphs introduce new challenges to query evaluation, data cleaning and data mining, among other things.They demand a departure from traditional theory to systems and applications and call for new techniques to query big graphs, improve data quality and identify associations among entities.

Querying Big Graphs
Consider a class Q of graph queries, such as graph traversal (e.g., depth-first search DFS and breadth-first search BFS), graph connectivity (e.g., strongly connected components), graph pattern matching (via e.g., graph simulation or subgraph isomorphism) and keyword search.Given a query Q 2 Q and a big graph G, the problem of querying big graphs is to compute the answers When G is ''big,'' it is often costly to compute Q(G).Indeed, DFS takes O(|G|) time, not to mention graph pattern matching via subgraph isomorphism, for which it is NP-complete to decide whether Q(G) is empty, i.e., whether there exists a match of pattern Q in G (cf. [102]).Worse yet, real-life graphs are often of large scale, e.g., Facebook has billions of users and trillions of links, which amount to about 300PB of data [37].
One might be tempted to think that we could cope with big graphs by means of parallel computing.That is, when G grows big, we add more processors and parallelize the computation of Q(G), to make the computation scale with G. Based on this assumption, several parallel graph query systems have been developed, e.g., Pregel [52], GraphLab [99], GraphX [36], Giraph [34], Giraph?? [104], Blogel [106] and Trinity [105].
However, there exist graph computation problems that are not parallel scalable.That is, for some query classes Q, their parallel running time cannot be substantially reduced no matter how many processors are used.Consider, for example, graph simulation [41], a quadratic-time problem.It has been shown that no parallel algorithms for the problem can scale well with the increase in processors used [26].This is actually not very surprising.The degree of parallelism is constrained by the depth of a computation, i.e., the longest chain of dependencies among its operations [46].As a consequence, some graph computation problems are ''inherently sequential'' [94].Add to the complication that parallel algorithms nowadays are typically developed over a shared-nothing architecture [58].For such algorithms, with the increase in processors also come higher communication costs, not to mention skewed graphs, skewed workload, start-up costs and interference when processors compete for, e.g., network bandwidth.
Moreover, even for queries that are parallel scalable, small businesses often have constrained resources such as limited budget and available processors and cannot afford renting thousands of Amazon EC2 instances.
With these observations come the following questions.Is it possible to efficiently compute Q(G) when G is big and Q is expensive, and when we have constrained resources?In other words, can we provide small businesses with the benefit of big graph analysis?
We tackle these questions in this paper.We propose a theory of bounded evaluability, which helps us answer queries in big graphs with constrained resources [13-15, 17, 22, 24].Based on the theory, we introduce a resource-constrained framework to query big graphs.

Catching Inconsistencies
To make practical use of big data, we have to cope with not only its quantity (volume) but also its quality (velocity).Real-life data are dirty: ''more than 25% of critical data in the world's top companies is flawed'' [33].Dirty data are costly.Indeed, ''bad data or poor data quality costs US businesses $600 billion annually'' [79], ''poor data can cost businesses 20-35% of their operating revenue'' [67], and ''poor data across businesses and the government costs the US economy $3.1 trillion a year'' [67].
The quality of real-life graph data is no better.
Example 1 It is common to find inconsistencies in knowledge bases that are being widely used.
(a) DBPedia: Flight A123 has two entries with the same departure time 14:50 and arrival time 22:35, but one entry is from Paris to New York, while the other is from Paris to Singapore [59].(b) DBPedia: John Brown is claimed to be both a child and a parent of the same person, Owen Brown.(c) Yago: Soccer player David Beckham is labeled with two birth places Leytonstone and Old Trafford [21].(d) MKNF marks that all birds can fly and penguins are birds [43], despite their evolved wing structures.
To build a knowledge base of high quality, effective methods have to be in place to catch inconsistencies in graph-structured data.Indeed, consistency checking is a major challenge to knowledge acquisition and knowledge base enrichment, among other things.
This highlights the need for theory and techniques to improve data quality.To catch semantic inconsistencies, we need data quality rules, which are typically expressed as dependencies.For relational data, a variety of dependencies have been studied, such as conditional functional dependencies (CFDs) [23] and denial constraints [4].Employing the dependencies, a host of techniques have been developed to detect errors in relational data and repair the data (see [91] for a survey).
When it comes to graphs, however, the study of dependencies is still in its infancy.Even primitive dependencies such as functional dependencies and keys are not yet well studied for graph-structured data.Such dependencies are particularly important for graphs since unlike relational databases, real-life graphs typically do not come with a schema.Dependencies provide us with one of few means to specify a fundamental part of the semantics of the data, and help us detect inconsistencies in knowledge bases and catch spams in social networks [29], among other things.However, as will be seen shortly, dependencies for graph-structured data are far more challenging than their relational counterparts.
We introduce a class of graph functional dependencies, referred to as GFDs [29].GFDs capture both attributevalue dependencies and topological structures of entities and subsume CFDs as a special case.We show that GFDs can be used as data quality rules and are capable of catching inconsistencies commonly found in knowledge bases, as violations of the GFDs.We study the classical problems for reasoning about GFDs, such as their satisfiability, implication and validation problems.We also show that there exist effective algorithms for catching violations of GFDs in large-scale graphs, which are parallel scalable under practical conditions.

Identifying Associations
Association rules have been well studied for discovering regularities between items in relational data and have proven effective in marketing activities such as promotional pricing and product placements [72,107].They have a traditional form X ) Y, where X and Y are disjoint itemsets.For example, ðfdipper, milkg ) fbeerg) is an association rule indicating that if customers buy dipper and milk, then the chances are that they will also buy beer.
The need for studying associations between entities in graphs is also evident, in emerging applications such as social media marketing.Social media marketing is predicted to trump traditional marketing.Indeed, ''90% of customers trust peer recommendations versus 14% who trust advertising'' [101], ''60% of users said Twitter plays an important role in their shopping'' [62], and ''the peer influence from one's friends causes more than 50% increases in odds of buying products'' [7].
Example 2 Association rules for social graphs are defined on entities in a graph, not on itemsets.As examples, below are association rules taken from [28,89].
(a) If x and x 0 are friends living in the same city c, there are at least 3 French restaurants in city c that x and x 0 both like, and if x 0 went to a newly opened French restaurant y in c, then x may also go to y.(b) If person x is in a music club, and among the people whom x follows, at least 80% of them like an album y, then it is likely that x will also buy y.(c) If all the people followed by x buy Nova Plus (a brand of mobile phones), and none of them gives Nova Plus a bad rating, then the chances are that x may also buy Nova Plus.
These rules help us identify potential customers.For example, consider a newly opened French restaurant y.If a person x satisfies the conditions specified in rule (a) above, then restaurant y may opt to send x a coupon, and the chances are that x will become a customer of y.Similarly for rules (b) and (c), which help music album vendors and mobile phone manufactures find potential customers and advertise their new products.
As opposed to association rules for itemsets, association rules for graphs, referred to as GPARs, involve social groups with multiple entities.GPARs depart from association rules for itemsets and introduce several challenges.
(1) To identify social groups, the rules need to be defined in terms of graph pattern matching, possibly with counting quantifiers [(see rules (b) and (c)].(2) As will be seen later, conventional support and confidence metrics no longer work for GPARs.(3) It is intractable to discover topranked diversified GPARs, and conventional mining algorithms for traditional rules and frequent graph patterns cannot be directly used to discover such rules.(4) A major application of such rules is to identify potential customers in social graphs.This is costly: graph pattern matching by subgraph isomorphism is intractable.Worse still, real-life social graphs are typically big, as remarked earlier.
We propose a class of GPARs defined in terms of graph patterns [89] and counting quantifiers [28].These GPARs differ from conventional association rules for itemsets in both syntax and semantics.They are useful in social media marketing, community structure analysis, social recommendation, knowledge extraction and link prediction [100], among other things.We propose topological support and confidence measures for GPARs.We also study the problem of discovering top-k diversified GPARs, and the problem of identifying potential customers with GPARs, establishing their complexity bounds and providing algorithms that are parallel scalable under practical conditions.

Organization
This paper is a progress report of our recent work.The remainder of the paper is organized as follows.We start with basic notations in Sect. 2. We then present a theory of bounded evaluation and a resource-bounded framework for querying big graphs in Sect.3. We propose GFDs in Sect.4, from formulation to classical decision problems to their applications.We present association rules for graphs in Sect. 5 and show how the rules help us in social media marketing.Open problems are identified in Sect.6.
The study of big graphs has raised as many questions as it has answered.We hope that the paper will incite interest in the study of big graphs, and we invite interested colleagues to join forces with us in the study.

Preliminaries
We first review basic notations of graphs and queries that will be used in the rest of the paper.

Graphs
We consider w.l.o.g.directed graphs G ¼ ðV; E; LÞ, where (1) V is a finite set of nodes; (2) E V Â V is a set of edges, in which ðv; v 0 Þ denotes an edge from node v to v 0 ; (3) each node v in V carries a label L(v) taken from an alphabet R of labels, indicating the content of the node, as found in social networks, knowledge bases and property graphs.
We denote the size of G as |G| = jVj þ jEj.
We will use two notions of subgraphs.A graph Subgraph G 0 is said to be induced by V 0 if E 0 consists of all the edges in G whose endpoints are both in V 0 .
Big Graph Analyses: From Queries to Dependencies and Association Rules

Graph Pattern Matching
As an example of graph queries, we take graph pattern matching defined in terms of subgraph isomorphism, stated as follows.
A graph pattern Q is a graph ðV , there exists a bijective function h from V Q to the set of nodes of G s such that (a) for each node u 2 V Q , L Q ðuÞ ¼ LðhðuÞÞ, and (b) ðu; u 0 Þ is an edge in Q if and only if ðhðuÞ; hðu 0 ÞÞ is an edge in G s .The answer The problem is as follows.
• Input: A graph G and a pattern Q.
The graph matching problem is intractable: it is NP-complete to decide whether Q(G) is empty (cf.[102]).

Querying Big Graphs
We start with querying big real-life graphs with constrained resources, in order to provide small businesses with the benefit of big graph analyses.We first present a theory of bounded evaluability in Sect.3.1.We then propose a resource-constrained framework to cope with the sheer volume of big graphs, based on the theory and approximate query answering in Sect.3.2.This section is based on results from [15,22,26,27,84].

Bounded Evaluability
Consider graph pattern queries Q defined in terms of subgraph isomorphism.As remarked earlier, such pattern queries are intractable and expensive.
Can we still efficiently compute exact answers Q(G) to pattern queries when graphs G is big and when we have constrained resources such as a single processor?

Bounded Evaluability
We approach this by making big graphs small.The idea is to make use of a set A of access constraints, which are a combination of indices and simple cardinality constraints defined on the labels of neighboring nodes of G. Given a query Q, we check whether Q is boundedly evaluable under A, i.e., whether for all graphs G that satisfy the access constraints of A, there exists a subgraph A large number of real-life queries are actually boundedly evaluable under simple access constraints, as illustrated by the example below, taken from [15].
Example 3 Consider IMDb [44], a graph G 0 in which nodes represent movies, casts, countries, years and awards from 1880 to 2013, and edges denote various relationships between the nodes.An example query on IMDb is to find pairs of first-billed actor and actress (main characters) from the same country who co-stared in a award-winning film released in 2011-2013.
The query can be represented as a graph pattern Q 0 shown in Fig. 1.It is to first find the set Q 0 ðG 0 Þ of matches, i.e., subgraphs G 0 of G 0 that are isomorphic to Q 0 ; it then extracts and returns actor-actress pairs from each match G 0 .The challenge is that Q 0 ðG 0 Þ takes exponential time to compute on the IMDb graph, which has 5.1 million nodes and 19.5 million edges.
Not all is lost.Using simple aggregate queries, one can readily find the following real-life cardinality constraints on the movie dataset from 1880-2013: (a) in each year, every award is presented to no more than 4 movies (C1); (b) each movie has at most 30 first-billed actors and actresses (C2), and each person has only one country of origin (C3); and (c) there are no more than 135 years (C4, i.e., 2013-1880), 24 major movie awards (C5) and 196 countries (C6) in IMDb in total [44].
An index can be built on the labels and nodes of G 0 for each of these cardinality constraints, yielding a set A 0 of 8 access constraints.For instance, given a year and an award, the index for C1 returns at most 4 movies that received the award in that year.Under A 0 , query Q 0 is boundedly evaluable.We can compute Q 0 ðG 0 Þ by accessing at most 17923 nodes and 35136 edges in G 0 , regardless of the size of G 0 , no matter how big G 0 is, by the following query plan: (a) we first identify a set V 1 of 135 year nodes, 24 award nodes and 196 country nodes, by using the indices built for access constraints C4-C6; (b) we then fetch a set V 2 of at most 24 Â 3 Â 4 ¼ 288 award-winning movie s released between 2011-2013, with no more than 288 Â 2 ¼ 576 edges connecting movies to awards and years, by using those award and year nodes in V 1 and the index for C1; (c) after these, we fetch a set V 3 of at most ð30 þ 30Þ Ã 288 ¼ 17280 actor s and actress es with 17280 edges, by using the set V 2 and the index for C2; and (d) we connect the actors and actresses in V 3 to country nodes in V 1 , with at most 17280 edges by using the index for constraint C3.Finally, we output (actor, actress) pairs connected to the same country in V 1 .
The query plan visits at most 135 ?24 ?196 ?288 ?17280 = 17923 nodes, and 576 ?17280 ?17280 = 35136 edges, by using the cardinality constraints and indices in A 0 , as opposed to tens of millions of nodes and edges in IMDb.Moreover, the number of nodes and edges is decided by Q 0 and cardinality bounds in A 0 ; it remains a constant no matter how big IMDb grows.

Bounded Evaluation
We next provide more insight into bounded evaluation of graph pattern queries.We invite the interested reader to consult [15] for details.
Access Schema An access constraint is of the form S !ðl; NÞ; where S R is a (possibly empty) set of labels, l is a label in R and N is a natural number.Recall that R is the alphabet of labels (see Sect. 2).
A graph G(V, E, L) satisfies the access constraint if • for any S-labeled set V S of nodes in V, there exist at most N common neighbors of V S with label l; and • there exists an index on S for l that for any S-labeled set V S in G finds all common neighbors of V S labeled with l in O(N)-time, independent of |G|.
Here V S is a set in which each node is labeled with a distinct label in S. A node v is a common neighbor of V S if for each node v 0 2 V S , either ðv; v 0 Þ or ðv 0 ; vÞ is an edge in G.In particular, when V S is ;, all nodes of G are common neighbors of V S .
Intuitively, an access constraint is a combination of (a) a cardinality constraint and (b) an index on the labels of neighboring nodes.It tells us that for any S-node labeled set V S , there exist a bounded number of common neighbors V l labeled with l, and moreover, V l can be efficiently retrieved with the index.Constraint u 1 states that for any pair of year and award nodes, there are at most 4 movie nodes connected to both, i.e., an award is given to at most 4 movies each year; similarly for u 2 -u 5 .Constraint u 6 is simpler.It says that (between 1880 and 2013) there are at most 135 years in the entire graph; note that the set S (i.e., the set V S ) for u 6 is empty; similarly for u 7 and u 8 .
We denote a set A of access constraints as an access schema.We say that G satisfies A, denoted by G A, if G satisfies all the access constraints in A.
Deciding Bounded Evaluability To make practical use of bounded evaluation, we need to answer the following question, to decide whether a given query is boundedly evaluable under a set of available access constraints.
• Input: A pattern query Q, an access schema A.
• Question: Is Q boundedly evaluable under A?
The question is nontrivial for relational queries.It is decidable but EXPSPACE-hard for SPC queries and is undecidable for queries in the relational algebra [22].
The good news is that for graph pattern queries, the problem is in low polynomial time in the size of Q and A, independent of data graphs G. Indeed, for pattern queries where jE Q j and jV Q j are the numbers of nodes and edges in Q, respectively; jjAjj is the number of constraints in A, and jAj is the size of A. In practice, Q and A are much smaller than data graphs G.
With this complexity bound, an algorithm for deciding the bounded evaluability of graph pattern queries is given in [15].It is based on a characterization of bounded evaluability, i.e., a sufficient and necessary condition for deciding whether a pattern query Q is boundedly evaluable under an access schema A.
Generating Bounded Query Plans After a pattern query Q is found boundedly evaluable under an access schema A, we need to generate a ''good'' query plan for Q that, given any In a nutshell, a query plan P for Q under A consists of three phases, presented as follows: (1) Plan P tells us what nodes to retrieve from G. It starts with a sequence of node fetching operations of the form fetchðu; V S ; uÞ, where u is a l-labeled node in Q, V S denotes a S-labeled set of Q and u is a constraint u ¼ S !ðl; NÞ in A. On a graph G, the operation is to retrieve a set V(u) of candidate matches for u from G: given V S that was retrieved from G earlier, it fetches common neighbors of V S from G that are labeled with l.These nodes are fetched by using the index of u and are stored in V(u).In particular, when S ¼ ;, the operation fetches all l-labeled nodes in G as V(u) for u.The operations fetch 1 ; fetch 2 ; Á Á Á ; fetch n in P are executed one by one.In fetch i , its V S consists of nodes from V j fetched earlier by fetch j for j\i.
(2) From the data fetched by P, a subgraph More specifically, (a) V P consists of candidates V(u) fetched for each pattern node u in Q, and (b) in G is also confined to the nodes fetched via access constraints and thus can also be done with bounded data access.(3) Finally, plan P simply computes We say that P is a bounded query plan for Q if for all graphs G A, it builds a subgraph G Q of G such that (a) QðG Q Þ ¼ QðGÞ, and (b) it accesses G via fetch operations only, and each fetch is controlled by an access constraint u in A. Since P fetches data from G by using the indices in A only, the time for fetching data from G by all operations in P depends on A and Q only.That is, P fetches a bounded amount of data from G and builds a small G Q from it.As a consequence, jG Q j is also independent of the size |G| of G.
An algorithm is developed in [15] that, given any boundedly evaluable pattern query Q under an access schema A, finds a bounded query plan for Q in OðjV Q jjE Q jjAjÞ time.As remarked earlier, Q and A are much smaller than data graphs G.
Effectiveness The approach has been verified effective using real-life graphs consisting of billions of nodes and edges [15].We find the following.(1) Under a couple of hundreds of access constraints, more than 60% of pattern queries are boundedly evaluable.(2) Bounded query plans outperform conventional algorithms such as VF2 [76] by 4 orders of magnitude, and access G Q such that jG Q j ¼ 3:2 Â 10 À5 Â jGj on average, reducing |G| of PB size to 32 GB. ( 3) It takes at most 37 ms to decide whether a pattern query Q is boundedly evaluable and to generate a bounded query plan for bounded Q.

Related Work
As remarked earlier, the principle behind bounded evaluation is to make big graphs small.There are typically two ways to reduce search space.(1) Graph indexing uses precomputed global information of G to compute distance [75], shortest paths [38] or substructure matching [57].( 2) Graph compression computes a summary G c of a big graph G and uses G c to answer all queries posed on G [9,25,54].
In contrast to the prior work, (1) bounded evaluation is based on access schema, which extends traditional indices by incorporating cardinality constraints, such that we can reason about the cardinality constraints and decide whether a query can be answered by accessing a bounded amount of data in advance, before we access the underlying graphs.Moreover, the indices in an access schema are based on labels of neighboring nodes, which are quite different from prior indexing structures.(2) Instead of using one-size-fitall compressed graphs G c to answer all queries posed on G, we adopt a dynamic data reduction scheme that finds a subgraph G Q of G for each query Q.Since G Q consists of only the information needed for answering Q, it allows us to compute Q(G) by using G Q that is much smaller than G c and hence using much less resources.(3) When Q is boundedly evaluable, for all graphs G that satisfy A we can find G Q of size independent of |G|; in contrast, jG c j may be proportional to |G|.
The theory of bounded evaluation was first studied for relational queries [13,14,17,22,24].It has proven effective on a variety of real-life datasets.It is shown that on average 77% of SPC queries [17] and 67% of relational algebra queries [13] are boundedly evaluable under a few hundreds of access constraints.Bounded evaluation outperforms commercial query engines by 3 orders of magnitude, and in fact, the gap gets larger on bigger datasets.The evaluation results from our industry collaborators are even more encouraging.They find that more than 90% of their big-data queries are boundedly evaluable, improving the performance from 25 times to 5 orders of magnitude [16].
The theory is extended from relations to graphs in [15], showing that bounded evaluation is also effective for graph pattern queries defined in terms of subgraph isomorphism and graph simulation.

A Resource-Constrained Framework
As remarked earlier, we can answer about 60% of pattern queries in big graphs by accessing a bounded amount of data no matter how big the graphs grow.Then, what should we do about the queries that are not boundedly evaluable under an access schema?Can we still answer those queries with constrained resources?
To this end, we propose a resource-constrained framework to query big graphs, which can be readily built on top of (parallel) graph query engines.

A Framework to Query Big Graphs
The framework, referred to as RESOURCE (RESOURce-Constrained Engine), aims to answer queries posed on big graphs when we have constrained resources such as limited available processors and time.To measure the constraints on resources, it takes a resource ratio a 2 ð0; 1 as a parameter, indicating that our available resources allow us to only access a a-fraction of a big graph.Employing an access schema A, RESOURCE works as follows.Given a query Q and a graph G that satisfies A, (1) it first checks whether Q is boundedly evaluable under A, i.e., whether exact answers Q(G) can be computed by accessing a fraction G Q G such that its size jG Q j is independent of the size jGj of G; (2) if so, it computes Q(G) by accessing a bounded fraction G Q of G, by generating a bounded query plan under A as described in Sect.3.1; (3) otherwise, it answers Q in G by means of data-driven approximation [27,92], which accesses a small G Q G in the entire process such that jG Q j ajGj, possibly by also using access constraints in A.
That is, under resource constraint specified by a, RESOURCE computes exact answers Q(G) whenever bounded evaluation is possible by employing access schema A; otherwise, it returns approximate answers QðG Q Þ within the given budget ajGj.
We next give more details about RESOURCE.ð1Þ Resource Ratio a The ratio is decided by available resources and the complexity of the class of queries to be processed.For example, for graph pattern queries (an essentially exponential-time process), one may pick an a smaller than the one for reachability queries (to decide whether there exists a path from one node to another, which is a linear-time problem).Intuitively, it indicates the ''resolution'' of the data we can afford: the larger a is, the more accurate the query answers are.
(2) Data-driven Approximation For each class Q of graph queries of users' choice, one can develop a datadriven approximation algorithm.Given a query Q 2 Q posed on a (possibly big) graph G, the approximation algorithm identifies a fraction G Q such that jG Q j ajGj, and computes QðG Q Þ as approximate answer to Q in G.A detailed presentation of the data-driven approximation scheme can be found in [92].
Such a data-driven approximation algorithm has been developed for graph pattern queries for personalized social search [27], as used by Graph Search of Facebook.Experimenting with real-life social graphs, we find that the algorithm easily scales with large-scale graphs: when graphs grows big, we simply decrease a and hence access smaller amount of data.Better still, the algorithm is accurate: even when the resource ratio a is as small as 15 Â 10 À6 , the algorithm returns matches with 100% accuracy.That is, when G consists of 1PB of data, ajGj is down to 15GB, i.e., data-driven approximation makes big data small, without paying too high a price of sacrificing the accuracy of query answers.
ð3Þ Algorithms Underlying RESOURCE.RESOURCE can be built on top of existing graph query engines provided with the following algorithms: 1. offline algorithms for discovering access constraints from real-life graphs and for maintaining the constraints in response to changes to the graphs; and 2. online algorithms for deciding whether a query is boundedly evaluable under an access schema, generating a bounded query plan for a boundedly evaluable query, and for data-driven approximation.As remarked earlier, these algorithms are already available for graph pattern queries (Sect.3.1 and [27]).
The framework can also incorporate other techniques for querying big graphs, by making big graphs small, including but not limited to the following.
(a) Query-driven approximation For an expensive query class Q, we can approximate its queries by adopting a cheaper class Q 0 of queries.For instance, for social community detection, one may want to use bounded graph simulation [51,81], which takes cubic time, instead of subgraph isomorphism, for which the decision problem is NP-complete.Another example is to compute top-k diversified answers for queries of Q, instead of computing the entire set Q(G) of answers [86] (see [92] for details of query-drive approximation).
(b) Query preserving graph compression We may compress a big graph G relative to a query class Q of users' Big Graph Analyses: From Queries to Dependencies and Association Rules choice [25].More specifically, a query preserving graph compression for Q is a pair hR; Pi of functions, where RðÁÞ is a compression function and PðÁÞ is a post-processing function.For any graph G, G c ¼ RðGÞ is the compressed graph computed from G by RðÁÞ, such that (i) jG c j jGj, and (ii) for all queries Q 2 Q, QðGÞ ¼ PðQðG c ÞÞ.Here PðQðG c ÞÞ is the result of post-processing the answers That is, we preprocess G by computing the compressed G c of G offline.After this step, for any query Q 2 Q, the answers Q(G) to Q in the original big G can be computed by evaluating the same query Q on the smaller G c online, without decompressing G c .The compression schema may be lossy: we do not need to restore the original graph G from G c .That is, G c only needs to retain the information necessary for answering queries in Q and hence can achieve a better compression ratio than lossless compression schemes.The effectiveness of this approach has been verified in [25].
(c) Query answering using views Given a query Q 2 Q and a set V of view definitions, query answering using views is to reformulate Q into another query Q 0 such that (i) Q and Q 0 are equivalent, i.e., for all graphs G, Q and Q 0 produce the same answers in G, and moreover, (ii) Q 0 refers only to V and its extensions (small cached views) VðGÞ, without accessing the underlying G.
More specifically, given a big graph G, one may identify a set V of views (pattern queries) and materialize them with VðGÞ of matches for patterns of V in G, as a preprocessing step offline.We compute matches of input queries Q online by using VðGÞ only.In practice, VðGÞ is typically much smaller than G, and hence, this approach allows us to query big G by accessing small VðGÞ.Better still, the views can be incrementally maintained offline in response to changes to G and adaptively adjusted to cover various queries [90].
One can further extend the traditional notion of query answering using views, by incorporating bounded evaluation, as studied for relational queries [14].
(d) Parallel query processing RESOURCE can be built on top of a parallel graph query engine and hence combine parallel query processing with bounded evaluation and data-driven approximation.In particular, we promote GRAPE, a parallel GRAPh Engine [30].It allows us to ''plug in'' existing sequential graph algorithms, and makes the computations parallel across multiple processors, without drastic degradation in performance or functionality of existing systems.
GRAPE has the following unique feature.The state-ofthe-art parallel graph systems require users to recast existing graph algorithms into a new model.While graph computations have been studied for decades and a large number of sophisticated sequential graph algorithms are already in place, to use Pregel, for instance, one has to ''think like a vertex'' and recast the existing algorithms into Pregel, similarly when programming with other systems.The recasting is nontrivial for people who are not very familiar with the parallel models.This makes these systems a privilege for experienced users only, just like computers three decades ago that were accessible only to people who knew DOS or Unix.
In contrast, GRAPE supports a simple programming model.For a class Q of graph queries, users only need to plug in three existing sequential (incremental) algorithms for Q, without the need for recasting the algorithms into a new model.GRAPE automatically parallelizes the computation across processors and inherits all optimization strategies developed for sequential graph algorithms.This makes parallel graph computations accessible to users who know conventional graph algorithms covered in undergraduate textbooks.
Better still, GRAPE is based on a principled approach by combining partial evaluation and incremental computation and can be modeled as fixpoint computation.As shown in [30], it guarantees its parallel processing to terminate with correct answers as long as the sequential algorithms plugged in are correct.
In addition, automated parallelization does not imply performance degradation.Indeed, GRAPE outperforms Giraph [34] (a open-source version of Pregel [52]), GraphLab [99] and Blogel [106] in both response time and communication costs, for a variety of computations such as graph traversal, pattern matching, connectivity and keyword search.We invite the interested reader to consult [30] for the details of GRAPE.

Related Work
In addition to bounded evaluation, RESOURCE highlights data-driven approximation.Recall that traditional approximate query answering is often based on synopses such as sampling, sketching, histogram or wavelets (see [18,77] for surveys).It is to compute a synopsis G 0 of a graph G and use G 0 to answer all queries posed on G.As opposed to a one-size-fit-all G 0 , data-driven approximation dynamically identifies G Q for each input query Q and hence achieves a higher accuracy of approximate query answers.
There has also been work on dynamic sampling for answering relational aggregate queries, e.g., [1,5].Assuming certain information about a query load, e.g., queries, the frequency of columns used in queries, or system logs, the prior work adaptively precomputes samples offline and picks some samples for answering the ''predictable queries'' online.In contrast, we study graph queries, where sampling is much harder.This is because (a) the graph queries are rather ''unpredictable'' due to topological constraints embedded in graph queries, and (b) as opposed to homogeneous relational data, there is no ''one-fit-for-all'' schema available for data nodes in a graph.We also do not assume the existence of abundant query logs and workload for sampling strategy.Instead, we develop dynamic reduction techniques to identify and only access promising ''areas'' that lead to reasonable approximate answers.
Related to the data-driven approximation scheme are also anytime algorithms [108], which allow users either to specify a budget on resources (e.g., running time, known as contract algorithms [70]), or to terminate the run of the algorithms at any time and return intermediate answers as approximate answers (known as interruptible algorithms [39]).Contract anytime algorithms have been explored for (a) budgeted search such as bounded-cost planning [64,66,108] under a user-specified budget and (b) graph search via subgraph isomorphism, to find intermediate answers within the budget, either by assigning dynamically maintained budgets and costs to nodes during the traversal [11], or by deciding search orders based on the frequencies of certain features in queries and graphs [61].
In contrast, RESOURCE (a) computes exact answers whenever bounded evaluation is possible, instead of heuristics; (b) it aims to strike a balance between the cost of finding solutions and the quality of the answers, by dynamic data reduction; and (c) it takes a given (arbitrarily small) ratio a as a parameter, accesses promising nodes only and guarantees bounded search space, by leveraging access schema as much as possible.

Dependencies for Graphs
We now turn to the other side of big graphs, namely the quality of graph-structured data.As remarked earlier, when the data are dirty, query answers computed in the data may not be correct and may even do more harm than good, no matter how efficient and scalable our systems and algorithms are for querying big graphs.
To catch inconsistencies in graphs, we propose a class of functional dependencies for graphs, referred to as GFDs, in Sect.4.1.We settle the classical problems for reasoning about GFDs in Sect.4.2.We make use of GFDs to catch errors in real-life graphs in Sect.4.3.
The main results of this section come from [29,88].

GFDs: Graph Functional Dependencies
We now present GFDs introduced in [29].GFDs are defined with graph patterns.To simplify the discussion, we extend the notation of Sect. 2 and write a graph pattern as , where V Q and E Q are the same as before; L Q is extended to also associate edges with labels; x is a list of distinct variables, one for each node in V Q ; and l is a bijective mapping from x to V Q , i.e., it assigns a distinct variable to each node v in V Q .For x 2 x, we use lðxÞ and x interchangeably when it is clear in the context.
We also allow wildcard '_' as a special label in L Q .

GFDs
A GFD u is a pair Q½ xðX !YÞ, where • Q½ x is a graph pattern, called the pattern of u, and • X and Y are two sets of literals of x.
Here a literal of x has the form of either x:A ¼ c or x:A ¼ y:B, where x; y 2 x, A and B denote attributes (not specified in Q) and c is a constant.
Intuitively, GFD u specifies two constraints: • a topological constraint imposed by pattern Q, and • attribute dependency specified by X !Y.
Recall that the ''scope'' of a relational functional dependency (FD) RðX !YÞ is specified by a relation schema R: the FD is applied only to instances of R. Unlike relational databases, graphs do not have a schema.Here Q specifies the scope of the GFD, such that the dependency X !Y is imposed only on the attributes of the vertices in each subgraph identified by Q. Constant literals x:A ¼ c enforce bindings of semantically related constants, along the same lines as CFDs [23].
Example 5 To catch the inconsistencies in real-life knowledge bases described in Example 1, we use GFDs defined with patterns Q 1 À ÀQ 4 of Fig. 2 as follows.
(1) Flight GFD u 1 = Q 1 ½x; x 1 -x 5 ; y; y 1 -y 5 ðX 1 !Y 1 Þ, in which pattern Q 1 specifies two flight entities, where l maps x to a flight, x 1 -x 5 to its id, departure city, destination, departure time and arrival time, respectively; similarly for y and y 1 -y 5 ; in addition, val is an attribute indicating the content of a node (not shown in Q 1 ).In u 1 , X 1 is x 1 :val ¼ y 1 :val, and Y 1 consists of x 2 :val ¼ y 2 :val and x 3 :val ¼ y 3 :val.
Intuitively, GFD u 1 states that for all flight entities x and y, if they share the same flight id, then they must have the same departing city and destination.
(2) Parent-child GFD u 2 = Q 2 ½x; yð; !falseÞ, where Q 2 specifies a pair of persons connected by child and parent relationships.It states that there exists no person entity x who is both a child and a parent of another person entity y.Note that X in Q 2 is an empty set, i.e., no precondition is imposed on the attributes of Q 2 , and Y is Boolean constant false, a syntactic sugar that can be expressed as, e.g., y:A ¼ c ^y:A ¼ d for distinct constants c and d, for attribute A of y.
(3) Birth places GFD u 3 = Q 3 ½x; y; zð; !y:val ¼ z:val), where Q 3 depicts a person entity with two distinct cities as birth places.Intuitively, u 3 is to ensure that for all person entities x, if x has two birth places y and z, then y and z share the same name.
(3) Generic is _a: GFD u 4 = Q 4 ½x; yð; !x:A ¼ y:A).It enforces a general property of is a relationship: if entity y is a x, then for any property A of x (represented by attribute A), x:A ¼ y:A.In particular, if x is labeled with bird, y with penguin, and A is can fly, then u 4 catches the inconsistency described in Example 1. Observe that x and y in Q 4 are labeled with wildcard '_', to match arbitrary generic entities.

Semantics
To interpret GFD u ¼ Q½ xðX !YÞ, we use the following notations.We denote a match of pattern Q in a graph G as a vector hð xÞ, consisting of h(x) (i.e., hðlðxÞÞ) for all x 2 x, in the same order as x. Consider a match hð xÞ of Q in G, and a literal x:A ¼ c of x.We say that hð xÞ satisfies the literal if there exists attribute A at the node v ¼ hðxÞ (i.e., v ¼ hðlðxÞÞ) and v:A ¼ c; similarly for literal x:A ¼ y:B.We denote by hð xÞ X if hð xÞ satisfies all the literals in X; similarly for hð xÞ Y.Here we write hðlðxÞÞ as h(x), where l is the mapping in Q from x to nodes in Q.We write hð xÞ X !Y if hð xÞ Y whenever hð xÞ X.
We say that graph G satisfies GFD u = Q½ xðX !YÞ, denoted by G u, if for all matches hð xÞ of Q in G, we have that hð xÞ X !Y.
To check whether G u, we need to examine all matches of Q in G.In addition, observe the following.
(1) For a literal We say that a graph G satisfies a set R of GFDs if for all u 2 R, G u, i.e., G satisfies every GFD in R.
Special Cases GFDs have the following special cases.
(1) As shown in [29], relational FDs and CFDs can be expressed as GFDs when tuples in a relation are represented as nodes in a graph.In fact, GFDs are able to express equality-generating dependencies (EGDs) [71].(2) GFDs can specify certain type information.For an entity x of type s, GFD Q½xð; !x:A ¼ x:AÞ enforces that x must have an A attribute, where Q consists of a single vertex labeled s and denoted by variable x.However, GFDs cannot enforce that an attribute A of x has a finite domain, e.g., Boolean.In relational databases, finite domains are specified by a relational schema, which are typically not in place for real-life graphs.

Related Work
There has been work on extending relational FDs to graph-structured data, mostly focusing on RDF [2,12,19,40,42,49,69].This line of work started from [49], by extending relational techniques to RDF.Based on triple patterns with variables, [2,19] define FDs with triple embedding, homomorphism and coincidence of variable valuations.Employing clustered values, [69] defines FDs with conjunctive path patterns; the work is extended to CFDs for RDF [42].FDs are also defined by mapping relations to RDF [12], using tree patterns in which nodes represent relation attributes.
The class of GFDs differs from the prior work as follows.(a) GFDs are defined for general property graphs, not limited to RDF.(b) GFDs support topological constraints by incorporating (possibly cyclic) graph patterns with variables, as opposed to [12,42,69].In contrast to [2,19,40,49,69] that take a value-based approach to defining FDs, GFDs are enforced on graph-structured entities identified by graph patterns via subgraph isomorphism.(c) GFDs support bindings of semantically related constants like CFDs [23], as well as forbidding GFDs with false.These allow us to specify data quality rules for consistency checking, but cannot be expressed as the FDs of [2,12,19,42,69].(d) The validation and implication problems for GFDs have been settled [29], while matching complexity bounds for the FDs previously proposed are yet to be developed.
Related to GFDs is a class of keys defined for RDF [88].Keys are defined as a graph pattern Q[x], with a designated variable x denoting an entity.Intuitively, it indicates that for any two matches h 1 and h 2 of Q in a graph G, h 1 ðxÞ and h 2 ðxÞ refer to the same entity and should be identified.Keys are recursively defined, i.e., Q may include entities other than x to be identified (perhaps with other keys), in order to match entities with a graph structure.Such keys aim to detect deduplicate entities and to fuse information from different sources that refers to the same entity, in knowledge fusion and knowledge base expansion; they also find applications in social network reconciliation, to reconcile user accounts across multiple social networks.We invite the interested reader to consult [88] for details.

Reasoning about GFDs
There are two classical problems associated with any class of dependencies, namely the satisfiability and implication problems, which are stated as follows.

Satisfiability
A set R of GFDs is satisfiable if R has a model; that is, there exists a graph G such that (a) G R, and (b) for each GFD Q½ xðX !YÞ in R, there exists a match of Q in G. Intuitively, it is to check whether the GFDs are ''dirty'' themselves when used as data quality rules.A model G of R requires all patterns in the GFDs of R to find a match in G, to ensure that the GFDs in R do not conflict with each other.
The satisfiability problem for GFDs is to determine, given a set R of GFDs, whether R is satisfiable.
Over relational data, any set R of FDs is satisfiable, i.e., there always exists a nonempty relation that satisfies R [91].However, a set R of conditional functional dependencies (CFDs) may not be satisfiable, i.e., there exists no nonempty relation that satisfies R [23].As GFDs subsume CFDs, it is not surprising that a set of GFDs may not be satisfiable, as shown in [29].

Implication
A set R of GFDs implies another GFD u, denoted by R u, if for all graphs G, if G R then G u, i.e., u is a logical consequence of R. In practice, the implication analysis helps us eliminate redundant data quality rules defined as GFDs and hence optimize our error detection process by minimizing rules.
The implication problem for GFDs is to decide, given a set R of GFDs and another GFD u, whether R u.

Complexity
These problems have been well studied for relational dependencies.For FDs, the satisfiability problem is in O( 1) time (since all FDs are satisfiable) and the implication problem is in linear time (cf.[71]).For CFDs, the satisfiability problem is NP-complete and the implication problem is coNP-complete in the presence of finite-domain attributes, but are in PTIME when all attributes involved have an infinite domain [23].
These problems have also been settled for GFDs [29]: • the satisfiability problem is coNP-complete, and • the implication problem is NP-complete.
The complexity bounds are rather robust, e.g., the problems remain intractable for GFDs defined with graph patterns that are acyclic directed graphs (DAGs).As shown in [29], the intractability of the satisfiability and implication problems arises from subgraph isomorphism embedded in these problems, which is NP-complete (cf.[102]).The complexity is not inherited from CFDs although GFDs subsume CFDs as a special case.Indeed, the satisfiability analysis of CFDs is NP-hard only under a relational schema that enforces attributes to have a finite domain [23], e.g., Boolean, i.e., the problem is intractable when CFDs and finite domains are put together.In contrast, graphs do not come with a schema; while GFDs subsume CFDs, they cannot specify finite domains.That is, the satisfiability problem for GFDs is already coNP-hard in the absence of a schema, similarly for the implication analysis.
Several tractable special cases of the satisfiability and implication problems for GFDs are identified in [29].
Putting these together, our main conclusion is that while GFDs are a combination of a topological constraint and an attribute dependency and are more complicated than CFDs, reasoning about GFDs is no harder than their relational counterparts such as CFDs.

Putting GFDs in Actions
One of the applications of GFDs is to detect inconsistencies in graph-structured data.That is, we use GFDs as data quality rules along the same lines as CFDs and catch violations of the rules by means of the validation analysis of GFDs, which is stated as follows.

Validation Analysis
Given a GFD u = Q½ xðX !YÞ and a graph G, we say that a match hð xÞ of Q in G is a violation of u if G h 6 u, where G h is the subgraph induced by hð xÞ.For a set R of GFDs, we denote by VioðR; GÞ the set of all violations of GFDs in G, i.e., hð xÞ 2 VioðR; GÞ if and only if there exists a GFD u in R such that hð xÞ is a violation of u in G.That is, VioðR; GÞ collects all entities of G that are inconsistent when the set R of GFDs is used as data quality rules.
The error detection problem is stated as follows: • Input: A set R of GFDs and a graph G.
Recall that the error detection problem is in PTIME for relational FDs and CFDs.In fact, when FDs and CFDs are used as data quality rules, errors in relations can be detected by two SQL queries that can be automatically generated from the FDs and CFDs [23].
In contrast, error detection is more challenging in graphs.Indeed, consider the decision version of the problem, referred to as the validation problem for GFDs.It is to decide whether G R, i.e., whether VioðR; GÞ is empty.This problem is coNP-complete [29].

Parallel Scalable Algorithms
The error detection problem is intractable.As remarked earlier, real-life graphs are often of large scale.Then, is error detection feasible in real-life graphs?The answer is affirmative, by using parallel algorithms to compute VioðR; GÞ.
As shown in [29], there exist parallel scalable algorithms for detecting errors in graphs by using GFDs, with the following property.Denote by • tðjRj; jGjÞ the running time of a ''best'' sequential algorithm to compute VioðR; GÞ, i.e., the least worstcase complexity among all such algorithms; and • TðjRj; jGj; pÞ the time taken by a parallel algorithm to compute VioðR; GÞ by using p processors.
Then, there exist parallel algorithms T p such that TðjRj; jGj; pÞ ¼ c Â tðjRj; jGjÞ p under certain practical conditions.Intuitively, T p guarantees to reduce its running time when p gets larger.That is, the more processors are used, the less time it takes to compute VioðR; GÞ.In other words, it can scale with largescale graphs despite the complexity, by increasing resources employed when graphs get larger.

Association Rules for Graphs
Besides the quantity and quality of big graphs, we next consider how to make practical use of big graph analyses in social media marketing, an emerging application.We first introduce a class of primitive graph pattern association rules, referred to as GPARs, in Sect.5.1.We then explore possible extensions of GPARs, by adding counting quantifiers in Sect.5.2.To apply GPARs in social media marketing, we finally address how to discover GPARs and how to identify potential customers by using GPARs, in Sect.5.3.
The results of the section are taken from [28,89].

GPARs: Graph Pattern Association Rules
We start with the GPARs introduced in [89].

GPARs
A graph pattern association rule (GPAR) R(x, y) is defined as Qðx; yÞ ) qðx; yÞ, where Q(x, y) is a graph pattern in which x and y are two designated nodes in Q, and q(x, y) is an edge labeled q from x to y, i.e., a relationship between x and y.We refer to Q and q as the antecedent and consequent of R, respectively.The rule states that for all nodes v x and v y in a (social) graph G, if there exists a match h 2 QðGÞ such that hðxÞ ¼ v x and hðyÞ ¼ v y , i.e., v x and v y match the designated nodes x and y in Q, respectively, then the consequent qðv x ; v y Þ will likely hold.
Intuitively, qðv x ; v y Þ indicate that v x is a potential customer of v y .Denote by Q(x, G) the set of h(x) for all matches h in QðGÞ, i.e., the matches of x in G via Q.Then in a social graph G, Q(x, y) identifies potential customers by computing matches Q(x, G).
We model R(x, y) as a graph pattern P R , by extending Q with a (dotted) edge q(x, y).We refer to pattern P R simply as R when it is clear from the context.
Example 7 Recall association rule (a) described in Example 2. It can be expressed as a GPAR R 1 ðx; yÞ: Q 5 ðx; yÞ ) visitðx; yÞ, as depicted in Fig. 3. Its antecedent is the pattern Q 5 (excluding the dotted edge) and its consequent is visitðx; yÞ.As opposed to conventional association rules, the GPAR is specified with a graph pattern Q 5 that enforces topological conditions on various entities: associations between customers (the friend relation), customers and restaurants (like, visit), city and restaurants (in), and city and customers (in).
This GPAR helps us identify potential customers for restaurant y.In a social graph G, we find matches of pattern Q 5 via subgraph isomorphism; for x and y in each of the matches (subgraphs of G), i.e., for x and y satisfying the antecedent of Q 1 , the chances are that x likes y, and hence, we can recommend y to x.
To simplify the discussion, we define the consequent of GPAR in terms of a single predicate q(x, y) following [72].However, a consequent can be readily extended to multiple predicates and even to a graph pattern.We consider nontrivial GPARs by requiring that (a) P R is connected; (b) Q is nonempty, i.e., it has at least one edge; and (c) q(x, y) does not appear in Q.

Related Work
Introduced in [72], association rules are traditionally defined on relations.Prior work on association rules for social networks [60] and RDF resorts to mining conventional rules and Horn rules (as conjunctive binary predicates) [31] on tuples with extracted attributes from graphs, instead of exploiting graph patterns.While [6] studies timedependent rules via graph patterns, it focuses on evolving graphs and adopts different semantics for support and confidence.
GPARs extend association rules from relations to graphs.(a) It demands topological support and confidence metrics.(b) GPARs are interpreted with isomorphic functions and hence cannot be expressed as conjunctive queries, which do not support negation or inequality needed for functions.(c) Applying GPARs becomes an intractable problem of multi-pattern-query processing in big graphs.(d) Mining (diversified) GPARs is beyond traditional rule mining from itemsets [107].
It should be remarked that conventional association rules [72] and a range of predication and classification rules [103] can be considered as a special case of GPARs, since their antecedents can be readily modeled as a graph pattern in which nodes represent items.

Adding Counting Quantifiers
In applications such as social media marketing, knowledge discovery and cyber security, more expressive patterns are needed, notably ones with counting quantifiers.In light of this, we extend GPARs with quantified graph patterns, by supporting counting quantifiers [28].

Quantified Graph Patterns
, where (a) V Q , E Q and L Q are the same as in patterns defined in Sect.2, (b) x o is a designated node in V Q , referred to as the query focus of Q, and (c) f is a function such that for each edge e 2 E Q , f(e) is a predicate of • a positive form rðeÞ p% for a real number p 2 ð0; 100, or rðeÞ p for a positive integer p, • rðeÞ ¼ 0, where e is referred to as a negated edge.
Here is either ¼ or !, and rðeÞ indicates the number of matches of edge e (via subgraph isomorphism with Q; see [28] for detailed semantics of rðeÞ).We refer to f(e) as the counting quantifier of e, and p% and p as ratio and numeric aggregate, respectively.
We leave out f(e) from Qðx o Þ if it is rðeÞ ! 1.
We extend GPARs with quantified graph patterns.
Example 8 Association rules (b) and (c) described in Example 2 are defined with quantified graph patterns.They are depicted in Fig. 3 and illustrated as follows.
For (b), the GPAR is R 2 ðx; yÞ: Q 6 ðx; yÞ ) buyðx; yÞ.Its antecedent is a quantified pattern Q 6 (excluding the dotted edge) and its consequent is buyðx; yÞ.Its query focus is x, indicating potential customers.Observe that edge followðx; x 0 Þ carries a counting quantifier '' !80%''.In a social graph G, a node v x matches x if (i) there exists an isomorphism h from Q 6 to a subgraph G 0 of G such that hðxÞ ¼ v x , i.e., G 0 satisfies the topological constraints of Q 5 , and (ii) among all the people whom v x follows, at least 80% of them account for matches of x 0 in Q 6 ðGÞ, satisfying the counting quantifier.
For (c), the GPAR is R 3 ðx; yÞ: Q 7 ðx; yÞ ) buyðx; yÞ, where the antecedent is again a quantified pattern Q 7 (excluding the dotted edge), and its query focus is x.Note that Q 7 carries both a universal quantification (= 100%) and a negation (= 0).More specifically, a node v x in G matches x in Q 7 only if (i) for all people x 0 followed by x, x 0 buys a Nova Plus, i.e., counting quantifier ''=100%'' enforces a universal quantification, and (ii) there exists no node v w in G such that followðv x ; v w Þ is an edge in G and there exists an edge from v w to Nova Plus labeled ''bad rating''; that is, counting quantifier ''¼ 0'' on edge followðx o ; z 2 Þ enforces negation.
As demonstrated by Example 8, counting quantifiers express first-order logic (FO) quantifiers as follows: • negation when f(e) is rðeÞ ¼ 0 (e.g., Q 7 ); A conventional graph pattern Q is a special case of quantified patterns when f(e) is rðeÞ ! 1 for all edges e in Q, i.e., it carries existential quantification only.
We call a quantified pattern Q positive if it contains no negated edges, and negative otherwise.For example, in the quantified patterns shown in Fig. 3, Q 5 and Q 6 are positive, while Q 7 is negative.
Restrictions To strike a balance between the expressive power and complexity, we assume a predefined constant l such that on any simple path (i.e., a path that contains no cycle) in Qðx o Þ, (a) there exist at most l quantifiers that are not existential, and (b) there exist no more than one negated edge, i.e., we exclude ''double negation'' from quantified patterns.
The reason for imposing the restriction is twofold.(1) Without the restriction, quantified patterns can express first-order logic (FO) on graphs.Such patterns inherit the complexity of FO, in addition to #P complication.Then, even the problem for deciding whether there exists a graph that matches such a pattern is beyond reach in practice.As will be seen shortly, the restriction makes discovery and applications of quantified patterns feasible in large-scale graphs.(2) Moreover, we find that quantified patterns with the restriction suffice to express graph patterns commonly needed in real-life applications, with small l.Indeed, empirical study suggests that l is at most 2, and ''double negation'' is rare, since ''99% of real-world queries are star-like'' [32].
One can extend f(e) in Qðx o Þ to support other built-in predicates [, 6 ¼ and as , and conjunctions of predicates.To simplify the discussion, we focus on the simple form of quantified patterns given above.

Quantified pattern matching
We revise the statement of the graph pattern matching problem given in Sect. 2 for quantified patterns as follows.
• Input: A quantified pattern Qðx o Þ and a graph G.
• Output: The set Qðx o ; GÞ of hðx o Þ for all h in Q(G), i.e., all matches of query focus x o of Q in G.
Its decision problem, referred to as the quantified matching problem, is stated as follows.
• Input: A quantified graph pattern Qðx o Þ, a graph G and a node v in G. • Question: Is v 2 Qðx o ; GÞ?
When Qðx o Þ is a conventional graph pattern, the problem is NP-complete.When it comes to quantified patterns, however, ratio aggregates r p% and negation r ¼ 0 increase the expressive power and make the analysis more intriguing.It has been shown [28] that the increased expressive power does come with a price; however, the complexity bound of the quantified matching problem does not get much higher.More specifically, the quantified matching problem is • DP-complete for general quantified patterns and • NP-complete for positive quantified patterns.
Here DP is a complexity class above NP (unless P = NP), denoting the class of languages recognized by oracle machines that make a call to an NP oracle and a call to a coNP oracle.That is, a language L is in DP if there exist L 1 2 NP and L 2 2 coNP such that L ¼ L 1 \ L 2 (see [102] for details about DP).

Relate Work
Over relational data, quantified association rules [63] and ratio rules [48] impose value ranges or ratios (e.g., the aggregated ratio of two attribute values) as constraints on attribute values.Similarly, mining quantitative correlated pattern [47] has been studied, with value ranges imposed on correlated attribute values, rather than on matches.GPARs with quantified patterns extend quantified and ratio association rules from relations to graph-structured data.
The need for counting in graph queries has long been recognized.To this end, SPARQLog [97] extends SPARQL with FO rules, including existential and universal quantification over node variables.Rules for social recommendation are studied in [98], using support count as constraints.QGRAPH [74] annotates nodes and edges with a counting range (count 0 as negated edge) to specify the number of matches that must exist in a database.Set regular path queries (SRPQ) [50] extends regular path queries with quantification for group selection, to restrict the nodes in one set connected to the nodes of another.For social networks, SocialScope [3] and SNQL [53] define algebraic languages with numeric aggregates on node and edge sets.
We define quantified patterns to strike a balance between their expressive power and complexity.It differs from the prior work in the following.(1) Using a uniform form of counting quantifiers, quantified patterns support numeric and ratio aggregates (e.g., at least p friends and 80% of friends), and universal (100%) and existential quantification ( ! 1).In contrast, previous proposals do not allow at least one of these.(2) We focus on graph pattern queries, beyond set regular expressions [50] and rules of [98].(3) We show that quantified matching is DPcomplete at worst, slightly higher than conventional matching (NP-complete) in the polynomial hierarchy [102].In contrast, SPARQL and SPARQLog are PSPACE-hard [97], and SRPQ takes EXPTIME [50]; while the complexity bounds for QGRAPH [74], SocialScope [3] and SNQL [53] are unknown, they are either more expensive than quantified patterns (e.g., QGRAPH is a fragment of FOðcountÞ) or cannot express numeric and ratio quantifiers [3,53].

Discovering and Applying GPARs
To make practical use of GPARs, we next consider two problems, namely GPAR discovery and application of GPARs for identifying potential customers.Below we focus on GPARs studied in [89] (Sect.5.1) in the absence of counting quantifiers, unless stated otherwise.

Discovering GPARs
To discover nontrivial and interesting GPARs, we first present their topological support and confidence, which are a departure from their conventional counterparts over relations.
Support The support of a pattern Q in a graph G, denoted by suppðQ; GÞ, indicates how often Q is applicable.As for association rules over itemsets, the support measure should be anti-monotonic, i.e., for patterns Q and Q 0 , if Q 0 YQ (in terms of containment), then in any graph G, suppðQ 0 ; GÞ !suppðQ; GÞ.
One may want to define suppðQ; GÞ as the number jQðGÞj of matches of Q in Q(G), following its counterpart for itemsets [107].However, as observed in [10,80,96], this conventional notion is not anti-monotonic.For example, consider pattern Q 0 with a single node labeled person, and Q with a single edge childðperson; personÞ.When posed on a real-life graph G, one may find that suppðQ 0 ; GÞ\suppðQ; GÞ although Q 0 YQ, as a person may have multiple children.
We define the support of pattern Qðx o Þ in G as suppðQ; GÞ ¼ jQðx o ; GÞj, i.e., the number of distinct matches of the designated node x o in Q(G).One can verify that this support measure is anti-monotonic.
Confidence To find how likely q(x, y) holds when x and y satisfy the constraints of Q(x, y), we study the confidence of R(x, y) in graph G, denoted as confðR; GÞ.We follow the local close world assumption (LCWA) [20], assuming that G is locally complete, i.e., either G includes the complete neighbors of a node for any edge type, or it has no information about these neighbors.
We define confðR; GÞ = jRðx;GÞj jQðx;GÞ\X o j , where X o is the set of candidates of x that are associated with an edge labeled q.Intuitively, X o retains ''true'' negative examples under LCWA, i.e., those that have required q relationship of x but are not a match of x.
These support and confidence measures apply to GPARs with or without counting (see [28,89]).
The Diversified Mining Problem We want to find GPARs for a particular event q(x, y).However, this often generates an excessive number of rules, which often pertain to the same or similar people [68,73].This suggests that we study a diversified mining problem, to discover GPARs that are both interesting and diverse.
Big Graph Analyses: From Queries to Dependencies and Association Rules To formalize the problem, we first define a function diffð; Þ to measure the difference of GPARs.Given two GPARs R 1 and R It measures the difference between GPARs in terms of the Jaccard distance of their match sets, by treating R 1 and R 2 as graph patterns.Such diversification has been adopted by recommender systems to avoid overconcentration and reduce too ''homogeneous'' items [73].Given a set L k of k GPARs that pertain to the same predicate q(x, y), where k is a given natural number, we define the objective function FðL k Þ by following the practice of recommender systems [35,78]: This is known as max-sum diversification and aims to strike a balance between the interestingness and diversity of the rules with a parameter k controlled by users.
Based on the objective function, the diversified GPAR mining problem is stated as follows.
• Input A graph G, a predicate q(x, y), a support bound r and positive integers k and d. • Output A set L k of k GPARs pertaining to q(x, y) such that (a) FðL k Þ is maximized, and (b) for each GPAR R 2 L k , suppðR; GÞ !r and rðP R ; xÞ d.
Here rðP R ; xÞ denotes the radius of P R (i.e., R) at x, i.e., the longest distance from designated node x to all nodes in P R when P R is treated as an undirected graph.This is a bi-criteria optimization problem.It aims to discover GPARs for a particular event q(x, y) with high support, bounded radius, and a balanced confidence and diversity.In practice, users can freely specify q(x, y) of interests.Proper parameters (e.g., support, confidence, diversity) can be estimated from query logs or be recommended by domain experts.
The problem is nontrivial.It is not surprising that its decision problem is intractable, since max-sum diversification is intractable itself [35].Nonetheless, a parallel algorithm is developed in [89] that is able to find a set L k of top-k diversified GPARs such that L k has approximation ratio 2, and moreover, it is parallel scalable with the increase in processors under practical conditions.That is, while the problem is intractable, it is feasible to find useful GPARs in real-life graphs by leveraging parallel computing, provided that we can employ more processors when the graphs grow big.
It remains open whether there exist parallel scalable algorithms for discovering diversified top-k GPARs defined with quantified patterns (Sect.5.2).

Identifying Potential Customers
We want to use GPARs to identify entities of interests that match certain behavior patterns specified by (quantified) patterns.We formalize this problem as follows [28,89].
Consider a set R of GPARs that pertain to the same predicate q(x, y), i.e., their consequents are the same event q(x, y).We define the set of entities identified by R in a (social) graph G with confidence g as follows: Rðx; G; gÞ ¼fv x j v x 2 Qðx; GÞ; Qðx; yÞ ) qðx; yÞ 2 R; confðR; GÞ !gg: We study the entity identification problem: • Input: A set R of GPARs pertaining to the same q(x, y), a confidence bound g [ 0, and a graph G. • Output: The set Rðx; G; gÞ of entities.
Intuitively, it can be used to find potential customers x of y in a social graph G that are identified by at least one GPAR in R, with confidence of at least g.
The problem is also nontrivial.Its decision problem is to determine, given R, G and g, whether Rðx; G; gÞ 6 ¼ ;.It is NP-hard for GPARs without counting quantifiers [89], and DP-hard when counting quantifiers are present [28].Nonetheless, parallel algorithms are already in place for entity identification, which are parallel scalable under practical conditions, no matter whether the GPARs carry counting quantifiers or not.That is, these algorithm guarantee reduction in parallel running time when more processors are employed.In other words, it is feasible to identify potential customers in real-life social graphs by employing GPARs.

Conclusion
We have reported an account of our recent work in connection with big graph analyses.The area of big graphs is, however, a rich source of questions and vitality.Much more work needs to be done, and many questions remain to be answered.Below we list some of the topics for future work, which deserve a full treatment.

Querying Big Graphs
We start with two questions associated with RESOURCE.We then address a general question about the effectiveness of parallel computing.

Discovering Access Schema
As we have seen in Sect.3.1, bounded evaluation allows us to answer a large number of real-life queries by accessing a bounded amount of data no matter how big graphs grow.The key idea is to decide the bounded evaluability of an input query Q by reasoning about access schema A to access only the part of data needed for answering Q by employing the indices in A. Now the question is how we can discover ''effective'' access schema A from real-life graphs for answering queries in a given application?
The discovery problem is a bi-criteria optimization problem.On one hand, we want to find an access schema A such that A ''covers'' as many queries of the applications as possible.On the other hand, we want to reduce the cost of A and make the indices in A as small as possible.It is to strike a balance between the effectiveness of A and its cost.It also depends on whether the query load is known in advance or not.

Accuracy Guarantee
As remarked in Sect.3.2, to answer queries that are not boundedly evaluable under A, RESOURCE employs data-driven approximation.This gives rise to another question.For a class Q of graph queries and a resource ratio a, do there exist a data-driven approximation algorithm T and an accuracy bound g, such that given any query Q 2 Q and graph Q, the approximate answers QðG Q Þ computed by T under a are guaranteed to have accuracy at least g?That is, up to g, (a) each approximate answer in QðG Q Þ is close enough to an exact answer to Q(G), i.e., it is a sensible answer in users' interest, and conversely, (b) for each exact answer in Q(G), there exists an approximate answer in QðG Q Þ that is close enough, i.e., QðG Q Þ ''covers'' all exact answers in Q(G).One naturally wants to find an approximation scheme T that maximizes accuracy ratio g subject to the resource budget given by a.

Parallel Scalability
As remarked in Sect. 1, not all parallel algorithms have the property that the more processors (resources) are used, the faster their computations get.Worse still, there are graph query classes for which there exist no parallel algorithm that has this property.A natural question is then how to characterize the effectiveness of parallel algorithms?In other words, we want to assess a parallel algorithm by evaluating its scalability with the increase in resources used.
Several models have been proposed for this purpose, e.g., [29,45,56,65,85].However, the study of this issue is still in its infancy.A characterization remains to be developed for general shared-nothing systems beyond MapReduce, to be widely accepted in practice.

Cleaning Big Graphs
Querying big graphs is hard, and cleaning big graphs is even harder.

Discovering GFDs
To use GFDs to detect inconsistencies in real-life graphs, effective algorithms have to be in place to discover nontrivial and interesting GFDs from real-life graphs.GFD discovery is much harder than discovery of relational FDs (e.g., [95]) and CFDs (e.g., [82]), since GFDs are a combination of topological constraints and attribute dependencies.Among other things, the validation analysis of GFDs discovered is NP-complete, compared to low PTIME for its FD and CFD counterparts (Sect.4.3).It is also more challenging than graph pattern mining since it has to deal with disconnected patterns (see u 1 of Example 5) and forbidding GFDs that do not expect to find matches in any consistent graphs (e.g., u 3 ), not to mention their intractable satisfiability and implication analyses.

Repairing Graph-structured Data
After we detect errors in a graph, we need effective methods to fix the errors, known as data repairing [4].Repairing big data are much harder than error detection and introduce a variety of challenges (see [87] for a survey).Even when only relational FDs are involved, the data complexity of the data repairing problem is already intractable [8], i.e., the problem is NP-hard even when we only use fixed FDs.It is even more challenging when certain fixes have to be computed [83], i.e., fixes that are guaranteed 100% correct and accurate, to repair ''critical data'' such as a knowledge base for medical data.

Big Graph Mining
As we have seen in Sect.5, GPARs are catching up in practice when social media marketing is predicted to trump traditional marketing.However, an immediate topic is to develop effective algorithms for discovering GPARs with quantified patterns (counting quantifiers).As remarked earlier, quantified pattern matching is DP-complete for patterns with possibly negated edges, and real-life graphs are often big.It is not yet known whether parallel scalability is within reach for discovering general GPARs, although the problem has been settled in positive for GPARs without counting quantifiers [89].
Another question concerns how to determine parameters in the diversified GPAR mining problem, namely support bound r and radius bound d (Sect.5.3).To make practical use of GPARs in social media marketing, we need to identify the ''right'' thresholds that yield interesting GPARs.Similarly, we need to determine the ''right'' threshold for confidence bound g in the entity identification problem for real-life applications.

( 3 )
As shown by u 2 of Example 5, we can express ''forbidding'' GFDs of the form Q½ xðX !falseÞ, where X is satisfiable.A forbidding GFD states that there exists no nonempty graph G such that Q can find a match hð xÞ in G and hð xÞ X.That is, Q and X put together specify an inconsistent combination.(4) As indicated by u 4 of Example 5, GFDs can express generic is a relationship.Along the same lines, GFDs can enforce inheritance relationship subclass as well.

Fig. 3
Fig. 3 Graph pattern association rules QðGÞ, and (b) the size jG Q j of G Q and the time for identifying G Q are determined by A and Q only, independent of |G|.
Intuitively, if a match hð xÞ of pattern Q in G violates the attribute dependency X !Y, i.e., hð xÞ X but hð xÞ 6 Y, then the subgraph induced by hð xÞ is inconsistent, i.e., its entities have inconsistencies.Example 6 Recall the inconsistencies about Flight A123 in DBPedia from Example 1 and GFD u 1 from Example 5.Then, there exists a match hð xÞ of the pattern Q 1 of u 1 in the graph depicting DBPedia, such that h(x) and h(y) have the same id, i.e., hð xÞ X 1 ; however, hð xÞ 6 hðY 1 Þ, a violation of u 1 .That is, u 1 catches the inconsistencies of the flight in DBPedia.Similarly, we can apply u 2 -u 4 of Example 5 as data quality rules to knowledge bases and catch the other inconsistencies described in Example 1.
has no attribute A, then hð xÞ trivially satisfies X !Y.That is, node h(x) (3) When X is ;, hð xÞ X for any match hð xÞ of Q in G.That is, empty X indicates Boolean constant true.(4) When Y ¼ ;, it indicates that Y is constantly true, and u becomes trivial.When Y is false and X ¼ ;, G 6 u if there exists a match of Q; i.e., u states that Q is an ''illegal'' pattern that should not find any matches.