Fact Checking in Knowledge Graphs with Ontological Subgraph Patterns

Given a knowledge graph and a fact (a triple statement), fact checking is to decide whether the fact belongs to the missing part of the graph. Facts in real-world knowledge bases are typically interpreted by both topological and semantic context that is not fully exploited by existing methods. This paper introduces a novel fact checking method that explicitly exploits discriminant subgraph structures. Our method discovers discriminant subgraphs associated with a set of training facts, characterized by a class of graph fact checking rules. These rules incorporate expressive subgraph patterns to jointly describe both topological and ontological constraints. (1) We extend graph fact checking rules (GFCs\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathsf{GFCs}}$$\end{document}) to a class of ontological graph fact checking rules (OGFCs\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathsf{OGFCs}}$$\end{document}). OGFCs\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathsf{OGFCs}}$$\end{document} generalize GFCs\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathsf{GFCs}}$$\end{document} by incorporating both topological constraints and ontological closeness to best distinguish between true and false fact statements. We provide quality measures to characterize useful patterns that are both discriminant and diversified. (2) Despite the increased expressiveness, we show that it is feasible to discover OGFCs\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathsf{OGFCs}}$$\end{document} in large graphs with ontologies, by developing a supervised pattern discovery algorithm. To find useful OGFCs\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathsf{OGFCs}}$$\end{document} as early as possible, it generates subgraph patterns relevant to training facts and dynamically selects patterns from a pattern stream with a small update cost per pattern. We verify that OGFCs\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathsf{OGFCs}}$$\end{document} can be used as rules and provide useful features for other statistical learning-based fact checking models. Using real-world knowledge bases, we experimentally verify the efficiency and the effectiveness of OGFC\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathsf{OGFC}}$$\end{document}-based techniques for fact checking.


Introduction
Knowledge graphs have been utilized to support emerging applications, for example, Web search [8], recommendation [33], and decision making [17]. Real-life knowledge bases often contain two components: (1) a knowledge graph G that consists of a set of facts, where each fact is a triple statement 〈 v x , , v y 〉, that contains a subject entity v x , an object entity v y , and a predicate r that encodes the relationship between v x and v y ; and (2) an external ontology O [7,35,46] to support organizing meta-data such as types and labels. An ontology is typically a graph that contains a set of concepts and their relationships in terms of semantic closeness, such as , , [2,46,48]. Among the cornerstones of knowledge base management is the task of fact checking. Given a knowledge graph G and a fact t, it is to decide whether t belongs to the missing part of G. The verified facts can be used to (1) directly refine incomplete knowledge bases [3,8,23,32], (2) provide cleaned evidence for error detection in dirty knowledge bases [4,16,27,44], (3) improve the quality of knowledge search [31,34], and (4) integrate multiple knowledge bases [8,10].
Facts in knowledge graphs are often associated with nontrivial regularities that are jointly described by imposing both topological constraints and ontological closeness. Such regularities can be captured by subgraphs associated with the facts. How to exploit these associated subgraphs and ontologies to effectively support fact checking in knowledge graphs? Consider the following example. Example 1. The graph G 1 in Fig. 1 illustrates a fraction of DBpedia [23] that depicts the facts about philosophers (e.g., "Plato"). The knowledge base is associated with an ontology O 1 , which depicts semantic relationships among the concepts (e.g., "philosopher") that are referred by the entity type in G 1 . A user is interested in finding "whether a logician ('Cicero') or a theologian ('St. Augustine') as v x is influenced by a philosopher ('Plato') as v y ".
It is observed that graph patterns help explain the existence of certain entities and relationships in knowledge bases [26]. Consider a rule represented by a graph pattern P 1 associated with philosophers, which states that "if a philosopher v x gave one or more speeches that cited a book of v y with the same topic, then v x is likely to be influenced by v y ". One may want to apply this rule to verify whether Cicero is influenced by Plato. Nevertheless, such rule cannot be directly applied, as Cicero is not directly labeled by "philosopher". On the other hand, as "logician" (resp. "masterpiece") is a type semantically close to the concept "philosopher" (resp. "speech") in the philosopher ontology O 1 , "Cicero" and "Plato" should be considered as matches of P 1 , and the triple 〈Cicero, , Plato〉 should be true in G 1 . Similarly, another fact 〈St. Augustine, , Plato〉 should be identified as true facts, given that (a) "theologian" and "writtenWork" are semantically close to "philosopher" and "book" in O 1 , respectively, and (b) there is a subgraph of G 1 that contains "St. Augustine" and "Plato," and matches P 1 .
Consider another example, a business knowledge base G 2 from a fraction of a real-world offshore activity network [19] in Fig. 1. To find whether an active broker (close to active intermediary) is likely to serve a company in transition, a pattern P 2 that explains such an action may identify G 2 by stating that " A is likely an intermediary of C if it served for a dissolved (closed) company D , which has the same shareholder O and one or more service providers with C".
Subgraph patterns with "weaker" constraints may not explain facts well. Consider a graph pattern P ′ 1 obtained by removing the edge (speech, book) from P 1 . Although "Cicero" and "Plato" match P ′ 1 , a false fact 〈Cicero, , John Stuart Mill〉 also matches P ′ 1 because "John Stuart Mill" also has a book belonging to the "Ancient Philosophy" (not shown). Thus, P ′ 1 alone does not distinguish between true and false facts for (philosopher, philosopher) well. However, as "Cicero" does not have a speech citing a book of "John Stuart Mill," the fact is identified as false by P 1 , since it does not satisfy the constraints.
These graph patterns can be easily interpreted as rules, and the matches of the graph patterns readily provide instance-level evidence to "explain" the facts. These matches also indicate more accurate predictive models for various facts. We ask the following questions: How to jointly characterize and discover useful patterns with subgraphs and ontologies? and How to use these patterns to support fact checking in large knowledge graphs?
Contribution. We propose models and algorithms that explicitly incorporate discriminant subgraphs and ontologies to support fact checking in knowledge graphs.
incorporate discriminant graph patterns as the antecedent and generalized triple patterns as the consequent and build a unified model to check multiple types of facts by graph pattern matching with ontology closeness. We adopt computationally efficient pattern models and closeness functions to ensure tractable fact checking via . We develop statistical measures (e.g., support, confidence, significance, and diversity) to characterize useful (Sect. 3). Based on these measures, we formulate the top-k discovery problem to mine useful for fact checking.
(2) We develop a feasible supervised discovery algorithm to compute over a set of training facts (Sect. 4). In contrast to conventional pattern mining, the algorithm solves Facts and their associated subgraphs. Subgraphs suggest the existence of facts by jointly describing topology and semantic constraints. These subgraphs can be identified by approximate graph pattern matching via associated ontologies a submodular optimization problem with provable optimality guarantees, by a single scan of a stream of patterns, and incurs a small cost for each pattern.
(3) To evaluate the applications of , we apply to enhance rule-based and learning-based models to the fact checking task, by developing two such classifiers. The first model directly uses as rules. The second model extracts instance-level features from the matches of patterns induced by to learn a classifier (Sect. 4.2). (4) Using real-world knowledge bases, we experimentally verify the efficiency of -based techniques (Sect. 5). We found that the discovery of is feasible over large graphs.
-based fact checking also achieves high accuracy and outperforms its counterparts using Horn clause rules and path-based learning. We also show that the models are highly interpretable by providing case studies.
Our work nontrivially extends graph fact checking rules ( ) [26] with the following new contributions that are not addressed by techniques: (1) new rule models that incorporate semantic closeness in ontology beyond label equality, (2) improved rule discovery algorithms that incorporate ontological subgraph matching and ontological pattern growth strategy, (3) a unified model for multiple types of facts with semantic closeness, which is unlike that need to build a separate model for each single triple pattern, and (4) experimental studies that verify the effectiveness of adding ontologies to the models.

Related work.
We categorize the related work as follows.
Fact checking. Fact checking has been studied for unstructured data [13,36] and structured (relational) data [18,45], mostly relying on text analysis and crowd sourcing. Automatic fact checking in knowledge graphs is not addressed in these work. Beyond relational data, the following methods have been studied to predict triples in graphs.
(1) Rule-based models extract association rules to predict facts.
(or its improved version + ) discovers rules with conjunctive Horn clauses [14,15] for knowledge base enhancement. Beyond Horn rules, GPARs [11] discover association rules in the form of Q ⇒ p , with a subgraph pattern Q and a single edge p. It recommends users via cooccurred frequent subgraphs.
(2) Supervised link prediction has been applied to train predictive models with latent features extracted from entities [8,22]. Recent works make use of path features [5,6,16,37,42]. The paths involving targeted entities are sampled from 1-hop neighbors [6] or via random walks [16], or constrained to be shortest paths [5]. Discriminant paths with the same ontology are grouped to generate training examples in [37].
Rule-based models are easy to interpret but usually cover only a subset of useful patterns [31]. It is also expensive to discover useful rules (e.g., via subgraph isomorphism) [11]. On the other hand, latent feature models are more difficult to be interpreted [31] compared with rule models [15]. Our work aims to balance the interpretability and model construction cost. (a) In contrast to [15], we use more expressive rules enhanced with graph patterns to express both constant and topological context of facts. Unlike [11], we use approximate pattern matching for instead of subgraph isomorphism, since the latter may produce redundant examples and is computationally hard in general. (b) can induce useful and discriminant features from patterns and subgraphs, beyond path features [6,16,42]. (c) can be used as a stand-alone rule-based method. They also provide context-dependent features to support supervised link prediction to learn highly interpretable models. These are not addressed in [11,15].
Ontological graph pattern matching. Ontology-based pattern matching has been proposed to replace the label equality with grouping semantically related labels [24,46]. Wu et al. [46] revises subgraph isomorphism with a quantitative metric which measures the similarity between the query and its matches in the graph. We adopt ontology-based matching introduced in [46] and the closeness function between concepts (labels) to find with semantically related labels.
Graph pattern mining. Frequent pattern mining defined by subgraph isomorphism has been studied for a single graph. GRAMI [9] discovers frequent subgraph patterns without edge labels. Parallel algorithms are also developed for association rules with subgraph patterns [11]. In contrast, (1) we adopt approximate graph pattern matching for feasible fact checking, rather than subgraph isomorphism as in [9,11].
(2) We develop a more feasible stream mining algorithm with optimality guarantees on rule quality, which incurs a small cost to process each pattern. (3) Supervised graph pattern mining over observed ground truth is not discussed in [9,11]. In contrast, we develop supervised pattern discovery algorithms that compute discriminant patterns that best distinguish between the observed true and false facts. None of these works discuss supervised graph pattern discovery and their applications for fact checking.
Graph dependency. Data dependencies have been extended to capture inconsistencies in graph data. Functional dependencies for graphs ( ) [12] enforce topological and value constraints by incorporating graph patterns with variables and subgraph isomorphism. Ontology functional dependencies (OFD) on relational data have been proposed to capture synonyms and is-a relationships defined in an ontology [2]. These hard constraints are useful for detecting and cleaning data inconsistencies for follow-up fact checking tasks [31]. On the other hand, they are often violated by incomplete knowledge graphs [31] and thus can be overkill for discovering useful substructures when applied to fact checking. We focus on "soft rules" to infer new facts toward data completion rather than identifying errors with hard constraints [34]. While hard rules are designed to enforce value constraints on node attribute values to capture data inconsistencies, can be viewed as a class of association rules that incorporates approximate graph pattern matching with ontology closeness functions to identify missing facts. The semantics and applications of are quite different from their counterparts in these data dependencies.

Fact Checking with Graph Patterns
We review the notions of knowledge graphs and fact checking. We then introduce a class of rules that incorporate graph patterns and ontologies for fact checking.

Graphs, Ontologies, and Patterns
Knowledge graphs. A knowledge graph [8] is a directed graph G = (V, E, L) , which consists of a finite set of nodes V, a set of edges E ⊆ V × V . Each node v ∈ V (resp. edge e ∈ E ) carries a label L(v) (resp. L(e)) that encodes the content of v (resp. e) such as types, names, or property values.

Ontologies. An ontology is a directed graph
where V o is a set of concept labels and E o ⊆ V o × V o is a set of semantic relations among the concept nodes. In practice, an edge (l, l � ) ∈ E o may encode three types of relations [21], including: (a) equivalence states l and l ′ are semantically equivalent, thereby representing "refersTo" or "knownAs"; (b) hyponyms states that l is a kind of l ′ , such as "isA" or "subclassOf" that enforces a preorder over V o ; and (c) descriptive states that l is described by another l ′ in terms of, for example, "association," "partOf" or "similarTo". In practice, an ontology may encode taxonomies, thesauri, or RDF schemas.
Label closeness function Given an ontology O and a concept label l, a label closeness function (⋅) computes a set of labels close to l, i.e., computes a relevant score between l and l ′ , and (2) (resp. (1 − ) ) is a similarity (resp. distance) bound. One may set (l, l � ) as the normalized sum of the edge weights along a shortest (undirected) path between l to l ′ in O [21,46]. For equivalence, hyponym, descriptive edges modeled in O, tunable weights w 1 , w 2 , and w 3 can be assigned respectively, to differentiate equivalence, inheritance, and association properties [21]. Example 2. Consider the knowledge graph G 1 in Fig. 1.
, Plato〉 is encoded by an edge in G with label " " between the subject node v x and the object node v y . The label of v x encodes its name "Cicero" and carries a type x = "philosopher"; similarly for v y with name "Plato" and type y = "philosopher". By setting w 1 = 0.0 , w 2 = 0.1 , and w 3 = 0.4 , the corresponding ontology O 1 of G 1 (Fig. 1) suggests that (1) where v x and v y are in G, and t ∉ E , the task of fact checking is to compute a model M to decide whether the relation r exists between v x and v y [31]. This task can be represented by a binary query in the form of 〈 v x , ? , v y 〉 , where the model M outputs "true" or "false" for the query. We study how subgraphs and ontologies can be jointly explored to support effective fact checking for knowledge graphs. To characterize useful subgraphs and concept labels, we introduce a class of ontology-based subgraph patterns, which extends its counterpart in graph fact checking rules ( ) [26] with ontology closeness.

Subgraph patterns.
A subgraph pattern P(x, y) = (V P , E P , L P ) is a directed graph that contains a set of pattern nodes V P and pattern edges E P , respectively. Each pattern node u p ∈ V P (resp. edge e p ∈ E P ) has a label L P (u p ) (resp. L P (e p ) ). Moreover, it contains two designated anchored nodes u x and u y in V P of types x and y, respectively. Specifically, when it contains a single pattern edge with label r between u x and u y , P is called a triple pattern, denoted as r(x, y). We next extend the approximate pattern matching [26] with ontologies.
Ontological pattern matching. Given a graph G, a pattern P(x, y), and a function (⋅) , for a pattern Match relation. Given P(x, y), G, O and function (⋅) , a pair of nodes (v x , v y ) match P(x, y), or P covers the pair (v x , v y ) , if (1) there exists a matching relation R ⊆ V P × V such that for each pair (u, v) ∈ R , (a) v is a candidate of u (verified by the ontology closeness function (⋅) ), (b) for every edge e P = (u, u � ) ∈ E P , there exists a can- , v x (resp. v y ) is a match of u x (resp. u y ), respectively.

Example 3.
Consider G 1 and its associated ontology O 1 in Fig. 1. Given the label "philosopher," a set of close labels Remarks. As observed in [26,28,39,40], subgraph patterns defined by, for example, subgraph isomorphism may be an overkill in capturing meaningful patterns and is computationally expensive (NP-hard). Moreover, it generates (exponentially) many isomorphic subgraphs and thus introduces redundant features for model learning [26]. In contrast, it is in (|V P |(|V P | + |V|)(|E P | + |E|)) time to find whether a fact is covered by an approximate pattern [26]. The tractability carries over to the validation of (Sect. 4). To ensure feasible fact checking in large knowledge graphs and ontologies, we shall consider ontological pattern matching to balance the expressiveness and computational cost of our rule model.

Ontological Graph Fact Checking Rules
We now introduce our rule model that incorporates graph patterns and ontologies.

Rule model. An ontological graph fact checking rule (denoted as
) is in the form of ∶ P(x, y) → r(x, y) , where (1) P(x, y) and r(x, y) are two graph patterns carrying the same pair of anchored nodes (u x , u y ) , and (2) r(x, y) is a triple pattern and is not in P(x, y).
Semantics. Given a knowledge graph G, an ontology O, and a closeness function

Example 4.
Consider the patterns and graphs in Fig. 1. To verify the influence between two philosophers, an is 1 ∶ P 1 (x, y) → (x, y). Pattern P 1 has two anchored nodes x and y, both with type philosopher, and covers the pair ( , ) in G 1 . To verify the service between a pair of matched entities ( , ) , another is 2 ∶ P 2 (x, y) → (x, y). Note that with subgraph isomorphism, P 1 induces two subgraphs of G 1 that only differ by entities with label speech and masterpiece. It is impractical for users to inspect such highly overlapped subgraphs with subgraph isomorphism.

Remarks.
We compare with two models below. (1) Horn rules are adopted by + [14], in the form of ⋀ B i → r(x, y) , where each B i is an atom (fact) carrying variables. It mines only closed (each variable appears at least twice) and connected (atoms transitively share variables/entities to all others) rules. We allow general approximate graph patterns in to mitigate missing data and capture richer context features for supervised models (Sect. 4). (2) The association rules with graph patterns [11] have similar syntax with but adopt strict subgraph isomorphism for social recommendation. In contrast, we define with semantics and quality measures (Sect. 3) specified for observed true and false facts to support fact checking. (3) The model [26] is a special case of in which (⋅) enforces label equality ( = 1).

Supervised Discovery
To characterize useful , we introduce a set of metrics that jointly measure pattern significance and rule models, which extend their counterparts from established rule models [15] and discriminant patterns [47], and are specialized for a set of training facts. We then formalize the supervised discovery problem.
Statistical measures. Our measures are defined over a knowledge graph G, an ontology O (with function (⋅) ), and a set of training facts . The training facts consists of a set of true facts + in G, and a set of false facts − that are known not in G, respectively. Extending the silver standard in knowledge base completion [34], (1) + can be usually sampled from manually cleaned knowledge bases [29]; and (2) − are populated following the partial closed-world assumption (see "Confidence").
We use the following notations. Given an ∶ P(x, y) → r(x, y) , a graph G, facts + and − , (1) P( + ) (resp. P( − ) ) refers to the set of training facts in + (resp. − ) that are covered by P(x, y) in + (resp. − ) in terms of O and (⋅) . P( ) is defined as P( + ) ∪ P( − ) , i.e., all the facts in covered by P. (2) r( + ) , r( − ) , and r( ) are defined similarly. Support and confidence. The support of an ∶ P(x, y) → r(x, y) , denoted by ( , G, ) (or simply ( ) ), is defined as Intuitively, the support is the fraction of the true facts that are instances of r(x, y), and those also satisfy the constraints of the subgraph pattern P(x, y) over the ontology O and the closeness function (⋅) . It extends the head coverage, a practical version for rule support [15] to address triple patterns r(x, y) that has not many matches due to the incompleteness of knowledge bases.
Given two patterns P 1 (x, y) and P 2 (x, y) , we say P 2 (x, y) refines P 1 (x, y) (denoted by P 1 (x, y) ⪯ P 2 (x, y) , if P 1 is a subgraph of P 2 and they pertain to the same pair of anchored nodes (u x , u y ) . We show that the support of preserves anti-monotonicity in terms of pattern refinement.

Lemma 1.
For graph G, given any two

Proof sketch. It suffices to show that any pair
) is a match, or (b) a node u as an ancestor or a descendant of u x in P 1 , such that no ancestor or descendant of v x 2 in G is a match. As P 2 refines P 1 , both (a) and (b) lead to that v x 2 is not covered by P 2 , which contradicts the definition of approximate patterns. □ Extending partial closed-world assumption. Following rule discovery in incomplete knowledge base [15], we extend partial closed-world assumption ( ) to characterize the confidence of . Given a triple pattern r(x, y) and a true instance . In other words, for a given entity v x , it assumes that r( + ) contains all the true facts about v x that pertain to specific r. Given the ontology and the function (⋅) that tolerates concept label dissimilarity, it will identify a fact as false only when it claims a fact that connects v x and v ′ y via r, and v ′ y is not ontologically close to any known entity that is connected to v x via r. This necessarily extends the conventional (where (⋅) simply enforces label equality, i.e., = 1 ) to reduce the impact of facts that may not be counted as "false" due to the true facts that are ontologically close to them.
We define a normalizer set P( + ) N , which contains all the pairs (v x , v y ) from P( + ) that have at least a false counterpart under the ontology-based . The confidence of in G, denoted as ( , G, ) (or simply ( ) ), is defined as The confidence measures the probability that an holds over the entity pairs that satisfy P(x, y), normalized by the facts that are assumed to be false under . We follow the ontology-based to construct false facts in our experimental study.
Significance. We next quantify how significant an is in "distinguishing" between the true and false facts, by extending the G-test score [47]. This test verifies the null hypothesis of whether the number of true facts "covered" by P(x, y) fits to the distribution in the false facts. If not, P(x, y) is considered to be significant. Specifically, the score (denoted as ( , p, n) , or simply ( ) ) is defined as where p (resp. n) is the frequency of the facts covered by . As ( ) is not anti-monotonic, a common practice is to use a "rounded up" score to find significant patterns [47]. We adopt an upper bound of ( ) , denoted as ̂ ( , p, n) (or ̂ ( ) for simplicity), which is defined as tanh(max{ ( , p, ), ( , , n)}) , where > 0 is a small constant (to prevent the case that ̂ ( ) = ∞ ), and ̂ is normalized to [0, 1] by the hyperbolic function tanh(⋅) . We show the following results.

Lemma 2.
Given graph G, for any two , it suffices to show that both ( , p, ) and ( , , n) are antimonotonic in terms of rule refinement.
(2) Given Lemma 1, we have p 2 ≤ p 1 and n 2 ≤ n 1 if 1 ⪯ 2 . Then, ( 2 , p 2 , ) ≤ ( 1 , p 1 , ) and ( 2 , , n 2 ) ≤ ( 1 , , n 1 ) , thus and therefore ̂ ( 2 ) ⪯̂ ( 1 ) . This completes the proof of Lemma 2. □ Redundancy-aware selection. In practice, one wants to find with both high significance and low redundancy. Indeed, a set of can be less useful if they "cover" the same set of true facts in + . We introduce a bi-criteria function that favors significant that cover more diversified true facts. Given a set of  , when the set of true facts + is known, the coverage score of  , denoted as () , is defined as The first term, defined as () = � ∑ ∈̂ ( ) , aggregates the total significance of in  . The second term is defined as where t () refers to the in  that cover a true fact t ∈ + . () quantifies the diversity of  and follows a reward function [25]. Intuitively, it rewards the diversity in that there is more benefit in selecting an that covers new facts, which are not covered by other in  yet. Both terms are normalized to (0, The coverage score favors that cover more distinct true facts with more discriminant patterns. We next show that (⋅) is well defined in terms of diminishing returns. That is, adding a new to a set  improves its significance and coverage at least as much as adding it to any superset of  (diminishing gain to  ). This also verifies that (⋅) employs submodularity [30], a property widely used to justify goodness measures for set mining. Define the marginal gain ( , ) of an to a set  ( ∉  ) as ( ∪ { })-() . We have the following result.

Lemma 3. The function
(⋅) is a monotone submodular function for that is for any two sets  1 and  2 , (1) if Proof. We show that both parts pertaining to () , i.e., () and () , are monotone submodular functions w.r.t.  , and therefore () is a monotone submodular function w.r.t. .
(1) We show that both () and () are monotone functions w.r.t.  . Each term ( ) is positive, and the sum ∑ ( ) , since every in  1 is also in  2 for any two sets  1 ⊆  2 of . Hence, () is a monotone function w.r.t. the set .
We denote the term � ∑ ( ) , since every in t ( 1 ) that covers t is also in t ( 2 ) for any two sets  1 ⊆  2 of . Hence, each term T t () in () is a monotone function w.r.t.  , and thus () is a monotone function w.r.t. .
(2) Next, we show that both () and () are submodular functions w.r.t.  . For any � ∉  , the marginal gain for () is: Similarly, for any � ∉  , the marginal gain of (⋅) for each ter m T t () is: T t ()) , which is an anti-monotonic function w.r.t. T t () . As T t () is monotonic w.r.t.  , for any two sets We now formulate the top-k discovery problem over observed facts.
Top-k supervised discovery. Given a graph G, a corresponding ontology O with an ontology closeness function (⋅) , a support threshold , a confidence threshold , training facts as instances of a triple pattern r(x, y), and integer k, the problem is to identify a set  of top-k that pertain to r(x, y), such that (a) for each ∈  , ( ) ≥ , ( ) ≥ , and (b) () is maximized.

Top-k Discovery
Unsurprisingly, the supervised discovery problem for is intractable. A naive "enumeration-and-verify" algorithm that generates and verifies all k-subsets of candidates is clearly impractical for large G, O, and . We introduce efficient algorithms with near-optimality guarantees. Before we introduce these algorithms, we first introduce a common building-block procedure that computes the pairs covered by a subgraph pattern ("pattern matching" procedure).

Procedure
. We start with procedure , an ontology-aware graph pattern matching procedure. Given knowledge graph G, ontology O, and closeness function (⋅) , for a subgraph pattern P(x, y), it computes the node pairs (v x , v y ) that can be covered by P(x, y). In a nutshell, the algorithm extends the approximate matching procedure that computes a graph dual-simulation relation [28], while the candidates are dynamically determined by (⋅) and O. More specifically, first finds the candidate matches v ∈ V of each node u ∈ V P , such that v has a type that is close to u determined by (⋅) and O. It then iteratively refines the match set that violates topological constraints of P by the definition of the matching relation R, until the match set cannot be further refined.
Complexity. Note that it takes a once-for-all preprocessing to identify all similar labels in the ontology O, in time (|V P |(|V O | + |E O |) , following a traversal of O. Given that O is typically small (and thus its diameter is a small constant), the computation of (⋅) for given labels is in O(1). It then takes ((|V P | + |V|)(|E P | + |E|)) time to compute the matching relation for each pattern.
We next introduce discovery algorithms.
"Batch + Greedy". We start with an algorithm (denoted as _ ) that takes a batch pattern discovery and a greedy selection as follows. (1) Apply graph pattern mining (e.g., Apriori [20]) to generate and verify all the graph patterns  . The verification is specialized by an operator , which invokes the pattern matching algorithm to compute the support and confidence for each pattern. (2) Invoke a greedy algorithm to do k passes of  . In each iteration i, it selects the pattern P i , such that the corresponding i ∶ P i (x, y) → r(x, y) maximizes the marginal gain _ guarantees a (1 − 1 e ) approximation, following Lemma 3 and the seminal result in [30]. Nevertheless, it requires the verification of all patterns before the construction of . The selection further requires k passes of all the verified patterns. This can be expensive for large G and .
We can do better: In contrast to "batch processing" the pattern discovery and sequentially applying the verification, we organize newly generated patterns in a stream and interleave pattern generation and verification to assemble new patterns to top-k with small update costs. This requires a single scan of all patterns with early termination, without waiting for all patterns to be verified. Capitalizing on stream-based optimization [1,40], we develop a near-optimal algorithm to discover . Our main results are shown below. As a proof of Theorem 1, we next introduce such a stream discovery algorithm.
"Stream + Sieve". Our supervised discovery algorithm, denoted as _ (illustrated in Fig. 2), interleaves pattern generation and selection as follows. (1) Ontology-aware pattern stream generation. The algorithm _ invokes a procedure to produce a pattern stream  (line 2 and 8). Unlike _ that verifies patterns against entire graph G, it partitions facts to blocks and iteratively spawns and verifies patterns by visiting local neighbors of the facts in each block. This progressively finds patterns that better "purify" the labels of only those facts they cover and thus reduces unnecessary enumeration and verification. Instead of using exact matching triples [26], leverages the ontology O and the closeness function (⋅) to group ontologically similar triples for partitioning.
(2) Selection on-the-fly. _ invokes a procedure (line 7) to select patterns and construct on the fly. To achieve the optimality guarantee, it applies the stream-sieving strategy in stream data summarization [1]. In a nutshell, it estimates the optimal value of a monotonic submodular function F(⋅) with multiple "sieve values," initialized by the maximum coverage score of single patterns (Sect. 3), i.e., =max P∈ ( (P)) (lines 4-5), and eagerly constructs with high marginal benefits that refines sieve values progressively.
The above two procedures interact with each other: Each pattern verified by is sent to for selection. The algorithm terminates when no new pattern can be verified by or the set  can no longer be improved by (as will be discussed). We next introduce the details of procedures and .

Procedure
. Procedure improves its "batch" counterpart in _ by locally generating patterns that cover particular sets of facts, following a manner of decision tree construction. It maintains the following structures in each iteration i: (1) a pattern set  i , which contains graph patterns of size (number of pattern edges) i, and is initialized as a size-0 pattern that contains anchored nodes u x and u y only; (2) a partition set i (P) , which records the sets of facts P( + ) and P( + ) , is initialized as { + , − } , for each pattern P ∈  i . At iteration i, it performs the following.
(1) For each block B ∈ i−1 , generates a set of graph patterns  i with size i. A size-i pattern P is constructed by adding a triple pattern e(u, u � ) to its size-(i − 1 ) counterpart P ′ in  i−1 . Moreover, it only inserts e(u, u � ) with instances from the neighbors of the matches of P ′ based on closeness function .
(2) For each pattern P ∈  i , computes its support, confidence, and significance (G-test) by invoking procedure as in the algorithm _ and prunes  i by removing unsatisfied patterns. It refines P � ( + ) and P � ( − ) to P( + ) and P( − ) accordingly. Note that P( + ) ⊆ P � ( + ) , and P( − ) ⊆ P � ( − ) . Once a promising pattern P is verified, returns P to procedure for the construction of top-k .

Procedure
. To compute the set of  that maximizes () for a given r(x, y), it suffices for procedure to compute top-k graph patterns that maximize () accordingly. It solves a submodular optimization problem over the pattern stream that specializes the sieve-streaming technique [1] to .
Sieve streaming. [1,26] Given a monotone submodular function F(⋅) , a constant >0, and element set  , sieve streaming finds top-k elements  that maximizes F() as follows. It first finds the largest value of singleton sets m = max e∈ F({e}) and then uses a set of sieve values (1 + ) j (j is an integer) to discretize the range [m, k * m] . As the optimal value, denoted as F( * ) , is in [m, k * m] , there exists a value (1 + ) j that "best" approximates F( * ) . For each sieve value v, a set of top patterns  v is maintained, by adding patterns with a marginal gain at least . It is shown that selecting the sieve of best k elements produces a set  with F() ≥ ( 1 2 − )F( * ) [1]. A direct application of the above sieve streaming for seems infeasible: One needs to find the maximum ( ) (or (P) for fixed r(x, y)), which requires to verify the entire pattern set. Capitalizing on data locality of graph pattern matching, Lemma 3, and Lemma 1, we show that this is doable for with a small cost.

Lemma 4. It is in O(| 1 |) time to compute the maximum (P).
This can be verified by observing that (⋅) also preserves anti-monotonicity in terms of pattern refinement, because () is an aggregation of ( ) and () is an aggregation of support, both of which hold the anti-monotonicity for single patterns. For any two patterns P(x, y) and . Thus, the value max P∈ (P) must be from a single-edge pattern. That is, procedure only needs to cache at most | 1 | size-1 patterns from to find the global maximum (P) (lines 4-5 of _ ). The rest of follows the sieve-streaming strategy, as illustrated in Fig. 3. The are constructed with the top-k graph patterns (line 8).
Optimization. To further prune unpromising patterns, procedure estimates an upper bound ̂ (P, To this end, first traces to an � ∶ P � (x, y) → r(x, y) , where P ′ is a verified sub-pattern of P, and P is obtained by adding a triple pattern r ′ to P ′ . It estimates an upper bound of the support of the ∶ P(x, y) → r(x, y) as ̂ ( ) = ( � )-l |r( + )| , where l is the number of the facts in r( + ) that have no match of r ′ in their i hop neighbors (thus cannot be covered by P). Similarly, one can estimate an upper bound for p and n in (⋅) and thus get an upper bound To see that ̂ (P, ) is an upper bound for (P, ) , one may note that the marginal gains for the significance part ̂ () and the diversity part () are both defined in terms of square roots. Given any two positive numbers a 1 and a 2 , an upper bound of . We apply this inequality to each square root term. Take signif icance for example, When substituting a 1 and a 2 in the inequality by 2 () and ̂ b (P) , respectively, we can have the upper bound . For the other terms in () , we can apply the inequality similarly to obtain the upper bound for each square root term.
The approximation ratio follows the analysis of optimizing stream summarization [1], by viewing patterns as data items that carry a benefit, and the general pattern coverage as the utility function to be optimized. Specifically, (1) there 2 (when || = k ). Following [1], there exists at least a value v j ∈  V that best estimates the optimal (⋅) and thus achieves approximation ratio ( 1 2 − ) . Thus, selecting with patterns from the sieve sets having the largest coverage guarantees approximation ratio ( 1 2 − ). The above analysis completes the proof of Theorem 1.

-based Fact Checking
The can be applied to enhance fact checking as rule models or via supervised link prediction. We introduce two -based models.
Generating training facts. Given a knowledge graph G = (V, E, L) and a triple pattern r(x, y), we generate training facts as follows. (1) For each fixed r(x, y), a set of true facts + is sampled from the matches of r(x, y) in the knowledge graph G. For each true fact 〈 v x , , v y 〉 ∈ + , we further introduce "noise" by replacing their labels to semantically close counterparts asserted by ontology labels from O and (⋅) . This generates a set of true facts that approximately match r(x, y).
. We follow the silver standard [34] to only generate false facts that are not seen in the existed facts in G, thus + ∩ − = ∅.

Using
as rules. Given facts , a rule-based model, denoted as R , invokes algorithm _ to discover top-k  as fact checking rules. Given a new fact t= v x , , v y , it follows a "hit and miss" convention [15] and checks whether there exists an in  that covers t (i.e., both the consequent and antecedent of cover t), in terms of ontology O and function (⋅) . If so, R accepts t, otherwise, it rejects t.
Using in supervised link prediction. Useful instance-level features can be extracted from the patterns and their matches induced by to train classifiers. We develop a second model (denoted as ) that adopts the following specifications. For each example t= 〈 v x , , v y 〉 ∈ , constructs a feature vector of size k, where each entry encodes the presence of the ith i in the top-k  . The class label of the example t is true (resp. false) if t ∈ + (resp. − ).
By default, adopts logistic regression, which is experimentally verified to achieve slightly better performance than others (e.g., Naive Bayes and SVM). We find that outperforms R over real-world graphs (See Sect. 5).

Experimental Study
Using real-world knowledge bases, we empirically evaluate the efficiency of discovery and the effectiveness of -based fact checking.
Ontologies. We have extracted ontologies for each knowledge graphs, either from their knowledge base sources like 1 , 2 , and 3 . For datasets that do not have external ontologies such as and , we extend graph summarization [40] to construct ontologies. Specifically, we start with a set of manually selected seed concept labels (e.g., conferences, institutions and authors from ) and extend these ontologies by grouping their frequently co-occurred labels in the node content (e.g., venues, universities, collaborators). We manually cleaned these ontologies to ensure their applicability.

Methods.
We implemented the following methods in JAVA: (1) _ , compared with (a) its "Batch + Greedy" counterpart _ (Sect. 4), (b) + [14] that discovers rules, (c) [22], the path ranking algorithm that trains classifiers with path features from random walks, and (d) [37], a variant of PRA that makes use of features from discriminant paths; (2) fact checking models R and , compared with learning models (and also denoted) by + , , and , respectively. For practical comparison, we set a pattern size (the number of pattern edges) bound b = 4 for discovery.
Ontology closeness. We apply weighted path lengths in (⋅) , in which each edge on a path has a weight according to one of the three types of relations (Sect. 2.1) (1) Given ontology O and two labels l and l ′ , the closeness of l and l ′ is defined as 1 − (l, l � ) , where (l, l � ) is the sum of weights on the shortest path between l and l ′ in the ontology O, normalized in range [0, 1]. (2) Given a threshold , (l, O) is defined as all the concept labels from O with closeness no less than . By default, we set = 1, i.e., the subgraph patterns enforce label equality by ensuring (l, l � ) = 0, which is same as . As will be shown, varying provides trade-off between discovery cost and model accuracy.
Model configuration. For a fair comparison, we made effort to calibrate the models and training/testing sets with consistent settings. (1) For the supervised link prediction methods ( , , and ), we sample 80% of the facts in a knowledge graph as the training facts , and 20% edges as testing set  . For example, we use in total 107 triple patterns over , and each triple pattern has 5K-50K instances. In (resp.  ), 20% are true examples + (resp.  + ), and 80% are false examples − (resp.  − ). We generate − and  − under ontology-based (Sect. 3) for all the models. For all methods, we use logistic regression to train the classifiers (which is the default settings of and ).
(2) For rule-based R and + , we discover rules that cover the same set of + . We set the size of + rule body to be 3, comparable to the number of pattern edges in our work. (3) We evaluate the impact of ontology closeness constraints to the efficiency and the effectiveness of based models, by varying the closeness threshold .
Overview of results. We find the following. (1) It is feasible to discover in large graphs (Exp-1). For example, it takes 211 seconds for _ to discover over with 4 million edges and 3000 training facts. On average, it outperforms + by 3.4 times. (2) can improve the accuracy of fact checking models (Exp-2). For example, it achieves additional 30% , 20%, and 5% gain of precision over , and 20% , 15%, and 16% gain of F 1 score over when compared with + , , and , respectively. (3) The ontological closeness and threshold enable trade-offs between discovering cost and effectiveness. With smaller threshold , while _ takes more time to discover , these rules can cover more training instances and verify more missing facts that are not covered by their counterparts induced by larger . (4) Our case study shows that yields interpretable models (Exp-3). We next report the details of our findings.

Exp-1: Eff iciency.
We repor t the efficiency of _ , compared with , + , and _ , and study the efficiency with ontology closeness by varying .
is omitted, since it has unstable learning time and is not comparable.
(1) Figure 4a shows that _ is on average 3.2 (resp. 4.1) times faster than + (resp. _ ) over due to its approximate matching scheme and top-k selection strategy. (2) Although + is faster than _ over smaller graphs, we find that it returns few rules due to low support. Enlarging rule size (e.g., to 5) + does not run to completion. (3) The cost of is less sensitive due to that it samples a (predefined) fixed number of paths, but it does not perform well in (Fig. 4b).
(4) _ outperforms the other three in except |E| = 0.4M (Fig. 4b), which is because _ has an overhead to compute and allocate sieves but too few rules can be discovered in small data.
Varying σ. Fixing |E| = 1.8M , | + | = 15K , = 0.005 , k = 200 , we varied from 0.05 to 0.25 in . As shown in Fig. 4e, _ takes longer time over smaller , due to more patterns and candidates need to be verified. On the other hand, _ is much less sensitive due to that it terminates early without verifying all patterns.
Varying k. Fixing |E| = 1.8M , = 0.1 , = 0.005 , we varied k from 200 to 1000 in . Figure 4f shows that _ is more sensitive to k due to it takes longer time to find k best patterns for each sieve value. Although _ is less sensitive, the major bottleneck is its verification cost. In addition, we found that with larger , less number of patterns are needed; thus, _ takes less time.
Varying . We next evaluate the impact of to the cost of discovery. We fix |E| = 1.2M , = 0.01 , = 0.005 , b = 4 , and k = 200 for , and |E| = 1.5M , = 0.1 , = 0.005 , b = 4 , and k = 200 for , we varied from 1 to 0.5. On the one hand, Fig. 4g, h shows that it takes longer time to discover for smaller for both _ and _ . This is because the pattern verification cost increases due to more candidates introduced by (⋅) . On the other hand, _ improves _ better over smaller due to its early termination. For example, it is 2.1, 4.4 and 8.14 times faster than _ over when is 1 (using label equality), 0.75 and 0.5, respectively. Compared with _ is on average five times slower when = 0.75 and is on average 15 times slower when = 0.50 over . Actually, enabling ontologies will enlarge training set + , which takes longer time as verified in  Fig. 4c, d. However, a benefit of is to build a unified model for multiple triple patterns r(x, y), rather than building a separate model for each r(x, y) as in . In practice, users can select applicable (closer to 1) to avoid including many similar labels.
Exp-2: Accuracy. We report the accuracy of all the models in Table 1.
Rule-based models. We apply the same support threshold = 0.1 for + and R . We set = 0.005 for R and set k = 200 . We sample 20 triple patterns and report the average accuracy. As shown in Table 1, R constantly improves + with up to 21% gain in prediction rate, and with comparable performance for other cases. We found that + reports rules with high support but not necessarily meaningful, while capture more meaningful context (see Exp-3). Both models have relatively high recall but low precision; due to that, they have a better chance to cover missing facts but may introduce errors when hitting false facts.
Supervised models. We next compare with supervised link prediction models.
achieves the highest prediction rates and F 1 scores. It outperforms with 12% gain on precision and 23% gain on recall on average and outperforms with 16% gain on precision and 19% recall. Indeed, extracts useful features from with both high significance and diversity, beyond path features.
We next evaluate the impact of factors to the model accuracy and study the impact of ontology closeness by varying in Fig. 5.
Varying and . For , fixing |E| = 2.0M , | + | = 135K , and k = 200 , we varied from 0.05 to 0.25 and compare patterns with confidence 0.02 and 0.04, respectively, as shown in Fig. 5a. For , fixing |E| = 1.5M , | + | = 250K , and k = 200 , we varied from 0.05 to 0.25 and compare patterns with confidence 0.001 and 0.002, respectively, as shown in Fig. 5b. Both figures show that and R have lower prediction rates when support threshold (resp. confidence) is higher (resp. lower). That is because fewer patterns can be discovered with higher , leading to more "misses" in facts, while higher confidence leads to stronger association of patterns and more accurate predictions. In general, achieves higher prediction rate than R .
Var ying | + | . For , f ixing |E| = 2.0M , | + | = 135K , = 0.001 , = 5 × 10 −5 , k = 200 , we vary | + | from 75K to 135K as shown in Fig. 5c; for , fixing |E| = 1.5M , = 0.01 , = 0.005 , k = 200 , we vary | + | from 50K to 250K as shown in Fig. 5d Varying k. For , fixing |E| = 2.0M , + = 135K , = 0.001 , = 5 × 10 −5 , we varied k from 50 to 250. Figure 5e shows the prediction rate first increases and then decreases. For rule-based model, more rules increase the accuracy by covering more true facts, while increasing the risk of hitting false facts. For supervised link prediction, the model will be under-fitting with few features for small k and will be over-fitting with too many features due to large k. We observe that k = 200 is a best setting for high prediction rate. This also explains the need for top-k discovery instead of a full enumeration of graph patterns.  Figure 5f verifies an interesting observation: Smaller patterns contribute more to recall and larger patterns contribute more to precision, because smaller patterns are more likely to "hit" new facts, while larger patterns have stricter constraints for correct prediction of true fact.
Varying . Using the same setting as in Fig. 4g, h, we report the impact of to the accuracy of -based models. Figure 5g and h shows that with smaller , and R achieve higher recall but retain reasonable precision. Indeed, smaller allows rules to be learned from more training examples and cover more missing facts. As is varied from 1 to 0.75 (resp. 0.5), for , the recall of increases from 45% to 56% (resp. 70% ) with at most 5% (resp. 9% ) loss in precision; for , the recall of increases from 31% to 42% (resp. 55% ) with at most 1% (resp. 4% ) loss in precision. Note that for = 1 , the results are the same as using without ontologies, which have much lower recalls than . This justifies the benefit of introducing ontological matching.

Exp-3: Case study.
We perform case studies to evaluate the applications of .
Test cases. A test case consists of a triple pattern r(x, y) and a set of test facts that are ontologically close to r(x, y).
According to the type information on nodes and edges, the triple patterns are categorized in: (a) Functional cases refer to functional predicates (a "one-to-one" mapping) of relationship r between node v x and node v y . For a relationship "capitalOf," two locations can only map to each other through it, for example, "London" is the capital of "UK".

3
(b) Pseudo-functional predicates can be "one-to-many," but have high functionality (a.k.a. "usually functional"). For example, relationships like "graduatedFrom" are not necessary functional, but are functional for many "persons".
Accuracy. We show 30 r(x, y) cases from each category and report their overall F 1 scores in Table. 2. Non-functional cases are those allow "many-to-many" relations, in which case may not hold [14]. We found that performs well for all test cases, especially for those non-functional ones. Indeed, the relaxation of label equality by the ontology closeness in both pattern matching and the ontology-based helps improve the fact checking models in recall without losing much precision (Fig. 5h, g), and the graph patterns of mitigate the non-functional bases with enriched context.
Interpretability. We further illustrate three top in Fig. 6, which contribute to highly important features in with high confidence and significance over a realworld financial network and two knowledge graphs and . (1) 3 : P 3 (x, y) → (company, company) ( ) states that two (anonymous) companies are likely to have the same name and registration date if they share shareholder and beneficiary, and one is registered and within jurisdiction in Panama, and the other is active in Panama. This has support 0.12 and confidence 0.0086 and is quite significant. For the same r(x, y), + discovers a top rule as (x, Jurisdiction_in_Panama) ∧ (y, Jurisdiction_in_ Panama) and implies x and y has the same name and registration date. This rule has a low prediction rate.
(2) 4 : P 4 (x, y) → (TVShow, film) ( ) states that a TV show and a film have relevant content if they have the common language, authors, and producers. This has support 0.15 and a high confidence and significant score. Within bound 3, + reports a top rule as (x, z)∧ (y, z) → (x, y) , which has low accuracy. This rule also identifies relevant relationships between BBC programs (e.g., "BBC News at Six") and other programs that are relevant to "TVShow" and "Films" respectively, enabled by ontological matching. These facts cannot be captured by or . (3) 5 : P 5 (x, y) → (writer, philosopher) ( ) states that a writer v x influences a philosopher v y , if v x influences a philosopher p and a scholar s, who both influences a philosopher v y . This rule identifies true facts such as 〈Bertrand Russel, , Ludwig Wittgenstein〉, the influence between a logician and a philosopher, enabled by ontological matching following O 2 .

Conclusion
We have introduced , a class of rules that incorporate graph patterns to predict facts in knowledge graphs. We developed an ontology-aware rule discovery algorithm to find useful for observed true and false facts, which selects the top discriminant graph patterns generated in a stream. We have shown that can be readily applied as rule models or provide useful instance-level features in supervised link prediction. The benefit of enabling ontologies is to build a unified model for multiple triple patterns. Our experimental study has verified the effectiveness and efficiency of -based techniques. We have evaluated with real-world graphs and pattern models. One future topic is to extend techniques for entity resolution, social recommendation, anomaly detection, and data imputation. A second direction is to extend model to cope with multi-label knowledge graphs or property graphs. A third future work is to develop scalable -based models and methods with parallel graph mining and distributed rule learning.