To provide a solution to the problem of contextually matching RDF entities, COMET – a context-aware RDF molecule matching technique – is proposed. This technique is grounded on the semantic data integration techniques proposed by Collarana et al. [84], whose work deals with matching and merging RDF molecules that are semantically similar using semantic similarity metric and fusion policies. This work makes use of the concepts of RDF molecules but contributes a new approach as to taking into consideration the context of the system while matching entities. COMET is an entity matching framework designed to create, identify, and match contextually equivalent RDF entities. Grounded on the entity matching component from the data integration technique proposed by Collarana et al. [84], we propose COMET, an entity matching approach to merge equivalent RDF entities based on context. Thus, a solution to the problem of contextually matching entities is provided (Fig. 5).
4.1 Problem Definition
RDF Molecule [84] – If \(\varPhi (G)\) is a given RDF Graph, we define RDF Molecule M as a subgraph of \(\varPhi (G)\) such that,
$$\begin{aligned} M = \{t_1,\dots , t_n\} \end{aligned}$$
$$\begin{aligned} \forall \quad i,j \in \{1,\dots ,n\} \big ( subject(t_i) = subject(t_j) \big ) \end{aligned}$$
Where \(t_1, t_2, \dots , t_n\) denote the triples in M. In other words, an RDF Molecule M consists of triples which have the same subject. That is, it can be represented by a tuple M = (R, T), where R denotes the URI of the molecule’s subject, and T denotes a set of property and value pairs p = (prop, val) such that the triple (R, prop, val) belongs to M. For example, the RDF molecule for Arnold Schwarzenegger is (dbr:Arnold-Schwarzenegger, { (dbo:occupation, Politician), (dbp:title, Governor)}). An RDF Graph \(\varPhi (G)\) described in terms of RDF molecules is defined as follows:
$$\begin{aligned} \varPhi (G) = \{ M = (R,T) | t = (R, prop, val) \in G \wedge (prop, val) \in T \} \end{aligned}$$
Context – We define a context C as any Boolean expression which represents the criteria of a system. Two entities, such as an RDF molecule \(M_1\) and \(M_2\), can be either similar or not similar with respect to a given context. That is, C is a Boolean function that takes as input two molecules \(M_1\) and \(M_2\) and returns true if they are similar according to system context, and false otherwise. Below is an example of context C, modeled after the example presented in Fig. 1, where two molecules are similar if they have the same occupation. If \(P = (p, v)\) is the predicate representing the occupation property of a molecule, then context.
$$\begin{aligned} C(M_1, M_2)={\left\{ \begin{array}{ll} \texttt {true}, &{} \text {if } P \in M_1 \wedge P \in M_2.\\ \texttt {false}, &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
Depending on the requirements of the integration scenario, this context can be any Boolean expression.
Semantic Similarity Function – Let \(M_1\) and \(M_2\) be any two RDF molecules. Then semantic similarity function \(\textit{Sim}_f\) is a function that measures the semantic similarity between these two molecules and returns a value between [0, 1]. A value of 0 expresses that the two molecules are completely dissimilar and 1 expresses that the molecules are identical. Such a similarity function is defined in GADES [371].
Contextually Equivalent RDF Molecule – Let \(\varPhi (G)\) and \(\varPhi (D)\) be two sets of RDF molecules. Let \(M_G\) and \(M_D\) be two RDF molecules from \(\varPhi (G)\) and \(\varPhi (D)\), respectively. Then, \(M_G\) and \(M_D\) are defined as contextually equivalent iff
-
1.
They are in the same context. That is, \(C(M_1, M_2) = \texttt {true}\)
-
2.
They have the highest similarity value, i.e.,
\(\textit{Sim}_f(M_G, M_D) = \textit{max}( \forall _{m \in \varPhi (D)} \textit{Sim}_f(M_G, m) )\)
Let \(F_c\) be an idealized set of contextually integrated RDF molecules from \(\varPhi (G)\) and \(\varPhi (D)\). Let \(\theta _C\) be a homomorphism such that \( \theta _C : \varPhi (G) \cup \varPhi (D) \rightarrow F_c\). Then there is an RDF Molecule \(M_F\) from \(F_c\) such that \(\theta (M_D) = \theta (M_G) = M_F\). From the motivation example, this means that the molecule of Arnold Schwarzenegger, the politician, is contextually equivalent to the molecule of Donald Trump as they are similar and they satisfy the context condition of having the same occupation.
In this work, we tackle the problem of explicitly modeling the context and then matching RDF molecules from RDF graphs that are both highly similar and equivalent in terms of this context. This problem is defined as follows: given RDF graphs \(\varPhi (G)\) and \(\varPhi (D)\), let \(M_G\) and \(M_D\) be two RDF molecules such that \(M_G \in \varPhi (G)\) and \(M_D \in \varPhi (D)\). The system is supplied with a context parameter C, which is a Boolean function evaluating if two molecules are in the same context. It is also supplied with a similarity function \(\textit{Sim}_f\), which evaluates the semantic similarity between \(M_G\) and \(M_D\).
The problem of creating a contextualized graph \(\varPhi _{C}\) consists of building a homomorphism \( \theta _C : \varPhi (G) \cup \varPhi (D) \rightarrow F_c\), such that for every pair of RDF molecules belonging to \(\varPhi _C\) there are none that are contextually equivalent according to system context C. If \(M_G\) and \(M_D\) are contextually equivalent molecules belonging to \(F_c\), then \(\theta _C(M_G) = \theta _C(M_D)\), otherwise \(\theta _C(M_G) \ne \theta _C(M_D)\).
An example of this problem is illustrated in Figure X, which depicts a use case with two RDF graphs and a single context condition C. With respect to C, the RDF molecule Arnold.S from \(\varPhi (G)\) is in the same context as Donald.T from \(\varPhi (D)\), but not in the same context as the molecule Arnold.S from \(\varPhi (G)\). So the problem is to identify a homomorphism \(\theta _C\) which evaluates the RDF molecules based on system context and maps these RDF molecules in a way that they can be integrated into a contextualized graph.
4.2 The COMET Architecture
We propose COMET, an approach to match contextually equivalent RDF graphs according to a given context, thus providing a solution to the problem of contextually matching RDF graphs. Figure 6 depicts the main components of the COMET architecture. COMET follows a two-fold approach to solve the problem of entity matching in RDF graphs in a context-aware manner: First, COMET computes the similarity measures across RDF entities and resorts to the Formal Concept Analysis algorithm to map contextually equivalent RDF entities. Finally, COMET combines the results of the first step and executes a 1-1 perfect matching algorithm for matching RDF entities based on the combined scores to finally synthesize the matching into a contextualized RDF graph.
4.3 Identifying Contextually Equivalent Entities
Building a Bipartite Graph. The COMET pipeline receives two RDF graphs \(\varPhi (G), \varPhi (D)\) as input, along with context parameter C, and a similarity function \( Sim _f\). COMET first constructs a bipartite graph between the sets \(\phi (G)\) and \(\phi (D)\). The Dataset Partitioner employs a similarity function \( Sim _f\) and ontology O to compute the similarity between RDF molecules in \(\phi (G)\) and \(\phi (D)\) assigning the similarity score as vertices weight in the bipartite graph. COMET allows for arbitrary, user-supplied similarity functions that leverage different algorithms to estimate similarity between RDF molecules. Thus, COMET supports a variety of similarity functions including simple string similarity. However, as shown in [84], semantic similarity measures are advocated (in the implementation of this work we particularly use GADES [371]) as they achieve better results by considering semantics encoded in RDF graphs.
After RDF molecules similarity comparison, the result of the similarity function is tested against a threshold \(\gamma \) to determine entity similarity (the similarity threshold’s minimum acceptable score). Thus, edges are discarded from the bipartite graph whose weights are lower than \(\gamma \). A threshold equal to 0.0 does not impose any restriction on the values of similarity; thus the bipartite graph includes all the edges. High thresholds, e.g. 0.8, restrict the values of similarity, resulting in a bipartite graph comprising just a few edges.
Pruning RDF Entities According to ContexB. The main step on the COMET pipeline is to validate and prune pairs of RDF molecules that do not comply with the input context C, making COMET a context-aware approach. For identifying contextually equivalent RDF entities, the Context Validator component employs the Formal Concept Analysis (FCA) algorithm. FCA is the study of binary data tables that describe the relationship between objects and their attributes. Applying this context validation step over the RDF molecules ensures that only contextually relevant tuples are kept. In COMET, context is modeled as any Boolean function. Two molecules are matched if they satisfy this condition, otherwise they are not matched. The algorithm by V. Vychodil [451] is applied in COMET; it performs formal concept analysis to compute formal concepts within a set of objects and their attributes. This algorithm is extended in our approach for validating complex Boolean conditions. A typical formal concept analysis table is shown in Table 1.
Table 1. Object-Attribute table for performing FCA.
Instead of using attributes in the column of the FCA matrix, in our approach, we replace the attributes with a boolean condition C. This is the same as the context condition C used in our approach. For example, the context C from the motivating example can be broken down into \(C = C_1 \wedge C_2\) where \(C_1\) = “contains property dbo:occupation”, and \(C_2\) = “has the same value for property dbo:occupation”. The execution of the FCA algorithm remains unchanged by this adaptation since the format of the input to FCA is still a binary matrix.
When applied to RDF molecules, formal concept analysis returns a set of formal concepts \(<M,C>\) where M is a set of all the molecules that contain all conditions contained in C. That is, by applying FCA, the set of molecules that satisfy a certain context condition can be obtained. Thus, the molecules that do not meet the context condition are pruned. In Fig. 7, an example of context validation is demonstrated. Edges in a bipartite graph are filtered according to a threshold value \(\gamma \) as detailed in the previous section. Next, the remaining edges are validated by constructing an FCA matrix according to context condition C. The FCA algorithm returns the edge satisfying the context conditions. The edges that do not satisfy the context condition are discarded.
4.4 The 1-1 Perfect Matching Calculator
COMET solves the problem of context-aware entity matching by computing a 1-1 weighted perfect matching between the sets of RDF molecules. The input of the 1-1 weighted perfect matching component is the weighted bipartite graph created on the previous step. Since each weight of an edge between two RDF molecules corresponds to a combined score of semantic similarity and context equivalence value, we call this a 1-1 context-aware matching calculator. The effect of this 1-1 context aware matching calculator is demonstrated in Fig. 9 Finally, a combinatorial optimization algorithm like the Hungarian algorithm [267] is utilized to compute the matching.
4.5 Integration Use Case: Applying Fusion Policies
In order to apply this context-aware entity matching pipeline into a data integration scenario, we envision the usage of fusion policies defined by Collarana et al. [84]. To consolidate entities identified as contextually equivalent, COMET can make use of synthesis policies, i.e. a user-supplied function that defines how the RDF molecules should be combined to form a connected whole. COMET can adopt the following synthesis policies:
-
1.
The Union Policy, which includes all predicates-object pairs, removing the one that is syntactically the same;
-
2.
The Linking Policy, which produces owl:sameAs links between contextually equivalent RDF molecules;
-
3.
The Authoritative Policy, which allows for defining one RDF graph as a prevalent source selecting its properties in case of property conflicts, i.e., properties annotated as owl:FunctionalProperty, equivalent properties owl:equivalentProperty, and equivalent classes annotated with owl:sameAs or owl:equivalentClass.
By applying these policies, the end output is a synthesized graph with linked entities that are contextually equivalent. In the next chapter, we take a look at another use case of context-aware entity matching: the temporal summarization of knowledge graph entities.