Considering Semantics on the Discovery of Relations in Knowledge Graphs

  • Ignacio Traverso-Ribón
  • Guillermo Palma
  • Alejandro Flores
  • Maria-Esther Vidal
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10024)

Abstract

Knowledge graphs encode semantic knowledge that can be exploited to enhance different data-driven tasks, e.g., query answering, data mining, ranking or recommendation. However, knowledge graphs may be incomplete, and relevant relations may be not included in the graph, affecting accuracy of these data-driven tasks. We tackle the problem of relation discovery in a knowledge graph, and devise \(\mathcal {KOI}\), a semantic based approach able to discover relations in portions of knowledge graphs that comprise similar entities. \(\mathcal {KOI}\) exploits both datatype and object properties to compute the similarity among entities, i.e., two entities are similar if their datatype and object properties have similar values. \(\mathcal {KOI}\) implements graph partitioning techniques that exploit similarity values to discover relations from knowledge graph partitions. We conduct an experimental study on a knowledge graph of TED talks with state-of-the-art similarity measures and graph partitioning techniques. Our observed results suggest that \(\mathcal {KOI}\) is able to discover missing edges between related TED talks that cannot be discovered by state-of-the-art approaches. These results reveal that combining semantics encoded both in the similarity measures and in the knowledge graph structure, has a positive impact on the relation discovery problem.

Keywords

Relation discovery Semantic similarity Graph partitioning 

1 Introduction

Following Linked Data initiatives and exploiting features of Semantic Web technologies, large volumes of data are publicly available in the form of knowledge graphs usually described using the RDF data model, e.g., DBpedia1 or YAGO2. Simultaneously, data-driven applications that rely on knowledge graphs are progressively increasing [5]. However, as traditional semi-structured data, knowledge graphs may be incomplete, either because relations among graph entities were unknown at the time the graph was created, or because the knowledge graph creation process failed to completely identify all existing relations. This situation encourages the development of techniques for the discovery of missing relations.

Discovering relations in knowledge graphs requires the analysis of both the semantics encoded in the knowledge graph, and the connectivity or structure of the represented relations. However, the majority of the state-of-the-art approaches are based either on the structure of the graph [2, 10], or on properties of the knowledge graph entities [7, 16]. Although some approaches combine both types of knowledge [21], they do not take into account domain semantics encoded in semantic similarity measures to discover missing relations [15].

In this paper we propose \(\mathcal {KOI}\), an approach for relation discovery in knowledge graphs that considers the semantics of both entities represented in the knowledge graph and their neighborhoods. \(\mathcal {KOI}\) receives as input a knowledge graph, and encodes the semantics about the properties of graph entities and their neighbors in a bipartite graph. Entity neighbors correspond to ego-networks, e.g., the friends of a person in a social network or the set of TED talks related to a given TED talk. \(\mathcal {KOI}\) partitions the bipartite graph into parts of highly similar entities connected to also similar ego-networks. Relations are discovered in these parts following the homophily prediction principle, which states that entities with similar characteristics tend to be related to similar entities [13]. Intuitively, the homophily prediction principle allows for relating two entities t1 and t2 whenever they have similar datatype and object property values (neighborhoods).

We evaluate the behavior of \(\mathcal {KOI}\) in a knowledge graph of TED talks3; we crafted this knowledge graph by crawling data from the official TED website (http://www.ted.com/). We compare relations discovered by \(\mathcal {KOI}\) with two baselines of relations identified by the METIS [9] and k-Nearest Neighbors (KNN) algorithms. We empowered KNN with statistic and semantic similarity measures (Sect. 6.3). Experimental outcomes suggest the following statements: (i) Semantics encoded in similarity measures and knowledge graph structure enhances the performance of relation discovery methods; and (ii) \(\mathcal {KOI}\) outperforms state-of-the-art approaches, obtaining higher values of precision and recall.

To summarize, the contributions of this paper are as follows:
  • \(\mathcal {KOI}\), a relation discovery method that implements graph partitioning techniques and relies on semantics encoded in similarity measures and graph structure to discover relations in knowledge graphs;

  • A knowledge graph describing TED talks crafted from the TED website; and

  • A empirical evaluation on a real-world knowledge graph of TED talks to analyze the performance of \(\mathcal {KOI}\) with respect to state-of-the-art approaches.

This paper comprises six additional sections. Section 2 motivates our approach with an example, and Sect. 3 introduces preliminary definitions. We explain our approach in Sect. 4 and the related work in Sect. 5. Section 6 reports on experimental results and describes the crafted TED knowledge graph. Section 7 concludes and presents future work ideas.

2 Motivating Example

In this section, we provide an example to motivate the problem of knowledge discovery tackled in this paper. We show an example of relation discovery between TED talks publicly available in the TED website. TED talks are described through textual properties, e.g., title, abstract or tags, and their relations with other talks in order to provide recommendations to the users. Relations between talks are defined by TED curators manually, which corresponds to a time expensive task and prone to omissions. Therefore, it would be helpful to have automatic methods able to ease the relation discovery and other curation tasks. We check the TED website in 2015 and 2016, and compare both versions of the website in order to detect relations between talks that are only represented in the newer version of the website. In total, we observe 62 relations that are included in 2016 but are not present in the 2015 version, i.e., TED curators do not discover these relations until 2016. One example is the relation between talks The politics of fiction4 and The shared wonder of film5. Both talks are present in both versions of the website. However, only in 2016 is possible to find a relation between them. Thus, we can conclude that there are missing relations between TED talks in the 2015 version of the website. An approach able to discover these relations automatically would alleviate the effort of curators and improve the quality (completeness) of the data. Though the relation between The politics of fiction and The shared wonder of film is not included in the 2015 website, the rest of knowledge regarding to these talks allows for intuiting a high degree of relatedness between them. We observe that both talks have keywords or tags in common as Culture or Storytelling. We also find some expressions in their abstracts or descriptions, that though do not match exactly, are clearly related such as identity politics and cultural walls, or film and novel. Moreover, if their sets of related TED talks are compared, we observe they share two related talks, The clues to a great story6 and The mistery box7. Thus, related talks have properties in common. \(\mathcal {KOI}\) relies on this observation and exploits entity properties to discover missing relations between these entities.

3 Preliminaries

In this section we present definitions required to understand our approach.

Definition 1

(RDF Triple [1]). Let U be a set of RDF URI references, B a set of Blank nodes, and L a set of RDF literals. A tuple \((s, p, o) \in UB \times U \times UBL\) is an RDF triple, where s is called subject, p predicate and o object.

Definition 2

(Knowledge graph [18]). Given a set T of RDF triples, a knowledge graph is a pair \(G=(V, E)\), where \(V = \{s | (s, p, o) \in T\} \cup \{o | (s, p, o) \in T\}\) is a set of entities and \(E=\{(s, p, o) \in T\}\) a set of relations.

Fig. 1.

Portion of a knowledge graph of TED talks. Nodes represent TED talks, while dashed squares represent datatype property values.

Figure 1 shows a portion of a knowledge graph describing TED talks. The predicate vol:hasLink connects related talks, while the rest of predicates correspond to datatype properties and connect talks with string literals.

Definition 3

(Ego-Network). Let \(G=(V, E)\) be a knowledge graph and \(L =\{p \mid (s, p, o) \in E\}\) be a set of predicates. Given an entity \(v_i \in V\) and a predicate \(r \in L\), the ego-network of \(v_i\) according to r is defined as the set of entities connected to \(v_i\) through an edge with predicate r: ego-net\((v_i, r)=\{v_j \mid (v_i,r, v_j)\; \in \;E \}\).

The ego-network of the entity ted:256 with respect to the predicate vol:hasLink (Fig. 1) is formed by entities ted:59, ted:73, and ted:184.

4 Our Approach: \(\mathcal {KOI}\)

4.1 Problem Definition

Let \(G'=(V, E')\) and \(G=(V, E)\) be two knowledge graphs. \(G'\) is an ideal knowledge graph that contains all the existing relations between entities in V. G is the actual knowledge graph, which contains only a portion of the relations represented in \(G'\), i.e., \(E \subseteq E'\). Let \(\varDelta (E', E) = E' - E\) be the set of relations existing in the ideal graph that are not represented in the actual knowledge graph G, and \(G_\text {comp}=(V, E_\text {comp})\) the complete knowledge graph, which contains a relation for each possible combination of entities and predicates \(E\subseteq E'\subseteq E_\text {comp}\).

Given a relation \(e \in \varDelta (E_\text {comp}, E)\), the relation discovery problem consists of determining if \(e \in E'\), i.e., if a relation e corresponds to an existing relation in the ideal graph \(G'\).

4.2 Our Solution

We propose \(\mathcal {KOI}\), a relation discovery method for knowledge graphs that considers semantics encoded in similarity measures and the knowledge graph structure. \(\mathcal {KOI}\) implements an unsupervised graph partitioning approach to identify parts of the graph from where relations are discovered. \(\mathcal {KOI}\) applies the homophily prediction principle to each part of the partitioned bipartite graph, in a way that two entities with similar characteristics are related to similar entities. Similarity values are computed based on: (a) the neighbors or ego-networks of two entities, and (b) their datatype property values (e.g., textual descriptions).

Figure 2 depicts the \(\mathcal {KOI}\) architecture. \(\mathcal {KOI}\) receives a knowledge graph \(G=(V,E)\) like the one showed in Fig. 1, two similarity measures \(S_v\) and \(S_u\), and a set of constraints S. As result, \(\mathcal {KOI}\) returns a set of relations discovered in the input graph G. \(\mathcal {KOI}\) builds a bipartite graph BG(Gr) where each entity in V is connected with its ego-network according to the predicate r. Figure 3a contains the bipartite graph built from the knowledge graph in Fig. 1 according to predicate vol:hasLink. By means of a graph partitioning algorithm and the similarity measures \(S_v\) and \(S_u\), \(\mathcal {KOI}\) identifies graph parts containing highly similar entities with highly similar ego-networks, i.e., similar entities that are highly connected in the original graph. According to the homophily prediction principle, \(\mathcal {KOI}\) produces candidate missing relations inside the identified graph parts. Figure 3b represents with red dashed lines the set of candidate discovered relations. Only those relations that satisfy a set of constraints S are considered as discovered relations. Listing 1.1 shows an example of constraints and Fig. 4 includes the corresponding score values for each candidate relation.
Fig. 2.

\(\mathcal {KOI}\)Architecture.\(\mathcal {KOI}\) receives a knowledge graph G, two similarity measures \(S_v\) and \(S_u\), and a set S of constraints. BG(Gr) is a bipartite graph and represents relations between entities in G and their corresponding ego-networks built in terms of r. A graph partitioning algorithm is used to partition BG(Gr) into a set P of parts; each part corresponds to a portion of BG(Gr) where both entities and ego-networks are highly similar. Parts in P are used to identify candidate discovered relations CDR (red edges). Then, a constraint satisfaction outputs in DR the relations in CDR that meet the constraints in S (green edges). (Color figure online)

Bipartite Graph Creation. Determining the membership of each relation \(e \in \varDelta (E_\text {comp},E)\) in \(E'\) is expensive in terms of time due to the large amount of relations included in \(\varDelta (E_\text {comp},E)\), and may produce a large amount of false positives. \(\mathcal {KOI}\) leverages from the homophily intuition to tackle this problem by finding highly similar portions of the graph, i.e., portions including entities with similar ego-networks and similar datatype property values. In order to consider at the same time both similarities, \(\mathcal {KOI}\) builds a bipartite graph where each entity is associated with its ego-network. The objective is to find a partitioning of this graph, such that each part contains highly similar entities and highly similar ego-networks. Thus, the \(\mathcal {KOI}\) graph partitioning problem is an optimization problem where these two similarities are maximized on entities of each part.

Definition 4

(\(\mathcal {KOI}\)Bipartite Graph). Let \(G=(V, E)\) be a knowledge graph and \(L =\{p \mid (s, p, o) \in E\}\) be a set of predicates. Given a predicate \(r \in L\), the \(\mathcal {KOI}\) Bipartite Graph of G and r is defined as \(BG(G,r) = (V \cup U(r), E_{BG}(r))\), where \(U(r) = \{\text {ego-net}(v_i,r) \mid v_i \in V\}\) is the set of ego-networks of entities in V, and \(E_{BG}(r) = \{(v_i, u_i) \mid v_i \in V \wedge u_i = \text {ego-net}(v_i, r)\}\) is the set of edges that associate each entity with its ego-network.

Figure 3a shows a \(\mathcal {KOI}\) bipartite graph for the knowledge graph in Fig. 1.
Fig. 3.

Example of \(\mathcal {KOI}\)Graphs. A \(\mathcal {KOI}\) bipartite graph and its partitioning (Color figure online)

Bipartite Graph Partitioning. To identify portions of the knowledge graph where the homophily prediction principle can be applied, the bipartite graph BG(Gr) is partitioned in a way that entities in each part are highly similar (i.e., similar datatype properties) and connected (i.e., have similar ego-networks).

Definition 5

(A Partition of a\(\mathcal {KOI}\)Bipartite Graph). Given a \(\mathcal {KOI}\) bipartite graph \(BG(G,r) = (V \cup U, E_{BG})\), a partition \(P(E_{BG})=\{p_1, p_2, ..., p_n\}\) satisfies the following conditions:
  • Each part \(p_i\) contains a set of edges \(p_i = \{(v_x, u_x) \in E_{BG}\}\),

  • Each edge \((v_x, u_x)\) in \(E_{BG}\) belongs to one and only one part p of \(P(E_{BG})\), i.e., \(\forall p_i, p_j \in P(E_{BG}), p_i \cap p_j = \emptyset \) and \(E_{BG} = \bigcup _{p \in P(E_{BG})} p\).

Definition 6

(The Problem of\(\mathcal {KOI}\)Bipartite Graph Partitioning). Given a \(\mathcal {KOI}\) bipartite graph \(BG(G,r) = (V \cup U, E_{BG})\), and similarity measures \(S_v\) and \(S_u\) for entities in V and ego-networks in U. The problem of \(\mathcal {KOI}\) Bipartite Graph Partitioning corresponds to the problem of finding a partition \(P(E_{BG})\) such that Density(\(P(E_{BG})\)) is maximized, where:
  • Density(\(P(E_{BG})\))=\(\sum _{p \in P(E_{BG})} (\text {partDensity}(p))\), and

  • \( \text {partDensity}(p) = \overbrace{\frac{\sum _{v_i,v_j \in V_p}[v_i \ne v_j]S_v(v_i, v_j)}{|V_p|(|V_p| - 1)}}^{(A)}+ \overbrace{\frac{\sum _{u_i,u_j \in U_p}[u_i \ne u_j]S_u(u_i, u_j)}{|U_p|(|U_p| - 1)}}^{(B)}\)

where component (A) represents the similarity between entities in edges of part p and (B) represents the similarity between the corresponding ego-networks. \(S_v\) and \(S_u\) are similarity measures for entities and ego-networks, respectively.

\(\mathcal {KOI}\) utilizes the partitioning algorithm proposed by Palma et al. [15] to solve the optimization problem of partitioning a \(\mathcal {KOI}\) bipartite graph.

The bipartite graph in Fig. 3a is partitioned into two parts represented in Fig. 3b. Entities of the part in the bottom are \(V_p=\{ted:256 , ted:595 , ted:184 \}\) and their corresponding ego-networks are \(U_p=\{u_{256}, u_{595}, u_{184}\}\). In order to calculate the partDensity of this part, we compare pair-wise entities in \(V_p\) with \(S_v\) and ego-networks in \(U_p\) with \(S_u\). Thus, we compute the similarity \(S_v\) for entity pairs \(S_v(\text {ted:256 }\!, \text {ted:595 })\), \(S_v(\text {ted:256 }\!, \text {ted:184 })\), and \(S_v(\text {ted:595 }\!, \text {ted:184 })\), and the similarity \(S_u\) for ego-networks pairs \(S_u(u_{256}, u_{595})\), \(S_u(u_{256}, u_{184})\), and \(S_u(u_{595}, u_{184})\). The computed partDensity value is in this case 0.775.

Candidate Relation Discovery.\(\mathcal {KOI}\) applies the homophily prediction principle in the parts of a partition of a \(\mathcal {KOI}\) bipartite graph, and discovers relations between entities included in the same part.

Definition 7

(Candidate relation). Given two knowledge graphs \(G=(V, E)\) and \(G_{comp}=(V, E_{comp})\). Let \(BG(G,r) = (V \cup U, E_{BG})\) be a \(\mathcal {KOI}\) bipartite graph. Let \(P(E_{BG})\) be a partition of \(E_{BG}\). Given a part \(p= \{(v_x, u_x) \in E_{BG}\} \in P(E_{BG})\), the set of candidate relations CDR(p) in part p corresponds to the set of relations \(\{(v_i, r, v_j) \in E_{comp}\}\) such that \(v_j\) is included in some ego-network \(u_x\) and edges \((v_i, u_i)\) and \((v_x, u_x)\) are contained in the partition p.

In Fig. 3b candidate relations are represented as red dashed lines. One example is the relation (ted:59, vol:hasLink, ted:595). This candidate relation is discovered due to the presence of ted:59 and ego-net(ted:73, vol:hasLink) in the same partition and the inclusion of the entity ted:595 in the ego-network ego-net(ted:73, vol:hasLink).

Constraint Satisfaction. A relation constraint is a set of RDF constraints that states conditions that must be satisfied by a candidate discovered relation in order to become a discovered relation, i.e., relations belonging to the ideal knowledge graph. RDF constraints are expressed using the SPARQL language as suggested by Lausen et al. [11] and Fischer et al. [3]. Only the candidate relations that fulfill relation constraints are considered as discovered relations.

Definition 8

(Discovered Relations). Given a set of candidate relations CDR and a set of relation constraints S, the set of discovered relations DR is defined as the subset of candidate relations that satisfy the given contraints Open image in new window.

Although an upper bound for the problem of checking if a constraint is satisfied by a candidate discovered relation is PSPACE-complete [17], because number of constraints is smaller than the size of the knowledge graph, the complexity of this decision problem can be expressed in terms of data, and is LOGSPACE for SPARQL [11, 17].
Fig. 4.

Application of the relation constraint described in Listing 1.1 for the candidate relations (red dashed edges) found in Fig. 3b.

Listing 1.1 illustrates a constraint that states a condition for a candidate discovered relation \(cdr=(v_i\; r\; v_j)\) to become a discovered relation. Whenever the candidate discovered relation \(cdr=(v_i\; r \; v_j)\) is identified in several parts of a partition P, the number of times that cdr appears is taken into account, as well as the similarity between the ego-network of \(v_i\) and the ego-networks where \(v_j\) is included. To determine if the constraint is satisfied, a score is computed and the value of this score has to be greater than a threshold \(\theta _i\). The score is defined as the product of the number of times a relation is discovered and the similarity between corresponding ego-networks. For each discovered relation, Fig. 4 contains the value of the corresponding score described in Listing 1.1. Relation (ted:256, vol:hasLink, ted:595) gets the highest value for this score being discovered four times in Fig. 3b. Moreover, the similarity between ego-networks ego-net(ted:595,vol:hasLink) and ego-net(ted:184, vol:hasLink) is 0.5. The constraint, specified as an ASK query, is held if at least one score value is greater than the threshold \(\theta \). Therefore, we consider only the maximum similarity value between the ego-networks.

5 Related Work

Palma et al. [15] and Flores et al. [4] present approaches for relation discovery in heterogeneous bipartite graphs. Palma et al. present semEP, a semantic-based graph partitioning approach that finds the minimal partition of a weighted bipartite graph with highest density. semEP utilizes parts in the same way \(\mathcal {KOI}\) does, in order to find missing relations. However, they consider entities as isolated elements and do not consider their ego-networks during the partitioning process. esDSG [4] performs similarly than semEP, i.e., given a weighted bipartite graph, esDSG identifies a subgraph that is highly dense and comprise highly similar entities. Again, ego-networks are not considered.

Researchers of the social network field study the structure of friendship induced graphs, and define the concept of ego-network as the set of entities that are at one-hop distance to a given entity. Epasto et al. [2] reports on high quality results in the friend suggestion task by analyzing the ego-networks of the induced knowledge graphs. In this case, the discovery of the relations is based purely on the ego-network of the entities and no datatype property value is considered.

Redondo et al. [7] propose an approach to discover relations between video fragments based on visual information and background knowledge extracted from the Web of data in form of annotations. Like [4, 15] entities or video fragments are considered as isolated elements in the knowledge graph, and the similarity is computed as the number of coincident annotations between two video fragments.

Sachan and Ichise [21] discover relations between authors in a co-author network extracted from dblp. Their approach is based on the dense subgraph approach. They consider the connections in the knowledge graph and some features of the authors and from the papers like the keywords. However, the comparison of such features relies on the syntactic level, and the semantics is omitted.

Kastrin et al. [10] present an approach to discover relations among biomedical terms. They build a knowledge graph with such terms with the help of SemRep [20], a tool for recovering semantic propositions from the literature. In this case, it is not only important the existence of the relation, but also the type of the relation. Unlike \(\mathcal {KOI}\), they only consider the graph topology, discarding semantic knowledge encoded in datatype properties.

Nunes et al. [14] link entities based on the number of co-occurrences in a text corpus and distance, measured in number of hops, between the entities in a knowledge graph. Unlike \(\mathcal {KOI}\), this approach needs a corpus labeled with entities and only takes into account the object properties, omitting the semantics encoded in datatype properties.

6 Empirical Evaluation

6.1 Knowledge Graph Creation

In this section we describe the characteristics of the crafted TED knowledge graph and its links to external vocabularies. This knowledge graph is built from a real-world dataset of TED talks and playlists8.

The knowledge graph of TED talks consists of 846 talks and 125 playlists (15/12/2015). Playlists are described with a title and the set of included TED talks. Each TED talk is described with the following set of datatype properties:
  • dc:title (Dublin Core vocabulary) represents title of the talk;

  • dc:creator models speaker;

  • dc:description represents abstract;

  • and ted:relatedTags corresponds to set of related keywords.

Apart from the datatype properties, TED talks are connected to playlists that include them through the object property ted:playlist. A vol:hasLink (Vocabulary Of Links9) object property connects each pair of talks that are together in at least one playlist. We crawled the playlists available in the TED website10. Playlists contain sets of TED talks that usually address similar topics. TED playlists are created and maintained by curators, who decide if a certain video may or may be not included in a certain playlist.

Additionally, we enriched the knowledge graph by adding similarity values between each pair of entities. We computed four similarity measures (TFIDF, ESA [6], Doc2Vec [12], and Doc2Vec Neighbors) using as input the concatenation of datatype properties title, description and related tags. ESA similarity values were computed using the public ESA endpoint11, and Doc2Vec (D2V) values were obtained training the gensim implementation [19] with the pre-trained Google News dataset12. Doc2Vec Neighbors (D2VN) is defined as:
$$\begin{aligned} \text {D2VN}(v_1, v_2) = \frac{\sqrt{S_v(v_1, v_2)^2 + S_u(\text {ego-net}(v_1, r), \text {ego-net}(v_2, r))^2}}{\sqrt{2}}, \end{aligned}$$
where r corresponds to vol:hasLink and \(S_v\) and \(S_u\) are defined as follows:
$$\begin{aligned} S_v(v_1, v_j)= & {} Doc2Vec(v_i, v,j) \end{aligned}$$
(1)
$$\begin{aligned} S_u(V_1, V_2)= & {} \frac{2 * \sum _{(v_i, v_j) \in WEr} \text {Doc2Vec}(v_i, v_j) }{|V_1| + |V_2|} \end{aligned}$$
(2)
where WEr represents the set of edges included in the 1-1 maximal bipartite graph matching following the definition of Schwartz et al. [22].

Unlike the knowledge graph created by Taibi et al. [23], our knowledge graph of TED talks includes information about the playlists, the relations between TED talks, and four similarity values for each pair of talks (TFIDF, ESA, Doc2Vec, and Doc2Vec Neighbors). The knowledge graph of TED talks is publicly available at https://goo.gl/7TnsqZ.

6.2 Experimental Configuration

We empirically evaluate the effectiveness of \(\mathcal {KOI}\) to discover missing relations in the 2015 TED knowledge graph, which is based on a real-world dataset. We compare \(\mathcal {KOI}\) with METIS [9] and k-Nearest Neighbors (KNN) empowered with four similarity measures: TFIDF, ESA, Doc2Vec, and Doc2Vec Neighbors.

Research Questions: We aim at answering the following research questions: (RQ1) Does semantics encoded in similarity measures affect the relation discovery task? In order to answer this question we compare four similarity measures, one statistical-based measure (TFIDF) and three semantic similarity measures (ESA [6], Doc2Vec [12], and Doc2Vec Neighbors). Doc2Vec Neighbors considers both, the semantics encoded in datatype properties and the structure of the graph by taking into account the ego-networks. (RQ2) Is \(\mathcal {KOI}\) able to outperform common discovery approaches as METIS or KNN?

Implementation: We implemented \(\mathcal {KOI}\) in Java 1.8 and executed the experiments on an Ubuntu 14.04 64 bits machine with CPU: Intel(R) Core(TM) i5-4300U 1.9 GHz (4 physical cores) and 8GB RAM. In order to perform a fair evaluation, we used the library WEKA [8] version 3.7.12 to split the dataset following the 10-fold cross-validation strategy. The cross-validation was performed over the set of relations among TED talks. In order to discover relations using the METIS solver version 5.113, we apply METIS on a KOI Bipartite Graph with the same similarity measures \(S_u\) and \(S_v\) above specified for \(\mathcal {KOI}\). METIS returns a partitioning of the given graph, and we produce candidate discovered relations as explained in Sect. 4. In order to perform a fair comparison, the same constraint (Listing 1.1) is applied for the results of both, \(\mathcal {KOI}\) and METIS.

Evaluation Metrics: For each discovery approach, we compute the following metrics: (i) Precision: Relation between the number of correctly discovered relations and the whole set of discovered relations. (ii) Recall: Relation between the number of correctly discovered relations and the number of existing relations in the dataset. (iii) F-Measure: harmonic mean of precision and recall. Values showed in Tables 1 and 2 are the average values over the 10-folds. Moreover, we draw the F-Measure curves for \(\mathcal {KOI}\) and METIS and calculate the Precision-Recall Area Under the Curve (AUC) coefficients (Table 3).

6.3 Discovering Relations with K-Nearest Neighbors

In our first experiment, we discover relations in the graph using the K-Nearest Neighbors (KNN) algorithm under the hypothesis that highly similar TED talks should be related. Given a talk, we discover a relation between it and its K most similar talks. This experiment evaluates the impact of considering semantics encoded in domain similarity measures during the relation discovery task (RQ1).

Table 1 reports on the results obtained by four similarity measures: TFIDF, ESA [6], Doc2Vec [12], and Doc2Vec Neighbors. The first three similarity measures only consider knowledge encoded in datatype properties. On the other hand, Doc2Vec neighbors compares two entities considering the knowledge located in datatype properties and the structure of the graph by taking into account the ego-networks. Results obtained with the first three similarity measures suggest that Doc2Vec and ESA, which are semantic similarity measures, are able to outperform TFIDF, which does not take into account semantics. Doc2Vec obtains the highest F-measure value (0.196) with \(K=13\), which is significantly better than the maximum values obtained by ESA (0.137) and TFIDF (0.133). Thus, we can conclude that considering semantics encoded in Doc2Vec has a positive impact in the relation discovery task with respect to ESA and TFIDF. Results obtained with the Doc2Vec Neighbors indicate that knowledge encoded in ego-networks is of great value and that combining it with the knowledge encoded in datatype properties allows for obtaining a higher F-measure value (0.285) than the other three similarity measures.
Table 1.

Effectivenness of KNN. D2V = Doc2Vec, D2VN = Doc2Vec Neighbors. D2VN presents the best results with an F-measure of 0.285 for \(K=4\). Relevance of the knowledge encoded in ego-networks is reported

6.4 Effectiveness of \(\mathcal {KOI}\) Discovering Relations

We executed \(\mathcal {KOI}\) using the definitions of \(S_v\) and \(S_u\) in Eqs. 1 and 2, respectively. We compare \(\mathcal {KOI}\) with respect to METIS [9] using the relation constraint constraint defined in Listing 1.1.
Table 2.

Comparison of\(\mathcal {KOI}\)and METIS. Values of \(\theta \) correspond to the value of variable THETA of the constraint in Listing 1.1

Table 3.

Area Under the Curve coefficients for \(\mathcal {KOI}\), KNN Doc2Vec Neighbors and METIS

Approach

AUC

F-Measure

\(\mathcal {KOI}\)

0.396

0.512

METIS

0.244

0.39

KNN D2VN

0.223

0.285

Fig. 5.

F-Measure curves of\(\mathcal {KOI}\)and METIS. Area under the curve indicates quality of the approaches

Table 2 contains the obtained results with \(\mathcal {KOI}\) and METIS. The highest F-measure value is 0.512 and is obtained by \(\mathcal {KOI}\) with \(\theta = 0.7\). This F-measure value is higher than the one obtained with KNN and Doc2Vec Neighbors (0.285) and also higher than the maximum value obtained by METIS (0.39). We also observe that the parameter \(\theta \), which corresponds to THETA in Listing 1.1, can be configured depending on the respective importance of precision and recall. Lower values of \(\theta \) deliver high values of recall, while high values of \(\theta \) deliver high values of precision. Figure 5 shows the F-Measure curve for values of \(\theta \in [0,2]\). \(\mathcal {KOI}\) is able to get higher F-Measure values for almost all \(\theta \) values. We also computed the Precision-Recall Curve for \(\mathcal {KOI}\), METIS and KNN Doc2Vec Neighbors. Table 3 shows that \(\mathcal {KOI}\) gets a higher AUC value (0.396) than METIS (0.244) and KNN (0.223).

7 Conclusions and Future Work

In this paper we present \(\mathcal {KOI}\), an approach that exploits semantics and graph structure information in order to discover missing relations in a knowledge graph. \(\mathcal {KOI}\) considers semantics encoded in entities and their ego-networks to identify relations between entities with similar datatype properties and similar ego-networks. Reported experimental results suggest that \(\mathcal {KOI}\) outperforms state-of-the-art approaches that: (i) do not consider semantics (KNN TFIDF), (ii) do not identify graph portions containing highly similar entities (KNN D2VN and METIS). In the future, we plan to extend \(\mathcal {KOI}\) to take into account domain specific knowledge in graphs of more specific domains, e.g., social network, financial, or clinical data. Further, we plan to extend \(\mathcal {KOI}\) to consider the relevance or importance of the entities in ego-networks, as well as to discover relations between different types of entities, e.g., drugs and proteins.

Footnotes

Notes

Acknowledgements

This work is supported by the German Ministry of Education and Research within the SHODAN project (Ref. 01IS15021C) and the German Ministry of Economy and Technology within the ReApp project (Ref. 01MA13001A).

References

  1. 1.
    Arenas, M., Gutierrez, C., Pérez, J.: Foundations of RDF databases. In: Tessaris, S., Franconi, E., Eiter, T., Gutierrez, C., Handschuh, S., Rousset, M.-C., Schmidt, R.A. (eds.) Reasoning Web. LNCS, vol. 5689, pp. 158–204. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  2. 2.
    Epasto, A., Lattanzi, S., Mirrokni, V., Sebe, I.O., Taei, A., Verma, S.: Ego-net community mining applied to friend suggestion. VLDB Endow. 9(4), 324–335 (2015)CrossRefGoogle Scholar
  3. 3.
    Fischer, P.M., Lausen, G., Schätzle, A., Schmidt, M.: RDF constraint checking. In: EDBT/ICDT 2015 Joint Conference (2015)Google Scholar
  4. 4.
    Flores, A., Vidal, M., Palma, G.: Exploiting semantics to predict potential novel links from dense subgraphs. In: 9th Alberto Mendelzon International Workshop on Foundations of Data Management (2015)Google Scholar
  5. 5.
    Fundulaki, I., Auer, S.: Linked open data - introduction to the special theme. ERCIM News 2014(96) (2014)Google Scholar
  6. 6.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: IJCAI, vol.7 (2007)Google Scholar
  7. 7.
    García, J.L.R., Sabatino, M., Lisena, P., Troncy, R.: Detecting hot spots in web videos. In: ISWC Poster and Demo Track. CEUR-WS.org (2014)Google Scholar
  8. 8.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  9. 9.
    Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1) (1998)Google Scholar
  10. 10.
    Kastrin, A., Rindflesch, T.C., Hristovski, D.: Link prediction on the semantic MEDLINE network - an approach to literature-based discovery. In: Džeroski, S., Panov, P., Kocev, D., Todorovski, L. (eds.) DS 2014. LNCS, vol. 8777, pp. 135–143. Springer, Heidelberg (2014)Google Scholar
  11. 11.
    Lausen, G., Meier, M., Schmidt, M.: Sparqling constraints for RDF. In: 11th International Conference on Extending Database Technology, EDBT. ACM (2008)Google Scholar
  12. 12.
    Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. CoRR, abs/1405.4053 (2014)Google Scholar
  13. 13.
    Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58(7), 1019–1031 (2007)CrossRefGoogle Scholar
  14. 14.
    Pereira Nunes, B., Dietze, S., Casanova, M.A., Kawase, R., Fetahu, B., Nejdl, W.: Combining a co-occurrence-based and a semantic measure for entity linking. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 548–562. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38288-8_37 CrossRefGoogle Scholar
  15. 15.
    Palma, G., Vidal, M.-E., Raschid, L.: Drug-target interaction prediction using semantic similarity and edge partitioning. In: Mika, P., et al. (eds.) ISWC 2014, Part I. LNCS, vol. 8796, pp. 131–146. Springer, Heidelberg (2014)Google Scholar
  16. 16.
    Pappas, N., Popescu-Belis, A.: Combining content with user preferences for ted lecture recommendation. In: 11th International Workshop on Content Based Multimedia Indexing. IEEE (2013)Google Scholar
  17. 17.
    Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. ACM Trans. Database Syst. 34(3), 30–43 (2009)CrossRefGoogle Scholar
  18. 18.
    Pirró, G.: Explaining and suggesting relatedness in knowledge graphs. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 622–639. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25007-6_36 CrossRefGoogle Scholar
  19. 19.
    Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA (2010). http://is.muni.cz/publication/884893/en
  20. 20.
    Rindflesch, T.C., Kilicoglu, H., Fiszman, M., Rosemblat, G., Shin, D.: Semantic medline,: an advanced information management application for biomedicine. Inf. Serv. Use 31(1–2), 15–21 (2011)Google Scholar
  21. 21.
    Sachan, M., Ichise, R.: Using semantic information to improve link prediction results in network datasets. Int. J. Eng. Technol. 2(4), 71–76 (2010)CrossRefGoogle Scholar
  22. 22.
    Schwartz, J., Steger, A., Weißl, A.: Fast algorithms for weighted bipartite matching. In: Nikoletseas, S.E. (ed.) WEA 2005. LNCS, vol. 3503, pp. 476–487. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  23. 23.
    Taibi, D., Chawla, S., Dietze, S., Marenzi, I., Fetahu, B.: Exploring TED talks as linked data for education. Br. J. Educ. Technol. 46(5), 1092–1096 (2015)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Ignacio Traverso-Ribón
    • 1
  • Guillermo Palma
    • 2
  • Alejandro Flores
    • 4
  • Maria-Esther Vidal
    • 2
    • 3
  1. 1.FZI Research Center for Information TechnologyKarlsruheGermany
  2. 2.Universidad Simón BolívarCaracasVenezuela
  3. 3.University of Bonn and FraunhoferBonnGermany
  4. 4.University of MarylandCollege ParkUSA

Personalised recommendations