1 Background

In the past few years, there has been a significant increase in the use and development of KGs and their various applications. These KGs are designed to store world knowledge, represented as triples (i.e., \(<\)entity, relation, entity\(>\)) consisting of entities, relations, and other entities, with each entity referring to a distinct real-world object, and each relation representing a connection between those objects. Since these entities serve as the foundation for the triples in a KG, the triples are inherently interconnected, creating a large and complex graph of knowledge. Currently, we have a large number of general KGs (e.g., DBpedia [1], YAGO [52], Google’s Knowledge Vault [14]) and domain-specific KGs (e.g., medical [48] and scientific KGs [56]). KGs have been utilized to improve a wide range of downstream applications, including but not limited to keyword search [64], fact-checking [30], and question answering [12, 28].

A knowledge graph, denoted as \(G = (E, R, T)\), is a graph that consists of three main components: a set of entities E, a set of relations R, and a set of triples T, where \(T \subseteq E \times R \times E\) represents the directed edges in the graph. In the set of triples T, a single triple \((h, r, t)\) represents a relationship between a head entity h and a tail entity t through a specific relation r. Each entity in the graph is identified by a unique identifier, such as http://dbpedia.org/resource/Spain in the case of DBpedia.

In practice, KGs are typically constructed from a single data source, making it difficult to achieve comprehensive coverage of a given domain [46]. To improve the completeness of a KG, one popular strategy is to integrate information from other KGs that may contain supplementary or complementary data. For instance, a general KG may only include basic information about a scientist, while scientific domain-specific KGs may have additional details like biographies and lists of publications. To combine knowledge across multiple KGs, a crucial step is to align equivalent entities in different KGs, which is known as entity alignment (EA) [7, 25].Footnote 1

Given a source KG \(G_1 = (E_1, R_1, T_1)\), a target KG \(G_2 = (E_2, R_2, T_2)\), and seed entity pairs (training set), i.e., \(S = \{(u,v) \mid u\in E_1, v\in E_2, u \leftrightarrow v\}\), where \(\leftrightarrow \) represents equivalence (i.e., u and v refer to the same real-world object), the task of EA can be defined as discovering the equivalent entity pairs in the test set.

Footnote 2

Fig. 1.1
2 mirrored knowledge graphs with 7 entity nodes that have names in English and Spanish. Seed entity pairs, Mexico link to Alfonso Cuaron, which links to Roma city, connected Roma film and gravity film, and Spain that links to Madrid. A dashed line connects the seed entity pair of Mexico.

An example of EA. The entity identifiers are placed in the square brackets. The prefixes of entity identifiers and the full relation identifiers are omitted for clarity; seed entity pairs are connected by dashed lines

Example Figure 1.1 shows a partial English KG (KG\({ }_{\text{EN}}\)) and a partial Spanish KG (KG\({ }_{\text{ES}}\)) concerning the director Alfonso Cuarón. Note that each entity in the KG has a unique identifier. For example, the movie “Roma” in the source KG is uniquely identified by Roma(film).Fn2 Given the seed entity pair, i.e., Mexico from KG\({ }_{\text{EN}}\) and Mexico from KG\({ }_{\text{ES}}\), EA aims to find the equivalent entity pairs in the test set, e.g., returning Roma(ciudad) in KG\({ }_{\text{ES}}\) as the corresponding target entity to the source entity Roma(city) in KG\({ }_{\text{EN}}\).

Broadly speaking, current entity alignment (EA) methods typically address the problem by assuming that equivalent entities in different KGs share similar local structures and applying representation learning techniques to embed entities as data points in a low-dimensional feature space. With effective entity embedding, the pairwise dissimilarity of entities can be calculated as the distance between data points, allowing us to evaluate whether two entities are a match or not.Footnote 3

2 Related Works

While the problem of EA was introduced a few years ago, the more generic version of the problem –identifying entity records referring to the same real-world entity from different data sources– has been investigated from various angles by different communities, under the names of entity resolution (ER) [15, 18, 45], entity matching [13, 42], record linkage [8, 34], deduplication [16], instance/ontology matching [20, 35, 49,50,51], link discovery [43, 44], and entity linking/entity disambiguation [11, 29]. Next, we describe the related work and the scope of this book.

2.1 Entity Linking

The process of entity linking (EL) or entity disambiguation is the act of recognizing entity mentions in natural language text and linking them to the corresponding entities in a given reference catalog, which is usually a knowledge graph. This process involves identifying which entity a particular mention in the text refers to. For example, if given the word “Rome,” the task would be to determine if it refers to the city in Italy, a movie, or another entity and then link it to the right entity in the reference catalog. Prior studies in EL [21, 22, 29, 36, 68] have used various sources of information to disambiguate entity mentions, including surrounding words, prior probabilities of certain target entities, already disambiguated entity mentions, and background knowledge from sources such as Wikipedia. However, much of this information is not available in scenarios where aligning KGs is required, such as entity embeddings or the prior distribution of entity linking given a mention. Moreover, EL is concerned with mapping natural language text to a KG, while this research investigates the mapping of entities between two KGs.

2.2 Entity Resolution

Entity resolution, which is also referred to as entity matching, deduplication, or record linkage, assumes that the input is relational data, and each data object usually has a large amount of textual information described in multiple attributes. Therefore, various similarity or distance functions are used in entity resolution to measure the similarity between two objects. These functions include Jaro-Winkler distance for comparing names and numerical distance for comparing dates. Based on the similarity measure, both rule-based and machine learning-based methods can be employed to classify two objects as either matching or non-matching [9].

To clarify further, in ER tasks, the attributes of data objects are first aligned, which can be done manually or automatically. Then, the similarity or distance functions are used to calculate the similarities between corresponding attribute values of the two objects. Finally, the similarity scores between the aligned attributes are combined or aggregated to determine the overall similarity between the two objects. This process allows rule-based or machine learning-based methods to classify pairs of objects as either matching or non-matching, based on the computed similarity scores [32, 45].

2.3 Entity Resolution on KGs

Certain methods for ER are created with the purpose of managing KGs and focus solely on binary connections, or data shaped like a graph. These methods are sometimes called instance/ontology matching approaches [49, 50]. The graph-shaped data comes with its own challenges: (1) Entities in graph-shaped data often lack detailed textual descriptions and may only be represented by their name, with a minimal amount of accompanying information. (2) Unlike classical databases, which assume that all fields of a record are present, KGs are built on the Open World Assumption, where the absence of certain attributes of an entity in the KG does not necessarily mean that they do not exist in reality. This fundamental difference sets KGs apart from traditional databases. (3) KGs have their own set of predefined semantics. At a basic level, these can take the form of a taxonomy of classes. In more complex cases, KGs can be endowed with an ontology of logical axioms.

In the past 20 years, various techniques have been developed to address the specific challenges of KGs, particularly in the context of the Semantic Web and the Linked Open Data cloud [26]. These techniques can be categorized along several different dimensions:

  • Scope. Several techniques have been developed for aligning KGs along different dimensions. For example, some approaches aim to align the entities in two different KGs, while others focus on aligning the relationship names, or schema, between KGs. Additionally, some methods aim to align the class taxonomies of two KGs, and a few techniques achieve all three tasks at once. In this particular book, however, the focus is on the first task, which is aligning entities in KGs.

  • Background knowledge. Certain techniques rely on an ontology (T-box) as background information, particularly those that participate in the Ontology Alignment Evaluation Initiative (OAEI).Footnote 4 However, in this specific book, the focus is on techniques that do not require such prior knowledge and can operate without an ontology.

  • Training. Some techniques for aligning knowledge graphs are unsupervised and operate directly on input data without any need for training data or a training phase. Examples of such methods include PARIS [51] and SiGMa [35]. On the other hand, other approaches involve learning mappings between entities based on predefined seeds. This particular book, however, focuses on the latter class of approaches.

Most of the supervised or semi-supervised approaches for entity alignment utilize recent advances in deep learning [23]. These approaches primarily rely on graph representation learning techniques to model the structure of knowledge graphs and generate entity embeddings for alignment. To refer to the supervised or semi-supervised approaches, we use the term “entity alignment (EA) approaches,” which is also the main focus of this study. However, in the next chapter, we include PARIS [51] for comparison as a representative of the unsupervised approaches. We also include AgreementMakerLight (AML) [17] as a representative of unsupervised systems that use background knowledge. For the other systems, we refer the reader to other surveys [9, 33, 41, 43].

In addition, since EA pursues the same goal as ER, it can be deemed a special but nontrivial case of ER. In this light, general ER approaches can be adapted to the problem of EA, and we include representative ER methods for comparison (to be detailed in Chap. 2).

Existing Benchmarks

Several synthetic datasets, such as DBP15K and DWY100K, were created using the inter-language and reference links already present in DBpedia to assess the effectiveness of EA methods. Chapter 2 contains more extensive statistical information about these datasets.

Notably, the Ontology Alignment Evaluation Initiative (OAEI) promoted the knowledge graph track.Footnote 5 Existing benchmarks for EA only provide instance-level information, while the KGs in these datasets include both schema and instance information. This can create an unfair evaluation of current EA approaches that do not consider the availability of ontology information. Hence, they are not presented in this book.

3 Evaluation Settings

This section provides an introduction to the evaluation settings that are commonly used for the EA task.

Datasets

Three datasets are commonly used and are representative, including the following:

  • DBP15K [53]. This particular dataset comprises three pairs of multilingual KGs that were extracted from DBpedia. These pairs include English to Chinese (\({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\)), English to Japanese (\({\mathtt {DBP15K}_{\mathtt {JA-EN}}}\)), and English to French (\({\mathtt {DBP15K}_{\mathtt {FR-EN}}}\)). Each of these KG pairs is made up of 15,000 inter-language links, which serve as gold standards.

  • DWY100K [54]. The dataset consists of two pairs of mono-lingual knowledge graphs, namely, \({\mathtt {DWY100K}_{\mathtt {DBP-WD}}}\) and \({\mathtt {DWY100K}_{\mathtt {DBP-YG}}}\). These pairs were extracted from DBpedia, Wikidata, and YAGO 3, and each one contains 100,000 pairs of entities. The extraction process is similar to that of DBP15K, except that the inter-language links have been replaced with reference links that connect these knowledge graphs.

  • SRPRS. According to Guo et al. [24], the KGs in previous EA datasets, such as DBP15K and DWY100K, are overly dense and do not accurately reflect the degree distributions observed in real-life KGs. In response to this issue, Guo et al. [24] developed a new EA benchmark that uses reference links in DBpedia to establish KGs with degree distributions that better reflect real-life situations. The resulting evaluation benchmark includes both cross-lingual (\({\mathtt {SRPRS}_{\mathtt {EN-FR}}}\), \({\mathtt {SRPRS}_{\mathtt {EN-DE}}}\)) and mono-lingual KG pairs (\({\mathtt {SRPRS}_{\mathtt {DBP-WD}}}\), \({\mathtt {SRPRS}_{\mathtt {DBP-YG}}}\)), where EN, FR, DE, DBP, WD, and YG represent DBpedia (English), DBpedia (French), DBpedia (German), DBpedia, Wikidata, and YAGO 3, respectively. Each KG pair is comprised of 15,000 pairs of entities.

Table 1.1 provides a summary of the datasets used in this study. Each KG pair includes relational triples, cross-KG entity pairs (30% of which are seed entity pairs and used for training), and attribute triples. The cross-KG entity pairs serve as gold standards.

Table 1.1 Statistics of EA benchmarks and our constructed dataset

Degree Distribution

Figure 1.2 presents the degree distributions of entities in the datasets, which provides insights into the characteristics of these datasets. The degree of an entity is defined as the number of triples in which the entity is involved. Entities with higher degrees tend to have richer neighboring structures. The degree distributions of the different KG pairs in each dataset are very similar. Thus, for brevity, we present only one KG pair’s degree distribution in Fig. 1.2.

Fig. 1.2
8 graphs against entity degree. Line plots for % of entities increase. Bars for number of entities decrease in, a minus 1, a minus 2 D B P 15 K, Z H and E N, c minus 1, c minus 2 S R P R S, E N and F R, and d minus 1 D B P, fluctuate in b minus 1, b minus 2 D W Y 100 K, D B P and W D, d minus 2 F B.

Degree distributions on different datasets. The X-axis denotes entity degree. The left Y-axis represents the number of entities (corresponding to bars), while the right Y-axis represents the percentage of entities with a degree lower than a given x value (corresponding to lines)

The sub-figures in series (a) correspond to the DBP15K dataset. As shown, entities with a degree of 1 comprise the largest proportion, while the number of entities generally decreases with increasing degree values, with some fluctuations. It is worth noting that the coverage curve approximates a straight line, as the number of entities changes only slightly when the degree increases from 2 to 10.

The (b) set of figures is related to DWY100K. This dataset has a distinct structure from (a), as there are no entities with a degree of 1 or 2. Additionally, the number of entities reaches its highest point at degree 4 and then decreases as the entity degree increases.

The (c) set of figures is related to SRPRS. It is clear that the degree distribution of entities in this dataset is more realistic, with entities of lower degrees making up a larger proportion. This is due to its well-thought-out sampling approach. Additionally, the (d) set of figures corresponds to the dataset we created, which will be discussed in Chap. 2.

Evaluation Metrics

Most existing EA solutions use Hits@k (\(k=1, 10\)) and mean reciprocal rank (MRR) as their evaluation metrics. The target entities are arranged in order of increasing distance scores from the source entity when making a prediction. The Hits@k metric shows the proportion of correctly aligned entities among the k nearest target entities. Hits@1 is the most significant measure of the accuracy of the alignment results.

MRR denotes the average of the reciprocal ranks of the ground truths. Note that higher Hits@k and MRR indicate better performance. Unless otherwise specified, the results of Hits@k are represented in percentages.