1 Introduction

In this chapter, we conduct an empirical evaluation of state-of-the-art EA approaches, which possesses the following characteristics:

Fair Comparison Within and Across Categories

Most recent studies have limited themselves to comparing only a subset of methods [4, 11, 15, 23, 27,28,29,30, 33]. Moreover, different approaches follow different protocols: some use only the KG structure for alignment, while others incorporate additional information; some perform one-pass alignment of KGs, while others use an iterative (re-)training strategy. While the literature presents a direct comparison of these methods, which highlights their overall effectiveness, a more desirable and equitable approach would be to classify these methods into categories and then compare the outcomes within and across categories.

In this chapter, we incorporate most of the state-of-the-art methods to facilitate a comprehensive comparison, including the very recent approaches that have not been evaluated against other methods previously. We divide them into three groups and conduct a thorough analysis of both intra- and inter-group evaluations, enabling us to better position these methods and evaluate their effectiveness.

Comprehensive Evaluation on Representative Datasets

To assess the performance of EA systems, various datasets have been developed, which can be broadly classified into two categories: cross-lingual benchmarks, exemplified by DBP15K [21], and mono-lingual benchmarks, exemplified by DWY100K [22]. A recent study [11] highlights that the KGs in prior datasets are much denser than those in real-world scenarios, which led them to create the SRPRS dataset with entity degrees that follow a normal distribution. Despite the availability of multiple datasets, previous studies only report their results on one or two specific datasets, making it challenging to evaluate their efficacy across a wide range of potential scenarios, such as cross-lingual/mono-lingual, dense/normal, and large-scale/medium-scale KGs.

In light of this observation, this chapter performs a thorough experimental evaluation on all the prominent datasets, namely, DBP15K, DWY100K, and SRPRS, which together consist of nine pairs of knowledge graphs. The evaluation is conducted across various dimensions, including effectiveness, efficiency, and robustness.

New Dataset for Real-Life Challenges

It has been noted that current EA datasets assume that each entity in the source KG has exactly one corresponding entity in the target KG, which is an unrealistic assumption. In reality, there are entities in one KG that may not have a corresponding entity in the other KG. For example, when aligning YAGO 4 and IMDB, only a small percentage (1%) of entities in YAGO 4 are related to movies, while the remaining 99% of entities in YAGO 4 do not have any corresponding entities in IMDB. These unmatchable entities would make the EA task more challenging.

Furthermore, we notice that the mono-lingual datasets currently available for EA evaluation assume that the entities in the different KGs share the same naming convention. Therefore, the baseline method that relies on comparing the string similarity between entity names can achieve perfect accuracy. However, this assumption is often not valid in real-life scenarios, where equivalent entities in different KGs may have dissimilar names, such as “America” and “USA” for the same entity. In addition, another challenge that is often overlooked in EA is that different entities in a KG might have the same name. This can make it difficult to determine whether an entity with the name “Paris” in the source KG refers to the same entity as one with the same name in the target KG, as they could potentially refer to different entities, such as the city in France and the city in Texas.

For these reasons, we believe that the current EA datasets do not fully capture the realistic challenges posed by unmatchable entities and ambiguous entity names. To address this issue, we introduce a new dataset that more closely mirrors these practical difficulties.

The main contributions of this chapter are the following:

  • This chapter provides a comprehensive evaluation of state-of-the-art EA approaches. The evaluation includes: (1) Identifying the main components of existing EA approaches and proposing a general EA framework (2) Categorizing state-of-the-art approaches into three groups and conducting detailed intra- and inter-group evaluations to better understand their strengths and weaknesses (3) Examining these approaches in various scenarios, including cross-/mono-lingual alignment and alignment on dense/normal, large-/medium-scale data, to evaluate their effectiveness, efficiency, and robustness. The empirical results provide insights into the performance of each approach. This evaluation aims to provide a more systematic and comprehensive understanding of the current state of EA research.

  • Through our study, we gained valuable experience and insights that allow us to identify the shortcomings of current EA datasets. To address these issues, we have created a new mono-lingual dataset that accurately reflects the real-life challenges of unmatchable entities and ambiguous entity names. We anticipate that this new dataset will provide a more effective benchmark for evaluating EA systems.

2 A General EA Framework

This section presents a general EA framework that is designed to include state-of-the-art EA approaches. Through a thorough analysis of current EA approaches, we identify four primary components, as shown in Fig. 2.1:

Fig. 2.1
A model of E A framework where Embedding learning module points to Alignment module, which points to Prediction module. Prediction module connects via 2-way dotted arrows to Extra information module, which points via a dotted arrow to Alignment module.

A general EA framework

  • Embedding learning module. This component is designed to train embeddings for entities, which can be broadly classified into two groups: KG representation-based models such as TransE [3] and graph neural network (GNN)-based models such as the graph convolutional network (GCN) [13].

  • Alignment module. This component focuses on aligning the entity embeddings learned in the previous module across different KGs. The goal is to map these embeddings into a unified space. Margin-based loss is a common approach used in this module to ensure that the seed entity embeddings from different KGs are close to each other. Another approach used frequently is corpus fusion, which aligns KGs at the corpus level and directly embeds entities in different KGs into the same vector space.

  • Prediction module. Once the unified embedding space is established, the next step is to predict the corresponding target entity for each source entity in the test set. One common approach is to use distance-based similarity measures such as cosine similarity, Manhattan distance, or Euclidean distance between entity embeddings to calculate the similarity between entities. The target entity with the highest similarity (or lowest distance) is then selected as the counterpart.

  • Extra information module. In addition to the basic modules, some EA approaches use additional information to improve their performance. One approach is bootstrapping, where confident alignment results are used as training data for subsequent alignment iterations. Another approach is to use multi-type literal information such as attributes, entity descriptions, and entity names to complement the KG structure. These additional sources of information are shown in Fig. 2.1 as blue dashed lines.

Example Further to the example in Chap. 1, we explain these modules. The embedding learning module generates embeddings for entities in KG\({ }_{\text{EN}}\) and KG\({ }_{\text{ES}}\), respectively. Then the alignment module projects the entity embeddings into the same vector space, where the entity embeddings in KG\({ }_{\text{EN}}\) and KG\({ }_{\text{ES}}\) are directly comparable. Finally, using the unified embeddings, the prediction module aims to predict the equivalent target entity in KG\({ }_{\text{ES}}\) for each source entity in KG\({ }_{\text{EN}}\). The extra information module leverages several techniques to improve the EA performance. Concretely, the bootstrapping strategy aims to include the confident EA pairs detected from a previous round, e.g., (Spain, España), into the training set for learning in the next round. Another approach is to use additional textual information to complement the entity embeddings for alignment.

We organize the state-of-the-art approaches based on each module of the EA framework and present them in Table 2.1. For a more detailed view of the approaches, readers can refer to the Appendix. Now, we will explain how each of these modules is implemented in various state-of-the-art approaches.

Table 2.1 A summary of the EA approaches involved in this study

2.1 Embedding Learning Module

In this section, we will explain the techniques used in the embedding learning module, which utilize the KG structure to create embeddings for each entity.

Table 2.1 shows that the most commonly used models for this module are TransE [3] and GCN [13]. We will provide a brief overview of these fundamental models.

TransE

The TransE model views relationships as translations that act on the lower-dimensional representations of entities. To clarify, when presented with a relational triple \((h, r, t)\), TransE proposes that the embedded representation of the tail entity t should be similar to the embedded representation of the head entity h plus the embedded representation of the relationship r, or \(\mathbf {h} + \mathbf {r} \approx \mathbf {t}\). By doing so, the model is able to maintain the structural information of the entities and produce close representations for entities that share similar neighbors in the embedding space.

GCN

A type of convolutional network that processes graph-based data directly is known as the graph convolutional network (GCN). It creates embeddings for individual nodes by encoding information about the neighborhoods of those nodes. GCN takes as input feature vectors for each node in the KG, as well as a representative graph structure description in matrix form, such as an adjacency matrix. The output of the GCN is a new feature matrix. A typical GCN model consists of multiple stacked GCN layers, which allows it to capture a partial KG structure that extends several hops away from the entity being processed.

On top of these basic models, some methods make modifications. Regarding the TransE-based models, MTransE removes the negative triples during training, BootEA and NAEA replace the original margin-based loss function with a limit-based objective function, MuGNN uses the logistic loss to substitute for the margin-based loss, and JAPE designs a new loss function.

Concerning the GCN-based models, it has been observed that the GCN does not take into account the relations present in KGs. Therefore, as a solution, RDGCN employs the dual-primal graph convolutional neural network (DPGCNN) [17]. In contrast, MuGNN leverages an attention-based GNN model to assign varying weights to neighboring nodes. Additionally, KECG merges graph attention network (GAT) [25] and TransE to capture both the inner-graph structure and the inter-graph alignment information.

Several approaches have introduced new embedding models. For example, in RSNs, the authors contend that triple-level learning is inadequate for capturing long-term relational dependencies between entities and is insufficient for propagating semantic information among entities. Therefore, they propose using recurrent neural networks (RNNs) with residual learning to learn the long-term relational paths between entities.

Similarly, TransEdge devises a new energy function to measure the error of edge translation between entity embeddings for KG embedding learning. This method models edge embeddings using context compression and projection.

2.2 Alignment Module

In this subsection, we introduce the methods used for the alignment module, which aims to unify separated KG embeddings.

The prevailing approach in KG embedding learning is to use a margin-based loss function on top of the embedding learning module. This loss function requires that the distance between entities in positive pairs should be small, while the distance between entities in negative pairs should be large, with a margin between the distances of positive and negative pairs. The positive pairs refer to seed entity pairs, while negative pairs are generated by corrupting the positive pairs. This approach helps to merge the two separate KG embedding spaces into one vector space. Table 2.1 indicates that the majority of methods that use GNNs rely on a margin-based alignment model to merge the two KG embedding spaces. In contrast, in GM-Align, a matching framework is employed to maximize the matching probabilities of seed entity pairs, which achieves the alignment process.

Corpus fusion is another common approach, which involves using the seed entity pairs to connect the training corpora of two KGs. Some methods, such as BootEA and NAEA, generate new triples by swapping the entities in the seed entity pairs to align the embeddings in a unified space. Concretely, given an entity pair \((u,v)\), the newly generated triples for \(G_1\) are \(T_1^{new} = \{(v,r,t)|(u,r,t)\in T_1\}\cup \{(h,r,v)|(h,r,u)\in T_1\}\) and for \(G_2\) are \(T_2^{new} = \{(u,r,t)|(v,r,t)\in T_2\}\cup \{(h,r,u)|(h,r,v)\in T_2\}\). To clarify, the overlay graph is built by connecting the entities in seed entity pairs with edges, and the rest of the entities are connected with edges based on their similarity or co-occurrence in the training corpus. Entity embeddings are then learned using the adjacency matrix of the overlay graph and the training corpus.

Some earlier works proposed transition functions to map the embedding vectors from one KG to another, while others utilized additional information such as entity attributes to align the entity embeddings into a unified space.

2.3 Prediction Module

This module typically involves computing similarity scores between source and target entity embeddings and selecting the target entity with the highest score as the alignment.

To align entities, the most common method is to generate a ranked list of target entities for each source entity based on a specific distance measure between their embeddings. The distance measures commonly used include Euclidean distance, Manhattan distance, and cosine similarity. The top-ranked entity in the list is then considered a match for the source entity. It is worth noting that the similarity score can be converted into the distance score by subtracting it from 1 and vice versa.Footnote 1 In contrast, in GM-Align, the entity with the highest matching probability is aligned with the source entity.

Additionally, a recent method called CEA observes that there is a correlation between different entity alignment decisions, meaning that if a target entity is already matched to a source entity with high confidence, it is less likely to be matched to another source entity. To capture this correlation, CEA models it as a stable matching problem, and addresses the problem based on the distance measure, which decreases the number of mismatches and improves the accuracy of entity alignment.

2.4 Extra Information Module

In this subsection, we discuss the methods used in the extra information module.

One approach to improve the EA framework is through bootstrapping strategy, also known as iterative training or self-learning strategy. This approach involves iteratively labeling highly probable EA pairs as the training set for the next round, leading to the gradual enhancement of alignment results. There are several methods based on this approach, with variations in the selection of confident EA pairs. The approach ITransE identifies the most similar nonaligned target entity for each nonaligned source entity, and if the similarity score between them exceeds a certain threshold, they are regarded as a confident pair. BootEA, NAEA, and TransEdge follow a similar approach where they calculate the probability of each source entity being aligned with every target entity. They only consider pairs with probability scores above a certain threshold and use a maximum likelihood matching algorithm with a 1-to-1 mapping constraint to generate a set of confident EA pairs.

Several methods utilize multi-type literal information to improve alignment by providing a more comprehensive view. Commonly used types of information are the attributes associated with entities. Some methods, such as JAPE, GCN-Align, and HMAN, only consider the statistical characteristics of the attribute names. Other methods, such as AttrE and M-Greedy, generate attribute embeddings by encoding the characters of attribute values. AttrE uses attribute embeddings to unify entity embeddings into the same space, while M-Greedy uses them to complement the entity embeddings.

There is a growing tendency toward the use of “entity names”.Footnote 2 Several methods are using “entity names” as input features to learn entity embeddings or exploit the semantic and string-level aspects of entity names as individual features. Specifically, GM-Align, RDGCN, and HGCN utilize entity names as input features to learn entity embeddings. On the other hand, CEA leverages both semantic and string-level aspects of entity names as individual features for alignment. Furthermore, KDCoE and the description-enhanced version of HMAN encode entity descriptions into vector representations and treat them as new features for alignment.

The availability of multi-type information is not always guaranteed in knowledge graph alignment. Some types of information like entity names are commonly available in most scenarios, while others like entity descriptions are often missing in many knowledge graphs. Additionally, due to the graph-based nature of knowledge graph alignment, most existing alignment datasets have limited textual information, which makes some approaches like KDCoE, M-Greedy, and AttrE less applicable.

3 Experiments and Analysis

This section presents an in-depth empirical study.Footnote 3

3.1 Categorization

According to the main components, we can broadly categorize current methods into three groups: Group I, which merely utilizes the KG structure for alignment, Group II, which harnesses the iterative training strategy to improve alignment results, and Group III, which utilizes information in addition to the KG structure. We introduce and compare these three categories using the example in Chap. 1.

Group I

This category of methods solely relies on the structure of the knowledge graph to align entities. Consider again the example in Chap. 1. In KG\({ }_{\text{EN}}\), the entity Alfonso Cuarón is connected to the entity Mexico and three other entities, while Spain is connected to Mexico and one more entity. The same structural information can be observed in KG\({ }_{\text{ES}}\). Since we already know that Mexico in KG\({ }_{\text{EN}}\) is aligned to Mexico in KG\({ }_{\text{ES}}\), by using the KG structure, it is easy to conclude that the equivalent target entity for Spain is España, and the equivalent target entity for Alfonso Cuarón is Alfonso Cuarón.

Group II

This category of approaches is known as iterative or self-learning strategies, where likely entity alignment pairs are labeled iteratively as the training set for the next round, leading to a progressive improvement in the alignment results. They can also be categorized into Group I or III, depending on whether they merely use the KG structure or not. Nevertheless, they are all characterized by the use of the bootstrapping strategy.

We still use the example in Chap. 1 to illustrate the bootstrapping mechanism. As shown in Fig. 1.1, by utilizing the KG structure, it is straightforward to identify that the source entity Spain is aligned with the target entity España, and the source entity Alfonso Cuarón is aligned with the target entity Alfonso Cuarón. The source entity Madrid does not have a clear target entity, as both Roma(ciudad) and Madrid in the target KG have the same structural information as the source entity. This is because they are both two hops away from the seed entity and have a degree of 1. To address this problem, bootstrapping-based approaches perform multiple rounds of alignment, using the confident entity pairs from the previous round as seed pairs for the next round. More specifically, they consider the entity pairs detected from the first round, i.e., (Spain, España) and (Alfonso Cuarón, Alfonso Cuarón), as the seed pairs in the following rounds. Consequently, in the second round, for the source entity Madrid, only the target entity Madrid shares the same structural information with it—two hops away from the seed entity pair (Mexico, Mexico) and one hop away from the seed entity pair (Spain, España).

Group III

Utilizing the KG structure for alignment when presented with graph-formatted input data sources is a natural choice; however, KGs also contain a wealth of semantic information that can be used to supplement structural data. These methods stand out by taking advantage of additional information beyond the KG structure.

As seen in Chap. 1, even with the KG structure and bootstrapping strategy, it is still difficult to identify the target entity for the source entity Gravity(film), since its structural information (connected to the entity Alfonso Cuarón and with degree 2) is shared by two target entities Gravity(película) and Roma(película). However, by combining the KG structure with the names in the identifiers, it is easy to differentiate between the two entities and correctly identify Gravity(película) as the target entity for Gravity(film).

3.2 Experimental Settings

The datasets and metrics utilized for assessment were previously introduced in Chap. 1. In the following section, we will elaborate on the techniques and parameter configurations used for comparison.

Methods to Compare

We will compare the previously mentioned methods, with the exception of KDCoE and MultiKE, due to the absence of entity descriptions in the evaluation benchmarks. Additionally, we will exclude AttrE since it is only functional in the mono-lingual context. Furthermore, we will provide the outcomes of the structure-only versions of JAPE and GCN-Align, specifically JAPE-Stru and GCN-Align(SE).

As previously stated in Chap. 1, to showcase the ability of ER methods in addressing EA, we will also compare with various name-based heuristics. These approaches are commonly used in related tasks [8, 18, 19], as they heavily depend on the resemblance between object names to identify equivalences. Concretely, we use the following:

  • Lev aligns entities through the utilization of Levenshtein distance [14], which is a string-based measurement tool for computing the dissimilarity between two sequences.

  • Embed aligns entities based on the cosine similarity between the averaged word embeddings, or name embeddings, of two entities. In accordance with [31], we utilize the pre-trained fastText embeddings [1] as word embeddings. For multilingual KG pairs, we use the MUSE word embeddings [7].

Implementation Details

The experiments were performed using a personal computer equipped with an Intel Core i7-4790 CPU, an NVIDIA GeForce GTX TITAN X GPU, and 128 GB of memory. All programs were implemented in Python.

To ensure reproducibility, we employ the source codes provided by the authors and utilize the parameter settings specified in their original papers to execute the models.Footnote 4 For datasets not included in the original papers, we use the same parameter settings as those employed in the original experiments to ensure consistency.

All of the evaluated methods provide results on the DBP15K dataset in their original papers, with the exception of MTransE and ITransE. We compare our implemented results with the reported results from the original papers. If the difference between our results and the reported results falls outside of a reasonable range, which we define as \(\pm 5\%\) of the original results, we mark the methods with an asterisk \({ }^*\). It is worth noting that there should not be a significant difference theoretically since we use the same source codes and parameter settings for implementation. For the SRPRS dataset, only RSNs reports results in its original paper [11]. We conduct experiments on all methods for SRPRS and present the results in Table 2.3. For the DWY100K dataset, we run all approaches and compare the performance of BootEA, MuGNN, NAEA, KECG, and TransEdge with the results provided in their original papers. We mark methods with notable differences with an asterisk \({ }^*\).

On each dataset, we highlight the best results within each group by denoting them in bold. We also mark the best Hits@1 performance among all approaches with \({ }^\blacktriangle \) since this metric is the most crucial and can best reflect the effectiveness of EA methods.

3.3 Results and Analyses on DBP15K

We then compare the performance within each category and across categories. The experiment results on the cross-lingual dataset DBP15K can be found in Table 2.2. Note that the Hits@10 and MRR results of CEA are missing in this table since it directly generates aligned entity pairs instead of returning a list of ranked entities.Footnote 5 We then compare the performance both within each category and across categories.

Table 2.2 Experimental results on DBP15K
Table 2.3 Experimental results on SRPRS
Table 2.4 Experimental results on DWY100K and DBP-FB

Group I

Out of the methods that only utilize the KG structure, RSNs consistently obtains superior outcomes in Hits@1 and MRR metrics. This success can be attributed to its ability to capture long-term relational paths, which offer more structural indications for alignment. The performance of MuGNN and KECG is equivalent, which can be partly attributed to their shared goal of completing KGs and reconciling structural disparities. While MuGNN utilizes AMIE+ [10] to induce rules for completion, KECG harnesses TransE to implicitly achieve this aim.

The remaining three techniques achieve comparatively lower outcomes. MTransE and JAPE-Stru leverage TransE to capture the KG structure, but JAPE-Stru outperforms MTransE because the latter models KG structures in different vector spaces, resulting in information loss when translating between them [21]. On the other hand, GCN-Align(SE) attains relatively superior results than MTransE and JAPE-Stru.

Group II

Among these methods, ITransE obtains notably poorer outcomes, which can be attributed to the information loss during embedding space translation and its simpler bootstrapping strategy as described in Sect. 2.2.4. BootEA, NAEA, and TransEdge all utilize the same bootstrapping strategy. BootEA achieves slightly inferior performance compared to reported outcomes, while NAEA performs significantly worse. In theory, NAEA should outperform BootEA as it employs an attention mechanism to capture neighbor-level information. On the other hand, TransEdge employs an edge-centric embedding model to capture structural information, resulting in more accurate entity embeddings and hence better alignment outcomes.

Group III

Both JAPE and GCN-Align utilize attributes to enhance entity embeddings, and their outcomes surpass those of their structure-only counterparts, demonstrating the utility of attribute information. Additionally, HMAN, which incorporates relation types as input, outperforms JAPE and GCN-Align by also utilizing attributes.

The remaining four methods utilize entity names instead of attributes for alignment and achieve superior outcomes. Among them, RDGCN and HGCN attain similar results, surpassing GM-Align. This can be attributed to their use of relations to optimize entity embedding learning, which was mostly overlooked in prior GNN-based EA models. However, CEA achieves the best performance in this group by effectively utilizing and merging available features.

Name-Based Heuristics

Regarding KG pairs with closely related languages, Lev achieves encouraging results, but it is ineffective on distantly related language pairs such as \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\) and \({\mathtt {DBP15K}_{\mathtt {JA-EN}}}\). On the other hand, Embed attains consistent performance on all KG pairs.

Intra-Category Comparison

Across all datasets, CEA obtains the best Hits@1 performance, while TransEdge, RDGCN, and HGCN achieve the top results for other metrics. This confirms the effectiveness of incorporating additional information such as the bootstrapping strategy and textual information.

The performance of name-based heuristics, such as Embed, is highly competitive, surpassing most methods that do not utilize entity name information in terms of Hits@1. This indicates that conventional ER solutions can still be effective for the EA task. However, Embed still lags behind most EA methods that integrate entity name information, such as RDGCN, HGCN, and CEA.

We can also observe that methods from the first two groups, such as TransEdge, achieve consistent results across all three KG pairs. In contrast, methods that utilize entity name information, such as HGCN, achieve much better results on KG pairs with closely related languages (\({\mathtt {DBP15K}_{\mathtt {FR-EN}}}\)) than those with distantly related languages (\({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\)). This indicates that language barriers can hinder the use of textual information, which can, in turn, undermine the overall effectiveness of the method.

3.4 Results and Analyses on SRPRS

The results on SRPRS are presented in Table 2.3. Similar observations can be made as in the case of DBP15K, which we will not elaborate on. However, we can focus on the differences from DBP15K as well as the patterns specific to this dataset.

Group I

The results show that the performance of the methods on the relatively sparse KGs in SRPRS is lower compared to DBP15K. However, RSNs outperforms the other methods, closely followed by KECG. It is important to note that while MuGNN achieves decent results on DBP15K, it performs much worse on SRPRS because there are no aligned relations on SRPRS, which results in the failure of rule transferring. Additionally, the sparser KG structure leads to a smaller number of detected rules.

Group II

Among these solutions, TransEdge still yields consistently superior results.

Group III

In contrast to GCN-Align(SE) and JAPE-Stru, incorporating attributes into GCN-Align leads to better results, but it does not contribute to the performance of JAPE. This is likely because the dataset has a relatively smaller number of attributes. On the other hand, using entity names significantly improves the results. It is worth noting that CEA achieves ground-truth performance on \({\mathtt {SRPRS}_{\mathtt {DBP-WD}}}\) and \({\mathtt {SRPRS}_{\mathtt {DBP-YG}}}\).

Name-Based Heuristics

For mono-lingual EA datasets like DBpedia, Wikidata, and YAGO, Lev and Embed are able to achieve ground-truth performance since the equivalent entities in different KGs have identical names based on their entity identifiers, making it easy to achieve accurate results through a simple comparison of these names. Additionally, Lev shows promising results on cross-lingual KG pairs with closely related language pairs.

Intra-Category Comparison

In contrast to DBP15K, methods that incorporate entity names (Group III) perform much better on SRPRS. This is likely due to two reasons: (1) the KG structure is less effective on this dataset, which is much sparser compared to DBP15K, and (2) the entity name information plays a significant role on both mono-lingual and cross-lingual datasets with closely related language pairs, where the names of equivalent entities are very similar.

3.5 Results and Analyses on DWY100K

Table 2.4 shows the results on the large-scale mono-lingual dataset DWY100K. However, we were unable to obtain the results of RDGCN and NAEA due to their requirement for an extremely large amount of memory space in our experimental environment.

The methods in the first group perform significantly better on this dataset, which can be attributed to the relatively richer KG structure (as shown in Fig. 1.2 in Chap. 1). Among them, MuGNN and KECG achieve over 60% Hits@1 on \({\mathtt {DWY100K}_{\mathtt {DBP-WD}}}\) and over 70% on \({\mathtt {DWY100K}_{\mathtt {DBP-YG}}}\), due to the rich structure that facilitates the process of KG completion, ultimately leading to improved EA performance.

The approaches in the second group achieve further improvement in results with the aid of the iterative training strategy. However, the reported results of BootEA and TransEdge are slightly higher than the values we obtained. Among the methods in Group III, CEA achieves ground-truth performance. Similar to SRPRS, the name-based heuristics Lev and Embed also achieve ground-truth results.

3.6 Efficiency Analysis

In order to provide a comprehensive evaluation, we report the average running time of each method on each dataset in Table 2.5, which allows us to compare the efficiency of different state-of-the-art solutions and provides insights into their scalability. We acknowledge that different parameter settings, such as the learning rate and number of epochs, may influence the final time cost. However, we aim to provide a general understanding of the efficiency of these methods by adopting the parameters reported in their original papers. As previously mentioned, we were unable to obtain the results of RDGCN and NAEA on DWY100K due to their requirement for an extremely large amount of memory space in our experimental environment.

Table 2.5 Averaged time cost on each dataset (in seconds)

On DBP15K and SRPRS, GCN-Align(SE) is the most efficient method with consistent alignment performance, followed closely by JAPE-Stru and ITransE. Most of the other methods have similar time costs (ranging from 1,000 to 10,000 seconds), except for NAEA and GM-Align, which require significantly longer running times.

The larger size of the DWY100K dataset leads to a significant increase in the time costs of all methods. MuGNN, KECG, and HMAN cannot run on GPUs due to memory limitations, and the authors of the original papers suggest running them on CPUs, which results in longer running times. Only three methods can complete the alignment process within 10,000s, while most of the other approaches take between 10,000s and 100,000s. In particular, GM-Align requires 5 days to generate the results, indicating that current state-of-the-art EA methods still have low efficiency when dealing with very large-scale data. Some methods, such as NAEA, RDGCN, and GM-Align, have poor scalability.

3.7 Comparison with Unsupervised Approaches

There exist some unsupervised methods aimed at aligning KGs that do not employ representation learning methodologies. To ensure the study’s comprehensiveness, we compare with a typical system, namely, PARIS [20]. PARIS relies on the comparison of similarities between literals and employs a probabilistic algorithm to align entities jointly in an unsupervised manner. Additionally, we also evaluate PARIS alongside AgreementMakerLight (AML) [9], an unsupervised system for ontology alignment that leverages KGs’ background knowledge.Footnote 6

The F1 score is employed as the evaluation metric since PARIS and AML do not produce a target entity for every source entity, thereby addressing cases where certain entities do not have a corresponding match in the other KG. The F1 score is calculated as the harmonic mean between precision (i.e., the number of correctly aligned entity pairs divided by the number of source entities for which an approach returns a target entity) and recall (i.e., the number of source entities for which an approach returns a target entity divided by the total number of source entities).

Figure 2.2 illustrates that the overall performance of PARIS and AML is marginally lower than that of CEA. Despite CEA exhibiting more robust performance, it depends on training data (seed entity pairs) that may not be present in actual KGs. In contrast, unsupervised systems do not necessitate any training data and can still produce highly favorable outcomes. Furthermore, the results from PARIS and AML demonstrate that ontology information does, in fact, enhance the alignment outcomes.

Fig. 2.2
A grouped column graph of F 1 scores in 9 datasets. A M L present only in E N F R and E N D E has approximately 0.930 and 0.960 respectively. PARIS and C E A have the highest of 1.000 in D B P W D, D B P V G, D B P W D, and D B P V G. C E A has the lowest in Z H E N, and PARIS in J A E N.

F1 scores of PARIS, AML, and CEA on EA datasets

3.8 Module-Level Evaluation

To obtain a better understanding of the techniques employed in various modules, we conduct an evaluation at the module level and present the associated experimental outcomes. More specifically, we select the representative methods from each module and create feasible combinations. By comparing the performance of different combinations, we can obtain a more precise assessment of the efficacy of various methods in these modules.

Regarding the embedding learning module, we use GCN and TransE. As for the alignment module, we adopt the margin-based loss function (Mgn) and the corpus fusion strategy (Cps). Following current approaches, we combine GCN with Mgn, and TransE with Cps, where the parameters are tuned in accordance with GCN-Align and JAPE, respectively. In the prediction module, we use the Euclidean distance (Euc), the Manhattan distance (Manh), and the cosine similarity (Cos). With regard to the extra information module, we denote the use of the bootstrapping strategy as B by implementing the iterative method in [32]. The use of multi-type information is represented as Mul, and we adopt the semantic and string-level features of entity names as in CEA.

The Hits@1 results of 24 combinations are shown in Table 2.6.Footnote 7 It is evident that the addition of the bootstrapping strategy and/or textual information does, in fact, improve the overall performance. Regarding the embedding model, the GCN+Mgn model appears to have more robust and superior performance than TransE+Cps. Furthermore, the selection of distance measures also has an impact on the outcomes. Compared with Manh and Euc, Cos leads to better performance on TransE-based models, while it brings worse results on GCN-based models. Despite this, the integration of entity name embeddings results in consistently superior performance when using the Cos distance measure.

Table 2.6 Hits@1 results of module-level evaluation

Significantly, GCN+Mgn+Cos+Mul+B (referred to as Comb.) attains the most exceptional performance, indicating that a basic amalgamation of techniques from existing modules can lead to highly favorable alignment outcomes.

3.9 Summary

We summarize the major findings from the experimental results.

EA vs. ER

EA is distinctive from other related tasks since it operates on graph-structured data. As a result, all current EA solutions utilize the KG structure to create entity embeddings for aligning entities, which can produce favorable results on DBP15K and DWY100K. Nonetheless, depending solely on the KG structure has certain limitations, as there are long-tail entities with minimal structural information or entities that have similar neighboring entities but do not refer to the same real-world object. To address this issue, recent studies propose incorporating textual information, leading to better performance. However, this prompts a question regarding whether ER approaches can handle the EA task, given that the texts linked to entities are often used by conventional ER solutions.

We answer this question by involving the name-based heuristics that have been used in most typical ER methods for comparison, and the experimental results reveal that: (1) ER solutions can indeed function on EA, but their performance is heavily reliant on the textual similarity between entities (2) While ER solutions can surpass the majority of structure-based EA methods, they are still surpassed by EA techniques that use name information to supplement entity embeddings (3) Incorporating the primary concepts in ER, specifically utilizing literal similarity to identify the equivalence between entities, into EA methods, is a promising direction that is worth exploring (as demonstrated by CEA)

Influence of Datasets

Figure 2.3 illustrates that the performance of EA methods varies significantly across different datasets. In general, dense datasets such as DBP15K and DWY100K tend to yield relatively better results than sparse ones. Moreover, mono-lingual KGs perform better than cross-lingual ones (DWY100K vs. DBP15K). Notably, on all mono-lingual datasets, the most performant method CEA, as well as the name-based heuristics Lev and Embed, achieves 100% accuracy. This is because these datasets are sourced from DBpedia, Wikidata, and YAGO, where equivalent entities in different KGs have identical names based on their entity identifiers, making it possible to obtain ground-truth results through a simple comparison of these names. However, these datasets do not reflect the real-life challenge of ambiguous entity names. To address this, we introduce a new mono-lingual benchmark, which will be discussed in the following section.

Fig. 2.3
A box plot of Hits at the rate of 1, in 9 datasets. E N F R has the lowest median value and D B P Y G the highest. Z H E N and E N F R have a short interquartile range, and D B P Y G a large range. The mean values across all datasets range between 38 and 60. Values are estimated.

The box plot of Hits@1 of all methods on different datasets

3.10 Guidelines and Suggestions

In this subsection, we provide guidelines and suggestions for potential users of EA approaches.

Guidelines for Practitioners

There are several considerations that may impact the selection of EA models. We have identified four of the most prevalent factors and provide the following recommendations:

  • Input information. If the input data only includes structured information from a knowledge graph, one may need to decide between using methods from Group I or Group II. On the other hand, if there is a lot of additional information available, one may prefer to use methods from Group III to make the most of these features and generate more trustworthy signals for alignment.

  • The scale of data. As explained in Sect. 2.3.6, certain cutting-edge techniques may not be scalable enough. Thus, it is important to consider the scale of the data before deciding on an alignment approach. For very large datasets, it may be wise to utilize simpler yet effective models, like GCN-Align, in order to minimize computational burden.

  • The objective of alignment. When the primary focus is on aligning entities, it may be preferable to employ models based on GNNs because they tend to be more resilient and adaptable. However, if there are other tasks involved, such as aligning relations, it might be more appropriate to use KG representation-based methods as they inherently learn both entity and relation representations. Additionally, recent research studies [23, 27] indicate that relations can aid in aligning entities.

  • The trade-off in bootstrapping. The bootstrapping process is a useful technique that can enhance the training set gradually and lead to improved alignment results. However, it can be susceptible to the problem of error propagation, which may introduce incorrectly matched entity pairs and amplify their negative effects in subsequent rounds. Additionally, it can be time-consuming. Therefore, when deciding whether to utilize the bootstrapping strategy, it is important to assess the difficulty of the datasets. If the datasets are relatively straightforward, with ample literal information and dense KG structures, utilizing the bootstrapping strategy may be a more suitable option. Otherwise, one should exercise caution when using this approach.

Suggestions for Future Research

We also discuss some open problems that are worthy of exploration in the future:

  • EA for long-tail entities. In actual knowledge graphs, most entities have few connections to other entities, while only a small number of entities have many connections. Aligning these less common entities is important for achieving good overall alignment performance, but current research on entity alignment has largely ignored them. A recent study [32] addresses this issue by using additional information to help align these less common entities and reducing their number through a KG completion process integrated into iterative self-training. However, there is still a lot of potential for further improvement in this area.

  • Multi-modal EA. Entities can be linked to information in various forms, including texts, images, and even videos. Therefore, it is necessary to explore multi-modal entity alignment, which involves aligning entities that have multiple modalities of associated information. This topic is worth further research [16].

  • EA in the open world. Most existing EA methods [12] operate under a closed-domain assumption, meaning that every entity in the source KG has a corresponding entity in the target KG. However, in real-world scenarios, there are always entities that cannot be matched. Moreover, labeled data, which is often necessary for state-of-the-art approaches, may not be accessible. Therefore, it is important to investigate EA in open-world settings, where unmatchable entities and limited labeled data are taken into account.

4 New Dataset and Further Experiments

As mentioned earlier, in current mono-lingual datasets, entities that have equivalent counterparts in different knowledge graphs have the same names based on their entity identifiers, which allows for reasonably accurate results through simple name comparison (with 100% precision on \({\mathtt {SRPRS}_{\mathtt {DBP-YG}}}\)). However, in real-life KGs, entity identifiers are often not human-readable, and instead, they are linked to one or more human-readable names. For instance, Freebase identifies the capital of France as /m/05qtj, which is linked to names like “Paris” or “The City of Light.” Retrieving these names and matching entities that share the same name can still yield a precision of 100% on datasets such as \({\mathtt {DWY100K}_{\mathtt {DBP-WD}}}\) and \({\mathtt {SRPRS}_{\mathtt {DBP-WD}}}\). However, in actual knowledge graphs, different entities can have the same name, even if they have different identifiers. For instance, the Freebase entities /m/05qtj (the capital of France) and /m/0h0_x (the king of Troy) share the name “Paris,” as do 20 cities in the USA. This means that using just the entity name to match entities will not work in real-life knowledge graphs. This presents a significant challenge for EA because it is not always certain that an entity with the name “Paris” in the source knowledge graph is the same as an entity with the same name in the target knowledge graph. The reason is that one might refer to the city in France, while the other might refer to the king of Troy. This is a significant complication in real-life knowledge graphs, as illustrated by the fact that in YAGO 3, about 34% of entities share a name with one or more other entities. This problem is not fully reflected in the commonly used mono-lingual datasets for EA.

A second issue with EA datasets is that they assume that for each entity in the source KG, there is exactly one corresponding entity in the target KG. This means that an EA approach can map each source entity to the most similar target entity. However, this is not a realistic scenario since KGs in real life may contain entities that are not present in other KGs. For instance, when aligning YAGO 3 and DBpedia, some entities may appear in YAGO 3 but not in DBpedia and vice versa. This problem is even more severe for KGs that draw data from various sources, such as YAGO 4 and IMDB. In YAGO 4, only 1% of entities are related to movies, while the remaining 99% are unrelated to IMDB entities, such as universities and smartphone brands. As a result, these entities have no matches in IMDB, and this problem is not addressed in current EA datasets.

We thus observe that the existing datasets for EA are an oversimplification of the real-life problem. Our solution is to create a fresh dataset that mimics these challenges. We anticipate that this dataset will result in improved EA models that can handle even more demanding problem scenarios and provide a clearer research direction for the community. In this section, we describe the development of the new dataset and present our experimental findings on it.

4.1 Dataset Construction

To reflect the difficulty of using entity names, we choose Freebase [2] as our target knowledge graph because it represents entities using indecipherable identifiers (i.e., Freebase MIDs), and different entities may have the same name. As to the source knowledge graph, we utilize DBpedia, which contains external links to Freebase that can be regarded as gold standards. The detailed process of constructing the new dataset is explained below:

Determining the Source Entity Set

We utilize the disambiguation information available in DBpedia to gather entities that have the same disambiguation term and create the entity set for the source knowledge graph. For example, for the ambiguous term Apple, the disambiguation records consist of entities such as Apple Inc. and Apple(fruit), both of which are included in the source entity set.

Determining Links and the Target Entity Set

Next, we utilize the external links between DBpedia and Freebase to obtain the entities in Freebase that correspond to the source entities and create the entity set for the target knowledge graph. These external links are considered as the gold standards. It should be noted that the entities in the target knowledge graph are identified using Freebase MIDs and multiple entities may have the same name, such as Apple. To retrieve the name for each entity, we use the label triples.

Retrieving Triples

Once the entity sets for the source and target knowledge graphs are determined, we extract the relational and attributive triples involving these entities from their respective knowledge graphs.

Refining Links and Entity Sets

Following the approach in previous work [21, 22], we retain only the links whose source and target entities are involved in at least one triple in their respective knowledge graphs, resulting in a total of 25,542 links. The entity sets are adjusted accordingly, including entities that participate in triples but not in links. Ultimately, there are 29,861 entities in the source knowledge graph, of which 4,319 cannot be matched, and 25,542 matchable entities in the target knowledge graph. Consistent with existing datasets, 30% of the links and unmatchable entities are utilized as the training set. For additional statistics on the dataset, please refer to Chap. 1.

4.2 Experimental Results on DBP-FB

In accordance with the current evaluation paradigm, we first analyze the performance of EA methods without considering unmatchable entities. As shown in Table 2.4, the overall performance of the methods in the first two groups is lower than that on SRPRS. This can be attributed to the greater structural heterogeneity of DBP-FB, which can be observed from sub-figures (d) in Fig. 1.2. In contrast to the KG pairs in sub-figures (a), (b), or (c), the entity distributions in these KGs are highly dissimilar, which makes it challenging to effectively leverage the structural information.

Methods that utilize entity names continue to produce the best results, although their performance is lower than that on previous mono-lingual datasets. Furthermore, on DBP-FB, Embed and Lev achieve only Hits@1 values of 58.3% and 57.8%, respectively, while they attain 100% on \({\mathtt {SRPRS}_{\mathtt {DBP-YG}}}\), \({\mathtt {SRPRS}_{\mathtt {DBP-WD}}}\), \({\mathtt {DWY100K}_{\mathtt {DBP-YG}}}\), and \({\mathtt {DWY100K}_{\mathtt {DBP-WD}}}\). This confirms that DBP-FB is a more suitable mono-lingual dataset for addressing the challenge of entity name ambiguity compared to existing datasets. Thus, DBP-FB can be considered a preferable mono-lingual dataset.

4.3 Unmatchable Entities

In addition, DBP-FB also contains unmatchable entities, which presents another real-life challenge for EA. We therefore evaluate the performance of Comb. (from Sect. 2.3.8) on DBP-FB, taking into account these unmatchable entities. Consistent with Sect. 2.3.7, we utilize the precision, recall, and F1 score as evaluation metrics, with the exception that we define recall as the number of matchable source entities for which an approach returns a target entity, divided by the total number of matchable source entities.

The information presented in Table 2.7 shows that Comb. exhibits a high level of recall, but its precision is relatively low. This is because it creates a target entity for each source entity, including those that cannot be matched. This pattern reflects the current performance of entity alignment solutions when dealing with unmatchable source entities. Nonetheless, this problem is not addressed in the current entity alignment datasets.

Table 2.7 EA performance on DBP-FB after considering unmatchable entities

In order to address this issue, we suggest a straightforward approach to handle unmatchable entities in DBP-FB, in addition to the current entity alignment solutions. Specifically, we propose setting a NIL threshold, denoted as \(\theta \), to predict unmatchable entities. As discussed in Sect. 2.2.3, entity alignment solutions typically employ a distance measure to find the corresponding target entity. If the distance value between a source entity and its nearest target entity is greater than \(\theta \), we consider the source entity to be unmatchable and exclude it from the alignment results. The value of the threshold \(\theta \) can be determined from the training data.

As shown in Table 2.7, the threshold-enhanced solution Comb. +TH achieves a better F1 score. We hope this preliminary study can inspire follow-up research on this issue.

5 Conclusion

Entity alignment plays a crucial role in integrating KGs to enhance knowledge coverage and quality. Despite the numerous proposed solutions, there has been limited comprehensive evaluation and detailed analysis of their performance. To address this gap, this chapter presents an empirical assessment of state-of-the-art approaches in terms of effectiveness and efficiency on representative datasets. We also conduct a thorough analysis of their performance and provide evidence-based discussions. Furthermore, we introduce a new dataset that more accurately reflects real-world challenges, which can serve as a benchmark for future research in this field.