State-of-the-Art Approaches

Zhao, Xiang; Zeng, Weixin; Tang, Jiuyang

doi:10.1007/978-981-99-4250-3_2

Part of the book series: Big Data Management ((BIGDM))

802 Accesses

Abstract

This chapter performs a thorough assessment and meticulous examination of the most advanced EA techniques. Initially, we introduce a broad EA framework that covers all current methods and classify these methods into three main groups. Then, we carefully appraise these solutions on various scenarios, taking into account their efficacy, efficiency, and scalability. Lastly, we create a novel EA dataset that reflects the actual difficulties encountered in alignment, which prior literature mostly ignored. This chapter aims to offer a comprehensive understanding of the advantages and drawbacks of current EA methods, in order to encourage further high-quality research.

You have full access to this open access chapter, Download chapter PDF

1 Introduction

In this chapter, we conduct an empirical evaluation of state-of-the-art EA approaches, which possesses the following characteristics:

Fair Comparison Within and Across Categories

Most recent studies have limited themselves to comparing only a subset of methods [4, 11, 15, 23, 27,28,29,30, 33]. Moreover, different approaches follow different protocols: some use only the KG structure for alignment, while others incorporate additional information; some perform one-pass alignment of KGs, while others use an iterative (re-)training strategy. While the literature presents a direct comparison of these methods, which highlights their overall effectiveness, a more desirable and equitable approach would be to classify these methods into categories and then compare the outcomes within and across categories.

In this chapter, we incorporate most of the state-of-the-art methods to facilitate a comprehensive comparison, including the very recent approaches that have not been evaluated against other methods previously. We divide them into three groups and conduct a thorough analysis of both intra- and inter-group evaluations, enabling us to better position these methods and evaluate their effectiveness.

Comprehensive Evaluation on Representative Datasets

To assess the performance of EA systems, various datasets have been developed, which can be broadly classified into two categories: cross-lingual benchmarks, exemplified by DBP15K [21], and mono-lingual benchmarks, exemplified by DWY100K [22]. A recent study [11] highlights that the KGs in prior datasets are much denser than those in real-world scenarios, which led them to create the SRPRS dataset with entity degrees that follow a normal distribution. Despite the availability of multiple datasets, previous studies only report their results on one or two specific datasets, making it challenging to evaluate their efficacy across a wide range of potential scenarios, such as cross-lingual/mono-lingual, dense/normal, and large-scale/medium-scale KGs.

In light of this observation, this chapter performs a thorough experimental evaluation on all the prominent datasets, namely, DBP15K, DWY100K, and SRPRS, which together consist of nine pairs of knowledge graphs. The evaluation is conducted across various dimensions, including effectiveness, efficiency, and robustness.

New Dataset for Real-Life Challenges

It has been noted that current EA datasets assume that each entity in the source KG has exactly one corresponding entity in the target KG, which is an unrealistic assumption. In reality, there are entities in one KG that may not have a corresponding entity in the other KG. For example, when aligning YAGO 4 and IMDB, only a small percentage (1%) of entities in YAGO 4 are related to movies, while the remaining 99% of entities in YAGO 4 do not have any corresponding entities in IMDB. These unmatchable entities would make the EA task more challenging.

Furthermore, we notice that the mono-lingual datasets currently available for EA evaluation assume that the entities in the different KGs share the same naming convention. Therefore, the baseline method that relies on comparing the string similarity between entity names can achieve perfect accuracy. However, this assumption is often not valid in real-life scenarios, where equivalent entities in different KGs may have dissimilar names, such as “America” and “USA” for the same entity. In addition, another challenge that is often overlooked in EA is that different entities in a KG might have the same name. This can make it difficult to determine whether an entity with the name “Paris” in the source KG refers to the same entity as one with the same name in the target KG, as they could potentially refer to different entities, such as the city in France and the city in Texas.

For these reasons, we believe that the current EA datasets do not fully capture the realistic challenges posed by unmatchable entities and ambiguous entity names. To address this issue, we introduce a new dataset that more closely mirrors these practical difficulties.

The main contributions of this chapter are the following:

This chapter provides a comprehensive evaluation of state-of-the-art EA approaches. The evaluation includes: (1) Identifying the main components of existing EA approaches and proposing a general EA framework (2) Categorizing state-of-the-art approaches into three groups and conducting detailed intra- and inter-group evaluations to better understand their strengths and weaknesses (3) Examining these approaches in various scenarios, including cross-/mono-lingual alignment and alignment on dense/normal, large-/medium-scale data, to evaluate their effectiveness, efficiency, and robustness. The empirical results provide insights into the performance of each approach. This evaluation aims to provide a more systematic and comprehensive understanding of the current state of EA research.
Through our study, we gained valuable experience and insights that allow us to identify the shortcomings of current EA datasets. To address these issues, we have created a new mono-lingual dataset that accurately reflects the real-life challenges of unmatchable entities and ambiguous entity names. We anticipate that this new dataset will provide a more effective benchmark for evaluating EA systems.

2 A General EA Framework

This section presents a general EA framework that is designed to include state-of-the-art EA approaches. Through a thorough analysis of current EA approaches, we identify four primary components, as shown in Fig. 2.1:

A model of E A framework where Embedding learning module points to Alignment module, which points to Prediction module. Prediction module connects via 2-way dotted arrows to Extra information module, which points via a dotted arrow to Alignment module. — **Fig. 2.1**

Embedding learning module. This component is designed to train embeddings for entities, which can be broadly classified into two groups: KG representation-based models such as TransE [3] and graph neural network (GNN)-based models such as the graph convolutional network (GCN) [13].
Alignment module. This component focuses on aligning the entity embeddings learned in the previous module across different KGs. The goal is to map these embeddings into a unified space. Margin-based loss is a common approach used in this module to ensure that the seed entity embeddings from different KGs are close to each other. Another approach used frequently is corpus fusion, which aligns KGs at the corpus level and directly embeds entities in different KGs into the same vector space.
Prediction module. Once the unified embedding space is established, the next step is to predict the corresponding target entity for each source entity in the test set. One common approach is to use distance-based similarity measures such as cosine similarity, Manhattan distance, or Euclidean distance between entity embeddings to calculate the similarity between entities. The target entity with the highest similarity (or lowest distance) is then selected as the counterpart.
Extra information module. In addition to the basic modules, some EA approaches use additional information to improve their performance. One approach is bootstrapping, where confident alignment results are used as training data for subsequent alignment iterations. Another approach is to use multi-type literal information such as attributes, entity descriptions, and entity names to complement the KG structure. These additional sources of information are shown in Fig. 2.1 as blue dashed lines.

Example Further to the example in Chap. 1, we explain these modules. The embedding learning module generates embeddings for entities in KG\({ }_{\text{EN}}\) and KG\({ }_{\text{ES}}\), respectively. Then the alignment module projects the entity embeddings into the same vector space, where the entity embeddings in KG\({ }_{\text{EN}}\) and KG\({ }_{\text{ES}}\) are directly comparable. Finally, using the unified embeddings, the prediction module aims to predict the equivalent target entity in KG\({ }_{\text{ES}}\) for each source entity in KG\({ }_{\text{EN}}\). The extra information module leverages several techniques to improve the EA performance. Concretely, the bootstrapping strategy aims to include the confident EA pairs detected from a previous round, e.g., (Spain, España), into the training set for learning in the next round. Another approach is to use additional textual information to complement the entity embeddings for alignment.

We organize the state-of-the-art approaches based on each module of the EA framework and present them in Table 2.1. For a more detailed view of the approaches, readers can refer to the Appendix. Now, we will explain how each of these modules is implemented in various state-of-the-art approaches.

Table 2.1 A summary of the EA approaches involved in this study

Full size table

2.1 Embedding Learning Module

In this section, we will explain the techniques used in the embedding learning module, which utilize the KG structure to create embeddings for each entity.

Table 2.1 shows that the most commonly used models for this module are TransE [3] and GCN [13]. We will provide a brief overview of these fundamental models.

TransE

The TransE model views relationships as translations that act on the lower-dimensional representations of entities. To clarify, when presented with a relational triple \((h, r, t)\), TransE proposes that the embedded representation of the tail entity t should be similar to the embedded representation of the head entity h plus the embedded representation of the relationship r, or \(\mathbf {h} + \mathbf {r} \approx \mathbf {t}\). By doing so, the model is able to maintain the structural information of the entities and produce close representations for entities that share similar neighbors in the embedding space.

GCN

A type of convolutional network that processes graph-based data directly is known as the graph convolutional network (GCN). It creates embeddings for individual nodes by encoding information about the neighborhoods of those nodes. GCN takes as input feature vectors for each node in the KG, as well as a representative graph structure description in matrix form, such as an adjacency matrix. The output of the GCN is a new feature matrix. A typical GCN model consists of multiple stacked GCN layers, which allows it to capture a partial KG structure that extends several hops away from the entity being processed.

On top of these basic models, some methods make modifications. Regarding the TransE-based models, MTransE removes the negative triples during training, BootEA and NAEA replace the original margin-based loss function with a limit-based objective function, MuGNN uses the logistic loss to substitute for the margin-based loss, and JAPE designs a new loss function.

Concerning the GCN-based models, it has been observed that the GCN does not take into account the relations present in KGs. Therefore, as a solution, RDGCN employs the dual-primal graph convolutional neural network (DPGCNN) [17]. In contrast, MuGNN leverages an attention-based GNN model to assign varying weights to neighboring nodes. Additionally, KECG merges graph attention network (GAT) [25] and TransE to capture both the inner-graph structure and the inter-graph alignment information.

Several approaches have introduced new embedding models. For example, in RSNs, the authors contend that triple-level learning is inadequate for capturing long-term relational dependencies between entities and is insufficient for propagating semantic information among entities. Therefore, they propose using recurrent neural networks (RNNs) with residual learning to learn the long-term relational paths between entities.

Similarly, TransEdge devises a new energy function to measure the error of edge translation between entity embeddings for KG embedding learning. This method models edge embeddings using context compression and projection.

2.2 Alignment Module

In this subsection, we introduce the methods used for the alignment module, which aims to unify separated KG embeddings.

The prevailing approach in KG embedding learning is to use a margin-based loss function on top of the embedding learning module. This loss function requires that the distance between entities in positive pairs should be small, while the distance between entities in negative pairs should be large, with a margin between the distances of positive and negative pairs. The positive pairs refer to seed entity pairs, while negative pairs are generated by corrupting the positive pairs. This approach helps to merge the two separate KG embedding spaces into one vector space. Table 2.1 indicates that the majority of methods that use GNNs rely on a margin-based alignment model to merge the two KG embedding spaces. In contrast, in GM-Align, a matching framework is employed to maximize the matching probabilities of seed entity pairs, which achieves the alignment process.

Corpus fusion is another common approach, which involves using the seed entity pairs to connect the training corpora of two KGs. Some methods, such as BootEA and NAEA, generate new triples by swapping the entities in the seed entity pairs to align the embeddings in a unified space. Concretely, given an entity pair \((u,v)\), the newly generated triples for \(G_1\) are \(T_1^{new} = \{(v,r,t)|(u,r,t)\in T_1\}\cup \{(h,r,v)|(h,r,u)\in T_1\}\) and for \(G_2\) are \(T_2^{new} = \{(u,r,t)|(v,r,t)\in T_2\}\cup \{(h,r,u)|(h,r,v)\in T_2\}\). To clarify, the overlay graph is built by connecting the entities in seed entity pairs with edges, and the rest of the entities are connected with edges based on their similarity or co-occurrence in the training corpus. Entity embeddings are then learned using the adjacency matrix of the overlay graph and the training corpus.

Some earlier works proposed transition functions to map the embedding vectors from one KG to another, while others utilized additional information such as entity attributes to align the entity embeddings into a unified space.

2.3 Prediction Module

This module typically involves computing similarity scores between source and target entity embeddings and selecting the target entity with the highest score as the alignment.

To align entities, the most common method is to generate a ranked list of target entities for each source entity based on a specific distance measure between their embeddings. The distance measures commonly used include Euclidean distance, Manhattan distance, and cosine similarity. The top-ranked entity in the list is then considered a match for the source entity. It is worth noting that the similarity score can be converted into the distance score by subtracting it from 1 and vice versa.^{Footnote 1} In contrast, in GM-Align, the entity with the highest matching probability is aligned with the source entity.

Additionally, a recent method called CEA observes that there is a correlation between different entity alignment decisions, meaning that if a target entity is already matched to a source entity with high confidence, it is less likely to be matched to another source entity. To capture this correlation, CEA models it as a stable matching problem, and addresses the problem based on the distance measure, which decreases the number of mismatches and improves the accuracy of entity alignment.

2.4 Extra Information Module

In this subsection, we discuss the methods used in the extra information module.

One approach to improve the EA framework is through bootstrapping strategy, also known as iterative training or self-learning strategy. This approach involves iteratively labeling highly probable EA pairs as the training set for the next round, leading to the gradual enhancement of alignment results. There are several methods based on this approach, with variations in the selection of confident EA pairs. The approach ITransE identifies the most similar nonaligned target entity for each nonaligned source entity, and if the similarity score between them exceeds a certain threshold, they are regarded as a confident pair. BootEA, NAEA, and TransEdge follow a similar approach where they calculate the probability of each source entity being aligned with every target entity. They only consider pairs with probability scores above a certain threshold and use a maximum likelihood matching algorithm with a 1-to-1 mapping constraint to generate a set of confident EA pairs.

Several methods utilize multi-type literal information to improve alignment by providing a more comprehensive view. Commonly used types of information are the attributes associated with entities. Some methods, such as JAPE, GCN-Align, and HMAN, only consider the statistical characteristics of the attribute names. Other methods, such as AttrE and M-Greedy, generate attribute embeddings by encoding the characters of attribute values. AttrE uses attribute embeddings to unify entity embeddings into the same space, while M-Greedy uses them to complement the entity embeddings.

There is a growing tendency toward the use of “entity names”.^{Footnote 2} Several methods are using “entity names” as input features to learn entity embeddings or exploit the semantic and string-level aspects of entity names as individual features. Specifically, GM-Align, RDGCN, and HGCN utilize entity names as input features to learn entity embeddings. On the other hand, CEA leverages both semantic and string-level aspects of entity names as individual features for alignment. Furthermore, KDCoE and the description-enhanced version of HMAN encode entity descriptions into vector representations and treat them as new features for alignment.

The availability of multi-type information is not always guaranteed in knowledge graph alignment. Some types of information like entity names are commonly available in most scenarios, while others like entity descriptions are often missing in many knowledge graphs. Additionally, due to the graph-based nature of knowledge graph alignment, most existing alignment datasets have limited textual information, which makes some approaches like KDCoE, M-Greedy, and AttrE less applicable.

3 Experiments and Analysis

This section presents an in-depth empirical study.^{Footnote 3}

3.1 Categorization

According to the main components, we can broadly categorize current methods into three groups: Group I, which merely utilizes the KG structure for alignment, Group II, which harnesses the iterative training strategy to improve alignment results, and Group III, which utilizes information in addition to the KG structure. We introduce and compare these three categories using the example in Chap. 1.

Group I

This category of methods solely relies on the structure of the knowledge graph to align entities. Consider again the example in Chap. 1. In KG\({ }_{\text{EN}}\), the entity Alfonso Cuarón is connected to the entity Mexico and three other entities, while Spain is connected to Mexico and one more entity. The same structural information can be observed in KG\({ }_{\text{ES}}\). Since we already know that Mexico in KG\({ }_{\text{EN}}\) is aligned to Mexico in KG\({ }_{\text{ES}}\), by using the KG structure, it is easy to conclude that the equivalent target entity for Spain is España, and the equivalent target entity for Alfonso Cuarón is Alfonso Cuarón.

Group II

This category of approaches is known as iterative or self-learning strategies, where likely entity alignment pairs are labeled iteratively as the training set for the next round, leading to a progressive improvement in the alignment results. They can also be categorized into Group I or III, depending on whether they merely use the KG structure or not. Nevertheless, they are all characterized by the use of the bootstrapping strategy.

We still use the example in Chap. 1 to illustrate the bootstrapping mechanism. As shown in Fig. 1.1, by utilizing the KG structure, it is straightforward to identify that the source entity Spain is aligned with the target entity España, and the source entity Alfonso Cuarón is aligned with the target entity Alfonso Cuarón. The source entity Madrid does not have a clear target entity, as both Roma(ciudad) and Madrid in the target KG have the same structural information as the source entity. This is because they are both two hops away from the seed entity and have a degree of 1. To address this problem, bootstrapping-based approaches perform multiple rounds of alignment, using the confident entity pairs from the previous round as seed pairs for the next round. More specifically, they consider the entity pairs detected from the first round, i.e., (Spain, España) and (Alfonso Cuarón, Alfonso Cuarón), as the seed pairs in the following rounds. Consequently, in the second round, for the source entity Madrid, only the target entity Madrid shares the same structural information with it—two hops away from the seed entity pair (Mexico, Mexico) and one hop away from the seed entity pair (Spain, España).

Group III

Utilizing the KG structure for alignment when presented with graph-formatted input data sources is a natural choice; however, KGs also contain a wealth of semantic information that can be used to supplement structural data. These methods stand out by taking advantage of additional information beyond the KG structure.

As seen in Chap. 1, even with the KG structure and bootstrapping strategy, it is still difficult to identify the target entity for the source entity Gravity(film), since its structural information (connected to the entity Alfonso Cuarón and with degree 2) is shared by two target entities Gravity(película) and Roma(película). However, by combining the KG structure with the names in the identifiers, it is easy to differentiate between the two entities and correctly identify Gravity(película) as the target entity for Gravity(film).

3.2 Experimental Settings

The datasets and metrics utilized for assessment were previously introduced in Chap. 1. In the following section, we will elaborate on the techniques and parameter configurations used for comparison.

Methods to Compare

We will compare the previously mentioned methods, with the exception of KDCoE and MultiKE, due to the absence of entity descriptions in the evaluation benchmarks. Additionally, we will exclude AttrE since it is only functional in the mono-lingual context. Furthermore, we will provide the outcomes of the structure-only versions of JAPE and GCN-Align, specifically JAPE-Stru and GCN-Align(SE).

As previously stated in Chap. 1, to showcase the ability of ER methods in addressing EA, we will also compare with various name-based heuristics. These approaches are commonly used in related tasks [8, 18, 19], as they heavily depend on the resemblance between object names to identify equivalences. Concretely, we use the following:

Lev aligns entities through the utilization of Levenshtein distance [14], which is a string-based measurement tool for computing the dissimilarity between two sequences.
Embed aligns entities based on the cosine similarity between the averaged word embeddings, or name embeddings, of two entities. In accordance with [31], we utilize the pre-trained fastText embeddings [1] as word embeddings. For multilingual KG pairs, we use the MUSE word embeddings [7].

Implementation Details

The experiments were performed using a personal computer equipped with an Intel Core i7-4790 CPU, an NVIDIA GeForce GTX TITAN X GPU, and 128 GB of memory. All programs were implemented in Python.

To ensure reproducibility, we employ the source codes provided by the authors and utilize the parameter settings specified in their original papers to execute the models.^{Footnote 4} For datasets not included in the original papers, we use the same parameter settings as those employed in the original experiments to ensure consistency.

All of the evaluated methods provide results on the DBP15K dataset in their original papers, with the exception of MTransE and ITransE. We compare our implemented results with the reported results from the original papers. If the difference between our results and the reported results falls outside of a reasonable range, which we define as \(\pm 5\%\) of the original results, we mark the methods with an asterisk \({ }^*\). It is worth noting that there should not be a significant difference theoretically since we use the same source codes and parameter settings for implementation. For the SRPRS dataset, only RSNs reports results in its original paper [11]. We conduct experiments on all methods for SRPRS and present the results in Table 2.3. For the DWY100K dataset, we run all approaches and compare the performance of BootEA, MuGNN, NAEA, KECG, and TransEdge with the results provided in their original papers. We mark methods with notable differences with an asterisk \({ }^*\).

On each dataset, we highlight the best results within each group by denoting them in bold. We also mark the best Hits@1 performance among all approaches with \({ }^\blacktriangle \) since this metric is the most crucial and can best reflect the effectiveness of EA methods.

3.3 Results and Analyses on DBP15K

We then compare the performance within each category and across categories. The experiment results on the cross-lingual dataset DBP15K can be found in Table 2.2. Note that the Hits@10 and MRR results of CEA are missing in this table since it directly generates aligned entity pairs instead of returning a list of ranked entities.^{Footnote 5} We then compare the performance both within each category and across categories.

Table 2.2 Experimental results on DBP15K

Full size table

Table 2.3 Experimental results on SRPRS

Full size table

Table 2.4 Experimental results on DWY100K and DBP-FB

Full size table

Group I

Out of the methods that only utilize the KG structure, RSNs consistently obtains superior outcomes in Hits@1 and MRR metrics. This success can be attributed to its ability to capture long-term relational paths, which offer more structural indications for alignment. The performance of MuGNN and KECG is equivalent, which can be partly attributed to their shared goal of completing KGs and reconciling structural disparities. While MuGNN utilizes AMIE+ [10] to induce rules for completion, KECG harnesses TransE to implicitly achieve this aim.

The remaining three techniques achieve comparatively lower outcomes. MTransE and JAPE-Stru leverage TransE to capture the KG structure, but JAPE-Stru outperforms MTransE because the latter models KG structures in different vector spaces, resulting in information loss when translating between them [21]. On the other hand, GCN-Align(SE) attains relatively superior results than MTransE and JAPE-Stru.

Group II

Among these methods, ITransE obtains notably poorer outcomes, which can be attributed to the information loss during embedding space translation and its simpler bootstrapping strategy as described in Sect. 2.2.4. BootEA, NAEA, and TransEdge all utilize the same bootstrapping strategy. BootEA achieves slightly inferior performance compared to reported outcomes, while NAEA performs significantly worse. In theory, NAEA should outperform BootEA as it employs an attention mechanism to capture neighbor-level information. On the other hand, TransEdge employs an edge-centric embedding model to capture structural information, resulting in more accurate entity embeddings and hence better alignment outcomes.

Group III

Both JAPE and GCN-Align utilize attributes to enhance entity embeddings, and their outcomes surpass those of their structure-only counterparts, demonstrating the utility of attribute information. Additionally, HMAN, which incorporates relation types as input, outperforms JAPE and GCN-Align by also utilizing attributes.

The remaining four methods utilize entity names instead of attributes for alignment and achieve superior outcomes. Among them, RDGCN and HGCN attain similar results, surpassing GM-Align. This can be attributed to their use of relations to optimize entity embedding learning, which was mostly overlooked in prior GNN-based EA models. However, CEA achieves the best performance in this group by effectively utilizing and merging available features.

Name-Based Heuristics

Regarding KG pairs with closely related languages, Lev achieves encouraging results, but it is ineffective on distantly related language pairs such as \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\) and \({\mathtt {DBP15K}_{\mathtt {JA-EN}}}\). On the other hand, Embed attains consistent performance on all KG pairs.

Intra-Category Comparison

Across all datasets, CEA obtains the best Hits@1 performance, while TransEdge, RDGCN, and HGCN achieve the top results for other metrics. This confirms the effectiveness of incorporating additional information such as the bootstrapping strategy and textual information.

The performance of name-based heuristics, such as Embed, is highly competitive, surpassing most methods that do not utilize entity name information in terms of Hits@1. This indicates that conventional ER solutions can still be effective for the EA task. However, Embed still lags behind most EA methods that integrate entity name information, such as RDGCN, HGCN, and CEA.

We can also observe that methods from the first two groups, such as TransEdge, achieve consistent results across all three KG pairs. In contrast, methods that utilize entity name information, such as HGCN, achieve much better results on KG pairs with closely related languages (\({\mathtt {DBP15K}_{\mathtt {FR-EN}}}\)) than those with distantly related languages (\({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\)). This indicates that language barriers can hinder the use of textual information, which can, in turn, undermine the overall effectiveness of the method.

3.4 Results and Analyses on SRPRS

The results on SRPRS are presented in Table 2.3. Similar observations can be made as in the case of DBP15K, which we will not elaborate on. However, we can focus on the differences from DBP15K as well as the patterns specific to this dataset.

Group I

The results show that the performance of the methods on the relatively sparse KGs in SRPRS is lower compared to DBP15K. However, RSNs outperforms the other methods, closely followed by KECG. It is important to note that while MuGNN achieves decent results on DBP15K, it performs much worse on SRPRS because there are no aligned relations on SRPRS, which results in the failure of rule transferring. Additionally, the sparser KG structure leads to a smaller number of detected rules.

Group II

Among these solutions, TransEdge still yields consistently superior results.

Group III

In contrast to GCN-Align(SE) and JAPE-Stru, incorporating attributes into GCN-Align leads to better results, but it does not contribute to the performance of JAPE. This is likely because the dataset has a relatively smaller number of attributes. On the other hand, using entity names significantly improves the results. It is worth noting that CEA achieves ground-truth performance on \({\mathtt {SRPRS}_{\mathtt {DBP-WD}}}\) and \({\mathtt {SRPRS}_{\mathtt {DBP-YG}}}\).

Name-Based Heuristics

For mono-lingual EA datasets like DBpedia, Wikidata, and YAGO, Lev and Embed are able to achieve ground-truth performance since the equivalent entities in different KGs have identical names based on their entity identifiers, making it easy to achieve accurate results through a simple comparison of these names. Additionally, Lev shows promising results on cross-lingual KG pairs with closely related language pairs.

Intra-Category Comparison

In contrast to DBP15K, methods that incorporate entity names (Group III) perform much better on SRPRS. This is likely due to two reasons: (1) the KG structure is less effective on this dataset, which is much sparser compared to DBP15K, and (2) the entity name information plays a significant role on both mono-lingual and cross-lingual datasets with closely related language pairs, where the names of equivalent entities are very similar.

3.5 Results and Analyses on DWY100K

Table 2.4 shows the results on the large-scale mono-lingual dataset DWY100K. However, we were unable to obtain the results of RDGCN and NAEA due to their requirement for an extremely large amount of memory space in our experimental environment.

The methods in the first group perform significantly better on this dataset, which can be attributed to the relatively richer KG structure (as shown in Fig. 1.2 in Chap. 1). Among them, MuGNN and KECG achieve over 60% Hits@1 on \({\mathtt {DWY100K}_{\mathtt {DBP-WD}}}\) and over 70% on \({\mathtt {DWY100K}_{\mathtt {DBP-YG}}}\), due to the rich structure that facilitates the process of KG completion, ultimately leading to improved EA performance.

The approaches in the second group achieve further improvement in results with the aid of the iterative training strategy. However, the reported results of BootEA and TransEdge are slightly higher than the values we obtained. Among the methods in Group III, CEA achieves ground-truth performance. Similar to SRPRS, the name-based heuristics Lev and Embed also achieve ground-truth results.

3.6 Efficiency Analysis

In order to provide a comprehensive evaluation, we report the average running time of each method on each dataset in Table 2.5, which allows us to compare the efficiency of different state-of-the-art solutions and provides insights into their scalability. We acknowledge that different parameter settings, such as the learning rate and number of epochs, may influence the final time cost. However, we aim to provide a general understanding of the efficiency of these methods by adopting the parameters reported in their original papers. As previously mentioned, we were unable to obtain the results of RDGCN and NAEA on DWY100K due to their requirement for an extremely large amount of memory space in our experimental environment.

Table 2.5 Averaged time cost on each dataset (in seconds)

Full size table

On DBP15K and SRPRS, GCN-Align(SE) is the most efficient method with consistent alignment performance, followed closely by JAPE-Stru and ITransE. Most of the other methods have similar time costs (ranging from 1,000 to 10,000 seconds), except for NAEA and GM-Align, which require significantly longer running times.

The larger size of the DWY100K dataset leads to a significant increase in the time costs of all methods. MuGNN, KECG, and HMAN cannot run on GPUs due to memory limitations, and the authors of the original papers suggest running them on CPUs, which results in longer running times. Only three methods can complete the alignment process within 10,000s, while most of the other approaches take between 10,000s and 100,000s. In particular, GM-Align requires 5 days to generate the results, indicating that current state-of-the-art EA methods still have low efficiency when dealing with very large-scale data. Some methods, such as NAEA, RDGCN, and GM-Align, have poor scalability.

3.7 Comparison with Unsupervised Approaches

There exist some unsupervised methods aimed at aligning KGs that do not employ representation learning methodologies. To ensure the study’s comprehensiveness, we compare with a typical system, namely, PARIS [20]. PARIS relies on the comparison of similarities between literals and employs a probabilistic algorithm to align entities jointly in an unsupervised manner. Additionally, we also evaluate PARIS alongside AgreementMakerLight (AML) [9], an unsupervised system for ontology alignment that leverages KGs’ background knowledge.^{Footnote 6}

The F1 score is employed as the evaluation metric since PARIS and AML do not produce a target entity for every source entity, thereby addressing cases where certain entities do not have a corresponding match in the other KG. The F1 score is calculated as the harmonic mean between precision (i.e., the number of correctly aligned entity pairs divided by the number of source entities for which an approach returns a target entity) and recall (i.e., the number of source entities for which an approach returns a target entity divided by the total number of source entities).

Figure 2.2 illustrates that the overall performance of PARIS and AML is marginally lower than that of CEA. Despite CEA exhibiting more robust performance, it depends on training data (seed entity pairs) that may not be present in actual KGs. In contrast, unsupervised systems do not necessitate any training data and can still produce highly favorable outcomes. Furthermore, the results from PARIS and AML demonstrate that ontology information does, in fact, enhance the alignment outcomes.

A grouped column graph of F 1 scores in 9 datasets. A M L present only in E N F R and E N D E has approximately 0.930 and 0.960 respectively. PARIS and C E A have the highest of 1.000 in D B P W D, D B P V G, D B P W D, and D B P V G. C E A has the lowest in Z H E N, and PARIS in J A E N. — **Fig. 2.2**

3.8 Module-Level Evaluation

To obtain a better understanding of the techniques employed in various modules, we conduct an evaluation at the module level and present the associated experimental outcomes. More specifically, we select the representative methods from each module and create feasible combinations. By comparing the performance of different combinations, we can obtain a more precise assessment of the efficacy of various methods in these modules.

Regarding the embedding learning module, we use GCN and TransE. As for the alignment module, we adopt the margin-based loss function (Mgn) and the corpus fusion strategy (Cps). Following current approaches, we combine GCN with Mgn, and TransE with Cps, where the parameters are tuned in accordance with GCN-Align and JAPE, respectively. In the prediction module, we use the Euclidean distance (Euc), the Manhattan distance (Manh), and the cosine similarity (Cos). With regard to the extra information module, we denote the use of the bootstrapping strategy as B by implementing the iterative method in [32]. The use of multi-type information is represented as Mul, and we adopt the semantic and string-level features of entity names as in CEA.

The Hits@1 results of 24 combinations are shown in Table 2.6.^{Footnote 7} It is evident that the addition of the bootstrapping strategy and/or textual information does, in fact, improve the overall performance. Regarding the embedding model, the GCN+Mgn model appears to have more robust and superior performance than TransE+Cps. Furthermore, the selection of distance measures also has an impact on the outcomes. Compared with Manh and Euc, Cos leads to better performance on TransE-based models, while it brings worse results on GCN-based models. Despite this, the integration of entity name embeddings results in consistently superior performance when using the Cos distance measure.

Table 2.6 Hits@1 results of module-level evaluation

Full size table

Significantly, GCN+Mgn+Cos+Mul+B (referred to as Comb.) attains the most exceptional performance, indicating that a basic amalgamation of techniques from existing modules can lead to highly favorable alignment outcomes.

3.9 Summary

We summarize the major findings from the experimental results.

EA vs. ER

EA is distinctive from other related tasks since it operates on graph-structured data. As a result, all current EA solutions utilize the KG structure to create entity embeddings for aligning entities, which can produce favorable results on DBP15K and DWY100K. Nonetheless, depending solely on the KG structure has certain limitations, as there are long-tail entities with minimal structural information or entities that have similar neighboring entities but do not refer to the same real-world object. To address this issue, recent studies propose incorporating textual information, leading to better performance. However, this prompts a question regarding whether ER approaches can handle the EA task, given that the texts linked to entities are often used by conventional ER solutions.

We answer this question by involving the name-based heuristics that have been used in most typical ER methods for comparison, and the experimental results reveal that: (1) ER solutions can indeed function on EA, but their performance is heavily reliant on the textual similarity between entities (2) While ER solutions can surpass the majority of structure-based EA methods, they are still surpassed by EA techniques that use name information to supplement entity embeddings (3) Incorporating the primary concepts in ER, specifically utilizing literal similarity to identify the equivalence between entities, into EA methods, is a promising direction that is worth exploring (as demonstrated by CEA)

Influence of Datasets

Figure 2.3 illustrates that the performance of EA methods varies significantly across different datasets. In general, dense datasets such as DBP15K and DWY100K tend to yield relatively better results than sparse ones. Moreover, mono-lingual KGs perform better than cross-lingual ones (DWY100K vs. DBP15K). Notably, on all mono-lingual datasets, the most performant method CEA, as well as the name-based heuristics Lev and Embed, achieves 100% accuracy. This is because these datasets are sourced from DBpedia, Wikidata, and YAGO, where equivalent entities in different KGs have identical names based on their entity identifiers, making it possible to obtain ground-truth results through a simple comparison of these names. However, these datasets do not reflect the real-life challenge of ambiguous entity names. To address this, we introduce a new mono-lingual benchmark, which will be discussed in the following section.

A box plot of Hits at the rate of 1, in 9 datasets. E N F R has the lowest median value and D B P Y G the highest. Z H E N and E N F R have a short interquartile range, and D B P Y G a large range. The mean values across all datasets range between 38 and 60. Values are estimated. — **Fig. 2.3**

3.10 Guidelines and Suggestions

In this subsection, we provide guidelines and suggestions for potential users of EA approaches.

Guidelines for Practitioners

There are several considerations that may impact the selection of EA models. We have identified four of the most prevalent factors and provide the following recommendations:

Input information. If the input data only includes structured information from a knowledge graph, one may need to decide between using methods from Group I or Group II. On the other hand, if there is a lot of additional information available, one may prefer to use methods from Group III to make the most of these features and generate more trustworthy signals for alignment.
The scale of data. As explained in Sect. 2.3.6, certain cutting-edge techniques may not be scalable enough. Thus, it is important to consider the scale of the data before deciding on an alignment approach. For very large datasets, it may be wise to utilize simpler yet effective models, like GCN-Align, in order to minimize computational burden.
The objective of alignment. When the primary focus is on aligning entities, it may be preferable to employ models based on GNNs because they tend to be more resilient and adaptable. However, if there are other tasks involved, such as aligning relations, it might be more appropriate to use KG representation-based methods as they inherently learn both entity and relation representations. Additionally, recent research studies [23, 27] indicate that relations can aid in aligning entities.
The trade-off in bootstrapping. The bootstrapping process is a useful technique that can enhance the training set gradually and lead to improved alignment results. However, it can be susceptible to the problem of error propagation, which may introduce incorrectly matched entity pairs and amplify their negative effects in subsequent rounds. Additionally, it can be time-consuming. Therefore, when deciding whether to utilize the bootstrapping strategy, it is important to assess the difficulty of the datasets. If the datasets are relatively straightforward, with ample literal information and dense KG structures, utilizing the bootstrapping strategy may be a more suitable option. Otherwise, one should exercise caution when using this approach.

Suggestions for Future Research

We also discuss some open problems that are worthy of exploration in the future:

EA for long-tail entities. In actual knowledge graphs, most entities have few connections to other entities, while only a small number of entities have many connections. Aligning these less common entities is important for achieving good overall alignment performance, but current research on entity alignment has largely ignored them. A recent study [32] addresses this issue by using additional information to help align these less common entities and reducing their number through a KG completion process integrated into iterative self-training. However, there is still a lot of potential for further improvement in this area.
Multi-modal EA. Entities can be linked to information in various forms, including texts, images, and even videos. Therefore, it is necessary to explore multi-modal entity alignment, which involves aligning entities that have multiple modalities of associated information. This topic is worth further research [16].
EA in the open world. Most existing EA methods [12] operate under a closed-domain assumption, meaning that every entity in the source KG has a corresponding entity in the target KG. However, in real-world scenarios, there are always entities that cannot be matched. Moreover, labeled data, which is often necessary for state-of-the-art approaches, may not be accessible. Therefore, it is important to investigate EA in open-world settings, where unmatchable entities and limited labeled data are taken into account.

4 New Dataset and Further Experiments

As mentioned earlier, in current mono-lingual datasets, entities that have equivalent counterparts in different knowledge graphs have the same names based on their entity identifiers, which allows for reasonably accurate results through simple name comparison (with 100% precision on \({\mathtt {SRPRS}_{\mathtt {DBP-YG}}}\)). However, in real-life KGs, entity identifiers are often not human-readable, and instead, they are linked to one or more human-readable names. For instance, Freebase identifies the capital of France as /m/05qtj, which is linked to names like “Paris” or “The City of Light.” Retrieving these names and matching entities that share the same name can still yield a precision of 100% on datasets such as \({\mathtt {DWY100K}_{\mathtt {DBP-WD}}}\) and \({\mathtt {SRPRS}_{\mathtt {DBP-WD}}}\). However, in actual knowledge graphs, different entities can have the same name, even if they have different identifiers. For instance, the Freebase entities /m/05qtj (the capital of France) and /m/0h0_x (the king of Troy) share the name “Paris,” as do 20 cities in the USA. This means that using just the entity name to match entities will not work in real-life knowledge graphs. This presents a significant challenge for EA because it is not always certain that an entity with the name “Paris” in the source knowledge graph is the same as an entity with the same name in the target knowledge graph. The reason is that one might refer to the city in France, while the other might refer to the king of Troy. This is a significant complication in real-life knowledge graphs, as illustrated by the fact that in YAGO 3, about 34% of entities share a name with one or more other entities. This problem is not fully reflected in the commonly used mono-lingual datasets for EA.

A second issue with EA datasets is that they assume that for each entity in the source KG, there is exactly one corresponding entity in the target KG. This means that an EA approach can map each source entity to the most similar target entity. However, this is not a realistic scenario since KGs in real life may contain entities that are not present in other KGs. For instance, when aligning YAGO 3 and DBpedia, some entities may appear in YAGO 3 but not in DBpedia and vice versa. This problem is even more severe for KGs that draw data from various sources, such as YAGO 4 and IMDB. In YAGO 4, only 1% of entities are related to movies, while the remaining 99% are unrelated to IMDB entities, such as universities and smartphone brands. As a result, these entities have no matches in IMDB, and this problem is not addressed in current EA datasets.

We thus observe that the existing datasets for EA are an oversimplification of the real-life problem. Our solution is to create a fresh dataset that mimics these challenges. We anticipate that this dataset will result in improved EA models that can handle even more demanding problem scenarios and provide a clearer research direction for the community. In this section, we describe the development of the new dataset and present our experimental findings on it.

4.1 Dataset Construction

To reflect the difficulty of using entity names, we choose Freebase [2] as our target knowledge graph because it represents entities using indecipherable identifiers (i.e., Freebase MIDs), and different entities may have the same name. As to the source knowledge graph, we utilize DBpedia, which contains external links to Freebase that can be regarded as gold standards. The detailed process of constructing the new dataset is explained below:

Determining the Source Entity Set

We utilize the disambiguation information available in DBpedia to gather entities that have the same disambiguation term and create the entity set for the source knowledge graph. For example, for the ambiguous term Apple, the disambiguation records consist of entities such as Apple Inc. and Apple(fruit), both of which are included in the source entity set.

Determining Links and the Target Entity Set

Next, we utilize the external links between DBpedia and Freebase to obtain the entities in Freebase that correspond to the source entities and create the entity set for the target knowledge graph. These external links are considered as the gold standards. It should be noted that the entities in the target knowledge graph are identified using Freebase MIDs and multiple entities may have the same name, such as Apple. To retrieve the name for each entity, we use the label triples.

Retrieving Triples

Once the entity sets for the source and target knowledge graphs are determined, we extract the relational and attributive triples involving these entities from their respective knowledge graphs.

Refining Links and Entity Sets

Following the approach in previous work [21, 22], we retain only the links whose source and target entities are involved in at least one triple in their respective knowledge graphs, resulting in a total of 25,542 links. The entity sets are adjusted accordingly, including entities that participate in triples but not in links. Ultimately, there are 29,861 entities in the source knowledge graph, of which 4,319 cannot be matched, and 25,542 matchable entities in the target knowledge graph. Consistent with existing datasets, 30% of the links and unmatchable entities are utilized as the training set. For additional statistics on the dataset, please refer to Chap. 1.

4.2 Experimental Results on DBP-FB

In accordance with the current evaluation paradigm, we first analyze the performance of EA methods without considering unmatchable entities. As shown in Table 2.4, the overall performance of the methods in the first two groups is lower than that on SRPRS. This can be attributed to the greater structural heterogeneity of DBP-FB, which can be observed from sub-figures (d) in Fig. 1.2. In contrast to the KG pairs in sub-figures (a), (b), or (c), the entity distributions in these KGs are highly dissimilar, which makes it challenging to effectively leverage the structural information.

Methods that utilize entity names continue to produce the best results, although their performance is lower than that on previous mono-lingual datasets. Furthermore, on DBP-FB, Embed and Lev achieve only Hits@1 values of 58.3% and 57.8%, respectively, while they attain 100% on \({\mathtt {SRPRS}_{\mathtt {DBP-YG}}}\), \({\mathtt {SRPRS}_{\mathtt {DBP-WD}}}\), \({\mathtt {DWY100K}_{\mathtt {DBP-YG}}}\), and \({\mathtt {DWY100K}_{\mathtt {DBP-WD}}}\). This confirms that DBP-FB is a more suitable mono-lingual dataset for addressing the challenge of entity name ambiguity compared to existing datasets. Thus, DBP-FB can be considered a preferable mono-lingual dataset.

4.3 Unmatchable Entities

In addition, DBP-FB also contains unmatchable entities, which presents another real-life challenge for EA. We therefore evaluate the performance of Comb. (from Sect. 2.3.8) on DBP-FB, taking into account these unmatchable entities. Consistent with Sect. 2.3.7, we utilize the precision, recall, and F1 score as evaluation metrics, with the exception that we define recall as the number of matchable source entities for which an approach returns a target entity, divided by the total number of matchable source entities.

The information presented in Table 2.7 shows that Comb. exhibits a high level of recall, but its precision is relatively low. This is because it creates a target entity for each source entity, including those that cannot be matched. This pattern reflects the current performance of entity alignment solutions when dealing with unmatchable source entities. Nonetheless, this problem is not addressed in the current entity alignment datasets.

Table 2.7 EA performance on DBP-FB after considering unmatchable entities

Full size table

In order to address this issue, we suggest a straightforward approach to handle unmatchable entities in DBP-FB, in addition to the current entity alignment solutions. Specifically, we propose setting a NIL threshold, denoted as \(\theta \), to predict unmatchable entities. As discussed in Sect. 2.2.3, entity alignment solutions typically employ a distance measure to find the corresponding target entity. If the distance value between a source entity and its nearest target entity is greater than \(\theta \), we consider the source entity to be unmatchable and exclude it from the alignment results. The value of the threshold \(\theta \) can be determined from the training data.

As shown in Table 2.7, the threshold-enhanced solution Comb. +TH achieves a better F1 score. We hope this preliminary study can inspire follow-up research on this issue.

5 Conclusion

Entity alignment plays a crucial role in integrating KGs to enhance knowledge coverage and quality. Despite the numerous proposed solutions, there has been limited comprehensive evaluation and detailed analysis of their performance. To address this gap, this chapter presents an empirical assessment of state-of-the-art approaches in terms of effectiveness and efficiency on representative datasets. We also conduct a thorough analysis of their performance and provide evidence-based discussions. Furthermore, we introduce a new dataset that more accurately reflects real-world challenges, which can serve as a benchmark for future research in this field.

Notes

1.
In this work, we use the distance between entity embeddings and the similarity between entity embeddings interchangeably.
2.
To obtain the names of entities, for DBpedia and YAGO, current approaches directly adopt the names in the identifiers, while for Wikidata, they use the entity identifier to retrieve the name of the corresponding Wikipedia page. Notably, these names from different KGs share the same naming convention.
3.
The relevant materials are available at https://github.com/DexterZeng/EAE.
4.
In the interest of space, we put the detailed parameter settings in Appendix B.
5.
The Hits@10 and MRR results of CEA are also missing in Table 2.3 and Table 2.4 for the same reason.
6.
AML requires ontology information, which does not exist in current EA datasets. Therefore, we mine the ontology information for these KGs. However, we can only successfully run AML on \({\mathtt {SRPRS}_{\mathtt {EN-FR}}}\) and \({\mathtt {SRPRS}_{\mathtt {EN-DE}}}\).
7.
The results on other datasets exhibit similar trends and hence are omitted in the interest of space.

References

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.
Article Google Scholar
K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247–1250, 2008.
Google Scholar
A. Bordes, N. Usunier, A. García-Durán, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In NIPS, pages 2787–2795, 2013.
Google Scholar
Y. Cao, Z. Liu, C. Li, Z. Liu, J. Li, and T. Chua. Multi-channel graph neural network for entity alignment. In ACL, pages 1452–1461, 2019.
Google Scholar
M. Chen, Y. Tian, K. Chang, S. Skiena, and C. Zaniolo. Co-training embeddings of knowledge graphs and entity descriptions for cross-lingual entity alignment. In IJCAI, pages 3998–4004, 2018.
Google Scholar
M. Chen, Y. Tian, M. Yang, and C. Zaniolo. Multilingual knowledge graph embeddings for cross-lingual knowledge alignment. In IJCAI, pages 1511–1517, 2017.
Google Scholar
A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou. Word translation without parallel data. arXiv preprint arXiv:1710.04087, 2017.
Google Scholar
S. Das, P. S. G. C., A. Doan, J. F. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, and Y. Park. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In SIGMOD, pages 1431–1446, 2017.
Google Scholar
D. Faria, C. Pesquita, E. Santos, I. F. Cruz, and F. M. Couto. Agreementmakerlight 2.0: Towards efficient large-scale ontology matching. In M. Horridge, M. Rospocher, and J. van Ossenbruggen, editors, ISWC, volume 1272 of CEUR Workshop Proceedings, pages 457–460. CEUR-WS.org, 2014.
Google Scholar
L. Galárraga, C. Teflioudi, K. Hose, and F. M. Suchanek. Fast rule mining in ontological knowledge bases with AMIE+. VLDB J., 24(6):707–730, 2015.
Article Google Scholar
L. Guo, Z. Sun, and W. Hu. Learning to exploit long-term relational dependencies in knowledge graphs. In ICML, pages 2505–2514, 2019.
Google Scholar
S. Hertling and H. Paulheim. The knowledge graph track at OAEI - gold standards, baselines, and the golden hammer bias. In A. Harth, S. Kirrane, A. N. Ngomo, H. Paulheim, A. Rula, A. L. Gentile, P. Haase, and M. Cochez, editors, ESWC, volume 12123 of Lecture Notes in Computer Science, pages 343–359. Springer, 2020.
Google Scholar
T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016.
Google Scholar
V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710, 1966.
Google Scholar
C. Li, Y. Cao, L. Hou, J. Shi, J. Li, and T.-S. Chua. Semi-supervised entity alignment via joint knowledge embedding model and cross-graph model. In EMNLP, pages 2723–2732, 2019.
Google Scholar
Y. Liu, H. Li, A. García-Durán, M. Niepert, D. Oñoro-Rubio, and D. S. Rosenblum. MMKG: multi-modal knowledge graphs. In P. Hitzler, M. Fernández, K. Janowicz, A. Zaveri, A. J. G. Gray, V. López, A. Haller, and K. Hammar, editors, ESWC, volume 11503 of Lecture Notes in Computer Science, pages 459–474. Springer, 2019.
Google Scholar
F. Monti, O. Shchur, A. Bojchevski, O. Litany, S. Günnemann, and M. M. Bronstein. Dual-primal graph convolutional networks. CoRR, abs/1806.00770, 2018.
Google Scholar
S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In SIGMOD, pages 19–34, 2018.
Google Scholar
V. Rastogi, N. N. Dalvi, and M. N. Garofalakis. Large-scale collective entity matching. PVLDB, 4(4):208–218, 2011.
Google Scholar
F. M. Suchanek, S. Abiteboul, and P. Senellart. PARIS: probabilistic alignment of relations, instances, and schema. PVLDB, 5(3):157–168, 2011.
Google Scholar
Z. Sun, W. Hu, and C. Li. Cross-lingual entity alignment via joint attribute-preserving embedding. In ISWC, pages 628–644, 2017.
Google Scholar
Z. Sun, W. Hu, Q. Zhang, and Y. Qu. Bootstrapping entity alignment with knowledge graph embedding. In IJCAI, pages 4396–4402, 2018.
Google Scholar
Z. Sun, J. Huang, W. Hu, M. Chen, L. Guo, and Y. Qu. Transedge: Translating relation-contextualized embeddings for knowledge graphs. In ISWC, pages 612–629, 2019.
Google Scholar
B. D. Trisedya, J. Qi, and R. Zhang. Entity alignment between knowledge graphs using attribute embeddings. In AAAI, pages 297–304, 2019.
Google Scholar
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, pages 5998–6008, 2017.
Google Scholar
Z. Wang, Q. Lv, X. Lan, and Y. Zhang. Cross-lingual knowledge graph alignment via graph convolutional networks. In EMNLP, pages 349–357, 2018.
Google Scholar
Y. Wu, X. Liu, Y. Feng, Z. Wang, R. Yan, and D. Zhao. Relation-aware entity alignment for heterogeneous knowledge graphs. In IJCAI, pages 5278–5284, 2019.
Google Scholar
Y. Wu, X. Liu, Y. Feng, Z. Wang, and D. Zhao. Jointly learning entity and relation representations for entity alignment. In EMNLP, pages 240–249, 2019.
Google Scholar
K. Xu, L. Wang, M. Yu, Y. Feng, Y. Song, Z. Wang, and D. Yu. Cross-lingual knowledge graph alignment via graph matching neural network. In ACL, pages 3156–3161, 2019.
Google Scholar
H.-W. Yang, Y. Zou, P. Shi, W. Lu, J. Lin, and S. Xu. Aligning cross-lingual entities with multi-aspect information. In EMNLP, pages 4422–4432, 2019.
Google Scholar
W. Zeng, X. Zhao, J. Tang, and X. Lin. Collective entity alignment via adaptive features. In ICDE, pages 1870–1873. IEEE, 2020.
Google Scholar
W. Zeng, X. Zhao, W. Wang, J. Tang, and Z. Tan. Degree-aware alignment for entities in tail. In SIGIR, pages 811–820. ACM, 2020.
Google Scholar
Q. Zhang, Z. Sun, W. Hu, M. Chen, L. Guo, and Y. Qu. Multi-view knowledge graph embedding for entity alignment. In IJCAI, pages 5429–5435, 2019.
Google Scholar
H. Zhu, R. Xie, Z. Liu, and M. Sun. Iterative entity alignment via joint knowledge embeddings. In IJCAI, pages 4258–4264, 2017.
Google Scholar
Q. Zhu, X. Zhou, J. Wu, J. Tan, and L. Guo. Neighborhood-aware attentional representation for multilingual knowledge graphs. In IJCAI, pages 1943–1949, 2019.
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory for Big Data and Decision, National University of Defense Technology, Changsha, Hunan, China
Xiang Zhao, Weixin Zeng & Jiuyang Tang

Authors

Xiang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Weixin Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Jiuyang Tang
View author publications
You can also search for this author in PubMed Google Scholar

Appendices

Appendix A

1.1 Methods in Group I of Table 2.1

MTransE

The MTransE model [6] is a translation-based approach for learning multilingual KG embeddings to support EA. Initially, it utilizes TransE (without negative triples) to project each KG into separate embedding spaces. Next, MTransE applies three distinct transition strategies: distance-based axis calibration, translation vectors, and linear transformation, to map the embedding vectors to their cross-lingual counterparts. During the prediction stage, a KNN search is conducted on the cross-lingual transition point of a target entity to obtain its corresponding counterpart.

RSNs

In this study [11], RNNs are combined with residual learning to effectively capture the long-term relational dependencies among KGs and generate more comprehensive KG structural embeddings for EA.

The paper [11] argues that triple-level learning is inadequate for capturing the long-term relational dependencies of entities and for propagating semantic information among entities. To address this limitation, recurrent skipping networks (RSNs) are proposed to learn the long-term relational paths between entities. To obtain the desired paths, biased random walks are used to efficiently sample paths from the KGs, with elements in two KGs connected by seed alignments. During the prediction phase, cosine similarity is utilized to predict the results.

MuGNN

The paper [4] proposes a multichannel GNN for learning KG embeddings that are oriented toward entity alignment.

The MuGNN approach first conducts relation weighting to generate a weight matrix for each KG using KG self-attention and cross-KG attention schemes, which correspond to different GNN channels. Next, it applies the GNN encoder (and the corresponding weight matrix) for each channel to model the KG structure. The outputs of the different channels are combined using the pooling operation. Finally, a margin-based alignment model is utilized to embed the two KGs into a unified embedding space.

In addition, MuGNN also proposes a method to address structural differences between KGs by completing missing relations. This is accomplished by using AMIE+ to induce rules and transferring rules between KGs through aligned relations. However, it is important to note that not all datasets have aligned relations, which may cause the rule transferring approach to fail.

KECG

The paper [15] proposes a method for jointly learning a knowledge embedding model that encodes inner-graph relationships and a cross-graph model that enhances entity embeddings with their neighbors’ information.

The main concept behind KECG involves employing a cross-graph model, which is an enhanced version of a graph attention network (GAT), to convert entities into a single vector space by incorporating both intra-graph and inter-graph alignment information. The resulting embeddings are then utilized as input for TransE, which models intra-graph connections and enforces relational constraints between entities to promote consistency across different KGs. During inference, equivalent entities are identified based on the L2 distance between entities in terms of the unified embeddings.

1.2 Methods in Group II of Table 2.1

ITransE

This study [34] extends the use of TransE to learn the structure of knowledge graphs. It develops three models, including translation-based, linear transformation-based, and parameter sharing-based models, to generate joint embeddings for various knowledge graphs. The study proceeds to iteratively align entities and update the joint knowledge embeddings, progressively considering highly confident aligned entities identified by the model. During the prediction stage, the model retrieves the closest entity from the target knowledge graph as the corresponding entity for each source entity.

BootEA

This work [22] suggests a technique called bootstrapping for EA that involves the iterative labeling of probable EA pairs as training data to teach alignment-oriented KG embeddings.

In terms of the KG structure encoder, BootEA employs TransE, but substitutes the margin-based loss function with a limit-based objective function. The approach involves learning alignment-oriented KG embeddings by swapping aligned entities between triples from different KGs. Additionally, the authors develop a bootstrapping strategy to refine alignment-oriented embeddings, which involves iteratively labeling probable alignments and adding them to the training data. BootEA further models EA as a classification problem and aims to maximize alignment likelihood across all labeled and unlabeled entities based on KG embeddings. During the prediction stage, cosine similarity is used to identify latent aligned entities.

NAEA

This paper [35] introduces a technique called neighborhood-aware attentional representation to enhance the effectiveness of EA, which is built on the fundamental framework of BootEA.

NAEA comprises two components: a knowledge embedding (KE) component and an entity alignment (EA) component. KE employs an attention mechanism to obtain neighbor-level representations of entities by combining their neighbors with weighted attention, and subsequently utilizes TransE to model both neighbor-level and relation-level representations. In contrast, BootEA only encodes relation-level representations.

Like BootEA, NAEA also treats the alignment task as a classification problem in its EA component. However, in NAEA, alignment probability calculation also incorporates neighbor-level knowledge information. During the prediction stage, the approach employs cosine similarity to identify aligned entity pairs based on integrated representations of entities.

TransEdge

This work [23] introduces a new edge-centric embedding model for EA, which contextualizes relation representations with respect to particular head-tail entity pairs.

The proposed method, TransEdge, defines a novel energy function to evaluate the accuracy of edge translation between entity embeddings for KG embedding learning. To model edge embeddings, two methods are employed: context compression and context projection. The limit-based loss function of TransEdge is used to optimize entity embeddings for EA, and the distance between seed entities is minimized to reconcile two KGs. During the prediction phase, the model ranks entities in another KG based on the cosine similarity of their entity embeddings in descending order for a given entity to be aligned. The intended match is expected to have the highest rank.

1.3 Methods in Group III of Table 2.1

JAPE

This work [21] presents a joint attribute-preserving embedding model for EA, which generates embeddings that incorporate both KG relations and attributes.

The proposed JAPE approach first employs TransE to encode the structure of each KG, but adapts the loss function. In addition to the large margin between scores of positive and negative triples, JAPE aims to assign lower scores to positive triples and higher scores to negative triples. The seed EA pairs are used to construct an overlay relationship graph in the corpus, which can align separate KG embeddings into a unified one.

Additionally, JAPE observes that latent aligned entities tend to have similar attribute values, and therefore abstracts attribute values to their range types and generate attribute embeddings to capture attribute correlations. Finally, attribute similarity constraints are combined with structural embeddings to refine entity representations by clustering entities with high attribute correlations. During the search for latent aligned entities, the model uses cosine similarity between entity embeddings.

GCN-Align

This work [26] utilizes GCN as the KG structure encoder for aligning entities.

To elaborate further, GCN-Align leverages GCN to capture the structure information of KGs, which generates neighborhood-aware embeddings of entities. Additionally, it embeds the attribute names of entities to provide a complementary view. The model uses a margin-based ranking loss function to unify embeddings from different KGs. The structural and attributive embeddings are then combined to predict aligned entity pairs based on the Manhattan distance score. Finally, the model predicts latent entity alignments based on the distance measure between entities from the two KGs.

AttrE

This work [24] proposes to learn attribute embeddings of entities, which shift the entity embeddings of two KGs into the same vector space.

First, AttrE creates a module for matching the predicates of two KGs, renaming them into a shared naming system to make sure the relation embeddings are compatible. Subsequently, TransE is used to learn the structural embeddings and attributes are encoded as attribute character embeddings. Transitivity rule is used to enrich the attribute triples. Finally, the attribute character embeddings are used to project the structural embeddings of entities into the same vector space and cosine similarity is used to make the prediction.

KDCoE

This work [5] develops a semi-supervised cross-lingual method to align multilingual KGs with minimal supervision.

The KDCoE approach uses TransE as the structure encoder and combines it with a linear transformation-based network to bring together different knowledge graph embeddings. Additionally, it uses an attentive gated recurrent unit encoder (AGRU) to create representations of entity descriptions. The KDCoE approach trains both modules simultaneously, with both models suggesting a set of the most confident entity alignment pairs during each iteration to improve cross-lingual learning accuracy over time. Similarly to MTransE, the prediction is made through a KNN search from the cross-lingual conversion point of a target entity.

HMAN

The HMAN [30] method utilizes GCN to merge multiple types of information in order to generate entity embeddings. Additionally, it proposes a modified model that incorporates the textual descriptions of entities, which are encoded using a pre-trained multilingual BERT model.

In detail, the HMAN approach employs GCN to model the structural connections and uses feedforward neural networks to generate embeddings for attributes and relations, as using GCN to learn attribute and relation embeddings inherently considers the neighboring entities’ attributes and relations, which could lead to noise. The approach then concatenates these representations to form a hybrid multi-aspect entity embedding. Finally, the method utilizes a margin-based ranking loss function to align the entities.

Furthermore, the HMAN method introduces two additional techniques, pointwise-BERT and pairwise-BERT, for utilizing multilingual BERT on entity descriptions to aid in the entity alignment process. To integrate entity descriptions with the hybrid multi-aspect entity embeddings, two strategies, reranking and weighted concatenation, are proposed. For prediction, the method leverages the L1 distance between entity embeddings.

GM-Align

The work described in reference [29] approaches the entity alignment problem as a graph matching challenge that can be addressed through both entity-level and graph-level matching techniques.

The GM-Align method initially generates a topic entity graph to depict the connections between a given entity (the “topic entity”) and its neighboring entities. This graph is then used to apply GCN to encode the structural information and generate matching scores. The method employs a word-based LSTM to embed the entity names as an initial feature matrix for GCN. The matching framework learns alignment information between the two knowledge graphs. During prediction, the method ranks all entities in the other knowledge graph in descending order of their matching probabilities, with the top-ranked entity considered as the result.

RDGCN

The authors of reference [27] propose a relation-aware dual-graph convolutional network to include relation information by employing attentive interactions between a knowledge graph and its dual relation counterpart, so as to achieve an effective entity alignment process.

The RDGCN method acknowledges that GCN-based models often disregard the relation information present in knowledge graphs. To address this, the authors employ the dual-primal graph CNN (DPGCNN) method to incorporate relation information. To adapt DPGCNN to the entity alignment task, the RDGCN method proposes a weighted model and explores the head/tail representations, which are initialized with entity names, as a way to capture the relation information.

The RDGCN method permits multiple rounds of interactions between the primal entity graph and its dual relation graph, thus allowing the model to integrate more complex relation information into entity representations effectively. The method employs GCN with highway gates to incorporate neighboring structural information. The authors devise a margin-based scoring function to align embeddings from different knowledge graphs. During prediction, the method uses the Manhattan distance between entity embeddings.

HGCN

The authors of reference [28] suggest jointly learning entity and relation representations for the entity alignment task.

The HGCN method first uses highway-GCNs that employ highway gates to control noise propagation in GCN to embed entities from various knowledge graphs. Next, the entity embeddings are utilized to approximate relation representations, which are then used to align relations across knowledge graphs. Finally, HGCN incorporates the relation representations into the entity embeddings to obtain joint entity representations and continues to use GCN to iteratively integrate neighboring structural information to improve the entity and relation representations further. Similar to RDGCN, a margin-based scoring function is used to align embeddings from different knowledge graphs, and the entity name is used as the initial feature matrix for GCN. During prediction, the method employs the Manhattan distance between entity embeddings.

MultiKE

The MultiKE method proposes a new framework that integrates entity names, relations, and attributes to learn embeddings for alignment [33].

The MultiKE method defines three different perspectives for EA, namely, entity name, relation, and attribute, and employs specific models to learn embeddings for each perspective. The TransE model is used to encode KG structure, with logistic loss replacing the margin-based loss. Two cross-KG identity inference strategies are proposed to capture and propagate alignment information between KGs. The view-specific entity embeddings are then combined, which are used for prediction through nearest-neighbor search. It should be noted that this method is currently only applicable to mono-lingual EA.

CEA

The authors of [31] create a unified EA framework that takes into account how different EA decisions are interconnected.

CEA uses three types of features (structural, semantic, and string signals) to capture different aspects of entities in heterogeneous knowledge graphs. The authors then model the problem of making collective EA decisions by framing it as a stable matching problem, which is solved using the deferred acceptance algorithm.

Appendix B

1.1 Parameter Setting

The definitions of the parameters can be found in their original papers.

MTransE: \(\lambda = 0.01\), \(\alpha = 5\), \(k = 75\), and \(epoch = 1000\).
JAPE-Stru, JAPE: \(d = 75\), \(\alpha = 0.1\), \(\beta = 0.05\), and \(\delta = 0.05\). For SE, learning rate is set to 0.01 and early stopping. For AE, learning rate is set to 0.1 and epochs are set to 100.
GCN-Align(SE), GCN-Align: \(d_s = 300\), \(d_a = 100\), \(\gamma _s = 3\), \(\gamma _a = 3\), and \(\beta = 0.9\).
RSNs: \(\alpha = 0.9\) and \(\beta = 0.9\); learning rate is set to 0.003, embedding size set to 256, batch size set to 512, and length set to 15.
MuGNN: \(\gamma _1 = 1.0\), \(\gamma _2 = 1.0\), and \(\gamma _r = 0.12\); embedding size is set to 128, learning rate set to 0.001, L2 set to 0.01, dropout set to 0.2, and epoch set to 500.
KECG: \(K_1 = 25\), \(K_2 = 2\), \(\lambda = 0.005\), \(\gamma _1 = 3.0\), \(\gamma _2 = 3.0\), dimension set to 128, and epoch set to 1,000.
ITransE: \(n = 50\), \(\gamma = 1.0\), \(k = 1.0\), \(\lambda = 0.001\), and \(epoch = 3000\).
BootEA: \(\gamma _1 = 0.01\), \(\gamma _2 = 2.0\), \(\gamma _3 = 0.7\), \(\mu _1 = 0.2\), \(\mu _2 = 0.1\). For DBP15K and SRPRS, \(\epsilon = 0.9\); for DWY100K, \(\epsilon = 0.98\). Learning rate is set to 0.01, epoch set to 50, and dimension set to 75.
NAEA: \(m = 75\), \(\beta = 0.8\), \(\lambda = 1\), \(\mu _1 = 1\), \(\mu _2 = 0.1\), \(\gamma = 2\), \(K = 4\), \(\eta = 0.01\), and \(epoch = 50\).
TransEdge: \(\gamma _1 = 0.2\), \(\gamma _2 = 2.0\), \(\alpha = 0.8\), \(s = 0.7\), \(d = 75\); learning rate is set to 0.01 and early stopping.
HMAN: \(F = 1,000\), \(\beta = 3\), \(\tau = 0.8\), \(epoch = 50,000\). For DBP15K and SRPRS, topological, relation, and attribute embeddings are set to 200, 100, and 100, respectively; for DWY100K, dimensions are set to 100, 50, and 50, respectively.
GM-Align: \(K_1 = 2\), and \(K_2 = 3\); learning rate is set to 0.001 and batch size set to 32.
RDGCN: \(\beta _1 = 0.1\), \(\beta _2 = 0.3\), \(\gamma = 1.0\), \(d = 300\), \(d^\prime = 600\), \(\tilde {d} = 300\), and \(\kappa = 125\); learning rate is set to 0.001.
HGCN: \(\gamma = 1\), \(\beta = 20\), and \(\kappa = 125\); learning rate is set to 0.001.

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhao, X., Zeng, W., Tang, J. (2023). State-of-the-Art Approaches. In: Entity Alignment. Big Data Management. Springer, Singapore. https://doi.org/10.1007/978-981-99-4250-3_2

Download citation

DOI: https://doi.org/10.1007/978-981-99-4250-3_2
Published: 26 October 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4249-7
Online ISBN: 978-981-99-4250-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics