A survey: knowledge graph entity alignment research based on graph embedding

Zhu, Beibei; Wang, Ruolin; Wang, Junyi; Shao, Fei; Wang, Kerun

doi:10.1007/s10462-024-10866-4

A survey: knowledge graph entity alignment research based on graph embedding

Open access
Published: 03 August 2024

Volume 57, article number 229, (2024)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence Review Aims and scope Submit manuscript

A survey: knowledge graph entity alignment research based on graph embedding

Download PDF

Beibei Zhu¹,
Ruolin Wang^2,3,
Junyi Wang^2,3,
Fei Shao^2,3 &
…
Kerun Wang^2,3

Abstract

Entity alignment (EA) aims to automatically match entities in different knowledge graphs, which is beneficial to the development of knowledge-driven applications. Representation learning has powerful feature capture capability and it is widely used in the field of natural language processing. Compared with traditional EA methods, EA methods based on representation learning have better performance and efficiency. Hence, we summarize and analyze the representative EA approaches based on representation learning in this paper. We present the problem description and data preprocessing for EA and other related fundamental knowledge. We propose a new EA framework for the latest models, which includes information aggregation module, entity alignment module, and post-alignment module. Based on these three modules, the various technologies are described in detail. In the experimental part, we first explore the effect of EA direction on model performance. Then, we classify the models into different categories in terms of alignment inference strategy, noise filtering strategy, and whether additional information is utilized. To ensure fairness, we perform the comparative analysis of the performance of the models within the categories separately on different datasets. We investigate both unimodal and multimodal EA. Finally, we present future research perspectives based on the shortcomings of existing EA methods.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Google proposed knowledge graphs to improve the accuracy of search engines and the efficiency of user retrieval (Liu et al. 2016). Knowledge graphs can help mine the semantic information of user needs and eliminate ambiguities. The semantic network (Bernerslee et al. 2001) is the predecessor of the knowledge graph. The semantic network focuses more on describing the relationship between concepts and concepts, while the knowledge graph is more inclined to describe the association between entities. The emergence of knowledge graphs is more in line with the development trend of computer semantics.

In recent years, knowledge graphs have been widely used in many fields. The data sources used to build knowledge graphs can be structured data, semi-structured data, unstructured data, generic knowledge graphs, etc. While different organizations choose data sources according to their business needs. In addition, there is no unified industry standard for the methods of building knowledge graphs in different domains. These all lead to heterogeneity and redundancy problems among different knowledge graphs. For example, the introduction of “Childhood” by Interactive Encyclopedia and Wikipedia is complementary and repetitive. If the information of the two encyclopedias can be correlated, users will have a more detailed and comprehensive knowledge of the book. To make full use of the information of entities, more and more researchers are fusing different knowledge graphs (Lin et al. 2020).

As a way to integrate knowledge (Mishra et al. 2017), EA extracts entities that refer to the same real-world objects in different knowledge graphs (Yang et al. 2020), which is beneficial for knowledge-driven applications. Traditional EA methods rely on machine translation or feature engineering, which are labor-intensive. And hand-designed features contain subjective factors, so the accuracy of traditional methods depends heavily on the quality of translation and the definition of features. Recently, representation learning techniques are proven to capture structural information better. Therefore, more and more researchers adopt representation learning techniques for knowledge graph EA. The embedding-based EA method frees from the reliance on manually constructed features or rules.

Several works (Zhao et al. 2022; Zhang et al. 2021; Sun et al. 2020; Fanourakis et al. 2022; Chaurasiya et al. 2022) have been done to review the development of EA. However, the field of EA is evolving rapidly and the existing review papers do not include the latest EA models. In addition, their work is not presented in sufficient detail to facilitate the reader’s understanding. Zhao et al. (2022) divide the EA framework into four parts: embedding learning module, alignment module, prediction module, and additional information module. They divide the existing state-of-the-art methods into three groups and perform group evaluations to compare the experimental results of the same models in different datasets. Representative methods from each module are selected to generate possible combinations, and the effectiveness of different methods in these modules is understood by comparing the performance of different combinations. Most EA in their experiments are based on local alignment, and they do not introduce multi-modal knowledge graph EA and Chinese knowledge graph EA. Zhang et al. (2021) conduct a comprehensive survey and analysis for embedding-based knowledge graph EA, and they divide the knowledge graph EA framework into two processes: embedding and alignment. Embedding history and methods based on TransE and graph convolutional networks are presented, and nearly 30 representative structural embedding models using these two embedding approaches are listed. Zhang et al. analyze the embedding models by mining whether EA models add attribute information, whether they use relational predicates as input, and whether they add seed. The classification of information is not refined enough. Sun et al. (2020) investigate 23 embedding-based EA methods. They classify the 23 methods according to the techniques and features and construct an open-source library. The library contains 12 representative embedding-based EA methods and evaluation methods for these techniques. However, their focus is on the dataset construction and experimental results, and no models are introduced according to the classification of techniques, and their technical discussions cannot meet the needs of subsequent new models. Fanourakis et al. (2022) do not provide a comprehensive introduction to embedding-based EA tasks, for example, path sequence models in embedding methods, multimodal EA, dangling EA and alignment inference strategies have not been presented. Chaurasiya et al. (2022) focus on four aspects of Degree distribution, Non-isomorphic neighbourhood, and Name bias, and the details of the various parts of the process are not presented.

This paper presents comprehensive research in this field to fill the gap in existing reviews, with the following main contributions.

(1) This paper proposes a new EA framework, which is divided into three parts: information aggregation module, alignment module, and post-alignment module. Each module has unique functions. In the information aggregation module, this paper not only introduces different embedding initialization methods, but also further refines the subsequent parts into global structure embedding and local semantic information. Compared to existing reviews, which tend to treat relations simply as structures, this paper not only considers the structural aspects of relations on the macro level, but also captures the local semantic information of relations on the micro level. In addition, this paper details the interaction between global structure and local semantics, revealing their complementarities and collaborations in the entity alignment process. In the alignment module, this paper introduces alignment optimization strategies and non-alignable entity prediction methods, which are rarely mentioned in previous reviews. Moreover, this paper also comprehensively analyzes different alignment inference strategies from both global and local perspectives. In the post-alignment module, this paper compares and analyzes a variety of iterative strategies to provide guidance for practical applications.

(2) In the experimental part, this paper introduces the performance of unimodal EA and multimodal EA. Among them, when introducing unimodal EA, Chinese EA is also introduced. Considering that entity alignment is a bi-directional matching problem, this paper introduces the effect of EA direction on the model performance. By comparing the experimental results, it is found that the direction has an effect on the model performance, which provides a reference for researchers to optimize the model. In the section on comparative analysis of unimodal experiments, this paper classifies the representative models in a different way from the existing reviews, and also compares them in several aspects. This paper classifies the representative models from four aspects: whether to apply global alignment, whether to apply noise filtering strategy, utilizing only global structure, and combining global structure and local semantics. The existing methods are also compared and analyzed.

(3) This paper follows the latest research trends in the current field and details the advanced methods used for knowledge graph entity alignment. This paper not only introduces existing methods, but also proposes a series of innovative research ideas to researchers. In particular, the paper suggests combining other features such as video with textual information to further achieve more accurate multimodal entity alignment. In addition, to improve the robustness and applicability of entity alignment techniques, the paper emphasizes the importance of constructing datasets that are close to real-world situations from multiple dimensions. The paper also proposes that the knowledge graph can be mapped to more complex vector spaces (e.g., complex spaces) to obtain better quality entity embedding representations. Meanwhile, this paper proposes that spatial and temporal dimensions should be considered comprehensively, to cope with the dynamic changes of the knowledge graph and enhance the generalization ability of the model. This paper provides a reference for advancing the research progress in the field of knowledge graph entity alignment, as well as solving the challenges in real-world problems.

2 Preliminary

2.1 Knowledge graph EA problem description

A knowledge graph is the knowledge base that organizes data from a semantic perspective, and it is a more general framework for describing formal semantic knowledge. A knowledge graph can be formalized as $KG = (E, R, T)$, where E, R, and T represent entities, relations, and triples, respectively. A knowledge graph is a graph structure in which the nodes in the graph represent entities and the edges in the graph represent relationships. There are two types of triples. The first type is the relational triple, such as (Yuan_Longping, Birthplace, China). The other type is called the attribute triple, e.g. (Yuan_Longping, Gender, “Male”). The task of knowledge graph EA is to find equivalent entities in two knowledge graphs (Sun et al. 2020), which is defined formally as:

$${Align}_{\text{ entity } }\left( K G_{1}, K G_{2}\right) =\left\{ \left( e_{1}, e_{2}\right) \mid e_{1} \in K G_{1}, e_{2} \in K G_{2}, e_{1} \sim e_{2}\right\} ,$$

(1)

where $K G_{1}, K G_{2}$ denotes two knowledge graphs, $e_{1}, e_{2}$ denotes entities, and $\sim$ denotes equivalence relations. Usually, the subset $A l i g n_{\text{ entity } }^{\prime }\left( K G_{1}, K G_{2}\right) \subset {Align}_{\text{ entity } }\left( K G_{1}, K G_{2}\right)$ is called the seed set, which is known in advance and used as training data.

2.2 Data preprocessing

Data preprocessing performs operations before the main processing of the data to obtain more targeted data and improve the subsequent alignment process. Data preprocessing in the EA task can be divided into syntactic regularization and data regularization. Usually, most EA algorithms, such as Zhu et al. (2021) perform alignment directly after simply organizing the data format and removing noisy data. While some other EA algorithms employ special data preprocessing. For example, Trisedya et al. (2019) first align predicates and then name similar predicates uniformly to embed relations and entities into the same vector space. Chen et al. (2020c) use radial basis functions to specialize continuous values. To mine the hidden information in the knowledge graph, Jiang et al. (2019) use logic rules to derive new triples in order to enrich the number of triples. And rule-based methods are generally divided into deductive reasoning and transfer rules. RpAlign (Huang et al. 2022) expands the training data using data augmentation to produce supervised triples across the knowledge graph, which can exchange information from different knowledge graphs.

3 Related foundations

3.1 Translation model

The translation model uses vector translation of the embedding space to represent relationships. TransE (Bordes et al. 2013) is the representative of the translation model family and it is widely used. Based on the vector representation of entities and relationships, TransE treats the relationships of a triple as the translation from the head entity to the tail entity. The purpose is to embed all entities and relations in the knowledge graph into a low-dimensional vector space. The energy function of the relation triple $\left( e_{1}, r_{1}, e_{2}\right)$ is defined as:

$$\varphi \left( e_{1}, r_{1}, e_{2}\right) =\left\| e_{1}+r_{1}-e_{2}\right\| ,$$

(2)

where $\Vert \cdot \Vert$ denotes the $L_{1}$-parameter or $L_{2}$-parameter of the vector. TransE has become the baseline standard for vectorized representation of knowledge graphs and has derived many different variants, such as TransR (Lin et al. 2015a), TransC (Lv et al. 2018) and KG2E (He et al. 2015), etc.

3.2 Deep model

Deep model uses deep learning techniques to learn embedding representations, and commonly used deep models include graph neural network (GNN), graph convolutional network (GCN), and graph attention network (GAT).

GNN (Zhou et al. 2020) is based on an information propagation mechanism, where each node updates its state by exchanging information with each other until it reaches some stable value. The goal of GNN is to learn the state embedding of each neighbor node, and the final output can be obtained using $h_{v}$. The formula for $h_{v}$ is as follows:

$$h_{v}=f\left( X_{v}, X_{c o[v]}, h_{n e[v]}, X_{n e[v]}\right) ,$$

(3)

where $f(\cdot )$ is a local transformation function with parameters to update the current node state according to the state adjustment of neighboring nodes; $X_{v}$ denotes the feature vector of node v; $X_{c o[v]}$ denotes the feature vector of the edge where node v is located; $h_{ne[v]}$ denotes the state vector of node v’s neighboring node; $x_{ne[v]}$ denotes the feature vector of the neighboring nodes of node v.

GCN (Kipf and Welling 2017) includes input, layer propagation and output. When the knowledge graph is embedded into a low-dimensional vector space, entities are considered as nodes. GCN uses the activation function to continuously update the neighbor node information, which is formulated as follows.

$$H^{(l+1)}=\sigma \left( D^{-\frac{1}{2}} {\widehat{A}} D^{-\frac{1}{2}} H^{(l)} W^{(l)}\right) ,$$

(4)

where ${\widehat{A}}=A+I$, A is the adjacency matrix, I is the identity matrix; D is the degree matrix of A; H is the feature of a layer, which is equivalent to X at the input layer and is called the hidden layer during propagation; W is the weight matrix; $\sigma$ is the activation function.

GAT (Velickovic et al. 2018) introduces an attention mechanism to assign corresponding weights to neighboring nodes and obtains information about the whole network from local information. The attention coefficients $a_{i j}$ are obtained by normalizing.

$$a_{i j}=\frac{\exp \left( {\text {Leaky}} {\text {Re}} l u\left( {\textrm{e}}_{i j}\right) \right) }{\sum _{k \in N_{i}} \exp \left( {\text {Leaky}} {\text {Re}} l u\left( {\textrm{e}}_{i k}\right) \right) }.$$

(5)

Using the computed attention coefficients, the features are weighted and summed to obtain the new features incorporating the neighborhood information.

$$h_{i}^{\prime }=\sigma \left( \sum _{j \in N_{i}} a_{i j} W \overrightarrow{h_{j}}\right) .$$

(6)

Generally, GAT also uses multi-headed attention to enhance the ability of the model and stabilize the training process by concatenating or averaging.

3.3 Semantic matching model

The semantic matching model calculates the similarity from the semantic level based on entities and relations in the vector space. Semantic matching models include RESCAL (Nickel et al. 2011), DistMult (Yang et al. 2015) and MLP (Multi-Layer Perceptron, Dong et al. 2014), etc.

RESCAL associates entities and vectors to obtain semantic information. The relationships are represented as matrices to model the pairwise interactions between potential factors with the following score functions.

$$f_{r}(h, t)={\textbf{h}}^{\top } {\textbf{M}}_{r} {\textbf{t}}=\sum _{i=0}^{d-1} \sum _{j=0}^{d-1}\left[ {\textbf{M}}_{r}\right] _{i j} \cdot [{\textbf{h}}]_{i} \cdot [{\textbf{t}}]_{j},$$

(7)

where ${\textbf{h}}, {\textbf{t}} \in {\mathbb {R}}^{d}$ is the vector representation of the entity and ${\textbf{M}}_{r} \in {\mathbb {R}}^{d \times d}$ is the matrix associated with the relationship.

DistMult simplifies RESCAL by restricting ${\textbf{M}}_{r}$ to the diagonal matrix. For each relation r, it introduces a vector embedding ${\textbf{r}} \in {\mathbb {R}}^{d}$ and requires that ${\textbf{M}}_{r}={\text { diag}}({\textbf{r}})$.

$$f_{r}(h, t)={\textbf{h}}^{\top } {\text {diag}}({\textbf{r}}) {\textbf{t}}=\sum _{i=0}^{d-1}[{\textbf{r}}]_{i} \cdot [{\textbf{h}}]_{i} \cdot [{\textbf{t}}]_{i}.$$

(8)

MLP is relatively simple, where each relation (and entity) is associated with a vector and concatenates the vectors h, r, and t in the input layer and maps to the nonlinear hidden layer. The scoring function is as follows.

$$f_{r}(h, t)={\textbf{w}}^{\top } \tanh \left( {\textbf{M}}^{1} {\textbf{h}}+{\textbf{M}}^{2} {\textbf{r}}+{\textbf{M}}^{3} {\textbf{t}}\right) ,$$

(9)

where ${\textbf{M}}^{1}, {\textbf{M}}^{2}, {\textbf{M}}^{3} \in {\mathbb {R}}^{d \times d}$ are the first level weights and ${\textbf{W}} \in {\mathbb {R}}^{d}$ are the second level weights, which are all shared in different relations.

3.4 Random walk

Random walk (RW) learns node embeddings by generating sequences. If nodes appear simultaneously on a random walk path in the graph, it means they have similar embeddings. When sampling paths for the knowledge graph, the generated sequences are cross-combinations of nodes and relations (Chen et al. 2020e). DeepWalk (Perozzi et al. 2014) and node2vec (Grover and Leskovec 2016) are pioneering works that introduce deep learning techniques into network analysis to learn node embeddings. When node2vec is applied to the knowledge graph, the transfer probability of reaching the next entity is calculated as follows.

$$P\left( e_{i+1} \mid e_{i}\right) =\left\{ \begin{array}{ll} \alpha _{p q}\left( e_{i}, e_{i+1}\right) \cdot w &{} \exists r \in R,\left( e_{i}, r, e_{i+1}\right) \in T, \\ 0 &{} \text{ otherwise } , \end{array}\right.$$

(10)

where $e_{i}$ is the ith entity in a certain walk, which needs to decide the next entity $e_{i+1}$, and if there is a relationship r between $e_{i}$ and $e_{i+1}$, the transfer probability from $e_{i}$ to $e_{i+1}$ needs to be evaluated, and w is the edge weight between entities $e_{i}$ and $e_{i+1}$.

3.5 Multimodal knowledge graph

Knowledge graph techniques have been widely used to deal with structured and textual data, but relatively less attention has been paid to unstructured data such as images. Few effective technical means are designed to extract structured knowledge from image data. Therefore, multimodal knowledge graphs are proposed for constructing entities under multiple modalities (e.g., image modality). Multimodal knowledge graphs can provide sufficient visual information for entities, thus allowing EA on a larger scale. Ultimately, multimodal knowledge graphs enable existing models to perform better because text and image features are considered together.

Although multimodal structural data are heterogeneous in the underlying representation, different modal data of the same entity are unified in the high-level semantics. Therefore, fusing multimodal data is helpful to language representation models. There are few studies on multimodal knowledge graphs, and several important open-source multimodal knowledge graphs include IMGpedia (Ferrada et al. 2017), MMKG (Liu et al. 2019), and Richpedia (Wang et al. 2019).

4 Representation learning-based EA framework

As is shown in Fig. 1, we design a typical representation learning-based knowledge graph EA framework. The framework includes information aggregation module, alignment module, and post-alignment module. When aligning entities, two knowledge graphs are first input and seed data are collected for training. Because the quality of the original data directly affects the final alignment results, the input data is often preprocessed.

In the information aggregation module, the embedding representation needs to be initialized firstly, which is generally done by random initialization or entity name-based initialization. Based on the initial embedding representation, the global structure embedding part updates entity embedding, which is based on translation family model, deep model, and path sequence model. The topological connection of the knowledge graph only provides the global structure information, while the local semantic information of entities, such as relations, attributes, summaries, contexts, names, types, ontologies, and images, has a positive impact on EA. So many methods fuse local semantic information to improve the alignment. The effect of the model can be enhanced by iterative co-training between global structural embedding and local semantic information, or integrate global and local information to enrich the features of entities. The information aggregation module gets the final entity embedding, and the information aggregation module provides a reserve for the entity alignment module.

In the alignment module, the embeddings of the source and target knowledge graphs are first unified into a vector space by using the combination method. Then the distance between the source and target entity vectors is calculated based on the final embedding representation of the entities. The commonly used metric strategies include Euclidean distance, Manhattan distance, Cosine distance, Cross-Domain Similarity Local Scaling (CSLS), and Edit distance. After the distance metric calculation between entity vectors, the entity similarity matrix is obtained. Some studies design optimization strategies and non-alignable entity prediction to improve the accuracy of alignment. After that, the alignment strategy includes global alignment and local alignment. Finally, the alignment module will output the alignment result.

In the post-alignment module, semi-supervised strategies are used to iteratively generate new seed pairs and expand the size of the seed set.

5 Information aggregation module

5.1 Embedding initialization method

5.1.1 Random initialization

Most current EA methods, such as GCN-Align (Wang et al. 2018), BootEA (Sun et al. 2018) and COTSAE (Yang et al. 2020) need to depend on the structure to form the initialization vector of entities. The goal of graph embedding is to obtain the low-dimensional vector representation of a high-dimensional graph. The structure embedding needs to be initialized before the global structure embedding is performed. Random initialization of entity embeddings is easier and more convenient. But random initialization may lead to local optima and produce low-quality embeddings. In real-world knowledge graphs, most entities have low node degrees and little structural information, so using only structural information to initialize entity embeddings may limit the effectiveness of EA models.

5.1.2 Vector initialization based on entity names

The entity name is considered a special attribute that is independent of the node degree. The entity name is an important clue in determining whether two entities are equivalent. If the entity name is available in KG, the entity name vector can be used as the initial feature vector of the entity. GMNN (Xu et al. 2019) uses a word-based LSTM to convert the entity name to its initial feature vector. To better initialize the model, RDGCN (Wu et al. 2019a), UED (Luo and Yu 2022), RNM (Zhu et al. 2021), EAMI (Zhu et al. 2023) and RAGA (Zhu et al. 2021) use Glove based on entity names for model initialization. They translate non-English entity names into English via Google Translate and initialize entity features with pre-trained English word vectors.

5.2 Global structure embedding method

In recent years, the field of EA structure embedding methods are mainly divided into three categories: translation model-based structural embedding methods, deep model-based structural embedding methods, and path sequence-based structural embedding methods. The characteristics of the three embedding methods are compared in Table 1.

Table 1 Comparative analysis of global structure embedding methods

A survey: knowledge graph entity alignment research based on graph embedding

Abstract

1 Introduction

2 Preliminary

2.1 Knowledge graph EA problem description

2.2 Data preprocessing

3 Related foundations

3.1 Translation model

3.2 Deep model

3.3 Semantic matching model

3.4 Random walk

3.5 Multimodal knowledge graph

4 Representation learning-based EA framework

5 Information aggregation module

5.1 Embedding initialization method

5.1.1 Random initialization

5.1.2 Vector initialization based on entity names

5.2 Global structure embedding method

5.2.1 Structural embedding based on translation model

5.2.2 Structural embedding based on deep model

5.2.3 Structural embedding based on path sequence model

5.3 Combination of global structure and local semantic

5.3.1 Collaborative training

5.3.2 Integration

5.4 Local semantic information

5.4.1 Incorporating relational semantic

5.4.2 Incorporating entity attribute

5.4.3 Incorporating entity summary/description

5.4.4 Incorporating entity context

5.4.5 Incorporating entity name

5.4.6 Incorporating type information

5.4.7 Incorporating ontology information

5.4.8 Incorporating image

6 Entity alignment module

6.1 Combination method

6.2 Similarity metric

6.3 Alignment optimization strategy

6.4 Non-alignable entity prediction

6.5 Alignment inference strategy

6.5.1 Global alignment

6.5.2 Local alignment

7 Post-alignment module

8 Negative sampling

9 Loss function

10 Benchmarking

10.1 Dataset

10.1.1 Unimodal dataset

10.1.2 Multimodal dataset

10.2 Evaluation metric

11 Experiment and analysis

11.1 Effect of direction on performance

11.2 Experimental setting

11.3 Comparison of unimodal EA models

11.3.1 Comparative analysis of models on DBP15K

11.3.2 Comparative analysis of models on DWY100K

11.3.3 Comparative analysis of models on SRPRS

11.3.4 Comparative analysis of models on DBP v1.1

11.3.5 Comparative analysis of models on Chinese dataset

11.4 Comparison of multimodal EA models

12 Research prospect

12.1 Study of other multimodal data

12.2 Study of realistic datasets

12.3 Study of other vector spaces

12.4 Study of dynamic knowledge graph EA

Data availability

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Competing interests

Additional information

Publisher's Note

Rights and permissions