1 Introduction

Google proposed knowledge graphs to improve the accuracy of search engines and the efficiency of user retrieval (Liu et al. 2016). Knowledge graphs can help mine the semantic information of user needs and eliminate ambiguities. The semantic network (Bernerslee et al. 2001) is the predecessor of the knowledge graph. The semantic network focuses more on describing the relationship between concepts and concepts, while the knowledge graph is more inclined to describe the association between entities. The emergence of knowledge graphs is more in line with the development trend of computer semantics.

In recent years, knowledge graphs have been widely used in many fields. The data sources used to build knowledge graphs can be structured data, semi-structured data, unstructured data, generic knowledge graphs, etc. While different organizations choose data sources according to their business needs. In addition, there is no unified industry standard for the methods of building knowledge graphs in different domains. These all lead to heterogeneity and redundancy problems among different knowledge graphs. For example, the introduction of “Childhood” by Interactive Encyclopedia and Wikipedia is complementary and repetitive. If the information of the two encyclopedias can be correlated, users will have a more detailed and comprehensive knowledge of the book. To make full use of the information of entities, more and more researchers are fusing different knowledge graphs (Lin et al. 2020).

As a way to integrate knowledge (Mishra et al. 2017), EA extracts entities that refer to the same real-world objects in different knowledge graphs (Yang et al. 2020), which is beneficial for knowledge-driven applications. Traditional EA methods rely on machine translation or feature engineering, which are labor-intensive. And hand-designed features contain subjective factors, so the accuracy of traditional methods depends heavily on the quality of translation and the definition of features. Recently, representation learning techniques are proven to capture structural information better. Therefore, more and more researchers adopt representation learning techniques for knowledge graph EA. The embedding-based EA method frees from the reliance on manually constructed features or rules.

Several works (Zhao et al. 2022; Zhang et al. 2021; Sun et al. 2020; Fanourakis et al. 2022; Chaurasiya et al. 2022) have been done to review the development of EA. However, the field of EA is evolving rapidly and the existing review papers do not include the latest EA models. In addition, their work is not presented in sufficient detail to facilitate the reader’s understanding. Zhao et al. (2022) divide the EA framework into four parts: embedding learning module, alignment module, prediction module, and additional information module. They divide the existing state-of-the-art methods into three groups and perform group evaluations to compare the experimental results of the same models in different datasets. Representative methods from each module are selected to generate possible combinations, and the effectiveness of different methods in these modules is understood by comparing the performance of different combinations. Most EA in their experiments are based on local alignment, and they do not introduce multi-modal knowledge graph EA and Chinese knowledge graph EA. Zhang et al. (2021) conduct a comprehensive survey and analysis for embedding-based knowledge graph EA, and they divide the knowledge graph EA framework into two processes: embedding and alignment. Embedding history and methods based on TransE and graph convolutional networks are presented, and nearly 30 representative structural embedding models using these two embedding approaches are listed. Zhang et al. analyze the embedding models by mining whether EA models add attribute information, whether they use relational predicates as input, and whether they add seed. The classification of information is not refined enough. Sun et al. (2020) investigate 23 embedding-based EA methods. They classify the 23 methods according to the techniques and features and construct an open-source library. The library contains 12 representative embedding-based EA methods and evaluation methods for these techniques. However, their focus is on the dataset construction and experimental results, and no models are introduced according to the classification of techniques, and their technical discussions cannot meet the needs of subsequent new models. Fanourakis et al. (2022) do not provide a comprehensive introduction to embedding-based EA tasks, for example, path sequence models in embedding methods, multimodal EA, dangling EA and alignment inference strategies have not been presented. Chaurasiya et al. (2022) focus on four aspects of Degree distribution, Non-isomorphic neighbourhood, and Name bias, and the details of the various parts of the process are not presented.

This paper presents comprehensive research in this field to fill the gap in existing reviews, with the following main contributions.

(1) This paper proposes a new EA framework, which is divided into three parts: information aggregation module, alignment module, and post-alignment module. Each module has unique functions. In the information aggregation module, this paper not only introduces different embedding initialization methods, but also further refines the subsequent parts into global structure embedding and local semantic information. Compared to existing reviews, which tend to treat relations simply as structures, this paper not only considers the structural aspects of relations on the macro level, but also captures the local semantic information of relations on the micro level. In addition, this paper details the interaction between global structure and local semantics, revealing their complementarities and collaborations in the entity alignment process. In the alignment module, this paper introduces alignment optimization strategies and non-alignable entity prediction methods, which are rarely mentioned in previous reviews. Moreover, this paper also comprehensively analyzes different alignment inference strategies from both global and local perspectives. In the post-alignment module, this paper compares and analyzes a variety of iterative strategies to provide guidance for practical applications.

(2) In the experimental part, this paper introduces the performance of unimodal EA and multimodal EA. Among them, when introducing unimodal EA, Chinese EA is also introduced. Considering that entity alignment is a bi-directional matching problem, this paper introduces the effect of EA direction on the model performance. By comparing the experimental results, it is found that the direction has an effect on the model performance, which provides a reference for researchers to optimize the model. In the section on comparative analysis of unimodal experiments, this paper classifies the representative models in a different way from the existing reviews, and also compares them in several aspects. This paper classifies the representative models from four aspects: whether to apply global alignment, whether to apply noise filtering strategy, utilizing only global structure, and combining global structure and local semantics. The existing methods are also compared and analyzed.

(3) This paper follows the latest research trends in the current field and details the advanced methods used for knowledge graph entity alignment. This paper not only introduces existing methods, but also proposes a series of innovative research ideas to researchers. In particular, the paper suggests combining other features such as video with textual information to further achieve more accurate multimodal entity alignment. In addition, to improve the robustness and applicability of entity alignment techniques, the paper emphasizes the importance of constructing datasets that are close to real-world situations from multiple dimensions. The paper also proposes that the knowledge graph can be mapped to more complex vector spaces (e.g., complex spaces) to obtain better quality entity embedding representations. Meanwhile, this paper proposes that spatial and temporal dimensions should be considered comprehensively, to cope with the dynamic changes of the knowledge graph and enhance the generalization ability of the model. This paper provides a reference for advancing the research progress in the field of knowledge graph entity alignment, as well as solving the challenges in real-world problems.

2 Preliminary

2.1 Knowledge graph EA problem description

A knowledge graph is the knowledge base that organizes data from a semantic perspective, and it is a more general framework for describing formal semantic knowledge. A knowledge graph can be formalized as \(KG = (E, R, T)\), where E, R, and T represent entities, relations, and triples, respectively. A knowledge graph is a graph structure in which the nodes in the graph represent entities and the edges in the graph represent relationships. There are two types of triples. The first type is the relational triple, such as (Yuan_Longping, Birthplace, China). The other type is called the attribute triple, e.g. (Yuan_Longping, Gender, “Male”). The task of knowledge graph EA is to find equivalent entities in two knowledge graphs (Sun et al. 2020), which is defined formally as:

$${Align}_{\text{ entity } }\left( K G_{1}, K G_{2}\right) =\left\{ \left( e_{1}, e_{2}\right) \mid e_{1} \in K G_{1}, e_{2} \in K G_{2}, e_{1} \sim e_{2}\right\} ,$$
(1)

where \(K G_{1}, K G_{2}\) denotes two knowledge graphs, \(e_{1}, e_{2}\) denotes entities, and \(\sim\) denotes equivalence relations. Usually, the subset \(A l i g n_{\text{ entity } }^{\prime }\left( K G_{1}, K G_{2}\right) \subset {Align}_{\text{ entity } }\left( K G_{1}, K G_{2}\right)\) is called the seed set, which is known in advance and used as training data.

2.2 Data preprocessing

Data preprocessing performs operations before the main processing of the data to obtain more targeted data and improve the subsequent alignment process. Data preprocessing in the EA task can be divided into syntactic regularization and data regularization. Usually, most EA algorithms, such as Zhu et al. (2021) perform alignment directly after simply organizing the data format and removing noisy data. While some other EA algorithms employ special data preprocessing. For example, Trisedya et al. (2019) first align predicates and then name similar predicates uniformly to embed relations and entities into the same vector space. Chen et al. (2020c) use radial basis functions to specialize continuous values. To mine the hidden information in the knowledge graph, Jiang et al. (2019) use logic rules to derive new triples in order to enrich the number of triples. And rule-based methods are generally divided into deductive reasoning and transfer rules. RpAlign (Huang et al. 2022) expands the training data using data augmentation to produce supervised triples across the knowledge graph, which can exchange information from different knowledge graphs.

3 Related foundations

3.1 Translation model

The translation model uses vector translation of the embedding space to represent relationships. TransE (Bordes et al. 2013) is the representative of the translation model family and it is widely used. Based on the vector representation of entities and relationships, TransE treats the relationships of a triple as the translation from the head entity to the tail entity. The purpose is to embed all entities and relations in the knowledge graph into a low-dimensional vector space. The energy function of the relation triple \(\left( e_{1}, r_{1}, e_{2}\right)\) is defined as:

$$\varphi \left( e_{1}, r_{1}, e_{2}\right) =\left\| e_{1}+r_{1}-e_{2}\right\| ,$$
(2)

where \(\Vert \cdot \Vert\) denotes the \(L_{1}\)-parameter or \(L_{2}\)-parameter of the vector. TransE has become the baseline standard for vectorized representation of knowledge graphs and has derived many different variants, such as TransR (Lin et al. 2015a), TransC (Lv et al. 2018) and KG2E (He et al. 2015), etc.

3.2 Deep model

Deep model uses deep learning techniques to learn embedding representations, and commonly used deep models include graph neural network (GNN), graph convolutional network (GCN), and graph attention network (GAT).

GNN (Zhou et al. 2020) is based on an information propagation mechanism, where each node updates its state by exchanging information with each other until it reaches some stable value. The goal of GNN is to learn the state embedding of each neighbor node, and the final output can be obtained using \(h_{v}\). The formula for \(h_{v}\) is as follows:

$$h_{v}=f\left( X_{v}, X_{c o[v]}, h_{n e[v]}, X_{n e[v]}\right) ,$$
(3)

where \(f(\cdot )\) is a local transformation function with parameters to update the current node state according to the state adjustment of neighboring nodes; \(X_{v}\) denotes the feature vector of node v; \(X_{c o[v]}\) denotes the feature vector of the edge where node v is located; \(h_{ne[v]}\) denotes the state vector of node v’s neighboring node; \(x_{ne[v]}\) denotes the feature vector of the neighboring nodes of node v.

GCN (Kipf and Welling 2017) includes input, layer propagation and output. When the knowledge graph is embedded into a low-dimensional vector space, entities are considered as nodes. GCN uses the activation function to continuously update the neighbor node information, which is formulated as follows.

$$H^{(l+1)}=\sigma \left( D^{-\frac{1}{2}} {\widehat{A}} D^{-\frac{1}{2}} H^{(l)} W^{(l)}\right) ,$$
(4)

where \({\widehat{A}}=A+I\), A is the adjacency matrix, I is the identity matrix; D is the degree matrix of A; H is the feature of a layer, which is equivalent to X at the input layer and is called the hidden layer during propagation; W is the weight matrix; \(\sigma\) is the activation function.

GAT (Velickovic et al. 2018) introduces an attention mechanism to assign corresponding weights to neighboring nodes and obtains information about the whole network from local information. The attention coefficients \(a_{i j}\) are obtained by normalizing.

$$a_{i j}=\frac{\exp \left( {\text {Leaky}} {\text {Re}} l u\left( {\textrm{e}}_{i j}\right) \right) }{\sum _{k \in N_{i}} \exp \left( {\text {Leaky}} {\text {Re}} l u\left( {\textrm{e}}_{i k}\right) \right) }.$$
(5)

Using the computed attention coefficients, the features are weighted and summed to obtain the new features incorporating the neighborhood information.

$$h_{i}^{\prime }=\sigma \left( \sum _{j \in N_{i}} a_{i j} W \overrightarrow{h_{j}}\right) .$$
(6)

Generally, GAT also uses multi-headed attention to enhance the ability of the model and stabilize the training process by concatenating or averaging.

3.3 Semantic matching model

The semantic matching model calculates the similarity from the semantic level based on entities and relations in the vector space. Semantic matching models include RESCAL (Nickel et al. 2011), DistMult (Yang et al. 2015) and MLP (Multi-Layer Perceptron, Dong et al. 2014), etc.

RESCAL associates entities and vectors to obtain semantic information. The relationships are represented as matrices to model the pairwise interactions between potential factors with the following score functions.

$$f_{r}(h, t)={\textbf{h}}^{\top } {\textbf{M}}_{r} {\textbf{t}}=\sum _{i=0}^{d-1} \sum _{j=0}^{d-1}\left[ {\textbf{M}}_{r}\right] _{i j} \cdot [{\textbf{h}}]_{i} \cdot [{\textbf{t}}]_{j},$$
(7)

where \({\textbf{h}}, {\textbf{t}} \in {\mathbb {R}}^{d}\) is the vector representation of the entity and \({\textbf{M}}_{r} \in {\mathbb {R}}^{d \times d}\) is the matrix associated with the relationship.

DistMult simplifies RESCAL by restricting \({\textbf{M}}_{r}\) to the diagonal matrix. For each relation r, it introduces a vector embedding \({\textbf{r}} \in {\mathbb {R}}^{d}\) and requires that \({\textbf{M}}_{r}={\text { diag}}({\textbf{r}})\).

$$f_{r}(h, t)={\textbf{h}}^{\top } {\text {diag}}({\textbf{r}}) {\textbf{t}}=\sum _{i=0}^{d-1}[{\textbf{r}}]_{i} \cdot [{\textbf{h}}]_{i} \cdot [{\textbf{t}}]_{i}.$$
(8)

MLP is relatively simple, where each relation (and entity) is associated with a vector and concatenates the vectors h, r, and t in the input layer and maps to the nonlinear hidden layer. The scoring function is as follows.

$$f_{r}(h, t)={\textbf{w}}^{\top } \tanh \left( {\textbf{M}}^{1} {\textbf{h}}+{\textbf{M}}^{2} {\textbf{r}}+{\textbf{M}}^{3} {\textbf{t}}\right) ,$$
(9)

where \({\textbf{M}}^{1}, {\textbf{M}}^{2}, {\textbf{M}}^{3} \in {\mathbb {R}}^{d \times d}\) are the first level weights and \({\textbf{W}} \in {\mathbb {R}}^{d}\) are the second level weights, which are all shared in different relations.

3.4 Random walk

Random walk (RW) learns node embeddings by generating sequences. If nodes appear simultaneously on a random walk path in the graph, it means they have similar embeddings. When sampling paths for the knowledge graph, the generated sequences are cross-combinations of nodes and relations (Chen et al. 2020e). DeepWalk (Perozzi et al. 2014) and node2vec (Grover and Leskovec 2016) are pioneering works that introduce deep learning techniques into network analysis to learn node embeddings. When node2vec is applied to the knowledge graph, the transfer probability of reaching the next entity is calculated as follows.

$$P\left( e_{i+1} \mid e_{i}\right) =\left\{ \begin{array}{ll} \alpha _{p q}\left( e_{i}, e_{i+1}\right) \cdot w &{} \exists r \in R,\left( e_{i}, r, e_{i+1}\right) \in T, \\ 0 &{} \text{ otherwise } , \end{array}\right.$$
(10)

where \(e_{i}\) is the ith entity in a certain walk, which needs to decide the next entity \(e_{i+1}\), and if there is a relationship r between \(e_{i}\) and \(e_{i+1}\), the transfer probability from \(e_{i}\) to \(e_{i+1}\) needs to be evaluated, and w is the edge weight between entities \(e_{i}\) and \(e_{i+1}\).

3.5 Multimodal knowledge graph

Knowledge graph techniques have been widely used to deal with structured and textual data, but relatively less attention has been paid to unstructured data such as images. Few effective technical means are designed to extract structured knowledge from image data. Therefore, multimodal knowledge graphs are proposed for constructing entities under multiple modalities (e.g., image modality). Multimodal knowledge graphs can provide sufficient visual information for entities, thus allowing EA on a larger scale. Ultimately, multimodal knowledge graphs enable existing models to perform better because text and image features are considered together.

Although multimodal structural data are heterogeneous in the underlying representation, different modal data of the same entity are unified in the high-level semantics. Therefore, fusing multimodal data is helpful to language representation models. There are few studies on multimodal knowledge graphs, and several important open-source multimodal knowledge graphs include IMGpedia (Ferrada et al. 2017), MMKG (Liu et al. 2019), and Richpedia (Wang et al. 2019).

4 Representation learning-based EA framework

As is shown in Fig. 1, we design a typical representation learning-based knowledge graph EA framework. The framework includes information aggregation module, alignment module, and post-alignment module. When aligning entities, two knowledge graphs are first input and seed data are collected for training. Because the quality of the original data directly affects the final alignment results, the input data is often preprocessed.

Fig. 1
figure 1

A basic framework for EA based on representation learning

In the information aggregation module, the embedding representation needs to be initialized firstly, which is generally done by random initialization or entity name-based initialization. Based on the initial embedding representation, the global structure embedding part updates entity embedding, which is based on translation family model, deep model, and path sequence model. The topological connection of the knowledge graph only provides the global structure information, while the local semantic information of entities, such as relations, attributes, summaries, contexts, names, types, ontologies, and images, has a positive impact on EA. So many methods fuse local semantic information to improve the alignment. The effect of the model can be enhanced by iterative co-training between global structural embedding and local semantic information, or integrate global and local information to enrich the features of entities. The information aggregation module gets the final entity embedding, and the information aggregation module provides a reserve for the entity alignment module.

In the alignment module, the embeddings of the source and target knowledge graphs are first unified into a vector space by using the combination method. Then the distance between the source and target entity vectors is calculated based on the final embedding representation of the entities. The commonly used metric strategies include Euclidean distance, Manhattan distance, Cosine distance, Cross-Domain Similarity Local Scaling (CSLS), and Edit distance. After the distance metric calculation between entity vectors, the entity similarity matrix is obtained. Some studies design optimization strategies and non-alignable entity prediction to improve the accuracy of alignment. After that, the alignment strategy includes global alignment and local alignment. Finally, the alignment module will output the alignment result.

In the post-alignment module, semi-supervised strategies are used to iteratively generate new seed pairs and expand the size of the seed set.

5 Information aggregation module

5.1 Embedding initialization method

5.1.1 Random initialization

Most current EA methods, such as GCN-Align (Wang et al. 2018), BootEA (Sun et al. 2018) and COTSAE (Yang et al. 2020) need to depend on the structure to form the initialization vector of entities. The goal of graph embedding is to obtain the low-dimensional vector representation of a high-dimensional graph. The structure embedding needs to be initialized before the global structure embedding is performed. Random initialization of entity embeddings is easier and more convenient. But random initialization may lead to local optima and produce low-quality embeddings. In real-world knowledge graphs, most entities have low node degrees and little structural information, so using only structural information to initialize entity embeddings may limit the effectiveness of EA models.

5.1.2 Vector initialization based on entity names

The entity name is considered a special attribute that is independent of the node degree. The entity name is an important clue in determining whether two entities are equivalent. If the entity name is available in KG, the entity name vector can be used as the initial feature vector of the entity. GMNN (Xu et al. 2019) uses a word-based LSTM to convert the entity name to its initial feature vector. To better initialize the model, RDGCN (Wu et al. 2019a), UED (Luo and Yu 2022), RNM (Zhu et al. 2021), EAMI (Zhu et al. 2023) and RAGA (Zhu et al. 2021) use Glove based on entity names for model initialization. They translate non-English entity names into English via Google Translate and initialize entity features with pre-trained English word vectors.

5.2 Global structure embedding method

In recent years, the field of EA structure embedding methods are mainly divided into three categories: translation model-based structural embedding methods, deep model-based structural embedding methods, and path sequence-based structural embedding methods. The characteristics of the three embedding methods are compared in Table 1.

Table 1 Comparative analysis of global structure embedding methods

5.2.1 Structural embedding based on translation model

Early EA approaches based on representation learning relied on TransE to capture the structural information of the knowledge graph. They directly use TransE for structural embedding, modeling in-graph relationships and treating relationships as translation vectors between entities, such as COTSAE (Yang et al. 2020), MEEA (Chen et al. 2021b), DAEA (Sun et al. 2020), MMEA (Chen et al. 2020c), and JSAE (Munne and Ichise 2020), etc.

Some approaches improve TransE, such as Trisedya et al. (2019) add weight to TransE, which enables aligned triples to receive higher attention and improves the alignment effect. The contribution of different neighbors to EA varies, NAEA (Zhu et al. 2019) uses a neighborhood-aware attention mechanism on top of TransE to aggregate entity neighbors with different importance to obtain a neighborhood-level entity representation. RTEA (Jiang et al. 2019) uses string similarity-based and embedding-based approaches represented by TransE to refine structure embedding. Based on RTEA, ESEA (Jiang et al. 2022a) uses an embedding-based model to filter weakly correlated entities and then explores the final alignment using a symbol-based approach. Based on TransE, AMKE (Shen et al. 2022) sets different margin hyperparameters for different relations and adapts the learning margin parameter.

The variants of TransE are also used to embed structures. TransE only represents the one-hop relationships between entities, ignoring the important multi-hop relationship information, and the modeling of complex relationships is not good enough. So IPTransE (Zhu et al. 2017) uses PTransE (Lin et al. 2015b) for structural embedding. Ps-TransC (Kang et al. 2020) uses TransC for structure embedding. Ps-TransC divides the knowledge graph into an ontology layer and an instance layer, where entities in the ontology layer are considered as a class. TransC models all the triples of each class as a sphere and all the instances of that class are contained within the sphere.

5.2.2 Structural embedding based on deep model

Although the information of neighbors can be transmitted to the central entity, they are only captured implicitly. Therefore to fuse the information of neighbors into the entity embedding, the structural embedding method based on the GNN family of models is proposed. This method considers complex parameters and relations, so it can learn more expressive embeddings (Yan et al. 2020). The deep method usually stacks more than two layers of GNN to learn entity representations, where the first layer of GNN nodes is randomly initialized and the node representation of the last layer of GNN is the final representation of entities. Representative models for structure embedding based on deep models include REA (Pei et al. 2020) and HyperKA (Sun et al. 2020).

GCN follows the neighborhood aggregation scheme that can iteratively update the representation of each entity node. Representative works include RNM (Zhu et al. 2021), NMN (Wu et al. 2020); and the work by Xiong and Gao 2019). The number of layers for GCNs has an impact on EA, HMAN (Yang et al. 2019) stacks multiple layers of GCNs to collect multi-hop neighbor information. Directly using multi-layer GCNs to aggregate information leads to the propagation of noisy information, so AliNet (Sun et al. 2020) uses the gating mechanism to aggregate multi-hop neighbor information. SSP (Nie et al. 2020) and HGCN (Wu et al. 2019b) use GCN to explicitly encode structure information. Highway gates are also used to control the amount of neighborhood information passed to nodes. Tam et al. (2021) adjust the number of GCNs layers to prevent the information transfer noise in the previous layer as well as the topology loss. To address the over-smoothing problem caused by the increasing number of GCN layers, RAC (Zeng et al. 2021) uses both approximate personalized propagation of neural predictions and GCN models to capture structure information. In addition to solving the over-smoothing problem caused by the increase in the number of GCN layers, EchoEA (Lin et al. 2021) also proposes a solution for the overfitting problem by introducing a four-level (entity-level, feature-level, entity-to-relationship, and relationship-to-entity) attention mechanism to further encode entity features.

GAT is also used to learn global structural embeddings of knowledge graphs. To effectively utilize pre-aligned links in the knowledge graph, CAECGAT (Xie et al. 2020) and DuGa-DIT (Xie et al. 2022) share cross knowledge graph entity embeddings and updates the embeddings using a gate mechanism. By overlaying multiple attention layers, the model can learn multi-hop information. In addition, TTEA (Zhang et al. 2023) uses GAT in the last part of the model to re-aggregate the information of the neighbors.

5.2.3 Structural embedding based on path sequence model

Translation models and deep models do not fully exploit the long-term structural dependencies among entities, which suffer from the limitations of low expressiveness and inefficient information dissemination. To better explore the structural information among entities, SAEA (Chen et al. 2020e) designs a degree-aware random walk method to generate heterogeneous sequence data and capture the long-term structural dependencies within entities. Deep paths have more relational dependencies than triples, and cross-knowledge mapping paths are used as bridges between knowledge graphs to transfer information. RSNs (Guo et al. 2019); and the work by (Chen et al. 2020f) apply biased random walk path sampling method to effectively explore the deep and cross-KG relational paths for embedding learning.

5.3 Combination of global structure and local semantic

5.3.1 Collaborative training

In the EA framework, different modules can be trained collaboratively and there is a positive influence between modules. The first category is the co-training of relationship alignment and entity alignment. For example, RNM (Zhu et al. 2021) adds the relationship information between entities to the neighbor matching model. And RNM designs a semi-supervised framework so that entity alignment and relationship alignment can enhance each other. HGCN (Wu et al. 2019b) first uses the entity embeddings learned by GCN to approximate the relationship representation. The relational representation is then merged into the entities to iteratively learn better representations.

Some models iteratively perform attribute alignment and entity alignment. For example, IMUSE (He et al. 2019), COTSAE (Yang et al. 2020), and NovEA (Sun et al. 2020). In each iteration, IMUSE first performs EA based on attribute values to build a matching set of entity pairs and then performs attribute alignment to build a matching set of attribute pairs. COTSAE learns entity embeddings using a collaborative training framework with alternating TransE components and pseudo networks. NovEA assumes that all common attributes of the two entities have the same weight and uses the aligned entities for attribute alignment. But when two entities do not have common attributes, aligned entity pairs can be used to find more possible aligned attribute pairs.

The generation and discrimination modules of the EA method applying adversarial learning are trained collaboratively. REA (Pei et al. 2020) first trains the noise-aware module to update the entity embeddings, and then uses the learned embeddings to optimize the noise-detection module. The trust score provided by the noise detection module can be fed back in the next iteration to train the noise-aware entity alignment. SEA (Pei et al. 2019b) uses the adversarial training model to iteratively refine the knowledge graph embedding to improve the perception of entity degree differences. The iterative training process stops considering the effect of degree on embeddings when the discriminator cannot distinguish entities based on degree information.

The knowledge graph complementation module and entity alignment module can also iterate over each other. ALIGNKGC (Singh et al. 2021) uses ComplEx for initialization of the knowledge graph complementation task to define the triple scores, which ensures that the two entities share the same embedding vector. Entity alignment allows the knowledge graph to obtain more facts, and high confidence predictions can facilitate EA.

5.3.2 Integration

Global structure and local semantics complement each other. Usually, combining global structure and local information provides additional help to obtain a better entity representation. The integration of multiple knowledge representations using vector concatenation enhances the complementarity of different information, thus improving the accuracy of EA tasks. For example, the unimodal EA model GCN-Align (Wang et al. 2018) concatenates entity embedding and attribute embedding according to the weights. FuzzyEA (Jiang et al. 2022b) performs the fusion of structural embeddings and local semantic embeddings, based on Dempster’s combination rule. The multimodal EA model MMEA (Chen et al. 2020c) migrates multimodal knowledge relational data, visual data, and numerical data embeddings from a separate space to a common space and sets the proportional hyperparameters for each type of knowledge.

In addition, some methods fuse global and local information from the matrix level. For example, CEA (Zeng et al. 2020) and CUEA (Zhao et al. 2022) first computes the global similarity matrix based on embedding and then compute the local name semantic similarity matrix, and then add the weighted sum of these two categories to fuse the global structure and local semantic information.

5.4 Local semantic information

The local semantic information incorporated into entities can complement the structural embedding of the knowledge graph and benefit EA. Analyzed from the perspective of the existing forms of local semantic information, the local semantic information of entities mainly includes relations, attributes, entity summary/description, contexts, entity names, and images. Tables 2 and  3 show the advantages and disadvantages of unimodal and multimodal local semantic information, respectively.

Table 2 Comparative analysis of various unimodal local semantic information
Table 3 Comparative analysis of various multimodal local semantic information

5.4.1 Incorporating relational semantic

To accurately disambiguate entities with similar structures, relational semantics can be used to refine the structure-based representation so that similar entities can be distinguished. The relations in the relational triples are connected by head and tail entities, so the relational embedding can be approximated by connecting its average head and tail entity representations, and representative models include HGCN (Wu et al. 2019b), RDGCN (Wu et al. 2019a), RNM (Zhu et al. 2021), and AVR-GCN (Ye et al. 2019), etc. RREA (Mao et al. 2020a) uses multilayer neural networks to learn entity embeddings. For different relationship types, the same entity is embedded in different relationship spaces, and then the embeddings of the same entity in different relationship spaces are aggregated into one entity embedding. The diversity of relationship structures poses challenges to relationship representation. Therefore, SREA (Zhang et al. 2024) constructs weighted line graphs to model diverse relational structures and learns relational representations independently from entities.

Structural embedding models based on TransE or its variants are unsuitable for encoding multi-mapping relationships. For example, a movie made by a director has multiple actors. For this reason, Shi and Xiao (2019) calibrate the embedding of different KGs by a set of fewer pre-aligned seeds to encode multi-map relations by dot product scaling. Contextualized relational representation improves on the above approach by arguing that relations occurring in different entity contexts should have different embeddings, regardless of whether they have the same surface form. For example, SSP (Nie et al. 2020) computes relational embeddings based on adjacent entities and the relations themselves. The approach is more intuitive and the SSP captures the semantic differences between relations even if they have the same surface form but occur in different contexts. RpAlign (Huang et al. 2022) treats relationships as rotational operations between entities and can handle three relationship patterns: symmetric/antisymmetric, inversion, and composition. Thus RpAlign can learn hybrid knowledge graph embedding.

Directed edges force adjacent information to accumulate only by the direction of flow, so some studies such as MRAEA (Mao et al. 2020b) and SHEA (Yan et al. 2021) create inverse relations for relations. ESEA (Jiang et al. 2022a) uses the symbol-based method to align relationships, and the relationship seeds further affect the alignment of multiple entities.

5.4.2 Incorporating entity attribute

The knowledge graph contains attribute triples that can provide valid information for EA. Sun et al. (2017) use the idea of Skip-gram to predict attribute relevance and refine it by clustering entities with higher attribute relevance. Zhang et al. (2017) define different feature functions based on different features, showing formally the correlation between attributes, and discovering more attribute mappings.

Not all similarities between attributes are beneficial for detecting aligned entities. Therefore to automatically find useful attributes for EA, EPEA (Wang et al. 2020) uses a CNN model to encode the sparse similarity matrix into a short and dense vector to capture the attribute similarity of two entities. AttrGNN (Liu et al. 2020) divides the knowledge graph into four subgraphs according to attribute value categories: name attribute, text attribute, numeric attribute, and no attribute, and uses Bidirectional encoder representations from transformers (BERT) to encode the attribute values. Tang et al. (2020) similarly use BERT to encode entity attribute values and compute the similarity matrix.

A simpler way to apply attribute information is to directly adopt the same way of dealing with attribute triples as with relationship triples. Haihong et al. (2020) and Trisedya et al. (2019) use TransE to learn attribute embeddings and then joint attribute embeddings and relational triple embeddings for alignment. EASA (Huang and Luo 2020) generates semantic aggregations of entities from different attributes and attribute values, and adds attribute attention to distinguish the different roles of different attributes during EA. Wang et al. (2018) and Liu et al. (2021) use GCN to generate a structural feature vector and an attribute feature vector for each entity, where Wang et al. (2018) use one-hot encoding to initialize the most frequently occurring attribute of each entity as an attribute feature vector and then combined with the structural feature vector for EA. However, selecting the most frequently occurring attributes would lead to too low differentiation among entities, so Pang et al. (2019) discard the most frequently occurring attributes to ensure both differentiation among entities and the selected entities are not long-tailed entities. Considering that the distance between attributes and attribute values affects the performance of EA, MultiKE (Zhang et al. 2019) integrates attributes and attribute values into the same matrix when processing attribute information, and then feeds them into the CNN for feature extraction.

The number of entity attributes also contributes to EA to some extent. He et al. (2019) measure entity similarity by counting the number of the same attributes between entities. Sorting the number of attributes is equivalent to setting weights for attributes, so Xiong and Gao (2019) arrange the number of attributes in descending order to improve the embedding of attribute information. Similarly, Yang et al. (2019) use the E-CBOW model for embedding attribute information and use the attention mechanism. The influence of different attributes on EA may differ significantly, so Yang et al. (2020) propose a joint attention method that calculates the attention of attribute values using attribute types, which share an attention weight with their attribute values, and captures the forward and reverse sequence information of attribute values using Bi-GRU. The self-attention mechanism plays an important role in distinguishing similar entities, CG-MuAlign (Zhu et al. 2020) and LinkNBed (Trivedi et al. 2018) also use the attention mechanism, where LinkNBed (Trivedi et al. 2018) first initializes attribute embeddings, then aggregates related embedding vectors to enrich entity and relationship embeddings by the attention mechanism, and finally captures the relationship interaction information between two entities by using the entity and relationship embedding representations.

EA model needs to consider differentiating the target entities when finding target entities for source entities. Yan et al. (2020) learn entity topics from attributes through BTM4EA, which uses entity high-level semantics for attribute modeling to filter weakly related entities. In addition, some scholars automatically generate optimal attributes based on data features to constrain the results of attribute triple alignment, such as NovEA (Sun et al. 2020) who selects optimal attributes as candidate values based on decision trees. Guan et al. (2019) apply probabilistic models to iteratively update the embedding of attributes and attribute values when performing attribute triple learning.

5.4.3 Incorporating entity summary/description

Many entities do not have attribute values, and summary embeddings can be used to reduce discrepancies. Wikidata (Vrandecic and Krötzsch 2014) provides a summary text description of the entity, containing basic information about the entity. Wang et al. (2018) use the first paragraph concepts from article data as entity descriptions, using external resources to enrich the entity embedding.

Munne and Ichise (2020), Yang et al. (2019) and EASAE (Munne and Ichise 2023) use BERT to generate a set of word vectors from the summary of each particular entity to obtain entity embeddings. Chen et al. (2018) use a multilingual word embedding pre-training corpus, and convert each entity description into a vector sequence that is input to the description encoder. A GRU incorporating a self-attentive mechanism is used to highlight sentence parts with important shared information and output the final description embedding representation.

In addition to applying a single embedding method, Xu et al. (2020) propose two text embedding models to embed the description of each entity. The Cross-TextGCN model uses GCN to encode the entities by transferring semantics between words and entities in the knowledge graph. The Cross-TextMatch model uses BiLSTM to encode entity descriptions.

5.4.4 Incorporating entity context

Entity context contains a large amount of information related to entities and relationships in the knowledge graph, with clear information sources and no noise. Therefore, fusing entity context information can enhance the knowledge representation learning ability. Yang et al. (2019) utilize contextual information to enhance the accuracy of EA, and add Jaccard coefficients to enhance the logic of contextual information. The contexts of two equivalent entities are usually similar. The stronger the contextual association of a neighbor entity with the central entity, the more alignment cues this neighbor may provide. Therefore, Wang et al. (2018) use the same encoder-independent embedding for each context, and then generate the context vector. TransEdge (Sun et al. 2019) study multiple relationships and use contextual projections to optimize the EA task under the same relationship type to facilitate the propagation of information in the graph. TransEdge extends the relationship, based on the TransE embedding structure while also using contextual projections to refine the embedding. Given that the TransE model cannot capture neighbor information, FuAlign (Wang et al. 2023) proposes a message propagation scheme to aggregate contextual information between an entity and its neighbors. DAEA (Zhang et al. 2021) generates multiple random walks for each entity to be aligned to capture its 10-hop neighborhood information and long sequence context to guide EA. JEANS (Chen et al. 2020a) performs a grounding process that links entities and text tokens in the same language to a shared vocabulary and thus discovers enough entity contexts for EA. While the above approaches are based on the context of neighbors or paths, IMEA (Xin et al. 2022) utilizes two Transformers to encode multiple contexts, including neighborhood subgraphs and paths.

5.4.5 Incorporating entity name

Given two entities, comparing their names is the simplest way to determine whether they are identical. Entity name embeddings can be used to initialize feature matrix, or they can be used as information enhancement signals for EA. Different names can be represented by similar average embeddings, and representing entity names with average word vectors enhances their ease of use, e.g., Zeng et al. (2020) use a concatenated weighted average word embedding to represent the semantic information of entity names, integrating the features on top of the separate structure and name information learned from the similarity matrix on top of the similarity matrix. Although representing entity names as averaged word vectors enhances their ease of operation, the averaging process inevitably causes a certain degree of semantic loss, which in turn cannot fully represent the semantic information of entity names. For this reason, Wei-Xin Zeng et al. (2020) propose a reordering model based on word shift distance, i.e., on the generated entity ordering results, the word shift distance model is used to further mine the entity name information and combine it with the structural information. To avoid the Out-Of-Vocabulary problem, COEA (Lin et al. 2023) combines word-level embeddings and character embeddings to perform entity alignment.

5.4.6 Incorporating type information

When two KGs have sparsity and domain feature differences in terms of structure, it can lead to significant alignment errors. Entity type information helps resolve some ambiguity and vagueness issues. Therefore JTMEA (Lu et al. 2021) combines the similarity of entity vectors and entity type matching in which type features are first extracted from entities of the same type, and then type matching constraints are applied to the comparison of candidate aligned entities. To fully use the entity type information, JETEA (Song et al. 2021) utilized the encoding function to obtain the type features of the entities to perform type matching, and the common features of the entities were extracted as the representation of the type information. TypeEA (Ge et al. 2023) considers entity type information to perform entity alignment and proposes a semantic matching-based type embedding model that utilizes the bilinear product score function to capture associations between types. To focus on the diversity of entity roles, TTEA (Zhang et al. 2023) uses triple-aware entity augmentation to model the diversity of roles of triple elements, using a nonlinear mapping to generate type embeddings from semantic embeddings.

5.4.7 Incorporating ontology information

Incorporating ontology information contributes to solving semantic heterogeneity problems and it also enhances the generality and extensibility of entity alignment. OntoEA (Xiang et al. 2021) claims that it performs entity alignment for the first time by combining ontology information and embedding techniques, utilizing relative positions to use classes and their membership with entities. In addition, OTIEA (Zhang et al. 2023) uses the attention mechanism and designs an ontology pair enhancement approach in the encoding process to capture complex intrinsic correlations through ontology information, while complementing the semantic triples with ontology information and introducing entity role features in the decoder.

5.4.8 Incorporating image

The relational structure information in knowledge graphs may lead to ambiguity. Image features have a unified visual concept, so image can be a good source of EA information.

To extract visual features, MMEA (Chen et al. 2020c) and ACK-MMEA (Li et al. 2023) implement vectorization of images and learns image embeddings using the VGG16 model, in preparation for subsequent multimodal knowledge fusion. To establish certain linkage with text entities directly, ITMEA (Wang et al. 2020) also uses VGGNet for image feature projection to map 4096-dimensional image feature vectors into n-dimensional entity embedding vectors. EVA (Liu et al. 2021) uses ResNet-152 as a feature extractor for all images. For each image, a forward pass is done and the output of the last layer is taken as the image representation. The feature is sent through a trainable feedforward layer for the final image embedding. HMEA (Guo et al. 2021) models and integrates multimodal information in hyperbolic space, uses DenseNet to learn image embeddings. IKRL (Xie et al. 2017) uses attention to construct image-based representations that jointly consider all image instances of each entity. PoE (Liu et al. 2019) combines multimodal features and measures the plausibility of facts by matching the underlying semantics of entities and mining the relationships contained in the embedding space. Entity embeddings are learned in computing the fact scores under each modality. To allow the selected visual coder to have different receptive fields and to adapt to images from different domains. PSNEA (Ni et al. 2023) utilizes an Inception-based network to extract the visual features of entities. PCMEA (Wang et al. 2024) uses a pretrained visual model (PVM), where the visual representation is obtained through a forward propagation layer to obtain a visual embedding representation.

6 Entity alignment module

6.1 Combination method

EA based on embedding needs to use the distance between entity vectors to determine the probability of alignment. Therefore, different knowledge graphs must be embedded into a unified vector space. There are two general methods to reconcile knowledge graph embedding.

(1) Transformation

Transformation embeds knowledge graphs into different vector spaces and transforms the embedding of one knowledge graph into the vector space of another knowledge graph using a linear transfer matrix.

(2) Sharing

There are three ways to achieve sharing: (a) Let the seed entity pairs in the knowledge graph share the same embedding when creating the model (Sun et al. 2017). (b) Using pre-aligned entity pairs to generate new cross-knowledge graph triples, the triples are used as a bridge between different knowledge graphs, e.g. given a seed entity pair (h1, h2) and a triple (h1, r, t1), the exchange method will generate a new triple (h2, r, t1) (Mao et al. 2020a). (c) Minimize the distance between its vectors directly for each pre-aligned entity pair (Yan et al. 2020).

6.2 Similarity metric

In the process of EA, the similarity between entities needs to be measured, and the common similarity measures are as follows.

(1) Euclidean distance

Euclidean distance is the distance between two points in a multidimensional space. The formula for calculating the Euclidean distance between two points \((x_{1},y_{1})\) and \((x_{2},y_{2})\) in the two-dimensional plane is defined as:

$$d=\sqrt{\left( x_{1}-x_{2}\right) ^{2}+\left( y_{1}-y_{2}\right) ^{2}}.$$
(11)

(2) Manhattan distance

Manhattan distance is used to indicate the sum of the absolute axis distances of two points on the standard coordinate system, and the calculation formula is defined as:

$$d=\left\| x_{1}-x_{2}\right\| +\left\| y_{1}-y_{2}\right\| .$$
(12)

(3) Cosine distance

Cosine distance is a measure of the magnitude of the difference between two vectors using the cosine of the two angles in vector space. The closer the cosine is to 1, the closer the angle is to 0 degrees, i.e., the more similar the two vectors are, and the formula is defined as:

$$d(A, B)=\frac{A \cdot B}{\Vert A\Vert \times \Vert B\Vert }=\frac{\sum _{i=1}^{n} A_{i} \times B_{i}}{\sqrt{\sum _{i=1}^{n} A_{i}^{2}} \times \sqrt{\sum _{i=1}^{n} B_{i}^{2}}}.$$
(13)

(4) Cross-Domain similarity local scaling (CSLS)

To deal with the phenomenon of hubness in high-dimensional space, i.e., the existence of dense regions in vector space where some points are the nearest neighbors of many points. While the previous approach uses cosine distance to select the nearest neighbors, CSLS is calculated as:

$${\text {CSLS}}\left( W x_{s}, y_{t}\right) =2 \cos \left( W x_{s}, y_{t}\right) -r_{\textrm{T}}\left( W x_{s}\right) -r_{\textrm{S}}\left( y_{t}\right) ,$$
(14)

where \(r_{\textrm{T}}\left( W x_{s}\right)\) is the average cosine distance between \(W x_{s}\) and its K target language nearest neighbors.

(5) Edit distance

Some works use Edit distance to calculate the similarity of strings, such as the similarity calculation of entity name strings. Edit distance is a measure of the difference between two character sequences, and Edit distance between two words refers to the minimum number of single-character edit operations (insertion, deletion, or replacement) required to convert one word to another, and the calculation formula is defined as:

$$\begin{aligned} d[i, j]= & {} \min \left\{ \begin{array}{l} d[i, j-1]+1, \\ d[i-1, j]+1, \\ d[i-1, j-1]+c\left( s_{1}[i], s_{2}[j]\right) , \end{array}\right. , \end{aligned}$$
(15)
$$\begin{aligned} c\left( s_{1}[i], s_{2}[j]\right)= & {} \left\{ \begin{array}{l} 1, s_{1}[i] \ne s_{2}[j], \\ 0, s_{1}[i]=s_{2}[j], \end{array}\right. \end{aligned}$$
(16)

where \(s_{1}[i]\) is the ith character in string \(s_{1}\) and \(s_{2}[j]\) is the jth character in string \(s_{2}\).

6.3 Alignment optimization strategy

Each element in the entity similarity matrix represents the similarity between entities. If fine-grained features can be combined into the entity similarity matrix, the accuracy of the EA model will be improved. According to prior knowledge, EA is a bidirectional matching problem between two knowledge graphs. Therefore, RAGA (Zhu et al. 2021) calculates the fine-grained similarity matrix by summing the weights of each entity aligned in both directions. Specifically, a softmax operation is applied to both rows and columns of the initial entity similarity matrix. To alleviate the uncertainty and ambiguity of the EA process, FuzzyEA (Jiang et al. 2022b) considers the uncertainty based on intuitionistic fuzzy set. Guo et al. (2022) propose a deep reinforcement learning-based framework to transform the EA problem into a sequential decision-making task. This framework can be adapted to most embedding-based EA models. DATTI (Mao et al. 2022) focuses on the decoding process and uses Adjacency tensor isomorphism equations and Gramian tensor isomorphism equations to enhance the decoding power. DATTI can bring great performance improvements in a little time.

6.4 Non-alignable entity prediction

Most of the existing studies assume that given a test source entity, an equivalent target entity can be found for it. Then in the realistic knowledge graph, some entities have no entities aligned with them (Luo and Yu 2022). SoTead (Luo et al. 2022) and WOGCL (Xu et al. 2023) convert the knowledge graph EA into an Optimal Transport problem by calling unmatched entities as hanging entities. Based on the set pseudo-entity pairs, contrast metric learning is performed to calculate the transmission consumption of the entity pairs and finally match virtual entities for the dangling entities. MHP (Liu et al. 2022a) also uses Optimal Transport for global higher-order similarity computation, and the dangling entity of the Optimal Transport is the part where the source and target entity embeddings differ. MHP considers multi-order neighbor entities when performing local similarity calculations. UEA (Zeng et al. 2021) uses a thresholded bidirectional nearest neighbor strategy to generate EA results, and the unaligned entities generated by this process are considered to be unaligned. Based on UEA, CUEA (Zhao et al. 2022) takes into account the fact that different pseudo-labeled data have different characteristics and uses confidence levels to measure the likelihood that an entity pair is true.

6.5 Alignment inference strategy

Alignment inference strategies are mainly divided into two categories: global alignment and local alignment. Table 4 summarizes and analyzes the two inference strategies.

Table 4 Alignment comparison analysis

6.5.1 Global alignment

To constrain one-to-one EA and exploit the interdependence between alignment decisions, some studies set the limit of one-to-one matching. CEA (Zeng et al. 2020) and RAGA (Zhu et al. 2021) use a deferred acceptance algorithm to find stable matching results for any two sets of the equal number of entities. No pair of two entities from different sets are more willing to match each other than the match already assigned to them. The deferred acceptance algorithm guarantees that a solution can be found in \(O(N^{2})\) time.

Furthermore, the global EA task is transformed into a maximum weighted bipartite graph matching problem. The Hungarian algorithm is the best solution to the task assignment problem, and the algorithm guarantees that a solution can be found in \(O(N^{4})\) time. GM-EHD-JEA (Xu et al. 2020) and LatsEA (Chen et al. 2021c) transform the EA problem into a task assignment problem, which is essentially a basic combinatorial optimization problem, and the exact solution can be found by the Hungarian algorithm. SEU (Mao et al. 2021) combines the Hungarian algorithm and the Sinkhorn operation. The models combining the Hungarian and other operations perform better than the models using only the Hungarian algorithm.

Although the above works achieve global EA by applying a one-to-one constraint on the EA process, they still do not adequately model the potential interdependencies. Thus, CEAFF (Zeng et al. 2021) investigates the dynamic properties of the decision process and provides a reinforcement learning-based model to implement collectively aligned entities. In the reinforcement learning framework, coherence and exclusivity constraints are designed to characterize interdependencies and restrict collective alignment. UED (Luo and Yu 2022) formulates the EA problem as an optimal transportation problem, finding the optimal global alignment by minimizing the total transportation distance.

6.5.2 Local alignment

The element in entity similarity matrix reflects the distance between the source entity and the target entity in vector space. After obtaining the embedding-based similarity matrix, the EA enters the alignment decision stage. Most of the current embedding-based EA methods, such as RNM (Zhu et al. 2021), HGCN (Wu et al. 2019b) and IMUSE (He et al. 2019) use the independent decision strategy to generate alignment results, applying greedy search strategies to find target entities for the test source entities. Specifically, given vector representations of the knowledge graph and the distance metric function. For each source entity, the alignment model uses the distance metric function to calculate the distance between the source entity vector and all target entity vectors to find alignable target entities. The above plain enumeration approach will increase the workload of EA and lead to less efficient alignment. The many-to-one situation may occur in the entity matching process, i.e., many test source entities are matched with the same test target entity. These are the limitations of the local EA strategy.

7 Post-alignment module

In the post-alignment module, the main focus is to discover more aligned entities by adding newly aligned entities to the seed set. Embedding-based EA methods use pre-aligned entities as seed data, and the performance depends heavily on the quality and quantity of seed data (Chen et al. 2020b). The data size of the knowledge graph is large, so the time and labor cost of manually labeling the aligned seeds is also large. Some studies propose using iterative training to add newly generated aligned entities to the seed set, which expands the size of the seed set and guides the subsequent training process. Table 5 provides a comparative analysis of various iterative strategies for the post-alignment module.

Table 5 Comparative analysis of post-alignment module

IPTransE (Zhu et al. 2017), RATransE (Haihong et al. 2020) and East (Zeng et al. 2019) design hard alignment strategy and soft alignment strategy. The hard alignment strategy directly applies the parameter sharing model of the joint embedding part to the process of generating new aligned entities, adding the new aligned entity pairs to the seed set. The soft alignment strategy is to solve the problem of error accumulation generated during hard alignment. Kang et al. (2020) and Shize et al. (2019) use a re-initialization strategy in addition to a soft alignment strategy. By reinitializing the embedding and the newly aligned entity pair set in each iteration, the propagation of errors in the next generation is reduced.

The above methods can only introduce a small number of high-confidence entity pairs, which cannot bring significant effect improvement. Wei-Xin Zeng et al. (2020) design an iterative strategy of “easy to hard”. The method uses the degree of entity nodes as a measure, considers entities with higher degrees as easy courses and long-tailed entities as difficult courses, and adds high-confidence entity pairs to the training set in a easy-to-hard way. Qu et al. (2019) consider entity pairs with alignment probability higher than a predefined threshold as easy alignment and the rest of entity pairs as hard alignment, and if more than K easy alignments are found in each iteration, they are added to the seed set to continue the iteration, otherwise, the iteration ends. There are problems such as the easy introduction of wrong samples and low efficiency. On this basis, Ge et al. (2021) use a refinement strategy to optimize the quality of new seed alignments generated by iterations and provide a plausible seed generator to generate pseudo-seed alignments.

BootEA (Sun et al. 2018) applies Bootstrapping (Yarowsky 1995) to iteratively expand the size of the seed set. The iterative process inevitably generates incorrect marks, and incorrect training samples can mislead the subsequent training process, so alignment editing methods are used to reduce the error accumulation. Similar to Bootstrapping, Lu et al. (2021) and Song et al. (2021) employ the strategy of error evaluation during the iteration process to make marked entities be marked or unmarked in subsequent iterations. Lin et al. (2021) propose an attribute combination bidirectional full filtering strategy to generate semi-supervised data, no longer using only bootstrap positive samples as input but adding negative samples while iterating. The above methods using Bootstrapping do not consider the effect of seed entity selection on entity vector representation. Therefore, Chen et al. (2020b) consider the centrality and differentiability of entities to select entities in addition to the iterative strategy of BootEA. Better knowledge graph alignment is achieved by using only a small number of high-quality seed-aligned entities.

Bootstrapping has achieved significant performance improvements. However, it is based on complex selection criteria that inevitably introduce a set of hyperparameters. Therefore based on the nature of entity one-to-one correspondence and EA direction asymmetry, MRAEA (Mao et al. 2020b), JEANS (Chen et al. 2020a), EVA (Liu et al. 2021), Inga (Pang et al. 2019), RANM (Cai et al. 2023) and AdaptiveEA (Zhang et al. 2021) propose bidirectional iterative strategies. Specifically, the entity pair \((e_{i},e_{j})\) is considered as a newly predicted aligned entity in the current iteration when and only when the entities \(e_{i}\) and \(e_{j}\) are the nearest neighbors of each other. This approach effectively alleviates the error propagation problem. However, in some cases even if \(e_{i}\) and \(e_{j}\) satisfy mutual nearest neighbors, the similarity between them may still be low, so a thresholded bidirectional nearest neighbor search strategy is proposed in UEA (Zeng et al. 2021) to generate alignment results. The strategy requires that the distance between them is below a given dynamic threshold \(\theta\) before \((e_{i},e_{j})\) is considered aligned. Based on bidirectional iterative strategy, DuGa-DIT (Xie et al. 2022) utilizes the newly increased EA pairs to dynamically adjust the attention score matrix and objective function across graphs. Bidirectional iteration is less often considered for adding negative samples to the input, Lin et al. (2021) propose an attribute combination bidirectional full filtering strategy to generate semi-supervised data, instead of using only bootstrap positive samples as input, negative samples are added while iterating. It is filtered by one-to-one constraints based on the unknown correctness of the local alignment.

The active learning algorithm uses the alignment results of the previous iteration in an iterative self-learning manner to update the embedding. Zeng et al. (2021) employ active learning to select entities to be manually labeled to maximize model performance with minimal effort. Given a label budget B, at each iteration, guided by a query policy, the entities with b (b < B) having the most amount of information for labeling are selected and pairs with these entities are added to the data with labels for training the EA model, iterating until the label budget is exhausted. JEANS (Chen et al. 2020a) captures cross-lingual correspondences of entities and lexical elements in a self-learning manner. Starting with a small number of seed EAs, transitions between language-specific embedding spaces are iteratively induced, and more entity and lexical element alignments are inferred in each iteration. DAGCN (Wang et al. 2022) uses the adversarial idea of degree perception to iteratively train generator and discriminator. DAGCN uses the discriminator to adjust the embedding representation in the generator and updates the parameters of the discriminator based on the embedding. When the entity degree difference is not detected, the effect of the degree difference is eliminated.

8 Negative sampling

In the EA task, using negative samples helps improve the performance of the model. The commonly used negative sampling techniques in the field of EA include uniform negative sampling, truncated negative sampling, and nearest neighbor negative sampling. Some scholars also use other negative sampling methods. Table 6 shows the negative sampling methods applied by EA model.

Table 6 Negative sampling method

9 Loss function

Loss function is used to estimate the degree of inconsistency between the predicted value f(x) and the true value Y. Loss function is a non-negative real-valued function usually expressed using L(Yf(x)). The smaller the value, the better the robustness of the model. The loss functions commonly used in the field of EA are shown in Table 7.

Table 7 Loss function

10 Benchmarking

10.1 Dataset

10.1.1 Unimodal dataset

Commonly used unimodal English datasets in the field of EA are classified into monolingual and cross-lingual datasets, which are mostly extracted from open-linked datasets according to different requirements. Table 8 provides statistics on the commonly used unimodal English datasets. In addition to the English dataset, we also introduce the commonly used Chinese dataset.

Table 8 Unimodal English dataset

(1) DBP15K

The DBP15KFootnote 1 (Sun et al. 2017) dataset includes three cross-lingual sub-datasets constructed from the DBpedia dataset. Links between 15,000 popular entities are extracted from English to Chinese, Japanese and French respectively. Usually, the number of entities involved in each language is much more than 15,000, and attribute triples make up a large proportion of the dataset.

(2) DWY100K

The DWY100KFootnote 2 (Sun et al. 2018) dataset is monolingual dataset. DWY100K contains two large-scale sub-datasets drawn from DBpedia, Wikidata, and YAGO3, represented by DBP–WD and DBP–YG, respectively. Each dataset has 100,000 reference entity alignments. The extraction method follows the DBP15K extraction method. Taking DBP–WD as an example, 100,000 aligned entity pairs were randomly extracted from the English version of DBpedia to Wikidata.

(3) SRPRS

RSNsFootnote 3 (Guo et al. 2019) first proposed the SRPRS dataset, which can control the degree distribution of entities in the sampled dataset. Here, the degree of an entity is defined as the number of relational triples that the entity is associated with.

(4) DBP v1.1

DBP v1.1Footnote 4 (Sun et al. 2020) contains cross knowledge graph and cross-lingual settings. The dataset consists of two versions, where the v1 version is generated using iterative degree-based sampling (IDS) method, while the v2 version first randomly removes entities with degree less than or equal to 5 and then uses the IDS method, which increases the density.

(5) Chinese dataset

Huang and Luo (2020)Footnote 5 collect data from Baidu Encyclopedia and Interactive Encyclopedia in the military domain and extracted triples from the infoboxes, named Dataset-1 on a small scale. Similarly, the entertainment dataset was collected from Baidu Encyclopedia and Interactive Encyclopedia, named Dataset-2 on a large scale. As is shown in Table 9, the Chinese dataset includes the number of entities, the number of relationships, and the total number of fact triples. The triples of the Interactive Encyclopedia and Baidu Encyclopedia are merged.

Table 9 Chinese dataset

10.1.2 Multimodal dataset

In the EA domain, two multimodal datasets constructed in MMKGFootnote 6 (Liu et al. 2019), namely FB15K–DB15K and FB15K–YAGO15K. FB15K is a representative subset extracted from the Freebase knowledge base. To maintain an approximation to the number of entities in FB15K, DBpedia’s DB15K and YAGO’s YAGO15K are formed mainly based on FB15K, using links to align the entities in FB15K with those in other knowledge graphs. Table 10 describes the statistics of the multimodal datasets. Each dataset contains nearly 15,000 entities and over 11,000 entity image sets.

Table 10 Multimodal dataset

10.2 Evaluation metric

Three evaluation metrics are usually used to evaluate the performance of EA, namely Hits@k, MR, and MRR. In the EA task, Hits@k represents the calculation of the proportion of the top k correctly aligned entities, MR represents the average ranking of all correctly aligned entities, and MRR represents the reciprocal of the average ranking of all correctly aligned entities in the alignment results.

(1) Hits@k

The proportion of correctly aligned entities ranked in the top k (k is usually taken as 1, 10). If there are correctly aligned entities among the top k candidates, the exact number of Hits@k is increased by 1. The higher the value of Hits@k, the better the model results. The Hits@k is the exact number of aligned entities divided by the total number of aligned entities and is calculated as follows.

$${\text {Hits}} @ k=\frac{{\text {count}}\left( \left\{ e \in S \mid {\text {rank}}_{e} \le k\right\} \right) }{{\text {count}}(S)},$$
(17)

where count(S) denotes the number of elements in the set; \(rank_{e}\) is the true rank of entity e, and S is the total number of candidate aligned entities.

(2) Mean rank (MR)

The average rank of all correctly aligned entities in the results of EA experiments, the lower the MR value, the better the model effect.

$${\textrm{MR}}=\frac{\sum _{e \in S} {\text {rank}}_{e}}{{\text {count}}(S)},$$
(18)

where count(S) denotes the number of elements in the set; \(rank_{e}\) is the true rank of entity e, and S is the total number of candidate aligned entities.

(3) Mean reciprocal rank (MRR)

MRR is the reciprocal of the average rank of all correctly aligned entities in the EA results. The higher the value of MRR, the better the model effect, which is calculated as follows.

$$M R R=\frac{1}{{\text {count}}(S)} \sum _{e \in S} \frac{1}{{\text {rank}}_{e}},$$
(19)

where count(S) denotes the number of elements in the set; \(rank_{e}\) is the true rank of entity e, and S is the total number of candidate aligned entities.

11 Experiment and analysis

11.1 Effect of direction on performance

EA is a bidirectional alignment problem. To explore the influence of alignment direction on the model, we study the performance of some representative models such as GCN-Align, HMAN, and GMNN in two directions on DBP15K. The experimental results are shown in Tables 11, 12,  13. For example, \({DBP15K}_{ZH{-}EN}\) in Table 11 represents the source knowledge graph as a Chinese dataset, while the target knowledge graph represents an English dataset.

Table 11 The effect of direction on the model on \(DBP15K_{ZH{-}EN}\)
Table 12 The effect of direction on the model on \(DBP15K_{JA--EN}\)
Table 13 The effect of direction on the model on \(DBP15K_{FR{-}EN}\)

The results show that the model performs better in most cases when the English knowledge graph is used as the target knowledge graph. This is because the knowledge graph of English is denser compared to the knowledge graphs of other languages. However, HMAN, AKE, and DAEA perform better or close to the forward alignment effect when reverse alignment is performed on the test dataset. All three models use external factors of entity description, attributes, or entity frequencies. This indicates that external information can alleviate the lack of sparsity of knowledge graph structure and facilitate EA. Therefore, this paper presents the analysis of the experiments on English knowledge graphs as the target knowledge graphs.

11.2 Experimental setting

To ensure fairness, the model performance on different datasets is analyzed separately in this paper. Under the unimodal dataset, 30% of the seeds are used for training, and the remaining 70% are used for testing. Under the multimodal dataset, 20%, 50%, and 80% of the seeds are used for training, and the rest are used for testing. Unless otherwise specified, the experimental results in this paper are from the original paper. Table 14 shows the information of representative EA models.

Table 14 Representative EA model information

Combined with the current research trends in the field, existing EA methods are classified into four categories in this paper: category 1, global alignment models, category 2, using noise filtering strategy, category 3, using only global structural information, and category 4, combining global structural and local semantic information. Some models may belong to different categories, for example, RAGA can be classified into categories 1, 2, and 4. However, we set the classification priority so that if a model is already classified into a category, it will not participate in the intra-class comparison of subsequent categories. Therefore, RAGA is classified into the first category. We perform a comparative analysis of models within categories.

The first category is EA from the global perspective. Most of the current EA models use plain enumeration alignment, which has room for improvement in both accuracy and alignment efficiency. Global EA can effectively exploit the interdependencies between alignment decisions to ensure one-to-one entity matching. Therefore, we separate global alignment and local alignment for experimental comparison and analysis.

The second category is models that employ a noise filtering strategy. As the number of network layers increases, the model can more effectively aggregate information from neighbors and capture structural representations. But this inevitably introduces noise that is not conducive to the learning of entity representations. Therefore, we analyze whether noise filtering has a positive impact on model performance.

The third category is the models using only structural information. Representation learning of knowledge graph is essentially the process of mining graph features, and the structure of the knowledge graph is an important basis for obtaining entity representations. Therefore, we classify the models that utilize only global structural information into a separate category.

The fourth category is the models that combine global structure and local semantics. When the knowledge graph structure is sparse, local semantics such as attributes will provide useful alignment signals for EA.

11.3 Comparison of unimodal EA models

11.3.1 Comparative analysis of models on DBP15K

The performance of model on DBP15K is shown in Table 15. DBP15K is a small, dense cross-lingual dataset. We divide the models into four classes and then performed intra-class comparisons as follows.

Table 15 Model performance comparison on DBP15K dataset

Category 1 among global EA models, SoTead performs the best on Hits@1 compared to the other models on the three sub-datasets of DBP15K. SEU is the second-best performer in the first category. Both SoTead and SEU models are unsupervised approaches, and their good performance lies in the fact that SoTead transforms the EA problem into an optimal transport problem, while SEU transforms the EA into a task assignment problem, significantly reducing the complexity of operations on the neural network. LatsEA performs the weakest. The other models consider entity names or relational local semantic information, while LatsEA puts the focus on global structural embedding. CEA, CEAFF, and GM-EHD-JEA all focus on the collective alignment process. While GM-EHD-JEA treats EA as a task assignment problem and uses the Hungarian algorithm. CEA views EA as a stable matching problem and solves it using a deferred acceptance algorithm. CEAFF collectively aligns source entities through reinforcement learning, which can adequately capture the interdependencies between EA decisions. Thus CEAFF achieves better results than the collective alignment methods CEA and GM-EHD-JEA. RAGA achieves better and more significant results than the collective alignment methods CEA, CEAFF, and GM-EHD-JEA. This is because RAGA incorporates the local semantics of relations and fine-tunes the similarity matrix to consider fine-grained semantic features. This shows that combining local semantic information as well as fine-tuning the similarity matrix is meaningful for EA.

Category 2 on DBP15K, RNM performs steadily in the top position on the three sub-datasets. HGCN, RDGCN, and SSP have similar performances because they all use local relational semantics. NMN performs slightly better than HGCN, RDGCN, and SSP. This is because, in addition to the embedding structure, NMN considers neighborhood matching. The performance of RNM is much improved than NMN, which is because RNM further introduces relational based on NMN while using collaborative training. This shows the importance of combining global structure and local semantics, while co-training can promote the performance of the model.

Category 3 MuGNN and KECG obtain better results than MTransE on DBP15K, which indicates that the deep model is more capable of capturing structural features compared to the translation model. Specifically, MuGNN uses a multi-channel graph neural network to capture structural information. KECG adopts a similar idea by jointly learning entity embeddings and encoding intra-graphic relations and neighborhood information. However, their performance is still weaker than that of RSNs. This is because RSNs improve performance by considering long-term relational dependencies between entities, which captures more structural signals for alignment. BootEA ranks first in all metrics for all sub-datasets in the category 3 that utilize structure only, because BootEA uses a bootstrap strategy that adds the generated new EA pairs to the seed set. This indicates that the bootstrap strategy has a positive impact on improving performance.

Category 4 the results of JTMEA are higher than JAPE and close to GCN-Align. JTMEA has better ability to capture semantic information during embedding. Both JTMEA and JETEA use entity type information and achieve close performance. TTEA uses type enhancement while considering triple specificity and role diversity to perform entity alignment, resulting in superior performance. Compared to GCN-Align and JAPE, HMAN works better because HMAN considers entity description information in addition to attribute information. This indicates that exploring multiple local semantic information is useful. Compared to HMAN, JAPE, and GCN-Align, GM-Align performs better because GM-Align uses an entity name-based initialization. While HMAN, JAPE, and GCN-Align use a random initialization embedding. MRAEA, TransEdge, and NAEA all use an iterative strategy along with local semantic information. The overall performance of the methods using iteration exceeds that of the other methods. MRAEA performs better than TransEdge and NAEA. TransEdge and NAEA use a bootstrap iteration strategy and they do not consider direction, while MRAEA uses a bidirectional iteration strategy, which indicates that the effect of direction on EA deserves attention. OTIEA utilizes ontology-enhanced triple encoders by mining intrinsic associations and ontology pair information with better results. In this category, FuzzyEA performs the best, FuzzyEA fuses entity names and descriptions and takes into account the uncertainty caused by a single metric of the alignment process.

11.3.2 Comparative analysis of models on DWY100K

The performance of each model on DWY100K is shown in Table 16. DWY100K is a large-scale, dense monolingual dataset. We classify the models into four categories and then the classification comparison is performed as follows.

Table 16 Model performance comparison on DWY100K dataset

Category 1 on DWY100K, both CEA and GM-EHD-JEA focus on the collective alignment process. While GM-EHD-JEA uses the Hungarian algorithm to maximize the local similarity score under one-to-one constraints. CEA uses the deferred acceptance algorithm to guarantee one-to-one alignment, i.e., optimal assignment of one source entity. CEA consistently outperforms GM-EHD-JEA on all sub-datasets of DWY100K, which indicates that GM-EHD-JEA using the search space separation strategy can harm the performance of the model. Moreover, the Hits@1 metric of CEA has reached 1. On DBP15K, the performance of CEA is much less than that of DWY100K, and the same phenomenon occurs for the GM-EHD-JEA model. This is because the similarity between the names of entities to be aligned is higher for the DWY100K dataset compared to DBP15K.

Category 2 on DWY100K, the performance of HGCN and NMN is much better than that of SSP, while the model performance of HGCN and NMN is closer. Both HGCN and NMN use initialization based on entity names, and HGCN considers local relational semantics in addition to using entity names. While SSP also makes use of relational information, it is not initialized based on entity names, so the performance of SSP is much inferior to HGCN and NMN, which shows that the initialization embedding method has an impact on the model performance.

Category 3 on DWY100K, MTransE performs the worst because it only considers the original topological information. MuGNN and KECG perform much better than MTransE because MuGNN uses a multi-channel graph neural network to capture structural information at different levels. KECG jointly learns entity embeddings and proximity information is considered. It is worth noting that the RSNs model does not outperform MuGNN and KECG on DWY100K. RSNs improve performance by considering long-term relational dependencies between entities, but when structural data are more adequate, long-term relational dependencies may not be significantly effective. BootEA still ranks first in the metrics for both datasets of DWY100K, which indicates that the bootstrap strategy improves the performance of the EA model on both large-scale and small-scale datasets. In addition, the performance of each model on DWY100K is significantly improved compared to the performance on small-scale data like DBP15K. For example, MuGNN has a significant improvement in Hits@1, which is due to the larger dataset size and more structural information of the model, which can provide more support for EA.

Category 4 RALG performs best because it creates a heterogeneous line graph and it is used to independently learn the relational representation of entities. SelfKG achieves the second-best performance and it uses self-supervised alignment. The DWY100K dataset is a monolingual dataset with high entity name similarity, so for monolingual datasets, alignment is easier and supervised learning is not necessary. SHEA has better performance on DWY100K. Because SHEA considers both intra-graph and cross-graph attention mechanisms when it learns alignment-oriented entity embeddings. EASAE uses both summary and attribute embedding, so alignment performance is great. GM-Align performs well. Because GM-Align implements cross-graph matching, the other models learn their respective structural representations independently.

11.3.3 Comparative analysis of models on SRPRS

Table 17 shows the performance of each model on the SRPRS dataset. We classify the models into four categories and then the classification comparison is performed as follows.

Table 17 Model performance comparison on SRPRS dataset

Category 1 compared to DBP15K and DWY100K, the SRPRS dataset is sparser. The SRPRS dataset is closer to the realistic dataset. By observing Table 17, we can find that the performance of each model degrades significantly on the sparse dataset. However, the performance of SEU still ranks first, and SEU uses word vector embedding as well as character embedding in the embedding part. The SRPRS dataset contains a large number of proper names, and the performance of the CEA and CEAF models also remains high, with Hits@1 on DBP–WD and DBP–YG achieving almost exactly correct predictions. This indicates that both the collective search algorithm and the task assignment algorithm are applicable to sparse datasets.

Category 2 HGCN and RDGCN both utilize GCN for structural embedding and they both use entity names and relations. So the performance of the two models is close. We can find the same sparse datasets, but these two models perform a bit better on the sparse dataset of a single language, which proves that the heterogeneity of language is a bigger obstacle to the EA task compared with the sparse knowledge graph.

Category 3 in the EA models using only structural information, the model trends are close to those of DBP15K and DWY100K. The models using the iterative strategy perform a little better. In addition, the performance of RSNs on sparse datasets is close to that of BootEA using the iterative strategy, which indicates that long-term dependencies help to obtain more accurate representations when the structure of the knowledge graph is relatively sparse.

Category 4 the performance of GM-Align still maintains better results because GM-Align considers graph matching, incorporates local matching information between different knowledge graphs, and has less reliance on structure. NAEA, TransEdge, and MRAEA using the iterative strategy degrade significantly, but the performance of MRAEA degrades less than TransEdge and NAEA, which indicates that the bidirectional iterative strategy is more robust than the bootstrapping iterative strategy. On sparse datasets, the performance of JAPE and GCN-Align models, which also apply attributes, does not differ much. This indicates that when the structure of the knowledge graph is sparse, GCN cannot learn good structural representation. It is worth noting that COEA and FGWEA still maintain a better performance even on sparse datasets, with values above 0.9 on all metrics, because both models not only take into account structural and semantic information but also improve the alignment process. COEA converts entity alignment into a combination optimization problem, and FGWEA uses optimal transport to solve entity alignment.

11.3.4 Comparative analysis of models on DBP v1.1

Table 18 shows the performance of the model on the DBP v1.1 dataset. The model data marked with * is from reference (Ge et al. 2023), and the other model data is from reference (Xiang et al. 2021). By observing Table 18, we can find that models that only utilize structural information do not work well, such as the MTransE and RSNs models. Although the BootEA model utilizes only structural information, it uses an iterative strategy to expand the size of the seed set, so it achieves better performance. Comparatively, combining structural and semantic information can improve the performance of the model. For example, the RDGCN model uses noise filtering strategy and relational semantics, and the OntoEA model introduces ontology information, both of which achieve better performance. TypeEA-B, TypeEA-R, and TypeEA-M apply type information to BootEA, RDGCN, and MultiKE, and through comparison, it is found that introducing entity type information facilitates entity alignment.

Table 18 Model performance comparison on DBPv1.1 dataset

11.3.5 Comparative analysis of models on Chinese dataset

As shown in Table 19, representative models on the Chinese dataset include TransH (Wang et al. 2014), TransD (Ji et al. 2015), IEAJKE (Zhu et al. 2017), AttrE (Trisedya et al. 2019), and EASA (Huang and Luo 2020). In Dataset-1 and Dataset-2, EASA performs the best with Hits@1 and Hits@10 and is much ahead of the second-best performing model AttrE in terms of Hits values. The MR metric of EASA is second only to the top-ranked IEAJKE in Dataset-1 and Dataset-2, which proves the importance of making full use of the semantic information of entities. Although AttrE uses the frequency ratio of relationships and attributes as weights for EA, it cannot capture the importance of semantic aggregation generated by many attributes. TransH, TransD, and IEAJKE do not consider entity semantic integration and attribute weights, and thus perform below EASA and AttrE.

Table 19 Model performance comparison on Chinese dataset

11.4 Comparison of multimodal EA models

Tables 20 and  21 show the experimental results of representative multimodal knowledge graph EA models on the FB15K–DB15K and FB15K–YAGO15K datasets. This paper compares the performance of each model on the datasets providing 20%, 50%, and 80% of seed entity pairs as training sets, respectively. The performance of each model gradually increases as the percentage of seed entity pairs increases. The fastest growth rate of each metric for MMEA and ACK-MMEA indicates that MMEA and ACK-MMEA have strong robustness and adaptability. By observing the experimental data, we find that PCMEA has the best results. PCMEA filters modal-specific noise and utilizes pseudo-labeling calibration methods and contrast learning, which can reduce the effect of noise and improve the quality of pseudo-labeling. The value of each metric of MMEA is significantly better than IKRL at each proportion of seed sets, which indicates that MMEA is applicable to real-world multimodal knowledge graphs. On the FB15K–DB15K dataset, for 80% of the seed entity pairs, MMEA outperforms HMEA with Hits@1 by nearly 20%. On the FB15K–YAGO15K dataset, for 80% of the seeded entity pairs, MMEA outperforms HMEA and GCN-Align by more than 15% in Hits@1 metrics. Even when the percentage of seeds is only 20%, MMEA shows better results compared to HMEA, with an improvement of about 14% for Hits@1 and over 15% for Hits@10. The above analysis shows that migrating multimodal knowledge embeddings from separate spaces to a common space is an effective method for EA.

Table 20 Model performance comparison analysis in dataset FB15K–DB15K
Table 21 Model performance comparison analysis in dataset FB15K–YAGO15K

12 Research prospect

12.1 Study of other multimodal data

The relational structure information of knowledge graphs sometimes leads to ambiguity, so multimodal knowledge (Liu et al. 2019) has a key role in the knowledge embedding process. Although a few studies (Chen et al. 2020c; Wang et al. 2020) have applied static image data to multimodal EA, other multimodal data have not been fully explored. For example, dynamic video data have not been applied to EA tasks. Video data contains more rich information and conveys more intuitive information compared with static image data. Therefore, how to incorporate advanced video features as well as other features into multimodal knowledge graph EA may be the focus of future research.

12.2 Study of realistic datasets

There are significant differences between existing datasets and real-world knowledge graphs. This makes it difficult for existing EA models to run on real-world knowledge graphs. The current datasets have more neighboring entities and contain more semantic information, thus these high-level entities are relatively easy to align. In addition, the current dataset focuses on only one aspect of heterogeneity, such as multilingualism, and ignores pattern and scale differences. Therefore, considering multiple perspectives to create datasets that are closer to real-world knowledge graphs deserves further research.

12.3 Study of other vector spaces

A key step in the embedding-based knowledge graph EA model is to learn the embedding representation, and the quality of the embedding has a direct impact on the subsequent EA performance. It has been shown that the non-Euclidean space has better graph structure embedding performance than the Euclidean space (Nickel and Kiela 2017). Most of the current EA methods use the Euclidean space. Although some models use hyperbolic spaces (Guo et al. 2021) and spherical space (Huang et al. 2022), many other vector spaces (e.g., complex spaces Sun et al. 2019) are still worth studying.

12.4 Study of dynamic knowledge graph EA

At present, most datasets applied in the field of EA are static knowledge graphs. While realistic knowledge graphs are frequently changing, so it is necessary to consider dynamic factors in the process of EA of knowledge graphs. The dynamics of dynamic knowledge graphs are mainly reflected in two dimensions of time and space (Zheng et al. 2020). Although some scholars have studied the EA of temporal knowledge graphs (Song et al. 2022), no scholars have considered the dynamics of knowledge graphs from the spatial dimension. Therefore, it is worthwhile to design spatial–temporal knowledge graph EA models from both temporal and spatial dimensions.