1 Introduction

Many current approaches to entity alignment (EA) in knowledge graphs (KGs) heavily rely on the graph structure of KGs [4, 7, 10, 15, 18]. These approaches assume that equivalent entities have similar neighborhood structures. While these methods have achieved state-of-the-art performance on synthetic datasets extracted from large-scale KGs, as mentioned in [2, 15, 24], recent studies have shown that these synthetic datasets are much denser than real-life KGs. Furthermore, existing EA methods are not capable of yielding satisfactory results on datasets with real-life distributions, as discussed in [7].

A recent study, referenced as [7], has shown that nearly half of the entities in actual knowledge graphs have connections to less than three other entities, which are called long-tail entities. This results in the KG being a relatively sparse graph. This matches our perception that only a few entities in real-life KGs are frequently accessed and have rich connections and detailed attributes, while the majority remain under-explored and provide little structural information. This leads to existing EA methods that rely solely on structural information struggling to accurately align these entities, as demonstrated in the following example.

Example In Fig. 6.1 is a partial English KG (KG\({ }_{\text{EN}}\)) and a partial Spanish KG (KG\({ }_{\text{ES}}\)) concerning the film Summer 1993. Note that the entities The Bookshop and La Librería in gray describe the original novel, while those in white depict the film.

Fig. 6.1
2 undirected network graphs. In K G E N, The Bookshop, Carla Simon and Laia Artigas are long-tail entities with only 1 connection each, while Spain, Summer 1993, The Bookshop, and Goya awards have 2 or 3 connections. In K G E S, Carla Simon, Laia Artigas, and La libreria are long-tail entities.

An example of EA. Nodes in gray (resp. white) are long-tail (resp. popular) entities (relation names and other entities are omitted in the interest of space)

During aligning entities of high degrees, e,g., Spain and España, structural information is of great help; however, as to long-tail entities, e.g., Carla Simón in KG\({ }_{\text{EN}}\), structural information may suggest Laia Artigas in KG\({ }_{\text{ES}}\) as its match, since they have a single link to Summer 1993 and Verano 1993, respectively.

The example unveils the shortcoming of solely relying on structural information for EA, which renders existing EA methods sup-optimal, even infeasible for long-tail entities. Hence, we are motivated to revisit the key phases of EA pipeline and address the challenge of EA when structural information is insufficient.

In the pre-alignment phase, we are searching for extra signals that can improve EA, and we find that entity names can provide a source of valuable information. This type of information is commonly present in real-life entities, but previous research has not given it sufficient attention. For example, if we consider the long-tail entity Carla Simón in KG\({ }_{\text{EN}}\), incorporating entity name information would be beneficial in finding the correct mapping, which is Carla Simón in KG\({ }_{\text{ES}}\). This shows that entity name information can provide a supplementary perspective to the commonly used structural information in EA.

Previous studies [19,20,21] have already used name embeddings, specifically averaged word embeddings, to populate initial feature matrices for learning structural representation. However, our approach is different in that we use entity names as an additional source of signal, in addition to structural information. We achieve this by encoding the names through concatenated power mean word embeddings [22].

During the alignment phase, we carefully merge the two signals mentioned earlier by considering the fact that the significance of structural and name information differs for entities with varying degrees. In the example above, aligning the long-tail entity Carla Simón in KG\({ }_{\text{EN}}\) relies more on entity name information than its limited neighboring structure. Conversely, for mapping popular entities such as the film La Librería, where ambiguous entity names are present (i.e., both the film La Librería and the novel La Librería share the same name), structure plays a more significant role. Generally speaking, we can assume that the importance of the entity name signal is higher (resp. lower) for entities with lower (resp. higher) degrees, while the opposite is true for the signal from neighboring structure. In order to accurately represent the nonlinear dynamics between the two signals, we develop a co-attention network that uses entity degrees as a guide to determine the weights of various signals. It is important to note that [10] introduced degrees as a way to address the bias in structural embedding methods, which tend to place entities with similar degrees in close proximity. However, our motivation is different in that we use degrees to calculate pairwise similarities instead of individual embeddings.

During the post-alignment phase, our proposal is to significantly improve the structural information of knowledge graphs by recursively examining and cross-referencing each other. While long-tail entities may lack structural information in their original knowledge graph (referred to as the “source KG”), the knowledge graph being aligned with (the “target KG”) may have this information in a complementary manner. As an illustration, let’s consider the entity Carla Simón. In KG\({ }_{\text{EN}}\), there may be missing information such as the fact that Carla Simón is from España, which is present in KG\({{ }_{\text{ES}}}\). By pairing the surrounding entities and leveraging information from the target KG, the source KG can potentially acquire this missing information and improve the alignment. Inspired by the beneficial impact of using rules to complete knowledge graphs [2], we propose an iterative training procedure that includes knowledge graph completion. In each round, we use confident entity alignment results as anchors to identify and add any missing relations, thereby enhancing the current knowledge graphs. As a result, these knowledge graphs become enriched, which in turn allows for the learning of better structural embeddings. Additionally, the matching signal can propagate to long-tail entities, which were previously difficult to align in a single shot but may now become easier to align as a result of this iterative process.

Contribution

In short, the contribution of this chapter can be summarized as follows:

  • We have observed a shortcoming in current EA methods regarding the alignment of long-tail entities, primarily because they heavily rely on structure. To overcome this limitation, we propose two solutions: (1) incorporating an additional signal from entity names through concatenated power mean word embeddings and (2) devising an efficient degree-aware co-attention mechanism to dynamically integrate the name and structural signals.

  • Our proposal aims to decrease the number of long-tail entities by enhancing relational structure through KG completion, integrated into an iterative self-training approach. This is achieved by utilizing confident EA outcomes as anchors and using other KGs as references. Our strategy not only improves the performance of EA but also enhances the coverage of KGs.

  • The techniques presented form a new framework called DAT. We conduct empirical evaluations of the implementation of DAT on both mono-lingual and cross-lingual EA tasks, comparing it to state-of-the-art methods. The results of our comparison and ablation analysis demonstrate the superiority of DAT.

Organization

Section 6.2 overviews related work. In Sect. 6.3, we analyze the long-tail phenomenon in EA. DAT and its components are elaborated in Sect. 6.4. Section 6.5 introduces experimental settings, evaluation results, and detailed analysis, followed by conclusion in Sect. 6.6.

2 Related Work

Conventional EA Framework

The advancements made by state-of-the-art methods can be analyzed based on a phased pipeline. Firstly, for the pre-alignment phase, KG representation methods such as TransE [3, 4, 23] and GCN [18] are utilized to encode structural information and embed KGs into low-dimensional spaces individually. Subsequently, for the alignment phase, the embedding spaces are evaluated and compared to derive alignment results under the supervision of seed entity pairs. Certain techniques [7, 14, 15] employ a method of combining training data to create a unified embedding space. This allows for the direct projection of entities from various KGs into the same space. Equivalence across KGs can then be identified by measuring the distance between entities in the unified embedding space during alignment. In order to enhance supervision signals by utilizing the outcomes of the alignment stage, post-alignment iterative techniques are utilized as described in [15, 23]. This approach involves updating structural embeddings and performing alignment recursively until a stopping condition is met. These techniques can be roughly summarized into a framework, depicted by Fig. 6.2.

Fig. 6.2
A flow diagram of the conventional E A framework. In pre-alignment, structural representation learning module leads to structural embedding matrix, which in turn leads to similarity evaluation module followed by E A results in the alignment phase. Post-alignment has an iterative training module.

Conventional framework of EA

Recent Advancement on EA

Recent endeavors have been directed toward addressing structural heterogeneity by developing sophisticated structural learning models such as topic graph matching [21] and multichannel graph neural network [2]. These approaches are intended to overcome the challenges associated with structural heterogeneity. A recent work enhances structural embedding through adversarial training that takes into account degree difference [10]. However, this approach may not be effective when aligning entities in knowledge graphs that are both in a low-frequency range. Furthermore, in this study, degree information is used to improve the learning of structural embeddings, whereas in our approach, degree information is used to combine two different alignment signals: structural and name information.

While iterative strategies can be effective in improving entity alignment (EA), previous research has shown that they can also have drawbacks. For example, they can be biased toward one knowledge graph (KG) and time-consuming [15], or they may introduce many false-positive instances [23], which is not ideal for real-life applications. In order to balance precision and computational efficiency, we propose a novel iterative training approach that incorporates a KG completion module. This module updates the structure of the KG in each round based on confident anchoring entity pairs. Our strategy is lightweight and limits the inclusion of incorrect pairs, reducing the likelihood of introducing false positives.

It is apparent that the majority of the aforementioned embeddings rely on structural information for learning, which can be inadequate for long-tail entities in some cases. To address this issue, some researchers have suggested incorporating attributes into embeddings in order to potentially compensate for the shortcomings of relying solely on structural information [14, 17, 18, 22]. However, a significant percentage (between 69 and 99%) of instances in popular KGs are lacking at least one attribute that other entities in the same class possess [6]. The use of entity descriptions [3] has been proposed as a way to provide additional information that is often missing in many KGs. While these efforts can improve overall performance, they may not effectively align entities in the long tail. Previous approaches have explored using entity names either as initial features for learning structural representation [19,20,21] or in combination with other information for representation learning [22]. In contrast, our proposed approach consolidates features from separate similarity matrices learned from structure and name information, with different strategies evaluated in Sect. 6.5.2.

3 Impact of Long-Tail Phenomenon

Task Definition

Given a source KG \(G_1 = (E_1, R_1, T_1)\) and a target KG \(G_2 = (E_2, R_2, T_2)\), where \(E_1\) (resp. \(E_2\)) represents source (resp. target) entities, R denotes relations and \(T \subseteq E\times R\times E\) represents triples. Denote the seed entity pairs as \(S = \{(e^{i}_1,e^{i}_{2})|e^{i}_1 = e^{i}_2, e^{i}_1\in E_1, e^{i}_{2}\in E_2\}\), \(i \in [1, |S|]\), where \(|\cdot |\) denote the cardinality of a set. EA task is to find new EA pairs based on S and return the eventual results \(S^{\prime } = \{(e^{i}_1,e^{i}_{2})|e^{i}_1 = e^{i}_2, e^{i}_1\in E_1, e^{i}_{2}\in E_2\}\), \(i \in [1, \min \{|E_1|, |E_2|\}]\), where \(=\) expresses that two entities are the same physical one.

A recently published study [7] identified that previous entity alignment datasets had knowledge graphs that were too densely connected and had degree distributions that differed significantly from real-life knowledge graphs. To address this issue, they created a new entity alignment benchmark that better reflects real-life distributions. The benchmark includes both cross-lingual datasets such as \({\mathtt {SRPRS}_{\mathtt {EN-FR}}}\), \({\mathtt {SRPRS}_{\mathtt {EN-DE}}}\), and mono-lingual datasets such as \({\mathtt {SRPRS}_{\mathtt {DBP-WD}}}\) and \({\mathtt {SRPRS}_{\mathtt {DBP-YG}}}\). The degree of an entity is defined as the number of relational triples it participates in. The study reports the degree distributions of entities in the test sets in Table 6.1. The researchers also evaluated the performance of RSNs, which was found to be the best solution in [7]. The evaluation included measuring the number of correctly aligned entities in different degrees.

Table 6.1 Degree distribution of entities in test set (the first KG in each KG pair) and results of RSNs

The results presented in Table 6.1 indicate that in the \({\mathtt {SRPRS}_{\mathtt {EN-FR}}}\) and \({\mathtt {SRPRS}_{\mathtt {DBP-YG}}}\) datasets, over 50% of the entities’ degrees are less than three, and in the \({\mathtt {SRPRS}_{\mathtt {EN-DE}}}\) and \({\mathtt {SRPRS}_{\mathtt {DBP-WD}}}\) datasets, almost half of the entities’ degree are only 1 or 2. This confirms that the majority of entities in the knowledge graph have very few connections to others and are considered long-tail entities. The results also demonstrate that the accuracy of long-tail entities is much lower than that of higher-degree entities, even though RSNs is the leading method in the benchmark. This suggests that current methods are not effective in handling long-tail entities, which limits overall performance. Therefore, it is crucial to re-evaluate the entity alignment pipeline, with a particular focus on addressing the challenges posed by long-tail entities.

4 Methodology

To provide an overview, we have summarized the main components of the DAT (degree-aware entity alignment in tail) framework in Fig. 6.3, highlighting the new designs in purple blue. In pre-alignment, structural representation learning module and name representation learning module are put forward to learn useful features of entities, i.e., name representation and structural representation; in alignment, these features are forwarded to degree-aware fusion module for effective fusion and alignment under the guide of degree information. In post-alignment, KG completion module aims to complete KGs with confident EA pairs in the results, and the augmented KGs are then again utilized in the next round iteratively.

Fig. 6.3
A flow diagram of D A T. In pre-alignment, name and structural representation learning modules lead to name and structural embedding matrices. The alignment phase has a degree-aware feature fusion module and E A results. Post-alignment has a K G completion module that gives updated K Gs.

The framework of DAT

Since structural representation learning module has been extensively studied, we adopt the state-of-the-art model RSNs [7] for this purpose. Given a structural embedding matrix \(\mathbf Z \in \mathbb {R}^{n\times d_s}\), two entities \(e_1 \in G_1\) and \(e_2 \in G_2\), their structural similarity \(Sim_s(e_1,e_2)\) is the cosine similarity between \(\mathbf {Z}(e_1)\) and \(\mathbf Z(e_2)\), where n denotes the number of all entities in two KGs, \(d_s\) is the dimension of structural embeddings, and \(\mathbf Z(e)\) denotes the embedding vector for entity e (i.e., \(\mathbf Z(e) = \mathbf {Z} \mathbf {e}\), where \(\mathbf {e}\) is the one-hot encoding of entity e). From the perspective of structure, the target entity with the highest similarity to a source entity is returned as its alignment result.

4.1 Name Representation Learning

Remembering that using structural information to align long-tail entities has limited effectiveness, we are taking a different approach from previous attempts that focus on utilizing structures. Instead, we are searching for a signal that is generally accessible to long-tail entities and can provide benefits for alignment.

In order to achieve this goal, we suggest including the textual names of entities, which has largely been ignored by current embedding-based EA methods. This approach is particularly attractive for several reasons, including: (1) the name of an entity is typically sufficient to identify it, and when given two entities, comparing their names is often the most straightforward way to determine if they are equivalent and (2) the majority of real-life entities have a name, and the proportion of entities with names is much greater than the proportion with other textual information, such as descriptions and attributes. This is particularly relevant for long-tail entities, which tend to lack such additional information.

Despite that there are many classic approaches for measuring the string similarity between entity names, we go for semantic similarity since it can still work when the vocabularies of KGs differ, especially for the cross-lingual scenario. Specifically, we choose a general form of power mean embeddings [11], which encompasses many well-known means such as the arithmetic mean, the geometric mean, and the harmonic mean. Given a sequence of word embeddings, \(\mathbf w_1, \ldots , \mathbf w_l \in \mathbb {R}^d\), the power mean operation is formalized as:

$$\displaystyle \begin{aligned} {} \left(\frac{w_{1i}^p + \cdots + w_{li}^p}{l}\right)^{1/ p}, \quad \forall i = 1, \ldots ,d, \quad p \in \mathbb{R}\cup{\pm\infty}, \end{aligned} $$
(6.1)

where l is the number of words and d denotes the dimension of embeddings. It can be seen that setting p to 1 results in the arithmetic mean, to 0 the geometric mean, to \(-\)1 the harmonic mean, to \(+\infty \) the maximum operation, and to \(-\infty \) the minimum operation [12].

Given a word embedding space \(\mathbb {E}^i\), the embeddings of the words in the name of entity s can be represented as \(\mathbf W^i = [\mathbf w_1^i, \ldots , \mathbf w_l^i]\in \mathbb {R}^{l\times d^i}\). Correspondingly, \(H_p(\mathbf W^i) \in \mathbb {R}^{d^i}\) denotes the power mean embedding vector after feeding \(\mathbf w_1^i, \ldots , \mathbf w_l^i\) to Eq. (6.1). To obtain summary statistics of entity s, we compute K power means of s and concatenate them to get the entity name representation \(\mathbf s^i \in \mathbb {R}^{d^i\cdot K}\), i.e.,

$$\displaystyle \begin{aligned} \mathbf s^i = H_{p_1}(\mathbf W^i)\oplus\cdots\oplus H_{p_K}(\mathbf W^i), \end{aligned} $$
(6.2)

where \(\oplus \) represents concatenation along rows and \(p_1, \ldots , p_K\) are K different power mean values [12].

To get further representational power from different word embeddings, we generate the final entity name representation \(\mathbf n_s\) by concatenating \(\mathbf s^i\) obtained from different embedding spaces \( \mathbb {E}^i\):

$$\displaystyle \begin{aligned} {} \mathbf n_s = \bigoplus\limits_{i} \mathbf s^i. \end{aligned} $$
(6.3)

Note that the dimensionality of this representation is \(d_n = \sum _i d^i\cdot K\). The name embeddings of all entities can be denoted in matrix form as \(\mathbf N \in \mathbb {R}^{n\times d_n}\).

The representation space will group together entity names that are semantically related, similar to how word embeddings work. When considering the textual names of two entities, denoted as \(e_1\) in group \(G_1\) and \(e_2\) in group \(G_2\), their similarity \(Sim_t(e_1,e_2)\) is calculated as the cosine similarity between the vector representation of \(e_1\) and the vector representation of \(e_2\), denoted as \(\mathbf N(e_1)\) and \(\mathbf N(e_2)\), respectively. The alignment result for a source entity is the target entity with the highest similarity score.

Discussion

The combined power mean word embedding, as presented in the article by Rücklé et al. [12], provides a superior alternative to averaged word embedding when it comes to representing entity names. This is because it is better equipped to capture and synthesize the relevant information conveyed by an entity name.Footnote 1 Averaging word embeddings results in a significant loss of information because it fails to account for the semantic variation that can exist within different names. On the other hand, using concatenated power means produces a more accurate summary by reducing ambiguity and uncertainty in the representation of an entity name. This is supported by the empirical evidence presented in Sect. 6.5.3.

It should be noted that in the context of cross-lingual entity alignment, we rely on pre-trained multilingual word embeddings, as described in [5]. These embeddings have already aligned words from different languages into a shared semantic space. As a result, entity names from multiple languages can exist within the same semantic space, obviating the need to design a separate mapping function for aligning multilingual embeddings.

The method described above can be extended to accommodate other textual information, such as attributes, without sacrificing its generality. One simple approach is to concatenate the attributes and entity name to form a “sentence” that provides a more comprehensive description of the entity. This combined sentence can then be encoded using concatenated power mean word embeddings. However, the integration of additional information and more complex adaptations is not within the scope of this chapter.

4.2 Degree-Aware Co-attention Feature Fusion

Entity identities can be characterized by various types of features from different perspectives. Therefore, it is important to have a feature fusion module that effectively combines these different signals. Some researchers have proposed to integrate different embeddings into a unified representation space [22], but this approach necessitates additional training to align irrelevant features. A more desirable strategy involves first computing the similarity matrix within each feature-specific space and then combining the similarity scores for each feature-specific space [9, 18]. However, the contributions of different features vary for entities with different degrees. For long-tail entities that lack structural information, entity name representation should be given more weight, whereas for popular entities, the structural representation is relatively more informative than the entity name information. To address this dynamic shift, we draw inspiration from the bi-attention mechanism proposed in [13] and design a degree-aware co-attention network, depicted in Fig. 6.4.

Fig. 6.4
A flow diagram of degree-aware co-attention feature fusion. The co-attention similarity matrix leads to 2 sets of average and softmax blocks for weight assignment, followed by weighting and sim of e 2, e 1. The feature matrices of e 1 and 2 lead to weighting and the co-attention similarity matrix.

Degree-aware co-attention feature fusion

Formally, we are given the structural embedding matrix \(\mathbf Z\) and the name embedding matrix \(\mathbf N\). For each entity pair \((e_1, e_2)\), where \(e_1 \in G_1\) and \(e_2 \in G_2\), we calculate a similarity score between \(e_1\) and \(e_2\). This similarity score is then used to determine the alignment result. To compute the overall similarity between entity pairs, we first calculate the feature-specific similarity scores, \(Sim_s(e_1,e_2)\) and \(Sim_t(e_1,e_2)\), between \(e_1\) and \(e_2\), as explained in the previous subsections. Our degree-aware co-attention network is designed to determine the weights for \(Sim_s(e_1,e_2)\) and \(Sim_t(e_1,e_2)\) by incorporating degree information. This network consists of three stages: feature matrix construction, co-attention similarity matrix calculation, and weight assignment.

Feature Matrix Construction

Apart from entity name and structural information, we also include entity degree information to construct a feature matrix for each entity. To be precise, we represent entity degrees as one-hot vectors of all possible degree values and pass them through a fully connected layer to obtain a continuous degree vector. As an example, the degree vector of \(e_1\) can be represented as \(\mathbf g_{e_1} = \mathbf M \cdot \mathbf h_{e_1} \in \mathbb {R}^{d_g}\), where \(\mathbf h_{e_1}\) is the one-hot representation of its degree, \(\mathbf M\) is the weight matrix in the fully-connected layer, and \(d_g\) denotes the dimension of the degree vector. This continuous degree vector, along with structural and entity name representations, is stacked to form an entity’s feature matrix. For entity \(e_1\):

$$\displaystyle \begin{aligned} \mathbf F_{e_1} = [\mathbf N(e_1); \mathbf Z(e_1); \mathbf g_{e_1}] \in \mathbb{R}^{3 \times d_m}, \end{aligned} $$
(6.4)

where \(;\) denotes the concatenation along columns, \(d_m = \max \{d_n, d_s, d_g\}\), and we pad the missing values with 0s.

Co-attention Similarity Matrix Calculation

To model the interaction between \(\mathbf F_{e_1}\) and \(\mathbf F_{e_2}\), as well as highlight important features, we build a co-attention matrix \(\mathbf S \in \mathbb {R}^{3 \times 3}\), where the similarity between the i-th feature of \(e_1\) and the j-th feature of \(e_2\) is computed by:

$$\displaystyle \begin{aligned} \mathbf S_{ij} = \alpha(\mathbf F_{e_1}^{i:}, \mathbf F_{e_2}^{j:}) \in \mathbb{R}, \end{aligned} $$
(6.5)

where \(\mathbf F_{e_1}^{i:}\) is the i-th row vector and \(\mathbf F_{e_2}^{j:}\) is the j-th row vector, \(i=1,2,3; j=1,2,3\). \(\alpha (\mathbf u, \mathbf v) = \mathbf w^{\top }(\mathbf u\oplus \mathbf v\oplus (\mathbf u\circ \mathbf v))\) is a trainable scalar function that encodes the similarity, where \(\mathbf w \in \mathbb {R}^{3d_m}\) is a trainable weight vector and \(\circ \) is the element-wise multiplication. Note that the implicit multiplication is a matrix multiplication.

Weight Assignment

The co-attention similarity matrix, denoted by \(\mathbf S\), is used to generate attention vectors, which are \(\mathbf {att}_{\mathbf {1}}\) and \(\mathbf {att}_{\mathbf {2}}\), in both directions. The attention vector \(\mathbf {att}_{\mathbf {1}}\) indicates the feature vectors in \(e_1\) that are most important or relevant to the feature vectors in \(e_2\). Similarly, \(\mathbf {att}_{\mathbf {2}}\) indicates the feature vectors in \(e_2\) that are most important or relevant to the feature vectors in \(e_1\). To achieve this, we pass the co-attention similarity matrix \(\mathbf S\) through a softmax layer. Next, the resulting matrix from the softmax layer is compressed using an average layer to create the attention vectors. It is worth noting that when performing column-wise operations in the softmax layer and row-wise operations in the average layer, we get \(\mathbf {att}_{\mathbf {1}}\). Conversely, when conducting row-wise operations in the softmax layer and column-wise operations in the average layer, we obtain \(\mathbf {att}_{\mathbf {2}}\).

Eventually, we multiply the feature-specific similarity scores with the attention values to obtain the final similarity score:

$$\displaystyle \begin{aligned} Sim(e_1,e_2) = Sim_s(e_1,e_2)\cdot {\mathbf{att}_{\mathbf{1}}}^{s} + Sim_t(e_1,e_2)\cdot {\mathbf{att}_{\mathbf{1}}}^{t}, \end{aligned} $$
(6.6)

where \({\mathbf {att}_{\mathbf {1}}}^{s}\) and \({\mathbf {att}_{\mathbf {1}}}^{t}\) are the corresponding weight values for structural and name similarity scores, respectively. Note that \(Sim(e_1,e_2) \neq Sim(e_2,e_1)\) as they may have different attention weight vectors.

The model that combines co-attention and feature fusion has a relatively simple structure with only two parameters, \(\mathbf M\) and \(\mathbf w\). Furthermore, it is straightforward to modify this model to include additional features.

Training

The training objective is to maximize the similarity scores of the training entity pairs, which can be converted to minimizing the following loss function:

$$\displaystyle \begin{aligned} {} L = \sum_{(e_1,e_2)\in S} [\ -Sim(e_1,e_2) + \gamma]_+ + [\ -Sim(e_2,e_1) + \gamma]_+ , \end{aligned} $$
(6.7)

where \([x]_+ = max\{0,x\}\) and and \(\gamma \) is a constant number.

Discussion

Alternative methods of implementing degree-aware weighting are possible, such as applying sigmoid\((\mathbf W \cdot [\mathbf N(e), \mathbf Z(e), \mathbf g_e])\) where \(\mathbf W\) represents the parameter. In this study, we utilize a co-attention mechanism to combine various signal channels with degree-aware weights, which highlights the benefits of incorporating degrees for effective EA in the tail. However, a more comprehensive comparison with other implementations is a subject for future research.

4.3 Iterative KG Completion

The concept of iterative self-training has been shown to be effective and warrants further investigation, as demonstrated in previous studies [15, 23]. However, current research has failed to consider the potential for enriching structural information during the iterative process. Our findings suggest that, while long-tail entities in the source KG may lack structural information, this information can be found in the target KG in a complementary manner. By mining confident EA results and using them as pseudo matching pairs to anchor subgraphs, we can replenish the original KG with facts from its counterpart, thereby mitigating the KGs’ structural sparsity. This can significantly improve KG coverage and reduce the number of long-tail entities. As the structural learning model generates increasingly better structural embeddings from the amplified KGs, the accuracy of EA results in subsequent rounds also improves naturally in an iterative fashion.

To start, we will describe how we incorporate EA pairs that have a high level of confidence. Our focus is on preventing the inclusion of any incorrect pairs that could potentially harm the model. To achieve this, we have developed a unique approach for choosing EA pairs. For every given entity \(e_1 \in E_1 - S_1\) (in \(G_1\) but not in the training set), suppose its most similar entity in \(G_2\) is \(e_2\), its second most similar entity is \(e_2^\prime \) and the difference between the similarity scores is \(\Delta _1 \triangleq Sim(e_1, e_2)-Sim(e_1, e_2^\prime )\), if for \(e_2\), its most similar entity in \(G_1\) is exactly \(e_1\), its second most similar entity is \(e_1^\prime \), the difference between the similarity scores is \(\Delta _2 \triangleq Sim(e_2, e_1)-Sim(e_2, e_1^\prime )\), and \(\Delta _1\), \(\Delta _2\) are both above a given threshold \(\theta \), \((e_1, e_2)\) would be considered as a correct pair. This is a relatively strong constraint, as it requires that (1) the similarity between the two entities is the highest from both sides, respectively, and (2) there is a margin between the top two candidates.

Once we have integrated the EA results with high confidence to the initial set of entity pairs, we proceed to use these entities (\(S_a\)) to connect two KGs and supplement them with new facts from each other. For example, if a triple \(t_1 \in T_1\) has both its head and tail entities matching entries in \(S_a\), we replace the entities in \(t_1\) with the corresponding entities in \(E_2\) and add the new triple to \(T_2\). While this may seem like a simple and straightforward approach, it effectively increases the overall coverage of the KGs. Finally, we leverage the augmented KGs to improve the quality of the structural representations, which in turn contributes to enhancing the EA performance. This iterative completion process is repeated for \(\zeta \) rounds.

Discussion

Certain EA methods also use bootstrapping or iterative training techniques, but their primary goal is to expand the training signals for updating the embeddings, without modifying the underlying structure of the KGs. In comparison to other approaches of selecting EA pairs which can be slow and may generate inaccurate results [15, 23], we improve this process by prioritizing two entities if they give each other priority. This is empirically validated in Sect. 6.5.5.

5 Experiments

This section reports the experiments with in-depth analysis.Footnote 2

5.1 Experimental Setting

Dataset

We use SRPRS [7] due to the KG pairs having a distribution similar to the real world. It was created with inter-language links and references in DBpedia, and each entity has an equivalent counterpart in the other KG. The relevant details are listed in Table 6.2, and 30% of entity pairs are utilized for training.

Table 6.2 Statistics of SRPRS

Parameter Settings

For the structural representation learning module, we follow the settings in [7], except for assigning \(d_s\) to 300. Regarding name representation learning module, we set \(\mathbf p = [p_1, \ldots , p_K]\) to \([1, \min , \max ]\). For mono-lingual datasets, we merely use the fastText embeddings [1] as the word embedding (i.e., only one embedding space in Eq. (6.3)). For cross-lingual datasets, the multilingual word embeddings are obtained from MUSE.Footnote 3 Two word embedding spaces (from two languages) are used in Eq. (6.3). As for degree-aware fusion module, we set \(d_g\) to 300, \(\gamma \) to 0.8, and batch size to 32. Stochastic gradient descent is harnessed to minimize the loss function, with learning rate set to 0.1, and we use early stopping to prevent over-fitting. In KG completion module, \(\theta \) is set to 0.05 and \(\zeta \) is set to 3.

Evaluation Metric

We use Hits@k (\(k=1\), 10) and the mean reciprocal rank (MRR) as evaluation metrics. For each source entity, entities in the other KG are ranked according to their similarity scores Sim with the source entity in descending order. Hits@k measures the proportion of correctly aligned entities among the top-k similar entities to the source entity. In particular, Hit@1 indicates the accuracy of the alignment results. MRR, on the other hand, is the average of the reciprocal ranks of the ground-truth results. A higher Hits@k and MRR indicate better performance. Unless stated otherwise, the results of Hits@k are represented as percentages. The best results are displayed in bold in the tables.

Competitors

Overall 13 state-of-the-art methods are involved in comparison. The group that solely utilizes structural feature includes (1) MTransE [4], which proposes to utilize TransE for EA; (2) IPTransE [23], which uses an iterative training process to improve the alignment results; (3) BootEA [15], which devises an alignment-oriented KG embedding framework and a bootstrapping strategy; (4) RSNs [7], which integrates recurrent neural networks with residual learning; (5) MuGNN [2], which puts forward a multichannel graph neural network to learn alignment-oriented KG embeddings; (6) KECG [8], which proposes to jointly learn knowledge embeddings that encode inner-graph relationships, and a cross-graph model that enhances entity embeddings with their neighbors’ information; and (7) TransEdge [16], which presents a novel edge-centric embedding model that contextualizes relation representations in terms of specific head-tail entity pairs.

Various methods have been proposed to incorporate other types of information in EA. JAPE [14] utilizes attributes of entities to refine structural information. GCN [18] generates entity embeddings and attribute embeddings to align entities in different KGs. GM-Align [21] builds a local subgraph of an entity to represent it and utilizes entity name information to initialize the framework. MultiKE [22] offers a novel framework that unifies the views of entity names, relations, and attributes at representation-level for mono-lingual EA. RDGCN [19] proposes a relation-aware dual-graph convolutional network to incorporate relation information via attentive interactions between KG and its dual relation counterpart. HGCN [20] is a learning framework that jointly learns entity and relation representations for EA.

5.2 Results

Table 6.3 presents the results. The first group of approaches only use structural information for alignment. BootEA and KECG outperform MTransE and IPTransE because of their alignment-oriented KG embedding framework and attention-based graph embedding model, respectively. RSNs further improves the results by taking into account long-term relational dependencies between entities, which can capture more structural signals for alignment. TransEdge achieves the best performance due to its edge-centric KG embedding and bootstrapping strategy. MuGNN fails to produce effective results as there are no aligned relations on SRPRS, which prevents the rule transferring from taking place and limits the number of detected rules. It is noteworthy that Hits@1 values on most datasets are below 50%, demonstrating the inadequacy of solely relying on KG structure, especially when long-tail entities make up the majority.

Table 6.3 Overall results of entity alignment

Regarding the second group, both GCN and JAPE exploit attribute information to complement structural signals. However, they fail to outperform the leading method in the first group, which can be attributed to the limited effect of attributive information. The other four methods make use of the publicly available entity name data. The substantial improvement in results compared to those of the first group confirms the value of this feature. Our framework, DAT, demonstrates its superiority over GM-Align, RDGCN, and HGCN with a 10% improvement in Hits@1 over all datasets, validating the effectiveness of exploiting entity name information. The fundamental explanation for this is that the fusion of features on the representation level by GM-Align, RDGCN, and HGCN may lead to information loss since the resulting merged feature representation may not retain the distinguishing features of the original ones. On the other hand, DAT adopts a co-attention network to compute feature weights and fuse features at the output level, which is based on feature-specific similarity scores.

Evaluation by Degree

We present the outcomes of DAT in terms of degree to illustrate its ability to align long-tail entities, as shown in Table 6.4. It is worth noting that the degree pertains to the original degree distribution since the entity degree may be changed by the completion process.

Table 6.4 Hits@1 results by degrees

Table 6.4 indicates that for entities with a degree of 1, the Hits@1 scores of DAT are two or three times higher than those of RSNs, confirming the capability of DAT in handling the long-tail problem. While there is also an improvement in the performance of DAT for popular entities, the gap between DAT and RSNs is much smaller than that observed in the case of long-tail entities. Furthermore, DAT outperforms RDGCN in all degree categories across four datasets, despite both using entity name information as an external signal for EA.

Comparison with MultiKE on Dense Datasets

The reason for not providing the results of MultiKE on SRPRS is because it can only handle datasets in a single language and requires prior knowledge of the relations’ semantics. However, in order to better understand DAT, we present the experimental results of DAT on the dense datasets that were previously evaluated with MultiKE. Specifically, the dense datasets, \({\mathtt {DWY100K}_{\mathtt {DBP-WD}}}\) and \({\mathtt {DWY100K}_{\mathtt {DBP-YG}}}\), are similar to \({\mathtt {SRPRS}_{\mathtt {DBP-WD}}}\) and \({\mathtt {SRPRS}_{\mathtt {DBP-YG}}}\), but have a larger scale (100K entities on each side) and higher density [15].

When evaluated on dense datasets, DAT produces superior results with Hits values exceeding 90% and MRR surpassing 0.95, as presented in Table 6.5. This indicates that DAT effectively utilizes name information, which can be credited to the degree-aware feature fusion module and the approach of first computing scores within each view rather than learning a merged representation that may result in the loss of information.

Table 6.5 Experimental results on dense datasets

5.3 Ablation Study

We report an ablation study on \({\mathtt {SRPRS}_{\mathtt {EN-FR}}}\) dataset in Table 6.6.

Table 6.6 Experimental results of ablation

Iterative KG Completion

If we remove the entire module, the performance of EA drops by 3.7% on Hits@1 (comparing DAT with DAT w/o IKGC). However, if we eliminate only the KG completion module while keeping the iterative process (similar to [23]), Hits@1 decreases by 1.9% (DAT vs. DAT w/o KGC). This validates the significance of KG completion. We also present the dynamic change of the degree distribution after each round (original, R1, R2, R3) in Fig. 6.5, which suggests that the embedded KG completion improves KG coverage and reduces the number of long-tail entities.

Fig. 6.5
A multiline graph of numbers versus degrees. The line for original decreases between (1, 2700) and (greater than 10, 900). The lines for R 1, R 2, and R 3 start from between 1700 and 1600 for degree 1, and end between 1400 and 1700 on the y-axis for degrees greater than 10. Values are estimated.

Distribution of entity degree in \({\mathtt {SRPRS}_{\mathtt {EN-FR}}}\)

Degree-Aware Co-attention Feature Fusion

In Table 6.6, it can be observed that if the fixed equal weights are used instead of the degree-aware fusion module, the Hits@1 decreases by 2.7% (DAT vs. DAT w/o ATT). This result confirms that adjusting the weights of features dynamically based on their degree leads to better integration of features and, as a result, more accurate alignment results. In Fig. 6.6, we present the weight of the structural representation generated by our degree-aware fusion model across different degrees (in the first round). This figure demonstrates that, in general, the importance of structural information increases with the degree of entities, which is in line with our expectations.

Fig. 6.6
4 scatterplots have the weight of S R versus degree. The points for E N F R and D B P W D are mostly clustered between 0.2 and 0.4 on the y-axis and 0 and 200 on the x-axis. The points for E N D E and D B P Y G have increasing trends. D B P Y G points are more scattered. Values are estimated.

Weight distribution of structural representation

Concatenated Power Mean Word Embeddings

We compared concatenated power mean word embeddings and averaged word embeddings in terms of aligning entities, denoting as DAT and DAT w/o CPM, respectively. The findings indicate that combining multiple power mean embeddings effectively captures more alignment features.

5.4 Error Analysis

We conduct an error analysis on \({\mathtt {SRPRS}_{\mathtt {EN-FR}}}\) dataset to investigate the contribution of each module and cases where DAT falls short. Using only structural information leads to a high error rate of 65.5% on Hits@1. The dataset contains 67.0% long-tail(i.e., with degree \(\leq \)3) entities, with a majority (65.1%) being misaligned. However, incorporating entity name information and dynamically fusing it with structural information significantly reduces the overall Hits@1 error rate to 27.9%, with a corresponding reduction in long-tail entity error rate to 33.2%. Furthermore, we employ iterative KG completion to replenish structure and propagate signals, which further decrease the overall Hits@1 error rate to 24.2%. This approach also reduces the percentage of long-tail entities to 49.7%, with only 8.3% being misaligned. Overall, our results indicate that long-tail entities initially account for the most errors, but employing the proposed techniques reduces not only the error rate but also the contribution of long-tail entities to the overall error.

For cases that DAT cannot solve, we provide an analysis that focuses on the information related to entity names. Out of the incorrect cases (24.2% in \({\mathtt {SRPRS}_{\mathtt {EN-FR}}}\)), 41% don’t have an appropriate entity name embedding because all the words in the name are out-of-vocabularies (OOVs), and 31% have partial OOVs. Additionally, 15% could have been correct by solely utilizing the name information, but they were misled by structural signals, while 13% fail to align because of either the inadequacy of the entity name representation method or the fact that the entities with the same name refer to different physical objects.

5.5 Further Experiment

We substantiate the efficacy of our iterative training approach by performing the following experiments.

Our iterative approach differs from current methods not only in the embedded KG completion procedure but also in the choice of confident pairs. To showcase its advantage, we remove the KG completion module from DAT and obtain DAT-I to compare its selection methods with those of [15, 23]. In [23], the authors use a threshold-based method (TH) to find pairs. For each nonaligned source entity, it identifies the most comparable nonaligned target entity, and if the similarity between the two entities exceeds a specified threshold, they are deemed confident pairs. In [15], the authors use a maximum weight graph matching (MWGM) method to find confident entity alignment pairs. For each source entity, it calculates the alignment likelihood to every target entity, and only those with likelihood above a given threshold are considered in a maximum likelihood matching process under a 1-to-1 mapping constraint, which generates a solution that contains confident EA pairs. We implement the methods within our framework and adjust the parameters based on the original papers. To evaluate the effectiveness of various iterative training techniques, we use the number of chosen confident EA pairs, the accuracy of these pairs, and the duration of each round as primary metrics.

To ensure fairness in the comparison, we present the outcomes of the initial three rounds in Fig. 6.7. The findings indicate that DAT-I outperforms the other two methods regarding the quantity and quality of chosen pairs in a relatively shorter time. As MWGM necessitates solving a global optimization problem, it takes considerably more time. Nonetheless, compared to TH, it performs better in terms of the accuracy of selected pairs.

Fig. 6.7
2 line graphs and a stacked horizontal bar graph. a. Number of pairs versus R 1 to R 3. D A T 1, T H and M W G M lines decrease. b. Accuracy versus R 1 to R 3. The decreasing line for D A T 1 is followed by increasing T H and M W G M. c. M W G M has the highest time consumption for R 1 to R 3.

Comparison results of iterative training strategies. (a) Number of pairs selected. (b) Accuracy of pairs selected. (c) Running time consumption (s)

6 Conclusion

In this chapter, we present an improved framework called DAT for entity alignment, which specifically focuses on handling long-tail entities. Recognizing the limitations of relying solely on structural information, we propose to incorporate entity name information in the pre-alignment phase through concatenated power mean embedding. For alignment, we introduce a co-attention feature fusion network that dynamically adjusts the weights of different features guided by degree to consolidate various signals. In the post-alignment phase, we enhance the performance by iteratively completing the KG with confident EA results as anchors, thereby amplifying the structural information. We evaluate DAT on cross-lingual and mono-lingual EA benchmarks and achieve superior results.