1 Introduction

State-of-the-art EA solutions [2,3,4,5] assume that equivalent entities usually possess similar neighboring information. Consequently, they utilize KG embedding models, e.g., TransE [6], or graph neural network (GNN) models, e.g., GCN [7], to generate structural embeddings of entities in individual KGs. Then, these separated embeddings are projected into a unified embedding space by using the seed entity pairs as connections, so that the entities from different KGs are directly comparable. Finally, to determine the alignment results, the majority of current works [1, 8,9,10] formalize the alignment process as a ranking problem; that is, for each entity in the source KG, they rank all the entities in the target KG according to some distance metric, and the closest entity is considered as the equivalent target entity.

Example In Fig. 8.1 are a partial English KG and a partial Spanish KG concerning the director Hirokazu Koreeda, where the dashed lines indicate known alignments (i.e., seeds). The task of EA aims to identify equivalent entity pairs between two KGs, e.g., (Shoplifters, Manbiki Kazoku).

Fig. 8.1
2 node diagrams. The nodes in K G subscript E N are nobody knows, Japan, Ryo Kase, still walking, Kirin Kiki, shoplifters, and Hirokazu Koreeda. The nodes in K G subscript E S are Japon, Nadie sabe, Aruitemo, Kirin Kiki, Manbiki Kazoku, and Hirokazu Koreeda. Hirokazu Koreeda is connected with a dashed line.

An example of EA

Nevertheless, we still observe several issues from current EA works:

  • Reliance on labeled data. Most of the approaches rely on pre-aligned seed entity pairs to connect two KGs and use the unified KG structural embeddings to align entities. These labeled data, however, might not exist in real-life settings. For instance, in the example, the equivalence between Hirokazu Koreeda in \(KG_{EN}\) and Hirokazu Koreeda in \(KG_{ES}\) might not be known in advance. In this case, state-of-the-art methods that solely rely on the structural information would fall short, as there are no seeds to connect these individual KGs.

  • Closed-domain setting. All of current EA solutions work under the closed-domain setting [11]; that is, they assume every entity in the source KG has an equivalent entity in the target KG. Nevertheless, in practical settings, there always exist unmatchable entities. For instance, in the example, for the source entity Ryo Kase, there is no equivalent entity in the target KG. Therefore, an ideal EA system should be capable of predicting the unmatchable entities.

In response to these issues, we put forward an unsupervised EA solution UEA that is capable of addressing the unmatchable problem. Specifically, to mitigate the reliance on labeled data, we mine useful features from the KG side information and use them to produce preliminary pseudo-labeled data. These preliminary seeds are forwarded to our devised progressive learning framework to generate unified KG structural representations, which are integrated with the side information to provide a more comprehensive view for alignment. This framework also progressively augments the training data and improves the alignment results in a self-training fashion. Besides, to tackle the unmatchable issue, we design an unmatchable entity prediction module, which leverages thresholded bidirectional nearest neighbor search (TBNNS) to filter out the unmatchable entities and excludes them from the alignment results. We embed the unmatchable entity prediction module into the progressive learning framework to control the pace of progressive learning by dynamically adjusting the thresholds in TBNNS.

Furthermore, considering that the pseudo-labeled data generated during the progressive learning process might be of different qualities, we introduce the concept of confidence to measure the probability of an entity pair of being correct. We further incorporate such confidence scores into KG representation learning with the aim of producing more accurate structural embeddings. Through empirical studies, we demonstrate that the confidence-based framework, CUEA, has a more stable performance than UEA regardless of the quality of input side information and is particularly more useful when the side information is low-grade.

Contribution

The main contributions of the chapter can be summarized as follows:

  • We identify the deficiencies of existing EA methods, i.e., requiring labeled data and working under the closed-domain setting, and propose an unsupervised EA framework UEA, as well as a confidence-based extension CUEA, that are able to deal with unmatchable entities. This is done by: (1) exploiting the side information of KGs to generate preliminary pseudo-labeled data; and (2) devising an unmatchable entity prediction module that leverages the (confidence-based) thresholded bi-directional nearest neighbor search strategy to produce alignment results, which can effectively exclude unmatchable entities; and (3) offering a progressive learning algorithm to improve the quality of KG embeddings and enhance the alignment performance.

  • We empirically evaluate our proposals against state-of-the-art methods, and the comparative results demonstrate their superiority.

Organization

In Sect. 8.2, we formally define the task of EA and introduce related work. Section 8.3 elaborates the framework. In Sect. 8.4, we introduce experimental results and conduct detailed analysis. Section 8.5 concludes this chapter.

2 Task Definition and Related Work

In this section, we formally define the task of EA and then introduce the related work.

Task Definition

The inputs to EA are a source KG \(G_1\) and a target KG \(G_2\). The task of EA is defined as finding the equivalent entities between the KGs, i.e., \(\Psi = \{(u,v)|u\in E_1, v\in E_2, u \leftrightarrow v\}\), where \(E_1\) and \(E_2\) refer to the entity sets in \(G_1\) and \(G_2\), respectively, and \(u \leftrightarrow v\) represents that the source entity u and the target entity v are equivalent, i.e., u and v refer to the same real-world object.

Most of current EA solutions assume that there exist a set of seed entity pairs \(\Psi _s = \{(u_s,v_s)|u_s\in E_1, v_s\in E_2, u_s \leftrightarrow v_s\}\). Nevertheless, in this chapter, we focus on unsupervised EA and do not assume the availability of such labeled data.

Unsupervised Entity Alignment

A few methods have investigated the alignment without labeled data. Qu et al. [20] propose an unsupervised approach toward knowledge graph alignment with the adversarial training framework. Nevertheless, the experimental results are extremely poor. He et al. [21] utilize the shared attributes between heterogeneous KGs to generate aligned entity pairs, which are used to detect more equivalent attributes. They perform entity alignment and attribute alignment alternately, leading to more high-quality aligned entity pairs, which are used to train a relation embedding model. Finally, they combine the alignment results generated by attribute and relation triples using a bivariate regression model. The overall procedure of this work might seem similar to our proposed model. However, there are many notable differences; for instance, the KG embeddings in our work are updated progressively, which can lead to more accurate alignment results, and our model can deal with unmatchable entities. We empirically demonstrate the superiority of our model in Sect. 8.4.

We notice that there are some entity resolution (ER) approaches established in a setting similar to EA, represented by PARIS [22]. They adopt collective alignment algorithms such as similarity propagation so as to model the relations among entities. We include them in the experimental study for the comprehensiveness of the chapter.

3 Methodology

In this section, we first introduce the outline of our proposal. Then, we elaborate the processing of side information to produce preliminary alignment seeds.

3.1 Model Outline

As shown in Fig. 8.2, given two KGs, CUEA first mines useful features from the side information. These features are forwarded to the unmatchable entity prediction module to generate initial alignment results with confidence scores, which are regarded as pseudo-labeled data. Then, the progressive learning framework uses these pseudo seeds, along with the probability scores, to connect two KGs and learn unified entity structural embeddings. It further combines the alignment signals from the side information and structural information to provide a more comprehensive view for alignment. Finally, it progressively improves the quality of structural embeddings and augments the alignment results by iteratively updating the pseudo-labeled data with results from the previous round, which also leads to increasingly better alignment. Note that by assigning the confidence score of 1 to all entity pairs, CUEA turns into the UEA model.

Fig. 8.2
A flow diagram. 2 node diagrams, K G E N and K G E S, lead to side information. It divides into fusion and unmatchable entity prediction. The latter divides as unmatchable entities and seed pairs with confidence. The confidence-based K G representation learning leads to fusion via structural information.

Outline of CUEA. Arrows in blue represent the progressive learning process. By setting the confidence to 1, the UEA model can be restored

3.2 Side Information

There is abundant side information in KGs, such as the attributes, descriptions, and classes. In this chapter, we use a particular form of the attributes—the entity name, as it exists in the majority of KGs. To make the most of the entity name information, inspired by Zeng et al. [5], we exploit it from the semantic level and string level and generate the textual distance matrix between entities in two KGs.

More specifically, we use the averaged word embeddings to represent the semantic meanings of entity names. Given the semantic embeddings of a source and a target entity, we obtain the semantic distance score by subtracting their cosine similarity score from 1. We denote the semantic distance matrix between the entities in two KGs as \({\mathbf {M}}^{\mathbf {n}}\), where rows represent source entities, columns denote target entities, and each element in the matrix denotes the distance score between a pair of source and target entities. As for the string-level feature, we adopt the Levenshtein distance [23] to measure the difference between two sequences. We denote the string distance matrix as \({\mathbf {M}}^{\mathbf {l}}\).

To obtain a more comprehensive view for alignment, we combine these two distance matrices and generate the textual distance matrix as \({\mathbf {M}}^{\mathbf {t}} = \alpha {\mathbf {M}}^{\mathbf {n}} + (1-\alpha ){\mathbf {M}}^{\mathbf {l}}\), where \(\alpha \) is a hyper-parameter that balances the weights. Then, we forward the textual distance matrix \({\mathbf {M}}^{\mathbf {t}}\) into the unmatchable entity module to produce alignment results, which are considered as the pseudo-labeled data for training KG structural embeddings. The details are introduced in the next subsection.

Remark

The goal of this step is to exploit available side information to generate useful features for alignment. Other types of side information, e.g., attributes and entity descriptions, can also be leveraged. Besides, more advanced textual encoders, such as misspelling oblivious word embeddings [24] and convolutional embedding for edit distance [25], can be utilized. We will investigate them in the future.

3.3 Unmatchable Entity Prediction

State-of-the-art EA solutions generate for each source entity a corresponding target entity and fail to consider the potential unmatchable issue. Nevertheless, as mentioned in [12], in real-life settings, KGs contain entities that other KGs do not contain. For instance, when aligning YAGO 4 and IMDB, only 1% of entities in YAGO 4 are related to movies, while the other 99% of entities in YAGO 4 necessarily have no match in IMDB. These unmatchable entities would increase the difficulty of EA. Therefore, in this chapter, we devise an unmatchable entity prediction module to predict the unmatchable entities and filter them out from the alignment results.

3.3.1 Thresholded Bidirectional Nearest Neighbor Search

More specifically, we put forward a novel strategy, i.e., thresholded bidirectional nearest neighbor search (TBNNS), to generate the alignment results, and the resulting unaligned entities are predicted to be unmatchable. As can be observed from Algorithm 1, given a source entity u and a target entity v, if u and v are the nearest neighbor of each other, and the distance between them is below a given threshold \(\theta \), we consider \((u,v)\) as an aligned entity pair. Note that \(\mathbf M(u,v)\) represents the element in the u-th row and v-th column of the distance matrix \(\mathbf M\).

Algorithm 1: TBNNS in the unmatchable entity prediction module

The TBNNS strategy exerts strong constraints on alignment, since it requires that the matched entities should both prefer each other the most, and the distance between their embeddings should be below a certain value. Therefore, it can effectively predict unmatchable entities and prevent them from being aligned. Notably, the threshold \(\theta \) plays a significant role in this strategy. A larger threshold would lead to more matches, whereas it would also increase the risk of including erroneous matches or unmatchable entities. In contrast, a small threshold would only lead to a few aligned entity pairs, and almost all of them would be correct. This is further discussed and verified in Sect. 8.4.4. Therefore, our progressive learning framework dynamically adjusts the threshold value to produce more accurate alignment results (to be discussed in the next subsection).

3.3.2 Confidence-Based TBNNS

Considering that the aligned entity pairs generated by TBNNS are of different qualities (i.e., some are true, while some are not), we further put forward confidence-based TBNNS, C-TBNNS, to measure the confidence of an entity pair (of being true). Specifically, we define the confidence score \(\Theta \) of an entity pair \((u,v)\) as:

$$\displaystyle \begin{aligned} {} \Theta(u,v)= \mathbf M(u,v^\prime) - \mathbf M(u,v) + \mathbf M(v,u^\prime) - \mathbf M(v,u), \end{aligned} $$
(8.1)

where \(\Delta _1 = \mathbf M(u,v^\prime ) - \mathbf M(u,v)\) denotes the gap between the distance scores of the top two closest entities (i.e., v and \(v^\prime \)) to entity u, while \(\Delta _2 = \mathbf M(v,u^\prime ) - \mathbf M(v,u)\) denotes the gap between the distance scores of the top two closest entities (i.e., u and \(u^\prime \)) to entity v. This is based on the intuition that, for an entity pair \((u,v)\), if the distance between them is the smallest from both sides and there are larger margins between the distances of the top two candidates, it would be more confident to consider them as a correct entity pair. We further restrict the confidence scores to a certain range:

$$\displaystyle \begin{aligned} {} \Theta(\mathcal{S}) = (1-\lambda) \frac{\Theta(\mathcal{S}) - \min\{\Theta(\mathcal{S})\}}{\max\{\Theta(\mathcal{S})\} - \min\{\Theta(\mathcal{S})\}} + \lambda \end{aligned} $$
(8.2)

where \(\Theta (\mathcal {S})\) represents the confidence scores of the entity pairs in \(\mathcal {S}\). The core of Eq. (8.2) is the min-max normalization, which converts the confidence scores to \([0,1]\). We add a hyper-parameter \(\lambda \in [0,1]\) to further restrict the range of the confidence scores to \([\lambda ,1]\). As thus, by setting \(\lambda \) to 1, all entity pairs would have the same confidence score of 1, and C-TBNNS can be restored to TBNNS. Hence, C-TBNNS can be regarded as a general case of TBNNS, which introduces the concept of confidence (probability) into the alignment result generation process.

3.4 The Progressive Learning Framework

To exploit the rich structural patterns in KGs that could provide useful signals for alignment, we design a progressive learning framework to combine structural and textual features for alignment and improve the quality of both structural embeddings and alignment results in a self-training fashion.

3.4.1 Knowledge Graph Representation Learning

As mentioned above, we forward the textual distance matrix \({\mathbf {M}}^{\mathbf {t}}\) generated by using the side information to the unmatchable entity prediction module to produce the preliminary alignment results, which are considered as pseudo-labeled data for learning unified KG embeddings. Concretely, following [18], we adopt GCNFootnote 1 to capture the neighboring information of entities. We leave out the implementation details since this is not the focus of this paper, which can be found in [18].

Alignment Objective

Since the representations of source and target KGs are learned individually, they need to be projected into a unified embedding space, where the entities across KGs could be compared directly. To this end, we use the semi-supervised loss function to enforce the distance between the embeddings of the entities in the labeled entity pairs to be small and meanwhile the negative samples (i.e., nonequivalent entity pairs) to be large. Formally:

$$\displaystyle \begin{aligned} {} \mathcal{L} = \sum_{(u,v)\in\mathcal{S}} \sum_{(u^\prime,v^\prime)\in\mathcal{S}^\prime_{(u,v)}} [d(\mathbf u, \mathbf v) + \gamma - d(\mathbf u^\prime, \mathbf v^\prime)]_+ , \end{aligned} $$
(8.3)

where \([\cdot ]_+ = \max \{0,\cdot \}\), \((u,v)\) is a labeled entity pair from the training data and \(\mathcal {S}^\prime _{(u,v)}\) represents the set of negative entity pairs obtained by corrupting \((u,v)\) using nearest neighbor sampling [1]. \(\mathbf u\) and \(\mathbf v\) represent the embeddings of source and target entities learned by GCN, respectively. \(d(\cdot ,\cdot )\) is the distance function that measures the distance between two embeddings. \(\gamma \) is a hyper-parameter separating positive samples from negative ones.

Confidence-Based Objective

Considering that the pseudo-labeled entity pairs have different confidences of being true, we incorporate such probabilities into the alignment objective to learn more accurate structural embeddings:

$$\displaystyle \begin{aligned} {} \mathcal{L}_c = \sum_{(u,v)\in\mathcal{S}} \sum_{(u^\prime,v^\prime)\in\mathcal{S}^\prime_{(u,v)}} \Theta(u, v)\ast[d(\mathbf u, \mathbf v) + \gamma - d(\mathbf u^\prime, \mathbf v^\prime)]_+ , \end{aligned} $$
(8.4)

where \(\Theta (u, v)\) is the confidence score attached to each entity pair. As thus, the more confident entity pairs would play a more important role during the training process, while the less confident pseudo entity pairs would have a smaller effect on the training, such that the impact from the false positives could be mitigated.

Feature Fusion

Given the learned structural embedding matrix \(\mathbf Z\), we calculate the structural distance score between a source and a target entity by subtracting the cosine similarity score between their embeddings from 1. We denote the resultant structural distance matrix as \({\mathbf {M}}^{\mathbf {s}}\). Then, we combine the textual and structural information to generate more accurate signals for alignment: \(\mathbf {M} = \beta {\mathbf {M}}^{\mathbf {t}} + (1-\beta ){\mathbf {M}}^{\mathbf {s}}\), where \(\beta \) is a hyper-parameter that balances the weights. The fused distance matrix \(\mathbf {M}\) can be used to generate more accurate matches.

3.4.2 The Progressive Learning Algorithm

The amount of training data has an impact on the quality of the unified KG embeddings, which in turn affects the alignment performance [3, 26]. As thus, we devise an algorithm (Algorithm 2) to progressively augment the pseudo training data, so as to improve the quality of KG embeddings and enhance the alignment performance. The algorithm starts with learning unified structural embeddings and generating the fused distance matrix \(\mathbf {M}\) by using the preliminary pseudo-labeled data \(\mathcal {S}_0\) (Lines 1–2). Then, the fused distance matrix is used to produce the new alignment results \(\Delta \mathcal {S}\) using C-TBNNS (line 4). These newly generated entity pairs \(\Delta \mathcal {S}\) are added to the alignment results, which are used for generating the fused distance matrix in the next round (Lines 6–7). The entities in \(\mathcal {S}\) are removed from the entity sets (Lines 9–10). In order to progressively improve the quality of KG embeddings and detect more alignment results, we perform the aforementioned process recursively until the number of newly generated entity pairs is below a given threshold \(\mu \). Finally, we consider the entity pairs in \(\mathcal {S}\) as the final alignment results \(\Psi \).

Algorithm 2: Progressive learning

Notably, in the learning process, once a pair of entities is considered as a match, the entities will be removed from the entity sets (Lines 5–6 and Lines 12–13). This could gradually reduce the alignment search space and lower the difficulty for aligning the rest of the entities. Obviously, this strategy suffers from the error propagation issue, which, however, could be effectively mitigated by the progressive learning process that dynamically adjusts the threshold. We will verify the effectiveness of this setting in Sect. 8.4.3.

3.4.3 Dynamic Threshold Adjustment

It can be observed from Algorithm 2 that the matches generated by the unmatchable entity prediction module are part of not only the eventual alignment results but also the pseudo training data for learning subsequent structural embeddings. Therefore, to enhance the overall alignment performance, the alignment results generated in each round should, ideally, have both large quantity and high quality. Unfortunately, these two goals cannot be achieved at the same time. This is because, as stated in Sect. 8.3.3, a larger threshold in TBNNS can generate more alignment results (large quantity), whereas some of them might be erroneous (low quality). These wrongly aligned entity pairs can cause the error propagation problem and result in more erroneous matches in the following rounds. In contrast, a smaller threshold leads to fewer alignment results (small quantity), while almost all of them are correct (high quality).

To address this issue, we aim to balance between the quantity and the quality of the matches generated in each round. An intuitive idea is to set the threshold to a moderate value. However, this fails to take into account the characteristics of the progressive learning process. That is, in the beginning, the quality of the matches should be prioritized, as these alignment results will have a long-term impact on the subsequent rounds. In comparison, in the later stages where most of the entities have been aligned, the quantity is more important, as we need to include more possible matches that might not have a small distance score. In this connection, we set the initial threshold \(\theta _0\) to a very small value so as to reduce potential errors. Then, in the following rounds, we gradually increase the threshold by \(\eta \), so that more possible matches could be detected. We will empirically validate the superiority of this strategy over the fixed weight in Sect. 8.4.3.

Noteworthily, our proposed confidence-based framework CUEA can further help mitigate the low-quality issue, as we calculate and assign a confidence score to each entity pair, where the wrongly aligned entity pairs would presumably have lower confidence scores and thus exert smaller influence on the subsequent alignment process.

Remark

As mentioned in the related work, there are some existing EA approaches that exploit the iterative learning (bootstrapping) strategy to improve EA performance. Particularly, BootEA calculates for each source entity the alignment likelihood to every target entity and includes those with likelihood above a given threshold in a maximum likelihood matching process under the 1-to-1 mapping constraint, producing a solution containing confident EA pairs [15]. This strategy is also adopted by [8, 16]. Zhu et al. use a threshold to select the entity pairs with very close distances as the pseudo-labeled data [14]. DAT employs a bidirectional margin-based constraint to select the confident EA pairs as labels [17]. Our progressive learning strategy differs from these existing solutions in three aspects: (1) we exclude the entities in the confident EA pairs from the test sets; (2) we use the dynamic threshold adjustment strategy to control the pace of learning process; (3) our strategy can deal with unmatchable entities; and (4) we attach a confidence score to each selected entity pair, which can mitigate the negative influence of the false positives on the KG representation learning process as well as the alignment results. The superiority of our strategy is validated in Sect. 8.4.3.

4 Experiment

This section reports the experimental results with in-depth analysis. The source code is available at https://github.com/DexterZeng/UEA.

4.1 Experimental Settings

Datasets

Following existing works, we adopt the DBP15K dataset [3] for evaluation. This dataset consists of three multilingual KG pairs extracted from DBpedia. Each KG pair contains 15,000 inter-language links as gold standards. The statistics can be found in Table 8.1. We note that state-of-the-art studies merely consider the labeled entities and divide them into training and testing sets. Nevertheless, as can be observed from Table 8.1, there exist unlabeled entities, e.g., 4,388 and 4,572 entities in the Chinese and English KG of \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\), respectively. In this connection, we adapt the dataset by including the unmatchable entities. Specifically, for each KG pair, we keep 30% of the labeled entity pairs as the training set (for training the supervised or semi-supervised methods). Then, to construct the test set, we include the rest of the entities in the first KG and the rest of the labeled entities in the second KG, so that the unlabeled entities in the first KG become unmatchable. The statistics of the test sets can be found in the test set column in Table 8.1.

Table 8.1 The statistics of the evaluation benchmarks

Parameter Settings

For the side information module, we utilize the fastText embeddings [27] as word embeddings. To deal with cross-lingual KG pairs, following [19], we use Google Translate to translate the entity names from one language to another, i.e., translating Chinese, Japanese, and French to English. \(\alpha \) is set to 0.5. For the structural information learning, we set \(\beta \) to 0.5. Following [18], we set \(\gamma \) in the alignment objectives to 3 and adopt Manhattan distance as \(d(\cdot ,\cdot )\). Regarding C-TBNNS, we set \(\lambda \) to 0.4. For progressive learning, we set the initial threshold \(\theta _0\) to 0.05, the incremental parameter \(\eta \) to 0.1, and the termination threshold \(\gamma \) to 30. Note that if the threshold \(\theta \) is over 0.45, we reset it to 0.45. These hyper-parameters are default values since there is no extra validation set for hyper-parameter tuning.

Evaluation Metrics

We use precision (P), recall (R), and F1 score as evaluation metrics. The precision is computed as the number of correct matches divided by the number of matches found by a method. The recall is computed as the number of correct matches found by a method divided by the number of gold matches. The F1 score is the harmonic mean between precision and recall. The bold figures in the tables represent the best results.

Competitors

We select the most performant state-of-the-art solutions for comparison. Within the group that solely utilizes structural information, we compare with BootEA [15], TransEdge [8], MRAEA [26], and SSP [28]. Among the methods incorporating other sources of information, we compare with GCN-Align [18], HMAN [9], HGCN [4], RE-GCN [29], DAT [17], and RREA [30]. We also include the unsupervised approaches, i.e., IMUSE [21] and PARIS [22]. To make a fair comparison, we only use entity name labels as the side information.

4.2 Results

Table 8.2 reports the alignment results, which shows that state-of-the-art supervised or semi-supervised methods have rather low precision values. This is because these approaches cannot predict the unmatchable source entities and generate a target entity for each source entity (including the unmatchable ones). Particularly, methods incorporating additional information attain relatively better performance than the methods in the first group, demonstrating the benefit of leveraging such additional information.

Table 8.2 Alignment results

Regarding the unsupervised methods, although IMUSE cannot deal with the unmatchable entities and achieves a low precision score, it outperforms most of the supervised or semi-supervised methods in terms of recall and F1 score. This indicates that, for the EA task, the KG side information is useful for mitigating the reliance on labeled data. In contrast to the abovementioned methods, PARIS attains very high precision, since it only generates matches that it believes to be highly possible, which can effectively filter out the unmatchable entities. It also achieves the second best F1 score among all approaches, showcasing its effectiveness when the unmatchable entities are involved. Our proposals, UEA and CUEA, attain the best balance between precision and recall and obtain the best F1 scores, outperforming the second best by a large margin, validating their effectiveness. Notably, although our proposed models do not require labeled data, they achieve even better performance than the most performant supervised methods HMAN and DAT.

Furthermore, it can be seen that, by integrating the notion of confidence into UEA, CUEA achieves comparable results to UEA. At first sight, it seems that assigning confidence scores to entity pairs does not have a large influence on the representation learning and the alignment results, which, however, could be ascribed to the fact that the side information is too effective on these datasets (solely using the string information can achieve an F1 score of 0.814, to be shown in Table 8.4), and hence rendering the structural information (largely affected by the confidence scores) less contributive to the overall results. Next, we will show that the confidence-based framework would be much more useful on datasets with side information in low quality.

4.2.1 Results Using Low-Quality Side Information

We compare the unsupervised approaches under a practical scenario where the side information is in low quality. Specifically, we assume that the pre-trained word embeddings as well as the machine translation tools are not available. Under this circumstance, to use the entity name information, a viable solution is to compare the name strings directly. However, the direct string comparison would be ineffective for cross-lingual datasets such as \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\) and \({\mathtt {DBP15K}_{\mathtt {JA-EN}}}\), where the languages in the source and target KGs are disparate. Hence, we aim to examine the effectiveness of these unsupervised approaches when the side information is in low quality and cannot provide many useful signals for alignment.

We report the results on \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\) and \({\mathtt {DBP15K}_{\mathtt {JA-EN}}}\) in Table 8.3, where the direct comparison between entity name strings serves as the side information. It can be observed that the F1 scores of all methods are very low (compared with those in Table 8.2), revealing that the quality of side information does affect the overall alignment results. Besides, given the low-quality side information, our proposed models UEA and CUEA still outperform the baselines IMUSE and PARIS in terms of the F1 score, demonstrating the effectiveness of the progressive learning framework and the unmatchable entity prediction module. Moreover, it is notable that CUEA achieves better results than UEA in terms of all metrics. This could be attributed to the confidence-based alignment result generation process, which could enable the entity pairs of higher confidence (higher probability of being correct, presumably) to have a larger impact on the representation learning and alignment process.

Table 8.3 Alignment results given low-grade side information
Table 8.4 Ablation results

4.3 Ablation Study

In this subsection, we examine the usefulness of proposed modules by conducting the ablation study. More specifically, in Table 8.4, we report the results of UEA w/o Unm, which excludes the unmatchable entity prediction module, and UEA w/o Prg, which excludes the progressive learning process. It shows that removing the unmatchable entity prediction module (UEA w/o Unm) brings down the performance on all metrics and datasets, validating its effectiveness of detecting the unmatchable entities and enhancing the overall alignment performance. Besides, without the progressive learning (UEA w/o Prg), the precision increases, while the recall and F1 score values drop significantly. This shows that the progressive learning framework can discover more correct aligned entity pairs and is crucial to the alignment progress.

To provide insights into the progressive learning framework, we report the results of UEA w/o Adj, which does not adjust the threshold, and UEA w/o Excl, which does not exclude the entities in the alignment results from the entity sets during the progressive learning. Table 8.4 shows that setting the threshold to a fixed value (UEA w/o Adj) leads to worse F1 results, verifying that the progressive learning process depends on the choice of the threshold and the quality of the alignment results. We will further discuss the setting of the threshold in the next subsection. Besides, the performance also decreases if we do not exclude the matched entities from the entity sets (UEA w/o Excl), validating that this strategy indeed can reduce the difficulty of aligning entities.

Moreover, we replace our progressive learning framework with other state-of-the-art iterative learning strategies (i.e., MWGM [15], TH [14], and DAT-I [17]) and report the results in Table 8.4. It shows that using our progressive learning framework (UEA) can attain the best F1 score, verifying its superiority.

4.4 Quantitative Analysis

In this subsection, we perform quantitative analysis of the modules in UEA and CUEA.

The Threshold \(\theta \) in TBNNS

We discuss the setting of \(\theta \) to reveal the trade-off between the risk and gain from generating the alignment results in the progressive learning. Identifying a match leads to the integration of additional structural information, which benefits the subsequent learning. However, for the same reason, the identification of a false positive, i.e., an incorrect match, potentially leads to mistakenly modifying the connections between KGs, with the risk of amplifying the error in successive rounds. As shown in Fig. 8.3, a smaller \(\theta \) (e.g., 0.05) brings low risk and low gain; that is, it merely generates a small number of matches, among which almost all are correct. In contrast, a higher \(\theta \) (e.g., 0.45) increases the risk and brings relatively higher gain; that is, it results in much more aligned entity pairs, while a certain portion of them are erroneous. Additionally, using a higher threshold leads to increasingly more alignment results, while for a lower threshold, the progressive learning process barely increases the number of matches. This is in consistency with our theoretical analysis in Sect. 8.3.3.

Fig. 8.3
3 bar graphs of number of entities versus round with 4 sets of bars. The bars indicate correct 0.05, wrong, correct 0.15, 0.25, 0.35, and 0.45. The bars depicting correct 0.05 are stable from rounds 1 to 4. The other bars have an increasing trend. The graphs are titled, Z H E N, J A E N, and F R E N.

Alignment results given different threshold values. Correct-\(\theta \) refers to the number of correct matches generated by the progressive learning framework at each round given the threshold value \(\theta \). Wrong refers to the number of erroneous matches generated in each round

Unmatchable Entity Prediction

Zhao et al. [12] propose an intuitive strategy (U-TH) to predict the unmatchable entities. They set an NIL threshold, and if the distance value between a source entity and its closest target entity is above this threshold, they consider the source entity to be unmatchable. We compare our unmatchable entity prediction strategy with it in terms of the percentage of unmatchable entities that are included in the final alignment results and the F1 score. On \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\), replacing our unmatchable entity prediction strategy with U-TH attains the F1 score at 0.837, which is 8.4% lower than that of UEA. Besides, in the alignment results generated by using U-TH, 18.9% are unmatchable entities, while this figure for UEA is merely 3.9%. This demonstrates the superiority of our unmatchable entity prediction strategy.

Influence of Parameters

As mentioned in Sect. 8.4.1, we set \(\alpha \) and \(\beta \) to 0.5 since there are no training/validation data. Here, we aim to prove that different values of the parameters do not have a large influence on the final results. More specifically, we keep \(\alpha \) at 0.5 and choose \(\beta \) from [0.3, 0.4, 0.5, 0.6, 0.7]; then we keep \(\beta \) at 0.5 and choose \(\alpha \) from [0.3, 0.4, 0.5, 0.6, 0.7]. It can be observed from Fig. 8.4 that, although smaller \(\alpha \) and \(\beta \) lead to better results, the performance does not change significantly.

Fig. 8.4
2 line graphs with 3 lines of different beta and alpha values. a, Z H E N. (0.3, 0.91) and (0.7, 0.88). J A E N. (0.3, 0.93) and (0.7, 0.925). F R E N. (0.3, 0.96) and (0.7, 0.945). b, Z H E N. (0.3, 0.925) and (0.7, 0.87). J A E N. (0.3, 0.95) and (0.7, 0.925). F R E N. (0.3, 0.955) and (0.7, 0.95).

The F1 scores by setting \(\alpha \) and \(\beta \) to different values

The Hyper-Parameter \(\lambda \) in CUEA

We then analyze the influence of \(\lambda \) in Eq. (8.2), which determines the range of the confidence scores, on the final alignment results. To highlight its influence on the structural representation learning, we follow the settings in Sect. 8.4.2.1 and report the results in Table 8.5.

Table 8.5 The influence of \(\lambda \) on the alignment results

It reads from Table 8.5 that the alignment performance is relatively stable when \(\lambda \) is not too large. Nevertheless, when setting \(\lambda \) to a large value (e.g., 1, to restore UEA), the results drop sharply. This reveals that assigning probability scores to the entity pairs according to their confidence of being true can facilitate the alignment. Besides, generally speaking, CUEA is robust to the perturbation of \(\lambda \) (as long as it is not too large).

Influence of Input Side Information

We adopt different side information as input to examine the performance of UEA. More specifically, we report the results of UEA-\({{\mathbf {M}}^{\mathbf {l}}}\), which merely uses the string-level feature of entity names as input, UEA-\({{\mathbf {M}}^{\mathbf {n}}}\), which only uses the semantic embeddings of entity names as input. We also provide the results of \({{\mathbf {M}}^{\mathbf {l}}}\) and \({{\mathbf {M}}^{\mathbf {n}}}\), which use the string-level and semantic information to directly generate alignment results (without progressive learning), respectively.

As shown in Table 8.3, the performance of solely using the input side information is not very promising (\({{\mathbf {M}}^{\mathbf {l}}}\) and \({{\mathbf {M}}^{\mathbf {n}}}\)). Nevertheless, by forwarding the side information into our model, the results of UEA-\({{\mathbf {M}}^{\mathbf {l}}}\) and UEA-\({{\mathbf {M}}^{\mathbf {n}}}\) become much better. This unveils that UEA can work with different types of side information and consistently improve the alignment results. Additionally, by comparing UEA-\({{\mathbf {M}}^{\mathbf {l}}}\) with UEA-\({{\mathbf {M}}^{\mathbf {n}}}\), it is evident that the input side information does affect the final results, and the quality of the side information is of significance to the overall alignment performance.

Pseudo-Labeled Data

We further examine the usefulness of the preliminary alignment results generated by the side information, i.e., the pseudo-labeled data. Concretely, we replace the training data in HGCN with these pseudo-labeled data, resulting in HGCN-U, and then compare its alignment results with the original performance. Regarding the F1 score, HGCN-U is 4% lower than HGCN on \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\), 2.9% lower on \({\mathtt {DBP15K}_{\mathtt {JA-EN}}}\), and 2.8% lower on \({\mathtt {DBP15K}_{\mathtt {FR-EN}}}\). The minor difference validates the effectiveness of the pseudo-labeled data generated by the side information. It also demonstrates that this strategy can be applied to other supervised or semi-supervised frameworks to reduce their reliance on labeled data.

5 Conclusion

In this chapter, we propose an unsupervised EA solution that is capable of dealing with unmatchable entities. We first exploit the side information of KGs to generate preliminary alignment results, which are considered as pseudo-labeled data and forwarded to the progressive learning framework to produce better KG embeddings and alignment results in a self-training fashion. We also devise an unmatchable entity prediction module to detect the unmatchable entities. The experimental results validate the usefulness of our proposed model and its superiority over state-of-the-art approaches.