Unsupervised Entity Alignment

Zhao, Xiang; Zeng, Weixin; Tang, Jiuyang

doi:10.1007/978-981-99-4250-3_8

Part of the book series: Big Data Management ((BIGDM))

710 Accesses

Abstract

State-of-the-art entity alignment solutions tend to rely on labeled data for model training. Additionally, they work under the closed-domain setting and cannot deal with entities that are unmatchable. To address these deficiencies, we offer an unsupervised framework that performs entity alignment in the open world. Specifically, we first mine useful features from the side information of KGs. Then, we devise an unmatchable entity prediction module to filter out unmatchable entities and produce preliminary alignment results. These preliminary results are regarded as the pseudo-labeled data and forwarded to the progressive learning framework to generate structural representations, which are integrated with the side information to provide a more comprehensive view for alignment. Finally, the progressive learning framework gradually improves the quality of structural embeddings and enhances the alignment performance by enriching the pseudo-labeled data with alignment results from the previous round. Our solution does not require labeled data and can effectively filter out unmatchable entities. Comprehensive experimental evaluations validate its superiority.

You have full access to this open access chapter, Download chapter PDF

1 Introduction

State-of-the-art EA solutions [2,3,4,5] assume that equivalent entities usually possess similar neighboring information. Consequently, they utilize KG embedding models, e.g., TransE [6], or graph neural network (GNN) models, e.g., GCN [7], to generate structural embeddings of entities in individual KGs. Then, these separated embeddings are projected into a unified embedding space by using the seed entity pairs as connections, so that the entities from different KGs are directly comparable. Finally, to determine the alignment results, the majority of current works [1, 8,9,10] formalize the alignment process as a ranking problem; that is, for each entity in the source KG, they rank all the entities in the target KG according to some distance metric, and the closest entity is considered as the equivalent target entity.

Example In Fig. 8.1 are a partial English KG and a partial Spanish KG concerning the director Hirokazu Koreeda, where the dashed lines indicate known alignments (i.e., seeds). The task of EA aims to identify equivalent entity pairs between two KGs, e.g., (Shoplifters, Manbiki Kazoku).

2 node diagrams. The nodes in K G subscript E N are nobody knows, Japan, Ryo Kase, still walking, Kirin Kiki, shoplifters, and Hirokazu Koreeda. The nodes in K G subscript E S are Japon, Nadie sabe, Aruitemo, Kirin Kiki, Manbiki Kazoku, and Hirokazu Koreeda. Hirokazu Koreeda is connected with a dashed line. — **Fig. 8.1**

Nevertheless, we still observe several issues from current EA works:

Reliance on labeled data. Most of the approaches rely on pre-aligned seed entity pairs to connect two KGs and use the unified KG structural embeddings to align entities. These labeled data, however, might not exist in real-life settings. For instance, in the example, the equivalence between Hirokazu Koreeda in $KG_{EN}$ and Hirokazu Koreeda in $KG_{ES}$ might not be known in advance. In this case, state-of-the-art methods that solely rely on the structural information would fall short, as there are no seeds to connect these individual KGs.
Closed-domain setting. All of current EA solutions work under the closed-domain setting [11]; that is, they assume every entity in the source KG has an equivalent entity in the target KG. Nevertheless, in practical settings, there always exist unmatchable entities. For instance, in the example, for the source entity Ryo Kase, there is no equivalent entity in the target KG. Therefore, an ideal EA system should be capable of predicting the unmatchable entities.

In response to these issues, we put forward an unsupervised EA solution UEA that is capable of addressing the unmatchable problem. Specifically, to mitigate the reliance on labeled data, we mine useful features from the KG side information and use them to produce preliminary pseudo-labeled data. These preliminary seeds are forwarded to our devised progressive learning framework to generate unified KG structural representations, which are integrated with the side information to provide a more comprehensive view for alignment. This framework also progressively augments the training data and improves the alignment results in a self-training fashion. Besides, to tackle the unmatchable issue, we design an unmatchable entity prediction module, which leverages thresholded bidirectional nearest neighbor search (TBNNS) to filter out the unmatchable entities and excludes them from the alignment results. We embed the unmatchable entity prediction module into the progressive learning framework to control the pace of progressive learning by dynamically adjusting the thresholds in TBNNS.

Furthermore, considering that the pseudo-labeled data generated during the progressive learning process might be of different qualities, we introduce the concept of confidence to measure the probability of an entity pair of being correct. We further incorporate such confidence scores into KG representation learning with the aim of producing more accurate structural embeddings. Through empirical studies, we demonstrate that the confidence-based framework, CUEA, has a more stable performance than UEA regardless of the quality of input side information and is particularly more useful when the side information is low-grade.

Contribution

The main contributions of the chapter can be summarized as follows:

We identify the deficiencies of existing EA methods, i.e., requiring labeled data and working under the closed-domain setting, and propose an unsupervised EA framework UEA, as well as a confidence-based extension CUEA, that are able to deal with unmatchable entities. This is done by: (1) exploiting the side information of KGs to generate preliminary pseudo-labeled data; and (2) devising an unmatchable entity prediction module that leverages the (confidence-based) thresholded bi-directional nearest neighbor search strategy to produce alignment results, which can effectively exclude unmatchable entities; and (3) offering a progressive learning algorithm to improve the quality of KG embeddings and enhance the alignment performance.
We empirically evaluate our proposals against state-of-the-art methods, and the comparative results demonstrate their superiority.

Organization

In Sect. 8.2, we formally define the task of EA and introduce related work. Section 8.3 elaborates the framework. In Sect. 8.4, we introduce experimental results and conduct detailed analysis. Section 8.5 concludes this chapter.

2 Task Definition and Related Work

In this section, we formally define the task of EA and then introduce the related work.

Task Definition

The inputs to EA are a source KG $G_1$ and a target KG $G_2$. The task of EA is defined as finding the equivalent entities between the KGs, i.e., $\Psi = \{(u,v)|u\in E_1, v\in E_2, u \leftrightarrow v\}$, where $E_1$ and $E_2$ refer to the entity sets in $G_1$ and $G_2$, respectively, and $u \leftrightarrow v$ represents that the source entity u and the target entity v are equivalent, i.e., u and v refer to the same real-world object.

Most of current EA solutions assume that there exist a set of seed entity pairs $\Psi _s = \{(u_s,v_s)|u_s\in E_1, v_s\in E_2, u_s \leftrightarrow v_s\}$. Nevertheless, in this chapter, we focus on unsupervised EA and do not assume the availability of such labeled data.

Unsupervised Entity Alignment

A few methods have investigated the alignment without labeled data. Qu et al. [20] propose an unsupervised approach toward knowledge graph alignment with the adversarial training framework. Nevertheless, the experimental results are extremely poor. He et al. [21] utilize the shared attributes between heterogeneous KGs to generate aligned entity pairs, which are used to detect more equivalent attributes. They perform entity alignment and attribute alignment alternately, leading to more high-quality aligned entity pairs, which are used to train a relation embedding model. Finally, they combine the alignment results generated by attribute and relation triples using a bivariate regression model. The overall procedure of this work might seem similar to our proposed model. However, there are many notable differences; for instance, the KG embeddings in our work are updated progressively, which can lead to more accurate alignment results, and our model can deal with unmatchable entities. We empirically demonstrate the superiority of our model in Sect. 8.4.

We notice that there are some entity resolution (ER) approaches established in a setting similar to EA, represented by PARIS [22]. They adopt collective alignment algorithms such as similarity propagation so as to model the relations among entities. We include them in the experimental study for the comprehensiveness of the chapter.

3 Methodology

In this section, we first introduce the outline of our proposal. Then, we elaborate the processing of side information to produce preliminary alignment seeds.

3.1 Model Outline

As shown in Fig. 8.2, given two KGs, CUEA first mines useful features from the side information. These features are forwarded to the unmatchable entity prediction module to generate initial alignment results with confidence scores, which are regarded as pseudo-labeled data. Then, the progressive learning framework uses these pseudo seeds, along with the probability scores, to connect two KGs and learn unified entity structural embeddings. It further combines the alignment signals from the side information and structural information to provide a more comprehensive view for alignment. Finally, it progressively improves the quality of structural embeddings and augments the alignment results by iteratively updating the pseudo-labeled data with results from the previous round, which also leads to increasingly better alignment. Note that by assigning the confidence score of 1 to all entity pairs, CUEA turns into the UEA model.

A flow diagram. 2 node diagrams, K G E N and K G E S, lead to side information. It divides into fusion and unmatchable entity prediction. The latter divides as unmatchable entities and seed pairs with confidence. The confidence-based K G representation learning leads to fusion via structural information. — **Fig. 8.2**

3.2 Side Information

There is abundant side information in KGs, such as the attributes, descriptions, and classes. In this chapter, we use a particular form of the attributes—the entity name, as it exists in the majority of KGs. To make the most of the entity name information, inspired by Zeng et al. [5], we exploit it from the semantic level and string level and generate the textual distance matrix between entities in two KGs.

More specifically, we use the averaged word embeddings to represent the semantic meanings of entity names. Given the semantic embeddings of a source and a target entity, we obtain the semantic distance score by subtracting their cosine similarity score from 1. We denote the semantic distance matrix between the entities in two KGs as ${\mathbf {M}}^{\mathbf {n}}$, where rows represent source entities, columns denote target entities, and each element in the matrix denotes the distance score between a pair of source and target entities. As for the string-level feature, we adopt the Levenshtein distance [23] to measure the difference between two sequences. We denote the string distance matrix as ${\mathbf {M}}^{\mathbf {l}}$.

To obtain a more comprehensive view for alignment, we combine these two distance matrices and generate the textual distance matrix as ${\mathbf {M}}^{\mathbf {t}} = \alpha {\mathbf {M}}^{\mathbf {n}} + (1-\alpha ){\mathbf {M}}^{\mathbf {l}}$, where $\alpha $ is a hyper-parameter that balances the weights. Then, we forward the textual distance matrix ${\mathbf {M}}^{\mathbf {t}}$ into the unmatchable entity module to produce alignment results, which are considered as the pseudo-labeled data for training KG structural embeddings. The details are introduced in the next subsection.

Remark

The goal of this step is to exploit available side information to generate useful features for alignment. Other types of side information, e.g., attributes and entity descriptions, can also be leveraged. Besides, more advanced textual encoders, such as misspelling oblivious word embeddings [24] and convolutional embedding for edit distance [25], can be utilized. We will investigate them in the future.

3.3 Unmatchable Entity Prediction

State-of-the-art EA solutions generate for each source entity a corresponding target entity and fail to consider the potential unmatchable issue. Nevertheless, as mentioned in [12], in real-life settings, KGs contain entities that other KGs do not contain. For instance, when aligning YAGO 4 and IMDB, only 1% of entities in YAGO 4 are related to movies, while the other 99% of entities in YAGO 4 necessarily have no match in IMDB. These unmatchable entities would increase the difficulty of EA. Therefore, in this chapter, we devise an unmatchable entity prediction module to predict the unmatchable entities and filter them out from the alignment results.

3.3.1 Thresholded Bidirectional Nearest Neighbor Search

More specifically, we put forward a novel strategy, i.e., thresholded bidirectional nearest neighbor search (TBNNS), to generate the alignment results, and the resulting unaligned entities are predicted to be unmatchable. As can be observed from Algorithm 1, given a source entity u and a target entity v, if u and v are the nearest neighbor of each other, and the distance between them is below a given threshold $\theta $, we consider $(u,v)$ as an aligned entity pair. Note that $\mathbf M(u,v)$ represents the element in the u-th row and v-th column of the distance matrix $\mathbf M$.

Algorithm 1: TBNNS in the unmatchable entity prediction module

The TBNNS strategy exerts strong constraints on alignment, since it requires that the matched entities should both prefer each other the most, and the distance between their embeddings should be below a certain value. Therefore, it can effectively predict unmatchable entities and prevent them from being aligned. Notably, the threshold $\theta $ plays a significant role in this strategy. A larger threshold would lead to more matches, whereas it would also increase the risk of including erroneous matches or unmatchable entities. In contrast, a small threshold would only lead to a few aligned entity pairs, and almost all of them would be correct. This is further discussed and verified in Sect. 8.4.4. Therefore, our progressive learning framework dynamically adjusts the threshold value to produce more accurate alignment results (to be discussed in the next subsection).

3.3.2 Confidence-Based TBNNS

Considering that the aligned entity pairs generated by TBNNS are of different qualities (i.e., some are true, while some are not), we further put forward confidence-based TBNNS, C-TBNNS, to measure the confidence of an entity pair (of being true). Specifically, we define the confidence score $\Theta $ of an entity pair $(u,v)$ as:

$$\displaystyle \begin{aligned} {} \Theta(u,v)= \mathbf M(u,v^\prime) - \mathbf M(u,v) + \mathbf M(v,u^\prime) - \mathbf M(v,u), \end{aligned} $$

(8.1)

where $\Delta _1 = \mathbf M(u,v^\prime ) - \mathbf M(u,v)$ denotes the gap between the distance scores of the top two closest entities (i.e., v and $v^\prime $) to entity u, while $\Delta _2 = \mathbf M(v,u^\prime ) - \mathbf M(v,u)$ denotes the gap between the distance scores of the top two closest entities (i.e., u and $u^\prime $) to entity v. This is based on the intuition that, for an entity pair $(u,v)$, if the distance between them is the smallest from both sides and there are larger margins between the distances of the top two candidates, it would be more confident to consider them as a correct entity pair. We further restrict the confidence scores to a certain range:

$$\displaystyle \begin{aligned} {} \Theta(\mathcal{S}) = (1-\lambda) \frac{\Theta(\mathcal{S}) - \min\{\Theta(\mathcal{S})\}}{\max\{\Theta(\mathcal{S})\} - \min\{\Theta(\mathcal{S})\}} + \lambda \end{aligned} $$

(8.2)

where $\Theta (\mathcal {S})$ represents the confidence scores of the entity pairs in $\mathcal {S}$. The core of Eq. (8.2) is the min-max normalization, which converts the confidence scores to $[0,1]$. We add a hyper-parameter $\lambda \in [0,1]$ to further restrict the range of the confidence scores to $[\lambda ,1]$. As thus, by setting $\lambda $ to 1, all entity pairs would have the same confidence score of 1, and C-TBNNS can be restored to TBNNS. Hence, C-TBNNS can be regarded as a general case of TBNNS, which introduces the concept of confidence (probability) into the alignment result generation process.

3.4 The Progressive Learning Framework

To exploit the rich structural patterns in KGs that could provide useful signals for alignment, we design a progressive learning framework to combine structural and textual features for alignment and improve the quality of both structural embeddings and alignment results in a self-training fashion.

3.4.1 Knowledge Graph Representation Learning

As mentioned above, we forward the textual distance matrix ${\mathbf {M}}^{\mathbf {t}}$ generated by using the side information to the unmatchable entity prediction module to produce the preliminary alignment results, which are considered as pseudo-labeled data for learning unified KG embeddings. Concretely, following [18], we adopt GCN^{Footnote 1} to capture the neighboring information of entities. We leave out the implementation details since this is not the focus of this paper, which can be found in [18].

Alignment Objective

Since the representations of source and target KGs are learned individually, they need to be projected into a unified embedding space, where the entities across KGs could be compared directly. To this end, we use the semi-supervised loss function to enforce the distance between the embeddings of the entities in the labeled entity pairs to be small and meanwhile the negative samples (i.e., nonequivalent entity pairs) to be large. Formally:

$$\displaystyle \begin{aligned} {} \mathcal{L} = \sum_{(u,v)\in\mathcal{S}} \sum_{(u^\prime,v^\prime)\in\mathcal{S}^\prime_{(u,v)}} [d(\mathbf u, \mathbf v) + \gamma - d(\mathbf u^\prime, \mathbf v^\prime)]_+ , \end{aligned} $$

(8.3)

where $[\cdot ]_+ = \max \{0,\cdot \}$, $(u,v)$ is a labeled entity pair from the training data and $\mathcal {S}^\prime _{(u,v)}$ represents the set of negative entity pairs obtained by corrupting $(u,v)$ using nearest neighbor sampling [1]. $\mathbf u$ and $\mathbf v$ represent the embeddings of source and target entities learned by GCN, respectively. $d(\cdot ,\cdot )$ is the distance function that measures the distance between two embeddings. $\gamma $ is a hyper-parameter separating positive samples from negative ones.

Confidence-Based Objective

Considering that the pseudo-labeled entity pairs have different confidences of being true, we incorporate such probabilities into the alignment objective to learn more accurate structural embeddings:

$$\displaystyle \begin{aligned} {} \mathcal{L}_c = \sum_{(u,v)\in\mathcal{S}} \sum_{(u^\prime,v^\prime)\in\mathcal{S}^\prime_{(u,v)}} \Theta(u, v)\ast[d(\mathbf u, \mathbf v) + \gamma - d(\mathbf u^\prime, \mathbf v^\prime)]_+ , \end{aligned} $$

(8.4)

where $\Theta (u, v)$ is the confidence score attached to each entity pair. As thus, the more confident entity pairs would play a more important role during the training process, while the less confident pseudo entity pairs would have a smaller effect on the training, such that the impact from the false positives could be mitigated.

Feature Fusion

Given the learned structural embedding matrix $\mathbf Z$, we calculate the structural distance score between a source and a target entity by subtracting the cosine similarity score between their embeddings from 1. We denote the resultant structural distance matrix as ${\mathbf {M}}^{\mathbf {s}}$. Then, we combine the textual and structural information to generate more accurate signals for alignment: $\mathbf {M} = \beta {\mathbf {M}}^{\mathbf {t}} + (1-\beta ){\mathbf {M}}^{\mathbf {s}}$, where $\beta $ is a hyper-parameter that balances the weights. The fused distance matrix $\mathbf {M}$ can be used to generate more accurate matches.

3.4.2 The Progressive Learning Algorithm

The amount of training data has an impact on the quality of the unified KG embeddings, which in turn affects the alignment performance [3, 26]. As thus, we devise an algorithm (Algorithm 2) to progressively augment the pseudo training data, so as to improve the quality of KG embeddings and enhance the alignment performance. The algorithm starts with learning unified structural embeddings and generating the fused distance matrix $\mathbf {M}$ by using the preliminary pseudo-labeled data $\mathcal {S}_0$ (Lines 1–2). Then, the fused distance matrix is used to produce the new alignment results $\Delta \mathcal {S}$ using C-TBNNS (line 4). These newly generated entity pairs $\Delta \mathcal {S}$ are added to the alignment results, which are used for generating the fused distance matrix in the next round (Lines 6–7). The entities in $\mathcal {S}$ are removed from the entity sets (Lines 9–10). In order to progressively improve the quality of KG embeddings and detect more alignment results, we perform the aforementioned process recursively until the number of newly generated entity pairs is below a given threshold $\mu $. Finally, we consider the entity pairs in $\mathcal {S}$ as the final alignment results $\Psi $.

Algorithm 2: Progressive learning

Notably, in the learning process, once a pair of entities is considered as a match, the entities will be removed from the entity sets (Lines 5–6 and Lines 12–13). This could gradually reduce the alignment search space and lower the difficulty for aligning the rest of the entities. Obviously, this strategy suffers from the error propagation issue, which, however, could be effectively mitigated by the progressive learning process that dynamically adjusts the threshold. We will verify the effectiveness of this setting in Sect. 8.4.3.

3.4.3 Dynamic Threshold Adjustment

It can be observed from Algorithm 2 that the matches generated by the unmatchable entity prediction module are part of not only the eventual alignment results but also the pseudo training data for learning subsequent structural embeddings. Therefore, to enhance the overall alignment performance, the alignment results generated in each round should, ideally, have both large quantity and high quality. Unfortunately, these two goals cannot be achieved at the same time. This is because, as stated in Sect. 8.3.3, a larger threshold in TBNNS can generate more alignment results (large quantity), whereas some of them might be erroneous (low quality). These wrongly aligned entity pairs can cause the error propagation problem and result in more erroneous matches in the following rounds. In contrast, a smaller threshold leads to fewer alignment results (small quantity), while almost all of them are correct (high quality).

To address this issue, we aim to balance between the quantity and the quality of the matches generated in each round. An intuitive idea is to set the threshold to a moderate value. However, this fails to take into account the characteristics of the progressive learning process. That is, in the beginning, the quality of the matches should be prioritized, as these alignment results will have a long-term impact on the subsequent rounds. In comparison, in the later stages where most of the entities have been aligned, the quantity is more important, as we need to include more possible matches that might not have a small distance score. In this connection, we set the initial threshold $\theta _0$ to a very small value so as to reduce potential errors. Then, in the following rounds, we gradually increase the threshold by $\eta $, so that more possible matches could be detected. We will empirically validate the superiority of this strategy over the fixed weight in Sect. 8.4.3.

Noteworthily, our proposed confidence-based framework CUEA can further help mitigate the low-quality issue, as we calculate and assign a confidence score to each entity pair, where the wrongly aligned entity pairs would presumably have lower confidence scores and thus exert smaller influence on the subsequent alignment process.

Remark

As mentioned in the related work, there are some existing EA approaches that exploit the iterative learning (bootstrapping) strategy to improve EA performance. Particularly, BootEA calculates for each source entity the alignment likelihood to every target entity and includes those with likelihood above a given threshold in a maximum likelihood matching process under the 1-to-1 mapping constraint, producing a solution containing confident EA pairs [15]. This strategy is also adopted by [8, 16]. Zhu et al. use a threshold to select the entity pairs with very close distances as the pseudo-labeled data [14]. DAT employs a bidirectional margin-based constraint to select the confident EA pairs as labels [17]. Our progressive learning strategy differs from these existing solutions in three aspects: (1) we exclude the entities in the confident EA pairs from the test sets; (2) we use the dynamic threshold adjustment strategy to control the pace of learning process; (3) our strategy can deal with unmatchable entities; and (4) we attach a confidence score to each selected entity pair, which can mitigate the negative influence of the false positives on the KG representation learning process as well as the alignment results. The superiority of our strategy is validated in Sect. 8.4.3.

4 Experiment

This section reports the experimental results with in-depth analysis. The source code is available at https://github.com/DexterZeng/UEA.

4.1 Experimental Settings

Datasets

Following existing works, we adopt the DBP15K dataset [3] for evaluation. This dataset consists of three multilingual KG pairs extracted from DBpedia. Each KG pair contains 15,000 inter-language links as gold standards. The statistics can be found in Table 8.1. We note that state-of-the-art studies merely consider the labeled entities and divide them into training and testing sets. Nevertheless, as can be observed from Table 8.1, there exist unlabeled entities, e.g., 4,388 and 4,572 entities in the Chinese and English KG of ${\mathtt {DBP15K}_{\mathtt {ZH-EN}}}$, respectively. In this connection, we adapt the dataset by including the unmatchable entities. Specifically, for each KG pair, we keep 30% of the labeled entity pairs as the training set (for training the supervised or semi-supervised methods). Then, to construct the test set, we include the rest of the entities in the first KG and the rest of the labeled entities in the second KG, so that the unlabeled entities in the first KG become unmatchable. The statistics of the test sets can be found in the test set column in Table 8.1.

Table 8.1 The statistics of the evaluation benchmarks

Full size table

Parameter Settings

For the side information module, we utilize the fastText embeddings [27] as word embeddings. To deal with cross-lingual KG pairs, following [19], we use Google Translate to translate the entity names from one language to another, i.e., translating Chinese, Japanese, and French to English. $\alpha $ is set to 0.5. For the structural information learning, we set $\beta $ to 0.5. Following [18], we set $\gamma $ in the alignment objectives to 3 and adopt Manhattan distance as $d(\cdot ,\cdot )$. Regarding C-TBNNS, we set $\lambda $ to 0.4. For progressive learning, we set the initial threshold $\theta _0$ to 0.05, the incremental parameter $\eta $ to 0.1, and the termination threshold $\gamma $ to 30. Note that if the threshold $\theta $ is over 0.45, we reset it to 0.45. These hyper-parameters are default values since there is no extra validation set for hyper-parameter tuning.

Evaluation Metrics

We use precision (P), recall (R), and F1 score as evaluation metrics. The precision is computed as the number of correct matches divided by the number of matches found by a method. The recall is computed as the number of correct matches found by a method divided by the number of gold matches. The F1 score is the harmonic mean between precision and recall. The bold figures in the tables represent the best results.

Competitors

We select the most performant state-of-the-art solutions for comparison. Within the group that solely utilizes structural information, we compare with BootEA [15], TransEdge [8], MRAEA [26], and SSP [28]. Among the methods incorporating other sources of information, we compare with GCN-Align [18], HMAN [9], HGCN [4], RE-GCN [29], DAT [17], and RREA [30]. We also include the unsupervised approaches, i.e., IMUSE [21] and PARIS [22]. To make a fair comparison, we only use entity name labels as the side information.

4.2 Results

Table 8.2 reports the alignment results, which shows that state-of-the-art supervised or semi-supervised methods have rather low precision values. This is because these approaches cannot predict the unmatchable source entities and generate a target entity for each source entity (including the unmatchable ones). Particularly, methods incorporating additional information attain relatively better performance than the methods in the first group, demonstrating the benefit of leveraging such additional information.

Table 8.2 Alignment results

Full size table

Regarding the unsupervised methods, although IMUSE cannot deal with the unmatchable entities and achieves a low precision score, it outperforms most of the supervised or semi-supervised methods in terms of recall and F1 score. This indicates that, for the EA task, the KG side information is useful for mitigating the reliance on labeled data. In contrast to the abovementioned methods, PARIS attains very high precision, since it only generates matches that it believes to be highly possible, which can effectively filter out the unmatchable entities. It also achieves the second best F1 score among all approaches, showcasing its effectiveness when the unmatchable entities are involved. Our proposals, UEA and CUEA, attain the best balance between precision and recall and obtain the best F1 scores, outperforming the second best by a large margin, validating their effectiveness. Notably, although our proposed models do not require labeled data, they achieve even better performance than the most performant supervised methods HMAN and DAT.

Furthermore, it can be seen that, by integrating the notion of confidence into UEA, CUEA achieves comparable results to UEA. At first sight, it seems that assigning confidence scores to entity pairs does not have a large influence on the representation learning and the alignment results, which, however, could be ascribed to the fact that the side information is too effective on these datasets (solely using the string information can achieve an F1 score of 0.814, to be shown in Table 8.4), and hence rendering the structural information (largely affected by the confidence scores) less contributive to the overall results. Next, we will show that the confidence-based framework would be much more useful on datasets with side information in low quality.

4.2.1 Results Using Low-Quality Side Information

We compare the unsupervised approaches under a practical scenario where the side information is in low quality. Specifically, we assume that the pre-trained word embeddings as well as the machine translation tools are not available. Under this circumstance, to use the entity name information, a viable solution is to compare the name strings directly. However, the direct string comparison would be ineffective for cross-lingual datasets such as ${\mathtt {DBP15K}_{\mathtt {ZH-EN}}}$ and ${\mathtt {DBP15K}_{\mathtt {JA-EN}}}$, where the languages in the source and target KGs are disparate. Hence, we aim to examine the effectiveness of these unsupervised approaches when the side information is in low quality and cannot provide many useful signals for alignment.

We report the results on ${\mathtt {DBP15K}_{\mathtt {ZH-EN}}}$ and ${\mathtt {DBP15K}_{\mathtt {JA-EN}}}$ in Table 8.3, where the direct comparison between entity name strings serves as the side information. It can be observed that the F1 scores of all methods are very low (compared with those in Table 8.2), revealing that the quality of side information does affect the overall alignment results. Besides, given the low-quality side information, our proposed models UEA and CUEA still outperform the baselines IMUSE and PARIS in terms of the F1 score, demonstrating the effectiveness of the progressive learning framework and the unmatchable entity prediction module. Moreover, it is notable that CUEA achieves better results than UEA in terms of all metrics. This could be attributed to the confidence-based alignment result generation process, which could enable the entity pairs of higher confidence (higher probability of being correct, presumably) to have a larger impact on the representation learning and alignment process.

Table 8.3 Alignment results given low-grade side information

Full size table

Table 8.4 Ablation results

Full size table

4.3 Ablation Study

In this subsection, we examine the usefulness of proposed modules by conducting the ablation study. More specifically, in Table 8.4, we report the results of UEA w/o Unm, which excludes the unmatchable entity prediction module, and UEA w/o Prg, which excludes the progressive learning process. It shows that removing the unmatchable entity prediction module (UEA w/o Unm) brings down the performance on all metrics and datasets, validating its effectiveness of detecting the unmatchable entities and enhancing the overall alignment performance. Besides, without the progressive learning (UEA w/o Prg), the precision increases, while the recall and F1 score values drop significantly. This shows that the progressive learning framework can discover more correct aligned entity pairs and is crucial to the alignment progress.

To provide insights into the progressive learning framework, we report the results of UEA w/o Adj, which does not adjust the threshold, and UEA w/o Excl, which does not exclude the entities in the alignment results from the entity sets during the progressive learning. Table 8.4 shows that setting the threshold to a fixed value (UEA w/o Adj) leads to worse F1 results, verifying that the progressive learning process depends on the choice of the threshold and the quality of the alignment results. We will further discuss the setting of the threshold in the next subsection. Besides, the performance also decreases if we do not exclude the matched entities from the entity sets (UEA w/o Excl), validating that this strategy indeed can reduce the difficulty of aligning entities.

Moreover, we replace our progressive learning framework with other state-of-the-art iterative learning strategies (i.e., MWGM [15], TH [14], and DAT-I [17]) and report the results in Table 8.4. It shows that using our progressive learning framework (UEA) can attain the best F1 score, verifying its superiority.

4.4 Quantitative Analysis

In this subsection, we perform quantitative analysis of the modules in UEA and CUEA.

The Threshold $\theta $ in TBNNS

We discuss the setting of $\theta $ to reveal the trade-off between the risk and gain from generating the alignment results in the progressive learning. Identifying a match leads to the integration of additional structural information, which benefits the subsequent learning. However, for the same reason, the identification of a false positive, i.e., an incorrect match, potentially leads to mistakenly modifying the connections between KGs, with the risk of amplifying the error in successive rounds. As shown in Fig. 8.3, a smaller $\theta $ (e.g., 0.05) brings low risk and low gain; that is, it merely generates a small number of matches, among which almost all are correct. In contrast, a higher $\theta $ (e.g., 0.45) increases the risk and brings relatively higher gain; that is, it results in much more aligned entity pairs, while a certain portion of them are erroneous. Additionally, using a higher threshold leads to increasingly more alignment results, while for a lower threshold, the progressive learning process barely increases the number of matches. This is in consistency with our theoretical analysis in Sect. 8.3.3.

3 bar graphs of number of entities versus round with 4 sets of bars. The bars indicate correct 0.05, wrong, correct 0.15, 0.25, 0.35, and 0.45. The bars depicting correct 0.05 are stable from rounds 1 to 4. The other bars have an increasing trend. The graphs are titled, Z H E N, J A E N, and F R E N. — **Fig. 8.3**

Unmatchable Entity Prediction

Zhao et al. [12] propose an intuitive strategy (U-TH) to predict the unmatchable entities. They set an NIL threshold, and if the distance value between a source entity and its closest target entity is above this threshold, they consider the source entity to be unmatchable. We compare our unmatchable entity prediction strategy with it in terms of the percentage of unmatchable entities that are included in the final alignment results and the F1 score. On ${\mathtt {DBP15K}_{\mathtt {ZH-EN}}}$, replacing our unmatchable entity prediction strategy with U-TH attains the F1 score at 0.837, which is 8.4% lower than that of UEA. Besides, in the alignment results generated by using U-TH, 18.9% are unmatchable entities, while this figure for UEA is merely 3.9%. This demonstrates the superiority of our unmatchable entity prediction strategy.

Influence of Parameters

As mentioned in Sect. 8.4.1, we set $\alpha $ and $\beta $ to 0.5 since there are no training/validation data. Here, we aim to prove that different values of the parameters do not have a large influence on the final results. More specifically, we keep $\alpha $ at 0.5 and choose $\beta $ from [0.3, 0.4, 0.5, 0.6, 0.7]; then we keep $\beta $ at 0.5 and choose $\alpha $ from [0.3, 0.4, 0.5, 0.6, 0.7]. It can be observed from Fig. 8.4 that, although smaller $\alpha $ and $\beta $ lead to better results, the performance does not change significantly.

2 line graphs with 3 lines of different beta and alpha values. a, Z H E N. (0.3, 0.91) and (0.7, 0.88). J A E N. (0.3, 0.93) and (0.7, 0.925). F R E N. (0.3, 0.96) and (0.7, 0.945). b, Z H E N. (0.3, 0.925) and (0.7, 0.87). J A E N. (0.3, 0.95) and (0.7, 0.925). F R E N. (0.3, 0.955) and (0.7, 0.95). — **Fig. 8.4**

The Hyper-Parameter $\lambda $ in CUEA

We then analyze the influence of $\lambda $ in Eq. (8.2), which determines the range of the confidence scores, on the final alignment results. To highlight its influence on the structural representation learning, we follow the settings in Sect. 8.4.2.1 and report the results in Table 8.5.

Table 8.5 The influence of $\lambda $ on the alignment results

Full size table

It reads from Table 8.5 that the alignment performance is relatively stable when $\lambda $ is not too large. Nevertheless, when setting $\lambda $ to a large value (e.g., 1, to restore UEA), the results drop sharply. This reveals that assigning probability scores to the entity pairs according to their confidence of being true can facilitate the alignment. Besides, generally speaking, CUEA is robust to the perturbation of $\lambda $ (as long as it is not too large).

Influence of Input Side Information

We adopt different side information as input to examine the performance of UEA. More specifically, we report the results of UEA-${{\mathbf {M}}^{\mathbf {l}}}$, which merely uses the string-level feature of entity names as input, UEA-${{\mathbf {M}}^{\mathbf {n}}}$, which only uses the semantic embeddings of entity names as input. We also provide the results of ${{\mathbf {M}}^{\mathbf {l}}}$ and ${{\mathbf {M}}^{\mathbf {n}}}$, which use the string-level and semantic information to directly generate alignment results (without progressive learning), respectively.

As shown in Table 8.3, the performance of solely using the input side information is not very promising (${{\mathbf {M}}^{\mathbf {l}}}$ and ${{\mathbf {M}}^{\mathbf {n}}}$). Nevertheless, by forwarding the side information into our model, the results of UEA-${{\mathbf {M}}^{\mathbf {l}}}$ and UEA-${{\mathbf {M}}^{\mathbf {n}}}$ become much better. This unveils that UEA can work with different types of side information and consistently improve the alignment results. Additionally, by comparing UEA-${{\mathbf {M}}^{\mathbf {l}}}$ with UEA-${{\mathbf {M}}^{\mathbf {n}}}$, it is evident that the input side information does affect the final results, and the quality of the side information is of significance to the overall alignment performance.

Pseudo-Labeled Data

We further examine the usefulness of the preliminary alignment results generated by the side information, i.e., the pseudo-labeled data. Concretely, we replace the training data in HGCN with these pseudo-labeled data, resulting in HGCN-U, and then compare its alignment results with the original performance. Regarding the F1 score, HGCN-U is 4% lower than HGCN on ${\mathtt {DBP15K}_{\mathtt {ZH-EN}}}$, 2.9% lower on ${\mathtt {DBP15K}_{\mathtt {JA-EN}}}$, and 2.8% lower on ${\mathtt {DBP15K}_{\mathtt {FR-EN}}}$. The minor difference validates the effectiveness of the pseudo-labeled data generated by the side information. It also demonstrates that this strategy can be applied to other supervised or semi-supervised frameworks to reduce their reliance on labeled data.

5 Conclusion

In this chapter, we propose an unsupervised EA solution that is capable of dealing with unmatchable entities. We first exploit the side information of KGs to generate preliminary alignment results, which are considered as pseudo-labeled data and forwarded to the progressive learning framework to produce better KG embeddings and alignment results in a self-training fashion. We also devise an unmatchable entity prediction module to detect the unmatchable entities. The experimental results validate the usefulness of our proposed model and its superiority over state-of-the-art approaches.

Notes

1.
More advanced structural learning models, such as recurrent skipping networks [13], could also be used here. We will explore these alternative options in the future.

References

Chengjiang Li, Yixin Cao, Lei Hou, Jiaxin Shi, Juanzi Li, and Tat-Seng Chua. Semi-supervised entity alignment via joint knowledge embedding model and cross-graph model. In EMNLP, pages 2723–2732, 2019.
Google Scholar
Muhao Chen, Yingtao Tian, Mohan Yang, and Carlo Zaniolo. Multilingual knowledge graph embeddings for cross-lingual knowledge alignment. In IJCAI, pages 1511–1517, 2017.
Google Scholar
Zequn Sun, Wei Hu, and Chengkai Li. Cross-lingual entity alignment via joint attribute-preserving embedding. In ISWC, pages 628–644, 2017.
Google Scholar
Yuting Wu, Xiao Liu, Yansong Feng, Zheng Wang, and Dongyan Zhao. Jointly learning entity and relation representations for entity alignment. In EMNLP, pages 240–249, 2019.
Google Scholar
Weixin Zeng, Xiang Zhao, Jiuyang Tang, and Xuemin Lin. Collective entity alignment via adaptive features. In ICDE, pages 1870–1873, 2020.
Google Scholar
Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In NIPS, pages 2787–2795, 2013.
Google Scholar
Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016.
Google Scholar
Zequn Sun, JiaCheng Huang, Wei Hu, Muhao Chen, Lingbing Guo, and Yuzhong Qu. Transedge: Translating relation-contextualized embeddings for knowledge graphs. In ISWC, pages 612–629, 2019.
Google Scholar
Hsiu-Wei Yang, Yanyan Zou, Peng Shi, Wei Lu, Jimmy Lin, and Xu Sun. Aligning cross-lingual entities with multi-aspect information. In EMNLP, pages 4430–4440, 2019.
Google Scholar
Yixin Cao, Zhiyuan Liu, Chengjiang Li, Zhiyuan Liu, Juanzi Li, and Tat-Seng Chua. Multi-channel graph neural network for entity alignment. In ACL, pages 1452–1461, 2019.
Google Scholar
Sven Hertling and Heiko Paulheim. The knowledge graph track at OAEI - gold standards, baselines, and the golden hammer bias. In ESWC, volume 12123, pages 343–359, 2020.
Google Scholar
Xiang Zhao, Weixin Zeng, Jiuyang Tang, Wei Wang, and Fabian Suchanek. An experimental study of state-of-the-art entity alignment approaches. IEEE Transactions on Knowledge and Data Engineering, pages 1–1, 2020.
Google Scholar
Lingbing Guo, Zequn Sun, and Wei Hu. Learning to exploit long-term relational dependencies in knowledge graphs. In ICML, pages 2505–2514, 2019.
Google Scholar
Hao Zhu, Ruobing Xie, Zhiyuan Liu, and Maosong Sun. Iterative entity alignment via joint knowledge embeddings. In IJCAI, pages 4258–4264, 2017.
Google Scholar
Zequn Sun, Wei Hu, Qingheng Zhang, and Yuzhong Qu. Bootstrapping entity alignment with knowledge graph embedding. In IJCAI, pages 4396–4402, 2018.
Google Scholar
Qiannan Zhu, Xiaofei Zhou, Jia Wu, Jianlong Tan, and Li Guo. Neighborhood-aware attentional representation for multilingual knowledge graphs. In IJCAI, pages 1943–1949, 2019.
Google Scholar
Weixin Zeng, Xiang Zhao, Wei Wang, Jiuyang Tang, and Zhen Tan. Degree-aware alignment for entities in tail. In SIGIR, pages 811–820, 2020.
Google Scholar
Zhichun Wang, Qingsong Lv, Xiaohan Lan, and Yu Zhang. Cross-lingual knowledge graph alignment via graph convolutional networks. In EMNLP, pages 349–357, 2018.
Google Scholar
Yuting Wu, Xiao Liu, Yansong Feng, Zheng Wang, Rui Yan, and Dongyan Zhao. Relation-aware entity alignment for heterogeneous knowledge graphs. In IJCAI, pages 5278–5284, 2019.
Google Scholar
Meng Qu, Jian Tang, and Yoshua Bengio. Weakly-supervised knowledge graph alignment with adversarial learning. CoRR, abs/1907.03179, 2019.
Google Scholar
Fuzhen He, Zhixu Li, Qiang Yang, An Liu, Guanfeng Liu, Pengpeng Zhao, Lei Zhao, Min Zhang, and Zhigang Chen. Unsupervised entity alignment using attribute triples and relation triples. In DASFAA, pages 367–382, 2019.
Google Scholar
Fabian M. Suchanek, Serge Abiteboul, and Pierre Senellart. PARIS: probabilistic alignment of relations, instances, and schema. PVLDB, 5(3):157–168, 2011.
Google Scholar
Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710, 1966.
Google Scholar
Bora Edizel, Aleksandra Piktus, Piotr Bojanowski, Rui Ferreira, Edouard Grave, and Fabrizio Silvestri. Misspelling oblivious word embeddings. In NAACL-HLT, pages 3226–3234, 2019.
Google Scholar
Xinyan Dai, Xiao Yan, Kaiwen Zhou, Yuxuan Wang, Han Yang, and James Cheng. Convolutional embedding for edit distance. In SIGIR, pages 599–608, 2020.
Google Scholar
Xin Mao, Wenting Wang, Huimin Xu, Man Lan, and Yuanbin Wu. MRAEA: an efficient and robust entity alignment approach for cross-lingual knowledge graph. In WSDM, pages 420–428, 2020.
Google Scholar
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.
Article Google Scholar
Hao Nie, Xianpei Han, Le Sun, Chi Man Wong, Qiang Chen, Suhui Wu, and Wei Zhang. Global structure and local semantics-preserved embeddings for entity alignment. In IJCAI, pages 3658–3664, 2020.
Google Scholar
Jinzhu Yang, Wei Zhou, Lingwei Wei, Junyu Lin, Jizhong Han, and Songlin Hu. RE-GCN: relation enhanced graph convolutional network for entity alignment in heterogeneous knowledge graphs. In DASFAA, pages 432–447, 2020.
Google Scholar
Xin Mao, Wenting Wang, Huimin Xu, Yuanbin Wu, and Man Lan. Relational reflection entity alignment. In CIKM, pages 1095–1104, 2020.
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory for Big Data and Decision, National University of Defense Technology, Changsha, Hunan, China
Xiang Zhao, Weixin Zeng & Jiuyang Tang

Authors

Xiang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Weixin Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Jiuyang Tang
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhao, X., Zeng, W., Tang, J. (2023). Unsupervised Entity Alignment. In: Entity Alignment. Big Data Management. Springer, Singapore. https://doi.org/10.1007/978-981-99-4250-3_8

Download citation

DOI: https://doi.org/10.1007/978-981-99-4250-3_8
Published: 26 October 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4249-7
Online ISBN: 978-981-99-4250-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unsupervised Entity Alignment

Abstract

1 Introduction

Contribution

Organization

2 Task Definition and Related Work

Task Definition

Unsupervised Entity Alignment

3 Methodology

3.1 Model Outline

3.2 Side Information

Remark

3.3 Unmatchable Entity Prediction

3.3.1 Thresholded Bidirectional Nearest Neighbor Search

Algorithm 1: TBNNS in the unmatchable entity prediction module

3.3.2 Confidence-Based TBNNS

3.4 The Progressive Learning Framework

3.4.1 Knowledge Graph Representation Learning

Alignment Objective

Confidence-Based Objective

Feature Fusion

3.4.2 The Progressive Learning Algorithm

Algorithm 2: Progressive learning

3.4.3 Dynamic Threshold Adjustment

Remark

4 Experiment

4.1 Experimental Settings

Datasets

Parameter Settings

Evaluation Metrics

Competitors

4.2 Results

4.2.1 Results Using Low-Quality Side Information

4.3 Ablation Study

4.4 Quantitative Analysis

The Threshold \(\theta \) in TBNNS

Unmatchable Entity Prediction

Influence of Parameters

The Hyper-Parameter \(\lambda \) in CUEA

Influence of Input Side Information

Pseudo-Labeled Data

5 Conclusion

Notes

References

Author information

Authors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation