1 Introduction

Knowledge graphs (KGs) have been applied to various fields such as natural language processing and information retrieval. To improve the quality of KGs, many efforts have been dedicated to the alignment of KGs, since different KGs usually contain complementary information. Particularly, entity alignment (EA), which aims to identify equivalent entities in different KGs, is a crucial step of KG alignment and has been intensively studied over the last few years [1,2,3,4,5,6,7,8]. We use Example 1 to illustrate this task.

Example 1

In Figure 1 are a partial English KG and a partial Spanish KG concerning the director Hirokazu Koreeda, where the dashed lines indicate known alignments (i.e., seeds). The task of EA aims to identify equivalent entity pairs between two KGs, e.g., (Shoplifters, Manbiki Kazoku).

Fig. 1
figure 1

An example of EA

State-of-the-art EA solutions [9,10,11,12] assume that equivalent entities usually possess similar neighboring information. Consequently, they utilize KG embedding models, e.g., TransE [13], or graph neural network (GNN) models, e.g., GCN [14], to generate structural embeddings of entities in individual KGs. Then, these separated embeddings are projected into a unified embedding space by using the seed entity pairs as connections, so that the entities from different KGs are directly comparable. Finally, to determine the alignment results, the majority of current works [3, 15,16,17] formalize the alignment process as a ranking problem; that is, for each entity in the source KG, they rank all the entities in the target KG according to some distance metric, and the closest entity is considered as the equivalent target entity.

Nevertheless, we still observe several issues from current EA works:

  • Reliance on labeled data Most of the approaches rely on pre-aligned seed entity pairs to connect two KGs and use the unified KG structural embeddings to align entities. These labeled data, however, might not exist in real-life settings. For instance, in Example 1, the equivalence between Hirokazu Koreeda in \(KG_{EN}\) and Hirokazu Koreeda in \(KG_{ES}\) might not be known in advance. In this case, state-of-the-art methods that solely rely on the structural information would fall short, as there are no seeds to connect these individual KGs.

  • Closed-domain setting All of current EA solutions work under the closed-domain setting [18]; that is, they assume every entity in the source KG has an equivalent entity in the target KG. Nevertheless, in practical settings, there always exist unmatchable entities. For instance, in Example 1, for the source entity Ryo Kase, there is no equivalent entity in the target KG. Therefore, an ideal EA system should be capable of predicting the unmatchable entities.

In response to these issues, we put forward an unsupervised EA solution UEA that is capable of addressing the unmatchable problem. Specifically, to mitigate the reliance on labeled data, we mine useful features from the KG side information and use them to produce preliminary pseudo-labeled data. These preliminary seeds are forwarded to our devised progressive learning framework to generate unified KG structural representations, which are integrated with the side information to provide a more comprehensive view for alignment. This framework also progressively augments the training data and improves the alignment results in a self-training fashion. Besides, to tackle the unmatchable issue, we design an unmatchable entity prediction module, which leverages thresholded bi-directional nearest neighbor search (TBNNS) to filter out the unmatchable entities and exclude them from the alignment results. We embed the unmatchable entity prediction module into the progressive learning framework to control the pace of progressive learning by dynamically adjusting the thresholds in TBNNS. Nevertheless, we discover that there is still a notable issue with UEA:

  • Ignorance of the quality of pseudo-labeled data. UEA treats the pseudo-labeled data generated in the progressive learning process equally. Nevertheless, these pseudo-labeled data are generated with different degrees of confidence. That is, the framework could have a higher degree of confidence for believing some pseudo-labeled entity pairs to be correct, while for the others such confidence could be relatively low.

As thus, we introduce the concept of confidence to measure the probability of an entity pair of being correct. We further incorporate such confidence scores into KG representation learning with the aim of producing more accurate structural embeddings. Through empirical studies, we demonstrate that the confidence-based framework, CUEA, has a more stable performance than UEA regardless of the quality of input side information, and is particularly useful when the side information is low-grade.

This article is an extended version of our previous work [19]. In this extension, we make the following improvement:

  • We extend UEA to a confidence-based framework CUEA, where we put forward C-TBNNS to assign confidence scores to aligned entity pairs and incorporate such probabilities into the KG representation learning process, so as to improve the quality of learned entity representations and also the alignment performance.

  • We add more datasets for evaluation and conduct a more comprehensive analysis, which empirically validate that, compared with UEA, CUEA has a more consistent performance and is more effective given low-quality side information.

Organization In Section 2, we formally define the task of EA and introduce related work. In Section 3, we introduce the preliminaries. In Section 4, we detail unmatchable entity prediction and the confidence-based extension. In Section 5, we introduce the progressive learning framework. In Section 6, we report the experimental results and conduct detailed analysis. In Section 7, we conclude this article.

2 Task Definition and Related Work

In this section, we formally define the task of EA, and then introduce the related work.

2.1 Task Definition

The inputs to EA are a source KG \({\mathcal {G}}_1\) and a target KG \({\mathcal {G}}_2\). The task of EA is defined as finding the equivalent entities between the KGs, i.e., \(\Psi = \{(u,v)|u\in {\mathcal {E}}_1, v\in {\mathcal {E}}_2, u \leftrightarrow v\}\), where \({\mathcal {E}}_1\) and \({\mathcal {E}}_2\) refer to the entity sets in \({\mathcal {G}}_1\) and \({\mathcal {G}}_2\), respectively, \(u \leftrightarrow v\) represents the source entity u and the target entity v are equivalent, i.e., u and v refer to the same real-world object.

Most of current EA solutions assume that there exist a set of seed entity pairs \(\Psi _s = \{(u_s,v_s)|u_s\in {\mathcal {E}}_1, v_s\in {\mathcal {E}}_2, u_s \leftrightarrow v_s\}\). Nevertheless, in this work, we focus on unsupervised EA and do not assume the availability of such labeled data.

2.2 Related Work

Entity alignment The majority of state-of-the-art methods are supervised or semi-supervised, which can be roughly divided into three categories, i.e., methods merely using the structural information, methods that utilize the iterative training strategy, and methods using information in addition to the structural information [20].

The approaches in the first category aim to mine useful structural signals for alignment, and devise structure learning models such as recurrent skipping networks [21] and multi-channel GNN [17], or exploit existing models such as TransE [3, 9, 22,23,24] and graph attention networks [3]. The embedding spaces of different KGs are connected by seed entity pairs. In accordance to the distance in the unified embedding space, the alignment results can hence be predicted.

Methods in the second category iteratively label likely EA pairs as the training set and gradually improve alignment results [15, 22,23,24,25]. A more detailed discussion about these methods and the difference from our framework is provided in Section 5. Methods in the third category incorporate the side information to offer a complementing view to the KG structure, including the attributes [10, 26,27,28,29,30], entity descriptions [16, 31], and entity names [12, 25, 32,33,34,35]. These methods devise various models to encode the side information and consider them as features parallel to the structural information. In comparison, the side information in this work has an additional role, i.e., generating pseudo-labeled data for learning unified structural representations.

Unsupervised entity alignment A few methods have investigated the alignment without labeled data. Qu et al. [36] propose an unsupervised approach toward knowledge graph alignment with the adversarial training framework. Nevertheless, the experimental results are extremely poor. He et al. [37] utilize the shared attributes between heterogeneous KGs to generate aligned entity pairs, which are used to detect more equivalent attributes. They perform entity alignment and attribute alignment alternately, leading to more high-quality aligned entity pairs, which are used to train a relation embedding model. Finally, they combine the alignment results generated by attribute and relation triples using a bivariate regression model. The overall procedure of this work might seem similar to our proposed model. However, there are many notable differences; for instance, the KG embeddings in our work are updated progressively, which can lead to more accurate alignment results, and our model can deal with unmatchable entities. We empirically demonstrate the superiority of our model in Sect. 6.

We notice that there are some entity resolution (ER) approaches established in a setting similar to EA, represented by PARIS [38]. They adopt collective alignment algorithms such as similarity propagation so as to model the relations among entities. We include them in the experimental study for the comprehensiveness of the article.

3 Preliminaries

In this section, we first introduce the outline of our proposal. Then, we elaborate the processing of side information to produce preliminary alignment seeds.

Fig. 2
figure 2

Outline of CUEA. Arrows in blue represent the progressive learning process. By setting the confidence to 1, the UEA model [19] can be restored

3.1 Model Outline

As shown in Fig. 2, given two KGs, CUEA first mines useful features from the side information. These features are forwarded to the unmatchable entity prediction module to generate initial alignment results with confidence scores, which are regarded as pseudo-labeled data. Then, the progressive learning framework uses these pseudo seeds, along with the probability scores, to connect two KGs and learn unified entity structural embeddings. It further combines the alignment signals from the side information and structural information to provide a more comprehensive view for alignment. Finally, it progressively improves the quality of structural embeddings and augments the alignment results by iteratively updating the pseudo-labeled data with results from the previous round, which also leads to increasingly better alignment. Note that by assigning the confidence score of 1 to all entity pairs, CUEA turns into the UEA model [19].

3.2 Side Information

There is abundant side information in KGs, such as the attributes, descriptions and classes. In this work, we use a particular form of the attributes—the entity name, as it exists in the majority of KGs. To make the most of the entity name information, inspired by [12], we exploit it from the semantic level and string-level and generate the textual distance matrix between entities in two KGs.

More specifically, we use the averaged word embeddings to represent the semantic meanings of entity names. Given the semantic embeddings of a source and a target entity, we obtain the semantic distance score by subtracting their cosine similarity score from 1. We denote the semantic distance matrix between the entities in two KGs as \(\mathbf {M^n}\), where rows represent source entities, columns denote target entities, and each element in the matrix denotes the distance score between a pair of source and target entities. As for the string-level feature, we adopt the Levenshtein distance [39] to measure the difference between two sequences. We denote the string distance matrix as \(\mathbf {M^l}\).

To obtain a more comprehensive view of alignment, we combine the two distance matrices and generate the textual distance matrix as \(\mathbf {M^t} = \alpha \mathbf {M^n} + (1-\alpha )\mathbf {M^l}\), where \(\alpha\) is a hyper-parameter balancing the weights. Then, we forward the textual distance matrix \(\mathbf {M^t}\) to the unmatchable entity module to produce alignment results, which are considered as the pseudo-labeled data for training KG structural embeddings. The details are introduced in the next subsection.

Remark The goal of this step is to exploit available side information to generate useful features for alignment. Other types of side information, e.g., attributes and entity descriptions, can also be leveraged. Besides, more advanced textual encoders, such as misspelling oblivious word embeddings [40] and convolutional embedding for edit distance [41], can be utilized. We will investigate them in the future.

4 Unmatchable Entity Prediction

State-of-the-art EA solutions generate for each source entity a corresponding target entity and fail to consider the potential unmatchable issue. Nevertheless, as discussed in [20], in real-life settings, KGs contain entities that other KGs do not contain. For instance, when aligning YAGO 4 and IMDB, only 1% of entities in YAGO 4 are related to movies, while the other 99% of entities in YAGO 4 necessarily have no match in IMDB. These unmatchable entities would increase the difficulty of EA. Therefore, in this work, we devise an unmatchable entity prediction module to predict the unmatchable entities and filter them out from the alignment results.

4.1 Thresholded Bi-directional Nearest Neighbor Search

We put forward a novel strategy, i.e., thresholded bi-directional nearest neighbor search (TBNNS), to generate the alignment results, and the resulting unaligned entities are predicted to be unmatchable. Specifically, given a source entity u and a target entity v, if u and v are the nearest neighbor of each other, and the distance between them is below a given threshold \(\theta\), we consider (uv) as an aligned entity pair. Note that \(\mathbf {M}(u,v)\) represents the element in the u-th row and v-th column of the distance matrix \(\mathbf {M}\).

The TBNNS strategy exerts strong constraints on alignment, since it requires that the matched entities should both prefer each other the most, and the distance between their embeddings should be below a certain value. Therefore, it can effectively predict unmatchable entities and prevent them from being aligned. Notably, the threshold \(\theta\) plays a significant role in this strategy. A larger threshold would lead to more matches, whereas it would also increase the risk of including erroneous matches or unmatchable entities. In contrast, a small threshold would only lead to a few aligned entity pairs, and almost all of them would be correct. This is further discussed and verified in Sect. 6.3. Therefore, our progressive learning framework dynamically adjusts the threshold value to produce more accurate alignment results (to be discussed in the next section).

figure a

4.2 Confidence-based TBNNS

Considering that the aligned entity pairs generated by TBNNS are of different qualities (i.e., some are true while some are not), we further put forward confidence-based TBNNS, C-TBNNS, to measure the confidence of an entity pair (of being true). Specifically, we define the confidence score \(\Theta\) of an entity pair (uv) as:

$$\begin{aligned} \Theta (u,v)= \mathbf {M}(u,v^\prime ) - \mathbf {M}(u,v) + \mathbf {M}(v,u^\prime ) - \mathbf {M}(v,u), \end{aligned}$$
(1)

where \(\Delta _1 = \mathbf {M}(u,v^\prime ) - \mathbf {M}(u,v)\) denotes the gap between the distance scores of the top-2 closest entities (i.e., v and \(v^\prime\)) to entity u, while \(\Delta _2 = \mathbf {M}(v,u^\prime ) - \mathbf {M}(v,u)\) denotes the gap between the distance scores of the top-2 closest entities (i.e., u and \(u^\prime\)) to entity v. This is based on the intuition that, for an entity pair (uv), if the distance between them is the smallest from both sides, and there are larger margins between the distances of the top-2 candidates, it would be more confident to consider them as a correct entity pair. We further restrict the confidence scores to a certain range:

$$\begin{aligned} \Theta ({\mathcal {S}}) = (1-\lambda ) \frac{\Theta ({\mathcal {S}}) - \min \{\Theta ({\mathcal {S}})\}}{\max \{\Theta ({\mathcal {S}})\} - \min \{\Theta ({\mathcal {S}})\}} + \lambda \end{aligned}$$
(2)

where \(\Theta ({\mathcal {S}})\) represents the confidence scores of the entity pairs in \({\mathcal {S}}\). The core of Eq. (2) is the min-max normalization, which converts the confidence scores to [0, 1]. We add a hyper-parameter \(\lambda \in [0,1]\) to further restrict the range of the confidence scores to \([\lambda ,1]\). As thus, by setting \(\lambda\) to 1, all entity pairs would have the same confidence score of 1, and C-TBNNS can be restored to TBNNS. Hence, C-TBNNS can be regarded as a general case of TBNNS, which introduces the concept of confidence (probability) into the alignment result generation process.

5 The Progressive Learning Framework

To exploit the rich structural patterns in KGs that could provide useful signals for alignment, we design a progressive learning framework to combine structural and textual features for alignment and improve the quality of structural embeddings and alignment results in a self-training fashion.

5.1 Knowledge Graph Representation Learning

As mentioned above, we forward the textual distance matrix \(\mathbf {M^t}\) generated by using the side information to the unmatchable entity prediction module to produce the preliminary alignment results, which are considered as pseudo-labeled data for learning unified KG embeddings. Concretely, following [26], we adopt GCNFootnote 1 to capture the neighboring information of entities. We leave out the implementation details since this is not the focus of this paper, which can be found in [26].

Alignment objective Since the representations of source and target KGs are learned individually, they need to be projected into a unified embedding space, where the entities across KGs could be compared directly. To this end, we use the semi-supervised loss function to enforce the distance between the embeddings of the entities in the labeled entity pairs to be small and meanwhile the negative samples (i.e., nonequivalent entity pairs) to be large. Formally:

$$\begin{aligned} {\mathcal {L}} = \sum _{(u,v)\in {\mathcal {S}}} \sum _{(u^\prime ,v^\prime )\in {\mathcal {S}}^\prime _{(u,v)}} [d(\mathbf {u}, \mathbf {v}) + \gamma - d(\mathbf {u}^\prime , \mathbf {v}^\prime )]_+ , \end{aligned}$$
(3)

where \([\cdot ]_+ = \max \{0,\cdot \}\), (uv) is a labeled entity pair from the training data, and \({\mathcal {S}}^\prime _{(u,v)}\) represents the set of negative entity pairs obtained by corrupting (uv) using nearest neighbor sampling [3]. \(\mathbf {u}\) and \(\mathbf {v}\) represent the embeddings of source and target entities learned by GCN, respectively. \(d(\cdot ,\cdot )\) is the distance function that measures the distance between two embeddings. \(\gamma\) is a hyper-parameter separating positive samples from negative ones.

Confidence-based objective Considering that the pseudo labeled entity pairs have different confidences of being true, we incorporate such probabilities into the alignment objective to learn more accurate structural embeddings:

$$\begin{aligned} {\mathcal {L}}_c = \sum _{(u,v)\in {\mathcal {S}}} \sum _{(u^\prime ,v^\prime )\in {\mathcal {S}}^\prime _{(u,v)}} \Theta (u, v)*[d(\mathbf {u}, \mathbf {v}) + \gamma - d(\mathbf {u}^\prime , \mathbf {v}^\prime )]_+ , \end{aligned}$$
(4)

where \(\Theta (u, v)\) is the confidence score attached to each entity pair. As thus, the more confident entity pairs would play a more important role during the training process, while the less confident pseudo entity pairs would have a smaller effect on the training, such that the impact from the false positives could be mitigated. We will empirically demonstrate its effectiveness in Section 6.

Feature fusion Given the learned structural embedding matrix \(\mathbf {Z}\), we calculate the structural distance score between a source and a target entity by subtracting the cosine similarity score between their embeddings from 1. We denote the resultant structural distance matrix as \(\mathbf {M^s}\). Then, we combine the textual and structural information to generate more accurate signals for alignment: \(\mathbf {M} = \beta \mathbf {M^t} + (1-\beta )\mathbf {M^s}\), where \(\beta\) is a hyper-parameter that balances the weights. The fused distance matrix \(\mathbf {M}\) is used to generate more accurate matches.

5.2 The Progressive Learning Algorithm

The amount of training data has an impact on the quality of the unified KG embeddings, which in turn affects the alignment performance [10, 42]. As thus, we devise an algorithm (Algorithm 2) to progressively augment the pseudo training data, so as to improve the quality of KG embeddings and enhance the alignment performance. The algorithm starts with learning unified structural embeddings and generating the fused distance matrix \(\mathbf {M}\) by using the preliminary pseudo-labeled data \({\mathcal {S}}_0\) (line 1-2). Then, the fused distance matrix is used to produce the new alignment results \(\Delta {\mathcal {S}}\) using C-TBNNS (line 4). These newly generated entity pairs \(\Delta {\mathcal {S}}\) are added to the alignment results, which are used for generating the fused distance matrix in the next round (line 6-7). The entities in \({\mathcal {S}}\) are removed from the entity sets (line 9-10). In order to progressively improve the quality of KG embeddings and detect more alignment results, we perform the aforementioned process recursively until the number of newly generated entity pairs is below a given threshold \(\mu\). Finally, we consider the entity pairs in \({\mathcal {S}}\) as the final alignment results \(\varPsi\).

figure b

Notably, in the learning process, once a pair of entities is considered as a match, the entities will be removed from the entity sets (line 9-10). This could gradually reduce the alignment search space and lower the difficulty for aligning the rest entities. Obviously, this strategy suffers from the error propagation issue, which, however, could be effectively mitigated by the progressive learning process that dynamically adjusts the threshold. We will verify the effectiveness of this setting in Sect. 6.2.4.

5.3 Dynamic Threshold Adjustment

It can be observed from Algorithm 2 that the matches generated by the unmatchable entity prediction module are not only part of the eventual alignment results, but also the pseudo training data for learning subsequent structural embeddings. Therefore, to enhance the overall alignment performance, the alignment results generated in each round should, ideally, have both large quantity and high quality. Unfortunately, these two goals cannot be achieved at the same time. This is because, as stated in Sect. 4, a larger threshold in TBNNS can generate more alignment results (large quantity), whereas some of them might be erroneous (low quality). These wrongly aligned entity pairs can cause the error propagation problem and result in more erroneous matches in the following rounds. In contrast, a smaller threshold leads to fewer alignment results (small quantity), while almost all of them are correct (high quality).

To address this issue, we aim to balance between the quantity and the quality of the matches generated in each round. An intuitive idea is to set the threshold to a moderate value. However, this fails to take into account the characteristics of the progressive learning process. That is, in the beginning, the quality of the matches should be prioritized, as these alignment results will have a long-term impact on the subsequent rounds. In comparison, in the later stages where most of the entities have been aligned, the quantity is more important, as we need to include more possible matches that might not have a small distance score. Hence, we set the initial threshold \(\theta _0\) to a very small value so as to reduce potential errors. Then, in the following rounds, we gradually increase the threshold by \(\eta\), so that more possible matches could be detected. We will empirically validate the superiority of this strategy over the fixed weight in Sect. 6.2.4.

Noteworthily, our proposed confidence-based framework CUEA can further help mitigate the low-quality issue, as we calculate and assign a confidence score to each entity pair, where the wrongly-aligned entity pairs would presumably have lower confidence scores and thus exert smaller influence on the subsequent alignment process.

Remark As mentioned in the related work, there are some existing EA approaches that exploit the iterative learning (bootstrapping) strategy to improve EA performance. Particularly, BootEA calculates for each source entity the alignment likelihood to every target entity, and includes those with likelihood above a given threshold in a maximum likelihood matching process under the 1-to-1 mapping constraint, producing a solution containing EA pairs [23]. This strategy is also adopted by [15, 24]. Zhu et al. use a threshold to select the entity pairs with very close distances as the pseudo-labeled data [22]. DAT employs a bi-directional margin-based constraint to select the confident EA pairs as labels [25]. Our progressive learning strategy differs from these existing solutions in four aspects: (1) we exclude the entities in the confident EA pairs from the test sets; and (2) we use the dynamic threshold adjustment strategy to control the pace of learning process; and (3) our strategy can deal with unmatchable entities; and (4) we attach a confidence score to each selected entity pair, which can mitigate the negative influence of the false positives on the KG representation learning process as well as the alignment results. The superiority of our strategy is validated in Sect. 6.

6 Experiment

This section reports the experiment results with in-depth analysis. The source code is available at https://github.com/DexterZeng/UEA.

6.1 Experiment Settings

In this subsection, we first introduce the datasets, and then we detail the parameter settings. Next, we introduce the evaluation metrics and the baseline models used for comparison.

6.1.1 Datasets

Following existing works, we adopt the DBP15K dataset [10] for evaluation. This dataset consists of three multilingual KG pairs extracted from DBpedia. Each KG pair contains 15 thousand inter-language links as gold standards. The statistics can be found in Table 1. We note that state-of-the-art studies merely consider the labeled entities and divide them into training and testing sets. Nevertheless, as shown in Table 1, there exist unlabeled entities, e.g., 4,388 and 4,572 entities in the Chinese and English KG of \(\texttt {DBP15K}_\texttt {ZH-EN}\), respectively. In this connection, we adapt the dataset by including the unmatchable entities. Specifically, for each KG pair, we keep 30% of the labeled entity pairs as the training set (for training the supervised or semi-supervised methods). Then, to construct the test set, we include the rest of the entities in the first KG and the rest of the labeled entities in the second KG, so that the unlabeled entities in the first KG become unmatchable. The statistics of the test sets can be found in the Test set column in Table 1.

In addition, we also use the SRPRS dataset for evaluation. Concretely, we adopt the two cross-lingual datasets, \(\texttt {SRPRS}_\texttt {EN-FR}\) and \(\texttt {SRPRS}_\texttt {EN-DE}\), which are extracted from the multilingual KGs of DBpedia. Note that we do not use the mono-lingual KG pairs in SRPRS since using the side information can already achieve the ground-truth results [20]. There are no unmatchable entities in SRPRS.

Table 1 The statistics of the evaluation benchmarks

6.1.2 Parameter Settings

For the side information module, we utilize the fastText embeddings [43] as word embeddings. To deal with cross-lingual KG pairs, following [33], we use Google translate to translate the entity names from one language to another, i.e., translating Chinese, Japanese and French to English. \(\alpha\) is set to 0.5. For the structural information learning, we set \(\beta\) to 0.5. Following [26], we set \(\gamma\) in the alignment objectives to 3 and adopt Manhattan distance as \(d(\cdot ,\cdot )\). Regarding C-TBNNS, we set \(\lambda\) to 0.4. For progressive learning, we set the initial threshold \(\theta _0\) to 0.05, the incremental parameter \(\eta\) to 0.1, the termination threshold \(\mu\) to 30. Note that if the threshold \(\theta\) is over 0.45, we reset it to 0.45. These hyper-parameters are default values since there is no extra validation set for hyper-parameter tuning. We will conduct the parameter analysis in the experiment.

6.1.3 Evaluation Metrics

We use precision (P), recall (R), and F1 score as evaluation metrics. The precision is computed as the number of correct matches divided by the number of matches found by a method. The recall is computed as the number of correct matches found by a method divided by the number of gold matches. The F1 score is the harmonic mean between precision and recall.

6.1.4 Competitors

We select the most performant state-of-the-art solutions for comparison. Within the group that solely utilizes structural information, we compare with:

  • BootEA [23], which employs the bootstrapping strategy to iteratively label likely entity alignment as training data for learning alignment-oriented KG embeddings;

  • TransEdge [15], which proposes a novel edge-centric embedding model that contextualizes relation representations in terms of specific head-tail entity pairs;

  • MRAEA [42], which models entity embeddings by attending over the node’s incoming and outgoing neighbors and its connected relations’ meta semantics;

  • SSP [44], which jointly leverages the global KG structure and entity-specific relational triples for better entity alignment.

  • RREA [45], which leverages relational reflection transformation to obtain relation specific embeddings for each entity and achieves effective entity alignment.

Among the methods incorporating other sources of information, we compare with:

  • GCN-Align [26], which employs GCN to learn structural embeddings and attribute embeddings for alignment;

  • HMAN [16], which harnesses the attributes and textual descriptions of entities to complement the structural information;

  • HGCN [11], which jointly learns entity and relation representations for EA;

  • RE-GCN [46], which exploits multiple structural graph convolution driven by triadic graph and primal graph to learn entity and relation embeddings;

  • DAT [25], which proposes an EA framework with emphasis on long-tail entities;

We also include the unsupervised approaches IMUSE [37] and PARIS [38]. To make a fair comparison, we only use entity name labels as the side information.

Table 2 Alignment results

6.2 Results

In this subsection, we first introduce the main alignment results. Then, we report the performance of unsupervised approaches given side information in low quality. Finally, we conduct the ablation study to provide insights into UEA.

6.2.1 Main Alignment Results

Table 2 reports the alignment results, which shows that state-of-the-art supervised or semi-supervised methods have rather low precision values. This is because these approaches cannot predict the unmatchable source entities and generate a target entity for each source entity (including the unmatchable ones). Particularly, methods incorporating additional information attain relatively better performance than the methods in the first group, demonstrating the benefit of leveraging such additional information.

Regarding the unsupervised methods, although IMUSE cannot deal with the unmatchable entities and achieves a low precision score, it outperforms most of the supervised or semi-supervised methods in terms of recall and F1 score. This indicates that, for the EA task, the KG side information is useful for mitigating the reliance on labeled data. In contrast to the methods discussed before, PARIS attains very high precision, since it only generates matches that it believes to be highly possible, which can effectively filter out the unmatchable entities. It also achieves the second best F1 score among all approaches, showcasing its effectiveness when the unmatchable entities are involved. Our proposals, UEA and CUEA, attain the best balance between precision and recall and obtain the best F1 scores, outperforming the second-best by a large margin, validating their effectiveness. Notably, although our proposed models do not require labeled data, they achieve even better performance than the most performant supervised methods. This could be attributed to the facts that (1) Our proposals are capable of dealing with unmatchable entities and hence achieve a good balance between precision and recall, while all the supervised approaches fail to identify the unmatchable entities and make alignment predictions for every source entity (including the unmatchable ones), thus attaining a low precision and in turn a low F1 score; (2) Most of the state-of-the-art supervised approaches merely perform the one-time alignment and cannot benefit from the progressive learning framework that utilizes the pseudo-labeled data for better training; (3) Some supervised approaches fail to make use of the side information that could provide useful signals for alignment. To verify the effectiveness of our proposed modules in the supervised setting, we allow CUEA to make use of labeled data, resulting in CUEA-sup. The results in Table 2 reflect that CUEA-sup attains much better performance than the state-of-the-art supervised approaches, as well as the unsupervised variant CUEA.

Furthermore, it can be seen that, by integrating the notion of confidence into UEA, CUEA achieves comparable results to UEA. At first sight, it seems that assigning confidence scores to entity pairs do not have a large influence on the representation learning and the alignment results, which, however, could be ascribed to the fact that the side information is too effective on these datasets (solely using the string information can achieves an F1 score of 0.814, to be shown in Table 5), and hence rendering the structural information (largely affected by the confidence scores) less contributive to the overall results. Next, we will show that the confidence-based framework would be much more useful on datasets with side information in low quality.

Table 3 Alignment results given low-grade side information

6.2.2 Results Using Low-Quality Side Information

We compare the unsupervised approaches under a practical scenario where the side information is in low-quality. Specifically, we assume that the pre-trained word embeddings as well as the machine translation tools are not available. Under this circumstance, to use the entity name information, a viable solution is to compare the name strings directly. However, the direct string comparison would be ineffective for cross-lingual datasets such as \(\texttt {DBP15K}_\texttt {ZH-EN}\) and \(\texttt {DBP15K}_\texttt {JA-EN}\), where the languages in the source and target KGs are disparate. Hence, we aim to examine the effectiveness of these unsupervised approaches when the side information is in low quality and cannot provide many useful signals for alignment.

We report the results on \(\texttt {DBP15K}_\texttt {ZH-EN}\) and \(\texttt {DBP15K}_\texttt {JA-EN}\) in Table 3, where the direct comparison between entity name strings serves as the side information. It can be observed that the F1 scores of all methods are very low (compared with those in Table 2), revealing that the quality of side information does affect the overall alignment results. Besides, given the low-quality side information, our proposed models UEA and CUEA still outperform the baselines IMUSE and PARIS in terms of the F1 score, demonstrating the effectiveness of the progressive learning framework and the unmatchable entity prediction module. Moreover, it is notable that CUEA achieves better results than UEA in terms of all metrics. This could be attributed to the confidence-based alignment results generation process, which could enable the entity pairs of higher confidence (higher probability of being correct, presumably) to have a larger impact on the representation learning and alignment process.

6.2.3 Efficiency Comparison

In this subsection, we evaluate the alignment efficiency. In Table 4, we report the running time of unsupervised approaches, as well as two most performant supervised approaches. Corresponding alignment performance results can be found in Table 2. It reads from Table 4 that, generally speaking, the time costs of our proposals are acceptable, which mainly come from the progressive learning process. PARIS and IMUSE have high efficiency since they adopt simple models to capture the KG structure information and mainly rely on the existing side information for alignment, while the supervised models conduct complicated modeling of KG structure and hence require more time.

Table 4 Comparison of time costs (in seconds)
Table 5 Ablation results

6.2.4 Ablation Study

In this subsection, we examine the usefulness of proposed modules by conducting the ablation study. First, by comparing the results of CUEA and UEA in Tables 2 and  3, we can conclude that the confidence-based framework is of great use, especially in cases when the side information is inferior. Next, we perform the ablation study on the basis of UEA.

More specifically, in Table 5, we report the results of UEA w/o Unm, which excludes the unmatchable entity prediction module, and UEA w/o Prg, which excludes the progressive learning process. It shows that, removing the unmatchable entity prediction module (UEA w/o Unm) brings down the performance on all metrics and datasets, validating its effectiveness of detecting the unmatchable entities and enhancing the overall alignment performance. Besides, without the progressive learning (UEA w/o Prg), the precision increases, while the recall and F1 score values drop significantly. This shows that the progressive learning framework can discover more correct aligned entity pairs and is crucial to the alignment progress.

To provide insights into the progressive learning framework, we report the results of UEA w/o Adj, which does not adjust the threshold, and UEA w/o Excl, which does not exclude the entities in the alignment results from the entity sets during the progressive learning. Table 5 shows that setting the threshold to a fixed value (UEA w/o Adj) leads to worse F1 results, verifying that the progressive learning process depends on the choice of the threshold and the quality of the alignment results. We will further discuss the setting of the threshold in the next subsection. Besides, the performance also decreases if we do not exclude the matched entities from the entity sets (UEA w/o Excl), validating that this strategy indeed can reduce the difficulty of aligning entities.

Moreover, we replace our progressive learning framework with other state-of-the-art iterative learning strategies (i.e., MWGM [23], TH [22] and DAT-I [25]) and report the results in Table 5. It shows that using our progressive learning framework (UEA) can attain the best F1 score, verifying its superiority.

6.3 Quantitative Analysis

In this subsection, we perform quantitative analysis of the modules in UEA and CUEA. We first investigate the unmatchable entity prediction module. Then, we examine the robustness of the progressive learning framework by varying the hyper-parameters. Finally, we provide the analysis on the side information, i.e., the influence of the quality of side information on the overall results, and the usefulness of the preliminary alignment results generated by the side information.

Fig. 3
figure 3

Alignment results given different threshold values. C-\(\theta\) refers to the number of correct matches generated by the progressive learning framework at each round given the threshold value \(\theta\). W refers to the number of erroneous matches generated in each round

6.3.1 Analysis on Unmatchable Entity Prediction

Regarding the unmatchable entity prediction module, we aim to examine: (1) whether the unmatchable entities can be accurately detected; and (2) the influence of \(\theta\) in TBNNS on the overall performance; and (3) the influence of \(\lambda\) in C-TBNNS on the overall performance.

Unmatchable entity prediction Zhao et al. [20] propose an intuitive strategy (U-TH) to predict the unmatchable entities. They set an NIL threshold, and if the distance value between a source entity and its closest target entity is above this threshold, they consider the source entity to be unmatchable. We compare our unmatchable entity prediction strategy with it in terms of the percentage of unmatchable entities that are included in the final alignment results and the F1 score. On \(\texttt {DBP15K}_\texttt {ZH-EN}\), replacing our unmatchable entity prediction strategy with U-TH attains the F1 score at 0.837, which is 8.4% lower than that of UEA. Besides, in the alignment results generated by using U-TH, 18.9% are unmatchable entities, while this figure for UEA is merely 3.9%. This demonstrates the superiority of our unmatchable entity prediction strategy.

The threshold \(\theta\) in TBNNS We discuss the setting of \(\theta\) to reveal the trade-off between the risk and gain from generating the alignment results in the progressive learning. Identifying a match leads to the integration of additional structural information, which benefits the subsequent learning. However, for the same reason, the identification of a false positive, i.e., an incorrect match, potentially leads to mistakenly modifying the connections between KGs, with the risk of amplifying the error in successive rounds. As shown in Fig. 3, a smaller \(\theta\) (e.g., 0.05) brings low risk and low gain; that is, it merely generates a small number of matches, among which almost all are correct. In contrast, a higher \(\theta\) (e.g., 0.45) increases the risk, and brings relatively higher gain; that is, it results in much more aligned pairs, while a certain portion of them are erroneous. Additionally, using a higher threshold leads to increasingly more alignment results, while for a lower threshold, the progressive learning process barely increases the number of matches. This is in consistency with our theoretical analysis in Sect. 4.

Table 6 The influence of \(\lambda\) on the alignment results

The hyper-parameter \(\lambda\) in CUEA We then analyze the influence of \(\lambda\) in Eq. (2), which determines the range of the confidence scores, on the final alignment results. To highlight its influence on the structural representation learning, we follow the settings in 6.2.2 and report the results in Table 6.

It reads from Table 6 that the alignment performance is relatively stable when \(\lambda\) is not too large. Nevertheless, when setting \(\lambda\) to a large value (e.g., 1, to restore UEA), the results drop sharply. This reveals that assigning probability scores to the entity pairs according to their confidence of being true can facilitate the alignment. Besides, generally speaking, CUEA is robust to the perturbation of \(\lambda\) (as long as it is not too large).

Fig. 4
figure 4

The F1 scores by setting \(\alpha\) and \(\beta\) to different values

6.3.2 Analysis on Progressive Learning Framework

Influence of hyper-parameters \(\alpha\) and \(\beta\) As mentioned in Sect. 6.1, we set \(\alpha\) and \(\beta\) to 0.5 since there are no training/validation data. Here, we aim to prove that different values of the parameters do not have a large influence on the final results. More specifically, we keep \(\alpha\) at 0.5, and choose \(\beta\) from [0.3, 0.4, 0.5, 0.6, 0.7]; then we keep \(\beta\) at 0.5, and choose \(\alpha\) from [0.3, 0.4, 0.5, 0.6, 0.7]. It can be observed from Fig. 4 that, although smaller \(\alpha\) and \(\beta\) lead to better results, the performance does not change significantly.

6.3.3 Analysis on Side Information

We first analyze the influence of the side information on the final alignment results. Then, we examine the usefulness of preliminary alignment results generated by using the side information.

Influence of input side information We adopt different side information as input to examine the performance of UEA. More specifically, we report the results of \(\textsf {UEA}-\mathbf {M}^{\mathbf {l}}\), which merely uses the string-level feature of entity names as input, \(\textsf {UEA-}\mathbf {M}^{\mathbf {n}}\), which only uses the semantic embeddings of entity names as input. We also provide the results of \(\mathbf {M}^{\mathbf {l}}\) and \({\mathbf {M}^{\mathbf {n}}}\), which use the string-level and semantic information to directly generate alignment results (without progressive learning), respectively.

As shown in Table 3, the performance of solely using the input side information is not very promising (\(\mathbf {M}^{\mathbf {l}}\) and \({\mathbf {M}^{\mathbf {n}}}\)). Nevertheless, by forwarding the side information into our model, the results of \(\textsf {UEA}-\mathbf {M}^{\mathbf {l}}\) and \(\textsf {UEA-}\mathbf {M}^{\mathbf {n}}\) become much better. This unveils that UEA can work with different types of side information and consistently improve the alignment results. Additionally, by comparing \(\textsf {UEA}-\mathbf {M}^{\mathbf {l}}\) with \(\textsf {UEA-}\mathbf {M}^{\mathbf {n}}\), it is evident that the input side information does affect the final results, and the quality of the side information is of significance to the overall alignment performance.

Pseudo-labeled data We further examine the usefulness of the preliminary alignment results generated by the side information, i.e., the pseudo-labeled data. Concretely, we replace the training data in HGCN with these pseudo-labeled data, resulting in HGCN-U, and then compare its alignment results with the original performance. Regarding the F1 score, HGCN-U is 4% lower than HGCN on \(\texttt {DBP15K}_\texttt {ZH-EN}\), 2.9% lower on \(\texttt {DBP15K}_\texttt {JA-EN}\), 2.8% lower on \(\texttt {DBP15K}_\texttt {FR-EN}\). The minor difference validates the effectiveness of the pseudo-labeled data generated by the side information. It also demonstrates that this strategy can be applied to other supervised or semi-supervised frameworks to reduce their reliance on labeled data.

7 Conclusion

In this article, we propose unsupervised EA solutions that are capable of dealing with unmatchable entities. We first exploit the side information of KGs to generate preliminary alignment results, which are considered as pseudo-labeled data and forwarded to the progressive learning framework to produce better KG embeddings and alignment results in a self-training fashion. We also devise an unmatchable entity prediction module to detect the unmatchable entities. The experimental results validate the usefulness of our proposed models and their superiority over state-of-the-art approaches.