1 Introduction

Figure 5.1 describes a toy example of EA. Typically, state-of-the-art EA solutions follow a two-stage working pipeline, which can be broadly divided into two main stages—representation learning and alignment inference. Most of the current works [4, 6, 33, 35] are dedicated to the former, which leverage various KG embedding models, e.g., TransE [2] and graph convolutional network (GCN) [15], for learning the representations of entities. By using seed entity pairs as reference points, the entity embeddings of different KGs are projected onto a common embedding space. This allows for the measurement of similarity or distanceFootnote 1 between entities from different KGs by assessing the similarity or distance between data points in the unified embedding space. Once the entities have been projected onto the unified embedding space, the alignment inference stage involves predicting the alignment results using these embeddings. When given an entity from the source KG, most state-of-the-art solutions adopt the direct alignment inference strategy. This strategy involves ranking the entities in the target KG based on a specific similarity measure between entity embeddings. The top-ranked target entity is then considered a match for the source entity.

Fig. 5.1
4 directed network graphs for an E A of a band. a. K G E N has nodes A Desnner and B Desnner in the same shade, connected by an arrow for brothers. K G E S has the A Desnner shaded, but not directly connected to B Desnner. b. A Desnner and B Desnner are in different shades in both K G E N and E S.

An example of EA. There is an English and a Spanish KG concerning the band The national in the figures. The aim of EA is to find equivalent entities in these KGs using the KG structure, e.g., \([\text{A. Dessner}]_{\text{en}}\) and \([\text{A. Dessner}]_{\text{es}}\). The left denotes the alignment results generated by current EA solutions that perform direct alignment inference based on structural similarity, where both \([\text{A. Dessner}]_{\text{en}}\) and \([\text{B. Dessner}]_{\text{en}}\) are aligned to \([\text{A. Dessner}]_{\text{es}}\). In comparison, the right denotes the results generated by our proposed reciprocal alignment inference, where \([\text{A. Dessner}]_{\text{en}}\) is aligned to \([\text{A. Dessner}]_{\text{es}}\), while \([\text{B. Dessner}]_{\text{en}}\) is matched with \([\text{B. Dessner}]_{\text{es}}\). (a) Results of direct alignment inference using structural similarity. (b) Results of reciprocal alignment inference using entity preference

Despite the improvements made by current techniques in boosting the precision of EA, these sophisticated models typically involve a substantial number of parameters and demand significant computational resources. Therefore, scalability is compromised in achieving the improvement, and these approaches are not suitable for handling practical large KGs. For instance, it is reported in [50] that, on the DWY100K dataset with 200,000 entities [33], the time cost for most state-of-the-art solutions is over 20,000 seconds, and some approaches [39, 52] even cannot produce the alignment results. Thus, the existence of real-life KGs that consist of tens of millions of entities creates a significant obstacle for current EA solutions, necessitating research on large-scale entity alignment. The investigation of large-scale EA aligns with the current trend of responsible design, development, use, and oversight of automated decision systems in the data management community [29].

Drawing inspiration from traditional graph partitioning strategies [12, 14], a feasible technique is to divide large KG pairs into several smaller subgraph pairs and then perform entity alignment on them. However, partitioning KG pairs for alignment is a challenging task that must achieve two objectives: (1) preserving the original structure of the KG as much as possible and (2) ensuring that the partition results of the source and target KGs match, meaning that equivalent entities in the source and target KGs are placed in the same subgraph pair. Although the first objective can be accomplished by modifying classical graph partitioning techniques such as METIS [13], the second objective is specific to the alignment task.

To achieve the second objective, we can use the seed entity pairs to guide the partition process. Seed entity pairs are pre-labeled entity pairs in which the entities are equivalent and are used to link two individual KGs. Ideally, if we can preserve the seed entity pairs during the partition and distribute them among the smaller subgraph pairs, the remaining (unknown) equivalent entities would have a greater likelihood of being placed in the same subgraph pair using these seed entity pairs as references, as equivalent entities usually have similar neighboring structures. Following this idea, there is a preliminary approach METIS-CPS (shortened as CPS) proposed by a concurrent work [10]. The proposed approach first partitions one KG into subgraphs. Then, based on the distribution of seed entities, it assigns appropriate weights to the edges in the other KG, and the partition is performed on this KG. However, it can be challenging for methods of this type (referred to as unidirectional partition strategies) to achieve the first objective because the partitioning of the second KG is limited by the requirement to maintain the seed links, which may compromise the structure of the KG to some extent.

To address this issue, this chapter proposes the Seed-oriented Bidirectional graph Partition framework, SBP, which aims to satisfy both objectives by conducting bidirectional partitions and aggregating the partition results from the source-to-target and target-to-source directions. The motivation behind this approach is that the subgraphs generated from partitioning the first KG tend to have more complete structures, while the subgraphs generated from partitioning the second KG mainly retain alignment signals. By performing bidirectional partitions and combining the subgraphs, the resulting subgraphs in each KG can have both complete structures and larger numbers of seed entities pointing to the subgraphs in the opposite KG, which can lead to more precise alignment results. Note that SBP can be used with various unidirectional partitioning strategies. Additionally, an iterative variant of SBP, I-SBP, is proposed to improve partition performance by incorporating confident alignment results from previous rounds into the seed entity pairs.

During the partition process, the accuracy of alignment results may be compromised because equivalent entities could be placed in different subgraph pairs, and the original KG structure information may also be lost to some extent. To improve alignment performance, we propose to enhance the alignment inference stage, which has received little attention in previous work. Specifically, we introduce a reciprocal alignment inference strategy. The idea of reciprocal modeling of the alignment process is motivated by the fact that the commonly used direct alignment inference approach (1) considers an entity’s preference toward entities on the other side via a similarity score, but neglects other influential factors, and (2) fails to integrate bidirectional preference scores or capture the mutual preferences of entities when making alignment decisions. Such an alignment inference strategy tends to produce many inaccurate results, as illustrated in the following example.

Example As shown in Fig. 5.1a, using the structural information, the direct alignment inference strategy would align both \([\text{A. Dessner}]_{\text{en}}\) and \([\text{B. Dessner}]_{\text{en}}\) to \([\text{A. Dessner}]_{\text{es}}\), since \([\text{A. Dessner}]_{\text{es}}\) is the entity that has the most similar structural information with them (connected to three entities, including \([\text{The national}]_{\text{en/es}}\)).

However, this direct inference approach overlooks the fact that entities’ preferences are not solely determined by the similarity score but also by the impact of alignment in the reverse direction. For instance, it is evident that \([\text{A. Dessner}]_{\text{es}}\) has higher similarity with \([\text{A. Dessner}]_{\text{en}}\) than \([\text{B. Dessner}]_{\text{en}}\) since it shares more neighboring information with \([\text{A. Dessner}]_{\text{en}}\). Under this circumstance, \([\text{B. Dessner} ]_{\text{en}}\) will lower its preference toward \([\text{A. Dessner}]_{\text{es}}\), since in its view, although \([\text{A. Dessner}]_{\text{es}}\) is its most preferred candidate in terms of similarity, they are less likely to form a match because \([\text{A. Dessner}]_{\text{es}}\) has a higher similarity with \([\text{A. Dessner}]_{\text{en}}\).

Therefore, by modeling and aggregating the bidirectional preferences as depicted in Fig. 5.1b, we could avoid matching \([\text{B. Dessner} ]_{\text{en}}\) with \([\text{A. Dessner}]_{\text{es}}\) and possibly help identify its correct equivalent entity \([\text{B. Dessner} ]_{\text{es}}\).

Specifically, we propose to model the entity alignment task as a reciprocal recommendation process [18, 27], which takes effect at two levels: (1) Entity preference modeling. It first incorporates the influence of the alignment in the reverse direction into an entity’s preference, so as to generate more accurate preference scores. (2) Bidirectional preference integration. It integrates bidirectional preferences to generate a reciprocal preference matrix that encodes the mutual preferences of entities on both sides. Experimental results have shown that the two-level reciprocal modeling approach achieves superior results compared to direct inference (to be detailed in Sect. 5.7).

We further notice that while the reciprocal inference approach achieves superior alignment performance, it also consumes more memory space and time compared to direct alignment inference. Therefore, to improve the efficiency, we propose two variants: no-ranking aggregation and progressive blocking, which approximate the reciprocal alignment inference. While the former removes the time- and resource-consuming ranking process during the preference aggregation process, the latter divides the entities into multiple blocks and performs alignment within each block. These variant strategies can significantly reduce the memory and time costs associated with the reciprocal alignment inference, albeit at the cost of a slight decrease in effectiveness.

The proposed techniques form a novel and scalable solution for Large-scale entIty alignMEnt, namely, LIME. Notably, LIME is model-agnostic and can be used with any entity representation learning models. In this work, we evaluate using the commonly used GCN model [15] and the state-of-the-art RREA model [22] in this work for empirical evaluation. To validate the effectiveness of LIME, we create a large EA dataset FB_DBP_2M with millions of entities and tens of millions of facts. Experimental results demonstrate that LIME can effectively handle EA at scale while remaining reasonably effective and efficient. We also compare LIME against state-of-the-art solutions on three mainstream datasets, showing that LIME can achieve promising results even on small-scale datasets.

Contributions

The main contributions of this chapter are the following:

  • We identify the scalability issue in state-of-the-art EA approaches and propose an EA framework LIME to deal with large-scale entity alignment.

  • We propose seed-oriented bidirectional graph partition strategies to partition large-scale KG pairs into smaller ones, where the alignment process is then conducted.

  • We propose a reciprocal alignment inference strategy that models and integrates the bidirectional preferences of entities when inferring alignment results.

  • We introduce two variants of reciprocal alignment inference that increase its scalability while incurring a small decrease in performance.

  • Our proposed model, LIME, is generic and can be applied to existing EA models to enhance their ability to handle large-scale entity alignment.

  • We demonstrate the effectiveness of our proposed model through a comprehensive experimental evaluation on popular entity alignment benchmarks and a newly constructed dataset with tens of millions of facts.

Organization

In Sect. 5.2, we present the outline of LIME. In Sect. 5.3, we introduce the partition strategies. In Sect. 5.4, we introduce the reciprocal alignment inference strategy. In Sect. 5.5, we introduce the variants of reciprocal alignment inference. In Sects. 5.6 and 5.7, we introduce the experimental settings and results, respectively. In Sect. 5.8, we introduce related work, followed by conclusion in Sect. 5.9.

2 Framework

We present the overall framework of our proposal, LIME, in Fig. 5.2.

  • To handle large-scale KGs, we begin by performing seed-oriented bidirectional partition (SBP) to partition the source and target KGs into multiple subgraph pairs with the aid of seed entity pairs.

    Fig. 5.2
    A process flow of the proposed LIME framework with undirected graphs. K G s linked with K G t leads to seed-oriented bidirectional partition followed by entity structural representation learning, reciprocal alignment with progressive blocking and no ranking, alignment results, and seed augmentation.

    The framework of our proposal. The entities in gray represent the seed entities. The corresponding seed entities are connected by dotted lines in the left of the figure

  • Subsequently, for each subgraph pair, we employ a KG structural learning modelFootnote 2 to generate unified entity embeddings, enabling direct comparisons between entities from different KGs.

  • Afterward, using the unified entity representations, we apply a reciprocal alignment inference strategy to model entity preferences on both sides and aggregate bidirectional preference information to generate alignment results. We also recognize that although reciprocal modeling achieves superior performance, it is computationally expensive in terms of time and memory. Therefore, we propose two alternative strategies to reduce the memory and time consumption at a slight cost of effectiveness: no-ranking aggregation and progressive blocking.

  • Moreover, we introduce an iterative version of LIME to enhance the partition performance by incorporating confident alignment results from the previous round with the seed entity pairs. This iterative process leads to gradual improvements in both partitioning and alignment results.

3 Partition Strategies for Entity Alignment

To handle large-scale input KGs, a common approach is to partition the KGs and parallelize the computation across a distributed cluster of machines [3]. In this work, we adopt this approach and propose to partition KGs into smaller subgraphs, align entities in each subgraph pair, and aggregate the alignment results in each partition to produce the final aligned entity pairs.

We leverage the commonly used graph partition tool, METIS [13], as the basic partition strategy. The algorithms in METIS are based on multilevel graph partitioning [12, 14], which reduces the graph size by collapsing vertices and edges, partitioning the smaller graph, and then uncoarsening it to construct a partition for the original graph. The aim is to create a balanced vertex partition that equitably divides the set of vertices into multiple partitions while minimizing the number of edges spanning the partitions. However, in the case of EA, there are two separate graphs at scale and a small number of seed entity pairs connecting them. The two graphs are interlinked by the seed entity pairs and can be considered as one and forwarded to METIS for partitioning. Indeed, this approach is likely to generate subgraphs that only contain source or target entities, which is contrary to the goal of EA that aims to identify equivalent entities between KGs. Therefore, we use seed-oriented graph partition strategies in this work.

In this section, we first introduce the seed-oriented unidirectional graph partition strategy as the baseline model. Then, we describe our proposed bidirectional partition framework and its iterative variants.

3.1 Seed-Oriented Unidirectional Graph Partition

Unidirectional graph partition strategies for EA conduct only one-way partition (e.g., source-to-target) of KG pairs using the seed entity pairs. Formally, they partition the source KG \(\mathcal {K}\mathcal {G}_s\) and target KG \(\mathcal {K}\mathcal {G}_t\) into k subgraph pairs \(\Phi = \{\mathcal {C}_1, \mathcal {C}_2, \ldots , \mathcal {C}_k\}\), where each subgraph pair \(\mathcal {C}_i = \{\mathcal {K}\mathcal {G}_s^i, \mathcal {K}\mathcal {G}_t^i, \mathcal {S}^i\}\) contains a pair of source subgraph \(\mathcal {K}\mathcal {G}_s^i\) and target subgraph \(\mathcal {K}\mathcal {G}_t^i\), as well as a number of seed entity pairs \(\mathcal {S}^i\) connecting the subgraphs. Specifically, in this work, we adopt a state-of-the-art unidirectional partition strategy CPS [10] as the baseline model.

CPS first directly partitions the source KG into k subgraphs \(\Phi _s = \{\mathcal {K}\mathcal {G}_s^1, \ldots , \mathcal {K}\mathcal {G}_s^k\}\) using METIS. Each source subgraph \(\mathcal {K}\mathcal {G}_s^i\) contains \(\varepsilon _i\) source entities \(\mathcal {S}^i_s = \{u^i_1, \ldots , u^i_{\varepsilon _i}\}\) from the seed entity pairs \(\mathcal {S}\). To partition the target KG (\(\mathcal {K}\mathcal {G}_t\)), we still use METIS, but with some modifications: (1) we assign higher weights to edges among seed target entities whose corresponding source entities are in the same subgraph. This encourages METIS to place these seed target entities in the same subgraph while retaining the overall KG structure; and (2) we assign edges among seed target entities whose corresponding source entities are from different subgraphs, say \(\mathcal {S}^i_s\) and \(\mathcal {S}^j_s\), with weight 0. This discourages placing these seed target entities in the same subgraph as their corresponding seed source entities are not in the same subgraph. Partitioning the target KG also results in k subgraphs \(\Phi _t =\{\mathcal {K}\mathcal {G}_t^1, \ldots , \mathcal {K}\mathcal {G}_t^k\}\). Then, for each source subgraph \(\mathcal {K}\mathcal {G}_s^i\), it retrieves the target subgraph \(\mathcal {K}\mathcal {G}_t^*\) that possesses the largest number of target entities \(\mathcal {S}^*_t\) corresponding to seed source entities \(\mathcal {S}^i_s\) in \(\mathcal {K}\mathcal {G}_s^i\) and considers them as a subgraph pair \(\mathcal {C}_i = \{\mathcal {K}\mathcal {G}_s^i, \mathcal {K}\mathcal {G}_t^*, \mathcal {S}^*\}\), where \(\mathcal {S}^*\) refers to the links connecting \(\mathcal {S}^*_t\) and \(\mathcal {S}^i_s\). We illustrate this process using the following example.

Example As shown in Fig. 5.3, there are two KGs to be aligned (i.e., \(\mathcal {K}\mathcal {G}_s\) and \(\mathcal {K}\mathcal {G}_t\)), where the colored lines denote the links in seed entity pairs, and the seed entities are also represented in gray. The entities with the same subscripts are equivalent.

Fig. 5.3
A process flow of partition with undirected network graphs. K G s linked with K G t is followed by the source-to-target and target-to-source partitions by C P S with 4 subgraph pairs each. The source subgraphs are C 1 and C 2. Bidirectional partition by S B P is formed by aggregation.

Illustration of the partition process. In each box, the solid line separates different subgraph pairs, while the dotted line differentiates the source subgraphs from the target ones

The proposed CPS conducts a one-off source-to-target partition. It first partitions \(\mathcal {K}\mathcal {G}_s\), resulting in two source subgraphs shown in the left part of the box. These subgraphs consist of \({\{u_1, u_2, u_3, u_4\}}\) and \({\{u_5, u_6, u_7, u_8, u_9\}}\), respectively. Next, when partitioning \(\mathcal {K}\mathcal {G}_t\), it increases the weight of the edge between \(v_1\) and \(v_4\) (resp., \(v_6\) and \(v_7\)) since the seed source entities \(u_1\) and \(u_4\) (resp., \(u_6\) and \(u_7\)) are in the same subgraph. Additionally, it sets the weight of the edge between \(v_4\) and \(v_7\) to 0. \(\mathcal {K}\mathcal {G}_t\) is thus partitioned into two subgraphs shown in the right part of the box, which consist of \(\{v_1, v_2, v_3, v_4, v_5\}\) and \(\{v_6, v_7, v_8, v_9\}\), respectively. Finally, using the seed entity pairs as anchors, it generates two subgraph pairs, i.e., \(\mathcal {C}_1\) and \(\mathcal {C}_2\).

3.2 Bidirectional Graph Partition

It is observed that in unidirectional partition strategies like CPS, the partition of the source KG can preserve its original structure well. However, the partition of the target KG is limited by the goal of retaining the seed entity pairs, which may lead to the destruction of the KG structure to some extent. As a solution, we propose a seed-oriented bidirectional graph partition framework, called SBP. The SBP framework first conducts the source-to-target partition using any unidirectional strategy, resulting in a set of source subgraphs (\(\Phi _s^0\)) and a set of target subgraphs (\(\Phi _t^0\)). Then, it conducts the partition process reversely, obtaining another set of source subgraphs (\(\Phi _s^1\)) and target subgraphs (\(\Phi _t^1\)). Next, it identifies and combines corresponding source subgraphs in \(\Phi _s^0\) and \(\Phi _s^1\), resulting in the aggregated set of source subgraphs (\(\Phi _s\)). Similarly, it generates the aggregated set of target subgraphs (\(\Phi _t\)). Finally, for each source subgraph (\(\mathcal {K}\mathcal {G}_s^i \in \Phi _s\)), it retrieves the target subgraph (\(\mathcal {K}\mathcal {G}_t^* \in \Phi _t\)) that possesses the largest number of seed target entities (\(\mathcal {S}^*_t\)) corresponding to seed source entities in \(\mathcal {K}\mathcal {G}_s^i\). It considers them as a subgraph pair (\(\mathcal {C}_i = \{\mathcal {K}\mathcal {G}_s^i, \mathcal {K}\mathcal {G}_t^*, \mathcal {S}^*\}\)) for alignment. The detailed process is presented in Algorithm 1 and the following example.

Algorithm 1: Bidirectional graph partition (SBP)

Example Continuing with the previous example, the SBP framework conducts the target-to-source partition, resulting in two target subgraphs comprising \(\{v_1, v_2, v_3, v_4, v_7\}\) and \(\{v_5, v_6, v_8, v_9\}\) and two source subgraphs comprising \(\{u_1, u_2, u_3, u_4, u_5, u_7\}\) and \(\{u_6, u_8, u_9\}\). Next, it identifies and combines corresponding source and target subgraphs generated by the source-to-target and target-to-source partition. For instance, based on the number of overlapping source seed entities, it identifies that the source subgraph comprising \(\{u_1, u_2, u_3, u_4\}\) (resp., \(\{u_5, u_6, u_7, u_8, u_9\}\)) generated by the source-to-target partition and the source subgraph comprising \(\{u_1, u_2, u_3, u_4, u_5, u_7\}\) (resp., \(\{u_6, u_8, u_9\}\)) generated by the target-to-source partition are corresponding. It combines them to generate the aggregated subgraph \(\{u_1, u_2, u_3, u_4, u_5, u_7\}\) (resp., \(\{u_5, u_6, u_7, u_8, u_9\}\)). The target subgraphs are aggregated in the same way. Finally, using the seed entity pairs as anchors, it generates two subgraph pairs, as shown in the rightmost box.

As shown in Fig. 5.3, in the partition results of CPS, equivalent entities may be placed in different subgraph pairs, such as \(u_5\) and \(v_5\). The SBP framework can effectively mitigate this issue by conducting bidirectional partitions and aggregating the results. Hence, while the partition results of the SBP framework may include redundant entities that exist in multiple subgraph pairs, it can still effectively decrease the instances where equivalent entities are allocated to different subgraph pairs.

Merits of Bidirectional Partitioning

Noteworthily, \(\Phi _s^0\) (resp., \(\Phi _t^1\)) is generated with the aim of preserving the original KG structure, while \(\Phi _s^1\) (resp., \(\Phi _t^0\)) is generated with the aim of both retaining the links and preserving the original KG structure. Consequently, the integration of subgraphs in \(\Phi _s^0\) (and \(\Phi _t^1\)) with \(\Phi _s^1\) (and \(\Phi _t^0\)) results in aggregated subgraphs in \(\Phi _s\) and \(\Phi _t\) that have a more comprehensive structure and a greater number of seed entities pointing to the subgraphs in the opposite side. This can ultimately lead to more precise alignment outcomes.

Moreover, unlike \(\Phi _s^0\), \(\Phi _s^1\), \(\Phi _t^0\), or \(\Phi _t^1\), where the subgraphs do not have common entities, the subgraphs in \(\Phi _s\) and \(\Phi _t\) overlap. This is comparable to the concept of redundancy-based methods in traditional entity resolution (ER) blocking techniques, where an entity can be assigned to multiple blocks [26]. This is because the partitioning process may unavoidably assign equivalent entities to different subgraph pairs, which limits the upper bound of the alignment performance (as the alignment is only performed within each subgraph pair). However, this upper bound can be raised through bidirectional partitioning, which assigns an entity to multiple subgraph pairs. This is empirically validated in Sect. 5.7.3.

Integration of Subgraph-Wise Alignment Results

As previously mentioned, the partition results produced by unidirectional strategies do not have redundancies, and therefore, the alignment outcomes can be obtained by directly merging subgraph-wise alignment results. However, since the subgraph pairs generated by SBP may contain overlapping entities, an additional result aggregation module is necessary to resolve any potential conflicts in the alignment outcomes. To address this, we adopt a straightforward voting strategy. Specifically, for the source entity aligned to multiple target entities generated by different subgraph pairs, we choose the target entity with the highest number of “votes” from the subgraph pairs as the final alignment outcome. If multiple target entities have the same highest vote, we select the one with the lowest mutual preference rank (explained in Sect. 5.4.3) as the match.

3.3 Iterative Bidirectional Graph Partition

It is clear that the one-off partitioning approach tends to generate inaccurate partition results, where equivalent entities may be placed in different subgraph pairs, and the original KG structure information could also be partially lost. To address this issue, we propose an iterative framework called I-SBP, which performs the partitioning process for \(\gamma \) rounds based on the signals provided by the previous round. Specifically, in each iteration, we partition the KG into k subgraph pairs using SBP and perform entity alignment within each subgraph pair (detailed in the next section). We then aggregate the subgraph-wise alignment results to generate the final aligned entity pairs. Since the final alignment results include confident entity pairs, which can be considered as pseudo seeds according to previous studies [33, 45], we select these entity pairs using the bidirectional nearest neighbor search in [45] and add them to the seed entity pairs \(\mathcal {S}\) to aid the partition in the next round. The process is detailed in Algorithm 2.

Algorithm 2: Iterative SBP (I-SBP)

3.4 Complexity Analysis

The time complexity of SBP is roughly double that of the unidirectional partition strategy it employs, while the time complexity of I-SBP is approximately \(\gamma \) times that of SBP. We use CPS as the unidirectional partition strategy in this study, and its time complexity is \(O(|\mathcal {S}|+\frac {(2k-1)|\mathcal {S}|{ }^2}{k^2} + |E_s| +|E_t| +|T_s|+|T_t| +k\log (k))\) [10], where \(|E_s|\) (and \(|E_t|\)) and \(|T_s|\) (and \(|T_t|\)) represent the number of entities and triples in the source (and target) KG, respectively, \(|\mathcal {S}|\) refers to the number of seed entity pairs, and k denotes the number of subgraph pairs.

Regarding space complexity, most unidirectional partition strategies need to store two knowledge graphs (KGs) simultaneously. However, for the SBP algorithm, bidirectional partitions are required, which necessitates the storage of four KGs. The space complexity of I-SBP is similar to that of SBP. In general, the space complexity of the partition process is determined by the size of the knowledge graphs involved.

3.5 Discussion

It is important to note that partition strategies are used to divide large-scale knowledge graph pairs into smaller ones so that state-of-the-art deep learning-based methods can be used to identify equivalent entities. However, the partition process can reduce the alignment performance as equivalent entities can be placed into different subgraph pairs. While this issue can be mitigated by improving partition strategies, it cannot be entirely avoided. Therefore, when dealing with small- or medium-sized datasets such as current entity alignment benchmarks, it may not be worthwhile to use partition strategies since partitioning would not significantly reduce computational costs while compromising alignment accuracy. This is also supported by empirical evidence from experiments conducted on the DWY100K dataset. Whether to use partition strategies ultimately depends on the alignment goal, i.e., efficiency or effectiveness. In this work, we follow previous works and do not employ partition strategies when dealing with small- or medium-sized EA datasets, except for the analysis of partition strategies.

4 Reciprocal Alignment Inference

After partitioning large-scale knowledge graph pairs into smaller ones, we perform alignment on each subgraph pair and combine the alignment results. In this subsection, we provide a brief overview of the representation learning process. Additionally, we propose a reciprocal inference strategy (illustrated in Fig. 5.4) that takes into account the mutual interactions between bidirectional alignments to enhance the alignment inference process. This strategy allows us to capture reciprocal interactions and improve alignment inference.

Fig. 5.4
A process flow with 4 steps, a to d. Step a for preference modeling has matrix S with columns v 1 to v 4 and rows u 1 to u 4, which splits into P s, t and P t, s as part of step b. In step c, P s, t leads to R s, t, and P t, s leads to R t, s. Step d for preference aggregation has a matrix P s to t.

An example of the preference modeling and aggregation. (a) similarity matrix, (b) preference matrix, (c) ranking matrix (d) reciprocal matrix

4.1 Entity Structural Representation Learning

The entity structural representation learning phase aims to model the structural characteristics of entities and project them from different knowledge graphs into a unified embedding space. In this space, the similarity between entities can be directly inferred by comparing their structural embeddings. Most state-of-the-art EA solutions focus on improving this phase by designing advanced structural representation learning models. However, our focus in this work is to enhance the alignment inference process and the capability of EA models to handle large-scale datasets. As such, our proposed model, LIME, is agnostic to the choice of structural learning models. We adopt a state-of-the-art embedding learning model for EA, RREA [22], which reflects entity representations along different relational hyperplanes to construct relation-specific entity embeddings for alignment. More model and implementation details can be found in the original paper. Besides, to demonstrate that LIME is generic and can be applied to existing representation learning models, we also adopt the most commonly used model in the EA literature, GCN [15, 38], as the baseline model. Relevant experimental evaluations can be found in Sect. 5.7.

4.2 Preference Modeling

Once we have obtained the unified entity representations, we can infer the alignment results based on entity preferences. Specifically, for each source entity, we predict its most preferred target entity as its equivalent entity.

Direct Alignment Inference

Previous studies only considered the similarity between entity representations to model entity preferences. We refer to this as direct alignment inference. Given an entity pair \((u,v), u \in \mathcal {E}_s, v \in \mathcal {E}_t\), their similarity score is denoted as \(sim(\boldsymbol {u},\boldsymbol {v})\),Footnote 3 where \(\boldsymbol {u}\) and \(\boldsymbol {v}\) are the entity embeddings of u and v, respectively. The corresponding similarity matrix is denoted as \(\boldsymbol {S}\). For direct alignment inference, the preference score of u toward v is defined as:

$$\displaystyle \begin{aligned} p_{u,v} = sim(\boldsymbol{u},\boldsymbol{v}). {} \end{aligned} $$
(5.1)

According to this definition, the preference score of u toward v is the same as the preference score of v toward u. Therefore, we have \(p_{u,v} = p_{v,u}\), since similarity measures are usually symmetric and do not differentiate between the two input elements.

Reciprocal Preference Modeling

We believe that to accurately model entity preferences, an entity’s preference score toward another entity should also consider the likelihood of a match between them. For instance, as can be observed from Fig. 5.1, for \([\text{B. Dessner} ]_{\text{en}}\), despite the high similarity score, it might have a low preference toward \([ \text{A. Dessner} ]_{\text{es}}\), since in its view, they are less likely to form a match (considering that \([\text{A. Dessner}]_{\text{es}}\) has a higher similarity with \([\text{A. Dessner}]_{\text{en}}\)). Theoretically, a source (target) entity would prefer the target (source) entities that have high similarities with it and meanwhile low similarities with other source (target) entities. In this connection, we define the preference score of u toward v as:

$$\displaystyle \begin{aligned} p_{u,v} = sim(\boldsymbol{u},\boldsymbol{v}) - \max\{sim(\boldsymbol{v},\boldsymbol{u}^\prime), u^\prime\in \mathcal{E}_s\} + 1, {} \end{aligned} $$
(5.2)

where \(0\leq p_{u,v} \leq 1\), and a larger \(p_{u,v}\) denotes a higher degree of preference. The preference score of v toward u is defined similarly.

Our definition of the preference score for an entity toward another entity is composed of three elements. The first element represents the similarity score between the two entities, while the second element represents the highest similarity score that the target entity has with all the source entities. Intuitively, u would prefer v more if their similarity score \(sim(\boldsymbol {u},\boldsymbol {v})\) is close (ideally, equal) to the highest similarity score that v has, i.e., \(\max \{sim(\boldsymbol {v},\boldsymbol {u}^\prime ), u'\in \mathcal {E}_s\}\). Hence, we subtract the first element from the second element. If the difference is close to 0, it shows that u is satisfied with v. To make the preference value positive, we add the third element, i.e., 1.

Our definition of the preference score takes into account the alignment in the reverse direction (i.e., the preference of the target entity toward the source entity), which is naturally incorporated into an entity’s preference modeling. Moreover, \(p_{u,v}\) is not necessarily equal to \(p_{v,u}\), since the preference score encodes the alignment information at the entity level (rather than the pairwise level as in Eq. (5.1)). We denote the matrix forms of the source-to-target and target-to-source preference scores as \(\boldsymbol {P}_{s,t}\) and \(\boldsymbol {P}_{t,s}\), respectively, and in general \(\boldsymbol {P}_{s,t} \neq \boldsymbol {P}_{t,s}^\top \), where \(\boldsymbol {P}_{t,s}^\top \) is the transpose of \(\boldsymbol {P}_{t,s}\).

4.3 Preference Aggregation

The preference scores only reflect the preferences in one direction, and an optimal alignment result should consider the preference scores in both directions. Hence, we propose to aggregate the unidirectional preferences. More specifically, we first convert the preference matrix \(\boldsymbol {P}\) into the ranking matrix \(\boldsymbol {R}\). The elements in each row of \(\boldsymbol {P}_{s,t}\) and \(\boldsymbol {P}_{t,s}\) are ranked descendingly according to their values, resulting in \(\boldsymbol {R}_{s,t}\) and \(\boldsymbol {R}_{t,s}\), respectively.Footnote 4 Each element in the ranking matrix \(\boldsymbol {R}\) represents the rank of the corresponding preference score, where a lower rank indicates a higher preference value. As thus, the ranking matrices can also encode the preference information.

The primary objective of transforming scores into ranks is to magnify the disparities between the scores. As we combine the source-to-target and target-to-source matrices to capture shared preferences, the small differences in scores on one side may be easily overlooked after the aggregation with information on the other side. Transforming scores into ranks allows us to preserve and integrate such differences into the ultimate mutual preference.

Afterward, we combine the two ranking matrices to capture the mutual preferences of entities and create the corresponding preference matrix:

$$\displaystyle \begin{aligned} \boldsymbol{P}_{s \leftrightarrow t} = \phi\big(\boldsymbol{R}_{s,t}, \boldsymbol{R}_{t,s}^\top\big), {} \end{aligned} $$
(5.3)

where \(\phi \) is an aggregation function, which can be chosen from any mean operators, cross-ratio uniform [43], or other viable methods [25]. For this study, we use the arithmetic mean, which remains impartial and presents both entities’ preferences toward each other precisely, without showing any inclination toward a higher or lower rank [23, 25]. The reciprocal matrix contains elements denoted by \(p_{u\leftrightarrow v}\), which indicates the degree of mutual preference between a pair of entities (u and v). The lower the value of \(p_{u\leftrightarrow v}\), the higher the level of preference between the two entities.

Algorithm 3: reciprocal_inference(ℰs, ℰt, S)

Algorithm 3 provides the details of the reciprocal alignment inference process. We also use Example 1 to further illustrate the process.

Example 1

As shown in Fig. 5.4, there are a total of four source entities (\(u_1, u_2, u_3, u_4\)) and four target entities (\(v_1, v_2, v_3, v_4\)). In \(\boldsymbol {S}\), \(\boldsymbol {P}_{s,t}\), \(\boldsymbol {R}_{s,t}\), and \(\boldsymbol {P}_{s \leftrightarrow t}\), the rows correspond to the source entities and the columns correspond to the target entities, while in \(\boldsymbol {P}_{t,s}\) and \(\boldsymbol {R}_{t,s}\), the rows correspond to the target entities and the columns correspond to the source entities. The entities with the same subscripts are equivalent.

  1. (a):

    The similarity scores in matrix \(\boldsymbol {S}\) are computed by using cosine similarity with entity embeddings. If we consider these similarity scores as the entity preferences and align each source entity to its most preferred target entity, the results, such as \((u_1, v_2)\), \((u_2, v_2)\), \((u_3, v_2)\), and \((u_4, v_2)\), will only contain one correct match in this example.

  2. (b):

    Using Eq. (5.2), we calculate the preference scores and obtain the preference matrices \(\boldsymbol {P}_{s,t}\) and \(\boldsymbol {P}_{t,s}\). Based on the preference scores, if we align each source entity to its most preferred target entity, the results \((u_1, v_1)\), \((u_2, v_2)\), and \((u_3, v_3)\) are correct. However, for \(u_4\), both \(v_3\) and \(v_4\) are likely to be the correct match. Additionally, if we align each target entity to its most preferred source entity, it is difficult to determine the result for \(v_2\).

  3. (c):

    We convert the preference matrices \(\boldsymbol {P}_{s,t}\) and \(\boldsymbol {P}_{t,s}\) into the ranking matrices \(\boldsymbol {R}_{s,t}\) and \(\boldsymbol {R}_{t,s}\), respectively.

  4. (d):

    We can aggregate the two ranking matrices using the arithmetic mean. Based on the reciprocal preference matrix \(\boldsymbol {P}_{s \leftrightarrow t}\), if we align each source (target) entity to its most preferred source (target) entity, all the results will be correct. □

Discussion

Some might reckon that Eq. (5.2) is similar to the definition of cross-domain similarity local scaling (CSLS) [16], a metric that is proposed to mitigate the hubness issue during the nearest neighbor search. However, CSLS subtracts the average of the top-n highest similarity scores of both source and target entities from the pairwise similarity, resulting in a score that is still at the pairwise level and cannot fully characterize the preference of each entity. On the other hand, our proposed entity-level preference measure can better reflect the individual preferences of entities, and the integrated reciprocal preference matrix leads to more accurate alignment results, as demonstrated in Sect. 5.7.4.

4.4 Correctness Analysis

The optimal solution for alignment inference is to correctly identify all entity pairs. For example, in Fig. 5.4, where there are four source entities (\(u_1, u_2, u_3, u_4\)), four target entities (\(v_1, v_2, v_3, v_4\)), and the similarity matrix \(\boldsymbol {S}\), the optimal solution is \(\mathcal {M} = \{(u_1, v_1), (u_2, v_2), (u_3, v_3), (u_4, v_4)\}\).

Nevertheless, the ability of Algorithm 3 to attain a correct or optimal solution depends on the input similarity matrix \(\boldsymbol {S}\), which is generated by the deep learning-based representation learning process that captures the relatedness among entities. In the worst-case scenario, where the representation learning process fails to learn anything useful and the similarity matrix is composed of 0s, Algorithm 3 or any alignment inference strategy, such as direct alignment inference, would produce results that are full of wrongly aligned entity pairs. However, if the similarity matrix is accurate (i.e., for each ground-truth entity pair \((u,v)\), u has a higher similarity score with v than the rest of the source entities, and likewise, v has a higher similarity score with u than the rest of the target entities), Algorithm 3 can find the correct solution, as proven in Proof.

Proof

We prove that, given an accurate similarity matrix where for each ground-truth entity pair \((u,v)\):

$$\displaystyle \begin{aligned} \begin{aligned} &u= \mathop{\arg\max}\{sim(\boldsymbol{v},\boldsymbol{u}^\prime), u^\prime\in \mathcal{E}_s\}; \\ &v= \mathop{\arg\max}\{sim(\boldsymbol{u},\boldsymbol{v}^\prime), v^\prime\in \mathcal{E}_t\}, \end{aligned} \end{aligned}$$

the reciprocal inference algorithm could accurately identify these entity pairs.

Without loss of generality, we consider the ground-truth entity pair \((u,v)\). The following proof also applies to the rest of the ground-truth entity pairs.

First, we can derive that \(p_{u,v}= 1\) and \(p_{v,u}= 1\) according to Eq. (5.2). Further, we can derive that:

$$\displaystyle \begin{aligned} p_{u,v} = \max\{p_{u,v^\prime}, v^\prime\in \mathcal{E}_t\}; \quad p_{v,u} = \max\{p_{v,u^\prime}, u^\prime\in \mathcal{E}_s\}. \end{aligned}$$

After converting the scores into ranks, we have:

$$\displaystyle \begin{aligned} r_{u,v} = \min\{r_{u,v^\prime}, v^\prime\in \mathcal{E}_t\}; \quad r_{v,u} = \min\{r_{v,u^\prime}, u^\prime\in \mathcal{E}_s\}, \end{aligned}$$

where r denotes the rank value. Next, after aggregating with arithmetic mean using Eq. (5.3), we can derive that:

$$\displaystyle \begin{aligned} p_{u\leftrightarrow v} = \min\{p_{u\leftrightarrow v^\prime}, v^\prime\in \mathcal{E}_t\}; \quad p_{v\leftrightarrow u} = \min\{p_{v\leftrightarrow u^\prime}, u^\prime\in \mathcal{E}_s\}. \end{aligned}$$

Finally, according to Line 10 to Line 12 in Algorithm 3, u and v would be aligned by reciprocal alignment inference. □

Therefore, the main challenge lies in obtaining an accurate similarity matrix. However, in most cases, the similarity matrix is likely to be inaccurate, as the representation learning process cannot guarantee to learn high-quality entity representations for generating an accurate similarity matrix. Thus, we categorize the similarity scores of the ground-truth entity pairs into four cases and discuss the performance of our proposed reciprocal alignment inference and the direct inference (baseline model) under these circumstances in the appendix. Empirically, reciprocal alignment inference achieves much better results than direct inference.

4.5 Complexity Analysis

Regarding the worst time complexity of Algorithm 3, the preference modeling process (Lines 1–3) requires \(O(n^2) + O(2n^2) + O(2n^2)\) as we can calculate the highest similarity scores outside of the loops, the ranking process (Lines 5–8) requires \(O(2n\times n\lg n)\), the aggregation process (Line 9) requires \(O(n^2)\), and the matching process (Lines 10–12) requires \(O(n^2)\), where n denotes the number of entities in a KG. Overall, the time complexity of Algorithm 3 is \(O(n^2\lg n)\). Notably, the time complexity of the direct alignment inference strategy is \(O(n^2)\). Our proposed reciprocal alignment inference has a higher time complexity than direct inference as it includes an additional ranking process that converts the preference scores into ranks. However, the process of ranking is crucial for enhancing the alignment performance and will be confirmed through experimentation.

The primary factor contributing to the space complexity of LIME is the reciprocal alignment inference stage, specifically the computation of the similarity, preference, and ranking matrices. In contrast, the direct alignment inference approach only requires computing the similarity matrix with a size of \(n\times n\), where n represents the number of entities. In our reciprocal modeling strategy, we remove the matrices once they are no longer necessary to decrease memory usage, and only up to three matrices are present at any given time. Thus, our model’s maximum memory consumption is three times that of the direct alignment inference.

5 Variants of Reciprocal Alignment Inference

As discussed in Sect. 5.4.5, incorporating the reciprocal preferences of entities into the model requires a greater amount of memory and time compared to the direct alignment strategy, as a result of computing the preference and ranking matrices. Consequently, in this section, we propose two alternative methods to minimize the memory and time usage associated with the reciprocal modeling.

5.1 No-Ranking Aggregation

The complexity analysis in Sect. 5.4.5 has identified that the increased time complexity is primarily due to the calculation of preference score rankings. Therefore, we propose a no-ranking aggregation strategy in order to approximate reciprocal alignment inference, which eliminates the ranking process and instead directly aggregates \(\boldsymbol {P}_{s,t}\) and \(\boldsymbol {P}_{t,s}\) to produce the reciprocal preference matrix \(\boldsymbol {P}_{s \leftrightarrow t}\).

5.2 Progressive Blocking

To further reduce the time and space requirements of reciprocal alignment inference, we propose a method to decrease the value of n. We introduce a progressive blocking method that partitions the entities into smaller blocks and infers the alignment results at the block level. The algorithm for this method is presented in Algorithm 4 and the process is illustrated in Fig. 5.5.

Fig. 5.5
A process flow for progressive blocking, with steps a to d. Step a has matrix S and network graph G with interconnected nodes u 1 to u 4 and v 1 to v 4. Step b has a H-shaped block with 4 nodes. Step c has G with 2 nodes. Step d has B with block b 1 with 4 nodes, and b 2 and 3 with 2 nodes each.

An example of the progressive blocking process. The shape in gray denotes a block. For instance, there are five blocks in (b), i.e., \(\{u_2, v_2, u_4, v_4\}\), \(\{u_1\}\), \(\{v_1\}\), \(\{u_3\}\), and \(\{v_3\}\). (a) The unified graph and the reciprocal matrix, (b) first round of blocking, (c) second round of blocking, (d) third found of blocking

Difference from the Graph Partition Strategies

It is important to note that the progressive blocking process and the graph partition strategies presented in Sect. 5.3 are distinct, despite both being methods for dividing large graphs into smaller ones. The input to the graph partition strategies is a KG, and the goal is to partition it into smaller subgraphs while preserving the original KG structure. In contrast, the input to the progressive blocking method is a bipartite graph with nodes representing source and target entities to be aligned and edges representing pairwise connections between them. The aim is to divide the bipartite graph into smaller blocks, where alignment can be inferred within a smaller search space. Consequently, when aligning large KG pairs, we first conduct graph partitioning to divide the KGs into smaller subgraphs. For each small KG pair, we learn entity structural representations and reciprocally infer the alignment results, where the progressive blocking method can be used to reduce the time and memory costs of reciprocal inference.

One-Off Blocking

To provide more detail, the inputs to the progressive blocking process include the unified graph \(\mathcal {G}\), which contains entities from both source and target KGs and their pairwise connections; the similarity matrix \(\boldsymbol {S}\), which encodes the pairwise similarities between source and target entities; and \(\Theta \), which is the set of given thresholds (hyper-parameters). First, the blocking process begins by removing the connections between a source entity u and a target entity v if the similarity score between them \(sim(\boldsymbol {u},\boldsymbol {v})\) is lower than a predefined threshold \(\theta \in \Theta \). This division creates different blocks of source and target entities, as illustrated in Fig. 5.5b. After obtaining the blocks, we perform reciprocal entity alignment on the entities within each block and aggregate the results from different blocks to obtain the overall alignment performance.

It is important to note that setting a small value for \(\theta \) will result in most connections remaining, and most entities remaining in the same block. Therefore, the threshold is typically set to a relatively large value to ensure that the entities are effectively divided into appropriate blocks. However, this blocking process may still produce isolated blocks containing only a single entity, as depicted in Fig. 5.5b. We have found that these isolated blocks can represent a significant portion of the overall entities. One intuitive approach to handling these isolated entities is to gather them together and place them in the same block. However, this block would likely be large in size, and reciprocally aligning the entities within it would still require a significant amount of memory (as empirically validated in Sect. 5.7.5).

Algorithm 4: progressive_blocking(G, S, Θ)

Progressive Blocking

To address this issue, we propose a progressive blocking strategy. The strategy begins by removing connections between source and target entities in \(\mathcal {G}\) with similarity scores lower than the threshold \(\theta \) and computing the connected components of \(\mathcal {G}\). Each connected component is considered as a block (Lines 2–3 in Algorithm 4). For each block in the block sets, if it contains more than one entity, it is added to the final set of blocks (Lines 6–7). We gather up the isolated entities (blocks) and place them into one block and restore the connections among the entities in this block, which forms the new unified graph \(\mathcal {G}\) (Lines 9–10). Then, we choose the next \(\theta \) (smaller than the previous one) from \(\Theta \) and block \(\mathcal {G}\) using the same strategy, i.e., removing connections with similarity scores lower than \(\theta \). As the threshold is lower than the previous one, some of the connections among these entities would remain, and these entities would be placed into different blocks. Similarly, there may still be isolated entities, and we can repeat the progressive blocking strategy to generate more non-isolated blocks by gathering up the isolated entities and adjusting the threshold. Finally, we obtain the final set of blocks (Line 11). We perform reciprocal entity alignment within each block and aggregate the individual results to attain the final alignment performance. A running example can be found in the following example.

The Benefits and Limitations of Progressive Blocking

By applying the progressive blocking method, the memory and time costs of reciprocal alignment inference are significantly reduced, as the number of entities in each block is much smaller than in the original graph. This reduction is empirically validated in Table 5.2 in Sect. 5.7.1. However, the blocking process may partition equivalent entities into different blocks, which can negatively impact the alignment accuracy. This issue is further discussed and analyzed in Sect. 5.7.

Example Continuing with the previous example, we now explain the progressive blocking process, which is also illustrated in Fig. 5.5.

  1. (a)

    The inputs include the unified graph \(\mathcal {G}\) and the similarity matrix \(\boldsymbol {S}\).

  2. (b)

    We set the initial threshold \(\theta _0\) to 0.75 and remove the pairwise connections in \(\mathcal {G}\) with similarity scores lower than \(\theta _0\). This results in five blocks, i.e., \(\{u_2, v_2, u_4, v_4\}\), \(\{u_1\}\), \(\{v_1\}\), \(\{u_3\}\), and \(\{v_3\}\), among which the latter four blocks contain only one entity. We gather these entities and restore connections among them, resulting in \(\mathcal {G}_0\).

  3. (c)

    We lower the threshold and set it to \(\theta _1 = 0.7\). We then remove the connections in \(\mathcal {G}_0\) with similarity scores lower than \(\theta _1\), which generates three blocks, \(\{u_3, v_3\}\), \(\{u_1\}\), and \(\{v_1\}\). The latter two blocks are isolated, and we aggregate them to generate \(\mathcal {G}_0\).

  4. (d)

    Again, we lower the threshold to \(\theta _2 = 0.6\). The similarity score between \(u_1\) and \(v_1\) is higher than this threshold and they form a block. Finally, we obtain three blocks, \(\mathcal {B} = \{b_1, b_2, b_3\} = \{\{u_2, v_2, u_4, v_4\}, \{u_3, v_3\}, \{u_1, v_1\}\}\).

6 Experimental Settings

In this section, we introduce the experimental settings.

6.1 Dataset

Following previous works, we adopt three popular EA datasets for evaluation:

  • DBP15K [32], which includes three cross-lingual KG pairs extracted from DBpedia [1], i.e., \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\) (Chinese to English), \({\mathtt {DBP15K}_{\mathtt {JA-EN}}}\) (Japanese to English), and \({\mathtt {DBP15K}_{\mathtt {FR-EN}}}\) (French to English). Each KG pair comprises 15,000 aligned entity pairs and approximately 200,000 relational triples.

  • SRPRS [11], which involves two cross-lingual KG pairs that are extracted from DBpedia, i.e., \({\mathtt {SRPRS}_{\mathtt {EN-FR}}}\) (English to French) and \({\mathtt {SRPRS}_{\mathtt {EN-DE}}}\) (English to German), and two mono-lingual datasets, i.e., \({\mathtt {SRPRS}_{\mathtt {DBP-WD}}}\) (DBpedia to Wikidata [37]) and \({\mathtt {SRPRS}_{\mathtt {DBP-YG}}}\) (DBpedia to YAGO [30]). Compared with DBP15K, the entity degree distribution in SRPRS is closer to the real-life distribution. Each KG pair comprises 15,000 aligned entity pairs and approximately 70,000 relational triples.

  • DWY100K [33], which involves two mono-lingual KG pairs, i.e., \({\mathtt {DWY100K}_{\mathtt {DBP-WD}}}\) (DBpedia to Wikidata) and \({\mathtt {DWY100K}_{\mathtt {DBP-YG}}}\) (DBpedia to YAGO). Each KG pair comprises 100,000 aligned entity pairs and approximately 900,000 relational triples. Compared with DBP15K and SRPRS, the scale of DWY100K is much larger.

Table 5.1 presents a summary of the statistics of these datasets. We use 30% of the aligned pairs for training and 10% for validation.

Table 5.1 Statistics of the datasets used for evaluation
Table 5.2 Evaluation results of variants of LIME on large-scale datasets

6.2 Construction of a Large-Scale Dataset

To evaluate the scalability of EA, we created a new dataset with millions of entities by using DBpedia and Freebase as the source and target KGs, respectively. We obtain the gold standards, i.e., aligned entity pairs, from the external links between DBpedia and Freebase.Footnote 5 We extract the relational triples involving the entities in the external links from the respective KGs. We then extract the relational triples involving these entities from their respective KGs. To ensure the quality of the extracted triples, we follow the method proposed in a previous work [50]. We keep only the links whose source and target entities are involved in at least one triple in their respective KGs, and the entity sets are adjusted accordingly. As a result of this process, each KG contains over two million entities and tens of millions of triples. Table 5.1 presents the statistics of the newly constructed dataset.

6.3 Implementation Details

For the graph partition, we set the number of subgraph pairs k to 75 for FB_DBP_2M and 5 for DWY100K. For CPS, we adopt the same settings as the original paper. The number of rounds \(\gamma \) of I-SBP is set to 3. For the representation learning models RREA and GCN, we adopt the same settings as the original papers [22, 38]. We use cosine similarity to measure the similarity between entity embeddings. The reciprocal alignment inference stage does not require any additional parameters. Regarding the progressive blocking process, we conduct three rounds and set the thresholds \(\Theta \) (hyper-parameters) to the 50th percentile (median), 25th percentile (the first quartile), and 1st percentile of the set of the largest similarity score of each source entity, respectively, which could be directly obtained given the similarity matrix. The main intuition behind this is that such settings can guarantee the thresholds are decreasing, and meanwhile the threshold values would not be too small as they are obtained from the set of the largest similarity scores of all source entities.

To compare with the approaches that leverage extra information, we incorporate entity names into our proposal. We directly adopt the strategies proposed in [46] to generate useful features from entity names for alignment. We do acknowledge that some methods [36, 44] use entity descriptions to improve the alignment performance significantly. However, we leave the integration of such information and comparison with these methods to future work, as it is outside the scope of this study. The source codes of LIME are publicly available at https://github.com/DexterZeng/LIME.

6.4 Evaluation Metrics

As per convention [6], we adopt Hits@1 as the performance measure, which indicates the proportion of correct alignments. Unless otherwise specified, Hits@1 is represented in percentage. We omit the frequently used Hits@10 and mean reciprocal rank (MRR) metrics since (1) they are less important indicators as pointed out in previous works [6, 50] and (2) they show similar trends to Hits@1.

In addition, we assess the alignment methods based on their memory usage (in GB) and time consumption (in seconds).

6.5 Competing Methods

Our model is compared against 24 methods, which are categorized into two groups. The first category consists of methods that employ various embedding learning models to acquire valuable entity representations for alignment, such as the following:

  • MTransE (2017) [6]: This work uses TransE to learn entity embeddings.

  • RSNs (2019) [11]: This work integrates recurrent neural networks with residual learning to capture the long-term relational dependencies within and between KGs.

  • MuGNN (2019) [4]: This work proposes a new multichannel graph neural network model that aims to learn alignment-focused embeddings for knowledge graphs by effectively encoding two KGs through multiple channels.

  • KECG (2019) [17]: The aim of this research paper is to suggest a method for learning a knowledge embedding model and a cross-graph model together. The knowledge embedding model is responsible for encoding inner-graph relationships, while the cross-graph model improves entity embeddings by incorporating information from their neighbors.

  • TransEdge (2019) [34]: It introduces a new embedding model that is focused on the edge or relation between entities. This model contextualizes the representation of relations by considering the specific pair of head and tail entities involved.

  • MMEA (2019) [28]: This paper proposes to model the multi-mapping relations in KG for EA.

  • AliNet (2020) [35]: It suggests an EA network that utilizes attention and gating mechanism to aggregate information from both direct and distant neighborhoods.

  • MRAEA (2020) [21]: This work involves creating a model that generates cross-lingual entity embeddings by focusing on the node’s incoming and outgoing neighbors, as well as the meta semantics of the relations it is connected to.

  • SSP (2020) [24]: This work proposes to combine the global structure of the knowledge graph with relational triples specific to each entity for alignment.

  • LDSD (2020) [5]: This paper proposes to capture both short-term variations and long-term interdependencies within knowledge graphs, with the goal of achieving better alignment.

  • HyperKA (2020) [31]: This work puts forward a hyperbolic relational graph neural network to embed knowledge graphs, utilizing a hyperbolic transformation to capture associations between pieces of knowledge.

  • RREA (2020) [22]: This work involves reflecting entity embeddings across various relational hyperplanes, in order to create relation-specific entity embeddings that can be utilized for alignment.

The second group of techniques makes use of data beyond the KG structure. This comprises the following:

  • JAPE (2017) [32]: This work uses the attributes of entities to refine the structural information.

  • GCN-Align (2018) [38]: This work utilizes GCN to produce entity embeddings, which are then merged with attribute embeddings to align entities present in separate KGs.

  • RDGCN (2019) [39]: The proposed approach involves a dual-graph convolutional network that is capable of incorporating relation information through attentive interactions between a knowledge graph and its dual relation counterpart.

  • HGCN (2019) [40]: This work proposes to learn entity and relation representations for EA jointly.

  • GM-EHD-JEA (2020) [42]: This work presents two coordinated reasoning techniques that can effectively address the many-to-one problem encountered during the inference process of entity alignment.

  • NMN (2020) [41]: This work proposes a neighborhood matching network that can handle the structural variability between KGs. The network utilizes similarity estimation between entities to capture both the topological structure and the difference in neighborhoods.

  • CEA (2020) [46]: This work introduces a collective framework for formulating entity alignment (EA) as a standard stable matching problem. This framework is solved using the deferred acceptance algorithm.

  • DAT (2020) [48]: This work proposes a degree-aware co-attention network that integrates semantic and structural features to enhance the performance of long-tail entities.

  • DGMC (2020) [8]: This work introduces a two-stage neural architecture for acquiring and refining structural correspondences between graphs.

  • AttrGNN (2020) [19]: Besides structural features, this research suggests utilizing an attributed value encoder and dividing the knowledge graph (KG) into subgraphs to model diverse types of attribute triples for alignment.

  • RNM (2021) [53]: This work puts forward a relation-aware neighborhood matching model for entity alignment.

  • CEAFF (2021) [47]: As an extension to CEA, this research suggests an adaptive feature fusion strategy to incorporate various features and a reinforcement learning-based model for conducting collective alignment.

We choose these baselines since they are the most recent and also the best performant approaches. Indeed, the majority of the baselines are embedding-based methods since most of the EA approaches merely focus on the embedding learning stage. There are only a limited number of methods that focus on the alignment inference stage, such as CEA, GM-EHD-JEA, and CEAFF. To ensure a fair comparison, we executed the source codes of the baseline methods in our experimental setup and presented the results obtained in comparison with the corresponding results reported in the original papers, despite the possibility of differences between the two. We have highlighted the top-performing results in each table by marking them in bold.

7 Results

We aim to answer the following research questions by conducting relevant experiments:

  1. 1.

    Can LIME effectively cope with large-scale datasets? (Sect. 5.7.1)

  2. 2.

    Can LIME outperform state-of-the-art solutions on datasets in normal scales? (Sect. 5.7.2)

  3. 3.

    What influence does the partition strategies have on the alignment process? (Sect. 5.7.3)

  4. 4.

    Is reciprocal alignment inference more effective than the frequently used CSLS metric? Could more insights into the reciprocal modeling process be provided? (Sect. 5.7.4)

  5. 5.

    Is the progressive blocking process sensitive to the hyper-parameters? How to set the hyper-parameters? (Sect. 5.7.5)

7.1 Evaluation on Large-Scale Dataset

Settings

To address RQ1, we experimented on FB_DBP_2M. All of the state-of-the-art approaches cannot be directly implemented on this dataset due to the huge computation cost. Hence, we utilized the SBP algorithm to partition KGs and used CPS and I-SBP for comparison. We used GCN and RREA as the entity representation learning models. Regarding the alignment inference stage, we compared our proposed reciprocal alignment inference strategy RInf and its variant methods RInf-wr, RInf-pb, with the direct alignment strategy DInf. For the comprehensiveness of evaluation, we also conducted the experiments on the medium-sized dataset DWY100K. The results are presented in Table 5.2.

Overall Results

According to Table 5.2, the best alignment performance on FB_DBP_2M and \({\mathtt {DWY100K}_{\mathtt {DBP-WD}}}\) is achieved by the combination of I-SBP, RREA, and RInf-pb. However, replacing RInf-pb with RInf in this combination leads to the highest Hits@1 on \({\mathtt {DWY100K}_{\mathtt {DBP-WD}}}\). In terms of efficiency, the combination of CPS, GCN, and DInf is the fastest across all three KG pairs. Additionally, the alignment results on DWY100K are much higher, while the memory and time costs are lower than those on FB_DBP_2M, demonstrating that our newly constructed large-scale EA dataset presents a significant challenge to EA solutions.

Partition Strategies

In terms of partition strategies, it is clear that I-SBP consistently achieves the best alignment results, regardless of the choice of embedding and inference models. Moreover, using SBP results in better alignment performance than using CPS, highlighting the effectiveness of leveraging bidirectional information for KG partitioning. However, SBP is more time- and memory-intensive than CPS as it requires bidirectional partitions. I-SBP further increases the time cost to a significantly higher level, at least three times that of SBP. This excessive time cost is due to the iterative re-partitioning process. Additionally, I-SBP consumes more memory space than other partition strategies.

Alignment Inference Strategies

Initially, we compare our proposed alignment inference strategies with the direct alignment inference approach. The results presented in Table 5.2 indicate that our proposed reciprocal inference strategy RInf outperforms the commonly used direct alignment inference DInf by a significant margin on DWY100K. On the FB_DBP_2M dataset, although RInf cannot work due to the high memory cost, its approximation strategies still attain better results than DInf. Particularly, compared to DInf, RInf-pb only requires more time and memory space within a reasonable range while consistently achieving superior alignment results across all datasets under different combinations of partition and embedding learning models. It is especially effective in the iterative partition setting. On the other hand, RInf-wr incurs slightly higher time and memory costs than DInf and achieves better results than DInf when using CPS while performing worse than DInf under bidirectional partitioning on \({\mathtt {DWY100K}_{\mathtt {DBP-YG}}}\) and FB_DBP_2M. This can be attributed to the fact that directly aggregating the preference scores can result in the loss of information, as the preference scores are typically very close. This is also discussed in Sect. 5.4.3.

In the next step, we compare RInf with its variants. On DWY100K, applying the blocking strategy reduces the Hits@1 performance of RInf by 2–5%, with the exception of cases where I-SBP is used. This is because the blocking process cannot guarantee that equivalent entities are placed in the same block. Nonetheless, RInf-pb reduces the memory cost by over 90% and the time cost by over 70%. This validates that our progressive blocking strategy can significantly increase the efficiency of the reciprocal modeling process at the cost of a slight performance drop. Although applying the blocking strategy reduces the Hits@1 performance of LIME, the results are still significantly higher than DInf. When using the iterative partition strategy, it can be observed that using RInf-pb achieves comparable or even better Hits@1 performance than using RInf. This is because the progressive blocking process reduces the search space and can generate more confident pairs, which could lead to increasingly better partition and alignment results.

Regarding the no-ranking variant RInf-wr, even though its time and memory costs are small (close to DInf), its alignment performance is significantly lower than RInf across all settings. This confirms that the ranking process is crucial in the preference aggregation process, as discussed in Sect. 5.4.3.

Representation Learning Models

Regarding the entity structural embedding learning methods, the more advanced model RREA achieves better results than the baseline model GCN with various partition and inference strategies, demonstrating the importance of modeling KG structure information for overall alignment performance. This also confirms that our proposal is independent of the embedding learning model and can consistently improve alignment results.

For further details on the design of each component and more experiments and discussions, please refer to the following subsections.

7.2 Comparison with State-of-the-Art Methods

In this subsection, we answer RQ2.

Settings

In the previous section, we demonstrated that LIME can effectively handle large-scale EA datasets. However, since state-of-the-art methods cannot handle the FB_DBP_2M dataset, we conducted further experiments on popular medium-sized and small datasets to validate the effectiveness of our proposal. Given that these datasets are relatively smaller, we did not use our proposed partition strategies in LIME, as discussed in Sect. 5.3.5. Therefore, we evaluated the effectiveness of our proposed reciprocal alignment inference strategy and its variants, using the RREA model as the representation learning module in LIME. We denoted using the no-ranking and progressive blocking variants as LIME-wr and LIME-pb, respectively.

We presented the results of methods that only utilize KG structure to learn entity embeddings for alignment in Table 5.3 and the results of methods that use additional information in Table 5.4. Additionally, we demonstrated that LIME can be applied to other representation learning models, and the results are reported in Table 5.5. We also provided a comparison of efficiency in Fig. 5.6.

Fig. 5.6
3 grouped bar graphs of time cost versus datasets with bars for 9 methods. a. F R E N has the highest costs. R S N bars are highest in all 3 datasets. b. D B P Y G has higher costs than D B P W D. Trans Edge bars are the highest. c. R S N is highest in E N F R followed by M u G N N in D B P Y G.

Running time comparison of methods merely using structural information. (a) On DBP15K. (b) On DWY100K. (c) On SRPRS

Table 5.3 Hits@1 results of methods merely using structural information
Table 5.4 Hits@1 results of methods using additional information
Table 5.5 Hits@1 results of applying LIME to other methods on DBP15K and SRPRS

Comparison of Alignment Performance

We can observe from Tables 5.3 and 5.4 that LIME achieves the best alignment performance in both categories, and the performance of LIME-wr and LIME-pb also surpasses that of the baseline models, validating the effectiveness of the reciprocal inference strategy and its variant strategies. Notably, LIME adopts RREA as the representation learning component, which has already attained the highest Hits@1 among existing methods. In order to further validate that LIME is a generic framework that can be used to improve the alignment performance of any representation learning-based EA method, we removed the RREA model and applied LIME to other models. We reported the corresponding results in Table 5.5. Specifically, we selected a representative approach from each group, namely, RSNs and RDGCN, and reported the results on DBP15K and SRPRS in Table 5.5. Results on other datasets are omitted due to space limitations. The results in Table 5.5 verify that applying LIME leads to much better alignment performance than the direct alignment inference strategy, regardless of the approaches or datasets. This further demonstrates the effectiveness and generality of the LIME framework.

Additionally, we can observe several trends from the tables: (1) the results on DWY100K are higher than those on DBP15K and SRPRS, since the KGs in DWY100K are denser, which can provide more structural information for alignment. In comparison, the results on SRPRS are the worst among the three as its KG structure is the sparsest. This reveals that the density of the KG structure is crucial to the alignment of entities; and (2) overall, compared with methods that only use structural information, the methods that incorporate additional features achieve much better alignment performance. On the mono-lingual datasets, some solutions even achieve ground-truth results, showcasing the benefits of incorporating other useful features.

Usage of Partition Strategies

In Sect. 5.3.5, we discussed that when dealing with small- or medium-sized datasets, it may not be worth using the partition strategy since partitioning may not significantly reduce computational costs, while it may decrease the alignment accuracy. We empirically validate this point by comparing the results of LIME in Table 5.3 with the results in Table 5.2. Specifically, we can see from Table 5.2 that the Hits@1 of SBP +RREA +RInf Footnote 6 on \({\mathtt {DWY100K}_{\mathtt {DBP-WD}}}\) is 76.9%, while this figure is 81.6% for LIME (equivalent to RREA +RInf) in Table 5.3, demonstrating that the partition process indeed harms the alignment accuracy. Furthermore, in terms of time cost, they are of the same order of magnitude (thousands of seconds), despite the fact that using the partition strategy would be faster.

Comparison of the Efficiency

We compare LIME with state-of-the-art approaches in terms of the efficiency and show the results in Fig. 5.6.Footnote 7 The study demonstrates that LIME is effective on all datasets, primarily because the representation learning model RREA is highly efficient. However, LIME does require slightly more running time to significantly enhance the alignment performance of RREA. It is also worth noting that the time cost is generally higher on larger datasets (such as DWY100K compared to DBP15K) and on denser datasets (such as DBP15K compared to SRPRS).

7.3 Experiments and Analyses on Partitioning

In this section, we seek to answer RQ3. By examining Table 5.2, we can conclude that I-SBP and SBP outperform CPS in generating precise alignment outcomes, albeit at the expense of greater time and memory usage. This section presents additional experiments aimed at assessing the efficacy of these partition methods.

Influence of Partition Strategies on Alignment Links

Our initial goal is to evaluate the percentage of preserved alignment links following the partitioning process. This is a critical aspect, as the optimal partition strategy should place equivalent entity pairs in the same subgraph pair, thereby enabling accurate alignment in subsequent stages. The ability of the partition strategy to group equivalent entity pairs together determines the maximum achievable alignment accuracy, as discussed in Sect. 5.3.2. As a result, we present the percentage of preserved gold alignment links following partitioning in Table 5.6.

Table 5.6 The percentage of gold alignment links preserved after partitioning

It reads from Table 5.6 that CPS destroys over 10% of the links on DWY100K and more than half of the links on FB_DBP_2M. This indicates that the partition process itself significantly reduces the maximum achievable alignment accuracy, which is undesirable. In comparison, adopting SBP can retain 67.5% of the links on FB_DBP_2M, which increases the result of CPS by over 50%. Moreover, I-SBP produces a remarkable improvement, preserving 80% of the links in FB_DBP_2M and almost all links in DWY100K. This demonstrates that iterative partitioning can effectively optimize the partition process and prevent equivalent entities from being placed into different subgraph pairs. Nevertheless, as shown in Table 5.2, the time cost of SBP is almost double that of CPS, while I-SBP requires significantly more time depending on the number of iterations.

Influence of the Number of Subgraph Pairs k

Our next step is to analyze the impact of the number of subgraph pairs k on the partition process. To be specific, Table 5.7 presents the percentage of preserved links, time cost, and the number of entities in the largest subgraph pair for CPS and SBP, with k set to 50, 75, and 100. Indeed, Table 5.7 indicates that as the number of subgraph pairs increases, the percentage of preserved links decreases, and the partition time cost increases for both CPS and SBP. However, increasing the number of subgraph pairs results in smaller subgraphs, which can be beneficial for structural representation learning strategies due to their scalability limitations.

Table 5.7 The percentage of preserved links, the time cost, and the number of entities in the largest subgraph pair of CPS and SBP given different k on FB_DBP_2M

7.4 Experiments and Analyses on Reciprocal Inference

In this subsection, we address RQ4.

Comparison with the CSLS Metric

In Sect. 5.4, we mentioned the CSLS metric, which was introduced to address the hubness problem in nearest neighbor search and may have a similar effect as the reciprocal alignment inference strategy. We thus replaced the reciprocal inference approach in LIME with the CSLS metric (with a hyper-parameter n set to 1, 5, or 10) and evaluated the corresponding Hits@1 results, which are presented in Fig. 5.7. It is worth noting that all other settings were kept the same.

Fig. 5.7
A grouped bar graph of hits at 1 versus 9 datasets. D B P Y G has the highest bars for LIME, C S L S n equals 1, C L S L S n equals 5, and C S L S n equals 10 followed by D B P W D, F R E N, Z H E N, J A E N, E N D E, D P B W D, D B P Y G, and E N F R. The bars for LIME are higher that C S L S.

Comparison of the reciprocal inference in LIME and the CSLS metric. DBP-WD\({ }^*\) and DBP-YG\({ }^*\) refer to the KG pairs in DWY100K. The results on FB_DBP_2M are omitted due to the excessive time and memory costs required by reciprocal inference in LIME and the CSLS metric

The results presented in Fig. 5.7 demonstrate that LIME consistently outperforms the CSLS metric on all datasets. This confirms that our reciprocal alignment inference strategy can more effectively model and integrate entity preferences, leading to more accurate alignment results compared to the CSLS metric (as discussed in Sect. 5.4). Additionally, we observe that the performance of the CSLS metric deteriorates as the hyper-parameter n increases.

Deeper Insights into the Preference Modeling and Aggregation

It is worth noting that in cases where the entity representation learning model has poor performance on EA (i.e., the model outputs a homogeneous probability distribution of entity embeddings), the preference matrix can have many ties, which may impede the effectiveness of the reciprocal modeling approach. Therefore, we aim to (1) analyze the likelihood of ties occurring in the preference matrix and (2) empirically demonstrate that our proposed inference strategy can still improve the performance of a low-performing entity representation learning model even in the presence of ties.

Take \({\mathtt {SRPRS}_{\mathtt {EN-FR}}}\), for example. We conducted an analysis of the preference matrices for RREA and a low-performing model RSNs, which have dimensions of 10,500*10,500. On average, ties occur 8.72 times in each row or column of the RREA preference matrix, which is not a frequent occurrence. For the low-performing RSNs model, this figure increases to 12.82. This suggests that the quality of entity representations can influence the frequency of ties during preference aggregation, but the effect is not significant. Furthermore, despite the presence of ties in the ranking matrices, applying our reciprocal inference strategy improves the Hits@1 of RSNs by 10.9% as shown in Table 5.5. This demonstrates that the reciprocal modeling approach can still benefit a low-performing entity representation learning method.

7.5 Experiments and Analyses on Progressive Blocking

In this subsection, we proceed to answer RQ5. First, we analyze the impact of the hyper-parameter \(\theta \) on the alignment performance and efficiency. Next, we discuss the parameter settings of the progressive blocking process.

Analysis of \(\theta \)

As mentioned in Sect. 5.5.2, setting \(\theta \) to a small value will retain the majority of connections, resulting in most entities being placed in the same block. On the other hand, setting \(\theta \) to a large value will remove many connections and separate entities into different isolated blocks. To empirically verify this claim, we conducted an experiment on the \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\) dataset, varying the value of \(\theta \). We reported the total number of blocks (#Total), the size of the largest block (#MaxSize), the number of blocks that only contain one entity (which we refer to as isolated blocks, #Iso), the percentage of isolated blocks (Perc.), the aggregated Hits@1 results of performing the alignment within each block (H@1), and the aggregated Hits@1 results of performing the alignment within each block and the aggregated isolated blocks (H@1*), in Table 5.8.

Table 5.8 Analysis of the hyper-parameter \(\theta \) in progressive blocking on \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\)

The results in Table 5.8 show that setting \(\theta \) to a large value, specifically 0.75, results in the removal of most pairwise connections, leading to over 10,000 blocks, of which 71.9% were isolated blocks. Also, the Hits@1 result is very low (at 48.1%). Aggregating the 8,225 isolated blocks and considering the alignment performance within this aggregated block, the Hits@1 result increased to 67.9%. However, this aggregated block contains over 8,000 entities and still requires significant memory space. In contrast, setting \(\theta \) to a small value, specifically 0.4, results in the majority of entities (over 20,000) being placed in the same block, which does not achieve the objective of reducing memory space.

Therefore, in a progressive blocking setting, the value of \(\theta \) in the first round is typically set to a larger value. Although this may result in a larger size of the aggregated isolated block, the subsequent rounds with lower \(\theta \) values further process the aggregated block.

Analysis of the Progressive Blocking

In this work, we conduct three rounds of progressive blocking and directly set \(\Theta \) to the 50th percentile (median), 25th percentile (the first quartile), and 1st percentile of the set of the largest similarity scores of all source entities, respectively. In this study, our goal is to investigate the impact of the values of \(\theta \) and the number of rounds of progressive blocking on the alignment performance and memory consumption. To be more specific, we keep two threshold values constant and vary the value of the other threshold. Then, we report the Hits@1 and memory size in Fig. 5.8a, b, and c. Moreover, we perform progressive blocking for 0 to 4 rounds and present the Hits@1 and memory size in Fig. 5.8d.

Fig. 5.8
4 double line graphs. a, b, and c. Hits at 1 and memory versus value of threshold. Hits at 1 decreases, while memory is constant from 0.60 in a. Memory is constant in graphs b and c while hits at 1 increase. d. Hits at 1 and memory versus round of blocking with constant lines after 2 rounds.

Analysis of the progressive blocking. (a) Threshold in the first round. On \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\). (b) Threshold in the second round. On \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\). (c) Threshold in the third round. On \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\). (d) Rounds of progressive blocking. On \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\)

As shown in Fig. 5.8a, the value of the initial threshold has an impact on the final Hits@1 result and memory cost. Setting the initial threshold to a relatively small value may produce more accurate alignment results, but it also comes with a high memory cost since most entities are still connected and placed in the same block. On the other hand, a larger threshold can reduce the memory cost, but it also leads to a lower alignment performance.

Figures 5.8b and c demonstrate that the values of the thresholds in the second and third round do not have a significant impact on the memory cost, while they only have a small influence on the alignment performance. Furthermore, Fig. 5.8d indicates that the Hits@1 performance and memory cost drop in the first few rounds and remain relatively stable with more rounds of blocking. Therefore, conducting progressive blocking for a few rounds is sufficient.

Threshold Setting in Practice

Based on the analysis, we can identify two crucial factors when setting the threshold schedule: (1) the threshold values should be gradually decreased; and (2) the initial threshold value should be selected carefully, possibly with the guidance of statistical information regarding the similarity scores. Therefore, our proposed strategy for scheduling the threshold is a feasible option in practice, and it can be adjusted based on the statistical information available.

8 Related Work

We will provide a brief overview of the studies that have addressed the scalability problem in EA. The experimental paper on EA [50] indicates that even the state-of-the-art EA methods still suffer from poor scalability. While simpler models such as GCN-Align [38] and ITransE [51] are faster, they tend to have poorer effectiveness [50]. In contrast, more effective models typically have complex architectures and are inefficient.

There have been several studies on relevant tasks that propose strategies for handling large-scale data. For instance, Flamino et al. [9] approach the alignment of entities in large-scale networks by clustering nodes using network-specific features. However, these features are not present in KGs, and the structure of KGs is more intricate than networks. Zhuang et al. [54] suggest partitioning entities from various knowledge bases into smaller blocks using predicates in the triples. Nonetheless, aligning predicates in different knowledge bases is already a challenging task, and the source codes of these methods are not available. Therefore, their implemented programs cannot be applied to the EA task. Zhang et al. [49] address the problem of linking large-scale heterogeneous entity graphs. However, the entity graph only includes entities in a few types, such as paper, author, and venue, and the relation types are also limited, which is very different from KGs. Thus, their proposed method, which depends on the characteristics of entity graphs, cannot be used for EA.

Several recent works have focused on addressing the efficiency issue in EA. Mao et al. [20] identify over-complex graph encoders and inefficient negative sampling strategies as the primary causes of poor efficiency in EA. They propose a novel KG encoder, Dual Attention Matching Network, to reduce computational complexity. However, their work focuses only on the representation learning stage and is evaluated on a medium-sized dataset, DWY100K. GM-EHD-JEA [42] formulates EA as a task assignment problem and proposes to solve it using the Hungarian algorithm. However, the Hungarian algorithm cannot be directly applied to EA due to its extra-large computation time. Therefore, they propose a space separation strategy to reduce the search space so that the Hungarian algorithm can work properly. This method is similar to our blocking strategy without the progressive procedure. However, we improve the performance by aggregating isolated blocks, and our progressive blocking process can further enhance efficiency.

Another recent work proposes a unidirectional strategy, CPS, to partition large-scale KGs and uses name information to improve alignment performance [10]. However, in general, the scalability issue in EA remains a critical and underexplored problem. It is worth noting that ER can be regarded as the general version of the EA task [50]. There have been several studies on improving the efficiency and scalability of ER, and we refer readers to the survey paper [7]. Our blocking strategy is inspired by these relevant works on ER.

9 Conclusion

In this chapter, we have highlighted the scalability issue in state-of-the-art EA approaches and proposed an effective solution, LIME, to address EA at scale. The LIME approach initially uses graph partition strategies that focus on seeds to divide large-scale KGs into smaller pairs. Then, LIME employs a novel reciprocal alignment inference strategy within each subgraph pair to generate alignment results based on the entity representations learned by existing embedding learning models. To enhance the scalability of reciprocal alignment inference, LIME suggests two variant strategies that can reduce computational costs, albeit with a slight decrease in performance. The experimental evaluations conducted on a novel large-scale EA dataset reveal that LIME can successfully address EA on a large scale. Besides, the empirical results on the popular EA datasets also validate the superiority of LIME and show that it can be applied to existing methods to improve their performance.