1 Introduction

The entity alignment performance heavily relies on the amount of labeled data (i.e., aligned entity pairs). It has been empirically verified that the alignment accuracy drops sharply when decreasing the number of seed entity pairs [23]. This is also illustrated in Fig. 7.1, where we summarize the alignment performance of the most performant EA solutions given varying sizes of training data.Footnote 1 Although this problem is prominent, it has been largely neglected by existing literature, since they directly extract the supervision signals from the inter-language links in DBpedia [1] or reference links among DBpedia, YAGO [22], Freebase [3], and Wikidata [28]. In practice though, these prior alignments might not exist among KGs constructed from different sources. In this case, it requires manual annotation to produce such labeled data, which is a nontrivial task since the annotator needs to retrieve the entity equivalent to a given entity from a vast pool of candidates. As thus, to reduce the manual labeling cost and also the reliance on labeled data, it is of great significance to study EA with scarce supervision.

Fig. 7.1
A multiline graph of hits at 1 percentage versus label rates between 30% and 6% plots decreasing lines for 5 methods. The line for R R E A is the highest in the graph followed by M R A E A, S S P, Trans Edge, and Ali Net.

The Hits@1 alignment results of state-of-the-art methods (RREA [17], MRAEA [16], SSP [19], TransEdge [24], and AliNet [25]) on \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\) [23] given decreasing amount of labeled data. Label rate denotes the percentage of labeled data in the whole dataset

In this chapter, we propose to approach EA with limited supervision by addressing two key research questions: (Q1) Given a fixed labeling budget, how to select the entities for manual annotation so that the labeled data can provide more useful guidance for the alignment? This can also be interpreted as, to reach a certain target of alignment performance, how to optimize the selection of entities for labeling so that we could label as few entities as possible? (Q2) Given a limited number of labeled data, how can we leverage the rich unlabeled data to facilitate the alignment?

In response to Q1, we exploit active learning (AL) to overcome the labeling bottleneck by asking queries in the form of unlabeled entities to be labeled by an oracle (e.g., a human annotator) [9, 21]. Through designing effective query strategies, the active learner can achieve satisfying performance using as few labeled instances as possible, thereby reducing the cost of obtaining labeled data [21]. In this chapter, we develop several query strategies to characterize the informativeness of entities from different angles and offer a reinforced AL framework to blend these query strategies adaptively with the aim of selecting the most valuable entities to be labeled.

To answer Q2, inspired by the recent advance in contrastive learning (CL) [12, 27], we devise an unsupervised contrastive loss to exploit the abundant unlabeled data for augmenting supervision signals. CL aims to generate data representations by learning to encode the similarities or dissimilarities among a set of unlabeled examples. The underlying intuition is that the rich unlabeled data themselves can be used as supervision to help guide the model training [29]. In this chapter, we employ two graph encoders to model different views of the structural information of entities and design a contrastive objective to distinguish the embeddings of the same entity in these two views from the embeddings of other entities. By incorporating the unsupervised contrastive loss into the semi-supervised alignment objective, the scarce supervision signals are amplified, which can further ameliorate the alignment performance.

The reinforced AL and the contrastive representation learning constitute RAC, an EA framework developed specifically to deal with scarce supervision. We empirically evaluate RAC on eight popular KG pairs against active and non-active baseline models. The results demonstrate that RAC achieves superior performance under scarce supervision and can be applied on existing EA models.

Contribution

The main contribution of this chapter can be summarized as follows: (1) we put forward RAC, an EA framework that aims to solve the scarce supervision issue, which can be employed on top of existing EA solutions to improve their capability of tackling limited labeled data; (2) we devise a reinforced active learning approach to blend query heuristics adaptively and select valuable entities for labeling, which benefits the subsequent alignment process; and (3) we are among the first attempts to exploit contrastive learning for EA, where the underlying supervision signals in the abundant unlabeled entities are leveraged to facilitate alignment.

2 Preliminaries

2.1 Problem Formulation

A KG is denoted as \(\mathcal {G}=\{(s,r,o)\} \subset \mathcal {E}\times \mathcal {R}\times \mathcal {E}\), where \(\mathcal {E}\) is the entity set, \(\mathcal {R}\) is the relation set, and a triple \((s,r,o)\) represents a subject entity \(s\in \mathcal {E}\) and an object entity \(o\in \mathcal {E}\) are connected by a relation \(r\in \mathcal {R}\). The inputs to entity alignment include two KGs to be aligned (i.e., the source KG \(\mathcal {G}_s\) and the target KG \(\mathcal {G}_t\)), a set of labeled entity pairs \(\mathcal {S} = \{(u^*, v^*)| u^* \in \mathcal {E}_s, v^* \in \mathcal {E}_t, u^*\Leftrightarrow v^*\}\), where \(\Leftrightarrow \) represents equivalence, \(\mathcal {E}_s\) and \(\mathcal {E}_t\) refer to the entity sets in the source and target KGs, respectively. The objective is to detect equivalent entity pairs in the rest of the entities.

The focus of this chapter is to study entity alignment with scarce supervision, which is decomposed into two problems, i.e., selecting entities for labeling and entity alignment under such limited supervision signals. The former is defined as, given a pool of unlabeled entities \(\mathcal {U}\), a labeling budget B and an oracle to label the entities, selecting B entities from \(\mathcal {U}\) for annotation so that the labeled data could provide more useful guidance for the subsequent alignment.

2.2 Model Overview

We provide the overview of our proposed model RAC in Fig. 7.2. RAC can be decomposed into multiple iterations. In each iteration, we first conduct reinforced active learning, where we use query strategies to measure the informativeness of entities and exploit the multi-armed bandit (MAB) mechanism to blend these strategies adaptively, so as to produce the entities to be labeled by the oracle. Next, the labeled entity pairs are added to the training set and forwarded to the contrastive entity representation learning module. In this module, using the labeled entity pairs as connections, we project individual KGs to a unified embedding space, where the entities from different KGs become comparable and the equivalence between entities can thus be inferred. Specifically, we design a semi-supervised alignment loss function to enforce the embeddings of the entities in each labeled entity pair to be close, such that the supervision signals can be propagated to unlabeled entities and their embeddings are updated to be comparable across KGs. Then we supplement with the unsupervised contrastive loss to contrast the structural entity representations learned by different graph encoders, which can leverage the rich unlabeled information for learning more expressive entity representations. Finally, the learned unified entity representations are used to conduct alignment inference to produce the results and also to help improve query strategies.

Fig. 7.2
A flow diagram of the proposed R A C. Contrastive entity representation learning has encoders that produce unsupervised contrastive loss, which along with training data gives semi-supervised loss. Entity representations lead to query strategies in reinforced active learning and alignment results.

The framework of our proposal RAC

3 Reinforced Active Learning

To cope with EA with scarce supervision, we first address the entity selection (for annotation) problem. Concretely, we adopt active learning (AL) to select the entities to be manually labeled with the aim of maximizing model performance with minimal effort. The AL process normally consists of multiple iterations. Given the labeling budget B, in each iteration, guided by the query strategies, we select \(b (b<B)\) entities with the highest informativeness for labeling and add the annotated entity pairs into the labeled data for training the EA framework. The iteration continues until the labeling budget is exhausted. Next, we introduce the query strategies in detail and then elaborate the reinforced active entity selection framework.

3.1 Query Strategies

We leverage three query strategies, i.e., degree centrality, PageRank centrality, and information density, to characterize the informativeness (more specifically, the representativeness) of entities.Footnote 2

Degree and PageRank Centrality

Since the entities in KGs are not i.i.d., we consider nodes with higher centrality contain more useful information and are of greater values. Hence, we adopt the commonly used centrality metric, i.e., the degree centrality \(\phi _{deg}(e)\), which is defined as the number of edges directly connected with entity e. Besides, we also leverage the PageRank centrality [18] \(\phi _{pr}(e_i;\Theta _p)\) to characterize the representativeness of entities:

$$\displaystyle \begin{aligned} {} \phi_{pr}(e_i;\Theta_p) =\rho\sum_{j}\boldsymbol{A}_{ij}\frac{\phi_{pr}(e_j;\Theta_p)}{\sum_{k}\boldsymbol{A}_{jk}} + \frac{1-\rho}{n}, \end{aligned} $$
(7.1)

where \(\boldsymbol {A}\) is the adjacency matrix, n is the number of entities in the KG, \(\rho \) is the damping parameter, and \(\Theta _p\) denotes the parameter set.

Information Density

In addition to the topological structure, the representativeness of an entity can also be measured from the embedding level. Concretely, we apply K-means on the embeddings of unlabeled entities. We consider the entities placed at or close to the centers of clusters are of greater values. Thus, we calculate the Euclidean distance \(d(\boldsymbol {e}, \boldsymbol {c}_e)\) between each entity e and the center entity \(c_e\) of the cluster it belongs to and define the information density of entity e as \(\phi _i(e;\Theta _i) = \frac {1}{1+d(\boldsymbol {e}, \boldsymbol {c}_e)}\), where \(\Theta _i\) denotes the parameter set. A larger \(\phi _i(e;\Theta _i)\) indicates that entity e locates in the denser area of the embedding space and is more representative.

Considering that the query strategy scores are on incomparable scales, we convert them into percentiles as in [34]. Denote \(\mathcal {P}_\phi (e, \mathcal {U})\) as the percentile of the score of e among the unlabeled data \(\mathcal {U}\) in terms of query strategy \(\phi \). Accordingly, the converted percentile scores of degree centrality, PageRank centrality, and information density are denoted as \(\mathcal {P}_{deg}\), \(\mathcal {P}_{pr}\), and \(\mathcal {P}_i\), respectively.

3.2 Reinforced Active Entity Selection via MAB

We leverage aforementioned query strategies to select the most informative entities for labeling. Considering that the significance of query strategies might vary in different iterations and no single query strategy can satisfy the need of all datasets, we propose to adaptively blend these strategies by adopting the multi-armed bandit (MAB) mechanism [26]. The MAB problems are some of the simplest reinforcement learning (RL) problems to solve, where we are given a slot machine with n arms (bandits) and each arm has its own probability distribution of success. Pulling any one of the arms yields a stochastic reward and the objective is to pull the arms such that the total reward collected in the long run can be maximized. Based on MAB, we treat each query strategy as an arm and approximate the importance of each strategy by estimating the expected reward (i.e., utility) of the corresponding arm. In this chapter, we adopt an extended framework of MAB, i.e., combinatorial MAB (CMAB) [6], which allows to play multiple arms in each iteration. Next, we elaborate the implementation details with regard to the alignment task.

Let \(\Phi \) be the set of arms. In each iteration t, based on the percentile scores, each arm \(\phi \in \Phi \) suggests its own set of entities to be labeled \(\mathcal {Q}_t(\phi )\), while the actual set of queried entities \(\mathcal {Q}_t\) are chosen based on the utility score \(\varepsilon \) assigned to each unlabeled entity \(e\in \mathcal {U}\), which is defined as:

$$\displaystyle \begin{aligned} \varepsilon_t(e) = \sum_{\phi}^{\Phi} \varepsilon_t(\phi)\mathcal{P}_{\phi}(e), \end{aligned} $$
(7.2)

where \(\varepsilon _t(\phi )\) is the utility score of arm \(\phi \) in iteration t and \(\mathcal {P}_{\phi }(e)\) is the percentile score of entity e in terms of query strategy \(\phi \). Then, the top b entities from the unlabeled entity set with the highest \(\varepsilon _t(e)\) are selected as \(\mathcal {Q}_t\) for querying the oracle in iteration t.

We estimate the utility of arm \(\varepsilon (\phi )\) by taking into account the exploitation-exploration trade-off dilemma in MAB [6]. That is, we consider both the exploitation of the arm that has the highest expected payoff and the exploration to get more information about the expected payoffs of the other arms. Regarding the former, we define the expected reward of choosing arm \(\phi \) in iteration t as the averaged reward it received from previous rounds:

$$\displaystyle \begin{aligned} {} \bar{\varepsilon}_t(\phi) = \frac{1}{t-1}\sum_{i=1}^{t-1} \hat{\varepsilon}_i(\phi), \end{aligned} $$
(7.3)

where \(\hat {\varepsilon }_i(\phi )\) is the reward received by arm \(\phi \) in round i, which is defined as the change of alignment result on validation set:

$$\displaystyle \begin{aligned} \hat{\varepsilon}_i(\phi) = \left( F(\mathcal{L}_i\cup \mathcal{L}_i(\phi)) - F(\mathcal{L}_i)\right) + \frac{|\mathcal{L}_i(\phi)|}{b} (F(\mathcal{L}_i\cup \mathcal{Q}_i) - F(\mathcal{L}_i)), \end{aligned} $$
(7.4)

where \(F(\cdot )\) denotes the value of a specific alignment measure (e.g., Hits@1, to be detailed in Sect. 7.5.1) on the validation set, which is generated by using the labeled data in the bracket. \(\mathcal {L}_i\) represents the already labeled entities in iteration i. \(\mathcal {L}_i(\phi ) = \mathcal {Q}_i(\phi )\cap \mathcal {Q}_i\), denoting the set of labeled entities suggested by query strategy \(\phi \) in iteration i. The difference between \( F(\mathcal {L}_i\cup \mathcal {L}_i(\phi ))\) and \(F(\mathcal {L}_i)\) represents the direct change of alignment performance brought by arm \(\phi \). Besides, we reckon that the performance change caused by adding all the labeled entities \(\mathcal {Q}_i\) can also be used to measure the utility of each arm. Hence, we use \(\frac {|\mathcal {L}_i(\phi )|}{b}\) to denote the contribution of arm \(\phi \) and multiply it with the overall performance change to produce the implicit change of alignment performance brought by arm \(\phi \).

Next, we move on to exploration. Following [6], to encourage leveraging the under-explored arms, we obtain the utility score of arm \(\phi \) in iteration t by adjusting \(\bar {\varepsilon }_t(\phi )\):

$$\displaystyle \begin{aligned} {} \varepsilon_t(\phi) = \bar{\varepsilon}_t(\phi) + \sqrt{\frac{3\ln t}{2n_\phi}}, \end{aligned} $$
(7.5)

where \(n_\phi \) represents the total number of labeled entities suggested by arm \(\phi \) until iteration t. As thus, the utility \(\varepsilon (\phi )\) of arm \(\phi \) is estimated by considering both the exploitation and exploration, which can provide more accurate signals for suggesting the entities to be labeled. Note that t starts from 1, and when \(t=1\), we omit the calculations of Eqs. (7.3)–(7.5) and set \(\varepsilon _1(\phi )\) to 1 for all arms.

4 Contrastive Embedding Learning

Given the scarce labeled data generated by reinforced AL, in this section, we introduce contrastive entity representation learning that further mitigates the scarce supervision issue by mining supervision signals from the unlabeled data. We first introduce the semi-supervised alignment loss, the core of EA models. Then we introduce the graph encoders. Finally, we elaborate the unsupervised contrastive loss, as well as the training and inference processes.

4.1 Semi-supervised Alignment Loss

Since the entities (nodes) from different KGs cannot be compared directly, following current EA solutions, we first learn the entity structural embeddings of source and target KGs independently, i.e., \(\boldsymbol {O}_s\) and \(\boldsymbol {O}_t\), and then devise a semi-supervised loss function to enforce the distance between the embeddings of the entities in the labeled entity pairs to be small and meanwhile the negative samples (i.e., nonequivalent entity pairs) to be large. Formally:

$$\displaystyle \begin{aligned} {} \mathcal{L}_s = \sum_{(u,v)\in\mathcal{S}} \sum_{(u^\prime,v^\prime)\in\mathcal{S}^\prime_{(u,v)}} [dis(\boldsymbol{u}, \boldsymbol{v}) + \gamma - dis(\boldsymbol{u}^\prime, \boldsymbol{v}^\prime)]_+ , \end{aligned} $$
(7.6)

where \([\cdot ]_+ = \max \{0,\cdot \}\), \((u,v)\) is a labeled entity pair from the training data and \(\mathcal {S}^\prime _{(u,v)}\) represents the set of negative entity pairs obtained by corrupting \((u,v)\) using nearest neighbor sampling [15]. \(\boldsymbol {u}\) and \(\boldsymbol {v}\) represent the embeddings of source and target entities retrieved from \(\boldsymbol {O}_s\) and \(\boldsymbol {O}_t\), respectively. \(dis(\cdot ,\cdot )\) is the distance function that measures the distance between two embeddings. \(\gamma \) is a hyper-parameter separating positive samples from negative ones.

In this chapter, the entity representation is obtained by aggregating the embeddings generated by two graph encoders: \(\boldsymbol {O}_\omega = agg(\boldsymbol {Z}^{\psi _1}_\omega , \boldsymbol {Z}^{\psi _2}_\omega )\), where agg is the aggregation function, which can be implemented as weighted average, concatenation, etc. \(\boldsymbol {Z}^{\psi _1}_\omega \) and \(\boldsymbol {Z}^{\psi _2}_\omega \) represent the embeddings generated from two different views. \(\omega \in \{s,t\}\) denotes the source and target KG, respectively.

Note that we devise two graph encoders since (1) they can capture different views of the structural information and the integrated embeddings could be more expressive and (2) by devising a contrastive objective to enforce the embeddings of each entity in the two different views to agree with each other and meanwhile to be distinguished from the embeddings of other entities, the rich unlabeled data can be leveraged as supervision signals to learn discriminative entity representations and benefit the alignment.

4.2 Graph Encoders

In this chapter, we use two basic models, graph convolutional network (GCN) [13] and approximate personalized propagation of neural predictions [14], to capture the close and distant structural information of entities and generate different views of KG embeddings.Footnote 3

The GCN model has been leveraged to generate entity embeddings by many previous works [30, 33]. It is a simple message passing algorithm. The inputs include the feature matrix of nodes \(\boldsymbol {X}\) and the adjacency matrix of graph \(\boldsymbol {A}\). In the case of two message passing layers, the equation of GCN can be formulated as:

$$\displaystyle \begin{aligned} \boldsymbol{Z}^{\psi_1} = \text{ReLU}\left(\hat{\boldsymbol{A}}\ \text{ReLU}\left( \hat{\boldsymbol{A}} \boldsymbol{X} \boldsymbol{W}_0 \right) \boldsymbol{W}_1\right), \end{aligned} $$
(7.7)

where \(\boldsymbol {Z}^{\psi _1}\) is the output entity embedding matrix. \(\boldsymbol {{\hat {A}}}\) is the symmetrically normalized adjacency matrix with self-loops. ReLU is the activation function, and \(\boldsymbol {W}_0\) and \(\boldsymbol {W}_1\) are the weight matrices.

While many approaches adopt GCN to learn entity representations, it is pointed out in [17] that when increasing the number of GCN layers, the alignment performance actually drops due to the oversmoothing issue. Therefore, we exploit the approximate personalized propagation [14] to generate the entity embeddings:

$$\displaystyle \begin{aligned} \begin{aligned} \boldsymbol{Z}^{(i)} &= (1 - \alpha) \boldsymbol{{\hat{A}}} \boldsymbol{Z}^{(i-1)} + \alpha \boldsymbol{X}, \quad i = 1, 2, \ldots, k,\\ \end{aligned} {} \end{aligned} $$
(7.8)

where \(\alpha \) is the teleport probability and k denotes the round of iterations. \(\boldsymbol {Z}^{(0)} = \boldsymbol {X}\), and the initial feature matrix \(\boldsymbol {X}\) acts as both the starting vector and the teleport set. \(\boldsymbol {Z}^{\psi _2}\) is the output entity embedding matrix. Note that we remove the neural prediction network \(f_\theta \) in the original model since it is not required in EA. We denote the resultant model as APP. By removing the weight matrices and nonlinearity in GCN, APP can capture distant structural information while retaining the quality of entity embeddings [14].

4.3 Unsupervised Contrastive Loss

Inspired by the successful application of contrastive learning (CL) on unsupervised graph representation learning [27, 36], in this chapter, we also devise a contrastive objective to distinguish the embeddings of the same entity under the two views from the embeddings of other entities, so as to leverage the supervision signals in the unlabeled data. Given an entity \(x_i\), we denote its embedding generated by the first view as \(\boldsymbol {Z}^{\psi _1}_\omega (i)\), and the embedding generated by the second view as \(\boldsymbol {Z}^{\psi _2}_\omega (i)\), where \(\omega \in \{s,t\}\) refers to the source and target KGs. These two embeddings form a positive sample. We consider the pairs of embeddings that contain \(\boldsymbol {Z}^{\psi _1}_\omega (i)\) (or \(\boldsymbol {Z}^{\psi _2}_\omega (i)\)) and the embedding of another entity as the negative samples. Then, the contrastive object of the entity in the first view is defined as:

$$\displaystyle \begin{aligned} {} \ell^{\psi_1}_\omega(x_i) = -\log \frac {e^{\theta\big(\boldsymbol{Z}^{\psi_1}_\omega(i), \boldsymbol{Z}^{\psi_2}_\omega(i)\big)}} {e^{\theta\big(\boldsymbol{Z}^{\psi_1}_\omega(i), \boldsymbol{Z}^{\psi_2}_\omega(i)\big)} + \mathcal{N}_{cross} + \mathcal{N}_{intra}}, \end{aligned} $$
(7.9)
$$\displaystyle \begin{aligned} {} \mathcal{N}_{cross} = \sum_{k=1}^{n_\omega} \boldsymbol{1}_{[k \neq i]} e^{\theta\big(\boldsymbol{Z}^{\psi_1}_\omega(i), \boldsymbol{Z}^{\psi_2}_\omega(k)\big)} \end{aligned} $$
(7.10)
$$\displaystyle \begin{aligned} {} \mathcal{N}_{intra} =\sum_{k=1}^{n_\omega} \boldsymbol{1}_{[k \neq i]} e^{\theta\big(\boldsymbol{Z}^{\psi_1}_\omega(i), \boldsymbol{Z}^{\psi_1}_\omega(k)\big)} \end{aligned} $$
(7.11)

where \(\theta (\cdot , \cdot )\) is a score function that calculates the similarity between two embeddings, which is implemented as \(\theta (\cdot , \cdot ) = f(g(\cdot ), g(\cdot ))\), where \(g(\cdot )\) is a multilayer perceptron (MLP) with nonlinear activation functions for transforming the embeddings, and \(f(\cdot , \cdot )\) is a similarity metric capturing the similarity between embeddings. \(\boldsymbol {1}_{[\cdot ]}\) is an indicator function which equals to 1 if the argument inside the bracket holds, and 0 otherwise. \(n_\omega \) is the number of entities in the KG. In the denominator, the first term is the positive sample, the second term \(\mathcal {N}_{cross}\) corresponds to the cross-view negative samples, and the third term \(\mathcal {N}_{intra}\) corresponds to the intra-view negative samples. Detailed illustrations can be found in Fig. 7.3. The contrastive object of the second view \(\ell ^{\psi _2}_\omega (x_i)\) is defined similarly. As thus, the overall unsupervised loss is defined as:

$$\displaystyle \begin{aligned} {} \mathcal{L}_u = \frac{1}{2n_s} \sum_{i = 1}^{n_s} \left[\ell^{\psi_1}_s(e_i) + \ell^{\psi_2}_s(e_i)\right] + \frac{1}{2n_t} \sum_{i = 1}^{n_t} \left[\ell^{\psi_1}_t(e_i) + \ell^{\psi_2}_t(e_i)\right], \end{aligned} $$
(7.12)

where \(n_s\) and \(n_t\) denote the number of entities in the source and target KGs, respectively.

Fig. 7.3
A diagram for losses. The undirected network graphs for source and target K Gs with unsupervised contrastive loss, positive samples, and intra and cross-view negative samples, are aggregated into an undirected network with positive and negative samples, and semi-supervised alignment loss.

Illustration of the losses

Model Training

Finally, we combine the semi-supervised alignment loss and the unsupervised contrastive loss, resulting in the loss function of our proposed model:

$$\displaystyle \begin{aligned} {} \mathcal{L} = \mathcal{L}_s + \lambda_{u}\mathcal{L}_u, \end{aligned} $$
(7.13)

where \(\lambda _{u}>0\) is the hyper-parameter balancing the two objectives.

4.4 Alignment Inference

After obtaining the learned unified embeddings, the alignment results can thus be inferred. For each source entity, we calculate its distance with all target entities according to a specific distance metric and consider the entity with the smallest distance as the match. We describe the overall procedure of RAC in Algorithm 1.

Algorithm 1: Reinforced active entity alignment

5 Experiment

In this section, we empirically evaluate our proposed modelFootnote 4 by answering the following questions:

  • RQ1: Does RAC outperform baseline alignment models on EA with limited supervision? Are the contrastive learning and reinforced AL modules useful?

  • RQ2: Where does the performance gain brought by CL come from? Is the combination of embeddings from different views effective enough? Is it sensitive to hyper-parameters?

  • RQ3: Can the reinforced AL strategy be applied on baseline models? Is it better than using the query strategies separately or blending strategies with equal weights?

5.1 Experimental Settings

Datasets

Following previous works, we adopt three popular EA datasets for evaluation: (1) DBP15K [23], which includes three cross-lingual KG pairs extracted from DBpedia; (2) SRPRS [11], which comprises two cross-lingual and two mono-lingual KG pairs extracted from DBpedia, Wikidata, and YAGO; and (3) DBP-FB [35], which is a mono-lingual KG pair extracted from DBpedia and Freebase. In each KG pair, 70%, 10%, and 20% of the gold standards are used for testing, validation, and training, respectively. Since we study EA with limited supervision, we only keep 500 seed entity pairs as the initial training set. Then, according to the labeling budget, we select the entities from the rest of the training data for annotation and add the labeled entity pairs into the initial training set. The details of datasets can be found in Table 7.1.

Table 7.1 Statistics of the datasets used for evaluation

Implementation Details

Regarding the query strategies, we set the damping parameter in Eq. (7.1) to the default value 0.85. We set b, the number of entities selected in each iteration, to 50. As to the semi-supervised alignment loss in Eq. (7.6), we adopt the Euclidean distance as \(dis(\cdot ,\cdot )\) and select \(\gamma \) among \([1,3,5,10]\). We implement the embedding aggregation function as: \( agg(\boldsymbol {Z}^{\psi _1}_\omega , \boldsymbol {Z}^{\psi _2}_\omega ) = \lambda _e\boldsymbol {Z}^{\psi _1}_\omega + (1-\lambda _e)\boldsymbol {Z}^{\psi _2}_\omega \), where \(\lambda _e\in (0,1)\) is the hyper-parameter that balances the weights of the two views, and we select it among \([0.2,0.4,0.6,0.8]\). As for the graph encoders, we follow previous works [30, 33] by adopting two two-layer GCNs. We follow the original work of APP [14] and directly set the teleport probability \(\alpha \) in Eq. (7.8) to 0.2, and the round of propagations k to 5. The dimensionality of entity embeddings is set to 100. Concerning the unsupervised contrastive loss in Eq. (7.9), we implement \(g(\cdot )\) as two-layer MLP with a nonlinear activation elu and adopt the cosine similarity as \(f(\cdot , \cdot )\). We select \(\lambda _u\) in Eq. (7.13) among \([0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35]\) and adopt Adam optimizer to minimize the training objective. The distance function in the alignment inference process is set to the Euclidean distance.

By tuning the hyper-parameters on the validation set, we set \(\gamma \) to 1, \(\lambda _e\) to 0.2, and \(\lambda _u\) to 0.2. The experiments are conducted on a personal computer with the Ubuntu system, an Intel Core i7-4790 CPU, an NVIDIA GeForce GTX TITAN X GPU, and a 32 GB memory. We conduct the experiments for five independent runs and report the averaged performance (and the standard deviation) on each dataset.

Evaluation Metrics

Following the convention [5], for each source entity in the test set, we rank the target entities ascendingly according to the embedding distance as in Sect. 7.4.4 and adopt Hits@1 as the evaluation metric, which is defined as the percentage of source entities whose ground-truth target entity is ranked first. Note that the Hits@1 results are represented in percentages, and the bold figures in the tables represent the best results.

Competing Methods

The majority of state-of-the-art EA methods focus on designing advanced representation learning models to capture more useful structural information for alignment. In comparison, our proposed framework RAC aims to improve the alignment performance under limited supervision by using reinforced AL and CL, which are agnostic to the choices of these embedding learning models. RAC can be applied on these methods to improve their capability of dealing with scare supervision signals. Hence, the main goal of this chapter is not to compare with these state-of-the-art models, but with the methods that improve EA performance under scarce supervision. In this light, we compare RAC with a very recent work [2], ALEA, which harnessed AL for EA. Specifically, we adopt the most performant variants of ALEA, i.e., ALEA-D and ALEA-B, as the baseline models, which leverage the degree and betweenness centrality as the query strategies, respectively.

Noteworthily, to demonstrate the wide applicability of RAC, we employ it on the most performant embedding learning model RREA [17], as well as a state-of-the-art EA model that leverages auxiliary information, CEA [33], in Sect. 7.5.2.

5.2 Main Results (RQ1)

We report the alignment results in Table 7.2 by setting the labeling budget B to 500 and 1500, respectively. It can be observed that RAC significantly outperforms the embedding learning-based baseline models GCN and APP across all datasets (over 40% on DBP-FB), showcasing the effectiveness of our proposal when the supervision signals are limited. Particularly, RAC (\(B=500\)) even achieves comparable results to GCN (\(B=1500\)) on \({\mathtt {SRPRS}_{\mathtt {EN-FR}}}\) and DBP-FB, which validates that, to reach a certain performance target, adopting RAC significantly reduces manual labeling effort. Besides, it is notable that the improvement is more prominent when there are fewer labeled data. For instance, RAC outperforms APP by over 15% on most datasets when \(B=500\), while the improvement is less than 15% on most datasets when \(B=1500\).

Table 7.2 Hits@1 results under different budgets

Then, by comparing the results of RAC with ALEA-D and ALEA-B, it is obvious that our proposed model is more effective and robust than existing AL-based EA models given scarce labeled data. The superior performance can be attributed to the reinforced AL strategy and the contrastive representation learning, which we will analyze in detail in the following.

Ablation Results

To examine the usefulness of the two key components—the unsupervised contrastive loss and the reinforced AL strategy—in RAC, we conduct ablation study. As shown in Fig. 7.4, we select the labeling budget B among \([\)250, 500, 750, 1000, 1250, 1500, 1750, 2000\(]\) and obtain the corresponding alignment results of RAC -Active, RAC -Rand., RAC w/o CL -Active, and RAC w/o CL -Rand., where -Active denotes using our proposed reinforced AL strategy, -Rand. denotes selecting the entities randomly, and w/o CL denotes removing the unsupervised contrastive loss. Note that, in the interest of space, we only select representative KG pairs from each dataset and report their results, among which \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\) and \({\mathtt {SRPRS}_{\mathtt {EN-FR}}}\) are cross-lingual datasets, while \({\mathtt {SRPRS}_{\mathtt {DBP-WD}}}\) and DBP-FB are mono-lingual ones.

Fig. 7.4
4 multiline graphs of hits at 1 versus budget plot 4 increasing curves for R A C active, R A C without C L active, R A C random, and R A C without C L random. Dataset D B P underscore F B has a large gap between the active and random lines. The other datasets are Z H E N, E N F R, and D B P W D.

Hits@1 results of ablation study. The shaded area denotes the standard deviation

It reads from Fig. 7.4 that the reinforced AL and CL strategies both contribute positively to the overall performance. More concretely, with the increase of labeling budget, the effectiveness of the AL strategy becomes less significant, while the unsupervised contrastive learning loss begins to play a more important role. This could be ascribed to the fact that: (1) the quality of the entities selected by AL drops with the increase of budget since the valuable entities have already been chosen in the early stages and (2) the effectiveness of CL relies on the quality of entity representations, which is improved when there are more labeled data.

Applying RAC on Embedding Learning-Based EA Model

We apply RAC on RREA, the most performant EA method so far, to see whether RAC would improve its capability of dealing with limited supervision. Specifically, we follow the implementation details in the original paper [17] and contrast the entity embeddings learned by it with the embeddings generated by GCN and then conduct the reinforced AL. The results are provided in Table 7.3, which validate that RAC can be applied on existing EA models to improve their performance under limited supervision, and the improvement is more notable when there are fewer labeled data (\(B=500\) vs. \(B=1500\)).

Table 7.3 Hits@1 results of applying RAC on RREA and CEA

Applying RAC on EA Model that Uses Auxiliary Information

We apply RAC on CEA [33], an EA model leveraging the entity name information to complement KG structural information for alignment. The results are provided in Table 7.3, which demonstrate that our proposal is also effective on EA models harnessing auxiliary information. We notice that the improvements on \({\mathtt {SRPRS}_{\mathtt {EN-FR}}}\) and \({\mathtt {SRPRS}_{\mathtt {EN-DE}}}\) are not significant. This is because the entity name information in SRPRS can already provide very accurate alignment signals, e.g., solely comparing the entity names can lead to ground-truth performance on \({\mathtt {SRPRS}_{\mathtt {DBP-YG}}}\) and \({\mathtt {SRPRS}_{\mathtt {DBP-WD}}}\) [35] (and thus we omit their results in Table 7.3). This unveils that it is more beneficial to study EA with scarce supervision when the auxiliary information is not available or of low quality (as is often the case) [35].

5.3 Experiments on Contrastive Learning (RQ2)

In this subsection, we carefully examine the effectiveness of unsupervised CL. We first empirically validate that the main performance enhancement brought by CL comes from the unsupervised contrastive loss itself rather than the combination of embeddings. Then we conduct parameter analysis to show its robustness.

Comparison with mere Combination of Embeddings

Since different representation learning models capture different structural information in KGs, one might wonder whether the effectiveness of unsupervised CL mainly comes from the combination of embeddings. To investigate this issue, we remove the effect of AL and report the results of GCN, APP, the combination of these two embeddings (denoted as Comb.), and the combination of these two embeddings with unsupervised CL loss (denoted as Comb.+CL) in Table 7.4. It shows that, compared with utilizing the representation learning models separately, Comb. only slightly improves the alignment performance in some cases and even brings down the results under a few settings, e.g., \(B=500\). After adding the unsupervised contrastive loss, Comb.+CL achieves superior results than Comb. and APP across all settings. This demonstrates the significance of using CL to mine supervision signals from the abundant unlabeled data. Furthermore, by comparing RAC with RAC w/o CL (which combines the two representations) in Fig. 7.4, we can conclude that the contrastive loss is effective with or without AL.

Table 7.4 Hits@1 results of variants of CL on \({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\) after removing the influence of AL

Sensitivity Analysis

We conduct sensitivity analysis on a critical hyper-parameter in the contrastive entity representation learning, \(\lambda _u\), which determines the contributions of semi-supervised alignment loss \(\mathcal {L}_s\) and unsupervised contrastive loss \(\mathcal {L}_u\) to the overall training objective, to show the stability of the model under perturbation of the hyper-parameter. Since it is intuitive that \(\mathcal {L}_s\) can provide more accurate signals for alignment compared with \(\mathcal {L}_u\), we vary \(\lambda _u\) from 0.05 to 0.35 and report the results in Table 7.4. From the table, it can be observed that the alignment performance is relatively stable when \(\lambda _u\) is not too large. We thus conclude that, overall, our model is robust to the perturbation of \(\lambda _u\).

5.4 Experiments on Reinforced AL (RQ3)

In this subsection, we aim to examine the usefulness of the reinforced AL component. We first demonstrate that the reinforced AL strategy can be applied on the baseline models to improve their performance given scarce labeled data. Next, we empirically verify that using our proposed reinforced AL to blend query strategies can lead to better results than using these strategies individually or combining the query strategies with equal weights.

Effectiveness of Reinforced AL on Baseline Models

We apply our proposed reinforced AL on the baseline models and report the results in Fig. 7.5. It shows that the performance of both APP and GCN is enhanced after applying the reinforced AL strategy, and the improvement is more prominent when the budget is smaller.

Fig. 7.5
4 multiline graphs of hits at 1 versus budget between 0 and 2000, plot 4 increasing curves for A P P active, G C N active, A P P random, and G C N random. Dataset D B P underscore F B has a large gap between the active and random lines. The other datasets are Z H E N, E N F R, and D B P W D.

Hits@1 results of applying AL on baseline models. The shaded area denotes the standard deviation

Comparison with Using Query Strategies Individually

To verify that blending the query strategies with MAB is more effective than using these strategies individually, we replace reinforced AL in RAC with degree centrality, PageRank centrality, and information density, resulting in RAC-Deg, RAC-Pr, and RAC-Emb, respectively, and report the results in Table 7.5. It shows that, overall speaking, the reinforced active entity selection strategy can lead to better alignment results than using query strategies individually.

Table 7.5 Hits@1 results of the variants of reinforced AL

Comparison with Combination with Equal Weights

To demonstrate that reinforced AL can adaptively integrate query strategies and lead to better alignment performance, we compare it with blending query strategies with equal weights (RAC-Avg) and provide the results in Table 7.5. It can be observed that RAC is more effective than RAC-Avg, especially when the budget value is small, showcasing the importance of combining query strategies adaptively.

6 Related Work

Entity Alignment

The task of EA has been intensively studied over the last few years [35]. The majority of existing EA literature [4, 5, 10, 23, 30] are devoted to learning better entity representations using the KG embedding techniques such as TransE and GCN. Specifically, some propose to capture the neighboring information [11, 25] for learning expressive entity representations, while some propose to model the relations to help guide the alignment of entities [24, 31]. All of these approaches require seed entity pairs to project entity embeddings from different KGs into a unified space, where the entities can be directly compared across KGs. Nevertheless, such labeled data are hard to obtain in real-life settings. To reduce the reliance on labeled data, some efforts are devoted to aligning entities in unsupervised settings [32]. They leverage the auxiliary (side information) of KGs, such as attributes and entity names, to produce the pseudo-labeled data, which are then used to learn the unified structural embeddings. Nevertheless, the effectiveness of these approaches is largely restrained by the quality of side information. In practice, the auxiliary information could be unavailable or unevenly distributed [35].

EA with Limited Supervision

The most similar work to ours is [2], which examines the effectiveness of various heuristics from AL in terms of improving EA performance under limited supervision. Our work differs from [2] in that (1) we devise a reinforced AL framework to adaptively blend query strategy heuristics and (2) we exploit the idea of CL to help further improve the EA performance. We also empirically validate the superiority of our proposal over [2].

Reinforced Active Learning

Reinforced AL approaches have been developed also for other related problems, where RL is used to take the role of traditional query strategy heuristics [7,8,9]. To tackle cross-lingual named entity recognition task, Fang et al. design a deep Q-network to select data for annotation in a streaming setting [8]. In [7, 9], different multi-armed bandit models [6] are used to learn active discriminative network representations for the node classification task. Note that the MAB mechanism implemented in RAC differs from theirs and is developed specifically for the alignment task.

Contrastive Learning on Graphs

Recently, contrastive learning (CL) has emerged as a successful method for unsupervised graph representation learning [27, 36]. CL is an active field of self-supervised learning, which can generate data representations by learning to encode the similarities or dissimilarities among a set of unlabeled examples [12]. The intuition behind is that the rich unlabeled data themselves can be used as supervision signals to help guide model training. In this chapter, we also exploit this idea and leverage the abundant unlabeled entities to facilitate the alignment.

7 Conclusion

State-of-the-art EA approaches are overly dependent on labeled data, which are difficult to obtain in practical settings. In response, we propose a reinforced active framework RAC to tackle EA with scarce supervision. In each labeling iteration, RAC selects the valuable entities to be labeled according to the multi-armed bandit mechanism that blends different query strategies. Then, given the limited labeled data, it mines useful supervision signals from the rich unlabeled data to help generate more accurate entity representations (and the alignment results). We evaluate RAC on popular EA benchmarks, and the empirical results validate that RAC is effective at coping with limited labeled data. Besides, we also demonstrate that RAC is a general framework to tackle EA with scarce supervision and can be employed on top of existing EA solutions.