1 Overview

To better understand current advanced representation learning methods, we propose a general framework to describe these methods, which includes six modules, i.e., pre-processing, messaging, attention, aggregation, post-processing, and loss function. In pre-processing, the initial entity and relation representations are generated. Then, KG representations are obtained via a representation learning network, which usually consists of three steps, i.e., messaging, attention, and aggregation. Among them, messaging aims to extract the features of the neighboring elements, attention aims to estimate the weight of each neighbor, and aggregation integrates the neighboring information with attention weights. Through the post-processing operation, the final representations are obtained. The whole model is then optimized by the loss function in the training stage.

More specifically, we summarize ten representative methods in terms of these modules in Table 3.1.

  • In the pre-processing module, there are mainly two ways to obtain the initial representations, some methods utilize pre-trained model to embed names or descriptions into initial representations, while some methods generate the initial structural representations through GNN-based networks.

    Table 3.1 Overview and comparison of advanced representation learning
  • In the messaging module, linear transformation is the most frequently used strategy, which makes use of a learnable matrix to transform neighboring features. Other methods include extracting neighboring features by concatenating multi-head messages, directly utilizing neighboring representations, etc.

  • In the attention module, the main focus is the computation of similarity. Most of the methods concatenate the representations and multiply a learnable attention vector to calculate attention weights. Besides, some use inner product of entity representations to compute similarity.

  • In the aggregation module, almost all methods aggregate 1-hop neighboring entity or relation information, while a few works propose to combine multi-hop neighboring information. Some use a set of randomly chosen entities, i.e., anchor set, to obtain position-aware representations.

  • In the post-processing module, most of the methods enhance final representations by concatenating the outputs of all layers of GNN. Besides, some methods propose to combine the features adaptively via strategies such as the gate mechanism [10].

  • In terms of the loss function, the majority of methods utilize the margin-based loss during training. Some additionally add the TransE [1] loss, while some improve the margin loss using LogSumExp and normalization operation, or utilizing the Sinkhorn [3] algorithm to calculate the loss.

2 Models

We use Eq. (3.1) to characterize the core procedure of representation learning:

$$\displaystyle \begin{aligned} \boldsymbol{e}_{i}^l = \mathbf{Aggregation}_{\forall j\in \mathcal{N}(i)} (\mathbf{Attention}(i,j)\cdot \mathbf{Messaging}(i,j))\,, {} \end{aligned} $$
(3.1)

where \(\mathbf {Messaging}\) aims to extract the features of neighboring elements, \(\mathbf {Attention}\) aims to estimate the weight of each neighbor, and \(\mathbf {Aggregation}\) integrates the neighborhood information with attention weights.

Next, we briefly introduce recent advance of representation learning for EA in terms of the modules mentioned in Table 3.1.

2.1 ALiNet

It aims to aggregate multi-hop structural information for learning entity representations [12].

Aggregation

This work devises a multi-hop aggregation strategy. For 2-hop aggregation, Aggregate is denoted as:

$$\displaystyle \begin{aligned} \boldsymbol{h}_{i,2}^l = \sigma\left(\sum_{j\in\mathcal{N}_2\cup{i}} \mathbf{Attention}(i,j)\cdot \mathbf{Messaging}(i,j)\right)\,, {} \end{aligned} $$
(3.2)

where \(\mathcal {N}_2\) denotes the 2-hop neighbors.

Then, it aggregates the multi-hop aggregation results to generate the entity representation. Aggregating 1-hop and 2-hop information is denoted as:

$$\displaystyle \begin{aligned} \boldsymbol{h}_i = g\left(\boldsymbol{h}_{i,2}^l\right)\cdot\boldsymbol{h}_{i,1}^l + \left(1-g(\boldsymbol{h}_{i,2}^l)\right)\cdot\boldsymbol{h}_{i,2}^l\,, {} \end{aligned} $$
(3.3)

where \(g(\boldsymbol {h}_{i,2}^l) = \sigma (\boldsymbol {M}\boldsymbol {h}_{i,2}^l + \boldsymbol {b})\), which is the gate to control the influences of different hops. \(\boldsymbol {M}\) and \(\boldsymbol {b}\) are learnable parameters.

Attention

Regarding the attention weight, it assumes that not all distant entities contribute positively to the characterization of the target entity representation, and the softmax function is used to produce the attention weights:

$$\displaystyle \begin{aligned} \mathbf{Attention}(i,j) = \alpha_{ij}^l = softmax \left(c_{ij}^l\right) = \frac{\exp\left(c_{ij}^l\right)}{\sum_{n\in\mathcal{N}_2(i)\cup {i}}\exp\left(c_{in}^l\right)}\,, {} \end{aligned} $$
(3.4)

where \(c_{ij}^l = LeakyReLU((\boldsymbol {M}_1^l\boldsymbol {h}_i^l)^T \boldsymbol {M}_2^l\boldsymbol {h}_j^l)\), and \(\boldsymbol {M}_1, \boldsymbol {M}_2\) are two learnable matrices.

Messaging

The extraction of the features of neighboring entities is implemented as a simple linear transformation: \(\mathbf {Messaging}(i,j) = \boldsymbol {W}_q^l \boldsymbol {h}_j^{l-1}\), where \(\boldsymbol {W}_q\) denotes the weight matrix for the q-hop aggregation.

Post-processing

The representations of all layers are concatenated to produce the final entity representation:

$$\displaystyle \begin{aligned} \boldsymbol{h}_i = \oplus_{l=1}^L norm\left(\boldsymbol{h}_i^l\right)\,. {} \end{aligned} $$
(3.5)

Loss Function

The loss function is formulated as:

$$\displaystyle \begin{aligned} \mathcal{L} = \sum_{(i,j)\in\mathcal{A}^+} ||\boldsymbol{h}_i-\boldsymbol{h}_j|| + \sum_{(i^\prime,j^\prime)\in\mathcal{A}^-} \alpha_1[\gamma - ||\boldsymbol{h}_{i^\prime}-\boldsymbol{h}_{j^\prime}||]_+\,, {} \end{aligned} $$
(3.6)

where \(\mathcal {A}^-\) is the set of negative samples, obtained through random sampling. \(||\cdot ||\) denotes the L2 norm. \([\cdot ]_+ = \max (0,\cdot )\).

2.2 MRAEA

It proposes to utilize the relation information to facilitate the entity representation learning process [8].

Pre-processing

Specifically, it first creates an inverse relation for each relation, resulting in the extended relation set \(\mathcal {R}\). Then, it generates the initial features for entities by averaging and concatenating the embeddings of neighboring entities and relations:

$$\displaystyle \begin{aligned} \boldsymbol{h}_{e_i}^{in} = \left[\frac{1}{|\mathcal{N}_i^e| + 1} \sum_{e_j\in\mathcal{N}_i^e\cup{e_i}} \boldsymbol{h}_{e_j} || \frac{1}{|\mathcal{N}_i^r|} \sum_{r_k\in\mathcal{N}_i^r} \boldsymbol{h}_{r_k}\right]\,, {} \end{aligned} $$
(3.7)

where the embeddings of entities and relations are randomly initialized.

Aggregation

The aggregation is a simple combination of the extracted features and the weights:

$$\displaystyle \begin{aligned} \boldsymbol{h}_{e_i}^{out} = \sigma\left(\sum_{e_j\in\mathcal{N}_{i}^{e}} \mathbf{Attention}(i,j)\cdot \mathbf{Messaging}(i,j)\right)\,, {} \end{aligned} $$
(3.8)

where \(\sigma \) is implemented as ReLU.

Attention

It augments the common self-attention mechanism to include relation features:

$$\displaystyle \begin{aligned} \begin{array}{rcl} \mathbf{Attention}(i,j) & =&\displaystyle softmax\\ & &\displaystyle \times\left(LeakyReLU\left(\boldsymbol{v}^T \left[\boldsymbol{h}_{e_i}^{in} || \boldsymbol{h}_{e_j}^{in} || \frac{1}{|\mathcal{M}_{i,j}|} \sum_{r_k\in\mathcal{M}_{i,j}} \boldsymbol{h}_{r_k}\right]\right)\right)\,, {}\\ \end{array} \end{aligned} $$
(3.9)

where \(\mathcal {M}_{i,j}\) represents the set of linked relations that connect \(e_i\) to \(e_j\). Noteworthily, it also adopts the multi-head attention mechanism to obtain the representation.

Messaging

The features of neighboring entities are the corresponding features from the pre-processing stage.

Post-processing

Finally, the outputs from different layers are concatenated to produce the final entity representations:

$$\displaystyle \begin{aligned} \hat{\boldsymbol{h}}_{e_i}^{out} = \left[\boldsymbol{h}_{e_i}^{out(0)}||\ldots||\boldsymbol{h}_{e_i}^{out(l)} \right]\,. {} \end{aligned} $$
(3.10)

Loss Function

The loss function is formulated as:

$$\displaystyle \begin{aligned} \mathcal{L} &= \sum_{(e_i,e_j)\in\mathcal{P}} ReLU(dis(e_i, e_j) - dis(e_i^\prime, e_j) + \lambda) + ReLU(dis(e_i, e_j) \\ & \quad - dis(e_i, e_j^\prime) + \lambda)\,, {} \end{aligned} $$
(3.11)

where \(dis(\cdot , \cdot )\)is the Manhattan distance between two entity representations. \(e_i^\prime \) and \(e_j^\prime \) represent the negative instances.

2.3 RREA

It proposes to use relational reflection transformation to aggregate features for learning entity representations [9].

Aggregation

The entity representations are denoted as:

$$\displaystyle \begin{aligned} \boldsymbol{h}_{e_i}^{l+1} = ReLU\left(\sum_{e_j\in\mathcal{N}_{e_i}^e} \sum_{r_k\in\mathcal{R}_{ij}} \mathbf{Attention}(i,j,k)\cdot \mathbf{Messaging}(i,j,k)\right)\,, {} \end{aligned} $$
(3.12)

where \(\mathcal {N}_{e_i}^e\) and \(\mathcal {R}_{ij}\) represent the neighboring entity and relation sets, respectively.

Attention

\(\mathbf {Attention}(i,j,k)\) denotes the weight coefficient computed by:

$$\displaystyle \begin{aligned} \mathbf{Attention}(i,j,k)= \frac{\exp\left(\beta_{ijk}^l\right)}{\sum_{e_j\in\mathcal{N}_{e_i}^e}\sum_{r_k\in\mathcal{R}_{ij}}\exp\left(\beta_{ijk}^l\right)}\,, {} \end{aligned} $$
(3.13)

where \(\beta _{ijk}^l = \boldsymbol {v}^T [\boldsymbol {h}_{e_i}^{l} || \boldsymbol {M}_{r_k}\boldsymbol {h}_{e_j}^{l}|| \boldsymbol {h}_{r_k}]\). \(\boldsymbol {v}\) is a trainable vector. \(\boldsymbol {M}_{r_k}\) is the relational reflection matrix of \(r_k\). We leave out the details of relational reflection matrix in the interest of space, which can be found in the original paper.

Messaging

The features of neighboring entities are the corresponding features from the pre-processing stage:

$$\displaystyle \begin{aligned} \mathbf{Messaging}(i,j,k) = \boldsymbol{M}_{r_k}\boldsymbol{h}_{e_j}^{l}\,, {} \end{aligned} $$
(3.14)

where \(\boldsymbol {M}_{r_k}\) is the relational reflection matrix of \(r_k\).

Post-processing

Then, the outputs from different layers are concatenated to produce the output vector:

$$\displaystyle \begin{aligned} \boldsymbol{h}_{e_i}^{out} = \left[\boldsymbol{h}_{e_i}^{0}||\ldots||\boldsymbol{h}_{e_i}^{l} \right]\,. {} \end{aligned} $$
(3.15)

Finally, it concatenates the entity representation with its neighboring relation embeddings to obtain the final entity representation:

$$\displaystyle \begin{aligned} \boldsymbol{h}_{e_i}^{Mul} = \left[\boldsymbol{h}_{e_i}^{out} || \frac{1}{|\mathcal{N}_{e_i}^r|} \sum_{r_j\in\mathcal{N}_{e_i}^r} \boldsymbol{h}_{r_j}\right]\,. {} \end{aligned} $$
(3.16)

Loss Function

The loss function is formulated as:

$$\displaystyle \begin{aligned} \mathcal{L} = \sum_{(e_i,e_j)\in\mathcal{P}} max(dis(e_i, e_j) - dis\left(e_i^\prime, e_j^\prime\right) + \lambda, 0)\,, {} \end{aligned} $$
(3.17)

where \(dis(\cdot , \cdot )\)is the Manhattan distance between two entity representations. \(e_i^\prime \) and \(e_j^\prime \) represent the negative instances generated by nearest neighbor sampling.

2.4 RPR-RHGT

This work introduces a meta path-based similarity framework for EA [2]. It considers the paths that frequently appear in the neighborhoods of pre-aligned entities to be reliable. We omit the generation of these reliable paths in the interest of space, which can be found in Sect. 3.3 of the original paper.

Pre-processing

Specifically, it first generates relation embeddings by aggregating the representations of neighboring entities:

$$\displaystyle \begin{aligned} R^l(r) = \sigma\left[\frac{1}{|\mathcal{H}_r|}\sum_{e_i\in\mathcal{H}_r}\boldsymbol{b}_h\boldsymbol{e}_i^{l-1} ||\frac{1}{|\mathcal{T}_r|}\sum_{e_j\in\mathcal{T}_r}\boldsymbol{b}_t\boldsymbol{e}_j^{l-1} \right]\,, {} \end{aligned} $$
(3.18)

where \(\mathcal {H}_r\) and \(\mathcal {T}_r\) denote the set of head entities and tail entities that are connected with relation r.

Aggregation

The entity representation is obtained by averaging the messages from neighborhood entities with the attention weights:

$$\displaystyle \begin{aligned} \tilde{\boldsymbol{e}}_h^l = \oplus_{\forall (r,t)\in RN(h)} HAttention(h,r,t)\cdot HMessage(h,r,t), {} \end{aligned} $$
(3.19)

where \(\oplus \) denotes the overlay operation.

Attention

The multi-head attention is computed as:

$$\displaystyle \begin{aligned} \begin{aligned} HAttention(h,r,t) &= {||}_{i\in[1,h_n]} {softmax}_{\forall(r,t)\in RN(h)} ({HATT}_{{head}^i}(h,r,t)),\\ {HATT}_{{head}^i}(h,r,t) &= \boldsymbol{a}^T ([K^i(h)||Q^i(t)]R^l(r))/\sqrt{d/h_n}, \end{aligned} {} \end{aligned} $$
(3.20)

where \(K^i(h) = K\_Linear^i(\boldsymbol {e}_h^{l-1})\), \(Q^i(t) = Q\_Linear^i(\boldsymbol {e}_t^{l-1})\), \(RN(h)\) represents the neighborhood entities of h, \(\boldsymbol {a}\) denotes the learnable attention vector, \(h_n\) is the number of attention heads, and \(d/h_n\) is the dimension per head.

Messaging

The multi-head message passing is computed as:

$$\displaystyle \begin{aligned} \begin{aligned} HMessage(h,r,t) &= {||}_{i\in[1,h_n]} ({HMSG}_{{head}^i}(h,r,t)),\\ {HMSG}_{{head}^i}(h,r,t) &= [V\_Linear^i(\boldsymbol{e}_t^{l-1})||R^l(r)], \end{aligned} {} \end{aligned} $$
(3.21)

where \(V\_Linear^i\) is a linear projection of the tail entity, which is then concatenated with the relation representation.

Post-processing

This work also combines the structural representations with name features using the residual connection:

$$\displaystyle \begin{aligned} \boldsymbol{e}_h^l = \omega_\beta A\_Linear\left(\tilde{\boldsymbol{e}}_h^l\right) + (1-\omega_\beta)N\_Linear\left({\boldsymbol{e}}_h^{l-1}\right)\,, {} \end{aligned} $$
(3.22)

where \(A\_Linear\) and \(N\_Linear\) are linear projections. Correspondingly, based on the relation structure \(\mathcal {T}_{rel}\) and path structure \(\mathcal {T}_{path}\), it generates the relation-based embeddings \(\boldsymbol {E}_{rel}\) and the path-based embeddings \(\boldsymbol {E}_{path}\).

Loss Function

Finally, the margin-based ranking loss function is used to formulate the overall loss function:

$$\displaystyle \begin{aligned} \begin{aligned} \mathcal{L} =& \sum_{(p,q)\in\mathcal{L},(p^\prime,q^\prime)\in\mathcal{L}_{rel}^\prime} [d_{rel}(p, q) - d_{rel}(p^\prime, q^\prime) + \lambda_1]_+ \\+ &\theta\left(\sum_{(p,q)\in\mathcal{L},(p^\prime,q^\prime)\in\mathcal{L}_{path}^\prime} [d_{path}(p, q) - d_{path}(p^\prime, q^\prime) + \lambda_2]_+\right)\,, \end{aligned} {} \end{aligned} $$
(3.23)

where the distance is measured by the Manhattan distance and \(\theta \) is the hyper-parameter that controls the weights of relation loss and path loss.

2.5 RAGA

It proposes to adopt the self-attention mechanism to spread entity information to the relations and then aggregate relation information back to entities, which can further enhance the quality of entity representations [17].

Pre-processing

In the pre-processing module, the pre-trained vectors are used as input and then forwarded to a two-layer GCN with highway network to encode structure information. We leave out the implementation details in the interest of space, which can be found in Sect. 4.2 in the original paper.

Aggregation

In RAGA, there are three main GNN networks. Denote the initial representation of entity i as \(\boldsymbol h_i\), which is generated in pre-processing module. The first GNN network obtains relation representation by aggregating all of its connected head entities and tail entities. For relation k, the aggregation of its connected head entities is computed as follows:

$$\displaystyle \begin{aligned} \boldsymbol{r}_k^h=\sigma\left(\sum_{e_i\in\mathcal H_{r_k}}\sum_{e_j\in\mathcal T_{e_ir_k}}\mathbf{Attention}_1(i,j,k)\cdot\mathbf{Messaging}_1(i)\right)\,, {} \end{aligned} $$
(3.24)

where \(\sigma \) is the ReLU activation function, \(\mathcal H_{r_k}\) is the set of head entities for relation \(r_k\), and \(\mathcal T_{e_ir_k}\) is the set of tail entities for head entity \(e_i\) and relation \(r_k\). The aggregation of all tail entities \(\boldsymbol r_k^t\) can be computed through a similar process, and the relation representation is obtained as \(\boldsymbol r_k=\boldsymbol r_k^h+\boldsymbol r_k^t\).

Then, the second GNN network generates relation-aware entity representation through aggregating relation information back to entities. For entity i, the aggregation of all its outward relation embeddings is computed as follows:

$$\displaystyle \begin{aligned} \boldsymbol{h}_i^h=\sigma\left(\sum_{e_j\in\mathcal{T}_{e_i}}\sum_{r_k\in\mathcal{R}_{e_ie_j}}\mathbf{Attention}_2(i,k)\cdot\boldsymbol r_k\right)\,, {} \end{aligned} $$
(3.25)

where \(\mathcal {T}_{e_i}\) is the set of tail entities for head entity \(e_i\) and \(\mathcal {R}_{e_ie_j}\) is the set of relations between head entity \(e_i\) and tail entity \(e_j\). The aggregation of inward relation embeddings \(\boldsymbol h_i^t\) is computed through a similar process. Then the relation-aware entity representations \(\boldsymbol {h}_i^{rel}\) can be obtained by concatenation: \(\boldsymbol h_i^{rel}=\left [\boldsymbol h_i\Vert \boldsymbol {h}_i^h\Vert \boldsymbol {h}_i^t\right ]\).

Finally, the third GNN takes as input the relation-aware entity representations and makes aggregation to produce the final entity representations:

$$\displaystyle \begin{aligned} \boldsymbol h_i^{out}=\sigma\left(\sum_{j\in\mathcal N_i}\mathbf{Attention}_3(i,j)\cdot\boldsymbol h_k^{rel}\right)\,, {} \end{aligned} $$
(3.26)

Attention

Corresponding to three GNN networks, there are three attention computations in RAGA. In the first GNN, to compute the attention weights, representations of head entity and tail entity are linearly transformed, respectively, and then concatenated:

$$\displaystyle \begin{aligned} \mathbf{Attention}_1(i,j,k)\,{=}\,\frac{\exp\left(\mathrm{LeakReLU}\left(\boldsymbol{a}_1^T\left[\boldsymbol{W}^h\boldsymbol{h}_i\Vert\boldsymbol{W}^t\boldsymbol{h}_j\right]\right)\right)}{\sum_{e_{i'}\in\mathcal H_{r_k}}\sum_{e_{j'}\in\mathcal T_{e_ir_k}}\exp\left(\mathrm{LeakReLU}\left(\boldsymbol{a}_1^T\left[\boldsymbol{W}^h\boldsymbol h_{i'}\Vert\boldsymbol{W}^t\boldsymbol{h}_{j'}\right]\right)\right)}\,, {} \end{aligned} $$
(3.27)

where \(\boldsymbol a_1\) is the learnable attention vector.

In the second GNN, representations of entity and its neighboring relations are directly concatenated:

$$\displaystyle \begin{aligned} \mathbf{Attention}_2(i,k)=\frac{\exp\left(\mathrm{LeakReLU}(\boldsymbol a_2^T\left[\boldsymbol h_i\Vert\boldsymbol r_k\right])\right)}{\sum_{e_j\in\mathcal T_{e_i}}\sum_{r_{k'}\in\mathcal R_{e_ie_j}}\exp\left(\mathrm{LeakReLU}(\boldsymbol a_2^T\left[\boldsymbol h_i\Vert\boldsymbol r_{k'}\right])\right)}\,, {} \end{aligned} $$
(3.28)

where \(\boldsymbol a_2\) is the learnable attention vector.

The computation of attention in the third GNN, i.e., \(\mathbf {Attention}_3\), is similar to Eq. (3.28), which concatenates entity and its neighboring entity instead of relation.

Messaging

Only the first GNN utilizes linear transformation as the messaging approach:

$$\displaystyle \begin{aligned} \mathbf{Messaging}_1(i)=\boldsymbol W\boldsymbol h_i\,, {} \end{aligned} $$
(3.29)

where \(\boldsymbol W\) can refer to \(\boldsymbol W^h\) or \(\boldsymbol W^t\) depending on the aggregation of head or tail entities.

Post-processing

The final enhanced entity representation is the concatenation of outputs of the second and the third GNNs:

$$\displaystyle \begin{aligned} \boldsymbol h_i^{final}=\left[\boldsymbol h_i^{rel}\Vert\boldsymbol h_i^{out}\right]\,. {} \end{aligned} $$
(3.30)

Loss Function

The loss function is formulated as:

$$\displaystyle \begin{aligned} \mathcal L=\sum_{(e_i,e_j)\in T}\sum_{(e_i^{\prime},e_j^{\prime})\in T_{e_i,e_j}^{\prime}}\max(dis(e_i,e_j)-dis(e_i^{\prime},e_j^{\prime})+\lambda,0)\,, {} \end{aligned} $$
(3.31)

where \(T_{e_i,e_j}^{\prime }\) is the set of negative sample for \(e_i\) and \(e_j\), \(\lambda \) is the margin, and \(dis()\) is defined as the Manhattan distance.

2.6 Dual-AMN

Dual-AMN proposes to utilize both intra-graph and cross-graph information for learning entity representations [7]. It constructs a set of virtual nodes, i.e., proxy vectors, through which the messaging and aggregation between graphs are conducted.

Aggregation

Dual-AMN uses two GNN networks to learn intra-graph and cross-graph information, respectively. Firstly, it utilizes relation projection operation in RREA to obtain intra-graph embeddings:

$$\displaystyle \begin{aligned} \boldsymbol h_{e_i}^{l}=\sigma\left(\sum_{e_j\in\mathcal{N}_{e_i}}\sum_{r_k\in\mathcal R_{ij}}\mathbf{Attention}_1(i,j,k)\cdot\mathbf{Messaging}_1(j,k)\right)\,, {} \end{aligned} $$
(3.32)

where \(\sigma \) is the tanh activation function and \(\boldsymbol h_{e_i}^l\) represents the output of l-th layer. Then the multi-hop embeddings are obtained by concatenation:

$$\displaystyle \begin{aligned} \boldsymbol h_{e_i}^{multi}=\left[\boldsymbol h_{e_i}^0\Vert\boldsymbol h_{e_i}^1\Vert\dots\Vert\boldsymbol h_{e_i}^l\right]\,. {} \end{aligned} $$
(3.33)

Secondly, it constructs a set of virtual nodes \(\mathcal S_p=\{\boldsymbol q_1,\boldsymbol q_2,\dots ,\boldsymbol q_n\}\), namely, the proxy vectors, which are randomly initialized. The cross-graph aggregation is computed as:

$$\displaystyle \begin{aligned} \boldsymbol h_{e_i}^p=\sum_{j\in\mathcal S_p}\mathbf{Attention}_2(i,j)\cdot\mathbf{Messaging}_2(i,j)\,. {} \end{aligned} $$
(3.34)

Attention

For intra-graph information learning, the attention weights are calculated as:

$$\displaystyle \begin{aligned} \mathbf{Attention}_1(i,j,k)=\frac{\exp(\boldsymbol v^T\boldsymbol h_{r_k})}{\sum_{e_{j'}\in\mathcal N_{e_i}}\sum_{r_{k'}\in\mathcal R_{ij'}}\exp(\boldsymbol v^T\boldsymbol h_{r_{k'}})}\,, {} \end{aligned} $$
(3.35)

where \(\boldsymbol {v}^T\) is a learnable attention vector and \(\boldsymbol h_{r_k}\) is the representation of relation \(r_k\), which is randomly initialized by He_initializer [4].

For cross-graph information learning, the attention weights are computed by the similarity between entity and proxy vectors:

$$\displaystyle \begin{aligned} \mathbf{Attention}_2(i,j)=\frac{\exp(\cos{}(\boldsymbol h_{e_i}^{multi},\boldsymbol q_j))}{\sum_{k\in\mathcal S_p}\exp(\cos{}(\boldsymbol h_{e_i},\boldsymbol q_k))}\,. {} \end{aligned} $$
(3.36)

Messaging

For the first GNN, the messaging is the same as RREA, which utilizes a relational reflection matrix to transform neighbor embeddings.

For the second GNN, the features of neighboring entities are represented as the difference between entity and proxy vectors:

$$\displaystyle \begin{aligned} \mathbf{Messaging}_2(i,j)=\boldsymbol h_{e_i}^{multi}-\boldsymbol q_j\,. {} \end{aligned} $$
(3.37)

Post-processing

For the final entity embeddings, the gate mechanism is used to combine intra-graph and cross-graph representations:

$$\displaystyle \begin{aligned} \begin{aligned} \boldsymbol\eta_{e_i}&=\sigma(\boldsymbol M\boldsymbol h_{e_i}^p+\boldsymbol b),\\ \boldsymbol h_{e_i}^{final}&=\boldsymbol\eta_{e_i}\cdot\boldsymbol h_{e_i}^p+(1-\boldsymbol\eta_{e_i})\cdot\boldsymbol h_{e_i}^{multi}\,, {} \end{aligned} \end{aligned} $$
(3.38)

where \(\boldsymbol M\) and \(\boldsymbol b\) are the gate weight matrix and gate bias vector.

Loss Function

Firstly, it calculates the original margin loss as follows:

$$\displaystyle \begin{aligned} l_o(e_i,e_j,e_j^{\prime})=\gamma+\Vert\boldsymbol h_{e_i}^{final}-\boldsymbol h_{e_j}^{final}\Vert^2_2-\Vert\boldsymbol h_{e_i}^{final}-\boldsymbol h_{e_j^{\prime}}^{final}\Vert^2_2\,. {} \end{aligned} $$
(3.39)

Inspired by batch normalization [5] which reduces the internal covariate shift, it proposes to use a normalization step that fixes the mean and variance of sample losses from \(l_o(e_i,e_j,e_j^{\prime })\) to \(l_n(e_i,e_j,e_j^{\prime })\) and reduces the dependence on the scale of the hyper-parameter. Finally, the overall loss function is defined as follows:

$$\displaystyle \begin{aligned} \begin{aligned} \mathcal L&=\sum_{(e_i,e_j)\in P}\log\left[1+\sum_{e_j^{\prime}\in E_2}\exp(l_n(e_i,e_j,e_j^{\prime}))\right]\\ &\quad +\sum_{(e_i,e_j)\in P}\log\left[1+\sum_{e_i^{\prime}\in E_1}\exp(l_n(e_j,e_i,e_i^{\prime}))\right]\,, {} \end{aligned} \end{aligned} $$
(3.40)

where P is the set of positive samples and \(E_1\) and \(E_2\) are the sets of entities in two knowledge graphs, respectively.

2.7 ERMC

This work proposes to jointly model and align entities and relations and meanwhile retain their semantic independence [14].

Pre-processing

For pre-processing, it obtains names or descriptions of entities and relations as the inputs for BERT [6] and adds an MLP layer to construct initial representations, which are denoted as \(\boldsymbol x^{e(0)}\) and \(\boldsymbol x^{r(0)}\) for each entity and relation, respectively.

Aggregation

Given an entity e, the model first aggregates the embeddings of entities that point to e:

$$\displaystyle \begin{aligned} \boldsymbol h_{\mathcal N_i^e}^{e(l+1)}=\sigma\left(\frac 1{|\mathcal N_i^{e(e)}|}\sum_{e_i\in\mathcal N_i^{e(e)}}\mathbf{Messaging}(i)\right)\,, {} \end{aligned} $$
(3.41)

where \(\sigma (\cdot )\) contains normalization, dropout, and activation operations. Similarly, the model aggregates the embeddings of entities that e points to, the embeddings of relations that point to e, and the embeddings of relations that e points to, producing \(\boldsymbol h_{\mathcal N_i^r}^{e(l+1)}\), \(\boldsymbol h_{\mathcal N_o^e}^{e(l+1)}\), and \(\boldsymbol h_{\mathcal N_o^r}^{e(l+1)}\), respectively. The model also aggregates the embeddings of entities that point to a relation r or r points to, so as to produce the relation embeddings \(\boldsymbol h_{\mathcal N_i^e}^{r(l+1)}\) and \(\boldsymbol h_{\mathcal N_o^e}^{r(l+1)}\), respectively.

Messaging

Given an entity e, the messaging process of the entities that point to e is implemented as a simple linear transformation: \(\mathbf {Messaging}(i)=\boldsymbol W_{e_i}^{e(l)}\boldsymbol x^{e_i(l)}\), where \(\boldsymbol x^{e_i(l)}\) is the node representation in the last layer and \(\boldsymbol W_{e_i}^{e(l)}\) is a learnable weight matrix that aggregates the inward entity features. The messaging process of other operations is implemented similarly.

Post-processing

The final representation of entity e is formulated as follows:

$$\displaystyle \begin{aligned} \begin{aligned} \boldsymbol h^{e(l+1)}&=\left[\boldsymbol h_{\mathcal N_i^e}^{e(l+1)}\Vert\boldsymbol h_{\mathcal N_i^r}^{e(l+1)}\Vert\boldsymbol h_{\mathcal N_o^e}^{e(l+1)}\Vert\boldsymbol h_{\mathcal N_o^r}^{e(l+1)}\right],\\ \boldsymbol x^{e(l+1)}&=MLP\left(\left[\boldsymbol h^{e(l+1)}\Vert\boldsymbol x^{e(l)}\right]\right)\,. {} \end{aligned} \end{aligned} $$
(3.42)

And the final representation of relation r is formulated similarly:

$$\displaystyle \begin{aligned} \begin{aligned} \boldsymbol h^{r(l+1)}&=\left[\boldsymbol h_{\mathcal N_i^e}^{r(l+1)}\Vert\boldsymbol h_{\mathcal N_o^e}^{r(l+1)}\right],\\ \boldsymbol x^{r(l+1)}&=MLP\left(\left[\boldsymbol h^{r(l+1)}\Vert\boldsymbol x^{r(l)}\right]\right)\,. {} \end{aligned} \end{aligned} $$
(3.43)

The graph embedding \(\boldsymbol H\in \mathbb R^{(|E|+|R|)\times d}\) is the concatenation of all entities and relations’ representations.

Loss Function

Denote \(\boldsymbol H_s\) and \(\boldsymbol H_t\) as the representations of two graphs, respectively. The similarity matrix is computed as:

$$\displaystyle \begin{aligned} \boldsymbol S=sinkhorn(\boldsymbol H_s,\boldsymbol H_t^T)\,, {} \end{aligned} $$
(3.44)

where \(s_{i,j}\in \boldsymbol S\) is a real number that denotes the correlation between entity \(e_s^i\) (from source graph) and \(e_t^j\) (from target graph), or the correlation between relation \(r_s^i\) (from source graph) and \(r_t^j\) (from target graph). The other elements are set to \(-\infty \) to mask the correlation between entity and relation across different graphs. The final loss function is formulated as follows:

$$\displaystyle \begin{aligned} \mathcal L=-\sum_{\left(e_s^i,e_t^j\right)\in\mathcal Q^e}\log\left(s_{i,j}\right)-\lambda\sum_{\left(r_s^i,r_t^j\right)\in\mathcal Q^r}\log\left(s_{i,j}\right)\,, {} \end{aligned} $$
(3.45)

where \((e_s^i,e_t^j)\) and \((r_s^i,r_t^j)\) are pre-aligned entity and relation pairs and \(\lambda \in [0,1]\) is a hyper-parameter.

2.8 KE-GCN

It combines GCNs and advanced KGE methods to learn the representations, where a novel framework is put forward to realize the messaging and aggregation modules in representation learning [15].

Aggregation

Denoting \(\boldsymbol h_v^l\) as the embedding of entity v at layer l, the entity updating rules are:

$$\displaystyle \begin{aligned} \begin{aligned} \boldsymbol m_v^{l+1}&=\sum_{(u,r)\in\mathcal N_{\mathrm{in}}(v)}\mathbf{Messaging}(u,r,v)+\sum_{(u,r)\in\mathcal N_{\mathrm{out}}(v)}\mathbf{Messaging}(u,r,v),\\ \boldsymbol h_v^{l+1}&=\sigma(\boldsymbol m_v^{l+1}+\boldsymbol W_0^l\boldsymbol h_v^l)\,, {} \end{aligned} \end{aligned} $$
(3.46)

where \(\mathcal N_{\mathrm {in}}(v)=\{(u,r)\vert u\stackrel {r}{\rightarrow }v\}\) is the set of inward entity-relation neighbors of entity v, while \(\mathcal N_{\mathrm {out}}(v)=\{(u,r)\vert u\stackrel {r}{\leftarrow }v\}\) is the set of outward neighbors of v. \(\boldsymbol W_0^l\) is a linear transformation matrix. \(\sigma (\cdot )\) denotes the activation function for the update. The embedding of relation is updated through a similar process.

Messaging

It considers GCN as an optimization process, where the messaging process is implemented as a partial derivative:

$$\displaystyle \begin{aligned} \mathbf{Messaging}(u,r,v)=\boldsymbol W_r^l\frac{\partial f(\boldsymbol h_u^l,\boldsymbol h_r^l,\boldsymbol h_v^l)}{\partial\boldsymbol h_v^l}\,, {} \end{aligned} $$
(3.47)

where \(\boldsymbol h_r^l\) represents the embedding of relation r at layer l and \(\boldsymbol W_r^l\) is a relation-specific linear transformation matrix. \(f(\boldsymbol h_u^l,\boldsymbol h_r^l,\boldsymbol h_v^l)\) is the scoring function that measures the plausibility of triple \((u,r,v)\). Thus, \(\boldsymbol m_v^{l+1}+\boldsymbol W_0^l\boldsymbol h_v^l\) in Eq. (3.46) can be regarded as the gradient ascent to maximize the sum of scoring function. For example, if \(f(\boldsymbol h_u^l,\boldsymbol h_r^l,\boldsymbol h_v^l)=(\boldsymbol h_u^l)^T\boldsymbol h_v^l\), Eq. (3.47) becomes equivalent to the common linear transformation \(\boldsymbol W_r^l\boldsymbol h_u^l\).

Loss Function

Denote the training set as \(S=\{(u,v)\}\); this model utilizes margin-based ranking loss for optimization:

$$\displaystyle \begin{aligned} \mathcal L=\sum_{(u,v)\in S}\sum_{(u',v')\in S_{(u,v)}^{\prime}}\max(\Vert\boldsymbol h_u-\boldsymbol h_v\Vert_1+\gamma-\Vert\boldsymbol h_{u'}-\boldsymbol h_{v'}\Vert_1,0)\,, {} \end{aligned} $$
(3.48)

where \(S_{(u,v)}^{\prime }\) denotes the set of negative entity alignments constructed by corrupting \((u,v)\), i.e., replacing u or v with a randomly chosen entity in graph. \(\gamma \) represents the margin hyper-parameter separating positive and negative entity alignments.

2.9 RePS

It encodes position and relation information for aligning entities [13].

Aggregation

Firstly, to encode position information, k subsets of nodes (referred to as anchor sets) are randomly sampled. An \(i^{th}\) anchor set is a collection of \(l_i\) number of nodes (anchors). Then for entity v, the aggregation process is formulated as:

$$\displaystyle \begin{aligned} \boldsymbol h_{v_p}^l=g\left(\frac 1{k+1}\left(\sum_{i=1}^k\mathbf{Messaging}_1(v,\psi_i)+\boldsymbol h_v^{l-1}\right)\right)\,, {} \end{aligned} $$
(3.49)

where \(\boldsymbol h_v^l\) represents the embedding of entity v from layer l, \(\psi _i\) is the \(i^{th}\) anchor set, and \(g(\boldsymbol X)=\sigma (\boldsymbol W_1\boldsymbol X+\boldsymbol b_1)\), where \(\boldsymbol W_1\) and \(\boldsymbol b_1\) are trainable parameters and \(\sigma \) is the activation function.

To encode relation information, a simple relation-specific GNN is used:

$$\displaystyle \begin{aligned} \boldsymbol h_{v_r}^l=f\left((1+c_v)\cdot\boldsymbol h_v^{l-1}+\sum_{i\in\mathcal N_v}\mathbf{Messaging}_2(i)\right)\,, {} \end{aligned} $$
(3.50)

where \(c_v\) is the learnable coefficient for entity v and \(\mathcal N_v\) is the set of neighboring entities of v. \(f(\boldsymbol X)=\boldsymbol W_2\boldsymbol X+\boldsymbol b_2\), where \(\boldsymbol W_2\) and \(\boldsymbol b_2\) are learnable parameters.

Messaging

To ensure similar entities in two graphs have similar representations, the relation-enriched distance function is defined as follows:

$$\displaystyle \begin{aligned} pd(u,v)=\min_q\left(\sum_{r\in P_q(u,v)}f(r,\mathcal{K}\mathcal{G}_i)\right)\,, {} \end{aligned} $$
(3.51)

where \(f(r,\mathcal {K}\mathcal {G}_i)\) is the frequency of relation r in \(\mathcal {K}\mathcal {G}_i\) and \(P_q(u,v)\) is the list of relations in the \(q^{th}\) path between u and v. Thus, \(pd(u,v)\) aims to find the shortest path between u and v, where the relations appear less frequently. Then the messaging function is formulated as follows:

$$\displaystyle \begin{aligned} \mathbf{Messaging}_1(v,\psi_i)=\min\left(\left\{pd(v,\phi_{i,j})\cdot\boldsymbol h_{\psi_{i,j}}^{l-1}\right\}_{j=1}^{l_i}\right)\,, {} \end{aligned} $$
(3.52)

where \(\psi _{i,j}\) is the jth entity in ith anchor set.

For relation-aware embedding, it sums up the neighboring representations with relation-specific weights:

$$\displaystyle \begin{aligned} \mathbf{Messaging}_2(i)=\frac{\boldsymbol h_i^{l-1}}{1+c_{r_{v,i}}}\,, {} \end{aligned} $$
(3.53)

where \(c_{r_{v,i}}\) is the learnable coefficient for relation r connecting v and i.

Post-processing

The final representation of v is computed as:

$$\displaystyle \begin{aligned} \boldsymbol h_v^l=g\left(\boldsymbol h_{v_p}^l\right)\cdot\boldsymbol h_{v_p}^l+\left(1-g\left(\boldsymbol h_{v_p}^l\right)\right)\cdot\boldsymbol h_{v_r}^l\,, {} \end{aligned} $$
(3.54)

where \(g(\boldsymbol h_{v_p}^l)=\sigma (\boldsymbol W_3\boldsymbol h_{v_p}^l+\boldsymbol b_3)\) learns the relative importance. \(\boldsymbol W_3\) and \(\boldsymbol b_3\) are trainable parameters and \(\sigma \) is the activation function.

Loss Function

It introduces a novel knowledge-aware negative sampling (KANS) technique to generate hard negative samples. For each tuple \((v,v')\) in S, the negative instances for v are sampled from set \(\Phi _v\), where \(\Phi _v\) is the set of entities which share at least one (relation, tail) pair or (relation, head) pair with \(v'\). The model is trained by minimizing the following loss:

$$\displaystyle \begin{aligned} \mathcal L=\sum_{(p,p')\in S}\Vert\boldsymbol p-\boldsymbol p'\Vert+\beta\sum_{(p,q)\in S'}[\gamma-\Vert\boldsymbol p-\boldsymbol q\Vert]_+\,, {} \end{aligned} $$
(3.55)

where \(\beta \) is a weighing parameter and \(\gamma \) is the margin.

2.10 SDEA

SDEA utilizes BiGRU to capture correlations among neighbors and generate entity representations [16].

Pre-processing

It devises an attribute embedding module to capture entity associations via entity attributes. Specifically, given an entity \(e_i\), it concatenates the names and descriptions of its attributes, denoted as \(S(e_i)\). Then \(S(e_i)\) is fed into BERT model to generate attribute embedding \(\boldsymbol H_a(e_i)\). The details of implementation can be found in Section III of the original paper, which is omitted in the interest of space.

Aggregation

It aggregates the neighboring information utilizing attention mechanism:

$$\displaystyle \begin{aligned} \boldsymbol H_r(e_i)=\sum_{t=1}^n\mathbf{Attention}(t)\cdot\mathbf{Messaging}(t)\,. {} \end{aligned} $$
(3.56)

Since SDEA treats neighborhood as a sequence, t actually represents t-th neighboring entity of \(e_i\), and \(\mathbf {Messaging}()\) is computed through a BiGRU.

Attention

SDEA computes attention via simple inner product:

$$\displaystyle \begin{aligned} \mathbf{Attention}(t) = \frac{\exp\left(\boldsymbol h_t^T\cdot \hat{\boldsymbol h}\right)}{\sum_{i=1}^n \exp\left(\boldsymbol h_i^T\cdot \hat{\boldsymbol h}\right)}\,, {} \end{aligned} $$
(3.57)

where \(\hat {\boldsymbol h}\) is the global attention representation, which is obtained after feeding the output of the last unit of the BiGRU, denoted as \(\boldsymbol h_n\), into an MLP layer.

Messaging

Different from other models, SDEA captures correlation between neighbors in messaging module, and all neighbors of entity \(e_i\) are regarded as an input sequence of the BiGRU model. Given entity \(e_i\), let \(\boldsymbol x_t\) denote the t-th input embedding (i.e., the attribute embedding of \(e_i\)’s t-th neighbor, as described in pre-processing module) and \(\boldsymbol h_t\) denote the output t-th hidden unit. The process of BiGRU is formulated as follows:

$$\displaystyle \begin{aligned} \begin{aligned} \boldsymbol r_t &= \sigma(\boldsymbol{W}_r\boldsymbol x_t + \boldsymbol U_r\boldsymbol h_{t-1} + \boldsymbol b_r)\\ \tilde{\boldsymbol h}_t &= \phi(\boldsymbol{W}\boldsymbol x_t) + \boldsymbol U(\boldsymbol r_t\odot \boldsymbol h_{t-1} + \boldsymbol b_h)\\ \boldsymbol z_t &= \sigma(\boldsymbol{W}_z\boldsymbol x_t + \boldsymbol U_z\boldsymbol h_{t-1} + \boldsymbol b_z)\\ \boldsymbol{h}_t &= (\boldsymbol 1-\boldsymbol z_t)\odot \boldsymbol h_{t-1} + \boldsymbol z_t\odot \tilde{\boldsymbol h}_t\,, \end{aligned} {} \end{aligned} $$
(3.58)

where \(\boldsymbol r_t\) is the reset gate that drops the unimportant information and \(\boldsymbol z_t\) is the update gate that combines the important information. \(\boldsymbol {W}, \boldsymbol U, \boldsymbol b\) are learnable parameters. \(\tilde {\boldsymbol h}_t\) is the hidden state. \(\sigma \) is the sigmoid function and \(\phi \) is the hyperbolic tangent. \(\odot \) is the Hadamard product.

For BiGRU, there are outputs of two directions \(\overleftarrow {\boldsymbol h_t}\) and \(\overrightarrow {\boldsymbol h_t}\), and the final output of BiGRU, namely, the output of messaging module, is the sum of two directions: \(\mathbf {Messaging}(i)=\overleftarrow {\boldsymbol h_t}+\overrightarrow {\boldsymbol h_t}\).

Post-processing

After obtaining the attribute embedding \(\boldsymbol H_a(e_i)\) and the relational embedding \(\boldsymbol H_r(e_i)\), they are concatenated and forwarded to another MLP layer, resulting in \(\boldsymbol H_m(e_i)=MLP([\boldsymbol H_a(e_i)\Vert \boldsymbol H_r(e_i)])\). Finally, \(\boldsymbol H_a(e_i)\), \(\boldsymbol H_r(e_i)\), and \(\boldsymbol H_m(e_i)\) are concatenated to produce \(\boldsymbol H_{ent}(e_i)=[\boldsymbol H_r(e_i)\Vert \boldsymbol H_a(e_i)\Vert \boldsymbol H_m(e_i)]\), which is used in alignment stage.

Loss Function

The model uses the following margin-based ranking loss as the loss function to train attribute embedding module:

$$\displaystyle \begin{aligned} \mathcal L=\sum_{e_i,e_i^{\prime},e_i^{\prime\prime}\in D}\max\left\{0,\Vert \boldsymbol H_a(e_i)-\boldsymbol H^{\prime}_a(e_i^{\prime})\Vert_2-\Vert \boldsymbol H_a(e_i)-\boldsymbol H_a^{\prime}(e^{\prime\prime}_i)\Vert_2+\beta\right\}\,, {} \end{aligned} $$
(3.59)

where D is the training set; \(\boldsymbol H_a\) and \(\boldsymbol H_a^{\prime }\) are attribute embeddings of source graph and target graph, respectively; and \(\beta >0\) is the margin hyper-parameter used for separating positive and negative pairs.

The training of relation embedding module uses a margin-based ranking loss similar to Eq. (3.59), where the embedding \(\boldsymbol H_a(e_i)\) is replaced by \([\boldsymbol H_r(e_i)\Vert \boldsymbol H_m(e_i)]\).

3 Experiments

In this section, we first conduct overall comparison experiment to reveal the effectiveness of state-of-the-art representation learning methods. Then we conduct further experiments in terms of the six modules of representation learning, so as to examine the effectiveness of various strategies.

3.1 Experimental Setting

Dataset

We use the most frequently used DBP15K dataset [11] for evaluation.

Baselines

For overall comparison, we select seven models, including AliNet [12], MRAEA [8], RREA [9], RAGA [17], SDEA [16], Dual-AMN [7], and RPR-RHGT [2]. We collect their source codes and reproduce the results in the same setting. Specifically, to make a fair comparison, we modify and unify the alignment part of these models, forcing them to utilize L1 distance and greedy algorithm for alignment inference. We omit the comparison with the remaining models, as they do not provide the source codes and our implementations cannot reproduce the results. For ablation and further experiments, we choose RAGA as the base model.

Parameters and Metrics

Since there are various kinds of hyper-parameters for different models, we just unify the common parameters, such as the margin \(\lambda =3\) in margin loss function, and number of negative samples \(k=5\). For other parameters, we keep the default settings in the original papers.

Following existing studies, we use Hits@k (\(k=1\), 10) and mean reciprocal rank (MRR) as the evaluation metrics. The higher the Hits@k and MRR, the better the performance. In experiments, we report the average performance of three independent runs as the final result.

3.2 Overall Results and Analysis

Firstly, we compare the overall performance of seven advanced models in Table 3.2, where the best results are highlighted in bold, and the second best results are underlined.

Table 3.2 Comparison of representation learning models on DBP15K

From the results, it can be observed that:

  • No model achieves state-of-the-art performance over all three KG pairs. This indicates that current advanced models have advantages and disadvantages in different situations.

  • SDEA achieves the best performance on ZH-EN and FR-EN, and RPR-RHGT leads on JA-EN. Considering that both of the two models leverage pre-trained model to obtain initial embeddings and devise novel approaches to extract neighboring features, we may draw primary conclusion that utilizing pre-trained model benefits the representation learning process, and effective messaging approach is important to the overall results.

  • RAGA achieves the second best performance on JA-EN and FR-EN, and Dual-AMN attains the second best result on ZH-EN. Notably, RAGA also leverages the pre-trained model, which further validates the effectiveness of using pre-trained model for initialization. Dual-AMN uses proxy vectors that can help capture cross-graph information and hence improve representation learning.

  • AliNet performs the worst over three datasets. As AliNet is the only model that aggregates 2-hop neighboring entities, it may indicate directly incorporating 2-hop neighboring information benefits little, which can also be observed in the further experiments on aggregation module.

3.3 Further Experiments

To compare various strategies in each module of representation learning, we conduct further experiments using the RAGA model.

3.3.1 Pre-processing Module

RAGA takes pre-trained embeddings as input, which are forwarded to a two-layer GCN with highway network to generate initial representations. To examine the effectiveness of pre-trained embeddings and structural embeddings, we remove them, respectively, and then make comparison. Table 3.3 shows the results, where “w/o Pre-trained” represents removing pre-trained embeddings, “w/o GNN” represents removing GCN, and “w/o Both” represents removing the whole pre-processing module.

Table 3.3 Analysis of the pre-processing module using RAGA

The results show that removing the structural features and the pre-trained embeddings significantly degrades the performance, and the model that completely removes the pre-processing module achieves the worst result. Hence, it is important to extract useful features to initialize the embeddings. Additionally, we can also observe that the semantic features in the pre-trained model are more useful than the structural vectors, which verifies the effectiveness of the prior knowledge contained in the pre-trained embeddings. Using structural embeddings for initialization is less effective, as the subsequent steps in representation learning also aim to extract the structural features to produce meaningful representations.

3.3.2 Messaging Module

For the messaging module, linear transformation is the most widely used approach. RAGA only utilizes linear transformation in its first GNN and does not use transformation in the other two GNNs. Thus, we design two variants: one that eliminates the linear transformation in the first GNN (“-Linear Transform”), resulting in a model without linear transformation at all, and the other one that adds linear transformation in the other two GNNs (“\(+\)Linear Transform”), resulting in a model that is fully equipped with linear transformation.

The results are presented in Table 3.4. Besides, we also report their convergence rates in Fig. 3.1.

Fig. 3.1
A line graph of loss versus epoch. The lines decrease exponentially with a concave-up trend in the following order from the top and reach (80, 0). No transform from (0, 7.5), original from (0, 3.5), and all linear transform from (0, 3). Values are estimated.

Comparison of convergences

Table 3.4 Analysis of the messaging module using RAGA

It is evident that adding linear transformation in the rest of the GNNs improves the performance of RAGA, especially on JA-EN and FR-EN datasets, where Hits@1 improves by 1.1% and 1.2%, respectively. Additionally, when removing linear transformation, the performance drops significantly. Furthermore, Fig. 3.1 shows that linear transformation can also boost the convergence of model, possibly due to the introduction of extra parameters.

3.3.3 Attention Module

For attention module, there are two popular implementations, i.e., inner product and concatenation. To compare the two approaches, we replace the concatenation computation of RAGA with inner product computation (i.e., “-Inner product,” by changing \(\boldsymbol {v}^T[\boldsymbol e_i\Vert \boldsymbol e_j]\) to \((\boldsymbol M_1\boldsymbol e_i)^T(\boldsymbol M_2\boldsymbol e_j)\), where \(\boldsymbol M_1, \boldsymbol M_2\) are learnable transformation matrices), and remove the attention mechanism, (i.e., “w/o Attention,” where we do not compute attention coefficient and just take average operation), respectively, and then report the results.

As it is shown in Table 3.5, the two variant models perform almost the same as the original model. Considering the influence of the initial representation generated in the pre-processing module, we remove the pre-trained vectors of the pre-processing module and then conduct the same comparison. As shown in Table 3.6, removing the attention mechanism drops the performance, so we may draw a preliminary conclusion that the attention mechanism can play a better role in the absence of prior knowledge. As for the two strategies of attention computation, inner product performs better than concatenation on ZH-EN dataset but worse on JA-EN and FR-EN datasets, which indicates these two approaches make different contributions on different datasets.

Table 3.5 Analysis of the attention module using RAGA
Table 3.6 Analysis of the attention module using RAGA after removing pre-trained embeddings

3.3.4 Aggregation Module

For the aggregation module, as RAGA incorporates both 1-hop neighbors and relation information to update entity representations, we examine two variants, i.e., adding two hop neighboring information (“-2hop”) and removing relation representation (“w/o rel.”). The results are shown in Table 3.7.

Table 3.7 Analysis of the aggregation module using RAGA

We can observe that the performance of the model decreases significantly after removing the relation representation learning. This shows that the integration of relation representations can indeed enhance the learning ability of the model. Besides, the performance of the model decreases slightly after adding the information of 2-hop neighboring entities, which might indicate that the 2-hop neighboring information can bring some noises, as not all entities are useful for aligning the target entity.

3.3.5 Post-processing Module

RAGA concatenates the relation-aware entity representation and the 1-hop aggregation results to produce the final representation. We examine two variants, i.e., “-highway’ ’that replaces concatenation with highway network [10], and “w/o post-processing” that removes the relation-aware entity representation (Table 3.8).

Table 3.8 Analysis of the post-processing module using RAGA

From the experimental results, it can be seen that removing post-processing module decreases the performance, which indicates that the relation-aware representations can indeed enhance the final representations and improve the alignment performance. After replacing the concatenation operation with highway network, the performance decreases on JA-EN dataset and increases on FR-EN dataset, which indicates that the two post-processing strategies do not have absolute advantages and disadvantages.

3.3.6 Loss Function Module

For the loss function, RAGA employs margin-based loss in training. We consider two other popular choices, i.e., TransE-based loss and margin-based + TransE loss. Specifically, TransE-based loss is formulated as \(l_E=\frac 1k\sum _k\Vert h_k+r_k-t_k\Vert _1\), where \((h_k, r_k, t_k)\) is randomly sampled.

From the results in Table 3.9, it can be seen that the model performance decreases after using or adding the TransE loss. This is mainly because the TransE assumption is not universal. For example, in the RAGA model used in this experiment, the representation of the relation is actually obtained by adding the head entity and the tail entity, which is in conflict with the TransE assumption.

Table 3.9 Analysis of the loss function module using RAGA

4 Conclusion

In this chapter, we survey recent advance in the representation learning stage of EA. We propose a general framework of GNN-based representation learning models, which consists of six modules, and summarize ten recent works in terms of these modules. Extensive experiments are conducted to show the overall performance of each method and also reveal the effectiveness of the strategies in each module.