1 Introduction

In recent times, there has been a noticeable trend of integrating multimedia data into knowledge graphs (KGs) to facilitate cross-modal activities that involve the interplay of information across multiple modalities, e.g., image and video retrieval [27], video summaries [19], visual entity disambiguation [17], visual question answering [32], etc. To this end, several multi-modal KGs (MMKGs) [16, 28] have been constructed very recently. An example of MMKG can be found in Fig. 9.1. For this study, we focus on MMKGs that consist of two modalities, namely, the KG structural details and visual information, while retaining a generalizable approach.

Fig. 9.1
An illustration of M M K G. It includes entities, image sets, and entity-image links. The relationships between four distinct entities are depicted.

An example of MMKG

Example Figure 9.1 shows a partial MMKG, which consists of entities, image sets, and the links between them. To elaborate, the KG structural data entails the relationships between the different entities, whereas the visual data is sourced from the sets of images. For the entity The Prestige, its image set may contain scenes, actors, posters, etc.

However, many of the current MMKGs have been sourced from restricted data sources, causing them to have inadequate domain coverage [22]. To broaden the scope of these MMKGs, one potential solution is to incorporate valuable knowledge from other MMKGs. An essential step in consolidating knowledge across MMKGs is to identify matching entities in different KGs, given that entities serve as the links that connect the diverse KGs. This technique is also referred to as multi-modal entity alignment (MMEA).

MMEA is a complex undertaking that necessitates the modeling and amalgamation of information from multiple modalities. For the KG structural information, existing entity alignment (EA) approaches [3, 9, 25, 33] can be directly adopted to generate entity structural embeddings for MMEA. These methods usually utilize TransE-based or graph convolutional network(GCN)-based models [1, 12] to learn entity representations of individual KGs, which are then unified using the seed entity pairs. Despite this, all of these techniques generate entity representations in the Euclidean space, which can result in significant distortion when embedding real-world graphs that possess scale-free or hierarchical structures [4, 23]. Concerning the visual information, the VGG16 model has been utilized to create embeddings for images linked to entities and subsequently employed for alignment. However, the VGG16 model is not adept at extracting valuable features from images, which limits the efficacy of the alignment process. Lastly, the integration of information from both modalities must be executed meticulously to enhance overall effectiveness.

To tackle the problems mentioned above, we introduce a multi-modal entity alignment technique that works in hyperbolic space (HMEA). More specifically, we expand the Euclidean representation to the hyperboloid manifold and utilize the hyperbolic graph convolutional networks (HGCN) to develop structural representations of entities. With regard to visual data, we create image embeddings using the densenet model and also map them into the hyperbolic space with HGCN. Ultimately, we combine the structural embeddings and image embeddings in the hyperbolic space to forecast potential alignments.

To sum up, the key contributions of our technique can be outlined as follows:

  • We propose a novel MMEA approach, HMEA, which models and integrates multi-modal information in the hyperbolic space.

  • We apply the hyperbolic graph convolutional networks (HGCNs) to develop structural representations of entities and showcase the benefits of the hyperbolic space for knowledge graph representations.

  • We use a superior image embedding model to acquire improved visual representations for alignment.

  • We perform thorough experimental evaluations to confirm the efficacy of our proposed model.

Organization

Section 9.2 overviews related work, and the preliminaries are introduced in Sect. 9.3. Section 9.4 describes our proposed approach. Section 9.5 presents experimental results, followed by conclusion in Sect. 9.6.

2 Related Work

In this section, we introduce some efforts that are relevant to this work.

2.1 Multi-Modal Knowledge Graph

Many knowledge graph construction studies concentrate on organizing and discovering textual data in a structured format, neglecting other resources available on the Web [28]. Nevertheless, real-world applications require cross-modal data, such as image and video retrieval, visual question answering, video summaries, visual commonsense reasoning, and so on. Consequently, multi-modal knowledge graphs (MMKGs) have been introduced, which comprise diverse information (e.g., image, text, KG) and cross-modal relationships. However, building MMKGs poses several challenges. Collecting substantial multi-modal data from search engines is a time-consuming and laborious task. Additionally, MMKGs often have low domain coverage and are incomplete. Integrating multi-modal knowledge from other MMKGs is an effective way to enhance their completeness. Currently, there are few studies about merging different MMKGs. Liu et al. [16] built two pairs of MMKGs and extracted relational, latent, numerical, and visual features for predicting the SameAs link between entities. And some approaches of multi-modal knowledge representation involve visual features from entity images for knowledge representation learning; IKRL [31] integrates image representations into an aggregated image-based representation via an attention-based method.

2.2 Representation Learning in Hyperbolic Space

Essentially, most of the existing GCN models are designed for graphs in Euclidean spaces [2]. However, research has found that graph data exhibits a non-Euclidean structure [18], and embedding real-world graphs with a scale-free or hierarchical structure results in significant distortion [4, 23]. Moreover, recent studies in network science have shown that hyperbolic geometry is ideal for modeling complex networks, as the hyperbolic space can naturally reflect some graph properties [14]. One of the key features of hyperbolic spaces is that they expand more rapidly than Euclidean spaces, which expands exponentially rather than polynomially. Due to the advantages of hyperbolic space in representing graph structure data, there has been growing interest in representation learning in hyperbolic spaces, particularly in learning the hierarchical representation of a graph [20]. Furthermore, Nickel et al. [21] have demonstrated that the Lorentz model of hyperbolic geometry has favorable properties for stochastic optimization and leads to substantially enhanced embeddings, particularly in low dimensions. Additionally, some researchers have begun to extend deep learning methods to hyperbolic space, achieving state-of-the-art performance on link prediction and node classification tasks [7, 8, 26].

3 Preliminaries

In this section, we start by providing a formal definition of the MMEA task. Then, we provide a brief overview of the GCN model. Lastly, we introduce the fundamental principles of hyperbolic geometry, which serve as the foundation for our proposed model.

3.1 Task Formulation

The goal of MMEA is to align entities in two MMKGs. An MMKG typically encompasses information in several modalities. In this study, we concentrate on the KG structural information and visual information, without any loss of generality. Formally, we represent MMKGs as \(MG = (E,R,T,I)\), where E, R, T, and I denote the sets of entities, relations, triples, and images, respectively. A relational triple \(t \in T\) can be represented as \((e_1, r, e_2)\), where \(e_1, e_2 \in E\) and \(r \in R\). An entity e is associated with multiple images \(I_e = \{i_e^0, i_e^1,\ldots ,i_e^n\}\).

Given two MMKGs, \(MG_1 = (E_1, R_1, T_1, I_1)\), \(MG_2 = (E_2, R_2, T_2, I_2)\), and seed entity pairs (pre-aligned entity pairs for training) \( S=\{(e_s^1, e_s^2)|e_s^1\leftrightarrow e_s^2, e_s^1 \in E_1, e_s^2 \in E_2 \}\), where \(\leftrightarrow \) represents equivalence, the task of MMEA can be defined as discovering more aligned entity pairs \(\{(e^1, e^2)|e^1 \in E_1, e^2 \in E_2 \}\). We use the following example to further illustrate this task.

Example Figure 9.2 shows two partial MMKGs. The equivalence between The Dark Knight in \(MG_1\) and The Dark Knight in \(MG_2\) is known in advance. EA aims to detect potential equivalent entity pairs, e.g., Nolan in \(MG_1\) and Nolan in \(MG_2\), using the known alignments. â–¡

Fig. 9.2
A two-part illustration of an example of M M E A. M G 1 features Michael Caine, The Dark Knight, Nolan, and Chicago, whereas M G 2 features The Dark Knight, Christian Bale, The Prestige, and Nolan.

An example of MMEA. Seed entity pairs are connected by dashed lines. For clarity, we only choose an image to represent the set of images of an entity

3.2 Graph Convolutional Neural Networks

GCNs [10, 13] are a neural network type that works directly with graph data. A GCN model comprises several stacked GCN layers. The inputs to the l-th layer of the GCN model are node feature vectors and the graph’s structure. \(\boldsymbol {H}^{(l)} \in {R} ^ {n \times d^{l}}\) is a vertex feature representation, where n is the number of vertices and \(d^{l}\) is the dimensionality of feature matrix. \(\boldsymbol {\hat A} = \boldsymbol {D}^{-\frac {1}{2}}(\boldsymbol {A} + \boldsymbol {I})\boldsymbol {D}^{-\frac {1}{2}}\) represents the symmetric normalized adjacency matrix. The identity matrix \(\boldsymbol {I}\) is added to the adjacency matrix \(\boldsymbol {A}\) to obtain self-loops for each node, and the degree matrix \( \boldsymbol {D} = \sum _j(\boldsymbol {A}_{ij}+\boldsymbol {I}_{ij})\). The output of the l-th layer is a new feature matrix \(\boldsymbol {H}^{(l+1)}\) by the following convolutional computation:

$$\displaystyle \begin{aligned} \boldsymbol{H}^{(l+1)} = \sigma (\boldsymbol{\hat A} \boldsymbol{H}^{(l)} \boldsymbol{W}^{(l)}). \end{aligned} $$
(9.1)

3.3 Hyperboloid Manifold

We provide a brief overview of the critical concepts in hyperbolic geometry. For a more comprehensive description, please refer to [6]. Hyperbolic geometry refers to a non-Euclidean geometry that features a constant negative curvature used to measure how a geometric object differs from a flat plane. In this work, we use the d-dimensional Poincare ball model with negative curvature \(-c\) \((c > 0)\): \(P^{(d,c)}=\{ \mathbf {x} \in R^d: \| \mathbf {x} \|{ }^2 < \frac {1}{c}\}\), where \(\| \cdot \|\) is the \(L_2\) norm. For each point \( x \in P^{(d,c)}\), the tangent space \(T^c_x \) is a d-dimensional vector space at point x, which contains all possible directions of paths in \(P^{(d,c)}\) leaving from x. Next, we present several fundamental actions in the hyperbolic space, which play a critical role in our proposed model.

Exponential and Logarithmic Maps

Specifically, let \(\boldsymbol {v}\) be the feature vector in the tangent space \( T^c_{\mathbf {o}}\); \(\mathbf {o}\) is a point in the hyperbolic space \(P^{(d,c)}\), which is also used as a reference point. Let \(\mathbf {o}\) be the origin, \(\mathbf {o} = 0\). The tangent space \( T^c_{\mathbf {o}} \) can be mapped to \(P^{(d,c)}\) via the exponential map:

$$\displaystyle \begin{aligned} \exp_{\mathbf{o}}^c(\boldsymbol{v}) = \operatorname{tanh}(\sqrt{c}\|\boldsymbol{v}\|)\frac{\boldsymbol{v}}{\sqrt{c}\|\boldsymbol{v}\|} {}. \end{aligned} $$
(9.2)

And conversely, the logarithmic map which maps \(P^{(d,c)}\) to \( T^c_{\mathbf {o}} \) is defined as:

$$\displaystyle \begin{aligned} \log_{\mathbf{o}}^c({\boldsymbol{y}}) = \text{arctanh}(\sqrt{c}\|\boldsymbol{y}\|)\frac{{\boldsymbol{y}}}{\sqrt{c}\|\boldsymbol{y}\|}. \end{aligned} $$
(9.3)

Möbius Addition

Vector addition does not have a well-defined meaning in the hyperbolic space. Adding the vectors of two points directly, as in Euclidean space, in the Poincare ball could yield a point outside the ball. In this case, the Möbius addition [7] provides an analogue to the Euclidean addition in the hyperbolic space. Here, \(\oplus _{c} \) represents the Möbius addition as:

$$\displaystyle \begin{aligned} {\boldsymbol{h}}_{i} \oplus_{c} \boldsymbol{h}_{j} =\frac{\left(1+2 c\left\langle {\boldsymbol{h}}_{i}, {\boldsymbol{h}}_{j}\right\rangle+c\left\|{\boldsymbol{h}}_{j}\right\|{}^{2}\right) {\boldsymbol{h}}_{i}+\left(1-c\left\|{\boldsymbol{h}}_{i}\right\|{}^{2}\right) {\boldsymbol{h}}_{j}}{1+2 c\left\langle {\boldsymbol{h}}_{i}, {\boldsymbol{h}}_{j}\right\rangle+c^{2}\left\|{\boldsymbol{h}}_{i}\right\|{}^{2}\left\|{\boldsymbol{h}}_{j}\right\|{}^{2}}. \end{aligned} $$
(9.4)

4 Methodology

In this section, we present our proposed approach HMEA, which operates in the hyperbolic space. The framework is shown in Fig. 9.3. We first adopt HGCN to obtain the structural embeddings of entities. Subsequently, we transform the corresponding entity images into visual embeddings employing the densenet model, which are further projected into the hyperbolic space. In the end, we join these embeddings in the hyperbolic space and predict the alignment outcomes utilizing a pre-determined hyperbolic distance. We use the following example to illustrate our proposed model.

Fig. 9.3
A diagram of the proposed method structure. The dense net model transforms the corresponding image into a visual embedding.

The framework of our proposed method

Example Further to the previous example, by using structural information, it is easy to detect that Nolan in \(MG_1\) is equivalent to Nolan in \(MG_2\). However, solely relying on structural data is insufficient and might result in an incorrect alignment of Michael Caine in \(MG_1\) with Christian Bale in \(MG_2\). In this scenario, the utilization of visual information would be highly beneficial as the images of Michael Caine in \(MG_1\) and Christian Bale in \(MG_2\) are significantly dissimilar. Consequently, we consider both structural and visual information for alignment. â–¡

In the following, we elaborate on the various components of our proposal.

4.1 Structural Representation Learning

We acquire the structural representation of MMKGs by employing hyperbolic graph convolutional neural networks, which extends convolutional computation to manifold space and leverages the effectiveness of both graph neural networks and hyperbolic embeddings. Initially, we transform the input Euclidean features to the hyperboloid manifold. Then, through feature transformation, message passing, and nonlinear activation in the hyperbolic space, we can get the hyperbolic structural representations.

Mapping Input Features to Hyperboloid Manifold

In general, the input node features are produced by pre-trained Euclidean neural networks, and hence, they exist in the Euclidean space. We begin by establishing a conversion from Euclidean features to the hyperbolic space.

Here, we assume that the input Euclidean features \({{\boldsymbol {x}}^{E}} \in T_{\mathbf {o}}H_c\), where \(T_{\mathbf {o}}H_c\) represent the tangent space referring to \(\mathbf {o}\), and \(\mathbf {o} \in H_c\) denotes the north pole (origin) in hyperbolic space. We obtain the hyperbolic feature matrix \({\boldsymbol {x}}^H\) via: \( {\boldsymbol {x}}^H = \operatorname {exp}_o^c({\boldsymbol {x}}^{E})\), where \(\operatorname {exp}_o^c(\cdot )\) is defined in Eq. (9.2).

Feature Transformation and Propagation

The core operations in hyperbolic structural learning, similar to GCN, are feature transformation and message passing. While these operations are well-established in the Euclidean space, they are considerably more complex in the hyperboloid manifold. One possible solution is to perform these functions with trainable parameters in the tangent space of a point within the hyperboloid manifold, as the tangent space is Euclidean. To this end, we utilize the \(\operatorname {exp(\cdot )}\) map and \(\operatorname {log(\cdot )}\) map to convert between the hyperboloid manifold and the tangent space. This enables us to make use of the tangent space \(T_{\mathbf {o}}H_c^d\) for executing Euclidean operations.

The initial step involves using the logarithmic map to map the hyperbolic representation \({\boldsymbol {x}}_v^H \in R^{1 \times d}\) of node v to the tangent space \(T_{\mathbf {o}}H_c^d\). Next, in \(T_{\mathbf {o}}H_c^d\), we compute the feature transformation and propagation rule for node v as:

$$\displaystyle \begin{aligned} {\boldsymbol{x}}_v^T = \boldsymbol{\hat A} \log _{\mathbf{o}}^{c}\left({\boldsymbol{x}}_v^H\right)\boldsymbol{W}, \end{aligned} $$
(9.5)

where \(\boldsymbol {x}_v^T \in R^{1\times d'} \) denotes the feature representation in the tangent space and \(\boldsymbol {\hat A}\) represents the symmetric normalized adjacency matrix; \(\boldsymbol {W}\) is a \( d' \times d \) trainable weight matrix.

Nonlinear Activation with Different Curvatures

Once the features have been transformed in the tangent space, a nonlinear activation function \(\sigma ^{\otimes ^{c_{l}, c_{l+1}}}\) is applied to learn nonlinear transformations. Specifically, in the tangent space \( T_{\mathbf {o}} H^{d}_{c_{l}} \) of layer l, Euclidean nonlinear activation is performed before mapping the features to the manifold of the next layer:

$$\displaystyle \begin{aligned} \sigma^{\otimes^{c_{l}, c_{l+1}}}\left(\boldsymbol{x}_v^T\right)=\exp _{\mathbf{o}}^{c_{l+1}}\left(\sigma\left(\log _{\mathbf{o}}^{c_{l}}\left(\boldsymbol{x}_v^T\right)\right)\right), \end{aligned} $$
(9.6)

where the hyperbolic curvatures at layer l and \(l+1\) are denoted as \(-1/c_{l}\) and \(-1/c_{l+1}\), respectively. The activation function \(\sigma \) used is the \(\operatorname {ReLU}(\cdot )\) function. This step is critical in enabling us to vary the curvature smoothly at each layer, which is necessary for achieving good performance due to limitations in machine precision and normalization.

Based on the hyperboloid feature transformation and nonlinear activation, the convolutional computation in the hyperbolic space is redefined as:

$$\displaystyle \begin{aligned} \begin{aligned} {\boldsymbol{H}}^{l+1} = \exp _{\mathbf{o}}^{c_{l+1}}\left(\sigma\left( \boldsymbol{\hat A} \log _{\mathbf{o}}^{c_l}\left({\boldsymbol{H}}^{l}\right)\boldsymbol{W}\right)\right), \end{aligned} \end{aligned} $$
(9.7)

where the convolutional computation in hyperbolic space involves using learned node embeddings in the hyperbolic space at layer \(l+1\) and layer l, represented respectively as \({\boldsymbol {H}}^{l+1}\in R^{n \times d^{l+1}}\) and \({\boldsymbol {H}}^{l} \in R^{n \times d^l}\). The initial embeddings are represented as \({\boldsymbol {H}}^{0} = {\boldsymbol {x}}^{H}\). The symmetric normalized adjacency matrix is represented by \(\boldsymbol {\hat A}\), and the trainable weight matrix is represented by \(\boldsymbol {W}\), which has dimensions \(d^l \times d^{l+1}\).

4.2 Visual Representation Learning

The densenet model [11] is used to learn image embeddings, which has been pre-trained on the ImageNet dataset [5]. The softmax layer in densenet is removed and 1920-dimensional embeddings are obtained for all images in the MMKGs. These embeddings are then projected into the hyperbolic space using HGCN to enhance their expressive power.

4.3 Multi-Modal Information Fusion

As both visual and structural information can impact the alignment results. To combine these two types of information, we propose a novel method that merges the structural information and visual information of MMKGs. Specifically, we obtain the merged representation of entity \({\mathbf {e}}_i\) in the hyperbolic space using the following approach:

$$\displaystyle \begin{aligned} {\boldsymbol{h}}_{i} =\big(\beta \cdot{\boldsymbol{H}}_{s}^i\big) \oplus_{c} \big((1-\beta) \cdot {\boldsymbol{H}}_{v}^i\big), \end{aligned} $$
(9.8)

where \({\boldsymbol {H}}_{s}\) and \({\boldsymbol {H}}_{v}\) are structural and visual embeddings learned from HGCN model, respectively; the hyper-parameter \(\beta \) is used to adjust the relative weight of the structure and visual features in the final merged representation. The Möbius addition operator \(\oplus _c\) is used to combine the structural and visual embeddings. However, the dimensions of the structural and visual representations should be identical.

4.4 Alignment Prediction

To predict the alignment results, we compute the distance between the entity representations from two MMKGs. The Euclidean distance and Manhattan distance are popular distance measures used in the Euclidean space [15, 30]. However, in the hyperbolic space, we must use the hyperbolic distance between nodes as the distance measure. For entities \(e_i\) in \(MG_1\) and \(e_j\) in \(MG_2\), the distance is defined as:

$$\displaystyle \begin{aligned} d_{c}\left({\boldsymbol{h}}_{i}, {\boldsymbol{h}}_{j}\right)= ||(-{\boldsymbol{h}}_{i}) \oplus_{c} {\boldsymbol{h}}_{j}||, \end{aligned} $$
(9.9)

where \({\boldsymbol {h}}_{i}\) and \({\boldsymbol {h}}_{j}\) denote the merged embeddings of \(e_i\) and \(e_j\) in the hyperbolic space, respectively; \(\| \cdot \|\) is the \(L_1\) norm; the operator \(\oplus _c\) is the Möbius addition.

We expect the distance to be small for equivalent entities and large for nonequivalent ones. To align a specific entity \(e_i\) in \(MG_1\), our approach calculates the distances between \(e_i\) and all entities in \(MG_2\) and presents a ranked list of entities as candidate alignments.

4.5 Model Training

To embed equivalent entities as closely as possible in the vector space, we utilize a set of established entity alignments (known as seed entities) S as training data to train the model. Specifically, we minimize the margin-based ranking loss function during model training:

$$\displaystyle \begin{aligned} \begin{aligned} L =& \sum_{(e, v) \in S} \sum_{(e^{\prime}, v^{\prime}) \in S_{(e,v)}^{\prime}}\left[d_{c}\left({\boldsymbol{h}}_{e}, {\boldsymbol{h}}_{v}\right)+\gamma - d_{c}\left({\boldsymbol{h}}_{e^{\prime}}, {\boldsymbol{h}}_{v^{\prime}}\right)\right]_{+} \end{aligned}, \end{aligned} $$
(9.10)

where \([x]_+ = \max \{{0,x}\}\); \((e,v)\) represents a seed entity pair and S is the set of entity pairs; \(S_{(e,v)}^\prime \) represents the set of negative instances created by altering \((e, v)\), i.e., by substituting e or v with a randomly selected entity from either \(MG_1\) or \(MG_2\); \(\gamma > 0\) denotes the margin hyper-parameter that separates positive and negative instances. The margin-based loss function stipulates that the distance between entities in positive pairs should be small, and the distance between entities in negative pairs should be large.

5 Experiment

5.1 Dataset and Evaluation Metric

In this study, we utilized datasets sourced from FreeBase, DBpedia, and YAGO, which were created by Liu et al. [16]. These datasets were developed by starting with FB15K to establish multi-modal knowledge graphs, which were then aligned with entities from other knowledge graphs such as DB15K and YAGO15K through reference links. Our experiments focused on two pairs of multi-modal knowledge graphs: FB15K-DB15K and FB15K-YAGO15K.

Due to the absence of original images in the datasets, we acquired the corresponding images for each entity using the URIs provided in [17]. To achieve this, we developed a Web crawler that can extract query results from image search engines, i.e., Google Images,Footnote 1 Bing Images,Footnote 2 and Yahoo Image Search.Footnote 3 Following this, we allocated the images obtained from various search engines to different MMKGs, thereby showcasing the dissimilarity among different MMKGs.

The detailed information on the datasets is provided in Table 9.1. Each dataset comprises approximately 15,000 entities and over 11,000 sets of entity images. The Images column represents the number of entities that possess the image sets. These alignments are given by the SameAs predicates that have been previously found. In the experiments, the known equivalent entity pairs are used for model training and testing.

Table 9.1 Statistic of the MMKG datasets

Evaluation Metric

We utilize \(Hits@k\) as the evaluation metric to gauge the efficacy of all the approaches. This metric determines the percentage of correctly aligned entities that are ranked among the top-k candidates.

5.2 Experimental Setting and Competing Approaches

Experimental Setting

To analyze the effectiveness of the methods across various percentages of the provided alignments \(P(\)%\()\), we evaluate the methods with low (\(20\%\)), medium (\(50\%\)), and high percentage (\(80\%\)) of the given seed entity pairs. The remaining sameAs triples are used for test. To ensure fairness, we have maintained the same number of dimensions (i.e., 400) for both GCN-Align and HMEA. The other parameters of GCN-Align follow [29]. For the parameters of our approach HMEA, we have created six negative samples for each positive sample. The margin hyper-parameters used in the loss function are \(\gamma _{\textsf { HMEA} -s} = 0.5\) and \(\gamma _{\textsf { HMEA} -v} = 1.5\), respectively. We optimized HMEA using the Adam optimizer.

Competing Approaches

To showcase the effectiveness of our proposed model, we have selected three state-of-the-art approaches as competitors:

  • GCN-Align [29] utilizes GCN to encode the structural information of entities and then combines relation and image embeddings for the purpose of entity alignment.

  • PoE [16] is based on the product of expert model. It computes the scores of facts under each modality and learns the entity embeddings for entity alignment. PoE combines information from two modalities. Additionally, we compare our approach with the PoE-s variant, which solely utilizes the structural information.

  • IKRL [31] integrates image representations into an aggregated image-based representation via an attention-based method. The method was initially proposed in the domain of knowledge representation, and we adapted it to address the MMEA problem.

In order to showcase the advantages of hyperbolic geometry, particularly in the learning of structural features, we have conducted preliminary experiments which solely utilize the structural information for EA, resulting in HMEA-s, GCN-Align-s, and PoE-s. In addition, to evaluate the contribution of visual information, we compare PoE, GCN-Align, and HMEA with just visual information, namely, PoE-v, GCN-Align-v, and HMEA-v.

5.3 Results

Table 9.2 displays the results, indicating that HMEA exhibits the most superior performance in all scenarios. Notably, in the case of FB15K-YAGO15K with 80% seed entity pairs, HMEA outperforms PoE and GCN-Align by almost 15% in terms of \(Hits@1\). With 20% seed entity pairs, our approach also shows better results and the improvement of \(Hits@1\) is around 2% and \(Hits@10\) is up to 20%. Based on the results obtained from PoE, it is evident that there is only a slight improvement in performance from \(Hits@1\) to \(Hits@10\), with the range being between 4 and 9%. In contrast, the enhancements in performance from \(Hits@1\) to \(Hits@10\) observed for HMEA are at least 20% across all scenarios. Moreover, it is worth noting that HMEA achieves significantly better results than IKRL.

Table 9.2 Alignment prediction on both datasets for different percentages of P

Table 9.3 demonstrates that even when utilizing solely structural information, HMEA-s still achieves superior results compared to the other two methods. Specifically, our proposed approach outperforms GCN-Align-s by almost 5% in terms of \(Hits@1\) on FB15K-DB15K and by 3% on FB15K-YAGO15K with 20% seed alignments. When using 50 and 80% seed entity pairs, HMEA-s shows significant improvements in performance. The improvements range from 10 to 18% regarding \(Hits@1\) and from 20 to 30% in terms of \(Hits@10\). These results suggest that our approach excels in capturing precise hierarchical structural representations.

Table 9.3 Results of three methods with structural information

Table 9.4 presents the results when incorporating visual information into the model. We compare the performance of three variants: PoE-v, GCN-Align-v, and HMEA-v. The results indicate that GCN-Align-v does not produce valuable visual representations for MMEA. However, even when utilizing only structural information, HMEA-v still achieves better results than PoE-v. Specifically, our proposed approach outperforms PoE-v slightly in both datasets for \(Hits@1\), by less than 1% with 20% seed alignments. On FB15K-DB15K dataset, when using 80% seeds, our proposed approach HMEA-v demonstrates significant improvements in performance. The improvements are around 7% regarding \(Hits@1\) and 18% in terms of \(Hits@10\). These results indicate that our proposed method is effective in learning visual features and incorporating them into the model to improve the overall performance.

Table 9.4 Comparison of three methods with visual information

5.4 Ablation Experiment

In this work, we consider multiple modalities of information in MMKGs. Specifically, we take into account the structural and visual aspects of the information. To further confirm the usefulness of multi-modal knowledge for MMEA, we carry out an ablation experiment. In addition, upon comparing HMEA and HMEA-s in Tables 9.2 and 9.3, we observe that incorporating visual information in our approach results in slightly better performance. The improvements are approximately 1% in terms of \(Hits@1\). Moreover, by comparing HMEA and HMEA-v in Tables 9.2 and 9.4, we can also conclude that the structural information plays a significant role. From the ablation study, we can conclude that MMEA primarily relies on the structural information, but the visual information still plays a useful role. Furthermore, the study also highlights that the combination of these two types of information leads to even better results.

5.5 Case Study

A key property of hyperbolic spaces is their exponential expansion, which means that they expand much faster than Euclidean spaces that expand polynomially. This property can be advantageous for distinguishing between similar entities since the neighbor nodes of a central node can be distributed in a larger space, resulting in greater distances between them.

To demonstrate the effectiveness of hyperbolic embeddings, we conducted a case study using Michael Caine as the root node. We visualized the embeddings of 1-hop film-related entities learned from both GCN-Align and HMEA separately, in the PCA-projected spaces shown in Fig. 9.4. We observed that for entities of the same type or with similar structural information, such as entity Alfie and B-o-B, their Euclidean embeddings (generated via GCN-Align) are placed closely together. In contrast, the distances between such entities in hyperbolic space are relatively farther apart, with only a few exceptions. This validates that the hyperbolic structural representation can help distinguish between similar entities. Furthermore, by placing similar entities (in the same KG) far apart, the hyperbolic representation can facilitate the alignment process across KGs.

Fig. 9.4
Two graphs illustrate the relationship between entities in F B 15 K and D B 15 K. a. B-o-B, CArs2, Alfie, and Get Carter have values ranging from 0.2 to negative 0.2. b. Cars 2, Get Carter, and Alfie have ratings ranging from 0.5 to negative 0.2.

The embeddings of 1-hop film-related neighbor entities of Michael Caine generated from GCN-Align and HMEA separately in the PCA-projected space. The green points represent entities in FB15K; red points represent entities in DB15K. For simplicity, we annotate part of entities. B-o-B is the abbreviation of Battle of Britain. (a) Embedding generated from GCN-Align. (b) Embedding generated from HMEA

An example can be seen in Fig. 9.4a, where entity Alfie in FB15K is closest to entity B-o-B, which is incorrect. However, in Fig. 9.4b, entity B-o-B is placed far away from Alfie, and the closest entity to Alfie is its equivalent entity in DB15K. By using hyperbolic projections, similar entities in the same KG are well distinguished and placed far apart, reducing the likelihood of alignment mistakes.

5.6 Additional Experiment

The cross-lingual EA datasets are the most commonly used datasets for evaluating EA methods. We included experiments on these datasets to demonstrate that our proposed approach is effective for popular datasets, including the cross-lingual EA task. Note that diverse languages are not taken as multiple modalities, and the cross-lingual EA is in essence single-modal EA. We use the DBP15K datasets in the experiments, which were built by Sun et al. [24]. As shown in Table 9.5, the datasets were generated from DBpedia, which contains rich inter-language links between different language versions of Wikipedia. Each dataset contains data in different languages and 15,000 known inter-language links connecting equivalent entities in two KGs, which are used for model training and testing. Following the setting in [29], we use \(30\%\) of inter-language links for training, and \(70\%\) of them for testing. \(Hits@k\) is used as the evaluation measure.

Table 9.5 Details of the cross-lingual datasets

The dimensions of both structural and attribute embeddings were set to 300 dimensions for GCN-Align. GCN-Align-s and HMEA-s represent adopting structural information; GCN-Align-a and HMEA-a represent adopting attribute information; and GCN-Align and HMEA combine both the structural information and attribute information.

Table 9.6 shows that in all datasets, HMEA-s outperforms GCN-Align-s, with improvements of around 7% in terms of \(Hits@1\) and more than 10% in terms of \(Hits@10\). These results demonstrate that HMEA benefits from hyperbolic geometry and is able to capture better structural features. Furthermore, our proposed approach achieves better results compared to GCN-Align as it combines both structural and attributive information, resulting in an approximately 10% increase in \(Hits@1\). Regarding attribute information, it is worth noting that our approach, HMEA-a, outperforms GCN-Align-a by a significant margin. Specifically, our approach achieves an approximately 15% improvement in \(Hits@1\) across all datasets.

Table 9.6 Result in cross-lingual datasets

6 Conclusion

This chapter introduces our proposed approach, HMEA, which is a multi-modal EA approach designed to efficiently integrate multi-modal information for EA in MMKGs. To achieve this, our approach extends the Euclidean representation to a hyperboloid manifold and employs HGCN to learn structural embeddings of entities. Additionally, we leverage a more advanced model, densenet, to learn more accurate visual embeddings. These structural and visual embeddings are then aggregated in the hyperbolic space to predict potential alignments. We validate the effectiveness of our proposed approach through comprehensive experimental evaluations. Additionally, we conduct further experiments that confirm the superior performance of HGCN in learning structural features of knowledge graphs in the hyperbolic space.