1 Introduction

Entity alignment (EA) is a task aiming to find entities from different knowledge graphs (KGs) that refer to the same real-world object. It plays an important role in KG construction and knowledge fusion as KGs are often independently created and suffer from incompleteness. Most existing models for EA leverage graph structures and/or side information of entities such as name and attributes along with KG embedding techniques to achieve alignment [1, 2]. Several recent methods enrich entity representations by incorporating images, a natural component of entity profiles in many KGs such as DBpedia [3] and Wikidata [4], to address EA in a multimodal view [5,6,7].

While experimental results have demonstrated that incorporating visual context benefits the EA task [5, 7], it is worth noting that the use of entity images may introduce noises. An error analysis in EVA [7] pointed out that hundreds of source entities were correctly matched to their counterparts before injecting images but were mismatched with images present. Different visual representations of equivalent entities could be potential noises that induce mismatches, and there are various reasons for the visual inconsistency between two equivalent entities. One major reason is that entities naturally have multiple visual representations. As shown in Fig. 1, images (visual context) at left are dissimilar from their counterparts at right, yet they refer to same real-world entities. In addition, the incompleteness of visual data is also a challenging issue for multimodal EA, as reported in [7] that ca. 15–50% entities in the most commonly used benchmark DBP15K [8] are not provided with images.

Fig. 1
figure 1

Thumbnail examples of DBpedia entities. A and C correspond to entities \({Oakland\_(Californie)}\) and \({Little\_Mix}\) in the French version of DBpedia, respectively. B and D correspond to entities \({Oakland,\_California}\) and \({Little\_Mix}\) in the English version of DBpedia, respectively

The aforementioned observations raise a doubt: to what extent or under what circumstances is visual context truly helpful to the EA task? Is there a way to filter potential noises and better use entity images? To investigate the above issues, in this work, we propose MMEA-s+v, a simple approach which combines embedding similarities between entities from structural and visual modalities at the output level. In order to fully exploit visual context, we explore a mechanism with classification techniques and entity types to identify potential visual noises and meanwhile generate binary entity mask vectors, which are used with MMEA-s+v to filter images during alignment learning and inference.

In summary, our main contributions are three-fold: (1) To the best of our knowledge, we are the first to investigate the positive and negative aspects of incorporating visual context for EA. We provide insights on actual visual noises that tend to induce misalignment in the multimodal EA. (2) We explore a mechanism with classification techniques and entity types to locate potential visual noises, and conduct extensive experiments to examine this mechanism. (3) We construct the multimodal version of DBP15K which contains a full set of entity images. With the proposed dataset, we hope to facilitate the community in the development of multimodal learning apporaches for KGs.

2 Related Work

Embedding-based approaches for entity alignment (EA) can be generally divided into two categories: that only utilized graph structures and that used additional side information of entities [2]. Among the first category, MTransE [9] adopted TransE [10] to encode language-specific KGs in separate embedding spaces and learned a transformation to align counterpart entities across embeddings. IPTransE [11] and BootEA [12] embedded two KGs in a unified space and bootstrapped the labeled alignments iteratively. Among the second category, GCN-Align [13], JAPE [8] and AttrE [14] used attribute triples in the KGs to refine structural embeddings. MultiKE [15] explored more types of features. It learned entity embeddings from three different views including entity names, relations and attributes. HMAN [16] further exploited literal descriptions of entities to boost performance. UEA [17] utilized useful features from side information in an unsupervised framework to perform EA in the open world.

Recently, a few attempts have been made to incorporate entity images into KGs and build multimodal embeddings for EA. MMEA [5] applied TransE to learn structural embeddings for entities, and utilized image features to learn visual representations. It integrated multiple representations of entities via common space learning. HMEA [6] adopted the hyperbolic graph convolutional networks (HGCNs) to learn structural and visual embeddings of entities separately, then merged them in the hyperbolic space by a weighted Mobius addition. EVA [7] employed GCNs [18] to learn structural representations for entities, and used feed-forward networks to learn embeddings from image, relation and attribute features, respectively. Then it fused embeddings of different modalities by a trainable weighted concatenation. MCLEA [19] considered task-oriented modality and utilized contrastive learning to model the intra-modal and inter-modal interactions for each entity representation. Although existing multimodal entity alignment approaches have shown promising performance, all of them ignored the negative impact of leveraging visual context for EA.

3 Method

We start with the task definition and notations. A KG is denoted as \(G = \left( E, R, T, I\right)\), where E, R, T, I are the sets of entities, relations, triples and images, respectively. Given a source KG \(G_{1} = \left( E_{1}, R_{1}, T_{1}, I_{1}\right)\) and a target KG \(G_{2} = \left( E_{2}, R_{2}, T_{2}, I_{2}\right)\), multimodal entity alignment (MMEA) aims to find every pair \((e_{1}, e_{2})\) where \(e_{1} \in E_{1}\), \(e_{2} \in E_{2}\) and \(e_{1}\) and \(e_{2}\) refer to the same real-world object. To solve this task, we adopt different encoders to encode structural information and visual context of entities, and propose a late fusion mechanism, which combines embedding similarity scores at the output level to find alignment. We name this approach as MMEA-s+v. For comparison, we also present two variants MMEA-avg and MMEA-cat, which adopt different early fusion strategies and learn multimodal joint embeddings to achieve alignment. The main structures of the three variants are shown in Fig. 2. We further explore a mechanism to filter potential visual noises and generate entity mask vectors, which are used with MMEA-s+v, aiming to exploit visual context for EA. Section 3.1 details about how to identify visual noises. Sections 3.2 and 3.3 focus on entity representation learning, alignment learning and inference.

Fig. 2
figure 2

An illustration of the framework, including the entity embedding module, two multimodal early fusion strategies, the visual noises identification and late fusion mechanisms

3.1 Visual Noise Identification

We observe that in most cases visual representations of entities vary largely from a type to another, while they are less different within a type. Based on the findings, we take entity types as classes of images to train a classifier, and use it to identify images whose predicted class is semantically distant from their actual class, i.e., visual noises. To this end we obtain entity types and inter-class conflicts from the ontology of KGs, and design mask vectors to store identification results.

3.1.1 Entity Types

The ontology of KGs usually contains properties and hierarchical classes, and defines subsumption relationships between classes and class disjointness optionally [20]. Types (classes) are often organized in a hierarchical tree structure in the ontology of a KG, and an entity is often associated to a set of types. For example, as shown in Fig. 3, the entity Barack Obama in French DBpedia has four types declared (we do not include the root \({owl\#Thing}\)): Agent, Person, Politician and President, with Agent as the most generic type and President as the most specific and a leaf node. As we observe that entities of fine-grained types like President and Senator are more semantically different than visually different, and therefore, we take the type of each entity at most at the fourth level (Politician in this example) as the label of its image. We also empirically find the choice of the fourth level, rather than the third or the fifth, yields better classification performance.

Fig. 3
figure 3

Subfigure a is an example of hierarchical classes. Subfigure b presents four entities (denoted by red texts), their finest types in the parentheses and their thumbnails. Because we take entity types at most at the fourth level as image labels, Barack_Obama and Jeff_Flake share the same label Politician, while labels of Rich_Nash and Lake_Ontario are Athlete and BodyOfWater, respectively

3.1.2 Inter-class Conflicts

To measure the semantic discrepancy between the predicted and real classes of an entity image, and inspired by OntoEA [21], we use a class conflict dictionary (CCD) to store the inter-class conflicts. Given two classes a and b, we calculate their conflict degree as C[ab]. For better illustration, we let V denote the hierarchical class tree in which each node refers to a unique class and o the root (typically \({owl\#Thing}\)), and define \(S_{x}^{c}\) as the set of children (subclasses) of node x and \(S_{x}^{d}\) as the set of all the descendants of x in V, respectively. We assume that all subclasses of the root in V are mutually disjoint, which is in accordance with the design intent for the class hierarchy, and we regard any two descendants of two disjoint classes as disjoint. Let D denote the set of all disjoint class pairs, thus \(D=\{(a,b) \mid a,b \in S_{o}^{c}, a \ne b \; {\text {or}} \; \forall c_{1},c_{2} \in S_{o}^{c}, a \in S_{c_{1}}^{d}, b \in S_{c_{2}}^{d}, c_{1} \ne c_{2}\}\). Given two classes a and b, we firstly determine if \(a \equiv b\) or \(a \in S_{b}^{d}\) or \(b \in S_{a}^{d}\), and set \(C[a,b]=0\) if they satisfy the condition, which ensures that a class does not conflict with itself or its descendant class, otherwise we look up D and set \(C[a,b]=1\) if \((a,b) \in D\), i.e., two disjoint classes are treated as conflicted. If neither of the above two conditions is met, we follow OntoEA and calculate C[ab] as:

$$\begin{aligned} C[a,b]=1-\frac{\vert S(a) \cap S(b) \vert }{\vert S(a) \cup S(b)\vert }, \end{aligned}$$
(1)

where S(a) and S(b) denote the sets of classes passed by routing from a and b to the root class, respectively, and \(\mid \cdot \mid\) denotes the set cardinality.

3.1.3 Entity Mask

We use \({\textbf{M}}\) as an entity mask vector and denote \({\textbf{M}}_{{e}_{i}}\) as the mask value of the ith entity \(e_{i}\) in E. If the image of \(e_{i}\) is determined as potential noise, we set \({\textbf{M}}_{{e}_{i}}=0\), which means \(e_{i}\) is masked and its image should be filtered in the training or test phase; otherwise, we set \({\textbf{M}}_{{e}_{i}}=1\). Note that the length of \({\textbf{M}}\) depends on the sum of the number of source entities and the number of target entities in a dataset. We initialize \({\textbf{M}}\) with all zeros and update it iteratively. Specifically, given a conflict degree threshold \(\lambda\), for \(e \in E\), we feed its corresponding image to a classifier to obtain top k predictions (denoted as \(p_{1}\), ..., \(p_{k}\)), and if the minimum conflict degree between the predictions and the actual class (denoted as g) of e is no greater than \(\lambda\), i.e., \(\min _{1 \le i \le k}\left\{ C\left[ p_{i}, g\right] \right\} \le \lambda\), we reset the mask value of e to be 1.

3.2 Entity Embedding

To better analyze the impacts of visual context on MMEA, we only model two modalities in the entity embeddings, i.e., graph structures and visual context.

3.2.1 Structural Embedding

Graph convolutional networks (GCNs) have proven to be effective in capturing information from graph structures and have been used for embedding-based EA recently [1]. Formally, given as input the adjacency matrix \({\textbf{A}}\) of a KG and randomly initialized feature matrix \({\textbf{H}}^{(0)}\) of its entities, a multi-layer GCN iteratively updates entity representations from the ith layer to the \((i+1)\)th layer with the following propagation rule:

$$\begin{aligned} {\textbf{H}}^{(i+1)}=\phi \left( {\hat{\textbf{D}}}^{-\frac{1}{2}} {\hat{\textbf{A}}} {\hat{\textbf{D}}}^{-\frac{1}{2}} \textbf{H}^{(i)} {\textbf{W}}^{(i)}\right) , \end{aligned}$$
(2)

where \({\hat{\textbf{A}}}={\textbf{A}}+{\textbf{I}}\) and \({\textbf{I}}\) is an identity matrix, \({\hat{\textbf{D}}}\) is the diagonal degree matrix of \({\hat{\textbf{A}}}\), \({\textbf{W}}^{(i)}\) denotes learnable parameters in the ith layer and \(\phi\) is the activation function ReLU. Following previous works [13, 16], we adopt GCNs to encode the neighborhood information of entities and take the output of the last GCN layer as the structural embeddings.

3.2.2 Visual Embedding

We choose ResNet-152 [22] pre-trained on the ImageNet [23] recognition task as the initial image classifier and fine-tune it with our datasets for EA. The fine-tuning details are given in Sect. 4.1. The fine-tuned model is used to extract image features. We feed each image \(i \in I\), through a forward pass and take the output of last layer before logits as its feature vector. Then, we project the feature into a low-dimensional space by a linear transformation to obtain visual embedding \(\textbf{e}_{v}\):

$$\begin{aligned} {{\textbf{e}}}_{v}={\textbf{W}}_{v}\cdot {\text {ResNet}}(i) + \textbf{b}_{v}, \end{aligned}$$
(3)

where \({\textbf{W}}_{v}\) is the projection matrix and \(\textbf{b}_{v}\) is the bias vector.

3.2.3 Multimodal Representation

Given an entity e, its structural embedding \({{\textbf{e}}_{s}}\) and visual embedding \({{\textbf{e}}}_{v}\), we present two strategies to combine \({{\textbf{e}}_{s}}\) and \({{\textbf{e}}}_{v}\) into a multimodal embedding \({{\textbf{e}}}\) as the joint representation of entity e.

(1) Weighted concatenation. Following the same setting in EVA [7], we calculate \({{\textbf{e}}}\) as:

$$\begin{aligned} {{\textbf{e}}}=\frac{e^{w_{\textrm{s}}}}{e^{w_{\textrm{s}}}+e^{w_{v}}}{{\textbf{e}}}_{\textrm{s}} \oplus \frac{e^{w_{v}}}{e^{w_{\textrm{s}}}+e^{w_{v}}}{{\textbf{e}}}_{v}, \end{aligned}$$
(4)

where \(w_{\textrm{s}}\) and \(w_{v}\) represent the weight of structural modality and the weight of visual modality, respectively, and both are trainable during learning. The symbol \(\oplus\) means concatenation of embeddings. We denote the variant using this kind of fusion as MMEA-cat, which is the same as the setting in EVA where only structural information and visual context are kept.

(2) Weighted averaging. We have \({{\textbf{e}}}= \sum _{i \in \{s, v\}}w_{i}{{\textbf{e}}}_{i}\), where \(w_{i}\) is calculated by:

$$\begin{aligned} w_{i} = \frac{ \cos \left( {{\textbf{e}}}_{i}, {{\bar{\textbf{e}}}} \right) }{\sum _{j\in \{s, v\}} \cos \left( {{\textbf{e}}}_{j}, {{\bar{\textbf{e}}}} \right) }, \quad i \in \{s, v\}, \end{aligned}$$
(5)

and \({{\bar{\textbf{e}}}}=\frac{1}{2}\left( {{\textbf{e}}_{s}}+{{\textbf{e}}}_{v}\right)\). By assigning weights to modality-specific entity embeddings, this kind of combination allows the model to emphasize important modalities. We denote the corresponding variant which adopts this fusion strategy as MMEA-avg.

3.3 Alignment Learning and Inference

This section presents details about alignment learning and inference. We integrate \(G_{1}\) and \(G_{2}\) as one KG and learn both structural embeddings and visual embeddings of entities in \(E_{1}\) and \(E_{2}\) in a unified space. For notations, we let \(E_{\textrm{s}}\) and \(E_{\textrm{t}}\) denote the sets of source entities and the corresponding target entities, respectively, where \(E_{\textrm{s}} \subseteq E_{1}\), \(E_{\textrm{t}} \subseteq E_{2}\) and \(\vert {E_{\textrm{s}}}\vert =\vert {E_{\textrm{t}}}\vert\). We rearrange the elements in both sets in order that the ith entity in \(E_{\textrm{s}}\) corresponds to the ith in \(E_{\textrm{t}}\). We denote P as the set of all aligned pairs, i.e., \(P=\{(e_{1}, e_{2}) \mid e_{1} \equiv e_{2}, e_{1} \in E_{\textrm{s}}, e_{2} \in E_{\textrm{t}}\}\), and \({\textbf{M}} \in {\mathbb{R}}^{\vert {E_{\textrm{s}}}\vert + \vert {E_{\textrm{t}}}\vert }\) as the entity mask used to filter potential noisy images. The training and test sets are obtained by splitting P with a ratio r.

3.3.1 Alignment Learning

Let \({\hat{E}}_{\textrm{s}}\) and \({\hat{E}}_{\textrm{t}}\) denote the source entities and target entities, respectively, in the training set. We align each modality separately. For the structural modality s, we compute a similarity matrix \({\textbf{Sim}}^{(s)} =\langle {\hat{\textbf{E}}}_{\textrm{s}}^{(s)},{\hat{\textbf{E}}}_{\textrm{t}}^{(s)} \rangle \in {\mathbb{R}}^{\mid {\hat{E}}_{\textrm{s}}\mid \times \mid {\hat{E}}_{\textrm{t}}\mid }\), where \({\hat{\textbf{E}}}_{\textrm{s}}^{(s)}\) (\({\hat{\textbf{E}}}_{\textrm{t}}^{(s)}\)) represents the structural embeddings of entities in \({\hat{E}}_{\textrm{s}}\) (\({\hat{E}}_{\textrm{t}}\)), and each entry \({\textbf{Sim}}_{i j}^{(s)}\) corresponds to the cosine similarity between the ith entity in \({\hat{E}}_{\textrm{s}}\) and jth in \({\hat{E}}_{\textrm{t}}\). To better punish hard negatives and mitigate the hubness problem [24], we choose the HAL loss [25] as the objective function and apply it to obtain the loss of structural modality \({\mathcal{L}}^{(s)}\) and train the structural embeddings:

$$\begin{aligned} {\mathcal{L}}^{(s)}&=\frac{1}{N} \sum _{i=1}^{N}\left( \frac{1}{\alpha }\log \left( 1+\sum _{m \ne i} e^{\alpha {\textbf{Sim}}_{m i}^{(s)}}\right) \right. \nonumber \\&\quad \left. +\frac{1}{\alpha } \log \left( 1+\sum _{n \ne i} e^{\alpha {\textbf{Sim}}_{i n}^{(s)}}\right) -\log \left( 1+\beta {\textbf{Sim}}_{i i}^{(s)}\right) \right) , \end{aligned}$$
(6)

where \(\alpha\), \(\beta\) are temperature scales and N is the batch size. Likewise, we compute \(\textbf{Sim}^{(v)}\) and \({\mathcal{L}}^{(v)}\) for the visual modality v. Thus the final loss \({\mathcal{L}}\) of MMEA-s+v is formulated as:

$$\begin{aligned} {\mathcal{L}}={\mathcal{L}}^{(s)}+{\mathcal{L}}^{(v)}. \end{aligned}$$
(7)

To apply the entity mask vectors \({\textbf{M}}\) to MMEA-s+v in alignment learning, we obtain a new set of alignment pairs \(P^{\prime }=\{(e_{1}, e_{2}) \mid e_{1} \equiv e_{2}, e_{1} \in {\hat{E}}_{\textrm{s}}, e_{2} \in {\hat{E}}_{\textrm{t}}, {\textbf{M}}_{e_{1}}=1, {\textbf{M}}_{e_{2}}=1\}\) with P and \({\textbf{M}}\), determine from \(P^{\prime }\) new sets of source entities and target entities, denoted by \({\tilde{E}}_{\textrm{s}}\) and \({\tilde{E}}_{\textrm{t}}\), respectively, and compute the visual similarity matrix as \(\textbf{Sim}^{(v)} =\langle {\tilde{\textbf{E}}}_{\textrm{s}}^{(v)},{\tilde{\textbf{E}}}_{\textrm{t}}^{(v)} \rangle \in {\mathbb{R}}^{\vert {\tilde{E}}_{\textrm{s}}\vert \times \vert {\tilde{E}}_{\textrm{t}}\vert }\).

For MMEA-avg and MMEA-cat, because we use the multimodal embeddings of entities to find alignment, we also optimize the joint representations and calculate a multimodal loss \({\mathcal{L}}^{(mm)}\) similar to Eq. (6). Then the final loss \({\mathcal{L}}\) of MMEA-avg/MMEA-cat is:

$$\begin{aligned} {\mathcal{L}}={\mathcal{L}}^{(s)}+{\mathcal{L}}^{(v)} + {\mathcal{L}}^{(mm)}. \end{aligned}$$
(8)

3.3.2 Inference

Given source entity set \({\bar{E}}_{\textrm{s}}\) and target entity set \({\bar{E}}_{\textrm{t}}\) used for inference, we compute \(\textbf{Sim}^{(s)} =\langle {\bar{\textbf{E}}}_{\textrm{s}}^{(s)},{\bar{\textbf{E}}}_{\textrm{t}}^{(s)}\rangle\) and \({\textbf{Sim}}^{(v)} =\langle {\bar{\textbf{E}}}_{\textrm{s}}^{(v)},{\bar{\textbf{E}}}_{\textrm{t}}^{(v)}\rangle\), where \({\textbf{Sim}}^{(r)}, {\textbf{Sim}}^{(v)} \in {\mathbb{R}}^{\vert {\bar{E}}_{\textrm{s}}\vert \times \vert {\bar{E}}_{\textrm{t}}\vert }\) are cosine similarity matrices for the structural and visual modalities, respectively. For MMEA-s+v, we simply combine them by a weighted addition to obtain the final similarity matrix \(\textbf{Sim} = w \cdot {\textbf{Sim}}^{(s)} + (1-w) \cdot {\textbf{Sim}}^{(v)}\), where \(w \in (0,1)\) is a hyper-parameter to balance the two modalities.

To hinder the potential negative impact of visual information when measuring the similarity between two entities, i.e., pulling closer two nonequivalent entities or pushing farther two identical entities, we use the entity mask vector on top of MMEA-s+v. Specifically, we define the similarity score between the ith entity \(e_{i}\) in \({\bar{E}}_{\textrm{s}}\) and the jth \(e_{j}\) in \({\bar{E}}_{\textrm{t}}\), i.e., the (ij) entry of \(\textbf{Sim}\) as:

$$\begin{aligned} \textbf{Sim}_{i j} = {\left\{ \begin{array}{ll} w \cdot {\textbf{Sim}}_{i j}^{(s)} + (1 - w) \cdot {\textbf{Sim}}_{i j}^{(v)} &{} {\text {if}}\; {\textbf{M}}_{e_{i}} = 1 \; {\text {and}}\; {\textbf{M}}_{e_{j}} = 1 \\ {\textbf{Sim}}_{i j}^{(s)} &{} {\text {otherwise}} \end{array}\right. }. \end{aligned}$$
(9)

Equation (9) illustrates the principal idea of fusing the two modalities: for source entity \(e_{i}\) and candidate target entity \(e_{j}\), the in-between similarity is predicted from both aspects of knowledge only when their images are regarded as potentially useful; otherwise it is solely based on the structural similarity.

For MMEA-avg and MMEA-cat, the cosine similarity matrix \(\textbf{Sim}\) is simply computed based on multimodal embeddings of entities in \({\bar{E}}_{\textrm{s}}\) and \({\bar{E}}_{\textrm{t}}\), i.e., \({\textbf{Sim}} =\langle {\bar{\textbf{E}}}_{\textrm{s}},{\bar{\textbf{E}}}_{\textrm{t}}\rangle\). After obtaining \(\textbf{Sim}\), we further use cross-domain similarity local scaling (CSLS) [24] to post-process it. Then for \(e_{i} \in {\bar{E}}_{\textrm{s}}\), we retrieve the similarity scores of the ith row in \(\textbf{Sim}\), rank them in a descending order, and take the top ranked entity as the match.

4 Experiments

4.1 Experimental Settings

4.1.1 Dataset

We construct the multimodal version of DBP15K and evaluate the methods on this benckmark. DBP15K is a widely used cross-lingual dataset extracted from DBpedia (2016-04) and contains three bilingual subsets: Chinese-English (ZH-EN), Japanese-English (JA-EN), and French-English (FR-EN). Each subset has 15K aligned entity pairs. DBpedia has provided links of thumbnails for many entities; however, it does not cover all of them. Statistics show that ca. 50–85% entities in DBP15K have images [7]. To solve the problem of data incompleteness, for (almost) every entity without an image in DBP15K, We construct a request URL with its surface name, obtain top 10 image URLs ranked by the keyword “selectedIndex” from Bing Images search and download the images. In our experiments, we take the most relevant image (with “selectedIndex = 0” in the URL) as the visual representation of an entity. The statistics of image coverage are presented in Table 1. To retrieve entity types, we query the classes of each entity with rdf : type via a public SPARQL endpoint.Footnote 1 We also obtain the subsumption, disjoint relationships between classes, which are explicitly defined by rdfs : subClassOf and \({owl\#disjointWith}\) properties in the DBpedia ontology, respectively.

Table 1 Statistics of image coverage

4.1.2 Implementation Details

Alignment We employ a three-layer GCN (including the input layer) to encode structural information of entities. The dimensions of the input and hidden layers are both set to 400. For MMEA-cat and MMEA-s+v, we set both the dimension of the output layer of GCN and the dimension of visual embeddings to 200. Whereas for MMEA-avg, we set both the output dimension of GCN and the dimension of visual embeddings to 400, so that its final embedding dimension is the same as MMEA-cat. For all there variants, we adopt AdamW to update parameters, and set the learning rate to \(5\times 10^{-4}\). When calculating losses, we set \(\alpha =5\), \(\beta =10\) for \({\mathcal{L}}^{(s)}\), and \(\alpha =15\), \(\beta =10\) for \({\mathcal{L}}^{(v)}\) and \({\mathcal{L}}^{(mm)}\). We train MMEA-avg and MMEA-s+v for 1000 epochs, while we only train MMEA-cat for 600 epochs because we observe an evident overfitting after epoch 600. For MMEA-s+v, we set \(w=0.5\) as the weight of structural similarities between entities during inference.

Following conventions, we use 30% of the aligned pairs for training and the remaining for evaluation, and choose H@1 (Hits@1), H@10 (Hits@10) and mean reciprocal rank (MRR) as the evaluation metrics. For the proposed variants, MMEA-avg, MMEA-cat and MMEA-s+v, we conduct five experiments with different random seeds and present the averaged results along with their standard deviations.

Classification We collect unique entities from all three subsets of DBP15K, filter those either without a type or an image, and use the remaining entities \(E^{\prime }\) as indices to retrieve their images and labels. For each split of DBP15K, we fine-tune a classifier based on the pre-trained ResNet152 [22], and build the test and training data from \(E_{\textrm{s}} \cup E_{\textrm{t}}\) and \(E^{\prime } \setminus (E_{\textrm{s}} \cup E_{\textrm{t}})\), respectively. We adopt stochastic gradient descent (SGD) to update parameters of classifiers with a learning rate of 0.001 and a momentum of 0.9. We set the batch size to 32 and the number of epochs for training to 25. During test, we obtain top 5 predictions for each entity image, and calculate the mask value of an entity based on its groundtruth class and predictions.

4.1.3 Comparative Methods

To investigate the effectiveness of unimodal data, we develop two variants named RelEA using only structural information (relational triples), and VisEA using only visual context (entity images) to achieve alignment, respectively. To generally verify the effectiveness of incorporating visual context, we compare MMEA-s+v with RelEA and other public structure-based EA approaches including: MTransE, IPTransE, MuGNN [26], SEA [27] and AliNet. To compare the effects of leveraging visual context with that of using other types of side information, such as entity names and/or attributes, we include other methods, which are JAPE, GCN-Align, HMAN and MultiKE. Note that for fair comparison, the results of HMAN are from its variant that only uses training data in DBP15K as alignment signals.

Recent multimodal approaches for entity alignment, such as MCLEA, EVA, MSNEA [28], etc. use three or more types of information including structural data, numerical/attribute triples, visual knowledge and surface names of entities to improve alignment performance. Because our work focuses on probing the impact of visual context, and, therefore, only utilized graph structure and visual context to conduct experiments, for fair comparison we have not included these methods (Table 2).

Table 2 Entity alignment results on DBP15K. Bold denotes the best results, and underline denotes the averaged results under five experiments with different random seeds. The gray shading and blue shading represents our proposed variants, including RelEA,VisEA, MMEA-avg, MMEA-cat and MMEA-s+v, and \(\frac{Means}{\pm {Stds.}}\) are shown

4.2 Alignment Results and Analyses

4.2.1 Performance Comparison

We analyze the alignment results from the following perspectives: (1) comparison between our unimodal variants (i.e., RelEA and VisEA) and baselines; (2) comparison between our multimodal variants and the rest of methods to verify the effects of visual context; (3) comparison of different fusion strategies.

(1) our variant RelEA using only structural information is comparable to other structure-based approaches, and even surpasses two models using additional side information, JAPE and GCN-Align. We think that many factors, such as the choice of models to learn structural embeddings for entities, the choice of loss functions, and the settings of hyperparameters for training, have an impact on the model performance. In detail, both MTransE and RelEA only used relational triples to embed entities, however, the former utilized TransE and the latter adopted GCNs. We deem that model capacity limits the quality of entity embeddings that directly determine alignment accuracy. Traning (or inference) settings and design of losses also affect performance. Both the uses of HAL loss (rather than the simple margin-based ranking loss) and CSLS in RelEA greatly improved performance, which may explain why it outperforms GCN-Align that additionaly used attribute triples in learning entity embeddings. VisEA performs worse than RelEA on all there datasets, indicating that leveraging visual context alone is insufficient to achieve satisfactory results.

(2) The proposed three multimodal variants, i.e., MMEA-avg, MMEA-cat and MMEA-s+v, all outperform other baselines using structural and/or side information, indicating that visual context is equally useful as other side information like entity attributes and names. They gain over RelEA 20.8–22.1% (absolute) improvement of Hits@1 on FR-EN, 12.2–13.8% improvement of Hits@1 on JA-EN and 13.3–14.5% improvement of Hits@1 on ZH-EN, respectively, which demonstrates that the incorporation of the visual context can substantially improve the EA system.

(3) As for the modality combination strategies, MMEA-cat achieves slightly better results than MMEA-avg and MMEA-s+v. We think this is due to the attention mechanism during modality fusion, which allows automatic learning of modality weights. Overall there is no prominent difference in the effectiveness among the three strategies.

4.2.2 Impacts of Filtering Entity Images

To maximize the benefits that the incorporation of visual context brings to EA, we selectively combine the feature similarities based on MMEA-s+v during inference with precomputed entity masks to filter potential visual noises. The range of values of the class conflict ratio \(\lambda\), calculated according to the rules and Eq. (1) presented in Sect. 3.1.2, is a finite set \(\{0, 0.4, 0.5, 0.6, 0.67, 1\}\). We choose \(\lambda \in \{0, 0.4, 0.67, 1\}\) to calculate mask values of entities with classification results. \(\lambda =0\) corresponds to the strictest setting and \(\lambda =1\) is the no-masking setting, where no entity images are filtered. Bigger \(\lambda\) indicates that more image pairs are involved and the visual context has more influence on alignment prediction during inference. Additionally, we design a special mask based on the alignment result obtained when \(\lambda =1\). Specifically, we reset the mask value of an entity to be 0 if it is correctly matched by only structural similarity but is missed by a joint decision of the two modalities.

Table 3 Alignment results and absolute improvements (Improv.) against RelEA under different settings on DBP15K, obtained with random seed as 2021 in the experiments

We conduct experiments under the above different settings and present the results in Table 3. As shown in Table 3, Hits@1 increases as \(\lambda\) is set larger and the no-masking setting (\(\lambda = 1\)) outperforms the strictest setting by 2.7–5.2%. We consider that it is mainly attributed to the relatively low quality of visual context. Nevertheless, filtering visual noises is non-trival, as we observe an average performance gain of 6.7\(\%\) in Hits@1 with the special masks over the no-masking setting. It is also clear that filtering visual noises with the special masks achieves obvious improvements comparing with MMEA-cat and MMEA-avg, which implicitly weaken the impacts of visual noises with their weighted concatenation and weighted average mechanisms to generate better multimodal embeddings. We further analyze the change of errors after visual context is injected under three settings, i.e., the strictest (mask), the no-masking and the special (Spec.). As shown in Fig. 4, on all three datasets the use of special masks greatly reduces errors while retaining as much benefits as no-masking settings bring. The observations suggest such complexity of the problem that the model will not necessarily output better results with visual context considered. They also prove that the visual noise filtering is beneficial to the multimodal entity alignment. The key challenge lies in locating real visual noises.

Fig. 4
figure 4

Number of new errors caused (left) and number of errors eliminated (right) with the use of images on DBP15K. Different colors indicate the results from different settings (cf. Sect. 4.2.2)

4.3 Classification Performance and Analysis

The classification accuracies along with numbers of images in training and test sets for each split of DBP15K are reported in Table 4. We collect classification results altogether and merge them for general analysis. For better understanding, we take nodes at the second level of the hierarchical class tree as base classes, and then use them to group fine-grained types, i.e., image labels used in the classification experiments. Note that we additionally treat Person and Organization, which are subclasses of Agent, as two base classes, as they are drastically different in both semantics and visual representations. A total of 17 base classes are identified and including their descendants, the total number of classes is 76 for FR-EN and 82 for JA-EN and ZH-EN (cf. Table 4). Among them, Top 4 base classes (together with their descendants) Person, Place, Work and Organisation cover 92% of all test entities over three datasets. Figure 5 illustrates the distribution of accuracy and number of test (entity) images with respect to all classes.

Table 4 Entity image classification results on the DBP15K dataset
Fig. 5
figure 5

The distribution of classification accuracy and the number of test images w.r.t. all classes. Each base class is denoted with a unique marker. The same markers scattered at different positions denote fine-grained types that share a common base class, such as blue stars denoting Royalty, Athlete and Cleric sharing the base class Person, and cyan dots denoting Country and Settlement sharing the base class Place. Because of limited space, we only present top 4 base classes and explicitly annotate top 10 classes (ranked by classification accuracy) beside their markers

We summarize the classification errors into two kinds: (1) the predicted class of an (entity) image and its true class are relatively close and in the same group, i.e., one is the super class of the other or they are siblings or cousins, and (2) the predicted class and the true class are disjoint. We find that without the first kind of errors, the accuracies of four base classes Person, Place, Organization, and Work rise from 0.53, 0.65, 0.36 and 0.31 to 0.91, 0.83, 0.51 and 0.52, respectively, which indicates that entities of Person or Place are more visually distinguishable, while entities of Organization and Work have less stable visual characteristics. By investigating the mispredictions, we identify several reasons that may explain the poor classification performance on many classes, which also provides insights into the quality of visual data used for MMEA. First, an image provided for an entity can be irrelevant to the entity itself. Second, the visual representations of entities of some classes are unstable. For example, entities of type Single or Album often have covers as their thumbnails, and these covers often vary widely from one to another depending on the design styles which are also easily misclassified into other classes like Artist and Settlement. Third, it is difficult to find accurate visual representations for conceptual entities, namely the entities referring to cognitive objects instead of physical objects. A typical type is MusicGenre, and its accuracy is as low as 0.03.

4.4 Study of Alignment Errors

In this subsection, we analyze alignment errors and investigate how visual context impacts entity alignment. Generally, The incorporation of entity images can reduce thousands of errors; but on the other hand, it also brings in much noise leading to many new mismatches, as illustrated in Fig. 4. Overall, it improves the alignment performance.

For the positive impact, we find that visual context is particularly helpful when structural information is insufficient to make correct alignment predictions. This finding is supported by the observation that among 3011 newly aligned entity pairs under the no-masking setting on FR-EN, 78% of them have a summed degree below the mean value of the summed degrees of all aligned entity pairs (i.e., long-tailed entities), and a lower degree of an entity indicates less structural information available to learn reliable structural embeddings.

To gain some insights into the negative impact of injecting visual context, we take results of FR-EN as an example and collect new errors occurred under the no-masking setting. These new errors shed light on true visual noises that should be filtered. Among the 818 errors on FR-EN, 139 source entities have mask values of 0 s, meaning that the top 1 predicted class of their image by the classifier is disjoint with their actual (entity) type, and that 139 errors could be reduced if these images are filtered. The remaining 679 errors are mostly about source entities with mask values of 1 s, which we divide into three categories for detailed analysis: (1) The first category contains 436/679 source entities where both the mask values of their aligned counterparts and their predicted matches are 1 s, and 80% of the mismatches are between entities of same or very close types, such as siblings, with Person and Place as two largest base classes. These mismatches are quite difficult to address because these entity types show relatively stable visual characteristics and the corresponding entity images are less visually distinguishable from those of the same types. (2) The second category includes 154/679 source entities where one of the mask values of their aligned counterparts and their predicted matches is 0, indicating that inappropriate or inconsistent images induced mismatches and these errors could be avoided when the noises are excluded. (3) Errors of the last category, making up about 9% of the total errors, are about source entities mismatched to entities without images, which means these images are not as useful as structural information in multimodal entity alignment.

5 Conclusion

This paper investigated impacts of incorporating visual context (entity images) for multimodal entity alignment. We proposed to learn entity embeddings from structural information and visual context, and integrate feature similarities at the output level. On top of this fusion strategy, we further explored a mechanism which uses image classification techniques and entity types to filter potential noises, and conducted extensive experiments to examine this mechanism. We found that visual context overall is beneficial and that while challenging, filtering noises can further boost performance. We experimentally proved that selectively using visual context brings the most benefits to EA, though the results largely depend on the quality of visual data. Our work also examined the quality of entity images in some multimodal KGs, which has not been inspected by existing studies.