1 Introduction

With the increasing overload of digital information stored in digital databases, novel methods are needed to improve the accessibility of online content [1, 2], especially the one from big online encyclopedias. In recent years, new deep learning techniques have achieved outstanding results in many computer vision and language tasks, and numerous attempts have been made to merge the two worlds. Particularly, Transformer-based networks have recently redefined the ways in which vision and language are processed, and many architectures have been introduced to create informative common spaces, where efficient k-NN search can be performed to search images given a natural language text and vice-versa, like CLIP [3]. This ability to align text and images in a latent space has been widely employed to solve many vision-language tasks, ranging from image captioning [4] to text-guided image synthesis [5].

Among all the emerging multi-modal tasks, image-caption matching is becoming particularly important, especially for scalable and efficient cross-modal retrieval [6, 7]. Image-caption matching involves associating an image with the text that best describes it. It can be used to find the most relevant images for a given query text (text-to-image retrieval) or vice-versa (image-to-text retrieval). These are important challenges that can make multimedia content more accessible and complete. While text-to-image retrieval has important applications in multimedia search engines – where a natural language phrase is used to search for visual content [8] – no natural use-cases arise for the complementary image-to-text retrieval scenario. Recently, WikiMedia foundation issued a Kaggle competitionFootnote 1 that concerns the retrieval of captions from Wikipedia pages associated with a certain image. An example of the setup is reported in Fig. 1. This task turns out to be critical in large online encyclopedias, where automatically linking images to the textual concepts referenced in the page text enables the underlying knowledge base to remain complete and up-to-date. A system able to perform this task would improve the accessibility and completeness of the underlying multi-modal knowledge graph in online encyclopedias, also assisting article writers in suggesting relevant captions for an inserted figure.

Existing solutions depend on straightforward approaches that use translations or page links, but these methods have restricted coverage. One possibility is employing free-form image captioning, where text is generated from scratch, given the image. However, most sophisticated image captioning methods nowadays struggle to handle images with intricate semantic content. Furthermore, generating text from scratch is unnecessary – other than computationally expensive – if we assume that the text describing the picture is already present somewhere on the page where the image resides. For these reasons, the best option is to frame the problem as a caption retrieval task, given the image content as context.

In principle, recent cross-modal models that produce an informative common latent space like CLIP [3] could solve this problem. However, there are some problems in adopting CLIP-like models to solve this task: i) the task is framed as a multi-modal retrieval task, where the query information is represented not only by the image itself but also its URL, which carries important priors useful for an effective matching; ii) CLIP latent space may struggle if not adapted to Wikipedia images, whose distribution may diverge consistently from the CLIP training set; iii) the dataset employed in the competition – the WIT dataset [9] – is also multi-lingual (captions are written in 108 different languages), requiring the development of an ad-hoc multi-modal multi-language retrieval pipeline.

Fig. 1
figure 1

Given an image and its URL, the objective is to find the most relevant caption

Another important requirement for the developed solution is to be efficient, given that the model should ideally process billions of Wikipedia pages. The WIT dataset [9] on which the competition is based – alongside other proposed Wikipedia-based collections [10, 11] – is composed of millions of image-text pairs that require scalable architectures to be processed. Therefore, in our work, we devise an efficient and effective two-stage pipeline for retrieving the most relevant captions given both textual and visual information (image URL and image itself) for solving the challenge proposed by the Wikimedia foundation.

Specifically, we propose a cascade of two image-text matching models based on large pre-trained Transformer models. The first model, called Multi-modal Caption Proposal (MCProp), is based on the common space matching approach and uses XLM-RoBERTa and CLIP as text and image feature extractors, respectively. Being very efficient at inference time, this model is used to quickly propose potentially relevant candidates. The second model, Caption Re-Rank (CRank), is a fine-tuned XLM-RoBERTa pairwise classifier. This model is less efficient but more accurate and is used to re-score and reorder the candidates from the first stage. After training each stage separately, we run extensive studies on the two modules to understand the validity of the whole pipeline on a carefully chosen validation set. Finally, we perform inference on the test set provided by the challenge, keeping only the top five captions as required by the challenge rules.

We compare our performance with the other competitors. Our approach achieves the fifth position on the final private leaderboard, with a final nDCG of 0.53, a +8% improvement with respect to the team in the sixth position. The detailed experimental analysis performed on the two modules shows the effectiveness and the efficiency of the proposed pipeline against the other competitors.

To summarize, in this paper, we propose the following contributions:

  • We introduce a multi-lingual multi-modal architecture composed of a cascade of Transformer-based networks for solving the novel, challenging task of image-based caption retrieval on Wikipedia pages.

  • We show the effectiveness and the efficiency of the proposed solution, achieving the fifth position in the Wikimedia challenge proposed on Kaggle.

  • We perform extensive experimentation on the two models to validate the proposed pipeline.

The rest of the paper is organized as follows: Section 2 summarizes some of the most influential works related to the presented task; Section 3 explains the two-stage pipeline for efficiency and effectively retrieves the most relevant captions; Section 4 present in-depth experiments on the two proposed models, showing the validity of the entire pipeline; Section 5 draws final conclusions and present possible extensions to this work; Finally, Appendix A, B and C report implementation details, together with some more in-depth experimentation done at inference time to try to boost the performance of the proposed model.

The code for reproducing our results is publicly available on GitHubFootnote 2.

2 Related works

In this section, we review some relevant literature for the explored task, with particular attention to transformer-based networks, image captioning, cross-modal retrieval, together with some works on multimedia understanding in Wikipedia.

Transformer-based networks

A big step forward for multi-modal understanding has been achieved with the introduction of Transformers [12], which proved to be extremely powerful both in the field of natural language processing with models like BERT [13], ELMo [14], GPT-3 [15], RoBERTa [16] and ALBERT [17], and in the field of computer vision with the introduction of Vision Transformers [18] and its variants like Swin Transformer [16], CrossViT [19] and Twins-SVT [20]. Since Transformers are so effective in both fields, they were used to extract common representations between multi-modal data so that they could later be compared or processed altogether. For example, some works showed the effectiveness of Transformers, from image captioning [21] to text-driven object detection [22], while TERAN [6] and, more recently, CLIP [3] demonstrated the power of transformers for cross-modal retrieval. Recently, researchers devised large vision language transformers like VinVL [23], ViLT [24], or Flamingo [25], pretrained on massive amounts of data, to solve many downstream vision-language tasks. Driven by these recent advances, our model employs some state-of-the-art transformer networks as image and text backbones for processing images and captions, respectively.

Image captioning

When trying to obtain a textual description from an image, one of the possibilities is image captioning, which consists of generating natural language text conditioned on the given input image. Early neural models for image captioning [26,27,28] encoded visual information using a single feature vector representing the whole image. Therefore, they were not able to exploit information about objects and their spatial relationships. In the last years, the concept of attention, which underpins the operation of Transformers, has proven to be crucial for image captioning tasks. Indeed, when deciding which combination of natural language words best describes an image, it is required to identify its most important and discriminating parts. The first application of Transformers on this task can be found in [29], in which the authors proposed the novel Conceptual Caption dataset and proved the effectiveness of Transformers in the captioning task. In [30], the authors exploit an image Transformer to obtain captions by attending the different image regions. Differently, in [21], the authors propose a meshed-memory Transformer, which uses mesh-like connectivity at decoding stage to exploit the activations at different depths of the network. Some works also use GANs as a framework to learn to caption images. Specifically, in [31], the authors present a novel method relying on conditional- GAN, which introduces an extension to traditional encoder-decoder architectures based on reinforcement learning (RL).

Sentence and image-sentence retrieval

Our research is more focused on sentence or image-sentence matching. This setup differs slightly from image captioning: in the matching task the captions must not be generated, but only carefully chosen among a set of candidates. Sentence matching obtained a huge boost in the last few years, thanks to large pre-training of Transformer models, such as BERT [13], RoBERTa [16], and multi-lingual versions of them like XLM-RoBERTa [32]. These models have been recently extended to work with images. Some works use BERT-like processing on both visual and textual modalities, such as ViLBERT [33], ImageBERT [34], Pixel-BERT [35], VL-BERT [36]. Nevertheless, all these methods require several network evaluations that scale quadratically with the number of items in the inference set. In fact, all the possible image-caption pairs should be input into the network to obtain the matching score. For this reason, many methods rely instead on the projection of visual and textual information into the same common space, where only a simple dot-product is needed to obtain the similarity between a given pair. In particular, in [37] the authors use VGG and ResNets visual features extractors, together with an LSTM for sentence processing, and they match images and captions exploiting hard-negatives at training time. Following, other methods focused on contrastive learning of common embedding spaces [38,39,40]. Differently, in [41], an adversarial learning method is proposed, and a discriminator is used to learn modality-invariant representations. The authors in [42] use a contextual attention-based LSTM-RNN which can selectively attend to salient regions of an image at each time step, and they employ a recurrent canonical correlation analysis to find hidden semantic relationships between regions and words. In [43] the authors proposed a system for combining image and text features for image retrieval. They introduced a fusion approach called Text Image Residual Gating (TIRG), in which the image feature is first gated and then added to a residual feature which works as a modification feature. Transformer networks have been used in image-text matching in [6, 7] for the task of multi-modal large-scale information retrieval. They introduced a novel disentangled transformer architecture that separately reasons on the two different modalities and enforces a final common abstract concept space.

Differently from all these works, in our efficient candidate proposal model we dealt with multi-modal queries composed of a text (the image URL) and the image itself. Therefore, we have a slightly different setup, in which an (image, text) pair is used to retrieve the captions (another text).

Multi-modal understanding in Wikipedia

Some relevant works have been recently proposed to tackle important challenges in Wikipedia. In order to train Wikipedia-scale multi-modal models, many datasets have been crawled by scraping Wikipedia pages, like WIT [9], WikiWeb2M [10], and AToMiC [11], which collect millions of image-text pairs. These datasets have been used in recent large vision-language pre-training multilingual models [44]. Especially, MURAL [45] extends ALIGN [46] by performing both image-text matching and translation pair matching, while REVEAL [47] proposes handling image captioning and image-text retrieval using an external memory network that is pretrained on massive data. The work that mostly resembles our task and motivation is the context-driven Wikipedia captioning method by [48], which employs images, descriptions, and sections in the article to generate a very precise caption. However, they do not frame the task as a retrieval problem, and they train a complex encoder-decoder transformer model, which is hard to deploy in real-world large-scale scenarios.

3 Method

The data provided in the Kaggle challenge consists of three main fields: image URLs, images, and captions. The challenge consists of finding the most relevant caption given the image URLs, the images, or both as a query. Given that the test set is composed of around \(n_t=92K\) elements, using a large Transformer to compute relevance score for every (query, target) pair is infeasible, as we would need to compute \(n_t^2\) relevance scores to get the ranking for the whole test set. Driven by this concern, we decided to adopt a cascade of two different models to produce the final rankings. The first one, which we call Multi-modal Caption Proposal (MCProp) model, employs both the textual information in the image URLs and the visual information in the images as a compound query to infer the caption. This model projects queries and captions in the same common space, where cosine similarity is used to measure the similarity between a query and a caption. With this model, efficient k-nearest neighbor search can be performed to create a rank for every query of all the \(n_t\) captions. The top-ranked elements are then used as candidates by the second model, called Caption Re-Rank (CRank) model. This is a large Transformer fine-tuned for pair classification, i.e., a binary classifier that classifies a (query, caption) pair as either a match or a non-match. This second model employs only the textual information in the image URL to infer the caption without relying again on the visual information. The highest match probabilities returned by MCProp determine the top-5 captions selected for every image, as requested by the challenge. Following, we present in detail the two models.

3.1 The multi-modal caption proposal model

The core idea of the Multi-modal Caption Proposal method is to transform the query and target data to transfer them to a common feature space. In this space, we can calculate the similarity in an efficient and scalable manner using cosine similarities. The model comprises two pipelines, the query pipeline and the caption pipeline. In turn, a query is composed of an image URL and an image. Therefore, in total, we need to process two textual fields and one visual field.

Visual features are extracted from images via the image encoder part of a CLIP network [3]. CLIP is a powerful multi-modal model composed of an image encoder and a text encoder that is trained to predict the correct visual-textual pairings. Being pre-trained on a multi-modal task, the image encoder module is a very good fit for our task.

We do not use the textual pipeline of CLIP as our textual backbone. The main reason for this is that this challenge is inherently multilingual, and CLIP is not trained on multiple languages. For this reason, we use instead a pre-trained large language Transformer model, XLM-RoBERTa [32], as a textual pipeline.

Fig. 2
figure 2

The Multi-modal Caption Proposal model. The XLM-RoBERTa backbone is shared among the query and caption pipelines, and the representations are further specialized using downstream Transformer Encoders (TE), whose architecture is reported on the left side of Fig. 1 in [12]. The final CLS tokens in output from the TEs are used as final embeddings for the image URL and caption, respectively

The overall architecture is shown in Fig. 2. Specifically, the image encoder of CLIP outputs an aggregated visual feature \(\bar{\textbf{v}}\); differently, XLM-RoBERTa outputs a set \(\{\textbf{w}_1, \textbf{w}_2,\ldots ,\textbf{w}_M\}\) of textual features, that are used by the heads of the original model for solving the many downstream tasks. Instead, we post-process these features by means of other Transformer Encoder layers in a way similar to TERN [7] and TERAN [7], obtaining \(\{\textbf{w}'_1, \textbf{w}'_2,\ldots ,\textbf{w}'_M\}\). We use the token embedding from the first element of the output sequence, the CLS token, as a final representation for the input text: \(\textbf{c} = \textbf{w}'_1\). The CLS token has been introduced in the BERT architecture [13], and it is a special token – usually the first element in the sequence – which is aimed at collecting global information from the sentence. All the other tokens are needed inside both XLM-RoBERTa and the subsequent Transformer Encoder layer for computing self-attention but are discarded afterward since they are not needed for the downstream task. Notice that the XLM-RoBERTa backbone is shared among the image URL and the caption pipelines. In fact, once the image URL has been properly cleaned, it resembles a valid natural language text that can be processed with standard pre-trained textual models. In order to specialize the representations to the downstream task, the two downstream Transformer Encoder modules from the two pipelines do not share the weights. Concerning input preprocessing, the image URL is cleaned by removing the extension, the part preceding the actual filename, and cleaning special characters like underscores or dashes, which are replaced by a space.

The image URL and the image are fused using an attentive fusion module described in the next paragraph.

3.1.1 Attentive feature fusion

The proposed problem is challenging since two different modalities (the image URL and the image) can be used as a compound query to infer the image caption. It would be interesting to automatically infer the relative importance of the two components of the query to solve the matching task. The attentive feature fusion module, inspired by other works in this direction [49, 50], serves precisely this purpose. This module is composed of a sub-network that computes two attention values, one for each query component. Specifically, the network is a simple MLP with a final sigmoid layer that takes as input the concatenation among the two input vectors \(\textbf{u}\) (image URL) and \(\textbf{v}\) (the image) and outputs two scalars, \(\alpha _{u}\) and \(\alpha _{v}\):

$$\begin{aligned} \alpha _u, \alpha _v = \text {sigmoid}(\text {MLP}([\textbf{u}, \textbf{v}])) \end{aligned}$$
(1)

where \([\cdot , \cdot ]\) denotes the concatenation operation. Thanks to the final sigmoid layer, these values lay in the range [0, 1]. Those values are then used for computing the final query representation \(\textbf{q}\) as a weighted average between the normalized input vectors:

$$\begin{aligned} \textbf{q} = \alpha _u \frac{\textbf{u}}{\Vert \textbf{u}\Vert } + \alpha _v \frac{\textbf{v}}{\Vert \textbf{v}\Vert } \end{aligned}$$
(2)

The vectors are normalized so their intrinsic magnitude is 1. Doing so, the \(\alpha _{u}\) and \(\alpha _{v}\) values are forced to be directly comparable and more easily interpretable.

3.1.2 Training

In order to match images and captions in the same common space, we use a hinge-based triplet ranking loss, focusing the attention on hard negatives, as in [6, 7, 37, 38]. Specifically, given the final query representation \(\textbf{q}\) and the target caption feature \(\textbf{c}\) we use the following loss function:

$$\begin{aligned} L_{match}({\textbf {q}}, {\textbf {c}})&= \max _{{\textbf {c}}'} [\gamma + S({\textbf {q}}, {\textbf {c}}') - S({\textbf {q}}, {\textbf {c}})]_+ + \nonumber \\&\max _{\textbf{q}'} [\gamma + S(\textbf{q}', \textbf{c}) - S(\textbf{q}, \textbf{c})]_+, \end{aligned}$$
(3)

where \([x]_+ \equiv \text {max}(0, x)\). \(S(\textbf{q},\textbf{c})\) is the similarity function between the query vector and the target caption features. We used the standard cosine similarity as \(S(\cdot , \cdot )\). As in [37], the hard negatives are sampled from the mini-batch and not globally, for performance reasons.

3.2 The caption re-rank model

The Caption Re-Rank (CRank) model is a binary classifier based on the XLM-RoBERTa model. More specifically, the network consists of the XLM-RoBERTa model, i.e., the encoder part of a Transformer model, with the pooled output of the CLS token connected to a linear layer with an output size equal to the number of labels. The overall architecture is shown in Fig. 3.

The classification task aims to determine if an image URL and a caption match or not. We use a processed version of the image URL to represent the image. We do not use visual information from the image in this phase. As for the Multi-modal Caption Proposal model, the URL is processed by removing any URL of the path component preceding the actual filename and any file type extension. Any underscore or dash characters in the remaining string are replaced by a space. The input of the matching process is the concatenation of the processed URL and the caption text, with a SEP token separating them.

Fig. 3
figure 3

The Caption Re-Rank model. It uses an XLM-RoBERTa masked language model pre-trained on the CommonCrawl dataset and fine-tuned on the image URL-caption match classification task. The CLS token in output from XLM-RoBERTa is attached to a feed-forward network (FFN) head, which outputs the final matching probability

3.2.1 Training

To fine-tune the pre-trained model for our classification task, we trained CRank using all the (image URL, caption) pairs available in the training dataset. The dataset explicitly defines only examples of matches. To get examples of non-matching pairs, we used a simple negative sampling strategy that randomly pairs image URLs and captions from the dataset. We generated several non-matching pairs equal to the matching ones, obtaining a training set of about 74 million examples. The training process used a batch size of 64, with each batch containing an equal amount of matching and non-matching pairs. We trained it for 2 epochs, using the Adam algorithm with weight decay [51], requiring 65 hours per epoch on a single NVidia RTX2080 GPU.

3.2.2 Selection of candidates for classification

The number of classifications required for the image-caption match problem we are facing grows with the square of the size of the test set. Each classification requires passing the string representing the (image URL, caption) pair through the XLMRoBERTa model, which takes a non-negligible time. This is structurally different from the MCProp model, in which images and caption are projected in the common space separately, thus with a linear cost with respect to the test set size. For the MCProp model, the quadratic cost is limited to the computation of the cosine similarities between the resulting vectors, which is a faster operation that can also be computed in parallel much easily.

The overall cost for CRank can rapidly pass the limit of available computational and time resources. With a single NVidia RTX2080 GPU the time required to compute the classification scores for all the (image URL, caption) pairs in the test set is more than three months. For comparison, computing all the pairwise similarities on embedding vectors of length 768 for the same test set requires eight minutes on a desktop CPU. Multiple GPUs can reduce the cost to a more manageable time, yet the quadratic nature of the process still make it not scalable to larger dataset. For this reason, we employed this two-steps approach, in which a faster method, e.g., MCProp, selects a smaller set of promising candidates to be processed by CRank. Having a fixed number of candidate captions for image makes the cost linear with respect to the dataset size. Using 1,000 candidate captions per image, determined using MCProp or other methods, it takes only 27 hours (compared to three months) to apply CRank to the 92K image URLs from the test set. In our experiments, we show how the final effectiveness is not influenced, while the efficiency of the overall inference phase is greatly improved.

4 Experiments

4.1 Dataset

The dataset used for our experiments is the one publicly released on the Kaggle competition page. It is based on Wikipedia-based Image Text Dataset (WIT) [9] and contains 37.6 million entity-rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. There are at least 12,000 examples for each language, making the dataset particularly interesting for building a model that is not necessarily relegated to a specific language. Each example contains an image URL, from which the image can be downloaded, and the target caption.

The dataset is already divided into a training and a test set. The training set contains for each (image, image URL) pair the associated caption describing the image. On the other hand, the test set separately comprises a list of (image, image URL) pairs that compose the query and a list of captions not paired with the given queries. Each of these two lists contains 92,367 entries. The ground-truth for the test set, i.e., the (image, caption) pairs, is not released. The only way to obtain the results on the test set is to submit the inferred top-5 captions for each query to the Kaggle evaluation server.

The dataset employed in this research is available on the Kaggle challenge pageFootnote 3, while the full WIT dataset can be downloaded following the instructions provided by the authors on their GitHub repositoryFootnote 4.

4.2 Evaluation

The quality of the obtained ranking is calculated considering the top 5 most similar captions for each image and applying the normalized Discounted Cumulative Gain (nDCG), the normalized version of the well-known Discounted Cumulative Gain (DCG). The rationale behind DCG is to penalize relevant items that are preceded by non-relevant items in the ordered list of results. The DCG grows with the exponential of the graded relevance of an item while it is inversely proportional to the logarithm of the item rank. The DCG at a particular rank position p is defined as follows:

$$\text {DCG}_p = \sum _{i=1}^p \frac{2^{\text {rel}_i} - 1}{\log _2{(i+1)}}, $$

where \(\text {rel}_i\) is the graded relevance of the item at position i in the results.

The nDCG normalizes DCG by its maximum theoretical value and thus returns values in the [0, 1] range. To calculate the nDCG at a specific rank position p the following formula is used:

$$\text {nDCG}_p = \frac{\text {DCG}_p}{\text {IDCG}_p},$$

where IDCG is the ideal discounted cumulative gain. Besides the nDCG metric employed in the challenge, we also compute the recall@K metric on our validation sets. This metric is widely-used when there is only one relevant item for every given query, like in this case. The recall@K measures the percentage of queries able to retrieve the correct result among the first K retrieved items.

Given that the final nDCG is computed by averaging the nDCG values for each query, we can also estimate the confidence interval on the mean to evaluate the statistical significance of its values to better compare different experiments. Therefore, for nDCG\(_k\) values in the tables reporting results on our validation set (Tables 1 and 2), we are able to report the 95% confidence intervals.

Table 1 Caption retrieval results for the multi-modal caption proposal model on the validation set
Table 2 Caption retrieval results for the Caption Re-Rank model, and candidate selector methods, and their combination, on the small validation set (1,000 elements)

4.3 Preliminary results on the MCProp model

For training the Multi-modal Caption Proposal model, we reserved 10,000 examples chosen randomly from the given training set for validating. For validating the model, we used the main metric used for the challenge, the nDCG\(_5\). However, we also report the Recall@K, as it is one of the main metrics used in cross-modal retrieval literature.

We used CLIP, provided with a ViT as the backbone for the visual pipeline. Instead, the language backbone is a XLM-RoBERTa, a large Transformer Encoder model pre-trained on a large and multilingual textual corpus. Both the visual and textual backbones were frozen during the training.

Fig. 4
figure 4

Weights estimated by the model for the image URL and the image (\(\alpha _{u}\) and \(\alpha _{v}\)) on the validation set as training progresses

As a query, we tried two different configurations. As a first experiment, we used a compound query built by employing both the image URL (processed with the textual pipeline) and the image (processed with the image pipeline). In the second configuration, instead, we used only the image URL. These two experiments aim to understand the role of images in solving the matching task, given that the image URL alone seems already sufficient in most cases.

When both the image URL and the image are used as a query, as mentioned in Section 3.1, we used two different fusion techniques: straightforward concatenation or the attentive fusion mechanism.

Discussion

Table 1 shows the results reached on the validation set for the different experiments. In particular, we can see that when the images are concatenated to the image URL information, the overall metrics slightly raise. Nevertheless, the performance increase is not substantial, especially for recall@1, which seems to downgrade (2nd row of Table 1). Better results are obtained when using the attentive fusion approach to merge image URLs with visual information (3rd row of Table 1). The use of the attentive fusion allowed us to inspect the relevance given by the model to the two different modalities that compose the query. Figure 4 shows the evolution of the two attention values on the validation set. As we can notice, the weight provided by the model for the visual pipeline remains at 65% with respect to the weight assigned to the image URL pipeline. This confirms the evidence that the visual information contributes less to the matching task in this scenario.

In Appendix B, we explored another inference methodology that exploited the fact that captions assigned to an image are no more eligible for another image.

4.4 Preliminary results on the CRank model

The CRank model takes as input a (image URL, caption) pair and returns a classification score. The higher the score, the higher the confidence of a match. Given a pool of candidate captions for an image URL, all the (image URL, caption) pairs must be classified by CRank, sorting then the pairs by classification score, higher to lower.

As detailed in Section 3.2, the computational complexity and cost of CRank made this method not directly applicable to the test set without having access to substantial computational resources, which was not our case. In order to test the performance of CRank independently of the candidate selection method, we used a smaller validation set of 1,000 elements (held out from the training data). The small size of this validation set made it possible to apply CRank to the whole set without performing a candidate selection first.

We then applied several methods that act as candidate selectors. Each candidate selector is used to rank all the captions, and then only the top-ranked 20% of the captions are re-ranked using the classification scores produced by CRank. In this way, we can measure which method for selecting candidates works better in combination with CRank.

Discussion

Table 2 shows that CRank obtains very high nDCG when applied to the whole validation set, placing the right caption in the right position 74% of the time and 86% of the time among the first 10 results. As comparative baselines, we tested three methods. We selected the top 5 most similar captions using the Levenshtein similarity with the cleaned image URL. Also, we used the embedding vector for the CLS token produced by the XLM-RoBERTa model before the classification layer for each caption and each image URL to compute pairwise cosine similarity as the measure to select the top 5 captions. Finally, we applied MCProp to the small validation set. Comparing the baselines, it is interesting to see that the XLM-RoBERTa similarity method performs worse than the others. This indicates that the embeddings extracted from XLM-RoBERTa are so specialized for the classification task that they are no longer suitable for language representation. The Levenshtein-based model obtains an average performance. This is a remarkable result considering that this method is fundamentally not able to handle pairs of texts in different languages. On this smaller validation set, the MCProp method performs very well, placing the proper caption in the first position 57% of the time.

We then used the baseline methods as a way to select a reduced set of candidates (20% of the 1,000 elements in the validation set), which are then re-ranked by CRank. This two-steps procedure makes CRank applicable to larger test sets. In all cases, the CRank re-rank produces a sensible improvement over the baselines.

When Levenshtein and XLM-RoBERTa are used the final scores are lower than the maximum achieved by CRank on the full validation set. This indicates that both Levenshtein and XLM-RoBERTa fail to put the correct result in the top 20% of their ranks for a sensible number of cases. On the contrary, the combination of MCProp and CRank produces a slight increase in recall@1 and nDCG, with respect to the use of CRank only. This is due to a positive interaction between the two methods: (i) all the elements placed by CRank in its top 5 positions are also placed by MCProp in its top 20% positions; (ii) a few cases of tie in CRank that caused the top result to be not the correct one are solved by MCProp  that puts the tied element in the correct orderFootnote 5.

4.5 Final results

This section reports the final results obtained on the private leaderboard of the Wikipedia challenge on Kaggle. Until the end of the competition, only the results on the public test set, composed of only 25% of the full test set, were shown to the teams. Five submissions per day were accepted on the public leaderboard. Therefore, teams had the opportunity to optimize their methods on the small test subset. The results in the private leaderboard were instead computed at the end of the challenge, on 85% of the test set. For all these reasons, despite not being publicly accessible, the results in the private leaderboard are more representative of the final model ranking.

In Table 3, we report our models’ performance on the private test set. Unfortunately, it was not possible to evaluate confidence intervals as we do not have access to the challenge test set. We can notice how the choices made using the validation sets are well reflected on the wider test corpus. In particular, the two-cascaded model pipeline obtains higher results among all the baseline methods. This confirms the ability of the MCProp model to propose relevant candidates and the proficiency of the CRank to move correct items toward the top of the ranked list.

In the Appendix C, we further tried to eliminate the need to explicitly deal with multiple languages by translating the whole test set into English. This trial did not improve the results reported in Table 3, confirming the strength of multilingual models in solving this complex matching task.

Table 3 nDCG\(_5\) of our models on the private leaderboard (85% test data)
Table 4 nDCG\(_5\) of the top-10 performing methods on the private leaderboard

In Table 4, we report the results on the private leaderboard for the top-10 methods. More than 100 teams took part in this challenge, and we positioned fifth. The first participant significantly overperformed the rest of the participants, with the second also performing distinctly well. We obtained a remarkable overall result, performing very similar to the two methods ahead of us and distancing the sixth team by a consistent margin (+8% in nDCG@5). These results confirm the validity of the approach, which already demonstrated promising outcomes on the validation set. The larger performance gap between our method and the top-performing teams can be justified by our choice of keeping the right balance between effectiveness and efficiency and our need for limited training computational resources. However, we argue that these limitations can easily be solved when deploying the method in large production environments offering large computing facilities.

Finally, in Fig. 5, we can qualitatively appreciate the outcomes from the proposed pipeline. We report results on the 1000-item validation set for which we have the ground truth available, and we leave out the Levenshtein distance since, as a baseline, it always produces ranks \(>10\) in all our reported examples. Figure 5a and b show success cases, where CRank successfully attracts relevant items already found by MCProp towards the top of the list. Figure 5c and d show scenarios in which the CRank alone is already able to retrieve the correct caption due to the clear syntactic correspondence between the URL and the caption – although in different languages – which makes the figure unnecessary. For example, the caption in Figure 5d is the Russian translation for Wittenbeck, a municipality in northern Germany, also present in the figure URL. Figure 5e and f show failure cases, where MCProp can find the correct result in a good position and CRank worsens their rank. This is probably due to the lack of visual information in CRank  which is important for discriminating among many possible existing matching candidates if only the URL is considered.

Fig. 5
figure 5

Qualitative examples, showing our approach’s success and failure cases. Specifically, examples (a) and (b) show the cases in which CRank successfully attracts relevant items already found by MCProp towards the top of the list; examples (c) and (d) show scenarios in which the CRank alone is already able to retrieve the correct caption; examples (e) and (f) show failure cases, where MCProp can find the correct result in a good position and CRank worsens their rank

5 Conclusions

In this paper, we proposed a system able to match images with corresponding multilingual captions. This is an important tool for managing large encyclopedia websites like Wikipedia, where most article images do not have any written context connected to the image. Driven by the power of recent Transformer-based models, we addressed the matching problem using a cascade of two models, Multi-modal Caption Proposal (MCProp), to efficiently propose relevant caption candidates and Caption Re-Rank (CRank), which re-ranks the proposals using a fine-tuned XLM-RoBERTa model. The results obtained on the validation sets show that the MCProp model is an effective model for proposing candidates, and CRank is able to bring the correct results (chosen among the candidates) towards the top of the ranked list.

We participated in the Wikipedia image-caption matching challenge proposed on Kaggle, reaching the fifth position on the private leaderboard among more than 100 participating teams with a system running on limited computational resources.

Although promising, this approach suffers from some known limitations. In particular, it seems that MCProp is not completely exploiting the visual information, and, as shown in qualitative results, the cascaded pipeline cannot always enhance results because CRank is only exploiting textual information. Future research directions include the use of additional contextual data available in the provided dataset during training to regularize the model and improve generalization. Also, it would be interesting to experiment with other fusion techniques for MCProp and exploit visual information also in CRank, or try approaching the problem by distilling efficient vision-language scores from large pre-trained vision-language transformers as done in ALADIN [52].