1 Introduction

The task of multimodal retrieval aims to retrieve images or videos that are semantically similar to a given text query and vice-versa. In the world of computer vision, image retrieval and video retrieval have been treated as separate tasks. For example, transformer models pretrained on large amounts of image-text pairs are then fine tuned for the task of image retrieval (Chen et al., 2019; Jia et al., 2021; Li et al., 2022, 2021, 2022). In contrast, video retrieval models are developed in two parallel directions. The first line of work (Bain et al., 2021; Ge et al., 2022; Li et al., 2022) focused on video-text pretraining on large-scale datasets like Howto100M (Miech et al., 2019) and WebVid-2 M (Bain et al., 2021). While the second line of work (Luo et al., 2021) focused on using pretrained image features like CLIP (Radford et al., 2021) for video retrieval often surpassing the models pretrained on video datasets.

Most of the current multimodal retrieval datasets typically contain around 10k visual inputs and the corresponding captions of maximum lengths ranging from 30 to 60. In multimodal retrieval, the text encoder projects the input text caption and visual inputs (image or video) into a common embedding space. With longer captions, the text embeddings might lose required contextual information resulting in incorrect retrievals. One could address this by incorporating more structured knowledge (e.g., parts of speech, dependency graphs) in the text encoder (Cao et al., 2022). However, a drop in performance is observed (Cao et al., 2022) with the addition of structural knowledge in a text-to-video retrieval setting. One could argue that the reason might be the smaller retrieval datasets and creating meaningful structural knowledge becomes a challenging task (Fig. 1).

Fig. 1
figure 1

Illustration of the improved video retrieval ranking for a test sample in MSRVTT dataset. For the current state-of-the-art video retrieval model, the rank of the ground truth video is 47. When multilingual data is used as knowledge transfer, the ranking of the ground truth video improved significantly from 47 to 2 (Color figure online)

There are over 6000 languages in the world each having its own vocabulary of words, grammar and morphology. However, there exists an overlap of knowledge among these languages (Hu et al., 2020; Artetxe et al., 2019; WU SJ, 2019). In Natural Language Processing, some works (Conneau and Lample, 2019) explored this idea of using multilingual data to improve the performance on monolingual English datasets. Conneau et al. introduced a new pretraining objective: Translation Language Modeling (TLM) in which random words were masked in the concatenated sentences of English and multilingual data and the model predicts the masked words. There, the objective was to use multilingual context to predict masked English words if the English context was not sufficient and vice-versa.

Most of the current works in multimodal retrieval are primarily focused on monolingual English datasets. Recently, few works have proposed models that can perform retrieval in a multilingual setting (Huang et al., 2021; Zhou et al., 2021; Ni et al., 2021; Burns et al., 2020). However, these works use large amounts of multilingual data for pretraining and are tuned for retrieval on each language. Moreover, these models are also limited to performing either image retrieval or video retrieval. In addition, these models show a drop in performance on English retrieval datasets even though there is a boost in performance on multilingual data.

In this work, we are interested in two objectives (i) design a model capable of performing retrieval on both images and videos (ii) design a model capable of performing retrieval on multiple languages including English. Multilingual data serves as a powerful knowledge augmentation for monolingual models (Conneau and Lample, 2019). Nevertheless, creating multilingual data requires huge human effort. To overcome this, we use state-of-the-art machine translation models (Tang et al., 2020) to convert English text captions into other languages. Specifically, we choose languages whose performance on XNLI benchmark (Conneau et al., 2018) is comparable to that of English (i.e., French, German, Spanish). With this, we create high quality multilingual data without requiring human labelling. To the best of our knowledge, this is the first work that uses multilingual knowledge transfer to improve multimodal retrieval.

We propose a new framework MuMUR: Multilingual Multimodal Universal Retrieval that is capable of performing both image and video retrieval on multilingual datasets. This framework is based on CLIP (Radford et al., 2021) to effectively utilize and adapt the multilingual knowledge transfer. Our model takes a visual input, English text caption and multilingual text caption as inputs and extracts joint visual-text representations. The multilingual text representations should act as a knowledge augmentation to the English text representations aiding in retrieval. For this purpose, we introduce a dual cross-modal (DCM) encoder block which learns the similarity between English text representations and visual representations. In addition, the DCM encoder block also associates the visual representations with the multilingual text representations. In the common embedding space, our model learns the important contextual information from multilingual representations which is otherwise missing from the English text representations effectively serving as knowledge transfer.

We validate our proposed model on a comprehensive set of image retrieval datasets: Flickr30k (Plummer et al., 2015) and video retrieval datasets: MSRVTT-9k (Xu et al., 2016), MSRVTT-7k (Xu et al., 2016), MSVD (Chen and Dolan, 2011), DiDeMo (Anne Hendricks et al., 2017) and Charades (Sigurdsson et al., 2016). We show that our approach achieves state-of-the-art results, outperforming previous models on most datasets. In addition to the evaluation on monolingual retrieval datasets, we also compare the performance of our model on multilingual datasets Multi30k (Elliott et al., 2016) and MSRVTT multilingual (Huang et al., 2021). Experimental results demonstrate that MuMUR achieves state-of-the-art results on all English video retrieval datasets and significantly outperforms previous models on multilingual video retrieval in a zero-shot setting. Furthermore, our model MuMUR establishes new benchmark on multilingual image retrieval while achieving strong performance on English image retrieval datasets. These results demonstrate the universal capability of MuMUR to perform all types of multimodal and multilingual retrieval.

To summarize, our contributions are as follows: (i) We generate multilingual data using external state-of-the-art machine translation models. (ii) We propose a model that is capable of knowledge transfer from multilingual data to improve the performance of multimodal retrieval. (iii) We evaluate the proposed framework on six English retrieval benchmarks and achieve state-of-the-art results in both text-to-visual and visual-to-text retrieval settings. (iv) Finally, we demonstrate that our model significantly outperforms previous approaches on multilingual retrieval datasets.

2 Related work

2.1 Multimodal retrieval

Pre-train and then fine-tune is the most popular paradigm involving image retrieval (Chen et al., 2019; Li et al., 2020, 2021, 2022). These models are pre-trained on huge amounts of image text pairs such as Conceptual Captions (Changpinyo et al., 2021), Visual Genome (Krishna et al., 2017) and SBU (Lu et al., 2020) and tested on image retrieval datasets such as Flickr30k (Plummer et al., 2015) and COCO (Karpathy and Fei-Fei, 2015).

The task of video retrieval has seen tremendous progress in the recent years. This is partly due to the availability of large-scale video datasets like HowTo100M (Miech et al., 2019) and WebVid-2 M (Bain et al., 2021). Besides the adaption of transformers to image tasks like image classification (Dosovitskiy et al., 2020) spurred the development of models based on transformers. However, videos require computationally more memory and compute power and can be infeasible to compute self-attention matrices. With the introduction of more efficient architectures (Bertasius et al., 2021) large-scale pretraining on videos became a possibility. In this direction, several transformer based architectures (Bain et al., 2021; Ge et al., 2022; Madasu et al., 2022, 2023) were proposed and pretrained on large video datasets which achieved state-of-the-art results on downstream video retrieval datasets in both zero-shot and fine-tuning settings.

In a parallel direction, a few works (Luo et al., 2021) have adopted image level features pretrained on large scale image-text pairs to perform video retrieval. Surprisingly, these works have performed significantly better than the models that are pretrained from scratch on large scale video datasets. Compared to these models, our approach completely differs in the architecture and the training methodology.

2.2 Multilingual training

The recent success of multimodal image-text models on a variety of tasks, such as retrieval and question-answering, has been mostly limited to monolingual models trained on English text. This is mainly due to the availability and high-quality of English-based multimodal datasets. Recent work indicates that incorporating a second language or a multilingual encoder, thus creating a shared multilingual token embedding space, can improve monolingual pure-NLP downstream tasks (Conneau and Lample, 2019). This concept was rapidly embraced for training multimodal models. Previous works had used images as a bridge for translating between two languages, without using a language-to-language shared dataset for training (Chen and Dolan, 2011; Surís et al., 2022; Sigurdsson et al., 2020).

Recent work has focused on multimodal tasks, such as image retrieval, aiming to add multilingual capabilities to multimodal models (Burns et al., 2020; Gella et al., 2017). The work often indicates that incorporating a second language during training of multimodal models, improves performance on single-language multimodal tasks such as image retrieval, compared to multimodal models that were trained on a single language (Gella et al., 2017; Kim et al., 2020; Wehrmann et al., 2019). MULE (Kim et al., 2020), which is a multilingual universal language encoder trained on image-multilingual text pairs, showed an improvement on image-sentence retrieval tasks of up to 20% compared to monolingual models. Nevertheless, all these previous works focus on designing models separately for image and video retrieval. Our objective is to use multilingual knowledge transfer to improve the performance on current image and video retrieval datasets. In addition, our model is capable of performing multilingual retrieval on more than 10 languages.

Fig. 2
figure 2

Illustration of the proposed MuMUR model. The model takes as visual input (image or video), a corresponding English text query and a translated multilingual query. The multilingual text query is obtained using the off-the-shelf machine translation model. It is used only for the inference and is not part of the architecture. The video and English text features are extracted using CLIP model whereas multilingual text features are extracted using M-CLIP model. The features are then passed onto a cross-model encoder to learn the association in a common embedding space. Cross-entropy loss is then applied to measure the similarity between text features \(R_{E}\) and \(R_{vE}\), \(R_{M}\) and \(R_{vM}\). The final loss is the sum of both the losses (Color figure online)

3 MuMUR: Multilingual Multimodal Universal Retrieval

In this section, we introduce our framework MuMUR: Multilingual Multimodal Universal Retrieval. We first describe the problem statement, then the multilingual data augmentation strategy and finally go over the proposed approach that enables knowledge transfer from multilingual data for video retrieval.

3.1 Problem statement

Given a set of visual data V, their corresponding English text captions E and related multilingual text captions M, our goal is to learn similarity functions \(s_{1}(v_{i}, e_{i})\) and \(s_{2}(v_{i}, m_{i})\) (\(v_{i} \in V\), \(e_{i} \in E\) and \(m_{i} \in M\)). In other words, we propose a framework MuMUR that enables end-to-end learning on a tuple of visual input, English text caption and multilingual text caption by bringing closer the joint representations of those three elements. Specifically, for each visual input V and the English text captions E, we generate the multilingual translations M using external state-of-the-art machine translation models (Tang et al., 2020). Next, we present the proposed approach that facilitates the end-to-end learning using multilingual data.

3.2 Approach

Our model, illustrated in Fig. 2, is comprised of three components: (i) visual encoder (ii) text encoder (iii) dual cross modal encoder. Next, we describe the framework in detail.

3.2.1 Visual encoder

Given a visual input V, we consider uniformly sampled clips \(C \in R^{N_{v} \times H \times W \times 3}\) where \(N_{v}\) is the number of frames (1 for images), H and W are the spatial dimensions of a RGB frame. We then use a pretrained CLIP-ViT image encoder (Radford et al., 2021) to extract the frame embeddings \(F_{v} \in R^{N_{v} \times D_{v}}\) where \(D_{v}\) denotes the dimensions of the frame embeddings. The frame embeddings are concatenated to obtain the final representation for the video V.

3.2.2 Text encoder

Let the inputs English text caption be E and multilingual text caption M of lengths p and q respectively. We use a pretrained CLIP-ViT text encoder to convert the English text caption into a sequence of embeddings \(R_{E} = R^{E_{p} \times D_{E}}\) where \(D_{E}\) denote the embedding dimensions. We consider the representation of the token [EOS] as the final representation of English text caption. To encode multilingual text caption M, we use a M-CLIPFootnote 1 model which is a multilingual clip model pretrained on multilingual text and image pairs. Specifically, the multilingual text caption is converted into a sequence of embeddings \(R_{M} = R^{M_{q} \times D_{M}}\) where \(D_{M}\) denote the embedding dimensions. Similar to the CLIP model, we consider the [EOS] representation as the final representation of M-CLIP model.

3.2.3 Dual cross-modal encoder (DCM)

Our goal is to closely associate the visual embeddings \(R_{v}\), English text embeddings \(R_{E}\) and multilingual text embeddings \(R_{M}\) in a common embedding space. For this purpose, we propose a dual cross-modal encoder (DCM). To incorporate textual information into visual features and to learn visual features that are semantically most similar to text features, we use multi-head attention. The text features are used as the queries whereas the visual features are used as the keys and values.

$$\begin{aligned}{} & {} r_{vE} = Attention(T_{E}, F_{v}, F_{v}) \end{aligned}$$
(1)
$$\begin{aligned}{} & {} r_{vM} = Attention(M_{E}, F_{v}, F_{v}) \end{aligned}$$
(2)

where multi-head attention (Attention) is defined as:

$$Attention(Q,K,V) = Softmax\left( {\frac{{QK^{T} }}{{\sqrt d }}} \right)V$$
(3)

Here Q, K and V are same as the original multi-head attention matrices in the transformer encoder. We then apply a fully connected layer on the attention outputs and finally layer normalization to obtain \(R_{vE}\) and \(R_{vM}\).

$$\begin{aligned}{} & {} R_{vE} = LN(FC(r_{vE}) + r_{vE}) \end{aligned}$$
(4)
$$\begin{aligned}{} & {} R_{vM} = LN(FC(r_{vM}) + r_{vM}) \end{aligned}$$
(5)

where FC is the fully connected layer and LN is the layer normalization layer.

3.2.4 Loss

We use the standard image-text or video-text matching loss (Wu and Zhai, 2019) to train the model. It is measured as the dot product similarity between matching text embeddings and visual embeddings in a batch. First, we compute the loss \(L_{E}\) between \(R_{vE}\) and \(R_{E}\) and then compute the loss \(L_{M}\) between \(R_{vM}\) and \(R_{M}\). The final loss is the sum of losses \(L_{E}\) and \(L_{M}\).

$$\begin{aligned} L = L_{E} + L_{M}. \end{aligned}$$
(6)

where \(L_{E}\) = \(L_{E}^{t2v}\) + \(L_{E}^{v2t}\) and \(L_{M}\) = \(L_{M}^{t2v}\) + \(L_{M}^{v2t}\).

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {L}}_{E}^\textrm{v2t}&= -\frac{1}{B}\sum _{i=1}^B\log \frac{ \exp ({R_{E}}^{(i)} \cdot R_{vE}^{(i)})}{\sum _{j=1}^{B}\exp ({R_{E}}^{(i)}\cdot R_{vE}^{(j)})},\\ \end{aligned} \end{aligned}$$
(7)
$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {L}}_{E}^\textrm{t2v}&= -\frac{1}{B}\sum _{i=1}^B\log \frac{ \exp ({R_{vE}}^{(i)}\cdot R_{E}^{(i)})}{\sum _{j=1}^{B}\exp ({R_{vE}}^{(i)}\cdot R_{E}^{(j)})}.\\ \end{aligned} \end{aligned}$$
(8)
$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {L}}_{M}^\textrm{v2t}&= -\frac{1}{B}\sum _{i=1}^B\log \frac{ \exp ({R_{M}}^{(i)} \cdot R_{vM}^{(i)})}{\sum _{j=1}^{B}\exp ({R_{M}}^{(i)}\cdot R_{vM}^{(j)})},\\ \end{aligned} \end{aligned}$$
(9)
$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {L}}_{M}^\textrm{t2v}&= -\frac{1}{B}\sum _{i=1}^B\log \frac{ \exp ({R_{vM}}^{(i)}\cdot R_{M}^{(i)})}{\sum _{j=1}^{B}\exp ({R_{vM}}^{(i)}\cdot R_{M}^{(j)})}.\\ \end{aligned} \end{aligned}$$
(10)

3.2.5 Inference

During inference for English retrieval datasets, we freeze the multilingual text encoder and measure the retrieval performance only using \(R_{E}\) and \(R_{vE}\). Similarly, for multilingual datasets, we freeze the English text encoder and calculate the retrieval score using \(R_{M}\) and \(R_{vM}\).

4 Experiments

4.1 Datasets

Our goal is to design a universal multimodal multilingual retrieval model. Therefore, our experiments focus on evaluating on both image and video retrieval datasets comprising of monolingual and multilingual captions.

4.1.1 Video retrieval

We perform experiments on six standard text-video retrieval datasets: MSRVTT-9k and MSRVTT-7k splits (Xu et al., 2016), MSVD (Chen and Dolan, 2011), DiDeMo (Anne Hendricks et al., 2017), Charades (Sigurdsson et al., 2016) and MSRVTT multilingual (Huang et al., 2021).

MSRVTT contains 10K videos with each video ranging from 10 to 32 s and 200K captions. We report the results both on MSRVTT-9k and MSRVTT-7k datasets following Luo et al. (2021).

MSVD consists of 1970 videos and 80K descriptions. We use the standard training, validation and testing splits following Luo et al. (2021). In this dataset, each video has multiple captions and are treated as independent samples during testing.

DiDeMo is made up of 10K videos and 40K localized descriptions of the videos. We concatenate all the sentences for each video and evaluate the paragraph-to-video retrieval following Lin et al. (2022); Luo et al. (2021).

Charades contains of 9848 videos and each video is associated with a caption. We use the standard training and test splits following Lin et al. (2022).

MSRVTT multilingual is a multilingual version of MSRVTT in which the English captions are translated into nine different languages. We use the standard splits following Huang et al. (2021).

4.1.2 Image retrieval

We evaluate the proposed approach MuMUR on the following image retrieval datasets:

Flickr30k contains 31000 images with each image containing single caption for training and validation data and 5 captions for testing data. We follow the standard splits of 29k/1k/1k (Li et al., 2022).

Multi30K (Elliott et al., 2016) is a multilingual version of Flickr30k (Plummer et al., 2015) in which the English text captions are translated into German (de), French (fr) and Czech (cs) languages. We use the training, validation and testing splits of 29k/1k/1k following Ni et al. (2021).

4.2 Metrics

For evaluating the performance of models, we use recall at rank K (R@1, R@5, R@10), median rank (MedR) and mean rank (MnR). Unless specified, the values reported are the mean of three runs with different seeds.

Table 1 Text-to-video and video-to-text retrieval results on MSR-VTT dataset 9k split

4.3 Implementation details

We use translations of French for the multilingual inputs to train the MuMUR model. The visual encoder and the English text encoder are initialized with CLIP-ViT-B-32. The multilingual text encoder is initialized with M-CLIP-ViT-B-32. The dimension size of the video, English caption and multilingual caption representations is 512. The dual cross-model encoder is initialized randomly and trained from scratch. The dimension size of the key, query and value projection layers is 512. The fully connected layer in the transformer has a size of 512 and a dropout of 0.4 is applied on this layer. We use 16 frames for MSRVTT-9k, MSRVTT-7k and MSVD datasets, 42 frames for DiDeMo and Charades datasets. The maximum sequence length is set to 32 for MSRVTT-9k and MSRVTT-7k, 64 for DiDeMo and 30 for charades dataset. The model is trained using AdamW (Loshchilov and Hutter, 2018) a learning rate of 1e-4 and a cosine decay of 1e-6. The MSRVTT-9k and MSRVTT-7k datasets are trained with a batch size of 32 and for 15 epochs. The MSVD dataset is trained with a batch size of 32 and for 5 epochs. The DiDeMo and charades datasets are trained with a batch size of 16 for 12 and 15 epochs respectively.

Table 2 Text-to-video retrieval results on MSR-VTT - 7k split
Table 3 Text-to-video retrieval results on MSVD dataset (multi-caption evaluation)
Table 4 Text-to-video retrieval result on DiDeMo dataset
Table 5 Text-to-video retrieval result on charades dataset
Table 6 Text-to-image and image-to-text retrieval results on Flickr30k (Plummer et al., 2015) dataset split
Table 7 Text-to-video retrieval (R@1 metric) results on MSR-VTT - multilingual (Huang et al., 2021)
Table 8 Text-to-image retrieval (R@1 metric) results on Multi30K (Elliott et al., 2016) dataset (de - German, fr - French and cs - Czech)

5 Results and discussion

5.1 Evaluation on English video retrieval datasets

In Table 1 we report the results of our proposed approach on MSVRTT-9k dataset. It can be observed that the difference between CLIP based models and other models is very significant (\(> 5\%\)). Therefore, it explains the incentive to build our model using CLIP features. On MSRVTT-9k split, our model significantly outperforms CLIP4Clip model on all the metrics in both text-to-video and video-to-text retrieval settings. VCM employs a knowledge graph between video and text modalities making its performance superior to other models in a video-to-text retrieval task. Our model surpasses VCM significantly in all the metrics elucidating that the multilingual representations serve as a powerful knowledge transfer. Moreover, our approach outperforms MCQ and MILES which are pretrained on WebVid-2 M data, initialized with CLIP features, employing additional semantic information like parts-of-speech. This validates that our model doesn’t require any pretraining on videos and structural knowledge injection. The multilingual text representations in our model effectively serves this purpose.

In Tables 234 and 5 we report the results on MSRVTT-7k, MSVD, DiDeMo and Charades datasets respectively. Our model outperforms all the previous approaches across all the metrics on all the datasets. For the MSRVTT-7k split, our model achieves a significant boost of 2.1%, 2.5% and 4.4% in R@1, R@5 and R@10 respectively compared to the previous baselines. For the MSVD dataset, we notice an improvement of 0.2%, 0.5% and 0.3% in R@1, R@5 and R@10 respectively. MSVD is a relatively smaller dataset with test size of 670 videos and hence, the improvements are relatively marginal.

For the DiDeMo dataset, our model showed a marginal boost of 0.2% in R@1 but a significant boost of 4.3% and 2.9% in R@5 and R@10 respectively compared to the previous approaches. For the Charades dataset, our model outperformed previous approaches by 0.9% in R@1 and by a significant margin of 4.6%, 7.6% and 6.0% in R@5, R@10 and MedianR respectively. ECLIPSE uses audio as additional information for video retrieval. We showed that multilingual text acts as a better knowledge transfer input.

5.2 Evaluation on english image retrieval datasets

Next, we measure the performance of MuMUR on English image retrieval dataset Flickr30k. The results are reported in the Table  6 in both image-to-text and text-to-image settings. As shown in the table, MuMUR achieves comparable performance to previous approaches on image-to-text retrieval. Moreover, we observe that MuMUR significantly outperforms these models by 5.5% in image-to-text retrieval. Note that, these models are pretrained on large amount of image-text pairs and fine-tuned on Flickr30k dataset. In contrast, our model uses a small amount of multilingual data and achieves remarkable results in both the settings. This validates that multilingual data acts as superior knowledge transfer even for image retrieval.

5.3 Evaluation on multilingual video retrieval datasets

In addition to the monolingual datasets, we also evaluate the proposed approach on multilingual video retrieval datasets. Specifically, we use the model trained only using French captions and test on 6 languages such as German (de), Czech (cs), Chinese (zh), Swahili (sw), Russian (ru) and Spanish (es) in a zero-shot setting. Table 7 shows the results on MSRVTT-multilingual dataset. Our model achieved a significant boost of 8.2% (average) in R@1 in a zero-shot setting. It is worth noting that our model in a zero-short evaluation outperformed the previous approaches fine-tuned on these languages by a huge margin of 6.1% (average). MMP (Huang et al., 2021) is pretrained on the large scale multilingual dataset HowTo100M on 9 languages. However, our model trained on just 1 language outperformed MMP. This shows that our dual cross-modal (DCM) encoder block can effectively learn the association among video, English and multilingual representations even when large video pretraining is not involved.

5.4 Evaluation on multilingual image retrieval datasets

We also evaluate the proposed model MuMUR on the multilingual image retrieval dataset Multi30K. Table  8 show the results on 3 languages German (de), French (fr) and Czech (cs) fine-tuned for these languages. As shown in the table, MuMUR significantly outperforms previous models by 3.1% for German (de), 3.2% for French (fr) and 4.3% for Czech (cs) languages.

5.5 Ablation studies

5.5.1 Effect of multilingual knowledge transfer

We investigate the effect of multilingual knowledge transfer on the video-retrieval performance. Precisely, we train a model without the multilingual text encoder keeping the rest of the architecture intact. As shown in Fig. 3, using multilingual data as knowledge transfer significantly improved the performance on DiDeMo and Charades datasets. The improvement is 2.4% for DiDeMo and 3.62% for Charades datasets.

Fig. 3
figure 3

Comparison of models with and without using multilingual data as input. The first model takes as input only video and English text captions whereas the second model takes video, English text and multilingual text captions as input. As shown in the figure, using multilingual text data as knowledge transfer significantly improved the performance (Color figure online)

Fig. 4
figure 4

Comparison of models consisting of only multilingual text encoder and multilingual text encoder + English text encoder. Using a separate English text encoder for encoding English text captions outperforms the model using multilingual text encoder to encode English text captions (Color figure online)

Fig. 5
figure 5

Comparison of models with and without DCM block in the architecture. Using DCM block in the architecture showed superior performance to models without the DCM block (Color figure online)

Fig. 6
figure 6

Comparison of MuMUR trained with different multilingual caption data. It is evident from the figure that training with more languages improved the performance (Color figure online)

Fig. 7
figure 7

Figure shows the effect of the number of video frames used to train MuMUR model on MSRVTT-9k dataset. Results demonstrate that R@1 score is the highest when 16 frames are used in text-to-video and video-to-text settings (Color figure online)

Fig. 8
figure 8

Figure demonstrates the performance of MuMUR for varying number of video frames on MSVD dataset (single caption evaluation). It is evident that the R@1 score is maximum at 16 frames when evaluated in both text-to-video and video-to-text settings (Color figure online)

Fig. 9
figure 9

We ablate the sampling strategy used for selecting video frames on MSRVTT-9k dataset. We observe that uniform and random sampling techniques achieve similar performance for all the video frames (Color figure online)

Fig. 10
figure 10

Figure illustrates the comparative performance of random and uniform video sampling techniques on MSVD dataset. It is evident that random significantly outperforms uniform sampling but fails for larger number of video frames (Color figure online)

5.5.2 Using only multilingual text encoder

Next, we ablate the choice of using an English text encoder. We validated previously that multilingual data improves the performance of video retrieval. This raises the question: Why a separate English text encoder is required if multilingual text encoder can be used for both English text and multilingual text representations? In Fig. 4, we show the results of two different model variants. The first model uses a separate English text encoder to encode English text captions whereas in the second model, both the English text and multilingual text are encoded using the same multilingual text encoder. Results show that encoding English text captions using a separate English text encoder surpasses the model using multilingual text encoder to encode both English text and multilingual text. Multilingual pretraining employs a part of English data whereas the English text encoder is pretrained on a comparatively larger English data. Hence, leveraging a separate English text encoder showed much superior performance to using multilingual text encoder for English text.

5.5.3 Effectiveness of dual cross encoder block

Next, we ablate the effectiveness of dual cross encoder block. We train a model without the DCM block and directly compute the loss between video representations and English text representations and video representations and multilingual text representations. From Fig. 5, we can see that the model using DCM block achieves better performance than the model without the encoder block. This justifies our motivation to use DCM block in our model.

5.5.4 Training with more languages

Next, we ablate training our model with more than one language. Concretely we train our model with German (de) and Spanish (es) captions. These languages are chosen because their performance on XNLI dataset (Conneau et al., 2018) is comparable to English. The results are shown in Fig. 6 and it is seen that training with more languages improved the performance on video retrieval. These results validate that multilingual data act as an effective knowledge transfer mechanism for improving video retrieval.

5.5.5 Effect of video frames

We investigate the impact of the frequency of video frames on the retrieval performance. We train the MuMUR model with increasing number of video frames in intervals of 4. Figures  7 and  8 demonstrate the results of this ablation study on MSRVTT-9k and MSVD datasets respectively. As shown in the figure, the R@1 score is the highest when the model is trained with 16 video frames in both text-to-video and video-to-text retrieval settings. Therefore, these results motivate the use of 16 frames for training MuMUR on video retrieval datasets.

5.5.6 Effect of video frame sampling

Following, we study the impact of choosing video frames at random vs with an uniform manner. Figures  9 and  10 illustrate the results for varying video frames on MSRVTT-9k and MSVD datasets respectively. It is clear from the figures that uniform and random sampling strategies achieve nearly the same performance. However in case of MSVD, we observe that random sampling performed much better compared to uniform sampling. Nevertheless we see that random sampling fails for video frames greater than 20.

6 Conclusion

In this paper we introduced MuMUR, a multilingual knowledge transfer framework to improve the performance of multimodal retrieval. We constructed multilingual captions using off-the-shelf state-of-the-art machine translation models. We then proposed a CLIP-based model that enables multilingual knowledge transfer using a dual cross-modal encoder block. Experiment results on six standard multimodal retrieval datasets showed that our framework achieved state-of-the-art results on all the datasets. Finally, our model also showed superior performance to previous approaches on multilingual retrieval datasets in a zero-shot and fine-tune settings. In the future, we will focus on more efficient ways of multilingual knowledge transfer for multimodal retrieval.