MuMUR: Multilingual Multimodal Universal Retrieval

Madasu, Avinash; Aflalo, Estelle; Stan, Gabriela Ben Melech; Rosenman, Shachar; Tseng, Shao-Yen; Bertasius, Gedas; Lal, Vasudev

doi:10.1007/s10791-023-09422-5

MuMUR: Multilingual Multimodal Universal Retrieval

Published: 25 September 2023

Volume 26, article number 5, (2023)
Cite this article

Download PDF

Information Retrieval Journal Aims and scope Submit manuscript

MuMUR: Multilingual Multimodal Universal Retrieval

Download PDF

Avinash Madasu¹,
Estelle Aflalo¹,
Gabriela Ben Melech Stan¹,
Shachar Rosenman¹,
Shao-Yen Tseng¹,
Gedas Bertasius² &
…
Vasudev Lal¹

386 Accesses
Explore all metrics

Abstract

Multi-modal retrieval has seen tremendous progress with the development of vision-language models. However, further improving these models require additional labelled data which is a huge manual effort. In this paper, we propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval. We first use state-of-the-art machine translation models to construct pseudo ground-truth multilingual visual-text pairs. We then use this data to learn a joint vision-text representation where English and non-English text queries are represented in a common embedding space based on pretrained multilingual models. We evaluate our proposed approach on a diverse set of retrieval datasets: five video retrieval datasets such as MSRVTT, MSVD, DiDeMo, Charades and MSRVTT multilingual, two image retrieval datasets such as Flickr30k and Multi30k. Experimental results demonstrate that our approach achieves state-of-the-art results on all video retrieval datasets outperforming previous models. Additionally, our framework MuMUR significantly beats other multilingual video retrieval dataset. We also observe that MuMUR exhibits strong performance on image retrieval. This demonstrates the universal ability of MuMUR to perform retrieval across all visual inputs (image and video) and text inputs (monolingual and multilingual).

Improving Video Retrieval Using Multilingual Knowledge Transfer

Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

Towards Cycle-Consistent Models for Text and Image Retrieval

1 Introduction

The task of multimodal retrieval aims to retrieve images or videos that are semantically similar to a given text query and vice-versa. In the world of computer vision, image retrieval and video retrieval have been treated as separate tasks. For example, transformer models pretrained on large amounts of image-text pairs are then fine tuned for the task of image retrieval (Chen et al., 2019; Jia et al., 2021; Li et al., 2022, 2021, 2022). In contrast, video retrieval models are developed in two parallel directions. The first line of work (Bain et al., 2021; Ge et al., 2022; Li et al., 2022) focused on video-text pretraining on large-scale datasets like Howto100M (Miech et al., 2019) and WebVid-2 M (Bain et al., 2021). While the second line of work (Luo et al., 2021) focused on using pretrained image features like CLIP (Radford et al., 2021) for video retrieval often surpassing the models pretrained on video datasets.

Most of the current multimodal retrieval datasets typically contain around 10k visual inputs and the corresponding captions of maximum lengths ranging from 30 to 60. In multimodal retrieval, the text encoder projects the input text caption and visual inputs (image or video) into a common embedding space. With longer captions, the text embeddings might lose required contextual information resulting in incorrect retrievals. One could address this by incorporating more structured knowledge (e.g., parts of speech, dependency graphs) in the text encoder (Cao et al., 2022). However, a drop in performance is observed (Cao et al., 2022) with the addition of structural knowledge in a text-to-video retrieval setting. One could argue that the reason might be the smaller retrieval datasets and creating meaningful structural knowledge becomes a challenging task (Fig. 1).

There are over 6000 languages in the world each having its own vocabulary of words, grammar and morphology. However, there exists an overlap of knowledge among these languages (Hu et al., 2020; Artetxe et al., 2019; WU SJ, 2019). In Natural Language Processing, some works (Conneau and Lample, 2019) explored this idea of using multilingual data to improve the performance on monolingual English datasets. Conneau et al. introduced a new pretraining objective: Translation Language Modeling (TLM) in which random words were masked in the concatenated sentences of English and multilingual data and the model predicts the masked words. There, the objective was to use multilingual context to predict masked English words if the English context was not sufficient and vice-versa.

Most of the current works in multimodal retrieval are primarily focused on monolingual English datasets. Recently, few works have proposed models that can perform retrieval in a multilingual setting (Huang et al., 2021; Zhou et al., 2021; Ni et al., 2021; Burns et al., 2020). However, these works use large amounts of multilingual data for pretraining and are tuned for retrieval on each language. Moreover, these models are also limited to performing either image retrieval or video retrieval. In addition, these models show a drop in performance on English retrieval datasets even though there is a boost in performance on multilingual data.

In this work, we are interested in two objectives (i) design a model capable of performing retrieval on both images and videos (ii) design a model capable of performing retrieval on multiple languages including English. Multilingual data serves as a powerful knowledge augmentation for monolingual models (Conneau and Lample, 2019). Nevertheless, creating multilingual data requires huge human effort. To overcome this, we use state-of-the-art machine translation models (Tang et al., 2020) to convert English text captions into other languages. Specifically, we choose languages whose performance on XNLI benchmark (Conneau et al., 2018) is comparable to that of English (i.e., French, German, Spanish). With this, we create high quality multilingual data without requiring human labelling. To the best of our knowledge, this is the first work that uses multilingual knowledge transfer to improve multimodal retrieval.

We propose a new framework MuMUR: Multilingual Multimodal Universal Retrieval that is capable of performing both image and video retrieval on multilingual datasets. This framework is based on CLIP (Radford et al., 2021) to effectively utilize and adapt the multilingual knowledge transfer. Our model takes a visual input, English text caption and multilingual text caption as inputs and extracts joint visual-text representations. The multilingual text representations should act as a knowledge augmentation to the English text representations aiding in retrieval. For this purpose, we introduce a dual cross-modal (DCM) encoder block which learns the similarity between English text representations and visual representations. In addition, the DCM encoder block also associates the visual representations with the multilingual text representations. In the common embedding space, our model learns the important contextual information from multilingual representations which is otherwise missing from the English text representations effectively serving as knowledge transfer.

We validate our proposed model on a comprehensive set of image retrieval datasets: Flickr30k (Plummer et al., 2015) and video retrieval datasets: MSRVTT-9k (Xu et al., 2016), MSRVTT-7k (Xu et al., 2016), MSVD (Chen and Dolan, 2011), DiDeMo (Anne Hendricks et al., 2017) and Charades (Sigurdsson et al., 2016). We show that our approach achieves state-of-the-art results, outperforming previous models on most datasets. In addition to the evaluation on monolingual retrieval datasets, we also compare the performance of our model on multilingual datasets Multi30k (Elliott et al., 2016) and MSRVTT multilingual (Huang et al., 2021). Experimental results demonstrate that MuMUR achieves state-of-the-art results on all English video retrieval datasets and significantly outperforms previous models on multilingual video retrieval in a zero-shot setting. Furthermore, our model MuMUR establishes new benchmark on multilingual image retrieval while achieving strong performance on English image retrieval datasets. These results demonstrate the universal capability of MuMUR to perform all types of multimodal and multilingual retrieval.

To summarize, our contributions are as follows: (i) We generate multilingual data using external state-of-the-art machine translation models. (ii) We propose a model that is capable of knowledge transfer from multilingual data to improve the performance of multimodal retrieval. (iii) We evaluate the proposed framework on six English retrieval benchmarks and achieve state-of-the-art results in both text-to-visual and visual-to-text retrieval settings. (iv) Finally, we demonstrate that our model significantly outperforms previous approaches on multilingual retrieval datasets.

2 Related work

2.1 Multimodal retrieval

Pre-train and then fine-tune is the most popular paradigm involving image retrieval (Chen et al., 2019; Li et al., 2020, 2021, 2022). These models are pre-trained on huge amounts of image text pairs such as Conceptual Captions (Changpinyo et al., 2021), Visual Genome (Krishna et al., 2017) and SBU (Lu et al., 2020) and tested on image retrieval datasets such as Flickr30k (Plummer et al., 2015) and COCO (Karpathy and Fei-Fei, 2015).

The task of video retrieval has seen tremendous progress in the recent years. This is partly due to the availability of large-scale video datasets like HowTo100M (Miech et al., 2019) and WebVid-2 M (Bain et al., 2021). Besides the adaption of transformers to image tasks like image classification (Dosovitskiy et al., 2020) spurred the development of models based on transformers. However, videos require computationally more memory and compute power and can be infeasible to compute self-attention matrices. With the introduction of more efficient architectures (Bertasius et al., 2021) large-scale pretraining on videos became a possibility. In this direction, several transformer based architectures (Bain et al., 2021; Ge et al., 2022; Madasu et al., 2022, 2023) were proposed and pretrained on large video datasets which achieved state-of-the-art results on downstream video retrieval datasets in both zero-shot and fine-tuning settings.

In a parallel direction, a few works (Luo et al., 2021) have adopted image level features pretrained on large scale image-text pairs to perform video retrieval. Surprisingly, these works have performed significantly better than the models that are pretrained from scratch on large scale video datasets. Compared to these models, our approach completely differs in the architecture and the training methodology.

2.2 Multilingual training

The recent success of multimodal image-text models on a variety of tasks, such as retrieval and question-answering, has been mostly limited to monolingual models trained on English text. This is mainly due to the availability and high-quality of English-based multimodal datasets. Recent work indicates that incorporating a second language or a multilingual encoder, thus creating a shared multilingual token embedding space, can improve monolingual pure-NLP downstream tasks (Conneau and Lample, 2019). This concept was rapidly embraced for training multimodal models. Previous works had used images as a bridge for translating between two languages, without using a language-to-language shared dataset for training (Chen and Dolan, 2011; Surís et al., 2022; Sigurdsson et al., 2020).

Recent work has focused on multimodal tasks, such as image retrieval, aiming to add multilingual capabilities to multimodal models (Burns et al., 2020; Gella et al., 2017). The work often indicates that incorporating a second language during training of multimodal models, improves performance on single-language multimodal tasks such as image retrieval, compared to multimodal models that were trained on a single language (Gella et al., 2017; Kim et al., 2020; Wehrmann et al., 2019). MULE (Kim et al., 2020), which is a multilingual universal language encoder trained on image-multilingual text pairs, showed an improvement on image-sentence retrieval tasks of up to 20% compared to monolingual models. Nevertheless, all these previous works focus on designing models separately for image and video retrieval. Our objective is to use multilingual knowledge transfer to improve the performance on current image and video retrieval datasets. In addition, our model is capable of performing multilingual retrieval on more than 10 languages.

3 MuMUR: Multilingual Multimodal Universal Retrieval

In this section, we introduce our framework MuMUR: Multilingual Multimodal Universal Retrieval. We first describe the problem statement, then the multilingual data augmentation strategy and finally go over the proposed approach that enables knowledge transfer from multilingual data for video retrieval.

3.1 Problem statement

Given a set of visual data V, their corresponding English text captions E and related multilingual text captions M, our goal is to learn similarity functions $s_{1}(v_{i}, e_{i})$ and $s_{2}(v_{i}, m_{i})$ ($v_{i} \in V$, $e_{i} \in E$ and $m_{i} \in M$). In other words, we propose a framework MuMUR that enables end-to-end learning on a tuple of visual input, English text caption and multilingual text caption by bringing closer the joint representations of those three elements. Specifically, for each visual input V and the English text captions E, we generate the multilingual translations M using external state-of-the-art machine translation models (Tang et al., 2020). Next, we present the proposed approach that facilitates the end-to-end learning using multilingual data.

3.2 Approach

Our model, illustrated in Fig. 2, is comprised of three components: (i) visual encoder (ii) text encoder (iii) dual cross modal encoder. Next, we describe the framework in detail.

3.2.1 Visual encoder

Given a visual input V, we consider uniformly sampled clips $C \in R^{N_{v} \times H \times W \times 3}$ where $N_{v}$ is the number of frames (1 for images), H and W are the spatial dimensions of a RGB frame. We then use a pretrained CLIP-ViT image encoder (Radford et al., 2021) to extract the frame embeddings $F_{v} \in R^{N_{v} \times D_{v}}$ where $D_{v}$ denotes the dimensions of the frame embeddings. The frame embeddings are concatenated to obtain the final representation for the video V.

3.2.2 Text encoder

Let the inputs English text caption be E and multilingual text caption M of lengths p and q respectively. We use a pretrained CLIP-ViT text encoder to convert the English text caption into a sequence of embeddings $R_{E} = R^{E_{p} \times D_{E}}$ where $D_{E}$ denote the embedding dimensions. We consider the representation of the token [EOS] as the final representation of English text caption. To encode multilingual text caption M, we use a M-CLIP^{Footnote 1} model which is a multilingual clip model pretrained on multilingual text and image pairs. Specifically, the multilingual text caption is converted into a sequence of embeddings $R_{M} = R^{M_{q} \times D_{M}}$ where $D_{M}$ denote the embedding dimensions. Similar to the CLIP model, we consider the [EOS] representation as the final representation of M-CLIP model.

3.2.3 Dual cross-modal encoder (DCM)

Our goal is to closely associate the visual embeddings $R_{v}$, English text embeddings $R_{E}$ and multilingual text embeddings $R_{M}$ in a common embedding space. For this purpose, we propose a dual cross-modal encoder (DCM). To incorporate textual information into visual features and to learn visual features that are semantically most similar to text features, we use multi-head attention. The text features are used as the queries whereas the visual features are used as the keys and values.

$$\begin{aligned}{} & {} r_{vE} = Attention(T_{E}, F_{v}, F_{v}) \end{aligned}$$

(1)

$$\begin{aligned}{} & {} r_{vM} = Attention(M_{E}, F_{v}, F_{v}) \end{aligned}$$

(2)

where multi-head attention (Attention) is defined as:

$$Attention(Q,K,V) = Softmax\left( {\frac{{QK^{T} }}{{\sqrt d }}} \right)V$$

(3)

Here Q, K and V are same as the original multi-head attention matrices in the transformer encoder. We then apply a fully connected layer on the attention outputs and finally layer normalization to obtain $R_{vE}$ and $R_{vM}$.

$$\begin{aligned}{} & {} R_{vE} = LN(FC(r_{vE}) + r_{vE}) \end{aligned}$$

(4)

$$\begin{aligned}{} & {} R_{vM} = LN(FC(r_{vM}) + r_{vM}) \end{aligned}$$

(5)

where FC is the fully connected layer and LN is the layer normalization layer.

3.2.4 Loss

We use the standard image-text or video-text matching loss (Wu and Zhai, 2019) to train the model. It is measured as the dot product similarity between matching text embeddings and visual embeddings in a batch. First, we compute the loss $L_{E}$ between $R_{vE}$ and $R_{E}$ and then compute the loss $L_{M}$ between $R_{vM}$ and $R_{M}$. The final loss is the sum of losses $L_{E}$ and $L_{M}$.

$$\begin{aligned} L = L_{E} + L_{M}. \end{aligned}$$

(6)

where $L_{E}$ = $L_{E}^{t2v}$ + $L_{E}^{v2t}$ and $L_{M}$ = $L_{M}^{t2v}$ + $L_{M}^{v2t}$.

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {L}}_{E}^\textrm{v2t}&= -\frac{1}{B}\sum _{i=1}^B\log \frac{ \exp ({R_{E}}^{(i)} \cdot R_{vE}^{(i)})}{\sum _{j=1}^{B}\exp ({R_{E}}^{(i)}\cdot R_{vE}^{(j)})},\\ \end{aligned} \end{aligned}$$

(7)

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {L}}_{E}^\textrm{t2v}&= -\frac{1}{B}\sum _{i=1}^B\log \frac{ \exp ({R_{vE}}^{(i)}\cdot R_{E}^{(i)})}{\sum _{j=1}^{B}\exp ({R_{vE}}^{(i)}\cdot R_{E}^{(j)})}.\\ \end{aligned} \end{aligned}$$

(8)

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {L}}_{M}^\textrm{v2t}&= -\frac{1}{B}\sum _{i=1}^B\log \frac{ \exp ({R_{M}}^{(i)} \cdot R_{vM}^{(i)})}{\sum _{j=1}^{B}\exp ({R_{M}}^{(i)}\cdot R_{vM}^{(j)})},\\ \end{aligned} \end{aligned}$$

(9)

$$\begin{aligned}{} & {} \begin{aligned} {\mathcal {L}}_{M}^\textrm{t2v}&= -\frac{1}{B}\sum _{i=1}^B\log \frac{ \exp ({R_{vM}}^{(i)}\cdot R_{M}^{(i)})}{\sum _{j=1}^{B}\exp ({R_{vM}}^{(i)}\cdot R_{M}^{(j)})}.\\ \end{aligned} \end{aligned}$$

(10)

3.2.5 Inference

During inference for English retrieval datasets, we freeze the multilingual text encoder and measure the retrieval performance only using $R_{E}$ and $R_{vE}$. Similarly, for multilingual datasets, we freeze the English text encoder and calculate the retrieval score using $R_{M}$ and $R_{vM}$.

4 Experiments

4.1 Datasets

Our goal is to design a universal multimodal multilingual retrieval model. Therefore, our experiments focus on evaluating on both image and video retrieval datasets comprising of monolingual and multilingual captions.

4.1.1 Video retrieval

We perform experiments on six standard text-video retrieval datasets: MSRVTT-9k and MSRVTT-7k splits (Xu et al., 2016), MSVD (Chen and Dolan, 2011), DiDeMo (Anne Hendricks et al., 2017), Charades (Sigurdsson et al., 2016) and MSRVTT multilingual (Huang et al., 2021).

MSRVTT contains 10K videos with each video ranging from 10 to 32 s and 200K captions. We report the results both on MSRVTT-9k and MSRVTT-7k datasets following Luo et al. (2021).

MSVD consists of 1970 videos and 80K descriptions. We use the standard training, validation and testing splits following Luo et al. (2021). In this dataset, each video has multiple captions and are treated as independent samples during testing.

DiDeMo is made up of 10K videos and 40K localized descriptions of the videos. We concatenate all the sentences for each video and evaluate the paragraph-to-video retrieval following Lin et al. (2022); Luo et al. (2021).

Charades contains of 9848 videos and each video is associated with a caption. We use the standard training and test splits following Lin et al. (2022).

MSRVTT multilingual is a multilingual version of MSRVTT in which the English captions are translated into nine different languages. We use the standard splits following Huang et al. (2021).

4.1.2 Image retrieval

We evaluate the proposed approach MuMUR on the following image retrieval datasets:

Flickr30k contains 31000 images with each image containing single caption for training and validation data and 5 captions for testing data. We follow the standard splits of 29k/1k/1k (Li et al., 2022).

Multi30K (Elliott et al., 2016) is a multilingual version of Flickr30k (Plummer et al., 2015) in which the English text captions are translated into German (de), French (fr) and Czech (cs) languages. We use the training, validation and testing splits of 29k/1k/1k following Ni et al. (2021).

4.2 Metrics

For evaluating the performance of models, we use recall at rank K (R@1, R@5, R@10), median rank (MedR) and mean rank (MnR). Unless specified, the values reported are the mean of three runs with different seeds.

Table 1 Text-to-video and video-to-text retrieval results on MSR-VTT dataset 9k split

Full size table

4.3 Implementation details

We use translations of French for the multilingual inputs to train the MuMUR model. The visual encoder and the English text encoder are initialized with CLIP-ViT-B-32. The multilingual text encoder is initialized with M-CLIP-ViT-B-32. The dimension size of the video, English caption and multilingual caption representations is 512. The dual cross-model encoder is initialized randomly and trained from scratch. The dimension size of the key, query and value projection layers is 512. The fully connected layer in the transformer has a size of 512 and a dropout of 0.4 is applied on this layer. We use 16 frames for MSRVTT-9k, MSRVTT-7k and MSVD datasets, 42 frames for DiDeMo and Charades datasets. The maximum sequence length is set to 32 for MSRVTT-9k and MSRVTT-7k, 64 for DiDeMo and 30 for charades dataset. The model is trained using AdamW (Loshchilov and Hutter, 2018) a learning rate of 1e-4 and a cosine decay of 1e-6. The MSRVTT-9k and MSRVTT-7k datasets are trained with a batch size of 32 and for 15 epochs. The MSVD dataset is trained with a batch size of 32 and for 5 epochs. The DiDeMo and charades datasets are trained with a batch size of 16 for 12 and 15 epochs respectively.

Table 2 Text-to-video retrieval results on MSR-VTT - 7k split

Full size table

Table 3 Text-to-video retrieval results on MSVD dataset (multi-caption evaluation)

Full size table

Table 4 Text-to-video retrieval result on DiDeMo dataset

Full size table

Table 5 Text-to-video retrieval result on charades dataset

Full size table

Table 6 Text-to-image and image-to-text retrieval results on Flickr30k (Plummer et al., 2015) dataset split

Full size table

Table 7 Text-to-video retrieval (R@1 metric) results on MSR-VTT - multilingual (Huang et al., 2021)

Full size table

Table 8 Text-to-image retrieval (R@1 metric) results on Multi30K (Elliott et al., 2016) dataset (de - German, fr - French and cs - Czech)

Full size table

5 Results and discussion

5.1 Evaluation on English video retrieval datasets

In Table 1 we report the results of our proposed approach on MSVRTT-9k dataset. It can be observed that the difference between CLIP based models and other models is very significant ($> 5\%$). Therefore, it explains the incentive to build our model using CLIP features. On MSRVTT-9k split, our model significantly outperforms CLIP4Clip model on all the metrics in both text-to-video and video-to-text retrieval settings. VCM employs a knowledge graph between video and text modalities making its performance superior to other models in a video-to-text retrieval task. Our model surpasses VCM significantly in all the metrics elucidating that the multilingual representations serve as a powerful knowledge transfer. Moreover, our approach outperforms MCQ and MILES which are pretrained on WebVid-2 M data, initialized with CLIP features, employing additional semantic information like parts-of-speech. This validates that our model doesn’t require any pretraining on videos and structural knowledge injection. The multilingual text representations in our model effectively serves this purpose.

In Tables 2, 3, 4 and 5 we report the results on MSRVTT-7k, MSVD, DiDeMo and Charades datasets respectively. Our model outperforms all the previous approaches across all the metrics on all the datasets. For the MSRVTT-7k split, our model achieves a significant boost of 2.1%, 2.5% and 4.4% in R@1, R@5 and R@10 respectively compared to the previous baselines. For the MSVD dataset, we notice an improvement of 0.2%, 0.5% and 0.3% in R@1, R@5 and R@10 respectively. MSVD is a relatively smaller dataset with test size of 670 videos and hence, the improvements are relatively marginal.

For the DiDeMo dataset, our model showed a marginal boost of 0.2% in R@1 but a significant boost of 4.3% and 2.9% in R@5 and R@10 respectively compared to the previous approaches. For the Charades dataset, our model outperformed previous approaches by 0.9% in R@1 and by a significant margin of 4.6%, 7.6% and 6.0% in R@5, R@10 and MedianR respectively. ECLIPSE uses audio as additional information for video retrieval. We showed that multilingual text acts as a better knowledge transfer input.

5.2 Evaluation on english image retrieval datasets

Next, we measure the performance of MuMUR on English image retrieval dataset Flickr30k. The results are reported in the Table 6 in both image-to-text and text-to-image settings. As shown in the table, MuMUR achieves comparable performance to previous approaches on image-to-text retrieval. Moreover, we observe that MuMUR significantly outperforms these models by 5.5% in image-to-text retrieval. Note that, these models are pretrained on large amount of image-text pairs and fine-tuned on Flickr30k dataset. In contrast, our model uses a small amount of multilingual data and achieves remarkable results in both the settings. This validates that multilingual data acts as superior knowledge transfer even for image retrieval.

5.3 Evaluation on multilingual video retrieval datasets

In addition to the monolingual datasets, we also evaluate the proposed approach on multilingual video retrieval datasets. Specifically, we use the model trained only using French captions and test on 6 languages such as German (de), Czech (cs), Chinese (zh), Swahili (sw), Russian (ru) and Spanish (es) in a zero-shot setting. Table 7 shows the results on MSRVTT-multilingual dataset. Our model achieved a significant boost of 8.2% (average) in R@1 in a zero-shot setting. It is worth noting that our model in a zero-short evaluation outperformed the previous approaches fine-tuned on these languages by a huge margin of 6.1% (average). MMP (Huang et al., 2021) is pretrained on the large scale multilingual dataset HowTo100M on 9 languages. However, our model trained on just 1 language outperformed MMP. This shows that our dual cross-modal (DCM) encoder block can effectively learn the association among video, English and multilingual representations even when large video pretraining is not involved.

5.4 Evaluation on multilingual image retrieval datasets

We also evaluate the proposed model MuMUR on the multilingual image retrieval dataset Multi30K. Table 8 show the results on 3 languages German (de), French (fr) and Czech (cs) fine-tuned for these languages. As shown in the table, MuMUR significantly outperforms previous models by 3.1% for German (de), 3.2% for French (fr) and 4.3% for Czech (cs) languages.

5.5 Ablation studies

5.5.1 Effect of multilingual knowledge transfer

We investigate the effect of multilingual knowledge transfer on the video-retrieval performance. Precisely, we train a model without the multilingual text encoder keeping the rest of the architecture intact. As shown in Fig. 3, using multilingual data as knowledge transfer significantly improved the performance on DiDeMo and Charades datasets. The improvement is 2.4% for DiDeMo and 3.62% for Charades datasets.

5.5.2 Using only multilingual text encoder

Next, we ablate the choice of using an English text encoder. We validated previously that multilingual data improves the performance of video retrieval. This raises the question: Why a separate English text encoder is required if multilingual text encoder can be used for both English text and multilingual text representations? In Fig. 4, we show the results of two different model variants. The first model uses a separate English text encoder to encode English text captions whereas in the second model, both the English text and multilingual text are encoded using the same multilingual text encoder. Results show that encoding English text captions using a separate English text encoder surpasses the model using multilingual text encoder to encode both English text and multilingual text. Multilingual pretraining employs a part of English data whereas the English text encoder is pretrained on a comparatively larger English data. Hence, leveraging a separate English text encoder showed much superior performance to using multilingual text encoder for English text.

5.5.3 Effectiveness of dual cross encoder block

Next, we ablate the effectiveness of dual cross encoder block. We train a model without the DCM block and directly compute the loss between video representations and English text representations and video representations and multilingual text representations. From Fig. 5, we can see that the model using DCM block achieves better performance than the model without the encoder block. This justifies our motivation to use DCM block in our model.

5.5.4 Training with more languages

Next, we ablate training our model with more than one language. Concretely we train our model with German (de) and Spanish (es) captions. These languages are chosen because their performance on XNLI dataset (Conneau et al., 2018) is comparable to English. The results are shown in Fig. 6 and it is seen that training with more languages improved the performance on video retrieval. These results validate that multilingual data act as an effective knowledge transfer mechanism for improving video retrieval.

5.5.5 Effect of video frames

We investigate the impact of the frequency of video frames on the retrieval performance. We train the MuMUR model with increasing number of video frames in intervals of 4. Figures 7 and 8 demonstrate the results of this ablation study on MSRVTT-9k and MSVD datasets respectively. As shown in the figure, the R@1 score is the highest when the model is trained with 16 video frames in both text-to-video and video-to-text retrieval settings. Therefore, these results motivate the use of 16 frames for training MuMUR on video retrieval datasets.

5.5.6 Effect of video frame sampling

Following, we study the impact of choosing video frames at random vs with an uniform manner. Figures 9 and 10 illustrate the results for varying video frames on MSRVTT-9k and MSVD datasets respectively. It is clear from the figures that uniform and random sampling strategies achieve nearly the same performance. However in case of MSVD, we observe that random sampling performed much better compared to uniform sampling. Nevertheless we see that random sampling fails for video frames greater than 20.

6 Conclusion

In this paper we introduced MuMUR, a multilingual knowledge transfer framework to improve the performance of multimodal retrieval. We constructed multilingual captions using off-the-shelf state-of-the-art machine translation models. We then proposed a CLIP-based model that enables multilingual knowledge transfer using a dual cross-modal encoder block. Experiment results on six standard multimodal retrieval datasets showed that our framework achieved state-of-the-art results on all the datasets. Finally, our model also showed superior performance to previous approaches on multilingual retrieval datasets in a zero-shot and fine-tune settings. In the future, we will focus on more efficient ways of multilingual knowledge transfer for multimodal retrieval.

Data Availability

The datasets used in this work are publicly accessible and code will be made public after paper acceptance.

Notes

https://github.com/FreddeFrallan/Multilingual-CLIP.

References

Amrani, E., Ben-Ari, R., Rotman, D., & Bronstein, A. (2021). Noise estimation using density estimation for self-supervised multimodal learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35, 6644–6652.
Article Google Scholar
Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., & Russell, B. (2017). Localizing moments in video with natural language. In: Proceedings of the IEEE international conference on computer vision. 5803–5812.
Artetxe, M., Ruder, S., & Yogatama, D. (2019). On the cross-lingual transferability of monolingual representations
Bain, M., Nagrani, A., Varol, G., & Zisserman, A. (2021). Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 1728–1738.
Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? In: ICML. 2, 4.
Burns, A., Kim, D., Wijaya, D., Saenko, K., & Plummer, B.A. (2020). Learning to scale multilingual representations for vision-language tasks. In: European Conference on Computer Vision. 197–213. Springer.
Cao, S., Wang, B., Zhang, W., & Ma, L. (2022). Visual consensus modeling for video-text retrieval.
Changpinyo, S., Sharma, P., Ding, N., & Soricut, R. (2021). Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3558–3568.
Chen, D., & Dolan, W.B. (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 190–200.
Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J. (2019). Uniter: Learning universal image-text representations.
Cheng, X., Lin, H., Wu, X., Yang, F., & Shen, D. (2021). Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290.
Conneau, A., & Lample, G. (2019). Cross-lingual language model pretraining. Advances in neural information processing systems 32.
Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S., Schwenk, H., & Stoyanov, V. (2018). Xnli: Evaluating cross-lingual sentence representations. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2475–2485.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. & (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations.
Dzabraev, M., Kalashnikov, M., Komkov, S., & Petiushko, A. (2021). Mdmmt: Multidomain multimodal transformer for video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3354–3363.
Elliott, D., Frank, S., Sima’an, K., & Specia, L. (2016). Multi30k: Multilingual english-german image descriptions. In: Proceedings of the 5th Workshop on Vision and Language. 70–74.
Fang, H., Xiong, P., Xu, L., & Chen, Y. (2021). Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097.
Gao, Z., Liu, J., Chen, S., Chang, D., Zhang, H., & Yuan, J. (2021) Clip2tv: An empirical study on transformer-based methods for video-text retrieval. arXiv preprint arXiv:2111.05610.
Ge, Y., Ge, Y., Liu, X., Li, D., Shan, Y., Qie, X., & Luo, P. (2022). Bridging video-text retrieval with multiple choice questions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16167–16176.
Ge, Y., Ge, Y., Liu, X., Wang, A.J., Wu, J., Shan, Y., Qie, X., & Luo, P. (2022). Miles: Visual bert pre-training with injected language semantics for video-text retrieval. arXiv preprint arXiv:2204.12408.
Gella, S., Sennrich, R., Keller, F., & Lapata, M. (2017). Image pivoting for learning multilingual multimodal representations. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2839–2845.
Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., Johnson, M. (2020). Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In: International Conference on Machine Learning. 4411–4421. PMLR.
Huang, P.Y., Patrick, M., Hu, J., Neubig, G., Metze, F., Hauptmann, A.G. (2021). Multilingual multimodal pre-training for zero-shot cross-lingual transfer of vision-language models. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2443–2459.
Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. 4904–4916. PMLR.
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.
Kim, D., Saito, K., Saenko, K., Sclaroff, S., & Plummer, B. (2020). Mule: Multimodal universal language embedding. In: Proceedings of the AAAI Conference on Artificial Intelligence. 34, 11254–11261.
Kim, W., Son, B., & Kim, I. (2021). Vilt: Vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning. pp. 5583–5594. PMLR
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L. J., Shamma, D. A., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123, 32–73.
Article MathSciNet Google Scholar
Lei, J., Berg, T.L., & Bansal, M. (2022). Revealing single frame bias for video-and-language learning. arXiv preprint arXiv:2206.03428
Lei, J., Li, L., Zhou, L., Gan, Z., Berg, T.L., Bansal, M., Liu, J. (2021). Less is more: Clipbert for video-and-language learning via sparse sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7331–7341.
Li, D., Li, J., Li, H., Niebles, J.C., Hoi, S.C. (2022) Align and prompt: Video-and-language pre-training with entity prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4953–4963.
Li, J., Li, D., Xiong, C., & Hoi, S. (2022). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning. 12888–12900. PMLR
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., & Hoi, S. C. H. (2021). Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 9694–9705.
Google Scholar
Li, L., Chen, Y.C., Cheng, Y., Gan, Z., Yu, L., Liu, J. (2020). Hero: Hierarchical encoder for video+ language omni-representation pre-training. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2046–2065.
Lin, Y.B., Lei, J., Bansal, M., & Bertasius, G. (2022). Eclipse: Efficient long-range video retrieval using sight and sound. arXiv preprint arXiv:2204.02874.
Liu, S., Fan, H., Qian, S., Chen, Y., Ding, W., Wang, Z. (2021). Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 11915–11925.
Liu, Y., Albanie, S., Nagrani, A., & Zisserman, A. (2019) Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487.
Liu, Y., Xiong, P., Xu, L., Cao, S., Jin, Q. (2022). Ts2-net: Token shift and selection transformer for text-video retrieval. arXiv preprint arXiv:2207.07852.
Loshchilov, I., & Hutter, F. (2018). Decoupled weight decay regularization. In: International Conference on Learning Representations.
Lu, J., Goswami, V., Rohrbach, M., Parikh, D., & Lee, S. (2020). 12-in-1: Multi-task vision and language representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10437–10446.
Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., & Li, T. (2021). Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860
Madasu, A., Aflalo, E., Ben Melech Stan, G., Tseng, S.Y., Bertasius, G., & Lal, V. (2023). Improving video retrieval using multilingual knowledge transfer. In: European Conference on Information Retrieval. 669–684. Springer.
Madasu, A., Oliva, J., & Bertasius, G. (2022). Learning to retrieve videos by asking questions. arXiv preprint arXiv:2205.05739
Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., & Sivic, J. (2019). Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2630–2640.
Ni, M., Huang, H., Su, L., Cui, E., Bharti, T., Wang, L., Zhang, D., & Duan, N. (2021). M3p: Learning universal representations via multitask multilingual multimodal pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3977–3986.
Patrick, M., Huang, P.Y., Asano, Y., Metze, F., Hauptmann, A.G., Henriques, J.F., Vedaldi, A. (2020). Support-set bottlenecks for video-text representation learning. In: International Conference on Learning Representations.
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision. 2641–2649.
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & et al. (2021). Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. 8748–8763. PMLR.
Sigurdsson, G.A., Alayrac, J.B., Nematzadeh, A., Smaira, L., Malinowski, M., Carreira, J., Blunsom, P., & Zisserman, A. (2020). Visual grounding in video for unsupervised word translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10850–10859.
Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., & Gupta, A. (2016). Hollywood in homes: Crowdsourcing data collection for activity understanding. In: European Conference on Computer Vision. 510–526. Springer.
Surís, D., Epstein, D., Vondrick, C. (2022). Globetrotter: Connecting languages by connecting images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16474–16484.
Tang, Y., Tran, C., Li, X., Chen, P.J., Goyal, N., Chaudhary, V., Gu, J., & Fan, A. (2020). Multilingual translation with extensible multilingual pretraining and finetuning. arXiv preprint arXiv:2008.00401
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision. 4534–4542.
Wang, A.J., Ge, Y., Yan, R., Ge, Y., Lin, X., Cai, G., Wu, J., Shan, Y., Qie, X., & Shou, M.Z. (2022). All in one: Exploring unified video-language pre-training. arXiv preprint arXiv:2203.07303
Wang, J., Ge, Y., Cai, G., Yan, R., Lin, X., Shan, Y., Qie, X., & Shou, M.Z. (2022). Object-aware video-language pre-training for retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3313–3322.
Wang, L., Li, Y., Huang, J., & Lazebnik, S. (2018). Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 394–407.
Article Google Scholar
Wehrmann, J., Souza, D.M., Lopes, M.A., & Barros, R.C. (2019). Language-agnostic visual-semantic embeddings. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 5804–5813.
Wu, H.Y., & Zhai, A. (2019). Classification is a strong baseline for deep metric learning. In: Sidorov, K., Hicks, Y. (eds.) Proceedings of the British Machine Vision Conference (BMVC). 224.1–224.12. BMVA Press. https://doi.org/10.5244/C.33.224,
WU SJ, D.M. (2019). The surprising cross-lingual effectiveness of bert. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China. 833–844.
Xu, H., Ghosh, G., Huang, P.Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., & Feichtenhofer, C. (2021). Videoclip: Contrastive pre-training for zero-shot video-text understanding. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 6787–6800.
Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 5288–5296.
Xu, X., Li, B., Wu, C., Tseng, S.Y., Bhiwandiwalla, A., Rosenman, S., Lal, V., Che, W., & Duan, N. (2023). ManagerTower: Aggregating the insights of uni-modal experts for vision-language representation learning. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 14507–14525. Association for Computational Linguistics, Toronto, Canada. https://doi.org/10.18653/v1/2023.acl-long.811, https://aclanthology.org/2023.acl-long.811
Xu, X., Wu, C., Rosenman, S., Lal, V., Che, W., & Duan, N. (2023). Bridgetower: Building bridges between encoders in vision-language representation learning. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 10637–10647.
Article Google Scholar
Yu, Y., Kim, J., & Kim, G. (2018). A joint sequence fusion model for video question answering and retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV). 471–487.
Zhang, B., Hu, H., & Sha, F. (2018). Cross-modal and hierarchical modeling of video and text. In: Proceedings of the european conference on computer vision (ECCV). 374–390.
Zhou, M., Zhou, L., Wang, S., Cheng, Y., Li, L., Yu, Z., & Liu, J. (2021). Uc2: Universal cross-lingual cross-modal vision-and-language pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4155–4165.
Zhu, L., & Yang, Y. (2020). Actbert: Learning global-local video-text representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8746–8755.

Download references

Acknowledgements

We are grateful to the Habana R &D team, especially Chaitanya Lolla, Sebastian Rogawski, and Radhakrishna Giduthuri, who provided crucial support for execution of this model on Intel Habana Gaudi AI accelerators.

Funding

This declaration is not applicable.

Author information

Authors and Affiliations

Cognitive Computing Research, Intel Labs, Santa Clara, USA
Avinash Madasu, Estelle Aflalo, Gabriela Ben Melech Stan, Shachar Rosenman, Shao-Yen Tseng & Vasudev Lal
Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, USA
Gedas Bertasius

Authors

Avinash Madasu
View author publications
You can also search for this author in PubMed Google Scholar
Estelle Aflalo
View author publications
You can also search for this author in PubMed Google Scholar
Gabriela Ben Melech Stan
View author publications
You can also search for this author in PubMed Google Scholar
Shachar Rosenman
View author publications
You can also search for this author in PubMed Google Scholar
Shao-Yen Tseng
View author publications
You can also search for this author in PubMed Google Scholar
Gedas Bertasius
View author publications
You can also search for this author in PubMed Google Scholar
Vasudev Lal
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AM designed the model, performed major experiments and wrote the first draft. EA helped in experiments and manuscript writing. GBMS helped in experiments and manuscript writing. SR helped in experiments and manuscript writing. S-YT helped manuscript writing and supervised the project. GB supervised the project and reviewed the manuscript. VL supervised the project and reviewed the manuscript.

Corresponding author

Correspondence to Avinash Madasu.

Ethics declarations

Conflict of interest

The authors declare that they do not have competing interests.

Ethical approval

This declaration is not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Madasu, A., Aflalo, E., Stan, G.B.M. et al. MuMUR: Multilingual Multimodal Universal Retrieval. Inf Retrieval J 26, 5 (2023). https://doi.org/10.1007/s10791-023-09422-5

Download citation

Received: 21 August 2023
Accepted: 06 September 2023
Published: 25 September 2023
DOI: https://doi.org/10.1007/s10791-023-09422-5

MuMUR: Multilingual Multimodal Universal Retrieval

Abstract

Similar content being viewed by others

Improving Video Retrieval Using Multilingual Knowledge Transfer

Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

Towards Cycle-Consistent Models for Text and Image Retrieval

1 Introduction

2 Related work

2.1 Multimodal retrieval

2.2 Multilingual training

3 MuMUR: Multilingual Multimodal Universal Retrieval

3.1 Problem statement

3.2 Approach

3.2.1 Visual encoder

3.2.2 Text encoder

3.2.3 Dual cross-modal encoder (DCM)

3.2.4 Loss

3.2.5 Inference

4 Experiments

4.1 Datasets

4.1.1 Video retrieval

4.1.2 Image retrieval

4.2 Metrics

4.3 Implementation details

5 Results and discussion

5.1 Evaluation on English video retrieval datasets

5.2 Evaluation on english image retrieval datasets

5.3 Evaluation on multilingual video retrieval datasets

5.4 Evaluation on multilingual image retrieval datasets

5.5 Ablation studies

5.5.1 Effect of multilingual knowledge transfer

5.5.2 Using only multilingual text encoder

5.5.3 Effectiveness of dual cross encoder block

5.5.4 Training with more languages

5.5.5 Effect of video frames

5.5.6 Effect of video frame sampling

6 Conclusion

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation