Multi-modal Transformer for Video Retrieval

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12349)


The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at


Video Language Retrieval Multi-modal Cross-modal Temporality Transformer Attention 



We thank the authors of  [14] for sharing their codebase and features, and Samuel Albanie, in particular, for his help with implementation details. This work was supported in part by the ANR project AVENUE.

Supplementary material

504439_1_En_13_MOESM1_ESM.pdf (191 kb)
Supplementary material 1 (pdf 191 KB)


  1. 1.
    Burns, A., Tan, R., Saenko, K., Sclaroff, S., Plummer, B.A.: Language features matter: effective language representations for vision-language tasks. In: ICCV (2019)Google Scholar
  2. 2.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? In: CVPR, A New Model and the Kinetics Dataset (2017)Google Scholar
  3. 3.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)Google Scholar
  4. 4.
    Gabeur, V., Sun, C., Alahari, K., Schmid, C.: CVPR 2020 video pentathlon challenge: multi-modal transformer for video retrieval. In: CVPR Video Pentathlon Workshop (2020)Google Scholar
  5. 5.
    Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., Glass, J.: Jointly discovering visual objects and spoken words from raw sensory input. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 659–677. Springer, Cham (2018). Scholar
  6. 6.
    Hershey, S., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)Google Scholar
  7. 7.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8) (1997)Google Scholar
  8. 8.
    Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. (2019)Google Scholar
  9. 9.
    Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR (2016)Google Scholar
  10. 10.
    Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: NIPS (2014)Google Scholar
  11. 11.
    Klein, B., Lev, G., Sadeh, G., Wolf, L.: Associating neural word embeddings with deep image representations using fisher vectors. In: CVPR (2015)Google Scholar
  12. 12.
    Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J.C.: Dense-captioning events in videos. In: ICCV (2017)Google Scholar
  13. 13.
    Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 212–228. Springer, Cham (2018). Scholar
  14. 14.
    Liu, Y., Albanie, S., Nagrani, A., Zisserman, A.: Use what you have: video retrieval using representations from collaborative experts. arXiv abs/1907.13487 (2019)Google Scholar
  15. 15.
    Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. arXiv e-prints arXiv:1912.06430, December 2019
  16. 16.
    Miech, A., Laptev, I., Sivic, J.: Learning a text-video embedding from incomplete and heterogeneous data. arXiv abs/1804.02516 (2018)Google Scholar
  17. 17.
    Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: ICCV (2019)Google Scholar
  18. 18.
    Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)Google Scholar
  19. 19.
    Mithun, N.C., Li, J., Metze, F., Roy-Chowdhury, A.K.: Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: ICMR (2018)Google Scholar
  20. 20.
    Mithun, N.C., Li, J., Metze, F., Roy-Chowdhury, A.K.: Joint embeddings with multimodal cues for video-text retrieval. Int. J. Multimedia Inf. Retrieval 8(1), 3–18 (2019). Scholar
  21. 21.
    Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: CVPR (2015)Google Scholar
  22. 22.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)Google Scholar
  23. 23.
    Sun, C., Baradel, F., Murphy, K., Schmid, C.: Learning video representations using contrastive bidirectional transformer. arXiv:1906.05743 (2019)
  24. 24.
    Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: ICCV (2019)Google Scholar
  25. 25.
    Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)Google Scholar
  26. 26.
    Wray, M., Larlus, D., Csurka, G., Damen, D.: Fine-grained action retrieval through multiple parts-of-speech embeddings. In: ICCV (2019)Google Scholar
  27. 27.
    Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv:1609.08144 (2016)
  28. 28.
    Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). Scholar
  29. 29.
    Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance metric learning, with application to clustering with side-information. In: NIPS (2002)Google Scholar
  30. 30.
    Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: CVPR (2016)Google Scholar
  31. 31.
    Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question answering and retrieval. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 487–503. Springer, Cham (2018). Scholar
  32. 32.
    Yu, Y., Ko, H., Choi, J., Kim, G.: End-to-end concept word detection for video captioning, retrieval, and question answering. In: CVPR (2017)Google Scholar
  33. 33.
    Zhang, B., Hu, H., Sha, F.: Cross-modal and hierarchical modeling of video and text. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 385–401. Springer, Cham (2018). Scholar
  34. 34.
    Zhang, Y., Jin, R., Zhou, Z.H.: Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybernet. 1, 43–52 (2010)CrossRefGoogle Scholar
  35. 35.
    Zhou, B., Lapedriza, À., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1452–1464 (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Inria, Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJKGrenobleFrance
  2. 2.Google ResearchMeylanFrance

Personalised recommendations