Skip to main content

Multi-query Video Retrieval

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Retrieving target videos based on text descriptions is a task of great practical value and has received increasing attention over the past few years. Despite recent progress, imperfect annotations in existing video retrieval datasets have posed significant challenges on model evaluation and development. In this paper, we tackle this issue by focusing on the less-studied setting of multi-query video retrieval, where multiple descriptions are provided to the model for searching over the video archive. We first show that multi-query retrieval task effectively mitigates the dataset noise introduced by imperfect annotations and better correlates with human judgement on evaluating retrieval abilities of current models. We then investigate several methods which leverage multiple queries at training time, and demonstrate that the multi-query inspired training can lead to superior performance and better generalization. We hope further investigation in this direction can bring new insights on building systems that perform better in real-world video retrieval applications (Code is available at https://github.com/princetonvisualai/MQVR).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Throughout this paper, we use the term ‘video retrieval’ to refer the specific task of text-to-video retrieval.

References

  1. Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: Overcoming priors for visual question answering. In: CVPR (2018)

    Google Scholar 

  2. Alayrac, J.B., et al.: Self-supervised multimodal versatile networks. In: NeurIPS (2020)

    Google Scholar 

  3. Andrés Portillo-Quintero, J., Ortiz-Bayliss, J.C., Terashima-Marín, H.: A straightforward framework for video retrieval using CLIP. arXiv:2102.12443 (2021)

  4. Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV (2017)

    Google Scholar 

  5. Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV (2017)

    Google Scholar 

  6. Arandjelović, R., Zisserman, A.: Three things everyone should know to improve object retrieval. In: CVPR (2012)

    Google Scholar 

  7. Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)

    Google Scholar 

  8. Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. arXiv:2104.00650 (2021)

  9. Chen, D., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL (2011)

    Google Scholar 

  10. Cheng, X., Lin, H., Wu, X., Yang, F., Shen, D.: Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv:2109.04290 (2021)

  11. Chum, O., Mikulik, A., Perdoch, M., Matas, J.: Total recall II: query expansion revisited. In: CVPR (2011)

    Google Scholar 

  12. Chum, O., Philbin, J., Sivic, J., Isard, M., Zisserman, A.: Total recall: automatic query expansion with a generative feature model for object retrieval. In: ICCV (2007)

    Google Scholar 

  13. Chung, J., Wuu, C., Yang, H., Tai, Y.-W., Tang, C.-K.: Haa500: human-centric atomic action dataset with curated videos. In: ICCV (2021)

    Google Scholar 

  14. Dong, J., et al.: Dual encoding for video retrieval by text. In: PAMI (2021)

    Google Scholar 

  15. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929 (2020)

  16. Dzabraev, M., Kalashnikov, M., Komkov, S., Petiushko, A.: MDMMT: multidomain multimodal transformer for video retrieval. In: CVPR (2021)

    Google Scholar 

  17. Fang, H., Xiong, P., Xu, L., Chen, Y.: CLIP2Video: mastering video-text retrieval via image CLIP. arXiv:2106.11097 (2021)

  18. Fernando, B., Tuytelaars, T.: Mining multiple queries for image retrieval: on-the-fly learning of an object-specific mid-level representation. In: CVPR (2013)

    Google Scholar 

  19. Gabeur, V., Nagrani, A., Sun, C., Alahari, K., Schmid, C.: Masking modalities for cross-modal video retrieval. In: WACV (2022)

    Google Scholar 

  20. Gabeur, V., Sun, C., Alahari, K., Schmid, C.: Multi-modal transformer for video retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 214–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_13

    Chapter  Google Scholar 

  21. Gao, Z., Liu, J., Chen, S., Chang, D., Zhang, H., Yuan, J.: CLIP2TV: an empirical study on transformer-based methods for video-text retrieval. arXiv:2111.05610 (2021)

  22. Gordo, A., Almazan, J., Revaud, J., Larlus, D.: End-to-end learning of deep visual representations for image retrieval. IJCV (2017)

    Google Scholar 

  23. Gordo, A., Radenovic, F., Berg, T.: Attention-based query expansion learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 172–188. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_11

    Chapter  Google Scholar 

  24. Goyal, P., et al.: Self-supervised pretraining of visual features in the wild. arXiv:2103.01988 (2021)

  25. Huang, S., Hang, H.M.: Multi-query image retrieval using CNN and SIFT features. In: APSIPA ASC (2017)

    Google Scholar 

  26. Imani, A., Vakili, A., Montazer, A., Shakery, A.: Deep neural networks for query expansion using word embeddings. In: ECIR (2019)

    Google Scholar 

  27. Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR-modulated detection for end-to-end multi-modal understanding. In: ICCV (2021)

    Google Scholar 

  28. Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: ICCV (2017)

    Google Scholar 

  29. Lei, J., Li, L., Zhou, L., Gan, Z., Berg, T.L., Bansal, M., Liu, J.: Less is more: clipBERT for video-and-language learning via sparse sampling. In: CVPR (2021)

    Google Scholar 

  30. Li, G., He, F., Feng, Z.: A CLIP-Enhanced method for video-language understanding. arXiv:2110.07137 (2021)

  31. Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8

    Chapter  Google Scholar 

  32. Liu, T., Lin, Y., Du, B.: Unsupervised person re-identification with stochastic training strategy. TIP (2022)

    Google Scholar 

  33. Liu, Y., Albanie, S., Nagrani, A., Zisserman, A.: Use what you have: video retrieval using representations from collaborative experts. arXiv:1907.13487 (2019)

  34. Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv:1608.03983 (2016)

  35. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv:1711.05101 (2017)

  36. Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., Li, T.: CLIP4clip: an empirical study of CLIP for end to end video clip retrieval. arXiv:2104.08860 (2021)

  37. Miech, A., Laptev, I., Sivic, J.: Learning a text-video embedding from incomplete and heterogeneous data. arXiv:1804.02516 (2018)

  38. Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100M: learning a text-video embedding by watching hundred million narrated video clips. In: ICCV (2019)

    Google Scholar 

  39. Mithun, N.C., Li, J., Metze, F., Roy-Chowdhury, A.K.: Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: ICMR (2018)

    Google Scholar 

  40. Patrick, M., et al.: Support-set bottlenecks for video-text representation learning. arXiv:2010.02824 (2020)

  41. Radenović, F., Tolias, G., Chum, O.: Fine-tuning CNN image retrieval with no human annotation. TPAMI (2018)

    Google Scholar 

  42. Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv:2103.00020 (2021)

  43. Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: CVPR (2015)

    Google Scholar 

  44. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)

    Google Scholar 

  45. Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: ICCV (2019)

    Google Scholar 

  46. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  47. Vural, C., Akbacak, E.: Deep multi query image retrieval. SPIC (2020)

    Google Scholar 

  48. Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: CVPR (2016)

    Google Scholar 

  49. Wang, X., Zhu, L., Yang, Y.: T2VLAD: global-local sequence alignment for text-video retrieval. In: CVPR (2021)

    Google Scholar 

  50. Wang, X., Wu, J., Chen, J., Li, L., Wang, Y.F., Wang, W.Y.: VATEX: a large-scale, high-quality multilingual dataset for video-and-language research. In: ICCV (2019)

    Google Scholar 

  51. Wang, Y., Lin, X., Wu, L., Zhang, W.: Effective multi-query expansions: robust landmark retrieval. In: ACMMM (2015)

    Google Scholar 

  52. Wang, Y., Lin, X., Wu, L., Zhang, W.: Effective multi-query expansions: collaborative deep networks for robust landmark retrieval. TIP (2017)

    Google Scholar 

  53. Wu, Y., Jiang, L., Yang, Y.: Switchable novel object captioner. TPAMI (2022)

    Google Scholar 

  54. Wu, Y., Yang, Y.: Exploring heterogeneous clues for weakly-supervised audio-visual video parsing. In: CVPR (2021)

    Google Scholar 

  55. Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: CVPR (2016)

    Google Scholar 

  56. Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question answering and retrieval. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 487–503. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_29

    Chapter  Google Scholar 

  57. Zhang, B., Hu, H., Sha, F.: Cross-modal and hierarchical modeling of video and text. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 385–401. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_23

    Chapter  Google Scholar 

  58. Zheng, L., et al.: MARS: a video benchmark for large-scale person re-identification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 868–884. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_52

    Chapter  Google Scholar 

  59. Zhu, L., Yang, Y.: ActBERT: learning global-local video-text representations. In: CVPR (2020)

    Google Scholar 

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. 2107048. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. We would like to thank members of the Princeton Visual AI Lab (Jihoon Chung, Zhiwei Deng, William Yang and others) for their helpful comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zeyu Wang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 18715 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Z., Wu, Y., Narasimhan, K., Russakovsky, O. (2022). Multi-query Video Retrieval. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13674. Springer, Cham. https://doi.org/10.1007/978-3-031-19781-9_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19781-9_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19780-2

  • Online ISBN: 978-3-031-19781-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics