Skip to main content
Log in

Text-video retrieval method based on enhanced self-attention and multi-task learning

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The explosive growth of videos on the Internet makes it a great challenge to use texts to retrieve the videos we need. The general method of text-video retrieval is to project them into a common semantic space to calculate the similarity score. The key technologies of a retrieval model are how to get strong feature representations of text and video and bridge the semantic gap between the two modalities. Moreover, most existing methods do not consider the strong consistency of text-video positive sample pairs. Considering the above problems, we proposed a text-video retrieval method based on enhanced self-attention and multi-task learning in this paper. Firstly, while encoding, the extracted text feature vectors and the extracted video feature vectors are input into Transformer based on enhanced self-attention mechanism for encoding and fusion. Then the text representations and video representations are projected into a common semantic space. Finally, by introducing multi-task learning in the common semantic space, our proposed approach combines the semantic similarity measurement task and the semantic consistency judgement task to optimize the common space through semantic consistency constraints. Our method obtains better retrieval performance on the MSR-Video to Text (MSRVTT), Large Scale Movie Description Challenge (LSMDC), and ActivityNet datasets than some existing approaches, which proves the effectiveness of our proposed strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

The data used in this paper is from http://pascal.inrialpes.fr/data2/vgabeur/video-features.

Code Availability

The code required to reproduce these findings cannot be shared at this time as the code also forms part of an ongoing study.

References

  1. Brattoli B, Tighe J, Zhdanov F, Perona P, Chalupka K (2020) Rethinking zero-shot video classification: end-to-end training for realistic applications. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4613–4623

  2. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308

  3. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  4. Dong J, Li X, Snoek CG (2018) Predicting visual features from text for image and video caption retrieval. IEEE Trans Multimedia 20(12):3377–3388

    Article  Google Scholar 

  5. Dong J, Li X, Xu C, Ji S, He Y, Yang G, Wang X (2019) Dual encoding for zero-example video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9346–9355

  6. Fang H, Xiong P, Xu L, Chen Y (2021) Clip2video: mastering video-text retrieval via image clip. arXiv:2106.11097

  7. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6202–6211

  8. Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: Computer Vision–ECCV 2020: 16th european conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, Springer, pp 214–229

  9. Gemmeke J F, Ellis D PW, Freedman D, Jansen A, Lawrence W, Moore R C, Plakal M, Ritter M (2017) Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 776–780

  10. Gao Z, Liu J, Chen S, Chang D, Zhang H, Yuan J (2021) Clip2tv: an empirical study on transformer-based methods for video-text retrieval. arXiv:2111.05610

  11. Habibian A, Mensink T, Snoek CG (2014) Composite concept discovery for zero-shot video event detection. In: Proceedings of international conference on multimedia retrieval, pp 17–24

  12. Hernandez R, Perez-Martin J, Bravo N, Barrios JM, Bustos B (2019) IMFD IMPRESEE at TRECVID 2019: Ad-Hoc Video Search and Video To Text. https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/imfd_impresee.pdf Accessed 7 March 2020

  13. Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643

  14. Jiang L, Meng D, Mitamura T, Hauptmann AG (2014) Easy samples first: self-paced reranking for zero-example multimedia search. In: Proceedings of the 22nd ACM international conference on multimedia, pp 547–556

  15. Krishna R, Hata K, Ren F, Fei-Fei L, Carlos Niebles J (2017) Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision, pp 706–715

  16. Li X, Dong J, Xu C, Cao J, Wang X, Yang G (2018) Renmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal Embeddings for Video-Text Retrieval. https://www-nlpir.nist.gov/projects/tvpubs/tv18.papers/rucmm.pdf. Accessed 13 November 2020

  17. Liu Y, Albanie S, Nagrani A, Zisserman A (2019) Use what you have: video retrieval using representations from collaborative experts. arXiv:1907.13487

  18. Luo H, Ji L, Zhong M, Chen Y, Lei W, Duan N, Li T (2021) CLIP4Clip: an empirical study of clip for end to end video clip retrieval. arXiv:2104.08860

  19. Markatopoulou F, Galanopoulos D, Mezaris V, Patras I (2017) Query and keyframe representations for ad-hoc video search. In: Proceedings of the 2017 ACM on international conference on multimedia retrieval, pp 407–411

  20. Miech A, Laptev I, Sivic J (2018) Learning a text-video embedding from incomplete and heterogeneous data. arXiv:1804.02516

  21. Miech A, Zhukov D, Alayrac J-B, Tapaswi M, Laptev I, Sivic J (2019) Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2630–2640

  22. Nguyen PA, Wu J, Ngo C-W, Danny F, Benoit H (2019) Vireo-eurecom@ trecvid 2019: Ad-hoc video search. https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/eurecom.pdf Accessed 7 March 2020

  23. Patrick M, Huang P-Y, Asano Y, Metze F, Hauptmann A, Henriques J, Vedaldi A (2020) Support-set bottlenecks for video-text representation learning. arXiv:2010.02824

  24. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. arXiv:2103.00020

  25. Rohrbach A, Torabi A, Rohrbach M, Tandon N, Pal C, Larochelle H, Courville A, Schiele B (2017) Movie description. Int J Comput Vis 123(1):94–120

    Article  Google Scholar 

  26. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112

  27. Tao Y, Wang T, Machado D, Garcia R, Tu Y, Reyes MP, Chen Y, Tian H, Shyu M-L, Chen S-C (2019) Florida international university-university of miami trecvid 2019. In: TRECVID. NIST, USA

  28. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  29. Ueki K, Hirakawa K, Kikuchi K, Ogawa T, Kobayashi T (2017) Waseda_Meisei at TRECVID 2017: Ad-hoc Video Search. https://www-nlpir.nist.gov/projects/tvpubs/tv17.papers/waseda_meisei.pdf. Accessed 13 November 2020

  30. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł., Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  31. Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp 305–321

  32. Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296

  33. Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp 305–321

  34. Yu Y, Kim J, Kim G (2018) A joint sequence fusion model for video question answering and retrieval. In: Proceedings of the European conference on computer vision (ECCV), pp 471–487

  35. Yu Y, Ko H, Choi J, Kim G (2017) End-to-end concept word detection for video captioning, retrieval, and question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3165–3173

  36. Zhang B, Hu H, Sha F (2018) Cross-modal and hierarchical modeling of video and text. In: Proceedings of the European conference on computer vision (ECCV), pp 374–390

Download references

Funding

This work was supported by the state key development program in 14th Five-Year under Grant No. 2021YFF0900701, 2021YFF0602103, 2021YFF0602102, 2021QY1702, and in part by Natural Science Foundation of China (No.61801441). We also thank for the research funds under Grant No. 2019GQG0001 from the Institute for Guo Qiang, Tsinghua University, and the High-quality and Cutting-edge Disciplines Construction Project for Universities in Beijing (Internet Information, Communication University of China).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and experiments were performed by Jiayao Qian. Analysis and conclusion were performed by Xiaoyu Wu, Jiayao Qian and Tiantian Wang. The first draft of the manuscript was written by Jiayao Qian and all authors commented and modified on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xiaoyu Wu.

Ethics declarations

Conflict of Interests

The authors declare that there is no conflict of interest regarding the publication of this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Jiayao Qian and Tiantian Wang are contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, X., Qian, J. & Wang, T. Text-video retrieval method based on enhanced self-attention and multi-task learning. Multimed Tools Appl 82, 24387–24406 (2023). https://doi.org/10.1007/s11042-023-14589-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-14589-6

Keywords

Navigation