Text-video retrieval method based on enhanced self-attention and multi-task learning

Wu, Xiaoyu; Qian, Jiayao; Wang, Tiantian

doi:10.1007/s11042-023-14589-6

Text-video retrieval method based on enhanced self-attention and multi-task learning

Published: 23 February 2023

Volume 82, pages 24387–24406, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Xiaoyu Wu¹,
Jiayao Qian¹ &
Tiantian Wang¹

182 Accesses
Explore all metrics

Abstract

The explosive growth of videos on the Internet makes it a great challenge to use texts to retrieve the videos we need. The general method of text-video retrieval is to project them into a common semantic space to calculate the similarity score. The key technologies of a retrieval model are how to get strong feature representations of text and video and bridge the semantic gap between the two modalities. Moreover, most existing methods do not consider the strong consistency of text-video positive sample pairs. Considering the above problems, we proposed a text-video retrieval method based on enhanced self-attention and multi-task learning in this paper. Firstly, while encoding, the extracted text feature vectors and the extracted video feature vectors are input into Transformer based on enhanced self-attention mechanism for encoding and fusion. Then the text representations and video representations are projected into a common semantic space. Finally, by introducing multi-task learning in the common semantic space, our proposed approach combines the semantic similarity measurement task and the semantic consistency judgement task to optimize the common space through semantic consistency constraints. Our method obtains better retrieval performance on the MSR-Video to Text (MSRVTT), Large Scale Movie Description Challenge (LSMDC), and ActivityNet datasets than some existing approaches, which proves the effectiveness of our proposed strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Level-wise aligned dual networks for text–video retrieval

Article Open access 07 July 2022

CMFG: Cross-Model Fine-Grained Feature Interaction for Text-Video Retrieval

Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network

Article 22 January 2024

Data Availability

The data used in this paper is from http://pascal.inrialpes.fr/data2/vgabeur/video-features.

Code Availability

The code required to reproduce these findings cannot be shared at this time as the code also forms part of an ongoing study.

References

Brattoli B, Tighe J, Zhdanov F, Perona P, Chalupka K (2020) Rethinking zero-shot video classification: end-to-end training for realistic applications. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4613–4623
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Dong J, Li X, Snoek CG (2018) Predicting visual features from text for image and video caption retrieval. IEEE Trans Multimedia 20(12):3377–3388
Article Google Scholar
Dong J, Li X, Xu C, Ji S, He Y, Yang G, Wang X (2019) Dual encoding for zero-example video retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9346–9355
Fang H, Xiong P, Xu L, Chen Y (2021) Clip2video: mastering video-text retrieval via image clip. arXiv:2106.11097
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6202–6211
Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: Computer Vision–ECCV 2020: 16th european conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, Springer, pp 214–229
Gemmeke J F, Ellis D PW, Freedman D, Jansen A, Lawrence W, Moore R C, Plakal M, Ritter M (2017) Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 776–780
Gao Z, Liu J, Chen S, Chang D, Zhang H, Yuan J (2021) Clip2tv: an empirical study on transformer-based methods for video-text retrieval. arXiv:2111.05610
Habibian A, Mensink T, Snoek CG (2014) Composite concept discovery for zero-shot video event detection. In: Proceedings of international conference on multimedia retrieval, pp 17–24
Hernandez R, Perez-Martin J, Bravo N, Barrios JM, Bustos B (2019) IMFD IMPRESEE at TRECVID 2019: Ad-Hoc Video Search and Video To Text. https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/imfd_impresee.pdf Accessed 7 March 2020
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643
Jiang L, Meng D, Mitamura T, Hauptmann AG (2014) Easy samples first: self-paced reranking for zero-example multimedia search. In: Proceedings of the 22nd ACM international conference on multimedia, pp 547–556
Krishna R, Hata K, Ren F, Fei-Fei L, Carlos Niebles J (2017) Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision, pp 706–715
Li X, Dong J, Xu C, Cao J, Wang X, Yang G (2018) Renmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal Embeddings for Video-Text Retrieval. https://www-nlpir.nist.gov/projects/tvpubs/tv18.papers/rucmm.pdf. Accessed 13 November 2020
Liu Y, Albanie S, Nagrani A, Zisserman A (2019) Use what you have: video retrieval using representations from collaborative experts. arXiv:1907.13487
Luo H, Ji L, Zhong M, Chen Y, Lei W, Duan N, Li T (2021) CLIP4Clip: an empirical study of clip for end to end video clip retrieval. arXiv:2104.08860
Markatopoulou F, Galanopoulos D, Mezaris V, Patras I (2017) Query and keyframe representations for ad-hoc video search. In: Proceedings of the 2017 ACM on international conference on multimedia retrieval, pp 407–411
Miech A, Laptev I, Sivic J (2018) Learning a text-video embedding from incomplete and heterogeneous data. arXiv:1804.02516
Miech A, Zhukov D, Alayrac J-B, Tapaswi M, Laptev I, Sivic J (2019) Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2630–2640
Nguyen PA, Wu J, Ngo C-W, Danny F, Benoit H (2019) Vireo-eurecom@ trecvid 2019: Ad-hoc video search. https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/eurecom.pdf Accessed 7 March 2020
Patrick M, Huang P-Y, Asano Y, Metze F, Hauptmann A, Henriques J, Vedaldi A (2020) Support-set bottlenecks for video-text representation learning. arXiv:2010.02824
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. arXiv:2103.00020
Rohrbach A, Torabi A, Rohrbach M, Tandon N, Pal C, Larochelle H, Courville A, Schiele B (2017) Movie description. Int J Comput Vis 123(1):94–120
Article Google Scholar
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
Tao Y, Wang T, Machado D, Garcia R, Tu Y, Reyes MP, Chen Y, Tian H, Shyu M-L, Chen S-C (2019) Florida international university-university of miami trecvid 2019. In: TRECVID. NIST, USA
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Ueki K, Hirakawa K, Kikuchi K, Ogawa T, Kobayashi T (2017) Waseda_Meisei at TRECVID 2017: Ad-hoc Video Search. https://www-nlpir.nist.gov/projects/tvpubs/tv17.papers/waseda_meisei.pdf. Accessed 13 November 2020
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł., Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp 305–321
Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296
Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp 305–321
Yu Y, Kim J, Kim G (2018) A joint sequence fusion model for video question answering and retrieval. In: Proceedings of the European conference on computer vision (ECCV), pp 471–487
Yu Y, Ko H, Choi J, Kim G (2017) End-to-end concept word detection for video captioning, retrieval, and question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3165–3173
Zhang B, Hu H, Sha F (2018) Cross-modal and hierarchical modeling of video and text. In: Proceedings of the European conference on computer vision (ECCV), pp 374–390

Download references

Funding

This work was supported by the state key development program in 14th Five-Year under Grant No. 2021YFF0900701, 2021YFF0602103, 2021YFF0602102, 2021QY1702, and in part by Natural Science Foundation of China (No.61801441). We also thank for the research funds under Grant No. 2019GQG0001 from the Institute for Guo Qiang, Tsinghua University, and the High-quality and Cutting-edge Disciplines Construction Project for Universities in Beijing (Internet Information, Communication University of China).

Author information

Authors and Affiliations

State Key Laboratory of Media Convergence and Communication, Communication University of China, No.1 Dingfuzhuang East Street Chaoyang District, Beijing, 10024, Asia, China
Xiaoyu Wu, Jiayao Qian & Tiantian Wang

Authors

Xiaoyu Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jiayao Qian
View author publications
You can also search for this author in PubMed Google Scholar
Tiantian Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and experiments were performed by Jiayao Qian. Analysis and conclusion were performed by Xiaoyu Wu, Jiayao Qian and Tiantian Wang. The first draft of the manuscript was written by Jiayao Qian and all authors commented and modified on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xiaoyu Wu.

Ethics declarations

Conflict of Interests

The authors declare that there is no conflict of interest regarding the publication of this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Jiayao Qian and Tiantian Wang are contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, X., Qian, J. & Wang, T. Text-video retrieval method based on enhanced self-attention and multi-task learning. Multimed Tools Appl 82, 24387–24406 (2023). https://doi.org/10.1007/s11042-023-14589-6

Download citation

Received: 28 September 2021
Revised: 26 January 2022
Accepted: 31 January 2023
Published: 23 February 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s11042-023-14589-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text-video retrieval method based on enhanced self-attention and multi-task learning

Abstract

Access this article

Similar content being viewed by others

Level-wise aligned dual networks for text–video retrieval

CMFG: Cross-Model Fine-Grained Feature Interaction for Text-Video Retrieval

Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network

Data Availability

Code Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Text-video retrieval method based on enhanced self-attention and multi-task learning

Abstract

Access this article

Similar content being viewed by others

Level-wise aligned dual networks for text–video retrieval

CMFG: Cross-Model Fine-Grained Feature Interaction for Text-Video Retrieval

Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network

Data Availability

Code Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation