ViGT: proposal-free video grounding with a learnable token in the transformer

Li, Kun; Guo, Dan; Wang, Meng

doi:10.1007/s11432-022-3783-3

ViGT: proposal-free video grounding with a learnable token in the transformer

Research Paper
Published: 26 September 2023

Volume 66, article number 202102, (2023)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Kun Li^1,2,
Dan Guo^1,2,3,4 &
Meng Wang^1,2,3,4

203 Accesses
3 Citations
Explore all metrics

Abstract

The video grounding (VG) task aims to locate the queried action or event in an untrimmed video based on rich linguistic descriptions. Existing proposal-free methods are trapped in the complex interaction between video and query, overemphasizing cross-modal feature fusion and feature correlation for VG. In this paper, we propose a novel boundary regression paradigm that performs regression token learning in a transformer. Particularly, we present a simple but effective proposal-free framework, namely video grounding transformer (ViGT), which predicts the temporal boundary using a learnable regression token rather than multi-modal or cross-modal features. In ViGT, the benefits of a learnable token are manifested as follows. (1) The token is unrelated to the video or the query and avoids data bias toward the original video and query. (2) The token simultaneously performs global context aggregation from video and query features. First, we employed a sharing feature encoder to project both video and query into a joint feature space before performing cross-modal co-attention (i.e., video-to-query attention and query-to-video attention) to highlight discriminative features in each modality. Furthermore, we concatenated a learnable regression token [REG] with the video and query features as the input of a vision-language transformer. Finally, we utilized the token [REG] to predict the target moment and visual features to constrain the foreground and background probabilities at each timestamp. The proposed ViGT performed well on three public datasets: ANet-Captions, TACoS, and YouCookII. Extensive ablation studies and qualitative analysis further validated the interpretability of ViGT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

References

Chen Y D, Hao C Y, Yang Z-X, et al. Fast target-aware learning for few-shot video object segmentation. Sci China Inf Sci, 2022, 65: 182104
Article Google Scholar
Wang H, Wu Y C, Li M H, et al. Survey on rain removal from videos or a single image. Sci China Inf Sci, 2022, 65: 111101
Article MathSciNet Google Scholar
Gao J, Sun C, Yang Z, et al. Tall: temporal activity localization via language query. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 5267–5275
Yuan Y, Mei T, Zhu W. To find where you talk: temporal sentence localization in video with attention based location regression. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019. 9159–9166
Zhang H, Sun A, Jing W, et al. Span-based localizing network for natural language video localization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020. 6543–6554
Zhang S, Peng H, Fu J, et al. Learning 2D temporal adjacent networks for moment localization with natural language. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2020. 12870–12877
Li K, Guo D, Wang M. Proposal-free video grounding with contextual pyramid network. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2021. 1902–1910
Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the European Conference on Computer Vision, 2016. 20–36
Shou Z, Wang D, Chang S F. Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 1049–1058
Buch S, Escorcia V, Shen C, et al. SST: single-stream temporal action proposals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2911–2920
Chen S, Jiang W, Liu W, et al. Learning modality interaction for temporal sentence localization and event captioning in videos. In: Proceedings of the European Conference on Computer Vision, 2020. 333–351
Li Y, Wang X, Xiao J, et al. Equivariant and invariant grounding for video question answering. In: Proceedings of the 30th ACM International Conference on Multimedia, 2022. 4714–4722
Ji Z, Chen K X, He Y Q, et al. Heterogeneous memory enhanced graph reasoning network for cross-modal retrieval. Sci China Inf Sci, 2022, 65: 172104
Article MathSciNet Google Scholar
Guo D, Zhou W, Li H, et al. Hierarchical LSTM for sign language translation. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2018
Qu W, Wang D L, Feng S, et al. A novel cross-modal hashing algorithm based on multimodal deep learning. Sci China Inf Sci, 2017, 60: 092104
Article Google Scholar
Guo D, Wang S, Tian Q, et al. Dense temporal convolution network for sign language translation. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019. 744–750
Mun J, Cho M, Han B. Local-global video-text interactions for temporal grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 10810–10819
Yuan Y, Ma L, Wang J, et al. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In: Proceedings of Advances in Neural Information Processing Systems, 2019. 536–546
Hendricks L A, Wang O, Shechtman E, et al. Localizing moments in video with natural language. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 5803–5812
Liu M, Wang X, Nie L, et al. Attentive moment retrieval in videos. In: Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018. 15–24
Liu M, Wang X, Nie L, et al. Cross-modal moment localization in videos. In: Proceedings of the 26th ACM international conference on Multimedia, 2018. 843–851
Chen J, Chen X, Ma L, et al. Temporally grounding natural sentence in video. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018. 162–171
Wang J, Ma L, Jiang W. Temporally grounding language queries in videos by contextual boundary-aware prediction. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2020. 12168–12175
Liu D, Qu X, Dong J, et al. Context-aware biaffine localizing network for temporal sentence grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. 11235–11244
Rodriguez C, Marrese-Taylor E, Saleh F S, et al. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2020. 2464–2473
Chen Y W, Tsai Y H, Yang M H. End-to-end multi-modal video temporal grounding. In: Proceedings of Advances in Neural Information Processing Systems, 2021. 34
Zhang M, Yang Y, Chen X, et al. Multi-stage aggregated transformer network for temporal language localization in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. 12669–12678
Zhang D, Dai X, Wang X, et al. MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1247–1257
Liu M, Nie L, Wang Y, et al. A survey on video moment localization. ACM Comput Surv, 2023, 55: 1–37
Google Scholar
Yang Y, Li Z, Zeng G. A survey of temporal activity localization via language in untrimmed videos. In: Proceedings of International Conference on Culture-oriented Science & Technology, 2020. 596–601
Xu H, He K, Plummer B A, et al. Multilevel language and vision integration for text-to-clip retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2019. 9062–9069
Liu W, Anguelov D, Erhan D, et al. SSD: single shot multibox detector. In: Proceedings of the European Conference on Computer Vision, 2016. 21–37
Zeng R, Xu H, Huang W, et al. Dense regression network for video grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020. 10287–10296
Rodriguez-Opazo C, Marrese-Taylor E, Fernando B, et al. DORi: discovering object relationships for moment localization of a natural language query in a video. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021. 1079–1088
Chen S, Jiang Y G. Hierarchical visual-textual graph for temporal activity localization via language. In: Proceedings of the European Conference on Computer Vision, 2020. 601–618
He D, Zhao X, Huang J, et al. Read, watch, and move: reinforcement learning for temporally grounding natural language descriptions in videos. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019. 8393–8400
Wang W, Huang Y, Wang L. Language-driven temporal activity localization: a semantic matching reinforcement learning model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 334–343
Nan G, Qiao R, Xiao Y, et al. Interventional video grounding with dual contrastive learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021. 2765–2775
Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision, 2020. 213–229
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: transformers for image recognition at scale. In: Proceedings of International Conference on Learning Representations, 2020
Su W, Zhu X, Cao Y, et al. VL-BERT: pre-training of generic visual-linguistic representations. In: Proceedings of International Conference on Learning Representations, 2019
Deng J, Yang Z, Chen T, et al. TransVG: end-to-end visual grounding with transformers. In: Proceedings of the IEEE International Conference on Computer Vision, 2021. 1769–1779
Arnab A, Dehghani M, Heigold G, et al. ViViT: a video vision transformer. In: Proceedings of the IEEE International Conference on Computer Vision, 2021. 6836–6846
Lei J, Berg T L, Bansal M. Detecting moments and highlights in videos via natural language queries. In: Proceedings of Advances in Neural Information Processing Systems, 2021. 34: 11846–11858
Pennington J, Socher R, Manning C D. GloVe: global vectors for word representation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014. 1532–1543
Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, 2015. 4489–4497
Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 6299–6308
Yu A W, Dohan D, Luong M T, et al. QANet: combining local convolution with global self-attention for reading comprehension. In: Proceedings of International Conference on Learning Representations, 2018
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 5998–6008
Rezatofighi H, Tsoi N, Gwak J, et al. Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 658–666
Krishna R, Hata K, Ren F, et al. Dense-captioning events in videos. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 706–715
Wang T, Zhang R, Lu Z, et al. End-to-end dense video captioning with parallel decoding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021. 6847–6857
Zhou L, Xu C, Corso J J. Towards automatic learning of procedures from web instructional videos. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2018. 7590–7598
Zhang Z, Lin Z, Zhao Z, et al. Cross-modal interaction networks for query-based moment retrieval in videos. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019. 655–664
Ding X, Wang N, Zhang S, et al. Exploring language hierarchy for video grounding. IEEE Trans Image Process, 2022, 31: 4693–4706
Article Google Scholar
Sun X, Wang X, Gao J, et al. You need to read again: multi-granularity perception network for moment retrieval in videos. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022. 1022–1032
Zhang B, Yang C, Jiang B, et al. Video moment retrieval with hierarchical contrastive learning. In: Proceedings of the 30th ACM International Conference on Multimedia, 2022. 346–355
Hahn M, Kadav A, Rehg J M, et al. Tripping through time: efficient localization of activities in videos. In: Proceedings of the British Machine Vision Conference, 2020
Lu C, Chen L, Tan C, et al. DEBUG: a dense bottom-up grounding approach for natural language video localization. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019. 5144–5153
Xiao S, Chen L, Zhang S, et al. Boundary proposal network for two-stage natural language video localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2021. 2986–2994

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant Nos. 72188101, 62020106007, 62272144, U20A20183) and Major Project of Anhui Province (Grant No. 202203a05020011).

Author information

Authors and Affiliations

School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, 230601, China
Kun Li, Dan Guo & Meng Wang
Key Laboratory of Knowledge Engineering with Big Data, Ministry of Education, Hefei, 230601, China
Kun Li, Dan Guo & Meng Wang
Intelligent Interconnected Systems Laboratory of Anhui Province, Hefei, 230601, China
Dan Guo & Meng Wang
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, 230088, China
Dan Guo & Meng Wang

Authors

Kun Li
View author publications
You can also search for this author in PubMed Google Scholar
Dan Guo
View author publications
You can also search for this author in PubMed Google Scholar
Meng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Dan Guo or Meng Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, K., Guo, D. & Wang, M. ViGT: proposal-free video grounding with a learnable token in the transformer. Sci. China Inf. Sci. 66, 202102 (2023). https://doi.org/10.1007/s11432-022-3783-3

Download citation

Received: 07 December 2022
Revised: 02 February 2023
Accepted: 06 May 2023
Published: 26 September 2023
DOI: https://doi.org/10.1007/s11432-022-3783-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ViGT: proposal-free video grounding with a learnable token in the transformer

Abstract

Access this article

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ViGT: proposal-free video grounding with a learnable token in the transformer

Abstract

Access this article

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation