Centralized sub-critic based hierarchical-structured reinforcement learning for temporal sentence grounding

Zhao, Yingyuan; Tan, Zhiyi; Bao, Bing-Kun; Tu, Zhengzheng

doi:10.1007/s00530-023-01091-0

Centralized sub-critic based hierarchical-structured reinforcement learning for temporal sentence grounding

Regular Paper
Published: 28 April 2023

Volume 29, pages 2181–2191, (2023)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Yingyuan Zhao^1,2,
Zhiyi Tan¹,
Bing-Kun Bao¹ &
…
Zhengzheng Tu²

184 Accesses
Explore all metrics

Abstract

Temporal sentence grounding is to localize the corresponding video clip of a sentence in video. Existing study based on hierarchical-structured reinforcement learning treats the task as training an agent learn its strategy, decomposed into a master-policy and several sub-policies, to adjust the prediction boundary progressively heading for the target clip. They adopt a decentralized-sub-critic framework, equipping every sub-policy with its own sub-critic network to perceive the current environment for enhancing its training. However, massive sub-critics result in massive network parameters. In addition, each decentralized sub-critic only considers the action of its sub-policy and fails to model the impact of other sub-policies’ actions on the environment, which would mislead sub-policies’ learning. To handle this, we contribute a novel solution composed of a centralized sub-critic based hierarchical-structured reinforcement learning (CSC-HSRL). The key is to train a centralized sub-critic network to evaluate the effects of all sub-policies’ actions. Furthermore, centralized sub-critic helps sub-policies to determine whether their actions are beneficial to localize target clip more precisely and support their training. Also, centralized sub-critic has fewer parameters. Experiments on Charades-STA and ActivityNet dataset show that compared with the decentralized sub-critic based model TSP-PRL, CSC-HSRL has higher accuracy and reduces model parameters in the meantime.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Contrastive Perturbation Network for Weakly Supervised Temporal Sentence Grounding

From Coarse to Fine: Hierarchical Structure-Aware Video Summarization

Cross-Graph Transformer Network for Temporal Sentence Grounding

Data availability

The data that support this study are available in Charades-STA at https://doi.org/10.1109/ICCV.2017.563 and ActivityNet dataset at https://doi.org/10.1109/ICCV.2017.83. These data were derived from the following resources available in the public domain: https://github.com/jiyanggao/TALL.

References

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Pieter Abbeel, O., Zaremba, W.: Hindsight experience replay. Advances in neural information processing systems 30 (2017)
Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: Proceedings of the IEEE international conference on computer vision, pp. 5803–5812 (2017)
Bacon, P.L., Harb, J., Precup, D.: The option-critic architecture. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Chaplot, D.S., Sathyendra, K.M., Pasumarthi, R.K., Rajagopal, D., Salakhutdinov, R.: Gated-attention architectures for task-oriented language grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Chen, S., Jiang, Y.G.: Semantic proposal for activity localization in videos via sentence query. Proc AAAI Conf Artif Intell 33, 8199–8206 (2019)
Google Scholar
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., Whiteson, S.: Counterfactual multi-agent policy gradients. In: Proceedings of the AAAI conference on artificial intelligence, vol. 32 (2018)
Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: Temporal activity localization via language query. In: Proceedings of the IEEE international conference on computer vision, pp. 5267–5275 (2017)
Gao, J., Xu, C.: Fast video moment retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1523–1532 (2021)
Ge, R., Gao, J., Chen, K., Nevatia, R.: Mac: Mining activity concepts for language-based temporal localization. In: 2019 IEEE winter conference on applications of computer vision (WACV), pp. 245–253. IEEE (2019)
Hahn, M., Kadav, A., Rehg, J.M., Graf, H.P.: Tripping through time: Efficient localization of activities in videos. arXiv preprint arXiv:1904.09936 (2019)
He, D., Zhao, X., Huang, J., Li, F., Liu, X., Wen, S.: Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos. Proc AAAI Conf Artif Intell 33, 8393–8400 (2019)
Google Scholar
Jiang, B., Huang, X., Yang, C., Yuan, J.: Cross-modal video moment retrieval with spatial and language-temporal attention. In: Proceedings of the 2019 on international conference on multimedia retrieval, pp. 217–225 (2019)
Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., Fidler, S.: Skip-thought vectors. Advances in neural information processing systems 28 (2015)
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision, pp. 706–715 (2017)
Liu, M., Wang, X., Nie, L., He, X., Chen, B., Chua, T.S.: Attentive moment retrieval in videos. In: The 41st international ACM SIGIR conference on research & development in information retrieval, pp. 15–24 (2018)
Liu, M., Wang, X., Nie, L., Tian, Q., Chen, B., Chua, T.S.: Cross-modal moment localization in videos. In: Proceedings of the 26th ACM international conference on Multimedia, pp. 843–851 (2018)
Lowe, R., Wu, Y.I., Tamar, A., Harb, J., Pieter Abbeel, O., Mordatch, I.: Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems 30 (2017)
Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In: International conference on machine learning, pp. 1928–1937. PMLR (2016)
Ning, K., Cai, M., Xie, D., Wu, F.: An attentive sequence to sequence translator for localizing video clips by natural language. IEEE Transact Multimedia 22(9), 2434–2443 (2019)
Article Google Scholar
Rodriguez, C., Marrese-Taylor, E., Saleh, F.S., Li, H., Gould, S.: Proposal-free temporal moment localization of a natural-language query in video using guided attention. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2464–2473 (2020)
Ryu, H., Kang, S., Kang, H., Yoo, C.D.: Semantic grouping network for video captioning. Proc AAAI Conf Artificial Intell 35, 2514–2522 (2021)
Google Scholar
Su, J., Adams, S., Beling, P.: Value-decomposition multi-agent actor-critics. Proc AAAI Conf Artif Intell 35, 11352–11360 (2021)
Google Scholar
Sun, X., Wang, H., He, B.: Maban: Multi-agent boundary-aware network for natural language moment retrieval. IEEE Transact Image Proc 30, 5589–5599 (2021)
Article Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction. MIT press (2018)
MATH Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp. 4489–4497 (2015)
Vezhnevets, A.S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., Kavukcuoglu, K.: Feudal networks for hierarchical reinforcement learning. In: International Conference on Machine Learning, pp. 3540–3549. PMLR (2017)
Wang, J., Ma, L., Jiang, W.: Temporally grounding language queries in videos by contextual boundary-aware prediction. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12168–12175 (2020)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision, pp. 20–36. Springer (2016)
Wang, W., Huang, Y., Wang, L.: Language-driven temporal activity localization: A semantic matching reinforcement learning model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 334–343 (2019)
Wang, X., Chen, W., Wu, J., Wang, Y.F., Wang, W.Y.: Video captioning via hierarchical reinforcement learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4213–4222 (2018)
Wu, J., Li, G., Liu, S., Lin, L.: Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12386–12393 (2020)
Wu, W., He, D., Tan, X., Chen, S., Wen, S.: Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6222–6231 (2019)
Xiao, S., Chen, L., Shao, J., Zhuang, Y., Xiao, J.: Natural language video localization with learnable moment proposals. arXiv preprint arXiv:2109.10678 (2021)
Xu, H., He, K., Plummer, B.A., Sigal, L., Sclaroff, S., Saenko, K.: Multilevel language and vision integration for text-to-clip retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9062–9069 (2019)
Xu, H., He, K., Plummer, B.A., Sigal, L., Sclaroff, S., Saenko, K.: Multilevel language and vision integration for text-to-clip retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9062–9069 (2019)
Yuan, Y., Ma, L., Wang, J., Liu, W., Zhu, W.: Semantic conditioned dynamic modulation for temporal sentence grounding in videos. Advances in Neural Information Processing Systems 32 (2019)
Yuan, Y., Mei, T., Zhu, W.: To find where you talk: Temporal sentence localization in video with attention based location regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9159–9166 (2019)

Download references

Acknowledgements

Thiswork was supported by National Key Research and Development Project (No.2020AAA0106200), the National Nature Science Foundation of China under Grants (No.61936005, 61872424), and the Natural Science Foundation of Jiangsu Province (Grants No. BK20200037). And the Natural Science Foundation of Jiangsu Province (Grants No. BK20200037 and BK20210595) and the Open Project of Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University (Grant No MMC202010).

Author information

Authors and Affiliations

College of Telecommunications Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China
Yingyuan Zhao, Zhiyi Tan & Bing-Kun Bao
Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China
Yingyuan Zhao & Zhengzheng Tu

Authors

Yingyuan Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyi Tan
View author publications
You can also search for this author in PubMed Google Scholar
Bing-Kun Bao
View author publications
You can also search for this author in PubMed Google Scholar
Zhengzheng Tu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YZ performed the experiment and wrote the main manuscipt text and ZT, ZT and B-KB guided and modified this manuscipt. All authors reviewed the manuscript.

Corresponding author

Correspondence to Zhiyi Tan.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhao, Y., Tan, Z., Bao, BK. et al. Centralized sub-critic based hierarchical-structured reinforcement learning for temporal sentence grounding. Multimedia Systems 29, 2181–2191 (2023). https://doi.org/10.1007/s00530-023-01091-0

Download citation

Received: 14 January 2023
Accepted: 09 April 2023
Published: 28 April 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s00530-023-01091-0

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Centralized sub-critic based hierarchical-structured reinforcement learning for temporal sentence grounding

Abstract

Access this article

Similar content being viewed by others

Contrastive Perturbation Network for Weakly Supervised Temporal Sentence Grounding

From Coarse to Fine: Hierarchical Structure-Aware Video Summarization

Cross-Graph Transformer Network for Temporal Sentence Grounding

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Navigation

Centralized sub-critic based hierarchical-structured reinforcement learning for temporal sentence grounding

Abstract

Access this article

Similar content being viewed by others

Contrastive Perturbation Network for Weakly Supervised Temporal Sentence Grounding

From Coarse to Fine: Hierarchical Structure-Aware Video Summarization

Cross-Graph Transformer Network for Temporal Sentence Grounding

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation