Memory-attended semantic context-aware network for video captioning

Chen, Shuqin; Zhong, Xian; Wu, Shifeng; Sun, Zhixin; Liu, Wenxuan; Jia, Xuemei; Xia, Hongxia

doi:10.1007/s00500-021-06360-6

Memory-attended semantic context-aware network for video captioning

Focus
Published: 11 November 2021

(2021)
Cite this article

Soft Computing Aims and scope Submit manuscript

Shuqin Chen¹,
Xian Zhong^1,2,
Shifeng Wu^1,3,
Zhixin Sun¹,
Wenxuan Liu¹,
Xuemei Jia¹ &
…
Hongxia Xia¹

243 Accesses
1 Citation
Explore all metrics

Abstract

Generating video captioning automatically is an active and flouring research topic that involves the complex interactions between visual features and natural language generation. The attention mechanism obtains the key visual information corresponding to the word by removing redundant information. However, existing visual attention methods are indirectly guided by the hidden state of language model, ignoring the interactions between visual features obtained by attention mechanisms. Due to the existence of incomplete object or interference noise, attention mechanism with frame feature is hard to find correct regions-of-interest which closely related to the motion state. Worse still, at each time step, the hidden states have no access to the posterior decode states. The future predicted information is not fully utilized, which lead to the lack of detailed context-aware information. In this paper, we propose a novel video captioning framework with Memory-attended Semantic Context-aware Network (MaSCN) to capture the adjacent sequential dependency across multiple time stamps between different outputs for visual features. To exploit pivotal feature from coarse-grained to fine-grained, we introduce the attention module in MaSCN, which uses corresponding tailored Visual Semantic LSTM(VSLSTM) layers to more precisely map visual relationship information through multi-level attention mechanism. Besides, we integrate the visual features obtained through the attention mechanism as a late fusion. The visual semantic loss is used to explicitly memorize contextual information, capturing the fine-grained detailed cues. Compared with state-of-the-art approaches, the extensive experiments demonstrate the effectiveness of our method on MSVD and MSR-VTT datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Article 11 October 2019

References

Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 12487–12496
Alnamoly MH, Alzohairy AM, El-Henawy IM (2021) A survey on gel images analysis software tools. J Intell Syst Int Things 1(1):40–47
Google Scholar
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 6077–6086
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 4724–4733
Chen S, Zhong X, Li L, Liu W, Gu C, Zhong L (2020a) Adaptively converting auxiliary attributes and textual embedding for video captioning based on bilstm. Neural Process Lett 52(3):2353–2369
Article Google Scholar
Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the annual meeting association computer linguistics: human language technology (ACL), pp 190–200
Chen S, Jiang Y (2019) Motion guided spatial attention for video captioning. In: Proceedings of the Conference on Artificial Intelligence (AAAI), pp 8191–8198
Chen J, Chao H (2020) Videotrm: pre-training for video captioning challenge 625 2020. In: Proceedings of the ACM international conference on multimedia (MM), pp 4605–4609
Cherian A, Wang J, Hori C, Marks TK (2020) Spatio-temporal ranked-attention networks for video captioning. In: Proceedings of the IEEE workshop/winter conference applied computer vision (WACV), pp 1606–1615
Gao L, Li X, Song J, Shen HT (2020) Hierarchical lstms with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42(5):1112–1131
Google Scholar
Hou J, Wu X, Zhao W, Luo J, Jia Y (2019) Joint syntax representation learning and visual cue translation for video captioning. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 8917–8926
Kareem FQ, Abdulazeez AM (2021) Ultrasound medical images classification based on deep learning algorithms: a review. Fusion Practice Appl 3(1):29–42
Google Scholar
Liu S, Ren Z, Yuan J (2018) Sibnet: sibling convolutional encoder for video captioning. In: Proceedings of the ACM international conference on multimedia (MM), pp 1425–1434
Masoomeh N, Alireza B (2020) Multi-sentence video captioning using content-oriented beam searching and multi-stage refining algorithm. Inf Process Manage 57(6):102302
Article Google Scholar
Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: Proceedings of IEEE/CVF Conference on Computer Vision. Pattern Recognition. (CVPR), pp 984–992
Pan B, Cai H, Huang D, Lee K, Gaidon A, Adeli E, Niebles JC (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 10867–10876
Pei W, Zhang J, Wang X, Ke L, Shen X, Tai Y (2019) Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 8347–8356
Ren S, He K, Girshick RB, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: Proceedings of IEEE/CVF international conference on computer vision (ICCV), pp 433–440
Song Z, Zhou X, Mao Z, Tan J (2021) Image captioning with context-aware auxiliary guidance. In: Proceedings of the conference on artificial intelligence (AAAI), pp 2584–2592
Sun L, Li B, Yuan C, Zha Z, Hu W (2019) Multimodal semantic attention network for video captioning. In: Proceeding of the IEEE international conference on multimedia expo (ICME), pp 1300–1305
Sun Z, Zhong X, Chen S, Liu W, Feng D, Li L (2021) Modeling context-guided visual and linguistic semantic feature for video captioning. In: Proceedings of the artificial neural networks machine learning (ICANN)
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inceptionresnet and the impact of residual connections on learning. In: Proceedings of the conference on artificial intelligence (AAAI), pp 4278–4284
Tan G, Liu D, Wang M, Zha Z (2020) Learning to discretely compose reasoning module networks for video captioning. In: Proceedings of international joint conference on artificial intelligence (IJCAI), pp 745–752
Touati R, Ferchichi I, Messaoudi I, Oueslati AE, Lachiri Z (2021) Pre-cursor microRNAs from different species classification based on features extracted from the image. J Cybersecur Inform Manage 3(1):05–13
Google Scholar
Tu Y, Zhou C, Guo J, Gao S, Yu Z (2021) Enhancing the alignment between tar- get words and corresponding frames for video captioning. Pattern Recognit 111:107702
Article Google Scholar
Wang J, Wang W, Huang Y, Wang L, Tan T (2018) M3: multimodal memory modelling for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 7512–7520
Wang B, Ma L, Zhang W, Jiang W, Wang J, Liu W (2019) Controllable video captioning with POS sequence guidance based on gated fusion network. In: Proceedings of IEEE/CVF international conference on computer vision (ICCV), pp 2641–2650
Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 5288–5296
Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q (2020) STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia 22(1):229–241
Article Google Scholar
Yao L, Torabi A, Cho K, Ballas N, Pal CJ, Larochelle H, Courville AC (2015) Describing videos by exploiting temporal structure. In: Proceedings of IEEE/CVF international conference on computer vision (ICCV), pp 4507–4515
Ye M, Li J, Ma AJ, Zheng L, Yuen PC (2019) Dynamic graph co-matching for unsupervised video-based person re-identification. IEEE Trans Image Process 28(6):2976–2990
Article MathSciNet Google Scholar
Ye M, Shen J, Zhang X, Yuen PC, Chang S-F (2020) Augmentation invariant and instance spreading feature for softmax embedding. IEEE Trans Pattern Anal Mach Intell
Zhang J, Peng Y (2020) Video captioning with object-aware spatio-temporal correlation and aggregation. IEEE Trans Image Process 29:6209–6222
Article Google Scholar
Zhang Z, Wu Q, Wang Y, Chen F (2021) Exploring region relationships implicitly: image captioning with visual relationship attention. Image vis Comput 109:104146
Article Google Scholar
Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 8327–8336
Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha Z (2020) Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 13275–13285
Zhao B, Li X, Lu X (2019) CAM-RNN: co-attention model based RNN for video captioning. IEEE Trans Image Process 28(11):5552–5565
Article MathSciNet Google Scholar
Zheng Q, Wang C, Tao D (2020) Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 13093–13102
Zhou S, Wang Y, Zhang F et al (2021) Cross-view similarity exploration for unsupervised cross-domain person re-identification. Neural Comput Appl 33:4001–4011
Article Google Scholar
Zhu X, Jing XY, Ma F et al (2019) Simultaneous visual-appearance-level and spatial-temporal-level dictionary learning for video-based person re-identification. Neural Comput Appl 31:7303–7315
Article Google Scholar
Zhu Y, Jiang S (2019) Attention-based densely connected LSTM for video captioning. In: Proceedings of the ACM International Conference Multimedia (MM), pp 802–810

Download references

Acknowledgements

This work was supported in part by the Fundamental Research Funds for the Central Universities of China under Grant 191010001, in part by the Hubei Key Laboratory of Transportation Internet of Things under Grant 2020III026GX, and in part by the Department of Science and Technology, Hubei Provincial People’s Government under Grant 2017CFA012.

Author information

Authors and Affiliations

School of Computer Science and Technology, Wuhan University of Technology, Wuhan, 430070, Hubei, China
Shuqin Chen, Xian Zhong, Shifeng Wu, Zhixin Sun, Wenxuan Liu, Xuemei Jia & Hongxia Xia
Hubei Key Laboratory of Transportation Internet of Things, Wuhan University of Technology, Wuhan, 430070, Hubei, China
Xian Zhong
ZhongQianLiYuan Engineering Consulting Co., Ltd,, Wuhan, 430071, Hubei, China
Shifeng Wu

Authors

Shuqin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xian Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Shifeng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhixin Sun
View author publications
You can also search for this author in PubMed Google Scholar
Wenxuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xuemei Jia
View author publications
You can also search for this author in PubMed Google Scholar
Hongxia Xia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xian Zhong.

Ethics declarations

Conflict of interest

The authors declared no potential conflicts of interest with respect to the research, author- ship, and/or publication of this article.

Ethical approval

This article does not contain any studies with animals performed by any of the authors.

Additional information

Communicated by Gopal Chaudhary.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, S., Zhong, X., Wu, S. et al. Memory-attended semantic context-aware network for video captioning. Soft Comput (2021). https://doi.org/10.1007/s00500-021-06360-6

Download citation

Accepted: 25 September 2021
Published: 11 November 2021
DOI: https://doi.org/10.1007/s00500-021-06360-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Memory-attended semantic context-aware network for video captioning

Abstract

Access this article

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

Video summarization using deep learning techniques: a detailed analysis and investigation

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Memory-attended semantic context-aware network for video captioning

Abstract

Access this article

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

Video summarization using deep learning techniques: a detailed analysis and investigation

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation