Abstract
Generating video captioning automatically is an active and flouring research topic that involves the complex interactions between visual features and natural language generation. The attention mechanism obtains the key visual information corresponding to the word by removing redundant information. However, existing visual attention methods are indirectly guided by the hidden state of language model, ignoring the interactions between visual features obtained by attention mechanisms. Due to the existence of incomplete object or interference noise, attention mechanism with frame feature is hard to find correct regions-of-interest which closely related to the motion state. Worse still, at each time step, the hidden states have no access to the posterior decode states. The future predicted information is not fully utilized, which lead to the lack of detailed context-aware information. In this paper, we propose a novel video captioning framework with Memory-attended Semantic Context-aware Network (MaSCN) to capture the adjacent sequential dependency across multiple time stamps between different outputs for visual features. To exploit pivotal feature from coarse-grained to fine-grained, we introduce the attention module in MaSCN, which uses corresponding tailored Visual Semantic LSTM(VSLSTM) layers to more precisely map visual relationship information through multi-level attention mechanism. Besides, we integrate the visual features obtained through the attention mechanism as a late fusion. The visual semantic loss is used to explicitly memorize contextual information, capturing the fine-grained detailed cues. Compared with state-of-the-art approaches, the extensive experiments demonstrate the effectiveness of our method on MSVD and MSR-VTT datasets.
Similar content being viewed by others
References
Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 12487–12496
Alnamoly MH, Alzohairy AM, El-Henawy IM (2021) A survey on gel images analysis software tools. J Intell Syst Int Things 1(1):40–47
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 6077–6086
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 4724–4733
Chen S, Zhong X, Li L, Liu W, Gu C, Zhong L (2020a) Adaptively converting auxiliary attributes and textual embedding for video captioning based on bilstm. Neural Process Lett 52(3):2353–2369
Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the annual meeting association computer linguistics: human language technology (ACL), pp 190–200
Chen S, Jiang Y (2019) Motion guided spatial attention for video captioning. In: Proceedings of the Conference on Artificial Intelligence (AAAI), pp 8191–8198
Chen J, Chao H (2020) Videotrm: pre-training for video captioning challenge 625 2020. In: Proceedings of the ACM international conference on multimedia (MM), pp 4605–4609
Cherian A, Wang J, Hori C, Marks TK (2020) Spatio-temporal ranked-attention networks for video captioning. In: Proceedings of the IEEE workshop/winter conference applied computer vision (WACV), pp 1606–1615
Gao L, Li X, Song J, Shen HT (2020) Hierarchical lstms with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42(5):1112–1131
Hou J, Wu X, Zhao W, Luo J, Jia Y (2019) Joint syntax representation learning and visual cue translation for video captioning. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 8917–8926
Kareem FQ, Abdulazeez AM (2021) Ultrasound medical images classification based on deep learning algorithms: a review. Fusion Practice Appl 3(1):29–42
Liu S, Ren Z, Yuan J (2018) Sibnet: sibling convolutional encoder for video captioning. In: Proceedings of the ACM international conference on multimedia (MM), pp 1425–1434
Masoomeh N, Alireza B (2020) Multi-sentence video captioning using content-oriented beam searching and multi-stage refining algorithm. Inf Process Manage 57(6):102302
Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: Proceedings of IEEE/CVF Conference on Computer Vision. Pattern Recognition. (CVPR), pp 984–992
Pan B, Cai H, Huang D, Lee K, Gaidon A, Adeli E, Niebles JC (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 10867–10876
Pei W, Zhang J, Wang X, Ke L, Shen X, Tai Y (2019) Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 8347–8356
Ren S, He K, Girshick RB, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: Proceedings of IEEE/CVF international conference on computer vision (ICCV), pp 433–440
Song Z, Zhou X, Mao Z, Tan J (2021) Image captioning with context-aware auxiliary guidance. In: Proceedings of the conference on artificial intelligence (AAAI), pp 2584–2592
Sun L, Li B, Yuan C, Zha Z, Hu W (2019) Multimodal semantic attention network for video captioning. In: Proceeding of the IEEE international conference on multimedia expo (ICME), pp 1300–1305
Sun Z, Zhong X, Chen S, Liu W, Feng D, Li L (2021) Modeling context-guided visual and linguistic semantic feature for video captioning. In: Proceedings of the artificial neural networks machine learning (ICANN)
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inceptionresnet and the impact of residual connections on learning. In: Proceedings of the conference on artificial intelligence (AAAI), pp 4278–4284
Tan G, Liu D, Wang M, Zha Z (2020) Learning to discretely compose reasoning module networks for video captioning. In: Proceedings of international joint conference on artificial intelligence (IJCAI), pp 745–752
Touati R, Ferchichi I, Messaoudi I, Oueslati AE, Lachiri Z (2021) Pre-cursor microRNAs from different species classification based on features extracted from the image. J Cybersecur Inform Manage 3(1):05–13
Tu Y, Zhou C, Guo J, Gao S, Yu Z (2021) Enhancing the alignment between tar- get words and corresponding frames for video captioning. Pattern Recognit 111:107702
Wang J, Wang W, Huang Y, Wang L, Tan T (2018) M3: multimodal memory modelling for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 7512–7520
Wang B, Ma L, Zhang W, Jiang W, Wang J, Liu W (2019) Controllable video captioning with POS sequence guidance based on gated fusion network. In: Proceedings of IEEE/CVF international conference on computer vision (ICCV), pp 2641–2650
Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 5288–5296
Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q (2020) STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia 22(1):229–241
Yao L, Torabi A, Cho K, Ballas N, Pal CJ, Larochelle H, Courville AC (2015) Describing videos by exploiting temporal structure. In: Proceedings of IEEE/CVF international conference on computer vision (ICCV), pp 4507–4515
Ye M, Li J, Ma AJ, Zheng L, Yuen PC (2019) Dynamic graph co-matching for unsupervised video-based person re-identification. IEEE Trans Image Process 28(6):2976–2990
Ye M, Shen J, Zhang X, Yuen PC, Chang S-F (2020) Augmentation invariant and instance spreading feature for softmax embedding. IEEE Trans Pattern Anal Mach Intell
Zhang J, Peng Y (2020) Video captioning with object-aware spatio-temporal correlation and aggregation. IEEE Trans Image Process 29:6209–6222
Zhang Z, Wu Q, Wang Y, Chen F (2021) Exploring region relationships implicitly: image captioning with visual relationship attention. Image vis Comput 109:104146
Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 8327–8336
Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha Z (2020) Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 13275–13285
Zhao B, Li X, Lu X (2019) CAM-RNN: co-attention model based RNN for video captioning. IEEE Trans Image Process 28(11):5552–5565
Zheng Q, Wang C, Tao D (2020) Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 13093–13102
Zhou S, Wang Y, Zhang F et al (2021) Cross-view similarity exploration for unsupervised cross-domain person re-identification. Neural Comput Appl 33:4001–4011
Zhu X, Jing XY, Ma F et al (2019) Simultaneous visual-appearance-level and spatial-temporal-level dictionary learning for video-based person re-identification. Neural Comput Appl 31:7303–7315
Zhu Y, Jiang S (2019) Attention-based densely connected LSTM for video captioning. In: Proceedings of the ACM International Conference Multimedia (MM), pp 802–810
Acknowledgements
This work was supported in part by the Fundamental Research Funds for the Central Universities of China under Grant 191010001, in part by the Hubei Key Laboratory of Transportation Internet of Things under Grant 2020III026GX, and in part by the Department of Science and Technology, Hubei Provincial People’s Government under Grant 2017CFA012.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declared no potential conflicts of interest with respect to the research, author- ship, and/or publication of this article.
Ethical approval
This article does not contain any studies with animals performed by any of the authors.
Additional information
Communicated by Gopal Chaudhary.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chen, S., Zhong, X., Wu, S. et al. Memory-attended semantic context-aware network for video captioning. Soft Comput (2021). https://doi.org/10.1007/s00500-021-06360-6
Accepted:
Published:
DOI: https://doi.org/10.1007/s00500-021-06360-6