Skip to main content
Log in

Memory-attended semantic context-aware network for video captioning

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Generating video captioning automatically is an active and flouring research topic that involves the complex interactions between visual features and natural language generation. The attention mechanism obtains the key visual information corresponding to the word by removing redundant information. However, existing visual attention methods are indirectly guided by the hidden state of language model, ignoring the interactions between visual features obtained by attention mechanisms. Due to the existence of incomplete object or interference noise, attention mechanism with frame feature is hard to find correct regions-of-interest which closely related to the motion state. Worse still, at each time step, the hidden states have no access to the posterior decode states. The future predicted information is not fully utilized, which lead to the lack of detailed context-aware information. In this paper, we propose a novel video captioning framework with Memory-attended Semantic Context-aware Network (MaSCN) to capture the adjacent sequential dependency across multiple time stamps between different outputs for visual features. To exploit pivotal feature from coarse-grained to fine-grained, we introduce the attention module in MaSCN, which uses corresponding tailored Visual Semantic LSTM(VSLSTM) layers to more precisely map visual relationship information through multi-level attention mechanism. Besides, we integrate the visual features obtained through the attention mechanism as a late fusion. The visual semantic loss is used to explicitly memorize contextual information, capturing the fine-grained detailed cues. Compared with state-of-the-art approaches, the extensive experiments demonstrate the effectiveness of our method on MSVD and MSR-VTT datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 12487–12496

  • Alnamoly MH, Alzohairy AM, El-Henawy IM (2021) A survey on gel images analysis software tools. J Intell Syst Int Things 1(1):40–47

    Google Scholar 

  • Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 6077–6086

  • Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 4724–4733

  • Chen S, Zhong X, Li L, Liu W, Gu C, Zhong L (2020a) Adaptively converting auxiliary attributes and textual embedding for video captioning based on bilstm. Neural Process Lett 52(3):2353–2369

    Article  Google Scholar 

  • Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the annual meeting association computer linguistics: human language technology (ACL), pp 190–200

  • Chen S, Jiang Y (2019) Motion guided spatial attention for video captioning. In: Proceedings of the Conference on Artificial Intelligence (AAAI), pp 8191–8198

  • Chen J, Chao H (2020) Videotrm: pre-training for video captioning challenge 625 2020. In: Proceedings of the ACM international conference on multimedia (MM), pp 4605–4609

  • Cherian A, Wang J, Hori C, Marks TK (2020) Spatio-temporal ranked-attention networks for video captioning. In: Proceedings of the IEEE workshop/winter conference applied computer vision (WACV), pp 1606–1615

  • Gao L, Li X, Song J, Shen HT (2020) Hierarchical lstms with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42(5):1112–1131

    Google Scholar 

  • Hou J, Wu X, Zhao W, Luo J, Jia Y (2019) Joint syntax representation learning and visual cue translation for video captioning. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 8917–8926

  • Kareem FQ, Abdulazeez AM (2021) Ultrasound medical images classification based on deep learning algorithms: a review. Fusion Practice Appl 3(1):29–42

    Google Scholar 

  • Liu S, Ren Z, Yuan J (2018) Sibnet: sibling convolutional encoder for video captioning. In: Proceedings of the ACM international conference on multimedia (MM), pp 1425–1434

  • Masoomeh N, Alireza B (2020) Multi-sentence video captioning using content-oriented beam searching and multi-stage refining algorithm. Inf Process Manage 57(6):102302

    Article  Google Scholar 

  • Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: Proceedings of IEEE/CVF Conference on Computer Vision. Pattern Recognition. (CVPR), pp 984–992

  • Pan B, Cai H, Huang D, Lee K, Gaidon A, Adeli E, Niebles JC (2020) Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 10867–10876

  • Pei W, Zhang J, Wang X, Ke L, Shen X, Tai Y (2019) Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 8347–8356

  • Ren S, He K, Girshick RB, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  • Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: Proceedings of IEEE/CVF international conference on computer vision (ICCV), pp 433–440

  • Song Z, Zhou X, Mao Z, Tan J (2021) Image captioning with context-aware auxiliary guidance. In: Proceedings of the conference on artificial intelligence (AAAI), pp 2584–2592

  • Sun L, Li B, Yuan C, Zha Z, Hu W (2019) Multimodal semantic attention network for video captioning. In: Proceeding of the IEEE international conference on multimedia expo (ICME), pp 1300–1305

  • Sun Z, Zhong X, Chen S, Liu W, Feng D, Li L (2021) Modeling context-guided visual and linguistic semantic feature for video captioning. In: Proceedings of the artificial neural networks machine learning (ICANN)

  • Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inceptionresnet and the impact of residual connections on learning. In: Proceedings of the conference on artificial intelligence (AAAI), pp 4278–4284

  • Tan G, Liu D, Wang M, Zha Z (2020) Learning to discretely compose reasoning module networks for video captioning. In: Proceedings of international joint conference on artificial intelligence (IJCAI), pp 745–752

  • Touati R, Ferchichi I, Messaoudi I, Oueslati AE, Lachiri Z (2021) Pre-cursor microRNAs from different species classification based on features extracted from the image. J Cybersecur Inform Manage 3(1):05–13

    Google Scholar 

  • Tu Y, Zhou C, Guo J, Gao S, Yu Z (2021) Enhancing the alignment between tar- get words and corresponding frames for video captioning. Pattern Recognit 111:107702

    Article  Google Scholar 

  • Wang J, Wang W, Huang Y, Wang L, Tan T (2018) M3: multimodal memory modelling for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 7512–7520

  • Wang B, Ma L, Zhang W, Jiang W, Wang J, Liu W (2019) Controllable video captioning with POS sequence guidance based on gated fusion network. In: Proceedings of IEEE/CVF international conference on computer vision (ICCV), pp 2641–2650

  • Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 5288–5296

  • Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q (2020) STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia 22(1):229–241

    Article  Google Scholar 

  • Yao L, Torabi A, Cho K, Ballas N, Pal CJ, Larochelle H, Courville AC (2015) Describing videos by exploiting temporal structure. In: Proceedings of IEEE/CVF international conference on computer vision (ICCV), pp 4507–4515

  • Ye M, Li J, Ma AJ, Zheng L, Yuen PC (2019) Dynamic graph co-matching for unsupervised video-based person re-identification. IEEE Trans Image Process 28(6):2976–2990

    Article  MathSciNet  Google Scholar 

  • Ye M, Shen J, Zhang X, Yuen PC, Chang S-F (2020) Augmentation invariant and instance spreading feature for softmax embedding. IEEE Trans Pattern Anal Mach Intell

  • Zhang J, Peng Y (2020) Video captioning with object-aware spatio-temporal correlation and aggregation. IEEE Trans Image Process 29:6209–6222

    Article  Google Scholar 

  • Zhang Z, Wu Q, Wang Y, Chen F (2021) Exploring region relationships implicitly: image captioning with visual relationship attention. Image vis Comput 109:104146

    Article  Google Scholar 

  • Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 8327–8336

  • Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha Z (2020) Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 13275–13285

  • Zhao B, Li X, Lu X (2019) CAM-RNN: co-attention model based RNN for video captioning. IEEE Trans Image Process 28(11):5552–5565

    Article  MathSciNet  Google Scholar 

  • Zheng Q, Wang C, Tao D (2020) Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision pattern recognition (CVPR), pp 13093–13102

  • Zhou S, Wang Y, Zhang F et al (2021) Cross-view similarity exploration for unsupervised cross-domain person re-identification. Neural Comput Appl 33:4001–4011

    Article  Google Scholar 

  • Zhu X, Jing XY, Ma F et al (2019) Simultaneous visual-appearance-level and spatial-temporal-level dictionary learning for video-based person re-identification. Neural Comput Appl 31:7303–7315

    Article  Google Scholar 

  • Zhu Y, Jiang S (2019) Attention-based densely connected LSTM for video captioning. In: Proceedings of the ACM International Conference Multimedia (MM), pp 802–810

Download references

Acknowledgements

This work was supported in part by the Fundamental Research Funds for the Central Universities of China under Grant 191010001, in part by the Hubei Key Laboratory of Transportation Internet of Things under Grant 2020III026GX, and in part by the Department of Science and Technology, Hubei Provincial People’s Government under Grant 2017CFA012.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xian Zhong.

Ethics declarations

Conflict of interest

The authors declared no potential conflicts of interest with respect to the research, author- ship, and/or publication of this article.

Ethical approval

This article does not contain any studies with animals performed by any of the authors.

Additional information

Communicated by Gopal Chaudhary.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, S., Zhong, X., Wu, S. et al. Memory-attended semantic context-aware network for video captioning. Soft Comput (2021). https://doi.org/10.1007/s00500-021-06360-6

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00500-021-06360-6

Keywords

Navigation