Visual and language semantic hybrid enhancement and complementary for video description

Tang, Pengjie; Tan, Yunlan; Luo, Wenlang

doi:10.1007/s00521-021-06733-w

Visual and language semantic hybrid enhancement and complementary for video description

Original Article
Published: 20 January 2022

Volume 34, pages 5959–5977, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Pengjie Tang^1,2,
Yunlan Tan^1,2 &
Wenlang Luo^1,2

307 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

It is a fundamental task of computer vision to describe and express the visual content of a video in natural language, which not only highly summarizes the video, but also presents the visual information in description sentence with reasonable pattern, correct grammars and decent words. The task has wide potential application in early education, visual aids, automatic interpretation and human–machine environment development. Nowadays, there are a variety of effective models for video description with the help of deep learning. However, the visual or language semantics is frequently mined alone, and the visual and language information cannot be complemented each other, resulting in that the accuracy and semantics of the generated sentence are difficult to be further improved. Facing the challenge, a framework for video description with visual and language semantic hybrid enhancing and complementary is proposed in this work. In detail, the language and visual semantics enhancing branches are integrated with the multimodal feature-based module firstly. Then a multi-objective jointly training strategy is employed for model optimization. Finally, the output probabilities from the three branches are fused with the weighted average for word prediction at each time step. Additionally, the language and visual semantics enhancing-based deep fusion modules are combined together with the same jointly training and sequential probabilities fusion for further performance improving. The experimental results on MSVD and MSR-VTT2016 datasets demonstrate the effectiveness of the proposed models, with the performance of proposed models outperforming the baseline model Deep-Glove (which is denoted as E-LSC for simplification and comparison) greatly and achieving competitive performance compared to the state-of-the-art methods. In particular, the BLEU4 and CIDEr reach 52.4 and 81.5, respectively, on MSVD with the proposed \(\hbox {HE-VLSC}^\#\) model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Learning to Prompt for Vision-Language Models

Article 31 July 2022

References

Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. Int J Comput Vis 50(2):171–184
Article Google Scholar
Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrell T, Kate S (2013) Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: IEEE international conference on computer vision (ICCV), Sydney, Australia, Jan 2013, IEEE, pp 2712–2719
Gupta A, Srinivasan P, Shi J, Davis LS (2009) Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: IEEE conference on computer vision and pattern recognition (CVPR), Miami, USA, Jun 2009, IEEE, pp 2012–2019
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Annual conference on neural information processing systems (NIPs), Quebec, Canada, Jul 2012, MIT, pp 1097–1105
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations (ICLR), Banff, Canada
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition (CVPR), Boston, USA, June 2015, IEEE, pp 1–9
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, USA, June 2015, IEEE, pp 770–778
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Annual conference on neural information processing systems (NIPS), Montreal, Canada, Dec 2014, MIT, pp 568–576
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE conference on computer vision and pattern recognition (CVPR), Boston, USA, June 2015, IEEE, pp 4305–4314
Mao J, Xu W, Yang Y, Wang J, Yuille A (2014) Deep captioning with multimodal recurrent neural networks (m-RNN). In: International conference on learning representations (ICLR), Banff, Canada, Apr 2014
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE conference on computer vision and pattern recognition (CVPR), Boston, USA, June 2015, IEEE, pp 2625–2634
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: IEEE conference on computer vision and pattern recognition (CVPR), Boston, USA, June 2015, IEEE, pp 3156–3164
Tang P, Wang H, Kwong S (2018) Deep sequential fusion lstm network for image description. Neurocomputing 312:154–164
Article Google Scholar
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence\(-\)video to text. In: IEEE international conference on computer vision (ICCV), Santiago, Chile, Dec 2015, IEEE, pp 4534–4542
Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, USA, June 2016, IEEE, pp 4594–4602
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: IEEE international conference on computer vision (ICCV), Santiago, Chile, Dec 2015, IEEE, pp 4507–4515
Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, USA, June 2016, IEEE, pp 1029–1038
Yu Y, Ko H, Choi J, Kim G (2017) End-to-end concept word detection for video captioning, retrieval, and question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, USA, Jul., IEEE, pp 3261–3269
Zheng Q, Wang C, Tao D (2020) Syntax-aware action targeting for video captioning. In: In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Online (Virtual), June 2020, IEEE, pp 13096–13105
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Xu K, Ba JL, Kiros R, Cho K, Courville A, Salakhutdinov R, et al (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning (ICML), Lille, France, July 2015, ACM, pp 2048–2057
Cho K, Courville A, Bengio Y (2015) Describing multimedia content using attention-based encoder-decoder networks. IEEE Trans Multimedia 17(11):1875–1886
Article Google Scholar
Bin Y, Yang Y, Shen F, Xie N, Shen H, Li X (2019) Describing video with attention based bidirectional lstm. IEEE Trans Cyber 49(7):2631–2641
Article Google Scholar
Gao L, Guo Z, Zhang H, Xu X, Shen H (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Trans Multimedia 19(9):2045–2055
Article Google Scholar
Jin T, Huang S, Chen M, Li Y, Zhang Z (2020) Sbat: Video captioning with sparse boundary-aware transformer. In: International joint conference on artificial intelligence (IJCAI), Online (Virtual), Jan 2020, Morgan Kaufmann, pp 630–636
Pan Y, Li Y, Luo J, Xu J, Yao T, Mei T (2020) Auto-captions on gif: a large-scale video-sentence dataset for vision-language pre-training. arXiv:2007.02375
Wu Q, Shen C, Liu L, Dick A, Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, USA, Jun 2016, IEEE, pp 203–212
Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: IEEE international conference on computer vision (ICCV), Venice, Italy, Oct 2017, IEEE, pp 984–992
Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: picking informative frames for video captioning. In: European conference on computer vision (ECCV), Munich, Germany, Sept 2018, IEEE, pp 367–384
Wang X, Chen W, Wu J, Wang YF, Wang WY (2018) Video captioning via hierarchical reinforcement learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, Jun 2018, IEEE, pp 4213–4222
Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen H (2018) Video captioning by adversarial lstm. IEEE Trans Image Process 27(11):5600–5611
Article MathSciNet Google Scholar
Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: The 49th annual meeting of the association for computational linguistics (ACL). Portland, USA, Jun 2011, ACL, pp. 190–200
Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: A large video description dataset for bridging video and language. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, USA, Jun 2016, IEEE, pp 5288–5296
Nagel HH (1994) A vision of xxx vision and language xxx comprises action: an example from road traffic. Artif Intell Rev 8(2):189–214
Article Google Scholar
Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: IEEE international conference on computer vision (ICCV), Sydney, Australia, Jan 2013, IEEE, , pp 433–440
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2015) Translating videos to natural language using deep recurrent neural networks. In: The 2015 annual conference of the north american chapter of the ACL (NAACL), Denver, USA, May, ACL, pp 1494–1504
Venugopalan S, Hendricks LA, Mooney R, Saenko K (2016) Improving lstm-based video description with linguistic knowledge mined from text. In: Conference on empirical methods in natural language processing (EMNLP), Austin, USA, Nov 2016, ACL, pp 1961–1966
Johnson J, Karpathy A, Li FF (2016) DenseCap: Fully convolutional localization networks for dense captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, USA, Jun 2016, IEEE, pp 4565–4574
Shen Z, Li J, Su Z, Li M, Chen Y, Jiang YG, Xue X (2017) Weakly supervised dense video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, USA, Jul 2017, IEEE, pp 1916–1924
Wang J, Jiang W, Ma L, Liu W, Xu Y (2018) Bidirectional attentive fusion with context gating for dense video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, USA, Jun 2018, IEEE, pp 7190–7198
Zhou L, Zhou Y, Jason JC, Richard S, Xiong C (2018) End-to-end dense video captioning with masked transformer. In: IEEE conference on computer vision and pattern recognition (CVPR),Salt Lake City, USA, Jun 2018, IEEE, pp 8739–8748
Krishna R, Hata K, Ren F, Li FF, Niebles JC (2017) Dense-captioning events in videos. In: IEEE international conference on computer vision (ICCV), Venice, Italy, Oct 2017, IEEE, pp 706–715
Mun J, Yang L, Ren Z, Xu N, Han B (2019) Streamlined dense video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Long Beach, USA, Jun 2019, IEEE, pp 6588–6597
Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha Z (2020) Object relational graph with teacher-recommended learning for video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Online (Virtual), Jun 2020, IEEE, pp 13278–13288
Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Long Beach, USA, Jun 2019, IEEE, pp 8327–8336
Zhang J, Peng Y (2019) Hierarchical vision-language alignment for video captioning. In: International conference on multimedia modeling (MMM), Thessaloniki, Greece, Jan 2019, Springer, pp 42–54
Park JS, Rohrbach M, Darrell T, Rohrbach A (2019) Adversarial inference for multi-sentence video description. In: IEEE conference on computer vision and pattern recognitionlong beach, USA, Jun 2019, IEEE, pp 6598–6608
Vedantam R, Zitnick CL, Parikh D (2015) CIDEr: Consensus-based image description evaluation. In: IEEE conference on computer vision and pattern recognition (CVPR), Boston, USA, Jun 2015, IEEE, pp 4566–4575
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: A method for automatic evaluation of machine translation. In: Annual meeting of the association for computational linguistics (ACL), Philadelphia, USA, Jul 2002, ACL, pp 311–318
Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Annual meeting of the association for computational linguistics workshop (ACLW), Ann Arbor, USA, Jun 2005, ACL, pp 65–72
Lin CY, Och FJ (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Annual meeting of the association for computational linguistics (ACL), Barcelona, Spain, Jul 2004, ACL, pp 21–26
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: ACM international conference on multimedia (ACM MM), Orlando, USA, Nov 2014, ACM, pp 675–678
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollr P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision (ECCV), Zurich, Switzerland, Sept 2014, Springer, pp 740–755
Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Long Beach, USA, Jun 2019, IEEE, pp 12487–12496
Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of international conference on computational linguistics (COLING), Dublin, Ireland, Aug 2014, ACL, pp 1218–1227
Baraldi L, Costantino G, Rita C (2017) Hierarchical boundary-aware neural encoder for video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, USA, Jul 2017, IEEE, pp 3185–3194
Ballas N, Yao L, Pal C, Courville A (2015) Delving deeper into convolutional networks for learning video representations. In: International conference on learning representations (ICLR), San Diego, USA, May 2015
Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, USA, Jun 2016, IEEE, pp 4584–4593
Pu Y, Min MR, Gan Z, Carin L (2016) Adaptive feature abstraction for translating video to language. arXiv preprint arXiv:1611.07837
Gao L, Li X, Song J, Shen HT (2020) Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42(5):1112–1131
Google Scholar
Ramanishka V, Abir D, Huk PD, Subhashini V, Anne HL, Marcus R, Kate S (2016) Multimodal video description. In: ACM conference on multimedia conference (ACM MM), Amsterdam, Netherlands, Oct 2016 ACM, pp 1092–1096
Wang J, Wang W, Huang Y, Wang L, Tan T (2018) M3: Multimodal memory modelling for video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, USA, Jun 2018, IEEE, pp 7512–7520
Song J, Guo Y, Gao L, Li X, Alan H, Shen H (2019) From deterministic to generative: multimodal stochastic rnns for video captioning. IEEE Trans Neural Netw Learn Syst 30(10):3047–3058
Article Google Scholar
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, USA, Jul 2017, IEEE, pp 5630–5639
Chen J, Pan Y, Li Y, Yao T, Chao H, Mei T (2019) Temporal deformable convolutional encoder-decoder networks for video captioning. In: AAAI conference on artificial intelligence (AAAI), Honolulu, USA, Jan 2019, AAAI, pp 8167–8174
Wang B, Ma L, Zhang W, Liu W, Reconstruction network for video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, USA, Jun 2018, IEEE, pp 7622–7631
Dong J, Li X, Lan W, Huo Y, Snoek CG (2016) Early embedding and late reranking for video captioning. In: ACM conference on multimedia conference (ACM MM), Amsterdam, Netherlands, Oct 2016, ACM, pp 1082–1086
Shetty R, Laaksonen J (2016) Frame- and segment-level features and candidate pool evaluation for video caption generation. In: ACM conference on multimedia conference (ACM MM), Amsterdam, Netherlands, Oct 2016, ACM, pp 1073–1076
Jin Q, Chen J, Chen S, Xiong Y, Hauptmann A (2016) Describing videos using multimodal fusion. In: ACM conference on multimedia conference (ACM MM), Amsterdam, Netherlands, Oct 2016, ACM, pp 1087–1091

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (No. 62062041, 61961023), Jiangxi Provincial Natural Science Foundation (No. 20212BAB202020), Ph.D. Research Initiation Project of Jinggangshan University (No. JZB1923, JZB1807), Bidding Project for the Foundation of College’s Key Research on Humanities and Social Science of Jiangxi Province (No. JD17082).

Author information

Authors and Affiliations

Electronics and Information Engineering College, Jinggangshan University, Ji’an, 343009, People’s Republic of China
Pengjie Tang, Yunlan Tan & Wenlang Luo
Jiangxi Engineering Laboratory of IoT Technology for Crop Growth, Jinggangshan University, Ji’an, 343009, People’s Republic of China
Pengjie Tang, Yunlan Tan & Wenlang Luo

Authors

Pengjie Tang
View author publications
You can also search for this author in PubMed Google Scholar
Yunlan Tan
View author publications
You can also search for this author in PubMed Google Scholar
Wenlang Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunlan Tan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, P., Tan, Y. & Luo, W. Visual and language semantic hybrid enhancement and complementary for video description. Neural Comput & Applic 34, 5959–5977 (2022). https://doi.org/10.1007/s00521-021-06733-w

Download citation

Received: 29 March 2021
Accepted: 02 November 2021
Published: 20 January 2022
Issue Date: April 2022
DOI: https://doi.org/10.1007/s00521-021-06733-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual and language semantic hybrid enhancement and complementary for video description

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Learning to Prompt for Vision-Language Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Visual and language semantic hybrid enhancement and complementary for video description

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Learning to Prompt for Vision-Language Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation