Abstract
It is a fundamental task of computer vision to describe and express the visual content of a video in natural language, which not only highly summarizes the video, but also presents the visual information in description sentence with reasonable pattern, correct grammars and decent words. The task has wide potential application in early education, visual aids, automatic interpretation and human–machine environment development. Nowadays, there are a variety of effective models for video description with the help of deep learning. However, the visual or language semantics is frequently mined alone, and the visual and language information cannot be complemented each other, resulting in that the accuracy and semantics of the generated sentence are difficult to be further improved. Facing the challenge, a framework for video description with visual and language semantic hybrid enhancing and complementary is proposed in this work. In detail, the language and visual semantics enhancing branches are integrated with the multimodal feature-based module firstly. Then a multi-objective jointly training strategy is employed for model optimization. Finally, the output probabilities from the three branches are fused with the weighted average for word prediction at each time step. Additionally, the language and visual semantics enhancing-based deep fusion modules are combined together with the same jointly training and sequential probabilities fusion for further performance improving. The experimental results on MSVD and MSR-VTT2016 datasets demonstrate the effectiveness of the proposed models, with the performance of proposed models outperforming the baseline model Deep-Glove (which is denoted as E-LSC for simplification and comparison) greatly and achieving competitive performance compared to the state-of-the-art methods. In particular, the BLEU4 and CIDEr reach 52.4 and 81.5, respectively, on MSVD with the proposed \(\hbox {HE-VLSC}^\#\) model.
Similar content being viewed by others
References
Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. Int J Comput Vis 50(2):171–184
Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrell T, Kate S (2013) Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: IEEE international conference on computer vision (ICCV), Sydney, Australia, Jan 2013, IEEE, pp 2712–2719
Gupta A, Srinivasan P, Shi J, Davis LS (2009) Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: IEEE conference on computer vision and pattern recognition (CVPR), Miami, USA, Jun 2009, IEEE, pp 2012–2019
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Annual conference on neural information processing systems (NIPs), Quebec, Canada, Jul 2012, MIT, pp 1097–1105
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations (ICLR), Banff, Canada
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition (CVPR), Boston, USA, June 2015, IEEE, pp 1–9
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, USA, June 2015, IEEE, pp 770–778
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Annual conference on neural information processing systems (NIPS), Montreal, Canada, Dec 2014, MIT, pp 568–576
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE conference on computer vision and pattern recognition (CVPR), Boston, USA, June 2015, IEEE, pp 4305–4314
Mao J, Xu W, Yang Y, Wang J, Yuille A (2014) Deep captioning with multimodal recurrent neural networks (m-RNN). In: International conference on learning representations (ICLR), Banff, Canada, Apr 2014
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE conference on computer vision and pattern recognition (CVPR), Boston, USA, June 2015, IEEE, pp 2625–2634
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: IEEE conference on computer vision and pattern recognition (CVPR), Boston, USA, June 2015, IEEE, pp 3156–3164
Tang P, Wang H, Kwong S (2018) Deep sequential fusion lstm network for image description. Neurocomputing 312:154–164
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence\(-\)video to text. In: IEEE international conference on computer vision (ICCV), Santiago, Chile, Dec 2015, IEEE, pp 4534–4542
Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, USA, June 2016, IEEE, pp 4594–4602
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: IEEE international conference on computer vision (ICCV), Santiago, Chile, Dec 2015, IEEE, pp 4507–4515
Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, USA, June 2016, IEEE, pp 1029–1038
Yu Y, Ko H, Choi J, Kim G (2017) End-to-end concept word detection for video captioning, retrieval, and question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, USA, Jul., IEEE, pp 3261–3269
Zheng Q, Wang C, Tao D (2020) Syntax-aware action targeting for video captioning. In: In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Online (Virtual), June 2020, IEEE, pp 13096–13105
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Xu K, Ba JL, Kiros R, Cho K, Courville A, Salakhutdinov R, et al (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning (ICML), Lille, France, July 2015, ACM, pp 2048–2057
Cho K, Courville A, Bengio Y (2015) Describing multimedia content using attention-based encoder-decoder networks. IEEE Trans Multimedia 17(11):1875–1886
Bin Y, Yang Y, Shen F, Xie N, Shen H, Li X (2019) Describing video with attention based bidirectional lstm. IEEE Trans Cyber 49(7):2631–2641
Gao L, Guo Z, Zhang H, Xu X, Shen H (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Trans Multimedia 19(9):2045–2055
Jin T, Huang S, Chen M, Li Y, Zhang Z (2020) Sbat: Video captioning with sparse boundary-aware transformer. In: International joint conference on artificial intelligence (IJCAI), Online (Virtual), Jan 2020, Morgan Kaufmann, pp 630–636
Pan Y, Li Y, Luo J, Xu J, Yao T, Mei T (2020) Auto-captions on gif: a large-scale video-sentence dataset for vision-language pre-training. arXiv:2007.02375
Wu Q, Shen C, Liu L, Dick A, Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, USA, Jun 2016, IEEE, pp 203–212
Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: IEEE international conference on computer vision (ICCV), Venice, Italy, Oct 2017, IEEE, pp 984–992
Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: picking informative frames for video captioning. In: European conference on computer vision (ECCV), Munich, Germany, Sept 2018, IEEE, pp 367–384
Wang X, Chen W, Wu J, Wang YF, Wang WY (2018) Video captioning via hierarchical reinforcement learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, USA, Jun 2018, IEEE, pp 4213–4222
Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen H (2018) Video captioning by adversarial lstm. IEEE Trans Image Process 27(11):5600–5611
Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: The 49th annual meeting of the association for computational linguistics (ACL). Portland, USA, Jun 2011, ACL, pp. 190–200
Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: A large video description dataset for bridging video and language. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, USA, Jun 2016, IEEE, pp 5288–5296
Nagel HH (1994) A vision of xxx vision and language xxx comprises action: an example from road traffic. Artif Intell Rev 8(2):189–214
Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: IEEE international conference on computer vision (ICCV), Sydney, Australia, Jan 2013, IEEE, , pp 433–440
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2015) Translating videos to natural language using deep recurrent neural networks. In: The 2015 annual conference of the north american chapter of the ACL (NAACL), Denver, USA, May, ACL, pp 1494–1504
Venugopalan S, Hendricks LA, Mooney R, Saenko K (2016) Improving lstm-based video description with linguistic knowledge mined from text. In: Conference on empirical methods in natural language processing (EMNLP), Austin, USA, Nov 2016, ACL, pp 1961–1966
Johnson J, Karpathy A, Li FF (2016) DenseCap: Fully convolutional localization networks for dense captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, USA, Jun 2016, IEEE, pp 4565–4574
Shen Z, Li J, Su Z, Li M, Chen Y, Jiang YG, Xue X (2017) Weakly supervised dense video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, USA, Jul 2017, IEEE, pp 1916–1924
Wang J, Jiang W, Ma L, Liu W, Xu Y (2018) Bidirectional attentive fusion with context gating for dense video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, USA, Jun 2018, IEEE, pp 7190–7198
Zhou L, Zhou Y, Jason JC, Richard S, Xiong C (2018) End-to-end dense video captioning with masked transformer. In: IEEE conference on computer vision and pattern recognition (CVPR),Salt Lake City, USA, Jun 2018, IEEE, pp 8739–8748
Krishna R, Hata K, Ren F, Li FF, Niebles JC (2017) Dense-captioning events in videos. In: IEEE international conference on computer vision (ICCV), Venice, Italy, Oct 2017, IEEE, pp 706–715
Mun J, Yang L, Ren Z, Xu N, Han B (2019) Streamlined dense video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Long Beach, USA, Jun 2019, IEEE, pp 6588–6597
Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha Z (2020) Object relational graph with teacher-recommended learning for video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Online (Virtual), Jun 2020, IEEE, pp 13278–13288
Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Long Beach, USA, Jun 2019, IEEE, pp 8327–8336
Zhang J, Peng Y (2019) Hierarchical vision-language alignment for video captioning. In: International conference on multimedia modeling (MMM), Thessaloniki, Greece, Jan 2019, Springer, pp 42–54
Park JS, Rohrbach M, Darrell T, Rohrbach A (2019) Adversarial inference for multi-sentence video description. In: IEEE conference on computer vision and pattern recognitionlong beach, USA, Jun 2019, IEEE, pp 6598–6608
Vedantam R, Zitnick CL, Parikh D (2015) CIDEr: Consensus-based image description evaluation. In: IEEE conference on computer vision and pattern recognition (CVPR), Boston, USA, Jun 2015, IEEE, pp 4566–4575
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: A method for automatic evaluation of machine translation. In: Annual meeting of the association for computational linguistics (ACL), Philadelphia, USA, Jul 2002, ACL, pp 311–318
Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Annual meeting of the association for computational linguistics workshop (ACLW), Ann Arbor, USA, Jun 2005, ACL, pp 65–72
Lin CY, Och FJ (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Annual meeting of the association for computational linguistics (ACL), Barcelona, Spain, Jul 2004, ACL, pp 21–26
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: ACM international conference on multimedia (ACM MM), Orlando, USA, Nov 2014, ACM, pp 675–678
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollr P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision (ECCV), Zurich, Switzerland, Sept 2014, Springer, pp 740–755
Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Long Beach, USA, Jun 2019, IEEE, pp 12487–12496
Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of international conference on computational linguistics (COLING), Dublin, Ireland, Aug 2014, ACL, pp 1218–1227
Baraldi L, Costantino G, Rita C (2017) Hierarchical boundary-aware neural encoder for video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, USA, Jul 2017, IEEE, pp 3185–3194
Ballas N, Yao L, Pal C, Courville A (2015) Delving deeper into convolutional networks for learning video representations. In: International conference on learning representations (ICLR), San Diego, USA, May 2015
Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, USA, Jun 2016, IEEE, pp 4584–4593
Pu Y, Min MR, Gan Z, Carin L (2016) Adaptive feature abstraction for translating video to language. arXiv preprint arXiv:1611.07837
Gao L, Li X, Song J, Shen HT (2020) Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42(5):1112–1131
Ramanishka V, Abir D, Huk PD, Subhashini V, Anne HL, Marcus R, Kate S (2016) Multimodal video description. In: ACM conference on multimedia conference (ACM MM), Amsterdam, Netherlands, Oct 2016 ACM, pp 1092–1096
Wang J, Wang W, Huang Y, Wang L, Tan T (2018) M3: Multimodal memory modelling for video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, USA, Jun 2018, IEEE, pp 7512–7520
Song J, Guo Y, Gao L, Li X, Alan H, Shen H (2019) From deterministic to generative: multimodal stochastic rnns for video captioning. IEEE Trans Neural Netw Learn Syst 30(10):3047–3058
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, USA, Jul 2017, IEEE, pp 5630–5639
Chen J, Pan Y, Li Y, Yao T, Chao H, Mei T (2019) Temporal deformable convolutional encoder-decoder networks for video captioning. In: AAAI conference on artificial intelligence (AAAI), Honolulu, USA, Jan 2019, AAAI, pp 8167–8174
Wang B, Ma L, Zhang W, Liu W, Reconstruction network for video captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, USA, Jun 2018, IEEE, pp 7622–7631
Dong J, Li X, Lan W, Huo Y, Snoek CG (2016) Early embedding and late reranking for video captioning. In: ACM conference on multimedia conference (ACM MM), Amsterdam, Netherlands, Oct 2016, ACM, pp 1082–1086
Shetty R, Laaksonen J (2016) Frame- and segment-level features and candidate pool evaluation for video caption generation. In: ACM conference on multimedia conference (ACM MM), Amsterdam, Netherlands, Oct 2016, ACM, pp 1073–1076
Jin Q, Chen J, Chen S, Xiong Y, Hauptmann A (2016) Describing videos using multimodal fusion. In: ACM conference on multimedia conference (ACM MM), Amsterdam, Netherlands, Oct 2016, ACM, pp 1087–1091
Acknowledgements
This work was supported in part by National Natural Science Foundation of China (No. 62062041, 61961023), Jiangxi Provincial Natural Science Foundation (No. 20212BAB202020), Ph.D. Research Initiation Project of Jinggangshan University (No. JZB1923, JZB1807), Bidding Project for the Foundation of College’s Key Research on Humanities and Social Science of Jiangxi Province (No. JD17082).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tang, P., Tan, Y. & Luo, W. Visual and language semantic hybrid enhancement and complementary for video description. Neural Comput & Applic 34, 5959–5977 (2022). https://doi.org/10.1007/s00521-021-06733-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-06733-w