Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM

Chen, Shuqin; Zhong, Xian; Li, Lin; Liu, Wenxuan; Gu, Cheng; Zhong, Luo

doi:10.1007/s11063-020-10352-2

Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM

Published: 19 September 2020

Volume 52, pages 2353–2369, (2020)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Shuqin Chen¹,
Xian Zhong ORCID: orcid.org/0000-0002-5242-0467^1,2,
Lin Li^1,2,
Wenxuan Liu¹,
Cheng Gu¹ &
…
Luo Zhong^1,2

427 Accesses
12 Citations
Explore all metrics

Abstract

Automatically generating captions for videos faces a huge challenge since it is a cross-modal cross task that involves vision and texts. Most of the existing models generate the captioning words merely based on the video visual content features, ignoring the important underlying semantic information. The relationship between explicit semantics and hidden visual content is not holistically exploited, thus hardly describing fine-grained caption accurately from a global view. To better extract and integrate the semantic information, we propose a novel encoder-decoder framework of bi-directional long short-term memory with attention model and conversion gate (BiLSTM-CG), which transfers auxiliary attributes and then generates detailed captioning. Specifically, we extract semantic attributes from sliced frames in a multiple-instance learning (MIL) manner. MIL algorithms attempt to learn a classification function that can predict the labels of bags and/or instances in the visual content. In the encoding stage, we adopt 2D and 3D convolutional neural networks to encode video clips, and then feed the concatenate features into a BiLSTM. In decoding stage, we design a CG to adaptively fuse semantic attributes into hidden features at word level, and a CG can convert auxiliary attributes and textual embedding for video captioning. Furthermore, the CG has an ability to automatically decide the optimal time stamp to capture the explicit semantic or rely on the hidden states of the language model to generate the next word. Extensive experiments conducted on the MSR-VTT and MSVD video captioning datasets demonstrate the effectiveness of our method compared with state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Learning to Prompt for Vision-Language Models

Article 31 July 2022

References

Ye M, Shen J, Lin G, Xiang T, Shao L, Hoi S.C.H. (2020) Deep learning for person re-identification: a survey and outlook. arXiv preprint arXiv:2001.04193
Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: Proceedings of the IEEE international conference on computer vision, pp 433–440
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542
Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha ZJ (2020) Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13278–13288
Zhang W, Wang B, Ma L, Liu W (2019) Reconstruct and represent video contents for captioning via reinforcement learning. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2920899
Article Google Scholar
Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 358–373
Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: Technical Papers, pp 1218–1227
Chen S, Jiang YG (2019) Motion guided spatial attention for video captioning. Proc AAAI Conf Artif Intell 33:8191–8198
Google Scholar
Chen Y, Zhang W, Wang S, Li L, Huang Q (2018) Saliency-based spatiotemporal attention for video captioning. In: 2018 IEEE fourth international conference on multimedia big data (BigMM), pp 1–8
Chen J, Pan Y, Li Y, Yao T, Chao H, Mei T (2019) Temporal deformable convolutional encoder-decoder networks for video captioning. Proc AAAI Conf Artif Intell 33:8167–8174
Google Scholar
Zhang J, Hu H (2019) Deep captioning with attention-based visual concept transfer mechanism for enriching description. Neural Process Lett 50(2):1891–1905
Article Google Scholar
Cao P, Yang Z, Sun L, Liang Y, Yang MQ, Guan R (2019) Image captioning with bidirectional semantic attention-based guiding of long short-term memory. Neural Process Lett 50(1):103–119
Article Google Scholar
Wang J, Wang W, Huang Y, Wang L, Tan T (2018) M3: Multimodal memory modelling for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7512–7520
Wu X, Li G, Cao Q, Ji Q, Lin L (2018) Interpretable video captioning via trajectory structured localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6829–6837
Song J, Guo Z, Gao L, Liu W, Zhang D, Shen HT (2017) Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231
Liu Y, Li X, Shi Z (2017) Video captioning with listwise supervision. In: Thirty-first AAAI conference on artificial intelligence, pp 4197–4203
Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6504–6512
Bin Y, Yang Y, Zhou J, Huang Z, Shen HT (2017) Adaptively attending to visual attributes and linguistic knowledge for captioning. In: Proceedings of the 25th ACM international conference on multimedia, pp 1345–1353
Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12487–12496
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Jian M, Lam KM, Dong J, Shen L (2014) Visual-patch-attention-aware saliency detection. IEEE Trans Cybernet 45(8):1575–1586
Article Google Scholar
Zhang M, Yang Y, Ji Y, Xie N, Shen F (2018) Recurrent attention network using spatial-temporal relations for action recognition. Signal Process 145:137–145
Article Google Scholar
Zhu Y, Jiang S (2019) Attention-based densely connected LSTM for video captioning. In: Proceedings of the 27th ACM international conference on multimedia, pp 802–810
Yu J, Tan M, Zhang H, Tao D, Rui Y (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2932058
Article Google Scholar
Yu J, Zhu C, Zhang J, Huang Q, Tao D (2019) Spatial pyramid-enhanced NetVLAD with weighted triplet loss for place recognition. IEEE Trans Neural Netw Learn Syst 31(2):661–674
Article Google Scholar
Zhang J, Yu J, Tao D (2018) Local deep-feature alignment for unsupervised dimension reduction. IEEE Trans Image Process 27(5):2420–2432
Article MathSciNet Google Scholar
Yu J, Tao D, Wang M, Rui Y (2014) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybernet 45(4):767–779
Article Google Scholar
Hong C, Yu J, Zhang J, Jin X, Lee KH (2018) Multimodal face-pose estimation with multitask manifold deep learning. IEEE Trans Ind Inform 15(7):3952–3961
Article Google Scholar
Ye M, Shen J, Zhang X, Yuen PC, Chang SF (2020) Augmentation invariant and instance spreading feature for softmax embedding. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2020.3013379
Article Google Scholar
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515
Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2018) Describing video with attention-based bidirectional LSTM. IEEE Trans Cybernet 49(7):2631–2641
Article Google Scholar
Wang X, Wu J, Zhang D, Su Y, Wang WY (2019) Learning to compose topic-aware mixture of experts for zero-shot video captioning. Proc AAAI Conf Artif Intell 33:8965–8972
Google Scholar
Yang L, Tang K, Yang J, Li LJ (2017) Dense captioning with joint inference and visual context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2193–2202
Ye M, Cheng Y, Lan X, Zhu H (2019) Improving night-time pedestrian retrieval with distribution alignment and contextual distance. IEEE Trans Industr Inform 16(1):615–624
Article Google Scholar
Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Lawrence Zitnick C (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482
Wu Q, Shen C, Liu L, Dick A, Van Den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212
Ye M, Lan X, Wang Z, Yuen PC (2020) Bi-directional center-constrained top-ranking for visible thermal person re-identification. IEEE Trans Inf Forens Secur 15:407–419
Article Google Scholar
Ye M, Liang C, Yu Y, Wang Z, Leng Q, Xiao C, Hu R (2016) Person reidentification via ranking aggregation of similarity pulling and dissimilarity pushing. IEEE Trans Multimedia 18(12):2553–2566
Article Google Scholar
Carbonneau MA, Cheplygina V, Granger E, Gagnon G (2018) Multiple instance learning: a survey of problem characteristics and applications. Pattern Recognit 77:329–353
Article Google Scholar
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Kraus OZ, Ba JL, Frey BJ (2016) Classifying and segmenting microscopy images with deep multiple instance learning. Bioinformatics 32(12):52–59
Article Google Scholar
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Zitnick CL (2014) Common objects in context. In: European conference on computer vision, Microsoft coco, pp 740–755
Zhang C, Platt JC, Viola PA (2006) Multiple instance boosting for object detection. In: Proceedings of the conference neural information processing systems, pp 1417–1424
Ye M, Shen J (2020) Probabilistic structural latent representation for unsupervised embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5457–5466
Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1, pp 190–200
Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Mahajan D, Girshick R, Ramanathan V, He K, Paluri M, Li Y, van der Maaten L (2018) Exploring the limits of weakly supervised pretraining. In: Proceedings of the European conference on computer vision (ECCV), pp 181–196
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 311–318
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. Text summarization branches out, pp 74–81
Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325

Download references

Acknowledgements

This work was supported in part by National Social Science Foundation of China under Grant 15BGL048, Fundamental Research Funds for the Central Universities of China under Grant 191010001, Hubei Key Laboratory of Transportation Internet of Things under Grant 2018IOT003, 2020III026GX, and Science and Technology Department of Hubei Province under Grant 2017CFA012.

Author information

Authors and Affiliations

School of Computer Science and Technology, Wuhan University of Technology, Wuhan, 430070, Hubei, China
Shuqin Chen, Xian Zhong, Lin Li, Wenxuan Liu, Cheng Gu & Luo Zhong
Hubei Key Laboratory of Transportation Internet of Things, Wuhan University of Technology, Wuhan, 430070, Hubei, China
Xian Zhong, Lin Li & Luo Zhong

Authors

Shuqin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xian Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Lin Li
View author publications
You can also search for this author in PubMed Google Scholar
Wenxuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Gu
View author publications
You can also search for this author in PubMed Google Scholar
Luo Zhong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xian Zhong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, S., Zhong, X., Li, L. et al. Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM. Neural Process Lett 52, 2353–2369 (2020). https://doi.org/10.1007/s11063-020-10352-2

Download citation

Accepted: 08 September 2020
Published: 19 September 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11063-020-10352-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM

Abstract

Access this article

Similar content being viewed by others

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM

Abstract

Access this article

Similar content being viewed by others

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation