Abstract
Automatically generating captions for videos faces a huge challenge since it is a cross-modal cross task that involves vision and texts. Most of the existing models generate the captioning words merely based on the video visual content features, ignoring the important underlying semantic information. The relationship between explicit semantics and hidden visual content is not holistically exploited, thus hardly describing fine-grained caption accurately from a global view. To better extract and integrate the semantic information, we propose a novel encoder-decoder framework of bi-directional long short-term memory with attention model and conversion gate (BiLSTM-CG), which transfers auxiliary attributes and then generates detailed captioning. Specifically, we extract semantic attributes from sliced frames in a multiple-instance learning (MIL) manner. MIL algorithms attempt to learn a classification function that can predict the labels of bags and/or instances in the visual content. In the encoding stage, we adopt 2D and 3D convolutional neural networks to encode video clips, and then feed the concatenate features into a BiLSTM. In decoding stage, we design a CG to adaptively fuse semantic attributes into hidden features at word level, and a CG can convert auxiliary attributes and textual embedding for video captioning. Furthermore, the CG has an ability to automatically decide the optimal time stamp to capture the explicit semantic or rely on the hidden states of the language model to generate the next word. Extensive experiments conducted on the MSR-VTT and MSVD video captioning datasets demonstrate the effectiveness of our method compared with state-of-the-art approaches.
Similar content being viewed by others
References
Ye M, Shen J, Lin G, Xiang T, Shao L, Hoi S.C.H. (2020) Deep learning for person re-identification: a survey and outlook. arXiv preprint arXiv:2001.04193
Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: Proceedings of the IEEE international conference on computer vision, pp 433–440
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542
Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha ZJ (2020) Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13278–13288
Zhang W, Wang B, Ma L, Liu W (2019) Reconstruct and represent video contents for captioning via reinforcement learning. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2920899
Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 358–373
Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: Technical Papers, pp 1218–1227
Chen S, Jiang YG (2019) Motion guided spatial attention for video captioning. Proc AAAI Conf Artif Intell 33:8191–8198
Chen Y, Zhang W, Wang S, Li L, Huang Q (2018) Saliency-based spatiotemporal attention for video captioning. In: 2018 IEEE fourth international conference on multimedia big data (BigMM), pp 1–8
Chen J, Pan Y, Li Y, Yao T, Chao H, Mei T (2019) Temporal deformable convolutional encoder-decoder networks for video captioning. Proc AAAI Conf Artif Intell 33:8167–8174
Zhang J, Hu H (2019) Deep captioning with attention-based visual concept transfer mechanism for enriching description. Neural Process Lett 50(2):1891–1905
Cao P, Yang Z, Sun L, Liang Y, Yang MQ, Guan R (2019) Image captioning with bidirectional semantic attention-based guiding of long short-term memory. Neural Process Lett 50(1):103–119
Wang J, Wang W, Huang Y, Wang L, Tan T (2018) M3: Multimodal memory modelling for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7512–7520
Wu X, Li G, Cao Q, Ji Q, Lin L (2018) Interpretable video captioning via trajectory structured localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6829–6837
Song J, Guo Z, Gao L, Liu W, Zhang D, Shen HT (2017) Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231
Liu Y, Li X, Shi Z (2017) Video captioning with listwise supervision. In: Thirty-first AAAI conference on artificial intelligence, pp 4197–4203
Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6504–6512
Bin Y, Yang Y, Zhou J, Huang Z, Shen HT (2017) Adaptively attending to visual attributes and linguistic knowledge for captioning. In: Proceedings of the 25th ACM international conference on multimedia, pp 1345–1353
Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12487–12496
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Jian M, Lam KM, Dong J, Shen L (2014) Visual-patch-attention-aware saliency detection. IEEE Trans Cybernet 45(8):1575–1586
Zhang M, Yang Y, Ji Y, Xie N, Shen F (2018) Recurrent attention network using spatial-temporal relations for action recognition. Signal Process 145:137–145
Zhu Y, Jiang S (2019) Attention-based densely connected LSTM for video captioning. In: Proceedings of the 27th ACM international conference on multimedia, pp 802–810
Yu J, Tan M, Zhang H, Tao D, Rui Y (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2932058
Yu J, Zhu C, Zhang J, Huang Q, Tao D (2019) Spatial pyramid-enhanced NetVLAD with weighted triplet loss for place recognition. IEEE Trans Neural Netw Learn Syst 31(2):661–674
Zhang J, Yu J, Tao D (2018) Local deep-feature alignment for unsupervised dimension reduction. IEEE Trans Image Process 27(5):2420–2432
Yu J, Tao D, Wang M, Rui Y (2014) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybernet 45(4):767–779
Hong C, Yu J, Zhang J, Jin X, Lee KH (2018) Multimodal face-pose estimation with multitask manifold deep learning. IEEE Trans Ind Inform 15(7):3952–3961
Ye M, Shen J, Zhang X, Yuen PC, Chang SF (2020) Augmentation invariant and instance spreading feature for softmax embedding. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2020.3013379
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515
Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2018) Describing video with attention-based bidirectional LSTM. IEEE Trans Cybernet 49(7):2631–2641
Wang X, Wu J, Zhang D, Su Y, Wang WY (2019) Learning to compose topic-aware mixture of experts for zero-shot video captioning. Proc AAAI Conf Artif Intell 33:8965–8972
Yang L, Tang K, Yang J, Li LJ (2017) Dense captioning with joint inference and visual context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2193–2202
Ye M, Cheng Y, Lan X, Zhu H (2019) Improving night-time pedestrian retrieval with distribution alignment and contextual distance. IEEE Trans Industr Inform 16(1):615–624
Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Lawrence Zitnick C (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482
Wu Q, Shen C, Liu L, Dick A, Van Den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212
Ye M, Lan X, Wang Z, Yuen PC (2020) Bi-directional center-constrained top-ranking for visible thermal person re-identification. IEEE Trans Inf Forens Secur 15:407–419
Ye M, Liang C, Yu Y, Wang Z, Leng Q, Xiao C, Hu R (2016) Person reidentification via ranking aggregation of similarity pulling and dissimilarity pushing. IEEE Trans Multimedia 18(12):2553–2566
Carbonneau MA, Cheplygina V, Granger E, Gagnon G (2018) Multiple instance learning: a survey of problem characteristics and applications. Pattern Recognit 77:329–353
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Kraus OZ, Ba JL, Frey BJ (2016) Classifying and segmenting microscopy images with deep multiple instance learning. Bioinformatics 32(12):52–59
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Zitnick CL (2014) Common objects in context. In: European conference on computer vision, Microsoft coco, pp 740–755
Zhang C, Platt JC, Viola PA (2006) Multiple instance boosting for object detection. In: Proceedings of the conference neural information processing systems, pp 1417–1424
Ye M, Shen J (2020) Probabilistic structural latent representation for unsupervised embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5457–5466
Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1, pp 190–200
Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Mahajan D, Girshick R, Ramanathan V, He K, Paluri M, Li Y, van der Maaten L (2018) Exploring the limits of weakly supervised pretraining. In: Proceedings of the European conference on computer vision (ECCV), pp 181–196
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 311–318
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. Text summarization branches out, pp 74–81
Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325
Acknowledgements
This work was supported in part by National Social Science Foundation of China under Grant 15BGL048, Fundamental Research Funds for the Central Universities of China under Grant 191010001, Hubei Key Laboratory of Transportation Internet of Things under Grant 2018IOT003, 2020III026GX, and Science and Technology Department of Hubei Province under Grant 2017CFA012.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chen, S., Zhong, X., Li, L. et al. Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM. Neural Process Lett 52, 2353–2369 (2020). https://doi.org/10.1007/s11063-020-10352-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-020-10352-2