Skip to main content
Log in

Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Automatically generating captions for videos faces a huge challenge since it is a cross-modal cross task that involves vision and texts. Most of the existing models generate the captioning words merely based on the video visual content features, ignoring the important underlying semantic information. The relationship between explicit semantics and hidden visual content is not holistically exploited, thus hardly describing fine-grained caption accurately from a global view. To better extract and integrate the semantic information, we propose a novel encoder-decoder framework of bi-directional long short-term memory with attention model and conversion gate (BiLSTM-CG), which transfers auxiliary attributes and then generates detailed captioning. Specifically, we extract semantic attributes from sliced frames in a multiple-instance learning (MIL) manner. MIL algorithms attempt to learn a classification function that can predict the labels of bags and/or instances in the visual content. In the encoding stage, we adopt 2D and 3D convolutional neural networks to encode video clips, and then feed the concatenate features into a BiLSTM. In decoding stage, we design a CG to adaptively fuse semantic attributes into hidden features at word level, and a CG can convert auxiliary attributes and textual embedding for video captioning. Furthermore, the CG has an ability to automatically decide the optimal time stamp to capture the explicit semantic or rely on the hidden states of the language model to generate the next word. Extensive experiments conducted on the MSR-VTT and MSVD video captioning datasets demonstrate the effectiveness of our method compared with state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Ye M, Shen J, Lin G, Xiang T, Shao L, Hoi S.C.H. (2020) Deep learning for person re-identification: a survey and outlook. arXiv preprint arXiv:2001.04193

  2. Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: Proceedings of the IEEE international conference on computer vision, pp 433–440

  3. Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542

  4. Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha ZJ (2020) Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13278–13288

  5. Zhang W, Wang B, Ma L, Liu W (2019) Reconstruct and represent video contents for captioning via reinforcement learning. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2920899

    Article  Google Scholar 

  6. Chen Y, Wang S, Zhang W, Huang Q (2018) Less is more: picking informative frames for video captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 358–373

  7. Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: Technical Papers, pp 1218–1227

  8. Chen S, Jiang YG (2019) Motion guided spatial attention for video captioning. Proc AAAI Conf Artif Intell 33:8191–8198

    Google Scholar 

  9. Chen Y, Zhang W, Wang S, Li L, Huang Q (2018) Saliency-based spatiotemporal attention for video captioning. In: 2018 IEEE fourth international conference on multimedia big data (BigMM), pp 1–8

  10. Chen J, Pan Y, Li Y, Yao T, Chao H, Mei T (2019) Temporal deformable convolutional encoder-decoder networks for video captioning. Proc AAAI Conf Artif Intell 33:8167–8174

    Google Scholar 

  11. Zhang J, Hu H (2019) Deep captioning with attention-based visual concept transfer mechanism for enriching description. Neural Process Lett 50(2):1891–1905

    Article  Google Scholar 

  12. Cao P, Yang Z, Sun L, Liang Y, Yang MQ, Guan R (2019) Image captioning with bidirectional semantic attention-based guiding of long short-term memory. Neural Process Lett 50(1):103–119

    Article  Google Scholar 

  13. Wang J, Wang W, Huang Y, Wang L, Tan T (2018) M3: Multimodal memory modelling for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7512–7520

  14. Wu X, Li G, Cao Q, Ji Q, Lin L (2018) Interpretable video captioning via trajectory structured localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6829–6837

  15. Song J, Guo Z, Gao L, Liu W, Zhang D, Shen HT (2017) Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231

  16. Liu Y, Li X, Shi Z (2017) Video captioning with listwise supervision. In: Thirty-first AAAI conference on artificial intelligence, pp 4197–4203

  17. Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6504–6512

  18. Bin Y, Yang Y, Zhou J, Huang Z, Shen HT (2017) Adaptively attending to visual attributes and linguistic knowledge for captioning. In: Proceedings of the 25th ACM international conference on multimedia, pp 1345–1353

  19. Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12487–12496

  20. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

  21. Jian M, Lam KM, Dong J, Shen L (2014) Visual-patch-attention-aware saliency detection. IEEE Trans Cybernet 45(8):1575–1586

    Article  Google Scholar 

  22. Zhang M, Yang Y, Ji Y, Xie N, Shen F (2018) Recurrent attention network using spatial-temporal relations for action recognition. Signal Process 145:137–145

    Article  Google Scholar 

  23. Zhu Y, Jiang S (2019) Attention-based densely connected LSTM for video captioning. In: Proceedings of the 27th ACM international conference on multimedia, pp 802–810

  24. Yu J, Tan M, Zhang H, Tao D, Rui Y (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2932058

    Article  Google Scholar 

  25. Yu J, Zhu C, Zhang J, Huang Q, Tao D (2019) Spatial pyramid-enhanced NetVLAD with weighted triplet loss for place recognition. IEEE Trans Neural Netw Learn Syst 31(2):661–674

    Article  Google Scholar 

  26. Zhang J, Yu J, Tao D (2018) Local deep-feature alignment for unsupervised dimension reduction. IEEE Trans Image Process 27(5):2420–2432

    Article  MathSciNet  Google Scholar 

  27. Yu J, Tao D, Wang M, Rui Y (2014) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybernet 45(4):767–779

    Article  Google Scholar 

  28. Hong C, Yu J, Zhang J, Jin X, Lee KH (2018) Multimodal face-pose estimation with multitask manifold deep learning. IEEE Trans Ind Inform 15(7):3952–3961

    Article  Google Scholar 

  29. Ye M, Shen J, Zhang X, Yuen PC, Chang SF (2020) Augmentation invariant and instance spreading feature for softmax embedding. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2020.3013379

    Article  Google Scholar 

  30. Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729

  31. Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515

  32. Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2018) Describing video with attention-based bidirectional LSTM. IEEE Trans Cybernet 49(7):2631–2641

    Article  Google Scholar 

  33. Wang X, Wu J, Zhang D, Su Y, Wang WY (2019) Learning to compose topic-aware mixture of experts for zero-shot video captioning. Proc AAAI Conf Artif Intell 33:8965–8972

    Google Scholar 

  34. Yang L, Tang K, Yang J, Li LJ (2017) Dense captioning with joint inference and visual context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2193–2202

  35. Ye M, Cheng Y, Lan X, Zhu H (2019) Improving night-time pedestrian retrieval with distribution alignment and contextual distance. IEEE Trans Industr Inform 16(1):615–624

    Article  Google Scholar 

  36. Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Lawrence Zitnick C (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482

  37. Wu Q, Shen C, Liu L, Dick A, Van Den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212

  38. Ye M, Lan X, Wang Z, Yuen PC (2020) Bi-directional center-constrained top-ranking for visible thermal person re-identification. IEEE Trans Inf Forens Secur 15:407–419

    Article  Google Scholar 

  39. Ye M, Liang C, Yu Y, Wang Z, Leng Q, Xiao C, Hu R (2016) Person reidentification via ranking aggregation of similarity pulling and dissimilarity pushing. IEEE Trans Multimedia 18(12):2553–2566

    Article  Google Scholar 

  40. Carbonneau MA, Cheplygina V, Granger E, Gagnon G (2018) Multiple instance learning: a survey of problem characteristics and applications. Pattern Recognit 77:329–353

    Article  Google Scholar 

  41. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659

  42. Kraus OZ, Ba JL, Frey BJ (2016) Classifying and segmenting microscopy images with deep multiple instance learning. Bioinformatics 32(12):52–59

    Article  Google Scholar 

  43. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Zitnick CL (2014) Common objects in context. In: European conference on computer vision, Microsoft coco, pp 740–755

  44. Zhang C, Platt JC, Viola PA (2006) Multiple instance boosting for object detection. In: Proceedings of the conference neural information processing systems, pp 1417–1424

  45. Ye M, Shen J (2020) Probabilistic structural latent representation for unsupervised embedding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5457–5466

  46. Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1, pp 190–200

  47. Xu J, Mei T, Yao T, Rui Y (2016) MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296

  48. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  49. Mahajan D, Girshick R, Ramanathan V, He K, Paluri M, Li Y, van der Maaten L (2018) Exploring the limits of weakly supervised pretraining. In: Proceedings of the European conference on computer vision (ECCV), pp 181–196

  50. Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 311–318

  51. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

  52. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  53. Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. Text summarization branches out, pp 74–81

  54. Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325

Download references

Acknowledgements

This work was supported in part by National Social Science Foundation of China under Grant 15BGL048, Fundamental Research Funds for the Central Universities of China under Grant 191010001, Hubei Key Laboratory of Transportation Internet of Things under Grant 2018IOT003, 2020III026GX, and Science and Technology Department of Hubei Province under Grant 2017CFA012.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xian Zhong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, S., Zhong, X., Li, L. et al. Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM. Neural Process Lett 52, 2353–2369 (2020). https://doi.org/10.1007/s11063-020-10352-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-020-10352-2

Keywords

Navigation