Skip to main content
Log in

Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning Methods

  • Survey Article
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Video captioning is an automated collection of natural language phrases that explains the contents in video frames. Because of the incomparable performance of deep learning in the field of computer vision and natural language processing in recent years, research in this field has been exponentially increased throughout past decades. Numerous approaches, datasets, and measurement metrics have been introduced in the literature, calling for a systematic survey to guide research efforts in this exciting new direction. Through the statistical analysis, this survey paper focuses mostly on state-of-the-art approaches, emphasizing deep learning models, assessing benchmark datasets in several parameters, and classifying the pros and cons of the various evaluation metrics based on the previous works in the deep learning field. This survey shows the most used variants of neural networks for visual and spatio-temporal feature extraction as well as language generation model. The results show that ResNet and VGG as visual feature extractor and 3D convolutional neural network as spatio-temporal feature extractor are mostly used. Besides that, Long Short Term Memory (LSTM) has been mainly used as the language model. However, nowadays, the Gated Recurrent Unit (GRU) and Transformer are slowly replacing LSTM. Regarding dataset usage, so far, MSVD and MSR-VTT are very much dominant due to be part of outstanding results among various captioning models. From 2015 to 2020, with all major datasets, some models such as, Inception-Resnet-v2 + C3D + LSTM, ResNet-101 + I3D + Transformer, ResNet-152 + ResNext-101 (R3D) + (LSTM, GAN) have achieved by far best results in video captioning. Despite rapid advancement, our survey reveals that video captioning research-work still has a lot to develop in accessing the full potential of deep learning for classifying and captioning a large number of activities, as well as creating large datasets covering diversified training video samples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Availability of data and material

All the data and materials are properly extracted and cited from the existing works.

References

  1. Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019a;12487–12496.

  2. Aafaq N, Akhtar N, Liu W, Mian A. Empirical autopsy of deep video captioning frameworks. 2019b. arXiv preprint arXiv:191109345

  3. Aggarwal A, Chauhan A, Kumar D, Mittal M, Roy S, Kim Th. Video caption based searching using end-to-end dense captioning and sentence embeddings. Symmetry. 2020;12(6):992.

    Article  Google Scholar 

  4. Agrawal P, Yadav R, Yadav V, De K, Roy PP. Caption-based region extraction in images. In: Proceedings of 3rd International Conference on Computer Vision and Image Processing, Springer, 2020;27–38.

  5. Akbari H, Palangi H, Yang J, Rao S, Celikyilmaz A, Fernandez R, Smolensky P, Gao J, Chang SF. Neuro-symbolic representations for video captioning: A case for leveraging inductive biases for vision and language. 2020. arXiv preprint arXiv:201109530

  6. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D. Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, 2015;2425–2433

  7. Ba JL, Kiros JR, Hinton GE. Layer normalization. 2016. arXiv preprint arXiv:160706450

  8. Ballas N, Yao L, Pal C, Courville A. Delving deeper into convolutional networks for learning video representations. 2015. arXiv preprint arXiv:151106432

  9. Banerjee S, Lavie A. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005;65–72.

  10. Bantupalli K, Xie Y. American sign language recognition using deep learning and computer vision. In: 2018 IEEE International Conference on Big Data (Big Data), IEEE, 2018; 4896–4899

  11. Baraldi L, Grana C, Cucchiara R. Hierarchical boundary-aware neural encoder for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017;1657–1666.

  12. Bin Y, Yang Y, Shen F, Xu X, Shen HT. Bidirectional long-short term memory for video description. In: Proceedings of the 24th ACM international conference on Multimedia, 2016;436–440 .

  13. Blei DM, Jordan MI. Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, 2003;127–134.

  14. Chen DL, Dolan WB. Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, Association for Computational Linguistics, 2011;190–200.

  15. Chen H, Lin K, Maye A, Li J, Hu X. A semantics-assisted video captioning model trained with scheduled sampling. 2019a. arXiv preprint arXiv:190900121

  16. Chen H, Li J, Hu X. Delving deeper into the decoder for video captioning. 2020a. arXiv preprint arXiv:200105614

  17. Chen J, Jin Q. Better captioning with sequence-level exploration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020;10890–10899.

  18. Chen J, Pan Y, Li Y, Yao T, Chao H, Mei T. Temporal deformable convolutional encoder-decoder networks for video captioning. Proceedings of the AAAI Conference on Artificial Intelligence. 2019b;33:8167–74.

    Article  Google Scholar 

  19. Chen S, Jiang YG. Motion guided spatial attention for video captioning. Proceedings of the AAAI Conference on Artificial Intelligence. 2019;33:8191–8.

    Article  Google Scholar 

  20. Chen X, Zhang M, Wang Z, Zuo L, Li B, Yang Y. Leveraging unpaired out-of-domain data for image captioning. Pattern Recognition Letters. 2020b;132:132–40.

    Article  Google Scholar 

  21. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using rnn encoder-decoder for statistical machine translation. 2014. arXiv preprint arXiv:14061078

  22. Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura JM, Parikh D, Batra D. Visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017;326–335.

  23. Das P, Xu C, Doell RF, Corso JJ. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2013;2634–2641.

  24. Deng J, Krause J, Berg AC, Fei-Fei L. Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2012;3450–3457.

  25. Donahue J, Hendricks LA. Sergio guadar rama. In: Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell, “Long-term recurrent con volutional networks for visual recognition and descrip tion,” in Proceedings of the IEEE conference on com puter vision and pattern recognition, 2015;2625–2634.

  26. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015;2625–2634.

  27. Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A. The pascal visual object classes (voc) challenge. International journal of computer vision. 2010;88(2):303–38.

    Article  Google Scholar 

  28. Fang Z, Gokhale T, Banerjee P, Baral C, Yang Y. Video2commonsense: Generating commonsense descriptions to enrich video captioning. 2020. arXiv preprint arXiv:200305162

  29. Fawzy NK, Marey MA, Aref MM. Video captioning using attention based visual fusion with bi-temporal context and bi-modal semantic feature learning. In: International Conference on Advanced Intelligent Systems and Informatics, Springer, 2020; 65–78.

  30. Felzenszwalb P, McAllester D, Ramanan D. A discriminatively trained, multiscale, deformable part model. In: 2008 IEEE conference on computer vision and pattern recognition, IEEE, 2008;1–8.

  31. Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L. Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017;5630–5639.

  32. Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney R, Darrell T, Saenko K. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE international conference on computer vision, 2013;2712–2719 .

  33. Guo Y, Zhang J, Gao L. Exploiting long-term temporal dynamics for video captioning. World Wide Web. 2019;22(2):735–49.

    Article  Google Scholar 

  34. Guo Z, Gao L, Song J, Xu X, Shao J, Shen HT. Attention-based lstm with semantic consistency for videos captioning. In: Proceedings of the 24th ACM international conference on Multimedia, 2016;357–361.

  35. Hao X, Zhou F, Li X. Scene-edge gru for video caption. In: 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), IEEE, vol 1, 2020;1290–1295.

  36. Hara K, Kataoka H, Satoh Y. Learning spatio-temporal features with 3d residual networks for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017;3154–3160.

  37. Hemalatha M, Sekhar CC. Domain-specific semantics guided approach to video captioning. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2020;1576–1585.

  38. Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997;9(8):1735–80.

    Article  Google Scholar 

  39. Hodosh M, Young P, Hockenmaier J. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research. 2013;47:853–99.

    Article  MathSciNet  Google Scholar 

  40. Hori C, Hori T, Lee TY, Zhang Z, Harsham B, Hershey JR, Marks TK, Sumi K. Attention-based multimodal fusion for video description. In: Proceedings of the IEEE international conference on computer vision, 2017;4193–4202.

  41. Hou J, Wu X, Zhao W, Luo J, Jia Y. Joint syntax representation learning and visual cue translation for video captioning. In: Proceedings of the IEEE International Conference on Computer Vision, 2019;8918–8927.

  42. Hou J, Jia Y, Qi Y, et al. Video captioning using weak annotation. 2020 arXiv preprint arXiv:200901067

  43. Huang G, Pang B, Zhu Z, Rivera C, Soricut R. Multimodal pretraining for dense video captioning. 2020a. arXiv preprint arXiv:201111760

  44. Huang Y, Chen J, Ouyang W, Wan W, Xue Y. Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Transactions on Image Processing. 2020b;29:4013–26.

    Article  Google Scholar 

  45. Iashin V, Rahtu E. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. 2020. arXiv preprint arXiv:200508271

  46. Jain BD, Thakur SM, Suresh K. Visual assistance for blind using image processing. In: 2018 International Conference on Communication and Signal Processing (ICCSP), IEEE, 2018;0499–0503.

  47. Jeong D, Kim BG. Dong SY. Deep joint spatiotemporal network (djstn) for efficient facial expression recognition. Sensors. 2020;20(7) 1936.

  48. Jia Z, Li X. icap: Interactive image captioning with predictive text. In: Proceedings of the 2020 International Conference on Multimedia Retrieval, 2020;428–435.

  49. Jin T, Huang S, Chen M, Li Y, Zhang Z. Sbat: Video captioning with sparse boundary-aware transformer. 2020. arXiv preprint arXiv:200711888

  50. Kim BG, Park DJ. Unsupervised video object segmentation and tracking based on new edge features. Pattern Recognition Letters. 2004;25(15):1731–42.

    Article  Google Scholar 

  51. Kim J, Choi I, Lee M. Context aware video caption generation with consecutive differentiable neural computer. Electronics. 2020;9(7):1162.

    Article  Google Scholar 

  52. Kim JH, Kim BG, Roy PP, Jeong DM. Efficient facial expression recognition algorithm based on hierarchical deep neural network structure. IEEE Access. 2019;7:41273–85.

    Article  Google Scholar 

  53. Koehn P. Statistical machine translation. Cambridge University Press; 2009.

  54. Kojima A, Tamura T, Fukunaga K. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision. 2002;50(2):171–84.

    Article  Google Scholar 

  55. Korbar B, Petroni F, Girdhar R, Torresani L. Video understanding as machine translation. 2020. arXiv preprint arXiv:200607203

  56. Krishna R, Hata K, Ren F, Fei-Fei L, Niebles JC. Dense-captioning events in videos. In: International Conference on Computer Vision (ICCV). 2017.

  57. Krishnamoorthy N, Malkarnenkar G, Mooney R, Saenko K, Guadarrama S. Generating natural-language video descriptions using text-mined knowledge. In: Twenty-Seventh AAAI Conference on Artificial Intelligence 2013.

  58. Lan W, Li X, Dong J. Fluency-guided cross-lingual image captioning. In: Proceedings of the 25th ACM international conference on Multimedia, 2017;1549–1557.

  59. Laptev I, Marszalek M, Schmid C, Rozenfeld B. Learning realistic human actions from movies. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2008;1–8.

  60. Lei J, Wang L, Shen Y, Yu D, Berg TL, Bansal M. Mart: Memory-augmented recurrent transformer for coherent video paragraph captioning. 2020. arXiv preprint arXiv:200505402

  61. Li L, Gong B. End-to-end video captioning with multitask reinforcement learning. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2019;339–348

  62. Li LJ, Su H, Fei-Fei L, Xing EP. Object bank: A high-level image representation for scene classification & semantic feature sparsification. In: Advances in neural information processing systems, 2010;1378–1386

  63. Li X, Zhou Z, Chen L, Gao L. Residual attention-based lstm for video captioning. World Wide Web. 2019;22(2):621–36.

    Article  Google Scholar 

  64. Li Y, Yao T, Pan Y, Chao H, Mei T. Jointly localizing and describing events for dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018;7492–7500.

  65. Lin CY. Rouge: A packagefor automatic evaluation of summaries. In: ProceedingsofWorkshop on Text Summarization Branches Out, Post2Conference Workshop of ACL 2004.

  66. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, 2014;740–755.

  67. Liu F, Gao S, Gao X. Segmentation of mr image based on maximum a posterior. In: 2001 Conference Proceedings of the 23rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, IEEE, vol 3, 2001;2681–2684.

  68. Liu S, Ren Z, Yuan J. Sibnet: Sibling convolutional encoder for video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2020.

  69. Long X, Gan C, de Melo G. Video captioning with multi-faceted attention. Transactions of the Association for Computational Linguistics. 2018;6:173–84.

    Article  Google Scholar 

  70. Luo H, Ji L, Shi B, Huang H, Duan N, Li T, Chen X, Zhou M. Univilm: A unified video and language pre-training model for multimodal understanding and generation. 2020. arXiv preprint arXiv:200206353

  71. Malinowski M, Fritz M. A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in neural information processing systems, 2014;1682–1690.

  72. Miech A, Zhukov D, Alayrac JB, Tapaswi M, Laptev I, Sivic J. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE International Conference on Computer Vision, 2019;2630–2640.

  73. Motwani TS, Mooney RJ. Improving video activity recognition using object recognition and text mining. In: ECAI, 2012;vol 1, p 2 .

  74. Mukherjee S, Saini R, Kumar P, Roy PP, Dogra DP, Kim BG, et al. Fight detection in hockey videos using deep network. Journal of Multimedia Information System. 2017;4(4):225–32.

    Google Scholar 

  75. Mukherjee S, Ghosh S, Ghosh S, Kumar P, Roy PP. Predicting video-frames using encoder-convlstm combination. In: ICASSP 2019–2019 IEEE International Conference on Acoustics. IEEE: Speech and Signal Processing (ICASSP); 2019. p. 2027–31.

  76. Mun J, Yang L, Ren Z, Xu N, Han B. Streamlined dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019;6588–6597

  77. Olivastri S, Singh G, Cuzzolin F. End-to-end video captioning. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019;0–0.

  78. Pala P, Santini S. Image retrieval by shape and texture. Pattern Recognition. 1999;32(3):517–27.

    Article  Google Scholar 

  79. Pan B, Cai H, Huang DA, Lee KH, Gaidon A, Adeli E, Niebles JC. Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020;10870–10879.

  80. Pan P, Xu Z, Yang Y, Wu F, Zhuang Y. Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016a;1029–1038.

  81. Pan Y, Mei T, Yao T, Li H, Rui Y. Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016b;4594–4602.

  82. Pan Y, Yao T, Li H, Mei T. Video captioning with transferred semantic attributes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017;6504–6512.

  83. Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, 2002;311–318.

  84. Park JS, Rohrbach M, Darrell T, Rohrbach A. Adversarial inference for multi-sentence video description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019;6598–6608.

  85. Pasunuru R, Bansal M. Multi-task video captioning with video and entailment generation. 2017. arXiv preprint arXiv:170407489

  86. Pawar K, Attar V. Deep learning approaches for video-based anomalous activity detection. World Wide Web. 2019;22(2):571–601.

    Article  Google Scholar 

  87. Pei W, Zhang J, Wang X, Ke L, Shen X, Tai YW. Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019;8347–8356.

  88. Pendurkar S, Kolpekwar S, Dhoot S, Haribhakta YV, Banerjee B. Attention based multi-modal fusion architecture for open-ended video question answering systems. Procedia Computer Science. 2020;171:446–55.

    Article  Google Scholar 

  89. Perez-Martin J, Bustos B, Pérez J. Attentive visual semantic specialized network for video captioning. In: International Conference on Computer Vision 2020.

  90. Perez-Martin J, Bustos B, Perez J. Improving video captioning with temporal composition of a visual-syntactic embedding. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021;3039–3049.

  91. Peris Á, Bolaños M, Radeva P, Casacuberta F. Video description using bidirectional recurrent neural networks. In: International Conference on Artificial Neural Networks, Springer, 2016;3–11.

  92. Plummer BA, Wang L, Cervantes CM, Caicedo JC, Hockenmaier J, Lazebnik S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision, 2015;2641–2649.

  93. Rahman M, Abedin T, Prottoy KS, Moshruba A, Siddiqui FH, et al. Semantically sensible video captioning (ssvc). 2020. arXiv preprint arXiv:200907335

  94. Rashtchian C, Young P, Hodosh M, Hockenmaier J. Collecting image annotations using amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, 2010;139–147

  95. Regneri M, Rohrbach M, Wetzel D, Thater S, Schiele B, Pinkal M. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics. 2013;1:25–36.

    Article  Google Scholar 

  96. Rimle P, Dogan P, Gross M. Enriching video captions with contextual text. 2020. arXiv preprint arXiv:200714682

  97. Rohrbach A, Rohrbach M, Qiu W, Friedrich A, Pinkal M, Schiele B. Coherent multi-sentence video description with variable level of detail. In: German conference on pattern recognition, Springer, 2014;184–195.

  98. Rohrbach A, Rohrbach M, Schiele B. The long-short story of movie description. In: German conference on pattern recognition, Springer, 2015a;209–221.

  99. Rohrbach A, Rohrbach M, Tandon N, Schiele B. A dataset for movie description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015b;3202–3212.

  100. Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B. Translating video content to natural language descriptions. In: Proceedings of the IEEE International Conference on Computer Vision, 2013;433–440.

  101. Roy PP, Bhunia AK, Bhattacharyya A, Pal U. Word searching in scene image and video frame in multi-script scenario using dynamic shape coding. Multimedia Tools and Applications. 2019;78(6):7767–801.

    Article  Google Scholar 

  102. Sah S, Nguyen T, Ptucha R. Understanding temporal structure for video captioning. Pattern Analysis and Applications. 2020;23(1):147–59.

    Article  Google Scholar 

  103. Schuldt C, Laptev I, Caputo B. Recognizing human actions: a local svm approach. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., IEEE, 2004;3,32–36

  104. Shen Z, Li J, Su Z, Li M, Chen Y, Jiang YG, Xue X. Weakly supervised dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017;1916–1924.

  105. Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT. From deterministic to generative: Multimodal stochastic rnns for video captioning. IEEE transactions on neural networks and learning systems. 2018;30(10):3047–58.

    Article  Google Scholar 

  106. Song K, Tan X, Qin T, Lu J, Liu TY. Mass: Masked sequence to sequence pre-training for language generation. 2019. arXiv preprint arXiv:190502450

  107. Soomro K, Zamir AR, Shah M. Ucf101: A dataset of 101 human actions classes from videos in the wild. 2012. arXiv preprint arXiv:12120402

  108. Srivastava N, Mansimov E, Salakhudinov R. Unsupervised learning of video representations using lstms. In: International conference on machine learning, 2015;843–852.

  109. Suin M, Rajagopalan A. An efficient framework for dense video captioning. In: AAAI, 2020;12039–12046.

  110. Sun C, Nevatia R. Semantic aware video transcription using random forest classifiers. In: European Conference on Computer Vision, Springer, 2014;772–786.

  111. Sur C. Sact: Self-aware multi-space feature composition transformer for multinomial attention for video captioning. 2020. arXiv preprint arXiv:200614262

  112. Szegedy C, Ioffe S, Vanhoucke V, Alemi A. Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI Conference on Artificial Intelligence 2017.

  113. Tan G, Liu D, Wang M, Zha ZJ. Learning to discretely compose reasoning module networks for video captioning. 2020. arXiv preprint arXiv:200709049

  114. Tang J. Intelligent Mobile Projects with TensorFlow: Build 10+ Artificial Intelligence Apps Using TensorFlow Mobile and Lite for IOS, Android, and Raspberry Pi. Packt Publishing Ltd 2018.

  115. Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R. Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 2014;1218–1227.

  116. Torabi A, Pal C, Larochelle H, Courville A. Using descriptive video services to create a large data source for video annotation research. 2015. arXiv preprint arXiv:150301070

  117. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, 2015;4489–4497.

  118. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Advances in neural information processing systems, 2017;5998–6008.

  119. Vedantam R, Lawrence Zitnick C, Parikh D. Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015;4566–4575.

  120. Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K. Translating videos to natural language using deep recurrent neural networks. 2014. arXiv preprint arXiv:14124729

  121. Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K. Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, 2015;4534–4542.

  122. Venugopalan S, Hendricks LA, Mooney R, Saenko K. Improving lstm-based video description with linguistic knowledge mined from text. 2016. arXiv preprint arXiv:160401729

  123. Wang B, Ma L, Zhang W, Liu W. Reconstruction network for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018a;7622–7631.

  124. Wang B, Ma L, Zhang W, Jiang W, Wang J, Liu W. Controllable video captioning with pos sequence guidance based on gated fusion network. In: Proceedings of the IEEE International Conference on Computer Vision, 2019a;2641–2650.

  125. Wang H, Kläser A, Schmid C, Liu CL. Action recognition by dense trajectories. In: CVPR 2011, IEEE, 2011;3169–3176.

  126. Wang H, Gao C, Han Y. Sequence in sequence for video captioning. Pattern Recognition Letters. 2020a;130:327–34.

    Article  Google Scholar 

  127. Wang J, Jiang W, Ma L, Liu W, Xu Y. Bidirectional attentive fusion with context gating for dense video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018b;7190–7198.

  128. Wang J, Wang W, Huang Y, Wang L, Tan T. M3: Multimodal memory modelling for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018c;7512–7520.

  129. Wang J, Wang W, Wang L, Wang Z, Feng DD, Tan T. Learning visual relationship and context-aware attention for image captioning. Pattern Recognition. 2020b;98:107075.

    Article  Google Scholar 

  130. Wang S, Lan L, Zhang X, Dong G, Luo Z. Object-aware semantics of attention for image captioning. Multimedia Tools and Applications. 2020c;79(3):2013–30.

    Article  Google Scholar 

  131. Wang T, Zheng H, Yu M, Tian Q, Hu H. Event-centric hierarchical representation for dense video captioning. IEEE Transactions on Circuits and Systems for Video Technology. 2020d.

  132. Wang X, Chen W, Wu J, Wang YF, Yang Wang W. Video captioning via hierarchical reinforcement learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018d;4213–4222.

  133. Wang X, Wu J, Zhang D, Su Y, Wang WY. Learning to compose topic-aware mixture of experts for zero-shot video captioning. Proceedings of the AAAI Conference on Artificial Intelligence. 2019b;33:8965–72.

    Article  Google Scholar 

  134. Wu A, Han Y. Hierarchical memory decoding for video captioning. 2020. arXiv preprint arXiv:200211886

  135. Wu X, Li G, Cao Q, Ji Q, Lin L. Interpretable video captioning via trajectory structured localization. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018;6829–6837.

  136. Xia Q, Huang H, Duan N, Zhang D, Ji L, Sui Z, Cui E, Bharti T, Zhou M. Xgpt: Cross-modal generative pre-training for image captioning. 2020. arXiv preprint arXiv:200301473

  137. Xiao H, Shi J. Video captioning with text-based dynamic attention and step-by-step learning. Pattern Recognition Letters. 2020.

  138. Xu H, Li B, Ramanishka V, Sigal L, Saenko K. Joint event detection and description in continuous video streams. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2019;396–405.

  139. Xu J, Mei T, Yao T, Rui Y. Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016;5288–5296.

  140. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y. Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, 2015a;2048–2057.

  141. Xu N, Liu AA, Wong Y, Zhang Y, Nie W, Su Y, Kankanhalli M. Dual-stream recurrent neural network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology. 2018;29(8):2482–93.

    Article  Google Scholar 

  142. Xu R, Xiong C, Chen W, Corso JJ. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: Twenty-Ninth AAAI Conference on Artificial Intelligence 2015b.

  143. Yan C, Tu Y, Wang X, Zhang Y, Hao X, Zhang Y, Dai Q. Stat: spatial-temporal attention mechanism for video captioning. IEEE transactions on multimedia 2019.

  144. Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT, Ji Y. Video captioning by adversarial lstm. IEEE Transactions on Image Processing. 2018;27(11):5600–11.

    Article  MathSciNet  Google Scholar 

  145. Yang Z, Han Y, Wang Z. Catching the temporal regions-of-interest for video captioning. In: Proceedings of the 25th ACM international conference on Multimedia, 2017;146–153.

  146. Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A. Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, 2015;4507–4515.

  147. Young P, Lai A, Hodosh M, Hockenmaier J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics. 2014;2:67–78.

    Article  Google Scholar 

  148. Yu H, Wang J, Huang Z, Yang Y, Xu W. Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016;4584–4593

  149. Zhang C, Tian Y. Automatic video description generation via lstm with joint two-stream encoding. In: 2016 23rd International Conference on Pattern Recognition (ICPR), IEEE, 2016;2924–2929.

  150. Zhang J, Peng Y. Hierarchical vision-language alignment for video captioning. In: International Conference on Multimedia Modeling, Springer, 2019a;42–54.

  151. Zhang J, Peng Y. Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019b;8327–8336.

  152. Zhang X, Gao K, Zhang Y, Zhang D, Li J, Tian Q. Task-driven dynamic fusion: Reducing ambiguity in video description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017;3713–3721.

  153. Zhang X, Liu C, Chang F. Guidance module network for video captioning. 2020a. arXiv preprint arXiv:201210930

  154. Zhang Z, Xu D, Ouyang W, Tan C (2019) Show, tell and summarize: Dense video captioning using visual cue aided sentence summarization. IEEE Transactions on Circuits and Systems for Video Technology

  155. Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha ZJ. Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020b;13278–13288.

  156. Zhao B, Li X, Lu X. Cam-rnn: Co-attention model based rnn for video captioning. IEEE Transactions on Image Processing. 2019;28(11):5552–65.

    Article  MathSciNet  Google Scholar 

  157. Zhao W, Wu X, Zhang X. Memcap: Memorizing style knowledge for image captioning. In: AAAI, 2020;12984–12992.

  158. Zheng Q, Wang C, Tao D. Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020;13096–13105.

  159. Zhu F, Hwang JN, Ma Z, Chen G, Guo J. Understanding objects in video: Object-oriented video captioning via structured trajectory and adversarial learning. IEEE Access 2020a.

  160. Zhu F, Hwang JN, Ma Z, Jun G. Object-oriented video captioning with temporal graph and prior knowledge building. 2020b. arXiv preprint arXiv:200303715

  161. Zhu L, Yang Y. Actbert: Learning global-local video-text representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020;8746–8755.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saiful Islam.

Ethics declarations

Conflict of interest

All of the Authors declare that he/she has no conflict of interest.

Code availability

There are no execution of codes relating to this work as it is a survey paper.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Islam, S., Dash, A., Seum, A. et al. Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning Methods. SN COMPUT. SCI. 2, 120 (2021). https://doi.org/10.1007/s42979-021-00487-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-021-00487-x

Keywords

Navigation