Skip to main content
Log in

Evolution of automatic visual description techniques-a methodological survey

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Describing the contents and activities in an image or video in semantically and syntactically correct sentences are known as captioning. Automated captioning is one of the most researched topics these days, with new sophisticated models being discovered every day. Captioning models require intense training and perform intense, complex calculations before successfully generating a caption and hence, takes a considerable amount of time even in machines with high specifications. In this survey, we go through the recent state-of-the-art advancements in automatic image and video description methodologies using deep neural networks and summarize the concepts inferred from them. The summarization has been done with a systematic, detailed, and critical analysis of the latest methodologies published in high impact proceedings and journals. Our investigation focuses on techniques that can optimize existing concepts and incorporate new methods of visual attention for generating captions. This survey emphasizes on the importance of applicability and effectiveness of existing works in real-life applications and highlights those computationally feasible and optimized techniques which can be supported in multiple devices, including lightweight devices like smartphones. Furthermore, we propose possible improvements and model architecture to support online video captioning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27

Similar content being viewed by others

References

  1. Abbas Q, Ibrahim ME, Jaffar MA (2018) Video scene analysis: An overview and challenges on deep learning algorithms. Multimedia Tools and Applications 1–39

  2. Alzubi J, Nayyar A, Kumar A (2018) Machine learning from theory to algorithms: An overview, vol 1142, p 012012

  3. Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: European conference on computer vision. Springer, Cham, pp 382–398

  4. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

  5. Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2d human pose estimation: New benchmark and state of the art analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3686–3693

  6. Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

  7. Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Schmidt L (2012) Video in sentences out. arXiv:1204.2742

  8. Bhatnagar BL, Singh S, Arora C, Jawahar CV, CVIT K (2017) Unsupervised learning of deep feature representation for clustering egocentric actions. In: IJCAI, pp 1447–1453

  9. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. Journal of Machine Learning Research 993–1022

  10. Borji A, Itti L (2015) Cat2000: A large scale fixation dataset for boosting saliency research. arXiv:1505.03581

  11. Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: European conference on computer vision. Springer, Berlin, pp 25–36

  12. Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the ieee conference on computer vision and pattern recognition, pp 961–970

  13. Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. Association for Computational Linguistics, vol 1, pp 190–200

  14. Choi MJ, Torralba A, Willsky AS (2012) Context models and out-of-context objects. Pattern Recognit Lett 33(7):853–62

    Article  Google Scholar 

  15. Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Paying more attention to saliency: Image captioning with saliency and context attention. ACM Trans Multimed Comput Commun Appl (TOMM) 14(2):48

    Google Scholar 

  16. Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Trans Image Process 27(10):5142–5154

    Article  MathSciNet  Google Scholar 

  17. Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional gan. In: Proceedings of the IEEE international conference on computer vision, pp 2970–2979

  18. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition CVPR’05, vol 1. IEEE, pp 886–893

  19. Dhanachandra N, Manglem K, Chanu YJ (2015) Image segmentation using K-means clustering algorithm and subtractive clustering algorithm. Procedia Comput Sci 54:764–771

    Article  Google Scholar 

  20. Escorcia V, Heilbron FC, Niebles JC, Ghanem B (2016) Daps: Deep action proposals for action understanding. In: European conference on computer vision. Springer, Cham, pp 768–784

  21. Fei-Fei L (2010) ImageNet: crowdsourcing, benchmarking & other cool things, CMU VASC Seminar

  22. Ferraro F, Mostafazadeh N, Vanderwende L, Devlin J, Galley M, Mitchell M (2015) A survey of current datasets for vision and language research. arXiv:1506.06833

  23. Freitag M, Al-Onaizan Y (2017) Beam search strategies for neural machine translation. arXiv:1702.01806

  24. Fu K, Jin J, Cui R, Sha F, Zhang C (2017) Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell 39(12):2321–2334

    Article  Google Scholar 

  25. Fu K, Li J, Jin J, Zhang C (2018) Image-text surgery: Efficient concept learning in image captioning by generating pseudopairs. IEEE Trans Neural Netw Learn Syst (99):1–12

  26. Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia 19(9):2045–2055

    Article  Google Scholar 

  27. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  28. Goldberg Y, Levy O (2014) word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722

  29. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680

  30. Graves A (2013) Generating sequences with recurrent neural networks. arXiv:1308.0850

  31. Gregor K, Danihelka I, Graves A, Rezende DJ, Wierstra D (2015) Draw: A recurrent neural network for image generation. arXiv:1502.04623

  32. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  33. Hermann KM, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015) Teaching machines to read and comprehend. In: Advances in neural information processing systems, pp 1693–1701

  34. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computat 9(8):1735–80

    Article  Google Scholar 

  35. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computat 9(8):1735–80

    Article  Google Scholar 

  36. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 47:853–899

    Article  MathSciNet  Google Scholar 

  37. Hossain MD, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR) 51(6):118

    Article  Google Scholar 

  38. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861

  39. Hui TW, Tang X, Change Loy C (2018) Liteflownet: A lightweight convolutional neural network for optical flow estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8981–8989

  40. Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–9

    Article  Google Scholar 

  41. Jiang M, Huang S, Duan J, Zhao Q (2015) Salicon: Saliency in context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1072–1080

  42. Judd T, Durand F, Torralba A (2012) A benchmark of computational models of saliency to predict human fixations

  43. Judd T, Ehinger K, Durand F, Torralba A (2009) Learning to predict where humans look. In: 2009 IEEE 12th international conference on computer vision. IEEE, pp 2106–2113

  44. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  45. Koch C, Ullman S (1987) Shifts in selective visual attention: towards the underlying neural circuitry. In: Matters of intelligence. Springer, Dordrecht, pp 115–141

  46. Krishna R, Hata K, Ren F, Fei-Fei L, Carlos Niebles J (2017) Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision, pp 706–715

  47. Kumar A, Sangwan SR, Arora A, Nayyar A, Abdel-Basset M (2019) Sarcasm detection using soft attention-based bidirectional long short-term memory model with convolution network. IEEE Access 7:3319–23328

    Google Scholar 

  48. Lin CY (2004) ROUGE: A packagefor automatic evaluation of summaries. In: Proceedings of workshop on text summarization branches out, post2conference workshop of ACL

  49. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision. Springer, Cham, pp 740–755

  50. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  51. Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383

  52. Mathews AP, Xie L, He X (2016) Senticap: Generating image descriptions with sentiments. In: Thirtieth AAAI conference on artificial intelligence

  53. Mathews A, Xie L, He X (2018) Semstyle: Learning to generate stylised image captions using unaligned text. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8591–8600

  54. Melnyk I, Sercu T, Dognin PL, Ross J, Mroueh Y (2018) Improved image captioning with adversarial semantic alignment. arXiv:1805.00063

  55. Mnih V, Heess N, Graves A (2014) Recurrent models of visual attention. In: Advances in neural information processing systems, pp 2204–2212

  56. Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1029–1038

  57. Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp 311–318

  58. Pascanu R, Gulcehre C, Cho K, Bengio Y (2013) How to construct deep recurrent neural networks. arXiv:1312.6026

  59. Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  60. Peris Á, Bolaños M, Radeva P, Casacuberta F (2016) Video description using bidirectional recurrent neural networks. In: International conference on artificial neural networks. Springer, Cham, pp 3–11

  61. Pini S, Cornia M, Bolelli F, Baraldi L, Cucchiara R (2018) M-VAD names: A dataset for video captioning with naming. Multimedia Tools and Applications 1–21

  62. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  63. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024

  64. Rohrbach A, Rohrbach M, Qiu W, Friedrich A, Pinkal M, Schiele B (2014) Coherent multi-sentence video description with variable level of detail. In: German conference on pattern recognition. Springer, Cham, pp 184–195

  65. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4510–4520

  66. Shetty R, Rohrbach M, Anne Hendricks L, Fritz M, Schiele B (2017) Speaking the same language: Matching machine to human captions by adversarial training. In: Proceedings of the IEEE international conference on computer vision, pp 4135–4144

  67. Shi X, Cai J, Gu J, Joty S (2018) Video captioning with boundary-aware hierarchical language decoding and joint video prediction. arXiv:1807.03658

  68. Shi H, Li P, Wang B, Wang Z (2018) Image captioning based on deep reinforcement learning. In: Proceedings of the 10th international conference on internet multimedia computing and service. ACM, p 45

  69. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  70. Tanti M, Gatt A, Camilleri KP (2017) What is the role of recurrent neural networks (RNNs) in an image caption generator? arXiv:1708.02043

  71. Tanti M, Gatt A, Camilleri KP (2018) Where to put the image in an image caption generator. Nat Lang Eng 24(3):467–489

    Article  Google Scholar 

  72. Torabi A, Pal C, Larochelle H, Courville A (2015) Using descriptive video services to create a large data source for video annotation research. arXiv:1503.01070

  73. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  74. Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vision 104(2):154–71

    Article  Google Scholar 

  75. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  76. Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542

  77. Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729

  78. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  79. Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vision 103(1):60–79

    Article  MathSciNet  Google Scholar 

  80. Wang W, Wu Y, Liu H, Wang S, Cheng J (2018) Temporal action detection by joint Identification-Verification. In: 2018 24th international conference on pattern recognition (ICPR). IEEE, pp 2026–2031

  81. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision. Springer, Cham, pp 20–36

  82. Wu S, Manber U (1992) Fast text searching allowing errors. Commun ACM 35(10):83–92

    Article  Google Scholar 

  83. Wu Z, Yao T, Fu Y, Jiang YG (2016) Deep learning for video classification and captioning. arXiv:1609.06782

  84. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057

  85. Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296

  86. Xu D, Yan Y, Ricci E, Sebe N (2017) Detecting anomalous events in videos by learning deep representations of appearance and motion. Elsevier J Comput Vis Image Underst 156:117–127

    Article  Google Scholar 

  87. Xu Z, Yang Y, Hauptmann AG (2015) A discriminative cnn video representation for event detection. In: IEEE conference on computer vision and pattern recognition, CVPR

  88. Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515

  89. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659

  90. Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 67–78

  91. Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4584–4593

  92. Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702

  93. Zhang M, Yang Y, Zhang H, Ji Y, Shen HT, Chua TS (2019) More is better: Precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Trans Image Process 28(1):32–44

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sanjay Kumar.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhowmik, A., Kumar, S. & Bhat, N. Evolution of automatic visual description techniques-a methodological survey. Multimed Tools Appl 80, 28015–28059 (2021). https://doi.org/10.1007/s11042-021-10964-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-10964-3

Keywords

Navigation