Abstract
Describing the contents and activities in an image or video in semantically and syntactically correct sentences are known as captioning. Automated captioning is one of the most researched topics these days, with new sophisticated models being discovered every day. Captioning models require intense training and perform intense, complex calculations before successfully generating a caption and hence, takes a considerable amount of time even in machines with high specifications. In this survey, we go through the recent state-of-the-art advancements in automatic image and video description methodologies using deep neural networks and summarize the concepts inferred from them. The summarization has been done with a systematic, detailed, and critical analysis of the latest methodologies published in high impact proceedings and journals. Our investigation focuses on techniques that can optimize existing concepts and incorporate new methods of visual attention for generating captions. This survey emphasizes on the importance of applicability and effectiveness of existing works in real-life applications and highlights those computationally feasible and optimized techniques which can be supported in multiple devices, including lightweight devices like smartphones. Furthermore, we propose possible improvements and model architecture to support online video captioning.
Similar content being viewed by others
References
Abbas Q, Ibrahim ME, Jaffar MA (2018) Video scene analysis: An overview and challenges on deep learning algorithms. Multimedia Tools and Applications 1–39
Alzubi J, Nayyar A, Kumar A (2018) Machine learning from theory to algorithms: An overview, vol 1142, p 012012
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: European conference on computer vision. Springer, Cham, pp 382–398
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2d human pose estimation: New benchmark and state of the art analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3686–3693
Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Barbu A, Bridge A, Burchill Z, Coroian D, Dickinson S, Fidler S, Schmidt L (2012) Video in sentences out. arXiv:1204.2742
Bhatnagar BL, Singh S, Arora C, Jawahar CV, CVIT K (2017) Unsupervised learning of deep feature representation for clustering egocentric actions. In: IJCAI, pp 1447–1453
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. Journal of Machine Learning Research 993–1022
Borji A, Itti L (2015) Cat2000: A large scale fixation dataset for boosting saliency research. arXiv:1505.03581
Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: European conference on computer vision. Springer, Berlin, pp 25–36
Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the ieee conference on computer vision and pattern recognition, pp 961–970
Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. Association for Computational Linguistics, vol 1, pp 190–200
Choi MJ, Torralba A, Willsky AS (2012) Context models and out-of-context objects. Pattern Recognit Lett 33(7):853–62
Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Paying more attention to saliency: Image captioning with saliency and context attention. ACM Trans Multimed Comput Commun Appl (TOMM) 14(2):48
Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Trans Image Process 27(10):5142–5154
Dai B, Fidler S, Urtasun R, Lin D (2017) Towards diverse and natural image descriptions via a conditional gan. In: Proceedings of the IEEE international conference on computer vision, pp 2970–2979
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition CVPR’05, vol 1. IEEE, pp 886–893
Dhanachandra N, Manglem K, Chanu YJ (2015) Image segmentation using K-means clustering algorithm and subtractive clustering algorithm. Procedia Comput Sci 54:764–771
Escorcia V, Heilbron FC, Niebles JC, Ghanem B (2016) Daps: Deep action proposals for action understanding. In: European conference on computer vision. Springer, Cham, pp 768–784
Fei-Fei L (2010) ImageNet: crowdsourcing, benchmarking & other cool things, CMU VASC Seminar
Ferraro F, Mostafazadeh N, Vanderwende L, Devlin J, Galley M, Mitchell M (2015) A survey of current datasets for vision and language research. arXiv:1506.06833
Freitag M, Al-Onaizan Y (2017) Beam search strategies for neural machine translation. arXiv:1702.01806
Fu K, Jin J, Cui R, Sha F, Zhang C (2017) Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell 39(12):2321–2334
Fu K, Li J, Jin J, Zhang C (2018) Image-text surgery: Efficient concept learning in image captioning by generating pseudopairs. IEEE Trans Neural Netw Learn Syst (99):1–12
Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia 19(9):2045–2055
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Goldberg Y, Levy O (2014) word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
Graves A (2013) Generating sequences with recurrent neural networks. arXiv:1308.0850
Gregor K, Danihelka I, Graves A, Rezende DJ, Wierstra D (2015) Draw: A recurrent neural network for image generation. arXiv:1502.04623
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hermann KM, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015) Teaching machines to read and comprehend. In: Advances in neural information processing systems, pp 1693–1701
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computat 9(8):1735–80
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computat 9(8):1735–80
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 47:853–899
Hossain MD, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR) 51(6):118
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861
Hui TW, Tang X, Change Loy C (2018) Liteflownet: A lightweight convolutional neural network for optical flow estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8981–8989
Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–9
Jiang M, Huang S, Duan J, Zhao Q (2015) Salicon: Saliency in context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1072–1080
Judd T, Durand F, Torralba A (2012) A benchmark of computational models of saliency to predict human fixations
Judd T, Ehinger K, Durand F, Torralba A (2009) Learning to predict where humans look. In: 2009 IEEE 12th international conference on computer vision. IEEE, pp 2106–2113
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Koch C, Ullman S (1987) Shifts in selective visual attention: towards the underlying neural circuitry. In: Matters of intelligence. Springer, Dordrecht, pp 115–141
Krishna R, Hata K, Ren F, Fei-Fei L, Carlos Niebles J (2017) Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision, pp 706–715
Kumar A, Sangwan SR, Arora A, Nayyar A, Abdel-Basset M (2019) Sarcasm detection using soft attention-based bidirectional long short-term memory model with convolution network. IEEE Access 7:3319–23328
Lin CY (2004) ROUGE: A packagefor automatic evaluation of summaries. In: Proceedings of workshop on text summarization branches out, post2conference workshop of ACL
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision. Springer, Cham, pp 740–755
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383
Mathews AP, Xie L, He X (2016) Senticap: Generating image descriptions with sentiments. In: Thirtieth AAAI conference on artificial intelligence
Mathews A, Xie L, He X (2018) Semstyle: Learning to generate stylised image captions using unaligned text. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8591–8600
Melnyk I, Sercu T, Dognin PL, Ross J, Mroueh Y (2018) Improved image captioning with adversarial semantic alignment. arXiv:1805.00063
Mnih V, Heess N, Graves A (2014) Recurrent models of visual attention. In: Advances in neural information processing systems, pp 2204–2212
Pan P, Xu Z, Yang Y, Wu F, Zhuang Y (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1029–1038
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp 311–318
Pascanu R, Gulcehre C, Cho K, Bengio Y (2013) How to construct deep recurrent neural networks. arXiv:1312.6026
Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Peris Á, Bolaños M, Radeva P, Casacuberta F (2016) Video description using bidirectional recurrent neural networks. In: International conference on artificial neural networks. Springer, Cham, pp 3–11
Pini S, Cornia M, Bolelli F, Baraldi L, Cucchiara R (2018) M-VAD names: A dataset for video captioning with naming. Multimedia Tools and Applications 1–21
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Rohrbach A, Rohrbach M, Qiu W, Friedrich A, Pinkal M, Schiele B (2014) Coherent multi-sentence video description with variable level of detail. In: German conference on pattern recognition. Springer, Cham, pp 184–195
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4510–4520
Shetty R, Rohrbach M, Anne Hendricks L, Fritz M, Schiele B (2017) Speaking the same language: Matching machine to human captions by adversarial training. In: Proceedings of the IEEE international conference on computer vision, pp 4135–4144
Shi X, Cai J, Gu J, Joty S (2018) Video captioning with boundary-aware hierarchical language decoding and joint video prediction. arXiv:1807.03658
Shi H, Li P, Wang B, Wang Z (2018) Image captioning based on deep reinforcement learning. In: Proceedings of the 10th international conference on internet multimedia computing and service. ACM, p 45
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Tanti M, Gatt A, Camilleri KP (2017) What is the role of recurrent neural networks (RNNs) in an image caption generator? arXiv:1708.02043
Tanti M, Gatt A, Camilleri KP (2018) Where to put the image in an image caption generator. Nat Lang Eng 24(3):467–489
Torabi A, Pal C, Larochelle H, Courville A (2015) Using descriptive video services to create a large data source for video annotation research. arXiv:1503.01070
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vision 104(2):154–71
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542
Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vision 103(1):60–79
Wang W, Wu Y, Liu H, Wang S, Cheng J (2018) Temporal action detection by joint Identification-Verification. In: 2018 24th international conference on pattern recognition (ICPR). IEEE, pp 2026–2031
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision. Springer, Cham, pp 20–36
Wu S, Manber U (1992) Fast text searching allowing errors. Commun ACM 35(10):83–92
Wu Z, Yao T, Fu Y, Jiang YG (2016) Deep learning for video classification and captioning. arXiv:1609.06782
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296
Xu D, Yan Y, Ricci E, Sebe N (2017) Detecting anomalous events in videos by learning deep representations of appearance and motion. Elsevier J Comput Vis Image Underst 156:117–127
Xu Z, Yang Y, Hauptmann AG (2015) A discriminative cnn video representation for event detection. In: IEEE conference on computer vision and pattern recognition, CVPR
Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 67–78
Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4584–4593
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702
Zhang M, Yang Y, Zhang H, Ji Y, Shen HT, Chua TS (2019) More is better: Precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Trans Image Process 28(1):32–44
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bhowmik, A., Kumar, S. & Bhat, N. Evolution of automatic visual description techniques-a methodological survey. Multimed Tools Appl 80, 28015–28059 (2021). https://doi.org/10.1007/s11042-021-10964-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-10964-3