Deep Learning Techniques for Automated Image Captioning

Srivastava, Siddharth; Chaudhari, Yash; Damania, Yash; Jadhav, Parul

doi:10.1007/978-981-16-4016-2_55

Siddharth Srivastava¹³,
Yash Chaudhari¹³,
Yash Damania¹³ &
…
Parul Jadhav¹³

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 286))

802 Accesses

Abstract

Automated Image Captioning involves understanding the semantic information of an image and expressing it in natural language. Among the many approaches proposed, deep learning-based techniques have achieved state-of-the-art results in solving this problem. In this paper, three primary, distinct deep learning-based approaches to solve this problem are introduced and compared: encoder-decoder frameworks, neuroevolution, and attention-based approaches. This paper covers their mechanisms and their performance, and highlights where they differ from each other. To conclude, the results of these approaches on benchmark dataset and metrics are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Fei-Fei L, Iyer A, Koch C, Perona P (2007) What do we perceive in a glance of a real-world scene? J Vis 7(1):10
Article Google Scholar
Sharma H, Agrahari M, Singh SK, Firoj M, Mishra RK (2020) Image captioning: a comprehensive survey. In: 2020 international conference on power electronics & IoT applications in renewable energy and its control (PARC). IEEE, pp 325–328
Google Scholar
Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304
Article Google Scholar
Wang X, Zhao Y, Pourpanah F (2020) Recent advances in deep learning
Google Scholar
Kiros R, Salakhutdinov R, Zemel, RS (2014) Unifying visual-semantic embeddings with multimodal neural language models
Google Scholar
Vinyals O, Toshev A, Bengio S, Erhan D, Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Google Scholar
Wu Q, Shen C, Liu L, Dick A, Van Den Hengel A, What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212
Google Scholar
Wei Y, Xia W, Huang J, Ni B, Dong J, Zhao Y, Yan S, CNN: single-label to multi-label
Google Scholar
Stanley KO, Miikkulainen R, Evolving neural networks through augmenting topologies, 10(2):99–127
Google Scholar
Miikkulainen R, Liang J, Meyerson E, Rawal A, Fink D, Francon O, Raju B, Shahrzad H, Navruzyan A, Duffy N, Hodjat B, Evolving deep neural networks
Google Scholar
Spratling MW, Johnson MH (2004) A feedback model of visual attention. J Cogn Neurosci 16(2):219–237
Article Google Scholar
Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent models of visual attention. arXiv preprint arXiv:1406.6247
Ba J, Mnih V, Kavukcuoglu K (2014) Multiple object recognition with visual attention, arXiv preprint arXiv:1412.7755
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2016) Show, attend and tell: neural image caption generation with visual attention
Google Scholar
Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circ Syst Video Technol 30(12):4467–4480
Article Google Scholar
Wolf T, Chaumond J, Debut L, Sanh V, Delangue C, Moi A, Cistac P, Funtowicz M, Davison J, Shleifer S et al (2020) Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp 38–45
Google Scholar
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229
Google Scholar
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Google Scholar
Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Google Scholar
Vedantam R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Google Scholar
Chen X, Fang H, Lin T-Y, Vedantam R, Gupta S, Dollar P, Zitnick CL, Microsoft COCO captions: data collection and evaluation server, Version: 2
Google Scholar
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T, Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Google Scholar
Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW (2016) Review networks for caption generation. arXiv preprint arXiv:1605.07912

Download references

Author information

Authors and Affiliations

Dr. Vishwanath Karad MIT World Peace University, Pune, India
Siddharth Srivastava, Yash Chaudhari, Yash Damania & Parul Jadhav

Authors

Siddharth Srivastava
View author publications
You can also search for this author in PubMed Google Scholar
Yash Chaudhari
View author publications
You can also search for this author in PubMed Google Scholar
Yash Damania
View author publications
You can also search for this author in PubMed Google Scholar
Parul Jadhav
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics, University of Leicester, Leicester, UK
Yu-Dong Zhang
Department of Electrical and Electronics Engineering, University of the Ryukyus, Nishihara, Okinawa, Japan
Tomonobu Senjyu
Department of Computer Science, Khon Kaen University, Khon Kaen, Thailand
Chakchai So-In
Global Knowledge Research Foundation, Ahmedabad, Gujarat, India
Amit Joshi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Srivastava, S., Chaudhari, Y., Damania, Y., Jadhav, P. (2022). Deep Learning Techniques for Automated Image Captioning. In: Zhang, YD., Senjyu, T., So-In, C., Joshi, A. (eds) Smart Trends in Computing and Communications. Lecture Notes in Networks and Systems, vol 286. Springer, Singapore. https://doi.org/10.1007/978-981-16-4016-2_55

Download citation

DOI: https://doi.org/10.1007/978-981-16-4016-2_55
Published: 26 October 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-4015-5
Online ISBN: 978-981-16-4016-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics