A Comprehensive Attention-Based Model for Image Captioning

Kumar, Vinod; Dahiya, Abhishek; Saini, Geetanjali; Sheokand, Sahil

doi:10.1007/978-981-16-3728-5_10

Vinod Kumar⁶,
Abhishek Dahiya⁶,
Geetanjali Saini⁶ &
…
Sahil Sheokand⁶

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 75))

1050 Accesses

Abstract

Image captioning/automatic image annotation is referred to as description of image in text according to the contents and properties observed in a picture. It has numerous implementations such as its utilisation in virtual assistants for people with visual impairment, for social media and several other applications in computer vision and deep learning. Another interesting application is that a video can be explained frame by frame by image captioning (considering it to be carousel of images). In this paper, we have used an encoder–decoder architecture along with attention mechanism for captioning the images. We have used layers of CNN in the form of an encoder and that of RNN as decoder. We used Adam optimiser which gave the best results for our architecture. We have used Beam Search and Greedy Search for evaluating the captions. BLEU score was calculated to estimate the proximity of the generated captions to the real captions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 299.00; Price excludes VAT (USA)

Softcover Book: USD 379.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR, 2015
Google Scholar
Vinyals, O., et al.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156–3164
Google Scholar
Lu, J., et al.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 375–383
Google Scholar
Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)
Article Google Scholar
Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET), pp. 1–6. IEEE, 2017
Google Scholar
O’Shea, K., Nash, R.: An introduction to convolutional neural networks. arXiv:1511.08458 (2015)
Schuster, M., Paliwal, K.K.: Networks bidirectional reccurent neural. IEEE Trans. Signal Process. 45, 2673–2681 (1997)
Article Google Scholar
Sundermeyer, M., et al.: Comparison of feedforward and recurrent neural network language models. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8430–8434. IEEE, 2013
Google Scholar
Gers, F.A., Schraudolph, N.N., Schmidhuber, J.: Learning precise timing with LSTM recurrent networks. J. Mach. Learning Res. 115–143 (2002)
Google Scholar
Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. arXiv:1708.02182 (2017)
Herdade, S., et al.: Image captioning: transforming objects into words. arXiv:1906.05963 (2019)
Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 (2017)
Papineni, K., et al.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, 2002
Google Scholar
Lakshminarasimhan Srinivasan, D.S., Amutha, A.L.: Image captioning–a deep learning approach. Int. J. Appl. Eng. Res. 13(9), 7239–7242 (2018)
Google Scholar
Deng, J.: Imagenet: a large-scale hierarchical image database. IEEE Conf. Comput. Vis. Pattern Recogn. 2009, 248–255 (2009)
Google Scholar
Ren, Z., et al.: Deep reinforcement learning-based image captioning with embedding reward. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 290–298
Google Scholar
Qassim, H., Verma, A., Feinzimer, D.: Compressed residual-VGG16 CNN model for big data places image recognition. In: IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), pp. 169–175. IEEE (2018)
Google Scholar
Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big data 3(1), 1–40 (2016)
Article Google Scholar
Harika, G., et al.: Building an Image Captioning System Using CNNs and LSTMs. Int. Res. J. Mod. Eng. Technol. Sci. 2(6) (2020)
Google Scholar
Fukui, H., et al.: Attention branch network: Learning of attention mechanism for visual explanation. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2019, pp. 10705–10714
Google Scholar
Li, L., et al.: Image caption with global-local attention. Proc. AAAI Conf. Artif. Intell. 31(1) (2017)
Google Scholar
Kottur, S., et al.: Visual word2vec (vis-w2v): learning visually grounded word embeddings using abstract scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4985–4994
Google Scholar
Zakir Hossain, M.D., et al.: A comprehensive survey of deep learning for im- age captioning. ACM Comput. Surv. (CsUR) 51(6), 1–36 (2019)
Article Google Scholar
Manoharan, S.: Performance analysis of clustering based image seg- mentation techniques. J. Innov. Image Process. (JIIP) 2(01), 14–24 (2020)
Article Google Scholar
Chawan, A.C., Kakade, V.K., Jadhav, J.K.: Automatic detection of flood using remote sensing images. J. Inf. Technol. 2(01), 11–26 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Delhi Technological University, Delhi, 110042, India
Vinod Kumar, Abhishek Dahiya, Geetanjali Saini & Sahil Sheokand

Authors

Vinod Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Abhishek Dahiya
View author publications
You can also search for this author in PubMed Google Scholar
Geetanjali Saini
View author publications
You can also search for this author in PubMed Google Scholar
Sahil Sheokand
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information Technology, RVS Technical Campus, Coimbatore, Tamil Nadu, India
S. Smys
Department of Telecommunication Engineering, Czech Technical University in Prague, Praha, Czech Republic
Robert Bestak
Gerald Schwartz School of Business, St. Francis Xavier University, Antigonish, NS, Canada
Ram Palanisamy
Faculty of Informatics and Information Technology, Slovak University Technology, Bratislava, Slovakia
Ivan Kotuliak

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, V., Dahiya, A., Saini, G., Sheokand, S. (2022). A Comprehensive Attention-Based Model for Image Captioning. In: Smys, S., Bestak, R., Palanisamy, R., Kotuliak, I. (eds) Computer Networks and Inventive Communication Technologies . Lecture Notes on Data Engineering and Communications Technologies, vol 75. Springer, Singapore. https://doi.org/10.1007/978-981-16-3728-5_10

Download citation

DOI: https://doi.org/10.1007/978-981-16-3728-5_10
Published: 14 September 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-3727-8
Online ISBN: 978-981-16-3728-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics