Abstract
Image captioning/automatic image annotation is referred to as description of image in text according to the contents and properties observed in a picture. It has numerous implementations such as its utilisation in virtual assistants for people with visual impairment, for social media and several other applications in computer vision and deep learning. Another interesting application is that a video can be explained frame by frame by image captioning (considering it to be carousel of images). In this paper, we have used an encoder–decoder architecture along with attention mechanism for captioning the images. We have used layers of CNN in the form of an encoder and that of RNN as decoder. We used Adam optimiser which gave the best results for our architecture. We have used Beam Search and Greedy Search for evaluating the captions. BLEU score was calculated to estimate the proximity of the generated captions to the real captions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR, 2015
Vinyals, O., et al.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156–3164
Lu, J., et al.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 375–383
Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)
Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET), pp. 1–6. IEEE, 2017
O’Shea, K., Nash, R.: An introduction to convolutional neural networks. arXiv:1511.08458 (2015)
Schuster, M., Paliwal, K.K.: Networks bidirectional reccurent neural. IEEE Trans. Signal Process. 45, 2673–2681 (1997)
Sundermeyer, M., et al.: Comparison of feedforward and recurrent neural network language models. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8430–8434. IEEE, 2013
Gers, F.A., Schraudolph, N.N., Schmidhuber, J.: Learning precise timing with LSTM recurrent networks. J. Mach. Learning Res. 115–143 (2002)
Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. arXiv:1708.02182 (2017)
Herdade, S., et al.: Image captioning: transforming objects into words. arXiv:1906.05963 (2019)
Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 (2017)
Papineni, K., et al.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, 2002
Lakshminarasimhan Srinivasan, D.S., Amutha, A.L.: Image captioning–a deep learning approach. Int. J. Appl. Eng. Res. 13(9), 7239–7242 (2018)
Deng, J.: Imagenet: a large-scale hierarchical image database. IEEE Conf. Comput. Vis. Pattern Recogn. 2009, 248–255 (2009)
Ren, Z., et al.: Deep reinforcement learning-based image captioning with embedding reward. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 290–298
Qassim, H., Verma, A., Feinzimer, D.: Compressed residual-VGG16 CNN model for big data places image recognition. In: IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), pp. 169–175. IEEE (2018)
Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big data 3(1), 1–40 (2016)
Harika, G., et al.: Building an Image Captioning System Using CNNs and LSTMs. Int. Res. J. Mod. Eng. Technol. Sci. 2(6) (2020)
Fukui, H., et al.: Attention branch network: Learning of attention mechanism for visual explanation. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2019, pp. 10705–10714
Li, L., et al.: Image caption with global-local attention. Proc. AAAI Conf. Artif. Intell. 31(1) (2017)
Kottur, S., et al.: Visual word2vec (vis-w2v): learning visually grounded word embeddings using abstract scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4985–4994
Zakir Hossain, M.D., et al.: A comprehensive survey of deep learning for im- age captioning. ACM Comput. Surv. (CsUR) 51(6), 1–36 (2019)
Manoharan, S.: Performance analysis of clustering based image seg- mentation techniques. J. Innov. Image Process. (JIIP) 2(01), 14–24 (2020)
Chawan, A.C., Kakade, V.K., Jadhav, J.K.: Automatic detection of flood using remote sensing images. J. Inf. Technol. 2(01), 11–26 (2020)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kumar, V., Dahiya, A., Saini, G., Sheokand, S. (2022). A Comprehensive Attention-Based Model for Image Captioning. In: Smys, S., Bestak, R., Palanisamy, R., Kotuliak, I. (eds) Computer Networks and Inventive Communication Technologies . Lecture Notes on Data Engineering and Communications Technologies, vol 75. Springer, Singapore. https://doi.org/10.1007/978-981-16-3728-5_10
Download citation
DOI: https://doi.org/10.1007/978-981-16-3728-5_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-3727-8
Online ISBN: 978-981-16-3728-5
eBook Packages: EngineeringEngineering (R0)