Skip to main content

A Comprehensive Attention-Based Model for Image Captioning

  • Conference paper
  • First Online:
Computer Networks and Inventive Communication Technologies

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 75))

  • 1050 Accesses

Abstract

Image captioning/automatic image annotation is referred to as description of image in text according to the contents and properties observed in a picture. It has numerous implementations such as its utilisation in virtual assistants for people with visual impairment, for social media and several other applications in computer vision and deep learning. Another interesting application is that a video can be explained frame by frame by image captioning (considering it to be carousel of images). In this paper, we have used an encoder–decoder architecture along with attention mechanism for captioning the images. We have used layers of CNN in the form of an encoder and that of RNN as decoder. We used Adam optimiser which gave the best results for our architecture. We have used Beam Search and Greedy Search for evaluating the captions. BLEU score was calculated to estimate the proximity of the generated captions to the real captions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 299.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 379.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR, 2015

    Google Scholar 

  2. Vinyals, O., et al.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156–3164

    Google Scholar 

  3. Lu, J., et al.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 375–383

    Google Scholar 

  4. Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)

    Article  Google Scholar 

  5. Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET), pp. 1–6. IEEE, 2017

    Google Scholar 

  6. O’Shea, K., Nash, R.: An introduction to convolutional neural networks. arXiv:1511.08458 (2015)

  7. Schuster, M., Paliwal, K.K.: Networks bidirectional reccurent neural. IEEE Trans. Signal Process. 45, 2673–2681 (1997)

    Article  Google Scholar 

  8. Sundermeyer, M., et al.: Comparison of feedforward and recurrent neural network language models. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8430–8434. IEEE, 2013

    Google Scholar 

  9. Gers, F.A., Schraudolph, N.N., Schmidhuber, J.: Learning precise timing with LSTM recurrent networks. J. Mach. Learning Res. 115–143 (2002)

    Google Scholar 

  10. Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. arXiv:1708.02182 (2017)

  11. Herdade, S., et al.: Image captioning: transforming objects into words. arXiv:1906.05963 (2019)

  12. Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 (2017)

  13. Papineni, K., et al.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, 2002

    Google Scholar 

  14. Lakshminarasimhan Srinivasan, D.S., Amutha, A.L.: Image captioning–a deep learning approach. Int. J. Appl. Eng. Res. 13(9), 7239–7242 (2018)

    Google Scholar 

  15. Deng, J.: Imagenet: a large-scale hierarchical image database. IEEE Conf. Comput. Vis. Pattern Recogn. 2009, 248–255 (2009)

    Google Scholar 

  16. Ren, Z., et al.: Deep reinforcement learning-based image captioning with embedding reward. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 290–298

    Google Scholar 

  17. Qassim, H., Verma, A., Feinzimer, D.: Compressed residual-VGG16 CNN model for big data places image recognition. In: IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), pp. 169–175. IEEE (2018)

    Google Scholar 

  18. Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big data 3(1), 1–40 (2016)

    Article  Google Scholar 

  19. Harika, G., et al.: Building an Image Captioning System Using CNNs and LSTMs. Int. Res. J. Mod. Eng. Technol. Sci. 2(6) (2020)

    Google Scholar 

  20. Fukui, H., et al.: Attention branch network: Learning of attention mechanism for visual explanation. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2019, pp. 10705–10714

    Google Scholar 

  21. Li, L., et al.: Image caption with global-local attention. Proc. AAAI Conf. Artif. Intell. 31(1) (2017)

    Google Scholar 

  22. Kottur, S., et al.: Visual word2vec (vis-w2v): learning visually grounded word embeddings using abstract scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4985–4994

    Google Scholar 

  23. Zakir Hossain, M.D., et al.: A comprehensive survey of deep learning for im- age captioning. ACM Comput. Surv. (CsUR) 51(6), 1–36 (2019)

    Article  Google Scholar 

  24. Manoharan, S.: Performance analysis of clustering based image seg- mentation techniques. J. Innov. Image Process. (JIIP) 2(01), 14–24 (2020)

    Article  Google Scholar 

  25. Chawan, A.C., Kakade, V.K., Jadhav, J.K.: Automatic detection of flood using remote sensing images. J. Inf. Technol. 2(01), 11–26 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kumar, V., Dahiya, A., Saini, G., Sheokand, S. (2022). A Comprehensive Attention-Based Model for Image Captioning. In: Smys, S., Bestak, R., Palanisamy, R., Kotuliak, I. (eds) Computer Networks and Inventive Communication Technologies . Lecture Notes on Data Engineering and Communications Technologies, vol 75. Springer, Singapore. https://doi.org/10.1007/978-981-16-3728-5_10

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-3728-5_10

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-3727-8

  • Online ISBN: 978-981-16-3728-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics