Optimal transformers based image captioning using beam search

Shetty, Ashish; Kale, Yatharth; Patil, Yogeshwar; Patil, Rajeshwar; Sharma, Sanjeev

doi:10.1007/s11042-023-17359-6

Optimal transformers based image captioning using beam search

Published: 31 October 2023

(2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Ashish Shetty ORCID: orcid.org/0000-0003-0513-6647¹,
Yatharth Kale¹,
Yogeshwar Patil¹,
Rajeshwar Patil¹ &
…
Sanjeev Sharma¹

112 Accesses
2 Citations
Explore all metrics

Abstract

Image Captioning is the process of generating textual descriptions of given images. It encompasses two major fields of deep learning, computer vision, and natural language processing. This paper presents an Image Captioning model which uses the Convolution Neural Network (CNN) model for feature extraction and a transformer architecture for the generation of sequences from these feature vectors. For feature extraction, this paper uses different CNN architectures like Xception, InceptionV3, ResNet50V2, VGG19, DenseNet201, ResNet152V2, EfficientNetV2B3, EfficientNetV2B0. The proposed method takes advantage of the transformer model for faster processing, and Beam search is implemented to get the top N most probable sequences for each image. The architecture is trained on Flickr8k dataset and the model outperforms the existing methods. The proposed model achieves a BLEU_4 score of 0.2184 on the Flickr8k dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Article 11 October 2019

Visual attention network

Article Open access 28 July 2023

Data availability statement

The Flickr8k [15] dataset.

References

Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv:1607.06450
Balasubramaniam D (2021) Evaluating the performance of transformer architecture over attention architecture on image captioning
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Carrara F, Falchi F, Caldelli R, Amato G, Becarelli R (2019) Adversarial image detection in deep neural networks. Multimedia Tools Appl 78(3):2815–2835
Article Google Scholar
Cho K, Van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: Encoder-decoder approaches. arXiv:1409.1259
Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1251–1258
Chu Y, Yue X, Yu X, Wang Z (2020) Automatic image captioning based on RESNET50 and LSTM with soft attention
Dash SK, Acharya S, Pakray P, Das R, Gelbukh A (2020) Topic-based image caption generation
do Carmo Nogueira T, Vinhal CDN, da Cruz Júnior G, Ullmann MRD (2020) Reference-based model using multimodal gated recurrent units for image captioning. Multimedia Tools Appl 79(41):30615–30635
Fang F, Wang H, Chen Y, Tang P (2018) Looking deeper and transferring attention for image captioning. Multimedia Tools Appl 77(23):31159–31175
Article Google Scholar
Gu J, Wang G, Cai J, Chen T (2017) An empirical study of language cnn for image captioning. In: Proceedings of the IEEE international conference on computer vision, pp 1222–1231
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He S, Liao W, Tavakoli HR, Yang M, Rosenhahn B, Pugeault N (2020) Image captioning through image transformer. In: Proceedings of the Asian conference on computer vision
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
Article MathSciNet MATH Google Scholar
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Katiyar S, Borgohain SK (2021) Image captioning using deep stacked lstms, contextual word embeddings and data augmentation. arXiv:2102.11237
Li X, Jiang S (2019) Know more say less: image captioning based on scene graphs. IEEE Trans Multimedia 21(8):2117–2130
Article Google Scholar
Lin C-Y, Och FJ (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL-04), pp 605–612
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: European conference on computer vision, Springer, pp 740–755
Liu R, Han Y (2022) Instance-sequence reasoning for video question answering. Front Comput Sci 16(6):166708
Article Google Scholar
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Reimers N, Gurevych I (2019) Sentence-BERT: sentence embeddings using siamese BERT-networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing, association for computational linguistics, p 11
Schuster M, Nakajima K (2012) Japanese and Korean voice search. In: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5149–5152
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Song K, Tan X, Qin T, Lu J, Liu T-Y (2020) MPNET: masked and permuted pre-training for language understanding. Adv Neural Inf Process Syst 33:16857–16867
Google Scholar
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Tan HY, Chan SC (2018) Phrase-based image caption generator with hierarchical LSTM network
Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International conference on machine learning, pp 6105–6114
Tiwary T, Mahapatra RP (2022) An accurate generation of image captions for blind people using extended convolutional atom neural network. Multimedia Tools Appl 1–30
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30
Vedantam R, Zitnick CL, Parikh D (2015) CIDER: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wang C, Yang H, Bartz C, Meinel C (2016) Image captioning with deep bidirectional LSTMS. In: Proceedings of the 24th ACM international conference on multimedia, pp 988–997
Wang S, Lan L, Zhang X, Dong G, Luo Z (2020) Object-aware semantics of attention for image captioning. Multimedia Tools Appl 79(3):2013–2030
Article Google Scholar
Wang X, Zhu L, Wu Y, Yang Y (2020) Symbiotic attention for egocentric action recognition with object-centric alignment. IEEE Trans Pattern Anal Mach Intell
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Xu S (2022) Clip-diffusion-LM: apply diffusion model on image captioning, arXiv:2210.04559
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Computat Linguist 2:67–78
Article Google Scholar
Zhou C, Lei Z, Chen S, Huang Y, Xianrui L (2016) A sparse transformer-based approach for image captioning

Download references

Funding

There has been no significant financial support for this work that could have influenced its outcome.

Author information

Authors and Affiliations

Indian Institute of Information Technology Pune, Pune, India
Ashish Shetty, Yatharth Kale, Yogeshwar Patil, Rajeshwar Patil & Sanjeev Sharma

Authors

Ashish Shetty
View author publications
You can also search for this author in PubMed Google Scholar
Yatharth Kale
View author publications
You can also search for this author in PubMed Google Scholar
Yogeshwar Patil
View author publications
You can also search for this author in PubMed Google Scholar
Rajeshwar Patil
View author publications
You can also search for this author in PubMed Google Scholar
Sanjeev Sharma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ashish Shetty.

Ethics declarations

Conficts of interest

The authors confirm that there are no known conflicts of interest associated with this publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shetty, A., Kale, Y., Patil, Y. et al. Optimal transformers based image captioning using beam search. Multimed Tools Appl (2023). https://doi.org/10.1007/s11042-023-17359-6

Download citation

Received: 11 August 2022
Revised: 30 August 2023
Accepted: 27 September 2023
Published: 31 October 2023
DOI: https://doi.org/10.1007/s11042-023-17359-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal transformers based image captioning using beam search

Abstract

Access this article

Similar content being viewed by others

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Visual attention network

Data availability statement

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conficts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimal transformers based image captioning using beam search

Abstract

Access this article

Similar content being viewed by others

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Visual attention network

Data availability statement

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conficts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation