An efficient automated image caption generation by the encoder decoder model

Ansari, Khustar; Srivastava, Priyanka

doi:10.1007/s11042-024-18150-x

An efficient automated image caption generation by the encoder decoder model

Published: 22 January 2024

(2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Khustar Ansari^1,2 &
Priyanka Srivastava¹

162 Accesses
1 Altmetric
Explore all metrics

Abstract

Image caption generation is becoming one of the hot research topics and attracts various researchers. It is a complex process because it utilizes both NLP (natural language processing) and computer vision approaches for generating the tasks. A range of strategies are available for image captioning that connect the visual material with everyday language, such as explaining images with textual descriptions. Pre-trained classification networks like CNN and RNN-based neural network models are used in the literature to encrypt visual data. Even though various literature works have analyzed outstanding image caption techniques, they still lack in providing better performance for diverse databases. To overcome such issues, this research work presents an automated optimization deep learning model for image caption generation. Initially, the input image is pre-processed, and then the encoder decoder-based structure is utilized for extracting the visual features and caption generation. On the encoder side, the pre-trained ResNet 101 (residual network) is used to extract the visual features, and the SA- Bi-LSTM (self-attention with bi-directional Long Short-Term Memory) is used to generate the caption on the decoder side. In addition, an optimization model CA (Chimp algorithm) is used to improve detection performance in caption generation. The proposed encoder-decoder model is tested on benchmark datasets like Flickr8k, Flickr30k and COCO. Further, this model attained better BLEU and ribes scores of 0.8595 and 0.3531 on the Flickr8k dataset. Thus, the proposed SA-BiLSTM model achieved a significant performance in image caption generation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A detector for page-level handwritten music object recognition based on deep learning

Article 20 January 2023

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Deep learning models for digital image processing: a review

Article 07 January 2024

Data availability

Data sharing is not applicable to this article.

References

Hossain MDZ, Sohel F, Shiratuddin MF, Laga H (2019) a comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR) 51(6):1–36
Article Google Scholar
Sharma H, Srivastava S (2023) Multilevel attention and relation network based image captioning model. Multimed Tools App 82(7):10981–11003
Article Google Scholar
Meel P, Vishwakarma DK (2021) HANimage captioning, and forensics ensemble multi-modal fake news detection. Inf Sci 567:23–41
Article Google Scholar
Zhang W, Tang S, Su J, Xiao J, Zhuang Y (2021) Tell and guess: cooperative learning for natural image caption generation with hierarchical refined attention. Multimedia Tools and Applications 80:16267–16282
Article Google Scholar
Liu W, Chen S, Guo L, Zhu X, Liu J (2021) Cptr: Full transformer network for image captioning. arXiv preprint arXiv: 2101.10804
Fang Z, Wang J, Hu X, Liang L, Gan Z, Wang L, Yang Y, Liu Z (2022) Injecting semantic concepts into end-to-end image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 18009–18019
Luo Y, Ji J, Sun X, Cao L, Wu Y, Huang F, Lin CW, Ji R (2021) Dual-level collaborative transformer for image captioning. In Proc AAAI conference Artif Intell 35(3):2286–2293
Google Scholar
Yan C, Hao Y, Li L, Yin J, Liu A, Mao Z, Chen Z, Gao X (2021) Task-adaptive attention for image captioning. IEEE Trans Circuits Syst Video Technol 32(1):43–51
Article Google Scholar
Chun PJ, Yamane T, Maemura Y (2022) A deep learning-based image captioning method to automatically generate comprehensive explanations of bridge damage. Computer-Aided Civil Infrastruct Eng 37(11):1387–1401
Article Google Scholar
Singh A, Raguru JK, Prasad G, Chauhan S, Tiwari PK, Zaguia A, Ullah MA (2022) Medical image captioning using optimized deep learning model. Comput Intell Neurosci 2022:1
Google Scholar
Beddiar DR, Oussalah M, Seppänen T, Jennane R (2022) ACapMed: Automatic Captioning for Medical Imaging. Appl Sci 12(21):11092
Article Google Scholar
Yan S, Xie Y, Wu F, Smith JS, Lu W, Zhang B (2020) Image captioning via hierarchical attention mechanism and policy gradient optimization. Signal Process 167:107329
Article Google Scholar
Bhalekar M, Bedekar M (2022) D-CNN: A New model for Generating Image Captions with Text Extraction Using Deep Learning for Visually Challenged Individuals. Eng, Technol Appl Sci Res 12(2):8366–8373
Article Google Scholar
Devi PR, Thrivikraman V, Kashyap D, Shylaja SS (2020) Image captioning using reinforcement learning with BLUDEr optimization. Pattern Recognit Image Anal 30:607–613
Article Google Scholar
Chu Y, Yue X, Yu L, Sergei M, Wang Z (2020) Automatic image captioning based on ResNet50 and LSTM with soft attention. Wirel Commun Mob Comput 2020:1–7
Google Scholar
Al-Malla MA, Jafar A, Ghneim N (2022) Image captioning model using attention and object features to mimic human image understanding. Journal of Big Data 9(1):1–16
Article Google Scholar
Wang Y, Xiao B, Bouferguene A, Al-Hussein M, Li H (2022) Vision-based method for semantic information extraction in construction by integrating deep learning object detection and image captioning. Adv Eng Inform 53:101699
Article Google Scholar
Castro R, Pineda I, Lim W, Morocho-Cayamcela ME (2022) Deep learning approaches based on transformer architectures for image captioning tasks. IEEE Access 10:33679–33694
Article Google Scholar
Xian T, Li Z, Zhang C, Ma H (2022) Dual global enhanced transformer for image captioning. Neural Netw 148:129–141
Article Google Scholar
Fei Z, Yan X, Wang S, Tian Q (2022) Deecap: dynamic early exiting for efficient image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 12216–12226.
Das R, Singh TD (2022) Assamese news image caption generation using attention mechanism. Multim Tools App 81(7):10051–10069
Article Google Scholar
Sasibhooshan R, Kumaraswamy S, Sasidharan S (2023) Image caption generation using visual attention prediction and contextual spatial relation extraction. J Big Data 10(1):18
Article Google Scholar
Liu M, Li L, Hu H, Guan W, Tian J (2020) Image caption generation with dual attention mechanism. Inf Process Manage 57(2):102178
Article Google Scholar
Khishe M, Mosavi MR (2020) Chimp optimization algorithm. Expert Syst Appl 149:113338
Article Google Scholar
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 47:853–899
Article MathSciNet Google Scholar
Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft coco captions: Data collection and evaluation server. 1–7. https://doi.org/10.48550/arXiv.1504.00325
Ye Z, Khan R, Naqvi N, Islam MS (2021) a novel automatic image caption generation using bidirectional long-short term memory framework. Multimed Tools App 80:25557–25582
Article Google Scholar
Khamparia A, Pandey B, Tiwari S, Gupta D, Khanna A, Rodrigues JJ (2020) An integrated hybrid CNN–RNN model for visual description and generation of captions. Circuits Syst Signal Process 39:776–788
Article Google Scholar
Ding G, Chen M, Zhao S, Chen H, Han J, Liu Q (2019) Neural image caption generation with weighted training and reference. Cogn Comput 11(6):763–777
Article Google Scholar
Zhang J, Xu C, Gao Z, Rodrigues JJ, de Albuquerque VHC (2020) Industrial pervasive edge computing-based intelligence IoT for surveillance saliency detection. IEEE Trans Industr Inf 17(7):5012–5020
Article Google Scholar
Lüddecke T & Ecker A (2022) Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 7086–7096.
Joudar NE, Ettaouil M (2022) An adaptive Drop method for deep neural networks regularization: Estimation of DropConnect hyperparameter using generalization gap. Knowl-Based Syst 253:109567
Article Google Scholar
Hssayni EH, Joudar NE, Ettaouil M (2022) A deep learning framework for time series classification using normal cloud representation and convolutional neural network optimization. Comput Intell 38(6):2056–2074
Article Google Scholar
Hssayni EH, Joudar NE, Ettaouil M (2022) Localization and reduction of redundancy in CNN using L 1-sparsity induction. J Ambient Intell Humanized Comput 38:1–13
Google Scholar

Download references

Funding

No funding is provided for the preparation of the manuscript.

Author information

Authors and Affiliations

Sarala Birla University, Ranchi, Jharkhand, 835103, India
Khustar Ansari & Priyanka Srivastava
Department of CSE, BIT Sindri Dhanbad, Sindri, Jharkhand, 828123, India
Khustar Ansari

Authors

Khustar Ansari
View author publications
You can also search for this author in PubMed Google Scholar
Priyanka Srivastava
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors have equal contributions in this work.

Corresponding author

Correspondence to Khustar Ansari.

Ethics declarations

Ethical approval

This article does not contain any studies with human participants or animals performed by authors.

Conflict of interest

Authors declare that they have no conflict of interest.

Consent to participate

All the authors involved have agreed to participate in this submitted article.

Consent to publish

All the authors involved in this manuscript give full consent for publication of this submitted article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ansari, K., Srivastava, P. An efficient automated image caption generation by the encoder decoder model. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-18150-x

Download citation

Received: 27 May 2023
Revised: 28 November 2023
Accepted: 03 January 2024
Published: 22 January 2024
DOI: https://doi.org/10.1007/s11042-024-18150-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient automated image caption generation by the encoder decoder model

Abstract

Access this article

Similar content being viewed by others

A detector for page-level handwritten music object recognition based on deep learning

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Deep learning models for digital image processing: a review

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical approval

Conflict of interest

Consent to participate

Consent to publish

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An efficient automated image caption generation by the encoder decoder model

Abstract

Access this article

Similar content being viewed by others

A detector for page-level handwritten music object recognition based on deep learning

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Deep learning models for digital image processing: a review

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical approval

Conflict of interest

Consent to participate

Consent to publish

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation