Skip to main content
Log in

An efficient automated image caption generation by the encoder decoder model

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Image caption generation is becoming one of the hot research topics and attracts various researchers. It is a complex process because it utilizes both NLP (natural language processing) and computer vision approaches for generating the tasks. A range of strategies are available for image captioning that connect the visual material with everyday language, such as explaining images with textual descriptions. Pre-trained classification networks like CNN and RNN-based neural network models are used in the literature to encrypt visual data. Even though various literature works have analyzed outstanding image caption techniques, they still lack in providing better performance for diverse databases. To overcome such issues, this research work presents an automated optimization deep learning model for image caption generation. Initially, the input image is pre-processed, and then the encoder decoder-based structure is utilized for extracting the visual features and caption generation. On the encoder side, the pre-trained ResNet 101 (residual network) is used to extract the visual features, and the SA- Bi-LSTM (self-attention with bi-directional Long Short-Term Memory) is used to generate the caption on the decoder side. In addition, an optimization model CA (Chimp algorithm) is used to improve detection performance in caption generation. The proposed encoder-decoder model is tested on benchmark datasets like Flickr8k, Flickr30k and COCO. Further, this model attained better BLEU and ribes scores of 0.8595 and 0.3531 on the Flickr8k dataset. Thus, the proposed SA-BiLSTM model achieved a significant performance in image caption generation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Algorithm 2
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

Data sharing is not applicable to this article.

References

  1. Hossain MDZ, Sohel F, Shiratuddin MF, Laga H (2019) a comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR) 51(6):1–36

    Article  Google Scholar 

  2. Sharma H, Srivastava S (2023) Multilevel attention and relation network based image captioning model. Multimed Tools App 82(7):10981–11003

    Article  Google Scholar 

  3. Meel P, Vishwakarma DK (2021) HANimage captioning, and forensics ensemble multi-modal fake news detection. Inf Sci 567:23–41

    Article  Google Scholar 

  4. Zhang W, Tang S, Su J, Xiao J, Zhuang Y (2021) Tell and guess: cooperative learning for natural image caption generation with hierarchical refined attention. Multimedia Tools and Applications 80:16267–16282

    Article  Google Scholar 

  5. Liu W, Chen S, Guo L, Zhu X, Liu J (2021) Cptr: Full transformer network for image captioning. arXiv preprint arXiv: 2101.10804

  6. Fang Z, Wang J, Hu X, Liang L, Gan Z, Wang L, Yang Y, Liu Z (2022) Injecting semantic concepts into end-to-end image captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 18009–18019

  7. Luo Y, Ji J, Sun X, Cao L, Wu Y, Huang F, Lin CW, Ji R (2021) Dual-level collaborative transformer for image captioning. In Proc AAAI conference Artif Intell 35(3):2286–2293

    Google Scholar 

  8. Yan C, Hao Y, Li L, Yin J, Liu A, Mao Z, Chen Z, Gao X (2021) Task-adaptive attention for image captioning. IEEE Trans Circuits Syst Video Technol 32(1):43–51

    Article  Google Scholar 

  9. Chun PJ, Yamane T, Maemura Y (2022) A deep learning-based image captioning method to automatically generate comprehensive explanations of bridge damage. Computer-Aided Civil Infrastruct Eng 37(11):1387–1401

    Article  Google Scholar 

  10. Singh A, Raguru JK, Prasad G, Chauhan S, Tiwari PK, Zaguia A, Ullah MA (2022) Medical image captioning using optimized deep learning model. Comput Intell Neurosci 2022:1

    Google Scholar 

  11. Beddiar DR, Oussalah M, Seppänen T, Jennane R (2022) ACapMed: Automatic Captioning for Medical Imaging. Appl Sci 12(21):11092

    Article  Google Scholar 

  12. Yan S, Xie Y, Wu F, Smith JS, Lu W, Zhang B (2020) Image captioning via hierarchical attention mechanism and policy gradient optimization. Signal Process 167:107329

    Article  Google Scholar 

  13. Bhalekar M, Bedekar M (2022) D-CNN: A New model for Generating Image Captions with Text Extraction Using Deep Learning for Visually Challenged Individuals. Eng, Technol Appl Sci Res 12(2):8366–8373

    Article  Google Scholar 

  14. Devi PR, Thrivikraman V, Kashyap D, Shylaja SS (2020) Image captioning using reinforcement learning with BLUDEr optimization. Pattern Recognit Image Anal 30:607–613

    Article  Google Scholar 

  15. Chu Y, Yue X, Yu L, Sergei M, Wang Z (2020) Automatic image captioning based on ResNet50 and LSTM with soft attention. Wirel Commun Mob Comput 2020:1–7

    Google Scholar 

  16. Al-Malla MA, Jafar A, Ghneim N (2022) Image captioning model using attention and object features to mimic human image understanding. Journal of Big Data 9(1):1–16

    Article  Google Scholar 

  17. Wang Y, Xiao B, Bouferguene A, Al-Hussein M, Li H (2022) Vision-based method for semantic information extraction in construction by integrating deep learning object detection and image captioning. Adv Eng Inform 53:101699

    Article  Google Scholar 

  18. Castro R, Pineda I, Lim W, Morocho-Cayamcela ME (2022) Deep learning approaches based on transformer architectures for image captioning tasks. IEEE Access 10:33679–33694

    Article  Google Scholar 

  19. Xian T, Li Z, Zhang C, Ma H (2022) Dual global enhanced transformer for image captioning. Neural Netw 148:129–141

    Article  Google Scholar 

  20. Fei Z, Yan X, Wang S, Tian Q (2022) Deecap: dynamic early exiting for efficient image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 12216–12226.

  21. Das R, Singh TD (2022) Assamese news image caption generation using attention mechanism. Multim Tools App 81(7):10051–10069

    Article  Google Scholar 

  22. Sasibhooshan R, Kumaraswamy S, Sasidharan S (2023) Image caption generation using visual attention prediction and contextual spatial relation extraction. J Big Data 10(1):18

    Article  Google Scholar 

  23. Liu M, Li L, Hu H, Guan W, Tian J (2020) Image caption generation with dual attention mechanism. Inf Process Manage 57(2):102178

    Article  Google Scholar 

  24. Khishe M, Mosavi MR (2020) Chimp optimization algorithm. Expert Syst Appl 149:113338

    Article  Google Scholar 

  25. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 47:853–899

    Article  MathSciNet  Google Scholar 

  26. Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft coco captions: Data collection and evaluation server. 1–7.  https://doi.org/10.48550/arXiv.1504.00325

  27. Ye Z, Khan R, Naqvi N, Islam MS (2021) a novel automatic image caption generation using bidirectional long-short term memory framework. Multimed Tools App 80:25557–25582

    Article  Google Scholar 

  28. Khamparia A, Pandey B, Tiwari S, Gupta D, Khanna A, Rodrigues JJ (2020) An integrated hybrid CNN–RNN model for visual description and generation of captions. Circuits Syst Signal Process 39:776–788

    Article  Google Scholar 

  29. Ding G, Chen M, Zhao S, Chen H, Han J, Liu Q (2019) Neural image caption generation with weighted training and reference. Cogn Comput 11(6):763–777

    Article  Google Scholar 

  30. Zhang J, Xu C, Gao Z, Rodrigues JJ, de Albuquerque VHC (2020) Industrial pervasive edge computing-based intelligence IoT for surveillance saliency detection. IEEE Trans Industr Inf 17(7):5012–5020

    Article  Google Scholar 

  31. Lüddecke T & Ecker A (2022) Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 7086–7096.

  32. Joudar NE, Ettaouil M (2022) An adaptive Drop method for deep neural networks regularization: Estimation of DropConnect hyperparameter using generalization gap. Knowl-Based Syst 253:109567

    Article  Google Scholar 

  33. Hssayni EH, Joudar NE, Ettaouil M (2022) A deep learning framework for time series classification using normal cloud representation and convolutional neural network optimization. Comput Intell 38(6):2056–2074

    Article  Google Scholar 

  34. Hssayni EH, Joudar NE, Ettaouil M (2022) Localization and reduction of redundancy in CNN using L 1-sparsity induction. J Ambient Intell Humanized Comput 38:1–13

    Google Scholar 

Download references

Funding

No funding is provided for the preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

All authors have equal contributions in this work.

Corresponding author

Correspondence to Khustar Ansari.

Ethics declarations

Ethical approval

This article does not contain any studies with human participants or animals performed by authors.

Conflict of interest

Authors declare that they have no conflict of interest.

Consent to participate

All the authors involved have agreed to participate in this submitted article.

Consent to publish

All the authors involved in this manuscript give full consent for publication of this submitted article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ansari, K., Srivastava, P. An efficient automated image caption generation by the encoder decoder model. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-18150-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-024-18150-x

Keywords

Navigation