Skip to main content
Log in

A comprehensive literature review on image captioning methods and metrics based on deep learning technique

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

One of the trending areas of study in artificial intelligence is image captioning. Image captioning is a process of creating descriptive information for visual objects, image metadata, or entities present in an image. It extracts features from the image using the integration of computer vision and Natural Language Processing (NLP), uses this data to identify objects, actions, and the relationships among them, and creates image descriptions. It is not only an extremely important but also a very difficult task in computer vision research. A lot of work on image captioning methods that utilize a deep learning approach has been conducted. The goal of this article is to discover, evaluate, and summarize the works that examine deep learning applications in the context of image captioning systems. We found 548 papers using a systematic literature review (SLR) technique, of which 38 were identified as primary studies and so underwent in-depth analysis. This review’s result demonstrates that LSTM, CNN, and RNN are mostly employ deep learning techniques for image captioning. Also, the most popular used datasets based on the selected primary studies are MS COCO Dataset, Flickr8k, and Flickr30k. These are standardized benchmark datasets being employed by researchers to compare their methods on common test-beds. The review also showed that the evaluation methods such as BLEU, CIDEr, SPICE, METEOR, and ROUGE-L are the most often employed ones according to the findings from this SMR study. Despite the considerable advancements achieved by deep learning approaches in this study domain, there is always a potential for improvement. Finally, the review provided future research for image captioning systems. We believe that this SLR will act as a reference for other scientists and an inspiration to gather the most recent data for their study evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

References

  1. Al-Shamayleh AS, Ahmad R, Abushariah MA, Alam KA, Jomhari N (2018) A systematic literature review on vision based gesture recognition techniques. Multimed Tools Appl 77:28121–28184

    Article  Google Scholar 

  2. Anderson, P, Fernando, B, Johnson, M, Gould, S (2016) Spice: Semantic propositional image caption evaluation. Paper presented at the European conference on computer vision. https://doi.org/10.1007/978-3-319-46454-1_24

  3. Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5561–5570

  4. Atliha V, Šešok DJAS (2022) Image-Captioning Model Compression 12(3):1638

    Google Scholar 

  5. Bai S, An SJN (2018) A survey on automatic image caption generation 311:291–304

    Google Scholar 

  6. Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

  7. Bernardi, R, Cakici, R, Elliott, D, Erdem, A, Erdem, E, Ikizler-Cinbis, N, . . . Plank, BJJ O AI R (2016) Automatic description generation from images: A survey of models, datasets, and evaluation measures. 55, 409–442. https://doi.org/10.1613/jair.4900

  8. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pp 144–152

  9. Caglayan O, Madhyastha P, Specia L (2020) Curious case of language generation evaluation metrics: A cautionary tale. arXiv preprint arXiv:2010.13588

  10. Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In 11th conference of the european chapter of the association for computational linguistics, pp 249–256

  11. Chen H, Ding G, Lin Z, Guo Y, Shan C, Han JJCC (2021) Image Caption Memorized Knowl 13(4):807–820

    Google Scholar 

  12. Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5659–5667

  13. Chen T, Liao YH, Chuang CY, Hsu WT, Fu J, Sun M (2017) Show, adapt and tell: Adversarial training of cross-domain image captioner. In Proceedings of the IEEE international conference on computer vision, pp 521–530

  14. Cho, K, Courville, A, Bengio, YJITOM (2015) Describing multimedia content using attention-based encoder-decoder networks. 17(11), 1875–1886. https://doi.org/10.1109/TMM.2015.2477044

  15. Cornia M, Baraldi L, Cucchiara R (2019) Show, control and tell: A framework for generating controllable and grounded captions. Paper presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  16. Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. Paper presented at the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

  17. Cui Y, Yang G, Veit A, Huang X, Belongie S (2018) Learning to evaluate image captioning. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition

  18. Dai J, Li Y, He K, Sun J (2016) R-FCN: Object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, p 29

  19. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05), vol 1. IEEE, pp 886–893

  20. Dao DC, Nguyen TO, Bressan S (2016) Factors influencing the performance of image captioning model: an evaluation. In: Proceedings of the 14th international conference on advances in mobile computing and multi media, pp 235–243

  21. Dash, SK, Saha, S, Pakray, P, Gelbukh, AJJOI, Systems, F (2019) Generating image captions through multimodal embedding. 36(5), 4787–4796. https://doi.org/10.3233/JIFS-179027

  22. Deng, C, Ding, N, Tan, M, Wu, Q (2020) Length-controllable image captioning. Paper presented at the European Conference on Computer Vision. https://doi.org/10.1007/978-3-030-58601-0_42

  23. Denoual E, Lepage Y (2005) BLEU in characters: towards automatic MT evaluation in languages without word delimiters. In: Companion volume to the proceedings of conference including posters/demos and tutorial abstracts

  24. Deorukhkar K, Ket S (2022) A detailed review of prevailing image captioning methods using deep learning techniques. Multimed Tools Appl 81(1):1313–1336

  25. Donahue, J, Anne Hendricks, L, Guadarrama, S, Rohrbach, M, Venugopalan, S, Saenko, K, Darrell, T (2015) Long-term recurrent convolutional networks for visual recognition and description. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition

  26. Dong J, Li X, Snoek CG (2018) Predicting visual features from text for image and video caption retrieval. IEEE Trans Multimed 20(12):3377–3388. https://doi.org/10.1109/TMM.2018.2832602

  27. Elliott, D, Keller, F (2013) Image description using visual dependency representations. Paper presented at the Proceedings of the 2013 conference on empirical methods in natural language processing

  28. Fang F, Wang H, Chen Y, Tang P (2018) Looking deeper and transferring attention for image captioning. Multimed Tools Appl 77:31159–31175. https://doi.org/10.1007/s11042-018-6228-6

  29. Fei Z (2020) Iterative back modification for faster image captioning. In: Proceedings of the 28th ACM international conference on multimedia, pp 3182–3190

  30. Fu, K, Jin, J, Cui, R, Sha, F, Zhang, CJITOPA, Intelligence, M (2016) Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. 39(12), 2321–2334

  31. Gao, L, Guo, Z, Zhang, H, Xu, X, Shen, HTJITOM (2017) Video captioning with attention-based LSTM and semantic consistency. 19(9), 2045–2055. https://doi.org/10.1109/TMM.2017.2729019

  32. Ghandi T, Pourreza H, Mahyar H (2023) Deep learning approaches on image captioning: A review. ACM Comput Surv 56(3):1–39

  33. Gong, Y, Wang, L, Hodosh, M, Hockenmaier, J, Lazebnik, S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. Paper presented at the European conference on computer vision. https://doi.org/10.1007/978-3-319-10593-2_35

  34. Guo L, Liu J, Zhu X, Lu HJAPA (2021) Fast Sequence Generation with Multi-Agent Reinforcement Learning

  35. Guo, R, Ma, S, Han, YJMT, Applications (2019) Image captioning: from structural tetrad to translated sentences. 78(17), 24321–24346. https://doi.org/10.1007/s11042-018-7118-7

  36. Han M, Chen W, Moges ADJCC (2019) Fast Image Caption Using LSTM 22(3):6143–6155

    Google Scholar 

  37. He X, Yang Y, Shi B, Bai X (2019) Vd-san: visual-densely semantic attention network for image caption generation. Neurocomputing 328:48–55. https://doi.org/10.1016/j.neucom.2018.02.106

  38. Hossain MZ, Sohel F, Shiratuddin MF, Laga H (2019) A comprehensive survey of deep learning for image captioning. ACM Comput Surv (CsUR) 51(6):1–36

    Google Scholar 

  39. Hosseini R, Xie P (2022) Image understanding by captioning with differentiable architecture search. In: Proceedings of the 30th ACM international conference on multimedia, pp 4665–4673

  40. Johnson J, Krishna R, Stark M, Li LJ, Shamma D, Bernstein M, Fei-Fei L (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3668–3678

  41. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  42. Kasai, J, Sakaguchi, K, Dunagan, L, Morrison, J, Bras, RL, Choi, Y, Smith, NAJAPA (2021) Transparent human evaluation for image captioning

  43. Kiros, R, Salakhutdinov, R, Zemel, RSJAPA (2014) Unifying visual-semantic embeddings with multimodal neural language models

  44. Kitchenham B, Brereton OP, Budgen D, Turner M, Bailey J, Linkman S (2009) Systematic literature reviews in software engineering–a systematic literature review. Inf Softw Technol 51(1):7–15. https://doi.org/10.1016/j.infsof.2008.09.009

  45. Kitchenham B, Brereton P (2013) A systematic review of systematic review process research in software engineering. Inf Softw Technol 55(12):2049–2075

    Article  Google Scholar 

  46. Keele S (2007) Guidelines for performing systematic literature reviews in software engineering

  47. Kitchenham, BJK, UK, Keele University (2004) Procedures for performing systematic reviews. 33(2004), 1–26

  48. Kumar, A, Goel, SJIJOHIS (2017) A survey of evolution of image captioning techniques. 14(3), 123–139.

  49. Kuznetsova, P, Ordonez, V, Berg, TL, Choi, YJTOTAFCL (2014) Treetalk: Composition and compression of trees for image descriptions. 2, 351–362 https://doi.org/10.1162/tacl_a_00188

  50. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791

  51. Li X, Yin X, Li C, Zhang P, Hu X, Zhang L et al (2020) Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. Springer International Publishing, pp 121–137. https://doi.org/10.1007/978-3-030-58577-8_8

  52. Lin CY (2004) Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81

  53. Lin CY, Och FJ (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL-04), pp 605–612

  54. Liu, S, Zhu, Z, Ye, N, Guadarrama, S, Murphy, K (2017) Improved image captioning via policy gradient optimization of spider. Paper presented at the Proceedings of the IEEE international conference on computer vision

  55. Liu, S, Zhu, Z, Ye, N, Guadarrama, S, Murphy, KJAPA (2016). Optimization of image description metrics using policy gradient methods. 5

  56. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD: Single shot multibox detector. In: Computer vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, proceedings, part I 14. Springer International Publishing, pp 21–37. https://doi.org/10.1007/978-3-319-46448-0_2

  57. Lowe, DGJIJOCV (2004) Distinctive image features from scale-invariant keypoints. 60(2), 91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94

  58. Mao, J, Xu, W, Yang, Y, Wang, J, Yuille, ALJAPA (2014) Explain images with multimodal recurrent neural networks

  59. Mao, Y, Chen, L, Jiang, Z, Zhang, D, Zhang, Z, Shao, J, Xiao, J (2022) Rethinking the reference-based distinctive image captioning. Paper presented at the Proceedings of the 30th ACM International Conference on Multimedia

  60. Mitchell, M, Dodge, J, Goyal, A, Yamaguchi, K, Stratos, K, Han, X, . . . Daumé III, H (2012) Midge: Generating image descriptions from computer vision detections. Paper presented at the Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

  61. Ojala T, Pietikäinen M, Mäenpää T (2000) Gray scale and rotation invariant texture classification with local binary patterns. In: Computer vision-ECCV 2000: 6th European conference on computer vision Dublin, Ireland, June 26–July 1, 2000 proceedings, part I 6. Springer, Berlin Heidelberg, pp 404–420. https://doi.org/10.1007/3-540-45054-8_27

  62. Oluwasanmi A, Aftab MU, Alabdulkreem E, Kumeda B, Baagyere EY, Qin Z (2019) Captionnet: Automatic end-to-end Siamese difference captioning model with attention. IEEE Access 7:106773–106783. https://doi.org/10.1109/ACCESS.2019.2931223

  63. Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10971–10980

  64. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Paper presented at the proceedings of the 40th annual meeting of the Association for Computational Linguistics

  65. Park, CC, Kim, B, Kim, GJITOPA, Intelligence, M (2018) Towards personalized image captioning via multimodal memory networks. 41(4), 999–1012

  66. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024

  67. Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for IDF. J Doc 60(5):503–520. https://doi.org/10.1108/00220410410560582

  68. Sammani F, Melas-Kyriazi L (2020) Show, edit and tell: a framework for editing image captions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4808–4816

  69. Sargar O, Kinger S (2021) Image captioning methods and metrics. In: 2021 international conference on emerging smart computing and informatics (ESCI). IEEE, pp 522–526

  70. Schuster, S, Krishna, R, Chang, A, Fei-Fei, L, Manning, CD (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval. Paper presented at the Proceedings of the fourth workshop on vision and language

  71. Sharif N, Bennamoun M, White LR, Shah SAA (2018) Learning-based composite metrics for improved caption evaluation. In: 56th annual meeting of association for computational linguistics

  72. Sharif, N, White, L, Bennamoun, M, Shah, SAA (2018) NNEval: Neural network based evaluation metric for image captioning. Paper presented at the Proceedings of the European Conference on Computer Vision (ECCV). https://doi.org/10.1007/978-3-030-01237-3_3

  73. Shetty, R, Rohrbach, M, Anne Hendricks, L, Fritz, M, Schiele, B (2017) Speaking the same language: Matching machine to human captions by adversarial training. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.

  74. Shuster, K, Humeau, S, Hu, H, Bordes, A, Weston, J (2019) Engaging image captioning via personality. Paper presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  75. Staniūtė R, Šešok DJAS (2019) A System Literature Rev Image Caption 9(10):2024

    Google Scholar 

  76. Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R (2022) From show to tell: A survey on deep learning-based image captioning. IEEE Trans Pattern Anal Mach Intell 45(1):539–559

  77. Su J, Tang J, Lu Z, Han X, Zhang H (2019) A neural image captioning model with caption-to-images semantic constructor. Neurocomputing 367:144–151

    Google Scholar 

  78. Tan JH, Chan CS, Chuah JH(2019) Comic: Toward a compact image captioning model with attention. IEEE Trans Multimed 21(10):2686–2696. https://doi.org/10.1109/TMM.2019.2904878

  79. Tan Y, Lin Z, Fu P, Zheng M, Wang L, Cao Y, Wang W (2022) Detach and attach: Stylized image captioning without paired stylized dataset. In: Proceedings of the 30th ACM international conference on multimedia, pp 4733–4741

  80. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  81. Vinyals, O, Toshev, A, Bengio, S, Erhan, D (2015) Show and tell: A neural image caption generator. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition

  82. Vinyals, O, Toshev, A, Bengio, S, Erhan, DJITOPA, Intelligence, M (2016) Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. 39(4), 652–663

  83. Wang C, Yang H, Bartz C, Meinel C (2016) Image captioning with deep bidirectional LSTMs. In: Proceedings of the 24th ACM international conference on multimedia, pp 988–997

  84. Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans Multimed Comput Commun Appl (TOMM) 14(2s):1–20. https://doi.org/10.1145/3115432

  85. Wang EK, Zhang X, Wang F, Wu TY, Chen CM (2019) Multilayer dense attention model for image caption. IEEE Access 7:66358–66368

    Google Scholar 

  86. Wang, Q, Chan, AB (2019) Describing like humans: on diversity in image captioning. Paper presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  87. Wang, Q, Wan, J, Chan, ABJITOPA, Intelligence, M (2020) On diversity in image captioning: Metrics and methods

  88. Wu, Q, Shen, C, Wang, P, Dick, A, Van Den Hengel, AJITOPA, Intelligence, M (2017) Image captioning and visual question answering based on attributes and external knowledge. 40(6), 1367–1381

  89. Xiao, F, Gong, X, Zhang, Y, Shen, Y, Li, J, Gao, XJN (2019) DAA: Dual LSTMs with adaptive attention for image captioning. 364, 322–329. https://doi.org/10.1016/j.neucom.2019.06.085

  90. Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder–decoder network for image captioning. IEEE Trans Multimed 21(11):2942–2956. https://doi.org/10.1109/TMM.2019.2915033

  91. Xiao X, Wang L, Ding K, Xiang S, Pan CJPR (2019) Dense Semantic Embedding Network for Image Captioning 90:285–296

    Google Scholar 

  92. Xu, K, Ba, J, Kiros, R, Cho, K, Courville, A, Salakhudinov, R, . . . Bengio, Y (2015) Show, attend and tell: Neural image caption generation with visual attention. Paper presented at the International conference on machine learning

  93. Xu, N, Zhang, H, Liu, A-A, Nie, W, Su, Y, Nie, J, Zhang, YJITOM (2019) Multi-level policy and reward-based deep reinforcement learning framework for image captioning. 22(5), 1372–1383

  94. Yang J, Sun Y, Liang J, Ren B, Lai S-HJN (2019) Image Caption Incorporating Affect Concepts Learned from both Visual and Textual Components 328:56–68

    Google Scholar 

  95. Yang, L-C, Yang, C-Y, Hsu, JY-J (2021) Object Relation Attention for Image Paragraph Captioning. Paper presented at the Proceedings of the AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v35i4.16423

  96. Yang L, Hu H (2019) Visual skeleton and reparative attention for part-of-speech image captioning system. Comput Vis Image Underst 189:102819

    Google Scholar 

  97. Yang L, Hu H (2019) Adaptive syncretic attention for constrained image captioning. Neural Process Lett 50:549–564

    MathSciNet  Google Scholar 

  98. Yang M, Liu J, Shen Y, Zhao Z, Chen X, Wu Q, Li C (2020) An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Trans Image Process 29:9627–9640. https://doi.org/10.1109/TIP.2020.3028651

  99. Yang, M, Zhao, W, Xu, W, Feng, Y, Zhao, Z, Chen, X, Lei, KJITOM (2018) Multitask learning for cross-domain image captioning. 21(4), 1047–1061

  100. You, Q, Jin, H, Wang, Z, Fang, C, Luo, J (2016) Image captioning with semantic attention. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition

  101. Yu, N, Hu, X, Song, B, Yang, J, Zhang, JJITOIP (2018) Topic-oriented image captioning based on order-embedding. 28(6), 2743–2754

  102. Zeng X, Wen L, Liu B, Qi XJN (2020) Deep Learning for Ultrasound Image Caption Generation Based on Object Detection 392:132–141

    Google Scholar 

  103. Zhang, J, Li, K, Wang, Z, Zhao, X, Wang, ZJESWA (2021) Visual enhanced gLSTM for image captioning. 184, 115462. https://doi.org/10.1016/j.eswa.2021.115462

  104. Zhang J, Li K, Wang Z (2021) Parallel-fusion LSTM with synchronous semantic and visual information for image captioning. J Vis Commun Image Represent 75:103044. https://doi.org/10.1016/j.jvcir.2021.103044

  105. Zhang, T, Huang, M, Zhao, L (2018) Learning structured representation for text classification via reinforcement learning. Paper presented at the Thirty-Second AAAI Conference on Artificial Intelligence. https://doi.org/10.1609/aaai.v32i1.12047

  106. Zhang X, He S, Song X, Lau RW, Jiao J, Ye QJN (2020) Image Captioning via Semantic Element Embedding 395:212–221

    Google Scholar 

  107. Zhang Z, Wu Q, Wang Y, Chen FJITOM (2018) High-quality image captioning with fine-grained and semantic-guided visual attention. 21(7):1681–1693

  108. Zhang Z, Zhang W, Diao W, Yan M, Gao X, Sun XJIA (2019) VAA: Visual aligning attention model for remote sensing image captioning. 7:137355–137364. https://doi.org/10.1109/ACCESS.2019.2942154

  109. Zhu X, Li L, Liu J, Li Z, Peng H, Niu XJN (2018) Image Captioning with Triple-Attention and Stack Parallel LSTM 319:55–65

    Google Scholar 

  110. Zhu X, Wang W, Guo L, Liu J (2020) AutoCaption: Image captioning with neural architecture search. arXiv preprint arXiv:2012.09742

  111. Zohourianshahzadi Z, Kalita JK (2022) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55(5):3833–3862

Download references

Acknowledgements

We would like to extend our appreciation to Al-Ahliyya Amman University for providing all necessary support to conduct this research work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmad Sami Al-Shamayleh.

Ethics declarations

Conflicts of interest

The authors have no conflicts of interest to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Al-Shamayleh, A.S., Adwan, O., Alsharaiah, M.A. et al. A comprehensive literature review on image captioning methods and metrics based on deep learning technique. Multimed Tools Appl 83, 34219–34268 (2024). https://doi.org/10.1007/s11042-024-18307-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-024-18307-8

Keywords

Navigation