Skip to main content
Log in

Image caption generation using transformer learning methods: a case study on instagram image

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract 

Nowadays, images are being used more extensively for communication purposes. A single image can convey a variety of stories, depending on the perspective and thoughts of everyone who views it. To facilitate comprehension, inclusion image captions is highly beneficial, especially for individuals with visual impairments who can read Braille or rely on audio descriptions. The purpose of this research is to create an automatic captioning system that is easy to understand and quick to generate. This system can be applied to other related systems. In this research, the transformer learning process is applied to image captioning instead of the convolutional neural networks (CNN) and recurrent neural networks (RNN) process which has limitations in processing long-sequence data and managing data complexity. The transformer learning process can handle these limitations well and more efficiently. Additionally, the image captioning system was trained on a dataset of 5,000 images from Instagram that were tagged with the hashtag "Phuket" (#Phuket). The researchers also wrote the captions themselves to use as a dataset for testing the image captioning system. The experiments showed that the transformer learning process can generate natural captions that are close to human language. The generated captions will also be evaluated using the Bilingual Evaluation Understudy (BLEU) score and Metric for Evaluation of Translation with Explicit Ordering (METEOR) score, a metric for measuring the similarity between machine-translated text and human-written text. This will allow us to compare the resemblance between the researcher-written captions and the transformer-generated captions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from the authors, upon reasonable request.

References   

  1. Dixon S (2022) Instagram: active users 2018. Statista. https://www.statista.com/statistics/253577/number-of-monthly-active-instagram-users/

  2. Singh A, Singh TD, Bandyopadhyay S (2021) An encoder-decoder based framework for hindi image caption generation. Multimed Tools Appl 80(28–29):35721–35740

    Article  Google Scholar 

  3. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: 31st Conference on Neural Information Processing Systems (NIPS 2017), vol 30. Cornell University. https://arxiv.org/pdf/1706.03762v5

  4. Liu W, Chen S, Guo L, Zhu X, Liu J (2021) CPTR: full transformer network for image captioning. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2101.10804

  5. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2020) An image is worth 16x16 words: transformers for image recognition at scale. In: The ninth international conference on learning representations. Cornell University. https://arxiv.org/pdf/2010.11929

  6. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3156–3164. https://doi.org/10.1109/cvpr.2015.7298935

  7. Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10575–10584. https://doi.org/10.1109/cvpr42600.2020.01059

  8. Wang Y, Xu J, Sun Y (2022) End-to-end transformer based model for image captioning. Proceedings of the AAAI Conference on Artificial Intelligence 36(3):2585–2594. https://doi.org/10.1609/aaai.v36i3.20160

    Article  Google Scholar 

  9. Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 5561–5570. https://doi.org/10.1109/cvpr.2018.00583

  10. Ghandi T, Pourreza HR, Pourreza HR (2023) Deep learning approaches on image Captioning: a review. ACM Computing Surveys. https://doi.org/10.1145/3617592

  11. Mookdarsanit P, Mookdarsanit L (2020) Thai-IC: Thai Image Captioning based on CNN-RNN Architecture. Int J Appl Comp Inform Syst 10(1):40–45

    Google Scholar 

  12. Deorukhkar KP, Ket S (2022) Image Captioning using Hybrid LSTM-RNN with Deep Features. Sens Imaging 23:31

    Article  Google Scholar 

  13. Gupta N, Jalal AS (2020) Integration of textual cues for fine-grained image captioning using deep CNN and LSTM. Neural Comput & Applic 32:17899–17908

    Article  Google Scholar 

  14. Cao D, Zhu M, Gao L (2019) An image caption method based on object detection. Multimedia Tools and Applications 78:35329–35350

    Article  Google Scholar 

  15. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. International Conference on Machine Learning 3:2048–2057. http://proceedings.mlr.press/v37/xuc15.pdf

  16. Johnson J, Karpathy A, Li F (2016) DenseCap: fully convolutional localization networks for dense captioning. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4565–4574. https://doi.org/10.1109/CVPR.2016.494

  17. Kim D, Choi J, Oh T, Kweon IS (2019) Dense relational captioning: triple-stream networks for relationship-based captioning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6271–6280. https://doi.org/10.1109/cvpr.2019.00643

  18. Kim D-J, Oh T-H, Choi J, Kweon IS (2022) Dense relational image captioning via multi-task triple-stream networks. IEEE Trans Pattern Anal Mach Intell 44:11

    Article  Google Scholar 

  19. Das R, Singh TD (2022) Assamese news image caption generation using attention mechanism. Multimed Tools Appl 81(7):10051–10069

    Article  Google Scholar 

  20. Zhang W, Nie W, Li X, Yao Y (2019) Image caption generation with adaptive transformer. 2019 34rd youth academic annual conference of Chinese association of automation (YAC), pp 521–526. https://doi.org/10.1109/yac.2019.8787715

  21. You Q, Jin H, Wang Z, Chen F, Luo J (2016) Image captioning with semantic attention. Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659. https://doi.org/10.1109/cvpr.2016.503

  22. Zhu X, Li L, Liu J, Peng H, Niu X (2018) Captioning transformer with stacked attention modules. Appl Sci 8(5):739

    Article  Google Scholar 

  23. Herdade S, Kappeler A, Boakye K, Soares JCV (2019) Image captioning: transforming objects into words. 33rd conference on neural information processing systems (NeurIPS 2019) 32:11135–11145. https://doi.org/10.48550/arXiv.1906.05963

  24. Shao Z, Han J, Marnerides D, Debattista K (2022) Region-object relation-aware dense captioning via transformer. IEEE transactions on neural networks and learning systems, pp 1–12. https://doi.org/10.1109/tnnls.2022.3152990

  25. Shao Z, Han J, Debattista K, Pang Y (2023) Textual context-aware dense captioning with diverse words. IEEE Transactions on Multimedia, pp 1–15. https://doi.org/10.1109/tmm.2023.3241517

  26. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2818–2826. https://doi.org/10.1109/cvpr.2016.308

  27. MongoDB (2019, January 1) What Is MongoDB. https://www.mongodb.com/what-is-mongodb

  28. Python (2022, May 1) Urllib.Request — Extensible library for opening URLs. Python Software Foundation. https://docs.python.org/3/library/urllib.request.html

  29. Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10327–10336. https://doi.org/10.1109/cvpr42600.2020.01034

  30. Papineni K, Roukos S, Ward TJ, Zhu W (2001) BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting on association for computational linguistics - ACL ’02, pp 311–318. https://doi.org/10.3115/1073083.1073135

  31. Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the Acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72. https://www.cs.cmu.edu/~alavie/METEOR/pdf/Banerjee-Lavie-2005-METEOR.pdf

  32. Shah FM, Humaira M, Jim ARK, Ami AS, Paul S (2021) Bornon: Bengali image captioning with transformer-based deep learning approach. SN Comput Sci 3(1). https://doi.org/10.1007/s42979-021-00975-0

  33. Tan M, Le QV (2019) EfficientNet: Rethinking model scaling for convolutional neural networks. Proceedings of the 36th international conference on machine learning, vol 97, pp 6105–6114. http://proceedings.mlr.press/v97/tan19a/tan19a.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kamontorn Prompitak.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dittakan, K., Prompitak, K., Thungklang, P. et al. Image caption generation using transformer learning methods: a case study on instagram image. Multimed Tools Appl 83, 46397–46417 (2024). https://doi.org/10.1007/s11042-023-17275-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-17275-9

Keywords

Navigation