Image caption generation using transformer learning methods: a case study on instagram image

Dittakan, Kwankamon; Prompitak, Kamontorn; Thungklang, Phutphisit; Wongwattanakit, Chatchawan

doi:10.1007/s11042-023-17275-9

Image caption generation using transformer learning methods: a case study on instagram image

Published: 19 October 2023

Volume 83, pages 46397–46417, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Kwankamon Dittakan¹,
Kamontorn Prompitak ORCID: orcid.org/0000-0001-6803-0597¹,
Phutphisit Thungklang¹ &
…
Chatchawan Wongwattanakit¹

279 Accesses
3 Citations
Explore all metrics

Abstract

Nowadays, images are being used more extensively for communication purposes. A single image can convey a variety of stories, depending on the perspective and thoughts of everyone who views it. To facilitate comprehension, inclusion image captions is highly beneficial, especially for individuals with visual impairments who can read Braille or rely on audio descriptions. The purpose of this research is to create an automatic captioning system that is easy to understand and quick to generate. This system can be applied to other related systems. In this research, the transformer learning process is applied to image captioning instead of the convolutional neural networks (CNN) and recurrent neural networks (RNN) process which has limitations in processing long-sequence data and managing data complexity. The transformer learning process can handle these limitations well and more efficiently. Additionally, the image captioning system was trained on a dataset of 5,000 images from Instagram that were tagged with the hashtag "Phuket" (#Phuket). The researchers also wrote the captions themselves to use as a dataset for testing the image captioning system. The experiments showed that the transformer learning process can generate natural captions that are close to human language. The generated captions will also be evaluated using the Bilingual Evaluation Understudy (BLEU) score and Metric for Evaluation of Translation with Explicit Ordering (METEOR) score, a metric for measuring the similarity between machine-translated text and human-written text. This will allow us to compare the resemblance between the researcher-written captions and the transformer-generated captions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attention Is All You Need to Tell: Transformer-Based Image Captioning

Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study

Article 16 June 2022

Image Captioning Through Image Transformer

Data availability

The data that support the findings of this study are available from the authors, upon reasonable request.

References

Dixon S (2022) Instagram: active users 2018. Statista. https://www.statista.com/statistics/253577/number-of-monthly-active-instagram-users/
Singh A, Singh TD, Bandyopadhyay S (2021) An encoder-decoder based framework for hindi image caption generation. Multimed Tools Appl 80(28–29):35721–35740
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: 31st Conference on Neural Information Processing Systems (NIPS 2017), vol 30. Cornell University. https://arxiv.org/pdf/1706.03762v5
Liu W, Chen S, Guo L, Zhu X, Liu J (2021) CPTR: full transformer network for image captioning. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2101.10804
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2020) An image is worth 16x16 words: transformers for image recognition at scale. In: The ninth international conference on learning representations. Cornell University. https://arxiv.org/pdf/2010.11929
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3156–3164. https://doi.org/10.1109/cvpr.2015.7298935
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10575–10584. https://doi.org/10.1109/cvpr42600.2020.01059
Wang Y, Xu J, Sun Y (2022) End-to-end transformer based model for image captioning. Proceedings of the AAAI Conference on Artificial Intelligence 36(3):2585–2594. https://doi.org/10.1609/aaai.v36i3.20160
Article Google Scholar
Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 5561–5570. https://doi.org/10.1109/cvpr.2018.00583
Ghandi T, Pourreza HR, Pourreza HR (2023) Deep learning approaches on image Captioning: a review. ACM Computing Surveys. https://doi.org/10.1145/3617592
Mookdarsanit P, Mookdarsanit L (2020) Thai-IC: Thai Image Captioning based on CNN-RNN Architecture. Int J Appl Comp Inform Syst 10(1):40–45
Google Scholar
Deorukhkar KP, Ket S (2022) Image Captioning using Hybrid LSTM-RNN with Deep Features. Sens Imaging 23:31
Article Google Scholar
Gupta N, Jalal AS (2020) Integration of textual cues for fine-grained image captioning using deep CNN and LSTM. Neural Comput & Applic 32:17899–17908
Article Google Scholar
Cao D, Zhu M, Gao L (2019) An image caption method based on object detection. Multimedia Tools and Applications 78:35329–35350
Article Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. International Conference on Machine Learning 3:2048–2057. http://proceedings.mlr.press/v37/xuc15.pdf
Johnson J, Karpathy A, Li F (2016) DenseCap: fully convolutional localization networks for dense captioning. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 4565–4574. https://doi.org/10.1109/CVPR.2016.494
Kim D, Choi J, Oh T, Kweon IS (2019) Dense relational captioning: triple-stream networks for relationship-based captioning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6271–6280. https://doi.org/10.1109/cvpr.2019.00643
Kim D-J, Oh T-H, Choi J, Kweon IS (2022) Dense relational image captioning via multi-task triple-stream networks. IEEE Trans Pattern Anal Mach Intell 44:11
Article Google Scholar
Das R, Singh TD (2022) Assamese news image caption generation using attention mechanism. Multimed Tools Appl 81(7):10051–10069
Article Google Scholar
Zhang W, Nie W, Li X, Yao Y (2019) Image caption generation with adaptive transformer. 2019 34rd youth academic annual conference of Chinese association of automation (YAC), pp 521–526. https://doi.org/10.1109/yac.2019.8787715
You Q, Jin H, Wang Z, Chen F, Luo J (2016) Image captioning with semantic attention. Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659. https://doi.org/10.1109/cvpr.2016.503
Zhu X, Li L, Liu J, Peng H, Niu X (2018) Captioning transformer with stacked attention modules. Appl Sci 8(5):739
Article Google Scholar
Herdade S, Kappeler A, Boakye K, Soares JCV (2019) Image captioning: transforming objects into words. 33rd conference on neural information processing systems (NeurIPS 2019) 32:11135–11145. https://doi.org/10.48550/arXiv.1906.05963
Shao Z, Han J, Marnerides D, Debattista K (2022) Region-object relation-aware dense captioning via transformer. IEEE transactions on neural networks and learning systems, pp 1–12. https://doi.org/10.1109/tnnls.2022.3152990
Shao Z, Han J, Debattista K, Pang Y (2023) Textual context-aware dense captioning with diverse words. IEEE Transactions on Multimedia, pp 1–15. https://doi.org/10.1109/tmm.2023.3241517
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2818–2826. https://doi.org/10.1109/cvpr.2016.308
MongoDB (2019, January 1) What Is MongoDB. https://www.mongodb.com/what-is-mongodb
Python (2022, May 1) Urllib.Request — Extensible library for opening URLs. Python Software Foundation. https://docs.python.org/3/library/urllib.request.html
Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10327–10336. https://doi.org/10.1109/cvpr42600.2020.01034
Papineni K, Roukos S, Ward TJ, Zhu W (2001) BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting on association for computational linguistics - ACL ’02, pp 311–318. https://doi.org/10.3115/1073083.1073135
Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the Acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72. https://www.cs.cmu.edu/~alavie/METEOR/pdf/Banerjee-Lavie-2005-METEOR.pdf
Shah FM, Humaira M, Jim ARK, Ami AS, Paul S (2021) Bornon: Bengali image captioning with transformer-based deep learning approach. SN Comput Sci 3(1). https://doi.org/10.1007/s42979-021-00975-0
Tan M, Le QV (2019) EfficientNet: Rethinking model scaling for convolutional neural networks. Proceedings of the 36th international conference on machine learning, vol 97, pp 6105–6114. http://proceedings.mlr.press/v97/tan19a/tan19a.pdf

Download references

Author information

Authors and Affiliations

College of Computing and Faculty of Hospitality and Tourism, Prince of Songkla University, Phuket Campus, Phuket, Thailand
Kwankamon Dittakan, Kamontorn Prompitak, Phutphisit Thungklang & Chatchawan Wongwattanakit

Authors

Kwankamon Dittakan
View author publications
You can also search for this author in PubMed Google Scholar
Kamontorn Prompitak
View author publications
You can also search for this author in PubMed Google Scholar
Phutphisit Thungklang
View author publications
You can also search for this author in PubMed Google Scholar
Chatchawan Wongwattanakit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kamontorn Prompitak.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dittakan, K., Prompitak, K., Thungklang, P. et al. Image caption generation using transformer learning methods: a case study on instagram image. Multimed Tools Appl 83, 46397–46417 (2024). https://doi.org/10.1007/s11042-023-17275-9

Download citation

Received: 02 June 2023
Revised: 15 September 2023
Accepted: 24 September 2023
Published: 19 October 2023
Issue Date: May 2024
DOI: https://doi.org/10.1007/s11042-023-17275-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image caption generation using transformer learning methods: a case study on instagram image

Abstract

Access this article

Similar content being viewed by others

Attention Is All You Need to Tell: Transformer-Based Image Captioning

Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study

Image Captioning Through Image Transformer

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Image caption generation using transformer learning methods: a case study on instagram image

Abstract

Access this article

Similar content being viewed by others

Attention Is All You Need to Tell: Transformer-Based Image Captioning

Image Captioning Encoder–Decoder Models Using CNN-RNN Architectures: A Comparative Study

Image Captioning Through Image Transformer

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation