Bornon: Bengali Image Captioning with Transformer-Based Deep Learning Approach

Muhammad Shah, Faisal; Humaira, Mayeesha; Jim, Md Abidur Rahman Khan; Saha Ami, Amit; Paul, Shimul

doi:10.1007/s42979-021-00975-0

Bornon: Bengali Image Captioning with Transformer-Based Deep Learning Approach

Original Research
Published: 27 November 2021

Volume 3, article number 90, (2022)
Cite this article

SN Computer Science Aims and scope Submit manuscript

Faisal Muhammad Shah¹^na1,
Mayeesha Humaira ORCID: orcid.org/0000-0001-6179-800X¹^na1,
Md Abidur Rahman Khan Jim¹^na1,
Amit Saha Ami¹^na1 &
…
Shimul Paul¹^na1

577 Accesses
6 Citations
Explore all metrics

Abstract

Image captioning using encoder–decoder-based approach where CNN is used as the Encoder and sequence generator like RNN as Decoder has proven to be very effective. However, this method has a drawback, that is, sequence needs to be processed in order. To overcome this drawback, some researchers have utilized the transformer model to generate captions from images using English datasets. However, none of them generated captions in Bengali using the transformer model. As a result, we utilized three different Bengali datasets to generate Bengali captions from images using the transformer model. Additionally, we compared the performance of the transformer-based model with a visual attention-based encoder–decoder approach. Finally, we compared the result of the transformer-based model with other models that employed different Bengali image captioning datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial intelligence in the creative industries: a review

Article Open access 02 July 2021

CBAM: Convolutional Block Attention Module

Learning to Prompt for Vision-Language Models

Article 31 July 2022

Notes

References

Subash R, Jebakumar R, Kamdar Y, Bhatt N. Automatic image captioning using convolution neural networks and LSTM. J Phys Conf Ser. 2019. https://doi.org/10.1088/1742-6596/1362/1/012096.
Wang C, Yang H, Meinel C. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans Multimed Comput Commun Appl. 2018. https://doi.org/10.1145/3115432.
Humaira M, Paul S, Jim MARK, Ami AS, Shah FM. A hybridized deep learning method for Bengali image captioning. IJACSA 2021;12(2):698–707.
Article Google Scholar
Vaswani A, et al. Attention Is all you need. arXiv:1706.03762, 2017.
Zhang W, Nie W, Li X, Yu Y. Image caption generation with adaptive transformer. IEEE 2019; pp. 521–26.
He S, Liao W, Tavakoli HR, Yang M. Image captioning through image transformer, University of Twente, 2021. pp.153–69 .
Ami AS, Humaira M, Jim MARK, Paul S, Shah FM: Bengali image captioning with visual attention. 2020 23rd International Conference on Computer and Information Technology (ICCIT), 2020, pp. 1–5. https://doi.org/10.1109/ICCIT51783.2020.9392709.
Rahman M, Mohammed N, Mansoor N, Momen S. Chittron. An automatic Bangla image captioning system. Procedia Comput Sci. 2018;154:636–42. https://doi.org/10.1016/j.procs.2019.06.100.
Article Google Scholar
Deb T, et al. Oboyob: A sequential-semantic Bengali image captioning engine. J Intell Fuzzy Syst. 2019;37(6):7427–39. https://doi.org/10.3233/JIFS-179351.
Article Google Scholar
Kamal AH, Jishan MA, Mansoor N. TextMage: the automated Bangla caption generator based on deep learning. In: 2020 International Conference on Decision Aid Sciences and Application (DASA), Sakheer, Bahrain, 2020; pp. 822–26. https://doi.org/10.1109/DASA51403.2020.9317108.
Khan MF. Improved Bengali image captioning via deep convolutional neural network based encoder–decoder model improved Bengali image captioning via deep convolutional neural network based encoder–decoder model. In: Proceedings of International Joint Conference on Advances in Computational Intelligence, 2020. pp. 217–29.
Kalam A, Azad A, Paul B. Bangla language textual image description by hybrid neural network model Bangla language textual image description by hybrid neural network model no. February, 2021; pp. 757–67. https://doi.org/10.11591/ijeecs.v21.i2.pp757-767.
Aneja J, Deshpande A, Schwing AG. Convolutional image captioning. In: Proceedings of IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2018; pp.5561–70. https://doi.org/10.1109/CVPR.2018.00583.
Liu S, Bai L, Hu Y, Wang H. Image captioning based on deep neural networks. MATEC Web Conf. 2018;232:1–7. https://doi.org/10.1051/matecconf/201823201052.
Article Google Scholar
Lan W, Li X, Dong J. Fluency-guided cross-lingual image captioning, MM 2017. In: Proc. 2017 ACM Multimed. Conf., 2017. pp. 1549–57. 2017.https://doi.org/10.1145/3123266.3123366.
Li X, Lan W, Dong J, Liu H. Adding Chinese captions to images, ICMR 2016. In: Proc. 2016 ACM Int. Conf. Multimed. Retr., 2016; pp. 271–75, doi: https://doi.org/10.1145/2911996.2912049.
Yoshikawa Y, Shigeto Y, Takeuchi A. STAIR captions: constructing a large-scale Japanese image caption dataset. ACL 2017. In: 55th Annu. Meet. Assoc. Comput. Linguist. Proc. Conf. (Long Pap.,2017; vol. 2, pp. 417–21. https://doi.org/10.18653/v1/P17-2066.
Jindal V. Generating image captions in Arabic using root-word based recurrent neural networks and deep neural networks. In: Association for Computational Linguistics: Student Research Workshop; 2018. pp. 144–51.
Nugraha AA, Arifianto A, Suyanto. Generating image description on Indonesian language using convolutional neural network and gated recurrent unit. In: 7th Int. Conf. Inf. Commun. Technol. ICoICT; 2019, pp. 1–6. https://doi.org/10.1109/ICoICT.2019.8835370.
Li G, Zhu L, Liu P, Yang Y. Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 8928–37.
Herdade S, Kappeler A, Boakye K, Soares J. Image captioning: transforming objects into words; 2019. arXiv:1906.05963
Atliha V, Šešok D. Applied sciences text augmentation using BERT for image captioning; 2020. https://doi.org/10.3390/app10175978.
Lee H, Yoon S, Dernoncourt F, Kim DS, Bui T, Jung K. ViLBERTScore: evaluating image caption using vision-and-language BERT; 2020; pp. 34–9.
Zhu X, Li L, Liu J, Peng H, Niu X. Applied sciences captioning transformer with stacked attention modules. https://doi.org/10.3390/app8050739.
Xu K, Ba J, Kiros R, Cho K. Conference on machine, and undefined 2015. Show, attend and tell: Neural image caption generation with visual attention. jmlr.org. Accessed 15 Aug 2020. http://www.jmlr.org/proceedings/papers/v37/xuc15.pdf.
Chen L et al..SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, 2017, vol. 2017, pp. 6298–306, https://doi.org/10.1109/CVPR.2017.667.
Chen H, Ding G, Lin Z, Zhao S, Han J. Show, observe and tell: Attribute-driven attention model for image captioning. IJCAI. 2018. https://doi.org/10.24963/ijcai.2018/84.
Article Google Scholar
You Q, et al. Image Captioning with Semantic Attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016;4651–9.
Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014. pp. 1–9. arXiv:1412.3555.
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2016;2016:2818–26. https://doi.org/10.1109/CVPR.2016.308.
Article Google Scholar
Fei-Fei L, Deng J, Li K. ImageNet: constructing a large-scale image database. J Vis. 2010;9(8):1037. https://doi.org/10.1167/9.8.1037.
Article Google Scholar
Chollet F. Xception: deep learning with depthwise separable convolutions. Proceedings of 30th IEEE Conference of Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017, pp. 1800–7, 2017. https://doi.org/10.1109/CVPR.2017.195.
Manning CD, Pham H, Luong MT. Effective approaches to attention-based neural machine translation, 2015. arXiv:1508.04025.
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate, 2015. pp. 1–15. arXiv:1409.0473
Papineni K, Roukos S, Ward T, Zhu W, Heights Y. IBM Research Report Bleu: a method for automatic evaluation of machine translation. Science 2001;22176:1–10. https://doi.org/10.3115/1073083.1073135.
Article Google Scholar
Denkowski M, Lavie A. Meteor universal: language specific translation evaluation for any target language. Language Technologies Institute; 2015. pp. 376–80 https://doi.org/10.3115/v1/w14-3348.

Download references

Author information

Faisal Muhammad Shah, Mayeesha Humaira, Md Abidur, Rahman Khan Jim, Amit Saha Ami and Shimul Paul contributed equally to this work.

Authors and Affiliations

Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh
Faisal Muhammad Shah, Mayeesha Humaira, Md Abidur Rahman Khan Jim, Amit Saha Ami & Shimul Paul

Authors

Faisal Muhammad Shah
View author publications
You can also search for this author in PubMed Google Scholar
Mayeesha Humaira
View author publications
You can also search for this author in PubMed Google Scholar
Md Abidur Rahman Khan Jim
View author publications
You can also search for this author in PubMed Google Scholar
Amit Saha Ami
View author publications
You can also search for this author in PubMed Google Scholar
Shimul Paul
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mayeesha Humaira.

Ethics declarations

Conflict of Interest

There are no conflicts of interest to disclose in the subject matter or materials discussed in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Muhammad Shah, F., Humaira, M., Jim, M.A.R.K. et al. Bornon: Bengali Image Captioning with Transformer-Based Deep Learning Approach. SN COMPUT. SCI. 3, 90 (2022). https://doi.org/10.1007/s42979-021-00975-0

Download citation

Received: 02 September 2021
Accepted: 17 November 2021
Published: 27 November 2021
DOI: https://doi.org/10.1007/s42979-021-00975-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bornon: Bengali Image Captioning with Transformer-Based Deep Learning Approach

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in the creative industries: a review

CBAM: Convolutional Block Attention Module

Learning to Prompt for Vision-Language Models

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bornon: Bengali Image Captioning with Transformer-Based Deep Learning Approach

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence in the creative industries: a review

CBAM: Convolutional Block Attention Module

Learning to Prompt for Vision-Language Models

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation