Skip to main content
Log in

Bornon: Bengali Image Captioning with Transformer-Based Deep Learning Approach

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Image captioning using encoder–decoder-based approach where CNN is used as the Encoder and sequence generator like RNN as Decoder has proven to be very effective. However, this method has a drawback, that is, sequence needs to be processed in order. To overcome this drawback, some researchers have utilized the transformer model to generate captions from images using English datasets. However, none of them generated captions in Bengali using the transformer model. As a result, we utilized three different Bengali datasets to generate Bengali captions from images using the transformer model. Additionally, we compared the performance of the transformer-based model with a visual attention-based encoder–decoder approach. Finally, we compared the result of the transformer-based model with other models that employed different Bengali image captioning datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. https://www.vistawide.com/languages/top_30_languages.htm.

  2. https://www.kaggle.com/adityajn105/flickr8k/activity.

  3. https://translate.google.com/.

  4. https://data.mendeley.com/datasets/hf6sf8zrkc/2.

References

  1. Subash R, Jebakumar R, Kamdar Y, Bhatt N. Automatic image captioning using convolution neural networks and LSTM. J Phys Conf Ser. 2019. https://doi.org/10.1088/1742-6596/1362/1/012096.

  2. Wang C, Yang H, Meinel C. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans Multimed Comput Commun Appl. 2018. https://doi.org/10.1145/3115432.

  3. Humaira M, Paul S, Jim MARK, Ami AS, Shah FM. A hybridized deep learning method for Bengali image captioning. IJACSA 2021;12(2):698–707.

    Article  Google Scholar 

  4. Vaswani A, et al. Attention Is all you need. arXiv:1706.03762, 2017.

  5. Zhang W, Nie W, Li X, Yu Y. Image caption generation with adaptive transformer. IEEE 2019; pp. 521–26.

  6. He S, Liao W, Tavakoli HR, Yang M. Image captioning through image transformer, University of Twente, 2021. pp.153–69 .

  7. Ami AS, Humaira M, Jim MARK, Paul S, Shah FM: Bengali image captioning with visual attention. 2020 23rd International Conference on Computer and Information Technology (ICCIT), 2020, pp. 1–5. https://doi.org/10.1109/ICCIT51783.2020.9392709.

  8. Rahman M, Mohammed N, Mansoor N, Momen S. Chittron. An automatic Bangla image captioning system. Procedia Comput Sci. 2018;154:636–42. https://doi.org/10.1016/j.procs.2019.06.100.

    Article  Google Scholar 

  9. Deb T, et al. Oboyob: A sequential-semantic Bengali image captioning engine. J Intell Fuzzy Syst. 2019;37(6):7427–39. https://doi.org/10.3233/JIFS-179351.

    Article  Google Scholar 

  10. Kamal AH, Jishan MA, Mansoor N. TextMage: the automated Bangla caption generator based on deep learning. In: 2020 International Conference on Decision Aid Sciences and Application (DASA), Sakheer, Bahrain, 2020; pp. 822–26. https://doi.org/10.1109/DASA51403.2020.9317108.

  11. Khan MF. Improved Bengali image captioning via deep convolutional neural network based encoder–decoder model improved Bengali image captioning via deep convolutional neural network based encoder–decoder model. In: Proceedings of International Joint Conference on Advances in Computational Intelligence, 2020. pp. 217–29.

  12. Kalam A, Azad A, Paul B. Bangla language textual image description by hybrid neural network model Bangla language textual image description by hybrid neural network model no. February, 2021; pp. 757–67. https://doi.org/10.11591/ijeecs.v21.i2.pp757-767.

  13. Aneja J, Deshpande A, Schwing AG. Convolutional image captioning. In: Proceedings of IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2018; pp.5561–70. https://doi.org/10.1109/CVPR.2018.00583.

  14. Liu S, Bai L, Hu Y, Wang H. Image captioning based on deep neural networks. MATEC Web Conf. 2018;232:1–7. https://doi.org/10.1051/matecconf/201823201052.

    Article  Google Scholar 

  15. Lan W, Li X, Dong J. Fluency-guided cross-lingual image captioning, MM 2017. In: Proc. 2017 ACM Multimed. Conf., 2017. pp. 1549–57. 2017.https://doi.org/10.1145/3123266.3123366.

  16. Li X, Lan W, Dong J, Liu H. Adding Chinese captions to images, ICMR 2016. In: Proc. 2016 ACM Int. Conf. Multimed. Retr., 2016; pp. 271–75, doi: https://doi.org/10.1145/2911996.2912049.

  17. Yoshikawa Y, Shigeto Y, Takeuchi A. STAIR captions: constructing a large-scale Japanese image caption dataset. ACL 2017. In: 55th Annu. Meet. Assoc. Comput. Linguist. Proc. Conf. (Long Pap.,2017; vol. 2, pp. 417–21. https://doi.org/10.18653/v1/P17-2066.

  18. Jindal V. Generating image captions in Arabic using root-word based recurrent neural networks and deep neural networks. In: Association for Computational Linguistics: Student Research Workshop; 2018. pp. 144–51.

  19. Nugraha AA, Arifianto A, Suyanto. Generating image description on Indonesian language using convolutional neural network and gated recurrent unit. In: 7th Int. Conf. Inf. Commun. Technol. ICoICT; 2019, pp. 1–6. https://doi.org/10.1109/ICoICT.2019.8835370.

  20. Li G, Zhu L, Liu P, Yang Y. Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 8928–37.

  21. Herdade S, Kappeler A, Boakye K, Soares J. Image captioning: transforming objects into words; 2019. arXiv:1906.05963

  22. Atliha V, Šešok D. Applied sciences text augmentation using BERT for image captioning; 2020. https://doi.org/10.3390/app10175978.

  23. Lee H, Yoon S, Dernoncourt F, Kim DS, Bui T, Jung K. ViLBERTScore: evaluating image caption using vision-and-language BERT; 2020; pp. 34–9.

  24. Zhu X, Li L, Liu J, Peng H, Niu X. Applied sciences captioning transformer with stacked attention modules. https://doi.org/10.3390/app8050739.

  25. Xu K, Ba J, Kiros R, Cho K. Conference on machine, and undefined 2015. Show, attend and tell: Neural image caption generation with visual attention. jmlr.org. Accessed 15 Aug 2020. http://www.jmlr.org/proceedings/papers/v37/xuc15.pdf.

  26. Chen L et al..SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, 2017, vol. 2017, pp. 6298–306, https://doi.org/10.1109/CVPR.2017.667.

  27. Chen H, Ding G, Lin Z, Zhao S, Han J. Show, observe and tell: Attribute-driven attention model for image captioning. IJCAI. 2018. https://doi.org/10.24963/ijcai.2018/84.

    Article  Google Scholar 

  28. You Q, et al. Image Captioning with Semantic Attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016;4651–9.

  29. Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014. pp. 1–9. arXiv:1412.3555.

  30. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2016;2016:2818–26. https://doi.org/10.1109/CVPR.2016.308.

    Article  Google Scholar 

  31. Fei-Fei L, Deng J, Li K. ImageNet: constructing a large-scale image database. J Vis. 2010;9(8):1037. https://doi.org/10.1167/9.8.1037.

    Article  Google Scholar 

  32. Chollet F. Xception: deep learning with depthwise separable convolutions. Proceedings of 30th IEEE Conference of Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017, pp. 1800–7, 2017. https://doi.org/10.1109/CVPR.2017.195.

  33. Manning CD, Pham H, Luong MT. Effective approaches to attention-based neural machine translation, 2015. arXiv:1508.04025.

  34. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate, 2015. pp. 1–15. arXiv:1409.0473

  35. Papineni K, Roukos S, Ward T, Zhu W, Heights Y. IBM Research Report Bleu: a method for automatic evaluation of machine translation. Science 2001;22176:1–10. https://doi.org/10.3115/1073083.1073135.

    Article  Google Scholar 

  36. Denkowski M, Lavie A. Meteor universal: language specific translation evaluation for any target language. Language Technologies Institute; 2015. pp. 376–80 https://doi.org/10.3115/v1/w14-3348.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mayeesha Humaira.

Ethics declarations

Conflict of Interest

There are no conflicts of interest to disclose in the subject matter or materials discussed in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Muhammad Shah, F., Humaira, M., Jim, M.A.R.K. et al. Bornon: Bengali Image Captioning with Transformer-Based Deep Learning Approach. SN COMPUT. SCI. 3, 90 (2022). https://doi.org/10.1007/s42979-021-00975-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-021-00975-0

Keywords

Navigation