Abstract
Image captioning using encoder–decoder-based approach where CNN is used as the Encoder and sequence generator like RNN as Decoder has proven to be very effective. However, this method has a drawback, that is, sequence needs to be processed in order. To overcome this drawback, some researchers have utilized the transformer model to generate captions from images using English datasets. However, none of them generated captions in Bengali using the transformer model. As a result, we utilized three different Bengali datasets to generate Bengali captions from images using the transformer model. Additionally, we compared the performance of the transformer-based model with a visual attention-based encoder–decoder approach. Finally, we compared the result of the transformer-based model with other models that employed different Bengali image captioning datasets.
Similar content being viewed by others
References
Subash R, Jebakumar R, Kamdar Y, Bhatt N. Automatic image captioning using convolution neural networks and LSTM. J Phys Conf Ser. 2019. https://doi.org/10.1088/1742-6596/1362/1/012096.
Wang C, Yang H, Meinel C. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans Multimed Comput Commun Appl. 2018. https://doi.org/10.1145/3115432.
Humaira M, Paul S, Jim MARK, Ami AS, Shah FM. A hybridized deep learning method for Bengali image captioning. IJACSA 2021;12(2):698–707.
Vaswani A, et al. Attention Is all you need. arXiv:1706.03762, 2017.
Zhang W, Nie W, Li X, Yu Y. Image caption generation with adaptive transformer. IEEE 2019; pp. 521–26.
He S, Liao W, Tavakoli HR, Yang M. Image captioning through image transformer, University of Twente, 2021. pp.153–69 .
Ami AS, Humaira M, Jim MARK, Paul S, Shah FM: Bengali image captioning with visual attention. 2020 23rd International Conference on Computer and Information Technology (ICCIT), 2020, pp. 1–5. https://doi.org/10.1109/ICCIT51783.2020.9392709.
Rahman M, Mohammed N, Mansoor N, Momen S. Chittron. An automatic Bangla image captioning system. Procedia Comput Sci. 2018;154:636–42. https://doi.org/10.1016/j.procs.2019.06.100.
Deb T, et al. Oboyob: A sequential-semantic Bengali image captioning engine. J Intell Fuzzy Syst. 2019;37(6):7427–39. https://doi.org/10.3233/JIFS-179351.
Kamal AH, Jishan MA, Mansoor N. TextMage: the automated Bangla caption generator based on deep learning. In: 2020 International Conference on Decision Aid Sciences and Application (DASA), Sakheer, Bahrain, 2020; pp. 822–26. https://doi.org/10.1109/DASA51403.2020.9317108.
Khan MF. Improved Bengali image captioning via deep convolutional neural network based encoder–decoder model improved Bengali image captioning via deep convolutional neural network based encoder–decoder model. In: Proceedings of International Joint Conference on Advances in Computational Intelligence, 2020. pp. 217–29.
Kalam A, Azad A, Paul B. Bangla language textual image description by hybrid neural network model Bangla language textual image description by hybrid neural network model no. February, 2021; pp. 757–67. https://doi.org/10.11591/ijeecs.v21.i2.pp757-767.
Aneja J, Deshpande A, Schwing AG. Convolutional image captioning. In: Proceedings of IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2018; pp.5561–70. https://doi.org/10.1109/CVPR.2018.00583.
Liu S, Bai L, Hu Y, Wang H. Image captioning based on deep neural networks. MATEC Web Conf. 2018;232:1–7. https://doi.org/10.1051/matecconf/201823201052.
Lan W, Li X, Dong J. Fluency-guided cross-lingual image captioning, MM 2017. In: Proc. 2017 ACM Multimed. Conf., 2017. pp. 1549–57. 2017.https://doi.org/10.1145/3123266.3123366.
Li X, Lan W, Dong J, Liu H. Adding Chinese captions to images, ICMR 2016. In: Proc. 2016 ACM Int. Conf. Multimed. Retr., 2016; pp. 271–75, doi: https://doi.org/10.1145/2911996.2912049.
Yoshikawa Y, Shigeto Y, Takeuchi A. STAIR captions: constructing a large-scale Japanese image caption dataset. ACL 2017. In: 55th Annu. Meet. Assoc. Comput. Linguist. Proc. Conf. (Long Pap.,2017; vol. 2, pp. 417–21. https://doi.org/10.18653/v1/P17-2066.
Jindal V. Generating image captions in Arabic using root-word based recurrent neural networks and deep neural networks. In: Association for Computational Linguistics: Student Research Workshop; 2018. pp. 144–51.
Nugraha AA, Arifianto A, Suyanto. Generating image description on Indonesian language using convolutional neural network and gated recurrent unit. In: 7th Int. Conf. Inf. Commun. Technol. ICoICT; 2019, pp. 1–6. https://doi.org/10.1109/ICoICT.2019.8835370.
Li G, Zhu L, Liu P, Yang Y. Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 8928–37.
Herdade S, Kappeler A, Boakye K, Soares J. Image captioning: transforming objects into words; 2019. arXiv:1906.05963
Atliha V, Šešok D. Applied sciences text augmentation using BERT for image captioning; 2020. https://doi.org/10.3390/app10175978.
Lee H, Yoon S, Dernoncourt F, Kim DS, Bui T, Jung K. ViLBERTScore: evaluating image caption using vision-and-language BERT; 2020; pp. 34–9.
Zhu X, Li L, Liu J, Peng H, Niu X. Applied sciences captioning transformer with stacked attention modules. https://doi.org/10.3390/app8050739.
Xu K, Ba J, Kiros R, Cho K. Conference on machine, and undefined 2015. Show, attend and tell: Neural image caption generation with visual attention. jmlr.org. Accessed 15 Aug 2020. http://www.jmlr.org/proceedings/papers/v37/xuc15.pdf.
Chen L et al..SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, 2017, vol. 2017, pp. 6298–306, https://doi.org/10.1109/CVPR.2017.667.
Chen H, Ding G, Lin Z, Zhao S, Han J. Show, observe and tell: Attribute-driven attention model for image captioning. IJCAI. 2018. https://doi.org/10.24963/ijcai.2018/84.
You Q, et al. Image Captioning with Semantic Attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016;4651–9.
Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014. pp. 1–9. arXiv:1412.3555.
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2016;2016:2818–26. https://doi.org/10.1109/CVPR.2016.308.
Fei-Fei L, Deng J, Li K. ImageNet: constructing a large-scale image database. J Vis. 2010;9(8):1037. https://doi.org/10.1167/9.8.1037.
Chollet F. Xception: deep learning with depthwise separable convolutions. Proceedings of 30th IEEE Conference of Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017, pp. 1800–7, 2017. https://doi.org/10.1109/CVPR.2017.195.
Manning CD, Pham H, Luong MT. Effective approaches to attention-based neural machine translation, 2015. arXiv:1508.04025.
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate, 2015. pp. 1–15. arXiv:1409.0473
Papineni K, Roukos S, Ward T, Zhu W, Heights Y. IBM Research Report Bleu: a method for automatic evaluation of machine translation. Science 2001;22176:1–10. https://doi.org/10.3115/1073083.1073135.
Denkowski M, Lavie A. Meteor universal: language specific translation evaluation for any target language. Language Technologies Institute; 2015. pp. 376–80 https://doi.org/10.3115/v1/w14-3348.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
There are no conflicts of interest to disclose in the subject matter or materials discussed in this manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Muhammad Shah, F., Humaira, M., Jim, M.A.R.K. et al. Bornon: Bengali Image Captioning with Transformer-Based Deep Learning Approach. SN COMPUT. SCI. 3, 90 (2022). https://doi.org/10.1007/s42979-021-00975-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-021-00975-0