Skip to main content

Bengali Visual Genome: A Multimodal Dataset for Machine Translation and Image Captioning

  • 149 Accesses

Part of the Smart Innovation, Systems and Technologies book series (SIST,volume 266)


Multimodal machine translation (MMT) refers to the extraction of information from more than one modality aiming at performance improvement by utilizing information collected from the modalities other than pure text. The availability of multimodal datasets, particularly for Indian regional languages, is still limited, and thus, there is a need to build such datasets for regional languages to promote the state of MMT research. In this work, we describe the process of creation of the Bengali Visual Genome (BVG) dataset. The BVG is the first multimodal dataset consisting of text and images suitable for English-to-Bengali multimodal machine translation tasks and multimodal research. We also demonstrate the sample use-cases of machine translation and region-specific image captioning using the new BVG dataset. These results can be considered as the baseline for subsequent research.


  • Machine translation
  • Multimodal dataset
  • CNN
  • RNN
  • Image captioning

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-981-16-6624-7_7
  • Chapter length: 8 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   219.00
Price excludes VAT (USA)
  • ISBN: 978-981-16-6624-7
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Hardcover Book
USD   279.99
Price excludes VAT (USA)
Fig. 7.1
Fig. 7.2
Fig. 7.3
Fig. 7.4


  1. 1.

  2. 2.

  3. 3.


  1. Sulubacak, U., Caglayan, O., Grönroos, S.A., Rouhe, A., Elliott, D., Specia, L., Tiedemann, J.: Multimodal machine translation through visuals and speech. Mach. Transl. 34(2), 97–147 (2020)

    CrossRef  Google Scholar 

  2. Popel, M., Tomkova, M., Tomek, J., Kaiser, Ł, Uszkoreit, J., Bojar, O., Žabokrtský, Z.: Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals. Nat. Commun. 11(1), 1–15 (2020)

    CrossRef  Google Scholar 

  3. Parida, S., Motlicek, P., Dash, A.R., Dash, S.R., Mallick, D.K., Biswal, S.P., Pattnaik, P., Nayak, B.N., Bojar, O.: Odianlp’s participation in WAT2020. In: Proceedings of the 7th Workshop on Asian Translation. pp. 103–108 (2020)

    Google Scholar 

  4. Khan, M.F., Sadiq-Ur-Rahman, S., Islam, M.S.: Improved bengali image captioning via deep convolutional neural network based encoder-decoder model. In: Proceedings of International Joint Conference on Advances in Computational Intelligence. pp. 217–229. Springer (2021)

    Google Scholar 

  5. Rahman, M., Mohammed, N., Mansoor, N., Momen, S.: Chittron: an automatic Bangla image captioning system. Procedia Comput. Sci. 154, 636–642 (2019)

    CrossRef  Google Scholar 

  6. Kamruzzaman, T.: Dataset for image captioning system (in bangla) (2021)

    Google Scholar 

  7. Nakazawa, T., Doi, N., Higashiyama, S., Ding, C., Dabre, R., Mino, H., Goto, I., Pa, W.P., Kunchukuttan, A., Oda, Y., Parida, S., Bojar, O., Kurohashi, S.: Overview of the 6th workshop on Asian translation. In: Proceedings of the 6th Workshop on Asian Translation. pp. 1–35. Association for Computational Linguistics, Hong Kong, China (Nov 2019).,

  8. Parida, S., Bojar, O., Dash, S.R.: Hindi visual genome: a dataset for multi-modal English to hindi machine translation. Comput. Sist. 23(4) (2019)

    Google Scholar 

  9. Kudo, T., Richardson, J.: SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 66–71. Association for Computational Linguistics, Brussels, Belgium (Nov 2018).,

  10. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Teh, Y.W., Titterington, M. (eds.) Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 9, pp. 249–256. PMLR, Chia Laguna Resort, Sardinia, Italy (13–15 May 2010),

  11. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2014),, cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015

  12. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3156–3164 (2015).

  13. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 1440–1448 (2015).

  14. Soh, M.: Learning cnn-lstm architectures for image caption generation

    Google Scholar 

  15. Yu, J., Li, J., Yu, Z., Huang, Q.: Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol. 30(12), 4467–4480 (2019)

    CrossRef  Google Scholar 

Download references


The author Ondřej Bojar would like to acknowledge the support of the grant 19-26934X (NEUREM3) of the Czech Science Foundation.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Satya Ranjan Dash .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Sen, A., Parida, S., Kotwal, K., Panda, S., Bojar, O., Dash, S.R. (2022). Bengali Visual Genome: A Multimodal Dataset for Machine Translation and Image Captioning. In: Satapathy, S.C., Peer, P., Tang, J., Bhateja, V., Ghosh, A. (eds) Intelligent Data Engineering and Analytics. Smart Innovation, Systems and Technologies, vol 266. Springer, Singapore.

Download citation