Skip to main content

MUST-VQA: MUltilingual Scene-Text VQA

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13804)


In this paper, we present a framework for Multilingual Scene Text Visual Question Answering that deals with new languages in a zero-shot fashion. Specifically, we consider the task of Scene Text Visual Question Answering (STVQA) in which the question can be asked in different languages and it is not necessarily aligned to the scene text language. Thus, we first introduce a natural step towards a more generalized version of STVQA: MUST-VQA. Accounting for this, we discuss two evaluation scenarios in the constrained setting, namely IID and zero-shot and we demonstrate that the models can perform on a par on a zero-shot setting. We further provide extensive experimentation and show the effectiveness of adapting multilingual language models into STVQA tasks.


  • Visual question answering
  • Scene text
  • Translation robustness
  • Multilingual models
  • Zero-shot transfer
  • Power of language models

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. 1.

  2. 2.


  1. Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36(12), 2552–2566 (2014)

    CrossRef  Google Scholar 

  2. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)

    Google Scholar 

  3. Biten, A.F., Litman, R., Xie, Y., Appalaraju, S., Manmatha, R.: Latr: layout-aware transformer for scene-text vqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16548–16558 (2022)

    Google Scholar 

  4. Biten, A.F., Tito, R., Gomez, L., Valveny, E., Karatzas, D.: Ocr-idl: Ocr annotations for industry document library dataset. arXiv preprint arXiv:2202.12985 (2022)

  5. Biten, A.F., et al.: Icdar 2019 competition on scene text visual question answering. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1563–1570. IEEE (2019)

    Google Scholar 

  6. Biten, A.F., et al.: Scene text visual question answering. In: ICCV, pp. 4291–4301 (2019)

    Google Scholar 

  7. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics 5, 135–146 (2017)

    CrossRef  Google Scholar 

  8. Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: Large scale system for text detection and recognition in images. In: SIGKDD, pp. 71–79 (2018)

    Google Scholar 

  9. Crystal, D.: Two thousand million? English today 24(1), 3–6 (2008)

    CrossRef  Google Scholar 

  10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  11. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  12. Geirhos, R., Jacobsen, J.H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., Wichmann, F.A.: Shortcut learning in deep neural networks. Nature Mach. Intell. 2(11), 665–673 (2020)

    CrossRef  Google Scholar 

  13. Gómez, L., Biten, A.F., Tito, R., Mafla, A., Rusiñol, M., Valveny, E., Karatzas, D.: Multimodal grid features and cell pointers for scene text visual question answering. Pattern Recogn. Lett. 150, 242–249 (2021)

    CrossRef  Google Scholar 

  14. Han, W., Huang, H., Han, T.: Finding the evidence: Localization-aware answer prediction for text visual question answering. arXiv preprint arXiv:2010.02582 (2020)

  15. Heinzerling, B., Strube, M.: Bpemb: tokenization-free pre-trained subword embeddings in 275 languages. arXiv preprint arXiv:1710.02187 (2017)

  16. Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9992–10002 (2020)

    Google Scholar 

  17. Kant, Y., Batra, D., Anderson, P., Schwing, A., Parikh, D., Lu, J., Agrawal, H.: Spatially aware multimodal transformers for TextVQA. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 715–732. Springer, Cham (2020).

    CrossRef  Google Scholar 

  18. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  19. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)

    CrossRef  MathSciNet  Google Scholar 

  20. Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  21. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  22. Mafla, A., Dey, S., Biten, A.F., Gomez, L., Karatzas, D.: Fine-grained image classification and retrieval by combining visual and locally pooled textual features. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2950–2959 (2020)

    Google Scholar 

  23. Mathew, M., Bagal, V., Tito, R.P., Karatzas, D., Valveny, E., Jawahar, C.: Infographicvqa. arXiv preprint arXiv:2104.12756 (2021)

  24. Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: a dataset for VQA on document images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2200–2209 (2021)

    Google Scholar 

  25. Mikulyte, G., Gilbert, D.: An efficient automated data analytics approach to large scale computational comparative linguistics. CoRR (2020)

    Google Scholar 

  26. Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: Ocr-vqa: visual question answering by reading text in images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 947–952. IEEE (2019)

    Google Scholar 

  27. Brugués i Pujolràs, J., Gómez i Bigordà, L., Karatzas, D.: A multilingual approach to scene text visual question answering. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. LNCS, vol. 13237, pp. 65–79. Springer, Cham.

  28. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)

    MathSciNet  MATH  Google Scholar 

  29. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp. 91–99 (2015)

    Google Scholar 

  30. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)

    CrossRef  MathSciNet  Google Scholar 

  31. Shazeer, N.: Glu variants improve transformer. arXiv preprint arXiv:2002.05202 (2020)

  32. Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: Textcaps: a dataset for image captioning with reading comprehension. arXiv preprint arXiv:2003.12462 (2020)

  33. Singh, A., et al.: Towards vqa models that can read. In: CVPR, pp. 8317–8326 (2019)

    Google Scholar 

  34. Vaswani, A., Shazeer, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  35. Xue, L., et al.: mt5: a massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 (2020)

  36. Yang, Z., et al.: Tap: text-aware pre-training for text-VQA and text-caption. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8751–8761 (2021)

    Google Scholar 

  37. Zhu, Q., Gao, C., Wang, P., Wu, Q.: Simple is not easy: a simple strong baseline for textvqa and textcaps. arXiv preprint arXiv:2012.05153 (2020)

Download references


This work has been supported by projects PDC2021-121512-I00, PLEC2021-00785, PID2020-116298GB-I00, ACE034/21/000084, the CERCA Programme/Generalitat de Catalunya, AGAUR project 2019PROD00090 (BeARS), the Ramon y Cajal RYC2020-030777-I/AEI/10.13039/501100011033 and PhD scholarship from UAB (B18P0073).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Emanuele Vivoli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vivoli, E., Biten, A.F., Mafla, A., Karatzas, D., Gomez, L. (2023). MUST-VQA: MUltilingual Scene-Text VQA. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13804. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-25068-2

  • Online ISBN: 978-3-031-25069-9

  • eBook Packages: Computer ScienceComputer Science (R0)