Spatially Aware Multimodal Transformers for TextVQA

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12354)


Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limited in their use of spatial relations and rely on fully-connected transformer-based architectures to implicitly learn the spatial structure of a scene. In contrast, we propose a novel spatially aware self-attention layer such that each visual entity only looks at neighboring entities defined by a spatial graph. Further, each head in our multi-head self-attention layer focuses on a different subset of relations. Our approach has two advantages: (1) each head considers local context instead of dispersing the attention amongst all visual entities; (2) we avoid learning redundant features. We show that our model improves the absolute accuracy of current state-of-the-art methods on TextVQA by 2.2% overall over an improved baseline, and 4.62% on questions that involve spatial reasoning and can be answered correctly using OCR tokens. Similarly on ST-VQA, we improve the absolute accuracy by 4.2%. We further show that spatially aware self-attention improves visual grounding.


VQA TextVQA Self-attention 



The Georgia Tech effort was supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE, Amazon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.

Supplementary material

504446_1_En_41_MOESM1_ESM.pdf (4.2 mb)
Supplementary material 1 (pdf 4253 KB)


  1. 1.
    Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36, 2552–2566 (2014)CrossRefGoogle Scholar
  2. 2.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)Google Scholar
  3. 3.
    Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)Google Scholar
  4. 4.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)Google Scholar
  5. 5.
    Bigham, J.P., et al.: VizWiz: nearly real-time answers to visual questions. In: W4A (2010)Google Scholar
  6. 6.
    Biten, A.F., et al.: ICDAR 2019 competition on scene text visual question answering. arXiv preprint arXiv:1907.00490 (2019)
  7. 7.
    Biten, A.F., et al.: Scene text visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4291–4301 (2019)Google Scholar
  8. 8.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)CrossRefGoogle Scholar
  9. 9.
    Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018)Google Scholar
  10. 10.
    Chen, Y.-C., et al.: UNITER: learning universal image-text representations. arXiv preprint arXiv:1909.11740 (2019)
  11. 11.
    Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, Bert (2019)Google Scholar
  12. 12.
    Dong, L., et al.: Unified language model pre-training for natural language understanding and generation. In: NeurIPS (2019)Google Scholar
  13. 13.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  14. 14.
    Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for TextVQA. arXiv preprint arXiv:1911.06258 (2019)
  15. 15.
    Jiang, Y., Natarajan, V., Chen, X., Rohrbach, M., Batra, D., Parikh, D.: Pythia v0. 1: the winning entry to the VQA challenge 2018. arXiv preprint arXiv:1807.09956 (2018)
  16. 16.
    Karatzas, D., et al.: ICDAR 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160 (2015)Google Scholar
  17. 17.
    D. Karatzas, F., et al.: ICDAR 2013 robust reading competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1484–1493 (2013)Google Scholar
  18. 18.
    Kipf, T., Welling, M.: Semi-supervised classification with graph convolutional networks. ArXiv, abs/1609.02907 (2016)Google Scholar
  19. 19.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 32–73 (2016)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Kuznetsova, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982 (2018)
  21. 21.
    Li, G., Duan, N., Fang, Y., Jiang, D., Zhou, M.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. arXiv preprint arXiv:1908.06066 (2019)
  22. 22.
    Li, L., Gan, Z., Cheng, Y., Liu, J.: Relation-aware graph attention network for visual question answering. ArXiv, abs/1903.12314 (2019)Google Scholar
  23. 23.
    Li, M.H., Yatskar, L., Yin, D., Hsieh, C.-J., Chang, K.-W.: VisualBert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
  24. 24.
    Lin, Y., Zhao, H., Li, Y., Wang, D.: Dcd\(\_\)zju, TextVQA challenge 2019 winner (2014).
  25. 25.
    Peter, J., et al.: Generating wikipedia by summarizing long sequences. In: ICLR (2018)Google Scholar
  26. 26.
    Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. ArXiv, abs/1908.02265 (2019)Google Scholar
  27. 27.
    Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. arXiv preprint arXiv:1912.02315 (2019)
  28. 28.
    Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances in Neural Information Processing Systems, pp. 289–297 (2016)Google Scholar
  29. 29.
    Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? ArXiv, abs/1905.10650 (2019)Google Scholar
  30. 30.
    Mishra, A., Alahari, K., Jawahar, C.V.: Image retrieval using textual cues. In: ICCV (2013)Google Scholar
  31. 31.
    Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: Proceedings of the International Conference on Document Analysis and Recognition (2019)Google Scholar
  32. 32.
    Murahari, V., Batra, D., Parikh, D., Das, A.: Large-scale pretraining for visual dialog: a simple state-of-the-art baseline. arXiv preprint arXiv:1912.02379 (2019)
  33. 33.
    Qi, D., Su, L., Song, J., Cui, E.D.B., Bharti, T., Sacheti, A.: ImageBert: cross-modal pre-training with large-scale weak-supervised image-text data. ArXiv, abs/2001.07966 (2020)Google Scholar
  34. 34.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In:NuerIPS, pp. 91–99 (2015)Google Scholar
  35. 35.
    Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. ArXiv, abs/1803.02155 (2018)Google Scholar
  36. 36.
    Singh, A., et al.: Towards VQA models that can read. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8309–8318 (2019)Google Scholar
  37. 37.
    Su, W., ZhuX., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: VL-BERT: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)
  38. 38.
    Anonymous submission. Msft\(\_\)vti, textvqa challenge 2019 top entry (post-challenge) (2014).
  39. 39.
    Sukhbaatar, S., Grave, E., Bojanowski, P., Joulin, A.: Adaptive attention span in transformers. In: ACL (2019)Google Scholar
  40. 40.
    Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: Videobert: a joint model for video and language representation learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7464–7473 (2019)Google Scholar
  41. 41.
    Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)
  42. 42.
    Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)Google Scholar
  43. 43.
    Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: International Conference on Learning Representations (2018)Google Scholar
  44. 44.
    Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In: ACL (2019)Google Scholar
  45. 45.
    Wang, X., Tu, Z., Wang, L., Shi, S.: Self-attention with structural position representations. In: EMNLP/IJCNLP (2019)Google Scholar
  46. 46.
    Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  47. 47.
    Yang, B., Tu, Z., Wong, D.F., Meng, F., Chao, L.S., Zhang, T.: Modeling localness for self-attention networks. In: EMNLP (2018)Google Scholar
  48. 48.
    Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 711–727. Springer, Cham (2018). Scholar
  49. 49.
    Zhao, G., Lin, J., Zhang, Z., Ren, X., Su, Q., Sun, X.: Explicit sparse transformer: concentrated attention through explicit selection. ArXiv, abs/1912.11637 (2019)Google Scholar
  50. 50.
    Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. ArXiv, abs/1909.11059 (2019)Google Scholar
  51. 51.
    Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: IEEE International Conference on Computer Vision (ICCV), pp. 19–27 (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Georgia Institute of TechnologyAtlantaGeorgia
  2. 2.GoogleColumbusUSA
  3. 3.Facebook AI Research (FAIR)New York CityUSA
  4. 4.University of IllinoisChampaignUSA

Personalised recommendations