Skip to main content

Weakly Supervised Grounding for VQA in Vision-Language Transformers

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding. However, most systems that show good performance of those tasks still rely on pre-trained object detectors during training, which limits their applicability to the object classes available for those detectors. To mitigate this limitation, this paper focuses on the problem of weakly supervised grounding in the context of visual question answering in transformers. Our approach leverages capsules by transforming each visual token into a capsule representation in the visual encoder; it then uses activations from language self-attention layers as a text-guided selection module to mask those capsules before they are forwarded to the next layer. We evaluate our approach on the challenging GQA as well as VQA-HAT dataset for VQA grounding. Our experiments show that: while removing the information of masked objects from standard transformer architectures leads to a significant drop in performance, the integration of capsules significantly improves the grounding ability of such systems and provides new state-of-the-art results compared to other approaches in the field. (Code is available at https://github.com/aurooj/WSG-VQA-VLTransformers)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abacha, A.B., Hasan, S.A., Datla, V.V., Liu, J., Demner-Fushman, D., Müller, H.: VQA-med: overview of the medical visual question answering task at imageclef 2019. (2019)

    Google Scholar 

  2. Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)

    Google Scholar 

  3. Arbelle, A., et al.: Detector-free weakly supervised grounding by separation. arXiv preprint arXiv:2104.09829 (2021)

  4. Caron, M., et al.: Emerging properties in self-supervised vision transformers (2021)

    Google Scholar 

  5. Chen, K., Gao, J., Nevatia, R.: Knowledge aided consistency for weakly supervised phrase grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4042–4050 (2018)

    Google Scholar 

  6. Chen, Y.C., et al.: Uniter: learning universal image-text representations (2019)

    Google Scholar 

  7. Chen, Z., Ma, L., Luo, W., Wong, K.Y.K.: Weakly-supervised spatio-temporally grounding natural sentence in video. arXiv preprint arXiv:1906.02549 (2019)

  8. Das, A., Agrawal, H., Zitnick, C.L., Parikh, D., Batra, D.: Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2016)

    Google Scholar 

  9. Datta, S., Sikka, K., Roy, A., Ahuja, K., Parikh, D., Divakaran, A.: Align2ground: weakly supervised phrase grounding guided by image-caption alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2601–2610 (2019)

    Google Scholar 

  10. Desai, K., Johnson, J.: VirTex: learning visual representations from textual annotations. In: CVPR (2021)

    Google Scholar 

  11. Duan, S., Cao, J., Zhao, H.: Capsule-transformer for neural machine translation. arXiv preprint arXiv:2004.14649 (2020)

  12. Duarte, K., Rawat, Y., Shah, M.: Videocapsulenet: a simplified network for action detection. In: Advances in Neural Information Processing Systems, pp. 7610–7619 (2018)

    Google Scholar 

  13. Duarte, K., Rawat, Y.S., Shah, M.: Capsulevos: semi-supervised video object segmentation using capsule routing. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 8480–8489 (2019)

    Google Scholar 

  14. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)

    Google Scholar 

  15. Gu, S., Feng, Y.: Improving multi-head attention with capsule networks. In: Tang, J., Kan, MY., Zhao, D., Li, S., Zan, H. (eds.) NLPCC 2019. LNCS (LNAI), vol. 11838, pp. 314–326. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32233-5_25

  16. Gurari, D., et al.: Vizwiz grand challenge: answering visual questions from blind people. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3608–3617 (2018)

    Google Scholar 

  17. Hinton, G.: How to represent part-whole hierarchies in a neural network. arXiv preprint arXiv:2102.12627 (2021)

  18. Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming auto-encoders. In: Honkela, T., Duch, W., Girolami, M., Kaski, S. (eds.) ICANN 2011. LNCS, vol. 6791, pp. 44–51. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21735-7_6

  19. Hinton, G.E., Sabour, S., Frosst, N.: Matrix capsules with EM routing. In: International Conference on Learning Representations (2018)

    Google Scholar 

  20. Huang, D.A., Buch, S., Dery, L., Garg, A., Fei-Fei, L., Niebles, J.C.: Finding" it": weakly-supervised reference-aware visual grounding in instructional videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5948–5957 (2018)

    Google Scholar 

  21. Huang, Z., Zeng, Z., Huang, Y., Liu, B., Fu, D., Fu, J.: Seeing out of the box: End-to-end pre-training for vision-language representation learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  22. Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-bert: aligning image pixels with text by deep multi-modal transformers. CoRR abs/2004.00849 (2020). https://arxiv.org/abs/2004.00849

  23. Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning. In: International Conference on Learning Representations (ICLR) (2018)

    Google Scholar 

  24. Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  25. Khan, A.U., Kuehne, H., Duarte, K., Gan, C., Lobo, N., Shah, M.: Found a reason for me? weakly-supervised grounded visual question answering using capsules (2021)

    Google Scholar 

  26. Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: a survey. arXiv preprint arXiv:2101.01169 (2021)

  27. Kim, W., Son, B., Kim, I.: VILT: vision-and-language transformer without convolution or region supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 5583–5594. PMLR (2021). https://proceedings.mlr.press/v139/kim21k.html

  28. Kim, W., Son, B., Kim, I.: VILT: vision-and-language transformer without convolution or region supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 5583–5594. PMLR (2021). https://proceedings.mlr.press/v139/kim21k.html

  29. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)

    Google Scholar 

  30. LaLonde, R., Bagci, U.: Capsules for object segmentation. arXiv preprint arXiv:1804.04241 (2018)

  31. Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11336–11344 (2020)

    Google Scholar 

  32. Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., Hoi, S.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS (2021)

    Google Scholar 

  33. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)

  34. Li, X., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8

  35. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

  36. Liu, J., et al.: Transformer-based capsule network for stock movement prediction. In: Proceedings of the First Workshop on Financial Technology and Natural Language Processing, pp. 66–73 (2019)

    Google Scholar 

  37. Liu, Y., Wan, B., Ma, L., He, X.: Relation-aware instance refinement for weakly supervised visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5612–5621 (2021)

    Google Scholar 

  38. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265 (2019)

  39. Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10437–10446 (2020)

    Google Scholar 

  40. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. Adv. Neural Inf. Process. Syst. 29 (2016)

    Google Scholar 

  41. Mazzia, V., Salvetti, F., Chiaberge, M.: Efficient-capsnet: capsule network with self-attention routing. arXiv preprint arXiv:2101.12491 (2021)

  42. Miech, A., Alayrac, J.B., Laptev, I., Sivic, J., Zisserman, A.: Thinking fast and slow: efficient text-to-visual retrieval with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9826–9836 (2021)

    Google Scholar 

  43. Mobiny, A., Cicalese, P.A., Nguyen, H.V.: Trans-caps: transformer capsule networks with self-attention routing (2021). https://openreview.net/forum?id=BUPIRa1D2J

  44. Niu, Y., Tang, K., Zhang, H., Lu, Z., Hua, X.S., Wen, J.R.: Counterfactual VQA: a cause-effect look at language bias. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12700–12710 (2021)

    Google Scholar 

  45. Pfeiffer, J., et al.: XGQA: cross-lingual visual question answering. arXiv preprint arXiv:2109.06082 (2021)

  46. Pucci, R., Micheloni, C., Martinel, N.: Self-attention agreement among capsules. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 272–280 (2021)

    Google Scholar 

  47. Qiao, T., Dong, J., Xu, D.: Exploring human-like attention supervision in visual question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  48. Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)

    Google Scholar 

  49. Ramakrishnan, S., Agrawal, A., Lee, S.: Overcoming language priors in visual question answering with adversarial regularization. arXiv preprint arXiv:1810.03649 (2018)

  50. Ribeiro, F.D.S., Duarte, K., Everett, M., Leontidis, G., Shah, M.: Learning with capsules: a survey. arXiv preprint arXiv:2206.02664 (2022)

  51. Riquelme, F., De Goyeneche, A., Zhang, Y., Niebles, J.C., Soto, A.: Explaining VQA predictions using visual grounding and a knowledge base. Image Vision Comput. 101, 103968 (2020)

    Google Scholar 

  52. Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: NIPS (2017)

    Google Scholar 

  53. Selvaraju, R.R., et al.: Taking a hint: leveraging explanations to make vision and language models more grounded. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2591–2600 (2019)

    Google Scholar 

  54. Shi, J., Xu, J., Gong, B., Xu, C.: Not all frames are equal: weakly-supervised video grounding with contextual similarity and visual clustering losses. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10444–10452 (2019)

    Google Scholar 

  55. Su, W., et al.: Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)

  56. Tan, H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (2019)

    Google Scholar 

  57. Wang, L., Huang, J., Li, Y., Xu, K., Yang, Z., Yu, D.: Improving weakly supervised visual grounding by contrastive knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14090–14100 (2021)

    Google Scholar 

  58. Whitehead, S., Wu, H., Ji, H., Feris, R., Saenko, K.: Separating skills and concepts for novel visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5632–5641 (2021)

    Google Scholar 

  59. Wu, L., Liu, X., Liu, Q.: Centroid transformers: learning to abstract with attention. arXiv preprint arXiv:2102.08606 (2021)

  60. Xiao, F., Sigal, L., Jae Lee, Y.: Weakly-supervised visual grounding of phrases with linguistic structures. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5945–5954 (2017)

    Google Scholar 

  61. Yang, X., Liu, X., Jian, M., Gao, X., Wang, M.: Weakly-supervised video object grounding by exploring spatio-temporal contexts. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1939–1947 (2020)

    Google Scholar 

  62. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)

    Google Scholar 

  63. Zeng, X., Wang, Y., Chiu, T.Y., Bhattacharya, N., Gurari, D.: Vision skills needed to answer visual questions. Proc. ACM Hum. Comput. Interact. 4(CSCW2), 1–31 (2020)

    Article  Google Scholar 

  64. Zhan, L.M., Liu, B., Fan, L., Chen, J., Wu, X.M.: Medical visual question answering via conditional reasoning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2345–2354 (2020)

    Google Scholar 

  65. Zhang, J., Bargal, S.A., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation backprop. Int. J. Comput. Vision 126(10), 1084–1102 (2018)

    Article  Google Scholar 

  66. Zhang, S., Qu, L., You, S., Yang, Z., Zhang, J.: Automatic generation of grounded visual questions. arXiv preprint arXiv:1612.06530 (2016)

  67. Zhang, Y., Niebles, J.C., Soto, A.: Interpretable visual question answering by visual grounding from attention supervision mining. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 349–357 (2019). https://doi.org/10.1109/WACV.2019.00043

Download references

Acknowledgement

Aisha Urooj is supported by the ARO grant W911NF-19-1-0356. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ARO, IARPA, DOI/IBC, or the U.S. Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aisha Urooj Khan .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 9881 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Khan, A.U., Kuehne, H., Gan, C., Lobo, N.D.V., Shah, M. (2022). Weakly Supervised Grounding for VQA in Vision-Language Transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13695. Springer, Cham. https://doi.org/10.1007/978-3-031-19833-5_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19833-5_38

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19832-8

  • Online ISBN: 978-3-031-19833-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics