Weakly Supervised Grounding for VQA in Vision-Language Transformers

Khan, Aisha Urooj; Kuehne, Hilde; Gan, Chuang; Lobo, Niels Da Vitoria; Shah, Mubarak

doi:10.1007/978-3-031-19833-5_38

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13695))

Included in the following conference series:

European Conference on Computer Vision

2086 Accesses
4 Citations

Abstract

Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding. However, most systems that show good performance of those tasks still rely on pre-trained object detectors during training, which limits their applicability to the object classes available for those detectors. To mitigate this limitation, this paper focuses on the problem of weakly supervised grounding in the context of visual question answering in transformers. Our approach leverages capsules by transforming each visual token into a capsule representation in the visual encoder; it then uses activations from language self-attention layers as a text-guided selection module to mask those capsules before they are forwarded to the next layer. We evaluate our approach on the challenging GQA as well as VQA-HAT dataset for VQA grounding. Our experiments show that: while removing the information of masked objects from standard transformer architectures leads to a significant drop in performance, the integration of capsules significantly improves the grounding ability of such systems and provides new state-of-the-art results compared to other approaches in the field. (Code is available at https://github.com/aurooj/WSG-VQA-VLTransformers)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SCBM: A Hybrid Model for Vietnamese Visual Question Answering

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

Towards Open Ended and Free Form Visual Question Answering: Modeling VQA as a Factoid Question Answering Problem

References

Abacha, A.B., Hasan, S.A., Datla, V.V., Liu, J., Demner-Fushman, D., Müller, H.: VQA-med: overview of the medical visual question answering task at imageclef 2019. (2019)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Arbelle, A., et al.: Detector-free weakly supervised grounding by separation. arXiv preprint arXiv:2104.09829 (2021)
Caron, M., et al.: Emerging properties in self-supervised vision transformers (2021)
Google Scholar
Chen, K., Gao, J., Nevatia, R.: Knowledge aided consistency for weakly supervised phrase grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4042–4050 (2018)
Google Scholar
Chen, Y.C., et al.: Uniter: learning universal image-text representations (2019)
Google Scholar
Chen, Z., Ma, L., Luo, W., Wong, K.Y.K.: Weakly-supervised spatio-temporally grounding natural sentence in video. arXiv preprint arXiv:1906.02549 (2019)
Das, A., Agrawal, H., Zitnick, C.L., Parikh, D., Batra, D.: Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2016)
Google Scholar
Datta, S., Sikka, K., Roy, A., Ahuja, K., Parikh, D., Divakaran, A.: Align2ground: weakly supervised phrase grounding guided by image-caption alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2601–2610 (2019)
Google Scholar
Desai, K., Johnson, J.: VirTex: learning visual representations from textual annotations. In: CVPR (2021)
Google Scholar
Duan, S., Cao, J., Zhao, H.: Capsule-transformer for neural machine translation. arXiv preprint arXiv:2004.14649 (2020)
Duarte, K., Rawat, Y., Shah, M.: Videocapsulenet: a simplified network for action detection. In: Advances in Neural Information Processing Systems, pp. 7610–7619 (2018)
Google Scholar
Duarte, K., Rawat, Y.S., Shah, M.: Capsulevos: semi-supervised video object segmentation using capsule routing. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 8480–8489 (2019)
Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
Google Scholar
Gu, S., Feng, Y.: Improving multi-head attention with capsule networks. In: Tang, J., Kan, MY., Zhao, D., Li, S., Zan, H. (eds.) NLPCC 2019. LNCS (LNAI), vol. 11838, pp. 314–326. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32233-5_25
Gurari, D., et al.: Vizwiz grand challenge: answering visual questions from blind people. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3608–3617 (2018)
Google Scholar
Hinton, G.: How to represent part-whole hierarchies in a neural network. arXiv preprint arXiv:2102.12627 (2021)
Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming auto-encoders. In: Honkela, T., Duch, W., Girolami, M., Kaski, S. (eds.) ICANN 2011. LNCS, vol. 6791, pp. 44–51. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21735-7_6
Hinton, G.E., Sabour, S., Frosst, N.: Matrix capsules with EM routing. In: International Conference on Learning Representations (2018)
Google Scholar
Huang, D.A., Buch, S., Dery, L., Garg, A., Fei-Fei, L., Niebles, J.C.: Finding" it": weakly-supervised reference-aware visual grounding in instructional videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5948–5957 (2018)
Google Scholar
Huang, Z., Zeng, Z., Huang, Y., Liu, B., Fu, D., Fu, J.: Seeing out of the box: End-to-end pre-training for vision-language representation learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-bert: aligning image pixels with text by deep multi-modal transformers. CoRR abs/2004.00849 (2020). https://arxiv.org/abs/2004.00849
Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning. In: International Conference on Learning Representations (ICLR) (2018)
Google Scholar
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Khan, A.U., Kuehne, H., Duarte, K., Gan, C., Lobo, N., Shah, M.: Found a reason for me? weakly-supervised grounded visual question answering using capsules (2021)
Google Scholar
Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: a survey. arXiv preprint arXiv:2101.01169 (2021)
Kim, W., Son, B., Kim, I.: VILT: vision-and-language transformer without convolution or region supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 5583–5594. PMLR (2021). https://proceedings.mlr.press/v139/kim21k.html
Kim, W., Son, B., Kim, I.: VILT: vision-and-language transformer without convolution or region supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 5583–5594. PMLR (2021). https://proceedings.mlr.press/v139/kim21k.html
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)
Google Scholar
LaLonde, R., Bagci, U.: Capsules for object segmentation. arXiv preprint arXiv:1804.04241 (2018)
Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11336–11344 (2020)
Google Scholar
Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., Hoi, S.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS (2021)
Google Scholar
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Li, X., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, J., et al.: Transformer-based capsule network for stock movement prediction. In: Proceedings of the First Workshop on Financial Technology and Natural Language Processing, pp. 66–73 (2019)
Google Scholar
Liu, Y., Wan, B., Ma, L., He, X.: Relation-aware instance refinement for weakly supervised visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5612–5621 (2021)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265 (2019)
Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10437–10446 (2020)
Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. Adv. Neural Inf. Process. Syst. 29 (2016)
Google Scholar
Mazzia, V., Salvetti, F., Chiaberge, M.: Efficient-capsnet: capsule network with self-attention routing. arXiv preprint arXiv:2101.12491 (2021)
Miech, A., Alayrac, J.B., Laptev, I., Sivic, J., Zisserman, A.: Thinking fast and slow: efficient text-to-visual retrieval with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9826–9836 (2021)
Google Scholar
Mobiny, A., Cicalese, P.A., Nguyen, H.V.: Trans-caps: transformer capsule networks with self-attention routing (2021). https://openreview.net/forum?id=BUPIRa1D2J
Niu, Y., Tang, K., Zhang, H., Lu, Z., Hua, X.S., Wen, J.R.: Counterfactual VQA: a cause-effect look at language bias. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12700–12710 (2021)
Google Scholar
Pfeiffer, J., et al.: XGQA: cross-lingual visual question answering. arXiv preprint arXiv:2109.06082 (2021)
Pucci, R., Micheloni, C., Martinel, N.: Self-attention agreement among capsules. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 272–280 (2021)
Google Scholar
Qiao, T., Dong, J., Xu, D.: Exploring human-like attention supervision in visual question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)
Google Scholar
Ramakrishnan, S., Agrawal, A., Lee, S.: Overcoming language priors in visual question answering with adversarial regularization. arXiv preprint arXiv:1810.03649 (2018)
Ribeiro, F.D.S., Duarte, K., Everett, M., Leontidis, G., Shah, M.: Learning with capsules: a survey. arXiv preprint arXiv:2206.02664 (2022)
Riquelme, F., De Goyeneche, A., Zhang, Y., Niebles, J.C., Soto, A.: Explaining VQA predictions using visual grounding and a knowledge base. Image Vision Comput. 101, 103968 (2020)
Google Scholar
Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: NIPS (2017)
Google Scholar
Selvaraju, R.R., et al.: Taking a hint: leveraging explanations to make vision and language models more grounded. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2591–2600 (2019)
Google Scholar
Shi, J., Xu, J., Gong, B., Xu, C.: Not all frames are equal: weakly-supervised video grounding with contextual similarity and visual clustering losses. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10444–10452 (2019)
Google Scholar
Su, W., et al.: Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)
Tan, H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (2019)
Google Scholar
Wang, L., Huang, J., Li, Y., Xu, K., Yang, Z., Yu, D.: Improving weakly supervised visual grounding by contrastive knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14090–14100 (2021)
Google Scholar
Whitehead, S., Wu, H., Ji, H., Feris, R., Saenko, K.: Separating skills and concepts for novel visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5632–5641 (2021)
Google Scholar
Wu, L., Liu, X., Liu, Q.: Centroid transformers: learning to abstract with attention. arXiv preprint arXiv:2102.08606 (2021)
Xiao, F., Sigal, L., Jae Lee, Y.: Weakly-supervised visual grounding of phrases with linguistic structures. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5945–5954 (2017)
Google Scholar
Yang, X., Liu, X., Jian, M., Gao, X., Wang, M.: Weakly-supervised video object grounding by exploring spatio-temporal contexts. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1939–1947 (2020)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)
Google Scholar
Zeng, X., Wang, Y., Chiu, T.Y., Bhattacharya, N., Gurari, D.: Vision skills needed to answer visual questions. Proc. ACM Hum. Comput. Interact. 4(CSCW2), 1–31 (2020)
Article Google Scholar
Zhan, L.M., Liu, B., Fan, L., Chen, J., Wu, X.M.: Medical visual question answering via conditional reasoning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2345–2354 (2020)
Google Scholar
Zhang, J., Bargal, S.A., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation backprop. Int. J. Comput. Vision 126(10), 1084–1102 (2018)
Article Google Scholar
Zhang, S., Qu, L., You, S., Yang, Z., Zhang, J.: Automatic generation of grounded visual questions. arXiv preprint arXiv:1612.06530 (2016)
Zhang, Y., Niebles, J.C., Soto, A.: Interpretable visual question answering by visual grounding from attention supervision mining. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 349–357 (2019). https://doi.org/10.1109/WACV.2019.00043

Download references

Acknowledgement

Aisha Urooj is supported by the ARO grant W911NF-19-1-0356. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ARO, IARPA, DOI/IBC, or the U.S. Government.

Author information

Authors and Affiliations

University of Central Florida, Orlando, FL, USA
Aisha Urooj Khan, Niels Da Vitoria Lobo & Mubarak Shah
Goethe University Frankfurt, Frankfurt, Hesse, Germany
Hilde Kuehne
MIT-IBM Watson AI Lab, Cambridge, MA, USA
Hilde Kuehne & Chuang Gan

Authors

Aisha Urooj Khan
View author publications
You can also search for this author in PubMed Google Scholar
Hilde Kuehne
View author publications
You can also search for this author in PubMed Google Scholar
Chuang Gan
View author publications
You can also search for this author in PubMed Google Scholar
Niels Da Vitoria Lobo
View author publications
You can also search for this author in PubMed Google Scholar
Mubarak Shah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aisha Urooj Khan .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 9881 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Khan, A.U., Kuehne, H., Gan, C., Lobo, N.D.V., Shah, M. (2022). Weakly Supervised Grounding for VQA in Vision-Language Transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13695. Springer, Cham. https://doi.org/10.1007/978-3-031-19833-5_38

Download citation

DOI: https://doi.org/10.1007/978-3-031-19833-5_38
Published: 04 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19832-8
Online ISBN: 978-3-031-19833-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Weakly Supervised Grounding for VQA in Vision-Language Transformers

Abstract

Access this chapter

Similar content being viewed by others

SCBM: A Hybrid Model for Vietnamese Visual Question Answering

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

Towards Open Ended and Free Form Visual Question Answering: Modeling VQA as a Factoid Question Answering Problem

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 9881 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Weakly Supervised Grounding for VQA in Vision-Language Transformers

Abstract

Access this chapter

Similar content being viewed by others

SCBM: A Hybrid Model for Vietnamese Visual Question Answering

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

Towards Open Ended and Free Form Visual Question Answering: Modeling VQA as a Factoid Question Answering Problem

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 9881 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation