Overcoming language priors with self-contrastive learning for visual question answering

Yan, Hong; Liu, Lijun; Feng, Xupeng; Huang, Qingsong

doi:10.1007/s11042-022-14167-2

Overcoming language priors with self-contrastive learning for visual question answering

Published: 10 November 2022

Volume 82, pages 16343–16358, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Hong Yan¹,
Lijun Liu^1,2,
Xupeng Feng³ &
…
Qingsong Huang^1,4

341 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Although remarkable success has been achieved in the last few years on the Visual Question Answer(VQA) task, most existing models are heavily driven by the surface linguistic correlation in the training set and ignore the image contents. Several recent methods introduce auxiliary tasks (visual annotation, counterfactual samples, etc.) to overcome language priors and enhance image dependence. However, the inherent priors, which evaluate whether the original models are driven by memorizing priors in training data, still have not been resolved. Therefore, we proposed a novel self-contrastive learning method contrasting the answers to the question predicted by question-relevant regions and question-irrelevant regions to solve this problem without introducing auxiliary tasks. Concretely, when the question pays attention to the question-relevant regions and the question-irrelevant regions, different answer spaces are generated to form a contrast to prevent the model from being driven by surface language priors. Therefore, the question is forced to rely on relevant image regions to predict the correct answer. Extensive experiments on the benchmark dataset demonstrate the effectiveness of our method. Particularly, by building on top of the model LMH, our method achieves the state-of-the-art performance of 59.00% on the most commonly used benchmark VQA-CP v2 without auxiliary tasks, with an improvement of 6.51%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

Integrating multimodal features by a two-way co-attention mechanism for visual question answering

Article 29 December 2023

Multiple Interaction Learning with Question-Type Prior Knowledge for Constraining Answer Search Space in Visual Question Answering

References

Agrawal A, Batra D, Parikh D (2016) Analyzing the behavior of visual question answering models. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Association for Computational Linguistics, Austin, pp 1955–1960, DOI https://doi.org/10.18653/v1/D16-1203
Agrawal A, Batra D, Parikh D, Kembhavi A (2018) Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4971–4980
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Cadene R, Ben-Younes H, Cord M, Thome N (2019a) Murel: Multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1989–1998
Cadene R, Dancette C, Cord M, Parikh D et al (2019b) Rubi: Reducing unimodal biases for visual question answering. Advances in Neural Information Processing Systems 32
Chen L, Yan X, Xiao J, Zhang H, Pu S, Zhuang Y (2020) Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10800–10809
Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, pp 1724–1734, DOI https://doi.org/10.3115/v1/D14-1179, (to appear in print)
Clark C, Yatskar M, Zettlemoyer L (2019) Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, China, pp 4069–4082, DOI https://doi.org/10.18653/v1/D19-1418, (to appear in print)
Das A, Agrawal H, Zitnick L, Parikh D, Batra D (2017) Human attention in visual question answering: Do humans and deep networks look at the same regions? Comput Vis Image Underst 163:90–100
Article Google Scholar
Fukui A, Park D H, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Texas, Association for Computational Linguistics, pp 457–468, DOI https://doi.org/10.18653/v1/D16-1044, (to appear in print)
Gat I, Schwartz I, Schwing A, Hazan T (2020) Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies. Adv Neural Inf Process Syst 33:3197–3208
Google Scholar
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6904–6913
Grand G, Belinkov Y (2019) Adversarial regularization for visual question answering: Strengths, shortcomings, and side effects. In: Proceedings of the second workshop on shortcomings in vision and language. Minnesota, Association for Computational Linguistics, pp 1–13, DOI https://doi.org/10.18653/v1/W19-1801 https://doi.org/10.18653/v1/W19-1801, (to appear in print)
Guo W, Zhang Y, Yang J, Yuan X (2021a) Re-attention for visual question answering. IEEE Trans Image Process 30:6730–6743
Article Google Scholar
Guo Y, Nie L, Cheng Z, Tian Q, Zhang M (2021b) Loss re-scaling vqa: Revisiting the language prior problem from a class-imbalance view. IEEE Trans Image Process 31:227–238
Article Google Scholar
Jing C, Wu Y, Zhang X, Jia Y, Wu Q (2020) Overcoming language priors in vqa via decomposed linguistic representations. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11181–11188
Kafle K, Kanan C (2017) An analysis of visual question answering algorithms. In: Proceedings of the IEEE international conference on computer vision, pp 1965–1973
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. arXiv:1412.6980
KV G, Mittal A (2020) Reducing language biases in visual question answering with visually-grounded question encoder. In: European Conference on computer vision. Springer, pp 18–34
Li L, Gan Z, Cheng Y, Liu J (2019) Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10313–10322
Liu C, Mao J, Sha F, Yuille A (2017) Attention correctness in neural image captioning. In: Proceedings of the AAAI conference on artificial intelligence, vol 31
Malinowski M, Doersch C, Santoro A, Battaglia P (2018) Learning visual question answering by bootstrapping hard attention. In: Proceedings of the European conference on computer vision (ECCV), pp 3–20
Niu Y, Tang K, Zhang H, Lu Z, Hua XS, Wen JR (2021) Counterfactual vqa: A cause-effect look at language bias. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12700–12710
Park DH, Hendricks LA, Akata Z, Rohrbach A, Schiele B, Darrell T, Rohrbach M (2018) Multimodal explanations: Justifying decisions and pointing to the evidence. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8779–8788
Peng L, Yang Y, Bin Y, Xie N, Shen F, Ji Y, Xu X (2019) Word-to-region attention network for visual question answering. Multimed Tools Appl 78(3):3843–3858
Article Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Ramakrishnan S, Agrawal A, Lee S (2018) Overcoming language priors in visual question answering with adversarial regularization. Adv Neural Inf Process Syst 31
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28
Selvaraju RR, Lee S, Shen Y, Jin H, Ghosh S, Heck L, Batra D, Parikh D (2019) Taking a hint: Leveraging explanations to make vision and language models more grounded. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2591–2600
Shrestha R, Kafle K, Kanan C (2019) Answer them all! toward universal visual question answering models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10472–10481
Shrestha R, Kafle K, Kanan C (2020) A negative case analysis of visual grounding methods for VQA. In: Proceedings of the 58th annual meeting of the association for computational linguistics, Online. Association for Computational Linguistics, pp 8172–8181, DOI https://doi.org/10.18653/v1/2020.acl-main.727, (to appear in print)
Teney D, Abbasnedjad E, Hengel AVD (2020a) Learning what makes a difference from counterfactual examples and gradient supervision. In: European conference on computer vision. Springer, pp 580–599
Teney D, Abbasnejad E, Kafle K, Shrestha R, Kanan C, Van Den Hengel A (2020b) On the value of out-of-distribution testing: an example of goodhart’s law. Adv Neural Inf Process Syst 33:407– 417
Google Scholar
Teney D, Abbasnejad E, van den Hengel A (2021) Unshuffling data for improved generalization in visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1417–1427
Toor AS, Wechsler H, Nappi M (2019) Question action relevance and editing for visual question answering. Multimed Tools Appl 78(3):2921–2935
Article Google Scholar
Wu J, Mooney R (2019) Self-critical reasoning for robust visual question answering. Adv Neural Inf Process Syst 32
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29
Zhu X, Mao Z, Liu C, Zhang P, Wang B, Zhang Y (2020) Overcoming language priors with self-supervised learning for visual question answering. In: Bessiere C (ed) Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI 2020, ijcai.org, pp 1083–1089, DOI https://doi.org/10.24963/ijcai.2020/151, (to appear in print)
Zhu X, Mao Z, Chen Z, Li Y, Wang Z, Wang B (2021) Object-difference drived graph convolutional networks for visual question answering. Multimed Tools Appl 80(11):16247–16265
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 81860318 and 81560296.

Author information

Authors and Affiliations

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500, China
Hong Yan, Lijun Liu & Qingsong Huang
Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650091, China
Lijun Liu
Educational Technology and Network Center, Kunming University of Science and Technology, Kunming, 650500, China
Xupeng Feng
Yunnan Key Laboratory of Computer Technology Applications, Kunming, 650500, China
Qingsong Huang

Authors

Hong Yan
View author publications
You can also search for this author in PubMed Google Scholar
Lijun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xupeng Feng
View author publications
You can also search for this author in PubMed Google Scholar
Qingsong Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingsong Huang.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yan, H., Liu, L., Feng, X. et al. Overcoming language priors with self-contrastive learning for visual question answering. Multimed Tools Appl 82, 16343–16358 (2023). https://doi.org/10.1007/s11042-022-14167-2

Download citation

Received: 06 July 2021
Revised: 09 May 2022
Accepted: 27 October 2022
Published: 10 November 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s11042-022-14167-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Overcoming language priors with self-contrastive learning for visual question answering

Abstract

Access this article

Similar content being viewed by others

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

Integrating multimodal features by a two-way co-attention mechanism for visual question answering

Multiple Interaction Learning with Question-Type Prior Knowledge for Constraining Answer Search Space in Visual Question Answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Overcoming language priors with self-contrastive learning for visual question answering

Abstract

Access this article

Similar content being viewed by others

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

Integrating multimodal features by a two-way co-attention mechanism for visual question answering

Multiple Interaction Learning with Question-Type Prior Knowledge for Constraining Answer Search Space in Visual Question Answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation