Abstract
Although remarkable success has been achieved in the last few years on the Visual Question Answer(VQA) task, most existing models are heavily driven by the surface linguistic correlation in the training set and ignore the image contents. Several recent methods introduce auxiliary tasks (visual annotation, counterfactual samples, etc.) to overcome language priors and enhance image dependence. However, the inherent priors, which evaluate whether the original models are driven by memorizing priors in training data, still have not been resolved. Therefore, we proposed a novel self-contrastive learning method contrasting the answers to the question predicted by question-relevant regions and question-irrelevant regions to solve this problem without introducing auxiliary tasks. Concretely, when the question pays attention to the question-relevant regions and the question-irrelevant regions, different answer spaces are generated to form a contrast to prevent the model from being driven by surface language priors. Therefore, the question is forced to rely on relevant image regions to predict the correct answer. Extensive experiments on the benchmark dataset demonstrate the effectiveness of our method. Particularly, by building on top of the model LMH, our method achieves the state-of-the-art performance of 59.00% on the most commonly used benchmark VQA-CP v2 without auxiliary tasks, with an improvement of 6.51%.
Similar content being viewed by others
References
Agrawal A, Batra D, Parikh D (2016) Analyzing the behavior of visual question answering models. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Association for Computational Linguistics, Austin, pp 1955–1960, DOI https://doi.org/10.18653/v1/D16-1203
Agrawal A, Batra D, Parikh D, Kembhavi A (2018) Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4971–4980
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Cadene R, Ben-Younes H, Cord M, Thome N (2019a) Murel: Multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1989–1998
Cadene R, Dancette C, Cord M, Parikh D et al (2019b) Rubi: Reducing unimodal biases for visual question answering. Advances in Neural Information Processing Systems 32
Chen L, Yan X, Xiao J, Zhang H, Pu S, Zhuang Y (2020) Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10800–10809
Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, pp 1724–1734, DOI https://doi.org/10.3115/v1/D14-1179, (to appear in print)
Clark C, Yatskar M, Zettlemoyer L (2019) Don’t take the easy way out: Ensemble based methods for avoiding known dataset biases. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, China, pp 4069–4082, DOI https://doi.org/10.18653/v1/D19-1418, (to appear in print)
Das A, Agrawal H, Zitnick L, Parikh D, Batra D (2017) Human attention in visual question answering: Do humans and deep networks look at the same regions? Comput Vis Image Underst 163:90–100
Fukui A, Park D H, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Texas, Association for Computational Linguistics, pp 457–468, DOI https://doi.org/10.18653/v1/D16-1044, (to appear in print)
Gat I, Schwartz I, Schwing A, Hazan T (2020) Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies. Adv Neural Inf Process Syst 33:3197–3208
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6904–6913
Grand G, Belinkov Y (2019) Adversarial regularization for visual question answering: Strengths, shortcomings, and side effects. In: Proceedings of the second workshop on shortcomings in vision and language. Minnesota, Association for Computational Linguistics, pp 1–13, DOI https://doi.org/10.18653/v1/W19-1801https://doi.org/10.18653/v1/W19-1801, (to appear in print)
Guo W, Zhang Y, Yang J, Yuan X (2021a) Re-attention for visual question answering. IEEE Trans Image Process 30:6730–6743
Guo Y, Nie L, Cheng Z, Tian Q, Zhang M (2021b) Loss re-scaling vqa: Revisiting the language prior problem from a class-imbalance view. IEEE Trans Image Process 31:227–238
Jing C, Wu Y, Zhang X, Jia Y, Wu Q (2020) Overcoming language priors in vqa via decomposed linguistic representations. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11181–11188
Kafle K, Kanan C (2017) An analysis of visual question answering algorithms. In: Proceedings of the IEEE international conference on computer vision, pp 1965–1973
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. arXiv:1412.6980
KV G, Mittal A (2020) Reducing language biases in visual question answering with visually-grounded question encoder. In: European Conference on computer vision. Springer, pp 18–34
Li L, Gan Z, Cheng Y, Liu J (2019) Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10313–10322
Liu C, Mao J, Sha F, Yuille A (2017) Attention correctness in neural image captioning. In: Proceedings of the AAAI conference on artificial intelligence, vol 31
Malinowski M, Doersch C, Santoro A, Battaglia P (2018) Learning visual question answering by bootstrapping hard attention. In: Proceedings of the European conference on computer vision (ECCV), pp 3–20
Niu Y, Tang K, Zhang H, Lu Z, Hua XS, Wen JR (2021) Counterfactual vqa: A cause-effect look at language bias. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12700–12710
Park DH, Hendricks LA, Akata Z, Rohrbach A, Schiele B, Darrell T, Rohrbach M (2018) Multimodal explanations: Justifying decisions and pointing to the evidence. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8779–8788
Peng L, Yang Y, Bin Y, Xie N, Shen F, Ji Y, Xu X (2019) Word-to-region attention network for visual question answering. Multimed Tools Appl 78(3):3843–3858
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Ramakrishnan S, Agrawal A, Lee S (2018) Overcoming language priors in visual question answering with adversarial regularization. Adv Neural Inf Process Syst 31
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28
Selvaraju RR, Lee S, Shen Y, Jin H, Ghosh S, Heck L, Batra D, Parikh D (2019) Taking a hint: Leveraging explanations to make vision and language models more grounded. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2591–2600
Shrestha R, Kafle K, Kanan C (2019) Answer them all! toward universal visual question answering models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10472–10481
Shrestha R, Kafle K, Kanan C (2020) A negative case analysis of visual grounding methods for VQA. In: Proceedings of the 58th annual meeting of the association for computational linguistics, Online. Association for Computational Linguistics, pp 8172–8181, DOI https://doi.org/10.18653/v1/2020.acl-main.727, (to appear in print)
Teney D, Abbasnedjad E, Hengel AVD (2020a) Learning what makes a difference from counterfactual examples and gradient supervision. In: European conference on computer vision. Springer, pp 580–599
Teney D, Abbasnejad E, Kafle K, Shrestha R, Kanan C, Van Den Hengel A (2020b) On the value of out-of-distribution testing: an example of goodhart’s law. Adv Neural Inf Process Syst 33:407– 417
Teney D, Abbasnejad E, van den Hengel A (2021) Unshuffling data for improved generalization in visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1417–1427
Toor AS, Wechsler H, Nappi M (2019) Question action relevance and editing for visual question answering. Multimed Tools Appl 78(3):2921–2935
Wu J, Mooney R (2019) Self-critical reasoning for robust visual question answering. Adv Neural Inf Process Syst 32
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29
Zhu X, Mao Z, Liu C, Zhang P, Wang B, Zhang Y (2020) Overcoming language priors with self-supervised learning for visual question answering. In: Bessiere C (ed) Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI 2020, ijcai.org, pp 1083–1089, DOI https://doi.org/10.24963/ijcai.2020/151, (to appear in print)
Zhu X, Mao Z, Chen Z, Li Y, Wang Z, Wang B (2021) Object-difference drived graph convolutional networks for visual question answering. Multimed Tools Appl 80(11):16247–16265
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grants 81860318 and 81560296.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yan, H., Liu, L., Feng, X. et al. Overcoming language priors with self-contrastive learning for visual question answering. Multimed Tools Appl 82, 16343–16358 (2023). https://doi.org/10.1007/s11042-022-14167-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-14167-2