Abstract
The Visual Question Answering (VQA) task aspires to provide a meaningful testbed for the development of AI models that can jointly reason over visual and natural language inputs. Despite a proliferation of VQA datasets, this goal is hindered by a set of common limitations. These include a reliance on relatively simplistic questions that are repetitive in both concepts and linguistic structure, little world knowledge needed outside of the paired image, and limited reasoning required to arrive at the correct answer. We introduce A-OKVQA, a crowdsourced dataset composed of a diverse set of about 25K questions requiring a broad base of commonsense and world knowledge to answer. In contrast to existing knowledge-based VQA datasets, the questions generally cannot be answered by simply querying a knowledge base, and instead require some form of commonsense reasoning about the scene depicted in the image. We demonstrate the potential of this new dataset through a detailed analysis of its contents and baseline performance measurements over a variety of state-of-the-art vision–language models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Specifically multi-qa-MiniLM-L6-cos-v1 [14] to avoid overlap with RoBERTa.
- 2.
To make this comparison even, we chose a random subset of our test set to be the same size as OK-VQA test set so that the minimum is over the same number of possible choices in both cases.
- 3.
- 4.
We use the second largest available GPT-3 model, Curie, as in [48].
- 5.
For ease of analysis we count a binary yes/no of whether a model answered correctly if it answered any possible answer in the direct answer set.
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)
Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)
Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005)
Brown, T.B., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Chang, Y., Narang, M.B., Suzuki, H., Cao, G., Gao, J., Bisk, Y.: WebQA: multihop and multimodal QA. arXiv (2021)
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv (2015)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question. In: NeurIPS (2015)
García, N., Otani, M., Chu, C., Nakashima, Y.: KnowIT VQA: answering knowledge-based questions about videos. In: AAAI (2020)
Geman, D., Geman, S., Hallonquist, N., Younes, L.: Visual turing test for computer vision systems. Proc. Natl. Acad. Sci. 112, 3618–3623 (2015)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR (2019)
HuggingFace: https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1
HuggingFace: https://huggingface.co/sentence-transformers/nli-bert-base
HuggingFace: https://huggingface.co/sentence-transformers/average_word_embeddings_glove.6B.300d
Hussain, Z., et al.: Automatic understanding of image and video advertisements. In: CVPR (2017)
Jain, A., Kothyari, M., Kumar, V., Jyothi, P., Ramakrishnan, G., Chakrabarti, S.: Select, substitute, search: a new benchmark for knowledge-augmented visual question answering. In: SIGIR (2021)
Jiang, Y., Natarajan, V., Chen, X., Rohrbach, M., Batra, D., Parikh, D.: Pythia v0.1: the winning entry to the VQA challenge 2018. arXiv (2018)
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)
Kamath, A., Clark, C., Gupta, T., Kolve, E., Hoiem, D., Kembhavi, A.: Webly supervised concept expansion for general purpose vision models. arXiv (2022)
Krishna, R., et al.: Visual Genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123, 32–73 (2017)
Li, Q., Fu, J., Yu, D., Mei, T., Luo, J.: Tell-and-answer: towards explainable visual question answering using attributes and captions. In: EMNLP (2018)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, H., Singh, P.: ConceptNet-a practical commonsense reasoning tool-kit. BT Technol. J. 22, 211–226 (2004)
Liu, Y., et al.: RoBERTa: a robustly optimized bert pretraining approach. arXiv (2019)
Lu, J., Batra, D., Parikh, D., Lee, S.: VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019)
Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: Multi-task vision and language representation learning. In: CVPR (2020)
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: NeurIPS (2014)
Marino, K., Chen, X., Parikh, D., Gupta, A.K., Rohrbach, M.: KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA. In: CVPR (2021)
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: CVPR (2019)
Mokady, R., Hertz, A., Bermano, A.H.: ClipCap: CLIP prefix for image captioning. arXiv (2021)
Park, D.H., et al.: Multimodal explanations: justifying decisions and pointing to the evidence. In: CVPR (2018)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP (2014)
Post, M.: A call for clarity in reporting BLEU scores. In: Conference on Machine Translation (2018)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog (2019)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21, 1–67 (2020)
Ren, M., Kiros, J., Zemel, R.S.: Exploring models and data for image question answering. In: NeurIPS (2015)
Shah, S., Mishra, A., Yadati, N., Talukdar, P.P.: KVQA: knowledge-aware visual question answering. In: AAAI (2019)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)
Singh, A., et al.: MMF: a multimodal framework for vision and language research (2020). https://github.com/facebookresearch/mmf
Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)
Tan, H.H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: EMNLP (2019)
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: CVPR (2016)
Wang, P., Wu, Q., Shen, C., Dick, A.R., van den Hengel, A.: Explicit knowledge-based reasoning for visual question answering. In: IJCAI (2017)
Wang, P., Wu, Q., Shen, C., van den Hengel, A., Dick, A.R.: FVQA: fact-based visual question answering. TPAMI 40, 2413–2427 (2017)
West, P., et al.: Symbolic knowledge distillation: from general language models to commonsense models. arXiv preprint arXiv:2110.07178 (2021)
Yang, Z., et al.: An empirical study of GPT-3 for few-shot knowledge-based VQA. arXiv (2021)
Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.B.: Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: NeurIPS (2018)
Yu, L., Park, E., Berg, A.C., Berg, T.L.: Visual madlibs: fill in the blank description generation and question answering. In: ICCV (2015)
Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: CVPR (2019)
Zhang, P., et al.: VinVL: revisiting visual representations in vision-language models. In: CVPR (2021)
Zhu, X., Anguelov, D., Ramanan, D.: Capturing long-tail distributions of object subcategories. In: CVPR (2014)
Zhu, Y., Groth, O., Bernstein, M.S., Fei-Fei, L.: Visual7W: grounded question answering in images. In: CVPR (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R. (2022). A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13668. Springer, Cham. https://doi.org/10.1007/978-3-031-20074-8_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-20074-8_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20073-1
Online ISBN: 978-3-031-20074-8
eBook Packages: Computer ScienceComputer Science (R0)