Skip to main content
Log in

VQA: Visual Question Answering

www.visualqa.org

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing \(\sim \)0.25 M images, \(\sim \)0.76 M questions, and \(\sim \)10 M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. https://github.com/tylin/coco-ui

  2. In order to be consistent with ‘human accuracies’ reported in Sect. 4, machine accuracies are averaged over all \({10 \atopwithdelims ()9}\) sets of human annotators

  3. http://visualqa.org/challenge.html

  4. http://www.visualqa.org/workshop.html

  5. http://visualqa.org/challenge.html

  6. Noun tags begin with NN, verb tags begin with VB, adjective tags begin with JJ, and prepositions are tagged as IN.

  7. Visualization created using http://worditout.com/.

References

  • Agrawal, H., Mathialagan, C.S., Goyal, Y., Chavali, N., Banik, P., Mohapatra, A., et al. (2015). Cloudcv: Large-scale distributed computer vision as a cloud service. In G. Hua & X.-S. Hua (Eds.), Mobile cloud visual media computing (pp. 265–290). Switzerland: Springer International Publishing.

  • Antol, S., Zitnick, C.L., Parikh, D. (2014). Zero-Shot learning via visual abstraction. In ECCV

  • Bigham, J.P., Jayant, C., Ji, H., Little, G., Miller, A., Miller, R.C., Miller, R., Tatarowicz, A., White, B., White, S., Yeh, T. (2010). VizWiz: Nearly real-time answers to visual questions. In User interface software and technology

  • Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008). Freebase: A collaboratively created graph database for structuring human knowledge. In International conference on management of data. doi:10.1145/1376616.1376746.

  • Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Jr., E.R.H., Mitchell, T.M. (2010). Toward an architecture for never-ending language learning. In AAAI

  • Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L. (2015). Microsoft COCO captions: Data collection and evaluation server. CoRR arXiv:1504.00325

  • Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L. (2015). Microsoft COCO Captions: Data collection and evaluation server. arXiv:1504.00325

  • Chen, X., Shrivastava, A., Gupta, A. (2013). NEIL: Extracting visual knowledge from web data. In ICCV

  • Chen, X., Zitnick, C.L. (2015). Mind’s eye: A recurrent visual representation for image caption generation. In CVPR

  • Coppersmith, G., Kelly, E. (2014). Dynamic wordclouds and vennclouds for exploratory data analysis. In ACL workshop on interactive language learning and visualization

  • Deng, J., Berg, A.C., Fei-Fei, L. (2011). Hierarchical semantic indexing for large scale image retrieval. In CVPR

  • Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR

  • Elliott, D., Keller, F. (2014). Comparing automatic evaluation measures for image description. In ACL

  • Fader, A., Zettlemoyer, L., Etzioni, O. (2013). Paraphrase-driven learning for open question answering. In ACL. http://www.aclweb.org/anthology/P13-1158

  • Fader, A., Zettlemoyer, L., Etzioni, O. (2014). Open Question answering over curated and extracted knowledge bases. In International conference on knowledge discovery and data mining

  • Fang, H., Gupta, S., Iandola, F.N., Srivastava, R., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., Zweig, G. (2015). From captions to visual concepts and back. In CVPR

  • Farhadi, A., Hejrati, M., Sadeghi, A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D. (2010). Every picture tells a story: Generating sentences for images. In ECCV

  • Gao, H., Mao, J., Zhou, J., Huang, Z., Yuille, A. (2015). Are you talking to a machine? dataset and methods for multilingual image question answering. In NIPS

  • Geman, D., Geman, S., Hallonquist, N., Younes, L. (2014). A visual turing test for computer vision systems. In PNAS

  • Gordon, J., Durme, B.V. (2013). Reporting bias and knowledge extraction. In Proceedings of the 3rd Workshop on Knowledge Extraction, at CIKM 2013

  • Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K. (2013). YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV. http://www.eecs.berkeley.edu/~sguada/pdfs/2013-ICCV-youtube2text-final.pdf

  • Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data models and evaluation metrics. ournal of Artificial Intelligence Research, 47, 853–899.

  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093

  • Karpathy, A., Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In CVPR . http://arxiv.org/abs/1412.2306

  • Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.L. (2014). ReferItGame: Referring to objects in photographs of natural scenes. In EMNLP

  • Kiros, R., Salakhutdinov, R., Zemel, R.S. (2015). Unifying visual-semantic embeddings with multimodal neural language models. In TACL

  • Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R.S., Torralba, A., Urtasun, R., Fidler, S. (2015). Skip-thought vectors. arXiv:1506.06726

  • Kong, C., Lin, D., Bansal, M., Urtasun, R., & Fidler, S. (2014). What Are You Talking About?. In CVPR: Text-to-image coreference.

    Google Scholar 

  • Krizhevsky, A., Sutskever, I., Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. In NIPS

  • Kulkarni, G., Premraj, V., Sagnik Dhar and, S.L., Choi, Y., Berg, A.C., Berg, T.L. (2011). Baby talk: Understanding and generating simple image descriptions. In CVPR

  • Lenat, D. B., & Guha, R. V. (1989). Building large knowledge-based systems; representation and inference in the cyc project. Chicago: Addison-Wesley Longman Publishing Co., Inc.

    Google Scholar 

  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014). Microsoft COCO: Common objects in context. In ECCV

  • Lin, X., Parikh, D. (2015). Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. In CVPR

  • Liu, H., & Singh, P. (2014). ConceptNet-A Practical Commonsense Reasoning Tool-Kit. BT Technology Journal, 22(4), 211–226. doi:10.1023/B:BTTJ.0000047600.45421.6d.

    Article  Google Scholar 

  • Malinowski, M., Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS

  • Malinowski, M., Rohrbach, M., Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In ICCV

  • Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L. (2014). Explain images with multimodal recurrent neural networks. CoRR arXiv:1410.1090

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NIPS

  • Mitchell, M., van Deemter, K., Reiter, E. (2013). Attributes in visual reference. In PRE-CogSci

  • Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Stratos, K., Han, X., Mensch, A., Berg, A., Berg, T.L., Daume III, H. (2012). Midge: Generating image descriptions from computer vision detections. In ACL

  • Mitchell, M., Van Deemter, K., Reiter, E. (2013). Generating expressions that refer to visible objects. In HLT-NAACL

  • Ramanathan, V., Joulin, A., Liang, P., Fei-Fei, L. (2014). Linking People with “Their” names using coreference resolution. In ECCV

  • Ren, M., Kiros, R., Zemel, R. (2015). Exploring models and data for image question answering. In NIPS

  • Richardson, M., Burges, C.J., Renshaw, E. (2013). MCTest: A challenge dataset for the open-domain machine comprehension of text. In EMNLP

  • Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B. (2013). Translating video content to natural language descriptions. In ICCV

  • Sadeghi, F., Kumar Divvala, S.K., Farhadi, A. (2015). Viske: Visual knowledge extraction and question answering by visual verification of relation phrases. In CVPR

  • Simonyan, K., Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR arxiv:1409.1556

  • Toutanova, K., Klein, D., Manning, C.D., Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In ACL

  • Tu, K., Meng, M., Lee, M. W., Choe, T. E., & Zhu, S. C. (2014). Joint video and text parsing for understanding events and answering queries. IEEE MultiMedia, 21(2), 42–70. doi:10.1109/MMUL.2014.29.

    Article  Google Scholar 

  • Vedantam, R., Zitnick, C.L., Parikh, D.(2015). CIDEr: Consensus-based image description evaluation. In CVPR

  • Vendantam, R., Lin, X., Batra, T., Zitnick, C.L., Parikh, D. (2015). Learning common sense through visual abstraction. In ICCV

  • Vinyals, O., Toshev, A., Bengio, S., Erhan, D. (2015). Show and tell: A neural image caption generator. In CVPR. arXiv:1411.4555

  • Weston, J., Bordes, A., Chopra, S., Mikolov, T. (2015). Towards AI-complete question answering: A set of prerequisite toy tasks. CoRR arXiv:1502.05698

  • Yu, L., Park, E., Berg, A.C., Berg, T.L. (2015). Visual madlibs: Fill-in-the-blank description generation and question answering. In ICCV

  • Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., Parikh, D. (2015). Yin and yang: Balancing and answering binary visual questions. CoRR arXiv:1511.05099

  • Zitnick, C.L., Parikh, D. (2013). Bringing semantics into focus using visual abstraction. In CVPR

  • Zitnick, C.L., Parikh, D., Vanderwende, L. (2013). Learning the visual interpretation of sentences. In ICCV

  • Zitnick, C. L., Vedantam, R., & Parikh, D. (2015). Adopting abstract images for semantic scene understanding. IEEE transactions on pattern analysis and machine intelligence, 38, 627–638.

    Article  Google Scholar 

Download references

Acknowledgments

We would like to acknowledge the countless hours of effort provided by the workers on Amazon Mechanical Turk. This work was supported in part by the The Paul G. Allen Family Foundation via an award to D.P., ICTAS at Virginia Tech via awards to D.B. and D.P., Google Faculty Research Awards to D.P. and D.B., the National Science Foundation CAREER award to D.B., the Army Research Office YIP Award to D.B., and a Office of Naval Research grant to D.B.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aishwarya Agrawal.

Additional information

Communicated by Margaret Mitchell, John Platt, Kate Saenko.

Aishwarya Agrawal, Jiasen Lu and Stanislaw Antol have contributed equally to this study.

Appendices

Appendix Overview

Fig. 15
figure 15

Proportions of spatial prepositions in the captions and question & answers for real images (left) and abstract scenes (right)

Fig. 16
figure 16

Venn-style word clouds (Coppersmith and Kelly 2014) for nouns with size indicating the normalized count for real images

In the appendix, we provide:

  1. I

    —Additional analysis comparing captions and Q&A data

  2. II

    —Qualitative visualizations for “What is” questions

  3. III

    —Human accuracy on multiple-choice questions

  4. IV

    —Details on VQA baselines

  5. V

    —“Age” and “Commonsense” of our model

  6. VI

    —Details on the abstract scene dataset

  7. VII

    —User interfaces used to collect the dataset

  8. VIII

    —List of the top answers in the dataset

  9. IX

    —Additional examples from the VQA dataset

Appendix 1: Captions Versus Questions

Do questions and answers provide further information about the visual world beyond that captured by captions? One method for determining whether the information captured by questions & answers is different from the information captured by captions is to measure some of the differences in the word distributions from the two datasets. We cast this comparison in terms of nouns, verbs, and adjectives by extracting all words from the caption data (MS COCO captions for real images and captions collected by us for abstract scenes) using the Stanford part-of-speech (POS)Footnote 6 tagger (Toutanova et al. 2003). We normalize the word frequencies from captions, questions, and answers per image, and compare captions questions and answers combined. Using a Kolmogorov-Smirnov test to determine whether the underlying distributions of the two datasets differ, we find a significant difference for all three parts of speech (p < .001) for both real images and abstract scenes. This helps motivate the VQA task as a way to learn information about visual scenes; although both captions and questions & answers provide information about the visual world, they do it from different perspectives, with different underlying biases (Gordon and Durme 2013), and can function as complementary to one another.

Fig. 17
figure 17

Venn-style word clouds (Coppersmith and Kelly 2014) for verbs with size indicating the normalized count for real images

Fig. 18
figure 18

Venn-style word clouds (Coppersmith and Kelly 2014) for adjectives with size indicating the normalized count for real images

Fig. 19
figure 19

Venn-style word clouds (Coppersmith and Kelly 2014) for nouns with size indicating the normalized count for abstract scenes

We illustrate the similarities and differences between the word distributions in captions versus questions & answers as Venn-style word clouds (Coppersmith and Kelly 2014) with size indicating the normalized count. These are shown in Fig. 16 (nouns), Fig. 17 (verbs), and Fig. 18 (adjectives) for real images and Fig. 19 (nouns), Fig. 20 (verbs), and Fig. 21 (adjectives) for abstract scenes.Footnote 7 The left side shows the top words in questions & answers, the right the top words in captions, and the center the words common to both, with size indicating the harmonic mean of the counts.

Fig. 20
figure 20

Venn-style word clouds (Coppersmith and Kelly 2014) for verbs with size indicating the normalized count for abstract scenes

Fig. 21
figure 21

Venn-style word clouds (Coppersmith and Kelly 2014) for adjectives with size indicating the normalized count for abstract scenes

Qualitatively, we see that adjectives in captions capture some clearly visual properties discussed in previous work on vision to language (Mitchell et al. 2013), such as material and pattern, while the questions & answers have more adjectives that capture what is usual (e.g., “dominant”, “approximate”, “higher”) and other kinds of commonsense properties (e.g., “edible”, “possible”, “unsafe”, “acceptable”). Interestingly, we see that question & answer nouns capture information about “ethnicity” and “hairstyle”, while caption nouns capture information about pluralized visible objects (e.g., “cellphones”, “daughters”) and groups (e.g., “trio”, “some”), among other differences. “Man” and “people” are common in both captions and questions & answers.

One key piece to understanding the visual world is understanding spatial relationships, and so we additionally extract spatial prepositions and plot their proportions in the captions versus the questions & answers data in Fig. 15 (left) for real images and Fig. 15 (right) for abstract scenes. We see that questions & answers have a higher proportion of specific spatial relations (i.e., “in”, “on”) compared to captions, which have a higher proportion of general spatial relations (ie., “with”, “near”).

Fig. 22
figure 22

Distribution of questions starting with “What is” by their first five words for a random sample of 60K questions for real images (left) and all questions for abstract scenes (right). The ordering of the words starts towards the center and radiates outwards. The arc length is proportional to the number of questions containing the word. White areas are words with contributions too small to show

Fig. 23
figure 23

Distribution of answers for questions starting with “What is” for a random sample of 60K questions for real images (top) and all questions for abstract scenes (bottom). Each column corresponds to questions ending in different words, such as “doing?”, “on?”, etc.

Appendix 2: “What is” Analysis

In Fig. 22, we show the distribution of questions starting with “What is” by their first five words for both real images and abstract scenes. Note the diversity of objects referenced in the questions, as well as, the relations between objects, such as “holding” and “sitting on”. In Fig. 23, we show the distribution of answers for “What is” questions ending in different words. For instance, questions ending in “eating” have answers such as “pizza”, “watermelon” and “hot dog”. Notice the diversity in answers for some questions, such as those that end with “for?” or “picture?”. Other questions result in intuitive responses, such as “holding?” and the response “umbrella”.

Appendix 3: Multiple-Choice Human Accuracy

To compute human accuracy for multiple-choice questions, we collected three human answers per question on a random subset of 3,000 questions for both real images and abstract scenes. In Table 6, we show the human accuracies for multiple choice questions. Table 6 also shows the inter-human agreement for open-ended answer task. In comparison to open-ended answer, the multiple-choice accuracies are more or less same for “yes/no” questions and significantly better (\(\approx \)15 % increase for real images and \(\approx \)11 % increase for abstract scenes) for “other” questions. Since “other” questions may be ambiguous, the increase in accuracy using multiple choice is not surprising.

Appendix 4: Details on VQA baselines

“per Q-type prior” baseline We decide on different question types based on first few words of questions in the real images training set and ensure that each question type has at least 30 questions in the training dataset. The most popular answer for each question type is also computed on real images training set.

“nearest neighbor” baseline For every question in the VQA test-standard set, we find its k nearest neighbor questions in the training set using cosine similarity in Skip-Thought (Kiros et al. 2015) feature space. We also experimented with bag of words and Word2Vec (Mikolov et al. 2013) feature spaces but we obtained the best performance with Skip-Thought. In this set of k questions and their associated images, we find the image which is most similar to the query image using cosine similarity in fc7 feature space. We use the fc7 features from the caffenet model in BVLC Caffe (Jia et al. 2014). The most common ground truth answer of this most similar image and question pair is the predicted answer for the query image and question pair. We pick \(k = 4\) on the test-dev set.

Appendix 5: “Age” and “Commonsense” of our model

We estimate the age and degree of commonsense of our best model (deeper LSTM Q + norm I), selected using VQA test-dev accuracies). To estimate the age, we compute a weighted average of the average age per question, weighted by the accuracy of the model’s predicted answer for that question, on the subset of questions in the VQA validation set for which we have age annotations (how old a human needs to be to answer the question correctly). To estimate the degree of commonsense, we compute a weighted average of the average degree of commonsense per question, weighted by the accuracy of the model’s predicted answer for that question, on the subset of questions in the VQA validation set for which we have commonsense annotations (whether the question requires commonsense to answer it).

Appendix 6: Abstract Scenes Dataset

In Fig. 24 (left), we show a subset of the objects that are present in the abstract scenes dataset. For more examples of the scenes generated, please see Fig. 29. The user interface used to create the scenes is shown in Fig. 24 (right). Subjects used a drag-and-drop interface to create the scenes. Each object could be flipped horizontally and scaled. The scale of the object determined the rendering order of the objects. Many objects have different attributes corresponding to different poses or types. Most animals have five different discrete poses. Humans have eight discrete expressions and their poses may be continuously adjusted using a “paperdoll” model (Antol et al. 2014).

Table 6 For each of the two datasets, real and abstract, first two rows are the human accuracies for multiple-choice questions when subjects were shown both the image and the question

Appendix 7: User Interfaces

In Fig. 25, we show the AMT interface that we used to collect questions for images. Note that we tell the workers that the robot already knows the answer to the previously asked question(s), inspiring them to ask different kinds of questions, thereby increasing the diversity of our dataset.

Figure 26 shows the AMT interface used for collecting answers to the previously collected questions when subjects were shown the corresponding images. Figure 27 shows the interface that was used to collect answers to questions when subjects were not shown the corresponding image (i.e., to help in gathering incorrect, but plausible, answers for the multiple-choice task and to assess how accurately the questions can be answered using common sense knowledge alone).

Fig. 24
figure 24

Left A small subset of the objects present in the abstract scene dataset. Right The AMT interface for collecting abstract scenes. The light green circles indicate where users can select to manipulate a person’s pose. Different objects may be added to the scene using the folders to the right

Fig. 25
figure 25

Our AMT interface for collecting the third question for an image, when subjects were shown previous questions that were collected and were asked to ask a question different from previous questions

Fig. 26
figure 26

The AMT interface used to collect answers to a question when subjects were shown the image while answering the question

Fig. 27
figure 27

The AMT interface used to collect answers to a question when subjects were not shown the image while answering the question using only commonsense to collect the plausible, but incorrect, multiple-choice answers

Fig. 28
figure 28

Random examples of questions (black), (a subset of the) answers given when looking at the image (green), and answers given when not looking at the image (blue) for numerous representative examples of the real image dataset (Color figure online)

Fig. 29
figure 29

Random examples of questions (black), (a subset of the) answers given when looking at the image (green), and answers given when not looking at the image (blue) for numerous representative examples of the abstract scene dataset (Color figure online)

Fig. 30
figure 30

Random examples of multiple-choice questions for numerous representative examples of the real and abstract scene dataset

Appendix 8: Answer Distribution

The top 250 answers in our real images dataset along with their counts and percentage counts are given below. The answers have been presented in different colors to show the different Part-of-Speech (POS) tagging of the answers with the following color code: yes/no, noun, verb, adjective, adverb, and numeral.

figure b

Appendix 9: Additional Examples

To provide insight into the dataset, we provide additional examples. In Figs. 28, 29, and 30, we show a random selection of the VQA dataset for the MS COCO (Lin et al. 2014) images, abstract scenes, and multiple-choice questions, respectively.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Agrawal, A., Lu, J., Antol, S. et al. VQA: Visual Question Answering. Int J Comput Vis 123, 4–31 (2017). https://doi.org/10.1007/s11263-016-0966-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-016-0966-6

Keyword

Navigation