Advertisement

Visual Question Answering on Image Sets

Conference paper
  • 580 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12366)

Abstract

We introduce the task of Image-Set Visual Question Answering (ISVQA), which generalizes the commonly studied single-image VQA problem to multi-image settings. Taking a natural language question and a set of images as input, it aims to answer the question based on the content of the images. The questions can be about objects and relationships in one or more images or about the entire scene depicted by the image set. To enable research in this new topic, we introduce two ISVQA datasets – indoor and outdoor scenes. They simulate the real-world scenarios of indoor image collections and multiple car-mounted cameras, respectively. The indoor-scene dataset contains 91,479 human-annotated questions for 48,138 image sets, and the outdoor-scene dataset has 49,617 questions for 12,746 image sets. We analyze the properties of the two datasets, including question-and-answer distributions, types of questions, biases in dataset, and question-image dependencies. We also build new baseline models to investigate new research challenges in ISVQA.

Supplementary material

504479_1_En_4_MOESM1_ESM.pdf (1.5 mb)
Supplementary material 1 (pdf 1512 KB)

Supplementary material 2 (mp4 187 KB)

Supplementary material 3 (mp4 195 KB)

Supplementary material 4 (mp4 114 KB)

References

  1. 1.
    Acharya, M., Kafle, K., Kanan, C.: TallyQA: answering complex counting questions. In: AAAI Conference on Artificial Intelligence (2019)Google Scholar
  2. 2.
    Agrawal, A., Batra, D., Parikh, D.: Analyzing the behavior of visual question answering models. In: Empirical Methods in Natural Language Processing (2016)Google Scholar
  3. 3.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  4. 4.
    Antol, S., et al.: VQA: Visual question answering. In: International Conference on Computer Vision (2015)Google Scholar
  5. 5.
    Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027 (2019)
  6. 6.
    Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  7. 7.
    Desta, M.T., Chen, L., Kornuta, T.: Object-based reasoning in VQA. In: Winter Conference on Applications of Computer Vision (2018)Google Scholar
  8. 8.
    Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., Huang, H.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: Conference on Computer Vision and Pattern Recognition (2019)Google Scholar
  9. 9.
    Gao, P., et al.: Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: Conference on Computer Vision and Pattern Recognition (2019)Google Scholar
  10. 10.
    Gao, P., You, H., Zhang, Z., Wang, X., Li, H.: Multi-modality latent interaction network for visual question answering. In: International Conference on Computer Vision (2019)Google Scholar
  11. 11.
    Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  12. 12.
    Goyal, Y., Khot, T., Agrawal, A., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: International Journal of Computer Vision (2019)Google Scholar
  13. 13.
    Gurari, D., et al.: Vizwiz grand challenge: answering visual questions from blind people. In: Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  14. 14.
    Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: Conference on Computer Vision and Pattern Recognition (2019)Google Scholar
  15. 15.
    Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In: Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers Inc. (2017)Google Scholar
  16. 16.
    Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  17. 17.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123(1), 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVQA+: spatio-temporal grounding for video question answering. arXiv preprint arXiv:1904.11574 (2019)
  19. 19.
    Liang, J., Jiang, L., Cao, L., Li, L.J., Hauptmann, A.: Focal visual-text attention for visual question answering. In: Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  20. 20.
    Lin, X., Parikh, D.: Don’t just listen, use your imagination: leveraging visual common sense for non-visual tasks. In: Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  21. 21.
    Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances In Neural Information Processing Systems (2016)Google Scholar
  22. 22.
    Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA : a visual question answering benchmark requiring external knowledge. In: Conference on Computer Vision and Pattern Recognition (2019)Google Scholar
  23. 23.
    Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing (2014)Google Scholar
  24. 24.
    Santoro, A., et al.: A simple neural network module for relational reasoning. Advances in Neural Information Processing Systems (2017)Google Scholar
  25. 25.
    Savva, M., et al.: Habitat: a platform for embodied AI research. arXiv preprint arXiv:1904.01201 (2019)
  26. 26.
    Singh, A., et al.: Pythia-a platform for vision & language research. In: SysML Workshop, NeurIPS (2018)Google Scholar
  27. 27.
    Singh, A., et al.: Towards VQA models that can read. In: Conference on Computer Vision and Pattern Recognition (2019)Google Scholar
  28. 28.
    Tan, H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. In: Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (2019)Google Scholar
  29. 29.
    Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  30. 30.
    Trott, A., Xiong, C., Socher, R.: Interpretable counting for visual question answering. In: International Conference on Learning Representations (2018)Google Scholar
  31. 31.
    Xia, F., Zamir, A.R., He, Z., Sax, A., Malik, J., Savarese, S.: Gibson ENV: real-world perception for embodied agents. In: Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  32. 32.
    Yagcioglu, S., Erdem, A., Erdem, E., Ikizler-Cinbis, N.: Recipeqa: a challenge dataset for multimodal comprehension of cooking recipes. arXiv preprint arXiv:1809.00812 (2018)
  33. 33.
    Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: Conference on Computer Vision and Pattern Recognition (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of MarylandCollege ParkUSA
  2. 2.Amazon Web Services (AWS)BeijingChina

Personalised recommendations