Skip to main content

Introduction

  • Chapter
  • First Online:
Visual Question Answering

Abstract

Visual question answering (VQA) is a challenging task that has received increasing attention from computer vision, natural language processing and all other AI communities. Given an image and a question in natural language format, reasoning over visual elements of the image and general knowledge are required to infer the correct answer, which may be presented in different formats. In this section, we first explain the motivation behind realizing VQA, i.e., the necessity of this new task and the benefits that the artificial intelligence (AI) field can derive from it. Subsequently, we categorize the VQA problem from different perspectives, including data type and task level. Finally, we present an overview and describe the structure of this book.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. A.B. Abacha, S.A. Hasan, V.V. Datla, J. Liu, D. Demner-Fushman, H. Müller, VQA-Med: overview of the medical visual question answering task at imageclef 2019, in CLEF (Working Notes) (2019)

    Google Scholar 

  2. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C.L. Zitnick, D. Parikh, VQA: visual question answering, in Proceedings of the IEEE International Conference on Computer Vision (2015)

    Google Scholar 

  3. R. Cantrell, M. Scheutz, P. Schermerhorn, X. Wu, Robust spoken instruction understanding for HRI, in 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI) (IEEE, 2010), pp. 275–282

    Google Scholar 

  4. X. Chen, C.L. Zitnick, Learning a recurrent visual representation for image caption generation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  5. A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, D. Batra, Embodied question answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 1–10

    Google Scholar 

  6. A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J.M. Moura, D. Parikh, D. Batra, Visual dialog, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 326–335

    Google Scholar 

  7. J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  8. H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. Platt, et al., From captions to visual concepts and back, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  9. A. Fukui, D.H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)

  10. Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh, Making the V in VQA matter: elevating the role of image understanding in visual question answering, in Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  11. M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a ranking task: data, models and evaluation metrics. JAIR 853–899 (2013)

    Google Scholar 

  12. D.A. Hudson, C.D. Manning, GQA: a new dataset for real-world visual reasoning and compositional question answering, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 6700–6709

    Google Scholar 

  13. Y. Jang, Y. Song, Y. Yu, Y. Kim, G. Kim, TGIF-QA: Toward spatio-temporal reasoning in visual question answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 2758–2766

    Google Scholar 

  14. X. Jia, E. Gavves, B. Fernando, T. Tuytelaars, Guiding long-short term memory for image caption generation, in Proceedings of the IEEE International Conference on Computer Vision (2015)

    Google Scholar 

  15. J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, R. Girshick, Clevr: a diagnostic dataset for compositional language and elementary visual reasoning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 2901–2910

    Google Scholar 

  16. A. Karpathy, A. Joulin, F.F. Li, Deep fragment embeddings for bidirectional image sentence mapping, in Proceedings of the Advances in Neural Information Processing Systems (2014)

    Google Scholar 

  17. T. Kollar, J. Krishnamurthy, G.P. Strimel, Toward interactive grounded language acqusition, in Robotics: Science and Systems (2013)

    Google Scholar 

  18. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, L. Fei-Fei, Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016)

  19. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Proceedings of the Advances in Neural Information Processing Systems (2012)

    Google Scholar 

  20. J.J. Lau, S. Gayen, A.B. Abacha, D. Demner-Fushman, A dataset of clinically generated visual questions and answers about radiology images. Sci. data 5(1), 1–10 (2018)

    Article  Google Scholar 

  21. J. Lei, L. Yu, M. Bansal, T.L. Berg, TVQA: localized, compositional video question answering. arXiv preprint arXiv:1809.01696 (2018)

  22. S. Li, G. Kulkarni, T.L. Berg, A.C. Berg, Y. Choi, Composing simple image descriptions using web-scale n-grams, in The SIGNLL Conference on Computational Natural Language Learning (2011)

    Google Scholar 

  23. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. Zitnick, Microsoft COCO: common objects in context, in Proceedings of the European Conference on Computer Vision (2014)

    Google Scholar 

  24. M. Malinowski, M. Fritz, A multi-world approach to question answering about real-world scenes based on uncertain input, in Proceedings of the Advances in Neural Information Processing Systems (2014), pp. 1682–1690

    Google Scholar 

  25. J. Mao, W. Xu, Y. Yang, J. Wang, A. Yuille, Deep captioning with multimodal recurrent neural networks (m-RNN), in Proceedings of the International Conference on Learning Representations (2015)

    Google Scholar 

  26. K. Marino, M. Rastegari, A. Farhadi, R. Mottaghi, OK-VQA: a visual question answering benchmark requiring external knowledge, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 3195–3204

    Google Scholar 

  27. C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, D. Fox, A joint model of language and perception for grounded attribute learning, in Proceedings of the International Conference on Machine Learning (2012)

    Google Scholar 

  28. N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, L. Vanderwende, Generating natural questions about an image. arXiv preprint arXiv:1603.06059 (2016)

  29. W. Norcliffe-Brown, S. Vafeias, S. Parisot, Learning conditioned graph structures for interpretable visual question answering, in Proceedings of the Advances in Neural Information Processing Systems (2018)

    Google Scholar 

  30. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., The kaldi speech recognition toolkit, in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, number CONF. IEEE Signal Processing Society (2011)

    Google Scholar 

  31. Y. Qiao, C. Deng, Q. Wu, Referring expression comprehension: a survey of methods and datasets. IEEE Trans. Multimedia (2020)

    Google Scholar 

  32. P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: unanswerable questions for squad. arXiv preprint arXiv:1806.03822 (2018)

  33. M. Ren, R. Kiros, R. Zemel, Image question answering: a visual semantic embedding model and a new dataset, in Proceedings of the Advances in Neural Information Processing Systems (2015)

    Google Scholar 

  34. D. Roy, K.-Y. Hsiao, N. Mavridis, Conversational robots: building blocks for grounding word meaning, in HLT-NAACL Workshop on Learning Word Meaning from Non-linguistic Data (Association for Computational Linguistics, 2003), pp. 70–77

    Google Scholar 

  35. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition. CoRR arXiv:1409.1556 (2014)

  36. A. Singh, V. Natarjan, M. Shah, Y. Jiang, X. Chen, D. Parikh, M. Rohrbach, Towards VQA models that can read, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 8317–8326

    Google Scholar 

  37. M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, S. Fidler, Movieqa: understanding stories in movies through question-answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4631–4640

    Google Scholar 

  38. R. Vedantam, C.L. Zitnick, D. Parikh, CIDEr: consensus-based image description evaluation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

    Google Scholar 

  39. O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: a neural image caption generator, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)

    Google Scholar 

  40. P. Wang, Q. Wu, C. Shen, A.v.d. Hengel, A. Dick, Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570 (2015)

  41. P. Wang, Q. Wu, C. Shen, A.v.d. Hengel, A. Dick, FVQA: fact-based visual question answering. arXiv:1606.05433 (2016)

  42. T. Winograd, Understanding natural language. Cogn. psychol. 3(1), 1–191 (1972)

    Article  Google Scholar 

  43. Q. Wu, C. Shen, A.v.d. Hengel, L. Liu, A. Dick, What value do explicit high level concepts have in vision to language problems? in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  44. Q. Wu, C. Shen, A.v.d. Hengel, P. Wang, A. Dick, Image captioning and visual question answering based on attributes and their related external knowledge. arXiv preprint arXiv:1603.02814 (2016)

  45. Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)

  46. K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, Y. Bengio, Show, attend and tell: neural image caption generation with visual attention, in Proceedings of the International Conference on Machine Learning (2015)

    Google Scholar 

  47. L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, A. Courville, Describing videos by exploiting temporal structure, in Proceedings of the IEEE International Conference on Computer Vision (2015)

    Google Scholar 

  48. L. Yu, E. Park, A.C. Berg, T.L. Berg, Visual madlibs: fill in the blank image generation and question answering, in Proceedings of the IEEE International Conference on Computer Vision (2015)

    Google Scholar 

  49. P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, D. Parikh, Yin and yang: balancing and answering binary visual questions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  50. Y. Zhu, O. Groth, M. Bernstein, L. Fei-Fei, Visual7W: grounded question answering in images, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qi Wu .

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Wu, Q., Wang, P., Wang, X., He, X., Zhu, W. (2022). Introduction. In: Visual Question Answering. Advances in Computer Vision and Pattern Recognition. Springer, Singapore. https://doi.org/10.1007/978-981-19-0964-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-0964-1_1

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-0963-4

  • Online ISBN: 978-981-19-0964-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics