Introduction

Wu, Qi; Wang, Peng; Wang, Xin; He, Xiaodong; Zhu, Wenwu

doi:10.1007/978-981-19-0964-1_1

Qi Wu ORCID: orcid.org/0000-0003-3631-256X¹⁷,
Peng Wang ORCID: orcid.org/0000-0001-7689-3405¹⁸,
Xin Wang ORCID: orcid.org/0000-0002-0351-2939¹⁹,
Xiaodong He²⁰ &
…
Wenwu Zhu²¹

Part of the book series: Advances in Computer Vision and Pattern Recognition ((ACVPR))

1016 Accesses

Abstract

Visual question answering (VQA) is a challenging task that has received increasing attention from computer vision, natural language processing and all other AI communities. Given an image and a question in natural language format, reasoning over visual elements of the image and general knowledge are required to infer the correct answer, which may be presented in different formats. In this section, we first explain the motivation behind realizing VQA, i.e., the necessity of this new task and the benefits that the artificial intelligence (AI) field can derive from it. Subsequently, we categorize the VQA problem from different perspectives, including data type and task level. Finally, we present an overview and describe the structure of this book.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

A.B. Abacha, S.A. Hasan, V.V. Datla, J. Liu, D. Demner-Fushman, H. Müller, VQA-Med: overview of the medical visual question answering task at imageclef 2019, in CLEF (Working Notes) (2019)
Google Scholar
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C.L. Zitnick, D. Parikh, VQA: visual question answering, in Proceedings of the IEEE International Conference on Computer Vision (2015)
Google Scholar
R. Cantrell, M. Scheutz, P. Schermerhorn, X. Wu, Robust spoken instruction understanding for HRI, in 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI) (IEEE, 2010), pp. 275–282
Google Scholar
X. Chen, C.L. Zitnick, Learning a recurrent visual representation for image caption generation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, D. Batra, Embodied question answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 1–10
Google Scholar
A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J.M. Moura, D. Parikh, D. Batra, Visual dialog, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 326–335
Google Scholar
J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. Platt, et al., From captions to visual concepts and back, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
A. Fukui, D.H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh, Making the V in VQA matter: elevating the role of image understanding in visual question answering, in Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a ranking task: data, models and evaluation metrics. JAIR 853–899 (2013)
Google Scholar
D.A. Hudson, C.D. Manning, GQA: a new dataset for real-world visual reasoning and compositional question answering, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 6700–6709
Google Scholar
Y. Jang, Y. Song, Y. Yu, Y. Kim, G. Kim, TGIF-QA: Toward spatio-temporal reasoning in visual question answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 2758–2766
Google Scholar
X. Jia, E. Gavves, B. Fernando, T. Tuytelaars, Guiding long-short term memory for image caption generation, in Proceedings of the IEEE International Conference on Computer Vision (2015)
Google Scholar
J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, R. Girshick, Clevr: a diagnostic dataset for compositional language and elementary visual reasoning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 2901–2910
Google Scholar
A. Karpathy, A. Joulin, F.F. Li, Deep fragment embeddings for bidirectional image sentence mapping, in Proceedings of the Advances in Neural Information Processing Systems (2014)
Google Scholar
T. Kollar, J. Krishnamurthy, G.P. Strimel, Toward interactive grounded language acqusition, in Robotics: Science and Systems (2013)
Google Scholar
R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, L. Fei-Fei, Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016)
A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Proceedings of the Advances in Neural Information Processing Systems (2012)
Google Scholar
J.J. Lau, S. Gayen, A.B. Abacha, D. Demner-Fushman, A dataset of clinically generated visual questions and answers about radiology images. Sci. data 5(1), 1–10 (2018)
Article Google Scholar
J. Lei, L. Yu, M. Bansal, T.L. Berg, TVQA: localized, compositional video question answering. arXiv preprint arXiv:1809.01696 (2018)
S. Li, G. Kulkarni, T.L. Berg, A.C. Berg, Y. Choi, Composing simple image descriptions using web-scale n-grams, in The SIGNLL Conference on Computational Natural Language Learning (2011)
Google Scholar
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. Zitnick, Microsoft COCO: common objects in context, in Proceedings of the European Conference on Computer Vision (2014)
Google Scholar
M. Malinowski, M. Fritz, A multi-world approach to question answering about real-world scenes based on uncertain input, in Proceedings of the Advances in Neural Information Processing Systems (2014), pp. 1682–1690
Google Scholar
J. Mao, W. Xu, Y. Yang, J. Wang, A. Yuille, Deep captioning with multimodal recurrent neural networks (m-RNN), in Proceedings of the International Conference on Learning Representations (2015)
Google Scholar
K. Marino, M. Rastegari, A. Farhadi, R. Mottaghi, OK-VQA: a visual question answering benchmark requiring external knowledge, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 3195–3204
Google Scholar
C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, D. Fox, A joint model of language and perception for grounded attribute learning, in Proceedings of the International Conference on Machine Learning (2012)
Google Scholar
N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, L. Vanderwende, Generating natural questions about an image. arXiv preprint arXiv:1603.06059 (2016)
W. Norcliffe-Brown, S. Vafeias, S. Parisot, Learning conditioned graph structures for interpretable visual question answering, in Proceedings of the Advances in Neural Information Processing Systems (2018)
Google Scholar
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., The kaldi speech recognition toolkit, in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, number CONF. IEEE Signal Processing Society (2011)
Google Scholar
Y. Qiao, C. Deng, Q. Wu, Referring expression comprehension: a survey of methods and datasets. IEEE Trans. Multimedia (2020)
Google Scholar
P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: unanswerable questions for squad. arXiv preprint arXiv:1806.03822 (2018)
M. Ren, R. Kiros, R. Zemel, Image question answering: a visual semantic embedding model and a new dataset, in Proceedings of the Advances in Neural Information Processing Systems (2015)
Google Scholar
D. Roy, K.-Y. Hsiao, N. Mavridis, Conversational robots: building blocks for grounding word meaning, in HLT-NAACL Workshop on Learning Word Meaning from Non-linguistic Data (Association for Computational Linguistics, 2003), pp. 70–77
Google Scholar
K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition. CoRR arXiv:1409.1556 (2014)
A. Singh, V. Natarjan, M. Shah, Y. Jiang, X. Chen, D. Parikh, M. Rohrbach, Towards VQA models that can read, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 8317–8326
Google Scholar
M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, S. Fidler, Movieqa: understanding stories in movies through question-answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4631–4640
Google Scholar
R. Vedantam, C.L. Zitnick, D. Parikh, CIDEr: consensus-based image description evaluation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Google Scholar
O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: a neural image caption generator, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)
Google Scholar
P. Wang, Q. Wu, C. Shen, A.v.d. Hengel, A. Dick, Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570 (2015)
P. Wang, Q. Wu, C. Shen, A.v.d. Hengel, A. Dick, FVQA: fact-based visual question answering. arXiv:1606.05433 (2016)
T. Winograd, Understanding natural language. Cogn. psychol. 3(1), 1–191 (1972)
Article Google Scholar
Q. Wu, C. Shen, A.v.d. Hengel, L. Liu, A. Dick, What value do explicit high level concepts have in vision to language problems? in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Q. Wu, C. Shen, A.v.d. Hengel, P. Wang, A. Dick, Image captioning and visual question answering based on attributes and their related external knowledge. arXiv preprint arXiv:1603.02814 (2016)
Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, Y. Bengio, Show, attend and tell: neural image caption generation with visual attention, in Proceedings of the International Conference on Machine Learning (2015)
Google Scholar
L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, A. Courville, Describing videos by exploiting temporal structure, in Proceedings of the IEEE International Conference on Computer Vision (2015)
Google Scholar
L. Yu, E. Park, A.C. Berg, T.L. Berg, Visual madlibs: fill in the blank image generation and question answering, in Proceedings of the IEEE International Conference on Computer Vision (2015)
Google Scholar
P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, D. Parikh, Yin and yang: balancing and answering binary visual questions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Y. Zhu, O. Groth, M. Bernstein, L. Fei-Fei, Visual7W: grounded question answering in images, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, University of Adelaide, Adelaide, SA, Australia
Qi Wu
Department of Computer Science, Northwestern Polytechnical University, Xi’an, Shaanxi, China
Peng Wang
Tsinghua University, Beijing, China
Xin Wang
AI Lab, JD.COM, Beijing, China
Xiaodong He
Department of Computer Science, Tsinghua University, Haidian, China
Wenwu Zhu

Authors

Qi Wu
View author publications
You can also search for this author in PubMed Google Scholar
Peng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong He
View author publications
You can also search for this author in PubMed Google Scholar
Wenwu Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Wu .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wu, Q., Wang, P., Wang, X., He, X., Zhu, W. (2022). Introduction. In: Visual Question Answering. Advances in Computer Vision and Pattern Recognition. Springer, Singapore. https://doi.org/10.1007/978-981-19-0964-1_1

Download citation

DOI: https://doi.org/10.1007/978-981-19-0964-1_1
Published: 13 May 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-0963-4
Online ISBN: 978-981-19-0964-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics