Abstract
Visual question answering (VQA) is a challenging task that has received increasing attention from computer vision, natural language processing and all other AI communities. Given an image and a question in natural language format, reasoning over visual elements of the image and general knowledge are required to infer the correct answer, which may be presented in different formats. In this section, we first explain the motivation behind realizing VQA, i.e., the necessity of this new task and the benefits that the artificial intelligence (AI) field can derive from it. Subsequently, we categorize the VQA problem from different perspectives, including data type and task level. Finally, we present an overview and describe the structure of this book.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
A.B. Abacha, S.A. Hasan, V.V. Datla, J. Liu, D. Demner-Fushman, H. Müller, VQA-Med: overview of the medical visual question answering task at imageclef 2019, in CLEF (Working Notes) (2019)
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C.L. Zitnick, D. Parikh, VQA: visual question answering, in Proceedings of the IEEE International Conference on Computer Vision (2015)
R. Cantrell, M. Scheutz, P. Schermerhorn, X. Wu, Robust spoken instruction understanding for HRI, in 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI) (IEEE, 2010), pp. 275–282
X. Chen, C.L. Zitnick, Learning a recurrent visual representation for image caption generation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, D. Batra, Embodied question answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 1–10
A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J.M. Moura, D. Parikh, D. Batra, Visual dialog, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 326–335
J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. Platt, et al., From captions to visual concepts and back, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
A. Fukui, D.H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh, Making the V in VQA matter: elevating the role of image understanding in visual question answering, in Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a ranking task: data, models and evaluation metrics. JAIR 853–899 (2013)
D.A. Hudson, C.D. Manning, GQA: a new dataset for real-world visual reasoning and compositional question answering, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 6700–6709
Y. Jang, Y. Song, Y. Yu, Y. Kim, G. Kim, TGIF-QA: Toward spatio-temporal reasoning in visual question answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 2758–2766
X. Jia, E. Gavves, B. Fernando, T. Tuytelaars, Guiding long-short term memory for image caption generation, in Proceedings of the IEEE International Conference on Computer Vision (2015)
J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, R. Girshick, Clevr: a diagnostic dataset for compositional language and elementary visual reasoning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 2901–2910
A. Karpathy, A. Joulin, F.F. Li, Deep fragment embeddings for bidirectional image sentence mapping, in Proceedings of the Advances in Neural Information Processing Systems (2014)
T. Kollar, J. Krishnamurthy, G.P. Strimel, Toward interactive grounded language acqusition, in Robotics: Science and Systems (2013)
R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, L. Fei-Fei, Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016)
A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Proceedings of the Advances in Neural Information Processing Systems (2012)
J.J. Lau, S. Gayen, A.B. Abacha, D. Demner-Fushman, A dataset of clinically generated visual questions and answers about radiology images. Sci. data 5(1), 1–10 (2018)
J. Lei, L. Yu, M. Bansal, T.L. Berg, TVQA: localized, compositional video question answering. arXiv preprint arXiv:1809.01696 (2018)
S. Li, G. Kulkarni, T.L. Berg, A.C. Berg, Y. Choi, Composing simple image descriptions using web-scale n-grams, in The SIGNLL Conference on Computational Natural Language Learning (2011)
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. Zitnick, Microsoft COCO: common objects in context, in Proceedings of the European Conference on Computer Vision (2014)
M. Malinowski, M. Fritz, A multi-world approach to question answering about real-world scenes based on uncertain input, in Proceedings of the Advances in Neural Information Processing Systems (2014), pp. 1682–1690
J. Mao, W. Xu, Y. Yang, J. Wang, A. Yuille, Deep captioning with multimodal recurrent neural networks (m-RNN), in Proceedings of the International Conference on Learning Representations (2015)
K. Marino, M. Rastegari, A. Farhadi, R. Mottaghi, OK-VQA: a visual question answering benchmark requiring external knowledge, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 3195–3204
C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, D. Fox, A joint model of language and perception for grounded attribute learning, in Proceedings of the International Conference on Machine Learning (2012)
N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, L. Vanderwende, Generating natural questions about an image. arXiv preprint arXiv:1603.06059 (2016)
W. Norcliffe-Brown, S. Vafeias, S. Parisot, Learning conditioned graph structures for interpretable visual question answering, in Proceedings of the Advances in Neural Information Processing Systems (2018)
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., The kaldi speech recognition toolkit, in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, number CONF. IEEE Signal Processing Society (2011)
Y. Qiao, C. Deng, Q. Wu, Referring expression comprehension: a survey of methods and datasets. IEEE Trans. Multimedia (2020)
P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: unanswerable questions for squad. arXiv preprint arXiv:1806.03822 (2018)
M. Ren, R. Kiros, R. Zemel, Image question answering: a visual semantic embedding model and a new dataset, in Proceedings of the Advances in Neural Information Processing Systems (2015)
D. Roy, K.-Y. Hsiao, N. Mavridis, Conversational robots: building blocks for grounding word meaning, in HLT-NAACL Workshop on Learning Word Meaning from Non-linguistic Data (Association for Computational Linguistics, 2003), pp. 70–77
K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition. CoRR arXiv:1409.1556 (2014)
A. Singh, V. Natarjan, M. Shah, Y. Jiang, X. Chen, D. Parikh, M. Rohrbach, Towards VQA models that can read, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 8317–8326
M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, S. Fidler, Movieqa: understanding stories in movies through question-answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4631–4640
R. Vedantam, C.L. Zitnick, D. Parikh, CIDEr: consensus-based image description evaluation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: a neural image caption generator, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)
P. Wang, Q. Wu, C. Shen, A.v.d. Hengel, A. Dick, Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570 (2015)
P. Wang, Q. Wu, C. Shen, A.v.d. Hengel, A. Dick, FVQA: fact-based visual question answering. arXiv:1606.05433 (2016)
T. Winograd, Understanding natural language. Cogn. psychol. 3(1), 1–191 (1972)
Q. Wu, C. Shen, A.v.d. Hengel, L. Liu, A. Dick, What value do explicit high level concepts have in vision to language problems? in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Q. Wu, C. Shen, A.v.d. Hengel, P. Wang, A. Dick, Image captioning and visual question answering based on attributes and their related external knowledge. arXiv preprint arXiv:1603.02814 (2016)
Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, Y. Bengio, Show, attend and tell: neural image caption generation with visual attention, in Proceedings of the International Conference on Machine Learning (2015)
L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, A. Courville, Describing videos by exploiting temporal structure, in Proceedings of the IEEE International Conference on Computer Vision (2015)
L. Yu, E. Park, A.C. Berg, T.L. Berg, Visual madlibs: fill in the blank image generation and question answering, in Proceedings of the IEEE International Conference on Computer Vision (2015)
P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, D. Parikh, Yin and yang: balancing and answering binary visual questions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Y. Zhu, O. Groth, M. Bernstein, L. Fei-Fei, Visual7W: grounded question answering in images, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Wu, Q., Wang, P., Wang, X., He, X., Zhu, W. (2022). Introduction. In: Visual Question Answering. Advances in Computer Vision and Pattern Recognition. Springer, Singapore. https://doi.org/10.1007/978-981-19-0964-1_1
Download citation
DOI: https://doi.org/10.1007/978-981-19-0964-1_1
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-0963-4
Online ISBN: 978-981-19-0964-1
eBook Packages: Computer ScienceComputer Science (R0)