Abstract
Counting questions are considered to be a subfield of the Visual Question Answering (VQA) research area. To evaluate VQA systems properly, a VQA dataset is needed in which all possible answers for all possible counting questions occur equally often. For this purpose, a generator program is developed to create a balanced dataset automatically to help in analyzing the VQA general network architecture and the VQAv2 dataset. The results show that the achieved accuracy of VQAv2 is mostly due to the structure of the questions and answers. On the other hand, when using the generated dataset, the VQA network is not able to achieve an accuracy of more than 12.12%, which is far below the 35.18% in the evaluation of the VQAv2 dataset. We found that two types of information can be exploited by a VQA network in the image to achieve better results: a characteristic object colour and a fixed association of image positions with certain numbers. Our work is a starting point for further work on the analysis of systemic errors in VQA, especially in the area of counting.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., Parikh, D.: Yin and Yang: balancing and answering binary visual questions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5014–5022 (2016)
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: ClevR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)
Teney, D., Wu, Q., van den Hengel, A.: Visual question answering: a tutorial. IEEE Sig. Process. Mag. 34(6), 63–75 (2017)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565, July 2018
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? Dataset and methods for multilingual image question. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Wu, Q., Teney, D., Wang, P., Shen, C., Dick, A., Van Den Hengel, A.: Visual question answering: a survey of methods and datasets. Comput. Vis. Image Underst. 163, 21–40 (2017)
Dancette, C., Cadene, R., Chen, X., Cord, M.: Overcoming statistical shortcuts for open-ended visual counting. arXiv preprint arXiv:2006.10079 (2020). @Commentjabref-meta: databaseType:bibtex;
Chattopadhyay, P., Vedantam, R., Selvaraju, R.R., Batra, D., Parikh, D.: Counting everyday objects in everyday scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1135–1144 (2017)
Zhang, Y., Hare, J., Prügel-Bennett, A.: Learning to count objects in natural images for visual question answering. arXiv preprint arXiv:1802.05766 (2018)
Glauner, P., Valtchev, P., State, R.: Impact of biases in big data. arXiv preprint arXiv:1803.00897 (2018)
Acharya, M., Kafle, K., Kanan, C.: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8076–8084 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Nuseir, A., Vannahme, M., Ebner, M. (2024). Evaluation of Systematic Errors in Visual Question Answering. In: Arai, K. (eds) Advances in Information and Communication. FICC 2024. Lecture Notes in Networks and Systems, vol 919. Springer, Cham. https://doi.org/10.1007/978-3-031-53960-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-53960-2_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53959-6
Online ISBN: 978-3-031-53960-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)