Abstract
AQuA (ASP-based Question Answering) is an Answer Set Programming (ASP) based visual question answering framework that truly “understands” an input picture and answers natural language questions about that picture. The knowledge contained in the picture is extracted using YOLO, a neural network-based object detection technique, and represented as an answer set program. Natural language processing is performed on the question to transform it into an ASP query. Semantic relations are extracted in the process for deeper understanding and to answer more complex questions. The resulting knowledge-base—with additional commonsense knowledge imported—can be used to perform reasoning using an ASP system, allowing it to answer questions about the picture, just like a human. This framework achieves 93.7% accuracy on CLEVR dataset, which exceeds human baseline performance. What is significant is that AQuA translates a question into an ASP query without requiring any training. Our framework for Visual Question Answering is quite general and closely simulates the way humans operate. In contrast to existing purely machine learning-based methods, our framework provides an explanation for the answer it computes, while maintaining high accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cao, Q., Liang, X., Li, B., Li, G., Lin, L.: Visual question reasoning on general dependency tree. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7249–7257 (2018)
Davidson, D.: Inquiries into Truth and Interpretation: Philosophical Essays, vol. 2. Oxford University Press, Oxford (2001)
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? Dataset and methods for multilingual image question. In: NIPS 2015, pp. 2296–2304 (2015)
Gelfond, M., Kahl, Y.: Knowledge Representation, Reasoning, and the Design of Intelligent Agents: The Answer-Set Programming Approach. Cambridge University Press, Cambridge (2014)
Gelfond, M., Lifschitz, V.: The stable model semantics for logic programming. In: ICLP/SLP, vol. 88, pp. 1070–1080 (1988)
Honnibal, M., Montani, I.: spaCy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. 7 (2017, to appear)
Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 804–813 (2017)
Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067 (2018)
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: IEEE CVPR 2017, pp. 2901–2910 (2017)
Johnson, J., et al.: Inferring and executing programs for visual reasoning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2989–2998 (2017)
Joshi, V., Peters, M., Hopkins, M.: Extending a parser to distant domains using a few dozen partially annotated examples. arXiv preprint arXiv:1805.06556 (2018)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPS 2014, pp. 1682–1690 (2014)
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: ACL System Demonstrations, pp. 55–60 (2014)
Marple, K., Salazar, E., Gupta, G.: Computing stable models of normal logic programs without grounding. arXiv:1709.00501 (2017)
Mascharka, D., Tran, P., Soklaski, R., Majumdar, A.: Transparency by design: closing the gap between performance and interpretability in visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4942–4950 (2018)
Miller, G.A.: WordNet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995). https://doi.org/10.1145/219717.219748
Pendharkar, D., Gupta, G.: An ASP based approach to answering questions for natural language text. In: Alferes, J.J., Johansson, M. (eds.) PADL 2019. LNCS, vol. 11372, pp. 46–63. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05998-9_4
Perez, E., et al.: FiLM: visual reasoning with a general conditioning layer. In: AAAI (2018)
Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR, pp. 779–788. IEEE Computer Society (2016)
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: NIPS 2015, pp. 2953–2961 (2015)
Santor, A., et al.: A simple neural network module for relational reasoning. In: NIPS 2017, pp. 4967–4976 (2017)
Schuster, S., Manning, C.D.: Enhanced English universal dependencies: an improved representation for natural language understanding tasks. In: LRED 2016, pp. 2371–2378 (2016)
Shah, S., Mishra, A., Yadati, N., Talukdar, P.P.: KVQA: knowledge-aware visual question answering. In: AAAI (2019)
Shakerin, F., Salazar, E., Gupta, G.: A new algorithm to automate inductive learning of default theories. TPLP 17(5–6), 1010–1026 (2017)
Shrestha, R., Kafle, K., Kanan, C.: Answer them all! Toward universal visual question answering models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10472–10481 (2019)
Speer, R., Chin, J., Havasi, C.: ConceptNet 5.5: an open multilingual graph of general knowledge. In: Proceedings AAAI, pp. 4444–4451 (2017)
Suarez, J., Johnson, J., Li, F.F.: DDRprog: a CLEVR differentiable dynamic reasoning programmer. arXiv preprint arXiv:1803.11361 (2018)
Wu, Q., Teney, D., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Visual question answering: a survey of methods and datasets. Comput. Vis. Image Underst. 163, 21–40 (2017)
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. CVPR, pp. 21–29 (2015)
Yi, K., et al.: Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: NIPS 2018, pp. 1031–1042 (2018)
Yu, L., Park, E., Berg, A.C., Berg, T.L.: Visual madlibs: fill in the blank image generation and question answering. arXiv preprint arXiv:1506.00278 (2015)
Acknowledgement
We are indebted to Dhruva Pendharkar for his early work on natural language question answering. Thanks also to Sarat Varanasi for discussion and help. Authors gratefully acknowledge support from NSF grants IIS 1910131 and IIS 1718945.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Basu, K., Shakerin, F., Gupta, G. (2020). AQuA: ASP-Based Visual Question Answering. In: Komendantskaya, E., Liu, Y. (eds) Practical Aspects of Declarative Languages. PADL 2020. Lecture Notes in Computer Science(), vol 12007. Springer, Cham. https://doi.org/10.1007/978-3-030-39197-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-39197-3_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39196-6
Online ISBN: 978-3-030-39197-3
eBook Packages: Computer ScienceComputer Science (R0)