VQA-LOL: Visual Question Answering Under the Lens of Logic

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12366)


Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this Lens of Logic, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions. We construct an augmentation of the VQA dataset as a benchmark, with questions containing logical compositions and linguistic transformations (negation, disjunction, conjunction, and antonyms). We propose our Lens of Logic (LOL) model which uses question-attention and logic-attention to understand logical connectives in the question, and a novel Fréchet-Compatibility Loss, which ensures that the answers of the component questions and the composed question are consistent with the inferred logical operation. Our model shows substantial improvement in learning logical compositions while retaining performance on VQA. We suggest this work as a move towards robustness by embedding logical connectives in visual understanding.


Visual question answering Logical robustness 



Support from NSF Robust Intelligence Program (1816039 and 1750082), DARPA (W911NF2020006) and ONR (N00014-20-1-2332) is gratefully acknowledged.

Supplementary material

504479_1_En_23_MOESM1_ESM.pdf (5.2 mb)
Supplementary material 1 (pdf 5342 KB)


  1. 1.
    Aditya, S., Yang, Y., Baral, C.: Integrating knowledge and reasoning in image understanding. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, pp. 6252–6259. AAAI Press (2019).
  2. 2.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)Google Scholar
  3. 3.
    Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)Google Scholar
  4. 4.
    Asai, A., Hajishirzi, H.: Logic-guided data augmentation and regularization for consistent question answering. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5642–5650. Association for Computational Linguistics (2020).
  5. 5.
    Bhattacharya, N., Li, Q., Gurari, D.: Why does a visual question have different answers? In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4271–4280 (2019)Google Scholar
  6. 6.
    Bobrow, D.G.: Natural language input for a computer problem solving system (1964)Google Scholar
  7. 7.
    Boole, G.: An investigation of the laws of thought: on which are founded themathematical theories of logic and probabilities. Dover Publications (1854)Google Scholar
  8. 8.
    Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, pp. 2787–2795 (2013)Google Scholar
  9. 9.
    Bowman, S.R., Potts, C., Manning, C.D.: Recursive neural networks can learn logical semantics. arXiv preprint arXiv:1406.1827 (2014)
  10. 10.
    Carey, S.: Conceptual Change in Childhood. MIT Press, Cambridge (1985)Google Scholar
  11. 11.
    Cesana-Arlotti, N., Martín, A., Téglás, E., Vorobyova, L., Cetnarski, R., Bonatti, L.L.: Precursors of logical reasoning in preverbal human infants. Science 359(6381), 1263–1266 (2018).
  12. 12.
    Corcoran, J.: Completeness of an ancient logic. J. Symb. Logic 37(4), 696–702 (1972)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  14. 14.
    Fang, Z., Gokhale, T., Banerjee, P., Baral, C., Yang, Y.: Video2commonsense: generating commonsense descriptions to enrich video captioning. arXiv preprint arXiv:2003.05162 (2020)
  15. 15.
    Fréchet, M.: Généralisation du théoreme des probabilités totales. Fundamenta Mathematicae 1(25), 379–387 (1935)CrossRefGoogle Scholar
  16. 16.
    Gopnik, A., Meltzoff, A.N., Kuhl, P.K.: The Scientist in the Crib: Minds, Brains, and How Children Learn. William Morrow & Co (1999)Google Scholar
  17. 17.
    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)Google Scholar
  18. 18.
    Hegel, G.W.F.: Hegel’s science of logic (1929)Google Scholar
  19. 19.
    Horn, L.R., Kato, Y.: Negation and Polarity: Syntactic and Semantic Perspectives. OUP, Oxford (2000)Google Scholar
  20. 20.
    Hudson, D.A., Manning, C.D.: GQA: a new dataset for compositional question answering over real-world images. arXiv preprint arXiv:1902.09506 (2019)
  21. 21.
    Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)Google Scholar
  22. 22.
    Kassner, N., Schütze, H.: Negated lama: birds cannot fly. arXiv preprint arXiv:1911.03343 (2019)
  23. 23.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  24. 24.
    Lewis, M., Steedman, M.: Combined distributional and logical semantics. Trans. Assoc. Comput. Linguist. 1, 179–192 (2013).
  25. 25.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  26. 26.
    Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
  27. 27.
    Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, pp. 13–23 (2019)Google Scholar
  28. 28.
    Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in Neural Information Processing Systems, pp. 1682–1690 (2014)Google Scholar
  29. 29.
    Mao, J., Gan, C., Kohli, P., Tenenbaum, J.B., Wu, J.: The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. In: International Conference on Learning Representations (2019).
  30. 30.
    Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 2, pp. 1003–1011. Association for Computational Linguistics (2009)Google Scholar
  31. 31.
    Morante, R., Sporleder, C.: Modality and negation: an introduction to the special issue. Comput. Linguist. 38(2), 223–260 (2012)MathSciNetCrossRefGoogle Scholar
  32. 32.
    Neelakantan, A., Roth, B., McCallum, A.: Compositional vector space models for knowledge base completion. arXiv preprint arXiv:1504.06662 (2015)
  33. 33.
    Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)Google Scholar
  34. 34.
    Piattelli-Palmarini, M.: Language and learning: the debate between jean piaget and noam chomsky (1980)Google Scholar
  35. 35.
    Raju, P.: The principle of four-cornered negation in Indian philosophy. Rev. Metaphys. 694–713 (1954)Google Scholar
  36. 36.
    Ramshaw, L., Marcus, M.: Text chunking using transformation-based learning. In: Third Workshop on Very Large Corpora (1995).
  37. 37.
    Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Advances in Neural Information Processing Systems, pp. 2953–2961 (2015)Google Scholar
  38. 38.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  39. 39.
    Riedel, S., Yao, L., McCallum, A., Marlin, B.M.: Relation extraction with matrix factorization and universal schemas. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 74–84. Association for Computational Linguistics, June 2013.
  40. 40.
    Rocktäschel, T., Bošnjak, M., Singh, S., Riedel, S.: Low-dimensional embeddings of logic. In: Proceedings of the ACL 2014 Workshop on Semantic Parsing, pp. 45–49 (2014)Google Scholar
  41. 41.
    Socher, R., Chen, D., Manning, C.D., Ng, A.: Reasoning with neural tensor networks for knowledge base completion. In: Advances in Neural Information Processing Systems, pp. 926–934 (2013)Google Scholar
  42. 42.
    Spinoza, B.D.: Ethics, translated by andrew boyle, introduction by ts gregory (1934)Google Scholar
  43. 43.
    Tan, H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)
  44. 44.
    Wittgenstein, L.: Tractatus Logico-Philosophicus. Routledge (1921)Google Scholar
  45. 45.
    Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019Google Scholar
  46. 46.
    Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6720–6731 (2019)Google Scholar
  47. 47.
    Zettlemoyer, L.S., Collins, M.: Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. arXiv preprint arXiv:1207.1420 (2012)
  48. 48.
    Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: AAAI, pp. 13041–13049 (2020)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Arizona State UniversityTempeUSA

Personalised recommendations