Advertisement

Learning Visual Question Answering by Bootstrapping Hard Attention

  • Mateusz Malinowski
  • Carl Doersch
  • Adam Santoro
  • Peter Battaglia
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11210)

Abstract

Attention mechanisms in biological perception are thought to select subsets of perceptual information for more sophisticated processing which would be prohibitive to perform on all sensory inputs. In computer vision, however, there has been relatively little exploration of hard attention, where some information is selectively ignored, in spite of the success of soft attention, where information is re-weighted and aggregated, but never filtered out. Here, we introduce a new approach for hard attention and find it achieves very competitive performance on a recently-released visual question answering datasets, equalling and in some cases surpassing similar soft attention architectures while entirely ignoring some features. Even though the hard attention mechanism is thought to be non-differentiable, we found that the feature magnitudes correlate with semantic relevance, and provide a useful signal for our mechanism’s attentional selection criterion. Because hard attention selects important features of the input information, it can also be more efficient than analogous soft attention mechanisms. This is especially important for recent approaches that use non-local pairwise operations, whereby computational and memory costs are quadratic in the size of the set of features.

Keywords

Visual question answering Visual Turing Test Attention 

Notes

Acknowledgments

We would like to thank Aishwarya Agrawal, Relja Arandjelovic, David G.T. Barrett, Joao Carreira, Timothy Lillicrap, Razvan Pascanu, David Raposo, and many others on the DeepMind team for critical feedback and discussions.

Supplementary material

474211_1_En_1_MOESM1_ESM.pdf (4.2 mb)
Supplementary material 1 (pdf 4277 KB)

References

  1. 1.
    Çukur, T., Nishimoto, S., Huth, A.G., Gallant, J.L.: Attention during natural vision warps semantic representation across the human brain. Nat. Neurosci. 16(6), 763 (2013)CrossRefGoogle Scholar
  2. 2.
    Sheinberg, D.L., Logothetis, N.K.: Noticing familiar objects in real world scenes: the role of temporal cortical neurons in natural vision. J. Neurosci. 21(4), 1340–1350 (2001)CrossRefGoogle Scholar
  3. 3.
    Simons, D.J., Rensink, R.A.: Change blindness: past, present, and future. Trends in Cogn. Sci. 9(1), 16–20 (2005)CrossRefGoogle Scholar
  4. 4.
    Mack, A., Rock, I.: Inattentional Blindness, vol. 33. MIT Press, Cambridge (1998)Google Scholar
  5. 5.
    Simons, D.J., Chabris, C.F.: Gorillas in our midst: sustained inattentional blindness for dynamic events. Perception 28(9), 1059–1074 (1999)CrossRefGoogle Scholar
  6. 6.
    Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in Neural Information Processing Systems (NIPS) (2014)Google Scholar
  7. 7.
    Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: overcoming priors for visual question answering. arXiv preprint arXiv:1712.00377 (2017)
  8. 8.
    Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. arXiv:1511.05234 (2015)
  9. 9.
    Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  10. 10.
    Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2016)Google Scholar
  11. 11.
    Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: FiLM: visual reasoning with a general conditioning layer. In: Proceedings of the Conference on Artificial Intelligence (AAAI) (2018)Google Scholar
  12. 12.
    Kazemi, V., Elqursh, A.: Show, ask, attend, and answer: a strong baseline for visual question answering. arXiv preprint arXiv:1704.03162 (2017)
  13. 13.
    Teney, D., Anderson, P., He, X., van der Hengel, A.: Tips and tricks for visual question answering: learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711 (2017)
  14. 14.
    Ilievski, I., Yan, S., Feng, J.: A focused dynamic attention model for visual question answering. arXiv:1604.01485 (2016)
  15. 15.
    Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems, pp. 2204–2212 (2014)Google Scholar
  16. 16.
    Gulcehre, C., Chandar, S., Cho, K., Bengio, Y.: Dynamic neural turing machine with soft and hard addressing schemes. arXiv preprint arXiv:1607.00036 (2016)
  17. 17.
    Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013)
  18. 18.
    Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)
  19. 19.
    Maddison, C.J., Mnih, A., Teh, Y.W.: The concrete distribution: a continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712 (2016)
  20. 20.
    Olah, C., et al.: The building blocks of interpretability. Distill (2018)Google Scholar
  21. 21.
    Oliva, A., Torralba, A., Castelhano, M.S., Henderson, J.M.: Top-down control of visual attention in object detection. In: Proceedings of 2003 International Conference on Image Processing, ICIP 2003, vol. 1, p. I-253. IEEE (2003)Google Scholar
  22. 22.
    Ren, M., Kiros, R., Zemel, R.: Image question answering: a visual semantic embedding model and a new dataset. In: Advances in Neural Information Processing Systems (NIPS) (2015)Google Scholar
  23. 23.
    Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  24. 24.
    Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a deep learning approach to visual question answering. Int. J. Comput. Vis. (IJCV) 125(1–3), 110–135 (2017)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Santoro, A., et al.: A simple neural network module for relational reasoning. In: Advances in Neural Information Processing Systems (NIPS), pp. 4974–4983 (2017)Google Scholar
  26. 26.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. arXiv preprint arXiv:1711.07971 (2017)
  27. 27.
    Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 6000–6010 (2017)Google Scholar
  28. 28.
    Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 (2018)
  29. 29.
    Malinowski, M., Fritz, M.: Towards a visual turing challenge. In: Learning Semantics (NIPS Workshop) (2014)Google Scholar
  30. 30.
    Malinowski, M., Fritz, M.: Hard to cheat: a turing test based on answering questions about images. In: AAAI Workshop: Beyond the Turing Test (2015)Google Scholar
  31. 31.
    Malinowski, M.: Towards holistic machines: from visual recognition to question answering about real-world images. Ph.D. thesis (2017)Google Scholar
  32. 32.
    Geman, D., Geman, S., Hallonquist, N., Younes, L.: Visual turing test for computer vision systems. In: Proceedings of the National Academy of Sciences. National Academy of Sciences (2015)Google Scholar
  33. 33.
    Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question answering. In: Advances in Neural Information Processing Systems (NIPS) (2015)Google Scholar
  34. 34.
    Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1988–1997. IEEE (2017)Google Scholar
  35. 35.
    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  36. 36.
    Harnad, S.: The symbol grounding problem. Phys. D: Nonlinear Phenom. 42(1), 335–346 (1990)CrossRefGoogle Scholar
  37. 37.
    Guadarrama, S., et al.: Grounding spatial relations for human-robot interaction. In: IROS (2013)Google Scholar
  38. 38.
    Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? Text-to-image coreference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  39. 39.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  40. 40.
    Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016 Part I. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_49CrossRefGoogle Scholar
  41. 41.
    Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1–9 (2015)Google Scholar
  42. 42.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the International Conference on Machine Learning (ICML) (2015)Google Scholar
  43. 43.
    Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering. arXiv preprint arXiv:1603.01417 (2016)
  44. 44.
    De Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., Courville, A.C.: Modulating early visual processing by language. In: Advances in Neural Information Processing Systems, pp. 6597–6607 (2017)Google Scholar
  45. 45.
    Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  46. 46.
    Chen, K., Wang, J., Chen, L.C., Gao, H., Xu, W., Nevatia, R.: ABC-CNN: an attention based convolutional neural network for visual question answering. arXiv:1511.05960 (2015)
  47. 47.
    Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  48. 48.
    Gulcehre, C., et al.: Hyperbolic attention networks. arXiv preprint arXiv:1805.09786 (2018)
  49. 49.
    Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067 (2018)
  50. 50.
    Mascharka, D., Tran, P., Soklaski, R., Majumdar, A.: Transparency by design: closing the gap between performance and interpretability in visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4942–4950 (2018)Google Scholar
  51. 51.
    Kim, Y., Denton, C., Hoang, L., Rush, A.M.: Structured attention networks. arXiv preprint arXiv:1702.00887 (2017)
  52. 52.
    Hu, R., Andreas, J., Darrell, T., Saenko, K.: Explainable neural computation via stack neural module networks. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)Google Scholar
  53. 53.
    Mokarian, A., Malinowski, M., Fritz, M.: Mean box pooling: a rich image representation and output embedding for the visual madlibs task. In: Proceedings of the British Machine Vision Conference (BMVC) (2016)Google Scholar
  54. 54.
    Tommasi, T., Mallya, A., Plummer, B., Lazebnik, S., Berg, A.C., Berg, T.L.: Solving visual madlibs with multiple cues. In: Proceedings of the British Machine Vision Conference (BMVC) (2016)Google Scholar
  55. 55.
    Desta, M.T., Chen, L., Kornuta, T.: Object-based reasoning in VQA. arXiv preprint arXiv:1801.09718 (2018)
  56. 56.
    Singh, S., Gupta, A., Efros, A.A.: Unsupervised discovery of mid-level discriminative patches. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012 Part II. LNCS, pp. 73–86. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33709-3_6CrossRefGoogle Scholar
  57. 57.
    Doersch, C., Gupta, A., Efros, A.A.: Mid-level visual element discovery as discriminative mode seeking. In: Advances in Neural Information Processing Systems (NIPS), pp. 494–502 (2013)Google Scholar
  58. 58.
    Juneja, M., Vedaldi, A., Jawahar, C., Zisserman, A.: Blocks that shout: distinctive parts for scene classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 923–930. IEEE (2013)Google Scholar
  59. 59.
    Doersch, C., Singh, S., Gupta, A., Sivic, J., Efros, A.: What makes Paris look like Paris? In: SIGGRAPH (2012)Google Scholar
  60. 60.
    Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 2017–2025 (2015)Google Scholar
  61. 61.
    Mallya, A., Lazebnik, S.: PackNet: adding multiple tasks to a single network by iterative pruning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018)Google Scholar
  62. 62.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS) (2012)Google Scholar
  63. 63.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv:1512.03385 (2015)
  64. 64.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)CrossRefGoogle Scholar
  65. 65.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv:1412.6980 (2014)
  66. 66.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
  67. 67.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Mateusz Malinowski
    • 1
  • Carl Doersch
    • 1
  • Adam Santoro
    • 1
  • Peter Battaglia
    • 1
  1. 1.DeepMindLondonUK

Personalised recommendations