AiR: Attention with Reasoning Capability

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12346)


While attention has been an increasingly popular component in deep neural networks to both interpret and boost performance of models, little work has examined how attention progresses to accomplish a task and whether it is reasonable. In this work, we propose an Attention with Reasoning capability (AiR) framework that uses attention to understand and improve the process leading to task outcomes. We first define an evaluation metric based on a sequence of atomic reasoning operations, enabling quantitative measurement of attention that considers the reasoning process. We then collect human eye-tracking and answer correctness data, and analyze various machine and human attentions on their reasoning capability and how they impact task performance. Furthermore, we propose a supervision method to jointly and progressively optimize attention, reasoning, and task performance so that models learn to look at regions of interests by following a reasoning process. We demonstrate the effectiveness of the proposed framework in analyzing and modeling attention with better reasoning capability and task performance. The code and data are available at


Attention Reasoning Eye-tracking dataset 



This work is supported by NSF Grants 1908711 and 1849107.

Supplementary material

500725_1_En_6_MOESM1_ESM.pdf (1.1 mb)
Supplementary material 1 (pdf 1105 KB)


  1. 1.
    Alers, H., Liu, H., Redi, J., Heynderickx, I.: Studying the effect of optimizing the image quality in saliency regions at the expense of background content. In: Image Quality and System Performance VII, vol. 7529, p. 752907. International Society for Optics and Photonics (2010)Google Scholar
  2. 2.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)Google Scholar
  3. 3.
    Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)Google Scholar
  4. 4.
    Ben-Younes, H., Cadène, R., Thome, N., Cord, M.: MUTAN: multimodal tucker fusion for visual question answering. In: ICCV (2017)Google Scholar
  5. 5.
    Borji, A., Itti, L.: CAT 2000: a large scale fixation dataset for boosting saliency research. arXiv preprint arXiv:1505.03581 (2015)
  6. 6.
    Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? IEEE Trans. Pattern Anal. Mach. Intell. 41, 740–757 (2019)CrossRefGoogle Scholar
  7. 7.
    Das, A., Agrawal, H., Zitnick, C.L., Parikh, D., Batra, D.: Human attention in visual question answering: do humans and deep networks look at the same regions? In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2016)Google Scholar
  8. 8.
    Do, T., Do, T.T., Tran, H., Tjiputra, E., Tran, Q.D.: Compact trilinear interaction for visual question answering. In: ICCV (2019)Google Scholar
  9. 9.
    Ehinger, K.A., Hidalgo-Sotelo, B., Torralba, A., Oliva, A.: Modelling search for people in 900 scenes: a combined source model of eye guidance. Vis. Cogn. 17(6–7), 945–978 (2009)CrossRefGoogle Scholar
  10. 10.
    Fan, S., et al.: Emotional attention: a study of image sentiment and visual attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7521–7531 (2018)Google Scholar
  11. 11.
    Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 457–468 (2016)Google Scholar
  12. 12.
    Gao, P., et al.: Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: CVPR (2019)Google Scholar
  13. 13.
    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)Google Scholar
  14. 14.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  15. 15.
    He, S., Tavakoli, H.R., Borji, A., Pugeault, N.: Human attention in image captioning: dataset and analysis. In: ICCV (2019)Google Scholar
  16. 16.
    Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. In: ICCV (2017)Google Scholar
  17. 17.
    Hudson, D.A., Manning, C.D.: Compositional attention networks for machine reasoning (2018)Google Scholar
  18. 18.
    Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR (2019)Google Scholar
  19. 19.
    Huk Park, D., et al.: Multimodal explanations: justifying decisions and pointing to the evidence. In: CVPR (2018)Google Scholar
  20. 20.
    Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)Google Scholar
  21. 21.
    Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 2106–2113. IEEE (2009)Google Scholar
  22. 22.
    König, S.D., Buffalo, E.A.: A nonparametric method for detecting fixations and saccades using cluster analysis: removing the need for arbitrary thresholds. J. Neurosci. Methods 227, 121–131 (2014)CrossRefGoogle Scholar
  23. 23.
    Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: NeurIPS, pp. 1571–1581 (2018)Google Scholar
  24. 24.
    Koehler, K., Guo, F., Zhang, S., Eckstein, M.P.: What do saliency models predict? J. Vis. 14(3), 14 (2014)CrossRefGoogle Scholar
  25. 25.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Li, W., Yuan, Z., Fang, X., Wang, C.: Knowing where to look? Analysis on attention of visual question answering system. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11132, pp. 145–152. Springer, Cham (2018). Scholar
  27. 27.
    Mascharka, D., Tran, P., Soklaski, R., Majumdar, A.: Transparency by design: closing the gap between performance and interpretability in visual reasoning. In: CVPR (2018)Google Scholar
  28. 28.
    Patro, B.N., Anupriy, Namboodiri, V.P.: Explanation vs attention: a two-player game to obtain attention for VQA. In: AAAI (2020)Google Scholar
  29. 29.
    Qiao, T., Dong, J., Xu, D.: Exploring human-like attention supervision in visual question answering. In: AAAI (2018)Google Scholar
  30. 30.
    Selvaraju, R.R., et al.: Taking a hint: leveraging explanations to make vision and language models more grounded. In: ICCV (2019)Google Scholar
  31. 31.
    Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: CVPR (2016)Google Scholar
  32. 32.
    Tavakoli, H.R., Ahmed, F., Borji, A., Laaksonen, J.: Saliency revisited: analysis of mouse movements versus fixations. In: CVPR (2017)Google Scholar
  33. 33.
    Tavakoli, H.R., Shetty, R., Borji, A., Laaksonen, J.: Paying attention to descriptions generated by image captioning models. In: ICCV (2017)Google Scholar
  34. 34.
    Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)Google Scholar
  35. 35.
    Wu, J., Mooney, R.: Self-critical reasoning for robust visual question answering. In: NeurIPS (2019)Google Scholar
  36. 36.
    Xu, J., Jiang, M., Wang, S., Kankanhalli, M.S., Zhao, Q.: Predicting human gaze beyond pixels. J. Vis. 14(1), 28 (2014)CrossRefGoogle Scholar
  37. 37.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)Google Scholar
  38. 38.
    Yang, C.J., Grauman, K., Gurari, D.: Visual question answer diversity. In: Sixth AAAI Conference on Human Computation and Crowdsourcing (2018)Google Scholar
  39. 39.
    Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: CVPR (2016)Google Scholar
  40. 40.
    Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.: Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: NeurIPS, pp. 1031–1042 (2018)Google Scholar
  41. 41.
    Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: CVPR (2019)Google Scholar
  42. 42.
    Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: ICCV (2017)Google Scholar
  43. 43.
    Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: CVPR (2019)Google Scholar
  44. 44.
    Zhang, Y., Niebles, J.C., Soto, A.: Interpretable visual question answering by visual grounding from attention supervision mining. In: WACV, pp. 349–357 (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of MinnesotaMinneapolisUSA

Personalised recommendations