Skip to main content

Interpretable Visual Understanding with Cognitive Attention Network

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 12891)


While image understanding on recognition-level has achieved remarkable advancements, reliable visual scene understanding requires comprehensive image understanding on recognition-level but also cognition-level, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge. In this paper, we propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning to achieve interpretable visual understanding. Specifically, we first introduce an image-text fusion module to fuse information from images and text collectively. Second, a novel inference module is designed to encode commonsense among image, query and response. Extensive experiments on large-scale Visual Commonsense Reasoning (VCR) benchmark dataset demonstrate the effectiveness of our approach. The implementation is publicly available at

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. Vuola, A., Akram, S., Kannala, J.: Mask-RCNN and U-Net ensembled for nuclei segmentation. In: 16th International Symposium on Biomedical Imaging (ISBI), pp. 208–212 (2019)

    Google Scholar 

  2. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on CVPR, pp. 770–778 (2016)

    Google Scholar 

  3. Barkau, R.L.: UNet: one-dimensional unsteady flow through a full network of open channels. user’s manual. Technical reports, Hydrologic Engineering Center Davis CA (1996)

    Google Scholar 

  4. Papandreou, G., et al.: Towards accurate multi-person pose estimation in the wild. In: CVPR, pp. 4903–4911 (2017)

    Google Scholar 

  5. Zellers, R., Bisk, Y., et al.: From recognition to cognition: visual commonsense reasoning. In: CVPR, pp. 6720–6731 (2019)

    Google Scholar 

  6. Gregor, K., Danihelka, I., et al.: DRAW: a recurrent neural network for image generation. In: International Conference on Machine Learning, pp. 1462–1471 (2015)

    Google Scholar 

  7. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)

    Google Scholar 

  8. Yu, F., Tang, J., Yin, W., et al.: ERNIE-ViL: knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934 (2020)

  9. Chen, Y.-C., et al.: UNITER: learning universal image-text representations (2019)

    Google Scholar 

  10. Lin, J., Jain, U., et al.: TAB-VCR: tags and attributes based VCR baselines (2019)

    Google Scholar 

  11. Natarajan, P., Wu, S., et al.: Multimodal feature fusion for robust event detection in web videos. In: CVPR, pp. 1298–1305 (2012)

    Google Scholar 

  12. Yang, X., Tang, K., et al.: Auto-encoding scene graphs for image captioning. In: CVPR, pp. 10 685–10 694 (2019)

    Google Scholar 

  13. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. CoRR, vol. abs/1705.07750 (2017)

    Google Scholar 

  14. Tang, X., et al.: Cognitive visual commonsense reasoning using dynamic working memory. In: International Conference on Big Data Analytics and Knowledge Discovery. Springer (2021)

    Google Scholar 

  15. You, Y., Zhang, Z., et al.: ImageNet training in minutes. In: Proceedings of the 47th International Conference on Parallel Processing, pp. 1–10 (2018)

    Google Scholar 

  16. Devlin, J., Chang, M.-W., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  17. Huang, Z., Xu, W., et al.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)

  18. Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)

  19. Ben-younes, H., Cadéne, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. CoRR (2017)

    Google Scholar 

  20. Ben-Younes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: ICCV, pp. 2612–2620 (2017)

    Google Scholar 

  21. Zhang, W., Ntoutsi, E.: FAHT: an adaptive fairness-aware decision tree classifier. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 1480–1486 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Wenbin Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tang, X. et al. (2021). Interpretable Visual Understanding with Cognitive Attention Network. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12891. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86361-6

  • Online ISBN: 978-3-030-86362-3

  • eBook Packages: Computer ScienceComputer Science (R0)