Interpretable Visual Understanding with Cognitive Attention Network

Tang, Xuejiao; Zhang, Wenbin; Yu, Yi; Turner, Kea; Derr, Tyler; Wang, Mengyu; Ntoutsi, Eirini

doi:10.1007/978-3-030-86362-3_45

Xuejiao Tang¹²,
Wenbin Zhang¹³,
Yi Yu¹⁴,
Kea Turner¹⁵,
Tyler Derr¹⁶,
Mengyu Wang¹⁷ &
…
Eirini Ntoutsi¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12891))

Included in the following conference series:

International Conference on Artificial Neural Networks

3117 Accesses
2 Citations

Abstract

While image understanding on recognition-level has achieved remarkable advancements, reliable visual scene understanding requires comprehensive image understanding on recognition-level but also cognition-level, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge. In this paper, we propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning to achieve interpretable visual understanding. Specifically, we first introduce an image-text fusion module to fuse information from images and text collectively. Second, a novel inference module is designed to encode commonsense among image, query and response. Extensive experiments on large-scale Visual Commonsense Reasoning (VCR) benchmark dataset demonstrate the effectiveness of our approach. The implementation is publicly available at https://github.com/tanjatang/CAN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Vuola, A., Akram, S., Kannala, J.: Mask-RCNN and U-Net ensembled for nuclei segmentation. In: 16th International Symposium on Biomedical Imaging (ISBI), pp. 208–212 (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on CVPR, pp. 770–778 (2016)
Google Scholar
Barkau, R.L.: UNet: one-dimensional unsteady flow through a full network of open channels. user’s manual. Technical reports, Hydrologic Engineering Center Davis CA (1996)
Google Scholar
Papandreou, G., et al.: Towards accurate multi-person pose estimation in the wild. In: CVPR, pp. 4903–4911 (2017)
Google Scholar
Zellers, R., Bisk, Y., et al.: From recognition to cognition: visual commonsense reasoning. In: CVPR, pp. 6720–6731 (2019)
Google Scholar
Gregor, K., Danihelka, I., et al.: DRAW: a recurrent neural network for image generation. In: International Conference on Machine Learning, pp. 1462–1471 (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)
Google Scholar
Yu, F., Tang, J., Yin, W., et al.: ERNIE-ViL: knowledge enhanced vision-language representations through scene graph. arXiv preprint arXiv:2006.16934 (2020)
Chen, Y.-C., et al.: UNITER: learning universal image-text representations (2019)
Google Scholar
Lin, J., Jain, U., et al.: TAB-VCR: tags and attributes based VCR baselines (2019)
Google Scholar
Natarajan, P., Wu, S., et al.: Multimodal feature fusion for robust event detection in web videos. In: CVPR, pp. 1298–1305 (2012)
Google Scholar
Yang, X., Tang, K., et al.: Auto-encoding scene graphs for image captioning. In: CVPR, pp. 10 685–10 694 (2019)
Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. CoRR, vol. abs/1705.07750 (2017)
Google Scholar
Tang, X., et al.: Cognitive visual commonsense reasoning using dynamic working memory. In: International Conference on Big Data Analytics and Knowledge Discovery. Springer (2021)
Google Scholar
You, Y., Zhang, Z., et al.: ImageNet training in minutes. In: Proceedings of the 47th International Conference on Parallel Processing, pp. 1–10 (2018)
Google Scholar
Devlin, J., Chang, M.-W., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Huang, Z., Xu, W., et al.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
Ben-younes, H., Cadéne, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. CoRR (2017)
Google Scholar
Ben-Younes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: ICCV, pp. 2612–2620 (2017)
Google Scholar
Zhang, W., Ntoutsi, E.: FAHT: an adaptive fairness-aware decision tree classifier. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 1480–1486 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Leibniz University of Hannover, Hanover, Germany
Xuejiao Tang
Carnegie Mellon University, Pittsburgh, USA
Wenbin Zhang
National Institute of Informatics, Tokyo, Japan
Yi Yu
Moffitt Cancer Center, Tampa, USA
Kea Turner
Vanderbilt University, Nashville, USA
Tyler Derr
Harvard Medical School, Boston, USA
Mengyu Wang
Freie Universität Berlin, Berlin, Germany
Eirini Ntoutsi

Authors

Xuejiao Tang
View author publications
You can also search for this author in PubMed Google Scholar
Wenbin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Yu
View author publications
You can also search for this author in PubMed Google Scholar
Kea Turner
View author publications
You can also search for this author in PubMed Google Scholar
Tyler Derr
View author publications
You can also search for this author in PubMed Google Scholar
Mengyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Eirini Ntoutsi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenbin Zhang .

Editor information

Editors and Affiliations

Comenius University in Bratislava, Bratislava, Slovakia
Igor Farkaš
iMotions A/S, Copenhagen, Denmark
Paolo Masulli
University of Tübingen, Tübingen, Baden-Württemberg, Germany
Sebastian Otte
Universität Hamburg, Hamburg, Germany
Stefan Wermter

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, X. et al. (2021). Interpretable Visual Understanding with Cognitive Attention Network. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12891. Springer, Cham. https://doi.org/10.1007/978-3-030-86362-3_45

Download citation

DOI: https://doi.org/10.1007/978-3-030-86362-3_45
Published: 07 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86361-6
Online ISBN: 978-3-030-86362-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics