Multi-modal co-attention relation networks for visual question answering

Guo, Zihan; Han, Dezhi

doi:10.1007/s00371-022-02695-9

Multi-modal co-attention relation networks for visual question answering

Original article
Published: 29 October 2022

Volume 39, pages 5783–5795, (2023)
Cite this article

The Visual Computer Aims and scope Submit manuscript

457 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

The current mainstream visual question answering (VQA) models only model the object-level visual representations but ignore the relationships between visual objects. To solve this problem, we propose a Multi-Modal Co-Attention Relation Network (MCARN) that combines co-attention and visual object relation reasoning. MCARN can model visual representations at both object-level and relation-level, and stacking its visual relation reasoning module can further improve the accuracy of the model on Number questions. Inspired by MCARN, we propose two models, RGF-CA and Cos-Sin+CA, which combine co-attention with the relative geometry features of visual objects, and achieve excellent comprehensive performance and higher accuracy on Other questions respectively. Extensive experiments and ablation studies based on the benchmark dataset VQA 2.0 prove the effectiveness of our models, and also verify the synergy of co-attention and visual object relation reasoning in VQA task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

Article 13 September 2023

Relational reasoning and adaptive fusion for visual question answering

Article 01 March 2024

Jointly Learning Attentions with Semantic Cross-Modal Correlation for Visual Question Answering

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. ICML. 37, 2048–2057 (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. CVPR. 1, 3156–3164 (2015)
Google Scholar
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: visual question answering. ICCV. 1, 2425–2433 (2015)
Google Scholar
Noh, H., Seo, P.H., Han, B.: Image question answering using convolutional neural network with dynamic parameter prediction. CVPR. 1, 30–38 (2016)
Google Scholar
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. NIPS. 1, 1682–1690 (2014)
Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: overcoming priors for visual question answering. CVPR. 1, 4971–4980 (2018)
Google Scholar
Lao, M., Guo, Y., Wang, H.: Zhang, Xin: Cross-modal multistep fusion network with co-attention for visual question answering. IEEE Access. 6, 31516–31524 (2018)
Article Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. CVPR. 1, 6325–6334 (2017)
Google Scholar
Han, D., Pan, N., Li, K.-C.: A traceable and revocable ciphertext-policy attribute-based encryption scheme based on privacy protection. IEEE Trans. Dependable Secur. Comput. 19(1), 316–327 (2022)
Article Google Scholar
Cui, M., Han, D., Wang, J.: An efficient and safe road condition monitoring authentication scheme based on fog computing. IEEE Internet Things J. 6(5), 9076–9084 (2019)
Article Google Scholar
Cui, M., Han, D., Wang, J., Li, K.-C., Chang, C.-C.: ARFV: an efficient shared data auditing scheme supporting revocation for fog-assisted vehicular ad-hoc networks. IEEE Trans. Veh. Technol. 69(12), 15815–15827 (2020)
Article Google Scholar
Han, D., Zhu, Y., Li, D., Liang, W., Souri, Alireza, Li, Kuan-Ching.: A blockchain-based auditable access control system for private data in service-centric IoT environments. IEEE Trans. Ind. Inf. 18(5), 3530–3540 (2022)
Li, H., Han, D.: Tang, Mingdong: A privacy-preserving storage scheme for logistics data with assistance of blockchain. IEEE Internet Things J. 9(6), 4704–4720 (2022)
Article Google Scholar
Lee, K.-H., Chen, X., Hua, G., Houdong, H., He, X.: Stacked cross attention for image-text matching. ECCV. 4, 212–228 (2018)
Google Scholar
Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 394–407 (2019)
Ranjan, V., Rasiwasia, N., Jawahar, C.V.: Multi-label cross-modal retrieval. ICCV. 1, 4094–4102 (2015)
Song, G., Wang, D.: Tan, X.: Deep memory network for cross-modal retrieval. IEEE Trans. Multim. 21(5), 1261–1275 (2019)
He, Y., Xiang, S., Kang, C., Wang, J., Pan, C.: Cross-modal retrieval via deep and bidirectional representation learning. IEEE Trans. Multim. 18(7), 1363–1377 (2016)
Article Google Scholar
Lu, P., Ji, L., Zhang, W., Duan, N., Zhou, M., Wang, J.: R-VQA: learning visual relation facts with semantic attention for visual question answering. KDD. 1, 1880–1889 (2018)
Ren, F.: Zhou, Y.: CGMVQA: a new classification and generative model for medical visual question answering. IEEE Access. 8, 50626–50636 (2020)
Sayedshayan, H.H., Mehran, S., Abdolreza, M.: Multiple answers to a question: a new approach for visual question answering. Vis. Comput. 37(1), 119–131 (2021)
Article Google Scholar
Yan, F., Silamu, W., Li, Y.: Chai, Yachuang: SPCA-Net: a based on spatial position relationship co-attention network for visual question answering. Vis. Comput. (2022). https://doi.org/10.1007/s00371-022-02524-z
Article Google Scholar
Rahman, T., Chou, S.-H., Sigal, L., Carenini, G.: An improved attention for visual question answering. CVPR Workshops 2021, 1653–1662 (2021)
Google Scholar
Yang, C., Wu, W., Wang, Y., Zhou, H.: Multi-modality global fusion attention network for visual question answering. Electronics 9(11), 1882 (2020)
Article Google Scholar
Guo, Z., Han, D.: Multi-modal explicit sparse attention networks for visual question answering. Sensors 20(23), 6758 (2020)
Article Google Scholar
Liu, H., Gong, S., Ji, Y., Yang, J., Xing, T., Liu, C.: Multimodal cross-guided attention networks for visual question answering. CMSA 2018. (2018)
Han, D., Zhou, S., Li, K.-C.: Rodrigo Fernandes de Mello: cross-modality co-attention networks for visual question answering. Soft Comput. 25(7), 5411–5421 (2021)
Article Google Scholar
He, S., Han, D.: An effective dense co-attention networks for visual question answering. Sensors 20(17), 4897 (2020)
Article Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. ICLR (2015)
Chen, K., Wang, J., Chen, L.-C., Gao, H., Xu, W., Nevatia, R.: ABC-CNN: an attention based convolutional neural network for visual question answering. CVPR. (2015). arXiv:1511.05960
Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. CVPR. 4613-4621 (2016)
Liu, Y., Zhang, X., Huang, F., Tang, X., Li, Z.: Visual question answering via attention-based syntactic structure tree-LSTM. Appl. Soft Comput. 82, 1055484 (2019)
Article Google Scholar
Nguyen, D.-K., Okatani, T.: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. CVPR. 6087–6096 (2018)
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. NIPS. 289-297 (2016)
Lao, M., Guo, Y., Wang, H.: Zhang, Xin: Cross-modal multistep fusion network with co-attention for visual question answering. IEEE Access. 6, 31516–31524 (2018)
Article Google Scholar
Mottaghi, R., Chen, X., Liu, X., Cho, N.-G., Lee, S.-W., Fidler, S., Urtasun, R., Yuille, A.L.: The role of context for object detection and semantic segmentation in the wild. CVPR. 1, 891–898 (2014)
Chen, X., Gupta, A.: Spatial memory for context reasoning in object detection. ICCV. 1, 4106–4116 (2017)
Santoro, A., Raposo, D., Barrett, D.G.T., Malinowski, M., Pascanu, R., Battaglia, P.W., Lillicrap, T.: A simple neural network module for relational reasoning. NIPS. 4967–4976 (2017)
Peng, L., Yang, Y., Wang, Z., Wu, X., Huang, Z.: CRA-Net: composed relation attention network for visual question answering. ACM Multimedia. 1202–1210 (2019)
Wang, P., Wu, Q., Shen, C., van den Hengel, A.: The VQA-machine: learning how to use existing vision algorithms to answer new questions. CVPR. 1, 3909–3918 (2017)
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. CVPR. 1, 6281–6290 (2019)
Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. CVPR. 3588–3597 (2018)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ICLR. 1 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CVPR. 1, 770–778 (2016)
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. NIPS. 1, 91–99 (2015)
Girshick, R.B: Fast R-CNN. ICCV. 1, 1440–1448 (2015)
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. EMNLP. 1532–1543 (2014)
Hochreiter, Sepp, Schmidhuber, Jürgen.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS. 1 (2014), arXiv:1412.3555
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. EMNLP. 1, 457–468 (2016)
Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. ICCV. 1, 1839–1848 (2017)
Zhang, Z., Zhao, Z., Lin, Z., Song, J., He, X.: Open-ended long-form video question answering via hierarchical convolutional self-attention networks. IJCAI. 1, 4383–4389 (2019)
Perez, E., Strub, F., de Vries, H., Dumoulin, V., Courville, A.C.: FiLM: visual reasoning with a general conditioning layer. AAAI. 1, 3942–3951 (2018)
Yu, J., Zhang, W., Yuhang, L., Qin, Z., Hu, Y., Tan, J., Wu, Q.: Reasoning on the relation: enhancing visual representation for visual question answering and cross-modal retrieval. IEEE Trans. Multim. 22(12), 3196–3209 (2020)
Article Google Scholar
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D.A., Bernstein, M.S., Fei-Fei, Li.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. NIPS. 1, 5998–6008 (2017)
Ba, L.J., Kiros, J.R., Hinton, G.E.: Layer normalization. Mach. Learn. 1, 2016. arXiv:1607.06450
Microsoft, C.O.C.O., Tsung-Yi, L., Michael, M., Serge, J.B., James, H., Pietro, P., Deva, R., Piotr, D.C., Lawrence, Z.: Common objects in context. ECCV 5, 740–755 (2014)
Google Scholar
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. ICLR.1, (2015)
Yang, Z., Qin, Z., Yu, J., Wan, T.: Prior visual relationship reasoning for visual question answering. ICIP. 1, 1411–1415 (2020)
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. CVPR. 1, 6077–6086 (2018)
Kim, J.-H., Jun, J., Zhang, B.-T.: Bilinear attention networks. NeurIPS. 1, 1571–1581 (2018)
Zhang, Y., Hare, J.S., Prügel-Bennett, A.: Learning to count objects in natural images for visual question answering. ICLR. 1 (2018)
Yu, Z., Cui, Y., Yu, J., Tao, D., Tian, Q.: Multimodal unified attention networks for vision-and-language interactions. CVPR. 1, (2019). arXiv:1908.04107

Download references

Acknowledgements

This research is supported by the National Natural Science Foundation of China under Grant 61873160 and Grant 61672338, and the Natural Science Foundation of Shanghai under Grant 21ZR1426500. We thank all the reviewers for their constructive comments and helpful suggestions.

Author information

Authors and Affiliations

College of Information Engineering, Shanghai Maritime University, 1550 Haigang Avenue, Shanghai, 201306, China
Zihan Guo & Dezhi Han

Authors

Zihan Guo
View author publications
You can also search for this author in PubMed Google Scholar
Dezhi Han
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Methodology, material preparation, data collection, and analysis were performed by ZG. ZG wrote the first draft of the manuscript, and DH and ZG commented on previous versions of the manuscript. DH did the supervision, reviewing, and editing. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zihan Guo.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Guo, Z., Han, D. Multi-modal co-attention relation networks for visual question answering. Vis Comput 39, 5783–5795 (2023). https://doi.org/10.1007/s00371-022-02695-9

Download citation

Accepted: 04 October 2022
Published: 29 October 2022
Issue Date: November 2023
DOI: https://doi.org/10.1007/s00371-022-02695-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-modal co-attention relation networks for visual question answering

Abstract

Access this article

Similar content being viewed by others

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

Relational reasoning and adaptive fusion for visual question answering

Jointly Learning Attentions with Semantic Cross-Modal Correlation for Visual Question Answering

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-modal co-attention relation networks for visual question answering

Abstract

Access this article

Similar content being viewed by others

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

Relational reasoning and adaptive fusion for visual question answering

Jointly Learning Attentions with Semantic Cross-Modal Correlation for Visual Question Answering

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation