Skip to main content
Log in

Multi-modal co-attention relation networks for visual question answering

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

The current mainstream visual question answering (VQA) models only model the object-level visual representations but ignore the relationships between visual objects. To solve this problem, we propose a Multi-Modal Co-Attention Relation Network (MCARN) that combines co-attention and visual object relation reasoning. MCARN can model visual representations at both object-level and relation-level, and stacking its visual relation reasoning module can further improve the accuracy of the model on Number questions. Inspired by MCARN, we propose two models, RGF-CA and Cos-Sin+CA, which combine co-attention with the relative geometry features of visual objects, and achieve excellent comprehensive performance and higher accuracy on Other questions respectively. Extensive experiments and ablation studies based on the benchmark dataset VQA 2.0 prove the effectiveness of our models, and also verify the synergy of co-attention and visual object relation reasoning in VQA task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

  1. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. ICML. 37, 2048–2057 (2015)

    Google Scholar 

  2. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. CVPR. 1, 3156–3164 (2015)

    Google Scholar 

  3. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: visual question answering. ICCV. 1, 2425–2433 (2015)

    Google Scholar 

  4. Noh, H., Seo, P.H., Han, B.: Image question answering using convolutional neural network with dynamic parameter prediction. CVPR. 1, 30–38 (2016)

    Google Scholar 

  5. Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. NIPS. 1, 1682–1690 (2014)

  6. Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: overcoming priors for visual question answering. CVPR. 1, 4971–4980 (2018)

    Google Scholar 

  7. Lao, M., Guo, Y., Wang, H.: Zhang, Xin: Cross-modal multistep fusion network with co-attention for visual question answering. IEEE Access. 6, 31516–31524 (2018)

    Article  Google Scholar 

  8. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. CVPR. 1, 6325–6334 (2017)

    Google Scholar 

  9. Han, D., Pan, N., Li, K.-C.: A traceable and revocable ciphertext-policy attribute-based encryption scheme based on privacy protection. IEEE Trans. Dependable Secur. Comput. 19(1), 316–327 (2022)

    Article  Google Scholar 

  10. Cui, M., Han, D., Wang, J.: An efficient and safe road condition monitoring authentication scheme based on fog computing. IEEE Internet Things J. 6(5), 9076–9084 (2019)

    Article  Google Scholar 

  11. Cui, M., Han, D., Wang, J., Li, K.-C., Chang, C.-C.: ARFV: an efficient shared data auditing scheme supporting revocation for fog-assisted vehicular ad-hoc networks. IEEE Trans. Veh. Technol. 69(12), 15815–15827 (2020)

    Article  Google Scholar 

  12. Han, D., Zhu, Y., Li, D., Liang, W., Souri, Alireza, Li, Kuan-Ching.: A blockchain-based auditable access control system for private data in service-centric IoT environments. IEEE Trans. Ind. Inf. 18(5), 3530–3540 (2022)

  13. Li, H., Han, D.: Tang, Mingdong: A privacy-preserving storage scheme for logistics data with assistance of blockchain. IEEE Internet Things J. 9(6), 4704–4720 (2022)

    Article  Google Scholar 

  14. Lee, K.-H., Chen, X., Hua, G., Houdong, H., He, X.: Stacked cross attention for image-text matching. ECCV. 4, 212–228 (2018)

    Google Scholar 

  15. Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 394–407 (2019)

  16. Ranjan, V., Rasiwasia, N., Jawahar, C.V.: Multi-label cross-modal retrieval. ICCV. 1, 4094–4102 (2015)

  17. Song, G., Wang, D.: Tan, X.: Deep memory network for cross-modal retrieval. IEEE Trans. Multim. 21(5), 1261–1275 (2019)

  18. He, Y., Xiang, S., Kang, C., Wang, J., Pan, C.: Cross-modal retrieval via deep and bidirectional representation learning. IEEE Trans. Multim. 18(7), 1363–1377 (2016)

    Article  Google Scholar 

  19. Lu, P., Ji, L., Zhang, W., Duan, N., Zhou, M., Wang, J.: R-VQA: learning visual relation facts with semantic attention for visual question answering. KDD. 1, 1880–1889 (2018)

  20. Ren, F.: Zhou, Y.: CGMVQA: a new classification and generative model for medical visual question answering. IEEE Access. 8, 50626–50636 (2020)

  21. Sayedshayan, H.H., Mehran, S., Abdolreza, M.: Multiple answers to a question: a new approach for visual question answering. Vis. Comput. 37(1), 119–131 (2021)

    Article  Google Scholar 

  22. Yan, F., Silamu, W., Li, Y.: Chai, Yachuang: SPCA-Net: a based on spatial position relationship co-attention network for visual question answering. Vis. Comput. (2022). https://doi.org/10.1007/s00371-022-02524-z

    Article  Google Scholar 

  23. Rahman, T., Chou, S.-H., Sigal, L., Carenini, G.: An improved attention for visual question answering. CVPR Workshops 2021, 1653–1662 (2021)

    Google Scholar 

  24. Yang, C., Wu, W., Wang, Y., Zhou, H.: Multi-modality global fusion attention network for visual question answering. Electronics 9(11), 1882 (2020)

    Article  Google Scholar 

  25. Guo, Z., Han, D.: Multi-modal explicit sparse attention networks for visual question answering. Sensors 20(23), 6758 (2020)

    Article  Google Scholar 

  26. Liu, H., Gong, S., Ji, Y., Yang, J., Xing, T., Liu, C.: Multimodal cross-guided attention networks for visual question answering. CMSA 2018. (2018)

  27. Han, D., Zhou, S., Li, K.-C.: Rodrigo Fernandes de Mello: cross-modality co-attention networks for visual question answering. Soft Comput. 25(7), 5411–5421 (2021)

    Article  Google Scholar 

  28. He, S., Han, D.: An effective dense co-attention networks for visual question answering. Sensors 20(17), 4897 (2020)

    Article  Google Scholar 

  29. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. ICLR (2015)

  30. Chen, K., Wang, J., Chen, L.-C., Gao, H., Xu, W., Nevatia, R.: ABC-CNN: an attention based convolutional neural network for visual question answering. CVPR. (2015). arXiv:1511.05960

  31. Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. CVPR. 4613-4621 (2016)

  32. Liu, Y., Zhang, X., Huang, F., Tang, X., Li, Z.: Visual question answering via attention-based syntactic structure tree-LSTM. Appl. Soft Comput. 82, 1055484 (2019)

    Article  Google Scholar 

  33. Nguyen, D.-K., Okatani, T.: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. CVPR. 6087–6096 (2018)

  34. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. NIPS. 289-297 (2016)

  35. Lao, M., Guo, Y., Wang, H.: Zhang, Xin: Cross-modal multistep fusion network with co-attention for visual question answering. IEEE Access. 6, 31516–31524 (2018)

    Article  Google Scholar 

  36. Mottaghi, R., Chen, X., Liu, X., Cho, N.-G., Lee, S.-W., Fidler, S., Urtasun, R., Yuille, A.L.: The role of context for object detection and semantic segmentation in the wild. CVPR. 1, 891–898 (2014)

  37. Chen, X., Gupta, A.: Spatial memory for context reasoning in object detection. ICCV. 1, 4106–4116 (2017)

  38. Santoro, A., Raposo, D., Barrett, D.G.T., Malinowski, M., Pascanu, R., Battaglia, P.W., Lillicrap, T.: A simple neural network module for relational reasoning. NIPS. 4967–4976 (2017)

  39. Peng, L., Yang, Y., Wang, Z., Wu, X., Huang, Z.: CRA-Net: composed relation attention network for visual question answering. ACM Multimedia. 1202–1210 (2019)

  40. Wang, P., Wu, Q., Shen, C., van den Hengel, A.: The VQA-machine: learning how to use existing vision algorithms to answer new questions. CVPR. 1, 3909–3918 (2017)

  41. Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. CVPR. 1, 6281–6290 (2019)

  42. Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. CVPR. 3588–3597 (2018)

  43. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ICLR. 1 (2015)

  44. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CVPR. 1, 770–778 (2016)

  45. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. NIPS. 1, 91–99 (2015)

  46. Girshick, R.B: Fast R-CNN. ICCV. 1, 1440–1448 (2015)

  47. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. EMNLP. 1532–1543 (2014)

  48. Hochreiter, Sepp, Schmidhuber, Jürgen.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

  49. Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS. 1 (2014), arXiv:1412.3555

  50. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. EMNLP. 1, 457–468 (2016)

  51. Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. ICCV. 1, 1839–1848 (2017)

  52. Zhang, Z., Zhao, Z., Lin, Z., Song, J., He, X.: Open-ended long-form video question answering via hierarchical convolutional self-attention networks. IJCAI. 1, 4383–4389 (2019)

  53. Perez, E., Strub, F., de Vries, H., Dumoulin, V., Courville, A.C.: FiLM: visual reasoning with a general conditioning layer. AAAI. 1, 3942–3951 (2018)

  54. Yu, J., Zhang, W., Yuhang, L., Qin, Z., Hu, Y., Tan, J., Wu, Q.: Reasoning on the relation: enhancing visual representation for visual question answering and cross-modal retrieval. IEEE Trans. Multim. 22(12), 3196–3209 (2020)

    Article  Google Scholar 

  55. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D.A., Bernstein, M.S., Fei-Fei, Li.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)

  56. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. NIPS. 1, 5998–6008 (2017)

  57. Ba, L.J., Kiros, J.R., Hinton, G.E.: Layer normalization. Mach. Learn. 1, 2016. arXiv:1607.06450

  58. Microsoft, C.O.C.O., Tsung-Yi, L., Michael, M., Serge, J.B., James, H., Pietro, P., Deva, R., Piotr, D.C., Lawrence, Z.: Common objects in context. ECCV 5, 740–755 (2014)

    Google Scholar 

  59. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  60. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. ICLR.1, (2015)

  61. Yang, Z., Qin, Z., Yu, J., Wan, T.: Prior visual relationship reasoning for visual question answering. ICIP. 1, 1411–1415 (2020)

  62. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. CVPR. 1, 6077–6086 (2018)

  63. Kim, J.-H., Jun, J., Zhang, B.-T.: Bilinear attention networks. NeurIPS. 1, 1571–1581 (2018)

  64. Zhang, Y., Hare, J.S., Prügel-Bennett, A.: Learning to count objects in natural images for visual question answering. ICLR. 1 (2018)

  65. Yu, Z., Cui, Y., Yu, J., Tao, D., Tian, Q.: Multimodal unified attention networks for vision-and-language interactions. CVPR. 1, (2019). arXiv:1908.04107

Download references

Acknowledgements

This research is supported by the National Natural Science Foundation of China under Grant 61873160 and Grant 61672338, and the Natural Science Foundation of Shanghai under Grant 21ZR1426500. We thank all the reviewers for their constructive comments and helpful suggestions.

Author information

Authors and Affiliations

Authors

Contributions

Methodology, material preparation, data collection, and analysis were performed by ZG. ZG wrote the first draft of the manuscript, and DH and ZG commented on previous versions of the manuscript. DH did the supervision, reviewing, and editing. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zihan Guo.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, Z., Han, D. Multi-modal co-attention relation networks for visual question answering. Vis Comput 39, 5783–5795 (2023). https://doi.org/10.1007/s00371-022-02695-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-022-02695-9

Keywords

Navigation