Skip to main content
Log in

Question guided multimodal receptive field reasoning network for fact-based visual question answering

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Fact-based visual question answering aims at answer questions about images with the help of external knowledge, but understanding the comprehensive semantics of images and text is more challenging than understanding the semantics of text alone. Most existing methods do not take full advantage of the information included in the knowledge graph, which makes it difficult to comprehensively consider the information of all modals in the process of reasoning. In this article, we propose a question guided multi-modal receptive field reasoning network, which expands the receptive field of image to the text and knowledge graph. Specifically, our pattern generates a clue from images and text at each inference step. Subsequently, clues are introduced into the attention’s operator to perform explicit and implicit reasoning on the knowledge graph. We will divide the reasoning process into three steps, first encoding the image to identify objects that match the text and scoring them. There is one more point, using the object with the highest score as a clue, the relationships included in the text are picked-up. In the end, use the relationships and objects obtained from the first two steps as clues to retrieve the matching triplets in the knowledge graph. This manner can provide a more precise answer to the question. Without using the Pretrained model, our model improved its performance on the ‘FVQA’ and ‘ZSFVQA’ datasets by 1.74% and 0.41%. Our research provides a novel method for future multi-modal retrieval using knowledge graphs. Our code is available at https://github.com/ZuoZicheng/QuMuQA

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Availability of Supporting Data

The datasets generated during and/or analysed during the current study are available in the github repository, https://github.com/ZuoZicheng/QuMuQA.

References

  1. Anderson P, JM Fernando B (2016) Spice: semantic propositional image caption evaluation. European Conf Comput Vis, 382–398

  2. Ben-Younes H, TN Cadene R (2019) Block: bilinear superdiagonal fusion for visual question answering and visual relationship detection. Proc AAAI Conf Art Intell 33:8102–8109

  3. Wang P, SC Wu Q (2017) Fvqa: fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40:2413–2427

  4. Narasimhan M, SA Lazebnik S (2018) Out of the box: reasoning with graph convolution nets for factual visual question answering. Adv Neural Inf Process Syst 31

  5. Ding Y, Yu J, Liu B, Hu Y, Cui M, Wu Q (2022) Mukea: multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5089–5098

  6. Zheng W, Yan L, Zhang W, Wang F-Y (2023) Webly supervised knowledge-embedded model for visual reasoning. IEEE Trans Neural Netw Learn Syst, 1–15

  7. Antol S, LJ Agrawal A (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433

  8. Wang J, Dong Y (2021) Improve visual question answering based on text feature extraction. J Phys: Conf Series 1856:012025

    Google Scholar 

  9. Shih K J, HD Singh S (2016) Where to look: focus regions for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4613–4621

  10. Malinowski M, FM Rohrbach M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision, pp 1–9

  11. Zhu Z, WY Yu J (2020) Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering. 2006–09073

  12. Shao Z, Han J, Marnerides D, Debattista K (2022) Region-object relation-aware dense captioning via transformer. IEEE Trans Neural Netw Learn Syst, 1–12

  13. Chu F, Cao J, Shao Z, Pang Y (2022) Illumination-guided transformer-based network for multispectral pedestrian detection. In: Fang L, Povey D, Zhai G, Mei T, Wang R (eds) Artificial intelligence. Springer, Cham, pp 343–355

    Chapter  Google Scholar 

  14. Yu J, WY Zhu Z (2020) Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recognit 108:107563

  15. Marino K, PD Chen X (2021) Krisp: integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14111–14121

  16. Liu Y, Guo Y, Yin J, Song X, Liu W, Nie L, Zhang M (2022) Answer questions with right image regions: a visual attention regularization approach. ACM Trans Multimed Comput Commun Appl (TOMM) 18:1–18

    Article  Google Scholar 

  17. Zhan X, Huang Y, Dong X, Cao Q, Liang X (2022) Pathreasoner: explainable reasoning paths for commonsense question answering. Knowl-Based Syst 235:107612

    Article  Google Scholar 

  18. Li X, Wu B, Song J, Gao L, Zeng P, Gan C (2022) Text-instance graph: exploring the relational semantics for text-based visual question answering. Pattern Recognit 124:108455

    Article  Google Scholar 

  19. Nawaz HS, Shi Z, Gan Y, Hirpa A, Dong J, Zheng H (2022) Temporal moment localization via natural language by utilizing video question answers as a special variant and bypassing nlp for corpora. IEEE Trans Circuits Syst Video Technol 32:6174–6185

    Article  Google Scholar 

  20. Sharma H, Jalal AS (2022) A framework for visual question answering with the integration of scene-text using phocs and fisher vectors. Expert Syst Appl 190:116159

    Article  Google Scholar 

  21. Yan L, Ma S, Wang Q, Chen Y, Zhang X, Savakis A, Liu D (2022) Video captioning using global-local representation. IEEE Trans Circuits Syst Video Technol 32(10):6642–6656

    Article  Google Scholar 

  22. Yan L, Han C, Xu Z, Liu D, Wang Q (2023) Prompt learns prompt: exploring knowledge-aware generative prompt collaboration for video captioning. In: Proceedings of the thirty-second international joint conference on artificial intelligence. IJCAI ’23

  23. Cai L, Fang H, Li Z (2023) Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering. J Supercomput, 1–28

  24. Li L, CY Gan Z (2019) Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10313–10322

  25. Gardères F, AB Ziaeefard M (2020) Conceptbert: concept-aware representation for visual question answering. Find Assoc Comput Linguist, 489–498

  26. Wu J, SA Lu J (2022) Multi-modal answer validation for knowledge-based vqa. Proc AAAI Conf Artif Intell 36:2712–2721

  27. Yang Z, GJ He X (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29

  28. He K, RS Zhang X (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  29. Chen Z, GY Chen J (2021) Zero-shot visual question answering using knowledge graph. International Semantic Web Conference, 146–162

  30. Farazi MR, Khan SH, Barnes N (2020) From known to the unknown: transferring knowledge to answer questions about novel visual and semantic concepts. Image Vis Comput 103:103985

    Article  Google Scholar 

  31. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781

  32. Speer R, HC Chin J (2017) Conceptnet 5.5: an open multilingual graph of general knowledge. Thirty-first AAAI conference on artificial intelligence

  33. Tandon N, SF De Melo G (2014) Webchild: harvesting and organizing commonsense knowledge from the web. In: Proceedings of the 7th ACM international conference on Web search and data mining, pp 523–532

  34. Pennington J, MCD Socher R (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, pp 1532–1543

  35. Lu J, BD Yang J (2016) Hierarchical question-image co-attention for visual question answering. Adv Neural Inf Process Syst 77

  36. Kim JH, ZBT Jun J (2018) Bilinear attention networks. Adv Neural Inf Process Syst 31

Download references

Funding

This research was supported by the National Social Science Foundation of China (19BYY076).

Author information

Authors and Affiliations

Authors

Contributions

Zicheng Zuo: Conceptualization, methodology, validation, writing - original draft. Yanhan Sun, Zhenfang Zhu, Mei Wu and Hui Zhao: Writing - review, editing and supervision.

Corresponding author

Correspondence to Zhenfang Zhu.

Ethics declarations

Ethical Approval

Not Applicable.

Competing Interests

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled “Question Guided Multimodal Receptive Field Reasoning Network for Fact-based Visual Question Answering”.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zuo, Z., Sun, Y., Zhu, Z. et al. Question guided multimodal receptive field reasoning network for fact-based visual question answering. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19387-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-024-19387-2

Keywords

Navigation