Question guided multimodal receptive field reasoning network for fact-based visual question answering

Zuo, Zicheng; Sun, Yanhan; Zhu, Zhenfang; Wu, Mei; Zhao, Hui

doi:10.1007/s11042-024-19387-2

Question guided multimodal receptive field reasoning network for fact-based visual question answering

Published: 20 May 2024

(2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Zicheng Zuo¹,
Yanhan Sun¹^na1,
Zhenfang Zhu ORCID: orcid.org/0000-0002-7217-3109¹,
Mei Wu¹^na1 &
…
Hui Zhao¹^na1

28 Accesses
Explore all metrics

Abstract

Fact-based visual question answering aims at answer questions about images with the help of external knowledge, but understanding the comprehensive semantics of images and text is more challenging than understanding the semantics of text alone. Most existing methods do not take full advantage of the information included in the knowledge graph, which makes it difficult to comprehensively consider the information of all modals in the process of reasoning. In this article, we propose a question guided multi-modal receptive field reasoning network, which expands the receptive field of image to the text and knowledge graph. Specifically, our pattern generates a clue from images and text at each inference step. Subsequently, clues are introduced into the attention’s operator to perform explicit and implicit reasoning on the knowledge graph. We will divide the reasoning process into three steps, first encoding the image to identify objects that match the text and scoring them. There is one more point, using the object with the highest score as a clue, the relationships included in the text are picked-up. In the end, use the relationships and objects obtained from the first two steps as clues to retrieve the matching triplets in the knowledge graph. This manner can provide a more precise answer to the question. Without using the Pretrained model, our model improved its performance on the ‘FVQA’ and ‘ZSFVQA’ datasets by 1.74% and 0.41%. Our research provides a novel method for future multi-modal retrieval using knowledge graphs. Our code is available at https://github.com/ZuoZicheng/QuMuQA

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical Attention Networks for Fact-based Visual Question Answering

Article 22 July 2023

A common-specific feature cross-fusion attention mechanism for KGVQA

Article 13 April 2024

REXUP: I REason, I EXtract, I UPdate with Structured Compositional Reasoning for Visual Question Answering

Availability of Supporting Data

The datasets generated during and/or analysed during the current study are available in the github repository, https://github.com/ZuoZicheng/QuMuQA.

References

Anderson P, JM Fernando B (2016) Spice: semantic propositional image caption evaluation. European Conf Comput Vis, 382–398
Ben-Younes H, TN Cadene R (2019) Block: bilinear superdiagonal fusion for visual question answering and visual relationship detection. Proc AAAI Conf Art Intell 33:8102–8109
Wang P, SC Wu Q (2017) Fvqa: fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40:2413–2427
Narasimhan M, SA Lazebnik S (2018) Out of the box: reasoning with graph convolution nets for factual visual question answering. Adv Neural Inf Process Syst 31
Ding Y, Yu J, Liu B, Hu Y, Cui M, Wu Q (2022) Mukea: multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5089–5098
Zheng W, Yan L, Zhang W, Wang F-Y (2023) Webly supervised knowledge-embedded model for visual reasoning. IEEE Trans Neural Netw Learn Syst, 1–15
Antol S, LJ Agrawal A (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Wang J, Dong Y (2021) Improve visual question answering based on text feature extraction. J Phys: Conf Series 1856:012025
Google Scholar
Shih K J, HD Singh S (2016) Where to look: focus regions for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4613–4621
Malinowski M, FM Rohrbach M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision, pp 1–9
Zhu Z, WY Yu J (2020) Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering. 2006–09073
Shao Z, Han J, Marnerides D, Debattista K (2022) Region-object relation-aware dense captioning via transformer. IEEE Trans Neural Netw Learn Syst, 1–12
Chu F, Cao J, Shao Z, Pang Y (2022) Illumination-guided transformer-based network for multispectral pedestrian detection. In: Fang L, Povey D, Zhai G, Mei T, Wang R (eds) Artificial intelligence. Springer, Cham, pp 343–355
Chapter Google Scholar
Yu J, WY Zhu Z (2020) Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recognit 108:107563
Marino K, PD Chen X (2021) Krisp: integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14111–14121
Liu Y, Guo Y, Yin J, Song X, Liu W, Nie L, Zhang M (2022) Answer questions with right image regions: a visual attention regularization approach. ACM Trans Multimed Comput Commun Appl (TOMM) 18:1–18
Article Google Scholar
Zhan X, Huang Y, Dong X, Cao Q, Liang X (2022) Pathreasoner: explainable reasoning paths for commonsense question answering. Knowl-Based Syst 235:107612
Article Google Scholar
Li X, Wu B, Song J, Gao L, Zeng P, Gan C (2022) Text-instance graph: exploring the relational semantics for text-based visual question answering. Pattern Recognit 124:108455
Article Google Scholar
Nawaz HS, Shi Z, Gan Y, Hirpa A, Dong J, Zheng H (2022) Temporal moment localization via natural language by utilizing video question answers as a special variant and bypassing nlp for corpora. IEEE Trans Circuits Syst Video Technol 32:6174–6185
Article Google Scholar
Sharma H, Jalal AS (2022) A framework for visual question answering with the integration of scene-text using phocs and fisher vectors. Expert Syst Appl 190:116159
Article Google Scholar
Yan L, Ma S, Wang Q, Chen Y, Zhang X, Savakis A, Liu D (2022) Video captioning using global-local representation. IEEE Trans Circuits Syst Video Technol 32(10):6642–6656
Article Google Scholar
Yan L, Han C, Xu Z, Liu D, Wang Q (2023) Prompt learns prompt: exploring knowledge-aware generative prompt collaboration for video captioning. In: Proceedings of the thirty-second international joint conference on artificial intelligence. IJCAI ’23
Cai L, Fang H, Li Z (2023) Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering. J Supercomput, 1–28
Li L, CY Gan Z (2019) Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10313–10322
Gardères F, AB Ziaeefard M (2020) Conceptbert: concept-aware representation for visual question answering. Find Assoc Comput Linguist, 489–498
Wu J, SA Lu J (2022) Multi-modal answer validation for knowledge-based vqa. Proc AAAI Conf Artif Intell 36:2712–2721
Yang Z, GJ He X (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29
He K, RS Zhang X (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Chen Z, GY Chen J (2021) Zero-shot visual question answering using knowledge graph. International Semantic Web Conference, 146–162
Farazi MR, Khan SH, Barnes N (2020) From known to the unknown: transferring knowledge to answer questions about novel visual and semantic concepts. Image Vis Comput 103:103985
Article Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Speer R, HC Chin J (2017) Conceptnet 5.5: an open multilingual graph of general knowledge. Thirty-first AAAI conference on artificial intelligence
Tandon N, SF De Melo G (2014) Webchild: harvesting and organizing commonsense knowledge from the web. In: Proceedings of the 7th ACM international conference on Web search and data mining, pp 523–532
Pennington J, MCD Socher R (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, pp 1532–1543
Lu J, BD Yang J (2016) Hierarchical question-image co-attention for visual question answering. Adv Neural Inf Process Syst 77
Kim JH, ZBT Jun J (2018) Bilinear attention networks. Adv Neural Inf Process Syst 31

Download references

Funding

This research was supported by the National Social Science Foundation of China (19BYY076).

Author information

Yanhan Sun, Mei Wu and Hui Zhao contributed equally to this work.

Authors and Affiliations

School of Information Science and Electrical Engineering, Shandong Jiao Tong University, Jinan, China
Zicheng Zuo, Yanhan Sun, Zhenfang Zhu, Mei Wu & Hui Zhao

Authors

Zicheng Zuo
View author publications
You can also search for this author in PubMed Google Scholar
Yanhan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Zhenfang Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Mei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Hui Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Zicheng Zuo: Conceptualization, methodology, validation, writing - original draft. Yanhan Sun, Zhenfang Zhu, Mei Wu and Hui Zhao: Writing - review, editing and supervision.

Corresponding author

Correspondence to Zhenfang Zhu.

Ethics declarations

Ethical Approval

Not Applicable.

Competing Interests

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled “Question Guided Multimodal Receptive Field Reasoning Network for Fact-based Visual Question Answering”.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zuo, Z., Sun, Y., Zhu, Z. et al. Question guided multimodal receptive field reasoning network for fact-based visual question answering. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19387-2

Download citation

Received: 17 January 2024
Revised: 01 May 2024
Accepted: 08 May 2024
Published: 20 May 2024
DOI: https://doi.org/10.1007/s11042-024-19387-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Question guided multimodal receptive field reasoning network for fact-based visual question answering

Abstract

Access this article

Similar content being viewed by others

Hierarchical Attention Networks for Fact-based Visual Question Answering

A common-specific feature cross-fusion attention mechanism for KGVQA

REXUP: I REason, I EXtract, I UPdate with Structured Compositional Reasoning for Visual Question Answering

Availability of Supporting Data

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical Approval

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Question guided multimodal receptive field reasoning network for fact-based visual question answering

Abstract

Access this article

Similar content being viewed by others

Hierarchical Attention Networks for Fact-based Visual Question Answering

A common-specific feature cross-fusion attention mechanism for KGVQA

REXUP: I REason, I EXtract, I UPdate with Structured Compositional Reasoning for Visual Question Answering

Availability of Supporting Data

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical Approval

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation