A common-specific feature cross-fusion attention mechanism for KGVQA

Ma, Mingyang; Tohti, Turdi; Hamdulla, Askar

doi:10.1007/s41060-024-00536-7

A common-specific feature cross-fusion attention mechanism for KGVQA

Regular Paper
Published: 13 April 2024

(2024)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

Mingyang Ma^1,2,
Turdi Tohti^1,2 &
Askar Hamdulla^1,2

26 Accesses
Explore all metrics

Abstract

Knowledge graph-based visual question answering aims to utilize the information in the knowledge graph to assist in answering complex questions that are difficult to answer based on image features alone. However, using knowledge graphs increases the difficulty of understanding facts by the model and introduces the possibility of generating noise, making it challenging to understand the facts and find answers. Previous multimodal fusion approaches typically treat the features of each modality as equally essential and implicitly explore the interactions between different modalities. However, we observe that when fusing text features with image features, image features can provide two types of information: common features and specific features. The common features can significantly enhance the text features and make the answer classification model more robust. In contrast, the specific features can complement the text features while providing different viewpoints, thus improving the performance of answer classification together with the common features. We propose a common-specific feature cross-fusion attention mechanism (CS-CFAN) approach based on these two observations. Unlike existing methods, CS-CFAN aims to learn how to extract and efficiently fuse features from complex multimodal data to solve complex questions that require external knowledge to assist in answering. On the F-VQA dataset, compared to the baseline model, our model achieves a 1.97% improvement with the same feature extraction method without a knowledge graph and reaches 82.66% accuracy with a knowledge graph.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Fig. 5

A focus fusion attention mechanism integrated with image captions for knowledge graph-based visual question answering

Article 13 February 2024

Hierarchical Attention Networks for Fact-based Visual Question Answering

Article 22 July 2023

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

Article 13 September 2023

References

Wang, P., Wu, Q., Shen, C., Hengel, A., Dick, A.: Explicit knowledge-based reasoning for visual question answering (2015). arXiv preprint arXiv:1511.02570
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3195–3204 (2019)
Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Hengel, A.: FVQA: fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2413–2427 (2017)
Article Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers (2019). arXiv preprint arXiv:1908.07490
Liu, B., Huang, Z., Zeng, Z., Chen, Z., Fu, J.: Learning rich image region representation for visual question answering (2019). arXiv preprint arXiv:1910.13077
Li, L.H., Yatskar, M., Yin, D., Hsieh, C., Chang, K.: VisualBERT: a simple and performant baseline for vision and language (2019). arXiv preprint arXiv:1908.03557
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)
Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11336–11344 (2020)
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13041–13049 (2020)
Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: UNITER: universal image-text representation learning. In: European Conference on Computer Vision, pp. 104–120 . Springer (2020)
Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10437–10446 (2020)
Zhang, H., Zeng, P., Hu, Y., Qian, J., Song, J., Gao, L.: Learning visual question answering on controlled semantic noisy labels. Pattern Recogn. 138, 109339 (2023)
Article Google Scholar
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Kim, J.-H., Jun, J., Zhang, B.-T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Ben-Younes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2612–2620 (2017)
Cadene, R., Ben-Younes, H., Cord, M., Thome, N.: MUREL: multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1989–1998 (2019)
Song, J., Zeng, P., Gao, L., Shen, H.T.: From pixels to objects: Cubic visual attention for visual question answering. arXiv preprint arXiv:2206.01923 (2022)
Li, B., Fei, H., Liao, L., Zhao, Y., Teng, C., Chua, T.-S., Ji, D., Li, F.: Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5923–5934 (2023)
Gao, L., Zeng, P., Song, J., Liu, X., Shen, H.T.: Examine before you answer: multi-task learning with adaptive-attentions for multiple-choice VQA. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1742–1750 (2018)
Basu, A., Addepalli, S., Babu, R.V.: RMLVQA: a margin loss approach for visual question answering with language biases. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11671–11680 (2023)
Wu, S., Fei, H., Cao, Y., Bing, L., Chua, T.-S.: Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling (2023). arXiv preprint arXiv:2305.11719
Sood, E., Kögel, F., Müller, P., Thomas, D., Bâce, M., Bulling, A.: Multimodal integration of human-like attention in visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2647–2657 (2023)
Yu, J., Jing, M., Liu, W., Luo, T., Zhang, B., Lu, K., Lei, F., Sun, J., Liang, J.: Answer-based entity extraction and alignment for visual text question answering. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 9487–9491 (2023)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale (2010). arXiv preprint arXiv:2010.11929
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018). arXiv preprint arXiv:1810.04805
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Fei, H., Liu, Q., Zhang, M., Zhang, M., Chua, T.-S.: Scene graph as pivoting: inference-time image-free unsupervised multimodal machine translation with visual scene hallucination (2023). arXiv preprint arXiv:2305.12256
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer (2014)
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D.A.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 32–73 (2017)
Article MathSciNet Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 2556–2565 (2018)
Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.-S.: NExT-GPT: any-to-any multimodal LLM (2023). arXiv preprint arXiv:2309.05519
Wu, S., Fei, H., Ji, W., Chua, T.-S.: Cross2StrA: unpaired cross-lingual image captioning with cross-lingual cross-modal structure-pivoted alignment (2023). arXiv preprint arXiv:2305.12260
Shah, S., Mishra, A., Yadati, N., Talukdar, P.P.: KVQA: knowledge-aware visual question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8876–8884 (2019)
Marino, K., Chen, X., Parikh, D., Gupta, A., Rohrbach, M.: KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14111–14121 (2021)
Yao, L., Mao, C., Luo, Y.: KG-BERT: BERT for knowledge graph completion (2019). arXiv preprint arXiv:1909.03193
Du, Y., Li, J., Tang, T., Zhao, W.X., Wen, J.-R.: Zero-shot visual question answering with language model feedback (2023). arXiv preprint arXiv:2305.17006
Shao, Z., Yu, Z., Wang, M., Yu, J.: Prompting large language models with answer heuristics for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14974–14983 (2023)
Zhang, L., Liu, S., Liu, D., Zeng, P., Li, X., Song, J., Gao, L.: Rich visual knowledge-based augmentation network for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 32(10), 4362–4373 (2020)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Teney, D., Anderson, P., He, X., Van Den Hengel, A.: Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232 (2018)
Kingma, D.P., Ba, J.: ADAM: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980
Qin, Q., Hu, W., Liu, B.: Feature projection for improved text classification. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8161–8171 (2020)
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: Advances in Neural Information Processing Systems, vol. 34, 14200–14213 (2021)

Download references

Acknowledgements

This work has been supported by the National Natural Science Foundation of China (62166042, U2003207), Natural Science Foundation of Xinjiang, China (2021D01C076), and Strengthening Plan of National Defense Science and Technology Foundation of China (2021-JCJQ-JJ-0059)

Author information

Authors and Affiliations

School of Computer Science and Technology, Xinjiang University, Urumqi, 830017, China
Mingyang Ma, Turdi Tohti & Askar Hamdulla
Xinjiang Key Laboratory of Multilingual Information Technology, Urumqi, 830017, China
Mingyang Ma, Turdi Tohti & Askar Hamdulla

Authors

Mingyang Ma
View author publications
You can also search for this author in PubMed Google Scholar
Turdi Tohti
View author publications
You can also search for this author in PubMed Google Scholar
Askar Hamdulla
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Turdi Tohti.

Ethics declarations

Conflict of interest

The authors have no Conflict of interest to declare that are relevant to the content of this article

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ma, M., Tohti, T. & Hamdulla, A. A common-specific feature cross-fusion attention mechanism for KGVQA. Int J Data Sci Anal (2024). https://doi.org/10.1007/s41060-024-00536-7

Download citation

Received: 28 September 2023
Accepted: 13 March 2024
Published: 13 April 2024
DOI: https://doi.org/10.1007/s41060-024-00536-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A common-specific feature cross-fusion attention mechanism for KGVQA

Abstract

Access this article

Similar content being viewed by others

A focus fusion attention mechanism integrated with image captions for knowledge graph-based visual question answering

Hierarchical Attention Networks for Fact-based Visual Question Answering

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A common-specific feature cross-fusion attention mechanism for KGVQA

Abstract

Access this article

Similar content being viewed by others

A focus fusion attention mechanism integrated with image captions for knowledge graph-based visual question answering

Hierarchical Attention Networks for Fact-based Visual Question Answering

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation