Skip to main content
Log in

A common-specific feature cross-fusion attention mechanism for KGVQA

  • Regular Paper
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

Knowledge graph-based visual question answering aims to utilize the information in the knowledge graph to assist in answering complex questions that are difficult to answer based on image features alone. However, using knowledge graphs increases the difficulty of understanding facts by the model and introduces the possibility of generating noise, making it challenging to understand the facts and find answers. Previous multimodal fusion approaches typically treat the features of each modality as equally essential and implicitly explore the interactions between different modalities. However, we observe that when fusing text features with image features, image features can provide two types of information: common features and specific features. The common features can significantly enhance the text features and make the answer classification model more robust. In contrast, the specific features can complement the text features while providing different viewpoints, thus improving the performance of answer classification together with the common features. We propose a common-specific feature cross-fusion attention mechanism (CS-CFAN) approach based on these two observations. Unlike existing methods, CS-CFAN aims to learn how to extract and efficiently fuse features from complex multimodal data to solve complex questions that require external knowledge to assist in answering. On the F-VQA dataset, compared to the baseline model, our model achieves a 1.97% improvement with the same feature extraction method without a knowledge graph and reaches 82.66% accuracy with a knowledge graph.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Wang, P., Wu, Q., Shen, C., Hengel, A., Dick, A.: Explicit knowledge-based reasoning for visual question answering (2015). arXiv preprint arXiv:1511.02570

  2. Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3195–3204 (2019)

  3. Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Hengel, A.: FVQA: fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2413–2427 (2017)

    Article  Google Scholar 

  4. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

  5. Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers (2019). arXiv preprint arXiv:1908.07490

  6. Liu, B., Huang, Z., Zeng, Z., Chen, Z., Fu, J.: Learning rich image region representation for visual question answering (2019). arXiv preprint arXiv:1910.13077

  7. Li, L.H., Yatskar, M., Yin, D., Hsieh, C., Chang, K.: VisualBERT: a simple and performant baseline for vision and language (2019). arXiv preprint arXiv:1908.03557

  8. Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)

  9. Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11336–11344 (2020)

  10. Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13041–13049 (2020)

  11. Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: UNITER: universal image-text representation learning. In: European Conference on Computer Vision, pp. 104–120 . Springer (2020)

  12. Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10437–10446 (2020)

  13. Zhang, H., Zeng, P., Hu, Y., Qian, J., Song, J., Gao, L.: Learning visual question answering on controlled semantic noisy labels. Pattern Recogn. 138, 109339 (2023)

    Article  Google Scholar 

  14. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)

  15. Kim, J.-H., Jun, J., Zhang, B.-T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

  16. Ben-Younes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2612–2620 (2017)

  17. Cadene, R., Ben-Younes, H., Cord, M., Thome, N.: MUREL: multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1989–1998 (2019)

  18. Song, J., Zeng, P., Gao, L., Shen, H.T.: From pixels to objects: Cubic visual attention for visual question answering. arXiv preprint arXiv:2206.01923 (2022)

  19. Li, B., Fei, H., Liao, L., Zhao, Y., Teng, C., Chua, T.-S., Ji, D., Li, F.: Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5923–5934 (2023)

  20. Gao, L., Zeng, P., Song, J., Liu, X., Shen, H.T.: Examine before you answer: multi-task learning with adaptive-attentions for multiple-choice VQA. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1742–1750 (2018)

  21. Basu, A., Addepalli, S., Babu, R.V.: RMLVQA: a margin loss approach for visual question answering with language biases. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11671–11680 (2023)

  22. Wu, S., Fei, H., Cao, Y., Bing, L., Chua, T.-S.: Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling (2023). arXiv preprint arXiv:2305.11719

  23. Sood, E., Kögel, F., Müller, P., Thomas, D., Bâce, M., Bulling, A.: Multimodal integration of human-like attention in visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2647–2657 (2023)

  24. Yu, J., Jing, M., Liu, W., Luo, T., Zhang, B., Lu, K., Lei, F., Sun, J., Liang, J.: Answer-based entity extraction and alignment for visual text question answering. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 9487–9491 (2023)

  25. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale (2010). arXiv preprint arXiv:2010.11929

  26. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018). arXiv preprint arXiv:1810.04805

  27. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

  28. Fei, H., Liu, Q., Zhang, M., Zhang, M., Chua, T.-S.: Scene graph as pivoting: inference-time image-free unsupervised multimodal machine translation with visual scene hallucination (2023). arXiv preprint arXiv:2305.12256

  29. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer (2014)

  30. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D.A.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 32–73 (2017)

    Article  MathSciNet  Google Scholar 

  31. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 2556–2565 (2018)

  32. Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.-S.: NExT-GPT: any-to-any multimodal LLM (2023). arXiv preprint arXiv:2309.05519

  33. Wu, S., Fei, H., Ji, W., Chua, T.-S.: Cross2StrA: unpaired cross-lingual image captioning with cross-lingual cross-modal structure-pivoted alignment (2023). arXiv preprint arXiv:2305.12260

  34. Shah, S., Mishra, A., Yadati, N., Talukdar, P.P.: KVQA: knowledge-aware visual question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8876–8884 (2019)

  35. Marino, K., Chen, X., Parikh, D., Gupta, A., Rohrbach, M.: KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14111–14121 (2021)

  36. Yao, L., Mao, C., Luo, Y.: KG-BERT: BERT for knowledge graph completion (2019). arXiv preprint arXiv:1909.03193

  37. Du, Y., Li, J., Tang, T., Zhao, W.X., Wen, J.-R.: Zero-shot visual question answering with language model feedback (2023). arXiv preprint arXiv:2305.17006

  38. Shao, Z., Yu, Z., Wang, M., Yu, J.: Prompting large language models with answer heuristics for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14974–14983 (2023)

  39. Zhang, L., Liu, S., Liu, D., Zeng, P., Li, X., Song, J., Gao, L.: Rich visual knowledge-based augmentation network for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 32(10), 4362–4373 (2020)

    Article  Google Scholar 

  40. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

  41. Teney, D., Anderson, P., He, X., Van Den Hengel, A.: Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232 (2018)

  42. Kingma, D.P., Ba, J.: ADAM: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980

  43. Qin, Q., Hu, W., Liu, B.: Feature projection for improved text classification. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8161–8171 (2020)

  44. Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: Advances in Neural Information Processing Systems, vol. 34, 14200–14213 (2021)

Download references

Acknowledgements

This work has been supported by the National Natural Science Foundation of China (62166042, U2003207), Natural Science Foundation of Xinjiang, China (2021D01C076), and Strengthening Plan of National Defense Science and Technology Foundation of China (2021-JCJQ-JJ-0059)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Turdi Tohti.

Ethics declarations

Conflict of interest

The authors have no Conflict of interest to declare that are relevant to the content of this article

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, M., Tohti, T. & Hamdulla, A. A common-specific feature cross-fusion attention mechanism for KGVQA. Int J Data Sci Anal (2024). https://doi.org/10.1007/s41060-024-00536-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41060-024-00536-7

Keywords

Navigation