Abstract
Knowledge graph-based visual question answering aims to utilize the information in the knowledge graph to assist in answering complex questions that are difficult to answer based on image features alone. However, using knowledge graphs increases the difficulty of understanding facts by the model and introduces the possibility of generating noise, making it challenging to understand the facts and find answers. Previous multimodal fusion approaches typically treat the features of each modality as equally essential and implicitly explore the interactions between different modalities. However, we observe that when fusing text features with image features, image features can provide two types of information: common features and specific features. The common features can significantly enhance the text features and make the answer classification model more robust. In contrast, the specific features can complement the text features while providing different viewpoints, thus improving the performance of answer classification together with the common features. We propose a common-specific feature cross-fusion attention mechanism (CS-CFAN) approach based on these two observations. Unlike existing methods, CS-CFAN aims to learn how to extract and efficiently fuse features from complex multimodal data to solve complex questions that require external knowledge to assist in answering. On the F-VQA dataset, compared to the baseline model, our model achieves a 1.97% improvement with the same feature extraction method without a knowledge graph and reaches 82.66% accuracy with a knowledge graph.
Similar content being viewed by others
References
Wang, P., Wu, Q., Shen, C., Hengel, A., Dick, A.: Explicit knowledge-based reasoning for visual question answering (2015). arXiv preprint arXiv:1511.02570
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3195–3204 (2019)
Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Hengel, A.: FVQA: fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2413–2427 (2017)
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers (2019). arXiv preprint arXiv:1908.07490
Liu, B., Huang, Z., Zeng, Z., Chen, Z., Fu, J.: Learning rich image region representation for visual question answering (2019). arXiv preprint arXiv:1910.13077
Li, L.H., Yatskar, M., Yin, D., Hsieh, C., Chang, K.: VisualBERT: a simple and performant baseline for vision and language (2019). arXiv preprint arXiv:1908.03557
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)
Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11336–11344 (2020)
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and VQA. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13041–13049 (2020)
Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: UNITER: universal image-text representation learning. In: European Conference on Computer Vision, pp. 104–120 . Springer (2020)
Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10437–10446 (2020)
Zhang, H., Zeng, P., Hu, Y., Qian, J., Song, J., Gao, L.: Learning visual question answering on controlled semantic noisy labels. Pattern Recogn. 138, 109339 (2023)
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Kim, J.-H., Jun, J., Zhang, B.-T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Ben-Younes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2612–2620 (2017)
Cadene, R., Ben-Younes, H., Cord, M., Thome, N.: MUREL: multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1989–1998 (2019)
Song, J., Zeng, P., Gao, L., Shen, H.T.: From pixels to objects: Cubic visual attention for visual question answering. arXiv preprint arXiv:2206.01923 (2022)
Li, B., Fei, H., Liao, L., Zhao, Y., Teng, C., Chua, T.-S., Ji, D., Li, F.: Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5923–5934 (2023)
Gao, L., Zeng, P., Song, J., Liu, X., Shen, H.T.: Examine before you answer: multi-task learning with adaptive-attentions for multiple-choice VQA. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1742–1750 (2018)
Basu, A., Addepalli, S., Babu, R.V.: RMLVQA: a margin loss approach for visual question answering with language biases. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11671–11680 (2023)
Wu, S., Fei, H., Cao, Y., Bing, L., Chua, T.-S.: Information screening whilst exploiting! multimodal relation extraction with feature denoising and multimodal topic modeling (2023). arXiv preprint arXiv:2305.11719
Sood, E., Kögel, F., Müller, P., Thomas, D., Bâce, M., Bulling, A.: Multimodal integration of human-like attention in visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2647–2657 (2023)
Yu, J., Jing, M., Liu, W., Luo, T., Zhang, B., Lu, K., Lei, F., Sun, J., Liang, J.: Answer-based entity extraction and alignment for visual text question answering. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 9487–9491 (2023)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale (2010). arXiv preprint arXiv:2010.11929
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018). arXiv preprint arXiv:1810.04805
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Fei, H., Liu, Q., Zhang, M., Zhang, M., Chua, T.-S.: Scene graph as pivoting: inference-time image-free unsupervised multimodal machine translation with visual scene hallucination (2023). arXiv preprint arXiv:2305.12256
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer (2014)
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D.A.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 32–73 (2017)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 2556–2565 (2018)
Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.-S.: NExT-GPT: any-to-any multimodal LLM (2023). arXiv preprint arXiv:2309.05519
Wu, S., Fei, H., Ji, W., Chua, T.-S.: Cross2StrA: unpaired cross-lingual image captioning with cross-lingual cross-modal structure-pivoted alignment (2023). arXiv preprint arXiv:2305.12260
Shah, S., Mishra, A., Yadati, N., Talukdar, P.P.: KVQA: knowledge-aware visual question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8876–8884 (2019)
Marino, K., Chen, X., Parikh, D., Gupta, A., Rohrbach, M.: KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14111–14121 (2021)
Yao, L., Mao, C., Luo, Y.: KG-BERT: BERT for knowledge graph completion (2019). arXiv preprint arXiv:1909.03193
Du, Y., Li, J., Tang, T., Zhao, W.X., Wen, J.-R.: Zero-shot visual question answering with language model feedback (2023). arXiv preprint arXiv:2305.17006
Shao, Z., Yu, Z., Wang, M., Yu, J.: Prompting large language models with answer heuristics for knowledge-based visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14974–14983 (2023)
Zhang, L., Liu, S., Liu, D., Zeng, P., Li, X., Song, J., Gao, L.: Rich visual knowledge-based augmentation network for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 32(10), 4362–4373 (2020)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Teney, D., Anderson, P., He, X., Van Den Hengel, A.: Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232 (2018)
Kingma, D.P., Ba, J.: ADAM: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980
Qin, Q., Hu, W., Liu, B.: Feature projection for improved text classification. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8161–8171 (2020)
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: Advances in Neural Information Processing Systems, vol. 34, 14200–14213 (2021)
Acknowledgements
This work has been supported by the National Natural Science Foundation of China (62166042, U2003207), Natural Science Foundation of Xinjiang, China (2021D01C076), and Strengthening Plan of National Defense Science and Technology Foundation of China (2021-JCJQ-JJ-0059)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no Conflict of interest to declare that are relevant to the content of this article
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ma, M., Tohti, T. & Hamdulla, A. A common-specific feature cross-fusion attention mechanism for KGVQA. Int J Data Sci Anal (2024). https://doi.org/10.1007/s41060-024-00536-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41060-024-00536-7