Fine-Grained Unbalanced Interaction Network for Visual Question Answering

Liao, Xinxin; Wu, Mingyan; Chai, Heyan; Qi, Shuhan; Wang, Xuan; Liao, Qing

doi:10.1007/978-3-030-82153-1_8

Xinxin Liao ORCID: orcid.org/0000-0002-5761-7832¹³,
Mingyan Wu¹³,
Heyan Chai¹³,
Shuhan Qi¹³,
Xuan Wang¹³ &
…
Qing Liao ORCID: orcid.org/0000-0003-1012-5301^13,14

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12817))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

1998 Accesses

Abstract

Learning an effective interaction mechanism is important for Visual Question Answering (VQA). It requires an understanding of both the visual content of images and the textual content of questions. Existing approaches consider both the inter-modal and intra-modal interactions, while neglecting the irrelevant information in the interactions. In this paper, we propose a novel Fine-grained Unbalanced Interaction Network (FUIN) to adaptively capture the most useful information from interactions. It contains a parallel interaction module to model the two-way interactions and a fine-grained adaptive activation module to adaptively activate the interactions for each component according to their specific context. Experimental evaluation results on the benchmark VQA-v2 dataset demonstrate that FUIN achieves state-of-the-art VQA performance, we achieve an overall accuracy of 71.14% on the test-std set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gai, K., Qiu, M.: Reinforcement learning-based content-centric services in mobile sensing. IEEE Netw. 32(4), 34–39 (2018)
Article Google Scholar
Tao, L., Golikov, S., Gai, K., Qiu, M.: A reusable software component for integrated syntax and semantic validation for services computing. In: 2015 IEEE Symposium on Service-Oriented System Engineering, pp. 127–132. IEEE (2015)
Google Scholar
Chen, M., Zhang, Y., Qiu, M., Guizani, N., Hao, Y.: SPHA: smart personal health advisor based on deep analytics. IEEE Commun. Mag. 56(3), 164–169 (2018)
Article Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. arXiv preprint arXiv:1606.00061 (2016)
Yu, Z., Yu, J., Xiang, C., Fan, J., Tao, D.: Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 29(12), 5947–5959 (2018)
Article Google Scholar
Yu, Z., Cui, Y., Yu, J., Tao, D., Tian, Q.: Multimodal unified attention networks for vision-and-language interactions. arXiv preprint arXiv:1908.04107 (2019)
He, S., Han, D.: An effective dense co-attention networks for visual question answering. Sensors 20(17), 4897 (2020)
Article Google Scholar
Wang, X., Cao, W.: Non-iterative approaches in training feed-forward neural networks and their applications (2018)
Google Scholar
Cao, W., Gao, J., Ming, Z., Cai, S., Shan, Z.: Fuzziness-based online sequential extreme learning machine for classification problems. Soft. Comput. 22(11), 3487–3494 (2018)
Article Google Scholar
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016)
Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1821–1830 (2017)
Google Scholar
Ben-Younes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2612–2620 (2017)
Google Scholar
Cao, W., Hu, L., Gao, J., Wang, X., Ming, Z.: A study on the relationship between the rank of input data and the performance of random weight neural network. Neural Comput. Appl. 32(16), 12685–12696 (2020)
Google Scholar
Gao, P., et al.: Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6639–6648 (2019)
Google Scholar
Chen, C., Han, D., Wang, J.: Multimodal encoder-decoder attention networks for visual question answering. IEEE Access 8, 35662–35671 (2020)
Article Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)
Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Gao, P., You, H., Zhang, Z., Wang, X., Li, H.: Multi-modality latent interaction network for visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5825–5835 (2019)
Google Scholar
Hong, J., Park, S., Byun, H.: Selective residual learning for visual question answering. Neurocomputing 402, 366–374 (2020)
Article Google Scholar
Guo, W., Zhang, Y., Wu, X., Yang, J., Cai, X., Yuan, X.: Re-attention for visual question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 91–98 (2020)
Google Scholar
Guo, Z., Han, D.: Multi-modal explicit sparse attention networks for visual question answering. Sensors 20(23), 6758 (2020)
Article Google Scholar

Download references

Acknowledgement

This work is supported in part by the National Natural Science Foundation of China under grant No. U1711261 and the Guangdong Major Project of Basic and Applied Basic Research under grant No. 2019B030302002.

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology Shenzhen, Shenzhen, China
Xinxin Liao, Mingyan Wu, Heyan Chai, Shuhan Qi, Xuan Wang & Qing Liao
Peng Cheng Laboratory, Shenzhen, China
Qing Liao

Authors

Xinxin Liao
View author publications
You can also search for this author in PubMed Google Scholar
Mingyan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Heyan Chai
View author publications
You can also search for this author in PubMed Google Scholar
Shuhan Qi
View author publications
You can also search for this author in PubMed Google Scholar
Xuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qing Liao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qing Liao .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Han Qiu
Ibaraki University, Hitachi, Japan
Cheng Zhang
University of Kentucky, Lexington, KY, USA
Zongming Fei
Texas A&M University – Commerce, Commerce, TX, USA
Meikang Qiu
Princeton University, Princeton, NJ, USA
Sun-Yuan Kung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liao, X., Wu, M., Chai, H., Qi, S., Wang, X., Liao, Q. (2021). Fine-Grained Unbalanced Interaction Network for Visual Question Answering. In: Qiu, H., Zhang, C., Fei, Z., Qiu, M., Kung, SY. (eds) Knowledge Science, Engineering and Management. KSEM 2021. Lecture Notes in Computer Science(), vol 12817. Springer, Cham. https://doi.org/10.1007/978-3-030-82153-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-82153-1_8
Published: 07 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82152-4
Online ISBN: 978-3-030-82153-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics