Skip to main content
Log in

Semantic-aware multi-branch interaction network for deep multimodal learning

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Deep multimodal learning has attracted increasing attention in artificial intelligence since it bridges vision and language. Most existing works only focus on specific multimodal tasks, which limits the ability to generalize to other tasks. Furthermore, these works only learn coarse-grained interactions at the object-level in images and the word-level in text, while ignoring to learn fine-grained interactions at relation-level and attribute-level. In this paper, to alleviate these issues, we propose a Semantic-aware Multi-Branch Interaction (SeMBI) network for various multimodal learning tasks. The SeMBI mainly consists of three modules, Multi-Branch Visual Semantics (MBVS) module, Multi-Branch Textual Semantics (MBTS) module and Multi-Branch Cross-modal Alignment (MBCA) module. The MBVS enhances the visual features and performs reasoning through three parallel branches, corresponding to the latent relationship branch, explicit relationship branch and attribute branch. The MBTS learns relation-level language context and attribute-level language context by textual relationship branch and textual attribute branch, respectively. The enhanced visual features then passed into MBCA to learn fine-grained cross-modal correspondence under the guidance of relation-level and attribute-level language context. We demonstrate the generalizability and effectiveness of the proposed SeMBI by applying it to three deep multimodal learning tasks, including Visual Question Answering (VQA), Referring Expression Comprehension (REC) and Cross-Modal Retrieval (CMR). Extensive experiments conducted on five common benchmark datasets indicate superior performance comparing with state-of-the-art works.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The datasets analysed during the current study are available from the corresponding author on reasonable request.

Notes

  1. If there is no relationship between words \(t_i\) and \(t_j\), the relational embedding between them is padded with 0.

References

  1. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition 6077–6086

  2. Antol S, Agrawal A, Jiasen L, Mitchell M, Dhruv BC, Zitnick L, Parikh D (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision 2425–2433

  3. Cadene R, Ben-Younes H, Cord M, Thome N (2019) Murel: multimodal relational reasoning for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 1989–1998

  4. Chen H Ding G, Liu X, Lin Z, Liu J, Han J (2020) Imram: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12652–12660

  5. Chen L, Ma W, Xiao J, Zhang H, Chang SF (2021) Ref-nms: breaking proposal bottlenecks in two-stage referring expression grounding. Proceed AAAI Conf Artif Intell 35:1036–1044

    Google Scholar 

  6. Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. Proceedings of the 2014 conference on empirical methods in natural language processing

  7. Gao P, Jiang Z, You H, Pan Lu, Hoi Steven CH, Wang X, Li H (2019) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 6639–6648

  8. Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6904–6913

  9. Guo D, Chang X, Tao D (2021) Bilinear graph networks for visual question answering. IEEE Transactions on neural networks and learning systems 1–12

  10. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition 770–778

  11. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  12. Han H, Jiayuan G, Zhang Z, Dai J, Wei Y (2018) Relation networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition 3588–3597

  13. Huang Q, Wei J, Cai Y, Zheng C, Chen J, Leung HF, Li Q (2020). Aligned dual channel graph convolutional network for visual question answering. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 7166–7176,

  14. Karpathy A, Joulin A, Fei-Fei L (2014) Deep fragment embeddings for bidirectional image sentence mapping. Advances in neural information processing systems 27: Annual conference on neural information processing systems

  15. Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. Advances in Neural information processing systems 31: annual conference on neural information processing systems

  16. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. International conference on learning representations

  17. Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. International conference on learning representations

  18. Knyazev B, de Vries H, Cangea C, Taylor GW, Courville A, Belilovsky E (2020) Graph density-aware losses for novel compositions in scene graph generation. British machine vision conference

  19. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma David A et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comp Vis 123(1):32–73

    Article  MathSciNet  Google Scholar 

  20. Lee KH, Chen X, Hua G, Houdong H, He X (2018) Stacked cross attention for image-text matching. In Proceedings of the European conference on computer vision (ECCV) 201–216

  21. Li J, Liu L, Niu L, Zhang L (2021) Memorize, associate and match: embedding enhancement via fine-grained alignment for image-text retrieval. IEEE Trans Image Process 30:9193–9207

    Article  Google Scholar 

  22. Li K, Zhang Y, Li K, Li Y, Yun F (2019) Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF international conference on computer vision. pp 4654–4662

  23. Li L, Gan Z, Cheng Y, Liu J (2019) Relation-aware graph attention network for visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision. pp 10313–10322

  24. Tsung-Yi L, Michael M, Serge B, James H, Pietro P, Deva R, Piotr D, Lawrence ZC (2014). Common objects in context Microsoft coco. In: European conference on computer vision 10: 740–755. Springer

  25. Lin Z, Kang Z, Zhang L, Tian L (2021) Multi-view attributed graph clustering. IEEE Transactions on knowledge and data engineering

  26. Liu C, Mao Z, Zhang T, Xie H, Wang B, Zhang Y (2020) Graph structured network for image-text matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 10921–10930

  27. Liu D, Zhang H, Feng W, Zha ZJ (2019) Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF international conference on computer vision 4673–4682

  28. Liu F, Liu J, Fang Z, Hong R, Hanqing L (2021) Visual question answering with dense inter- and intra-modality interactions. IEEE Trans Multim 23:3518–3529

    Article  Google Scholar 

  29. Liu X, Wang Z, Shao J, Wang X, Li H (2019) Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 1950–1959

  30. Liu Y, Wang H, Meng F, Liu M, Liu H (2021) Attend, correct and focus: a bidirectional correct attention network for image-text matching. In: 2021 IEEE International conference on image processing (ICIP), pages 2673–2677

  31. Luo G, Zhou Y, Sun X, Cao L, Chenglin W, Deng C, Ji R (2020) Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 10034–10043

  32. Ma J, Liu J, Lin Q, Bei W, Wang Y, You Y (2021) Multitask learning for visual question answering. IEEE Transactions on neural networks and learning systems 1–15

  33. Ma L, Jiang W, Jie Z, Jiang YG, Liu W (2020) Matching image and sentence with multi-faceted representations. IEEE Trans Circ Sys Video Technol 30(7):2250–2261

    Google Scholar 

  34. Mao J, Huang J, Toshev A, Camburu O, Yuille AL, Murphy K (2016) Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition 11–20

  35. Peng L, Yang Y, Wang Z, Wu X, Huang Z (2019) Cra-net: composed relation attention network for visual question answering. In: Proceedings of the 27th ACM international conference on multimedia, pp 1202–1210

  36. Peng Y, Huang X, Zhao Y (2018) An overview of cross-media retrieval: concepts, methodologies, benchmarks and challenges. IEEE Trans Circ Sys Video Technol 28(9):2372–2385

    Article  Google Scholar 

  37. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  38. Qiu H, Li H, Wu Q, Meng F, Shi H, Zhao T, Ngan KN (2020) Language-aware fine-grained object representation for referring expression comprehension. In Proceedings of the 28th ACM international conference on multimedia. pp 4171–4180

  39. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Infor Process Sys 28:91–99

    Google Scholar 

  40. Schuster S, Krishna R, Chang A, Fei-Fei L, Manning CD (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In: Proceedings of the fourth workshop on vision and language. pp 70–80

  41. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  42. Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention networks. arXiv preprint arXiv:1710.10903

  43. Wang P, Wu Q, Cao J, Shen C, Gao L, van den HA (2019) Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1960–1968

  44. Wang S, Wang R, Yao Z, Shan S, Chen X (2020) Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1508–1517

  45. Wang T, Huang J, Zhang H, Sun Q (2020) Visual commonsense r-cnn. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10760–10770

  46. Wang Y, Yang H, Bai X, Qian X, Ma Lin, Jing Lu, Li Biao, Fan Xin (2021) Pfan++: bi-directional image-text retrieval with position focused attention network. IEEE Trans Multim 23:3362–3376

    Article  Google Scholar 

  47. Wang Z, Liu X, Hongsheng LL, Sheng JY, Wang X, Shao J (2019) Camp: cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF International conference on computer vision pp. 5764–5773

  48. Whitehead S, Wu H, Ji H, Feris R, Saenko K (2021) Separating skills and concepts for novel visual question answering. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 5628–5637

  49. Wu H, Mao J, Zhang Y, Jiang Y, Li L, Sun W, Ma WY (2019) Univse: robust visual semantic embeddings via structured semantic representations. arXiv preprint arXiv:1904.05521

  50. Wu Y, Wang S, Song G, Huang Q (2019) Learning fragment self-attention embeddings for image-text matching. In: Proceedings of the 27th ACM international conference on multimedia, pp 2088–2096

  51. Yang S, Li G, Yizhou Y (2019) Cross-modal relationship inference for grounding referring expressions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 4145–4154

  52. Yang S, Li G, Yizhou Y (2019) Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE/CVF international conference on computer vision, pp 4644–4653

  53. Yang X, Lin G, Lv F, Liu F (2020) Trrnet: tiered relation reasoning for compositional visual question answering. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI16

  54. Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29

  55. Licheng Y, Lin Z, Shen X, Yang J, Xin L, Bansal M, Mattnet BTL (2018) Modular attention network for referring expression comprehension. In: Proceedings of the IEEE conference on computer vision and pattern recognition 10:1307–1315

  56. Yu L, Poirson P, Yang S, Berg AC, Berg TL (2016) Modeling context in referring expressions. In: European conference on computer vision, pp 69–85. Springer

  57. Zhou Y, Jun Y, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 6281–6290

  58. Yu Z, Yu J, Xiang C, Zhao Z, Tian Q, Tao D (2018) Rethinking diversified and discriminative proposal generation for visual grounding. In: Proceedings of the international joint conference on artificial intelligence

  59. Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 3533–3542

  60. Zhang Y, Zhou W, Wang M, Tian Q, Li H (2021) Deep relation embedding for cross-modal retrieval. IEEE Trans Image Process 30:617–627

    Article  Google Scholar 

  61. Zhou Y, Ji R, Sun X, Luo G, Hong X, Su J, Ding X, Shao L (2020) K-armed bandit based multi-modal network architecture search for visual question answering. In: Proceedings of the 28th ACM international conference on multimedia, pp 1245–1254

  62. Zhuang Y, Song J, Fei W, Li X, Zhang Z, Rui Yong (2018) Multimodal deep embedding via hierarchical grounded compositional semantics. IEEE Trans Circ Sys Video Technol 28(1):76–89

    Article  Google Scholar 

Download references

Acknowledgements

This paper was supported by National Key R &D Program of China (2019YFC1521204).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Huang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pan, H., Huang, J. Semantic-aware multi-branch interaction network for deep multimodal learning. Neural Comput & Applic 35, 7529–7545 (2023). https://doi.org/10.1007/s00521-022-08048-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-08048-w

Keywords

Navigation