Skip to main content
Log in

MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Multimodal named entity recognition (MNER) aims at identifying entity spans and recognizing their categories in social media posts with the aid of images. Previous work on MNER often relies on an attention mechanism to model the interactions between the images and text representations. However, the inconsistency of feature representations of different modalities will bring difficulties to the modeling of image-text interaction. To address this issue, we propose multi-granularity visual contexts to align image features into the textual space for text-text interactions so that the attention mechanism in pre-trained textual embeddings can be better utilized. The visual information of multi-granularity can help establish more accurate and thorough connections between image pixels and linguistic semantics. Specifically, we first extract the global image caption and dense image captions as the coarse-grained visual context and fine-grained visual contexts separately. Then, we consider images as signals with sparse semantic density for image-text interactions and image captions as dense semantic signals for text-text interactions. To alleviate the bias caused by visual noise and inaccurate alignment, we further design a dynamic filter network to filter visual noise and dynamically allocate visual information for modality fusion. Meanwhile, we propose a novel multi-granularity visual prompt-guided fusion network to model more robust modality fusion. Extensive experiments on three MNER datasets demonstrate the effectiveness of our method and achieve state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Availability of supporting data

Data will be available upon request.

References

  1. Li J, Li H, Pan Z, Pan G (2023) Prompt ChatGPT in MNER: improved multimodal named entity recognition method based on auxiliary refining knowledge from ChatGPT. arXiv:2305.12212

  2. Liu P, Li H, Ren Y, Liu J, Si S, Zhu H, Sun L (2023) A novel framework for multimodal named entity recognition with multi-level alignments. arXiv:2305.08372

  3. Cui S, Cao J, Cong X, Sheng J, Li Q, Liu T, Shi J (2023) Enhancing multimodal entity and relation extraction with variational information bottleneck. arXiv:2304.02328

  4. Liu W, Zhong X, Hou J, Li S, Huang H, Fang Y (2023) Integrating large pre-trained models into multimodal named entity recognition with evidential fusion. arXiv:2306.16991

  5. Chen J, Xue Y, Zhang H, Ding W, Zhang Z, Chen J (2023) On development of multimodal named entity recognition using part-of-speech and mixture of experts. Int J Mach Learn Cybernet 14(6):2181–2192

    Article  Google Scholar 

  6. Wang X, Tian J, Gui M, Li Z, Ye J, Yan M, Xiao Y (2022) PromptMNER: prompt-based entity-related visual clue extraction and integration for multimodal named entity recognition. In: International conference on database systems for advanced applications. Springer, pp 297–305

  7. Liu Y, Li S, Hu F, Liu A, Liu Y (2022) Explicit sparse attention network for multimodal named entity recognition. In: China conference on knowledge graph and semantic computing. Springer, pp 83–94

  8. Zhao S, Hu M, Cai Z, Liu F (2021) Modeling dense cross-modal interactions for joint entity-relation extraction. In: Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence. pp 4032–4038

  9. Lu D, Neves L, Carvalho V, Zhang N, Ji H (2018) Visual attention model for name tagging in multimodal social media. In: Proceedings of the 56th annual meeting of the association for computational linguistics (vol 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, pp 1990–1999. https://doi.org/10.18653/v1/P18-1185

  10. Yu J, Jiang J, Yang L, Xia R (2020) Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In: Proceedings of the 58th annual meeting of the association for computational linguistics, Association for Computational Linguistics, Online, pp 3342–3352. https://doi.org/10.18653/v1/2020.acl-main.306

  11. Zhang D, Wei S, Li S, Wu H, Zhu Q, Zhou G (2021) Multi-modal graph fusion for named entity recognition with targeted visual guidance. Proc AAAI Conf Artif Intell 35(16):14347–14355. https://doi.org/10.1609/aaai.v35i16.17687

    Article  Google Scholar 

  12. Zhang Q, Fu J, Liu X, Huang X (2018) Adaptive co-attention network for named entity recognition in tweets. In: Proceedings of the AAAI conference on artificial intelligence, vol 32

  13. Zheng C, Wu Z, Wang T, Cai Y, Li Q (2021) Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Trans Multimed 23:2520–2532. https://doi.org/10.1109/TMM.2020.3013398

    Article  Google Scholar 

  14. Li C, Sun A, Weng J, He Q (2014) Tweet segmentation and its application to named entity recognition. IEEE Trans Knowl Data Eng 27(2):558–570

    Article  Google Scholar 

  15. Moon S, Neves L, Carvalho V (2018) Multimodal named entity recognition for short social media posts. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (Long Papers), pp 852–860

  16. Arshad O, Gallo I, Nawaz S, Calefati A (2019) Aiding intra-text representations with visual context for multimodal named entity recognition. In: 2019 International conference on document analysis and recognition (ICDAR). IEEE, pp 337–342

  17. Liu P, Wang G, Li H, Liu J, Ren Y, Zhu H, Sun L (2022) Multi-granularity cross-modality representation learning for named entity recognition on social media

  18. Wu Z, Zheng C, Cai Y, Chen J, Leung H-f, Li Q (2020) Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In: Proceedings of the 28th ACM international conference on multimedia, ACM, Seattle WA USA, pp 1038–1046. https://doi.org/10.1145/3394171.3413650

  19. Sun L, Wang J, Zhang K, Su Y, Weng F (2021) RpBERT: a text-image relation propagation-based BERT model for multimodal NER. Proc AAAI Conf Artif Intell 35:13860–13868

    Google Scholar 

  20. Asgari-Chenaghlu M, Feizi-Derakhshi MR, Farzinvash L, Balafar MA, Motamed C (2022) A multimodal deep learning approach for named entity recognition from social media. Neural Comput Appl 34(3):1905–1922. arXiv:2001.06888. https://doi.org/10.1007/s00521-021-06488-4

  21. Tian Y, Sun X, Yu H, Li Y, Fu K (2021) Hierarchical self-adaptation network for multimodal named entity recognition in social media. Neurocomputing 439:12–21. https://doi.org/10.1016/j.neucom.2021.01.060

    Article  Google Scholar 

  22. Xu B, Huang S, Sha C, Wang H (2022) MAF: a general matching and alignment framework for multimodal named entity recognition. In: Proceedings of the fifteenth ACM international conference on web search and data mining, pp 1215–1223

  23. Chen D, Li Z, Gu B, Chen Z (2021) Multimodal named entity recognition with image attributes and image knowledge. In: Database systems for advanced applications: 26th international conference, DASFAA 2021, Taipei, Taiwan, April 11–14, 2021, Proceedings, Part II 26, Springer, pp 186–201

  24. Lu J, Zhang D, Zhang J, Zhang P (2022) Flat multi-modal interaction transformer for named entity recognition. In: Proceedings of the 29th international conference on computational linguistics. pp 2055–2064

  25. Wang X, Gui M, Jiang Y, Jia Z, Bach N, Wang T, Huang Z, Tu K (2022) ITA: image-text alignments for multi-modal named entity recognition. In: Proceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies. pp 3176–3189

  26. Sang EF, Veenstra J (1999) Representing text chunks. arXiv:9907006

  27. Lafferty J, McCallum A, Pereira FC (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data

  28. Kenton JDM-WC, Toutanova LK (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, vol 1. pp 2

  29. Yang Z, Gong B, Wang L, Huang W, Yu D, Luo J (2019) A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 4683–4693

  30. Wang T, Anwer RM, Cholakkal H, Khan FS, Pang Y, Shao L (2019) Learning rich features at high-speed for single-shot object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 1971–1980

  31. Kim S-W, Kook H-K, Sun J-Y, Kang M-C, Ko S-J (2018) Parallel feature pyramid network for object detection. In: Proceedings of the European conference on computer vision (ECCV). pp 234–250

  32. Jian S, Kaiming H, Shaoqing R, Xiangyu Z (2016) Deep residual learning for image recognition. In: IEEE Conference on computer vision & pattern recognition. pp 770–778

  33. Chen S, Aguilar G, Neves L, Solorio T (2021) Can images help recognize entities? a study of the role of images for multimodal NER. In: Proceedings of the seventh workshop on noisy user-generated text (W-NUT 2021). pp 87–96

  34. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763

  35. Maas AL, Hannun AY, Ng AY et al (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proc. Icml, vol 30. Atlanta, GA, pp 3

  36. Wang X, Tian J, Gui M, Li Z, Wang R, Yan M, Chen L, Xiao Y (2022) WikiDiverse: a multimodal entity linking dataset with diversified contextual topics and entity types. arXiv:2204.06347

  37. Berrar D et al. (2019) Cross-Validation

  38. Hastie T, Tibshirani R, Friedman JH, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, vol 2

  39. Hart PE, Stork DG, Duda RO (2000) Pattern classification. Wiley Hoboken

  40. Chen X, Zhang N, Li L, Yao Y, Deng S, Tan C, Huang F, Si L, Chen H (2022) Good visual guidance make a better extractor: hierarchical visual prefix for multimodal entity and relation extraction. In: Findings of the association for computational linguistics: NAACL 2022. pp 1607–1618

  41. Ma X, Hovy E (2016) End-to-end sequence labeling via bi-directional LSTM-CNNS-CRF. In: Proceedings of the 54th annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 1064–1074

  42. Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 260–270

  43. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30

  44. Liu P, Wang G, Li H, Liu J, Ren Y, Zhu H, Sun L (2022) Multi-granularity cross-modality representation learning for named entity recognition on social media. arXiv:2210.14163

  45. Wang X, Ye J, Li Z, Tian J, Jiang Y, Yan M, Zhang J, Xiao Y (2022) CAT-MNER: multimodal named entity recognition with knowledge-refined cross-modal attention. In: 2022 IEEE International conference on multimedia and expo (ICME). IEEE, pp 1–6

  46. Zhao F, Li C, Wu Z, Xing S, Dai X (2022) Learning from different text-image pairs: a relation-enhanced graph convolutional network for multimodal NER. In: Proceedings of the 30th ACM international conference on multimedia, pp 3983–3992

  47. Zhang X, Yuan J, Li L, Liu J (2023) Reducing the bias of visual objects in multimodal named entity recognition. In: Proceedings of the Sixteenth ACM international conference on web search and data mining, pp 958–966

  48. Chen F, Feng Y (2023) Chain-of-thought prompt distillation for multimodal named entity and multimodal relation extraction. arXiv:2306.14122

  49. Wang X, Cai J, Jiang Y, Xie P, Tu K, Lu W (2022) Named entity and relation extraction with multi-modal retrieval. arXiv:2212.01612

  50. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  51. Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: International conference on learning representations

Download references

Acknowledgements

This work was supported by the Major Program of the National Natural Science Foundation of China (No.61991410), the Natural Science Foundation of Shanghai (No.23ZR1422800), and the Program of the Pujiang National Laboratory (No.P22KN00391).

Funding

This work was supported by the Major Program of the National Natural Science Foundation of China (No.61991410), the Natural Science Foundation of Shanghai (No.23ZR1422800), and the Program of the Pujiang National Laboratory (No.P22KN00391).

Author information

Authors and Affiliations

Authors

Contributions

Wei Liu: Writing - Original Draft, Writing - Editing, Software, Data curation. Aiqun Ren: Writing - Original Draft, Writing - Editing, Software, Data curation. Chao Wang: Writing - Original Draft, Writing - Editing, Software, Data curation. Yan Peng: Writing - Review, Editing. Shaorong Xie: Writing - Review, Editing. Weimin Li: Writing - Review, Editing.

Corresponding author

Correspondence to Chao Wang.

Ethics declarations

Ethical Approval

Not applicable

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, W., Ren, A., Wang, C. et al. MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-18472-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-024-18472-w

Keywords

Navigation