MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition

Liu, Wei; Ren, Aiqun; Wang, Chao; Peng, Yan; Xie, Shaorong; Li, Weimin

doi:10.1007/s11042-024-18472-w

MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition

Published: 08 February 2024

(2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Wei Liu^1,2,
Aiqun Ren^1,2,
Chao Wang ORCID: orcid.org/0000-0003-4843-1953^3,4,
Yan Peng^2,3,4,
Shaorong Xie¹ &
…
Weimin Li¹

248 Accesses
1 Altmetric
Explore all metrics

Abstract

Multimodal named entity recognition (MNER) aims at identifying entity spans and recognizing their categories in social media posts with the aid of images. Previous work on MNER often relies on an attention mechanism to model the interactions between the images and text representations. However, the inconsistency of feature representations of different modalities will bring difficulties to the modeling of image-text interaction. To address this issue, we propose multi-granularity visual contexts to align image features into the textual space for text-text interactions so that the attention mechanism in pre-trained textual embeddings can be better utilized. The visual information of multi-granularity can help establish more accurate and thorough connections between image pixels and linguistic semantics. Specifically, we first extract the global image caption and dense image captions as the coarse-grained visual context and fine-grained visual contexts separately. Then, we consider images as signals with sparse semantic density for image-text interactions and image captions as dense semantic signals for text-text interactions. To alleviate the bias caused by visual noise and inaccurate alignment, we further design a dynamic filter network to filter visual noise and dynamically allocate visual information for modality fusion. Meanwhile, we propose a novel multi-granularity visual prompt-guided fusion network to model more robust modality fusion. Extensive experiments on three MNER datasets demonstrate the effectiveness of our method and achieve state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

P-MNER: Cross Modal Correction Fusion Network with Prompt Learning for Multimodal Named Entity Recognition

PromptMNER: Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition

Wukong-CMNER: A Large-Scale Chinese Multimodal NER Dataset with Images Modality

Availability of supporting data

Data will be available upon request.

References

Li J, Li H, Pan Z, Pan G (2023) Prompt ChatGPT in MNER: improved multimodal named entity recognition method based on auxiliary refining knowledge from ChatGPT. arXiv:2305.12212
Liu P, Li H, Ren Y, Liu J, Si S, Zhu H, Sun L (2023) A novel framework for multimodal named entity recognition with multi-level alignments. arXiv:2305.08372
Cui S, Cao J, Cong X, Sheng J, Li Q, Liu T, Shi J (2023) Enhancing multimodal entity and relation extraction with variational information bottleneck. arXiv:2304.02328
Liu W, Zhong X, Hou J, Li S, Huang H, Fang Y (2023) Integrating large pre-trained models into multimodal named entity recognition with evidential fusion. arXiv:2306.16991
Chen J, Xue Y, Zhang H, Ding W, Zhang Z, Chen J (2023) On development of multimodal named entity recognition using part-of-speech and mixture of experts. Int J Mach Learn Cybernet 14(6):2181–2192
Article Google Scholar
Wang X, Tian J, Gui M, Li Z, Ye J, Yan M, Xiao Y (2022) PromptMNER: prompt-based entity-related visual clue extraction and integration for multimodal named entity recognition. In: International conference on database systems for advanced applications. Springer, pp 297–305
Liu Y, Li S, Hu F, Liu A, Liu Y (2022) Explicit sparse attention network for multimodal named entity recognition. In: China conference on knowledge graph and semantic computing. Springer, pp 83–94
Zhao S, Hu M, Cai Z, Liu F (2021) Modeling dense cross-modal interactions for joint entity-relation extraction. In: Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence. pp 4032–4038
Lu D, Neves L, Carvalho V, Zhang N, Ji H (2018) Visual attention model for name tagging in multimodal social media. In: Proceedings of the 56th annual meeting of the association for computational linguistics (vol 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, pp 1990–1999. https://doi.org/10.18653/v1/P18-1185
Yu J, Jiang J, Yang L, Xia R (2020) Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In: Proceedings of the 58th annual meeting of the association for computational linguistics, Association for Computational Linguistics, Online, pp 3342–3352. https://doi.org/10.18653/v1/2020.acl-main.306
Zhang D, Wei S, Li S, Wu H, Zhu Q, Zhou G (2021) Multi-modal graph fusion for named entity recognition with targeted visual guidance. Proc AAAI Conf Artif Intell 35(16):14347–14355. https://doi.org/10.1609/aaai.v35i16.17687
Article Google Scholar
Zhang Q, Fu J, Liu X, Huang X (2018) Adaptive co-attention network for named entity recognition in tweets. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Zheng C, Wu Z, Wang T, Cai Y, Li Q (2021) Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Trans Multimed 23:2520–2532. https://doi.org/10.1109/TMM.2020.3013398
Article Google Scholar
Li C, Sun A, Weng J, He Q (2014) Tweet segmentation and its application to named entity recognition. IEEE Trans Knowl Data Eng 27(2):558–570
Article Google Scholar
Moon S, Neves L, Carvalho V (2018) Multimodal named entity recognition for short social media posts. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (Long Papers), pp 852–860
Arshad O, Gallo I, Nawaz S, Calefati A (2019) Aiding intra-text representations with visual context for multimodal named entity recognition. In: 2019 International conference on document analysis and recognition (ICDAR). IEEE, pp 337–342
Liu P, Wang G, Li H, Liu J, Ren Y, Zhu H, Sun L (2022) Multi-granularity cross-modality representation learning for named entity recognition on social media
Wu Z, Zheng C, Cai Y, Chen J, Leung H-f, Li Q (2020) Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In: Proceedings of the 28th ACM international conference on multimedia, ACM, Seattle WA USA, pp 1038–1046. https://doi.org/10.1145/3394171.3413650
Sun L, Wang J, Zhang K, Su Y, Weng F (2021) RpBERT: a text-image relation propagation-based BERT model for multimodal NER. Proc AAAI Conf Artif Intell 35:13860–13868
Google Scholar
Asgari-Chenaghlu M, Feizi-Derakhshi MR, Farzinvash L, Balafar MA, Motamed C (2022) A multimodal deep learning approach for named entity recognition from social media. Neural Comput Appl 34(3):1905–1922. arXiv:2001.06888. https://doi.org/10.1007/s00521-021-06488-4
Tian Y, Sun X, Yu H, Li Y, Fu K (2021) Hierarchical self-adaptation network for multimodal named entity recognition in social media. Neurocomputing 439:12–21. https://doi.org/10.1016/j.neucom.2021.01.060
Article Google Scholar
Xu B, Huang S, Sha C, Wang H (2022) MAF: a general matching and alignment framework for multimodal named entity recognition. In: Proceedings of the fifteenth ACM international conference on web search and data mining, pp 1215–1223
Chen D, Li Z, Gu B, Chen Z (2021) Multimodal named entity recognition with image attributes and image knowledge. In: Database systems for advanced applications: 26th international conference, DASFAA 2021, Taipei, Taiwan, April 11–14, 2021, Proceedings, Part II 26, Springer, pp 186–201
Lu J, Zhang D, Zhang J, Zhang P (2022) Flat multi-modal interaction transformer for named entity recognition. In: Proceedings of the 29th international conference on computational linguistics. pp 2055–2064
Wang X, Gui M, Jiang Y, Jia Z, Bach N, Wang T, Huang Z, Tu K (2022) ITA: image-text alignments for multi-modal named entity recognition. In: Proceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies. pp 3176–3189
Sang EF, Veenstra J (1999) Representing text chunks. arXiv:9907006
Lafferty J, McCallum A, Pereira FC (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data
Kenton JDM-WC, Toutanova LK (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, vol 1. pp 2
Yang Z, Gong B, Wang L, Huang W, Yu D, Luo J (2019) A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 4683–4693
Wang T, Anwer RM, Cholakkal H, Khan FS, Pang Y, Shao L (2019) Learning rich features at high-speed for single-shot object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 1971–1980
Kim S-W, Kook H-K, Sun J-Y, Kang M-C, Ko S-J (2018) Parallel feature pyramid network for object detection. In: Proceedings of the European conference on computer vision (ECCV). pp 234–250
Jian S, Kaiming H, Shaoqing R, Xiangyu Z (2016) Deep residual learning for image recognition. In: IEEE Conference on computer vision & pattern recognition. pp 770–778
Chen S, Aguilar G, Neves L, Solorio T (2021) Can images help recognize entities? a study of the role of images for multimodal NER. In: Proceedings of the seventh workshop on noisy user-generated text (W-NUT 2021). pp 87–96
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763
Maas AL, Hannun AY, Ng AY et al (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proc. Icml, vol 30. Atlanta, GA, pp 3
Wang X, Tian J, Gui M, Li Z, Wang R, Yan M, Chen L, Xiao Y (2022) WikiDiverse: a multimodal entity linking dataset with diversified contextual topics and entity types. arXiv:2204.06347
Berrar D et al. (2019) Cross-Validation
Hastie T, Tibshirani R, Friedman JH, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, vol 2
Hart PE, Stork DG, Duda RO (2000) Pattern classification. Wiley Hoboken
Chen X, Zhang N, Li L, Yao Y, Deng S, Tan C, Huang F, Si L, Chen H (2022) Good visual guidance make a better extractor: hierarchical visual prefix for multimodal entity and relation extraction. In: Findings of the association for computational linguistics: NAACL 2022. pp 1607–1618
Ma X, Hovy E (2016) End-to-end sequence labeling via bi-directional LSTM-CNNS-CRF. In: Proceedings of the 54th annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 1064–1074
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 260–270
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Liu P, Wang G, Li H, Liu J, Ren Y, Zhu H, Sun L (2022) Multi-granularity cross-modality representation learning for named entity recognition on social media. arXiv:2210.14163
Wang X, Ye J, Li Z, Tian J, Jiang Y, Yan M, Zhang J, Xiao Y (2022) CAT-MNER: multimodal named entity recognition with knowledge-refined cross-modal attention. In: 2022 IEEE International conference on multimedia and expo (ICME). IEEE, pp 1–6
Zhao F, Li C, Wu Z, Xing S, Dai X (2022) Learning from different text-image pairs: a relation-enhanced graph convolutional network for multimodal NER. In: Proceedings of the 30th ACM international conference on multimedia, pp 3983–3992
Zhang X, Yuan J, Li L, Liu J (2023) Reducing the bias of visual objects in multimodal named entity recognition. In: Proceedings of the Sixteenth ACM international conference on web search and data mining, pp 958–966
Chen F, Feng Y (2023) Chain-of-thought prompt distillation for multimodal named entity and multimodal relation extraction. arXiv:2306.14122
Wang X, Cai J, Jiang Y, Xie P, Tu K, Lu W (2022) Named entity and relation extraction with multi-modal retrieval. arXiv:2212.01612
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: International conference on learning representations

Download references

Acknowledgements

This work was supported by the Major Program of the National Natural Science Foundation of China (No.61991410), the Natural Science Foundation of Shanghai (No.23ZR1422800), and the Program of the Pujiang National Laboratory (No.P22KN00391).

Funding

This work was supported by the Major Program of the National Natural Science Foundation of China (No.61991410), the Natural Science Foundation of Shanghai (No.23ZR1422800), and the Program of the Pujiang National Laboratory (No.P22KN00391).

Author information

Authors and Affiliations

School of Computer Engineering and Science, Shanghai University, Shanghai, People’s Republic of China
Wei Liu, Aiqun Ren, Shaorong Xie & Weimin Li
Shanghai Artificial Intelligence Laboratory, Shanghai, 201114, People’s Republic of China
Wei Liu, Aiqun Ren & Yan Peng
School of Future Technology, Shanghai University, Shanghai, People’s Republic of China
Chao Wang & Yan Peng
Institute of Artificial Intelligence, Shanghai University, Shanghai, People’s Republic of China
Chao Wang & Yan Peng

Authors

Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Aiqun Ren
View author publications
You can also search for this author in PubMed Google Scholar
Chao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Peng
View author publications
You can also search for this author in PubMed Google Scholar
Shaorong Xie
View author publications
You can also search for this author in PubMed Google Scholar
Weimin Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Wei Liu: Writing - Original Draft, Writing - Editing, Software, Data curation. Aiqun Ren: Writing - Original Draft, Writing - Editing, Software, Data curation. Chao Wang: Writing - Original Draft, Writing - Editing, Software, Data curation. Yan Peng: Writing - Review, Editing. Shaorong Xie: Writing - Review, Editing. Weimin Li: Writing - Review, Editing.

Corresponding author

Correspondence to Chao Wang.

Ethics declarations

Ethical Approval

Not applicable

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, W., Ren, A., Wang, C. et al. MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-18472-w

Download citation

Received: 21 September 2023
Revised: 08 January 2024
Accepted: 29 January 2024
Published: 08 February 2024
DOI: https://doi.org/10.1007/s11042-024-18472-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition

Abstract

Access this article

Similar content being viewed by others

P-MNER: Cross Modal Correction Fusion Network with Prompt Learning for Multimodal Named Entity Recognition

PromptMNER: Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition

Wukong-CMNER: A Large-Scale Chinese Multimodal NER Dataset with Images Modality

Availability of supporting data

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical Approval

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition

Abstract

Access this article

Similar content being viewed by others

P-MNER: Cross Modal Correction Fusion Network with Prompt Learning for Multimodal Named Entity Recognition

PromptMNER: Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition

Wukong-CMNER: A Large-Scale Chinese Multimodal NER Dataset with Images Modality

Availability of supporting data

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical Approval

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation