Abstract
Multimodal named entity recognition (MNER) aims at identifying entity spans and recognizing their categories in social media posts with the aid of images. Previous work on MNER often relies on an attention mechanism to model the interactions between the images and text representations. However, the inconsistency of feature representations of different modalities will bring difficulties to the modeling of image-text interaction. To address this issue, we propose multi-granularity visual contexts to align image features into the textual space for text-text interactions so that the attention mechanism in pre-trained textual embeddings can be better utilized. The visual information of multi-granularity can help establish more accurate and thorough connections between image pixels and linguistic semantics. Specifically, we first extract the global image caption and dense image captions as the coarse-grained visual context and fine-grained visual contexts separately. Then, we consider images as signals with sparse semantic density for image-text interactions and image captions as dense semantic signals for text-text interactions. To alleviate the bias caused by visual noise and inaccurate alignment, we further design a dynamic filter network to filter visual noise and dynamically allocate visual information for modality fusion. Meanwhile, we propose a novel multi-granularity visual prompt-guided fusion network to model more robust modality fusion. Extensive experiments on three MNER datasets demonstrate the effectiveness of our method and achieve state-of-the-art performance.
Similar content being viewed by others
Availability of supporting data
Data will be available upon request.
References
Li J, Li H, Pan Z, Pan G (2023) Prompt ChatGPT in MNER: improved multimodal named entity recognition method based on auxiliary refining knowledge from ChatGPT. arXiv:2305.12212
Liu P, Li H, Ren Y, Liu J, Si S, Zhu H, Sun L (2023) A novel framework for multimodal named entity recognition with multi-level alignments. arXiv:2305.08372
Cui S, Cao J, Cong X, Sheng J, Li Q, Liu T, Shi J (2023) Enhancing multimodal entity and relation extraction with variational information bottleneck. arXiv:2304.02328
Liu W, Zhong X, Hou J, Li S, Huang H, Fang Y (2023) Integrating large pre-trained models into multimodal named entity recognition with evidential fusion. arXiv:2306.16991
Chen J, Xue Y, Zhang H, Ding W, Zhang Z, Chen J (2023) On development of multimodal named entity recognition using part-of-speech and mixture of experts. Int J Mach Learn Cybernet 14(6):2181–2192
Wang X, Tian J, Gui M, Li Z, Ye J, Yan M, Xiao Y (2022) PromptMNER: prompt-based entity-related visual clue extraction and integration for multimodal named entity recognition. In: International conference on database systems for advanced applications. Springer, pp 297–305
Liu Y, Li S, Hu F, Liu A, Liu Y (2022) Explicit sparse attention network for multimodal named entity recognition. In: China conference on knowledge graph and semantic computing. Springer, pp 83–94
Zhao S, Hu M, Cai Z, Liu F (2021) Modeling dense cross-modal interactions for joint entity-relation extraction. In: Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence. pp 4032–4038
Lu D, Neves L, Carvalho V, Zhang N, Ji H (2018) Visual attention model for name tagging in multimodal social media. In: Proceedings of the 56th annual meeting of the association for computational linguistics (vol 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, pp 1990–1999. https://doi.org/10.18653/v1/P18-1185
Yu J, Jiang J, Yang L, Xia R (2020) Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In: Proceedings of the 58th annual meeting of the association for computational linguistics, Association for Computational Linguistics, Online, pp 3342–3352. https://doi.org/10.18653/v1/2020.acl-main.306
Zhang D, Wei S, Li S, Wu H, Zhu Q, Zhou G (2021) Multi-modal graph fusion for named entity recognition with targeted visual guidance. Proc AAAI Conf Artif Intell 35(16):14347–14355. https://doi.org/10.1609/aaai.v35i16.17687
Zhang Q, Fu J, Liu X, Huang X (2018) Adaptive co-attention network for named entity recognition in tweets. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Zheng C, Wu Z, Wang T, Cai Y, Li Q (2021) Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Trans Multimed 23:2520–2532. https://doi.org/10.1109/TMM.2020.3013398
Li C, Sun A, Weng J, He Q (2014) Tweet segmentation and its application to named entity recognition. IEEE Trans Knowl Data Eng 27(2):558–570
Moon S, Neves L, Carvalho V (2018) Multimodal named entity recognition for short social media posts. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (Long Papers), pp 852–860
Arshad O, Gallo I, Nawaz S, Calefati A (2019) Aiding intra-text representations with visual context for multimodal named entity recognition. In: 2019 International conference on document analysis and recognition (ICDAR). IEEE, pp 337–342
Liu P, Wang G, Li H, Liu J, Ren Y, Zhu H, Sun L (2022) Multi-granularity cross-modality representation learning for named entity recognition on social media
Wu Z, Zheng C, Cai Y, Chen J, Leung H-f, Li Q (2020) Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In: Proceedings of the 28th ACM international conference on multimedia, ACM, Seattle WA USA, pp 1038–1046. https://doi.org/10.1145/3394171.3413650
Sun L, Wang J, Zhang K, Su Y, Weng F (2021) RpBERT: a text-image relation propagation-based BERT model for multimodal NER. Proc AAAI Conf Artif Intell 35:13860–13868
Asgari-Chenaghlu M, Feizi-Derakhshi MR, Farzinvash L, Balafar MA, Motamed C (2022) A multimodal deep learning approach for named entity recognition from social media. Neural Comput Appl 34(3):1905–1922. arXiv:2001.06888. https://doi.org/10.1007/s00521-021-06488-4
Tian Y, Sun X, Yu H, Li Y, Fu K (2021) Hierarchical self-adaptation network for multimodal named entity recognition in social media. Neurocomputing 439:12–21. https://doi.org/10.1016/j.neucom.2021.01.060
Xu B, Huang S, Sha C, Wang H (2022) MAF: a general matching and alignment framework for multimodal named entity recognition. In: Proceedings of the fifteenth ACM international conference on web search and data mining, pp 1215–1223
Chen D, Li Z, Gu B, Chen Z (2021) Multimodal named entity recognition with image attributes and image knowledge. In: Database systems for advanced applications: 26th international conference, DASFAA 2021, Taipei, Taiwan, April 11–14, 2021, Proceedings, Part II 26, Springer, pp 186–201
Lu J, Zhang D, Zhang J, Zhang P (2022) Flat multi-modal interaction transformer for named entity recognition. In: Proceedings of the 29th international conference on computational linguistics. pp 2055–2064
Wang X, Gui M, Jiang Y, Jia Z, Bach N, Wang T, Huang Z, Tu K (2022) ITA: image-text alignments for multi-modal named entity recognition. In: Proceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies. pp 3176–3189
Sang EF, Veenstra J (1999) Representing text chunks. arXiv:9907006
Lafferty J, McCallum A, Pereira FC (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data
Kenton JDM-WC, Toutanova LK (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, vol 1. pp 2
Yang Z, Gong B, Wang L, Huang W, Yu D, Luo J (2019) A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 4683–4693
Wang T, Anwer RM, Cholakkal H, Khan FS, Pang Y, Shao L (2019) Learning rich features at high-speed for single-shot object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 1971–1980
Kim S-W, Kook H-K, Sun J-Y, Kang M-C, Ko S-J (2018) Parallel feature pyramid network for object detection. In: Proceedings of the European conference on computer vision (ECCV). pp 234–250
Jian S, Kaiming H, Shaoqing R, Xiangyu Z (2016) Deep residual learning for image recognition. In: IEEE Conference on computer vision & pattern recognition. pp 770–778
Chen S, Aguilar G, Neves L, Solorio T (2021) Can images help recognize entities? a study of the role of images for multimodal NER. In: Proceedings of the seventh workshop on noisy user-generated text (W-NUT 2021). pp 87–96
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763
Maas AL, Hannun AY, Ng AY et al (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proc. Icml, vol 30. Atlanta, GA, pp 3
Wang X, Tian J, Gui M, Li Z, Wang R, Yan M, Chen L, Xiao Y (2022) WikiDiverse: a multimodal entity linking dataset with diversified contextual topics and entity types. arXiv:2204.06347
Berrar D et al. (2019) Cross-Validation
Hastie T, Tibshirani R, Friedman JH, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, vol 2
Hart PE, Stork DG, Duda RO (2000) Pattern classification. Wiley Hoboken
Chen X, Zhang N, Li L, Yao Y, Deng S, Tan C, Huang F, Si L, Chen H (2022) Good visual guidance make a better extractor: hierarchical visual prefix for multimodal entity and relation extraction. In: Findings of the association for computational linguistics: NAACL 2022. pp 1607–1618
Ma X, Hovy E (2016) End-to-end sequence labeling via bi-directional LSTM-CNNS-CRF. In: Proceedings of the 54th annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 1064–1074
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 260–270
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Liu P, Wang G, Li H, Liu J, Ren Y, Zhu H, Sun L (2022) Multi-granularity cross-modality representation learning for named entity recognition on social media. arXiv:2210.14163
Wang X, Ye J, Li Z, Tian J, Jiang Y, Yan M, Zhang J, Xiao Y (2022) CAT-MNER: multimodal named entity recognition with knowledge-refined cross-modal attention. In: 2022 IEEE International conference on multimedia and expo (ICME). IEEE, pp 1–6
Zhao F, Li C, Wu Z, Xing S, Dai X (2022) Learning from different text-image pairs: a relation-enhanced graph convolutional network for multimodal NER. In: Proceedings of the 30th ACM international conference on multimedia, pp 3983–3992
Zhang X, Yuan J, Li L, Liu J (2023) Reducing the bias of visual objects in multimodal named entity recognition. In: Proceedings of the Sixteenth ACM international conference on web search and data mining, pp 958–966
Chen F, Feng Y (2023) Chain-of-thought prompt distillation for multimodal named entity and multimodal relation extraction. arXiv:2306.14122
Wang X, Cai J, Jiang Y, Xie P, Tu K, Lu W (2022) Named entity and relation extraction with multi-modal retrieval. arXiv:2212.01612
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: International conference on learning representations
Acknowledgements
This work was supported by the Major Program of the National Natural Science Foundation of China (No.61991410), the Natural Science Foundation of Shanghai (No.23ZR1422800), and the Program of the Pujiang National Laboratory (No.P22KN00391).
Funding
This work was supported by the Major Program of the National Natural Science Foundation of China (No.61991410), the Natural Science Foundation of Shanghai (No.23ZR1422800), and the Program of the Pujiang National Laboratory (No.P22KN00391).
Author information
Authors and Affiliations
Contributions
Wei Liu: Writing - Original Draft, Writing - Editing, Software, Data curation. Aiqun Ren: Writing - Original Draft, Writing - Editing, Software, Data curation. Chao Wang: Writing - Original Draft, Writing - Editing, Software, Data curation. Yan Peng: Writing - Review, Editing. Shaorong Xie: Writing - Review, Editing. Weimin Li: Writing - Review, Editing.
Corresponding author
Ethics declarations
Ethical Approval
Not applicable
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, W., Ren, A., Wang, C. et al. MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-18472-w
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-024-18472-w