Abstract
This paper presents an enhanced approach for adapting a Language Model (LM) to a specific domain, with a focus on Named Entity Recognition (NER) and Named Entity Linking (NEL) tasks. Traditional NER/NEL methods require a large amounts of labeled data, which is time and resource intensive to produce. Unsupervised and semi-supervised approaches overcome this limitation but suffer from a lower quality. Our approach, called KeyWord Masking (KWM), fine-tunes a Language Model (LM) for the Masked Language Modeling (MLM) task in a special way. Our experiments demonstrate that KWM outperforms traditional methods in restoring domain-specific entities. This work is a preliminary step towards developing a more sophisticated NER/NEL system for domain-specific data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
BEYOND: Building epidemiological surveillance & prophylaxis with observations near & distant. https://www6.inrae.fr/beyond/. Accessed 06 Feb 2023
GeoNames. https://gd.eppo.int/. Accessed 06 Feb 2023
PESV. https://gd.eppo.int/. Accessed 06 Feb 2023
EPPO (2023). EPPO Global Database (available online). https://plateforme-esv.fr. Accessed 06 Feb 2023
Ayoola, T., Fisher, J., Pierleoni, A.: Improving entity disambiguation by reasoning over a knowledge base. arXiv preprint arXiv:2207.04106 (2022)
Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., Auli, M.: Cloze-driven pretraining of self-attention networks. arXiv preprint arXiv:1903.07785 (2019)
Bossy, R., Deléger, L., Chaix, E., Ba, M., Nédellec, C.: Bacteria Biotope 2019 (2022). https://doi.org/10.57745/PCQFC2
Chen, X., et al.: One model for all domains: collaborative domain-prefix tuning for cross-domain NER. arXiv preprint arXiv:2301.10410 (2023)
Derczynski, L., Llorens, H., Saquete, E.: Massively increasing TIMEX3 resources: a transduction approach. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, pp. 3754–3761. European Language Resources Association (ELRA) (2012). http://www.lrec-conf.org/proceedings/lrec2012/pdf/451_Paper.pdf
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota (Volume 1: Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Ferré, A., Deléger, L., Bossy, R., Zweigenbaum, P., Nédellec, C.: C-norm: a neural approach to few-shot entity normalization. BMC Bioinform. 21(23), 1–19 (2020)
Fritzler, A., Logacheva, V., Kretov, M.: Few-shot classification in named entity recognition task. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, pp. 993–1000 (2019)
Gligic, L., Kormilitzin, A., Goldberg, P., Nevado-Holgado, A.: Named entity recognition in electronic health records using transfer learning bootstrapped neural networks. Neural Netw. 121, 132–139 (2020)
Gritta, M., Pilehvar, M.T., Collier, N.: Which Melbourne? Augmenting geocoding with maps. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1285–1296 (2018)
Imambi, S., Prakash, K.B., Kanagachidambaresan, G.: PyTorch. In: Programming with TensorFlow: Solution for Edge Computing Applications, pp. 87–104 (2021)
Iovine, A., Fang, A., Fetahu, B., Rokhlenko, O., Malmasi, S.: CycleNER: an unsupervised training approach for named entity recognition. In: Proceedings of the ACM Web Conference 2022, pp. 2916–2924 (2022)
Jia, C., Liang, X., Zhang, Y.: Cross-domain NER using cross-domain language modeling. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2464–2474 (2019)
Jiang, S., Cormier, S., Angarita, R., Rousseaux, F.: Improving text mining in plant health domain with GAN and/or pre-trained language model. Front. Artif. Intell. 6 (2023)
Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)
Li, J., Chiu, B., Feng, S., Wang, H.: Few-shot named entity recognition via meta-learning. IEEE Trans. Knowl. Data Eng. 34(9), 4245–4256 (2022). https://doi.org/10.1109/TKDE.2020.3038670
Li, X., et al.: Effective few-shot named entity linking by meta-learning. In: 2022 IEEE 38th International Conference on Data Engineering (ICDE), pp. 178–191. IEEE (2022)
Liu, Z., Jiang, F., Hu, Y., Shi, C., Fung, P.: NER-BERT: a pre-trained model for low-resource entity tagging. arXiv preprint arXiv:2112.00405 (2021)
Liu, Z., et al.: CrossNER: evaluating cross-domain named entity recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 13452–13460 (2021)
Ming, H., Yang, J., Jiang, L., Pan, Y., An, N.: Few-shot nested named entity recognition. arXiv e-prints, pp. arXiv-2212 (2022)
Neumann, M., King, D., Beltagy, I., Ammar, W.: ScispaCy: fast and robust models for biomedical natural language processing. In: BioNLP 2019, p. 319 (2019)
Pergola, G., Kochkina, E., Gui, L., Liakata, M., He, Y.: Boosting low-resource biomedical QA via entity-aware masking strategies. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 1977–1985 (2021)
Peters, M.E., Ammar, W., Bhagavatula, C., Power, R.: Semi-supervised sequence tagging with bidirectional language models. arXiv preprint arXiv:1705.00108 (2017)
Popovski, G., Kochev, S., Korousic-Seljak, B., Eftimov, T.: Foodie: a rule-based named-entity recognition method for food information extraction. ICPRAM 12, 915 (2019)
Raiman, J., Raiman, O.: Deeptype: multilingual entity linking by neural type system evolution. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
van Rossum, G.: Python programming language. 3(8), 15 (2022). https://www.python.org/downloads/release/python-3815/. Accessed 06 Feb 2023
Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. arXiv preprint cs/0306050 (2003)
Schoch, C., et al.: NCBI taxonomy: a comprehensive update on curation, resources and tools. Database 2020 (2020). https://doi.org/10.1093/database/baaa062. https://www.ncbi.nlm.nih.gov/taxonomy. Accessed 06 Feb 2023
Ushio, A., Camacho-Collados, J.: T-NER: an all-round python library for transformer-based named entity recognition. arXiv preprint arXiv:2209.12616 (2022)
Wang, C., Sun, X., Yu, H., Zhang, W.: Entity disambiguation leveraging multi-perspective attention. IEEE Access 7, 113963–113974 (2019)
Wang, S., et al.: \(k\) NN-NER: named entity recognition with nearest neighbor search. arXiv preprint arXiv:2203.17103 (2022)
Wettig, A., Gao, T., Zhong, Z., Chen, D.: Should you mask 15% in masked language modeling? arXiv preprint arXiv:2202.08005 (2022)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online (2020). https://www.aclweb.org/anthology/2020.emnlp-demos.6
Xu, J., Gan, L., Cheng, M., Wu, Q.: Unsupervised medical entity recognition and linking in Chinese online medical text. J. Healthcare Eng. 2018 (2018)
Yamada, I., Asai, A., Shindo, H., Takeda, H., Matsumoto, Y.: LUKE: deep contextualized entity representations with entity-aware self-attention. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6442–6454. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.523. https://aclanthology.org/2020.emnlp-main.523
Yamada, I., Washio, K., Shindo, H., Matsumoto, Y.: Global entity disambiguation with BERT. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3264–3271 (2022)
Zhang, S., Elhadad, N.: Unsupervised biomedical named entity recognition: experiments with clinical and biological texts. J. Biomed. Inform. 46(6), 1088–1098 (2013)
Acknowledgements
The authors would like to express their sincere gratitude to the ANR-20-PCPA-0002, BEYOND [1] for providing the funding that made this research possible.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Borovikova, M., Ferré, A., Bossy, R., Roche, M., Nédellec, C. (2023). Could KeyWord Masking Strategy Improve Language Model?. In: Métais, E., Meziane, F., Sugumaran, V., Manning, W., Reiff-Marganiec, S. (eds) Natural Language Processing and Information Systems. NLDB 2023. Lecture Notes in Computer Science, vol 13913. Springer, Cham. https://doi.org/10.1007/978-3-031-35320-8_19
Download citation
DOI: https://doi.org/10.1007/978-3-031-35320-8_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35319-2
Online ISBN: 978-3-031-35320-8
eBook Packages: Computer ScienceComputer Science (R0)