Abstract
This paper describes our systems for the sub-task I in the Software Mention Detection in Scholarly Publications shared-task. We propose three approaches leveraging different pre-trained language models (BERT, SciBERT, and XLM-R) to tackle this challenge. Our best-performing system addresses the named entity recognition (NER) problem through a three-stage framework. (1) Entity Sentence Classification - classifies sentences containing potential software mentions; (2) Entity Extraction - detects mentions within classified sentences; (3) Entity Type Classification - categorizes detected mentions into specific software types. Experiments on the official dataset demonstrate that our three-stage framework achieves competitive performance, surpassing both other participating teams and our alternative approaches. As a result, our framework based on the XLM-R-based model achieves a weighted F1-score of 67.80%, delivering our team the 3rd rank in Sub-task I for the Software Mention Recognition task. We release our source code at this repository (https://github.com/thuynguyen2003/NER-Three-Stage-Framework-for-Software-Mention-Recognition).
You have full access to this open access chapter, Download conference paper PDF
Keywords
1 Introduction
Named Entity Recognition (NER) is an important task in NLP that involves identifying and classifying named entities in text. That will transform them into structured data, making it easier to categorize and perform search processing or carry out other NLP tasks [5] on that data such as text classification, sentiment analysis, and contextual analysis, particularly in the domain of Biomedical Named Entity Recognition (Bio-NER), which is challenged by a range of entities like genes, proteins, medications, and diseases [9].
The SOMD 2024 shared-task, hosted within Natural Scientific Language Processing and Research Knowledge Graphs (NSLP 2024) workshop [8], is designed to extract mentioned software and metadata from documents. In this context, both the software and the metadata are identified as specific intervals in the original documents. Understand and identify the software mentioned in documents, which is especially important to support information extraction in scientific documents.
In this paper, we present three different approaches to address the challenge of sub-task I, including:
-
Approach 1: Fine-tuning pre-trained language models as a token classification problem.
-
Approach 2: Two-stage framework for entity extraction and classification.
-
Apporach 3: Three-stage framework for entity sentence classification, entity extraction, and entity type classification.
2 Related Work
In recent years, pre-training language models (PLMs) have made significant advancements in Named Entity Recognition (NER) tasks [16]. Among these, the most popular model is BERT [7] and its variations like SciBERT [2], RoBERT [3], and BiLSTM [11]. These models are often paired with machine learning techniques, particularly Conditional Random Fields (CRF) [10]. Additionally, some approaches involve breaking down the NER task into two simpler tasks using question-answering methods [1], achieving notable results on various datasets like BioNLP13CG, CTIReports, OntoNotes5.0 [12], and WNUT17 [6] based on the F1 measure.
With the emergence of ChatGPT, researchers have been exploring the use of Large Language Models (LLMs) for NER tasks [15, 17], with some studies demonstrating that ChatGPT can be distilled into smaller UniversalNER models for open NER [18]. These UniversalNER models have shown exceptional accuracy across 43 datasets spanning diverse fields such as biomedicine, programming, social media, law, and finance, without requiring direct supervision. UniversalNER surpasses traditional guideline-tuned models like Alpaca and Vicuna by an average of over 30 F1 points and achieves a high F1 score of 0.8 on SoMeSci. In this paper, BERT, SciBERT, and XML-R models are still utilized to address the first task of the shared SOMD 2024 challenge.
3 Approach
To address the Software Mention Recognition task, we utilize the power of different pre-trained transformer-based language model in different approaches. Figure 1 illustrates three approaches to participate in the competition. Because shared-task is related to each token in the sentence and whether words are in capital letters or not also greatly affects the recognition of entities. Therefore, we do not apply any preprocessing techniques but use data directly from the organizers. Also the tokenize method will depend on the default tokenier of the models. In our work, we employ various pre-trained language models, including the XLM-Roberta (XLM-R) [4], BERT [7], and SciBERT [2] as our main backbones. The detail of our three approaches are present as follow.
3.1 Approach 1: Token Classification with BERTs
For the first approach, we address the task by fine-tuning different transformer BERT-base models for the token classification task. We adapted different pre-trained language models to the training dataset. After tokenizing the input, we feed the token sequence to backbones models to extract the fixed vector in the last layer as the final representation of the input sentence. Then, we apply a fully connected layer to process the vectors and predict labels for each input token using a softmax function. There are a total of 27 labels (in Table 1), where 26 correspond to 13 different entity types, and one label represents non-entities. Figure 1 illustrates the overview of our first approach.
3.2 Approach 2: Two-Stage Framework for Entity Extraction and Classification
Motivated by recent work by [1], we address Task 1 - Software Mention Recognition with a two-stage framework composed of entity extraction and entity classification components. However, our components are re-designed to improve the overall performance than original framework proposed by [1]. Figure 1 illustrates the overview of this approach, the detail of each component is presented below:
-
Stage 1 - Entity extraction: This stage aims to identify whether each token in a given input sentence belongs to an entity or not. We achieve this through token classification, similar to Approach 1. However, instead of using 27 labels for different token types, we only use 3 labels as:
-
O: Non-entity token
-
B-X: Beginning token of an entity of type X (where X represents one of the 13 entity types)
-
I-X: Token within an entity of type X
Using separate labels for the beginning (B) and inside (I) positions of tokens within an entity allows us to efficiently extract all words belonging to the same entity in stage 2.
-
-
Stage 2 - Entity classification: In this stage, we classify the detected entities from stage 1. We use a classifier with 13 labels corresponding to the 13 entity types, discarding the B-I prefix distinction used for token position. This classifier is built by fine-tuning a transformer-based model like BERT. [14] During fine-tuning for classification tasks, it’s common practice to use the hidden state associated with the [CLS] token as input for a classifier. However, in this approach, we fine-tune the entire transformer model end-to-end. This means the hidden states are not treated as fixed features, but are trained alongside the classification head (a component added on top of the pre-trained model) for optimal performance. Additionally, to leverage the knowledge of transformer models, we format this classifier as a question-and-answering model by constructing the input as the following prompt:
-
Input: What is <entity> in the sentence: <input sentence>
-
Output: Type of entity
-
3.3 Approach 3: Three-Stage Framework
Our analysis in Table 2 revealed a limited number of sentences containing entities within the training set. This disparity raised concerns about potential biases in the label information during the training process for the previously mentioned approaches. To address this, we introduce a new three-stage framework, which integrate a binary classification with Approach 2. We simply built a binary classification model to detect the sentences which contain the entity. As shown in Fig. 1, if a sentence is classified as class 0, assign all tokens in the sentence as O, otherwise, this sentence will be passed to Approach 2 to extract the entity and its type.
4 Experimental Setup
4.1 Data and Evaluation Metrics
This shared-task uses the SoMeSci dataset [13] which included 39768 sentences and 3756 software mentions divided into a training set and a private test set. We train our systems only on the training set and evaluate the performance of our model on the private test set using weighted precision, recall, and F1-score. In Table 2, we summarize some general information about the two data sets. Where #Sentence denotes the number of sentences, #Sentence with entity denotes the number of sentences containing the entity, and Total entity is the total of entities in all sentences. Max length and Avg length are the maximum length and average length of the sentences in each set, respectively. This dataset contains six groups of entity Application, OperatingSystem, PlugIn, ProgrammingEnvironment, and SoftwareConference. Each group can have the entity belong to four types [Creation, Deposition, Mention, Usage]. In Table 3 we indicate the distribution of each entity in the dataset
4.2 System Settings
We conduct all experiments on three approaches, using three base-version backbones: XLM-RFootnote 1, BERTFootnote 2, and SciBERTFootnote 3. We loaded the weights of the backbones from the HuggingFace library and carried out training on an NVIDIA T4(x2) GPU provided by Kaggle. The corresponding hyper-parameters for each approach are presented below:
-
Approach 1: batch size = 32, learning rate = 5e−05, and the number of epoch = 25 with XLM-R model and the number of epoch = 20 both remain backbones.
-
Approach 2:
-
Stage 1: batch size = 32, learning rate = 5e−05 and the number of epoch = 20 for all three backbones.
-
Stage 2: batch size = 16, learning rate = 2e−05 and the number of epoch = 25 with XLM-R model and epoch = 20 two remainder models.
-
-
Approach 3:
-
Stage 1: batch size = 32, learning rate = 2e−5 and the number of epoch = 10 for all three backbones.
-
Stage 2 and Stage 3: Using the configuration and architecture as the Approach 2.
-
5 Main Results
According to the organizing committee, this sub-task will be evaluated by F1-Score based on exact matches. As shown in Table 4, we provide a tabulated summary of 9 experiments, each representing one of the 9 final systems generated from three different approaches and using three distinct backbones.
The experimental results in Table 4 indicate that Approach 3, a three-stage system, demonstrates the best performance across all backbones, with the XLM-RoBERTa backbone exhibiting the highest efficacy among all approaches. However, this result is for reference only and is only true in all of my experiments. It’s important to acknowledge that different contexts, set up or datasets might yield different outcomes, and we are not sure this is the best result that each backbone could give in other cases. Finally, the best system was built according to approach 3 with XLM-R backbone and our best submission was ranked 3rd. Table 5 show the final score of the top 5 participants.
With the test dataset labels provided by the organizing committee, we evaluated the performance of our best system for each entity class in Table 6. We observed that the SoftwareCoreference_Deposition entity achieved the highest Precision score, while the ProgrammingEnvironment_Usage entity attained the highest Recall and F1 score, top 5 F1-score classes are ProgrammingEnvironment_Usage, SoftwareCoreference_Deposition, and OperatingSystem_Mention. It is evident that entities belonging to the PlugIn group typically scored lower than those in other groups shows that it has difficulty in the regconization process. Although, the number of PlugIn_Usage entities in the training set is pretty large the result on the test set is not positive. Besides that, PlugIn_Creation and PlugIn_Deposition entities have the sample in the training set are pretty low and their score moves forward to zero. The number of OperatingSystem_Mention entities in the training set is low and the score on the test set is high so we predict the mention entity type in this group is featured and easier to recognize than other groups.
Additionally, in Table 7, we evaluated each individual stage in our final three-stage system by assuming that the accuracy of the stages before it is 100%. The first stage works well with an F1-score of 0.992 in classifying whether a sentence contains an entity or not. Moving to stage 2, tasked with detecting entities in sentences, achieved an F1 score at a relatively good level, but a significant difference between Precision and Recall (12.6% difference) is evident, which also affects the overall system performance. In the final stage, the scores between the three metrics are relatively balanced, but it appears that the task of classifying 13 entity classes had some impact on this stage with relatively lower overall performance. The propagation of errors between the three stages has a significant impact on the entire system, with the final F1-score of the entire system being 0.678.
6 Conclusion and Future Work
In this paper, we present and evaluate three approaches for tackling sub-task I in the Software Mention Detection in Scholarly Publications shared task. While we explored the use of suitable transformer models like BERT, our three-stage system leveraging the XLM-R model achieved the highest performance in the competition. As a result, our best system achieved the Top 3 in the private test. In future work, our intention is to analyze the error propagation between the three stages to enhance the performance of the entire three-stage system. Additionally, with access to more substantial computational resources, we aim to experiment with fine-tuning sub-tasks using larger batch sizes and epochs for each backbone in order to investigate the effects of these hyper-parameters on the model’s performance.
References
Arora, J., Park, Y.: Split-NER: named entity recognition via two question-answering-based classifications. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Toronto, Canada, pp. 416–426. Association for Computational Linguistics (2023). https://doi.org/10.18653/v1/2023.acl-short.36. https://aclanthology.org/2023.acl-short.36
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3615–3620. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1371. https://aclanthology.org/D19-1371
Chen, T., et al.: RoBERT-Agr: an entity relationship extraction model of massive agricultural text based on the RoBERTa and CRF algorithm. In: 2023 IEEE 8th International Conference on Big Data Analytics (ICBDA), pp. 113–120. IEEE (2023)
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.747. https://aclanthology.org/2020.acl-main.747
Dash, A., Darshana, S., Yadav, D.K., Gupta, V.: A clinical named entity recognition model using pretrained word embedding and deep neural networks. Decis. Anal. J. 10, 100426 (2024)
Derczynski, L., Nichols, E., van Erp, M., Limsopatham, N.: Results of the WNUT2017 shared task on novel and emerging entity recognition. In: Proceedings of the 3rd Workshop on Noisy User-Generated Text, Copenhagen, Denmark, pp. 140–147. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/W17-4418. https://www.aclweb.org/anthology/W17-4418
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Simperl, E., Peter Clark, K.K.: Natural scientific language processing and research knowledge graphs. In: Lecture Notes in Artificial Intelligence (2024)
Li, L., Zhou, R., Huang, D.: Two-phase biomedical named entity recognition using CRFs. Comput. Biol. Chem. 33(4), 334–338 (2009)
Lopez, P., Du, C., Cohoon, J., Ram, K., Howison, J.: Mining software entities in scientific literature: document-level NER for an extremely imbalance and large-scale task. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 3986–3995 (2021)
Luo, L., et al.: An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics 34(8), 1381–1388 (2018)
Pradhan, S., et al.: Towards robust linguistic analysis using OntoNotes. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, Bulgaria, pp. 143–152. Association for Computational Linguistics (2013). https://aclanthology.org/W13-3516
Schindler, D., Bensmann, F., Dietze, S., Krüger, F.: SoMeSci-A 5 star open data gold standard knowledge graph of software mentions in scientific articles. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 4574–4583 (2021)
Tunstall, L., Von Werra, L., Wolf, T.: Natural Language Processing with Transformers: Building Language Applications With Hugging Face. O’Reilly (2022)
Wang, S., et al.: GPT-NER: named entity recognition via large language models. arXiv preprint arXiv:2304.10428 (2023)
Zhang, H., et al.: Samsung research China-Beijing at SemEval-2023 Task 2: an AL-R model for multilingual complex named entity recognition. In: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pp. 114–120 (2023)
Zhang, Z., Zhao, Y., Gao, H., Hu, M.: LinkNER: linking local named entity recognition models to large language models using uncertainty. arXiv preprint arXiv:2402.10573 (2024)
Zhou, W., Zhang, S., Gu, Y., Chen, M., Poon, H.: UniversalNER: targeted distillation from large language models for open named entity recognition. In: The Twelfth International Conference on Learning Representations (2023)
Acknowledgements
This research was supported by The VNUHCM-University of Information Technology’s Scientific Research Support Fund. We also thank the anonymous reviewers for their valuable comments on our manuscript.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2024 The Author(s)
About this paper
Cite this paper
Nguyen Thi, T., Nguyen Viet, A., Dang Van, T., Luu-Thuy Nguyen, N. (2024). Software Mention Recognition with a Three-Stage Framework Based on BERTology Models at SOMD 2024. In: Rehm, G., Dietze, S., Schimmler, S., Krüger, F. (eds) Natural Scientific Language Processing and Research Knowledge Graphs. NSLP 2024. Lecture Notes in Computer Science(), vol 14770. Springer, Cham. https://doi.org/10.1007/978-3-031-65794-8_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-65794-8_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-65793-1
Online ISBN: 978-3-031-65794-8
eBook Packages: Computer ScienceComputer Science (R0)