Keywords

1 Introduction

Named Entity Recognition (NER) is an important task in NLP that involves identifying and classifying named entities in text. That will transform them into structured data, making it easier to categorize and perform search processing or carry out other NLP tasks [5] on that data such as text classification, sentiment analysis, and contextual analysis, particularly in the domain of Biomedical Named Entity Recognition (Bio-NER), which is challenged by a range of entities like genes, proteins, medications, and diseases [9].

The SOMD 2024 shared-task, hosted within Natural Scientific Language Processing and Research Knowledge Graphs (NSLP 2024) workshop [8], is designed to extract mentioned software and metadata from documents. In this context, both the software and the metadata are identified as specific intervals in the original documents. Understand and identify the software mentioned in documents, which is especially important to support information extraction in scientific documents.

In this paper, we present three different approaches to address the challenge of sub-task I, including:

  • Approach 1: Fine-tuning pre-trained language models as a token classification problem.

  • Approach 2: Two-stage framework for entity extraction and classification.

  • Apporach 3: Three-stage framework for entity sentence classification, entity extraction, and entity type classification.

2 Related Work

In recent years, pre-training language models (PLMs) have made significant advancements in Named Entity Recognition (NER) tasks [16]. Among these, the most popular model is BERT [7] and its variations like SciBERT [2], RoBERT [3], and BiLSTM [11]. These models are often paired with machine learning techniques, particularly Conditional Random Fields (CRF) [10]. Additionally, some approaches involve breaking down the NER task into two simpler tasks using question-answering methods [1], achieving notable results on various datasets like BioNLP13CG, CTIReports, OntoNotes5.0 [12], and WNUT17 [6] based on the F1 measure.

With the emergence of ChatGPT, researchers have been exploring the use of Large Language Models (LLMs) for NER tasks [15, 17], with some studies demonstrating that ChatGPT can be distilled into smaller UniversalNER models for open NER [18]. These UniversalNER models have shown exceptional accuracy across 43 datasets spanning diverse fields such as biomedicine, programming, social media, law, and finance, without requiring direct supervision. UniversalNER surpasses traditional guideline-tuned models like Alpaca and Vicuna by an average of over 30 F1 points and achieves a high F1 score of 0.8 on SoMeSci. In this paper, BERT, SciBERT, and XML-R models are still utilized to address the first task of the shared SOMD 2024 challenge.

3 Approach

To address the Software Mention Recognition task, we utilize the power of different pre-trained transformer-based language model in different approaches. Figure 1 illustrates three approaches to participate in the competition. Because shared-task is related to each token in the sentence and whether words are in capital letters or not also greatly affects the recognition of entities. Therefore, we do not apply any preprocessing techniques but use data directly from the organizers. Also the tokenize method will depend on the default tokenier of the models. In our work, we employ various pre-trained language models, including the XLM-Roberta (XLM-R) [4], BERT [7], and SciBERT [2] as our main backbones. The detail of our three approaches are present as follow.

Fig. 1.
figure 1

Overview system of three approaches: Sample input is “Celeste was written in C #” with two entities are E_1 and E_2. E_1 and E_2 play the role of two entity types in this example

3.1 Approach 1: Token Classification with BERTs

For the first approach, we address the task by fine-tuning different transformer BERT-base models for the token classification task. We adapted different pre-trained language models to the training dataset. After tokenizing the input, we feed the token sequence to backbones models to extract the fixed vector in the last layer as the final representation of the input sentence. Then, we apply a fully connected layer to process the vectors and predict labels for each input token using a softmax function. There are a total of 27 labels (in Table 1), where 26 correspond to 13 different entity types, and one label represents non-entities. Figure 1 illustrates the overview of our first approach.

Table 1. List of labels for token classification task in Approach 1

3.2 Approach 2: Two-Stage Framework for Entity Extraction and Classification

Motivated by recent work by [1], we address Task 1 - Software Mention Recognition with a two-stage framework composed of entity extraction and entity classification components. However, our components are re-designed to improve the overall performance than original framework proposed by [1]. Figure 1 illustrates the overview of this approach, the detail of each component is presented below:

  • Stage 1 - Entity extraction: This stage aims to identify whether each token in a given input sentence belongs to an entity or not. We achieve this through token classification, similar to Approach 1. However, instead of using 27 labels for different token types, we only use 3 labels as:

    • O: Non-entity token

    • B-X: Beginning token of an entity of type X (where X represents one of the 13 entity types)

    • I-X: Token within an entity of type X

    Using separate labels for the beginning (B) and inside (I) positions of tokens within an entity allows us to efficiently extract all words belonging to the same entity in stage 2.

  • Stage 2 - Entity classification: In this stage, we classify the detected entities from stage 1. We use a classifier with 13 labels corresponding to the 13 entity types, discarding the B-I prefix distinction used for token position. This classifier is built by fine-tuning a transformer-based model like BERT. [14] During fine-tuning for classification tasks, it’s common practice to use the hidden state associated with the [CLS] token as input for a classifier. However, in this approach, we fine-tune the entire transformer model end-to-end. This means the hidden states are not treated as fixed features, but are trained alongside the classification head (a component added on top of the pre-trained model) for optimal performance. Additionally, to leverage the knowledge of transformer models, we format this classifier as a question-and-answering model by constructing the input as the following prompt:

    • Input: What is <entity> in the sentence: <input sentence>

    • Output: Type of entity

3.3 Approach 3: Three-Stage Framework

Our analysis in Table 2 revealed a limited number of sentences containing entities within the training set. This disparity raised concerns about potential biases in the label information during the training process for the previously mentioned approaches. To address this, we introduce a new three-stage framework, which integrate a binary classification with Approach 2. We simply built a binary classification model to detect the sentences which contain the entity. As shown in Fig. 1, if a sentence is classified as class 0, assign all tokens in the sentence as O, otherwise, this sentence will be passed to Approach 2 to extract the entity and its type.

4 Experimental Setup

4.1 Data and Evaluation Metrics

This shared-task uses the SoMeSci dataset [13] which included 39768 sentences and 3756 software mentions divided into a training set and a private test set. We train our systems only on the training set and evaluate the performance of our model on the private test set using weighted precision, recall, and F1-score. In Table 2, we summarize some general information about the two data sets. Where #Sentence denotes the number of sentences, #Sentence with entity denotes the number of sentences containing the entity, and Total entity is the total of entities in all sentences. Max length and Avg length are the maximum length and average length of the sentences in each set, respectively. This dataset contains six groups of entity Application, OperatingSystem, PlugIn, ProgrammingEnvironment, and SoftwareConference. Each group can have the entity belong to four types [Creation, Deposition, Mention, Usage]. In Table 3 we indicate the distribution of each entity in the dataset

Table 2. General statistics in the training set and private test set
Table 3. Statistics the number of entities in each entity type entity in each entity group in the training set and private test set

4.2 System Settings

We conduct all experiments on three approaches, using three base-version backbones: XLM-RFootnote 1, BERTFootnote 2, and SciBERTFootnote 3. We loaded the weights of the backbones from the HuggingFace library and carried out training on an NVIDIA T4(x2) GPU provided by Kaggle. The corresponding hyper-parameters for each approach are presented below:

  • Approach 1: batch size = 32, learning rate = 5e−05, and the number of epoch = 25 with XLM-R model and the number of epoch = 20 both remain backbones.

  • Approach 2:

    • Stage 1: batch size = 32, learning rate = 5e−05 and the number of epoch = 20 for all three backbones.

    • Stage 2: batch size = 16, learning rate = 2e−05 and the number of epoch = 25 with XLM-R model and epoch = 20 two remainder models.

  • Approach 3:

    • Stage 1: batch size = 32, learning rate = 2e−5 and the number of epoch = 10 for all three backbones.

    • Stage 2 and Stage 3: Using the configuration and architecture as the Approach 2.

5 Main Results

According to the organizing committee, this sub-task will be evaluated by F1-Score based on exact matches. As shown in Table 4, we provide a tabulated summary of 9 experiments, each representing one of the 9 final systems generated from three different approaches and using three distinct backbones.

The experimental results in Table 4 indicate that Approach 3, a three-stage system, demonstrates the best performance across all backbones, with the XLM-RoBERTa backbone exhibiting the highest efficacy among all approaches. However, this result is for reference only and is only true in all of my experiments. It’s important to acknowledge that different contexts, set up or datasets might yield different outcomes, and we are not sure this is the best result that each backbone could give in other cases. Finally, the best system was built according to approach 3 with XLM-R backbone and our best submission was ranked 3rd. Table 5 show the final score of the top 5 participants.

Table 4. Comparative performance of our three Approaches with different pre-trained language models on the test set.
Table 5. Official scoreboard (https://codalab.lisn.upsaclay.fr/competitions/16935#results) for the sub-task I: Software mention recognition.

With the test dataset labels provided by the organizing committee, we evaluated the performance of our best system for each entity class in Table 6. We observed that the SoftwareCoreference_Deposition entity achieved the highest Precision score, while the ProgrammingEnvironment_Usage entity attained the highest Recall and F1 score, top 5 F1-score classes are ProgrammingEnvironment_Usage, SoftwareCoreference_Deposition, and OperatingSystem_Mention. It is evident that entities belonging to the PlugIn group typically scored lower than those in other groups shows that it has difficulty in the regconization process. Although, the number of PlugIn_Usage entities in the training set is pretty large the result on the test set is not positive. Besides that, PlugIn_Creation and PlugIn_Deposition entities have the sample in the training set are pretty low and their score moves forward to zero. The number of OperatingSystem_Mention entities in the training set is low and the score on the test set is high so we predict the mention entity type in this group is featured and easier to recognize than other groups.

Additionally, in Table 7, we evaluated each individual stage in our final three-stage system by assuming that the accuracy of the stages before it is 100%. The first stage works well with an F1-score of 0.992 in classifying whether a sentence contains an entity or not. Moving to stage 2, tasked with detecting entities in sentences, achieved an F1 score at a relatively good level, but a significant difference between Precision and Recall (12.6% difference) is evident, which also affects the overall system performance. In the final stage, the scores between the three metrics are relatively balanced, but it appears that the task of classifying 13 entity classes had some impact on this stage with relatively lower overall performance. The propagation of errors between the three stages has a significant impact on the entire system, with the final F1-score of the entire system being 0.678.

Table 6. Performance of the final system on the test dataset across entity classes evaluated by Precision, Recall, and F1-score.
Table 7. Performance of components in our final three-stage framework.

6 Conclusion and Future Work

In this paper, we present and evaluate three approaches for tackling sub-task I in the Software Mention Detection in Scholarly Publications shared task. While we explored the use of suitable transformer models like BERT, our three-stage system leveraging the XLM-R model achieved the highest performance in the competition. As a result, our best system achieved the Top 3 in the private test. In future work, our intention is to analyze the error propagation between the three stages to enhance the performance of the entire three-stage system. Additionally, with access to more substantial computational resources, we aim to experiment with fine-tuning sub-tasks using larger batch sizes and epochs for each backbone in order to investigate the effects of these hyper-parameters on the model’s performance.