Keywords

1 Introduction

The integration of software tools into scientific research is no longer confined to engineering disciplines but has extended to all disciplines, including the humanities and social sciences. This is largely driven by the escalating need to process and analyze data across various domains. Consequently, an extensive mention of these tools in the scientific articles. Furthermore, the volume of published scholarly articles continues to grow year after year, which makes overcoming the challenge of managing this vast repository of knowledge more important than ever before. In response, numerous projects and initiatives such as EOSCFootnote 1 and NFDIFootnote 2 have been launched to organize research outputs, including articles, software mentions, and datasets, in a manner that adheres to the FAIR principles [27] (i. e., Findable, Accessible, Interoperable, and Reusable) to enhance the overall integrity and reproducibility of scientific research.

The absence of a standardized mechanism for authors to accurately cite and reference tools and datasets in their scholarly articles necessitates the extraction of such mentions post-publication. However, this task is challenging due to the unstructured nature of these articles. This underscores the importance of developing sophisticated and reliable methodologies capable of automatically and accurately recognizing mentions of these tools. This paper addresses this challenge by concentrating on a key aspect of it: “Subtask I of the Software Mention Detection in Scholarly Publications (SOMD).”Footnote 3 The aim is to explore new methodologies and models of software mention detection processes in academic literature.

Although the challenge of detecting software mentions has been directly addressed using the KG approach [20, 22], a lot of approaches that solve the general problem have been proposed to solve the general problem of extracting information from scholarly data, including Named Entity Recognition [23], Metadata Extraction [1, 5, 10] and Reference Extraction and Segmentation [4, 8, 19]. The proven capabilities of large language models (LLMs) to understand and generate human-like text offer new possibilities. Among the advanced models, the LLM such as the Falcon-7b model stands out due to its decoder-only architecture and the breadth of its training data. Although it was initially designed for generative tasks, the scale of architecture and the breadth of its training data present an exciting opportunity for exploring its potential for application in specialized areas of NLP, such as the detection and classification of software mentioned in academic publications.

This paper investigates the effectiveness of the Falcon-7b model on the task of software mention detection in the scientific literature by adapting it to the task. The choice of model is arbitrary without assuming it has a specific advantage over similar models like LLAMA 2 [24] or GPT3.5 [6]. This choice allows us to explore how well such LLM models perform on a specialized task, specifically, we leverage the model’s extensive training on a wide array of texts and its advanced understanding of linguistic nuances. This effectiveness is critically assessed through comprehensive experiments, encompassing a variety of evaluation metrics to ensure a balanced analysis of the model’s applicability to the task at hand.

The structure of this paper is laid out as follows: Sect. 2 provides a review of the relevant literature. In Sect. 3, we delve into the methodology of this study, outlining the approaches and techniques utilized. Section 4 presents the experimental setup, alongside the results achieved from our investigations. Finally, Sect. 5 concludes the paper, summarizing our findings and offering insights into potential avenues for future research.

2 Related Work

As there are not so many efforts to detect software mention using LLM, we review in this section approaches on named-entity recognition (NER), where we consider that software mention detection is a use case of this problem. This review is structured in three categories:

2.1 Rule-Based and Classical Machine Learning Approaches

Named entity recognition has been at the core of machine learning research for decades due to its importance in a variety of applications. However, the task is challenging and thus many approaches have been proposed to tackle the problem in specific scenarios. [2] introduces a rule-based NER framework designed for the Malay language, to improve the retrieval process of articles. It addresses the challenges faced by NER processes in languages with morphological differences and the lack of existing systems for the Malay language. By analyzing the domain of studies and the specific linguistic features of Malay, this framework accurately classifies named entities such as people, organizations, and locations. [16] proposes to address the limitations of traditional rule-based approaches by using Hidden Markov Models (HMM) for NER tasks in Indian Language beyond domain-specific applications. [25] proposes an approach using the Conditional Random Field (CRF) and Active Learning (AL) algorithm, where the training process of the CRF classifier is repeated until the model stabilizes. This resulted in an efficient and cost-effective outcome. [9] uses the Beam search algorithm to detect named entities in the Persian language by segmenting text into suitable and unsuitable expressions for the named entities and then applying dynamic external knowledge to recognize the emerging named entity.

2.2 Deep Learning-Based Approaches

Deep Neural Networks have proven significant performance over classical machine learning approaches on different tasks, including NLP, computer vision and others. Consequently, they have been used to address NER tasks as well [12, 14]. [29] proposes a NER framework called E-NER that uses evidential deep learning (EDL) to explicitly model predictive uncertainty for named entity recognition (NER) tasks. It addresses the challenges of sparse entities and out-of-vocabulary (OOV) entities in NER by introducing uncertainty-guided loss terms and training strategies. E-NER achieves accurate uncertainty estimation, better OOV/OOD detection performance, and improved generalization ability on OOV entities compared to state-of-the-art baselines. [15] introduces a novel approach to Clinical NER for de-identifying sensitive health information in clinical texts by developing a Capsule-LSTM network that leverages the strengths of capsule networks for capturing complex data relationships and LSTM networks for understanding sequential data. [13] the All CNN (ACNN) model for Chinese clinical NER that employs CNN enhanced by an attention mechanism, sidestepping the inefficiencies of traditional LSTM models. By leveraging multi-level CNN layers with various kernel sizes and a residual structure, ACNN adeptly captures context information across different scales, addressing the challenges posed by the complex grammar and terminology of Chinese clinical texts. [7] proposes to combine CNN and bi-LSTM architectures for biomedical NER (bioNER), aiming to efficiently handle the complexities of biomedical texts, such as variant spellings and inconsistent use of prefixes and suffixes. The proposed combinatorial feature embedding and attention mechanism for enhanced entity recognition demonstrate superior performance on benchmark datasets JNLPBA and NCBI-Disease when compared with existing methods.

2.3 Large Language Model-Based Approaches

Due to their power to understand and generate natural text, LLMs have been employed for many Natural Language Processing (NLP) tasks, including Named Entity recognition. [26] proposes GPT-NER, a method that transforms the sequence labelling task of Named Entity Recognition (NER) into a text generation task, allowing large language models (LLMs) to be easily adapted for NER. GPT-NER achieves comparable performances to fully supervised baselines on five widely adopted NER datasets and exhibits a greater ability in low-resource and few-shot setups, making it suitable for real-world NER applications with limited labelled examples. [17] employs Generative Pre-trained Transformer 3 (GPT-3) by OpenAI together with a weak supervisor to address the NER challenge within the legal domain, exemplified by documents from the Official Gazette of the Federal District (DODF). [11] proposes an architecture that refines the adversarial example generation process for LLM using disentanglement and word attribution techniques to efficiently generate adversarial examples while maintaining semantic similarity. The experiments conducted on benchmark datasets-CoNLL-2003, Ontonotes 5.0, and MultiCoNER-demonstrated that the approach improves the F1 scores by 8%.

3 Method

This section elaborates on the methodologies applied for the Software Mention Detection in Scholarly Publications (SOMD) [21] Subtask I, part of the Natural Scientific Language Processing and Research Knowledge Graphs (NSLP)Footnote 4 2024 workshop. Our approach is centred on token classification, addressing the unique challenges software mention recognition poses in scientific texts.

The primary objective of the subtask is to recognize software mentions within individual sentences, further classifying them by mention type (e.g., mention, usage, creation) and software type (e.g., application, programming environment, package). The task is approached with a methodology that innovatively applies a large language model (LLM), such as the Falcon-7b [3] model, as the foundation for a token classification system.

The core of our methodology is The Falcon-7b model, which is a decoder-only architecture known for its performance across a wide range of NLP tasks. Despite its primary design as a generative model, we adapt Falcon-7b for token classification by appending a classification layer atop its structure. This adaptation is driven by the hypothesis that the extensive pre-training on diverse corpora, coupled with its vast parameter space, provides the Falcon-7b model with a nuanced understanding of textual context. Such capabilities can significantly enhance the model’s proficiency in identifying and classifying software mentions within the complex syntax and semantics of scholarly writing.

The methodology implemented in this study is structured around several strategic training approaches, tailored to address the intrinsic challenges of the SOMD task. Initially, the task is framed as a token classification problem, where labels are assigned to each word or subtoken within a sentence to denote the presence and category of software mentions. A notable challenge arises from the tendency of transformer-based models like Falcon-7b to segment words into multiple subtokens, which complicates the direct application of labels.

To address the challenge of labelling consistency in the presence of subtoken segmentation by transformer models, our methodology employs two distinct strategies. The first, referred to as “Unified Labeling” assigns the same label to all subtokens derived from a single word’s segmentation, ensuring consistency across the subtoken sequence. This maintains label continuity across divisions, which facilitates coherent entity recognition despite the segmentation process. In contrast, the second strategy “Selective Labeling” assigns a label only to the first subtoken of a segmented word, disregarding subsequent subtokens. This method aims to minimize label redundancy and computational complexity associated with processing multiple labels for a single entity. Each approach is independently explored to determine its efficacy in addressing the challenges of subtoken label alignment in the context of the SOMD task.

A critical obstacle encountered in the SOMD task is the substantial class imbalance, where data is skewed towards non-mention (’O’) labels. To address this imbalance, we implemented two distinct and independently tested strategies to recalibrate the dataset for training. In the first strategy “Weighted Loss”, weighted loss is applied, where class weights are inversely proportional to class frequencies. however, given the substantial imbalance, original weight values became impractical for underrepresented classes. To resolve this, we scaled the weights within a sensible range, setting minimums and maximums (1, and thresholds of 25, 50, 100, and 200), thereby enabling the nuanced training of models across various weight configurations to explore their efficacy in balancing classification performance. The second strategy “Adaptive Sampling” strategically segments the dataset into over-represented (all ’O’ tokens) and under-represented (at least one non-’O’ tokens) categories. This involves oversampling the under-represented data by a factor of 2 and undersampling the over-represented data to sizes equal to multiples (1, 1.5, 3) of the oversampled data volume. Adaptive Sampling aims to achieve a more balanced class distribution, enhancing the model’s capacity to learn from a representative spectrum of the dataset. Each strategy’s independent evaluation provides insights into its effectiveness in addressing dataset imbalance challenges in the SOMD task.

Additionally, recognizing the multifaceted nature of software mentions, separate token classifiers are developed for identifying software types and the mentioned types. We call this strategy “Dual-Classifier”. It enables more precise label application by separating the task into two distinct classification problems, by distinguishing between software and mention types. This setup is designed to explore whether such a nuanced approach can effectively capture the diverse nature of software mentions, offering a potentially more sophisticated mechanism for their identification and classification.

To ensure the effectiveness and generalizability of the model, the pre-defined training dataset was utilized to develop and refine the model’s capabilities. For evaluating the model’s performance in accurately identifying and classifying software mentions, the test dataset provided for Subtask II was employed. With this, we aim to conduct a comprehensive assessment of the model’s ability to generalize across various scenarios and text variations found in scholarly publications, thereby validating the model’s applicability and effectiveness in real-world tasks.

Following the comprehensive methodologies delineated for the Software Mention Detection (SOMD) task, our evaluation process is designed to rigorously assess the effectiveness of our approaches, namely the “Unified Labeling,” “Selective Labeling,” “Weighted Loss,” “Adaptive Sampling,” and the “Dual-Classifier” Approach. Central to our evaluation is the F1-Score, focusing on exact matches, which serves as a critical metric to quantify the precision and recall of our models in accurately identifying and classifying software mentions. Adherence to the IOB2 format for our submission files ensures our alignment with the standardized training labels, facilitating direct comparison of our model’s performance against established benchmarks.

4 Experimental Results

4.1 Results

To evaluate our method on Software Mention Detection in Scholarly Publications (SOMD) Subtask I, we explored various settings centred around the Falcon-7b model. The experiments are conducted using the Hugging Face Transformers library [28], with PyTorch [18] as the computational backend. Model training and evaluation were performed on a high-performance computing cluster equipped with NVIDIA A100 GPUs. To maintain consistency across all experiments conducted in this study, we standardized our training hyperparameters. This ensures that any observed differences in model performance are attributable to the variations in model architecture, data preprocessing, or other experimental conditions, rather than inconsistencies in training configurations.

The focal point of our evaluation is the F1 score, which balances precision and recall and offers a comprehensive measure of model performance. The evaluation result on the test dataset is illustrated in Table 1. These results are submitted to the shared task platform under the username: “fddaFIT”.

Table 1. Evaluation metrics for each experimental approach.

In comparison between the two distinct labelling strategies:“Unified Labeling” and “Selective Labeling.” The outcome favoured the “Selective Labeling” strategy, which demonstrated superior precision and recall and achieved a higher F1 score of 0.6193 compared to the 0.5561 of “Unified Labeling.” This demonstrates the effectiveness of selectively assigning labels in enhancing the model’s precision and recall. This finding guided the direction of subsequent experiments, by embedding the “Selective Labeling” strategy in our methodology.

Further investigations focus on adaptive sampling and weighted loss scaling to address the notable challenge of dataset imbalance. Adaptive sampling experiments, applying various undersampling multipliers to over-represented data, demonstrated the precise impact of data distribution on model efficacy. The employment of a multiplier of 1.5 showed an enhancement of the F1 score to 0.6284, the highest among adaptive sampling variations. This suggests an optimal balance in dataset composition, significantly contributing to model performance.

Experiments with weighted loss, adjusted across a range of maximum weights (25, 50, 100, and 200), aimed to refine the model’s sensitivity to class frequencies. While experimenting with weighted loss scaling offered insights into handling class imbalance, the adjustments did not surpass the adaptive sampling’s peak F1 score, with the highest recorded F1 at 0.6012 for a scaling factor of 25.

The “Dual-Classifier” approach introduced a bifurcated strategy for identifying software types and mention types, aiming to enrich the model’s understanding and classification accuracy. Interestingly, this approach alone achieved an F1 score of 0.6068, comparable to some weighted loss scaling strategies but slightly below the best-performing adaptive sampling method. Combining the “Dual-Classifier” with “Adaptive Sampling” at a multiplier of 1.5 did not further enhance the F1 score, indicating that while each method independently contributes to addressing specific challenges in SOMD, their combined effect does not necessarily result in cumulative improvements.

The experiments demonstrate the critical role of selective labelling and adaptive sampling in enhancing F1 scores for SOMD. While weighted loss scaling and the “Dual-Classifier” approach contribute to performance improvements, the combination of strategies does not yield further enhancements. This indicates the necessity of strategic selection and implementation in model development for SOMD tasks.

In these experiments, we observed that the model effectively identifies software mentions within the complex academic text and accurately classifies entities like specific software tools or programming languages.

For example, it might correctly identify “Python” as a programming language used within a research context. However, challenges arise in distinguishing between mentions of software and instances where the software is being actively used or discussed in depth. This differentiation is crucial for understanding the role of software in research, as only mentions might not signify importance or relevance to the study’s outcomes. Examples of the model’s output are provided in the full versionFootnote 5 of this paper.

5 Conclusion

In this paper, we presented the efficacy of the Falcon-7b model, a prominent Large Language Model (LLM), in tackling the nuances of Software Mention Detection (SOMD) within scholarly publications. Guided by the hypothesis that advanced LLMs could significantly improve the precision and recall for SOMD tasks due to their extensive training on diverse datasets, our study systematically explored various strategies centred around the Falcon-7b model.

The comparative analysis of “Unified Labeling” and “Selective Labeling” strategies revealed a preference for “Selective Labeling,” which yielded a higher F1 score. This result underscores the incremental nature of advancements achievable with the Falcon-7b model in the context of SOMD tasks. Additionally, our experiments with adaptive sampling and weighted loss scaling aimed at addressing dataset imbalances highlighted that, despite certain improvements, the enhancements were not as substantial as anticipated. The adaptive sampling strategy, especially with a multiplier of 1.5, did indeed enhance the F1 score, to the highest obtained score. However, this observation was not sufficient to categorize the performance as extraordinary. This finding suggests that while the Falcon-7b model and similar LLMs hold promise for NLP tasks, their application in specialized areas such as SOMD may not always yield groundbreaking results.

In conclusion, the outcomes of our study indicate that, despite the advanced capabilities of LLMs like Falcon-7b, the performance improvements in specific NLP tasks such as SOMD are modest. While LLMs offer advantages in processing and understanding complex language patterns, their effectiveness in specialized domains like software mention detection within scholarly texts does not markedly outperform more traditional approaches. This emphasizes the importance of combining LLMs with other approaches and the need to explore a broad spectrum of models and methodologies to identify the most effective solutions for specialized NLP challenges.