Keywords

1 Introduction

The evolution of information extraction (IE) in scholarly communications necessitates the development of innovative methodologies to accurately identify and categorise software mentions. This is a critical component for ensuring research transparency and reproducibility. The Shared Task on Software Mentions Disambiguation (SOMD) highlights the importance of refined citation practices amidst inconsistent referencing of software artifacts. This research employs generative Large Language Models (LLMs) to address the challenges in software mention extraction and relation identification, marking a significant step towards the sophisticated extraction of software-related information. Our approach integrates Retrieval-Augmented Generation (RAG) techniques with LLMs to dissect and comprehend the intricate web of software citations and their attributive details (e.g. the developer or the version) within scientific texts. By examining the potential of LLMs in performing IE tasks by transforming them into single-choice question-answering, we present a comprehensive analysis that addresses the nuances of software mention extraction and could be applied on a broader scope of scholarly artifacts NER and relation extraction. This paper explores the complexities of applying LLMs to NER tasks, providing insights into the challenges and proposing a new methodology for relation extraction that could pave the way for future innovations in the field.

2 Related Work

This section explores works in Named Entity Recognition (NER) and Information Extraction (IE), focusing on software mention extraction, scholarly artifact NER, and the use of generative LLMs in these areas.

In software mention extraction, the Softcite dataset [2] and the SoMeSci knowledge graph [7] are significant. Softcite offers a gold-standard dataset for extracting software mentions from biomedical and economic research, while SoMeSci Knowledge Graph includes software mentions in scientific articles with relation labels such as version, developer and citations, highlighting the need for accurate software mention extraction.

In scholarly artifact NER, Saji and Matsubara [6] introduced a method using academic knowledge graphs to extract research resource metadata from scholarly papers, enhancing metadata quality and repository size. Otto et al. [5] developed the GSAP-NER corpus for extracting machine learning-related entities from scientific publications, filling the gap in general-purpose NER models. These words demostrate the importance of domain-specific NER tasks and leverageing knowledge graphs for enhancing research resource repositories.

Regarding LLMs for NER, Wang [10] explores the use of text-generation models for sequence labeling and underlying challenges, particularly in low-resource and few-shot setups. Furthermore, Xie [11] investigates LLMs’ self-improving capabilities for zero-shot NER.

For LLMs in relation extraction, Wadhwa [8] uses few-shot prompting and fine-tuning with large language models achieving state-of-the-art performances in relation extraction using LLMs, while Wan [9] introduce GPT-RE to enhancing relation extraction accuracy through task-specific entity representations.

Lastly, Xu [12] conducts a comprehensive survey providing an overview of generative information extraction using LLMs, categorizing works by IE subtasks and learning paradigms, and highlighting the transformative potential of LLMs in IE.

Overall, these studies underscore the evolving methods and significant contributions in extracting information from scholarly texts using specialized datasets, domain-specific approaches, and advanced generative models.

3 SOMD Shared Task

The Shared Task on Software Mentions Disambiguation (SOMD) aims to enhance transparency and reproducibility in scientific research by improving software citation practices. Participants are required to develop models that can identify and disambiguate software mentions in scholarly texts using the expanded Software Mentions in Science (SoMeSci) knowledge graph, with a focus on AI and Computer Science.

Subtask 1, Software NER, involves identifying four types of software-related entities: Application, Plugin, Operating System, and Programming Environment. Accurately classifying software mentions is necessary to understand their role in academic discourse.

Subtask 2, Attribute NER, aims to extract ten different attributive information associated with software entities, such as alternative names, abbreviations, authorship (developers), software release maintenance (versions, extensions, release dates, licenses), and elements of the referencing system (in-text citations, URLs, and software coreferences for cross-sentence linkage), to a specific software mention.

Subtask 3, Relation Extraction, involves establishing relationships between software entities and their attributes using specific relation types and mapping each attribute to these relations. This task aids in comprehending the interrelations of software entities within scholarly texts.

The tasks evaluation metric, is the Weighted Average Macro F1 score. This metric adjusts the influence of each label on the final result based on the number of test instances assigned to it, ensuring that labels with fewer instances have a proportionally smaller impact on the overall performance evaluation.

4 Using LLMs for Software Related IE-Tasks

The exploration of generative LLMs such as GPT-4 for Information Extraction (IE) tasks related to software underscores the transformative potential these models hold as general-purpose task solvers. The allure of leveraging LLMs in this capacity is significant, given their ability to process and generate human-like text across a wide range of topics and formats. However, the challenge arises in the specificity and nuanced requirements of domain-specific tasks, such as NER within specialized fields. It has been observed that, despite their vast knowledge and versatility, current models like GPT-4 often fall short when tasked with domain-specific NER, primarily due to their generalized training and lack of domain-specific tuning.

To mitigate these shortcomings and enhance the performance of LLMs in specialized IE tasks, various in-domain learning strategies are employed. These strategies are designed to equip the LLM with a deeper understanding of the task at hand, essentially guiding the model towards more accurate identification and classification of relevant text spans. Among these strategies, optimizing the task description plays a crucial role. A well-crafted, precise task description can significantly improve the model’s focus and comprehension of the task’s objectives, leading to more relevant and accurate outcomes.

Furthermore, the provision of speaking, prototypical examples serves as another effective strategy. By presenting the model with clear, illustrative examples that encapsulate the essence of the task, we can anchor its understanding and enhance its ability to generalize from these examples to new, unseen instances. This approach leverages the model’s inherent learning capabilities, allowing it to draw parallels and apply learned concepts to the task at hand.

Additionally, augmenting the model’s capabilities with Retrieval Augmented Generation (RAG) introduces a powerful dimension to the information extraction process. RAG combines the generative prowess of LLMs with the specificity and relevance of retrieved documents, enabling the model to access a broader context and detailed examples that are pertinent to the task. This strategy is particularly advantageous in domain-specific applications, where the relevance and accuracy of the information extracted are paramount.

In our approach, we capitalize on the last strategy, RAG, to maximize the utility of training sets for each specific task. The retrieval component of this strategy entails identifying instances within the training set that are similar to the test instance and can provide valuable insights for the identification and classification of relevant information. This method not only enhances the model’s performance by providing it with task-relevant data but also ensures that the information extracted is of high relevance and quality, tailored to the specific demands of the domain-specific NER task. Through these tailored strategies, we aim to bridge the gap between the broad capabilities of LLMs and the precise requirements of domain-specific information extraction tasks, paving the way for more effective and efficient utilization of generative language models in specialized domains.

4.1 Challenges in Applying LLMs to NER Tasks

The integration of generative LLMs into NER tasks introduces a set of unique challenges that can significantly impact the performance and reliability of extraction outcomes. One of the most prevalent issues in generative NER approaches is the phenomenon of hallucination [10], where the model generates entities not present in the test instance. This can result from the model misinterpreting the provided examples as part of the text from which entities should be extracted, leading to inaccuracies and inconsistencies in the results. n Furthermore, matching problems during the location of mention positions present considerable challenges, particularly in the context of span-based evaluation systems. These systems evaluate the accuracy of entity extraction based on the precise span of text identified as an entity. Discrepancies in the extracted span-whether through corrected spellings or variations in representation-can complicate the matching process. For example, the extraction of “jquery” as “jQuery” illustrates a common scenario where the prevalent spelling of a software library may differ from its mention in the text, yet both are correct. This variability necessitates sophisticated matching strategies to ensure accuracy in evaluation.

The situation is further complicated by texts containing multiple mentions of a single entity. The challenge lies in determining whether each mention can be accurately classified as an entity, alongside dealing with the ambiguity of different entities that are written in the same way. Overlapping or nested entity mentions that do not align with the ground truth data introduce additional layers of complexity, requiring nuanced approaches to entity recognition and classification. Our proposed baseline solution, filtering out non-matching entities and employing rule-based decisions for handling multiple matches have been adopted. However, the disadvantage of this method is that it is based on simplistic heuristics.

Future work may explore advanced solutions like using LLMs for precise entity matching, offering potential improvements for NER challenges. While this paper does not delve into these complex methods, it highlights the importance of ongoing research to further refine and improve LLMs for more accurate and efficient entity extraction.

4.2 Sample Retrieval for RAG on Various IE-Tasks

Retrieval-Augmented Generation (RAG) significantly bolsters the capabilities of LLMs in IE tasks by effectively utilizing both unstructured and structured contexts [3, 4]. This dual approach is essential in IE for achieving precise extractions, yet the selection of optimal samples for the generative process poses a substantial challenge. Addressing this involves two primary strategies: utilizing sentence embeddings to find contextually similar textual content for the LLM, and identifying analogous entities to uncover beneficial training sentences. These methods enable the LLM to discern structural and semantic patterns for more accurate text extractions. A crucial obstacle is accurately identifying target entities within test instances, for which we employ a pre-trained Language Model (PLM) tailored to our NER task. This PLM is instrumental in both spotting potential entity candidates and facilitating entity similarity searches, leveraging last hidden state embeddings from training examples to locate matching entities within the dataset.

Our methodology extends to evaluating various retrieval techniques and their impact on the LLM’s efficiency, particularly within a Few-Shot learning framework. We explore different methods, including random illustrative samples, text similarity-based RAG, and entity-based sentence retrieval, to provide the LLM with contextually relevant examples, thereby optimizing the software entity extraction process from scientific texts. This exploration aims to identify the most effective strategies for utilizing LLMs in domain-specific software entity extraction tasks. By overcoming challenges related to resource demand, execution time, and precision in entity identification, our approach aims to enhance the accuracy and efficiency of NER processes, contributing valuable insights and methodologies to the field of computational linguistics and information extraction.

A special retrieval method is used for the RE Task. Because for this task, the entities for each test instance are given, we easily could list all possible relations (compare Table 3). We could use them to find similar relations in the train set. If we find more then one, we decide for the one, with highest sentence similarity.

For the RE Task, a special retrieval method is employed. As entities for each test instance are provided, all possible relations can be listed based on Table 3. These relations can then be used to find relations of the same type, and with the same domain and range in the train set. If multiple similar relations are found, the one with the highest sentence similarity is selected.

4.3 Extraction of Software Entities

The extraction of software entities from scientific texts represents a specialized challenge within the realm of NER, targeting the identification of software entities across four distinct categories ranging from applications to operating systems. Furthermore, this task extends beyond mere identification, seeking to understand the intent behind each mention of software entities-whether it pertains to creation, usage, deposition, or mere citation.

We address this challenging task using LLMs, despite their high demands on resources and time, especially when processing extensive publications with sparse relevant text. Our approach includes a pipeline strategy that prioritizes selecting relevant text passages for LLM analysis, improving efficiency by filtering out unrelated content. This method’s success depends on the selection accuracy, directly impacting recall. However, the trade-off for reduced computational costs justifies the potential minor decrease in recall. Our performance optimization employs a hybrid method, combining a fine-tuned NER model for sentence selection with LLMs for information extraction. This approach faces limits, notably when sentences crucial for analysis are missed in the selection phase, capping the LLM extraction phase’s accuracy as indicated by an initial sentence classification task recall of 0.882 (0.884 F1). This establishes a theoretical limit on extraction accuracy due to potential false negatives, illustrating a balance between efficiency and the precision constraints of LLMs in detailed text analysis.

4.4 Extraction of Software Attributes

Following the identification of software entities within scientific texts, a further nuanced aspect of NER and IE tasks emerges in the extraction of associated software attributes. These attributes encompass a wide array of specific details, the version, developer, citations, URLs, release dates, abbreviations, licenses, extensions, software co-references, and alternative names. We used a similar approach as for subtask 1, and utilised train sample retrieval to augment the task description in a few shot setup. For each sample, including those derived from few-shot learning, the process entails presenting the sentence containing the software entity(ies) and then predicting a JSON list of identified entities along with their respective attribute types.

4.5 Relation Extraction as Single-Choice Question Answering Task

In the field of Natural Language Processing (NLP), the extraction of relations between entities within a text corpus poses significant challenges. This study proposes a novel approach by conceptualizing the task of relations extraction as a single-choice question-answering (QA) activity. This method entails generating a comprehensive list of all possible entities within a sentence, drawing from the existing entities and their relationships as delineated in the training dataset. Each potential pair of entities is then evaluated to ascertain if a specific relation, such as “version_of”, appropriately links them. For instance, considering the relationship “version_of”, a sentence may be formulated as “8 is the version of SPSS”, representing a possible relation between the version number and the software entity.

For every sentence in the dataset, this process yields a set of single-choice questions, each positing a potential relationship between entities. These questions are then prompted to a LLM for answering. The LLM’s task is to select the most plausible relation from among the given options, thereby facilitating the extraction of accurate entity relations from the text. However, this approach is not without its challenges. A primary source of error stems from instances where multiple relations could plausibly link a pair of entities, leading to ambiguity and complicating the single-choice question-answering framework. Despite the challenges, we demonstrate that treating relation extraction as a single-choice QA task provides a structured and innovative approach to extracting valuable insights from complex textual data.

5 Experiments

5.1 Models

Fine-tuned Model. In our experimental setup, the initial phase focused on fine-tuning a language model specifically for Subtask 1, which involved NER. Following the methodology similar to Schindler [7], we employed the SciBERT model [1], given its pre-training on scientific corpus, making it apt for NER fine-tuning within scholarly texts. The fine-tuning process entailed rigorous parameter optimization, including adjustments to the batch size, learning rate, and the relative share of negative samples-sentences that do not contain any annotations. This optimization utilized a 90/10 train/evaluation data split from the available training dataset. Subsequent to parameter tuning, we conducted a final training run with a modified data split of 95/5 train/evaluation to maximize the training data’s utility.

Generative Large Language Models. For a comparative analysis, our study integrated LLMs, specifically examining the performance differences between GPT-3.5-fast and GPT-4-fast models accessed via the OpenAI API. To ensure deterministic outputs for comparison, we set the temperature parameter to zero, eliminating randomness in the model’s response generation.

5.2 Prompting

Software NER. The prompting strategy for Software NER involved providing a concise task description, which included specifying the task as NER and intention classification, alongside an enumeration of target labels. We also highlighted the domain specificity of the texts (i.e., scientific publications) and requested the output in JSON format, delineating separate labels for entity type and intention. Sample sentences from the training set, along with their corresponding JSON output, were included as illustrative examples. This setup varied in the number and order of displayed examples based on the retrieval method, exploring the impact of these factors on model performance through separate experiments.

Attributive NER. For Attributive NER, the prompt construction mirrored the approach taken in Subtask 1 but incorporated more detailed rules, emphasizing the necessity for attributes to relate directly to software entities. Known software entities were additionally provided as input, even for test instances, adhering to the oracle setup defined in the shared task guidelines. This method aimed to refine the model’s ability to extract and classify attributive information accurately.

Relation Extraction. The approach to Relation Extraction reimagined the task as a series of single-choice question-answering challenges. Each potential relation, given the software and attributive entities within a sentence, was listed, with claims formulated for each possible relationship (e.g.,“IBM is the developer of Windows”). The model was tasked with identifying the veracity of each claim through a single-choice question format, where all claims were enumerated and solutions provided in a batch format for each example sentence. This setup culminated in presenting the test instance alongside its single-choice questions, expecting the model to deliver decisions on the relational claims.

5.3 Train Sample Retrieval for Few-Shot Generation

Our experimental framework explored various methods for test sample retrieval to ascertain the most effective approach in enhancing model performance. These methods included the use of random illustrative samples to represent every possible signature, retrieval based on entity similarity, and retrieval based on sentence similarity. Each method was evaluated against a baseline to determine its impact on the accuracy and efficiency of the information extraction tasks at hand.

5.4 Relation Extraction Baseline

In the development of our baseline for relation extraction, we established a robust heuristic framework derived from an analysis of potential relations indicated within the text. Our strategy was guided by two principal rules aimed at simplifying the decision-making process for identifying accurate relationships between entities. Firstly, we limited our consideration to relations that necessitate the presence of at least one related entity, deliberately excluding optional inter-software entity relations such as “specification_of” and “PlugIn_of.” Given the infrequent occurrence of these cases within the training dataset, we anticipated only a minimal impact on the overall performance metric, specifically the weighted mean macro F1 score for the subtask. Secondly, for all remaining relation types, our approach favored selecting the closest possible entity positioned to the left of the focal software entity as the most likely relation partner. This heuristic was not only straightforward but proved to be highly effective, aligning our baseline performance with that of the top contenders in the shared task. This methodology highlights the potential of leveraging simple, rule-based strategies to achieve competitive results in complex relation extraction challenges.

6 Results

Table 1. Results of Different Models and Retrieval Methods on Subtask 1

Our analysis in Subtask 1 (Software NER) shows a varied performance landscape across different models and retrieval methods (Table 1). The finetuned baseline, which uses SciBERT, achieved a solid foundation with a 59.9% F1 score. However, LLMs that use random samples without fine-tuning showed a decrease in performance, with the highest F1 score reaching only 57.4%.

A closer examination of retrieval-based models indicates that LLMs perform better. The highest F1 score of 67.9% was achieved by sentence similarity retrieval models, while entity retrieval showed the best performance at 67.7% F1. The transition from GPT 3.5 to GPT 4 models resulted in a significant improvement of approximately 3–5%, although it required around three times more computation time. Notably, our best models were able to perform within a mere 3% below the theoretical maximum by utilizing SciBERT for sentence selection, leveraging oracle positive sentences.

In Subtask 2 (Attributive NER), our methodology showed a significant improvement of +10% in F1 performance compared to our competitors, demonstrating the effectiveness of our approach in a low data regime (Table 2). For Subtask 3 (Relation Extraction), our LLM Single-Question Answering model further improved the F1 score by 5.1%, building on the already competent performance of the heuristic baseline and highlighting the advantage of our method. Furthermore, it has been demonstrated that using 7–10 samples is the most effective strategy, as it optimises the balance between input complexity and model performance.

This analysis highlights the potential of utilising advanced LLM techniques and carefully selected retrieval methods to significantly improve the accuracy and efficiency of NER tasks in specialised domains.

Table 2. SOMD Performance Rankings (Weighted Average Macro)

7 Conclusion

Our research on enhancing Relation Extraction (RE) with LLMs through Single-Choice QA has introduced a novel intersection of methodologies aimed at improving the precision of information extraction in the context of scientific texts. By integrating Retrieval-Augmented Generation (RAG) with LLMs and adopting a methodical approach to fine-tuning and leveraging large language models such as SciBERT and use the model to support GPT variants, we have demonstrated the capability of LLMs to navigate the complexities inherent in the extraction of software entities and their attributes.

The exploration of different retrieval strategies-ranging using entity and sentence similarity underscores our commitment to refining the inputs for generative models, ensuring that they are fed the most relevant and contextually appropriate data. This meticulous preparation has allowed us to significantly boost the performance of LLMs in recognising nuanced distinctions among software-related entities and accurately extracting relation types within scholarly articles. Our experiments have not only highlighted the efficacy of LLMs in addressing domain-specific tasks with a limited set of examples but also revealed the inherent challenges, such as the difficulties in matching entity mentions accurately. Despite these hurdles, our single-choice QA approach for RE lead to a strong heuristic baseline for relation extraction and show how altering the task simplifies the problem.

The outcomes of our research indicate a promising direction for future work in leveraging LLMs for NER and RE tasks. The development of our system for participation in the SOMD shared task has illustrated the potential of a single-choice QA approach to relation extraction, offering a structured and scalable method for extracting meaningful insights from textual data. Our findings contribute to the growing body of knowledge on the application of generative models in the field of computational linguistics, paving the way for more sophisticated and efficient methodologies in information extraction from scientific literature.