Scientific Software Citation Intent Classification Using Large Language Models

Istrate, Ana-Maria; Fisher, Joshua; Yang, Xinyu; Moraw, Kara; Li, Kai; Li, Donghui; Klein, Martin

doi:10.1007/978-3-031-65794-8_6

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14770))

Included in the following conference series:

International Workshop on Natural Scientific Language Processing and Research Knowledge Graphs

410 Accesses

Abstract

Software has emerged as a crucial tool in the current research ecosystem, frequently referenced in academic papers for its application in studies or the introduction of new software systems. Despite its prevalence, there remains a significant gap in understanding how software is cited within the scientific literature. In this study, we offer a conceptual framework for studying software citation intent and explore the use of large language models, such as BERT-based models, GPT-3.5, and GPT-4 for this task. We compile a representative software-mention dataset by merging two existing gold standard software mentions datasets and annotating them to a common citation intent scheme. This new dataset makes it possible to analyze software citation intent at the sentence level. We observe that in a fine-tuning setting, large language models can generally achieve an accuracy of over 80% on software citation intent classification on unseen, challenging data. Our research paves the way for future empirical investigations into the realm of research software, establishing a foundational framework for exploring this under-examined area.

You have full access to this open access chapter, Download conference paper PDF

Keywords

1 Introduction

Research software is a critical instrument in contemporary scientific environments, as it offers the computational capacity to expand human capacities to observe and investigate phenomena and acquire new knowledge from the increasing amount of data [19]. As a result, software gradually takes more important roles in data-driven science [40] and is regarded as a “first-class research object” by scientists in a growing list of research domains [6]. During the past decade, there has been significant progress in the development of research infrastructure to support the publishing, using, and crediting research software [2, 39], which in turn, supports empirical investigations into the impacts of research software and their roles in scientific research [27, 34]. All these efforts are believed to contribute to the construction of a more fair and transparent scientific system [14].

Despite this progress, one major gap in existing research is that there is a lack of a more granular understanding of the links between scientific publications and research software, i.e., how software is cited in scientific publications. This question is central to citation context analysis developed from the field of scientometrics, which examines the different types of contexts in citation sentences, such as sentiment, function, or level of importance of individual citations [43]. This approach is able to reveal more granular reasons behind citations and impact and hence contributes to a deeper understanding of how credit and knowledge flow between publications [9]. On the same page, when this method is applied to research software, we can also understand not only how many times a software object is cited in publications, but the reasons why it is cited. This knowledge is critical for the construction of a new research infrastructure for research software and evaluation of research software and its developers.

A large number of citation context classification schemes have previously been proposed, focusing specifically on citations between scientific publications (Scite [31]). However, we are arguing that these schemes cannot be used to sufficiently understand why software is cited in scientific publications, since there may be distinct reasons why a software package is cited in a paper. In light of this topic, very few schemes have been proposed for citations of research software in scientific publications, with the exceptions of SoftCite [12] and SoMeSci [37]. We believe it is vital to revitalize existing efforts by (1) developing a new classification system for citation intents of research software that builds on existing efforts, and (2) applying and testing the system on new software mentions datasets.

In this paper, we are presenting our preliminary results, including (1) a new classification system for software citation intents in full-text scientific publications and (2) performance assessment of machine learning algorithms, in particular large language models, for classifying the citation intent of software-mention sentences. We compile a new dataset that can be used for software citation intent analysis by aggregating and annotating the SoftCite [12] and SoMeSci [37] datasets. We evaluate the performance of the machine learning models on a subset from a recent large-scale software name mention dataset published by the Chan Zuckerberg Initiative (CZI) [22]. More specifically, we are examining the intent of informal citations to the software. Borrowing previous research [35], formal and informal citation approaches to mentioning software in scientific publications are defined as those mentions with and without an official citation to the software respectively. This project, to our knowledge, is the first effort to identify and classify contexts of software mentions (or informal software citations) from full-text publications. It is our hope that our results will improve existing research infrastructure for empirical studies on research software.

2 Related Work

2.1 Research Software Studies

Software has become a cornerstone of contemporary scientific systems, due to the large quantity of data available to researchers and the computational resources required to analyze such data [10]. Given its elevated importance, it is important to treat research software as a “first-class research object” just like research articles, which requires the support from infrastructure to publish, peer-review, reuse, and cite software entities [6]. Among these requirements, giving and tracing citations to software is central to the assignment of reward to software development activities and promoting researchers’ motivation to develop and publish software [13].

Software citation is a highly challenging issue in the scholarly communication system because various empirical studies have found that software is inconsistently cited, if cited at all, in scientific publications [20, 28]. In addition, when a software is cited, there is often a complicated relationship between citations and software, which makes it very hard to trace how specific software is cited [26]. These findings have inspired recent efforts to develop software citation principles, especially aligned with the FAIR Principles [2, 39].

Despite these progresses to develop a more robust software citation infrastructure, it is commonly accepted that information citations to software, or software mentions, are critical for investigating the links between scientific publications and software [38]. This approach is reflected in recent efforts to publish large-scale software name datasets extracted from full-text publications, especially the CZI dataset that covers close to 4 million open-source scientific publications [22], as well as some similar datasets [12]. These fresh open datasets will undoubtedly promote new empirical research on this critical topic to promote the openness of science, including the present research.

2.2 Citation Intent Classification

Citations have long been regarded as a gold standard to measure one’s impacts within the scientific system, as they represent one’s intellectual debts to other authors [24, 29]. Based on this normative theory, it is possible to construct a systematic evaluation system by collecting the citations between all documents, which is the idea behind the Scientific Citation Index (SCI) as well as many other scientific evaluation systems [16]. Despite the proven effectiveness (at least to some text) of using citations to evaluate one’s scientific impacts [4, 17], however, one concern with just focusing on the citation count is that the citation itself can bear multiple semantic meanings. For example, Bruno Latour, the famous sociologist of science, made the following argument:

“[Sources] may be cited without being read, that is perfunctorily; or to support a claim which is exactly the opposite of what its author intended; or for technical details so minute that they escaped their author’s attention; or because of intentions attributed to the authors but not explicitly stated in the text.” [25]

Such challenges to classic citation analysis method gave rise to a new line of research that focuses on the symbolic meanings of citations in the full-text publications, often called citation context and content analysis [8]. In their review of this topic, Zhang et al. identified a few important aspects of the symbol that have been analyzed, such as sentiment, function, or level of importance of individual citations [43].

In addition, a few important classification systems have been proposed to classify regular citations (especially those citing research articles) during the past few decades, each with their own categories and considerations. [23, 30, 31, 41]. However, Cohan et al. [7] argue that these classification systems are usually too fine-grained to allow a meaningful application to software citations. Having many fine-grained categories successfully captures rare contexts but hinders a meaningful analysis of the citation impact. More recent efforts, such as SoftCite [12] and SoMeSci [37], are trying to directly address this research problem, by developing citation context categories dedicated to software entities. However, both of these efforts are based on and only tested using limited publication samples. Moreover, the citation context categories are not the same between the two datasets. As a result, we believe there is still a large gap in this field that can be addressed by our efforts to apply advanced data science methods to classify software citation sentences to a common scheme.

3 Methods

We first defined a set of citation intent classes and created mappings for any existing informal software citation datasets with intention annotations. We then used the combined dataset to fine-tune multiple language models.

3.1 Citation Intent Classes

We reviewed multiple schemes for citation context and intention that have been proposed for both regular research publications and software. The respective schemes are listed in Table 1. Both the ACL-ARC and SciCite were proposed for the intent classification of research article citations. For their work on the ACL Anthology Reference Corpus (ARC) [5], Jurgens et al. [23] propose six categories for citation function, unifying several previously proposed schemes. Their scheme focuses on how authors align their work to cited publications and maintains a higher granularity for classifying indirect mentions like background information or contrasting related works than our scheme does. We found that these kinds of mentions currently don’t occur with a high enough frequency to warrant further splitting these classes. For SciCite, Cohan et al. [7] similarly argue that previous datasets for citation intent often use too fine-grained schemes. They propose instead to use only three categories, focusing on direct use and comparison of results, and regarding everything else as background information providing more context.

However, these schemes cannot be directly transferred to software citations. Unlike research article citations, software can be cited as research software that has been created as part of the publication. Both the SoftCite [12] and SoMeSci [37] provide annotations for the intent of software citations and propose software-specific schemes. Apart from software creation and usage, they both consider categories related to software publication, namely sharing and deposition.

Table 1. Citation Intent Schemes.

Full size table

For the development of our citation intent scheme, we established the following two guiding principles:

1.
The scheme should be able to distinguish the most common and relevant types of citation intent.
2.
The scheme should exhibit high inter-annotator agreement so that it can be consistently applied by a human.

The first principle ensures that the resulting scheme is not too fine-grained, as this would hinder a meaningful analysis in future empirical research. The second principle allows multiple annotators to label a potentially large corpus consistently. Based on the principles and the previously described schemes, we propose the following three categories in our scheme for informal research software citation intent:

the paper describes or acknowledges the creation of a research software entity.
the paper describes the use of research software in any part of the research procedure, for any purpose.
the paper describes the research software for any other reasons beyond the first two categories. Note that throughout the paper, we refer to this category by using “related” and “mention” interchangeably.

The most relevant distinction in our scheme is that between a sentence in a publication describing the creation of a piece of software as opposed to one citing the usage of software. We make sure not to confound the annotation process or analysis with any rare labels by instead encapsulating any other mention in the third category.

Similar to existing efforts, this scheme only considers functional intents, i.e., functional reasons for mentioning the software in publications, instead of other aspects of the intent, such as sentiment and importance [43]. In contrast to the schemes proposed for SoMeSci and SoftCite, we specifically did not include a category for sharing or deposition. We believe that distinguishing this citation intent from creation and usage is not strongly relevant to the evaluation of impact of software being mentioned in publications, especially considering the case where sharing software is often strongly related to creating the software in the first place.

An important attribute of our scheme is that it is designed to be applied on the sentence level: the evaluation is made based on each sentence where a software entity is mentioned. Hence, a paper-software pair can have multiple citation intents if a software entity is mentioned multiple times in the paper. For creating our datasets, we decided that each sentence could only be classified into one category. In the case where multiple categories were applicable, we chose the category carrying more weight in evaluating impact, where has a larger weight than , which in turn has a larger weight than . Note that since we are considering only one citation intent per sentence, we are making the assumption that even if multiple software are mentioned in the same sentence, they are mentioned with the same intent. In some rare corner cases, such as multiple software being mentioned in the same sentence with multiple intents, this assumption will not hold true. Hence, our findings are not applicable in these cases.

3.2 Data

We wanted to re-use existing datasets as much as possible and build on top of previous work, rather than create new datasets and define new gold standards. This is why we chose to build on the SoftCite [12] and SoMeSci [37] datasets by merging them into one representative dataset that can be used for analyzing software citation intent. The datasets, for the most part, consist of single sentences that contain a software mention (informal citation, by means of verbal reference to software, whereas a formal citation would be by verbal means in conjunction with an included URI, or an official citation, such as a literature reference) and their corresponding labels, which vary between the datasets. Consolidating these similar yet still slightly different datasets was outside the scope of this work. However, given our decision regarding citation intent classes outlined above, we had to make a few adjustments to the existing labels in the provided datasets. We did this through manual curation. Table 2 shows the mappings between our proposed scheme and software citation intent schemes used in the SoMeSci and SoftCite datasets. From the SoftCite dataset, we were able to transfer labels Used and Created directly to our Usage and Creation classes and mapped most of the Shared labeled data into Creation. After careful consideration and debate between multiple annotators, we moved some records that had multiple labels or no labels at all into our Mention category. For the SoMeSci dataset, we transferred the Usage, Mention and Creation labels straight to our own labels. We disregarded entries with a label of Deposition.

Table 2. Mappings to other software citation intent schemes.

Full size table

As part of data curation, we created a pipeline that downloaded all available full text of papers in the two datasets (SoftCite [12] and SoMeSci [37]) via the PMC API [42] in order to augment the existing data at the sentence-level with an expanded citation context of three sentences surrounding the citation: leading, citing and trailing sentence. After all pre-processing, we ended up with a single dataset. The final dataset consists of 3188 software citations, each labeled as Creation, Usage, or Mention, along with the sentence in which the software mention occurs ( ) and the citation context. The citation context consists of:

Some examples from the combined training dataset can be found in Table 3. In addition, we augmented the dataset with 1,000 sentences that contained no citation contexts by sampling randomly from the set of sentences that were not tagged with a software mention. This data can be used as negative training examples. The distribution of labels and the number of words in the contexts in the training dataset used can be seen in Fig. 1.

Table 3. Examples from the training data

Full size table

We used the combined dataset of 4,188 sentences to train the language models. We split the dataset 80/20 for training and testing in order to facilitate a reasonable comparison between models. We evaluated the models on the test held-out portion of the data that the model has not seen during training. Moreover, we had an additional dataset of 210 samples curated by Chan Zuckerberg Initiative. This dataset is a subset of the CZI Software Mentions Dataset [21] and was manually curated by CZI annotators by reviewing sentences that contain mentions of software names; the dataset was initially curated using a more granular intent classification which was subsequently mapped to the intent classification described above (creation, used, mention). Note that since the original CZI Software Mentions Dataset was not annotated with intent classification classes, this has been done manually by CZI bio-curators after the initial dataset has been published. Because of the effort required and the size of the initial dataset, we were only able to use a subset for evaluation. One annotator initially classified the sentences, and an additional annotator resolved any conflicts at an ulterior time.

3.3 Training Models

We explored finetuning several BERT [11] models, as well as GPT-3.5 [32] and GPT-4 [33] in various training settings.

BERT Models. We studied four different BERT models, namely BERT [11], DistilBERT [36], SciBERT [3] and PubMedBERT [18]. BERT [11] was pre-trained on BookCorpus [44] and English Wikipedia [15]. It is well suited for fine-tuning downstream tasks that use a full sentence, e.g. for text classification. DistilBERT [36] is a smaller and thus a faster BERT model. It was pre-trained on the same corpus using knowledge distillation with BERT as its teacher. SciBERT was pre-trained on a corpus of scientific texts from Semantic Scholar [1]. It was found to outperform BERT on tasks and datasets in the scientific domain. PubMedBERT [18] was in turn trained on biomedical papers, specifically abstracts from PubMed and full-text articles from PubMedCentral [42]. Hence, it is tailored to tasks in the biomedical domain.

We used the model architecture provided by Hugging Face’s library, applying fine-tuning to all four models for text classification tasks. This fine-tuning was consistent across models, employing identical parameters: (epochs=10, learning_rate=2e-5, weight_decay=0.01).

GPT-3.5/4. We also investigated GPT-3.5 and GPT-4 in three different learning settings: zero-shot learning, few-shot learning and fine-tuning. In zero-shot learning, the model is only given a description of the task before solving the task. Specifically, we used the system message described in Listing 1.1 to instruct the model. In few-shot learning, the model receives a similar instruction followed by a handful of examples for expected interactions between the user and the assistant (i.e. the model). The model is supposed to learn from these few examples how to generalize on new data. For this, we sampled five examples from each class (creation, usage, mention, and none) and provided these to the model. The corresponding prompt is shown in Listing 1.2.

Fine-tuning can further improve a model’s performance. Instead of a handful of examples, a larger training set is provided. We fine-tuned GPT3.5 on the sentence and full context, using the same dataset as for the BERT models. Both models were fine-tuned for a total of 5 epochs using the OpenAI API and the ‘gpt-3.5-turbo‘ model. The OpenAI API does not allow much hyper-parameter tuning besides the number of epochs, so we used the API’s default settings. Since the model can give back answers that do not fit into one of the provided classes, we post-process the answers by lowercasing and stripping punctuation marks.

4 Results

We evaluated the models on a 20% test split of the training data (Table 4), as well as on the additional CZI Validation Dataset (Table 5). The evaluation metrics used to assess model performance were precision, recall, F1-score, and overall accuracy, both at the individual label level and for the aggregate performance. In Table 4 we also attach the metrics reported for classification of intent classes by SoftCite [12], as well as SoMeSci [37]. These metrics are extracted from the corresponding papers and are evaluated on different datasets then the ones in this paper. Note that since the classifiers are not evaluated on the same data or even trained to predict the same citation intent classes, the results are not necessarily comparable. Nonetheless, we report the metrics where applicable. For example, the SoftCite paper [12] only reports the performance of a trained citation intent classifier for the used and not used categories. Hence, we show the metrics for the used category, mapping it to our Usage category. For the SoMeSci [37] dataset, the paper reports metrics for the following classes: Allusion, Usage, Creation and Deposition. We map the Allusion to our Mention category, and the Usage and Creation classes directly. We don’t report the Deposition metrics, since we have discarded this category in our own scheme.

4.1 Results of BERT Models

As shown in Table 4, all models achieve high scores across the metrics and categories on the test split. PubMedBERT outperforms the others by a small degree. As seen in Table 5, on the CZI Validation Dataset, model performance drops across the board, DistilBERT, however, outperforming the other BERT models. Considering the moderate sample size of the training dataset, it may imply that a light architecture such as DistilBERT performs better in this dataset, achieving a more balanced result, especially when compared with the original BERT.

For the category-specific results, the classification is related to the availability of the labels in the training and validation dataset - all the models perform best in the Usage label, which is the most frequent in both the training and validation set, and perform the worst on the Creation label. Notably though, DistilBERT is the only model to achieve non-zero scores in the Creation category across all three metrics, highlighting its unique capability to identify and classify this particularly challenging category. For the Mention category, the performance drops for all BERT models, with PubMedBERT outperforming the other BERT models. The same trend can be observed for the test split in Table 4. By definition, this category encompasses more varied instances than the other two, which might be why the models struggle to consistently recognise it. In the Usage category, SciBERT and PubMedBERT perform the highest.

4.2 Results of GPT-3.5/GPT-4

In general, fine-tuning the model at the sentence level seems to achieve the best results for the GPT-3.5 model on both the test split and the CZI validation dataset. The GPT3.5 model fine-tuned on the sentence level achieves the highest performance on the challenging CZI validation set (P = 0.571, R = 0.531, F1 = 0.545, Accuracy = 0.881) surpassing all BERT models as well. Fine-tuning GPT-3.5 on the entire context (containing the leading, citing, and trailing sentence) leads to a decrease in performance. This tells us that feeding the entire context around the sentence is actually not helping the model learn more information about the intent on citing the software in the sentence itself. The same observation holds for both GPT3.5/4 few-shot models. None of the few-shot and zero-shot approaches for both GPT-3.5 and the GPT-4 models achieved close to the performance of the fine-tuned models, which means that despite these models’ generalizable power, they still don’t hold enough information to be able to classify software citation intent without additional training data. We observe in general that GPT-4 models tend to outperform GPT-3.5 models both in zero and few-shot contexts. Notably, however, both GPT-3.5 and GPT-4 few-shot models generally tend to do worse than the zero-shot models counterparts. We haven’t investigated in detail why that might happen, but it is an interesting observation to note that for this task, learning from a few examples is detrimental, whereas learning from a lot (i.e. finetuning) is helpful.

Table 4. Evaluation of Different Models on the Test Split. We assessed the overall Precision (P), Recall (R), F1-score (F1), and Accuracy (Acc) across the entire validation dataset. Additionally, we analyzed the Precision, Recall, and F1-score for each label within every model. For comparison, we attach the metrics reported for classification of intent classes by SoftCite [12], as well as SoMeSci [37]. These metrics are extracted from the corresponding papers and are evaluated on different datasets than the ones in this paper.

Full size table

Inspecting per-class performance, we observe that, similarly to BERT models, GPT models tend to do very well on predicting Usage and Unlabeled labels, achieving precision, recall and F1 score >0.9 for both test splits and CZI Validation Dataset and struggle the most with predicting the Mention class. This makes sense given that any software will be, after all, mentioned in the paper. While we instruct the model to predict this label if none of the Usage or Creation might apply, it is still ambiguous. Some examples of model mistakes on the CZI Validation dataset can be seen in Table 6.

Table 5. Evaluation of Different Models on the CZI Validation Dataset. We assessed the overall Precision (P), Recall (R), F1-score (F1), and Accuracy (Acc) across the entire validation dataset. Additionally, we analyzed the Precision, Recall, and F1-score for each label within every model.

Full size table

Table 6. Examples of mistakes the GPT3.5 model fine-tuned at the sentence level makes on the CZI Validation dataset

Full size table

5 Data and Code Availability Statement

Training scripts are available on our GitHub repository: https://github.com/karacolada/SoftwareImpactHackathon2023_SoftwareCitationIntent. BERT finetuning scripts can be found under BERT_finetuning and all the GPT scripys, including zero-shot, few-shot and fine-tuning under GPT_models. The merged dataset, together with training, validation and test splits, as well as GPT-3.5 formatted data and the CZI Validation Dataset can also be found under the data folder, together with extra documentation and a README file.

6 Discussion

The determination of software citation intent requires a system to not only identify the entity but understand the semantic relationships provided by the context around the entity. In this study, we focused on the latter part of said system, by investigating which model types can effectively learn the intent of authors by their reporting of software. Prior work done by the SoftCite [12] and SoMeSci [37] groups offered valuable datasets that were combined and normalized to a simple scheme. This corpus allowed for the fine-tuning and experimentation of various flavors of BERT, GPT-3.5 and GPT-4 models. Our intuition was that these models would be able to accurately characterize these intents. One interesting finding was that including the full context in classification seemed to hurt model performance. This insensitivity towards extra contextual clues indicates that intent can typically be determined in close proximity to the mention of the software entity. Further text analysis may elucidate exactly what type of language is characteristic of each intent class. A quick word frequency count in sentences that are of the “creation” class identify “software”, “available”, “http”, “developed”, and “source” as the most common. A more proper analysis for all intent types to identify word patterns could improve the classification system (Fig. 2).

To test the full breadth of capabilities of the GPT models, we employed various experiments including zero-shot, few-shot and fine-tuned approaches. None of the few-shot and zero-shot approaches for both GPT-3.5 and the GPT-4 models achieved performance comparable to the fine-tuned models, which means that software citation intent classification is not a task that these models can do out of the box, without additional training. Adding example cases to the prompts in a few-shot setting yielded a decrease in both precision and recall over the evaluation sets compared to the zero-shot setting. Beyond this preliminary work, future experiments will have to test different versions of prompts to further probe this unexpected behavior, as prompt engineering was not an extensive part of this work. Fine-tuning the GPT-3.5 model generated results comparable to the BERT models. We did not experiment with fine-tuning a GPT-4 model because this process is closed to the general public at the moment of paper publication, but we would expect that a GPT-4 fine-tuned model would achieve even higher performance. The easiest category to predict was Usage, followed by Creation. The Mention category was the hardest for models to learn to predict well, which makes sense given that software falling in the other classes can automatically fall under this category as well. While trained on different intent categories and data, our best models surpass the metrics reported by SoftCite and SoMeSci on the Creation and Mention categories and are comparable for the Usage category.

Fine-tuned models generally exhibited high performance on the test set. However, despite our best efforts to identify and resolve systematic differences in the CZI validation set from our test set, models were not able to achieve similar performance. The only competitive performance can be observed using the fine-tuned GPT-3.5 model. Given this is a challenging dataset coming from a different distribution than the training data, this speaks to the power of the GPT family of models to generalize and find nuance in ambiguous text, compared to BERT models. Further experiments necessitate a proper quality and error analysis of the validation set.

7 Conclusion

In conclusion, this preliminary work presents a new system for the classification of software citation intent in scholarly research and insights into the use of large language models for classifying scientific software citation intent. Building on prior work in this research space, we offer an aggregated and normalized corpus that can be used to train and evaluate the performance of machine learning models tasked on the classification of mentions, usage, and creation of software in text. We present, to the best of our knowledge, the first study on using large language models on predicting software citation intent. The identification of these entities contributes to the link between research software and scientific publications. We believe this work establishes a foundational framework for exploring the under-examined area of studying scientific software citation intent.

References

Ammar, W., et al.: Construction of the literature graph in semantic scholar. arXiv preprint arXiv:1805.02262 (2018)
Barker, M., et al.: Introducing the fair principles for research software. Sci. Data 9(1), 622 (2022)
Article MathSciNet Google Scholar
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1371. https://aclanthology.org/D19-1371
Bensman, S.J.: Garfield and the impact factor: the creation, utilization, and validation of a citation measure. Ann. Rev. Inf. Sci. Technol. (ARIST) 42 (2008)
Google Scholar
Bird, S., et al.: The ACL anthology reference corpus: a reference dataset for bibliographic research in computational linguistics. In: LREC (2008)
Google Scholar
Chassanoff, A., Altman, M.: Curation as “interoperability with the future’’: preserving scholarly research software in academic libraries. J. Am. Soc. Inf. Sci. 71(3), 325–337 (2020)
Google Scholar
Cohan, A., Ammar, W., Van Zuylen, M., Cady, F.: Structural scaffolds for citation intent classification in scientific publications. arXiv preprint arXiv:1904.01608 (2019)
Cronin, B.: The need for a theory of citing. J. Documentation 37(1), 16–24 (1981)
Article Google Scholar
Cronin, B.: The citation process. Role Significance Citations Sci. Commun. 103 (1984)
Google Scholar
Crouch, S., et al.: The software sustainability institute: changing research software attitudes and practices. Comput. Sci. Eng. 15(6), 74–80 (2014)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019)
Google Scholar
Du, C., Cohoon, J., Lopez, P., Howison, J.: Softcite dataset: a dataset of software mentions in biomedical and economic research publications. J. Am. Soc. Inf. Sci. 72(7), 870–884 (2021)
Google Scholar
Du, C., Cohoon, J., Priem, J., Piwowar, H., Meyer, C., Howison, J.: Citeas: better software through sociotechnical change for better software citation. In: Companion Publication of the 2021 Conference on Computer Supported Cooperative Work and Social Computing, pp. 218–221 (2021)
Google Scholar
Easterbrook, S.M.: Open code for open science? Nat. Geosci. 7(11), 779–781 (2014)
Article Google Scholar
W. Foundation: Wikimedia downloads. https://dumps.wikimedia.org
Garfield, E.: “Science citation index’’—a new dimension in indexing: this unique approach underlies versatile bibliographic systems for communicating and evaluating information. Science 144(3619), 649–654 (1964)
Article Google Scholar
Garfield, E.: Is citation analysis a legitimate evaluation tool? Scientometrics 1, 359–375 (1979)
Article Google Scholar
Gu, Y., et al.: Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare 3(1), 1–23 (2021)
Article Google Scholar
Horsfall, D., Cool, J., Hettrick, S., Pisco, A.O., Hong, N.C., Haniffa, M.: Research software engineering accelerates the translation of biomedical research for health. Nat. Med. 1–4 (2023)
Google Scholar
Howison, J., Bullard, J.: Software in the scientific literature: problems with seeing, finding, and using software mentioned in the biology literature. J. Am. Soc. Inf. Sci. 67(9), 2137–2155 (2016)
Google Scholar
Istrate, A.M., et al.: SoftwareImpactHackathon2023: Software Citation Intent (2023). https://github.com/karacolada/SoftwareImpactHackathon2023_SoftwareCitationIntent
Istrate, A.M., Li, D., Taraborelli, D., Torkar, M., Veytsman, B., Williams, I.: A large dataset of software mentions in the biomedical literature. arXiv preprint arXiv:2209.00693 (2022)
Jurgens, D., Kumar, S., Hoover, R., McFarland, D., Jurafsky, D.: Measuring the evolution of a scientific field through citation frames. Trans. Assoc. Comput. Linguist. 6, 391–406 (2018)
Article Google Scholar
Kaplan, N.: The norms of citation behavior: prolegomena to the footnote. Am. Doc. 16(3), 179–184 (1965)
Article Google Scholar
Latour, B.: Science in Action: How to Follow Scientists and Engineers Through Society. Harvard University Press (1987)
Google Scholar
Li, K., Chen, P.Y., Yan, E.: Challenges of measuring software impact through citations: an examination of the lme4 R package. J. Informet. 13(1), 449–461 (2019)
Article Google Scholar
Li, K., Yan, E.: Co-mention network of R packages: scientific impact and clustering structure. J. Informet. 12(1), 87–100 (2018)
Article Google Scholar
Li, K., Yan, E., Feng, Y.: How is R cited in research outputs? Structure, impacts, and citation standard. J. Informet. 11(4), 989–1002 (2017)
Article Google Scholar
Merton, R.K.: The Sociology of Science: Theoretical and Empirical Investigations. University of Chicago Press (1973)
Google Scholar
Moravcsik, M.J.: Citation context classification of a citation classic concerning citation context classification. Soc. Stud. Sci. 18(3), 515–521 (1988)
Article Google Scholar
Nicholson, J.M., et al.: Scite: a smart citation index that displays the context of citations and classifies their intent using deep learning. Quant. Sci. Stud. 2(3), 882–898 (2021)
Article Google Scholar
OpenAI: Models - OpenAI API. https://platform.openai.com/docs/models/gpt-3-5-turbo
OpenAI: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Pan, X., Yan, E., Cui, M., Hua, W.: Examining the usage, citation, and diffusion patterns of bibliometric mapping software: a comparative study of three tools. J. Informet. 12(2), 481–493 (2018)
Article Google Scholar
Park, H., You, S., Wolfram, D.: Informal data citation for data sharing and reuse is more common than formal data citation in biomedical fields. J. Am. Soc. Inf. Sci. 69(11), 1346–1354 (2018)
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2020)
Google Scholar
Schindler, D., Bensmann, F., Dietze, S., Krüger, F.: Somesci-a 5 star open data gold standard knowledge graph of software mentions in scientific articles. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 4574–4583 (2021)
Google Scholar
Schindler, D., Bensmann, F., Dietze, S., Krüger, F.: The role of software in science: a knowledge graph-based analysis of software mentions in pubmed central. PeerJ Comput. Sci. 8, e835 (2022)
Article Google Scholar
Smith, A.M., Katz, D.S., Niemeyer, K.E.: Software citation principles. PeerJ Comput. Sci. 2, e86 (2016)
Article Google Scholar
Symons, J., Alvarado, R.: Can we trust big data? applying philosophy of science to software. Big Data Soc. 3(2), 2053951716664747 (2016)
Article Google Scholar
Teufel, S., Siddharthan, A., Tidhar, D.: Automatic classification of citation function. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 103–110 (2006)
Google Scholar
Wheeler, D.L., et al.: Database resources of the national center for biotechnology information. Nucleic Acids Res. 36(suppl_1), D13–D21 (2007)
Google Scholar
Zhang, G., Ding, Y., Milojević, S.: Citation content analysis (CCA): a framework for syntactic and semantic analysis of citation content. J. Am. Soc. Inform. Sci. Technol. 64(7), 1490–1503 (2013)
Article Google Scholar
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: The IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar

Download references

Acknowledgements

The authors would like to acknowledge the Chan Zuckerberg Initiative for hosting the Mapping the Impact of Research Software in Science hackathon that led to the development of this work. We would also like to thank Michaela Torkar for help with curating the CZI validation dataset and valuable expert feedback. Kara Moraw was supported by the UK Research Councils through grant EP/S021779/1.

Author information

Authors and Affiliations

Chan Zuckerberg Initiative, Redwood City, CA, 94063, USA
Ana-Maria Istrate & Donghui Li
Elsevier, Research Collaborations Unit, Amsterdam, The Netherlands
Joshua Fisher
Department of Information Science, Cornell University, Ithaca, USA
Xinyu Yang
Software Sustainability Institute, Edinburgh, UK
Kara Moraw
EPCC, University of Edinburgh, Edinburgh, UK
Kara Moraw
School of Information Sciences, University of Tennessee, Knoxville, USA
Kai Li
Los Alamos National Laboratory, Los Alamos, NM, USA
Martin Klein

Authors

Ana-Maria Istrate
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Fisher
View author publications
You can also search for this author in PubMed Google Scholar
Xinyu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Kara Moraw
View author publications
You can also search for this author in PubMed Google Scholar
Kai Li
View author publications
You can also search for this author in PubMed Google Scholar
Donghui Li
View author publications
You can also search for this author in PubMed Google Scholar
Martin Klein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ana-Maria Istrate .

Editor information

Editors and Affiliations

Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Berlin, Germany
Georg Rehm
GESIS Leibniz Institut für Sozialwissenschaften and Heinrich-Heine - University Düsseldorf, Cologne, Germany
Stefan Dietze
Technical University of Berlin and Fraunhofer FOKUS, Berlin, Berlin, Germany
Sonja Schimmler
Wismar University of Applied Sciences, Wismar, Germany
Frank Krüger

Appendix

(See Figs. 3, 4 and 5).

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Istrate, AM. et al. (2024). Scientific Software Citation Intent Classification Using Large Language Models. In: Rehm, G., Dietze, S., Schimmler, S., Krüger, F. (eds) Natural Scientific Language Processing and Research Knowledge Graphs. NSLP 2024. Lecture Notes in Computer Science(), vol 14770. Springer, Cham. https://doi.org/10.1007/978-3-031-65794-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-65794-8_6
Published: 15 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-65793-1
Online ISBN: 978-3-031-65794-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Scientific Software Citation Intent Classification Using Large Language Models

Abstract

Keywords

1 Introduction

2 Related Work

2.1 Research Software Studies

2.2 Citation Intent Classification

3 Methods

3.1 Citation Intent Classes

3.2 Data

3.3 Training Models

4 Results

4.1 Results of BERT Models

4.2 Results of GPT-3.5/GPT-4

5 Data and Code Availability Statement

6 Discussion

7 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation