SDCF: semi-automatically structured dataset of citation functions

Basuki, Setio; Tsuchiya, Masatoshi

doi:10.1007/s11192-022-04471-x

SDCF: semi-automatically structured dataset of citation functions

Open access
Published: 21 July 2022

Volume 127, pages 4569–4608, (2022)
Cite this article

Download PDF

You have full access to this open access article

Scientometrics Aims and scope Submit manuscript

SDCF: semi-automatically structured dataset of citation functions

Download PDF

Setio Basuki¹ &
Masatoshi Tsuchiya¹

1445 Accesses
1 Citation
Explore all metrics

Abstract

There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets contain a few instances and a limited number of labels. Furthermore, most labels have been built using narrow research fields. Addressing these issues, this paper proposes a semiautomatic approach to develop a large dataset of citation functions based on two types of datasets. The first type contains 5668 manually labeled instances to develop a new labeling scheme of citation functions, and the second type is the final dataset that is built automatically. Our labeling scheme covers papers from various areas of computer science, resulting in five coarse labels and 21 fine-grained labels. To validate the scheme, two annotators were employed for annotation experiments on 421 instances that produced Cohen’s Kappa values of 0.85 for coarse labels and 0.71 for fine-grained labels. Following this, we performed two classification stages, i.e., filtering, and fine-grained to build models using the first dataset. The classification followed several scenarios, including active learning (AL) in a low-resource setting. Our experiments show that Bidirectional Encoder Representations from Transformers (BERT)-based AL achieved 90.29% accuracy, which outperformed other methods in the filtering stage. In the fine-grained stage, the SciBERT-based AL strategy achieved a competitive 81.15% accuracy, which was slightly lower than the non-AL strategy. These results show that the AL is promising since it requires less than half of the dataset. Considering the number of labels, this paper released the largest dataset consisting of 1,840,815 instances.

TDM-CFC: Towards Document-Level Multi-label Citation Function Classification

Contextualised segment-wise citation function classification

Article 12 July 2023

Important citations identification with semi-supervised classification model

Article 20 January 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Citation analysis is part of the bibliographic analysis that studies how the connection between academic publications is established in terms of one which cites and the other which is cited (Nicolaisen, 2008). Citation analysis has become widespread practice to measure the impact of academic publication. Hlavcheva and Kanishcheva (2020) stated that an academic publication's impact comes from several directions, such as the impact of the researcher, the impact of the group or institution, the local or global academic ranking, and the quality of the publication, which are measured by citation counts. In this setting, the citation counts involve calculating the number of times a document is cited by other documents and is performed through bibliometric databases. However, there is no single database that gathers all publications together with their cited references. The analysis needs to look at several database options, such as Web of Science (WOS),^{Footnote 1} Scopus,^{Footnote 2} Google Scholar,^{Footnote 3} etc. There are several measurements, e.g., h-index personal metric, or impact factor for journal metric, which are widely used as impact indicators because of the citation analysis.

Besides the benefit of current citation analysis, measuring the publication impact using the citation counts gets intense criticism. This is because the citation counts assume that all citations have an equal impact on the academic publication. In fact, not all citations are equal and should not be treated equally (Valenzuela et al., 2015). Treating the citations to be always a positive endorsement of the cited references is problematic because the citations are often made to show disapproval of the cited references. Moreover, the citation analysis fails to capture contextual information (Hirsch, 2005; Mercer et al., 2014) containing several citation functions, such as giving the background, using the work, making the comparison, criticism, etc. Focusing on the research paper, the contextual information can be used to dig deeper into the paper. Authors of research papers use citations to show the position of their research in broad literature (Lin & Sui, 2020). The citation functions indicate the research’s novelty (Tahamtan & Bornmann, 2019), and the quality of the research (Raamkumar et al., 2016), and help authors understand the big picture of the given topics (Qayyum & Afzal, 2018). Furthermore, the citation functions enable the research paper to obtain a higher impact when it is used, approved, and supported by other works, and less impact when other works just mention the research paper. Thus, involving the citation functions as the contextual information needs serious attention to enrich the impact analysis of the scientific publication.

There is a growing concern for works on the automatic identification of citation functions (Pride & Knoth, 2020). This trend is caused by the fact that authors provide citations to determine the important and non-important roles of citations (Nazir et al., 2020). According to (Zhu et al., 2015), previous works are considered influential if they inspire authors to propose solutions. While incidental citations refer to a previous work that does not provide a significant impact on the proposed research. In this domain, the terms important and non-important (Valenzuela et al., 2015) are identical to the terms influential and incidental (Pride & Knoth, 2020). However, most previous works have a small number of citation instances or considered few types of labels. In addition, existing works have suffered from a lack of research variety. Most of these works were developed based on natural language processing (NLP)-related papers. Consequently, several potential citation functions were missing from being identified.

The contribution of this paper consists of two parts. In the first part, we propose a new annotation scheme for citation functions that have not been accommodated in previous works. Our proposed scheme covers all computer science (CS) fields on arXiv from the beginning to December 31, 2017. This paper uses well-organized parsed sentences of research papers from (Färber et al., 2018) and selects 1.8 million raw citing sentences. Based on 5668 randomly selected instances, we developed the proposed annotation scheme following three stages, i.e., top-down analysis, bottom-up analysis, and annotation experiment. Completing the first two stages reveals that there are potential newly proposed labels. We found five fine-grained labels related to the background’s role of cited papers that were not proposed by existing works. These labels are definition, suggest, technical, judgment, and trend. In addition, we found three new labels defining the role of a cited paper, i.e., cited_paper_propose, cited_paper_result, and cited_paper_dominant. Our final scheme consists of 5 coarse and 21 fine-grained labels. Following this, annotation experiments were conducted involving two annotators on 421 samples. We use Cohen’s Kappa (Cohen, 1960) to validate the results of the annotation experiments.

The second part of our contribution is to build a dataset of citation functions through a semiautomatic approach. This approach was chosen because manual labeling is time-consuming and needs enormous human effort. The proposed method consists of two development stages. In the first stage, we build two classification tasks, i.e., filtering, and fine-grained classification. The filtering task eliminates nonessential labels, and the fine-grained task categorizes the detail of the essential labels. In both tasks, we implement a classical machine learning and deep learning approach. Because of the small number of manually labeled instances, pre-trained word embedding methods should be considered here. In addition, this paper demonstrates pool-based active learning (AL) as a low-resource scenario. Following this, the next stage is to assign labels to the entire unlabeled instances using the best models from both tasks of the previous stage.

At the end of this research, this paper delivers several contributions:

The annotation scheme for citation functions consists of five coarse and 21 fine-grained labels.
The validity of the scheme is demonstrated in terms of Cohen’s Kappa results with 0.85 (almost perfect) for coarse labels and 0.71 (substantial agreement) for fine-grained labels.
The low-resource scenario-based AL achieves competitive accuracies on less than half of the training data.
While Bidirectional Encoder Representations from Transformers (BERT)-based AL outperformed other methods in the filtering task, SciBERT reached competitive performances compared to non-AL methods in the fine-grained stage.
Considering the number of labels, we released the largest dataset consisting of 1,840,815 instances.^{Footnote 4}

The rest of this paper is organized as follows. The “Related works” section describes existing works covering three parts, namely, the annotation schemes of citation functions, the research papers’ argumentative structure, and the detection of citation functions. Next, the section “Building the dataset of citation functions” discusses how our dataset is developed. This section covers several points, i.e., scheme development, scheme comparison, annotation strategy, and text classification strategy. The section “Experiment results” presents annotation and text classification experiments, including the released dataset. Finally, in the “Conclusion and future work” section, we present other notable findings from the conducted experiments.

Related works

This section contains a review of existing works related to several points, i.e., the annotation schemes of citation functions, the argumentative structures of scientific papers, the dataset of citation functions, and the automatic identification of citation functions. For consistency, this paper uses several terminologies, namely, citing paper is an author’s work, cited paper is previous work cited by citing paper, citing sentence is a sentence containing citation marks, and citation function is a reason behind citations.

Citation function labels

The review was conducted on previous works proposing their annotation schemes. During the review, we found two major categories of citation functions, i.e., coarse label (general) and fine-grained label (detail). While several works provided both categories, other works provided a single category, either coarse or fine-grained label. The existing annotation schemes of citation functions are shown in Table 1.

Table 1 Existing works on annotation schemes of citation functions

SDCF: semi-automatically structured dataset of citation functions

Abstract

Similar content being viewed by others

TDM-CFC: Towards Document-Level Multi-label Citation Function Classification

Contextualised segment-wise citation function classification

Important citations identification with semi-supervised classification model

Introduction

Related works

Citation function labels

Research paper argumentative structure

Citation function dataset

Citation function classification

Building the dataset of citation functions

Annotation scheme development

Citation scheme comparison

Annotation strategy

Text classification strategy

Active learning strategy

Statistically significant test

Experiment results

Annotation experiment results

Filtering stage result

Fine-grained stage result

Active learning results

Filtering stage results

Fine-grained results

Conclusion and future work

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation