Predicting citation impact of academic papers across research areas using multiple models and early citations

Zhang, Fang; Wu, Shengli

doi:10.1007/s11192-024-05086-0

Predicting citation impact of academic papers across research areas using multiple models and early citations

Open access
Published: 25 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Scientometrics Aims and scope Submit manuscript

Predicting citation impact of academic papers across research areas using multiple models and early citations

Download PDF

127 Accesses
Explore all metrics

Abstract

As the volume of scientific literature expands rapidly, accurately gauging and predicting the citation impact of academic papers has become increasingly imperative. Citation counts serve as a widely adopted metric for this purpose. While numerous researchers have explored techniques for projecting papers’ citation counts, a prevalent constraint lies in the utilization of a singular model across all papers within a dataset. This universal approach, suitable for small, homogeneous collections, proves less effective for large, heterogeneous collections spanning various research domains, thereby curtailing the practical utility of these methodologies. In this study, we propose a pioneering methodology that deploys multiple models tailored to distinct research domains and integrates early citation data. Our approach encompasses instance-based learning techniques to categorize papers into different research domains and distinct prediction models trained on early citation counts for papers within each domain. We assessed our methodology using two extensive datasets sourced from DBLP and arXiv. Our experimental findings affirm that the proposed classification methodology is both precise and efficient in classifying papers into research domains. Furthermore, the proposed prediction methodology, harnessing multiple domain-specific models and early citations, surpasses four state-of-the-art baseline methods in most instances, substantially enhancing the accuracy of citation impact predictions for diverse collections of academic papers.

Features, techniques and evaluation in predicting articles’ citations: a review from years 2010–2023

Article 07 December 2023

Predicting High Impact Academic Papers Using Citation Network Features

Identification of important citations by exploiting research articles’ metadata and cue-terms from content

Article 22 November 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The rapid advancement of science and technology has led to a staggering increase in the number of academic publications produced globally each year (Zhu & Ban, 2018). In this ever-growing landscape, effectively evaluating the impact of research papers has become a critical issue (Castillo et al., 2007; Chakraborty et al., 2014; Li et al., 2019; Yan et al., 2011). Citation count, which measures the frequency with which a paper is referenced by other works, is widely recognized as the most prevalent metric for assessing the influence of academic papers, authors, and institutions (Bu et al., 2021; Cao et al., 2016; Lu et al., 2017; Redner, 1998; Stegehuis et al., 2015; Wang et al., 2021). Building upon the foundation of citation counts, numerous additional measures have been proposed to quantify research impact from various perspectives (Braun et al., 2006; Egghe, 2006; Garfield, 1972, 2006; Hirsch, 2005; Persht, 2009; Yan & Ding, 2010).

Predicting the impact of scientific papers has garnered significant research attention due to its profound implications (Abramo et al., 2019; Abrishami & Aliakbary, 2019; Bai et al., 2019; Cao et al., 2016; Chen & Zhang, 2015; Li et al., 2019; Liu et al., 2020; Ma et al., 2021; Ruan et al., 2020; Su, 2020; Wang et al., 2013, 2021, 2023; Wen et al., 2020; Xu et al., 2019; Yan et al., 2011; Yu et al., 2014; Zhao & Feng, 2022; Zhu & Ban, 2018). See “Citation count prediction” section for more detailed discussion about them. Accurately forecasting the future citation impact of academic papers, particularly those recently published, offers invaluable benefits to various stakeholders within the research ecosystem. Precisely predicting the impact of papers, especially those published for a short time, would be helpful for researchers to find potentially high-impact papers and interesting research topics at an earlier stage. It is also helpful for institutions, government agencies, and funding bodies to evaluate published papers, researchers, and project proposals, among others.

For large and diverse collections encompassing papers from various research areas, a one-size-fits-all approach to citation impact prediction may be inadequate. Even within a broad field like Computing, sub-fields such as Theoretical Computing, Artificial Intelligence, Systems, and Applications can exhibit distinct citation patterns. Previous study has demonstrated that citation dynamics can vary significantly across research areas, journals, researchers in different age groups, among other factors (Kelly, 2015; Levitt & Thelwall, 2008; Mendoza, 2021; Milz & Seifert, 2018). To illustrate this point, let us consider an example from the DBLP dataset used in our study. Figure 1a depicts the average citation distributions of papers in three research areas: Cryptography, Computer Networks, and Software Engineering. We can observe striking differences in their citation patterns:

Software Engineering papers consistently attract relatively few citations over time, without a pronounced peak in their citation curve.
Artificial Intelligence papers garner the highest citation counts among the three areas. Their citation curve rises rapidly, peaking around year 4, followed by a gradual decline until year 7, after which the decrease becomes more precipitous.
Cryptography papers exhibit a steadily increasing citation trend over the first 10 years, reaching a peak around year 11, followed by a slow decline in citations thereafter.

These divergent citation patterns across research areas highlight the limitations of employing a single, universal model for citation impact prediction. In light of these observations, a more effective strategy would be to segment papers into distinct groups based on their research areas and develop tailored prediction models for each group. By accounting for the unique citation characteristics of different domains, such a group-specific modelling approach has the potential to significantly enhance the accuracy and reliability of citation impact predictions, particularly for large and heterogeneous collections of academic papers.

Citation patterns are not solely determined by research areas but also influenced by the quality and intrinsic characteristics of individual papers. Even within the same research area, the citation dynamics of papers can vary considerably (Garfield, 2006; Wang et al., 2021; Yan & Ding, 2010). High-impact papers may exhibit significantly different citation trajectories compared to average or low-impact works. Accounting for these differences by employing multiple models tailored to papers with varying citation potential could further improve prediction performance. Figure 1b illustrates this phenomenon using an example from the Embedded & Real-Time Systems research area. All papers in this domain can be categorized into four classes based on their cumulative citation counts (cc) over 15 years: cc < 10, 10 ≤ cc < 50, 50 ≤ cc < 100, and cc ≥ 100. The general pattern observed for all the curves is that they initially increase for a few years and then decrease afterwards. However, the peak point varies depending on the total number of citations. Papers with higher citation counts take more years to reach their peak point. This finding suggests that class-based prediction can be a viable approach for our prediction task, as it account for the varying peak times based on the citation count classes.

If all of the papers are not classified, then it is necessary to have a classification system that encompasses multiple categories and an automated method for allocating each paper into one or more suitable categories. For a large collection of papers to be classified, both the effectiveness and efficiency of the allocating method are crucial factors to consider.

Taking into account all the observations mentioned earlier, we propose MM, a prediction method based on Multiple Models tailored for different research areas and citation counts, to predict the future citation counts of a paper. This work makes the following contributions:

1.
A new instance-based learning method is introduced to classify papers into a given number of research areas. Both paper contents (titles and abstracts) and citations are considered separately. An ensemble-based method is then employed to make the final decision. Experiments with the DBLP dataset demonstrate that the proposed method can achieve excellent classification performance.
2.
A prediction method for paper citation counts is proposed. For any paper to be predicted, a suitable prediction model is chosen based on its research area and early citation history. This customized approach enables each document to use a fitting model.
3.
Experiments with two datasets show that the proposed prediction method outperforms four baseline methods in this study, demonstrating its superiority.

The remainder of this article is structured as follows: “Related work” section reviews related work on citation count prediction and classification of academic papers. “Methodology” section describes the proposed method in detail. “Experimental settings and results” section presents the experimental settings, procedures, and results, along with an analysis of the findings. Finally, “Conclusion” section concludes the paper.

Related work

In this work, the primary task is citation count prediction of papers, while classification of scientific papers serves as an additional task that may be required for the prediction task. Accordingly, we review some related work on citation count prediction and classification of academic papers separately in the following sections.

Citation count prediction

In the literature, there are numerous papers on predicting the citation counts of scientific papers. These methods can be categorized into three groups based on the information used for prediction.

The first group relies solely on the paper’s citation history as input. Wang et al. (2013) developed a model called WSB to predict the total number of citations a paper will receive, assuming its earlier citation data is known. Cao et al. (2016) proposed a data analytic approach to predict the long-term citation count of a paper using its short-term (three years after publication) citation data. Given a large collection of papers C with long citation histories, for a paper p with a short citation history, they matched it with a group of papers in C with similar early citation data and then used those papers in C to predict p’s later citation counts. Abrishami and Aliakbary (2019) proposed a long-term citation prediction method called NNCP based on Recurrent Neural Network (RNN) and the sequence-to-sequence model. Their dataset comprised papers published in five authoritative journals: Nature, Science, NEJM (The New England Journal of Medicine), Cell, and PNAS (Proceedings of the National Academy of Sciences). Wang et al. (2021) introduced a nonlinear predictive combination model, NCFCM, that utilized multilayer perceptron (MLP) to combine WSB and an improved version of AVR for predicting citation counts.

The second group uses not only the citation data but also some other extracted features from the paper or the wider academic network for the prediction task. Yu et al. (2014) adopted a stepwise multiple regression model using four groups of 24 features, including paper, author, publication, and citation-related features. Bornmann et al. (2014) took the percentile approach of Hazen (1914), considering the journal’s impact and other variables such as the number of authors, cited references, and pages. Castillo et al. (2007) used information about past papers written by the same author(s). Chen and Zhang (2015) applied Gradient Boosting Regression Trees (GBRT) with six paper content features and 10 author features. Bai et al. (2019) made long-term predictions using the Gradient Boosting Decision Tree (GBDT) model with five features, including the citation count within 5 years after publication, authors’ impact factor, h-index, Q value, and the journal's impact factor. Akella et al. (2021) exploited 21 features derived from social media shares, mentions, and reads of scientific papers to predict future citations with various machine learning models, such as Random Forest, Decision Tree, Gradient Boosting, and others. Xu et al. (2019) extracted 22 features from heterogeneous academic networks and employed a Convolutional Neural Network (CNN) to capture the complex nonlinear relationship between early network features and the final cumulative citation count. Ruan et al. (2020) employed a four-layer BP neural network to predict the 5th year citation counts of papers, using a total of 30 features, including paper, author, publication, reference, and early citation-related features. By extracting high-level semantic features from metadata text, Ma et al. (2021) adopted a neural network to consider both semantic information and the early citation counts to predict long-term citation counts. Wang et al. (2023) applied neural network technology to a heterogeneous network including author and paper information. Huang et al. (2022) argued that citations should not be treated equally, as the citing text and the section in which the citation occurs significantly impact its importance. Thus, they applied deep learning models to perform fine-grained citation prediction—not just citation count for the whole paper but citation count occurring in each section.

The third group uses other types of information beyond those mentioned above. To investigate the impact of peer-reviewing data on prediction performance, Li et al. (2019) adopted a neural network prediction model, incorporating an abstract-review match method and a cross-review match mechanism to learn deep features from peer-reviewing texts. Combining these learned features with breadth features (topic distribution, topic diversity, publication year, number of authors, and average author h-index), they employed a multilayer perceptron (MLP) to predict citation counts. Li et al. (2022) also utilized peer-reviewing text for prediction, using an aspect-aware capsule network. Zhao and Feng (2022) proposed an end-to-end deep learning framework called DeepCCP, which takes an early citation network as input and predicts the citation count using both GRU and CNN, instead of extracting features.

Citation counts of a paper can be affected by many factors such as research areas, paper types, age, sex, and other aspects of the authors (Andersen & Nielsen, 2018; Mendoza, 2021; Thelwall, 2020). Levitt and Thelwall (2008) compared patterns of annual citations of highly cited papers across six research areas. To our knowledge, Abramo et al. (2019) is the only work that uses multiple regression models for prediction, with one model for each subject category. Abramo et al. (2019) is the most relevant to our work in this article. However, there are two major differences. First, we propose a paper classification method in this paper, while no paper classification is required in Abramo et al. (2019). Second, we apply multiple models for papers in each category, whereas only one model is used for each category in Abramo et al. (2019).

Classification of scientific papers

Classification of scientific papers becomes a critical issue when organizing and managing an increasing number of publications through computerized solutions. In previous research, typically, meta-data such as title, abstract, keywords, and citations of papers were used for this task, while full text was not considered due to its unavailability in most situations.

Various machine learning methods, such as K-Nearest Neighbors (Lukasik et al., 2013; Waltman & Van Eck, 2012), K-means (Kim & Gil, 2019), and Naïve Bayes (Eykens et al., 2021), have been applied. Recently, deep neural network models, such as Convolutional Neural Networks (Daradkeh et al., 2022; Rivest et al., 2021), Recurrent Neural Networks (Hoppe et al., 2021; Semberecki & Maciejewski, 2017), and pre-trained language models (Hande et al., 2021; Kandimalla et al., 2020), have also been utilized.

One key issue is the classification system to be used. There are many different classification systems. Both Thomson Reuters’ Web of Science database (WoS) and Elsevier’s Scopus database have their own general classification systems, covering many subjects/research areas. Some systems focus on one particular subject, such as the medical subject headings (MeSH), the physics and astronomy classification scheme (PACS), the Chemical Abstracts Sections, the journal of economic literature (JEL), and the ACM Computing Classification System.

Based on the WoS classification system, Kandimalla et al. (2020) applied a deep attentive neural network (DANN) to a collection of papers from the WoS database for the classification task. It was assumed that each paper belonged to only one category, and only abstracts were used.

Zhang et al. (2022) compared three classification systems: Thomson Reuters’ Web of Science, Fields of Research provided by Dimensions, and the Subjects Classification provided by Springer Nature. Among these, the second one was generated by machine learning methods automatically, while the other two were generated manually by human experts. It is found there are significant inconsistency between machine and human-generated systems.

Rather than using an existing classification system, some researchers build their own classification system using the collection to be classified or other resources such as Wikipedia.

Shen et al. (2018) organized scientific publications into a hierarchical concept structure of up to six levels. The first two levels (similar to areas and sub-areas) were manually selected, while the others were automatically generated. Wikipedia pages were used to represent the concepts. Each publication or concept was represented as an embedding vector, thus the similarity between a publication and a concept could be calculated by the cosine similarity of their vector representations. It is a core component for the construction of the Microsoft Academic Graph.

In the same vein as Shen et al. (2018), Toney-Wails and Dunham (2022) also used Wikipedia pages to represent concepts and build the classification system. Both publications and concepts were represented as embedding vectors. Their database contains more than 184 million documents in English and more than 44 million documents in Chinese.

Mendoza et al. (2022) presented a benchmark corpus and a classification system as well, which could be used for the academic paper classification task. The classification system used is the 36 subjects defined in the UK Research Excellent Framework.^{Footnote 1} According to Cressey and Gibney (2014), this practice is the largest overall assessment of university research outputs ever undertaken globally. The 191,000 submissions to REF 2014 comprise a very good data set because every paper was manually categorized by experts when submitted.

Liu et al. (2022) described the NLPCC 2022 Task 5 Track 1, a multi-label classification task for scientific literature, where one paper may belong to multiple categories simultaneously. The data set, crawled from the American Chemistry Society’s publication website, comprises 95,000 papers’ meta-data including titles and abstracts. A hierarchical classification system, with a maximum of three levels, was also defined.

As we can see, the classification problem of academic papers is quite complicated. Many classification systems and classification methods are available. However, classification systems and classification methods are related to each other. The major goal of this work is to perform citation count prediction of published papers, in which classification of papers is a basic requirement. For example, considering the DBLP dataset which includes over four million papers, special consideration is required to perform the classification task effectively and efficiently. We used the classification system from CSRankings,^{Footnote 2} which included a set of four categories (research areas) and 26 sub-categories in total. A group of top venues were identified for each sub-category. However, many more venues in DBLP are not assigned to any category. We used all those recommended venue papers in the CSRankings system as representative papers of a given research area. An instance-based learning approach was used to measure the semantic similarity of the target paper and all the papers in a particular area. A decision could be made based on the similarity scores that the target paper obtained for all research areas. Besides, citation data between the target paper and all the papers in those recommended venues is also considered. Quite different from those proposed classification methods before, this instance-based learning approach suits our purpose well. See “Methodology” section for more details.

Methodology

This research aims to predict the number of citations of academic papers in the next couple of years based on their metadata including title, abstract and citation data since publication. The main idea of our approach is: for a paper, depends on its research area and early citation count, we use a specific model to make the prediction. There are two key issues. Academic paper classification and citation count prediction methods. Let us detail them one by one in the following subsections.

Computing classification system

To carry out the classification task of academic papers, a suitable classification system is required. There are many classification systems available for natural science, social science, humanities, or specific branches of science or technology. Since one of the datasets used in this study is DBLP, which includes over four million papers on computer science so far, we will focus our discussion on classification systems and methods for computer science.

In computer science, there are quite a few classification systems available. For example, both the Association for Computing Machinery (ACM) and the China Computer Federation (CCF) define their own classification systems. However, both are not very suitable for our purpose. The ACM’s classification system is quite complicated, but it does not provide any representative venues for any of the research areas. The CCF defines 10 categories and recommends dozens of venues in each category. However, some journals and conferences publish papers in more than one category, but they are only recommended in one category. For instance, both the journals IEEE Transactions on Knowledge and Data Engineering and Data and Knowledge Engineering publish papers on Information Systems and Artificial Intelligence, but they are only recommended in the Database/Data Mining/Content Retrieval category.

In this research, we used the classification system from CSRankings. This system divides computer science into four areas: AI, System, Theory, and Interdisciplinary Areas. Then, each area is further divided into several sub-areas, totalling 26 sub-areas. We flatten these 26 sub-areas for classification, while ignoring the four general areas at level one. One benefit of using this system is that it lists several key venues for every sub-area. For example, three venues are identified for Computer Vision: CVPR (IEEE Conference on Computer Vision and Pattern Recognition), ECCV (European Conference on Computer Vision), and ICCV (IEEE International Conference on Computer Vision). This is very useful for the paper classification task, as we will discuss now.

Paper classification

For this research, we need a classification algorithm that can perform the classification task for all the papers in the DBLP dataset effectively and efficiently.

Although many classification methods have been proposed, we could not find a method that suits our case well. Therefore, we developed our own approach. Using the classification system of CSRankings, we assume that all the papers published in those identified venues belong to that given research area, referred to as seed papers. For all the non-seed papers, we need to decide the areas to which they belong. This is done by considering three aspects together: content, references, and citations. Let us look at the first aspect first.

The collection of all the seed papers, denoted as C, was indexed using the Search engine Lucene^{Footnote 3} with the BM25 model. Both titles and abstracts were used in the indexing process. Each research area ${a}_{k}$ is presented by all its seed papers C(${a}_{k}$). For a given non-seed paper p, we use its title and abstract as a query to search for similar papers in C. Then each seed paper s will obtain a score (similarity between p and s)

$$sim\left( {p,s} \right) = \sum\nolimits_{{t_{j} \in p}} {idf\left( {t_{j} } \right) \times \frac{{f\left( {t_{j} ,s} \right) \times \left( {b_{2} + 1} \right)}}{{f\left( {t_{j} ,s} \right) + b_{2} \times \left( {1 - b_{1} + b_{1} \times \frac{{\left| {T_{s} } \right|}}{AL\left( C \right)}} \right)}}}$$

(1)

in which b₁ and b₂ are two parameters (set to 0.75 and 1.2, respectively, as default setting values of Lucene in the experiments), T_s is the set of all the terms in s, $AL(C)$ is the average length of all the documents in C, $f\left({t}_{j},s\right)$ is the term frequency of ${t}_{j}$ in s, $idf\left({t}_{j}\right)$ is the inverse document frequency of ${t}_{j}$ in collection C with all the seed papers. $idf\left({t}_{j}\right)$ is defined as

$$idf\left({t}_{j}\right)=\text{log}\left(1+\frac{\left|C\right|-\left|C({t}_{j})\right|+0.5}{\left|C({t}_{j})\right|+0.5}\right)$$

(2)

in which $\left|C\right|$ is the number of papers in $C$, and $\left|C({t}_{j})\right|$ is the number of papers in C satisfying the condition that ${t}_{j}$ appears in them. For a paper p and a research area ${a}_{k}$, we can calculate the average similarity score between p and all the seed papers in C(${a}_{k}$) as

$${sim}{\prime}\left(p,{a}_{k}\right)=\frac{1}{\left|C({a}_{k})\right|}{\sum }_{s\in C({a}_{k})}sim\left(p,s\right)$$

(3)

where C(${a}_{k}$) is the collection of seed papers in area ${a}_{k}$.

We also consider citations between $p$ and any of the papers in C. Citations in two different directions are considered separately: $citingNum\left( p,{a}_{k}\right)$ denotes the number of papers in C(${a}_{k}$) that p cites, and $citedNum\left(p,{a}_{k}\right)$ denotes the number of papers in C(${a}_{k}$) that cites p. Now we want to combine the three features. Normalization is required. For example, $sim\left({p,a}_{k}\right)$ can be normalized by

$$sim\left(p,{a}_{k}\right)=\frac{{sim}{\prime}\left(p,{a}_{k}\right)}{\sum_{{a}_{l}\in RArea}{sim}{\prime}\left(p,{a}_{l}\right)}$$

(4)

in which $RA$ is the set of 26 research areas. $citingNum\left(p,{a}_{k}\right)$ and $citedNum\left(p,{a}_{k}\right)$ can be normalized similarly. Then we let

$$score\left(p,{a}_{k}\right)={\beta }_{1}\times sim\left(p,{a}_{k}\right)+{\beta }_{2}\times citingNum\left(p,{a}_{k}\right)+{\beta }_{3}\times citedNum\left(p,{a}_{k}\right)$$

(5)

for any ${a}_{k}\in RArea$, in which ${\beta }_{1}$, ${\beta }_{2}$, and ${\beta }_{3}$ are three parameters. When applying Eq. 5 to $p$ and all 26 research areas, we may obtain corresponding scores for each area. p can be put to research area ${a}_{k}$ if $score\left(p,{a}_{k}\right)$ is the biggest among all 26 scores for all research areas. The values of ${\beta }_{1}$, ${\beta }_{2}$, and ${\beta }_{3}$ are decided by Euclidean Distance with multiple linear regression with a training data set (Wu et al., 2023). Compared with other similar methods such as Stacking with MLS and StackingC, this method can achieve comparable performance but much more efficient than the others. It should be very suitable for large-scale datasets.

In this study, we assume that each paper just belongs to one of the research areas. If required, this method can be modified to support multi-label classification, then a paper may belong to more than one research area at the same time. We may set a reasonable threshold $\tau$, and for any testing paper $p$ and research area ${a}_{k}$, if $score\left(p,{a}_{k}\right)>\tau$, then paper $p$ belongs to research area ${a}_{k}$. However, this is beyond the scope of this research, and we leave it for further study.

In summary, the proposed classification algorithm instance-based learning (IBL) is sketched as follows:

Citation count prediction

As we observed that papers in the same research area may have different citation patterns, it is better to treat them using multiple prediction models rather than one unified model. Therefore, for all the papers in a research area, we divide them into up to 10 groups according to the number of citations already obtained in the first m years. In a specific research area, for a group of papers considered, we count the number of citations they obtained during a certain period. We use cc (i) to represent the number of papers cited i times, where i ranges from 0 to n.

A threshold of 100 is set. We consider the values of cc(0), cc(1),…, cc(n) in order. If cc(0) is greater than or equal to the threshold, we create a group with those papers that received zero citations. Otherwise, we combine cc(0) with cc(1), and if the sum is still less than the threshold, we continue adding the next value cc(2), and so on, until the cumulative sum reaches or exceeds the threshold. At this point, we create a group with all the papers contributing to that cumulative sum. We then move on to the next unassigned value of cc(i) and repeat the process, creating new groups until all papers are assigned to a group. The last group may contain fewer than 100 papers, but it is still considered a valid group.

A regression model is set for each of these groups for prediction. For the training data set, all the papers are classified by research area with known citation history of up to t years. For all the papers belonging to a group ${g}_{i}$ inside a research area ${a}_{k}$, we put their information together. Consider

$${c{\prime}}_{t}{=w}_{0}{c}_{0}+{w}_{1}{c}_{1}+\dots +{w}_{m}{c}_{m}+b$$

(6)

${c}_{0}$, ${c}_{1}$, …,${c}_{m}$, and ${\text{c}}_{t}$ are citation counts of all the papers involved up to year 0, 1,…, m, and in year t (t ≥ m). We can train the weights ${\text{w}}_{0}$, ${\text{w}}_{1}$,…, ${\text{w}}_{m}$, and b for this group by multiple linear regression using ${c}_{0}$, ${c}_{1}$, …,${c}_{m}$ as independent variables and $c^{\prime}_{t}$ as the target variable. The same applies to all other groups and research areas.

To predict the future citation counts of a paper, we need to decide which research area and group that paper should be in. Then the corresponding model can be chosen for the prediction. Algorithm MM is sketched as follows:

Note that classification and citation count prediction are two separate tasks. When performing the citation count prediction task, it is required that all the papers involved should have a research area label. Such a requirement can be satisfied in different ways. For example, in the WoS system, it has a list of journals, and each journal is assigned to one or two research areas. All the papers published in those journals are classified by the journals publishing them. In arXiv, an open-access repository of scientific papers, all the papers are assigned a research area label by the authors when uploading them. When performing the citation count prediction task on such datasets, we do not need to do anything else. However, for papers in DBLP, all the papers are not classified. It is necessary to classify them in some way before we can perform citation count prediction for all the papers involved. In this study, we proposed an instance-based learning approach, which provides an efficient and effective solution to this problem.

Experimental settings and results

Datasets

Two datasets were used for this study. One is a DBLP dataset, and other is an arXiv dataset.

We downloaded a DBLP dataset (Tang et al., 2008).^{Footnote 4} It contains 4,107,340 papers in computer science and 36,624,464 citations from 1961 to 2019. For every paper, the dataset provides its metadata, such as title, abstract, references, authors and their affiliations, publication year, the venue in which the paper was published, and citations since publication. Some subsets of it were used in this study.

For the classification part, we used two subsets of the dataset. The first one (C₁) is all the papers published in those 72 recommended venues in CSRankings between 1965 and 2019. There are 191,727 papers. C₁ is used as seed papers for all 26 research areas. The second subset (C₂) includes 1300 papers, 50 for each research area. Those papers were randomly selected from a group of 54 conferences and journals and judged manually. C₂ is used for the testing of the proposed classification method.

For the prediction part, we also used two subsets. One subset for training and the other for testing. The training dataset (C₃) includes selected papers published between 1990 and 1994, and the testing dataset (C₄) includes selected papers published in 1995. For all those papers between 1990 and 1994 or in 1995, we removed those that did not get any citation and those with incomplete information. After such processing, we obtain 38,247 papers for dataset C₃, and 9967 papers for dataset C₄.

We also downloaded an arXiv dataset (Saier and Farber, 2020).^{Footnote 5} It contains 1,043,126 papers in many research areas including Physics, Mathematics, Computer Science, and others, with 15,954,664 citations from 1991 to 2022. For every paper, its metadata such as title, abstract, references, authors and affiliations, publication year, and citations since publication was provided. Importantly, each paper is given a research area label by the authors. Therefore, it is no need to classify papers when we use this dataset for citation count prediction. Two subsets were generated in this study. One subset for training and the other for testing. The training dataset (C₅) includes all the papers published between 2008 and 2013, and the testing dataset (C₆) includes all the papers published in 2014. There are 5876 papers in dataset C₅ and 1471papers in dataset C₆.

Classification results

In the CSRankings classification system, there are a total of 26 special research areas. A few top venues are recommended for each of them. We assume that all the papers published in those recommended conferences belong to the corresponding research area solely. For example, three conferences CVPR, ECCV, and ICCV are recommended for Computer Vision. We assume that all the papers published in these three conferences belong to the Computer Vision research area but no others.

To evaluate the proposed method, we used a set of 1300 non-seed papers (C₂). It included 50 papers for each research area. All of them were labelled manually. In Eq. 5, three parameters need to be trained. Therefore, we divided those 1300 papers into two equal partitions of 650, and each included the same number of papers in every research area. Then the two-fold cross-validation was performed. Table 1 shows the average performance.

Table 1 Performance of the classification method and its features

Predicting citation impact of academic papers across research areas using multiple models and early citations

Abstract

Similar content being viewed by others

Features, techniques and evaluation in predicting articles’ citations: a review from years 2010–2023

Predicting High Impact Academic Papers Using Citation Network Features

Identification of important citations by exploiting research articles’ metadata and cue-terms from content

Introduction

Related work

Citation count prediction

Classification of scientific papers

Methodology

Computing classification system

Paper classification

Citation count prediction

Experimental settings and results

Datasets

Classification results

Setting for the prediction task

Evaluation metrics

Evaluation results

Overall prediction performance

Prediction performance of highly cited papers

Ablation study of MM

Impact of classification on MM

Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation