1 Introduction

Currently, users can type their preferred keywords into paper-searching websites (e.g., Google Scholar and Baidu Academic) to search for their interested papers, and then, these websites will recommend appropriate papers to them [1]. Generally, a paper just contains partial keywords that a user is interested in, so a paper recommender system must return a set of papers that collectively cover all requested keywords.

As shown in Fig. 1, here is a brief introduction to a process of user’s creation. Figure 1 shows that the user normally achieve his goal by putting the following keywords into research tasks: (1) link prediction addresses the sparsity of cited network, (2) weighting criteria is applied to the link prediction, (3) data mining is concerning about information on the mining paper from a network and is applied to the weighted method, and (4) citation network is for studying the cited relationships among papers. Therefore, the user obtains four corresponding manuscript keywords, i.e., link prediction, weighting criteria, data mining, and citation network, and the user needs to do these keywords searching and research tasks before his writing.

Fig. 1
figure 1

An example of paper keywords research

As shown in Fig. 1, the user obtains a set of keywords including link prediction, weighting criteria, data mining, and citation network. Then, paper-searching websites usually recommend some papers to their users based on those above keywords. As we all know, the keywords of a paper can only represent papers’ topics or themes; therefore, considering keywords only appear in paper-searching process may find a set of papers that belong to different research domains or are actually not correlated, which fails to satisfy the user original requirements on deep and continuous research.

Fortunately, paper citation network that depicts the cited relationships among different papers has provided a promising way to model the correlations among the papers in terms of width and depth perspectives. However, the current paper citation network still faces a big challenge, that is, each paper of the existing paper citation network has slight cited relationships with other papers, so that correlated relationships among papers are also very sparse.

Considering this challenge, we will propose a novel link prediction approach to optimize the existing paper citation network. Furthermore, many previous researches proved that link prediction is the best solution to various network optimization problems [2, 3]. More specifically, link prediction attempts to estimate the likelihood of the existence of a link between two nodes because nodes attribute to information and network structures. In addition, when using our proposal to build new paper relationships (i.e., correlated relationships), we also consider the effect of self-citations from authors and potential correlations among papers (i.e., these correlated relationships are not included in the paper citation network as their published time is close).

Overall, our contributions in this paper are concluded into three aspects below:

  • We propose a novel link prediction approach to construct new relation graphs. Our proposal considers a wide range of factors that influence the correlations among papers, such as paper published time, paper keywords, and paper authors. Furthermore, our link prediction approach takes the network structure of paper citation network into considerations, which makes the predicted results more reasonable and convincing.

  • We optimize the existing paper citation network by reducing the negative influence of intentional self-citations from partial authors.

  • At last, extensive experiments are performed on a real-world paper dataset to demonstrate the actual capability of our method of dealing with the network sparsity problem.

The rest of paper is organized as follows. Related work is presented in Section 2. In Section 3, we introduce the research motivation. In Section 4, the detail of our proposed link prediction approach is described. Next, Section 5 discusses the experimental datasets (i.e., Hep-Th) and experimental evaluated metrics and mainly analyzes the experimental results. Finally, in Section 6, we have summarized our proposal as well as future research topics.

2 Related work

Link prediction is a significant research content and approach of optimizing various network. To the best of our knowledge, an essential fact of the link prediction is that node attributes to those known information and network structure features, so link prediction methods can easily find the missing links. Besides, these methods can build new links (i.e., correlated links) between two nodes without connection. Thus, the link prediction can effectively address a core problem of our proposal, i.e., solve the sparsity in the existing paper citation network.

Currently, link prediction has made massive strides and plays an important role in many research areas. For example, new friends through link prediction can be found in social network [4] and protein-protein interactions can also be found [5]. Link prediction approaches can be classified into three categories: similarity-based methods, maximum likelihood approaches, and probabilistic methods [6]. As far as we know, the similarity-based methods can be used to the large-scale networks, which is because it can calculate the similarity scores between two nodes [7]. Although maximum likelihood approaches can obtain specific parameters and probabilistic methods can predict missing links by using the trained model, maximum likelihood approaches and probabilistic methods cannot dispose of the broad-scale networks [8]. Therefore, we mainly consider the similarity-based approach in our research.

Generally, the similarity-based approach can also be classified into two categories: the network structure-based similarity methods and the node attribute-based similarity methods. The node attribute-based similarity methods mainly focus on the node attribute to information of finding the similar nodes, so these methods are a significant way to form node pairs. Furthermore, these methods also solve the cold-start problem for link prediction research, e.g., Wang et al. [9] used the node attribute information (e.g., user profile) to address the cold-start problem on Twitter and Facebook. In addition, the network structure-based similarity method allocates similarity scores to the node pairs according to the structure features of networks. Currently, the network structure-based similarity method mainly contains four categories, i.e., local approaches, global approaches, quasi-local approaches, and community-based approaches [10]. Here, we mainly pay attention to the local similarity-based approaches, because it calculates the similarity scores of two nodes without connection based on the nodes’ neighboring structural features; furthermore, some common index of the local approaches can be used in the large-scale networks, e.g., Common Neighbors index (CN), Jaccard Coefficient (JC), Adamic–Adar index (AA), and Resource Allocation index (RA).

Many of link prediction researches only concentrate on unweighted networks, but actually, many real-world networks can be weighted. For example, edge weighting value can represent the strength of connection in brain networks and the number of flights in airline networks [11], respectively. For the social network, the work [12] uses local weighted similarity functions to calculate the weighting value of two nodes without connection. Besides, this work also proves that the weak ties have an effect on link prediction. In addition, [13] shows that the ties of spouses or romantic partners play an important role in the social network, so these ties can be regarded as one of the significant edge-weighted ways in the link prediction. Recently, the work in [14] is carrying out a study into the effects of the strength of link in the social network and proposes weighting criterion for link prediction model according to users’ data information and the number of interactions among users. However, in their weighting criterion, their work does not take full advantage of the node and its attribute information.

In view of the above research content, we know that the link prediction is one of the significant approaches to solve network sparsity, as it specializes in predicting the missing/correlated links among two nodes without connection. Thus, we propose a novel link prediction approach to construct the paper correlated graph, that is, the similarity-based weighting method.

3 Research motivation

In our paper, we focus on the following key issue: how to solve the sparsity of the existing paper citation network? As for this problem, link prediction approach is the best solution. Furthermore, in the process of building a correlated relationship on the paper citation network, we consider the effect of self-citations from authors and potential correlations among the papers, which are not included in citation network but with close published time.

An intuitive example is presented in Fig. 2 to motivate our approach. Assume that Fig. 2a and b are two parts: paper citation network GC and paper correlated graph Gp, respectively. Each of the network contains the same 10 nodes, i.e., v1, …, v10; each node represents a paper and contains some attribute information (i.e., paper time, paper keywords, and paper authors). As shown in Fig. 2a, nodes v1 and v2 have cited relationship that is mainly because they have common authors (i.e., authors a1 and a2). This phenomenon (i.e., v1 cites v2) is called self-citation. Therefore, in this paper, we reduce the effect of the intentional self-citations through a weighting model. In addition, nodes v1 and v10 have the same attribute information (i.e., keywords k1, k3, and k4 and authors a1, a2, and a3); however, they do not have direct relationship that is mainly because they have the same published year. Hence, we will establish the new link (i.e., correlated relationship) between two similar nodes by using link prediction approach, e.g., nodes v1 and v10 can build the correlated relationship in Fig. 2b. In view of the analysis mentioned above, we know that the link prediction approach is necessary to optimize current paper citation network, which will be introduced in detail in Section 4.

Fig. 2
figure 2

a and b are part of the paper citation network and the paper correlated graph, respectively

4 Link prediction method

According to the above analysis, we propose a novel link prediction model to optimize existing paper citation network. As shown in Fig. 3, our link prediction process follows a task sequence [15], and this task sequence mainly consists of the following five activities:

  • Activity 1: Pre-processing of the network. In order to construct a paper correlated graph, the paper citation network is regarded as an undirected paper citation network (G).

  • Activity 2: To divide paper citation network. G is partitioned into two parts, i.e., Gtrain and Gtest. In the Gtrain, we need to get the average score from existing pairs of nodes. Furthermore, in the Gtest, we need to get the weighting value of the two nodes without connection.

  • Activity 3: Network to be weighted. In the Gtrain, the weighting value of the two connected nodes are calculated by using the weighting criteria, and the weighting value of two nodes without connection are calculated in the Gtest.

  • Activity 4: Score calculation and ranking. (1) Firstly, we use two weighted similarity function formulas WCN and WJC [16] to calculate the weighting value of two nodes without connection in the Gtrain. Next, we can obtain a ranking list in order to descend weighting value. At last, the average score is saved in waverage(vitrain , vjtrain).

Fig. 3
figure 3

Process for weighting-based link prediction

Two weighted similarity functions are as follow:

  1. (A)

    Weighted Common Neighbor - WCN(vitrain, vjtrain). Which is:

$$ {\sum}_{v_{z\mathrm{train}}\in \Gamma \left({v}_{i\mathrm{train}}\right)\cap \Gamma \left({v}_{j\mathrm{train}}\right)}\frac{w\left({v}_{i\mathrm{train}},{v}_{z\mathrm{train}}\right)+w\left({v}_{j\mathrm{train}},{v}_{z\mathrm{train}}\right)}{2} $$
(1)
$$ {w}_{\mathrm{train}}^{\mathrm{WCN}}\left({v}_{i\mathrm{train}},{v}_{j\mathrm{train}}\right)=\frac{\mathrm{WCN}\left({v}_{i\mathrm{train}},{v}_{j\mathrm{train}}\right)}{\left|\Gamma \left({v}_{i\mathrm{train}}\right)\cap \Gamma \left({v}_{j\mathrm{train}}\right)\right|} $$
(2)

where Eq. (2) calculates the actual weighting value between nodes vitrain and vjtrain, |Γ(vitrain) ∩ Γ(vjtrain)| represents the number of common nodes between nodes vitrain and vjtrain.

  1. (B)

    Weighted Jaccard Coefficient - WJC(vitrain, vjtrain). Which is:

$$ \frac{\sum_{v_{z\mathrm{train}}\in \Gamma \left({v}_{i\mathrm{train}}\right)\cap \Gamma \left({v}_{j\mathrm{train}}\right)}\frac{w\left({v}_{i\mathrm{train}},{v}_{z\mathrm{train}}\right)+w\left({v}_{j\mathrm{train}},{v}_{z\mathrm{train}}\right)}{2}}{\sum_{v_i^{\prime}\in \Gamma \left({v}_{i\mathrm{train}}\right)}w\left({v}_{i\mathrm{train}},{v}_i^{\prime}\right)+{\sum}_{v_j^{\prime}\in \Gamma \left({v}_{j\mathrm{train}}\right)}w\left({v}_{j\mathrm{train}},{v}_j^{\prime}\right)} $$
(3)
$$ {w}_{\mathrm{train}}^{\mathrm{WJC}}\left({v}_{i\mathrm{train}},{v}_{j\mathrm{train}}\right)=\frac{\mathrm{WJC}\left({v}_{i\mathrm{train}},{v}_{j\mathrm{train}}\right)}{\left|\Gamma \left({v}_{i\mathrm{train}}\right)\cap \Gamma \left({v}_{j\mathrm{train}}\right)\right|} $$
(4)

where Eq. (4) calculates the actual weighting value between nodes vitrain and vjtrain.

$$ {w}_{\mathrm{max}}\left({v}_{i\mathrm{train}},{v}_{j\mathrm{train}}\right)=\arg \underset{i,j=\overline{1,N}}{\max }{w}_{\mathrm{train}}\left({v}_{i\mathrm{train}},{v}_{j\mathrm{train}}\right) $$
(5)
$$ {w}_{\mathrm{min}}\left({v}_{i\mathrm{train}},{v}_{j\mathrm{train}}\right)=\arg \underset{i,j=\overline{1,N}}{\min }{w}_{\mathrm{train}}\left({v}_{i\mathrm{train}},{v}_{j\mathrm{train}}\right) $$
(6)

Equations (5) and (6) are used to find the maximum value and the minimum value in the Gtrain.

Since the available paper citation datasets are very sparse, the greatest challenge is how to find correlated relationships among papers. Besides, if we select high threshold value, the correlated links among papers will not be predicted and built in the existing paper citation network, i.e., the chances to find correlated papers decrease for papers. To guarantee the accuracy of predicting and building these correlated links among papers, we select the threshold value same as the average score, i.e., waverage(vitrain , vjtrain). The average score is as follows:

$$ {w}_{\mathrm{average}}\left({v}_{i\mathrm{train}},{v}_{j\mathrm{train}}\right)=\frac{w_{\mathrm{max}}\left({v}_{i\mathrm{train}},{v}_{j\mathrm{train}}\right)+{w}_{\mathrm{min}}\left({v}_{i\mathrm{train}},{v}_{j\mathrm{train}}\right)}{2} $$
(7)

(2) In the Gtest, we will perform score calculation of two nodes without connection and produce a descending ranking list.

  • Activity 5: Connected nodes. LP (link prediction) is defined as in Eq. (8):

$$ \mathrm{LP}=\left\{{w}_{\mathrm{test}}\left({v}_{i\mathrm{test}},{v}_{j\mathrm{test}}\right)\ge {w}_{\mathrm{average}}\left({v}_{i\mathrm{train}},{v}_{j\mathrm{train}}\right)\right\} $$
(8)

4.1 Proposed weighting criteria

Consider the case that undirected paper citation network (G) contains node attribute information (paper time, paper keywords, and paper authors). Furthermore, the existing link prediction approach provides some similarity functions for our proposal. Thus, our proposed weighting model will be described in Eq. (9), time ∈ Timekeyword ∈ Keywordauthor ∈ Author and xtime, xkeyword, xauthor ∈ {0, 1}.

$$ {w}^{\ast}\left({v}_i,{v}_j\right)={\mathrm{time}}^{x_{\mathrm{time}}}\times {\mathrm{keyword}}^{x_{\mathrm{keyword}}}\times {\mathrm{author}}^{x_{\mathrm{author}}} $$
(9)

Here, we propose weighting model which can generate two disparate weighting criteria based on the Eq. (9). Please note that the product between paper’s data in weighting criteria emphasizes the fact that the selected data information must be concurrently calculated. Thus, the two disparate weighting criteria are as follows:

4.1.1 Keywords and authors’ weighting criteria

In our research, if the number of common keywords and co-authors of two papers increases, the weighting value between two nodes will be greater. However, when two papers do not have common keywords, the value between the two papers will decrease as the number of co-authors increases. Such strategies have been adopted to reduce the effect of self-citations. Therefore, the weighting criteria for a pair of nodes vi and vj are defined as in Eqs. (10)–(13):

$$ r=\left\{\begin{array}{l}1\kern0.75em \mathrm{Contains}\ \mathrm{common}\ \mathrm{keywords}\\ {}0\kern0.75em \mathrm{Otherwise}\end{array}\right. $$
(10)
$$ \mathrm{cosine}\left({K}_{v_i},{K}_{v_j}\right)=\frac{\left|{K}_{v_i}\cap {K}_{v_j}\right|}{\sqrt{\left|{K}_{v_i}\right|}\times \sqrt{\left|{K}_{v_j}\right|}} $$
(11)
$$ \mathrm{cosine}\left({A}_{v_i}^a,{A}_{v_j}^a\right)=\frac{\left|{A}_{v_i}^a\cap {A}_{v_j}^a\right|}{\sqrt{\left|{A}_{v_i}^a\right|}\times \sqrt{\left|{A}_{v_j}^a\right|}} $$
(12)
$$ {w}^{\mathrm{KA}}\left({v}_i,{v}_j\right)=C\times \left(r\right.\times {\beta}^{\left(1-\mathrm{cosine}\left({K}_{v_i},{K}_{v_j}\right)\right)}\times {\alpha}^{\left(1-\right.\left.\mathrm{cosine}\left({A}_{v_i}^a,{A}_{v_j}^a\right)\right)}+\beta \times \left.{\alpha}^{\mathrm{cosine}\left({A}_{v_i}^a,{A}_{v_j}^a\right)}\times \left(1-r\right)\right) $$
(13)

where α and β (0 < α, β < 1) are arbitrary damping parameters and they are used to calibrate the importance of paper authors and paper keywords in the weighting criteria. \( {A}_{v_i}^a \) (\( {A}_{v_j}^a \)) and \( {K}_{v_i} \) (\( {K}_{v_j} \)) are a set of authors and keywords, respectively. \( {A}_{v_i}^a\cap {A}_{v_j}^a \) and \( {K}_{v_i}\cap {K}_{v_j} \) represent the co-authors and common keywords, respectively. \( \mathrm{cosine}\left({A}_{v_i}^a,{A}_{v_j}^a\right) \) and \( \mathrm{cosine}\left({K}_{v_i},{K}_{v_j}\right) \) denote an approach that computes the similarity between the two nodes vi and vj, respectively. A constant C is defined for convenience of calculation.

4.1.2 Time, keywords, and authors’ weighting criteria

According to the above analysis, we know the effect of paper keywords and paper authors on the weighting criteria. Here, we then try to find the role of paper time in the weighting model, i.e., if the published time of two papers ia relatively close, the weighting value between the two nodes will be greater. Therefore, the weighting criteria of a pair of nodes vi and vj is defined with Eqs. (14)–(15):

$$ k(t)=\left\{\begin{array}{l}0.5\kern5.5em if\ {t}_{v_i}={t}_{v_j}\\ {}\frac{1}{1+{e}^{-\left|{t}_{v_i}-{t}_{v_j}\right|}}\kern2.5em if\ {t}_{v_i}\ne {t}_{v_j}\ \end{array}\right. $$
(14)
$$ {w}^{\mathrm{TKA}}\left({v}_i,{v}_j\right)=C\times {\lambda}^{k\left(\mathrm{t}\right)}\times \left(r\right.\times {\beta}^{\left(1-\mathrm{cosine}\left({K}_{v_i},{K}_{vj}\right)\right)}\times {\alpha}^{\left(1-\right.\left.\mathrm{cosine}\left({A}_{v_i}^a,{A}_{v_j}^a\right)\right)}+\beta \times \left.{\alpha}^{\mathrm{cosine}\left({A}_{v_i}^a,{A}_{v_j}^a\right)}\times \left(1-r\right)\right) $$
(15)

where parameter λ (0 < λ < 1) adjusts the effect of paper time on the weighting criteria. Furthermore, \( {t}_{v_i} \) and \( {t}_{v_j} \) indicate the published time of nodes vi and vj.

Note that, if there are no common keywords and co-authors among two papers, the weighting value of them will be set to the fixed value, namely wn − co = 0.1. In addition, as for synonymy, word inflections and polysemy are tackled with automatic query expansion techniques [17]. However, it is out of the scope of this paper research.

4.2 Paper correlated graph

Here, we define the undirected relation network as a paper correlated graph.

Definition 1. Paper correlated graph: Paper correlated graph is represented by Gp = {Vp, Ep}, where Vp and Ep denote its set of nodes and edges, respectively. Furthermore, for each of node pairs (vi, vj), the paper correlated graph has a corresponding edge e(vi, vj).

5 Experiments

In this section, large-scale experiments are designed and tested on a real-world paper citation dataset which demonstrates the usefulness and effectiveness of our link prediction approach.

5.1 Experimental environments

5.1.1 Experimental tools

The proposed link prediction approach is implemented in PyCharm and executed under the environment of Intel(R) Core(R) CPU @3.0 GHz, 16 GB RAM, and Windows 10 @ 1809, 64bit operating system.

5.1.2 Dataset

A paper citation network was extracted from the available Hep-Th dataset [18]. In the paper citation network, a node represents a paper and an edge indicates that two specific nodes have cited relationship. Furthermore, each node stores the paper’s published time, keywords, and authors’ information. Besides, we will use the information of each paper that the title and abstract are used to construct a set of keywords; here, we mainly use RAKE (Rapid Automatic Keyword Extraction algorithm) method to construct a set of keywords, which is because this method analyzes the frequency of words appearing and their co-occurrence with other words is used to identify keywords or phrases in the body of a text.

Our experimental process of link prediction also follows the task sequence that is depicted in Fig. 3. First, the existing paper citation network is partitioned into two parts according to their published time. So, the existing paper citation network from the Hep-Th dataset is partitioned into Gtrain = [1997, 1999] and Gtest = [2000, 2002]. The Gtrain contains 7304 nodes (papers) and 56,376 edges; likewise, the Gtest contains 8721 nodes (papers) and 70,045 edges. Next, we need to configure parameters in our experiment (i.e., α, β, and λ). In this experimental process, we need to finetune each parameter of two different weighting criteria to find the accurate and credible parameter values. Thus, for parameter values α, β, and λ, we first range the values of α from 0.3 to 0.9 with step 0.2, we range the values of β from 0.3 to 0.9 with step 0.1; and finally, we set the values of λ to 0.3 and 0.9 to reflect the factors of published time. In addition, in order to calculate a value of two nodes without connection in the Gtrain, we select two disparate weighted similarity functions, i.e., WCN and WJC, as well as these two disparate weighted similarity functions are usually used in different link prediction research. Next, these two weighted similarity functions will be combined with two different weighting criteria (i.e., Keywords & Authors (KA) and Time & Keywords & Authors (TKA)). Thus, we will obtain four disparate functions (i.e., WCNKA, WCNTKA, WJCKA, WCNTKA) and apply these functions into our experiments.

5.1.3 Evaluation criteria

  1. (1)

    AUC (area under the receiver operating characteristic curve [19]). The area under the ROC (receiver operating characteristic) curve can demonstrate link prediction methods accuracy, so the AUC can be regarded as measure index. In our research, we first assign a weighting value for each correlated links and non-existent links in the paper citation network. Next, we will randomly select correlated links and non-existent links and compare their values. Finally, we can obtain AUC value by acting the Eq. (16).

$$ \mathrm{AUC}=\frac{n^{\prime }+0.5{n}^{{\prime\prime} }}{n} $$
(16)

where n demonstrates independent comparisons, the randomly selected correlated links are given higher weighting value n and the same weighting value n′′ times.

  1. (2)

    Edge number. For link prediction approach, we argue that the newer generated edges it has, the approach will be better. Therefore, we test the number of the new generated edges of the output graph (i.e., the paper correlated graph).

In this paper, we focus on the link prediction approach based on the paper time, paper keywords, and paper authors’ information of the paper citation network. As far as we know, there are few existing approaches that solve the same problem. Therefore, we compare our proposal approach with the following two approaches:

  1. 1)

    Random [20]: On the test network, if two nodes without connection have more than half of the number of same keywords, these two nodes without connection will generate new edges.

  2. 2)

    Maximum [20]: If the weighting value of two nodes without connection in the test network is greater than the maximum value found in the training network, these two nodes without connection will generate new edges.

5.2 Experimental results

5.2.1 Profile 1: the AUC value of five approaches

In this profile, we compare five different approaches by using the AUC. Figures 4, 5, 6, and 7 show that the experimental results are mostly different in terms of different weighted similarity functions (i.e., WCN and WJC) and different parameters α, β, and λ. That is because the different weighted similarity functions and the different parameters values play a vital influence on different approaches.

Fig. 4
figure 4

The different parameters are employed in the weighted similarity function WCN, and λ = 0.3 to obtain AUC values

Fig. 5
figure 5

The different parameters are employed in the weighted similarity function WCN, and λ = 0.9 to obtain AUC values

Fig. 6
figure 6

The different parameters are employed in the weighted similarity function WJC, and λ = 0.3 to obtain AUC values

Fig. 7
figure 7

The different parameters are employed in the weighted similarity function WJC and λ = 0.9 to obtain AUC values

As shown in Figs. 4, 5, 6, and 7, for our proposal (i.e., Our-KA and Our-TKA) and the maximum approach (i.e., Maximum-KA and Maximum-TKA), with the same weighted similarity functions, the value of AUC generally increases as the parameters value increases. That is because the node attribute information plays an increasingly important effect on link prediction during the process of parameter value increasing. However, for the random approach, the value of AUC, it is not affected by the parameters value and the weighted similarity functions. In addition, Figs. 4, 5, 6, and 7 show that, with the same weighted similarity functions and the parameters value, the AUC values obtained by our proposal are generally greater than those obtained by the maximum and random approaches. Further, it shows that our proposal is superior to other approaches. As far as we know, the larger value of AUC means that our proposal can better improve the existing paper citation network, i.e., our proposal played a significant role in lightening the sparsity of the existing paper citation network.

5.2.2 Profile 2: the number of new edges built by five approaches

In this profile, we compare five different approaches by using the number produced by new edges. Tables 1, 2, 3, and 4 present the predicted results by combining different parameters which are originated from the weighted similarity functions WCN and WJC. As shown in Tables 1, 2, 3, and 4, the random approach can obtain the same results, i.e., the numbers of building new edges are 4, which further indicates that the different weighted similarity functions and the different parameters value seldom have impact on this approach. However, for other approaches, Tables 1, 2, 3, and 4 show that we will mostly gain different experimental results under different weighted similarity functions and parameters values, which is because the different weighted similarity functions and the different parameter values have an important impact on building the new edges.

Table 1 The different parameters are employed in the weighted similarity function WCN, and λ = 0.3 to obtain the edge number values
Table 2 The different parameters are employed in the weighted similarity function WCN, and λ = 0.9 to obtain the edge number values
Table 3 The different parameters are employed in the weighted similarity function WJC, and λ = 0.3 to obtain the edge number values
Table 4 The different parameters are employed in the weighted similarity function WJC, and λ = 0.9 to obtain the edge number values

As shown in Tables 1, 2, 3, and 4, for our proposal (i.e., Our-KA and Our-TKA) and the maximum approach (i.e., Maximum-KA and Maximum-TKA), under the same weighted similarity functions, the number of new edges would increase as the parameter value increases. That is because the node attribute information plays an increasingly important role in link prediction as the parameter value increases. Furthermore, the new edges built by our approach are larger than the maximum and random approaches, which show that our proposal is superior to other approaches. In addition, for our proposal, with the same parameter value and the same weighting criteria (i.e., KA and TKA), we find that the number of the new edges built by the weighted similarity function WJC is generally greater than that built by WCN. It shows that the WJC achieves better results in link prediction than the WCN. According to the above analysis, our proposal can effectively solve sparsity of the existing paper citation network.

5.2.3 Profile 3: investigate the performance in different parameter value setting and the weighted similarity functions

In this profile, we compare the different parameters value setting and the weighted similarity functions by using the AUC. Figures 8 and 9 present the predicted results by parameters α and β combined within the same weighting criteria. Here, we only consider the effect of paper keywords and paper authors’ information within the weighting criteria. As shown in Fig. 8, under the same weighted similarity function, three curves (i.e., α = 0.3, α = 0.5, and α = 0.7) show that the value of AUC increases with the parameter value (β) increasing. Besides, the curve of α = 0.7 can converge. Likewise, in Fig. 9, α = 0.3, α = 0.5, and α = 0.7, these three curves also show that the value of AUC would increase as the parameter value (β) increases, and these curves all converge. Here, Figs. 8 and 9 indirectly show the reason why the Our-KA approach in Figs. 4, 5, 6, and 7 suddenly increases sharply with the increasing parameter value. In addition, according to the comparison between Figs. 8 and 9, under the same parameters, the AUC value obtained by WJCKA is generally greater than that obtained by WCNKA, which further shows that the WJC in our proposal achieves better results than the WCN.

Fig. 8
figure 8

The different parameters are employed in the function WCNKA to obtain AUC values. α = 0.3, α = 0.5, α = 0.7

Fig. 9
figure 9

The different parameters are employed in the function WJCKA to obtain AUC values. α = 0.3, α = 0.5, α = 0.7

Figures 10, 11, 12, and 13 present the prediction results by different parameter value combinations on the weighting criteria. Here, we mainly consider the combined effect of paper time, paper keywords, and paper authors’ information within the weighting criteria. According to the analysis of Figs. 8 and 9, we know the effect on the paper keywords and paper authors; therefore, we further analyze the role of paper published-time information within the weighting criteria. According to the comparison between Figs. 10 and 11, or Figs. 12 and 13, the value of AUC increases as the parameter value λ increases. Here, Figs. 10, 11, 12, and 13 also indirectly show the reason why the Our-TKA approach in Figs. 4, 5, 6, and 7 suddenly increases sharply with increasing parameters. In addition, according to the comparison between Figs. 10 and 12, or Figs. 11 and 13, under the same parameters setting, the AUC value obtained by WJCTKA is also generally greater than that obtained by WCNTKA, which also further shows that the WJC in our proposal achieves better results than the WCN.

Fig. 10
figure 10

The different parameters are employed in the function WCNTKA, and λ = 0.3 to obtain AUC values. α = 0.3, α = 0.5, α = 0.7

Fig. 11
figure 11

The different parameters are employed in the function WCNTKA, and λ = 0.9 to obtain AUC values. α = 0.3, α = 0.5, α = 0.7

Fig. 12
figure 12

The different parameters are employed in the function WJCTKA, and λ = 0.3 to obtain AUC values. α = 0.3, α = 0.5, α = 0.7

Fig. 13
figure 13

The different parameters are employed in the function WJCTKA, and λ = 0.9 to obtain AUC values. α = 0.3, α = 0.5, α = 0.7

5.2.4 Profile 4: performance comparison in different functions

In this profile, under the same weighted similarity functions, we compare the different weighting criteria by using the values of AUC and the edges’ number. According to the experimental results, we get the best results of the function WCNKA, i.e., the value of AUC is 0.9992 and the value of the edge number is 26344478. Similarly, we also acquire the best results of the other three kind of functions (i.e., WCNTKA, WJCKA, WJCTKA). Therefore, Table 5 shows the best results of four functions in terms of AUC and the edge’s number. As shown in Table 5, the values of AUC and the edge’s number within the TKA weighting criteria is greater than that within the KA weighting criteria, which indicates that the weighting criteria of TKA in our proposal achieves better results than the weighting criteria of KA. Hence, the experimental results suggest that combining paper time, paper keywords, and paper authors’ information can enhance link prediction performance significantly and optimize the existing paper citation network effectively.

Table 5 The result of performance metrics in the different functions

5.3 Further discussions

However, there are still some shortages in our proposed link prediction approach. Firstly, there are many attributes of node on the existing paper citation network, and it is hard to obtain these attribute information [21, 22] to further optimize paper citation network. Secondly, since the process of obtaining data may involve some privacy issues [23,24,25,26,27,28,29,30,31,32,33], our work will further consider the privacy-preservation effects when making link prediction. Finally, more complex multi-dimensional or multi-criterion application scenarios [34,35,36,37,38,39,40,41,42,43,44] should be considered in the future to make our proposal more comprehensive.

6 Conclusions

Predicting whether two correlated papers will build correlated links in an existing paper citation network is a significant analysis task, which is regarded as a link prediction problem. To find and build correlated links in the existing paper citation network, we put forward a novel link prediction approach. The novel link prediction approach not only has advantages of predicting and building correlated links, but also helps with alleviating the current paper citation network sparsity. Furthermore, we also use the combination of paper time, paper keywords, and paper authors’ information to reduce the effect of the self-citations. Since the weighting value of nodes pair in the paper citation network is obtained from calculating its attribute information, the experimental results can reflect the actual weighting value of a node pair as accurately as possible. Finally, the feasibility of our proposal is validated by a real-world dataset.

In the future, we will continue to refine our work by considering more complex scenarios, such as privacy-aware or multi-dimensional link prediction problems.