1 Introduction

The process of predicting the probability of a link between two unconnected nodes using the network’s characteristics, structure, and node information is called link prediction [12]. Link prediction includes predictions of implicit links (existing in real-world networks, but it is not easy to be observed) and future links (do not exist in the network but may occur in future with the evolution of the network, the influence of external information, and other factors).

The application of link prediction has brought great convenience to human society. For example, link prediction is used in social networks to predict the likelihood that two people who do not know each other will become friends in future [8], and it is also used in protein interaction networks to discover the protein structures that are most likely to interact, and then do interaction experiments, which can increase the experiment success rate [25, 37]. With the explosive growth of data, there are more and more incorrect links and implicit links in the networks, and the incorrect links can be corrected and deleted through link prediction methods, then the implicit links can be shown [29, 39]. Link prediction is also common in network modeling, recommendation systems [5, 36], node classification, network reconstruction, knowledge acquisition [19], knowledge question and answering [28], and knowledge graph [2, 17, 34], among other applications.

The gravitational field is the model of two mass objects producing mutual gravitational forces in the space. The gravitational value is determined by the mass of objects, the distance between objects, and the universal gravitational constant. The greater the mass of the object and the smaller the distance between the objects, the greater the gravitational force between them. It is quite similar to analyzing the node-to-node interaction in complex networks. When the importance of nodes in the network is greater and the distance between nodes is shorter, it indicates that the strength of the relationship between these two nodes is greater, thereby, it indicates that the probability of these two nodes connecting in future is higher. Therefore, based on such consideration, the universal gravitational equation is introduced to model the strength of the relationship between nodes in complex networks, and the problem of computing the future connection probability between nodes is projected into physical space solving it.

There are numerous methods for evaluating the importance of nodes, including degree centrality-based evaluation methods [3], betweenness-centrality-based evaluation methods [11], and node deletion-based methods [6]. However, the degree centrality-based evaluation methods ignore that some nodes with small degree play a vital role in the network, such as the bridge node which connects two clusters. The time complexity of the betweenness-centrality-based evaluation method is high since it requires calculating the number of shortest paths passing through the target node. The method based on node deletion may cause the network to be disconnected, thus making the importance of the node equal to 0, which is inaccurate. Therefore, [32] propose a node contraction approach for assessing node importance. The node importance is evaluated by comparing the network’s cohesion before and after contracting the node. After contracting the node, the network becomes more cohesive, indicating that the node is more significant. This method overcomes the problem of network disconnection caused by node deletion.

There are many methods to evaluate the distance between nodes, such as finding the shortest path between nodes, common neighbors of two nodes, and similarity between nodes. We utilize the similarity between nodes to estimate the distance between nodes because most algorithms based on node similarity consider not only the shortest path between nodes but also other information between nodes at the same time.

Therefore, firstly, we use the node importance calculated by the node contraction method to measure the mass property in the universal gravitational equation and the similarity between nodes to measure the distance property in the universal gravitational equation, then we consider the direct and indirect gravitational values between nodes, propose a novel link prediction framework based on gravitational field, abbreviated as LPFGF, and obtain the framework equation to calculate the similarity between nodes. Finally, simulation experiments on four real datasets are conducted, and the prediction performance of the existing link prediction algorithms before and after using the framework is compared. The results demonstrate that the proposed framework is effective and feasible.

Therefore, the innovation of this paper mainly includes the following three points.

  1. (1)

    The gravitational equation is introduced to measure the strength of the relationship between nodes in this paper. Considering that the nodes with large degree are not necessarily the most influential nodes in the network and the connectivity of the network, the node contraction method is used to measure the importance of nodes. Most existing link prediction algorithms consider both path and distance information when evaluating the similarity between nodes and participating in the calculation process, so the similarity between nodes is used to measure the distance between nodes. In general, the more similar the nodes are, the fewer nodes pass between them, which is more appropriate for the actual gravitational field model and can be extended to other networks, such as social networks and traffic networks.

  2. (2)

    We not only consider the forces between directly connected nodes, but also the forces between nodes that are not directly connected.

  3. (3)

    Unlike the link prediction algorithms proposed by other researchers, the proposed LPFGF framework can be used in other link prediction algorithms and improve their link prediction performance. In addition, without the supervised signal of nodes, the proposed LPFGF framework has better link prediction performance than some pre-classical network representation learning algorithms.

2 Related Work

With the rapid evolution of complex networks, researchers have given link prediction a great deal of attention, and several link prediction methods have been proposed [35]. The following are the three main categories that exist right now.

The first category is node similarity-based methods. If the node similarity is great, the connection possibility between node pairs is high. The time complexity of this category is low, but the prediction accuracy is poor. The following four sub-categories can be found within it: (1) The network local information-based similarity index. It mainly includes CN, PA [1], Salton [10], Jaccard [1], and LHN-I [19], Liu et al. [22] combine CN, AA, and Resource Allocation (RA) with the local naive Bayes algorithm, proposing three hybrid similarity indexes, they are LNBCN, LNBAA, and LNBRA. In addition, [10] propose the node-weighted link prediction model, which considers the number and weight of common neighbors between nodes. Tang [31] consider the influence of common neighbors on the accuracy of any algorithm under the assumption of varied network architectures and propose an algorithm for predicting the missing links of small promoted index via common neighbors, which considers the priority of small promoted index based on common neighbors. (2) The path-based similarity index. It mainly includes Local Path (LP) [1], Katz [1] and LHN-II [22]. In addition, [14] offer a link prediction approach based on high path similarity and use the path as the judgment characteristic to find missing links, which restricts information leakage by punishing public neighbor pairs. Ding [9] consider the structural characteristics of authors in the scientific research cooperation network and propose a weighted mining algorithm by combining path similarity indexes with certain weights, and conduct an empirical test and analysis with the biomedical research field in China as an example. (3) The random walking-based similarity index. It mainly includes Average Commute Time (ACT) [1], Local Random Walk (LRW) [1], and Random Walk with Restart (RWR) [1], Albert et al. Based on the Metropolis-Hasting (MH) algorithm, the degree information of neighbor nodes is fully utilized, then the Random Walk with Restart similarity algorithm is fused, and then, an improved link prediction algorithm based on the Metropolis-Hasting (MH) algorithm is proposed, which showed good prediction performance [24]. (4) Other similarity indexes. For example, the Matrix-Forest Theory Index (MFI) [4], Transferring Similarity Common Neighbor (TSCN) [23] and Transferring Similarity Preferential Attachment (TSPA) [23].

The second category is maximum likelihood-estimation-based methods. The algorithm calculates the likelihood of the network based on the generation and organization patterns of network structure and observed edges, and considers the real network maximizing the network likelihood, and then calculates the connection probability between unconnected node pairs. The prediction accuracy of this category is higher, but the time complexity is also higher. A hierarchical structure model is initially established, and then a maximum likelihood-based link prediction algorithm is proposed, which shows networks with an obvious hierarchical structure have good prediction performance [7]. Clauset[40, 41] construct a benchmark dataset for hierarchical link prediction, named TeleGraph. A maximum likelihood-analysis-based link prediction framework of a closed-circuit model is proposed and the experimental results show that, the model’s prediction accuracy is better than the hierarchy model after the appropriate Hamilton quantity in the network is defined [26].

The third category is machine learning-based methods. The problem of predicting links can be seen as a machine learning classification problem. The prediction accuracy of this category is higher, but most machine learning algorithms need to consider supervised signals. Machine learning was first used to predict links in 2007, and the experimental results show better prediction performance [20]. Zhang [38] use a graph neural network (GNN) to develop a new link prediction approach for learning heuristics from local subgraphs. The experimental findings demonstrate a level of performance that has never been seen before. In addition, recently, [18] utilize a variational autoencoder to propose a link prediction for temporal networks. It produces low-dimensional and dense representation vectors of nodes while simultaneously preserving the dynamic nonlinear properties of temporal networks.

All of the aforementioned algorithms calculate the connection probability between node pairs using the network’s local information structure or node information, rather than calculating the probability from the physical space. Currently, there are some studies on introducing gravitational fields into complex networks. For example, in 2005, Academician Li regarded data points in Euclidean space as nodes in a complex network from the physical point of view, and the association between data objects was regarded as the relationship between nodes in a complex network. It is found that the data structure in the Euclidean space is very similar to the topological structure of the complex network, and a data field of two-dimensional static data is constructed [21], which has brought great influence to the complex network research circle, which has led many researchers to study and analyze the topological properties of complex networks from the perspective of physics. The most basic property representing the importance of objects in the gravitational field is mass. The gravitational force between objects increases as the mass of the objects increases. In complex networks, degree is the most intuitive attribute reflecting the importance of a node. He and Li [16] utilize the degree of nodes to determine the node importance, and then established a gravitational field model in complex networks. They apply the model to the task of ranking the importance of nodes. However, they ignore that some nodes with small degree play a vital role in the network, such as bridge nodes, and using this model to assess the node importance can easily result in inaccurate results. We use the node importance after node contraction to measure the mass property in the universal gravitational equation, and use the gravitational force to evaluate the relationship strength between nodes, and propose a novel link prediction framework based on the gravitational field, which not only considers the node importance with small degree, but also overcomes the network disconnection problem caused by the deletion of nodes.

3 A Novel Link Prediction Framework

3.1 Related Definitions

The following are relevant definitions used in this paper.

Node contraction [32]: When contracting any node \(v_{i}\) in network \(G\), it means merging all \(k_{i}\) nodes which are adjacent to \(v_{i}\) with \(v_{i}\), and now the \(k_{i}\) edges are associated with \(v_{i}\). \(G*v_{i}\) represents the network after contracting \(v_{i}\). A schematic illustration of contracting node \(v_{4}\) is shown in Fig. 1.

Fig. 1
figure 1

Schematic illustration of contracting node \(v_{4}\)

Link prediction [12]: Let \(G = (V,E)\) is an undirected network, \(U\) is the maximum possible number of edges in network \(G\), where \(U = {{|V|(|V| - 1)} \mathord{\left/ {\vphantom {{|V|(|V| - 1)} 2}} \right. \kern-\nulldelimiterspace} 2}\), \(U - E\) is the edges set that do not exist in network \(G\). So, link prediction is described as finding out the pairs of vertices that are most likely to form edges in future. The similarity between unconnected nodes is calculated by the proposed link prediction method, ranking from largest to smallest. The higher the similarity value, the higher connecting probability the edges have.

3.2 An Improved Node Importance Evaluation Method

The cohesion of the network is defined as the reciprocal of the product of the number of nodes \(n\) and the average path length \(l\). The cohesion of network \(G\) is expressed as follows:

$$\partial [G] = \frac{1}{n \cdot l} = \frac{1}{{n \cdot \frac{{\sum\nolimits_{i \ne j \in V} {D_{ij} } }}{n \cdot (n - 1)}}} = \frac{n - 1}{{\sum\nolimits_{i \ne j \in V} {D_{ij} } }},$$
(1)

where \(n \ge 2\), \(D_{ij}\) represents the shortest path length between nodes \(v_{i}\) and \(v_{j}\).

From Eq. 1, we find that the cohesion of the network after contracting node \(v_{i}\) is

$$\begin{gathered} \partial [G*v_{i} ] = \frac{1}{{(n - k_{i} ) \cdot l(G*v_{i} )}} \hfill \\ = \frac{1}{{(n - k_{i} ) \cdot \frac{{\sum\nolimits_{{i \ne j \in V - \Gamma (v_{i} )}} {D^{\prime}_{ij} } }}{{(n - k_{i} ) \cdot (n - k_{i} - 1)}}}} \hfill \\ = \frac{{n - k_{i} - 1}}{{\sum\nolimits_{{i \ne j \in V - \Gamma (v_{i} )}} {D^{\prime}_{ij} } }}, \hfill \\ \end{gathered}$$
(2)

where \(k_{i}\) is the degree of \(v_{i}\), \(\Gamma (v_{i} )\) is the set of neighbor nodes of \(v_{i}\), and \(D^{\prime}_{ij}\) is the distance matrix updated after node \(v_{i}\) is contracted.

We may deduce from Eqs. 1 and 2 that if a node's degree is large and the node is in a critical position in the network, the whole network will be compressed into a more compact network after contracting the node, and the shortest path length through the node will be shortened, indicating that the node is much vital.

In order to better calculate the contribution of node \(v_{i}\) to network \(G\), the definition of node importance is given by [32] as follows:

$$\begin{aligned} {\text{IMC}}(v_{i} ) & = 1 - \frac{{\partial [G]}}{{\partial [G*v_{i} ]}} = 1 - \frac{{\frac{1}{{n \cdot l(G)}}}}{{\frac{1}{{(n - k_{i} ) \cdot l(G*v_{i} )}}}} \\ & = \frac{{n \cdot l(G) - (n - k_{i} ) \cdot l(G*v_{i} )}}{{n \cdot l(G)}}. \\ \end{aligned}$$
(3)

where \(\partial [G*v_{i} ]\) represents the cohesion of network \(G\) after node \(v_{i}\) is contracted. Equation 3 can be understood as the cohesion of network \(G\) after contracting node \(v_{i}\) subtracts the cohesion of network \(G\) before contracting node \(v_{i}\) to obtain the contribution of the node \(v_{i}\) to network \(G\).

It is seen from Eq. 3 that the node importance is related to the degree and position of the node. After contracting node \(v_{i}\), the cohesion of the network is proportional to the importance of node \(v_{i}\). Based on this, we think the cohesion of the network after contracting the node can be directly used to assess the node importance.

As a result, in order to achieve normalization of nodes’ importance, the importance of node \(v_{i}\) is defined as follows:

$${\text{IMC}}(v_{i} ) = \frac{{\partial [G*v_{i} ]}}{{\mathop {\max }\limits_{1 \le j \le n} (\partial [G*v_{j} ])}},$$
(4)

where \(\mathop {\max }\limits_{1 \le j \le n} (\partial [G*v_{j} ])\) represents the maximum cohesion value of the network after contracting all nodes. When the value is 1, it indicates that after contracting the node, the network is more cohesive and the node is more essential in the network.

3.3 Link Prediction Framework Based on Gravitational Field

We know the universal gravitational equation is defined as follows:

$$F = G\frac{{M_{a} \cdot M_{b} }}{{r^{2} }},$$
(5)

where \(G\) is the gravitational constant, \(M\) represents the object's attributes, such as the mass and importance, and \(r\) is the distance between two objects.

The strength of the relationship between nodes in complex networks is quite like the gravitational force between objects. The relationship between nodes depends on the degree, characteristics, betweenness and importance of the nodes themselves, and depends on the distance between them. Therefore, we use the gravitational force to evaluate the relationship strength between nodes and project the similarity problem between nodes into the physical space to solve it.

We all know that mass is the most basic feature that can reflect an object's importance in the gravitational field. We find that one of the most important variables for determining the node importance in complex networks is node degree. However, some nodes with small degree are also important to the network, such as bridge nodes. Therefore, we use the improved node importance evaluation method based on node contraction proposed in the previous section to measure the mass attribute M in the universal gravitational equation. Therefore, we have

$$M_{a} = {\text{IMC}}(a),M_{b} = {\text{IMC}}(b).$$
(6)

The universal gravitational equation’s distance attribute r is measured by the node similarity \(S(a,b)\). The more similar the two nodes are, the shorter the distance between the two nodes is. We have

$$r = \frac{1}{S(a,b)}.$$
(7)

If node \(b\) and node \(a\) are more similar and node \(b\) is more vital, the gravitational value produced by node \(b\) on node \(a\) is greater, and then the relationship between these two nodes is closer. For example, in the actual Internet network, the probability of Google home page to Baidu home page is necessarily different from the probability of Baidu home page to Google home page, which is related to user usage habits and click rates. Therefore, Eq. 5 is modified to obtain the force equation of node b on node a as follows:

$$\begin{aligned} F_{{ab}} & = G^{\prime } \cdot \frac{{M_{b} }}{r} = G^{\prime } \cdot \frac{{{\text{IMC}}(b)}}{{\frac{1}{{S(a,b)}}}} \\ & = G^{\prime } \cdot {\text{IMC}}(b) \cdot S(a,b) \\ & = G^{\prime } \cdot \frac{{\partial [G*b]}}{{\mathop {\max }\limits_{{1 \le j \le n}} (\partial [G*j])}} \cdot S(a,b). \\ \end{aligned}$$
(8)

In order to distinguish the network G, the gravitational constant here is denoted by \(G^{\prime}\). Since \(G^{\prime}\) is exist in all equation, so it can be negligible. Thus, the gravitational value is

$$F_{ab} = {\text{IMC}}(b) \cdot S(a,b) = \frac{\partial [G*b]}{{\mathop {\max }\limits_{1 \le j \le n} (\partial [G*j])}} \cdot S(a,b).$$
(9)

We use the direct gravitational value between nodes to determine the relationship strength between nodes, and we additionally consider the indirect gravitational value between nodes to make the prediction results more accurate, that is, the sum of the gravitational values produced by all neighboring nodes of the target node to another target node. We do the summation operation of the two components, direct and indirect gravitational values, and thus obtain the similarity values between nodes.

Next, we will illustrate the process with a simple example. Shown in Fig. 2 is an artificial network composed of nodes 3, 4, 5 and their neighboring nodes. The set of neighboring nodes of \(x\) is denoted by \(\Gamma (x)\).

  1. (1)

    The direct gravitational value between node 3 and node 5 is

    $$F_{35} = {\text{IMC}}(5) \cdot S(3,5),$$
    (10)
  2. (2)

    The gravitational values between node 3 and the neighboring nodes of node 5 are

    $$F_{34} = {\text{IMC}}(4) \cdot S(3,4),$$
    (11)
    $$F_{36} = {\text{IMC}}(6) \cdot S(3,6),$$
    (12)
    $$F_{37} = {\text{IMC}}(7) \cdot S(3,7),$$
    (13)
    $$F_{38} = {\text{IMC}}(8) \cdot S(3,8).$$
    (14)
  3. (3)

    The gravitational values between node 5 and the neighboring nodes of node 3 are

    $$F_{51} = {\text{IMC}}(1) \cdot S(5,1),$$
    (15)
    $$F_{52} = {\text{IMC}}(2) \cdot S(5,2),$$
    (16)
    $$F_{54} = {\text{IMC}}(4) \cdot S(5,4).$$
    (17)
  4. (4)

    The direct and indirect gravitational values between nodes 3 and 5 are summed to obtain the similarity value between nodes 3 and 5 as

    $$\begin{aligned} S_{{35}}^{\prime } & = F_{{35}} + F_{{34}} + F_{{36}} \\ & + F_{{37}} + F_{{38}} + F_{{51}} \\ & + F_{{52}} + F_{{54}} = F_{{35}} \\ & + \sum\limits_{{i \in \Gamma (5)}} {F_{{3i}} } + \sum\limits_{{j \in \Gamma (3)}} {F_{{5j}} } . \\ \end{aligned}$$
    (18)
Fig. 2
figure 2

Network composed of nodes 3, 4, 5 and their neighboring nodes

Therefore, according to Eq. 18, we obtain the similarity algorithmic framework equation between any two nodes \(u\) and \(v\) as

$$\begin{aligned} S_{{uv}}^{\prime } & = F_{{uv}} + \sum\limits_{{i \in \Gamma (v)}} {F_{{ui}} } + \sum\limits_{{j \in \Gamma (u)}} {F_{{vj}} } \\ & = {\text{IMC}}(v) \cdot S(u,v) \\ & + \sum\limits_{{i \in \Gamma (v)}} {{\text{IMC}}(i) \cdot S(u,i)} \\ & + \sum\limits_{{j \in \Gamma (u)}} {{\text{IMC}}(j) \cdot S(v,j)} . \\ \end{aligned}$$
(19)

Applying this algorithmic framework to different link prediction algorithms yields different similarity results. We normalize this similarity matrix, and the final similarity framework equation is obtained as

$$S^{\prime\prime}_{uv} = \frac{{S^{\prime}_{uv} }}{{\sum\nolimits_{k = 1}^{|V|} {S^{\prime}_{uk} } }} = \frac{{F_{uv} + \sum\nolimits_{i \in \Gamma (v)} {F_{ui} } + \sum\nolimits_{j \in \Gamma (u)} {F_{vj} } }}{{\sum\nolimits_{k = 1}^{|V|} {F_{uk} + \sum\nolimits_{i \in \Gamma (k)} {F_{ui} } + \sum\nolimits_{j \in \Gamma (u)} {F_{kj} } } }},$$
(20)

where |V| is the number of nodes.

In addition, we provide a diagram of the algorithmic framework of LPFGF. The LPFGF algorithmic framework can be separated into two components, as shown in Fig. 3. The specific descriptions of each component are as follows.

  • Module 1:

    Firstly, we input the network \(G\), and then separate the network dataset into a training set and a test set. Secondly, the adjacency matrix \({\mathbf{A}}\) is computed. If there is an edge linking \(v_{i}\) and \(v_{j}\), then \(a_{ij} = 1\), otherwise \(a_{ij} = 0\). Thirdly, a direct distance matrix \({\mathbf{H}}\) is calculated through the adjacency matrix, and if there is an edge between \(v_{i}\) and \(v_{j}\), then \(h_{ij} = 1\), otherwise, \(h_{ij} = \infty\). Fourthly, before contracting nodes, the Floyd algorithm is applied to compute the shortest path between any two nodes, and then the shortest distance matrix \({\mathbf{D}}\) is calculated. Next, the shortest distance matrix \({\mathbf{D^{\prime}}}\) after contracting node is calculated through matrix \({\mathbf{D}}\), then the cohesion of each node after contracting is calculated by Eq. 2, and the cohesion column vector matrix \({\mathbf{I}}\) is calculated. Finally, \({\mathbf{I}}\) is normalized and the importance of each node is calculated by Eq. 4, and the importance column vector matrix \({\mathbf{I^{\prime}}}\) is calculated.

  • Module 2:

    Firstly, the similarity matrix \({\mathbf{S}}\) is obtained by utilizing node similarity-based link prediction benchmark methods, such as CN, AA, and RA. Secondly, the importance of each node and the similarity value between two nodes are substituted into Eq. 9. Thirdly, we consider the direct and indirect gravitational values between nodes to evaluate the relationship strength between nodes, and then obtain the similarity matrix \({\mathbf{S^{\prime}}}\). Finally, the normalized similarity matrix \({\mathbf{S^{\prime\prime}}}\) is obtained by normalizing \({\mathbf{S^{\prime}}}\).

Fig. 3
figure 3

The algorithmic framework of the LPFGF

The algorithmic framework of LPFGF is formed by the above two parts. To assess the framework's performance, we apply it to node similarity-based link prediction algorithms like CN, RA, PA, and LRW, and propose improved link prediction algorithms like LPFGF-CN, LPFGF-RA, LPFGF-PA, and LPFGF-LRW, and then, through the later comparative experiments, we demonstrate the feasibility and effectiveness of the proposed framework.

figure a

The following is the LPFGF algorithmic framework's main pseudo-code. Note that, we do not provide the pseudo-code for calculating the node importance matrix \({\mathbf{I^{\prime}}}\).

As shown in the pseudo-code above, firstly, S is the similarity matrix between nodes obtained by different link prediction algorithms, then we calculate the sum of the direct and indirect gravitational values between the target nodes through three nested for statements, and finally the updated similarity matrix between nodes is obtained using a normalization method, thereby realizing the extension of the LPFGF framework to different link prediction algorithms. The input importance matrix \({\mathbf{I^{\prime}}}\) can be calculated using the module 1 shown in Fig. 3.

4 Results and Analysis of Experiments

4.1 Experimental Datasets

Citeseer, DBLP, Cora, and Wiki datasets are used in trials to see if the proposed LPFGF algorithmic framework is an effective prediction framework, and their topological properties are summarized in Table 1, with |V| denoting the number of nodes, |E| denoting the number of edges, |Y| denoting the number of network tags, K denoting the average degree, D denoting the network diameter, L denoting the average path length, P denoting the density, and C denoting the average clustering coefficient.

Table 1 The topological properties of Citeseer, DBLP, Cora, and Wiki datasets

4.2 Evaluation Indicators and Benchmark Methods

Three important indications for evaluating the prediction accuracy of the link prediction algorithm are the Area Under the Receiver Operating Characteristic Curve (AUC), Precision, and Ranking Score. The link prediction performance is evaluated by AUC [15] in this paper. In general, the AUC value should be greater than 0.5 and not more than 1.

In order to evaluate the proposed LPFGF algorithmic framework’s prediction performance, 18 different link prediction algorithms based on similarity are utilized as benchmark algorithms, and the link prediction performance of these algorithms before and after employing the LPFGF framework is compared. In addition, in order to emphasize the contribution of the proposed LPFGF framework, the prediction performance of several pre-classical network representation learning models, which are DeepWalk [27], Node2Vec [13], LINE [30], and SDNE [33], on link prediction tasks are provided for comparison.

4.3 Visualization of Degree Distribution

The degree distribution is defined as the probability distribution of the degree of nodes in the network. Figure 4 shows the degree distribution of Citeseer, Cora, DBLP, and Wiki. By observing the degree distribution of four datasets, on the Citeseer and Cora datasets, the degree of most nodes is primarily focused among 0 and 10, while on the DBLP and Wiki datasets, the degree of most nodes is primarily focused among 10 and 50. According to the density of the nodes, combining the topological properties description of the four datasets in Table 1, we can find that the Citeseer and Cora datasets are fairly sparse networks, while the DBLP and Wiki datasets are rather dense networks.

Fig. 4
figure 4

Visualization of degree distribution

4.4 Experimental Results

The random sampling method is used to divide these four datasets in this paper. The training ratios for these four datasets are, respectively, 0.7, 0.8, and 0.9. The average value obtained after conducting the experiment 10 times separately is the final result.

Tables 2 and 3 show the prediction results of 18 node similarity-based benchmark metrics before and after applying the LPFGF framework, to demonstrate the feasibility of the LPFGF algorithmic framework.

Table 2 AUC values comparison between the original link prediction algorithms (%)
Table 3 AUC values comparison between the original link prediction algorithms after using the LPFGF framework optimized (%)

The original AUC values of the node similarity-based link prediction benchmark methods are listed in Table 2. The AUC results show that the overall prediction performance on DBLP and Wiki datasets is better. On these four datasets, there are only six algorithms whose link prediction accuracy is higher than 80%, accounting for 33.3%, they are LP, Katz, LHNII, LRW, SRW, and MFI, in which Katz and MFI algorithms outperform the others in terms of prediction performance, whose accuracy is higher than 90% on these four datasets. However, CN, Salton, HPI, and other algorithms show poor prediction accuracy on the Citeseer dataset, with a minimum of 65.8%. We find that the path-based, random walking-based and other link prediction algorithms perform better than local information-based algorithms in terms of prediction performance when compared to the four types of algorithms based on node similarity outlined in related work.

The AUC values of the node similarity-based link prediction benchmark methods after using the proposed LPFGF algorithmic framework are listed in Table 3. Observing the results in Table 3, the overall prediction performance on the DBLP and Wiki datasets is better after using the proposed LPFGF algorithmic framework, and the AUC values in essence reach more than 90%. On these four datasets, there are10 algorithms whose link prediction accuracy is higher than 80%, accounting for 55.6%, they are Salton, HDI, LHNI, LP, the Katz, LHNII, LNBCN, LRW, SRW, and MFI. Among those, Katz, LHNII, and MFI algorithms have better performance in link prediction, whose prediction accuracy is higher than 90% on four datasets. Particularly, on the Citeseer dataset, their accuracy is more than 93%. The results of the PA algorithm on the Citeseer, DBLP and Wiki datasets are close to the original results, while the results are slightly reduced on the Cora dataset. When compared to the four types of link prediction algorithms based on node similarity stated in related work, we find that the performance of these four types of link prediction algorithms has improved to some extent after applying the LPFGF framework to optimize.

Comparing the experimental results in Tables 2 and 3, link prediction algorithms based on local network structure information such as CN, Salton, HPI, HDI, LHNI, AA, and RA are substantially improved, and link prediction algorithms based on global information such as LP, Katz, LHNII, ACT, and LRW are slightly improved after using the proposed LPFGF framework. LP, Katz, and LHNII algorithms consider the n-order paths in the network, and ACT, LRW, and SRW algorithms consider the probability of any node in the network walking to other nodes. Therefore, the information considered by these algorithms is more comprehensive, and the resulting link prediction performance is better. These link prediction algorithms that consider more comprehensive information can be improved to some extent by using the proposed LPFGF framework, which considers not only the global information of the network, but also the importance of the nodes. The link prediction performance is slightly improved because of the more comprehensive information considered by these algorithms themselves. It can be seen that the link prediction algorithms based on local network structure information only consider the neighboring information between nodes, resulting in poor link prediction performance. These algorithms not only consider the common neighbor information of the target nodes, but also the importance of the target nodes and the direct and indirect gravitational values between the target nodes after using the LPFGF framework. The information considered is more comprehensive, and the link prediction performance has been greatly improved, which shows that the proposed LPFGF framework is more suitable for link prediction algorithms based on local information such as CN, AA, and RA, and the framework can solve the problem of incomplete information embodied in these algorithms, thereby improving their link prediction performance.

We provide Fig. 5 as a comparison chart of the average AUC values of the original benchmark method before and after using the LPFGF algorithmic framework so that the changes in prediction performance of the original benchmark method before and after using the LPFGF algorithmic framework can be seen more intuitively. The average AUC value refers to the average of the AUC values on each dataset. In Fig. 5, the abscissa represents the original benchmark methods, the ordinate represents the average AUC values, the polyline with a blue triangle symbol represents the average AUC values of the original benchmark methods before using the LPFGF framework, and the polyline with a rose red circle symbol represents the average AUC values of the original benchmark methods after using the LPFGF framework.

Fig. 5
figure 5

Comparison of the average AUC values of the original benchmark methods before and after using the LPFGF framework

As is seen from Fig. 5, the prediction performance of link prediction methods has been greatly improved after using the proposed LPFGF framework. Especially, the prediction performance of CN, Salton, and HPI methods, on the Citeseer and Cora datasets, is improved by a maximum of 15%. The prediction performance of PA decreases slightly, but it is close to the original results. The prediction performance of LP, Katz, LRW, and SRW improve slightly on Citeseer and Cora datasets, and are basically the same on DBLP and Wiki datasets. When the PA algorithm is optimized by the LPFGF framework, it takes node contraction as a node importance evaluation method. According to Eqs. 2 and 4, the node importance is determined not just by its degree but also by its position. The node may not be essential if its degree is high but it is not at a key position in the network. The PA algorithm only considers the impact of node degree on node similarity. Thus, it leads to slight performance degradation. Therefore, it is effective and feasible to project the similarity problem between nodes into the physical space to solve.

As shown in Table 4, we provide a comparison between the prediction performance of the proposed algorithm, such as LPFGF-CN, LPFGF-Salton, and LPFGF- HPI, with the prediction performance of the four pre-classical network representation learning models, which are DeepWalk, Node2Vec, LINE, and SDNE on the Cora and Citeseer datasets with a training rate of 0.8. In order not to do the experiment repeatedly, the prediction results of the four pre-classical network representation learning models are quoted from [40, 41]. Note that, except that the original benchmark algorithms whose prediction performance is higher than these four-network representation learning models-based link prediction, the others are listed in Table 4 which are optimized by the proposed LPFGF algorithmic framework.

Table 4 AUC values comparison between the LPFGF framework and four-network representation learning algorithms (%)

Table 4 shows that after the proposed LPFGF algorithmic framework optimizing the other classic link prediction algorithms, the prediction performance of them is better than that of the pre-classical network representation learning model to a certain extent, without considering the node label information, which further shows that the proposed LPFGF framework is feasible and successful.

4.5 Time Complexity Analysis

The proposed algorithmic framework is separated into two parts, each of which will be examined individually for time complexity. The first part is to calculate the importance of the nodes. Firstly, the Floyd algorithm is used to calculate the shortest distance matrix, whose time complexity is O(N3), and then the shortest distance matrix after contracting each node is calculated using a new algorithm [32], which reduces the time complexity to O(N2), thereby the importance of the nodes is computed. Therefore, the overall time complexity of the first part is O(N3). The second part is to compute the normalized similarity matrix. Firstly, node similarity-based algorithms need to be used to measure the distance in the universal gravitation equation. Local information-based similarity algorithms, such as CN, AA, and RA, whose time complexity is O(N2), global path-based similarity algorithms, such as LP and Katz, whose time complexity is O(N3), the time complexity of ACT is O(N3), and the time complexity of LRW and SRW is O(N < k > n), where < k > is average degree, so < k > is much smaller than N. In summary, the time complexity of measuring the distance in the universal gravitation equation is O(N3). Then the direct and indirect gravitational values between nodes need to be calculated, and the time complexity is O(N3), so the time complexity of the second part is O(N3). Therefore, the overall time complexity of the proposed LPFGF algorithmic framework is O(N3), which is equal to the time complexity of the global-based node similarity algorithm, but the link prediction performance of the proposed LPFGF algorithmic framework is much better than the global-based node similarity algorithm.

Furthermore, as shown in Table 5, we choose five representative algorithms from the common neighbor-based similarity algorithm, the path-based global similarity algorithm and the random walking-based global similarity algorithm, and compare the running time of these five algorithms before and after using the LPFGF framework on the Citeseer dataset. we find that the time complexity of the proposed LPFGF algorithmic framework is on the same level as the global-based link prediction algorithm in theoretical analysis, but in practice, although the proposed framework can significantly improve the link prediction performance, there is a certain gap in the algorithmic running time.

Table 5 Running time comparison before and after using the LPFGF framework

As shown in Table 6, we find that the running time for calculating the importance of nodes using node contraction is much higher than calculating the similarity between nodes using the gravitational field technique. The reason is that when calculating the importance of nodes using node contraction, not only the shortest distance matrix of the network before contracting nodes needs to be calculated using Floyd algorithm, but also the shortest distance matrix of the network after contracting nodes. The items involved in the calculation are too complex, which lead to too long running time.

Table 6 Running time of computing node importance and node similarity

5 Conclusion

We introduce the universal gravitational equation to evaluate the strength of relationship between nodes in complex networks, and measure the mass property of the universal gravitational equation using the node importance obtained by the node contraction method and measure the distance property of the universal gravitational equation using the similarity between nodes, then proposing a link prediction framework based on gravitational field, abbreviated as LPFGF. Then, we obtain the similarity equation between nodes, thereby, the similarity problem between nodes is projected into the physical space solving it. Finally, simulation results on four actual datasets reveal that the prediction performance of most node similarity-based link prediction algorithms for comparison has been improved to some extent by utilizing the LPFGF framework.

However, the LPFGF framework still has some shortcomings. In general, an edge can only connect two nodes, and the relationship and similarity between two nodes can only be considered. However, there are many structures beyond the general network in the real world, for example, in a scientific cooperation network, there are often more than two co-authors in an article. Therefore, we will later address the question of how to predict such edges that go beyond two-two interactions, and we will also try to apply the proposed LPFGF framework to high-order network structures.