A Novel Link Prediction Framework Based on Gravitational Field

Yang, Yanlin; Ye, Zhonglin; Zhao, Haixing; Meng, Lei

doi:10.1007/s41019-022-00201-8

A Novel Link Prediction Framework Based on Gravitational Field

RESEARCH PAPER
Open access
Published: 16 January 2023

Volume 8, pages 47–60, (2023)
Cite this article

Download PDF

You have full access to this open access article

Data Science and Engineering Aims and scope Submit manuscript

A Novel Link Prediction Framework Based on Gravitational Field

Download PDF

Yanlin Yang^1,2,3,
Zhonglin Ye^1,2,3,
Haixing Zhao ORCID: orcid.org/0000-0003-0957-1603^1,2,3 &
…
Lei Meng^1,2,3

1824 Accesses
2 Citations
Explore all metrics

Abstract

Currently, most researchers only utilize the network information or node characteristics to calculate the connection probability between unconnected node pairs. Therefore, we attempt to project the problem of connection probability between unconnected pairs into the physical space calculating it. Firstly, the definition of gravitation is introduced in this paper, and the concept of gravitation is used to measure the strength of the relationship between nodes in complex networks. It is generally known that the gravitational value is related to the mass of objects and the distance between objects. In complex networks, the interrelationship between nodes is related to the characteristics, degree, betweenness, and importance of the nodes themselves, as well as the distance between nodes, which is very similar to the gravitational relationship between objects. Therefore, the importance of nodes is used to measure the mass property in the universal gravitational equation and the similarity between nodes is used to measure the distance property in the universal gravitational equation, and then a complex network model is constructed from physical space. Secondly, the direct and indirect gravitational values between nodes are considered, and a novel link prediction framework based on the gravitational field, abbreviated as LPFGF, is proposed, as well as the node similarity framework equation. Then, the framework is extended to various link prediction algorithms such as Common Neighbors (CN), Adamic-Adar (AA), Preferential Attachment (PA), and Local Random Walk (LRW), resulting in the proposed link prediction algorithms LPFGF-CN, LPFGF-AA, LPFGF-PA, LPFGF-LRW, and so on. Finally, four real datasets are used to compare prediction performance, and the results demonstrate that the proposed algorithmic framework can successfully improve the prediction performance of other link prediction algorithms, with a maximum improvement of 15%.

Newton’s Gravitational Law for Link Prediction in Social Networks

Link Prediction in Complex Network: Nature Inspired Gravitation Force Approach

A degree-related and link clustering coefficient approach for link prediction in complex networks

Article 25 January 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The process of predicting the probability of a link between two unconnected nodes using the network’s characteristics, structure, and node information is called link prediction [12]. Link prediction includes predictions of implicit links (existing in real-world networks, but it is not easy to be observed) and future links (do not exist in the network but may occur in future with the evolution of the network, the influence of external information, and other factors).

The application of link prediction has brought great convenience to human society. For example, link prediction is used in social networks to predict the likelihood that two people who do not know each other will become friends in future [8], and it is also used in protein interaction networks to discover the protein structures that are most likely to interact, and then do interaction experiments, which can increase the experiment success rate [25, 37]. With the explosive growth of data, there are more and more incorrect links and implicit links in the networks, and the incorrect links can be corrected and deleted through link prediction methods, then the implicit links can be shown [29, 39]. Link prediction is also common in network modeling, recommendation systems [5, 36], node classification, network reconstruction, knowledge acquisition [19], knowledge question and answering [28], and knowledge graph [2, 17, 34], among other applications.

The gravitational field is the model of two mass objects producing mutual gravitational forces in the space. The gravitational value is determined by the mass of objects, the distance between objects, and the universal gravitational constant. The greater the mass of the object and the smaller the distance between the objects, the greater the gravitational force between them. It is quite similar to analyzing the node-to-node interaction in complex networks. When the importance of nodes in the network is greater and the distance between nodes is shorter, it indicates that the strength of the relationship between these two nodes is greater, thereby, it indicates that the probability of these two nodes connecting in future is higher. Therefore, based on such consideration, the universal gravitational equation is introduced to model the strength of the relationship between nodes in complex networks, and the problem of computing the future connection probability between nodes is projected into physical space solving it.

There are numerous methods for evaluating the importance of nodes, including degree centrality-based evaluation methods [3], betweenness-centrality-based evaluation methods [11], and node deletion-based methods [6]. However, the degree centrality-based evaluation methods ignore that some nodes with small degree play a vital role in the network, such as the bridge node which connects two clusters. The time complexity of the betweenness-centrality-based evaluation method is high since it requires calculating the number of shortest paths passing through the target node. The method based on node deletion may cause the network to be disconnected, thus making the importance of the node equal to 0, which is inaccurate. Therefore, [32] propose a node contraction approach for assessing node importance. The node importance is evaluated by comparing the network’s cohesion before and after contracting the node. After contracting the node, the network becomes more cohesive, indicating that the node is more significant. This method overcomes the problem of network disconnection caused by node deletion.

There are many methods to evaluate the distance between nodes, such as finding the shortest path between nodes, common neighbors of two nodes, and similarity between nodes. We utilize the similarity between nodes to estimate the distance between nodes because most algorithms based on node similarity consider not only the shortest path between nodes but also other information between nodes at the same time.

Therefore, firstly, we use the node importance calculated by the node contraction method to measure the mass property in the universal gravitational equation and the similarity between nodes to measure the distance property in the universal gravitational equation, then we consider the direct and indirect gravitational values between nodes, propose a novel link prediction framework based on gravitational field, abbreviated as LPFGF, and obtain the framework equation to calculate the similarity between nodes. Finally, simulation experiments on four real datasets are conducted, and the prediction performance of the existing link prediction algorithms before and after using the framework is compared. The results demonstrate that the proposed framework is effective and feasible.

Therefore, the innovation of this paper mainly includes the following three points.

(1)
The gravitational equation is introduced to measure the strength of the relationship between nodes in this paper. Considering that the nodes with large degree are not necessarily the most influential nodes in the network and the connectivity of the network, the node contraction method is used to measure the importance of nodes. Most existing link prediction algorithms consider both path and distance information when evaluating the similarity between nodes and participating in the calculation process, so the similarity between nodes is used to measure the distance between nodes. In general, the more similar the nodes are, the fewer nodes pass between them, which is more appropriate for the actual gravitational field model and can be extended to other networks, such as social networks and traffic networks.
(2)
We not only consider the forces between directly connected nodes, but also the forces between nodes that are not directly connected.
(3)
Unlike the link prediction algorithms proposed by other researchers, the proposed LPFGF framework can be used in other link prediction algorithms and improve their link prediction performance. In addition, without the supervised signal of nodes, the proposed LPFGF framework has better link prediction performance than some pre-classical network representation learning algorithms.

2 Related Work

With the rapid evolution of complex networks, researchers have given link prediction a great deal of attention, and several link prediction methods have been proposed [35]. The following are the three main categories that exist right now.

The first category is node similarity-based methods. If the node similarity is great, the connection possibility between node pairs is high. The time complexity of this category is low, but the prediction accuracy is poor. The following four sub-categories can be found within it: (1) The network local information-based similarity index. It mainly includes CN, PA [1], Salton [10], Jaccard [1], and LHN-I [19], Liu et al. [22] combine CN, AA, and Resource Allocation (RA) with the local naive Bayes algorithm, proposing three hybrid similarity indexes, they are LNBCN, LNBAA, and LNBRA. In addition, [10] propose the node-weighted link prediction model, which considers the number and weight of common neighbors between nodes. Tang [31] consider the influence of common neighbors on the accuracy of any algorithm under the assumption of varied network architectures and propose an algorithm for predicting the missing links of small promoted index via common neighbors, which considers the priority of small promoted index based on common neighbors. (2) The path-based similarity index. It mainly includes Local Path (LP) [1], Katz [1] and LHN-II [22]. In addition, [14] offer a link prediction approach based on high path similarity and use the path as the judgment characteristic to find missing links, which restricts information leakage by punishing public neighbor pairs. Ding [9] consider the structural characteristics of authors in the scientific research cooperation network and propose a weighted mining algorithm by combining path similarity indexes with certain weights, and conduct an empirical test and analysis with the biomedical research field in China as an example. (3) The random walking-based similarity index. It mainly includes Average Commute Time (ACT) [1], Local Random Walk (LRW) [1], and Random Walk with Restart (RWR) [1], Albert et al. Based on the Metropolis-Hasting (MH) algorithm, the degree information of neighbor nodes is fully utilized, then the Random Walk with Restart similarity algorithm is fused, and then, an improved link prediction algorithm based on the Metropolis-Hasting (MH) algorithm is proposed, which showed good prediction performance [24]. (4) Other similarity indexes. For example, the Matrix-Forest Theory Index (MFI) [4], Transferring Similarity Common Neighbor (TSCN) [23] and Transferring Similarity Preferential Attachment (TSPA) [23].

The second category is maximum likelihood-estimation-based methods. The algorithm calculates the likelihood of the network based on the generation and organization patterns of network structure and observed edges, and considers the real network maximizing the network likelihood, and then calculates the connection probability between unconnected node pairs. The prediction accuracy of this category is higher, but the time complexity is also higher. A hierarchical structure model is initially established, and then a maximum likelihood-based link prediction algorithm is proposed, which shows networks with an obvious hierarchical structure have good prediction performance [7]. Clauset[40, 41] construct a benchmark dataset for hierarchical link prediction, named TeleGraph. A maximum likelihood-analysis-based link prediction framework of a closed-circuit model is proposed and the experimental results show that, the model’s prediction accuracy is better than the hierarchy model after the appropriate Hamilton quantity in the network is defined [26].

The third category is machine learning-based methods. The problem of predicting links can be seen as a machine learning classification problem. The prediction accuracy of this category is higher, but most machine learning algorithms need to consider supervised signals. Machine learning was first used to predict links in 2007, and the experimental results show better prediction performance [20]. Zhang [38] use a graph neural network (GNN) to develop a new link prediction approach for learning heuristics from local subgraphs. The experimental findings demonstrate a level of performance that has never been seen before. In addition, recently, [18] utilize a variational autoencoder to propose a link prediction for temporal networks. It produces low-dimensional and dense representation vectors of nodes while simultaneously preserving the dynamic nonlinear properties of temporal networks.

All of the aforementioned algorithms calculate the connection probability between node pairs using the network’s local information structure or node information, rather than calculating the probability from the physical space. Currently, there are some studies on introducing gravitational fields into complex networks. For example, in 2005, Academician Li regarded data points in Euclidean space as nodes in a complex network from the physical point of view, and the association between data objects was regarded as the relationship between nodes in a complex network. It is found that the data structure in the Euclidean space is very similar to the topological structure of the complex network, and a data field of two-dimensional static data is constructed [21], which has brought great influence to the complex network research circle, which has led many researchers to study and analyze the topological properties of complex networks from the perspective of physics. The most basic property representing the importance of objects in the gravitational field is mass. The gravitational force between objects increases as the mass of the objects increases. In complex networks, degree is the most intuitive attribute reflecting the importance of a node. He and Li [16] utilize the degree of nodes to determine the node importance, and then established a gravitational field model in complex networks. They apply the model to the task of ranking the importance of nodes. However, they ignore that some nodes with small degree play a vital role in the network, such as bridge nodes, and using this model to assess the node importance can easily result in inaccurate results. We use the node importance after node contraction to measure the mass property in the universal gravitational equation, and use the gravitational force to evaluate the relationship strength between nodes, and propose a novel link prediction framework based on the gravitational field, which not only considers the node importance with small degree, but also overcomes the network disconnection problem caused by the deletion of nodes.

3 A Novel Link Prediction Framework

3.1 Related Definitions

The following are relevant definitions used in this paper.

Node contraction [32]: When contracting any node $v_{i}$ in network $G$, it means merging all $k_{i}$ nodes which are adjacent to $v_{i}$ with $v_{i}$, and now the $k_{i}$ edges are associated with $v_{i}$. $G*v_{i}$ represents the network after contracting $v_{i}$. A schematic illustration of contracting node $v_{4}$ is shown in Fig. 1.

Link prediction [12]: Let $G = (V,E)$ is an undirected network, $U$ is the maximum possible number of edges in network $G$, where $U = {{|V|(|V| - 1)} \mathord{\left/ {\vphantom {{|V|(|V| - 1)} 2}} \right. \kern-\nulldelimiterspace} 2}$, $U - E$ is the edges set that do not exist in network $G$. So, link prediction is described as finding out the pairs of vertices that are most likely to form edges in future. The similarity between unconnected nodes is calculated by the proposed link prediction method, ranking from largest to smallest. The higher the similarity value, the higher connecting probability the edges have.

3.2 An Improved Node Importance Evaluation Method

The cohesion of the network is defined as the reciprocal of the product of the number of nodes $n$ and the average path length $l$. The cohesion of network $G$ is expressed as follows:

$$\partial [G] = \frac{1}{n \cdot l} = \frac{1}{{n \cdot \frac{{\sum\nolimits_{i \ne j \in V} {D_{ij} } }}{n \cdot (n - 1)}}} = \frac{n - 1}{{\sum\nolimits_{i \ne j \in V} {D_{ij} } }},$$

(1)

where $n \ge 2$, $D_{ij}$ represents the shortest path length between nodes $v_{i}$ and $v_{j}$.

From Eq. 1, we find that the cohesion of the network after contracting node $v_{i}$ is

$$\begin{gathered} \partial [G*v_{i} ] = \frac{1}{{(n - k_{i} ) \cdot l(G*v_{i} )}} \hfill \\ = \frac{1}{{(n - k_{i} ) \cdot \frac{{\sum\nolimits_{{i \ne j \in V - \Gamma (v_{i} )}} {D^{\prime}_{ij} } }}{{(n - k_{i} ) \cdot (n - k_{i} - 1)}}}} \hfill \\ = \frac{{n - k_{i} - 1}}{{\sum\nolimits_{{i \ne j \in V - \Gamma (v_{i} )}} {D^{\prime}_{ij} } }}, \hfill \\ \end{gathered}$$

(2)

where $k_{i}$ is the degree of $v_{i}$, $\Gamma (v_{i} )$ is the set of neighbor nodes of $v_{i}$, and $D^{\prime}_{ij}$ is the distance matrix updated after node $v_{i}$ is contracted.

We may deduce from Eqs. 1 and 2 that if a node's degree is large and the node is in a critical position in the network, the whole network will be compressed into a more compact network after contracting the node, and the shortest path length through the node will be shortened, indicating that the node is much vital.

In order to better calculate the contribution of node $v_{i}$ to network $G$, the definition of node importance is given by [32] as follows:

$$\begin{aligned} {\text{IMC}}(v_{i} ) & = 1 - \frac{{\partial [G]}}{{\partial [G*v_{i} ]}} = 1 - \frac{{\frac{1}{{n \cdot l(G)}}}}{{\frac{1}{{(n - k_{i} ) \cdot l(G*v_{i} )}}}} \\ & = \frac{{n \cdot l(G) - (n - k_{i} ) \cdot l(G*v_{i} )}}{{n \cdot l(G)}}. \\ \end{aligned}$$

(3)

where $\partial [G*v_{i} ]$ represents the cohesion of network $G$ after node $v_{i}$ is contracted. Equation 3 can be understood as the cohesion of network $G$ after contracting node $v_{i}$ subtracts the cohesion of network $G$ before contracting node $v_{i}$ to obtain the contribution of the node $v_{i}$ to network $G$.

It is seen from Eq. 3 that the node importance is related to the degree and position of the node. After contracting node $v_{i}$, the cohesion of the network is proportional to the importance of node $v_{i}$. Based on this, we think the cohesion of the network after contracting the node can be directly used to assess the node importance.

As a result, in order to achieve normalization of nodes’ importance, the importance of node $v_{i}$ is defined as follows:

$${\text{IMC}}(v_{i} ) = \frac{{\partial [G*v_{i} ]}}{{\mathop {\max }\limits_{1 \le j \le n} (\partial [G*v_{j} ])}},$$

(4)

where $\mathop {\max }\limits_{1 \le j \le n} (\partial [G*v_{j} ])$ represents the maximum cohesion value of the network after contracting all nodes. When the value is 1, it indicates that after contracting the node, the network is more cohesive and the node is more essential in the network.

3.3 Link Prediction Framework Based on Gravitational Field

We know the universal gravitational equation is defined as follows:

$$F = G\frac{{M_{a} \cdot M_{b} }}{{r^{2} }},$$

(5)

where $G$ is the gravitational constant, $M$ represents the object's attributes, such as the mass and importance, and $r$ is the distance between two objects.

The strength of the relationship between nodes in complex networks is quite like the gravitational force between objects. The relationship between nodes depends on the degree, characteristics, betweenness and importance of the nodes themselves, and depends on the distance between them. Therefore, we use the gravitational force to evaluate the relationship strength between nodes and project the similarity problem between nodes into the physical space to solve it.

We all know that mass is the most basic feature that can reflect an object's importance in the gravitational field. We find that one of the most important variables for determining the node importance in complex networks is node degree. However, some nodes with small degree are also important to the network, such as bridge nodes. Therefore, we use the improved node importance evaluation method based on node contraction proposed in the previous section to measure the mass attribute M in the universal gravitational equation. Therefore, we have

$$M_{a} = {\text{IMC}}(a),M_{b} = {\text{IMC}}(b).$$

(6)

The universal gravitational equation’s distance attribute r is measured by the node similarity $S(a,b)$. The more similar the two nodes are, the shorter the distance between the two nodes is. We have

$$r = \frac{1}{S(a,b)}.$$

(7)

If node $b$ and node $a$ are more similar and node $b$ is more vital, the gravitational value produced by node $b$ on node $a$ is greater, and then the relationship between these two nodes is closer. For example, in the actual Internet network, the probability of Google home page to Baidu home page is necessarily different from the probability of Baidu home page to Google home page, which is related to user usage habits and click rates. Therefore, Eq. 5 is modified to obtain the force equation of node b on node a as follows:

$$\begin{aligned} F_{{ab}} & = G^{\prime } \cdot \frac{{M_{b} }}{r} = G^{\prime } \cdot \frac{{{\text{IMC}}(b)}}{{\frac{1}{{S(a,b)}}}} \\ & = G^{\prime } \cdot {\text{IMC}}(b) \cdot S(a,b) \\ & = G^{\prime } \cdot \frac{{\partial [G*b]}}{{\mathop {\max }\limits_{{1 \le j \le n}} (\partial [G*j])}} \cdot S(a,b). \\ \end{aligned}$$

(8)

In order to distinguish the network G, the gravitational constant here is denoted by $G^{\prime}$. Since $G^{\prime}$ is exist in all equation, so it can be negligible. Thus, the gravitational value is

$$F_{ab} = {\text{IMC}}(b) \cdot S(a,b) = \frac{\partial [G*b]}{{\mathop {\max }\limits_{1 \le j \le n} (\partial [G*j])}} \cdot S(a,b).$$

(9)

We use the direct gravitational value between nodes to determine the relationship strength between nodes, and we additionally consider the indirect gravitational value between nodes to make the prediction results more accurate, that is, the sum of the gravitational values produced by all neighboring nodes of the target node to another target node. We do the summation operation of the two components, direct and indirect gravitational values, and thus obtain the similarity values between nodes.

Next, we will illustrate the process with a simple example. Shown in Fig. 2 is an artificial network composed of nodes 3, 4, 5 and their neighboring nodes. The set of neighboring nodes of $x$ is denoted by $\Gamma (x)$.

(1)
The direct gravitational value between node 3 and node 5 is
$$F_{35} = {\text{IMC}}(5) \cdot S(3,5),$$
(10)
(2)
The gravitational values between node 3 and the neighboring nodes of node 5 are
$$F_{34} = {\text{IMC}}(4) \cdot S(3,4),$$
(11)
$$F_{36} = {\text{IMC}}(6) \cdot S(3,6),$$
(12)
$$F_{37} = {\text{IMC}}(7) \cdot S(3,7),$$
(13)
$$F_{38} = {\text{IMC}}(8) \cdot S(3,8).$$
(14)
(3)
The gravitational values between node 5 and the neighboring nodes of node 3 are
$$F_{51} = {\text{IMC}}(1) \cdot S(5,1),$$
(15)
$$F_{52} = {\text{IMC}}(2) \cdot S(5,2),$$
(16)
$$F_{54} = {\text{IMC}}(4) \cdot S(5,4).$$
(17)
(4)
The direct and indirect gravitational values between nodes 3 and 5 are summed to obtain the similarity value between nodes 3 and 5 as
$$\begin{aligned} S_{{35}}^{\prime } & = F_{{35}} + F_{{34}} + F_{{36}} \\ & + F_{{37}} + F_{{38}} + F_{{51}} \\ & + F_{{52}} + F_{{54}} = F_{{35}} \\ & + \sum\limits_{{i \in \Gamma (5)}} {F_{{3i}} } + \sum\limits_{{j \in \Gamma (3)}} {F_{{5j}} } . \\ \end{aligned}$$
(18)

Therefore, according to Eq. 18, we obtain the similarity algorithmic framework equation between any two nodes $u$ and $v$ as

$$\begin{aligned} S_{{uv}}^{\prime } & = F_{{uv}} + \sum\limits_{{i \in \Gamma (v)}} {F_{{ui}} } + \sum\limits_{{j \in \Gamma (u)}} {F_{{vj}} } \\ & = {\text{IMC}}(v) \cdot S(u,v) \\ & + \sum\limits_{{i \in \Gamma (v)}} {{\text{IMC}}(i) \cdot S(u,i)} \\ & + \sum\limits_{{j \in \Gamma (u)}} {{\text{IMC}}(j) \cdot S(v,j)} . \\ \end{aligned}$$

(19)

Applying this algorithmic framework to different link prediction algorithms yields different similarity results. We normalize this similarity matrix, and the final similarity framework equation is obtained as

$$S^{\prime\prime}_{uv} = \frac{{S^{\prime}_{uv} }}{{\sum\nolimits_{k = 1}^{|V|} {S^{\prime}_{uk} } }} = \frac{{F_{uv} + \sum\nolimits_{i \in \Gamma (v)} {F_{ui} } + \sum\nolimits_{j \in \Gamma (u)} {F_{vj} } }}{{\sum\nolimits_{k = 1}^{|V|} {F_{uk} + \sum\nolimits_{i \in \Gamma (k)} {F_{ui} } + \sum\nolimits_{j \in \Gamma (u)} {F_{kj} } } }},$$

(20)

where |V| is the number of nodes.

In addition, we provide a diagram of the algorithmic framework of LPFGF. The LPFGF algorithmic framework can be separated into two components, as shown in Fig. 3. The specific descriptions of each component are as follows.

Module 1:

Firstly, we input the network $G$, and then separate the network dataset into a training set and a test set. Secondly, the adjacency matrix ${\mathbf{A}}$ is computed. If there is an edge linking $v_{i}$ and $v_{j}$, then $a_{ij} = 1$, otherwise $a_{ij} = 0$. Thirdly, a direct distance matrix ${\mathbf{H}}$ is calculated through the adjacency matrix, and if there is an edge between $v_{i}$ and $v_{j}$, then $h_{ij} = 1$, otherwise, $h_{ij} = \infty$. Fourthly, before contracting nodes, the Floyd algorithm is applied to compute the shortest path between any two nodes, and then the shortest distance matrix ${\mathbf{D}}$ is calculated. Next, the shortest distance matrix ${\mathbf{D^{\prime}}}$ after contracting node is calculated through matrix ${\mathbf{D}}$, then the cohesion of each node after contracting is calculated by Eq. 2, and the cohesion column vector matrix ${\mathbf{I}}$ is calculated. Finally, ${\mathbf{I}}$ is normalized and the importance of each node is calculated by Eq. 4, and the importance column vector matrix ${\mathbf{I^{\prime}}}$ is calculated.
Module 2:

Firstly, the similarity matrix ${\mathbf{S}}$ is obtained by utilizing node similarity-based link prediction benchmark methods, such as CN, AA, and RA. Secondly, the importance of each node and the similarity value between two nodes are substituted into Eq. 9. Thirdly, we consider the direct and indirect gravitational values between nodes to evaluate the relationship strength between nodes, and then obtain the similarity matrix ${\mathbf{S^{\prime}}}$. Finally, the normalized similarity matrix ${\mathbf{S^{\prime\prime}}}$ is obtained by normalizing ${\mathbf{S^{\prime}}}$.

The algorithmic framework of LPFGF is formed by the above two parts. To assess the framework's performance, we apply it to node similarity-based link prediction algorithms like CN, RA, PA, and LRW, and propose improved link prediction algorithms like LPFGF-CN, LPFGF-RA, LPFGF-PA, and LPFGF-LRW, and then, through the later comparative experiments, we demonstrate the feasibility and effectiveness of the proposed framework.

The following is the LPFGF algorithmic framework's main pseudo-code. Note that, we do not provide the pseudo-code for calculating the node importance matrix ${\mathbf{I^{\prime}}}$.

As shown in the pseudo-code above, firstly, S is the similarity matrix between nodes obtained by different link prediction algorithms, then we calculate the sum of the direct and indirect gravitational values between the target nodes through three nested for statements, and finally the updated similarity matrix between nodes is obtained using a normalization method, thereby realizing the extension of the LPFGF framework to different link prediction algorithms. The input importance matrix ${\mathbf{I^{\prime}}}$ can be calculated using the module 1 shown in Fig. 3.

4 Results and Analysis of Experiments

4.1 Experimental Datasets

Citeseer, DBLP, Cora, and Wiki datasets are used in trials to see if the proposed LPFGF algorithmic framework is an effective prediction framework, and their topological properties are summarized in Table 1, with |V| denoting the number of nodes, |E| denoting the number of edges, |Y| denoting the number of network tags, K denoting the average degree, D denoting the network diameter, L denoting the average path length, P denoting the density, and C denoting the average clustering coefficient.

Table 1 The topological properties of Citeseer, DBLP, Cora, and Wiki datasets

Full size table

4.2 Evaluation Indicators and Benchmark Methods

Three important indications for evaluating the prediction accuracy of the link prediction algorithm are the Area Under the Receiver Operating Characteristic Curve (AUC), Precision, and Ranking Score. The link prediction performance is evaluated by AUC [15] in this paper. In general, the AUC value should be greater than 0.5 and not more than 1.

In order to evaluate the proposed LPFGF algorithmic framework’s prediction performance, 18 different link prediction algorithms based on similarity are utilized as benchmark algorithms, and the link prediction performance of these algorithms before and after employing the LPFGF framework is compared. In addition, in order to emphasize the contribution of the proposed LPFGF framework, the prediction performance of several pre-classical network representation learning models, which are DeepWalk [27], Node2Vec [13], LINE [30], and SDNE [33], on link prediction tasks are provided for comparison.

4.3 Visualization of Degree Distribution

The degree distribution is defined as the probability distribution of the degree of nodes in the network. Figure 4 shows the degree distribution of Citeseer, Cora, DBLP, and Wiki. By observing the degree distribution of four datasets, on the Citeseer and Cora datasets, the degree of most nodes is primarily focused among 0 and 10, while on the DBLP and Wiki datasets, the degree of most nodes is primarily focused among 10 and 50. According to the density of the nodes, combining the topological properties description of the four datasets in Table 1, we can find that the Citeseer and Cora datasets are fairly sparse networks, while the DBLP and Wiki datasets are rather dense networks.

4.4 Experimental Results

The random sampling method is used to divide these four datasets in this paper. The training ratios for these four datasets are, respectively, 0.7, 0.8, and 0.9. The average value obtained after conducting the experiment 10 times separately is the final result.

Tables 2 and 3 show the prediction results of 18 node similarity-based benchmark metrics before and after applying the LPFGF framework, to demonstrate the feasibility of the LPFGF algorithmic framework.

Table 2 AUC values comparison between the original link prediction algorithms (%)

Full size table

Table 3 AUC values comparison between the original link prediction algorithms after using the LPFGF framework optimized (%)

Full size table

The original AUC values of the node similarity-based link prediction benchmark methods are listed in Table 2. The AUC results show that the overall prediction performance on DBLP and Wiki datasets is better. On these four datasets, there are only six algorithms whose link prediction accuracy is higher than 80%, accounting for 33.3%, they are LP, Katz, LHNII, LRW, SRW, and MFI, in which Katz and MFI algorithms outperform the others in terms of prediction performance, whose accuracy is higher than 90% on these four datasets. However, CN, Salton, HPI, and other algorithms show poor prediction accuracy on the Citeseer dataset, with a minimum of 65.8%. We find that the path-based, random walking-based and other link prediction algorithms perform better than local information-based algorithms in terms of prediction performance when compared to the four types of algorithms based on node similarity outlined in related work.

The AUC values of the node similarity-based link prediction benchmark methods after using the proposed LPFGF algorithmic framework are listed in Table 3. Observing the results in Table 3, the overall prediction performance on the DBLP and Wiki datasets is better after using the proposed LPFGF algorithmic framework, and the AUC values in essence reach more than 90%. On these four datasets, there are10 algorithms whose link prediction accuracy is higher than 80%, accounting for 55.6%, they are Salton, HDI, LHNI, LP, the Katz, LHNII, LNBCN, LRW, SRW, and MFI. Among those, Katz, LHNII, and MFI algorithms have better performance in link prediction, whose prediction accuracy is higher than 90% on four datasets. Particularly, on the Citeseer dataset, their accuracy is more than 93%. The results of the PA algorithm on the Citeseer, DBLP and Wiki datasets are close to the original results, while the results are slightly reduced on the Cora dataset. When compared to the four types of link prediction algorithms based on node similarity stated in related work, we find that the performance of these four types of link prediction algorithms has improved to some extent after applying the LPFGF framework to optimize.

Comparing the experimental results in Tables 2 and 3, link prediction algorithms based on local network structure information such as CN, Salton, HPI, HDI, LHNI, AA, and RA are substantially improved, and link prediction algorithms based on global information such as LP, Katz, LHNII, ACT, and LRW are slightly improved after using the proposed LPFGF framework. LP, Katz, and LHNII algorithms consider the n-order paths in the network, and ACT, LRW, and SRW algorithms consider the probability of any node in the network walking to other nodes. Therefore, the information considered by these algorithms is more comprehensive, and the resulting link prediction performance is better. These link prediction algorithms that consider more comprehensive information can be improved to some extent by using the proposed LPFGF framework, which considers not only the global information of the network, but also the importance of the nodes. The link prediction performance is slightly improved because of the more comprehensive information considered by these algorithms themselves. It can be seen that the link prediction algorithms based on local network structure information only consider the neighboring information between nodes, resulting in poor link prediction performance. These algorithms not only consider the common neighbor information of the target nodes, but also the importance of the target nodes and the direct and indirect gravitational values between the target nodes after using the LPFGF framework. The information considered is more comprehensive, and the link prediction performance has been greatly improved, which shows that the proposed LPFGF framework is more suitable for link prediction algorithms based on local information such as CN, AA, and RA, and the framework can solve the problem of incomplete information embodied in these algorithms, thereby improving their link prediction performance.

We provide Fig. 5 as a comparison chart of the average AUC values of the original benchmark method before and after using the LPFGF algorithmic framework so that the changes in prediction performance of the original benchmark method before and after using the LPFGF algorithmic framework can be seen more intuitively. The average AUC value refers to the average of the AUC values on each dataset. In Fig. 5, the abscissa represents the original benchmark methods, the ordinate represents the average AUC values, the polyline with a blue triangle symbol represents the average AUC values of the original benchmark methods before using the LPFGF framework, and the polyline with a rose red circle symbol represents the average AUC values of the original benchmark methods after using the LPFGF framework.

As is seen from Fig. 5, the prediction performance of link prediction methods has been greatly improved after using the proposed LPFGF framework. Especially, the prediction performance of CN, Salton, and HPI methods, on the Citeseer and Cora datasets, is improved by a maximum of 15%. The prediction performance of PA decreases slightly, but it is close to the original results. The prediction performance of LP, Katz, LRW, and SRW improve slightly on Citeseer and Cora datasets, and are basically the same on DBLP and Wiki datasets. When the PA algorithm is optimized by the LPFGF framework, it takes node contraction as a node importance evaluation method. According to Eqs. 2 and 4, the node importance is determined not just by its degree but also by its position. The node may not be essential if its degree is high but it is not at a key position in the network. The PA algorithm only considers the impact of node degree on node similarity. Thus, it leads to slight performance degradation. Therefore, it is effective and feasible to project the similarity problem between nodes into the physical space to solve.

As shown in Table 4, we provide a comparison between the prediction performance of the proposed algorithm, such as LPFGF-CN, LPFGF-Salton, and LPFGF- HPI, with the prediction performance of the four pre-classical network representation learning models, which are DeepWalk, Node2Vec, LINE, and SDNE on the Cora and Citeseer datasets with a training rate of 0.8. In order not to do the experiment repeatedly, the prediction results of the four pre-classical network representation learning models are quoted from [40, 41]. Note that, except that the original benchmark algorithms whose prediction performance is higher than these four-network representation learning models-based link prediction, the others are listed in Table 4 which are optimized by the proposed LPFGF algorithmic framework.

Table 4 AUC values comparison between the LPFGF framework and four-network representation learning algorithms (%)

Full size table

Table 4 shows that after the proposed LPFGF algorithmic framework optimizing the other classic link prediction algorithms, the prediction performance of them is better than that of the pre-classical network representation learning model to a certain extent, without considering the node label information, which further shows that the proposed LPFGF framework is feasible and successful.

4.5 Time Complexity Analysis

The proposed algorithmic framework is separated into two parts, each of which will be examined individually for time complexity. The first part is to calculate the importance of the nodes. Firstly, the Floyd algorithm is used to calculate the shortest distance matrix, whose time complexity is O(N³), and then the shortest distance matrix after contracting each node is calculated using a new algorithm [32], which reduces the time complexity to O(N²), thereby the importance of the nodes is computed. Therefore, the overall time complexity of the first part is O(N³). The second part is to compute the normalized similarity matrix. Firstly, node similarity-based algorithms need to be used to measure the distance in the universal gravitation equation. Local information-based similarity algorithms, such as CN, AA, and RA, whose time complexity is O(N²), global path-based similarity algorithms, such as LP and Katz, whose time complexity is O(N³), the time complexity of ACT is O(N³), and the time complexity of LRW and SRW is O(N < k > ⁿ), where < k > is average degree, so < k > is much smaller than N. In summary, the time complexity of measuring the distance in the universal gravitation equation is O(N³). Then the direct and indirect gravitational values between nodes need to be calculated, and the time complexity is O(N³), so the time complexity of the second part is O(N³). Therefore, the overall time complexity of the proposed LPFGF algorithmic framework is O(N³), which is equal to the time complexity of the global-based node similarity algorithm, but the link prediction performance of the proposed LPFGF algorithmic framework is much better than the global-based node similarity algorithm.

Furthermore, as shown in Table 5, we choose five representative algorithms from the common neighbor-based similarity algorithm, the path-based global similarity algorithm and the random walking-based global similarity algorithm, and compare the running time of these five algorithms before and after using the LPFGF framework on the Citeseer dataset. we find that the time complexity of the proposed LPFGF algorithmic framework is on the same level as the global-based link prediction algorithm in theoretical analysis, but in practice, although the proposed framework can significantly improve the link prediction performance, there is a certain gap in the algorithmic running time.

Table 5 Running time comparison before and after using the LPFGF framework

Full size table

As shown in Table 6, we find that the running time for calculating the importance of nodes using node contraction is much higher than calculating the similarity between nodes using the gravitational field technique. The reason is that when calculating the importance of nodes using node contraction, not only the shortest distance matrix of the network before contracting nodes needs to be calculated using Floyd algorithm, but also the shortest distance matrix of the network after contracting nodes. The items involved in the calculation are too complex, which lead to too long running time.

Table 6 Running time of computing node importance and node similarity

Full size table

5 Conclusion

We introduce the universal gravitational equation to evaluate the strength of relationship between nodes in complex networks, and measure the mass property of the universal gravitational equation using the node importance obtained by the node contraction method and measure the distance property of the universal gravitational equation using the similarity between nodes, then proposing a link prediction framework based on gravitational field, abbreviated as LPFGF. Then, we obtain the similarity equation between nodes, thereby, the similarity problem between nodes is projected into the physical space solving it. Finally, simulation results on four actual datasets reveal that the prediction performance of most node similarity-based link prediction algorithms for comparison has been improved to some extent by utilizing the LPFGF framework.

However, the LPFGF framework still has some shortcomings. In general, an edge can only connect two nodes, and the relationship and similarity between two nodes can only be considered. However, there are many structures beyond the general network in the real world, for example, in a scientific cooperation network, there are often more than two co-authors in an article. Therefore, we will later address the question of how to predict such edges that go beyond two-two interactions, and we will also try to apply the proposed LPFGF framework to high-order network structures.

References

Albert R, Barabasi AL (2002) Statistical mechanics of complex networks. Rev Mod Phys 75(51):47–97
Article MathSciNet MATH Google Scholar
Baghershahi, P., Hosseini, R., Moradi, H.: Self-attention presents low-dimensional knowledge graph embeddings for link prediction. (2022). http://arxiv.org/abs/math/2112.10644
Callaway DS, Newman MEJ, Strogatz SH, Watts DJ (2000) Network robustness and fragility: percolation on random graphs. Phys Rev Lett 85(25):5468–5471
Article Google Scholar
Chebotarev PY, Shamis EV (1997) The matrix-forest theorem and measuring relations in small social groups. Autom Remote Control 58(9):1505–1514
MATH Google Scholar
Chen JH, Chen W, Huang JJ, Fang JH, Li ZX, Liu A, Zhao L (2020) Co-purchaser recommendation for online group buying. Data Sci Eng 5:280–292
Article Google Scholar
Chen Y, Hu AQ, Hu J, Chen LQ (2004) Method for finding the most vital node in communication networks. Gaojishu Tongxin/High Technol Lett 14(1):21–24
Google Scholar
Clauset A, Moore C, Newman MEJ (2008) Hierarchical structure and the prediction of missing links in networks. Nature 453(7191):98–101
Article Google Scholar
Daud NN, Hamid SHA, Saadoon M, Sahran F, Anuar NB (2020) Applications of link prediction in social networks: a review. J Netw Comput Appl 166:102716. https://doi.org/10.1016/j.jnca.2020.102716
Article Google Scholar
Ding JD, Guo J (2021) Mining potential author cooperative relationships based on the similarity of content and path. Inf Stud: Theor Appl 44(1):124–128
MathSciNet Google Scholar
Fındık O, Özkaynak E (2021) Link prediction based on node weighting in complex networks. Soft Comput 25:2467–2482. https://doi.org/10.1007/s00500-020-05314-8
Article Google Scholar
Freeman LC (1977) A set of measures of centrality based on betweenness. Sociometry 40(1):35–41
Article Google Scholar
Getoor L, Diehl CP (2005) Link mining: a survey. ACM SIGKDD Explor Newsl 7(2):3–12
Article Google Scholar
Grover A, Leskovec J (2016) node2vec: Scalable Feature Learning for Networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, California, USA, 855–864
Gu QY, Wu B, Chi RY (2021) Link prediction method based on the similarity of high path. J Commun 42(7):61–69
Google Scholar
Hanley JA, Mcneil BJ (1982) The meaning and use of the area under a Receiver Operating Characteristic (ROC) curve. Radiology 143(1):29–36
Article Google Scholar
He JJ, Li RF (2011) Modified nodes ranking method using random walk model. Comput Eng Appl 47:87–89
Google Scholar
Hoyt, C.T., Berrendorf, M., Galkin, M., Tresp, V., M. Gyori, B.: A unified framework for rank-based evaluation metrics for link prediction in knowledge graphs. arXiv preprint (2022). http://arxiv.org/abs/math/2203.07544
Jiao PF, Guo X, Jing X, He DX, Wu HM, Pan SR, Gong MG, Wang WJ (2021) Temporal network embedding for link prediction via VAE joint attention mechanism. IEEE Trans Neural Netw Learn Syst 45:1–14. https://doi.org/10.1109/tnnls.2021.3084957
Article Google Scholar
Leicht EA, Holme P, Newman MEJ (2006) Vertex similarity in networks. Phys Rev E 73(2):116–120
Article Google Scholar
Liben-Nowell D, Kleinberg J (2007) The link-prediction problem for social networks. J Am Soc Inf Sci 58(7):1019–1031
Article Google Scholar
Li DY, Du Y (2005) Artificial intelligence with uncertainty. Defense and Industry Press
MATH Google Scholar
Liu Z, Zhang QM, Lü LY, Zhou T (2011) Link prediction in complex networks: a local naive Bayes model. Europhys Lett 96(4):48007
Article Google Scholar
Lu SY, Shi J, Liu B, Yao JK, Jin Y (2019) Research on a link prediction algorithm based on transferring similarity and preferential attachment. Computer Technol Develop 29(008):18–23
Google Scholar
Lü L, He M, Yi C (2021) An improved MH link prediction algorithm combining with Random Walk with Restar. J Yunnan Univ: Nat Sci Edition 43(2):245–253
MATH Google Scholar
Nasiri E, Berahmand K, Rostami M, Dabiri M (2021) A novel link prediction algorithm for protein-protein interaction networks by attributed graph embedding. Comput Biol Med 137:104772. https://doi.org/10.1016/j.compbiomed.2021.104772
Article Google Scholar
Pan LM, Zhou T, Lü LY, Hu CK (2016) Predicting missing links and identifying spurious links via likelihood analysis. Sci Rep 6(1):22955–22955
Article Google Scholar
Perozzi B, Al-Rfou R, Skiena S (2014) DeepWalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, USA, 701–710
Saxena A, Kochsiek A, Gemulla R (2022) Sequence-to-sequence knowledge graph completion and question answering. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 2814–2828. https://doi.org/10.18653/v1/2022.acl-long.201
Si SZ (2014) Link prediction and network reconfiguration in social networks. Dissertation, Northeastern University
Tang J, Qu M, Wang MZ, Yan J, Mei QZ (2015) LINE: Large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 1067–1077
Tang YX, Qi JY (2021) Algorithm of predicting missing links of small promoted index via common neighbors. Modern Electron Tech 44(5):37–40
Google Scholar
Tan YJ, Wu J, Deng HZ (2006) Evaluation method for node importance based on node contraction in complex networks. Syst Eng –Theor Pract 26(011):79–83
Google Scholar
Wang DX, Cui P, Zhu WW (2016) Structural deep network embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, California, USA, 1225–1234
Wang J, Ilievski F, Szekely P, Yao K (2022) Augmenting knowledge graphs for better link prediction. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Vienna, Austria, 2277–2283. doi: https://doi.org/10.24963/ijcai.2022/316
Wu HX, Song CY, Ge Y, Ge TJ (2022) Link prediction on complex networks: an experimental survey. Data Sci Eng 5:1–26. https://doi.org/10.1007/s41019-022-00188-2
Article Google Scholar
Wu SW, Zhang YX, Gao CL, Bian KG, Cui B (2020) GARG: anonymous recommendation of point-of-interest in mobile networks by graph convolution network. Data Sci Eng 5:433–447
Article Google Scholar
Yu HY, Braun P, Yildirim MA, Lemmens I, Venkatesan K, Sahalie J et al (2008) High-quality binary protein interaction map of the yeast interactome network. Science 322(5898):104–110
Article Google Scholar
Zhang MH, Chen YX (2018) Link prediction based on graph neural networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, Canada, 5171–5181
Zhang X, Zhao CL, Wang XJ, Yi DY (2015) Identifying missing and spurious interactions in directed networks. Int J Distrib Sens Netw. https://doi.org/10.1155/2015/507386
Article Google Scholar
Zhou M, Li BS, Yang ML, Pan LJ (2022) TeleGraph: A benchmark dataset for hierarchical link prediction. http://arxiv.org/abs/math/2204.07703
Zhou MQ, Jin HJ, Wu QW, Xie H, Han QZ (2022) Betweenness centrality-based community adaptive network representation for link prediction. Appl Intell 52(4):3545–3558. https://doi.org/10.1007/s10489-021-02633-7
Article Google Scholar

Download references

Funding

This work is partially supported by the National Key Research and Development Program of China under Grant No.2020YFC1523300, the National Natural Science Foundation of China under Grant No.61763041, the Youth Program of Natural Science Foundation of Qinghai Province of China under Grant No.2021-ZJ-946Q and the Middle-Youth Program of Natural Science Foundation of Qinghai Normal University under Grant No. 2020QZR007.

Author information

Authors and Affiliations

College of Computer, Qinghai Normal University, Xining, 810001, Qinghai, China
Yanlin Yang, Zhonglin Ye, Haixing Zhao & Lei Meng
The State Key Laboratory of Tibetan Intelligent Information Processing and Application, Xining, 810008, Qinghai, China
Yanlin Yang, Zhonglin Ye, Haixing Zhao & Lei Meng
Tibetan Information Processing Engineering Technology and Research Center of Qinghai Province, Xining, 810008, Qinghai, China
Yanlin Yang, Zhonglin Ye, Haixing Zhao & Lei Meng

Authors

Yanlin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhonglin Ye
View author publications
You can also search for this author in PubMed Google Scholar
Haixing Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Lei Meng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by Yanlin Yang, Zhonglin Ye, and Lei Meng. The first draft of the manuscript was written by Yanlin Yang and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Haixing Zhao.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yang, Y., Ye, Z., Zhao, H. et al. A Novel Link Prediction Framework Based on Gravitational Field. Data Sci. Eng. 8, 47–60 (2023). https://doi.org/10.1007/s41019-022-00201-8

Download citation

Received: 12 April 2022
Revised: 15 October 2022
Accepted: 27 November 2022
Published: 16 January 2023
Issue Date: March 2023
DOI: https://doi.org/10.1007/s41019-022-00201-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Novel Link Prediction Framework Based on Gravitational Field

Abstract

Similar content being viewed by others

Newton’s Gravitational Law for Link Prediction in Social Networks

Link Prediction in Complex Network: Nature Inspired Gravitation Force Approach

A degree-related and link clustering coefficient approach for link prediction in complex networks

1 Introduction

2 Related Work