1 Introduction

Many social, biological, and information systems in the real world, from the nervous system to the ecosystem, from road traffic to the Internet, from the ant colony structure to human social relations, can be naturally described as networks, where vertices represent entities and links denote relations or interactions between vertices. As a topology approximation of complex systems, due to limitations of time and space, or experimental conditions, it is inevitable that there will be some errors or redundant links in constructing the complex network, and at the same time, there will also be some undetected potential links. In addition, because of the dynamic evolution of the complex network links over time, we need to predict the missing and potential links according to the known network information, which is the goal of network link prediction problem [1, 2].

Link prediction problem has a wide range of practical applications in different fields. For example, in biological networks, such as protein-protein interactionmetabolic and diseases-gene networks [3], link existing between the nodes indicates they have a interaction relationship. Due to the high experimental costs of revealing the hidden interaction relationships in these networks, the results of link prediction can direct the experiment designing so as to reduce the cost and improve the success rate of experiment. Predicting the loss and suspicious links of diseases-gene networks can help to explore the mechanism of diseases, predict and evaluate their treatment. Furthermore, it can also find new drug targets and open up new ways for drug development [4].

In social network analysis, link prediction can also be used as a powerful supplementary tool to accurately analyze the social network structure. Researches on online social networks analysis have developed very rapidly in recent years. In online social networks, potential friends of the users can be revealed by link prediction, and can be recommended to the users [5]. By analyzing social relations, we can find potential interpersonal links [6, 7]. Link prediction can also be used in the academic network to predict the type and cooperators of an academic paper [8]. Link prediction method can also be directly used for information recommendation, such as the commodity recommendation to customers [9]. Marketers would like to recommend products or services based on existing preferences or contacts. Social networking websites would like to customize suggestions for new friends and groups. For monitoring e-mail communication, link prediction is used to detect the anomalous e-mail [10]. Financial corporations would like to monitor transaction networks for fraudulent activity. In monitoring the network of criminals, link prediction is used to discover the hidden connection between criminals so as to prevent criminal or terrorist activity.

Link prediction not only has a wide range of practical value, but also has important theoretical significance. For example, it is helpful to understand the mechanism of the evolution of complex network [11] .Since the statistical magnitude to describe characteristics of the network structure is very large, it is difficult to compare the advantages and disadvantages of different mechanisms. Link prediction can provide a simple and unified platform for a fair comparison of network evolution mechanisms, so as to promote the theoretical research on a complex network evolution model.

In recent years, many methods on link prediction have been reported. Those methods can be classified into categories such as similarity-based methods, maximum likelihood methods and probabilistic model based methods.

In the similarity-based method, each node pair is assigned an index, which is defined as the similarity between the two nodes. All non-observed links are ranked according to their similarities, and the links connecting more similar nodes are supposed to have higher existence likelihoods. Node similarity can be defined by using the essential attributes of nodes: two nodes are considered to be similar if they have many common features [12] or topological structures [13]. Many studies found that there are substantial levels of topical similarity among users who are close to each other in the social network, such as friendship prediction in [14], which studied the presence of homology in three systems that combine tagging social media with online social networks. Many works exploit topological features of network structures for link prediction tasks. In [15], the overall relations between object pairs are defined as a link pattern, which consists of interaction pattern and connection structure in the network. The structural similarity indices can be classified into three categories: local indices, global indices, and quasi-local indices. Local indices use only neighbor information of the nodes. Typical local indices include Common Neighbors [16], Salton Index [17], Jaccard Index [18], Sorensen Index [19], Hub Depressed Index[20], Hub Promoted Index[21], Leicht-Holme-Newman Index (LHN1) [21], Preferential Attachment Index [22], Adamic-Adar Index [23] and Resource Allocation Index [24]. Global indices require global topological information. Katz Index [25]. Leicht-Holme-Newman Index (LHN2) [21], Matrix Forest Index (MFI) [26] are typical global indices. Quasi-local indices do not require global topological information but make use of more information than local indices. Such indices includes Local Path Index [24, 27], Local Random Walk [28],and Superposed Random Walk [28]. Another group of similarity is based on the random walk, such as Average Commute Time [29], Cos+ [30], random walk with restart [31], and SimRank[32]. Zhou et al. [24, 33] proposed two new local indices, Resource Allocation index and Local Path index. Empirical results show that these two indices outperform all other local indices. In particular, the local path index, requiring a little bit more information than the common neighbors index, provides competitively accurate prediction compared with the global indexes.

Liu and Lv [34] studied the link prediction problem based on the local random walk, and found that the limited step may get a better prediction than the result of global random walk. Rao [35] proposed an algorithm based on the MapReduce parallel computation model that can be applied to large complex networks. Dong [36] proposed an algorithm based on the gravitation of the node, which can improve the prediction accuracy while maintaining a low time complexity.

Another category of link prediction method is based on maximum likelihood estimation. These methods presuppose some organizing models of the network structure, with the detailed rules and specific parameters obtained by maximizing the likelihood of the observed structure. Then, the likelihood of any non-observed link can be calculated according to those rules and parameters. Typical organizing models of the network are the hierarchical structure model [37] and the stochastic block model [3840] . In [41], a set of simple features are proposed as a structural model that can be analyzed to identify missing links. Hierarchical model has high accuracy in handling with the network of significant levels of the organization, such as the terrorist network and grasslands food chain network. However, since it needs to generate a lot of samples to predict the network, its computational complexity is too high to deal with the large scale networks. Link prediction method based on random block model can predict not only the missing links, but also predict and correct errors in the network, such as the errors links in protein interaction network. From the viewpoint of practical applications, an obvious drawback of the maximum likelihood methods is that it is very time consuming. It will definitely fail to deal with the huge online networks that often consist of millions of nodes.

Another type of link prediction method is based on the probability model. These model based methods aim at abstracting the underlying structure from the observed network, and then predicting the missing links by using the learned model. These methods first create a model containing a set of adjustable parameters, and then use optimization strategy to find the optimal parameter values, such that the resulting model can be better structures and relationships reflecting real network characteristics. The probabilistic model optimizes a target function to establish a model composed of a group of parameters Θ which can best fit the observed data of the target network. Then the probability of a potential link (i,j) is estimated by the conditional probability P(A ij = 1|Θ). There are three mainstream probability based methods, respectively called Probabilistic Relational Model (PRM) [42], Probabilistic Entity Relationship Model (PERM) [43] and Stochastic Relational Model (SRM) [44]. Ramesh R et al. [45] proposed an approach for probabilistic link prediction and path analysis using Markov chains. H. Kashima et al. [46] introduce an approach for link prediction in network structured domains. An advantage of probability model based method is that it can achieve a higher predictive accuracy, but its time complexity and non-universality parameter calculation seriously restrict its application scope.

Several methods focus on supervised machine learning strategy for link prediction. The target attribute of those methods is a class label indicating the existence or absence of a link between a node pair. The relevance of using weights to improve supervised link prediction is investigated in [47]. In [48], a link propagation method is proposed, which is a semi-supervised learning algorithm for link prediction on graphs based on the popularly-studied label propagation. In [46], a parameterized probabilistic model of network evolution was presented and an incremental learning algorithm for such models was derived. Similar to the maximum likelihood methods, a drawback of machine learning based methods is their high time complexity, which is prohibited in some real applications.

Many studies focus on the link prediction in multidimensional and large-scale social networks. In [49], G. Rossetti et al. presented several predictors based on structural analysis of the multidimensional networks. H. H. Song et al. [50] proposed a method to approximate a large family of proximity measures for link prediction in large-scale networks. Although those methods are designed specifically for the large scale networks, accuracy of the results cannot be guaranteed due to the limit on computation time.

Another link prediction problem of increasing interest revolves around node attributes. Many real-world networks contain rich categorical node attributes, e.g., users in Google+ have profiles with attributes including employer, school, occupation and address. In the attribute inference problem, we aim to populate attribute information for network nodes with missing or incomplete attribute data. This scenario often arises in practice when users in online social networks set their profiles to be publicly invisible or create an account without providing any attribute information. The growing interest in this problem is highlighted by the privacy implications associated with attribute inference as well as the importance of attribute information for applications including people searching and collaborative filtering.

There are two sources of information in networks with node attributes, namely, topological information and attribute information. How to simultaneously incorporate these two sources of information is an important issue in the link prediction on networks with node attributes. The relational learning [51, 52], matrix factorization and alignment [53, 54] based approaches have been proposed to leverage attribute information for link prediction, but they suffer from scalability issues. More recently, Backstrom and Leskovec [55] presented a Supervised Random Walk (SRW) algorithm for link prediction that combines network structure and edge attribute information, but this approach does not fully leverage node attribute information as it only incorporates node information for neighboring nodes. Yin et al. [56, 57] proposed the use of Social-Attribute Network (SAN) to gracefully integrate network structure and node attributes in a scalable way. They focused on generalizing Random Walk with Restart (RWR) algorithm to the SAN model to predict links as well as to infer node attributes.

Ant colony optimization (ACO) is an evolution simulation algorithm proposed by M. Dorigo et al. [58, 60]. Inspired by the behaviors of the real ant colony, they recognized the similarities between the ants’ food-hunting activities and travelling salesman problem (TSP), and successfully solved the TSP problems using the same principle that the ants have used to find the shortest route to food source via communication and cooperation. ACO has been successfully used for system fault detecting, job-shop scheduling, frequency assignment, network load balancing, graph coloring, robotics and other combinational optimization problems [61, 67]. ACO has some advantages such as allowing positive feedback, distributed computing, and constructive greedy heuristic search.

In this paper, we propose a link prediction method based on the ant colony optimization. In the algorithm, artificial ants are employed to travel on a logical graph. Each ant chooses its path according to the value of the pheromone and heuristic information on the edges. The paths the ants passing through are evaluated, and the pheromone information on each edge is updated according to the quality of the path it located. Finally, the pheromone on each edge is used as the score of the similarity between the nodes. We use AUC and precision as measurements to evaluate the performance of the algorithms, and compare it with the other link prediction algorithms. The experimental results on a number of real networks show that the accuracy of our algorithm is significantly superior to the other algorithms. We also extend the method to solve the link prediction problem in the networks with node attributes. The pheromones on the edges are used to predict links as well as infer node attributes. Our experimental results show that our algorithm can obtain higher quality results on the networks with node attributes than other algorithms.

The rest of this paper is organized as follows. Section 2 reviews the problem of link prediction and methods for evaluating the results. Section 3 presents the ACO based algorithm ACO_LP, and describes the implementation details of the algorithm. Section 4 extends the method to solve the link prediction problem in the networks with node attributes. Section 5 shows and analyzes the experimental results obtained by ACO_LP, and compares its performance with other similar methods. Section 6 draws conclusions.

2 Problem formulation and evaluation methods

We consider a network represented by an undirected simple network G(V,E), where V is the set of nodes and E is the set of links. Multiple links and self-connections are not allowed in G. Let N = |V | be the number of nodes in G. We use U to denote the universal set containing all N(N-1)/2 possible links. The task of link prediction is to find out missing links (or the links that will appear in the future) in the set of non-existing links U-E.

The purpose of our method is to assign a score, Score(x,y), to each pair of nodes (x,y) ∈U. This score reflects the similarity between the two nodes. For a nodes pair (x,y) in U E , the larger Score(x,y) is, the higher probability there will exist a link between nodes x and y.

To test the accuracy of the results of our algorithm, the observed links in E are randomly divided into two parts: the training set, E T, which is treated as known information, while the probe set (i.e., validation subset), E P , which is used for testing and no information in this set is used for prediction. E TE P = E and E TE P = ø. As an example, Fig. 1a shows a network with 15 nodes and 21 existing links. Our goal is to predict the potential links in the 84 unconnected node pairs. To test the algorithm’s accuracy, we need to select some existing links as probe set, and the other as training set. For instance, we pick 5 links as probe links, which are presented by dash lines in Fig. 1b. Then, an algorithm only makes use of the information contained in the training graph presented by solid lines in Fig. 1b. The algorithm will eventually give a score to each of the 89 node pairs, including 84 non-existing links in U-E, and 5 test links in E P.

Fig. 1
figure 1

A network

In principle, a link prediction algorithm provides an ordered list of all non-observed links (i.e., U-E T) or equivalently gives each non-observed link, say (x,y) ∈U-E T, a score s xy to quantify its existence likelihood. To quantify the accuracy of prediction algorithms, there are three standard metrics: AUC, Precision and Ranking Score.

(1) AUC

AUC (area under the receiver operating characteristic curve) measures the accuracy of link prediction results from the entirety. Provided the rank of all non-observed links, the AUC score can be interpreted as the probability that a randomly chosen missing link (a link in E P ) is given a higher score than a randomly chosen non-existing link(a link in U-E). . In the algorithmic implementation, we usually calculate the score of each non-observed link instead of giving the ordered list since the latter task is more time consuming. Then, at each time we randomly pick a missing link and a non-existing link to compare their scores, if among n independent comparisons, there are n’ times the missing link having a higher score and n” times have the same score, the AUC score is:

$$AUC=\frac{n^{\prime}+0.5n^{\prime\prime}}{n} $$

If all the scores are generated from an independent and identical distribution, the AUC score should be about 0.5. Therefore, the degree to which the value exceeds 0.5 indicates how better the algorithm performs than pure chance.

(2) Precision

Given the ranking of the non-observed links, the precision is defined as the ratio of relevant items selected to the total number of items selected. That is to say, if we take the top-L links as the predicted ones, among which m links are right, then the precision is:

$$precision=\frac{m}{L} $$

Clearly, higher precision means higher prediction accuracy.

(3) Ranking Score

Ranking score (RS) considers the ranks of the similarity scores of the testing edges. Let H= U-E T be the set of nonobserved links. Let e i be an unknown edge in testing set E \(^{P}_{,}r_{i} \) be the rank of the edge e i after sorting the edges in descending order of their scores. The ranking score of edge e i is defined as: RS i = r i /|H|, and the ranking score of the link prediction result is:

$$RS=\frac{1}{\vert E^{p}\vert }\sum\limits_{i\in E^{p}} {RS_{i} =\frac{1}{\vert E^{p}\vert }} \sum\limits_{i\in E^{p}} {\frac{r_{i} }{\vert H\vert }} $$

It can easily be seen that prediction result with higher accuracy will get higher ranking score.

3 Framework of the ACO algorithm for link prediction

3.1 Basic idea of the algorithm

Given an undirected simple network G=(V,E), a complete graph G’ called logical graph of G is constructed by adding all the missing links in G. In our algorithm, artificial ants are used to randomly walk on the logical graph G’. We refer the connections between node pairs in original graph G as “link”, while those in the logical graph G’ as “edge”, which includes both existing and non-existing links in G. On each edge of G’, we set the pheromone and heuristic information on the edges. The edge has higher probability to have a link will get larger values of pheromone and heuristic.

Each ant travels on the logical graph G’ to visit n nodes and forms a path. In one iteration of an ant’s walk, some nodes may be selected multiple times, and some nodes may not be selected. In the random walk, the ant at nodev i chooses the next edge (v i ,v j ) to walk through according to a probability p ij . The value of p ij is defined according to the pheromone and heuristic information on edge (v i ,v j ). The edge with higher tendency to have a link will be assigned larger probability p ij , and the ants will more likely choose this edge to pass through. After the ants finish a round of walk and form the paths, the algorithm evaluates the quality of each path. The path consisting of edges with higher probability to have links will get higher quality score. Then, the quality scores are used to modify the pheromone information on the edges of the path. The edge with higher quality score will obtain larger increment on pheromone information. This pheromone information in turn influences the walk of the ants in the next iteration: the larger amount of pheromone is laid on an edge, the more likely an ant will select this edge. The intensity of pheromone information on each edge could be increased by the ants passing it and decreased by evaporation in each iteration. Communications and cooperations between individual ants by pheromone information enable the algorithm to have strong capability of finding the best paths. Finally, the pheromone τ ij on edge (v i ,v j ) is used as the score reflecting the similarity between the two nodes. The pheromone matrix formed is output as the final score matrix.

3.2 Parameter initialization

For the undirected simple network G=(V,E), let |V |=n, and V = {v 1,v 2,…v n },n*n matrix A=[a ij ] be the adjacent matrix of G. We use a vector S = (s 1,s 2,…,s n ) to represent a path an ant walking through in the logical graph G’, where s i Vis the ith node on the path. Initially, we set all the elements in S as Φ, which represents an empty node, and will be replaced by a real node in the ant’s random walking.

We set the initial value of pheromone on the edge between the nodes (v i ,v j ) as

$$ \tau_{ij} =\lambda \ast (a_{ij} +\varepsilon ) $$
(1)

Here, λ and ε are positive constants, namely if (v i ,v j ) ∈E the initial value of pheromone τ ij is set as λ(1+ ε),otherwise it is set as λ.ε . It is obvious that the edges which have link connection will have higher initial pheromone value. Such pheromone information will guide the ants to walk through the existing links and their neighbors with higher probability.

Denote the set of common neighbors of nodes (v i ,v j ) as:

$${\Gamma} (x,y)=\{v\vert v\in V,(x,v)\in E\wedge (y,v)\in E\} $$

We set the value of heuristic information on the edge between nodes (v i ,v j ) as

$$ \eta_{ij} =\gamma \ast \vert {\Gamma} (i,j)\vert $$
(2)

Here, parameter γ is a positive constant. Such heuristic information will direct the ants to walk towards the closely connected nodes.

3.3 Probability for ants’ path selection

In each iteration, ant at node v i selects an edge to reach the next node according to a probability. We define \(p_{ij}^{k} \) as the probability for ant k at node v i to choose v j :

$$ p_{ij}^{k} =\frac{\tau_{ij}^{\alpha} .\eta_{ij}^{\beta} }{\sum\limits_{k=1}^{n} {\tau_{ik}^{\alpha} .\eta_{ik}^{\beta} } } $$
(3)

Here, τ ij is the pheromone on the edge between nodes v i and v j , η ij is a heuristic function which is defined as the visibility of the edge (v i ,v j ). Parameters α, β determine the relative influence of the pheromone and the heuristic information. Obviously, the edge with larger pheromone and heuristic value will have higher probability to be chosen by the ants.

3.4 The fitness function

After each iteration, the tour of each ant forms a path consisting of n nodes. The path of the k-th ant is denoted as S(k) = (s 1,s 2,s 3,...,s n ), here s i V is the ith node in the path. A fitness function is defined to measure the quality of each path, and will be used to update the pheromone information. A path with more existing links and closely connected nodes will have higher quality score, since each pair of adjacent nodes in the path are more likely to be connected by a potential link. Therefore, the fitness of a path can be measured in two aspects, namely, the importance of the nodes and edges on the path.

Generally the importance of a node is measured by its ”centrality” under different definitions. Different centricity depicts the different function of nodes in the network, such as the spreading ability, the influence of the node. Degree based centrality is the most simple and direct measure of the node importance. In general, greater centricity degree of a node means that it is more important and more likely to be linked with other nodes. Therefore, the fitness of a path is defined as the summation of the degrees of its nodes:

$$ Q(S)=\sum\limits_{i=1}^{n} {d(} s_{i} ) $$
(4)

Here, d(s i )is the degree of the node s i ).

However, not all the nodes with large degree are the most important. The importance of a node is related with the structure of the network and the function of the node. For example, in the communication network, although some nodes have small degrees, they are possibly the hub points for large amount of packets to pass through. Therefore, we use betweenness as a measure of the node’s importance. Sociologist Linton Freeman [68] first proposed the concept of betweenness as a measure of both the load and importance of a node. The former is more global to the network, whereas the latter is only a local effect. The betweenness centrality of a node v i is given by the expression:

$$ B(v_{i})=\sum\limits_{s\ne i \ne t}{\frac{n^{i}_{st}}{g_{st}}} $$
(5)

where, g st is the total number of shortest paths from node sto node t,and E = E pp E pa is the number of those paths that pass through v i .

Based on the node betweenness centrality defined by (5), the fitness of a path is defined as the summation of the betweenness of the nodes on the path:

$$ Q(S)=\sum\limits_{i=1}^{n} {B(s_{i} )} $$
(6)

In addition to the importance of the nodes, the importance of the edges is also a factor in the quality measure of a path. One measurement of the edge importance is the clustering coefficient. The importance of each edge is defined as the number of all triangles associated with it. For an edge e ij = (s i ,s j ), its clustering coefficient is defined as:

$$ C(e_{ij} )=\frac{z_{ij} +1}{\min \left[ {\left({d_{i} -1} \right),(d_{j} -1)} \right]} $$
(7)

Here, z ij is the number of triangles with edge e ij , d i and d j are degrees of nodess i and s j respectively. Larger clustering coefficient of an edge between two nodes indicates the higher probability that they are connected by a link. Let S = (s 1,s 2,s 3,...,s n ) be a path, here s i V is the ith node in the path. Denote the ith edge on the path as e i,i+1(i = 1, 2, ..., n−1). Based on the edge clustering coefficient defined by (7), the fitness of a path is defined as:

$$ Q(S)=\sum\limits_{i=1}^{i=n-1} {c(e_{i,i+1} )} $$
(8)

Another measurement of the importance of an edge is the edge betweenness. Similar to node betweenness, the betweenness of the edge is defined as:

$$ B(e_{ij} )=\sum\limits_{s\ne t} {\frac{n_{st}^{e} }{g_{st} }} $$
(9)

Here,\(n_{st}^{e} \) is the number of the shortest paths from node sto node tpassing through the edge. e ij , and g st is the number of the shortest paths from node s to node t. Edge betweenness is the measure of edge’s ability of communication, the greater betweenness an edge has, the more important it is in the connectivity of the network.

Using the edge betweenness, the quality of a path is defined as:

$$ Q(S)=\sum\limits_{i=1}^{i=n-1} {B(e_{i,i+1} )} $$
(10)

We can choose one from formulas (4), (6), (8), (10) as the fitness to evaluate the paths in each iteration. In this paper, we use the node’s centrality based measurement in our experiments.

3.5 Pheromone updating

After each iteration, the algorithm updates the pheromone value on each edge according to the formulas as follows:

$$ \tau_{ij} (t+1)=\rho \cdot \tau_{ij} (t)+{\Delta} \tau_{ij} (t) $$
(11)

Here,

$$ {\Delta} \tau_{ij} (t)=\sum\limits_{k=1}^{m} {\Delta \tau_{ij}^{k} (t)} $$
(12)

and

$$ {\Delta} \tau_{ij}^{k} (t)=Q(S) $$
(13)

Obviously, the more ants at v i select the node v j , the more increment of pheromone the egde e ij has, and the higher probability the ants select this edge in the next iteration. This forms a positive feedback by the pheromone system. In our experiments, we set the fitness of the path S as

$$ Q(S)=C\ast \frac{1}{n}\sum\limits_{i=1}^{n} {d(s_{i} )} , $$
(14)

where C is a positive constant, d(s i ) is the degree of node s i .

3.6 Termination conditions and outputs

The algorithm ceases the iterations according to a certain termination condition. We stop the iterations when the pheromone values on each edge obtained in adjacent iterations tend to stabilize. In addition, we also set up a threshold Nc, which is the maximum number of iterations. The iterations should be ended as well when the number of iterations goes beyond Nc.

Finally, the algorithm outputs the pheromone matrix as the score matrix, namely, the final score of nodes pair v i and v j is Score(i,j) = τ ij . To evaluate the quality of the link prediction result, we need to rank all the non-existing links in decreasing order according to their scores, and use the AUC and precision to assess the performance of the algorithm.

3.7 Framework of the algorithm

The framework of our ant colony optimization based algorithm for link prediction ACO_LP is as follows.

Algorithm 1 ACO LP (ACO for Link Prediction)

Input:   A:   Adjacency matrix of the network;

            Nc:   The maximum number of iterations;

            ε      The threshold for the error of pheromone information;

Output:Score:  The score matrix;

Begin

1. t = 1;

2. Parameter initialization:

      set the initial values of pheromone matrix τ and

            heuristic matrix η according to (1) and (2);

3. Repeat

4.     For k = 1 to m do /* for the m ants*/

5.       Ant k randomly selects a node s 1

6.       for i = 1 to n-1 do

7.            Ant k selects the next node according to (3);

8.       End for i

9.       Calculate the fitness of the path formed by ant k according to (14);

10.       End for k

11.       Update the pheromone values according to (11);

12.    Until max1 ≤ i,jn |τ ij (t+1)−τ ij (t)| ≤ ε or t > Nc;

13.    Score = τ;

14.    Output the score matrix Score;

End

Line 2 of algorithm ACO_LP sets the initial values of pheromone matrix τ and heuristic matrix η. To calculate the heuristic information η , we need to calculate the number of common neighbors of all node pairs. Let n be the number of nodes in the network, and k be the average degree of the nodes. For each node v i , it takes d 2 time to search for the common neighbors of v i with other nodes. Therefore, time complexity of this step is O(nk 2).

In lines 3 to 12, the ants travel in the network to form the paths. The time complexity for an ant choosing its path in each iteration is O(n 2). Since there are m ants and N c iterations, the total time is O(mn 2 N c ). Therefore the overall time complexity of the algorithm is O(nk 2+N c mn 2). Since N c and m are constants, the time complexity of the algorithm is O(n 2).

4 Link prediction using node attributes

For the networks with node attributes, we also use an undirected graph G =(V,E) to represent a network, where edges in E represent interactions between the n nodes in V. In addition to network structure, we have categorical attributes for the nodes. For instance, in the Google+ social network, nodes are users, edges represent their friendship or some other relationship. Node attributes are derived from user profile information and include fields such as employer, school, and hometown. In this work, we restrict our focus on categorical variables, since the other types of variables, e.g., live chats, email messages, real-valued variables, etc., could be clustered into categorical variables via vector quantization, or directly discredited to categorical variables. We use a binary representation for each categorical attribute. For example, various employers can be treated as separate binary attributes. Hence, for a specific social network, the number of distinct attributes m is finite, though it could be very large. Attributes of a node v i are then represented as an m-dimensional binary column vector b i = (b i1,b i2,...,b im ). The j-th entry of b i is defined as

$$b_{ij} =\left\{ {{\begin{array}{*{20}c} 1 \quad {\mathrm{ }v_{i}\,\, \mathrm{has\,\, the}\,\, j\mathrm{th\,\, attribute}} \quad \\ 0 \quad {\text{otherwise}} \quad \\ \end{array} }} \right. $$

We denote the m*n attribute matrix for all nodes as B = [b ij ].

Given a network G with m distinct categorical attributes, an attribute matrix B,we create an augmented graph G A by adding m additional nodes to its logical graph G’, with each additional node corresponding to an attribute. For each node v in G’ with attribute a, we create an undirected link between v and ain the augmented graph G A . This augmented graph G A includes the original network interactions, relations between nodes and their attributes. Artificial ants are used to randomly walk on the augmented graph G A , instead of on the logical graph G’.

There are two types of nodes in the augmented graph G A = (V A , E A ), namely, the item node set V p and the attribute node set V a , where V A = V p V a . The set E A in G A also consists of two types of edges: E A = E p E a , where E p is the set of edges between the item nodes, and E a is the set of edges between the item nodes and the attribute nodes. There is no edge between two attribute nodes. Let |V p |=n,|V a |=m, then G A is represented by an (n+m)*(n+m) adjacent matrix.

For each edge (v i ,v j ) in set E p , which is the set of edges between the item nodes, we set the initial pheromone value on the edge (v i ,v j ) as τ ij = λ∗(a ij +ε). Here, λ and ε are positive constants. For an edge (v i ,v j ) in set E a , which is the set of edges between the item and attribute nodes, we set the initial value of its pheromone as τ ij = λ∗(1+ε).

It is obvious that the edges which connect an item and its attributes will have higher initial pheromone value. Such pheromone information will guide the ants to walk through the paths between the nodes of items with the identical attributes.

We set the value of heuristic information on the edge between two item nodes (v i ,v j ) as η ij = γ∗|Γ(i,j)|. Here, parameter γ is a positive constant.

We notice that if an attribute is shared by fewer items, those items are more similar, and are more likely to be linked. Therefore edge connecting an attribute node of lower degree should have higher heuristic value. For each edge (v i ,v j ) in set E a , where v i is an item node and v j is an attribute node, we set the value of its heuristic information as η ij = μ/d j . Here, parameter μ is a positive constant, d j is the degree of v j . Such heuristic information will direct the ants to walk between the item nodes through their common attribute nodes with low degrees.

Each ant travels on the augmented graph G A to visit n+m nodes and forms a path. In each iteration of an ant’s walk, some nodes may be selected multiple times, and some nodes may not be selected. In the random walk, the ant at each node chooses the next edge (v i ,v j ) to walk through according to a probability p ij . defined in (3). The edges in E p with higher tendency and the edges in E a linked with lower degree attribute node will be assigned larger probability p ij , and ants will more likely to choose those edges to pass through. After the ants finish a round of walk and form the paths, the algorithm evaluates fitness of the paths according to (14). Then, the fitness scores are used to modify the pheromone information on the edges of the path according to (11). Finally, the pheromone τ ij on each edge (v i ,v j ) in set E p is used as the score reflecting the similarity between the two nodes. Also, the pheromone τ ij on each edge (v i ,v j ) in set E a is used to detect the potential attributes of item node v i . Suppose an edge (v i ,v j ) in set E a connects an item node v i with an attribute node v j , if the pheromone τ ij on edge (v i ,v j ) is greater than a threshold, node v i probably has the attribute represented by node v j .

5 Experimental results

In this section, we empirically demonstrate the effectiveness of our proposed algorithm ACO L P on real world networks. We also compare its performance against the traditional similarity based link prediction algorithms such as CN Salton, Jaccard Sorensen, HPI, HDI, LHN I , PA, LP and Katz. We focus on the accuracy of the results and the algorithms’ computing time. All experiments have been conducted on Microsoft Windows 7 operating system, and the results are visualized on Matlab 6.0. Based on our experience in ACO applications, we set parameters α = 0.8, β = 0.7, ε = 0.01, C = 0.95 in our experiments.

5.1 Data Sets

In this paper we consider six benchmark data sets [69] representing networks drawn from disparate fields: protein-protein interaction network (PPI), coauthorships network between scientists (NS),electrical power grid of the western US (Grid), network of the US political blogs (PB),Internet (INT),and US airport network (USAir). For each dataset, we test on its largest connected component. Table 1 summarizes the topological features of the largest components of those networks. In the table, N and M are the total numbers of nodes and links, respectively. NUM C is the number of the connected components in the network and the size of the largest one. For example, 1222/2 means that this network has 2 connected components and the largest one contains 1222 nodes. In the table, e is the efficiency of the network [70], C and r are clustering coefficient [71] and assortative coefficient [72], respectively. K is the average degree of the network.

Table 1 Topological features of the giant components in the six networks tested

5.2 Test on quality of the results

First, we test the accuracy of the results by the algorithms using AUC score and precision as measurements. To evaluate the accuracy of the results a random 10-fold cross validation (CV) is used. In 10-fold cross-validation, the original nodes are randomly partitioned into 10 subsets. Of the 10 subsets, a single subset is retained as the validation data for testing the algorithms, and the remaining 9 subsets are used as training data. The cross-validation process is then repeated 10 times. We calculated standard deviation of the results on each data set, and find that all of the standard deviations are less than 0.024. The 10 results from the folds are averaged to produce a single estimation. Table 2 presents the average AUC scores on 10-fold CV tests by different algorithms. In the table, the highest AUC scores for the data sets by the 11 algorithms are emphasized in bold-face.

Table 2 Comparison of the algorithms’ accuracy quantified by AUC

As shown in Table 2, we can see that among all the 11 algorithms, ACO_LP has the highest AUC scores on all the datasets. Even for the most difficult data set Grid, algorithm ACO_LP gets the highest AUC score 0.9985, while the other algorithms get AUC scores from 0.4677 to 0.6375. Comparing Tables 1 and 2, we can find that the AUC scores of the data sets are roughly proportional to their clustering coefficients, an algorithm can get better results on data sets with larger clustering coefficients. Our algorithm sets an initial value to those logical edges where a link does not exist in real network, it can increase the diversity of ants search.Therefore, our algorithm still achieves better results on the networks with low clustering coefficients. This shows that the algorithm ACO_LP can achieve high quality results and strong robustness.

Based on Table 2, we use Wilcoxon signed-rank test to show that the AUC scores of the results by ACO_LP are statistically different from those by other methods. Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used when two related samples or matched samples are compared. It does not make assumptions about the distribution of the data. We calculate the W-values of precisions by ACO_LP with those by other methods, and the results are as shown in Table 3. We let the confidence level α = 0.05, and the number of samples n = 6. From the table of criteria W value, we know W(0.05, 6) = 2. Since all the W values are greater than W(0.05, 6), there are significant differences between the AUC scores by ACO_LP and other ten methods. Therefore, the quality of results by our algorithm ACO_LP is obviously higher than those by other methods.

Table 3 W-values of AUC scores by ACO_LP with those by other methods

Next, we test and compare the precisions of the algorithms on different datasets, the results are shown in Table 4.

Table 4 Comparison of the algorithms’ accuracy quantified by precision

As shown in Table 4, we can also see that the ACO_LP algorithm has higher precision than all the other algorithms. The experimental results show that our algorithm ACO_LP performed significantly better than all the other algorithms in all the data sets. The reason for ACO_LP achieving high quality results is that it uses both the pheromone and heuristic information, which reflect both local and global structure of the network. By the ants’ touring on the network, global topological information is transformed and accumulated by the pheromone on each edge. Therefore the final pheromone information is more accurate as a similarity score between the nodes than other similarity measurements based on local information.

5.3 Test on the time requirement by the algorithms

Computational complexity is another important concern in the designing of link prediction algorithm. In our experiments, we also compared the time complexity of ACO_LP algorithm with the other algorithms, and the results are depicted in Fig. 2. We can see from Fig. 2 that the computational time required by ACO_LP algorithm is much less than those of the algorithms CN, Salton, Jaccard, Sorensen, HPI, HDI, LHN_I and PA, and slightly more than those of the algorithms LP and Katz.

Fig. 2
figure 2

Running time of the algorithms

Let n be the number of nodes in the network, and k be the average degree of the nodestime complexity of ACO_LP algorithm is O(n 2). In CN algorithm, for each node v i , it takes d 2 time to search for the common neighbors of v i with other nodes. Therefore, time complexity of algorithm CN is O(nk 2). Similarly, other common neighbor based algorithms, such as Salton, Jaccard, Sorenson, HPI, HDI, LHN-I and PA, also have the same time complexity of O(nk 2). Although the common neighbor based algorithms have lower time complexity than ACO_LP algorithm, the large hidden constants make those algorithms much slower than ACO_LP. All the experiment results show that ACO_LP algorithm can achieve high quality prediction results in less computational time.

5.4 Test on networks with node attributes

We also test our extended method on networks with node attributes. We test on eight data sets [73] representing networks drawn from Digital Bibliography Library Project (DBLP) which is a computer science bibliography website hosted at Universität Trier, Germany. It has existed at least since the 1980s, and has listed more than 2.3 million articles on computer science. All important journals on computer science are tracked. Proceedings papers of many conferences are also tracked. It consists of major data bases of publications in computer science. In our experiment, we test on eight datasets including Computer Science Conference (ACM), Applications of Natural Language to Data Bases (NLDB), Complex, Intelligent and Software Intensive Systems (CISIS), International Conference on Information and Communication Security (ICICS), International Conference on Machine Learning (ICML), International World Wide Web Conferences (WWW), Computer Analysis of Images and Patterns (CAIP) and International Conference on Artificial Neural Networks (ICANN). We test on part of the authors in each dataset. For instance, in dataset ACM, we only test the authors in the ACM conference in the years from 1986 to 1996. For each dataset, we construct a network where each node represents an author. Co-authorship between two authors is mapped to the link between their corresponding person nodes. For each author, we get his entire publication history. Terms in the paper titles are considered as the attributes for the corresponding author. Since networks of some database are not connected, we test only on the largest component. Table 5 summarizes the topological features of the largest components of those networks. In the table, N and M are the total numbers of nodes and links, respectively. NUM C is the number of the connected components in the network and the size of the largest one. For example, 1995/897 means that this network has 897 connected components and the largest one contains 1995 nodes. In the table, eis the efficiency of the network [70], C and r are clustering coefficient [71] and assortative coefficient [72], respectively. K is the average degree of the network.

Table 5 Topological features of the giant components in the eight networks tested

Table 5 Topological features of the giant components in the eight networks tested

On those eight datasets, we test the accuracy and the computing time of our algorithm, and compare its performance with the link prediction algorithms CN,Salton, Jaccard,Sorensen, HPI, HDI, LHN_I, PA, LP and Katz.

First, we test the accuracy of the results by the algorithms using AUC score and precision as measurements. To evaluate the accuracy of the results, a random 10-fold cross validation is used. Table 6 presents the average AUC scores on 10-fold CV tests by different algorithms. In the table, the highest AUC scores for each data set by the 11 algorithms are emphasized in bold-face.

Table 6 Comparison of the algorithms’ accuracy quantified by AUC

We can see from Table 6 that among all the 11 algorithms, ACO L P has the highest AUC scores on all the datasets. For instance, for dataset ACM, algorithm ACO L P gets the highest AUC score 0.9455, while the other algorithms get AUC scores less than 0.8635. This shows that the algorithm ACO L P can achieve high quality results and strong robustness.

Based on Table 6, we also use Wilcoxon signed-rank test to show that the AUC score of the result by ACO L P is statistically different from those by other methods. We calculate the W-values of AUC scores by ACO L P with those by other methods, and the results are as shown in Table 7. We also let the confidence level α = 0.05, and the number of samples n = 8. From the table of criteria W value, we know W(0.05, 8) =6. From the table we can see that the W values of AUC scores by ACO L P with those by other methods except Jaccard and PA are greater than W(0.05,8), there are significant differences between the AUC scores by ACO L P and other eight methods. Therefore, the quality of results by our algorithm ACO L P is obviously higher than other eights methods.

Table 7 W-values of AUC scores by ACO L P with those by other methods

Next, we test and compare the precisions of the algorithms on different datasets, the results are shown in Table 8.

Table 8 Comparison of the algorithms’ accuracy quantified by precision

As shown in Table 8, we can see that the algorithm ACO_LP has higher precision than all the other algorithms in six datasets other than CISIS and ICICS. For datasets CISIS and ICICS, precisions of ACO_LP are very close to the highest ones.

We also compared the time complexity of algorithm ACO_LP with the other algorithms, and the results are depicted in Fig. 3. We can see from Fig. 3 that the computational time required by ACO_LP algorithm is less than or very close to the other algorithms in most of the datasets. But for datasets WWW and ICANN, ACO_LP consumes more computation time than some other methods. Since these two datasets have large number of attributes, they create lots of additional nodes in the augmented graph, and consume more computation time. Since our algorithm ACO_LP can also discover the potential attributes of the items, this excess computation time is due to the cost of attribute detecting for the item nodes. To reduce the time cost for the datasets with large number of attributes, it is necessary to eliminate the nonessential attributes by dimension reduction in data preprocessing.

Fig. 3
figure 3

Running time of the algorithms

6 Conclusions

With the large amount of network data available in electric form today, link prediction has become a popular subarea in data mining. From the perspective of swarm intelligence, a new link prediction method based on ant colony optimization is proposed in this paper. In the algorithm, artificial ants are used to randomly walk on the logical graph. Each ant chooses its path according to the value of the pheromone and heuristic information on the edges. The initial value of pheromone on each edge on the logical graph is set according to the link connection on the network. The initial value of heuristic information on each edge is set according to the common neighbors of the two nodes it connects.

The paths obtained by the ants’ walking are evaluated, and the pheromone information on each edge is updated according to the quality of the path it located. Finally, the pheromone on each edge is used as the final similarity score of the node pair. Empirical results show that our algorithm can achieve higher quality results of link prediction using less computation time than other algorithms. We also extend our method to solve the link prediction problem in networks with node attributes. We expend the logical graph by adding attribute nodes, and use the ant colony optimization algorithm on this augmented graph to perform link prediction using both topologic information and attributes on the nodes. . Our experimental results show that such extended method also can precisely detect the missing or incomplete attributes of data.

There are two reasons for ACO_LP achieving high quality results. One is that it uses both the pheromone and heuristic information reflecting both local and global structure of the network. Another reason is that ACO_LP considers both attribute and structure informationour experimental results show that it can obtain higher quality results on those networks with node attributes.

When the datasets have large number of attributes, since our algorithm ACO_LP creates lots of additional nodes in the augmented graph, it consumes more computation time. It is our further work to develop an efficient way to eliminate the some nonessential attributes to reduce the time cost.