1 Introduction

In online social networks (OSNs), nodes are organized into communities, where a community represents a group of nodes having similar characteristics, such as similar interests, opinions, or beliefs [1, 2]. The links between the nodes belonging to the same community are referred to as intra community links, and the links between the nodes belonging to different communities are referred to as inter community links. In social networks, intra-community links are driven by the effect of homophily [3] as similar nodes prefer to connect with each other. The formation of inter-community links is still not well explored in the literature; however, it can be explained by different complex phenomena, such as triadic closure and weak ties [4]. In real-world networks, it is observed that the number of intra-community links is more than the number of inter-community links [5]. The evolution of social networks is regulated by the formation of new links in the network.

In OSNs, we recommend more probable, but not existing links as promising connections to help users in making new friends, and a user having more friends will be more loyal towards the website [6, 7]. However, forming the right kind of links is very important as the opinion of a user is highly influenced by the opinion of its neighbors [8]. In the recent era, scientists have focused on increasing the diversity in the network so that the users receive information on a topic from different viewpoints before making their opinion [9]. It is very crucial that a user receives the information from other users having different perspectives to mitigate the negative impact of fake propaganda, false information, or fake news spreading on the network [10]. Hence, it is required that a user has a diverse neighborhood by having connections with different communities. In social networks, more inter-community links should be promoted to increase diversity. The link recommendation system plays an important role in forming new links and transforming the network evolution. Besides this, an improved link prediction method can also be used for anomaly detection by better identifying suspicious links in newly formed intra and inter-community links.

Initially, researchers proposed link prediction methods based on the similarity of the nodes [11]. These methods compute the similarity of a pair of nodes based on network structure, and more similar nodes are more likely to form a link. These methods are also often referred to as classic or heuristic link prediction methods. The well known classic methods include Jaccard coefficient [12], Adamic Adar index [13], resource allocation index [11], preferential attachment index [14], and so on. These methods were extended to include community structure to improve the link prediction accuracy; however, most of the methods improved the total accuracy by improving intra-community link prediction accuracy [15, 16].

In recent works, network characteristics have been studied using network embedding where the network is represented in a low dimensional latent space [17]. In network embedding techniques, the aim is to embed similar nodes closer to each other. Most of the existing network embedding methods [1719] focus on embedding the nodes closely if they belong to the same community and therefore have high accuracy for the node classification task and intra-community link prediction.

In our work, we propose a network embedding method, called NodeSim embedding, which considers both the nodes’ similarity and their community information while generating the network embedding. In the learned embedding, the nodes belonging to the same community will be embedded closely, and the nodes belonging to different communities will be embedded closer based on their neighborhood similarity. Therefore, the generated embedding preserves the structural properties of the network and is efficient in predicting diverse promising links. Next, we propose a link prediction method that trains a logistic regression model using node pair embedding and their community information to predict both the inter and intra-community links with high accuracy. This is the first work that uses community information for learning the link prediction model and achieves higher accuracy for both types of links. The experiments are performed to show the accuracy and efficiency of the proposed method on real-world networks. The results show that the proposed method outperforms the state-of-the-art methods on all the datasets. We further show the application of the proposed method in anomalous link detection, and the NodeSim embedding provides the best results compared to the baseline methods on medium to large-size networks.

The paper is structured as follows. In Sect. 2, we discuss the state of the art literature on link prediction by focusing on network embedding techniques. In Sect. 3, we discuss the proposed methods, including (i) NodeSim network embedding method and (ii) link prediction method. In Sect. 4, we discuss experimental results on real-world networks, including the performance, sensitivity, scalability, robustness analysis, and application of the proposed method. The paper is concluded in Sect. 5 with future directions.

2 Related work

Link prediction is a very well-known problem in network science and has been applied to predict missing links in different types of networks, such as friendship networks, collaboration networks, and chemical networks. Initially, researchers proposed heuristic methods that only considered the neighborhood information of the nodes for link prediction and did not consider the network topology. These heuristic methods were further extended that also considered the network structure properties like community structure to predict the links [15, 16, 20, 21]. However, most of these methods improved the overall accuracy of link prediction by improving the accuracy of intra-community link prediction. The main benefit of using heuristic methods is that these methods do not need any training and are comparatively faster.

Another class of link prediction methods uses machine learning models, such as probabilistic graphical models [22, 23], matrix factorization [24, 25], supervised learning methods [26, 27], and semi-supervised learning methods [28, 29]. These machine learning methods provide good accuracy though they suffer from the class imbalance problem as the number of existing links in a network are significantly fewer than the number of non-existing links.

In recent years, network embedding techniques have been used to study networks and to propose solutions for various network analysis problems. The network embedding methods can be categorized into three categories based on the structural proximity considered while generating the embedding, (i) microscopic structure embedding, which considers local proximity of nodes, such as first-order [30, 31], second-order [30] or high-order proximity [17, 18, 32], (ii) mesoscopic structure embedding, which captures hierarchical and community structural proximity [3335], and (iii) network properties preserved embedding, which captures global network properties, such as network transitivity or structural balance [36, 37].

In the existing mesoscopic network embedding, the main focus has been either on the hierarchical embedding where the users belonging to the same hierarchy should be embedded together [33] or on the intra-community proximity where the nodes belonging to one community should be embedded closely [34, 35]. In hierarchical or structural role proximity, the nodes playing the same structural roles are embedded closely; for example, the nodes having a similar degree or similar influential power should be embedded closer [3843]. In this work, we propose the NodeSim network embedding method, which considers both (i) high-order proximity by the similarity of the nodes and (ii) mesoscopic structure by the network communities while generating the embedding. In NodeSim embedding, the nodes belonging to one community are clustered together, and the similar nodes belonging to different communities are embedded closer. The proposed embedding captures a richer diverse neighborhood of the nodes that is further verified using the link prediction.

3 The proposed method

In this section, we first discuss the required network properties for our work. Next, we discuss our proposed NodeSim embedding method to learn the feature representation of the nodes and the proposed link prediction method.

3.1 Community structure

In real-world complex networks, nodes connect with each other if they have similar properties. A group of nodes that are densely connected with each other is referred to as a community [44]. The community label of a node u is denoted by \(C_{u}\). If both end nodes of a link \((u,v)\) belong to the same community, it is referred to as an intra-community link, and \(C_{(u,v)} =1\) for an intra-community link. If both end nodes belong to different communities, then the link \((u,v)\) is referred as an inter-community link and \(C_{(u,v)} =0\).

In most real-world networks, the ground truth community information is not available. In literature, several community detection methods have been proposed to identify communities using network structure if the ground truth information is not known. In this work, we apply the highly used community detection method, known as the Louvain method [45], to identify the communities if the ground truth information is not known.

Louvain Community Detection Method: The Louvain method [45] uses two-step greedy optimization to optimize the modularity of a community partition of the network. First, the method optimizes the modularity locally to find small communities. In the second step, it merges all nodes belonging to the same community and creates an aggregated network where each node represents a community. These steps are performed iteratively until we achieve the maximum modularity and the obtained communities are returned.

3.2 Node-pair similarity

In a network, two nodes connect with each other if they have some common interest or characteristics, and therefore, a link between a pair of nodes is the first indication that they are similar. However, these binary/unweighted connections cannot capture the complete information of the system as each connection is not equally important. A better way of representing the network is with weighted edges, where edge-weight denotes the strength of the connection. For example, in a friendship network, the weight of an edge can be computed based on the intimacy of the relationship or frequency of the communication [46]. The similarity of a node pair \((u,v)\) is denoted as \(\operatorname{Sim}(u,v)\).

In most real-world networks, the edge-weight data is not available as it is not feasible to collect all the required information for computing the strength of each connection. In network science, there have been proposed methods to compute the similarity of a node-pair based on their neighborhood connectivity in the network structure. Some of the well-known methods are the number of common neighbors [47], Jaccard coefficient [12], Adamic Adar [13], Resource Allocation [11], and so on, which compute a node-pair similarity based on their local-neighborhood proximity.

In this work, we will use the Jaccard coefficient to compute a node pair’s similarity in unweighted networks. The Jaccard coefficient for a node pair \((u,v)\) is defined as, \(\operatorname{JC}(u,v)= \frac{|\Gamma (u) \cap \Gamma (v)|}{|\Gamma (u) \cup \Gamma (v)|}\), where \(\Gamma (u)\) is the set of neighbors of node u.

3.3 NodeSim network embedding

For a given graph \(G(V, E)\), the network embedding method learns the mapping \(\Phi : V \rightarrow \mathbb{R}^{d}\), where d is the dimension of the embedding space. In recent works, the Skip-gram model has been used to generate the network embedding by representing the network as a document where the nodes are corresponding to the words [18]. In a network, a sampled sequence of nodes is considered the same as an ordered sequence of words in a document. The simplest way to generate the ordered sequence of nodes is by using random walks.

In the random walk [48], if the random walker is at node u, the probability that the random walker will move to node v is defined as,

$$ P_{uv}= \textstyle\begin{cases} 1/\deg (u), & \text{if } (u,v) \in E, \\ 0, & \text{otherwise}. \end{cases} $$

The random walk method does not consider the network structure properties while sampling the nodes. In recent works, different sampling methodologies have been explored to sample the network to learn feature representations of the network [17, 49]. However, the proposed methods do not consider the meso-scale properties, such as community structure, while exploring the network. In this work, we propose a random walk based sampling method, called NodeSim Random Walk, that captures the neighborhood of the node by considering both the nodes’ similarity as well as the meso-scale community structure of the network.

3.3.1 NodeSim random walk

In network embedding, the focus is to embed similar nodes closer. The simplest way to capture the node similarity during the random walk would be to bias the edge probability based on the similarity of its end nodes. However, this will ignore the meso-scale property of the network that is captured through the community structure. In NodeSim random walk, the edge-probabilities are assigned based on both the similarity of the nodes and community structure.

In NodeSim Random walk, the unnormalized probability \(p_{uv}\) to move from node u to node v is defined as,

$$ p_{uv}= \textstyle\begin{cases} \alpha \cdot (\operatorname{Sim}(u,v) +1/\deg (u)), & \text{if } (u,v) \in E \text{ and } C_{(u,v)} = 1, \\ \beta \cdot (\operatorname{Sim}(u,v) +1/\deg (u)), & \text{if } (u,v) \in E \text{ and } C_{(u,v)} = 0, \\ 0, & \text{otherwise}. \end{cases} $$
(1)

The probabilities are normalized for each node u with respect to all of its neighbors. So, the probability to move from node u to node v is computed as, \(P_{uv}= p_{uv} \cdot w_{u}\) where \(w_{u}\) is the normalizing factor for node u.

In this work, the similarity of the nodes is computed using the Jaccard Coefficient. Figure 1 explains edge-probabilities for NodeSim random walk, where the network has two communities shown by red and blue nodes, and the edges \((u,v)\) and \((u,w)\) are inter and intra-community edges, respectively, which are labeled with \(p_{uv}\) and \(p_{uw}\), respectively.

Figure 1
figure 1

NodeSim Random Walk probabilities for inter and intra community nodes

Intuitively, parameters α and β control how the random walker explores the neighborhood. A higher value of α shows that the walker will prefer to sample more similar nodes from the same community, and a higher value of β shows that the walker will put a higher weight to explore the inter-community neighborhood of the node.

3.3.2 Learn embedding

Once the ordered sequences of nodes are generated using NodeSim random walk, the network embedding is learned using the Skip-gram model [50]. The network embedding method learns a mapping for each node \(u \in V\) to a d-dimension embedding space that represents the d-dimensional feature representation of node u based on its structural role. The network embedding is denoted as \(\Phi : u \in V \longrightarrow \mathbb{R}^{|V| \times d}\), where Φ can be considered a \(|V| \times d\) size matrix that is learned by solving a maximal likelihood optimization problem.

In the skip-gram model, given the corpus, the neighborhood of a word is defined using a sliding window over the consecutive words. In networks, we generate the ordered sequence of nodes using sampling methods. For example, if NodeSim random walker visits the following nodes \(\{u_{1}, u_{2}, \ldots u_{i}, \ldots u_{l}\}\), they will be referred to as an ordered sequence of nodes. In our method, we generate ordered sequences of nodes by taking γ NodeSim walks of length l from each node. The neighborhood of a node \(u_{i}\) will be defined by considering \(k-1\) nodes visited before and after node \(u_{i}\) during the sampling, where k is the window size or context of the node. For every node \(u_{i} \in V\), \(N_{NS}(u_{i}) \subset V\) denotes the neighborhood of node \(u_{i}\) in the network that is generated through the NodeSim sampling method with the given context k.

In the skip-gram model, the network embedding is learned based on the likelihood of a node \(u_{i}\) co-occurring with other neighborhood nodes within the context k in the NodeSim random walk. We, therefore, optimize the following optimization function that aims for maximizing the probability of observing a node in the neighborhood of node \(u_{i}\), given its feature representation \(\Phi (u_{i})\),

$$ \underset{\Phi }{\text{maximize}} \sum _{u_{i} \in V} \log\Pr \bigl( N_{NS}(u_{i}) | \Phi (u_{i}) \bigr). $$
(2)

The optimization problem is solved using two assumptions. The first assumption is conditional independence, that the probability of observing a node in the neighborhood of the source node is independent of observing any other node in its neighborhood given the feature representation of the source node, so,

$$ \Pr \bigl( N_{NS}(u_{i}) | \Phi (u_{i}) \bigr)= \Pi _{u_{j} \in N_{NS}(u_{i})}\Pr \bigl(u_{j}| \Phi (u_{i}) \bigr). $$
(3)

The second assumption is the symmetry that considers the pairwise similarity of a source node and its neighborhood node in the feature space. Therefore, we estimate the probability of a node \(u_{j}\) co-occurring with node \(u_{i}\) using the softmax function,

$$ \Pr \bigl(u_{j} |\Phi (u_{i}) \bigr) = \frac{\exp (\Phi (u_{j}) \cdot \Phi (u_{i}))}{\sum_{v \in V} \exp (\Phi (v) \cdot \Phi (u_{i})) }. $$
(4)

Finally, using both assumptions, the objective function given in Equation (2) is computed as,

$$ \underset{\Phi }{\text{maximize}} \sum _{u_{i} \in V} \biggl( -\log Z_{u_{i}} + \sum _{u_{j} \in N_{NS}(u_{i})}\Phi (u_{j}) \cdot \Phi (u_{i}) \biggr), $$
(5)

where \(Z_{u_{i}} = \sum_{v \in V} \exp (\Phi (u_{i}) \cdot \Phi (v))\) is expensive for large-scale networks and it is approximated using negative sampling method [51]. Equation (5) is optimized using SGA (stochastic gradient ascent) over the features Φ [17].

3.3.3 Complexity

The complexity of the proposed network embedding method depends on two major steps, (i) identify the communities and (ii) NodeSim embedding learned using the Skip-gram model. The complexity of the community detection method and Skip-gram model is well defined in the literature, so we briefly discuss the complexity of our method. In our implementation, we have used the Louvain community detection method having complexity \(O(n \cdot \log n)\) where n is the total number of nodes in the network. Once the community structure is identified, the complexity to generate the probability distribution for NodeSim random walk is \(O(m)\) where m is the total number of edges in the network. The complexity for learning embedding using the skip-gram model is \(O(nkl \gamma (d+ d \log (n)))\), where d denotes the number of dimensions, l denotes the walk length, k denotes the window size, and γ denotes the number of random walks. So, the overall complexity is \(O(n\log n + m + nkl \gamma (d+ d \log (n)))\).

3.4 Link-prediction method

The link prediction method first generates the feature representation of given node pairs and then train a logistic regression model using the feature representation of node pairs and their community information.

3.4.1 Feature representation of node pair

The feature representation of a pair of node \((u,v)\) is generated by applying a binary operator on the feature representation of node u and v. The most common operators are mentioned below.

  1. 1.

    Average: \(e_{i}(u,v)=\frac{\Phi _{i}(u) + \Phi _{i}(v)}{2}\)

  2. 2.

    Weighted-L1: \(e_{i}(u,v)=|\Phi _{i}(u) - \Phi _{i}(v)|\)

  3. 3.

    Weighted-L2: \(e_{i}(u,v)=|\Phi _{i}(u) - \Phi _{i}(v)|^{2}\)

  4. 4.

    Hadamard: \(e_{i}(u,v)=\Phi _{i}(u) * \Phi _{i}(v)\)

\(\Phi _{i}(u)\) denotes the \(i_{th}\) feature of node u, and \(e_{i}(u,v)\) denotes the \(i_{th}\) feature of a node pair \((u,v)\). In this way, a d-dimension feature vector is generated for each node-pair using the d-dimension feature representation of the corresponding nodes.

3.4.2 Link prediction model

For link prediction, a logistic regression model is trained using features of the node-pair and their community information, with the output having the existent/non-existent information of the link between the given node-pair. The input features for a node pair \((u,v)\) is generated as, \(f(u,v)= (e(u,v) || C_{(u,v)})\), where || is concatenation operator and \(C_{(u,v)}\) is 1 if both nodes u and v belong to the same community, otherwise 0. The output parameter is 1 or 0 if there exists a link between the given pair of nodes or not, respectively. We have shown results for all four operators applied on \(e(u,v)\).

4 Experimental analysis

In this section, we discuss baseline methods, datasets, and experimental results.

4.1 Baseline methods

The proposed method is compared with both types of link prediction methods (i) similarity-based heuristic methods and (ii) network embedding based methods.

We compare with the following three heuristic methods based on network structure.

1.:

Jaccard Coefficient (JC) [12]: \(\operatorname{JC}(u,v)= \frac{|\Gamma (u) \cap \Gamma (v)|}{|\Gamma (u) \cup \Gamma (v)|}\)

2.:

Adamic Adar (AA) [13]: \(AA(u,v)=\sum_{ w \in (\Gamma (u) \cap \Gamma (v))} \frac{1}{\log |\Gamma (w)|}\)

3.:

Resource Allocation (RA) [11]: \(RA(u,v)=\sum_{ w \in (\Gamma (u) \cap \Gamma (v))} \frac{1}{|\Gamma (w)|}\)

We compare our method with the following network embedding based link-prediction methods.

4.:

DeepWalk [18]: Deepwalk method learns the network embedding using the skip-gram model on the ordered sequence of nodes generated using random walk.

5.:

Node2Vec [17]: Node2Vec is an extension of DeepWalk where the walker has different probabilities for moving to its neighbors, and the probability to move to the next node depends on its distance from the previously visited node. Once the nodes are sampled, the network embedding is learned using the skip-gram model. We have used the code provided by the authors at https://github.com/aditya-grover/node2vec.

6.:

NECS [35]: Network Embedding with Community Structural information (NECS) uses nonnegative matrix factorization to generate nodes’ embedding, which preserves the high-order proximity. The final network embedding is learned by jointly optimizing the consensus relationship between the nodes’ representation and the community structure. We have used the implementation provided by the authors at https://github.com/liyu1990/necs.

For DeepWalk, Node2Vec, and NECS methods, the node-pair embedding is generated using the Hadamard operator, and then the logistic regression model is trained for the link prediction as mentioned in these works.

7.:

Splitter [19]: This network embedding method learns multiple embedding of each node based on the principled decomposition of the ego-network. These multiple representations of a node denote its embedding with respect to the local communities it belongs to. The implementation is provided by the authors at https://github.com/google-research/google-research/tree/master/graph_embedding/persona. For link prediction, we used the method discussed in their paper. For each node pair \((u,v)\), the similarity score is computed using the dot product of their embedding. In the persona graph, each node has multiple embedding, so we compute the similarity score for each combination of their embedding, and the maximum score is returned as the final similarity score.

The implementation code of NodeSim method is available at https://github.com/akratiiet/NodeSim.

4.2 Datasets

We perform experiments on real-world networks, and their details are mentioned in Table 1. Facebook and Twitter are snapshots from online social networking websites, Enron is an email communication network, and GrQc, Hep-th, Hep-ph, Astro-ph, and DBLP are co-authorship networks. In all the networks, the communities are detected using the Louvain Method, and a community label is assigned to each node based on which community it belongs to. A node pair is referred to as intra-community node pair if both the nodes belong to the same community; otherwise, it will be referred to as inter-community node pair.

Table 1 Datasets

To generate the training and testing data, we follow the same methodology as used in [17, 19]; however, we maintain the ratio of inter and intra-community links that is not considered in previous studies. First, we remove 10% of inter-community and 10% of intra-community edges from E uniformly at random and put them in set \(E_{lp}\) that will be used for link prediction. While removing the 10% edges, it is ensured that the network remains connected. The remaining 90% edges are referred to as \(E_{ne}\), and \(G(E_{ne}, V)\) will be used to generate network embedding.

For link prediction task, the same number of inter and intra-community node pairs for non-existent links are chosen uniformly at random, as we have in \(E_{lp}\). These sampled links will work as negative cases and are added to set \(E_{lp}\). If a link is formed between a given node pair, then it is referred to as a positive case; otherwise, it will be referred to as a negative case. To create train and test data, the node pairs in \(E_{lp}\) are split into \(E_{\mathrm{train}}\) and \(E_{\mathrm{test}}\), and while splitting, we ensure that the ratio of intra and inter-community node pairs is maintained for both positive and negative cases. The default train and test ratio is \((0.5:0.5)\) if it is not mentioned explicitly. In heuristic and Splitter link Prediction methods, a node pair is predicted positive if the similarity score for this pair is higher than the similarity score of 50% positive train cases.

4.3 Performance study

First, we compare NodeSim method with baselines, and ROC-AUC value is computed for all test cases, intra-community and inter-community test cases as shown in Table 2. The table shows the best results observed for different parameter settings used in different methods, and each experiment is repeated five times to compute the average. The dimension of network embedding is \(d=128\). The results show that the proposed NodeSim method with Hadamard operator for node pair embedding outperforms all baseline methods. The bold faced values show the best ROC-AUC obtained for the total link prediction and the best results also provide better Intra and Inter link prediction results as compared to all the baselines. The ‘*’ value for NECS and Splitter methods show that the code execution was not completed in 48 hours on the server, and therefore, the values are not mentioned. The NECS method uses matrix factorization and therefore has high computational complexity. The Splitter method generates multiple embedding of each node corresponding to its local communities, and therefore, the execution time is manyfold based on the density of the network and connectivity of the nodes.

Table 2 ROC-AUC for link prediction

We further study the performance of our method by varying the ratio of train and test set. The results are shown in Fig. 2 for Hep-ph and Astro-ph networks. Results show that the performance of the proposed method is better compared to baselines, even if the training ratio is 0.1; however, the best results are achieved when the ratio of training size is at least 0.5 and 0.3 for Hep-ph and Astro-ph networks, respectively.

Figure 2
figure 2

Vary train size

4.4 Parameter sensitivity

The NodeSim embedding method depends on a number of parameters, and we examine the impact of different parameters on the performance of link prediction. In Table 3, we have shown the default values of different parameters that has been decided based on the preliminary analysis and their range that we have considered. The results are shown on two networks, Hep-ph and Astro-ph.

Table 3 Default and varied range values of different network embedding parameters

Figure 3 shows the impact of varying α on inter and intra-community link prediction. The results show that \(\alpha \sim 1 \text{--} 1.5\) achieves the best results. In Fig. 4, the results show that \(\beta \sim 1.5 \text{--} 2\) achieves the best results. The results confirm that the inter-community edges should be weighted higher than the intra-community edges during the sampling to predict inter-community links with high accuracy, as expected.

Figure 3
figure 3

Impact of varying α

Figure 4
figure 4

Impact of varying β

Next, we analyze the impact of embedding parameters on link prediction accuracy. Figure 5 represents that the performance of link prediction methods improves with the embedding dimension. In Fig. 6, we observe that the performance reduces with the window size as the larger window size considers distant nodes while generating the local context of the nodes, and these nodes might not be similar. In real-world networks, most of the new links are driven by the triad closure phenomenon, and it is less probable that a node will be connected to a distant node.

Figure 5
figure 5

Impact of varying Dimension (d)

Figure 6
figure 6

Impact of varying context k

Figures 7 and 8 show results for varying the number of walks and the walk-length. As observed in Fig. 7, the intra-community results are less affected by the number of walks than the inter-community links as the ratio of inter-community context pairs decreases with more number of walks; as we expected. Similarly, the inter-community accuracy also decreases with the walk-length even if the total accuracy is improved, as shown in Fig. 8 (b).

Figure 7
figure 7

Impact of varying Number of Walks

Figure 8
figure 8

Impact of varying Walk-length

4.5 Scalability

We compare the running time of different network embedding based methods on synthetic networks generated using SCCP (Scale-free networks with Community and Core-Periphery) model [56, 57]. The network generator first creates a seed graph, i.e., a complete graph of m nodes for each community, where m is the average degree of nodes. Next, in each iteration, a new node is added to each community, and the added node builds m connections using preferential attachment law [14] while ensuring the intra and inter-community edge ratio. The running time is compared on synthetic networks so that the ratio of intra and inter-community edges are maintained as we increase the network size. In our experiments, the ratio is (intra : inter = 0.75 : 0.25), and the average degree of the network is 8. The total number of communities is 10 in the network having 100 and 1000 nodes and 100 in the network having 10,000 and 100,000 nodes. All communities in a network are of the same size.

Figure 9 show the running time of different methods. All experiments are performed on the server having 384 GB RAM and 2× Intel Xeon 4110 @ 2.1 GHz CPU. For 100,000 nodes network, the Splitter code was not finished in 48 hours, and the NECS code was killed due to the memory error on the server. The results show that the proposed method executes faster than all the baselines except deepwalk as the network size grows. The running time of NodeSim and Node2Vec is almost equal. The deepwalk method is the fastest as it creates node context using a simple random walk and does not consider the structural properties of the network.

Figure 9
figure 9

Running time for different embedding methods versus network size

4.6 Robustness for identified communities

There have been proposed several community detection methods in the literature that consider different network properties while identifying the communities. Therefore, the communities identified by different methods might vary. For some methods, such as Louvain or greedy method, if the same method is applied many times, the returned community structure might differ each time.

We, therefore, study the efficiency of the NodeSim embedding method corresponding to different community detection methods. We apply five different community detection methods (including Louvain): (i) Louvain method [45]. (ii) Fluid Communities Algorithm [58], (iii) Greedy Modularity Maximization [2], (iv) Semi-synchronous Label Propagation [59], and (v) Asynchronous Label Propagation [60]. The details of community detection methods are explained in Appendix A. After identifying the communities using different methods, the training and testing data is created, as discussed in Sect. 4.2. Next, we generate network embedding by applying different embedding methods and apply the link prediction method. Each method is executed five times, and the average ROC-AUC value for the Hep-ph network is shown in Fig. 10.

Figure 10
figure 10

ROC-AUC for link prediction corresponding to different community detection methods for Hep-ph network

The results show that the performance of different methods is relatively maintained irrespective of the community detection method. The NodeSim method outperforms in all the cases as the method considers both the similarity of nodes and their communities while generating the network embedding.

4.7 Case study

For visualization, we show the NodeSim embedding of the Zachary Karate Network [61] in 2-dimension and 3-dimension space. The network and its embeddings are shown in Fig. 11, where the nodes having the same color belong to one community. The embedding shows that the nodes belonging to different communities are well separated; however, more similar nodes are embedded closer. For example, node 12 is more likely to form inter-community links with nodes 4, 5, 6, and 10, so, as observed, they are embedded closer but still well separated. The embedding of the nodes improves with high dimension, as we also observed in Sect. 4.4 that the accuracy increases with a higher dimension. We have also shown embeddings for Dutch School Friendship Network [62], and Illinois Highschool Friendship Network [63] in Appendix B.

Figure 11
figure 11

(a) Zachary Karate Network with three communities, (b) and (c) 2-dimensional and 3-dimensional embedding of Zachary Karate Network using NodeSim Method, respectively

4.8 Application in anomaly detection

The one well-known application of link prediction is to detect anomalous links. We briefly analyze the performance of the proposed method for anomalous link prediction. For this analysis, we use four real-world anomaly datasets and three synthetic datasets generated using real-world network; the details are provided in Table 4. The German Boys network [64] is a friendship network of a German school class from 1880–1881, and students are labeled as outliers based on their characteristics and behavior. The Disney and Books networks are co-purchase networks extracted from Amazon [65]. Enron-Anomaly [66] is an email communication network having spammers labeled as anomalous users. In all real-world anomaly datasets, nodes are labeled as anomalous and non-anomalous. To create synthetic anomaly network datasets, we follow the method used in previous anomaly detection works [67, 68]. We first add 0.4% nodes as anomalous nodes to the given network G. Each anomalous node picks its degree \((k)\) from the degree-distribution of network G and will make k connections uniformly at random from the nodes of network G. The synthetic anomalous networks corresponding to Hep-ph, Astro-ph, and DBLP networks are referred to as Hep-ph Anomaly, Astro-ph Anomaly, and DBLP Anomaly, respectively. In all networks, each edge that is connected with any anomalous node is labeled as an anomalous edge, and the rest of the edges are considered regular edges (also referred to as non-anomalous edges).

Table 4 Datasets for anomaly analysis

For anomalous like detection, we create network embedding using Deepwalk, Node2Vec, NECS, and NodeSim methods. For German Boys school, we create 32 dimension embedding as it is a small network, using the following parameters: 5 number of walks of length 10, 3 window size, \(p = 0.25\) and \(q = 0.25\) for Node2Vec, and \(\alpha = 1\) and \(\beta = 1\) for NodeSim embedding method. For other networks, we create 128 dimension embedding using default parameter settings for the number of walks, walk-length, and window size as used for link prediction. Please note that the Splitter method generates multiple embedding of each node based on its local persona, and therefore, this can not be used directly for anomaly detection.

To create train and testing data, we uniformly split 50% anomalous and non-anomalous edges as training dataset and the rest 50% as testing data. Given that the training data is imbalanced due to a very small number of anomalous edges, we use SMOTE (Synthetic Minority Oversampling Technique) oversampling [69] to create a balanced dataset using two nearest neighbors. Then, we train a logistic regression model on the balanced dataset for different network embeddings. The ROC-AUC and Micro-F1 values for the testing data are shown in Table 5 and 6, respectively. We observe that for very small networks, such as German boys, the NECS method provides better results. However, the performance of NECS is the worst for medium-size networks, and the method has a very high computational complexity for large-size networks. The results show that the NodeSim method provides promising results for anomalous link detection for medium to large-scale networks.

Table 5 ROC-AUC for anomalous link detection
Table 6 Micro-F1 for anomalous link detection

We have shown detailed experiments for anomalous link detection as this is a more suitable application of link prediction; however the NodeSim embedding can also be used for detecting anomalous nodes. For anomalous nodes detection, one can use any of the following two approaches, (i) directly train a machine learning model on the network embedding to identify anomalous nodes, or (ii) first classify the edges as anomalous and non-anomalous, and then use this information further to classify anomalous nodes. One of the main limitations in anomalous link prediction is the availability of real-world datasets. In our analysis, we consider that each link connected to an anomalous node is anomalous; however, in real-world applications, there might be a case where an anomalous node can have both types of connections, anomalous as well non-anomalous. In the proposed method, we only use network structure, and the method provides good results compared to baselines with the limited information of the network where no additional information is available due to the privacy concerns of the users. Given the promising performance of the NodeSim method, one can use it further for designing improved anomalous links and nodes detection methods using the additional information of the nodes and network. For example, in attributed networks, the anomaly link detection method can use network embedding and nodes’ attributes to achieve improved performance.

5 Conclusion

In this work, we have proposed the NodeSim network embedding method, which considers both the nodes’ similarity and their community membership while learning the feature representation of the nodes. The NodeSim embedding method efficiently learns the embedding of diverse nodes that is further verified using the link prediction. We proposed a link prediction method that trains a logistic regression model using nodes’ features and their community information. The results showed that the proposed link-prediction method outperforms baseline methods for both intra-community as well as inter-community link prediction. We further studied the impact of different parameters and showed that a higher value of β provides higher inter-community link prediction accuracy as the NodeSim method embeds the more similar diverse nodes closer than the others. We further show the application of the proposed method in anomaly detection and network visualization.

In the future, we would like to extend the proposed method to generate embedding of dynamic networks to predict inter and intra-community links with high accuracy to increase diversity. Such embedding can be used for several downstream tasks in dynamic networks, such as anomaly detection, network visualization, and recommendation systems for suggesting content, posts, or advertisements.