Incremental community discovery via latent network representation and probabilistic inference
- 52 Downloads
Abstract
Most of the community detection algorithms assume that the complete network structure \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) is available in advance for analysis. However, in reality this may not be true due to several reasons, such as privacy constraints and restricted access, which result in a partial snapshot of the entire network. In addition, we may be interested in identifying the community information of only a selected subset of nodes (denoted by \(\mathcal {V}_{{\mathrm{T}}} \subseteq \mathcal {V}\)), rather than obtaining the community structure of all the nodes in \(\mathcal {G}\). To this end, we propose an incremental community detection method that repeats two stages—(i) network scan and (ii) community update. In the first stage, our method selects an appropriate node in such a way that the discovery of its local neighborhood structure leads to an accurate community detection in the second stage. We propose a novel criterion, called Information Gain, based on existing network embedding algorithms (Deepwalk and node2vec) to scan a node. The proposed community update stage consists of expectation–maximization and Markov Random Field-based denoising strategy. Experiments with 5 diverse networks with known ground-truth community structure show that our algorithm achieves 10.2% higher accuracy on average over state-of-the-art algorithms for both network scan and community update steps.
Keywords
Community detection Incremental community detection Network embedding Probabilistic inference1 Introduction
In social network analysis, the task of community detection has been widely studied [1]. In many cases, instead of discovering the community structure of the entire network, we may wish to detect communities within a target set of nodes [2, 3]. For instance, a telecommunication company may want to find communities that its valuable customers are part of, in order to provide better facilities.
Most studies in this direction assume that the overall network structure is known in advance [4]. However, in real networks, complete information is difficult or even impossible to obtain [5]. In Facebook network, for instance, the complete linkage structure of a user is often unobtainable due to several privacy constraints. Moreover, in many online social network sites, such as Twitter, many new edges are created daily. In such cases, the entire network structure available to the users is incomplete.
This leads us to tackle a more realistic problem setting—given an initial sub-network and a set of target nodes, our task is to progressively scan the network and explore the communities^{1} where the target nodes reside. Scanning a node means checking its social profile and retrieving its neighborhood. By adding the scanned neighbors of a node, we keep accumulating knowledge about the topology of the network. This problem setting is realistic for several reasons—first, network information obtained is usually incomplete and hard to acquire. Second, many new relationships are created everyday. Even though network information is complete at a particular time, as time goes by, the network structure may need to be scanned and updated.
However, in general the cost to scan a network is limited, which further creates two challenges. The first is that we have to scan the network carefully. With no constraint on scanning, we can explore the network in a brute force manner and at some point, enough nodes are scanned and correct community membership of target nodes can be revealed. However, when there is an upper limit on the cost (i.e., budget) to scan nodes, it is necessary to judiciously scan nodes in order to explore the best community information of target node set. The second challenge is to incrementally update communities of target nodes based on partial information. After a network scan, more nodes are discovered, and the algorithm needs to quickly and correctly update communities of target nodes.
To address the first challenge, we propose a metric, called information gain for selecting a new node to scan. The key idea behind this metric is to use network embedding to find latent representation of nodes, and compute which node (and its associated edge) have the largest information gain. This metric tracks how the latent vector representations of nodes change over a series of network updates in order to decide which node to scan that has closed edge links. If a node’s latent representation drastically changes over successive updates, it is likely that its neighborhood information has also been changed, and thus the node can be a potential candidate to scan.
To address the second challenge, we propose a three-step incremental community update method—Step 1: an incremental local update based on expectation–maximization [6]; Step 2: an intermittent global update to correct the local update; and Step 3: the Markov Random Field (MRF)-based denoising [7] to further adjust both local and global updates.
We investigate a contemporary problem which is more realistic than the traditional settings of community detection and has rarely been explored in the past [8].
We propose a novel method that interweaves network discovery and community detection by first mapping the discovered network into an embedding space, followed by an incremental community update that adjusts the current community structure by leveraging the new information acquired in network discovery.
We compare our method with 4 different variations of the existing state-of-the-art method [8] along with 3 other commonly used baselines on 5 diverse datasets. We observe that our proposed algorithm significantly outperforms other baselines with a performance improvement of 8%\(\sim \)17% with an average improvement of 10.2%.
2 Related work
In network science, the task of detecting dense modules from the network has been extensively studied for static networks [4, 9] (see [1, 10] for a comprehensive review). Several attempts have been made to detect local communities around a target node [2, 3, 11, 12, 13, 14, 15]. Some work has focused on detecting dynamic communities in evolving networks [16, 17, 18, 19]. Recently, [20, 21, 22] attempted to discover communities from incomplete/noisy networks. All these methods assume that the entire network structure is available a priori, whereas we assume that only target nodes are given, and one needs to scan the nodes to explore the network structure. Every scan operation incurs a cost, and the network exploration can be possible till a certain budget is exhausted.
In the line of only network discovery (without community detection), there has been plenty of research focusing on general sampling techniques [23, 24], sampling web documents [25], and sampling social network [26, 27]. These techniques are not applicable in our settings because they initially consider the entire network structure for sampling. Moreover, none of them really focus on discovering the underlying community structure in the sampled network. There is also a lot of work on incremental community discovery of dynamic graph, such as [28, 29, 30, 31, 32, 33, 34, 35]. These decentralized algorithms have similar scenarios as we choose here, but they do not consider the cost when the algorithm makes a query or explores one or more nodes in the graph. This implies our scenario is more difficult.
The most similar existing work with ours is NetDiscover [8] which also attempted to discover disjoint communities of a target node set. Although the problem definition is exactly the same, we differ from their method w.r.t. both candidate node selection and incremental community detection. In the former case, NetDiscover selects a query node based on one of the two community scoring metrics (modularity, normalized cut), whereas the proposed node selection algorithm is based on a novel metric information gain. A close inspection of NetDiscover reveals that there is an information leakage while selecting the query node in that it uses the entire network to compute the node scoring function. However, NetDiscover detects initial communities using spectral clustering [36] (where the number of communities needs to be given a priori) and adopts generative model (GM) [6] to update communities, whereas we combine both EM and MRF algorithms for community update. To compare with existing work, we consider both NetDiscover and commonly used random sampling and greedy algorithm as baselines (see Sect. 7.3). We also compare our method with Quick Community Adaptation (QCA) [37] and Dynamic Permanence (DyPerm) [38] methods. QCA is an adoptive modularity-based approach for identifying and tracking community structure of dynamic network. DyPerm is another modularity maximization-based approach which incrementally updates the community structure of a network at \(t_i\) based on the community structure at \(t_{i-1}\) with detecting the community structure from the scratch. We adopt these algorithms to our setting (incomplete network) and make them comparable.
3 Problem statement
We denote a network as \(\mathcal {G}=(\mathcal {V},\mathcal {E})\), where \(\mathcal {V}\) is a set of nodes and \(\mathcal {E}\) is a set of edges. Assume that initially we do not know the entire network \(\mathcal {G}\); only a partial subset of the network \(\mathcal {G}_{{\mathrm{s}}}=(\mathcal {V}_{{\mathrm{s}}},\mathcal {E}_{{\mathrm{s}}})\) (where \(\mathcal {V}_{{\mathrm{s}}} \subseteq \mathcal {V}\) and \(\mathcal {E}_{{\mathrm{s}}} \subseteq \mathcal {E}\)) is known. Among all the nodes in \(\mathcal {V}_{{\mathrm{s}}}\), we are particularly interested in detecting the community structure of a given set of target nodes \(\mathcal {V}_{{\mathrm{T}}} \subseteq \mathcal {V}_{{\mathrm{s}}}\).
We iteratively scan nodes and explore the network with the neighborhood information of the scanned nodes. We use \(\mathcal {G}_i=(\mathcal {V}_i, \mathcal {E}_i)\) to denote an intermediate network at the ith scan iteration (thus, initially \(\mathcal {G}_0 = \mathcal {G}_{{\mathrm{s}}}\)). The performance of community detection and the cost incurred by scanning nodes greatly depend upon the node selection strategy (i.e., which node to scan next). We assume a function Q(v) denoting the cost to scan a node v. If v is a private node^{2} and it does not allow a scan, then \(Q(v) = \infty \). In the simple case, the cost to scan all non-private nodes may be same, i.e., \(Q(v)=1\). However, we also adopt a more general setting with heterogeneous cost per node. Another general setting is that the available budget B we invest for scanning is limited.
To this end, we consider two sub-problems: One is to decide candidate nodes^{3} to scan next, and the other is to update the community structure incrementally.
Network scan Given a budget B, an intermediate network \(\mathcal {G}_i\) with a cost function Q associated with it, and a target node set \(\mathcal {V}_{{\mathrm{T}}}\), the aim of candidate selection is to decide the next candidates whose exploration of the local neighborhoods leads to the best community detection. After that, we actually scan the selected candidates, and \(\mathcal {G}_{i+1}\) is generated from \(\mathcal {G}_i\) and scan results of the candidates.
Community update Given \(\mathcal {G}_i\) and \(\mathcal {G}_{i+1}\), the task of community update is to efficiently and effectively discover the community structure in \(\mathcal {G}_{i+1}\) considering the community structure in \(\mathcal {G}_i\) from the last iteration.
4 Overall algorithm
5 Network scan
The purpose of network scan is to discover unexplored parts of the network, where a node is scanned to know its full neighborhood structure. It mainly aims at exploring and acquiring more nodes and their connections, leading to the detection of better community structure. Since the effectiveness of the algorithm mainly depends on the scan sequence, the main focus of our network scan approach is to decide the best candidate nodes to be processed next given an intermediate network. The parameter k denotes the number of nodes chosen to be scanned.
Candidates for network scan Let \(S_{i}\) be a set of scanned nodes till iteration (\(i-1\)). The neighbor set is defined as \(\mathcal {N}_i=\{v | \exists u \in S_i, (v,u) \in \mathcal {E}\}\). The scan candidate set \(\mathcal {C}_i = \mathcal {N}_i \setminus S_i\) contains only nodes that are not scanned among all nodes in \(\mathcal {N}_i\).
There are many candidate selection algorithms for network scan [39, 40]. Two of them are simple but widely used in various fields: random sampling and greedy sampling [41]. We will use these methods as baselines in Sect. 7. Soundarajan et al. [40] suggested to select the node that maximizes the total number of nodes scanned with the aim of exploring the complete network structure. Other methods were specifically designed for community detection [8]. These methods dynamically sample nodes from an intermediate network in such a way that a certain community quality measurement metric is expected to be improved. Liu et al. [8] selected nodes in such a way that the value of ‘normalized cut’ decreases or ‘modularity’ increases. Here, we briefly describe these two metrics and how we use them to design baseline methods.
Normalized cut Given K communities at iteration i, the normalized cut is defined as: \(\sum _{k=1}^{K} \frac{\text {cut}(\mathsf {C}_{k}, \mathcal {G}_{i}-\mathsf {C}_{k})}{\text {assoc}(\mathsf {C}_{k}, \mathcal {G}_{i})}\), where \(\text {assoc}(\mathsf {C}_{k}, \mathcal {G}_{i})\) represents the total degree of nodes in \(\mathsf {C}_{k}\) within \(\mathcal {G}_{i}\) and \(\text {cut}(\mathsf {C}_{k}, \mathcal {G}_{i}-\mathsf {C}_{k})\) is the number of edge-cuts between \(\mathsf {C}_k\) and all other remaining communities. The optimization of the above cost function is to minimize edge-cuts (connections) among different communities. The baseline algorithms used in the experiment follow Liu et al. [8]. We calculate the minimum normalized cut cost for each node in the candidate set \(\mathcal {C}_i\)—at this step, the correct community membership of candidates is not known, and thus all possible community assignments of candidates have to be tested after fixing the community structures of the scanned nodes in \(S_{i}\). Among all the candidates in \(\mathcal {C}_i\), k nodes leading to the minimum normalized cut are selected and added to the intermediate network. Here, we assume that the community structure does not change for all nodes but newly added ones in the intermediate network.
Modularity Modularity is used to evaluate the strength of partitioning a network into different communities: \(\sum _{k=1}^{K} (e(\mathsf {C}^{k},S) - a(\mathsf {C}^{k},S)^2)\), where \(e(\mathsf {C}^{k},S)\) is the fraction of edges where both nodes are in the same community \(\mathsf {C}^{k}\), and \(a(\mathsf {C}^{k},S)\) is the fraction of edges that at least one node is in \(\mathsf {C}^{k}\). A high modularity value means dense connections between the nodes in a community but sparse connections between the nodes in different communities. We take the same approach as in normalized cut, i.e., testing with all possible community membership assignments of candidates, and choosing k candidate nodes that lead to the maximum modularity [8].
Here, we propose a new candidate selection method that does not incur any additional hidden cost. We utilize Deepwalk [42] and node2vec [43] to obtain latent feature vectors of nodes, i.e., network embedding into feature space. We first briefly introduce Deepwalk and node2vec and then discuss our proposed algorithm.
5.1 Network embedding algorithms
Deepwalk [42] and node2vec [43] are deep learning-based embedding methods to learn latent representations of nodes in a network. Deepwalk encodes social relations into a continuous vector space after modeling a series of random walks with a Natural Language Processing method. The key idea is that each visited node during a random walk can be considered as a word, and the random walk corresponds to a sentence. Node2vec learns a latent feature vector that maximizes the likelihood of maintaining the neighborhoods of the node.
The feature representation framework generally consists of two main parts: a random walk generator and a representation update. Both of the above two frameworks have the common generator, and for representation updates, Deepwalk uses SkipGram [44] and node2vec uses a modified SkipGram with customizations. The generator takes a graph as input and randomly samples a path of a given length from the starting node which is uniformly chosen over all possible nodes in the network. Each node is a neighbor of the previous node in the path. Deepwalk and node2vec are both scalable, and their effectiveness for community detection is shown in [42, 43, 45, 46].
Terminology 1
(Latent representation) A latent vector representation of a node v in \(\mathcal {G}_i\) generated by a network embedding algorithm is an abstracted neighborhood information of v. Thus, two nodes with similar latent vectors are close neighbors.
5.2 Proposed candidate selection method—information gain
The proposed algorithm is based on the latent representation described above. In each iteration, we run a network embedding algorithm to update the representation of the nodes in the network. Let \(LR_{i}(v)\) be a vector representing the latent information of a node \(v \in \mathcal {V}_i\).
Information gain The information gain of a node v at iteration i is defined as \(Gain_i(v)=||LR_{i-1}(v) - LR_{i}(v)||_{1\text { or }2}\), where \(||\cdot ||_1\) (resp. \(||\cdot ||_2\)) means the \(L_1\) (resp. \(L_2\)) vector norm. The higher the information gain, the greater the changes in network structures around v after the last scan. If \(LR_{i-1}(v)\) and \(LR_{i}(v)\) are same, it means there is no change around v.
If a candidate node c is discovered and added in the ith iteration (i.e., \(LR_{i-1}(c)\) is not defined), then \(Gain_i(v)=\alpha ||LR_{i}(c)||\), where \(\alpha \) accounts for the penalty for missing information in the last iteration. A node with low information gain has very stable and rigid neighborhood structure. It is unlikely that scanning such stable neighborhoods brings any drastic update in community structure. Thus, we choose top k nodes with the highest information gain. Throughout our experiments, the \(L_{2}\) norm is considered for information gain, and \(\alpha \) is set to 1.
5.3 Comparison of node selection metrics
- 1.
Scan as many communities as possible, and make the number of scanned nodes in each community as even as possible.
- 2.
Scan actively around community boundaries (rather than core of the communities).
- 1.
Given a selected node v and its ground-truth community \(C_v\), we calculate the ratio of the discovered nodes to the size of \(C_v\), i.e., \(\frac{\text {number of discovered nodes so far in}~C_v}{|C_{v}|}\).
- 2.
The number of ground-truth communities that v’s neighbors belong to.
Statistics of various metrics to scan a selected node in Coauthorship dataset [48]: (i) the average ratio of the number of discovered nodes to the number of all nodes in the same ground-truth community of top-5 candidates, and (ii) the average number of communities that the neighbors of top-5 candidates belong to
Candidate ranking | Algorithm | Discovered/total nodes | No. of neigh.’ comm. |
---|---|---|---|
1st | Normalized Cut | 0.06 | 5.5 |
Modularity | 0.08 | 5.6 | |
Information Gain | 0.05 | 6.3 | |
2nd | Normalized Cut | 0.08 | 5.0 |
Modularity | 0.10 | 5.2 | |
Information Gain | 0.06 | 5.8 | |
3rd | Normalized Cut | 0.09 | 4.9 |
Modularity | 0.12 | 4.7 | |
Information Gain | 0.07 | 5.5 | |
4th | Normalized Cut | 0.11 | 4.5 |
Modularity | 0.15 | 4.3 | |
Information Gain | 0.10 | 5.0 | |
5th | Normalized Cut | 0.12 | 4.0 |
Modularity | 0.17 | 3.9 | |
Information Gain | 0.12 | 4.7 |
6 Community structure update
The task of this step is to update community membership of nodes in an intermediate network \(\mathcal {G}_i\) based on (i) the community structure of \(\mathcal {G}_{i-1}\), and (ii) new nodes discovered after scanning. Existing approaches involve both local and global updates. While local updates only consider new edges and nodes, global updates consider the whole intermediate network \(\mathcal {G}_i\). The proposed update process consists of three steps—Step 1: an incremental local update based on the expectation–maximization [6]; Step 2: an intermittent global update to correct the local update; and Step 3: the MRF denoising [7] to further adjust both of the local and the global updates. The expectation–maximization (EM) algorithm is originally a part of the generative model suggested in [6]. The local update has less computational complexity. However, it may introduce errors due to the lack of information about the whole intermediate network.
The EM algorithm itself is very efficient but has a limitation. When the number of hidden variables (i.e., community structures in our case) to learn is large, it is known to be sub-optimal. For our targeted community detection, however, we think that the EM method can still afford. In our method, however, the number of hidden variables is not as many as that of usual full community detection problems. Alternatives are other moment-based or spectral-based methods, e.g., hidden Markov model (HMM). However, the inference algorithm of HMM is not as cheap as that of the EM method [49]. We think that the EM algorithm is a good choice in our case considering the relatively small number of hidden variables to infer.
Thus, EM algorithm is applied in every iteration and a global update will be run periodically. After that, we further reduce errors introduced by the network updates using the Markov Random Field (MRF). We fully customize both of the EM and the MRF algorithms in the proposed community structure update method.
6.1 Expectation–maximization algorithm
However, it is also important to ensure that such errors do not become cumulative. Therefore, in Algorithm 2, a global update process (lines 10–12) is executed when the number of edges increases by at least 10% (i.e., the global update condition in line 8). Note that as network size increases, the global update happens less frequently.
We are mainly interested in detecting the community structure of \(\mathcal {V}_{{\mathrm{T}}}\). This does not mean that only \(\theta _{vh}\) (where \(v \in \mathcal {V}_{{\mathrm{T}}}\)) is needed to be calculated because \(\theta _{vh}\) is strongly entangled with \(\theta _{uh}\), where \(u \in \mathcal {V}\setminus \mathcal {V}_{{\mathrm{T}}}\) is a neighbor of v.
After the EM step, we assign an updated community label to each node in \(\mathcal {G}_i\) (lines 13–15 of Algorithm 2). Lastly, we perform one more denoising process (line 17) after updating for scanned nodes.
6.2 Markov random field (MRF) denoising
Note that the (observed) community results of EM algorithm are obtained from the (hidden) noise-free community structure. Our goal is to infer the noise-free hidden community structure from the results observed in the EM algorithm. Let \(o_v \in \{1, 2, \ldots , K\}\) be the observed label of a node \(v \in \mathcal {G}_i\), and \(h_v\) be the hidden actual community label. Given the observed noisy labels of all nodes, our goal is to recover the original noise-free community labels, considering the network connectivity of \(\mathcal {G}_i\).
7 Experiments
In this section, we start by presenting the metrics used for evaluation, followed by detailed experimental results.
7.1 Evaluation metrics
The effectiveness should be measured in comparison with the true community labels of nodes. Note that the measure aims only for the target nodes since the goal of our community update is to obtain better community structure of the target set of nodes. There are in total at most K different communities in the network, and let \(n_1, n_2, \ldots , n_m\) be the number of target nodes in the m different communities detected by the algorithms. The value of m may not be equal to the total number of communities K. Let \(f_{ij}\) be the fraction of target nodes in the estimated ith community that belong to the jth true community. Thus, we can find the true community that is most likely equivalent to the ith predicted community by \({\mathrm{arg}\,\mathrm{max}}_{j \in \{1, 2, \ldots , K\}}{f_{ij}}\). In particular, \(F_i = \max _{j \in \{1, 2, \ldots , K\}}{f_{ij}}\). The reason we do this is that the estimated community may not have the same order of communities as the true labels; we have to find the mapping between estimated and true communities. Also, \(F_i\) is always in (0, 1]. In the ideal situation when all the nodes in the ith estimated community have the same true label, \(F_i\) is equal to 1. The Average Cluster Purity [8] is: \( \hbox {ACP} = \frac{\sum _{i=1}^{m}{n_i \ F_i}}{{\sum _{i=1}^{m}{n_i}}} \). A higher value of ACP indicates a better quality of the community structure.
As the algorithm produces more estimated communities (i.e., m gets larger), ACP tends to be improved. Therefore, ACP may not always be a good metric; in particular when the number of communities is large, the ACP metric may be smaller compared with a small number of communities. We evaluate the performance using another measure called Average Cluster Entropy (ACE) that considers other estimated communities unlike \(F_i\). Simply put, the entropy \(E_i\) for an estimated community i is \(E_i = 1 - \sum _{j = 1}^{K}{f_{ij}^{2}}\). The Average Cluster Entropy is defined as: \( \text {ACE} = \frac{\sum _{i=1}^{m}{n_i \ E_i}}{{\sum _{i=1}^{m}{n_i}}}\). A low value of ACE implies a high purity and a better community structure. An estimated community i that consists only of nodes from same true community will have the lowest entropy \(E_i=0\), and if the true labels of the estimated community i are evenly distributed over K different true communities, it will have entropy (\(1-1/m\)).
7.2 Datasets
Statistics of the datasets used in our experiments
Dataset | Nodes | Edges | Targets | Total Com. | Target Com. |
---|---|---|---|---|---|
DBLP | 28,702 | 66,831 | 115 | 4 | 4 |
Coauthorship | 90,302 | 352,184 | 1374 | 24 | 10 |
Synthetic | 36,000 | 291,424 | 715 | 10 | 10 |
Amazon | 334,863 | 925,872 | 602 | 75,149 | 20 |
YouTube | 1,134,890 | 2,987,624 | 800 | 8385 | 40 |
(i) DBLP network was collected by [8]. In this dataset, authors are considered as nodes and pairs of co-authors are connected with edges if they collaborated in a paper. Liu et al. [8] considered 115 authors from four real research groups led by Prof. Jiawei Han, Prof. Christos Faloutsos, Prof. Dan Roth, and Prof. Michael Jordan as target nodes.
(ii) Coauthorship network was released by [48]. It contains authors in Computer Science as nodes, and edges represent the co-author relationship. There are 24 disjoint ground-truth communities representing different research areas (Algorithm, AI, NLP, ML, etc.). It may be possible that an author has worked on multiple fields which causes the communities to overlap. To make it disjoint, we follow the approaches in [48]—assign each author to that research community in which he/she has published most papers. Total 1374 target nodes constituting 10 communities are randomly selected.
(iii) Synthetic network is a LFR network [51], consisting of 36,000 nodes. The average degree of a node is set to 8, and the number of nodes in a community is set in the range of [50, 100]. The target nodes are randomly sampled from 10 communities. There are at least 20 nodes in each target selected community.
Further in our experiments, we adopt 2 standard networks which contain known overlapping community structure [52], and pre-process the networks as follows—from each such network, we select those nodes as target nodes whose communities are completely disjoint. Then even though the underlying community structure of the entire network is overlapping, the ground-truth communities around the target nodes are disjoint. The networks are as follows:
(iv) Amazon network [52] is a co-purchase network, consisting of nodes as products, and two products are connected if they have been co-purchased by at least one customer. Products from the same category define a community. We randomly select 602 nodes constituting 20 communities that have no overlap.
(v) YouTube network [52] consists of users as nodes and friendships as edges. The ground-truth communities are user-defined groups. We randomly select 800 nodes constituting 40 communities that have no overlap.
7.3 Baseline algorithms
Random sampling This algorithm randomly picks k nodes from Zipf (exponential) distribution in the intermediate network \(\mathcal {G}_i\) that are not scanned, and searches their neighbors.
Greedy sampling This algorithm selects the k nodes with the largest number of degree in the candidate node set.
Ratio of degree and entropy combination algorithm This approach combines the greedy algorithm and community membership of the one-hop neighbors of the scanned node. The metric we compute for each node is \(\frac{\hbox {entropy}}{\hbox {degree}}\), where entropy is computed with the community distribution of neighbors [8]. Specifically, when the neighbors are from one cluster, the entropy is small; it is large otherwise. We choose k candidates with the smallest metric value.
Normalized cut-based algorithm This is a variation of [8] (see Sect. 5).
Modularity-based algorithm This is another variation of [8] (see Sect. 5).
7.4 Sensitivity of parameters
In this section, we briefly describe our parameter selection strategy. The proposed method has three major parameters, the number of nodes to scan k and two MRF parameters \(\beta \) and \(\eta \). We varied k from 1 to 10. Of course, \(k=1\) theoretically gives the best result. In many cases, however, we could not find any distinctive differences even for \(k=10\). Thus, we have chosen the median value \(k=5\). The MRF denoising performance varies up to 2% across different parameter settings (different values of \(\beta \) and \(\eta \)), which may not be significant. We can therefore conclude that the result of our method is less sensitive to parameter selection. In the rest of the section, we use the following parameter values as default: \(k=5\), \(\beta =8.8\) and \(\eta =1.9\).
7.5 Evaluation results
We run each experiment 20 times with random initialization of all parameters, and the average performance is shown. We conduct a threefold experimental setup—(i) the cost of scanning each node is equally set to 1 (Constant Cost); (ii) the cost varies across nodes (Varying Cost), and (iii) the impact of MRF denoising in our method.
7.5.1 Results without denoising
We discuss the experimental results of our method without MRF denoising as follows.
The greedy algorithm shows the worst performance even compared to random sampling (sometimes 20% less) as it adds a lot of noisy information to the network structure. Furthermore, it is the least stable one among all algorithms. On the other hand, modularity and normalized cut-based algorithms are expected to have better results as they use the information of candidates’ neighbors. However, our algorithm (with both node2vec and Deepwalk) outperforms these baselines for all the datasets—Deepwalk shows slightly better performance than node2vec.
Varying cost To simulate real scenarios, we conduct experiments with various costs. The cost is generated according to the Zipf distribution as suggested in [8]—all nodes are randomly shuffled and \(z(v) = 1/ind(v)^{\lambda }\), where ind(v) is the index of node v after shuffling. The cost Q(v) is the normalization of z(v) over all nodes.
The ACP/ACE performance with varying costs shows the similar pattern —our method achieves better performance than the baseline methods. While at some points there are small fluctuations, the overall trend almost remains the same. This again confirms that our strategy is superior to its competitors in community detection.
7.5.2 Performance with denoising
7.6 Summary of the experimental results
Summary of the experimental results for all the datasets
Dataset | Coauthorship | Synthetic | Amazon | DBLP | YouTube | Average |
---|---|---|---|---|---|---|
[8]+Ncut | ||||||
ACP | 0.76 | 0.87 | 0.52 | 0.96 | 0.62 | 0.74 |
ACE | 0.34 | 0.21 | 0.59 | 0.07 | 0.55 | 0.35 |
[8]+Modu | ||||||
ACP | 0.78 | 0.86 | 0.56 | 0.96 | 0.62 | 0.75 |
ACE | 0.33 | 0.22 | 0.58 | 0.06 | 0.55 | 0.35 |
DyPerm [38] | ||||||
ACP | 0.76 | 0.85 | 0.56 | 0.96 | 0.62 | 0.75 |
ACE | 0.36 | 0.22 | 0.59 | 0.09 | 0.55 | 0.36 |
QCA [37] | ||||||
ACP | 0.71 | 0.83 | 0.55 | 0.94 | 0.61 | 0.73 |
ACE | 0.37 | 0.24 | 0.56 | 0.09 | 0.57 | 0.37 |
Random | ||||||
ACP | 0.69 | 0.84 | 0.55 | 0.94 | 0.62 | 0.73 |
ACE | 0.42 | 0.24 | 0.57 | 0.10 | 0.56 | 0.38 |
Greedy | ||||||
ACP | 0.72 | 0.82 | 0.52 | 0.92 | 0.62 | 0.72 |
ACE | 0.39 | 0.28 | 0.59 | 0.15 | 0.55 | 0.39 |
Entropy | ||||||
ACP | 0.72 | 0.85 | 0.56 | 0.96 | 0.62 | 0.74 |
ACE | 0.39 | 0.24 | 0.56 | 0.07 | 0.56 | 0.36 |
Our+Deepwalk | ||||||
ACP | 0.80 | 0.88 | 0.56 | 0.96 | 0.64 | 0.77 |
ACE | 0.30 | 0.22 | 0.56 | 0.08 | 0.54 | 0.34 |
Our+node2vec | ||||||
ACP | 0.79 | 0.87 | 0.57 | 0.96 | 0.64 | 0.77 |
ACE | 0.32 | 0.21 | 0.56 | 0.08 | 0.54 | 0.34 |
One thing to note is that the proposed method usually takes 5–20 times longer than the existing NetDiscover[8] algorithm, while NetDiscover assumes every candidate node is scanned, and selects the best one but it only considers the cost of the selected node as the step for scanning step. The strict comparison of time between these methods may not be very useful.
8 Conclusion
In this paper, we studied a realistic setup for community detection—the entire network is not known a priori, and therefore one needs to progressively scan unknown nodes and update the community structure around a target node set. The problem is divided into two sub-problems—network scan and community update. We proposed a novel method for each sub-problem. In the network scan step, a new metric information gain was designed to decide the best node to scan. A combination of the EM and MRF algorithm was proposed for the community update step to further recover the actual community labels of the nodes.
There are very few attempts made to process an incomplete network using an incremental way where one network construction reinforces the community detection that in turn helps discover better network structure. Most of the state-of-the-art algorithms assume that the static snapshot of the network is available beforehand, which may not be a realistic setting. Therefore, our problem definition is novel.
The use of EM framework and Markovian denoising is the major technical novelty.
Footnotes
Notes
Acknowledgements
T. Chakraborty would like to acknowledge the support of Ramanujan Fellowship, Early Career Research Award (ECR/2017/001691) (SERB, DST) and the centre for Design and New Media (supported by TCS), IIIT-Delhi. Noseong Park is the corresponding author.
References
- 1.Fortunato S (2010) Community detection in graphs. Phys Rep 486(3):75–174MathSciNetCrossRefGoogle Scholar
- 2.Clauset A (2005) Finding local community structure in networks. Phys Rev E 72(2):026132CrossRefGoogle Scholar
- 3.Luo F, Wang JZ, Promislow E (2008) Exploring local community structures in large networks. Web Intell Agent Syst Int J 6(4):387–400Google Scholar
- 4.Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008(10):P10008zbMATHCrossRefGoogle Scholar
- 5.Kim M, Leskovec J (2011) The network completion problem: Inferring missing nodes and edges in networks. In: Proceedings of the 2011 SIAM international conference on data mining, pp 47–58Google Scholar
- 6.Ball B, Karrer B, Newman MEJ (2011) Efficient and principled method for detecting communities in networks. Phys Rev E 84:036103CrossRefGoogle Scholar
- 7.Bishop CM (2006) Pattern recognition and machine learning. Springer, BerlinzbMATHGoogle Scholar
- 8.Liu J, Aggarwal C, Han J (2015) On integrating network and community discovery. In: WSDM, Shanghai, China, pp 117–126Google Scholar
- 9.Chakraborty T, Srinivasan S, Ganguly N, Mukherjee A, Bhowmick S (2016) Permanence and community structure in complex networks. ACM Trans Knowl Discov Data 11(2):14:1–14:34CrossRefGoogle Scholar
- 10.Chakraborty T, Dalmia A, Mukherjee A, Ganguly N (2017) Metrics for community analysis: a survey. ACM Comput Surv 50(4):54:1–54:37CrossRefGoogle Scholar
- 11.Nassar H, Kloster K, Gleich DF (2015) Strong localization in personalized PageRank vectors. In: International workshop on algorithms and models for the web-graph. Springer, pp 190–202Google Scholar
- 12.Yin H, Benson AR, Leskovec J, Gleich DF (2017) Local higher-order graph clustering. In: ACM Conference on knowledge discovery and data mining, pp 555–564Google Scholar
- 13.Liu C, Liu J, Jiang Z (2014) A multiobjective evolutionary algorithm based on similarity for community detection from signed social networks. IEEE Trans Cybern 44(12):2274–2287CrossRefGoogle Scholar
- 14.Wang X, Liu J (2017) A layer reduction based community detection algorithm on multiplex networks. Physica A 471:244–252CrossRefGoogle Scholar
- 15.Li Z, Liu J (2016) A multi-agent genetic algorithm for community detection in complex networks. Physica A 449:336–347CrossRefGoogle Scholar
- 16.Chakrabarti D, Kumar R, Tomkins A (2006) Evolutionary clustering. In: ACM Conference on knowledge discovery and data mining, pp 554–560Google Scholar
- 17.Zhang J, Yu PS (2015) Community detection for emerging networks. In: Proceedings of the 2015 SIAM international conference on data mining, Vancouver, Canada, pp 127–135Google Scholar
- 18.Cheng J, Wu X, Zhou M, Gao S, Huang Z, Liu C (2018) A novel method for detecting new overlapping community in complex evolving networks. IEEE Trans Syst Man Cybern Syst 99(99):1–13Google Scholar
- 19.Wang Z, Zhang D, Zhou X, Yang D, Yu Z, Yu Z (2014) Discovering and profiling overlapping communities in location-based social networks. IEEE Trans Syst Man Cybern Syst 44(4):499–509CrossRefGoogle Scholar
- 20.Lin W, Kong X, Yu PS, Wu Q, Jia Y, Li C (2012) Community detection in incomplete information networks. In: International conference on world wide web. Lyon, France, pp 341–350Google Scholar
- 21.Wang L, Wang J, Bi Y, Wu W, Xu W, Lian B (2014) Noise-tolerance community detection and evolution in dynamic social networks. J Comb Optim 28(3):600–612MathSciNetzbMATHCrossRefGoogle Scholar
- 22.Koujaku S, Kudo M, Takigawa I, Imai H (2015) Community change detection in dynamic networks in noisy environment. In: Proceedings of the international conference on World Wide Web, pp 793–798Google Scholar
- 23.Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: International conference on knowledge discovery and data mining, pp 631–636Google Scholar
- 24.Ahmed NK, Neville J, Kompella R (2014) Network sampling: from static to streaming graphs. ACM Trans Knowl Discov Data 8(2):1–56CrossRefGoogle Scholar
- 25.Baykan E, Henzinger M, Weber I (2013) A comprehensive study of techniques for url-based web page language classification. ACM Trans Web 7(1):1–37CrossRefGoogle Scholar
- 26.Gabielkov M, Rao A, Legout A (2014) Sampling online social networks: an experimental study of twitter. ACM Comput Commun Rev 44(4):127–128CrossRefGoogle Scholar
- 27.Lu J, Li D (2012) Sampling online social networks by random walk. In: International workshop on hot topics on interdisciplinary social networks research, pp 33–40Google Scholar
- 28.Yun S-Y, Proutiere A (2014) Community detection via random and adaptive sampling. In: Conference on learning theory, pp 138–175Google Scholar
- 29.Mahoney MW, Orecchia L, Vishnoi NK (2012) A local spectral method for graphs: with applications to improving graph partitions and exploring data graphs locally. J Mach Learn Res 13(8):2339–2365MathSciNetzbMATHGoogle Scholar
- 30.Meng F, Zhang F, Zhu M, Xing Y, Wang Z, Shi J (2016) Incremental density-based link clustering algorithm for community detection in dynamic networks. Math Prob Eng 2016:1873504Google Scholar
- 31.Xie J, Chen M, Szymanski BK (2013) Labelrankt: incremental community detection in dynamic networks via label propagation. In: Proceedings of the workshop on dynamic networks management and mining, pp 25–32Google Scholar
- 32.Takaffoli M, Rabbany R, Zaïane OR (2013) Incremental local community identification in dynamic social networks. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, pp 90–94Google Scholar
- 33.Zakrzewska A, Bader DA (2015) Fast incremental community detection on dynamic graphs. In: International conference on parallel processing and applied mathematics, pp 207–217CrossRefGoogle Scholar
- 34.Clementi A, Di Ianni M, Gambosi G, Natale E, Silvestri R (2015) Distributed community detection in dynamic graphs. Theor Comput Sci 584:19–41MathSciNetzbMATHCrossRefGoogle Scholar
- 35.Becchetti L, Clementi A, Natale E, Pasquale F, Trevisan L (2017) Find your place: simple distributed algorithms for community detection. In: Proceedings of the 28th annual ACM SIAM symposium on discrete algorithms, pp 940–959Google Scholar
- 36.Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems. Vancouver, British Columbia, Canada, pp 849–856Google Scholar
- 37.Nguyen NP, Dinh TN, Xuan Y, Thai MT (2011) Adaptive algorithms for detecting community structure in dynamic social networks. In: IEEE International conference on computer communications, pp 2282–2290Google Scholar
- 38.Agarwal P, Verma R, Agarwal A, Chakraborty T (2018) Dyperm: maximizing permanence for dynamic community detection. In Pacific-asia conference on advances in knowledge discovery and data mining (PAKDD), pp 437–449CrossRefGoogle Scholar
- 39.Li X, Wu B, Guo Q, Zeng X, Shi C (2015) Dynamic community detection algorithm based on incremental identification. In: 2015 IEEE International conference on data mining workshop, pp 900–907Google Scholar
- 40.Soundarajan S, Eliassi-Rad T, Gallagher B, Pinar A (2016) Maxreach: reducing network incompleteness through node probes. In: ASONAM, San Fransisco, CA, USA, pp 152–157Google Scholar
- 41.Vitter JS (1985) Random sampling with a reservoir. ACM Trans Math Softw (TOMS) 11(1):37–57MathSciNetzbMATHCrossRefGoogle Scholar
- 42.Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: SIGKDD, New York, USA, pp 701–710Google Scholar
- 43.Grover A, Leskovec J (2016) Node2vec: scalable feature learning for networks. In: SIGKDD, San Francisco, CA, USA, pp 855–864Google Scholar
- 44.Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space, CoRR, arXiv:1301.3781
- 45.Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015) Line: large-scale information network embedding. In: WWW, Florence, Italy, pp 1067–1077Google Scholar
- 46.Chang S, Han W, Tang J, Qi G-J, Aggarwal CC, Huang TS (2015) Heterogeneous network embedding via deep architectures In: SIGKDD, Sydney, Australia, pp 119–128Google Scholar
- 47.Seifi M, Junier I, Rouquier J-B, Iskrov S, Guillaume J-L (2013) Stable community cores in complex networks. In: Menezes R, Evsukoff A, González MC (eds) Complex networks. Springer, Berlin, Heidelberg, pp 87–98CrossRefGoogle Scholar
- 48.Chakraborty T, Srinivasan S, Ganguly N, Mukherjee A, Bhowmick S (2014) On the permanence of vertices in network communities. In: Proceedings of international conference on knowledge discovery and data mining, pp 1396–1405Google Scholar
- 49.Khreich W, Granger E, Miri A, Sabourin R (2010) On the memory complexity of the forward–backward algorithm. Pattern Recogn Lett 31(2):91–99zbMATHCrossRefGoogle Scholar
- 50.Besag J (1986) On the statistical analysis of dirty pictures. J R Stat Soc. Ser B (Methodol) 48(3):259–302MathSciNetzbMATHGoogle Scholar
- 51.Lancichinetti A, Fortunato S (2009) Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Phys Rev E 80:016118CrossRefGoogle Scholar
- 52.Leskovec J, Krevl A (2014) SNAP Datasets: stanford large network dataset collection, http://snap.stanford.edu/data. Accessed May 2018
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.