Incremental community discovery via latent network representation and probabilistic inference

Most of the community detection algorithms assume that the complete network structure G=(V,E)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {G}=(\mathcal {V},\mathcal {E})$$\end{document} is available in advance for analysis. However, in reality this may not be true due to several reasons, such as privacy constraints and restricted access, which result in a partial snapshot of the entire network. In addition, we may be interested in identifying the community information of only a selected subset of nodes (denoted by VT⊆V\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {V}_{{\mathrm{T}}} \subseteq \mathcal {V}$$\end{document}), rather than obtaining the community structure of all the nodes in G\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {G}$$\end{document}. To this end, we propose an incremental community detection method that repeats two stages—(i) network scan and (ii) community update. In the first stage, our method selects an appropriate node in such a way that the discovery of its local neighborhood structure leads to an accurate community detection in the second stage. We propose a novel criterion, called Information Gain, based on existing network embedding algorithms (Deepwalk and node2vec) to scan a node. The proposed community update stage consists of expectation–maximization and Markov Random Field-based denoising strategy. Experiments with 5 diverse networks with known ground-truth community structure show that our algorithm achieves 10.2% higher accuracy on average over state-of-the-art algorithms for both network scan and community update steps.


Introduction
In social network analysis, the task of community detection has been widely studied [1]. In many cases, instead of discovering the community structure of the entire network, we may wish to detect communities within a target set of nodes [2,3]. For instance, a telecommunica-B Noseong Park npark9@gmu.edu 1 Dept. of Electrical and Computer Engineering, University of Maryland, College Park, USA tion company may want to find communities that its valuable customers are part of, in order to provide better facilities.
Most studies in this direction assume that the overall network structure is known in advance [4]. However, in real networks, complete information is difficult or even impossible to obtain [5]. In Facebook network, for instance, the complete linkage structure of a user is often unobtainable due to several privacy constraints. Moreover, in many online social network sites, such as Twitter, many new edges are created daily. In such cases, the entire network structure available to the users is incomplete.
This leads us to tackle a more realistic problem setting-given an initial sub-network and a set of target nodes, our task is to progressively scan the network and explore the communities 1 where the target nodes reside. Scanning a node means checking its social profile and retrieving its neighborhood. By adding the scanned neighbors of a node, we keep accumulating knowledge about the topology of the network. This problem setting is realistic for several reasons-first, network information obtained is usually incomplete and hard to acquire. Second, many new relationships are created everyday. Even though network information is complete at a particular time, as time goes by, the network structure may need to be scanned and updated.
However, in general the cost to scan a network is limited, which further creates two challenges. The first is that we have to scan the network carefully. With no constraint on scanning, we can explore the network in a brute force manner and at some point, enough nodes are scanned and correct community membership of target nodes can be revealed. However, when there is an upper limit on the cost (i.e., budget) to scan nodes, it is necessary to judiciously scan nodes in order to explore the best community information of target node set. The second challenge is to incrementally update communities of target nodes based on partial information. After a network scan, more nodes are discovered, and the algorithm needs to quickly and correctly update communities of target nodes.
To address the first challenge, we propose a metric, called information gain for selecting a new node to scan. The key idea behind this metric is to use network embedding to find latent representation of nodes, and compute which node (and its associated edge) have the largest information gain. This metric tracks how the latent vector representations of nodes change over a series of network updates in order to decide which node to scan that has closed edge links. If a node's latent representation drastically changes over successive updates, it is likely that its neighborhood information has also been changed, and thus the node can be a potential candidate to scan.
To address the second challenge, we propose a three-step incremental community update method-Step 1: an incremental local update based on expectation-maximization [6]; Step 2: an intermittent global update to correct the local update; and Step 3: the Markov Random Field (MRF)-based denoising [7] to further adjust both local and global updates.
These three steps that consist of (i) network scan, (ii) EM, and (iii) MRF are systematically combined into one framework. To maximize the community structure inference performance of the EM method, our information gain-based network scan actively searches community boundaries rather than community cores. Connections are weaker around boundaries than cores so our network scan method is designed to help the EM algorithm better infer about community structures. Because we reveal community structures of selected targets only, its number of hidden variables (community memberships) is not large when formulated in the EM method. This is one of the main reasons why we have chosen the EM method. Our MRF-based denoising supports the EM algorithm. Briefly, the contributions of our paper are as follows: -We investigate a contemporary problem which is more realistic than the traditional settings of community detection and has rarely been explored in the past [8]. -We propose a novel method that interweaves network discovery and community detection by first mapping the discovered network into an embedding space, followed by an incremental community update that adjusts the current community structure by leveraging the new information acquired in network discovery. -We compare our method with 4 different variations of the existing state-of-the-art method [8] along with 3 other commonly used baselines on 5 diverse datasets. We observe that our proposed algorithm significantly outperforms other baselines with a performance improvement of 8%∼17% with an average improvement of 10.2%.

Related work
In network science, the task of detecting dense modules from the network has been extensively studied for static networks [4,9] (see [1,10] for a comprehensive review). Several attempts have been made to detect local communities around a target node [2,3,[11][12][13][14][15]. Some work has focused on detecting dynamic communities in evolving networks [16][17][18][19]. Recently, [20][21][22] attempted to discover communities from incomplete/noisy networks. All these methods assume that the entire network structure is available a priori, whereas we assume that only target nodes are given, and one needs to scan the nodes to explore the network structure. Every scan operation incurs a cost, and the network exploration can be possible till a certain budget is exhausted.
In the line of only network discovery (without community detection), there has been plenty of research focusing on general sampling techniques [23,24], sampling web documents [25], and sampling social network [26,27]. These techniques are not applicable in our settings because they initially consider the entire network structure for sampling. Moreover, none of them really focus on discovering the underlying community structure in the sampled network. There is also a lot of work on incremental community discovery of dynamic graph, such as [28][29][30][31][32][33][34][35]. These decentralized algorithms have similar scenarios as we choose here, but they do not consider the cost when the algorithm makes a query or explores one or more nodes in the graph. This implies our scenario is more difficult.
The most similar existing work with ours is NetDiscover [8] which also attempted to discover disjoint communities of a target node set. Although the problem definition is exactly the same, we differ from their method w.r.t. both candidate node selection and incremental community detection. In the former case, NetDiscover selects a query node based on one of the two community scoring metrics (modularity, normalized cut), whereas the proposed node selection algorithm is based on a novel metric information gain. A close inspection of NetDiscover reveals that there is an information leakage while selecting the query node in that it uses the entire network to compute the node scoring function. However, NetDiscover detects initial communities using spectral clustering [36] (where the number of communities needs to be given a priori) and adopts generative model (GM) [6] to update communities, whereas we combine both EM and MRF algorithms for community update. To compare with existing work, we consider both NetDiscover and commonly used random sampling and greedy algorithm as baselines (see Sect. 7.3). We also compare our method with Quick Community Adaptation (QCA) [37] and Dynamic Permanence (DyPerm) [38] methods. QCA is an adoptive modularity-based approach for identifying and tracking community structure of dynamic network. DyPerm is another modularity maximizationbased approach which incrementally updates the community structure of a network at t i based on the community structure at t i−1 with detecting the community structure from the scratch. We adopt these algorithms to our setting (incomplete network) and make them comparable.

Problem statement
We denote a network as G = (V, E), where V is a set of nodes and E is a set of edges. Assume that initially we do not know the entire network G; only a partial subset of the network is known. Among all the nodes in V s , we are particularly interested in detecting the community structure of a given set of target nodes We iteratively scan nodes and explore the network with the neighborhood information of the scanned nodes. We use G i = (V i , E i ) to denote an intermediate network at the ith scan iteration (thus, initially G 0 = G s ). The performance of community detection and the cost incurred by scanning nodes greatly depend upon the node selection strategy (i.e., which node to scan next). We assume a function Q(v) denoting the cost to scan a node v. If v is a private node 2 and it does not allow a scan, then Q(v) = ∞. In the simple case, the cost to scan all non-private nodes may be same, i.e., Q(v) = 1. However, we also adopt a more general setting with heterogeneous cost per node. Another general setting is that the available budget B we invest for scanning is limited.
To this end, we consider two sub-problems: One is to decide candidate nodes 3 to scan next, and the other is to update the community structure incrementally.
Network scan Given a budget B, an intermediate network G i with a cost function Q associated with it, and a target node set V T , the aim of candidate selection is to decide the next candidates whose exploration of the local neighborhoods leads to the best community detection. After that, we actually scan the selected candidates, and G i+1 is generated from G i and scan results of the candidates.
Community update Given G i and G i+1 , the task of community update is to efficiently and effectively discover the community structure in G i+1 considering the community structure in G i from the last iteration.

Overall algorithm
The purpose of integrating both network scan and community update steps is to effectively determine the community structure of a set of target nodes V T while incrementally obtaining network structure through scanning the nodes within a given budget B. Even in the case that a budget is not specified, the proposed method should be able to progressively enhance the community structure as network scan proceeds. Our method undergoes the following steps (see Choose a set V of k nodes to scan 6: Scan the selected nodes in V 8: Update G i+1 from G i with the scan results 9: information that has not been discovered so far; and (iii) simultaneously update the network G i to generate G i+1 and its community structure. While iterating the above procedures, the currently available network structure G i is used as the main information source to decide candidate nodes to scan. It is critical to scan the nodes in the network in such an order that it facilitates the most efficient discovery of network communities while maintaining a low cost. Algorithm 1 alternately solves the two sub-problems we mentioned earlier. It starts by querying all the nodes in V T (line 1). This is to ensure that all the target nodes and their local network structures are fully scanned for an initial community assignment. We handle two sub-problems by two parts: the network scan part in lines 5-8, and the community update part in line 9. C i is the community structure for V T at the ith iteration, and is incrementally modified using CommunityUpdate() function (Algorithm 2). K is the maximum number of communities the target set can have. The parameter K is internally used in the network scan and community update steps. We will discuss these two parts in detail in the following sections.

Network scan
The purpose of network scan is to discover unexplored parts of the network, where a node is scanned to know its full neighborhood structure. It mainly aims at exploring and acquiring more nodes and their connections, leading to the detection of better community structure. Since the effectiveness of the algorithm mainly depends on the scan sequence, the main focus of our network scan approach is to decide the best candidate nodes to be processed next given an intermediate network. The parameter k denotes the number of nodes chosen to be scanned.

Candidates for network scan Let S i be a set of scanned nodes till iteration (i − 1). The neighbor set is defined as
contains only nodes that are not scanned among all nodes in N i .
There are many candidate selection algorithms for network scan [39,40]. Two of them are simple but widely used in various fields: random sampling and greedy sampling [41]. We will use these methods as baselines in Sect. 7. Soundarajan et al. [40] suggested to select the node that maximizes the total number of nodes scanned with the aim of exploring the complete network structure. Other methods were specifically designed for community detection [8]. These methods dynamically sample nodes from an intermediate network in such a way that a certain community quality measurement metric is expected to be improved. Liu et al. [8] selected nodes in such a way that the value of 'normalized cut' decreases or 'modularity' increases. Here, we briefly describe these two metrics and how we use them to design baseline methods.
Normalized cut Given K communities at iteration i, the normalized cut is defined as: is the number of edge-cuts between C k and all other remaining communities. The optimization of the above cost function is to minimize edge-cuts (connections) among different communities. The baseline algorithms used in the experiment follow Liu et al. [8]. We calculate the minimum normalized cut cost for each node in the candidate set C i -at this step, the correct community membership of candidates is not known, and thus all possible community assignments of candidates have to be tested after fixing the community structures of the scanned nodes in S i . Among all the candidates in C i , k nodes leading to the minimum normalized cut are selected and added to the intermediate network. Here, we assume that the community structure does not change for all nodes but newly added ones in the intermediate network.
Modularity Modularity is used to evaluate the strength of partitioning a network into different communities: where e(C k , S) is the fraction of edges where both nodes are in the same community C k , and a(C k , S) is the fraction of edges that at least one node is in C k . A high modularity value means dense connections between the nodes in a community but sparse connections between the nodes in different communities. We take the same approach as in normalized cut, i.e., testing with all possible community membership assignments of candidates, and choosing k candidate nodes that lead to the maximum modularity [8].
Here, we propose a new candidate selection method that does not incur any additional hidden cost. We utilize Deepwalk [42] and node2vec [43] to obtain latent feature vectors of nodes, i.e., network embedding into feature space. We first briefly introduce Deepwalk and node2vec and then discuss our proposed algorithm.

Network embedding algorithms
Deepwalk [42] and node2vec [43] are deep learning-based embedding methods to learn latent representations of nodes in a network. Deepwalk encodes social relations into a continuous vector space after modeling a series of random walks with a Natural Language Processing method. The key idea is that each visited node during a random walk can be considered as a word, and the random walk corresponds to a sentence. Node2vec learns a latent feature vector that maximizes the likelihood of maintaining the neighborhoods of the node.
The feature representation framework generally consists of two main parts: a random walk generator and a representation update. Both of the above two frameworks have the common generator, and for representation updates, Deepwalk uses SkipGram [44] and node2vec uses a modified SkipGram with customizations. The generator takes a graph as input and randomly samples a path of a given length from the starting node which is uniformly chosen over all possible nodes in the network. Each node is a neighbor of the previous node in the path. Deepwalk and node2vec are both scalable, and their effectiveness for community detection is shown in [42,43,45,46].

Terminology 1 (Latent representation)
A latent vector representation of a node v in G i generated by a network embedding algorithm is an abstracted neighborhood information of v. Thus, two nodes with similar latent vectors are close neighbors.

Proposed candidate selection method-information gain
The proposed algorithm is based on the latent representation described above. In each iteration, we run a network embedding algorithm to update the representation of the nodes in the network. Let L R i (v) be a vector representing the latent information of a node v ∈ V i .

Information gain The information gain of a node v at iteration i is defined as
where || · || 1 (resp. || · || 2 ) means the L 1 (resp. L 2 ) vector norm. The higher the information gain, the greater the changes in network structures around v after the last scan.
If a candidate node c is discovered and added in the ith iteration (i.e., where α accounts for the penalty for missing information in the last iteration. A node with low information gain has very stable and rigid neighborhood structure. It is unlikely that scanning such stable neighborhoods brings any drastic update in community structure. Thus, we choose top k nodes with the highest information gain. Throughout our experiments, the L 2 norm is considered for information gain, and α is set to 1.

Comparison of node selection metrics
In this section, we analyze the characteristics of three node selection metrics: normalized cut, modularity, and information gain. We first suggest two key factors to be considered while scanning a network as follows: 1. Scan as many communities as possible, and make the number of scanned nodes in each community as even as possible. 2. Scan actively around community boundaries (rather than core of the communities).
These two actions are crucial for the EM algorithm to better identify community structure. In general, community cores have very dense connections and are easier to detect [47]. However, connections are weak around community boundaries because their community memberships are ambiguous in many cases [9]. To help the EM algorithm in this hostile situation, we scan more around those community boundaries and evenly for each community. Our information gain is designed to meet these key factors.
In Table 1, we summarize key statistics of three metrics that can show how good a metric is w.r.t. the above two factors. We define the key statistics as follows: 1. Given a selected node v and its ground-truth community C v , we calculate the ratio of the discovered nodes to the size of C v , i.e., number of discovered nodes so far in C v |C v | . 2. The number of ground-truth communities that v's neighbors belong to. Information Gain 0.12 4.7 The best methods are shown in bold In each iteration, the statistics of top-5 candidates are collected and the average across iterations is reported. The proposed information gain shows the lowest values for the first statistics and the highest for the second statistics, which indicates that it can (i) scan more communities than others (so that the average number of discovered nodes in a community is smaller than others given the same budget) and (ii) scan around community boundaries more actively.
Algorithm 2 CommunityUpdate(Intermediate network: G i , A set of scanned nodes: V , Community structure: C i , Max community numbers: K ) 1: for each c ∈ V do 2: // Let N c be a set of neighbors of c. N c was updated after scanning c.

3:
Initialize edge parameters q cw (h), where w ∈ N c 4: while until parameters are converged do 5: for each node v in N c do 6: Update θ vh 7: Update q vw (h), where w ∈ N c 8: if the global update condition is met then 9: // This part is a global update. 10: for each community h in C v do 11: Update θ vh for all nodes in G i 12: Update q vw (h) for all edges in G i 13: for each node w in V i do 14: // Label j means the community of w.

15:
Update Label j = arg max k θ wk 16: // Enhance the local and global update results using the Markov Random Field (MRF) denoising technique. 17: M RFdenoising(Label, G i , C i , K )

Community structure update
The task of this step is to update community membership of nodes in an intermediate network G i based on (i) the community structure of G i−1 , and (ii) new nodes discovered after scanning. Existing approaches involve both local and global updates. While local updates only consider new edges and nodes, global updates consider the whole intermediate network G i . The proposed update process consists of three steps-Step 1: an incremental local update based on the expectation-maximization [6]; Step 2: an intermittent global update to correct the local update; and Step 3: the MRF denoising [7] to further adjust both of the local and the global updates. The expectation-maximization (EM) algorithm is originally a part of the generative model suggested in [6]. The local update has less computational complexity. However, it may introduce errors due to the lack of information about the whole intermediate network.
The EM algorithm itself is very efficient but has a limitation. When the number of hidden variables (i.e., community structures in our case) to learn is large, it is known to be suboptimal. For our targeted community detection, however, we think that the EM method can still afford. In our method, however, the number of hidden variables is not as many as that of usual full community detection problems. Alternatives are other moment-based or spectralbased methods, e.g., hidden Markov model (HMM). However, the inference algorithm of HMM is not as cheap as that of the EM method [49]. We think that the EM algorithm is a good choice in our case considering the relatively small number of hidden variables to infer.
Thus, EM algorithm is applied in every iteration and a global update will be run periodically. After that, we further reduce errors introduced by the network updates using the Markov Random Field (MRF). We fully customize both of the EM and the MRF algorithms in the proposed community structure update method.

Expectation-maximization algorithm
The EM algorithm is an iterative method to find the maximum likelihood or maximum a posteriori estimate of parameters. It consists of parameter learning and estimation processes. For nodes in the network, the model is parameterized by θ vh which represents the propensity that a node v has edges in a community h. θ vh can be understood as a parameter that characterizes the number of edges. The product θ vh · θ wh is the expected number of edges in the community h that lie between nodes v and w. Let A be a matrix whose elements represent the number of edges between nodes. The number of edges, i.e., A vw , is Poisson distributed around the expected value, according to the generative model in [6]. Thus, the probability of generating a graph G is: We follow [6] for updates in the EM step of Algorithm 2. q vw (h) is a parameter in the update process: The community label of a node v is h that maximizes the parameter θ vh . The Commu-nityUpdate() function iterates the procedure for each scanned node c ∈ V . In Algorithm 2, for the local update, only edges that are linked to the nodes in N c are initialized and updated (lines 3-7). θ vh and q vw (h) can be updated accordingly. In the maximization step, all nodes in G i should be updated as parameter θ vh changes, where w ∈ V i , and are also affected by nodes that are not in the locality of v. Local update usually produces small errors because G i and G i+1 are very similar in many cases.
However, it is also important to ensure that such errors do not become cumulative. Therefore, in Algorithm 2, a global update process (lines 10-12) is executed when the number of edges increases by at least 10% (i.e., the global update condition in line 8). Note that as network size increases, the global update happens less frequently.
We are mainly interested in detecting the community structure of V T . This does not mean that only θ vh (where v ∈ V T ) is needed to be calculated because θ vh is strongly entangled After the EM step, we assign an updated community label to each node in G i (lines 13-15 of Algorithm 2). Lastly, we perform one more denoising process (line 17) after updating for scanned nodes.

Markov random field (MRF) denoising
The results of the community update operation from EM algorithm provide a good indication of actual community membership labels. However, errors naturally inhere in the process because it is an approximation of the true community assignment. We attempt to further eliminate the errors from the estimated community assignment. In this paper, we utilize conditional independence and clique factorization properties [7] of the Markov Random Field (MRF). Simply put, we can consider MRF as a generalization of the Markov Chain concept to graphs (Fig. 2). Thus a node's community is decided by its neighbors' community memberships. In fact, MRF is one of the most popular graphical inference methods such as Bayesian Network and Belief Propagation.
Note that the (observed) community results of EM algorithm are obtained from the (hidden) noise-free community structure. Our goal is to infer the noise-free hidden community structure from the results observed in the EM algorithm. Let o v ∈ {1, 2, . . . , K } be the observed label of a node v ∈ G i , and h v be the hidden actual community label. Given the observed noisy labels of all nodes, our goal is to recover the original noise-free community labels, considering the network connectivity of G i .
Since the noise level is likely to be small, there is a strong correlation among o v , h v and h u , where v and u are neighbors of each other. This prior knowledge can be captured using MRF. This graphical model has two types of cliques, and each of them involves two nodes. The cliques of the form {h v , o v } have an energy function that expresses the correlation between the two. Since there are multiple community labels, a closed form of the energy function between all observed and hidden label pairs is as follows: For each pair of labels, energy penalty is equal to 1 only if community membership is different, and 0 otherwise. This is desirable because it leads to a lower energy (i.e., high probability) when labels are the same, and a higher energy otherwise. η is a positive constant that needs a calibration.
The other cliques contain pairs of neighboring hidden node labels. Thus, (v, u) ∈ E i for a pair h v and h u . Similarly, the energy is expected to be low when two neighbors have the same community label.
The complete energy function used to define a joint distribution is as follows: , h) , where Z is a normalizing constant. To achieve better community updates, we wish to find h j having a high probability p(h, o). There are many algorithms to solve this optimization problem. Here, we use an iterative conditional modes technique (ICM) [50] which is a coordinate-wise gradient ascent method.

Experiments
In this section, we start by presenting the metrics used for evaluation, followed by detailed experimental results.

Evaluation metrics
The effectiveness should be measured in comparison with the true community labels of nodes. Note that the measure aims only for the target nodes since the goal of our community update is to obtain better community structure of the target set of nodes. There are in total at most K different communities in the network, and let n 1 , n 2 , . . . , n m be the number of target nodes in the m different communities detected by the algorithms. The value of m may not be equal to the total number of communities K . Let f i j be the fraction of target nodes in the estimated ith community that belong to the jth true community. Thus, we can find the true community that is most likely equivalent to the ith predicted community by arg max j∈{1,2,...,K } f i j . In particular, F i = max j∈{1,2,...,K } f i j . The reason we do this is that the estimated community may not have the same order of communities as the true labels; we have to find the mapping between estimated and true communities. Also, F i is always in (0, 1]. In the ideal situation when all the nodes in the ith estimated community have the same true label, F i is equal to 1. The Average Cluster Purity [8] is . A higher value of ACP indicates a better quality of the community structure. As the algorithm produces more estimated communities (i.e., m gets larger), ACP tends to be improved. Therefore, ACP may not always be a good metric; in particular when the number of communities is large, the ACP metric may be smaller compared with a small number of communities. We evaluate the performance using another measure called Average Cluster Entropy (ACE) that considers other estimated communities unlike F i . Simply put, the entropy E i for an estimated community i is We use five different networks with various sizes and characteristics better community structure. An estimated community i that consists only of nodes from same true community will have the lowest entropy E i = 0, and if the true labels of the estimated community i are evenly distributed over K different true communities, it will have entropy (1 − 1/m).

Datasets
There are very few publicly available networks with disjoint ground-truth communities. We use the following networks in our experiments (see Table 2 for the statistics): (i) DBLP network was collected by [8]. In this dataset, authors are considered as nodes and pairs of co-authors are connected with edges if they collaborated in a paper. Liu et al. [8] considered 115 authors from four real research groups led by Prof. Jiawei Han, Prof. Christos Faloutsos, Prof. Dan Roth, and Prof. Michael Jordan as target nodes.
(ii) Coauthorship network was released by [48]. It contains authors in Computer Science as nodes, and edges represent the co-author relationship. There are 24 disjoint ground-truth communities representing different research areas (Algorithm, AI, NLP, ML, etc.). It may be possible that an author has worked on multiple fields which causes the communities to overlap. To make it disjoint, we follow the approaches in [48]-assign each author to that research community in which he/she has published most papers. Total 1374 target nodes constituting 10 communities are randomly selected.
(iii) Synthetic network is a LFR network [51], consisting of 36,000 nodes. The average degree of a node is set to 8, and the number of nodes in a community is set in the range of [50,100]. The target nodes are randomly sampled from 10 communities. There are at least 20 nodes in each target selected community.
Further in our experiments, we adopt 2 standard networks which contain known overlapping community structure [52], and pre-process the networks as follows-from each such network, we select those nodes as target nodes whose communities are completely disjoint. Then even though the underlying community structure of the entire network is overlapping, the ground-truth communities around the target nodes are disjoint. The networks are as follows: (iv) Amazon network [52] is a co-purchase network, consisting of nodes as products, and two products are connected if they have been co-purchased by at least one customer. Products from the same category define a community. We randomly select 602 nodes constituting 20 communities that have no overlap.
(v) YouTube network [52] consists of users as nodes and friendships as edges. The groundtruth communities are user-defined groups. We randomly select 800 nodes constituting 40 communities that have no overlap.

Baseline algorithms
We consider several baseline methods for two different sub-problems (network scan and community update) separately. As the performance of community detection algorithm depends critically upon the order of nodes scanned, we test commonly used network scan algorithms while maintaining the same community update stated in Lines 3-15 of Algorithm 2 without the MRF denoising step. In particular, the following strategies mentioned in Sect. 5 were tested.
-Random sampling This algorithm randomly picks k nodes from Zipf (exponential) distribution in the intermediate network G i that are not scanned, and searches their neighbors. -Greedy sampling This algorithm selects the k nodes with the largest number of degree in the candidate node set. -Ratio of degree and entropy combination algorithm This approach combines the greedy algorithm and community membership of the one-hop neighbors of the scanned node. The metric we compute for each node is entropy degree , where entropy is computed with the community distribution of neighbors [8]. Specifically, when the neighbors are from one cluster, the entropy is small; it is large otherwise. We choose k candidates with the smallest metric value.
We once again emphasize that the last two strategies were mistakenly utilized by Liu et al. [8]; however, we use their original implementation without any modification. We will show that despite the information leakage in [8], our method still outperforms them across different datasets. We also compare our network scan algorithm with and without MRF denoising step to better illustrate the improvement due to the MRF denoising. In addition to this, we compare our method with two other incremental community detection methods: (i) (QCA): This framework uses a modularity-based approach for incremental community detection [37], and (ii) (DyPerm): It maximizes permanence, a local community-centric measure to detect communities [38]. DyPerm was shown to outperform most of the state-of-the-art incremental community detection methods (Fig. 3).

Sensitivity of parameters
In this section, we briefly describe our parameter selection strategy. The proposed method has three major parameters, the number of nodes to scan k and two MRF parameters β and η. We varied k from 1 to 10. Of course, k = 1 theoretically gives the best result. In many cases, however, we could not find any distinctive differences even for k = 10. Thus, we have chosen the median value k = 5. The MRF denoising performance varies up to 2% across different parameter settings (different values of β and η), which may not be significant. We can therefore conclude that the result of our method is less sensitive to parameter selection. In the rest of the section, we use the following parameter values as default: k = 5, β = 8.8 and η = 1.9.

Evaluation results
We run each experiment 20 times with random initialization of all parameters, and the average performance is shown. We conduct a threefold experimental setup-(i) the cost of scanning each node is equally set to 1 (Constant Cost); (ii) the cost varies across nodes (Varying Cost), and (iii) the impact of MRF denoising in our method.

Results without denoising
We discuss the experimental results of our method without MRF denoising as follows.
Constant cost In this experiment, the cost of scanning a node is set to 1. Figure 4 shows the performance of different network scan algorithms for all the networks. The performance for each algorithm is shown, with increasing values of the budget on the x-axis and the ACP/APE on the y-axis. Our proposed algorithm outperforms other baselines significantly in all cases. This proves the superiority of information gain over other scan metrics.
The greedy algorithm shows the worst performance even compared to random sampling (sometimes 20% less) as it adds a lot of noisy information to the network structure. Furthermore, it is the least stable one among all algorithms. On the other hand, modularity and normalized cut-based algorithms are expected to have better results as they use the information of candidates' neighbors. However, our algorithm (with both node2vec and Deepwalk) outperforms these baselines for all the datasets-Deepwalk shows slightly better performance than node2vec.
Varying cost To simulate real scenarios, we conduct experiments with various costs. The cost is generated according to the Zipf distribution as suggested in [8]-all nodes are randomly shuffled and z(v) = 1/ind(v) λ , where ind(v) is the index of node v after shuffling. The cost Q(v) is the normalization of z(v) over all nodes.
The ACP/ACE performance with varying costs shows the similar pattern -our method achieves better performance than the baseline methods. While at some points there are small fluctuations, the overall trend almost remains the same. This again confirms that our strategy is superior to its competitors in community detection.

Performance with denoising
Here, we show the results after including the MRF denoising mentioned in Sect. 6.2. Figure 5 shows the results before and after applying the MRF denoising step. For simplicity, we only show the ACP results (ACE results have similar trends). We observe that the MRF denoising can improve the ACP/ACE by 2%∼8%. The MRF denoising also improves the performance of baseline methods, e.g., the random sampling in Fig. 5. We argue that the MRF denoising has a positive effect on recovering actual hidden community memberships, and it can be generalized to many other strategies.

Summary of the experimental results
For all five networks, we report the performance (in terms of both ACP and ACE) of all the competing methods. We consider our complete method (node2vec/Deepwalk + EM algorithm + MRF denoising) and other existing baselines (without any modification). Table 3 shows Bold values highlight the best performance. We consider the largest budget with constant cost and default parameter setting. The higher (lower) the value of ACP (ACE), the better the performance that for all the networks, our method is as good as the best baseline [8] or even better than it. The ACP (ACE) values of our Deepwalk-based and node2vec-based methods averaged over all the datasets are the same, 0.77 (0.34), followed by [8]+Modu and [8]+NCut. In short, our method significantly beats the existing baselines over all the datasets. One thing to note is that the proposed method usually takes 5-20 times longer than the existing NetDiscover [8] algorithm, while NetDiscover assumes every candidate node is scanned, and selects the best one but it only considers the cost of the selected node as the step for scanning step. The strict comparison of time between these methods may not be very useful.

Conclusion
In this paper, we studied a realistic setup for community detection-the entire network is not known a priori, and therefore one needs to progressively scan unknown nodes and update the community structure around a target node set. The problem is divided into two sub-problems-network scan and community update. We proposed a novel method for each sub-problem. In the network scan step, a new metric information gain was designed to decide the best node to scan. A combination of the EM and MRF algorithm was proposed for the community update step to further recover the actual community labels of the nodes.
The major advances that the present work provides in the field of community detection are as follows: -There are very few attempts made to process an incomplete network using an incremental way where one network construction reinforces the community detection that in turn helps discover better network structure. Most of the state-of-the-art algorithms assume that the static snapshot of the network is available beforehand, which may not be a realistic setting. Therefore, our problem definition is novel. -The use of EM framework and Markovian denoising is the major technical novelty.
Our proposed method consistently achieved better performance (on average 10.2% higher than the best baseline) across five different datasets. We also make our experimental codes available in the spirit of reproducible research: https://github.com/ZheCui/ MRFCommDetect.