Abstract
Core decomposition is a classic technique for discovering densely connected regions in a graph with large range of applications. Formally, a k-core is a maximal subgraph where each vertex has at least k neighbors. A natural extension of a k-core is a (k, h)-core, where each node must have at least k nodes that can be reached with a path of length h. The downside in using (k, h)-core decomposition is the significant increase in the computational complexity: whereas the standard core decomposition can be done in \({{\mathcal {O}}}{\left( m\right) }\) time, the generalization can require \({{\mathcal {O}}}{\left( n^2m\right) }\) time, where n and m are the number of nodes and edges in the given graph. In this paper, we propose a randomized algorithm that produces an \(\epsilon \)-approximation of (k, h) core decomposition with a probability of \(1 - \delta \) in \({{\mathcal {O}}}{\left( \epsilon ^{-2} hm (\log ^2 n - \log \delta )\right) }\) time. The approximation is based on sampling the neighborhoods of nodes, and we use Chernoff bound to prove the approximation guarantee. We also study distance-generalized dense subgraphs, show that the problem is NP-hard, provide an algorithm for discovering such graphs with approximate core decompositions, and provide theoretical guarantees for the quality of the discovered subgraphs. We demonstrate empirically that approximating the decomposition complements the exact computation: computing the approximation is significantly faster than computing the exact solution for the networks where computing the exact solution is slow.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Core decomposition is a classic technique for discovering densely connected regions in a graph. The appeal of core decomposition is a simple and intuitive definition, and the fact that the core decomposition can be computed in linear time. Core decomposition has a large range of applications such as graph visualization [1], graph modeling [4], social network analysis [23], internet topology modeling [7], influence analysis [19, 27], bioinformatics [2, 16], and team formation [5].
More formally, a k-core is a maximal subgraph such that every vertex has at least k degree. We can show that k-core forms a nested structure: \((k + 1)\)-core is a subset of k-core and that the core decomposition can be discovered in linear time [3]. Core decomposition has been extended to directed [14], multilayer [12], temporal [13], and weighted [24] networks.
A natural extension of core decomposition, proposed by Bonchi et al. [6], is a distance-generalized core decomposition or (k, h)-core decomposition, where the degree is replaced by the number of nodes that can be reached with a path of length h. Here, h is a user parameter and \(h = 1\) reduces to a standard core decomposition. Using distance-generalized core decomposition may produce a more refined decomposition [6]. Moreover, it can be used when discovering h-clubs, distance-generalized dense subgraphs, and distance-generalized chromatic numbers [6].
Studying such structures may be useful in graphs where paths of length h reveal interesting information. For example, assume a authorship network, where an edge between a paper an a researcher indicates that the researcher was an author of the paper. Then, paths of length 2 contain co-authorship information.
The major downside in using the distance-generalized core decomposition is the significant increase in the computational complexity: whereas the standard core decomposition can be done in \({{\mathcal {O}}}{\left( m\right) }\) time, the generalization can require \({{\mathcal {O}}}{\left( n^2m\right) }\) time, where n and m are the number of nodes and edges in the given graph.
To combat this problem, we propose a randomized algorithm that produces an \(\epsilon \)-approximation of (k, h) core decomposition with a probability of \(1 - \delta \) in
time.
The intuition behind our approach is as follows. In order to compute the distance-generalized core decomposition, we need to discover and maintain h-neighborhoods for each node. We can discover the h-neighborhood of a node v by taking the union of the \((h - 1)\)-neighborhood of the adjacent nodes, which leads to a simple dynamic program. The computational bottleneck comes from the fact that these neighborhoods may become too large. So, instead of computing the complete neighborhood, we have a carefully selected budget M. The moment the neighborhood becomes too large, we delete (roughly) half of the nodes, and to compensate for the sampling we multiply our size estimate by 2. This procedure is repeated as often as needed. Since we are able to keep the neighbor samples small, we are able to compute the decomposition faster.
We use Chernoff bounds to determine an appropriate value for M and provide algorithms for maintaining the h-neighborhoods. The maintenance requires special attention since if the h-neighborhood becomes too small we need to bring back the deleted nodes.
Finally, we study distance-generalized subgraphs, a notion proposed by Bonchi et al. [6] that extends a notion of dense subgraphs. Here the density is the ratio of h-connected node pairs and nodes. We show that the problem is NP-hard and propose an algorithm based on approximate core maps, extending the results by Bonchi et al. [6].
The rest of the paper is organized as follows: In Sect. 2, we introduce preliminary notation and formalize the problem. In Sect. 3, we present a naive version of the algorithm that yields approximate results but is too slow. We prove the approximation guarantee in Sect. 4 and speed up the algorithm in Sect. 5. In Sect. 6, we study distance-generalized dense subgraphs. We discuss the related work in Sect. 7. Finally, we compare our method empirically against the baselines in Sect. 8 and conclude the paper with discussion in Sect. 9.
This work extends the conference paper [26].
2 Preliminaries and problem definition
In this section, we establish preliminary notation and define our problem.
Assume an undirected graph \(G = (V, E)\) with n nodes and m edges. We will write A(v) to be the set of nodes adjacent to v. Given an integer h, we define an h-path to be a sequence of at most \(h + 1\) adjacent nodes. An h-neighborhood N(v; h, X) is then the set of nodes that are reachable with an h-path in a set of nodes X. If \(X = V\) or otherwise clear from context, we will drop it from the notation. Note that \(N(v; 1) = A(v) \cup \left\{ v\right\} \).
We will write \(\deg _h(v; X) = {\left| N(v; h, X)\right| } - 1\), where X is a set of nodes and \(v \in X\). We will often drop X from the notation if it is clear from the context.
A k-core is the maximal subgraph of G for which all nodes have at least a degree of k. Discovering the cores can be done in \({{\mathcal {O}}}{\left( m\right) }\) time by iteratively deleting the vertex with the smallest degree [23].
Bonchi et al. [6] proposed to extend the notion of k-cores to (k, h)-cores. Here, given an integer h, a (k, h)-core is the maximal graph H of G such that \({\left| N(v; h)\right| } - 1 \ge k\) for each \(v \in V(H)\), that is, we can reach at least k nodes from v with a path of at most h nodes. The core number c(v) of a vertex v is the largest k such that v is contained in (k, h)-core H. We will call H as the core graph of v and we will refer to c as the core map.
Note that discovering (k, 1)-cores is equal to discovering standard k-cores. We follow the same strategy when computing (k, h)-cores as with standard cores: we iteratively find and delete the vertex with the smallest degree [6]. We will refer to the exact algorithm as ExactCore. While ExactCore is guaranteed to produce the correct result, the computational complexity deteriorates to \({{\mathcal {O}}}{\left( n^2m\right) }\). The main reason here is that the neighborhoods N(v; h) can be significantly larger than just adjacent nodes A(v).
In this paper, we consider approximating cores.
Definition 2.1
(approximative (k, h)-core) Given a graph G an integer h and approximation guarantee \(\epsilon \), an \(\epsilon \)-approximative core map \({c'}:{V} \rightarrow {{\mathbb {N}}}\) maps a node to an integer such that \({\left| c'(v) - c(v)\right| } \le \epsilon c(v)\) for each \(v \in V\).
We will introduce an algorithm that computes an \(\epsilon \)-approximative core map with a probability of \(1 - \delta \) in quasilinear time.
3 Naive, slow algorithm
In this section, we introduce a basic idea of our approach. This version of the algorithm will be still too slow but will approximate the cores accurately. We will prove the accuracy in the next section and then refine the subroutines to obtain the needed computational complexity.
The bottleneck for computing the cores is maintaining the h-neighborhood N(v; h) for each node v as we delete the nodes. Instead of maintaining the complete h-neighborhood, we will keep only certain nodes if the neighborhood becomes too large. We then compensate the sampling when estimating the size of the h-neighborhood.
Assume that we are given a graph G, an integer h, approximation guarantee \(\epsilon \), and a probability threshold \(\delta \). Let us define numbers \(C = \log (2n / \delta )\) and
The quantity M will act as an upper bound for the sampled h-neighborhood, while C will be useful when analyzing the properties of the algorithm. We will see later that these specific values will yield the approximation guarantees.
We start the algorithm by sampling the rank of a node from a geometric distribution \(r[v] = {geo}{\left( 1/2\right) }\). Note that ties are allowed. During the algorithm, we maintain two key variables B[v, i] and k[v, i] for each \(v \in V\) and each index \(i = 1, \ldots , h\). Here,
is a subset of i-neighborhood N(v; i) consisting of nodes whose rank \(r[u] \ge k[v, i]\). The threshold k[v, i] is set to be as small as possible such that \({\left| B[v, i]\right| } \le M\).
We can estimate c(v) from B[v, h] and k[v, h] as follows: Consider the quantity \(d = {\left| B[v, h] \setminus \left\{ v\right\} \right| }2^{k[v, h]}\). Note that for an integer k the probability of a vertex v having a rank \(r[v] \ge k\) is \(2^{-k}\). This hints that d is a good estimate for c(v). We show in the next section that this is indeed the case, but d is lacking an important property that we need in order to prove the correctness of the algorithm. Namely, d can increase while we are deleting nodes. To fix this pathological case, we estimate c(v) with \(\max (d, M2^{k[v, h] - 1})\) if \(k[v, h] > 0\), and with d if \(k[v, h] = 0\). The pseudo-code for the estimate is given in Algorithm 1.
To compute B[v, i], we have the following observation.
Proposition 3.1
For any \(v \in V\) and any \(i = 1, \ldots , h\), we have
Moreover, \(k[v, i] \ge k[w, i - 1]\) for any \(w \in A(v)\).
Proof
Let \(w \in A(v)\). Since \(N(w, i - 1) \subseteq N(v, i)\), we have \(k[v, i] \ge k[w, i - 1]\). Consequently, \(B[v, i] \subseteq T \subseteq N(v, i)\), and by definition of B[v, i], the claim follows. \(\square \)
The proposition leads to Compute, an algorithm for computing B[v, i] given in Algorithm 2. Here, we form a set T, a union of sets \(B[w, i - 1]\), where \(w \in A(v)\). After T is formed, we search for the threshold \(k[v, i] \ge \max _{w \in A(v)} k[w, i - 1]\) that yields at most M nodes in T and store the resulting set in B[v, i].
As the node, say u, is deleted, we need to update the affected nodes. We do this update in Algorithm 3 by recomputing the neighbors \(v \in A(u)\) and see whether B[v, i] and k[v, i] have changed; if they have, then we recompute \(B[w, i + 1]\) for all \(w \in A(v)\), and so on.
The main algorithm Core, given in Algorithm 4, initializes B[v, i] using Compute, deletes iteratively the nodes with smallest estimate d[v] while updating the sets B[v, i] with Update.
4 Approximation guarantee
In this section, we will prove the approximation guarantee of our algorithm. The key step is to show that Estimate produces an accurate estimate. For notational convenience, we need the following definition.
Definition 4.1
Assume d integers \(X = \left( x_1 ,\ldots , x_d\right) \) and an integer M. Define
to be the number of integers larger than or equal to i. Let \(k \ge 0\) be the smallest integer for which \(S_k \le M\). Define
Note that if \(R = \left( r[w] \mid w \in N(v; h)\right) \) with r[v] being the first element in R, then \({\varDelta }{\left( R; M\right) }\) is equal to the output of \(\textsc {Estimate} (v)\).
Our first step is to show that \({\varDelta }{\left( X; M\right) }\) is monotonic.
Proposition 4.1
Assume \(M > 0\). Let \(x_1, \ldots , x_d\) be a set of integers. Select \(d' \le d\). Then
Note that this claim would not hold if we did not have the \(M2^{k - 1}\) term in the definition of \({\varDelta }{\left( X; M\right) }\).
Proof
Let k, \(S_i\), and \(T_i\) be as defined for \({\varDelta }{\left( x_1, \ldots , x_d; M\right) }\) in Definition 4.1. Also, let \(k'\), \(S_i'\), and \(T_i'\) be as defined for \({\varDelta }{\left( x_1, \ldots , x_{d'}; M\right) }\) in Definition 4.1.
Since \(S_i' \le S_i\), we have \(k' \le k\). If \(k' = k\), the claim follows immediately since also \(T_i' \le T_i\). If \(k' < k\), then
and
proving the claim. \(\square \)
Next we formalize the accuracy of \({\varDelta }{\left( X; M\right) }\). We prove the claim in Appendix.
Proposition 4.2
Assume \(0 < \epsilon \le 1/2\). Let \({\mathcal {R}} = R_1, \ldots , R_d\) be independent random variables sampled from geometric distribution, \({geo}{\left( 1/2\right) }\). Assume \(C > 0\) and define M as in Eq. 1. Then
with probability \(1 - \exp \left( -C\right) \).
We are now ready to state the main claim.
Proposition 4.3
Assume graph G with n nodes, \(\epsilon > 0\), and \(C > 0\). For each node \(v \in V\), let c(v) be the core number reported by ExactCore and let \(c'(v)\) be the core number reported by Core. Then with probability \(1 - 2ne^{-C}\)
for every node in V. Moreover, if \(c(v) \le M\), where M is given in Eq. 1, then \(c(v) = c'(v)\).
We will prove the main claim of the proposition with two lemmas. In both proofs, we will use the variable \(\tau _v\) which we define to be the value of d[v] when v is deleted by Core.
The first lemma establishes a lower bound.
Lemma 4.1
The lower bound \(c'(v) \ge (1 - \epsilon )c(v)\) holds with probability \(1 - ne^{-C}\).
Proof
For each node \(v \in V\), let \(R_v\) be a rank, an independent random variable sampled from geometric distribution, \({geo}{\left( 1/2\right) }\).
Let \(H_v\) be the core graph of v as solved by ExactCore. Define \(S_v = N(v, h) \cap H_v\) to be the h-neighborhood of v in \(H_v\). Note that \(c(v) \le {\left| S_v\right| } - 1\). Let \({\mathcal {R}}_v\) be the list of ranks \(\left( R_w ; w \in S_v\right) \) such that \(R_v\) is always the first element.
Proposition 4.2 combined with the union bound states that
holds with probability \(1 - ne^{-C}\) for every node v. Assume that these events hold.
To prove the claim, select a node v and let w be the first node in \(H_v\) deleted by Core. Let F be the graph right before deleting w by Core. Then,
proving the lemma. \(\square \)
Next, we establish the upper bound.
Lemma 4.2
The upper bound \(c'(v) \le (1 + \epsilon )c(v)\) holds with probability \(1 - ne^{-C}\).
Proof
For each node \(v \in V\), let \(R_v\) be an independent random variable sampled from geometric distribution, \({geo}{\left( 1/2\right) }\).
Consider the exact algorithm ExactCore for solving the (k, h) core problem. Let \(H_v\) be the graph induced by the existing nodes right before v is removed by ExactCore. Define \(S_v = N(v, h) \cap H_v\) to be the h-neighborhood of v in \(H_v\). Note that \(c(v) \ge {\left| S_v\right| } - 1\). Let \({\mathcal {R}}_v\) be the list of ranks \(\left( R_w ; w \in S_v\right) \) such that \(R_v\) is the first element.
Proposition 4.2 combined with the union bound states that
holds with probability \(1 - ne^{-C}\) for every node v. Assume that these events hold.
Select a node v. Let W be the set containing v and the nodes selected before v by Core. Select \(w \in W\). Let F be the graph right before deleting w by Core. Let u be the node in F that is deleted first by ExactCore. Let \(\beta \) be the value of d[u] when w is deleted by Core. Then,
Since this bound holds for any \(w \in W\), we have
proving the lemma. \(\square \)
We are now ready to prove the proposition.
Proof of Proposition 4.3
The probability that one of the two above lemmas does not hold is bounded by the union bound with \(2ne^{-C}\), proving the main claim.
To prove the second claim note that when \(d[v] \le M\) then d[v] matches accurately the number of the remaining nodes that can be reached by an h-path from a node v. On the other hand, if there is a node w that reaches more than M nodes, we are guaranteed that \(d[w] \ge M\) and \(k[w, h] > 0\), implying that Core will always prefer deleting v before w. Consequently, at the beginning Core will select the nodes in the same order as ExactCore and reports the same core number as long as there are nodes with \(d[v] \le M\) or, equally, as long as \(c(v) \le M\). \(\square \)
5 Updating data structures faster
Now that we have proven the accuracy of Core, our next step is to address the computational complexity. The key problem is that Compute is called too often and the implementation of Update is too slow.
As Core progresses, the set B[v, i] is modified in two ways. The first case is when some nodes become too far away, and we need to delete these nodes from B[v, i]. The second case is when we have deleted enough nodes so that we can lower k[v, i] and introduce new nodes. Our naive version of Update calls Compute for both cases. We will modify the algorithms so that Compute is called only to handle the second case, and the first case is handled separately. Note that these modifications do not change the output of the algorithm.
First, we change the information stored in B[v, i]. Instead of storing just a node u, we will store (u, z), where z is the number of neighbors \(w \in A(v)\), such that u is in \(B[w, i - 1]\). We will store B[v, i] as a linked list sorted by the rank. In addition, each node \(u \in B[w, i - 1]\) is augmented with an array \(Q = (q_v \mid v \in A(w))\). An entry \(q_v\) points to the location of u in B[v, i] if u is present in B[v, i]. Otherwise, \(q_v\) is set to null.
We will need two helper functions to maintain B[v, i]. The first function is a standard merge sort, \(\textsc {MergeSort} (X, Y)\), that merges two sorted lists in \({{\mathcal {O}}}{\left( {\left| X\right| } + {\left| Y\right| }\right) }\) time, maintaining the counters and the pointers.
The other function is \(\textsc {Delete} (X, Y)\) that removes nodes in Y from X, which we will use to remove nodes from B[v, i]. The deletion is done in by reducing the counters of the corresponding nodes in X by 1 and removing them when the counter reaches 0. It is vital that we can process a single node \(y \in Y\) in constant time. This will be possible because we will be able to use the pointer array described above.
Let us now consider calling Compute. We would like to minimize the number of calls of Compute. In order to do that, we need additional bookkeeping. The first additional information is m[v, i] which is the number of neighboring nodes \(w \in A(v)\) for which \(k[w, i - 1] = k[v, i]\). Proposition 3.1 states that \(k[v, i] \ge k[w, i - 1]\), for all \(w \in A(v)\). Thus, if \(m[v, i] > 0\), then there is a node \(u \in A(v)\) with \(k[v, i] = k[u, i - 1]\) and so recomputing B[v, i] will not change k[v, i] and will not add new nodes in B[v, i].
Unfortunately, maintaining just m[v, i] is not enough. We may have \(k[v, i] > k[w, i - 1]\) for any \(w \in A(v)\) immediately after Compute. In such case, we compute sets of nodes
and combine them in D[v, i], a union of \(X_w\) along with the counter information similar to B[v, i], that is,
The key observation is that as long as \({\left| B[v, i]\right| } + {\left| D[v, i]\right| } > M\), the level k[v, i] does not need to be updated.
There is one complication, namely, we need to compute D[v, i] in \({{\mathcal {O}}}{\left( M\deg v\right) }\) time. Note that, unlike B[v, i], the set D[v, i] can have more than M elements. Hence, using MergeSort will not work. Moreover, a stock k-ary merge sort requires \({{\mathcal {O}}}{\left( M\deg (v) \log \deg (v)\right) }\) time. The key observation to avoid the additional \(\log \) factor is that D[v, i] does not need to be sorted. More specifically, we first compute an array
and then extract the nonzero entries to form D[v, i]. We only need to compute the nonzero entries so we can compute these entries in \({{\mathcal {O}}}{\left( \sum {\left| X_w\right| }\right) } \subseteq {{\mathcal {O}}}{\left( M\deg v\right) }\) time. Moreover, since we do not need to keep them in order we can extract the nonzero entries in the same time. We will refer this procedure as Union, taking the sets \(X_w\) as input and forming D[v, i].
We need to maintain D[v, i] efficiently. In order to do that, we augment each node \(u \in B[w, i - 1]\) with an array \((q_v \mid v \in A(w))\), where \(q_v\) points to the location of u in D[v, i] if \(u \in D[v, i]\).
The pseudo-code for the updated Compute is given in Algorithm 5. Here we compute B[v, i] and k[v, i] first by using MergeSort iteratively and trimming the resulting set if it has more than M elements. We proceed to compute m[v, i] and D[v, i]. If \(m[v, i] = 0\), we compute D[v, i] with Union. Note that if \(m[v, i] > 0\), we leave D[v, i] empty. The above discussion leads immediately to the computational complexity of Compute.
Proposition 5.1
\(\textsc {Compute} (v, i)\) runs in \({{\mathcal {O}}}{\left( M \deg v\right) }\) time.
The pseudo-code for Update is given in Algorithm 6. Here, we maintain a stack U of tuples (v, Y), where v is the node that requires an update, and Y are the nodes that have been deleted from B[v, i] during the previous round. First, if \({\left| B[v, i]\right| } + {\left| D[v, i]\right| } \le M\) and \(m[v, i] = 0\), we run \(\textsc {Compute} (v, i)\). Next, we proceed by reducing the counters of Z in \(B[w, i + 1]\) and \(D[w, i + 1]\) for each \(w \in A(v)\). We also update \(m[w, i + 1]\). Finally, we add (w, Z) to the next stack, where Z are the deleted nodes in \(B[w, i + 1]\).
Proposition 5.2
Update maintains B[v, i] correctly.
Proof
As Core deletes nodes from the graph, Proposition 3.1 guarantees that B[v, i] can be modified only in two ways: either node u is deleted from B[v, i] when u is no longer present in any \(B[w, i - 1]\) where \(w \in A(v)\), or k[v, i] changes and new nodes are added.
The first case is handled properly as Update uses Delete whenever a node is deleted from \(B[w, i - 1]\).
The second case follows since if \({\left| B[v, i]\right| } + {\left| D[v, i]\right| } > M\) or \(m[v, i] > 0\), then we know that Compute will not change k[v, i] and will not introduce new nodes in B[v, i]. \(\square \)
Proposition 5.3
Assume a graph G with n nodes and m edges. Assume \(0 < \epsilon \le 1/2\), constant C, and the maximum path length h. The running time of Core is bounded by:
with a probability of \(1 - n\exp (-C)\), where M is defined in Eq. 1.
Proof
We will prove the proposition by bounding \(R_1 + R_2\), where \(R_1\) is the total time needed by Compute and the \(R_2\) is the total time needed by the inner loop in Update.
We will bound \(R_1\) first. Note that a single call of \(\textsc {Compute} (v, i)\) requires \({{\mathcal {O}}}{\left( M\deg v\right) }\) time.
To bound the number of Compute calls, let us first bound k[v, i]. Proposition 4.2 and union bound imply that
holds for all nodes \(v \in V\) with probability \(1 - n\exp (-C)\). Solving for k[v, i] leads to
We claim that \(\textsc {Compute} (v, i)\) is called at most twice per each value of k[v, i]. To see this, consider that \(\textsc {Compute} (v, i)\) sets \(m[v, i] = 0\). Then, we also set D[v, i] and we are guaranteed by the first condition on Line 9 of Update that the next call of \(\textsc {Compute} (v, i)\) will lower k[v, i]. Assume now that \(\textsc {Compute} (v, i)\) sets \(m[v, i] > 0\). Then, the second condition on Line 9 of Update guarantees that the next call of \(\textsc {Compute} (v, i)\) either keeps m[v, i] at 0 (and computes D[v, i]) or lowers k[v, i].
In summary, the time needed by Compute is bounded by
Let us now consider \(R_2\). For each deleted node in B[v, i] or for each lowered k[v, i] the inner for-loop requires \({{\mathcal {O}}}{\left( \deg v\right) }\) time. Equation 5 implies that the total number of deletions from B[v, i] is in \({{\mathcal {O}}}{\left( M\log \frac{n}{M}\right) }\) and that we can lower k[v, i] at most \({{\mathcal {O}}}{\left( \log \frac{n}{M}\right) }\) times. Consequently,
We have bounded both \(R_1\) and \(R_2\) proving the main claim. \(\square \)
Corollary 5.1
Assume real values \(\epsilon > 0\), \(\delta > 0\), a graph G with n nodes and m edges. Let \(C = \log (2n / \delta )\). Then Core yields \(\epsilon \) approximation in
time with \(1 - \delta \) probability.
Proposition 5.4
Core requires \({{\mathcal {O}}}{\left( hmM\right) }\) memory.
Proof
An entry in B[v, i] requires \({{\mathcal {O}}}{\left( \deg v\right) }\) memory for the pointer information. An entry in D[v, i] only requires \({{\mathcal {O}}}{\left( 1\right) }\) memory. Since \({\left| B[v, i]\right| } \le M\) and \({\left| D[v, i]\right| } \le M \deg v\), the claim follows. \(\square \)
In order to speed up the algorithm further, we employ two additional heuristics. First, we can safely delay the initialization of B[v, i] until every \(B[w, i - 1]\), where \(w \in A(v)\) yields a core estimate that is below the current core number. Delaying the initialization allows us to ignore B[v, i] during Update. Second, if the current core number exceeds the number of remaining nodes, then we can stop and use the current core number for the remaining nodes. While these heuristics do not provide any additional theoretical advantage, they speed up the algorithm in practice.
6 Distance-generalized dense subgraphs
In this section, we will study distance-generalized dense subgraphs, a notion introduced by Bonchi et al. [6].
In order to define the problem, let us first define \(E_h(X)\) to be the node pairs in X that are connected with an h-path in X. We exclude the node pairs of form (u, u). Note that \(E(X) = E_1(X)\).
We define the h-density of X to be the ratio of \(E_h(X)\) and \({\left| X\right| }\),
We will sometimes drop h from the notation if it is clear from the context.
Problem 6.1
(Dense) Given a graph G and h to find the subgraph X maximizing \({dns}{(X; h)}\).
Dense can be solved for \(h = 1\) in polynomial time using fractional programming combined with minimum cut problems [15]. However, the distance-generalized version of the problem is NP-hard.
Proposition 6.1
Dense is NP-hard even for \(h = 2\).
To prove the result, we will use extensively the following lemma.
Lemma 6.1
Let X be the densest subgraph. Let \(Y \subseteq X\) and \(Z \cap X = \emptyset \). Then
Proof
Due to optimality \({dns}{(}{X}) \ge {dns}{(}{X \setminus Y})\). Then,
Similarly, \({dns}{(}{X}) \ge {dns}{(}{X \cup Z})\) implies
proving the claim. \(\square \)
Proof of Proposition 6.1
To prove the claim, we will reduce 3Dmatch to our problem. In an instance of 3Dmatch, we are given a universe \(U = u_1, \ldots , u_n\) of size n and m sets \({\mathcal {C}}\) of size 3 and ask whether there is an exact cover of U in \({\mathcal {C}}\).
We can safely assume that \(C_1\) does not intersect with any other set. Otherwise, we can add a new set and 3 new items without changing the outcome of the instance.
In order to define the graph, let us first define \(k = 12m\) and \(\ell = 3k(3k - 1)/2 + 6k - 1\). Note that \(k \ge 12\).
For each \(u_i \in U\), we add k nodes \(a_{ij}\), where \(j = 1, \ldots , k\). For each \(a_{ij}\), we add \(2\ell \) unique nodes that are all connected to \(a_{ij}\). We will denote the resulting star with \(S_{ij}\). We will select a non-center node from \(S_{ij}\) and denote it by \(b_{ij}\). Finally, we write \(S'_{ij} = S_{ij} \setminus \left\{ a_{ij}, b_{ij}\right\} \).
For each set \(C_t \in {\mathcal {C}}\), we add a node, say \(p_t\), and connect it to \(b_{ij}\) for every \(u_i \in C\) and \(j = 1, \ldots , k\). We will denote the collection of these nodes with P. We connect every node in P to \(p_1\).
Let X be the nodes of the densest subgraph for \(h = 2\). Let \(Q = P \cap X\) and let \({\mathcal {R}}\) be the corresponding sets in \({\mathcal {C}}\).
To simplify the notation, we will need the following counts of node pairs. First, let us define \(\alpha \) to be the number of node pairs in a single \(S_{ij}\) connected with a 2-path,
Second, let us define the number of node pairs connected with a 2-path using a single node \(p_t \in P\). Since \(p_t\) connects 3k nodes \(b_{ij}\) and reaches 3k nodes \(b_{ij}\) and 3k nodes \(a_{ij}\), we have
Finally, consider W consisting of a single \(p_t\) and the corresponding 3k stars. Let us write \(\gamma = 3k\alpha + \beta \) to be the number of node pairs connected by a 2-path in W.
We will prove the proposition with a series of claims.
Claim 1: \({dns}{(}{X}) > \ell \). The density of W as defined above is:
Thus, \({dns}{(}{X}) \ge {dns}{(}{W}) > \ell \).
Claim 2: \({\mathcal {R}}\) is disjoint. To prove the claim, assume otherwise and let \(C_t\), with \(t > 1\), be a set overlapping with some other set in \({\mathcal {R}}\).
Let us bound the number of node pairs that are solely connected with \(p_t\). The node \(p_t\) connects \(3k + 1\) nodes in V. Out of these nodes at least \(k + 1\) nodes are connected by another node in X. In addition, \(p_t\) reaches to \(a_{ij}\) and \(b_{ij}\), where \(u_i \in C_t\) and \(j = 1, \ldots , k\), that is, 6k nodes in total. Finally, \(p_t\) may connect to every other node in P, at most \(m - 1\) nodes, and every \(a_{ij}\) connected to \(p_1\), at most 3k nodes. In summary, we have
Lemma 6.1 with \(Y = \left\{ p_t\right\} \) now contradicts the optimality of X. Thus, \({\mathcal {R}}\) is disjoint.
Claim 3: Either \(S_{ij} \subseteq X\) or \(S_{ij} \cap X = \emptyset \). To prove the claim assume that \(S_{ij} \cap X \ne \emptyset \).
Assume that \(b_{ij} \notin X\). Then, \(S_{ij} \cap X\) is a disconnected component with density less than \(\ell \), contradicting Lemma 6.1. Assume that \(b_{ij} \in X\) and \(a_{ij} \notin X\). Then, deleting \(b_{ij}\) will reduce at most \(3k + m - 1 < \ell \) connected node pairs, contradicting Lemma 6.1.
Assume that \(b_{ij}, a_{ij} \in X\). If \(S'_{ij} \cap X = \emptyset \), then deleting \(a_{ij}\) will reduce at most 2 connected node pairs, contradicting Lemma 6.1. Assume now there are \(u \in S'_{ij} \cap X\) and \(w \in S'_{ij} \setminus X\). Then \({\left| E_2(X \cup \left\{ w\right\} )\right| } - {\left| E_2(X)\right| } > {\left| E_2(X)\right| } - {\left| E_2(X \setminus \left\{ u\right\} )\right| }\), contradicting Lemma 6.1. Consequently, \(S_{ij} \subseteq X\).
Claim 4: If \(p_t \in X\), then X contains every corresponding \(S_{ij}\). To prove the claim assume otherwise.
Assume first that there are no corresponding \(S_{ij}\) in X for \(p_t\). If \(t > 1\), then \(p_t\) reaches to at most \(m - 1 + 3k\) nodes. If \(t = 1\), then \(p_1\) connects at most \(m - 1\) nodes and reaches to at most \((m - 1)(3k + 1)\) nodes.
Both cases lead to
contradicting Lemma 6.1.
Assume there is at least one corresponding \(S_{ij}\) in X but not all, say \(S_{i'j'}\) is missing. Then
contradicting Lemma 6.1.
Claim 5: No \(S_{ij}\) without corresponding \(p_t\) is included in X. To prove the claim note that such \(S_{ij}\) is disconnected and has density of \(\ell \),
contradicting Lemma 6.1.
The previous claims together show that the density of X is equal to
which is an increasing function of \({\left| Q\right| }\). Since \({\mathcal {R}}\) is disjoint and maximal, the 3Dmatch instance has a solution if and only if \({\mathcal {R}}\) is a solution. \(\square \)
One of the appealing aspects of \({dns}{(}{X; h })\) for \(h = 1\) is that we can 2-approximate in linear time [8]. This is done by ordering the nodes with ExactCore, say \(v_1, \ldots , v_n\) and then selecting the densest subgraph of the form \(v_1, \ldots , v_i\).
The approximation guarantee for \(h > 1\) is weaker even if use ExactCore. Bonchi et al. [6] showed that \(2{dns}{(}{Y}) \ge \sqrt{2{dns}{(}{X}) + 1/4} - 1/2\) when we use ExactCore.
Using Core instead of ExactCore poses additional challenges. In order to select a subgraph among n candidates, we need to estimate the density of its subgraph. We cannot use d[v] used by Core as these are the values that Core uses to determine the order.
Assume that Core produced order of vertices \(v_1, \ldots , v_n\), first vertices deleted first. To find the densest graph among the candidate, we essentially repeat Core except now we delete the nodes using the order \(v_1, \ldots , v_n\). We then estimate the number of edges with the identity
We will refer to this algorithm as EstDense. The pseudo-code for EstDense is given in Algorithm 7.
The algorithm yields to the following guarantee.
Proposition 6.2
Assume \(\epsilon> 0, C > 0\) and h. Define \(\gamma = \frac{1 - \epsilon }{1 + \epsilon }\). For any given k, let \(C_k\) be the (k, h)-core. Define
to be the smallest size ratio between \(C_k\) and \(C_{k\gamma }\).
Let X be the h-densest subgraph.
Let \(c'\) be an \(\epsilon \)-approximative core map and let \(v_1, \ldots , v_n\) be the corresponding vertex order. Let \(Y = \textsc {EstDense} (G, v_1, \ldots , v_n, \epsilon , C)\) Then,
with probability \(1 - n^2\exp \left( -C\right) \).
To prove the result, we need the following lemma.
Lemma 6.2
For any given k, define \(C'_k = \left\{ v \mid c'(v) \ge k\right\} \). Then
Proof
Write \(F = C_{k(1 - \epsilon )}'\). Let \(v \in C_k\). Then \(c'(v) \ge (1 - \epsilon ) c(v) \ge (1 - \epsilon )k\) and so \(v \in F\). Thus, \(C_k \subseteq F\). Conversely, let \(v \in F\). Then \((1 + \epsilon )c(v) \ge c'(v) \ge k(1 - \epsilon )\) and so \(v \in C_{\gamma k}\). Thus, \(F \subseteq C_{\gamma k}\). The definition of \(\beta \) now implies
proving the claim. \(\square \)
Proof of Proposition 6.2
Let c be the core map produced by ExactCore. For any given k, define \(C'_k = \left\{ v \mid c'(v) \ge k\right\} \).
Let \(u \in X\) be the first vertex deleted by ExactCore. Let \(b = \deg _h(u; X)\) be its h-degree. Write \(X' = X \setminus \left\{ u\right\} \). Since X is optimal,
Deleting u from X will delete b node pairs from \(E_h(X)\) containing u. In addition, every node in the h-neighborhood of u may be disconnected from each other, potentially reducing the node pairs by \({b \atopwithdelims ()2}\). In summary,
Combining the two inequalities leads to
Solving for b results in
Let Z be the nodes right before u is deleted by ExactCore. Note that \(c(u) \ge \deg _h(u; Z) \ge \deg _h(u; X) = b\).
Let \(C_k\) be the smallest core containing u, that is, \(c(u) = k\). By definition, \(\deg _h(v; C_k) \ge k \ge b\), for all \(v \in C_k\).
Let \(F = C'_{k(1 - \epsilon )}\). Lemma 6.2 now states that
Let \(d'(Z)\) be the estimated density for a subgraph Z.
Proposition 4.2 and the union bound state that
with probability \(1 - n^2e^{-C}\). Equations 6–8 prove the inequality in the claim. \(\square \)
EstDense is essentially Core so we can apply Proposition 5.3.
Corollary 6.1
Assume real values \(\epsilon > 0\), \(\delta > 0\), a graph G with n nodes and m edges. Let \(C = \log (n^2 / \delta )\). Then EstDense runs in
time and Proposition 6.2 holds with \(1 - \delta \) probability.
Finally, let us describe a potentially faster variant of the algorithm that we will use in our experiments. The above proof will work even if replace \(C_k\) with the most inner (exact) core. Since \(F = C'_{k(1 - \epsilon )}\) we can prune all the vertices for which \(c'(v) < k(1 - \epsilon )\). The problem is that we do not know k but we can lower bound it with \(k \ge k'/(1 + \epsilon )\), where \(k' = \max _v c'(v)\). In summary, before running Estimate we remove all the vertices for which \(c'(v) < \gamma k'\).
7 Related work
The notion of distance-generalized core decomposition was proposed by Bonchi et al. [6]. The authors provide several heuristics to significantly speed up the baseline algorithm (a variant of an algorithm proposed by Batagelj and Zaveršnik [3]). Despite being significantly faster than the baseline approach, these heuristics still have the computational complexity in \({{\mathcal {O}}}{\left( nn'(n' + m')\right) }\), where \(n'\) and \(m'\) are the numbers of nodes and edges in the largest h-neighborhood. For dense graphs and large values of h, the sizes \(n'\) and \(m'\) can be close n and m, leading to the computational time of \({{\mathcal {O}}}{\left( n^2m\right) }\). We will use these heuristics as baselines in Sect. 8.
All these algorithms, as well as ours, rely on the same idea of iteratively deleting the vertex with the smallest \(\deg _h(v)\) and updating these counters upon the deletion. The difference is that the previous works maintain these counters exactly—and use some heuristics to avoid updating unnecessary nodes—whereas we approximate \(\deg _h(v)\) by sampling, thus reducing the computational time at the cost of accuracy.
A popular variant of decomposition is a k-truss, where each edge is required to be in at least k triangles [9, 17, 28,29,30]. Sarıyüce and Pinar [22], Sarıyüce et al. [21] proposed (r, s) nucleus decomposition, an extension of k-cores where the notion nodes and edges are replaced with r-cliques and s-cliques, respectively. Sarıyüce and Pinar [21] point out that there are several variants of k-trusses, depending on the connectivity requirements: Huang et al. [17] require the trusses to be triangle-connected, Cohen [9] require them to be connected, and Zhang and Parthasarathy [29] allow the trusses to be disconnected.
A k-core is the largest subgraph whose smallest degree is at least k. A similar concept is the densest subgraph, a subgraph whose average degree is the largest [15]. Such graphs are convenient variants for discovering dense communities as they can be discovered in polynomial time [15], as opposed to, e.g., cliques that are inapproximable [31].
Interestingly, the same peeling algorithm that is used for core decomposition can be used to 2-approximate the densest subgraph [8]. Tatti [25] proposed a variant of core decomposition so that the densest subgraph is equal to the inner core. This composition is solvable in polynomial time and can be approximated using the same-peeling strategy.
A distance-generalized clique is known as h-club, which is a subgraph where every node is reachable by an h-path from every node [20]. Here the path must be inside the subgraph. Since cliques are 1-clubs, discovering maximum h-clubs is immediately an inapproximable problem. Bonchi et al. [6] argued that (k, h) decomposition can be used to aid discovering maximum h-clubs.
Using sampling for parallelizing (normal), core computation was proposed by Esfandiari et al. [10]. Here, the authors sparsify the graph multiple times by sampling edges. The sampling probability depends on the core numbers: larger core numbers allow for more aggressive sparsification. The authors then use Chernoff bounds to prove the approximation guarantees. The authors were able to sample edges since the degree in the sparsified graph is an estimate of the degree in the original graph (multiplied by the sampling probability). This does not hold for (k, h) core decomposition because a node \(w \in N(v; h)\) can reach v with several paths.
Approximating h-neighborhoods can be seen as an instance of a cardinality estimation problem. A classic approach for solving such problems is HyperLogLog [11]. Adopting HyperLogLog or an alternative approach, such as [18], is a promising direction for a future work, potentially speeding up the algorithm further. The challenge here is to maintain the estimates as the nodes are removed by Core.
8 Experimental evaluation
Our two main goals in experimental evaluation are to study the accuracy and the computational time of Core.
8.1 Datasets and setup
We used 8 publicly available benchmark datasets. CaAstro and CaHep are collaboration networks between researchers.Footnote 1RoadPa and RoadTX are road networks in Pennsylvania and Texas.\(^{1}\) Amazon contains product pairs that are often co-purchased in a popular online retailer.\(^{1}\) Youtube contains user-to-user links in a popular video streaming service.Footnote 2Hyves and Douban contain friendship links in a Dutch and Chinese social networks, respectively.Footnote 3 The sizes of the graphs are given in Table 1.
We implemented Core in C++Footnote 4 and conducted the experiments using a single core (2.4GHz). For Core, we used 8GB RAM and for EstDense we used 50GB RAM. In all experiments, we set \(\delta = 0.05\).
8.2 Accuracy
In our first experiment, we compared the accuracy of our estimate \(c'(v)\) against the correct core numbers c(v). As a measure, we used the maximum relative error
Note that Proposition 4.3 states that the error should be less than \(\epsilon \) with high probability.
The error as a function of \(\epsilon \) for CaHep and CaAstro datasets is shown in Fig. 1 for \(h = 3, 4\). We see from the results that the error tends to increase as a function of \(\epsilon \). As \(\epsilon \) decreases, the internal value M increases, reaching the point where the maximum core number is smaller than M. For such values, Proposition 4.3 guarantees that Core produces correct results. We see, for example, that this value is reached with \(\epsilon = 0.20\) for CaHep, and \(\epsilon = 0.15\) for CaAstro when \(h = 3\), and \(\epsilon = 0.35\) for Amazon when \(h = 4\).
8.3 Computational time
Our next experiment is to study the computational time as a function of \(\epsilon \); the results are shown in Fig. 1. From the results, we see that generally computational time increases as \(\epsilon \) decreases. The computational time flattens when we reach the point when \(c(v) \le M\) for every M. In such case, the lists B[v, i] match exactly to the neighborhoods N(v, i) and do not change if M is increased further. Consequently, decreasing \(\epsilon \) further will not change the running time. Interestingly, the running time increases slightly for Amazon and \(h = 4\) as \(\epsilon \) increases. This is most likely due to the increased number of Compute calls for smaller values of M.
Next, we compare the computational time of our method against the baselines lb and lub proposed by Bonchi et al. [6]. As our hardware setup is similar, we used the running times for the baselines reported by Bonchi et al. [6]. Here, we fixed \(\epsilon = 0.5\). The results are shown in Table 1.
We see from the results that for \(h = 2\) the results are dominated by lb. This is due to the fact that most, if not all, nodes will have \(c(v) \le M\). In such case, Core does not use any sampling and does not provide any speed-up. This is especially the case for the road networks, where the core number stays low even for larger values of h. On the other hand, Core outperforms the baselines in cases where c(v) is large, whether due to a larger h or due to denser networks. As an extreme example, lub required over 13 hours with 52 CPU cores to compute core for Hyves, while Core provided an estimate in about 12 minutes using only 1 CPU core.
Interestingly enough, Core solves CaAstro faster when \(h = 4\) than when \(h = 3\). This is due to the fact that we stop when the current core value plus one is equal to the number of remaining nodes.
To further demonstrate the effect of the network size on the computation time, we generate a series of synthetic datasets. Each dataset is stochastic blockmodel with 10 blocks of equal size, \(C_1, \ldots , C_{10}\). To add a hierarchical structure, we set the probability of an edge between nodes in \(C_i\) and \(C_j\) with \(i < j\) to be \(10^{-6} i^2\). We vary the number of nodes from \(10,\,000\) to \(100,\,000\). The computational times for our method, with \(h = 2, 3, 4\) and \(\epsilon = 0.5\), are shown in Fig. 2. As expected, the running times increase as the number of edges increase. Moreover, larger h require more processing time. We should stress that while Corollary 5.1 bounds the running time as quasi-linear, in practice the trend depends on the underlying model.
8.4 Dense subgraphs
Finally, we used EstDense to estimate the densest subgraph for \(h = 2, 3, 4\). We set \(\epsilon = 0.5\) and \(\delta = 0.05\). The results, shown in Table 2, are as expected. Both the density and the size of the h-densest subgraphs increase as the function of h. The dense subgraphs are generally smaller and less dense for the sparse graphs, such as road networks.
In our experiments, the running times for EstDense were generally smaller but comparable to the running times of Core. The speed-up is largely due to the pruning of nodes with smaller core numbers. The exception was Youtube with \(h = 3\), where EstDense required over 23 minutes. The slowdown is due to Core using lazy initialization of B[v, i], whereas EstDense needs B[v, h] to be computed in order to obtain d[v]. This is also the reason why EstDense requires more memory in practice.
9 Concluding remarks
In this paper, we introduced a randomized algorithm for approximating distance-generalized core decomposition. The major advantage over the exact approximation is that the approximation can be done in \({{\mathcal {O}}}{\left( \epsilon ^{-2} hm (\log ^2 n - \log \delta )\right) }\) time, whereas the exact computation may require \({{\mathcal {O}}}{\left( n^2m\right) }\) time. We also studied distance-generalized dense subgraphs by proving that the problem is NP-hard and extended the guarantee results of [6] to approximate core decompositions.
The algorithm is based on sampling the h-neighborhoods of the nodes. We prove the approximation guarantee with Chernoff bounds. Maintaining the sampled h-neighborhood requires carefully designed bookkeeping in order to obtain the needed computational complexity. This is especially the case since the sampling probability may change as the graph gets smaller during the decomposition.
In practice, the sampling complements the exact algorithm. For the setups where the exact algorithm struggles, our algorithm outperforms the exact approached by a large margin. Such setups include well-connected networks and values h larger than 3.
An interesting direction for future work is to study whether the heuristics introduced by Bonchi et al. [6] can be incorporated with the sampling approach in order to obtain even faster decomposition method.
References
Alvarez-Hamelin JI, Dall’Asta L, Barrat A, Vespignani A (2006) Large scale networks fingerprinting and visualization using the k-core decomposition. In Advances in neural information processing systems, pp 41–50
Bader GD, Hogue CWV (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform 4(1):1–27
Batagelj V, Zaveršnik M (2011) Fast algorithms for determining (generalized) core groups in social networks. Adv Data Anal Classif 5(2):129–145
Bollobás B (1984) The evolution of random graphs. Trans Am Math Soc 286(1):257–274
Bonchi F, Gullo F, Kaltenbrunner A, Volkovich Y (2014) Core decomposition of uncertain graphs. In: proceedings of the international conference on knowledge discovery and data mining (KDD), pp 1316–1325
Bonchi F, Khan A, Severini L (2019) Distance-generalized core decomposition. In: proceedings of the 2019 international conference on management of data, pp 1006–1023
Carmi S, Havlin S, Kirkpatrick S, Shavitt Y, Shir E (2007) A model of internet topology using k-shell decomposition. Proc Natl Acad Sci 104(27):11150–11154
Charikar M (2000) Greedy approximation algorithms for finding dense components in a graph. APPROX
Cohen J (2008) Trusses: cohesive subgraphs for social network analysis. National security agency technical report, 16(3.1)
Esfandiari H, Lattanzi S, Mirrokni V (2018) Parallel and streaming algorithms for k-core decomposition. In: international conference on machine learning, pp 1397–1406
Flajolet P, Fusy É, Gandouet O, Meunier F (2007) Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In: Discrete mathematics and theoretical computer science, pp 137–156
Galimberti E, Bonchi F, Gullo F (2017) Core decomposition and densest subgraph in multilayer networks. In: proceedings of the 2017 ACM on conference on information and knowledge management, pp 1807–1816
Galimberti E, Barrat A, Bonchi F, Cattuto C, Gullo F (2018) Mining (maximal) span-cores from temporal networks. In: proceedings of the 27th ACM international conference on information and knowledge management, pp 107–116
Giatsidis C, Thilikos DM, Vazirgiannis M (2013) D-cores: measuring collaboration of directed graphs based on degeneracy. Knowl Inf Syst 35(2):311–343
Goldberg AV (1984) Finding a maximum density subgraph. University of California Berkeley Technical report
Hagmann P, Cammoun L, Gigandet X, Meuli R, Honey CJ, Wedeen VJ, Sporns O (2008) Mapping the structural core of human cerebral cortex. PLoS Biol 6(7):888–893
Huang X, Cheng H, Qin L, Tian W, Yu J (2014) Querying k-truss community in large and dynamic graphs. In: proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp 1311–1322
Kane DM, Nelson J, Woodruff DP (2010) An optimal algorithm for the distinct elements problem. In: proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp 41–52
Kitsak M, Gallos LK, Havlin S, Liljeros F, Muchnik L, Stanley HE, Makse HA (2010) Identification of influential spreaders in complex networks. Nat Phys 6(11):888–893
Mokken RJ et al (1979) Cliques, clubs and clans. Qual Quant 13(2):161–173
Sarıyüce AE, Pinar A (2016) Fast hierarchy construction for dense subgraphs. Proceedings of the VLDB Endowment, 10(3)
Sariyuce AE, Seshadhri C, Pinar A, Catalyurek UV (2015) Finding the hierarchy of dense subgraphs using nucleus decompositions. In: proceedings of the 24th international conference on world wide web, pp 927–937
Seidman SB (1983) Network structure and minimum degree. Soc Netw 5(3):269–287
Serrano MÁ, Boguná M, Vespignani A (2009) Extracting the multiscale backbone of complex weighted networks. Proc Natl Acad Sci 106(16):6483–6488
Tatti N (2019) Density-friendly graph decomposition. ACM Trans Knowl Discov Data (TKDD) 13(5):1–29
Tatti N (2021) Fast computation of distance-generalized cores using sampling. In ICDM
Ugander J, Backstrom L, Marlow C, Kleinberg J (2012) Structural diversity in social contagion. Proc Natl Acad Sci 109(16):5962–5966
Wang J, Cheng J (2012) Truss decomposition in massive networks. Proceedings of the VLDB Endowment, 5(9)
Zhang Y, Parthasarathy S (2012) Extracting analyzing and visualizing triangle k-core motifs within networks. In: 2012 IEEE 28th international conference on data engineering, IEEE. pp 1049–1060
Zhao F, Tung AKH (2012) Large scale cohesive subgraphs discovery for social network visual analysis. Proc VLDB Endow 6(2):85–96
Zuckerman D (2006) Linear degree extractors and the inapproximability of max clique and chromatic number. In: proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pp 681–690
Acknowledgements
This research was supported by the Academy of Finland project MALSOME (343045).
Funding
Open Access funding provided by University of Helsinki including Helsinki University Central Hospital.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Proof of Proposition 4.2
Proof of Proposition 4.2
We start with stating several Chernoff bounds.
Lemma A.1
(Chernoff bounds) Let \(X_1, \ldots , X_d\) be d independent Bernoulli random variables with \(P(X_i = 1) = \mu \). Let \(S = \sum _{i = 1}^d X_i\). Then,
Proof
Equations 9–10 are standard multiplicative Chernoff bounds. Equation 11 is obtained with a union bound of Eqs. 9–10, completing the claim. \(\square \)
To prove Proposition 4.2, we first need the following technical lemma.
Lemma A.2
Assume \(0 < \epsilon \le 1/2\). Let \(R_1, \ldots , R_d\) be independent random variables sampled from geometric distribution, \({geo}{\left( 1/2\right) }\). Define
to be the number of variables \(\left\{ R_j\right\} \) larger than or equal to i. Assume \(C > 0\) and define M as in Eq. 1. Assume that \(M \le d\). Let \(\ell \ge 1\) be an integer such that
Then with probability \(1 - \exp \left( -C\right) \), we have
and
where \(k = \ell -1, \ell , \ell + 1\) and \(\mu _k = 2^{-k}\).
Proof
First, note that Eq. 12 implies
To prove the lemma, let us define the events
and
We will prove the result with union bound by showing that
To bound \(P(A_k)\), observe that \(P(R_j \ge k) = \mu _k\). The Chernoff bound now states that for \(k \le \ell + 1\) we have
Next, we bound \(B_1\), assuming \(\ell > 1\) as otherwise we can ignore the term, with
and \(B_2\) with
The bounds for \(P(B_1)\), \(P(B_2)\), and \(P(A_k)\) complete the proof. \(\square \)
Proof of Proposition 4.2
Let \(S_i\), \(T_i\), and k be as defined in Definition 4.1 for \({\varDelta }{\left( {\mathcal {R}}; M\right) }\). Let \(\ell \) be as defined in Eq 12. We can safely assume that \(M \le d\).
Assume that the events in Lemma A.2 hold. Then, Eq. 13 guarantees that \(k = \ell - 1, \ell , \ell + 1\).
Write \(Y_i = T_i2^i\) and \(Z_i = M2^{i - 1}\). Equation 14 guarantees that
for \(i = \ell - 1, \ell , \ell + 1\). If \(k = 0\) or \(Y_k \ge Z_k\), then \({\varDelta }{\left( {\mathcal {R}}; M\right) } = Y_k\) and we are done.
Assume \(k > 0\) and \(Y_k < Z_k\). Then immediately
To prove the other direction, first assume that \(k > \ell - 1\). By definition of k, we have \(S_{k - 1} > M\) and consequently \(T_{k - 1} \ge M\). Thus,
where the last inequality is given by Eq. 16. On the other hand, if \(k = \ell - 1\), then
where the second inequality holds since \(k > 0\) implies that \(d \ge 2\). In summary, Eq. 2 holds.
Since the events in Lemma A.2 hold with probability of \(1 - \exp \left( -C\right) \), the claim follows. \(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Tatti, N. Fast computation of distance-generalized cores using sampling. Knowl Inf Syst 65, 2429–2453 (2023). https://doi.org/10.1007/s10115-023-01830-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-01830-9