Fast computation of distance-generalized cores using sampling

Core decomposition is a classic technique for discovering densely connected regions in a graph with large range of applications. Formally, a k-core is a maximal subgraph where each vertex has at least k neighbors. A natural extension of a k-core is a (k, h)-core, where each node must have at least k nodes that can be reached with a path of length h. The downside in using (k, h)-core decomposition is the significant increase in the computational complexity: whereas the standard core decomposition can be done in Om\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathcal {O}}}{\left( m\right) }$$\end{document} time, the generalization can require On2m\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathcal {O}}}{\left( n^2m\right) }$$\end{document} time, where n and m are the number of nodes and edges in the given graph. In this paper, we propose a randomized algorithm that produces an ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document}-approximation of (k, h) core decomposition with a probability of 1-δ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1 - \delta $$\end{document} in Oϵ-2hm(log2n-logδ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\mathcal {O}}}{\left( \epsilon ^{-2} hm (\log ^2 n - \log \delta )\right) }$$\end{document} time. The approximation is based on sampling the neighborhoods of nodes, and we use Chernoff bound to prove the approximation guarantee. We also study distance-generalized dense subgraphs, show that the problem is NP-hard, provide an algorithm for discovering such graphs with approximate core decompositions, and provide theoretical guarantees for the quality of the discovered subgraphs. We demonstrate empirically that approximating the decomposition complements the exact computation: computing the approximation is significantly faster than computing the exact solution for the networks where computing the exact solution is slow.


Introduction
Core decomposition is a classic technique for discovering densely connected regions in a graph.The appeal of core decomposition is a simple and intuitive definition, and the fact that the core decomposition can be computed in linear time.Core decomposition has a large range of applications such as graph visualization [1], graph modeling [4], social network analysis [23], internet topology modeling [7], influence analysis [19,27], bioinformatics [2,16], and team formation [5].
More formally, a k-core is a maximal subgraph such that every vertex has at least k degree.We can show that k-core form a nested structure: (k + 1)core is a subset of k-core, and that the core decomposition can be discovered in linear time [3].Core decomposition has been extended to directed [14], multi-layer [12], temporal [13], and weighted [24] networks.
A natural extension of core decomposition, proposed by Bonchi et al. [6], is a distance-generalized core decomposition or (k, h)-core decomposition, where the degree is replaced by the number of nodes that can be reached with a path of length h.Here, h is a user parameter and h = 1 reduces to a standard core decomposition.Using distance-generalized core decomposition may produce a more refined decomposition [6].Moreover, it can be used when discovering h-clubs, distance-generalized dense subgraphs, and distance-generalized chromatic numbers [6].
Studying such structures may be useful in graphs where paths of length h reveal interesting information.For example, assume a authorship network, where an edge between a paper an a researcher indicate that the researcher was an author of the paper.Then paths of length 2 contain co-authorship information.
The major downside in using the distance-generalized core decomposition is the significant increase in the computational complexity: whereas the standard core decomposition can be done in O(m) time, the generalization can require O n 2 m time, where n and m are the number of nodes and edges in the given graph.
To combat this problem we propose a randomized algorithm that produces an ǫ-approximation of (k, h) core decomposition with a probability of 1 − δ in time.
The intuition behind our approach is as follows.In order to compute the distance-generalized core decomposition we need to discover and maintain hneighborhoods for each node.We can discover the h-neighborhood of a node v by taking the union of the (h − 1)-neighborhood of the adjacent nodes, which leads to a simple dynamic program.The computational bottleneck comes from the fact that these neighborhoods may become too large.So, instead of computing the complete neighborhood, we have a carefully selected budget M .
The moment the neighborhood becomes too large, we delete (roughly) half of the nodes, and to compensate for the sampling we multiply our size estimate by 2. This procedure is repeated as often as needed.Since we are able to keep the neighbor samples small, we are able to compute the decomposition faster.
We use Chernoff bounds to determine an appropriate value for M , and provide algorithms for maintaining the h-neighborhoods.The maintenance require special attention since if the h-neighborhood becomes too small we need to bring back the deleted nodes.
Finally, we study distance-generalized subgraphs, a notion proposed by Bonchi et al. [6] that extends a notion of dense subgraphs.Here the density is the ratio of h-connected node pairs and nodes.We show that the problem is NP-hard and propose an algorithm based on approximate core maps, extending the results by Bonchi et al. [6].
The rest of the paper is organized as follows.In Section 2 we introduce preliminary notation and formalize the problem.In Section 3 we present a naive version of the algorithm that yields approximate results but is too slow.We prove the approximation guarantee in Section 4, and speed-up the algorithm in Section 5.In Section 6 we study distance-generalized dense subgraphs.We discuss the related work in Section 7. Finally, we compare our method empirically against the baselines in Section 8 and conclude the paper with discussion in Section 9.This work extends the conference paper [26].

Preliminaries and problem definition
In this section we establish preliminary notation and define our problem.Assume an undirected graph G = (V, E) with n nodes and m edges.We will write A(v) to be the set of nodes adjacent to v. Given an integer h, we define an h-path to be a sequence of at most h + 1 adjacent nodes.An hneighborhood N (v; h, X) is then the set of nodes that are reachable with an h-path in a set of nodes X.If X = V or otherwise clear from context, we will drop it from the notation.Note that N (v; 1) = A(v) ∪ {v}.
We will write deg where X is a set of nodes and v ∈ X.We will often drop X from the notation if it is clear from the context.
A k-core is the maximal subgraph of G for which all nodes have at least a degree of k.Discovering the cores can be done in O(m) time by iteratively deleting the vertex with the smallest degree [23].
Bonchi et al. [6] proposed to extend the notion of k-cores to (k, h)-cores.Here, given an integer h, a (k, h)-core is the maximal graph , that is, we can reach at least k nodes from v with a path of at most h nodes.The core number c(v) of a vertex v is the largest k such that v is contained in (k, h)-core H.We will call H as the core graph of v and we will refer to c as the core map.
Note that discovering (k, 1)-cores is equal to discovering standard k-cores.We follow the same strategy when computing (k, h)-cores as with standard cores: we iteratively find and delete the vertex with the smallest degree [6].We will refer to the exact algorithm as ExactCore.While ExactCore is guaranteed to produce the correct result the computational complexity deteriorates to O n 2 m .The main reason here is that the neighborhoods N (v; h) can be significantly larger than just adjacent nodes A(v).
In this paper we consider approximating cores.
Definition 2.1 (approximative (k, h)-core).Given a graph G an integer h and approximation guarantee ǫ, an ǫ-approximative core map c ′ : V → N maps a node to an integer such that |c We will introduce an algorithm that computes an ǫ-approximative core map with a probability of 1 − δ in quasilinear time.

Naive, slow algorithm
In this section we introduce a basic idea of our approach.This version of the algorithm will be still too slow but will approximate the cores accurately.We will prove the accuracy in the next section, and then refine the subroutines to obtain the needed computational complexity.
The bottleneck for computing the cores is maintaining the h-neighborhood N (v; h) for each node v as we delete the nodes.Instead of maintaining the complete h-neighborhood we will keep only certain nodes if the neighborhood becomes too large.We then compensate the sampling when estimating the size of the h-neighborhood.
Assume that we are given a graph G, an integer h, approximation guarantee ǫ, and a probability threshold δ.Let us define numbers C = log(2n/δ) and The quantity M will act as an upper bound for the sampled h-neighborhood, while C will be useful when analyzing the properties of the algorithm.We will see later that these specific values will yield the approximation guarantees.We start the algorithm by sampling the rank of a node from a geometric distribution r[v] = geo(1/2).Note that ties are allowed.During the algorithm we maintain two key variables B[v, i] and k[v, i] for each v ∈ V and each index i = 1, . . ., h.Here, We can estimate c(v) from B[v, h] and k[v, h] as follows: Consider the quan- ,h] .Note that for an integer k the probability of a vertex v having a rank r[v] ≥ k is 2 −k .This hints that d is a good estimate for c(v).We show in the next section that this is indeed the case but d is lacking an important property that we need in order to prove the correctness of the algorithm.Namely, d can increase while we are deleting nodes.To fix this pathological case we estimate c(v) with max(d, M 2 k[v,h]−1 ) if k[v, h] > 0, and with d if k[v, h] = 0.The pseudo-code for the estimate is given in Algorithm 1.
To compute B[v, i] we have the following observation.Proposition 3.1.For any v ∈ V and any i = 1, . . ., h, we have The proposition leads to Compute, an algorithm for computing B[v, i] given in Algorithm 2. Here, we form a set T , a union of sets B[w, i − 1], where w ∈ A(v).After T is formed we search for the threshold k[v, i] ≥ max w∈A(v) k[w, i − 1] that yields at most M nodes in T , and store the resulting As the node, say u, is deleted we need to update the affected nodes.We do this update in Algorithm 3 by recomputing the neighbors v ∈ A(u), and see if B[v, i] and k[v, i] have changed; if they have, then we recompute B[w, i + 1] for all w ∈ A(v), and so on.
The main algorithm Core, given in Algorithm 4, initializes B[v, i] using Compute, deletes iteratively the nodes with smallest estimate d[v] while updating the sets B[v, i] with Update.

Approximation guarantee
In this section we will prove the approximation guarantee of our algorithm.The key step is to show that Estimate produces an accurate estimate.For notational convenience, we need the following definition.
to be the number of integers larger than or equal to i. Let k ≥ 0 be the smallest integer for which Our first step is to show that ∆(X; M ) is monotonic.
Note that this claim would not hold if we did not have the M 2 k−1 term in the definition of ∆(X; M ).Proof Let k, S i , and T i be as defined for ∆(x 1 , . . ., x d ; M ) in Definition 4.1.Also, let k ′ , S ′ i , and T ′ i be as defined for ∆(x 1 , . . ., x d ′ ; M ) in Definition 4.1.Since S ′ i ≤ S i , we have k ′ ≤ k.If k ′ = k, the claim follows immediately since also Next we formalize the accuracy of ∆(X; M ).We prove the claim in Appendix.
Proposition 4.3.Assume graph G with n nodes, ǫ > 0, and C > 0. For each node v ∈ V , let c(v) be the core number reported by ExactCore and let c ′ (v) be the core number reported by Core.Then with probability We will prove the main claim of the proposition with two lemmas.In both proofs we will use the variable τ v which we define to be the value of d[v] when v is deleted by Core.
The first lemma establishes a lower bound.
Proof For each node v ∈ V , let Rv be a rank, an independent random variable sampled from geometric distribution, geo(1/2).
Let Hv be the core graph of v as solved by ExactCore.Define Sv = N (v, h)∩Hv to be the h-neighborhood of v in Hv.Note that c(v) ≤ |Sv| − 1.Let Rv be the list of ranks (Rw; w ∈ Sv) such that Rv is always the first element.
Proposition 4.2 combined with the union bound states that holds with probability 1 − ne −C for every node v. Assume that these events hold.
To prove the claim, select a node v and let w be the first node in Hv deleted by Core.Let F be the graph right before deleting w by Core.Then proving the lemma.
Next, we establish the upper bound.
Proof For each node v ∈ V , let Rv be an independent random variable sampled from geometric distribution, geo(1/2).Consider the exact algorithm ExactCore for solving the (k, h) core problem.Let Hv be the graph induced by the existing nodes right before v is removed by ExactCore.Define Sv = N (v, h) ∩ Hv to be the h-neighborhood of v in Hv.Note that c(v) ≥ |Sv| − 1.Let Rv be the list of ranks (Rw; w ∈ Sv) such that Rv is the first element.
Proposition 4.2 combined with the union bound states that holds with probability 1 − ne −C for every node v. Assume that these events hold.Select a node v. Let W be the set containing v and the nodes selected before v by Core.Select w ∈ W .Let F be the graph right before deleting w by Core.Let u be the node in F that is deleted first by ExactCore.Let β be the value of d Since this bound holds for any w ∈ W , we have proving the lemma.
We are now ready to prove the proposition.
Proof of Proposition 4.3 The probability that one of the two above lemmas does not hold is bounded by the union bound with 2ne −C , proving the main claim.
To prove the second claim note that when matches accurately the number of the remaining nodes that can be reached by an h-path from a node v. On the other hand, if there is a node w that reaches more than M nodes, we are guaranteed that d[w] ≥ M and k[w, h] > 0, implying that Core will always prefer deleting v before w.Consequently, at the beginning Core will select the nodes in the same order as ExactCore and reports the same core number as long as there are nodes with d[v] ≤ M or, equally, as long as c(v) ≤ M .

Updating data structures faster
Now that we have proven the accuracy of Core, our next step is to address the computational complexity.The key problem is that Compute is called too often and the implementation of Update is too slow.
As Core progresses, the set B[v, i] is modified in two ways.The first case is when some nodes become too far away, and we need to delete these nodes from B[v, i].The second case is when we have deleted enough nodes so that we can lower k[v, i] and introduce new nodes.Our naive version of Update calls Compute for both cases.We will modify the algorithms so that Compute is called only to handle the second case, and the first case is handled separately.Note that these modifications do not change the output of the algorithm.
First, we change the information stored in B[v, i].Instead of storing just a node u, we will store (u, z), where z is the number of neighbors w ∈ A(v), such that u is in B[w, i − 1].We will store B[v, i] as a linked list sorted by the rank.In addition, each node u ∈ B[w, i − 1] is augmented with an array We will need two helper functions to maintain B[v, i].The first function is a standard merge sort, MergeSort(X, Y ), that merges two sorted lists in O(|X| + |Y |) time, maintaining the counters and the pointers.
The other function is Delete(X, Y ) that removes nodes in Y from X, which we will use to remove nodes from B[v, i].The deletion is done in by reducing the counters of the corresponding nodes in X by 1, and removing them when the counter reaches 0. It is vital that we can process a single node y ∈ Y in constant time.This will be possible because we will be able to use the pointer array described above.
Let us now consider calling Compute.We would like to minimize the number of calls of Compute.In order to do that, we need additional bookkeeping.The first additional information is m Unfortunately, maintaining just m[v, i] is not enough.We may have k[v, i] > k[w, i − 1] for any w ∈ A(v) immediately after Compute.In such case, we compute sets of nodes and combine them in D[v, i], a union of X w along with the counter information similar to B[v, i], that is, The key observation is that as long as does not need to be updated.
There is one complication, namely, we need to compute can have more than M elements.Hence, using MergeSort will not work.Moreover, a stock kary merge sort requires O(M deg(v) log deg(v)) time.The key observation to avoid the additional log factor is that D[v, i] does not need to be sorted.More specifically, we first compute an array and then extract the non-zero entries to form D[v, i].We only need to compute the non-zero entries so we can compute these entries in O( |X w |) ⊆ O(M deg v) time.Moreover, since we do not need to keep them in order we can extract the non-zero entries in the same time.We will refer this procedure as Union, taking the sets X w as input and forming D[v, i].
We need to maintain D[v, i] efficiently.In order to do that we augment each node u ∈ B[w, i − 1] with an array (q v | v ∈ A(w)), where q v points to the location of The pseudo-code for the updated Compute is given in Algorithm 5.Here we compute B[v, i] and k[v, i] first by using MergeSort iteratively and trimming the resulting set if it has more than M elements.We proceed to compute The above discussion leads immediately to the computational complexity of Compute.
The pseudo-code for Update is given in Algorithm 6.Here, we maintain a stack U of tuples (v, Y ), where v is the node that requires an update, and Y are the nodes that have been deleted from B[v, i] during the previous round.
with a probability of 1 − n exp(−C), where M is defined in Eq. 1.
Algorithm 5: Refined version of Compute(v, i).Recomputes B[v, i] and k[v, i] from scratch.
Proof We will prove the proposition by bounding R 1 + R 2 , where R 1 is the total time needed by Compute and the R 2 is the total time needed by the inner loop in Update.
We will bound R 1 first.Note that a single call of Compute(v, i) requires O(M deg v) time.
To bound the number of Compute calls, let us first bound k[v, i].Proposition 4.2 and union bound implies that We claim that Compute(v, i) is called at most twice per each value of k[v, i].To see this, consider that Compute(v, i) sets m[v, i] = 0. Then we also set D[v, i] and we are guaranteed by the first condition on Line 9 of Update that the next call of Compute(v, i) Then the second condition on Line 9 of Update guarantees that the next call of In summary, the time needed by Compute is bounded by Let Algorithm 6: Refined version of Update(u).Deletes u and updates the affected B[v, i] and k[v, i].In order to speed-up the algorithm further we employ two additional heuristics.First, we can safely delay the initialization of B[v, i] until every B[w, i−1], where w ∈ A(v), yields a core estimate that is below the current core number.Delaying the initialization allows us to ignore B[v, i] during Update.Second, if the current core number exceeds the number of remaining nodes, then we can stop and use the current core number for the remaining nodes.While these heuristics do not provide any additional theoretical advantage, they speed-up the algorithm in practice.

Distance-generalized dense subgraphs
In this section we will study distance-generalized dense subgraphs, a notion introduced by Bonchi et al. [6].
In order to define the problem, let us first define E h (X) to be the node pairs in X that are connected with an h-path in X.We exclude the node pairs of form (u, u).Note that E(X) = E 1 (X).
We define the h-density of X to be the ratio of E h (X) and |X|, We will sometimes drop h from the notation if it is clear from the context.Problem 6.1 (Dense).Given a graph G and h find the subgraph X maximizing dns(X; h).
Dense can be solved for h = 1 in polynomial time using fractional programming combined with minimum cut problems [15].However, the distance-generalized version of the problem is NP-hard.Proposition 6.1.Dense is NP-hard even for h = 2.
To prove the result we will use extensively the following lemma.Lemma 6.1.Let X be the densest subgraph.Let Y ⊆ X and Z ∩ X = ∅.Then proving the claim.
Proof of Proposition 6.1 To prove the claim we will reduce 3Dmatch to our problem.
In an instance of 3Dmatch we are given a universe U = u 1 , . . ., un of size n and m sets C of size 3 and ask whether there is an exact cover of U in C. We can safely assume that C 1 does not intersect with any other set.Otherwise, we can add a new set and 3 new items without changing the outcome of the instance.
For each u i ∈ U , we add k nodes a ij , where j = 1, . . ., k.For each a ij , we add 2ℓ unique nodes that are all connected to a ij .We will denote the resulting star with S ij .We will select a non-center node from S ij and denote it by b ij .Finally, we write For each set C t ∈ C, we add a node, say p t , and connect it to b ij for every u i ∈ C and j = 1, . . ., k.We will denote the collection of these nodes with P .We connect every node in P to p 1 .
Let X be the nodes of the densest subgraph for h = 2. Let Q = P ∩ X and let R be the corresponding sets in C.
To simplify the notation we will need the following counts of node pairs.First, let us define α to be the number of node pairs in a single S ij connected with a 2-path, Second, let us define the number of node pairs connected with a 2-path using a single node p t ∈ P .Since p t connects 3k nodes b ij and reaches 3k nodes b ij and 3k nodes a ij , we have Finally, consider W consisting of a single p t and the corresponding 3k stars.Let us write γ = 3kα + β to be the number of node pairs connected by a 2-path in W .
We will prove the proposition with a series of claims.Claim 1: dns(X) > ℓ.The density of W as defined above is Thus, dns(X) ≥ dns(W ) > ℓ.
Claim 2: R is disjoint.To prove the claim, assume otherwise and let C t , with t > 1, be a set overlapping with some other set in R.
Let us bound the number of node pairs that are solely connected with p t .The node p t connects 3k + 1 nodes in V .Out of these nodes at least k + 1 nodes are connected by another node in X.In addition, p t reaches to a ij and b ij , where u i ∈ C t and j = 1, . . ., k, that is, 6k nodes in total.Finally, p t may connect to every other node in P , at most m − 1 nodes, and every a ij connected to p 1 , at most 3k nodes.In summary, we have Lemma 6.1 with Y = {p t } now contradicts the optimality of X.Thus, R is disjoint.
To prove the claim assume that S ij ∩ X = ∅.Fast computation of distance-generalized cores using sampling Assume that b ij / ∈ X.Then S ij ∩ X is a disconnected component with density less than ℓ, contradicting Lemma 6.1.Assume that b ij ∈ X and a ij / ∈ X.Then deleting b ij will reduce at most 3k + m − 1 < ℓ connected node pairs, contradicting Lemma 6.1.
Assume that b ij , a ij ∈ X.If S ′ ij ∩ X = ∅, then deleting a ij will reduce at most 2 connected node pairs, contradicting Lemma 6.1.Assume now there are u ∈ Claim 4: If p t ∈ X, then X contains every corresponding S ij .To prove the claim assume otherwise.
Assume first that there are no corresponding S ij in X for p t .If t > 1, then p t reaches to at most m − 1 + 3k nodes.If t = 1, then p 1 connects at most m − 1 nodes and reaches to at most (m − 1)(3k + 1) nodes.
Both cases lead to Assume there is at least one corresponding S ij in X but not all, say To prove the claim note that such S ij is disconnected and has density of ℓ, contradicting Lemma 6.1.
The previous claims together show that the density of X is equal to which is an increasing function of |Q|.Since R is disjoint and maximal, the 3Dmatch instance has a solution if and only if R is a solution.
One of the appealing aspects of dns(X; h) for h = 1 is that we can 2-approximate in linear time [8].This is done by ordering the nodes with Ex-actCore, say v 1 , . . ., v n and then selecting the densest subgraph of the form v 1 , . . ., v i .
Using Core instead of ExactCore poses additional challenges.In order to select a subgraph among n candidates, we need to estimate the density of its subgraph.We cannot use d[v] used by Core as these are the values that Core uses to determine the order.
Assume that Core produced order of vertices v 1 , . . ., v n , first vertices deleted first.To find the densest graph among the candidate, we essentially repeat Core except now we delete the nodes using the order v 1 , . . ., v n .We then estimate the number of edges with the identity Proof of Proposition 6.2 Let c be the core map produced by ExactCore.For any given k, define Deleting u from X will delete b node pairs from E h (X) containing u.In addition, every node in the h-neighborhood of u may be disconnected from each other, potentially reducing the node pairs by b 2 .In summary, Combining the two inequalities leads to Solving for b results in b ≥ 2dns(X) + 1/4 − 1/2 .(6) Let Z be the nodes right before u is deleted by ExactCore.Note that c(u Let C k be the smallest core containing u, that is, c(u Let d ′ (Z) be the estimated density for a subgraph Z. Proposition 4.2 and the union bound state that with probability 1 − n 2 e −C .Eqs. 6-8 prove the inequality in the claim.
EstDense is essentially Core so we can apply Proposition 5.3.Finally, let us describe a potentially faster variant of the algorithm that we will use in our experiments.The above proof will work even if replace C k with the most inner (exact) core.Since F = C ′ k(1−ǫ) we can prune all the vertices for which c ′ (v) < k(1 − ǫ).The problem is that we do not know k but we can lower bound it with k ≥ k ′ /(1 + ǫ), where k ′ = max v c ′ (v).In summary, before running Estimate we remove all the vertices for which c ′ (v) < γk ′ .

Related work
The notion of distance-generalized core decomposition was proposed by Bonchi et al. [6].The authors provide several heuristics to significantly speed-up the baseline algorithm (a variant of an algorithm proposed by Batagelj and Zaveršnik [3]).Despite being significantly faster than the baseline approach, these heuristics still have the computational complexity in O(nn ′ (n ′ + m ′ )), where n ′ and m ′ are the numbers of nodes and edges in the largest h-neighborhood.For dense graphs and large values of h, the sizes n ′ and m ′ can be close n and m, leading to the computational time of O n 2 m .We will use these heuristics as baselines in Section 8.
All these algorithms, as well as ours, rely on the same idea of iteratively deleting the vertex with the smallest deg h (v) and updating these counters upon the deletion.The difference is that the previous works maintain these counters exactly-and use some heuristics to avoid updating unnecessary nodeswhereas we approximate deg h (v) by sampling, thus reducing the computational time at the cost of accuracy.
A popular variant of decomposition is a k-truss, where each edge is required to be in at least k triangles [9,17,[28][29][30].Sarıyüce and Pinar [21], Sariyuce et al. [22] proposed (r, s) nucleus decomposition, an extension of k-cores where the notion nodes and edges are replaced with r-cliques and s-cliques, respectively.Sarıyüce and Pinar [21] points out that there are several variants of k-trusses, depending on the connectivity requirements: Huang et al. [17] requires the trusses to be triangle-connected, Cohen [9] requires them to be connected, and Zhang and Parthasarathy [29] allows the trusses to be disconnected.
A k-core is the largest subgraph whose smallest degree is at least k.A similar concept is the densest subgraph, a subgraph whose average degree is the largest [15].Such graphs are convenient variants for discovering dense communities as they can be discovered in polynomial time [15], as opposed to, e.g., cliques that are inapproximable [31].
Interestingly, the same peeling algorithm that is used for core decomposition can be use to 2-approximate the densest subgraph [8].Tatti [25] proposed a variant of core decomposition so that the densest subgraph is equal to the inner core.This composition is solvable in polynomial time an can be approximated using the same peeling strategy.
A distance-generalized clique is known as h-club, which is a subgraph where every node is reachable by an h-path from every node [20].Here the path must be inside the subgraph.Since cliques are 1-clubs, discovering maximum h-clubs is immediately an inapproximable problem.Bonchi et al. [6] argued that (k, h) decomposition can be used to aid discovering maximum h-clubs.
Using sampling for parallelizing (normal) core computation was proposed by Esfandiari et al. [10].Here, the authors sparsify the graph multiple times by sampling edges.The sampling probability depends on the core numbers: larger core numbers allow for more aggressive sparsification.The authors then use Chernoff bounds to prove the approximation guarantees.The authors were able to sample edges since the degree in the sparsified graph is an estimate of the degree in the original graph (multiplied by the sampling probability).This does not hold for (k, h) core decomposition because a node w ∈ N (v; h) can reach v with several paths.
Approximating h-neighborhoods can be seen as an instance of a cardinality estimation problem.A classic approach for solving such problems is Hyper-LogLog [11].Adopting HyperLogLog or an alternative approach, such as [18], is a promising direction for a future work, potentially speeding up the algorithm further.The challenge here is to maintain the estimates as the nodes are removed by Core.

Experimental evaluation
Our two main goals in experimental evaluation is to study the accuracy and the computational time of Core.

Datasets and setup
We used 8 publicly available benchmark datasets.CaAstro and CaHep are collaboration networks between researchers. 1 RoadPa and RoadTX are road networks in Pennsylvania and Texas. 1 Amazon contains product pairs that are often co-purchased in a popular online retailer. 1Youtube contains user-touser links in a popular video streaming service. 2 Hyves and Douban contain friendship links in a Dutch and Chinese social networks, respectively. 3The sizes of the graphs are given in Table 1.
We implemented Core in C++ 4 and conducted the experiments using a single core (2.4GHz).For Core we used 8GB RAM and for EstDense we used 50GB RAM.In all experiments, we set δ = 0.05.

Accuracy
In our first experiment we compared the accuracy of our estimate c ′ (v) against the correct core numbers c(v).As a measure we used the maximum relative Table 1: Sizes and computational times for the benchmark datasets.Here, n is the number of nodes m is the number of edges, M is the internal parameter of Core given in Eq. 1.The running times for the baselines lub and lb are taken from [6].Dashes indicate that the experiments did not finish in 20 Note that Proposition 4.3 states that the error should be less than ǫ with high probability.
The error as a function of ǫ for CaHep and CaAstro datasets is shown in Figure 1 for h = 3, 4. We see from the results that the error tends to increase as a function of ǫ.As ǫ decreases, the internal value M increases, reaching the point where the maximum core number is smaller than M .For such values, Proposition 4.3 guarantees that Core produces correct results.We see, for example, that this value is reached with ǫ = 0.20 for CaHep, and ǫ = 0.15 for CaAstro when h = 3, and ǫ = 0.35 for Amazon when h = 4.

Computational time
Our next experiment is to study the computational time as a function of ǫ; the results are shown in Figure 1.From the results we see that generally computational time increases as ǫ decreases.The computational time flattens when we reach the point when c(v) ≤ M for every M .In such case, the lists B[v, i] match exactly to the neighborhoods N (v, i) and do not change if M is increased further.Consequently, decreasing ǫ further will not change the running time.Interestingly, the running time increases slightly for Amazon and h = 4 as ǫ increases.This is most likely due to the increased number of Compute calls for smaller values of M .
Next, we compare the computational time of our method against the baselines lb and lub proposed by Bonchi et al. [6].As our hardware setup is similar, we used the running times for the baselines reported by Bonchi et al. [6].Here, we fixed ǫ = 0.5.The results are shown in Table 1.
We see from the results that for h = 2 the results are dominated by lb.This is due to the fact that most, if not all, nodes will have c(v) ≤ M .In such case, Core does not use any sampling and does not provide any speed up.This is especially the case for the road networks, where the core number stays low even for larger values of h.On the other hand, Core outperforms the baselines in cases where c(v) is large, whether due to a larger h or due to denser networks.As an extreme example, lub required over 13 hours with 52 CPU cores to compute core for Hyves while Core provided an estimate in about 12 minutes using only 1 CPU core.
Interestingly enough, Core solves CaAstro faster when h = 4 than when h = 3.This is due to the fact that we stop when the current core value plus one is equal to the number of remaining nodes.To further demonstrate the effect of the network size on the computation time we generate a series of synthetic datasets.Each dataset is stochastic blockmodel with 10 blocks of equal size, C 1 , . . ., C 10 .To add a hierarchical structure we set the probability of an edge between nodes in C i and C j with i < j to be 10 −6 i 2 .We vary the number of nodes from 10 000 to 100 000.The computational times for our method, with h = 2, 3, 4 and ǫ = 0.5, are shown in Figure 2. As expected, the running times increase as the number of edges increase.Moreover, larger h require more processing time.We should stress that while Corollary 5.1 bounds the running time as quasi-linear, in practice the trend depends on the underlying model.

Dense subgraphs
Finally, we used EstDense to estimate the densest subgraph for h = 2, 3, 4. We set ǫ = 0.5 and δ = 0.05.The results, shown in Table 2, are as expected.Both the density and the size of the h-densest subgraphs increase as the function of h.The dense subgraphs are generally smaller and less dense for the sparse graphs, such as, road networks.
In our experiments, the running times for EstDense were generally smaller but comparable to the running times of Core.The speed-up is largely due to the pruning of nodes with smaller core numbers.The exception was Youtube with h = 3, where EstDense required over 23 minutes.The slowdown is due to Core using lazy initialization of B[v, i] whereas EstDense needs B[v, h]

Concluding remarks
In this paper we introduced a randomized algorithm for approximating distance-generalized core decomposition.The major advantage over the exact approximation is that the approximation can be done in O ǫ −2 hm(log 2 n − log δ) time, whereas the exact computation may require O n 2 m time.We also studied distance-generalized dense subgraphs by proving that the problem is NP-hard and extended the guarantee results of [6] to approximate core decompositions.
The algorithm is based on sampling the h-neighborhoods of the nodes.We prove the approximation guarantee with Chernoff bounds.Maintaining the sampled h-neighborhood requires carefully designed bookkeeping in order to obtain the needed computational complexity.This is especially the case since the sampling probability may change as the graph gets smaller during the decomposition.
In practice, the sampling complements the exact algorithm.For the setups where the exact algorithm struggles, our algorithm outperforms the exact approached by a large margin.Such setups include well-connected networks and values h larger than 3.
An interesting direction for future work is to study whether the heuristics introduced by Bonchi et al. [6] can be incorporated with the sampling approach in order to obtain even faster decomposition method.
A Proof of Proposition 4.2 We start with stating several Chernoff bounds.
To prove Proposition 4.2 we first need the following technical lemma.
Proof First, note that Eq. 12 implies To prove the lemma, let us define the events (Eq. 1) The bounds for P (B 1 ), P (B 2 ), and P (A k ) complete the proof.
Proof of Proposition 4.2 Let S i , T i , and k be as defined in Definition 4.1 for ∆(R; M ).Let ℓ be as defined in Eq 12.We can safely assume that M ≤ d.
Write Y i = T i 2 i and Z i = M 2 i−1 .Eq. 14 guarantees that To prove the other direction, first assume that k > ℓ − 1.By definition of k, we have S k−1 > M and consequently T k−1 ≥ M .Thus, where the last inequality is given by Eq. 16.On the other hand, if k = ℓ − 1, then where the second inequality holds since k > 0 implies that d ≥ 2. In summary, Eq. 2 holds.
Since the events in Lemma A.2 hold with probability of 1 − exp (−C), the claim follows.
i), and by definition of B[v, i], the claim follows.

Algorithm 3 :
Naive version of Update(u).Deletes u and updates the affected B[v, i] and k[v, i].

8 c ← 0; 9 while graph is not empty do 10 u 11 c
← arg min v d[v] (use k[v, h] as a tie breaker); ← max(c, d[u]); 12 output u with c as the core number; 13 Update(u);

Proposition 5 . 3 .
Next, we proceed by reducing the counters of Z in B[w, i + 1] and D[w, i + 1] for each w ∈ A(v).We also update m[w, i + 1].Finally, we add (w, Z) to the next stack, where Z are the deleted nodes in B[w, i + 1].Proposition 5.2.Update maintains B[v, i] correctly.Proof As Core deletes nodes from the graph, Proposition 3.1 guarantees that B[v, i] can be modified only in two ways: either node u is deleted from B[v, i] when u is no longer present in any B[w, i − 1] where w ∈ A(v), or k[v, i] changes and new nodes are added.The first case is handled properly as Update uses Delete whenever a node is deleted from B[w, i − 1].The second case follows since if|B[v, i]| + |D[v, i]| > M or m[v, i] > 0,then we know that Compute will not change k[v, i] and will not introduce new nodes in B[v, i].Assume a graph G with n nodes and m edges.Assume 0 < ǫ ≤ 1/2, constant C, and the maximum path length h.The running time of Core is bounded by

21 add (w, Z) to W ; 22 UCorollary 5 . 1 .O hm log n/δ ǫ 2 log nǫ 2 Proposition 5 . 4 .
← W 23 delete u from G; We have bounded both R 1 and R 2 proving the main claim.Assume real values ǫ > 0, δ > 0, a graph G with n nodes and m edges.Let C = log(2n/δ).Then Core yields ǫ approximation in log n/δ time with 1 − δ probability.Core requires O(hmM ) memory.Proof An entry in B[v, i] requires O(deg v) memory for the pointer information.An entry in D[v, i] only requires O(1) memory.Since |B[v, i]| ≤ M and |D[v, i]| ≤ M deg v, the claim follows.

Figure 1 :
Figure 1: Relative error and computational time as a function of ǫ for CaHep, CaAstro, and Amazon datasets and h = 3 (top row) and h = 4 (bottom row).

Figure 2 :
Figure 2: Computational time as a function of number of edges applied to synthetic data.
us now consider R 2 .For each deleted node in B[v, i] or for each lowered k[v, i] the inner for-loop requires O(deg v) time.Equation 5 implies that the total number of deletions from B[v, i] is in O M log n M , and that we can lower k[v, i] at most O log n M times.Consequently, hours.For Youtube and Hyves, lub was run with 52 CPU cores.The remaining experiments are done with a single CPU core.

Table 2 :
Densities and sizes of discovered dense subgraphs for the benchmark datasets.
to be computed in order to obtain d[v].This is also the reason why EstDense requires more memory in practice.