1 Introduction

Core decomposition is a classic technique for discovering densely connected regions in a graph. The appeal of core decomposition is a simple and intuitive definition, and the fact that the core decomposition can be computed in linear time. Core decomposition has a large range of applications such as graph visualization [1], graph modeling [4], social network analysis [23], internet topology modeling [7], influence analysis [19, 27], bioinformatics [2, 16], and team formation [5].

More formally, a k-core is a maximal subgraph such that every vertex has at least k degree. We can show that k-core forms a nested structure: \((k + 1)\)-core is a subset of k-core and that the core decomposition can be discovered in linear time [3]. Core decomposition has been extended to directed [14], multilayer [12], temporal [13], and weighted [24] networks.

A natural extension of core decomposition, proposed by Bonchi et al. [6], is a distance-generalized core decomposition or (kh)-core decomposition, where the degree is replaced by the number of nodes that can be reached with a path of length h. Here, h is a user parameter and \(h = 1\) reduces to a standard core decomposition. Using distance-generalized core decomposition may produce a more refined decomposition [6]. Moreover, it can be used when discovering h-clubs, distance-generalized dense subgraphs, and distance-generalized chromatic numbers [6].

Studying such structures may be useful in graphs where paths of length h reveal interesting information. For example, assume a authorship network, where an edge between a paper an a researcher indicates that the researcher was an author of the paper. Then, paths of length 2 contain co-authorship information.

The major downside in using the distance-generalized core decomposition is the significant increase in the computational complexity: whereas the standard core decomposition can be done in \({{\mathcal {O}}}{\left( m\right) }\) time, the generalization can require \({{\mathcal {O}}}{\left( n^2m\right) }\) time, where n and m are the number of nodes and edges in the given graph.

To combat this problem, we propose a randomized algorithm that produces an \(\epsilon \)-approximation of (kh) core decomposition with a probability of \(1 - \delta \) in

$$\begin{aligned} {{\mathcal {O}}}{\left( \frac{hm\log n / \delta }{\epsilon ^{2}}\log \frac{n\epsilon ^{2}}{\log n / \delta }\right) } \subseteq {{\mathcal {O}}}{\left( \epsilon ^{-2} hm (\log ^2 n - \log \delta )\right) } \end{aligned}$$

time.

The intuition behind our approach is as follows. In order to compute the distance-generalized core decomposition, we need to discover and maintain h-neighborhoods for each node. We can discover the h-neighborhood of a node v by taking the union of the \((h - 1)\)-neighborhood of the adjacent nodes, which leads to a simple dynamic program. The computational bottleneck comes from the fact that these neighborhoods may become too large. So, instead of computing the complete neighborhood, we have a carefully selected budget M. The moment the neighborhood becomes too large, we delete (roughly) half of the nodes, and to compensate for the sampling we multiply our size estimate by 2. This procedure is repeated as often as needed. Since we are able to keep the neighbor samples small, we are able to compute the decomposition faster.

We use Chernoff bounds to determine an appropriate value for M and provide algorithms for maintaining the h-neighborhoods. The maintenance requires special attention since if the h-neighborhood becomes too small we need to bring back the deleted nodes.

Finally, we study distance-generalized subgraphs, a notion proposed by Bonchi et al. [6] that extends a notion of dense subgraphs. Here the density is the ratio of h-connected node pairs and nodes. We show that the problem is NP-hard and propose an algorithm based on approximate core maps, extending the results by Bonchi et al. [6].

The rest of the paper is organized as follows: In Sect. 2, we introduce preliminary notation and formalize the problem. In Sect. 3, we present a naive version of the algorithm that yields approximate results but is too slow. We prove the approximation guarantee in Sect. 4 and speed up the algorithm in Sect. 5. In Sect. 6, we study distance-generalized dense subgraphs. We discuss the related work in Sect. 7. Finally, we compare our method empirically against the baselines in Sect. 8 and conclude the paper with discussion in Sect. 9.

This work extends the conference paper [26].

2 Preliminaries and problem definition

In this section, we establish preliminary notation and define our problem.

Assume an undirected graph \(G = (V, E)\) with n nodes and m edges. We will write A(v) to be the set of nodes adjacent to v. Given an integer h, we define an h-path to be a sequence of at most \(h + 1\) adjacent nodes. An h-neighborhood N(vhX) is then the set of nodes that are reachable with an h-path in a set of nodes X. If \(X = V\) or otherwise clear from context, we will drop it from the notation. Note that \(N(v; 1) = A(v) \cup \left\{ v\right\} \).

We will write \(\deg _h(v; X) = {\left| N(v; h, X)\right| } - 1\), where X is a set of nodes and \(v \in X\). We will often drop X from the notation if it is clear from the context.

A k-core is the maximal subgraph of G for which all nodes have at least a degree of k. Discovering the cores can be done in \({{\mathcal {O}}}{\left( m\right) }\) time by iteratively deleting the vertex with the smallest degree [23].

Bonchi et al. [6] proposed to extend the notion of k-cores to (kh)-cores. Here, given an integer h, a (kh)-core is the maximal graph H of G such that \({\left| N(v; h)\right| } - 1 \ge k\) for each \(v \in V(H)\), that is, we can reach at least k nodes from v with a path of at most h nodes. The core number c(v) of a vertex v is the largest k such that v is contained in (kh)-core H. We will call H as the core graph of v and we will refer to c as the core map.

Note that discovering (k, 1)-cores is equal to discovering standard k-cores. We follow the same strategy when computing (kh)-cores as with standard cores: we iteratively find and delete the vertex with the smallest degree [6]. We will refer to the exact algorithm as ExactCore. While ExactCore is guaranteed to produce the correct result, the computational complexity deteriorates to \({{\mathcal {O}}}{\left( n^2m\right) }\). The main reason here is that the neighborhoods N(vh) can be significantly larger than just adjacent nodes A(v).

In this paper, we consider approximating cores.

Definition 2.1

(approximative (kh)-core) Given a graph G an integer h and approximation guarantee \(\epsilon \), an \(\epsilon \)-approximative core map \({c'}:{V} \rightarrow {{\mathbb {N}}}\) maps a node to an integer such that \({\left| c'(v) - c(v)\right| } \le \epsilon c(v)\) for each \(v \in V\).

We will introduce an algorithm that computes an \(\epsilon \)-approximative core map with a probability of \(1 - \delta \) in quasilinear time.

3 Naive, slow algorithm

In this section, we introduce a basic idea of our approach. This version of the algorithm will be still too slow but will approximate the cores accurately. We will prove the accuracy in the next section and then refine the subroutines to obtain the needed computational complexity.

The bottleneck for computing the cores is maintaining the h-neighborhood N(vh) for each node v as we delete the nodes. Instead of maintaining the complete h-neighborhood, we will keep only certain nodes if the neighborhood becomes too large. We then compensate the sampling when estimating the size of the h-neighborhood.

Assume that we are given a graph G, an integer h, approximation guarantee \(\epsilon \), and a probability threshold \(\delta \). Let us define numbers \(C = \log (2n / \delta )\) and

$$\begin{aligned} M = 1 + \frac{4(2 + \epsilon )}{\epsilon ^2}(C + \log 8) \quad . \end{aligned}$$
(1)

The quantity M will act as an upper bound for the sampled h-neighborhood, while C will be useful when analyzing the properties of the algorithm. We will see later that these specific values will yield the approximation guarantees.

We start the algorithm by sampling the rank of a node from a geometric distribution \(r[v] = {geo}{\left( 1/2\right) }\). Note that ties are allowed. During the algorithm, we maintain two key variables B[vi] and k[vi] for each \(v \in V\) and each index \(i = 1, \ldots , h\). Here,

$$\begin{aligned} B[v, i] = \left\{ u \in N(v; i) \mid r[u] \ge k[v, i]\right\} \end{aligned}$$

is a subset of i-neighborhood N(vi) consisting of nodes whose rank \(r[u] \ge k[v, i]\). The threshold k[vi] is set to be as small as possible such that \({\left| B[v, i]\right| } \le M\).

We can estimate c(v) from B[vh] and k[vh] as follows: Consider the quantity \(d = {\left| B[v, h] \setminus \left\{ v\right\} \right| }2^{k[v, h]}\). Note that for an integer k the probability of a vertex v having a rank \(r[v] \ge k\) is \(2^{-k}\). This hints that d is a good estimate for c(v). We show in the next section that this is indeed the case, but d is lacking an important property that we need in order to prove the correctness of the algorithm. Namely, d can increase while we are deleting nodes. To fix this pathological case, we estimate c(v) with \(\max (d, M2^{k[v, h] - 1})\) if \(k[v, h] > 0\), and with d if \(k[v, h] = 0\). The pseudo-code for the estimate is given in Algorithm 1.

figure b

To compute B[vi], we have the following observation.

Proposition 3.1

For any \(v \in V\) and any \(i = 1, \ldots , h\), we have

$$\begin{aligned} \begin{aligned} B[v, i]&= \left\{ u \in T \mid r[u] \ge k[v, i]\right\} , \quad \text {where}\quad \\ T&= \left\{ v\right\} \cup \left\{ u \in B[w, i - 1] \mid w \in A(v)\right\} \quad . \end{aligned} \end{aligned}$$

Moreover, \(k[v, i] \ge k[w, i - 1]\) for any \(w \in A(v)\).

Proof

Let \(w \in A(v)\). Since \(N(w, i - 1) \subseteq N(v, i)\), we have \(k[v, i] \ge k[w, i - 1]\). Consequently, \(B[v, i] \subseteq T \subseteq N(v, i)\), and by definition of B[vi], the claim follows. \(\square \)

The proposition leads to Compute, an algorithm for computing B[vi] given in Algorithm 2. Here, we form a set T, a union of sets \(B[w, i - 1]\), where \(w \in A(v)\). After T is formed, we search for the threshold \(k[v, i] \ge \max _{w \in A(v)} k[w, i - 1]\) that yields at most M nodes in T and store the resulting set in B[vi].

figure c

As the node, say u, is deleted, we need to update the affected nodes. We do this update in Algorithm 3 by recomputing the neighbors \(v \in A(u)\) and see whether B[vi] and k[vi] have changed; if they have, then we recompute \(B[w, i + 1]\) for all \(w \in A(v)\), and so on.

figure d

The main algorithm Core, given in Algorithm 4, initializes B[vi] using Compute, deletes iteratively the nodes with smallest estimate d[v] while updating the sets B[vi] with Update.

figure e

4 Approximation guarantee

In this section, we will prove the approximation guarantee of our algorithm. The key step is to show that Estimate produces an accurate estimate. For notational convenience, we need the following definition.

Definition 4.1

Assume d integers \(X = \left( x_1 ,\ldots , x_d\right) \) and an integer M. Define

$$\begin{aligned} S_i = {\left| \left\{ j \in \left[ d\right] \mid x_j \ge i\right\} \right| } \text { and } T_i = {\left| \left\{ j \in \left[ d\right] \mid x_j \ge i, j \ge 2\right\} \right| } \end{aligned}$$

to be the number of integers larger than or equal to i. Let \(k \ge 0\) be the smallest integer for which \(S_k \le M\). Define

$$\begin{aligned} {\varDelta }{\left( X; M\right) } = {\left\{ \begin{array}{ll} \max \left( T_k2^k, M2^{k - 1}\right) , &{} \text { if } k > 0, \\ T_k2^k, &{} \text { if } k = 0\quad .\\ \end{array}\right. } \end{aligned}$$

Note that if \(R = \left( r[w] \mid w \in N(v; h)\right) \) with r[v] being the first element in R, then \({\varDelta }{\left( R; M\right) }\) is equal to the output of \(\textsc {Estimate} (v)\).

Our first step is to show that \({\varDelta }{\left( X; M\right) }\) is monotonic.

Proposition 4.1

Assume \(M > 0\). Let \(x_1, \ldots , x_d\) be a set of integers. Select \(d' \le d\). Then

$$\begin{aligned} {\varDelta }{\left( x_1, \ldots , x_{d'}; M\right) } \le {\varDelta }{\left( x_1, \ldots , x_d; M\right) }\quad . \end{aligned}$$

Note that this claim would not hold if we did not have the \(M2^{k - 1}\) term in the definition of \({\varDelta }{\left( X; M\right) }\).

Proof

Let k, \(S_i\), and \(T_i\) be as defined for \({\varDelta }{\left( x_1, \ldots , x_d; M\right) }\) in Definition 4.1. Also, let \(k'\), \(S_i'\), and \(T_i'\) be as defined for \({\varDelta }{\left( x_1, \ldots , x_{d'}; M\right) }\) in Definition 4.1.

Since \(S_i' \le S_i\), we have \(k' \le k\). If \(k' = k\), the claim follows immediately since also \(T_i' \le T_i\). If \(k' < k\), then

$$\begin{aligned} {\varDelta }{\left( x_1, \ldots , x_d; M\right) } \ge M2^{k - 1} \ge M2^{k'} \ge T_{k'}' 2^{k'} \end{aligned}$$

and

$$\begin{aligned} {\varDelta }{\left( x_1, \ldots , x_d; M\right) } \ge M2^{k - 1} \ge M2^{k' - 1}, \end{aligned}$$

proving the claim. \(\square \)

Next we formalize the accuracy of \({\varDelta }{\left( X; M\right) }\). We prove the claim in Appendix.

Proposition 4.2

Assume \(0 < \epsilon \le 1/2\). Let \({\mathcal {R}} = R_1, \ldots , R_d\) be independent random variables sampled from geometric distribution, \({geo}{\left( 1/2\right) }\). Assume \(C > 0\) and define M as in Eq. 1. Then

$$\begin{aligned} {\left| {\varDelta }{\left( {\mathcal {R}}; M\right) } - (d - 1)\right| } \le \epsilon (d - 1) \end{aligned}$$
(2)

with probability \(1 - \exp \left( -C\right) \).

We are now ready to state the main claim.

Proposition 4.3

Assume graph G with n nodes, \(\epsilon > 0\), and \(C > 0\). For each node \(v \in V\), let c(v) be the core number reported by ExactCore and let \(c'(v)\) be the core number reported by Core. Then with probability \(1 - 2ne^{-C}\)

$$\begin{aligned} {\left| c(v) - c'(v)\right| } \le \epsilon c(v), \end{aligned}$$

for every node in V. Moreover, if \(c(v) \le M\), where M is given in Eq. 1, then \(c(v) = c'(v)\).

We will prove the main claim of the proposition with two lemmas. In both proofs, we will use the variable \(\tau _v\) which we define to be the value of d[v] when v is deleted by Core.

The first lemma establishes a lower bound.

Lemma 4.1

The lower bound \(c'(v) \ge (1 - \epsilon )c(v)\) holds with probability \(1 - ne^{-C}\).

Proof

For each node \(v \in V\), let \(R_v\) be a rank, an independent random variable sampled from geometric distribution, \({geo}{\left( 1/2\right) }\).

Let \(H_v\) be the core graph of v as solved by ExactCore. Define \(S_v = N(v, h) \cap H_v\) to be the h-neighborhood of v in \(H_v\). Note that \(c(v) \le {\left| S_v\right| } - 1\). Let \({\mathcal {R}}_v\) be the list of ranks \(\left( R_w ; w \in S_v\right) \) such that \(R_v\) is always the first element.

Proposition 4.2 combined with the union bound states that

$$\begin{aligned} {\left| {\varDelta }{\left( {\mathcal {R}}_v; M\right) } - ({\left| S_v\right| } - 1)\right| } \le \epsilon ({\left| S_v\right| } - 1)\quad . \end{aligned}$$
(3)

holds with probability \(1 - ne^{-C}\) for every node v. Assume that these events hold.

To prove the claim, select a node v and let w be the first node in \(H_v\) deleted by Core. Let F be the graph right before deleting w by Core. Then,

$$\begin{aligned} c'(v)&\ge c'(w)&(\textsc {Core} \text { picked }\ w\ \text { before }\ v\ \text { or } w = v)\\&\ge \tau _w \\&\ge {\varDelta }{\left( {\mathcal {R}}_w; M\right) }&(H_w \subseteq H_v \subseteq F\ \text {and Prop.}~4.1\text {)} \\&\ge (1 - \epsilon ) ({\left| S_w\right| } - 1)&\text {(Eq.}~3\text {)} \\&\ge (1 - \epsilon )c(w)&{(S_w = N(w, h) \cap H_w)} \\&\ge (1 - \epsilon ) c(v),&{(w \in H_v)} \\ \end{aligned}$$

proving the lemma. \(\square \)

Next, we establish the upper bound.

Lemma 4.2

The upper bound \(c'(v) \le (1 + \epsilon )c(v)\) holds with probability \(1 - ne^{-C}\).

Proof

For each node \(v \in V\), let \(R_v\) be an independent random variable sampled from geometric distribution, \({geo}{\left( 1/2\right) }\).

Consider the exact algorithm ExactCore for solving the (kh) core problem. Let \(H_v\) be the graph induced by the existing nodes right before v is removed by ExactCore. Define \(S_v = N(v, h) \cap H_v\) to be the h-neighborhood of v in \(H_v\). Note that \(c(v) \ge {\left| S_v\right| } - 1\). Let \({\mathcal {R}}_v\) be the list of ranks \(\left( R_w ; w \in S_v\right) \) such that \(R_v\) is the first element.

Proposition 4.2 combined with the union bound states that

$$\begin{aligned} {\left| {\varDelta }{\left( {\mathcal {R}}_v; M\right) } - ({\left| S_v\right| } - 1)\right| } \le \epsilon ({\left| S_v\right| } - 1)\quad . \end{aligned}$$
(4)

holds with probability \(1 - ne^{-C}\) for every node v. Assume that these events hold.

Select a node v. Let W be the set containing v and the nodes selected before v by Core. Select \(w \in W\). Let F be the graph right before deleting w by Core. Let u be the node in F that is deleted first by ExactCore. Let \(\beta \) be the value of d[u] when w is deleted by Core. Then,

$$\begin{aligned} \tau _w&\le \beta&(\textsc {Core} \text { picked }\ w\ \text {over}\ u\ \text {or}\ w = u)\\&\le {\varDelta }{\left( {\mathcal {R}}_u; M\right) }&(F \subseteq H_u\ \text {and Proposition}~4.1\text {)}\\&\le (1 + \epsilon ) ({\left| S_u\right| } - 1)&\text {(Eq.}~4\text {)}\\&\le (1 + \epsilon ) c(u)&{(S_u = N(u, h) \cap H_u)} \\&\le (1 + \epsilon ) c(v) \quad .&{(v \in F \subseteq H_u)} \end{aligned}$$

Since this bound holds for any \(w \in W\), we have

$$\begin{aligned} c'(v) = \max _{w \in W} \tau _w \le (1 + \epsilon )c(v), \end{aligned}$$

proving the lemma. \(\square \)

We are now ready to prove the proposition.

Proof of Proposition 4.3

The probability that one of the two above lemmas does not hold is bounded by the union bound with \(2ne^{-C}\), proving the main claim.

To prove the second claim note that when \(d[v] \le M\) then d[v] matches accurately the number of the remaining nodes that can be reached by an h-path from a node v. On the other hand, if there is a node w that reaches more than M nodes, we are guaranteed that \(d[w] \ge M\) and \(k[w, h] > 0\), implying that Core will always prefer deleting v before w. Consequently, at the beginning Core will select the nodes in the same order as ExactCore and reports the same core number as long as there are nodes with \(d[v] \le M\) or, equally, as long as \(c(v) \le M\). \(\square \)

5 Updating data structures faster

Now that we have proven the accuracy of Core, our next step is to address the computational complexity. The key problem is that Compute is called too often and the implementation of Update is too slow.

As Core progresses, the set B[vi] is modified in two ways. The first case is when some nodes become too far away, and we need to delete these nodes from B[vi]. The second case is when we have deleted enough nodes so that we can lower k[vi] and introduce new nodes. Our naive version of Update calls Compute for both cases. We will modify the algorithms so that Compute is called only to handle the second case, and the first case is handled separately. Note that these modifications do not change the output of the algorithm.

First, we change the information stored in B[vi]. Instead of storing just a node u, we will store (uz), where z is the number of neighbors \(w \in A(v)\), such that u is in \(B[w, i - 1]\). We will store B[vi] as a linked list sorted by the rank. In addition, each node \(u \in B[w, i - 1]\) is augmented with an array \(Q = (q_v \mid v \in A(w))\). An entry \(q_v\) points to the location of u in B[vi] if u is present in B[vi]. Otherwise, \(q_v\) is set to null.

We will need two helper functions to maintain B[vi]. The first function is a standard merge sort, \(\textsc {MergeSort} (X, Y)\), that merges two sorted lists in \({{\mathcal {O}}}{\left( {\left| X\right| } + {\left| Y\right| }\right) }\) time, maintaining the counters and the pointers.

The other function is \(\textsc {Delete} (X, Y)\) that removes nodes in Y from X, which we will use to remove nodes from B[vi]. The deletion is done in by reducing the counters of the corresponding nodes in X by 1 and removing them when the counter reaches 0. It is vital that we can process a single node \(y \in Y\) in constant time. This will be possible because we will be able to use the pointer array described above.

Let us now consider calling Compute. We would like to minimize the number of calls of Compute. In order to do that, we need additional bookkeeping. The first additional information is m[vi] which is the number of neighboring nodes \(w \in A(v)\) for which \(k[w, i - 1] = k[v, i]\). Proposition 3.1 states that \(k[v, i] \ge k[w, i - 1]\), for all \(w \in A(v)\). Thus, if \(m[v, i] > 0\), then there is a node \(u \in A(v)\) with \(k[v, i] = k[u, i - 1]\) and so recomputing B[vi] will not change k[vi] and will not add new nodes in B[vi].

Unfortunately, maintaining just m[vi] is not enough. We may have \(k[v, i] > k[w, i - 1]\) for any \(w \in A(v)\) immediately after Compute. In such case, we compute sets of nodes

$$\begin{aligned} X_w = \left\{ u \in B[w, i - 1] \mid r[u] = k[v, i] - 1\right\} , \end{aligned}$$

and combine them in D[vi], a union of \(X_w\) along with the counter information similar to B[vi], that is,

$$\begin{aligned} D[v, i] = \left\{ (u, z) \mid z = {\left| \left\{ w \in A(v) \mid u \in X_w\right\} \right| } > 0\right\} \quad . \end{aligned}$$

The key observation is that as long as \({\left| B[v, i]\right| } + {\left| D[v, i]\right| } > M\), the level k[vi] does not need to be updated.

There is one complication, namely, we need to compute D[vi] in \({{\mathcal {O}}}{\left( M\deg v\right) }\) time. Note that, unlike B[vi], the set D[vi] can have more than M elements. Hence, using MergeSort will not work. Moreover, a stock k-ary merge sort requires \({{\mathcal {O}}}{\left( M\deg (v) \log \deg (v)\right) }\) time. The key observation to avoid the additional \(\log \) factor is that D[vi] does not need to be sorted. More specifically, we first compute an array

$$\begin{aligned} a[u] = {\left| \left\{ w \in A(v) \mid u \in X_w\right\} \right| }, \end{aligned}$$

and then extract the nonzero entries to form D[vi]. We only need to compute the nonzero entries so we can compute these entries in \({{\mathcal {O}}}{\left( \sum {\left| X_w\right| }\right) } \subseteq {{\mathcal {O}}}{\left( M\deg v\right) }\) time. Moreover, since we do not need to keep them in order we can extract the nonzero entries in the same time. We will refer this procedure as Union, taking the sets \(X_w\) as input and forming D[vi].

We need to maintain D[vi] efficiently. In order to do that, we augment each node \(u \in B[w, i - 1]\) with an array \((q_v \mid v \in A(w))\), where \(q_v\) points to the location of u in D[vi] if \(u \in D[v, i]\).

The pseudo-code for the updated Compute is given in Algorithm 5. Here we compute B[vi] and k[vi] first by using MergeSort iteratively and trimming the resulting set if it has more than M elements. We proceed to compute m[vi] and D[vi]. If \(m[v, i] = 0\), we compute D[vi] with Union. Note that if \(m[v, i] > 0\), we leave D[vi] empty. The above discussion leads immediately to the computational complexity of Compute.

Proposition 5.1

\(\textsc {Compute} (v, i)\) runs in \({{\mathcal {O}}}{\left( M \deg v\right) }\) time.

The pseudo-code for Update is given in Algorithm 6. Here, we maintain a stack U of tuples (vY), where v is the node that requires an update, and Y are the nodes that have been deleted from B[vi] during the previous round. First, if \({\left| B[v, i]\right| } + {\left| D[v, i]\right| } \le M\) and \(m[v, i] = 0\), we run \(\textsc {Compute} (v, i)\). Next, we proceed by reducing the counters of Z in \(B[w, i + 1]\) and \(D[w, i + 1]\) for each \(w \in A(v)\). We also update \(m[w, i + 1]\). Finally, we add (wZ) to the next stack, where Z are the deleted nodes in \(B[w, i + 1]\).

Proposition 5.2

Update maintains B[vi] correctly.

Proof

As Core deletes nodes from the graph, Proposition 3.1 guarantees that B[vi] can be modified only in two ways: either node u is deleted from B[vi] when u is no longer present in any \(B[w, i - 1]\) where \(w \in A(v)\), or k[vi] changes and new nodes are added.

The first case is handled properly as Update uses Delete whenever a node is deleted from \(B[w, i - 1]\).

The second case follows since if \({\left| B[v, i]\right| } + {\left| D[v, i]\right| } > M\) or \(m[v, i] > 0\), then we know that Compute will not change k[vi] and will not introduce new nodes in B[vi]. \(\square \)

figure f
figure g

Proposition 5.3

Assume a graph G with n nodes and m edges. Assume \(0 < \epsilon \le 1/2\), constant C, and the maximum path length h. The running time of Core is bounded by:

$$\begin{aligned} {{\mathcal {O}}}{\left( hmM\log \frac{n}{M}\right) } = {{\mathcal {O}}}{\left( hmC\epsilon ^{-2}\log \frac{n\epsilon ^{2}}{C}\right) } \end{aligned}$$

with a probability of \(1 - n\exp (-C)\), where M is defined in Eq. 1.

Proof

We will prove the proposition by bounding \(R_1 + R_2\), where \(R_1\) is the total time needed by Compute and the \(R_2\) is the total time needed by the inner loop in Update.

We will bound \(R_1\) first. Note that a single call of \(\textsc {Compute} (v, i)\) requires \({{\mathcal {O}}}{\left( M\deg v\right) }\) time.

To bound the number of Compute calls, let us first bound k[vi]. Proposition 4.2 and union bound imply that

$$\begin{aligned} M2^{k[v, i] - 1} \le (1 + \epsilon )c(v) \le 2n \end{aligned}$$

holds for all nodes \(v \in V\) with probability \(1 - n\exp (-C)\). Solving for k[vi] leads to

$$\begin{aligned} k[v, i] \le 2 + \log _2 \frac{n}{M} \in {{\mathcal {O}}}{\left( \log \frac{n}{M}\right) } \quad . \end{aligned}$$
(5)

We claim that \(\textsc {Compute} (v, i)\) is called at most twice per each value of k[vi]. To see this, consider that \(\textsc {Compute} (v, i)\) sets \(m[v, i] = 0\). Then, we also set D[vi] and we are guaranteed by the first condition on Line 9 of Update that the next call of \(\textsc {Compute} (v, i)\) will lower k[vi]. Assume now that \(\textsc {Compute} (v, i)\) sets \(m[v, i] > 0\). Then, the second condition on Line 9 of Update guarantees that the next call of \(\textsc {Compute} (v, i)\) either keeps m[vi] at 0 (and computes D[vi]) or lowers k[vi].

In summary, the time needed by Compute is bounded by

$$\begin{aligned} R_1 \in {{\mathcal {O}}}{\left( \sum _{i, v} M\deg (v) \log \frac{n}{M}\right) } = {{\mathcal {O}}}{\left( hmM\log \frac{n}{M}\right) } \quad . \end{aligned}$$

Let us now consider \(R_2\). For each deleted node in B[vi] or for each lowered k[vi] the inner for-loop requires \({{\mathcal {O}}}{\left( \deg v\right) }\) time. Equation 5 implies that the total number of deletions from B[vi] is in \({{\mathcal {O}}}{\left( M\log \frac{n}{M}\right) }\) and that we can lower k[vi] at most \({{\mathcal {O}}}{\left( \log \frac{n}{M}\right) }\) times. Consequently,

$$\begin{aligned} R_2 \in {{\mathcal {O}}}{\left( h\sum _v (M + 1)\log \frac{n}{M} \deg v\right) } = {{\mathcal {O}}}{\left( hmM\log \frac{n}{M} \right) }\ . \end{aligned}$$

We have bounded both \(R_1\) and \(R_2\) proving the main claim. \(\square \)

Corollary 5.1

Assume real values \(\epsilon > 0\), \(\delta > 0\), a graph G with n nodes and m edges. Let \(C = \log (2n / \delta )\). Then Core yields \(\epsilon \) approximation in

$$\begin{aligned} {{\mathcal {O}}}{\left( \frac{hm\log n / \delta }{\epsilon ^{2}}\log \frac{n\epsilon ^{2}}{\log n / \delta }\right) } \end{aligned}$$

time with \(1 - \delta \) probability.

Proposition 5.4

Core requires \({{\mathcal {O}}}{\left( hmM\right) }\) memory.

Proof

An entry in B[vi] requires \({{\mathcal {O}}}{\left( \deg v\right) }\) memory for the pointer information. An entry in D[vi] only requires \({{\mathcal {O}}}{\left( 1\right) }\) memory. Since \({\left| B[v, i]\right| } \le M\) and \({\left| D[v, i]\right| } \le M \deg v\), the claim follows. \(\square \)

In order to speed up the algorithm further, we employ two additional heuristics. First, we can safely delay the initialization of B[vi] until every \(B[w, i - 1]\), where \(w \in A(v)\) yields a core estimate that is below the current core number. Delaying the initialization allows us to ignore B[vi] during Update. Second, if the current core number exceeds the number of remaining nodes, then we can stop and use the current core number for the remaining nodes. While these heuristics do not provide any additional theoretical advantage, they speed up the algorithm in practice.

6 Distance-generalized dense subgraphs

In this section, we will study distance-generalized dense subgraphs, a notion introduced by Bonchi et al. [6].

In order to define the problem, let us first define \(E_h(X)\) to be the node pairs in X that are connected with an h-path in X. We exclude the node pairs of form (uu). Note that \(E(X) = E_1(X)\).

We define the h-density of X to be the ratio of \(E_h(X)\) and \({\left| X\right| }\),

$$\begin{aligned} {dns}{(}{X; h})= \frac{{\left| E_h(X)\right| }}{{\left| X\right| }}\quad . \end{aligned}$$

We will sometimes drop h from the notation if it is clear from the context.

Problem 6.1

(Dense) Given a graph G and h to find the subgraph X maximizing \({dns}{(X; h)}\).

Dense can be solved for \(h = 1\) in polynomial time using fractional programming combined with minimum cut problems [15]. However, the distance-generalized version of the problem is NP-hard.

Proposition 6.1

Dense is NP-hard even for \(h = 2\).

To prove the result, we will use extensively the following lemma.

Lemma 6.1

Let X be the densest subgraph. Let \(Y \subseteq X\) and \(Z \cap X = \emptyset \). Then

$$\begin{aligned} \frac{{\left| E_h(X)\right| } - {\left| E_h(X \setminus Y)\right| }}{{\left| Y\right| }} \ge {dns}{(}{X}) \ge \frac{{\left| E_h(X \cup Z)\right| } - {\left| E_h(X)\right| }}{{\left| Z\right| }} \end{aligned}$$

Proof

Due to optimality \({dns}{(}{X}) \ge {dns}{(}{X \setminus Y})\). Then,

$$\begin{aligned} \frac{{\left| E_h(X)\right| } - {\left| E_h(X \setminus Y)\right| }}{{\left| Y\right| }} \ge \frac{{\left| E_h(X)\right| } - {dns}{(}{X}) ({\left| X\right| } - {\left| Y\right| })}{{\left| X\right| } - ({\left| X\right| } - {\left| Y\right| })} = {dns}{(}{X})\quad . \end{aligned}$$

Similarly, \({dns}{(}{X}) \ge {dns}{(}{X \cup Z})\) implies

$$\begin{aligned} \frac{{\left| E_h(X \cup Z)\right| } - {\left| E_h(X)\right| }}{{\left| Z\right| }} \le \frac{{dns}{(}{X})({\left| X\right| } + {\left| Z\right| }) - {\left| E_h(X)\right| }}{({\left| X\right| } + {\left| Z\right| }) - {\left| X\right| }} = {dns}{(}{X}), \end{aligned}$$

proving the claim. \(\square \)

Proof of Proposition 6.1

To prove the claim, we will reduce 3Dmatch to our problem. In an instance of 3Dmatch, we are given a universe \(U = u_1, \ldots , u_n\) of size n and m sets \({\mathcal {C}}\) of size 3 and ask whether there is an exact cover of U in \({\mathcal {C}}\).

We can safely assume that \(C_1\) does not intersect with any other set. Otherwise, we can add a new set and 3 new items without changing the outcome of the instance.

In order to define the graph, let us first define \(k = 12m\) and \(\ell = 3k(3k - 1)/2 + 6k - 1\). Note that \(k \ge 12\).

For each \(u_i \in U\), we add k nodes \(a_{ij}\), where \(j = 1, \ldots , k\). For each \(a_{ij}\), we add \(2\ell \) unique nodes that are all connected to \(a_{ij}\). We will denote the resulting star with \(S_{ij}\). We will select a non-center node from \(S_{ij}\) and denote it by \(b_{ij}\). Finally, we write \(S'_{ij} = S_{ij} \setminus \left\{ a_{ij}, b_{ij}\right\} \).

For each set \(C_t \in {\mathcal {C}}\), we add a node, say \(p_t\), and connect it to \(b_{ij}\) for every \(u_i \in C\) and \(j = 1, \ldots , k\). We will denote the collection of these nodes with P. We connect every node in P to \(p_1\).

Let X be the nodes of the densest subgraph for \(h = 2\). Let \(Q = P \cap X\) and let \({\mathcal {R}}\) be the corresponding sets in \({\mathcal {C}}\).

To simplify the notation, we will need the following counts of node pairs. First, let us define \(\alpha \) to be the number of node pairs in a single \(S_{ij}\) connected with a 2-path,

$$\begin{aligned} \alpha = E_2(S_{ij}) = {2 \ell + 1 \atopwithdelims ()2}\quad . \end{aligned}$$

Second, let us define the number of node pairs connected with a 2-path using a single node \(p_t \in P\). Since \(p_t\) connects 3k nodes \(b_{ij}\) and reaches 3k nodes \(b_{ij}\) and 3k nodes \(a_{ij}\), we have

$$\begin{aligned} \beta = {3k \atopwithdelims ()2} + 6k\quad . \end{aligned}$$

Finally, consider W consisting of a single \(p_t\) and the corresponding 3k stars. Let us write \(\gamma = 3k\alpha + \beta \) to be the number of node pairs connected by a 2-path in W.

We will prove the proposition with a series of claims.

Claim 1: \({dns}{(}{X}) > \ell \). The density of W as defined above is:

$$\begin{aligned} {dns}{(}{W}) = \frac{3k \alpha + \beta }{3k(2\ell + 1) + 1} > \frac{3k \alpha + \ell }{3k(2\ell + 1) + 1} = \ell \quad . \end{aligned}$$

Thus, \({dns}{(}{X}) \ge {dns}{(}{W}) > \ell \).

Claim 2: \({\mathcal {R}}\) is disjoint. To prove the claim, assume otherwise and let \(C_t\), with \(t > 1\), be a set overlapping with some other set in \({\mathcal {R}}\).

Let us bound the number of node pairs that are solely connected with \(p_t\). The node \(p_t\) connects \(3k + 1\) nodes in V. Out of these nodes at least \(k + 1\) nodes are connected by another node in X. In addition, \(p_t\) reaches to \(a_{ij}\) and \(b_{ij}\), where \(u_i \in C_t\) and \(j = 1, \ldots , k\), that is, 6k nodes in total. Finally, \(p_t\) may connect to every other node in P, at most \(m - 1\) nodes, and every \(a_{ij}\) connected to \(p_1\), at most 3k nodes. In summary, we have

$$\begin{aligned} \begin{aligned} {\left| E_2(X)\right| } - {\left| E_2(X \setminus \left\{ p_t\right\} )\right| }&\le {3k + 1 \atopwithdelims ()2} - {k + 1 \atopwithdelims ()2} + 6k + m - 1 + 3k\\&= \ell - k^2/2 + 5k/2 + m + 3k < \ell \le {dns}{(}{X})\quad . \end{aligned} \end{aligned}$$

Lemma 6.1 with \(Y = \left\{ p_t\right\} \) now contradicts the optimality of X. Thus, \({\mathcal {R}}\) is disjoint.

Claim 3: Either \(S_{ij} \subseteq X\) or \(S_{ij} \cap X = \emptyset \). To prove the claim assume that \(S_{ij} \cap X \ne \emptyset \).

Assume that \(b_{ij} \notin X\). Then, \(S_{ij} \cap X\) is a disconnected component with density less than \(\ell \), contradicting Lemma 6.1. Assume that \(b_{ij} \in X\) and \(a_{ij} \notin X\). Then, deleting \(b_{ij}\) will reduce at most \(3k + m - 1 < \ell \) connected node pairs, contradicting Lemma 6.1.

Assume that \(b_{ij}, a_{ij} \in X\). If \(S'_{ij} \cap X = \emptyset \), then deleting \(a_{ij}\) will reduce at most 2 connected node pairs, contradicting Lemma 6.1. Assume now there are \(u \in S'_{ij} \cap X\) and \(w \in S'_{ij} \setminus X\). Then \({\left| E_2(X \cup \left\{ w\right\} )\right| } - {\left| E_2(X)\right| } > {\left| E_2(X)\right| } - {\left| E_2(X \setminus \left\{ u\right\} )\right| }\), contradicting Lemma 6.1. Consequently, \(S_{ij} \subseteq X\).

Claim 4: If \(p_t \in X\), then X contains every corresponding \(S_{ij}\). To prove the claim assume otherwise.

Assume first that there are no corresponding \(S_{ij}\) in X for \(p_t\). If \(t > 1\), then \(p_t\) reaches to at most \(m - 1 + 3k\) nodes. If \(t = 1\), then \(p_1\) connects at most \(m - 1\) nodes and reaches to at most \((m - 1)(3k + 1)\) nodes.

Both cases lead to

$$\begin{aligned} \begin{aligned} {\left| E_2(X)\right| } - {\left| E_2(X \setminus \left\{ p_t\right\} )\right| }&\le {m - 1 \atopwithdelims ()2} + (m - 1)(3k + 1)< \ell < {dns}{(}{X}), \end{aligned} \end{aligned}$$

contradicting Lemma 6.1.

Assume there is at least one corresponding \(S_{ij}\) in X but not all, say \(S_{i'j'}\) is missing. Then

$$\begin{aligned} {\left| E_2(X)\right| } - {\left| E_2(X \setminus S_{ij})\right| } < {\left| E_2(X \cup S_{i'j'})\right| } - {\left| E_2(X)\right| }, \end{aligned}$$

contradicting Lemma 6.1.

Claim 5: No \(S_{ij}\) without corresponding \(p_t\) is included in X. To prove the claim note that such \(S_{ij}\) is disconnected and has density of \(\ell \),

contradicting Lemma 6.1.

The previous claims together show that the density of X is equal to

$$\begin{aligned} {dns}{(}{X}) = \frac{{\left| Q\right| } \gamma + ({\left| Q\right| } - 1)(6k) + {{\left| Q\right| } \atopwithdelims ()2}}{{\left| Q\right| }(3k(2\ell + 1) + 1)}, \end{aligned}$$

which is an increasing function of \({\left| Q\right| }\). Since \({\mathcal {R}}\) is disjoint and maximal, the 3Dmatch instance has a solution if and only if \({\mathcal {R}}\) is a solution. \(\square \)

One of the appealing aspects of \({dns}{(}{X; h })\) for \(h = 1\) is that we can 2-approximate in linear time [8]. This is done by ordering the nodes with ExactCore, say \(v_1, \ldots , v_n\) and then selecting the densest subgraph of the form \(v_1, \ldots , v_i\).

The approximation guarantee for \(h > 1\) is weaker even if use ExactCore. Bonchi et al. [6] showed that \(2{dns}{(}{Y}) \ge \sqrt{2{dns}{(}{X}) + 1/4} - 1/2\) when we use ExactCore.

Using Core instead of ExactCore poses additional challenges. In order to select a subgraph among n candidates, we need to estimate the density of its subgraph. We cannot use d[v] used by Core as these are the values that Core uses to determine the order.

Assume that Core produced order of vertices \(v_1, \ldots , v_n\), first vertices deleted first. To find the densest graph among the candidate, we essentially repeat Core except now we delete the nodes using the order \(v_1, \ldots , v_n\). We then estimate the number of edges with the identity

$$\begin{aligned} 2{\left| E_h(X)\right| } = \sum _{v \in X} \deg _h(v; X)\quad . \end{aligned}$$

We will refer to this algorithm as EstDense. The pseudo-code for EstDense is given in Algorithm 7.

figure h

The algorithm yields to the following guarantee.

Proposition 6.2

Assume \(\epsilon> 0, C > 0\) and h. Define \(\gamma = \frac{1 - \epsilon }{1 + \epsilon }\). For any given k, let \(C_k\) be the (kh)-core. Define

$$\begin{aligned} \beta = \min _{k} \frac{{\left| C_k\right| }}{{\left| C_{k\gamma }\right| }} \end{aligned}$$

to be the smallest size ratio between \(C_k\) and \(C_{k\gamma }\).

Let X be the h-densest subgraph.

Let \(c'\) be an \(\epsilon \)-approximative core map and let \(v_1, \ldots , v_n\) be the corresponding vertex order. Let \(Y = \textsc {EstDense} (G, v_1, \ldots , v_n, \epsilon , C)\) Then,

$$\begin{aligned} 2{dns}{(}{Y}) \ge \gamma \beta \left( \sqrt{2 {dns}{(}{X}) + 1/4 } - 1/2\right) \end{aligned}$$

with probability \(1 - n^2\exp \left( -C\right) \).

To prove the result, we need the following lemma.

Lemma 6.2

For any given k, define \(C'_k = \left\{ v \mid c'(v) \ge k\right\} \). Then

$$\begin{aligned} {dns}{(}{C_{k(1 - \epsilon )}'}) \ge \beta {dns}{(}{C_k})\quad . \end{aligned}$$

Proof

Write \(F = C_{k(1 - \epsilon )}'\). Let \(v \in C_k\). Then \(c'(v) \ge (1 - \epsilon ) c(v) \ge (1 - \epsilon )k\) and so \(v \in F\). Thus, \(C_k \subseteq F\). Conversely, let \(v \in F\). Then \((1 + \epsilon )c(v) \ge c'(v) \ge k(1 - \epsilon )\) and so \(v \in C_{\gamma k}\). Thus, \(F \subseteq C_{\gamma k}\). The definition of \(\beta \) now implies

$$\begin{aligned} {dns}{(}{F}) = \frac{{\left| E_h(F)\right| }}{{\left| F\right| }} \ge \beta \frac{{\left| E_h(F)\right| }}{{\left| C_k\right| }} \ge \beta \frac{{\left| E_h(C_k)\right| }}{{\left| C_k\right| }} \end{aligned}$$

proving the claim. \(\square \)

Proof of Proposition 6.2

Let c be the core map produced by ExactCore. For any given k, define \(C'_k = \left\{ v \mid c'(v) \ge k\right\} \).

Let \(u \in X\) be the first vertex deleted by ExactCore. Let \(b = \deg _h(u; X)\) be its h-degree. Write \(X' = X \setminus \left\{ u\right\} \). Since X is optimal,

$$\begin{aligned} \frac{{\left| E_h(X)\right| }}{{\left| X\right| }} \ge \frac{{\left| E_h(X')\right| }}{{\left| X'\right| }}\quad . \end{aligned}$$

Deleting u from X will delete b node pairs from \(E_h(X)\) containing u. In addition, every node in the h-neighborhood of u may be disconnected from each other, potentially reducing the node pairs by \({b \atopwithdelims ()2}\). In summary,

$$\begin{aligned} {\left| E_h(X)\right| } - {\left| E_h(X')\right| } \le b + {b \atopwithdelims ()2} = {b + 1 \atopwithdelims ()2}\quad . \end{aligned}$$

Combining the two inequalities leads to

$$\begin{aligned} {b + 1 \atopwithdelims ()2} \ge {\left| E_h(X)\right| } - \frac{{\left| E_h(X)\right| }({\left| X\right| } - 1)}{{\left| X\right| }} = \frac{{\left| E_h(X)\right| }}{{\left| X\right| }} = {dns}{(}{X})\quad . \end{aligned}$$

Solving for b results in

$$\begin{aligned} b \ge \sqrt{2 {dns}{(}{X}) + 1/4 } - 1/2\quad . \end{aligned}$$
(6)

Let Z be the nodes right before u is deleted by ExactCore. Note that \(c(u) \ge \deg _h(u; Z) \ge \deg _h(u; X) = b\).

Let \(C_k\) be the smallest core containing u, that is, \(c(u) = k\). By definition, \(\deg _h(v; C_k) \ge k \ge b\), for all \(v \in C_k\).

Let \(F = C'_{k(1 - \epsilon )}\). Lemma 6.2 now states that

$$\begin{aligned} 2{dns}{(}{F}) \ge 2 \beta {dns}{(}{C_k}) = \beta \frac{1}{{\left| C_k\right| }} \sum _{v \in C_k} \deg _h(v; C_k) \ge \beta k \ge \beta b\quad . \end{aligned}$$
(7)

Let \(d'(Z)\) be the estimated density for a subgraph Z.

Proposition 4.2 and the union bound state that

$$\begin{aligned} {dns}{(}{Y}) \ge \frac{1}{1 + \epsilon } d'(Y) \ge \frac{1}{1 + \epsilon } d'(F) \ge \gamma {dns}{(}{F}) \end{aligned}$$
(8)

with probability \(1 - n^2e^{-C}\). Equations 68 prove the inequality in the claim. \(\square \)

EstDense is essentially Core so we can apply Proposition 5.3.

Corollary 6.1

Assume real values \(\epsilon > 0\), \(\delta > 0\), a graph G with n nodes and m edges. Let \(C = \log (n^2 / \delta )\). Then EstDense runs in

$$\begin{aligned} {{\mathcal {O}}}{\left( \frac{hm\log n / \delta }{\epsilon ^{2}}\log \frac{n\epsilon ^{2}}{\log n / \delta }\right) } \end{aligned}$$

time and Proposition 6.2 holds with \(1 - \delta \) probability.

Finally, let us describe a potentially faster variant of the algorithm that we will use in our experiments. The above proof will work even if replace \(C_k\) with the most inner (exact) core. Since \(F = C'_{k(1 - \epsilon )}\) we can prune all the vertices for which \(c'(v) < k(1 - \epsilon )\). The problem is that we do not know k but we can lower bound it with \(k \ge k'/(1 + \epsilon )\), where \(k' = \max _v c'(v)\). In summary, before running Estimate we remove all the vertices for which \(c'(v) < \gamma k'\).

7 Related work

The notion of distance-generalized core decomposition was proposed by Bonchi et al. [6]. The authors provide several heuristics to significantly speed up the baseline algorithm (a variant of an algorithm proposed by Batagelj and Zaveršnik [3]). Despite being significantly faster than the baseline approach, these heuristics still have the computational complexity in \({{\mathcal {O}}}{\left( nn'(n' + m')\right) }\), where \(n'\) and \(m'\) are the numbers of nodes and edges in the largest h-neighborhood. For dense graphs and large values of h, the sizes \(n'\) and \(m'\) can be close n and m, leading to the computational time of \({{\mathcal {O}}}{\left( n^2m\right) }\). We will use these heuristics as baselines in Sect. 8.

All these algorithms, as well as ours, rely on the same idea of iteratively deleting the vertex with the smallest \(\deg _h(v)\) and updating these counters upon the deletion. The difference is that the previous works maintain these counters exactly—and use some heuristics to avoid updating unnecessary nodes—whereas we approximate \(\deg _h(v)\) by sampling, thus reducing the computational time at the cost of accuracy.

A popular variant of decomposition is a k-truss, where each edge is required to be in at least k triangles [9, 17, 28,29,30]. Sarıyüce and Pinar [22], Sarıyüce et al. [21] proposed (rs) nucleus decomposition, an extension of k-cores where the notion nodes and edges are replaced with r-cliques and s-cliques, respectively. Sarıyüce and Pinar [21] point out that there are several variants of k-trusses, depending on the connectivity requirements: Huang et al. [17] require the trusses to be triangle-connected, Cohen [9] require them to be connected, and Zhang and Parthasarathy [29] allow the trusses to be disconnected.

A k-core is the largest subgraph whose smallest degree is at least k. A similar concept is the densest subgraph, a subgraph whose average degree is the largest [15]. Such graphs are convenient variants for discovering dense communities as they can be discovered in polynomial time [15], as opposed to, e.g., cliques that are inapproximable [31].

Interestingly, the same peeling algorithm that is used for core decomposition can be used to 2-approximate the densest subgraph [8]. Tatti [25] proposed a variant of core decomposition so that the densest subgraph is equal to the inner core. This composition is solvable in polynomial time and can be approximated using the same-peeling strategy.

A distance-generalized clique is known as h-club, which is a subgraph where every node is reachable by an h-path from every node [20]. Here the path must be inside the subgraph. Since cliques are 1-clubs, discovering maximum h-clubs is immediately an inapproximable problem. Bonchi et al. [6] argued that (kh) decomposition can be used to aid discovering maximum h-clubs.

Using sampling for parallelizing (normal), core computation was proposed by Esfandiari et al. [10]. Here, the authors sparsify the graph multiple times by sampling edges. The sampling probability depends on the core numbers: larger core numbers allow for more aggressive sparsification. The authors then use Chernoff bounds to prove the approximation guarantees. The authors were able to sample edges since the degree in the sparsified graph is an estimate of the degree in the original graph (multiplied by the sampling probability). This does not hold for (kh) core decomposition because a node \(w \in N(v; h)\) can reach v with several paths.

Approximating h-neighborhoods can be seen as an instance of a cardinality estimation problem. A classic approach for solving such problems is HyperLogLog [11]. Adopting HyperLogLog or an alternative approach, such as [18], is a promising direction for a future work, potentially speeding up the algorithm further. The challenge here is to maintain the estimates as the nodes are removed by Core.

Table 1 Sizes and computational times for the benchmark datasets

8 Experimental evaluation

Our two main goals in experimental evaluation are to study the accuracy and the computational time of Core.

8.1 Datasets and setup

We used 8 publicly available benchmark datasets. CaAstro and CaHep are collaboration networks between researchers.Footnote 1RoadPa and RoadTX are road networks in Pennsylvania and Texas.\(^{1}\) Amazon contains product pairs that are often co-purchased in a popular online retailer.\(^{1}\) Youtube contains user-to-user links in a popular video streaming service.Footnote 2Hyves and Douban contain friendship links in a Dutch and Chinese social networks, respectively.Footnote 3 The sizes of the graphs are given in Table 1.

We implemented Core in C++Footnote 4 and conducted the experiments using a single core (2.4GHz). For Core, we used 8GB RAM and for EstDense we used 50GB RAM. In all experiments, we set \(\delta = 0.05\).

8.2 Accuracy

In our first experiment, we compared the accuracy of our estimate \(c'(v)\) against the correct core numbers c(v). As a measure, we used the maximum relative error

$$\begin{aligned} \max _{v \in V} \frac{{\left| c'(v) - c(v)\right| }}{c(v)}\quad . \end{aligned}$$

Note that Proposition 4.3 states that the error should be less than \(\epsilon \) with high probability.

The error as a function of \(\epsilon \) for CaHep and CaAstro datasets is shown in Fig. 1 for \(h = 3, 4\). We see from the results that the error tends to increase as a function of \(\epsilon \). As \(\epsilon \) decreases, the internal value M increases, reaching the point where the maximum core number is smaller than M. For such values, Proposition 4.3 guarantees that Core produces correct results. We see, for example, that this value is reached with \(\epsilon = 0.20\) for CaHep, and \(\epsilon = 0.15\) for CaAstro when \(h = 3\), and \(\epsilon = 0.35\) for Amazon when \(h = 4\).

Fig. 1
figure 1

Relative error and computational time as a function of \(\epsilon \) for CaHep, CaAstro, and Amazon datasets and \(h = 3\) (top row) and \(h = 4\) (bottom row)

8.3 Computational time

Our next experiment is to study the computational time as a function of \(\epsilon \); the results are shown in Fig. 1. From the results, we see that generally computational time increases as \(\epsilon \) decreases. The computational time flattens when we reach the point when \(c(v) \le M\) for every M. In such case, the lists B[vi] match exactly to the neighborhoods N(vi) and do not change if M is increased further. Consequently, decreasing \(\epsilon \) further will not change the running time. Interestingly, the running time increases slightly for Amazon and \(h = 4\) as \(\epsilon \) increases. This is most likely due to the increased number of Compute calls for smaller values of M.

Next, we compare the computational time of our method against the baselines lb and lub proposed by Bonchi et al. [6]. As our hardware setup is similar, we used the running times for the baselines reported by Bonchi et al. [6]. Here, we fixed \(\epsilon = 0.5\). The results are shown in Table 1.

We see from the results that for \(h = 2\) the results are dominated by lb. This is due to the fact that most, if not all, nodes will have \(c(v) \le M\). In such case, Core does not use any sampling and does not provide any speed-up. This is especially the case for the road networks, where the core number stays low even for larger values of h. On the other hand, Core outperforms the baselines in cases where c(v) is large, whether due to a larger h or due to denser networks. As an extreme example, lub required over 13 hours with 52 CPU cores to compute core for Hyves, while Core provided an estimate in about 12 minutes using only 1 CPU core.

Interestingly enough, Core solves CaAstro faster when \(h = 4\) than when \(h = 3\). This is due to the fact that we stop when the current core value plus one is equal to the number of remaining nodes.

Fig. 2
figure 2

Computational time as a function of number of edges applied to synthetic data

To further demonstrate the effect of the network size on the computation time, we generate a series of synthetic datasets. Each dataset is stochastic blockmodel with 10 blocks of equal size, \(C_1, \ldots , C_{10}\). To add a hierarchical structure, we set the probability of an edge between nodes in \(C_i\) and \(C_j\) with \(i < j\) to be \(10^{-6} i^2\). We vary the number of nodes from \(10,\,000\) to \(100,\,000\). The computational times for our method, with \(h = 2, 3, 4\) and \(\epsilon = 0.5\), are shown in Fig. 2. As expected, the running times increase as the number of edges increase. Moreover, larger h require more processing time. We should stress that while Corollary 5.1 bounds the running time as quasi-linear, in practice the trend depends on the underlying model.

8.4 Dense subgraphs

Table 2 Densities and sizes of discovered dense subgraphs for the benchmark datasets

Finally, we used EstDense to estimate the densest subgraph for \(h = 2, 3, 4\). We set \(\epsilon = 0.5\) and \(\delta = 0.05\). The results, shown in Table 2, are as expected. Both the density and the size of the h-densest subgraphs increase as the function of h. The dense subgraphs are generally smaller and less dense for the sparse graphs, such as road networks.

In our experiments, the running times for EstDense were generally smaller but comparable to the running times of Core. The speed-up is largely due to the pruning of nodes with smaller core numbers. The exception was Youtube with \(h = 3\), where EstDense required over 23 minutes. The slowdown is due to Core using lazy initialization of B[vi], whereas EstDense needs B[vh] to be computed in order to obtain d[v]. This is also the reason why EstDense requires more memory in practice.

9 Concluding remarks

In this paper, we introduced a randomized algorithm for approximating distance-generalized core decomposition. The major advantage over the exact approximation is that the approximation can be done in \({{\mathcal {O}}}{\left( \epsilon ^{-2} hm (\log ^2 n - \log \delta )\right) }\) time, whereas the exact computation may require \({{\mathcal {O}}}{\left( n^2m\right) }\) time. We also studied distance-generalized dense subgraphs by proving that the problem is NP-hard and extended the guarantee results of [6] to approximate core decompositions.

The algorithm is based on sampling the h-neighborhoods of the nodes. We prove the approximation guarantee with Chernoff bounds. Maintaining the sampled h-neighborhood requires carefully designed bookkeeping in order to obtain the needed computational complexity. This is especially the case since the sampling probability may change as the graph gets smaller during the decomposition.

In practice, the sampling complements the exact algorithm. For the setups where the exact algorithm struggles, our algorithm outperforms the exact approached by a large margin. Such setups include well-connected networks and values h larger than 3.

An interesting direction for future work is to study whether the heuristics introduced by Bonchi et al. [6] can be incorporated with the sampling approach in order to obtain even faster decomposition method.