1 Introduction

Finding dense subgraphs in networks is a common tool for analyzing networks with potential applications in diverse domains, such as bioinformatics (Fratkin et al., 2006; Langston et al., 2005), finance (Du et al., 2009), social media (Angel et al., 2014), or web graph analysis (Fratkin et al., 2006).

While useful on their own, analyzing dense subgraphs without any additional explanation may be difficult for domain experts and consequently may limit its usability.

Fortunately, it is often the case that the network has additional information such as labels associated with nodes and/or edges. For example, in social networks, users may have tags describing themselves. In networks arising from communication, for example, by email or Twitter, the tags associated with the edge can be the tags associated or extracted with the message.

Using the available label information to provide explainable dense subgraphs may ease the burden of domain experts when, for example, studying social networks. In this paper, we consider finding dense subgraphs in networks with labeled edges. More formally, we are looking for a label set that induces a dense subgraph. As a measure of density, a subgraph (WF) with nodes W and edges F will use \({\left| F\right| } / {\left| W\right| }\), the ratio of edges over the nodes, a popular choice for measuring the density of a subgraph.

We consider two cases: conjunctive-induced and disjunctive-induced dense subgraphs. In the former, the induced subgraph consists of all the edges that have the given label set. In the latter, the induced subgraph consists of all the edges that have at least one label common with the label set. We give an example of both cases in Fig. 1.

Fig. 1
figure 1

Example graphs with labels on the edges. Edge labels are indicated by colors; dashed edges indicate edges with 2 labels. Left figure: Label \(\ell _1\) induces a subgraph with 6 nodes and 9 edges, and \(\ell _2\) induces a subgraph with 7 nodes and 11 edges, while the conjunction of \(\ell _1\) and \(\ell _2\) induces a subgraph with 5 nodes and 8 edges resulting in the highest density of \(8/5=1.6\). Right figure: Labels \(\ell _1\), \(\ell _2\), and \(\ell _3\) each induce subgraphs with 5 nodes and 4 edges, while the disjunction of \(\ell _1\) and \(\ell _2\) induces a subgraph with 5 nodes and 7 edges, resulting in a density of \(7/5=1.4\). The colours are used to indicate edge labels. For example, orange edges have one label while purple edges have another label, and dashed edges with both orange and purple colours have both labels

Finding the densest subgraph—with no label constraints—can be done in polynomial time (Goldberg et al., 1984) and can be 2-approximated in linear time (Charikar, 2000). Unfortunately, additional requirements on the labels will make solving the optimization problem exactly computationally intractable: we show that both problems are NP-hard, which forces us to resort to heuristics. We propose a greedy algorithm for both problems: we start with an empty label set and keep adding the best possible label until no additions are possible. We then return the best observed induced subgraph.

The computational bottleneck of the greedy method is selecting a new label. If done naively, evaluating a single label candidate requires enumerating over all the edges. Since this needs to be done for every candidate during every addition, the running time is \(\mathcal {O} \mathopen {}\left( p {\left| L\right| }\right)\), where \({\left| L\right| }\) is the number of labels and p is the number of edge-label pairs in the network. By keeping certain counters we can speed up the running time. We show that conjunctive-induced graphs can be discovered in \(\mathcal {O} \mathopen {}\left( p \log {\left| L\right| }\right)\) time using a balanced search tree, and that disjunctive-induced graphs can be discovered in \(\mathcal {O} \mathopen {}\left( p \log ^2 {\left| L\right| }\right)\) time with the aid of an algorithm originally used to maintain convex hulls.

This is an extended version of our previously published conference paper (Kumpulainen & Tatti, 2022). We extend our earlier work by considering an alternative definition of density: namely, we search for label-induced subgraphs (WF) with high \(\alpha\)-density \({\left| F\right| } - \alpha {\left| W\right| }\). This density is closely related to the problem of finding a subgraph with maximum density (Goldberg et al., 1984) but also has been used to decompose graphs (Tatti, 2019; Danisch et al., 2017). We show that there are \(\alpha\) such that the \(\alpha\)-densest label-induced graph also has the highest density. We then modify the greedy algorithms to find subgraphs with high \(\alpha\)-density in \(\mathcal {O} \mathopen {}\left( p \log {\left| L\right| }\right)\) for both the conjunctive and disjunctive cases.

The remainder of the paper is organized as follows. In Sect. 2 we introduce the notation and formalize the optimization problem. In Sects. 34 we present our algorithms. In Sect. 5, we analyze the case of using an alternative density metric and adapt the previous algorithms to this problem. Section 6 is devoted to the related work. Finally, we present the experimental evaluation in Sect. 7 and conclude with a discussion in Sect. 8. The computational complexity proofs are given in Appendix 1.

2 Preliminary notation and problem definition

In this section, we first describe the common notation and then introduce the formal definition of our problem.

Assume that we are given an edge-labeled graph, that is, a tuple \(G=(V,E, lab )\), where V is the set of vertices, \(E \subseteq \{(x,y) \mid (x, y) \in V^2, x \ne y\}\) is the set of undirected edges, and \(lab : E \rightarrow 2^L\) is a function that maps each edge \(e \in E\) to the set of labels \(lab (e)\). Here L is the set of all possible labels.

Given a label \(\ell \in L\), let us write \(E(\ell )\) to be the edges having the label \(\ell\). In addition, let us write \(V(\ell )\) to be the nodes adjacent to \(E(\ell )\).

Our goal is to search for dense regions of graphs that can be explained using the labels. In other words, we are looking for a set of labels that induce a dense graph. More formally, we define an inducing function to be a function f that maps two sets of labels to a binary number. An example of such a function could be \(f(A; B) = [B \subseteq A]\) which returns 1 if and only if B is a subset of A.

Given a set of labels \(B \subseteq L\), an inducing function f, and a graph G, we define the label-induced subgraph \(H=G(f, B)\) as \((V(B), E(B), lab )\), where

$$\begin{aligned} E(B) = \{e \in E \mid f( lab (e); B) = 1\} \end{aligned}$$

is the subset of edges that satisfy f, and V(B) is the set of vertices that are adjacent to E(B).

Given a graph G with vertices V and edges E, we measure the density of the graph d(G) as the number of edges divided by the number of vertices: \(d(G) = \frac{|E|}{|V|}\).

We should point out that there are alternative choices for a notion of density. For example, one option is to consider a fraction of edges \(|E|/{|V| \atopwithdelims ()2}\). However, this measure is however problematic since a single edge will yield a maximum value. Consequently, either a size needs to be incorporated into the objective, which leads to discovering maximum cliques—an NP-hard problem with bad approximation guarantees (Håstad, 1996), or enumerating all pseudo-cliques with an exponential-time algorithm (Uno, 2010; Abello et al., 2002). On the other hand, finding graph with maximum d(G) can be done in polynomial time (Goldberg et al., 1984), and 2-approximated in linear time (Charikar, 2000). See related work for additional discussion.

We are now ready to state our generic problem.

Problem 1

(LD) Let \(G=(V,E, lab )\) be an edge-labeled graph over a set of labels L with multiple labels being possible for each edge. Assume an inducing function f. Find a set of labels \(L^*\) such that the density d(H) of the label-induced subgraph \(H=G(f, L^*)\) is maximized.

We consider two special cases of LD. Firstly, let us define \(f_{ AND } \mathopen {}\left( A; B\right) = [B \subseteq A]\), that is, the induced edges need to contain every label in B. We will denote the problem LD paired with \(f_{ AND }\) as LDand. Secondly, we define \(f_{ OR } \mathopen {}\left( A; B\right) = [B \cap A \ne \emptyset ]\), that is, the induced edges need to have one common label with B. Then, we denote the corresponding problem as LDor. In other words, LDand is the problem of finding dense conjunctive-induced subgraphs, and LDor is the problem of finding disjunctive-induced subgraphs.

In addition, we consider an alternative measure to the density d(G) of a graph by instead measuring the \(\alpha\)-density of the graph \(g(G; \alpha )\) as the number of edges minus \(\alpha\) times the number of vertices: \(g(G; \alpha ) = |E| - \alpha |V|\). This measure is closely related to finding the densest subgraph: Goldberg et al. (1984) finds a series of \(\alpha\)-densest subgraphs when searching for the densest subgraph. However, this measure has been also studied on its own as it can be used to decompose graphs (Tatti, 2019; Danisch et al., 2017). Our optimization problem is as follows.

Problem 2

(LD-\(\alpha\)) Let \(G=(V,E, lab )\) be an edge-labeled graph over a set of labels L with multiple labels being possible for each edge. Assume an inducing function f and a constant \(\alpha \in \mathbb {R}\). Find a set of labels \(L^*\) such that the \(\alpha\)-density \(g(H; \alpha )\) of the label-induced subgraph \(H=G(f, L^*)\) is maximized.

3 Finding dense conjunctive-induced graphs

In this section, we focus on LDand, that is, finding conjunctive-induced graphs that are dense. We will first prove that LDand is NP-hard.

Theorem 1

LDand is NP-hard.

Proof

The proof is given in Appendix 1. \(\square\)

The NP-hardness forces us to resort to heuristics. Here, we use the algorithm for 2-approximating dense subgraphs (Charikar, 2000) as a starting point. The algorithm iteratively removes a node with the smallest degree, and returns the best solution among the observed subgraphs. We propose a similar greedy algorithm, where we greedily add the best possible label, and repeat until the induced subgraph is empty. We then select the best observed labels as the output.

To avoid enumerating over the edges every time we look for a new label, we maintain several counters. Let A be the current set of labels. For each label, we maintain the number of nodes \(n_k\) and edges \(m_k\) of the candidate graph, that is, \(n_k = {\left| V(A \cup \left\{ k\right\} )\right| }\) and \(m_k = {\left| E(A \cup \left\{ k\right\} )\right| }\). We store the densities \(m_k / n_k\) in a balanced search tree (for example, a red-black tree), which allows us to obtain the largest element quickly. Once we update set A, we also update the counters and update the search tree. Maintaining the node counts \(n_k\) requires us to maintain the counters \(r_{v, k}\), number of edges labeled as k adjacent to v: once the counter reduces to 0, we reduce \(n_k\) by 1. The pseudo-code of the algorithm is given in Algorithm 1.

figure a

We conclude with an analysis of the computational complexity of GreedyAnd.

Theorem 2

GreedyAnd runs in \(\mathcal {O} \mathopen {}\left( p\log {{\left| L\right| }}+{\left| V\right| }+{\left| E\right| }\right)\) time, where p is the number of edge-label pairs \(p={\left| \{(e,k)\mid e\in E, k\in lab (e)\}\right| }\).

Proof

Initializing counters in GreedyAnd can be done in \(\mathcal {O} \mathopen {}\left( {\left| V\right| } + {\left| E\right| } + {\left| L\right| }\right)\) time while initializing the tree can be done in \(\mathcal {O} \mathopen {}\left( {\left| L\right| } \log {\left| L\right| }\right)\) time.

Let us consider the inner for-loop. Since an edge is deleted once it is processed, the inner for-loop is executed at most p times during the search. Since this is the only way the counters get updated, the tree T is updated p times, each update requiring \(\mathcal {O} \mathopen {}\left( \log {\left| L\right| }\right)\) time.

The outer loop is executed at most \({\left| L\right| }\) times. During each round, selecting and removing the label requires \(\mathcal {O} \mathopen {}\left( \log {\left| L\right| }\right)\) time.

In summary, the algorithm requires

$$\begin{aligned} \mathcal {O} \mathopen {}\left( {\left| V\right| } + {\left| E\right| } + {\left| L\right| } + {\left| L\right| } \log {\left| L\right| } + p \log {\left| L\right| }\right) \subseteq \mathcal {O} \mathopen {}\left( {\left| V\right| } + {\left| E\right| } + p \log {\left| L\right| }\right) \end{aligned}$$

time, completing the proof. \(\square\)

4 Finding dense disjunctive-induced graphs

In this section, we focus on LDor, that is, finding disjunctive-induced graphs that are dense. We will first prove that LDor is NP-hard.

Theorem 3

LDor is NP-hard.

Proof

The proof is given in Appendix 1. \(\square\)

Similar to LDand, we resort to a greedy search to find good subgraphs: We start with an empty label set, and iteratively add the best possible label. Once done, we return the best observed label set.

However, we maintain a different set of counters as compared to GreedyAnd. The reason for having different counters is to avoid a significantly higher number of updates: the inner loop would need to go over the edge-label pairs that are not present in the graph. More formally, we maintain values n and m representing the number of nodes and edges in the subgraph induced by the current set of labels, say A. We also maintain \(n_k\) and \(m_k\), the number of additional nodes and edges if k is added to A. At each iteration, we select the label optimizing \(\frac{m + m_k}{n + n_k}\). We will discuss the selection process later. Once the label is selected, we update the counters \(m_k\) and \(n_k\). To maintain \(n_k\) properly, we keep track of what nodes are already in V(A), using an indicator \(r_v\) with \(r_v = 1\) if \(v \in V(A)\). The pseudo-code for the algorithm is given in Algorithm 2.

figure b

During each iteration, we need to select the label maximizing \(\frac{m+m_k}{n+n_k}\). We cannot use priority queues any longer since n and m change every iteration. However, we can speed up the selection using a convex hull, a classic concept from computational geometry, see for example, (Li & Klette, 2011). First, let us formally define a lower-right convex hull.

Definition 1

Given a set of points \(X = \left\{ (x_i, y_i)\right\}\) in a plane, we define a lower-right convex hull \(H = hull \mathopen {}\left( H\right)\) to be a subset of X such that \(q = (x_q, y_q) \in X\) is not in X if and only if there is a point \(r = (x_r, y_r) \in H\) such that \(x_q \le x_r\) and \(y_q \ge y_r\), or if there are two points \(p, r \in H\) such that q is above or at the segment joining q and r.

If we were to plot X on a plane, then \(hull \mathopen {}\left( X\right)\) is the lower-right portion of the complete convex hull, that is, a set of points in X that form a convex polygon containing X. For notational simplicity, we will refer to \(hull \mathopen {}\left( X\right)\) as the convex hull. Note that if we order the points in \(hull \mathopen {}\left( X\right)\) by their x-coordinates, then the y-coordinates and the slopes of the intermediate segments are also increasing.

We will first argue that we only need to search the convex hull when looking for the optimal label.

Theorem 4

Let X be a set of positive points \({(m_i, n_i)}\), and let \(H = hull \mathopen {}\left( X\right)\) be the convex hull. Select \(m, n \ge 0\). Then \(\max _{p \in X} \frac{m+m_i}{n+n_i} = \max _{p \in H} \frac{m+m_i}{n+n_i}\).

Proof

Let \(k = (m_k, n_k)\) be the optimal point in X. Assume that \(k \notin H\). Assume that there is a point \(q = (m_q, n_q)\) in H such that \(m_q \ge m_k\) and \(n_q \le n_k\). Then \(\frac{m+m_k}{n+n_k} \le \frac{m+m_q}{n+n_q}\), so the point q is also optimal.

Assume there is no such point q. Then, the x-coordinate of point k falls between two consecutive points p and q in H, that is, \(m_p< m_k < m_q\). Then k must be above the segment between p and q as otherwise, k would also be a part H. Therefore, the slope for the segment between p and k must be greater than the slope of the segment between p and q, and the slope for the segment between k and q must be smaller,

$$\begin{aligned} \frac{n_q-n_k}{m_q-m_k} \le \frac{n_q-n_p}{m_q-m_p} \le \frac{n_k-n_p}{m_k-m_p}. \end{aligned}$$
(1)

Furthermore, since \(k \notin H\), we must have \(n_k > n_p\). By assumption, we also have \(n_k < n_q\). In summary, we have \(n_p< n_k < n_q\) and \(m_p< m_k < m_q\), which means that the slopes in Eq. 1 are all positive. By taking the reciprocals this then gives,

$$\begin{aligned} \frac{m_q-m_k}{n_q-n_k} \ge \frac{m_q-m_p}{n_q-n_p} \ge \frac{m_k-m_p}{n_k-n_p}. \end{aligned}$$
(2)

Denote then the objective value at point k by \(c = \frac{m+m_k}{n+n_k}\). Let \(x_1 = c(n+n_p) - m\). Then, the optimality of k implies \(\frac{m+x_1}{n+n_p} = c \ge \frac{m+m_p}{n+n_p}\), which means \(x_1 \ge m_p\). The definition of c leads to \(m = c(n+n_k) - m_k\), which in turns leads to \(x_1 = c(n_p-n_k) + m_k\). Solving for c we get \(c = \frac{m_k-x_1}{n_k-n_p}\). Substituting \(x_1 \ge m_p\) yields \(c \le \frac{m_k-m_p}{n_k-n_p}\), using Eq. 2 then yields \(c \le \frac{m_q-m_k}{n_q-n_k}\).

Next, let \(x_2 = c(n_q-n_k) + m_k\) which means that \(c = \frac{x_2-m_k}{n_q-n_k}\). Now since \(c \le \frac{m_q-m_k}{n_q-n_k}\) we must have \(x_2 \le m_q\). Since \(m_k = c(n+n_k) - m\), we also have \(x_2 = c(n_q+n) - m\), yielding \(c = \frac{m+x_2}{n+n_q} \le \frac{m+m_q}{n+n_q}\), thus q is also optimal. \(\square\)

Theorem 4 states that we need to only consider the convex hull H of the set \(\left\{ (m_i, n_i)\right\}\) when searching for the optimal new label. Note that H does not depend on n or m. Moreover, we can use the algorithm by Overmars and Van Leeuwen (1981) to maintain H as \(n_k\) and \(m_k\) are updated in \(\mathcal {O} \mathopen {}\left( \log ^2 {\left| L\right| }\right)\) time per update. We will see that the number of needed updates is bounded by the number of edge-label pairs.

However, the convex hull can be as large as the original set, so our goal is to avoid enumerating over the whole set. To this end, we design a binary search strategy over the hull. We will first introduce two quantities used in our search.

Definition 2

Given two points \(p, q \in hull \mathopen {}\left( X\right)\), we define the inverse slope as \(s(p,q) = \frac{m_q-m_p}{n_q-n_p}\) and the bias term as \(b(p,q) = \frac{m_q n_p - m_p n_q}{n_q-n_p}\).

First, let us prove that both s and b are monotonically decreasing.

Lemma 1

Let p, q, and r be three consecutive points in \(hull \mathopen {}\left( X\right)\). Then we have \(n \times s(q,r) + b(q,r) \le n \times s(p,q) + b(p,q)\), for any \(n \ge 0\).

Proof

The slope for the segment between p and q is less than or equal to the slope for the segment between q and r. Inversing the slopes leads to

$$\begin{aligned} s(q, r) = \frac{m_r-m_q}{n_r-n_q} \le \frac{m_q-m_p}{n_q-n_p} = s(p, q). \end{aligned}$$

By cross-multiplying, adding \(m_q n_q - m_q n_p - m_q n_r + \frac{m_q n_p n_r}{n_q}\) to both sides, multiplying by \(\frac{n_q}{(n_r-n_q)(n_q-n_p)}\), and simplifying, we get

$$\begin{aligned} b(q,r) = \frac{m_r n_q - m_q n_r}{n_r-n_q} \le \frac{m_q n_p - m_p n_q}{n_q-n_p} = b(p,q). \end{aligned}$$

Combining the two equations proves the claim. \(\square\)

Next, we show the key necessary condition for the optimal point.

Lemma 2

Let p, q, and r be 3 consecutive points in \(hull \mathopen {}\left( X\right)\). Select \(n, m \ge 0\). If q optimizes \(\frac{m_q + m}{n_q + n}\), then \(n \times s(q,r) + b(q,r) \le m \le n \times s(p,q) + b(p,q)\).

Proof

Since q is optimal, we have \(\frac{m+m_p}{n+n_p} \le \frac{m+m_q}{n+n_q}\). Solving for m gives us \(m \le n\frac{m_q-m_p}{n_q-n_p} + \frac{m_q n_p - m_p n_q}{n_q-n_p} = n \times s(p,q) + b(p,q)\). Similarly, due to optimality, \(\frac{m+m_r}{n+n_r} \le \frac{m+m_q}{n+n_q}\), and solving for m leads to \(m \ge n \times s(p,q) + b(p,q)\), proving the claim. \(\square\)

The two lemmas allow us to use binary search as follows. Given two consecutive points p and q we test whether \(m \le n \times s(p,q) + b(p,q)\). If true, then the optimal label is q or to the right of q, if false, the optimal point is to the left of q. To perform the binary search, we can use directly the structure maintained by the algorithm by Overmars and Van Leeuwen (1981) since it stores the current convex hull in a balanced search tree. Moreover, the algorithm allows evaluating any function based on the neighboring points. Specifically, we can maintain s and b. In summary, we can find the optimal label in \(\mathcal {O} \mathopen {}\left( \log {\left| L\right| }\right)\) time.

Our next result formalizes the above discussion.

Theorem 5

GreedyOr runs in \(\mathcal {O} \mathopen {}\left( p\log ^2{{\left| L\right| }}+{\left| V\right| }+{\left| E\right| }\right)\) time, where p is the number of edge-label pairs \(p={\left| \{(e,k)\mid e\in E, k\in lab (e)\}\right| }\).

Proof

The proof is similar to the proof of Theorem 2, except we have replaced a search tree with the convex hull structure by Overmars and Van Leeuwen (1981). The inner for-loops are evaluated at most \(\mathcal {O} \mathopen {}\left( p\right)\) times since an edge or a node is visited only once, and \(\sum _v {\left| S_v\right| } \in \mathcal {O} \mathopen {}\left( p\right)\). Maintaining the hull requires \(\mathcal {O} \mathopen {}\left( \log ^2 {\left| L\right| }\right)\) time, and there are at most \(\mathcal {O} \mathopen {}\left( p\right)\) such updates. Searching for an optimal label requires \(\mathcal {O} \mathopen {}\left( \log {\left| L\right| }\right)\) time, and there are at most \({\left| L\right| }\) such searches. \(\square\)

We should point out that a faster algorithm by Brodal and Jacob (2002) maintains the convex hull in \(\mathcal {O} \mathopen {}\left( \log {\left| L\right| }\right)\) time. However, this algorithm does not provide a search tree structure that we can use to search for the optimal addition.

5 Finding subgraphs with high \(\alpha\)-density

In this section, we focus on the problem LD-\(\alpha\) of finding subgraphs with high \(\alpha\)-density.

The following classic result in fractional programming (Dinkelbach, 1967) shows how the problem of finding the maximum density subgraph reduces to maximizing the \(\alpha\)-density of a subgraph for a large enough value of \(\alpha\). An immediate consequence of this result is that solving LD-\(\alpha\) is NP-hard.

Theorem 6

write \(H_\alpha\) to be the solution to LD-\(\alpha\). There is \(\tau\) such that \(H_\tau\) also solves LD. Moreover, for any \(\alpha > \tau\), the graph \(H_\alpha\) either solves LD or is empty.

Proof

Let \(H^*\) be a solution to LD with \(\sigma = d(H^*)\). Since there are a finite number of subgraphs, there is \(\tau < \sigma\) such that any graph H with \(d(H) \ge \tau\) has \(d(H) = \sigma\).

Since \(g(H^*; \tau ) > 0\), we have \(g(H_\tau ; \tau ) > 0\), or \({\left| E(H_\tau )\right| } - \tau {\left| V(H_\tau )\right| } > 0\) which implies \(d(H_\tau ) > \tau\). By definition of \(\tau\), the subgraph \(H_\tau\) solves LD.

Similarly, for any \(\alpha > \tau\), we have \(g(H_\alpha ; \alpha ) \ge 0\). Consequently, either \(H_\alpha\) is empty or \(d(H_\alpha ) \ge \alpha > \tau\), that is, \(H_\alpha\) solves LD. \(\square\)

Corollary 1

LD-\(\alpha\) is NP-hard for both \(f_{ OR }\) and \(f_{ AND }\). Moreover, both problems are inapproximable unless \({\textbf {P}}={\textbf {NP}}\).

Proof

The proof is given in Appendix 1. \(\square\)

To find solutions to LD-\(\alpha\) in practice, we adapt the previous greedy algorithms to find subgraphs with high \(\alpha\)-density. In the conjunctive case, we get the GreedyAnd-\(\alpha\) algorithm by simply changing the density on line 4 of Algorithm 1 from \(\frac{m_\ell }{n_\ell }\) to \(m_\ell -\alpha n_\ell\). This leads to the same computational complexity as for GreedyAnd.

In the disjunctive case, we again keep track of the counters to find the number of additional nodes and edges when a label is added to the current set of labels. However, the \(\alpha\)-density to maximize now becomes \((m+m_k)-\alpha (n+n_k)\). As \(m-\alpha n\) does not depend on the label, we only need to find the label k that maximizes \(m_k-\alpha n_k\). We may thus use a balanced search tree as in the conjunctive case. The pseudo-code for this algorithm is given in Algorithm 3.

figure c

As GreedyOr-\(\alpha\) does not need to use a convex hull but uses a balanced search tree instead, the running time becomes the same as for the conjunctive case.

Theorem 7

GreedyAnd-\(\alpha\) and GreedyOr-\(\alpha\) run in \(\mathcal {O} \mathopen {}\left( p\log {{\left| L\right| }}+{\left| V\right| }+{\left| E\right| }\right)\) time, where p is the number of edge-label pairs \(p={\left| \{(e,k)\mid e\in E, k\in lab (e)\}\right| }\).

Proof

The proofs for both cases are virtually the same as the proof of Theorem 2. \(\square\)

We conclude this section by considering the (lack of the) hierarchy property of \(\alpha\)-density. Tatti (2019) showed that the subgraphs (without label constraints) optimizing \(g(\cdot , \alpha )\) form a nested structure, that is, if we write \(H_\alpha\) to be the optimal solution, then \(H_\beta \subseteq H_\alpha\) for any \(\beta > \alpha\). Such a decomposition may be useful as it partitions the nodes into increasingly dense regions. Unfortunately, this is not the case for us as shown in Fig. 2.

Fig. 2
figure 2

Subgraphs with optimal \(\alpha\)-density are not nested. Left figure: \(\ell _1\) is optimal for \(\alpha = 3/4\) and \(\ell _2\) is optimal for \(\alpha = 1/4\) when using \(f_{ AND }\). Right figure: \(\ell _1\) is optimal for \(\alpha = 2.25\) and \(\ell _2\) is optimal for \(\alpha = 1.75\) when using \(f_{ OR }\)

Interestingly enough, if we allow more flexible queries, we can show that we too obtain a nested structure. More formally, given a Boolean formula B we define G(B) to be the subgraph consisting of edges whose labels satisfy B, and the incident vertices. Then the optimization problem would be to find the Boolean formula B maximizing \(g(G(B); \alpha )\). We then have the following proposition.

Proposition 1

Let \(H_\alpha\) be a subgraph induced by a Boolean formula \(B_\alpha\) that optimizes \(g(\cdot ; \alpha )\). Then \(H_\alpha \subseteq H_\beta\) for any \(\alpha > \beta\).

Proof

Assume otherwise. Write \(X = H_\alpha \cup H_\beta\) and \(Y = H_\alpha \cap H_\beta\). Note that X is induced by \(B_\alpha \vee B_\beta\) and Y is induced by \(B_\alpha \wedge B_\beta\). Then

$$\begin{aligned} g(X; \beta ) - g(H_\beta ; \beta ) > g(X; \alpha ) - g(H_\beta ; \alpha ) = g(H_\alpha ; \alpha ) - g(Y; \alpha ) \ge 0, \end{aligned}$$

where the last inequality is due to the optimality of \(H_\alpha\). Thus, \(g(X; \beta ) > g(H_\beta ; \beta )\) violating the optimality of \(H_\beta\). \(\square\)

6 Related work

A closely related work to our method is an approach proposed by Galbrun et al. (2014). Here the authors search for multiple dense subgraphs that can be explained by conjunction on (or the majority of) the node labels. The authors propose a greedy algorithm for finding such subgraphs. Interestingly enough, the authors do not show that the underlying problem is NP-hard—although we conjecture that this is indeed the case—instead, they show that the subproblem arising from the greedy approach is an NP-hard problem.

Another closely related work is an approach proposed by Pool et al. (2014), where the authors search for dense subgraphs that can be explained by queries on the nodes. The quality of the subgraphs is a ratio S/C, where S measures the goodness of a subgraph using the edges within the subgraph as well as the cross-edges, and C measures the complexity of the query.

The major difference between our work and the aforementioned work is that our method uses labels on the edges. While conceptually a small difference, this distinction leads to different algorithms and different analyses of those algorithms. Moreover, we cannot apply directly the previously discussed methods to networks that only have labels on edges.

An appealing property of finding subgraphs that maximize \({\left| E(W)\right| }/{\left| W\right| }\), or equivalently an average degree, is that we can find the optimal solution in polynomial time (Goldberg et al., 1984). Furthermore, we can 2-approximate the graph with a simple linear algorithm (Charikar, 2000). The algorithm iteratively removes the node with the smallest degree and then selects the best available graph. This algorithm is essentially the same as the algorithm used to discover k-cores, subgraphs that have the minimum degree of at least k. The connection between the k-cores and dense subgraphs is further explored by Tatti (2019), where the dense subgraphs are extended to create an increasingly dense structure. A variant of a quality measure was proposed by Tsourakakis (2015), where the quality of the subgraph is the ratio of triangles over the vertices. In another variant by Bonchi et al. (2019), the edges were replaced with paths of at most length k. Finding such structures in labeled graphs poses an interesting line of future work.

While finding dense subgraphs is polynomial, finding cliques is an NP-hard problem with a very strong inapproximability bound (Håstad, 1996). Finding cliques may be impractical as they do not allow any absent edges. To relax the requirement, Abello et al. (2002) and Uno (2010) proposed searching for quasi-cliques, that is subgraphs with a high proportion of edges, \({\left| E(W)\right| } / \left( {\begin{array}{c}{\left| W\right| }\\ 2\end{array}}\right)\). Another relaxation of cliques is k-plex where k absent edges are allowed for a vertex (Seidman, 1983). Finding k-plexes remain an NP-hard problem (Balasundaram et al., 2011). Alternatively, we can relax the definition by considering n-cliques, where vertices must be connected with an n-path (Bron & Kerbosch, 1973), or n-clans where we also require that the diameter of the graph is n (Mokken, 1979). Since 1-clique (and 1-clan) is a clique, these problems remain computationally intractable.

7 Experimental evaluation

In this section, we describe our experimental evaluation of the GreedyAnd and GreedyOr algorithms. First, we observe how the algorithms behave on synthetic data with increasing randomness. Then we apply the algorithms to real-world datasets and analyze the results.

We implement our algorithms in Python and the source code is available online.Footnote 1 Since the number of labels in our experiments was not exceedingly large, we did not use the speed up using convex hulls when implementing disjunctive-induced graphs. Instead, we search for the optimal label from scratch leading to a running time of \(\mathcal {O} \mathopen {}\left( p{\left| L\right| }\right)\).

Experiments with synthetic data: We evaluate the greedy algorithms on synthetic graphs of 200 vertices and 50 labels. We select 5 of the labels as target labels and construct graphs for the conjunctive and disjunctive cases such that selecting the subgraph induced by these 5 labels gives the best density. We then add random noise to the network by introducing a noise parameter \(\epsilon\), which controls the probability of randomly adding and removing edges as well as adding new labels to the edges.

For the conjunctive case, we create five disjoint cliques of 10 vertices such that all edges on the kth clique have all except the kth of the target labels. Finally, we add one more 20 vertex clique that has all of the target labels. Since each of the smaller cliques is missing one of the target labels, selecting the conjunction of all of them yields the densest subgraph as the clique of 20 vertices.

Given the noise parameter \(\epsilon\), we then add noise by having each of the edges in the cliques removed with probability \(\epsilon\), as well as having any other edges added between any pair of vertices with probability \(\epsilon\). Finally, for each of the edges in the cliques, we add any of the other labels with probability \(\epsilon\) each, except for adding the remaining target labels to edges in the cliques.

For the disjunctive case, we have created one clique with 40 vertices. The edges in the clique are split into five sets, such that each set of edges gets one of the target labels. Now, selecting the disjunction of the five target labels induces the clique as the subgraph and results in the highest density.

We then add noise by removing edges from the clique and adding new edges between any other pair of vertices with probability \(\epsilon\). In addition, each edge gains any of the other labels also with probability \(\epsilon\).

Fig. 3
figure 3

Density of the subgraph induced by the target labels and the subgraph induced by the labels chosen by the greedy algorithms as a function of noise \(\epsilon\) in the network. The line shows the mean density of 10 runs for each \(\epsilon\), and the vertical error bars show their standard deviation. The results for GreedyAnd algorithm are on the left and for GreedyOr on the right

Fig. 4
figure 4

Running time of the GreedyAnd algorithm as a function of the number of vertices (left) and the number of labels (right) in our synthetic graphs

Fig. 5
figure 5

Running time of the GreedyOr algorithm as a function of the number of vertices (left) and the number of labels (right) in our synthetic graphs.

We repeat the experiments with increasing values of \(\epsilon\) and compare the density of the subgraph induced by the target labels to the density of the subgraph induced by the labels of the greedy algorithms. For each \(\epsilon\), we run the experiment 10 times and compute the mean and standard deviation of the runs. The results are shown in Fig. 3.

In both cases, the greedy algorithms correctly find the target labels for small values of \(\epsilon\). After \(\epsilon > 0.25\) for GreedyAnd and after \(\epsilon > 0.35\) for GreedyOr, the algorithms start to find other sets of labels, which yield higher densities than the target labels as many of the edges in the target clique have been removed and other edges have been added. However, at \(\epsilon = 0.30\), the GreedyOr returns a suboptimal solution that yields a slightly lower density than the target labels.

We confirm the theoretical running times of the algorithms by setting \(\epsilon =0.2\) and performing experiments with increasingly large graphs, where the number of total vertices goes from 10000 up to 100000 while other aspects of the experiments remain constant. Similarly, we test how the running times of the algorithms scale as the number of total labels in our synthetic graph increases from 1000 to 10000. The results for GreedyAnd are shown in Fig. 4 and results for GreedyOr in Fig. 5.

As expected, the running times of both algorithms scale linearly with the number of vertices in the graph. Furthermore, the running time of our naive implementation of GreedyOr appears to scale quadratically with the number of labels, while the scaling for GreedyAnd is close to linear. These results confirm our theoretical analysis and show that our algorithms can be applied to large graphs in practice.

Experiments with real-world datasets: We test the greedy algorithms by running experiments on four real-world datasets. The first dataset is the Enron Email Dataset,Footnote 2 which consists of publicly available emails from employees of a former company called Enron Corporation. We collect the emails in sent mail folders and construct a graph where new edges are added between the sender and the recipients of each email. Each edge has labels consisting of the stemmed words in the email’s title, with stop words and words including numbers removed.

The second dataset consists of high energy physics theory publications (HEP-TH) from the years 1992 to 2003. The data was originally released in KDD CupFootnote 3 but we use a preprocessed version of the data available in GitHubFootnote 4 We create the network by adding authors as vertices, and edges between any two authors are added if they share at least two publications. The edges between authors are then given labels which consist of the stemmed words in the titles of the shared articles between the two authors. We exclude stop words and words including numbers from the titles the same way as for the Enron dataset.

The third dataset consists of publications from the DBLPFootnote 5 dataset (Tang et al., 2008). From this dataset, we chose publications from ECMLPKDD, ICDM, KDD, NIPS, SDM, and WWW conferences. The network is constructed in the same way as for the HEP-TH data, with authors as vertices, two or more shared publications as edges, and stemmed and filtered words from the titles as labels.

The fourth and final dataset consists of the latest 10000 tweets collected from Twitter APIFootnote 6 with the hashtag #metoo by the 27th of May, 23:59 UTC. We create the network by having users as vertices with an edge between a pair of users if one of them has retweeted or responded to one of the other’s tweets. The labels on the edge are then any hashtags in the retweets or response tweets between the two users.

We construct the networks by filtering out labels that appear in less than 0.1% of the edges in the Enron and Twitter datasets, or labels that occur in less than 0.5% of the papers in the case of the HEP-TH and DBLP datasets. The sizes, label counts, and densities of the resulting graphs are shown in Table 1.

We run the greedy algorithms on each of these graphs, and compare the results against the densest subgraph ignoring the labels (Dense). We report the statistics for the label-induced subgraphs and the densest subgraphs in Table 2.

For each of the datasets, both algorithms find label-induced subgraphs with higher densities than in the original graphs. In most cases, the restriction of constructing label-induced subgraphs results in clearly lower densities compared to the densest label-ignorant subgraphs. Interestingly, for the DBLP dataset GreedyAnd finds a label-induced subgraph with a very high density that is close to the density of the densest subgraph ignoring the labels. The running times are practical: the algorithm processes networks with 100 000 edge-label pairs in seconds.

For Enron and HEP-TH datasets, the GreedyOr returns large sets of labels resulting in large subgraphs, whereas the GreedyAnd algorithm selects only a few labels with smaller induced subgraphs in each case. For the Twitter dataset, both greedy algorithms select only one label, which induces a small subgraph with a notably higher density than the original graph.

Table 1 Basic characteristics of the networks: number or vertices \({\left| V\right| }\), number or edges \({\left| E\right| }\), number of labels \({\left| L\right| }\), number of edge-label pairs p, and the density \(d(G) = {\left| E\right| } / {\left| V\right| }\)
Table 2 Statistics for the resulting subgraphs for the greedy algorithms and the label-ignorant densest subgraph algorithm. For the label-induced subgraphs, we have the number of vertices n, the number of edges m, the size of the best set of labels \({\left| A\right| }\), density d, and running time t in seconds. For the densest subgraph, we show the number of vertices n and density \(d = m/n\)

Experiments with \(\alpha\)-density: Next we consider finding \(\alpha\)-dense subgraphs by running the GreedyAnd-\(\alpha\) and GreedyOr-\(\alpha\) algorithms on the same datasets. The results are shown in Tables 3 and 4, respectively.

As pointed out by Theorem 6, the optimal \(\alpha\)-dense subgraph is also the densest for sufficiently large \(\alpha\). We use a binary search to find the maximum \(\alpha\) for which the greedy algorithm yields a non-empty graph. The values of \(\alpha\) in these tables are chosen by the binary search process while searching for the maximum. Additionally, we experiment with using a smaller \(\alpha\) value of 0.25 times the maximum. For clarity, we exclude duplicated results where different values of \(\alpha\) yield the same subgraph.

We see that the greedy algorithms for the two problems often find the same solution, as suggested by Theorem 6. However, this is not always the case due to the heuristic nature of these algorithms. Interestingly, with \(\alpha =2.5\) for the HEP-TH dataset, the GreedyAnd-\(\alpha\) finds a denser subgraph than the one found by GreedyAnd, while an additional manual experiment with \(\alpha =1.4\) results in the greedy algorithm suboptimally returning an empty graph. For the DBLP dataset using \(\alpha = 3.6\) leads to the same solution as GreedyAnd, but larger values of \(\alpha\) lead the greedy algorithm to choose a suboptimal first label resulting in less dense subgraphs. For Enron and HEP-TH datasets, GreedyOr-\(\alpha\) only finds subgraphs with a slightly lower density than the ones found by GreedyOr.

In general, we observe that using smaller values of \(\alpha\) results in subgraphs with more vertices and edges in both the disjunctive and conjunctive cases. Thus having \(\alpha\) as a parameter gives us more control over the size of the resulting subgraph and allows us to look for both smaller and larger groups of densely connected nodes.

Table 3 Results for running GreedyAnd-\(\alpha\) on the four datasets with the different values of \(\alpha\). For each resulting subgraph, we have the density \(d = m/n\), the chosen labels, the number of nodes n, and the number of edges m.
Table 4 Results for running GreedyOr-\(\alpha\) on the four datasets with the different values of \(\alpha\).

Case study: We analyze the label-induced dense subgraphs for the Twitter and DBLP datasets by repeatedly running the GreedyAnd algorithm for these graphs. After running the algorithm, we exclude the edges from the output edge-induced subgraph and run the algorithm again on the remaining graph. The first 8 resulting sets of labels, as well as densities and sizes for the induced subgraphs, are shown in Table 5.

For the DBLP graph, the algorithm finds a group of 25 authors that have each written at least two papers together with a shared topic, as well as other relatively large groups of authors whose edges form almost perfect cliques. The labels representing stemmed words can be used to interpret the topics of publications for these groups of authors having tight collaboration.

For the Twitter data of #metoo tweets, the densest label-induced subgraphs are formed by mostly looking at individual hashtags. This detects groups of people tweeting about #MeTooASE referring to the French Me Too movement for foster children, as well as groups closely discussing other topics in the context of the Me Too movement such as live streaming or the recent trial between Johnny Depp and Amber Heard.

We see that the same labels also appear when searching for \(\alpha\)-dense subgraphs. For example, by looking at the labels for \(\alpha =1.548\) for the Twitter dataset in Table 4 and comparing them with the labels in Table 5, we can see that this subgraph found by the GreedyOr-\(\alpha\) algorithm in fact consists of multiple smaller groups of people discussing a variety of topics that we previously discovered.

Table 5 Label sets with corresponding subgraph densities and sizes selected by running the GreedyAnd algorithm repeatedly on the graphs for DBLP and Twitter datasets.

8 Concluding remarks

In this paper, we considered the problem of finding dense subgraphs that are induced by labels on the edges. More specifically, we considered two cases: conjunctive-induced dense subgraphs, where the edges need to contain the given label set, and disjunctive-induced dense subgraphs, where the edges need to have only one label in common. As a measure of quality, we used the average degree of a subgraph. We showed that both problems are NP-hard, and we proposed a greedy heuristic to find dense induced subgraphs. By maintaining suitable counters we were able to find subgraphs in quasi-linear time: \(\mathcal {O} \mathopen {}\left( p \log {\left| L\right| }\right)\) for conjunctive-induced graphs and \(\mathcal {O} \mathopen {}\left( p \log ^2 {\left| L\right| }\right)\) for disjunctive-induced graphs. In addition, we analyzed the related problem of maximizing the number of edges minus \(\alpha\) times the number of vertices and showed how the optimal solutions to these problems are connected. We proved that the problem of maximizing this \(\alpha\)-density is NP-hard and inapproximable unless \({\textbf {P}}={\textbf {NP}}\). We adopted the greedy algorithms for the conjunctive and disjunctive cases of this problem resulting in a running time of \(\mathcal {O} \mathopen {}\left( p \log {\left| L\right| }\right)\) for the disjunctive case as well. We then demonstrated that the algorithms are practical, they can find ground truth in synthetic datasets, and find interpretable results from real-world networks.

While this paper focused on the conjunctive and disjunctive cases, future work could explore other ways to induce graphs from a label set and design efficient algorithms for such tasks. Another direction for future work is to relax the requirement that every edge/node must be induced from labels. Instead, we can allow some deviation from this requirement but then penalize the deviations appropriately when assessing the quality of the subgraph.