Dense subgraphs induced by edge labels

Finding densely connected groups of nodes in networks is a widely used tool for analysis in graph mining. A popular choice for finding such groups is to find subgraphs with a high average degree. While useful, interpreting such subgraphs may be difficult. On the other hand, many real-world networks have additional information, and we are specifically interested in networks with labels on edges. In this paper, we study finding sets of labels that induce dense subgraphs. We consider two notions of density: average degree and the number of edges minus the number of nodes weighted by a parameter $\alpha$. There are many ways to induce a subgraph from a set of labels, and we study two cases: First, we study conjunctive-induced dense subgraphs, where the subgraph edges need to have all labels. Secondly, we study disjunctive-induced dense subgraphs, where the subgraph edges need to have at least one label. We show that both problems are NP-hard. Because of the hardness, we resort to greedy heuristics. We show that we can implement the greedy search efficiently: the respective running times for finding conjunctive-induced and disjunctive-induced dense subgraphs are in $O(p \log k)$ and $O(p \log^2 k)$, where $p$ is the number of edge-label pairs and $k$ is the number of labels. Our experimental evaluation demonstrates that we can find the ground truth in synthetic graphs and that we can find interpretable subgraphs from real-world networks.


Introduction
Finding dense subgraphs in networks is a common tool for analyzing networks with potential applications in diverse domains, such as bioinformatics [11,16], finance [10], social media [2], or web graph analysis [11].
While useful on their own, analyzing dense subgraphs without any additional explanation may be difficult for domain experts and consequently may limit its usability.
Fortunately, it is often the case that the network has additional information such as labels associated with nodes and/or edges.For example, in social networks, users may have tags describing themselves.In networks arising from communication, for example, by email or Twitter, the tags associated with the edge can be the tags associated or extracted with the message.
Using the available label information to provide explainable dense subgraphs may ease the burden of domain experts when, for example, studying social networks.In this paper, we consider finding dense subgraphs in networks with labeled edges.More formally, we are looking for a label set that induces a dense subgraph.As a measure of density, a subgraph (W, F ) with nodes W and edges F will use |F |/|W |, the ratio of edges over the nodes, a popular choice for measuring the density of a subgraph.
We consider two cases: conjunctive-induced and disjunctive-induced dense subgraphs.In the former, the induced subgraph consists of all the edges that have the given label set.In the latter, the induced subgraph consists of all the edges that have at least one label common with the label set.We give an example of both cases in Figure 1.
Fig. 1 Example graphs with labels on the edges.Edge labels are indicated by colors; dashed edges indicate edges with 2 labels.Left figure: Label ℓ 1 induces a subgraph with 6 nodes and 9 edges, and ℓ 2 induces a subgraph with 7 nodes and 11 edges, while the conjunction of ℓ 1 and ℓ 2 induces a subgraph with 5 nodes and 8 edges resulting in the highest density of 8/5 = 1.6.Right figure: Labels ℓ 1 , ℓ 2 , and ℓ 3 each induce subgraphs with 5 nodes and 4 edges, while the disjunction of ℓ 1 and ℓ 2 induces a subgraph with 5 nodes and 7 edges, resulting in a density of 7/5 = 1.4.
Finding the densest subgraph-with no label constraints-can be done in polynomial time [13] and can be 2-approximated in linear time [7].Unfortunately, additional requirements on the labels will make solving the optimization problem exactly computationally intractable: we show that both problems are NP-hard, which forces us to resort to heuristics.We propose a greedy algorithm for both problems: we start with an empty label set and keep adding the best possible label until no additions are possible.We then return the best observed induced subgraph.The computational bottleneck of the greedy method is selecting a new label.If done naively, evaluating a single label candidate requires enumerating over all the edges.Since this needs to be done for every candidate during every addition, the running time is O(p|L|), where |L| is the number of labels and p is the number of edge-label pairs in the network.By keeping certain counters we can speed up the running time.We show that conjunctive-induced graphs can be discovered in O(p log |L|) time using a balanced search tree, and that disjunctive-induced graphs can be discovered in O p log 2 |L| time with the aid of an algorithm originally used to maintain convex hulls.This is an extended version of our previously published conference paper [15].We extend our earlier work by considering an alternative definition of density: namely, we search for label-induced subgraphs (W, F ) with high αdensity |F | − α|W |.This density is closely related to the problem of finding a subgraph with maximum density [13] but also has been used to decompose graphs [8,23].We show that there are α such that the α-densest label-induced graph also has the highest density.We then modify the greedy algorithms to find subgraphs with high α-density in O(p log |L|) for both the conjunctive and disjunctive cases.
The remainder of the paper is organized as follows.In Section 2 we introduce the notation and formalize the optimization problem.In Sections 3-4 we present our algorithms.In Section 5, we analyze the case of using an alternative density metric and adapt the previous algorithms to this problem.Section 6 is devoted to the related work.Finally, we present the experimental evaluation in Section 7 and conclude with a discussion in Section 8.The computational complexity proofs are given in Appendix A.

Preliminary notation and problem definition
In this section, we first describe the common notation and then introduce the formal definition of our problem.
Assume that we are given an edge-labeled graph, that is, a tuple G = (V, E, lab), where V is the set of vertices, E ⊆ {(x, y) | (x, y) ∈ V 2 , x ̸ = y} is the set of undirected edges, and lab : E → 2 L is a function that maps each edge e ∈ E to the set of labels lab(e).Here L is the set of all possible labels.
Given a label ℓ ∈ L, let us write E(ℓ) to be the edges having the label ℓ.In addition, let us write V (ℓ) to be the nodes adjacent to E(ℓ).
Our goal is to search for dense regions of graphs that can be explained using the labels.In other words, we are looking for a set of labels that induce a dense graph.More formally, we define an inducing function to be a function f that maps two sets of labels to a binary number.An example of such a function could be f Given a set of labels B ⊆ L, an inducing function f , and a graph G, we define the label-induced subgraph H = G(f, B) as (V (B), E(B), lab), where is the subset of edges that satisfy f , and V (B) is the set of vertices that are adjacent to E(B).
Given a graph G with vertices V and edges E, we measure the density of the graph d(G) as the number of edges divided by the number of vertices: We should point out that there are alternative choices for a notion of density.For example, one option is to consider a fraction of edges |E|/ |V | 2 .However, this measure is however problematic since a single edge will yield a maximum value.Consequently, either a size needs to be incorporated into the objective, which leads to discovering maximum cliques-an NP-hard problem with bad approximation guarantees [14], or enumerating all pseudo-cliques with an exponential-time algorithm [1,25].On the other hand, finding graph with maximum d(G) can be done in polynomial time [13], and 2-approximated in linear time [7].See related work for additional discussion.
We are now ready to state our generic problem.
Problem 1 (LD) Let G = (V, E, lab) be an edge-labeled graph over a set of labels L with multiple labels being possible for each edge.Assume an inducing function f .Find a set of labels L * such that the density d(H) of the label-induced subgraph We consider two special cases of LD.Firstly, let us define f AND (A; B) = [B ⊆ A], that is, the induced edges need to contain every label in B. We will denote the problem LD paired with f AND as LDand.Secondly, we define that is, the induced edges need to have one common label with B. Then, we denote the corresponding problem as LDor.In other words, LDand is the problem of finding dense conjunctive-induced subgraphs, and LDor is the problem of finding disjunctive-induced subgraphs.
In addition, we consider an alternative measure to the density d(G) of a graph by instead measuring the α-density of the graph g(G; α) as the number of edges minus α times the number of vertices: g(G; α) = |E|−α|V |.This measure is closely related to finding the densest subgraph: Goldberg [13] finds a series of α-densest subgraphs when searching for the densest subgraph.However, this measure has been also studied on its own as it can be used to decompose graphs [8,23].Our optimization problem is as follows.
Problem 2 (LD-α) Let G = (V, E, lab) be an edge-labeled graph over a set of labels L with multiple labels being possible for each edge.Assume an inducing function f and a constant α ∈ R. Find a set of labels L * such that the α-density g(H; α) of the label-induced subgraph H = G(f, L * ) is maximized.

Finding dense conjunctive-induced graphs
In this section, we focus on LDand, that is, finding conjunctive-induced graphs that are dense.We will first prove that LDand is NP-hard.
Proof The proof is given in Appendix A. □ The NP-hardness forces us to resort to heuristics.Here, we use the algorithm for 2-approximating dense subgraphs [7] as a starting point.The algorithm iteratively removes a node with the smallest degree, and returns the best solution among the observed subgraphs.We propose a similar greedy algorithm, where we greedily add the best possible label, and repeat until the induced subgraph is empty.We then select the best observed labels as the output.
To avoid enumerating over the edges every time we look for a new label, we maintain several counters.Let A be the current set of labels.For each label, we maintain the number of nodes n k and edges m k of the candidate graph, that is, We store the densities m k /n k in a balanced search tree (for example, a red-black tree), which allows us to obtain the largest element quickly.Once we update set A, we also update the counters and update the search tree.Maintaining the node counts n k requires us to maintain the counters r v,k , number of edges labeled as k adjacent to v: once the counter reduces to 0, we reduce n k by 1.The pseudo-code of the algorithm is given in Algorithm 1.
We conclude with an analysis of the computational complexity of GreedyAnd.
18 return the set of labels A i that yields the highest density;

Finding dense disjunctive-induced graphs
In this section, we focus on LDor, that is, finding disjunctive-induced graphs that are dense.We will first prove that LDor is NP-hard.
Proof The proof is given in Appendix A. □ Similar to LDand, we resort to a greedy search to find good subgraphs: We start with an empty label set, and iteratively add the best possible label.Once done, we return the best observed label set.
However, we maintain a different set of counters as compared to GreedyAnd.The reason for having different counters is to avoid a significantly higher number of updates: the inner loop would need to go over the edge-label pairs that are not present in the graph.More formally, we maintain values n and m representing the number of nodes and edges in the subgraph induced by the current set of labels, say A. We also maintain n k and m k , the number of additional nodes and edges if k is added to A. At each iteration, we select the label optimizing m+m k n+n k .We will discuss the selection process later.Once the label is selected, we update the counters m k and n k .To maintain n k Algorithm 2: GreedyOr, greedy search for the disjunctive-induced dense subgraphs there is an edge with label ℓ adjacent to v}; 5 r v ← 0, for each vertex v; 6 A 0 ← ∅ and i ← 0; 7 while there are labels do 8 pick and remove label k that yields the maximum density m+m k n+n k ; 22 return the set of labels A i that yields the highest density; properly, we keep track of what nodes are already in V (A), using an indicator r v with r v = 1 if v ∈ V (A).The pseudo-code for the algorithm is given in Algorithm 2.
During each iteration, we need to select the label maximizing m+m k n+n k .We cannot use priority queues any longer since n and m change every iteration.However, we can speed up the selection using a convex hull, a classic concept from computational geometry, see for example, [17].First, let us formally define a lower-right convex hull.Definition 1 Given a set of points X = {(x i , y i )} in a plane, we define a lower-right convex hull H = hull (H) to be a subset of X such that q = (xq, yq) ∈ X is not in X if and only if there is a point r = (xr, yr) ∈ H such that xq ≤ xr and yq ≥ yr, or if there are two points p, r ∈ H such that q is above or at the segment joining q and r.
If we were to plot X on a plane, then hull (X) is the lower-right portion of the complete convex hull, that is, a set of points in X that form a convex polygon containing X.For notational simplicity, we will refer to hull (X) as the convex hull.Note that if we order the points in hull (X) by their x-coordinates, then the y-coordinates and the slopes of the intermediate segments are also increasing.
We will first argue that we only need to search the convex hull when looking for the optimal label.Theorem 4 Let X be a set of positive points (m i , n i ), and let H = hull (X) be the convex hull.Select m, n ≥ 0. Then max p∈X m+mi n+ni = max p∈H m+mi n+ni .
Proof Let k = (m k , n k ) be the optimal point in X. Assume that k / ∈ H. Assume that there is a point q = (mq, nq) in H such that mq ≥ m k and nq ≤ n k .Then m+m k n+n k ≤ m+mq n+nq , so the point q is also optimal.Assume there is no such point q.Then, the x-coordinate of point k falls between two consecutive points p and q in H, that is, mp < m k < mq.Then k must be above the segment between p and q as otherwise, k would also be a part H. Therefore, the slope for the segment between p and k must be greater than the slope of the segment between p and q, and the slope for the segment between k and q must be smaller, Furthermore, since k / ∈ H, we must have n k > np.By assumption, we also have n k < nq.In summary, we have np < n k < nq and mp < m k < mq, which means that the slopes in Equation 1 are all positive.By taking the reciprocals this then gives, Denote then the objective value at point n+nq ≤ m+mq n+nq , thus q is also optimal.□ Theorem 4 states that we need to only consider the convex hull H of the set {(m i , n i )} when searching for the optimal new label.Note that H does not depend on n or m.Moreover, we can use the algorithm by Overmars and Van Leeuwen [19] to maintain H as n k and m k are updated in O log 2 |L| time per update.We will see that the number of needed updates is bounded by the number of edge-label pairs.However, the convex hull can be as large as the original set, so our goal is to avoid enumerating over the whole set.To this end, we design a binary search strategy over the hull.We will first introduce two quantities used in our search.
Definition 2 Given two points p, q ∈ hull (X), we define the inverse slope as s(p, q) = mq−mp nq−np and the bias term as b(p, q) = mqnp−mpnq nq−np .
First, let us prove that both s and b are monotonically decreasing.
Lemma 1 Let p, q, and r be three consecutive points in hull (X).Then we have n × s(q, r) + b(q, r) ≤ n × s(p, q) + b(p, q), for any n ≥ 0.
Proof The slope for the segment between p and q is less than or equal to the slope for the segment between q and r.Inversing the slopes leads to Combining the two equations proves the claim.□ Next, we show the key necessary condition for the optimal point.Lemma 2 Let p, q, and r be 3 consecutive points in hull (X).Select n, m ≥ 0. If q optimizes mq+m nq+n , then n × s(q, r) + b(q, r) ≤ m ≤ n × s(p, q) + b(p, q).
Proof Since q is optimal, we have The two lemmas allow us to use binary search as follows.Given two consecutive points p and q we test whether m ≤ n × s(p, q) + b(p, q).If true, then the optimal label is q or to the right of q, if false, the optimal point is to the left of q.To perform the binary search, we can use directly the structure maintained by the algorithm by Overmars and Van Leeuwen [19] since it stores the current convex hull in a balanced search tree.Moreover, the algorithm allows evaluating any function based on the neighboring points.Specifically, we can maintain s and b.In summary, we can find the optimal label in O(log |L|) time.
Our next result formalizes the above discussion.We should point out that a faster algorithm by Brodal and Jacob [5] maintains the convex hull in O(log |L|) time.However, this algorithm does not provide a search tree structure that we can use to search for the optimal addition.

Finding subgraphs with high α-density
In this section, we focus on the problem LD-α of finding subgraphs with high α-density.
The following classic result in fractional programming [9] shows how the problem of finding the maximum density subgraph reduces to maximizing the α-density of a subgraph for a large enough value of α.An immediate consequence of this result is that solving LD-α is NP-hard.Similarly, for any α > τ , we have g(Hα; α) ≥ 0. Consequently, either Hα is empty or d(Hα) ≥ α > τ , that is, Hα solves LD. □ Corollary 1 LD-α is NP-hard for both f OR and f AND .Moreover, both problems are inapproximable unless P = NP.
Proof The proof is given in Appendix A. □ To find solutions to LD-α in practice, we adapt the previous greedy algorithms to find subgraphs with high α-density.In the conjunctive case, we get the GreedyAnd-α algorithm by simply changing the density on line 4 of Algorithm 1 from m ℓ n ℓ to m ℓ −αn ℓ .This leads to the same computational complexity as for GreedyAnd.
In the disjunctive case, we again keep track of the counters to find the number of additional nodes and edges when a label is added to the current set of labels.However, the α-density to maximize now becomes (m + m k ) − α(n + Algorithm 3: GreedyOr-α, greedy search for the disjunctiveinduced subgraphs with high α-density there is an edge with label ℓ adjacent to v}; 5 r v ← 0, for each vertex v; 6 T ← labels sorted by the density values m ℓ − αn ℓ (e.g., in a red-black tree); 7 A 0 ← ∅ and i ← 0; 8 while there are labels do  24 return the set of labels A i that yields the highest density; n k ).As m − αn does not depend on the label, we only need to find the label k that maximizes m k − αn k .We may thus use a balanced search tree as in the conjunctive case.The pseudo-code for this algorithm is given in Algorithm 3.
As GreedyOr-α does not need to use a convex hull but uses a balanced search tree instead, the running time becomes the same as for the conjunctive case.We conclude this section by considering the (lack of the) hierarchy property of α-density.Tatti [23] showed that the subgraphs (without label constraints) Fig. 2 Subgraphs with optimal α-density are not nested.Left figure: ℓ 1 is optimal for α = 3/4 and ℓ 2 is optimal for α = 1/4 when using f AND .Right figure: ℓ 1 is optimal for α = 2.25 and ℓ 2 is optimal for α = 1.75 when using f OR .
optimizing g(•, α) form a nested structure, that is, if we write H α to be the optimal solution, then H β ⊆ H α for any β > α.Such a decomposition may be useful as it partitions the nodes into increasingly dense regions.Unfortunately, this is not the case for us as shown in Figure 2.
Interestingly enough, if we allow more flexible queries, we can show that we too obtain a nested structure.More formally, given a Boolean formula B we define G(B) to be the subgraph consisting of edges whose labels satisfy B, and the incident vertices.Then the optimization problem would be to find the Boolean formula B maximizing g(G(B); α).We then have the following proposition.

Related work
A closely related work to our method is an approach proposed by Galbrun et al. [12].Here the authors search for multiple dense subgraphs that can be explained by conjunction on (or the majority of) the node labels.The authors propose a greedy algorithm for finding such subgraphs.Interestingly enough, the authors do not show that the underlying problem is NP-hardalthough we conjecture that this is indeed the case-instead, they show that the subproblem arising from the greedy approach is an NP-hard problem.Another closely related work is an approach proposed by Pool et al. [20], where the authors search for dense subgraphs that can be explained by queries on the nodes.The quality of the subgraphs is a ratio S/C, where S measures the goodness of a subgraph using the edges within the subgraph as well as the cross-edges, and C measures the complexity of the query.
The major difference between our work and the aforementioned work is that our method uses labels on the edges.While conceptually a small difference, this distinction leads to different algorithms and different analyses of those algorithms.Moreover, we cannot apply directly the previously discussed methods to networks that only have labels on edges.
An appealing property of finding subgraphs that maximize |E(W )|/|W |, or equivalently an average degree, is that we can find the optimal solution in polynomial time [13].Furthermore, we can 2-approximate the graph with a simple linear algorithm [7].The algorithm iteratively removes the node with the smallest degree and then selects the best available graph.This algorithm is essentially the same as the algorithm used to discover k-cores, subgraphs that have the minimum degree of at least k.The connection between the kcores and dense subgraphs is further explored by Tatti [23], where the dense subgraphs are extended to create an increasingly dense structure.A variant of a quality measure was proposed by Tsourakakis [24], where the quality of the subgraph is the ratio of triangles over the vertices.In another variant by Bonchi et al. [4], the edges were replaced with paths of at most length k.Finding such structures in labeled graphs poses an interesting line of future work.
While finding dense subgraphs is polynomial, finding cliques is an NP-hard problem with a very strong inapproximability bound [14].Finding cliques may be impractical as they do not allow any absent edges.To relax the requirement, Abello et al. [1] and Uno [25] proposed searching for quasi-cliques, that is subgraphs with a high proportion of edges, Another relaxation of cliques is k-plex where k absent edges are allowed for a vertex [21].Finding k-plexes remain an NP-hard problem [3].Alternatively, we can relax the definition by considering n-cliques, where vertices must be connected with an n-path [6], or n-clans where we also require that the diameter of the graph is n [18].Since 1-clique (and 1-clan) is a clique, these problems remain computationally intractable.

Experimental evaluation
In this section, we describe our experimental evaluation of the GreedyAnd and GreedyOr algorithms.First, we observe how the algorithms behave on synthetic data with increasing randomness.Then we apply the algorithms to real-world datasets and analyze the results.
We implement our algorithms in Python and the source code is available online 1 .Since the number of labels in our experiments was not exceedingly large, we did not use the speed up using convex hulls when implementing disjunctive-induced graphs.Instead, we search for the optimal label from scratch leading to a running time of O(p|L|).
Experiments with synthetic data: We evaluate the greedy algorithms on synthetic graphs of 200 vertices and 50 labels.We select 5 of the labels as target labels and construct graphs for the conjunctive and disjunctive cases such that selecting the subgraph induced by these 5 labels gives the best density.We then add random noise to the network by introducing a noise parameter ϵ, which controls the probability of randomly adding and removing edges as well as adding new labels to the edges.For the conjunctive case, we create five disjoint cliques of 10 vertices such that all edges on the kth clique have all except the kth of the target labels.Finally, we add one more 20 vertex clique that has all of the target labels.Since each of the smaller cliques is missing one of the target labels, selecting the conjunction of all of them yields the densest subgraph as the clique of 20 vertices.
Given the noise parameter ϵ, we then add noise by having each of the edges in the cliques removed with probability ϵ, as well as having any other edges added between any pair of vertices with probability ϵ.Finally, for each of the edges in the cliques, we add any of the other labels with probability ϵ each, except for adding the remaining target labels to edges in the cliques.
For the disjunctive case, we have created one clique with 40 vertices.The edges in the clique are split into five sets, such that each set of edges gets one of the target labels.Now, selecting the disjunction of the five target labels induces the clique as the subgraph and results in the highest density.
We then add noise by removing edges from the clique and adding new edges between any other pair of vertices with probability ϵ.In addition, each edge gains any of the other labels also with probability ϵ.
We repeat the experiments with increasing values of ϵ and compare the density of the subgraph induced by the target labels to the density of the subgraph induced by the labels of the greedy algorithms.For each ϵ, we run the experiment 10 times and compute the mean and standard deviation of the runs.The results are shown in Figure 3.
In both cases, the greedy algorithms correctly find the target labels for small values of ϵ.After ϵ > 0.25 for GreedyAnd and after ϵ > 0.35 for GreedyOr, the algorithms start to find other sets of labels, which yield higher densities than the target labels as many of the edges in the target clique have been removed and other edges have been added.However, at ϵ = 0.30, the GreedyOr returns a suboptimal solution that yields a slightly lower density than the target labels.We confirm the theoretical running times of the algorithms by setting ϵ = 0.2 and performing experiments with increasingly large graphs, where the number of total vertices goes from 10000 up to 100000 while other aspects of the experiments remain constant.Similarly, we test how the running times of the algorithms scale as the number of total labels in our synthetic graph increases from 1000 to 10000.The results for GreedyAnd are shown in Figure 4 and results for GreedyOr in Figure 5.
As expected, the running times of both algorithms scale linearly with the number of vertices in the graph.Furthermore, the running time of our naive implementation of GreedyOr appears to scale quadratically with the number of labels, while the scaling for GreedyAnd is close to linear.These results confirm our theoretical analysis and show that our algorithms can be applied to large graphs in practice.
Experiments with real-world datasets: We test the greedy algorithms by running experiments on four real-world datasets.The first dataset is the Enron Email Dataset2 , which consists of publicly available emails from employees of a former company called Enron Corporation.We collect the emails in sent mail folders and construct a graph where new edges are added between the sender and the recipients of each email.Each edge has labels consisting of the stemmed words in the email's title, with stop words and words including numbers removed.
The second dataset consists of high energy physics theory publications (HEP-TH) from the years 1992 to 2003.The data was originally released in KDD Cup3 but we use a preprocessed version of the data available in GitHub 4We create the network by adding authors as vertices, and edges between any two authors are added if they share at least two publications.The edges between authors are then given labels which consist of the stemmed words in the titles of the shared articles between the two authors.We exclude stop words and words including numbers from the titles the same way as for the Enron dataset.
The third dataset consists of publications from the DBLP5 dataset [22].From this dataset, we chose publications from ECMLPKDD, ICDM, KDD, NIPS, SDM, and WWW conferences.The network is constructed in the same way as for the HEP-TH data, with authors as vertices, two or more shared publications as edges, and stemmed and filtered words from the titles as labels.
The fourth and final dataset consists of the latest 10000 tweets collected from Twitter API6 with the hashtag #metoo by the 27th of May, 23:59 UTC.We create the network by having users as vertices with an edge between a pair of users if one of them has retweeted or responded to one of the other's tweets.The labels on the edge are then any hashtags in the retweets or response tweets between the two users.
We construct the networks by filtering out labels that appear in less than 0.1% of the edges in the Enron and Twitter datasets, or labels that occur in less than 0.5% of the papers in the case of the HEP-TH and DBLP datasets.The sizes, label counts, and densities of the resulting graphs are shown in Table 1.
We run the greedy algorithms on each of these graphs, and compare the results against the densest subgraph ignoring the labels (Dense).We report the statistics for the label-induced subgraphs and the densest subgraphs in Table 2.
For each of the datasets, both algorithms find label-induced subgraphs with higher densities than in the original graphs.In most cases, the restriction of constructing label-induced subgraphs results in clearly lower densities compared to the densest label-ignorant subgraphs.Interestingly, for the DBLP dataset GreedyAnd finds a label-induced subgraph with a very high density that is close to the density of the densest subgraph ignoring the labels.The running times are practical: the algorithm processes networks with 100 000 edge-label pairs in seconds.
For Enron and HEP-TH datasets, the GreedyOr returns large sets of labels resulting in large subgraphs, whereas the GreedyAnd algorithm selects only a few labels with smaller induced subgraphs in each case.For the Twitter dataset, both greedy algorithms select only one label, which induces a small subgraph with a notably higher density than the original graph.
Experiments with α-density: Next we consider finding α-dense subgraphs by running the GreedyAnd-α and GreedyOr-α algorithms on the same datasets.The results are shown in Table 3 and Table 4, respectively.
As pointed out by Theorem 6, the optimal α-dense subgraph is also the densest for sufficiently large α.We use a binary search to find the maximum α for which the greedy algorithm yields a non-empty graph.The values of α in these tables are chosen by the binary search process while searching for the maximum.Additionally, we experiment with using a smaller α value of 0.25 times the maximum.For clarity, we exclude duplicated results where different values of α yield the same subgraph.
We see that the greedy algorithms for the two problems often find the same solution, as suggested by Theorem 6.However, this is not always the case due to the heuristic nature of these algorithms.Interestingly, with α = 2.5 for the HEP-TH dataset, the GreedyAnd-α finds a denser subgraph than the one found by GreedyAnd, while an additional manual experiment with α = 1.4 results in the greedy algorithm suboptimally returning an empty graph.For the DBLP dataset using α = 3.6 leads to the same solution as GreedyAnd, but larger values of α lead the greedy algorithm to choose a suboptimal first label resulting in less dense subgraphs.For Enron and HEP-TH datasets, GreedyOr-α only finds subgraphs with a slightly lower density than the ones found by GreedyOr.
In general, we observe that using smaller values of α results in subgraphs with more vertices and edges in both the disjunctive and conjunctive cases.Thus having α as a parameter gives us more control over the size of the resulting subgraph and allows us to look for both smaller and larger groups of densely connected nodes.
Case study: We analyze the label-induced dense subgraphs for the Twitter and DBLP datasets by repeatedly running the GreedyAnd algorithm for these graphs.After running the algorithm, we exclude the edges from the output edge-induced subgraph and run the algorithm again on the remaining graph.The first 8 resulting sets of labels, as well as densities and sizes for the induced subgraphs, are shown in Table 5.
For the DBLP graph, the algorithm finds a group of 25 authors that have each written at least two papers together with a shared topic, as well as other relatively large groups of authors whose edges form almost perfect cliques.The labels representing stemmed words can be used to interpret the topics of publications for these groups of authors having tight collaboration.
For the Twitter data of #metoo tweets, the densest label-induced subgraphs are formed by mostly looking at individual hashtags.This detects groups of people tweeting about #MeTooASE referring to the French Me Too movement for foster children, as well as groups closely discussing other topics in the context of the Me Too movement such as live streaming or the recent trial between Johnny Depp and Amber Heard.
We see that the same labels also appear when searching for α-dense subgraphs.For example, by looking at the labels for α = 1.548 for the Twitter dataset in Table 4 and comparing them with the labels in Table 5, we can see that this subgraph found by the GreedyOr-α algorithm in fact consists of multiple smaller groups of people discussing a variety of topics that we previously discovered.

Concluding remarks
In this paper, we considered the problem of finding dense subgraphs that are induced by labels on the edges.More specifically, we considered two cases: conjunctive-induced dense subgraphs, where the edges need to contain the given label set, and disjunctive-induced dense subgraphs, where the edges need to have only one label in common.As a measure of quality, we used the average degree of a subgraph.We showed that both problems are NP-hard, and we proposed a greedy heuristic to find dense induced subgraphs.By maintaining suitable counters we were able to find subgraphs in quasi-linear time: O(p log |L|) for conjunctive-induced graphs and O p log 2 |L| for disjunctiveinduced graphs.In addition, we analyzed the related problem of maximizing the number of edges minus α times the number of vertices and showed how the optimal solutions to these problems are connected.We proved that the problem of maximizing this α-density is NP-hard and inapproximable unless P = NP.We adopted the greedy algorithms for the conjunctive and disjunctive cases of this problem resulting in a running time of O(p log |L|) for the disjunctive case as well.We then demonstrated that the algorithms are practical, they can find ground truth in synthetic datasets, and find interpretable results from real-world networks.While this paper focused on the conjunctive and disjunctive cases, future work could explore other ways to induce graphs from a label set and design efficient algorithms for such tasks.Another direction for future work is to relax the requirement that every edge/node must be induced from labels.Instead, we can allow some deviation from this requirement but then penalize the deviations appropriately when assessing the quality of the subgraph.

Declarations
overlapping C i and C j , we introduce 4N additional vertices and 2N edges, each edge connecting two unique nodes, and labeled as L \ {i, j}.
We claim that for |X| ≥ 5, 3ExactCover has a solution if and only if there is an induced graph H with d(H) ≥ |X|/(|X| + 3).
Assume that we are given a set of labels A ⊂ L. Let B = L \ A. Let k be the number of set pairs in B that are overlapping, that is, Then the density of the corresponding graph H = G(f AND , A) is equal to Assume that k > 0. Since |B| ≤ N , we can bound the density with Assume that k = 0. Then the density is equal to |B|/(|B| + 1).Let U = {C i | i ∈ B}.Since U is disjoint, 3|B| ≤ |X| and the equality holds if and only if U covers X.
Assume that there is a subgraph H = G(f AND , A) with d(H) ≥ |X|/(|X| + 3).Since we assume that |X| ≥ 5, we have d(H) ≥ 5/8 > 3/5, and the preceding discussion shows that the sets corresponding to A form an exact cover of X.
On the other hand, if there is an exact cover in C, then d(G(f AND , A)) = |X|/(|X|+3), where A is the set of labels corresponding to the cover.This shows that maximizing the density of the label-induced subgraph is an NP-hard problem.□ Proof of Theorem 3 We will prove the claim by reducing 3ExactCover to the densest subgraph problem.In 3ExactCover we are given a set X and a family C of subsets of size 3 over X and asked if there is a disjoint subset of C whose union is X.
Assume that we are given a set X and a family C = {C 1 , . . ., C N } of N subsets.The vertices V consists of the set X, N additional vertices y 1 , . . ., y N , and 2 more vertices Z = z 1 , z 2 .We have N labels, L = {1, . . ., N }.
Next, we define the edges E. Connect each x ∈ X to Z, and label the edges with labels {i | x ∈ C i }.Then for each C i , we connect z 1 to y i , labeled with i.
We claim that 3ExactCover has a solution if and only if the optimal labelinduced graph has the density of 7|X|/(6 + 4|X|).
Given a non-empty set of labels A ⊆ L, the density of the corresponding graph H is equal to g(k, |A|), where g(s, t) = 2s+t 2+s+t , and k is the size of the union of sets in C corresponding to A.
Since each set in C is of size ).Moreover, there are N + 2 nodes in the graph, so a difference between two densities is at least (N + 2) −2 .Consequently, if we set τ = 7d(H) ≥ |X|/(4|X| + 6) − 0.5(N + 2) −2 , then Theorem 6 implies that 3ExactCover has a solution if and only if there is H with g(H, τ ) > 0. This proves the hardness and the inapproximability since any algorithm with a multiplicative guarantee will find the optimal solution.The proof for f AND is similar.□

15 remove edge e; 16 update
T for all labels ℓ with changed values of m ℓ or n ℓ ;

9 10 A
pick and remove label k that has the maximum density in T ; i+1 ← A i ∪ {k}; 11 for each edge e = (u, v) with label k do 12 for each label ℓ of edge e = (u, v) do m ℓ ← m ℓ − 1 ; 13 m ← m + 1;

21 remove edge e; 22 update 23 i
T for all labels ℓ with changed values of m ℓ or n ℓ ; ← i + 1;

Theorem 7
GreedyAnd-α and GreedyOr-α run in O(p log |L| + |V | + |E|) time, where p is the number of edge-label pairs p = |{(e, k) | e ∈ E, k ∈ lab(e)}|.ProofThe proofs for both cases are virtually the same as the proof of Theorem 2. □

Fig. 3
Fig. 3 Density of the subgraph induced by the target labels and the subgraph induced by the labels chosen by the greedy algorithms as a function of noise ϵ in the network.The line shows the mean density of 10 runs for each ϵ, and the vertical error bars show their standard deviation.The results for GreedyAnd algorithm are on the left and for GreedyOr on the right.

Fig. 4 Fig. 5
Fig.4Running time of the GreedyAnd algorithm as a function of the number of vertices (left) and the number of labels (right) in our synthetic graphs.

Theorem 2
GreedyAnd runs in O(p log |L| + |V | + |E|) time, where p is the number of edge-label pairs p = |{(e, k) | e ∈ E, k ∈ lab(e)}|.Proof Initializing counters in GreedyAnd can be done in O(|V | + |E| + |L|) time while initializing the tree can be done in O(|L| log |L|) time.
O(|V |+ |E| + |L| + |L| log |L| + p log |L|) ⊆ O(|V | + |E| + p log |L|) [19]f The proof is similar to the proof of Theorem 2, except we have replaced a search tree with the convex hull structure by Overmars and Van Leeuwen[19].The inner for-loops are evaluated at most O(p) times since an edge or a node is visited only once, and v |Sv| ∈ O(p).Maintaining the hull requires O log 2 |L| time, and there are at most O(p) such updates.Searching for an optimal label requires O(log |L|) time, and there are at most |L| such searches.□ Theorem 5 GreedyOr runs in O p log 2 |L| + |V | + |E| time, where p is the number of edge-label pairs p = |{(e, k) | e ∈ E, k ∈ lab(e)}|.

Table 1
Basic characteristics of the networks: number or vertices |V |, number or edges |E|, number of labels |L|, number of edge-label pairs p, and the density d(G) = |E|/|V |.

Table 2
Statistics for the resulting subgraphs for the greedy algorithms and the label-ignorant densest subgraph algorithm.For the label-induced subgraphs, we have the number of vertices n, the number of edges m, the size of the best set of labels |A|, density d, and running time t in seconds.For the densest subgraph, we show the number of vertices n and density d = m/n.

Table 3
Results for running GreedyAnd-α on the four datasets with the different values of α.For each resulting subgraph, we have the density d = m/n, the chosen labels, the number of nodes n, and the number of edges m. Results matching the densest subgraph found by GreedyAnd are shown in bold.

Table 4
Results for running GreedyOr-α on the four datasets with the different values of α.For each resulting subgraph, we have the density d = m/n, the chosen labels or their amount, the number of nodes n, and the number of edges m. Results matching the densest subgraph found by GreedyOr are shown in bold.

Table 5
Label sets with corresponding subgraph densities and sizes selected by running the GreedyAnd algorithm repeatedly on the graphs for DBLP and Twitter datasets.The labels are stemmed words from publication titles for DBLP, and tweet hashtags for Twitter data.The densities are not monotonically decreasing as the greedy algorithm does not always find the optimal solution.
3, we have |A| ≥ k/3.Thus, Proof of Corollary 1 Let us adopt the notation of the proof of Theorem 3. The proof shows that 3ExactCover has a solution if and only if there is an induced graph H with 7d(H) ≥ |X|/(4|X| + 6 where the equalities hold if and only if k = |X| and 3|A| = k, that is, A corresponds to an exact cover of X. □