1 Introduction

Real-world networks are highly dynamic in nature, with new relations (edges) being continuously established among entities (nodes) and old relations being broken. Analyzing the temporal dimension of networks can provide valuable insights about their structure and function; for instance, it can reveal temporal patterns, concept drift, periodicity, temporal events, etc. In this paper, we focus on the problem of finding dense subgraphs, a fundamental graph-mining primitive. Applications include community detection in social networks [16, 18, 48], gene expression and drug interaction analysis in bioinformatics [22, 45], graph compression and summarization [21, 30, 32], spam and security threat detection [13, 26], and more.

When working with temporal networks, one has first to define how to deal with the temporal dimension, i.e., how to identify which are the temporal intervals in which the dense structures should be sought. Instead of defining those intervals a priori, in this paper we study the problem of automatically identifying the intervals that provide the most interesting structures. We consider a subgraph interesting if it boasts high density. As a result, we are able to discover a sequence of dense subgraphs in the temporal network, capturing the evolution of interesting events that occur during the network lifetime. As a concrete example, consider the problem of story identification in online social media [3, 8]: The main goal is to automatically discover emerging stories by finding dense subgraphs induced by some entities, such as twitter hashtags, co-occurring in a social media stream.

In our case, we are also interested in finding different stories over the network lifetime. For instance, as one story wanes and another one emerges, one dense subgraph among entities dissipates and another one appears. Thus, by segmenting the timeline of the temporal network into intervals, and identifying dense subgraphs in each interval, we can capture the evolution and progression of the main stories over time. As another example, consider a collaboration network, where a sequence of dense subgraphs in the network can reveal information about the main trends and topics over time, along with the corresponding time intervals.

Challenges and contributions The problem of finding the k densest subgraphs in a static graph has been considered in the literature from different perspectives. One natural idea is to iteratively (and greedily) find and remove the densest subgraphs [49], which unfortunately does not provide any theoretical guarantee. More recent works study the problem of finding k densest graphs with limited overlap, while they provide theoretical guarantees in some cases of interest [7, 24]. However, these approaches do not generalize to temporal networks.

For temporal networks, to our knowledge, there are only few papers that consider the task of finding temporally coherent densest subgraphs. The most similar to our work aims at finding a heavy subgraph present in all, or k, snapshots [46]. Another related work focuses on finding a dense subgraph covered by k scattered intervals in a temporal network [44]. Both methods, however, focus on finding a single densest subgraph.

In this paper, instead, we aim at producing a partition of the temporal network that (i) it captures dense structures in the network; (ii) it exhibits temporal cohesion; and (iii) it is amenable to direct inspection and temporal interpretation. To accomplish our objective, we formulate the problem of \(k\)-Densest-Episodes (Sect. 2), which requires to find a partition of the temporal domain into k non-overlapping intervals, such that the intervals span subgraphs with maximum total density. The output is a sequence of dense subgraphs along with corresponding time intervals, capturing the most interesting events during the network lifetime.

For example, consider a simple temporal network shown in Fig. 1. It consists of five nodes \(\{A,B,C,D,E\}\), which interact at six different time stamps (1, 2, 4, 5, 7, 8, 10). Our goal is to discover time intervals that provide the densest subgraphs. One interesting interval is \(I=[7,10]\), in that four different interactions \(\{(A,C),(C,D),(C,E),(D,E)\}\) occur during I, with three of them constructing a prominently dense subgraph—a clique \(\{C,D,E\}\). Thus, a pair (interval, a subgraph covered by the interval) \(([7,10], \{(C,D),(C,E),(D,E)\})\) summarizes an interesting episode in the history of interactions of this toy network. Another interesting interval is [1, 4] as it contains a clique \(\{B,C,D\}\). Thus, our network partition would be \((([1,4], \{(B,C),(B,D),\)\((D,E)\}), ([7,10], \{(C,D),(C,E),(D,E)\}))\). Note that interaction (AC) is completely ignored as it does not contribute to any dense subgraph.

Fig. 1
figure 1

An example temporal network with five nodes and seven time stamps. The solid lines depict interactions that occur at a given time stamp, while the dotted lines depict interactions that occur at different time stamps. The highlighted time intervals, nodes, and interactions depict events discovered in the network

A naïve solution to this problem has polynomial but prohibitively high running time. Thus, we adapt existing recent work on dynamic densest subgraph discovery [19] and approximate dynamic programming [47] to design a fast approximation algorithm (Sect. 3).

Next (Sect. 4), we shift our attention to encouraging coverage of a larger set of nodes, so as to produce richer and more interesting structures. The resulting new problem formulation turns out to be NP-hard. However, on static graphs a simple greedy algorithm leads to approximate solution due to the submodularity of the objective function. Following this observation, we extend this greedy approach for the case of temporal networks. Despite the fact that the approximation guarantee does not carry on when generalizing to the temporal case, our experimental evaluation indicates that the method produces solutions of very high quality.

Experiments on synthetic and real-world datasets (Sect. 5) and a case study on Twitter data (Sect. 6) confirm that our methods are efficient and produce meaningful and high-quality results.

2 Problem formulation

We are given a temporal graph\(G = (V,\mathcal {T},E)\), where V denotes the set of nodes, \(\mathcal {T}= [0, 1, \ldots , t_\mathrm{max}] \subset \mathbb {N}\) is a discrete time domain, and \(E \subseteq V \times V \times \mathcal {T}\) is the set of all temporal edges. Given a temporal interval \(T=[t_1, t_2]\) with \( t_1, t_2 \in \mathcal {T}\), let \(G[T] = (V[T],E[T])\) be the subgraph induced by the set of temporal edges \(E[T] = \left\{ (u,v) \mid (u,v,t) \in E, t \in T\right\} \) with V[T] being the set of endpoints of edges E[T].

Definition 1

(Episode) Given a temporal graph \(G = (V,\mathcal {T}, E)\), we define an episode as a pair (IH) where \(I = [t_1, t_2]\) is a temporal interval with \(t_1, t_2 \in \mathcal {T}\) and H is a subgraph of G[I].

Our goal is to find a set of interesting episodes along the lifetime of the temporal graph. In particular, our measure of interestingness is the density of the subgraph in the episodes. We adopt the widely used notion of density of a subgraph \(H = (V(H),E(H))\) as the average degree of the nodes in the subgraph, i.e., \(d(H)=\frac{|E(H)|}{|V(H)|}\). While several definitions for density have been studied in the literature, the one we focus on enjoys the following nice properties: It can be optimized exactly [27] and approximated efficiently [15], while a densest subgraph can be computed in real-world graphs containing up to tens of billions of edges [17].

Problem 1

(k-Densest-Episodes) Given a temporal graph \(G = (V,\mathcal {T}, E)\) and an integer \(k \in \mathbb {N}\), find a set of k episodes \(S=\left\{ (I_\ell , H_\ell )\right\} \), for \(\ell = 1, \ldots , k\) such that \(\{I_\ell \}\) are disjoint intervals and the profit \(\sum _{\ell = 1}^k d(H_\ell )\) is maximized.

We can solve Problem 1 in polynomial time. To see this, let \(S^*\) be an optimum solution for Problem 1 and let \(\mathcal {I}(S^*)=\{I_1, \ldots , I_k\}\) and \(\mathcal {G}(S^*)=\{H_1, \ldots , H_k\}\). Observe that without loss of generality, we can assume that the union of the intervals in \(\mathcal {I}(S^*)\) is equal to the set of time stamps \(\mathcal {T}\), that is, \(\mathcal {I}(S^*)\) is a k-segmentation of \(\mathcal {T}\). This follows from the fact that by increasing the length of the \(I_\ell \)’s, the density of the corresponding densest subgraphs cannot decrease.

Given an interval \(I_\ell \), a densest subgraph in \(G(I_\ell )\) can be found by running any algorithm for computing a densest subgraph: in \(\mathcal {O}(nm \log n)\) time by the easy-to-implement algorithm of Goldberg et al. [27, 43] or in \(\mathcal {O}(nm\log (n^2/m))\) time by the more involved algorithm by Gallo et al. [25], where n and m denote the number of nodes and edges in \(G(I_\ell )\), respectively. An optimal segmentation can be solved by a standard dynamic-programming approach, requiring \(\mathcal {O}(k|\mathcal {T}|^2)\) steps [9]. By combining the subroutine for computing an optimal segmentation with either subroutine for computing a densest subgraph for each given interval, one can find a solution to Problem 1 in \(\mathcal {O}(k|\mathcal {T}|^2nm \log n)\), or \(\mathcal {O}(k|\mathcal {T}|^2nm\log (n^2/m))\), respectively.

As a post-processing step, we can trim the intervals in an optimal solution \(S^* = \left\{ (I_\ell , H_\ell )\right\} \) by calculating the minimum subinterval of \(I_\ell \), which spans all edges of \(H_\ell \), for each \(\ell = 1, \ldots , k\).

3 Approximate dynamic programming

The simple algorithm discussed in the previous section has a running time, which is prohibitively expensive for large graphs. In this section, we develop a fast algorithm with approximation guarantees.

The derivations below closely follow the ones in [47], which improves [29]. However, we cannot use those results directly: Both papers work with minimization problems and use the fact that the profit of an interval is not less than the profit of its subintervals. In contrast, our problem is a maximization problem and requires a tailored solution.

Given a time interval \(T = [t_1, t_2]\), we write \(d^*(T)\) to denote the density of the densest subgraph in T, that is, \(d^*(T)=\max _{H\subseteq G(T)} d(H)\). For simplicity, we define \(d^*([t_1, t_2]) = 0\) if \(t_2<t_1\). Problem 1 is now a classic k-segmentation problem of \(\mathcal {T}\) maximizing the total sum of scores \(d^*(T)\) for individual time intervals. For notation simplicity, we assume that all time stamps \(\mathcal {T}\) are enumerated by integers from 1 to r.

Let \(o[i, \ell ]\) be the profit of the optimal \(\ell \)-segmentation using only the first i time stamps. Then,

$$\begin{aligned} o[i, \ell ] = \max _{j < i} o[j, \ell - 1] + d^*(j + 1, i), \end{aligned}$$
(1)

and o[ik] can be computed recursively.

Our goal is to approximate \(o[i, \ell ]\) quickly with a score which we will denote by \(s[i, \ell ]\). The main idea behind the speedup is not to test all possible values of j in Eq. 1. Instead, we are going to keep a small set of candidates, denoted by A, and only use those values for testing.

The challenge is how to keep A small enough while at the same time guarantee the approximation ratio. The pseudo-code achieving this balance is given in Algorithm 1, while a subroutine that keeps the candidate list short is given in Algorithm 2. Algorithm 1 executes a standard dynamic programming search: It assumes that partition of \(i'<i\) first data points into \(\ell -1\) intervals is already calculated and finds the best last interval [ai] for partitioning of i first points into l intervals. However, it does not consider all possible candidates [ai], but only a sparsified list, which guarantees to preserve a quality guarantee. The sparsified list is built for a fixed number of intervals \(\ell \) starting from empty list. Intuitively, it keeps only candidates \(A=\left\{ a_j\right\} \) with significant difference in \(s[a_j,\ell -1]\). The significance of the difference depends on the current best profit \(s[i,\ell ]\): The larger the value of the solution found, the less cautious we can be about lost candidates and the coarser becomes A. Thus, we need to refine A by Algorithm 2 after each processed i.

figure f
figure g

We first study the approximation guarantee of ApproxDP, assuming that \(d^*(\cdot )\) is calculated exactly.

Proposition 1

Let \(s[i, \ell ]\) be the profit table constructed by \(\texttt {ApproxDP} (k, \epsilon )\). Then, \(s[i,\ell ](\frac{\ell \epsilon }{k}+1)\ge o(i,\ell )\).

To prove the proposition, let us first fix \(\ell \) and let \(A_i\) be the set of candidates in A to be tested on line 6 of round i. Let \(\delta _i\) be the value of \(\delta \) in Algorithm 2, called on iteration i. Then, \(\delta _{i-1}\) is the coarsening parameter used to sparsify \(A_i\).

Lemma 1

For every \(b\in [1,i]\), there is \(a_j \in A_i\), such that

$$\begin{aligned} s[a_j-1,\ell -1]+d^*([a_j,i])\ge s[b-1,\ell -1]+d^*([b,i])-\delta _{i-1}. \end{aligned}$$

Proof

We say that a list of numbers \(A = (a_j)\) is i-dense, if

$$\begin{aligned} s[a_{j+1}-1,\ell -1] - s[a_j-1,\ell -1]\le \delta _{i-1} \text { or } a_{j+1} = a_j+1, \end{aligned}$$

for every \(a_j \in A\) with \(j < {\left| A\right| }\). We first prove by induction over i that \(A_i\) is i-dense.

Assume that \(A_{i - 1}\) is \((i - 1)\)-dense. The SPRS procedure never deletes the last element, so \((i-1) \in A_{i - 1}\), and \(A_{i - 1} \cup \left\{ i\right\} \) is \((i - 1)\)-dense. Note that \(\delta _{i-2}\le \delta _{i-1}\), because \(s[i,\ell ]\) is monotone, and \(s[i,\ell ]\ge s[i-1,\ell ]\), due to explicit check in line 6 of procedure ApproxDP. Thus, \(A_{i - 1} \cup \left\{ i\right\} \) is i-dense. Since \(A_i=\texttt {SPRS} (A_{i-1})\cup \{i\}\), and \(\texttt {SPRS} \) does not create gaps larger than \(\delta _{i-1}\), the list \(A_i\) is i-dense.

Let \(a_j\) be the largest element in \(A_i\), such that \(a_j\le b\). Then, either \(a_j\le b<a_{j+1}\) or \(b=a_{|A_i|}\) and \(a_j=a_{|A_i|}\). In the first case, due to monotonicity, we have \(s[a_{j+1}, \ell -1]\ge s[b,\ell -1]\), which gives \(s[b-1, \ell -1]-s[a_j-1, \ell -1]\le \delta _{i-1}\). The second case is trivial.

Due to monotonicity, \(d^*([a_j,i])\ge d^*([b,i])\). This concludes the proof. \(\square \)

We can now complete the proof of Proposition 1.

Proof of Proposition 1

We will prove the result with induction over \(\ell \). The claim holds for \(\ell =1\) and any i as we initialize s[i, 1] by optimal values (on line 1 of Algorithm 1). We assume that the approximation guarantee holds for \(\ell - 1\), that is,

$$\begin{aligned} s[i,\ell -1](1+\frac{\epsilon }{k}(\ell -1))\ge o[i,\ell -1] \end{aligned}$$

and we prove the result for \(\ell \).

Let \(\alpha =(1+\frac{\epsilon }{k}(\ell - 1))\). Let b be the starting point of the last interval of optimal solution \(o[i,\ell ]\), and let \(a_j\) be as given by Lemma 1. We upper bound

$$\begin{aligned} \delta _{i-1} = s[i-1, \ell -1]\frac{\epsilon }{k+\epsilon \ell }\le s[i, \ell ]\frac{\epsilon }{k+\epsilon \ell } \le s[i, \ell ]\frac{\epsilon }{\alpha k}. \end{aligned}$$
(2)

Then,

$$\begin{aligned} \alpha s[i,\ell ]&\ge \alpha (s[a_j-1,\ell -1] + d^*([a_j,i]))&(a_j \in A_i) \\&\ge \alpha (s[b-1,\ell -1] + d^*([b,i]) - \delta _{i-1})&(\text {Lemma}~1) \\&= \alpha s[b-1,\ell -1] + \alpha d^*([b,i]) - \alpha \delta _{i-1} \\&\ge o[b-1,\ell -1] + \alpha d^*([b,i]) - \alpha \delta _{i-1}&(\text {induction})\\&\ge o[b-1,\ell -1] + d^*([b,i]) - \alpha \delta _{i-1}&(\alpha \ge 1)\\&\ge o[b-1,\ell -1] + d^*([b,i]) - s[i, \ell ]\frac{\epsilon }{k}&(Eq.~2)\\&= o[i,\ell ] - s[i, \ell ]\frac{\epsilon }{k}. \\ \end{aligned}$$

As a result, \(s[i,\ell ](1+\frac{\epsilon }{k}\ell )\ge o[i,\ell ])\). \(\square \)

Let us now address the running time of the approximate dynamic programming.

Proposition 2

The running time of ApproxDP is \(\mathcal {O}(\frac{k^2}{\epsilon }r)\).

Proof

Let us fix i and \(\ell \), and count the number of candidates in \(A_i\). Note that \(|A_i| = |\texttt {SPRS} (A_{i-1})|+1\). The list of candidates \(\texttt {SPRS} (A_{i-1})\) corresponds to a monotonically increasing sequence of \(s[a, \ell ]\), with consecutive elements being at least \(\delta _{i - 1}\) apart. Thus, \(|\texttt {SPRS} (A_{i-1})|\le \frac{s[i-1,\ell ]}{\delta _{i - 1}} = \frac{k+\ell \epsilon }{\epsilon }\le \frac{k(1+\epsilon )}{\epsilon }\) and the number of operations in one call of the inner loop (lines 4–8) of Algorithm 1 is \(\mathcal {O}(k/\epsilon )\). Since this loop is called kr times, the result follows. \(\square \)

Since computing \(d^*\) requires time \(\mathcal {O}(nm \log n)\), the total running time is \(\mathcal {O}(r \frac{k^2}{\epsilon }nm\log {n})\), where \(r=|\mathcal {T}|\). We further speed up our algorithm by approximating the value \(d^*\) by means of one of the approaches developed by [19]. In particular, we employ the algorithm that maintains a \(2(1 + \epsilon )\)-approximate solution for the incremental densest subgraph problem (i.e., edge insertions only), while having poly-logarithmic amortized cost. We shall refer to such an algorithm as ApprDens.

ApprDens allows us to efficiently maintain the approximate density of the densest subgraph \(d^*([a,i])\) for each a in \(A_i\) in ApproxDP, as larger values of i are processed and edges are added. Whenever we remove an item a from \(A_i\) in SPRS, we also drop the corresponding instance of ApprDens.

From the fact that an approximate densest subgraph can be maintained with poly-logarithmic amortized cost, it follows that our algorithm has quasi-linear running time.

Proposition 3

ApproxDP combined with ApprDens runs in \(\mathcal {O}(\frac{k^2}{\epsilon _1 \epsilon _2^2} |\mathcal {T}|m_t \log ^2 n )\) time, where \(\epsilon _1\) and \(\epsilon _2\) are the respective approximation parameters for ApproxDP and ApprDens and \(m_t\) is the maximum number of edges per time stamp.

For real-world highly dynamic temporal networks, we can safely assume that \(m_t\) is a small constant.

Proof

To fill in cell s[il], we need to update \(|A_i|=\mathcal {O}(k/\epsilon )\) graphs by adding at most (some edges can be already in the graphs) \(m_i\) edges—the number of edges with \(t_i\) time stamp. Let \(m_t\) be the maximum number of edges per time stamp. Theorem 4 in [19] states that maintaining the graph with \(m_i\) edges requires \(\mathcal {O}(m_i \epsilon _{2}^{-2} \log ^2 n)\) time. We still need to fill \(k|\mathcal {T}|\) cells in the DP matrix. Combining these two results proves the proposition. \(\square \)

When combining ApproxDP with ApprDens, we wish to maintain the same approximation guarantee of ApprDens. Recall that ApproxDP leverages the fact that the profit function is monotone and non-increasing. Unfortunately, ApprDens does not necessarily yield a monotone score function, as the density of the computed subgraph might decrease when a new edge is inserted. This can be easily circumvented by keeping track of the best solution, i.e., the subgraph with highest density. The following proposition holds.

Proposition 4

ApproxDP combined with ApprDens yields a \(2(1+\epsilon _1)(1 + \epsilon _2)\)-approximation guarantee.

Proof

Let \(d_a^*(T)\) be the density of the graph returned by ApprDens for a time interval T. Let O be the optimal k-segmentation, let \(\mathcal {I}(O)\) be the intervals of this solution, and let \(q_1 = \sum _{I \in \mathcal {I}(O)} d^*(I)\) be its score. Let also \(q_2 = \sum _{I \in \mathcal {I}(O)} d_a^*(I)\). Let \(q_3\) be the score of the optimal k-segmentation \(O_a\) using \(d_a^*\). Note that the intervals constituting the solution \(O_a\) may not be the same as in O, as they are optimal solutions for different interval scoring functions \(d^*\) and \(d_a^*\). Thus, \(q_3\) may not be equal to \(q_2\). Let \(q_4\) be the score of the segmentation produced by ApproxDP. Then,

$$\begin{aligned} q_1 \le 2(1 + \epsilon _2)q_2 \le 2(1 + \epsilon _2)q_3 \le 2(1 + \epsilon _2)(1 + \epsilon _1)q_4, \end{aligned}$$

completing the proof. \(\square \)

We will refer to this combination of ApproxDP with ApprDens as Algorithm kGapprox.

4 Encouraging larger and more diverse subgraphs

Problem 1 is focused on total density maximization; thus, its solution can contain graphs which are dense, but union of their node sets can cover only a small part of the network. Such segmentation is useful when we are interested in the densest temporally coherent subgraphs, which can be understood as tight cores of temporal clusters. However, segmentations with larger but less dense subgraphs, covering a larger fraction of nodes, can be useful to get a high-level explanation of the whole temporal network. To allow for such segmentations, we extend Problem 1 to take into account node coverage.

Denote the set of subgraphs \(G_i\), which are included in solution episodes \(S=\left\{ (I_i, G_i)\right\} \) as \(\mathcal {G}=\{G_i\}\) for \(i = 1, \ldots , k\) . Given a collection of subgraphs \(\mathcal {G}\), let \(x_v(\mathcal {G})=|\{G_i\in \mathcal {G}: v\in V(G_i), G_i\in \mathcal {G}\}|\) be the number of subgraphs in \(\mathcal {G}\), which include node v. We consider generalized cover functions of the type

$$\begin{aligned} \mathrm {cover} (\mathcal {G}\mid w) = \sum _{v \in V}w(x_v(\mathcal {G})), \end{aligned}$$

where w is a nonnegative non-decreasing concave function of \(x_v(\mathcal {G})\). If \(w(x_v(\mathcal {G}))\) is a 0–1 indicator function, then the function \(\mathrm {cover} (\mathcal {G}\mid w)\) is a standard cover, which is intuitive and easy to optimize by a greedy algorithm. Another instance of the generalized cover function, inspired by text summarization research [35], is \(w(x_v(\mathcal {G}))=\sqrt{x_v(\mathcal {G})}\). It ensures that the marginal gain of a node decreases proportionally to the number of times the node is covered. We add the cover term to the cost function of Problem 1, and we obtain the resulting problem formulation.

Problem 2

(k-Densest-Episodes-EC) Given a temporal graph \(G = (V,\mathcal {T}, E)\), integer k, parameter \(\lambda \ge 0\), find a k-segmentation \(S=\{(I_i, G_i)\}\) of G, such that \(\mathrm {profit} (S) = \sum _{G_i\in \mathcal {G}}d(G_i) + \lambda \,\mathrm {cover} (\mathcal {G}\mid w)\) is maximized.

Unlike Problem 1, this problem cannot be solved in polynomial time.

Proposition 5

Problem 2 is NP-hard.

Proof

We will prove the hardness by reducing the set packing problem to \(k\)-Densest-Episodes-EC. In the set packing problem, we are given a collection \(\mathcal {C} = \left\{ C_1 ,\ldots , C_\ell \right\} \) of sets and are asked whether there are p disjoint sets. We can safely assume that \({\left| C_i\right| } = 3\).

Assume that we are given such a collection, and let us construct the temporal graph. The nodes V consist of two sets \(V_1\) and \(V_2\). The first set \(V_1\) corresponds to the elements in \(\bigcup _i C_i\). The second set \(V_2\) consists of \(q = 6\ell + 3\) nodes. There are \(2\ell \) time stamps. At the 2ith time stamp, we connect the nodes corresponding to \(C_i\), while at odd time stamps, we full-connect \(V_2\). Finally, we set \(k = \ell + p\) and \(\lambda = 1/({\left| V\right| } + 1)\). We use 0–1 indicator function for w.

We claim that there is a solution to the set packing problem if and only if there is a solution to \(k\)-Densest-Episodes-EC with the profit of at least \(\ell (q - 1) / 2 + p + \lambda (3p + q)\).

To prove the only if direction, assume there is a collection \(\mathcal {C}'\) of p disjoint sets. Build a k-segmentation by selecting each clique spanning \(V_2\) to be in its own segment, as well as the three cliques corresponding to the sets in \(\mathcal {C}'\). This solution will have the necessary profit.

Let us now prove the if direction. Assume an optimal k-segmentation S. It is easy to see that if the ith segment contains an odd time stamp, then \(G_i\) must be the clique spanning \(V_2\). On the other hand, if the ith segment is equal to [2j, 2j], then \(G_j\) is a clique connecting \(C_j\).

Let a be the number of segments containing odd time stamps, we can safely assume that \(a > 0\). Let b be the number of segments containing only even time stamps. Let c be the total number of nodes in \(V_1\) covered by at least one segment. Then,

$$\begin{aligned} \mathrm {profit} (S) = a(q - 1) / 2 + b + \lambda (c + q). \end{aligned}$$

We assume that \(\mathrm {profit} (S) \ge \ell (q - 1) / 2 + p + \lambda (3p + q)\). Since \(b + \lambda (c + q) \le \ell + 1 < (q - 1) / 2\) and \(\lambda (c + q) < 1\), this is only possible if \(a = \ell \), \(b = p\), and \(c = 3p\). This completes the proof. \(\square \)

4.1 k static overlapping densest subgraphs

Given the complexity of Problem 2, we start with analysis of a static graph case. We formulate the k-overlapping-densest-subgraphs problem and design a linear algorithm with an approximation guarantee. We will later apply the developed approach to temporal graphs; however, the algorithm can be used as an efficient stand-alone method for finding overlapping dense subgraphs.

Problem 3

(kstatic overlapping densest subgraphs) Given a static graph \(H=(V,E')\), integer k, and real \(\lambda \ge 0\), find a set of k subgraphs \(\mathcal {H}= \{H_i \subseteq H\}\), such that \(\mathrm {profit} _{ST}(\mathcal {H}) = \sum _{H_i\subseteq \mathcal {H}}d(H_i) + \lambda \cdot \mathrm {cover} (\mathcal {H}\mid w)\) is maximized.

Next, we show below how to obtain a constant-factor approximate solution. We start with showing that the generalized cover function has beneficial combinatorial properties: It is submodular, nonnegative, and non-decreasing with respect to the set of subgraphs. The density term of the cost function of Problem 2 (and Problem 3) is a linear function of subgraphs, and thus the whole cost function is nonnegative, non-decreasing, and submodular.

Proposition 6

Function \(\mathrm {cover} (\mathcal {G}\mid w)\) is a nonnegative, non-decreasing, and submodular function of subgraphs.

Proof

For a fixed \(v\in V\) function \(x_v(\mathcal {G})\) is non-decreasing modular (and submodular): for any set of subgraphs X and a new subgraph x holds that \(x_v(X\cup \{x\})-x_v(X)=1\) if v belongs to x and does not belong to any subgraph in X, otherwise 0. By the property of submodular functions, composition of concave non-decreasing and submodular non-decreasing is non-decreasing submodular. Function \(\mathrm {cover} (\mathcal {G}\mid w)\) is submodular non-decreasing as a nonnegative linear combination. Nonnegativity follows from nonnegativity of w. \(\square \)

To solve Problem 3, we can search greedily over subgraphs. Let \(\mathcal {H}_{i-1} = \{H_1,\dots , H_{i-1}\}\), and define marginal node gain, given weight function w, as

$$\begin{aligned} \delta (v \mid \mathcal {H}_{i-1}, w) = w(x_v(\mathcal {H}_{i-1}\cup \{v\}))-w(x_v(\mathcal {H}_{i-1})). \end{aligned}$$
(3)

Here, \(\left\{ v\right\} \) refers to a graph containing only v. Then, denote the marginal gain of subgraph \(H_i\) given already selected graphs \(\mathcal {H}_{i-1}\) as

$$\begin{aligned} \chi (H_i \mid \mathcal {H}_{i-1}, w)=d(H_i)+\lambda \sum _{v\in H_i}\delta (v \mid \mathcal {H}_{i-1}, w). \end{aligned}$$
(4)

Greedy algorithm for Problem 3 consequently builds the set \(\mathcal {H}\) by adding \(H_i\), which maximizes gain \(\chi (H_i \mid \mathcal {H}_{i-1})\). If we can find \(H_i\) optimally, such algorithm yields \(1-1/e\) approximation due to submodular maximization with cardinality constrains (see [42] for this classic result).

To find the optimal \(H_i\), we need to solve the following problem.

Problem 4

Given a static graph \(H=(V,E')\), a set of subgraphs \(\mathcal {H}_{i-1} = \{H_1,\dots , H_{i-1}\}\), find a graph \(F \subseteq H\), such that \(\chi (F \mid \mathcal {H}_{i-1})\) is maximized.

Luckily, Problem 4 can be transformed into a (weighted) densest subgraph problem. In order to do so, we will define a weighted fully connected graph \(R = (V, V\times V, a)\) having the same nodes V as H with the weights a(uv) defined as

$$\begin{aligned} a(u, v) = I[(u, v) \in E'] + \frac{\lambda }{1 + I[u = v]}(\delta (u \mid \mathcal {H}_{i-1}, w) + \delta (v \mid \mathcal {H}_{i-1}, w)). \end{aligned}$$

Here, \(I[\cdot ]\) is an indicator function, returning 1 if the condition is true, and 0 otherwise. Note that we allow self-loop edges. Let \(R'\) be a subgraph in R and let F be the induced subgraph in H having the same nodes as \(R'\). Then, it is now straightforward to see that

$$\begin{aligned} \chi (F \mid \mathcal {H}_{i-1}, w) = d(R'). \end{aligned}$$

In other words, solving Problem 4 is equivalent to solving densest subgraph problem in R. Consequently, we can solve Problem 4 exactly in \(\mathcal {O}(|V|^3)\) time [25]. Alternatively, we can estimate it efficiently with 1 / 2-approximation in \(O(|V|^2)\) time by Charikar et al. [15]. We will use the latter algorithm and refer to it as StaticGreedy.

Now we have everything to design and analyze an approximation algorithm for Problem 3. Algorithm 3 greedily finds k subgraphs to solve Problem 3.

Each subgraph is sought with 1 / 2-approximation guarantee, and due to submodularity, greedy optimal subgraph search would be a \((1-1/e)\)-approximation. Combining these results leads to the following statement.

Proposition 7

Algorithm 3 is a \(1/2(1-1/e)\approx 0.31606\) approximation for Problem 3.

Proof

Let y be the value of \(profit_{ST}\) score of k greedily sought subgraph, assuming that each subgraph was sought optimally. The ith subgraph has a marginal gain \(y_i\), thus \(y = \sum _i^k y_i\). Let optimal solution of Problem 3 be \(y^*\). Due to greedy submodular optimization \(y\ge (1-1/e)y^*\), Algorithm 3 uses 1 / 2-approximation algorithm StaticGreedy for subgraph search, thus \(\bar{y_i}\ge y_i/2\), where \(\bar{y_i}\) is the marginal gain of the i-th subgraph included into the solution. Let \(\bar{y}\) be the value of final solution output by Algorithm 3. Putting everything together, we have \(\bar{y}=\sum _i^k\bar{y_i}\ge \sum _i^ky_i/2 = y/2 \ge 1/2(1-1/e)y^*\). This concludes the proof. \(\square \)

The running time of Algorithm 3 is defined by the running time of the greedy subroutine and is \(\Theta (k|V|^2)\).

figure h

4.2 Greedy dynamic programming

Similarly to Problem 1, we will use dynamic programming for Problem 2. However, as the problem is hard, we have to rely on greedy choices of the subgraphs. Thus, the obtained solution does not have any quality guarantee.

Let \(M[\ell ,i]\) be the profit of i first points into \(\ell \) intervals, let \(C[\ell ,i]\) be the set of subgraphs \(\mathcal {G}_{\ell }=\{G_1, \dots , G_{\ell }\}\) selected on these \(\ell \) intervals, \(1\le \ell \le k\) and \(0\le i\le m\).

Define marginal gain interval [ji], given that \(j-1\) are already segmented into \(\ell - 1\) interval, (here \(\chi \) is defined in Eq. 4):

$$\begin{aligned} \mathrm {gain} ([j,i], C[\ell -1, j-1]) = \max _{G'\subseteq G([j,i])}\chi (G' \mid C[\ell -1, j-1]). \end{aligned}$$
(5)

This leads to a dynamic program

$$\begin{aligned} \begin{aligned} M[\ell ,i] =&\max _{1\le j\le i+1} M[\ell -1,j-1] + \mathrm {gain} ([j,i], C[\ell -1, j-1]) \text { for } 1 < \ell \le k, \\ M[1,i] =\,&d^*([0, i]) \text { for } 0\le i\le m, \\ M[k',0] =\,&0 \text { for } 1\le k'\le k. \end{aligned} \end{aligned}$$

After filling this table, M[km] contains the profit of k-segmentation with subgraph overlaps. C[km] will contain selected subgraphs, and the intervals and subgraphs can be reconstructed, if we keep track of the starting points of selected last intervals. Note that profit M[km] is not optimal, because the choice of subgraph \(G_i\) depends on the interval and the previous choices.

We perform dynamic programming by approximation algorithm ApproxDP, and the densest subgraph for each candidate interval is retrieved by Epasto et al. [19]. We refer to the resulting algorithm as kGCvr.

To keep track on number of \(x_v\) when we construct \(\mathcal {G}\), we need to keep frequencies of each node. To avoid extensive memory costs, in the experiments we use Min-Count sketches.

5 Experimental evaluation

We evaluate the performance of the proposed algorithms on synthetic graphs and real-world social networks. The datasets are described below. Unless specified, we post-process the output of all algorithms and report the optimal densest subgraphs in the output intervals. Our datasets and implementations are publicly available.Footnote 1

5.1 Synthetic data

We generate a temporal network with k planted communities and a background network. All graphs are Erdős-Rényi. The communities \(G'\) have the same density, disjoint set of nodes, and are planted in consecutive non-overlapping intervals. The background network G includes nodes from all planted communities \(G'\). The edges of G are distributed uniformly on the timeline. In a typical setup, the length of the whole time interval T is \({\left| T\right| }=1000\) time units, while the edges of each \(G'\) are generated in intervals of length \({\left| T'\right| }=100\) time units. The densities of the communities and the background network vary. The number of nodes in G is set to 100.

We produced two families of synthetic temporal networks: Synthetic1 and Synthetic2. In the first setting (dataset family Synthetic1), we vary the average degree of the background network from 0.5 to 4 and fix the density of the planted 5-cliques to 4. Synthetic1 allows to test the robustness of our algorithms against background noise. In the second setting (dataset family Synthetic2), we vary the density of planted eight-node graphs from 2 to 7, while the average degree of the background network is fixed to 2. A separate synthetic dataset Synthetic3 is designed to test the effect of setting different parameters k in the algorithms. The dataset contains \(k=10\) intervals with the activity of eight-node subgraphs with average degree 5, and the background noise has average degree 2.

5.2 Real-world data

We use the following real-world datasets: Facebook [51] is a subset of Facebook activity in the New Orleans regional community. Interactions are posts of users on each other walls. The data cover the time period from 9.05.06 to 20.08.06. The Twitter dataset tracks activity of Twitter users in Helsinki in year 2013. As interactions, we consider tweets that contain mentions of other users. The StudentsFootnote 2 dataset logs activity in a student online network at the University of California, Irvine. Nodes represent students, and edges represent messages with ignored directions. Enron:Footnote 3 is a popular dataset that contains e-mail communication of senior management in a large company and spans several years.

For a case study, we create a hashtag network from Twitter dataset (the same tweets from users in Helsinki in year 2013): Nodes represent hashtags—there is an interaction, if two hashtags occur in the same tweet. The time stamp of the interaction corresponds to the time stamp of the tweet. We denote this dataset as Twitter#.

5.3 Optimal baseline

A natural baseline for kGapprox is Optimal, which combines exact dynamic programming with finding the optimal densest subgraph for each candidate interval. Due to the high running time of Optimal, we generate a very small dataset with 60 time stamps, where each time stamp contains a random graph with 3–6 nodes and random density. We vary the number of intervals k and report the value of the solution (without any post-processing) and the running time in Fig. 2. On this toy dataset, kGapprox is able to find near-optimal solution, while being significantly faster than Optimal.

Fig. 2
figure 2

Comparison between optimum and approximate solutions (Optimal and kGapprox). Approximate algorithm was run with \(\epsilon _1=\epsilon _2=0.1\). Running time is in seconds

5.4 Results on synthetic datasets

Next, we evaluate the performance of kGapprox on the synthetic datasets Synthetic1 and Synthetic2 by assessing how well the algorithm finds the planted subgraphs. We report mean precision, recall, and F-measure, calculated with respect to the ground-truth subgraphs. All results are averaged over 100 independent runs.

First, Fig. 3a depicts the quality of the solution as a function of background noise. Recall that the Synthetic1 dataset contains planted eight-node subgraphs with average degree 5. Precision and recall are generally high for all values of average degree in the background network. However, precision degrades as the density of the background network increases, as then it becomes cost-beneficial to add more nodes in the discovered densest subgraphs.

Second, Fig. 3b shows the quality of the solution of kGapprox as a function of the density in the planted subgraphs. Note that, in Synthetic2 the density of the background network is 2. Similarly to the previous results, the quality of the solution, especially recall, degrades much only when the density of the planted and the background network becomes similar.

Fig. 3
figure 3

Precision, recall, and F-measure on synthetic datasets. For plot a, the community average degree is fixed to 5 (Synthetic1 dataset), and for plot, b the background network degree is fixed to 2 (Synthetic2 dataset). Plot a the mean standard deviation for precision is 0.193, for recall is 0.183, and for F-measure is 0.180. Plot b the mean standard deviation for precision is 0.188, for recall is 0.178, and for F-measure is 0.173

Figure 4 demonstrates how well the true event intervals are recovered in the case of a synthetic Synthetic3 dataset with \(k=10\) planned events intervals. The true value of k was treated as unknown, and kGapprox was run with all possible integer values of k in [2, 20].

Figure 4a shows the quality of the intervals, precision, and recall are calculated with respect to the length of the overlap between the true interval and the output one.

Since the number of intervals in the segmentation and the ground truth is different, we compare each output interval to its best match in terms of F-measure. That is, let \((I_1, \dots , I_k)\) and \((I_1', \dots , I_{k'}')\) be the set of ground-truth intervals and solution intervals with k not necessarily be equal to \(k'\). For each \(I_i'\) in the solution, we find the best matching interval in the ground truth \(I^*_i=\max _{I_i\in (I_1, \dots , I_k)} F(I_i',I_i)\). Here, F is F-measure, with precision and recall being calculated with respect to time stamps: the number of time stamps from the ground-truth interval \(I_i\), which also belong to the interval \(I_i'\), divided by the number of time stamps in \(I_i'\) (precision) or divided by the number of time stamps in \({I}_i\) (recall). Once such matching interval \(I^*_i\) is found for each \(I_i'\), we calculate and report precision \(P(I^*_i,I_i)\), recall \(R(I^*_i,I_i)\), and F-measure \(F(I^*_i,I_i)\), defined with respect to the time stamps as described above.

All the reported measures are averaged over the output intervals (and over 100 runs). After matching the intervals, we also evaluate the quality of the densest subgraphs and compare their node sets to the ground-truth events in the corresponding intervals (Fig. 4b). As we can see, the intervals are in general recovered quite well, even though the algorithm is given an incorrect value of k. The quality of the subgraph recovery is generally lower, which is the results of shifted borders of the intervals.

Fig. 4
figure 4

Quality of the solutions in the case of unknown k. Planted \(k=10\) intervals with eight-node subgraphs each (Synthetic3 dataset). Plot a shows the quality of the solution segmentation, and plot b shows the quality of the subgraphs sought in the intervals

5.5 Results on real-world datasets

As the optimal partition algorithm Optimal is not scalable for real datasets, we present comparative results of kGapprox with baselines kGoptDP and kGoptDS. The kGoptDP algorithm performs exact dynamic programming, but uses an approximate incremental algorithm for the densest subgraph search (the incremental framework by Epasto et al. [19]). Vice versa, kGoptDS performs approximate dynamic programming while calculating the densest subgraph optimally for each candidate interval (by Goldberg’s algorithm [27]). Note that kGoptDP has \(2(1+\epsilon _{\text {ds}})^2\) approximation guarantee and kGoptDS has \((1+\epsilon _{\text {dp}})\) approximation guarantee. However, even these non-optimal baselines are quite slow in practice and we use a subset of \(1\,000\) interactions of Students and Enron datasets for comparative reporting.

To ensure fairness, we report the total density of the optimal densest subgraphs in the intervals returned by the algorithms.

In Table 1, we report the density of the solutions reported by kGapprox, kGoptDP, and kGoptDS, and Table 2 shows their running time. We experiment with different parameters for the approximate densest subgraph search (\(\epsilon _{\text {ds}} \)) and for approximate dynamic programming (\(\epsilon _{\text {dp}} \)).

For both datasets, the best solution (i.e., the solution with the highest value of the profit function of Problem 1) was found by kGoptDS. This is expected as this algorithm has the best approximation factor. The solution cost decreases as \(\epsilon _{\text {dp}} \) increases. On the other hand, kGoptDS has the largest running time, which decreases with increasing \(\epsilon _{\text {dp}} \), but even with the largest parameter value (\(\epsilon _{\text {dp}} =2\)) kGoptDS takes about an hour.

The kGoptDP algorithm typically finds the second-best solution; however it only marginally outperforms kGapprox (e.g., \(\epsilon _{\text {ds}} =0.1\)), while requiring up to several orders of magnitude of higher computational time. Naturally, the quality of the solution degrades with increasing \(\epsilon _{\text {ds}} \).

The solution quality degrades with increasing the approximation parameters for all algorithms. However, the degradation is not as dramatic as the worst-case bound suggests, while using such an approximation parameter offers significant speedup. kGapprox provides the fastest estimates of a good quality for a wide range of approximation parameters. Note that kGapprox is more sensitive to the changes in the quality of the densest subgraph search regulated by \(\epsilon _{\text {ds}} \).

Table 1 Comparison of kGapprox with kGoptDP and kGoptDS baselines: total community density
Table 2 Comparison of kGapprox with kGoptDP and kGoptDS baselines: total community density: running time

5.6 Running time and scalability

Figure 5 shows running time of kGapprox as a function of the approximation parameters \(\epsilon _{\text {ds}} \) and \(\epsilon _{\text {dp}} \). The figure confirms the theory, that is, \(\epsilon _{\text {ds}} \) has significant impact on the running time, while the algorithm scales very well with \(\epsilon _{\text {dp}}\).

We demonstrate scalability in Fig. 6, plotting the running time for increasing number of interactions, for Facebook and Twitter datasets. Recall that the theoretical running time is \(\mathcal {O}(k^2 m\log n)\), where n is the number of nodes and m the number of interactions. In practice, the running time grows fast for the first thousand interactions and then saturates to linear dependence. This happens because in the beginning of the network history the number of nodes grows fast. In addition, new, denser than previously seen, subgraphs are more likely to occur. Thus, the approximate densest subgraph subroutine has to be computed more often. Furthermore, the number of intervals k contributes to running time as expected.

Fig. 5
figure 5

Effect of different approximation parameters in kGapprox. \(k=20\)

Fig. 6
figure 6

Scalability testing with \(\epsilon _{\text {ds}} =\epsilon _{\text {dp}} =0.1\)

Figure 7 shows how the cost of the solution changes as the network evolves. Setting larger k results in larger total density. However, the relative change of the solution values is approximately the same for all k: As the number of time stamps goes from 100 to 100000, the total density increases about 2.5 times for Facebook dataset and 3.5 times for Twitter dataset. This means that while different k lead to technically different segmentations, they capture the rate of network evolution.

Naturally, setting larger k results in discovering subgraphs of smaller individual density, as it follows from Fig. 8. However, the relative difference between the mean density for different k is typically less than the relative difference between the values of k itself. This means that the algorithm tends not to split intervals of dense subgraphs to achieve a better total density, but rather discovers new dense subgraph intervals as k increases.

Fig. 7
figure 7

Total density of the solution subgraphs for different values of k and different lengths of the time series (\(\epsilon _{\text {ds}} =\epsilon _{\text {dp}} =0.1\))

Fig. 8
figure 8

Mean density of the solution subgraphs for different values of k and different lengths of the time series (\(\epsilon _{\text {ds}} =\epsilon _{\text {dp}} =0.1\))

5.7 Subgraphs with larger node coverage—static graphs

Next, we evaluate StaticGreedy. To measure coverage, we simply count the number of distinct nodes in the output subgraphs. We use the 10K first interactions of Students dataset, set \(k=20\), and test different values of \(\lambda \). Figure 9 shows the density and the pairwise Jaccard similarity of the node sets of the retrieved subgraphs. The subgraphs are shown in the order they are discovered. Smaller values of \(\lambda \) give larger density, and larger values of \(\lambda \) give more cover. We observe that, for all values of \(\lambda \), in the beginning StaticGreedy returns diverse and dense subgraphs, but soon after it starts outputting graphs, which have been already selected to the solution on the previous iterations. We speculate that the algorithm finds all dense subgraphs that exist in the dataset. Regarding setting \(\lambda \), we observe that \(\lambda =0.002\) offers a good trade-off in finding subgraphs of high density and moderate overlap.

Fig. 9
figure 9

Pairwise similarities (three heatmap plots on the left) and densities (right plot) of subgraphs returned by StaticGreedy

5.8 Subgraphs with larger node coverage—dynamic graphs

Finally, we evaluate the performance of kGCvr algorithm. We vary the parameter \(\lambda \) and compare different characteristics of the solution, with the solution returned by kGapprox. For different values of \(\lambda \), Table 3 shows average density and total number of covered nodes, and Table 4 shows average size of the subgraphs and average pairwise Jaccard similarity. Although kGCvr does not have an approximation guarantee, for small values of \(\lambda \) it finds subgraphs of the density close to kGapprox. Similarly to the static case, \(\lambda \) provides an efficient trade-off between density and coverage.

Table 3 Total density and total cover size of kGCvr ’s outputs with \(k=5\) and \(\epsilon _{\text {ds}} = \epsilon _{\text {dp}} = 0.1\)
Table 4 Average subgraph size and average Jaccard similarity between the subgraphs in the output of kGCvr with \(k=5\) and \(\epsilon _{\text {ds}} = \epsilon _{\text {dp}} = 0.1\)

5.9 Parameter selection

Both problem formulations, \(k\)-Densest-Episodes and \(k\)-Densest-Episodes-EC, follow the classic sequence segmentation problem setting [10] and take as input the number of segments (k) in the timeline partition. It is primarily assumed that the value of k can be specified by prior knowledge and user expectation. In the case of problem formulations \(k\)-Densest-Episodes and \(k\)-Densest-Episodes-EC, we can show (“Appendix A”) that the total profit is a strictly increasing function of the number of segments and reaches its maximum when k is equal to the number of intervals. Thus, the value of k cannot be guided by the optimal value. Furthermore, it is hard to assess the quality of the subgraphs in the segmentation: Larger intervals with denser subgraphs correspond to larger events, while splitting an interval in favor of less dense subgraphs corresponds to sub-events. Duplicating events in the neighboring intervals can also lead to different sub-segmentation, when k increases; thus, we cannot recommend to decrease k if duplicates occur. However, we do not view that uncertainty with respect to the choice of k as a weakness of the approach: It allows the user to explore the data at different granularity levels and possibly observe a hierarchy of events.

The problem formulations kGapprox and kGCvr require the approximation parameters \(\epsilon _{\text {dp}}\) and \(\epsilon _{\text {ds}}\). As we discussed in the section about the performance of kGapprox, kGoptDP, and kGoptDS (Table 1), by design our approximation algorithms are more sensitive to the changes in the quality of the densest subgraph search. The parameter \(\epsilon _{\text {ds}}\) affects the calculation of profits of the intervals, and these values are used to guide the dynamic programming algorithm, while loose values of these approximation parameters are likely to misguide it. As it follows from the scalability results (Fig. 5), the algorithms scale better with the change of \(\epsilon _{\text {ds}}\) rather than \(\epsilon _{\text {ds}}\). However, both parameters contribute equally to the solution quality guarantee \(2(1+\epsilon _{\text {ds}})(1+\epsilon _{\text {dp}})\) of Problems \(k\)-Densest-Episodes and the order of the approximation factor depended on the largest of \(\epsilon _{\text {ds}} \) and \(\epsilon _{\text {dp}} \). Thus, it is not guaranteed (and not fully supported by empirical results) that reducing only \(\epsilon _{\text {ds}} \) will lead to better results faster. As a rule of thumb in most of our experiments, we use \(\epsilon _{\text {ds}} =\epsilon _{\text {dp}} =0.1\), which gives a satisfactory guarantee of 2.42 and is sufficiently fast. We use the same parameters for the experiments with kGCvr.

The last parameter to discuss is the parameter \(\lambda \) in problem \(k\)-Densest-Episodes-EC, which controls the node coverage in the solution. The sensitivity and the range of meaningful values of this parameter depend non-trivially on the topological and temporal properties of the network. To select a good value for \(\lambda \), one could try sampling different values and plot the density of the resulting subgraphs, similarly to Fig. 9. Then, one can choose a value for \(\lambda \), which provides a good trade-off between diversity and density: Too small value of \(\lambda \) may lead to dense but repeating structures, and too large value may yield too large and not dense-enough subgraphs.

6 Case study

We present a case study using graphs of co-occurring hashtags from Twitter messages in the Helsinki region. We create two subsets of Twitter# dataset: one covering all tweets in November 2013 and another in December 2013. November dataset consists of 4758 interactions, 917 nodes, and the corresponding static graph has average degree density 3.546. December dataset has 5559 interactions, 1039 nodes, and the density is 3.290.

Figures 10 and 11 show the dense subgraphs discovered by the kGapprox algorithm on these datasets, with \(k=4\) and \(\epsilon _{\text {ds}} = \epsilon _{\text {dp}} = 0.1\).

For the November dataset, kGapprox creates a small 1-day interval in the beginning and then splits the rest time almost evenly. This first interval includes the nodes movember, liiga, halloween, and digiexpo, which cover a broad range of global (e.g., movember and Halloween) and local events (e.g., game industry event DigiExpo and Finnish ice hockey league). The next interval is represented by a large variety of well-connected tags related to mtv and media, corresponding to the MTV Europe Music Awards 13 on November 10. There are also other ice hockey-related tags, e.g., leijonat and Father’s Day tags, e.g., isänpäivä, which was on November 13. The third interval is mostly represented by Slush-related tags; Slush is the annual large startup and tech event in Helsinki. The last interval is completely dedicated to ice hockey with many team names.

There are three major public holidays in December: Finland’s Independence Day on December 6, Christmas on December 25, and New Year’s Eve on December 31. kGapprox allocates one interval for Christmas and New Year from December 21 to 31. Ice hockey is also represented in this interval, as well as in the third interval. Remarkably, the Independence Day holiday is split into two intervals. The first one is from December 1 to December 6, 3:30pm, and the corresponding graph has two clusters: the first one contains general holidays-related tags and the second one is focused on Independence Day President’s reception. (Itsenäisyyspäivän vastaanotto or colloquially Linnan juhlat/Slotts balen). This is a large event that starts on December 6, 6pm, is broadcasted live, and is discussed in media for the following days. The second interval for December 6–9 is a truthful representation of this event.

Fig. 10
figure 10

Subgraphs, discovered in the network of Twitter hashtags Twitter# from November 2013 kGapprox algorithm with \(k=4\), \(\epsilon _{\text {ds}} = \epsilon _{\text {dp}} = 0.1\)

Fig. 11
figure 11

Subgraphs, discovered in the network of Twitter hashtags Twitter# from December 2013 by kGapprox algorithm with \(k=4\), \(\epsilon _{\text {ds}} = \epsilon _{\text {dp}} = 0.1\)

To demonstrate the qualitative performance of kGapprox for different parameters, we consider three parameters settings: \( case _1\): \(\epsilon _{\text {ds}} =0.1\), \(\epsilon _{\text {dp}} =0.1\); \( case _2\): \(\epsilon _{\text {ds}} =0.01\), \(\epsilon _{\text {dp}} =0.1\); and \( case _3\): \(\epsilon _{\text {ds}} =0.1\), \(\epsilon _{\text {dp}} =0.01\). Table 5 shows the characteristics of the solution graphs \((H_1, H_2, H_3, H_4)\) discovered in the different settings.

The first two rows show the average degree density and the number of nodes (size) of each graph. Rows 3 and 4 compare an ith graph in one solution (i.e., in one parameters setting) with the ith graph in other solutions (i.e., in other parameters setting). We report the average overlap in nodes and average Jaccard similarity of node sets. Larger overlap and larger Jaccard similarity values provide evidence that the algorithm outputs similar i-th episode graphs for different settings. For the November datasets, the first two episodes are identical for all settings. Episodes 3 and 4 are similar for cases \( case _1\) and \( case _3\), but different for \( case _2\): As it is discussed before, the change in the densest subgraph search contributes to the change in the solution. There is a similar trend for the December dataset, although the similarity values are typically lower.

Rows 5 and 6 present the similarities between the graphs in one solution. We compare an i-th graph in a solution to all other episode graphs in that solution. We report an average overlap in nodes and average Jaccard similarity of node sets. Lower overlap and smaller Jaccard similarity values indicate that the graphs in the solution differ. All similarity values for both datasets are quite low. Although the average overlap in nodes can be as high as 7.333, such an overlap is not prominent when the sizes of the graphs are taken into consideration, as it is shown by the Jaccard similarity metric.

We can conclude that for all parameters, the solution for the case study consists of diverse graphs. However, changing the accuracy of the densest subgraphs search may lead to differences in the output episodes graphs.

Table 5 Characteristics of the episode graphs \(H_i\) discovered for different parameters of \(\epsilon _{\text {ds}} \) and \(\epsilon _{\text {dp}} \) in the case-study dataset

7 Related work

Partitioning a graph in dense subgraphs is a well-established problem. Many of the existing works adopt as density definition the average-degree notion [2, 23, 33, 50]. The densest subgraph, under this definition, can be found in polynomial time [27]. Moreover, there is a 2-approximation greedy algorithm by Charikar [15] and Asahiro et al. [4], which runs in linear time of the graph size. Many recent works develop methods to maintain the average-degree densest subgraph in a streaming scenario [14, 19, 20, 38, 39]. Alternative density definitions, such as variants of quasi-clique, are often hard to approximate or solve by efficient heuristics due to connections to NP-complete Maximum Clique problem [1, 37, 49].

A line of work focuses on dynamic graphs, which model node/edge additions/deletions. Different aspects of network evolution, including evolution of dense groups, were studied in this setting [6, 11, 34, 41]. However, here we use the interaction network model, which is different to dynamic graphs, as it captures the instantaneous interactions between nodes.

Another classic approach to model temporal graphs is to consider graph snapshots, find structures in each snapshot separately (or by incorporating information from previous snapshots), and then summarize historical behavior of the discovered structures [5, 12, 28, 36, 40]. These approaches usually focus on the temporal coherence of the dense structures discovered in the snapshots and assume that the snapshots are given. In this work, we aggregate instantaneous interaction into timeline partitions of arbitrary lengths.

To the best our knowledge, the following works are better aligned with our approach. A work of Rozenshtein et al. [44] considers a problem of finding the densest subgraph in a temporal network. However, first, they do not aim at creating a temporal partitioning. Second, they are interested in finding a single dense subgraph whose edges occur in k short time intervals. On the contrary, in this work we search for an interval partitioning and consider only graphs that are span continuous intervals. Other close works are by Jethava and Beerenwinkel [31] and Semertzidis et al. [46]. However, these works consider a set of snapshots and search for a single heavy subgraph induced by one or several intervals. The work of Semertzidis et al. [46] explores different formulations for the persistent heavy subgraph problem, including maximum average density, while Jethava and Beerenwinkel [31] focus solely on maximum average density.

8 Conclusions

In this work, we consider the problem of finding a sequence of dense subgraphs in a temporal network. We search for a partition of the network timeline into k non-overlapping intervals, such that the intervals span subgraphs with maximum total density. To provide a fast solution for this problem, we adapt recent work on dynamic densest subgraph and approximate dynamic programming. In order to ensure that the episodes we discover consist of a diverse set of nodes, we adjust the problem formulation to encourage coverage of a larger set of nodes. While the modified problem is NP-hard, we provide a greedy heuristic, which performs well on empirical tests.

The problems of temporal event detection and timeline segmentation can be formulated in various ways depending on the type of structures that are considered to be interesting. Here, we propose segmentation with respect to maximizing subgraph density. The intuition is that those dense subgraphs provide a sequence of interesting events that occur in the lifetime of the temporal network. However, other notions of interesting structures, such as frequency of the subgraphs, or statistical non-randomness of the subgraphs, can be considered for future work. In addition, it could be meaningful to allow more than one structure per interval. Another possible extension is to consider overlapping intervals instead of a segmentation.