Online summarization of dynamic graphs using subjective interestingness for sequential data

Many real-world phenomena can be represented as dynamic graphs, i.e., networks that change over time. The problem of dynamic graph summarization, i.e., to succinctly describe the evolution of a dynamic graph, has been widely studied. Existing methods typically use objective measures to find fixed structures such as cliques, stars, and cores. Most of the methods, however, do not consider the problem of online summarization, where the summary is incrementally conveyed to the analyst as the graph evolves, and (thus) do not take into account the knowledge of the analyst at a specific moment in time. We address this gap in the literature through a novel, generic framework for subjective interestingness for sequential data. Specifically, we iteratively identify atomic changes, called ‘actions’, that provide most information relative to the current knowledge of the analyst. For this, we introduce a novel information gain measure, which is motivated by the minimum description length (MDL) principle. With this measure, our approach discovers compact summaries without having to decide on the number of patterns. As such, we are the first to combine approaches for data mining based on subjective interestingness (using the maximum entropy principle) with pattern-based summarization (using the MDL principle). We instantiate this framework for dynamic graphs and dense subgraph patterns, and present DSSG, a heuristic algorithm for the online summarization of dynamic graphs by means of informative actions, each of which represents an interpretable change to the connectivity structure of the graph. The experiments on real-world data demonstrate that our approach effectively discovers informative summaries. We conclude with a case study on data from an airline network to show its potential for real-world applications.


Introduction
Many real-world phenomena, including interactions between people (e.g., social media, e-mail), web browsing, transport and logistics operations, and asset management, can be modelled in terms of the relationships between entities. That is, the corresponding data can be naturally represented as a network or graph, where vertices represent the entities and edges represent their relationships. When these relationships change over time, the graphs are called dynamic graphs.
The problem of static graph summarization has been widely studied, e.g., to efficiently store large volumes of data (Navlakha et al. 2008); improve query efficiency (LeFevre and Terzi 2010); visualize large graphs ); and provide high-level descriptions (Goebl et al. 2016). Some of the popular methods rely on compression , aggregation of vertices/edges (LeFevre and Terzi 2010), or finding meaningful patterns (Goebl et al. 2016).
The need to incorporate the temporal dimension has led to the introduction of the problem of dynamic graph summarization. Lately, this problem has gained much attention. Here, the focus is on finding a minimal set of temporal structures that describe a dynamic network or graph. A typical way to achieve this is by considering a dynamic network as a sequence of static graph states/snapshots (Sun et al. 2007;Shah et al. 2015;Adhikari et al. 2017) and subjecting those to static graph summarization methods. Such sequences of static graphs, constructed by segmenting a dynamic graph into different states, can be referred to as sequential data. For instance, the method proposed by Shah et al. (2015), namely TimeCrunch, extends VoG, a method for static graph summarization by Koutra et al. (2014). It creates a summary by stitching together the graph structures found in different snapshots while minimizing the global description length of the dynamic network. TimeCrunch uses a predefined vocabulary of graph structures, including cliques, stars, cores, and bipartite cores.
Most existing methods, however, do not consider the problem of subjective online summarization, where the summary is iteratively and incrementally conveyed to the analyst as the graph evolves. In that, the analyst is progressively updated on all changes up to the current state of the network, relative to his/her prior knowledge. This problem has two key characteristics that differentiate it from posthoc summarization and therefore require a different approach. First, at any state, it is only possible to use data that has been observed until this very moment; it is impossible to use parts of the dynamic graph that lie in the future. Second, each change that is observed and communicated to the analyst should be relative to what that analyst already knows about the graph.
One motivation for such an approach comes from airline network analysis, where vertices represent airports and (directed) edges represent operating flights or routes between two airports. As the edges in an airline network change with time, it can be considered as a dynamic network. Here, an analyst may be interested in learning the informative changes, for example, as to how the traffic load is changing in real-time between different airports. An airline schedule is generated based on comprehensive knowledge on air traffic load management (Bazargan 2016). Hence, a domain analyst may well have prior knowledge/expectation at the block-hour level, of the total number of routes operated by an airline, total number of flights, number of unique routes from each airport, or even the densely connected set of airports. However, delays are a reality, as the schedules are not necessarily robust enough to perfectly factor and accommodate them. Hence, a compact and subjective online summarization bears real-time utility for airliners. It is critical to note that the application and utility of this approach is not limited to airline domain but spans across many other real-world scenarios, including evolving co-authorship network, co-actor network, and interaction network.
Our first significant contribution is the introduction of a novel, generic framework for subjective interestingness for sequential data. For this, we build on previous work by De Bie (2011), who first introduced a formalization of subjective interestingness for exploratory data mining, in which the analyst's prior beliefs are modelled as constraints and a background distribution-representing the current knowledge of the analystis derived using the maximum entropy principle. The novelty of our framework for sequential data is two-fold. First, the patterns that we define, called 'actions', represent atomic changes to the data that provide information relative to the current knowledge of the analyst. Second, we introduce a novel information gain measure that is motivated by the minimum description length (MDL) principle (Grünwald 2007). With this measure, our approach can automatically discover compact summaries without having to decide on the number of patterns. As such, we are the first to combine approaches for data mining based on subjective interestingness (using the maximum entropy principle) with pattern-based summarization (using the MDL principle).
Our second significant contribution is the instantiation of this generic framework for dynamic graphs. As van Leeuwen et al. (2016) instantiated subjective interestingness for dense subgraph discovery from (static) graphs, indeed we here build on their results. The concrete actions that we define, include add, remove, update, shrink, split, and merge. An instance of each of the action types is presented in Fig. 1a-f, for a toy example depicting an evolving airline network. Each of these actions adds, updates, and/or removes one or more dense subgraphs to/in/from the current summary, represented by set C s for each state s. The set C s comprises of the analyst's prior beliefs (represented by B) and the dense subgraphs as patterns (represented by P i ). In Fig. 1a-f, we indicate the initial summary C I s and final summary C F s after performing the actions in each state. By iteratively communicating these actions to the analyst, the analyst learns about the relevant changes in the graph (as shown in Fig. 1g) relative to what they already know. The use of our information measure ensures that we always communicate actions that provide more information about the data than that is required to describe the patterns and corresponding actions, effectively making sure that the analyst always gains information. Our third and final significant contribution is DSSG, a heuristic algorithm for the online summarization of dynamic graphs by means of iteratively discovering actions. Guided by the information gain criterion, it always considers all possible types of actions but only returns that action that provides the largest gain.
The remainder of the paper is organized as follows. The relevant literature is summarized in Sect. 2, followed by notation and preliminaries in Sect. 3. Our framework for subjective interestingness for sequential data and its online summarization is presented in Sects. 4.1 and 4.2, respectively, leading to the introduction of the problem of online summarization of dynamic graphs in Sect. 4.3. In this context, the DSSG algorithm is presented in Sect. 5. The experimental results on publicly available real-world datasets are discussed in Sect. 6, followed by a case study in the airline domain in and C F s , respectively; (g) Patterns P1-P5 and corresponding add/merge/shrink/split/update/remove actions can be used to summarize the six consecutive states of the dynamic graph as depicted in a-f Sect. 7. Important features of the proposed framework, key observations, limitations and future scope are discussed in Sect. 8, after which we conclude in Sect. 9.

Related work
We divide the relevant literature into the following categories: static graph mining; static graph summarization; dynamic graph mining; and dynamic graph summarization. The dynamic graph summarization category is most closely related to our work; we discuss the other categories for completeness.
Static graph mining Dense subgraph mining is a well-researched problem. The terms cliques, quasi-cliques (Abello et al. 2002;Matsuda et al. 1999), k-cores (Seidman 1983, k-plex (Seidman and Foster 1978), kD-cliques (Luce 1950) and k-club (Mokken 1979) in static graphs have been systematically defined and explored in the literature. Recent work on identifying quasi-cliques includes Tsourakakis et al. (2013);Veremyev et al. (2016), while Wu and Hao (2015) summarize all methods for solving the maximum clique problem. Although these measures to identify graph structures are objective, van Leeuwen et al. (2016) argued that the interestingness of each graph structure or pattern is subject to prior information in most applications. On similar lines, Bendimerad et al. (2020) defined subjectively interesting attributed subgraphs.
In line with ideas given by van Leeuwen et al. (2016) and Bendimerad et al. (2020), we also consider the analyst's prior beliefs.
Another popular sub-category of static graph mining is clustering or partitioning of the graph. Most of those methods focus on discovering splits, cuts, or partitions in a graph to identify different regions or communities of interest using spectral partitioning (Alpert et al. 1999), min-max cut (Ding et al. 2001), minimum cut trees (Flake et al. 2004), betweenness measures (Newman and Girvan 2004), or modularity maximization (Newman 2006). These methods cover the graph as a whole, while pattern mining in graph data restricts the knowledge discovery to some areas of interest.
Static graph summarization The idea of static graph summarization is to compress a graph (Navlakha et al. 2008;Koutra et al. 2014) or aggregate nodes/edges in a graph (LeFevre and Terzi 2010;Toivonen et al. 2011;Goebl et al. 2016). It is found to improve query efficiency (LeFevre and Terzi 2010), speed up clustering algorithms (Toivonen et al. 2011), effectively compress a graph dataset (Navlakha et al. 2008), and provide better visualization ) of a graph dataset. Koutra et al. (2014) describe a graph by identifying structures using a predefined vocabulary of graph structures such as stars, full and near cliques, full and near bipartite cores, and chains, which minimizes the total encoded length of the graph along with the model (based on the minimum description length principle). Another popular objective of static graph summarization is to find influential dynamics in a network through patterns (Goebl et al. 2016). These patterns provide a high-level description of a graph and are considered relevant and informative in the case of real datasets such as social networks, where information propagation is an essential characteristic of the data. Cook and Holder (1994) subjectively summarize a graph by providing a hierarchical description of structural regularities guided by the background knowledge in terms of rules, including compactness, connectivity, coverage and other types of domain-dependent rules. Similar to our proposed approach, the authors also combine the concept of minimum description length with background knowledge. However, we model background knowledge using constraints and the maximum entropy principle.
Dynamic graph mining This category covers methods that identify temporal graph patterns in a dynamic network. Rozenshtein et al. (2017) study interaction networks to find dense and temporally compact patterns. The authors introduce the k-Densest episode identification problem on temporal graphs (Rozenshtein et al. 2018), where an episode is defined as a pair of a time interval and a subgraph. Galimberti et al. (2018) propose the idea of maximal span-cores and span-cores decomposition of temporal networks.
Dynamic graph summarization This category is different from dynamic graph mining: graph summarization methods identify structures and evolution that provide a succinct description of a network, while graph mining methods identify all possible patterns in the network. As our proposed method fits this category, Table 1 shows an overview of both existing methods and ours; we will elaborate on this comparison in the last paragraph of this section.
GraphScope (Sun et al. 2007) was one of the very first methods that focused on summarizing temporal graphs. It partitions the graph into bipartite cores and cliques. Simultaneously, by detecting the change in encoding cost of graph segment upon presentation of a new graph with the evolution in the state, segments are identified. The size of the summary is dependent on the size of candidate structures generated in each static snapshot of the given dynamic graph, which is not necessarily automatic Com 2 (Araujo et al. 2014) identifies temporal edge-labelled communities in a graph and uses the minimum description length (MDL) principle with Canonical Polyadic (CP) or PARAFAC decomposition. TimeCrunch (Shah et al. 2015) also uses the MDL principle to summarize a temporal graph. The authors identify graph structures, using the vocabulary of graph structures given by Koutra et al. (2014), along with their corresponding temporal presence in terms of one-shot, periodic, flickering, and ranged. Adhikari et al. (2017) summarize a dynamic network by aggregating nodes into supernodes and time pairs into 'super time'. This method creates a flattened graph (static) after aggregation. Each of these methods concerns an instance of MDL-based dynamic graph compression (either lossy or lossless), but none of them directly summarizes how a dynamic graph changes and evolves.
Various methods in the literature have directly or indirectly addressed the problem of summarizing the evolution of a dynamic graph. You et al. (2009) captures repeated addition and removal of subgraphs between two consecutive graph snapshots in a dynamic graph. Scharwächter et al. (2016) proposed to find frequent structural changes, such as triadic closure and homophilic rewiring, in the form of evolution rules. Ahmed and Karypis (2015) summarize graph evolution by capturing co-evolving relational motifs, which occur when all or a majority of the occurrences of a relational pattern-or motif-evolve similarly over time. Robardet (2009) proposed to capture the evolution of isolated pseudo-cliques over time by means of a sequence of five temporal events, including formation, dissolution, growth, diminution and stability.
Similarly, Ahmed and Karypis (2012) proposed to epitomize an evolving graph by identifying Evolving Induced Relational States (EIRS). The authors defined EIRS as a sequence of Induced Relational States (IRS), which are a set of vertices that remain connected by similar edges having the same direction and label for several consecutive snapshots (based on a threshold). In EIRS, the time interval of each IRS cannot overlap with other IRS and has several or at least a certain number of common vertices. Lin et al. (2011) focus on discovering evolving communities by analyzing the dynamic interactions between vertices by representing the multi-dimensional and multi-relational characteristics as a relational hypergraph called a 'metagraph'. Another recent method based on TimeCrunch (Shah et al. 2015) that aims to capture the evolution of graph structures is given in the preliminary work by Saran and Vreeken (2019). They capture evolving graph patterns by capturing dynamic events such as growth, split, merge, and change in structure type (e.g., from clique to star) of a pattern. Based on their characteristics, these methods can be referred to as methods for discovering evolving graph patterns.
All methods mentioned in this category thus far are defined for a 'fixed' dynamic graph, i.e., over a fixed time interval, and not for a 'streaming' dynamic graph that is generated on-the-fly and should also be analysed on-the-fly, where the summary should change upon the presentation of a new snapshot of a graph. In other words, these methods do not support online summarization. Recent methods for online dynamic graph summarization, discussed next, include Tang et al. (2016); Khan and Aggarwal (2016); Qu et al. (2016); Tsalouchidou et al. (2020). Tang et al. (2016) and Khan and Aggarwal (2016) generate a graphical sketch of a dynamic graph, aggregating vertices and edge weights, which is updated after each snapshot of a graph sequence. These graphical sketches are useful to improve the efficiency of graph-based queries. Qu et al. (2016) summarize a diffusion network, i.e., a dynamic graph where information propagates with time, by discovering spreading trees (n-ary) as cascades, which grows with a change in state. Recently, Tsalouchidou et al. (2020) proposed the Scalable Dynamic Graph summarization Method (SDGM) to generate an online summary by extending the static graph summarization approach of LeFevre and Terzi (2010). Although these methods provide online summarization, they do not summarize informative state-to-state relative changes in a dynamic graph. That is, they do not provide incremental summaries, where each relative change in the structure of the graph is summarized and communicated to the analyst step by step.
To bridge this gap in the literature, we consider the problem of discovering informative changes in a streaming dynamic graph in an incremental manner. As we are interested in finding all informative changes, we require our method to automatically determine the number of returned patterns. To this end we propose to identify subgraphs that maximally deviate from the current knowledge of the analyst. For this we build on the notion of subjective interestingness proposed by De Bie (2011). To the best of our knowledge, we are the first to consider the problem of subjective, incremental, online graph summarization. This is corroborated by the qualitative comparison in Table 1, which shows the relevant characteristics for all dynamic graph summarization methods discussed in this section.
Since we propose to summarize a dynamic graph by means of dense patterns, we will adapt TimeCrunch (Shah et al. 2015) and SDGM (Tsalouchidou et al. 2020) to establish two baseline methods for empirical comparison in Sect. 6.

Preliminaries
This section defines the notation adopted in this paper, and briefly describes the two most closely related works on which we build in this paper. These are (1) the framework for FORmalizing Subjective Interestingness in Exploratory Data mining (FORSIED) introduced by De Bie (2011), and instantiated for different types of data and patterns; and (2) the work on Subjective interestingness of SubGraph patterns (SSG) in static graphs by van Leeuwen et al. (2016).

Data and notation
A rectangular dataset is a matrix D ∈ D M×N , where the dimension of the dataset is given by M × N and D is the domain of an individual cell. A (simple) graph is denoted as G = (V , E), where V is a set of vertices and E is a set of edges such that u, v ∈ V for each edge (u, v) ∈ E. Its adjacency matrix is a rectangular dataset and hence, represented by D ∈ D |V |×|V | , where D = {0, 1}.
A dynamic (rectangular) dataset D T changes with time, where T is the timespan of the dataset. This time interval can be segmented into several consecutive intervals, where each interval t = (t b , t f ) ⊂ T represent a state s, such that t b is the begin time and t f is the finish time. For any two consecutive states, s and s + 1, time t f s is equal to time t b s+1 . Thus, a sequence of snapshots D 1 , . . . , D S is observed, indexed by state s ∈ {1, . . . , S}, where S is the total number of states. Note that, in a sequence of snapshots, each D s is a static rectangular dataset, such that D s ∈ D M×N . We refer to such a sequence of snapshots as sequential data.
A dynamic graph, denoted G T = (V , E T ), is a graph in which each edge is present for a given period within time interval T , i.e., E T is the set of edges that occur in time interval T . More specifically, each e = (u, v, t b , t f ) ∈ E T defines an edge u, v ∈ V that appears at start time t b and continues to exist until it disappears at finish time t f . Again, the time interval T can be segmented into several intervals, as seen earlier for dynamic datasets. This assumption implies that each t ⊂ T defines a static state s of the dynamic graph, that is essentially a (simple) graph: each edge either exists or not. We denote the dynamic graph projected to its graph corresponding to a fixed time t by G s , and its corresponding adjacency matrix by D s ∈ D |V |×|V | , such that D = {0, 1}. Hence, a dynamic graph, G T can be represented as a sequential dataset D T , with a sequence of static graph snapshots G 1 , . . . , G S and a corresponding sequence of adjacency matrices D 1 , . . . , D S .
Notably, even when time is not discrete, one can easily discretize it by segmenting it into equal-length intervals (e.g., seconds, minutes, …). As we will see, the length of these intervals determines the granularity at which the approach will identify changes in the data. For instance, in the airline case, it is implausible that (relevant) changes will occur within seconds or even minutes, hence, it may be reasonable to segment time in hours.

Subjectively interesting patterns in static graphs
Informally, the FORSIED framework (De Bie 2011) defines subjective interestingness of a pattern as the information it provides with regard to the analyst's expectations (or prior knowledge), normalized by its complexity. Given a dataset D, the analyst's background distribution P * , is the distribution that maximizes entropy, is given by The set of constraints enforced in Eq. 2 is presented in a generalized form, where each constraint B i ∈ B is a pair consisting of a function f i over D-as properties of the data-and a corresponding constant c i , i.e., B i = ( f i , c i ). The set of constraints B represents the analyst's prior knowledge or expectations on the data. The exact type(s) of constraints and their interpretation depends on the type and nature of the dataset D.
Next, the interestingness of a pattern θ is defined as the ratio of the pattern's selfinformation (denoted SI) to its description length (denoted DL). Self-information is the negative log-probability that the pattern is present in the data, i.e., − log(P(θ ∈ D)), while description length is the number of bits required to describe or communicate the pattern to the analyst.
Instantiating these generic concepts for dense subgraph patterns in static graphs, van Leeuwen et al. (2016) defined interestingness I of a static graph pattern, θ = (W , k W ), denoting a vertex set W having k W edges, as 1 where n W is the number of possible edges in subgraph W , q is a hyperparameter representing the 'expected' probability of a random node to be present in W , and p W is the probability of the subgraph occurring given background distribution P * . The latter probability is computed as p W = 1 n W u,v∈W p u,v , where p u,v is the probability that an edge between vertices u and v exists as given by P * . Iterative learning The framework above can be motivated by the observation that compression equates learning (Grünwald 2007): in order to learn as much as possible about the data, the implicit goal of the analyst is to (internally) represent the data using as few bits as possible. This observation implies minimizing − log P * (D), i.e., the length of the data encoded by the background distribution. This can be accomplished by changing the analyst's knowledge on D. Here, change in the analyst's knowledge on D implies that a new set of constraints C corresponding to each discovered pattern must be constructed, which is used to update the background distribution P * . Specifically, when a graph pattern is discovered, a constraint is added to ensure that the updated expectations of the analyst conform with the actual number of edges. For instance, when a graph pattern (W , k W ) is presented to the analyst, a new constraint C W = ( f W , k W ) is added to C, where f W is a function over W vertices which counts the number of edges, i.e., f W (D) = u,v∈W ,u<v D[u, v], and k W is the actual number of edges in the vertex-induced subgraph of W vertices. Notably, the solution to the following problem provides the updated background distribution (van Leeuwen et al. 2016): Hence, the analyst can learn everything about the data by iteratively discovering the most interesting pattern and updating the background distribution after each iteration.

Proposed approach
In this section, we introduce our novel framework for subjective interestingness for sequential data, which extends the FORSIED framework but also incorporates crucial changes. We introduce the problem of subjective summarization of sequential data, and to solve this problem we propose the method of online summarization of sequential data. Finally, we instantiate this generic problem for dynamic graphs.

Subjective interestingness for sequential data
Given a sequential dataset D T , we consider the setting where an analyst is interested in learning informative patterns about the data as the snapshots unfold in an online fashion. As with static data, the analyst may have prior beliefs about the data already before the first snapshot-these are represented by a set of constraints B.
When the snapshot corresponding to the first state is analyzed, we aim to find a compact set of constraints, i.e., patterns, that-together with the prior beliefs-minimize the negative log-probability of the data, given the implied background distribution. To avoid finding either too many or too complex patterns, we draw inspiration from the minimum description length principle (Grünwald 2007) and use a two-part code to balance the goodness of fit of the data with the complexity of the constraint set. More precisely, we aim to find a new set of constraints C 1 with corresponding background distribution P * 1 that minimizes − log P * 1 (D 1 ) + L(C 1 ), where L is a function that computes the encoded length for any given set of constraints. It is of note that we require an additional set of constraints C 1 other than the existing set of constraints B to achieve the optimal (feasible) solution of the above problem. The set of constraints C 1 is used to ensure that the knowledge mined by the discovered patterns is reflected in the background distribution P * 1 . For any consecutive snapshot, we now want to adapt what the analyst has learned before; by only providing the analyst with information about changes that have occurred in the data since the previous state, he requires minimal effort, and we obtain a minimal summary. Given the previous, this implies that-for each snapshot s after the first-we need to find a set of constraints C s with corresponding background dis- where L is a function that computes the encoded length for any given set of constraints given another set of constraints; i.e., smaller changes require fewer bits.
With the given discussion, we formally introduce the following problem statement.
Problem 1 (Subjective Summarization of Sequential Data) Given a sequential dataset D T , i.e., sequence of snapshots D 1 , . . . , D S , and prior beliefs B, find: is computed using constraints B ∪ C 1 ; -for D s , with s ∈ {2, . . . , S}: a set of constraints C s that minimizes − log P * s (D s )+ L(C s |C s−1 ), where P * s is computed using constraints B ∪ C s .

Online summarization of sequential data
Apart from the fact that optimally solving each iteration of Problem 1 would require to consider a very large search space, i.e., that of all possible constraints sets, we do not want to present unordered sets of constraints to the analyst: this would very likely overwhelm the analyst and therefore cause confusion. Instead, we prefer to present atomic changes to C to the analyst one by one, as is also done in the framework for static data. We will therefore now derive an approach that heuristically approximates Problem 1 by iteratively looking for the largest changes and communicating those to the analyst immediately. After each atomic change α, also called action, the set of constraints C is updated to a new set C , and hence the background distribution P * is updated accordingly. α reduces the negative log-probability of the data by updating the background distribution, and we define this reduction as Information Content, IC.
Definition 1 (Information Content) Given an action α, and constraint sets C (original) and C (updated), we define the information content of α, denoted by IC, as the difference between the length of the data encoded by the background distributions specified by constraint sets C and C : where P * X is the MaxEnt probability distribution given a set of constraints X (i.e., using Eqs. 1-3).
An action on C can be categorized as one of the following: 1. Addition of a new constraint C, i.e., C = C ∪ {C}, 2. Deletion of a constraint C , i.e., C = C\{C}, 3. Update of an already present constraint C ∈ C, i.e., replacing C with a constraint C , and hence C = C\{C} ∪ {C }.
Definition 2 (Description Length) The description length of an action α, denoted DL(α), is defined as the (minimum) number of bits required to encode the changes in the set C when communicated to the analyst.
Remark 1 Given a set of constraints C s−1 , let A be an ordered set of actions performed on C s−1 to get an updated set C s , then the encoded length L of C s is computed as: We now have two different quantities associated with each atomic change α, i.e., IC and DL. Maximizing IC and minimizing DL leads to our overall goal of minimizing − log P * s (D s ) + L(C s |C s−1 ). Thus, we discount IC with DL and perform the action with maximal difference at each step. We call this difference information gain and denote it by IG.

Definition 3 [Information Gain]
Let α be an action that transforms a given set of constraints C into an updated set C . Then, the information gain IG on performing α on C is given by The process of online summarization begins with the initialization of background distribution P * B using the prior belief(s) B that an analyst may have. At the start of state 1, no patterns have been discovered yet, i.e., C 1 = ∅, which implies P * B∪C 1 = P * B . Then patterns with maximum IG (Eq. 10) are discovered iteratively and for each such pattern a corresponding constraint C is added to C 1 and hence the background distribution P * B∪C 1 is updated (using Eqs. 5-7). Note that C 1 is initially an empty set, thus the only action that can be performed on C 1 is the addition of a new pattern. The process continues until no feasible action can be performed on set C 1 . Here, a feasible action is any action which satisfies a user-provided criteria, for example, to be in agreement with the MDL principle an action α it is recommended default that α is feasible if IG(α) > 0. The process then moves to the following state. For any state s (except state 1), C s is initialized to the final C s−1 and P * C s to the final P * C s−1 . This is followed by iterative actions on C s with maximal IG, until no feasible action can be performed.

Online summarization of dynamic graphs
The concept of subjective summarization of sequential data can be directly adapted to dynamic graphs by segmenting such a graph into a sequence of static graph snapshots (see Sect. 3.1). By making the data type more specific, however, we can also instantiate the other components of the generic framework-e.g., actions, prior beliefs, constraints, and description length-with more precise definitions. As discussed earlier, a graph pattern, θ = (W , k W ) is a subgraph of W ⊂ V vertices that is connected by k W edges. Thus, by definition a graph pattern is connected, i.e., there exists a path from every vertex to every other vertex. Note that, since we consider graph patterns, the definition of constraints follows the discussion in Sect. 3.2. Following, we introduce the following problem statement as an instance of Problem 1.
Problem 2 (Subjective Summarization of Dynamic Graphs) Given a dynamic graph G T consisting of a sequence of snapshots G 1 , . . . , G S , with D s the corresponding adjacency matrix for a state s and prior beliefs B, solve Problem 1 such that each pattern in every set C s is a connected subgraph pattern.
As discussed previously, optimally solving Problem 2 requires to consider a very large number of possible constraint sets. Similarly, we heuristically address Problem 2 by iteratively communicating atomic changes, or actions, having maximal IG to the analyst. Based on the properties of a graph pattern and possible structural changes, we now formalize six specific types of actions which we use to communicate changes on graph data, as initially depicted in Fig. 1.
The add action communicates a newly discovered subjectively dense subgraph pattern. In Fig. 1a, two patterns, P1 and P2, are identified and added in state S1. A remove action deletes a pattern that no longer holds in the current state, i.e., when the pattern is no longer connected and/or its density decreases substantially. An example is shown in Fig. 1f, where a sparse pattern P5 is removed in state S6-removing a constraint is informative when it has a positive IC.
The other actions are update, merge, shrink, and split, which all represent modifications of constraint(s) already present in C. When the density of a pattern corresponding to an existing constraint increases, this is communicated via update. Thus, a constraint C = ( f W , k W ) ∈ C is replaced by a similar but updated constraint C = ( f W , k W ). In Fig. 1e, pattern P5 is updated to pattern P5 in state S5, when its density increases compared its density in state S4 (Fig. 1d). By applying a merge action, two previous patterns are merged to form one new pattern. That is, Fig. 1b, where two patterns, P1 and P2, are merged to create a new pattern P3 in S2.
Actions shrink and split either reduce an existing constraint or decompose one into multiple constraints. A constraint is shrunk when the density of a pattern decreases with the evolution of the graph (see Fig. 1c, where pattern P3 shrinks to form pattern P3 in state S3). Similarly, a constraint can be decomposed into multiple new constraints if the pattern corresponding to an original constraint consists of two or more connected components (see Fig. 1d, where pattern P3 splits into two new patterns P4 and P5 in state S4). In shrink, the original constraint The different conditions that must be satisfied for each of the six types of actions to be applicable are summarized in Table 2.
Next, the formulation of information content IC and description length DL of each action type is summarized in Table 3. We extend the abstract definition of description length given in the previous section (Definition 2). The description length of an action is the summation of two parts, the first of which encodes the type of action, represented Table 2 Conditions that must be met to perform an action α on a constraint C present in constraint set C, with initial pattern θ i , resultant pattern θ f and density function ρ (defined as the ratio of the number of edges to the maximum possible number of edges in a graph) ✓: true, ✗: false, ?: may or may not be true, -: not applicable by t ype(α), and the second of which encodes the details, represented by details(α). For all quantities where the upper limit is not known, we use the universal integer code (Rissanen 1983), which is given by L N (n) = log(2.865064) + log(n) + log log(n) . . . and sums over all positive terms. If the upper limit is known then we use the uniform code (Grünwald 2007), given by log(n). Note that all logarithms are to base two.
In the description length of α, to describe the type of action we use the uniform code over all possible action types as there is no priority or bias towards any action. Thus, DL(t ype(α)) = T a = log(l), as we require − log 1 l bits. Here, l = 6 as we have defined six action types above. The computation of DL(details(α)) for each action type is shown in Table 3. That is, details(add) is the summation of the number of bits required to describe the set of vertices (T p = DL[(W , k W )], see Eq. 4), and the number of edges in the corresponding vertex-induced subgraph. Instead of describing the number of edges in a subgraph, we describe the number of edges short in a subgraph when compared to a clique of same number of vertices. That is, for a subgraph having W vertices, n W is the maximum number of edges possible between W vertices, and k W is the number of edges, then we describe the difference between n W and k W , given by T c . Thus, a dense subgraph with high number of edges would have smaller description length, which favours discovery of dense subgraph patterns. Note that, the hyperparameter 'q' in T p can be used to influence the size of pattern (see Sect. 3.2).
In remove, update, shrink, and split, the index of the constraint to be removed is communicated in T b bits. Similarly, in case of merge the index of two constraints are communicated in 2 × T b bits. Since we only consider the merge of two constraints at a time, the term L N (|2|) is omitted. In addition, for all the actions except remove the information about the edges is communicated in T c bits. In case of shrink, terms T d and T e indicate the number of bits required to describes the number of vertices removed from the original pattern and the removed vertices, respectively. In split, the number of resulting constraints is described in T f bits, each constraint in T g bits, vertices in each constraint in T h bits, and information about edges in each component using T i bits.

Lemma 1 For an action α, which updates a set of constraints C to C , IC(α) as defined in Definition 1 is equal to
where R is a submatrix of D given by R = D[W 1 , . . . , W M ; W 1 , . . . , W M ], such that W is the set of M vertices covered by the affected constraint(s), 2 C α .
Proof The proof is straightforward, however, for completeness we provide the following details. In Eq. 8, log P * X (D) = i, j∈V log P * X (D i j ) is the sum over all pairs of vertices. These pairs can be categorized into three groups, which are (1) both vertices lie in W , (2) neither of the vertices lie in W , and (3) either (but not both) of the vertices lie in W . It is only in the first case that the probability is updated on performing the action α, while the rest of the probability terms remains unchanged and hence, these terms cancel out each other, i.e., log P * C (D i j ) = log P * C (D i j ). Thus, the result follows.
By virtue of Lemma 1, we come up with the following result.

Theorem 1 The complexity of computing information content IC of an action α is
Proof The proof follows Eq. 11 which is sum over all pair of vertices, (i, j) : i, j ∈ W . Hence, this requires a complexity of O(|W | 2 ).
As discussed above, we solve Problem 2 by iteratively performing that action (of one of the six types defined above) with maximal IG. Thus, we introduce the problem of online summarization of dynamic graphs (Problem 3). Hence, we heuristically unfold Problem 2 by iteratively solving Problem 3 at each step.
Problem 3 (Online Summarization of Dynamic Graphs) Given the current state s, graph snapshot G s , corresponding adjacency matrix D s , and current constraint set C s , perform that action 'α' from the set of all possible actions having maximal information gain IG, such that the pattern(s) obtained after performing 'α' are connected subgraph(s). 1. Belief-c: In this case, we model the analyst's knowledge about the total number of edges in the initial snapshot of the data. In other words, the analyst has prior knowledge about the relative edge density of the graph dataset. Solving Eqs. 1-3, De Bie (2011) showed that P * turns out to be product of independent Bernoulli distributions for each random variable a u,v and is given by

Prior beliefs
This distribution is best represented as a matrix P * ∈ [0, 1] |V |×|V | with row and column indices indicating the vertices, such that p u,v = ex p(2·λ) 1+ex p(2·λ) = ρ(G 0 ) suggests the probability of a u,v = 1, i.e., an edge between vertex u and v. 2. Belief-i: Similarly, here, the user possesses a belief about the individual degree of each vertex in a snapshot of the data. The maximum entropy distribution turns out to be a similar product of independent Bernoulli distributions, given as where is the probability of random variable a u,v = 1. Updating the background distribution When a pattern θ = (W , k W ) is discovered (through action add), a constraint C = ( f W , k W ) is added to the set C, and P * is updated using Eqs. 5-7 (van Leeuwen et al. 2016), where the updated P * is given as where Thus, for all pairs (u, v) : u, v ∈ W a unique Lagrangian multiplier, λ W is introduced (using the bisection method) upon updating the background distribution. Similarly, if multiple constraints are present in C, then p u,v is computed as 1+ex p(λ u +λ v + C∈C:u,v∈W λ W ) . Hence, it is efficient to store only the Lagrangian multipliers and compute the probability whenever required.
If a remove action is performed then the corresponding Lagrangian multiplier is removed from the list to update the background distribution. Similarly, for all other actions, first the corresponding Lagrangian multiplier(s) to the original constraint(s) are removed and then using Eqs. 5-7, new Lagrangian multiplier(s) are computed. Hence, this is an efficient way to update the background distribution. Feasibility constraint In order to provide the user with a concise summary we introduce a feasibility constraints to limit the number of actions performed in each state. That is, we consider an action feasible if the information gain is positive, i.e., IG(α) > 0. Although, it may be altered as per user preference, this choice is motivated by MDL principle and ensures that an action always provide more information about the data than that it costs to describe the action.

The DSSG algorithm
In this section, we introduce an algorithm called DSSG, of which the step by step procedure is outlined in Algorithm 1. DSSG is a heuristic approach to solve Problem 2 that works in an iterative manner, solving Problem 3 in each step. The overall procedure of DSSG can be summarized as follows.
DSSG starts with an initial graph snapshot G 0 , an initial set of constraints B (as the analyst's prior belief), and a set of constraint C (which is usually ∅ initially). Given this, the maximum entropy distribution P is then computed (Line 2). For each state s (Line 3) actions are performed iteratively to solve Problem 3 (Lines 5-10). The process continues until no action can be performed (Line 14). Each performed action consists of an update of the background knowledge (updating P and C) followed by communication of the performed action B, to the analyst (Line 12). An example can be seen in Fig. 1, where in each state the initial and final (represented by superscript I and F respectively) set of constraints is represented by C (indexed by subscript s ∈ [1, T ]).
The feasibility constraint comes into effect while searching for the best action to be performed in each step (Line 11). The overall best action with the maximal value of IG is selected and returned. If the best action violates the feasibility constraint, then null is returned and the process continues with the next graph snapshot.
The EvaluateAdd procedure is used to discover the best new subgraph pattern with maximum IG, which is a complex problem. This can be realized by the fact that the discovery of new pattern requires the evaluation of all possible 2 |V | candidate subgraphs. Hence, we use a hill climber based search algorithm (SearchPattern, see Algorithm 2) based on the SSG algorithm (van Leeuwen et al. 2016), which is proposed for finding a subjective interesting subgraph in a static graph. This algorithm Algorithm 1 DSSG 1: procedure DSSG(G T , C, B) 2: Compute Maximum Entropy Distribution for G 0 given B as P 3: for each G s ∈ G T do here G T is a sequence of static graphs (snapshots) 4: repeat 5: A ← EvaluateAdd(G s , P) 6: R ← EvaluateRemove(G s , P, C) 7: U ← EvaluateUpdate(G s , P, C) 8: S ← EvaluateShrink(G s , P, C) 9: M ← EvaluateMerge(G s , P, C) 10: T ← EvaluateSplit(G s , P, C) 11: B ← GetBestAction(A, R, U, S, M, T) Returns action with max. IG 12: if B = ∅ then 13: Update C and P using B and Communicate B to the analyst 14: until B = ∅ Move to next snapshot if nothing is to be learned return (H * , I * ) If nothing increases I * return the found graph pattern starts with a seed pattern H * and recursively adds (Line 3-6) or removes (Line 10-13) vertices to find a pattern with a maximal value of IG. This search stops if neither a vertex can be added nor removed (Line 17). To ensure the connectedness constraint, while adding vertices only vertices neighboring to vertices present in the pattern are checked (Line 3). As this hill climber is likely to suffer from convergence to local optima, we independently run the Algorithm 2 for a list of seed patterns (van Leeuwen et al. 2016) and select the single best pattern as search result. Further, note that the computational cost of naïvely computing IG(add) at each step of the hill climber would be prohibitive, as it would require to compute a new Lagrangian multiplier to update the background distribution at each step. As this is the same problem as van Leeuwen et al. (2016) faced, we also adapt the same solution. That is, information content IC of a pattern θ = (W , k W ), as defined in Eq. 8, is approximated by We empirically show that Eq. 16 is an adequate approximation of Eq. 8 in Fig. 2. To obtain Fig. 2, we created a random graph of 20 vertices using the Barabási-Albert model and computed the values of SI (Eq. 16) and IC (Eq. 8) of all possible connected subgraphs, considering the two types of prior belief as discussed in Section 4.4. It is observed that for all candidate subgraphs (and for both types of prior belief) the value of SI is always less than or equal to IC. Although they are not exactly equal, the correlation r = 0.9999 (in Fig. 2a) and r = 0.9948 (in Fig. 2b) are high enough to suggest that SI can be successfully used as proxy for IC, as is also argued by van Leeuwen et al. (2016). Moreover, computing SI is clearly much faster than computing IC, as it does not require updating the background distribution at each step. Hence, this allows to discover surprisingly densely connected graph patterns from snapshots of the graph in an efficient way. for u ∈ W do 3: if I > I * then 7: return ShrinkPattern(H , P, I ) 8: else 9: return (H * , I * ) EvaluateRemove and EvaluateUpdate are used to evaluate each constraint in C to, either remove or update a constraint, respectively. In these procedures, each constraint in C is independently evaluated by computing the corresponding IG. To compute IC (as in Table 3), we update the background distribution assuming that the action would take place. Of note, the update in the background distribution is rolled back after evaluation of each constraint. Both of these method return the respective constraint with maximal IG.
Similarly, EvaluateMerge returns two constraints (in C) or patterns which, when merged, result in a connected graph pattern with maximal IG.
EvaluateShrink is used to evaluate each constraint in C for shrink and the reduced constraint with maximal IG is returned. To shrink a pattern or constraint, we use the procedure ShrinkPattern (Algorithm 3), which recursively removes vertices (Line 2-7) until no increase in IG is observed (Line 9).
EvaluateSplit is used find the constraint which produces maximal IG upon split. Note that, a new pattern that is the result of split may shrink in a next iteration; hence we also evaluate a possible reduction of each resulting pattern upon split using procedure ShrinkPattern. Thus, EvaluateSplit contains two parts: (1) first the different connected components in the original pattern are identified (each component acts as a new pattern or constraint), and (2) then each new pattern is evaluated for shrink.
Complexity In a single iteration of DSSG, six different procedures are executed sequentially; hence, we discuss the complexity of each procedure. The EvaluateAdd procedure runs the hill climber SearchPattern independently, k times for k different seeds. In each iteration of this hill climber, the computation of IG is the most computationally expensive part, with time complexity of O(|W | 2 ) (from Theorem 1), where W is the set of vertices in a pattern. This hill climber is a direct adaptation of SSG and van Leeuwen et al. (2016) showed that this complexity can be reduced to O(|W |). Hence, if the number of neighbors in Algorithm 2 is (let's say) N , then each iteration takes O(N |W |). Thus, the worst-case complexity of running SearchPattern becomes O (IN |W |), assuming that the hill climber runs for at most I iterations.
In the other procedures, to evaluate each constraint in C requires the computation of IG, which takes O(|W | 2 ) (from Theorem 1). Note that computing the Lagrangian multiplier corresponding to a revised constraint in C requires to run the bisection method, which has a complexity of O(n|W | 2 ). In this, n is the number of iterations required, computed as log 0 , where is the given error or tolerance and 0 is the initial bracket size. Thus, the other procedures have a complexity of O(n|W | 2 ).
Given that the complexity of the overall algorithm strongly depends on the actual number of iterations, which cannot be computed in advance, we will instead mention empirical runtimes in the experiment section.

Experiments and results
In this section, we will demonstrate the efficacy of the proposed framework and corresponding algorithm, DSSG, by means of quantitative (Sect. 6.3) and qualitative (Sect. 6.5) results on seven publicly available real-world datasets (Sect. 6.1). We also compare the proposed method to baselines based on two recent methods for dynamic graph summarization (Sect. 6.4).

Datasets
In this section, we will use the following seven publicly available datasets, also summarized in Table 4.
High-School Interaction 3 : This dataset has a total timespan of 5 days. In all 9 h of interaction is available per day, except for the first day with 5 h, and the total timespan is segmented into 41 different states of 1 h each.
Workplace Interaction 3 : This is an interaction network of employees at a workplace. It has a total timespan of 10 working days, where interactions for 9 h are available for each day, except for the first day where 10 h of interactions are available. It is segmented into 91 different states of 1 h each. Although the interactions are instantaneous in nature, an edge exists for each interaction which occurred in a state (snapshot). Table 4 Datasets along with some of their properties. Type indicates if the dataset is a Directed (D) or Undirected (U) graph, |V | is the total number of nodes in the graph, |E S | is the total number of unique edges without timestamp, |E T | is the total total number of unique edges with timestamp, T is the total time period for which the edges in the graph are considered, t is the time period covered by each individual state, and |S| is the total number of states considered for each dataset MathOverFlow 4 : This network captures the communication between users on the MathOverFlow website. A timestamped undirected edge exists between two users if one user answers another user's question, comments on another user's question, or comments on another user's answer to any question. The dataset has a total duration of 2560 days. Here we consider a total timeperiod of 6.5 years, segmented into 26 states of 1 quarter (3 months) each. The lifespan of any edge is considered to be three months, i.e., an edge disappears 3 months after the time it appeared in the network.
Reuters Terror Network 5 : This dataset contains words that are present in each news article following the 9/11 terror attack. We build a network of words (as vertices) with a link between them (undirected edge) wherever they appear in the same article. The total time period considered is 66 days, with segments (snapshots) of 1 day each. In each state, the snapshot of the network contains all the words (and edges between them) if they appeared in any news article published on that day.
TheMovieDB: A network of actors (vertices) is considered, with an edge corresponding to a co-acted movie. The data is fetched using the TheMovieDB API. 6 The time period of the network is from year 2009 to 2016, and is segmented into 8 states of 1 year each. All movies in the 8 year time period having actors with popularity score more than 2 are included. Each snapshot contains edges corresponding to movies released in the same year.
DBLP: This is a co-author network, created using the DBLP 7 data of all publications in top-20 Machine Learning and Data Mining conferences 8 over a period of 10 years. The dataset is segmented into 10 states of 1 year each, adding an edge between two authors if they have co-authored at least one publication in the given year.
WebClicks: A network of click requests (directed edges) is created from referrer host to target host (nodes) for the time period between 1 November 2009 to 22 Novem-ber 2009. To prune the data, 9 we only consider edges with more than 25 requests in a day. Also, the network is segmented into 22 states of 1 day each. That is, the edge remains only for 1 day, given that at least 25 requests were made from referrer host to target host.

Experimental setup
The prior belief for each of the datasets, except for the TheMovieDB dataset, used in this paper is type belief-c. For TheMovieDB type belief-i is used.
Since, we use an adaptation of the hill climber given by van Leeuwen et al. (2016), we fix the following parameters as suggested by the same article.
1. The parameter 'q' used in computation of the description length of pattern (see Table 3) is fixed at 0.01. 2. We use the 'interestingness' based 'TopK' seeding strategy with k = 10.
The experiments are executed on an Apple Macbook Pro 2018, with 2.3 GHz Quad-Core Intel Core i5 processor and 8GBs of RAM. The source code and binaries of DSSG, implemented in Python, are available for download at https://github.com/skkapoor/ MiningSubjectiveSubgraphPatterns.

Quantitative analysis
In this subsection, we demonstrate the performance of the proposed method on the above mentioned datasets. We evaluate the results in terms of (1) the type of actions performed in each state, (2) the number of patterns (or constraints) required to summarize each state, (3) the densities of the patterns found in each state, (4) the ratio of the vertices covered by the patterns in the dataset in each state, and (5) the compression ratio between the encoding cost of the data given the initial background distribution and given the final background distribution in each state. We also showcase the feasibility of the proposed approach by presenting the time taken for online summarization of all states in each graph dataset. Table 5 presents the results and summarizes the set of found patterns for each dataset by the proposed method.
Number of patterns required to summarize each state We observe the lowest median number of patterns, i.e., 5 for Workplace and most, i.e., 62, for DBLP. This is expected as Workplace has the smallest number of vertices and DBLP has the second most number of vertices among all considered datasets. However, WebClicks has the largest number of vertices but surprisingly very few patterns are found to summarize each state ranging between 2 and 7. This is because WebClicks is sparsely dense with the number of unique edges |E s | almost equal to number of vertices |V | (see Table 4). For TheMovieDB, a high number of patterns are observed in the summary of each state as TheMovieDB is a relatively dense dataset.
We also observe that the patterns found covers a high number of vertices despite of performing only limited actions. The largest coverage of 58.41% is observed for HighSchool and smallest of 0.49% in DBLP. In case of WebClicks, reasonable cov- Table 5 Properties of the found set of patterns (or constraints) C s in each state s. Minimum, median, and maximum value of each property is shown among all states in each dataset, where the number of constraints in each state is shown by |C s |; total number of performed actions in each state by |A|; difference in two sets of constraints (C s−1 and C s ) in terms of the number of edges added and removed covered by patterns in either set by C ; overall changes in the dataset between two consecutive states (s − 1 and s) in terms of number of edges added and removed by s ; average of the average density of all the patterns in C s byρ; compression ratio by CR; and coverage, i.e., the fraction of vertices of the dataset covered by all patterns combined. Runtime (in seconds) is the time required to process all states of the dataset, i.e., to obtain a complete solution of Problem 2 erage in the range of 2.19% − 4.24% is observed. Hence, we conclude that depending upon the size and density of a dataset, our method adequately identifies the number of patterns required summarize each state of a dynamic graph.

Number and type of actions performed
We observe that the number of actions (|A|) performed in each state is consistent with number of changes taking place in Fig. 3 The fraction of each type of action used to summarize each dataset (Color figure online) the network upon evolution from one state to another (i.e., total number of new edges added and old edges removed, shown by S ). That is, when S is smaller, a smaller value of |A| is observed, and vice versa. For example, in Reuters only 1 action is performed when changes in the network are small, i.e., a total of 322 edges are either added or removed, and 27 actions are performed when the changes are much larger, i.e., 13,494 edges either appeared or disappeared from the network. The fraction of each type of action performed can be seen in Fig. 3. It is found that add and remove are the two most frequently performed actions, whereas the other types of actions depend on the nature of evolution of the network. It is seen that update is performed only for the High-School network. For WebClicks, no merge or split actions are observed. Hence, the type of actions carried out are dependent on the topology of the network and the nature of evolution, to which our proposed algorithm effectively adapts itself.
Quality of patterns We assess the identified patterns through average density 10 ρ and compression ratios CR. 11 Minimizing the encoding cost of the data is only one part of our objective, and we use it to signify the information contained in the patterns: the higher the compression ratio, the more information about the data is provided by the patterns. The maximum compression ratio is observed for TheMovieDB, which is 48.30%, and the minimum of 0.16% is obtained for Reuters. This is accompanied by the observed high values for the average of the average densities of all identified patterns, including the minima of 0.0062 and 0.0070 in case of TheMovieDB and MathOverFlow respectively, which are also higher than the average densities of snapshots of the data. Thus, our method finds subjectively dense and informative patterns.
We also observe for TheMovieDB where a more sophisticated belief, i.e., beliefi is used. That is, the background distribution closely represents a snapshot of the dataset, and with the change of state, any action would results in high compression ratios, which is also observed in Table 5.
Quality of actions performed We next investigate the sequential approach taken in Problem 3. From the nature of the problem, it is expected that with each performed 10 For a graph G = (V , E), ρ = |E| |V | * (|V |−1) (directed) or = 2 * |E| |V | * (|V |−1) (undirected). 11 CR is 1 minus the ratio of the encoding cost (number of bits, computed as − log 2 P(D)) of the data given the initial background distribution and given the final background distribution. action, the codelength of the data should decrease and the average of average densities of identified set of patterns should increase. This is confirmed by Fig. 4, where the codelength is found to be always decreasing and the density is mostly increasing for the DBLP and TheMovieDB networks. We also observe in Table 5, that there is a correlation between changes captured by the actions (i.e, C ) and changes in the overall state (i.e., S ) compared to the previous state. For TheMovieDB, we observed a relatively larger value of C , i.e., 15,556, when S is also large, i.e., 74,499; for Workplace we observed smaller value of C , i.e., 5, when S is smaller, i.e., 35. Therefore, the actions capture the changes in the graph state appropriately. Runtime analysis Last, we discuss the (computation) time taken to run the experiment for each dataset. This is comprised of various factors, including the time required to compute the background distribution, executing the hill climber with different number of seeds to discover patterns, creating a candidate list for each type of atomic change to be performed, and updating the background distribution. In Table 5, the factors visibly affecting runtime are the size and density of a dataset, and the number of segments considered in a dataset. Overall, the maximum runtime of 79,049 s, which is approximately 22 h for Reuters, appears practical. However, this could be further reduced upon optimization and parallelization of the proposed algorithm. Also note that all experiments have been run on a standard laptop.

Comparison with baseline methods
In Sect. 2 (and specifically in Table 1) we have described in detail how DSSG differs from existing methods for dynamic graph summarization, i.e., it solves a (slightly yet crucially) different problem. Specifically, unlike other methods DSSG summarizes a dynamic graph by discovering state-to-state relative changes in the form of evolving patterns and incrementally updates the analyst's knowledge after each graph snapshot. To empirically demonstrate that DSSG provides good solutions to this problem, we here compare its results to those obtained by two baseline methods adapted from TimeCrunch (TC) (Shah et al. 2015) and Scalable Dynamic Graph summarization Method (SDGM) (Tsalouchidou et al. 2020).
Baseline methods Next we describe how we adapt TC and SDGM to match our problem setting. For TC, at each state s we compute two summaries using TC, for (1) graph sequence G 1 , . . . , G s−1 , and (2) graph sequence G 1 , . . . , G s−1 , G s . The difference between the two resulting summaries is the incremental information communicated to the analyst using two action types, namely add, to communicate patterns that appeared in state s, and remove, to communicate patterns present in state s − 1 but not in s. These actions are encoded as with DSSG (see Table 3), except that the number of action types is two. SDGM provides a summary after each state, hence for SDGM we use the same two action types and encoding to communicate the changes between each two consecutive summaries.
As neither TC nor SDGM considers prior knowledge, for consistency in comparison we start from the same initial background distribution as for DSSG. The background distribution is updated after each action, exactly as in our approach 14). Whereas our approach automatically selects the number of patterns needed to summarize the changes, TC and SDGM do not. For TC we set the (maximum) number of actions, 12 to be the number of patterns found by DSSG in each state. For SDGM, 13 we choose the maximum number of patterns among all states by DSSG as the number of supernodes in each dataset, as it identifies a preset number of supernodes while producing a summary in each state. Note that providing this information from DSSG is potentially favorable to TC and SDGM.
Evaluation criterion The objective of our main problem, i.e., Problem 2, is to minimize − log P * C s (D s )+ L(C s |C s−1 ) for each state s of a dynamic graph. Hence, we assess the there methods using this function as a measure. For simplicity, we denote the number of bits required to encode a graph snapshot D s given background distribution P * C s , i.e., − log P * C s (D s ), by L(D s ), and the number of bits required to encode all atomic changes, i.e., L(C s |C s−1 ), by DL. In the following, we use superscripts I and F to represent respectively the Initial and Final values of a state. Apart from absolute values, we also investigate the difference between these initial and final values, which is equivalent to the total IG achieved by performing all actions found for a state s, as given by Results Starting with Fig. 5f, depicting overall compression by means of normalized L F (D s ) + DL over all states, we observe that DSSG yields lower (=better) values for four datasets; TC has better results for Workplace and DBLP. This shows that DSSG generally succeeds in finding better solutions to the overall problem, with TC often close and occasionally better. If we now turn our attention to the results for IG in Fig. 5e, we observe that DSSG is the only method that always finds actions having positive information gain. For Workplace, where TC seemed to perform better on a high level, TC finds patterns that provide negative information gain-which conflicts our problem statement, and therefore the results are suboptimal from the perspective of our problem. For DBLP, however, TC performs better than DSSG with regard to both L F (D s ) + DL and IG. We therefore investigate the individual components of IG, i.e., L I (D s ), L F (D s ), IC, and DL, in Fig. 5a-d respectively. In Fig. 5a, b, we see that the encoded sizes of the data at the start and end of each state are (logically) very similar to each other, but they are also quite similar to those in Fig. 5f: the size of the data is a relatively large part of the total compressed size. The (normalized) encoded sizes of the data obtained by DSSG are typically smaller than those obtained by the other methods.
When we study the distributions of information content in Fig. 5c, we observe that both DSSG and TC succeed in identifying patterns with high information content. The high IC values for TC come at the cost of higher values for DL though: Figure 5d shows that the description lengths required to communicate the patterns are larger for TC than for DSSG-also for DBLP. Given that the same encoding is used for TC as for DSSG, and the number of patterns is fixed for TC, this means that the patterns found by TC are larger and often-but not always-less informative. (Also, note that DSSG is able to employ other action types, enabling compact yet informative summaries.) SDGM clearly solves a (very) different problem than DSSG, as it finds patterns with both low information content and large description lengths.
In summary, we conclude that when we adapt TC and SDGM for the problem that we consider in this paper, they perform less well than DSSG. TC finds summaries that are similarly informative yet more complex than those of DSSG. SDGM, on the other hand, generally finds complex summaries that are far less informative than those identified by the other methods. We would like to stress that this should not come as a surprise though, and should certainly not disqualify either TC or SDGM: they have been designed to solve other problems than DSSG. The above results demonstrate that our problem and approach are indeed different from those considered by TC and SDGM, corroborating our proposed approach.

Qualitative analysis
In this subsection, we discuss how the summary created by our proposed approach can be meaningful to a domain expert. Since we provide a summary of the changes in a dataset, the effectiveness of the discovered patterns can be assessed by the information captured in the sets of patterns and the actions performed on them. We analyze the results 14 obtained for DBLP and TheMovieDB.  DBLP For the DBLP graph, we discuss one of the various captured chains of subgraph patterns, which demonstrates the evolution of the communities of 92 authors centered mainly around Christos Faloutsos from Year 2010 to 2015. This evolution is shown in Fig. 6. Initially (Fig. 6a) in Year 2010, two surprisingly dense communities shown as pattern 'A' and 'B' are discovered, where Christos Faloutsos is a common link between the two communities. These two different communities have been condensed in the following year and merged to form a single community, shown as pattern 'C' (Fig. 6b) with Christos Faloutsos and U Kang being some of the prominent names. This collaboration network shrinks the next year (Fig. 6c). In Year 2013 another very densely connected set of authors is discovered, shown as pattern 'D' in Fig. 6d. Surprisingly, in the subsequent year, this set of authors got split into two different communities of three authors each, i.e., Lisa Friedland, David D. Jensen and Amanda Gentzel and Christos Faloutsos, Jay Yoon Lee and Danai Koutra. However, the latter set of authors got merged with a newly discovered densely connected set of authors centered around Christos Faloutsos and Evangelos E. Papalexakis, shown by pattern 'H' in Fig. 6e. Finally, in year 2015 the two different communities where Christos Faloutsos is the common link, i.e., pattern 'C' and a part of pattern 'H', merge to form one community with Neil Shah starting the collaboration with Leman Akoglu and others. In short, we captured how the community around one author with a large number of collaborations evolve over time.
TheMovieDB In this network, we discussed the discovered evolution of different patterns or communities of 1019 actors from Year 2012 to 2017, as shown in Fig. 7. For each found pattern, we also find the associated genres using the hypergeometric test. A genre is considered to be significant if the p-value after Bonferroni Correction (with factor 19) is less than 1e − 1. During the Year 2012, two patterns 'A' and 'B' are discovered (Fig. 7a). Pattern 'A', with significant genres Action and Comedy, includes vertices such as Liam Neeson, Josh Pence, David Gyasi and Nick Holder, all with high vertex degree. Pattern 'B' comprises of Sally Field and Lee Pace as high degree vertices and has Adventure and Fantasy as significant genres. In Year 2013, Pattern 'A' splits into 8 resulting patterns ('C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'). This suggests that these 8 patterns represents 8 different communities of actors. Surprisingly, among these 8 patterns (which are all non-overlapping disjoint patterns), 6 patterns (excluding 'F' and 'G') got merged to form Pattern 'S' only after pattern 'K' is discovered (Fig. 7b). Hence, it is found that the actors of the pattern 'C', 'D', 'E', 'H', 'I' and 'J' are indirectly connected through the actors in pattern 'K'. Some of the notable actors of pattern 'S' include James Badge Dale, Kyle Chandler, Kirsten Dunst and Will Smith. Pattern 'S' has Romance, Crime and Western as the three significantly associated genres. Pattern 'S' is decomposed into 12 different patterns in Year 2014 (Fig. 7c). All the 12 resulting patterns have different significantly associated genres such as, Action with pattern 'T', Science Fiction with 'U', Documentary with 'W', Fantasy with 'X', Animation and Family with 'Y' and 'AE', War with 'Z' and 'AD', Crime with 'AA', Drama with 'AB' and Comedy with 'AC'. Most of the patterns disappear in the following two years, i.e., 2015 and 2016, except patterns 'T' and 'AE'. In Year 2017, pattern 'AE' merges with a newly discovered pattern 'AF', resulting in pattern 'AG'. Thus, pattern 'T' and 'AG' are observed in Year 2017 (Fig. 7d). Some of the prominent actors in pattern 'T' are Sylvester Stallone, Lady Gaga, Ben Kingsley and Alison Brie. Pattern 'AG' includes actors like Dustin Hoffman and Oprah Winfrey and has Animation, Fantasy and Adventure as significant genres.
This case presents how the collaboration between actors evolves over time. The genres which are significantly associated to each pattern implies that our algorithm successfully identifies different and evolving subgroups (or communities) in the network.

Case study: airline flight network
To explore how the proposed approach and algorithm could be used in a real-world scenario, we now present a case study on the US flight network. 15 Flight networks are typical examples of dynamic graphs that one would like to analyze on the fly, e.g., to detect and monitor delays as early as possible.
Dataset We use the scheduled and actual flight operating data for the month of January 2017, with 298 airports (considered as vertices) and 450,017 flights operated in that month. The dataset has features such as scheduled departure and arrival time, along with actual departure and arrival time, for each flight. Using these features we create two types of networks: (1) in a scheduled flight network, a directed edge for a given time interval is included from origin to destination airport if at least one flight was scheduled to depart or arrive in that interval; (2) in an actual flight network, a directed edge is included between two airports if at least one flight actually departed or arrived in that interval.
For either type of network we create 31 independent instances, one for each day of January 2017. Each network is segmented into 20 sequential snapshots, or states, of 1 h each (from 0400 h to 2400 h, all converted to UTC -7). The motivation behind choosing 1 h as the length of a snapshot is that airliners manage their operations in blocks of 1 h duration each. For simplicity, we do not consider cancelled flights in this case study.
Approach We use DSSG to independently summarize both the scheduled and actual flight network. In both the cases, we assume the analyst to have a prior belief on the number of routes scheduled to be operated from each airport in the initial snapshot (i.e., the total number of airports from where at least one flight is arriving and the total number of airports to where at least one fight is departing). We then inspect the resulting summaries. Summaries As the data is large and dynamic, visualizing all patterns or the complete summary at once is not practical. Instead, Fig. 8 visualizes the sequence of actions identified by our method on a given flight network, to provide a high-level overview-or fingerprint-of the summary. Such fingerprints can then be compared to spot deviations between the scheduled and actual dynamic networks.
Comparing summaries An analyst could investigate the discovered patterns (as shown in Sect. 6.5), but here we first investigate the differences between the obtained summaries, to learn about unexpected events (here: delays) causing the observed network to differ from the expected network. For illustrative purposes, we use the scheduled network of day 14, and actual networks of days 14 and 21.
Inspecting the fingerprints in Fig. 8 shows that the actual flight network of day 14 behaves differently from both the scheduled flight network of day 14 (Fig. 8a) and the actual flight network of the same day one week later (Fig. 8b). For example, in the initial snapshot (0400-0500 h) in Fig. 8a, the prior distribution sufficiently described the scheduled flight network of day 14 and hence no new patterns are discovered. In the actual flight network of that day, however, two patterns are discovered for that snapshot. A closer look at the data reveals that this is caused by flights that operated either ahead of time or delayed. In Fig. 8b, similar observations can be made for the actual flight networks for two days exactly on week apart. To further investigate the causes of deviations, an analyst could inspect the patterns and actions. DSSG provides a sequence of actions (descending by IG) that an analyst could learn from, especially when supported by an environment for interactive data and pattern exploration (see Discussion).
Inspecting patterns To further understand the differences between the flight networks, we consider two typical block hours, i.e., 1400-1500 and 1500-1600 h. Figure 9 shows the top 5 patterns 16 with regard to information content (IC) and for the same three different networks as above. Note that this means that we only show patterns that are newly discovered or revised in the current state. Fig. 9 The top 5 patterns with regard to information content discovered from each respective flight network. Color coding: the pattern with highest information content is shown in red, followed in order by magenta, green, blue, and orange. Labels indicate airport codes (Color figure online) From Fig. 9a we observe that, for the scheduled flight network of day 14 during 1400-1500 h, four out of the five patterns are star-shaped, with hub airports. In the first pattern (shown in red) MSP is the hub, with flights departing for airports such as ATW, LNK and MEM. Similarly, patterns 2 (magenta) and 4 (blue) have SNA and ORD as hubs, respectively. Pattern 3 (green) has STL and EWR as hubs, where flights are departing, and two other airports, OMA and OAK, where flights are arriving. These patterns indicate that a large number of flights are scheduled to depart from hubs like MSP, SNA, ORD and STL, while flights are expected to arrive at OAK and OMA. Finally, pattern 5 is a connected set of airports including SFO, XNA, ORD and SCE.
In the actual flight network for the same timeslot in Fig. 9c, the most informative pattern is a set of densely connected airports including SLC, DEN, LAS, and SEA (shown in red). The second pattern (magenta) is similar to the most informative pattern found in the scheduled network (red in Fig. 9a), with MSP as hub. Patterns 3, 4, and 5 are also star-shaped, with hubs SLC, JAX, and SEA respectively. Upon investigating the underlying data we find that patterns 1 and 3 comprise flights having a combined positive delay (flights departing and/or arriving late) of 1083 min and 23 min, respectively. This is a relevant discovery, as 1083 min is a large combined delay and pattern 1 was not found in the scheduled data. For pattern 2, which we did find in the scheduled network, no positive delay is observed (instead we find a combined negative 'delay' of roughly 9 min, which is very moderate). For patterns 4 and 5 negative delays are observed. Similar observations can be made for the block hour in Fig. 9c, d.
The fingerprints of Fig. 8b already suggested that the actual flight networks of days 14 and 21 differ, and this is confirmed by the different top 5 patterns shown in Fig. 9e, f. Interestingly, none of these patterns is present in either the scheduled or actual flight network of day 14, and these patterns are also found to correspond to substantial positive and/or negative delays.
Together, these observations indicate that by comparing the summaries and patterns discovered by DSSG, an analyst can learn about sets of connected airports where structural operational deviations from the schedule occurred, which often resulted in delays. As such, this case study served to illustrate how our approach could be used in a real-world scenario where online and incremental analysis of structural changes in dynamic graphs can render valuable insights.

Discussion
We propose a framework for summarizing sequential datasets in an online setting. We define information gain using both the maximum entropy principle and minimum description length principles. This measure enables not only to quantify the informativeness of a pattern, but also of the proposed actions (or atomic changes) in our framework, which enables to capture the evolution in a graph by evolving patterns. The proposed generic framework for subjective summarization of sequential data can be further instantiated for different types of evolving datasets, such as event sequence databases. In this paper, we instantiated the proposed generic framework for dynamic (simple) graphs.
This work focuses on the discovery of an online summary of dynamic graphs, by iteratively identifying actions with maximum information gain. The summary of a dynamic network contains a set of subgraph patterns (or constraints) along with captured changes in those (chains of) patterns over time. The findings from the experiments performed on different networks indicate that (1) the generated summaries are informative with regard to the analyst's prior knowledge about the data, with relatively high observed compression ratios; (2) the sets of subgraph patterns identified to summarize the networks are found to be relatively dense; and (3) the discovered evolving patterns provide an informative sequence that can be further inspected and analyzed. Also, with the proposed measures of information gain and information content, we show in the airline case study that our method can be used to rank the found patterns.
We observe during the experiments that a pattern might appear regularly or sporadically in different snapshots of a dynamic network. This leads to a situation where our method learns and forgets the same pattern multiple times. However, on each occasion, our method treats the same pattern as newly acquired knowledge. It would be interesting to identify these instances while summarizing a network over time. A way to address this limitation could be to label each subgraph pattern and explore the similarity between two subgraph patterns. Thus, similar to TimeCrunch (Shah et al. 2015), the periodicity of a pattern could be explored. Another limitation of our work is the consideration of prior belief of the analyst. In this setting, we only consider that the analyst has prior knowledge on the initial snapshot and is interested in observing the changes in the network. A different setting may consider that the analyst knows about the different snapshots of the network.
One future opportunity includes improving the scalability of the proposed framework. The runtime of the proposed algorithm is currently higher than the two methods used to compare the summaries provided by DSSG, including TimeCrunch (Shah et al. 2015) and SDGM (Tsalouchidou et al. 2020). Notably, the other two methods have a highly optimized implementation using parallel and distributed computing capabilities. For now, DSSG sequentially executes multiple procedures, including the number of independent seed runs of the hill climber. These procedures are highly independent and could be executed simultaneously. Hence, DSSG has several inherent features which may allow a parallelized implementation. This would significantly reduce the runtime and improve the scalability of the algorithm. Another future opportunity includes the development of a tool based on the proposed framework, for interactive visualization and exploration of changes identified in a dynamic network. This tool would further provide a user-friendly platform for analysts to learn how a network evolves with time.

Conclusion
We presented the novel problem of subjective summarization of sequential data in an online manner. As a specific instance of this generic problem, online summarization of dynamic graphs was introduced. We presented a framework to solve this problem, which has been built on the existing ideas related to maximum entropy principle, the minimum description length principle, and subjectively interesting subgraph patterns. We then introduced an efficient algorithm, called DSSG, which is followed by extensive experiments on real-world datasets. Through experimental results, we demonstrated the effectiveness of the proposed algorithm. The generated summaries are found to be informative with regard to the analyst's prior knowledge about the data. We conclude this from the observed substantial compression ratios and the fact that compression equates learning. We have also found different sequences of patterns, which evolved over time in a network. We also presented a case study and demonstrated a potential use of the proposed method in the airline domain. Comparison of two different summaries of the airline network, using the scheduled and the actual flight data, revealed potentially informative events. As a part of future work, it would be interesting to extend the proposed method to incorporate a feature to capture periodicity of the patterns; another is to extend this method to multigraphs, weighted graphs, and attributed graphs. Finally, as a part of our ongoing/future work, we aim to develop a tool for interactive visualization and exploration of the found patterns.