Fast filtering and animation of large dynamic networks
- First Online:
- Cite this article as:
- Grabowicz, P.A., Aiello, L.M. & Menczer, F. EPJ Data Sci. (2014) 3: 27. doi:10.1140/epjds/s13688-014-0027-8
- 16k Downloads
Detecting and visualizing what are the most relevant changes in an evolving network is an open challenge in several domains. We present a fast algorithm that filters subsets of the strongest nodes and edges representing an evolving weighted graph and visualize it by either creating a movie, or by streaming it to an interactive network visualization tool. The algorithm is an approximation of exponential sliding time-window that scales linearly with the number of interactions. We compare the algorithm against rectangular and exponential sliding time-window methods. Our network filtering algorithm: (i) captures persistent trends in the structure of dynamic weighted networks, (ii) smoothens transitions between the snapshots of dynamic network, and (iii) uses limited memory and processor time. The algorithm is publicly available as open-source software.
Network visualization is widely adopted to make sense of, and gain insight from, complex and large interaction data. These visualizations are typically static, and incapable to deal with quickly changing networks. Dynamic graphs, where nodes and edges churn and change over time, can be effective means of visualizing evolving networked systems such as social media, similarity graphs, or interaction networks between real world entities. The recent availability of live data streams from online social media motivated the development of interfaces to process and visualize evolving graphs. Dynamic visualization is supported by several tools –. In particular, Gephi  supports graph streaming with a dedicated API based on JSON events and enables the association of timestamps to each graph component.
While there is some literature on dynamic layout of graphs –, not much work has been done so far about developing information filtering techniques for dynamic visualization of large and quickly changing networks. Yet, for large networks in which the rate of structural changes in time could be very high, the task of determining the nodes and edges that can represent and transmit the salient structural properties of the network at a certain time is crucial to produce meaningful visualizations of the graph evolution.
We contribute to filling this gap by presenting a new graph filtering and visualization tool called fastviz fastviz that processes a chronological sequence of weighted interactions between the graph nodes and dynamically filters the most relevant parts of the network to visualize. Our algorithm:
• captures persistent trends in structural properties of dynamic networks, while removing no longer relevant portions of the networks and emphasizing old nodes and links that show fresh activity;
• smoothens transitions between the snapshots of a dynamic network by leveraging short-term and long-term node activity;
• uses limited memory and processor time and is fast enough to be applied to large live data streams and visualize their representation in the form of a network.
The reminder of this paper is structured as follows. First, we introduce related studies in Section 2. Next, we introduce the fastviz fastviz filtering method for dynamic networks in Section 3. We compare this method against rectangular and exponential sliding time-window approaches and show what are the advantages of our method. Finally, we present visualizations created with our filtering methods for four different real datasets in Section 4, and conclude the study.
2 Related work
Graph drawing ,  is a branch of information visualization that has acquired great importance in complex systems analysis. A good pictorial representation of a graph can highlight its most important structural components, logically partition its different regions, and point out the most central nodes and the edges on which the information flows more frequently or quickly. The rapid development of computer-aided visualization tools and the refinement of graph layout algorithms – allowed increasingly higher-quality visualizations of large graphs . As a result, many open tools for static graph analysis and visualization have been developed in the last decade. Among the best known we mention Walrus , Pajek , , Visone , GUESS , Networkbench , NodeXL , and Tulip . Studies about comparisons of different tools have also been published recently .
The interest in depicting the shape of online social networks ,  and the availability of live data streams from online social media motivated the development of tools for animated visualizations of dynamic graphs, in offline contexts, where temporal graph evolution is known in advance, as well as in online scenarios, where the graph updates are received in a streaming fashion . Several tools supporting dynamics visualization emerged, including GraphAEL  (http://graphael.cs.arizona.edu/), GleamViz (www.gleamviz.org), Gephi  (gephi.org), and GraphStream  (graphstream-project.org). Despite static visualizations based on time-windows , alluvial diagrams , or matrices – have been explored as solutions to capture the graph evolution, dynamic graph drawing remains the technique that has attracted more interest in the research community so far. Compared to static visualizations, dynamic animations present additional challenges: user studies have shown that they can be perceived as harder to parse visually, even though they have the potential to be more informative and engaging .
As a result, a large corpus of work about the theoretical concepts on good visualization practices, especially for dynamic graphs, has been produced in the last two decades. Besides the work done in defining efficient update operations on graphs , , several principles about good graph visualizations have been proposed and explored in different studies. Friedrich and Eades  defined high-level guidelines for a good visualization of graph evolution with animations, including uniform, smooth and symmetrical movement of graph elements, with minimization of edge crossings and overviewing some techniques that make the visualization more enjoyable, such as fadeout deletion of nodes. Graph readability has been measured in user studies in relation to several tasks –; the experimental findings highlight the importance of visualization criteria such as minimizing bends and edge crossings and maximizing cluster separation in facilitating the viewer’s interpretation and understanding of the graph. A general concept that has been studied for long in relation to the quality of dynamic graph visualization is the mental map– that the viewer has of the graph structure. In practical terms, the placement of existing nodes and edges should change as little as possible when a change is made to the graph , under the hypothesis that if the mental map is preserved the parsing of the visual information is faster and more accurate. More recent work  has reappraised the importance of the mental map in the comprehension of a dynamic graph series, while identifying some cases in which it may help ,  (e.g., memorability of the graph evolution, following long paths, recognition of recurrent patterns, tracking a large number of moving objects).
More in general, there are several open fronts in empirical research in graph visualization to identify the impact of certain factors on the quality of the animation (e.g., speed , interactivity ). An extensive overview of this aspect has been conducted recently by Kriglstein et al.. Methods to preserve the stability of nodes and the consistency of the network structure leveraging hierarchical organization on nodes have been proposed –. User studies have shown that hierarchical approaches that collapse several nodes in larger meta-nodes can improve graph readability in cases of high edge density . The graph layout also has a significant impact on the readability of graphs . Some work has been done to adapt spectral and force-directed graph layouts  to incremental layouts that recompute the position of nodes at time t based on the previous positions at time minimizing displacement of vertices , – or to propose new “stress-minimization” strategies to map the changes in the graph .
Although much exploration has been done in the visualization principles to achieve highly-readable animations, two aspects have been overlooked so far.
First, not many techniques to extract and visualize the most relevant information from very large graphs have been studied yet. Graph decomposition has been used in a static context to increase the readability of the network by splitting it into modules to be visualized separately , while sliding time-windows have been employed to discard older nodes and edges in visualization of graph evolution . A hierarchical organization of nodes according to some authority or centrality measure allows to visualize the graph at different levels of details, eliminating the need to display all nodes and edges at once . Some work has been done about interactive exploration by blending different visualization paradigms  and time-varying clustering . Indices to measure the relevance of events in a dynamic graph at both node and community level have also been proposed , even if they have not been applied to any graph animation task. Yet, none of these techniques has been tested on very large data and none of the modern visualization tools provide features for the detection of the most relevant components of a graph at a given time. On the other hand, quantitative studies on the characterization of temporal networks – have been conducted, but with no direct connection with the dynamic visualization task.
Last, the visualization of large graphs in an online scenario, where node and edge updates are received in a live stream, and the related practical implications of dynamic visualizations, have rarely been considered. In this context, just some exploratory work has been carried out about information selection techniques for dynamic graph visualization, including solutions based on temporal decay of nodes and edges , node clustering , and centrality indices , .
3 Network filtering
represents the occurrence of interactions between nodes of weight at epoch time . Entries with more than two nodes are interpreted as interactions happening between each pair of members of the clique with the respective weight. Multiple interactions between the same pair of nodes sum up by adding up their corresponding weights. The advantage of the clique-wise format over the pairwise format is that the size of input files is smaller.
3.2 Filtering criterion
In the first stage of the algorithm, at most nodes with the highest strengths are saved in the buffer together with the interactions among them. The strength of a node i is a sum of weights of all connections of that node, i.e., , where is the weight of an undirected connection between nodes i and j. Whenever a new node, which does not appear in the buffer yet, is read from the input, it replaces the node in the buffer with the lowest value of the strength. If an incoming input involves a node that is already in the buffer, then the strength of the node is increased by the weight of the incoming connection. To emphasize the most recent events and penalize stale ones, a forgetting mechanism that decreases the strengths of all nodes and weights of all edges is run periodically every time period by multiplying their current values by a forgetting factor . This process leads to the removal of old inactive nodes having low strength and storage of old nodes with fresh activity and high strength.
3.3 Filtering criterion versus rectangular and exponential sliding time-windows
In general, the forgetting period is fixed, therefore there is only one free parameter controlling the filtering, e.g., the forgetting factor , which we assign according to the dynamic network, i.e., the faster the network densifies in time, the more aggressive forgetting we use (see Appendix B for more details about the values of parameters). In the following paragraphs, we analyze the dynamics of several structural properties of the networks produced with fastvizfastviz, rectangular, and exponential sliding time-window methods having equal aggregating areas.
Due to this fact the computational complexity of sliding-time window methods increases in time, whereas it is bounded in fastvizfastviz. Since network structural properties such as average degree and clustering depend on the size of the network, we calculate these properties for the subgraphs of equal size, i.e., for the strongest nodes of the full network produced by each of the sliding time-window methods (Figures 3C-J). For simplicity, we refer to these subgraphs of nodes as the buffered networks.
Second, we find that the networks produced with our filtering method do not experience drastic fluctuations of the global and local clustering coefficients and degree assortativity, which are especially evident for the rectangular time-window (Figures 3E, G, H, and I). We conclude that the fastviz fastviz filtering produces smoother transitions between network snapshots than rectangular sliding time-window. This property of our method may improve readability of visualizations of such dynamic networks.
Finally, fastviz fastviz captures persistent trends in the values of the properties by leveraging the short-term and long-term node activity. For instance, it captures the trends in degree, clustering coefficients, and assortativity that are less visible with the rectangular time-window, while they are well-visible with the exponential time-window (Figures 3C-F, I, and J). Note that high average degree obtained for networks produced with exponential time-window corresponds to the nodes that are active over a prolonged time-span, whose activity is aggregated over unbounded aggregation period, and the number of nodes is unbounded as well. On the contrary, rectangular sliding time-window shows the degree aggregated over a finite time-window, while fastviz fastviz limits the number of tracked nodes, leading to lower reported average degree.
3.4 Network updates for visualization
In the second stage, for the purpose of visualization, the algorithm selects nodes with the highest strength and creates a differential update to the visualized network consisting of these nodes and the connections between them. Each such differential update is meant to be visualized in the resulting animation of the network, e.g., as a frame of a movie.
The comparison of the evolution of structural properties of the corresponding buffered and visualized networks shows that these networks differ significantly for each of the filtering methods (compare Figure 3 vs. Figure 5). This difference is the most salient in the case of rectangular time-window, which yields considerably larger fluctuations of structural properties than the other methods. In the cases of fastviz fastviz and exponential time-window some structural properties show evolution that is qualitatively similar for buffered and visualized networks, e.g., the average degree and the global clustering coefficient (Figures 3C-F vs. Figures 5C-F). We conclude that the structure of visualized network differs significantly from the structure of buffered network, although this difference is smaller for fastviz fastviz than for rectangular sliding time-window.
3.5 Computational complexity
The computational complexity of the buffering stage of the algorithm is , where E is the total number of the pairwise interactions read (the cliques are made of multiple pairwise interactions). Each time when an interaction includes a node that is not yet stored in the buffered graph the adjacency matrix of the graph needs to be updated. Specifically, the weakest node is replaced with the new node, so entries in the adjacency matrix are zeroed, which corresponds to . The memory usage scales as , accounting for the adjacency matrix of the buffered graph. (For certain real dynamic networks, the buffered graph is sparse. In such cases, one can propose more optimized implementations of fastvizfastviz. Here, we focus on limiting the time complexity so that it scales linearly with the number of interactions and describe the generic implementation that achieves it.) The second, update-generating, stage has computational complexity of , where U is the total number of differential updates, which is a fraction of E and commonly it is many times smaller than E. (Typically, a large number of interactions is aggregated to create one differential update to the visualized network. In the examples that we show in the next section, one update aggregates from 400 to 2 million interactions. Therefore, U is from 400 to 2 million times smaller than E.) This term corresponds to the fact that the strengths of all buffered nodes are sorted each time an update to the visualized network is prepared. The memory trace of this stage is very low and scales as . We conclude that our method has computational complexity that scales linearly with the number of interactions. It is therefore fast, that is, able to deal with extremely large dynamic networks efficiently.
In this section, we describe animations of exemplary dynamic graphs filtered with fastvizfastviz. Principally, the sequence of graph updates can be converted into image frames that are combined into a movie depicting the network evolution. We implement this visualizing technique and create with it the network animations described below. Alternatively, the updates can be fed directly to the Gephi Streaming API to produce an interactive visualization of the evolving network. The Gephi Streaming API allows graph streaming from a client to a server where Gephi is running. In such a case, the graphs are streamed directly from our filtering system to the Gephi server without any third-party modules. In Appendix A, we introduce implementation details of both approaches. Finally, corresponding animations can be created by other visualization tools fed with the fastviz fastviz updates; we highly encourage their development.
Statistics of the experimental datasets
Osama bin Laden’s death
IMDB movie keywords
US patent title words
We use data obtained through the Twitter gardenhose streaming API, which covers around 10% of the tweet volume. We focus on two events: the announcement of Osama bin Laden’s death and the 2013 Super Bowl. We consider user mentions and hashtags as entities and their co-occurrence in the same tweet as interactions between them.
The first video (Figure 2A) shows how the anticipation for the Super Bowl steadily grows on early Sunday morning and afternoon, and how it explodes when the game is about to start. Hashtags related to #commercials and concerts (e.g., #beyonce) are evident. Later, the impact of the #blackout is clearly visible. The interest about the event drops rapidly after the game is over and stays low during the next day.
The video about the announcement of Osama bin Laden’s death (Figure 2B) shows the initial burst caused by @keithurbahn and how the breaking news was spread by users @brianstelter and @jacksonjk. The video shows that the news appears later via #cnn and is announced by @obama. The breaking of this event on Twitter is described in detail by Lotan .
4.3 IMDB movies
We use a dataset from IMDB of all movies, their year of release and all the keywords assigned to them (from imdb.to/11SZD). We create a network of keywords that are assigned to the same movies. Our video (Figure 2C) shows interesting evolution of the keywords from “character-name-in-title” and “based-on-novel” (first half of 20th century), through “martial-arts” (70s and 80s) to “independent-film” (90s and later), “anime” and “surrealism” (2000s).
We use a set of US patents issued between 1976 and 2010 . We analyze the appearance of words in their titles. Whenever two or more words appear in a title of a patent we create a link between them at the moment when the patent was issued. To improve readability we filter out stopwords and the generic frequent words: “method,” “device” and “apparatus.” Our video (Figure 2D) shows that at the beginning of the period techniques related to “engine” and “combustion” were popular, and later start to cluster together with “motor” and “vehicle.” Another cluster is sparked by patents about “magnetic” “recording” and “image” “processing.” It merges with a cluster of words related to “semiconductor” and “liquid” “crystal” to form the largest cluster of connected keywords at the end of the period.
4.5 Other visualizations
Other than these experimental datasets, on-demand animations of Twitter hashtag co-occurrence and diffusion (retweet and mention) networks can be generated with our tool via the Truthy service (truthy.indiana.edu/movies). Hundreds of videos have already been generated by the users of the platform and are available to view on YouTube (youtube.com/user/truthyatindiana/videos).
The datasets in our case studies are fairly diverse in topicality, time span, and size, as shown in Table 1. Nevertheless, our method is able to narrow down the visualization to meaningful small subgraphs with less than 600 distinct nodes in all cases. The high performance of the algorithm makes it viable for real-time visualizations of live and large data streams. On a desktop machine the algorithm producing differential updates of the network took several minutes to finish for the US patents and less than two minutes for the other datasets. Given such a performance, it is possible to visualize in real-time highly popular events such as the Super Bowl, which produced up to 4,500 tweets per second.
Tools for dynamic graph visualization developed so far do not provide specialized ways to dynamically select the most important portions of large evolving graphs. We contribute to filling this gap by proposing an algorithm to filter nodes and edges that best represent the network structure at a given time. Our method captures trends and smoothens the dynamics of structural properties of weighted networks by leveraging the short-term and long-term node activity. Furthermore, our filtering method uses limited memory and processor time making it viable for large live data streams. We implemented our filtering algorithm in open source tools that take in input a stream of interaction data and output a movie of the network evolution or a live Gephi animation. As future work, we wish to improve our algorithm by means of further optimization and to enhance the tools by providing a standalone module for live visualization of graph evolution.
Appendix A: Implementation details and source code
We have implemented two independent tools described in the manuscript. The first tool is the fastviz fastviz algorithm. The second tool converts the sequence of updates into image frames that are combined into a movie depicting the network evolution. We release the source code of both tools (see the project website github.com/WICI/fastviz). Here, we describe the two tools in more detail.
The first tool is the fastviz fastviz algorithm. It takes in input a chronological stream of interactions between nodes and converts it into a set of graph updates that account only for the most relevant part of the network in the JSON format. In the network filtering stage, the algorithm stores a buffered network of size , limiting the computational complexity and memory usage of the algorithm. In the second stage, for the purpose of visualization, the algorithm selects nodes with the highest strength and all edges between these nodes with the highest strength and all edges between these nodes that have weight above a certain threshold . The subgraph induced by the nodes is compared with the subgraph in the previous state and a differential update is created. The updates are created per every time interval that is determined with the time contraction parameter . A value of 10 for this parameter means that the time will flow in the visualization 10 times faster than in the data given as the input (see Appendix B). The differential updates are written in output in the form of a JSON file formatted according to the Gephi Streaming API . We choose JSON format specifically due to the compatibility with Gephi Streaming API. In short, each line of the JSON file corresponds to one update of the graph structure and contains a sequence of JSON objects that specify the addition/deletion/attribute change of nodes and edges. We also introduced a new type of object to deal with labels on the screen, for example, to write the date and time on the screen.
The second tool converts the sequence of updates into image frames that are combined into a movie depicting the network evolution. To this end, the sequence of updates produced by the filtering algorithm is fed to a python module that builds a representation of a dynamic graph, namely an object that handles each of the updates and reflects the changes to its current structure. The transition between the structural states of the graph determined by the received updates is depicted by a sequence of image frames. Each differential update correspond to one visualization frame, i.e., one frame of an animation. In its initial state, the nodes in the network are arranged according to the Fruchterman Reingold graph layout algorithm . The choice of the layout is arbitrary and other layouts can be used and compared. However, due to the focus of this study on the filtering method, rather than the quality of the visualization, we do not explore any other layout algorithms. For each new incoming event, a new layout is computed by running N iterations of the layout algorithm, using the previous layout as a seed. Intermediate layouts are produced at each iteration of the algorithm. Every intermediate layout is converted to a png frame that is combined through the mencoder tool  to produce a movie that shows a smooth transition between different states. The movie is encoded with the frequency of 30 frames per second. To avoid nodes and edges to appear or disappear abruptly in the movie, we use animations that smoothly collapse dying nodes and expand new ones. A configuration file allows to modify the default movie appearance (e.g, resolution, colors) and layout parameters (see the project website).
We release the source code of both tools with the documentation under the GNU General Public License (see the project website github.com/WICI/fastviz). Together with the tools we release the datasets used in this paper and instructions on how to recreate all the examples of animations presented in this manuscript. Additionally, the updates created with fastviz fastviz can be fed directly to the Gephi Streaming API to produce an interactive visualization of the evolving network. Respective instructions can be found at the website of the project.
Appendix B: Algorithm parameters
Values of the parameters of thefastvizfastvizalgorithm for the introduced case studies
Osama bin Laden’s death
IMDB movie keywords
US patent title words
The time contraction corresponds to the number of seconds in data time scale that are going to be contracted to one second of the visualization. The larger the time span of the dataset, the larger should be this parameter in order to keep the length of visualization fixed. For instance, if the timespan of the network is 10 hours, and one wants to see its evolution in a 10-second-long animation, then should be set to 3,600. It is crucial to provide a desired value for this parameter, because providing a value that is too large will create just a few network updates and a very short animation, while providing a value that is too small will create a large number of updates making the JSON file very big and the animation very long.
The minimal edge weight is a threshold above which edges appear in the visualization. Low value of this parameter may results in many edges of low weight appearing in the animation, while high value of the parameter may prevent any edges from being visualized. In case a user does not have any information about the visualized network, we recommend leaving this parameter at its default value of 0.95, which will visualize all edges of standard weight 1 or higher.
The forgetting factor decides how fast older interactions among nodes are forgotten in comparison with more recent interactions. This parameter can be tuned individually for the purpose of the visualization. In general, the faster the network densifies in time, the more aggressive should be the forgetting, i.e., the lower should be the forgetting factor . In general, keeping the default value of this parameter is safe, although its adjustment will improve the quality of visualization.
We are grateful to André Panisson for inspiration and to Jacob Ratkiewicz, Bruno Gonçalves, Mark Meiss, and other members of the Truthy project (cnets.indiana.edu/groups/nan/truthy) for helpful discussions and suggestions. PAG acknowledges funding from the JAE-Predoc program of CSIC and partial financial support from the MINECO under project MODASS (FIS2011-24785). This work is supported in part by the NSF (ICES award CCF-1101743) and the James S. McDonnell Foundation and by the SocialSensor FP7 project, partially funded by the EC under contract number 287975.
This article is published under license to BioMed Central Ltd.Open Access This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.