- 103 Downloads
Much of big data comes with relational information. People are friends with or follow each other on social media platforms, send each other emails, or call each other. Researchers around the world copublish their work, and large-scale technology networks like power grids and the Internet are the basis for worldwide connectivity. Big data networks are ubiquitous and are more and more available for researchers and companies to extract knowledge about our society or to leverage new business models based on data analytics. These networks consist of millions of interconnected entities and form complex socio-technical systems that are the fundamental structures governing our world, yet defy easy understanding. Instead, we must turn to network analytics to understand the structure and dynamics of these large-scale networked systems and to identify important or critical elements or to reveal groups. However, in the context of big data, network analytics is also faced with certain challenges.
Network Analytical Methods
Networks are defined as a set of nodes and a set of edges connecting the nodes. The major questions for network analytics, independent from network size, are “Who is important?” and “Where are the groups?” Stanley Wasserman and Katherine Faust have authored a seminal work on network analytical methods. Even though this work was published in the mid-1990s, it can still be seen as the standard book on methods for network analytics, and it also provides the foundation for many contemporary methods and metrics. With respect to identifying the most important nodes in a given network, a diverse array of centrality metrics have been developed in the last decades. Marina Henning and her coauthors classified centrality metrics into four groups. “Activity” metrics purely count the number or summarize the volume of connections. For “radial” metrics, a node is important if it is close to other nodes, and “medial” metrics account for being in the middle of flows in networks or for bridging different areas of the network. “Feedback” metrics are based on the idea that centrality can result from the fact that a node is connected (directly or even indirectly) to other central nodes. For the first three groups, Linton C. Freeman has defined “degree centrality,” “closeness centrality,” and “betweenness centrality” as the most intuitive metrics. These metrics are used in almost every network analytical research project nowadays. The fourth metric category comprises mathematically advanced methods based on eigenvector computation. Phillip Bonacich presented eigenvector centrality which led to important developments of metrics for web analytics like Google’s PageRank algorithm or the HITS algorithms by John Kleinberg, which is incorporated into several search engines to rank search results based on the website’s structural importance on the Internet.
The second big pile of research questions related to networks is about identifying groups. Groups can refer to a broad array of definitions, e.g., nodes sharing of certain socioeconomic attributes, membership affiliations, or geographic proximity. When analyzing networks, we are often interested in structurally identifiable groups, i.e., sets of nodes of a network that are denser connected among them and sparser connected to all other nodes. The most obvious group of nodes in a network would be a clique – a set of nodes where each node is connected to all other nodes. Other definitions of groups are more relaxed. K-cores are a set of nodes for which every node is connected to at least k other nodes in the set. It turns out that k-cores are more realistic for real-world data than cliques and much faster to calculate. For any form of group identification in networks, we are often interested in evaluating the “goodness” of the identified groups. The most common approach to assess the quality of grouping algorithms is to calculate the modularity index developed by Michelle Girvan and Mark Newman.
The most widely used algorithms in network analytics were developed in the context of small groups of (less than 100) humans. When we study big networks with millions of nodes, several major challenges emerge. To begin with, most network algorithms run in Θ(n2) time or slower. This means that if we double the number of nodes, the calculation time is quadrupled. For instance, let us assume we have a network with 1,000 nodes and a second network with one million nodes (thousandfold). If a certain centrality calculation with quadratic algorithmic complexity takes 1 min on the first network, the same calculation would take 1 million minutes (approximately 2 years) on the second network (millionfold). This property of many network metrics makes it nearly impossible to apply them to big data networks within reasonable time. Consequently, optimization and approximation algorithms of traditional metrics are developed and used to speed up analysis for big data networks.
A straight forward approach for algorithmic optimization of network algorithms for big data is parallelization. The abovementioned algorithms closeness and betweenness centralities are based on all-pairs shortest path calculation. In other words, the algorithm starts at a node, follows its links, and visits all other nodes in concentric circles. The calculation for one node is independent from the calculation for all other nodes; thus, different processors or different computers can jointly calculate a metric with very little coordination overhead.
Approximation algorithms try to estimate a centrality metric based on a small part of the actual calculations. The calculations of the all-pairs shortest path calculation can be restricted in two ways. First, we can limit the centrality calculation to the k-step neighborhood of nodes, i.e., instead of visiting all other nodes in concentric circles, we stop at a distance k. Second, instead of all nodes, we just select a small proportion of nodes as starting points for the shortest path calculations. Both approaches can speed up calculation time tremendously as just a small proportion of the calculations are needed to create these results. Surprisingly, these approximated results have very high accuracy. This is because real-world networks are far from random and have specific characteristics. For instance, networks created from social interactions among people often have core-periphery structure and are highly clustered. These characteristics facilitate the accuracy of centrality approximation calculations. In the context of optimizing and approximating traditional network metrics, a major future challenge will be to estimate time/fidelity trade-offs(e.g., develop confidence intervals for network metrics) and to build systems that incorporate the constraints of user and infrastructure into the calculations. This is especially crucial as certain network metrics are very sensitive and small data change can lead to big change of results.
New algorithms are especially developed for very large networks. These algorithms have sub-quadratic complexity so that they are applicable for very large networks. Vladimir Batagelj and Andrej Mrvar have developed a broad array of new metrics and a network analytical tool called “Pajek” to analyze networks with tens of millions of nodes.
However, some networks are too big to fit into the memory of a single computer. Imagine a network with 1 billion nodes and 100 billion edges – social media networks have already reached this size. Such a network would require a computer with about 3,000 gigabyte RAM to hold the pure network structure with no additional information. Even though supercomputer installations already exist that can cope with these requirements, they are rare and expensive. Instead, researchers make use of computer clusters and analytical software optimized for distributed systems, like Hadoop.
Most modern big data networks come from streaming data of interactions. Messages are sent among nodes, people call each other, and data flows are measured among servers. The observed data consist of dyadic interaction. As the nodes of the dyads overlap over time, we can extract networks. Even though networks extracted from streaming data are inherently dynamic, the actual analysis of these networks is often done with static metrics, e.g., by comparing the networks created from daily aggregation of data. The most interesting research questions with respect to streaming data are related to change detection. Centrality metrics for every node or network level indices that describe the structure of the network can be calculated for every time interval. Looking at these values as time series can help to identify structural change in the dynamically changing networks over time.
Visualizing Big Data Networks
Visualizing networks can be a very efficient analytical approach as human perception is capable of identifying complex structures and patterns. To facilitate visual analytics, algorithms are needed that present network data in an interpretable way. One of the major challenges for network visualization algorithms is to calculate the positions of the nodes of the network in a way that it reveals the structure of the network, i.e., show communities and put important nodes in the center of the figure. The algorithmic challenges for visualizing big networks are very similar to the ones discussed above. Most commonly used layout algorithms scale very poorly. Ulrich Brandes and Christian Pich developed a layout algorithm based on eigenvector analysis that can be used to visualize networks with millions of nodes. The method that they applied is similar to the before-mentioned approximation approaches. As real-world networks normally have a certain topology that is far from random, calculating just a part of the actual layout algorithm can be a good enough approximation to reveal interesting aspects of a network.
Networks are often enriched with additional information about the nodes or the edges. We often know the gender or the location of people. Nodes might represent different types of infrastructure elements. We can incorporate this information by mapping data to visual elements of our network visualization. Nodes can be visualized with different shapes (circles, boxes, etc.) and can be colored with different colors resulting in multivariate network drawings. Adding contextual information to compelling network visualizations can make the difference between pretty pictures and valuable pieces of information visualization.
Besides algorithmic issues, we also face serious conceptual challenges when analyzing big data networks. Many “traditional” network analytical metrics were developed for groups of tens of people. Applying the same metrics to very big networks raises questions whether the algorithmic assumptions or the interpretations of results are still valid. For instance, the abovementioned metrics closeness and betweenness centralities just incorporate the shortest paths between every pair of nodes ignoring possible flow of information on non-shortest paths. Even more, these metrics do not take path length into account. In other words, if a node is on the shortest path of length, two or eight is treaded identically. Most likely this does not reflect real-world assumptions of information flow. All these issues can be addressed by applying different metrics that incorporate all possible paths or a random selection of paths with length k. In general, when accomplishing network analytics, we need to ask which of the existing network algorithms are suitable under which assumptions to be used for very large networks? Moreover, what research questions are appropriate for very large networks? Does being a central actor in a group of high school kids has the same interpretation as being a central user of an online social network with millions of users?
Networks are everywhere in big data. Analyzing these networks can be challenging. Due of the very nature of network data and algorithms, many traditional approaches of handling and analyzing these networks are not scalable. Nonetheless, it is worthwhile coping with these challenges. Researchers from different academic areas have been optimizing existing and developing new metrics and methodologies as network analytics can provide unique insights into big data.
- Batagelj, V., Mrvar, A., & de Nooy, W. (2011). Exploratory social network analysis with Pajek. (Expanded edition.). New York: Cambridge University Press.Google Scholar
- Brandes, U., & Pich, C. (2007). Eigensolver Methods for progressive multidimensional scaling of large data. Proceedings of the 14th International Symposium on Graph Drawing (GD’06), 42–53.Google Scholar
- Hennig, M., Brandes, U., Pfeffer, J., & Mergel, I. (2012). Studying social networks. A guide to empirical research. Frankfurt: Campus Verlag.Google Scholar