Abstract
Complex networks are heterogeneous relational data sets with nontrivial substructures and statistical properties. They are typically represented as graphs consisting of vertices and edges. The analysis of their intricate structure is relevant to many areas of science and commerce, and data sets may reach sizes that require distributed storage and processing. We describe and compare programming models for distributed computing with a focus on graph algorithms for large-scale complex network analysis. Four frameworks-GraphLab, Apache Giraph, Giraph++ and Apache Flink—are used to implement algorithms for the representative problems connected components, community detection, PageRank and clustering coefficients. The implementations are executed on a computer cluster to evaluate the frameworks’ suitability in practice and to compare their performance to that of the single-machine, shared-memory parallel network analysis package NetworKit. Out of the distributed frameworks, GraphLab and Apache Giraph generally show the best performance. In our experiments a cluster of eight computers running Apache Giraph enables the analysis of a network with ca. 2 billion edges, which is too large for a single machine of the same type. However, for networks that fit into memory of one machine, the performance of the shared-memory parallel implementation is usually far better than the distributed ones. The study provides experimental evidence for selecting the appropriate framework depending on the task and data volume.
Similar content being viewed by others
References
Apache (2014) Giraph++ patch for apache giraph. https://issues.apache.org/jira/browse/GIRAPH-818. Accessed 31 July 2014
Apache (2015a) Website of the framework Apache Flink. https://flink.apache.org/
Apache (2015b) Website of the framework Apache Giraph. http://giraph.apache.org/
Apache (2015c) Website of the research project Stratosphere. http://stratosphere.eu/
Apache (2016) Website of GraphX. https://spark.apache.org/graphx/
Avery Ching (2013) Scaling apache giraph to a trillion edges. https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920. Accessed 30 July 2014
Battré D, Ewen S, Hueske F, Kao O, Markl V, Warneke D (2010) Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In: Proceedings of 1st ACM symposium on cloud computing, SoCC ’10. ACM, New York, pp 119–130
Boldi P, Vigna S (2004) The WebGraph framework I: compression techniques. In: Proceedings of the thirteenth international World Wide Web Conference (WWW 2004). ACM Press, Manhattan, USA, pp 595–601
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. In: Computer networks and ISDN systems. Elsevier Science Publishers B. V, Amsterdam, pp 107–117
Cha M, Haddadi H, Benevenuto F, Gummadi KP (2010) Measuring user influence in Twitter: the million follower fallacy. In: Proceedings of the 4th international AAAI conference on Weblogs and Social Media (ICWSM)
Costa LdF, Oliveira ON, Travieso G, Rodrigues FA, Villas Boas PR, Antiqueira L, Viana MP, Correa Rocha LE (2011) Analyzing and modeling real-world phenomena with complex networks: a survey of applications. Adv Phys 60(3):329–412
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) Powergraph: Distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th USENIX conference on operating systems design and implementation, OSDI’12. USENIX Association, Berkeley, CA, USA, pp 17–30
Karloff H, Suri S, Vassilvitskii S (2010) A model of computation for mapreduce. In: Proceedings of the twenty-first annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 938–948
Koch J, Staudt CL, Vogel M, Meyerhenke H (2015) Complex network analysis on distributed systems: an empirical comparison. In: Pei J, Silvestri F, Tang J (eds) Proceedings of 2015 IEEE/ACM international conference on advances in social networks analysis and mining, ASONAM 2015. ACM, pp 1169–1176
Kunegis J (2013) Konect: the koblenz network collection. In: Proceedings of 22nd international conference on World Wide Web companion. International World Wide Web Conferences Steering Committee, pp 1343–1350
Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: WWW ’10: Proceedings of the 19th international conference on World wide web. ACM, New York, NY, USA, pp 591–600
Lin J, Dyer C (2010) Data-intensive text processing with MapReduce. G-Reference, Information and Interdisciplinary Subjects Series. Morgan & Claypool
Lin J, Schatz M (2010) Design patterns for efficient graph algorithms in mapreduce. In: Proceedings of the eighth workshop on mining and learning with graphs, MLG ’10. ACM, New York, NY, USA, pp 78–85
Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning in the cloud. CoRR, abs/1204.6078
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 135–146
McColl RC, Ediger D, Poovey J, Campbell D, Bader DA (2014) A performance evaluation of open source graph databases. In: Proceedings of 1st workshop on parallel programming for analytics applications, PPAA ’14. ACM, New York, NY, USA, pp 11–18
Meyerhenke H, Sanders P, Schulz C (2014) Partitioning complex networks via size-constrained clustering. In: Proceedings of 13th international symposium on experimental algorithms (SEA 2014), vol 8504 of LNCS. Springer, Berlin, pp 351–363
Newman M (2010) Networks: an introduction. Oxford University Press, Oxford
Raghavan UN, Albert R, Kumara S (2007) Near linear time algorithm to detect community structures in large-scale networks. Phys Rev E 76(3):036106
Satish N, Sundaram N, Patwary MMA, Seo J, Park J, Hassaan MA, Sengupta S, Yin Z, Dubey P (2014). Navigating the maze of graph analytics frameworks using massive graph datasets. In: Proceedings 2014 ACM SIGMOD international conference on management of data, SIGMOD ’14. ACM, New York, NY, USA, pp 979–990
Schank T, Wagner D (2005) Approximating clustering-coefficient and transitivity. J Gr Algorithm Appl 9(2):265–275
Slota GM, Madduri K, Rajamanickam S (2014) Pulp: scalable multi-objective multi-constraint partitioning for small-world networks. In: Lin J, Pei J, Hu X, Chang W, Nambiar R, Aggarwal C, Cercone N, Honavar V, Huan J, Mobasher B, Pyne S (eds) 2014 IEEE international conference on big data, Big Data 2014, pp 481–490
Staudt CL, Sazonovs A, Meyerhenke H (2016) NetworKit: a tool suite for large-scale complex network analysis. Netw Sci, To Appear
Tian Y, Balmin A, Corsten SA, Tatikonda S, McPherson J (2013) From “think like a vertex” to “think like a graph”. PVLDB 7(3):193–204
Turi (2016). Website of the company distributing GraphLab
Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111
Zhang Y, Gao Q, Gao L, Wang C (2012). Accelerate large-scale iterative computation through asynchronous accumulative updates. In: Proceedings of the 3rd workshop on scientific cloud computing date, ACM, pp 13–22
Acknowledgments
This work is partially supported by German Research Foundation (DFG) grant ME 3619/3-1 within the Priority Programme 1736 Algorithms for Big Data.
Author information
Authors and Affiliations
Corresponding author
Additional information
Parts of this paper have been published in preliminary form as Koch et al. (2015).
Rights and permissions
About this article
Cite this article
Koch, J., Staudt, C.L., Vogel, M. et al. An empirical comparison of Big Graph frameworks in the context of network analysis. Soc. Netw. Anal. Min. 6, 84 (2016). https://doi.org/10.1007/s13278-016-0394-1
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-016-0394-1