An empirical comparison of Big Graph frameworks in the context of network analysis

  • Jannis Koch
  • Christian L. Staudt
  • Maximilian Vogel
  • Henning Meyerhenke
Original Article

Abstract

Complex networks are heterogeneous relational data sets with nontrivial substructures and statistical properties. They are typically represented as graphs consisting of vertices and edges. The analysis of their intricate structure is relevant to many areas of science and commerce, and data sets may reach sizes that require distributed storage and processing. We describe and compare programming models for distributed computing with a focus on graph algorithms for large-scale complex network analysis. Four frameworks-GraphLab, Apache Giraph, Giraph++ and Apache Flink—are used to implement algorithms for the representative problems connected components, community detection, PageRank and clustering coefficients. The implementations are executed on a computer cluster to evaluate the frameworks’ suitability in practice and to compare their performance to that of the single-machine, shared-memory parallel network analysis package NetworKit. Out of the distributed frameworks, GraphLab and Apache Giraph generally show the best performance. In our experiments a cluster of eight computers running Apache Giraph enables the analysis of a network with ca. 2 billion edges, which is too large for a single machine of the same type. However, for networks that fit into memory of one machine, the performance of the shared-memory parallel implementation is usually far better than the distributed ones. The study provides experimental evidence for selecting the appropriate framework depending on the task and data volume.

Keywords

Big Graph frameworks Distributed computing Graph algorithms Complex networks Algorithmic network analysis 

References

  1. Apache (2014) Giraph++ patch for apache giraph. https://issues.apache.org/jira/browse/GIRAPH-818. Accessed 31 July 2014
  2. Apache (2015a) Website of the framework Apache Flink. https://flink.apache.org/
  3. Apache (2015b) Website of the framework Apache Giraph. http://giraph.apache.org/
  4. Apache (2015c) Website of the research project Stratosphere. http://stratosphere.eu/
  5. Apache (2016) Website of GraphX. https://spark.apache.org/graphx/
  6. Avery Ching (2013) Scaling apache giraph to a trillion edges. https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920. Accessed 30 July 2014
  7. Battré D, Ewen S, Hueske F, Kao O, Markl V, Warneke D (2010) Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In: Proceedings of 1st ACM symposium on cloud computing, SoCC ’10. ACM, New York, pp 119–130Google Scholar
  8. Boldi P, Vigna S (2004) The WebGraph framework I: compression techniques. In: Proceedings of the thirteenth international World Wide Web Conference (WWW 2004). ACM Press, Manhattan, USA, pp 595–601Google Scholar
  9. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. In: Computer networks and ISDN systems. Elsevier Science Publishers B. V, Amsterdam, pp 107–117Google Scholar
  10. Cha M, Haddadi H, Benevenuto F, Gummadi KP (2010) Measuring user influence in Twitter: the million follower fallacy. In: Proceedings of the 4th international AAAI conference on Weblogs and Social Media (ICWSM)Google Scholar
  11. Costa LdF, Oliveira ON, Travieso G, Rodrigues FA, Villas Boas PR, Antiqueira L, Viana MP, Correa Rocha LE (2011) Analyzing and modeling real-world phenomena with complex networks: a survey of applications. Adv Phys 60(3):329–412CrossRefGoogle Scholar
  12. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  13. Gonzalez JE, Low Y, Gu H, Bickson D, Guestrin C (2012) Powergraph: Distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th USENIX conference on operating systems design and implementation, OSDI’12. USENIX Association, Berkeley, CA, USA, pp 17–30Google Scholar
  14. Karloff H, Suri S, Vassilvitskii S (2010) A model of computation for mapreduce. In: Proceedings of the twenty-first annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 938–948Google Scholar
  15. Koch J, Staudt CL, Vogel M, Meyerhenke H (2015) Complex network analysis on distributed systems: an empirical comparison. In: Pei J, Silvestri F, Tang J (eds) Proceedings of 2015 IEEE/ACM international conference on advances in social networks analysis and mining, ASONAM 2015. ACM, pp 1169–1176Google Scholar
  16. Kunegis J (2013) Konect: the koblenz network collection. In: Proceedings of 22nd international conference on World Wide Web companion. International World Wide Web Conferences Steering Committee, pp 1343–1350Google Scholar
  17. Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: WWW ’10: Proceedings of the 19th international conference on World wide web. ACM, New York, NY, USA, pp 591–600Google Scholar
  18. Lin J, Dyer C (2010) Data-intensive text processing with MapReduce. G-Reference, Information and Interdisciplinary Subjects Series. Morgan & ClaypoolGoogle Scholar
  19. Lin J, Schatz M (2010) Design patterns for efficient graph algorithms in mapreduce. In: Proceedings of the eighth workshop on mining and learning with graphs, MLG ’10. ACM, New York, NY, USA, pp 78–85Google Scholar
  20. Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning in the cloud. CoRR, abs/1204.6078Google Scholar
  21. Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 135–146Google Scholar
  22. McColl RC, Ediger D, Poovey J, Campbell D, Bader DA (2014) A performance evaluation of open source graph databases. In: Proceedings of 1st workshop on parallel programming for analytics applications, PPAA ’14. ACM, New York, NY, USA, pp 11–18Google Scholar
  23. Meyerhenke H, Sanders P, Schulz C (2014) Partitioning complex networks via size-constrained clustering. In: Proceedings of 13th international symposium on experimental algorithms (SEA 2014), vol 8504 of LNCS. Springer, Berlin, pp 351–363Google Scholar
  24. Newman M (2010) Networks: an introduction. Oxford University Press, OxfordCrossRefMATHGoogle Scholar
  25. Raghavan UN, Albert R, Kumara S (2007) Near linear time algorithm to detect community structures in large-scale networks. Phys Rev E 76(3):036106CrossRefGoogle Scholar
  26. Satish N, Sundaram N, Patwary MMA, Seo J, Park J, Hassaan MA, Sengupta S, Yin Z, Dubey P (2014). Navigating the maze of graph analytics frameworks using massive graph datasets. In: Proceedings 2014 ACM SIGMOD international conference on management of data, SIGMOD ’14. ACM, New York, NY, USA, pp 979–990Google Scholar
  27. Schank T, Wagner D (2005) Approximating clustering-coefficient and transitivity. J Gr Algorithm Appl 9(2):265–275MathSciNetCrossRefMATHGoogle Scholar
  28. Slota GM, Madduri K, Rajamanickam S (2014) Pulp: scalable multi-objective multi-constraint partitioning for small-world networks. In: Lin J, Pei J, Hu X, Chang W, Nambiar R, Aggarwal C, Cercone N, Honavar V, Huan J, Mobasher B, Pyne S (eds) 2014 IEEE international conference on big data, Big Data 2014, pp 481–490Google Scholar
  29. Staudt CL, Sazonovs A, Meyerhenke H (2016) NetworKit: a tool suite for large-scale complex network analysis. Netw Sci, To AppearGoogle Scholar
  30. Tian Y, Balmin A, Corsten SA, Tatikonda S, McPherson J (2013) From “think like a vertex” to “think like a graph”. PVLDB 7(3):193–204Google Scholar
  31. Turi (2016). Website of the company distributing GraphLabGoogle Scholar
  32. Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111CrossRefGoogle Scholar
  33. Zhang Y, Gao Q, Gao L, Wang C (2012). Accelerate large-scale iterative computation through asynchronous accumulative updates. In: Proceedings of the 3rd workshop on scientific cloud computing date, ACM, pp 13–22Google Scholar

Copyright information

© Springer-Verlag Wien 2016

Authors and Affiliations

  • Jannis Koch
    • 1
  • Christian L. Staudt
    • 1
  • Maximilian Vogel
    • 1
  • Henning Meyerhenke
    • 1
  1. 1.Department of InformaticsKarlsruhe Institute of Technology (KIT)KarlsruheGermany

Personalised recommendations