, Volume 14, Issue 2, pp 107–117 | Cite as

Iterative Computation of Connected Graph Components with MapReduce

  • Lars Kolb
  • Ziad Sehili
  • Erhard Rahm


The use of the MapReduce framework for iterative graph algorithms is challenging. To achieve high performance it is critical to limit the amount of intermediate results as well as the number of necessary iterations. We address these issues for the important problem of finding connected components in large graphs. We analyze an existing MapReduce algorithm, CC-MR, and present techniques to improve its performance including a memory-based connection of subgraphs in the map phase. Our evaluation with several large graph datasets shows that the improvements can substantially reduce the amount of generated data by up to a factor of 8.8 and runtime by up to factor of 3.5.


MapReduce Hadoop Connected graph components Transitive closure 


  1. 1.
    Afrati FN, Borkar VR, Carey MJ, Polyzotis N, Ullman JD (2011) Map-Reduce extensions and recursive queries. In: Proc. of intl. conference on extending database technology, pp 1–8Google Scholar
  2. 2.
    Awerbuch B, Shiloach Y (1987) New connectivity and MSF algorithms for shuffle-exchange network and PRAM. IEEE Trans Comput 36(10):1258–1263CrossRefzbMATHMathSciNetGoogle Scholar
  3. 3.
    Bancilhon F, Maier D, Sagiv Y, Ullman JD (1986) Magic sets and other strange ways to implement logic programs. In: Proc. of symposium on principles of database systems, pp 1–15Google Scholar
  4. 4.
    Bu Y, Howe B, Balazinska M, Ernst MD (2012) The HaLoop approach to large-scale iterative data analysis. VLDB Journal 21(2):169–190CrossRefGoogle Scholar
  5. 5.
    Bus L, Tvrd\'ık P (2001) A parallel algorithm for connected components on distributed memory machines. In: Proc. of European PVM/MPI users` group meeting, pp 280–287Google Scholar
  6. 6.
    Cheiney JP, de Maindreville C (1989) A parallel transitive closure algorithm using hash-based clustering. In: Proc. of intl. workshop on database machines, pp 301–316Google Scholar
  7. 7.
    Cohen J (2009) Graph twiddling in a MapReduce world. Comput Sci Eng 11(4):29–41CrossRefGoogle Scholar
  8. 8.
    Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proc. of symposium on operating system design and implementation, pp 137–150Google Scholar
  9. 9.
    Greiner J (1994) A comparison of parallel algorithms for connected components. In: Proc. of symposium on parallelism in algorithms and architectures, pp 16–25Google Scholar
  10. 10.
    Hirschberg DS, Chandra AK, Sarwate DV (1979) Computing connected components on parallel computers. Commun ACM 22(8):461–464CrossRefzbMATHMathSciNetGoogle Scholar
  11. 11.
    Ioannidis YE (1986) On the computation of the transitive closure of relational operators. In: Proc. of intl. conference on very large databases, pp 403–411Google Scholar
  12. 12.
    Kang U, Tsourakakis CE, Faloutsos C (2009) PEGASUS: a peta-scale graph mining system. In: Proc. of intl. conference on data mining, pp 229–238Google Scholar
  13. 13.
    Kolb L, Rahm E (2013) Parallel entity resolution with Dedoop. Datenbank-Spektrum 13(1):23–32CrossRefGoogle Scholar
  14. 14.
    Kolb L, Thor A, Rahm E (2012) Dedoop: efficient deduplication with Hadoop. Proceedings of the VLDB endowment 5(12):1878–1881CrossRefGoogle Scholar
  15. 15.
    Lattanzi S, Moseley B, Suri S, Vassilvitskii S (2011) Filtering: a method for solving graph problems in MapReduce. In: Proc. of symposium on parallelism in algorithms and architectures, pp 85–94Google Scholar
  16. 16.
    Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proc. of the intl. conference on management of data, pp 135–146Google Scholar
  17. 17.
    Petermann A, Junghanns M, Mueller R, Rahm E (2014) BIIIG: enabling business intelligence with integrated instance graphs. In: Proc. of intl. workshop on graph data management (GDM)Google Scholar
  18. 18.
    Rastogi V, Machanavajjhala A, Chitnis L, Sarma AD (2013) Finding connected components in map-reduce in logarithmic rounds. In: Proc. of intl. conference on data engineering, pp 50–61Google Scholar
  19. 19.
    Seidl T, Boden B, Fries S (2012) CC-MR - finding connected components in huge graphs with MapReduce. In: Proc. of machine learning and knowledge discovery in databases, pp 458–473Google Scholar
  20. 20.
    Shiloach Y, Vishkin U (1982) An O(log n) Parallel connectivity algorithm. J Algorithms 3(1):57–67CrossRefzbMATHMathSciNetGoogle Scholar
  21. 21.
    Tarjan RE (1972) Depth-first search and linear graph algorithms. SIAM J Comput 1(2):146–160CrossRefzbMATHMathSciNetGoogle Scholar
  22. 22.
    Valduriez P, Khoshafian S (1988) Parallel evaluation of the transitive closure of a database relation. Int J Parallel Prog 17(1):19–37CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.Institut für InformatikUniversität LeipzigLeipzigGermany

Personalised recommendations