CC-MR – Finding Connected Components in Huge Graphs with MapReduce

  • Thomas Seidl
  • Brigitte Boden
  • Sergej Fries
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7523)


The detection of connected components in graphs is a well-known problem arising in a large number of applications including data mining, analysis of social networks, image analysis and a lot of other related problems. In spite of the existing very efficient serial algorithms, this problem remains a subject of research due to increasing data amounts produced by modern information systems which cannot be handled by single workstations. Only highly parallelized approaches on multi-core-servers or computer clusters are able to deal with these large-scale data sets. In this work we present a solution for this problem for distributed memory architectures, and provide an implementation for the well-known MapReduce framework developed by Google. Our algorithm CC-MR significantly outperforms the existing approaches for the MapReduce framework in terms of the number of necessary iterations, communication costs and execution runtime, as we show in our experimental evaluation on synthetic and real-world data. Furthermore, we present a technique for accelerating our implementation for datasets with very heterogeneous component sizes as they often appear in real data sets.


Parallel Algorithm Component Size Star Graph MapReduce Framework Forward Edge 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bus, L., Tvrdík, P.: A Parallel Algorithm for Connected Components on Distributed Memory Machines. In: Cotronis, Y., Dongarra, J. (eds.) PVM/MPI 2001. LNCS, vol. 2131, pp. 280–287. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  2. 2.
    Chin, F.Y.L., Lam, J., Chen, I.-N.: Efficient parallel algorithms for some graph problems. Commun. ACM 25(9), 659–665 (1982)MathSciNetzbMATHCrossRefGoogle Scholar
  3. 3.
    Cohen, J.: Graph twiddling in a MapReduce world. Computing in Science and Engineering 11(4), 29–41 (2009)CrossRefGoogle Scholar
  4. 4.
    Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)Google Scholar
  5. 5.
    Greiner, J.: A comparison of parallel algorithms for connected components. In: SPAA, pp. 16–25 (1994)Google Scholar
  6. 6.
    Hirschberg, D.S., Chandra, A.K., Sarwate, D.V.: Computing connected components on parallel computers. Commun. ACM 22(8), 461–464 (1979)MathSciNetzbMATHCrossRefGoogle Scholar
  7. 7.
    Kang, U., Tsourakakis, C.E., Faloutsos, C.: Pegasus: A peta-scale graph mining system. In: ICDM, pp. 229–238 (2009)Google Scholar
  8. 8.
    Krishnamurthy, A., Lumetta, S., Culler, D., Yelick, K.: Connected components on distributed memory machines. DIMACS Implementation Challenge 30, 1 (1997)MathSciNetGoogle Scholar
  9. 9.
    Lattanzi, S., Moseley, B., Suri, S., Vassilvitskii, S.: Filtering: a method for solving graph problems in mapreduce. In: SPAA, pp. 85–94 (2011)Google Scholar
  10. 10.
    Rastogi, V., Machanavajjhala, A., Chitnis, L., Sarma, A.D.: Finding connected components on map-reduce in logarithmic rounds. Computing Research Repository (CoRR), abs/1203.5387 (2012)Google Scholar
  11. 11.
    Shiloach, Y., Vishkin, U.: An o(log n) parallel connectivity algorithm. J. Algorithms 3(1), 57–67 (1982)MathSciNetzbMATHCrossRefGoogle Scholar
  12. 12.
    Wu, B., Du, Y.: Cloud-based connected component algorithm. In: Artificial Intelligence and Computational Intelligence (AICI), vol. 3, pp. 122–126 (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Thomas Seidl
    • 1
  • Brigitte Boden
    • 1
  • Sergej Fries
    • 1
  1. 1.Data Management and Data Exploration GroupRWTH Aachen UniversityGermany

Personalised recommendations