Understanding parallelism in graph traversal on multi-core clusters

  • Huiwei Lv
  • Guangming Tan
  • Mingyu Chen
  • Ninghui Sun
Special Issue Paper


There is an ever-increasing need for exploring large-scale graph data sets in computational sciences, social networks, and business analytics. However, due to irregular and memory-intensive nature, graph applications are notoriously known for their poor performance on parallel computer systems. In this paper we propose a new hybrid MPI/Pthreads breadth-first search (BFS) algorithm featuring with (i) overlapping computation and communication by separating them into multiple threads, (ii) maximizing multi-threading parallelism on multi-cores with massive threads to improve throughputs, and (iii) exploiting pipeline parallelism using lock-free queues for asynchronous communication. By comparing it with traditional MPI-only BFS algorithm, we learned several valuable lessons that would help to understand and exploit parallelism in graph traversal applications. Experiments show our algorithm is 1.9× faster than the MPI-only version, capable of processing 1.45 billion edges per second on a 32-node SMP cluster. At a large scale, our algorithm is 1.49× than the MPI-only BFS algorithm in Combinatorial BLAS Library with 6,144 cores.


Breadth-first search Graph algorithms Hybrid MPI/Pthreads programming Lock-free queues 


  1. 1.
    The Graph 500 List (2011). http://www.graph500.org/
  2. 2.
    The Linpack Benchmark (2011). http://www.top500.org/project/linpack
  3. 3.
    Agarwal V, Petrini F, Pasetto D, Bader DA (2010) Scalable graph exploration on multicore processors. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis, SC’10. IEEE Comput Soc, Washington, pp 1–11 CrossRefGoogle Scholar
  4. 4.
    Bader DA, Cong G (2006) Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs. J Parallel Distrib Comput 66:1366–1378 MATHCrossRefGoogle Scholar
  5. 5.
    Bader DA, Madduri K (2006) Designing multithreaded algorithms for breadth-first search and st-connectivity on the Cray MTA-2. In: Proceedings of the 2006 international conference on parallel processing, ICPP’06. IEEE Comput Soc, Washington, pp 523–530 CrossRefGoogle Scholar
  6. 6.
    Buluç A, Gilbert JR (2011) The Combinatorial BLAS: design, implementation, and applications. Int J High Perform Comput Appl. doi: Google Scholar
  7. 7.
    Buluç A, Madduri K (2011) Parallel breadth-first search on distributed memory systems. Corros Rev. arXiv:1104.4518
  8. 8.
    Cappello F, Etiemble D (2000) MPI versus MPI+OpenMP on IBM SP for the NAS Benchmarks. In: Proceedings of the 2000 ACM/IEEE conference on supercomputing (CDROM), Supercomputing’00. IEEE Comput Soc, Washington Google Scholar
  9. 9.
    Giacomoni J, Moseley T, Vachharajani M (2008) FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. In: Proceedings of the 13th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP’08. ACM, New York, pp 43–52 CrossRefGoogle Scholar
  10. 10.
    Kang S, Bader DA (2009) An efficient transactional memory algorithm for computing minimum spanning forest of sparse graphs. ACM SIGPLAN Not 44:15–24 CrossRefGoogle Scholar
  11. 11.
    Leiserson CE, Schardl TB (2010) A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). In: Proceedings of the 22nd ACM symposium on parallelism in algorithms and architectures, SPAA’10. ACM, New York, pp 303–314 CrossRefGoogle Scholar
  12. 12.
    Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C (2005) Realistic, mathematically tractable graph generation and evolution, using Kronecker multiplication. In: Jorge A, Torgo L, Brazdil P, Camacho R, Gama J (eds) Knowledge discovery in databases: PKDD 2005. Lecture notes in computer science, vol 3721. Springer, Berlin, pp 133–145 CrossRefGoogle Scholar
  13. 13.
    Loft RD, Thomas SJ, Dennis JM (2001) Terascale spectral element dynamical core for atmospheric general circulation models. In: Proceedings of the 2001 ACM/IEEE conference on supercomputing (CDROM), Supercomputing’01. ACM, New York, p 18 CrossRefGoogle Scholar
  14. 14.
    Lumsdaine A, Gregor D, Hendrickson B, Berry J (2007) Challenges in parallel graph processing. Parallel Process Lett 17(1):5–20 MathSciNetCrossRefGoogle Scholar
  15. 15.
    Mizell D, Maschhoff K (2009) Early experiences with large-scale Cray XMT systems. In: Proceedings of the 2009 IEEE international symposium on parallel & distributed processing. IEEE Comput Soc, Washington, pp 1–9 CrossRefGoogle Scholar
  16. 16.
    Molka D, Hackenberg D, Schone R, Muller MS (2009) Memory performance and cache coherency effects on an intel nehalem multiprocessor system. In: Proceedings of the 2009 18th international conference on parallel architectures and compilation techniques. IEEE Comput Soc, Washington, pp 261–270 CrossRefGoogle Scholar
  17. 17.
    Scarpazza DP, Villa O, Petrini F (2008) Efficient breadth-first search on the Cell/BE processor. IEEE Trans Parallel Distrib Syst 19:1381–1395 CrossRefGoogle Scholar
  18. 18.
    Tan G, Sreedhar V, Gao G (2011) Analysis and performance results of computing betweenness centrality on IBM Cyclops64. J Supercomput 56:1–24 CrossRefGoogle Scholar
  19. 19.
    Wu X, Taylor V (2011) Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore supercomputers. ACM SIGMETRICS Perform Eval Rev 38:56–62 CrossRefGoogle Scholar
  20. 20.
    Yoo A, Chow E, Henderson K, McLendon W, Hendrickson B, Catalyurek U (2005) A scalable distributed parallel breadth-first search algorithm on BlueGene/L. In: Proceedings of the 2005 ACM/IEEE conference on supercomputing, SC’05. IEEE Comput Soc, Washington, p 25 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  • Huiwei Lv
    • 1
    • 2
  • Guangming Tan
    • 1
  • Mingyu Chen
    • 1
  • Ninghui Sun
    • 1
  1. 1.State Key Laboratory of Computer Architecture, Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
  2. 2.Graduate School of Chinese Academy of SciencesBeijingChina

Personalised recommendations