Advertisement

The Journal of Supercomputing

, Volume 56, Issue 1, pp 1–24 | Cite as

Analysis and performance results of computing betweenness centrality on IBM Cyclops64

  • Guangming Tan
  • Vugranam C. Sreedhar
  • Guang R. Gao
Article

Abstract

This paper presents a joint study of application and architecture to improve the performance and scalability of an irregular application—computing betweenness centrality—on a many-core architecture IBM Cyclops64. The characteristics of unstructured parallelism, dynamically non-contiguous memory access, and low arithmetic intensity in betweenness centrality pose an obstacle to an efficient mapping of parallel algorithms on such many-core architectures. By identifying several key architectural features, we propose and evaluate efficient strategies for achieving scalability on a massive multi-threading many-core architecture. We demonstrate several optimization strategies including multi-grain parallelism, just-in-time locality with explicit memory hierarchy and non-preemptive thread execution, and fine-grain data synchronization. Comparing with a conventional parallel algorithm, we get 4X-50X improvement in performance and 16X improvement in scalability on a 128-cores IBM Cyclops64 simulator.

Many-core architecture Betweenness centrality Just-in-time locality Multi-grain parallelism 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Alderson D, Doyle JC, Li L, Willinger W (2005) Towards a theory of scale-free graphs: definition, properties, and implications. Internet Math 2(4):431–523 CrossRefzbMATHMathSciNetGoogle Scholar
  2. 2.
    Bader DA (2006) Hpcs scalable synthetic compact applications 2 graph analysis. www.highproductivity.org/SSCABmks.htm
  3. 3.
    Bader DA, Madduri K (2006) Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2. In: The 35th international conference on parallel processing (ICPP 2006) Google Scholar
  4. 4.
    Bader DA, Madduri K (2006) Parallel algorithms for evaluating centrality indices in real-world networks. In: The 35th international conference on parallel processing (ICPP 2006) Google Scholar
  5. 5.
    Brandes U (2001) A faster algorithm for betweenness centrality. J Math Social 25(2):163–177 CrossRefzbMATHGoogle Scholar
  6. 6.
    Chilimbi TM, Hirzel M (2002) Dynamic hot data stream prefetching for general-purpose programs. In: PLDI ’02: Proceedings of the ACM SIGPLAN 2002 conference on programming language design and implementation, New York, NY, USA, 2002. ACM Press, New York, pp 199–209 CrossRefGoogle Scholar
  7. 7.
    Collins JD, Tullsen DM, Wang H, Shen JP (2001) Dynamic speculative precomputation. In: The 34th annual international symposium on microarchitecture Google Scholar
  8. 8.
    Collins JD, Wang H, Tullsen DM, Hughes C, Lavery D, Shen JP (2001) Speculative precomputation: long-range prefetching of delinquent loads. In: The 28th international symposium on computer architecture Google Scholar
  9. 9.
    del Cuvillo J, Zhu W, Gao GR (2005) Landing openmp on cyclops-64: an efficient mapping of openmp to a many-core system-on-a-chip. In: The 3rd ACM international conference on computing frontiers, Ischia, Italy Google Scholar
  10. 10.
    del Cuvillo J, Zhu W, Hu Z, Gao GR (2005) Fast: a functionally accurate simulation toolset for the cyclops-64 cellular architecture. In: Workshop on modeling, benchmarking and simulation (MoBS), held in conjunction with the annual international symposium on computer architecture (ISCA’05) Google Scholar
  11. 11.
    del Cuvillo J, Zhu W, Hu Z, Gao GR (2005) Tiny threads: a thread virtual machine for the cyclops-64 cellular architecture. In: Fifth workshop on massively parallel processing (WMPP), held in conjunction with the 19th international parallel and distributed processing system Google Scholar
  12. 12.
    Denneau M, Warren HS Jr (2005) 64-bit Cyclops: principles of operation. April 2005 Google Scholar
  13. 13.
    Erez M, Ahn JH, Gummaraju J, Rosenblum M, Dally WJ (2007) Executing irregular scientific applications on stream architectures. In: ICS ’07: Proceedings of the 21st annual international conference on supercomputing, New York, NY, USA, 2007. ACM Press, New York, pp 93–104 CrossRefGoogle Scholar
  14. 14.
    Freeman LC (1977) A set of measures of centrality based on betweenness. Sociometry 40(1):35–41 CrossRefGoogle Scholar
  15. 15.
    Ganusov I, Burtscher M (2005) Future execution: a hardware prefetching technique for chip multiprocessors. In: 2005 International conference on parallel architectures and compilation techniques, pp 350–360 Google Scholar
  16. 16.
    Ganusov I, Burtscher M (2006) Efficient emulation of hardware prefetchers via event-driven helper threading. In: 2006 International conference on parallel architectures and compilation techniques, pp 144–153 Google Scholar
  17. 17.
    Gao GR, Likharev KK, Messina PC, Sterling TL (1996) Hybrid technology multi-threaded architecture. In: Proceedings of frontiers ’96: the sixth symposium on the frontiers of massively parallel computation, pp 98–105 Google Scholar
  18. 18.
    Gao G, Nelson Amaral J, Marquez A, Theobald K (1998) A refinement of the “htmt” program execution model. Technical report, CAPSL, University of Delaware, 1998 Google Scholar
  19. 19.
    García Quinones C, Madriles C, Sánchez J, Marcuello P, González A, Tullsen DM (2005) Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices. In: PLDI ’05: Proceedings of the 2005 ACM SIGPLAN conference on programming language design and implementation, pp 269–279 Google Scholar
  20. 20.
    Gordon M, Thies W, Amarasinghe S (2006) Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In: International conference on architectural support for programming languages and operating systems, San Jose, CA, October 2006 Google Scholar
  21. 21.
    Herlihy M (1991) Wait-free synchronization. ACM Trans Program Lang Syst 11(1):124–149 CrossRefGoogle Scholar
  22. 22.
    Lin Y, Padua D (2000) Compiler analysis of irregular memory accesses. In: PLDI ’00: Proceedings of the ACM SIGPLAN 2000 conference on programming language design and implementation, New York, NY, USA, 2000. ACM Press, New York, pp 157–168 CrossRefGoogle Scholar
  23. 23.
    Lu J, Das A, Hsu W-C, Nguyen K, Abraham SG (2005) Dynamic helper threaded prefetching on the sun ultrasparc cmp processor. In: MICRO 38: Proceedings of the 38th annual IEEE/ACM international symposium on microarchitecture, Washington, DC, USA, 2005. IEEE Computer Society, Los Alamitos, pp 93–104 Google Scholar
  24. 24.
    Luk C-K, Mowry TC (1999) Automatic compiler-inserted prefetching for pointer-based applications. IEEE Trans Comput 48(2) Google Scholar
  25. 25.
    Mellor-Crummey JM, Scott ML (1991) Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans Comput Syst 9:1 CrossRefGoogle Scholar
  26. 26.
    Mowry T, Gupta A (1991) Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. J Parallel Distrib Comput 12(2):87–106 CrossRefGoogle Scholar
  27. 27.
    Ponnusamy R, Saltz J, Choudhary A (1993) Runtime-compilation techniques for data partitioning and communication schedule reuse. In: Supercomputing’93 Google Scholar
  28. 28.
    Rauchwerger L, Zhan Y, Torrellas J (1998) Hardware for speculative run-time parallelization in distributed shared memory multiprocessors. In: Proceedings of the 4th international symposium on high-performance computer architecture, p 162 Google Scholar
  29. 29.
    Sharma S, Ponnusamy R, Moon B, Hwang Y, Das R, Saltz J (1994) Run-time and compile-time support for adaptive irregular problems. In: Supercomputing’94 Google Scholar
  30. 30.
    Steffan JG, Colohan CB, Zhai A, Mowry TC (2000) A scalable approach to thread-level speculation. In: Proceedings of the 27th annual international symposium on computer architecture Google Scholar
  31. 31.
    Tan G, Tu D (2009) Characterizing betweenness centrality algorithm on multi-core architectures. In: The 2009 IEEE international symposium on parallel and distributed processing with applications (ISPA’09) Google Scholar
  32. 32.
    Tan G, Sreedhar VC, Gao GR (2008) Just-in-time locality and percolation for optimizing irregular applications on a manycore architecture. In: 21st Annual languages and compilers for parallel computing workshop Google Scholar
  33. 33.
    Wu Y (2002) Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching. In: PLDI ’02: Proceedings of the ACM SIGPLAN 2002 conference on programming language design and implementation, New York, NY, USA, 2002. ACM Press, New York, pp 210–221 CrossRefGoogle Scholar
  34. 34.
    Zhang Z, Torrellas J (1995) Speeding up irregular applications in shared-memory multiprocessors: Memory binding and group, prefetching. In: 22nd International symposium on computer architecture Google Scholar
  35. 35.
    Zhang W, Tullsen DM (2007) Accelerating and adapting precomputation threads for efficient prefetching. In: 3th International symposium on high performance computer architecture Google Scholar
  36. 36.
    Zhu W, Sreedhar VC, Hu Z, Gao GR (2007) Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In: The 34th international symposium on computer architecture Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Guangming Tan
    • 1
    • 2
  • Vugranam C. Sreedhar
    • 3
  • Guang R. Gao
    • 2
  1. 1.Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
  2. 2.Computer Architecture and Parallel Systems LaboratoryUniversity of DelawareNewarkUSA
  3. 3.IBM T. J. Watson Research CenterCambridgeUSA

Personalised recommendations