Unbalanced tree search on a manycore system using the GPI programming model

  • Rui Machado
  • Carsten Lojewski
  • Salvador Abreu
  • Franz-Josef Pfreundt
Special Issue Paper


The recent developments in computer architectures progress towards systems with large core count (Manycore) which expose more parallelism to applications. Some applications named irregular and unbalanced applications demand a dynamic and asynchronous load balance implementation to utilize the full performance a Manycore system. For example, the recently established Graph500 benchmark aims at such applications. The UTS benchmark characterizes the performance of such irregular and unbalanced computations with a tree-structured search space that requires continuous dynamic load balancing. GPI is a PGAS API that delivers the full performance of RDMA-enabled networks directly to the application. Its programming model focuses the use of one-sided asynchronous communication, overlapping computation and communication. In this paper we address the dynamic load balancing requirements of unbalanced applications using the GPI programming model. Using the UTS benchmark, we detail the implementation of a work stealing algorithm using GPI and present the performance results. Our performance evaluation shows significant improvements when compared with the optimized MPI version with a maximum performance of 9.5 billion nodes per second on 3072 cores.


GPI Work stealing Load balancing UTS Manycore 


  1. 1.
    Olivier S, Huan J, Liu J, Prins J, Dinan J, Sadayappan P, Tseng C-W (2006) UTS: an unbalanced tree search benchmark. In: Proc 19th intl workshop on languages and compilers for parallel computing (LCPC), New Orleans, LA, November 2–4, 2006 Google Scholar
  2. 2.
    Machado R, Lojewski C (2009) The Fraunhofer virtual machine: a communication library and runtime system based on the RDMA model. Comput Sci Res Dev 23(3):125–132 CrossRefGoogle Scholar
  3. 3.
    Kumar V, Grama AY, Vempaty NR (1994) Scalable load balancing techniques for parallel computers. J Parallel Distrib Comput 22(1):60–79 CrossRefGoogle Scholar
  4. 4.
    Devine KD, Boman EG, Heaphy RT, Hendrickson BA, Teresco JD, Faik J, Flaherty JE, Gervasio LG (2005) New challenges in dynamic load balancing. J Appl Numer Math 52(2–3):133–152 MathSciNetzbMATHCrossRefGoogle Scholar
  5. 5.
    Devine K, Hendrickson B, Boman E, St John M, Vaughan C (2000) Design of dynamic load-balancing tools for parallel applications. In: Proc of the 14th int conference on supercomputing (ICS ’00). ACM, New York, pp 110–118. doi: 10.1145/335231.335242. CrossRefGoogle Scholar
  6. 6.
    Chakrabarti S, Yelick K (1994) Randomized load-balancing for tree-structured computation. In: IEEE scalable high performance computing conference, pp 666–673 CrossRefGoogle Scholar
  7. 7.
    Blumofe R, Leiserson C (1994) Scheduling multithreaded computations by work stealing. In: Proc 35th ann symp found comp sci, pp 356–368 CrossRefGoogle Scholar
  8. 8.
    Frigo M, Leiserson CE, Randall KH (1998) The implementation of the Cilk-5 multithreaded language. In: Proc conference on prog language design and implementation (PLDI), ACM SIGPLAN. ACM, New York, pp 212–223 Google Scholar
  9. 9.
    Charles P, Grothoff C, Saraswat V, Donawa C, Kielstra A, Ebcioglu K, von Praun C, Sarkar V (2005) X10: an object-oriented approach to non-uniform cluster computing. In: Proc conference on object oriented prog systems, languages, and applications (OOPSLA), pp 519–538 Google Scholar
  10. 10.
    Cong G, Kodali S, Krishnamoorty S, Lea D, Saraswat V, Wen T (2008) Solving irregular graph problems using adaptive work-stealing. In: Proc 37th int conference on parallel processing (ICPP), Portland, OR, September 2008 Google Scholar
  11. 11.
    Dinan J, Olivier S, Sabin G, Prins J, Sadayappan P, Tseng C-W (2007) Dynamic load balancing of unbalanced computations using message passing. In: Proc of 6th intl workshop on performance modeling, evaluation, and optimization of parallel and distributed systems (PMEO-PDS), pp 1–8 Google Scholar
  12. 12.
    Dinan J, Olivier S, Sabin G, Prins J, Sadayappan P, Tseng C-W (2008) A message passing benchmark for unbalanced applications. J Simul Model Pract Theory 16(9):1177–1189 CrossRefGoogle Scholar
  13. 13.
    UPC Consortium (2005) UPC language specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Lab Google Scholar
  14. 14.
    Olivier S, Prins J (2008) Scalable dynamic load balancing using UPC. In: Proc of 37th int conference on parallel processing (ICPP-08), Portland, OR, September 2008 Google Scholar
  15. 15.
    Nieplocha J, Carpenter B (1999) ARMCI: A portable remote memory copy library for distributed array libraries and compiler run-time systems. Lecture notes in computer science, vol 1586, pp 533–546 Google Scholar
  16. 16.
    Dinan J, Krishnamoorthy S, Larkins DB, Nieplocha J, Sadayappan P (2009) Scalable work stealing. In: Proc 21st intl conference on supercomputing (SC), Portland, OR, November 14–20, 2009 Google Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  • Rui Machado
    • 1
  • Carsten Lojewski
    • 1
  • Salvador Abreu
    • 2
  • Franz-Josef Pfreundt
    • 1
  1. 1.Fraunhofer Institut Techno-und WirtschaftsmathematikCompetence Center for High Performance ComputingKaiserslauternGermany
  2. 2.University of EvoraEvoraPortugal

Personalised recommendations