Advertisement

Scalable Work-Stealing Load-Balancer for HPC Distributed Memory Systems

  • Clement Fontenaille
  • Eric Petit
  • Pablo de Oliveira Castro
  • Seijilo Uemura
  • Devan Sohier
  • Piotr Lesnicki
  • Ghislain Lartigue
  • Vincent Moureau
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11339)

Abstract

Work-stealing schedulers are common in shared memory environments. However, large scale distributed memory usage has been limited to specific ad-hoc implementations preventing a broader adoption. In this paper we introduce a new scalable work-stealing algorithm for distributed memory systems as well as our implementation as the TITUS_DLB library. It is based on Kleinberg’s small-world graph. It allows to control the communication patterns and associated runtime overheads while providing efficient heuristics for victim selection and results routing. To validate our approach, we present the DLB_Bench benchmark which emulates arbitrary workload distribution and imbalance characteristics. Finally, we compare TITUS_DLB to the ad-hoc solution developed for the YALES2 computational fluid dynamics and combustion solver. We achieve up to 54% performance gain over thousands of cores.

Notes

Acknowledgment

This work has been funded by the European FP7 Exa2ct project, ATOS, and the ECR lab, a collaboration between CEA, UVSQ, and Intel. The authors thank the GASPI and GPI-2 development team for their very good support and advice. The authors also thank CRIANN, IT4I and BSC for the compute resources and assistance.

References

  1. 1.
    Acun, B., et al.: Parallel programming with migratable objects: Charm++ in practice. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 647–658. IEEE Press (2014)Google Scholar
  2. 2.
    Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 23(2), 187–198 (2011)CrossRefGoogle Scholar
  3. 3.
    Bader, D.: Designing scalable synthetic compact applications for benchmarking high productivity computing systems. Cyberinfrastructure Technol. Watch. 2, 1–10 (2006)Google Scholar
  4. 4.
    Berenbrink, P., Friedetzky, T., Goldberg, L.A.: The natural work-stealing algorithm is stable. SIAM J. Comput. 32(5), 1260–1279 (2003)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM (JACM) 46(5), 720–748 (1999)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Broquedis, F., et al.: hwloc: a generic framework for managing hardware affinities in HPC applications. In: 2010 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 180–186. IEEE (2010)Google Scholar
  7. 7.
    Dinan, J., Larkins, D.B., Sadayappan, P., Krishnamoorthy, S., Nieplocha, J.: Scalable work stealing. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, p. 53. ACM (2009)Google Scholar
  8. 8.
    Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk-5 multithreaded language. In: ACM Sigplan Notices, vol. 33, pp. 212–223. ACM (1998)Google Scholar
  9. 9.
    Gautier, T., Lima, J.V., Maillard, N., Raffin, B.: Xkaapi: a runtime system for data-flow task programming on heterogeneous architectures. In: 2013 IEEE 27th International Symposium on Parallel & Distributed Processing (IPDPS), pp. 1299–1308. IEEE (2013)Google Scholar
  10. 10.
    Grünewald, D., Simmendinger, C.: The GASPI API specification and its implementation GPI 2.0. In: 7th International Conference on PGAS Programming Models, vol. 243 (2013)Google Scholar
  11. 11.
    Kleinberg, J.: The small-world phenomenon: an algorithmic perspective. In: Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, pp. 163–170 (2000)Google Scholar
  12. 12.
    Kleinberg, J., Rubinfeld, R.: Short paths in expander graphs. In: Proceedings of the 37th Annual Symposium on Foundations of Computer Science, pp. 86–95. IEEE (1996)Google Scholar
  13. 13.
    Kukanov, A., Voss, M.J.: The foundations for scalable multi-core software in Intel Threading Building Blocks. Intel Technol. J. 11(4), 309–322 (2007)CrossRefGoogle Scholar
  14. 14.
    Lusk, E.L., Pieper, S.C., Butler, R.M., et al.: More scalability, less pain: a simple programming model and its implementation for extreme computing. SciDAC Rev. 17(1), 30–37 (2010)Google Scholar
  15. 15.
    Machado, R., Lojewski, C., Abreu, S., Pfreundt, F.J.: Unbalanced tree search on a manycore system using the GPI programming model. Comput. Sci. Res. Dev. 26, 229–236 (2011)CrossRefGoogle Scholar
  16. 16.
    Michael, M.M.: Scalable lock-free dynamic memory allocation. ACM Sigplan Not. 39(6), 35–46 (2004)CrossRefGoogle Scholar
  17. 17.
    Min, S.J., Iancu, C., Yelick, K.: Hierarchical work stealing on manycore clusters. In: 5th Conference on Partitioned Global Address Space programming Models (2011)Google Scholar
  18. 18.
    Olivier, S., et al.: UTS: an unbalanced tree search benchmark. In: Almási, G., Caşcaval, C., Wu, P. (eds.) LCPC 2006. LNCS, vol. 4382, pp. 235–250. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-72521-3_18CrossRefGoogle Scholar
  19. 19.
    Perarnau, S., Sato, M.: Victim selection and distributed work stealing performance: a case study. In: Parallel and Distributed Processing Symposium, vol. 28. IEEE (2014)Google Scholar
  20. 20.
    Quintin, J.-N., Wagner, F.: Hierarchical work-stealing. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010. LNCS, vol. 6271, pp. 217–229. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15277-1_21CrossRefGoogle Scholar
  21. 21.
    Tchiboukdjian, M., Gast, N., Trystram, D., Roch, J.L., Bernard, J.: A Tighter Analysis of Work Stealing. Angorithms and Computation, pp. 291–302 (2010)CrossRefGoogle Scholar
  22. 22.
    Woodall, T.S., Shipman, G.M., Bosilca, G., Graham, R.L., Maccabe, A.B.: High performance RDMA protocols in HPC. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds.) EuroPVM/MPI 2006. LNCS, vol. 4192, pp. 76–85. Springer, Heidelberg (2006).  https://doi.org/10.1007/11846802_18CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Clement Fontenaille
    • 1
    • 2
  • Eric Petit
    • 3
  • Pablo de Oliveira Castro
    • 1
  • Seijilo Uemura
    • 1
  • Devan Sohier
    • 1
  • Piotr Lesnicki
    • 2
  • Ghislain Lartigue
    • 4
  • Vincent Moureau
    • 4
  1. 1.Li-PaRADUniversity of VersaillesVersaillesFrance
  2. 2.Atos-BullParisFrance
  3. 3.Intel CorporationSanta ClaraUSA
  4. 4.CORIA-CNRSUniversity of NormandieSaint-Étienne-du-RouvrayFrance

Personalised recommendations