Skip to main content
Log in

An Illustration of Extending Hedgehog to Multi-Node GPU Architectures Using GEMM

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript


Asynchronous task-based systems offer the possibility of making it easier to take advantage of scalable heterogeneous architectures. This paper extends the previous work, demonstrating how Hedgehog, a dataflow graph-based model developed at the National Institute of Standards and Technology, can be used to obtain high performance for numerical linear algebraic operations as a starting point for complex algorithms. While the results were promising, it was unclear how to scale them to larger matrices and compute node counts. The aim here is to show how the new, improved algorithm inspired by DPLASMA performs equally well using Hedgehog. The results are compared against the leading library DPLASMA to illustrate the performance of different asynchronous dataflow models. The work demonstrates that using general-purpose, high-level abstractions, such as Hedgehog’s dataflow graphs, makes it possible to achieve similar performance to the specialized linear algebra codes such as DPLASMA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

The data for the matrices is generated using a random generator function, which is included as code in the repository linked in the Code Availability section.

Code Availability

Our code is available here The application v3_benchmark2 of the commit tagged as also v3_benchmark2 was used for benchmarking on all the systems.


  1. Compared to other AMT systems, HPX brings a "future-proof C++ conforming API" and an exposed asynchronous programming model.


  1. Shingde N, Berzins M, Blattner T, Keyrouz W, Bardakoff A. Extending Hedgehog’s dataflow graphs to multi-node GPU architectures. In Lecture Notes in Computer Science 2023;(pp. 1-12).

  2. Bardakoff A, Bachelet B, Blattner T, Keyrouz W, Kroiz GC, Yon L. "Hedgehog: Understandable Scheduler-Free Heterogeneous Asynchronous Multithreaded Data-Flow Graphs," 2020 IEEE/ACM 3rd Annual Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM), 2020, pp. 1-15.,

  3. Herault T, Robert Y, Bosilca G, Dongarra J. "Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC," 2019 IEEE/ACM 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), 2019, pp. 33-41,

  4. Gates M, Kurzak J, Charara A, YarKhan A, Dongarra J. SLATE: design of a modern distributed and accelerated linear algebra library. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 26, 2019;1-18.

  5. Bauer M, Treichler S, Slaughter E, Aiken A. Legion: Expressing locality and independence with logical regions. In Proc. of the Int. Conf. on High Perf. Comput., Networking, Storage and Analysis. IEEE Computer Society Press, 2012;66.

  6. Berzins M, Beckvermit J, Harman T, Bezdjian A, Humphrey A, Meng Q, Schmidt J, Wight C. Extending the Uintah Framework through the Petascale Modeling of Detonation in Arrays of High Explosive Devices. SIAM Journal on Scientific Computing. 2016;38(5):101–22.

    Article  MathSciNet  Google Scholar 

  7. Bosilca G, Bouteiller A, Danalis A, Faverge M, Herault T, Dongarra JJ. PaRSEC: Exploiting Heterogeneity to Enhance Scalability. Computing in Science Engineering. 2013;15(6):36–45.

    Article  Google Scholar 

  8. Edwards HC, Trott CR, Sunderland D. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J Parallel and Distrib Comput. 2014;74(12):3202–16.

    Article  Google Scholar 

  9. Holmen JK, Sahasrabudhe D, Berzins M. “A Heterogeneous MPI+PPL Task Scheduling Approach for Asynchronous Many-Task Runtime Systems,” In Proceedings of the Practice and Experience in Advanced Research Computing 2021 on Sustainability, Success and Impact (PEARC21), ACM, (2021)

  10. Holmen JK, Peterson B, Berzins M. “An Approach for Indirectly Adopting a Performance Portability Layer in Large Legacy Codes,” In 2nd International Workshop on Performance, Portability, and Productivity in HPC (P3HPC), SC19, 2019.

  11. Kaiser H, Heller T, Adelstein-Lelbach B, Serio A, Fey D. HPX: A Task Based Programming Model in a Global Address Space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (Eugene, OR, USA) (PGAS ’14). ACM, New York, NY, USA, Article 6 2014.

  12. Kale LV, Krishnan S. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications (Washington, D.C., USA) (OOPSLA ’93). ACM, New York, NY, USA, 1993;91-108.

  13. Meng Q, Humphrey A, Berzins M. “The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System,” In Digital Proceedings of The International Conference for High Performance Computing, Networking, Storage and Analysis, SC’12, WOLFHPC 2012 Worshop, 2012;pp. 2441–2448.

  14. Holmen JK, Sahasrabudhe D, Berzins M. “Porting Uintah to Heterogeneous Systems,” In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC22) Best Paper Award, ACM, 2022.

  15. Augonnet C, Thibault S, Namyst R, Wacrenier P. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures CCPE - Concurrency and Computation: Practice and Experience. Special Issue: Euro-Par. 2011;2009(23):187–98.

    Google Scholar 

  16. Blumofe RD, Leiserson CE. Space-Efficient Scheduling of Multithreaded Computations. SIAM Journal on Computing. 1998;27(1):202–29.

    Article  MathSciNet  Google Scholar 

  17. Bardakoff Alexandre. Analysis and Execution of a Data-Flow Graph Explicit Model Using Static Metaprogramming. Université Clermont Auvergne, 2021.

  18. Computation Platform for AI/ML | NIST. (2019b, December 17). NIST.

  19. Center for High Performance Computing - the University of Utah. (n.d.).

  20. Kaiser et al. HPX - The C++ Standard Library for Parallelism and Concurrency. Journal of Open Source Software, 2020;5(53), 2352,

  21. Bauer M, Treichler S, Slaughter E, Aiken A. Legion: Expressing locality and independence with logical regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 2012; 1-11. Supercomputing, IEEE.

  22. Augonnet C, Thibault S, Namyst R, Wacrenier P-A. Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience. 2011;23(2):187–98.

    Article  Google Scholar 

  23. Garland M, et al "Parallel Computing Experiences with CUDA," in IEEE Micro, vol. 28, no. 4, pp. 13-27, July-Aug. 2008. keywords: Parallel processing;Programming profession;Parallel programming;Concurrent computing;Computer architecture;Computer graphics;Kernel;Throughput;Central Processing Unit,

  24. Kale LV, Krishnan S. Charm++: A portable concurrent object oriented system based on c++. SIGPLAN Notices. 1993;28(10):91–108.

    Article  Google Scholar 

  25. Bennett J, Clay R, Baker G, Gamell M, Hollman D, Knight S, Kolla H, Sjaardema G, Slattengren N, Teranishi K, et al. Asc atdm level 2 milestone #5325: Asynchronous many-task runtime system analysis and assessment for next generation platforms. Technical Report SAND2015-8312, US Department of Energy, Sandia National Laboratories 2015

  26. Abdullah Alperen, Afibuzzaman Md, Rabbi Fazlay, Yusuf Ozkaya M, Catalyurek Umit, Metin Aktulga Hasan. “An Evaluation of Task-Parallel Frameworks for Sparse Solvers on Multicore and Manycore CPU Architectures.” In 50th International Conference on Parallel Processing, 1-11. Lemont IL USA: ACM, 2021.

  27. Ruidong Gu, Becchi Michela. “A Comparative Study of Parallel Programming Frameworks for Distributed GPU Applications.” In Proceedings of the 16th ACM International Conference on Computing Frontiers, 268-73. CF ’19. New York, NY, USA: Association for Computing Machinery, 2019.

  28. Emmanuel Agullo, Buttari Alfredo, Guermouche Abdou, Herrmann Julien, Jego Antoine. “Task-Based Parallel Programming for Scalable Matrix Product Algorithms.” ACM Transactions on Mathematical Software 49, no. 2 2023; 1-23.

  29. David Rohr, Lindenstruth Volker. “A Flexible and Portable Large-Scale DGEMM Library for Linpack on Next-Generation Multi-GPU Systems.” In 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 2015; 664-68,

  30. Baker Gavin Matthew, Bettencourt Matthew Tyler, Bova Steven W, Franko Ken, Gamell Marc, Grant Ryan, Hammond Simon David, Hollman David S, Knight Samuel, Kolla Hemanth, Lin Paul, Olivier Stephen Lecler, Sjaardema Gregory D, Slattengren Nicole Lemaster, Teranishi Keita, Wilke Jeremiah J, Bennett Janine Camille, Clay Robert L, Kale Laxkimant, Jain Nikhil, Mikida Eric, Aiken Alex, Bauer Michael, Lee Wonchan, Slaughter Elliott, Treichler Sean, Berzins Martin, Harman Todd, Humphreys Alan, Schmidt John, Sunderland Dan, Mccormick Pat, Gutierrez Samuel, Shulz Martin, Gamblin Todd, Bremer Peer, -Timo. ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms. United States. 2015.

  31. Nanmiao Wu, Gonidelis Ioannis, Liu Simeng, Fink Zane, Gupta Nikunj , Mohammadiporshokooh Karame, Diehl Patrick, Kaiser Hartmut, Kale Laxmikant V. “Quantifying Overheads in Charm++ and HPX Using Task Bench.” In Euro-Par 2022: Parallel Processing Workshops, edited by Jeremy Singer, Yehia Elkhatib, Dora Blanco Heras, Patrick Diehl, Nick Brown, and Aleksandar Ilic, 5-16. Lecture Notes in Computer Science. Cham: Springer Nature Switzerland, 2023.

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Nitish Shingde.

Ethics declarations

Conflict of interest

Certain equipment, instruments, software, or materials, commercial or non-commercial, are identified in this paper in order to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement of any product or service by NIST, nor is it intended to imply that the materials or equipment identified are necessarily the best available for the purpose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Applications and Frameworks using the Asynchronous Many Task Paradigm” guest edited by Patrick Diehl, Hartmut Kaiser, Peter Thoman, Steven R. Brandt and “Ram” Ramanujam.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shingde, N., Blattner, T., Bardakoff, A. et al. An Illustration of Extending Hedgehog to Multi-Node GPU Architectures Using GEMM. SN COMPUT. SCI. 5, 654 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: