Skip to main content
Log in

Simulating stellar merger using HPX/Kokkos on A64FX on Supercomputer Fugaku

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The increasing availability of machines relying on non-GPU architectures, such as ARM A64FX in high-performance computing, provides a set of interesting challenges to application developers. In addition to requiring code portability across different parallelization schemes, programs targeting these architectures have to be highly adaptable in terms of compute kernel sizes to accommodate different execution characteristics for various heterogeneous workloads. In this paper, we demonstrate an approach to write compute kernels using Kokko’s abstraction layer to be executed on x86 and A64FX CPUs and NVIDIA GPUs. In addition to applying Kokkos as an abstraction over the execution of compute kernels on different heterogeneous execution environments, we show that the use of standard C++ constructs, as exposed by the HPX runtime system, enables platform portability based on the real-world Octo-Tiger astrophysics application. We report our experience with porting Octo-Tiger to the ARM A64FX architecture provided by Stony Brook’s Ookami and Riken’s Supercomputer Fugaku and compare the resulting performance with that achieved on well-established GPU-oriented HPC machines such as ORNL’s Summit, NERSC’s Perlmutter, and CSCS’s Piz Daint systems. Octo-Tiger scaled well on Supercomputer Fugaku without any major code changes due to the abstraction levels provided by HPX and Kokkos. Adding vectorization support for ARM’s SVE to Octo-Tiger was trivial thanks to using standard C++ interfaces.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. https://www.top500.org/lists/hpcg/2023/11/.

  2. https://github.com/STEllAR-GROUP/hpx.

  3. Available at https://github.com/STEllAR-GROUP/hpx-kokkos.

  4. https://github.com/G-071/octotiger-spack.

  5. https://github.com/STEllAR-GROUP/hpx/pull/5870.

  6. https://github.com/kokkos/kokkos/pull/5628.

  7. https://github.com/khuck/zerosum.

  8. https://github.com/gperftools/gperftools.

  9. https://github.com/STEllAR-GROUP/OctoTigerBuildChain.

  10. https://github.com/G-071/octotiger-spack.

  11. https://doi.org/10.5281/zenodo.5213015.

References

  1. Almgren A, Sazo MB, Bell J, Harpole A, Katz M, Sexton J, Willcox D, Zhang W, Zingale M (2020) CASTRO: a massively parallel compressible astrophysics simulation code. J Open Sour Softw 5(54):2513. https://doi.org/10.21105/joss.02513

    Article  Google Scholar 

  2. Bauer M, Treichler S, Slaughter E, Aiken A (2012) Legion: expressing locality and independence with logical regions. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, pp 1–11

  3. Beckingsale DA, Burmark J, Hornung R, Jones H, Killian W, Kunen AJ, Pearce O, Robinson P, Ryujin BS, Scogland TR (2019) Raja: portable performance for large-scale scientific applications. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (p3hpc). IEEE, pp 71–81

  4. Bosilca G, Bouteiller A, Danalis A, Faverge M, Haidar A, Herault T, Kurzak J, Langou J, Lemariner P, Ltaeif H, Luszczek P, YarKhan A, Dongarra J (2011) 2011-05. Flexible development of dense linear algebra algorithms on massively parallel architectures with dplasma. Anchorage, Alaska, USA. IEEE, pp 1432–1441

  5. Bosilca G, Bouteiller A, Danalis A, Faverge M, Hérault T, Dongarra JJ (2013) Parsec: exploiting heterogeneity to enhance scalability. Comput Sci Eng 15(6):36–45

    Article  Google Scholar 

  6. Chamberlain BL, Callahan D, Zima HP (2007) Parallel programmability and the chapel language. Int J High Perform Comput Appl 21(3):291–312

    Article  Google Scholar 

  7. Clayton GC (2012) What are the R coronae borealis stars? J Am Assoc Var Star Obs 40(1): 539. https://doi.org/10.48550/arXiv.1206.3448. arXiv:1206.3448 [astro-ph.SR]

  8. Crawford CL, Clayton GC, Munson B, Chatzopoulos E, Frank J (2020) Modelling R Coronae Borealis Stars: effects of He-burning shell temperature and metallicity. Mon Not R Astron Soc 498(2):2912–2924. https://doi.org/10.1093/mnras/staa2526. arXiv:2007.03076 [astro-ph.SR]

    Article  Google Scholar 

  9. Daiß G (2018) Octo-Tiger: Binary star systems with HPX on Nvidia P100. Master’s thesis

  10. Daiß G, et al (2021) Beyond fork-join: integration of performance portable Kokkos kernels with HPX. In: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, pp 377–386

  11. Daiß G, Amini P, Biddiscombe J, Diehl P, Frank J, Huck K, Kaiser H, Marcello D, Pfander D, Pfüger D (2019) From Piz Daint to the stars: simulation of stellar mergers using high-level abstractions. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19, New York, NY, USA. Association for Computing Machinery

  12. Daiß G, Diehl P, Kaiser H, Pflüger D (2023) Stellar Mergers with HPX-Kokkos and SYCL: methods of using an asynchronous many-task runtime system with sycl. In: International Workshop on OpenCL. https://doi.org/10.1145/3585341.3585354

  13. Daiß G, Singanaboina SY, Diehl P, Kaiser H, Pflüger D (2022) From merging frameworks to merging stars: experiences using HPX, Kokkos and SIMD Types. In: 2022 IEEE/ACM 7th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2). IEEE, pp 10–19

  14. Daiß G, Diehl P, Marcello D, Kheirkhahan A, Kaiser H, Pflüger D (2022) From task-based GPU work aggregation to stellar mergers: turning fine-grained CPU tasks into portable GPU Kernels. In: 2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Los Alamitos, CA, USA. IEEE Computer Society, pp 89–99

  15. Di Renzo M, Fu L, Urzay J (2020) Htr solver: an open-source exascale-oriented task-based multi-gpu high-order code for hypersonic aerothermodynamics. Comput Phys Commun 255:107262

    Article  MathSciNet  Google Scholar 

  16. Diehl P, Brandt SR, Morris M, Gupta N, Kaiser H (2023) Benchmarking the parallel 1d heat equation solver in chapel, charm++, c++, hpx, go, julia, python, rust, swift, and java. arXiv:2307.01117

  17. Diehl P, Daiss G, Huck K, Marcello D, Shiber S, Kaiser H, Frank J, Clayton GC, Pflueger D (2022) Distributed, combined CPU and GPU profiling within HPX using APEX. arXiv https://doi.org/10.48550/ARXIV.2210.06437

  18. Diehl P, Daiß G, Marcello D, Huck K, Shiber S, Kaiser H, Frank J, Clayton GC, Pflüger D (2021) Octo-tiger’s new hydro module and performance using HPX+ CUDA on ORNL’s summit. In: 2021 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 204–214

  19. Gamblin T, LeGendre M, Collette MR, Lee GL, Moody A, De Supinski BR, Futral S (2015) The Spack package manager: bringing order to HPC software chaos. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–12

  20. Germain JDdS, McCorquodale J, Parker SG, Johnson CR (2000) Uintah: a massively parallel problem solving environment. In: Proceedings the Ninth International Symposium on High-Performance Distributed Computing. IEEE, pp 33–41

  21. Grant RE, Levenhagen M, Olivier SL, DeBonis D, Pedretti KT, Laros JH III (2016) Standardizing power monitoring and control at exascale. Computer 49(10):38–46

    Article  Google Scholar 

  22. Guilkey J, Harman T, Banerjee B (2007) An Eulerian–Lagrangian approach for simulating explosions of energetic devices. Comput Struct 85(11–14):660–674

    Article  Google Scholar 

  23. Gupta N, Brandt SR, Wagle B, Wu N, Kheirkhahan A, Diehl P, Baumann FW, Kaiser H (2020) Deploying a task-based runtime system on Raspberry Pi clusters. In: 2020 IEEE/ACM Fifth International Workshop on Extreme Scale Programming Models and Middleware (ESPM2). IEEE, pp 11–20

  24. Heller T, Kaiser H, Diehl P, Fey D, Schweitzer MA (2016) Closing the performance gap with modern C++. In: High Performance Computing: ISC High Performance 2016 International Workshops, ExaComm, E-MuCoCoS, HPC-IODC, IXPUG, IWOPH, P\({^{\hat{\,}}}\) 3MA, VHPC, WOPSSS, Frankfurt, Germany, June 19–23, 2016, Revised Selected Papers 31. Springer, pp 18–31

  25. Huck KA (2022) Broad performance measurement support for asynchronous multi-tasking with apex. In: 2022 IEEE/ACM 7th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), pp 20–29

  26. Huck KA, Porterfield A, Chaimov N, Kaiser H, Malony AD, Sterling T, Fowler R (2015) An autonomic performance environment for exascale. Supercomput Front Innov 2(3):49–66

    Google Scholar 

  27. Jetley P, Gioachin F, Mendes C, Kale LV, Quinn T (2008) Massively parallel cosmological simulations with changa. In: 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, pp 1–12

  28. Kaiser H, Brodowicz M, Sterling T (2009) Parallex an advanced parallel execution model for scaling-impaired applications. In: 2009 International Conference on Parallel Processing Workshops. IEEE, pp 394–401

  29. Kaiser H, Diehl P, Lemoine AS, Lelbach BA, Amini P, Berge A, Biddiscombe J, Brandt SR, Gupta N, Heller T et al (2020) HPX-the C++ standard library for parallelism and concurrency. J Open Sour Softw 5(53):2352

    Article  Google Scholar 

  30. Kale LV, Krishnan S (1993) Charm++ a portable concurrent object oriented system based on C++. In: Proceedings of the Eighth Annual Conference on Object-Oriented Programming Systems, Languages, and Applications, pp 91–108

  31. Kodama Y, Odajima T, Arima E, Sato M (2020) Evaluation of power management control on the supercomputer Fugaku. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 484–493

  32. Kretz M, Lindenstruth V (2012) Vc: a c++ library for explicit vectorization. Softw Pract Exp 42(11):1409–1430

    Article  Google Scholar 

  33. Luitjens J, Worthen B, Berzins M, Henderson T (2007) Scalable parallel amr for the uintah multiphysics code. Petascale Comput Algorithms Appl 67–82

  34. Marcello DC, Shiber S, De Marco O, Frank J, Clayton GC, Motl PM, Diehl P, Kaiser H (2021) Octo-Tiger: a new, 3D hydrodynamic code for stellar mergers that uses HPX parallelization. Mon Not R Astron Soc 504(4):5345–5382

    Article  Google Scholar 

  35. Mason E, Diaz M, Williams RE, Preston G, Bensby T (2010) The peculiar nova V1309 Scorpii/nova Scorpii 2008. A candidate twin of V838 Monocerotis. Astron Astrophys 516:A108. https://doi.org/10.1051/0004-6361/200913610. arXiv:1004.3600 [astro-ph.SR]

    Article  Google Scholar 

  36. Munson et al (2021) R Coronae Borealis star evolution: simulating 3D merger events to 1D stellar evolution including large scale nucleosynthesis. Astrophys J. https://doi.org/10.3847/1538-4357/abeb6c

    Article  Google Scholar 

  37. Nandez JLA, Ivanova N, Lombardi JC Jr (2014) V1309 Sco understanding a merger. Astrophys J 786:39. https://doi.org/10.1088/0004-637X/786/1/39. arXiv:1311.6522 [astro-ph.SR]

    Article  Google Scholar 

  38. Padmanabhan N, Ronaghan E, Zagorac JL, Easther R (2019) Simulating ultralight dark matter with chapel: an experience report. In: SC19 Proceedings

  39. Parenteau M, Bourgault-Cote S, Plante F, Laurendeau E (2020) Development of parallel cfd applications on distributed memory with chapel. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp 651–658

  40. Pfander D, Daiß G, Marcello D, Kaiser H, Pflüger D (2018) Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX. In: Proceedings of the International Workshop on OpenCL, IWOCL’18, New York, NY, USA. ACM, pp 19:1–19:8

  41. Phillips JC, Braun R, Wang W, Gumbart J, Tajkhorshid E, Villa E, Chipot C, Skeel RD, Kale L, Schulten K (2005) Scalable molecular dynamics with namd. J Comput Chem 26(16):1781–1802

    Article  Google Scholar 

  42. Sahasrabudhe D, Phipps ET, Rajamanickam S, Berzins M (2019) A portable SIMD primitive using Kokkos for heterogeneous architectures. In: International Workshop on Accelerator Programming Using Directives. Springer, pp 140–163

  43. Saio H (2008) Radial and nonradial pulsations in RCB and EHe-B stars. In: Werner A, Rauch T (eds) Hydrogen-Deficient Stars, Volume 391 of Astronomical Society of the Pacific Conference Series, p 69

  44. Soi R, Mamidi NR, Slaughter E, Prasun K, Nemili A, Deshpande S (2020) An implicitly parallel meshfree solver in regent. In: 2020 IEEE/ACM 3rd Annual Parallel Applications Workshop: Alternatives To MPI+ X (PAW-ATM). IEEE, pp 40–54

  45. Spinti J, Thornock J, Eddings E, Smith P, Sarofim A (2008) Heat transfer to objects in pool fires. Transp Phenom Fires 20:69

    Article  Google Scholar 

  46. Sreepathi S, Taylor M (2021) Early evaluation of Fugaku A64FX architecture using climate workloads. In: 2021 IEEE International Conference on Cluster Computing (CLUSTER), pp 719–727

  47. Srinivas Yadav S (2023) sve::experimental::simd header-only library for SVE vectorization on A64FX. https://github.com/srinivasyadav18/sve

  48. Sunderland D, Peterson B, Schmidt J, Humphrey A, Thornock J, Berzins M (2016) An overview of performance portability in the uintah runtime system through the use of kokkos. In: 2016 Second International Workshop on Extreme Scale Programming Models and Middleware (ESPM2), pp 44–47

  49. Thoman P, Dichev K, Heller T, Iakymchuk R, Aguilar X, Hasanov K, Gschwandtner P, Lemarinier P, Markidis S, Jordan H et al (2018) A taxonomy of task-based parallel programming technologies for high-performance computing. J Supercomput 74(4):1422–1434

    Article  Google Scholar 

  50. Treichler S, Bauer M, Bhagatwala A, Borghesi G, Sankaran R, Kolla H, McCormick PS, Slaughter E, Lee W, Aiken A, et al (2017) S3d-legion: an exascale software for direct numerical simulation of turbulent combustion with complex multicomponent chemistry. In: Exascale Scientific Applications. Chapman and Hall/CRC, pp 257–278

  51. Trott CR, Lebrun-Grandié D, Arndt D, Ciesko J, Dang V, Ellingwood N, Gayatri R, Harvey E, Hollman DS, Ibanez D, Liber N, Madsen J, Miles J, Poliakoff D, Powell A, Rajamanickam S, Simberg M, Sunderland D, Turcksin B, Wilke J (2022) Kokkos 3: programming model extensions for the exascale era. IEEE Trans Parallel Distrib Syst 33(4):805–817

    Article  Google Scholar 

  52. Tylenda R, Hajduk M, Kamiński T, Udalski A, Soszyński I, Szymański MK, Kubiak M, Pietrzyński G, Poleski R, Wyrzykowski Ł, Ulaczyk K (2011) V1309 Scorpii: merger of a contact binary. Astron Astrophys 528:A114. https://doi.org/10.1051/0004-6361/201016221. arXiv:1012.0163 [astro-ph.SR]

    Article  Google Scholar 

  53. Wu N, Gonidelis I, Liu S, Fink Z, Gupta N, Mohammadiporshokooh K, Diehl P, Kaiser H, Kale LV (2022) Quantifying overheads in charm++ and hpx using task bench. In: European Conference on Parallel Processing. Springer, pp 5–16

Download references

Acknowledgments

This research used resources of the National Energy Research Scientific Computing Center, the U.S. Department of Energy, Office of Science User Facility, operated under Contract No. DE-AC02-05CH11231. This work used computational resources of the Supercomputer Fugaku provided by Riken through the HPCI System Research Project (Project ID: hp210311). A grant from the Swiss National Supercomputing Center (CSCS) supported this work under Project ID: s1078. The authors would like to thank Stony Brook Research Computing and Cyberinfrastructure and the Institute for Advanced Computational Science at Stony Brook University for access to the innovative high-performance Ookami computing system, which was made possible by a $5M National Science Foundation Grant (#1927880).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Patrick Diehl.

Ethics declarations

Disclaimer

The results on NERSC’s Perlmutter were conducted in phase 1; such results should not reflect or imply that they are the final results of the system. Numerous upgrades will be made for Phase 2 that will substantially change Perlmutter’s final size and network capabilities.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Diehl, P., Daiß, G., Huck, K. et al. Simulating stellar merger using HPX/Kokkos on A64FX on Supercomputer Fugaku. J Supercomput (2024). https://doi.org/10.1007/s11227-024-06113-w

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-024-06113-w

Keywords

Navigation