Skip to main content

Achieving Near-Native Runtime Performance and Cross-Platform Performance Portability for Random Number Generation Through SYCL Interoperability

  • Conference paper
  • First Online:
Accelerator Programming Using Directives (WACCPD 2021)

Abstract

High-performance computing (HPC) is a major driver accelerating scientific research and discovery, from quantum simulations to medical therapeutics. While the increasing availability of HPC resources is in many cases pivotal to successful science, even the largest collaborations lack the computational expertise required for maximal exploitation of current hardware capabilities. The need to maintain multiple platform-specific codebases further complicates matters, potentially adding constraints on machines that can be utilized. Fortunately, numerous programming models are under development that aim to facilitate portable codes for heterogeneous computing. One in particular is SYCL, an open standard, C++-based single-source programming paradigm. Among the new features available in the most recent specification, SYCL 2020, is interoperability, a mechanism through which applications and third-party libraries coordinate sharing data and execute collaboratively. In this paper, we leverage the SYCL programming model to demonstrate cross-platform performance portability across heterogeneous resources. We detail our NVIDIA and AMD random number generator extensions to the oneMKL open-source interfaces library. Performance portability is measured relative to platform-specific baseline applications executed on four major hardware platforms using two different compilers supporting SYCL. The utility of our extensions are exemplified in a real-world setting via a high-energy physics simulation application. We show the performance of implementations that capitalize on SYCL interoperability are at par with native implementations, attesting to the cross-platform performance portability of a SYCL-based approach to scientific codes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This is not the case when using unified shared memory, as explained later.

  2. 2.

    Use of proprietary data that cannot be made publicly available.

References

  1. AMD hipBLAS: Dense Linear Algebra on AMD GPUs. https://github.com/ROCmSoftwarePlatform/hipBLAS. Accessed 05 Apr 2021

  2. AMD hipRAND: Random Number Generation on AMD GPUs. https://github.com/ROCmSoftwarePlatform/rocRAND. Accessed 05 Apr 2021

  3. ComputeCpp: Codeplay’s implementation of the SYCL open standard. https://developer.codeplay.com/products/computecpp/ce/home. Accessed 28 Feb 2021

  4. hipSYCL RPMs. http://repo.urz.uni-heidelberg.de/sycl/test-plugin/rpm/centos7/. Accessed 13 Mar 2021

  5. Intel Math Kernel Library. https://intel.ly/32eX1eu. Accessed 31 Aug 2020

  6. Intel oneAPI DPC++/C++ Compiler. https://github.com/intel/llvm/tree/sycl. Accessed 28 Feb 2021

  7. Intel oneAPI Math Kernel Library (oneMKL). https://docs.oneapi.com/versions/latest/onemkl/index.html. Accessed 28 Feb 2021

  8. NVIDIA cuBLAS: Dense Linear Algebra on GPUs. https://developer.nvidia.com/cublas. Accessed 31 Aug 2020

  9. NVIDIA CUDA programming model. http://www.nvidia.com/CUDA. Accessed 05 Apr 2021

  10. NVIDIA cuRAND: Random Number Generation on NVIDIA GPUs. https://developer.nvidia.com/curand. Accessed 28 Feb 2021

  11. NVIDIA cuSPARSE: the CUDA sparse matrix library. https://docs.nvidia.com/cuda/cusparse/index.html. Accessed 05 Apr 2021

  12. SYCL: C++ Single-source Heterogeneous Programming for OpenCL. https://www.khronos.org/registry/SYCL/specs/sycl-2020-provisional.pdf. Accessed 23 July 2020

  13. The ARM Computer Vision and Machine Learning library. https://github.com/ARM-software/ComputeLibrary/. Accessed 31 Aug 2020

  14. Aad, G., et al.: The ATLAS Experiment at the CERN Large Hadron Collider, vol. 3, p. S08003, 437 (2008). https://doi.org/10.1088/1748-0221/3/08/S08003, https://cds.cern.ch/record/1129811, also published by CERN Geneva in 2010

  15. Agostinelli, S., et al.: GEANT4-a simulation toolkit, vol. 506, pp. 250–303 (2003). https://doi.org/10.1016/S0168-9002(03)01368-8

  16. Alpay, A., Heuveline, V.: SYCL beyond OpenCL: the architecture, current state and future direction of hipSYCL. In: Proceedings of the International Workshop on OpenCL, p. 1 (2020)

    Google Scholar 

  17. Buckley, A., et al.: General-purpose event generators for LHC physics. Phys. Rep. 504(5), 145–233 (2011)

    Article  Google Scholar 

  18. Costanzo, M., Rucci, E., Sanchez, C.G., Naiouf, M.: Early Experiences Migrating CUDA codes to oneAPI (2021)

    Google Scholar 

  19. Deakin, T., McIntosh-Smith, S.: Evaluating the performance of HPC-Style SYCL applications. In: Proceedings of the International Workshop on OpenCL, IWOCL 2020. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3388333.3388643

  20. Dong, Z., Gray, H., Leggett, C., Lin, M., Pascuzzi, V.R., Yu, K.: Porting HEP parameterized calorimeter simulation code to GPUs. Front. Big Data 4, 32 (2021)

    Article  Google Scholar 

  21. Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74, 3202–3216 (2014)

    Article  Google Scholar 

  22. Feickert, M., Nachman, B.: A Living Review of Machine Learning for Particle Physics (2021)

    Google Scholar 

  23. Goli, M., et al.: Towards cross-platform performance portability of DNN models using SYCL. In: 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), pp. 25–35. IEEE (2020)

    Google Scholar 

  24. Gozillon, A., Keryell, R., Yu, L.Y., Harnisch, G., Keir, P.: triSYCL for Xilinx FPGA. In: The 2020 International Conference on High Performance Computing and Simulation. IEEE (2020)

    Google Scholar 

  25. Hornung, R.D., Keasler, J.A.: The RAJA portability layer: overview and status. Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States) (2014)

    Google Scholar 

  26. James, F., Moneta, L.: Review of high-quality random number generators. Comput. Softw. Big Comput. 4, 1–12 (2020). https://doi.org/10.1007/s41781-019-0034-3

    Article  Google Scholar 

  27. Larkin, J.: Performance portability through descriptive parallelism. In: Presentation at DOE Centers of Excellence Performance Portability Meeting (2016)

    Google Scholar 

  28. McIntosh-Smith, S., Boulton, M., Curran, D., Price, J.: On the performance portability of structured grid codes on many-core computer architectures. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2014. LNCS, vol. 8488, pp. 53–75. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07518-1_4

    Chapter  Google Scholar 

  29. Pennycook, S.J., Sewall, J.D., Lee, V.W.: Implications of a metric for performance portability. Future Gener. Comput. Syst. 92, 947–958 (2019)

    Article  Google Scholar 

  30. Pheatt, C.: Intel threading building blocks. J. Comput. Sci. Coll. 23(4), 298 (2008)

    Google Scholar 

  31. Schaarschmidt, J.: The new ATLAS fast calorimeter simulation. J. Phys. Conf. Ser. 898, 042006 (2017). https://doi.org/10.1088/1742-6596/898/4/042006

    Article  Google Scholar 

  32. Stauber, T., Sommerlad, P.: ReSYCLator: transforming CUDA C++ source code into SYCL. In: Proceedings of the International Workshop on OpenCL, IWOCL 2019. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3318170.3318190

  33. Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66 (2010)

    Article  Google Scholar 

  34. Zhu, W., Niu, Y., Gao, G.R.: Performance portability on EARTH: a case study across several parallel architectures. Cluster Comput. 10(2), 115–126 (2007)

    Article  Google Scholar 

Download references

Acknowledgement

This work was completed while V.R.P. was at Lawrence Berkeley National Laboratory, and was funded in part by the DOE HEP Center for Computational Excellence at Lawrence Berkeley National Laboratory under B&R KA2401045.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vincent R. Pascuzzi .

Editor information

Editors and Affiliations

Data Availability Statement

Data Availability Statement

1.1 Summary of the Experiments Reported

We ran two benchmark applications on a variety of hardware:

1. Intel Core i7-1080H, Intel UHD Graphics 630 (Razer Blade Studio Edition 2020) 2. AMD Rome 7742, NVIDIA A100 (DGX node from NERSC) 3. MSI Radeon RX Vega 56 (Private Intel Xeon Gold 5220 node).

Both applications are freely available (Github link below) but inputs to FastCaloSim are proprietary data of the ATLAS Experiment that we unfortunately cannot shared publicly (special access may be granted upon request).

We used Intel LLVM sycl-nightly/20210330, nvcc 10.2 and hipSYCL 0.9.0 for the various targets. oneMKL is used for all RNG but the hipRAND backend is not publicly available due to DOE restrictions on software developed by employees. We are happy to make arrangements for this to be made available.

1.2 Artifact Availability

Software Artifact Availability: All author-created software artifacts are maintained in a public repository under an OSI-approved license.

Hardware Artifact Availability: All author-created hardware artifacts are maintained in a public repository under an OSI-approved license.

Data Artifact Availability: Some author-created data artifacts are NOT maintained in a public repository or are NOT available under an OSI-approved license.

Proprietary Artifacts: There are associated proprietary artifacts that are not created by the authors. Some author-created artifacts are proprietary.

List of URLs and/or DOIs where artifacts are available:

figure c

1.3 Baseline Experimental Setup, and Modifications Made for the Paper

Relevant hardware details: DGX A100, Intel Core i7-1080H, Intel UHD Graphics 630, MSI Radeon RX Vega 56, NVIDIA A100, Intel Xeon Gold 5220

Operating systems and versions: Ubuntu 20.04 with kernel 5.8.18, OpenSUSE 15.0 with kernel 4.12, CentOS7 with kernel 3.10

Compilers and versions: GNU 8.2, nvcc 10.2, hipSCYL 0.9.0, Clang 12.0.0

Libraries and versions: oneMKL v0.1.0, CUDA 10.2.89, hip 4.0

Key algorithms: Philo \(\times \) 4 \(\times \) 32 \(\times \) 10, MRG32k3a

Input datasets and versions: ATLAS FastCaloSim single-electron and top-antitop quark n-tuple inputs

Paper Modifications: We added to the oneMKL open-source interfaces library random number generator (RNG) support for AMD (hipRAND) and NVIDIA (cuRAND) GPUS through SYCL interoperability. This provides a single entry point for executing on a wide range of available HPC systems scientific and other codes which utilize RNGs.

Output from scripts that gathers execution environment information

figure d

1.4 Artifact Evaluation

Verification and validation studies: Each experiment was run hundreds of times over the course of several weeks to validate day-to-day and operational fluctuations of the systems used for benchmarking.

Accuracy and precision of timings: Each experiment was run hundreds of times over the course of several weeks to validate day-to-day and operational fluctuations of the systems used for benchmarking.

Used manufactured solutions or spectral properties: N/A

Quantified the sensitivity of results to initial conditions and/or parameters of the computational environment: Each experiment was run hundreds of times over the course of several weeks to validate day-to-day and operational fluctuations of the systems used for benchmarking.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pascuzzi, V.R., Goli, M. (2022). Achieving Near-Native Runtime Performance and Cross-Platform Performance Portability for Random Number Generation Through SYCL Interoperability. In: Bhalachandra, S., Daley, C., Melesse Vergara, V. (eds) Accelerator Programming Using Directives. WACCPD 2021. Lecture Notes in Computer Science(), vol 13194. Springer, Cham. https://doi.org/10.1007/978-3-030-97759-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-97759-7_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-97758-0

  • Online ISBN: 978-3-030-97759-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics