Achieving Near-Native Runtime Performance and Cross-Platform Performance Portability for Random Number Generation Through SYCL Interoperability

Pascuzzi, Vincent R.; Goli, Mehdi

doi:10.1007/978-3-030-97759-7_2

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 13194))

Included in the following conference series:

International Workshop on Accelerator Programming Using Directives

337 Accesses
1 Citations

Abstract

High-performance computing (HPC) is a major driver accelerating scientific research and discovery, from quantum simulations to medical therapeutics. While the increasing availability of HPC resources is in many cases pivotal to successful science, even the largest collaborations lack the computational expertise required for maximal exploitation of current hardware capabilities. The need to maintain multiple platform-specific codebases further complicates matters, potentially adding constraints on machines that can be utilized. Fortunately, numerous programming models are under development that aim to facilitate portable codes for heterogeneous computing. One in particular is SYCL, an open standard, C++-based single-source programming paradigm. Among the new features available in the most recent specification, SYCL 2020, is interoperability, a mechanism through which applications and third-party libraries coordinate sharing data and execute collaboratively. In this paper, we leverage the SYCL programming model to demonstrate cross-platform performance portability across heterogeneous resources. We detail our NVIDIA and AMD random number generator extensions to the oneMKL open-source interfaces library. Performance portability is measured relative to platform-specific baseline applications executed on four major hardware platforms using two different compilers supporting SYCL. The utility of our extensions are exemplified in a real-world setting via a high-energy physics simulation application. We show the performance of implementations that capitalize on SYCL interoperability are at par with native implementations, attesting to the cross-platform performance portability of a SYCL-based approach to scientific codes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This is not the case when using unified shared memory, as explained later.
2.
Use of proprietary data that cannot be made publicly available.

References

AMD hipBLAS: Dense Linear Algebra on AMD GPUs. https://github.com/ROCmSoftwarePlatform/hipBLAS. Accessed 05 Apr 2021
AMD hipRAND: Random Number Generation on AMD GPUs. https://github.com/ROCmSoftwarePlatform/rocRAND. Accessed 05 Apr 2021
ComputeCpp: Codeplay’s implementation of the SYCL open standard. https://developer.codeplay.com/products/computecpp/ce/home. Accessed 28 Feb 2021
hipSYCL RPMs. http://repo.urz.uni-heidelberg.de/sycl/test-plugin/rpm/centos7/. Accessed 13 Mar 2021
Intel Math Kernel Library. https://intel.ly/32eX1eu. Accessed 31 Aug 2020
Intel oneAPI DPC++/C++ Compiler. https://github.com/intel/llvm/tree/sycl. Accessed 28 Feb 2021
Intel oneAPI Math Kernel Library (oneMKL). https://docs.oneapi.com/versions/latest/onemkl/index.html. Accessed 28 Feb 2021
NVIDIA cuBLAS: Dense Linear Algebra on GPUs. https://developer.nvidia.com/cublas. Accessed 31 Aug 2020
NVIDIA CUDA programming model. http://www.nvidia.com/CUDA. Accessed 05 Apr 2021
NVIDIA cuRAND: Random Number Generation on NVIDIA GPUs. https://developer.nvidia.com/curand. Accessed 28 Feb 2021
NVIDIA cuSPARSE: the CUDA sparse matrix library. https://docs.nvidia.com/cuda/cusparse/index.html. Accessed 05 Apr 2021
SYCL: C++ Single-source Heterogeneous Programming for OpenCL. https://www.khronos.org/registry/SYCL/specs/sycl-2020-provisional.pdf. Accessed 23 July 2020
The ARM Computer Vision and Machine Learning library. https://github.com/ARM-software/ComputeLibrary/. Accessed 31 Aug 2020
Aad, G., et al.: The ATLAS Experiment at the CERN Large Hadron Collider, vol. 3, p. S08003, 437 (2008). https://doi.org/10.1088/1748-0221/3/08/S08003, https://cds.cern.ch/record/1129811, also published by CERN Geneva in 2010
Agostinelli, S., et al.: GEANT4-a simulation toolkit, vol. 506, pp. 250–303 (2003). https://doi.org/10.1016/S0168-9002(03)01368-8
Alpay, A., Heuveline, V.: SYCL beyond OpenCL: the architecture, current state and future direction of hipSYCL. In: Proceedings of the International Workshop on OpenCL, p. 1 (2020)
Google Scholar
Buckley, A., et al.: General-purpose event generators for LHC physics. Phys. Rep. 504(5), 145–233 (2011)
Article Google Scholar
Costanzo, M., Rucci, E., Sanchez, C.G., Naiouf, M.: Early Experiences Migrating CUDA codes to oneAPI (2021)
Google Scholar
Deakin, T., McIntosh-Smith, S.: Evaluating the performance of HPC-Style SYCL applications. In: Proceedings of the International Workshop on OpenCL, IWOCL 2020. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3388333.3388643
Dong, Z., Gray, H., Leggett, C., Lin, M., Pascuzzi, V.R., Yu, K.: Porting HEP parameterized calorimeter simulation code to GPUs. Front. Big Data 4, 32 (2021)
Article Google Scholar
Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74, 3202–3216 (2014)
Article Google Scholar
Feickert, M., Nachman, B.: A Living Review of Machine Learning for Particle Physics (2021)
Google Scholar
Goli, M., et al.: Towards cross-platform performance portability of DNN models using SYCL. In: 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), pp. 25–35. IEEE (2020)
Google Scholar
Gozillon, A., Keryell, R., Yu, L.Y., Harnisch, G., Keir, P.: triSYCL for Xilinx FPGA. In: The 2020 International Conference on High Performance Computing and Simulation. IEEE (2020)
Google Scholar
Hornung, R.D., Keasler, J.A.: The RAJA portability layer: overview and status. Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States) (2014)
Google Scholar
James, F., Moneta, L.: Review of high-quality random number generators. Comput. Softw. Big Comput. 4, 1–12 (2020). https://doi.org/10.1007/s41781-019-0034-3
Article Google Scholar
Larkin, J.: Performance portability through descriptive parallelism. In: Presentation at DOE Centers of Excellence Performance Portability Meeting (2016)
Google Scholar
McIntosh-Smith, S., Boulton, M., Curran, D., Price, J.: On the performance portability of structured grid codes on many-core computer architectures. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2014. LNCS, vol. 8488, pp. 53–75. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07518-1_4
Chapter Google Scholar
Pennycook, S.J., Sewall, J.D., Lee, V.W.: Implications of a metric for performance portability. Future Gener. Comput. Syst. 92, 947–958 (2019)
Article Google Scholar
Pheatt, C.: Intel threading building blocks. J. Comput. Sci. Coll. 23(4), 298 (2008)
Google Scholar
Schaarschmidt, J.: The new ATLAS fast calorimeter simulation. J. Phys. Conf. Ser. 898, 042006 (2017). https://doi.org/10.1088/1742-6596/898/4/042006
Article Google Scholar
Stauber, T., Sommerlad, P.: ReSYCLator: transforming CUDA C++ source code into SYCL. In: Proceedings of the International Workshop on OpenCL, IWOCL 2019. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3318170.3318190
Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66 (2010)
Article Google Scholar
Zhu, W., Niu, Y., Gao, G.R.: Performance portability on EARTH: a case study across several parallel architectures. Cluster Comput. 10(2), 115–126 (2007)
Article Google Scholar

Download references

Acknowledgement

This work was completed while V.R.P. was at Lawrence Berkeley National Laboratory, and was funded in part by the DOE HEP Center for Computational Excellence at Lawrence Berkeley National Laboratory under B&R KA2401045.

Author information

Authors and Affiliations

Brookhaven National Laboratory, Upton, NY, 11973, USA
Vincent R. Pascuzzi
Codeplay Software Ltd., Edinburgh, EH3 9DR, UK
Mehdi Goli

Authors

Vincent R. Pascuzzi
View author publications
You can also search for this author in PubMed Google Scholar
Mehdi Goli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vincent R. Pascuzzi .

Editor information

Editors and Affiliations

Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Sridutt Bhalachandra
Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Christopher Daley
Oak Ridge National Laboratory, Oak Ridge, DE, USA
Verónica Melesse Vergara

Data Availability Statement

1.1 Summary of the Experiments Reported

We ran two benchmark applications on a variety of hardware:

1. Intel Core i7-1080H, Intel UHD Graphics 630 (Razer Blade Studio Edition 2020) 2. AMD Rome 7742, NVIDIA A100 (DGX node from NERSC) 3. MSI Radeon RX Vega 56 (Private Intel Xeon Gold 5220 node).

Both applications are freely available (Github link below) but inputs to FastCaloSim are proprietary data of the ATLAS Experiment that we unfortunately cannot shared publicly (special access may be granted upon request).

We used Intel LLVM sycl-nightly/20210330, nvcc 10.2 and hipSYCL 0.9.0 for the various targets. oneMKL is used for all RNG but the hipRAND backend is not publicly available due to DOE restrictions on software developed by employees. We are happy to make arrangements for this to be made available.

1.2 Artifact Availability

Software Artifact Availability: All author-created software artifacts are maintained in a public repository under an OSI-approved license.

Hardware Artifact Availability: All author-created hardware artifacts are maintained in a public repository under an OSI-approved license.

Data Artifact Availability: Some author-created data artifacts are NOT maintained in a public repository or are NOT available under an OSI-approved license.

Proprietary Artifacts: There are associated proprietary artifacts that are not created by the authors. Some author-created artifacts are proprietary.

List of URLs and/or DOIs where artifacts are available:

1.3 Baseline Experimental Setup, and Modifications Made for the Paper

Relevant hardware details: DGX A100, Intel Core i7-1080H, Intel UHD Graphics 630, MSI Radeon RX Vega 56, NVIDIA A100, Intel Xeon Gold 5220

Operating systems and versions: Ubuntu 20.04 with kernel 5.8.18, OpenSUSE 15.0 with kernel 4.12, CentOS7 with kernel 3.10

Compilers and versions: GNU 8.2, nvcc 10.2, hipSCYL 0.9.0, Clang 12.0.0

Libraries and versions: oneMKL v0.1.0, CUDA 10.2.89, hip 4.0

Key algorithms: Philo \(\times \) 4 \(\times \) 32 \(\times \) 10, MRG32k3a

Input datasets and versions: ATLAS FastCaloSim single-electron and top-antitop quark n-tuple inputs

Paper Modifications: We added to the oneMKL open-source interfaces library random number generator (RNG) support for AMD (hipRAND) and NVIDIA (cuRAND) GPUS through SYCL interoperability. This provides a single entry point for executing on a wide range of available HPC systems scientific and other codes which utilize RNGs.

Output from scripts that gathers execution environment information

1.4 Artifact Evaluation

Verification and validation studies: Each experiment was run hundreds of times over the course of several weeks to validate day-to-day and operational fluctuations of the systems used for benchmarking.

Accuracy and precision of timings: Each experiment was run hundreds of times over the course of several weeks to validate day-to-day and operational fluctuations of the systems used for benchmarking.

Used manufactured solutions or spectral properties: N/A

Quantified the sensitivity of results to initial conditions and/or parameters of the computational environment: Each experiment was run hundreds of times over the course of several weeks to validate day-to-day and operational fluctuations of the systems used for benchmarking.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pascuzzi, V.R., Goli, M. (2022). Achieving Near-Native Runtime Performance and Cross-Platform Performance Portability for Random Number Generation Through SYCL Interoperability. In: Bhalachandra, S., Daley, C., Melesse Vergara, V. (eds) Accelerator Programming Using Directives. WACCPD 2021. Lecture Notes in Computer Science(), vol 13194. Springer, Cham. https://doi.org/10.1007/978-3-030-97759-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-97759-7_2
Published: 15 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-97758-0
Online ISBN: 978-3-030-97759-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Achieving Near-Native Runtime Performance and Cross-Platform Performance Portability for Random Number Generation Through SYCL Interoperability

Abstract

Access this chapter

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Data Availability Statement

Data Availability Statement

1.1 Summary of the Experiments Reported

1.2 Artifact Availability

1.3 Baseline Experimental Setup, and Modifications Made for the Paper

1.4 Artifact Evaluation

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation