Abstract
High-performance computing (HPC) is a major driver accelerating scientific research and discovery, from quantum simulations to medical therapeutics. While the increasing availability of HPC resources is in many cases pivotal to successful science, even the largest collaborations lack the computational expertise required for maximal exploitation of current hardware capabilities. The need to maintain multiple platform-specific codebases further complicates matters, potentially adding constraints on machines that can be utilized. Fortunately, numerous programming models are under development that aim to facilitate portable codes for heterogeneous computing. One in particular is SYCL, an open standard, C++-based single-source programming paradigm. Among the new features available in the most recent specification, SYCL 2020, is interoperability, a mechanism through which applications and third-party libraries coordinate sharing data and execute collaboratively. In this paper, we leverage the SYCL programming model to demonstrate cross-platform performance portability across heterogeneous resources. We detail our NVIDIA and AMD random number generator extensions to the oneMKL open-source interfaces library. Performance portability is measured relative to platform-specific baseline applications executed on four major hardware platforms using two different compilers supporting SYCL. The utility of our extensions are exemplified in a real-world setting via a high-energy physics simulation application. We show the performance of implementations that capitalize on SYCL interoperability are at par with native implementations, attesting to the cross-platform performance portability of a SYCL-based approach to scientific codes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This is not the case when using unified shared memory, as explained later.
- 2.
Use of proprietary data that cannot be made publicly available.
References
AMD hipBLAS: Dense Linear Algebra on AMD GPUs. https://github.com/ROCmSoftwarePlatform/hipBLAS. Accessed 05 Apr 2021
AMD hipRAND: Random Number Generation on AMD GPUs. https://github.com/ROCmSoftwarePlatform/rocRAND. Accessed 05 Apr 2021
ComputeCpp: Codeplay’s implementation of the SYCL open standard. https://developer.codeplay.com/products/computecpp/ce/home. Accessed 28 Feb 2021
hipSYCL RPMs. http://repo.urz.uni-heidelberg.de/sycl/test-plugin/rpm/centos7/. Accessed 13 Mar 2021
Intel Math Kernel Library. https://intel.ly/32eX1eu. Accessed 31 Aug 2020
Intel oneAPI DPC++/C++ Compiler. https://github.com/intel/llvm/tree/sycl. Accessed 28 Feb 2021
Intel oneAPI Math Kernel Library (oneMKL). https://docs.oneapi.com/versions/latest/onemkl/index.html. Accessed 28 Feb 2021
NVIDIA cuBLAS: Dense Linear Algebra on GPUs. https://developer.nvidia.com/cublas. Accessed 31 Aug 2020
NVIDIA CUDA programming model. http://www.nvidia.com/CUDA. Accessed 05 Apr 2021
NVIDIA cuRAND: Random Number Generation on NVIDIA GPUs. https://developer.nvidia.com/curand. Accessed 28 Feb 2021
NVIDIA cuSPARSE: the CUDA sparse matrix library. https://docs.nvidia.com/cuda/cusparse/index.html. Accessed 05 Apr 2021
SYCL: C++ Single-source Heterogeneous Programming for OpenCL. https://www.khronos.org/registry/SYCL/specs/sycl-2020-provisional.pdf. Accessed 23 July 2020
The ARM Computer Vision and Machine Learning library. https://github.com/ARM-software/ComputeLibrary/. Accessed 31 Aug 2020
Aad, G., et al.: The ATLAS Experiment at the CERN Large Hadron Collider, vol. 3, p. S08003, 437 (2008). https://doi.org/10.1088/1748-0221/3/08/S08003, https://cds.cern.ch/record/1129811, also published by CERN Geneva in 2010
Agostinelli, S., et al.: GEANT4-a simulation toolkit, vol. 506, pp. 250–303 (2003). https://doi.org/10.1016/S0168-9002(03)01368-8
Alpay, A., Heuveline, V.: SYCL beyond OpenCL: the architecture, current state and future direction of hipSYCL. In: Proceedings of the International Workshop on OpenCL, p. 1 (2020)
Buckley, A., et al.: General-purpose event generators for LHC physics. Phys. Rep. 504(5), 145–233 (2011)
Costanzo, M., Rucci, E., Sanchez, C.G., Naiouf, M.: Early Experiences Migrating CUDA codes to oneAPI (2021)
Deakin, T., McIntosh-Smith, S.: Evaluating the performance of HPC-Style SYCL applications. In: Proceedings of the International Workshop on OpenCL, IWOCL 2020. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3388333.3388643
Dong, Z., Gray, H., Leggett, C., Lin, M., Pascuzzi, V.R., Yu, K.: Porting HEP parameterized calorimeter simulation code to GPUs. Front. Big Data 4, 32 (2021)
Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74, 3202–3216 (2014)
Feickert, M., Nachman, B.: A Living Review of Machine Learning for Particle Physics (2021)
Goli, M., et al.: Towards cross-platform performance portability of DNN models using SYCL. In: 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), pp. 25–35. IEEE (2020)
Gozillon, A., Keryell, R., Yu, L.Y., Harnisch, G., Keir, P.: triSYCL for Xilinx FPGA. In: The 2020 International Conference on High Performance Computing and Simulation. IEEE (2020)
Hornung, R.D., Keasler, J.A.: The RAJA portability layer: overview and status. Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States) (2014)
James, F., Moneta, L.: Review of high-quality random number generators. Comput. Softw. Big Comput. 4, 1–12 (2020). https://doi.org/10.1007/s41781-019-0034-3
Larkin, J.: Performance portability through descriptive parallelism. In: Presentation at DOE Centers of Excellence Performance Portability Meeting (2016)
McIntosh-Smith, S., Boulton, M., Curran, D., Price, J.: On the performance portability of structured grid codes on many-core computer architectures. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2014. LNCS, vol. 8488, pp. 53–75. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07518-1_4
Pennycook, S.J., Sewall, J.D., Lee, V.W.: Implications of a metric for performance portability. Future Gener. Comput. Syst. 92, 947–958 (2019)
Pheatt, C.: Intel threading building blocks. J. Comput. Sci. Coll. 23(4), 298 (2008)
Schaarschmidt, J.: The new ATLAS fast calorimeter simulation. J. Phys. Conf. Ser. 898, 042006 (2017). https://doi.org/10.1088/1742-6596/898/4/042006
Stauber, T., Sommerlad, P.: ReSYCLator: transforming CUDA C++ source code into SYCL. In: Proceedings of the International Workshop on OpenCL, IWOCL 2019. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3318170.3318190
Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66 (2010)
Zhu, W., Niu, Y., Gao, G.R.: Performance portability on EARTH: a case study across several parallel architectures. Cluster Comput. 10(2), 115–126 (2007)
Acknowledgement
This work was completed while V.R.P. was at Lawrence Berkeley National Laboratory, and was funded in part by the DOE HEP Center for Computational Excellence at Lawrence Berkeley National Laboratory under B&R KA2401045.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Data Availability Statement
Data Availability Statement
1.1 Summary of the Experiments Reported
We ran two benchmark applications on a variety of hardware:
1. Intel Core i7-1080H, Intel UHD Graphics 630 (Razer Blade Studio Edition 2020) 2. AMD Rome 7742, NVIDIA A100 (DGX node from NERSC) 3. MSI Radeon RX Vega 56 (Private Intel Xeon Gold 5220 node).
Both applications are freely available (Github link below) but inputs to FastCaloSim are proprietary data of the ATLAS Experiment that we unfortunately cannot shared publicly (special access may be granted upon request).
We used Intel LLVM sycl-nightly/20210330, nvcc 10.2 and hipSYCL 0.9.0 for the various targets. oneMKL is used for all RNG but the hipRAND backend is not publicly available due to DOE restrictions on software developed by employees. We are happy to make arrangements for this to be made available.
1.2 Artifact Availability
Software Artifact Availability: All author-created software artifacts are maintained in a public repository under an OSI-approved license.
Hardware Artifact Availability: All author-created hardware artifacts are maintained in a public repository under an OSI-approved license.
Data Artifact Availability: Some author-created data artifacts are NOT maintained in a public repository or are NOT available under an OSI-approved license.
Proprietary Artifacts: There are associated proprietary artifacts that are not created by the authors. Some author-created artifacts are proprietary.
List of URLs and/or DOIs where artifacts are available:
1.3 Baseline Experimental Setup, and Modifications Made for the Paper
Relevant hardware details: DGX A100, Intel Core i7-1080H, Intel UHD Graphics 630, MSI Radeon RX Vega 56, NVIDIA A100, Intel Xeon Gold 5220
Operating systems and versions: Ubuntu 20.04 with kernel 5.8.18, OpenSUSE 15.0 with kernel 4.12, CentOS7 with kernel 3.10
Compilers and versions: GNU 8.2, nvcc 10.2, hipSCYL 0.9.0, Clang 12.0.0
Libraries and versions: oneMKL v0.1.0, CUDA 10.2.89, hip 4.0
Key algorithms: Philo \(\times \) 4 \(\times \) 32 \(\times \) 10, MRG32k3a
Input datasets and versions: ATLAS FastCaloSim single-electron and top-antitop quark n-tuple inputs
Paper Modifications: We added to the oneMKL open-source interfaces library random number generator (RNG) support for AMD (hipRAND) and NVIDIA (cuRAND) GPUS through SYCL interoperability. This provides a single entry point for executing on a wide range of available HPC systems scientific and other codes which utilize RNGs.
Output from scripts that gathers execution environment information
1.4 Artifact Evaluation
Verification and validation studies: Each experiment was run hundreds of times over the course of several weeks to validate day-to-day and operational fluctuations of the systems used for benchmarking.
Accuracy and precision of timings: Each experiment was run hundreds of times over the course of several weeks to validate day-to-day and operational fluctuations of the systems used for benchmarking.
Used manufactured solutions or spectral properties: N/A
Quantified the sensitivity of results to initial conditions and/or parameters of the computational environment: Each experiment was run hundreds of times over the course of several weeks to validate day-to-day and operational fluctuations of the systems used for benchmarking.
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Pascuzzi, V.R., Goli, M. (2022). Achieving Near-Native Runtime Performance and Cross-Platform Performance Portability for Random Number Generation Through SYCL Interoperability. In: Bhalachandra, S., Daley, C., Melesse Vergara, V. (eds) Accelerator Programming Using Directives. WACCPD 2021. Lecture Notes in Computer Science(), vol 13194. Springer, Cham. https://doi.org/10.1007/978-3-030-97759-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-97759-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-97758-0
Online ISBN: 978-3-030-97759-7
eBook Packages: Computer ScienceComputer Science (R0)