oclCUB: an OpenCL parallel computing library for deep learning operators

Shi, Changqing; Sun, Yufei; Sui, Yicheng; Chen, Yuqiao; Wang, Haotian; Zhang, Yuzhi

doi:10.1007/s42514-024-00181-3

oclCUB: an OpenCL parallel computing library for deep learning operators

Regular Paper
Published: 16 February 2024

(2024)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

49 Accesses
Explore all metrics

Abstract

Deep learning (DL) mainly uses various parallel computing libraries to optimize the speed of model training. The underlying computations of the DL operators typically include essential functions such as reduction and prefix scan, the efficiency of which can be greatly improved using parallel acceleration devices. However, the acceleration of these computations is mainly supported by collective primitive libraries such as NVIDIA CUB and AMD hipCUB, which are only available on vendor-specific hardware accelerators due to the highly segregated computational ecology between different vendors. To address this issue, we propose an OpenCL parallel computing library called oclCUB that can run on different heterogeneous platforms. OclCUB abstracts the OpenCL execution environment, implements reusable common underlying computations of DL, and designs two types of interfaces targeting the operators' heterogeneous acceleration pattern, enabling users to design and optimize DL operators efficiently. We evaluate the oclCUB on various hardware accelerators across Nvidia Tesla V100s with OpenCL 1.2, AMD RADEON PRO V520 with OpenCL 2.0, MT-3000 with MOCL 3, and Kunpeng 920 with POCL 1.6. Our experiments show that the oclCUB-based operators achieve accurate computational results on various platforms. The results also demonstrate that oclCUB is able to maintain a smaller, acceptable performance gap with CUB, and comparable in performance to hipCUB.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

A review on the long short-term memory model

Article 13 May 2020

Bolstering stochastic gradient descent with model building

Article Open access 15 April 2024

References

Abadi, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Adinets, A., Merrill, D.: Onesweep: a faster least significant digit radix sort for GPUs. arXiv preprint arXiv:2206.01784 (2022)
AMD ROCm: A thin wrapper library on top of rocPRIM or CUB. https://github.com/ROCmSoftwarePlatform/hipCUB (2019a)
AMD ROCm: A C++ Runtime API and Kernel Language. https://github.com/ROCm-Developer-Tools/HIP (2019b)
AMD ROCm: AMD ROCm Platform Documentation. https://rocmdocs.amd.com/ (2022a)
AMD ROCm. A header-only library providing HIP parallel primitives. https://github.com/ROCmSoftwarePlatform/rocPRIM (2022b)
Bell, N., Hoberock, J.: “Thrust: A Productivity-Oriented Library for CUDA.” GPU Computing Gems, Jade, pp. 359–371. Morgan Kaufmann (2012)
Google Scholar
Cao, C., et al.: clMAGMA: high performance dense linear algebra with OpenCL. In: Proceedings of the International Workshop on OpenCL 2013 & 2014 (2014)
Chen, T., et al.: Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)
Chetlur, S., et al. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)
Cublas, N.C.: Library. NVIDIA Corporation, Santa Clara (2008)
Google Scholar
Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
Article Google Scholar
Fang, J., Varbanescu, A.L., Sips, H.: A comprehensive performance comparison of CUDA and OpenCL. In: 2011 International Conference on Parallel Processing. IEEE (2011)
Intel: oneAPI Deep Neural Network Library. https://github.com/oneapi-src/oneDNN (2019)
Jääskeläinen, P., de La Lama, C.S., Schnetter, E., et al.: pocl: A performance-portable OpenCL implementation. Int. J. Parallel Prog. 43, 752–785 (2015)
Article Google Scholar
Khan, J., et al.: Miopen: an open source library for deep learning primitives. arXiv preprint arXiv:1910.00078 (2019)
Kirk, D.: NVIDIA CUDA software and GPU parallel computing architecture. In: ISMM. Vol. 7 (2007)
Komatsu, K., et al.: Evaluating performance and portability of OpenCL programs. In: The Fifth International Workshop on Automatic Performance Tuning. Vol. 66 (2010)
Lu, K., Wang, Y., Guo, Y., et al.: MT-3000: a heterogeneous multi-zone processor for HPC. CCF Trans. High Perform. Comput. 4(2), 150–164 (2022)
Article Google Scholar
Martín, P.J., Ayuso, L.F., Torres, R., et al.: Algorithmic strategies for optimizing the parallel reduction primitive in CUDA. In: 2012 International Conference on High Performance Computing & Simulation (HPCS). IEEE, pp. 511–519 (2012)
Merrill, D. CUB v1. 5.3: CUDA Unbound, a library of warp-wide, blockwide, and device-wide GPU parallel primitives. NVIDIA Res. (2015)
Nichols, D., et al.: MagmaDNN: accelerated deep learning using MAGMA. In: Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning) (2019)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32 (2019)
Pheatt, C.: Intel® threading building blocks. J. Comput. Sci. Coll. 23(4), 298–298 (2008)
Google Scholar
Rupp, K., et al.: ViennaCL–-linear algebra library for multi-and many-core architectures. SIAM J. Sci. Comput. 38(5), S412–S439 (2016)
Article MathSciNet Google Scholar
Stone, J.E., Gohara, D., Shi, G.: OpenCL: A parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66 (2010)
Article PubMed PubMed Central Google Scholar
Zhang, P., Fang, J., Yang, C., et al.: Mocl: an efficient OpenCL implementation for the matrix-2000 architecture. In: Proceedings of the 15th ACM International Conference on Computing Frontiers, pp. 26–35 (2018)

Download references

Acknowledgements

This research is supported by National Key R&D Program of China grant 2021YFB0300104, as well as by Tianjin Research Innovation Project for Postgraduate Students grant 2022BKY023.

Author information

Authors and Affiliations

College of Software, Nankai University, Tianjin, 300450, China
Changqing Shi, Yufei Sun, Yicheng Sui, Yuqiao Chen, Haotian Wang & Yuzhi Zhang
ITAI, Haihe Lab, Tianjin, 300350, China
Yuzhi Zhang

Authors

Changqing Shi
View author publications
You can also search for this author in PubMed Google Scholar
Yufei Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yicheng Sui
View author publications
You can also search for this author in PubMed Google Scholar
Yuqiao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Haotian Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuzhi Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yufei Sun.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shi, C., Sun, Y., Sui, Y. et al. oclCUB: an OpenCL parallel computing library for deep learning operators. CCF Trans. HPC (2024). https://doi.org/10.1007/s42514-024-00181-3

Download citation

Received: 06 October 2023
Accepted: 05 January 2024
Published: 16 February 2024
DOI: https://doi.org/10.1007/s42514-024-00181-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

oclCUB: an OpenCL parallel computing library for deep learning operators

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

A review on the long short-term memory model

Bolstering stochastic gradient descent with model building

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

oclCUB: an OpenCL parallel computing library for deep learning operators

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

A review on the long short-term memory model

Bolstering stochastic gradient descent with model building

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation