Skip to main content
Log in

oclCUB: an OpenCL parallel computing library for deep learning operators

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

Deep learning (DL) mainly uses various parallel computing libraries to optimize the speed of model training. The underlying computations of the DL operators typically include essential functions such as reduction and prefix scan, the efficiency of which can be greatly improved using parallel acceleration devices. However, the acceleration of these computations is mainly supported by collective primitive libraries such as NVIDIA CUB and AMD hipCUB, which are only available on vendor-specific hardware accelerators due to the highly segregated computational ecology between different vendors. To address this issue, we propose an OpenCL parallel computing library called oclCUB that can run on different heterogeneous platforms. OclCUB abstracts the OpenCL execution environment, implements reusable common underlying computations of DL, and designs two types of interfaces targeting the operators' heterogeneous acceleration pattern, enabling users to design and optimize DL operators efficiently. We evaluate the oclCUB on various hardware accelerators across Nvidia Tesla V100s with OpenCL 1.2, AMD RADEON PRO V520 with OpenCL 2.0, MT-3000 with MOCL 3, and Kunpeng 920 with POCL 1.6. Our experiments show that the oclCUB-based operators achieve accurate computational results on various platforms. The results also demonstrate that oclCUB is able to maintain a smaller, acceptable performance gap with CUB, and comparable in performance to hipCUB.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Listing 1
Listing 2
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Abadi, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)

  • Adinets, A., Merrill, D.: Onesweep: a faster least significant digit radix sort for GPUs. arXiv preprint arXiv:2206.01784 (2022)

  • AMD ROCm: A thin wrapper library on top of rocPRIM or CUB. https://github.com/ROCmSoftwarePlatform/hipCUB (2019a)

  • AMD ROCm: A C++ Runtime API and Kernel Language. https://github.com/ROCm-Developer-Tools/HIP (2019b)

  • AMD ROCm: AMD ROCm Platform Documentation. https://rocmdocs.amd.com/ (2022a)

  • AMD ROCm. A header-only library providing HIP parallel primitives. https://github.com/ROCmSoftwarePlatform/rocPRIM (2022b)

  • Bell, N., Hoberock, J.: “Thrust: A Productivity-Oriented Library for CUDA.” GPU Computing Gems, Jade, pp. 359–371. Morgan Kaufmann (2012)

    Google Scholar 

  • Cao, C., et al.: clMAGMA: high performance dense linear algebra with OpenCL. In: Proceedings of the International Workshop on OpenCL 2013 & 2014 (2014)

  • Chen, T., et al.: Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)

  • Chetlur, S., et al. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)

  • Cublas, N.C.: Library. NVIDIA Corporation, Santa Clara (2008)

    Google Scholar 

  • Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)

    Article  Google Scholar 

  • Fang, J., Varbanescu, A.L., Sips, H.: A comprehensive performance comparison of CUDA and OpenCL. In: 2011 International Conference on Parallel Processing. IEEE (2011)

  • Intel: oneAPI Deep Neural Network Library. https://github.com/oneapi-src/oneDNN (2019)

  • Jääskeläinen, P., de La Lama, C.S., Schnetter, E., et al.: pocl: A performance-portable OpenCL implementation. Int. J. Parallel Prog. 43, 752–785 (2015)

    Article  Google Scholar 

  • Khan, J., et al.: Miopen: an open source library for deep learning primitives. arXiv preprint arXiv:1910.00078 (2019)

  • Kirk, D.: NVIDIA CUDA software and GPU parallel computing architecture. In: ISMM. Vol. 7 (2007)

  • Komatsu, K., et al.: Evaluating performance and portability of OpenCL programs. In: The Fifth International Workshop on Automatic Performance Tuning. Vol. 66 (2010)

  • Lu, K., Wang, Y., Guo, Y., et al.: MT-3000: a heterogeneous multi-zone processor for HPC. CCF Trans. High Perform. Comput. 4(2), 150–164 (2022)

    Article  Google Scholar 

  • Martín, P.J., Ayuso, L.F., Torres, R., et al.: Algorithmic strategies for optimizing the parallel reduction primitive in CUDA. In: 2012 International Conference on High Performance Computing & Simulation (HPCS). IEEE, pp. 511–519 (2012)

  • Merrill, D. CUB v1. 5.3: CUDA Unbound, a library of warp-wide, blockwide, and device-wide GPU parallel primitives. NVIDIA Res. (2015)

  • Nichols, D., et al.: MagmaDNN: accelerated deep learning using MAGMA. In: Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning) (2019)

  • Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32 (2019)

  • Pheatt, C.: Intel® threading building blocks. J. Comput. Sci. Coll. 23(4), 298–298 (2008)

    Google Scholar 

  • Rupp, K., et al.: ViennaCL–-linear algebra library for multi-and many-core architectures. SIAM J. Sci. Comput. 38(5), S412–S439 (2016)

    Article  MathSciNet  Google Scholar 

  • Stone, J.E., Gohara, D., Shi, G.: OpenCL: A parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66 (2010)

    Article  PubMed  PubMed Central  Google Scholar 

  • Zhang, P., Fang, J., Yang, C., et al.: Mocl: an efficient OpenCL implementation for the matrix-2000 architecture. In: Proceedings of the 15th ACM International Conference on Computing Frontiers, pp. 26–35 (2018)

Download references

Acknowledgements

This research is supported by National Key R&D Program of China grant 2021YFB0300104, as well as by Tianjin Research Innovation Project for Postgraduate Students grant 2022BKY023.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yufei Sun.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shi, C., Sun, Y., Sui, Y. et al. oclCUB: an OpenCL parallel computing library for deep learning operators. CCF Trans. HPC (2024). https://doi.org/10.1007/s42514-024-00181-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42514-024-00181-3

Keywords

Navigation