Advertisement

Enhancing the Programmability and Performance Portability of GPU Tensor Operations

  • Arya MazaheriEmail author
  • Johannes Schulte
  • Matthew W. Moskewicz
  • Felix Wolf
  • Ali Jannesari
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11725)

Abstract

Deep-learning models with convolutional networks are widely used for many artificial-intelligence tasks, thanks to the increasing adoption of high-throughput GPUs, even in mobile phones. CUDA and OpenCL are the two largely used programming interfaces for accessing the computing power of GPUs. However, attaining code portability has always been a challenge, until the introduction of the Vulkan API. Still, performance portability is not necessarily provided. In this paper, we investigate the unique characteristics of CUDA, OpenCL, and Vulkan kernels and propose a method for abstracting away syntactic differences. Such abstraction creates a single-source kernel which we use for generating code for each GPU programming interface. In addition, we expose auto-tuning parameters to further enhance performance portability. We implemented a selection of convolution operations, covering the core operations needed for deploying three common image-processing neural networks, and tuned them for NVIDIA, AMD, and ARM Mali GPUs. Our experiments show that we can generate deep-learning kernels with minimal effort for new platforms and achieve reasonable performance. Specifically, our Vulkan backend is able to provide competitive performance compared to vendor deep-learning libraries.

Keywords

GPU Deep learning Performance portability 

Notes

Acknowledgment

This research has been supported by the Klaus Tschira Foundation, the Hessian LOEWE initiative within the Software-Factory 4.0 project, and the German Research Foundation (DFG) through the Program Performance Engineering for Scientific Software.

References

  1. 1.
    Chen, T., et al.: TVM: an automated end-to-end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, pp. 578–594 (2018)Google Scholar
  2. 2.
    Chetlur, S., et al.: cuDNN: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)
  3. 3.
    Da Silva, H.C., Pisani, F., Borin, E.: A comparative study of SYCL, OpenCL, and OpenMP. In: Proceedings of International Symposium on Computer Architecture and High-Performance Computing Workshops, SBAC-PADW 2016, pp. 61–66. IEEE (2016)Google Scholar
  4. 4.
    Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.: From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parallel Comput. 38(8), 391–407 (2012)CrossRefGoogle Scholar
  5. 5.
    Fang, J., Varbanescu, A.L., Sips, H.: A comprehensive performance comparison of CUDA and OpenCL. In: Proceedings of International Conference on Parallel Processing (ICPP), pp. 216–225. IEEE (2011)Google Scholar
  6. 6.
    Huynh, L.N., Lee, Y., Balan, R.K.: DeepMon: mobile GPU-based deep learning framework for continuous vision applications. In: Proceedings of 15th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys 2017, pp. 82–95. ACM (2017)Google Scholar
  7. 7.
    Intel: PlaidML (2019). https://www.intel.ai/plaidml
  8. 8.
    Karimi, K., Dickson, N.G., Hamze, F.: A performance comparison of CUDA and OpenCL. arXiv preprint arXiv:1005.2581 (2010)
  9. 9.
    Kim, J., Dao, T.T., Jung, J., Joo, J., Lee, J.: Bridging OpenCL and CUDA: a comparative analysis and translation. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 1–12. ACM (2015)Google Scholar
  10. 10.
    Mammeri, N., Juurlink, B.: VComputeBench: a Vulkan benchmark suite for GPGPU on mobile and embedded GPUs. In: Proceedings of International Symposium on Workload Characterization, IISWC 2018, pp. 25–35. IEEE (2018)Google Scholar
  11. 11.
    Mazaheri, A., Schulte, J., Moskewicz, M., Wolf, F., Jannesari, A.: Artifact Evaluation (2019). https://doi.org/10.6084/m9.figshare.8490146
  12. 12.
    Memeti, S., Li, L., Pllana, S., Kołodziej, J., Kessler, C.: Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption. In: Proceedings of Workshop on Adaptive Resource Management and Scheduling for Cloud Computing, pp. 1–6. ACM (2017)Google Scholar
  13. 13.
    Moskewicz, M.W., Jannesari, A., Keutzer, K.: A metaprogramming and autotuning framework for deploying deep learning applications. arXiv preprint arXiv:1611.06945 (2016)
  14. 14.
    Moskewicz, M.W., Jannesari, A., Keutzer, K.: Boda: a holistic approach for implementing neural network computations. In: Proceedings of International Conference on Computing Frontier, CF 2017, pp. 53–62. ACM (2017)Google Scholar
  15. 15.
    Sachetto Oliveira, R., et al.: Comparing CUDA, OpenCL and OpenGL implementations of the cardiac monodomain equations. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2011. LNCS, vol. 7204, pp. 111–120. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-31500-8_12CrossRefGoogle Scholar
  16. 16.
    Sampson, A.: Let’s fix OpenGL. In: Leibniz International Proceedings in Informatics, LIPIcs 2017, vol. 71. Schloss Dagstuhl, Leibniz-Zentrum füer Informatik (2017)Google Scholar
  17. 17.
    Su, C.L., Chen, P.Y., Lan, C.C., Huang, L.S., Wu, K.H.: Overview and comparison of OpenCL and CUDA technology for GPGPU. In: Proceedings of Asia Pacific Conference on Circuits and Systems, APCCAS 2012, pp. 448–451. IEEE (2012)Google Scholar
  18. 18.
    The Khronos Group: Khronos SPIR-V registry (2019). https://www.khronos.org/registry/spir-v
  19. 19.
    The Khronos Group: Khronos Vulkan registry (2019). https://www.khronos.org/registry/vulkan
  20. 20.
    Vasilache, N., et al.: Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730 (2018)

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Arya Mazaheri
    • 1
    Email author
  • Johannes Schulte
    • 1
  • Matthew W. Moskewicz
    • 2
  • Felix Wolf
    • 1
  • Ali Jannesari
    • 3
  1. 1.Technische Universität DarmstadtDarmstadtGermany
  2. 2.Deepscale Inc.Mountain ViewUSA
  3. 3.Iowa State UniversityAmesUSA

Personalised recommendations