OpenMP Task Generation for Batched Kernel APIs

  • Jinpil LeeEmail author
  • Yutaka Watanabe
  • Mitsuhisa Sato
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11718)


The demand for calculating many small computation kernels is getting significantly important in the HPC area not only for the traditional numerical applications but also recent machine learning applications. While many-core accelerators such as GPUs are power-efficient compute platforms, a large amount of code modification is required. Batched kernel APIs such as batched BLAS can schedule numerical kernels efficiently on the target hardware while it still needs manual code modification. In this paper, we propose a code translation technique to generate batched kernel APIs in a high-level programming model. We use OpenMP task parallelism to specify dependency among numerical kernels. The user adds the task directives to specify tasks so that the compiler can recognize numerical kernels. The compiler detects conventional numerical kernels in the code and creates a unique batch ID for each kernel. When the task runtime detects tasks with the same batch ID, they are merged into a batch. The current implementation supports NVIDIA GPUs and batched BLAS in cuBLAS. DGEMM kernels can be detected and translated into batched DGEMM. A trivial DGEMM loop and blocked Cholesky decomposition code are used for performance evaluation. The evaluation result shows that batched DGEMM improves the performance when the matrix size is small and the number of DGEMM kernels is large. The time for DGEMMs in blocked Cholesky decomposition is 4 times faster than sequential execution when using batched DGEMM (\(4096 \times 4096\) matrix, tile size 128), however the overall performance is improved 36% because of task/batch management overhead.


OpenMP Task parallelism Accelerator Batched BLAS 


  1. 1.
    Argobots - Official Repository on Github.
  2. 2.
    Dongarra, J., et al.: Batched BLAS (basic linear algebra subprograms) 2018 specification, July 2018Google Scholar
  3. 3.
    Dongarra, J., Hammarling, S., Higham, N.J., Relton, S.D., Valero-Lara, P., Zounon, M.: The design and performance of batched blas on modern high-performance computing systems. Procedia Comput. Sci. 108, 495–504 (2017). International Conference on Computational Science, ICCS 2017, Zurich, Switzerland, 12–14 June 2017CrossRefGoogle Scholar
  4. 4.
    Dongarra, J.J., et al.: A proposed API for batched basic linear algebra subprograms (2016)Google Scholar
  5. 5.
    Intel Math Kernel Library - Batched DGEMM Interface.
  6. 6.
    Jin, C., Baskaran, M.: Analysis of explicit vs. implicit tasking in OpenMP using kripke, pp. 62–70, November 2018.
  7. 7.
    Muddukrishna, A., Jonsson, P.A., Vlassov, V., Brorsson, M.: Locality-aware task scheduling and data distribution on NUMA systems. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 156–170. Springer, Heidelberg (2013). Scholar
  8. 8.
  9. 9.
    Olivier, S.L., Prins, J.F.: Evaluating OpenMP 3.0 run time systems on unbalanced task graphs. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 63–78. Springer, Heidelberg (2009). Scholar
  10. 10.
    Omni Compiler Infrastructure.
  11. 11.
    Relton, S.D., Valero-Lara, P., Zounon, M.: A comparison of potential interfaces for batched BLAS computations (2016)Google Scholar
  12. 12.
    Watanabe, Y., Lee, J., Boku, T., Sato, M.: Trade-off of offloading to FPGA in OpenMP task-based programming. In: de Supinski, B.R., Valero-Lara, P., Martorell, X., Mateo Bellido, S., Labarta, J. (eds.) IWOMP 2018. LNCS, vol. 11128, pp. 96–110. Springer, Cham (2018). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.RIKEN Center for Computational ScienceKobeJapan
  2. 2.Graduate School of Systems and Information EngineeringUniversity of TsukubaTsukubaJapan

Personalised recommendations