Skip to main content

Optimizing CUDA code by kernel fusion: application on BLAS


Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic power of the GPU. Examples of memory-bound kernels are BLAS-1 (vector–vector) and BLAS-2 (matrix–vector) operations. However, when kernels share data, kernel fusion can improve memory locality by placing shared data, originally passed via off-chip global memory, into a faster, but distributed on-chip memory. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared with similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.24\(\times \) faster for the examples tested.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. The first CUDA processor, G80, has flop-to-word ratio about 24, GT200 has 27, GF110 has 33, GK110 has 63 and GM204 has 82.

  2. For more details about CUDA, we refer to [17].

  3. Some programming languages use map only for unary functions and introduce zipwith for n-ary functions.

  4. Data element can be placed in registers only if their indexing can be determined at compile time [17].

  5. This is trivially fulfilled in code generation stage, as outputs of all reductions are used outside of the fusion implementation performing the reduction, thus the global barrier is performed by finishing the kernel.

  6. For more details about shared memory bank conflicts, we reffer to  [17].

  7. It is naturally possible to use rectangular tiles, which decrease the reduction overhead. However, such tiles forbid efficient fusion of operations working with matrix and its transposition.

  8. When the function f performs reduction on each row of the matrix and the reduction’s result is an input of function g processing the same row, CPU is able to hold the row in the cache and reuse it after reduction finish (thus, outer loops in f and g going over rows are fused, whereas inner loops are unfused). Considering GPU, the row needs to be partitioned among more thread blocks when it is read into on-chip memory by f, thus thread blocks need to be synchronized before the result of the reduction is available. Our compiler performs the synchronization by a new kernel invocation, thus all on-chip data are lost before the result of the reduction is available for g, so no row data can be reused. The only way to reuse row data on GPU is to use persistent threads [10], but it is not clear if it could have a positive performance impact.


  1. Belter G, Jessup ER, Karlin I, Siek JG (2009) Automating the generation of composed linear algebra kernels. In: Proceedings of the conference on high performance computing, networking, storage and analysis (SC09), ACM, 2009, pp 1–12

  2. Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A, Pozo R, Remington K, Whaley RC (2002) An updated set of basic linear algebra subprograms (BLAS). ACM Trans Math Softw 28:135–151

  3. Catanzaro B, Garland M, Keutzer K (2011) Copperhead: compiling an embedded data parallel language In: The 16th ACM symposium on principles and practice of parallel programming (PPoPP)

  4. Cole M (1989) Algorithmic skeletons: structural management of parallel computation. Research monographs in parallel and distributed computing. MIT Press, Cambridge

  5. Dehnavi MM, Fernandez DM, Giannacopoulos D (2011) Enhancing the performance of conjugate gradient solvers on graphic processing units. IEEE Trans Magn 47:1162–1165

  6. Filipovič J, Fousek J, Lakomý B, Madzin M (2012) Automatically optimized GPU acceleration of element subroutines in finite element method. In: Symposium on application accelerators in high-performance computing (SAAHPC)

  7. Fousek J, Filipovič J, Madzin M (2011) Automatic fusions of CUDA-GPU kernels for parallel map. In: Second international workshop on highly-efficient accelerators and reconfigurable technologies (HEART), pp 42–47

  8. González-Vélez H, Leyton M (2010) A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers. Softw Pract Exp 40:1135–1160

  9. Gulati K, Khatri SP (2009) An automated approach for simd kernel generation for GPU based software acceleration. In: Symposium on application accelerators in high performance computing (SAAHPC)

  10. Gupta K, Stuart JA, Owens JD (2012) A study of persistent threads style GPU programming for GPGPU workloads. In: Innovative parallel computing

  11. Hoberock J, Bell N (2009) Thrust: a parallel template library

  12. Howell GW, Demmel JW, Fulton CT, Hammarling S, Marmol K (2008) Cache efficient bidiagonalization using BLAS 2.5 operators. ACM Trans Math Softw (TOMS) 34:1–14

  13. Iverson KE (1962) A programming language. In: Spring joint computer conference (AIEE-IRE)

  14. Larsen B (2011) Simple optimizations for an applicative array language for graphics processors. In: Proceedings of the sixth workshop on Declarative aspects of multicore programming (DAMP), 2011

  15. Meng J, Morozov VA, Kumaran K, Vishwanath V, Uram TD (2011) Grophecy: GPU performance projection from CPU code skeletons. In: International conference for high performance computing, networking, storage and analysis (SC11)

  16. Meng J, Morozov VA, Vishwanath V, Kumaran K (2012) Dataflow-driven gpu performance projection for multi-kernel transformations. In: International conference for high performance computing, networking, storage and analysis (SC12)

  17. NVIDIA, CUDA C Programming Guide, version 6.5., (2014)

  18. Russell FP, Mellor MR, Kelly PH, Beckmann O (2011) DESOLA: an active linear algebra library using delayed evaluation and runtime code generation. Sci Comput Program 76:227–242

  19. Sato S, Iwasaki H (2009) A skeletal parallel framework with fusion optimizer for GPGPU programming. In: Programming languages and systems, vol 5904 of Lecture Notes in Computer Science. Springer Berlin

  20. Tabik S, Ortega G, Garzón EM (2014) Performance evaluation of kernel fusion blas routines on the GPU: iterative solvers as case study. J Supercomput 70:577–587

  21. Tarditi D, Puri S, Oglesby J (2006) Accelerator: using data parallelism to program GPUs for general-purpose uses, SIGARCH Computer Architecture News, 34

  22. Wahib M, Marutama N (2014) Scalable kernel fusion for memory-bound GPU applications. In: International conference for high performance computing, networking, storage and analysis (SC14)

  23. Wang G, Lin Y, Yi W (2010) Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: IEEE/ACM international conference on green computing and communications and international conference on cyber, physical and social computing (GREENCOM–CPSCOM)

Download references


This work was supported by Ministry of Education, Youth and Sport of the Czech Republic under the Project “CERIT Scientific Cloud” (No. ED3.2.00/08.0144). The first author was supported by the Ministry of Education, Youth, and Sport Project CZ.1.07/2.3.00/30.0037—Employment of Best Young Scientists for International Cooperation Empowerment.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jiří Filipovič.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Filipovič, J., Madzin, M., Fousek, J. et al. Optimizing CUDA code by kernel fusion: application on BLAS. J Supercomput 71, 3934–3957 (2015).

Download citation

  • Published:

  • Issue Date:

  • DOI:


  • CUDA
  • BLAS
  • Kernel fusion
  • Code generation