Abstract
Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic power of the GPU. Examples of memory-bound kernels are BLAS-1 (vector–vector) and BLAS-2 (matrix–vector) operations. However, when kernels share data, kernel fusion can improve memory locality by placing shared data, originally passed via off-chip global memory, into a faster, but distributed on-chip memory. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared with similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.24\(\times \) faster for the examples tested.
This is a preview of subscription content, access via your institution.





Notes
The first CUDA processor, G80, has flop-to-word ratio about 24, GT200 has 27, GF110 has 33, GK110 has 63 and GM204 has 82.
For more details about CUDA, we refer to [17].
Some programming languages use map only for unary functions and introduce zipwith for n-ary functions.
Data element can be placed in registers only if their indexing can be determined at compile time [17].
This is trivially fulfilled in code generation stage, as outputs of all reductions are used outside of the fusion implementation performing the reduction, thus the global barrier is performed by finishing the kernel.
For more details about shared memory bank conflicts, we reffer to [17].
It is naturally possible to use rectangular tiles, which decrease the reduction overhead. However, such tiles forbid efficient fusion of operations working with matrix and its transposition.
When the function f performs reduction on each row of the matrix and the reduction’s result is an input of function g processing the same row, CPU is able to hold the row in the cache and reuse it after reduction finish (thus, outer loops in f and g going over rows are fused, whereas inner loops are unfused). Considering GPU, the row needs to be partitioned among more thread blocks when it is read into on-chip memory by f, thus thread blocks need to be synchronized before the result of the reduction is available. Our compiler performs the synchronization by a new kernel invocation, thus all on-chip data are lost before the result of the reduction is available for g, so no row data can be reused. The only way to reuse row data on GPU is to use persistent threads [10], but it is not clear if it could have a positive performance impact.
References
Belter G, Jessup ER, Karlin I, Siek JG (2009) Automating the generation of composed linear algebra kernels. In: Proceedings of the conference on high performance computing, networking, storage and analysis (SC09), ACM, 2009, pp 1–12
Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A, Pozo R, Remington K, Whaley RC (2002) An updated set of basic linear algebra subprograms (BLAS). ACM Trans Math Softw 28:135–151
Catanzaro B, Garland M, Keutzer K (2011) Copperhead: compiling an embedded data parallel language In: The 16th ACM symposium on principles and practice of parallel programming (PPoPP)
Cole M (1989) Algorithmic skeletons: structural management of parallel computation. Research monographs in parallel and distributed computing. MIT Press, Cambridge
Dehnavi MM, Fernandez DM, Giannacopoulos D (2011) Enhancing the performance of conjugate gradient solvers on graphic processing units. IEEE Trans Magn 47:1162–1165
Filipovič J, Fousek J, Lakomý B, Madzin M (2012) Automatically optimized GPU acceleration of element subroutines in finite element method. In: Symposium on application accelerators in high-performance computing (SAAHPC)
Fousek J, Filipovič J, Madzin M (2011) Automatic fusions of CUDA-GPU kernels for parallel map. In: Second international workshop on highly-efficient accelerators and reconfigurable technologies (HEART), pp 42–47
González-Vélez H, Leyton M (2010) A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers. Softw Pract Exp 40:1135–1160
Gulati K, Khatri SP (2009) An automated approach for simd kernel generation for GPU based software acceleration. In: Symposium on application accelerators in high performance computing (SAAHPC)
Gupta K, Stuart JA, Owens JD (2012) A study of persistent threads style GPU programming for GPGPU workloads. In: Innovative parallel computing
Hoberock J, Bell N (2009) Thrust: a parallel template library
Howell GW, Demmel JW, Fulton CT, Hammarling S, Marmol K (2008) Cache efficient bidiagonalization using BLAS 2.5 operators. ACM Trans Math Softw (TOMS) 34:1–14
Iverson KE (1962) A programming language. In: Spring joint computer conference (AIEE-IRE)
Larsen B (2011) Simple optimizations for an applicative array language for graphics processors. In: Proceedings of the sixth workshop on Declarative aspects of multicore programming (DAMP), 2011
Meng J, Morozov VA, Kumaran K, Vishwanath V, Uram TD (2011) Grophecy: GPU performance projection from CPU code skeletons. In: International conference for high performance computing, networking, storage and analysis (SC11)
Meng J, Morozov VA, Vishwanath V, Kumaran K (2012) Dataflow-driven gpu performance projection for multi-kernel transformations. In: International conference for high performance computing, networking, storage and analysis (SC12)
NVIDIA, CUDA C Programming Guide, version 6.5., (2014)
Russell FP, Mellor MR, Kelly PH, Beckmann O (2011) DESOLA: an active linear algebra library using delayed evaluation and runtime code generation. Sci Comput Program 76:227–242
Sato S, Iwasaki H (2009) A skeletal parallel framework with fusion optimizer for GPGPU programming. In: Programming languages and systems, vol 5904 of Lecture Notes in Computer Science. Springer Berlin
Tabik S, Ortega G, Garzón EM (2014) Performance evaluation of kernel fusion blas routines on the GPU: iterative solvers as case study. J Supercomput 70:577–587
Tarditi D, Puri S, Oglesby J (2006) Accelerator: using data parallelism to program GPUs for general-purpose uses, SIGARCH Computer Architecture News, 34
Wahib M, Marutama N (2014) Scalable kernel fusion for memory-bound GPU applications. In: International conference for high performance computing, networking, storage and analysis (SC14)
Wang G, Lin Y, Yi W (2010) Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: IEEE/ACM international conference on green computing and communications and international conference on cyber, physical and social computing (GREENCOM–CPSCOM)
Acknowledgments
This work was supported by Ministry of Education, Youth and Sport of the Czech Republic under the Project “CERIT Scientific Cloud” (No. ED3.2.00/08.0144). The first author was supported by the Ministry of Education, Youth, and Sport Project CZ.1.07/2.3.00/30.0037—Employment of Best Young Scientists for International Cooperation Empowerment.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Filipovič, J., Madzin, M., Fousek, J. et al. Optimizing CUDA code by kernel fusion: application on BLAS. J Supercomput 71, 3934–3957 (2015). https://doi.org/10.1007/s11227-015-1483-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1483-z
Keywords
- GPGPU
- CUDA
- BLAS
- Kernel fusion
- Code generation