Optimizing CUDA code by kernel fusion: application on BLAS

Filipovič, Jiří; Madzin, Matúš; Fousek, Jan; Matyska, Luděk

doi:10.1007/s11227-015-1483-z

Optimizing CUDA code by kernel fusion: application on BLAS

Published: 22 July 2015

Volume 71, pages 3934–3957, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Jiří Filipovič ORCID: orcid.org/0000-0002-5703-9673¹,
Matúš Madzin¹,
Jan Fousek¹ &
…
Luděk Matyska²

1061 Accesses
46 Citations
2 Altmetric
Explore all metrics

Abstract

Contemporary GPUs have significantly higher arithmetic throughput than a memory throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic power of the GPU. Examples of memory-bound kernels are BLAS-1 (vector–vector) and BLAS-2 (matrix–vector) operations. However, when kernels share data, kernel fusion can improve memory locality by placing shared data, originally passed via off-chip global memory, into a faster, but distributed on-chip memory. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our source-to-source compiler. To demonstrate the usability of the compiler, we have implemented several BLAS-1 and BLAS-2 routines and show how the performance of their sequences can be improved by fusions. Compared with similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.24\(\times \) faster for the examples tested.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Toward a BLAS library truly portable across different accelerator types

Article 10 June 2019

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

A C++ Library for Memory Layout and Performance Portability of Scientific Applications

Notes

The first CUDA processor, G80, has flop-to-word ratio about 24, GT200 has 27, GF110 has 33, GK110 has 63 and GM204 has 82.
For more details about CUDA, we refer to [17].
Some programming languages use map only for unary functions and introduce zipwith for n-ary functions.
Data element can be placed in registers only if their indexing can be determined at compile time [17].
This is trivially fulfilled in code generation stage, as outputs of all reductions are used outside of the fusion implementation performing the reduction, thus the global barrier is performed by finishing the kernel.
For more details about shared memory bank conflicts, we reffer to [17].
It is naturally possible to use rectangular tiles, which decrease the reduction overhead. However, such tiles forbid efficient fusion of operations working with matrix and its transposition.
When the function f performs reduction on each row of the matrix and the reduction’s result is an input of function g processing the same row, CPU is able to hold the row in the cache and reuse it after reduction finish (thus, outer loops in f and g going over rows are fused, whereas inner loops are unfused). Considering GPU, the row needs to be partitioned among more thread blocks when it is read into on-chip memory by f, thus thread blocks need to be synchronized before the result of the reduction is available. Our compiler performs the synchronization by a new kernel invocation, thus all on-chip data are lost before the result of the reduction is available for g, so no row data can be reused. The only way to reuse row data on GPU is to use persistent threads [10], but it is not clear if it could have a positive performance impact.

References

Belter G, Jessup ER, Karlin I, Siek JG (2009) Automating the generation of composed linear algebra kernels. In: Proceedings of the conference on high performance computing, networking, storage and analysis (SC09), ACM, 2009, pp 1–12
Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A, Pozo R, Remington K, Whaley RC (2002) An updated set of basic linear algebra subprograms (BLAS). ACM Trans Math Softw 28:135–151
Catanzaro B, Garland M, Keutzer K (2011) Copperhead: compiling an embedded data parallel language In: The 16th ACM symposium on principles and practice of parallel programming (PPoPP)
Cole M (1989) Algorithmic skeletons: structural management of parallel computation. Research monographs in parallel and distributed computing. MIT Press, Cambridge
Dehnavi MM, Fernandez DM, Giannacopoulos D (2011) Enhancing the performance of conjugate gradient solvers on graphic processing units. IEEE Trans Magn 47:1162–1165
Filipovič J, Fousek J, Lakomý B, Madzin M (2012) Automatically optimized GPU acceleration of element subroutines in finite element method. In: Symposium on application accelerators in high-performance computing (SAAHPC)
Fousek J, Filipovič J, Madzin M (2011) Automatic fusions of CUDA-GPU kernels for parallel map. In: Second international workshop on highly-efficient accelerators and reconfigurable technologies (HEART), pp 42–47
González-Vélez H, Leyton M (2010) A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers. Softw Pract Exp 40:1135–1160
Gulati K, Khatri SP (2009) An automated approach for simd kernel generation for GPU based software acceleration. In: Symposium on application accelerators in high performance computing (SAAHPC)
Gupta K, Stuart JA, Owens JD (2012) A study of persistent threads style GPU programming for GPGPU workloads. In: Innovative parallel computing
Hoberock J, Bell N (2009) Thrust: a parallel template library
Howell GW, Demmel JW, Fulton CT, Hammarling S, Marmol K (2008) Cache efficient bidiagonalization using BLAS 2.5 operators. ACM Trans Math Softw (TOMS) 34:1–14
Iverson KE (1962) A programming language. In: Spring joint computer conference (AIEE-IRE)
Larsen B (2011) Simple optimizations for an applicative array language for graphics processors. In: Proceedings of the sixth workshop on Declarative aspects of multicore programming (DAMP), 2011
Meng J, Morozov VA, Kumaran K, Vishwanath V, Uram TD (2011) Grophecy: GPU performance projection from CPU code skeletons. In: International conference for high performance computing, networking, storage and analysis (SC11)
Meng J, Morozov VA, Vishwanath V, Kumaran K (2012) Dataflow-driven gpu performance projection for multi-kernel transformations. In: International conference for high performance computing, networking, storage and analysis (SC12)
NVIDIA, CUDA C Programming Guide, version 6.5., (2014)
Russell FP, Mellor MR, Kelly PH, Beckmann O (2011) DESOLA: an active linear algebra library using delayed evaluation and runtime code generation. Sci Comput Program 76:227–242
Sato S, Iwasaki H (2009) A skeletal parallel framework with fusion optimizer for GPGPU programming. In: Programming languages and systems, vol 5904 of Lecture Notes in Computer Science. Springer Berlin
Tabik S, Ortega G, Garzón EM (2014) Performance evaluation of kernel fusion blas routines on the GPU: iterative solvers as case study. J Supercomput 70:577–587
Tarditi D, Puri S, Oglesby J (2006) Accelerator: using data parallelism to program GPUs for general-purpose uses, SIGARCH Computer Architecture News, 34
Wahib M, Marutama N (2014) Scalable kernel fusion for memory-bound GPU applications. In: International conference for high performance computing, networking, storage and analysis (SC14)
Wang G, Lin Y, Yi W (2010) Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: IEEE/ACM international conference on green computing and communications and international conference on cyber, physical and social computing (GREENCOM–CPSCOM)

Download references

Acknowledgments

This work was supported by Ministry of Education, Youth and Sport of the Czech Republic under the Project “CERIT Scientific Cloud” (No. ED3.2.00/08.0144). The first author was supported by the Ministry of Education, Youth, and Sport Project CZ.1.07/2.3.00/30.0037—Employment of Best Young Scientists for International Cooperation Empowerment.

Author information

Authors and Affiliations

Faculty of Informatics, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Jiří Filipovič, Matúš Madzin & Jan Fousek
Institute of Computer Science, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Luděk Matyska

Authors

Jiří Filipovič
View author publications
You can also search for this author in PubMed Google Scholar
Matúš Madzin
View author publications
You can also search for this author in PubMed Google Scholar
Jan Fousek
View author publications
You can also search for this author in PubMed Google Scholar
Luděk Matyska
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiří Filipovič.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Filipovič, J., Madzin, M., Fousek, J. et al. Optimizing CUDA code by kernel fusion: application on BLAS. J Supercomput 71, 3934–3957 (2015). https://doi.org/10.1007/s11227-015-1483-z

Download citation

Published: 22 July 2015
Issue Date: October 2015
DOI: https://doi.org/10.1007/s11227-015-1483-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing CUDA code by kernel fusion: application on BLAS

Abstract

Access this article

Similar content being viewed by others

Toward a BLAS library truly portable across different accelerator types

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

A C++ Library for Memory Layout and Performance Portability of Scientific Applications

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimizing CUDA code by kernel fusion: application on BLAS

Abstract

Access this article

Similar content being viewed by others

Toward a BLAS library truly portable across different accelerator types

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

A C++ Library for Memory Layout and Performance Portability of Scientific Applications

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation