uBench: exposing the impact of CUDA block geometry in terms of performance
The choice of thread-block size and shape is one of the most important user decisions when a parallel problem is written for any CUDA architecture. The reason is that thread-block geometry has a significant impact on the global performance of the program. Unfortunately, the programmer has not enough information about the subtle interactions between this choice of parameters and the underlying hardware.
This paper presents uBench, a complete suite of micro-benchmarks, in order to explore the impact on performance of (1) the thread-block geometry choice criteria, and (2) the GPU hardware resources and configurations. Each micro-benchmark has been designed to be as simple as possible to focus on a single effect derived from the hardware and thread-block parameter choice.
As an example of the capabilities of this benchmark suite, this paper shows an experimental evaluation and comparison of Fermi and Kepler architectures. Our study reveals that, in spite of the new hardware details introduced by Kepler, the principles underlying the block geometry selection criteria are similar for both architectures.
KeywordsGPU Benchmarking CUDA Fermi Kepler Performance measurement
- 1.Torres Y, Gonzalez-Escribano A, Llanos DR (2012) Using Fermi architecture knowledge to speed up CUDA and OpenCL programs. In: Proc. ISPA’2012, Leganes, Madrid, Spain, 2012 Google Scholar
- 2.NVIDIA (2010) NVIDIA CUDA programming guide 3.0 Fermi Google Scholar
- 3.NVIDIA (2012) NVIDIA CUDA programming guide 4.2: Kepler Google Scholar
- 4.Kirk DB, Hwu WW (2010) Programming massively parallel processors: a hands-on approach, February 2010. Morgan Kaufmann, San Mateo Google Scholar
- 5.Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu WW (2008) Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proc. PPoPP’08, Salt Lake City, UT, USA, pp 73–82 Google Scholar
- 6.Xiang Cui CZ, Chen Y, Mei H (2010) Auto-tuning dense matrix multiplication for GPGPU with cache. In: Proc. ICPADS’2010, Shanghai, China, December 2010, pp 237–242 Google Scholar
- 8.Wong H, Papadopoulou M-M, Sadooghi-Alvandi M, Moshovos A (2010) Demystifying GPU microarchitecture through microbenchmarking. In: Proc. ISPASS’2010, March 2010, pp 235–246 Google Scholar
- 9.Zhang Y, Owens J (2011) A quantitative performance analysis model for gpu architectures. In: Proc. HPCA’2011, February 2011, pp 382–393 Google Scholar
- 10.NVIDIA (2012) NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. Last visit: June 2012. http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
- 11.Greg Ruetsch PM (2010) NVIDIA optimizing matrix transpose in CUDA, June 2010. Last visit: December 2, 2010. http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/CUDA/website/C/src/transposeNew/doc/MatrixTranspose.pdf
- 13.Cannon LE (1969) A cellular computer to implement the Kalman filter algorithm. Ph.D. dissertation, Montana State University, 1969 (online). Available: http://portal.acm.org/citation.cfm?coll=GUIDE/&dl=GUIDE/&id=905686
- 14.Torres Y, Gonzalez-Escribano A, Llanos DR (2012) uBench: performance impact of CUDA block geometry. Dept. Informatica, Universidad de Valladolid, Tech. Rep. IT-DI-2012-0001, December 2012. http://www.infor.uva.es/investigacion/publicaciones.html