The Journal of Supercomputing

, Volume 65, Issue 3, pp 1150–1163 | Cite as

uBench: exposing the impact of CUDA block geometry in terms of performance

  • Yuri Torres
  • Arturo Gonzalez-Escribano
  • Diego R. Llanos


The choice of thread-block size and shape is one of the most important user decisions when a parallel problem is written for any CUDA architecture. The reason is that thread-block geometry has a significant impact on the global performance of the program. Unfortunately, the programmer has not enough information about the subtle interactions between this choice of parameters and the underlying hardware.

This paper presents uBench, a complete suite of micro-benchmarks, in order to explore the impact on performance of (1) the thread-block geometry choice criteria, and (2) the GPU hardware resources and configurations. Each micro-benchmark has been designed to be as simple as possible to focus on a single effect derived from the hardware and thread-block parameter choice.

As an example of the capabilities of this benchmark suite, this paper shows an experimental evaluation and comparison of Fermi and Kepler architectures. Our study reveals that, in spite of the new hardware details introduced by Kepler, the principles underlying the block geometry selection criteria are similar for both architectures.


GPU Benchmarking CUDA Fermi Kepler Performance measurement 



This research is partly supported by the Ministerio de Industria, Spain (CENIT OCEANLIDER), MINECO (Spain) and the European Union FEDER (MOGECOPP project TIN2011-25639, CAPAP-H network TIN2010-12011-E and TIN2011-15734-E), Junta de Castilla y León (VA172A12-2), and the HPC-EUROPA2 project (project number: 228398) with the support of the European Commission—Capacities Area—Research Infrastructures Initiative.


  1. 1.
    Torres Y, Gonzalez-Escribano A, Llanos DR (2012) Using Fermi architecture knowledge to speed up CUDA and OpenCL programs. In: Proc. ISPA’2012, Leganes, Madrid, Spain, 2012 Google Scholar
  2. 2.
    NVIDIA (2010) NVIDIA CUDA programming guide 3.0 Fermi Google Scholar
  3. 3.
    NVIDIA (2012) NVIDIA CUDA programming guide 4.2: Kepler Google Scholar
  4. 4.
    Kirk DB, Hwu WW (2010) Programming massively parallel processors: a hands-on approach, February 2010. Morgan Kaufmann, San Mateo Google Scholar
  5. 5.
    Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu WW (2008) Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proc. PPoPP’08, Salt Lake City, UT, USA, pp 73–82 Google Scholar
  6. 6.
    Xiang Cui CZ, Chen Y, Mei H (2010) Auto-tuning dense matrix multiplication for GPGPU with cache. In: Proc. ICPADS’2010, Shanghai, China, December 2010, pp 237–242 Google Scholar
  7. 7.
    Torres Y, Gonzalez-Escribano A, Llanos DR (2011) Understanding the impact of CUDA tuning techniques for Fermi. In: Intl. conf. on high performance computing and simulation, HPCS 2011, pp 631–639 CrossRefGoogle Scholar
  8. 8.
    Wong H, Papadopoulou M-M, Sadooghi-Alvandi M, Moshovos A (2010) Demystifying GPU microarchitecture through microbenchmarking. In: Proc. ISPASS’2010, March 2010, pp 235–246 Google Scholar
  9. 9.
    Zhang Y, Owens J (2011) A quantitative performance analysis model for gpu architectures. In: Proc. HPCA’2011, February 2011, pp 382–393 Google Scholar
  10. 10.
    NVIDIA (2012) NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. Last visit: June 2012.
  11. 11.
    Greg Ruetsch PM (2010) NVIDIA optimizing matrix transpose in CUDA, June 2010. Last visit: December 2, 2010.
  12. 12.
    Aji AM, Daga M, Feng W-c (2011) Bounding the effect of partition camping in GPU kernels. In: Proc. 8th ACM int. conf. on computing frontiers, ser. CF’11. ACM, New York, pp 27:1–27:10 (online). Available: Google Scholar
  13. 13.
    Cannon LE (1969) A cellular computer to implement the Kalman filter algorithm. Ph.D. dissertation, Montana State University, 1969 (online). Available:
  14. 14.
    Torres Y, Gonzalez-Escribano A, Llanos DR (2012) uBench: performance impact of CUDA block geometry. Dept. Informatica, Universidad de Valladolid, Tech. Rep. IT-DI-2012-0001, December 2012.

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Yuri Torres
    • 1
  • Arturo Gonzalez-Escribano
    • 1
  • Diego R. Llanos
    • 1
  1. 1.Universidad de ValladolidValladolidSpain

Personalised recommendations