uBench: exposing the impact of CUDA block geometry in terms of performance

Torres, Yuri; Gonzalez-Escribano, Arturo; Llanos, Diego R.

doi:10.1007/s11227-013-0921-z

uBench: exposing the impact of CUDA block geometry in terms of performance

Published: 03 April 2013

Volume 65, pages 1150–1163, (2013)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yuri Torres¹,
Arturo Gonzalez-Escribano¹ &
Diego R. Llanos¹

532 Accesses
18 Citations
Explore all metrics

Abstract

The choice of thread-block size and shape is one of the most important user decisions when a parallel problem is written for any CUDA architecture. The reason is that thread-block geometry has a significant impact on the global performance of the program. Unfortunately, the programmer has not enough information about the subtle interactions between this choice of parameters and the underlying hardware.

This paper presents uBench, a complete suite of micro-benchmarks, in order to explore the impact on performance of (1) the thread-block geometry choice criteria, and (2) the GPU hardware resources and configurations. Each micro-benchmark has been designed to be as simple as possible to focus on a single effect derived from the hardware and thread-block parameter choice.

As an example of the capabilities of this benchmark suite, this paper shows an experimental evaluation and comparison of Fermi and Kepler architectures. Our study reveals that, in spite of the new hardware details introduced by Kepler, the principles underlying the block geometry selection criteria are similar for both architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs

Performance Analysis of GPU Programming Models Using the Roofline Scaling Trajectories

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

References

Torres Y, Gonzalez-Escribano A, Llanos DR (2012) Using Fermi architecture knowledge to speed up CUDA and OpenCL programs. In: Proc. ISPA’2012, Leganes, Madrid, Spain, 2012
Google Scholar
NVIDIA (2010) NVIDIA CUDA programming guide 3.0 Fermi
NVIDIA (2012) NVIDIA CUDA programming guide 4.2: Kepler
Kirk DB, Hwu WW (2010) Programming massively parallel processors: a hands-on approach, February 2010. Morgan Kaufmann, San Mateo
Google Scholar
Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu WW (2008) Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proc. PPoPP’08, Salt Lake City, UT, USA, pp 73–82
Google Scholar
Xiang Cui CZ, Chen Y, Mei H (2010) Auto-tuning dense matrix multiplication for GPGPU with cache. In: Proc. ICPADS’2010, Shanghai, China, December 2010, pp 237–242
Google Scholar
Torres Y, Gonzalez-Escribano A, Llanos DR (2011) Understanding the impact of CUDA tuning techniques for Fermi. In: Intl. conf. on high performance computing and simulation, HPCS 2011, pp 631–639
Chapter Google Scholar
Wong H, Papadopoulou M-M, Sadooghi-Alvandi M, Moshovos A (2010) Demystifying GPU microarchitecture through microbenchmarking. In: Proc. ISPASS’2010, March 2010, pp 235–246
Google Scholar
Zhang Y, Owens J (2011) A quantitative performance analysis model for gpu architectures. In: Proc. HPCA’2011, February 2011, pp 382–393
Google Scholar
NVIDIA (2012) NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. Last visit: June 2012. http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
Greg Ruetsch PM (2010) NVIDIA optimizing matrix transpose in CUDA, June 2010. Last visit: December 2, 2010. http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/CUDA/website/C/src/transposeNew/doc/MatrixTranspose.pdf
Aji AM, Daga M, Feng W-c (2011) Bounding the effect of partition camping in GPU kernels. In: Proc. 8th ACM int. conf. on computing frontiers, ser. CF’11. ACM, New York, pp 27:1–27:10 (online). Available: http://doi.acm.org/10.1145/2016604.2016637
Google Scholar
Cannon LE (1969) A cellular computer to implement the Kalman filter algorithm. Ph.D. dissertation, Montana State University, 1969 (online). Available: http://portal.acm.org/citation.cfm?coll=GUIDE/&dl=GUIDE/&id=905686
Torres Y, Gonzalez-Escribano A, Llanos DR (2012) uBench: performance impact of CUDA block geometry. Dept. Informatica, Universidad de Valladolid, Tech. Rep. IT-DI-2012-0001, December 2012. http://www.infor.uva.es/investigacion/publicaciones.html

Download references

Acknowledgements

This research is partly supported by the Ministerio de Industria, Spain (CENIT OCEANLIDER), MINECO (Spain) and the European Union FEDER (MOGECOPP project TIN2011-25639, CAPAP-H network TIN2010-12011-E and TIN2011-15734-E), Junta de Castilla y León (VA172A12-2), and the HPC-EUROPA2 project (project number: 228398) with the support of the European Commission—Capacities Area—Research Infrastructures Initiative.

Author information

Authors and Affiliations

Universidad de Valladolid, Valladolid, Spain
Yuri Torres, Arturo Gonzalez-Escribano & Diego R. Llanos

Authors

Yuri Torres
View author publications
You can also search for this author in PubMed Google Scholar
Arturo Gonzalez-Escribano
View author publications
You can also search for this author in PubMed Google Scholar
Diego R. Llanos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arturo Gonzalez-Escribano.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Torres, Y., Gonzalez-Escribano, A. & Llanos, D.R. uBench: exposing the impact of CUDA block geometry in terms of performance. J Supercomput 65, 1150–1163 (2013). https://doi.org/10.1007/s11227-013-0921-z

Download citation

Published: 03 April 2013
Issue Date: September 2013
DOI: https://doi.org/10.1007/s11227-013-0921-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

uBench: exposing the impact of CUDA block geometry in terms of performance

Abstract

Access this article

Similar content being viewed by others

SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs

Performance Analysis of GPU Programming Models Using the Roofline Scaling Trajectories

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

uBench: exposing the impact of CUDA block geometry in terms of performance

Abstract

Access this article

Similar content being viewed by others

SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs

Performance Analysis of GPU Programming Models Using the Roofline Scaling Trajectories

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation