An analytical GPU performance model for 3D stencil computations from the angle of data traffic

Su, Huayou; Cai, Xing; Wen, Mei; Zhang, Chunyuan

doi:10.1007/s11227-015-1392-1

An analytical GPU performance model for 3D stencil computations from the angle of data traffic

Published: 26 February 2015

Volume 71, pages 2433–2453, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Huayou Su¹,
Xing Cai^2,3,
Mei Wen¹ &
…
Chunyuan Zhang¹

353 Accesses
7 Citations
Explore all metrics

Abstract

The achievable GPU performance of many scientific computations is not determined by a GPU’s peak floating-point rate, but rather how fast data are moved through different stages of the entire memory hierarchy. We take low-order 3D stencil computations as a representative class to study the reachable GPU performance from the angle of data traffic. Specifically, we propose a simple analytical model to estimate the execution time based on quantifying the data traffic volume at three stages: (1) between registers and on-streaming multiprocessor (SMX) storage, (2) between on-SMX storage and L2 cache, (3) between L2 cache and GPU’s device memory. Three associated granularities are used: a CUDA thread, a thread block, and a set of simultaneously active thread blocks. For four chosen 3D stencil computations, NVIDIA’s profiling tools are used to verify the accuracy of the quantified data traffic volumes, by examining a large number of executions with different problem sizes and thread block configurations. Moreover, by introducing an imbalance coefficient, together with the known realistic memory bandwidths, we can predict the execution time usage based on the quantified data traffic volumes. For the four 3D stencils, the average error of the time predictions is 6.9 % for a baseline implementation approach, whereas for a blocking implementation approach the average prediction error is 9.5 %.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance Analysis for Stencil-Based 3D MPDATA Algorithm on GPU Architecture

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

Article 20 February 2023

Modeling Stencil Computations on Modern HPC Architectures

References

Baghsorkhi SS, Delahaye M, Patel SJ, Gropp WD, Hwu WMW (2010) An adaptive performance modeling tool for GPU architectures. In: Proceedings of PPoPP’10. ACM, New York, pp 105–114. doi:10.1145/1693453.1693470
Bakhoda A, Yuan GL, Fung WW, Wong H, Aamodt TM (2009) Analyzing cuda workloads using a detailed GPU simulator. In: IEEE international symposium on performance analysis of systems and software (ISPASS’09). IEEE, pp 163–174
Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of SC’08. IEEE Press, Piscataway, pp 4:1–4:12. doi:10.1109/SC.2008.5222004
Datta K, Kamil S, Williams S, Oliker L, Shalf J, Yelick K (2009) Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev 51(1):129–159
Article MATH Google Scholar
de la Cruz R, Araya-Polo M (in press) Modeling stencil computations on modern HPC architectures
De La Cruz R, Araya-Polo M (2014) Algorithm 942: semi-stencil. ACM Trans Math Softw (TOMS) 40(3):23
Google Scholar
Holewinski J, Pouchet LN, Sadayappan P (2012) High-performance code generation for stencil computations on GPU architectures. In: Proceedings of ICS’12. ACM, New York, pp 311–320. doi:10.1145/2304576.2304619
Hong S, Kim H (2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: Proceedings of ISCA’09. ACM, New York, pp 152–163. doi:10.1145/1555754.1555775
Kamil S, Husbands P, Oliker L, Shalf J, Yelick K (2005) Impact of modern memory subsystems on cache optimizations for stencil computations. In: Proceedings of MSP’05. ACM, New York, pp 36–43. doi:10.1145/1111583.1111589
Kamil S, Datta K, Williams S, Oliker L, Shalf J, Yelick K (2006) Implicit and explicit optimizations for stencil computations. In: Proceedings of MSPC’06. ACM, New York, pp 51–60. doi:10.1145/1178597.1178605
Kamil S, Chan C, Oliker L, Shalf J, Williams S (2010) An auto-tuning framework for parallel multicore stencil computations. In: Proceedings of IPDPS’10, pp 1–12. doi:10.1109/IPDPS.2010.5470421
Meng J, Skadron K (2009) Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In: Proceedings of ICS’09. ACM, New York, pp 256–265. doi:10.1145/1542275.1542313
Micikevicius P (2009) 3D finite difference computation on GPUs using CUDA. In: GPGPU-2. ACM, New York, pp 79–84. doi:10.1145/1513895.1513905
Nickolls J, Dally W (2010) The GPU computing era. Micro IEEE 30(2):56–69. doi:10.1109/MM.2010.41
Article Google Scholar
Nugteren C, van den Braak GJ, Corporaal H, Bal H (2014) A detailed GPU cache model based on reuse distance theory. In: IEEE 20th international symposium on high performance computer architecture (HPCA). IEEE, pp 37–48
NVIDIA T (2013) K20-k20x GPU accelerators benchmarks. ApplicationPerformance Technical Brief, Nvidia. http://www.nvidia.com/docs/IO/122874/K20-and-K20X-application-performance-technical-brief.pdf
NVIDIA C (2012a) CUDA API reference manual
Profiler user’s guide.http://docs.nvidia.com/cuda/pdf/CUDA_Profiler_Users_Guide.pdf
Rahman SMF, Yi Q, Qasem A (2011) Understanding stencil code performance on multicore architectures. In: Proceedings of the 8th ACM international conference on computing frontiers. ACM, New York p 30
Schäfer A, Fey D (2011) High performance stencil code algorithms for GPGPUs. Procedia Comput Sci 4:2027–2036
Article Google Scholar
Sim J, Dasgupta A, Kim H, Vuduc R (2012) A performance analysis framework for identifying potential benefits in GPGPU applications. In: Proceedings of PPoPP’12. ACM, New York, pp 11–22. doi:10.1145/2145816.2145819
Stengel H, Treibig J, Hager G, Wellein G (2014) Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model. arXiv:1410.5010
Su H, Wu N, Wen M, Zhang C, Cai X (2013a) On the GPU–CPU performance portability of OpenCL for 3D stencil computations. In: International conference on parallel and distributed systems (ICPADS). IEEE, pp 78–85
Su H, Wu N, Wen M, Zhang C, Cai X (2013b) On the GPU performance of 3D stencil computations implemented in OpenCL. In: Supercomputing. Springer, New York, pp 125–135
Unat D, Cai X, Baden SB (2011) Mint: realizing CUDA performance in 3D stencil methods with annotated C. In: Proceedings of ICS’11. ACM, New York, pp 214–224. doi:10.1145/1995896.1995932
Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76. doi:10.1145/1498765.1498785
Zhang Y, Mueller F (2012) Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In: Proceedings of CGO’12. ACM, New York, pp 155–164. doi:10.1145/2259016.2259037

Download references

Acknowledgments

The authors gratefully acknowledge the support from the National Natural Science Foundation of China under NSFC Nos. 61033008, 61103080 and 61272145, SRFDP Nos. 20104307110002 and 20124307130004, Innovation in Graduate School of NUDT Nos. B100603, B120605, the FRINATEK program of the Research Council of Norway under No. 214113/F20.

Author information

Authors and Affiliations

School of Computer, National University of Defense Technology, Changsha, China
Huayou Su, Mei Wen & Chunyuan Zhang
Simula Research Laboratory, Oslo, Norway
Xing Cai
Department of Informatics, University of Oslo, Oslo, Norway
Xing Cai

Authors

Huayou Su
View author publications
You can also search for this author in PubMed Google Scholar
Xing Cai
View author publications
You can also search for this author in PubMed Google Scholar
Mei Wen
View author publications
You can also search for this author in PubMed Google Scholar
Chunyuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huayou Su.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Su, H., Cai, X., Wen, M. et al. An analytical GPU performance model for 3D stencil computations from the angle of data traffic. J Supercomput 71, 2433–2453 (2015). https://doi.org/10.1007/s11227-015-1392-1

Download citation

Published: 26 February 2015
Issue Date: July 2015
DOI: https://doi.org/10.1007/s11227-015-1392-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An analytical GPU performance model for 3D stencil computations from the angle of data traffic

Abstract

Access this article

Similar content being viewed by others

Performance Analysis for Stencil-Based 3D MPDATA Algorithm on GPU Architecture

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

Modeling Stencil Computations on Modern HPC Architectures

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An analytical GPU performance model for 3D stencil computations from the angle of data traffic

Abstract

Access this article

Similar content being viewed by others

Performance Analysis for Stencil-Based 3D MPDATA Algorithm on GPU Architecture

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

Modeling Stencil Computations on Modern HPC Architectures

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation