Skip to main content
Log in

An analytical GPU performance model for 3D stencil computations from the angle of data traffic

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The achievable GPU performance of many scientific computations is not determined by a GPU’s peak floating-point rate, but rather how fast data are moved through different stages of the entire memory hierarchy. We take low-order 3D stencil computations as a representative class to study the reachable GPU performance from the angle of data traffic. Specifically, we propose a simple analytical model to estimate the execution time based on quantifying the data traffic volume at three stages: (1) between registers and on-streaming multiprocessor (SMX) storage, (2) between on-SMX storage and L2 cache, (3) between L2 cache and GPU’s device memory. Three associated granularities are used: a CUDA thread, a thread block, and a set of simultaneously active thread blocks. For four chosen 3D stencil computations, NVIDIA’s profiling tools are used to verify the accuracy of the quantified data traffic volumes, by examining a large number of executions with different problem sizes and thread block configurations. Moreover, by introducing an imbalance coefficient, together with the known realistic memory bandwidths, we can predict the execution time usage based on the quantified data traffic volumes. For the four 3D stencils, the average error of the time predictions is 6.9 % for a baseline implementation approach, whereas for a blocking implementation approach the average prediction error is 9.5 %.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Baghsorkhi SS, Delahaye M, Patel SJ, Gropp WD, Hwu WMW (2010) An adaptive performance modeling tool for GPU architectures. In: Proceedings of PPoPP’10. ACM, New York, pp 105–114. doi:10.1145/1693453.1693470

  2. Bakhoda A, Yuan GL, Fung WW, Wong H, Aamodt TM (2009) Analyzing cuda workloads using a detailed GPU simulator. In: IEEE international symposium on performance analysis of systems and software (ISPASS’09). IEEE, pp 163–174

  3. Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of SC’08. IEEE Press, Piscataway, pp 4:1–4:12. doi:10.1109/SC.2008.5222004

  4. Datta K, Kamil S, Williams S, Oliker L, Shalf J, Yelick K (2009) Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev 51(1):129–159

    Article  MATH  Google Scholar 

  5. de la Cruz R, Araya-Polo M (in press) Modeling stencil computations on modern HPC architectures

  6. De La Cruz R, Araya-Polo M (2014) Algorithm 942: semi-stencil. ACM Trans Math Softw (TOMS) 40(3):23

    Google Scholar 

  7. Holewinski J, Pouchet LN, Sadayappan P (2012) High-performance code generation for stencil computations on GPU architectures. In: Proceedings of ICS’12. ACM, New York, pp 311–320. doi:10.1145/2304576.2304619

  8. Hong S, Kim H (2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: Proceedings of ISCA’09. ACM, New York, pp 152–163. doi:10.1145/1555754.1555775

  9. Kamil S, Husbands P, Oliker L, Shalf J, Yelick K (2005) Impact of modern memory subsystems on cache optimizations for stencil computations. In: Proceedings of MSP’05. ACM, New York, pp 36–43. doi:10.1145/1111583.1111589

  10. Kamil S, Datta K, Williams S, Oliker L, Shalf J, Yelick K (2006) Implicit and explicit optimizations for stencil computations. In: Proceedings of MSPC’06. ACM, New York, pp 51–60. doi:10.1145/1178597.1178605

  11. Kamil S, Chan C, Oliker L, Shalf J, Williams S (2010) An auto-tuning framework for parallel multicore stencil computations. In: Proceedings of IPDPS’10, pp 1–12. doi:10.1109/IPDPS.2010.5470421

  12. Meng J, Skadron K (2009) Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In: Proceedings of ICS’09. ACM, New York, pp 256–265. doi:10.1145/1542275.1542313

  13. Micikevicius P (2009) 3D finite difference computation on GPUs using CUDA. In: GPGPU-2. ACM, New York, pp 79–84. doi:10.1145/1513895.1513905

  14. Nickolls J, Dally W (2010) The GPU computing era. Micro IEEE 30(2):56–69. doi:10.1109/MM.2010.41

    Article  Google Scholar 

  15. Nugteren C, van den Braak GJ, Corporaal H, Bal H (2014) A detailed GPU cache model based on reuse distance theory. In: IEEE 20th international symposium on high performance computer architecture (HPCA). IEEE, pp 37–48

  16. NVIDIA T (2013) K20-k20x GPU accelerators benchmarks. ApplicationPerformance Technical Brief, Nvidia. http://www.nvidia.com/docs/IO/122874/K20-and-K20X-application-performance-technical-brief.pdf

  17. NVIDIA C (2012a) CUDA API reference manual

  18. Profiler user’s guide.http://docs.nvidia.com/cuda/pdf/CUDA_Profiler_Users_Guide.pdf

  19. Rahman SMF, Yi Q, Qasem A (2011) Understanding stencil code performance on multicore architectures. In: Proceedings of the 8th ACM international conference on computing frontiers. ACM, New York p 30

  20. Schäfer A, Fey D (2011) High performance stencil code algorithms for GPGPUs. Procedia Comput Sci 4:2027–2036

    Article  Google Scholar 

  21. Sim J, Dasgupta A, Kim H, Vuduc R (2012) A performance analysis framework for identifying potential benefits in GPGPU applications. In: Proceedings of PPoPP’12. ACM, New York, pp 11–22. doi:10.1145/2145816.2145819

  22. Stengel H, Treibig J, Hager G, Wellein G (2014) Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model. arXiv:1410.5010

  23. Su H, Wu N, Wen M, Zhang C, Cai X (2013a) On the GPU–CPU performance portability of OpenCL for 3D stencil computations. In: International conference on parallel and distributed systems (ICPADS). IEEE, pp 78–85

  24. Su H, Wu N, Wen M, Zhang C, Cai X (2013b) On the GPU performance of 3D stencil computations implemented in OpenCL. In: Supercomputing. Springer, New York, pp 125–135

  25. Unat D, Cai X, Baden SB (2011) Mint: realizing CUDA performance in 3D stencil methods with annotated C. In: Proceedings of ICS’11. ACM, New York, pp 214–224. doi:10.1145/1995896.1995932

  26. Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76. doi:10.1145/1498765.1498785

  27. Zhang Y, Mueller F (2012) Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In: Proceedings of CGO’12. ACM, New York, pp 155–164. doi:10.1145/2259016.2259037

Download references

Acknowledgments

The authors gratefully acknowledge the support from the National Natural Science Foundation of China under NSFC Nos. 61033008, 61103080 and 61272145, SRFDP Nos. 20104307110002 and 20124307130004, Innovation in Graduate School of NUDT Nos. B100603, B120605, the FRINATEK program of the Research Council of Norway under No. 214113/F20.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huayou Su.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Su, H., Cai, X., Wen, M. et al. An analytical GPU performance model for 3D stencil computations from the angle of data traffic. J Supercomput 71, 2433–2453 (2015). https://doi.org/10.1007/s11227-015-1392-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1392-1

Keywords

Navigation