The Journal of Supercomputing

, Volume 71, Issue 7, pp 2433–2453 | Cite as

An analytical GPU performance model for 3D stencil computations from the angle of data traffic

  • Huayou Su
  • Xing Cai
  • Mei Wen
  • Chunyuan Zhang


The achievable GPU performance of many scientific computations is not determined by a GPU’s peak floating-point rate, but rather how fast data are moved through different stages of the entire memory hierarchy. We take low-order 3D stencil computations as a representative class to study the reachable GPU performance from the angle of data traffic. Specifically, we propose a simple analytical model to estimate the execution time based on quantifying the data traffic volume at three stages: (1) between registers and on-streaming multiprocessor (SMX) storage, (2) between on-SMX storage and L2 cache, (3) between L2 cache and GPU’s device memory. Three associated granularities are used: a CUDA thread, a thread block, and a set of simultaneously active thread blocks. For four chosen 3D stencil computations, NVIDIA’s profiling tools are used to verify the accuracy of the quantified data traffic volumes, by examining a large number of executions with different problem sizes and thread block configurations. Moreover, by introducing an imbalance coefficient, together with the known realistic memory bandwidths, we can predict the execution time usage based on the quantified data traffic volumes. For the four 3D stencils, the average error of the time predictions is 6.9 % for a baseline implementation approach, whereas for a blocking implementation approach the average prediction error is 9.5 %.


Analytical performance modeling GPU Stencil computation  Data traffic 



The authors gratefully acknowledge the support from the National Natural Science Foundation of China under NSFC Nos. 61033008, 61103080 and 61272145, SRFDP Nos. 20104307110002 and 20124307130004, Innovation in Graduate School of NUDT Nos. B100603, B120605, the FRINATEK program of the Research Council of Norway under No. 214113/F20.


  1. 1.
    Baghsorkhi SS, Delahaye M, Patel SJ, Gropp WD, Hwu WMW (2010) An adaptive performance modeling tool for GPU architectures. In: Proceedings of PPoPP’10. ACM, New York, pp 105–114. doi: 10.1145/1693453.1693470
  2. 2.
    Bakhoda A, Yuan GL, Fung WW, Wong H, Aamodt TM (2009) Analyzing cuda workloads using a detailed GPU simulator. In: IEEE international symposium on performance analysis of systems and software (ISPASS’09). IEEE, pp 163–174Google Scholar
  3. 3.
    Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of SC’08. IEEE Press, Piscataway, pp 4:1–4:12. doi: 10.1109/SC.2008.5222004
  4. 4.
    Datta K, Kamil S, Williams S, Oliker L, Shalf J, Yelick K (2009) Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev 51(1):129–159zbMATHCrossRefGoogle Scholar
  5. 5.
    de la Cruz R, Araya-Polo M (in press) Modeling stencil computations on modern HPC architecturesGoogle Scholar
  6. 6.
    De La Cruz R, Araya-Polo M (2014) Algorithm 942: semi-stencil. ACM Trans Math Softw (TOMS) 40(3):23Google Scholar
  7. 7.
    Holewinski J, Pouchet LN, Sadayappan P (2012) High-performance code generation for stencil computations on GPU architectures. In: Proceedings of ICS’12. ACM, New York, pp 311–320. doi: 10.1145/2304576.2304619
  8. 8.
    Hong S, Kim H (2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: Proceedings of ISCA’09. ACM, New York, pp 152–163. doi: 10.1145/1555754.1555775
  9. 9.
    Kamil S, Husbands P, Oliker L, Shalf J, Yelick K (2005) Impact of modern memory subsystems on cache optimizations for stencil computations. In: Proceedings of MSP’05. ACM, New York, pp 36–43. doi: 10.1145/1111583.1111589
  10. 10.
    Kamil S, Datta K, Williams S, Oliker L, Shalf J, Yelick K (2006) Implicit and explicit optimizations for stencil computations. In: Proceedings of MSPC’06. ACM, New York, pp 51–60. doi: 10.1145/1178597.1178605
  11. 11.
    Kamil S, Chan C, Oliker L, Shalf J, Williams S (2010) An auto-tuning framework for parallel multicore stencil computations. In: Proceedings of IPDPS’10, pp 1–12. doi: 10.1109/IPDPS.2010.5470421
  12. 12.
    Meng J, Skadron K (2009) Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In: Proceedings of ICS’09. ACM, New York, pp 256–265. doi: 10.1145/1542275.1542313
  13. 13.
    Micikevicius P (2009) 3D finite difference computation on GPUs using CUDA. In: GPGPU-2. ACM, New York, pp 79–84. doi: 10.1145/1513895.1513905
  14. 14.
    Nickolls J, Dally W (2010) The GPU computing era. Micro IEEE 30(2):56–69. doi: 10.1109/MM.2010.41 CrossRefGoogle Scholar
  15. 15.
    Nugteren C, van den Braak GJ, Corporaal H, Bal H (2014) A detailed GPU cache model based on reuse distance theory. In: IEEE 20th international symposium on high performance computer architecture (HPCA). IEEE, pp 37–48Google Scholar
  16. 16.
    NVIDIA T (2013) K20-k20x GPU accelerators benchmarks. ApplicationPerformance Technical Brief, Nvidia.
  17. 17.
    NVIDIA C (2012a) CUDA API reference manualGoogle Scholar
  18. 18.
  19. 19.
    Rahman SMF, Yi Q, Qasem A (2011) Understanding stencil code performance on multicore architectures. In: Proceedings of the 8th ACM international conference on computing frontiers. ACM, New York p 30Google Scholar
  20. 20.
    Schäfer A, Fey D (2011) High performance stencil code algorithms for GPGPUs. Procedia Comput Sci 4:2027–2036CrossRefGoogle Scholar
  21. 21.
    Sim J, Dasgupta A, Kim H, Vuduc R (2012) A performance analysis framework for identifying potential benefits in GPGPU applications. In: Proceedings of PPoPP’12. ACM, New York, pp 11–22. doi: 10.1145/2145816.2145819
  22. 22.
    Stengel H, Treibig J, Hager G, Wellein G (2014) Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model. arXiv:1410.5010
  23. 23.
    Su H, Wu N, Wen M, Zhang C, Cai X (2013a) On the GPU–CPU performance portability of OpenCL for 3D stencil computations. In: International conference on parallel and distributed systems (ICPADS). IEEE, pp 78–85Google Scholar
  24. 24.
    Su H, Wu N, Wen M, Zhang C, Cai X (2013b) On the GPU performance of 3D stencil computations implemented in OpenCL. In: Supercomputing. Springer, New York, pp 125–135Google Scholar
  25. 25.
    Unat D, Cai X, Baden SB (2011) Mint: realizing CUDA performance in 3D stencil methods with annotated C. In: Proceedings of ICS’11. ACM, New York, pp 214–224. doi: 10.1145/1995896.1995932
  26. 26.
    Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76. doi: 10.1145/1498765.1498785
  27. 27.
    Zhang Y, Mueller F (2012) Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In: Proceedings of CGO’12. ACM, New York, pp 155–164. doi: 10.1145/2259016.2259037

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.School of ComputerNational University of Defense TechnologyChangshaChina
  2. 2.Simula Research LaboratoryOsloNorway
  3. 3.Department of InformaticsUniversity of OsloOsloNorway

Personalised recommendations