Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

Abstract

In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.

This is a preview of subscription content, log in to check access.

References

  1. [1]

    Wonnacott D. Achieving scalable locality with time skewing. Int. J. Parallel Program, 2002, 30(3): 181-221.

  2. [2]

    Mccalpin J, Wonnacott D. Time skewing: A value-based approach to optimizing for memory locality. Technical Report DCS-TR-379, Department of Computer Science, Rugers University. 1999.

  3. [3]

    Strzodka R, Shaheen M, Pajak D et al. Cache oblivious parallelograms in iterative stencil computations. In Proc. the 24th ACM Int. Conf. Supercomputing, Tsukuba, Japan, Jun. 1-4, 2010, pp.49-59.

  4. [4]

    Song Y, Li Z. New tiling techniques to improve cache temporal locality. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, Atlanta, USA, May 1-4, 1999, pp.215-228.

  5. [5]

    Jin G, Mellor-Crummey J, Fowler R. Increasing temporal locality with skewing and recursive blocking. In Proc. ACM/IEEE Conference on Supercomputing, Denver, USA, Nov. 10-16, 2001, pp.43-43.

  6. [6]

    Datta K, Murphy M, Volkov V et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proc. ACM/IEEE Conference on Supercomputing, Austin, USA, Nov.15-21, 2008, Article 4.

  7. [7]

    Williams S, Shalf J, Oliker L et al. Scientific computing Kernels on the cell processor. Int. J. Parallel Program, 2007, 35(3): 263-298.

  8. [8]

    Meng J, Skadron K. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proc. the 23 rd International Conference on Supercomputing, Yorktown Heights, USA, Jun. 8-12, 2009, pp.256-265.

  9. [9]

    NVIDIA. NVIDIA CUDA programming guide 3.0, http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_CUDA_ProgrammingGuide-pdf, 2010.

  10. [10]

    NVIDIA Corp. CUDA Occupancy Calculator, 2010.

  11. [11]

    Hong S, Kim H. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proc. the 36th Annual Int. Symp. Computer Architecture, Austin, USA, Jun. 20-24, 2009, pp.152-163.

  12. [12]

    Baghsorkhi S S, Delahaye M, Patel S J, Gropp W D, Hwu W W. An adaptive performance modeling tool for GPU architectures. In Proc. the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Bangalore, India, Jan. 9-14, 2010, pp.105-114.

  13. [13]

    van der Laan W J. Decuda. http://wiki.github.com/laanwj/decuda/, Sept., 2010.

  14. [14]

    Moazeni M, Bui A, Sarrafzadeh M. A memory optimization technique for software-managed scratchpad memory in GPUs. In Proc. the 7th IEEE Symposium on Application Specific Processors, San Francisco, USA, Jul. 27-28, 2009, pp.43-49.

  15. [15]

    Carrillo S, Siegel J, Li X. A control-structure splitting optimization for GPGPU. In Proc. the 6th ACM Conf. Computing Frontiers, Ischia, Italy, May 18-20, 2009, pp.147-150.

  16. [16]

    Park S J, Ross J, Shires D, Richie D, Henz B, Nguyen L. Hybrid core acceleration of UWB SIRE radar signal processing. IEEE Trans. Parallel Distrib. Syst, 2011, 22(1): 46-57.

  17. [17]

    Gu L, Li X, Siegel J. An empirically tuned 2D and 3D FFT library on CUDA GPU. In Proc. the 24th ACM Int. Conf. Supercomputing, Tsukuba, Japan, Jun. 1-4, 2010, pp.305-314.

  18. [18]

    Goorts P, Rogmans S, Bekaert P. Optimal data distribution for versatile finite impulse response filtering on next-generation graphics hardware using CUDA. In Proc. the 15th International Conference on Parallel and Distributed Systems, Shenzhen, China, Dec. 9-11, 2009, pp.300-307.

  19. [19]

    Bailey P, Myre J, Walsh S D C, Lilja D J, Saar M O. Accelerating lattice Boltzmann fluid flow simulations using graphics processors. In Proc. International Conference on Parallel Processing, Vienna, Austria, Sep. 22-25, 2009, pp.550-557.

  20. [20]

    Venkatasubramanian S, Vuduc R W. Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In Proc. the 23rd International Conference on Supercomputing, Yorktown Heights, USA, Jun. 8-12, 2009, pp.244-255.

  21. [21]

    Christen M, Schenk O, Neufeld E et al. Parallel data-locality aware stencil computations on modern micro-architectures. In Proc. IEEE Int. Symp. Parallel & Distributed Processing, Rome, Italy, May 23-29, 2009, pp.1-10.

  22. [22]

    Micikevicius P. 3D finite difference computation on GPUs using CUDA. In Proc. the 2nd Workshop on General Purpose Processing on Graphics Processing Units, Washington, USA, Mar. 8, 2009, pp.79-84.

  23. [23]

    Di P, Wan Q, Zhang X et al. Toward harnessing DOACROSS parallelism for multi-GPGPUs. In Proc. the 39th Int. Conf. Parallel Processing, San Diego, USA, Sep. 13-16, 2010, pp.40-50.

  24. [24]

    Christen M, Schenk O, Burkhart H. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proc. IEEE International Parallel & Distributed Processing Symposium, Anchorage, USA, May 16-20, 2011, pp.676-687.

  25. [25]

    Nguyen A, Satish N, Chhugani J, Kim C, Dubey P. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proc. ACM/IEEE Int. Conf. for High Performance Computing, Networking, Storage and Analysis, New Orleans, USA, Nov. 13-19, 2010, pp.1-13.

  26. [26]

    Qasem A, Kennedy K. Profitable loop fusion and tiling using model-driven empirical search. In Proc. the 20th Annual International Conference on Supercomputing, Cairns, Australia, Jun. 28-Jul. 1, 2006, pp.249-258.

  27. [27]

    Knijnenburg P M W, Kisuki T, Gallivan K et al. The effect of cache models on iterative compilation for combined tiling and unrolling: Research articles. Concurrency and Computation: Practice & Experience, 2004, 16(2-3): 247-270.

  28. [28]

    Yotov K, Li X, Ren G et al. A comparison of empirical and model-driven optimization. In Proc. the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, San Diego, USA, Jun. 8-11, 2003, pp.63-76.

  29. [29]

    Chen C, Chame J, Hall M. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In Proc. Int. Symp. Code Generation and Optimization, San Jose, USA, Mar. 20-23, 2005, pp.111-122.

  30. [30]

    Epshteyn A, Garzaran M, DeJong G et al. Analytical models and empirical search: A hybrid approach to code optimization. In Proc. the 18th International Workshop on Languages and Compilers for Parallel Computing, Hawthorne, USA, Oct. 20-22, 2005, pp.259-273.

  31. [31]

    Agakov F, Bonilla E, Cavazos J et al. Using machine learning to focus iterative optimization. In Proc. Int. Symp. Code Generation and Optimization, New York, USA, Mar. 26-29, 2006, pp.295-305.

  32. [32]

    Almagor L, Cooper K D, Grosul A, Harvey T J, Reeves S W, Subramanian D, Torczon L, Waterman T. Finding effective compilation sequences. In Proc. ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, Washington, USA, Jun. 11-13, 2004, pp.231-239.

  33. [33]

    Vaswani K, Thazhuthaveetil M J, Srikant Y N et al. Microarchitecture sensitive empirical models for compiler optimizations. In Proc. Int, Symp, Code Generation and Optimization, San Jose, USA, Mar. 11-14, 2007, pp.131-143.

  34. [34]

    Ryoo S, Rodrigues C I, Stone S S et al. Program optimization space pruning for a multithreaded gpu. In Proc. the 6th Annual IEEE/ACM Int. Symp. Code Generation and Optimization, Boston, USA, Apr. 6-9, 2008, pp.195-204.

  35. [35]

    Choi J W, Singh A, Vuduc R W. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proc. the 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, Bangalore, India, Jan. 9-14, 2010, pp.115-126.

  36. [36]

    Lam M. Software pipelining: An effective scheduling technique for VLIW machines. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, Atlanta, USA, Jun. 20-24, 1988, pp.318-328.

  37. [37]

    Callahan D, Carr S, Kennedy K. Improving register allocation for subscripted variables. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, White Plains, USA, Jun. 20-22, 1990, pp.53-65.

Download references

Author information

Correspondence to Yang Yang.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(PDF 102 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Yang, Y., Cui, H., Feng, X. et al. A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs. J. Comput. Sci. Technol. 27, 57–74 (2012). https://doi.org/10.1007/s11390-012-1206-3

Download citation

Keywords

  • stencil computation
  • circular queue
  • GPU
  • occupancy
  • register