A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

Yang, Yang; Cui, Hui-Min; Feng, Xiao-Bing; Xue, Jing-Ling

doi:10.1007/s11390-012-1206-3

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

Regular Paper
Published: 09 January 2012

Volume 27, pages 57–74, (2012)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Yang Yang^1,2,
Hui-Min Cui^1,2,
Xiao-Bing Feng¹ &
…
Jing-Ling Xue³

123 Accesses
7 Citations
Explore all metrics

Abstract

In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Memory Access Optimization of High-Order CFD Stencil Computations on GPU

EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUs

Article Open access 14 January 2023

Scalable Parallelization of Stencils Using MODA

References

Wonnacott D. Achieving scalable locality with time skewing. Int. J. Parallel Program, 2002, 30(3): 181-221.
Article MATH Google Scholar
Mccalpin J, Wonnacott D. Time skewing: A value-based approach to optimizing for memory locality. Technical Report DCS-TR-379, Department of Computer Science, Rugers University. 1999.
Strzodka R, Shaheen M, Pajak D et al. Cache oblivious parallelograms in iterative stencil computations. In Proc. the 24th ACM Int. Conf. Supercomputing, Tsukuba, Japan, Jun. 1-4, 2010, pp.49-59.
Song Y, Li Z. New tiling techniques to improve cache temporal locality. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, Atlanta, USA, May 1-4, 1999, pp.215-228.
Jin G, Mellor-Crummey J, Fowler R. Increasing temporal locality with skewing and recursive blocking. In Proc. ACM/IEEE Conference on Supercomputing, Denver, USA, Nov. 10-16, 2001, pp.43-43.
Datta K, Murphy M, Volkov V et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proc. ACM/IEEE Conference on Supercomputing, Austin, USA, Nov.15-21, 2008, Article 4.
Williams S, Shalf J, Oliker L et al. Scientific computing Kernels on the cell processor. Int. J. Parallel Program, 2007, 35(3): 263-298.
Article Google Scholar
Meng J, Skadron K. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proc. the 23 rd International Conference on Supercomputing, Yorktown Heights, USA, Jun. 8-12, 2009, pp.256-265.
NVIDIA. NVIDIA CUDA programming guide 3.0, http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_CUDA_ProgrammingGuide-pdf, 2010.
NVIDIA Corp. CUDA Occupancy Calculator, 2010.
Hong S, Kim H. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proc. the 36th Annual Int. Symp. Computer Architecture, Austin, USA, Jun. 20-24, 2009, pp.152-163.
Baghsorkhi S S, Delahaye M, Patel S J, Gropp W D, Hwu W W. An adaptive performance modeling tool for GPU architectures. In Proc. the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Bangalore, India, Jan. 9-14, 2010, pp.105-114.
van der Laan W J. Decuda. http://wiki.github.com/laanwj/decuda/, Sept., 2010.
Moazeni M, Bui A, Sarrafzadeh M. A memory optimization technique for software-managed scratchpad memory in GPUs. In Proc. the 7th IEEE Symposium on Application Specific Processors, San Francisco, USA, Jul. 27-28, 2009, pp.43-49.
Carrillo S, Siegel J, Li X. A control-structure splitting optimization for GPGPU. In Proc. the 6th ACM Conf. Computing Frontiers, Ischia, Italy, May 18-20, 2009, pp.147-150.
Park S J, Ross J, Shires D, Richie D, Henz B, Nguyen L. Hybrid core acceleration of UWB SIRE radar signal processing. IEEE Trans. Parallel Distrib. Syst, 2011, 22(1): 46-57.
Article Google Scholar
Gu L, Li X, Siegel J. An empirically tuned 2D and 3D FFT library on CUDA GPU. In Proc. the 24th ACM Int. Conf. Supercomputing, Tsukuba, Japan, Jun. 1-4, 2010, pp.305-314.
Goorts P, Rogmans S, Bekaert P. Optimal data distribution for versatile finite impulse response filtering on next-generation graphics hardware using CUDA. In Proc. the 15th International Conference on Parallel and Distributed Systems, Shenzhen, China, Dec. 9-11, 2009, pp.300-307.
Bailey P, Myre J, Walsh S D C, Lilja D J, Saar M O. Accelerating lattice Boltzmann fluid flow simulations using graphics processors. In Proc. International Conference on Parallel Processing, Vienna, Austria, Sep. 22-25, 2009, pp.550-557.
Venkatasubramanian S, Vuduc R W. Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In Proc. the 23rd International Conference on Supercomputing, Yorktown Heights, USA, Jun. 8-12, 2009, pp.244-255.
Christen M, Schenk O, Neufeld E et al. Parallel data-locality aware stencil computations on modern micro-architectures. In Proc. IEEE Int. Symp. Parallel & Distributed Processing, Rome, Italy, May 23-29, 2009, pp.1-10.
Micikevicius P. 3D finite difference computation on GPUs using CUDA. In Proc. the 2nd Workshop on General Purpose Processing on Graphics Processing Units, Washington, USA, Mar. 8, 2009, pp.79-84.
Di P, Wan Q, Zhang X et al. Toward harnessing DOACROSS parallelism for multi-GPGPUs. In Proc. the 39th Int. Conf. Parallel Processing, San Diego, USA, Sep. 13-16, 2010, pp.40-50.
Christen M, Schenk O, Burkhart H. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proc. IEEE International Parallel & Distributed Processing Symposium, Anchorage, USA, May 16-20, 2011, pp.676-687.
Nguyen A, Satish N, Chhugani J, Kim C, Dubey P. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proc. ACM/IEEE Int. Conf. for High Performance Computing, Networking, Storage and Analysis, New Orleans, USA, Nov. 13-19, 2010, pp.1-13.
Qasem A, Kennedy K. Profitable loop fusion and tiling using model-driven empirical search. In Proc. the 20th Annual International Conference on Supercomputing, Cairns, Australia, Jun. 28-Jul. 1, 2006, pp.249-258.
Knijnenburg P M W, Kisuki T, Gallivan K et al. The effect of cache models on iterative compilation for combined tiling and unrolling: Research articles. Concurrency and Computation: Practice & Experience, 2004, 16(2-3): 247-270.
Article Google Scholar
Yotov K, Li X, Ren G et al. A comparison of empirical and model-driven optimization. In Proc. the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, San Diego, USA, Jun. 8-11, 2003, pp.63-76.
Chen C, Chame J, Hall M. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In Proc. Int. Symp. Code Generation and Optimization, San Jose, USA, Mar. 20-23, 2005, pp.111-122.
Epshteyn A, Garzaran M, DeJong G et al. Analytical models and empirical search: A hybrid approach to code optimization. In Proc. the 18th International Workshop on Languages and Compilers for Parallel Computing, Hawthorne, USA, Oct. 20-22, 2005, pp.259-273.
Agakov F, Bonilla E, Cavazos J et al. Using machine learning to focus iterative optimization. In Proc. Int. Symp. Code Generation and Optimization, New York, USA, Mar. 26-29, 2006, pp.295-305.
Almagor L, Cooper K D, Grosul A, Harvey T J, Reeves S W, Subramanian D, Torczon L, Waterman T. Finding effective compilation sequences. In Proc. ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, Washington, USA, Jun. 11-13, 2004, pp.231-239.
Vaswani K, Thazhuthaveetil M J, Srikant Y N et al. Microarchitecture sensitive empirical models for compiler optimizations. In Proc. Int, Symp, Code Generation and Optimization, San Jose, USA, Mar. 11-14, 2007, pp.131-143.
Ryoo S, Rodrigues C I, Stone S S et al. Program optimization space pruning for a multithreaded gpu. In Proc. the 6th Annual IEEE/ACM Int. Symp. Code Generation and Optimization, Boston, USA, Apr. 6-9, 2008, pp.195-204.
Choi J W, Singh A, Vuduc R W. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proc. the 15th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, Bangalore, India, Jan. 9-14, 2010, pp.115-126.
Lam M. Software pipelining: An effective scheduling technique for VLIW machines. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, Atlanta, USA, Jun. 20-24, 1988, pp.318-328.
Callahan D, Carr S, Kennedy K. Improving register allocation for subscripted variables. In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, White Plains, USA, Jun. 20-22, 1990, pp.53-65.

Download references

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Yang Yang, Hui-Min Cui & Xiao-Bing Feng (Member, CCF)
Graduate University of Chinese Academy of Sciences, Beijing, 100190, China
Yang Yang & Hui-Min Cui
Programming Languages and Compilers Group, School of Computer Science and Engineering, University of New South Wales, Sydney, NSW, 2052, Australia
Jing-Ling Xue (Senior Member, IEEE)

Authors

Yang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hui-Min Cui
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Bing Feng
View author publications
You can also search for this author in PubMed Google Scholar
Jing-Ling Xue
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Yang.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(PDF 102 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, Y., Cui, HM., Feng, XB. et al. A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs. J. Comput. Sci. Technol. 27, 57–74 (2012). https://doi.org/10.1007/s11390-012-1206-3

Download citation

Received: 09 May 2011
Revised: 08 October 2011
Published: 09 January 2012
Issue Date: January 2012
DOI: https://doi.org/10.1007/s11390-012-1206-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

Abstract

Access this article

Similar content being viewed by others

Memory Access Optimization of High-Order CFD Stencil Computations on GPU

EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUs

Scalable Parallelization of Stencils Using MODA

References

Author information

Authors and Affiliations

Corresponding author

Electronic Supplementary Material

(PDF 102 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

Abstract

Access this article

Similar content being viewed by others

Memory Access Optimization of High-Order CFD Stencil Computations on GPU

EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUs

Scalable Parallelization of Stencils Using MODA

References

Author information

Authors and Affiliations

Corresponding author

Electronic Supplementary Material

(PDF 102 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation