On the GPU Performance of 3D Stencil Computations Implemented in OpenCL

Su, Huayou; Wu, Nan; Wen, Mei; Zhang, Chunyuan; Cai, Xing

doi:10.1007/978-3-642-38750-0_10

Huayou Su^19,20,21,
Nan Wu^19,20,
Mei Wen¹⁹,
Chunyuan Zhang¹⁹ &
…
Xing Cai^20,21

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7905))

Included in the following conference series:

International Supercomputing Conference

2523 Accesses
7 Citations

Abstract

Aiming at a close examination of the OpenCL performance myth, we study in this paper OpenCL implementations of several representative 3D stencil computations. It is found that typical optimization techniques such as array padding, plane sweeping and chunking give similar performance boosts to the OpenCL implementations, as those obtained in corresponding CUDA programs. The key to good performance lies in maximizing the use of on-chip resources of a GPU, same for both OpenCL and CUDA programming. In most cases, the achieved FLOPS rates on NVIDIA’s Fermi and Kepler GPUs are fully comparable between the two programming alternatives. For four typical 3D stencil computations, the performance of the OpenCL implementations is on average 9% and 2% faster than that of the CUDA counterparts on GTX590 and Tesla K20, respectively. At the moment, the only clear advantage of CUDA programming for stencil computations arises from CUDA’s ability of using the read-only data cache on NVIDIA’s Kepler GPUs. The skepticism about OpenCL’s GPU performance thus seems unjustified for 3D stencil computations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Khronos OpenCL Working Group: The OpenCL Specification (2011), http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
Fang, J., Varbanescu, A., Sips, H.: A comprehensive performance comparison of CUDA and OpenCL. In: Proceedings of the 2011 International Conference on Parallel Processing, pp. 216–225. IEEE Computer Society Press (2011)
Google Scholar
Karimi, K., Dickson, N., Hamze, F.: A performance comparison of CUDA and OpenCL (2010), http://arxiv.org/ftp/arxiv/papers/1005/1005.2581.pdf
Komatsu, K., Sato, K., Arai, Y., Koyama, K., Takizawa, H., Kobayashi, H.: Evaluating performance and portability of OpenCL programs. In: Proceedings of the Fifth International Workshop on Automatic Performance Tuning (iWAPT 2010). IEEE Computer Society Press (2010)
Google Scholar
Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.: From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming. Parallel Computing 38(8), 391–407 (2012)
Article Google Scholar
Unat, D., Cai, X., Baden, S.: Mint: realizing CUDA performance in 3D stencil methods with annotated C. In: Proceedings of the 25th ACM International Conference on Supercomputing, pp. 214–224. ACM (2011)
Google Scholar
Schäfer, A., Fey, D.: High performance stencil code algorithms for GPGPUs. In: Proceedings of the International Conference on Computational Science. Procedia Computer Science, vol. 4, pp. 2027–2036. Elsevier (2011)
Google Scholar
NVIDIA: NVIDIA OpenCL Best Practices Guide (2009), http://developer.download.nvidia.com/compute/cuda/2_3/opencl/docs/NVIDIA_OpenCL_BestPracticesGuide.pdf
NVIDIA: NVIDIA OpenCL SDK code sample of 3D FDTD, http://developer.download.nvidia.com/compute/DevZone/OpenCL/Projects/oclFDTD3d.zip
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE Computer Society Press (2008)
Google Scholar
Holewinski, J., Pouchet, L.N., Sadayappan, P.: High-performance code generation for stencil computations on GPU architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 311–320. ACM (2012)
Google Scholar
Zhang, Y., Mueller, F.: Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, pp. 155–164. ACM (2012)
Google Scholar
Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press (2010)
Google Scholar
Zumbusch, G.: Tuning a finite difference computation for parallel vector processors. In: Proceedings of the 2012 11th International Symposium on Parallel and Distributed Computing, pp. 63–70. IEEE Computer Society Press (2012)
Google Scholar
Yang, Y., Cui, H., Feng, X., Xue, J.: A hybrid circular queue method for iterative stencil computations on GPUs. Journal of Computer Science and Technology 27(1), 57–74 (2012)
Article Google Scholar
Rul, S., Vandierendonck, H., D’Haene, J., De Bosschere, K.: An experimental study on performance portability of OpenCL kernels. In: Symposium on Application Accelerators in High Performance Computing, SAAHPC 2010 (2010)
Google Scholar
Demidov, D.: VexCL: Vector expression template library for OpenCL (2013), http://www.codeproject.com/Articles/415058/VexCL-Vector-expression-template-library-for-OpenC

Download references

Author information

Authors and Affiliations

School of Computer Science, National University of Defense Technology, Changsha, Hunan, 410073, China
Huayou Su, Nan Wu, Mei Wen & Chunyuan Zhang
Simula Research Laboratory, P.O. Box 134, 1325, Lysaker, Norway
Huayou Su, Nan Wu & Xing Cai
Department of Informatics, University of Oslo, P.O. Box 1080, Blindern, 0316, Oslo, Norway
Huayou Su & Xing Cai

Authors

Huayou Su
View author publications
You can also search for this author in PubMed Google Scholar
Nan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Mei Wen
View author publications
You can also search for this author in PubMed Google Scholar
Chunyuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xing Cai
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Hamburg, Department of Informatics, Bundestraße 45a, 20146, Hamburg, Germany
Julian Martin Kunkel
Deutsches Klimarechenzentrum, Bundestraße 45a, 20146, Hamburg, Germany
Thomas Ludwig
Germany and Prometeus GmbH, University of Mannheim, Fliederstraße 2, 74915, Waibstadt, Germany
Hans Werner Meuer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Su, H., Wu, N., Wen, M., Zhang, C., Cai, X. (2013). On the GPU Performance of 3D Stencil Computations Implemented in OpenCL. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds) Supercomputing. ISC 2013. Lecture Notes in Computer Science, vol 7905. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38750-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-38750-0_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38749-4
Online ISBN: 978-3-642-38750-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics