Abstract
Aiming at a close examination of the OpenCL performance myth, we study in this paper OpenCL implementations of several representative 3D stencil computations. It is found that typical optimization techniques such as array padding, plane sweeping and chunking give similar performance boosts to the OpenCL implementations, as those obtained in corresponding CUDA programs. The key to good performance lies in maximizing the use of on-chip resources of a GPU, same for both OpenCL and CUDA programming. In most cases, the achieved FLOPS rates on NVIDIA’s Fermi and Kepler GPUs are fully comparable between the two programming alternatives. For four typical 3D stencil computations, the performance of the OpenCL implementations is on average 9% and 2% faster than that of the CUDA counterparts on GTX590 and Tesla K20, respectively. At the moment, the only clear advantage of CUDA programming for stencil computations arises from CUDA’s ability of using the read-only data cache on NVIDIA’s Kepler GPUs. The skepticism about OpenCL’s GPU performance thus seems unjustified for 3D stencil computations.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Khronos OpenCL Working Group: The OpenCL Specification (2011), http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
Fang, J., Varbanescu, A., Sips, H.: A comprehensive performance comparison of CUDA and OpenCL. In: Proceedings of the 2011 International Conference on Parallel Processing, pp. 216–225. IEEE Computer Society Press (2011)
Karimi, K., Dickson, N., Hamze, F.: A performance comparison of CUDA and OpenCL (2010), http://arxiv.org/ftp/arxiv/papers/1005/1005.2581.pdf
Komatsu, K., Sato, K., Arai, Y., Koyama, K., Takizawa, H., Kobayashi, H.: Evaluating performance and portability of OpenCL programs. In: Proceedings of the Fifth International Workshop on Automatic Performance Tuning (iWAPT 2010). IEEE Computer Society Press (2010)
Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.: From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming. Parallel Computing 38(8), 391–407 (2012)
Unat, D., Cai, X., Baden, S.: Mint: realizing CUDA performance in 3D stencil methods with annotated C. In: Proceedings of the 25th ACM International Conference on Supercomputing, pp. 214–224. ACM (2011)
Schäfer, A., Fey, D.: High performance stencil code algorithms for GPGPUs. In: Proceedings of the International Conference on Computational Science. Procedia Computer Science, vol. 4, pp. 2027–2036. Elsevier (2011)
NVIDIA: NVIDIA OpenCL Best Practices Guide (2009), http://developer.download.nvidia.com/compute/cuda/2_3/opencl/docs/NVIDIA_OpenCL_BestPracticesGuide.pdf
NVIDIA: NVIDIA OpenCL SDK code sample of 3D FDTD, http://developer.download.nvidia.com/compute/DevZone/OpenCL/Projects/oclFDTD3d.zip
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE Computer Society Press (2008)
Holewinski, J., Pouchet, L.N., Sadayappan, P.: High-performance code generation for stencil computations on GPU architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 311–320. ACM (2012)
Zhang, Y., Mueller, F.: Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, pp. 155–164. ACM (2012)
Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press (2010)
Zumbusch, G.: Tuning a finite difference computation for parallel vector processors. In: Proceedings of the 2012 11th International Symposium on Parallel and Distributed Computing, pp. 63–70. IEEE Computer Society Press (2012)
Yang, Y., Cui, H., Feng, X., Xue, J.: A hybrid circular queue method for iterative stencil computations on GPUs. Journal of Computer Science and Technology 27(1), 57–74 (2012)
Rul, S., Vandierendonck, H., D’Haene, J., De Bosschere, K.: An experimental study on performance portability of OpenCL kernels. In: Symposium on Application Accelerators in High Performance Computing, SAAHPC 2010 (2010)
Demidov, D.: VexCL: Vector expression template library for OpenCL (2013), http://www.codeproject.com/Articles/415058/VexCL-Vector-expression-template-library-for-OpenC
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Su, H., Wu, N., Wen, M., Zhang, C., Cai, X. (2013). On the GPU Performance of 3D Stencil Computations Implemented in OpenCL. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds) Supercomputing. ISC 2013. Lecture Notes in Computer Science, vol 7905. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38750-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-38750-0_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38749-4
Online ISBN: 978-3-642-38750-0
eBook Packages: Computer ScienceComputer Science (R0)