Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs
When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. When executing GPU-specific kernels on CPUs, local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns by using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by removing all the unwanted local-memory arrays together with the obsolete barrier statements. Experiments show that the automated transformation can satisfactorily improve OpenCL kernel performances on Sandy Bridge CPU and Intel’s Many-Integrated-Core coprocessor.
KeywordsOpenCL Performance portability Multi-core/many-core CPU Code transformation and optimization
Unable to display preview. Download preview PDF.
- 1.FreeOCL: multi-platform implementation of OpenCL 1.2 targeting CPUs, https://code.google.com/p/freeocl/
- 2.The LLVM compiler infrastructure, http://llvm.org/
- 3.Balasundaram, V., Kennedy, K.: A technique for summarizing data access and its use in parallelism enhancing transformations. In: SIGPLAN 1989 Conference on Programming Language Design and Implementation, Portland, USA, pp. 41–53 (1989)Google Scholar
- 4.Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: A compiler framework for optimization of affine loop nests for GPGPUs. In: 22nd International Conference on Supercomputing, Island of Kos, Greece, pp. 225–234 (June 2008)Google Scholar
- 5.Bastoul, C.: Code generation in the polyhedral model is easier than you think. In: 13th International Conference on Parallel Architectures and Compilation Techniques, Antibes Juan-les-Pins, France, pp. 7–16 (September 2004)Google Scholar
- 6.Gummaraju, J., Morichetti, L., Houston, M., Sander, B., Gaster, B.R., Zheng, B.: Twin peaks: A software platform for heterogeneous computing on general-purpose and graphics processors. In: 19th International Conference on Parallel Architectures and Compilation Techniques, Vienna, Austria, pp. 205–216 (September 2010)Google Scholar
- 7.Intel Corporation: Intel SDK for OpenCL Applications XE 2013 Optimization Guide (2013)Google Scholar
- 8.Nvidia: OpenCL Best Practices Guide (February 2011)Google Scholar
- 9.Nvidia: OpenCL Programming Guide for the CUDA Architecture (February 2011)Google Scholar
- 11.Seo, S., Lee, J., Jo, G., Lee, J.: Automatic OpenCL work-group size selection for multicore CPUs. In: 22nd International Conference on Parallel Architectures and Compilation Techniques, Edinburgh, UK (September 2013)Google Scholar
- 12.Stratton, J.A., Grover, V., Marathe, J., Aarts, B., Murphy, M., Hu, Z., Hwu, W.M.W.: Efficient compilation of fine-grained SPMD threaded programs for multicore CPUs. In: 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, Toronto, Canada, pp. 111–119 (April 2010)Google Scholar
- 14.Stratton, J.A., Kim, H.S., Jablin, T.B., Hwu, W.M.W.: Performance portability in accelerated parallel kernels. Tech. Rep. IMPACT-13-01, University of Illinois at Urbana-Champaign (May 2013)Google Scholar