Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs

  • Dafei Huang
  • Mei Wen
  • Changqing Xun
  • Dong Chen
  • Xing Cai
  • Yuran Qiao
  • Nan Wu
  • Chunyuan Zhang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8632)

Abstract

When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. When executing GPU-specific kernels on CPUs, local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns by using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by removing all the unwanted local-memory arrays together with the obsolete barrier statements. Experiments show that the automated transformation can satisfactorily improve OpenCL kernel performances on Sandy Bridge CPU and Intel’s Many-Integrated-Core coprocessor.

Keywords

OpenCL Performance portability Multi-core/many-core CPU Code transformation and optimization 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    FreeOCL: multi-platform implementation of OpenCL 1.2 targeting CPUs, https://code.google.com/p/freeocl/
  2. 2.
    The LLVM compiler infrastructure, http://llvm.org/
  3. 3.
    Balasundaram, V., Kennedy, K.: A technique for summarizing data access and its use in parallelism enhancing transformations. In: SIGPLAN 1989 Conference on Programming Language Design and Implementation, Portland, USA, pp. 41–53 (1989)Google Scholar
  4. 4.
    Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: A compiler framework for optimization of affine loop nests for GPGPUs. In: 22nd International Conference on Supercomputing, Island of Kos, Greece, pp. 225–234 (June 2008)Google Scholar
  5. 5.
    Bastoul, C.: Code generation in the polyhedral model is easier than you think. In: 13th International Conference on Parallel Architectures and Compilation Techniques, Antibes Juan-les-Pins, France, pp. 7–16 (September 2004)Google Scholar
  6. 6.
    Gummaraju, J., Morichetti, L., Houston, M., Sander, B., Gaster, B.R., Zheng, B.: Twin peaks: A software platform for heterogeneous computing on general-purpose and graphics processors. In: 19th International Conference on Parallel Architectures and Compilation Techniques, Vienna, Austria, pp. 205–216 (September 2010)Google Scholar
  7. 7.
    Intel Corporation: Intel SDK for OpenCL Applications XE 2013 Optimization Guide (2013)Google Scholar
  8. 8.
    Nvidia: OpenCL Best Practices Guide (February 2011)Google Scholar
  9. 9.
    Nvidia: OpenCL Programming Guide for the CUDA Architecture (February 2011)Google Scholar
  10. 10.
    Pennycook, S., Hammond, S., Wright, S., Herdman, J., Miller, I., Jarvis, S.A.: An investigation of the performance portability of OpenCL. Journal of Parallel and Distributed Computing 73(11), 1439–1450 (2013)CrossRefGoogle Scholar
  11. 11.
    Seo, S., Lee, J., Jo, G., Lee, J.: Automatic OpenCL work-group size selection for multicore CPUs. In: 22nd International Conference on Parallel Architectures and Compilation Techniques, Edinburgh, UK (September 2013)Google Scholar
  12. 12.
    Stratton, J.A., Grover, V., Marathe, J., Aarts, B., Murphy, M., Hu, Z., Hwu, W.M.W.: Efficient compilation of fine-grained SPMD threaded programs for multicore CPUs. In: 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, Toronto, Canada, pp. 111–119 (April 2010)Google Scholar
  13. 13.
    Stratton, J.A., Stone, S.S., Hwu, W. M.W.: MCUDA: An effective implementation of CUDA kernels for multi-core CPUs. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 16–30. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  14. 14.
    Stratton, J.A., Kim, H.S., Jablin, T.B., Hwu, W.M.W.: Performance portability in accelerated parallel kernels. Tech. Rep. IMPACT-13-01, University of Illinois at Urbana-Champaign (May 2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Dafei Huang
    • 1
    • 2
  • Mei Wen
    • 1
    • 2
  • Changqing Xun
    • 1
    • 2
  • Dong Chen
    • 1
    • 2
  • Xing Cai
    • 3
  • Yuran Qiao
    • 1
    • 2
  • Nan Wu
    • 2
    • 3
  • Chunyuan Zhang
    • 1
    • 2
  1. 1.Department of ComputerNational University of Defense TechnologyChina
  2. 2.State Key Laboratory of High Performance ComputingChangshaChina
  3. 3.Simula Research LaboratoryOsloNorway

Personalised recommendations