Skip to main content


Log in

pocl: A Performance-Portable OpenCL Implementation

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript


OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. We test the two aspects to portability by utilizing the kernel compiler and the OpenCL implementation to run OpenCL applications in various platforms with different style of parallel resources. The results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others


  1. At the time of this writing, pocl does not yet support popular commercial GPU targets. However, the SPMD/GPU path of the kernel compiler has been tested by using research targets to ensure GPU-like devices can be supported using pocl.


  1. Clang: A C language frontend for LLVM. Online; Accessed 5 Feb 2014

  2. Clover Git: Implementing barriers. Online; Accessed 18 May 2013

  3. Clover Git: OpenCL 1.1 software implementation. Online; Accessed 18 May 2013

  4. freeocl: Multi-platform implementation of OpenCL 1.2 targeting CPUs. Online; Accessed 18 May 2013

  5. LLVM compiler infrastructure. Online; Accessed 5 Feb 2014

  6. TTA-based codesign environment (TCE). Online; Accessed 18 May 2013

  7. Advanced Micro Devices Inc: Accelerated parallel processing (APP) software development kit (SDK) v2.8 (2012)

  8. Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co. Inc., Reading (1986)

    Google Scholar 

  9. Allen, F.E.: Control flow analysis. ACM SIGPLAN Not. 5(7), 1–19 (1970)

    Article  Google Scholar 

  10. Allen, J., Kennedy, K., Porterfield, C., Warren, J.: Conversion of control dependence to data dependence. In: Proceedings of ACM Symposium Principles of Programming Languages, Austin, TX, pp. 177–189 (1983)

  11. ARM Ltd.: The ARM NEON™ general-purpose SIMD engine (2012).

  12. ARM Ltd.: The ARMCortex™ A9 processor (2013).

  13. Cammarota, R., Nicolau, A., Veidenbaum, A.V., Kejariwal, A., Donato, D., Madhugiri, M.: On the determination of inlining vectors for program optimization. In: Proceedings of 22nd International Conference on Compiler Construction, CC’13, pp. 164–183. Springer, Berlin (2013). doi:10.1007/978-3-642-37051-9_9

  14. Cocke, J.: Global common subexpression elimination. In: Proceedings of Symposium Compiler Optimization, pp. 20–24. Urbana-Champaign, IL (1970)

  15. Corporaal, H.: Microprocessor Architectures: From VLIW to TTA. Wiley, Chichester (1997)

    Google Scholar 

  16. Cytron, R., Ferrante, J., Rosen, B.K., Wegman, M.N., Zadeck, F.K.: Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13(4), 451–490 (1991)

    Article  Google Scholar 

  17. Esko, O., Jääskeläinen, P., Huerta, P., de La Lama, C.S., Takala, J., Martinez, J.I.: Customized exposed datapath soft-core design flow with compiler support. In: International Conference on Field Programmable Logic and Applications, pp. 217–222. Milan, Italy (2010)

  18. Fisher, J.: Trace scheduling: a technique for global microcode compaction. IEEE Trans. Comput. C–30(7), 478–490 (1981)

    Article  Google Scholar 

  19. Goldberg, D.: What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 5–48 (1991)

    Article  Google Scholar 

  20. Gschwind, M., Hofstee, H.P., Flachs, B., Hopkins, M., Watanabe, Y., Yamazaki, T.: Synergistic processing in Cell’s multicore architecture. IEEE Micro 26, 10–24 (2006)

    Article  Google Scholar 

  21. Gummaraju, J., Morichetti, L., Houston, M., Sander, B., Gaster, B.R., Zheng, B.: Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques, pp. 205–216. Vienna, Austria (2010)

  22. Gummaraju, J., Sander, B., Morichetti, L., Gaster, B., Howes, L.: Efficient implementation of GPGPU synchronization primitives on CPUs. In: Proceedings of ACM International Conference on Computing Frontiers, pp. 85–86. Bertinoro, Italy (2010)

  23. Hecht, M.S., Ullman, J.D.: Flow graph reducibility. In: Proceedings of Annual ACM Symposium on Theory of Computing, pp. 238–250. Denver, CO (1972)

  24. IBM: OpenCL(TM) development kit for Linux on Power, v0.3 (2011)

  25. IEEE, Piscataway, NJ: IEEE standard for information technology—portable operation system interface (POSIX). Shell and utilities., 2004 edn. (2004). Std 1003.1

  26. IEEE, Piscataway, NJ: Standard for floating-point arithmetic (2008). Std 754-2008

  27. Intel Corp.: Desktop 4th Gen IntelCore™ Processor Family: Datasheet, Vol. 1 (2013). Doc. No. 328897-004

  28. Jääskeläinen, P., Sánchez de La Lama, C., Huerta, P., Takala, J.: OpenCL-based design methodology for application-specific processors. Trans. HiPEAC 5 (2011).

  29. Janssen, J., Corporaal, H.: Making graphs reducible with controlled node splitting. ACM Trans. Program. Lang. Syst. 19(6), 1031–1052 (1997)

    Article  Google Scholar 

  30. Karrenberg, R., Hack, S.: Whole-function vectorization. In: Proceedings of Annual IEEE/ACM International Symposium Code Generation and Optimization, pp. 141–150. Chamonix, France (2011)

  31. Karrenberg, R., Hack, S.: Improving performance of OpenCL on CPUs. In: Proceedings of International Conference on Compiler Construction, pp. 1–20. Tallinn, Estonia (2012)

  32. Kejariwal, A., Nicolau, A., Saito, H., Tian, X., Girkar, M., Banerjee, U., Polychronopoulos, C.D.: A general approach for partitioning N-dimensional parallel nested loops with conditionals. In: Proceedings of 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’06, pp. 49–58. ACM, New York, NY, USA (2006). doi:10.1145/1148109.1148117

  33. Khronos Group, Beaverton, OR: OpenCL Specification, v1.2r19 edn. (2012)

  34. Khronos Group: SPIR 1.2 Specification for OpenCL (2014)

  35. Lattner, C., Adve, V.: LLVM: A compilation framework for lifelong program analysis and transformation. In: Proceedings of International Symposium on Code Generation Optimization, p. 75 (2004)

  36. Lee, J., Kim, J., Seo, S., Kim, S., Park, J., Kim, H., Dao, T.T., Cho, Y., Seo, S.J., Lee, S.H., Cho, S.M., Song, H.J., Suh, S.B., Choi, J.D.: An OpenCL framework for heterogeneous multicores with local memory. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pp. 193–204. ACM, New York, NY, USA (2010). doi:10.1145/1854273.1854301

  37. Maher, B.A., Smith, A., Burger, D., McKinley, K.S.: Merging head and tail duplication for convergent hyperblock formation. In: Proceedings of Annual IEEE/ACM International Symposium on Microarchitecture, pp. 65–76. Orlando, FL (2006)

  38. Muller, J.M.: Elementary Functions: Algorithms and Implementation. Birkhäuser, London (2006)

    Google Scholar 

  39. Nicolau, A., Li, G., Kejariwal, A.: Techniques for efficient placement of synchronization primitives. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’09, pp. 199–208. ACM, New York, NY, USA (2009). doi:10.1145/1504176.1504207

  40. Nicolau, A., Li, G., Veidenbaum, A.V., Kejariwal, A.: Synchronization optimizations for efficient execution on multi-cores. In: Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, pp. 169–180. ACM, New York, NY, USA (2009). doi:10.1145/1542275.1542303

  41. Nvidia Corp., Santa Clara, CA: NVIDIA CUDA Compute Unified Device Architecture: Programming Guide, v2.0 edn. (2008)

  42. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, Cambridge (2007)

    Google Scholar 

  43. Rotem, N.: Intel OpenCL SDK vectorizer. LLVM Developer’s Meeting (2011)

  44. Schnetter, E.: Vecmathlib. Online; Accessed 5 Feb 2014

  45. Shibata, N.: Efficient evaluation methods of elementary functions suitable for SIMD computation. In: Journal of Computer Science on Research and Development, Proceedings of the International Supercomputing Conference ISC10, vol. 25, pp. 25–32 (2010). doi:10.1007/s00450-010-0108-2

  46. Shibata, N.: SLEEF (SIMD library for evaluating elementary functions). Web Site (2013).

  47. Stratton, J.A., Stone, S.S., Hwu, W.M.W.: MCUDA: an efficient implementation of CUDA kernels for multi-core CPUs. In: J.N. Amaral (ed.) Languages and Compilers for Parallel Computing, LNCS, vol. 5335, pp. 16–30. Springer, Berlin (2008). doi:10.1007/978-3-540-89740-8_2

Download references


The work has been financially supported by the Academy of Finland (funding decision 253087), Finnish Funding Agency for Technology and Innovation (Project “Parallel Acceleration”, funding decision 40115/13), ARTEMIS joint undertaking under Grant Agreement No. 641439 (ALMARVI), by NSF awards 0905046, 0941653, and 1212401, as well as NSERC grant 2012-RGPIN-1505. In addition to the financial supporters, the authors would also like to thank the constructive comments and references pointed out by the reviewers.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Pekka Jääskeläinen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jääskeläinen, P., de La Lama, C.S., Schnetter, E. et al. pocl: A Performance-Portable OpenCL Implementation. Int J Parallel Prog 43, 752–785 (2015).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: