International Journal of Parallel Programming

, Volume 43, Issue 5, pp 752–785 | Cite as

pocl: A Performance-Portable OpenCL Implementation

  • Pekka Jääskeläinen
  • Carlos Sánchez de La Lama
  • Erik Schnetter
  • Kalle Raiskila
  • Jarmo Takala
  • Heikki Berg
Article

Abstract

OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. We test the two aspects to portability by utilizing the kernel compiler and the OpenCL implementation to run OpenCL applications in various platforms with different style of parallel resources. The results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.

Keywords

OpenCL LLVM GPGPU VLIW SIMD Parallel programming Heterogeneous platforms Performance portability 

References

  1. 1.
    Clang: A C language frontend for LLVM. http://clang.llvm.org/. Online; Accessed 5 Feb 2014
  2. 2.
    Clover Git: Implementing barriers. http://people.freedesktop.org/steckdenis/clover/barrier.html. Online; Accessed 18 May 2013
  3. 3.
    Clover Git: OpenCL 1.1 software implementation. http://people.freedesktop.org/steckdenis/clover/index.html. Online; Accessed 18 May 2013
  4. 4.
    freeocl: Multi-platform implementation of OpenCL 1.2 targeting CPUs. http://code.google.com/p/freeocl/. Online; Accessed 18 May 2013
  5. 5.
    LLVM compiler infrastructure. http://llvm.org/. Online; Accessed 5 Feb 2014
  6. 6.
    TTA-based codesign environment (TCE). http://tce.cs.tut.fi. Online; Accessed 18 May 2013
  7. 7.
    Advanced Micro Devices Inc: Accelerated parallel processing (APP) software development kit (SDK) v2.8 (2012)Google Scholar
  8. 8.
    Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co. Inc., Reading (1986)Google Scholar
  9. 9.
    Allen, F.E.: Control flow analysis. ACM SIGPLAN Not. 5(7), 1–19 (1970)CrossRefGoogle Scholar
  10. 10.
    Allen, J., Kennedy, K., Porterfield, C., Warren, J.: Conversion of control dependence to data dependence. In: Proceedings of ACM Symposium Principles of Programming Languages, Austin, TX, pp. 177–189 (1983)Google Scholar
  11. 11.
    ARM Ltd.: The ARM NEON™ general-purpose SIMD engine (2012). http://www.arm.com/products/processors/technologies/neon.php
  12. 12.
    ARM Ltd.: The ARMCortex™ A9 processor (2013). http://www.arm.com/products/processors/cortex-a/cortex-a9.php
  13. 13.
    Cammarota, R., Nicolau, A., Veidenbaum, A.V., Kejariwal, A., Donato, D., Madhugiri, M.: On the determination of inlining vectors for program optimization. In: Proceedings of 22nd International Conference on Compiler Construction, CC’13, pp. 164–183. Springer, Berlin (2013). doi:10.1007/978-3-642-37051-9_9
  14. 14.
    Cocke, J.: Global common subexpression elimination. In: Proceedings of Symposium Compiler Optimization, pp. 20–24. Urbana-Champaign, IL (1970)Google Scholar
  15. 15.
    Corporaal, H.: Microprocessor Architectures: From VLIW to TTA. Wiley, Chichester (1997)Google Scholar
  16. 16.
    Cytron, R., Ferrante, J., Rosen, B.K., Wegman, M.N., Zadeck, F.K.: Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13(4), 451–490 (1991)CrossRefGoogle Scholar
  17. 17.
    Esko, O., Jääskeläinen, P., Huerta, P., de La Lama, C.S., Takala, J., Martinez, J.I.: Customized exposed datapath soft-core design flow with compiler support. In: International Conference on Field Programmable Logic and Applications, pp. 217–222. Milan, Italy (2010)Google Scholar
  18. 18.
    Fisher, J.: Trace scheduling: a technique for global microcode compaction. IEEE Trans. Comput. C–30(7), 478–490 (1981)CrossRefGoogle Scholar
  19. 19.
    Goldberg, D.: What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 5–48 (1991)CrossRefGoogle Scholar
  20. 20.
    Gschwind, M., Hofstee, H.P., Flachs, B., Hopkins, M., Watanabe, Y., Yamazaki, T.: Synergistic processing in Cell’s multicore architecture. IEEE Micro 26, 10–24 (2006)CrossRefGoogle Scholar
  21. 21.
    Gummaraju, J., Morichetti, L., Houston, M., Sander, B., Gaster, B.R., Zheng, B.: Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques, pp. 205–216. Vienna, Austria (2010)Google Scholar
  22. 22.
    Gummaraju, J., Sander, B., Morichetti, L., Gaster, B., Howes, L.: Efficient implementation of GPGPU synchronization primitives on CPUs. In: Proceedings of ACM International Conference on Computing Frontiers, pp. 85–86. Bertinoro, Italy (2010)Google Scholar
  23. 23.
    Hecht, M.S., Ullman, J.D.: Flow graph reducibility. In: Proceedings of Annual ACM Symposium on Theory of Computing, pp. 238–250. Denver, CO (1972)Google Scholar
  24. 24.
    IBM: OpenCL(TM) development kit for Linux on Power, v0.3 (2011)Google Scholar
  25. 25.
    IEEE, Piscataway, NJ: IEEE standard for information technology—portable operation system interface (POSIX). Shell and utilities., 2004 edn. (2004). Std 1003.1Google Scholar
  26. 26.
    IEEE, Piscataway, NJ: Standard for floating-point arithmetic (2008). Std 754-2008Google Scholar
  27. 27.
    Intel Corp.: Desktop 4th Gen IntelCore™ Processor Family: Datasheet, Vol. 1 (2013). Doc. No. 328897-004Google Scholar
  28. 28.
    Jääskeläinen, P., Sánchez de La Lama, C., Huerta, P., Takala, J.: OpenCL-based design methodology for application-specific processors. Trans. HiPEAC 5 (2011). http://www.hipeac.net/node/4310
  29. 29.
    Janssen, J., Corporaal, H.: Making graphs reducible with controlled node splitting. ACM Trans. Program. Lang. Syst. 19(6), 1031–1052 (1997)CrossRefGoogle Scholar
  30. 30.
    Karrenberg, R., Hack, S.: Whole-function vectorization. In: Proceedings of Annual IEEE/ACM International Symposium Code Generation and Optimization, pp. 141–150. Chamonix, France (2011)Google Scholar
  31. 31.
    Karrenberg, R., Hack, S.: Improving performance of OpenCL on CPUs. In: Proceedings of International Conference on Compiler Construction, pp. 1–20. Tallinn, Estonia (2012)Google Scholar
  32. 32.
    Kejariwal, A., Nicolau, A., Saito, H., Tian, X., Girkar, M., Banerjee, U., Polychronopoulos, C.D.: A general approach for partitioning N-dimensional parallel nested loops with conditionals. In: Proceedings of 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’06, pp. 49–58. ACM, New York, NY, USA (2006). doi:10.1145/1148109.1148117
  33. 33.
    Khronos Group, Beaverton, OR: OpenCL Specification, v1.2r19 edn. (2012)Google Scholar
  34. 34.
    Khronos Group: SPIR 1.2 Specification for OpenCL (2014)Google Scholar
  35. 35.
    Lattner, C., Adve, V.: LLVM: A compilation framework for lifelong program analysis and transformation. In: Proceedings of International Symposium on Code Generation Optimization, p. 75 (2004)Google Scholar
  36. 36.
    Lee, J., Kim, J., Seo, S., Kim, S., Park, J., Kim, H., Dao, T.T., Cho, Y., Seo, S.J., Lee, S.H., Cho, S.M., Song, H.J., Suh, S.B., Choi, J.D.: An OpenCL framework for heterogeneous multicores with local memory. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pp. 193–204. ACM, New York, NY, USA (2010). doi:10.1145/1854273.1854301
  37. 37.
    Maher, B.A., Smith, A., Burger, D., McKinley, K.S.: Merging head and tail duplication for convergent hyperblock formation. In: Proceedings of Annual IEEE/ACM International Symposium on Microarchitecture, pp. 65–76. Orlando, FL (2006)Google Scholar
  38. 38.
    Muller, J.M.: Elementary Functions: Algorithms and Implementation. Birkhäuser, London (2006)Google Scholar
  39. 39.
    Nicolau, A., Li, G., Kejariwal, A.: Techniques for efficient placement of synchronization primitives. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’09, pp. 199–208. ACM, New York, NY, USA (2009). doi:10.1145/1504176.1504207
  40. 40.
    Nicolau, A., Li, G., Veidenbaum, A.V., Kejariwal, A.: Synchronization optimizations for efficient execution on multi-cores. In: Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, pp. 169–180. ACM, New York, NY, USA (2009). doi:10.1145/1542275.1542303
  41. 41.
    Nvidia Corp., Santa Clara, CA: NVIDIA CUDA Compute Unified Device Architecture: Programming Guide, v2.0 edn. (2008)Google Scholar
  42. 42.
    Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, Cambridge (2007)Google Scholar
  43. 43.
    Rotem, N.: Intel OpenCL SDK vectorizer. LLVM Developer’s Meeting (2011)Google Scholar
  44. 44.
    Schnetter, E.: Vecmathlib. http://bitbucket.org/eschnett/vecmathlib. Online; Accessed 5 Feb 2014
  45. 45.
    Shibata, N.: Efficient evaluation methods of elementary functions suitable for SIMD computation. In: Journal of Computer Science on Research and Development, Proceedings of the International Supercomputing Conference ISC10, vol. 25, pp. 25–32 (2010). doi:10.1007/s00450-010-0108-2
  46. 46.
    Shibata, N.: SLEEF (SIMD library for evaluating elementary functions). Web Site (2013). http://shibatch.sourceforge.net/
  47. 47.
    Stratton, J.A., Stone, S.S., Hwu, W.M.W.: MCUDA: an efficient implementation of CUDA kernels for multi-core CPUs. In: J.N. Amaral (ed.) Languages and Compilers for Parallel Computing, LNCS, vol. 5335, pp. 16–30. Springer, Berlin (2008). doi:10.1007/978-3-540-89740-8_2

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Pekka Jääskeläinen
    • 1
  • Carlos Sánchez de La Lama
    • 2
  • Erik Schnetter
    • 3
    • 4
    • 5
  • Kalle Raiskila
    • 6
  • Jarmo Takala
    • 1
  • Heikki Berg
    • 6
  1. 1.Tampere University of TechnologyTampereFinland
  2. 2.Knowledge Development for POFMadridSpain
  3. 3.Perimeter Institute for Theoretical PhysicsWaterlooCanada
  4. 4.Department of PhysicsUniversity of GuelphGuelphCanada
  5. 5.Center for Computation and TechnologyLouisiana State UniversityBaton RougeUSA
  6. 6.Nokia Research CenterEspooFinland

Personalised recommendations