Abstract
OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. We test the two aspects to portability by utilizing the kernel compiler and the OpenCL implementation to run OpenCL applications in various platforms with different style of parallel resources. The results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.
Similar content being viewed by others
Notes
At the time of this writing, pocl does not yet support popular commercial GPU targets. However, the SPMD/GPU path of the kernel compiler has been tested by using research targets to ensure GPU-like devices can be supported using pocl.
References
Clang: A C language frontend for LLVM. http://clang.llvm.org/. Online; Accessed 5 Feb 2014
Clover Git: Implementing barriers. http://people.freedesktop.org/steckdenis/clover/barrier.html. Online; Accessed 18 May 2013
Clover Git: OpenCL 1.1 software implementation. http://people.freedesktop.org/steckdenis/clover/index.html. Online; Accessed 18 May 2013
freeocl: Multi-platform implementation of OpenCL 1.2 targeting CPUs. http://code.google.com/p/freeocl/. Online; Accessed 18 May 2013
LLVM compiler infrastructure. http://llvm.org/. Online; Accessed 5 Feb 2014
TTA-based codesign environment (TCE). http://tce.cs.tut.fi. Online; Accessed 18 May 2013
Advanced Micro Devices Inc: Accelerated parallel processing (APP) software development kit (SDK) v2.8 (2012)
Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co. Inc., Reading (1986)
Allen, F.E.: Control flow analysis. ACM SIGPLAN Not. 5(7), 1–19 (1970)
Allen, J., Kennedy, K., Porterfield, C., Warren, J.: Conversion of control dependence to data dependence. In: Proceedings of ACM Symposium Principles of Programming Languages, Austin, TX, pp. 177–189 (1983)
ARM Ltd.: The ARM NEON™ general-purpose SIMD engine (2012). http://www.arm.com/products/processors/technologies/neon.php
ARM Ltd.: The ARMCortex™ A9 processor (2013). http://www.arm.com/products/processors/cortex-a/cortex-a9.php
Cammarota, R., Nicolau, A., Veidenbaum, A.V., Kejariwal, A., Donato, D., Madhugiri, M.: On the determination of inlining vectors for program optimization. In: Proceedings of 22nd International Conference on Compiler Construction, CC’13, pp. 164–183. Springer, Berlin (2013). doi:10.1007/978-3-642-37051-9_9
Cocke, J.: Global common subexpression elimination. In: Proceedings of Symposium Compiler Optimization, pp. 20–24. Urbana-Champaign, IL (1970)
Corporaal, H.: Microprocessor Architectures: From VLIW to TTA. Wiley, Chichester (1997)
Cytron, R., Ferrante, J., Rosen, B.K., Wegman, M.N., Zadeck, F.K.: Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13(4), 451–490 (1991)
Esko, O., Jääskeläinen, P., Huerta, P., de La Lama, C.S., Takala, J., Martinez, J.I.: Customized exposed datapath soft-core design flow with compiler support. In: International Conference on Field Programmable Logic and Applications, pp. 217–222. Milan, Italy (2010)
Fisher, J.: Trace scheduling: a technique for global microcode compaction. IEEE Trans. Comput. C–30(7), 478–490 (1981)
Goldberg, D.: What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 5–48 (1991)
Gschwind, M., Hofstee, H.P., Flachs, B., Hopkins, M., Watanabe, Y., Yamazaki, T.: Synergistic processing in Cell’s multicore architecture. IEEE Micro 26, 10–24 (2006)
Gummaraju, J., Morichetti, L., Houston, M., Sander, B., Gaster, B.R., Zheng, B.: Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques, pp. 205–216. Vienna, Austria (2010)
Gummaraju, J., Sander, B., Morichetti, L., Gaster, B., Howes, L.: Efficient implementation of GPGPU synchronization primitives on CPUs. In: Proceedings of ACM International Conference on Computing Frontiers, pp. 85–86. Bertinoro, Italy (2010)
Hecht, M.S., Ullman, J.D.: Flow graph reducibility. In: Proceedings of Annual ACM Symposium on Theory of Computing, pp. 238–250. Denver, CO (1972)
IBM: OpenCL(TM) development kit for Linux on Power, v0.3 (2011)
IEEE, Piscataway, NJ: IEEE standard for information technology—portable operation system interface (POSIX). Shell and utilities., 2004 edn. (2004). Std 1003.1
IEEE, Piscataway, NJ: Standard for floating-point arithmetic (2008). Std 754-2008
Intel Corp.: Desktop 4th Gen IntelCore™ Processor Family: Datasheet, Vol. 1 (2013). Doc. No. 328897-004
Jääskeläinen, P., Sánchez de La Lama, C., Huerta, P., Takala, J.: OpenCL-based design methodology for application-specific processors. Trans. HiPEAC 5 (2011). http://www.hipeac.net/node/4310
Janssen, J., Corporaal, H.: Making graphs reducible with controlled node splitting. ACM Trans. Program. Lang. Syst. 19(6), 1031–1052 (1997)
Karrenberg, R., Hack, S.: Whole-function vectorization. In: Proceedings of Annual IEEE/ACM International Symposium Code Generation and Optimization, pp. 141–150. Chamonix, France (2011)
Karrenberg, R., Hack, S.: Improving performance of OpenCL on CPUs. In: Proceedings of International Conference on Compiler Construction, pp. 1–20. Tallinn, Estonia (2012)
Kejariwal, A., Nicolau, A., Saito, H., Tian, X., Girkar, M., Banerjee, U., Polychronopoulos, C.D.: A general approach for partitioning N-dimensional parallel nested loops with conditionals. In: Proceedings of 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’06, pp. 49–58. ACM, New York, NY, USA (2006). doi:10.1145/1148109.1148117
Khronos Group, Beaverton, OR: OpenCL Specification, v1.2r19 edn. (2012)
Khronos Group: SPIR 1.2 Specification for OpenCL (2014)
Lattner, C., Adve, V.: LLVM: A compilation framework for lifelong program analysis and transformation. In: Proceedings of International Symposium on Code Generation Optimization, p. 75 (2004)
Lee, J., Kim, J., Seo, S., Kim, S., Park, J., Kim, H., Dao, T.T., Cho, Y., Seo, S.J., Lee, S.H., Cho, S.M., Song, H.J., Suh, S.B., Choi, J.D.: An OpenCL framework for heterogeneous multicores with local memory. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pp. 193–204. ACM, New York, NY, USA (2010). doi:10.1145/1854273.1854301
Maher, B.A., Smith, A., Burger, D., McKinley, K.S.: Merging head and tail duplication for convergent hyperblock formation. In: Proceedings of Annual IEEE/ACM International Symposium on Microarchitecture, pp. 65–76. Orlando, FL (2006)
Muller, J.M.: Elementary Functions: Algorithms and Implementation. Birkhäuser, London (2006)
Nicolau, A., Li, G., Kejariwal, A.: Techniques for efficient placement of synchronization primitives. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’09, pp. 199–208. ACM, New York, NY, USA (2009). doi:10.1145/1504176.1504207
Nicolau, A., Li, G., Veidenbaum, A.V., Kejariwal, A.: Synchronization optimizations for efficient execution on multi-cores. In: Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, pp. 169–180. ACM, New York, NY, USA (2009). doi:10.1145/1542275.1542303
Nvidia Corp., Santa Clara, CA: NVIDIA CUDA Compute Unified Device Architecture: Programming Guide, v2.0 edn. (2008)
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, Cambridge (2007)
Rotem, N.: Intel OpenCL SDK vectorizer. LLVM Developer’s Meeting (2011)
Schnetter, E.: Vecmathlib. http://bitbucket.org/eschnett/vecmathlib. Online; Accessed 5 Feb 2014
Shibata, N.: Efficient evaluation methods of elementary functions suitable for SIMD computation. In: Journal of Computer Science on Research and Development, Proceedings of the International Supercomputing Conference ISC10, vol. 25, pp. 25–32 (2010). doi:10.1007/s00450-010-0108-2
Shibata, N.: SLEEF (SIMD library for evaluating elementary functions). Web Site (2013). http://shibatch.sourceforge.net/
Stratton, J.A., Stone, S.S., Hwu, W.M.W.: MCUDA: an efficient implementation of CUDA kernels for multi-core CPUs. In: J.N. Amaral (ed.) Languages and Compilers for Parallel Computing, LNCS, vol. 5335, pp. 16–30. Springer, Berlin (2008). doi:10.1007/978-3-540-89740-8_2
Acknowledgments
The work has been financially supported by the Academy of Finland (funding decision 253087), Finnish Funding Agency for Technology and Innovation (Project “Parallel Acceleration”, funding decision 40115/13), ARTEMIS joint undertaking under Grant Agreement No. 641439 (ALMARVI), by NSF awards 0905046, 0941653, and 1212401, as well as NSERC grant 2012-RGPIN-1505. In addition to the financial supporters, the authors would also like to thank the constructive comments and references pointed out by the reviewers.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jääskeläinen, P., de La Lama, C.S., Schnetter, E. et al. pocl: A Performance-Portable OpenCL Implementation. Int J Parallel Prog 43, 752–785 (2015). https://doi.org/10.1007/s10766-014-0320-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-014-0320-y