pocl: A Performance-Portable OpenCL Implementation

Jääskeläinen, Pekka; de La Lama, Carlos Sánchez; Schnetter, Erik; Raiskila, Kalle; Takala, Jarmo; Berg, Heikki

doi:10.1007/s10766-014-0320-y

pocl: A Performance-Portable OpenCL Implementation

Published: 19 August 2014

Volume 43, pages 752–785, (2015)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Pekka Jääskeläinen¹,
Carlos Sánchez de La Lama²,
Erik Schnetter^3,4,5,
Kalle Raiskila⁶,
Jarmo Takala¹ &
…
Heikki Berg⁶

2209 Accesses
77 Citations
10 Altmetric
Explore all metrics

Abstract

OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. We test the two aspects to portability by utilizing the kernel compiler and the OpenCL implementation to run OpenCL applications in various platforms with different style of parallel resources. The results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Pragmatic Performance Portability with OpenMP 4.x

OpenMP as a High-Level Specification Language for Parallelism

ComPar: Optimized Multi-compiler for Automatic OpenMP S2S Parallelization

Notes

At the time of this writing, pocl does not yet support popular commercial GPU targets. However, the SPMD/GPU path of the kernel compiler has been tested by using research targets to ensure GPU-like devices can be supported using pocl.

References

Clang: A C language frontend for LLVM. http://clang.llvm.org/. Online; Accessed 5 Feb 2014
Clover Git: Implementing barriers. http://people.freedesktop.org/steckdenis/clover/barrier.html. Online; Accessed 18 May 2013
Clover Git: OpenCL 1.1 software implementation. http://people.freedesktop.org/steckdenis/clover/index.html. Online; Accessed 18 May 2013
freeocl: Multi-platform implementation of OpenCL 1.2 targeting CPUs. http://code.google.com/p/freeocl/. Online; Accessed 18 May 2013
LLVM compiler infrastructure. http://llvm.org/. Online; Accessed 5 Feb 2014
TTA-based codesign environment (TCE). http://tce.cs.tut.fi. Online; Accessed 18 May 2013
Advanced Micro Devices Inc: Accelerated parallel processing (APP) software development kit (SDK) v2.8 (2012)
Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co. Inc., Reading (1986)
Google Scholar
Allen, F.E.: Control flow analysis. ACM SIGPLAN Not. 5(7), 1–19 (1970)
Article Google Scholar
Allen, J., Kennedy, K., Porterfield, C., Warren, J.: Conversion of control dependence to data dependence. In: Proceedings of ACM Symposium Principles of Programming Languages, Austin, TX, pp. 177–189 (1983)
ARM Ltd.: The ARM NEON™ general-purpose SIMD engine (2012). http://www.arm.com/products/processors/technologies/neon.php
ARM Ltd.: The ARMCortex™ A9 processor (2013). http://www.arm.com/products/processors/cortex-a/cortex-a9.php
Cammarota, R., Nicolau, A., Veidenbaum, A.V., Kejariwal, A., Donato, D., Madhugiri, M.: On the determination of inlining vectors for program optimization. In: Proceedings of 22nd International Conference on Compiler Construction, CC’13, pp. 164–183. Springer, Berlin (2013). doi:10.1007/978-3-642-37051-9_9
Cocke, J.: Global common subexpression elimination. In: Proceedings of Symposium Compiler Optimization, pp. 20–24. Urbana-Champaign, IL (1970)
Corporaal, H.: Microprocessor Architectures: From VLIW to TTA. Wiley, Chichester (1997)
Google Scholar
Cytron, R., Ferrante, J., Rosen, B.K., Wegman, M.N., Zadeck, F.K.: Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13(4), 451–490 (1991)
Article Google Scholar
Esko, O., Jääskeläinen, P., Huerta, P., de La Lama, C.S., Takala, J., Martinez, J.I.: Customized exposed datapath soft-core design flow with compiler support. In: International Conference on Field Programmable Logic and Applications, pp. 217–222. Milan, Italy (2010)
Fisher, J.: Trace scheduling: a technique for global microcode compaction. IEEE Trans. Comput. C–30(7), 478–490 (1981)
Article Google Scholar
Goldberg, D.: What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 5–48 (1991)
Article Google Scholar
Gschwind, M., Hofstee, H.P., Flachs, B., Hopkins, M., Watanabe, Y., Yamazaki, T.: Synergistic processing in Cell’s multicore architecture. IEEE Micro 26, 10–24 (2006)
Article Google Scholar
Gummaraju, J., Morichetti, L., Houston, M., Sander, B., Gaster, B.R., Zheng, B.: Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques, pp. 205–216. Vienna, Austria (2010)
Gummaraju, J., Sander, B., Morichetti, L., Gaster, B., Howes, L.: Efficient implementation of GPGPU synchronization primitives on CPUs. In: Proceedings of ACM International Conference on Computing Frontiers, pp. 85–86. Bertinoro, Italy (2010)
Hecht, M.S., Ullman, J.D.: Flow graph reducibility. In: Proceedings of Annual ACM Symposium on Theory of Computing, pp. 238–250. Denver, CO (1972)
IBM: OpenCL(TM) development kit for Linux on Power, v0.3 (2011)
IEEE, Piscataway, NJ: IEEE standard for information technology—portable operation system interface (POSIX). Shell and utilities., 2004 edn. (2004). Std 1003.1
IEEE, Piscataway, NJ: Standard for floating-point arithmetic (2008). Std 754-2008
Intel Corp.: Desktop 4th Gen IntelCore™ Processor Family: Datasheet, Vol. 1 (2013). Doc. No. 328897-004
Jääskeläinen, P., Sánchez de La Lama, C., Huerta, P., Takala, J.: OpenCL-based design methodology for application-specific processors. Trans. HiPEAC 5 (2011). http://www.hipeac.net/node/4310
Janssen, J., Corporaal, H.: Making graphs reducible with controlled node splitting. ACM Trans. Program. Lang. Syst. 19(6), 1031–1052 (1997)
Article Google Scholar
Karrenberg, R., Hack, S.: Whole-function vectorization. In: Proceedings of Annual IEEE/ACM International Symposium Code Generation and Optimization, pp. 141–150. Chamonix, France (2011)
Karrenberg, R., Hack, S.: Improving performance of OpenCL on CPUs. In: Proceedings of International Conference on Compiler Construction, pp. 1–20. Tallinn, Estonia (2012)
Kejariwal, A., Nicolau, A., Saito, H., Tian, X., Girkar, M., Banerjee, U., Polychronopoulos, C.D.: A general approach for partitioning N-dimensional parallel nested loops with conditionals. In: Proceedings of 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’06, pp. 49–58. ACM, New York, NY, USA (2006). doi:10.1145/1148109.1148117
Khronos Group, Beaverton, OR: OpenCL Specification, v1.2r19 edn. (2012)
Khronos Group: SPIR 1.2 Specification for OpenCL (2014)
Lattner, C., Adve, V.: LLVM: A compilation framework for lifelong program analysis and transformation. In: Proceedings of International Symposium on Code Generation Optimization, p. 75 (2004)
Lee, J., Kim, J., Seo, S., Kim, S., Park, J., Kim, H., Dao, T.T., Cho, Y., Seo, S.J., Lee, S.H., Cho, S.M., Song, H.J., Suh, S.B., Choi, J.D.: An OpenCL framework for heterogeneous multicores with local memory. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pp. 193–204. ACM, New York, NY, USA (2010). doi:10.1145/1854273.1854301
Maher, B.A., Smith, A., Burger, D., McKinley, K.S.: Merging head and tail duplication for convergent hyperblock formation. In: Proceedings of Annual IEEE/ACM International Symposium on Microarchitecture, pp. 65–76. Orlando, FL (2006)
Muller, J.M.: Elementary Functions: Algorithms and Implementation. Birkhäuser, London (2006)
Google Scholar
Nicolau, A., Li, G., Kejariwal, A.: Techniques for efficient placement of synchronization primitives. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’09, pp. 199–208. ACM, New York, NY, USA (2009). doi:10.1145/1504176.1504207
Nicolau, A., Li, G., Veidenbaum, A.V., Kejariwal, A.: Synchronization optimizations for efficient execution on multi-cores. In: Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, pp. 169–180. ACM, New York, NY, USA (2009). doi:10.1145/1542275.1542303
Nvidia Corp., Santa Clara, CA: NVIDIA CUDA Compute Unified Device Architecture: Programming Guide, v2.0 edn. (2008)
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, Cambridge (2007)
Google Scholar
Rotem, N.: Intel OpenCL SDK vectorizer. LLVM Developer’s Meeting (2011)
Schnetter, E.: Vecmathlib. http://bitbucket.org/eschnett/vecmathlib. Online; Accessed 5 Feb 2014
Shibata, N.: Efficient evaluation methods of elementary functions suitable for SIMD computation. In: Journal of Computer Science on Research and Development, Proceedings of the International Supercomputing Conference ISC10, vol. 25, pp. 25–32 (2010). doi:10.1007/s00450-010-0108-2
Shibata, N.: SLEEF (SIMD library for evaluating elementary functions). Web Site (2013). http://shibatch.sourceforge.net/
Stratton, J.A., Stone, S.S., Hwu, W.M.W.: MCUDA: an efficient implementation of CUDA kernels for multi-core CPUs. In: J.N. Amaral (ed.) Languages and Compilers for Parallel Computing, LNCS, vol. 5335, pp. 16–30. Springer, Berlin (2008). doi:10.1007/978-3-540-89740-8_2

Download references

Acknowledgments

The work has been financially supported by the Academy of Finland (funding decision 253087), Finnish Funding Agency for Technology and Innovation (Project “Parallel Acceleration”, funding decision 40115/13), ARTEMIS joint undertaking under Grant Agreement No. 641439 (ALMARVI), by NSF awards 0905046, 0941653, and 1212401, as well as NSERC grant 2012-RGPIN-1505. In addition to the financial supporters, the authors would also like to thank the constructive comments and references pointed out by the reviewers.

Author information

Authors and Affiliations

Tampere University of Technology, Tampere, Finland
Pekka Jääskeläinen & Jarmo Takala
Knowledge Development for POF, Madrid, Spain
Carlos Sánchez de La Lama
Perimeter Institute for Theoretical Physics, Waterloo, ON, Canada
Erik Schnetter
Department of Physics, University of Guelph, Guelph, ON, Canada
Erik Schnetter
Center for Computation and Technology, Louisiana State University, Baton Rouge, LA, USA
Erik Schnetter
Nokia Research Center, Espoo, Finland
Kalle Raiskila & Heikki Berg

Authors

Pekka Jääskeläinen
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Sánchez de La Lama
View author publications
You can also search for this author in PubMed Google Scholar
Erik Schnetter
View author publications
You can also search for this author in PubMed Google Scholar
Kalle Raiskila
View author publications
You can also search for this author in PubMed Google Scholar
Jarmo Takala
View author publications
You can also search for this author in PubMed Google Scholar
Heikki Berg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pekka Jääskeläinen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jääskeläinen, P., de La Lama, C.S., Schnetter, E. et al. pocl: A Performance-Portable OpenCL Implementation. Int J Parallel Prog 43, 752–785 (2015). https://doi.org/10.1007/s10766-014-0320-y

Download citation

Received: 07 February 2014
Accepted: 05 August 2014
Published: 19 August 2014
Issue Date: October 2015
DOI: https://doi.org/10.1007/s10766-014-0320-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

pocl: A Performance-Portable OpenCL Implementation

Abstract

Access this article

Similar content being viewed by others

Pragmatic Performance Portability with OpenMP 4.x

OpenMP as a High-Level Specification Language for Parallelism

ComPar: Optimized Multi-compiler for Automatic OpenMP S2S Parallelization

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

pocl: A Performance-Portable OpenCL Implementation

Abstract

Access this article

Similar content being viewed by others

Pragmatic Performance Portability with OpenMP 4.x

OpenMP as a High-Level Specification Language for Parallelism

ComPar: Optimized Multi-compiler for Automatic OpenMP S2S Parallelization

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation