Towards High-Level Programming for Systems with Many Cores

  • Sergei GorlatchEmail author
  • Michel Steuwer
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8974)


Application development for modern high-performance systems with many cores, i.e., comprising multiple Graphics Processing Units (GPUs) and multi-core CPUs, currently exploits low-level programming approaches like CUDA and OpenCL, which leads to complex, lengthy and error-prone programs. In this paper, we advocate a high-level programming approach for such systems, which relies on the following two main principles: (a) the model is based on the current OpenCL standard, such that programs remain portable across various many-core systems, independently of the vendor, and all low-level code optimizations can be applied; (b) the model extends OpenCL with three high-level features which simplify many-core programming and are automatically translated by the system into OpenCL code. The high-level features of our programming model are as follows: (1) memory management is simplified and automated using parallel container data types (vectors and matrices); (2) a data (re)distribution mechanism supports data partitioning and generates automatic data movements between multiple GPUs; (3) computations are precisely and concisely expressed using parallel algorithmic patterns (skeletons). The well-defined skeletons allow for semantics-preserving transformations of SkelCL programs which can be applied in the process of program development, as well as in the compilation and optimization phase. We demonstrate how our programming model and its implementation are used to express several parallel applications, and we report first experimental results on evaluating our approach in terms of program size and target performance.


SkelCL Graphics Processing Units (GPU) Multiple GPUs Container Data Types Parallel Container 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work is partially supported by the OFERTIE (FP7) and MONICA projects. We would like to thank the anonymous reviewers for their valuable comments, as well as NVIDIA for their generous hardware donation used in our experiments.


  1. 1.
    OpenACC application program interface. Version 1.0 (2011)Google Scholar
  2. 2.
    AMD. AMD APP SDK code samples. Version 2.7, February 2013Google Scholar
  3. 3.
    AMD. Bolt – A C++ template library optimized for GPUs (2013)Google Scholar
  4. 4.
    Arora, N., Shringarpure, A., Vuduc, R.W.: Direct N-body kernels for multicore platforms. In: 2012 41st International Conference on Parallel Processing, pp. 379–387. IEEE Computer Society, Los Alamitos (2009)Google Scholar
  5. 5.
    Blelloch, G.E.: Prefix sums and their applications. In: Sythesis of Parallel Algorithms, pp. 35–60. Morgan Kaufmann Publishers Inc. (1990)Google Scholar
  6. 6.
    Chang, D.-J., Desoky, A.H., Ouyang, M., Rouchka, E.C.: Compute pairwise manhattan distance and pearson correlation coefficient of data points with GPU. In: 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing, pp. 501–506 (2009)Google Scholar
  7. 7.
    Elangovan, V.K., Badia, R.M., Parra, E.A.: OmpSs-OpenCL programming model for heterogeneous systems. In: Kasahara, H., Kimura, K. (eds.) LCPC 2012. LNCS, vol. 7760, pp. 96–111. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  8. 8.
    Enmyren, J., Kessler. C.: SkePU: a multi-backend skeleton programming library for multi-GPU systems. In: Proceedings 4th International Workshop on High-Level Parallel Programming and Applications (HLPP-2010), pp. 5–14 (2010)Google Scholar
  9. 9.
    Ernsting, S., Kuchen, H.: Algorithmic skeletons for multi-core, multi-GPU systems and clusters. Int. J. High Perform. Comput. Netw. 7(2), 129–138 (2012)CrossRefGoogle Scholar
  10. 10.
    Gorlatch, S., Cole, M.: Parallel skeletons. In: Padua, D.A. (ed.) Encyclopedia of Parallel Computing, pp. 1417–1422. Springer, US (2011)Google Scholar
  11. 11.
    Gorlatch, S., Lengauer, C.: (De)Composition rules for parallel scan and reduction. In: Proceedings of the 3rd International Working Conference on Massively Parallel Programming Models (MPPM’97), pp. 23–32. IEEE Computer Society Press (1998)Google Scholar
  12. 12.
    Hoberock, J., Bell, N.: (NVIDIA). Thrust: a parallel template, Library (2013)Google Scholar
  13. 13.
    Khronos Group. The OpenCL specification, Version 2.0, November 2013Google Scholar
  14. 14.
    Kirk, D.B., Hwu, W.W.: Programming Massively Parallel Processors - A Hands-on Approach. Morgan Kaufman, San Francisco (2010)Google Scholar
  15. 15.
    Nitsche, T.: Skeleton implementations based on generic data distributions. In: 2nd International Workshop on Constructive Methods for Parallel Programming (2000)Google Scholar
  16. 16.
    NVIDIA. CUBLAS (2013).
  17. 17.
    NVIDIA. NVIDIA CUDA SDK code samples. Version 5.0, February 2013Google Scholar
  18. 18.
    OpenMP Architecture Review Board. OpenMP API. Version 4.0 (2013)Google Scholar
  19. 19.
    Pepper, P., Südholt. M.: Deriving parallel numerical algorithms using data distribution algebras: Wang’s algorithm. In: 30th Annual Hawaii International Conference on System Sciences (HICSS), pp. 501–510 (1997)Google Scholar
  20. 20.
    Steuwer, M., Friese, M., Albers, S., Gorlatch, S.: Introducing and implementing the allpairs skeleton for programming multi-GPU systems. Int. J. Parallel Prog. 42(4), 601–618 (2013)CrossRefGoogle Scholar
  21. 21.
    Steuwer, M., Gorlatch, S.: SkelCL: enhancing OpenCL for high-level programming of multi-GPU systems. In: Malyshkin, V. (ed.) PaCT 2013. LNCS, vol. 7979, pp. 258–272. Springer, Heidelberg (2013) CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.University of MuensterMünsterGermany

Personalised recommendations