International Journal of Parallel Programming

, Volume 42, Issue 4, pp 601–618 | Cite as

Introducing and Implementing the Allpairs Skeleton for Programming Multi-GPU Systems

  • Michel SteuwerEmail author
  • Malte Friese
  • Sebastian Albers
  • Sergei Gorlatch


Algorithmic skeletons simplify software development: they abstract typical patterns of parallelism and provide their efficient implementations, allowing the application developer to focus on the structure of algorithms, rather than on implementation details. This becomes especially important for modern parallel systems with multiple graphics processing units (GPUs) whose programming is complex and error-prone, because state-of-the-art programming approaches like CUDA and OpenCL lack high-level abstractions. We define a new algorithmic skeleton for allpairs computations which occur in real-world applications, ranging from bioinformatics to physics. We develop the skeleton’s generic parallel implementation for multi-GPU Systems in OpenCL. To enable the automatic use of the fast GPU memory, we identify and implement an optimized version of the allpairs skeleton with a customizing function that follows a certain memory access pattern. We use matrix multiplication as an application study for the allpairs skeleton and its two implementations and demonstrate that the skeleton greatly simplifies programming, saving up to 90 % of lines of code as compared to OpenCL. The performance of our optimized implementation is up to 6.8 times higher as compared with the generic implementation and is competitive to the performance of a manually written optimized OpenCL code.


High-level programming models Algorithmic skeletons   GPU computing Allpairs computation SkelCL 



We thank the anonymous reviewers for their valuable comments and NVIDIA for donating hardware.


  1. 1.
  2. 2.
    Arora, N., Shringarpure, A., Vuduc, R.: Direct N-body Kernels for multicore platforms. In: Proceedings of ICPP’09, IEEE, pp. 379–387 (2009)Google Scholar
  3. 3.
    Chang, D., Desoky, A., Ouyang, M., Rouchka, E.: Compute pairwise Manhattan distance and Pearson correlation coefficient of data points with GPU. In: Proceedings of SNPD’09, IEEE, pp. 501–506 (2009)Google Scholar
  4. 4.
    Cole, M.: Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming. Parallel Comput. 30(3), 389–406 (2004)CrossRefGoogle Scholar
  5. 5.
    Daub, C., Steuer, R., Selbig, J., Kloska, S.: Estimating mutual information using B-spline functions—an improved similarity measure for analysing gene expression data. BMC Bioinform. 5(1), 118 (2004)CrossRefGoogle Scholar
  6. 6.
    Enmyren, J., Kessler, C.: SkePU: a multi-backend skeleton programming library for multi-GPU systems. In: Proceedings 4th international workshop on high-level parallel programming and applications, ACM, pp. 5–14 (2010)Google Scholar
  7. 7.
    Ernsting, S., Kuchen, H.: Algorithmic skeletons for multi-core, multi-GPU systems and clusters. Int. J. High Perform. Comput. Netw. 7(2), 129–138 (2012)CrossRefGoogle Scholar
  8. 8.
    González-Vélez, H., Leyton, M.: A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers. Softw. Pract. Exp. 40(12), 1135–1160 (2010)CrossRefGoogle Scholar
  9. 9.
    Gorlatch, S., Cole, M.: Parallel Skeletons. In: Padua, D. (ed.) Encyclopedia of Parallel Computing, pp. 1417–1422. Springer, Berlin (2011)Google Scholar
  10. 10.
    Hoberock, J., Bell, N.: Thrust: A Parallel Template Library. (2009)
  11. 11.
    Kirk, D.B., Hwu, W.W.: Programming Massively Parallel Processors—A Hands-on Approach. Morgan Kaufman, Burlington (2010)Google Scholar
  12. 12.
    Lämmel, R.: Google’s MapReduce programming model—revisited. Sci. Comput. Program. 68(3), 208–237 (2007)Google Scholar
  13. 13.
    Munshi, A.: The OpenCL Specification. Version 1.2. Khronos OpenCL Working Group, Beaverton, Oregon (2011)Google Scholar
  14. 14.
    NVIDIA.: NVIDIA CUDA C Programming Guide. Version 5.0 (2012)
  15. 15.
    NVIDIA.: CUBLAS. (2013)
  16. 16.
    Sarje, A., Aluru, S.: All-pairs computations on many-core graphics processors. Parallel Comput. 39(2), 79–93 (2013)CrossRefGoogle Scholar
  17. 17.
    Steuwer, M., Kegel, P., Gorlatch, S.: Towards high-level programming of multi-GPU systems using the SkelCL library. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), IEEE, pp. 1858–1865 (2012)Google Scholar
  18. 18.
    Wirawan, A., Schmidt, B., Kwoh. C.K.: Pairwise distance matrix computation for multiple sequence alignment on the cell broadband engine. In: Proceedings of ICCS’09, Springer, pp. 954–963 (2009)Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Michel Steuwer
    • 1
    Email author
  • Malte Friese
    • 1
  • Sebastian Albers
    • 1
  • Sergei Gorlatch
    • 1
  1. 1.Department of Mathematics and Computer ScienceUniversity of MuensterMünsterGermany

Personalised recommendations