International Journal of Parallel Programming

, Volume 46, Issue 1, pp 42–61 | Cite as

Simultaneous CPU–GPU Execution of Data Parallel Algorithmic Skeletons



Parallel programming has become ubiquitous; however, it is still a low-level and error-prone task, especially when accelerators such as GPUs are used. Thus, algorithmic skeletons have been proposed to provide well-defined programming patterns in order to assist programmers and shield them from low-level aspects. As the complexity of problems, and consequently the need for computing capacity, grows, we have directed our research toward simultaneous CPU–GPU execution of data parallel skeletons to achieve a performance gain. GPUs are optimized with respect to throughput and designed for massively parallel computations. Nevertheless, we analyze whether the additional utilization of the CPU for data parallel skeletons in the Muenster Skeleton Library leads to speedups or causes a reduced performance, because of the smaller computational capacity of CPUs compared to GPUs. We present a C\(++\) implementation based on a static distribution approach. In order to evaluate the implementation, four different benchmarks, including matrix multiplication, N-body simulation, Frobenius norm, and ray tracing, have been conducted. The ratio of CPU and GPU execution has been varied manually to observe the effects of different distributions. The results show that a speedup can be achieved by distributing the execution among CPUs and GPUs. However, both the results and the optimal distribution highly depend on the available hardware and the specific algorithm.


High-level parallel programming Data parallel algorithmic skeletons Simultaneous CPU–GPU execution 


  1. 1.
    Cole, M.: Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press, Cambridge (1991)MATHGoogle Scholar
  2. 2.
    Cole, M.: Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming. Parallel Comput. 30(3), 389–406 (2004)CrossRefGoogle Scholar
  3. 3.
    Ernsting, S., Kuchen, H.: Algorithmic skeletons for multi-core, multi-GPU systems and clusters. Int. J. High Perform. Comput. Netw. 7(2), 129–138 (2012)CrossRefGoogle Scholar
  4. 4.
    The Message Passing Interface (MPI) standard. Accessed Apr 2016
  5. 5.
    The OpenMP API Specification for Parallel Programming. Accessed Apr 2016
  6. 6.
    Nvidia Corporation: CUDA Website. Accessed Apr 2016
  7. 7.
    Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008)CrossRefGoogle Scholar
  8. 8.
    Ciechanowicz, P.: Algorithmic skeletons for general sparse matrices on multi-core processors. In: Proceedings of the 20th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), pp. 188–197 (2008)Google Scholar
  9. 9.
    Ernsting, S., Kuchen, H.: A scalable farm skeleton for hybrid parallel and distributed programming. Int. J. Parallel Prog. 42(6), 968–987 (2014)CrossRefGoogle Scholar
  10. 10.
    Poldner, M., Kuchen, H.: Skeletons for divide and conquer algorithms. In: Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN). ACTA Press (2008)Google Scholar
  11. 11.
    Poldner, M., Kuchen, H.: Algorithmic skeletons for branch and bound. In: Software and Data Technologies: First International Conference, ICSOFT 2006, Setúbal, Portugal, 11–14 September 2006, Revised Selected Papers, pp. 204–219. Springer, Berlin (2008)Google Scholar
  12. 12.
    Kuchen, H., Striegnitz, J.: Higher-order functions and partial applications for a C++ skeleton library. In: Proceedings of the 2002 Joint ACM-ISCOPE Conference on Java Grande, JGI ’02, pp. 122–130. ACM, New York (2002)Google Scholar
  13. 13.
    Zhang, Y., Kameda, H., Hung, S.L.: Comparison of dynamic and static load-balancing strategies in heterogeneous distributed systems. IEE Proc. Comput. Digit. Tech. 144(2), 100–106 (1997)CrossRefGoogle Scholar
  14. 14.
    Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Proceedings of the Job Scheduling Strategies for Parallel Processing, IPPS ’97. Springer, London (1997)Google Scholar
  15. 15.
    Ferguson, D.F., Nikolaou, C., Sairamesh, J., Yemini, Y.: Economic models for allocating resources in computer systems. In Clearwater, S.H. (ed.) Market-based Control: A Paradigm for Distributed Resource Allocation, pp 156–183 (1996)Google Scholar
  16. 16.
    Grewe, D., O’Boyle, M.F.P.: A static task partitioning approach for heterogeneous systems using OpenCL. In: Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software, CC’11/ETAPS’11, pp. 286–305. Springer, Berlin (2011)Google Scholar
  17. 17.
    Ernsting, S., Kuchen, H.: Data parallel algorithmic skeletons with accelerator support. Int. J. Parallel Program. (2016). doi: 10.1007/s10766-016-0416-7
  18. 18.
    Cannon, L.E.: A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. thesis, Montana State University, Bozeman, MT, AAI7010025 (1969)Google Scholar
  19. 19.
    Demout, J.: CUDA Pro Tip: minimize the tail effect. Accessed Sep 2016
  20. 20.
    Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 23(2), 187–198 (2011)CrossRefGoogle Scholar
  21. 21.
    Dastgeer, U., Kessler, C., Thibault, S.: Flexible runtime support for efficient skeleton programming on heterogeneous GPU-based systems. In: Proceedings of the International Conference on Parallel Computing, ParCo ’11 (2011)Google Scholar
  22. 22.
    Alexandre, F., Marques, R., Paulino, H.: On the support of task-parallel algorithmic skeletons for multi-GPU computing. In: Proceedings of the 29th Annual ACM Symposium on Applied Computing, SAC ’14, pp. 880–885. ACM, New York (2014)Google Scholar
  23. 23.
    Sato, S., Iwasaki, H.: A skeletal parallel framework with fusion optimizer for GPGPU programming. In: Hu, Z. (ed.) Proceedings of the 7th Asian Symposium. APLAS 2009, Seoul, Korea, December 14–16, 2009. Lecture Notes in Computer Science, vol. 5904, pp. 79–94. Springer, Berlin (2009)Google Scholar
  24. 24.
    Aldinucci, M., Danelutto, M., Kilpatrick, P., Torquati, M.: FastFlow: High-level and efficient streaming on multi-core. In: Pllana, S., Xhafa, F. (eds.) Programming Multi-core and Many-Core Computing Systems, Parallel and Distributed Computing, chapt. 13. Wiley-Blackwell (in press)Google Scholar
  25. 25.
    Steuwer, M., Gorlatch, S.: SkelCL: enhancing OpenCL for high-level programming of multi-GPU systems. In: Malyshkin, V. (ed.) Proceedings of the 12th International Conference on Parallel Computing Technologies. PaCT ’13, pp. 258–272. Springer, Berlin (2013)Google Scholar
  26. 26.
    Goli, M., Gonzalez-Velez, H.: Heterogeneous algorithmic skeletons for fast flow with seamless coordination over hybrid architectures. In: Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP ’13, pp. 148–156. IEEE Computer Society, Washington (2013)Google Scholar
  27. 27.
    Lee, C., Ro, W.W., Gaudiot, J.-L.: Boosting CUDA applications with CPU–GPU hybrid computing. Int. J. Parallel Prog. 42(2), 384–404 (2014)CrossRefGoogle Scholar
  28. 28.
    Chen, L., Huo, X., Agrawal, G.: Accelerating MapReduce on a coupled CPU–GPU architecture. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp. 25:1–25:11. IEEE Computer Society Press, Los Alamitos (2012)Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.MünsterGermany

Personalised recommendations