Skip to main content
Log in

Simultaneous CPU–GPU Execution of Data Parallel Algorithmic Skeletons

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Parallel programming has become ubiquitous; however, it is still a low-level and error-prone task, especially when accelerators such as GPUs are used. Thus, algorithmic skeletons have been proposed to provide well-defined programming patterns in order to assist programmers and shield them from low-level aspects. As the complexity of problems, and consequently the need for computing capacity, grows, we have directed our research toward simultaneous CPU–GPU execution of data parallel skeletons to achieve a performance gain. GPUs are optimized with respect to throughput and designed for massively parallel computations. Nevertheless, we analyze whether the additional utilization of the CPU for data parallel skeletons in the Muenster Skeleton Library leads to speedups or causes a reduced performance, because of the smaller computational capacity of CPUs compared to GPUs. We present a C\(++\) implementation based on a static distribution approach. In order to evaluate the implementation, four different benchmarks, including matrix multiplication, N-body simulation, Frobenius norm, and ray tracing, have been conducted. The ratio of CPU and GPU execution has been varied manually to observe the effects of different distributions. The results show that a speedup can be achieved by distributing the execution among CPUs and GPUs. However, both the results and the optimal distribution highly depend on the available hardware and the specific algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. With regard to the CUDA terminology, we also refer to GPUs as devices.

References

  1. Cole, M.: Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press, Cambridge (1991)

    MATH  Google Scholar 

  2. Cole, M.: Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming. Parallel Comput. 30(3), 389–406 (2004)

    Article  Google Scholar 

  3. Ernsting, S., Kuchen, H.: Algorithmic skeletons for multi-core, multi-GPU systems and clusters. Int. J. High Perform. Comput. Netw. 7(2), 129–138 (2012)

    Article  Google Scholar 

  4. The Message Passing Interface (MPI) standard. http://www.mcs.anl.gov/research/projects/mpi/. Accessed Apr 2016

  5. The OpenMP API Specification for Parallel Programming. http://openmp.org. Accessed Apr 2016

  6. Nvidia Corporation: CUDA Website. http://www.nvidia.de/object/cuda-parallel-computing-de.html. Accessed Apr 2016

  7. Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008)

    Article  Google Scholar 

  8. Ciechanowicz, P.: Algorithmic skeletons for general sparse matrices on multi-core processors. In: Proceedings of the 20th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), pp. 188–197 (2008)

  9. Ernsting, S., Kuchen, H.: A scalable farm skeleton for hybrid parallel and distributed programming. Int. J. Parallel Prog. 42(6), 968–987 (2014)

    Article  Google Scholar 

  10. Poldner, M., Kuchen, H.: Skeletons for divide and conquer algorithms. In: Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN). ACTA Press (2008)

  11. Poldner, M., Kuchen, H.: Algorithmic skeletons for branch and bound. In: Software and Data Technologies: First International Conference, ICSOFT 2006, Setúbal, Portugal, 11–14 September 2006, Revised Selected Papers, pp. 204–219. Springer, Berlin (2008)

  12. Kuchen, H., Striegnitz, J.: Higher-order functions and partial applications for a C++ skeleton library. In: Proceedings of the 2002 Joint ACM-ISCOPE Conference on Java Grande, JGI ’02, pp. 122–130. ACM, New York (2002)

  13. Zhang, Y., Kameda, H., Hung, S.L.: Comparison of dynamic and static load-balancing strategies in heterogeneous distributed systems. IEE Proc. Comput. Digit. Tech. 144(2), 100–106 (1997)

    Article  Google Scholar 

  14. Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Proceedings of the Job Scheduling Strategies for Parallel Processing, IPPS ’97. Springer, London (1997)

  15. Ferguson, D.F., Nikolaou, C., Sairamesh, J., Yemini, Y.: Economic models for allocating resources in computer systems. In Clearwater, S.H. (ed.) Market-based Control: A Paradigm for Distributed Resource Allocation, pp 156–183 (1996)

  16. Grewe, D., O’Boyle, M.F.P.: A static task partitioning approach for heterogeneous systems using OpenCL. In: Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software, CC’11/ETAPS’11, pp. 286–305. Springer, Berlin (2011)

  17. Ernsting, S., Kuchen, H.: Data parallel algorithmic skeletons with accelerator support. Int. J. Parallel Program. (2016). doi:10.1007/s10766-016-0416-7

  18. Cannon, L.E.: A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. thesis, Montana State University, Bozeman, MT, AAI7010025 (1969)

  19. Demout, J.: CUDA Pro Tip: minimize the tail effect. https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-minimize-the-tail-effect/. Accessed Sep 2016

  20. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 23(2), 187–198 (2011)

    Article  Google Scholar 

  21. Dastgeer, U., Kessler, C., Thibault, S.: Flexible runtime support for efficient skeleton programming on heterogeneous GPU-based systems. In: Proceedings of the International Conference on Parallel Computing, ParCo ’11 (2011)

  22. Alexandre, F., Marques, R., Paulino, H.: On the support of task-parallel algorithmic skeletons for multi-GPU computing. In: Proceedings of the 29th Annual ACM Symposium on Applied Computing, SAC ’14, pp. 880–885. ACM, New York (2014)

  23. Sato, S., Iwasaki, H.: A skeletal parallel framework with fusion optimizer for GPGPU programming. In: Hu, Z. (ed.) Proceedings of the 7th Asian Symposium. APLAS 2009, Seoul, Korea, December 14–16, 2009. Lecture Notes in Computer Science, vol. 5904, pp. 79–94. Springer, Berlin (2009)

  24. Aldinucci, M., Danelutto, M., Kilpatrick, P., Torquati, M.: FastFlow: High-level and efficient streaming on multi-core. In: Pllana, S., Xhafa, F. (eds.) Programming Multi-core and Many-Core Computing Systems, Parallel and Distributed Computing, chapt. 13. Wiley-Blackwell (in press)

  25. Steuwer, M., Gorlatch, S.: SkelCL: enhancing OpenCL for high-level programming of multi-GPU systems. In: Malyshkin, V. (ed.) Proceedings of the 12th International Conference on Parallel Computing Technologies. PaCT ’13, pp. 258–272. Springer, Berlin (2013)

  26. Goli, M., Gonzalez-Velez, H.: Heterogeneous algorithmic skeletons for fast flow with seamless coordination over hybrid architectures. In: Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP ’13, pp. 148–156. IEEE Computer Society, Washington (2013)

  27. Lee, C., Ro, W.W., Gaudiot, J.-L.: Boosting CUDA applications with CPU–GPU hybrid computing. Int. J. Parallel Prog. 42(2), 384–404 (2014)

    Article  Google Scholar 

  28. Chen, L., Huo, X., Agrawal, G.: Accelerating MapReduce on a coupled CPU–GPU architecture. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp. 25:1–25:11. IEEE Computer Society Press, Los Alamitos (2012)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabian Wrede.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wrede, F., Ernsting, S. Simultaneous CPU–GPU Execution of Data Parallel Algorithmic Skeletons. Int J Parallel Prog 46, 42–61 (2018). https://doi.org/10.1007/s10766-016-0483-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-016-0483-9

Keywords

Navigation