Simultaneous CPU–GPU Execution of Data Parallel Algorithmic Skeletons

Wrede, Fabian; Ernsting, Steffen

doi:10.1007/s10766-016-0483-9

Simultaneous CPU–GPU Execution of Data Parallel Algorithmic Skeletons

Published: 04 January 2017

Volume 46, pages 42–61, (2018)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

799 Accesses
4 Citations
Explore all metrics

Abstract

Parallel programming has become ubiquitous; however, it is still a low-level and error-prone task, especially when accelerators such as GPUs are used. Thus, algorithmic skeletons have been proposed to provide well-defined programming patterns in order to assist programmers and shield them from low-level aspects. As the complexity of problems, and consequently the need for computing capacity, grows, we have directed our research toward simultaneous CPU–GPU execution of data parallel skeletons to achieve a performance gain. GPUs are optimized with respect to throughput and designed for massively parallel computations. Nevertheless, we analyze whether the additional utilization of the CPU for data parallel skeletons in the Muenster Skeleton Library leads to speedups or causes a reduced performance, because of the smaller computational capacity of CPUs compared to GPUs. We present a C\(++\) implementation based on a static distribution approach. In order to evaluate the implementation, four different benchmarks, including matrix multiplication, N-body simulation, Frobenius norm, and ray tracing, have been conducted. The ratio of CPU and GPU execution has been varied manually to observe the effects of different distributions. The results show that a speedup can be achieved by distributing the execution among CPUs and GPUs. However, both the results and the optimal distribution highly depend on the available hardware and the specific algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

With regard to the CUDA terminology, we also refer to GPUs as devices.

References

Cole, M.: Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press, Cambridge (1991)
MATH Google Scholar
Cole, M.: Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming. Parallel Comput. 30(3), 389–406 (2004)
Article Google Scholar
Ernsting, S., Kuchen, H.: Algorithmic skeletons for multi-core, multi-GPU systems and clusters. Int. J. High Perform. Comput. Netw. 7(2), 129–138 (2012)
Article Google Scholar
The Message Passing Interface (MPI) standard. http://www.mcs.anl.gov/research/projects/mpi/. Accessed Apr 2016
The OpenMP API Specification for Parallel Programming. http://openmp.org. Accessed Apr 2016
Nvidia Corporation: CUDA Website. http://www.nvidia.de/object/cuda-parallel-computing-de.html. Accessed Apr 2016
Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008)
Article Google Scholar
Ciechanowicz, P.: Algorithmic skeletons for general sparse matrices on multi-core processors. In: Proceedings of the 20th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), pp. 188–197 (2008)
Ernsting, S., Kuchen, H.: A scalable farm skeleton for hybrid parallel and distributed programming. Int. J. Parallel Prog. 42(6), 968–987 (2014)
Article Google Scholar
Poldner, M., Kuchen, H.: Skeletons for divide and conquer algorithms. In: Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN). ACTA Press (2008)
Poldner, M., Kuchen, H.: Algorithmic skeletons for branch and bound. In: Software and Data Technologies: First International Conference, ICSOFT 2006, Setúbal, Portugal, 11–14 September 2006, Revised Selected Papers, pp. 204–219. Springer, Berlin (2008)
Kuchen, H., Striegnitz, J.: Higher-order functions and partial applications for a C++ skeleton library. In: Proceedings of the 2002 Joint ACM-ISCOPE Conference on Java Grande, JGI ’02, pp. 122–130. ACM, New York (2002)
Zhang, Y., Kameda, H., Hung, S.L.: Comparison of dynamic and static load-balancing strategies in heterogeneous distributed systems. IEE Proc. Comput. Digit. Tech. 144(2), 100–106 (1997)
Article Google Scholar
Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Proceedings of the Job Scheduling Strategies for Parallel Processing, IPPS ’97. Springer, London (1997)
Ferguson, D.F., Nikolaou, C., Sairamesh, J., Yemini, Y.: Economic models for allocating resources in computer systems. In Clearwater, S.H. (ed.) Market-based Control: A Paradigm for Distributed Resource Allocation, pp 156–183 (1996)
Grewe, D., O’Boyle, M.F.P.: A static task partitioning approach for heterogeneous systems using OpenCL. In: Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software, CC’11/ETAPS’11, pp. 286–305. Springer, Berlin (2011)
Ernsting, S., Kuchen, H.: Data parallel algorithmic skeletons with accelerator support. Int. J. Parallel Program. (2016). doi:10.1007/s10766-016-0416-7
Cannon, L.E.: A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. thesis, Montana State University, Bozeman, MT, AAI7010025 (1969)
Demout, J.: CUDA Pro Tip: minimize the tail effect. https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-minimize-the-tail-effect/. Accessed Sep 2016
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 23(2), 187–198 (2011)
Article Google Scholar
Dastgeer, U., Kessler, C., Thibault, S.: Flexible runtime support for efficient skeleton programming on heterogeneous GPU-based systems. In: Proceedings of the International Conference on Parallel Computing, ParCo ’11 (2011)
Alexandre, F., Marques, R., Paulino, H.: On the support of task-parallel algorithmic skeletons for multi-GPU computing. In: Proceedings of the 29th Annual ACM Symposium on Applied Computing, SAC ’14, pp. 880–885. ACM, New York (2014)
Sato, S., Iwasaki, H.: A skeletal parallel framework with fusion optimizer for GPGPU programming. In: Hu, Z. (ed.) Proceedings of the 7th Asian Symposium. APLAS 2009, Seoul, Korea, December 14–16, 2009. Lecture Notes in Computer Science, vol. 5904, pp. 79–94. Springer, Berlin (2009)
Aldinucci, M., Danelutto, M., Kilpatrick, P., Torquati, M.: FastFlow: High-level and efficient streaming on multi-core. In: Pllana, S., Xhafa, F. (eds.) Programming Multi-core and Many-Core Computing Systems, Parallel and Distributed Computing, chapt. 13. Wiley-Blackwell (in press)
Steuwer, M., Gorlatch, S.: SkelCL: enhancing OpenCL for high-level programming of multi-GPU systems. In: Malyshkin, V. (ed.) Proceedings of the 12th International Conference on Parallel Computing Technologies. PaCT ’13, pp. 258–272. Springer, Berlin (2013)
Goli, M., Gonzalez-Velez, H.: Heterogeneous algorithmic skeletons for fast flow with seamless coordination over hybrid architectures. In: Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP ’13, pp. 148–156. IEEE Computer Society, Washington (2013)
Lee, C., Ro, W.W., Gaudiot, J.-L.: Boosting CUDA applications with CPU–GPU hybrid computing. Int. J. Parallel Prog. 42(2), 384–404 (2014)
Article Google Scholar
Chen, L., Huo, X., Agrawal, G.: Accelerating MapReduce on a coupled CPU–GPU architecture. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp. 25:1–25:11. IEEE Computer Society Press, Los Alamitos (2012)

Download references

Author information

Authors and Affiliations

Leonardo-Campus 3, 48149, Münster, Germany
Fabian Wrede & Steffen Ernsting

Authors

Fabian Wrede
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Ernsting
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabian Wrede.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wrede, F., Ernsting, S. Simultaneous CPU–GPU Execution of Data Parallel Algorithmic Skeletons. Int J Parallel Prog 46, 42–61 (2018). https://doi.org/10.1007/s10766-016-0483-9

Download citation

Received: 30 September 2016
Accepted: 23 December 2016
Published: 04 January 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s10766-016-0483-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Simultaneous CPU–GPU Execution of Data Parallel Algorithmic Skeletons

Abstract

Access this article

Similar content being viewed by others

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Data Parallel Algorithmic Skeletons with Accelerator Support

Autonomic Coordination of Skeleton-Based Applications Over CPU/GPU Multi-Core Architectures

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Simultaneous CPU–GPU Execution of Data Parallel Algorithmic Skeletons

Abstract

Access this article

Similar content being viewed by others

Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments

Data Parallel Algorithmic Skeletons with Accelerator Support

Autonomic Coordination of Skeleton-Based Applications Over CPU/GPU Multi-Core Architectures

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation