Boosting CUDA Applications with CPU–GPU Hybrid Computing

Lee, Changmin; Ro, Won Woo; Gaudiot, Jean-Luc

doi:10.1007/s10766-013-0252-y

Boosting CUDA Applications with CPU–GPU Hybrid Computing

Published: 22 May 2013

Volume 42, pages 384–404, (2014)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Changmin Lee¹,
Won Woo Ro¹ &
Jean-Luc Gaudiot²

1450 Accesses
21 Citations
Explore all metrics

Abstract

This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups of 3.08\(\times \) in the best case and 1.42\(\times \) on average compared to the baseline GPU-only processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Supporting the Xeon Phi Coprocessor in a Heterogeneous Programming Model

Load balancing in a heterogeneous world: CPU-Xeon Phi co-execution of data-parallel kernels

Article 17 March 2018

Raúl Nozal, Borja Perez, … Ramón Beivide

Accelerating Scientific Applications on Heterogeneous Systems with HybridOMP

References

Acar, U.A., Blelloch, G.E., Blumofe, R.D.: The data locality of work stealing. In: Proceedings of the Twelfth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’00, pp. 1–12. ACM, New York, NY, USA (2000)
Bakhoda, A., Yuan, G., Fung, W., Wong, H., Aamodt, T.: Analyzing cuda workloads using a detailed gpu simulator. In: Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pp. 163–174 (2009). doi:10.1109/ISPASS.2009.4919648
Bell, N., Garland, M.: Cusp: Generic parallel algorithms for sparse matrix and graph computations (2010). http://cusp-library.googlecode.com
Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46, 720–748 (1999)
Article MATH MathSciNet Google Scholar
Brodtkorb, A.R., Dyken, C., Hagen, T.R., Hjelmervik, J.M., Storaasli, O.O.: State-of-the-art in heterogeneous computing. Sci. Program. 18, 1–33 (2010)
Google Scholar
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J., Lee, S.H., Skadron, K.: Rodinia: a benchmark suite for heterogeneous computing. In: Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pp. 44–54 (2009). doi:10.1109/IISWC.2009.5306797
Cifuentes, C., Malhotra, V.M.: Binary translation: static, dynamic, retargetable? In: Proceedings of the 1996 International Conference on Software Maintenance, ICSM ’96, pp. 340–349. IEEE Computer Society, Washington, DC, USA (1996)
Diamos, G.F., Kerr, A.R., Yalamanchili, S., Clark, N.: Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pp. 353–364. ACM, New York, NY, USA (2010)
Garland, M., Kirk, D.B.: Understanding throughput-oriented architectures. Commun. ACM 53, 58–66 (2010)
Article Google Scholar
Gummaraju, J., Morichetti, L., Houston, M., Sander, B., Gaster, B.R., Zheng, B.: Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pp. 205–216. ACM, New York, NY, USA (2010)
Juric, M.: Cuda md5 hashing. http://majuric.org/software/cudamd5
Kerr, A., Diamos, G., Yalamanchili, S.: A characterization and analysis of ptx kernels. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC ’09, pp. 3–12. IEEE Computer Society, Washington, DC, USA (2009)
Kumar, R., Tullsen, D., Jouppi, N., Ranganathan, P.: Heterogeneous chip multiprocessors. Computer 38(11), 32–38 (2005)
Article Google Scholar
Lattner, C., Adve, V.: Llvm: A compilation framework for lifelong program analysis & transformation. In: Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, CGO ’04, pp. 75. IEEE Computer Society, Washington, DC, USA (2004)
Lee, J., Kim, J., Seo, S., Kim, S., Park, J., Kim, H., Dao, T.T., Cho, Y., Seo, S.J., Lee, S.H., Cho, S.M., Song, H.J., Suh, S.B., Choi, J.D.: An opencl framework for heterogeneous multicores with local memory. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pp. 193–204. ACM, New York, NY, USA (2010)
Linderman, M.D., Collins, J.D., Wang, H., Meng, T.H.: Merge: a programming model for heterogeneous multi-core systems. In: Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XIII, pp. 287–296. ACM, New York, NY, USA (2008)
Luk, C.K., Hong, S., Kim, H.: Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pp. 45–55 (2009)
Nickolls, J., Dally, W.: The gpu computing era. Micro IEEE 30(2), 56–69 (2010)
Article Google Scholar
NVIDIA: Cuda parallel computing platform. http://developer.nvidia.com/category/zone/cuda-zone
NVIDIA: Ptx: Parallel thread execution isa. http://developer.nvidia.com/cuda/nvidia-gpu-computing-documentation
OpenCL: The open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl
Ravi, V.T., Ma, W., Chiu, D., Agrawal, G.: Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In: Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10, pp. 137–146. ACM, New York, NY, USA (2010)
Saha, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Rajagopalan, M., Fang, J., Zhang, P., Ronen, R., Mendelson, A.: Programming model for a heterogeneous x86 platform. In: Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’09, pp. 431–440. ACM, New York, NY, USA (2009)
Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: a many-core x86 architecture for visual computing. In: ACM SIGGRAPH 2008 papers, SIGGRAPH ’08, pp. 18:1–18:15. ACM, New York, NY, USA (2008)
Stratton, J., Stone, S., Hwu, W.m.: Mcuda: An efficient implementation of cuda kernels for multi-core cpus. In: Amaral, J. (ed.) Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, vol. 5335, pp. 16–30. Springer, Berlin (2008)
Tian, C., Feng, M., Gupta, R.: Supporting speculative parallelization in the presence of dynamic data structures. In: Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10, pp. 62–73. ACM, New York, NY, USA (2010)
Wang, P.H., Collins, J.D., Chinya, G.N., Jiang, H., Tian, X., Girkar, M., Yang, N.Y., Lueh, G.Y., Wang, H.: Exochi: architecture and programming environment for a heterogeneous multi-core multithreaded system. In: Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’07, pp. 156–166. ACM, New York, NY, USA (2007)

Download references

Acknowledgments

We thank all of the anonymous reviewers for their comments. This work was supported by the Basic Science Research Program through the National Research Foundation (NRF) of Korea, which is funded by the Ministry of Education, Science and Technology [2009-0070364]. This work is also supported in part by the US National Science Foundation under Grant No. CCF-1065448. Any opinions, findings, and conclusions as well as recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Yonsei University, Seoul, 120-749, Republic of Korea
Changmin Lee & Won Woo Ro
University of California, Irvine, CA, 92697-2625, USA
Jean-Luc Gaudiot

Authors

Changmin Lee
View author publications
You can also search for this author in PubMed Google Scholar
Won Woo Ro
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Luc Gaudiot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Won Woo Ro.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, C., Ro, W.W. & Gaudiot, JL. Boosting CUDA Applications with CPU–GPU Hybrid Computing. Int J Parallel Prog 42, 384–404 (2014). https://doi.org/10.1007/s10766-013-0252-y

Download citation

Received: 10 September 2012
Accepted: 14 May 2013
Published: 22 May 2013
Issue Date: April 2014
DOI: https://doi.org/10.1007/s10766-013-0252-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Boosting CUDA Applications with CPU–GPU Hybrid Computing

Abstract

Access this article

Similar content being viewed by others

Supporting the Xeon Phi Coprocessor in a Heterogeneous Programming Model

Load balancing in a heterogeneous world: CPU-Xeon Phi co-execution of data-parallel kernels

Accelerating Scientific Applications on Heterogeneous Systems with HybridOMP

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Boosting CUDA Applications with CPU–GPU Hybrid Computing

Abstract

Access this article

Similar content being viewed by others

Supporting the Xeon Phi Coprocessor in a Heterogeneous Programming Model

Load balancing in a heterogeneous world: CPU-Xeon Phi co-execution of data-parallel kernels

Accelerating Scientific Applications on Heterogeneous Systems with HybridOMP

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation