Abstract
This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups of 3.08\(\times \) in the best case and 1.42\(\times \) on average compared to the baseline GPU-only processing.
Similar content being viewed by others
References
Acar, U.A., Blelloch, G.E., Blumofe, R.D.: The data locality of work stealing. In: Proceedings of the Twelfth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’00, pp. 1–12. ACM, New York, NY, USA (2000)
Bakhoda, A., Yuan, G., Fung, W., Wong, H., Aamodt, T.: Analyzing cuda workloads using a detailed gpu simulator. In: Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pp. 163–174 (2009). doi:10.1109/ISPASS.2009.4919648
Bell, N., Garland, M.: Cusp: Generic parallel algorithms for sparse matrix and graph computations (2010). http://cusp-library.googlecode.com
Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46, 720–748 (1999)
Brodtkorb, A.R., Dyken, C., Hagen, T.R., Hjelmervik, J.M., Storaasli, O.O.: State-of-the-art in heterogeneous computing. Sci. Program. 18, 1–33 (2010)
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J., Lee, S.H., Skadron, K.: Rodinia: a benchmark suite for heterogeneous computing. In: Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pp. 44–54 (2009). doi:10.1109/IISWC.2009.5306797
Cifuentes, C., Malhotra, V.M.: Binary translation: static, dynamic, retargetable? In: Proceedings of the 1996 International Conference on Software Maintenance, ICSM ’96, pp. 340–349. IEEE Computer Society, Washington, DC, USA (1996)
Diamos, G.F., Kerr, A.R., Yalamanchili, S., Clark, N.: Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pp. 353–364. ACM, New York, NY, USA (2010)
Garland, M., Kirk, D.B.: Understanding throughput-oriented architectures. Commun. ACM 53, 58–66 (2010)
Gummaraju, J., Morichetti, L., Houston, M., Sander, B., Gaster, B.R., Zheng, B.: Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pp. 205–216. ACM, New York, NY, USA (2010)
Juric, M.: Cuda md5 hashing. http://majuric.org/software/cudamd5
Kerr, A., Diamos, G., Yalamanchili, S.: A characterization and analysis of ptx kernels. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC ’09, pp. 3–12. IEEE Computer Society, Washington, DC, USA (2009)
Kumar, R., Tullsen, D., Jouppi, N., Ranganathan, P.: Heterogeneous chip multiprocessors. Computer 38(11), 32–38 (2005)
Lattner, C., Adve, V.: Llvm: A compilation framework for lifelong program analysis & transformation. In: Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, CGO ’04, pp. 75. IEEE Computer Society, Washington, DC, USA (2004)
Lee, J., Kim, J., Seo, S., Kim, S., Park, J., Kim, H., Dao, T.T., Cho, Y., Seo, S.J., Lee, S.H., Cho, S.M., Song, H.J., Suh, S.B., Choi, J.D.: An opencl framework for heterogeneous multicores with local memory. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT ’10, pp. 193–204. ACM, New York, NY, USA (2010)
Linderman, M.D., Collins, J.D., Wang, H., Meng, T.H.: Merge: a programming model for heterogeneous multi-core systems. In: Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XIII, pp. 287–296. ACM, New York, NY, USA (2008)
Luk, C.K., Hong, S., Kim, H.: Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pp. 45–55 (2009)
Nickolls, J., Dally, W.: The gpu computing era. Micro IEEE 30(2), 56–69 (2010)
NVIDIA: Cuda parallel computing platform. http://developer.nvidia.com/category/zone/cuda-zone
NVIDIA: Ptx: Parallel thread execution isa. http://developer.nvidia.com/cuda/nvidia-gpu-computing-documentation
OpenCL: The open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl
Ravi, V.T., Ma, W., Chiu, D., Agrawal, G.: Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In: Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10, pp. 137–146. ACM, New York, NY, USA (2010)
Saha, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Rajagopalan, M., Fang, J., Zhang, P., Ronen, R., Mendelson, A.: Programming model for a heterogeneous x86 platform. In: Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’09, pp. 431–440. ACM, New York, NY, USA (2009)
Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: a many-core x86 architecture for visual computing. In: ACM SIGGRAPH 2008 papers, SIGGRAPH ’08, pp. 18:1–18:15. ACM, New York, NY, USA (2008)
Stratton, J., Stone, S., Hwu, W.m.: Mcuda: An efficient implementation of cuda kernels for multi-core cpus. In: Amaral, J. (ed.) Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, vol. 5335, pp. 16–30. Springer, Berlin (2008)
Tian, C., Feng, M., Gupta, R.: Supporting speculative parallelization in the presence of dynamic data structures. In: Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10, pp. 62–73. ACM, New York, NY, USA (2010)
Wang, P.H., Collins, J.D., Chinya, G.N., Jiang, H., Tian, X., Girkar, M., Yang, N.Y., Lueh, G.Y., Wang, H.: Exochi: architecture and programming environment for a heterogeneous multi-core multithreaded system. In: Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’07, pp. 156–166. ACM, New York, NY, USA (2007)
Acknowledgments
We thank all of the anonymous reviewers for their comments. This work was supported by the Basic Science Research Program through the National Research Foundation (NRF) of Korea, which is funded by the Ministry of Education, Science and Technology [2009-0070364]. This work is also supported in part by the US National Science Foundation under Grant No. CCF-1065448. Any opinions, findings, and conclusions as well as recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lee, C., Ro, W.W. & Gaudiot, JL. Boosting CUDA Applications with CPU–GPU Hybrid Computing. Int J Parallel Prog 42, 384–404 (2014). https://doi.org/10.1007/s10766-013-0252-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-013-0252-y