CoreTSAR: Adaptive Worksharing for Heterogeneous Systems
The popularity of heterogeneous computing continues to increase rapidly due to the high peak performance, favorable energy efficiency, and comparatively low cost of accelerators. However, heterogeneous programming models still lack the flexibility of their CPU-only counterparts. Accelerated OpenMP models, including OpenMP 4.0 and OpenACC, ease the migration of code from CPUs to GPUs but lack much of OpenMP’s flexibility: OpenMP applications can run on any number of CPUs without extra user effort, but GPU implementations do not offer similar adaptive worksharing across GPUs in a node, nor do they employ a mix of CPUs and GPUs. To address these shortcomings, we present CoreTSAR, our library for scheduling cores via a task-size adapting runtime system by supporting worksharing of loop nests across arbitrary heterogeneous resources. Beyond scheduling the computational load across devices, CoreTSAR includes a memory-management system that operates based on task association, enabling the runtime to dynamically manage memory movement and task granularity. Our evaluation shows that CoreTSAR can provide nearly linear scaling to four GPUs and all cores in a node without modifying the code within the parallel region. Furthermore, CoreTSAR provides portable performance across a variety of system configurations.
KeywordsStatic Schedule Adaptive Schedule CUDA Implementation OpenMP Version Task Granularity
Unable to display preview. Download preview PDF.
- 1.Anandakrishnan, R., Scogland, T.R.W., Fenley, A.T., Gordon, J.C., Feng, W.-c., Onufriev, A.V.: Accelerating Electrostatic Surface Potential Calculation with Multi-Scale Approximation on Graphics Processing Units. Journal of Molecular Graphics and Modelling 28(8), 904–910 (2009)Google Scholar
- 4.Berkelaar, M., Notebaert, P., Eikland, K.: lp_solve(mixed integer) linear programming problem solver (2003), http://lpsolve.sourceforge.net/5.0/
- 6.CAPS Enterprise, Cray Inc., NVIDIA and the Portland Group. The openacc application programming interface, v1.0. (November 2011), http://www.openacc-standard.org
- 7.Daga, M., Scogland, T., Feng, W.: Architecture-aware mapping and optimization on a 1600-core gpu. In: 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS), pp. 316–323. IEEE (2011)Google Scholar
- 10.Grauer-Gray, S., Xu, L., Searles, R., Ayalasomayajula, S.: Auto-tuning a High-Level Language Targeted to GPU Codes. cis.udel.eduGoogle Scholar
- 11.Munshi, A.: Khronos OpenCL Working Group and others. The opencl specification (2008)Google Scholar
- 12.OpenMP Architecture Review Board. OpenMP application program interface version 4.0 (2013)Google Scholar
- 13.Ravi, V.T., Agrawal, G.: A dynamic scheduling framework for emerging heterogeneous systems. In: 2011 18th International Conference on High Performance Computing (HiPC), pp. 1–10 (2011)Google Scholar
- 14.Ravi, V.T., Ma, W., Chiu, D., Agrawal, G.: Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. In: ICS 2010: Proceedings of the 24th ACM International Conference on Supercomputing, ACM Request Permissions (June 2010)Google Scholar
- 15.Reinders, J.: Intel Threading Building Blocks (2007)Google Scholar
- 16.Scogland, T.R.W., Rountree, B., Feng, W.-c., de Supinski, B.R.: Heterogeneous Task Scheduling for Accelerated OpenMP. In: 2012 IEEE International Parallel & Distributed Processing Symposium (IPDPS), Shanghai, China (2012)Google Scholar