Load balancing in a heterogeneous world: CPU-Xeon Phi co-execution of data-parallel kernels
- 152 Downloads
Heterogeneous systems composed by a CPU and a set of different hardware accelerators are very compelling thanks to their excellent performance and energy consumption features. One of the most important problems of those systems is the workload distribution among their devices. This paper describes an extension of the Maat library to allow the co-execution of a data-parallel OpenCL kernel on a heterogeneous system composed by a CPU and an Intel Xeon Phi. Maat provides an abstract view of the heterogeneous system as well as set of load balancing algorithms to squeeze the performance out of the node. It automatically performs the data partition and distribution among the devices, generates the kernels and efficiently merges the partial outputs together. Experimental results show that this approach always outperforms the baseline with only a Xeon Phi, giving excellent performance and energy efficiency. Furthermore, it is essential to select the right load balancing algorithm because it has a huge impact in the system performance and energy consumption.
KeywordsHeterogeneous computing Co-execution CPU-Xeon Phi Load balancing OpenCL Performance portability Energy efficiency
This work has been supported by the Spanish Ministry of Education, FPU grant FPU16/03299, the University of Cantabria, grant CVE-2014-18166, the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and TIN2016-81840-REDT (CAPAP-H6 network), the European Research Council (G.A. No. 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project has received funding from the European Unions Horizon 2020 research and innovation programme under Grant Agreement No. 671697.
- 2.AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK) V3. Last accessed January 2018. https://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/
- 5.Donyanavard B, Mück T, Sarma S, Dutt N (2016) SPARTA: runtime task allocation for energy efficient heterogeneous many-cores bryan. In: Proceedings of the 11th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, pp 1–10Google Scholar
- 8.Li P, Brunet E, Trahay F, Parrot C, Thomas G, Namyst R (2015) Automatic OpenCL code generation for multi-device heterogeneous architectures. In: Proceedings of the International Conference on Parallel Processing, pp 959–968Google Scholar
- 9.Lopez et al (2016) Towards achieving performance portability using directives for accelerators. In: Third workshop on accelerator programming using directives, pp 13–24Google Scholar
- 10.Ma K, Li X, Chen W, Zhang C, Wang X (2012) GreenGPU: a holistic approach to energy efficiency in GPU-CPU heterogeneous architectures. In: Proceedings of the International Conference on Parallel Processing, pp 48–57Google Scholar
- 11.Pandit P, Govindarajan R (2014) Fluidic kernels: cooperative execution of opencl programs on multiple heterogeneous devices. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp 273–283Google Scholar
- 12.Pérez B, Bosque JL, Beivide R (2016) Simplifying programming and load balancing of data parallel applications on heterogeneous systems. In: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, ACM, pp 42–51Google Scholar
- 13.Salehian S, Liu J, Yan Y (2017) Comparison of threading programming models. In: Proceedings IEEE 31st International Parallel and Distributed Processing Sym. Workshops, pp 766–774Google Scholar
- 14.Stone JE, Gohara D, Shi G (2010) OpenCL: a parallel programming standard for heterogeneous computing systems. IEEE Des Test 12(3):66–73Google Scholar
- 16.Wienke S, Terboven C, An Mey D, Muller MS (2013) Accelerators, quo vadis? Performance vs. productivity. In: Proceedings of the International Conference on High Performance Computing and Simulation, pp 471–473Google Scholar
- 17.Xiao X, Hirasawa S, Takizawa H, Kobayashi H (2016) The importance of dynamic load balancing among openmp thread teams for irregular workloads. In: 4th International Symposium on Computing and Networking, pp 529–535Google Scholar