Advertisement

The Journal of Supercomputing

, Volume 75, Issue 3, pp 1123–1136 | Cite as

Load balancing in a heterogeneous world: CPU-Xeon Phi co-execution of data-parallel kernels

  • Raúl NozalEmail author
  • Borja Perez
  • Jose Luis Bosque
  • Ramón Beivide
Article

Abstract

Heterogeneous systems composed by a CPU and a set of different hardware accelerators are very compelling thanks to their excellent performance and energy consumption features. One of the most important problems of those systems is the workload distribution among their devices. This paper describes an extension of the Maat library to allow the co-execution of a data-parallel OpenCL kernel on a heterogeneous system composed by a CPU and an Intel Xeon Phi. Maat provides an abstract view of the heterogeneous system as well as set of load balancing algorithms to squeeze the performance out of the node. It automatically performs the data partition and distribution among the devices, generates the kernels and efficiently merges the partial outputs together. Experimental results show that this approach always outperforms the baseline with only a Xeon Phi, giving excellent performance and energy efficiency. Furthermore, it is essential to select the right load balancing algorithm because it has a huge impact in the system performance and energy consumption.

Keywords

Heterogeneous computing Co-execution CPU-Xeon Phi Load balancing OpenCL Performance portability Energy efficiency 

Notes

Acknowledgements

This work has been supported by the Spanish Ministry of Education, FPU grant FPU16/03299, the University of Cantabria, grant CVE-2014-18166, the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and TIN2016-81840-REDT (CAPAP-H6 network), the European Research Council (G.A. No. 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project has received funding from the European Unions Horizon 2020 research and innovation programme under Grant Agreement No. 671697.

References

  1. 1.
    Aji AM et al (2016) MultiCL: enabling automatic scheduling for task-parallel workloads in OpenCL. Parallel Comput 58:37–55MathSciNetCrossRefGoogle Scholar
  2. 2.
    AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK) V3. Last accessed January 2018. https://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/
  3. 3.
    Belviranli ME, Bhuyan LN, Gupta R (2013) A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans Archit Code Optim 9(4):1–20CrossRefGoogle Scholar
  4. 4.
    Castillo E et al (2014) Financial applications on multi-CPU and multi-GPU architectures. J Supercomput 71(2):729–739CrossRefGoogle Scholar
  5. 5.
    Donyanavard B, Mück T, Sarma S, Dutt N (2016) SPARTA: runtime task allocation for energy efficient heterogeneous many-cores bryan. In: Proceedings of the 11th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, pp 1–10Google Scholar
  6. 6.
    Lastovetsky A, Szustak L, Wyrzykowski R (2017) Model-based optimization of eulag kernel on intel xeon phi through load imbalancing. IEEE Trans Parallel Distrib Syst 28(3):787–797CrossRefGoogle Scholar
  7. 7.
    Lee J, Samadi M, Park Y, Mahlke S (2015) Skmd. ACM Trans Comput Syst 33(3):1–27CrossRefGoogle Scholar
  8. 8.
    Li P, Brunet E, Trahay F, Parrot C, Thomas G, Namyst R (2015) Automatic OpenCL code generation for multi-device heterogeneous architectures. In: Proceedings of the International Conference on Parallel Processing, pp 959–968Google Scholar
  9. 9.
    Lopez et al (2016) Towards achieving performance portability using directives for accelerators. In: Third workshop on accelerator programming using directives, pp 13–24Google Scholar
  10. 10.
    Ma K, Li X, Chen W, Zhang C, Wang X (2012) GreenGPU: a holistic approach to energy efficiency in GPU-CPU heterogeneous architectures. In: Proceedings of the International Conference on Parallel Processing, pp 48–57Google Scholar
  11. 11.
    Pandit P, Govindarajan R (2014) Fluidic kernels: cooperative execution of opencl programs on multiple heterogeneous devices. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp 273–283Google Scholar
  12. 12.
    Pérez B, Bosque JL, Beivide R (2016) Simplifying programming and load balancing of data parallel applications on heterogeneous systems. In: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, ACM, pp 42–51Google Scholar
  13. 13.
    Salehian S, Liu J, Yan Y (2017) Comparison of threading programming models. In: Proceedings IEEE 31st International Parallel and Distributed Processing Sym. Workshops, pp 766–774Google Scholar
  14. 14.
    Stone JE, Gohara D, Shi G (2010) OpenCL: a parallel programming standard for heterogeneous computing systems. IEEE Des Test 12(3):66–73Google Scholar
  15. 15.
    Vilches A, Asenjo R, Navarro A, Corbera F, Gran R, Garzarán M (2015) Adaptive partitioning for irregular applications on heterogeneous CPU–GPU chips. Procedia Comput Sci 51(1):140–149CrossRefGoogle Scholar
  16. 16.
    Wienke S, Terboven C, An Mey D, Muller MS (2013) Accelerators, quo vadis? Performance vs. productivity. In: Proceedings of the International Conference on High Performance Computing and Simulation, pp 471–473Google Scholar
  17. 17.
    Xiao X, Hirasawa S, Takizawa H, Kobayashi H (2016) The importance of dynamic load balancing among openmp thread teams for irregular workloads. In: 4th International Symposium on Computing and Networking, pp 529–535Google Scholar
  18. 18.
    Zhang F, Zhai J, He B, Zhang S, Chen W (2017) Understanding co-running behaviors on integrated cpu/gpu architectures. IEEE Trans Parallel Distrib Syst 28(3):905–918CrossRefGoogle Scholar
  19. 19.
    Zhong Z, Rychkov V, Lastovetsky A (2015) Data partitioning on multicore and multi-GPU platforms using functional performance models. IEEE Trans Comput 64(9):2506–2518MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Computer Science and Electronics DepartmentUniversity of CantabriaSantanderSpain

Personalised recommendations