Developing adaptive multi-device applications with the Heterogeneous Programming Library
The usage of heterogeneous devices presents two main problems. One is their complex programming, a problem that grows when multiple devices are used. The second issue is that even if the codes for these devices can be portable on top of OpenCL, they lack performance portability, effectively requiring specialized implementations for each device to get good performance. In this paper we extend the Heterogeneous Programming Library (HPL), which improves the usability of heterogeneous systems on top of OpenCL, to better handle both issues. First, we provide HPL with mechanisms to support the implementation of any multi-device application that requires arbitrary patterns of communication between several devices and a host memory. In a second stage HPL is improved with an adaptive scheme to optimize communications between devices depending on the execution environment. An evaluation using benchmarks with very different nature shows that HPL reduces the SLOCs and programming effort of OpenCL applications by 27 and 43 %, respectively, while improving the performance of applications that exchange data between devices by 28 % on average.
KeywordsProgrammability Heterogeneity Parallelism Portability Libraries OpenCL
This work was supported by the Xunta de Galicia under the Consolidation Program of Competitive Reference Groups (GRC2013/055), the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (TIN2013-42148-P), both of them cofunded by FEDER funds of the EU, and the Scientific and Technological Research Council of Turkey (TUBITAK; 112E191). This work is also partially supported by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS).
- 2.Barak A, Ben-Nun T, Levy E, Shiloh A (2010) A package for OpenCL based heterogeneous computing on clusters with many GPU devices. In: 2010 IEEE international conference on cluster computing workshops and posters (CLUSTER WORKSHOPS), pp 1–7Google Scholar
- 3.Duato J, Pena A, Silla F, Mayo R, Quintana-Ortí E (2010) rCUDA: reducing the number of GPU-based accelerators in high performance clusters. In: 2010 International conference on high performance computing and simulation (HPCS 2010), pp 224–231Google Scholar
- 8.Grasso I, Pellegrini S, Cosenza B, Fahringer T (2013) LibWater: heterogeneous distributed computing made easy. In: International conference on supercomputing (ICS’13), pp 161–172Google Scholar
- 12.Khronos OpenCL Working Group (2013) The OpenCL specification. Version 2Google Scholar
- 13.Kim J, Seo S, Lee J, Nah J, Jo G, Lee J (2012) SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In: Proceedings of the 26th ACM international conference on supercomputing (ICS’12), pp 341–352Google Scholar
- 16.Lobeiras J, Viñas M, Amor M, Fraguela B, Arenaz M, García J, Castro M (2013) Parallelization of shallow water simulations on current multi-threaded systems. Int J High Perform Comput Appl 27(4):493–512Google Scholar
- 18.Nvidia (2008) Nvidia: CUDA compute unified device architectureGoogle Scholar
- 19.Seo S, Jo G, Lee J (2011) Performance characterization of the NAS parallel benchmarks in OpenCL. In: Proceedings of the 2011 IEEE international symposium on workload characterization, IISWC ’11, pp 137–148Google Scholar
- 20.Steuwer M, Gorlatch S (2014) SkelCL: a high-level extension of OpenCL for multi-GPU systems. J Supercomput 69(1):25–33Google Scholar
- 22.Thoman P, Kofler K, Studt H, Thomson J, Fahringer T (2011) Automatic OpenCL device characterization: guiding optimized kernel design. In: Euro-Par’11, LNCS, vol 6853. Springer, pp 438–452Google Scholar
- 24.Viñas M, Bozkus Z, Fraguela B, Andrade D, Doallo R (2014) Exploiting multi-GPU systems using the Heterogeneous Programming Library. In: 14th International conference on computational and mathematical methods in science and engineering (CMMSE 2014), pp 1280–1291Google Scholar
- 26.Xu R, Chandrasekaran S, Chapman B (2013) Exploring programming multi-GPUs using OpenMP and OpenACC-based hybrid model. In: 2013 IEEE 27th International parallel and distributed processing symposium workshops Ph.D. forum (IPDPSW), pp 1169–1176Google Scholar