The Journal of Supercomputing

, Volume 71, Issue 6, pp 2204–2220 | Cite as

Developing adaptive multi-device applications with the Heterogeneous Programming Library

  • Moisés  Viñas
  • Zeki  Bozkus
  • Basilio  B.  Fraguela
  • Diego  Andrade
  • Ramón  Doallo
Article

Abstract

The usage of heterogeneous devices presents two main problems. One is their complex programming, a problem that grows when multiple devices are used. The second issue is that even if the codes for these devices can be portable on top of OpenCL, they lack performance portability, effectively requiring specialized implementations for each device to get good performance. In this paper we extend the Heterogeneous Programming Library (HPL), which improves the usability of heterogeneous systems on top of OpenCL, to better handle both issues. First, we provide HPL with mechanisms to support the implementation of any multi-device application that requires arbitrary patterns of communication between several devices and a host memory. In a second stage HPL is improved with an adaptive scheme to optimize communications between devices depending on the execution environment. An evaluation using benchmarks with very different nature shows that HPL reduces the SLOCs and programming effort of OpenCL applications by 27 and 43 %, respectively, while improving the performance of applications that exchange data between devices by 28 % on average.

Keywords

Programmability Heterogeneity Parallelism Portability  Libraries OpenCL 

References

  1. 1.
    Acosta A, Almeida F (2013) Skeletal based programming for dynamic programming on multiGPU systems. J Supercomput 65(3):1125–1136CrossRefGoogle Scholar
  2. 2.
    Barak A, Ben-Nun T, Levy E, Shiloh A (2010) A package for OpenCL based heterogeneous computing on clusters with many GPU devices. In: 2010 IEEE international conference on cluster computing workshops and posters (CLUSTER WORKSHOPS), pp 1–7Google Scholar
  3. 3.
    Duato J, Pena A, Silla F, Mayo R, Quintana-Ortí E (2010) rCUDA: reducing the number of GPU-based accelerators in high performance clusters. In: 2010 International conference on high performance computing and simulation (HPCS 2010), pp 224–231Google Scholar
  4. 4.
    Duran A, Ayguadé E, Badia R, Labarta J, Martinell L, Martorell X, Planas J (2011) OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Process Lett 21(2):173–193CrossRefMathSciNetGoogle Scholar
  5. 5.
    Fraguela BB, Renau J, Feautrier P, Padua D, Torrellas J (2003) Programming the FlexRAM parallel intelligent memory system. ACM SIGPLAN Not 38(10):49–60CrossRefGoogle Scholar
  6. 6.
    Geijn RAVD, Watts J (1997) SUMMA: scalable universal matrix multiplication algorithm. Concurr Comput Pract Exp 9(4):255–274CrossRefGoogle Scholar
  7. 7.
    González C, Fraguela B (2013) A framework for argument-based task synchronization with automatic detection of dependencies. Parallel Comput 39(9):475–489CrossRefGoogle Scholar
  8. 8.
    Grasso I, Pellegrini S, Cosenza B, Fahringer T (2013) LibWater: heterogeneous distributed computing made easy. In: International conference on supercomputing (ICS’13), pp 161–172Google Scholar
  9. 9.
    Guo J, Bikshandi G, Fraguela B, Padua D (2009) Writing productive stencil codes with overlapped tiling. Concurr Comput Pract Exp 21(1):25–39CrossRefGoogle Scholar
  10. 10.
    Halstead MH (1977) Elements of software science. Elsevier Science Inc., New York, USAMATHGoogle Scholar
  11. 11.
    Kegel P, Steuwer M, Gorlatch S (2013) dOpenCL: towards uniform programming of distributed heterogeneous multi-/many-core systems. J Parallel Distrib Comput 73(12):1639–1648CrossRefGoogle Scholar
  12. 12.
    Khronos OpenCL Working Group (2013) The OpenCL specification. Version 2Google Scholar
  13. 13.
    Kim J, Seo S, Lee J, Nah J, Jo G, Lee J (2012) SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In: Proceedings of the 26th ACM international conference on supercomputing (ICS’12), pp 341–352Google Scholar
  14. 14.
    Lamport L (1979) How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans Comput 28(9):690–691CrossRefMATHGoogle Scholar
  15. 15.
    Li K, Hudak P (1989) Memory coherence in shared virtual memory systems. ACM Trans Comput Syst 7(4):321–359CrossRefGoogle Scholar
  16. 16.
    Lobeiras J, Viñas M, Amor M, Fraguela B, Arenaz M, García J, Castro M (2013) Parallelization of shallow water simulations on current multi-threaded systems. Int J High Perform Comput Appl 27(4):493–512Google Scholar
  17. 17.
    Nieuwpoort RV, Romein JW (2011) Correlating radio astronomy signals with many-core hardware. Int J Parallel Program 39(1):88–114CrossRefGoogle Scholar
  18. 18.
    Nvidia (2008) Nvidia: CUDA compute unified device architectureGoogle Scholar
  19. 19.
    Seo S, Jo G, Lee J (2011) Performance characterization of the NAS parallel benchmarks in OpenCL. In: Proceedings of the 2011 IEEE international symposium on workload characterization, IISWC ’11, pp 137–148Google Scholar
  20. 20.
    Steuwer M, Gorlatch S (2014) SkelCL: a high-level extension of OpenCL for multi-GPU systems. J Supercomput 69(1):25–33Google Scholar
  21. 21.
    Stumm M, Zhou S (1990) Algorithms implementing distributed shared memory. Computer 23(5):54–64CrossRefGoogle Scholar
  22. 22.
    Thoman P, Kofler K, Studt H, Thomson J, Fahringer T (2011) Automatic OpenCL device characterization: guiding optimized kernel design. In: Euro-Par’11, LNCS, vol 6853. Springer, pp 438–452Google Scholar
  23. 23.
    Viñas M, Bozkus Z, Fraguela B (2013) Exploiting heterogeneous parallelism with the Heterogeneous Programming Library. J Parallel Distrib Comput 73(12):1627–1638CrossRefGoogle Scholar
  24. 24.
    Viñas M, Bozkus Z, Fraguela B, Andrade D, Doallo R (2014) Exploiting multi-GPU systems using the Heterogeneous Programming Library. In: 14th International conference on computational and mathematical methods in science and engineering (CMMSE 2014), pp 1280–1291Google Scholar
  25. 25.
    Viñas M, Lobeiras J, Fraguela B, Arenaz M, Amor M, García J, Castro M, Doallo R (2013) A multi-GPU shallow-water simulation with transport of contaminants. Concurr Comput Pract Exp 25(8):1153–1169CrossRefGoogle Scholar
  26. 26.
    Xu R, Chandrasekaran S, Chapman B (2013) Exploring programming multi-GPUs using OpenMP and OpenACC-based hybrid model. In: 2013 IEEE 27th International parallel and distributed processing symposium workshops Ph.D. forum (IPDPSW), pp 1169–1176Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Moisés  Viñas
    • 1
  • Zeki  Bozkus
    • 2
  • Basilio  B.  Fraguela
    • 1
  • Diego  Andrade
    • 1
  • Ramón  Doallo
    • 1
  1. 1.Grupo de Arquitectura de ComputadoresUniversidade da CoruñaA CoruñaSpain
  2. 2.Department of Computer EngineeringKadir Has ÜniversitesiIstanbulTurkey

Personalised recommendations