The Journal of Supercomputing

, Volume 74, Issue 11, pp 5628–5642 | Cite as

Exploring the interoperability of remote GPGPU virtualization using rCUDA and directive-based programming models

  • Adrián CastellóEmail author
  • Antonio J. Peña
  • Rafael Mayo
  • Judit Planas
  • Enrique S. Quintana-Ortí
  • Pavan Balaji


Directive-based programming models, such as OpenMP, OpenACC, and OmpSs, enable users to accelerate applications by using coprocessors with little effort. These devices offer significant computing power, but their use can introduce two problems: an increase in the total cost of ownership and their underutilization because not all codes match their architecture. Remote accelerator virtualization frameworks address those problems. In particular, rCUDA provides transparent access to any graphic processor unit installed in a cluster, reducing the number of accelerators and increasing their utilization ratio. Joining these two technologies, directive-based programming models and rCUDA, is thus highly appealing. In this work, we study the integration of OmpSs and OpenACC with rCUDA, describing and analyzing several applications over three different hardware configurations that include two InfiniBand interconnections and three NVIDIA accelerators. Our evaluation reveals favorable performance results, showing low overhead and similar scaling factors when using remote accelerators instead of local devices.


GPUs Directive-based programming models OpenACC OmpSs Remote virtualization rCUDA 



The researchers from the Universitat Jaume I de Castelló were supported by Universitat Jaume I research project (P11B2013-21), project TIN2014-53495-R, a Generalitat Valenciana grant and FEDER. The researcher from the Barcelona Supercomputing Center (BSC-CNS) Lausanne was supported by the European Commission (HiPEAC-3 Network of Excellence, FP7-ICT 287759), Intel-BSC Exascale Lab collaboration, IBM/BSC Exascale Initiative collaboration agreement, Computación de Altas Prestaciones VI (TIN2012-34557) and the Generalitat de Catalunya (2014-SGR-1051). This work was partially supported by the U.S. Dept. of Energy, Office of Science, Office of Advanced Scientific Computing Research (SC-21), under contract DE-AC02-06CH11357. The initial version of rCUDA was jointly developed by Universitat Politècnica de València (UPV) and Universitat Jaume I de Castellón (UJI) until year 2010. This initial development was later split into two branches. Part of the UPV version was used in this paper. The development of the UPV branch was supported by Generalitat Valenciana under Grants PROMETEO 2008/060 and Prometeo II 2013/009. We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory.


  1. 1.
    Strohmaier E, Dongarra J, Simon H, Meuer M (2015) TOP500 supercomputing sites. Accessed Nov 2015
  2. 2.
    NVIDIA (2015) CUDA API reference, version 7.5Google Scholar
  3. 3.
    Shreiner D, Sellers G, Kessenich JM, Licea-Kane BM (2013) OpenGL programming guide: the official guide to learning OpenGL. Addison-Wesley Professional, BostonGoogle Scholar
  4. 4.
    Mark WR, Glanville RS, Akeley K, Kilgard MJ (2003) Cg: a system for programming graphics hardware in a C-like language. ACM Trans Graph (TOG) 22(3):896–907CrossRefGoogle Scholar
  5. 5.
    Munshi A (2014)The OpenCL specification 2.0. 0.5em minus 0.4em Khronos OpenCL working groupGoogle Scholar
  6. 6.
    OpenACC directives for accelerators (2015). Accessed Dec 2015
  7. 7.
    OmpSs project home page. Accessed Dec 2015
  8. 8.
    OpenMP application program interface 4.0 (2013). OpenMP Architecture BoardGoogle Scholar
  9. 9.
    Peña AJ (2013) Virtualization of accelerators in high performance clusters. Ph.D. dissertation, Universitat Jaume I, CastellónGoogle Scholar
  10. 10.
    Kawai A, Yasuoka K, Yoshikawa K, Narumi T (2012) Distributed-shared CUDA: virtualization of large-scale GPU systems for programmability and reliability. In: International conference on future computational technologies and applicationsGoogle Scholar
  11. 11.
    Shi L, Chen H, Sun J, Li K (2012) vCUDA: GPU-accelerated high-performance computing in virtual machines. IEEE Trans Comput 61(6):804–816MathSciNetCrossRefGoogle Scholar
  12. 12.
    Xiao S, Balaji P, Zhu Q, Thakur R, Coghlan S, Lin H, Wen G, Hong J, Feng W (2012) VOCL: an optimized environment for transparent virtualization of graphics processing units. In: Innovative parallel computing. IEEE, New YorkGoogle Scholar
  13. 13.
    Kim J, Seo S, Lee J, Nah J, Jo G, Lee J (2012) SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In: International conference on supercomputingGoogle Scholar
  14. 14.
    Duran A, Ayguadé E, Badia RM, Labarta J, Martinell L, Martorell X, Planas J (2011) OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Process Lett 21(02):173–193MathSciNetCrossRefGoogle Scholar
  15. 15.
    Castelló A, Duato J, Mayo R, Peña AJ, Quintana-Ortí ES, Roca V, Silla F (2014) On the use of remote GPUs and low-power processors for the acceleration of scientific applications. In: The fourth international conference on smart grids, green communications and IT energy-aware technologies, pp 57–62Google Scholar
  16. 16.
    Iserte S, Castelló A, Mayo R, Quintana-Ortí ES, Reaño C, Prades J, Silla F, Duato J (2014) SLURM support for remote GPU virtualization: implementation and performance study. In: International symposium on computer architecture and high performance computing (SBAC-PAD)Google Scholar
  17. 17.
    Peña AJ, Reaño C, Silla F, Mayo R, Quintana-Ortí ES, Duato J (2014) A complete and efficient CUDA-sharing solution for HPC clusters. Parallel Comput 40(10):574–588CrossRefGoogle Scholar
  18. 18.
    Kegel P, Steuwer M, Gorlatch S (2012) dOpenCL: towards a uniform programming approach for distributed heterogeneous multi-/many-core systems. In: International parallel and distributed processing symposium workshops (IPDPSW)Google Scholar
  19. 19.
    Castelló A, Peña AJ, Mayo R, Balaji P, Quintana-Ortí ES (2015) Exploring the suitability of remote GPGPU virtualization for the OpenACC programming model using rCUDA. In: IEEE international conference on cluster computingGoogle Scholar
  20. 20.
    Castelló A, Mayo R, Planas J, Quintana-Ortí ES (2015) Exploiting task-parallelism on GPU clusters via OmpSs and rCUDA virtualization. In: IEEE international workshop on reengineering for parallelism in heterogeneous parallel platformsGoogle Scholar
  21. 21.
    HP Corp., Intel Corp., Microsoft Corp., Phoenix Tech. Ltd., Toshiba Corp. (2011) Advanced configuration and power interface specification, revision 5.0Google Scholar
  22. 22.
    Reaño C, Silla F, Castelló A, Peña AJ, Mayo R, Quintana-Ortí ES, Duato J (2014) Improving the user experience of the rCUDA remote GPU virtualization framework. Concurr Comput 27(14):3746–3770CrossRefGoogle Scholar
  23. 23.
    PGI compilers and tools (2015) Accessed Dec 2015
  24. 24.
    Johnson N (2013) EPCC OpenACC benchmark suite. Accessed Dec 2015
  25. 25.
    Herdman J, Gaudin W, McIntosh-Smith S, Boulton M, Beckingsale D, Mallinson A, Jarvis SA (2012) Accelerating hydrocodes with OpenACC, OpenCL and CUDA. In: SC companion: high performance computing, networking, storage and analysisGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Adrián Castelló
    • 1
    Email author
  • Antonio J. Peña
    • 2
  • Rafael Mayo
    • 1
  • Judit Planas
    • 3
  • Enrique S. Quintana-Ortí
    • 1
  • Pavan Balaji
    • 4
  1. 1.Universitat Jaume I de CastellóCastellón de la PlanaSpain
  2. 2.Barcelona Supercomputing Center (BSC-CNS)BarcelonaSpain
  3. 3.École Polytechnique Fédérale de LausanneGenevaSwitzerland
  4. 4.Argonne National LaboratoryLemontUSA

Personalised recommendations