On the effect of using rCUDA to provide CUDA acceleration to Xen virtual machines


Nowadays, many data centers use virtual machines (VMs) in order to achieve a more efficient use of hardware resources. The use of VMs provides a reduction in equipment and maintenance expenses as well as a lower electricity consumption. Nevertheless, current virtualization solutions, such as Xen, do not easily provide graphics processing units (GPUs) to applications running in the virtualized domain with the flexibility usually required in data centers (i.e., managing virtual GPU instances and concurrently sharing them among several VMs). Therefore, the execution of GPU-accelerated applications within VMs is hindered by this lack of flexibility. In this regard, remote GPU virtualization solutions may address this concern. In this paper we analyze the use of the remote GPU virtualization mechanism to accelerate scientific applications running inside Xen VMs. We conduct our study with six different applications, namely CUDA-MEME, CUDASW++, GPU-BLAST, LAMMPS, a triangle count application, referred to as TRICO, and a synthetic benchmark used to emulate different application behaviors. Our experiments show that the use of remote GPU virtualization is a feasible approach to address the current concerns of sharing GPUs among several VMs, featuring a very low overhead if an InfiniBand fabric is already present in the cluster.

This is a preview of subscription content, access via your institution.

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15


  1. 1.

    CUDA (compute unified device architecture) is a technology created by NVIDIA which comprises a parallel compute platform (CUDA-enabled graphics processing units) as well as an application programming interface (API) and a compiler.

  2. 2.

    In addition to the GRID K1 GPU, NVIDIA has also brought to market the GRID K2 model, which features 1536 CUDA cores per GPU and 4 GB of memory. However, this amount of resources per GPU is still noticeable smaller than the ones available in current NVIDIA Tesla K20 and K40 GPUs, featuring, respectively, 2496 and 2880 CUDA cores and 5GB and 12GB of memory. Therefore, using the GRID K2 device for providing acceleration to scientific applications instead of providing desktop virtualization would deliver a significantly lower performance than current mainstream GPUs used in HPC servers, such as the K20 or K40 GPUs.

  3. 3.

    KVMGT is the open source implementation of Intel’s GPU Virtualization Technology for KVM VMs.

  4. 4.

    In order to interact with the virtualized GPU, some kind of interface is required so that the application can access the virtual device. This interface could be placed at different levels. For instance, it could be placed at the driver level. However, GPU drivers usually employ low-level protocols which, additionally, are proprietary and strictly closed by GPU vendors. Therefore, a higher-level boundary must be used. This is why the GPU API is commonly selected for placing the virtualization boundary, given that these APIs are public.

  5. 5.

    The native domain refers to a scenario where virtualization is not used, that is, a real computer is leveraged. On the other hand, the virtual domain refers to the virtual machine.

  6. 6.

    In this work, we will refer to main memory as host memory or just host, while GPU memory will be referred as device memory or simply device, according to the well-established usage defined in the CUDA ecosystem.


  1. 1.

    Kernel-Based Virtual Machine, KVM. http://www.linux-kvm.org (2015). Accessed 19 Oct 2015

  2. 2.

    Xen Project. http://www.xenproject.org/ (2015). Accessed 19 Oct 2015

  3. 3.

    VMware Virtualization. http://www.vmware.com/ (2015). Accessed 19 Oct 2015

  4. 4.

    Oracle VM VirtualBox. http://www.virtualbox.org/ (2015). Accessed 19 Oct 2015

  5. 5.

    Semnanian, A., Pham, J., Englert, B., Wu, X.: Virtualization technology and its impact on computer hardware architecture. In: Proceedings of the Information Technology: New Generations, ITNG, pp. 719–724 (2011)

  6. 6.

    Felter, W., Ferreira, A., Rajamony, R., Rubio, J.: An updated performance comparison of virtual machines and linux containers. In: IBM Research Report (2014)

  7. 7.

    Zhang, J., Lu, X., Arnold, M., Panda, D.: MVAPICH2 over OpenStack with SR-IOV: an efficient approach to build HPC Clouds. In: Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid, pp. 71–80 (2015)

  8. 8.

    Wu, H., Diamos, G., Sheard, T., Aref, M., Baxter, S., Garland, M., Yalamanchili, S.: Red Fox: an execution environment for relational query processing on GPUs. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO (2014)

  9. 9.

    Playne, D.P., Hawick, K.A.: Data parallel three-dimensional Cahn-Hilliard field equation simulation on GPUs with CUDA. In: Proceedings of the Parallel and Distributed Processing Techniques and Applications, PDPTA, pp. 104–110 (2009)

  10. 10.

    Yamazaki, I., Dong, T., Solcà, R., Tomov, S., Dongarra, J., Schulthess, T.: Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems. Concurr. Comput.: Pract. Exp. 26(16), 2652–2666 (2014)

    Article  Google Scholar 

  11. 11.

    Luo, D.Y.: Canny edge detection on NVIDIA CUDA. In: Proceedings of the Computer Vision and Pattern Recognition Workshops, CVPR Workshops, pp. 1–8 (2008)

  12. 12.

    Surkov, V.: Parallel option pricing with Fourier space time-stepping method on graphics processing units. Parallel Comput. 36(7), 372–380 (2010)

    MathSciNet  MATH  Article  Google Scholar 

  13. 13.

    Agarwal, P.K., Hampton, S., Poznanovic, J., Ramanthan, A., Alam, S.R., Crozier, P.S.: Performance modeling of microsecond scale biological molecular dynamics simulations on heterogeneous architectures. Concurr. Comput.: Pract. Exp. 25(10), 1356–1375 (2013)

    Article  Google Scholar 

  14. 14.

    Luo, G.H., Huang, S.K., Chang, Y.S., Yuan, S.M.: A parallel bees algorithm implementation on GPU. J. Syst. Arch. 60(3), 271–279 (2014)

    Article  Google Scholar 

  15. 15.

    NVIDIA GRID Technology. http://www.nvidia.com/object/grid-technology.html (2015). Accessed 19 Oct 2015

  16. 16.

    Song, J., et al: KVMGT: a full GPU virtualization solution. In: KVM Forum (2014)

  17. 17.

    AMD Multiuser GPU, Hardware-Based Virtualized Solution. http://www.amd.com/Documents/Multiuser-GPU-Datasheet.pdf (2015). Accessed 19 Oct 2015

  18. 18.

    V-GPU: GPU Virtualization. https://github.com/zillians/platform_manifest_vgpu (2015). Accessed 19 Oct 2015

  19. 19.

    Oikawa, M., Kawai, A., Nomura, K., Yasuoka, K., Yoshikawa, K., Narumi, T.: DS-CUDA: a middleware to use many GPUs in the cloud environment. In: Proceedings of the SC Companion: High Performance Computing, Networking Storage and Analysis, SCC, pp. 1207–1214 (2012)

  20. 20.

    Reaño, C., Silla, F., Shainer, G., Schultz, S.: Local and remote GPUs perform similar with EDR 100G InfiniBand. In: Proceedings of the Industrial Track of the 16th International Middleware Conference, ACM, Middleware Industry ’15, pp. 4:1–4:7 (2015)

  21. 21.

    Reaño, C., Silla, F., Duato, J.: Enhancing the rCUDA remote GPU virtualization framework: from a prototype to a production solution. In: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, IEEE Press, CCGrid ’17, pp. 695–698 (2017)

  22. 22.

    Shi, L., Chen, H., Sun, J.: vCUDA: GPU accelerated high performance computing in virtual machines. In: Proceedings of the IEEE Parallel and Distributed Processing Symposium, IPDPS, pp. 1–11 (2009)

  23. 23.

    Liang, T.Y., Chang, Y.W.: GridCuda: A grid-enabled CUDA programming toolkit. In: Proceedings of the IEEE Advanced Information Networking and Applications Workshops, WAINA, pp. 141–146 (2011)

  24. 24.

    Giunta, G., Montella, R., Agrillo, G., Coviello, G.: A GPGPU transparent virtualization component for high performance computing clouds. In: Proceedings of the Euro-Par Parallel Processing, Euro-Par, pp. 379–391 (2010)

  25. 25.

    Gupta, V., Gavrilovska, A., Schwan, K., Kharche, H., Tolia, N., Talwar, V., Ranganathan, P. GViM: GPU-accelerated virtual machines. In: Proceedings of the ACM Workshop on System-level Virtualization for High Performance Computing, HPCVirt, pp. 17–24 (2009)

  26. 26.

    Merritt, A.M., Gupta, V., Verma, A., Gavrilovska, A., Schwan, K.: Shadowfax: scaling in heterogeneous cluster systems via GPGPU assemblies. In: Proceedings of the International Workshop on Virtualization Technologies in Distributed Computing, VTDC, pp. 3–10 (2011)

  27. 27.

    Shadowfax II—Scalable Implementation of GPGPU Assemblies. http://keeneland.gatech.edu/software/keeneland/kidron (2015). Accessed 19 Oct 2015

  28. 28.

    Walters, J.P., Younge, A.J., Kang, D.I., Yao, K.T., Kang, M., Crago, S.P., Fox, G.C.: GPU-passthrough performance: a comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL applications. In: Proceedings of the IEEE International Conference on Cloud Computing, CLOUD (2014)

  29. 29.

    Yang, C.T., Wang, H.Y., Ou, W.S., Liu, Y.T., Hsu, C.H.: On implementation of GPU virtualization using PCI pass-through. In: Proceedings of the IEEE Cloud Computing Technology and Science, CloudCom, pp. 711–716 (2012)

  30. 30.

    Jo, H., Jeong, J., Lee, M., Choi, D.H.: Exploiting GPUs in virtual machine for BioCloud. BioMed Res. Int. 2013, 11 (2013). https://doi.org/10.1155/2013/939460

    Article  Google Scholar 

  31. 31.

    NVIDIA: CUDA C Programming Guide 7.5. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf (2015a). Accessed 19 Oct 2015

  32. 32.

    NVIDIA: CUDA Runtime API Reference Manual 7.5. http://docs.nvidia.com/cuda/pdf/CUDA_Runtime_API.pdf (2015b). Accessed 19 Oct 2015

  33. 33.

    NVIDIA: The NVIDIA GPU Computing SDK Version 5.5 (2013)

  34. 34.

    iperf3: A TCP, UDP, and SCTP Network Bandwidth Measurement Tool. https://github.com/esnet/iperf (2015). Accessed 19 Oct 2015

  35. 35.

    Reaño, C., Silla, F.: Reducing the performance gap of remote GPU virtualization with InfiniBand Connect-IB. In: 2016 IEEE Symposium on Computers and Communication (ISCC), pp. 920–925 (2016)

  36. 36.

    Mellanox: Connect-IB Single and Dual QSFP+ Port PCI Express Gen3 x16 Adapter Card User Manual. http://www.mellanox.com/related-docs/user_manuals/Connect-IB_Single_and_Dual_QSFP+_Port_PCI_Express_Gen3_%20x16_Adapter_Card_User_Manual.pdf (2014a). Accessed 19 Oct 2015

  37. 37.

    Mellanox: ConnectX-3 VPI Single and Dual QSFP+ Port Adapter Card User Manual 1.7. http://www.mellanox.com/related-docs/user_manuals/ConnectX-3_VPI_Single_and_Dual_QSFP_Port_Adapter_Card_User_Manual.pdf (2013). Accessed 19 Oct 2015

  38. 38.

    Pérez, F., Reaño, C., Silla, F.: Providing CUDA acceleration to KVM virtual machines in InfiniBand clusters with rCUDA. In: 16th International Conference Distributed Applications and Interoperable Systems (DAIS), pp. 82–95. Springer International Publishing (2016)

  39. 39.

    Mellanox: Mellanox OFED for Linux User Manual. http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_User_Manual_v2.3-1.0.1.pdf (2014b). Accessed 19 Oct 2015

  40. 40.

    Reaño, C., Mayo, R., Quintana-Ortí, E., Silla, F., Duato, J., Peña, A.: Influence of InfiniBand FDR on the performance of remote GPU virtualization. In: Proceedings of the IEEE International Conference on Cluster Computing, CLUSTER, pp. 1–8 (2013)

  41. 41.

    Laboratories, S.N.: LAMMPS Molecular Dynamics Simulator. http://lammps.sandia.gov/ (2013). Accessed 19 Oct 2015

  42. 42.

    Liu, Y., Schmidt, B., Liu, W., Maskell, D.L.: CUDA-MEME: accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units. Pattern Recognit. Lett. 31(14), 2170–2177 (2010)

    Article  Google Scholar 

  43. 43.

    Liu, Y., Wirawan, A., Schmidt, B.: CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions. BMC Bioinformat. 14(1), 1–10 (2013)

    Article  Google Scholar 

  44. 44.

    Vouzis, P.D., Sahinidis, N.V.: GPU-BLAST: using graphics processors to accelerate protein sequence alignment. Bioinformatics 27(2), 182–188 (2011)

    Article  Google Scholar 

  45. 45.

    NVIDIA: NVIDIA Popular GPU-Accelerated Applications Catalog. http://www.nvidia.com/content/gpu-applications/PDF/GPU-apps-catalog-mar2015.pdf (2015c). Accessed 19 Oct 2015

  46. 46.

    Liu, Y. CUDA-MEME. https://sites.google.com/site/yongchaosoftware/mcuda-meme (2014). Accessed 19 Oct 2015

  47. 47.

    Polak, A.: Counting triangles in large graphs on GPU. In: IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 740–746 (2016)

  48. 48.

    Prades, J., Silla, F.: Turning GPUs into floating devices over the cluster: the Beauty of GPU Migration. In: Proceedings of the 6th Workshop on Heterogeneous and Unconventional Cluster Architectures and Applications (HUCAA) (2017)

Download references


This work was funded by the Generalitat Valenciana under Grant PROMETEO/2017/077. Authors are also grateful for the generous support provided by Mellanox Technologies Inc.

Author information



Corresponding author

Correspondence to Javier Prades.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Prades, J., Reaño, C. & Silla, F. On the effect of using rCUDA to provide CUDA acceleration to Xen virtual machines. Cluster Comput 22, 185–204 (2019). https://doi.org/10.1007/s10586-018-2845-0

Download citation


  • Virtualization
  • CUDA
  • Xen
  • InfiniBand
  • HPC
  • Performance