A Lightweight Approach to GPU Resilience

  • Max BairdEmail author
  • Christian Fensch
  • Sven-Bodo Scholz
  • Artjoms Šinkarovs
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11339)


Resilience for HPC applications typically is implemented as a CPU-based rollback-recovery technique. In this context, long running accelerator computations on GPUs pose a major challenge as these devices usually do not offer any means of interrupt. This paper proposes a solution to the aforementioned problem: it suggests a novel approach that rewrites GPU kernels so that a soft interrupt of their execution becomes possible. Our approach is based on the Compute Unified Device Architecture (CUDA) by Nvidia and works by taking advantage of CUDA’s execution model of partitioning threads into blocks. In essence, we re-write the kernel so that each block determines whether it should continue execution or return control to the CPU. By doing so we are able to perform a premature interrupt of kernels.


HPC GPU Resilience 



This work was supported in part by grants EP/N028201/1 and EP/L00058X/1 from the Engineering and Physical Sciences Research Council (EPSRC) as well as the James Watt Scholarship of Heriot-Watt University.


  1. 1.
    Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (BLCR) for Linux clusters. J. Phys. Conf. Ser. 46, 494–499 (2006). Scholar
  2. 2.
    Gõddeke, D., Strzodka, R., Mohd-Yusof, J., McCormick, P.: Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Comput. 33(10–11), 685–699 (2007). Scholar
  3. 3.
    Egwutuoha, I.P., Levy, D., Selic, B., Chen, S.: A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J. Supercomput. 65(3), 1302–1326 (2013). Scholar
  4. 4.
    Bautista-Gomez, L., Tsuboi, S., et al.: FTI: high performance fault tolerance interface for hybrid systems. In: 2011 International Conference for High Performance Computing. IEEE (2011).
  5. 5.
    Cappello, F., Geist, A., et al.: Toward exascale resilience. Int. J. High Perform. Comput. Appl. 23(4), 374–388 (2009). Scholar
  6. 6.
    DeBardeleben, N., et al.: GPU behavior on a large HPC cluster. In: an Mey, D., et al. (eds.) Euro-Par 2013. LNCS, vol. 8374, pp. 680–689. Springer, Heidelberg (2014). Scholar
  7. 7.
    Fan, Z., Qiu, F., et al.: GPU cluster for high performance computing. In: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, SC 2004, p. 47 (2004).
  8. 8.
    Gupta, V., et al.: GViM: GPU-accelerated virtual machines. In: Proceedings of the 3rd ACM Workshop on System-Level Virtualization for High Performance Computing, HPCVirt 2009, pp. 17–24. ACM (2009).
  9. 9.
    Shi, L., Chen, H., et al.: vCUDA: GPU-accelerated high-performance computing in virtual machines. IEEE Trans. Comput. 61(6), 804–816 (2009). Scholar
  10. 10.
    Nukada, A., et al.: NVCR: a transparent checkpoint-restart library for NVIDIA CUDA. In: 2011 IEEE IPDPS Workshops and Phd Forum, pp. 104–113. IEEE (2011).
  11. 11.
    NVIDIA: CUDA C programming guide (2017)Google Scholar
  12. 12.
    Peña, A.J., Bland, W., Balaji, P.: VOCL-FT: introducing techniques for efficient soft error coprocessor recovery. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 1–12. IEEE (2015).
  13. 13.
    Phillips, J.C., et al.: Adapting a message-driven parallel application to GPU-accelerated clusters. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008. IEEE (2008).
  14. 14.
    Pourghassemi, B., et al.: CudaCR: an in-kernel application-level checkpoint/restart scheme for CUDA-enabled GPUs. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 725–732. IEEE (2017).
  15. 15.
    Showerman, M., et al.: QP: a heterogeneous multi-accelerator cluster. In: 10th LCI International Conference on High-Performance Clustered Computing (2009)Google Scholar
  16. 16.
    Takizawa, H., et al.: CheCUDA: a checkpoint/restart tool for CUDA applications. In: 2009 International Conference on PDCAT, pp. 408–413. IEEE (2009).
  17. 17.
    Takizawa, H., et al.: CheCL: transparent checkpointing and process migration of OpenCL applications. In: 2011 IEEE International IPDPS. IEEE (2011).
  18. 18.
    Mohamed, H., Osipyan, H., Marchand-Maillet, S.: Multi-core (CPU and GPU) for permutation-based indexing. In: Traina, A.J.M., Traina, C., Cordeiro, R.L.F. (eds.) SISAP 2014. LNCS, vol. 8821, pp. 277–288. Springer, Cham (2014). Scholar
  19. 19.
    Yang, G., et al.: PBOOST: a GPU-based tool for parallel permutation tests in genome-wide association studies. Bioinformatics 31(9), 1460–1462 (2015). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Max Baird
    • 1
    Email author
  • Christian Fensch
    • 1
  • Sven-Bodo Scholz
    • 1
  • Artjoms Šinkarovs
    • 1
  1. 1.Heriot-Watt UniversityEdinburghScotland

Personalised recommendations