Advertisement

A Lightweight Approach to GPU Resilience

  • Max BairdEmail author
  • Christian Fensch
  • Sven-Bodo Scholz
  • Artjoms Šinkarovs
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11339)

Abstract

Resilience for HPC applications typically is implemented as a CPU-based rollback-recovery technique. In this context, long running accelerator computations on GPUs pose a major challenge as these devices usually do not offer any means of interrupt. This paper proposes a solution to the aforementioned problem: it suggests a novel approach that rewrites GPU kernels so that a soft interrupt of their execution becomes possible. Our approach is based on the Compute Unified Device Architecture (CUDA) by Nvidia and works by taking advantage of CUDA’s execution model of partitioning threads into blocks. In essence, we re-write the kernel so that each block determines whether it should continue execution or return control to the CPU. By doing so we are able to perform a premature interrupt of kernels.

Keywords

HPC GPU Resilience 

Notes

Acknowledgements

This work was supported in part by grants EP/N028201/1 and EP/L00058X/1 from the Engineering and Physical Sciences Research Council (EPSRC) as well as the James Watt Scholarship of Heriot-Watt University.

References

  1. 1.
    Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (BLCR) for Linux clusters. J. Phys. Conf. Ser. 46, 494–499 (2006).  https://doi.org/10.1088/1742-6596/46/1/067CrossRefGoogle Scholar
  2. 2.
    Gõddeke, D., Strzodka, R., Mohd-Yusof, J., McCormick, P.: Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Comput. 33(10–11), 685–699 (2007).  https://doi.org/10.1016/j.parco.2007.09.002CrossRefGoogle Scholar
  3. 3.
    Egwutuoha, I.P., Levy, D., Selic, B., Chen, S.: A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J. Supercomput. 65(3), 1302–1326 (2013).  https://doi.org/10.1007/s11227-013-0884-0CrossRefGoogle Scholar
  4. 4.
    Bautista-Gomez, L., Tsuboi, S., et al.: FTI: high performance fault tolerance interface for hybrid systems. In: 2011 International Conference for High Performance Computing. IEEE (2011).  https://doi.org/10.1145/2063384.2063427
  5. 5.
    Cappello, F., Geist, A., et al.: Toward exascale resilience. Int. J. High Perform. Comput. Appl. 23(4), 374–388 (2009).  https://doi.org/10.1177/1094342009347767CrossRefGoogle Scholar
  6. 6.
    DeBardeleben, N., et al.: GPU behavior on a large HPC cluster. In: an Mey, D., et al. (eds.) Euro-Par 2013. LNCS, vol. 8374, pp. 680–689. Springer, Heidelberg (2014).  https://doi.org/10.1007/978-3-642-54420-0_66CrossRefGoogle Scholar
  7. 7.
    Fan, Z., Qiu, F., et al.: GPU cluster for high performance computing. In: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, SC 2004, p. 47 (2004).  https://doi.org/10.1109/SC.2004.26
  8. 8.
    Gupta, V., et al.: GViM: GPU-accelerated virtual machines. In: Proceedings of the 3rd ACM Workshop on System-Level Virtualization for High Performance Computing, HPCVirt 2009, pp. 17–24. ACM (2009).  https://doi.org/10.1145/1519138.1519141
  9. 9.
    Shi, L., Chen, H., et al.: vCUDA: GPU-accelerated high-performance computing in virtual machines. IEEE Trans. Comput. 61(6), 804–816 (2009).  https://doi.org/10.1109/IPDPS.2009.5161020MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Nukada, A., et al.: NVCR: a transparent checkpoint-restart library for NVIDIA CUDA. In: 2011 IEEE IPDPS Workshops and Phd Forum, pp. 104–113. IEEE (2011).  https://doi.org/10.1109/IPDPS.2011.131
  11. 11.
    NVIDIA: CUDA C programming guide (2017)Google Scholar
  12. 12.
    Peña, A.J., Bland, W., Balaji, P.: VOCL-FT: introducing techniques for efficient soft error coprocessor recovery. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 1–12. IEEE (2015).  https://doi.org/10.1145/2807591.2807640
  13. 13.
    Phillips, J.C., et al.: Adapting a message-driven parallel application to GPU-accelerated clusters. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008. IEEE (2008).  https://doi.org/10.1109/SC.2008.5214716
  14. 14.
    Pourghassemi, B., et al.: CudaCR: an in-kernel application-level checkpoint/restart scheme for CUDA-enabled GPUs. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 725–732. IEEE (2017).  https://doi.org/10.1109/CLUSTER.2017.100
  15. 15.
    Showerman, M., et al.: QP: a heterogeneous multi-accelerator cluster. In: 10th LCI International Conference on High-Performance Clustered Computing (2009)Google Scholar
  16. 16.
    Takizawa, H., et al.: CheCUDA: a checkpoint/restart tool for CUDA applications. In: 2009 International Conference on PDCAT, pp. 408–413. IEEE (2009).  https://doi.org/10.1109/PDCAT.2009.78
  17. 17.
    Takizawa, H., et al.: CheCL: transparent checkpointing and process migration of OpenCL applications. In: 2011 IEEE International IPDPS. IEEE (2011).  https://doi.org/10.1109/IPDPS.2011.85
  18. 18.
    Mohamed, H., Osipyan, H., Marchand-Maillet, S.: Multi-core (CPU and GPU) for permutation-based indexing. In: Traina, A.J.M., Traina, C., Cordeiro, R.L.F. (eds.) SISAP 2014. LNCS, vol. 8821, pp. 277–288. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-11988-5_26CrossRefGoogle Scholar
  19. 19.
    Yang, G., et al.: PBOOST: a GPU-based tool for parallel permutation tests in genome-wide association studies. Bioinformatics 31(9), 1460–1462 (2015).  https://doi.org/10.1093/bioinformatics/btu840CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Max Baird
    • 1
    Email author
  • Christian Fensch
    • 1
  • Sven-Bodo Scholz
    • 1
  • Artjoms Šinkarovs
    • 1
  1. 1.Heriot-Watt UniversityEdinburghScotland

Personalised recommendations