Skip to main content

Hybrid CPU/GPU Checkpoint for GPU-Based Heterogeneous Systems

  • Conference paper
Parallel Computational Fluid Dynamics (ParCFD 2013)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 405))

Included in the following conference series:

Abstract

Fault tolerance has become a major concern in exascale computing, especially for the large scale CPU/GPU heterogeneous clusters. The performance/cost benefit of GPU based system is subject to their abilities to provide high reliability, availability, and serviceability. The traditional CPU-based checkpoint technologies have been deployed on the GPU platform but all of them treat the GPU as a second class controllable and shared entity. As existing GPU checkpoint/restart implementations do not support checkpointing the internal GPU status, the codes running on GPU (kernel) can not be checked/restored just like the CPU codes, all the checkpoint operation is done outside the kernel. In this paper, we propose a hybrid checkpoint technology, HKC (Hybrid Kernel Checkpoint). HKC combines the PTX stub inject technology and dynamic library hijack mechanism, to save/store the internal state of a GPU kernel. Our evaluation shows that HKC increases the system reliability of CPU/GPU hybrid system with a very reasonable cost, and show more resilience than other checkpoint scheme.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. CUDA: Compute Unified Device Architecture (accessed September 2012), http://www.nvidia.com/object/cuda_home_new.html

  2. Maruyama, N., Nukada, A., Matsuoka, S.: A high-performance fault-tolerant software framework for memory on commodity GPUs. In: Proc. Int’l Symp. Parallel and Distributed Processing (IPDPS 2010), pp. 1–11 (April 2010)

    Google Scholar 

  3. Haque, I.S., Pande, V.S.: Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU. In: 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid), pp. 691–696 (2010)

    Google Scholar 

  4. NVIDIA CUDA debugger API Reference Manual, http://developer.nvidia.com/cuda/nvidia-gpu-computing-documentation

  5. Allinea DDT, http://www.allinea.com/products/ddt

  6. TotalView, http://www.roguewave.com/products/totalview.aspx

  7. Shi, L., Chen, H., Sun, J.: vCUDA: GPU Accelerated High Performance Computing in Virtual Machines. In: Proc. Int’l Symp. Parallel and Distributed Processing (IPDPS 2009), pp. 1–11 (May 2009)

    Google Scholar 

  8. GPGPU: General Purpose Programming on GPUs, http://www.gpgpu.org/w/index.php/FAQ#WhatprogrammingAPIsexistforGPGPU.3F

  9. Tian, Z.A., Liu, R.S., Liu, H.R., Zheng, C.X., Hou, Z.Y., Peng, P.: Molecular dynamics simulation for cooling rate dependence of solidification microstructures of silver. Journal of Non-Crystalline Solids 354, 3705–3712 (2009)

    Article  Google Scholar 

  10. Zhong, H., Nieh, J.: CRAK: Linux Checkpoint/Restart As a KERNEL Module. Technical Report, Columbia University,2002

    Google Scholar 

  11. Duell, J.: The Design and Implementation of Berkeley Labs Linux Checkpoint/Restart. Paper LBNL-54941. Berkeley,2005

    Google Scholar 

  12. Litzkow, M., Tannenbaum, T.: J. Basney, et al. Checkpoint and Migration of UNIX Process in the Condor Distributed Processing System. Technical Report, 1346, University of Wisconsin Madison

    Google Scholar 

  13. Takizawa, H., Sato, K., Komatsu, K., et al.: CheCUDA: A Checkpoint/Restart Tool for CUDA Applications. In: Proc. of International Conference on Parallel and Distributed Computing Applications and Technologies, Higashi Hiroshima, pp. 408–413 (2009)

    Google Scholar 

  14. Takizawa, H., Koyama, K., Sato, K., et al.: CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications. In: Proc. of International Parallel and Distributed Processing Symposium, Anchorage, pp. 864–876 (2011)

    Google Scholar 

  15. Nukada, A., Takizawa, H., Matsuoka, S.: NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA. In: Proc. of IPDPS Workshop, Alaska, pp. 104–113 (2011)

    Google Scholar 

  16. Li, T., Narayana, V.K., El-Araby, E., et al.: GPU Resource Sharing and Virtualization on High Performance Computing Systems. In: Proc. of International Conference on Parallel Processing, Taipei, pp. 733–742 (2011)

    Google Scholar 

  17. Bautista, L., Nukada, A., Maruyama, N., et al.: Low-overhead diskless checkpoint for hybrid computing systems. In: Proc. of High Performance Computing, Dona Paula, pp. 1–10 (2010)

    Google Scholar 

  18. Laosooksathit, S., Naksinehaboon, N., Leangsuksan, C., et al.: Lightweight Checkpoint Mechanism and Modeling in GPGPU Environment. In: Proc. of HPCVirt Workshop, Paris (2010)

    Google Scholar 

  19. Toan, N., Jitsumoto, H., Maruyama, N., et al.: MPI-CUDA Applications Checkpointing. In: Proc. of Summer United Workshops on Parallel, Distributed and Cooperative Processing. Technical Report, Kanazawa (2010)

    Google Scholar 

  20. OpenCL: Parallel Computing on the GPU and CPU. In Beyond Programmable Shading Course of SIGGRAPH 2008 (August 14, 2008)

    Google Scholar 

  21. Chen, H., Shi, L., Sun, J.: VMRPC: A High Efficiency and Light Weight RPC System for Virtual Machines. In: The 18th IEEE International Workshop on Quality of Service (IWQoS), Beijing, China (2010)

    Google Scholar 

  22. Mohr, A., Gleicher, M.: HijackGL: Reconstructing from Streams for Stylized Rendering. In: Proc. of International Symposium on Non-photorealistic Animation and Rendering, New York, p. 13 (2002)

    Google Scholar 

  23. Xu, X., Lin, Y., Tang, T., et al.: HiAL-Ckpt: A hierarchical application-level checkpointing for CPU-GPU hybrid systems. In: Proc. of International Conference on Computer Science and Education, Hefei, pp. 1895–1899 (2010)

    Google Scholar 

  24. Dimitrov, M., Mantor, M., Zhou, H.: Understanding software approaches for gpgpu reliability. In: Proc. of Workshop on General-Purpose Computation on Graphics Processing Units, Washington, pp. 94–104 (2009)

    Google Scholar 

  25. Sheaffer, J., Luebke, D., Skadron, K.: A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors. In: Proc. of ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, San Diego, pp. 55–64 (2007)

    Google Scholar 

  26. Maruyama, N., Nukada, A., Matsuoka, S.: A High-Performance Fault-Tolerant Software Framework for Memory on Commodity GPUs. In: Proc. of IEEE International Symposium on Parallel & Distributed Processing, Atlanta, pp. 1–12 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Shi, L., Chen, H., Li, T. (2014). Hybrid CPU/GPU Checkpoint for GPU-Based Heterogeneous Systems. In: Li, K., Xiao, Z., Wang, Y., Du, J., Li, K. (eds) Parallel Computational Fluid Dynamics. ParCFD 2013. Communications in Computer and Information Science, vol 405. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53962-6_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-53962-6_42

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-53961-9

  • Online ISBN: 978-3-642-53962-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics