Hybrid CPU/GPU Checkpoint for GPU-Based Heterogeneous Systems

Shi, Lin; Chen, Hao; Li, Ting

doi:10.1007/978-3-642-53962-6_42

Lin Shi^5,6,
Hao Chen⁶ &
Ting Li⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 405))

Included in the following conference series:

International Conference on Parallel Computing in Fluid Dynamics

3603 Accesses
3 Citations

Abstract

Fault tolerance has become a major concern in exascale computing, especially for the large scale CPU/GPU heterogeneous clusters. The performance/cost benefit of GPU based system is subject to their abilities to provide high reliability, availability, and serviceability. The traditional CPU-based checkpoint technologies have been deployed on the GPU platform but all of them treat the GPU as a second class controllable and shared entity. As existing GPU checkpoint/restart implementations do not support checkpointing the internal GPU status, the codes running on GPU (kernel) can not be checked/restored just like the CPU codes, all the checkpoint operation is done outside the kernel. In this paper, we propose a hybrid checkpoint technology, HKC (Hybrid Kernel Checkpoint). HKC combines the PTX stub inject technology and dynamic library hijack mechanism, to save/store the internal state of a GPU kernel. Our evaluation shows that HKC increases the system reliability of CPU/GPU hybrid system with a very reasonable cost, and show more resilience than other checkpoint scheme.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

CUDA: Compute Unified Device Architecture (accessed September 2012), http://www.nvidia.com/object/cuda_home_new.html
Maruyama, N., Nukada, A., Matsuoka, S.: A high-performance fault-tolerant software framework for memory on commodity GPUs. In: Proc. Int’l Symp. Parallel and Distributed Processing (IPDPS 2010), pp. 1–11 (April 2010)
Google Scholar
Haque, I.S., Pande, V.S.: Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU. In: 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid), pp. 691–696 (2010)
Google Scholar
NVIDIA CUDA debugger API Reference Manual, http://developer.nvidia.com/cuda/nvidia-gpu-computing-documentation
Allinea DDT, http://www.allinea.com/products/ddt
TotalView, http://www.roguewave.com/products/totalview.aspx
Shi, L., Chen, H., Sun, J.: vCUDA: GPU Accelerated High Performance Computing in Virtual Machines. In: Proc. Int’l Symp. Parallel and Distributed Processing (IPDPS 2009), pp. 1–11 (May 2009)
Google Scholar
GPGPU: General Purpose Programming on GPUs, http://www.gpgpu.org/w/index.php/FAQ#WhatprogrammingAPIsexistforGPGPU.3F
Tian, Z.A., Liu, R.S., Liu, H.R., Zheng, C.X., Hou, Z.Y., Peng, P.: Molecular dynamics simulation for cooling rate dependence of solidification microstructures of silver. Journal of Non-Crystalline Solids 354, 3705–3712 (2009)
Article Google Scholar
Zhong, H., Nieh, J.: CRAK: Linux Checkpoint/Restart As a KERNEL Module. Technical Report, Columbia University,2002
Google Scholar
Duell, J.: The Design and Implementation of Berkeley Labs Linux Checkpoint/Restart. Paper LBNL-54941. Berkeley,2005
Google Scholar
Litzkow, M., Tannenbaum, T.: J. Basney, et al. Checkpoint and Migration of UNIX Process in the Condor Distributed Processing System. Technical Report, 1346, University of Wisconsin Madison
Google Scholar
Takizawa, H., Sato, K., Komatsu, K., et al.: CheCUDA: A Checkpoint/Restart Tool for CUDA Applications. In: Proc. of International Conference on Parallel and Distributed Computing Applications and Technologies, Higashi Hiroshima, pp. 408–413 (2009)
Google Scholar
Takizawa, H., Koyama, K., Sato, K., et al.: CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications. In: Proc. of International Parallel and Distributed Processing Symposium, Anchorage, pp. 864–876 (2011)
Google Scholar
Nukada, A., Takizawa, H., Matsuoka, S.: NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA. In: Proc. of IPDPS Workshop, Alaska, pp. 104–113 (2011)
Google Scholar
Li, T., Narayana, V.K., El-Araby, E., et al.: GPU Resource Sharing and Virtualization on High Performance Computing Systems. In: Proc. of International Conference on Parallel Processing, Taipei, pp. 733–742 (2011)
Google Scholar
Bautista, L., Nukada, A., Maruyama, N., et al.: Low-overhead diskless checkpoint for hybrid computing systems. In: Proc. of High Performance Computing, Dona Paula, pp. 1–10 (2010)
Google Scholar
Laosooksathit, S., Naksinehaboon, N., Leangsuksan, C., et al.: Lightweight Checkpoint Mechanism and Modeling in GPGPU Environment. In: Proc. of HPCVirt Workshop, Paris (2010)
Google Scholar
Toan, N., Jitsumoto, H., Maruyama, N., et al.: MPI-CUDA Applications Checkpointing. In: Proc. of Summer United Workshops on Parallel, Distributed and Cooperative Processing. Technical Report, Kanazawa (2010)
Google Scholar
OpenCL: Parallel Computing on the GPU and CPU. In Beyond Programmable Shading Course of SIGGRAPH 2008 (August 14, 2008)
Google Scholar
Chen, H., Shi, L., Sun, J.: VMRPC: A High Efficiency and Light Weight RPC System for Virtual Machines. In: The 18th IEEE International Workshop on Quality of Service (IWQoS), Beijing, China (2010)
Google Scholar
Mohr, A., Gleicher, M.: HijackGL: Reconstructing from Streams for Stylized Rendering. In: Proc. of International Symposium on Non-photorealistic Animation and Rendering, New York, p. 13 (2002)
Google Scholar
Xu, X., Lin, Y., Tang, T., et al.: HiAL-Ckpt: A hierarchical application-level checkpointing for CPU-GPU hybrid systems. In: Proc. of International Conference on Computer Science and Education, Hefei, pp. 1895–1899 (2010)
Google Scholar
Dimitrov, M., Mantor, M., Zhou, H.: Understanding software approaches for gpgpu reliability. In: Proc. of Workshop on General-Purpose Computation on Graphics Processing Units, Washington, pp. 94–104 (2009)
Google Scholar
Sheaffer, J., Luebke, D., Skadron, K.: A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors. In: Proc. of ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, San Diego, pp. 55–64 (2007)
Google Scholar
Maruyama, N., Nukada, A., Matsuoka, S.: A High-Performance Fault-Tolerant Software Framework for Memory on Commodity GPUs. In: Proc. of IEEE International Symposium on Parallel & Distributed Processing, Atlanta, pp. 1–12 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Enginerring, Hunan University of Science and Technology, Xiang Tan, China
Lin Shi & Ting Li
School of Computer and Communication, Hunan University, Chang Sha, China
Lin Shi & Hao Chen

Authors

Lin Shi
View author publications
You can also search for this author in PubMed Google Scholar
Hao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ting Li
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

College of Information Science and Engineering, Hunan University, 410082, Changsha, China
Kenli Li
College of Information Science and Engineering, Hunan University, #2, South Lushan Road, Yuelu District, 410082, Changsha, China
Zheng Xiao & Jiayi Du &
College of Information Science and Engineering, Northeastern University, 110004, Shenyang, China
Yan Wang
Hunan University, State University of New York at New Paltz,, 12561, New Paltz, NY, USA
Keqin Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shi, L., Chen, H., Li, T. (2014). Hybrid CPU/GPU Checkpoint for GPU-Based Heterogeneous Systems. In: Li, K., Xiao, Z., Wang, Y., Du, J., Li, K. (eds) Parallel Computational Fluid Dynamics. ParCFD 2013. Communications in Computer and Information Science, vol 405. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53962-6_42

Download citation

DOI: https://doi.org/10.1007/978-3-642-53962-6_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53961-9
Online ISBN: 978-3-642-53962-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics