Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization

Chen, Cheng; Du, Yunfei; Zuo, Ke; Fang, Jianbin; Yang, Canqun

doi:10.1007/s11227-017-2116-5

Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization

Published: 20 August 2017

Volume 75, pages 4226–4247, (2019)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Cheng Chen¹,
Yunfei Du²,
Ke Zuo¹,
Jianbin Fang¹ &
…
Canqun Yang¹

314 Accesses
7 Citations
Explore all metrics

Abstract

Massively heterogeneous architectures are widely adopted for the design of modern peta-scale and future exa-scale systems. In such heterogeneous clusters, due to the increasing number of involved components, it is essential to enable fault tolerance to improve the reliability of the whole system. However, existing programming models for heterogeneous clusters (e.g., MPI\(+\)X) concern more on performance, instead of reliability. In this paper, we design and implement a fault tolerance framework for hybrid programs that leverage heterogeneous hardware architectures based on the in-memory checkpointing technique. We provide new capabilities for programming heterogeneous applications that can greatly simplify the implementation of application-level checkpointing. We also conduct optimizations on checkpoint saving and loading to increase scalability. We validate effectiveness of the framework with various benchmarks and real-world applications on the Tianhe-2 supercomputer. Our experimental results show that our framework can improve the resilience of long-running applications and reduce checkpointing overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Article Open access 06 April 2024

Peter Thoman & Philip Salzmann

Efficient High-Level Programming in Plain Java

Article 05 December 2022

Rui S. Silva & João L. Sobral

Containers in HPC: a survey

Article 27 October 2022

Rafael Keller Tesser & Edson Borin

References

Yang X, Liao X, Lu K, Hu Q, Song J, Su J (2011) The Tianhe-1A supercomputer: its hardware and software. J Comput Sci Technol 26:344–351
Article Google Scholar
www.top500.org
Liao X, Xiao L, Yang C, Lv Y (2014) Milkyway-2 supercomputer: system and application. Front Comput Sci 8:345–356
Article MathSciNet Google Scholar
Liao X, Yang C, Tang T, Yi H, Wang F, Wu Q, Xue J (2014) OpenMC: towards simplifying programming for tianhe supercomputers. J Comput Sci Technol 29(3):532–546
Dubrow A (2015) What got done in one year at NSF’s Stampede supercomputer. Comput Sci Eng 17:83–88
Article Google Scholar
Chen C, Fang J, Tang T, Yang C (2017) LU factorization on heterogeneous systems: an energy-efficient approach towards high performance. Computing 99(8):791–811
Karablieh F, Bazzi RA (2002) Heterogeneous checkpointing for multithreaded applications. In: 21st IEEE Symposium on Reliable Distributed Systems. IEEE, pp 140–149
Di Martino C, Kramer W, Kalbarczyk Z, Iyer R (2015) Measuring and understanding extreme-scale application resilience: a field study of 5,000,000 HPC application runs. In: IEEE/IFIP International Conference on Dependable Systems and Networks, pp 25–36
Gomez LB, Nukada A, Maruyama N, Cappello F (2010) Low-overhead diskless checkpoint for hybrid computing systems. In: 2010 International Conference on High Performance Computing, pp 1–10
Zheng G, Ni X, Kalé LV (2012) A scalable double in-memory checkpoint and restart scheme towards exascale. In: 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops. IEEE, pp 1–6
Sato K, Maruyama N, Mohror K, Moody A, Gamblin T, de Supinski BR, Matsuoka S (2012) Design and modeling of a non-blocking checkpointing system. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp 19:1–19:10
Gomez LAB, Maruyama N, Cappello F, Matsuoka S (2010) Distributed diskless checkpoint for large scale systems. In: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp 63–72
Ropars T, Martsinkevich TV, Guermouche A, Schiper A, Cappello F (2013) SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing. In: High performance computing, networking, storage and analysis, pp 1–12
Dong X, Wen M, Chai J, Cai X, Zhao M, Zhang C (2015) Communication-hiding programming for clusters with multi-coprocessor nodes. Concurr Comput Pract Exp 27(16):4172–4185
Fraguela BB, Losada N, González P, Martín MJ (2017) A portable and adaptable fault tolerance solution for heterogeneous applications. J Parallel Distrib Comput 104:146–158
Article Google Scholar
Kannan S, Farooqui N, Gavrilovska A, Schwan K (2014) HeteroCheckpoint: efficient checkpointing for accelerator-based systems. In: Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp 738–743
Takizawa H, Koyama K, Sato K, Komatsu K, Kobayashi H (2011) CheCL: transparent checkpointing and process migration of OpenCL applications. In: 2011 IEEE International on Parallel and Distributed Processing Symposium. IEEE, pp 864–876
Takizawa H, Sato K, Komatsu K, Kobayashi H (2009) CheCUDA: a checkpoint/restart tool for CUDA applications. In: 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies. IEEE, pp 408–413
Nukada A, Takizawa H, Matsuoka S (2011) NVCR: a transparent checkpoint-restart library for NVIDIA CUDA. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), pp 104–113
Rezaei A, Coviello G, Li C-H, Chakradhar S, Mueller F (2014) Snapify: capturing snapshots of offload applications on Xeon Phi manycore processors. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing. New York, NY, USA. ACM, pp 1–12
Knights Corner software developers guide. April 27 (2012)
Schulz KW, Ulerich R, Malaya N, Bauman PT, Stogner R, Simmons C (2012) Early experiences porting scientific applications to the many integrated core (MIC) platform. In: TACC-Intel Highly Parallel Computing Symposium, Austin, TX
Koo R, Toueg S (1987) Checkpointing and rollback-recovery for distributed systems. In: IEEE Transactions on Software Engineering, no 1. IEEE, pp 23–31
Barrett R, Berry MW, Chan TF, Demmel J, Donato J, Dongarra J, Eijkhout V, Pozo R, Romine C, Van der Vorst H (1994) Templates for the solution of linear systems: building blocks for iterative methods, vol 43. Siam, Philadelphia
Book MATH Google Scholar
Yang C, Wang F, Du Y, Chen J, Liu J, Yi H, Lu K (2010) Adaptive optimization for petascale heterogeneous CPU/GPU computing. In: 2010 IEEE International Conference on Cluster Computing. IEEE, pp 19–28
Shahbazian S (2008) Revisiting the foundations of quantum theory of atoms in molecules: the variational procedure and the zero-flux conditions. Int J Quantum Chem 108:1477–1484
Article Google Scholar
Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531
Article MATH Google Scholar
Xu X, Lin Y, Tang T, Lin Y (2010) HiAL-Ckpt: a hierarchical application-level checkpointing for CPU-GPU hybrid systems. In: 2010 5th International Conference on Computer Science Education, pp 1895–1899
Laosooksathit S, Naksinehaboon N, Leangsuksan C, Dhungana A, Chandler C, Chanchio K, Farbin A (2010) Lightweight checkpoint mechanism and modeling in GPGPU environment. In: 4th workshop on system level virtualization for high performance computing (HPCVirt 2010), April 2010
Guo X, Jiang H, Li KC (2013) A checkpoint/restart scheme for CUDA applications with complex memory hierarchy. In: 2013 14th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, pp 247–252
Peña AJ, Bland W, Balaji P (2015) VOCL-FT: introducing techniques for efficient soft error coprocessor recovery. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 71:1–71:12
Rajachandrasekar R, Potluri S, Venkatesh A, Hamidouche K, Wasi-ur Rahman Md, Panda DK (2014) MIC-Check: a distributed check pointing framework for the intel many integrated cores architecture. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing. ACM, pp 121–124
Chen C, Du Y, Xu Z, Yang C (2015) FT-Offload: a scalable fault-tolerance programing model on MIC cluster. In: Proceeding of 15th International Conference on Algorithms and Architectures for Parallel Processing, pp 3–17

Download references

Acknowledgements

This work is supported by the National High Technology R&D Program of China (863 Program) 2015AA01A301, the National Natural Science Foundation of China (NSFC) 61402488, 61602501.

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, Changsha, 410073, China
Cheng Chen, Ke Zuo, Jianbin Fang & Canqun Yang
School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, 510000, China
Yunfei Du

Authors

Cheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yunfei Du
View author publications
You can also search for this author in PubMed Google Scholar
Ke Zuo
View author publications
You can also search for this author in PubMed Google Scholar
Jianbin Fang
View author publications
You can also search for this author in PubMed Google Scholar
Canqun Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunfei Du.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, C., Du, Y., Zuo, K. et al. Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization. J Supercomput 75, 4226–4247 (2019). https://doi.org/10.1007/s11227-017-2116-5

Download citation

Published: 20 August 2017
Issue Date: 01 August 2019
DOI: https://doi.org/10.1007/s11227-017-2116-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization

Abstract

Access this article

Similar content being viewed by others

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Efficient High-Level Programming in Plain Java

Containers in HPC: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization

Abstract

Access this article

Similar content being viewed by others

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Efficient High-Level Programming in Plain Java

Containers in HPC: a survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation