Skip to main content
Log in

Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Massively heterogeneous architectures are widely adopted for the design of modern peta-scale and future exa-scale systems. In such heterogeneous clusters, due to the increasing number of involved components, it is essential to enable fault tolerance to improve the reliability of the whole system. However, existing programming models for heterogeneous clusters (e.g., MPI\(+\)X) concern more on performance, instead of reliability. In this paper, we design and implement a fault tolerance framework for hybrid programs that leverage heterogeneous hardware architectures based on the in-memory checkpointing technique. We provide new capabilities for programming heterogeneous applications that can greatly simplify the implementation of application-level checkpointing. We also conduct optimizations on checkpoint saving and loading to increase scalability. We validate effectiveness of the framework with various benchmarks and real-world applications on the Tianhe-2 supercomputer. Our experimental results show that our framework can improve the resilience of long-running applications and reduce checkpointing overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Yang X, Liao X, Lu K, Hu Q, Song J, Su J (2011) The Tianhe-1A supercomputer: its hardware and software. J Comput Sci Technol 26:344–351

    Article  Google Scholar 

  2. www.top500.org

  3. Liao X, Xiao L, Yang C, Lv Y (2014) Milkyway-2 supercomputer: system and application. Front Comput Sci 8:345–356

    Article  MathSciNet  Google Scholar 

  4. Liao X, Yang C, Tang T, Yi H, Wang F, Wu Q, Xue J (2014) OpenMC: towards simplifying programming for tianhe supercomputers. J Comput Sci Technol 29(3):532–546

  5. Dubrow A (2015) What got done in one year at NSF’s Stampede supercomputer. Comput Sci Eng 17:83–88

    Article  Google Scholar 

  6. Chen C, Fang J, Tang T, Yang C (2017) LU factorization on heterogeneous systems: an energy-efficient approach towards high performance. Computing 99(8):791–811

  7. Karablieh F, Bazzi RA (2002) Heterogeneous checkpointing for multithreaded applications. In: 21st IEEE Symposium on Reliable Distributed Systems. IEEE, pp 140–149

  8. Di Martino C, Kramer W, Kalbarczyk Z, Iyer R (2015) Measuring and understanding extreme-scale application resilience: a field study of 5,000,000 HPC application runs. In: IEEE/IFIP International Conference on Dependable Systems and Networks, pp 25–36

  9. Gomez LB, Nukada A, Maruyama N, Cappello F (2010) Low-overhead diskless checkpoint for hybrid computing systems. In: 2010 International Conference on High Performance Computing, pp 1–10

  10. Zheng G, Ni X, Kalé LV (2012) A scalable double in-memory checkpoint and restart scheme towards exascale. In: 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops. IEEE, pp 1–6

  11. Sato K, Maruyama N, Mohror K, Moody A, Gamblin T, de Supinski BR, Matsuoka S (2012) Design and modeling of a non-blocking checkpointing system. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp 19:1–19:10

  12. Gomez LAB, Maruyama N, Cappello F, Matsuoka S (2010) Distributed diskless checkpoint for large scale systems. In: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp 63–72

  13. Ropars T, Martsinkevich TV, Guermouche A, Schiper A, Cappello F (2013) SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing. In: High performance computing, networking, storage and analysis, pp 1–12

  14. Dong X, Wen M, Chai J, Cai X, Zhao M, Zhang C (2015) Communication-hiding programming for clusters with multi-coprocessor nodes. Concurr Comput Pract Exp 27(16):4172–4185

  15. Fraguela BB, Losada N, González P, Martín MJ (2017) A portable and adaptable fault tolerance solution for heterogeneous applications. J Parallel Distrib Comput 104:146–158

    Article  Google Scholar 

  16. Kannan S, Farooqui N, Gavrilovska A, Schwan K (2014) HeteroCheckpoint: efficient checkpointing for accelerator-based systems. In: Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp 738–743

  17. Takizawa H, Koyama K, Sato K, Komatsu K, Kobayashi H (2011) CheCL: transparent checkpointing and process migration of OpenCL applications. In: 2011 IEEE International on Parallel and Distributed Processing Symposium. IEEE, pp 864–876

  18. Takizawa H, Sato K, Komatsu K, Kobayashi H (2009) CheCUDA: a checkpoint/restart tool for CUDA applications. In: 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies. IEEE, pp 408–413

  19. Nukada A, Takizawa H, Matsuoka S (2011) NVCR: a transparent checkpoint-restart library for NVIDIA CUDA. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), pp 104–113

  20. Rezaei A, Coviello G, Li C-H, Chakradhar S, Mueller F (2014) Snapify: capturing snapshots of offload applications on Xeon Phi manycore processors. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing. New York, NY, USA. ACM, pp 1–12

  21. Knights Corner software developers guide. April 27 (2012)

  22. Schulz KW, Ulerich R, Malaya N, Bauman PT, Stogner R, Simmons C (2012) Early experiences porting scientific applications to the many integrated core (MIC) platform. In: TACC-Intel Highly Parallel Computing Symposium, Austin, TX

  23. Koo R, Toueg S (1987) Checkpointing and rollback-recovery for distributed systems. In: IEEE Transactions on Software Engineering, no 1. IEEE, pp 23–31

  24. Barrett R, Berry MW, Chan TF, Demmel J, Donato J, Dongarra J, Eijkhout V, Pozo R, Romine C, Van der Vorst H (1994) Templates for the solution of linear systems: building blocks for iterative methods, vol 43. Siam, Philadelphia

    Book  MATH  Google Scholar 

  25. Yang C, Wang F, Du Y, Chen J, Liu J, Yi H, Lu K (2010) Adaptive optimization for petascale heterogeneous CPU/GPU computing. In: 2010 IEEE International Conference on Cluster Computing. IEEE, pp 19–28

  26. Shahbazian S (2008) Revisiting the foundations of quantum theory of atoms in molecules: the variational procedure and the zero-flux conditions. Int J Quantum Chem 108:1477–1484

    Article  Google Scholar 

  27. Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531

    Article  MATH  Google Scholar 

  28. Xu X, Lin Y, Tang T, Lin Y (2010) HiAL-Ckpt: a hierarchical application-level checkpointing for CPU-GPU hybrid systems. In: 2010 5th International Conference on Computer Science Education, pp 1895–1899

  29. Laosooksathit S, Naksinehaboon N, Leangsuksan C, Dhungana A, Chandler C, Chanchio K, Farbin A (2010) Lightweight checkpoint mechanism and modeling in GPGPU environment. In: 4th workshop on system level virtualization for high performance computing (HPCVirt 2010), April 2010

  30. Guo X, Jiang H, Li KC (2013) A checkpoint/restart scheme for CUDA applications with complex memory hierarchy. In: 2013 14th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, pp 247–252

  31. Peña AJ, Bland W, Balaji P (2015) VOCL-FT: introducing techniques for efficient soft error coprocessor recovery. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 71:1–71:12

  32. Rajachandrasekar R, Potluri S, Venkatesh A, Hamidouche K, Wasi-ur Rahman Md, Panda DK (2014) MIC-Check: a distributed check pointing framework for the intel many integrated cores architecture. In: Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing. ACM, pp 121–124

  33. Chen C, Du Y, Xu Z, Yang C (2015) FT-Offload: a scalable fault-tolerance programing model on MIC cluster. In: Proceeding of 15th International Conference on Algorithms and Architectures for Parallel Processing, pp 3–17

Download references

Acknowledgements

This work is supported by the National High Technology R&D Program of China (863 Program) 2015AA01A301, the National Natural Science Foundation of China (NSFC) 61402488, 61602501.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunfei Du.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, C., Du, Y., Zuo, K. et al. Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization. J Supercomput 75, 4226–4247 (2019). https://doi.org/10.1007/s11227-017-2116-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-017-2116-5

Keywords

Navigation