FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing

Hu, Wei; Liu, Guang-Ming; Jiang, Yan-Huang

doi:10.1631/FITEE.1601450

FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing

Published: 28 November 2018

Volume 19, pages 1273–1290, (2018)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

66 Accesses
Explore all metrics

Abstract

As the scale of supercomputers rapidly grows, the reliability problem dominates the system availability. Existing fault tolerance mechanisms, such as periodic checkpointing and process redundancy, cannot effectively fix this problem. To address this issue, we present a new fault tolerance framework using process replication and prefetching (FTRP), combining the benefits of proactive and reactive mechanisms. FTRP incorporates a novel cost model and a new proactive fault tolerance mechanism to improve the application execution efficiency. The novel cost model, called the ‘work-most’ (WM) model, makes runtime decisions to adaptively choose an action from a set of fault tolerance mechanisms based on failure prediction results and application status. Similar to program locality, we observe the failure locality phenomenon in supercomputers for the first time. In the new proactive fault tolerance mechanism, process replication with process prefetching is proposed based on the failure locality, significantly avoiding losses caused by the failures regardless of whether they have been predicted. Simulations with real failure traces demonstrate that the FTRP framework outperforms existing fault tolerance mechanisms with up to 10% improvement in application efficiency for common failure prediction accuracy, and is effective for petascale systems and beyond.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-agent architecture for fault recovery in self-healing systems

Article 07 August 2020

Illuminating the I/O Optimization Path of Scientific Applications

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

References

Alam SR, Kuehn JA, Barrett RF, et al., 2007. Cray XT4: an early evaluation for petascale scientific simulation. Proc ACM/IEEE Conf on Supercomputing, p.1–12. https://doi.org/10.1145/1362622.1362675
Book Google Scholar
Babaoglu O, Joy W, 1981. Converting a swap–based system to do paging in an architecture lacking page–referenced bits. Proc 8th ACM Symp on Operating Systems Principles, p.78–86. https://doi.org/10.1145/800216.806595
Google Scholar
Bhatele A, Jetley P, Gahvari H, et al., 2011. Architectural constraints to attain 1 exaflop/s for three scientific application classes. Proc IEEE Int Parallel & Distributed Processing Symp, p.80–91, https://doi.org/10.1109/IPDPS.2011.18
Book Google Scholar
Bouguerra MS, Gainaru A, Gomez LB, et al., 2013. Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing. IEEE 27th Int Symp on Parallel Distributed Processing, p.501–512. https://doi.org/10.1109/IPDPS.2013.74
Book Google Scholar
Brown D, Smith G, 2008. MP.2 Syslog Data (2006–2008). Technical Report, PNNL–SA–61371.
Google Scholar
Cappello F, Casanova H, Robert Y, 2010. Checkpointing vs. migration for post–petascale supercomputers. Proc 39th Int Conf on Parallel Processing, p.168–177. https://doi.org/10.1109/ICPP.2010.26
Book Google Scholar
Daly JT, 2006. A higher order estimate of the optimum checkpoint interval for restart dumps. Fut Gener Comput Syst, 22(3):303–312. https://doi.org/10.1016/j.future.2004.11.016
Article Google Scholar
Denning PJ, 2005. The locality principle. Commun ACM, 48(7):19–24. https://doi.org/10.1145/1070838.1070856
Article Google Scholar
Dwork C, Lynch N, Stockmeyer L, 1988. Consensus in the presence of partial synchrony. J ACM, 35(2):288–323. https://doi.org/10.1145/42282.42283
Article MathSciNet Google Scholar
Egwutuoha IP, Levy D, Selic B, et al., 2013. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput, 65(3):1302–1326. https://doi.org/10.1007/s11227–013–0884.0
Article Google Scholar
Elliott J, Kharbas K, Fiala D, et al., 2012. Combining partial redundancy and checkpointing for HPC. IEEE 32nd Int Conf on Distributed Computing Systems, p.615–626. https://doi.org/10.1109/ICDCS.2012.56
Book Google Scholar
Elnozahy ENM, Alvisi L, Wang YM, et al., 2002. A survey of rollback–recovery protocols in message–passing systems. ACM Comput Surv, 34(3):375–408. https://doi.org/10.1145/568522.568525
Article Google Scholar
Fahey M, Larkin J, Adams J, 2008. I/O performance on a massively parallel Cray XT3/XT4. IEEE Int Symp on Parallel and Distributed Processing, p.1–12. https://doi.org/10.1109/IPDPS.2008.4536270
Book Google Scholar
Ferreira K, Stearley J, Laros JH III, et al., 2011. Evaluating the viability of process replication reliability for exascale systems. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 44. https://doi.org/10.1145/2063384.2063443
Book Google Scholar
Gainaru A, Cappello F, Kramer W, 2012a. Taming of the shrew: modeling the normal and faulty behaviour of large–scale HPC systems. IEEE 26th Int Symp on Parallel Distributed Processing, p.1168–1179. https://doi.org/10.1109/IPDPS.2012.107
Google Scholar
Gainaru A, Cappello F, Snir M, et al., 2012b. Fault prediction under the microscope: a closer look into HPC systems. Proc Int Conf on High Performance Computing, Networking, Storage and Analysis, Article 77. https://doi.org/10.1109/SC.2012.57
Google Scholar
George C, Vadhiyar S, 2012. ADFT: an adaptive framework for fault tolerance on large scale systems using application malleability. Proc Comput Sci, 9:166–175.
Article Google Scholar
George C, Vadhiyar S, 2015. Fault tolerance on large scale systems using adaptive process replication. IEEE Trans Comput, 64(8):2213–2225. https://doi.org/10.1109/TC.2014.2360536
Article MathSciNet MATH Google Scholar
Gujrati P, Li Y, Lan Z, et al., 2007. A meta–learning failure predictor for blue gene/l systems. Proc Int Conf on Parallel Processing, p.1–8. https://doi.org/10.1109/ICPP.2007.9
Book Google Scholar
Gupta S, Xiang P, Yang Y, et al., 2013. Locality principle revisited: a probability–based quantitative approach. J Parall Distrib Comput, 73(7):1011–1027. https://doi.org/10.1016/j.jpdc.2013.01.010
Article Google Scholar
Hamerly G, Elkan C, 2001. Bayesian approaches to failure prediction for disk drives. Proc 18th Int Conf on Machine Learning, p.202–209.
Google Scholar
Hargrove PH, Duell JC, 2006. Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters. J Phys Conf Ser, 46(1):494. https://doi.org/10.1088/1742–6596/46/1.067
Article Google Scholar
Hellerstein JL, Zhang F, Shahabuddin P, 2001. A statistical approach to predictive detection. Comput Netw, 35(1):77–95. https://doi.org/10.1016/S1389–1286(00)00151.1
Article Google Scholar
Hu W, Jiang Y, Liu G, et al., 2015. DDC: Distributed Data Collection Framework for Failure Prediction in Tianhe Supercomputers. Springer International Publishing, p.18–32. https://doi.org/10.1007/978–3–319–23216–4.2
Google Scholar
Kalaiselvi S, Rajaraman V, 2000. A survey of checkpointing algorithms for parallel and distributed computers. Sadhana, 25(5):489–510. https://doi.org/10.1007/B.02703630
Article Google Scholar
Lan Z, Li Y, 2008. Adaptive fault management of parallel applications for high–performance computing. IEEE Trans Comput, 57(12):1647–1660. https://doi.org/10.1109/TC.2008.90
Article MathSciNet MATH Google Scholar
Lan Z, Gu J, Zheng Z, et al., 2010. A study of dynamic metalearning for failure prediction in large–scale systems. J Parall Distrib Comput, 70(6):630–643. https://doi.org/10.1016/j.jpdc.2010.03.003
Article MATH Google Scholar
Liang Y, Zhang Y, Jette M, et al., 2006. Bluegene/l failure analysis and prediction models. Int Conf on Dependable Systems and Networks, p.425–434. https://doi.org/10.1109/DSN.2006.18
Google Scholar
Lu CD, 2005. Scalable Diskless Checkpointing for Large Parallel Systems. PhD Thesis, Champaign, IL, USA.
Google Scholar
Mohammed A, Kavuri R, Upadhyaya N, 2012. Fault tolerance: case study. Proc 2nd Int Conf on Computational Science, Engineering and Information Technology, p.138–144. https://doi.org/10.1145/2393216.2393240
Google Scholar
Mohror K, Moody A, de Supinski BR, 2012. Asynchronous checkpoint migration with MRNet in the Scalable Checkpoint/Restart Library. IEEE/IFIP Int Conf on Dependable Systems and Networks Workshops, p.1–6. https://doi.org/10.1109/DSNW.2012.6264668
Book Google Scholar
Moody A, Bronevetsky G, Mohror K, et al., 2010. Design, modeling, and evaluation of a scalable multi–level checkpointing system. Proc ACM/IEEE Int Conf for High Performance Computing, Networking, Storage and Analysis, p.1–11. https://doi.org/10.1109/SC.2010.18
Book Google Scholar
Pinheiro E, Weber WD, Barroso LA, 2007. Failure trends in a large disk drive population. Proc 5th USENIX Conf on File and Storage Technologies, p.2.
Google Scholar
Plank JS, Beck M, Kingsley G, et al., 1995. Libckpt: transparent checkpointing under Unix. Proc USENIX Technical Conf Proc, p.18.
Google Scholar
Plank JS, Li K, Puening MA, 1998. Diskless checkpointing. IEEE Trans Parall Distrib Syst, 9(10):972–986. https://doi.org/10.1109/71.730527
Article Google Scholar
Roman E, 2002. A survey of checkpoint/restart implementations. Technical Report LBNL–54942, Lawrence Berkeley National Laboratory.
Google Scholar
Sahoo RK, Oliner AJ, Rish I, et al., 2003. Critical event prediction for proactive management in large–scale computer clusters. Proc 9th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.426–435. https://doi.org/10.1145/956750.956799
Book Google Scholar
Salfner F, Lenk M, Malek M, 2010. A survey of online failure prediction methods. ACM Comput Surv, 42(3):10.1–10.42. https://doi.org/10.1145/1670679.1670680
Article Google Scholar
Sancho JC, Petrini F, Johnson G, et al., 2004. On the feasibility of incremental checkpointing for scientific computing. Proc 18th Int Symp on Parallel and Distribbuted Processing Symp, p.58. https://doi.org/10.1109/IPDPS.2004.1302982
Google Scholar
Schroeder B, Pinheiro E, Weber WD, 2009. DRAM errors in the wild: a large–scale field study. Proc 11th Int Joint Conf on Measurement and Modeling of Computer Systems, p.193–204. https://doi.org/10.1145/2492101.1555372
Book Google Scholar
Vetter JS, Mueller F, 2003. Communication characteristics of large–scale scientific applications for contemporary cluster architectures. J Parall Distrib Comput, 63(9):853–865. https://doi.org/10.1016/S0743–7315(03)00104.7
Article MATH Google Scholar
Vilalta R, Ma S, 2002. Predicting rare events in temporal domains. Proc IEEE Int Conf on Data Mining, p.474–481. https://doi.org/10.1109/ICDM.2002.1183991
Book Google Scholar
Weinberg J, McCracken MO, Strohmaier E, et al., 2005. Quantifying locality in the memory access patterns of HPC applications. Proc ACM/IEEE Conf on Supercomputing, p.50. https://doi.org/10.1109/SC.2005.59
Book Google Scholar
Young JW, 1974. A first order approximation to the optimum checkpoint interval. Commun ACM, 17(9):530–531. https://doi.org/10.1145/361147.361115
Article MATH Google Scholar
Zhong Y, Shen X, Ding C, 2009. Program locality analysis using reuse distance. ACM Trans Program Lang Syst, 31(6), Article 20. https://doi.org/10.1145/1552309.1552310
Book Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, Changsha, 410073, China
Wei Hu, Guang-Ming Liu & Yan-Huang Jiang
National Supercomputer Center in Tianjin, Tianjin, 300457, China
Wei Hu & Guang-Ming Liu

Authors

Wei Hu
View author publications
You can also search for this author in PubMed Google Scholar
Guang-Ming Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yan-Huang Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Hu.

Additional information

Project supported by the National Natural Science Foundation of China (Nos. 61272141, 61120106005, and 61303068) and the National High-Tech R&D Program of China (No. 2012AA01A301)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, W., Liu, GM. & Jiang, YH. FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing. Frontiers Inf Technol Electronic Eng 19, 1273–1290 (2018). https://doi.org/10.1631/FITEE.1601450

Download citation

Received: 03 August 2016
Revised: 03 March 2017
Accepted: 09 October 2018
Published: 28 November 2018
Issue Date: October 2018
DOI: https://doi.org/10.1631/FITEE.1601450

Key words

CLC number

TP338.6

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing

Abstract

Access this article

Similar content being viewed by others

Multi-agent architecture for fault recovery in self-healing systems

Illuminating the I/O Optimization Path of Scientific Applications

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Navigation

FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing

Abstract

Access this article

Similar content being viewed by others

Multi-agent architecture for fault recovery in self-healing systems

Illuminating the I/O Optimization Path of Scientific Applications

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Search

Navigation