Abstract
Understanding the behavior of failures in large-scale systems is important in order to design techniques to tolerate them. Reliability knowledge of resources can be used in numerous ways by scientist of systems administrators: (1) it can be used to improve the quality of service of the machine; (2) to reduce performance loss due to unexpected failures either by reliability-aware scheduling or by reliability-aware checkpointing; (3) to design more resilient applications, programming models or machines in the future. This chapter focuses on offering an overview of failures observed in real large-scale systems and their characteristics, with an emphasis on modeling, detection, and prediction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anaya IDP, Simko V, Bourcier J, Plouzeau N, Jézéquel J-M (2014) A prediction-driven adaptation approach for self-adaptive sensor networks. In: Proceedings of the 9th international symposium on software engineering for adaptive and self-managing systems, SEAMS 2014. ACM, New York, pp 145–154
Andrzejak A, Silva L (2007) Deterministic models of software aging and optimal rejuvenation schedules. In: 10th IFIP/IEEE international symposium on integrated network management, IM’07, pp 159–168
Aupy G, Robert Y, Vivien F, Zaidouni D (2012) Impact of fault prediction on checkpointing strategies. Rapport de recherche RR-8023, INRIA
Aupy G, Robert Y, Vivien F, Zaidouni D (2013) Checkpointing strategies with prediction windows. In: 2013 IEEE 19th Pacific Rim international symposium on dependable computing (PRDC), pp 1–10
Avizienis A, Laprie J-C, Randell B, Landwehr C (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE J Dependable Secur Comput 1:11–33
Bairavasundaram LN, Goodson GR, Pasupathy S, Schindler J (2007) An analysis of latent sector errors in disk drives. In: Proceedings of the 2007 ACM SIGMETRICS international conference on measurement and modeling of computer systems, SIGMETRICS’07. ACM, New York, pp 289–300
Bautista-Gomez L, Tsuboi S, Komatitsch D, Cappello F, Maruyama N, Matsuoka S (2011) FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, pp 1–32
Bolander N, Qiu H, Eklund N, Hindle E, Rosenfeld T (2009) Physics-based remaining useful life predictions for aircraft engine bearing prognosis. In: Conference of the prognostics and health management society
Bouguerra MS, Gainaru A, Cappello F (2013) Failure prediction: what to do with unpredicted failures? In: 28th IEEE international parallel and distributed processing symposium
Bouguerra MS, Gainaru A, Cappello F, Gomez LB, Maruyama N, Matsuoka S (2013) Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing. In: Proceedings of IEEE IPDPS 2013. IEEE Press
Cappello F, Geist A, Gropp B, Kale L, Kramer W, Snir M (2009) Toward exascale resilience. Int J High Perform Comput Appl 23:374–388
Cappello F, Casanova H, Robert Y (2010) Checkpointing versus migration for post-petascale supercomputers. In: 2010 39th international conference on parallel processing (ICPP), pp 168–177
Chen MY, Accardi A, Kıcıman E, Lloyd J, Patterson D, Fox A, Brewer E (2004) Path-based failure and evolution management. In: Proceedings of the international symposium on networked system design and implementation, NSDI’04, pp 309–322
Cotroneo D, Natella R, Pietrantuono R, Russo S (2014) A survey of software aging and rejuvenation studies. J Emerg Technol Comput Syst 10(1):8:1–8:34
Csenki A (1990) Bayes predictive analysis of a fundamental software reliability model. IEEE Trans Reliab 39:177–183
DeBardeleben N, Daly J, Scott S, Harrod W (2009) High-end computing resilience: analysis of issues facing the HEC community and path forward for research and development. National HPC workshop on resilience
Di S, Berrocal E, Bautista-Gomez L, Heisey K, Gupta R, Cappello F (2014) Toward effective detection of silent data corruptions for HPC applications. In: Proceedings of the 28th ACM international conference on supercomputing, SC’14
Dick T, Barkan C, Chapman E, Stehly M (2000) Predicting the occurrence of broken rails: a quantitative approach. In: Proceedings of the American railway engineering and maintenance of way association annual conference
Dongarra J, Beckman P, Moore T, Aerts P, Aloisio G, Andre J-C, Barkai D, Berthou J-Y, Boku T, Braunschweig B, Cappello F, Chapman B, Chi X (2011) The international exascale software project roadmap. Int J High Perform Comput Appl 25(1):3–60
El-Sayed N, Schroeder B (2013) Reading between the lines of failure logs: understanding how HPC systems fail. In: 2013 43rd annual IEEE/IFIP international conference on dependable systems and networks (DSN), pp 1–12
Elnozahy E, Bianchini R, El-Ghazawi T, Fox A, Godfrey F, Hoisie A, McKinley K, Melhem R, Plank J, Ranganathan P et al (2008) System resilience at extreme scale. Technical report for the defence advanced research project agency
Farr W (1996) Software reliability modeling survey. Handbook of software reliability engineering. McGraw-Hill, New York, pp 71–117
Feitelson DG (2002) Workload modeling for performance evaluation. Performance evaluation of complex systems: techniques and tools. Springer, Berlin, pp 114–141
Fiala D, Mueller F, Engelmann C, Riesen R, Ferreira K, Brightwell R (2012) Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, SC’12. IEEE Computer Society Press, Los Alamitos, pp 78:1–78:12
Fu S, Xu C (2007) Quantifying temporal and spatial fault event correlation for proactive failure management. In: IEEE proceedings of symposium on reliable and distributed systems
Gainaru A, Cappello F, Fullop J, Trausan-Matu S, Kramer W (2011) Adaptive event prediction strategy with dynamic time window for large-scale HPC systems. In: Managing large-scale systems via the analysis of system logs and the application of machine learning techniques, SLAML’11. ACM, New York, pp 4:1–4:8
Gainaru A, Cappello F, Trausan-Matu S, Kramer W (2011) Event log mining tool for large scale HPC systems. In: Proceedings of the 17th international conference on parallel processing—volume part I, Euro-Par’11. Springer, Berlin, pp 52–64
Gainaru A, Cappello F, Snir M, Kramer W (2012) Fault prediction under the microscope: a closer look into HPC systems. In: Proceedings of 2012 international conference for high performance computing, networking, storage and analysis. IEEE Press
Gainaru A, Cappello F, Kramer W (2012) Taming of the shrew: modeling the normal and faulty behavior of large-scale HPC systems. In: Proceedings of IEEE IPDPS 2012. IEEE Press
Gertsbakh I (2000) Reliability theory: with applications to preventive maintenance. Springer, Berlin
Guan Q, Zhang Z, Fu S (2011) Ensemble of Bayesian predictors for autonomic failure management in cloud computing. In: 20th international conference on computer communications and networks, pp 1–6
Guermouche A, Ropars T, Snir M, Cappello F (2012) HydEE: failure containment without event logging for large scale send-deterministic MPI applications. In: 2012 IEEE 26th international parallel and distributed processing symposium (IPDPS), pp 1216–1227
Hacker T, Romero F (2009) An analysis of clustered failures on supercomputing systems. J Parallel Distrib Comput 69:652–665
Hacker TJ, Romero F, Carothers CD (2009) An analysis of clustered failures on large supercomputing systems. J Parallel Distrib Comput 69:652–665
Hamerly G, Elkan C (2001) Bayesian approaches to failure prediction for disk drives. In: Proceedings of the eighteenth international conference on machine learning, pp 202–209
Heien E, Kondo D, Gainaru A, LaPine D, Kramer B, Cappello F (2011) Modeling and tolerating heterogeneous failures in large parallel systems. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis. ACM, p 45
Holmgren M (1996) Comparison between different methods for fatigue life prediction of bogie beams. Rakenteiden Mekaniikka, vol 29
Hwang A, Stefanovici I, Schroeder B (2012) Cosmic rays don’t strike twice: understanding the nature of DRAM errors and the implications for system design. SIGARCH Comput Archit News 40(1):111–122
Javadi B, Kondo D, Vincent J-M, Anderson D (2011) Discovering statistical models of availability in large distributed systems: an empirical study of SETI@home. IEEE Trans Parallel Distrib Syst 22(11):1896–1903
Jorio D, Laurent A, Teisseire M (2009) Mining frequent gradual itemsets from large databases. In: International conference on intelligent data analysis
Kharbas K, Kim D, Hoefler T, Mueller F (2012) Assessing HPC failure detectors for MPI jobs. In: Proceedings of the 2012 20th euromicro international conference on parallel, distributed and network-based processing, pp 81–88
Kiciman E, Fox A (2005) Detecting application-level failures in component-based internet services. IEEE Trans Neural Netw 16(5):1027–1041
Lan Z, Gu J, Zheng Z, Thakur R, Coghlan S (2010) Dynamic meta-learning for failure prediction in large-scale systems: a case study. J Parallel Distrib Comput 6:630–643
Lan Z, Zheng Z, Li Y (2010) Toward automated anomaly identification in large-scale systems. IEEE Trans Parallel Distrib Syst 21:147–187
Leangsuksun C, Ostrouchov G, Scott SL (2008) Using log information to perform statistical analysis on failures encountered by large-scale HPC deployment. In: Proceedings of the 2008 high availability and performance computing workshop
Lehmann EL, Casella G (1998) Theory of point estimation, vol 31. Springer, New York
Li Y, Lan Z (2006) Exploit failure prediction for adaptive fault-tolerance in cluster computing. In: Sixth IEEE international symposium on cluster computing and the grid, CCGRID 06, vol 1
Liang Y (2006) Blue Gene/L failure analysis and prediction models. In: Proceedings of the international conference on dependable systems and networks, pp 425–434
Lou J (2010) Mining dependency in distributed systems through unstructured logs analysis. ACM Spec Interes Group Oper Syst (SIGOPS) 44
Lu C-D (2013) Failure data analysis of HPC systems. Technical report CoRR abs/1302.4779
Lu C-D, Reed DA (2005) Scalable diskless checkpointing for large parallel systems. Technical report, Ph.D. dissertation, University of Illinois at Urbana-Champain
Mane SV (2008) False negative estimation: theory, techniques and applications. ProQuest, Ann Arbor
Martino CD, Baccanico F, Fullop J, Kramer W, Kalbarczyk Z, Iyer RK (2014) Lessons learned from the analysis of system failures at petascale: the case of Blue Waters. In: IEEE/IFIP international conference on dependable systems and networks (DSN 2014)
Moody A, Bronevetsky G, Mohror K, de Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis, pp 1–11
Murray J, Hughes G, Kreutz-Delgado K (2003) Hard drive failure prediction using non-parametric statistical methods. In: Proceedings of ICANN/ICONIP
Nassar FA, Andrews DM (1985) A methodology for analysis of failure prediction data. In: IEEE real-time systems symposium, pp 160–166
Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. In: IEEE international conference on dependable systems and networks
Panigrahi PK, Dwivedi M, Khandelwal V, Sen M (2003) Prediction of turbulence statistics behind a square cylinder using neural networks and fuzzy logic. J Fluids Eng 125:385–387
Papadogiannakis A, Polychronakis M, Markatos EP (2010) Improving the accuracy of network intrusion detection systems under load using selective packet discarding. In: Proceedings of the third European workshop on system security, EUROSEC’10. ACM, New York, pp 15–21
Patra A, Bidhar S, Kumar U (2010) Failure prediction of rail considering rolling contact fatigue. Int J Reliab Qual Saf Eng 17(3):167–177
Rani S, Leangsuksun C, Tikotekar A, Rampure V, Scott S (2006) Toward efficient failure detection and recovery in HPC. In: Proceedings of high availability and performance workshop
Ricoux P (2013) European exascale software initiative EESI2—towards exascale roadmap implementation. In: 2nd IS-ENES workshop on high-performance computing for climate models
Ruping S (2000) MySVM manual. Technical report, University of Dortmund, CS Department, AI Unit
Sahoo RK, Oliner AJ, Rish I, Gupta M, Moreira JE, Ma S, Vilalta R, Sivasubramaniam A (2003) Critical event prediction for proactive management in large-scale computer clusters. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD’03. ACM, New York, pp 426–435
Salfner F (2006) Modeling event-driven time series with generalized hidden semi-Markov models. Technical report 208, Department of Computer Science, Humboldt University
Salfner F, Malek M (2007) Using hidden semi-Markov models for effective online failure prediction. In: Symposium on reliable distributed systems, pp 161–174
Salfner F, Lenk M, Malek M (2010) A survey of online failure prediction methods. Comput Surv 42:1–42
Schroeder B, Gibson G (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Dependable Secur Comput 7(4):337–350
Schroeder B, Gibson GA (2007) Understanding failures in petascale computers. J Phys: Conf Ser 78:012022
Shantharam M, Srinivasmurthy S, Raghavan P (2012) Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM international conference on supercomputing, ICS’12. ACM, New York, pp 69–78
Snir M, Wisniewski RW, Abraham JA, Adve SV, Bagchi S, Balaji P, Belak J, Bose P, Cappello F, Carlson B, Chien AA et al (2013) Addressing failures in exascale computing. Argonne report ANL/MCS-TM-332
Stearley J (2005) Defining and measuring supercomputer reliability, availability and serviceability (RAS). In: Proceedings of the Linux cluster institute conference
Stearley J, Oliner AJ (2008) Bad words: finding faults in spirit’s syslogs. In: The eighth IEEE international symposium on cluster computing and the grid, pp 765–770
Taerat N, Naksinehaboon N, Chandler C, Elliott J, Leangsuksun C, Ostrouchov G, Scott S, Engelmann C (2009) Blue Gene/L log analysis and time to interrupt estimation. In: International conference on availability, reliability and security, ARES’09, pp 173–180
Thanakornworakij T, Nassar R, Leangsuksun CB, Paun M (2013) Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications. Int J High Perform Comput Appl 27(4):474–482
Tiwari D, Gupta S, Vazhkudai S (2014) Lazy checkpointing: exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In: 2014 44th annual IEEE/IFIP international conference on dependable systems and networks (DSN), pp 25–36
Tsai T, Theera-Ampornpunt N, Bagchi S (2012) A study of soft error consequences in hard disk drives In: IEEE international conference on dependable systems and networks, pp 1–8
US Department of Energy (2012) Fault Management Workshop. http://shadow.dyndns.info/publications/geist12department.pdf. Accessed July 2013
Vilalta R, Apte C, Hellerstein J, Ma S, Weiss S (2002) Predictive algorithms in the management of computer systems. IBM Syst J 41:461–474
Wang C, Talwar V, Schwan K, Ranganathan P (2010) Online detection of utility cloud anomalies using metric distributions. NOMS. IEEE, pp 96–103
Workshop, I-A (2012) HPC resilience at extreme scale. http://institute.lanl.gov/resilience/docs/Inter-AgencyResilienceReport.pdf. Accessed July 2013
Xu W, Huang L, Fox A, Patterson D, Jordan M (2009) Online system problem detection by mining patterns of console logs. In: Proceedings of the 2009 ninth IEEE international conference on data mining, ICDM’09. IEEE Computer Society, Washington, pp 588–597
Yamanishi K (2005) Dynamic syslog mining for network failure monitoring. In: Proceedings of the 11th ACM SIGKDD, international conference on knowledge discovery and data mining. ACM Press, pp 499–508
Yigitbasi N, Gallet M, Kondo D, Iosup A, Epema D (2010) Analysis and modeling of time-correlated failures in large-scale distributed systems. In: 2010 11th IEEE/ACM international conference on grid computing (GRID), pp 65–72
Yu L, Zheng Z, Lan Z, Coghlan S (2011) Practical online failure prediction for Blue Gene/P: period-based versus event-driven. In: IEEE conference on dependable systems and networks workshops, pp 259–264
Zheng G, Shi L, Kale L (2004) FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: 2004 IEEE international conference on cluster computing, pp 93–103
Zheng Z, Yu L (2011) Co-analysis of RAS log and job log on Blue Gene/p. In: Proceedings of the 2011 IEEE international parallel and distributed processing symposium, pp 840–851
Zheng Z, Li Y, Lan Z (2007) Anomaly localization in large-scale clusters. In: IEEE international conference on cluster computing, pp 322–330
Zheng Z, Lan Z, Gupta R, Coghlan S, Beckman P (2010) A practical failure prediction with location and lead time for Blue Gene/P. In: IEEE conference on dependable systems and networks workshops, pp 15–22
Acknowledgments
Ana Gainaru’s work is supported by the Blue Waters sustained-Petascale computing project, funded by the National Science Foundation (award number OCI 07-25070) and the state of Illinois. This chapter is build on material from publications co-authored with numerous colleagues. The authors would like to thank Leonardo Bautista-Gomez, Mohamed Slim Bouguerra, Jeremy Enos, Joshi Fullop, Eric Heien, Derrick Kondo, and William Kramer.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland (outside the USA)
About this chapter
Cite this chapter
Gainaru, A., Cappello, F. (2015). Errors and Faults. In: Herault, T., Robert, Y. (eds) Fault-Tolerance Techniques for High-Performance Computing. Computer Communications and Networks. Springer, Cham. https://doi.org/10.1007/978-3-319-20943-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-20943-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20942-5
Online ISBN: 978-3-319-20943-2
eBook Packages: Computer ScienceComputer Science (R0)