Abstract
This chapter presents a case study on how to characterize the resiliency of large-scale computers. The analysis focuses on the failures and errors of Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The characterization is performed by a joint analysis of several data sources, which include workload and error/failure logs as well as manual failure reports. We describe LogDiver, a tool to automate the data preprocessing and metric computation that measure the impact of system errors and failures on user applications, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU\(+\)GPU) nodes. Results include (i) a characterization of the root causes of single node failures; (ii) a direct assessment of the effectiveness of system-level failover and of memory, processor, network, GPU accelerator, and file system error resiliency; (iii) an analysis of system-wide outages; (iv) analysis of application resiliency to system-related errors; and (v) insight into the relationship between application scale and resiliency across different error categories.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
An example of an efficient checkpoint/restart at full scale is that of the rhmd application, which shows an MTBI of 34Â h when running on 20,000 nodes.
- 3.
Given a monomial equation \(y=ax^k\), taking the logarithm of the equation (with any base) yields \((\log y)=k~(\log x) + \log a\). Setting \(X = \log x\) and \(Y = \log y\) (i.e., moving to log–log graph), yields the equation \(Y = mX + b\).
References
http://www.cray.com/Products/Storage/Sonexion/Specifications.aspx
Karo M, Lagerstrom R, Kohnke M, Albing C (2008) The application level placement scheduler. In: Cray User Group—CUG
http://www.adaptivecomputing.com/products/hpc-products/moab-hpc-suite-enterprise-edition
Di Martino C, Kramer W, Kalbarczyk Z, Iyer R (2015) Measuring and understanding extreme-scale application resilience: a field study of 5,000,000 hpc application runs. In: 45th annual IEEE/IFIP international conference on dependable systems and networks (DSN), pp 25–36, June 2015
Inc. A, Bios and kernel developers guide, for amd family 16th
TDIM Division (1997) A white paper on the benefits of chipkill-correct ecc for pc server main memory
Johnsen P, Straka M, Shapiro M, Norton A, Galarneau T (2013) Petascale wrf simulation of hurricane sandy deployment of ncsa’s cray xe6 blue waters. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, SC’13, ACM, New York, NY, USA, pp 63:1–63:7
Martino CD, Jha S, Kramer W, Kalbarczyk Z, Iyer RK (2015) Logdiver: a tool for measuring resilience of extreme-scale systems and applications. In: Proceedings of the 5th workshop on fault tolerance for HPC at eXtreme scale, ACM, FTXS ’15, New York, NY, USA, pp 11–18
Goljadina N, Nekrutkin V, Zhigljavky A Analysis of time series structure: SSA and related techniques
Schroeder B, Gibson G (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Depend Secure Comput 7(4):337–350
Schroeder B, Gibson GA (2007) Disk failures in the real world: what does an mttf of 1,000,000 hours mean to you?. In: Proceedings of the 5th USENIX conference on file and storage technologies, FAST ’07, Berkeley, CA, USA, USENIX Association
Sridharan V, Stearley J, DeBardeleben N, Blanchard S, Gurumurthi S (2013) Feng shui of supercomputer memory: positional effects in dram and sram faults. In: Proceedings of SC13: international conference for high performance computing, networking, storage and analysis, SC’13, New York, NY, USA, pp 22:1–22:11, ACM
Schroeder B, Pinheiro E, Weber W (2009) Dram errors in the wild: a large-scale field study. SIGMETRICS Perform Eval Rev 37:193–204
http://www.olcf.ornl.gov/titan/, number 2 on top500.org
Sahoo RK, Sivasubramaniam A, Squillante MS, Zhang Y (2004) Failure data analysis of a large-scale heterogeneous server environment. In: International conference on DSN’04: proceedings of the 2004 dependable systems and networks, pp 772–781
Liang Y, Sivasubramaniam A, Moreira J, Zhang Y, Sahoo R, Jette M (2005) Filtering failure logs for a bluegene/l prototype. In: DSN’05: Proceedings of the 2005 international conference on dependable systems and networks, pp 476–485
Liang Y, Zhang Y, Jette M, Sivasubramaniam A, Sahoo R (2006) Bluegene/l failure analysis and prediction models. In: DSN 2006 international conference on dependable systems and networks, 2006, pp 425–434
Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. In: DSN ’07, 37th annual IEEE/IFIP international conference on dependable systems and networks, pp 575–584, June 2007
Di Martino C, Cinque M, Cotroneo D (2012) Assessing time coalescence techniques for the analysis of supercomputer logs. In: Proceedings of 42nd annual IEEE/IFIP international conference on dependable systems and networks (DSN), pp 1–12
Pecchia A, Cotroneo D, Kalbarczyk Z, Iyer RK (2011) Improving log-based field failure data analysis of multi-node computing systems. In: Proceedings of the 2011 IEEE/IFIP 41st international conference on dependable systems & networks, DSN ’11, Washington, DC, USA, pp 97–108, IEEE Computer Society
Di Martino C (2013) One size does not fit all: clustering supercomputer failures using a multiple time window approach. In: Kunkel J, Ludwig T, Meuer H (eds) International supercomputing conference—supercomputing, vol 7905 of Lecture notes in computer science, pp 302–316, Springer, Berlin Heidelberg
Di Martino C, Baccanico F, Fullop J, Kramer W, Kalbarczyk Z, Iyer R (2014) Lessons learned from the analysis of system failures at petascale: the case of blue waters. In: Proceedings of 44th annual IEEE/IFIP international conference on dependable systems and networks (DSN)
Di Martino C, Chen D, Goel G, Ganesan R, Kalbarczyk Z, Iyer R (2014) Analysis and diagnosis of sla violations in a production saas cloud. In: IEEE 25th international symposium on software reliability engineering (ISSRE), pp 178–188, Nov 2014
Tiwari D, Gupta S, Gallarno G, Rogers J, Maxwell D (2015) Reliability lessons learned from gpu experience with the titan supercomputer at oak ridge leadership computing facility. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, p 38, ACM
Heien E, Kondo D, Gainaru A, LaPine A, Kramer W, Cappello F (2011) Modeling and tolerating heterogeneous failures in large parallel systems. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, SC’11, New York, NY, USA, pp 45:1–45:11, ACM
Bruneo D, Longo F, Ghosh R, Scarpa M, Puliafito A, Trivedi K (2015) Analytical modeling of reactive autonomic management techniques in iaas clouds. In: IEEE 8th international conference on cloud computing (CLOUD), 2015, pp 797–804, June 2015
Di Martino C, Kalbarczyk Z, Iyer RK, Goel G, Sarkar S, Ganesan R (2014) Characterization of operational failures from a business data processing saas platform. In: Companion proceedings of the 36th international conference on software engineering, ICSE companion 2014, New York, NY, USA, pp 195–204, ACM
Chen X, Lu C, Pattabiraman K (2013) Predicting job completion times using system logs in supercomputing clusters. In: 43rd annual IEEE/IFIP conference on dependable systems and networks workshop (DSN-W), 2013, pp 1–8, June 2013
Gainaru A, Cappello F, Kramer W (2012) Taming of the shrew: modeling the normal and faulty behaviour of large-scale hpc systems. In: IEEE 26th international parallel distributed processing symposium (IPDPS), pp 1168–1179
Gainaru A, Cappello F, Snir M, Kramer W (2012) Fault prediction under the microscope: a closer look into hpc systems. In: International conference for high performance computing, networking, storage and analysis (SC), pp 1–11
Oppenheimer D, Patterson DA (2002) Studying and using failure data from large-scale internet services. In: Proceedings of the 10th workshop on ACM SIGOPS European workshop, EW 10, New York, NY, USA, pp 255–258, ACM
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Di Martino, C., Kalbarczyk, Z., Iyer, R. (2016). Measuring the Resiliency of Extreme-Scale Computing Environments. In: Fiondella, L., Puliafito, A. (eds) Principles of Performance and Reliability Modeling and Evaluation. Springer Series in Reliability Engineering. Springer, Cham. https://doi.org/10.1007/978-3-319-30599-8_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-30599-8_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30597-4
Online ISBN: 978-3-319-30599-8
eBook Packages: EngineeringEngineering (R0)