Measuring the Resiliency of Extreme-Scale Computing Environments

Di Martino, Catello; Kalbarczyk, Zbigniew; Iyer, Ravishankar

doi:10.1007/978-3-319-30599-8_24

Catello Di Martino⁵,
Zbigniew Kalbarczyk⁴ &
Ravishankar Iyer⁴

Part of the book series: Springer Series in Reliability Engineering ((RELIABILITY))

1253 Accesses
6 Citations

Abstract

This chapter presents a case study on how to characterize the resiliency of large-scale computers. The analysis focuses on the failures and errors of Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The characterization is performed by a joint analysis of several data sources, which include workload and error/failure logs as well as manual failure reports. We describe LogDiver, a tool to automate the data preprocessing and metric computation that measure the impact of system errors and failures on user applications, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU\(+\)GPU) nodes. Results include (i) a characterization of the root causes of single node failures; (ii) a direct assessment of the effectiveness of system-level failover and of memory, processor, network, GPU accelerator, and file system error resiliency; (iii) an analysis of system-wide outages; (iv) analysis of application resiliency to system-related errors; and (v) insight into the relationship between application scale and resiliency across different error categories.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A detailed description of the comparative analysis between Tables 14 and 15 is reported in [5, 9].
2.
An example of an efficient checkpoint/restart at full scale is that of the rhmd application, which shows an MTBI of 34 h when running on 20,000 nodes.
3.
Given a monomial equation \(y=ax^k\), taking the logarithm of the equation (with any base) yields \((\log y)=k~(\log x) + \log a\). Setting \(X = \log x\) and \(Y = \log y\) (i.e., moving to log–log graph), yields the equation \(Y = mX + b\).

References

www.cray.com/Assets/PDF/products/xe/CrayXE6Brochure.pdf
http://www.cray.com/Products/Storage/Sonexion/Specifications.aspx
Karo M, Lagerstrom R, Kohnke M, Albing C (2008) The application level placement scheduler. In: Cray User Group—CUG
Google Scholar
http://www.adaptivecomputing.com/products/hpc-products/moab-hpc-suite-enterprise-edition
Di Martino C, Kramer W, Kalbarczyk Z, Iyer R (2015) Measuring and understanding extreme-scale application resilience: a field study of 5,000,000 hpc application runs. In: 45th annual IEEE/IFIP international conference on dependable systems and networks (DSN), pp 25–36, June 2015
Google Scholar
Inc. A, Bios and kernel developers guide, for amd family 16th
Google Scholar
TDIM Division (1997) A white paper on the benefits of chipkill-correct ecc for pc server main memory
Google Scholar
Johnsen P, Straka M, Shapiro M, Norton A, Galarneau T (2013) Petascale wrf simulation of hurricane sandy deployment of ncsa’s cray xe6 blue waters. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, SC’13, ACM, New York, NY, USA, pp 63:1–63:7
Google Scholar
Martino CD, Jha S, Kramer W, Kalbarczyk Z, Iyer RK (2015) Logdiver: a tool for measuring resilience of extreme-scale systems and applications. In: Proceedings of the 5th workshop on fault tolerance for HPC at eXtreme scale, ACM, FTXS ’15, New York, NY, USA, pp 11–18
Google Scholar
Goljadina N, Nekrutkin V, Zhigljavky A Analysis of time series structure: SSA and related techniques
Google Scholar
Schroeder B, Gibson G (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Depend Secure Comput 7(4):337–350
Article Google Scholar
Schroeder B, Gibson GA (2007) Disk failures in the real world: what does an mttf of 1,000,000 hours mean to you?. In: Proceedings of the 5th USENIX conference on file and storage technologies, FAST ’07, Berkeley, CA, USA, USENIX Association
Google Scholar
Sridharan V, Stearley J, DeBardeleben N, Blanchard S, Gurumurthi S (2013) Feng shui of supercomputer memory: positional effects in dram and sram faults. In: Proceedings of SC13: international conference for high performance computing, networking, storage and analysis, SC’13, New York, NY, USA, pp 22:1–22:11, ACM
Google Scholar
Schroeder B, Pinheiro E, Weber W (2009) Dram errors in the wild: a large-scale field study. SIGMETRICS Perform Eval Rev 37:193–204
Google Scholar
http://www.olcf.ornl.gov/titan/, number 2 on top500.org
Sahoo RK, Sivasubramaniam A, Squillante MS, Zhang Y (2004) Failure data analysis of a large-scale heterogeneous server environment. In: International conference on DSN’04: proceedings of the 2004 dependable systems and networks, pp 772–781
Google Scholar
Liang Y, Sivasubramaniam A, Moreira J, Zhang Y, Sahoo R, Jette M (2005) Filtering failure logs for a bluegene/l prototype. In: DSN’05: Proceedings of the 2005 international conference on dependable systems and networks, pp 476–485
Google Scholar
Liang Y, Zhang Y, Jette M, Sivasubramaniam A, Sahoo R (2006) Bluegene/l failure analysis and prediction models. In: DSN 2006 international conference on dependable systems and networks, 2006, pp 425–434
Google Scholar
Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. In: DSN ’07, 37th annual IEEE/IFIP international conference on dependable systems and networks, pp 575–584, June 2007
Google Scholar
Di Martino C, Cinque M, Cotroneo D (2012) Assessing time coalescence techniques for the analysis of supercomputer logs. In: Proceedings of 42nd annual IEEE/IFIP international conference on dependable systems and networks (DSN), pp 1–12
Google Scholar
Pecchia A, Cotroneo D, Kalbarczyk Z, Iyer RK (2011) Improving log-based field failure data analysis of multi-node computing systems. In: Proceedings of the 2011 IEEE/IFIP 41st international conference on dependable systems & networks, DSN ’11, Washington, DC, USA, pp 97–108, IEEE Computer Society
Google Scholar
Di Martino C (2013) One size does not fit all: clustering supercomputer failures using a multiple time window approach. In: Kunkel J, Ludwig T, Meuer H (eds) International supercomputing conference—supercomputing, vol 7905 of Lecture notes in computer science, pp 302–316, Springer, Berlin Heidelberg
Google Scholar
Di Martino C, Baccanico F, Fullop J, Kramer W, Kalbarczyk Z, Iyer R (2014) Lessons learned from the analysis of system failures at petascale: the case of blue waters. In: Proceedings of 44th annual IEEE/IFIP international conference on dependable systems and networks (DSN)
Google Scholar
Di Martino C, Chen D, Goel G, Ganesan R, Kalbarczyk Z, Iyer R (2014) Analysis and diagnosis of sla violations in a production saas cloud. In: IEEE 25th international symposium on software reliability engineering (ISSRE), pp 178–188, Nov 2014
Google Scholar
Tiwari D, Gupta S, Gallarno G, Rogers J, Maxwell D (2015) Reliability lessons learned from gpu experience with the titan supercomputer at oak ridge leadership computing facility. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, p 38, ACM
Google Scholar
Heien E, Kondo D, Gainaru A, LaPine A, Kramer W, Cappello F (2011) Modeling and tolerating heterogeneous failures in large parallel systems. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, SC’11, New York, NY, USA, pp 45:1–45:11, ACM
Google Scholar
Bruneo D, Longo F, Ghosh R, Scarpa M, Puliafito A, Trivedi K (2015) Analytical modeling of reactive autonomic management techniques in iaas clouds. In: IEEE 8th international conference on cloud computing (CLOUD), 2015, pp 797–804, June 2015
Google Scholar
Di Martino C, Kalbarczyk Z, Iyer RK, Goel G, Sarkar S, Ganesan R (2014) Characterization of operational failures from a business data processing saas platform. In: Companion proceedings of the 36th international conference on software engineering, ICSE companion 2014, New York, NY, USA, pp 195–204, ACM
Google Scholar
Chen X, Lu C, Pattabiraman K (2013) Predicting job completion times using system logs in supercomputing clusters. In: 43rd annual IEEE/IFIP conference on dependable systems and networks workshop (DSN-W), 2013, pp 1–8, June 2013
Google Scholar
Gainaru A, Cappello F, Kramer W (2012) Taming of the shrew: modeling the normal and faulty behaviour of large-scale hpc systems. In: IEEE 26th international parallel distributed processing symposium (IPDPS), pp 1168–1179
Google Scholar
Gainaru A, Cappello F, Snir M, Kramer W (2012) Fault prediction under the microscope: a closer look into hpc systems. In: International conference for high performance computing, networking, storage and analysis (SC), pp 1–11
Google Scholar
Oppenheimer D, Patterson DA (2002) Studying and using failure data from large-scale internet services. In: Proceedings of the 10th workshop on ACM SIGOPS European workshop, EW 10, New York, NY, USA, pp 255–258, ACM
Google Scholar

Download references

Author information

Authors and Affiliations

Bell Labs - Nokia, 600 Mountain Ave, New Provicence, NJ, 07974, USA
Zbigniew Kalbarczyk & Ravishankar Iyer
Coordinated Science Laboratory, University of Illinois at Urbana Champaign, 1307 W Main St, Urbana, IL, 61801, USA
Catello Di Martino

Authors

Catello Di Martino
View author publications
You can also search for this author in PubMed Google Scholar
Zbigniew Kalbarczyk
View author publications
You can also search for this author in PubMed Google Scholar
Ravishankar Iyer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Catello Di Martino .

Editor information

Editors and Affiliations

University of Massachusetts Dartmou, Dartmouth, Massachusetts, USA
Lance Fiondella
Università degli studi di Messina, Messina, Italy
Antonio Puliafito

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Di Martino, C., Kalbarczyk, Z., Iyer, R. (2016). Measuring the Resiliency of Extreme-Scale Computing Environments. In: Fiondella, L., Puliafito, A. (eds) Principles of Performance and Reliability Modeling and Evaluation. Springer Series in Reliability Engineering. Springer, Cham. https://doi.org/10.1007/978-3-319-30599-8_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-30599-8_24
Published: 02 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30597-4
Online ISBN: 978-3-319-30599-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics