Skip to main content

Measuring the Resiliency of Extreme-Scale Computing Environments

  • Chapter
  • First Online:
Principles of Performance and Reliability Modeling and Evaluation

Abstract

This chapter presents a case study on how to characterize the resiliency of large-scale computers. The analysis focuses on the failures and errors of Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The characterization is performed by a joint analysis of several data sources, which include workload and error/failure logs as well as manual failure reports. We describe LogDiver, a tool to automate the data preprocessing and metric computation that measure the impact of system errors and failures on user applications, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU\(+\)GPU) nodes. Results include (i) a characterization of the root causes of single node failures; (ii) a direct assessment of the effectiveness of system-level failover and of memory, processor, network, GPU accelerator, and file system error resiliency; (iii) an analysis of system-wide outages; (iv) analysis of application resiliency to system-related errors; and (v) insight into the relationship between application scale and resiliency across different error categories.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A detailed description of the comparative analysis between Tables 14 and 15 is reported in [5, 9].

  2. 2.

    An example of an efficient checkpoint/restart at full scale is that of the rhmd application, which shows an MTBI of 34 h when running on 20,000 nodes.

  3. 3.

    Given a monomial equation \(y=ax^k\), taking the logarithm of the equation (with any base) yields \((\log y)=k~(\log x) + \log a\). Setting \(X = \log x\) and \(Y = \log y\) (i.e., moving to log–log graph), yields the equation \(Y = mX + b\).

References

  1. www.cray.com/Assets/PDF/products/xe/CrayXE6Brochure.pdf

  2. http://www.cray.com/Products/Storage/Sonexion/Specifications.aspx

  3. Karo M, Lagerstrom R, Kohnke M, Albing C (2008) The application level placement scheduler. In: Cray User Group—CUG

    Google Scholar 

  4. http://www.adaptivecomputing.com/products/hpc-products/moab-hpc-suite-enterprise-edition

  5. Di Martino C, Kramer W, Kalbarczyk Z, Iyer R (2015) Measuring and understanding extreme-scale application resilience: a field study of 5,000,000 hpc application runs. In: 45th annual IEEE/IFIP international conference on dependable systems and networks (DSN), pp 25–36, June 2015

    Google Scholar 

  6. Inc. A, Bios and kernel developers guide, for amd family 16th

    Google Scholar 

  7. TDIM Division (1997) A white paper on the benefits of chipkill-correct ecc for pc server main memory

    Google Scholar 

  8. Johnsen P, Straka M, Shapiro M, Norton A, Galarneau T (2013) Petascale wrf simulation of hurricane sandy deployment of ncsa’s cray xe6 blue waters. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, SC’13, ACM, New York, NY, USA, pp 63:1–63:7

    Google Scholar 

  9. Martino CD, Jha S, Kramer W, Kalbarczyk Z, Iyer RK (2015) Logdiver: a tool for measuring resilience of extreme-scale systems and applications. In: Proceedings of the 5th workshop on fault tolerance for HPC at eXtreme scale, ACM, FTXS ’15, New York, NY, USA, pp 11–18

    Google Scholar 

  10. Goljadina N, Nekrutkin V, Zhigljavky A Analysis of time series structure: SSA and related techniques

    Google Scholar 

  11. Schroeder B, Gibson G (2010) A large-scale study of failures in high-performance computing systems. IEEE Trans Depend Secure Comput 7(4):337–350

    Article  Google Scholar 

  12. Schroeder B, Gibson GA (2007) Disk failures in the real world: what does an mttf of 1,000,000 hours mean to you?. In: Proceedings of the 5th USENIX conference on file and storage technologies, FAST ’07, Berkeley, CA, USA, USENIX Association

    Google Scholar 

  13. Sridharan V, Stearley J, DeBardeleben N, Blanchard S, Gurumurthi S (2013) Feng shui of supercomputer memory: positional effects in dram and sram faults. In: Proceedings of SC13: international conference for high performance computing, networking, storage and analysis, SC’13, New York, NY, USA, pp 22:1–22:11, ACM

    Google Scholar 

  14. Schroeder B, Pinheiro E, Weber W (2009) Dram errors in the wild: a large-scale field study. SIGMETRICS Perform Eval Rev 37:193–204

    Google Scholar 

  15. http://www.olcf.ornl.gov/titan/, number 2 on top500.org

  16. Sahoo RK, Sivasubramaniam A, Squillante MS, Zhang Y (2004) Failure data analysis of a large-scale heterogeneous server environment. In: International conference on DSN’04: proceedings of the 2004 dependable systems and networks, pp 772–781

    Google Scholar 

  17. Liang Y, Sivasubramaniam A, Moreira J, Zhang Y, Sahoo R, Jette M (2005) Filtering failure logs for a bluegene/l prototype. In: DSN’05: Proceedings of the 2005 international conference on dependable systems and networks, pp 476–485

    Google Scholar 

  18. Liang Y, Zhang Y, Jette M, Sivasubramaniam A, Sahoo R (2006) Bluegene/l failure analysis and prediction models. In: DSN 2006 international conference on dependable systems and networks, 2006, pp 425–434

    Google Scholar 

  19. Oliner A, Stearley J (2007) What supercomputers say: a study of five system logs. In: DSN ’07, 37th annual IEEE/IFIP international conference on dependable systems and networks, pp 575–584, June 2007

    Google Scholar 

  20. Di Martino C, Cinque M, Cotroneo D (2012) Assessing time coalescence techniques for the analysis of supercomputer logs. In: Proceedings of 42nd annual IEEE/IFIP international conference on dependable systems and networks (DSN), pp 1–12

    Google Scholar 

  21. Pecchia A, Cotroneo D, Kalbarczyk Z, Iyer RK (2011) Improving log-based field failure data analysis of multi-node computing systems. In: Proceedings of the 2011 IEEE/IFIP 41st international conference on dependable systems & networks, DSN ’11, Washington, DC, USA, pp 97–108, IEEE Computer Society

    Google Scholar 

  22. Di Martino C (2013) One size does not fit all: clustering supercomputer failures using a multiple time window approach. In: Kunkel J, Ludwig T, Meuer H (eds) International supercomputing conference—supercomputing, vol 7905 of Lecture notes in computer science, pp 302–316, Springer, Berlin Heidelberg

    Google Scholar 

  23. Di Martino C, Baccanico F, Fullop J, Kramer W, Kalbarczyk Z, Iyer R (2014) Lessons learned from the analysis of system failures at petascale: the case of blue waters. In: Proceedings of 44th annual IEEE/IFIP international conference on dependable systems and networks (DSN)

    Google Scholar 

  24. Di Martino C, Chen D, Goel G, Ganesan R, Kalbarczyk Z, Iyer R (2014) Analysis and diagnosis of sla violations in a production saas cloud. In: IEEE 25th international symposium on software reliability engineering (ISSRE), pp 178–188, Nov 2014

    Google Scholar 

  25. Tiwari D, Gupta S, Gallarno G, Rogers J, Maxwell D (2015) Reliability lessons learned from gpu experience with the titan supercomputer at oak ridge leadership computing facility. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, p 38, ACM

    Google Scholar 

  26. Heien E, Kondo D, Gainaru A, LaPine A, Kramer W, Cappello F (2011) Modeling and tolerating heterogeneous failures in large parallel systems. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, SC’11, New York, NY, USA, pp 45:1–45:11, ACM

    Google Scholar 

  27. Bruneo D, Longo F, Ghosh R, Scarpa M, Puliafito A, Trivedi K (2015) Analytical modeling of reactive autonomic management techniques in iaas clouds. In: IEEE 8th international conference on cloud computing (CLOUD), 2015, pp 797–804, June 2015

    Google Scholar 

  28. Di Martino C, Kalbarczyk Z, Iyer RK, Goel G, Sarkar S, Ganesan R (2014) Characterization of operational failures from a business data processing saas platform. In: Companion proceedings of the 36th international conference on software engineering, ICSE companion 2014, New York, NY, USA, pp 195–204, ACM

    Google Scholar 

  29. Chen X, Lu C, Pattabiraman K (2013) Predicting job completion times using system logs in supercomputing clusters. In: 43rd annual IEEE/IFIP conference on dependable systems and networks workshop (DSN-W), 2013, pp 1–8, June 2013

    Google Scholar 

  30. Gainaru A, Cappello F, Kramer W (2012) Taming of the shrew: modeling the normal and faulty behaviour of large-scale hpc systems. In: IEEE 26th international parallel distributed processing symposium (IPDPS), pp 1168–1179

    Google Scholar 

  31. Gainaru A, Cappello F, Snir M, Kramer W (2012) Fault prediction under the microscope: a closer look into hpc systems. In: International conference for high performance computing, networking, storage and analysis (SC), pp 1–11

    Google Scholar 

  32. Oppenheimer D, Patterson DA (2002) Studying and using failure data from large-scale internet services. In: Proceedings of the 10th workshop on ACM SIGOPS European workshop, EW 10, New York, NY, USA, pp 255–258, ACM

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Catello Di Martino .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Di Martino, C., Kalbarczyk, Z., Iyer, R. (2016). Measuring the Resiliency of Extreme-Scale Computing Environments. In: Fiondella, L., Puliafito, A. (eds) Principles of Performance and Reliability Modeling and Evaluation. Springer Series in Reliability Engineering. Springer, Cham. https://doi.org/10.1007/978-3-319-30599-8_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-30599-8_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-30597-4

  • Online ISBN: 978-3-319-30599-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics