Skip to main content

Characterizing Memory Failures Using Benford’s Law

  • Conference paper
  • First Online:
Euro-Par 2021: Parallel Processing Workshops (Euro-Par 2021)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13098))

Included in the following conference series:

  • 686 Accesses

Abstract

Fault tolerance is a key challenge as high performance computing systems continue to increase component counts, individual component reliability decreases, and hardware and software complexity increases. To better understand the potential impacts of failures on next-generation systems, significant effort has been devoted to collecting, characterizing and analyzing failures on current systems. These studies require large volumes of data and complex analysis in an attempt to identify statistical properties of the failure data.

In this paper, we examine the lifetime of failures on the Cielo supercomputer that was located at Los Alamos National Laboratory, looking specifically at the time between faults on this system. Through this analysis, we show that the time between uncorrectable faults for this system obeys Benford’s law, This law applies to a number of naturally occurring collections of numbers and states that the leading digit is more likely to be small, for example a leading digit of 1 is more likely than 9. We also show that a number of common distributions used to model failures also follow this law. This work provides critical analysis on the distribution of times between failures for extreme-scale systems. Specifically, the analysis in this work could be used as a simple form of failure prediction or used for modeling realistic failures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. AMD64 architecture programmer’s manual volume 2: system programming, revision 3.23 (2013). http://developer.amd.com/wordpress/media/2012/10/24593_APM_v21.pdf

  2. Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Depend. Secure Comput. 1(1), 11–33 (2004). https://doi.org/10.1109/TDSC.2004.2

    Article  Google Scholar 

  3. Baumann, R.: Radiation-induced soft errors in advanced semiconductor technologies. IEEE Trans. Device Mater. Reliab. 5(3), 305–316 (2005). https://doi.org/10.1109/TDMR.2005.853449

    Article  Google Scholar 

  4. Benford, F.: The law of anomalous numbers. Proc. Am. Philos. Soc. 78(4), 551–572 (1938)

    MATH  Google Scholar 

  5. Berger, A., Hill, T.P.: Benford’s law strikes back: no simple explanation in sight for mathematical gem 33(1), 85–91 (2011). https://doi.org/10.1007/s00283-010-9182-3

  6. Constantinescu, C.: Impact of deep submicron technology on dependability of VLSI circuits. In: Proceedings of the International Conference on Dependable Systems and Networks, DSN 2002, pp. 205–209 (2002). https://doi.org/10.1109/DSN.2002.1028901

  7. Constantinescu, C.: Trends and challenges in VLSI circuit reliability. IEEE Micro 23(4), 14–19 (2003). https://doi.org/10.1109/MM.2003.1225959

    Article  Google Scholar 

  8. Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22(3), 303–312 (2006). https://doi.org/10.1016/j.future.2004.11.016

    Article  Google Scholar 

  9. Dell, T.J.: A white paper on the benefits of Chipkill-correct ECC for PC server main memory. IBM Microelectron. Div. 1–23 (1997)

    Google Scholar 

  10. Di Martino, C., Kalbarczyk, Z., Iyer, R.K., Baccanico, F., Fullop, J., Kramer, W.: Lessons learned from the analysis of system failures at Petascale: the case of Blue Waters. In: International Conference on Dependable Systems and Networks (2014)

    Google Scholar 

  11. Gupta, S., Patel, T., Engelmann, C., Tiwari, D.: Failures in large scale systems: long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017, pp. 44:1–44:12. ACM, New York (2017). https://doi.org/10.1145/3126908.3126937

  12. Hwang, A.A., Stefanovici, I.A., Schroeder, B.: Cosmic rays don’t strike twice: understanding the nature of DRAM errors and the implications for system design. In: Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pp. 111–122. ACM, New York (2012). https://doi.org/10.1145/2150976.2150989

  13. Jamain, A.: Benford’s Law. Master’s thesis, Department of Mathematics, Imperial College of London and ENSIMAG, London, UK (2001), http://www.math.ualberta.ca/~aberger/benford_bibliography/jamain_thesis01.pdf. Not found in Imperial College Library or COPAC Catalogs on 16 February 2013. URL link is broken too

  14. Jauk, D., Yang, D., Schulz, M.: Predicting faults in high performance computing systems: an in-depth survey of the state-of-the-practice. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3295500.3356185

  15. Kondo, D., Javadi, B., Iosup, A., Epema, D.: The failure trace archive: enabling comparative analysis of failures in diverse distributed systems. In: 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid), pp. 398–407. IEEE (2010)

    Google Scholar 

  16. Levy, S., Ferreira, K.B., DeBardeleben, N., Siddiqua, T., Sridharan, V., Baseman, E.: Lessons learned from memory errors observed over the lifetime of Cielo. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018. IEEE Press (2018)

    Google Scholar 

  17. Li, X., Huang, M.C., Shen, K., Chu, L.: A realistic evaluation of memory hardware errors and software system susceptibility. In: Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC 2010, pp. 6–20. USENIX Association, Berkeley (2010). http://dl.acm.org/citation.cfm?id=1855840.1855846

  18. Li, X., Shen, K., Huang, M.C., Chu, L.: A memory soft error measurement on production systems. In: 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference, ATC 2007, pp. 21:1–21:6. USENIX Association, Berkeley (2007). http://dl.acm.org/citation.cfm?id=1364385.1364406

  19. Liu, Y., Nassar, R., Leangsuksun, C., Naksinehaboon, N., Paun, M., Scott, S.L.: An optimal checkpoint/restart model for a large scale high performance computing system. In: IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, pp. 1–9. IEEE (2008)

    Google Scholar 

  20. Newcomb, S.: Note on the frequency of use of the different digits in natural numbers. Am. J. Math. 4(1–4), 39–40 (1881). http://www.jstor.org/stable/2369148

  21. Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: Proceedings of the International Conference on Dependable Systems and Networks, DSN 2006, pp. 249–258. IEEE Computer Society, Washington (2006). https://doi.org/10.1109/DSN.2006.5

  22. Schroeder, B., Pinheiro, E., Weber, W.D.: DRAM errors in the wild: a large-scale field study. Commun. ACM 54(2), 100–107 (2009). https://doi.org/10.1145/1897816.1897844

    Article  Google Scholar 

  23. Siddiqua, T., et al.: Lifetime memory reliability data from the field. In: 2017 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), pp. 1–6, October 2017. https://doi.org/10.1109/DFT.2017.8244428

  24. Sridharan, V., et al.: Memory errors in modern systems: the good, the bad, and the ugly. In: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2015, pp. 297–310. ACM, New York (2015). https://doi.org/10.1145/2694344.2694348

  25. Sridharan, V., Liberty, D.: A study of DRAM failures in the field. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 76:1–76:11. IEEE Computer Society Press, Los Alamitos (2012). http://dl.acm.org/citation.cfm?id=2388996.2389100

  26. Sridharan, V., Stearley, J., DeBardeleben, N., Blanchard, S., Gurumurthi, S.: Feng Shui of supercomputer memory: positional effects in DRAM and SRAM faults. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2013, pp. 22:1–22:11. ACM, New York (2013). https://doi.org/10.1145/2503210.2503257

  27. Ziegler, J., Lanford, W.: The effect of sea level cosmic rays on electronic devices. J. Appl. Phys. 52(6), 4305–4312 (1981). https://doi.org/10.1063/1.329243

    Article  Google Scholar 

Download references

Acknowledgment

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kurt B. Ferreira .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ferreira, K.B., Levy, S. (2022). Characterizing Memory Failures Using Benford’s Law. In: Chaves, R., et al. Euro-Par 2021: Parallel Processing Workshops. Euro-Par 2021. Lecture Notes in Computer Science, vol 13098. Springer, Cham. https://doi.org/10.1007/978-3-031-06156-1_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-06156-1_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-06155-4

  • Online ISBN: 978-3-031-06156-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics