Skip to main content

Advertisement

SpringerLink
Log in
Menu
Find a journal Publish with us
Search
Cart
Book cover

European Conference on Parallel Processing

Euro-Par 2011: Euro-Par 2011: Parallel Processing Workshops pp 241–250Cite as

  1. Home
  2. Euro-Par 2011: Parallel Processing Workshops
  3. Conference paper
Cooperative Application/OS DRAM Fault Recovery

Cooperative Application/OS DRAM Fault Recovery

  • Patrick G. Bridges30,
  • Mark Hoemmen31,
  • Kurt B. Ferreira30,31,
  • Michael A. Heroux31,
  • Philip Soltero30 &
  • …
  • Ron Brightwell31 
  • Conference paper
  • 1149 Accesses

  • 19 Citations

  • 1 Altmetric

Part of the Lecture Notes in Computer Science book series (LNTCS,volume 7156)

Abstract

Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application / OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results.

Keywords

  • Fault Tolerance
  • DRAM Failure
  • Fault-Tolerant GMRES

This work was supported in part by a faculty sabbatical appointment from Sandia National Laboratories and a grant from the U.S. Department of Energy Office of Science, Advanced Scientific Computing research, under award number DE-SC0005050, program manager Sonia Sachs.

Download conference paper PDF

References

  1. Bronevetsky, G., de Supinski, B.: Soft error vulnerability of iterative linear algebra methods. In: Proceedings of the 22nd Annual International Conference on Supercomputing, ICS 2008, pp. 155–164. ACM, New York (2008)

    CrossRef  Google Scholar 

  2. Buttari, A., Dongarra, J., Kurzak, J., Luszczek, P., Tomov, S.: Computations to enhance the performance while achieving the 64-bit accuracy. Tech. Rep. UT-CS-06-584, University of Tennessee Knoxville, lAPACK Working Note #180 (November 2006)

    Google Scholar 

  3. Chen, Z., Dongarra, J.: Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In: 20th International Parallel and Distributed Processing Symposium, IPDPS 2006 (April 2006)

    Google Scholar 

  4. Davis, T.A., Hu, Y.: The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. (2011) (to appear), http://www.cise.ufl.edu/research/sparse/matrices

  5. Dopson, D.: SoftECC: A System for Software Memory Integrity Checking. Master’s thesis, Massachusetts Institute of Technology (September 2005)

    Google Scholar 

  6. Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)

    CrossRef  Google Scholar 

  7. van den Eshof, J., Sleijpen, G.L.G.: Inexact Krylov subspace methods for linear systems. SIAM J. Matrix Anal. Appl. 26(1), 125–153 (2004)

    CrossRef  MathSciNet  MATH  Google Scholar 

  8. Ferreira, K.B., Riesen, R., Brighwell, R., Bridges, P., Arnold, D.: libhashckpt: Hash-Based Incremental Checkpointing Using GPU’s. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 272–281. Springer, Heidelberg (2011)

    CrossRef  Google Scholar 

  9. Heroux, M.A., Bartlett, R.A., Howle, V.E., Hoekstra, R.J., Hu, J.J., Kolda, T.G., Lehoucq, R.B., Long, K.R., Pawlowski, R.P., Phipps, E.T., Salinger, A.G., Thornquist, H.K., Tuminaro, R.S., Willenbring, J.M., Williams, A., Stanley, K.S.: An overview of the Trilinos project. ACM Trans. Math. Softw. 31(3), 397–423 (2005)

    CrossRef  MathSciNet  MATH  Google Scholar 

  10. Heroux, M.A., Hoemmen, M.: Fault-tolerant iterative methods via selective reliability. Tech. Rep. SAND2011-3915 C, Sandia National Laboratories (2011), http://www.sandia.gov/~maherou/

  11. Howle, V.E.: Soft errors in linear solvers as integrated components of a simulation. Presented at the Copper Mountain Conference on Iterative Methods, Copper Mountain, CO, April 9 (2010)

    Google Scholar 

  12. Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers C-33(6) (June 1984)

    Google Scholar 

  13. Kleen, A.: mcelog: memory error handling in user space. In: Proceedings of Linux Kongress 2010, Nuremburg, Germany (September 2010)

    Google Scholar 

  14. Li, X., Huang, M.C., Shen, K., Chu, L.: A realistic evaluation of memory hardware errors and software system susceptibility. In: Proceedings of the 2010 USENIX Annual Technical Conference (USENIX 2010), Boston, MA (June 2010)

    Google Scholar 

  15. Maruyama, N., Nukada, A., Matsuoka, S.: A high-performance fault-tolerant software framework for memory on commodity GPUs. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12 (April 2010)

    Google Scholar 

  16. Saad, Y.: A flexible inner-outer preconditioned GMRES algorithm. SIAM J. Sci. Comput. 14, 461–469 (1993)

    CrossRef  MathSciNet  MATH  Google Scholar 

  17. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. SIAM, Philadelphia (2003)

    CrossRef  MATH  Google Scholar 

  18. Saad, Y., Schultz, M.H.: GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Statist. Comput. 7, 856–869 (1986)

    CrossRef  MathSciNet  MATH  Google Scholar 

  19. Schroeder, B., Pinheiro, E., Weber, W.D.: DRAM errors in the wild: a large-scale field study. Communications of the ACM 54, 100–107 (2011)

    CrossRef  Google Scholar 

  20. Simonici, V., Szyld, D.B.: Theory of inexact Krylov subspace methods and applications to scientific computing. SIAM J. Sci. Comput. 25(2), 454–477 (2003)

    CrossRef  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

  1. Department of Computer Science, University of New Mexico, Albuquerque, NM, USA, 87131

    Patrick G. Bridges, Kurt B. Ferreira & Philip Soltero

  2. Sandia National Laboratories, Albuquerque, NM, 87123, USA

    Mark Hoemmen, Kurt B. Ferreira, Michael A. Heroux & Ron Brightwell

Authors
  1. Patrick G. Bridges
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Mark Hoemmen
    View author publications

    You can also search for this author in PubMed Google Scholar

  3. Kurt B. Ferreira
    View author publications

    You can also search for this author in PubMed Google Scholar

  4. Michael A. Heroux
    View author publications

    You can also search for this author in PubMed Google Scholar

  5. Philip Soltero
    View author publications

    You can also search for this author in PubMed Google Scholar

  6. Ron Brightwell
    View author publications

    You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

  1. Scilytics, Koellnerhofgasse 3/15A, 1010, Vienna, Austria

    Michael Alexander

  2. ICAR-CNR, Via P. Castellino, 111, 80131, Napoli, Italy

    Pasqua D’Ambra

  3. University of Amsterdam, 1090, Amsterdam, Netherlands

    Adam Belloum

  4. Innovative Computing Laboratory, The University of Tennessee, US

    George Bosilca

  5. Department of Experimental Medicine and Clinic, University Magna Græcia, 88100, Catanzaro, Italy

    Mario Cannataro

  6. Computer Science Department, University of Pisa, Italy

    Marco Danelutto

  7. Second University of Naples, Italy

    Beniamino Di Martino

  8. TUMünchen,, Boltzmannstr. 3, ,, 85748, Garching, Germany

    Michael Gerndt

  9. Equipe Runtime, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France

    Emmanuel Jeannot & Raymond Namyst & 

  10. Equipe HIEPACS, INRIA Bordeaux Sud-Ouest, 33405, Talence Cedex, France

    Jean Roman

  11. Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831-6164, Oak Ridge, TN, USA

    Stephen L. Scott

  12. Department of Scientific Computing, University of Vienna, Nordbergstr. 15/3C, 1090, Vienna, Austria

    Jesper Larsson Traff

  13. Computer Science and Mathematics Division, Oak Ridge National Laboratory, 37831, Oak Ridge, TN, USA

    Geoffroy Vallée

  14. Technische Universität München, Germany

    Josef Weidendorfer

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bridges, P.G., Hoemmen, M., Ferreira, K.B., Heroux, M.A., Soltero, P., Brightwell, R. (2012). Cooperative Application/OS DRAM Fault Recovery. In: Alexander, M., et al. Euro-Par 2011: Parallel Processing Workshops. Euro-Par 2011. Lecture Notes in Computer Science, vol 7156. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29740-3_28

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI: https://doi.org/10.1007/978-3-642-29740-3_28

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29739-7

  • Online ISBN: 978-3-642-29740-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Search

Navigation

  • Find a journal
  • Publish with us

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Publish your research
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our imprints

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support

167.114.118.210

Not affiliated

Springer Nature

© 2023 Springer Nature