A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI

  • Wesley Bland
  • Peng Du
  • Aurelien Bouteiller
  • Thomas Herault
  • George Bosilca
  • Jack Dongarra
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7484)

Abstract

Most predictions of Exascale machines picture billion way parallelism, encompassing not only millions of cores, but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: 1) traditional checkpoint based approaches incur a steep overhead on failure free operations and 2) the dominant programming paradigm for parallel applications (the MPI standard) offers extremely limited support of software-level fault tolerance approaches. In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI standard, to enable algorithmic based recovery, without incurring the overhead of customary periodic checkpointing. The validity and performance of this approach are evaluated on large scale systems, using the QR factorization as an example.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Cappello, F., Casanova, H., Robert, Y.: Preventive migration vs. preventive checkpointing for extreme scale supercomputers. PPL 21(2), 111–132 (2011)MathSciNetGoogle Scholar
  2. 2.
    Cappello, F., Geist, A., Gropp, B., Kalé, L.V., Kramer, B., Snir, M.: Toward exascale resilience. IJHPCA 23(4), 374–388 (2009)Google Scholar
  3. 3.
    Chen, Z., Fagg, G.E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J.: Fault tolerant high performance computing by a coding approach. In: Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2005, pp. 213–223. ACM, New York (2005)CrossRefGoogle Scholar
  4. 4.
    Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22, 303–312 (2006)CrossRefGoogle Scholar
  5. 5.
    Davies, T., Karlsson, C., Liu, H., Ding, C., Chen, Z.: High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing. In: Proceedings of the 25th ACM International Conference on Supercomputing (ICS 2011). ACM (2011)Google Scholar
  6. 6.
    Dongarra, J., Beckman, P., et al.: The international exascale software roadmap. IJHPCA 25(11), 3–60 (2011)Google Scholar
  7. 7.
    Dongarra, J.J., Blackford, L.S., Choi, J., et al.: ScaLAPACK user’s guide. Society for Industrial and Applied Mathematics, Philadelphia (1997)Google Scholar
  8. 8.
    Du, P., Bouteiller, A., Bosilca, G., Herault, T., Dongarra, J.: Algorithm-based Fault Tolerance for Dense Matrix Factorizations. In: 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM (2012)Google Scholar
  9. 9.
    Fagg, G.E., Dongarra, J.: FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) PVM/MPI 2000. LNCS, vol. 1908, p. 346. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  10. 10.
    Gelenbe, E.: On the optimum checkpoint interval. JoACM 26, 259–270 (1979)MathSciNetMATHCrossRefGoogle Scholar
  11. 11.
    Gropp, W., Lusk, E.: Fault tolerance in message passing interface programs. Int. J. High Perform. Comput. Appl. 18, 363–372 (2004)CrossRefGoogle Scholar
  12. 12.
    Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers 100(6), 518–528 (1984)CrossRefGoogle Scholar
  13. 13.
    Luk, F.T., Park, H.: An analysis of algorithm-based fault tolerance techniques. Journal of Parallel and Distributed Computing 5(2), 172–184 (1988)CrossRefGoogle Scholar
  14. 14.
    Plank, J.S., Thomason, M.G.: Processor allocation and checkpoint interval selection in cluster computing systems. JPDC 61, 1590 (2001)Google Scholar
  15. 15.
    Schroeder, B., Gibson, G.A.: Understanding Failures in Petascale Computers. SciDAC, Journal of Physics: Conference Series 78 (2007)Google Scholar
  16. 16.
    The MPI Forum. MPI: A Message-Passing Interface Standard, Version 2.2. Technical report (2009)Google Scholar
  17. 17.
    Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17, 530–531 (1974)MATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Wesley Bland
    • 1
  • Peng Du
    • 1
  • Aurelien Bouteiller
    • 1
  • Thomas Herault
    • 1
  • George Bosilca
    • 1
  • Jack Dongarra
    • 1
  1. 1.Innovative Computing LaboratoryUniversity of TennesseeKnoxvilleUSA

Personalised recommendations