Parallel Fault Tolerant Algorithms for Parabolic Problems

  • Hatem Ltaief
  • Marc Garbey
  • Edgar Gabriel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4128)

Abstract

With increasing number of processors available on nowadays high performance computing systems, the mean time between failure of these machines is decreasing. The ability of hardware and software components to handle process failures is therefore getting increasingly important. The objective of this paper is to present a fault tolerant approach for the implicit forward time integration of parabolic problems using explicit formulas. This technique allows the application to recover from process failures and to reconstruct the lost data of the failed process(es) avoiding the roll-back operation required in most checkpoint-restart schemes. The benchmark used to highlight the new algorithms is the two dimensional heat equation solved with a first order implicit Euler scheme.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    MPI Forum : MPI: A Message-Passing Interface Standard. Document for a Standard Message-Passing Interface, University of Tennessee, 1993 (1994)Google Scholar
  2. 2.
    Sankaran, S., Squyres, J.M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., Roman, E.: The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing. International Journal of High Performance Computing Applications (2004)Google Scholar
  3. 3.
    Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In: SC 2002 Conference CD, IEEE/ACM SIGARCH, Baltimore, MD (2002)Google Scholar
  4. 4.
    Fagg, G.E., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J.J.: Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing. International Journal of High Performance Computing Applications 19(4), 465–477 (2005)CrossRefGoogle Scholar
  5. 5.
    Beck, Dongarra, Fagg, Geist, Gray, Kohl, Migliardi, Moore, K., Moore, T., Papadopoulous, Scott, Sunderam: HARNESS: a next generation distributed virtual machine, Future Generation Computer Systems, 15 (1999)Google Scholar
  6. 6.
    Engelmann, C., Geist, G.A.: Super-Scalable Algorithms for Computing on 100,000 Processors. In: Proceedings of International Conference on Computational Science (ICCS) 2005, Atlanta, GA, USA (May 2005)Google Scholar
  7. 7.
    Chen, Z., Fagg, G.E., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J.J.: Fault Tolerant High Performance Computing by a coding approach. In: Proceedings of the 2005 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2005), Chicago, IL, June 15-17, 2005, ACM Press, New York (2005)Google Scholar
  8. 8.
    Murio, D.A.: The Mollification Method and the Numerical Solution of Ill-posed Problems. Wiley, New York (1993)Google Scholar
  9. 9.
    Garbey, M., Ltaief, H.: Fault Tolerant Domain Decomposition for Parabolic Problems Domain Decomposition 16, New York University, January 2005 (to appear, 2005)Google Scholar
  10. 10.
    Eckhaus, W., Garbey, M.: Asymptotic analysis on large time scales for singular perturbation problems of hyperbolic type. SIAM J. Math. Anal. 21(4), 867–883 (1990)MATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Dupros, M.: A Filtering technique for System of Reaction Diffusion equations. Int. J. for Numerical Methods in Fluids (in press, 2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Hatem Ltaief
    • 1
  • Marc Garbey
    • 1
  • Edgar Gabriel
    • 1
  1. 1.Department of Computer ScienceUniversity of HoustonHoustonUSA

Personalised recommendations