Dynamic Failure Management for Parallel Applications on Grids

  • Hyungsoo Jung
  • Dongin Shin
  • Hyeongseog Kim
  • Hyuck Han
  • Inseon Lee
  • Heon Y. Yeom
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3470)

Abstract

The computational grid, as it is today, is vulnerable to node failures and the probability of a node failure rapidly grows as the size of the grid increases. There have been several attempts to provide fault tolerance using checkpointing and message logging in conjunction with the MPI library. However, the Grid itself should be active in dealing with the failures. We propose a dynamic reconfigurable architecture where the applications can regroup in the face of a failure. The proposed architecture removes the single point of failure from the computational grids and provides flexibility in terms of grid configuration.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the grid: Enabling scalable virtual organizations. International J. Supercomputer Applications 15 (2001)Google Scholar
  2. 2.
    Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34, 375–408 (2002)CrossRefGoogle Scholar
  3. 3.
    Schneider, F.B.: The state machine approach: a tutorial (1986)Google Scholar
  4. 4.
    Menden, J., Stellner, G.: Proving properties of pvm applications - a case study with cocheck. In: Ludwig, T., Sunderam, V.S., Bode, A., Dongarra, J. (eds.) PVM/MPI 1996 and EuroPVM 1996. LNCS, vol. 1156, pp. 134–141. Springer, Heidelberg (1996)Google Scholar
  5. 5.
    Li, W.J., Tsay, J.J.: Checkpointing message-passing interface (MPI) parallel programs. In: Proceedings of the Pacific Rim International Symposium on Fault-Tolerant Systems (PRFTS), pp. 147–152 (1997)Google Scholar
  6. 6.
    Fagg, G.E., Dongarra, J.: FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) PVM/MPI 2000. LNCS, vol. 1908, pp. 346–353. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  7. 7.
    Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: Portable fault tolerance scheme for MPI. Parallel Processing Letters 10, 371–382 (2000)CrossRefGoogle Scholar
  8. 8.
    Foster, I., Karonis, N.T.: A grid-enabled MPI: Message passing in heterogeneous distributed computing systems. In: Proceedings of SC 1998. ACM Press, New York (1998)Google Scholar
  9. 9.
    Kim, S., Woo, N., Yeom, H.Y., Park, T., Park, H.: Design and implementation of dynamic process management for grid-enabled MPICH. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 653–656. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  10. 10.
    Woo, N., Yeom, H.Y., Park, T., Park, H.: MPICH-GF, transparent checkpointing and rollback-recovery for grid-enabled MPI processes. In: Proceedings of the 2nd Workshop on Hardware/Software Support for High Performance Scientific and Engineering Computing (2003)Google Scholar
  11. 11.
    Foster, I., Kesselman, C.: The globus project: A status report. In: Proceedings of the Heterogeneous Computing Workshop, pp. 4–18 (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Hyungsoo Jung
    • 1
  • Dongin Shin
    • 1
  • Hyeongseog Kim
    • 1
  • Hyuck Han
    • 1
  • Inseon Lee
    • 1
  • Heon Y. Yeom
    • 1
  1. 1.School of Computer Science and Engineering, Institute of Computer TechnologySeoul National UniversitySeoulKorea

Personalised recommendations