SHIELD: A Fault-Tolerant MPI for an Infiniband Cluster

  • Hyuck Han
  • Hyungsoo Jung
  • Jai Wug Kim
  • Jongpil Lee
  • Youngjin Yu
  • Shin Gyu Kim
  • Heon Y. Yeom
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4208)

Abstract

Today’s high performance cluster computing technologies demand extreme robustness against unexpected failures to finish aggressively parallelized work in a given time constraint. Although there has been a steady effort in developing hardware and software tools to increase fault-resilience of cluster environments, a successful solution has yet to be delivered to commercial vendors. This paper presents SHIELD, a practical and easily-deployable fault-tolerant MPI and management system of MPI for an Infiniband cluster. SHIELD provides a novel framework that can be easily used in real cluster systems, and it has different design perspectives than those proposed by other fault-tolerant MPI. We show that SHIELD provides robust fault-resilience to fault-vulnerable cluster systems and that the design features of SHIELD are useful wherever fault-resilience is regarded as the matter of utmost importance.

Keywords

Checkpoint Consistent recovery Fault-tolerance MPI Infiniband 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In: Proceedings of the 2002 ACM/IEEE Supercomputing Conference (2002)Google Scholar
  2. 2.
    Bouteiller, B., Cappello, F., Herault, T., Krawezik, K., Lemarinier, P., Magniette, M.: MPICH-V2: A Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging. In: Proceedings of the 2003 ACM/IEEE Supercomputing Conference (2003)Google Scholar
  3. 3.
    Fagg, G.E., Dongarra, J.: FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface (2000)Google Scholar
  4. 4.
    Garcia-Molina, H.: Elections in a Distributed Computing System. IEEE Transactions on Computers (1982)Google Scholar
  5. 5.
    InfiniBand Trade Association, InfiniBand Architecture Specification, Release (2004), http://www.infinibandta.org
  6. 6.
    Jung, H., Shin, D., Han, H., Kim, J.W., Yeom, H.Y., Lee, J.: Design and Implementation of Multiple Fault-Tolerant MPI over Myrine. In: Proceedings of the 2005 ACM/IEEE Supercomputing Conference (2005)Google Scholar
  7. 7.
    Kim, H.S., Yeom, H.Y.: A User-Transparent Recoverable File System for Distributed Computing Environment. In: Challenges of Large Applications in Distributed Environments (CLADE 2005) (2005)Google Scholar
  8. 8.
    Liu, J., Wu, J., Kini, S.P., Wyckoff, P., Panda, D.K.: High Performance RDMA-based MPI Implementation over InfiniBand. In: ICS 2003: Proceedings of the 17th annual international conference on Supercomputing (2003)Google Scholar
  9. 9.
    Oh, K.J., Klein, M.L.: A General Purpose Parallel Molecular Dynamics Simulation Program. Computer Physics Communication (2006)Google Scholar
  10. 10.
    Stellner, G.: CoCheck: Checkpointing and Process Migration for MPI. In: Proceedings of the International Parallel Processing Symposium (1996)Google Scholar
  11. 11.
    Woo, N., Jung, H., Yeom, H.Y., Park, T., Park, H.: MPICH-GF: Transparent Checkpointing and Rollback-Recovery for Grid-Enabled MPI Processes. IEICE Transactions on Information and Systems (2004)Google Scholar
  12. 12.
    Woo, N., Jung, H., Shin, D., Han, H., Yeom, H.Y., Park, T.: Performance Evaluation of Consistent Recovery Protocols Using MPICH-GF. In: Proceedings of the 5th European Dependable Computing Conference (2005)Google Scholar
  13. 13.

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Hyuck Han
    • 1
  • Hyungsoo Jung
    • 1
  • Jai Wug Kim
    • 1
  • Jongpil Lee
    • 1
  • Youngjin Yu
    • 1
  • Shin Gyu Kim
    • 1
  • Heon Y. Yeom
    • 1
  1. 1.School of Computer Science and EngineeringSeoul National UniversitySeoulSouth Korea

Personalised recommendations