Experimental Assessment of the Practicality of a Fault-Tolerant System

  • Jai Wug Kim
  • Jongpil Lee
  • Heon Y. Yeom
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4362)


Fault-tolerance has gained renewed importance with the proliferation of high-performance clusters. However, fault-tolerant systems have not yet been widely adopted commercially because they are either hard to deploy, hard to use, hard to manage, hard to maintain, or hard to justify. We have developed M 3, a practical and easily-deployable multiple fault-tolerant MPI system for Myrinet, to satisfy the demand for a fault-tolerant system.

In this paper, we run rigorous tests using real-world applications to validate that M 3 can be used in commercial clusters. We also describe improvements made to our system to solve various problems that arose when deploying it on a commercial cluster.

This paper models our system’s checkpoint overhead and presents the results of a series of tests using computation- and communication-intensive MPI applications used commercially in various fields of science. The experimental results show that not only does our system conform to various types of running environment well, but that it can also be practically deployed in commercial clusters.


Node Failure Experimental Assessment Real Cluster Synchronization Message Checkpoint Interval 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agbaria, A., Friedman, R.: Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations. In: HPDC’99: Proceedings of the Eighth IEEE International Symposium on High Performance Distributed Computing (1999)Google Scholar
  2. 2.
    Baroni, S., Dal Corso, A., de Gironcoli, S., Giannozzi, P., Cavazzoni, C., Ballabio, G., Scandolo, S., Chiarotti, G., Focher, P., Pasquarello, A., Laasonen, K., Trave, A., Car, R., Marzari, N., Kokalj, A.,
  3. 3.
    Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S.: Fedak, G., Germain, C., Herault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Neri, V., Selikhov, A.: MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In: Proceedings of the 2002 ACM/IEEE Supercomputing Conference (2002)Google Scholar
  4. 4.
    Bouteiller, B., Cappello, F., Herault, T.: Krawezik, K., Lemarinier, P., Magniette, M.: MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging. In: Proceedings of the 2003 ACM/IEEE Supercomputing Conference (2003)Google Scholar
  5. 5.
    Elnozahy, E.N., Plank, J.S.: Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery. IEEE Transactions On. Dependable And. Secure Computing 1(2), 97–108 (2004)CrossRefGoogle Scholar
  6. 6.
    Fagg, G.E., Dongarra, J.: FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) PVM/MPI 2000. LNCS, vol. 1908, p. 346. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  7. 7.
    Gropp, W., Lusk, E., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message-Passing-Interface. MIT Press, Cambridge (1994)Google Scholar
  8. 8.
    Jung, H., Shin, D., Han, H., Kim, J.W., Yeom, H.Y., Lee, J.: Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet. In: Proceedings of the 2005 ACM/IEEE Supercomputing Conference (2005)Google Scholar
  9. 9.
    Lamport, L.: The Part-Time Parliament. ACM Transactions on Computer Systems (1998)Google Scholar
  10. 10.
    Lee, B.S.: Numerical Study on the Control of Unsteady Separated Flow Fields. Seoul National University, Ph.D. (2005)Google Scholar
  11. 11.
    Li, K., Naughton, J.F., Plank, J.S.: Real-Time, Concurrent Checkpoint for Parallel Programs. In: Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (1990)Google Scholar
  12. 12.
    Maruyama, S., Matsumoto, S., Ogita, A.: Surface Phenomena of Molecular Clusters by Molecular Dynamics Method. Thermal Science & Engineering (1994)Google Scholar
  13. 13.
    Oh, K.J., Klein, M.L.: A General Purpose Parallel Molecular Dynamics Simulation Program. Computer Physics Communication (2006)Google Scholar
  14. 14.
    Pattabiraman, K., Vick, C., Wood, A.: Modeling Coordinated Checkpointing for Large-Scale Supercomputers. In: Proceedings of the 2005 International Conference on Dependable Systems and Networks (2005)Google Scholar
  15. 15.
    Vaidya, N.H.: Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme. IEEE Transactions on Computers 46(8), 942–947 (1997)CrossRefGoogle Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Jai Wug Kim
    • 1
  • Jongpil Lee
    • 1
  • Heon Y. Yeom
    • 1
  1. 1.School of Computer Science and Engineering, Seoul National University, Seoul 151-742Korea

Personalised recommendations