Performance Evaluation of Consistent Recovery Protocols Using MPICH-GF

  • Namyoon Woo
  • Hyungsoo Jung
  • Dongin Shin
  • Hyuck Han
  • Heon Y. Yeom
  • Taesoon Park
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3463)

Abstract

This paper presents an implementation of several consistent recovery protocols at the abstract device level and their performance comparison. We have performed experiments using three NAS Parallel Benchmark applications with class C datasets on state of the art equipment. The interesting result is that causal message logging protocol has the most expensive recovery cost with communication intensive applications since it suffers from concentrated overload of simultaneous message replaying. Receiver-based optimistic message logging has the least recovery cost with drawback of extensive disk access overhead in failure-free executions. Coordinated checkpointing seems the most practical choice among them.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Alvisi, L., Elnozahy, E.N., Rao, S., Husain, S.A., Mel, A.D.: An Analysis of Communication Induced Checkpointing. In: FTCS-29, The 29th International Symposium on Fault-Tolerant Computing, pp. 242–249.Google Scholar
  2. 2.
    Alvisi, L., Marzullo, K.: Trade-Offs in Implementing Causal Message Logging Protocols. In: Proceedings of the 15th ACM Annual Symposium on the Principles of Distributed Computing, pp. 58–67 (May 1996)Google Scholar
  3. 3.
    Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. In: Proc. IEEE Symp. on High Performance Distributed Computing, pp. 167–176 (August 1999)Google Scholar
  4. 4.
    Bouteiller, A., Lemarinier, P., Krawezik, G., Cappello, F.: Coordinated checkpoint versus message log for fault-tolerant MPI. In: Proceedings of Cluster 2003, pp. 242–250 (December 2003)Google Scholar
  5. 5.
    Chandy, K.M., Lamport, L.: Distributed snapshots: Determining global states of distributed systems. ACM Trans. on Computing Systems 3(1), 63–75 (1985)CrossRefGoogle Scholar
  6. 6.
    Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3), 375–408 (2002)CrossRefGoogle Scholar
  7. 7.
    Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI Message Passing Interface Standard. Parallel Computing 22(6), 789–828 (1996)MATHCrossRefGoogle Scholar
  8. 8.
    Karnois, N.T., Toonen, B., Foster, I.: MPICH-G2: A grid-enabled implementation of the message passing interface. Journal of Parallel and Distributed Computing 63(5), 551–563 (2003)CrossRefGoogle Scholar
  9. 9.
    NASA Ames Research Center: Nas parallel benchmarks. Technical report (1997), http://science.nas.nasa.gov/Software/NPB/
  10. 10.
    Neves, N., Fuchs, W.K.: RENEW: A tool for fast and efficient implementation of checkpoint protocols. In: Symp. on Fault-Tolerant Computing, pp. 58–67 (1998)Google Scholar
  11. 11.
    Nguyen-Tuong, A.: Integrating Fault-Tolerance Techniques in Grid Applications. PhD thesis, University of Virginia, USA (2000)Google Scholar
  12. 12.
    Nguyen, G.T., Tran, V.D., Kotocová, M.: Application recovery in parallel programming environment. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J., Volkert, J. (eds.) PVM/MPI 2002. LNCS, vol. 2474, pp. 234–242. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  13. 13.
    Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent checkpointing under unix. In: USENIX Winter 1995 Technical Conference (January 1995)Google Scholar
  14. 14.
    Plank, J.S., Li, K., Puening, M.A.: Diskless checkpointing. IEEE Trans. on Parallel and Distributed Systems 9(10), 972–986 (1998)CrossRefGoogle Scholar
  15. 15.
    Rao, S., Alvisi, L., Vin, H.M.: The cost of recovery in message logging protocols. IEEE Transaction on Knowledge and Data Engineering 12(2), 160–173 (2000)CrossRefGoogle Scholar
  16. 16.
    Rao, S., Alvisi, L., Vin, H.M.: Egida: An extensible toolkit for low-overhead fault-tolerance. In: Symp. on Fault-Tolerant Computing, pp. 48–55 (1999)Google Scholar
  17. 17.
    Russ, S.H., Robinson, J., Flachs, B.K., Heckel, B.: The Hector distributed run-time environment. IEEE Trans. on Parallel and Distributed Systems 9(11), 1102–1114 (1998)CrossRefGoogle Scholar
  18. 18.
    Stellner, G.: CoCheck: Checkpointing and process migration for MPI. In: Proc. the Int’l Parallel Processing Symp., pp. 526–531 (April 1996)Google Scholar
  19. 19.
    Zandy, V.: ckpt library, http://www.cs.wisc.edu/zandy/ckpt/
  20. 20.
    Zwaenepoel, W., Elnozahy, E.N.: Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Transactions on Computers C-41(5), 526–531 (1992)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Namyoon Woo
    • 1
  • Hyungsoo Jung
    • 1
  • Dongin Shin
    • 1
  • Hyuck Han
    • 1
  • Heon Y. Yeom
    • 1
  • Taesoon Park
    • 2
  1. 1.School of Computer Science and EngineeringSeoul National UniversitySeoulKorea
  2. 2.Department of Computer EngineeringSejong UniversitySeoulKorea

Personalised recommendations