Advertisement

Performance of consistent checkpointing in a modular operating system: Results of the FTM experiment

  • Gilles Muller
  • Mireille Hue
  • Nadine Peyrouze
Session 11: Measurement
Part of the Lecture Notes in Computer Science book series (LNCS, volume 852)

Abstract

This paper presents an evaluation of the performance of a consistent checkpointing mechanism that has been integrated into a modular Mach microkernel based operating system. We have measured the performance overhead of checkpointing for several workstation-typical applications: number crunching and office tools. This has been done using specific servers which were added to a standard Mach 3.0/BSD system. Measurements are performed for failure-free executions by varying the number of checkpoints and thus the amount of computation lost in the event of a crash. Our initial results showed a time overhead of about 3% for up to 20% work lost in the event of a crash. while we get an overhead between 16% and 23% for up to 1% computation lost. Also, when porting interactive office tools such as the micro-emacs text editor, we get a maximal checkpoint duration of 1.4 second on our prototype machine that is as powerful as a Sun 3/60. Based on these results, we argue that checkpointing can be integrated into a modular micro-kernel based operating system without degradation of the system performances.

Keywords

Fault-tolerance consistent checkpointing modular operating systems micro-kernel stable transactional memory performance evaluation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [Accetta et al., 86]
    [Accetta et al. 86] M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, & M. Young, Mach: A new kernel foundation for Unix development. In Proc. of Usenix 1986 Summer Conference, pages 93–112, July 1986.Google Scholar
  2. [Ahamad & Lin 89]
    M. Ahamad & L. Lin. Using checkpoints to localize the effects of faults in distributed systems. Proc. of 8th Symposium on Reliable Distributed Systems, pages 2–11, 1989.Google Scholar
  3. [Banâtre et al. 86]
    [Banâtre et al. 86] J.P. Banâtre, M. Banâtre, G. Lapalme, & Fl. Ployette. The design and building of enchere, a distributed electronic marketing system. Communications of the ACM, 29(1):19–29, January 1986.CrossRefGoogle Scholar
  4. [Banâtre et al. 88]
    [Banâtre et al. 88] J.P. Banâtre, M. Banâtre, & G. Muller. Ensuring data security and integrity with a fast stable storage. In Proc. of 4th International Conference on Data Engineering, pages 285–293, Los Angeles, February 1988.Google Scholar
  5. [Banâtre et al. 91]
    [Banâtre et al. 91] M. Banâtre, G. Muller, B. Rochat, & P. Sanchez. Design decisions for the FTM: A general purpose fault tolerant machine. In Proc. of 21st International Symposium on Fault-Tolerant Computing Systems, pages 71–78, Montréal, Canada, June 1991.Google Scholar
  6. [Banâtre et al. 93]
    [Banâtre et al. 93] M. Banâtre, P. Heng, G. Muller, N. Peyrouze, & B. Rochat. An experience in the design of a reliable object based system, In Proc. of the 2th Conference on Parallel and Distributed Information Systems, San Diego, California, January 1993.Google Scholar
  7. [Bhargava & Lian 88]
    B. Bhargava & S.R. Lian. Independent checkpointing and concurrent rollback for recovery in distributed systems — an optimistic approach. In Proc. of 7th Symposium on Reliable Distributed Systems. pages 3–12, 1988.Google Scholar
  8. [Borg et al. 89]
    [Borg et al. 89] A. Borg, W. Blau, W. Graetsch, F. Herrmann, & W. Oberle, Fault tolerance under unix. ACM Transactions on Computer Systems, 7(1):1–24, 1989.CrossRefGoogle Scholar
  9. [Chandy & Lamport 85]
    K.M. Chandy & L. Lamport. Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems, 3(1):63–75, February 1985.CrossRefGoogle Scholar
  10. [Cristian & Jahanian 91]
    F. Cristian & F. Jahanian. A timestamp-based checkpointing protocol for long-lived distributed computations. Proc. of 10th Symposium on Reliable Distributed Systems, pages 12–20, September 1991.Google Scholar
  11. [Elnozahy & Zwaenepoel 92]
    E.N. Elnozahy & W. Zwaenepoel. Manetho: Transparent rollbackrecovery with low overhead, limited rollback and fast output commit. IEEE Transactions on Computers, 41(5):526–531, May 1992.CrossRefGoogle Scholar
  12. [Elnozahy et al. 92]
    [Elnozahy et al. 92] E.N. Elnozahy, D.B. Johnson, & W. Zwaenepoel, The performance of consistent checkpointing. Proc. of 11th Symposium on Reliable Distributed Systems. pages 39–47, 1992.Google Scholar
  13. [Gazelle 90]
    Gazelle Microcircuits, Inc, Santa Clara (CA). Hot Rod High Speed Serial Link Data Sheet, 1990.Google Scholar
  14. [Gleeson 93]
    B.J. Gleeson, Fault tolerance: Why should i pay for it. In M. Banâtre & P.A. Lee, éditeurs, Workshop on Hardware and Software Architectures for Fault Tolerance: Perspective and Towards a Synthesis, volume 774 of Lecture Notes in Computer Science, pages 66–77, Le Mont Saint-Michel (France), June 1993.Google Scholar
  15. [Goldberg et al. 90]
    [Goldberg et al. 90] A. Goldberg, A. Gopal, K. Li, R. Strom, & D.F. Bacon. Transparent recovery of mach applications. In USENIX Mach Workshop. pages 169–183, Burlington (Vermont), October 1990.Google Scholar
  16. [Gray 78]
    J. Gray, Notes on Database Operating Systems., volume 60 of Lecture Notes in Computer Science. Springer Verlag, 1978.Google Scholar
  17. [Hue et al. 93]
    [Hue et al. 93] M. Hue, G. Muller, N. Peyrouze, & B. Rochat, Implementing dynamic atomic actions using reliable servers. In Proceedings of Esprit Basic Research Project 6360, Broadcast, First Year Report, volume 3, October 1993.Google Scholar
  18. [Juang & Venkatesan 91]
    T.TY Juang & S. Venkatesan. Crash recovery with little overhead. In Proc. of 13th International Conference on Distributed Computing Systems. pages 454–452,1991.Google Scholar
  19. [Koo & Toueg 86]
    R. Koo & S. Toueg, Checkpointing and rollback recovery for distributed systems. In Proc. of Fall Joint Computer Conference, pages 1150–1158, Dallas, 1986.Google Scholar
  20. [Lampson 81]
    B. Lampson. Atomic transactions. In Distributed Systems and Architecture and Implementation: an Advanced Course, volume 105 of Lecture Notes in Computer Science, pages 246–265. Springer Verlag. 1981.Google Scholar
  21. [Leu & Bhargava 88]
    P. Leu & B. Bhargava. Concurrent robust checkpointing and recovery in distributed systems. In Proc. of 4th International Conference on Data Engineering, pages 154–163, Loas Angeles (CA), February 1988.Google Scholar
  22. [Li et al. 91]
    [Li et al. 91] K. Li, J.F. Naughton & J.S. Plank. Checkpointing multicomputer applications. Proc. of 10th Symposium on Reliable Distributed Systems, pages 1–10, 1991.Google Scholar
  23. [Merlin & Randell 78]
    P.M. Merlin & B. Randell. State restoration in distributed systems. In Proc. of 8th International Symposium on Fault-Tolerant Computing Systems, pages 129–134, Toulouse, June 1978.Google Scholar
  24. [Muller et al. 91]
    [Muller et al. 91] G. Muller, B. Rochat, & P. Sanchez. A stable transactional memory for building robust object oriented programs. In EuroMicro 91, pages 359–364, Vienne, Autriche, September 1991.Google Scholar
  25. [Nelson 81]
    B.J. Nelson. Remote Procedure Call. PhD thesis, Department of Computer Science, Carnegie-Mellon University, Pittsburgh, 1981.Google Scholar
  26. [Rochat 92]
    B. Rochat. Une approche à la construction de services fiables dans les systèmes distribués. Phd. thesis, université de Rennes I (France), February 1992.Google Scholar
  27. [Rozier et al. 88]
    [Rozier et al. 88] M. Rozier, V. Abrossimov, F. Armand, I. Boule, M. Gien, M. Guillemont, F. Herrmann, P. Léonard, S. Langlois, & W. Neuhauser. The Chorus distributed operating system. Computing Systems. 1(4):305–370, 1988.Google Scholar
  28. [Schmuck & Wyllie 91]
    F. Schmuck & J. Wyllie, Experience with transactions in quicksilver. In ACM, Proc. of 13th ACM Symposium on Operating Systems Principles, pages 239–253, October 1991.Google Scholar
  29. [Silva & Silva 92]
    L.M. Silva & J.G. Silva, Global checkpointing for distributed programs. Proc. of 11th Symposium on Reliable Distributed Systems, pages 155–162, 1992.Google Scholar
  30. [Singh et al. 91]
    [Singh et al. 91] J.P. Singh, W. Weber, & A. Gupta. Splash: Stanford parallel applications for shared-memory. Technical Report CSL-TR-91-469, Computer Systems Laboratory, Stanford University, April 1991.Google Scholar
  31. [Strom & Yemini 85]
    R.E. Strom & S. Yemini. Optimistic recovery in distributed systems. ACM Transactions on Computer Systems, 3(3):204–226, August 1985.Google Scholar
  32. [Tamir & Sequin 84]
    Y. Tamir & C. Sequin. Error recovery in multicomputers using global checkpoints. In Proc. of 1984 International Conference on Parallel Processing, pages 32–41. August 1984.Google Scholar
  33. [Wood 81]
    W.G. Wood. A decentralised recovery control protocol. Proc. of 11th International Symposium on Fault-Tolerant Computing Systems, pages 159–164, 1981.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1994

Authors and Affiliations

  • Gilles Muller
    • 1
  • Mireille Hue
    • 2
  • Nadine Peyrouze
    • 2
  1. 1.IRISA/INRIARennes CedexFrance
  2. 2.BULL Research IRISARennes CedexFrance

Personalised recommendations