Skip to main content
Log in

Scheduling saves in fault-tolerant computations

  • Published:
Acta Informatica Aims and scope Submit manuscript

Abstract

Computer users with very long computations run the risk of losing work because of machine failures. Such losses can often be reduced by scheduling saves on secure storage devices of work successfully done. In the model studied here, the user leaves the computation unattended for extended periods of time, after which he or she returns to check whether a machine failure occurs. When a check reveals a failure, the user resets the computation so that it resumes from the point of the last successful save.

Saves are themselves time consuming, so that any strategy for scheduling saves must strike a balance between the computing time lost during saves and the computing time that is occasionally lost, because of a failure since the last successful save.

For a given time to the next check and given constant save times, this paper computes schedules that maximize the expected amount of work successfully done before the next check, under the uniform and exponential failure laws. Explicit formulas are obtained for the uniform law. A recurrence leads to routine numerical calculations for the more difficult system with an exponential failure law.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Boguslavsky, L.B., Coffman, E.G., Jr., Gilbert, E.N., Kreinin, A.Y.: Scheduling checks and saves. ORSA J. Comput.4, 60–69 (1992)

    Google Scholar 

  2. Coffman, E.G., Jr., Gilbert, E.N.: Optimal strategies for scheduling saves and preventive maintenance. IEEE Trans. Reliab.39, 9–18 (1990)

    Google Scholar 

  3. Goyal, A., Nicola, V., Tantawi, A.N., Trivedi, K.S.: Reliability of systems with limited repairs. IEEE Trans Reliab. (Special Issue on Fault Tolerant Computing)R-36, 202–207 (1987)

    Google Scholar 

  4. Kulkarni, V.G., Nicola, V.F., Trivedi, K.S.: Effects of checkpointing and queueing on program performance. Research Rep. RC 13283, IBM Research, Yorktown Heights, NY 10598, USA

  5. Tantawi, A.N., Ruschitzka, M.: Performance analysis of checkpointing strategies. ACM Trans. Comput. Syst.2, 123–144 (1984)

    Google Scholar 

  6. Toueg, S., Babaoglu, O.: On the optimum checkpoint selection problem. SIAM J. Comput.13, 630–649 (1984)

    Google Scholar 

  7. Trivedi, K.S.: Reliability evaluation for fault tolerant systems. In: Mathematical computer performance and reliability, pp. 403–414. Amsterdam: North-Holland 1983

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Coffman, E.G., Flatto, L. & Kreinin, A.Y. Scheduling saves in fault-tolerant computations. Acta Informatica 30, 409–423 (1993). https://doi.org/10.1007/BF01210593

Download citation

  • Received:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01210593

Keywords

Navigation