Skip to main content

Part of the book series: Synthesis Lectures on Computer Architecture ((SLCA))

  • 158 Accesses

Abstract

In Chapter 2, we learned how to detect errors. Detecting an error is sufficient for providing safety, but we would also like the system to recover from the error. Recovery hides the effects of the error from the user. After recovery, the system can resume operation and ideally remain live. For many systems, availability is the most important metric, and achieving high availability requires the system to be able to recover from its errors without user intervention. If the error was due to a permanent fault, recovery may not be sufficient for liveness because execution after recovery will keep reencountering the same permanent fault. The solutions to this problem—permanent fault diagnosis and self-repair—are the topics of the next two chapters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 29.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

3.7 References

  1. R. E. Ahmed, R. C. Frazier, and P. N. Marinos. Cache-Aided Rollback Error Recovery (CARER) Algorithms for Shared-Memory Multiprocessor Systems. In Proceedings of the 20th International Symposium on Fault-Tolerant Computing Systems, pp. 82–88, June 1990. doi:https://doi.org/10.1109/FTCS.1990.89338

  2. H. Akkary, R. Rajwar, and S. T. Srinivasan. Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2003. doi:https://doi.org/10.1109/MICRO.2003.1253246

  3. T. J. Dell. A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory. IBM Microelectronics Division Whitepaper, Nov. 1997.

    Google Scholar 

  4. E. Elnozahy, D. Johnson, and Y. Wang. A Survey of Rollback-Recovery Protocols in Message-Passing Systems. Technical Report CMU-CS-96-181, Department of Computer Science, Carnegie Mellon University, Sept. 1996.

    Google Scholar 

  5. E. Elnozahy and W. Zwaenepoel. Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit. IEEE Transactions on Computers, 41(5), pp. 526–531, May 1992. doi:https://doi.org/10.1109/12.142678

    Article  Google Scholar 

  6. M. Feeley, J. Chase, V. Narasayya, and H. Levy. Integrating Coherency and Recoverability in Distributed Systems. In Proceedings of the First USENIX Symposium on Operating Systems Design and Implementation, pp. 215–227, Nov. 1994.

    Google Scholar 

  7. C. Gniady, B. Falsafi, and T. Vijaykumar. Is SC + ILP = RC? In Proceedings of the 26th Annual International Symposium on Computer Architecture, pp. 162–171, May 1999. doi:https://doi.org/10.1145/307338.300993

  8. B. T. Gold, J. C. Smolens, B. Falsafi, and J. C. Hoe. The Granularity of Soft-Error Containment in Shared Memory Multiprocessors. In Proceedings of the Workshop on System Effects of Logic Soft Errors, Apr. 2006.

    Google Scholar 

  9. J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers, 1993.

    Google Scholar 

  10. M. Herlihy and J. E. B. Moss. Transactional Memory: Architectural Support for Lock-Free Data Structures. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pp. 289–300, May 1993. doi:https://doi.org/10.1109/ISCA.1993.698569

  11. D. Hunt and P. Marinos. A General Purpose Cache-Aided Rollback Error Recovery (CARER) Technique. In Proceedings of the 17th International Symposium on Fault-Tolerant Computing Systems, pp. 170–175, 1987. [12] IBM. Enhancing IBM Netfinity Server Reliability: IBM Chipkill Memory. IBM Whitepaper, Feb. 1999.

    Google Scholar 

  12. IEEE Computer Society. IEEE Standard for Scalable Coherent Interface (SCI), Aug. 1993.

    Google Scholar 

  13. J.-H. Kim and N. Vaidya. Recoverable Distributed Shared Memory Using the Competitive Update Protocol. In Pacific Rim International Symposium on Fault-Tolerant Systems, Dec. 1995.

    Google Scholar 

  14. L. Lamport. Time, Clocks and the Ordering of Events in a Distributed System. Communications of the ACM, 21(7), pp. 558–565, July 1978. doi:https://doi.org/10.1145/359545.359563

    Article  MATH  Google Scholar 

  15. M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System. Technical Report 1346, Computer Sciences Department, University of Wisconsin–Madison, Apr. 1997.

    Google Scholar 

  16. M. J. Mack, W. M. Sauer, S. B. Swaney, and B. G. Mealey. IBM POWER6 Reliability. IBM Journal of Research and Development, 51(6), pp. 763–774, 2007.

    Article  Google Scholar 

  17. M. Moir, K. Moore, and D. Nussbaum. The Adaptive Transactional Memory Test Platform: A Tool for Experimenting with Transactional Code for Rock. In Proceedings of the 3rd ACM SIGPLAN Workshop on Transactional Computing, Feb. 2008.

    Google Scholar 

  18. C. Morin, A. Gefflaut, M. Banatre, and A.-M. Kermarrec. COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 56–65, May 1996.

    Google Scholar 

  19. M. Mueller, L. Alves, W. Fischer, M. Fair, and I. Modi. RAS Strategy for IBM S/390 G5 and G6. IBM Journal of Research and Development, 43(5/6), Sept./Nov. 1999.

    Google Scholar 

  20. J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas. ReViveI/O: Efficient Handling of I/O in Highly-Available Rollback-Recovery Servers. In Proceedings of the Twelfth International Symposium on High-Performance Computer Architecture, pp. 200–211, Feb. 2006.

    Google Scholar 

  21. D. A. Patterson, G. Gibson, and R. H. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of 1988 ACM SIGMOD Conference, pp. 109–116, June 1988. doi:https://doi.org/10.1145/50202.50214

  22. J. S. Plank. An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance. Technical Report UT-CS-97-372, Department of Computer Science, University of Tennessee, July 1997.

    Google Scholar 

  23. J. S. Plank, K. Li, and M. A. Puening. Diskless Checkpointing. IEEE Transactions on Parallel and Distributed Systems, 9(10), pp. 972–986, Oct. 1998. doi:https://doi.org/10.1109/71.730527

    Article  Google Scholar 

  24. M. Prvulovic, Z. Zhang, and J. Torrellas. ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors. In Proceedings of the 29th Annual International Symposium on Computer Architecture, pp. 111–122, May 2002. doi:https://doi.org/10.1109/ISCA.2002.1003567

  25. P. Ranganathan, V. S. Pai, and S. V. Adve. Using Speculative Retirement and Larger Instruction Windows to Narrow the Performance Gap between Memory Consistency Models. In Proceedings of the Ninth ACM Symposium on Parallel Algorithms and Architectures, pp. 199–210, June 1997.

    Google Scholar 

  26. O. Serlin. Fault-Tolerant Systems in Commercial Applications. IEEE Computer, pp. 19–30, Aug. 1984.

    Google Scholar 

  27. T. J. Slegel et al. IBM’s S/390 G5 Microprocessor Design. IEEE Micro, pp. 12–23, March/April 1999. doi:https://doi.org/10.1109/40.755464

  28. J. E. Smith and A. R. Pleszkun. Implementing Precise Interrupts in Pipelined Processors. IEEE Transactions on Computers, C-37(5), pp. 562–573, May 1988. doi:https://doi.org/10.1109/12.4607

    Article  Google Scholar 

  29. D. J. Sorin, M. M. Martin, M. D. Hill, and D. A. Wood. SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery. In Proceedings of the 29th Annual International Symposium on Computer Architecture, pp. 123–134, May 2002. doi:https://doi.org/10.1109/ISCA.2002.1003568

  30. F. Sultan, T. Nguyen, and L. Iftode. Scalable Fault-Tolerant Distributed Shared Memory. In Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, Nov. 2000.

    Google Scholar 

  31. Y. M. Wang, E. Chung, Y. Huang, and E. Elnozahy. Integrating Checkpointing with Transaction Processing. In Proceedings of the 27th International Symposium on Fault-Tolerant Computing Systems, pp. 304–308, June 1997. doi:https://doi.org/10.1109/FTCS.1997.614103

  32. Y.-M. Wang, Y. Huang, K.-P. Vo, P.-Y. Chung, and C. Kintala. Checkpointing and Its Applications. In Proceedings of the 25th International Symposium on Fault-Tolerant Computing Systems, pp. 22–31, June 1995.

    Google Scholar 

  33. K. Wu, W. K. Fuchs, and J. H. Patel. Error Recovery in Shared Memory Multiprocessors Using Private Caches. IEEE Transactions on Parallel and Distributed Systems, 1(2), pp. 231–240, Apr. 1990. doi:https://doi.org/10.1109/71.80134

    Article  Google Scholar 

  34. K.-L. Wu and W. K. Fuchs. Recoverable Distributed Shared Virtual Memory. IEEE Transactions on Computers, 39(4), pp. 460–469, Apr. 1990. doi:https://doi.org/10.1109/12.54839

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Nature Switzerland AG

About this chapter

Cite this chapter

Sorin, D. (2009). Error Recovery. In: Fault Tolerant Computer Architecture. Synthesis Lectures on Computer Architecture. Springer, Cham. https://doi.org/10.1007/978-3-031-01723-0_3

Download citation

Publish with us

Policies and ethics