Skip to main content

Part of the book series: Synthesis Lectures on Computer Architecture ((SLCA))

  • 159 Accesses

Abstract

In the past two chapters, we have discussed how to detect errors and recover from them. For transient errors, detection and recovery are sufficient. After recovery, the transient error is no longer present and execution can resume without a problem. However, if an error is due to a permanent fault, detection and recovery may not be sufficient.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 29.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

4.6 References

  1. F. A. Bower, D. J. Sorin, and S. Ozev. A Mechanism for Online Diagnosis of Hard Faults in Microprocessors. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 197–208, Nov. 2005. doi:https://doi.org/10.1109/MICRO.2005.8

  2. A. Charlesworth. Starfire: Extending the SMP Envelope. IEEE Micro, 18(1), pp. 39–49, Jan./Feb. 1998.

    Article  Google Scholar 

  3. K. Constantinides, O. Mutlu, T. Austin, and V. Bertacco. Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 97–108, Dec. 2007.

    Google Scholar 

  4. C. E. Leiserson et al. The Network Architecture of the Connection Machine CM-5. In Proceedings of the Fourth ACM Symposium on Parallel Algorithms and Architectures, pp. 272–285, June 1992. doi:https://doi.org/10.1145/140901.141883

  5. M.-L. Li, P. Ramachandran, S. K. Sahoo, S. Adve, V. Adve, and Y. Zhou. Trace-Based Diagnosis of Permanent Hardware Faults. In Proceedings of the International Conference on Dependable Systems and Networks, June 2008.

    Google Scholar 

  6. M.-L. Li, P. Ramachandran, S. K. Sahoo, S. Adve, V. Adve, and Y. Zhou. Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design. In Proceedings of the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2008. doi:https://doi.org/10.1145/1346281.1346315

  7. M. Mueller, L. Alves, W. Fischer, M. Fair, and I. Modi. RAS Strategy for IBM S/390 G5 and G6. IBM Journal of Research and Development, 43(5/6), Sept./Nov. 1999.

    Google Scholar 

  8. R. Rajsuman. Deisgn and Test of Large Embedded Memories: An Overview. IEEE Design & Test of Computers, pp. 16–27, May/June 2001.

    Google Scholar 

  9. S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin. Ultra Low-Cost Defect Protection for Microprocessor Pipelines. In Proceedings of the Twelfth International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2006. doi:https://doi.org/10.1145/1168857.1168868

  10. J. C. Smolens, B. T. Gold, J. C. Hoe, B. Falsafi, and K. Mai. Detecting Emerging Wearout Faults. In Proceedings of the Workshop on Silicon Errors in Logic—System Effects, Apr. 2007.

    Google Scholar 

  11. L. Spainhower and T. A. Gregg. IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective. IBM Journal of Research and Development, 43(5/6), Sept./Nov. 1999.

    Google Scholar 

  12. R. Treuer and V. K. Agarwal. Built-In Self-Diagnosis for Repairable Embedded RAMs. IEEE Design & Test of Computers, pp. 24–33, June 1993. doi:https://doi.org/10.1109/54.211525

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Nature Switzerland AG

About this chapter

Cite this chapter

Sorin, D. (2009). Diagnosis. In: Fault Tolerant Computer Architecture. Synthesis Lectures on Computer Architecture. Springer, Cham. https://doi.org/10.1007/978-3-031-01723-0_4

Download citation

Publish with us

Policies and ethics