Abstract
In the past two chapters, we have discussed how to detect errors and recover from them. For transient errors, detection and recovery are sufficient. After recovery, the transient error is no longer present and execution can resume without a problem. However, if an error is due to a permanent fault, detection and recovery may not be sufficient.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
4.6 References
F. A. Bower, D. J. Sorin, and S. Ozev. A Mechanism for Online Diagnosis of Hard Faults in Microprocessors. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 197–208, Nov. 2005. doi:https://doi.org/10.1109/MICRO.2005.8
A. Charlesworth. Starfire: Extending the SMP Envelope. IEEE Micro, 18(1), pp. 39–49, Jan./Feb. 1998.
K. Constantinides, O. Mutlu, T. Austin, and V. Bertacco. Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 97–108, Dec. 2007.
C. E. Leiserson et al. The Network Architecture of the Connection Machine CM-5. In Proceedings of the Fourth ACM Symposium on Parallel Algorithms and Architectures, pp. 272–285, June 1992. doi:https://doi.org/10.1145/140901.141883
M.-L. Li, P. Ramachandran, S. K. Sahoo, S. Adve, V. Adve, and Y. Zhou. Trace-Based Diagnosis of Permanent Hardware Faults. In Proceedings of the International Conference on Dependable Systems and Networks, June 2008.
M.-L. Li, P. Ramachandran, S. K. Sahoo, S. Adve, V. Adve, and Y. Zhou. Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design. In Proceedings of the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2008. doi:https://doi.org/10.1145/1346281.1346315
M. Mueller, L. Alves, W. Fischer, M. Fair, and I. Modi. RAS Strategy for IBM S/390 G5 and G6. IBM Journal of Research and Development, 43(5/6), Sept./Nov. 1999.
R. Rajsuman. Deisgn and Test of Large Embedded Memories: An Overview. IEEE Design & Test of Computers, pp. 16–27, May/June 2001.
S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin. Ultra Low-Cost Defect Protection for Microprocessor Pipelines. In Proceedings of the Twelfth International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2006. doi:https://doi.org/10.1145/1168857.1168868
J. C. Smolens, B. T. Gold, J. C. Hoe, B. Falsafi, and K. Mai. Detecting Emerging Wearout Faults. In Proceedings of the Workshop on Silicon Errors in Logic—System Effects, Apr. 2007.
L. Spainhower and T. A. Gregg. IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective. IBM Journal of Research and Development, 43(5/6), Sept./Nov. 1999.
R. Treuer and V. K. Agarwal. Built-In Self-Diagnosis for Repairable Embedded RAMs. IEEE Design & Test of Computers, pp. 24–33, June 1993. doi:https://doi.org/10.1109/54.211525
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Sorin, D. (2009). Diagnosis. In: Fault Tolerant Computer Architecture. Synthesis Lectures on Computer Architecture. Springer, Cham. https://doi.org/10.1007/978-3-031-01723-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-01723-0_4
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-00595-4
Online ISBN: 978-3-031-01723-0
eBook Packages: Synthesis Collection of Technology (R0)eBColl Synthesis Collection 2