Abstract
CMOS downscaling trends, manifested in the use of smaller transistor feature sizes and lower supply voltages, make microprocessors more and more vulnerable to transient errors with each new technology generation. One architectural approach to detecting and recovering from such errors is to execute two copies of the same program and then compare the results. While comparing only the store instructions is sufficient for error detection, register values also have to be compared to support fault recovery. In this paper, we propose novel checkpoint-assisted mechanisms for efficient fault recovery that dramatically reduce the number of register values to be compared for detecting soft errors and perform comprehensive investigation of these and other existing recovery schemes from the standpoint of performance, power and design complexity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Austin, T.: DIVA: a reliable substrate for deep submicron microarchitecture design. In: Proc. Micro-32 (1999)
Qureshi, M., et al.: Microarchitecture-Based Introspection: A Technique for Transient Fault Tolerance in Microprocessors. In: DSN 2005 (2005)
Holm, J.G., Banerjee, P.: Low cost concurrent error detection in a VLIW architecture using replicated instructions. In: Proc. ICPP-21 (1992)
Gomaa, M., et al.: Transient-Fault Recovery for Chip Multiprocessors. In: Proc. ISCA-30 (2003)
Gomaa, M., Vijaykumar, T.N.: Opportunistic Transient Fault Detection. In: ISCA 2005 (2005)
Ray, J., Hoe, J., Falsafi, B.: Dual use of superscalar datapath for transient-fault detection and recovery. In: Proc. Micro-34 (2001)
Reinhardt, S., Mukherjee, S.: Transient fault detection via simultaneous multithreading. In: Proc. ISCA-27 (June 2000)
Rotenberg, E.: AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In: Proc. 29th Intl. Symp. On Fault-Tolerant Computing Systems (1999)
Smolens, J., et al.: Efficient Resource sharing in Concurrent error detecting Superscalar microarchitectures. In: Proc. Micro-37 (2004)
Sundaramoorthy, K., Purser, Z., Rotenberg, E.: Slipstream processors: Improving both performance and fault tolerance. In: Proc. Micro-33 (December 2000)
Vijaykumar, T., Pomeranz, I., Cheng, K.: Transient-fault recovery using simultaneous multithreading. In: Proc. ISCA-29 (2002)
Sharkey, J.: M-Sim: A Flexible, Multi-threaded Simulation Environment. Tech. Report CS-TR-05-DP1, Department of Computer Science, SUNY Binghamton (2005)
Tullsen, D., et al.: Exploiting Choice: Instruction Fetch and Issue on an Implementable Si-multaneous Multithreading Processor. In: Proc International Symposium on Computer Architecture (1996)
Shivakumar, P., et al.: Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic. In: Proc DSN (2002)
Mukherjee, S., et al.: Detailed Design and Evaluation of Redundant Multithreading Alternatives. In: Proc ISCA 2002 (2002)
Gomaa, M., et al.: Transient-fault Recovery for Chip Multiprocessors. In: Proc ISCA 2003 (2003)
Compaq Computer Corporation, Data Integrity for Compaq Non-Stop Himalaya Servers (1999)
Slegel, T., et al.: IBM’s S/390 G5 Microprocessor Design. IEEE Micro (1999)
Ponomarev, D., et al.: Reducing Datapath Energy through the Isolation of Short-Lived Operands. In: Proc. PACT 2003 (2003)
Abu-Ghazeleh, N., et al.: Exploiting Short-Lived Values for Low-Overhead Transient Fault Recovery. In: Proc. ASGI 2006 (2006)
Martinez, J., et al.: Cherry: Checkpointed Early Resource Recycling in Out-of-Order Processors. In: Proc. MICRO 2002 (2002)
Ergin, O., et al.: Increasing Processor Performance through Early Register Release. In: Proc. ICCD 2004 (2004)
Kirman, M., et al.: Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors. In: Proc. MICRO 2005 (2005)
Smolens, J., et al.: Fingerprinting: Bounding Soft-Error Detection Latency and Band-width. In: Proc. ASPLOS 2004 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sharkey, J., Abu-Ghazeleh, N., Ponomarev, D., Ghose, K., Aggarwal, A. (2006). Trade-Offs in Transient Fault Recovery Schemes for Redundant Multithreaded Processors. In: Robert, Y., Parashar, M., Badrinath, R., Prasanna, V.K. (eds) High Performance Computing - HiPC 2006. HiPC 2006. Lecture Notes in Computer Science, vol 4297. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11945918_18
Download citation
DOI: https://doi.org/10.1007/11945918_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68039-0
Online ISBN: 978-3-540-68040-6
eBook Packages: Computer ScienceComputer Science (R0)