Trade-Offs in Transient Fault Recovery Schemes for Redundant Multithreaded Processors

Sharkey, Joseph; Abu-Ghazeleh, Nayef; Ponomarev, Dmitry; Ghose, Kanad; Aggarwal, Aneesh

doi:10.1007/11945918_18

Joseph Sharkey²⁰,
Nayef Abu-Ghazeleh²⁰,
Dmitry Ponomarev²⁰,
Kanad Ghose²⁰ &
…
Aneesh Aggarwal²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4297))

Included in the following conference series:

International Conference on High-Performance Computing

852 Accesses
10 Citations

Abstract

CMOS downscaling trends, manifested in the use of smaller transistor feature sizes and lower supply voltages, make microprocessors more and more vulnerable to transient errors with each new technology generation. One architectural approach to detecting and recovering from such errors is to execute two copies of the same program and then compare the results. While comparing only the store instructions is sufficient for error detection, register values also have to be compared to support fault recovery. In this paper, we propose novel checkpoint-assisted mechanisms for efficient fault recovery that dramatically reduce the number of register values to be compared for detecting soft errors and perform comprehensive investigation of these and other existing recovery schemes from the standpoint of performance, power and design complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Austin, T.: DIVA: a reliable substrate for deep submicron microarchitecture design. In: Proc. Micro-32 (1999)
Google Scholar
Qureshi, M., et al.: Microarchitecture-Based Introspection: A Technique for Transient Fault Tolerance in Microprocessors. In: DSN 2005 (2005)
Google Scholar
Holm, J.G., Banerjee, P.: Low cost concurrent error detection in a VLIW architecture using replicated instructions. In: Proc. ICPP-21 (1992)
Google Scholar
Gomaa, M., et al.: Transient-Fault Recovery for Chip Multiprocessors. In: Proc. ISCA-30 (2003)
Google Scholar
Gomaa, M., Vijaykumar, T.N.: Opportunistic Transient Fault Detection. In: ISCA 2005 (2005)
Google Scholar
Ray, J., Hoe, J., Falsafi, B.: Dual use of superscalar datapath for transient-fault detection and recovery. In: Proc. Micro-34 (2001)
Google Scholar
Reinhardt, S., Mukherjee, S.: Transient fault detection via simultaneous multithreading. In: Proc. ISCA-27 (June 2000)
Google Scholar
Rotenberg, E.: AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In: Proc. 29th Intl. Symp. On Fault-Tolerant Computing Systems (1999)
Google Scholar
Smolens, J., et al.: Efficient Resource sharing in Concurrent error detecting Superscalar microarchitectures. In: Proc. Micro-37 (2004)
Google Scholar
Sundaramoorthy, K., Purser, Z., Rotenberg, E.: Slipstream processors: Improving both performance and fault tolerance. In: Proc. Micro-33 (December 2000)
Google Scholar
Vijaykumar, T., Pomeranz, I., Cheng, K.: Transient-fault recovery using simultaneous multithreading. In: Proc. ISCA-29 (2002)
Google Scholar
Sharkey, J.: M-Sim: A Flexible, Multi-threaded Simulation Environment. Tech. Report CS-TR-05-DP1, Department of Computer Science, SUNY Binghamton (2005)
Google Scholar
Tullsen, D., et al.: Exploiting Choice: Instruction Fetch and Issue on an Implementable Si-multaneous Multithreading Processor. In: Proc International Symposium on Computer Architecture (1996)
Google Scholar
Shivakumar, P., et al.: Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic. In: Proc DSN (2002)
Google Scholar
Mukherjee, S., et al.: Detailed Design and Evaluation of Redundant Multithreading Alternatives. In: Proc ISCA 2002 (2002)
Google Scholar
Gomaa, M., et al.: Transient-fault Recovery for Chip Multiprocessors. In: Proc ISCA 2003 (2003)
Google Scholar
Compaq Computer Corporation, Data Integrity for Compaq Non-Stop Himalaya Servers (1999)
Google Scholar
Slegel, T., et al.: IBM’s S/390 G5 Microprocessor Design. IEEE Micro (1999)
Google Scholar
Ponomarev, D., et al.: Reducing Datapath Energy through the Isolation of Short-Lived Operands. In: Proc. PACT 2003 (2003)
Google Scholar
Abu-Ghazeleh, N., et al.: Exploiting Short-Lived Values for Low-Overhead Transient Fault Recovery. In: Proc. ASGI 2006 (2006)
Google Scholar
Martinez, J., et al.: Cherry: Checkpointed Early Resource Recycling in Out-of-Order Processors. In: Proc. MICRO 2002 (2002)
Google Scholar
Ergin, O., et al.: Increasing Processor Performance through Early Register Release. In: Proc. ICCD 2004 (2004)
Google Scholar
Kirman, M., et al.: Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors. In: Proc. MICRO 2005 (2005)
Google Scholar
Smolens, J., et al.: Fingerprinting: Bounding Soft-Error Detection Latency and Band-width. In: Proc. ASPLOS 2004 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science,
Joseph Sharkey, Nayef Abu-Ghazeleh, Dmitry Ponomarev & Kanad Ghose
Department of Electrical Engineering, State University of New York at Binghamton,
Aneesh Aggarwal

Authors

Joseph Sharkey
View author publications
You can also search for this author in PubMed Google Scholar
Nayef Abu-Ghazeleh
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Ponomarev
View author publications
You can also search for this author in PubMed Google Scholar
Kanad Ghose
View author publications
You can also search for this author in PubMed Google Scholar
Aneesh Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

,
Yves Robert
Department of Electrical and Computer Engineering, Rutgers, the State University of New Jersey, 94 Brett Road, NJ 08854, Piscataway, USA
Manish Parashar
Hewlett-Packard ISO, Sy 192, Whitefield Road, Mahadevapura Post, 560048, Bangalore, India
Ramamurthy Badrinath
Department of Electrical Engineering, University of Southern California, 90089-2562, Los Angeles, CA, USA
Viktor K. Prasanna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sharkey, J., Abu-Ghazeleh, N., Ponomarev, D., Ghose, K., Aggarwal, A. (2006). Trade-Offs in Transient Fault Recovery Schemes for Redundant Multithreaded Processors. In: Robert, Y., Parashar, M., Badrinath, R., Prasanna, V.K. (eds) High Performance Computing - HiPC 2006. HiPC 2006. Lecture Notes in Computer Science, vol 4297. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11945918_18

Download citation

DOI: https://doi.org/10.1007/11945918_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68039-0
Online ISBN: 978-3-540-68040-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics