Trade-Offs in Transient Fault Recovery Schemes for Redundant Multithreaded Processors

  • Joseph Sharkey
  • Nayef Abu-Ghazeleh
  • Dmitry Ponomarev
  • Kanad Ghose
  • Aneesh Aggarwal
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4297)


CMOS downscaling trends, manifested in the use of smaller transistor feature sizes and lower supply voltages, make microprocessors more and more vulnerable to transient errors with each new technology generation. One architectural approach to detecting and recovering from such errors is to execute two copies of the same program and then compare the results. While comparing only the store instructions is sufficient for error detection, register values also have to be compared to support fault recovery. In this paper, we propose novel checkpoint-assisted mechanisms for efficient fault recovery that dramatically reduce the number of register values to be compared for detecting soft errors and perform comprehensive investigation of these and other existing recovery schemes from the standpoint of performance, power and design complexity.


Register File Cache Line Soft Error Transient Fault Main Thread 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Austin, T.: DIVA: a reliable substrate for deep submicron microarchitecture design. In: Proc. Micro-32 (1999)Google Scholar
  2. 2.
    Qureshi, M., et al.: Microarchitecture-Based Introspection: A Technique for Transient Fault Tolerance in Microprocessors. In: DSN 2005 (2005)Google Scholar
  3. 3.
    Holm, J.G., Banerjee, P.: Low cost concurrent error detection in a VLIW architecture using replicated instructions. In: Proc. ICPP-21 (1992)Google Scholar
  4. 4.
    Gomaa, M., et al.: Transient-Fault Recovery for Chip Multiprocessors. In: Proc. ISCA-30 (2003)Google Scholar
  5. 5.
    Gomaa, M., Vijaykumar, T.N.: Opportunistic Transient Fault Detection. In: ISCA 2005 (2005)Google Scholar
  6. 6.
    Ray, J., Hoe, J., Falsafi, B.: Dual use of superscalar datapath for transient-fault detection and recovery. In: Proc. Micro-34 (2001)Google Scholar
  7. 7.
    Reinhardt, S., Mukherjee, S.: Transient fault detection via simultaneous multithreading. In: Proc. ISCA-27 (June 2000)Google Scholar
  8. 8.
    Rotenberg, E.: AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In: Proc. 29th Intl. Symp. On Fault-Tolerant Computing Systems (1999)Google Scholar
  9. 9.
    Smolens, J., et al.: Efficient Resource sharing in Concurrent error detecting Superscalar microarchitectures. In: Proc. Micro-37 (2004)Google Scholar
  10. 10.
    Sundaramoorthy, K., Purser, Z., Rotenberg, E.: Slipstream processors: Improving both performance and fault tolerance. In: Proc. Micro-33 (December 2000)Google Scholar
  11. 11.
    Vijaykumar, T., Pomeranz, I., Cheng, K.: Transient-fault recovery using simultaneous multithreading. In: Proc. ISCA-29 (2002)Google Scholar
  12. 12.
    Sharkey, J.: M-Sim: A Flexible, Multi-threaded Simulation Environment. Tech. Report CS-TR-05-DP1, Department of Computer Science, SUNY Binghamton (2005)Google Scholar
  13. 13.
    Tullsen, D., et al.: Exploiting Choice: Instruction Fetch and Issue on an Implementable Si-multaneous Multithreading Processor. In: Proc International Symposium on Computer Architecture (1996)Google Scholar
  14. 14.
    Shivakumar, P., et al.: Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic. In: Proc DSN (2002)Google Scholar
  15. 15.
    Mukherjee, S., et al.: Detailed Design and Evaluation of Redundant Multithreading Alternatives. In: Proc ISCA 2002 (2002)Google Scholar
  16. 16.
    Gomaa, M., et al.: Transient-fault Recovery for Chip Multiprocessors. In: Proc ISCA 2003 (2003)Google Scholar
  17. 17.
    Compaq Computer Corporation, Data Integrity for Compaq Non-Stop Himalaya Servers (1999)Google Scholar
  18. 18.
    Slegel, T., et al.: IBM’s S/390 G5 Microprocessor Design. IEEE Micro (1999)Google Scholar
  19. 19.
    Ponomarev, D., et al.: Reducing Datapath Energy through the Isolation of Short-Lived Operands. In: Proc. PACT 2003 (2003)Google Scholar
  20. 20.
    Abu-Ghazeleh, N., et al.: Exploiting Short-Lived Values for Low-Overhead Transient Fault Recovery. In: Proc. ASGI 2006 (2006)Google Scholar
  21. 21.
    Martinez, J., et al.: Cherry: Checkpointed Early Resource Recycling in Out-of-Order Processors. In: Proc. MICRO 2002 (2002)Google Scholar
  22. 22.
    Ergin, O., et al.: Increasing Processor Performance through Early Register Release. In: Proc. ICCD 2004 (2004)Google Scholar
  23. 23.
    Kirman, M., et al.: Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors. In: Proc. MICRO 2005 (2005)Google Scholar
  24. 24.
    Smolens, J., et al.: Fingerprinting: Bounding Soft-Error Detection Latency and Band-width. In: Proc. ASPLOS 2004 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Joseph Sharkey
    • 1
  • Nayef Abu-Ghazeleh
    • 1
  • Dmitry Ponomarev
    • 1
  • Kanad Ghose
    • 1
  • Aneesh Aggarwal
    • 2
  1. 1.Department of Computer Science 
  2. 2.Department of Electrical EngineeringState University of New York at Binghamton 

Personalised recommendations