Tolerating Radiation-Induced Transient Faults in Modern Processors

Open Access
Article

Abstract

As MOS device sizes continue shrinking, lower charges, for example those charges carried by single ionizing particles of naturally occurring radiation, are sufficient to upset the functioning of complex modern microprocessors. In order to handle these inevitable errors, designs should include fault-tolerant features so that the processors can continue to correctly perform despite the occurrence of errors. The main goal of this work is to develop architecture mechanisms to protect processors against the effect of such radiation-induced transient faults. It should first be noted that, from a program execution perspective, many faults manifest themselves as control flow errors that cause processors to violate the correct sequencing of instructions. We present here at first a basic compile-time signature assignment algorithm and describe a novel approach to improve the fault detection coverage of the basic algorithm. Moreover, to allow the processor to efficiently check the run-time sequence and detect control flow errors, we introduce an on-chip assigned-signature checker which is capable of executing three additional instructions (SIC, SIJ, SIJC). Second, since the very concept of simultaneous multi-threading (SMT) provides the necessary redundancy, some proposals have been made to run two copies of the same thread on top of SMT platforms in order to detect and correct soft errors. This allows, upon detection of an error, the rolling back of the processor state to a known safe point, and then a retry of the instructions, thereby effecting a completely error-free execution. This paper has focused on two crucial implementation issues introduced by this scheme: (1) the design trade-off between the fault detection coverage versus design costs; (2) the possible occurrence of deadlock situations.

Keywords

Soft-error Computer architecture Fault-tolerant Control flow checking Multi-threading 

References

  1. 1.
    Hennessy J.L., Patterson D.A.: Computer architecture: a quantitative approach. 3rd edn. Morgan Kaufmann Publishers, Inc. (2002)MATHGoogle Scholar
  2. 2.
    Borkar, S.: Design challenges of technology scaling. IEEE Micro. (1999)Google Scholar
  3. 3.
    Yang, P. Chern, J.-H.: Design for reliability: the major challenge for VLSI. Proceedings of the IEEE (1993)Google Scholar
  4. 4.
    Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: 27th international symposium on computer architecture (2000)Google Scholar
  5. 5.
    Hennessy, J.: The future of systems research. IEEE Comput. (1999)Google Scholar
  6. 6.
    Stackhouse, B., Bhimji, S., et al.: A 65 nm 2-billion transistor quad-core itanium processor. IEEE Trans. Solid-State Circuits (2009)Google Scholar
  7. 7.
    Quach, N.: High Availability and reliability in the itanium processor. IEEE Micro. (2000)Google Scholar
  8. 8.
    Sanda, P.N., Kellington, J.W., Kudva, P., Kalla, R., McBeth, R.B., Ackaret, J., Lockwood, R., Schumann, J., Jones, C.R.: Soft-error resilience of the IBM POWER6 processor. IBM J. Res. Dev. (2008)Google Scholar
  9. 9.
    Clarke, W.J., Alves, L.C., Dell, T.J., Elfering, H., Kubala, J.P., Lin, C., Mueller, M.J., Werner, K.: IBM System z10 design for RAS. IBM J. Res. Dev. (2009)Google Scholar
  10. 10.
    Ando, H., Yoshida, Y., Inoue, A., Sugiyama, I., Asakawa, T., Morita, K., Muta, T., Motokurumada, T., Okada, S., Yamashita, H., Satsukawa, Y., Konmoto, A., Yamashita, R., Sugiyama, H.: A 1.3-GHz Fifth-generation SPARC64 Microprocessor. IEEE J. Solid-State Circuits (2003)Google Scholar
  11. 11.
    Intel Corporation, (Santa Clara): IA-32 intel architecture software developer’s manuals (2006)Google Scholar
  12. 12.
    Wilken, K., Shen, J.P.: Continuous signature monitoring: low-cost concurrent-detection of processor control errors. IEEE Trans. Comput. Aided Des. (1990)Google Scholar
  13. 13.
    Ohlsson, J., Rimén, M., Gunneflo, U.: A study of the effects of transient fault injection into a 32-bit RISC with built-in watchdog. In: 29th international symposium on fault-tolerant computing (1991)Google Scholar
  14. 14.
    Schuette, M.A., Shen, J.P.: Processor control flow monitoring using signatured instruction streams. IEEE Trans. Comput. (1987)Google Scholar
  15. 15.
    Mohmood, A., McCluskey, E.J.: Concurrent error detection using watchdog processors—a survey. IEEE Trans. Comput. (1988)Google Scholar
  16. 16.
    Schuette, M.A., Shen, J.P.: Exploiting instruction-level parallelism for integrated control-flow checking. IEEE Trans. Comput. (1994)Google Scholar
  17. 17.
    Warter, N.J., Hwu, W.-M.W.: A software based approach to achieving optimal performance for signature control flow checking. 20th international symposium on fault-tolerant computing (1990)Google Scholar
  18. 18.
    Michel, T., Leveugle, R., Saucier, G.: A new approach to control flow checking without program modification. In: 21st international symposium on fault-tolerant computing (1991)Google Scholar
  19. 19.
    Alkhalifa, Z., Nair, S., Krishnamurthy, N., Abraham, J.A.: Design and evaluation of system-level checks for on-line control flow error detection. IEEE Trans. Parallel Distrib. Syst. (1999)Google Scholar
  20. 20.
    Shirvani, P.P., McCluskey, E.J.: Fault-tolerant systems in a space environment: The CRC ARGOS Project. Tech. Rep. CRC-TR 98-2, Stanford University (1998)Google Scholar
  21. 21.
    Bagchi, S., Srinivasan, B., Whisnant, K., Kalbarczyk, Z., Iyer, R.K.: Hierarchical error detection in a software implemented fault tolerance (SIFT) environment. IEEE Trans. Knowl. Data Eng. (2000)Google Scholar
  22. 22.
    Oh, N., Shirvani, P.P., McCluskey, E.J.: Control-flow checking by software signatures. IEEE Trans. Reliab. (2002)Google Scholar
  23. 23.
    Aho A.V., Sethi R., Ullman J.D.: Compilers: Principles, techniques, and tools. Addison-Wesley Publishing Company, Wokingham, UK (1986)Google Scholar
  24. 24.
    Borin, E., Wang, C., Wu, Y., Araujo, G.: Dynamic binary control-flow errors detecttion. ACM SIGARCH Computer Architecture News (2005)Google Scholar
  25. 25.
    Saxena, N.R., McCluskey, E.J. Dependable adaptive computing systems- the ROAR project. In: 1998 IEEE international conference on systems, man and cybernetics (1998)Google Scholar
  26. 26.
    Rotenberg, E.: AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In: 29th international symposium on fault-tolerant computing (1999)Google Scholar
  27. 27.
    Mukherjee, S.S., Kontz, M., Reinhardt, S.K.: Detailed design and evaluation of redundant multithreading alternatives. In: 29th international symposium on computer architecture (2002)Google Scholar
  28. 28.
    Vijaykumar, T.N., Pomeranz, I., Cheng, K.: Transient-fault recovery using simultaneous multithreading. In: 29th international symposium on computer architecture (2002)Google Scholar
  29. 29.
    Ray, J., Hoe, J.C., Falsafi, B.: Dual use of superscalar datapath for transient-fault detection and recovery. In: 34th international symposium on microarchitecture (2001)Google Scholar
  30. 30.
    Smolens, J.C., Kim, J., Hoe, J.C., Falsafi, B.: Efficient resource sharing in concurrent error detecting superscalar microarchitectures. In: 37th international symposium on microarchitecture (2004)Google Scholar
  31. 31.
    Bossen, D.C., Tendler, J.M., Reick, K.: Power4 system design for high reliability. IEEE Micro. (2002)Google Scholar
  32. 32.
    Mukherjee, S.S., Weaver, C., Emer, J., Reinhardt, S.K., Austin, T.: A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In: 36th international symposium on microarchitecture (2003)Google Scholar
  33. 33.
    Mendelson, A., Suri, N.: Designing high-performance & reliable superscalar architectures the out of order reliable superscalar (O3RS) approach. In: International conference on dependable systems and networks (2000)Google Scholar
  34. 34.
    Kang, D., Gaudiot, J.-L.: Speculation control for simultaneous multithreading. In: 18th international parallel and distributed processing symposium (2004)Google Scholar
  35. 35.
    Compaq Computer Co., Massachusetts: Alpha 21264/EV68CB and 21264/EV68DC Hardware Reference Manual, 1.1 ed. (2001)Google Scholar
  36. 36.
    Tullsen, D.M., Eggers, S.J., Levy, H.M.: Simultaneous multithreading: maximizing on-chip parallelism. In: 22nd international symposium on computer architecture (1995)Google Scholar
  37. 37.
    Silberschatz A., Galvin P.B., Gagne G.: Applied operating system concepts. John Wiley & Sons, Inc. (2000)Google Scholar
  38. 38.
    Raasch, S.E., Reinhardt, S.K.: The impact of resource partitioning on SMT processors. In: 12th international conference on parallel architectures and compilation techniques (2003)Google Scholar
  39. 39.
    Burger, D., Austin, T.M.: The SimpleScalar Tool Set, Version 2.0. Tech. Rep. 1342, University of Wisconsin-Madison Computer Sciences Department (1997)Google Scholar
  40. 40.
    KleinOsowski, A., Lilja D.J.: MinneSPEC: a new SPEC benchmark workload for simulation-based computer architecture research. Tech. Rep. ARCTiC Lab No. 02–08, University of Minnesota, Minneapolis (2002)Google Scholar

Copyright information

© The Author(s) 2009

Authors and Affiliations

  1. 1.Enterprise Microprocessor GroupIntel CorporationSanta ClaraUSA
  2. 2.Department of Electrical Engineering and Computer ScienceUniversity of CaliforniaIrvineUSA

Personalised recommendations