Skip to main content
Log in

Fault tolerance on-chip: a reliable computing paradigm using self-test, self-diagnosis, and self-repair (3S) approach

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

If your computer crashes, you can revive it by a reboot, an empirical solution that usually turns out to be effective. The rationale behind this solution is that transient faults, either in hardware or software, can be fixed by refreshing the machine state. Such a “silver bullet”, however, could be futile in the future because the faults, especially those existing in the hardware such as Integrated Circuit (IC) chips, cannot be eliminated by refreshing. What we need is a more sophisticated mechanism to steer the system back to the right track. The “magic cure” is the Fault Tolerance On-Chip (FTOC) mechanism, which relies on a suite of built-in design-for-reliability logic, including fault detection, fault diagnosis, and error recovery, working in a self-supportive manner. We have exploited the FTOC to build a holistic solution ranging from on-chip fault detection to error recovery mechanisms to address faults caused by chips progressively aging. Besides fault detection, the FTOC paradigm provides attractive benefits, such as facilitating graceful performance degradation, mitigating the impact of verification blind spots, and improving the chip yield.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Borkar S. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro, 2005, 25: 10–16

    Article  Google Scholar 

  2. Yang G H, Han Y H, Li X W. ReviveNet: a self-adaptive architecture for improving lifetime reliability via localized timing adaptation. IEEE Trans Comput, 2011, 60: 1219–1232

    Article  MathSciNet  Google Scholar 

  3. Fu B, Han Y, Ma J, et al. An abacus turn model for time/space-efficient reconfigurable routing. In: Proceedings of the 38th Annual International Symposium on Computer Architecture, San Jose, 2011. 259–270

    Google Scholar 

  4. Yan G, Sun F, Li H, et al. CoreRank: redeeming “Sick Silicon” by dynamically quantifying core-level healthy condition. IEEE Trans Comput, 2016, 65: 716–729

    Article  MathSciNet  Google Scholar 

  5. Yan G, Han Y, Li X. SVFD: a versatile online fault detection scheme via checking of stability violation. IEEE Trans VLSI Syst, 2011, 19: 1627–1640

    Article  Google Scholar 

  6. Zhang L, Han Y H, Xu Q, et al. On topology reconfiguration for defect-tolerant NoC-based homogeneous manycore systems. IEEE Trans VLSI Syst, 2009, 17: 1173–1186

    Google Scholar 

  7. Dennard R H, Gaensslen F H, Rideout V L, et al. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J Solid-State Circuits, 1974, 9: 256–268

    Article  Google Scholar 

  8. Srinivasan J, Adve S, Bose P, et al. The impact of technology scaling on lifetime reliability. In: Proceedings of International Conference on Dependable Systems and Networks, Florence, 2004. 177–186

    Google Scholar 

  9. Borkar S, Karnik T, Narendra S, et al. Parameter variations and impact on circuits and microarchitecture. In: Proceedings of Design Automation Conference, Anaheim, 2003. 338–342

    Google Scholar 

  10. Wang W P, Yang S Q, Sarvesh B, et al. The impact of NBTI on the performance of combinational and sequential circuits. In: Proceedings of the 44th ACM/IEEE Design Automation Conference, San Diego, 2007. 364–369

    Google Scholar 

  11. Borkar S, Karnik T, Narendra S, et al. Parameter variations and impact on circuits and microarchitecture. In: Proceedings of Design Automation Conference, Anaheim, 2003. 338–342

    Google Scholar 

  12. Chen G, Chuah K Y, Li M F, et al. Dynamic NBTI of PMOS transistors and its impact on device lifetime. In: Proceedings of the 41st Annual IEEE International Reliability Physics Symposium, Dallas, 2003. 196–202

    Google Scholar 

  13. Zhao W, Liu F, Agarwal K, et al. Rigorous extraction of process variations for 65-nm CMOS design. IEEE Trans Semicond Manufact, 2009, 22: 196–203

    Article  Google Scholar 

  14. Xiang D, Zhang Y. Cost-effective power-aware core testing in NoCs based on a new unicast-based multicast scheme. IEEE Trans Comput-Aided Des Integr Circuits Syst, 2011, 30: 135–147

    Article  Google Scholar 

  15. Xiang D, Chakrabarty K, Fujiwara H. A unified test and fault-tolerant multicast solution for network-on-chip designs. In: Proceedings of IEEE International Test Conference (ITC), Fort Worth, 2016. 1–9

    Google Scholar 

  16. Xiang D, Sui W, Yin B, et al. Compact test generation with an influence input measure for launch-on-capture transition fault testing. IEEE Trans VLSI Syst, 2014, 22: 1968–1979

    Article  Google Scholar 

  17. Ferhani F, Saxena N, McCluskey E, et al. How many test patterns are useless. In: Proceedings of the 26th IEEE VLSI Test Symposium, San Diego, 2008. 23–28

    Google Scholar 

  18. Wang N J, Patel S J. ReStore: symptom-based soft error detection in microprocessors. IEEE Trans Dependable Secure Comput, 2006, 3: 188–201

    Article  Google Scholar 

  19. Aitken R. Yield learning perspectives. IEEE Des Test Comput, 2012, 29: 59–62

    Article  Google Scholar 

  20. Powell M D, Biswas A, Gupta S, et al. Architectural core salvaging in a multi-core processor for hard-error tolerance. In: Proceedings of the 36th Annual International Symposium on Computer Architecture, Austin, 2009. 93–104

    Google Scholar 

  21. Eyerman S, Eeckhout L, Karkhanis T, et al. A top-down approach to architecting CPI component performance counters. IEEE Micro, 2007, 27: 84–93

    Article  Google Scholar 

  22. Tschanz J, Bowman K, Lu S, et al. A 45 nm resilient and adaptive microprocessor core for dynamic variation tolerance. In: Proceedings of IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, 2010. 282–283

    Google Scholar 

  23. Petrica P, Izraelevitz A, Albonesi D, et al. Flicker: a dynamically adaptive architecture for power limited multicore systems. In: Proceedings of the 40th Annual International Symposium on Computer Architecture, Tel-Aviv, 2013. 13–23

    Google Scholar 

  24. Carlson T, Heirman W, Eeckhout L. Sniper: exploring the level of abstraction for scalable and accurate parallel multicore simulation. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Seattle, 2011. 1–12

    Google Scholar 

  25. Miller J, Kasture H, Kurian G, et al. Graphite: a distributed parallel simulator for multicores. In: Proceedings of IEEE 16th International Symposium on High Performance Computer Architecture (HPCA), Bangalore, 2010. 1–12

    Google Scholar 

  26. Kohler A, Schley G, Radetzki M. Fault tolerant network on chip switching with graceful performance degradation. IEEE Trans Comput-Aided Des Integr Circuits Syst, 2010, 29: 883–896

    Article  Google Scholar 

  27. Gizopoulos D, Psarakis M, Adve S, et al. Architectures for online error detection and recovery in multicore processors. In: Proceedings of Design, Automation and Test in Europe, Grenoble, 2011. 1–6

    Book  Google Scholar 

  28. Alizadeh B, Fujita M. A debugging method for repairing post-silicon bugs of high performance processors in the fields. In: Proceedings of International Conference on Field-Programmable Technology, Beijing, 2010. 328–331

    Google Scholar 

  29. Chang C-W, Chou H-Z, Chang K-H, et al. Constraint generation for software-based post-silicon bug masking with scalable resynthesis technique for constraint optimization. In: Proceedings of the 12th International Symposium on Quality Electronic Design, Santa Clara, 2011. 174–181

    Google Scholar 

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant Nos. 61532017, 61572470, 61521092, 61522406, 61432017, 61376043), and in part by Youth Innovation Promotion Association, CAS (Grant No. Y404441000).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guihai Yan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Yan, G., Ye, J. et al. Fault tolerance on-chip: a reliable computing paradigm using self-test, self-diagnosis, and self-repair (3S) approach. Sci. China Inf. Sci. 61, 112102 (2018). https://doi.org/10.1007/s11432-017-9290-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-017-9290-4

Keywords

Navigation