Abstract
If your computer crashes, you can revive it by a reboot, an empirical solution that usually turns out to be effective. The rationale behind this solution is that transient faults, either in hardware or software, can be fixed by refreshing the machine state. Such a “silver bullet”, however, could be futile in the future because the faults, especially those existing in the hardware such as Integrated Circuit (IC) chips, cannot be eliminated by refreshing. What we need is a more sophisticated mechanism to steer the system back to the right track. The “magic cure” is the Fault Tolerance On-Chip (FTOC) mechanism, which relies on a suite of built-in design-for-reliability logic, including fault detection, fault diagnosis, and error recovery, working in a self-supportive manner. We have exploited the FTOC to build a holistic solution ranging from on-chip fault detection to error recovery mechanisms to address faults caused by chips progressively aging. Besides fault detection, the FTOC paradigm provides attractive benefits, such as facilitating graceful performance degradation, mitigating the impact of verification blind spots, and improving the chip yield.
Similar content being viewed by others
References
Borkar S. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro, 2005, 25: 10–16
Yang G H, Han Y H, Li X W. ReviveNet: a self-adaptive architecture for improving lifetime reliability via localized timing adaptation. IEEE Trans Comput, 2011, 60: 1219–1232
Fu B, Han Y, Ma J, et al. An abacus turn model for time/space-efficient reconfigurable routing. In: Proceedings of the 38th Annual International Symposium on Computer Architecture, San Jose, 2011. 259–270
Yan G, Sun F, Li H, et al. CoreRank: redeeming “Sick Silicon” by dynamically quantifying core-level healthy condition. IEEE Trans Comput, 2016, 65: 716–729
Yan G, Han Y, Li X. SVFD: a versatile online fault detection scheme via checking of stability violation. IEEE Trans VLSI Syst, 2011, 19: 1627–1640
Zhang L, Han Y H, Xu Q, et al. On topology reconfiguration for defect-tolerant NoC-based homogeneous manycore systems. IEEE Trans VLSI Syst, 2009, 17: 1173–1186
Dennard R H, Gaensslen F H, Rideout V L, et al. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J Solid-State Circuits, 1974, 9: 256–268
Srinivasan J, Adve S, Bose P, et al. The impact of technology scaling on lifetime reliability. In: Proceedings of International Conference on Dependable Systems and Networks, Florence, 2004. 177–186
Borkar S, Karnik T, Narendra S, et al. Parameter variations and impact on circuits and microarchitecture. In: Proceedings of Design Automation Conference, Anaheim, 2003. 338–342
Wang W P, Yang S Q, Sarvesh B, et al. The impact of NBTI on the performance of combinational and sequential circuits. In: Proceedings of the 44th ACM/IEEE Design Automation Conference, San Diego, 2007. 364–369
Borkar S, Karnik T, Narendra S, et al. Parameter variations and impact on circuits and microarchitecture. In: Proceedings of Design Automation Conference, Anaheim, 2003. 338–342
Chen G, Chuah K Y, Li M F, et al. Dynamic NBTI of PMOS transistors and its impact on device lifetime. In: Proceedings of the 41st Annual IEEE International Reliability Physics Symposium, Dallas, 2003. 196–202
Zhao W, Liu F, Agarwal K, et al. Rigorous extraction of process variations for 65-nm CMOS design. IEEE Trans Semicond Manufact, 2009, 22: 196–203
Xiang D, Zhang Y. Cost-effective power-aware core testing in NoCs based on a new unicast-based multicast scheme. IEEE Trans Comput-Aided Des Integr Circuits Syst, 2011, 30: 135–147
Xiang D, Chakrabarty K, Fujiwara H. A unified test and fault-tolerant multicast solution for network-on-chip designs. In: Proceedings of IEEE International Test Conference (ITC), Fort Worth, 2016. 1–9
Xiang D, Sui W, Yin B, et al. Compact test generation with an influence input measure for launch-on-capture transition fault testing. IEEE Trans VLSI Syst, 2014, 22: 1968–1979
Ferhani F, Saxena N, McCluskey E, et al. How many test patterns are useless. In: Proceedings of the 26th IEEE VLSI Test Symposium, San Diego, 2008. 23–28
Wang N J, Patel S J. ReStore: symptom-based soft error detection in microprocessors. IEEE Trans Dependable Secure Comput, 2006, 3: 188–201
Aitken R. Yield learning perspectives. IEEE Des Test Comput, 2012, 29: 59–62
Powell M D, Biswas A, Gupta S, et al. Architectural core salvaging in a multi-core processor for hard-error tolerance. In: Proceedings of the 36th Annual International Symposium on Computer Architecture, Austin, 2009. 93–104
Eyerman S, Eeckhout L, Karkhanis T, et al. A top-down approach to architecting CPI component performance counters. IEEE Micro, 2007, 27: 84–93
Tschanz J, Bowman K, Lu S, et al. A 45 nm resilient and adaptive microprocessor core for dynamic variation tolerance. In: Proceedings of IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, 2010. 282–283
Petrica P, Izraelevitz A, Albonesi D, et al. Flicker: a dynamically adaptive architecture for power limited multicore systems. In: Proceedings of the 40th Annual International Symposium on Computer Architecture, Tel-Aviv, 2013. 13–23
Carlson T, Heirman W, Eeckhout L. Sniper: exploring the level of abstraction for scalable and accurate parallel multicore simulation. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Seattle, 2011. 1–12
Miller J, Kasture H, Kurian G, et al. Graphite: a distributed parallel simulator for multicores. In: Proceedings of IEEE 16th International Symposium on High Performance Computer Architecture (HPCA), Bangalore, 2010. 1–12
Kohler A, Schley G, Radetzki M. Fault tolerant network on chip switching with graceful performance degradation. IEEE Trans Comput-Aided Des Integr Circuits Syst, 2010, 29: 883–896
Gizopoulos D, Psarakis M, Adve S, et al. Architectures for online error detection and recovery in multicore processors. In: Proceedings of Design, Automation and Test in Europe, Grenoble, 2011. 1–6
Alizadeh B, Fujita M. A debugging method for repairing post-silicon bugs of high performance processors in the fields. In: Proceedings of International Conference on Field-Programmable Technology, Beijing, 2010. 328–331
Chang C-W, Chou H-Z, Chang K-H, et al. Constraint generation for software-based post-silicon bug masking with scalable resynthesis technique for constraint optimization. In: Proceedings of the 12th International Symposium on Quality Electronic Design, Santa Clara, 2011. 174–181
Acknowledgements
This work was supported by National Natural Science Foundation of China (Grant Nos. 61532017, 61572470, 61521092, 61522406, 61432017, 61376043), and in part by Youth Innovation Promotion Association, CAS (Grant No. Y404441000).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, X., Yan, G., Ye, J. et al. Fault tolerance on-chip: a reliable computing paradigm using self-test, self-diagnosis, and self-repair (3S) approach. Sci. China Inf. Sci. 61, 112102 (2018). https://doi.org/10.1007/s11432-017-9290-4
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-017-9290-4