Abstract
The Network-on-Chip (NoC) paradigm has been proposed as a favorable solution to handle the strict communication requirements between the increasingly large number of cores on a single chip. However, NoC systems are exposed to the aggressive scaling down of transistors, low operating voltages, and high integration and power densities, making them vulnerable to permanent (hard) faults and transient (soft) errors. A hard fault in a NoC can lead to external blocking, causing congestion across the whole network. A soft error is more challenging because of its silent data corruption, which leads to a large area of erroneous data due to error propagation, packet re-transmission, and deadlock. In this paper, we present the architecture and design of a comprehensive soft error and hard fault-tolerant 3D-NoC system, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO). With the aid of efficient mechanisms and algorithms, 3D-FETO is capable of detecting and recovering from soft errors which occur in the routing pipeline stages and leverages reconfigurable components to handle permanent faults in links, input buffers, and crossbars. In-depth evaluation results show that the 3D-FETO system is able to work around different kinds of hard faults and soft errors, ensuring graceful performance degradation, while minimizing additional hardware complexity and remaining power efficient.
Similar content being viewed by others
References
Ahmed AB, Abdallah AB (2016) Adaptive fault-tolerant architecture and routing algorithm for reliable many-core 3D-NoC systems. J Parallel Distrib Comput 93–94:30–43
Ben Abdallah A (2013) Multicore systems-on-chip: practical hardware/software design, 2nd edn. Atlantis, Karachi
Ben Abdallah A, Masahiro S (2006) Basic network-on-chip interconnection for future gigascale MCSoCs applications: communication and computation orthogonalization. In: Proceedings of the symposium on science, society, and technology (JASSST2006), pp 1–7
Ben Ahmed A, Ben Abdallah A (2012) LA-XYZ: low latency, high throughput look-ahead routing algorithm for 3D network-on-chip (3D-NoC) architecture. In: IEEE 6th international symposium on embedded multicore socs (MCSoC). IEEE, New York, pp 167–174
Ben Ahmed A, Ben Abdallah A (2012) Low-overhead routing algorithm for 3D network-on-chip. In: Third International Conference on Networking and Computing (ICNC), pp 23–32
Ben Ahmed A, Ben Abdallah A (2013) Architecture and design of high-throughput, low-latency, and fault-tolerant routing algorithm for 3D-network-on-chip (3D-NoC). J Supercomput 66(3):1507–1532
Ben Ahmed A, Ben Abdallah A (2014) Graceful deadlock-free fault-tolerant routing algorithm for 3D network-on-chip architectures. J Parallel Distrib Comput 74(4):2229–2240
Bertozzi D, Benini L, De Micheli G (2005) Error control schemes for on-chip communication links: the energy-reliability tradeoff. IEEE Trans Comput-Aided Des Integr Circ Syst 24(6):818–831
Bertozzi D, Jalabert A, Murali S, Tamhankar R, Stergiou S, Benini L, De Micheli G (2005) NoC synthesis flow for customized domain specific multiprocessor systems-on-chip. IEEE Trans Parallel Distrib Syst 16(2):113–129
Chen P, Dai K, Wu D, Rao J, Zou X (2010) The parallel algorithm implementation of matrix multiplication based on ESCA. In: IEEE Asia Pacific Conference on Circuits and Systems (APCCAS). IEEE, New York, pp 1091–1094
Chien AA, Kim JH (1995) Planar-adaptive routing: low-cost adaptive networks for multiprocessors. J ACM (JACM) 42(1):91–123
Constantinides K, Plaza S, Blome J, Zhang B, Bertacco V, Mahlke S, Austin T, Orshansky M (2006) Bulletproof: a defect-tolerant CMP switch architecture. In: The twelfth international symposium on high-performance computer architecture. IEEE, New York, pp 5–16
Dally WJ, Towles BP (2004) Principles and practices of interconnection networks. Elsevier, Amsterdam
Dang KN, Meyer M, Okuyama Y, Tran XT, Ben Abdallah A (2015) A soft-error resilient 3d network-on-chip router. In: IEEE 7th International Conference on Awareness Science and Technology (iCAST), pp 84–90
DeOrio A, Fick D, Bertacco V, Sylvester D, Blaauw D, Hu J, Chen G (2012) A reliable routing architecture and algorithm for NoCs. IEEE Trans Comput-Aided Des Integr Circ Syst 31(5):726–739
Dixit A, Wood A (2011) The impact of new technology on soft error rates. In: 2011 international reliability physics symposium, pp 5B.4.1–5B.4.7
Eghbal A, Yaghini PM, Bagherzadeh N, Khayambashi M (2015) Analytical fault tolerance assessment and metrics for TSV-based 3D network-on-chip. IEEE Trans Comput 64(12):3591–3604
Ernst D, Kim NS, Das S, Pant S, Rao R, Pham T, Ziesler C, Blaauw D, Austin T, Flautner K, et al. (2003) Razor: a low-power pipeline based on circuit-level timing speculation. In: Proceedings 36th annual IEEE/ACM international symposium on microarchitecture (MICRO-36). IEEE, New York, pp 7–18
Fick D, DeOrio A, Chen G, Bertacco V, Sylvester D, Blaauw D (2009) A highly resilient routing algorithm for fault-tolerant NoCs. In: 2009 Design, Automation Test in Europe Conference Exhibition, pp 21–26
Hernández C, Silla F, Santonja V, Duato J (2008) Dealing with variability in NoC links. In: 2nd workshop on diagnostic services in network-on-chips, pp 4–10
Hsiao MY (1970) A class of optimal minimum odd-weight-column sec-ded codes. IBM J Res Dev 14(4):395–401
ITRS: 2012 Edition Update Process Integration, Devices, and Structures. Tech. rep., The International Technology Roadmap for Semiconductor (2012). http://www.itrs2.net/2012-itrs.html. Accessed 16 June 2016
Karl E, Blaauw D, Sylvester D, Mudge T (2006) Reliability modeling and management in dynamic microprocessor-based systems. In: Proceedings of the 43rd Annual Design Automation Conference. DAC ’06ACM, New York, pp 1057–1060
Lehtonen T, Liljeberg P, Plosila J (2007) Online reconfigurable self-timed links for fault tolerant NoC. VLSI Des 2007:1–13
Lehtonen T, Wolpert D, Liljeberg P, Plosila J, Ampadu P (2010) Self-adaptive system for addressing permanent errors in on-chip interconnects. IEEE Trans Very Large Scale Integr (VLSI) Syst 18(4):527–540
Lin S, Costello D, Miller M (1984) Automatic-repeat-request error-control schemes. IEEE Commun Mag 22(12):5–17
NanGate Inc.: Nangate Open Cell Library 45 nm. http://www.nangate.com/. Accessed 16 June 2016
NCSU Electronic Design Automation: FreePDK3D45 3D-IC process design kit. http://www.eda.ncsu.edu/wiki/FreePDK3D45:Contents. Accessed 16 June 2016
Parikh R, Bertacco V (2011) Formally enhanced runtime verification to ensure NoC functional correctness. In: Proceedings of the 44th annual IEEE/ACM international symposium on microarchitecture. MICRO-44ACM, New York, pp 410–419
Pasricha S, Zou Y (2011) A low overhead fault tolerant routing scheme for 3D Networks-on-Chip. In: 12th International symposium on quality electronic design (ISQED). IEEE, New York, pp 1–8
Prodromou A, Panteli A, Nicopoulos C, Sazeides Y (2012) NoCAlert: an on-line and real-time fault detection mechanism for network-on-chip architectures. In: Proceedings of the 2012 45th annual IEEE/ACM international symposium on microarchitecture (MICRO), pp 60–71
Radetzki M, Feng C, Zhao X, Jantsch A (2013) Methods for fault tolerance in networks-on-chip. ACM Comput Surv (CSUR) 46(1):8
Rahmani AM, Vaddina KR, Latif K, Liljeberg P, Plosila J, Tenhunen H (2014) High-performance and fault-tolerant 3D noc-bus hybrid architecture using arb-net-based adaptive monitoring platform. IEEE Trans Comput 63(3):734–747
Ravindan DK (2009) Structural fault-tolerance on the NOC circuit level. Tech. rep. Institut fur Technische Informatik, Universitat Stuttgart
Shamshiri S, Cheng KT (2009) Yield and cost analysis of a reliable noc. In: 27th IEEE VLSI Test symposium. IEEE, New York, pp 173–178
Shamshiri S, Ghofrani AA, Cheng KT (2011) End-to-end error correction and online diagnosis for on-chip networks. In: IEEE International Test Conference (ITC). IEEE, New York, pp 1–10
Sivaram R (1992) Queuing delays for uniform and nonuniform traffic patterns in a MIN. ACM SIGSIM Simul Dig 22(1):17–27
Yu Q, Ampadu P (2010) Transient and permanent error co-management method for reliable networks-on-chip. In: Fourth ACM/IEEE international symposium on networks-on-chip (NOCS). IEEE, New York, pp 145–154
Yu Q, Zhang M, Ampadu P (2013) Addressing network-on-chip router transient errors with inherent information redundancy. ACM Trans Embed Comput Syst (TECS) 12(4):105:1–105:21
Zekri AS, Sedukhin SG (2006) The general matrix multiply-add operation on 2D torus. In: 20th international parallel and distributed processing symposium (IPDPS). IEEE, New York, pp 8–16
Acknowledgements
This work is partially supported by Competitive Research Funding (CRF), The University of Aizu, Reference P-11 (2016), and JSPS KAKENHI Grant Number JP30453020. This work is also supported by VLSI Design and Education Center (VDEC), the University of Tokyo, Japan, in Collaboration with Synopsys, Inc. and Cadence Design Systems, Inc. The first and the last authors in the author list are the main contributors of this work.
Author information
Authors and Affiliations
Corresponding author
Additional information
This project is partially supported by Competitive Research Funding (CRF), The University of Aizu, Reference P-11 (2016), and JSPS KAKENHI Grant Number JP30453020.
Rights and permissions
About this article
Cite this article
Dang, K.N., Meyer, M., Okuyama, Y. et al. A low-overhead soft–hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems. J Supercomput 73, 2705–2729 (2017). https://doi.org/10.1007/s11227-016-1951-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1951-0