The Journal of Supercomputing

, Volume 73, Issue 6, pp 2705–2729 | Cite as

A low-overhead soft–hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems

  • Khanh N. DangEmail author
  • Michael Meyer
  • Yuichi Okuyama
  • Abderazek Ben Abdallah


The Network-on-Chip (NoC) paradigm has been proposed as a favorable solution to handle the strict communication requirements between the increasingly large number of cores on a single chip. However, NoC systems are exposed to the aggressive scaling down of transistors, low operating voltages, and high integration and power densities, making them vulnerable to permanent (hard) faults and transient (soft) errors. A hard fault in a NoC can lead to external blocking, causing congestion across the whole network. A soft error is more challenging because of its silent data corruption, which leads to a large area of erroneous data due to error propagation, packet re-transmission, and deadlock. In this paper, we present the architecture and design of a comprehensive soft error and hard fault-tolerant 3D-NoC system, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO). With the aid of efficient mechanisms and algorithms, 3D-FETO is capable of detecting and recovering from soft errors which occur in the routing pipeline stages and leverages reconfigurable components to handle permanent faults in links, input buffers, and crossbars. In-depth evaluation results show that the 3D-FETO system is able to work around different kinds of hard faults and soft errors, ensuring graceful performance degradation, while minimizing additional hardware complexity and remaining power efficient.


3D NoCs Fault-tolerance Soft–hard faults Reliability Architecture Design 



This work is partially supported by Competitive Research Funding (CRF), The University of Aizu, Reference P-11 (2016), and JSPS KAKENHI Grant Number JP30453020. This work is also supported by VLSI Design and Education Center (VDEC), the University of Tokyo, Japan, in Collaboration with Synopsys, Inc. and Cadence Design Systems, Inc. The first and the last authors in the author list are the main contributors of this work.


  1. 1.
    Ahmed AB, Abdallah AB (2016) Adaptive fault-tolerant architecture and routing algorithm for reliable many-core 3D-NoC systems. J Parallel Distrib Comput 93–94:30–43CrossRefGoogle Scholar
  2. 2.
    Ben Abdallah A (2013) Multicore systems-on-chip: practical hardware/software design, 2nd edn. Atlantis, KarachiCrossRefGoogle Scholar
  3. 3.
    Ben Abdallah A, Masahiro S (2006) Basic network-on-chip interconnection for future gigascale MCSoCs applications: communication and computation orthogonalization. In: Proceedings of the symposium on science, society, and technology (JASSST2006), pp 1–7Google Scholar
  4. 4.
    Ben Ahmed A, Ben Abdallah A (2012) LA-XYZ: low latency, high throughput look-ahead routing algorithm for 3D network-on-chip (3D-NoC) architecture. In: IEEE 6th international symposium on embedded multicore socs (MCSoC). IEEE, New York, pp 167–174Google Scholar
  5. 5.
    Ben Ahmed A, Ben Abdallah A (2012) Low-overhead routing algorithm for 3D network-on-chip. In: Third International Conference on Networking and Computing (ICNC), pp 23–32Google Scholar
  6. 6.
    Ben Ahmed A, Ben Abdallah A (2013) Architecture and design of high-throughput, low-latency, and fault-tolerant routing algorithm for 3D-network-on-chip (3D-NoC). J Supercomput 66(3):1507–1532CrossRefGoogle Scholar
  7. 7.
    Ben Ahmed A, Ben Abdallah A (2014) Graceful deadlock-free fault-tolerant routing algorithm for 3D network-on-chip architectures. J Parallel Distrib Comput 74(4):2229–2240CrossRefGoogle Scholar
  8. 8.
    Bertozzi D, Benini L, De Micheli G (2005) Error control schemes for on-chip communication links: the energy-reliability tradeoff. IEEE Trans Comput-Aided Des Integr Circ Syst 24(6):818–831CrossRefGoogle Scholar
  9. 9.
    Bertozzi D, Jalabert A, Murali S, Tamhankar R, Stergiou S, Benini L, De Micheli G (2005) NoC synthesis flow for customized domain specific multiprocessor systems-on-chip. IEEE Trans Parallel Distrib Syst 16(2):113–129CrossRefGoogle Scholar
  10. 10.
    Chen P, Dai K, Wu D, Rao J, Zou X (2010) The parallel algorithm implementation of matrix multiplication based on ESCA. In: IEEE Asia Pacific Conference on Circuits and Systems (APCCAS). IEEE, New York, pp 1091–1094Google Scholar
  11. 11.
    Chien AA, Kim JH (1995) Planar-adaptive routing: low-cost adaptive networks for multiprocessors. J ACM (JACM) 42(1):91–123CrossRefzbMATHGoogle Scholar
  12. 12.
    Constantinides K, Plaza S, Blome J, Zhang B, Bertacco V, Mahlke S, Austin T, Orshansky M (2006) Bulletproof: a defect-tolerant CMP switch architecture. In: The twelfth international symposium on high-performance computer architecture. IEEE, New York, pp 5–16Google Scholar
  13. 13.
    Dally WJ, Towles BP (2004) Principles and practices of interconnection networks. Elsevier, AmsterdamGoogle Scholar
  14. 14.
    Dang KN, Meyer M, Okuyama Y, Tran XT, Ben Abdallah A (2015) A soft-error resilient 3d network-on-chip router. In: IEEE 7th International Conference on Awareness Science and Technology (iCAST), pp 84–90Google Scholar
  15. 15.
    DeOrio A, Fick D, Bertacco V, Sylvester D, Blaauw D, Hu J, Chen G (2012) A reliable routing architecture and algorithm for NoCs. IEEE Trans Comput-Aided Des Integr Circ Syst 31(5):726–739CrossRefGoogle Scholar
  16. 16.
    Dixit A, Wood A (2011) The impact of new technology on soft error rates. In: 2011 international reliability physics symposium, pp 5B.4.1–5B.4.7Google Scholar
  17. 17.
    Eghbal A, Yaghini PM, Bagherzadeh N, Khayambashi M (2015) Analytical fault tolerance assessment and metrics for TSV-based 3D network-on-chip. IEEE Trans Comput 64(12):3591–3604MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Ernst D, Kim NS, Das S, Pant S, Rao R, Pham T, Ziesler C, Blaauw D, Austin T, Flautner K, et al. (2003) Razor: a low-power pipeline based on circuit-level timing speculation. In: Proceedings 36th annual IEEE/ACM international symposium on microarchitecture (MICRO-36). IEEE, New York, pp 7–18Google Scholar
  19. 19.
    Fick D, DeOrio A, Chen G, Bertacco V, Sylvester D, Blaauw D (2009) A highly resilient routing algorithm for fault-tolerant NoCs. In: 2009 Design, Automation Test in Europe Conference Exhibition, pp 21–26Google Scholar
  20. 20.
    Hernández C, Silla F, Santonja V, Duato J (2008) Dealing with variability in NoC links. In: 2nd workshop on diagnostic services in network-on-chips, pp 4–10Google Scholar
  21. 21.
    Hsiao MY (1970) A class of optimal minimum odd-weight-column sec-ded codes. IBM J Res Dev 14(4):395–401CrossRefGoogle Scholar
  22. 22.
    ITRS: 2012 Edition Update Process Integration, Devices, and Structures. Tech. rep., The International Technology Roadmap for Semiconductor (2012). Accessed 16 June 2016
  23. 23.
    Karl E, Blaauw D, Sylvester D, Mudge T (2006) Reliability modeling and management in dynamic microprocessor-based systems. In: Proceedings of the 43rd Annual Design Automation Conference. DAC ’06ACM, New York, pp 1057–1060Google Scholar
  24. 24.
    Lehtonen T, Liljeberg P, Plosila J (2007) Online reconfigurable self-timed links for fault tolerant NoC. VLSI Des 2007:1–13CrossRefGoogle Scholar
  25. 25.
    Lehtonen T, Wolpert D, Liljeberg P, Plosila J, Ampadu P (2010) Self-adaptive system for addressing permanent errors in on-chip interconnects. IEEE Trans Very Large Scale Integr (VLSI) Syst 18(4):527–540CrossRefGoogle Scholar
  26. 26.
    Lin S, Costello D, Miller M (1984) Automatic-repeat-request error-control schemes. IEEE Commun Mag 22(12):5–17CrossRefGoogle Scholar
  27. 27.
    NanGate Inc.: Nangate Open Cell Library 45 nm. Accessed 16 June 2016
  28. 28.
    NCSU Electronic Design Automation: FreePDK3D45 3D-IC process design kit. Accessed 16 June 2016
  29. 29.
    Parikh R, Bertacco V (2011) Formally enhanced runtime verification to ensure NoC functional correctness. In: Proceedings of the 44th annual IEEE/ACM international symposium on microarchitecture. MICRO-44ACM, New York, pp 410–419Google Scholar
  30. 30.
    Pasricha S, Zou Y (2011) A low overhead fault tolerant routing scheme for 3D Networks-on-Chip. In: 12th International symposium on quality electronic design (ISQED). IEEE, New York, pp 1–8Google Scholar
  31. 31.
    Prodromou A, Panteli A, Nicopoulos C, Sazeides Y (2012) NoCAlert: an on-line and real-time fault detection mechanism for network-on-chip architectures. In: Proceedings of the 2012 45th annual IEEE/ACM international symposium on microarchitecture (MICRO), pp 60–71Google Scholar
  32. 32.
    Radetzki M, Feng C, Zhao X, Jantsch A (2013) Methods for fault tolerance in networks-on-chip. ACM Comput Surv (CSUR) 46(1):8CrossRefGoogle Scholar
  33. 33.
    Rahmani AM, Vaddina KR, Latif K, Liljeberg P, Plosila J, Tenhunen H (2014) High-performance and fault-tolerant 3D noc-bus hybrid architecture using arb-net-based adaptive monitoring platform. IEEE Trans Comput 63(3):734–747MathSciNetCrossRefGoogle Scholar
  34. 34.
    Ravindan DK (2009) Structural fault-tolerance on the NOC circuit level. Tech. rep. Institut fur Technische Informatik, Universitat StuttgartGoogle Scholar
  35. 35.
    Shamshiri S, Cheng KT (2009) Yield and cost analysis of a reliable noc. In: 27th IEEE VLSI Test symposium. IEEE, New York, pp 173–178Google Scholar
  36. 36.
    Shamshiri S, Ghofrani AA, Cheng KT (2011) End-to-end error correction and online diagnosis for on-chip networks. In: IEEE International Test Conference (ITC). IEEE, New York, pp 1–10Google Scholar
  37. 37.
    Sivaram R (1992) Queuing delays for uniform and nonuniform traffic patterns in a MIN. ACM SIGSIM Simul Dig 22(1):17–27MathSciNetCrossRefGoogle Scholar
  38. 38.
    Yu Q, Ampadu P (2010) Transient and permanent error co-management method for reliable networks-on-chip. In: Fourth ACM/IEEE international symposium on networks-on-chip (NOCS). IEEE, New York, pp 145–154Google Scholar
  39. 39.
    Yu Q, Zhang M, Ampadu P (2013) Addressing network-on-chip router transient errors with inherent information redundancy. ACM Trans Embed Comput Syst (TECS) 12(4):105:1–105:21Google Scholar
  40. 40.
    Zekri AS, Sedukhin SG (2006) The general matrix multiply-add operation on 2D torus. In: 20th international parallel and distributed processing symposium (IPDPS). IEEE, New York, pp 8–16Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Khanh N. Dang
    • 1
    Email author
  • Michael Meyer
    • 1
  • Yuichi Okuyama
    • 1
  • Abderazek Ben Abdallah
    • 1
  1. 1.Adaptive Systems Laboratory, Graduate School of Computer Science and EngineeringThe University of AizuAizu-WakamatsuJapan

Personalised recommendations