Skip to main content
Log in

A low-overhead soft–hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The Network-on-Chip (NoC) paradigm has been proposed as a favorable solution to handle the strict communication requirements between the increasingly large number of cores on a single chip. However, NoC systems are exposed to the aggressive scaling down of transistors, low operating voltages, and high integration and power densities, making them vulnerable to permanent (hard) faults and transient (soft) errors. A hard fault in a NoC can lead to external blocking, causing congestion across the whole network. A soft error is more challenging because of its silent data corruption, which leads to a large area of erroneous data due to error propagation, packet re-transmission, and deadlock. In this paper, we present the architecture and design of a comprehensive soft error and hard fault-tolerant 3D-NoC system, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO). With the aid of efficient mechanisms and algorithms, 3D-FETO is capable of detecting and recovering from soft errors which occur in the routing pipeline stages and leverages reconfigurable components to handle permanent faults in links, input buffers, and crossbars. In-depth evaluation results show that the 3D-FETO system is able to work around different kinds of hard faults and soft errors, ensuring graceful performance degradation, while minimizing additional hardware complexity and remaining power efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Ahmed AB, Abdallah AB (2016) Adaptive fault-tolerant architecture and routing algorithm for reliable many-core 3D-NoC systems. J Parallel Distrib Comput 93–94:30–43

    Article  Google Scholar 

  2. Ben Abdallah A (2013) Multicore systems-on-chip: practical hardware/software design, 2nd edn. Atlantis, Karachi

    Book  Google Scholar 

  3. Ben Abdallah A, Masahiro S (2006) Basic network-on-chip interconnection for future gigascale MCSoCs applications: communication and computation orthogonalization. In: Proceedings of the symposium on science, society, and technology (JASSST2006), pp 1–7

  4. Ben Ahmed A, Ben Abdallah A (2012) LA-XYZ: low latency, high throughput look-ahead routing algorithm for 3D network-on-chip (3D-NoC) architecture. In: IEEE 6th international symposium on embedded multicore socs (MCSoC). IEEE, New York, pp 167–174

  5. Ben Ahmed A, Ben Abdallah A (2012) Low-overhead routing algorithm for 3D network-on-chip. In: Third International Conference on Networking and Computing (ICNC), pp 23–32

  6. Ben Ahmed A, Ben Abdallah A (2013) Architecture and design of high-throughput, low-latency, and fault-tolerant routing algorithm for 3D-network-on-chip (3D-NoC). J Supercomput 66(3):1507–1532

    Article  Google Scholar 

  7. Ben Ahmed A, Ben Abdallah A (2014) Graceful deadlock-free fault-tolerant routing algorithm for 3D network-on-chip architectures. J Parallel Distrib Comput 74(4):2229–2240

    Article  Google Scholar 

  8. Bertozzi D, Benini L, De Micheli G (2005) Error control schemes for on-chip communication links: the energy-reliability tradeoff. IEEE Trans Comput-Aided Des Integr Circ Syst 24(6):818–831

    Article  Google Scholar 

  9. Bertozzi D, Jalabert A, Murali S, Tamhankar R, Stergiou S, Benini L, De Micheli G (2005) NoC synthesis flow for customized domain specific multiprocessor systems-on-chip. IEEE Trans Parallel Distrib Syst 16(2):113–129

    Article  Google Scholar 

  10. Chen P, Dai K, Wu D, Rao J, Zou X (2010) The parallel algorithm implementation of matrix multiplication based on ESCA. In: IEEE Asia Pacific Conference on Circuits and Systems (APCCAS). IEEE, New York, pp 1091–1094

  11. Chien AA, Kim JH (1995) Planar-adaptive routing: low-cost adaptive networks for multiprocessors. J ACM (JACM) 42(1):91–123

    Article  MATH  Google Scholar 

  12. Constantinides K, Plaza S, Blome J, Zhang B, Bertacco V, Mahlke S, Austin T, Orshansky M (2006) Bulletproof: a defect-tolerant CMP switch architecture. In: The twelfth international symposium on high-performance computer architecture. IEEE, New York, pp 5–16

  13. Dally WJ, Towles BP (2004) Principles and practices of interconnection networks. Elsevier, Amsterdam

    Google Scholar 

  14. Dang KN, Meyer M, Okuyama Y, Tran XT, Ben Abdallah A (2015) A soft-error resilient 3d network-on-chip router. In: IEEE 7th International Conference on Awareness Science and Technology (iCAST), pp 84–90

  15. DeOrio A, Fick D, Bertacco V, Sylvester D, Blaauw D, Hu J, Chen G (2012) A reliable routing architecture and algorithm for NoCs. IEEE Trans Comput-Aided Des Integr Circ Syst 31(5):726–739

    Article  Google Scholar 

  16. Dixit A, Wood A (2011) The impact of new technology on soft error rates. In: 2011 international reliability physics symposium, pp 5B.4.1–5B.4.7

  17. Eghbal A, Yaghini PM, Bagherzadeh N, Khayambashi M (2015) Analytical fault tolerance assessment and metrics for TSV-based 3D network-on-chip. IEEE Trans Comput 64(12):3591–3604

    Article  MathSciNet  MATH  Google Scholar 

  18. Ernst D, Kim NS, Das S, Pant S, Rao R, Pham T, Ziesler C, Blaauw D, Austin T, Flautner K, et al. (2003) Razor: a low-power pipeline based on circuit-level timing speculation. In: Proceedings 36th annual IEEE/ACM international symposium on microarchitecture (MICRO-36). IEEE, New York, pp 7–18

  19. Fick D, DeOrio A, Chen G, Bertacco V, Sylvester D, Blaauw D (2009) A highly resilient routing algorithm for fault-tolerant NoCs. In: 2009 Design, Automation Test in Europe Conference Exhibition, pp 21–26

  20. Hernández C, Silla F, Santonja V, Duato J (2008) Dealing with variability in NoC links. In: 2nd workshop on diagnostic services in network-on-chips, pp 4–10

  21. Hsiao MY (1970) A class of optimal minimum odd-weight-column sec-ded codes. IBM J Res Dev 14(4):395–401

    Article  Google Scholar 

  22. ITRS: 2012 Edition Update Process Integration, Devices, and Structures. Tech. rep., The International Technology Roadmap for Semiconductor (2012). http://www.itrs2.net/2012-itrs.html. Accessed 16 June 2016

  23. Karl E, Blaauw D, Sylvester D, Mudge T (2006) Reliability modeling and management in dynamic microprocessor-based systems. In: Proceedings of the 43rd Annual Design Automation Conference. DAC ’06ACM, New York, pp 1057–1060

  24. Lehtonen T, Liljeberg P, Plosila J (2007) Online reconfigurable self-timed links for fault tolerant NoC. VLSI Des 2007:1–13

    Article  Google Scholar 

  25. Lehtonen T, Wolpert D, Liljeberg P, Plosila J, Ampadu P (2010) Self-adaptive system for addressing permanent errors in on-chip interconnects. IEEE Trans Very Large Scale Integr (VLSI) Syst 18(4):527–540

    Article  Google Scholar 

  26. Lin S, Costello D, Miller M (1984) Automatic-repeat-request error-control schemes. IEEE Commun Mag 22(12):5–17

    Article  Google Scholar 

  27. NanGate Inc.: Nangate Open Cell Library 45 nm. http://www.nangate.com/. Accessed 16 June 2016

  28. NCSU Electronic Design Automation: FreePDK3D45 3D-IC process design kit. http://www.eda.ncsu.edu/wiki/FreePDK3D45:Contents. Accessed 16 June 2016

  29. Parikh R, Bertacco V (2011) Formally enhanced runtime verification to ensure NoC functional correctness. In: Proceedings of the 44th annual IEEE/ACM international symposium on microarchitecture. MICRO-44ACM, New York, pp 410–419

  30. Pasricha S, Zou Y (2011) A low overhead fault tolerant routing scheme for 3D Networks-on-Chip. In: 12th International symposium on quality electronic design (ISQED). IEEE, New York, pp 1–8

  31. Prodromou A, Panteli A, Nicopoulos C, Sazeides Y (2012) NoCAlert: an on-line and real-time fault detection mechanism for network-on-chip architectures. In: Proceedings of the 2012 45th annual IEEE/ACM international symposium on microarchitecture (MICRO), pp 60–71

  32. Radetzki M, Feng C, Zhao X, Jantsch A (2013) Methods for fault tolerance in networks-on-chip. ACM Comput Surv (CSUR) 46(1):8

    Article  Google Scholar 

  33. Rahmani AM, Vaddina KR, Latif K, Liljeberg P, Plosila J, Tenhunen H (2014) High-performance and fault-tolerant 3D noc-bus hybrid architecture using arb-net-based adaptive monitoring platform. IEEE Trans Comput 63(3):734–747

    Article  MathSciNet  Google Scholar 

  34. Ravindan DK (2009) Structural fault-tolerance on the NOC circuit level. Tech. rep. Institut fur Technische Informatik, Universitat Stuttgart

  35. Shamshiri S, Cheng KT (2009) Yield and cost analysis of a reliable noc. In: 27th IEEE VLSI Test symposium. IEEE, New York, pp 173–178

  36. Shamshiri S, Ghofrani AA, Cheng KT (2011) End-to-end error correction and online diagnosis for on-chip networks. In: IEEE International Test Conference (ITC). IEEE, New York, pp 1–10

  37. Sivaram R (1992) Queuing delays for uniform and nonuniform traffic patterns in a MIN. ACM SIGSIM Simul Dig 22(1):17–27

    Article  MathSciNet  Google Scholar 

  38. Yu Q, Ampadu P (2010) Transient and permanent error co-management method for reliable networks-on-chip. In: Fourth ACM/IEEE international symposium on networks-on-chip (NOCS). IEEE, New York, pp 145–154

  39. Yu Q, Zhang M, Ampadu P (2013) Addressing network-on-chip router transient errors with inherent information redundancy. ACM Trans Embed Comput Syst (TECS) 12(4):105:1–105:21

    Google Scholar 

  40. Zekri AS, Sedukhin SG (2006) The general matrix multiply-add operation on 2D torus. In: 20th international parallel and distributed processing symposium (IPDPS). IEEE, New York, pp 8–16

Download references

Acknowledgements

This work is partially supported by Competitive Research Funding (CRF), The University of Aizu, Reference P-11 (2016), and JSPS KAKENHI Grant Number JP30453020. This work is also supported by VLSI Design and Education Center (VDEC), the University of Tokyo, Japan, in Collaboration with Synopsys, Inc. and Cadence Design Systems, Inc. The first and the last authors in the author list are the main contributors of this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Khanh N. Dang.

Additional information

This project is partially supported by Competitive Research Funding (CRF), The University of Aizu, Reference P-11 (2016), and JSPS KAKENHI Grant Number JP30453020.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dang, K.N., Meyer, M., Okuyama, Y. et al. A low-overhead soft–hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems. J Supercomput 73, 2705–2729 (2017). https://doi.org/10.1007/s11227-016-1951-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1951-0

Keywords

Navigation