Abstract
This paper proposes a fully distributed fault-tolerant routing methodology for tori and meshes. A dynamic fault-model is supported, enabling the network to remain fully operational at all times. Contrary to most previous proposals that support a dynamic fault-model, the methodology is able to tolerate concave fault regions, thereby avoiding disabling healthy nodes in most practical scenarios. The methodology provides high network performance through the use of adaptive routing and provides graceful performance degradation in the presence of faults.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Top 500 Supercomputing Sites (2007), http://www.top500.org/lists/2007/06
Gómez, M., Nordbotten, N., et al.: A routing methodology for achieving fault tolerance in direct networks. IEEE Trans. Computers 55(4), 400–415 (2006)
Shih, J.D.: Fault-tolerant wormhole routing in torus networks with overlapped block faults. IEE Proc. Computers and Digital Techniques 150(1), 29–37 (2003)
Mukherjee, S., Bannon, P., Lang, S., Spink, A., Webb, D.: The Alpha 21364 network architecture. IEEE Micro 22(1), 26–35 (2002)
Wang, H., et al.: A technology-aware and energy-oriented topology exploration for on-chip networks. In: Design, Automation and Test in Europe, pp. 1238–1243 (2005)
Intel Corporation: Tera-scale research prototype, ftp://download.intel.com/research/platform/terascale/tera-scaleresearchprototypebackgrounder.pdf
Held, J.: et al.: From a few cores to many: A tera-scale computing research overview, ftp://download.intel.com/research/platform/terascale/
Linder, D., Harden, J.: An adaptive and fault tolerant wormhole routing strategy for k-ary n-cubes. IEEE Trans. Computers 40(1), 2–12 (1991)
Chien, A., Kim, J.: Planar adaptive routing: Low-cost adaptive networks for multiprocessors. Journal of the ACM 42(1), 91–123 (1995)
Glass, C., Ni, L.: Fault-tolerant wormhole routing in meshes without virtual channels. IEEE Trans. Parallel and Distributed Systems 7(6), 620–636 (1996)
Glass, C., Ni, L.: The turn model for adaptive routing. Journal of the ACM 41(5), 874–902 (1994)
Cunningham, C., Avresky, D.: Fault-tolerant adaptive routing for two dimensional meshes. In: Proc. Symp. High-Performance Comp. Architecture, pp. 122–131 (1995)
Boppana, R., Chalasani, S.: Fault-tolerant wormhole routing algorithms for mesh networks. IEEE Trans. Computers 44(7), 848–864 (1995)
Sui, P.H., Wang, S.D.: An improved algorithm for fault-tolerant wormhole routing in meshes. IEEE Trans. Computers 46(9), 1040–1042 (1997)
Kim, S.P., Han, T.: Fault-tolerant wormhole routing in mesh with overlapped solid fault regions. Parallel Computing 23, 1937–1962 (1997)
Gu, H., et al.: A new routing method to tolerate both convex and concave faulty regions in mesh/torus networks. In: Proc. PDCAT, pp. 714–719 (2005)
Park, S., et al.: Fault-tolerant wormhole routing algorithms in meshes in the presence of concave faults. In: Proc. Int. Paral. and Dist. Processing Symp. (2000)
Chalasani, S., Boppana, R.: Fault-tolerant wormhole routing in tori. In: Proc. ACM Int. Conf. on Supercomputing, pp. 146–155 (1994)
Shih, J.D.: A fault-tolerant wormhole routing scheme for torus networks with nonconvex faults. Information Processing Letters 88(6), 271–278 (2003)
Carrion, C., et al.: A flow control mechanism to avoid message deadlock in k-ary n-cube networks. In: Int. Conf. High Performance Computing, pp. 322–329 (1997)
Puente, V., et al.: Immunet: A cheap and robust fault-tolerant packet routing mechanism. In: Proc. Int. Symp. Computer Architecture, pp. 198–209 (2004)
Skeie, T.: Handling multiple faults in wormhole mesh networks. In: Proc. Int. Euro-Par Conf. on Parallel Processing, pp. 1076–1098 (1998)
Duato, J.: A necessary and sufficient condition for deadlock-free adaptive routing in wormhole networks. IEEE Trans. Parallel and Distributed Systems 6(10) (1995)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nordbotten, N.A., Skeie, T. (2007). A Routing Methodology for Dynamic Fault Tolerance in Meshes and Tori. In: Aluru, S., Parashar, M., Badrinath, R., Prasanna, V.K. (eds) High Performance Computing – HiPC 2007. HiPC 2007. Lecture Notes in Computer Science, vol 4873. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77220-0_47
Download citation
DOI: https://doi.org/10.1007/978-3-540-77220-0_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77219-4
Online ISBN: 978-3-540-77220-0
eBook Packages: Computer ScienceComputer Science (R0)