Skip to main content

A Routing Methodology for Dynamic Fault Tolerance in Meshes and Tori

  • Conference paper
High Performance Computing – HiPC 2007 (HiPC 2007)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4873))

Included in the following conference series:

Abstract

This paper proposes a fully distributed fault-tolerant routing methodology for tori and meshes. A dynamic fault-model is supported, enabling the network to remain fully operational at all times. Contrary to most previous proposals that support a dynamic fault-model, the methodology is able to tolerate concave fault regions, thereby avoiding disabling healthy nodes in most practical scenarios. The methodology provides high network performance through the use of adaptive routing and provides graceful performance degradation in the presence of faults.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Top 500 Supercomputing Sites (2007), http://www.top500.org/lists/2007/06

  2. Gómez, M., Nordbotten, N., et al.: A routing methodology for achieving fault tolerance in direct networks. IEEE Trans. Computers 55(4), 400–415 (2006)

    Article  Google Scholar 

  3. Shih, J.D.: Fault-tolerant wormhole routing in torus networks with overlapped block faults. IEE Proc. Computers and Digital Techniques 150(1), 29–37 (2003)

    Article  Google Scholar 

  4. Mukherjee, S., Bannon, P., Lang, S., Spink, A., Webb, D.: The Alpha 21364 network architecture. IEEE Micro 22(1), 26–35 (2002)

    Article  Google Scholar 

  5. Wang, H., et al.: A technology-aware and energy-oriented topology exploration for on-chip networks. In: Design, Automation and Test in Europe, pp. 1238–1243 (2005)

    Google Scholar 

  6. Intel Corporation: Tera-scale research prototype, ftp://download.intel.com/research/platform/terascale/tera-scaleresearchprototypebackgrounder.pdf

  7. Held, J.: et al.: From a few cores to many: A tera-scale computing research overview, ftp://download.intel.com/research/platform/terascale/

  8. Linder, D., Harden, J.: An adaptive and fault tolerant wormhole routing strategy for k-ary n-cubes. IEEE Trans. Computers 40(1), 2–12 (1991)

    Article  MathSciNet  Google Scholar 

  9. Chien, A., Kim, J.: Planar adaptive routing: Low-cost adaptive networks for multiprocessors. Journal of the ACM 42(1), 91–123 (1995)

    Article  MATH  Google Scholar 

  10. Glass, C., Ni, L.: Fault-tolerant wormhole routing in meshes without virtual channels. IEEE Trans. Parallel and Distributed Systems 7(6), 620–636 (1996)

    Article  Google Scholar 

  11. Glass, C., Ni, L.: The turn model for adaptive routing. Journal of the ACM 41(5), 874–902 (1994)

    Article  Google Scholar 

  12. Cunningham, C., Avresky, D.: Fault-tolerant adaptive routing for two dimensional meshes. In: Proc. Symp. High-Performance Comp. Architecture, pp. 122–131 (1995)

    Google Scholar 

  13. Boppana, R., Chalasani, S.: Fault-tolerant wormhole routing algorithms for mesh networks. IEEE Trans. Computers 44(7), 848–864 (1995)

    Article  MATH  Google Scholar 

  14. Sui, P.H., Wang, S.D.: An improved algorithm for fault-tolerant wormhole routing in meshes. IEEE Trans. Computers 46(9), 1040–1042 (1997)

    Article  MathSciNet  Google Scholar 

  15. Kim, S.P., Han, T.: Fault-tolerant wormhole routing in mesh with overlapped solid fault regions. Parallel Computing 23, 1937–1962 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  16. Gu, H., et al.: A new routing method to tolerate both convex and concave faulty regions in mesh/torus networks. In: Proc. PDCAT, pp. 714–719 (2005)

    Google Scholar 

  17. Park, S., et al.: Fault-tolerant wormhole routing algorithms in meshes in the presence of concave faults. In: Proc. Int. Paral. and Dist. Processing Symp. (2000)

    Google Scholar 

  18. Chalasani, S., Boppana, R.: Fault-tolerant wormhole routing in tori. In: Proc. ACM Int. Conf. on Supercomputing, pp. 146–155 (1994)

    Google Scholar 

  19. Shih, J.D.: A fault-tolerant wormhole routing scheme for torus networks with nonconvex faults. Information Processing Letters 88(6), 271–278 (2003)

    Article  MathSciNet  Google Scholar 

  20. Carrion, C., et al.: A flow control mechanism to avoid message deadlock in k-ary n-cube networks. In: Int. Conf. High Performance Computing, pp. 322–329 (1997)

    Google Scholar 

  21. Puente, V., et al.: Immunet: A cheap and robust fault-tolerant packet routing mechanism. In: Proc. Int. Symp. Computer Architecture, pp. 198–209 (2004)

    Google Scholar 

  22. Skeie, T.: Handling multiple faults in wormhole mesh networks. In: Proc. Int. Euro-Par Conf. on Parallel Processing, pp. 1076–1098 (1998)

    Google Scholar 

  23. Duato, J.: A necessary and sufficient condition for deadlock-free adaptive routing in wormhole networks. IEEE Trans. Parallel and Distributed Systems 6(10) (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Srinivas Aluru Manish Parashar Ramamurthy Badrinath Viktor K. Prasanna

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nordbotten, N.A., Skeie, T. (2007). A Routing Methodology for Dynamic Fault Tolerance in Meshes and Tori. In: Aluru, S., Parashar, M., Badrinath, R., Prasanna, V.K. (eds) High Performance Computing – HiPC 2007. HiPC 2007. Lecture Notes in Computer Science, vol 4873. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77220-0_47

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-77220-0_47

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-77219-4

  • Online ISBN: 978-3-540-77220-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics