Skip to main content

A New Adaptive Fault-Tolerant Routing Methodology for Direct Networks

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3296))

Abstract

Interconnection networks play a key role in the fault tolerance of massively parallel computers, since faults may isolate a large fraction of the machine containing many healthy nodes. In this paper, we present a methodology to design fully adaptive fault-tolerant routing algorithms for direct interconnection networks that can be applied to different regular topologies. The methodology is mainly based on the selection of an intermediate node (if needed) for each source-destination pair. Packets are adaptively routed to the intermediate node and, from this node, they are adaptively forwarded to their destination. This methodology requires only one additional virtual channel, even for tori. Evaluation results show that the methodology is 7-fault tolerant, and for up to 14 faults, more than 99% of the combinations are tolerated, also without significantly degrading performance in the presence of faults.

This work was supported by the Spanish Ministry of Science and Technology under Grant TIC2003-08154-C06-01.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. IBM BG/L Team, An Overview of BlueGene/L Supercomputer, ACM Supercomputing Conference (2002)

    Google Scholar 

  2. Chien, A.A., Kim, J.H.: Planar-adaptive routing: Low-cost adaptive networks for multiprocessors. In: Proc. of the 19th Int. Symp. on Computer Architecture, May 1992, pp. 268–277 (1992)

    Google Scholar 

  3. Chalasani, S., Boppana, R.V.: Communication in multicomputers with nonconvex faults. IEEE Trans. on Computers 46(5), 616–622 (1997)

    Article  MathSciNet  Google Scholar 

  4. Dally, W.J., Aoki, H.: Deadlock-free adaptive routing in multicomputer networks using virtual channels. IEEE Trans. on Parallel and Distributed Systems 4(4), 466–475 (1993)

    Article  Google Scholar 

  5. Dally, W.J., et al.: The Reliable Router: A Reliable and High-Performance Communication Substrate for Parallel Computers. In: Proc. Parallel Computer Routing and Communication Workshop (1994)

    Google Scholar 

  6. Duato, J.: A theory of fault-tolerant routing in wormhole networks. In: Proc. of the Int. Conf. on Parallel and Distributed Systems, December 1994, pp. 600–607 (1994)

    Google Scholar 

  7. Duato, J.: A Necessary and Sufficient Condition for Deadlock-Free Outgoing in Cut-Through and Store-and-Forward Networks. Proc. of IEEE Trans. on Parallel and Distributed Systems 7(8), 841–854 (1996)

    Article  Google Scholar 

  8. Earth Simulator Center, http://www.es.jamstec.go.jp/esc/eng/index.html

  9. Glass, G.J., Ni, L.M.: Fault-Tolerant Wormhole Routing in Meshes without Virtual Channels. IEEE Trans. on Parallel and Distributed Systems 7(6), 620–636 (1996)

    Article  Google Scholar 

  10. Ho, C.T., Stockmeyer, L.: A New Approach to Fault-Tolerant Wormhole Routing for Mesh-Connected Parallel Computers. In: Proc. of 16th Int. Parallel and Distributed Processing Symp. (April 2002)

    Google Scholar 

  11. Kermani, P., Kleinrock, L.: Virtual cut-through: A new computer communication switching technique. Computer Networks 3, 267–286 (1979)

    MathSciNet  MATH  Google Scholar 

  12. Linder, D.H., Harden, J.C.: An Adaptive and fault tolerant wormhole routing strategy for k-ary n-cubes. IEEE Trans. Computers C-40(1), 2–12 (1991)

    Article  MathSciNet  Google Scholar 

  13. Puente, V., et al.: Adaptive Bubble Router: A Design to Balance Latency and Throughput in Networks for Parallel Computers. In: Proc. of the 22nd Int. Conf. on Parallel Processing (September 1999)

    Google Scholar 

  14. Suh, Y.J., Dao, B.V., Duato, J., Yalamanchili, S.: Software-based rerouting for fault-tolerant pipelined communication. IEEE Trans. on Parallel and Distributed Systems 11(3), 193–211 (2000)

    Article  Google Scholar 

  15. Valiant, L.G.: A Scheme for Fast Parallel Communication. SIAM Journal on Computing 11, 350–361 (1982)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gómez, M.E. et al. (2004). A New Adaptive Fault-Tolerant Routing Methodology for Direct Networks. In: Bougé, L., Prasanna, V.K. (eds) High Performance Computing - HiPC 2004. HiPC 2004. Lecture Notes in Computer Science, vol 3296. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30474-6_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30474-6_49

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24129-4

  • Online ISBN: 978-3-540-30474-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics