Skip to main content

Toward Understanding Soft Faults in High Performance Cluster Networks

  • Chapter
  • 315 Accesses

Part of the IFIP — The International Federation for Information Processing book series (IFIPAICT,volume 118)


Fault management in high performance cluster networks has been focused on the notion of hard faults (i.e., link or node failures). Network degradations that negatively impact performance but do not result in failures often go unnoticed. In this paper, we classify such degradations as soft faults. In addition, we identify consistent performance as an important requirement in cluster networks. Using this service requirement, we describe a comprehensive strategy for cluster fault management.


  • Cluster
  • fault management
  • interconnection networks
  • soft faults

The original version of this chapter was revised: The copyright line was incorrect. This has been corrected. The Erratum to this chapter is available at DOI: 10.1007/978-0-387-35674-7_66


  1. S. Baik, C. Hood, and W. Gropp. Prototype of am3: Active mapper and monitoring module for the Myrinet environment. In Proceedings of the HSLN Workshop, Nov. 2002.

    Google Scholar 

  2. D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. Eicken. Logp: Towards a realistic model of parallel computation. In Proceedings of the Fourth ACM SIGPLAN Symposium on Princples and Practices of Parallel Programming, May 1993.

    Google Scholar 

  3. W. Gu, G. Eisenhauer, and K. Schwan. Falcon: On-line moniroting and steering of parallel programs. In Ninth International Conference on Parallel and Distributed Computing and Systems (PDCS’97), Oct. 1997.

    Google Scholar 

  4. J. Hollingsworth and B. Miller. Dynamic control of performance monitoring on large scale parallel systems. In International Conference on Supercomputing, July 1993.

    Google Scholar 

  5. C. S. Hood and C. Ji. Proactive network-fault detection. IEEE Transactions on Reliability, 46 (3): 333–341, September 1997.

    CrossRef  Google Scholar 

  6. Argonne National Laboratory. Chiba City, the Argonne scalable cluster, 1999. http://www-unix. mcs. anl. gov/chiba/.

    Google Scholar 

  7. R. P. Martin, A. M. Vandat, D. E. Culler, and T. E. Anderson. Effects of communication latency, overhead, and bandwidth in a cluster architecture. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 85–97, June 1997.

    Google Scholar 

  8. C. Mendes and D. Reed. Performance stability and prediction. In IEEE International Workshop on High Performance Computing (WHPC’94), March 1994.

    Google Scholar 

  9. D. M. Ogle, K. Schwan, and R. Snodgrass. Application-dependent dynamic monitoring of distributed and parallel systems. IEEE Transactions on Parallel and Distributed Systems, 4 (7): 762–778, July 1993.

    CrossRef  Google Scholar 

  10. J. M. Orduna, F. Silla, and J. Duato. A new task mapping technique for communication-aware scheduling strategies. In International Conference on Parallel Processing Workshops, pages 349–354, 2001.

    CrossRef  Google Scholar 

  11. D. A. Reed, R. A. Aydt, R. J. Noe, P. C. Roth, K. A. Shields, B. W. Schwartz, and L. F. Tavera. Scalable performance analysis: The pablo performance analysis environment. In Proceedings of the IEEE Computer Society Scalable Parallel Libraries Conference, October 1993.

    Google Scholar 

  12. J. Vetter and D. Reed. Managing performance analysis with dynamic projection pursuit. In Proceedings of SC’99, November 1999.

    Google Scholar 

  13. J. Vetter and K. Schwan. Progress: A toolkit for interactive program steering. In Proceedings of the International Conference on Parallel Processing, August 1995.

    Google Scholar 

Download references

Author information

Authors and Affiliations


Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2003 IFIP International Federation for Information Processing

About this chapter

Cite this chapter

Evans, J.J., Baik, S., Hood, C.S., Gropp, W. (2003). Toward Understanding Soft Faults in High Performance Cluster Networks. In: Goldszmidt, G., Schönwälder, J. (eds) Integrated Network Management VIII. IM 2003. IFIP — The International Federation for Information Processing, vol 118. Springer, Boston, MA.

Download citation

  • DOI:

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4757-5521-3

  • Online ISBN: 978-0-387-35674-7

  • eBook Packages: Springer Book Archive