Toward Understanding Soft Faults in High Performance Cluster Networks

  • Jeffrey J. Evans
  • Seongbok Baik
  • Cynthia S. Hood
  • William Gropp
Part of the IFIP — The International Federation for Information Processing book series (IFIPAICT, volume 118)

Abstract

Fault management in high performance cluster networks has been focused on the notion of hard faults (i.e., link or node failures). Network degradations that negatively impact performance but do not result in failures often go unnoticed. In this paper, we classify such degradations as soft faults. In addition, we identify consistent performance as an important requirement in cluster networks. Using this service requirement, we describe a comprehensive strategy for cluster fault management.

Keywords

Cluster fault management interconnection networks soft faults 

References

  1. [1]
    S. Baik, C. Hood, and W. Gropp. Prototype of am3: Active mapper and monitoring module for the Myrinet environment. In Proceedings of the HSLN Workshop, Nov. 2002.Google Scholar
  2. [2]
    D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. Eicken. Logp: Towards a realistic model of parallel computation. In Proceedings of the Fourth ACM SIGPLAN Symposium on Princples and Practices of Parallel Programming, May 1993.Google Scholar
  3. [3]
    W. Gu, G. Eisenhauer, and K. Schwan. Falcon: On-line moniroting and steering of parallel programs. In Ninth International Conference on Parallel and Distributed Computing and Systems (PDCS’97), Oct. 1997.Google Scholar
  4. [4]
    J. Hollingsworth and B. Miller. Dynamic control of performance monitoring on large scale parallel systems. In International Conference on Supercomputing, July 1993.Google Scholar
  5. [5]
    C. S. Hood and C. Ji. Proactive network-fault detection. IEEE Transactions on Reliability, 46 (3): 333–341, September 1997.CrossRefGoogle Scholar
  6. [6]
    Argonne National Laboratory. Chiba City, the Argonne scalable cluster, 1999. http://www-unix. mcs. anl. gov/chiba/. Google Scholar
  7. [7]
    R. P. Martin, A. M. Vandat, D. E. Culler, and T. E. Anderson. Effects of communication latency, overhead, and bandwidth in a cluster architecture. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 85–97, June 1997.Google Scholar
  8. [8]
    C. Mendes and D. Reed. Performance stability and prediction. In IEEE International Workshop on High Performance Computing (WHPC’94), March 1994.Google Scholar
  9. [9]
    D. M. Ogle, K. Schwan, and R. Snodgrass. Application-dependent dynamic monitoring of distributed and parallel systems. IEEE Transactions on Parallel and Distributed Systems, 4 (7): 762–778, July 1993.CrossRefGoogle Scholar
  10. [10]
    J. M. Orduna, F. Silla, and J. Duato. A new task mapping technique for communication-aware scheduling strategies. In International Conference on Parallel Processing Workshops, pages 349–354, 2001.CrossRefGoogle Scholar
  11. [11]
    D. A. Reed, R. A. Aydt, R. J. Noe, P. C. Roth, K. A. Shields, B. W. Schwartz, and L. F. Tavera. Scalable performance analysis: The pablo performance analysis environment. In Proceedings of the IEEE Computer Society Scalable Parallel Libraries Conference, October 1993.Google Scholar
  12. [12]
    J. Vetter and D. Reed. Managing performance analysis with dynamic projection pursuit. In Proceedings of SC’99, November 1999.Google Scholar
  13. [13]
    J. Vetter and K. Schwan. Progress: A toolkit for interactive program steering. In Proceedings of the International Conference on Parallel Processing, August 1995.Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2003

Authors and Affiliations

  • Jeffrey J. Evans
    • 1
  • Seongbok Baik
    • 1
  • Cynthia S. Hood
    • 1
  • William Gropp
    • 2
  1. 1.Department of Computer ScienceIllinois Institute of TechnologyChicagoUSA
  2. 2.Mathematics and Computer Science DivisionArgonne National LaboratoryArgonneUSA

Personalised recommendations