An approach for hierarchical system level diagnosis of massively parallel computers combined with a simulation-based method for dependability analysis

  • J. Altmann
  • F. Balbach
  • A. Hein
Session 9: Parallel systems
Part of the Lecture Notes in Computer Science book series (LNCS, volume 852)


The primary focus in the analysis of massively parallel supercomputers has traditionally been on their performance. However, their complex network topologies, large number of processors, and sophisticated system software can make them very unreliable. If every failure of one of the many components of a massively parallel computer could shut down the machine, the machine would be useless. Therefore fault tolerance is required. The basis of effective mechanisms for fault tolerance is an efficient diagnosis.

This paper deals with concurrent and hierarchical system level diagnosis for a particular massively parallel architecture and with a simulation-based method to validate the proposed diagnosis algorithm. The diagnosis algorithm is presented and we describe a simulation-based method to test and verify the algorithms for fault tolerance already during the design phase of the target machine.


massively parallel computers system level diagnosis simulation-based analysis scalable and object-oriented simulation models 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Bianchini R., Buskens R. Implementation of On-Line Distributed System-Level Diagnosis Theory IEEE Transaction on computer. vol. C-41, No. 5, pp 616–626, May 1992CrossRefGoogle Scholar
  2. [2]
    Bieker B., Deconinck G., Maehle E., Vouncks J. Reconfiguration and Checkpointing in Massively Parallel Systems Submitted to EDCC-1 1994Google Scholar
  3. [3]
    Bobbio A. Dependability Analysis of Fault-Tolerant Systems: a Literature Survey in Microprocessing and Microprogramming 29 (1990), pp 1–13, North-Holland, 1990.Google Scholar
  4. [4]
    Dal Cin M., Hofmann F., Grygier A., Hessenauer H., Hildebrand U., Linster C.U., Thiel T., Turowski S. MEMSY — A Modular Expandable Multiprocessor System in A. Bode, M. Dal Cin (eds), Parallel Computer Architectures, pp 15–30, Springer LNCS 732, 1993.Google Scholar
  5. [5]
    Goswami, Kumar K., Ravi K. Iyer. The DEPEND Reference Manual. Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1991.Google Scholar
  6. [6]
    Goswami, Kumar K. Design for Dependability: A Simulation-Based Approach. Ph.D. Thesis, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1993.Google Scholar
  7. [7]
    Grand Challenges High Performance Computing and Communication. The Fiscal Year 1992 U.S. Research and Development Program. Report by the Committee on Physical, Mathematical, and Engeneering Sciences, NFS Washington 1992.Google Scholar
  8. [8]
    Hein, Axel. SimParGC — Ein Simulator zur Leistungs-und Zuverlässigkeits-Analyse des Multiprozessorsystems Parsytec GC, Version 1.0. Internal Report, IMMD 3, University of Erlangen-Nürnberg, 1994.Google Scholar
  9. [9]
    Hosseini S, Kuhl J.G., Reddy S.M. A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and Repair IEEE Transaction on computer. vol. C-33, pp 223–233. Mar. 1984Google Scholar
  10. [10]
    Inmos The T9000 Transputer Hardware Reference Manual INMOS Limited 1993.Google Scholar
  11. [11]
    Kuhl, J.G; Reddy, S.M. Distibuted fault tolerance for large multiprocessor systems ACM-Sigarch Newsletter 8, No.3, pp23–30, 1980Google Scholar
  12. [12]
    Kuhl, J.; Reddy, S. Fault-diagnosis in fully distributed systems FTCS 11, Fault tolerant computing: the 11th international symposium, pp. 100–105, 1981Google Scholar
  13. [13]
    Marsan, M. Ajmone, G. Balbo und G. Conte. Performance Models of Multiprocessor Systems. Cambridge; London: The MIT Press. 1986.Google Scholar
  14. [14]
    Meyer, F.J.; Masson, G. An efficient fault diagnosis algorithm fot symetric multiprocessor architecture IEEE Transaction on computer, vol. C-27, pp. 1059–1063, Nov. 1978Google Scholar
  15. [15]
    Parsytec Computer GmbH. The Parsytec GC Technical Summary, Version 1.0. Aachen (Germany), 1991.Google Scholar
  16. [16]
    Parsytec Computer GmbH. PARIX Release 1.2. Reference Manual. Aachen (Germany), 1993.Google Scholar
  17. [17]
    Preparata, F.P.; Metze, G.; Chien, R.T On the connection assignment problem of diagnosable systems IEEE Trans.Electronic Computing. Vol. EC-16, pp 848–854, December 1967Google Scholar
  18. [18]
    Stahl, M.; Buskens, R.; Bianchini, R. Jr. On-line diagnosis in general topology networks Workshop on fault tolerant Parallel and Distributed Systems, pp. 114–121 IEEE Computer Society, Massachusetts July 1992Google Scholar
  19. [19]
    Stroustrup, Bjarne. The C++ Programming Language, Second Edition. New York; London [u.a.]: Addison-Wesley Publishing Company, 1991.Google Scholar
  20. [20]
    Trivedi, Kishor S. Probability and Statistics with Reliability, Queuing, and Computer Science Applications. Englewood Cliffs: NJ Prentice Hall, 1982.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1994

Authors and Affiliations

  • J. Altmann
    • 1
  • F. Balbach
    • 1
  • A. Hein
    • 1
  1. 1.Institut für Mathematische Maschinen und Datenverarbeitung (IMMD) IIIUniversität Erlangen-NürnbergErlangenGermany

Personalised recommendations