An approach for hierarchical system level diagnosis of massively parallel computers combined with a simulation-based method for dependability analysis
The primary focus in the analysis of massively parallel supercomputers has traditionally been on their performance. However, their complex network topologies, large number of processors, and sophisticated system software can make them very unreliable. If every failure of one of the many components of a massively parallel computer could shut down the machine, the machine would be useless. Therefore fault tolerance is required. The basis of effective mechanisms for fault tolerance is an efficient diagnosis.
This paper deals with concurrent and hierarchical system level diagnosis for a particular massively parallel architecture and with a simulation-based method to validate the proposed diagnosis algorithm. The diagnosis algorithm is presented and we describe a simulation-based method to test and verify the algorithms for fault tolerance already during the design phase of the target machine.
Keywordsmassively parallel computers system level diagnosis simulation-based analysis scalable and object-oriented simulation models
Unable to display preview. Download preview PDF.
- Bieker B., Deconinck G., Maehle E., Vouncks J. Reconfiguration and Checkpointing in Massively Parallel Systems Submitted to EDCC-1 1994Google Scholar
- Bobbio A. Dependability Analysis of Fault-Tolerant Systems: a Literature Survey in Microprocessing and Microprogramming 29 (1990), pp 1–13, North-Holland, 1990.Google Scholar
- Dal Cin M., Hofmann F., Grygier A., Hessenauer H., Hildebrand U., Linster C.U., Thiel T., Turowski S. MEMSY — A Modular Expandable Multiprocessor System in A. Bode, M. Dal Cin (eds), Parallel Computer Architectures, pp 15–30, Springer LNCS 732, 1993.Google Scholar
- Goswami, Kumar K., Ravi K. Iyer. The DEPEND Reference Manual. Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1991.Google Scholar
- Goswami, Kumar K. Design for Dependability: A Simulation-Based Approach. Ph.D. Thesis, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1993.Google Scholar
- Grand Challenges High Performance Computing and Communication. The Fiscal Year 1992 U.S. Research and Development Program. Report by the Committee on Physical, Mathematical, and Engeneering Sciences, NFS Washington 1992.Google Scholar
- Hein, Axel. SimParGC — Ein Simulator zur Leistungs-und Zuverlässigkeits-Analyse des Multiprozessorsystems Parsytec GC, Version 1.0. Internal Report, IMMD 3, University of Erlangen-Nürnberg, 1994.Google Scholar
- Hosseini S, Kuhl J.G., Reddy S.M. A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and Repair IEEE Transaction on computer. vol. C-33, pp 223–233. Mar. 1984Google Scholar
- Inmos The T9000 Transputer Hardware Reference Manual INMOS Limited 1993.Google Scholar
- Kuhl, J.G; Reddy, S.M. Distibuted fault tolerance for large multiprocessor systems ACM-Sigarch Newsletter 8, No.3, pp23–30, 1980Google Scholar
- Kuhl, J.; Reddy, S. Fault-diagnosis in fully distributed systems FTCS 11, Fault tolerant computing: the 11th international symposium, pp. 100–105, 1981Google Scholar
- Marsan, M. Ajmone, G. Balbo und G. Conte. Performance Models of Multiprocessor Systems. Cambridge; London: The MIT Press. 1986.Google Scholar
- Meyer, F.J.; Masson, G. An efficient fault diagnosis algorithm fot symetric multiprocessor architecture IEEE Transaction on computer, vol. C-27, pp. 1059–1063, Nov. 1978Google Scholar
- Parsytec Computer GmbH. The Parsytec GC Technical Summary, Version 1.0. Aachen (Germany), 1991.Google Scholar
- Parsytec Computer GmbH. PARIX Release 1.2. Reference Manual. Aachen (Germany), 1993.Google Scholar
- Preparata, F.P.; Metze, G.; Chien, R.T On the connection assignment problem of diagnosable systems IEEE Trans.Electronic Computing. Vol. EC-16, pp 848–854, December 1967Google Scholar
- Stahl, M.; Buskens, R.; Bianchini, R. Jr. On-line diagnosis in general topology networks Workshop on fault tolerant Parallel and Distributed Systems, pp. 114–121 IEEE Computer Society, Massachusetts July 1992Google Scholar
- Stroustrup, Bjarne. The C++ Programming Language, Second Edition. New York; London [u.a.]: Addison-Wesley Publishing Company, 1991.Google Scholar
- Trivedi, Kishor S. Probability and Statistics with Reliability, Queuing, and Computer Science Applications. Englewood Cliffs: NJ Prentice Hall, 1982.Google Scholar