Advertisement

The FTMPS-project: Design and implementation of fault-tolerance techniques for massively parallel systems

Monitoring, Debugging, and Fault Tolerance
Part of the Lecture Notes in Computer Science book series (LNCS, volume 797)

Abstract

The FTMPS-project provides a solution to the need for faulttolerance in large systems. A complete fault-tolerance approach is developed and being implemented. The built-in hardware error-detection features combined with software error-detection techniques provide a high coverage of transient as well as permanent failures. Combined with the diagnosis software, the necessary information for the OSS (statistics and visualisation) and the possibly reconfiguration is collected. Backward error recovery based on checkpointing and rollback, is implemented.

Keywords

Parallel System Error Recovery Faulty Component Concurrent Error Detection Control Processor 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Castillo F., Siewiorek D.P.: Workload, Performance, and Reliability of Digital Computer Systems. IEEE Proc. of FTCS-11, pp. 84–89, June 1981.Google Scholar
  2. 2.
    Deconinck G., Vounckx J., Lauwereins R., Peperstraete J.A.: Survey of Backward Error Recovery Rechniques for Multicomputers Based on Checkpointing and Rollback. IASTED Intl. Conf. on Modelling and Simulation, Pittsburgh, PA, USA, May 1993, pp. 262–265.Google Scholar
  3. 3.
    Esser R., Knecht R.: Intel Paragon XP/S — Architecture and Software Environment. Proceedings of Supercomputer 93, Mannheim, June 1993.Google Scholar
  4. 4.
    Iyer R.K., Rossetti D.J.: A Measurement-Based Model for Workload Dependence of CPU Errors. IEEE Trans. on Computers, C35(6):511–519, June 1986.Google Scholar
  5. 5.
    Mahmood A.: Concurrent Error Detection Using Watchdog Processors — A Survey. IEEE Trans. on Computers, 37(2), 1990.Google Scholar
  6. 6.
    Maehle E., Obelör W.: DELTA-T, a User-Transparent Software-Monitoring Tool for Multi-Transputer Systems. Proc. EUROMICRO 92, Microprocessing and Microprogramming, 32(9):245–252, Sep. 1992.Google Scholar
  7. 7.
    Parsytec GmbH: Technical Summary Parsytec GC, Version 1.0. Parsytec GmbH, 1991.Google Scholar
  8. 8.
    Tiedt F.: Parsytec GCel Supercomputer, Technical Report, Parsytec GmbH, 1991.Google Scholar
  9. 9.
    van Leeuwen J., Tan R. B.: Routing with Compact Routing Tables. Technical Report RUU-CS-83-16 Rijksuniversiteit Utrecht, Nov. 1983.Google Scholar
  10. 10.
    Vounckx J., Deconinck G., Cuyvers R., Lauwereins R., Peperstraete J.A.: Network fault-tolerance with Interval Routing Devices. Proc. of the 11th IASTED Int. Symp. Applied Informatics, pp. 293–296, Annecy, France, May 1993.Google Scholar
  11. 11.
    Vounckx J., Deconinck G., Cuyvers R., Lauwereins R., Peperstraete J.A.: Multi-processor Routing techniques. Deliverable O3.1.1/L of ESPRIT Project 6731, July 1993.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1994

Authors and Affiliations

  1. 1.Katholieke Universiteit Leuven, ESATHeverleeBelgium
  2. 2.Parsytec GmbH (D)Deutschland
  3. 3.Universidade de Coimbra (P)Portugal
  4. 4.F.A. Universität Erlangen-Nürnberg (D)Deutschland
  5. 5.Universität-GH Paderborn (D)Deutschland

Personalised recommendations