Reconfiguration and checkpointing in massively parallel systems

  • Bernd Bieker
  • Geert Deconinck
  • Erik Maehle
  • Johan Vounckx
Session 9: Parallel systems
Part of the Lecture Notes in Computer Science book series (LNCS, volume 852)


Despite the improvements in hardware design massively parallel systems lack on dependability due to the huge amount of components these systems consist of. One possibility to introduce fault-tolerance into such systems is backward error recovery where failed modules can be replaced by spares. The ESPRIT Project 6731 “A Practical Approach to Fault-Tolerant Massively Parallel Systems” follows such an approach and covers the aspects of error detection, diagnosis, checkpointing and reconfiguration. Target systems are multi-computers consisting of grid-wise connected modules using message passing. A first implementation will be made for the Parsytec GCel under PARIX. This paper focuses on recovery by reconfiguration and checkpointing. The project is based on switching in spares and routing around failed components via virtual links (interval routing). For the recovery a user-driven as well as a user-transparent approach are provided based on the new recovery-line-manager.


fault-tolerant parallel computers user-driven checkpointing user-transparent checkpointing reconfiguration routing 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [ABH94]
    J. Altmann, F. Balbach, A. Hein: An Approach for Hierarchical System-Level Diagnosis of Massively Parallel Computers combined with a Simulation-Based Method for Dependability Analysis. To appear in EDCC-1, 1994Google Scholar
  2. [BAN89]
    P. Banerjee: Reconfiguring a Hypercube Multiprocessor in the Presence of Faults. Proc. 4th Conf. on Hypercubes, Concurrent Computers and Applications, Montery, California, March 1989, pp. 95–102Google Scholar
  3. [BBM92]
    A. Bauch, B. Bieker, E. Maehle: Backward Error Recovery in the Dynamical Reconfigurable Multiprocessor System DAMP. IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, July 6–7, 1992, pp. 36–43Google Scholar
  4. [DaSe86]
    W.J. Dally, C.L. Seitz: The torus routing chip. Distributed Computing, 1, 1986, pp. 187–196CrossRefGoogle Scholar
  5. [DVLP93]
    G. Deconinck, J. Vounckx, R. Lauwereins, J.A. Peperstraete: Survey of Backward Error Recovery Techniques for Multicomputers Based on Checkpointing and Rollback, Proc. of the IASTED Int. Conf. on Modelling and Simulation, Pittsburgh, PA, May 10–12, 1993, pp. 262–265Google Scholar
  6. [ftmp92]
    ESPRIT-Project 6731: A Practical Approach to Fault-Tolerant Massively Parallel Systems. Technical Annex, 1992Google Scholar
  7. [FrT89]
    T.M. Frazier, Y. Tamir: Application-Transparent Error-Recovery Techniques for Multicomputer. Proc. of the 4th Conf. on Hypercubes, Concurrent Computers and Applications, Mar. 1989, pp. 103–108Google Scholar
  8. [GRW88]
    R. Geist, R. Reynolds, J. Westall: Selection of a Checkpoint Interval in a Critical-Task Environment. IEEE Trans. on Reliability, Vol. 37, No. 4, Oct., 1988Google Scholar
  9. [Haye76]
    J.P. Hayes: A Graph Model for Fault-Tolerant Computing Systems. IEEE Transactions on Computers, Vol. 25, No. 9, 1976, pp. 875–884Google Scholar
  10. [JOH87]
    D.B. Johnson, W. Zwaenepoel: Sender-Based Message Logging. Proceedings of the 17th International Symposium on Fault-tolerant Computing, FTCS-17, July 1987, pp. 14–19Google Scholar
  11. [KoT87]
    R. Koo, S. Toueg: Checkpointing and Rollback Recovery for Distributed Systems. IEEE Trans. on Software Engineering, Vol. 13, No. 1, 1987, pp. 23–31Google Scholar
  12. [LeHa92]
    T.C. Lee, J.P. Hayes: A Fault-Tolerant Communication Scheme for Hyper cube Computers. IEEE Transactions on Computers, Vol. 41. No. 10, October 1992, pp. 1242–1256CrossRefGoogle Scholar
  13. [Leis92]
    CE. Leierson, et. al: The Network Architecture of the Connection Machine CM-5.4th Annual Int. Symp. on Parallel Algorithms and Architectures, pp. 272–285. ACM Press 1992Google Scholar
  14. [LiHa91]
    D.H. Linder, J.C. Harden: An Adaptive and Fault Tolerant Wormhole Routing Strategy for k-ary n-cubes, IEEE Transactions on Computers, Vol. 40. No. 1, January 1991, pp. 2–12CrossRefGoogle Scholar
  15. [MaMo86]
    E. Maehle, K. Moritzen, K. Wirl: A Graph Model for Diagnosis and Reconfiguration and Its Application to a Fault-Tolerant Multiprocessor System. Proceedings of the 16th International Symposium on Fault-tolerant Computing, FTCS-16, 1986, pp. 292–297Google Scholar
  16. [OeW93]
    W. Oed, M. Walker: An Overview of Cray Research Computers including the Y-MP/C90 and the new MPP T3D. 5th ACM Symposium on Parallel Algorithms and Architectures, Velen, Germany, June 1993, pp. 271–272Google Scholar
  17. [Par93]
    Parsytec Computer GmbH: PARIX 1.2 Software Documentation. March 1993Google Scholar
  18. [TAM89]
    Y. Tamir, T. Frazier: Error Recovery in Multicomputers Using Global Checkpoints. Hawaii Int. Conf. on System Sciences-22. Kailua-Kona, Hawaii, January 1989Google Scholar
  19. [Tie92]
    F. Tied: Parsytec GCel Supercomputer. Technical Report, Preliminary Documentation, July 1992Google Scholar
  20. [VITa87]
    J. van Leeuwen, R.B. Tan: Interval Routing. The Computer Journal, Vol. 30, No. 4, 1987, pp. 298–307CrossRefGoogle Scholar
  21. [VoDe93]
    J. Vounckx; G. Deconinck; R. Cuyvers; R. Lauwereins; J.A. Peperstraete: Network Fault-Tolerance with Interval Routing Devices, LASTED Int. Symp. Applied Informatics, France. May 1993, pp. 293–296Google Scholar
  22. [VoDe93b]
    J. Vounckx, G. Deconinck, R. Cuyvers, R. Lauwereins: Multiprocessor Routing techniques, Deliverable O3.1.1/L of ESPRIT Project 6731, July 1993Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1994

Authors and Affiliations

  • Bernd Bieker
    • 1
  • Geert Deconinck
    • 2
  • Erik Maehle
    • 1
  • Johan Vounckx
    • 2
  1. 1.FG DatentechnikUniversität-GH PaderbornDeutschland
  2. 2.Dept. Elektrotechniek-ESATKatholieke Universiteit LeuvenBelgien

Personalised recommendations