Skip to main content

Reconfiguration and checkpointing in massively parallel systems

  • Session 9: Parallel systems
  • Conference paper
  • First Online:
Book cover Dependable Computing — EDCC-1 (EDCC 1994)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 852))

Included in the following conference series:

Abstract

Despite the improvements in hardware design massively parallel systems lack on dependability due to the huge amount of components these systems consist of. One possibility to introduce fault-tolerance into such systems is backward error recovery where failed modules can be replaced by spares. The ESPRIT Project 6731 “A Practical Approach to Fault-Tolerant Massively Parallel Systems” follows such an approach and covers the aspects of error detection, diagnosis, checkpointing and reconfiguration. Target systems are multi-computers consisting of grid-wise connected modules using message passing. A first implementation will be made for the Parsytec GCel under PARIX. This paper focuses on recovery by reconfiguration and checkpointing. The project is based on switching in spares and routing around failed components via virtual links (interval routing). For the recovery a user-driven as well as a user-transparent approach are provided based on the new recovery-line-manager.

This research was partially supported by the EC as Esprit project 6731 (FTMPS)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. J. Altmann, F. Balbach, A. Hein: An Approach for Hierarchical System-Level Diagnosis of Massively Parallel Computers combined with a Simulation-Based Method for Dependability Analysis. To appear in EDCC-1, 1994

    Google Scholar 

  2. P. Banerjee: Reconfiguring a Hypercube Multiprocessor in the Presence of Faults. Proc. 4th Conf. on Hypercubes, Concurrent Computers and Applications, Montery, California, March 1989, pp. 95–102

    Google Scholar 

  3. A. Bauch, B. Bieker, E. Maehle: Backward Error Recovery in the Dynamical Reconfigurable Multiprocessor System DAMP. IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, July 6–7, 1992, pp. 36–43

    Google Scholar 

  4. W.J. Dally, C.L. Seitz: The torus routing chip. Distributed Computing, 1, 1986, pp. 187–196

    Article  Google Scholar 

  5. G. Deconinck, J. Vounckx, R. Lauwereins, J.A. Peperstraete: Survey of Backward Error Recovery Techniques for Multicomputers Based on Checkpointing and Rollback, Proc. of the IASTED Int. Conf. on Modelling and Simulation, Pittsburgh, PA, May 10–12, 1993, pp. 262–265

    Google Scholar 

  6. ESPRIT-Project 6731: A Practical Approach to Fault-Tolerant Massively Parallel Systems. Technical Annex, 1992

    Google Scholar 

  7. T.M. Frazier, Y. Tamir: Application-Transparent Error-Recovery Techniques for Multicomputer. Proc. of the 4th Conf. on Hypercubes, Concurrent Computers and Applications, Mar. 1989, pp. 103–108

    Google Scholar 

  8. R. Geist, R. Reynolds, J. Westall: Selection of a Checkpoint Interval in a Critical-Task Environment. IEEE Trans. on Reliability, Vol. 37, No. 4, Oct., 1988

    Google Scholar 

  9. J.P. Hayes: A Graph Model for Fault-Tolerant Computing Systems. IEEE Transactions on Computers, Vol. 25, No. 9, 1976, pp. 875–884

    Google Scholar 

  10. D.B. Johnson, W. Zwaenepoel: Sender-Based Message Logging. Proceedings of the 17th International Symposium on Fault-tolerant Computing, FTCS-17, July 1987, pp. 14–19

    Google Scholar 

  11. R. Koo, S. Toueg: Checkpointing and Rollback Recovery for Distributed Systems. IEEE Trans. on Software Engineering, Vol. 13, No. 1, 1987, pp. 23–31

    Google Scholar 

  12. T.C. Lee, J.P. Hayes: A Fault-Tolerant Communication Scheme for Hyper cube Computers. IEEE Transactions on Computers, Vol. 41. No. 10, October 1992, pp. 1242–1256

    Article  Google Scholar 

  13. CE. Leierson, et. al: The Network Architecture of the Connection Machine CM-5.4th Annual Int. Symp. on Parallel Algorithms and Architectures, pp. 272–285. ACM Press 1992

    Google Scholar 

  14. D.H. Linder, J.C. Harden: An Adaptive and Fault Tolerant Wormhole Routing Strategy for k-ary n-cubes, IEEE Transactions on Computers, Vol. 40. No. 1, January 1991, pp. 2–12

    Article  Google Scholar 

  15. E. Maehle, K. Moritzen, K. Wirl: A Graph Model for Diagnosis and Reconfiguration and Its Application to a Fault-Tolerant Multiprocessor System. Proceedings of the 16th International Symposium on Fault-tolerant Computing, FTCS-16, 1986, pp. 292–297

    Google Scholar 

  16. W. Oed, M. Walker: An Overview of Cray Research Computers including the Y-MP/C90 and the new MPP T3D. 5th ACM Symposium on Parallel Algorithms and Architectures, Velen, Germany, June 1993, pp. 271–272

    Google Scholar 

  17. Parsytec Computer GmbH: PARIX 1.2 Software Documentation. March 1993

    Google Scholar 

  18. Y. Tamir, T. Frazier: Error Recovery in Multicomputers Using Global Checkpoints. Hawaii Int. Conf. on System Sciences-22. Kailua-Kona, Hawaii, January 1989

    Google Scholar 

  19. F. Tied: Parsytec GCel Supercomputer. Technical Report, Preliminary Documentation, July 1992

    Google Scholar 

  20. J. van Leeuwen, R.B. Tan: Interval Routing. The Computer Journal, Vol. 30, No. 4, 1987, pp. 298–307

    Article  Google Scholar 

  21. J. Vounckx; G. Deconinck; R. Cuyvers; R. Lauwereins; J.A. Peperstraete: Network Fault-Tolerance with Interval Routing Devices, LASTED Int. Symp. Applied Informatics, France. May 1993, pp. 293–296

    Google Scholar 

  22. J. Vounckx, G. Deconinck, R. Cuyvers, R. Lauwereins: Multiprocessor Routing techniques, Deliverable O3.1.1/L of ESPRIT Project 6731, July 1993

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Klaus Echtle Dieter Hammer David Powell

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bieker, B., Deconinck, G., Maehle, E., Vounckx, J. (1994). Reconfiguration and checkpointing in massively parallel systems. In: Echtle, K., Hammer, D., Powell, D. (eds) Dependable Computing — EDCC-1. EDCC 1994. Lecture Notes in Computer Science, vol 852. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-58426-9_141

Download citation

  • DOI: https://doi.org/10.1007/3-540-58426-9_141

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-58426-1

  • Online ISBN: 978-3-540-48785-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics