Skip to main content

Fault tolerance in distributed shared memory multiprocessors

  • Conference paper
  • First Online:
Parallel Computer Architectures

Part of the book series: Lecture Notes in Computer Science ((volume 732))

Abstract

Massively parallel systems represent a new challenge for fault tolerance. The designers of such systems cannot expect that no parts of the system will fail. With the significant increase in the complexity and number of components the chance of a single or multiple failure is no longer negligible. It is clear that the redundancy, reconfigurability and diagnosis techniques must be incorporated at the design stage itself and not as a subsequent add-on. In this paper we discuss the fault tolerance techniques developed for MEMSY, a massively parallel architecture. These techniques can, in principle, be easily transferred to other distributed shared memory multiprocessors.

Guest researcher from TU Budapest, Dept. Measurement and Instrumentation Engineering

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Banâtre, M.; Muller, G.; Rochat, B.; Sanchez, P.: Design Decisions for the FTM: A General Purpose Fault Tolerant Machine, Proc. 21th FTCS, pp. 71–78,1991

    Google Scholar 

  2. Chandy, K. M.; Lamport, L.: Distributed Snapshots: Determining Global States of Distributed Systems, ACM T.o.C.S., vol. 3, no. 1, pp. 63–75, 1985

    Article  Google Scholar 

  3. Cristian, F.: Understanding Fault Tolerant Distributed Systems, Com. ACM vol. 34 (1991), pp. 56–78

    Article  Google Scholar 

  4. Dal Cin, M.: New Trends in Parallel and Reliable Computing: Massive Parallelism and Fault Tolerance. Invited paper, Proc. μP'92, 7th Symposium on Microcomputer and Microprocessor Appl., Budapest, April 1992, pp. 1–10

    Google Scholar 

  5. Grand Challenges: High Performance Computing and Communications. The Fiscal Year 1992 U.S. Research and Development Program. Report by the Committee on Physical, Mathematical, and Engineering Sciences, NSF Washington 1992

    Google Scholar 

  6. Hildebrand, U.: A Fault Tolerant Interconnection Network for Memory-Coupled Multiprocessor Systems, In: Dal Cin, M.; Hohl, W.(eds.): Proc. 5th Int. Conf. Fault Tolerant Computing Systems, Informatik-Fachberichte 283, pp. 360–371, Springer 1991

    Google Scholar 

  7. Hofmann, F. et al.: MEMSY — A Modular Expandable Multiprocessor System, in this volume

    Google Scholar 

  8. Hohl, W.; Michel, E.; Pataricza, A.: Hardware Support for Error Detection in Multiprocessor Systems — A Case Study, Proc. μP'92, 7th Symposium on Microcomputer and Microprocessor Appl., Budapest, April 1992, pp. 81–90

    Google Scholar 

  9. Kai Li; Naughton, J. F.; Plank, J. S.: Checkpointing Multicomputer Applications, Proc. 10th Symposium on Reliable Distributed Systems, pp. 2–12, 1991

    Google Scholar 

  10. Koo, R.; Toueg, S.: Checkpointing and Rollback-Recovery for Distributed Systems, IEEE T.o.S.E., pp. 23–31, Jan. 1987

    Google Scholar 

  11. Lampson, B. W.: The Stable System, in Lampson, B. W.; Paul, M.; Siegert H. J. (ed): Distributed Systems: Architecture and Implementation, LNCS 105, pp. 254–256, 1988

    Google Scholar 

  12. Leveugle, R.; Michel, T.; Saucier, G.: Design of Microprocessors with Built-in On-Line Test, Proc. 20th FTCS, pp. 450–456, 1990

    Google Scholar 

  13. Lu, D. J.: Watchdog Processors and Structural Integrity Checking, IEEE T.o.C., Vol. 31. No.7, 681–685, 1982

    Google Scholar 

  14. Mahmood, A.; McCluskey, E. J.: Concurrent Error Detection Using Watchdog Processors — A Survey, IEEE, T.o.C., Vol. 37. No. 2, pp. 160–174, 1988

    Google Scholar 

  15. Michel, E.; Hohl, W.: Concurrent Error Detection Using Watchdog Processors in the Multiprocessor System MEMSY, Proc. 5th Int. Conf. Fault-Tolerant Computing Systems, Nürnberg, Informatik Fachberichte 283, pp. 54–64, Springer, September 1991

    Google Scholar 

  16. Russell, D. L.; Tiedeman, M. J.: Multiprocess Recovery Using Conversations, Proc. 9th FTCS, pp. 106–109, 1979

    Google Scholar 

  17. Shrivastava, S.; Mancini, L.; Randell, B.: On The Duality Of Fault Tolerant System Structures. In: J. Nehmer (ed.), Experiences With Distributed Systems, Proc. Int. WS. Kaiserslautern 1987, pp. 10–37, Springer LNCS 309, 1988

    Google Scholar 

  18. Siewiorek, D. P.: Faults And Their Manifestation, Springer LNCS 448, pp. 244–261, 1987

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Arndt Bode Mario Dal Cin

Rights and permissions

Reprints and permissions

Copyright information

© 1993 Springer-Verlag

About this paper

Cite this paper

Dal Cin, M. et al. (1993). Fault tolerance in distributed shared memory multiprocessors. In: Bode, A., Dal Cin, M. (eds) Parallel Computer Architectures. Lecture Notes in Computer Science, vol 732. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-57307-0_24

Download citation

  • DOI: https://doi.org/10.1007/3-540-57307-0_24

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-57307-4

Publish with us

Policies and ethics