Fault tolerance in distributed shared memory multiprocessors

Dal Cin, M.; Grygier, A.; Hessenauer, H.; Hildebrand, U.; Hönig, J.; Hohl, W.; Michel, E.; Pataricza, A.

doi:10.1007/3-540-57307-0_24

M. Dal Cin¹,
A. Grygier¹,
H. Hessenauer¹,
U. Hildebrand¹,
J. Hönig¹,
W. Hohl¹,
E. Michel¹ &
…
A. Pataricza¹

Part of the book series: Lecture Notes in Computer Science ((volume 732))

9 Accesses
5 Citations

Abstract

Massively parallel systems represent a new challenge for fault tolerance. The designers of such systems cannot expect that no parts of the system will fail. With the significant increase in the complexity and number of components the chance of a single or multiple failure is no longer negligible. It is clear that the redundancy, reconfigurability and diagnosis techniques must be incorporated at the design stage itself and not as a subsequent add-on. In this paper we discuss the fault tolerance techniques developed for MEMSY, a massively parallel architecture. These techniques can, in principle, be easily transferred to other distributed shared memory multiprocessors.

Guest researcher from TU Budapest, Dept. Measurement and Instrumentation Engineering

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Banâtre, M.; Muller, G.; Rochat, B.; Sanchez, P.: Design Decisions for the FTM: A General Purpose Fault Tolerant Machine, Proc. 21th FTCS, pp. 71–78,1991
Google Scholar
Chandy, K. M.; Lamport, L.: Distributed Snapshots: Determining Global States of Distributed Systems, ACM T.o.C.S., vol. 3, no. 1, pp. 63–75, 1985
Article Google Scholar
Cristian, F.: Understanding Fault Tolerant Distributed Systems, Com. ACM vol. 34 (1991), pp. 56–78
Article Google Scholar
Dal Cin, M.: New Trends in Parallel and Reliable Computing: Massive Parallelism and Fault Tolerance. Invited paper, Proc. μP'92, 7th Symposium on Microcomputer and Microprocessor Appl., Budapest, April 1992, pp. 1–10
Google Scholar
Grand Challenges: High Performance Computing and Communications. The Fiscal Year 1992 U.S. Research and Development Program. Report by the Committee on Physical, Mathematical, and Engineering Sciences, NSF Washington 1992
Google Scholar
Hildebrand, U.: A Fault Tolerant Interconnection Network for Memory-Coupled Multiprocessor Systems, In: Dal Cin, M.; Hohl, W.(eds.): Proc. 5th Int. Conf. Fault Tolerant Computing Systems, Informatik-Fachberichte 283, pp. 360–371, Springer 1991
Google Scholar
Hofmann, F. et al.: MEMSY — A Modular Expandable Multiprocessor System, in this volume
Google Scholar
Hohl, W.; Michel, E.; Pataricza, A.: Hardware Support for Error Detection in Multiprocessor Systems — A Case Study, Proc. μP'92, 7th Symposium on Microcomputer and Microprocessor Appl., Budapest, April 1992, pp. 81–90
Google Scholar
Kai Li; Naughton, J. F.; Plank, J. S.: Checkpointing Multicomputer Applications, Proc. 10th Symposium on Reliable Distributed Systems, pp. 2–12, 1991
Google Scholar
Koo, R.; Toueg, S.: Checkpointing and Rollback-Recovery for Distributed Systems, IEEE T.o.S.E., pp. 23–31, Jan. 1987
Google Scholar
Lampson, B. W.: The Stable System, in Lampson, B. W.; Paul, M.; Siegert H. J. (ed): Distributed Systems: Architecture and Implementation, LNCS 105, pp. 254–256, 1988
Google Scholar
Leveugle, R.; Michel, T.; Saucier, G.: Design of Microprocessors with Built-in On-Line Test, Proc. 20th FTCS, pp. 450–456, 1990
Google Scholar
Lu, D. J.: Watchdog Processors and Structural Integrity Checking, IEEE T.o.C., Vol. 31. No.7, 681–685, 1982
Google Scholar
Mahmood, A.; McCluskey, E. J.: Concurrent Error Detection Using Watchdog Processors — A Survey, IEEE, T.o.C., Vol. 37. No. 2, pp. 160–174, 1988
Google Scholar
Michel, E.; Hohl, W.: Concurrent Error Detection Using Watchdog Processors in the Multiprocessor System MEMSY, Proc. 5th Int. Conf. Fault-Tolerant Computing Systems, Nürnberg, Informatik Fachberichte 283, pp. 54–64, Springer, September 1991
Google Scholar
Russell, D. L.; Tiedeman, M. J.: Multiprocess Recovery Using Conversations, Proc. 9th FTCS, pp. 106–109, 1979
Google Scholar
Shrivastava, S.; Mancini, L.; Randell, B.: On The Duality Of Fault Tolerant System Structures. In: J. Nehmer (ed.), Experiences With Distributed Systems, Proc. Int. WS. Kaiserslautern 1987, pp. 10–37, Springer LNCS 309, 1988
Google Scholar
Siewiorek, D. P.: Faults And Their Manifestation, Springer LNCS 448, pp. 244–261, 1987
Google Scholar

Download references

Author information

Authors and Affiliations

Informatik III, Universität Erlangen-Nürnberg, Deutschland
M. Dal Cin, A. Grygier, H. Hessenauer, U. Hildebrand, J. Hönig, W. Hohl, E. Michel & A. Pataricza

Authors

M. Dal Cin
View author publications
You can also search for this author in PubMed Google Scholar
A. Grygier
View author publications
You can also search for this author in PubMed Google Scholar
H. Hessenauer
View author publications
You can also search for this author in PubMed Google Scholar
U. Hildebrand
View author publications
You can also search for this author in PubMed Google Scholar
J. Hönig
View author publications
You can also search for this author in PubMed Google Scholar
W. Hohl
View author publications
You can also search for this author in PubMed Google Scholar
E. Michel
View author publications
You can also search for this author in PubMed Google Scholar
A. Pataricza
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Arndt Bode Mario Dal Cin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dal Cin, M. et al. (1993). Fault tolerance in distributed shared memory multiprocessors. In: Bode, A., Dal Cin, M. (eds) Parallel Computer Architectures. Lecture Notes in Computer Science, vol 732. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-57307-0_24

Download citation

DOI: https://doi.org/10.1007/3-540-57307-0_24
Published: 12 July 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-57307-4

Publish with us

Policies and ethics