Abstract
In front of the increasing throughput of local area networks, Networks Of Workstations (NOW) have become a convenient and cheaper alternative to parallel architectures for the execution of long-running parallel applications. However, made up of a large number of components they may experience failures. ICARE is a recoverable distributed shared memory (RDSM), based on backward error recovery, implemented on an ATM-based platform running CHORUS microkernel. This paper presents the implementation and performance evaluation of ICARE which exhibits a low overhead. Indeed, ICARE takes benefit of the already existing features of a DSM system in order to combine both availability and efficiency. Shared data are stored in standard memories and are managed by extending the coherence protocol.
This work has been partially funded by the DRET research contract number 93.124.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
G. Cabillic, G. Muller, and I. Puaut. The performance of consistent checkpointingin distributed shared memory systems. In Proc. of the 14th Symposium on Reliable Distributed Systems, September 1995.
J.B. Carter, A.L. Cox, S. Dwarkadas, E.M. Elnozahi, D.B. Johnson, P. Keheler, S. Rodrigues, W. Yu, and W. Zwaenepoel. Network multicomputing using recoverable distributed shared memory. In Proc. of the IEEE International Conference CompCon’93, 1993.
M. Costa, P. Guedes, M. Sequeira, N. Neves, and M. Castro. Lightweight logging for lazy release consistent distributed shared memory. In Proc of the Symposium on Operating Systems Design and Implementation, November 1996.
B.D. Fleisch. Reliable distributed shared memory. In Proc. of the 2nd Workshop on Experimental Distributed Systems, pages 102–105, 1990.
G. Janakiraman and Y. Tamir. Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers. In Proc of the 13th Symposium on Reliable Distributed Systems, 1994.
B. Janssens and W.K. Fuchs. Relaxing consistency in recoverable distributed shared memory. In Proc of the 23rd International Symposium on Fault Tolerant Computing Systems, 1993.
B. Janssens and W.K. Fuchs. Reducing interprocessor dependence in recoverable distributed shared memory. In Proc. of the 13th Symposium on Reliable Distributed Systems, October 1994.
A.-M. Kermarrec. Contrôle de la réplication des données dans une mémoire virtuelle partagée recouvrable efficace (control of data replication in an efficient distributed shared virtual memory). Technique et science informatiques, 15, May 1996.
A.-M. Kermarrec. Une approche globale fondée sur la réplication pour la disponibilité et l’efficacité des systèmes extensibles à mémoire partagée (A Global Approach, Based on Data Replication for Efficiency and Highavailability in Large-scale Distributed Shared Memory Systems). PhD thesis, Université de Rennes 1, 1996.
K. Li and P. Hudack. Memory coherence in shared memory systems. ACM Transactions on Computer Systems, 7(4):321–359, November 1989.
N. Neves, M. Castro, and P. Guedes. A checkpoint protocol for an entry consistent shared memory system. In Proc. of the 13th ACM Symposium on Principles of Distributed Computing, August 1994.
G.G. Richard III and M. Singhal. Using logging and asynchronous checkpointing to implement recoverable distributed shared memory. In Proc. of the 12th Symposium on Reliable Distributed Systems, 1993.
M. Rozier, V. Abrossimov, F. Armand, I. Boule, M. Gien, M. Guillemont, F. Hermann, P. Leonard, S. Langlois, and W. Neuhauser. Chorus distributed operating system. Computing Systems, 1(4):305–370, October 1988.
G. Suri, B. Janssens, and W.K. Fuchs. Reduced overhead logging for rollback recovery in distributed shared memory. In Proc of the 25th International Symposium on Fault Tolerant Computing Systems, 1995.
V. Tarn and M. Hsu. Fast recovery in distributed shared virtual memory systems. In Proc. of the 10th International Conference on Distributed Computing Systems, June 1990.
T.J. Wilkinson. Implementing Fault Tolerance in a 64-bit Distributed Operating System. PhD thesis, City University, London, July 1993.
K.L. Wu and W.K. Fuchs. Recoverable distributed shared virtual memory. IEEE Transactions on Computers, 39(4), April 1990.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 1998 Springer Science+Business Media New York
About this chapter
Cite this chapter
Kermarrec, AM., Morin, C. (1998). An Efficient Recoverable DSM on a Network of Workstations: Design and Implementation. In: Fault-Tolerant Parallel and Distributed Systems. Springer, Boston, MA. https://doi.org/10.1007/978-1-4615-5449-3_7
Download citation
DOI: https://doi.org/10.1007/978-1-4615-5449-3_7
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-7488-6
Online ISBN: 978-1-4615-5449-3
eBook Packages: Springer Book Archive