Agreement on the group membership in synchronous distributed systems
When a group of processors in a distributed system cooperate with each other on processing of a common task, it is often necessary for the non-faulty processors to have a mutually consistent knowledge of the set of processors that can be considered to be non-faulty. The set of non-faulty processors in the group — known as the group membership — will change for example when a processor crashes or when a crashed processor, after restart, joins the group. These changes should be known by all non-faulty processors as quickly as possible within a known bounded time interval. We present an algorithm by which non-faulty processors of a group of bounded size will be able to maintain a consistent and timely knowledge of the group membership. Processors in the group are assumed to execute the algorithm in a synchronous manner and at periodic intervals or cycles of some fixed length. In an execution of the proposed algorithm, every non-faulty processor knows of any processor failure within at most two cycles following the cycle in which the failure occurred, and a restarted processor can join the group in two cycles. At most less than half the number of processors are assumed to fail in any three consecutive cycles.
Keywordsgroup membership distributed algorithms broadcast networks fault-tolerance
Unable to display preview. Download preview PDF.
- /Birman 87/.Birman, K.; Joseph, T. "Reliable Communication in the Presence of Failures". ACM Transactions on Computer Systems, Vol. 5, No 1. February 1985. pp 47–76.Google Scholar
- /Cristian 85/.Cristian, F.; Aghili, H.; Strong, R.; Dolev, D. "Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement". Proceedings 15th International Symposium on Fault-Tolerant COmputing. Ann Arbor, MI. June 1985. pp 200–206.Google Scholar
- /Cristian 88/.Cristian, F. "Agreeing on who is Present and who is Absent in a Synchronous Distributed System". 18th International Symposium on Fault-Tolerant Computing. Tokyo, Japan. June 1988. pp 206–211.Google Scholar
- /Cristian 90/.Cristian, F. "Synchronous Atomic Broadcast for Redundant Broadcast Channels". IBM Research Report RJ7203. April 1990.Google Scholar
- /Ezhilchelvan 90/.Ezhilchelvan, P.D.; Lemos, R. "A Robust Group Membership Algorithm for Distributed Real-Time Systems". Proceedings of the 11th Real-Time Systems Symposium. Orlando, Florida. December 1990.Google Scholar
- /Kopetz 89/.Kopetz, H.; Grunsteidl, G.; Reisinger, J. "Fault-Tolerant Membership Service in a Distributed Real-Time System". Int. Conference on Dependable Computing for Critical Applications. Santa Barbara, CA. August, 1989. pp 167–174.Google Scholar
- /Melliar-Smith 90/.Melliar-Smith, P.M.; Moser, L.M.; Agarwala. "Broadcast Protocols for Distributed Systems". IEEE Transactions on Parallel and Distributed Systems Vol.1, No 1. January 1990. pp 17–25.Google Scholar
- /Navaratnam 88/.Navaratnam, S.; Chanson, S.; Neufeld, G. "Reliable Group Communication in Distributed Systems". Proc 8th International Conference on Distributed Computing Systems. June, 1988. pp 439–446.Google Scholar
- /Peterson 89/.Peterson, L.; Buchholz, N.C.; Schlichting, R.D. "Preserving and Using Context Information in Interprocess Communication". ACM TOCS Vol. 7, No. 3. August 1989. pp 217–246.Google Scholar
- /Powell 88/.Powell, D. et al. "The Delta-4 Approach to Dependability in Open Distributed Computing Systems. 18th International Symposium on Fault-Tolerant Computing. Tokyo, Japan. June 1988. pp 83–93.Google Scholar
- /Schlichting 83/.Schlichting, R.D.; Schneider, F.B. "Fail-Stop Processors: An Approach to Design Fault-Tolerant Computing Systems". ACM Transactions on Computer Systems, Vol 1, No 3. August 1983. pp 222–234.Google Scholar