Efficient Reduction for Wait-Free Termination Detection in a Crash-Prone Distributed System

Mittal, Neeraj; Freiling, Felix C.; Venkatesan, S.; Penso, Lucia Draque

doi:10.1007/11561927_9

Efficient Reduction for Wait-Free Termination Detection in a Crash-Prone Distributed System

Neeraj Mittal¹⁷,
Felix C. Freiling¹⁸,
S. Venkatesan¹⁷ &
…
Lucia Draque Penso¹⁸

Conference paper

626 Accesses
12 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3724))

Abstract

We investigate the problem of detecting termination of a distributed computation in systems where processes can fail by crashing. Specifically, when the communication topology is fully connected, we describe a way to transform any termination detection algorithm \(\mathcal{A}\) that has been designed for a failure-free environment into a termination detection algorithm \(\mathcal{B}\) that can tolerate process crashes. Our transformation assumes the existence of a perfect failure detector. We show that a perfect failure detector is in fact necessary to solve the termination detection problem in a crash-prone distributed system even if at most one process can crash.

Let μ(n,M) and δ(n,M) denote the message complexity and detection latency, respectively, of \(\mathcal{A}\) when the system has n processes and the underlying computation exchanges M application messages. The message complexity of \(\mathcal{B}\) is at most O(n + μ(n,0)) messages per failure more than the message complexity of \(\mathcal{A}\). Also, its detection latency is at most O(δ(n,0)) per failure more than that of \(\mathcal{A}\). Furthermore, the overhead (that is, the amount of control data piggybacked) on an application message increases by only O(log n) bits per failure.

The fault-tolerant termination detection algorithm resulting from the transformation satisfies two desirable properties. First, it can tolerate failure of up to n–1 processes, that is, it is wait-free. Second, it does not impose any overhead on the fault-sensitive termination detection algorithm until one or more processes crash, that is, it is fault-reactive. Our transformation can be extended to arbitrary communication topologies provided process crashes do not partition the system.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dijkstra, E.W., Scholten, C.S.: Termination Detection for Diffusing Computations. Information Processing Letters (IPL) 11, 1–4 (1980)
Article MATH MathSciNet Google Scholar
Francez, N.: Distributed Termination. ACM Transactions on Programming Languages and Systems (TOPLAS) 2, 42–55 (1980)
Article MATH Google Scholar
Rana, S.P.: A Distributed Solution of the Distributed Termination Problem. Information Processing Letters (IPL) 17, 43–46 (1983)
Article MATH MathSciNet Google Scholar
Shavit, N., Francez, N.: A New Approach to Detection of Locally Indicative Stability. In: Kott, L. (ed.) ICALP 1986. LNCS, vol. 226, pp. 344–358. Springer, Heidelberg (1986)
Google Scholar
Mattern, F.: Algorithms for Distributed Termination Detection. Distributed Computing (DC) 2, 161–175 (1987)
Article Google Scholar
Dijkstra, E.W.: Shmuel Safra’s Version of Termination Detection. EWD Manuscript 998 (1987), Available at http://www.cs.utexas.edu/users/EWD
Mattern, F.: Global Quiescence Detection based on Credit Distribution and Recovery. Information Processing Letters (IPL) 30, 195–200 (1989)
Article MathSciNet Google Scholar
Huang, S.T.: Detecting Termination of Distributed Computations by External Agents. In: Proceedings of the IEEE International Conference on Distributed Computing Systems (ICDCS), pp. 79–84 (1989)
Google Scholar
Chandrasekaran, S., Venkatesan, S.: A Message-Optimal Algorithm for Distributed Termination Detection. Journal of Parallel and Distributed Computing (JPDC) 8, 245–252 (1990)
Article Google Scholar
Tel, G., Mattern, F.: The Derivation of Distributed Termination Detection Algorithms from Garbage Collection Schemes. ACM Transactions on Programming Languages and Systems (TOPLAS) 15, 1–35 (1993)
Article Google Scholar
Khokhar, A.A., Hambrusch, S.E., Kocalar, E.: Termination Detection in Data-Driven Parallel Computations/Applications. Journal of Parallel and Distributed Computing (JPDC) 63, 312–326 (2003)
Article MATH Google Scholar
Mahapatra, N.R., Dutt, S.: An Efficient Delay-Optimal Distributed Termination Detection Algorithm. To Appear in Journal of Parallel and Distributed Computing, JPDC (2004)
Google Scholar
Wang, X., Mayo, J.: A General Model for Detecting Termination in Dynamic Systems. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS), Santa Fe, New Mexico (2004)
Google Scholar
Mittal, N., Venkatesan, S., Peri, S.: Message-Optimal and Latency-Optimal Termination Detection Algorithms for Arbitrary Topologies. In: Guerraoui, R. (ed.) DISC 2004. LNCS, vol. 3274, pp. 290–304. Springer, Heidelberg (2004)
Chapter Google Scholar
Venkatesan, S.: Reliable Protocols for Distributed Termination Detection. IEEE Transactions on Reliability 38, 103–110 (1989)
Article Google Scholar
Lai, T.H., Wu, L.F.: An (N − 1)-Resilient Algorithm for Distributed Termination Detection. IEEE Transactions on Parallel and Distributed Systems (TPDS) 6, 63–78 (1995)
Article Google Scholar
Tseng, Y.C.: Detecting Termination by Weight-Throwing in a Faulty Distributed System. Journal of Parallel and Distributed Computing (JPDC) 25, 7–15 (1995)
Article Google Scholar
Shah, A., Toueg, S.: Distributed Snapshots in spite of Failures. Technical Report TR84-624, Department of Computer Science, Cornell University, Ithaca, NY (1984) (Revised February 1985)
Google Scholar
Chandy, K.M., Lamport, L.: Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems 3, 63–75 (1985)
Article Google Scholar
Gärtner, F.C., Pleisch, S. (Im)Possibilities of Predicate Detection in Crash-Affected Systems. In: Datta, A.K., Herman, T. (eds.) WSS 2001. LNCS, vol. 2194, pp. 98–113. Springer, Heidelberg (2001)
Chapter Google Scholar
Mittal, N., Freiling, F.C., Venkatesan, S., Penso, L.D.: Efficient Reductions for Wait-Free Termination Detection in Crash-Prone Systems. Technical Report AIB-2005-12, Department of Computer Science, Rheinisch-Westfälische Technische Hochschule (RWTH), Aachen, Germany (2005)
Google Scholar
Arora, A., Gouda, M.G.: Distributed Reset. IEEE Transactions on Computers 43, 1026–1038 (1994)
Article MATH Google Scholar
Wu, L.F., Lai, T.H., Tseng, Y.C.: Consensus and Termination Detection in the Presence of Faulty Processes. In: Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS), Hsinchu, Taiwan, pp. 267–274 (1992)
Google Scholar
Chandra, T.D., Toueg, S.: Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM 43, 225–267 (1996)
Article MATH MathSciNet Google Scholar
Peri, S., Mittal, N.: On Termination Detection in an Asynchronous Distributed System. In: Proceedings of the ISCA International Conference on Parallel and Distributed Computing Systems (PDCS), California, pp. 209–215 (2004)
Google Scholar
Larrea, M., Fernández, A., Arévalo, S.: On the Implementation of Unreliable Failure Detectors in Partially Synchronous Systems. IEEE Transactions on Computers 53, 815–828 (2004)
Article Google Scholar
Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the Presence of Partial Synchrony. Journal of the ACM 35, 288–323 (1988)
Article MathSciNet Google Scholar
Delporte-Gallet, C., Fauconnier, H., Guerraoui, R., Hadzilacos, V., Kouznetsov, P., Toueg, S.: The Weakest Failure Detector to Solve Certain Fundamental Problems in Distributed Computing. In: Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC), St. Johns, Newfoundland, Canada (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, The University of Texas at Dallas, Richardson, TX, 75083, USA
Neeraj Mittal & S. Venkatesan
Department of Computer Science, RWTH Aachen University, D-52056, Aachen, Germany
Felix C. Freiling & Lucia Draque Penso

Authors

Neeraj Mittal
View author publications
You can also search for this author in PubMed Google Scholar
Felix C. Freiling
View author publications
You can also search for this author in PubMed Google Scholar
S. Venkatesan
View author publications
You can also search for this author in PubMed Google Scholar
Lucia Draque Penso
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

CNRS and University Paris Diderot,
Pierre Fraigniaud

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mittal, N., Freiling, F.C., Venkatesan, S., Penso, L.D. (2005). Efficient Reduction for Wait-Free Termination Detection in a Crash-Prone Distributed System. In: Fraigniaud, P. (eds) Distributed Computing. DISC 2005. Lecture Notes in Computer Science, vol 3724. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11561927_9

Download citation

DOI: https://doi.org/10.1007/11561927_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29163-3
Online ISBN: 978-3-540-32075-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics