Finding missing synchronization in a distributed computation using controlled re-execution

Mittal, Neeraj; Garg, Vijay K.

doi:10.1007/s00446-003-0104-x

Finding missing synchronization in a distributed computation using controlled re-execution

Published: August 2004

Volume 17, pages 107–130, (2004)
Cite this article

Distributed Computing Aims and scope Submit manuscript

Neeraj Mittal¹ &
Vijay K. Garg²

40 Accesses
3 Citations
Explore all metrics

Abstract.

Correct distributed programs are hard to write. Not surprisingly, distributed systems are especially vulnerable to software faults. Testing and debugging is an important way to improve the reliability of distributed systems. A distributed debugger equipped with the mechanism to re-execute the traced computation in a controlled fashion can greatly facilitate the detection and localization of bugs. This approach gives rise to a general problem of predicate control, which takes a computation and a safety property specified on the computation as inputs, and produces a controlled computation, with added synchronization, that maintains the given safety property as output. We devise efficient control algorithms for two classes of useful predicates, namely region predicates and disjunctive predicates. For the former, we prove that the control algorithm is optimal in the sense that it guarantees maximum concurrency possible in the controlled computation. For the latter, we prove that our control algorithm generates the least number of synchronization dependencies and therefore has optimal message-complexity. Furthermore, we provide a necessary and sufficient condition under which it is possible to efficiently compute a minimal controlling synchronization for a general predicate. We also give an algorithm to compute such a synchronization under the condition provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-agent architecture for fault recovery in self-healing systems

Article 07 August 2020

Competition on Software Verification and Witness Validation: SV-COMP 2023

Static Value Analysis of Python Programs by Abstract Interpretation

References

Chandy KM, Lamport L: Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems 3(1):63-75 (1985)
Article Google Scholar
Chase C, Garg VK: Detection of Global Predicates: Techniques and their Limitations. Distributed Computing (DC) 11(4):191-201 (1998)
Google Scholar
Cooper R, Marzullo K: Consistent Detection of Global Predicates. In: Proceedings of the ACM/ONR Workshop on Parallel and Distributed Debugging, pp 163-173, Santa Cruz, California, 1991
Cormen TH, Leiserson CE, Rivest RL: Introduction to Algorithms. The MIT Press, Cambridge, Massachusetts, 1991
Fidge C: Logical Time in Distributed Computing Systems. IEEE Computer 24(8):28-33 (1991)
Google Scholar
Huang Y, Kintala C: Software Implemented Fault Tolerance: Technologies and Experience. In: Proceedings of the IEEE Fault-Tolerant Computing Symposium (FTCS), pp 138-144, June 1993
Hurfin M, Mizuno M, Raynal M, Singhal M: Efficient Distributed Detection of Conjunctions of Local Predicates in Asynchronous Computations. In: Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP), pp 588-594, New Orleans, October 1996
Johnson DB, Zwaenepoel W: Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing. In: Proceedings of the 6th ACM Symposium on Principles of Distributed Computing (PODC), pp 171-181, August 1988
Kilgore R, Chase C: Testing Distributed Programs Containing Racing Messages. The Computer Journal 40(8):489-498 (1997)
Google Scholar
Lamport L: Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM (CACM) 21(7):558-565 (1978)
MATH Google Scholar
LeBlanc TJ, Mellor-Crummey JM: Debugging Programs with Instant Replay. IEEE Transactions on Computers C-36(4):471-482 (1987)
Google Scholar
Maggiolo-Schettini A, Welde H, Winkowski J: Modeling a Solution for a Control Problem in Distributed Systems by Restrictions. Theoretical Computer Science 13(1):61-83 (1981)
Article MATH Google Scholar
Mattern F: Virtual Time and Global States of Distributed Systems. In: Parallel and Distributed Algorithms: Proceedings of the Workshop on Distributed Algorithms (WDAG), pp 215-226. Elsevier Science Publishers B. V. (North-Holland), 1989
Miller BP, Choi J: Breakpoints and Halting in Distributed Programs. In: Proceedings of the 8th IEEE International Conference on Distributed Computing Systems (ICDCS), pp 316-323, 1988
Mittal N, Garg VK: Debugging Distributed Programs Using Controlled Re-execution. In: Proceedings of the 19th ACM Symposium on Principles of Distributed Computing (PODC), pp 239-248, Portland, Oregon, July 2000
Sen A, Garg VK: Detecting Temporal Logic Predicates in the Happened-Before Model. In: Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), Florida, April 2002
Stoller SD, Liu YA: Efficient Symbolic Detection of Global Properties in Distributed Systems. In Hu AJ, Vardi MY (eds) Proceedings of the 10th International Conference on Computer-Aided Verification (CAV), volume 1427 of Lecture Notes in Computer Science (LNCS), pp 357-368. Springer-Verlag, 1998
Stoller SD, Unnikrishnan L, Liu YA: Efficient Detection of Global Properties in Distributed Systems Using Partial-Order Methods. In: Proceedings of the 12th International Conference on Computer-Aided Verification (CAV), volume 1855 of Lecture Notes in Computer Science (LNCS), pp 264-279. Springer-Verlag, July 2000
Tarafdar A: Software Fault Tolerance in Distributed Systems Using Controlled Re-execution. PhD thesis, The University of Texas at Austin, August 2000
Tarafdar A, Garg VK: Predicate Control for Active Debugging of Distributed Programs. In: Proceedings of the 9th IEEE Symposium on Parallel and Distributed Processing (SPDP), pp 763-769, Orlando, 1998
Tarafdar A, Garg VK: Software Fault Tolerance of Concurrent Programs Using Controlled Re-execution. In: Proceedings of the 13th Symposium on Distributed Computing (DISC), pp 210-224, Bratislava, Slovak Republic, September 1999
Torres-Pomales W: Software Fault Tolerance: A Tutorial, 2000. NASA Langley Research Center
Wang Y-M, Huang Y, Fuchs WK, Kintala C, Suri G: Progressive Retry for Software Failure Recovery in Message-Passing Applications. IEEE Transactions on Computers 46(10):1137-1141 (1997)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, The University of Texas at Dallas, TX 75083, Richardson, USA
Neeraj Mittal
Department of Electrical and Computer Engineering, The University of Texas at Austin, TX 78712, Austin, USA
Vijay K. Garg

Authors

Neeraj Mittal
View author publications
You can also search for this author in PubMed Google Scholar
Vijay K. Garg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Neeraj Mittal.

Additional information

Received: 19 June 2002, Accepted: 7 November 2003, Published online: 1 March 2004

Vijay K. Garg: Supported in part by the NSF Grants ECS-9907213, CCR-9988225, Texas Education Board Grant ARP-320, an Engineering Foundation Fellowship, and an IBM grant.

A preliminary version of the results in this paper first appeared in [15].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mittal, N., Garg, V.K. Finding missing synchronization in a distributed computation using controlled re-execution. Distrib. Comput. 17, 107–130 (2004). https://doi.org/10.1007/s00446-003-0104-x

Download citation

Issue Date: August 2004
DOI: https://doi.org/10.1007/s00446-003-0104-x

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Finding missing synchronization in a distributed computation using controlled re-execution

Abstract.

Access this article

Similar content being viewed by others

Multi-agent architecture for fault recovery in self-healing systems

Competition on Software Verification and Witness Validation: SV-COMP 2023

Static Value Analysis of Python Programs by Abstract Interpretation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords:

Navigation

Finding missing synchronization in a distributed computation using controlled re-execution

Abstract.

Access this article

Similar content being viewed by others

Multi-agent architecture for fault recovery in self-healing systems

Competition on Software Verification and Witness Validation: SV-COMP 2023

Static Value Analysis of Python Programs by Abstract Interpretation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords:

Search

Navigation