Summary
The paper shows that characterizing the causal relationship between significant events is an important but non-trivial aspect for understanding the behavior of distributed programs. An introduction to the notion of causality and its relation to logical time is given; some fundamental results concerning the characterization of causality are presented. Recent work on the detection of causal relationships in distributed computations is surveyed. The issue of observing distributed computations in a causally consistent way and the basic problems of detecting global predicates are discussed. To illustrate the major difficulties, some typical monitoring and debugging approaches are assessed, and it is demonstrated how their feasibility is severely limited by the fundamental problem to master the complexity of causal relationships.
Similar content being viewed by others
References
Acharya A, Badrinath BR: Recording distributed snapshots based on causal order of message delivery. Inf Process Lett 44: 317–321 (1992)
Ahmad M, Hutto PW, John R: Implementing and programming causal distributed shared memory. Proc 11th Int Conference on Distributed Computing Systems, Arlington, Texas, pp 274–281, 1991
Ahuja M, Carlson T, Gahlot A, Shands D: Timestamping events for inferring ‘affects’ relation and potential causality. Proc 15th IEEE Int Computer Software and Application Conference COMPSAC 91, Tokyo, pp 606–611, 1991
Baldy P, Dicky H, Medina R, Morvan M, Vilarem JF: Efficient reconstruction of the causal relationship in distributed computations. Technical Report 92-013, Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier, Montpellier 1992
Bates PC, Wileden JC: High-level debugging of distributed systems: the behavioral abstraction approach. J Syst Softw 4(3): 255–264 (1983)
Birman K, Joseph T: Exploiting virtual synchrony in distributed systems. Operating Syst Rev 22(1): 123–138 (1987)
Birman K: The process group approach to reliable distributed computing. Technical Report, Computer Science Department, Cornell University, Ithaca, New York 1991
Birman K, Schiper A, Stephenson P: Lightweight causal and atomic group multicast. ACM Trans Comput Syst 9(3): 272–314 (1991)
Bruegge B, Hibbard P: Generalized Path Expressions. J Syst Softw 2(2): 265–276 (1983)
Chandy KM, Lamport L: Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 31(1): 63–75 (1985)
Charron-Bost B: Combinatorics and geometry of consistent cuts: application to concurrency theory. In: Bermond JC, Raynal M (eds) Distributed algorithms. LNCS, vol 392. Springer, Berlin Heidelberg New York 1989, pp 45–56
Charron-Bost B: Concerning the size of logical clocks in distributed systems. Inf Process Lett 39: 11–16 (1991)
Charron-Bost B, Delporte-Gallet C, Fauconnier H: Local and temporal predicates in distributed systems. Technical Report, LITP, IBP, Université Paris 7, Paris 1992
Charron-Bost B, Mattern F, Tel G: Synchronous, asynchronous, and causally ordered communication. Distrib Comput (to appear)
Cooper R, Marzullo K: Consistent detection of global predicates. Proc ACM/ONR Workshop on Parallel and Distributed Debugging, Santa Cruz, California 1991, pp 163–173
Diehl C, Jard C: Interval approximations and message causality in distributed systems. In: Finkel A, Jantzen M (eds) Proc of the 9th Annual Symposium on Theoretical Aspects of Computer Science STACS '92, LNCS, vol 577, Springer, Berlin Heidelberg New York 1992, pp 363–374
Fidge CJ: Timestamps in message-passing systems that preserve the partial ordering. Proc 11th Australian Computer Science Conference, University of Queensland, pp 55–66, 1988
Fidge CJ: Partial orders for parallel debugging. ACM SIGPLAN Notices 24(1): 183–194 (1989)
Fidge CJ: Dynamic analysis of event orderings in message passing systems. PhD Thesis, Department of Computer Science, Australian National University, Canberra 1989
Fidge CJ: Logical time in distributed computing systems. IEEE Comput 24(8): 28–33 (1991)
Fischer MJ, Michael A: Sacrificing serializability to attain high availability of data in an unreliable network. Proc ACM SIGACT-SIGOPS Symposium on Principles of Database Systems, pp 70–75, 1982
Fishburn PC: Interval orders and interval graphs. Wiley, New York 1985
Floyd RW: Algorithm 97, shortest path. Commun ACM 5: 345 (1962)
Fowler J, Zwaenepoel W: Causal distributed breakpoints. Proc 10th Int Conference on Distributed Computing Systems, Paris, pp 134–141, 1990
Gait J: A probe effect in concurrent programs. Softw Pract Exper 16 (3): 225–233 (1986)
Garg VK, Waldecker B: Detection of unstable predicates in distributed programs. In: Shyamasundra R (ed) Proc 12th Conference on the Foundation of Software Technology and Theoretical Computer Science, LNCS, vol 652. Springer, Berlin Heidelberg New York 1992, pp 253–264
Haban D, Weigel W: Global events and global breakpoints in distributed systems. Proc 21st Annual Hawaii Int Conference on System Sciences, pp 166–175, 1988
Haban D, Zhou S, Maurer D, Wilhelm R: Specification and detection of global breakpoints in distributed systems. Technical Report SFB124-08/1991, Universität des Saarlandes, Saarbrücken 1991
Harel D, Pnueli A: On the development of reactive systems. In: Apt K (ed) Logics and models of concurrent systems, NATO ASI Series F, vol 13. Springer, Berlin Heidelberg New York 1985, pp 477–498
Hseush W, Kaiser GE: Modeling concurrency in parallel debugging. ACM SIGPLAN Notices 25(3): 11–20 (1990)
Hutto P, Ahamad M: Slow memory: weakening consistency to enhance concurrency in distributed shared memories. Proc 10th Int Conference on Distributed Computing Systems, Paris, pp 302–309, 1990
Johnson DB, Zwaenepoel W: Recovery in distributed systems using optimistic message logging and checkpointing. J Algorithms 11(3): 462–491 (1990)
Katz S, Peled D: Interleaving set temporal logic. Theor Comput Sci 75: 263–287 (1990)
Katz S, Peled D: Verification of distributed programs using representative interleaving sequences. Distrib Comput 6: 107–120 (1992)
Kearns JP, Koodalattupuram B: Immediate ordered service in distributed systems. Proc 9th Int Conference on Distributed Computing Systems, Newport Beach, California, 611–618, 1989
Lamport L: Time, clocks, and the ordering of events in a distributed system. Commun ACM 21 (7): 558–565 (1978)
LeBlanc TJ, Mellor-Crummey JM: Debugging parallel programs with instant replay. IEEE Trans Comput 36(4): 471–482 (1987)
Leu E, Schiper A, Zramdini A: Efficient excution replay for distributed memory architectures. In: Bode A (ed) Proc 2nd European Distributed Memory Computing Conference, Munich, Germany, LNCS, vol 487. Springer, Berlin Heidelberg New York 1991, pp 315–324
Manabe Y, Imase M: Global conditions in debugging distributed programs. J Parallel Distrib Comput 15: 62–69 (1992)
Manna Z, Pnueli A: The temporal logic of reactive and concurrent systems. Springer, Berlin Heidelberg New York 1992
Marzullo K, Neiger G: Detection of global state predicates. In: Toueg S, Spirakis PG, Kirousis L (eds) Proc 5th Workshop on Distributed Algorithms (WDAG-91), Delphi, Greece, LNCS, vol 579. Springer, Berlin Heidelberg New York 1991, pp 254–272
Masuzawa T, Tokura N: A causal distributed breakpoint algorithm for distributed debugger. Proc ICICE Fall Conf (SD-1-8) 6: 373–374 (1992)
Mattern F: Algorithms for distributed termination detection. Distrib Comput 2: 161–175 (1987)
Mattern F: Virtual time and global states in distributed systems. In: Cosnard M et al. (eds) Proc Workshop on Parallel and Distributed Algorithms, Chateau de Bonas, Oct. 1988, Elsevier, North Holland, 1989, pp 215–226
Mattern F: Efficient algorithms for distributed snapshots and global virtual time approximation. J Parallel Distrib Comput 18: 423–434 (1993)
Medina R: Incremental garbage collection of causal relationship computation in distributed systems. Proc 5th IEEE Symposium on Parallel and Distributed Processing, Irving, Texas, 1993
Meldal S, Sankar S, Vera J: Exploiting locality in maintaining potential causality. Proc 10th Annual ACM Symposium on Principles of Distributed Computing, Montreal, Canada, pp 231–239, 1991
Miller BP, Choi JD: Breakpoints and halting in distributed programs. Proc 8th Int Conference on Distributed Computing Systems, pp 316–323, 1988
Minkowski H: Raum und Zeit. In: Lorentz HA, Einstein A, Minkowski H: Das Relativitätsprinzip. Eine Sammlung von Abhandlungen. Teubner, Leipzig 1915, pp 56–68
Netzer RHB, Miller BP: Optimal tracing and replay for debugging message-passing parallel programs. Proc Supercomputing '92, Minneapolis, pp 502–511, 1992
Nielsen M, Plotkin G, Winskel G: Petri nets, event structures and domains, part I. Theor Comput Sci 13: 85–108 (1981)
Ochmanski E: Inevitability in concurrent systems. Inf Process Lett 25: 221–225 (1987)
Panangaden P, Taylor K: Concurrent common knowledge: defining agreement for asynchronous systems. Distrib Comput 6(2): 73–93 (1992)
Ponamgi MK, Hseush W, Kaiser GE: Debugging multithreaded programs with MPD. IEEE Software, 8(3): 37–43 (1991)
Pratt V: Modeling concurrency with partial orders. Int J Parallel Program 15(1): 33–71 (1986)
Pratt V: Modeling concurrency with geometry. Proc 18th Annual Symposium on Principles of Programming Languages (POPL-91), pp 311–322, 1991
Pratt V: Arithmetic+Logic+Geometry=Concurrency. In: Simon I (ed) Proc LATIN '92, LNCS, vol 583. Springer, Berlin Heidelberg New York 1992, pp 430–447
Raynal M, Schiper A, Toueg S: The causal ordering abstraction and a simple way to implement it. Inf Process Lett 39: 343–350 (1991)
Reisig W: A strong part of concurreney. In: Rozenberg G (ed) Advances in Petri nets, LNCS, vol 266. Springer, Berlin Heidelberg New York 1987, pp 238–272
Reisig W: Temporal logic and causality in concurrent systems. In: Vogt FH (ed) Proc Concurrency '88, LNCS, vol 335. Springer, Berlin Heidelberg New York 1988, pp 121–139
Reisig W: Parallel composition of liveness. Technical Report SFB342/30/91A, Technische Universität München, München 1991
van Renesse R: Causal controversy at Le Mont St.-Michel. ACM Operating Syst Rev 27(2):44–53 (1993)
Sandoz A, Schiper A: A characterization of consistent distributed snapshots using causal order. Technical Report 92-14, Département d'Informatique, Ecole Polytechnique Fédérale de Lausanne, Lausanne 1992
Schiper A, Eggli J, Sandoz A: A new algorithm to implement causal ordering. In: Bermond JC, Raynal M (eds) Proc Workshop on Distributed Algorithms, Nice, France, LNCS, vol 392. Springer, Berlin Heidelberg New York 1989, pp 219–232
Singhal M, Kshemkalyani A: An efficient implementation of vector clocks. Inf Process Lett 43: 47–52 (1992)
Spezialetti M, Kearns JP: Simultaneous regions: a framework for the consistent monitoring of distributed systems. Proc 9th Int Conference on Distributed Computing Systems, Newport Beach, California, pp 61–68, 1989
Stone JM: Visualizing concurrent processes. Technical Report RC 12973, IBM TJ Watson Research Center 1987
Stone JM: Debugging concurrent processes: a case study. Proc SIGPLAN Conf on Programming Language Design and Implementation, Atlanta, Georgia, pp 145–153, 1988
Stone JM: A graphical representation of concurrent processes. ACM SIGPLAN Notices 24(1): 226–235 (1989)
Strom R, Yemini S: Optimistic recovery in distributed systems. ACM Trans Comput Syst 3 (3): 204–226 (1985)
Szpilrajn E: Sur l'extension de l'ordre partiel. Fund Math 16: 386–389 (1930)
Warshall S: A theorem on Boolean matrices. J ACM 9: 11–12 (1962)
Winskel G: An introduction to event structures. In: de Bakker JW, de Roever WP, Rozenberg G (eds) Proc Workshop on Linear Time, Branching Time and Partial Order in Logics and Models for Concurrency, Noordwijkerhout, The Netherlands, LNCS, vol 354. Springer, Berlin Heidelberg New York 1988, pp 364–397
Wuu GTJ, Bernstein AJ: Efficient solutions to the replicated log and dictionary problems. Proc ACM Symposium on Principles of Distributed Computing, pp 233–242, 1984
Author information
Authors and Affiliations
Additional information
Reinhard Schwarz received a diploma in computer science from the University of Kaiserslautern, Germany, in 1990. Since then, he is working as a research assistant at the computer science department. His research interests include debugging and monitoring of distributed systems, runtime support for object-oriented distributed programming, and distributed algorithms.
Friedemann Mattern received the diploma in computer science from Bonn University, Germany, and the Ph.D. degree from the University of Kaiserslautern, Germany, in 1983 and 1989, respectively. Since 1991 he is a professor of computer science at the University of Saarland in Saarbrücken, Germany. His current research interests include programming of distributed systems, distributed applications, and distributed algorithms.
The work presented in this paper was carried out as part of the PARAWAN project supported by the Bundesministerium für Forschung und Technologie (BMFT)
Rights and permissions
About this article
Cite this article
Schwarz, R., Mattern, F. Detecting causal relationships in distributed computations: In search of the holy grail. Distrib Comput 7, 149–174 (1994). https://doi.org/10.1007/BF02277859
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF02277859