Skip to main content
Log in

Detecting causal relationships in distributed computations: In search of the holy grail

  • Published:
Distributed Computing Aims and scope Submit manuscript

Summary

The paper shows that characterizing the causal relationship between significant events is an important but non-trivial aspect for understanding the behavior of distributed programs. An introduction to the notion of causality and its relation to logical time is given; some fundamental results concerning the characterization of causality are presented. Recent work on the detection of causal relationships in distributed computations is surveyed. The issue of observing distributed computations in a causally consistent way and the basic problems of detecting global predicates are discussed. To illustrate the major difficulties, some typical monitoring and debugging approaches are assessed, and it is demonstrated how their feasibility is severely limited by the fundamental problem to master the complexity of causal relationships.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Acharya A, Badrinath BR: Recording distributed snapshots based on causal order of message delivery. Inf Process Lett 44: 317–321 (1992)

    Google Scholar 

  2. Ahmad M, Hutto PW, John R: Implementing and programming causal distributed shared memory. Proc 11th Int Conference on Distributed Computing Systems, Arlington, Texas, pp 274–281, 1991

  3. Ahuja M, Carlson T, Gahlot A, Shands D: Timestamping events for inferring ‘affects’ relation and potential causality. Proc 15th IEEE Int Computer Software and Application Conference COMPSAC 91, Tokyo, pp 606–611, 1991

  4. Baldy P, Dicky H, Medina R, Morvan M, Vilarem JF: Efficient reconstruction of the causal relationship in distributed computations. Technical Report 92-013, Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier, Montpellier 1992

    Google Scholar 

  5. Bates PC, Wileden JC: High-level debugging of distributed systems: the behavioral abstraction approach. J Syst Softw 4(3): 255–264 (1983)

    Google Scholar 

  6. Birman K, Joseph T: Exploiting virtual synchrony in distributed systems. Operating Syst Rev 22(1): 123–138 (1987)

    Google Scholar 

  7. Birman K: The process group approach to reliable distributed computing. Technical Report, Computer Science Department, Cornell University, Ithaca, New York 1991

    Google Scholar 

  8. Birman K, Schiper A, Stephenson P: Lightweight causal and atomic group multicast. ACM Trans Comput Syst 9(3): 272–314 (1991)

    Google Scholar 

  9. Bruegge B, Hibbard P: Generalized Path Expressions. J Syst Softw 2(2): 265–276 (1983)

    Google Scholar 

  10. Chandy KM, Lamport L: Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 31(1): 63–75 (1985)

    Google Scholar 

  11. Charron-Bost B: Combinatorics and geometry of consistent cuts: application to concurrency theory. In: Bermond JC, Raynal M (eds) Distributed algorithms. LNCS, vol 392. Springer, Berlin Heidelberg New York 1989, pp 45–56

    Google Scholar 

  12. Charron-Bost B: Concerning the size of logical clocks in distributed systems. Inf Process Lett 39: 11–16 (1991)

    Google Scholar 

  13. Charron-Bost B, Delporte-Gallet C, Fauconnier H: Local and temporal predicates in distributed systems. Technical Report, LITP, IBP, Université Paris 7, Paris 1992

    Google Scholar 

  14. Charron-Bost B, Mattern F, Tel G: Synchronous, asynchronous, and causally ordered communication. Distrib Comput (to appear)

  15. Cooper R, Marzullo K: Consistent detection of global predicates. Proc ACM/ONR Workshop on Parallel and Distributed Debugging, Santa Cruz, California 1991, pp 163–173

  16. Diehl C, Jard C: Interval approximations and message causality in distributed systems. In: Finkel A, Jantzen M (eds) Proc of the 9th Annual Symposium on Theoretical Aspects of Computer Science STACS '92, LNCS, vol 577, Springer, Berlin Heidelberg New York 1992, pp 363–374

    Google Scholar 

  17. Fidge CJ: Timestamps in message-passing systems that preserve the partial ordering. Proc 11th Australian Computer Science Conference, University of Queensland, pp 55–66, 1988

  18. Fidge CJ: Partial orders for parallel debugging. ACM SIGPLAN Notices 24(1): 183–194 (1989)

    Google Scholar 

  19. Fidge CJ: Dynamic analysis of event orderings in message passing systems. PhD Thesis, Department of Computer Science, Australian National University, Canberra 1989

    Google Scholar 

  20. Fidge CJ: Logical time in distributed computing systems. IEEE Comput 24(8): 28–33 (1991)

    Google Scholar 

  21. Fischer MJ, Michael A: Sacrificing serializability to attain high availability of data in an unreliable network. Proc ACM SIGACT-SIGOPS Symposium on Principles of Database Systems, pp 70–75, 1982

  22. Fishburn PC: Interval orders and interval graphs. Wiley, New York 1985

    Google Scholar 

  23. Floyd RW: Algorithm 97, shortest path. Commun ACM 5: 345 (1962)

    Google Scholar 

  24. Fowler J, Zwaenepoel W: Causal distributed breakpoints. Proc 10th Int Conference on Distributed Computing Systems, Paris, pp 134–141, 1990

  25. Gait J: A probe effect in concurrent programs. Softw Pract Exper 16 (3): 225–233 (1986)

    Google Scholar 

  26. Garg VK, Waldecker B: Detection of unstable predicates in distributed programs. In: Shyamasundra R (ed) Proc 12th Conference on the Foundation of Software Technology and Theoretical Computer Science, LNCS, vol 652. Springer, Berlin Heidelberg New York 1992, pp 253–264

    Google Scholar 

  27. Haban D, Weigel W: Global events and global breakpoints in distributed systems. Proc 21st Annual Hawaii Int Conference on System Sciences, pp 166–175, 1988

  28. Haban D, Zhou S, Maurer D, Wilhelm R: Specification and detection of global breakpoints in distributed systems. Technical Report SFB124-08/1991, Universität des Saarlandes, Saarbrücken 1991

    Google Scholar 

  29. Harel D, Pnueli A: On the development of reactive systems. In: Apt K (ed) Logics and models of concurrent systems, NATO ASI Series F, vol 13. Springer, Berlin Heidelberg New York 1985, pp 477–498

    Google Scholar 

  30. Hseush W, Kaiser GE: Modeling concurrency in parallel debugging. ACM SIGPLAN Notices 25(3): 11–20 (1990)

    Google Scholar 

  31. Hutto P, Ahamad M: Slow memory: weakening consistency to enhance concurrency in distributed shared memories. Proc 10th Int Conference on Distributed Computing Systems, Paris, pp 302–309, 1990

  32. Johnson DB, Zwaenepoel W: Recovery in distributed systems using optimistic message logging and checkpointing. J Algorithms 11(3): 462–491 (1990)

    Google Scholar 

  33. Katz S, Peled D: Interleaving set temporal logic. Theor Comput Sci 75: 263–287 (1990)

    Google Scholar 

  34. Katz S, Peled D: Verification of distributed programs using representative interleaving sequences. Distrib Comput 6: 107–120 (1992)

    Google Scholar 

  35. Kearns JP, Koodalattupuram B: Immediate ordered service in distributed systems. Proc 9th Int Conference on Distributed Computing Systems, Newport Beach, California, 611–618, 1989

  36. Lamport L: Time, clocks, and the ordering of events in a distributed system. Commun ACM 21 (7): 558–565 (1978)

    Google Scholar 

  37. LeBlanc TJ, Mellor-Crummey JM: Debugging parallel programs with instant replay. IEEE Trans Comput 36(4): 471–482 (1987)

    Google Scholar 

  38. Leu E, Schiper A, Zramdini A: Efficient excution replay for distributed memory architectures. In: Bode A (ed) Proc 2nd European Distributed Memory Computing Conference, Munich, Germany, LNCS, vol 487. Springer, Berlin Heidelberg New York 1991, pp 315–324

    Google Scholar 

  39. Manabe Y, Imase M: Global conditions in debugging distributed programs. J Parallel Distrib Comput 15: 62–69 (1992)

    Google Scholar 

  40. Manna Z, Pnueli A: The temporal logic of reactive and concurrent systems. Springer, Berlin Heidelberg New York 1992

    Google Scholar 

  41. Marzullo K, Neiger G: Detection of global state predicates. In: Toueg S, Spirakis PG, Kirousis L (eds) Proc 5th Workshop on Distributed Algorithms (WDAG-91), Delphi, Greece, LNCS, vol 579. Springer, Berlin Heidelberg New York 1991, pp 254–272

    Google Scholar 

  42. Masuzawa T, Tokura N: A causal distributed breakpoint algorithm for distributed debugger. Proc ICICE Fall Conf (SD-1-8) 6: 373–374 (1992)

    Google Scholar 

  43. Mattern F: Algorithms for distributed termination detection. Distrib Comput 2: 161–175 (1987)

    Google Scholar 

  44. Mattern F: Virtual time and global states in distributed systems. In: Cosnard M et al. (eds) Proc Workshop on Parallel and Distributed Algorithms, Chateau de Bonas, Oct. 1988, Elsevier, North Holland, 1989, pp 215–226

  45. Mattern F: Efficient algorithms for distributed snapshots and global virtual time approximation. J Parallel Distrib Comput 18: 423–434 (1993)

    Google Scholar 

  46. Medina R: Incremental garbage collection of causal relationship computation in distributed systems. Proc 5th IEEE Symposium on Parallel and Distributed Processing, Irving, Texas, 1993

  47. Meldal S, Sankar S, Vera J: Exploiting locality in maintaining potential causality. Proc 10th Annual ACM Symposium on Principles of Distributed Computing, Montreal, Canada, pp 231–239, 1991

  48. Miller BP, Choi JD: Breakpoints and halting in distributed programs. Proc 8th Int Conference on Distributed Computing Systems, pp 316–323, 1988

  49. Minkowski H: Raum und Zeit. In: Lorentz HA, Einstein A, Minkowski H: Das Relativitätsprinzip. Eine Sammlung von Abhandlungen. Teubner, Leipzig 1915, pp 56–68

    Google Scholar 

  50. Netzer RHB, Miller BP: Optimal tracing and replay for debugging message-passing parallel programs. Proc Supercomputing '92, Minneapolis, pp 502–511, 1992

  51. Nielsen M, Plotkin G, Winskel G: Petri nets, event structures and domains, part I. Theor Comput Sci 13: 85–108 (1981)

    Google Scholar 

  52. Ochmanski E: Inevitability in concurrent systems. Inf Process Lett 25: 221–225 (1987)

    Google Scholar 

  53. Panangaden P, Taylor K: Concurrent common knowledge: defining agreement for asynchronous systems. Distrib Comput 6(2): 73–93 (1992)

    Google Scholar 

  54. Ponamgi MK, Hseush W, Kaiser GE: Debugging multithreaded programs with MPD. IEEE Software, 8(3): 37–43 (1991)

    Google Scholar 

  55. Pratt V: Modeling concurrency with partial orders. Int J Parallel Program 15(1): 33–71 (1986)

    Google Scholar 

  56. Pratt V: Modeling concurrency with geometry. Proc 18th Annual Symposium on Principles of Programming Languages (POPL-91), pp 311–322, 1991

  57. Pratt V: Arithmetic+Logic+Geometry=Concurrency. In: Simon I (ed) Proc LATIN '92, LNCS, vol 583. Springer, Berlin Heidelberg New York 1992, pp 430–447

    Google Scholar 

  58. Raynal M, Schiper A, Toueg S: The causal ordering abstraction and a simple way to implement it. Inf Process Lett 39: 343–350 (1991)

    Google Scholar 

  59. Reisig W: A strong part of concurreney. In: Rozenberg G (ed) Advances in Petri nets, LNCS, vol 266. Springer, Berlin Heidelberg New York 1987, pp 238–272

    Google Scholar 

  60. Reisig W: Temporal logic and causality in concurrent systems. In: Vogt FH (ed) Proc Concurrency '88, LNCS, vol 335. Springer, Berlin Heidelberg New York 1988, pp 121–139

    Google Scholar 

  61. Reisig W: Parallel composition of liveness. Technical Report SFB342/30/91A, Technische Universität München, München 1991

    Google Scholar 

  62. van Renesse R: Causal controversy at Le Mont St.-Michel. ACM Operating Syst Rev 27(2):44–53 (1993)

    Google Scholar 

  63. Sandoz A, Schiper A: A characterization of consistent distributed snapshots using causal order. Technical Report 92-14, Département d'Informatique, Ecole Polytechnique Fédérale de Lausanne, Lausanne 1992

    Google Scholar 

  64. Schiper A, Eggli J, Sandoz A: A new algorithm to implement causal ordering. In: Bermond JC, Raynal M (eds) Proc Workshop on Distributed Algorithms, Nice, France, LNCS, vol 392. Springer, Berlin Heidelberg New York 1989, pp 219–232

    Google Scholar 

  65. Singhal M, Kshemkalyani A: An efficient implementation of vector clocks. Inf Process Lett 43: 47–52 (1992)

    Google Scholar 

  66. Spezialetti M, Kearns JP: Simultaneous regions: a framework for the consistent monitoring of distributed systems. Proc 9th Int Conference on Distributed Computing Systems, Newport Beach, California, pp 61–68, 1989

  67. Stone JM: Visualizing concurrent processes. Technical Report RC 12973, IBM TJ Watson Research Center 1987

  68. Stone JM: Debugging concurrent processes: a case study. Proc SIGPLAN Conf on Programming Language Design and Implementation, Atlanta, Georgia, pp 145–153, 1988

  69. Stone JM: A graphical representation of concurrent processes. ACM SIGPLAN Notices 24(1): 226–235 (1989)

    Google Scholar 

  70. Strom R, Yemini S: Optimistic recovery in distributed systems. ACM Trans Comput Syst 3 (3): 204–226 (1985)

    Google Scholar 

  71. Szpilrajn E: Sur l'extension de l'ordre partiel. Fund Math 16: 386–389 (1930)

    Google Scholar 

  72. Warshall S: A theorem on Boolean matrices. J ACM 9: 11–12 (1962)

    Google Scholar 

  73. Winskel G: An introduction to event structures. In: de Bakker JW, de Roever WP, Rozenberg G (eds) Proc Workshop on Linear Time, Branching Time and Partial Order in Logics and Models for Concurrency, Noordwijkerhout, The Netherlands, LNCS, vol 354. Springer, Berlin Heidelberg New York 1988, pp 364–397

    Google Scholar 

  74. Wuu GTJ, Bernstein AJ: Efficient solutions to the replicated log and dictionary problems. Proc ACM Symposium on Principles of Distributed Computing, pp 233–242, 1984

Download references

Author information

Authors and Affiliations

Authors

Additional information

Reinhard Schwarz received a diploma in computer science from the University of Kaiserslautern, Germany, in 1990. Since then, he is working as a research assistant at the computer science department. His research interests include debugging and monitoring of distributed systems, runtime support for object-oriented distributed programming, and distributed algorithms.

Friedemann Mattern received the diploma in computer science from Bonn University, Germany, and the Ph.D. degree from the University of Kaiserslautern, Germany, in 1983 and 1989, respectively. Since 1991 he is a professor of computer science at the University of Saarland in Saarbrücken, Germany. His current research interests include programming of distributed systems, distributed applications, and distributed algorithms.

The work presented in this paper was carried out as part of the PARAWAN project supported by the Bundesministerium für Forschung und Technologie (BMFT)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schwarz, R., Mattern, F. Detecting causal relationships in distributed computations: In search of the holy grail. Distrib Comput 7, 149–174 (1994). https://doi.org/10.1007/BF02277859

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02277859

Key words

Navigation