Advertisement

Real-Time Systems

, Volume 6, Issue 3, pp 289–316 | Cite as

Replica determinism in distributed real-time systems: A brief survey

  • Stefan Poledna
Article

Abstract

Replication of entities is a convenient technique to achieve fault-tolerance. The problem of replica determinism thereby is to assure, that replicated entities show consistent behavior in the absence of failures. Possible sources for replica non-determinism as well as basic requirements and strategies to enforce replica determinism are presented. The problem of replica determinism enforcement under real-time constraints is surveyed in the context of the communication problem for distributed systems. Furthermore the close interdependence between replica determinism on the one side and synchronization strategies, handling of failures and redundancy preservation on the other side is reviewed. The impact of synchronous or asynchronous approaches on replication strategies is also discussed.

Keywords

System Performance Assure Basic Requirement Control Engineer Communication Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ahamad, M., Dasgupta, P., LeBlanc, R. J., and Wilkes, C. T. 1987. Fault tolerant computing in object based distributed operating systems.Proc. 6th Symp. on Reliability in Distributed Software and Database Systems, pp. 115–125.Google Scholar
  2. Avizienis, A., and Chen, L. 1977. On the Implementation of N-Version Programming for Software Fault-Tolerance During Programm Execution.Proc. Compsac 77, pp. 149–155. Chicago, IL: Computer Society Press of the IEEE.Google Scholar
  3. Babaoglu, O., and Drummond, R. 1984. Communication architectures for fast reliable broadcasts.Proc. 6th Symp. on Reliability in Distributed Software and Database Systems, pp. 2–10.Google Scholar
  4. Babaoglu, O., Stephenson, P., and Drummond, R. 1988.Reliable Broadcasts and Communication Modells: Tradeoffs and Lower bounds. Distr. Comput. Springer-Verlag. Nr. 2. pp. 177–189.Google Scholar
  5. Barret, P. A., Hilborne, A. M., Bond, P. G., Seaton, D. T., Verissimo, P., Rodrigues, L., and Speirs, N. A. 1990. The Delta-4 extra performance architecture (XPA).Proc. 20th Int. Symp. on Fault-Tolerant Computing—FTCS 20, Chapel Hill, NC, pp. 481–488.Google Scholar
  6. Bartlet, J. 1981. A NonStop Kernel.Proc. 8th Symp. on Operating System Principles, pp. 22–29.Google Scholar
  7. Ben-Or, M. 1983. Another advantage of free choice: Completely asynchronous agreement protocols.Proc. 2nd ACM Annual Symp. on Principles of Distributed Computing, Montreal, Canada, pp. 27–30.Google Scholar
  8. Bernstein, P. 1988. Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing.IEEE Computer, February: 37–45.Google Scholar
  9. Birman, K. P., and Joseph, T. A. 1987a. Exploiting virtual synchronity in distributed systems.Proc. 11th ACM Symp. on Operating System Principles, Austin, TX, pp. 123–128.Google Scholar
  10. Birman, K. P., and Joseph, T. A. (1987b). Reliable communication in the presence of failures.ACM Trans. on Comp. Sys., 5(1):47–76.Google Scholar
  11. Birman, K. P., Joseph, T. A., Raechle, T., and El Abbadi, A. 1984. Implementing fault-tolerant distributed objects.Proc. of the 4th Symp. on Reliability in Distributed Software and Database Systems. IEEE Computer Society in cooperation with ACM, Silver Spring, MD, pp. 124–133.Google Scholar
  12. Brilliant, S., Knight, J., and Levenson, N. 1989. The consistent comparision problem in N-version software.IEEE Trans. Software Engineering, 15(11):1481–1485.Google Scholar
  13. Budhiraja, N., Gopal, A., and Toueg, S. 1990. Early stopping distributed bidding and applications.Proc. 4th Int. Workshop on Distributed Algorithms, pp. 304–320. Springer Verlag, Lecture Notes in Computer Science 486.Google Scholar
  14. Budhiraja, N., Marzullo, K., Schneider, F., and Toueg, S. 1992. Primary-backup protocols: Lower bounds and optimal implementations.Proc. of the 3rd IFIP Int. Working Conf. on Dependable Computing for Critical Applications, Mondello, Sicily, Italy, pp. 187–196.Google Scholar
  15. Burns, J. E., and Lynch, N. A. 1987. The Byzantine Firing Squad Problem.Advances in Computing Research, 4:147–161.Google Scholar
  16. Cao, J., and Wang, K. C. 1992. An abstract model of rollback recovery control in distributed systems.Operating Systems Review, 26(4):62–76.Google Scholar
  17. Chandra, T., and Toueg, S. 1991. Unreliable failure detectors for asynchronous systems.Proc. 10th Annual ACM Symp. on Principles of Distributed Computing, Montreal, Canada, pp. 325–340.Google Scholar
  18. Chang, J. M., and Maxemchuck, N. F. 1984. Reliable broadcast protocols.ACM Trans. on Comput. Sys., 2(3):251–273.Google Scholar
  19. Chérèque, M., Powell, D., Reynier, P., Richier, J.-L., and Voiron, J. 1992. Active replication in Delta-4.Proc. of the 22th Int. Symp. on Fault-Tolerant Computing, IEEE Computer Society Press, Boston, MA, pp. 28–37.Google Scholar
  20. Coan, B. A., and Dwork, C. 1986. Simultaneity is harder than agreement.Proc. 5th Symp. on Relibility in Distributed Software and Database Systems, Los Angeles, CA, pp. 141–150.Google Scholar
  21. Cooper, E. C. 1984. Circus: A replicated procedure call facility.Proc. of the 4th Symp. on Reliability in Distributed Software and Database Systems. IEEE Computer Society in cooperation with ACM, Silver Spring, MD, pp. 11–24.Google Scholar
  22. Cristian, F. 1988. Agreeing on who is present and who is absent in a synchronous computer system.Proc. Fault Tolerant Computing, IEEE Computer Society Press, pp. 206–211.Google Scholar
  23. Cristian, F. 1989. Exception handling.Proc. Dependability of Resilient Computers. T. Anderson (Ed). Oxford: Blackwell.Google Scholar
  24. Cristian, F. 1990. Synchronous atomic broadcast for redundant broadcast channels.The Journal of Real-Time Systems, 2(3):195–212.Google Scholar
  25. Cristian, F. 1991. Understanding fault-tolerant distributed systems.Comm. of the ACM, 34(2):57–78.Google Scholar
  26. Cristian, F., Aghili, H., Strong, R., and Dolev, D. 1985. Atomic broadcast: From simple message diffusion to Byzantine agreement.Proc. of the 15th Annual Int. Symp. on Fault-Tolerant Computing, IEEE Computer Society Press, Ann Arbor, MI, pp. 200–206.Google Scholar
  27. Dolev, D. 1982. The Byzantine generals strike again.Journal of Algorithms, 3(1):14–30.Google Scholar
  28. Dolev, D., and Reischuck, R. 1985. Bounds on information exchange for Byzantine agreement.Journal of the ACM, 32(1):191–204.Google Scholar
  29. Dolev, D., and Strong, H. 1983. Authenticated algorithms for Byzantine agreement.Siam Journal on Computing, 12(4):656–666.Google Scholar
  30. Dolev, D., Dwork, C., and Stockmeyer, L. 1987. On the minimal synchronism needed for distributed consensus.Journal of the ACM, 34(1):77–97.Google Scholar
  31. Dolev, D., Reischuck, R., and Strong, H. R. 1990. Early stopping in Byzantine agreement.Journal of the ACM, 37(4):720–741.Google Scholar
  32. Dwork, C., Lynch, N., and Stockmeyer, L. 1988. Consensus in the presence of partial synchrony.Journal of the ACM, 35(2):288–323.Google Scholar
  33. Fischer, M., and Lynch, N. 1982. A lower bound for the time to assure interactive consistency.Information Processing Letters, 14(4):183–186.Google Scholar
  34. Fischer, M., Lynch, N., and Paterson, M. 1985. Impossibility of distributed consensus with one faulty processor.Journal of the ACM, 32(2):374–382.Google Scholar
  35. Garcia-Molina, H., and Spauster, A. 1989. Message ordering in a multicast environment.Proc. 9th Int. Conf. on Distributed Computing Systems, IEEE Computer Society Press, pp. 354–361.Google Scholar
  36. Gopal, A., and Toueg, S. 1991. Inconsistency and Contamination.Proc. of the 10th ACM Symp. on Principles of Distributed Computing, Montreal, Canada, pp. 257–272.Google Scholar
  37. Huntsberger, T. 1992. Sensor fusion in a dynamic environment.Proc. on Sensor Fusion V, SPIE—The Int. Society for Optical Engineering, pp. 175–182.Google Scholar
  38. Kaashoek, M. F., and Tanenbaum, A. S. 1991. Group communication in the amoeba distributed operating system.Proc. 11th Int. Conf. on Distributed Computing Systems, Los Alamitos, CA, pp. 222–230.Google Scholar
  39. Kieckhafer, R. M., Thambidurai, P. M., Walter, C. J., and Finn, A. M. 1988. The MAFT architecture for distributed fault-tolerance.IEEE Trans. on Comput., 37(4):394–405.Google Scholar
  40. Kopetz, H. 1986. Scheduling in distributed real time systems.Proc. Advanced Seminar on Real-Time Local Area Networks, INRIA, Bandol, France, pp. 105–126.Google Scholar
  41. Kopetz, H. 1992. Sparse time versus dense time in distributed real-time systems.Proc. 12th Int. Conf. on Distributed Computing Systems, Yokohama, Japan, pp. 460–467.Google Scholar
  42. Kopetz, H., and Grünsteidl, G. 1992. TTP—A time triggered protocol for automotive applications. Research Report Nr. 16/1992. Inst. für Technische Informatik, Technische Universit.Google Scholar
  43. Kopetz, H., and Kim, K. 1990. Temporal uncertainties in interaction among real-time objects.Proc. of the 9th Symp. on Reliable Distributed Systems, Huntsville, AL, pp. 165–174.Google Scholar
  44. Kopetz, H., and Ochseneiter, W. 1987. Clock synchronization in distributed real-time systems.IEEE Trans. on Comput., 36(8):933–940.Google Scholar
  45. Kopetz, H., Damm, A., Koza, C., Mulazzani, M., Senft, C., and Zainlinger, R. 1989. The MARS approach.IEEE Micro., 9(1):25–40.Google Scholar
  46. Kopetz, H., Grünsteidl, G., and Reisinger, J. 1991. Fault-tolerant membership service in a synchronous distributed real-time system.Proc. Dependable Computing for Critical Applications, Vol. 4 ofDependable Computing and Fault-Tolerant Systems, A. Avizienis and J. C. Laprie (ed.), Springer Verlag, pp. 441–429.Google Scholar
  47. Kopetz, H., Kantz, H., Grünsteidel, G., Puschner, P., and Reisinger, J. 1990. Tolerating transient faults in MARS.Proc. Fault Tolerant computing, Newcastle upon Tyne, UK, pp. 466–473.Google Scholar
  48. Koutny, M., Mancini, L. V., and Pappalardo, G. 1991. Formalising replicated distributed processing.Proc. of the 10th Symp. on Reliable Distributed Systems, Pisa, IT, pp. 108–117.Google Scholar
  49. Lamport, L. 1978. Time, clocks and the ordering of events in a distributed system.Comm. of the ACM, 21(7):558–565.Google Scholar
  50. Lamport, L. 1984. Using time instead of timeout for fault-tolerant distributed systems.ACM Trans. on Prog. Languages and Systems, 6(2):254–280.Google Scholar
  51. Lamport, L., and Melliar-Smith, P. M. 1985. Synchronizing clocks in the presence of faults.Journal of the ACM, 32(1):52–78.Google Scholar
  52. Lamport, L., Shostak, R., and Pease, M. 1982. The Byzantine generals problem.ACM Trans. on Prog. Lang. and Sys., 4(3):382–401.Google Scholar
  53. Laprie, J. C. (Ed). 1992.Dependability: Basic Concepts and Terminology. Volume 5 ofDependable Computing and Fault-Tolerant Systems, Springer Verlag, pp. 23–28.Google Scholar
  54. Lee, P. A., and Anderson, T. 1990.Fault Tolerance. Dependable Computing and Fault-Tolerant Systems, A. Avizyienis, H. Kopetz and J. C. Laprie (Eds), chapter 7, Error Recovery. Springer Verlag, Wien, New York, pp. 143–185.Google Scholar
  55. Mancini, L., and Pappalardo, G. 1988. Towards a theory of replicated processing.Proc. Techniques in Real-Time and Fault-Tolerant Systems. Lecture Notes in Computer Science, Vol 331. Springer-Verlag, pp. 175–192.Google Scholar
  56. Marzullo, K. 1990. Tolerating failures of continuous-valued sensors.ACM Trans. on Comp. Sys., 8(4):284–304.Google Scholar
  57. Melliar-Smith, P. M., and Moser, L.E. 1989. Fault-tolerant distributed systems based on broadcast communication.Proc. 9th Int. Conf. on Distributed Computing Systems, pp. 129–134.Google Scholar
  58. Mishra, S., Peterson, L. L., and Schlichting, R. D. 1989. Implementing fault-tolerant replicated objects using Psync.Proc. 8th Symp. on Reliable Distributed Systems, Seattle, WA, pp. 42–52.Google Scholar
  59. Palumbo, D. L., and Butler, R. W. 1985. Measurement of SIFT operating system overhead. Technical Memo 86322. NASA.Google Scholar
  60. Pease, M., Shostak, R., and Lamport, L. 1980. Reaching agreement in the presence of faults.Journal of the ACM, 26(2):228–234.Google Scholar
  61. Powel, D. (Ed) 1991a. Delta-4: A generic architecture for dependable computing. Volume 1 ofESPRIT Research Reports, chapter 6.3, Models of Distributed Computation. Springer Verlag. pp. 99–100.Google Scholar
  62. Powell, D. (Ed) 1991b. Delta-4: A generic architecture for dependable computing. Volume 1 ofESPRIT Research Reports, Chapter 6.4, Replicated Software Components. Springer Verlag, pp. 100–104.Google Scholar
  63. Powell, D. (Ed) 1991c. Delta-4: A generic architecture for dependable computing. Volume 1 ofESPRIT Research Reports, chapter 6.7, Semi-Active Replication. Springer Verlag, pp. 116–120.Google Scholar
  64. Powell, D. (Ed) 1991d. Delta-4: A generic architecture for dependable computing. Volume 1 ofESPRIT Research Reports, chapter 6.6, Passive Replication. Springer Verlag, pp. 111–115.Google Scholar
  65. Powell, D. (Ed) 1991e. Delta-4: A generic architecture for dependable computing. Volume 1 ofESPRIT Research Reports, chapter 10.6, Two-Phase Accept Protocol. Springer Verlag, pp. 282–284.Google Scholar
  66. Reisinger, J. 1989. Failure Modes and Failure Characteristics of a TDMA Driven Ethernet. Research Report 8/89, Inst. für Technische Informatik, Technische Universität Wien, Austria.Google Scholar
  67. Schlichting, R. D., and Schneider, F. B. 1983. Fail-stop processors: An approach to designing fault-tolerant computing systems.ACM Trans. on Comput. Sys. 1(3):222–238.Google Scholar
  68. Schneider, F. B. 1990. Implementing fault-tolerant services using the state machine appoach: A tutorial.ACM Computing Surveys 22(4):299–319. Schneider, M. 1993. Self-stabilization.ACM Computing Surveys 25(1):45–67.Google Scholar
  69. Shi, S. S. B., and Belford, G. G. 1989. Consistent replicated transactions.Proc. 8th Symp. on Reliable Distributed Systems, Seattle, WA, pp. 30–41.Google Scholar
  70. Shin, K. G., Lin, T.-H., and Lee, Y.-H. 1986. Optimal checkpointing of real-time tasks.Proc. on the 5th Symp. on Reliability in Distributed Software and Database Systems, Los Angeles, CA, pp. 151–158.Google Scholar
  71. Tanenbaum, A. S., et al. 1990. Experiences with the amoeba distributed operating system.Comm. of the ACM 33:46–63.Google Scholar
  72. Taylor, D., and Wilson, G. 1989. The stratus system architecture.Proc. Dependability of Resilient Computers, T. Anderson, Ed. Oxford: Blackwell.Google Scholar
  73. Toueg, S., Perry K. J., and Srikanth, T. K. 1987. Fast distributed agreement.SIAM Journal on Computing 16(3):445–457.Google Scholar
  74. Tully, A., and Shrivastava, S. K. 1990. Preventing state divergence in replicated distributed programs.Proc. 9th Symp. on Reliable Distributed Systems, Huntsville, AL, pp. 104–113.Google Scholar
  75. Veríssimo, P. 1990. Real-time data management with clock-less reliable broadcast protocols.Proc. of the Workshop on Managment of Replicated Data, Houston, pp. 20–24.Google Scholar
  76. Veríssimo, P., Rodrigues, L., and Baptista, M. 1989. AMp: A highly parallel atomic multicast protocol.Proc. SIGCOMM Symp. ACM, Austin, pp. 83–93.Google Scholar
  77. Von Neumann, J. 1956. Probabilistic logics and the synthesis of reliable organisms from unreliable components. InAutomata Studies, C. E. Shannon and J. McCarthy (Ed), pp. 43–98. Princeton University Press.Google Scholar
  78. Wensly, J. H., Lamport, L., Goldberg, J., Green, M. W., Levitt, K. N., Mellinar-Smith, P. M., Shostack, R. E., and Weinstock, C. B. 1978. SIFT: The design and analysis of a fault-tolerant computer for aircraft control.Proc. of the IEEE 66(10):1240–1255.Google Scholar
  79. Wu, K. L., Yu, P. S., and Pu, C. 1991. Divergence control for epsilon-serialisability. Technical report CUCS-002-91, Department of Computer Science, Columbia University. Also available as IBM Tech Report No. RC16598.Google Scholar

Copyright information

© Kluwer Academic Publishers 1994

Authors and Affiliations

  • Stefan Poledna
    • 1
  1. 1.Institut für Technische Informatik Technische Universität WienViennaAustria

Personalised recommendations