Dynamic Fault Tolerance in Distributed Simulation System

  • Min Ma
  • Shiyao Jin
  • Chaoqun Ye
  • Xiaojian Liu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3991)


Distributed simulation system is widely used for forecasting, decision-making and scientific computing. Multi-agent and Grid have been used as platform for simulation. In order to survive from software or hardware failures and guarantee successful rate during agent migrating, system must solve the fault tolerance problem. Classic fault tolerance technology like checkpoint and redundancy can be used for distributed simulation system, but is not efficient. We present a novel fault tolerance protocol which combines the causal message logging method and prime-backup technology. The proposed protocol uses iterative backup location scheme and adaptive update interval to reduce overhead and balance the cost of fault tolerance and recovery time. The protocol has characteristics of no orphan state, and do not need the survival agents to rollback. Most important is that the recovery scheme can tolerant concurrently failures, even the permanent failure of single node. Correctness of the protocol is proved and experiments show the protocol is efficient.


Simulation System Fault Tolerance Recovery Protocol Orphan State Simulation Node 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Damani: Fault -tolerant distributed simulation. Presented at proceedings of the 12th workshop on parallel and distributed simulation(PADS 1998) (1998)Google Scholar
  2. 2.
    Johnnes Luthi, S.G.: F-RSS: A Flexible Framework for Fault Tolerant HLA Federations. In: Presented at ICCS 2004 (2004)Google Scholar
  3. 3.
    Elnozahy, E.N., Johnson, D.B., Wang, Y.M.: A survey of rollback-recovery protocols in message-passing systems. Technical Report CMU-CS-96-181, Carnegie Mellon University (October 1996)Google Scholar
  4. 4.
    Johansen, D., Marzullo, K., Schneider, F.B., Jacobsen, K., Zagorodnov, D.: NAP: Practical Fault-Tolerance for Itinerant Computations. Technical Report TR98-1716. Department of Computer Science, Cornell University, USA (November 1998)Google Scholar
  5. 5.
    Leslie, L.: Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 558–565 (1978)MATHCrossRefGoogle Scholar
  6. 6.
    Agrawal, D.: Replicated objects in time warp simulations. In: Presented at Proc. 1992 Winter Simulation Conference, SCS (1992)Google Scholar
  7. 7.
    Rob, S., Shaula, Y.: Optimistic recovery in distributed systems. ACM Trans. Comput. Syst. 3, 204–226 (1985)CrossRefGoogle Scholar
  8. 8.
    Lyu, M.R., Chen, X., Wong, T.Y.: Design and Evaluation of a Fault-Tolerant Mobile-Agent System. IEEE Intelligent Systems 19(5), 32–38 (2004)Google Scholar
  9. 9.
    Alan, F., Ralph, D.: Using Dynamic Proxy Agent Replicate Groups to Improve Fault-Tolerance in Multi-Agent Systems. In: AAMAS 2003, July 14-18 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Min Ma
    • 1
  • Shiyao Jin
    • 1
  • Chaoqun Ye
    • 1
  • Xiaojian Liu
    • 1
  1. 1.School of Computer ScienceNational University of Defense TechnologyHunanChina

Personalised recommendations