A Survey on Fault Management Techniques in Distributed Computing

  • Selvani Deepthi Kavila
  • G. S. V. Prasada Raju
  • Suresh Chandra Satapathy
  • Alekhya Machiraju
  • G. V. L. Kinnera
  • K. Rasly
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 199)


Now-a-days with the rapid increase in distributed computing systems faults are equally enhancing in scales in spite of many fault detection techniques proposed. Designing and implementing distributed computing systems is challenging due to their ever- increasing scales and the complexity. A faulty distributed system due to any reason during executing its processes can cause some damages. A fault management system helps the distributed systems by detecting malfunctions, errors or faults etc., We investigated different techniques of fault tolerance used in real time distributed system. The main concentration is on types of faults, fault detection techniques and their recovery techniques used. Link failure, resource failure or any other failure is to be detected and rectified for working the system accurately without any disturbances. The fault management applications are hereby enabled to determine the root cause of distributed systems failure automatically. In order to aspect faults detection in distributed systems we propose to combine proactive and reactive techniques in an expert system for managing the faults.


Distributed computing systems Fault Management Reactive and Proactive 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Shukri, A., Noor, M., Deris, M.M.: Dynamic hybrid fault detection methodology. Journal of Computing 3(6) (June 2011)Google Scholar
  2. 2.
    Girault, A., Kalla, H., Sorel, Y.: Transient Processor/Bus Fault Tolerance For Embedded Systems with hybrid redundancy and data fragmentationGoogle Scholar
  3. 3.
    Sistla, A.P., Welch, J.L.: Efficient distributed recovery using message logging. In: Proceedings of the Eighth Annual ACM Symposium on Principles of Distributed Computing, pp. 223–238 (June 1989)Google Scholar
  4. 4.
    Cao, H., Zhu, J.: An Adaptive Replicas Creation Algorithm with Fault Tolerance in the Distributed Storage Network. IEEE (2008)Google Scholar
  5. 5.
    Yeh, C.-H.: The robust middleware approach for transparent and systematic fault tolerance in parallel and distributed systems. In: International Conference on Parallel Processing, Proceedings (2003)Google Scholar
  6. 6.
    Tian, D., Wu, K., Li, X.: A Novel Adaptive Failure Detector for Distributed Systems. In: Proceedings of the 2008 International Conference on Networking, Architecture, and Storage (2008) ISBN: 978-0-7695-3187-8Google Scholar
  7. 7.
    “Fault detection and diagnosis in distributed systems: An approach by partial stochastic Petri nets”,
  8. 8.
    Freiling, F.C., Guerraoui, R., Kuznetsov, P.: The failure detector abstraction (published in 2011)Google Scholar
  9. 9.
    Fischer, M., Lynch, N., Paterson, M.: Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM 32(2), 374–382 (1985)CrossRefMATHMathSciNetGoogle Scholar
  10. 10.
    Kola, G., Kosar, T., Livny, M.: “Faults in large distributed systems and what we can do about them” (2005),
  11. 11.
    Goyer, P., Momtahan, P., Selic, B.: A Fault-Tolerant Strategy for Hierarchical Control in Distributed Computer Systems. In: Proc. 20th IEEE Symp. on Fault-Tolerant Computing Systems (FTCS20), pp. 290–297. IEEE CS Press (1990)Google Scholar
  12. 12.
    Mushtaq, H., Al-Ars, Z., Bertels, K.: Survey of fault tolerance techniques for shared memory Multicore/multiprocessor systems. In: IEEE 6th International Design and Test Workshop, IDT (2011)Google Scholar
  13. 13.
    Frey, J., Tannenbaum, T., Foster, I., Livny, M., Tuecke, S.: Condor-G: A Computation Management Agent for Multi-Institutional Grids. Cluster Computing 5(3) (2002)Google Scholar
  14. 14.
    Abawajy, J.H.: Fault-Tolerant Scheduling Policy for Grid Computing Systems. In: Proceedings of the IEEE 18th International Parallel and Distributed Processing Symposium (IPDPS 2004),Google Scholar
  15. 15.
    Kandaswamy, G., Mandal, A., Reed, D.: Fault Tolerance and Recovery of Scientific Workflows on Computational Grids. In: 8th IEEE International Symp. on Cluster Computing and the Grid, CCGRID 2008 (2008)Google Scholar
  16. 16.
    Aguilera, M.K., Chen, W., Toueg, S.: Failure Detection and Consensus in the Crash-Recovery Model. In: Kutten, S. (ed.) DISC 1998. LNCS, vol. 1499, pp. 231–245. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  17. 17.
    Castro, M.: Practical Byzantine Fault Tolerance and Proactive Recovery. Microsoft research (2002)Google Scholar
  18. 18.
    Affan, M., Ansari, M.A.: Distributed Fault Management for Computational Grids. In: Fifth International Conference on Grid and Cooperative Computing, GCC 2006, pp. 363–368 (2006)Google Scholar
  19. 19.
    Hussain, N., Ansari, M.A., Yasin, M.M.: Fault Tolerance using Parallel Shadow Image Servers (PSIS). In: Grid Based Computing Environment, IEEE–ICET, November 13-14 (2006)Google Scholar
  20. 20.
    Jalote, P.: Fault Tolerance in Distributed Systems (1994) ISBN: 0-13-301367-7Google Scholar
  21. 21.
    Stelling, P., Foster, I.: A Fault detection service for wide area distributed computations. The aerospace Cooperation, EI Segundo, CA 90245-4691-USAGoogle Scholar
  22. 22.
    Sainil, P., Singh, A.K.: Two New Protocols for Fault Tolerant Agreement., Department of Computer Engineering, National Institute of Technology, Kurukshetra, India, International Journal of Distributed and Parallel Systems (IJDPS) 2(1) (January 2011)Google Scholar
  23. 23.
    Graham, R.L., Choi, S.-E., Daniel, D.J., Desai, N.N., Minnich, R.G., Rasmussen, C.E., Risinger, L.D., Sukalski, M.W.: A network-failure-tolerant message passing system for terascale clusters. In: Proceedings of the 16th International Conference on Supercomputing, pp. 77–83. ACM Press (2002)Google Scholar
  24. 24.
    Aulwes, R.T., Daniel, D.J., Desai, N.N., Graham, R.L., Risinger, L.D., Sukalski, M.W.: LA-MPI: The design and implementation of a network-fault-tolerant MPI for terascale clusters. Technical Report LA-UR- 03-0939, Los Alamos National Laboratory (2003)Google Scholar
  25. 25.
    Wolski, R., Pjesivac-Grbovic, N., London, K., Dongarra, J.: Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems. ICS, Heidelberg (June 2004)Google Scholar
  26. 26.
    Ekwall, R., Schiper, A.: A Fault-Tolerant Token-Based Atomic Broadcast Algorithm. Journal: IEEE Transactions on Dependable and Secure Computing - TDSC 8(5), 625–639 (2011)Google Scholar
  27. 27.
    Chakravorty, S., Kalé, L.V.: A Fault Tolerance Protocol with Fast Fault Recovery. IEEE (2007)Google Scholar
  28. 28.
    Chakravorty, S., Mendes, C.L., Kalé, L.V.: Proactive Fault Tolerance in MPI Applications Via Task Migration. In: Robert, Y., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) HiPC 2006. LNCS, vol. 4297, pp. 485–496. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  29. 29.
    Gorender, S., Raynal, M.: An Adaptive Programming Model for Fault-Tolerant Distributed Computing. IEEE Transactions on Dependable and Secure Computing 4(1) (January-March 2007)Google Scholar
  30. 30.
    Hwang, S., Kesselman, C.: A Flexible Framework for Fault Tolerance in the Grid. Journal of Grid Computing 1, 251–272 (2003)CrossRefMATHGoogle Scholar
  31. 31.
    Krishnan, S., Gannon, D.: Checkpoint and restart for distributed components in XCAT3. In: Proceedings of the Fifth IEEE/ACM International Workshop on Grid Comp., GRID (2004)Google Scholar
  32. 32.
    Haeberlen, A., Kuznetsov, P.: The Fault Detection Problem. In: Abdelzaher, T., Raynal, M., Santoro, N. (eds.) OPODIS 2009. LNCS, vol. 5923, pp. 99–114. Springer, Heidelberg (2009), Google Scholar
  33. 33.
    Claesson, V., Lönn, H., Suri, N.: Efficient TDMA Synchronization for Distributed Embedded Systems. In: Proc. 20th Symp. Reliable Distributed Systems, pp. 198–201 (October 2001)Google Scholar
  34. 34.
    De Florio, V., Blondia, C.: A Survey of Linguistic Structures for Application-Level Fault Tolerance. ACM Computing Surveys 40(2), Article 6 (April 2008)Google Scholar
  35. 35.
    Zhao, W.: A Byzantine Fault Tolerant Distributed Commit Protocol, Department of Electrical and Computer Engineering, Cleveland State University, 2121 Euclid Ave, Cleveland, OH 44115Google Scholar
  36. 36.
    Luo, W., Yang, F., Tu, G., Pang, L., Qin, X.: TERCOS: A Novel Technique for Exploiting redundancies in Fault-Tolerant and Real-Time Distributed Systems. In: 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, RTCSA 2007 (2007)Google Scholar
  37. 37.
    Huang, Y.-P., Huang, C.-H.: A Genetic-based fuzzy grey prediction model. IEEE (October 1995)Google Scholar
  38. 38.

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Selvani Deepthi Kavila
    • 1
  • G. S. V. Prasada Raju
    • 2
  • Suresh Chandra Satapathy
    • 1
  • Alekhya Machiraju
    • 1
  • G. V. L. Kinnera
    • 1
  • K. Rasly
    • 1
  1. 1.Department of Computer Science and EngineeringAnil Neerukonda Institute of Technology and SciencesVisakhapatnamIndia
  2. 2.Department of Computer ScienceSDE, Andhra UniveristyVisakhapatnamIndia

Personalised recommendations