An Eminent Approach of Fault Management Using Proactive and Reactive Techniques in Distributed Computing
Many of the existing distributed systems perform remote monitoring, and even generate alarms when fault occurs. But they fail in finding the exact location of the fault, or even the automatic execution of appropriate fault recovery actions. Hence for large distributed systems with many computing nodes it may be rather time-consuming and difficult to resolve the problems in a short time by an exhaustive search in order to find the cause for failure. At the time of failure many problems like loss or delay of fault messages may occur. Also a failure may result in a number of unreliable alarms. A good sophisticated design for fault management should work out efficiently whenever there is a redundant and incomplete data. So we proposed a system comprising of several components where we have a fault detection engine in which various techniques have been proposed for detecting faults in distributed computing. Once the faults are detected, one may diagnose the system to track the root cause. For diagnosis the system we are going to implement that In Expert system and we are combining both the reactive and proactive techniques for fault management.
KeywordsDistributed Systems Distributed computing Fault Management Expert System Reactive Proactive
Unable to display preview. Download preview PDF.
- 2.Paoli, A.: Fault Detection and Fault Tolerant Control for Distributed Systems-A general Frame work. Ph.D Thesis, University of Bologna-XVI Ciclo, A.A (2000–2003)Google Scholar
- 3.Abraham, A.: Rule based expert system. In: Sydenham, P.H., Thorn, R. (eds.) Handbook of Measuring System Design, John Wiley & Sons, Ltd., UK (2005) ISBN: 0-470-02143-8Google Scholar
- 4.Patterson, D., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., Enriquez, P., Fox, A., Kiciman, E., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff, W., Traupman, J., Treuhaft, N.: Recovery Oriented Computing (ROC): Motivation, Definition, Techniques and Case Studies. Technical Report UCB/CSD TR 02-1175, Computer Science Dept., Univ. of California at Berkeley (March 2002)Google Scholar
- 5.Joshi, K.R., Hiltunen, M., Sanders, W.H., Schlichting, R.: Automatic Model-Driven Recovery in Distributed Systems. In: Proc. 24th IEEE Symp. Reliable Distributed Systems (SRDS 2005), pp. 26–38 (October 2005)Google Scholar
- 6.Blanke, M., Kinnaert, M., Lunze, J.: Diagnosis and fault-tolerant control. Springer (2003)Google Scholar
- 7.Blanke, M., Zamanabadi, R.I., Bogh, S.A.: Fault tolerant control systems: a holistic view. Control Engineering Practice 5(5) (1997)Google Scholar
- 8.Blanke, M.: Aims and means in the evolution of fault tolerant control. In: Proceedings of the European Science Foundation COSY Workshop, Rome (1995)Google Scholar
- 9.Chakravorty, S., Kalé, L.V.: A Fault Tolerance Protocol with Fast Fault Recovery. IEEE (2007)Google Scholar
- 10.Bogh, S.A.: Fault Tolerant Control Systems - a Development Method and Real-Life Case Study. PhD thesis, Aalborg University, Department of Control Engineering (December 1997)Google Scholar
- 11.Huang, Y., Kintala, C.M.R.: Software Implemented Fault Tolerance: Technologies and Experience. In: Proc. 23rd Int’l Symp. Fault Tolerant Computing (FTCS-23), pp. 2–9 (June 1993)Google Scholar
- 12.Zhou, W.: Fault Management in Distributed Systems, university of Pennsylvania, Department of CIS,Technical Report (January 05, 2010)Google Scholar