Removal of all faulty nodes from a fault-tolerant service by means of distributed diagnosis with imperfect fault coverage
In general, offering a fault-tolerant service boils down to the execution of replicas of a service process on different nodes in a distributed system. The service is fault-tolerant in such a way, that, even if some of the nodes on which a replica of the service resides, behave maliciously, the service is still performed correctly. To be able to guarantee the correctness of a fault-tolerant service despite the presence of maliciously functioning nodes, it is of key importance that all faulty nodes are timely removed from this service. Faulty nodes are detected by tests performed by the nodes offering the service. In practice, tests always have an imperfect fault coverage. In this paper, a distributed diagnosis algorithm with imperfect tests is described, by means of which all detectably faulty nodes are removed from a fault-tolerant service. This may, however, inevitably, imply the removal of a number of correctly functioning nodes from the service too. The maximum number of correctly functioning nodes removed from the service by the algorithm is calculated. Finally, the minimally required number of nodes needed in a fault-tolerant service to perform this diagnosis algorithm is given.
Unable to display preview. Download preview PDF.
- Preparata, F., Metze, G., Chien, R., On the connection assignment of diagnosable systems, in: IEEE Transactions on Electronic Computing, EC-16, 6(Dec. 1967), pp.848–854.Google Scholar
- Barborak, M., Malek, M., Dahbura, A., The consensus problem in fault tolerant computing, in: ACM Computing Surveys, Vol 25, 2(Jun. 1993), pp.171–220.Google Scholar
- Blough, D.M., Sullivan, G.F., Mason G.M. Intermittent fault diagnosis in multi processor systems, in: IEEE Transactions on computers, vol 41, 11(Nov. 1992), pp.1430–1441.Google Scholar
- Bauch, A., Maehle, E., Self diagnosis, Reconfiguration and Recovery in the Dynamical Reconfigurable Multiprocessor System DAMP, in: Fault-tolerant computing systems: tests, diagnosis, fault-treatment: 5th international GI/ITG/GMA Conference Nürnberg, September 25–27, 1991: Proceedings, Dal Cin, M., and Hohl, W. (Eds.), Springer-Verlag, Berlin, 1991, pp. 18–29.Google Scholar
- Bianchini, R., Goodwin, R., Nydick, D.S., Practical application and implementation of distributed system level diagnosis theory, in: Fault-tolerant computing: the twentieth international symposium, IEEE Comp. Soc. Press, Los Alamitos, California, 1990, pp. 332–339.Google Scholar
- Chen, Y., Bucken, W., Echtle, K., Efficient algorithms for system diagnosis with both processor and comparator faults, in: IEEE Transactions on parallel and distributed systems, vol 4, 4(Apr. 1993), pp.371–381.Google Scholar
- Lee, S., Shin, K.G., Optimal multiple syndrome probabilistic diagnosis,in: Faulttolerant computing: the twentieth international symposium, IEEE Comp. Soc. Press, Los Alamitos, California, 1990, pp. 324–331.Google Scholar
- Maheshwari, S.N., Hakimi, S.L., On models for diagnosable systems and probabilistic fault diagnosis, in: IEEE Transaction on computers, vol 25, 3(March 1976).Google Scholar
- Kime, C.R., An analysis model for digital system diagnosis, in: IEEE Transactions on computers, vol c-19,11(Nov. 1970).Google Scholar
- Jalote, P., Fault tolerance in distributed systems, Prentice Hall, 1994, pp.115–125.Google Scholar
- Lee, S., Shin, K.G., On probabilistic diagnosis of multiprocessor systems using multiple syndromes, in: IEEE Transactions on parallel and distributed systems, vol 5, 6(Jun. 1994), pp.630–638.Google Scholar
- Lee, S., Shin, K.G., Optimal and efficient probabilistic distributed diagnosis schemes, in: IEEE Transactions on computers, vol 42, 7(Jul. 1993), pp.882–886.Google Scholar