Crash-Quiescent Failure Detection

  • Srikanth Sastry
  • Scott M. Pike
  • Jennifer L. Welch
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5805)

Abstract

A distributed algorithm is crash quiescent if it eventually stops sending messages to crashed processes. An algorithm can be made crash quiescent by providing it with either a crash notification service or a reliable communication service. Both services can be implemented in practical environments with failure detectors. Therefore, crash-quiescent failure detection is fundamental to system-wide crash quiescence. We establish necessary and sufficient conditions for crash-quiescent failure detection in partially synchronous environments where a bounded, but unknown, number of consecutive messages can be arbitrarily late or lost. Without a correct majority of processes, not even the weakest oracle for fault-tolerant consensus, \(\Diamond\mathcal{W}\), can be implemented crash quiescently. With a correct majority, however, the eventually perfect failure detector, \(\Diamond\mathcal{P}\), is possible. Our \(\Diamond\mathcal{P}\) algorithm is correct in all runs, but improves performance via crash quiescence in any run with a correct majority. We also present a refinement of our \(\Diamond\mathcal{P}\) algorithm to mitigate the overhead of achieving crash quiescence; the resulting bit complexity per utilized link is asymptotically better than or equal to that of non-crash-quiescent counterparts.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal of the ACM 43(2), 225–267 (1996)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Mostefaoui, A., Mourgaya, E., Raynal, M.: An introduction to oracles for asynchronous distributed systems. Future Gener. Comput. Syst. 18(6), 757–767 (2002)CrossRefMATHGoogle Scholar
  3. 3.
    Aguilera, M.K., Chen, W., Toueg, S.: On quiescent reliable communication. SIAM Journal on Computing 29(6), 2040–2073 (2000)MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the presence of partial synchrony. Journal of the ACM 35(2), 288–323 (1988)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Sastry, S., Pike, S.M.: Eventually perfect failure detection using ADD channels. In: Proceedings of the 5th International Symposium on Parallel and Distributed Processing and Applications, pp. 483–496 (2007)Google Scholar
  6. 6.
    Sastry, S., Pike, S.M., Welch, J.L.: Crash fault detection in celerating environments. In: Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium, pp. 1–12 (2009)Google Scholar
  7. 7.
    Aguilera, M.K., Delporte-Gallet, C., Fauconnier, H., Toueg, S.: Stable leader election. In: Proceedings of the 15th International Symposium on Distributed Computing, pp. 108–122 (2001)Google Scholar
  8. 8.
    Pike, S.M., Song, Y., Sastry, S.: Wait-free dining under eventual weak exclusion. In: Proceedings of the 9th International Conference on Distributed Computing and Networking, pp. 135–146 (2008)Google Scholar
  9. 9.
    Delporte-Gallet, C., Fauconnier, H., Guerraoui, R., Kouznetsov, P.: Mutual exclusion in asynchronous systems with failure detectors. Journal of Parallel and Distributed Computing 65(4), 492–505 (2005)CrossRefMATHGoogle Scholar
  10. 10.
    Mostéfaoui, A., Mourgaya, E., Raynal, M.: Asynchronous implementation of failure detectors. In: Proceedings of the 33rd International Conference on Dependable Systems and Networks, pp. 351–360 (2003)Google Scholar
  11. 11.
    Bertier, M., Marin, O., Sens, P.: Implementation and performance evaluation of an adaptable failure detector. In: Proceedings of the 32nd International Conference on Dependable Systems and Networks, pp. 354–363 (2002)Google Scholar
  12. 12.
    Larrea, M., Arévalo, S., Fernández, A.: Efficient algorithms to implement unreliable failure detectors in partially synchronous systems. In: Proceedings of the 13th International Symposium on Distributed Computing, pp. 34–49 (1999)Google Scholar
  13. 13.
    Fetzer, C., Raynal, M., Tronel, F.: An adaptive failure detection protocol. In: Proceedings of the 7th Pacific Rim International Symposium on Dependable Computing, pp. 146–153 (2001)Google Scholar
  14. 14.
    Fetzer, C., Schmid, U., Süsskraut, M.: On the possibility of consensus in asynchronous systems with finite average response times. In: Proceedings of the 25th International Conference on Distributed Computing Systems, pp. 271–280 (2005)Google Scholar
  15. 15.
    Larrea, M., Lafuente, A.: Communication-efficient implementation of failure detector classes \(\diamond\mathcal{P}\) and \(\diamond\mathcal{Q}\). In: Proceedings of the 19th International Symposium on Distributed Computing, pp. 495–496 (2005)Google Scholar
  16. 16.
    Larrea, M., Fernández, A., Arévalo, S.: On the implementation of unreliable failure detectors in partially synchronous systems. IEEE Transactions on Computers 53(7), 815–828 (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Srikanth Sastry
    • 1
  • Scott M. Pike
    • 1
  • Jennifer L. Welch
    • 1
  1. 1.Department of Computer Science and EngineeringTexas A&M UniversityCollege StationUSA

Personalised recommendations