Abstract
Before jumping into the question of how to make systems reliable, it will be useful to briefly understand the reasons that distributed systems fail. In this chapter we discuss some of the thinking around failure: a surprisingly rich and varied technical topic.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bartlett, J.F.: A nonstop kernel In: Proceedings of the Eighth ACM Symposium on Operating Systems Principles, Pacific Grove, CA, December 1981, pp. 22–29. ACM Press, New York (1981)
Birman, K.P., van Renesse, R.: Software for reliable networks. Sci. Am. 274(5), 64–69 (1996)
Borr, A., Wilhelmy, C.: Highly available data services for UNIX client/server networks: Why fault-tolerant hardware isn’t the answer. In: Banatre, M., Lee, P. (eds.) Hardware and Software Architectures for Fault Tolerance. Lecture Notes in Computer Science, vol. 774, pp. 385–404. Springer, Berlin (1994)
Chilaragee, R.: Top five challenges facing the practice of fault tolerance. In: Banatre, M., Lee, P. (eds.) Hardware and Software Architectures for Fault Tolerance. Lecture Notes in Computer Science, vol. 774, pp. 3–12. Springer, Berlin (1994)
Clarke, R., Knake, R.: Cyber War: The Next Threat to National Security and What to Do About It. HarperCollins e-books (April 20, 2010)
Cristian, F.: Synchronous and asynchronous group communication. Commun. ACM 39(4), 88–97 (1996)
Gibbs, B.W.: Software’s chronic crisis. Sci. Am. (1994)
Gray, J.: A census of tandem system availability between 1985 and 1990. Technical Report 90.1, Tandem Computer Corporation, September (1990)
Gray, J., Reuter, A.: Transaction Processing: Concepts and Techniques. Morgan Kaufmann, San Mateo (1993)
Gray, J., Bartlett, J., Horst, R.: Fault tolerance in tandem computer systems. In: Avizienis, A., Kopetz, H., Laprie, J.C. (eds.) The Evolution of Fault-Tolerant Computing. Springer, Berlin (1987)
Gray, J., Helland, P., Shasha, D.: Dangers of replication and a solution. In: ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, June 1996
Hunker, J.: Creeping Failure: How We Broke the Internet and What We Can Do to Fix It. McClelland and Stewart, Toronto (2011). Reprint edition (September 27). ISBN-10: 0771040245
Peterson, I.: Fatal Defect: Chasing Killer Computer Bugs. Time Books/Random House, New York (1995)
Vogels, W.: The private investigator. Technical Report, Department of Computer Science, Cornell University, April (1996)
Vogels, W., Re, C.: WS-membership—failure management in a Web-Services World. In: 12th International World Wide Web Conference, Budapest, Hungary, May 2003
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag London Limited
About this chapter
Cite this chapter
Birman, K.P. (2012). How and Why Computer Systems Fail. In: Guide to Reliable Distributed Systems. Texts in Computer Science. Springer, London. https://doi.org/10.1007/978-1-4471-2416-0_9
Download citation
DOI: https://doi.org/10.1007/978-1-4471-2416-0_9
Publisher Name: Springer, London
Print ISBN: 978-1-4471-2415-3
Online ISBN: 978-1-4471-2416-0
eBook Packages: Computer ScienceComputer Science (R0)