Skip to main content

How and Why Computer Systems Fail

  • Chapter
Guide to Reliable Distributed Systems

Part of the book series: Texts in Computer Science ((TCS))

  • 3196 Accesses

Abstract

Before jumping into the question of how to make systems reliable, it will be useful to briefly understand the reasons that distributed systems fail. In this chapter we discuss some of the thinking around failure: a surprisingly rich and varied technical topic.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 89.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Bartlett, J.F.: A nonstop kernel In: Proceedings of the Eighth ACM Symposium on Operating Systems Principles, Pacific Grove, CA, December 1981, pp. 22–29. ACM Press, New York (1981)

    Chapter  Google Scholar 

  • Birman, K.P., van Renesse, R.: Software for reliable networks. Sci. Am. 274(5), 64–69 (1996)

    Article  Google Scholar 

  • Borr, A., Wilhelmy, C.: Highly available data services for UNIX client/server networks: Why fault-tolerant hardware isn’t the answer. In: Banatre, M., Lee, P. (eds.) Hardware and Software Architectures for Fault Tolerance. Lecture Notes in Computer Science, vol. 774, pp. 385–404. Springer, Berlin (1994)

    Chapter  Google Scholar 

  • Chilaragee, R.: Top five challenges facing the practice of fault tolerance. In: Banatre, M., Lee, P. (eds.) Hardware and Software Architectures for Fault Tolerance. Lecture Notes in Computer Science, vol. 774, pp. 3–12. Springer, Berlin (1994)

    Google Scholar 

  • Clarke, R., Knake, R.: Cyber War: The Next Threat to National Security and What to Do About It. HarperCollins e-books (April 20, 2010)

    Google Scholar 

  • Cristian, F.: Synchronous and asynchronous group communication. Commun. ACM 39(4), 88–97 (1996)

    Article  Google Scholar 

  • Gibbs, B.W.: Software’s chronic crisis. Sci. Am. (1994)

    Google Scholar 

  • Gray, J.: A census of tandem system availability between 1985 and 1990. Technical Report 90.1, Tandem Computer Corporation, September (1990)

    Google Scholar 

  • Gray, J., Reuter, A.: Transaction Processing: Concepts and Techniques. Morgan Kaufmann, San Mateo (1993)

    MATH  Google Scholar 

  • Gray, J., Bartlett, J., Horst, R.: Fault tolerance in tandem computer systems. In: Avizienis, A., Kopetz, H., Laprie, J.C. (eds.) The Evolution of Fault-Tolerant Computing. Springer, Berlin (1987)

    Google Scholar 

  • Gray, J., Helland, P., Shasha, D.: Dangers of replication and a solution. In: ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, June 1996

    Google Scholar 

  • Hunker, J.: Creeping Failure: How We Broke the Internet and What We Can Do to Fix It. McClelland and Stewart, Toronto (2011). Reprint edition (September 27). ISBN-10: 0771040245

    Google Scholar 

  • Peterson, I.: Fatal Defect: Chasing Killer Computer Bugs. Time Books/Random House, New York (1995)

    Google Scholar 

  • Vogels, W.: The private investigator. Technical Report, Department of Computer Science, Cornell University, April (1996)

    Google Scholar 

  • Vogels, W., Re, C.: WS-membership—failure management in a Web-Services World. In: 12th International World Wide Web Conference, Budapest, Hungary, May 2003

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag London Limited

About this chapter

Cite this chapter

Birman, K.P. (2012). How and Why Computer Systems Fail. In: Guide to Reliable Distributed Systems. Texts in Computer Science. Springer, London. https://doi.org/10.1007/978-1-4471-2416-0_9

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-2416-0_9

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-2415-3

  • Online ISBN: 978-1-4471-2416-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics