Abstract
It is sometimes argued that with the right training, discipline, and tools it should be possible to produce zero-defect code. Very few things in life, though, are zero-defect—not even things that can be considered life critical. If you practice sky-diving, your main parachute could fail to open, no matter how carefully you check it before each jump. A parachutist would be wise not to trust a company that tries to sell him a zero-defect parachute. The jumper is more likely to avoid problems by bringing a spare chute. That is: the seasoned parachutist takes the possibility of component failure into account in the adoption of a system that has a relatively low probability of system failure. We can provide system reliability, even when none of the system components are zero-defect. In many cases, though, mere redundancy does not solve the problem (i.e., multiple sky-jumpers in parallel). Reliable systems are designed with the possibility of component failure in mind, and with remedies in place to reduce the odds of system failure. Component failure is a rarely and isolated event though. In this chapter we will consider the nature of failure in complex software systems, and how we can develop methods to leverage these insights.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, T., Barrett, P.A., Halliwell, D.N., Moudling, M.L.: An evaluation of software fault tolerance in a practical system. In: Fault Tolerant Computing Symposium, pp. 140–145 (1985)
Aviz̆ienis, A.A.: Software fault tolerance. In: The Methodology of N-Version Programming, pp. 23–46. Wiley, New York (1995)
Knight, J.C., Leveson, N.G.: An experimental evaluation of the assumption of independence in multi-version programming. IEEE Trans. Softw. Eng. 12(1), 96–109 (1986)
Kudrjavets, G., Nagappan, N., Ball, T.: Assessing the relationship between software assertions and code quality: an empirical investigation. Tech. rep. MSR-TR-2006-54, Microsoft Research (2006)
Lions, J.-L.: Report of the inquiry board for the Ariane 5 flight 501 failure (1996). Joint Communication, European Space Agency, ESA-CNES, Paris, France
Perrow, C.: Normal Accidents: Living with High Risk Technologies. Princeton University Press, Princeton (1984)
Randell, B., Xu, J.: The evolution of the recovery block concept. In: Lyu, M.R. (ed.) Software Fault Tolerance, pp. 1–21. Wiley, New York (1995)
Rasmussen, R.D., Litty, E.C.: A voyager attitude control perspective on fault tolerant systems. In: AIAA, Alburquerque, NM, pp. 241–248 (1981)
Reeves, G.E., Neilson, T.A.: The mars rover spirit FLASH anomaly. In: IEEE Aerospace Conference, Big Sky, Montana (2005)
Rushby, J.: Partitioning in avionics architectures: requirements, mechanisms, and assurance. Technical report, Computer Science Laboratory, SRI (1999). Draft technical report
Sha, L.: Using simplicity to control complexity. IEEE Softw. 18(4), 20–28 (2001)
Weber, D.G.: Formal specification of fault-tolerance and its relation to computer security. In: Proceedings of the 5th International Workshop on Software Specification and Design, IWSSD’89, pp. 273–277. ACM, New York (1989)
Acknowledgements
The research described in this chapter was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag London Limited
About this chapter
Cite this chapter
Holzmann, G.J. (2012). Conquering Complexity. In: Hinchey, M., Coyle, L. (eds) Conquering Complexity. Springer, London. https://doi.org/10.1007/978-1-4471-2297-5_3
Download citation
DOI: https://doi.org/10.1007/978-1-4471-2297-5_3
Publisher Name: Springer, London
Print ISBN: 978-1-4471-2296-8
Online ISBN: 978-1-4471-2297-5
eBook Packages: Computer ScienceComputer Science (R0)