Abstract
Since major software failures often result in disasters ranging from financial loss to loss of lives, preventing their recurrence is absolutely necessary. A post-mortem investigation is required to identify their root cause and implement appropriate countermeasures. Current approaches to software failure investigations are limited and often result in returning the software system back to normal execution as quickly as possible. In the process of doing so, the software system is left vulnerable to a reoccurrence of the same type of software failures. This chapter defines the concept of a software failure and then reviews the problems of major software failures. The aim is to determine how to improve the accuracy of their root-cause analysis in order to prevent the reoccurrence of major accidents. A review of recent cases of major software failures from different industries, such as the medical domain, is given to demonstrate the reality and seriousness of software failures. These software failures are then analysed so as to identify limitations and establish requirements for improvement of the software investigation process. These requirements form the basis for the design of a near-miss management system (NMS).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Bibliography
Bihina Bella, M. A., Eloff, J. H. P., & Olivier, M. S. (2012). Improving system availability with near miss analysis. Network Security, October, 18–20.
Bogdanich, W. (2010a). Radiation offers new cures, and ways to do harm. The New York Times. Available from: http://www.nytimes.com/2010/01/24/health/24radiation.html?hp=&pagewanted=all&_r=1&. Accessed 1 Apr 2013.
Bogdanich, W. (2010b, January 26). As technology surges, radiation safeguards lag. The New York Times. Available from: http://www.nytimes.com/2010/01/27/us/27radiation.html?ref=radiation_boom. Accessed 1 Apr 2013.
Bogdanich, W. (2011). Radiation boom. The New York Times. Available from: http://topics.nytimes.com/top/news/us/series/radiation_boom/index.html. Accessed 1 Apr 2013.
Charette, R. (2010). Software problem blamed for woman’s death in Minnesota. IEEE Spectrum. Available from: http://spectrum.ieee.org/riskfactor/computing/it/software-problem-blamed-for-womans-death-in-minnesota. Accessed 28 Mar 2013.
Durand-Parenti, C. (2009, May 12). Une erreur informatique à 300 millions d’euros. Le Point. Available from: http://www.lepoint.fr/actualites-societe/2009-05-12/une-erreur-informatique-a-300-millions-d-euros/920/0/342633. Accessed 4 Mar 2013.
FDA. (2004a, April 30). Neuro N’Vision programmer. MAUDE Adverse Event Report: #2182207–2004-00681. Available from: http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfMAUDE/Detail.cfm?MDRFOI__ID=527622. Accessed 25 Mar 2013.
FDA. (2004b, September 22). Medtronic announces nationwide, voluntary recall of model 8870 software application card. Available from: http://www.fda.gov/MedicalDevices/Safety/ListofRecalls/ucm133126.htm. Accessed 25 Mar 2013.
FDA. (2007, July). Baxter healthcare Pte. Ltd. colleague 3 cxe volumetric infusion pump 80frn. MAUDE Adverse Event Report #6000001–2007-09468. Available from: http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfMAUDE/Detail.cfm? MDRFOI__ID=914443. Accessed 26 Mar 2013.
FDA. (2013). MAUDE – Manufacturer and user facility device experience. Available from: http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfMAUDE/ TextSearch.cfm. Accessed 26 Mar 2013.
Feldman, J. (2011, October 13). RIM outage explanation leaves big questions. Information Week. Available from: www.informationweek.com/news/global-cio/interviews/231900785. Accessed 22 Jul 2012.
Fernandez, M. (2009, August 6). Computer error caused rent troubles for public housing tenants. The New York Times. Available from: http://www.nytimes.com/2009/08/06/nyregion/06rent.html?_r=0. Accessed 4 Mar 2013.
Finnegan, M. (2013). RBS apologises as customers hit by another IT outage. Computerworld UK. Available from: http://www.computerworlduk.com/news/ it-business/3491865/rbs-apologises-as-customers-hit-by-another-it-outage/. Accessed 5 Feb 2013.
Greene, T. (2011, February 4). Financial firm fined $25M for hiding software glitch that cost investors $217M. Available from: http://www.networkworld.com/news/2011/020411-axa-rosenburg-group-glitch.html. Accessed 28 Feb 2013.
Harris, C. (2011, May 24). IT downtime costs $26.5 billion in lost revenue. Information Week. Available from: http://www.informationweek.com/storage/disaster-recovery/it-downtime-costs-265-billion-in-lost-re/229625441. Accessed 8 Oct 2012.
Horton, J. (2008, May 15). How BlackBerry outages work. HowStuffWorks.com. Available from: http://electronics.howstuffworks.com/blackberry-outage1.htm. Accessed 26 Jul 2012.
IAEA. (2013a). Prevention of accidental exposure in radiotherapy. Training course. Module 2.7. Error in TPS data entry – Panama (2,111 KB). Available from: https://rpop.iaea.org/RPOP/RPoP/Content/AdditionalResources/Training/1_TrainingMaterial/AccidentPreventionRadiotherapy.htm. Accessed 19 Mar 2013.
IEEE. (1990). IEEE standard computer dictionary: A compilation of IEEE standard computer glossaries. Institute of Electrical and Electronics Engineers.
Jacobsen, G. (2011a, June 8). Class action filed over glitch wrongly jailing young people. The Sydney Morning Herald. Available from: http://www.smh.com.au/technology/technology-news/class-action-filed-over-glitch-wrongly-jailing-young-people-20110608-1fs1h.htm. Accessed 1 Mar 2013.
Karp, G. (2012, November 15). United Airlines experiences yet another major computer glitch. Chicago Tribune. Available from: http://articles.chicagotribune.com/2012-11-15/business/ct-biz-1116-united-outage-20121116_1_jeff-smisek-charlie-hobart-reservation-system. Accessed 1 Mar 2013.
Laprie, J. C. (Ed.). (1992). Dependability: Basic Concepts and terminology. Wein/New York: Springer-Verlag.
Mappic, S. (2013, September 4). How much does downtime cost? Available from: http://www.appdynamics.com/blog/devops/how-much-does-downtime-cost/. Accessed 8 Feb 2014.
Marcus, E., & Stern, H. (2003, September 19). Blueprints for high availability: Designing resilient distributed systems (Chapters 2 and 3) (2nd ed.). New York: John Wiley & Sons.
Neebula.com. (2012). Success factors for root-cause analysis. Available from: http://www.neebula.com. Accessed 26 Mar 2013.
Pertet, S., & Narasimhan, P. (2005, December). Causes of failures in Web applications. Carnegie Mellon University: Parallel Data Lab Technical Report CMU-PDL-05-109.
Ponemon Institute. (2011, February). Calculating the cost of data center outages. Benchmark Study of 41 US Data Centers. Available from: http://emersonnetworkpower.com/en-US/Brands/Liebert/Documents/White%20Papers/ data-center-costs_24659-R02-11.pdf. Accessed 8 Feb 2014.
Renault, M. (2012, September 20). Orange s’explique sur la grande panne de juillet. Le Figaro. Available from: http://www.lefigaro.fr/hightech/2012/09/19/ 01007-20120919ARTFIG00606-orange-s-explique-sur-la-grande-panne-de-juillet.php. Accessed 4 Mar 2013.
Roberts, M. (2010, October 19). Organ donation errors “avoidable”. BBC News. Available from: http://www.bbc.co.uk/news/health-11572898. Accessed 20 Mar 2013.
Roberts, P. (2012, June 20). FDA: Software failures responsible for 24% of all medical device recalls. Available from: http://threatpost.com/en_us/blogs/fda-software-failures-responsible-24-all-medical-device-recalls-062012. Accessed 25 Mar 2013.
Sommer, J. (2010, June 19). The tremors from a coding error. New York Times. Available from: http://www.nytimes.com/2010/06/20/business/20stra.html?_r=0. Accessed 17 Jun 2013.
Trigg, J., & Doulis, J. (2008). Troubleshooting: What can go wrong and how to fix it. In Practical guide to clinical computing- systems: Design, operations, and infrastructure (Chapter 7) (pp. 105–128). London: Elsevier.
Whittaker, Z. (2011, October 10). BlackBerry’s outage post-mortem: Where did it all go wrong? ZDNet. Available from: http://www.zdnet.com/blog/btl/blackberrys-outage-post-mortem-where-did-it-all-go-wrong/60801. Accessed 26 Jul 2012.
Worstall, T. (2012, June 25). RBS/NatWest computer failure: Fully explained. Available from: http://www.forbes.com/sites/timworstall/2012/06/25/ rbsnatwest-computer-failure-fully-explained. Accessed 7 Feb 2013.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Eloff, J., Bella, M.B. (2018). Software Failures: An Overview. In: Software Failure Investigation. Springer, Cham. https://doi.org/10.1007/978-3-319-61334-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-61334-5_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61333-8
Online ISBN: 978-3-319-61334-5
eBook Packages: EngineeringEngineering (R0)