Abstract
Enterprise-system upgrades are unreliable and often produce downtime or data-loss. Errors in the upgrade procedure, such as broken dependencies, constitute the leading cause of upgrade failures. We propose a novel upgrade-centric fault model, based on data from three independent sources, which focuses on the impact of procedural errors rather than software defects. We show that current approaches for upgrading enterprise systems, such as rolling upgrades, are vulnerable to these faults because the upgrade is not an atomic operation and it risks breaking hidden dependencies among the distributed system-components. We also present a mechanism for tolerating complex procedural errors during an upgrade. Our system, called Imago, improves availability in the fault-free case, by performing an online upgrade, and in the faulty case, by reducing the risk of failure due to breaking hidden dependencies. Imago performs an end-to-end upgrade atomically and dependably by dedicating separate resources to the new version and by isolating the old version from the upgrade procedure. Through fault injection, we show that Imago is more reliable than online-upgrade approaches that rely on dependency-tracking and that create system states with mixed versions.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Crameri, O., Knežević, N., Kostić, D., Bianchini, R., Zwaenepoel, W.: Staged deployment in Mirage, an integrated software upgrade testing and distribution system. In: Symposium on Operating Systems Principles, Stevenson, WA, October 2007, pp. 221–236 (2007)
Neumann, P., et al.: America Offline. The Risks Digest 18(30-31) (August 8-9, 1996), http://catless.ncl.ac.uk/Risks/18.30.html
Koch, C.: AT&T Wireless self-destructs. CIO Magazine (April 2004), http://www.cio.com/archive/041504/wireless.html
Wears, R.L., Cook, R.I., Perry, S.J.: Automation, interaction, complexity, and failure: A case study. Reliability Engineering and System Safety 91(12), 1494–1501 (2006)
Di Cosmo, R.: Report on formal management of software dependencies. Technical report, INRIA (EDOS Project Deliverable WP2-D2.1) (September 2005)
Office of Government Commerce: Service Transition. Information Technology Infrastructure Library, ITIL (2007)
Oracle Corporation: Database rolling upgrade using Data Guard SQL Apply. Maximum Availability Architecture White Paper (December 2008)
Oxford English Dictionary, 2nd edn. Oxford University Press, Oxford (1989), http://www.oed.com
Brewer, E.A.: Lessons from giant-scale services. IEEE Internet Computing 5(4), 46–55 (2001)
Oppenheimer, D., Ganapathi, A., Patterson, D.A.: Why do Internet services fail, and what can be done about it? In: USENIX Symposium on Internet Technologies and Systems, Seattle, WA (March 2003)
Keller, L., Upadhyaya, P., Candea, G.: ConfErr: A tool for assessing resilience to human configuration errors. In: International Conference on Dependable Systems and Networks, Anchorage, AK (June 2008)
Nagaraja, K., Oliveira, F., Bianchini, R., Martin, R.P., Nguyen, T.D.: Understanding and dealing with operator mistakes in Internet services. In: USENIX Symposium on Operating Systems Design and Implementation, San Francisco, CA, December 2004, pp. 61–76 (2004)
Oliveira, F., Nagaraja, K., Bachwani, R., Bianchini, R., Martin, R.P., Nguyen, T.D.: Understanding and validating database system administration. In: USENIX Annual Technical Conference (June 2006)
Dumitraş, T., Kavulya, S., Narasimhan, P.: A fault model for upgrades in distributed systems. Technical Report CMU-PDL-08-115, Carnegie Mellon University (2008)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: an Introduction to Cluster Analysis. Wiley Series in Probability and Mathematical Statistics. Wiley, Chichester (1990)
Sullivan, M., Chillarege, R.: Software defects and their impact on system availability-a study of field failures in operating systems. In: Fault-Tolerant Computing Symposium, pp. 2–9 (1991)
Chatfield, C.: Statistics for Technology: A Course in Applied Statistics, 3rd edn. Chapman & Hall/CRC (1983)
Dig, D., Comertoglu, C., Marinov, D., Johnson, R.: Automated detection of refactorings in evolving components. In: Thomas, D. (ed.) ECOOP 2006. LNCS, vol. 4067, pp. 404–428. Springer, Heidelberg (2006)
Anderson, R.: The end of DLL Hell. MSDN Magazine (January 2000)
Di Cosmo, R., Zacchiroli, S., Trezentos, P.: Package upgrades in FOSS distributions: details and challenges. In: Workshop on Hot Topics in Software Upgrades (October 2008)
Menascé, D.: TPC-W: A benchmark for e-commerce. IEEE Internet Computing 6(3), 83–87 (2002)
Dumitraş, T., Tan, J., Gho, Z., Narasimhan, P.: No more HotDependencies: Toward dependency-agnostic upgrades in distributed systems. In: Workshop on Hot Topics in System Dependability, Edinburgh, Scotland (June 2007)
Amir, Y., Danilov, C., Stanton, J.: A low latency, loss tolerant architecture and protocol for wide area group communication. In: International Conference on Dependable Systems and Networks, New York, NY, June 2000, pp. 327–336 (2000)
Amza, C., Cecchet, E., Chanda, A., Cox, A., Elnikety, S., Gil, R., Marguerite, J., Rajamani, K., Zwaenepoel, W.: Specification and implementation of dynamic web site benchmarks. In: IEEE Workshop on Workload Characterization, Austin, TX, November 2002, pp. 3–13 (2002), http://rubis.objectweb.org/
Downing, A.: Oracle Corporation. Personal communication (2008)
Boyapati, C., Liskov, B., Shrira, L., Moh, C.H., Richman, S.: Lazy modular upgrades in persistent object stores. In: Object-Oriented Programing, Systems, Languages and Applications, Anaheim, CA, pp. 403–417 (2003)
Zolti, I.: Accenture. Personal communication (2006)
Neamtiu, I., Hicks, M., Stoyle, G., Oriol, M.: Practical dynamic software updating for C. In: ACM Conference on Programming Language Design and Implementation, Ottawa, Canada, June 2006, pp. 72–83 (2006)
Neamtiu, I., Hicks, M.: Safe and timely dynamic updates for multi-threaded programs. In: ACM Conference on Programming Language Design and Implementation, Dublin, Ireland (June 2009)
Lowell, D., Saito, Y., Samberg, E.: Devirtualizable virtual machines enabling general, single-node, online maintenance. In: International Conference on Architectural Support for Programming Languages and Operating Systems, Boston, MA, October 2004, pp. 211–223 (2004)
Potter, S., Nieh, J.: Reducing downtime due to system maintenance and upgrades. In: Large Installation System Administration Conference, San Diego, CA, December 2005, pp. 47–62 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 IFIP International Federation for Information Processing
About this paper
Cite this paper
Dumitraş, T., Narasimhan, P. (2009). Why Do Upgrades Fail and What Can We Do about It?. In: Bacon, J.M., Cooper, B.F. (eds) Middleware 2009. Middleware 2009. Lecture Notes in Computer Science, vol 5896. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10445-9_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-10445-9_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10444-2
Online ISBN: 978-3-642-10445-9
eBook Packages: Computer ScienceComputer Science (R0)