Why Do Upgrades Fail and What Can We Do about It?

Dumitraş, Tudor; Narasimhan, Priya

doi:10.1007/978-3-642-10445-9_18

Tudor Dumitraş¹⁸ &
Priya Narasimhan¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 5896))

Included in the following conference series:

ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing

1376 Accesses
24 Citations

Abstract

Enterprise-system upgrades are unreliable and often produce downtime or data-loss. Errors in the upgrade procedure, such as broken dependencies, constitute the leading cause of upgrade failures. We propose a novel upgrade-centric fault model, based on data from three independent sources, which focuses on the impact of procedural errors rather than software defects. We show that current approaches for upgrading enterprise systems, such as rolling upgrades, are vulnerable to these faults because the upgrade is not an atomic operation and it risks breaking hidden dependencies among the distributed system-components. We also present a mechanism for tolerating complex procedural errors during an upgrade. Our system, called Imago, improves availability in the fault-free case, by performing an online upgrade, and in the faulty case, by reducing the risk of failure due to breaking hidden dependencies. Imago performs an end-to-end upgrade atomically and dependably by dedicating separate resources to the new version and by isolating the old version from the upgrade procedure. Through fault injection, we show that Imago is more reliable than online-upgrade approaches that rely on dependency-tracking and that create system states with mixed versions.

Download to read the full chapter text

Chapter PDF

Fault-Aware Application Management Protocols

Identifying Failure Causalities in Multi-component Applications

Run-Time Root Cause Analysis in Adaptive Distributed Systems

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Crameri, O., Knežević, N., Kostić, D., Bianchini, R., Zwaenepoel, W.: Staged deployment in Mirage, an integrated software upgrade testing and distribution system. In: Symposium on Operating Systems Principles, Stevenson, WA, October 2007, pp. 221–236 (2007)
Google Scholar
Neumann, P., et al.: America Offline. The Risks Digest 18(30-31) (August 8-9, 1996), http://catless.ncl.ac.uk/Risks/18.30.html
Koch, C.: AT&T Wireless self-destructs. CIO Magazine (April 2004), http://www.cio.com/archive/041504/wireless.html
Wears, R.L., Cook, R.I., Perry, S.J.: Automation, interaction, complexity, and failure: A case study. Reliability Engineering and System Safety 91(12), 1494–1501 (2006)
Article Google Scholar
Di Cosmo, R.: Report on formal management of software dependencies. Technical report, INRIA (EDOS Project Deliverable WP2-D2.1) (September 2005)
Google Scholar
Office of Government Commerce: Service Transition. Information Technology Infrastructure Library, ITIL (2007)
Google Scholar
Oracle Corporation: Database rolling upgrade using Data Guard SQL Apply. Maximum Availability Architecture White Paper (December 2008)
Google Scholar
Oxford English Dictionary, 2nd edn. Oxford University Press, Oxford (1989), http://www.oed.com
Brewer, E.A.: Lessons from giant-scale services. IEEE Internet Computing 5(4), 46–55 (2001)
Article Google Scholar
Oppenheimer, D., Ganapathi, A., Patterson, D.A.: Why do Internet services fail, and what can be done about it? In: USENIX Symposium on Internet Technologies and Systems, Seattle, WA (March 2003)
Google Scholar
Keller, L., Upadhyaya, P., Candea, G.: ConfErr: A tool for assessing resilience to human configuration errors. In: International Conference on Dependable Systems and Networks, Anchorage, AK (June 2008)
Google Scholar
Nagaraja, K., Oliveira, F., Bianchini, R., Martin, R.P., Nguyen, T.D.: Understanding and dealing with operator mistakes in Internet services. In: USENIX Symposium on Operating Systems Design and Implementation, San Francisco, CA, December 2004, pp. 61–76 (2004)
Google Scholar
Oliveira, F., Nagaraja, K., Bachwani, R., Bianchini, R., Martin, R.P., Nguyen, T.D.: Understanding and validating database system administration. In: USENIX Annual Technical Conference (June 2006)
Google Scholar
Dumitraş, T., Kavulya, S., Narasimhan, P.: A fault model for upgrades in distributed systems. Technical Report CMU-PDL-08-115, Carnegie Mellon University (2008)
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: an Introduction to Cluster Analysis. Wiley Series in Probability and Mathematical Statistics. Wiley, Chichester (1990)
Book MATH Google Scholar
Sullivan, M., Chillarege, R.: Software defects and their impact on system availability-a study of field failures in operating systems. In: Fault-Tolerant Computing Symposium, pp. 2–9 (1991)
Google Scholar
Chatfield, C.: Statistics for Technology: A Course in Applied Statistics, 3rd edn. Chapman & Hall/CRC (1983)
Google Scholar
Dig, D., Comertoglu, C., Marinov, D., Johnson, R.: Automated detection of refactorings in evolving components. In: Thomas, D. (ed.) ECOOP 2006. LNCS, vol. 4067, pp. 404–428. Springer, Heidelberg (2006)
Chapter Google Scholar
Anderson, R.: The end of DLL Hell. MSDN Magazine (January 2000)
Google Scholar
Di Cosmo, R., Zacchiroli, S., Trezentos, P.: Package upgrades in FOSS distributions: details and challenges. In: Workshop on Hot Topics in Software Upgrades (October 2008)
Google Scholar
Menascé, D.: TPC-W: A benchmark for e-commerce. IEEE Internet Computing 6(3), 83–87 (2002)
Article Google Scholar
Dumitraş, T., Tan, J., Gho, Z., Narasimhan, P.: No more HotDependencies: Toward dependency-agnostic upgrades in distributed systems. In: Workshop on Hot Topics in System Dependability, Edinburgh, Scotland (June 2007)
Google Scholar
Amir, Y., Danilov, C., Stanton, J.: A low latency, loss tolerant architecture and protocol for wide area group communication. In: International Conference on Dependable Systems and Networks, New York, NY, June 2000, pp. 327–336 (2000)
Google Scholar
Amza, C., Cecchet, E., Chanda, A., Cox, A., Elnikety, S., Gil, R., Marguerite, J., Rajamani, K., Zwaenepoel, W.: Specification and implementation of dynamic web site benchmarks. In: IEEE Workshop on Workload Characterization, Austin, TX, November 2002, pp. 3–13 (2002), http://rubis.objectweb.org/
Downing, A.: Oracle Corporation. Personal communication (2008)
Google Scholar
Boyapati, C., Liskov, B., Shrira, L., Moh, C.H., Richman, S.: Lazy modular upgrades in persistent object stores. In: Object-Oriented Programing, Systems, Languages and Applications, Anaheim, CA, pp. 403–417 (2003)
Google Scholar
Zolti, I.: Accenture. Personal communication (2006)
Google Scholar
Neamtiu, I., Hicks, M., Stoyle, G., Oriol, M.: Practical dynamic software updating for C. In: ACM Conference on Programming Language Design and Implementation, Ottawa, Canada, June 2006, pp. 72–83 (2006)
Google Scholar
Neamtiu, I., Hicks, M.: Safe and timely dynamic updates for multi-threaded programs. In: ACM Conference on Programming Language Design and Implementation, Dublin, Ireland (June 2009)
Google Scholar
Lowell, D., Saito, Y., Samberg, E.: Devirtualizable virtual machines enabling general, single-node, online maintenance. In: International Conference on Architectural Support for Programming Languages and Operating Systems, Boston, MA, October 2004, pp. 211–223 (2004)
Google Scholar
Potter, S., Nieh, J.: Reducing downtime due to system maintenance and upgrades. In: Large Installation System Administration Conference, San Diego, CA, December 2005, pp. 47–62 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Tudor Dumitraş & Priya Narasimhan

Authors

Tudor Dumitraş
View author publications
You can also search for this author in PubMed Google Scholar
Priya Narasimhan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Laboratory, University of Cambridge, William Gates Building, JJ Thomson Avenue, CB3 0FD, Cambridge, UK
Jean M. Bacon
Yahoo! Research, 4401 Great America Parkway, 95054, Santa Clara, CA, USA
Brian F. Cooper

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dumitraş, T., Narasimhan, P. (2009). Why Do Upgrades Fail and What Can We Do about It?. In: Bacon, J.M., Cooper, B.F. (eds) Middleware 2009. Middleware 2009. Lecture Notes in Computer Science, vol 5896. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10445-9_18

Download citation

DOI: https://doi.org/10.1007/978-3-642-10445-9_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10444-2
Online ISBN: 978-3-642-10445-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Why Do Upgrades Fail and What Can We Do about It?

Abstract

Chapter PDF

Similar content being viewed by others

Fault-Aware Application Management Protocols

Identifying Failure Causalities in Multi-component Applications

Run-Time Root Cause Analysis in Adaptive Distributed Systems

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Why Do Upgrades Fail and What Can We Do about It?

Abstract

Chapter PDF

Similar content being viewed by others

Fault-Aware Application Management Protocols

Identifying Failure Causalities in Multi-component Applications

Run-Time Root Cause Analysis in Adaptive Distributed Systems

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation