Supplemental Information for:

Why Do Upgrades Fail And What Can We Do About It?

T. Dumitraş and P. Narasimhan



Abstract
Enterprise-system upgrades are unreliable and often result in downtime or data-loss. Errors in the upgrade procedure, such as broken dependencies, constitute the leading cause of upgrade failures. We propose a novel upgrade-centric fault model, based on data from three independent sources, which focuses on the impact of procedural errors rather than software defects. We show that current approaches for upgrading enterprise systems, such as rolling upgrades, are vulnerable to these faults because the upgrade is not an atomic operation and it risks breaking hidden dependencies among the distributed system-components.

Research paper:

T. Dumitraş and P. Narasimhan. Why Do Upgrades Fail And What Can We Do About It? Toward Dependable, Online Upgrades in Enterprise Systems. In ACM/IFIP/USENIX Conference on Middleware, Urbana-Champaign, IL, Nov.-Dec. 2009.

In this clustering dendrogram, the leaves correspond to the 55 faults reported in the user study (u), survey (s) and field study (f). Each vertical line links two clusters into a larger cluster, and their position on the X-axis indicates the mean inter-fault distance. For example, two or three identical faults, reported in different experiments or studies, form a cluster with mean distance = 0. A link with a significantly larger distance than the links below suggests the presence of a natural cluster.

Each cluster, highlighted by a rectangle (left side) corresponds to a specific type of upgrade faults. This fault model has four distinct categories: (1) simple configuration errors (e.g. typos); (2) semantic configuration errors (e.g. misunderstood effects of parameters); (3) broken environmental dependencies (e.g. library or port conflicts); and (4) data-access errors, which render the persistent data partially-unavailable. The cophenetic correlation coefficient, which measures the correlation between the inter-fault distance and the distance in the dendrogram, is 0.85. Principal-component analysis (right side) creates a two-dimensional shadow of the fault clusters, which suggests that the four types of upgrade faults do not overlap.



Annotated fault list: [XLS]
A preliminary version of this fault model is described in the technical report CMU-PDL-08-115: [PDF]


Last updated: .

Contact: Tudor Dumitraş