A Model for Space-Correlated Failures in Large-Scale Distributed Systems
Distributed systems such as grids, peer-to-peer systems, and even Internet DNS servers have grown significantly in size and complexity in the last decade. This rapid growth has allowed distributed systems to serve a large and increasing number of users, but has also made resource and system failures inevitable. Moreover, perhaps as a result of system complexity, in distributed systems a single failure can trigger within a short time span several more failures, forming a group of time-correlated failures. To eliminate or alleviate the significant effects of failures on performance and functionality, the techniques for dealing with failures require good failure models. However, not many such models are available, and the available models are valid for few or even a single distributed system. In contrast, in this work we propose a model that considers groups of time-correlated failures and is valid for many types of distributed systems. Our model includes three components, the group size, the group inter-arrival time, and the resource downtime caused by the group. To validate this model, we use failure traces corresponding to fifteen distributed systems. We find that space-correlated failures are dominant in terms of resource downtime in seven of the fifteen studied systems. For each of these seven systems, we provide a set of model parameters that can be used in research studies or for tuning distributed systems. Last, as a result of our work six of the studied traces have been made available through the Failure Trace Archive ( http://fta.inria.fr ).
KeywordsWindow Size Failure Event Common Distribution Failure Group Desktop Grid
Unable to display preview. Download preview PDF.
- 1.Heath, T., Martin, R.P., Nguyen, T.D.: Improving cluster availability using workstation validation. In: SIGMETRICS, pp. 217–227 (2002)Google Scholar
- 2.Bhagwan, R., Tati, K., Cheng, Y., Savage, S., Voelker, G.: Total recall: System support for automated availability management. In: NSDI, pp. 337–350 (2004)Google Scholar
- 3.Sahoo, R., Sivasubramaniam, A., Squillante, M., Zhang, Y.: Failure data analysis of a large-scale heterogeneous server environment. In: DSN, p. 772 (2004)Google Scholar
- 5.Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: DSN, pp. 249–258 (2006)Google Scholar
- 6.Iosup, A., Dumitrescu, C., Epema, D.H.J., Li, H., Wolters, L.: How are real grids used? the analysis of four grid traces and its implications. In: GRID, pp. 262–269 (2006)Google Scholar
- 10.Iosup, A., Jan, M., Sonmez, O.O., Epema, D.H.J.: On the dynamic resource availability in grids. In: GRID, pp. 26–33 (2007)Google Scholar
- 11.Zhang, Y., Squillante, M., Sivasubramaniam, A., Sahoo, R.: Performance implications of failures in large-scale cluster scheduling. In: JSSPP, pp. 233–252 (2004)Google Scholar
- 12.Mickens, J.W., Noble, B.D.: Exploiting availability prediction in distributed systems. In: NSDI (2006)Google Scholar
- 13.Bolosky, W.J., Douceur, J.R., Ely, D., Theimer, M.: Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs. In: SIGMETRICS, pp. 34–43 (2000)Google Scholar
- 14.Kondo, D., Javadi, B., Iosup, A., Epema, D.: The Failure Trace Archive: Enabling comparative analysis of failures in diverse distributed systems. In: CCGRID, pp. 1–10 (2010), Archive data available, http://fta.inria.fr
- 17.Gray, J.: Why do computers stop and what can be done about it? In: Symposium on Reliability in Distributed Software and Database Systems, pp. 3–12 (1986)Google Scholar
- 19.Gallet, M., Yigitbasi, N., Javadi, B., Kondo, D., Iosup, A., Epema, D.: A model for space-correlated failures in large-scale distributed systems. Tech.Rep. PDS-2010-001, TU Delft (2010), http://pds.twi.tudelft.nl/reports/2010/PDS-2010-001.pdf