Advertisement

A Model for Space-Correlated Failures in Large-Scale Distributed Systems

  • Matthieu Gallet
  • Nezih Yigitbasi
  • Bahman Javadi
  • Derrick Kondo
  • Alexandru Iosup
  • Dick Epema
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6271)

Abstract

Distributed systems such as grids, peer-to-peer systems, and even Internet DNS servers have grown significantly in size and complexity in the last decade. This rapid growth has allowed distributed systems to serve a large and increasing number of users, but has also made resource and system failures inevitable. Moreover, perhaps as a result of system complexity, in distributed systems a single failure can trigger within a short time span several more failures, forming a group of time-correlated failures. To eliminate or alleviate the significant effects of failures on performance and functionality, the techniques for dealing with failures require good failure models. However, not many such models are available, and the available models are valid for few or even a single distributed system. In contrast, in this work we propose a model that considers groups of time-correlated failures and is valid for many types of distributed systems. Our model includes three components, the group size, the group inter-arrival time, and the resource downtime caused by the group. To validate this model, we use failure traces corresponding to fifteen distributed systems. We find that space-correlated failures are dominant in terms of resource downtime in seven of the fifteen studied systems. For each of these seven systems, we provide a set of model parameters that can be used in research studies or for tuning distributed systems. Last, as a result of our work six of the studied traces have been made available through the Failure Trace Archive ( http://fta.inria.fr ).

Keywords

Window Size Failure Event Common Distribution Failure Group Desktop Grid 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Heath, T., Martin, R.P., Nguyen, T.D.: Improving cluster availability using workstation validation. In: SIGMETRICS, pp. 217–227 (2002)Google Scholar
  2. 2.
    Bhagwan, R., Tati, K., Cheng, Y., Savage, S., Voelker, G.: Total recall: System support for automated availability management. In: NSDI, pp. 337–350 (2004)Google Scholar
  3. 3.
    Sahoo, R., Sivasubramaniam, A., Squillante, M., Zhang, Y.: Failure data analysis of a large-scale heterogeneous server environment. In: DSN, p. 772 (2004)Google Scholar
  4. 4.
    Tang, D., Iyer, R.K.: Dependability measurement and modeling of a multicomputer system. IEEE Trans. Computers 42(1), 62–75 (1993)CrossRefGoogle Scholar
  5. 5.
    Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: DSN, pp. 249–258 (2006)Google Scholar
  6. 6.
    Iosup, A., Dumitrescu, C., Epema, D.H.J., Li, H., Wolters, L.: How are real grids used? the analysis of four grid traces and its implications. In: GRID, pp. 262–269 (2006)Google Scholar
  7. 7.
    Castillo, X., McConnel, S.R., Siewiorek, D.P.: Derivation and calibration of a transient error reliability model. IEEE Trans. Computers 31(7), 658–671 (1982)CrossRefGoogle Scholar
  8. 8.
    Iyer, R.K., Butner, S.E., McCluskey, E.J.: A statistical failure/load relationship: Results of a multicomputer study. IEEE Trans. Computers 31(7), 697–706 (1982)CrossRefGoogle Scholar
  9. 9.
    Gray, J.: A Census of Tandem System Availability Between 1985 and 1990. IEEE Trans. on Reliability 39, 409–418 (1990)CrossRefGoogle Scholar
  10. 10.
    Iosup, A., Jan, M., Sonmez, O.O., Epema, D.H.J.: On the dynamic resource availability in grids. In: GRID, pp. 26–33 (2007)Google Scholar
  11. 11.
    Zhang, Y., Squillante, M., Sivasubramaniam, A., Sahoo, R.: Performance implications of failures in large-scale cluster scheduling. In: JSSPP, pp. 233–252 (2004)Google Scholar
  12. 12.
    Mickens, J.W., Noble, B.D.: Exploiting availability prediction in distributed systems. In: NSDI (2006)Google Scholar
  13. 13.
    Bolosky, W.J., Douceur, J.R., Ely, D., Theimer, M.: Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs. In: SIGMETRICS, pp. 34–43 (2000)Google Scholar
  14. 14.
    Kondo, D., Javadi, B., Iosup, A., Epema, D.: The Failure Trace Archive: Enabling comparative analysis of failures in diverse distributed systems. In: CCGRID, pp. 1–10 (2010), Archive data available, http://fta.inria.fr
  15. 15.
    Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.E.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Sec. Comput. 1(1), 11–33 (2004)CrossRefGoogle Scholar
  16. 16.
    Lin, T.T.Y., Siewiorek, D.P.: Error log analysis: statistical modeling and heuristic trend analysis. IEEE Trans. on Reliability 39, 419–432 (1990)CrossRefGoogle Scholar
  17. 17.
    Gray, J.: Why do computers stop and what can be done about it? In: Symposium on Reliability in Distributed Software and Database Systems, pp. 3–12 (1986)Google Scholar
  18. 18.
    Aldrich, J.: R. A. Fisher and the making of maximum likelihood 1912-1922. Statistical Science 12(3), 162–176 (1997)MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Gallet, M., Yigitbasi, N., Javadi, B., Kondo, D., Iosup, A., Epema, D.: A model for space-correlated failures in large-scale distributed systems. Tech.Rep. PDS-2010-001, TU Delft (2010), http://pds.twi.tudelft.nl/reports/2010/PDS-2010-001.pdf

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Matthieu Gallet
    • 1
    • 3
  • Nezih Yigitbasi
    • 1
    • 3
  • Bahman Javadi
    • 2
    • 3
  • Derrick Kondo
    • 2
    • 3
  • Alexandru Iosup
    • 1
    • 3
  • Dick Epema
    • 1
    • 3
  1. 1.Delft University of TechnologyThe Netherlands
  2. 2.INRIA GrenobleFrance
  3. 3.The Failure Trace Archive, Email: contact@fta.inria.frFrance

Personalised recommendations