Failure Analysis and Modeling in Large Multi-site Infrastructures

  • Tran Ngoc Minh
  • Guillaume Pierre
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7891)

Abstract

Every large multi-site infrastructure such as Grids and Clouds must implement fault-tolerance mechanisms and smart schedulers to enable continuous operation even when resource failures occur. Evaluating the efficiency of such mechanisms and schedulers requires representative failure models that are able to capture realistic properties of real world failure data. This paper shows that failures in multi-site infrastructures are far from being randomly distributed. We propose a failure model that captures features observed in real failure traces.

References

  1. 1.
    Beran, J.: Statistics for Long-Memory Processes. Chapman & Hall (1994)Google Scholar
  2. 2.
    Chu, J., Labonte, K., Levine, B.N.: Availability and Locality Measurements of Peer-to-Peer File Systems. In: ITCom (2002)Google Scholar
  3. 3.
    Feitelson, D.G.: Workload Modeling for Computer Systems Performance Evaluation., Book Draft, Version 0.32 (2011)Google Scholar
  4. 4.
    Feller, W.: An Introduction to Probability Theory and Its Applications (1950)Google Scholar
  5. 5.
    Gainaru, A., Cappello, F., Snir, M., Kramer, W.: Fault Prediction under the Microscope: a Closer Look into HPC Systems. In: SC (2012)Google Scholar
  6. 6.
    Gallet, M., Yigitbasi, N., Javadi, B., Kondo, D., Iosup, A., Epema, D.: A Model for Space-Correlated Failures in Large-Scale Distributed Systems. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010, Part I. LNCS, vol. 6271, pp. 88–100. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  7. 7.
    Hurst, H.E.: Long Term Storage Capacity of Reservoirs., Trans. ASCE (1951)Google Scholar
  8. 8.
    Iosup, A., et al.: On the Dynamic Resource Availability in Grids. In: GRID (2007)Google Scholar
  9. 9.
    Karagiannis, T., et al.: A User-Friendly Self-Similarity Analysis Tool (2003)Google Scholar
  10. 10.
    Kondo, D., et al.: The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems. In: CCGRID (2010)Google Scholar
  11. 11.
    Lillo, F., Farmer, J.: The Long Memory of the Efficient Market (2004)Google Scholar
  12. 12.
    Myung, J.: Tutorial on Maximum Likelihood Estimation. J. Math Psy. (2003)Google Scholar
  13. 13.
    Nurmi, D., Brevik, J., Wolski, R.: Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 432–441. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  14. 14.
    Oppenheimer, D., Ganapathi, A., Patterson, D.A.: Why do Internet Services Fail, and What Can Be Done about It? In: USITS (2003)Google Scholar
  15. 15.
    Pecchia, A., Cotroneo, D., Kalbarczyk, Z., Iyer, R.K.: Improving Log-Based Field Failure Data Analysis of Multi-Node Computing Systems. In: DSN (2011)Google Scholar
  16. 16.
    Sahoo, R.K., Squillante, M.S., Sivasubramaniam, A., Zhang, Y.: Failure Data Analysis of a Large-Scale Heterogeneous Server Environment. In: DSN (2004)Google Scholar
  17. 17.
    Schroeder, B., Gibson, G.A.: A Large-Scale Study of Failures in High-Performance-Computing Systems. In: DSN (2006)Google Scholar
  18. 18.
    Yigitbasi, N., Gallet, M., Kondo, D., Iosup, A., Epema, D.: Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems. In: GRID (2010)Google Scholar
  19. 19.
    Zheng, Z., et al.: 3-Dimensional Root Cause Diagnosis via Co-Analysis (2012)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2013

Authors and Affiliations

  • Tran Ngoc Minh
    • 1
  • Guillaume Pierre
    • 1
  1. 1.IRISA / University of Rennes 1France

Personalised recommendations