Repair Time in Distributed Storage Systems

  • Frédéric Giroire
  • Sandeep Kumar Gupta
  • Remigiusz Modrzejewski
  • Julian Monteiro
  • Stéphane Perennes
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8059)

Abstract

In this paper, we analyze a highly distributed backup storage system realized by means of nano datacenters (NaDa). NaDa have been recently proposed as a way to mitigate the growing energy, bandwidth and device costs of traditional data centers, following the popularity of cloud computing. These service provider-controlled peer-to-peer systems take advantage of resources already committed to always-on set top boxes, the fact they do not generate heat dissipation costs and their proximity to users.

In this kind of systems redundancy is introduced to preserve the data in case of peer failures or departures. To ensure long-term fault tolerance, the storage system must have a self-repair service that continuously reconstructs the fragments of redundancy that are lost. In the literature, the reconstruction times are modeled as independent. In practice, however, numerous reconstructions start at the same time (when the system detects that a peer has failed).

We propose a new analytical framework that takes into account this correlation when estimating the repair time and the probability of data loss. We show that the load is unbalanced among peers (young peers inherently store less data than the old ones). The models and schemes proposed are validated by mathematical analysis, extensive set of simulations, and experimentation using the GRID5000 test-bed platform. This new model allows system designers to operate a more accurate choice of system parameters in function of their targeted data durability.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Valancius, V., Laoutaris, N., Massoulié, L., Diot, C., Rodriguez, P.: Greening the internet with nano data centers. In: Proceedings of the 5th International Conference on Emerging Networking Experiments and Technologies, pp. 37–48. ACM (2009)Google Scholar
  2. 2.
    Chun, B.-G., Dabek, F., Haeberlen, A., Sit, E., Weatherspoon, H., Kaashoek, M.F., Kubiatowicz, J., Morris, R.: Efficient replica maintenance for distributed storage systems. In: Proc. of USENIX NSDI, pp. 45–58 (2006)Google Scholar
  3. 3.
    Bolosky, W.J., Douceur, J.R., Ely, D., Theimer, M.: Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs. ACM SIGMETRICS Perf. Eval. Review 28, 34–43 (2000)CrossRefGoogle Scholar
  4. 4.
    Bhagwan, R., Tati, K., Chung Cheng, Y., Savage, S., Voelker, G.M.: Total recall: System support for automated availability management. In: Proc. of the USENIX NSDI, pp. 337–350 (2004)Google Scholar
  5. 5.
    Ramabhadran, S., Pasquale, J.: Analysis of long-running replicated systems. In: Proc. of IEEE INFOCOM, Spain, pp. 1–9 (2006)Google Scholar
  6. 6.
    Alouf, S., Dandoush, A., Nain, P.: Performance analysis of peer-to-peer storage systems. In: Mason, L.G., Drwiega, T., Yan, J. (eds.) ITC 2007. LNCS, vol. 4516, pp. 642–653. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  7. 7.
    Datta, A., Aberer, K.: Internet-scale storage systems under churn – a study of the steady-state using markov models. In: Procedings of the IEEE Intl. Conf. on Peer-to-Peer Computing (P2P), pp. 133–144 (2006)Google Scholar
  8. 8.
    Dandoush, A., Alouf, S., Nain, P.: Simulation analysis of download and recovery processes in P2P storage systems. In: Proc. of the Intl. Teletraffic Congress (ITC), France, pp. 1–8 (2009)Google Scholar
  9. 9.
    Picconi, F., Baynat, B., Sens, P.: Predicting durability in dhts using markov chains. In: Proceedings of the 2nd Intl. Conference on Digital Information Management (ICDIM), vol. 2, pp. 532–538 (October 2007)Google Scholar
  10. 10.
    Venkatesan, V., Iliadis, I., Haas, R.: Reliability of data storage systems under network rebuild bandwidth constraints. In: 2012 IEEE 20th International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 189–197 (2012)Google Scholar
  11. 11.
    Ford, D., Labelle, F., Popovici, F.I., Stokely, M., Truong, V.-A., Barroso, L., Grimes, C., Quinlan, S.: Availability in globally distributed storage systems. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, pp. 1–7 (2010)Google Scholar
  12. 12.
    Dimakis, A., Godfrey, P., Wainwright, M., Ramchandran, K.: Network coding for distributed storage systems. In: IEEE INFOCOM, pp. 2000–2008 (May 2007)Google Scholar
  13. 13.
    Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, pp. 329–350. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  14. 14.
    Giroire, F., Gupta, S., Modrzejewski, R., Monteiro, J., Perennes, S.: Analysis of the repair time in distributed storage systems. INRIA, Research Report 7538 (February 2011)Google Scholar
  15. 15.
    Luby, M., Mitzenmacher, M., Shokrollahi, M., Spielman, D., Stemann, V.: Practical loss-resilient codes. In: Proceedings of the 29th Annual ACM Symposium on Theory of Computing, pp. 150–159 (1997)Google Scholar
  16. 16.
    Legtchenko, S., Monnet, S., Sens, P., Muller, G.: Churn-resilient replication strategy for peer-to-peer distributed hash-tables. In: Guerraoui, R., Petit, F. (eds.) SSS 2009. LNCS, vol. 5873, pp. 485–499. Springer, Heidelberg (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Frédéric Giroire
    • 1
  • Sandeep Kumar Gupta
    • 2
  • Remigiusz Modrzejewski
    • 1
  • Julian Monteiro
    • 3
  • Stéphane Perennes
    • 1
  1. 1.Project MASCOTTE, I3S (CNRS/Univ. of Nice)/INRIASophia AntipolisFrance
  2. 2.IIT DelhiNew DelhiIndia
  3. 3.Department of Computer Science, IMEUniversity of São PauloBrazil

Personalised recommendations