Repair Time in Distributed Storage Systems
In this paper, we analyze a highly distributed backup storage system realized by means of nano datacenters (NaDa). NaDa have been recently proposed as a way to mitigate the growing energy, bandwidth and device costs of traditional data centers, following the popularity of cloud computing. These service provider-controlled peer-to-peer systems take advantage of resources already committed to always-on set top boxes, the fact they do not generate heat dissipation costs and their proximity to users.
In this kind of systems redundancy is introduced to preserve the data in case of peer failures or departures. To ensure long-term fault tolerance, the storage system must have a self-repair service that continuously reconstructs the fragments of redundancy that are lost. In the literature, the reconstruction times are modeled as independent. In practice, however, numerous reconstructions start at the same time (when the system detects that a peer has failed).
We propose a new analytical framework that takes into account this correlation when estimating the repair time and the probability of data loss. We show that the load is unbalanced among peers (young peers inherently store less data than the old ones). The models and schemes proposed are validated by mathematical analysis, extensive set of simulations, and experimentation using the GRID5000 test-bed platform. This new model allows system designers to operate a more accurate choice of system parameters in function of their targeted data durability.
Unable to display preview. Download preview PDF.
- 1.Valancius, V., Laoutaris, N., Massoulié, L., Diot, C., Rodriguez, P.: Greening the internet with nano data centers. In: Proceedings of the 5th International Conference on Emerging Networking Experiments and Technologies, pp. 37–48. ACM (2009)Google Scholar
- 2.Chun, B.-G., Dabek, F., Haeberlen, A., Sit, E., Weatherspoon, H., Kaashoek, M.F., Kubiatowicz, J., Morris, R.: Efficient replica maintenance for distributed storage systems. In: Proc. of USENIX NSDI, pp. 45–58 (2006)Google Scholar
- 4.Bhagwan, R., Tati, K., Chung Cheng, Y., Savage, S., Voelker, G.M.: Total recall: System support for automated availability management. In: Proc. of the USENIX NSDI, pp. 337–350 (2004)Google Scholar
- 5.Ramabhadran, S., Pasquale, J.: Analysis of long-running replicated systems. In: Proc. of IEEE INFOCOM, Spain, pp. 1–9 (2006)Google Scholar
- 7.Datta, A., Aberer, K.: Internet-scale storage systems under churn – a study of the steady-state using markov models. In: Procedings of the IEEE Intl. Conf. on Peer-to-Peer Computing (P2P), pp. 133–144 (2006)Google Scholar
- 8.Dandoush, A., Alouf, S., Nain, P.: Simulation analysis of download and recovery processes in P2P storage systems. In: Proc. of the Intl. Teletraffic Congress (ITC), France, pp. 1–8 (2009)Google Scholar
- 9.Picconi, F., Baynat, B., Sens, P.: Predicting durability in dhts using markov chains. In: Proceedings of the 2nd Intl. Conference on Digital Information Management (ICDIM), vol. 2, pp. 532–538 (October 2007)Google Scholar
- 10.Venkatesan, V., Iliadis, I., Haas, R.: Reliability of data storage systems under network rebuild bandwidth constraints. In: 2012 IEEE 20th International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 189–197 (2012)Google Scholar
- 11.Ford, D., Labelle, F., Popovici, F.I., Stokely, M., Truong, V.-A., Barroso, L., Grimes, C., Quinlan, S.: Availability in globally distributed storage systems. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, pp. 1–7 (2010)Google Scholar
- 12.Dimakis, A., Godfrey, P., Wainwright, M., Ramchandran, K.: Network coding for distributed storage systems. In: IEEE INFOCOM, pp. 2000–2008 (May 2007)Google Scholar
- 14.Giroire, F., Gupta, S., Modrzejewski, R., Monteiro, J., Perennes, S.: Analysis of the repair time in distributed storage systems. INRIA, Research Report 7538 (February 2011)Google Scholar
- 15.Luby, M., Mitzenmacher, M., Shokrollahi, M., Spielman, D., Stemann, V.: Practical loss-resilient codes. In: Proceedings of the 29th Annual ACM Symposium on Theory of Computing, pp. 150–159 (1997)Google Scholar