Skip to main content
Log in

REDU: reducing redundancy and duplication for multi-failure recovery in erasure-coded storages

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Data reliability is a significant issue in large-scale storage systems. Erasure codes provide high data reliability via data recovery, which however generates a large amount of data transmission in the network. The bandwidth cost of the data transmission in recovery significantly impacts the performance of the located cluster. Existing work considers the single-failure as the most common failure pattern and mainly focuses on reducing the data transmission cost of single-failure recovery, which unfortunately fails to efficiently support the multi-failure recovery. In this work, first, we provide the Mean Time To Multi-Failure metric based on Markov model to demonstrate the frequency and pattern of multi-failure in erasure-coded storages. Then, we propose REDU to reduce the duplication and redundancy in multi-failure recovery of erasure-coded storages. In REDU, we propose merging-based de-duplication to reduce duplicated data transmission, and aggregating-based de-redundancy to reduce redundant information transmission, and we also propose cooperative routing to efficiently use the two schemes above based on the practical cluster topology. The analysis and experimental results demonstrate the importance of multi-failure recovery problem and the efficiency of REDU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Borthakur D (2009) The hadoop distributed le system: architecture and design. Technical report

  2. Huang C, Simitci H, Xu Y, Ogus A, Calder B, Gopalan P, Li J, Yekhanin S et al (2012) Erasure coding in windows azure storage. In USENIX ATC

  3. Mingliang L, Ye J, Jidong Z, Yan Z, Qianqian S, Xiaosong M, Wenguang C (2013) Acic: automatic cloud i/o configurator for hpc applications. In SC, p 38

  4. Benson T, Akella A, Maltz DA (2010) Network traffic characteristics of data centers in the wild. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, IMC, pp 267–280

  5. Talbot D (2013) A smarter algorithm could cut energy use in data centers by 35 percent. Technical article

  6. Duminuco A, Biersack E (2008) Hierarchical codes: how to make erasure codes attractive for peer-to-peer storage systems. In P2P, IEEE, pp 89–98

  7. Wu Y, Dimakis AG (2009) Reducing repair traffic for erasure coding-based storage via interference alignment. In: Information theory, 2009. ISIT 2009. IEEE international symposium on, pp 2276–2280

  8. Zhang J, Liao X, Li S, Hua Y, Liu X, Lin B (2014) Aggrecode: constructing route intersection for data reconstruction in erasure coded storage. In: Proceedings of the 33rd IEEE international conference on computer communications, INFOCOM, IEEE

  9. Sanjay G, Howard G, Shun-Tak L (2003) The google file system. In: Proceedings of the nineteenth ACM symposium on operating systems principles, SOSP, pp 29–43

  10. Ford D, Labelle F, Popovici FI, Stokely M, Van Truong A, Barroso L, Grimes C, Quinlan S (2010) Availability in globally distributed storage systems. In OSDI, pp 61–74

  11. Vishwanath KV, Nagappan N (2010) Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM symposium on cloud computing, SoCC ’10. ACM, New York, NY, USA, pp 193–204

  12. Giroire F, Monteiro J, Pérennes S (2010) Peer-to-peer storage systems: a practical guideline to be lazy. In: Global telecommunications conference (GLOBECOM 2010), 2010 IEEE, pp 1–6

  13. Zhang J, Li S, Liao X, Liu X (2014) Aggregation decoding for multi-failure recovery in erasure-coded storage. In: Proceedings of the 13th international symposium on pervasive systems, algorithms, and networks (I-SPAN), I-SPAN

  14. Huang C, Chen M, Li J (2007) Pyramid codes: flexible schemes to trade space for access efficiency in reliable data storage systems. In: Network computing and applications, IEEE, pp 79–86

  15. Dimakis AG, Godfrey PB, Yunnan W, Martin JW, Kannan R (2010) Network coding for distributed storage systems. Inf Theory IEEE Trans 56(9):4539–4551

  16. Zhen H, Ernst B, Yuxing P (2011) Reducing repair traffic in p2p backup systems: exact regenerating codes on hierarchical codes. TOS 7(3):10

  17. Hu Y, Chen HCH, Lee PPC, Tang Y (2012) Nccloud: applying network coding for the storage repair in a cloud-of-clouds. In FAST

  18. Zhang J, Li S, Liao X, Peng S, Liu X, Jia Z (2015) Hematch: a redundancy layout placement scheme for erasure-coded storages in practical heterogeneous failure patterns. Sci China Inf Sci

  19. Osama K, Randal B, James P, William P, Huang C (2012) Minimizing i/o for recovery and degraded reads. In FAST, Rethinking erasure codes for cloud file systems

  20. Jun L, Xin W, Baochun L (2013) Cooperative pipelined regeneration in distributed storage systems. In INFOCOM, IEEE

  21. Jun L, Xin W, Baochun L (2011) Pipelined regeneration with regenerating codes for distributed storage systems. In NetCod, IEEE, pp 1–6

  22. Dholakia A, Eleftheriou E, Hu X-Y, Iliadis I, Menon J, Rao KK (2008) A new intra-disk redundancy scheme for high-reliability raid storage systems in the presence of unrecoverable errors. ACM Trans Storage (TOS) 4(1):1

  23. Plank JS, Blaum M, Hafner JL (2013) Sd codes: erasure codes designed for how storage systems really fail. In: Proceedings of the 2013 USENIX conference on file and storage technologies

  24. Li M, Lee PC (2014) Stair codes: a general family of erasure codes for tolerating device and sector failures in practical storage systems. Fail, FAST

Download references

Acknowledgments

This work is supported in part by Nature Science Foundation of China Nos. 61379146, 61272483, 61402511, and 61402514 and Fund of NUDT No. JC13-06-03.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jing Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Li, S. & Liao, X. REDU: reducing redundancy and duplication for multi-failure recovery in erasure-coded storages. J Supercomput 72, 3281–3296 (2016). https://doi.org/10.1007/s11227-015-1397-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1397-9

Keywords

Navigation