REDU: reducing redundancy and duplication for multi-failure recovery in erasure-coded storages

Zhang, Jing; Li, Shanshan; Liao, Xiangke

doi:10.1007/s11227-015-1397-9

REDU: reducing redundancy and duplication for multi-failure recovery in erasure-coded storages

Published: 10 March 2015

Volume 72, pages 3281–3296, (2016)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Jing Zhang¹,
Shanshan Li¹ &
Xiangke Liao¹

296 Accesses
3 Citations
Explore all metrics

Abstract

Data reliability is a significant issue in large-scale storage systems. Erasure codes provide high data reliability via data recovery, which however generates a large amount of data transmission in the network. The bandwidth cost of the data transmission in recovery significantly impacts the performance of the located cluster. Existing work considers the single-failure as the most common failure pattern and mainly focuses on reducing the data transmission cost of single-failure recovery, which unfortunately fails to efficiently support the multi-failure recovery. In this work, first, we provide the Mean Time To Multi-Failure metric based on Markov model to demonstrate the frequency and pattern of multi-failure in erasure-coded storages. Then, we propose REDU to reduce the duplication and redundancy in multi-failure recovery of erasure-coded storages. In REDU, we propose merging-based de-duplication to reduce duplicated data transmission, and aggregating-based de-redundancy to reduce redundant information transmission, and we also propose cooperative routing to efficiently use the two schemes above based on the practical cluster topology. The analysis and experimental results demonstrate the importance of multi-failure recovery problem and the efficiency of REDU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HeMatch: A redundancy layout placement scheme for erasure-coded storages in practical heterogeneous failure patterns

Article 08 April 2015

Self-repairing codes

Article 12 September 2014

SA-RSR: a read-optimal data recovery strategy for XOR-coded distributed storage systems

Article 01 June 2022

References

Borthakur D (2009) The hadoop distributed le system: architecture and design. Technical report
Huang C, Simitci H, Xu Y, Ogus A, Calder B, Gopalan P, Li J, Yekhanin S et al (2012) Erasure coding in windows azure storage. In USENIX ATC
Mingliang L, Ye J, Jidong Z, Yan Z, Qianqian S, Xiaosong M, Wenguang C (2013) Acic: automatic cloud i/o configurator for hpc applications. In SC, p 38
Benson T, Akella A, Maltz DA (2010) Network traffic characteristics of data centers in the wild. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, IMC, pp 267–280
Talbot D (2013) A smarter algorithm could cut energy use in data centers by 35 percent. Technical article
Duminuco A, Biersack E (2008) Hierarchical codes: how to make erasure codes attractive for peer-to-peer storage systems. In P2P, IEEE, pp 89–98
Wu Y, Dimakis AG (2009) Reducing repair traffic for erasure coding-based storage via interference alignment. In: Information theory, 2009. ISIT 2009. IEEE international symposium on, pp 2276–2280
Zhang J, Liao X, Li S, Hua Y, Liu X, Lin B (2014) Aggrecode: constructing route intersection for data reconstruction in erasure coded storage. In: Proceedings of the 33rd IEEE international conference on computer communications, INFOCOM, IEEE
Sanjay G, Howard G, Shun-Tak L (2003) The google file system. In: Proceedings of the nineteenth ACM symposium on operating systems principles, SOSP, pp 29–43
Ford D, Labelle F, Popovici FI, Stokely M, Van Truong A, Barroso L, Grimes C, Quinlan S (2010) Availability in globally distributed storage systems. In OSDI, pp 61–74
Vishwanath KV, Nagappan N (2010) Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM symposium on cloud computing, SoCC ’10. ACM, New York, NY, USA, pp 193–204
Giroire F, Monteiro J, Pérennes S (2010) Peer-to-peer storage systems: a practical guideline to be lazy. In: Global telecommunications conference (GLOBECOM 2010), 2010 IEEE, pp 1–6
Zhang J, Li S, Liao X, Liu X (2014) Aggregation decoding for multi-failure recovery in erasure-coded storage. In: Proceedings of the 13th international symposium on pervasive systems, algorithms, and networks (I-SPAN), I-SPAN
Huang C, Chen M, Li J (2007) Pyramid codes: flexible schemes to trade space for access efficiency in reliable data storage systems. In: Network computing and applications, IEEE, pp 79–86
Dimakis AG, Godfrey PB, Yunnan W, Martin JW, Kannan R (2010) Network coding for distributed storage systems. Inf Theory IEEE Trans 56(9):4539–4551
Zhen H, Ernst B, Yuxing P (2011) Reducing repair traffic in p2p backup systems: exact regenerating codes on hierarchical codes. TOS 7(3):10
Hu Y, Chen HCH, Lee PPC, Tang Y (2012) Nccloud: applying network coding for the storage repair in a cloud-of-clouds. In FAST
Zhang J, Li S, Liao X, Peng S, Liu X, Jia Z (2015) Hematch: a redundancy layout placement scheme for erasure-coded storages in practical heterogeneous failure patterns. Sci China Inf Sci
Osama K, Randal B, James P, William P, Huang C (2012) Minimizing i/o for recovery and degraded reads. In FAST, Rethinking erasure codes for cloud file systems
Jun L, Xin W, Baochun L (2013) Cooperative pipelined regeneration in distributed storage systems. In INFOCOM, IEEE
Jun L, Xin W, Baochun L (2011) Pipelined regeneration with regenerating codes for distributed storage systems. In NetCod, IEEE, pp 1–6
Dholakia A, Eleftheriou E, Hu X-Y, Iliadis I, Menon J, Rao KK (2008) A new intra-disk redundancy scheme for high-reliability raid storage systems in the presence of unrecoverable errors. ACM Trans Storage (TOS) 4(1):1
Plank JS, Blaum M, Hafner JL (2013) Sd codes: erasure codes designed for how storage systems really fail. In: Proceedings of the 2013 USENIX conference on file and storage technologies
Li M, Lee PC (2014) Stair codes: a general family of erasure codes for tolerating device and sector failures in practical storage systems. Fail, FAST

Download references

Acknowledgments

This work is supported in part by Nature Science Foundation of China Nos. 61379146, 61272483, 61402511, and 61402514 and Fund of NUDT No. JC13-06-03.

Author information

Authors and Affiliations

School of Computer Science and Technology, National University of Defence Technology, Changsha, China
Jing Zhang, Shanshan Li & Xiangke Liao

Authors

Jing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shanshan Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiangke Liao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Zhang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, J., Li, S. & Liao, X. REDU: reducing redundancy and duplication for multi-failure recovery in erasure-coded storages. J Supercomput 72, 3281–3296 (2016). https://doi.org/10.1007/s11227-015-1397-9

Download citation

Published: 10 March 2015
Issue Date: September 2016
DOI: https://doi.org/10.1007/s11227-015-1397-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

REDU: reducing redundancy and duplication for multi-failure recovery in erasure-coded storages

Abstract

Access this article

Similar content being viewed by others

HeMatch: A redundancy layout placement scheme for erasure-coded storages in practical heterogeneous failure patterns

Self-repairing codes

SA-RSR: a read-optimal data recovery strategy for XOR-coded distributed storage systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

REDU: reducing redundancy and duplication for multi-failure recovery in erasure-coded storages

Abstract

Access this article

Similar content being viewed by others

HeMatch: A redundancy layout placement scheme for erasure-coded storages in practical heterogeneous failure patterns

Self-repairing codes

SA-RSR: a read-optimal data recovery strategy for XOR-coded distributed storage systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation