Computing

, Volume 98, Issue 3, pp 319–341 | Cite as

A study of the performance of novel storage-centric repairable codes

  • Anwitaman Datta
  • Lluis Pamies-Juarez
  • Frédérique Oggier
Article
  • 237 Downloads

Abstract

Erasure coding has become an integral part of the storage infrastructure in data-centers and cloud backends—since it provides significantly higher fault tolerance for substantially lower storage overhead compared to a naive approach like n-way replication. Fault tolerance refers to the ability to achieve very high availability despite (temporary) failures, but for long term data durability, the redundancy provided by erasure coding needs to be replenished as storage nodes fail or are retired. Traditional erasure codes are not easily amenable to repairs, and their repair process is usually both expensive and slow. Consequently, in recent years, numerous novel codes tailor-made for distributed storage have been proposed to optimize the repair process. Broadly, most of these codes belong to either of the two following families: network coding inspired regenerating codes that aim at minimizing the per repair traffic, and locally repairable codes (LRC) which minimize the number of nodes contacted per repair (which in turn leads to the reduction of repair traffic and latency). Existing studies of these codes however restrict themselves to the repair of individual data objects in isolation. They ignore many practical issues that a real system storing multiple objects needs to take into account. Our goal is to explore a subset of such issues, particularly pertaining to the scenario where multiple objects are stored in the system. We use a simulation based approach, which models the network bottlenecks at the edges of a distributed storage system, and the nodes’ load and (un)availability. Specifically, we abstract the key features of both regenerating and LRC, and examine the effect of data placement and the corresponding de/correlation of failures, and the competition for limited network resources when multiple objects need to be repaired simultaneously by exploring the interplay of code parameters and trade-offs of bandwidth usage and speed of repairs.

Keywords

Storage systems Erasure codes Repairability Regenerating codes Locally repairable codes Self-repairing codes 

Mathematics Subject Classification

94Bxx 

References

  1. 1.
    Ahlswede R, Cai N, Li SYR, Yeung RW (2000) Network information flow. IEEE Trans Inf Theory 46(4):1204–1216CrossRefMathSciNetMATHGoogle Scholar
  2. 2.
    Amazon.com. Amazon S3. http://aws.amazon.com/s3. Accessed 21 July 2015
  3. 3.
    Apache.org. HDFS. http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html. Accessed 21 July 2015
  4. 4.
    Apache.org. HDFS-RAID. http://wiki.apache.org/hadoop/HDFS-RAID. Accessed 21 July 2015
  5. 5.
    Calder B et al (2011) Windows azure storage: a highly available cloud storage service with strong consistency. In: 23rd ACM symposium on operating systems principles (SOSP)Google Scholar
  6. 6.
    Dalle O, Giroire F, Monteiro J, Prennes S (2009) Analysis of failure correlation impact on peer-to-peer storage systems. In: Proceedings of the 9th international conference on peer-to-peer computing (P2P)Google Scholar
  7. 7.
    Dimakis AG, Godfrey PB, Wu Y, Wainwright M, Ramchandran K (2010) Network coding for distributed storage systems. IEEE Trans Inf Theory 56(9):4539–4551Google Scholar
  8. 8.
    Fan B, Tantisiriroj W, Xiao L, Gibson G (2009) Diskreduce: raid for data-intensive scalable computing. In: The 4th annual workshop on petascale data storage (PDSW), pp 6–10Google Scholar
  9. 9.
    Ford D, Labelle F, Popovici FI, Stokely M, Truong V-A, Barroso L, Grimes C, Quinlan S (2010) Availability in globally distributed storage systems. In: The 9th USENIX conference on operating systems design and implementation (OSDI)Google Scholar
  10. 10.
    Ghemawat S, Gobioff H, Leung ST (2003) The google file system. In: Proceedings of the ACM symposium on operating systems principles (SOSP), pp 29–43Google Scholar
  11. 11.
    Gopalan P, Huang C, Simitci H, Yekhanin S (2012) On the locality of codewords symbols. IEEE Trans Inf Theory 58(11):6925–6934Google Scholar
  12. 12.
    Greenan KM, Li X, Wylie JJ (2010) Flat xor-based erasure codes in storage systems: constructions, efficient recovery, and tradeoffs. In: Proceedings of the 26th symposium on mass storage systems and technologies (MSST)Google Scholar
  13. 13.
    Huang C, Simitci H, Xu Y, Ogus A, Calder B, Gopalan P, Li J, Yekhanin S (2012) Erasure coding in windows azure storage. In: Proceedings of the USENIX annual technical conference (ATC)Google Scholar
  14. 14.
    Kamath GM, Prakash N, Lalitha V, Vijay Kumar P (2013) Codes with local regeneration. In: Information theory and applications workshop (ITA)Google Scholar
  15. 15.
    Kermarrec A-M, LeScouarnec N, Straub G (2011) Repairing multiple failures with coordinated and adaptive regenerating codes. In: International symposium on network coding (NetCod)Google Scholar
  16. 16.
    Khan O, Burns R, Plank JS, Huang C (2011) In search of i/o-optimal recovery from disk failures. In: Proceedings of the 3rd USENIX workshop on hot topics in storage and file systems (HotStorage)Google Scholar
  17. 17.
    Kubiatowicz J, Bindel D, Chen Y, Czerwinski S, Eaton P, Geels D, Gummadi R, Rhea S, Weatherspoon H, Weimer W, Wells C, Zhao B (2000) Oceanstore: an architecture for global-scale persistent storage. In: The 9th international conference on architectural support for programming languages and operating systems (ASPLOS)Google Scholar
  18. 18.
    Li J, Yang S, Wang X, Li B (2010) Tree-structured data regeneration in distributed storage systems with regenerating codes. In: The 29th IEEE international conference on computer communications (INFOCOM)Google Scholar
  19. 19.
    Oggier F, Datta A (2011) Self-repairing codes for distributed storage—a projective geometric construction. In: Information theory workshop (ITW)Google Scholar
  20. 20.
    Oggier F, Datta A (2013) Coding techniques for repairability in networked distributed storage systems. Foundations and trends in communications and information theory, vol 9. Now Publishers, Delft, The NetherlandsGoogle Scholar
  21. 21.
    Oggier F, Datta A (2015) Self-repairing codes: local repairability for cheap and fast maintenance of erasure coded data. Computing 97(2):171–201Google Scholar
  22. 22.
    Pamies-Juarez L, Hollmann HDL, Oggier F (2013) Locally repairable codes with multiple repair alternatives. In: IEEE international symposium on information theoryGoogle Scholar
  23. 23.
    Papailiopoulos DS, Luo J, Dimakis AG, Huang C, Li J (2011) Simple regenerating codes: network coding for cloud storage. CoRR. arXiv:1109.0264
  24. 24.
    Papailiopoulos DS, Luo J, Dimakis AG, Huang C, Li J (2012) Simple regenerating codes: network coding for cloud storage. In: The 30th IEEE international conference on computer communications (INFOCOM)Google Scholar
  25. 25.
    Papailiopoulos DS, Dimakis AG (2012) Locally repairable codes. In: IEEE international symposium on information theory proceedings (ISIT). IEEE, pp 2771–2775Google Scholar
  26. 26.
    Plank JS (2009) The RAID-6 Liber8Tion code. Int J High Perform Comput Appl 23(3):242–251CrossRefGoogle Scholar
  27. 27.
    Rawat AS, Vishwanath S (2012) On locality in distributed storage systems. In: International workshop on information theoryGoogle Scholar
  28. 28.
    Shum KW (2011) Cooperative regenerating codes for distributed storage systems. In: IEEE international conference on communications (ICC)Google Scholar
  29. 29.
    Silberstein N, Rawat AS, Koyluoglu OO, Vishwanath S (2013) Optimal locally repairable codes via rank-metric codes. In: IEEE international symposium on information theoryGoogle Scholar
  30. 30.
    Venkatesan V, Iliadis I, Hu X-Y, Haas R, Fragouli C (2010) Effect of replica placement on the reliability of large-scale data storage systems. In: The 18th annual IEEE/ACM international symposium on modeling, analysis and simulation of computer and telecommunication systems (MASCOTS)Google Scholar
  31. 31.
    Wang G, Butt AR, Pandey P, Gupta K (2009) Using realistic simulation for performance analysis of mapreduce setups. In: Proceedings of the 1st ACM workshop on large-scale system and application performance (LSAP)Google Scholar
  32. 32.
    Weatherspoon H, Kubiatowicz JD (2002) Erasure coding vs. replication: a quantitative comparison. In: The 1st international workshop on peer-to-peer systems (IPTPS)Google Scholar
  33. 33.
    Weil SA, Leung AW, Brandt SA, Maltzahn C (2007) Rados: a scalable, reliable storage service for petabyte-scale storage clusters. In: Proceedings of the 2nd international workshop on petascale data storage: held in conjunction with supercomputing (PDSW’07)Google Scholar
  34. 34.
    You GW, Hwang SW, Jain N (2011) Scalable load balancing in cluster storage systems. In: 12th international conference on middlewareGoogle Scholar
  35. 35.
    Zhang Z, Deshpande A, Ma X, Thereska E, Narayanan D (2010) Does erasure coding have a role to play in my data center? In: Microsoft research MSR-TR-2010-52Google Scholar

Copyright information

© Springer-Verlag Wien 2015

Authors and Affiliations

  • Anwitaman Datta
    • 1
  • Lluis Pamies-Juarez
    • 2
  • Frédérique Oggier
    • 3
  1. 1.School of Computer EngineeringNanyang Technological UniversitySingaporeSingapore
  2. 2.HGST ResearchSan JoseUSA
  3. 3.School of Physical and Mathematical SciencesNanyang Technological UniversitySingaporeSingapore

Personalised recommendations