Skip to main content
Log in

Leach: an automatic learning cache for inline primary deduplication system

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Deduplication technology has been increasingly used to reduce storage costs. Though it has been successfully applied to backup and archival systems, existing techniques can hardly be deployed in primary storage systems due to the associated latency cost of detecting duplicated data, where every unit has to be checked against a substantially large fingerprint index before it is written. In this paper we introduce Leach, for inline primary storage, a self-learning in-memory fingerprints cache to reduce the writing cost in deduplication system. Leach is motivated by the characteristics of real-world I/O workloads: highly data skew exist in the access patterns of duplicated data. Leach adopts a splay tree to organize the on-disk fingerprint index, automatically learns the access patterns and maintains hot working sets in cachememory, with a goal to service a majority of duplicated data detection. Leveraging the working set property, Leach provides optimization to reduce the cost of splay operations on the fingerprint index and cache updates. In comprehensive experiments on several real-world datasets, Leach outperforms conventional LRU (least recently used) cache policy by reducing the number of cache misses, and significantly improves write performance without great impact to cache hits.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Srinivasan K, Bisson T, Goodson G, Voruganti K. Idedup: latencyaware, inline data deduplication for primary storage. In: Proceedings of the 10th Usenix Conference on File and Storage Technologies. 2012, 24: 1–24: 14

    Google Scholar 

  2. Geer D. Reducing the storage burden via data deduplication. Computer, 2008, 41(12): 15–17

    Article  Google Scholar 

  3. Zhu B, Li K, Patterson H. Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the 6th Usenix Conference on File and Storage Technologies. 2008, 18:1–18: 14

    Google Scholar 

  4. Rodeh O, Wildani A, Miller E L. Hands: A heuristically arranged nonbackup in-line deduplication system. In: Proceedings of the 2013 IEEE International Conference on Data Engineering. 2013, 446–457

    Google Scholar 

  5. Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P. Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proccedings of the 7th Conference on File and Storage technologies. 2009, 111–123

    Google Scholar 

  6. Bhagwat D, Eshghi K, Long D D, Lillibridge M. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In: Proceedings of the 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems. 2009, 1–9

    Chapter  Google Scholar 

  7. Meyer D T, Bolosky W J. A study of practical deduplication. ACM Transactions on Storage, 2012, 7(4): 14:1–14:20

    Article  Google Scholar 

  8. Jin K, Miller E L. The effectiveness of deduplication on virtual machine disk images. In: Proceedings of the 2009 Israeli Experimental Systems Conference. 2009, 7:1–7:12

    Article  Google Scholar 

  9. Lu M, Chambliss D, Glider J, Constantinescu C. Insights for data reduction in primary storage: a practical analysis. In: Proceedings of the 5th Annual International Systems and Storage Conference. 2012, 17:1–17:7

    Article  Google Scholar 

  10. Koller R, Rangaswami R. I/O deduplication: utilizing content similarity to improve I/O performance. ACM Transactions on Storage, 2010, 6(3): 13:1–13: 26

    Article  Google Scholar 

  11. Akuÿrek S, Salem K. Adaptive block rearrangement. Technical Report, 1993

    Google Scholar 

  12. Carson S D. A system for adaptive disk rearrangement. Software: Practice and Experience, 1990, 20(3): 225–242

    Google Scholar 

  13. Sleator D D, Tarjan R E. Self-adjusting binary search trees. Journal of the ACM, 1985, 32(3): 652–686

    Article  MATH  MathSciNet  Google Scholar 

  14. Zaw E P, Thein N L. Improved live VM migration using LRU and Splay tree algorithm. International Journal of Computer Science and Telecommunications, 2012, 3(3): 1–7

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Lin.

Additional information

Bin Lin is currently a PhD candidate in computer science and technology at School of Computer, National University of Defense Technology, China. His research interests include operating system and storage management.

Shanshan Li is currently an associate professor in computer science and technology at School of Computer, National University of Defense Technology, China. Her research interests include operating system, wireless sensor network, and storage management.

Xiangke Liao is currently a professor in computer science and technology at School of Computer, National University of Defense Technology, China. His research interests include operating system and high performance computing.

Jing Zhang is currently a PhD candidate in computer science and technology at School of Computer, National University of Defense Technology, China. His research interests include operating system and storage management.

Xiaodong Liu is currently an associate professor in computer science and technology at School of Computer, National University of Defense Technology, China. His research interests include operating system and social network.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, B., Li, S., Liao, X. et al. Leach: an automatic learning cache for inline primary deduplication system. Front. Comput. Sci. 8, 175–183 (2014). https://doi.org/10.1007/s11704-014-3377-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-014-3377-2

Keywords

Navigation