Middleware 2009: Middleware 2009 pp 103-122 | Cite as

Efficient Locally Trackable Deduplication in Replicated Systems

  • João Barreto
  • Paulo Ferreira
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5896)

Abstract

We propose a novel technique for distributed data deduplication in distributed storage systems. We combine version tracking with high-precision, local similarity detection techniques. When compared with the prominent techniques of delta encoding and compare-by-hash, our solution borrows most advantages that distinguish each such alternative. A thorough experimental evaluation, comparing a full-fledged implementation of our technique against popular systems based on delta encoding and compare-by-hash, confirms gains in performance and transferred volumes for a wide range of real workloads and scenarios.

Keywords

Data deduplication data replication distributed file systems compare-by-hash delta encoding 

References

  1. 1.
    Saito, Y., Shapiro, M.: Optimistic replication. ACM Computing Surveys 37(1), 42–81 (2005)CrossRefMATHGoogle Scholar
  2. 2.
    Dahlin, M., Chandra, B., Gao, L., Nayate, A.: End-to-end wan service availability. IEEE/ACM Transactions on Networking 11(2), 300–313 (2003)CrossRefGoogle Scholar
  3. 3.
    Trigdell, A., Mackerras, P.: The rsync algorithm. Technical report, Australian National University (1998)Google Scholar
  4. 4.
    Muthitacharoen, A., Chen, B., Mazieres, D.: A low-bandwidth network file system. In: 8th ACM Symposium on Operating Systems Principles (SOSP), pp. 174–187 (2001)Google Scholar
  5. 5.
    Lelewer, D., Hirschberg, D.: Data compression. ACM Computing Surveys 19(3), 261–296 (1987)CrossRefMATHGoogle Scholar
  6. 6.
    Levy, E., Silberschatz, A.: Distributed file systems: Concepts and examples. ACM Computing Surveys 22(4), 321–374 (1990)CrossRefGoogle Scholar
  7. 7.
    Cox, L., Noble, B.: Pastiche: Making backup cheap and easy. In: 5th Symposium on Operating Systems Design and Implementation, pp. 285–298. ACM, New York (2002)CrossRefGoogle Scholar
  8. 8.
    Jain, N., Dahlin, M., Tewari, R.: Taper: Tiered approach for eliminating redundancy in replica sychronization. In: 4th USENIX FAST, p. 21 (2005)Google Scholar
  9. 9.
    Bobbarjung, D., Jagannathan, S., Dubnicki, C.: Improving duplicate elimination in storage systems. ACM Transactions on Storage 2(4), 424–448 (2006)CrossRefGoogle Scholar
  10. 10.
    Eshghi, K., Lillibridge, M., Wilcock, L., Belrose, G., Hawkes, R.: Jumbo store: providing efficient incremental upload and versioning for a utility rendering service. In: 5th USENIX conference on File and Storage Technologies (FAST), p. 22 (2007)Google Scholar
  11. 11.
    Pilato, C., Fitzpatrick, B., Collins-Sussman, B.F.: Version Control with Subversion. O’Reilly, Sebastopol (2004)Google Scholar
  12. 12.
    Saito, Y., Bershad, B.N., Levy, H.M.: Manageability, availability, and performance in porcupine: a highly scalable, cluster-based mail service. ACM Trans. Comput. Syst. 18(3), 298 (2000)CrossRefGoogle Scholar
  13. 13.
    Henson, V., Garzik, J.: Bitkeeper for kernel developers (2002), http://infohost.nmt.edu/~val/ols/bk.ps.gz
  14. 14.
    MacDonald, J.: File system support for delta compression. Masters thesis, University of California at Berkeley (2000)Google Scholar
  15. 15.
  16. 16.
    Policroniades, C., Pratt, I.: Alternatives for detecting redundancy in storage systems data. In: USENIX Annual Technical Conference (2004)Google Scholar
  17. 17.
    Quinlan, S., Dorward, S.: Venti: A new approach to archival data storage. In: 1st USENIX Conference on File and Storage Technologies (FAST), p. 7 (2002)Google Scholar
  18. 18.
    Aronovich, L., Asher, R., Bachmat, E., Bitner, H., Hirsch, M., Klein, S.T.: The design of a similarity based deduplication system. In: ACM SYSTOR, pp. 1–14 (2009)Google Scholar
  19. 19.
    Petersen, K., Spreitzer, M., Terry, D., Theimer, M., Demers, A.: Flexible update propagation for weakly consistent replication. In: ACM SOSP, pp. 288–301 (1997)Google Scholar
  20. 20.
    Kistler, J.J., Satyanarayanan, M.: Disconnected operation in the coda file system. SIGOPS Oper. Syst. Rev. 25(5), 213–225 (1991)CrossRefGoogle Scholar
  21. 21.
    Barreto, J.: Optimistic Replication in Weakly Connected Resource-Constrained Environments. PhD thesis, IST, Technical University Lisbon (2008)Google Scholar
  22. 22.
    Santry, D., Feeley, M., Hutchinson, N., Veitch, A., Carton, R., Ofir, J.: Deciding when to forget in the elephant file system. In: ACM SOSP, pp. 110–123 (1999)Google Scholar
  23. 23.
    Szeredi, M.: FUSE: Filesystem in Userspace (2008), http://sourceforge.net/projects/avf
  24. 24.
    MacDonald, J.: xdelta, http://code.google.com/p/xdelta/
  25. 25.
    Spring, N.T., Wetherall, D.: A protocol-independent technique for eliminating redundant network traffic. SIGCOMM Comput. Comm. Rev. 30(4), 87–95 (2000)CrossRefGoogle Scholar
  26. 26.
    Tolia, N., Kozuch, M., Satyanarayanan, M., Karp, B., Perrig, A., Bressoud, T.: Opportunistic use of content addressable storage for distributed file systems. In: USENIX Annual Technical Conference, pp. 127–140 (2003)Google Scholar
  27. 27.
    Annapureddy, S., Freedman, M.J., Mazières, D.: Shark: scaling file servers via cooperative caching. In: USENIX Symp. Net. Sys. Design & Impl., pp. 129–142 (2005)Google Scholar
  28. 28.
    Henson, V.: An analysis of compare-by-hash. In: USENIX Workshop on Hot Topics in Operating Systems (2003)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2009

Authors and Affiliations

  • João Barreto
    • 1
  • Paulo Ferreira
    • 1
  1. 1.Distributed Systems GroupINESC-ID/Technical University of LisbonPortugal

Personalised recommendations