Middleware 2009: Middleware 2009 pp 103-122 | Cite as
Efficient Locally Trackable Deduplication in Replicated Systems
Conference paper
Abstract
We propose a novel technique for distributed data deduplication in distributed storage systems. We combine version tracking with high-precision, local similarity detection techniques. When compared with the prominent techniques of delta encoding and compare-by-hash, our solution borrows most advantages that distinguish each such alternative. A thorough experimental evaluation, comparing a full-fledged implementation of our technique against popular systems based on delta encoding and compare-by-hash, confirms gains in performance and transferred volumes for a wide range of real workloads and scenarios.
Keywords
Data deduplication data replication distributed file systems compare-by-hash delta encoding Download
to read the full conference paper text
References
- 1.Saito, Y., Shapiro, M.: Optimistic replication. ACM Computing Surveys 37(1), 42–81 (2005)CrossRefMATHGoogle Scholar
- 2.Dahlin, M., Chandra, B., Gao, L., Nayate, A.: End-to-end wan service availability. IEEE/ACM Transactions on Networking 11(2), 300–313 (2003)CrossRefGoogle Scholar
- 3.Trigdell, A., Mackerras, P.: The rsync algorithm. Technical report, Australian National University (1998)Google Scholar
- 4.Muthitacharoen, A., Chen, B., Mazieres, D.: A low-bandwidth network file system. In: 8th ACM Symposium on Operating Systems Principles (SOSP), pp. 174–187 (2001)Google Scholar
- 5.Lelewer, D., Hirschberg, D.: Data compression. ACM Computing Surveys 19(3), 261–296 (1987)CrossRefMATHGoogle Scholar
- 6.Levy, E., Silberschatz, A.: Distributed file systems: Concepts and examples. ACM Computing Surveys 22(4), 321–374 (1990)CrossRefGoogle Scholar
- 7.Cox, L., Noble, B.: Pastiche: Making backup cheap and easy. In: 5th Symposium on Operating Systems Design and Implementation, pp. 285–298. ACM, New York (2002)CrossRefGoogle Scholar
- 8.Jain, N., Dahlin, M., Tewari, R.: Taper: Tiered approach for eliminating redundancy in replica sychronization. In: 4th USENIX FAST, p. 21 (2005)Google Scholar
- 9.Bobbarjung, D., Jagannathan, S., Dubnicki, C.: Improving duplicate elimination in storage systems. ACM Transactions on Storage 2(4), 424–448 (2006)CrossRefGoogle Scholar
- 10.Eshghi, K., Lillibridge, M., Wilcock, L., Belrose, G., Hawkes, R.: Jumbo store: providing efficient incremental upload and versioning for a utility rendering service. In: 5th USENIX conference on File and Storage Technologies (FAST), p. 22 (2007)Google Scholar
- 11.Pilato, C., Fitzpatrick, B., Collins-Sussman, B.F.: Version Control with Subversion. O’Reilly, Sebastopol (2004)Google Scholar
- 12.Saito, Y., Bershad, B.N., Levy, H.M.: Manageability, availability, and performance in porcupine: a highly scalable, cluster-based mail service. ACM Trans. Comput. Syst. 18(3), 298 (2000)CrossRefGoogle Scholar
- 13.Henson, V., Garzik, J.: Bitkeeper for kernel developers (2002), http://infohost.nmt.edu/~val/ols/bk.ps.gz
- 14.MacDonald, J.: File system support for delta compression. Masters thesis, University of California at Berkeley (2000)Google Scholar
- 15.Lynn, B.: Git magic (2009), http://www-cs-students.stanford.edu/~blynn/gitmagic/
- 16.Policroniades, C., Pratt, I.: Alternatives for detecting redundancy in storage systems data. In: USENIX Annual Technical Conference (2004)Google Scholar
- 17.Quinlan, S., Dorward, S.: Venti: A new approach to archival data storage. In: 1st USENIX Conference on File and Storage Technologies (FAST), p. 7 (2002)Google Scholar
- 18.Aronovich, L., Asher, R., Bachmat, E., Bitner, H., Hirsch, M., Klein, S.T.: The design of a similarity based deduplication system. In: ACM SYSTOR, pp. 1–14 (2009)Google Scholar
- 19.Petersen, K., Spreitzer, M., Terry, D., Theimer, M., Demers, A.: Flexible update propagation for weakly consistent replication. In: ACM SOSP, pp. 288–301 (1997)Google Scholar
- 20.Kistler, J.J., Satyanarayanan, M.: Disconnected operation in the coda file system. SIGOPS Oper. Syst. Rev. 25(5), 213–225 (1991)CrossRefGoogle Scholar
- 21.Barreto, J.: Optimistic Replication in Weakly Connected Resource-Constrained Environments. PhD thesis, IST, Technical University Lisbon (2008)Google Scholar
- 22.Santry, D., Feeley, M., Hutchinson, N., Veitch, A., Carton, R., Ofir, J.: Deciding when to forget in the elephant file system. In: ACM SOSP, pp. 110–123 (1999)Google Scholar
- 23.Szeredi, M.: FUSE: Filesystem in Userspace (2008), http://sourceforge.net/projects/avf
- 24.MacDonald, J.: xdelta, http://code.google.com/p/xdelta/
- 25.Spring, N.T., Wetherall, D.: A protocol-independent technique for eliminating redundant network traffic. SIGCOMM Comput. Comm. Rev. 30(4), 87–95 (2000)CrossRefGoogle Scholar
- 26.Tolia, N., Kozuch, M., Satyanarayanan, M., Karp, B., Perrig, A., Bressoud, T.: Opportunistic use of content addressable storage for distributed file systems. In: USENIX Annual Technical Conference, pp. 127–140 (2003)Google Scholar
- 27.Annapureddy, S., Freedman, M.J., Mazières, D.: Shark: scaling file servers via cooperative caching. In: USENIX Symp. Net. Sys. Design & Impl., pp. 129–142 (2005)Google Scholar
- 28.Henson, V.: An analysis of compare-by-hash. In: USENIX Workshop on Hot Topics in Operating Systems (2003)Google Scholar
Copyright information
© IFIP International Federation for Information Processing 2009