Optimization for data de-duplication algorithm based on file content

Nie, Xuejun; Qin, Leihua; Zhou, Jingli; Liu, Ke; Zhu, Jianfeng; Wang, Yu

doi:10.1007/s12200-010-0103-z

Optimization for data de-duplication algorithm based on file content

Research Article
Published: 21 July 2010

Volume 3, pages 308–316, (2010)
Cite this article

Frontiers of Optoelectronics in China Aims and scope Submit manuscript

Xuejun Nie¹,
Leihua Qin¹,
Jingli Zhou¹,
Ke Liu¹,
Jianfeng Zhu¹ &
…
Yu Wang¹

55 Accesses
6 Altmetric
Explore all metrics

Abstract

Content defined chunking (CDC) is a prevalent data de-duplication algorithm for removing redundant data segments in archival storage systems. Current researches on CDC do not consider the unique content characteristic of different file types, and they determine chunk boundaries in a random way and apply a single strategy for all file types. It has been proven that such method cannot achieve optimal performance for compound archival data. We analyze the content characteristic of different file types and propose candidate anchor histogram (CAH) to capture it. We propose an improved strategy for determining chunk boundaries based on CAH and tune some key parameters of CDC based on the data layout of underlying data de-duplication file system (TriDFS), which can efficiently store variable-sized chunks on fixed-sized physical blocks. These strategies are evaluated with representative archival data, and the result indicates that they can increase on average the compression ratio by 16.3% and write throughput by 13.7%, while only decrease the read throughput by 2.5%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Tony A, Biggar H. Data De-Duplication and Disk-to-Disk Backup Systems: Technical and Business Considerations. The Enterprise Strategy Group Technical Report. 2007
Biggar H. Experiencing in Data De-Duplication: Improving Efficiency and Reducing Capacity Requirements. The Enterprise Strategy Group Technical Report. 2007
Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P. Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proceedings of the 7th USERNIX Conference on File and Storage Technologies. 2009
Cox L P, Murray C D, Noble B D. Pastiche: making backup cheap and easy. In: Proceedings of the 5th Symposium on Operating Systems Design and Implementation. 2002, 285–298
Quinlan S, Dorward S. Venti: a new approach to archival storage. In: Proceedings of the Conference on File and Storage Technologies. 2002, 89–101
Jain N, Dahlia M, Tewari R. TAPER: tiered approach for eliminating redundancy in replica synchronization. In: Proceedings of the 4th USENIX Conference on File and Storage Technologies. 2005, 4: 21
Google Scholar
Bobbarjung D R, Jagannathan S, Dubnicki C. Improving duplicate elimination in storage systems. ACM Transactions on Storage, 2006, 2(4): 424–448
Article Google Scholar
Zhu B, Kai L, Patterson H. Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the 6th USENIX Conference on File and Storage Technologies. 2008, 18
You L L, Karamanolis C. Evaluation of efficient archival storage techniques. In: Proceedings of the 21st IEEE Symposium on Mass Storage Systems and Technologies. 2004, 227–232
Manber U. Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference. 1994, 1–10
Rabin M O. Fingerprinting by Random Polynomials. Center for Research in Computing Technology. Harvard University Technical Report TR-15-81. 1981
Brin S, Davis J, Garcia-Molina H. Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 1995, 398–409

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, 430074, China
Xuejun Nie, Leihua Qin, Jingli Zhou, Ke Liu, Jianfeng Zhu & Yu Wang

Authors

Xuejun Nie
View author publications
You can also search for this author in PubMed Google Scholar
Leihua Qin
View author publications
You can also search for this author in PubMed Google Scholar
Jingli Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Ke Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yu Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leihua Qin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nie, X., Qin, L., Zhou, J. et al. Optimization for data de-duplication algorithm based on file content. Front. Optoelectron. China 3, 308–316 (2010). https://doi.org/10.1007/s12200-010-0103-z

Download citation

Received: 26 February 2010
Accepted: 06 April 2010
Published: 21 July 2010
Issue Date: September 2010
DOI: https://doi.org/10.1007/s12200-010-0103-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimization for data de-duplication algorithm based on file content

Abstract

Access this article

Similar content being viewed by others

Research on Chunking Algorithms of Data De-duplication

Large-Scale Data Management System Using Data De-duplication System

Content-Based Chunk Placement Scheme for Decentralized Deduplication on Distributed File Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimization for data de-duplication algorithm based on file content

Abstract

Access this article

Similar content being viewed by others

Research on Chunking Algorithms of Data De-duplication

Large-Scale Data Management System Using Data De-duplication System

Content-Based Chunk Placement Scheme for Decentralized Deduplication on Distributed File Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation