A Scalable Inline Cluster Deduplication Framework for Big Data Protection

Fu, Yinjin; Jiang, Hong; Xiao, Nong

doi:10.1007/978-3-642-35170-9_18

Yinjin Fu^18,19,
Hong Jiang¹⁹ &
Nong Xiao¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 7662))

Included in the following conference series:

ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing

1616 Accesses
31 Citations

Abstract

Cluster deduplication has become a widely deployed technology in data protection services for Big Data to satisfy the requirements of service level agreement (SLA). However, it remains a great challenge for cluster deduplication to strike a sensible tradeoff between the conflicting goals of scalable deduplication throughput and high duplicate elimination ratio in cluster systems with low-end individual secondary storage nodes. We propose ∑-Dedupe, a scalable inline cluster deduplication framework, as a middleware deployable in cloud data centers, to meet this challenge by exploiting data similarity and locality to optimize cluster deduplication in inter-node and intra-node scenarios, respectively. Governed by a similarity-based stateful data routing scheme, ∑-Dedupe assigns similar data to the same backup server at the super-chunk granularity using a handprinting technique to maintain high cluster-deduplication efficiency without cross-node deduplication, and balances the workload of servers from backup clients. Meanwhile, ∑-Dedupe builds a similarity index over the traditional locality-preserved caching design to alleviate the chunk index-lookup bottleneck in each node. Extensive evaluation of our ∑-Dedupe prototype against state-of-the-art schemes, driven by real-world datasets, demonstrates that ∑-Dedupe achieves a cluster-wide duplicate elimination ratio almost as high as the high-overhead and poorly scalable traditional stateful routing scheme but at an overhead only slightly higher than that of the scalable but low duplicate-elimination-ratio stateless routing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Villars, R.L., Olofson, C.W., Eastwood, M.: Big Data: What It Is and Why You Should Care. White Paper, IDC (2011)
Google Scholar
Kolodg, C.J.: Effective Data Leak Prevention Programs: Start by Protecting Data at the Source-Your Databases. White Paper, IDC (2011)
Google Scholar
Zhu, B., Li, K., Patterson, H.: Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. In: Proc. of USENIX FAST (2008)
Google Scholar
Gantz, J., Reinsel, D.: The Digital Universe Decade-Are You Ready? White Paper, IDC (2010)
Google Scholar
Biggar, H.: Experiencing Data De-Duplication: Improving Efficiency and Reducing Capacity Requirements. White Paper. The Enterprise Strategy Group (2007)
Google Scholar
Dong, W., Douglis, F., Li, K., Patterson, H., Reddy, S., Shilane, P.: Tradeoffs in Scalable Data Routing for Deduplication Clusters. In: Proc. of USENIX FAST (2011)
Google Scholar
Douglis, F., Bhardwaj, D., Qian, H., Shilane, P.: Content-aware Load Balancing for Distributed Backup. In: Proc. of USENIX LISA (2011)
Google Scholar
Bhagwat, D., Eshghi, K., Long, D.D., Lillibridge, M.: Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup. In: Proc. of IEEE MASCOTS (2009)
Google Scholar
Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepko-wski, J., Ungureanu, C., Welnicki, M.: HYDRAstor: a Scalable Secondary Storage. In: Proc. of USENIX FAST (2009)
Google Scholar
Bhagwat, D., Eshghi, K., Mehra, P.: Content-based Document Routing and Index Partitioning for Scalable Similarity-based Searches in a Large Corpus. In: Proc. of ACM SIGKDD (2007)
Google Scholar
Yang, T., Jiang, H., Feng, D., Niu, Z., Zhou, K., Wan, Y.: DEBAR: a Scalable High-Performance Deduplication Storage System for Backup and Archiving. In: Proc. of IEEE IPDPS (2010)
Google Scholar
Kaiser, H., Meister, D., Brinkmann, A., Effert, S.: Design of an Exact Data Deduplication Cluster. In: Proc. of IEEE MSST (2012)
Google Scholar
Fu, Y., Jiang, H., Xiao, N., Tian, L., Liu, F.: AA-Dedupe: An Application-Aware Source Deduplication Approach for Cloud Backup Services in the Personal Computing Environment. In: Proc. of IEEE Cluster (2011)
Google Scholar
Jaccard Index, http://en.wikipedia.org/wiki/Jaccard_index
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise Independent Permutations. Journal of Computer and System Sciences 60(3), 630–659 (2000)
Article MathSciNet MATH Google Scholar
Eshghi, K., Tang, H.K.: A framework for Analyzing and Improving Content-based Chunking Algorithms. Technical Report, Hewlett Packard (2005)
Google Scholar
Wallace, G., Douglis, F., Qian, H., Shilane, P., Smaldone, S., Chamness, M., Hsu, W.: Characteristics of Backup Workloads in Production Systems. In: Proc. of FAST (2012)
Google Scholar
Xia, W., Jiang, H., Feng, D., Hua, Y.: Silo: a Similarity-locality based Near-exact Deduplication Scheme with Low RAM Overhead and High Throughput. In: Proc. of USENIX ATC (2011)
Google Scholar
The Linux Kernel Archives, http://www.kernel.org/
FIU IODedup Traces, http://iotta.snia.org/traces/391
Vrable, M., Savage, S., Voelker, G.M.: Cumulus: Filesystem Backup to the Cloud. In: Proc. of USENIX FAST (2009)
Google Scholar
IBM ProtecTIER Deduplication Gateway, http://www-03.ibm.com/systems/storage/tape/ts7650g/index.html
Efstathopoulos, P.: File Routing Middleware for Cloud Deduplication. In: Proc. of ACM CloudCP (2012)
Google Scholar
EMC Data Domain Global Deduplication Array, http://www.datadomain.com/products/global-deduplication-array.html
SEPATON S2100-ES2, http://www.sepaton.com/products/SEPATON_ES2.html

Download references

Author information

Authors and Affiliations

State Key Laboratory of High Performance Computing, National University of Defense Technology, China
Yinjin Fu & Nong Xiao
Department of Computer Science and Engineering, University of Nebraska-Lincoln, USA
Yinjin Fu & Hong Jiang

Authors

Yinjin Fu
View author publications
You can also search for this author in PubMed Google Scholar
Hong Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Nong Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Electrical and Computer Engineering Department, Carnegie Mellon University, 4720 Forbes Avenue, 15213, Pittsburgh, PA, USA
Priya Narasimhan
Department of Computer Engineering and Informatics, University of Patras, University Campus, 26504, Rio, Greece
Peter Triantafillou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fu, Y., Jiang, H., Xiao, N. (2012). A Scalable Inline Cluster Deduplication Framework for Big Data Protection. In: Narasimhan, P., Triantafillou, P. (eds) Middleware 2012. Middleware 2012. Lecture Notes in Computer Science, vol 7662. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35170-9_18

Download citation

DOI: https://doi.org/10.1007/978-3-642-35170-9_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35169-3
Online ISBN: 978-3-642-35170-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics