Journal of Computer Science and Technology

, Volume 31, Issue 4, pp 820–835 | Cite as

A Data Deduplication Framework of Disk Images with Adaptive Block Skipping

  • Bing ZhouEmail author
  • Jiang-Tao Wen
Regular Paper


We describe an efficient and easily applicable data deduplication framework with heuristic prediction based adaptive block skipping for the real-world dataset such as disk images to save deduplication related overheads and improve deduplication throughput with good deduplication efficiency maintained. Under the framework, deduplication operations are skipped for data chunks determined as likely non-duplicates via heuristic prediction, in conjunction with a hit and matching extension process for duplication identification within skipped blocks and a hysteresis mechanism based hash indexing process to update the hash indices for the re-encountered skipped chunks. For performance evaluation, the proposed framework was integrated and implemented in the existing data domain and sparse indexing deduplication algorithms. The experimental results based on a real-world dataset of 1.0 TB disk images showed that the deduplication related overheads were significantly reduced with adaptive block skipping, leading to a 30%~80% improvement in deduplication throughput when deduplication metadata were stored on the disk for data domain, and 25%~40% RAM space saving with a 15%~20% improvement in deduplication throughput when an in-RAM sparse index was used in sparse indexing. In both cases, the corresponding deduplication ratios reduced were below 5%.


data deduplication metadata adaptive block skipping 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Zhu B, Li K, Patterson H. Avoiding the disk bottleneck in the data domain deduplication file system. In Proc. the 6th USENIX Conference on File and Storage Technologies (FAST), February 2008, pp.269-282.Google Scholar
  2. [2]
    Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P. Sparse Indexing: Large scale, inline deduplication using sampling and locality. In Proc. the 7th FAST, February 2009, pp.111-123.Google Scholar
  3. [3]
    Srinivasan K, Bisson T, Goodson G, Voruganti K. iDedup: Latency-aware, inline data deduplication for primary storage. In Proc. the 10th FAST, February 2012, pp.299-312.Google Scholar
  4. [4]
    Wildani A, Miller E, Rodeh O. HANDS: A heuristically arranged non-backup inline deduplication system. In Proc. the 29th IEEE International Conference on Data Engineering (ICDE), April 2013, pp.446-457.Google Scholar
  5. [5]
    Rabin M O. Fingerprinting by random polynomials. Technical Report, TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.Google Scholar
  6. [6]
    Black J. Compare-by-hash: A reasoned analysis. In Proc. the USENIX Annual Technical Conference (ATC), May 2006, pp.85-90.Google Scholar
  7. [7]
    Meister D, Kaiser J, Brinkmann A, Cortes T, Kuhn M, Kunkel J. A study on data deduplication in HPC storage systems. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2012.Google Scholar
  8. [8]
    Bloom B H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 1970, 13(7): 422-426.CrossRefzbMATHGoogle Scholar
  9. [9]
    Jin K, Miller E L. The effectiveness of deduplication on virtual machine disk images. In Proc. the 2nd Annual International Systems and Storage Conference (SYSTOR), May 2009, pp.7:1-7:12.Google Scholar
  10. [10]
    Muthitacharoen A, Chen B, Mazières D. A low-bandwidth network file system. In Proc. the 18th ACM Symposium on Operating Systems Principles (SOSP), October 2001, pp.174-187.Google Scholar
  11. [11]
    Romar´ıski B, Heldt L, Kilian W et al. Anchor-driven subchunk deduplication. In Proc. the 4th Annual International Conference on Systems and Storage (SYSTOR), May 2011, pp.16:1-16:13.Google Scholar
  12. [12]
    Bhagwat D, Eshghi K, Long D D E, Lillibridge M. Extreme Binning: Scalable, parallel deduplication for chunkbased file backup. In Proc. the IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), September 2009.Google Scholar
  13. [13]
    Tanenbaum A S. Modern Operating Systems (2nd edition). Prentice Hall PTR, 2001.Google Scholar
  14. [14]
    Zhou B, Wen J. Hysteresis re-chunking based metadata harnessing deduplication of disk images. In Proc. the 42nd IEEE International Conference on Parallel Processing (ICPP), October 2013, pp.389-398.Google Scholar
  15. [15]
    Fan L, Cao P, Almeida J, Broder A. Summary cache: A scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking, June 2000, 8(3): 281-293.Google Scholar
  16. [16]
    Guo F, Efstathopoulos P. Building a high-performance deduplication system. In Proc. the USENIX Annual Technical Conference (ATC), June 2011, Article No. 25.Google Scholar
  17. [17]
    Botelho F C, Shilane P, Garg N, Hsu W. Memory efficient sanitization of a deduplicated storage system. In Proc. the 11th FAST, February 2013, pp.81-94.Google Scholar
  18. [18]
    Debnath B, Sengupta S, Li J. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proc. the USENIX Annual Technical Conference (ATC), June 2010, Article No. 16.Google Scholar
  19. [19]
    Meister D, Brinkmann A. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST), May 2010.Google Scholar
  20. [20]
    Dal Bianco G, Galante R, Heuser C A. A fast approach for parallel deduplication on multicore processors. In Proc. the ACM Symposium on Applied Computing (SAC), March 2011, pp.1027-1032.Google Scholar
  21. [21]
    Bobbarjung D R, Jagannathan S, Dubnicki C. Improving duplicate elimination in storage systems. ACM Transactions on Storage, November 2006, 2(4): 424-448.Google Scholar
  22. [22]
    Kruus E, Ungureanu C, Dubnicki C. Bimodal content defined chunking for backup streams. In Proc. the 8th FAST, February 2010, Article No. 18.Google Scholar
  23. [23]
    Lu G, Jin Y, Du D. Frequency based chunking for data deduplication. In Proc. IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), August 2010, pp.287-296.Google Scholar
  24. [24]
    Meister D, Brinkmann A, Süβ T. File recipe compression in data deduplication systems. In Proc. the 11th FAST, February 2013, pp.175-182.Google Scholar
  25. [25]
    Balachandran S, Constantinescu C. Sequence of hashes compression in data deduplication. In Proc. the Data Compression Conference (DCC), March 2008, p.505.Google Scholar
  26. [26]
    Harnik D, Margalit O, Naor D, Sotnikov D, Vernik G. Estimation of deduplication ratios in large datasets. In Proc. the 28th IEEE Symposium on Mass Storage Systems and Technologies (MSST), April 2012.Google Scholar
  27. [27]
    Xie F, Condict M, Shete S. Estimating duplication by content-based sampling. In Proc. the USENIX Conference on Annual Technical Conference (ATC), June 2013, pp.181-186.Google Scholar
  28. [28]
    Constantinescu C, Lu M. Quick estimation of data compression and deduplication for large storage systems. In Proc. the 1st International Conference on Data Compression, Communications and Processing (CCP), June 2011, pp.98-102.Google Scholar
  29. [29]
    Fu M, Feng D, Hua Y, He X, Chen Z, XiaW, Zhang Y, Tan Y. Design tradeoffs for data deduplication performance in backup workloads. In Proc. the 13th FAST, February 2015, pp.331-344.Google Scholar
  30. [30]
    Fu M, Feng D, Hua Y, He X, Chen Z, Xia W, Huang F, Liu Q. Accelerating restore and garbage collection in deduplicationbased backup systems via exploiting historical information. In Proc. the USENIX Annual Technical Conference (ATC), June 2014, pp.181-192.Google Scholar
  31. [31]
    Tang Y, Yang J. Secure deduplication of general computations. In Proc. the USENIX Annual Technical Conference (ATC), July 2015, pp.319-331.Google Scholar
  32. [32]
    Zhang W, Yang T, Narayanasamy G, Tang H. Low-cost data deduplication for virtual machine backup in cloud storage. In Proc. the 5th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage), June 2013, Article No. 12.Google Scholar
  33. [33]
    Lin X, Lu G, Douglis F, Shilane P, Wallace G. Migratory compression: Coarse-grained data reordering to improve compressibility. In Proc. the 12th FAST, February 2014, pp.257-271.Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.State Key Laboratory on Intelligent Technology and SystemsTsinghua UniversityBeijingChina
  2. 2.Tsinghua National Laboratory for Information Science and TechnologyTsinghua UniversityBeijingChina
  3. 3.Department of Computer Science and TechnologyTsinghua UniversityBeijingChina

Personalised recommendations