Skip to main content
Log in

Data deduplication techniques for efficient cloud storage management: a systematic review

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The exponential growth of digital data in cloud storage systems is a critical issue presently as a large amount of duplicate data in the storage systems exerts an extra load on it. Deduplication is an efficient technique that has gained attention in large-scale storage systems. Deduplication eliminates redundant data, improves storage utilization and reduces storage cost. This paper presents a broad methodical literature review of existing data deduplication techniques along with various existing taxonomies of deduplication techniques that have been based on cloud data storage. Furthermore, the paper investigates deduplication techniques based on text and multimedia data along with their corresponding taxonomies as these techniques have different challenges for duplicate data detection. This research work is useful to identify deduplication techniques based on text, image and video data. It also discusses existing challenges and significant research directions in deduplication for future researchers, and article concludes with a summary of valuable suggestions for future enhancements in deduplication.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Gu M, Li X, Cao Y (2014) Optical storage arrays: a perspective for future big data storage. Light Sci Appl 3(5):e177. https://doi.org/10.1038/lsa.2014.58

    Article  Google Scholar 

  2. Tian Y, Khan SM, Jiménez DA, Loh GH (2014) Last-level cache deduplication. In: Proceedings of the 28th ACM International Conference on Supercomputing, pp 53–62. https://doi.org/10.1145/2597652.2597655

  3. Hovhannisyan H, Qi W, Lu K, Yang R, Wang J (2016) Whispers in the cloud storage: a novel cross-user deduplication-based covert channel design. Peer-to-Peer Networking and Applications, pp 1–10. https://doi.org/10.1007/s12083-016-0483-y

  4. Mandagere N, Zhou P, Smith MA, Uttamchandani S (2008) Demystifying data deduplication. In: Proceedings of the ACM/IFIP/USENIX Middleware’08 Conference Companion, pp 12–17. https://doi.org/10.1145/1462735.1462739

  5. Paulo J, Pereira J (2014) A survey and classification of storage deduplication systems. ACM Comput Surv (CSUR) 47(1):1–30. https://doi.org/10.1145/2611778

    Article  MathSciNet  Google Scholar 

  6. Mao B, Jiang H, Wu S, Fu Y, Tian L (2014) Read-performance optimization for deduplication-based storage systems in the cloud. In: ACM Transactions on Storage (TOS), vol 10(2). https://doi.org/10.1145/2512348

  7. Di Pietro R, Sorniotti A (2016) Proof of ownership for deduplication systems: a secure, scalable, and efficient solution. Comput. Commun. 82:71–82. https://doi.org/10.1016/j.comcom.2016.01.011

    Article  Google Scholar 

  8. Wang J, Chen X (2016) Efficient and secure storage for outsourced data: a survey. Data Sci Eng 1(3):178–188. https://doi.org/10.1007/s41019-016-0018-9

    Article  Google Scholar 

  9. Chen CP, Zhang CY (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347. https://doi.org/10.1016/j.ins.2014.01.015

    Article  Google Scholar 

  10. Venish A, Sankar KS (2015) Framework of data deduplication: a survey. Indian J Sci Technol. https://doi.org/10.17485/ijst/2015/v8i26/80754

  11. Xia W, Jiang H, Feng D, Douglis F, Shilane P, Hua Y, Fu M, Zhang Y, Zhou Y (2016) A comprehensive study of the past present and future of data deduplication. Proc IEEE 104(9):1681–1710. https://doi.org/10.1109/JPROC.2016.2571298

    Article  Google Scholar 

  12. Maan AJ (2013) Analysis and comparison of algorithms for lossless data compression. Int J Inf Comput Technol 3(3):139–46

    Google Scholar 

  13. Xia W, Jiang H, Feng D, Tian L, Fu M, Zhou Y (2014) Ddelta: a deduplication-inspired fast delta compression approach. Perform Eval 79:258–272. https://doi.org/10.1016/j.peva.2014.07.016

    Article  Google Scholar 

  14. Shanmugasundaram S, Lourdusamy R (2011) A comparative study of text compression algorithms. Int J Wisdom Based Comput 1(3):68–76

    Google Scholar 

  15. Bhadade US, Trivedi AI (2011) Lossless text compression using dictionaries. Int J Comput Appl Algorithms 13(8):27–34

  16. Witten IH, Neal RM, Cleary JG (1987) Arithmetic coding for data compression. Commun ACM 30(6):520–40. https://doi.org/10.1145/214762.214771

    Article  Google Scholar 

  17. Brereton P, Kitchenham BA, Budgen D, Turner M, Khalil M (2007) Lessons from applying the systematic literature review process within the software engineering domain. J Syst Softw 80(4):571–83. https://doi.org/10.1016/j.jss.2006.07.009

    Article  Google Scholar 

  18. Kitchenham B, Brereton OP, Budgen D, Turner M, Bailey J, Linkman S (2009) Systematic literature reviews in software engineering—a systematic literature review. Inf Softw Technology 51(1):7–15. https://doi.org/10.1016/j.infsof.2008.09.009

    Article  Google Scholar 

  19. IDC REPROT ON EXPONENTIAL DATA Gantz J, Reinsel D (2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. In: IDC iView: IDC Analyze the Future,pp 1–6. http://www.emc.com/collateral/analyst-reports/idc-digital-universe-united-states.pdf

  20. Reed DA, Dongarra J (2015) Exascale computing and big data. Commun ACM 58(7):56–68. https://doi.org/10.1145/2699414

    Article  Google Scholar 

  21. Barreto J, Ferreira P (2009) Efficient locally trackable deduplication in replicated systems. In: Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag New York, Inc. USA, p 6

  22. Meyer DT, Bolosky WJ (2012) A study of practical deduplication. ACM Trans Storage (TOS). https://doi.org/10.1145/2078861.2078864

  23. Borges EN, de Carvalho MG, Galante R, Gonçalves MA, Laender AH (2011) An unsupervised heuristic-based approach for bibliographic metadata deduplication. Inf Process Manag 47(5):706–718. https://doi.org/10.1016/j.ipm.2011.01.009

    Article  Google Scholar 

  24. Alvarez C (2011) NetApp deduplication for FAS and V-Series deployment and implementation guide. In: Technical ReportTR-3505

  25. Xu J, Zhang W, Zhang Z, Wang T, Huang T (2016) Clustering-based acceleration for virtual machine image deduplication in the cloud environment. J Syst Softw 121:144–156. https://doi.org/10.1016/j.jss.2016.02.021

    Article  Google Scholar 

  26. Paulo J, Pereira J (2014) Distributed Exact Deduplication for Primary Storage Infrastructures. In Magoutis K., Pietzuch P. (eds) Distributed applications and interoperable systems DAIS 2014, vol 8460, LNCS Springer, Heidelberg. https://doi.org/10.1007/978-3-662-43352-2_5

  27. Banu AF, Chandrasekar C (2012) A survey on deduplication methods. Int J Comput Trends Technol 3(3):364–368

    Google Scholar 

  28. He Q, Li Z, Zhang X (2010) Data deduplication techniques. IEEE Int Conf Future Inf Technol Manag Eng (FITME) 1:430–433. https://doi.org/10.1109/FITME.2010.5656539

    Google Scholar 

  29. Zhou R, Liu M, Li T (2013) Characterizing the efficiency of data deduplication for big data storage management. In: IEEE International Symposium on Workload Characterization (IISWC), pp 98–108: https://doi.org/10.1109/IISWC.2013.6704674

  30. Ahmad RW, Gani A, Ab. Hamid SH et al (2015) Virtual machine migration in cloud data centers: a review, taxonomy, and open research issue. J Supercomput 71(7):2473–2515. https://doi.org/10.1007/s11227-015-1400-5

    Article  Google Scholar 

  31. Hu Y, Li C, Liu L, Li T (2016) Hope: enabling efficient service orchestration in software-defined data centers. In: Proceedings of the 2016 International Conference on Supercomputing, p 10 ACM. https://doi.org/10.1145/2925426.2926257

  32. Srinivasan K, Bisson T, Goodson GR, Voruganti K (2012) iDedup: latency-aware, inline data deduplication for primary storage. In: Proceedings of the USENIX Conference on File and Storage Technologies, vol 12, pp 24–24

  33. Mao B, Jiang H, Wu S, Tian L (2016) Leveraging data deduplication to improve the performance of primary storage systems in the cloud. IEEE Trans Comput 65(6):1775–1788. https://doi.org/10.1109/TC.2015.2455979

    Article  MathSciNet  MATH  Google Scholar 

  34. Kim C, Park KW, Park KH (2012) GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system. In: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores ACM, pp 17–26. https://doi.org/10.1145/2141702.2141705

  35. Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezis G, Camble P (2009) Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies, vol 9, pp 111–123

  36. Zhu B, Li K, Patterson RH (2008) Avoiding the disk bottleneck in the data domain deduplication file system. Proc USENIX Conf File Storage Technol 8:1–14

    Google Scholar 

  37. Dubnicki C, Gryz L, Heldt L, Kaczmarczyk M, Kilian W, Strzelczak P, Szczepkowski J, Ungureanu C, Welnicki M (2009) HYDRAstor: A scalable secondary storage. In: 7th USENIX Conference on File and Storage Technologies (FAST), vol 9, pp 197–210

  38. Li YK, Xu M, Ng CH, Lee PP (2015) Efficient hybrid inline and out-of-line deduplication for backup storage. ACM Trans Storage (TOS) 11(1):1–21. https://doi.org/10.1145/2641572

    Google Scholar 

  39. Xia W, Jiang H, Feng D, Hua Y (2015) Similarity and locality based indexing for high performance data deduplication. IEEE Trans Comput 64(4):1162–1176. https://doi.org/10.1109/TC.2014.2308181

    Article  MathSciNet  MATH  Google Scholar 

  40. Ng CH, Ma M, Wong TY, Lee PP, Lui J (2011) Live deduplication storage of virtual machine images in an open-source cloud. In: Proceedings of the 12th International Middleware Conference. International Federation for Information Processing, pp 80–99

  41. Zhao X, Zhang Y, Wu Y, Chen K, Jiang J, Li K (2013) Liquid: a scalable deduplication file system for virtual machine images. IEEE Trans Parallel Distrib Syst 25(5):1257–1266. https://doi.org/10.1109/TPDS.2013.173

    Article  Google Scholar 

  42. Waldspurger CA (2002) Memory resource management in VMware ESX server. In: ACM Proceedings of the 5th Symposium on Operating Systems Design and Implementation SIGOPS, vol 36(SI), pp 181–194. https://doi.org/10.1145/844128.844146

  43. Clements AT, Ahmad I, Vilayannur M, Li J (2009) Decentralized Deduplication in SAN Cluster File Systems. In: USENIX Annual Technical Conference, pp 101–114

  44. Anand A, Sekar V, Akella A (2009) SmartRE: an architecture for coordinated network-wide redundancy elimination. ACM SIGCOMM Comput Commun Rev 39(4):87–98. https://doi.org/10.1145/1594977.1592580

    Article  Google Scholar 

  45. Agarwal B, Akella A, Anand A, Balachandran A, Chitnis P, Muthukrishnan C, Ramjee R, Varghese G (2010). EndRE: An End-system redundancy elimination service for enterprises. In: NSDI, pp 419–432

  46. Katiyar A, Weissman JB (2011) ViDeDup: an application-aware framework for video de-duplication. In: Proceedings of the 3rd USENIX Conference on Hot Topics in Storage and File Systems (Hot Storage), pp 1–5

  47. Li C, Shilane P, Douglis F, Shim H, Smaldone S, Wallace G (2014) Nitro: a Capacity-optimized SSD cache for primary storage. In: USENIX Annual Technical Conference, pp 501–512

  48. Shen HT, Zhou X, Huang Z, Shao J, Zhou X (2007) UQLIPS: a real-time near-duplicate video clip detection system. In: Proceedings of the 33rd International Conference on Very Large Data Bases VLDB Endowment, pp 1374–1377

  49. Chen F, Luo T, Zhang X (2011) CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In: Proceedings of 9th USENIX Conference on File Storage Technology (FAST), vol 11, pp 77–90

  50. Vrable M, Savage S, Voelker GM (2009) Cumulus: filesystem backup to the cloud. ACM Trans Storage (TOS) 5(4):1–14. https://doi.org/10.1145/1629080.1629084

    Article  Google Scholar 

  51. Lai R, Hua Y, Feng D, Xia W, Fu M, Yang Y (2014) A near-exact defragmentation scheme to improve restore performance for cloud backup systems. In: Sun X et al (eds) Algorithms and architectures for parallel processing. LNCS, vol 8630. Springer, Cham, pp 457–471. https://doi.org/10.1007/978-3-319-11197-1_35

    Google Scholar 

  52. Mao B, Jiang H, Wu S, Fu Y, Tian L (2014) Read-performance optimization for deduplication-based storage systems in the cloud. ACM Trans Storage. https://doi.org/10.1145/2512348

    Google Scholar 

  53. Tan Y, Jiang H, Feng D, Tian L, Yan Z (2011) CABdedupe: a causality-based deduplication performance booster for cloud backup services. In: Parallel and Distributed Processing Symposium (IPDPS) IEEE International, pp 1266–1277

  54. Nbt Yusof, Ismail A, Majid NAA (2016) Deduplication image middleware detection comparison in standalone cloud database. Int J Adv Comput Sci Technol (IJACST) 5(3):12–18

    Google Scholar 

  55. Nie Z, Hua Y, Feng D, Li Q, Sun Y (2014) Efficient storage support for real-time near-duplicate video retrieval. In: Sun X et al (eds) Algorithms and architectures for parallel processing ICA3PP LNCS, vol 8631. Springer, Cham. https://doi.org/10.1007/978-3-319-11194-0_24

    Google Scholar 

  56. Chen M, Wang S, Tian L (2013) A high-precision duplicate image deduplication approach. J Comput 8(11):2768–2775. https://doi.org/10.4304/jcp.8.11.2768-2775

    Article  Google Scholar 

  57. Wang G, Chen S, Lin M, Liu X (2014) SBBS: A sliding blocking algorithm with backtracking sub-blocks for duplicate data detection. Expert Syst Appl 41(5):2415–2423. https://doi.org/10.1016/j.eswa.2013.09.040

    Article  Google Scholar 

  58. Bobbarjung DR, Jagannathan S, Dubnicki C (2006) Improving duplicate elimination in storage systems. ACM Trans Storage (TOS) 2(4):424–48. https://doi.org/10.1145/1210596.1210599

    Article  Google Scholar 

  59. Kruus E, Ungureanu C, Dubnicki C (2010) Bimodal content defined chunking for backup streams. In: Proceedings of the USENIX Conference on File and Storage Technologies (FAST), pp 239–252

  60. Lim SH (2011) DeFFS: Duplication-eliminated flash file system. Comput Electr Eng 37(6):1122–1136. https://doi.org/10.1016/j.compeleceng.2011.06.007

    Article  Google Scholar 

  61. Kaczmarczyk M, Barczynski M, Kilian W, Dubnicki C (2012) Reducing impact of data fragmentation caused by in-line deduplication. In: Proceedings of the 5th Annual International Systems and Storage Conference ACM, pp 1–12. https://doi.org/10.1145/2367589.2367600

  62. Wildani A, Miller EL, Rodeh O (2013) Hands: A heuristically arranged non-backup in-line deduplication system. In: IEEE 29th International Conference on Data Engineering (ICDE), pp 446–457. https://doi.org/10.1109/ICDE.2013.6544846

  63. Nam YJ, Park D, Du DH (2012) Assuring demanded read performance of data deduplication storage with backup datasets. In: IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp 201–208. https://doi.org/10.1109/MASCOTS.2012.32

  64. Park D, Fan Z, Nam YJ, Du DH (2017) A lookahead read cache: improving read performance for deduplication backup storage. J Comput Sci Technol 32(1):26–40. https://doi.org/10.1007/s11390-017-1680-8

    Article  Google Scholar 

  65. Xia W, Jiang H, Feng D, Tian L (2016) DARE: a deduplication-aware resemblance detection and elimination scheme for data reduction with low overheads. IEEE Trans Comput 65(6):1692–1705. https://doi.org/10.1109/TC.2015.2456015

    Article  MathSciNet  MATH  Google Scholar 

  66. Fu M, Feng D, Hua Y, He X, Chen Z, Liu J, Xia W, Huang F, Liu Q (2016) Reducing fragmentation for in-line deduplication backup storage via exploiting backup history and cache knowledge. IEEE Trans Parallel Distrib Syst 27(3):855–868. https://doi.org/10.1109/TPDS.2015.2410781

    Article  Google Scholar 

  67. Fu Y, Jiang H, Xiao N (2012) A scalable inline cluster deduplication framework for big data protection. In: Narasimhan P, Triantafillou P (eds) Middleware IFIP international federation for information processing. LNCS, vol 7662. Springer, Berlin, pp 354–373

    Google Scholar 

  68. Rabin MO (1981) Fingerprinting by random polynomials. Harvard Aiken Computational Laboratory TR-15-81. URL: http://cr.yp.to/bib/entries.html

  69. Bhagwat D, Eshghi K, Long DD, Lillibridge M (2009) Extreme binning: scalable, parallel deduplication for chunk-based file backup. In: Proceedings of IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, Computer Society, Washington, DC, vol 9, pp 1–9. https://doi.org/10.1109/MASCOT.2009.5366623

  70. Yang TM, Feng D, Niu ZY, Wan YP (2010) Scalable high performance de-duplication backup via hash join. J Zhejiang Uni Sci C Springer 11(5):315–327. https://doi.org/10.1631/jzus.C0910445

    Google Scholar 

  71. Min J, Yoon D, Won Y (2011) Efficient deduplication techniques for modern backup operation. IEEE Trans Comput 60(6):824–840. https://doi.org/10.1109/TC.2010.263

    Article  MathSciNet  Google Scholar 

  72. Guo F, Efstathopoulos P (2011) Building a high-performance deduplication system. In: Proceedings of USENIX Annual Technical Conference

  73. Barreto J, Veiga L, Ferreira P (2012) Hash challenges: stretching the limits of compare-by-hash in distributed data deduplication. Inf Process Lett 112(10):380–385. https://doi.org/10.1016/j.ipl.2012.01.012

    Article  MathSciNet  MATH  Google Scholar 

  74. Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555. https://doi.org/10.1109/TKDE.2011.127

    Article  Google Scholar 

  75. Fu Y, Jiang H, Xiao N, Tian L, Liu F, Xu L (2014) Application-aware local-global source deduplication for cloud backup services of personal storage. IEEE Trans Parall Distrib Syst 25(5):1155–1165. https://doi.org/10.1109/TPDS.2013.167

    Article  Google Scholar 

  76. Harnik D, Pinkas B, Shulman-Peleg A (2010) Side channels in cloud services: deduplication in cloud storage. IEEE Secur Priv 8(6):40–47. https://doi.org/10.1109/MSP.2010.187

    Article  Google Scholar 

  77. Li J, Chen X, Li M, Li J, Lee PP, Lou W (2014) Secure deduplication with efficient and reliable convergent key management. IEEE Trans Parallel Distrib Syst 25(6):1615–1625. https://doi.org/10.1109/TPDS.2013.284

    Article  Google Scholar 

  78. Liu C, Liu X, Wan L (2013) Policy-based de-duplication in secure cloud storage. In: Yuan Y, Wu X, Lu Y (eds) Trustworthy Computing and Services. ISCTCS communications in computer and information science, vol 320. Springer, Berlin, pp 250–262. https://doi.org/10.1007/978-3-642-35795-4_32

    Google Scholar 

  79. Storer MW, Greenan K, Long DD, Miller EL (2008) Secure data deduplication. In: Proceedings of the 4th ACM International Workshop on Storage Security and Survivability, pp 1–10. https://doi.org/10.1145/1456469.14

  80. Li J, Chen X, Huang X, Tang S, Xiang Y, Hassan MM, Alelaiwi A (2015) Secure distributed deduplication systems with improved reliability. IEEE Trans Comput 64(12):3569–3579. https://doi.org/10.1109/TC.2015.2401017

    Article  MathSciNet  MATH  Google Scholar 

  81. Vishalakshi NS, Sridevi S (2017) Survey on secure de-duplication with encrypted data for cloud storage. Int J Adv Res Sci Eng Technol 4(1):3111–3117

    Google Scholar 

  82. Bibawe CB, Baviscar V (2017) Secure authorized deduplication for data reduction with low overheads in hybrid cloud. Int J Innov Res Comput Commun Eng 5(2):1797–1804. https://doi.org/10.15680/IJIRCCE.2017.0502105

    Google Scholar 

  83. Wu S, Li KC, Mao B, Liao M (2016) DAC: improving storage availability with deduplication-assisted cloud-of-clouds. Future Gener Comput Syst 74:190–198. https://doi.org/10.1016/j.future.2016.02.001

    Article  Google Scholar 

  84. Wang J, Zhao Z, Xu Z, Zhang H, Li L, Guo Y (2015) I-sieve: an inline high performance deduplication system used in cloud storage. Tsinghua Sci Technol 20(1):17–27. https://doi.org/10.1109/TST.2015.7040510

    Article  Google Scholar 

  85. Leesakul W, Townend P, Xu J (2014) Dynamic data deduplication in cloud storage. In: IEEE 8th International Symposium on Service Oriented System. Engineering, pp 320–325: https://doi.org/10.1109/SOSE.2014.46

  86. Sun Z, Shen J, Yong J (2013) A novel approach to data deduplication over the engineering-oriented cloud systems. Integr Comput Aided Eng 20(1):45–57. https://doi.org/10.3233/ICA-120418

    Google Scholar 

  87. Neelaveni P, Vijayalakshmi M (2016) FC-LID: file classifier based linear indexing for deduplication in cloud backup services. In: Bjørner N, Prasad S, Parida L (eds) Distributed computing and internet technology. LNCS, vol 9581. Springer, Cham, pp 213–222. https://doi.org/10.1007/978-3-319-28034-9_28

    Chapter  Google Scholar 

  88. Li J, Chen X, Xhafa F, Barolli L (2015) Secure deduplication storage systems supporting keyword search. J Comput Syst Sci 81(8):1532–1541. https://doi.org/10.1016/j.jcss.2014.12.026

    Article  MathSciNet  MATH  Google Scholar 

  89. Shin Y, Koo D, Hur J (2017) A survey of secure data deduplication schemes for cloud storage systems. ACM Comput Surv (CSUR) 49(4):1–38. https://doi.org/10.1145/3017428

    Article  Google Scholar 

  90. Pokale MS, Dhok S, Kasbe V, Joshi G, Shinde N (2017) Data deduplication and load balancing techniques on cloud systems. Int J Adv Res Comput Commun Eng 6(3):878–883. https://doi.org/10.17148/IJARCCE.2017.63205

    Article  Google Scholar 

  91. Debnath BK, Sengupta S, Li J (2010) ChunkStash: speeding up inline storage deduplication using flash memory. In: Proceedings of USENIX Annual Technical Conference (ATC), pp 1–16

  92. Dong W, Douglis F, Li K, Patterson RH, Reddy S, Shilane P (2011) Tradeoffs in scalable data routing for deduplication clusters. In: Proceedings of USENIX Conference on File and Storage Technologies (FAST), vol 11, pp 15–29

  93. Li J, Qian X, Li Q, Zhao Y, Wang L, Tang YY (2015) Mining near duplicate image groups. Multimed Tools Appl 74(2):655–669

    Article  Google Scholar 

  94. Ramaiah NP, Mohan CK (2011) De-duplication of photograph images using histogram refinement. In Recent Advances in Intelligent Computational Systems (RAICS) IEEE 391-395. https://doi.org/10.1109/RAICS.2011.6069341

  95. Zargar AJ, Singh N, Rathee G, Singh AK (2015) Image data-deduplication using the block truncation coding technique. In: Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE) International Conference on IEEE, pp 154–158. https://doi.org/10.1109/ABLAZE.2015.7154986

  96. Hua Y, He W, Liu X, Feng D (2015) SmartEye: real-time and efficient cloud image sharing for disaster environments. In: IEEE Conference on Computer Communications (INFOCOM), pp 1616–1624: https://doi.org/10.1109/INFOCOM.2015.7218541

  97. Li X, Li J, Huang F (2016) A secure cloud storage system supporting privacy-preserving fuzzy deduplication. Soft Comput 20(4):1437–1448. https://doi.org/10.1007/s00500-015-1596-6

    Article  Google Scholar 

  98. Deshmukh AS, Lambhate PD (2016) A methodological survey on mapreduce for identification of duplicate images. Int J Sci Res (IJSR) 5(1):206–210

  99. Rashid F, Miri A, Woungang I (2016) Secure image deduplication through image compression. J Inf Secur Appl 27:54–64. https://doi.org/10.1016/j.jisa.2015.11.003

    Google Scholar 

  100. Zheng Y, Yuan X, Wang X, Jiang J, Wang C, Gui X (2015) Enabling encrypted cloud media center with secure deduplication. In: Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security, pp 63–72. https://doi.org/10.1145/2714576.271462

  101. Yang X, Zhu Q, Cheng KT (2009) Near-duplicate detection for images and videos. In: Proceedings of the First ACM workshop on Large-Scale Multimedia Retrieval and Mining, pp 73–80: https://doi.org/10.1145/1631058.1631073

  102. Naturel X, Gros P (2005) A fast shot matching strategy for detecting duplicate sequences in a television stream. In: ACM Proceedings of the 2nd International Workshop on Computer Vision Meets Databases, pp 21–27. https://doi.org/10.1145/1160939.1160947

  103. Li X, Lin J, Li J, Jin B (2016) A Video Deduplication Scheme with Privacy Preservation in IoT. In: International Symposium on Computational Intelligence and Intelligent Systems. Communications in Computer and Information Science, vol 575. Springer, Singapore, pp 409–417. https://doi.org/10.1007/978-981-10-0356-1_43

  104. Velmurugan K, Baboo LD (2011) Content-based image retrieval using SURF and colour moments. Global J Comput Sci Technol 11(10)

  105. Li L (2014) Image matching algorithm based on feature-point and DAISY descriptor. J Multim 9(6):829–834. https://doi.org/10.4304/jmm.9.6.829-834

    Google Scholar 

  106. Lei Y, Qiu G, Zheng L, Huang J (2014) Fast near-duplicate image detection using uniform randomized trees. ACM Trans Multim Comput Commun Appl (TOMM) 10(4):1–15. https://doi.org/10.1145/2602186

    Article  Google Scholar 

  107. Dong W, Wang Z, Charikar M, Li K (2012) High-confidence near-duplicate image detection. In: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval ACM, NY, USA. https://doi.org/10.1145/2324796.2324798

  108. Ke Y, Sukthankar R, Huston L, Ke Y, Sukthankar R (2004) Efficient near-duplicate detection and sub-image retrieval. In :ACM Multimedia, vol 4(1)

  109. Thomee B, Huiskes MJ, Bakker EM, Lew MS (2013) An evaluation of content-based duplicate image detection methods for web search. In: IEEE International Conference on Multimedia and Expo (ICME), pp 1–6. https://doi.org/10.1109/ICME.2013.6607451

  110. Foo JJ, Sinha R, Zobel J (2007) SICO: a system for detection of near-duplicate images during search. In: IEEE International Conference Multimedia and Expo, pp 595–598. https://doi.org/10.1109/ICME.2007.4284720

  111. Chum O, Philbin J, Zisserman A (2008) Near Duplicate Image Detection: min-Hash and tf-idf Weighting. In: BMVC British Machine Vision Conference, vol 810, pp 812–815. https://doi.org/10.5244/C.22.50

  112. Li Z, Feng X (2013) Near duplicate image detecting algorithm based on bag of visual word model. J Multimed 8(5):557–565

    Google Scholar 

  113. Seo JS, Haitsma J, Kalker T, Yoo CD (2004) A robust image fingerprinting system using the Radon transform. Signal Process Image Commun 19(4):325–39. https://doi.org/10.1016/j.image.2003.12.001

    Article  Google Scholar 

  114. Yu X, Huang T (2008) An image fingerprinting method robust to complicated image modifications. In: IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP), pp 227–230. https://doi.org/10.1109/IIH-MSP.2008.93

  115. Gavrielides MA, Sikudova E, Pitas I (2006) Color-based descriptors for image fingerprinting. IEEE Trans Multimed 8(4):740–748. https://doi.org/10.1109/TMM.2006.876290

    Article  Google Scholar 

  116. Nikolaidis N, Pitas I (2009) Still image and video fingerprinting. In: IEEE Seventh International Conference on Advances in Pattern Recognition (ICAPR), pp 3–8. https://doi.org/10.1109/ICAPR.2009.83

  117. Nian F, Li T, Wu X, Gao Q, Li F (2016) Efficient near-duplicate image detection with a local-based binary representation. Multimed Tools Appl 75(5):2435–2452. https://doi.org/10.1007/s11042-015-2472-1

    Article  Google Scholar 

  118. Srinivasan SH, Sawant N (2008) Finding near-duplicate images on the web using fingerprints. In: Proceedings of the 16th ACM International Conference on Multimedia, pp 881–884. https://doi.org/10.1145/1459359.1459512

  119. Yao J, Yang B, Zhu Q (2015) Near-duplicate image retrieval based on contextual descriptor. IEEE Signal Process Lett 22(9):1404–1408. https://doi.org/10.1109/LSP.2014.2377795

    Article  Google Scholar 

  120. Leutenegger S, Chli M, Siegwart RY (2011) BRISK: Binary robust invariant scalable keypoints. In: IEEE International Conference on Computer Vision (ICCV), pp 2548–2555: https://doi.org/10.1109/ICCV.2011.6126542

  121. Chen CC, Hsieh SL (2015) Using binarization and hashing for efficient SIFT matching. J Vis Commun Image Represent 30:86–93. https://doi.org/10.1016/j.jvcir.2015.02.014

    Article  Google Scholar 

  122. Huang F, Zhou Z, Liu T, Liu X (2016) Original image tracing with image relational graph for near-duplicate image elimination. In: Sun X, Liu A, Chao HC, Bertino E (eds) Cloud Computing and Security ICCCS. LNCS, vol 10040. Springer, Cham, pp 322–336. https://doi.org/10.1007/978-3-319-48674-1_29

    Google Scholar 

  123. Wang XJ, Zhang L, Liu C (2013) Duplicate discovery on 2 billion internet images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 429–436

  124. Zhao J, Xue LJ, Men GZ (2010) Optimization matching algorithm based on improved Harris and SIFT. In: IEEE International Conference on Machine Learning and Cybernetics (ICMLC), vol 1, pp 258–261. https://doi.org/10.1109/ICMLC.2010.5581057

  125. Lu CS, Hsu CY (2005) Geometric distortion-resilient image hashing scheme and its applications on copy detection and authentication. Multimed Syst 11(2):159–173. https://doi.org/10.1007/s00530-005-0199-y

    Article  Google Scholar 

  126. Lei Y, Wang Y, Huang J (2011) Robust image hash in Radon transform domain for authentication. Signal Process Image Commun 26(6):280–288. https://doi.org/10.1016/j.image.2011.04.007

  127. Hua Y, Jiang H, Feng D (2014) FAST: Near real-time searchable data analytics for the cloud. In: IEEE Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 754–765: https://doi.org/10.1109/SC.2014.67

  128. Ma J, Stones RJ, Ma Y, Wang J, Ren J, Wang G, Liu X (2017) Lazy exact deduplication. ACM Trans Storage (TOS) 13(2):1–26. https://doi.org/10.1145/3078837

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by Department of Science and Technology, Government of India under WOS (Women Scientists Scheme) sponsored research Project entitled “Distributed Data Deduplication Technique for efficient Cloud Based Storage System” under File No: SR/WOS-A/ET-119/2016.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ravneet Kaur.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kaur, R., Chana, I. & Bhattacharya, J. Data deduplication techniques for efficient cloud storage management: a systematic review. J Supercomput 74, 2035–2085 (2018). https://doi.org/10.1007/s11227-017-2210-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-017-2210-8

Keywords

Navigation