A Survey on Efficient Data Deduplication in Data Analytics

  • Ch. Prathima
  • L. S. S. Reddy
Part of the SpringerBriefs in Applied Sciences and Technology book series (BRIEFSAPPLSCIENCES)


Nowadays, the demand of data safekeeping capacity is increasing dramatically. Because of more requirements of safekeeping, the computer world is appealing to toward cloud safekeeping. Security of data and cost factors are essential issues in cloud safekeeping. A duplicate document not only waste storage, it also escalates the access time. Therefore, the recognition and removal of duplicate data can be an essential task. Data deduplication, a competent method of data decrease, has gained increasing attention and recognition in large-scale storage space systems. It minimizes redundant data at the data file or subfile level and recognizes duplicated content by its cryptographically secure hash signature. It is very complicated because neither duplicate data do not have a standard key nor they contain mistake. Within this paper, the backdrop and key top features of data deduplication is preserved, then summarize and classify the data deduplication process in line with the key workflow.


Deduplication Chunking Hashing CDC Encryption 


  1. 1.
    He S, Zhang C, Hao P (2009) Comparative study of features for fingerprint indexing. In: 16th IEEE international conference on image processing (ICIP), CairoGoogle Scholar
  2. 2.
    Fang H, Zhang Z, Wang CJ, Daneshmand M, Wang C, Wang H (2015) A survey of big data research. IEEE Netw 29(5):6–9CrossRefGoogle Scholar
  3. 3.
    Panda M, Sethy R (2015) Big data analysis using Hadoop: a survey. Int J Adv Res Comput Sci Softw Eng 5(7):1153–1157Google Scholar
  4. 4.
    Malhotra J, Bakal J (2015) A survey and comparative study of data deduplication techniques. In: International conference on pervasive computing (ICPC), PuneGoogle Scholar
  5. 5.
    Dubnicki C, Gryz L, Heldt L, Kaczmarczyk M, Kilian W, Strzelczak P, Szczepkowski J, Ungureanu C, Welnicki M (2009) HYDRAstor: a scalable secondary storage. In: 7th USENIX conference on file and storage technologies (FAST 09), San Francisco, CaliforniaGoogle Scholar
  6. 6.
    Wei J, Jiang H, Zhou K, Feng D (2010) MAD2: a scalable high-throughput exact deduplication approach for network backup services. In: IEEE 26th symposium on mass storage systems and technologies (MSST), Incline Village, NVGoogle Scholar
  7. 7.
    Hong B (2004) Duplicate data elimination in a SAN file system. In: 21st international conference on massive storage systems and technologies (MSST), College Park, MDGoogle Scholar
  8. 8.
    Xia W, Jiang H, Feng D, Douglis F, Shilane P, Hua Y, Fu M, Zhang Y, Zhou Y (2016) A comprehensive study of the past, present, and future of data deduplication. Proc IEEE 104(9):1681–1710CrossRefGoogle Scholar
  9. 9.
    Li A, Jiwu S, Mingqiang L (2010) Data deduplication techniques. J Softw 2(9):916–929zbMATHGoogle Scholar
  10. 10.
    Hsu W, Ong S, System and method for dividing data into predominantly fixed-sized chunks so that duplicate data chunks may be identified. US Patent US7281006B2, 2007Google Scholar
  11. 11.
    Bo C, Li ZF, Can W (2012) Research on chunking algorithms of data de-duplication. In: International conference on communication, electronics and automation engineering, Berlin, HeidelbergGoogle Scholar
  12. 12.
    Lkhagvasuren I, So J, Lee J, Ko Y (2014) Multi-level byte index chunking mechanism for file synchronization. Int J Softw Eng Appl 8(3):339–350Google Scholar
  13. 13.
    Lu G, Jin Y, Du D (2010) Frequency based chunking for data de-duplication. In: IEEE international symposium on modeling, analysis and simulation of computer and telecommunication systems (MASCOTS), Miami Beach, FLGoogle Scholar
  14. 14.
    Zhang Y, Wang W, Yin T, Yuan J (2013) A novel frequency based chunking for data deduplication. Appl Mech Mater 278:2048–2053CrossRefGoogle Scholar
  15. 15.
    Kruus E, Ungureanu C, Dubnicki C (2010) Bimodal content defined chunking for backup streams. In: 8th usenix conference on file and storage technologies (FAST-10), San Jose, CaliforniaGoogle Scholar
  16. 16.
    Wei J, Zhu J, Li Y (2014) Multimodal content defined chunking for data deduplication, Huawei TechnologiesGoogle Scholar
  17. 17.
    Zhang Y, Feng D, Jiang H, Xia W, Fu M, Huang F, Zhou Y (2017) A fast asymmetric extremum content defined chunking algorithm for data deduplication in backup storage systems. IEEE Trans Comput 66(2):199–211Google Scholar
  18. 18.
    Xia W, Zhou Y, Jiang H, Feng D, Hua Y, Hu Y, Liu Q, Zhang Y (2016) FastCDC: a fast and efficient content-defined chunking approach for data deduplication. In: 2016 USENIX conference on usenix annual technical conference, Berkeley, CA, USAGoogle Scholar
  19. 19.
    Paulo J, Pereira J (2014) A survey and classification of storage deduplication systems. ACM Comput Surv (CSUR), 47(1)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Xing Y, Xiao N, Liu F, Sun Z, He W (2015) AR-dedupe: an efficient deduplication approach for cluster deduplication system. J Shanghai Jiaotong Univ (Sci), 76–81CrossRefGoogle Scholar
  21. 21.
    Luo S, Zhang G, Wu C, Khan S, Li K (2015) Boafft: distributed deduplication for big data storage in the cloud. IEEE Trans Cloud Comput 99:1–13CrossRefGoogle Scholar

Copyright information

© The Author(s) 2019

Authors and Affiliations

  • Ch. Prathima
    • 1
    • 2
  • L. S. S. Reddy
    • 1
  1. 1.K L UniversityVaddeswaram, GunturIndia
  2. 2.Data Analytics Research Lab, Department of IT, Sree Vidyanikethan Engineering CollegeTirupatiIndia

Personalised recommendations