Skip to main content
Log in

Distributed deduplication with fingerprint index management model for big data storage in the cloud

  • Special Issue
  • Published:
Evolutionary Intelligence Aims and scope Submit manuscript

Abstract

As data progressively grows within data centers, the cloud storage models face several issues while storing data and offers abilities needed to shift data in an adequate time frame. This study aims to develop a distributed deduplication model to achieve scalable throughput and capacity utilizing many data servers for duplicating data in parallel with minimal loss. This paper proposes a new cloud storage model based on a distributed deduplication with the fingerprint index management (DDFI) model. The DDFI model operates on three main stages. At the initial stage, the DDFI model makes use of an effective routing technique depending upon the similarity level of the data, which leads to low network overhead by rapid identification of storage locations. In the second stage, the duplicate data identification procedure is carried out by the use of the MD5 algorithm. At the final stage, a fingerprint index management process is executed where a fingerprint index comprises fingerprints and its corresponding position details of every written chunk. For optimizing the results of the deduplication performance, the DDFI model manages the fingerprint index in storage space and only sometimes writes to disk at the same time as the cloud database scheme is idle. The simulation outcome exhibited that the presented DDFI model offered maximum results with a higher deduplication ratio (DR) with a minimum overhead of network bandwidth. From the detailed comparative analysis, it is inferred that the presented DFFI model offered maximum relative DR, maximum duplication performance, minimum read bandwidth, and write bandwidth.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Biggar H (2012) Experiencing data de-duplication: improving efficiency and reducing capacity requirements. White paper, Feb. 2007. The Enterprise Strategy Group, Dublin

    Google Scholar 

  2. Kubiatowicz J, Bindel D, Chen Y et al (2000) Oceanstore: an architecture for global-scale persistent storage. ACM Sigplan Not 35(11):190–201

    Article  Google Scholar 

  3. Quinlan S, Dorward S (2002) Venti: a new approach to archival storage. In: Proceedings of the conference on file and storage technologies, vol 2, pp 89–101

  4. Lillibridge M, Eshghi K, Bhagwat D et al (2009) Sparse indexing: large scale, inline deduplication using sampling and locality In: Proceedings of the conference on file and storage technologies, vol 9, pp 111–123

  5. Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings of compression complexity sequences, pp 21–29

  6. Debnath B, Sengupta S, Li J (2010) ChunkStash: speeding up inline storage deduplication using flash memory. In: Proceedings of conference on USENIX annual technical conference, pp 16–16

  7. EMC Data Domain Global Deduplication Array. https://www.datadomain.com/products/global-deduplication-array.html. Visited in 2015

  8. Dubnicki C, Gryz L, Heldt L et al (2009) HYDRAstor: a scalable secondary storage. In: FAST, vol 9, pp 197–210

  9. Dong W, Douglis F, Li K et al (2011) Tradeoffs in scalable data routing for deduplication clusters. In: Proceedings of the conference on file and storage technologies, pp 15–29

  10. Wang L, Zhu Z, Zhang X, Dong X, Wang Y (2017) DOMe: a deduplication optimization method for the NewSQL database backups. PLoS ONE 12(10):e0185189

    Article  Google Scholar 

  11. Luo S, Zhang G, Wu C, Khan S, Li K (2015) Boafft: distributed deduplication for big data storage in the cloud. IEEE Trans Cloud Comput 61:1–13

    Google Scholar 

  12. Li M, Zhang H, Wu Y, Zhao C (2019) Prefetch-aware fingerprint cache management for data deduplication systems. Front Comput Sci 13(3):500–515

    Article  Google Scholar 

  13. Muthitacharoen A, Chen B, Mazieres D (2001) A low-bandwidth network file system. ACM SIGOPS Oper Syst Rev 35(5):174–187

    Article  Google Scholar 

  14. Vijayan MK, Kochunni JO, Attarde DR, Ankireddypalle RR, CommVault Systems Inc (2019) Deduplication replication in a distributed deduplication data storage system. U.S. patent application 16/232,950

  15. Thakur MA, Bari S, Deshmukh R, Auty S (2020) Secure key agreement model for group data sharing and achieving data deduplication in cloud computing. In Information and communication technology for sustainable development. Springer, Singapore, pp 121–127

  16. Hema S, Kangaiammal A (2019) Distributed storage hash algorithm (DSHA) for file-based deduplication in cloud computing. In: International conference on computer networks and inventive communication technologies. Springer, Cham, pp 572–581

  17. An B, Li Y, Ma J, Huang G, Chen X, Cao D (2019) DCStore: a deduplication-based cloud-of-clouds storage service. In: 2019 IEEE international conference on web services (ICWS). IEEE, pp 291–295

  18. Yuan H, Chen X, Li J, Jiang T, Wang J, Deng R (2019) Secure cloud data deduplication with efficient re-encryption. IEEE Trans Serv Comput. https://doi.org/10.1109/TSC.2019.2948007

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Sabeetha Saraswathi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saraswathi, S.S., Malarvizhi, N. Distributed deduplication with fingerprint index management model for big data storage in the cloud. Evol. Intel. 14, 683–690 (2021). https://doi.org/10.1007/s12065-020-00395-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12065-020-00395-8

Keywords

Navigation