A Bloom Filter-Based Data Deduplication for Big Data

Podder, Shrayasi; Mukherjee, S.

doi:10.1007/978-981-10-8360-0_15

A Bloom Filter-Based Data Deduplication for Big Data

Shrayasi Podder⁶ &
S. Mukherjee⁷

Conference paper
First Online: 08 April 2018

827 Accesses
2 Citations

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 38))

Abstract

Big data is growing at an unprecedented rate with text data having a large share and redundancy is a technique to ensure availability of this data. Large growth of unstructured text data hinders the primary purpose of the big data rendering the data difficult to store and search. Data compression is a solution to optimize the use of the storage space for big data. Deduplication is the most useful compression techniques. This paper proposes a two-phase data deduplication mechanism for text data. In the syntactic phase, a combination of clustering and Bloom Filter is used. In the semantic phase, a combination of SVD and WordNet synset is employed. Experimental results show the efficacy of the proposed system.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

CWADN, http://www.computerweekly.com/
Eaton C, Deroos D, Deutsch T, Lapis G, Zikopoulos P (2012) Understanding big data. McGraw-Hill Companies
Google Scholar
https://www.smartfile.com/blog/the-future-forecast-for-cloud-storage-in-2018/
https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/vni-hyperconnectivity-wp.html
Reed DA, Gannon DB, Larus JR (2012) Imagining the future: thoughts on computing. Computer 45
Article Google Scholar
Deduplication, http://en.wikipedia.org/wiki/Data_deduplication
https://www.dropbox.com/
https://www.google.com/drive/
Su YH, Chuan HM, Wang SC, Yan KQ, Chen BW (2014) Quality of service enhancement by using an integer bloom filter based data deduplication mechanism in the cloud storage environment. In: IFIP international conference on network and parallel computing. Springer, Berlin, pp 587–590
Google Scholar
Su YH, Merlo P, Henderson J, Schneider G, Wehrli E (2013) Learning document similarity using natural language processing. Linguistik Online 17(5)
Google Scholar
da Cruz Nassif LF, Hruschka ER (2013) Document clustering for forensic analysis: an approach for improving computer inspection. IEEE Trans Inf Forensics Secur 8:46–54
Article Google Scholar
Jiang J-Y, Lin Y-S, Lee S-J (2014) A similarity measure for text classification and clustering. IEEE Trans Knowl Data Eng 26:1575–1590
Article Google Scholar
Pires CE, Nascimento DC, Mestre (2016) Applying machine learning techniques for scaling out data quality algorithms in cloud computing environments. Appl Intell 45:530
Article Google Scholar
Gemmell J, Rubinstein BIP, Chandra AK. Improving entity resolution with global constraints. https://arxiv.org/abs/1108.6016
Bose P, Guo H, Kranakis E, Maheshwari A, Morin P, Morrison J, Smid M, Tang Y (2008) On the false-positive rate of bloom filters. Inf Process Lett 108(4):210–213
Article MathSciNet Google Scholar
Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426
Article Google Scholar
Wikipedia (2015) Bloom filter. https://en.wikipedia.org/wiki/Bloom_filter
Subramanyam R (2016) Idempotent distributed counters using a forgetful bloom filter. Clust Comput 19(2):879–892
Article MathSciNet Google Scholar
Hu G, Zhou S, Guan J, Hu X (2008) Towards effective document clustering: a constrained K-means based approach. Inf Process Manag 44:1397–1409
Article Google Scholar
Tolic A, Brodnik A (2015) Deduplication in unstructured-data storage systems. Elektroteh Vestn 82(5):233
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Engineering and Management, Kolkata, India
Shrayasi Podder
DIST, CEG, Anna University, Chennai, India
S. Mukherjee

Authors

Shrayasi Podder
View author publications
You can also search for this author in PubMed Google Scholar
S. Mukherjee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Mukherjee .

Editor information

Editors and Affiliations

Smart Grid and Renewable Energy, University of Agder, Kristiansand, Norway
Mohan L. Kolhe
Department of Computer Science and Engineering, ABES Engineering College, Ghaziabad, Uttar Pradesh, India
Munesh C. Trivedi
Department of Computer Science and Engineering, ABES Engineering College, Ghaziabad, Uttar Pradesh, India
Shailesh Tiwari
Department of Computer Science and Engineering, The Indira Gandhi National Tribal University, Amarkantak, Madhya Pradesh, India
Vikash Kumar Singh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Podder, S., Mukherjee, S. (2018). A Bloom Filter-Based Data Deduplication for Big Data. In: Kolhe, M., Trivedi, M., Tiwari, S., Singh, V. (eds) Advances in Data and Information Sciences. Lecture Notes in Networks and Systems, vol 38. Springer, Singapore. https://doi.org/10.1007/978-981-10-8360-0_15

Download citation

DOI: https://doi.org/10.1007/978-981-10-8360-0_15
Published: 08 April 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8359-4
Online ISBN: 978-981-10-8360-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics