Abstract
Big data is growing at an unprecedented rate with text data having a large share and redundancy is a technique to ensure availability of this data. Large growth of unstructured text data hinders the primary purpose of the big data rendering the data difficult to store and search. Data compression is a solution to optimize the use of the storage space for big data. Deduplication is the most useful compression techniques. This paper proposes a two-phase data deduplication mechanism for text data. In the syntactic phase, a combination of clustering and Bloom Filter is used. In the semantic phase, a combination of SVD and WordNet synset is employed. Experimental results show the efficacy of the proposed system.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Eaton C, Deroos D, Deutsch T, Lapis G, Zikopoulos P (2012) Understanding big data. McGraw-Hill Companies
https://www.smartfile.com/blog/the-future-forecast-for-cloud-storage-in-2018/
Reed DA, Gannon DB, Larus JR (2012) Imagining the future: thoughts on computing. Computer 45
Deduplication, http://en.wikipedia.org/wiki/Data_deduplication
Su YH, Chuan HM, Wang SC, Yan KQ, Chen BW (2014) Quality of service enhancement by using an integer bloom filter based data deduplication mechanism in the cloud storage environment. In: IFIP international conference on network and parallel computing. Springer, Berlin, pp 587–590
Su YH, Merlo P, Henderson J, Schneider G, Wehrli E (2013) Learning document similarity using natural language processing. Linguistik Online 17(5)
da Cruz Nassif LF, Hruschka ER (2013) Document clustering for forensic analysis: an approach for improving computer inspection. IEEE Trans Inf Forensics Secur 8:46–54
Jiang J-Y, Lin Y-S, Lee S-J (2014) A similarity measure for text classification and clustering. IEEE Trans Knowl Data Eng 26:1575–1590
Pires CE, Nascimento DC, Mestre (2016) Applying machine learning techniques for scaling out data quality algorithms in cloud computing environments. Appl Intell 45:530
Gemmell J, Rubinstein BIP, Chandra AK. Improving entity resolution with global constraints. https://arxiv.org/abs/1108.6016
Bose P, Guo H, Kranakis E, Maheshwari A, Morin P, Morrison J, Smid M, Tang Y (2008) On the false-positive rate of bloom filters. Inf Process Lett 108(4):210–213
Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426
Wikipedia (2015) Bloom filter. https://en.wikipedia.org/wiki/Bloom_filter
Subramanyam R (2016) Idempotent distributed counters using a forgetful bloom filter. Clust Comput 19(2):879–892
Hu G, Zhou S, Guan J, Hu X (2008) Towards effective document clustering: a constrained K-means based approach. Inf Process Manag 44:1397–1409
Tolic A, Brodnik A (2015) Deduplication in unstructured-data storage systems. Elektroteh Vestn 82(5):233
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Podder, S., Mukherjee, S. (2018). A Bloom Filter-Based Data Deduplication for Big Data. In: Kolhe, M., Trivedi, M., Tiwari, S., Singh, V. (eds) Advances in Data and Information Sciences. Lecture Notes in Networks and Systems, vol 38. Springer, Singapore. https://doi.org/10.1007/978-981-10-8360-0_15
Download citation
DOI: https://doi.org/10.1007/978-981-10-8360-0_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8359-4
Online ISBN: 978-981-10-8360-0
eBook Packages: EngineeringEngineering (R0)