Abstract
The web is the largest repository of documents available and, for retrieval for various purposes, we must use crawlers to navigate autonomously, to select documents and processing them according to the objectives pursued. However, we can see, even intuitively, that are obtained more or less abundant replications of a significant number of documents. The detection of these duplicates is important because it allows to lighten databases and improve the efficiency of information retrieval engines, but also improve the precision of cybermetric analysis, web mining studies, etc. Hash standard techniques used to detect these duplicates only detect exact duplicates, at the bit level. However, many of the duplicates found in the real world are not exactly alike. For example, we can find web pages with the same content, but with different headers or meta tags, or viewed with style sheets different. A frequent case is that of the same document but in different formats; in these cases we will have completely different documents at binary level. The obvious solution is to compare plain text conversions of all these formats, but these conversions are never identical, because of the different treatments of the converters on various formatting elements (treatment of textual characters, diacritics, spacing, paragraphs ...). In this work we introduce the possibility of using what is known as fuzzy-hashing. The idea is to produce fingerprints of files (or documents, etc..). This way, a comparison between two fingerprints could give us an estimate of the closeness or distance between two files, documents, etc. Based on the concept of “rolling hash”, the fuzzy hashing has been used successfully in computer security tasks, such as identifying malware, spam, virus scanning, etc. We have added capabilities of fuzzy hashing to a slight crawler and have made several tests in a heterogeneous network domain, consisting of multiple servers with different software, static and dynamic pages, etc.. These tests allowed us to measure similarity thresholds and to obtain useful data about the quantity and distribution of duplicate documents on web servers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bar-Ilan, J.: Expectations versus reality – search engine features needed for web research at mid 2005. Cybermetrics 9(1) (2005), http://www.cindoc.csic.es/cybermetrics/articles/v9i1p2.html
Bharat, K., Broder, A.: Mirror, mirror on the web: A study of host pairs with replicated content. Computer Networks 31(11-16), 1579–1590 (1999), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.1488&rep=rep1&type=pdf
Chowdhury, A.: Duplicate data detection (2004), retrieved from http://ir.iit.edu/~abdur/Research/Duplicate.html , http://gogamza.mireene.co.kr/wp-content/uploads/1/XbsrPeUgh6.pdf
Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems (TOIS) 20(2), 171–191 (2002), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5.3673&rep=rep1&type=pdf
Damerau, F.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964), http://www.cis.uni-muenchen.de/~heller/SuchMasch/apcadg/liter.atur/data/damerau_distance.pdf
Figuerola, C.G., Alonso Berrocal, J.L., Zazo Rodríguez, Á.F., Rodríguez Vázquez de Aldana, E.: Diseño de spiders. Tech. Rep. DPTOIA-IT-2006-002 (2006)
Figuerola, C.G., Gómez Díaz, R., Alonso Berrocal, J.L., Zazo Rodríguez, A.F.: Proyecto 7: un motor de recuperación web colaborativo. Scire. Representación y Organización del Conocimiento 16, 53–60 (2010)
Hamming, R.: Error detecting and error correcting codes. Bell System Technical Journal 29(2), 147–160 (1950), http://www.lee.eng.uerj.br/~gil/redesII/hamming.pdf
Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital Investigation 3, 91–97 (2006), https://www.dfrws.org/2006/proceedings/12-Kornblum.pdf
Kornblum, J.: Beyond fuzzy hash. In: US Digital Forensic and Incident Response Summit 2010 (2010), http://computer-forensics.sans.org/community/summits/2010/files/19-beyond-fuzzy-hashing-kornblum.pdf
Kornblum, J.: Fuzzy hashing and sseep (2010), http://ssdeep.sourceforge.net/
Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
Milenko, D.: ssdeep 2.5. python wrapper for ssdeep library (2010), http://pypi.python.org/pypi/ssdeep
Navarro, G.: A guided tour to approximate string matching. ACM computing surveys (CSUR) 33(1), 31–88 (2001), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.7225&rep=rep1&type=pdf
Soukoreff, R., MacKenzie, I.: Measuring errors in text entry tasks: an application of the levenshtein string distance statistic. In: CHI 2001 Extended Abstracts on Human Factors in Computing Systems, pp. 319–320. ACM, New York (2001), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.757&rep=rep1&type=pdf
Tan, P., Steinbach, M., Kumar, V., et al.: Introduction to data mining. Pearson Addison Wesley, Boston (2006), http://www.pphust.cn/uploadfiles/200912/20091204204805761.pdf
Tridgell, A.: Spamsum overview and code (2002), http://samba.org/ftp/unpacked/junkcode/spamsum
Tridgell, A., Mackerras, P.: The rsync algorithm (2004), http://dspace-prod1.anu.edu.au/bitstream/1885/40765/2/TR-CS-96-05.pdf
Yerra, R., Ng, Y.: Detecting similar html documents using a fuzzy set information retrieval approach. In: 2005 IEEE International Conference on Granular Computing, vol. 2, pp. 693–699. IEEE, Los Alamitos (2005), http://faculty.cs.byu.edu/~dennis/papers/ieee-grc.ps
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Figuerola, C.G., Díaz, R.G., Alonso Berrocal, J.L., Zazo Rodríguez, A.F. (2011). Web Document Duplicate Detection Using Fuzzy Hashing. In: Corchado, J.M., Pérez, J.B., Hallenborg, K., Golinska, P., Corchuelo, R. (eds) Trends in Practical Applications of Agents and Multiagent Systems. Advances in Intelligent and Soft Computing, vol 90. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19931-8_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-19931-8_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19930-1
Online ISBN: 978-3-642-19931-8
eBook Packages: EngineeringEngineering (R0)