Web Document Duplicate Detection Using Fuzzy Hashing

  • Carlos G. Figuerola
  • Raquel Gómez Díaz
  • José L. Alonso Berrocal
  • Angel F. Zazo Rodríguez
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 90)

Abstract

The web is the largest repository of documents available and, for retrieval for various purposes, we must use crawlers to navigate autonomously, to select documents and processing them according to the objectives pursued. However, we can see, even intuitively, that are obtained more or less abundant replications of a significant number of documents. The detection of these duplicates is important because it allows to lighten databases and improve the efficiency of information retrieval engines, but also improve the precision of cybermetric analysis, web mining studies, etc. Hash standard techniques used to detect these duplicates only detect exact duplicates, at the bit level. However, many of the duplicates found in the real world are not exactly alike. For example, we can find web pages with the same content, but with different headers or meta tags, or viewed with style sheets different. A frequent case is that of the same document but in different formats; in these cases we will have completely different documents at binary level. The obvious solution is to compare plain text conversions of all these formats, but these conversions are never identical, because of the different treatments of the converters on various formatting elements (treatment of textual characters, diacritics, spacing, paragraphs ...). In this work we introduce the possibility of using what is known as fuzzy-hashing. The idea is to produce fingerprints of files (or documents, etc..). This way, a comparison between two fingerprints could give us an estimate of the closeness or distance between two files, documents, etc. Based on the concept of “rolling hash”, the fuzzy hashing has been used successfully in computer security tasks, such as identifying malware, spam, virus scanning, etc. We have added capabilities of fuzzy hashing to a slight crawler and have made several tests in a heterogeneous network domain, consisting of multiple servers with different software, static and dynamic pages, etc.. These tests allowed us to measure similarity thresholds and to obtain useful data about the quantity and distribution of duplicate documents on web servers.

Keywords

Web Crawling Fuzzy Hashing Document Duplicate Detection Information Retrieval 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bar-Ilan, J.: Expectations versus reality – search engine features needed for web research at mid 2005. Cybermetrics 9(1) (2005), http://www.cindoc.csic.es/cybermetrics/articles/v9i1p2.html
  2. 2.
    Bharat, K., Broder, A.: Mirror, mirror on the web: A study of host pairs with replicated content. Computer Networks 31(11-16), 1579–1590 (1999), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.90.1488&rep=rep1&type=pdf CrossRefGoogle Scholar
  3. 3.
  4. 4.
    Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems (TOIS) 20(2), 171–191 (2002), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5.3673&rep=rep1&type=pdf CrossRefGoogle Scholar
  5. 5.
    Damerau, F.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176 (1964), http://www.cis.uni-muenchen.de/~heller/SuchMasch/apcadg/liter.atur/data/damerau_distance.pdf CrossRefGoogle Scholar
  6. 6.
    Figuerola, C.G., Alonso Berrocal, J.L., Zazo Rodríguez, Á.F., Rodríguez Vázquez de Aldana, E.: Diseño de spiders. Tech. Rep. DPTOIA-IT-2006-002 (2006)Google Scholar
  7. 7.
    Figuerola, C.G., Gómez Díaz, R., Alonso Berrocal, J.L., Zazo Rodríguez, A.F.: Proyecto 7: un motor de recuperación web colaborativo. Scire. Representación y Organización del Conocimiento 16, 53–60 (2010)Google Scholar
  8. 8.
    Hamming, R.: Error detecting and error correcting codes. Bell System Technical Journal 29(2), 147–160 (1950), http://www.lee.eng.uerj.br/~gil/redesII/hamming.pdf MathSciNetGoogle Scholar
  9. 9.
    Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital Investigation 3, 91–97 (2006), https://www.dfrws.org/2006/proceedings/12-Kornblum.pdf CrossRefGoogle Scholar
  10. 10.
    Kornblum, J.: Beyond fuzzy hash. In: US Digital Forensic and Incident Response Summit 2010 (2010), http://computer-forensics.sans.org/community/summits/2010/files/19-beyond-fuzzy-hashing-kornblum.pdf
  11. 11.
    Kornblum, J.: Fuzzy hashing and sseep (2010), http://ssdeep.sourceforge.net/
  12. 12.
    Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  13. 13.
    Milenko, D.: ssdeep 2.5. python wrapper for ssdeep library (2010), http://pypi.python.org/pypi/ssdeep
  14. 14.
    Navarro, G.: A guided tour to approximate string matching. ACM computing surveys (CSUR) 33(1), 31–88 (2001), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.7225&rep=rep1&type=pdf CrossRefGoogle Scholar
  15. 15.
    Soukoreff, R., MacKenzie, I.: Measuring errors in text entry tasks: an application of the levenshtein string distance statistic. In: CHI 2001 Extended Abstracts on Human Factors in Computing Systems, pp. 319–320. ACM, New York (2001), http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.757&rep=rep1&type=pdf CrossRefGoogle Scholar
  16. 16.
    Tan, P., Steinbach, M., Kumar, V., et al.: Introduction to data mining. Pearson Addison Wesley, Boston (2006), http://www.pphust.cn/uploadfiles/200912/20091204204805761.pdf Google Scholar
  17. 17.
    Tridgell, A.: Spamsum overview and code (2002), http://samba.org/ftp/unpacked/junkcode/spamsum
  18. 18.
    Tridgell, A., Mackerras, P.: The rsync algorithm (2004), http://dspace-prod1.anu.edu.au/bitstream/1885/40765/2/TR-CS-96-05.pdf
  19. 19.
    Yerra, R., Ng, Y.: Detecting similar html documents using a fuzzy set information retrieval approach. In: 2005 IEEE International Conference on Granular Computing, vol. 2, pp. 693–699. IEEE, Los Alamitos (2005), http://faculty.cs.byu.edu/~dennis/papers/ieee-grc.ps CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Carlos G. Figuerola
    • 1
  • Raquel Gómez Díaz
    • 1
  • José L. Alonso Berrocal
    • 1
  • Angel F. Zazo Rodríguez
    • 1
  1. 1.University of SalamancaSalamancaSpain

Personalised recommendations