Identifying and Filtering Near-Duplicate Documents

Broder, Andrei Z.

doi:10.1007/3-540-45123-4_1

Andrei Z. Broder⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1848))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

943 Accesses
103 Citations
11 Altmetric

Abstract

The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size “sketch” for each document. For a large collection of documents (say hundreds of millions) the size of this sketch is of the order of a few hundred bytes per document.

However, for efficient large scale web indexing it is not necessary to determine the actual resemblance value: it suffices to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed. In other words, it suffices to determine whether the resemblance is above a certain threshold. In this talk we show how this determination can be made using a “sample” of less than 50 bytes per document.

The basic approach for computing resemblance has two aspects: first, resemblance is expressed as a set (of strings) intersection problem, and second, the relative size of intersections is evaluated by a process of random sampling that can be done independently for each document. The process of estimating the relative size of intersection of sets and the threshold test discussed above can be applied to arbitrary sets, and thus might be of independent interest.

The algorithm for filtering near-duplicate documents discussed here has been successfully implemented and has been used for the last three years in the context of the AltaVista search engine.

Most of this work was done while the author was at Compaq’s System Research Center in Palo Alto. A preliminary version of this work was presented (but not published) at the “Fun with Algorithms” conference, Isola d’Elba, 1998.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Brin, J. Davis,. H Garcŕsya-Molina. Copy Detection Mechanisms for Digital Documents. Proceedings of the ACM SIGMOD Annual Conference, May 1995.
Google Scholar
K. Bharat and A. Z. Broder. A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. In Proceedings of Seventh International World Wide Web Conference, pages 379–388, 1998.
Google Scholar
A. Z. Broder. Some applications of Rabinŕss fingerprinting method. In R. Capocelli, A. De Santis, and U. Vaccaro, (editors), Sequences II: Methods in Communications, itSecurity, and Computer Science, pages 143–152. Springer-Verlag, 1993.
Google Scholar
A. Z. Broder. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of Sequences 1997, pages 21–29. IEEE Computer Society, 1997.
Google Scholar
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-Wise Independent Permutations. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pages 327–336, 1998.
Google Scholar
A. Z. Broder and U. Feige. Min-Wise versus Linear Independence. In Proceedings of the Eleventh Annual ACM-SI AM Symposium on Discrete Algorithms, pages 147–154, 2000.
Google Scholar
A. Z. Broder, S. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. In Proceedings of the Sixth International World Wide Web Conference, pages 391–404, 1997.
Google Scholar
N. Heintze. Scalable Document Fingerprinting. Proceedings of the Second USENIX Workshop on Electronic Commerce, pages 191–200, 1996.
Google Scholar
U. Manber. Finding similar files in a large file system. Proceedings of the Winter 1994 USENIX Conference, pages 1–10, 1994.
Google Scholar
R. Seltzer, E. J. Ray, and D. S. Ray. The AltaVista Search Revolution: How to Find Anything on the Internet. McGraw-Hill, 1996.
Google Scholar
N. Shivakumar, H. García-Molina. SCAM: A Copy Detection Mechanism for Digital Documents. Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries, 1995.
Google Scholar
N. Shivakumar and H. García-Molina. Building a Scalable and Accurate Copy Detection Mechanism. Proceedings of the 3nd International Conference on Theory and Practice of Digital Libraries, 1996.
Google Scholar
N. Shivakumar and H. García-Molina. Finding near-replicas of documents on the web. In Proceedings of Workshop on Web Databases (WebDBŕs98), March 1998.
Google Scholar
Z. Smith. The Truth About the Web: Crawling Towards Eternity, Web Techniques Magazine, May 1997.
Google Scholar
M. O. Rabin. Fingerprinting by random polynomials. Center for Research in Computing Technology, Harvard University, Report TR-15-81, 1981.
Google Scholar
E. Ukkonen. Approximate string-matching distance and the q-gram distance. In R. Capocelli, A. De Santis, and U. Vaccaro (Editors), Sequences II: Methods in Communications, Security, and Computer Science, pages 300–312. Springer-Verlag, 1993.
Google Scholar

Download references

Author information

Authors and Affiliations

AltaVista Company, San Mateo, CA, 94402, USA
Andrei Z. Broder

Authors

Andrei Z. Broder
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartimento di Matematica ed Applicazioni, Universitá die Palermo, Via Archirafi 34, 90123, Palermo, Italy
Raffaele Giancarlo
Centre de recherches mathématiques, Université de Montréal, CP 6128, succursale Centre-Ville, Montréal, Québec, Canada, H3C 3J7
David Sankoff

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Broder, A.Z. (2000). Identifying and Filtering Near-Duplicate Documents. In: Giancarlo, R., Sankoff, D. (eds) Combinatorial Pattern Matching. CPM 2000. Lecture Notes in Computer Science, vol 1848. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45123-4_1

Download citation

DOI: https://doi.org/10.1007/3-540-45123-4_1
Published: 07 November 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67633-1
Online ISBN: 978-3-540-45123-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics