Compact Features for Detection of Near-Duplicates in Distributed Retrieval

  • Yaniv Bernstein
  • Milad Shokouhi
  • Justin Zobel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4209)


In distributed information retrieval, answers from separate collections are combined into a single result set. However, the collections may overlap. The fact that the collections are distributed means that it is not in general feasible to prune duplicate and near-duplicate documents at index time. In this paper we introduce and analyze the grainy hash vector, a compact document representation that can be used to efficiently prune duplicate and near-duplicate documents from result lists. We demonstrate that, for a modest bandwidth and computational cost, many near-duplicates can be accurately removed from result lists produced by a cooperative distributed information retrieval system.


Hash Function Server Selection Compact Feature Result List Distribute Information Retrieval 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Allan, J., et al.: Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval. In: SIGIR Forum, University of Massachusetts Amherst, September 2002, vol. 37(1), pp. 31–47 (2003)Google Scholar
  2. Bernstein, Y., Zobel, J.: A scalable system for identifying co-derivative documents. In: Proc. String Processing and Information Retrieval Symposium, Padova, Italy, pp. 55–67 (2004)Google Scholar
  3. Bernstein, Y., Zobel, J.: Redundant documents and search effectiveness. In: Proc. ACM CIKM Conf., Bremen, Germany, pp. 736–743 (2005)Google Scholar
  4. Brin, S., Davis, J., García-Molina, H.: Copy detection mechanisms for digital documents. In: Proc. ACM SIGMOD international conference on Management of Data, San Jose, California, pp. 398–409 (1995)Google Scholar
  5. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8-13), 1157–1166 (1997)CrossRefGoogle Scholar
  6. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: Proc. ACM symposium on Theory of computing (STOC), pp. 327–336. ACM Press, New York (1998)Google Scholar
  7. Callan, J., Connell, M.: Query-based sampling of text databases. ACM Transactions on Information Systems 19(2), 97–130 (2001)CrossRefGoogle Scholar
  8. Callan, J., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: Proc. Int. ACM-SIGIR Conf., Seattle, Washington, pp. 21–28 (1995)Google Scholar
  9. Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems 20(2), 171–191 (2002)CrossRefGoogle Scholar
  10. Conrad, J.G., Guo, X.S., Schriber, C.P.: Online duplicate document detection: Signature reliability in a dynamic retrieval environment. In: Proc. ACM-CIKM Conf., New Orleans, Louisiana, pp. 443–452 (2003)Google Scholar
  11. Cooper, J.W., Coden, A.R., Brown, E.W.: Detecting similar documents using salient terms. In: Proc. ACM-CIKM Conf., McLean, Virginia, pp. 245–251 (2002)Google Scholar
  12. Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: Proc. first Latin American Web Congress, pp. 37–45. IEEE, Los Alamitos (2003)Google Scholar
  13. Gauch, S., Wang, G., Gomez, M.: ProFusion: Intelligent fusion from multiple, distributed search engines. J. Universal Computer Science 2(9), 637–649 (1996)Google Scholar
  14. Gravano, L., Chang, C.K., Garcia-Molina, H., Paepcke, A.: STARTS: Stanford proposal for Internet meta-searching. In: Proc. ACM SIGMOD international conference on Management of Data, Tucson, Arizona, pp. 207–218 (1997)Google Scholar
  15. Harman, D.: Overview of the first TREC conference. In: Proc. ACM-SIGIR Conf., Pittsburgh, Pennsylvania, pp. 36–47 (1993)Google Scholar
  16. Hernandez, T., Kambhampati, S.: Improving text collection selection with coverage and overlap statistics. In: Proc. Int. Conf. on World Wide Web, Chiba, Japan, pp. 1128–1129 (2005)Google Scholar
  17. Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarised documents. J. the American Society for Information Science and Technology 54(3), 203–215 (2003)CrossRefGoogle Scholar
  18. Ilyinski, S., Kuzmin, M., Melkov, A., Segalovich, I.: An efficient method to detect duplicates of web documents with the use of inverted index. In: Proc. Int. Conf. on World Wide Web, Honolulu, Hawaii (2002)Google Scholar
  19. Kolcz, A., Chowdhury, A., Alspector, J.: Improved robustness of signature-based near-replica detection via lexicon randomization. In: Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Seattle, WA, pp. 605–610 (2004)Google Scholar
  20. Lyon, C., Malcolm, J., Dickerson, B.: Detecting short passages of similar text in large document collections. In: Proc. Conf. on Empirical Methods in Natural Language Processing, Philadelphia, Pennsylvania (2001)Google Scholar
  21. Manber, U.: Finding similar files in a large file system. In: Proc. USENIX Winter Technical Conf., San Fransisco, CA, pp. 1–10, 17–21 (1994)Google Scholar
  22. Meng, W., Yu, C., Liu, K.: Building efficient and effective metasearch engines. ACM Computing Surveys 34(1), 48–89 (2002)CrossRefGoogle Scholar
  23. Nottelmann, H., Fuhr, N.: Evaluating different methods of estimating retrieval quality for resource selection. In: Proc. Int. ACM-SIGIR Conf., Toronto, Canada, pp. 290–297 (2003)Google Scholar
  24. Powell, A.L., French, J.: Comparing the performance of collection selection algorithms. ACM Transactions on Information Systems 21(4), 412–456 (2003)CrossRefGoogle Scholar
  25. Pugh, W., Henzinger, M.H.: Detecting duplicate and near-duplicate files (United States Patent 6,658,423) (2003)Google Scholar
  26. Selberg, E., Etzioni, O.: The MetaCrawler architecture for resource aggregation on the Web. In: IEEE Expert (January–February 1997), pp. 11–14 (1997)Google Scholar
  27. Si, L., Callan, J.: Unified utility maximization framework for resource selection. In: Proc. ACM-CIKM Conf., Washington, D.C., pp. 32–41 (2004)Google Scholar
  28. Si, L., Callan, J.: Relevant document distribution estimation method for resource selection. In: Proc. ACM-SIGIR Conf., Toronto, Canada, pp. 298–305 (2003)Google Scholar
  29. Warren Jr., H.S.: Hacker’s Delight. Addison-Wesley, Reading (2002)Google Scholar
  30. Zamir, O., Etzioni, O.: Grouper: a dynamic clustering interface to web search results. In: Proc. Int. Conf. on World Wide Web, Toronto, Canada, pp. 1361–1374 (1999)Google Scholar
  31. Zobel, J., Bernstein, Y.: The case of the duplicate documents: Measurement, search, and science. In: Proc. Asia-Pacific Web Conf., Harbin, China, pp. 26–39 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Yaniv Bernstein
    • 1
  • Milad Shokouhi
    • 1
  • Justin Zobel
    • 1
  1. 1.School of Computer Science and Information TechnologyRMIT UniversityMelbourneAustralia

Personalised recommendations