Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Web Search Result De-duplication and Clustering

  • Xuehua Shen
  • Cheng Xiang Zhai
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_326

Definition

Web search result de-duplication and clustering are both techniques for improving the organization and presentation of Web search results. De-duplication refers to the removal of duplicate or near-duplicate web pages in the search result page. Since a user is not likely interested in seeing redundant information, de-duplication can help improve search results by decreasing the redundancy and increasing the diversity among search results.

Web search result clustering means that given a set of web search results, the search engine partitions them into subsets (clusters) according to the similarity between search results and presents the results in a structured way. Clustering results helps improve the organization of search results because similar pages will be grouped together in a cluster and a user can easily navigate into the most relevant cluster to find relevant pages. Hierarchical clustering is often used to generate a hierarchical tree structure which facilitates...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Broder AZ, Glassman SC, Manasse MS, Zweig G. Syntactic clustering of the web. Comput Netw. 1997;29(8–13):1157–66.Google Scholar
  2. 2.
    Chowdhury A, Frieder O, Grossman DA, McCabe MC. Collection statistics for fast duplicate document detection. ACM Trans Inf Syst. 2002;20(2): 171–91.CrossRefGoogle Scholar
  3. 3.
    Cutting DR, Pedersen JO, Karger D, Tukey JW. Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 1992. p. 318–29.Google Scholar
  4. 4.
    Dumais ST, Cutrell E, Chen H. Optimizing search by showing results in context. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems; 2001. p. 277–84.Google Scholar
  5. 5.
    Ferragina P, Gulli A. A personalized search engine based on Web-snippet hierarchical clustering. In: Proceedings of the 14th International World Wide Web Conference; 2005. p. 801–10.Google Scholar
  6. 6.
    Hearst MA, Pedersen JO. 1Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 1996. p. 76–84.Google Scholar
  7. 7.
    Hoad T, Zobel J. Methods for identifying versioned and plagiarised documents. J Am Soc Inf Sci Technol. 2003;54(3):203–15.CrossRefGoogle Scholar
  8. 8.
    Huffman S, Lehman A, Stolboushkin A, Wong-Toi H, Yang F, Roehrig H. Multiple-signal duplicate detection for search evaluation. In: Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2007. p. 223–30.Google Scholar
  9. 9.
    Jardine N, van Rijsbergen C. The use of hierarchic clustering in information retrieval. Inf Storage Retrovir. 1971;7(5):217–40.CrossRefGoogle Scholar
  10. 10.
    Manber U. Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference; 1994. p. 1–10.Google Scholar
  11. 11.
    Mei Q, Shen X, Zhai C. Automatic labeling of multinomial topic models. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2007. p. 490–9.Google Scholar
  12. 12.
    Shivakumar N, Garcia-Molina H. SCAM: a copy detection mechanism for digital documents. In: Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries; 1995.Google Scholar
  13. 13.
    Wang X, Zhai C. Learn from web search logs to organize search results. In: Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2007. p. 87–94.Google Scholar
  14. 14.
    Willett P. Recent trends in hierarchic document clustering: a critical review. Inf Process Manag. 1988;24(5):577–97.CrossRefGoogle Scholar
  15. 15.
    Zamir O, Etzioni O. Grouper: a dynamic clustering interface to Web search results. In: Proceedings of the 8th International World Wide Web Conference; 1999.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Google, Inc.Mountain ViewUSA
  2. 2.University of Illinois at Urbana-ChampaignUrbanaUSA

Section editors and affiliations

  • Cong Yu
    • 1
  1. 1.Google ResearchNew YorkUSA