Web Search Result De-duplication and Clustering

Shen, Xuehua; Zhai, Cheng Xiang

doi:10.1007/978-1-4614-8265-9_326

Xuehua Shen³ &
Cheng Xiang Zhai⁴

17 Accesses

Definition

Web search result de-duplication and clustering are both techniques for improving the organization and presentation of Web search results. De-duplication refers to the removal of duplicate or near-duplicate web pages in the search result page. Since a user is not likely interested in seeing redundant information, de-duplication can help improve search results by decreasing the redundancy and increasing the diversity among search results.

Web search result clustering means that given a set of web search results, the search engine partitions them into subsets (clusters) according to the similarity between search results and presents the results in a structured way. Clustering results helps improve the organization of search results because similar pages will be grouped together in a cluster and a user can easily navigate into the most relevant cluster to find relevant pages. Hierarchical clustering is often used to generate a hierarchical tree structure which facilitates...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 4,499.99; Price excludes VAT (USA)

Hardcover Book: USD 6,499.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

Broder AZ, Glassman SC, Manasse MS, Zweig G. Syntactic clustering of the web. Comput Netw. 1997;29(8–13):1157–66.
Google Scholar
Chowdhury A, Frieder O, Grossman DA, McCabe MC. Collection statistics for fast duplicate document detection. ACM Trans Inf Syst. 2002;20(2): 171–91.
Article Google Scholar
Cutting DR, Pedersen JO, Karger D, Tukey JW. Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 1992. p. 318–29.
Google Scholar
Dumais ST, Cutrell E, Chen H. Optimizing search by showing results in context. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems; 2001. p. 277–84.
Google Scholar
Ferragina P, Gulli A. A personalized search engine based on Web-snippet hierarchical clustering. In: Proceedings of the 14th International World Wide Web Conference; 2005. p. 801–10.
Google Scholar
Hearst MA, Pedersen JO. 1Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 1996. p. 76–84.
Google Scholar
Hoad T, Zobel J. Methods for identifying versioned and plagiarised documents. J Am Soc Inf Sci Technol. 2003;54(3):203–15.
Article Google Scholar
Huffman S, Lehman A, Stolboushkin A, Wong-Toi H, Yang F, Roehrig H. Multiple-signal duplicate detection for search evaluation. In: Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2007. p. 223–30.
Google Scholar
Jardine N, van Rijsbergen C. The use of hierarchic clustering in information retrieval. Inf Storage Retrovir. 1971;7(5):217–40.
Article Google Scholar
Manber U. Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference; 1994. p. 1–10.
Google Scholar
Mei Q, Shen X, Zhai C. Automatic labeling of multinomial topic models. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2007. p. 490–9.
Google Scholar
Shivakumar N, Garcia-Molina H. SCAM: a copy detection mechanism for digital documents. In: Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries; 1995.
Google Scholar
Wang X, Zhai C. Learn from web search logs to organize search results. In: Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2007. p. 87–94.
Google Scholar
Willett P. Recent trends in hierarchic document clustering: a critical review. Inf Process Manag. 1988;24(5):577–97.
Article Google Scholar
Zamir O, Etzioni O. Grouper: a dynamic clustering interface to Web search results. In: Proceedings of the 8th International World Wide Web Conference; 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

Google, Inc., Mountain View, CA, USA
Xuehua Shen
University of Illinois at Urbana-Champaign, Urbana, IL, USA
Cheng Xiang Zhai

Authors

Xuehua Shen
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Xiang Zhai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xuehua Shen .

Editor information

Editors and Affiliations

Georgia Institute of Technology College of Computing, Atlanta, GA, USA
Ling Liu
University of Waterloo School of Computer Science, Waterloo, ON, Canada
M. Tamer Özsu

Section Editor information

Google Research, New York, NY, USA
Cong Yu

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Shen, X., Zhai, C. (2018). Web Search Result De-duplication and Clustering. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_326

Download citation

DOI: https://doi.org/10.1007/978-1-4614-8265-9_326
Published: 07 December 2018
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics