Definition
Web search result de-duplication and clustering are both techniques for improving the organization and presentation of Web search results. De-duplication refers to the removal of duplicate or near-duplicate web pages in the search result page. Since a user is not likely interested in seeing redundant information, de-duplication can help improve search results by decreasing the redundancy and increasing the diversity among search results.
Web search result clustering means that given a set of web search results, the search engine partitions them into subsets (clusters) according to the similarity between search results and presents the results in a structured way. Clustering results helps improve the organization of search results because similar pages will be grouped together in a cluster and a user can easily navigate into the most relevant cluster to find relevant pages. Hierarchical clustering is often used to generate a hierarchical tree structure which facilitates...
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Recommended Reading
Broder AZ, Glassman SC, Manasse MS, Zweig G. Syntactic clustering of the web. Comput Netw. 1997;29(8–13):1157–66.
Chowdhury A, Frieder O, Grossman DA, McCabe MC. Collection statistics for fast duplicate document detection. ACM Trans Inf Syst. 2002;20(2): 171–91.
Cutting DR, Pedersen JO, Karger D, Tukey JW. Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 1992. p. 318–29.
Dumais ST, Cutrell E, Chen H. Optimizing search by showing results in context. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems; 2001. p. 277–84.
Ferragina P, Gulli A. A personalized search engine based on Web-snippet hierarchical clustering. In: Proceedings of the 14th International World Wide Web Conference; 2005. p. 801–10.
Hearst MA, Pedersen JO. 1Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 1996. p. 76–84.
Hoad T, Zobel J. Methods for identifying versioned and plagiarised documents. J Am Soc Inf Sci Technol. 2003;54(3):203–15.
Huffman S, Lehman A, Stolboushkin A, Wong-Toi H, Yang F, Roehrig H. Multiple-signal duplicate detection for search evaluation. In: Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2007. p. 223–30.
Jardine N, van Rijsbergen C. The use of hierarchic clustering in information retrieval. Inf Storage Retrovir. 1971;7(5):217–40.
Manber U. Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference; 1994. p. 1–10.
Mei Q, Shen X, Zhai C. Automatic labeling of multinomial topic models. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2007. p. 490–9.
Shivakumar N, Garcia-Molina H. SCAM: a copy detection mechanism for digital documents. In: Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries; 1995.
Wang X, Zhai C. Learn from web search logs to organize search results. In: Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2007. p. 87–94.
Willett P. Recent trends in hierarchic document clustering: a critical review. Inf Process Manag. 1988;24(5):577–97.
Zamir O, Etzioni O. Grouper: a dynamic clustering interface to Web search results. In: Proceedings of the 8th International World Wide Web Conference; 1999.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Section Editor information
Rights and permissions
Copyright information
© 2018 Springer Science+Business Media, LLC, part of Springer Nature
About this entry
Cite this entry
Shen, X., Zhai, C. (2018). Web Search Result De-duplication and Clustering. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_326
Download citation
DOI: https://doi.org/10.1007/978-1-4614-8265-9_326
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering