Chapter

Information Retrieval Technology

Volume 3689 of the series Lecture Notes in Computer Science pp 388-400

Subsite Retrieval: A Novel Concept for Topic Distillation

  • Tao QinAffiliated withCarnegie Mellon UniversityMSP Laboratory, Dept. Electronic Engineering, Tsinghua UniversityMicrosoft Research Asia
  • , Tie-Yan LiuAffiliated withCarnegie Mellon UniversityMicrosoft Research Asia
  • , Xu-Dong ZhangAffiliated withCarnegie Mellon UniversityMSP Laboratory, Dept. Electronic Engineering, Tsinghua University
  • , Guang FengAffiliated withCarnegie Mellon UniversityMSP Laboratory, Dept. Electronic Engineering, Tsinghua UniversityMicrosoft Research Asia
  • , Wei-Ying MaAffiliated withCarnegie Mellon UniversityMicrosoft Research Asia

* Final gross prices may vary according to local VAT.

Get Access

Abstract

Topic distillation is one of the main information needs when users search the Web. In previous approaches to topic distillation, the single page was treated as the basic searching unit. This strategy is inherited from general information retrieval, which has not fully utilized the structure information of the Web. In this paper, we propose a novel concept for topic distillation, named subsite retrieval, in which the basic searching unit is the subsite instead of the single page. As indicated by the name, the subsite is a subset of website, consisting of a structural collection of pages. The key of subsite retrieval is to extract effective features to represent a subsite by utilizing both the content in each page and the structural information in the subsite. Specifically, we propose a so-called PI algorithm for this purpose, which is based on the modeling of website growth. Testing on the topic distillation task of TREC 2003 and TREC 2004, subsite retrieval gets significant improvement of retrieval performance over the previous single page based methods.