Subsite Retrieval: A Novel Concept for Topic Distillation

  • Tao Qin
  • Tie-Yan Liu
  • Xu-Dong Zhang
  • Guang Feng
  • Wei-Ying Ma
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3689)


Topic distillation is one of the main information needs when users search the Web. In previous approaches to topic distillation, the single page was treated as the basic searching unit. This strategy is inherited from general information retrieval, which has not fully utilized the structure information of the Web. In this paper, we propose a novel concept for topic distillation, named subsite retrieval, in which the basic searching unit is the subsite instead of the single page. As indicated by the name, the subsite is a subset of website, consisting of a structural collection of pages. The key of subsite retrieval is to extract effective features to represent a subsite by utilizing both the content in each page and the structural information in the subsite. Specifically, we propose a so-called PI algorithm for this purpose, which is based on the modeling of website growth. Testing on the topic distillation task of TREC 2003 and TREC 2004, subsite retrieval gets significant improvement of retrieval performance over the previous single page based methods.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amitay, E., Carmel, D., Darlow, A., Lempel, R., Soffer, A.: Topic Distillation with Knowledge Agents. In: the eleventh Text Retrieval Conference (TREC 2002) (2002)Google Scholar
  2. 2.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar
  3. 3.
    Bharat, K., Henzinger, M.R.: Improved algorithms for topic distillation in a hyperlinked environment. In: Proceedings of the ACM-SIGIR (1998)Google Scholar
  4. 4.
    Bharat, K., Mihaila, G.A.: When Experts Agree: Using Non-affiliated Experts to Rank Popular Topics. In: Proceedings of the Tenth International World Wide Web Conference (2001)Google Scholar
  5. 5.
    Broder, A.: A Taxonomy of Web Search. SIGIR Forum 36(2) (2002)Google Scholar
  6. 6.
    Chakrabarti, S.: Integrating the Page Object Model with hyperlinks for enhanced topic distillation and information extraction. In: the 10th International World Wide Web Conference (2001)Google Scholar
  7. 7.
    Chakrabarti, S., Joshi, M., Tawde, V.: Enhanced topic distillation using text, markup tags, and hyperlinks. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 208–216. ACM Press, New York (2001)CrossRefGoogle Scholar
  8. 8.
    Craswell, N., Hawking, D.: Overview of the TREC 2003 Web Track. In: the twelfth Text Retrieval Conference (TREC 2003) (2003)Google Scholar
  9. 9.
    Feng, G., Liu, T.Y., Zhang, X.D., Qin, T., Gao, B., Ma, W.Y.: Level-Based Link Analysis. In: Zhang, Y., Tanaka, K., Yu, J.X., Wang, S., Li, M. (eds.) APWeb 2005. LNCS, vol. 3399, pp. 183–194. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  10. 10.
    Golub, G., Van Loan, C.: Matrix Computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996)MATHGoogle Scholar
  11. 11.
    Hawking, D.: Overview of the TREC-9 Web Track. In: the ninth Text Retrieval Conference (TREC 9) (2000)Google Scholar
  12. 12.
    Ingongngam, P., Rungsawang, A.: Report on the TREC 2003 Experiments Using Web Topic-Centric Link Analysis. In: the twelfth Text Retrieval Conference (TREC 2003) (2003)Google Scholar
  13. 13.
    Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–622 (1999)MATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Kleinberg, J., Lawrence, S.: The Structure of the Web. Science 294, 1849 (2001)CrossRefGoogle Scholar
  15. 15.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web, Technical report, Stanford University, Stanford, CA (1998)Google Scholar
  16. 16.
    Robertson, S.E.: Overview of the okapi projects. Journal of Documentation 53(1), 3–7 (1997)CrossRefGoogle Scholar
  17. 17.
    Robertson, S.E., Sparck Jones, K.: Relevance weighting of search terms. Journal of the American Society of Information Science 27, 129–146 (1976)CrossRefGoogle Scholar
  18. 18.
    Shakery, A., Zhai, C.X.: Relevance Propagation for Topic Distillation UIUC TREC 2003 Web Track Experiments. In: the twelfth Text Retrieval Conference (TREC 2003) (2003)Google Scholar
  19. 19.
    TREC-2004 Web Track Guidelines,
  20. 20.
    Wen, J.R., Song, R., Cai, D., Zhu, K., Yu, S., Ye, S., Ma, W.Y.: Microsoft Research Asia at the web track of TREC 2003. In: the twelfth Text Retrieval Conference (TREC 2003) (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Tao Qin
    • 1
    • 2
  • Tie-Yan Liu
    • 2
  • Xu-Dong Zhang
    • 1
  • Guang Feng
    • 1
    • 2
  • Wei-Ying Ma
    • 2
  1. 1.MSP Laboratory, Dept. Electronic EngineeringTsinghua UniversityBeijingP.R. China
  2. 2.Microsoft Research AsiaBeijingP. R. China

Personalised recommendations