Advertisement

Site-Independent Template-Block Detection

  • Aleksander Kołcz
  • Wen-tau Yih
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4702)

Abstract

Detection of template and noise blocks in web pages is an important step in improving the performance of information retrieval and content extraction. Of the many approaches proposed, most rely on the assumption of operating within the confines of a single website or require expensive hand-labeling of relevant and non-relevant blocks for model induction. This reduces their applicability, since in many practical scenarios template blocks need to be detected in arbitrary web pages, with no prior knowledge of the site structure. In this work we propose to bridge these two approaches by using within-site template discovery techniques to drive the induction of a site-independent template detector. Our approach eliminates the need for human annotation and produces highly effective models. Experimental results demonstrate the usefulness of the proposed methodology for the important applications of keyword extraction, with relative performance gain as high as 20%.

Keywords

Random Forest Document Frequency Keyword Extraction Open Directory Project Keyphrase Extraction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Bar-Yossef, Z., Rajagopalan, S.: Template detection via data mining and its applications. In: Proc. of the 11th World Wide Web Conference (2002)Google Scholar
  2. 2.
    Cai, D., Yu, S., Wen, J., Ma, W.: VIPS: a vision-based page segmentation algorithm. Technical Report MSR-TR-2003-79, Microsoft Research Asia (2003)Google Scholar
  3. 3.
    Chen, L., Ye, S., Li, X.: Template detection for large scale search engines. In: SAC 2006. Proceedings of the 21st Annual ACM Symposium on Applied Computing, pp. 1094–1098. ACM Press, New York (2006)CrossRefGoogle Scholar
  4. 4.
    Debnath, S., Mitra, P., Pal, N., Giles, C.: Automatic identification of informative sections of web pages. IEEE Transactions on Knowledge and Data Engineering 17(9), 1233–1246 (2005)CrossRefGoogle Scholar
  5. 5.
    Fayyad, U., Irani, K.: Multi-interval discretization of continuousvalued attributes for classification learning. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence, pp. 1022–1029 (1993)Google Scholar
  6. 6.
    Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Proc. of IJCAI-99, pp. 668–673 (1999)Google Scholar
  7. 7.
    Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: Proc. of the 14th World Wide Web Conference, pp. 830–839 (2005)Google Scholar
  8. 8.
    Goodman, J., Carvalho, V.R.: Implicit queries for email. In: CEAS-05 (2005)Google Scholar
  9. 9.
    Kushmerick, N.: Learning to remove internet advertisements. In: Proceedings of AGENTS-99 (1999)Google Scholar
  10. 10.
    Song, R., Liu, H., Wen, J., Ma, W.: Learning block importance models for web pages. In: Proc. of the 13th World Wide Web Conference, pp. 203–211 (2004)Google Scholar
  11. 11.
    Turney, P.D.: Learning algorithms for keyphrase extraction. Information Retrieval 2(4), 303–336 (2000)CrossRefGoogle Scholar
  12. 12.
    Turney, P.D.: Coherent keyphrase extraction via web mining. In: Proc. of IJCAI-03, pp. 434–439 (2003)Google Scholar
  13. 13.
    Yi, L., Liu, B.: Web page cleaning for web mining through feature weighting. In: Proc. of 18th International Joint Conference on Artificial Intelligence (2003)Google Scholar
  14. 14.
    Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: KDD-2003. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, New York (2003)Google Scholar
  15. 15.
    Yih, W., Goodman, J., Carvalho, V.: Finding advertising keywords on web pages. In: Proceedings of the 15th World Wide Web Conference (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Aleksander Kołcz
    • 1
  • Wen-tau Yih
    • 2
  1. 1.Microsoft Live Labs, Redmond WAUSA
  2. 2.Microsoft Research, Redmond WAUSA

Personalised recommendations