Site-Independent Template-Block Detection

  • Aleksander Kołcz
  • Wen-tau Yih
Conference paper

DOI: 10.1007/978-3-540-74976-9_17

Part of the Lecture Notes in Computer Science book series (LNCS, volume 4702)
Cite this paper as:
Kołcz A., Yih W. (2007) Site-Independent Template-Block Detection. In: Kok J.N., Koronacki J., Lopez de Mantaras R., Matwin S., Mladenič D., Skowron A. (eds) Knowledge Discovery in Databases: PKDD 2007. PKDD 2007. Lecture Notes in Computer Science, vol 4702. Springer, Berlin, Heidelberg


Detection of template and noise blocks in web pages is an important step in improving the performance of information retrieval and content extraction. Of the many approaches proposed, most rely on the assumption of operating within the confines of a single website or require expensive hand-labeling of relevant and non-relevant blocks for model induction. This reduces their applicability, since in many practical scenarios template blocks need to be detected in arbitrary web pages, with no prior knowledge of the site structure. In this work we propose to bridge these two approaches by using within-site template discovery techniques to drive the induction of a site-independent template detector. Our approach eliminates the need for human annotation and produces highly effective models. Experimental results demonstrate the usefulness of the proposed methodology for the important applications of keyword extraction, with relative performance gain as high as 20%.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Aleksander Kołcz
    • 1
  • Wen-tau Yih
    • 2
  1. 1.Microsoft Live Labs, Redmond WAUSA
  2. 2.Microsoft Research, Redmond WAUSA

Personalised recommendations