Clustering Template Based Web Documents

  • Thomas Gottron
Conference paper

DOI: 10.1007/978-3-540-78646-7_7

Part of the Lecture Notes in Computer Science book series (LNCS, volume 4956)
Cite this paper as:
Gottron T. (2008) Clustering Template Based Web Documents. In: Macdonald C., Ounis I., Plachouras V., Ruthven I., White R.W. (eds) Advances in Information Retrieval. ECIR 2008. Lecture Notes in Computer Science, vol 4956. Springer, Berlin, Heidelberg

Abstract

More and more documents on the World Wide Web are based on templates. On a technical level this causes those documents to have a quite similar source code and DOM tree structure. Grouping together documents which are based on the same template is an important task for applications that analyse the template structure and need clean training data. This paper develops and compares several distance measures for clustering web documents according to their underlying templates. Combining those distance measures with different approaches for clustering, we show which combination of methods leads to the desired result.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Thomas Gottron
    • 1
  1. 1.Institut für InformatikJohannes Gutenberg-Universität MainzMainzGermany

Personalised recommendations