Clustering Template Based Web Documents

  • Thomas Gottron
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4956)

Abstract

More and more documents on the World Wide Web are based on templates. On a technical level this causes those documents to have a quite similar source code and DOM tree structure. Grouping together documents which are based on the same template is an important task for applications that analyse the template structure and need clean training data. This paper develops and compares several distance measures for clustering web documents according to their underlying templates. Combining those distance measures with different approaches for clustering, we show which combination of methods leads to the desired result.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bar-Yossef, Z., Rajagopalan, S.: Template detection via data mining and its applications. In: WWW 2002: Proceedings of the 11th International Conference on World Wide Web, pp. 580–591. ACM Press, New York (2002)CrossRefGoogle Scholar
  2. 2.
    Yang, G., Ramakrishnan, I.V., Kifer, M.: On the complexity of schema inference from web pages in the presence of nullable data attributes. In: CIKM 2003: Proceedings of the twelfth International Conference on Information and Knowledge Management, pp. 224–231. ACM Press, New York (2003)CrossRefGoogle Scholar
  3. 3.
    Lin, S.H., Ho, J.M.: Discovering informative content blocks from web documents. In: KDD 2002: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 588–593. ACM Press, New York (2002)CrossRefGoogle Scholar
  4. 4.
    Debnath, S., Mitra, P., Giles, C.L.: Automatic extraction of informative blocks from webpages. In: SAC 2005, pp. 1722–1726. ACM Press, New York (2005)CrossRefGoogle Scholar
  5. 5.
    Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: KDD 2003: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 296–305. ACM Press, New York (2003)CrossRefGoogle Scholar
  6. 6.
    Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In: WWW 2004: Proceedings of the 13th International Conference on World Wide Web, pp. 502–511. ACM Press, New York (2004), doi:10.1145/988672.988740CrossRefGoogle Scholar
  7. 7.
    Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: WWW 2005: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 830–839. ACM Press, New York (2005)CrossRefGoogle Scholar
  8. 8.
    Chakrabarti, D., Kumar, R., Punera, K.: Page-level template detection via isotonic smoothing. In: WWW 2007: Proceedings of the 16th International Conference on World Wide Web, pp. 61–70. ACM Press, New York (2007)CrossRefGoogle Scholar
  9. 9.
    Cruz, I.F., Borisov, S., Marks, M.A., Webbs, T.R.: Measuring structural similarity among web documents: preliminary results. In: Porto, V.W., Waagen, D. (eds.) EP 1998. LNCS, vol. 1447, pp. 513–524. Springer, Heidelberg (1998)Google Scholar
  10. 10.
    Buttler, D.: A short survey of document structure similarity algorithms. In: IC 2004: Proceedings of the International Conference on Internet Computing, pp. 3–9. CSREA Press (2004)Google Scholar
  11. 11.
    Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks 29(8-13), 1157–1166 (1997)Google Scholar
  12. 12.
    Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: KDD 2003: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 577–582. ACM Press, New York (2003)CrossRefGoogle Scholar
  13. 13.
    Lindholm, T., Kangasharju, J., Tarkoma, S.: Fast and simple XML tree differencing by sequence alignment. In: DocEng 2006: Proceedings of the 2006 ACM Symposium on Document Engineering, pp. 75–84. ACM Press, New York (2006)CrossRefGoogle Scholar
  14. 14.
    Shi, L., Niu, C., Zhou, M., Gao, J.: A DOM tree alignment model for mining parallel data from the web. In: ACL 2006: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL, Morristown, NJ, USA, Association for Computational Linguistics, pp. 489–496 (2006)Google Scholar
  15. 15.
    Liu, B.: Web Data Mining – Exploring Hyperlinks, Contents, and Usage Data. Springer, Heidelberg (2007)MATHGoogle Scholar
  16. 16.
    Kruskal, J.B.: Nonmetric multidimensional scaling: A numerical method. Psychometrika 29(2), 115–129 (1964)MATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850 (1971)CrossRefGoogle Scholar
  18. 18.
    Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: AAAI 2000: Proceedings of the 17th National Conference on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search, AAAI, pp. 58–64 (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Thomas Gottron
    • 1
  1. 1.Institut für InformatikJohannes Gutenberg-Universität MainzMainzGermany

Personalised recommendations