Advertisement

World Wide Web

, Volume 12, Issue 2, pp 171–211 | Cite as

On Finding Templates on Web Collections

  • Karane Vieira
  • André Luiz da Costa Carvalho
  • Klessius Berlt
  • Edleno S. de Moura
  • Altigran S. da Silva
  • Juliana Freire
Article

Abstract

Templates are pieces of HTML code common to a set of web pages usually adopted by content providers to enhance the uniformity of layout and navigation of theirs Web sites. They are usually generated using authoring/publishing tools or by programs that build HTML pages to publish content from a database. In spite of their usefulness, the content of templates can negatively affect the quality of results produced by systems that automatically process information available in web sites, such as search engines, clustering and automatic categorization programs. Further, the information available in templates is redundant and thus processing and storing such information just once for a set of pages may save computational resources. In this paper, we present and evaluate methods for detecting templates considering a scenario where multiple templates can be found in a collection of Web pages. Most of previous work have studied template detection algorithms in a scenario where the collection has just a single template. The scenario with multiple templates is more realistic and, as it is discussed here, it raises important questions that may require extensions and adjustments in previously proposed template detection algorithms. We show how to apply and evaluate two template detection algorithms in this scenario, creating solutions for detecting multiple templates. The methods studied partitions the input collection into clusters that contain common HTML paths and share a high number of HTML nodes and then apply a single-template detection procedure over each cluster. We also propose a new algorithm for single template detection based on a restricted form of bottom-up tree-mapping that requires only small set of pages to correctly identify a template and which has a worst-case linear complexity. Our experimental results over a representative set of Web pages show that our approach is efficient and scalable while obtaining accurate results.

Keywords

web template detection tree-mapping web IR 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bar-Yossef, Z., Rajagopalan, S.: Template detection via data mining and its applications. In: Proceedings of the International Conference on the World Wide Web, pp. 580–591 (2002)Google Scholar
  2. 2.
    Beszteri, I., Vuorimaa, P.: Vertical navigation of layout adapted web documents. World Wide Web 10(1), 1–35 (2007)CrossRefGoogle Scholar
  3. 3.
    Chakrabarti, S., Joshi, M., Tawde, V.: Enhanced topic distillation using text, markup tags, and hyperlinks. In: Proceedings the of ACM Conference on Research and Development in Information Retrieval, pp. 208–216 (2001)Google Scholar
  4. 4.
    Chakrabarti, D., Kumar, R., Punera, K.: Page-level template detection via isotonic smoothing. In: WWW ’07: Proceedings of the 16th International Conference on World Wide Web, pp. 61–70. ACM, New York, NY, USA (2007)CrossRefGoogle Scholar
  5. 5.
    Chen, W.: New algorithm for tree-to-tree correction problem. J. Algorithms 40, 135–158 (2001)zbMATHCrossRefMathSciNetGoogle Scholar
  6. 6.
    Chen, L., Ye, S., Li, X.: Template detection for large scale search engines. In: SAC ’06: Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 1094–1098. ACM, New York, NY, USA (2006)CrossRefGoogle Scholar
  7. 7.
    Dalamagas, T., Cheng, T., Winkel, K.J., Sellis, T.K.: A methodology for clustering xml documents by structure. Inf. Syst. 31(3), 187–228 (2006)CrossRefGoogle Scholar
  8. 8.
    de Castro Reis, D., Golgher, P.B., da Silva, A.S., Laender, A.H.F.: Automatic web news extraction using tree edit distance. In: Proceedings of the International Conference on the World Wide Web, pp. 502–511 (2004)Google Scholar
  9. 9.
    Debnath, S., Mitra, P., Giles, C.L.: Automatic extraction of informative blocks from webpages. In: ACM Symposium on Applied Computing, pp. 1722–1726 (2005)Google Scholar
  10. 10.
    Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: Proceedings of the International Conference on the World Wide Web—Poster Session, pp. 830–839. (2005)Google Scholar
  11. 11.
    Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, pp. 577–582 (2003)Google Scholar
  12. 12.
    Khy, S., Ishikawa, Y., Kitagawa, H.: A novelty-based clustering method for on-line documents. World Wide Web 11(1), 1–37 (2008)CrossRefGoogle Scholar
  13. 13.
    Lian, W., Cheung, D.W.L., Mamoulis, N., Yiu, S.M.: An efficient and scalable algorithm for clustering xml documents by structure. IEEE Trans. Knowl. Data Eng. 16(1), 82–96 (2004)CrossRefGoogle Scholar
  14. 14.
    Macías, J.A.: Intelligent assistance in authoring dynamically generated web interfaces. World Wide Web 11(2), 253–286 (2008)CrossRefGoogle Scholar
  15. 15.
    Nielsen, J.: User interface directions for the web. Commun. ACM 42(1), 65–72 (1999)CrossRefGoogle Scholar
  16. 16.
    Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the International Workshop on the Web and Databases (2002)Google Scholar
  17. 17.
    Selkow, S.M.: The tree-to-tree editing problem. Inf. Process. Lett. 6, 184–186 (1977)zbMATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    Song, R., Liu, H., Wen, J.R., Ma, W.Y.: Learning block importance models for web pages. In: Proceedings of the International Conference on the World Wide Web, pp. 203–211 (2004)Google Scholar
  19. 19.
    Tai, K.C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)zbMATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    Valiente, G.: An efficient bottom-up distance between trees. In: Proceedings of the International Symposium on String Processing and Information Retrieval, pp. 212–219. IEEE Computer Science Press (2001)Google Scholar
  21. 21.
    Vieira, K., da Silva, A.S., Pinto, N., de Moura, E.S., Cavalcanti, J.M.B., Freire, J.: A fast and robust method for web page template detection and removal. In: Proceedings of the ACM International Conference on Information and Knowledge Management, Arlington, VA, USA, pp. 258–267 (2006)Google Scholar
  22. 22.
    Wang, J.T.L., Zhang, K.: Finding similar consensus between trees: an algorithm and a distance hierarchy. Pattern Recogn. 34, 127–137 (2001)zbMATHCrossRefGoogle Scholar
  23. 23.
    Yang, W.: Identifying syntactic differences between two programs. Softw. Pract. Exp. 21(7), 739–755 (1991)CrossRefGoogle Scholar
  24. 24.
    Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: Proceedings of the International ACM Conference on Knowledge Discovery and Data Mining, pp. 296–305 (2003)Google Scholar
  25. 25.
    Zhai, Y., Liu, B.: Extracting web data using instance-based learning. World Wide Web 10(2), 113–132 (2007)CrossRefGoogle Scholar
  26. 26.
    Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Inf. Process. Lett. 42(3), 133–139 (1992)zbMATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Karane Vieira
    • 1
  • André Luiz da Costa Carvalho
    • 1
  • Klessius Berlt
    • 1
  • Edleno S. de Moura
    • 1
  • Altigran S. da Silva
    • 1
  • Juliana Freire
    • 2
  1. 1.Department of Computer ScienceFederal University of AmazonasManausBrazil
  2. 2.School of ComputingUniversity of UtahSalt Lake CityUSA

Personalised recommendations