Skip to main content

Visual Similarity of Web Pages

  • Conference paper

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 67))

Abstract

In this paper we introduce an experiment with two methods for evaluating similarity of Web pages. The results of these methods can be used in different ways for the reordering and clustering a Web page set. Both of these methods belong to the field of Web content mining. The first method is purely focused on the visual similarity of Web pages. This method segments Web pages and compares their layouts based on image processing and graph matching. The second method is based on detecting of objects that result from the user point of view on the Web page. The similarity of Web pages is measured as an object match on the analyzed Web pages.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Burget, R.: Layout Based Information Extraction from HTML Documents. In: Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Paraná, Brazil, pp. 624–628 (2007)

    Google Scholar 

  2. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting Content Structure for Web Pages based on Visual Representation. In: Asia Pacific Web Conference, pp. 406–417 (2003)

    Google Scholar 

  3. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a vision based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003)

    Google Scholar 

  4. Cosulschi, M., Constantinescu, N., Gabroveanu, M.: Classification and comparison of information structures from a Web page. The Annals of the University of Craiova 31, 109–121 (2004)

    MATH  Google Scholar 

  5. Chang, C., Lui, S.: IEPAD: information extraction based on pattern discovery. In: Proc. of the 10th Int. World Wide Web, WWW 2001, pp. 681–688. ACM, New York (2001)

    Chapter  Google Scholar 

  6. Chen, J., Zhou, B., Shi, J., Zhang, H., Fengwu, Q.: Function-based object model towards Website adaptation World Wide Web, Hong Kong, May 01-05, pp. 587–596 (2001)

    Google Scholar 

  7. Chibane, I., Doan, B.L.: A Web page topic segmentation algorithm based on visual criteria and content layout. In: Proc. of SIGIR, pp. 817–818 (2007)

    Google Scholar 

  8. Han, J., Chang, K.: Data Mining for Web Intelligence. Computer 35(11), 64–70 (2002)

    Article  Google Scholar 

  9. Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco

    Google Scholar 

  10. Kobayashi, M., Takeda, K.: Information retrieval on the Web. ACM Computing Surveys (CSUR) 32(2), 144–173 (2000)

    Article  Google Scholar 

  11. Kosala, K., Blockeel, H.: Web Mining Research: A Survey. SIGKDD Explorations 2(1), 1–15 (2000)

    Article  Google Scholar 

  12. Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E.: Semantic Analysis of Web Pages Using Web Patterns. In: IEEE/WIC/ACM International Conference on Web Intelligence, Hong Kong, pp. 329–333 (2006)

    Google Scholar 

  13. Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E., Pokorny, J.: Web Pages Reordering and Clustering Based on Web Patterns. In: Geffert, V., Karhumäki, J., Bertoni, A., Preneel, B., Návrat, P., Bieliková, M. (eds.) SOFSEM 2008. LNCS, vol. 4910, pp. 731–742. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  14. Liu, B.: Web content mining (tutorial). In: Proc. of the 14th Int. World Wide Web (2005)

    Google Scholar 

  15. Liu, B., Chang, K.C.-C.: Editorial: Special Issue on Web Content Mining. IGKDD Explorer Newsletter 6(2), 1–4 (2004)

    Article  MATH  Google Scholar 

  16. Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: KDD 2003, pp. 601-606 (2003)

    Google Scholar 

  17. Mitsuhashi, N., Yamaguchi, T., Takama., Y.: Layout analysis for calculation of Web page similarity as image. In: Int. Symp. on Advanced Intelligent Systems, pp. 142–145 (2003)

    Google Scholar 

  18. Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: CIKM 2005, pp. 381–388. ACM, New York (2005)

    Chapter  Google Scholar 

  19. Takama, Y., Mitsuhashi, N.: Visual Similarity Comparison for Web Page Retrieval. In: Proc. of IEEE/WIC/ACM Web Intelligence (WI 2005), pp. 301–304 (2005)

    Google Scholar 

  20. Tseng, Y.-F., Kao, H.-K.: The Mining and Extraction of Primary Informative Blocks and Data Objects from Systematic Web Pages. In: Web Intelligence (WI 2006), pp. 370–373 (2006)

    Google Scholar 

  21. Tversky, A.: Features of Similarity. Psychological Review 84, 327–352 (1977)

    Article  Google Scholar 

  22. Van Welie, M.: Pattern in Interaction Design, http://www.welie.com , (access 2008-08-31)

  23. Xiang, P., Yang, X., Shi, Y.: Effective Page Segmentation Combining Pattern Analysis and Visual Separators for Browsing on Small Screens. In: WI 2006, pp. 831–840 (2006)

    Google Scholar 

  24. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: World Wide Web, WWW 2005, Chiba, Japan, May 10 - 14, pp. 76–85. ACM, New York (2005)

    Chapter  Google Scholar 

  25. Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Transact. on Knowl. and Data Eng. 18(12), 1614–1628 (2006)

    Article  Google Scholar 

  26. Zheng, S., Song, R., Wen, J.-R.: Template-independent news extraction in based on visual consistency. In: Proc. of AAAI 2007, pp. 1507–1511 (2007)

    Google Scholar 

  27. Zhu, J., Nie, Z., Wen, J., Zhang, B., Ma, W.: Simultaneous record detection and attribute labeling in Web data extraction. In: Proc. of the 12th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, KDD 2006, Philadelphia, PA, USA, August 20-23, pp. 494–503. ACM, New York (2006)

    Chapter  Google Scholar 

  28. Zhu, J., Zhang, B., Nie, Z., Wen, J.R., Hon, H.W.: Webpage understanding: an integrated approach. In: Knowledge Discovery in Data, San Jose, California, USA, pp. 903–912 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kudělka, M., Takama, Y., Snášel, V., Klos, K., Pokorný, J. (2010). Visual Similarity of Web Pages. In: Snášel, V., Szczepaniak, P.S., Abraham, A., Kacprzyk, J. (eds) Advances in Intelligent Web Mastering - 2. Advances in Intelligent and Soft Computing, vol 67. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10687-3_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-10687-3_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-10686-6

  • Online ISBN: 978-3-642-10687-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics