Visual Similarity of Web Pages

  • Miloš Kudělka
  • Yasufumi Takama
  • Václav Snášel
  • Karel Klos
  • Jaroslav Pokorný
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 67)

Abstract

In this paper we introduce an experiment with two methods for evaluating similarity of Web pages. The results of these methods can be used in different ways for the reordering and clustering a Web page set. Both of these methods belong to the field of Web content mining. The first method is purely focused on the visual similarity of Web pages. This method segments Web pages and compares their layouts based on image processing and graph matching. The second method is based on detecting of objects that result from the user point of view on the Web page. The similarity of Web pages is measured as an object match on the analyzed Web pages.

Keywords

Web Mining Multimedia Semantics of the Images Automatic Understanding 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Burget, R.: Layout Based Information Extraction from HTML Documents. In: Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Paraná, Brazil, pp. 624–628 (2007)Google Scholar
  2. 2.
    Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting Content Structure for Web Pages based on Visual Representation. In: Asia Pacific Web Conference, pp. 406–417 (2003)Google Scholar
  3. 3.
    Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a vision based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003)Google Scholar
  4. 4.
    Cosulschi, M., Constantinescu, N., Gabroveanu, M.: Classification and comparison of information structures from a Web page. The Annals of the University of Craiova 31, 109–121 (2004)MATHGoogle Scholar
  5. 5.
    Chang, C., Lui, S.: IEPAD: information extraction based on pattern discovery. In: Proc. of the 10th Int. World Wide Web, WWW 2001, pp. 681–688. ACM, New York (2001)CrossRefGoogle Scholar
  6. 6.
    Chen, J., Zhou, B., Shi, J., Zhang, H., Fengwu, Q.: Function-based object model towards Website adaptation World Wide Web, Hong Kong, May 01-05, pp. 587–596 (2001)Google Scholar
  7. 7.
    Chibane, I., Doan, B.L.: A Web page topic segmentation algorithm based on visual criteria and content layout. In: Proc. of SIGIR, pp. 817–818 (2007)Google Scholar
  8. 8.
    Han, J., Chang, K.: Data Mining for Web Intelligence. Computer 35(11), 64–70 (2002)CrossRefGoogle Scholar
  9. 9.
    Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San FranciscoGoogle Scholar
  10. 10.
    Kobayashi, M., Takeda, K.: Information retrieval on the Web. ACM Computing Surveys (CSUR) 32(2), 144–173 (2000)CrossRefGoogle Scholar
  11. 11.
    Kosala, K., Blockeel, H.: Web Mining Research: A Survey. SIGKDD Explorations 2(1), 1–15 (2000)CrossRefGoogle Scholar
  12. 12.
    Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E.: Semantic Analysis of Web Pages Using Web Patterns. In: IEEE/WIC/ACM International Conference on Web Intelligence, Hong Kong, pp. 329–333 (2006)Google Scholar
  13. 13.
    Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E., Pokorny, J.: Web Pages Reordering and Clustering Based on Web Patterns. In: Geffert, V., Karhumäki, J., Bertoni, A., Preneel, B., Návrat, P., Bieliková, M. (eds.) SOFSEM 2008. LNCS, vol. 4910, pp. 731–742. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  14. 14.
    Liu, B.: Web content mining (tutorial). In: Proc. of the 14th Int. World Wide Web (2005)Google Scholar
  15. 15.
    Liu, B., Chang, K.C.-C.: Editorial: Special Issue on Web Content Mining. IGKDD Explorer Newsletter 6(2), 1–4 (2004)MATHCrossRefGoogle Scholar
  16. 16.
    Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: KDD 2003, pp. 601-606 (2003)Google Scholar
  17. 17.
    Mitsuhashi, N., Yamaguchi, T., Takama., Y.: Layout analysis for calculation of Web page similarity as image. In: Int. Symp. on Advanced Intelligent Systems, pp. 142–145 (2003)Google Scholar
  18. 18.
    Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: CIKM 2005, pp. 381–388. ACM, New York (2005)CrossRefGoogle Scholar
  19. 19.
    Takama, Y., Mitsuhashi, N.: Visual Similarity Comparison for Web Page Retrieval. In: Proc. of IEEE/WIC/ACM Web Intelligence (WI 2005), pp. 301–304 (2005)Google Scholar
  20. 20.
    Tseng, Y.-F., Kao, H.-K.: The Mining and Extraction of Primary Informative Blocks and Data Objects from Systematic Web Pages. In: Web Intelligence (WI 2006), pp. 370–373 (2006)Google Scholar
  21. 21.
    Tversky, A.: Features of Similarity. Psychological Review 84, 327–352 (1977)CrossRefGoogle Scholar
  22. 22.
    Van Welie, M.: Pattern in Interaction Design, http://www.welie.com, (access 2008-08-31)
  23. 23.
    Xiang, P., Yang, X., Shi, Y.: Effective Page Segmentation Combining Pattern Analysis and Visual Separators for Browsing on Small Screens. In: WI 2006, pp. 831–840 (2006)Google Scholar
  24. 24.
    Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: World Wide Web, WWW 2005, Chiba, Japan, May 10 - 14, pp. 76–85. ACM, New York (2005)CrossRefGoogle Scholar
  25. 25.
    Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Transact. on Knowl. and Data Eng. 18(12), 1614–1628 (2006)CrossRefGoogle Scholar
  26. 26.
    Zheng, S., Song, R., Wen, J.-R.: Template-independent news extraction in based on visual consistency. In: Proc. of AAAI 2007, pp. 1507–1511 (2007)Google Scholar
  27. 27.
    Zhu, J., Nie, Z., Wen, J., Zhang, B., Ma, W.: Simultaneous record detection and attribute labeling in Web data extraction. In: Proc. of the 12th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, KDD 2006, Philadelphia, PA, USA, August 20-23, pp. 494–503. ACM, New York (2006)CrossRefGoogle Scholar
  28. 28.
    Zhu, J., Zhang, B., Nie, Z., Wen, J.R., Hon, H.W.: Webpage understanding: an integrated approach. In: Knowledge Discovery in Data, San Jose, California, USA, pp. 903–912 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Miloš Kudělka
    • 1
  • Yasufumi Takama
    • 2
  • Václav Snášel
    • 1
  • Karel Klos
    • 1
  • Jaroslav Pokorný
    • 3
  1. 1.Department of Computer ScienceVŠB - Technical University of OstravaOstrava-PorubaCzech Republic
  2. 2.Tokyo Metropolitan UniversityJapan
  3. 3.Charles UniveristyCzech Republic

Personalised recommendations