Abstract
In this paper we introduce an experiment with two methods for evaluating similarity of Web pages. The results of these methods can be used in different ways for the reordering and clustering a Web page set. Both of these methods belong to the field of Web content mining. The first method is purely focused on the visual similarity of Web pages. This method segments Web pages and compares their layouts based on image processing and graph matching. The second method is based on detecting of objects that result from the user point of view on the Web page. The similarity of Web pages is measured as an object match on the analyzed Web pages.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Burget, R.: Layout Based Information Extraction from HTML Documents. In: Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Paraná, Brazil, pp. 624–628 (2007)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting Content Structure for Web Pages based on Visual Representation. In: Asia Pacific Web Conference, pp. 406–417 (2003)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a vision based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003)
Cosulschi, M., Constantinescu, N., Gabroveanu, M.: Classification and comparison of information structures from a Web page. The Annals of the University of Craiova 31, 109–121 (2004)
Chang, C., Lui, S.: IEPAD: information extraction based on pattern discovery. In: Proc. of the 10th Int. World Wide Web, WWW 2001, pp. 681–688. ACM, New York (2001)
Chen, J., Zhou, B., Shi, J., Zhang, H., Fengwu, Q.: Function-based object model towards Website adaptation World Wide Web, Hong Kong, May 01-05, pp. 587–596 (2001)
Chibane, I., Doan, B.L.: A Web page topic segmentation algorithm based on visual criteria and content layout. In: Proc. of SIGIR, pp. 817–818 (2007)
Han, J., Chang, K.: Data Mining for Web Intelligence. Computer 35(11), 64–70 (2002)
Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco
Kobayashi, M., Takeda, K.: Information retrieval on the Web. ACM Computing Surveys (CSUR) 32(2), 144–173 (2000)
Kosala, K., Blockeel, H.: Web Mining Research: A Survey. SIGKDD Explorations 2(1), 1–15 (2000)
Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E.: Semantic Analysis of Web Pages Using Web Patterns. In: IEEE/WIC/ACM International Conference on Web Intelligence, Hong Kong, pp. 329–333 (2006)
Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E., Pokorny, J.: Web Pages Reordering and Clustering Based on Web Patterns. In: Geffert, V., Karhumäki, J., Bertoni, A., Preneel, B., Návrat, P., Bieliková, M. (eds.) SOFSEM 2008. LNCS, vol. 4910, pp. 731–742. Springer, Heidelberg (2008)
Liu, B.: Web content mining (tutorial). In: Proc. of the 14th Int. World Wide Web (2005)
Liu, B., Chang, K.C.-C.: Editorial: Special Issue on Web Content Mining. IGKDD Explorer Newsletter 6(2), 1–4 (2004)
Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: KDD 2003, pp. 601-606 (2003)
Mitsuhashi, N., Yamaguchi, T., Takama., Y.: Layout analysis for calculation of Web page similarity as image. In: Int. Symp. on Advanced Intelligent Systems, pp. 142–145 (2003)
Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: CIKM 2005, pp. 381–388. ACM, New York (2005)
Takama, Y., Mitsuhashi, N.: Visual Similarity Comparison for Web Page Retrieval. In: Proc. of IEEE/WIC/ACM Web Intelligence (WI 2005), pp. 301–304 (2005)
Tseng, Y.-F., Kao, H.-K.: The Mining and Extraction of Primary Informative Blocks and Data Objects from Systematic Web Pages. In: Web Intelligence (WI 2006), pp. 370–373 (2006)
Tversky, A.: Features of Similarity. Psychological Review 84, 327–352 (1977)
Van Welie, M.: Pattern in Interaction Design, http://www.welie.com , (access 2008-08-31)
Xiang, P., Yang, X., Shi, Y.: Effective Page Segmentation Combining Pattern Analysis and Visual Separators for Browsing on Small Screens. In: WI 2006, pp. 831–840 (2006)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: World Wide Web, WWW 2005, Chiba, Japan, May 10 - 14, pp. 76–85. ACM, New York (2005)
Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Transact. on Knowl. and Data Eng. 18(12), 1614–1628 (2006)
Zheng, S., Song, R., Wen, J.-R.: Template-independent news extraction in based on visual consistency. In: Proc. of AAAI 2007, pp. 1507–1511 (2007)
Zhu, J., Nie, Z., Wen, J., Zhang, B., Ma, W.: Simultaneous record detection and attribute labeling in Web data extraction. In: Proc. of the 12th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, KDD 2006, Philadelphia, PA, USA, August 20-23, pp. 494–503. ACM, New York (2006)
Zhu, J., Zhang, B., Nie, Z., Wen, J.R., Hon, H.W.: Webpage understanding: an integrated approach. In: Knowledge Discovery in Data, San Jose, California, USA, pp. 903–912 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kudělka, M., Takama, Y., Snášel, V., Klos, K., Pokorný, J. (2010). Visual Similarity of Web Pages. In: Snášel, V., Szczepaniak, P.S., Abraham, A., Kacprzyk, J. (eds) Advances in Intelligent Web Mastering - 2. Advances in Intelligent and Soft Computing, vol 67. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10687-3_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-10687-3_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10686-6
Online ISBN: 978-3-642-10687-3
eBook Packages: EngineeringEngineering (R0)