Visual Similarity of Web Pages

Kudělka, Miloš; Takama, Yasufumi; Snášel, Václav; Klos, Karel; Pokorný, Jaroslav

doi:10.1007/978-3-642-10687-3_13

Visual Similarity of Web Pages

Miloš Kudělka⁶,
Yasufumi Takama⁷,
Václav Snášel⁶,
Karel Klos⁶ &
…
Jaroslav Pokorný⁸

Conference paper

466 Accesses
3 Citations

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 67))

Abstract

In this paper we introduce an experiment with two methods for evaluating similarity of Web pages. The results of these methods can be used in different ways for the reordering and clustering a Web page set. Both of these methods belong to the field of Web content mining. The first method is purely focused on the visual similarity of Web pages. This method segments Web pages and compares their layouts based on image processing and graph matching. The second method is based on detecting of objects that result from the user point of view on the Web page. The similarity of Web pages is measured as an object match on the analyzed Web pages.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Burget, R.: Layout Based Information Extraction from HTML Documents. In: Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Paraná, Brazil, pp. 624–628 (2007)
Google Scholar
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting Content Structure for Web Pages based on Visual Representation. In: Asia Pacific Web Conference, pp. 406–417 (2003)
Google Scholar
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a vision based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003)
Google Scholar
Cosulschi, M., Constantinescu, N., Gabroveanu, M.: Classification and comparison of information structures from a Web page. The Annals of the University of Craiova 31, 109–121 (2004)
MATH Google Scholar
Chang, C., Lui, S.: IEPAD: information extraction based on pattern discovery. In: Proc. of the 10th Int. World Wide Web, WWW 2001, pp. 681–688. ACM, New York (2001)
Chapter Google Scholar
Chen, J., Zhou, B., Shi, J., Zhang, H., Fengwu, Q.: Function-based object model towards Website adaptation World Wide Web, Hong Kong, May 01-05, pp. 587–596 (2001)
Google Scholar
Chibane, I., Doan, B.L.: A Web page topic segmentation algorithm based on visual criteria and content layout. In: Proc. of SIGIR, pp. 817–818 (2007)
Google Scholar
Han, J., Chang, K.: Data Mining for Web Intelligence. Computer 35(11), 64–70 (2002)
Article Google Scholar
Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco
Google Scholar
Kobayashi, M., Takeda, K.: Information retrieval on the Web. ACM Computing Surveys (CSUR) 32(2), 144–173 (2000)
Article Google Scholar
Kosala, K., Blockeel, H.: Web Mining Research: A Survey. SIGKDD Explorations 2(1), 1–15 (2000)
Article Google Scholar
Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E.: Semantic Analysis of Web Pages Using Web Patterns. In: IEEE/WIC/ACM International Conference on Web Intelligence, Hong Kong, pp. 329–333 (2006)
Google Scholar
Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E., Pokorny, J.: Web Pages Reordering and Clustering Based on Web Patterns. In: Geffert, V., Karhumäki, J., Bertoni, A., Preneel, B., Návrat, P., Bieliková, M. (eds.) SOFSEM 2008. LNCS, vol. 4910, pp. 731–742. Springer, Heidelberg (2008)
Chapter Google Scholar
Liu, B.: Web content mining (tutorial). In: Proc. of the 14th Int. World Wide Web (2005)
Google Scholar
Liu, B., Chang, K.C.-C.: Editorial: Special Issue on Web Content Mining. IGKDD Explorer Newsletter 6(2), 1–4 (2004)
Article MATH Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: KDD 2003, pp. 601-606 (2003)
Google Scholar
Mitsuhashi, N., Yamaguchi, T., Takama., Y.: Layout analysis for calculation of Web page similarity as image. In: Int. Symp. on Advanced Intelligent Systems, pp. 142–145 (2003)
Google Scholar
Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. In: CIKM 2005, pp. 381–388. ACM, New York (2005)
Chapter Google Scholar
Takama, Y., Mitsuhashi, N.: Visual Similarity Comparison for Web Page Retrieval. In: Proc. of IEEE/WIC/ACM Web Intelligence (WI 2005), pp. 301–304 (2005)
Google Scholar
Tseng, Y.-F., Kao, H.-K.: The Mining and Extraction of Primary Informative Blocks and Data Objects from Systematic Web Pages. In: Web Intelligence (WI 2006), pp. 370–373 (2006)
Google Scholar
Tversky, A.: Features of Similarity. Psychological Review 84, 327–352 (1977)
Article Google Scholar
Van Welie, M.: Pattern in Interaction Design, http://www.welie.com , (access 2008-08-31)
Xiang, P., Yang, X., Shi, Y.: Effective Page Segmentation Combining Pattern Analysis and Visual Separators for Browsing on Small Screens. In: WI 2006, pp. 831–840 (2006)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: World Wide Web, WWW 2005, Chiba, Japan, May 10 - 14, pp. 76–85. ACM, New York (2005)
Chapter Google Scholar
Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Transact. on Knowl. and Data Eng. 18(12), 1614–1628 (2006)
Article Google Scholar
Zheng, S., Song, R., Wen, J.-R.: Template-independent news extraction in based on visual consistency. In: Proc. of AAAI 2007, pp. 1507–1511 (2007)
Google Scholar
Zhu, J., Nie, Z., Wen, J., Zhang, B., Ma, W.: Simultaneous record detection and attribute labeling in Web data extraction. In: Proc. of the 12th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, KDD 2006, Philadelphia, PA, USA, August 20-23, pp. 494–503. ACM, New York (2006)
Chapter Google Scholar
Zhu, J., Zhang, B., Nie, Z., Wen, J.R., Hon, H.W.: Webpage understanding: an integrated approach. In: Knowledge Discovery in Data, San Jose, California, USA, pp. 903–912 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, VŠB - Technical University of Ostrava, 17. listopadu 15, 708 33, Ostrava-Poruba, Czech Republic
Miloš Kudělka, Václav Snášel & Karel Klos
Tokyo Metropolitan University, Japan
Yasufumi Takama
Charles Univeristy, Czech Republic
Jaroslav Pokorný

Authors

Miloš Kudělka
View author publications
You can also search for this author in PubMed Google Scholar
Yasufumi Takama
View author publications
You can also search for this author in PubMed Google Scholar
Václav Snášel
View author publications
You can also search for this author in PubMed Google Scholar
Karel Klos
View author publications
You can also search for this author in PubMed Google Scholar
Jaroslav Pokorný
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. Computer Science, Technical University Ostrava, Tr. 17. Listopadu 15, 708 33, Ostrava, Czech Republic
Vaclav Snášel
Inst. Computer Science, Technical University of Lódz, ul. Wólczanska 215, 93-005, Lódz, Poland
Piotr S. Szczepaniak
Machine Intelligence Research Labs (MIR), Scientific Network for Innovation & Research Excellence, P.O.Box 2259, 98071-2259, Auburn, WA, USA
Ajith Abraham
PAN Warszawa, Systems Research Instiute, Newelska 6, 01-447, Warszawa, Poland
Janusz Kacprzyk

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kudělka, M., Takama, Y., Snášel, V., Klos, K., Pokorný, J. (2010). Visual Similarity of Web Pages. In: Snášel, V., Szczepaniak, P.S., Abraham, A., Kacprzyk, J. (eds) Advances in Intelligent Web Mastering - 2. Advances in Intelligent and Soft Computing, vol 67. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10687-3_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-10687-3_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10686-6
Online ISBN: 978-3-642-10687-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics