Abstract
Measuring web page similarity is a very important task in the area of web mining and information retrieval. This paper introduces a method for measuring web page similarity, which considers both textual and visual properties of pages. Textual properties of a page are described by means of modified weight vector space model. General visual properties are captured via segmentation of a page, which divides a page into visual blocks, properties of which are stored into a vector of visual properties. These both vectors are then used to compute the overall web page similarity. This method will be described in detail and results of several experiments are also introduced in this paper.
Keywords
- Web Page Similarity
- Clustering
- Vector Space Model
- Vector Distance
- Term Weighting
- Visual Blocks
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Halkidi, M., Nguyen, B., Varlamis, I., Vazirigiannis, M.: Thesus: Organizing web document collections based on link semantics. VLDB Journal 12(4), 320–332 (2003)
Dean, J., Henzinger, M.: Finding related pages in the World Wide Web. In: Proceedings of the 8th WWW Conference, Toronto, Canada, pp. 1467–1479 (1999)
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1998)
Sannella, M.J.: Constraint Satisfaction and Debugging for Interactive User Interfaces. PhD. Thesis. UMI Order No. GAX95-09398, University of Washington (1994)
Cruz, I.F., Borisov, S., Marks, M.A., Webb, T.R.: Measuring structural similarity among web documents: preliminary results. In: Proceedings of the 7th International Conference on Electronic Publishing, pp. 513–524. ICCC Press, Washington D.C. (1998)
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: Proceedings of the 9th ACM SIGKDD Conference, pp. 577–582. ACM, Washington D.C (2003)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a Vision-based Page Segmentation Algorithm. Technical Report MSR-TR-2003-79, Microsoft (2004)
Burget, R.: Automatic Document Structure Detection for Data Integration. In: Abramowicz, W. (ed.) BIS 2007. LNCS, vol. 4439, pp. 391–397. Springer, Heidelberg (2007)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Block-based Web Search. In: The 27th Annual International ACM SIGIR Conference on Information Retrieval, pp. 440–447. ACM, Sheffield (2004)
Modha, D.S., Spangler, W.S.: Clustering hypertext with applications to web searching. In: Proceedings of the 11th ACM Conference on Hypertext and Hypermedia, pp. 143–152. ACM, San Antonio (2000)
Cutler, M., Deng, H., Maniccam, S.S., Meng, W.: A new study on using html structures to improve retrieval. In: Proceedings of the 11th IEEE International Conference on Tools with Artificial Intelligence, pp. 406–409. IEEE, Chicago (1999)
Bartik, V.: Text-Based Web Page Classification with Use of Visual Information. In: International Symposium on Open Source Intelligence & Web Mining. IEEE, Odense (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bartík, V. (2012). Measuring Web Page Similarity Based on Textual and Visual Properties. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2012. Lecture Notes in Computer Science(), vol 7268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29350-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-29350-4_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29349-8
Online ISBN: 978-3-642-29350-4
eBook Packages: Computer ScienceComputer Science (R0)
