Skip to main content

Measuring Web Page Similarity Based on Textual and Visual Properties

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNAI,volume 7268)

Abstract

Measuring web page similarity is a very important task in the area of web mining and information retrieval. This paper introduces a method for measuring web page similarity, which considers both textual and visual properties of pages. Textual properties of a page are described by means of modified weight vector space model. General visual properties are captured via segmentation of a page, which divides a page into visual blocks, properties of which are stored into a vector of visual properties. These both vectors are then used to compute the overall web page similarity. This method will be described in detail and results of several experiments are also introduced in this paper.

Keywords

  • Web Page Similarity
  • Clustering
  • Vector Space Model
  • Vector Distance
  • Term Weighting
  • Visual Blocks

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Halkidi, M., Nguyen, B., Varlamis, I., Vazirigiannis, M.: Thesus: Organizing web document collections based on link semantics. VLDB Journal 12(4), 320–332 (2003)

    CrossRef  Google Scholar 

  2. Dean, J., Henzinger, M.: Finding related pages in the World Wide Web. In: Proceedings of the 8th WWW Conference, Toronto, Canada, pp. 1467–1479 (1999)

    Google Scholar 

  3. Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management 24, 513–523 (1998)

    CrossRef  Google Scholar 

  4. Sannella, M.J.: Constraint Satisfaction and Debugging for Interactive User Interfaces. PhD. Thesis. UMI Order No. GAX95-09398, University of Washington (1994)

    Google Scholar 

  5. Cruz, I.F., Borisov, S., Marks, M.A., Webb, T.R.: Measuring structural similarity among web documents: preliminary results. In: Proceedings of the 7th International Conference on Electronic Publishing, pp. 513–524. ICCC Press, Washington D.C. (1998)

    Google Scholar 

  6. Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: Proceedings of the 9th ACM SIGKDD Conference, pp. 577–582. ACM, Washington D.C (2003)

    Google Scholar 

  7. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a Vision-based Page Segmentation Algorithm. Technical Report MSR-TR-2003-79, Microsoft (2004)

    Google Scholar 

  8. Burget, R.: Automatic Document Structure Detection for Data Integration. In: Abramowicz, W. (ed.) BIS 2007. LNCS, vol. 4439, pp. 391–397. Springer, Heidelberg (2007)

    CrossRef  Google Scholar 

  9. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Block-based Web Search. In: The 27th Annual International ACM SIGIR Conference on Information Retrieval, pp. 440–447. ACM, Sheffield (2004)

    Google Scholar 

  10. Modha, D.S., Spangler, W.S.: Clustering hypertext with applications to web searching. In: Proceedings of the 11th ACM Conference on Hypertext and Hypermedia, pp. 143–152. ACM, San Antonio (2000)

    Google Scholar 

  11. Cutler, M., Deng, H., Maniccam, S.S., Meng, W.: A new study on using html structures to improve retrieval. In: Proceedings of the 11th IEEE International Conference on Tools with Artificial Intelligence, pp. 406–409. IEEE, Chicago (1999)

    CrossRef  Google Scholar 

  12. Bartik, V.: Text-Based Web Page Classification with Use of Visual Information. In: International Symposium on Open Source Intelligence & Web Mining. IEEE, Odense (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bartík, V. (2012). Measuring Web Page Similarity Based on Textual and Visual Properties. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2012. Lecture Notes in Computer Science(), vol 7268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29350-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-29350-4_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29349-8

  • Online ISBN: 978-3-642-29350-4

  • eBook Packages: Computer ScienceComputer Science (R0)