Skip to main content

Factors Affecting Web Page Similarity

  • Conference paper
Advances in Information Retrieval (ECIR 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3408))

Included in the following conference series:

Abstract

Tools that allow effective information organisation, access and navigation are becoming increasingly important on the Web. Similarity between web pages is a concept that is central to such tools. In this paper, we examine the effect that content and layout-related aspects of web pages have on web page similarity. We consider the textual content contained within common HTML tags, the structural layout of pages, and the query terms contained within pages. Our study shows that combinations of factors can yield more promising results than individual factors, and that different aspects of web pages affect similarities between pages in a different manner. We found a number of factors that, when taken into account, can result in effective measures of similarity between web pages. Query information in particular, proved to be important for the effective organisation of web pages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. In: Proceedings of the 6th WWW Conference, pp. 1157–1166 (1997)

    Google Scholar 

  2. Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., Kleinberg, J.: Automatic resource compilation by analyzing hyperlink structure and associated text. In: Proceedings of the 7th WWW Conference, pp. 65–74 (1998)

    Google Scholar 

  3. Cruz, I.F., Borisov, S., Marks, M.A., Webb, T.R.: Measuring structural similarity among web documents: preliminary results. In: Proceedings of the 7th International Conference on Electronic Publishing, pp. 513–524 (1998)

    Google Scholar 

  4. Cutler, M., Deng, H., Maniccam, S.S., Meng, W.: A new study on using html structures to improve retrieval. In: Proceedings of the 11th IEEE International Conference on Tools with Artificial Intelligence, pp. 406–409 (1999)

    Google Scholar 

  5. Dean, J., Henzinger, M.: Finding related pages in the world wide web. In: Proceedings of the 8th WWW Conference, pp. 1467–1479 (1999)

    Google Scholar 

  6. Eiron, N., McCurley, K.S.: Analysis of anchor text for web search. In: Proceedings of the 26th ACM SIGIR Conference, pp. 459–460 (2003)

    Google Scholar 

  7. Friburger, N., Maurel, D.: Textual similarity based on proper names. In: Proceedings of the ACM SIGIR Workshop on Mathematical Formal Methods in Information Retrieval, pp. 155–167 (2002)

    Google Scholar 

  8. Ganesan, P., Garcia-Molina, H., Widom, J.: Exploiting hierarchical domain structure to compute similarity. ACM Transactions on Information Systems 21(1), 64–93 (2003)

    Article  Google Scholar 

  9. Halkidi, M., Nguyen, B., Varlamis, I., Vazirigiannis, M.: Thesus: Organising web document collections based on link semantics. VLDB Journal 12(4), 320–332 (2003)

    Article  Google Scholar 

  10. Haveliwala, T.H., Gionis, A., Klein, D., Indyk, P.: Evaluating strategies for similarity search on the web. In: Proceedings of the 11th WWW Conference, pp. 157–163 (2002)

    Google Scholar 

  11. Hawking, D., Voorhees, E., Craswell, N., Bailey, P.: Overview of the trec-8 web track. In: Proceedings of TREC-8, pp. 131–150 (2000)

    Google Scholar 

  12. Hearst, M.A., Pedersen, J.O.: Re-examining the cluster hypothesis: Scatter/gather on retrieval results. In: Proceedings of the 19th ACM SIGIR Conference, pp. 76–84 (1996)

    Google Scholar 

  13. Jardine, N., van Rijsbergen, C.J.: The use of hierarchical clustering in information retrieval. Information Storage and Retrieval 7, 217–240 (1971)

    Article  Google Scholar 

  14. Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: Proceedings of the 9th ACM SIGKDD Conference, pp. 577–582 (2003)

    Google Scholar 

  15. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  16. Modha, D.S., Spangler, W.S.: Clustering hypertext with applications to web searching. In: Proceedings of the 11th ACM Conferencei on Hypertext and Hypermedia, pp. 143–152 (2000)

    Google Scholar 

  17. Mukherjea, S.: Organizing topic-specific web information. In: Proceedings of the 11th ACM Conference on Hypertext and Hypermedia, pp. 133–141 (2000)

    Google Scholar 

  18. Ozmutlu, S., Spink, A., Ozmutlu, H.C.: A day in the life of web searching: an exploratory study. Information Processing & Management 40(2), 319–345 (2004)

    Article  Google Scholar 

  19. Pirolli, P., Pitkow, J., Rao, R.: Silk from a sow’s ear: extracting usable structures from the web. In: Proceedings of ACM SIGCHI Conference, pp. 118–125 (1996)

    Google Scholar 

  20. Tombros, A.: The effectiveness of hierarchic query-based clustering of documents for information retrieval. PhD thesis, Department of Computing Science, University of Glasgow (2002)

    Google Scholar 

  21. Tombros, A., van Rijsebrgen, C.J.: Query-sensitive similarity measures for information retrieval. Knowledge and Information Systems 6(5), 617–642 (2004)

    Article  Google Scholar 

  22. Tombros, A., Villa, R., van Rijsbergen, C.J.: The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing & Management 38(4), 559–582 (2002)

    Article  MATH  Google Scholar 

  23. Toyoda, M., Kitsuregawa, M.: Creating a web community chart for navigating related communities. In: Proceedings of the 12th ACM Conference on Hypertext and Hypermedia, pp. 103–112 (2001)

    Google Scholar 

  24. van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)

    Google Scholar 

  25. Voorhees, E.: The Effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval. PhD thesis, Department of Computer Science, Cornell University (1985)

    Google Scholar 

  26. Weiss, R., Velez, B., Sheldon, M.: Hypursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In: Proceedings of the 7th ACM Conference on Hypertext and Hypermedia, pp. 180–193 (1996)

    Google Scholar 

  27. Wong, W., Fu, A.W.: Finding structure and characteristics of web documents for classification. In: Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 96–105 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Tombros, A., Ali, Z. (2005). Factors Affecting Web Page Similarity. In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-31865-1_35

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25295-5

  • Online ISBN: 978-3-540-31865-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics