Factors Affecting Web Page Similarity

Tombros, Anastasios; Ali, Zeeshan

doi:10.1007/978-3-540-31865-1_35

Anastasios Tombros¹⁸ &
Zeeshan Ali¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3408))

Included in the following conference series:

European Conference on Information Retrieval

4384 Accesses
18 Citations

Abstract

Tools that allow effective information organisation, access and navigation are becoming increasingly important on the Web. Similarity between web pages is a concept that is central to such tools. In this paper, we examine the effect that content and layout-related aspects of web pages have on web page similarity. We consider the textual content contained within common HTML tags, the structural layout of pages, and the query terms contained within pages. Our study shows that combinations of factors can yield more promising results than individual factors, and that different aspects of web pages affect similarities between pages in a different manner. We found a number of factors that, when taken into account, can result in effective measures of similarity between web pages. Query information in particular, proved to be important for the effective organisation of web pages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. In: Proceedings of the 6th WWW Conference, pp. 1157–1166 (1997)
Google Scholar
Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., Kleinberg, J.: Automatic resource compilation by analyzing hyperlink structure and associated text. In: Proceedings of the 7th WWW Conference, pp. 65–74 (1998)
Google Scholar
Cruz, I.F., Borisov, S., Marks, M.A., Webb, T.R.: Measuring structural similarity among web documents: preliminary results. In: Proceedings of the 7th International Conference on Electronic Publishing, pp. 513–524 (1998)
Google Scholar
Cutler, M., Deng, H., Maniccam, S.S., Meng, W.: A new study on using html structures to improve retrieval. In: Proceedings of the 11th IEEE International Conference on Tools with Artificial Intelligence, pp. 406–409 (1999)
Google Scholar
Dean, J., Henzinger, M.: Finding related pages in the world wide web. In: Proceedings of the 8th WWW Conference, pp. 1467–1479 (1999)
Google Scholar
Eiron, N., McCurley, K.S.: Analysis of anchor text for web search. In: Proceedings of the 26th ACM SIGIR Conference, pp. 459–460 (2003)
Google Scholar
Friburger, N., Maurel, D.: Textual similarity based on proper names. In: Proceedings of the ACM SIGIR Workshop on Mathematical Formal Methods in Information Retrieval, pp. 155–167 (2002)
Google Scholar
Ganesan, P., Garcia-Molina, H., Widom, J.: Exploiting hierarchical domain structure to compute similarity. ACM Transactions on Information Systems 21(1), 64–93 (2003)
Article Google Scholar
Halkidi, M., Nguyen, B., Varlamis, I., Vazirigiannis, M.: Thesus: Organising web document collections based on link semantics. VLDB Journal 12(4), 320–332 (2003)
Article Google Scholar
Haveliwala, T.H., Gionis, A., Klein, D., Indyk, P.: Evaluating strategies for similarity search on the web. In: Proceedings of the 11th WWW Conference, pp. 157–163 (2002)
Google Scholar
Hawking, D., Voorhees, E., Craswell, N., Bailey, P.: Overview of the trec-8 web track. In: Proceedings of TREC-8, pp. 131–150 (2000)
Google Scholar
Hearst, M.A., Pedersen, J.O.: Re-examining the cluster hypothesis: Scatter/gather on retrieval results. In: Proceedings of the 19th ACM SIGIR Conference, pp. 76–84 (1996)
Google Scholar
Jardine, N., van Rijsbergen, C.J.: The use of hierarchical clustering in information retrieval. Information Storage and Retrieval 7, 217–240 (1971)
Article Google Scholar
Joshi, S., Agrawal, N., Krishnapuram, R., Negi, S.: A bag of paths model for measuring structural similarity in web documents. In: Proceedings of the 9th ACM SIGKDD Conference, pp. 577–582 (2003)
Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)
Article MATH MathSciNet Google Scholar
Modha, D.S., Spangler, W.S.: Clustering hypertext with applications to web searching. In: Proceedings of the 11th ACM Conferencei on Hypertext and Hypermedia, pp. 143–152 (2000)
Google Scholar
Mukherjea, S.: Organizing topic-specific web information. In: Proceedings of the 11th ACM Conference on Hypertext and Hypermedia, pp. 133–141 (2000)
Google Scholar
Ozmutlu, S., Spink, A., Ozmutlu, H.C.: A day in the life of web searching: an exploratory study. Information Processing & Management 40(2), 319–345 (2004)
Article Google Scholar
Pirolli, P., Pitkow, J., Rao, R.: Silk from a sow’s ear: extracting usable structures from the web. In: Proceedings of ACM SIGCHI Conference, pp. 118–125 (1996)
Google Scholar
Tombros, A.: The effectiveness of hierarchic query-based clustering of documents for information retrieval. PhD thesis, Department of Computing Science, University of Glasgow (2002)
Google Scholar
Tombros, A., van Rijsebrgen, C.J.: Query-sensitive similarity measures for information retrieval. Knowledge and Information Systems 6(5), 617–642 (2004)
Article Google Scholar
Tombros, A., Villa, R., van Rijsbergen, C.J.: The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing & Management 38(4), 559–582 (2002)
Article MATH Google Scholar
Toyoda, M., Kitsuregawa, M.: Creating a web community chart for navigating related communities. In: Proceedings of the 12th ACM Conference on Hypertext and Hypermedia, pp. 103–112 (2001)
Google Scholar
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)
Google Scholar
Voorhees, E.: The Effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval. PhD thesis, Department of Computer Science, Cornell University (1985)
Google Scholar
Weiss, R., Velez, B., Sheldon, M.: Hypursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In: Proceedings of the 7th ACM Conference on Hypertext and Hypermedia, pp. 180–193 (1996)
Google Scholar
Wong, W., Fu, A.W.: Finding structure and characteristics of web documents for classification. In: Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 96–105 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Queen Mary University of London, London, U.K
Anastasios Tombros & Zeeshan Ali

Authors

Anastasios Tombros
View author publications
You can also search for this author in PubMed Google Scholar
Zeeshan Ali
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departamento de Electrónica y Computación, Universidad de Santiago de Compostela, Spain
David E. Losada
Departamento de Ciencias de la Computación e Inteligencia Artificial E.T.S.I. Informática y de Telecomunicación, Universidad de Granada, 18071, Granada, Spain
Juan M. Fernández-Luna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tombros, A., Ali, Z. (2005). Factors Affecting Web Page Similarity. In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_35

Download citation

DOI: https://doi.org/10.1007/978-3-540-31865-1_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25295-5
Online ISBN: 978-3-540-31865-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics