Skip to main content
Log in

THESUS: Organizing Web document collections based on link semantics

  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract.

The requirements for effective search and management of the WWW are stronger than ever. Currently Web documents are classified based on their content not taking into account the fact that these documents are connected to each other by links. We claim that a page’s classification is enriched by the detection of its incoming links’ semantics. This would enable effective browsing and enhance the validity of search results in the WWW context. Another aspect that is underaddressed and strictly related to the tasks of browsing and searching is the similarity of documents at the semantic level. The above observations lead us to the adoption of a hierarchy of concepts (ontology) and a thesaurus to exploit links and provide a better characterization of Web documents. The enhancement of document characterization makes operations such as clustering and labeling very interesting. To this end, we devised a system called THESUS. The system deals with an initial sets of Web documents, extracts keywords from all pages’ incoming links, and converts them to semantics by mapping them to a domain’s ontology. Then a clustering algorithm is applied to discover groups of Web documents. The effectiveness of the clustering process is based on the use of a novel similarity measure between documents characterized by sets of terms. Web documents are organized into thematic subsets based on their semantics. The subsets are then labeled, thereby enabling easier management (browsing, searching, querying) of the Web. In this article, we detail the process of this system and give an experimental analysis of its results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Al-Halami R, Berwick R(1998) In: Fellbaum C, Miller G. (eds) WordNet, an electronic lexical database. MIT Press-Bradford Books, Cambridge, MA

  2. Aggarwal C, Gates S, Yu P (1999) On the merits of building categorization systems by supervised clustering. In: Proceedings of the 5th international conference on knowledge discovery and data mining (ACM SIGKDD), 15-18 August 1999, San Diego, pp 352-356

  3. Bidault A, Safar B, Froidevaux Ch (2002) Proximit’e entre requetes dans un contexte mediateur. 13eme Congres Francophone AFRIF-AFIA de reconnaissance des formes et intelligence artificielle, Centre des Congres d’Angers, FRANCE, 8-10 January 2002, pp 653-662

  4. Beckmann N, Kriegel HP, Schneider R, Seeger B (1990) The R*-Tree: an efficient and robust access method for points and rectangles. In: Proceedings of the ACM SIGMOD international conference on management of data, Atlantic City, NJ, 23-25 May 1990, pp 322-331

  5. Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the 7th international World Wide Web conference, Brisbane, Australia, 14-18 April 1998. Comput Netw ISDN Sys 30(1-7):pp 107-117

  6. Chakrabarti S, Dom B, Gibson D, Keinberg J, Raghavan P, Rajagopalan S (1998) Automatic resource list compilation by analyzing hyperlink structure and associated text. In: Proceedings of the 7th international World Wide Web conference, Brisbane, Australia, 14-18 April 1998, Comput Netw ISDN Sys 30(1-7):65-74

  7. Chakrabati S, Dom B, Gibson D, Kleinberg J, Kumar S, Raghavan P, Rajagopalan S, Tomkins A (1999) Mining the link structure of the World Wide Web. IEEE Comput 32(8):60-67

    Google Scholar 

  8. Chekuri C, Goldwasser M, Raghavan P, Upfal E (1997) Web search using automatic classification. Poster at the 6th international World Wide Web conference, Santa Clara, CA, April 1997, http://decweb.ethz.ch/WWW6/Posters/725/Web+_+Search.html

  9. DARPA Agent Markup Language Ontology Library. http://www.daml.org/ontologies/

  10. Dumais S, Chen H (2000) Hierarchical classification of Web content. In: Proceedings of the 23rd ACM international conference on research and development in information retrieval, Athens, Greece, 24-28 July 2000, pp 256-263

  11. Desmontils E, Jacquin C (2002) Indexing a Web site with a terminology oriented ontology. In: Cruz IF, Decker S, Euzenat J, McGuinness DL (eds) The emerging semantic Web. IOS Press, Amsterdam, pp 181-198

  12. Ester M, Kriegel HP, Sander J, Xu X (1996) A density based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining ACM-SIGKDD, Portland, OR, August 1996, pp 226-231

  13. Ester M, Kriegel HP, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24th VLDB conference, New York, 27-31 August 1998, pp 323-333

  14. Eiter T, Mannila H (1997) Distance measures for point sets and their computation. Acta Informat 34(2):109-133

    Article  MathSciNet  MATH  Google Scholar 

  15. Fisher D (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2:139-172

    Article  MATH  Google Scholar 

  16. Guarino N (1998) Formal ontology and information systems. In: Procedings of the 1st international conference on formal ontologies in information systems FOIS’98, Trento, Italy, June 1998, pp 3-15. IOS Press, Amsterdam

  17. Gionis A, Gunopulos D, Koudras N (2001) Efficient and tunable similar set retrieval. In: Proceedings of the ACM SIGMOD international conference on management of data, Santa Barbara, 21-24 May 2001, pp 247-258

  18. Green J, Horne N, Orlowska E, Siemens P (1996) A rough set model of information retrieval. Fundamenta Informaticae 28(3-4):273-296

    Google Scholar 

  19. Glover EJ, Tsioutsiouliklis K, Lawrence S, Pennock DM, Flake GW (2002) Using Web structure for classifying and describing Web pages. In: Proceedings of the 11th international World Wide Web conference (WWW 2002), Honolulu, 7-11 May 2002, pp 562-569

  20. Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inform Sys 17(2-3):107-145

    Google Scholar 

  21. Haveliwala TH, Gionis A, Klein D, Indyk P (2002) Evaluating strategies for similarity on the Web. In: Proceedings of the 11th international World Wide Web conference (WWW 2002), Honolulu, 7-11 May 2002, pp 432-442

  22. Halkidi M, Nguyen B, Varlamis I, Vazirgiannis M (2002) THESUS: Organizing web document collections based on semantics & clustering. Technical Report N.230. (available at http://osage.inria.fr/verso/Gemo/PUBLI/index.php)

  23. Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604-632

    Article  Google Scholar 

  24. Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, 15-18 August 1999, San Diego, pp 16-22

  25. Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, Madison, WI, 24-27 July 1998, pp 296-304

  26. Niiniluoto I (1987) Truthlikeness. Reidel, Dordrecht

  27. The Northern Light search engine: http://www.northernlight.com

  28. Nguyen B, Vazirgianis M, Varlamis I, Halkidi M, (2002) Organizing Web documents into thematic subsets using an ontology, Technical Report. (available at http://www.db-net.aueb.gr/pubs.php#tr)

  29. ODP - Open Directory Project, http://dmoz.org/

  30. Phelps T, Wilensky R (2000) Robust hyperlinks cost just five words each. UC Berkeley Computer Science Technical Report UCB//CSD-00-1091

  31. Qui Y, Frei HP (1993) Concept Base Query Expansion. In: Proceedings of the 16th annual international ACM-SIGIR conference on research and development in information retrieval, Pittsburgh, 27 June-July 1 1993, pp 160-169

  32. Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on artificial intelligence (IJCAI 95), Montreal, 20-25 August 1995, pp 448-453

  33. Resnik P (1999) Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 11:95-130

    MATH  Google Scholar 

  34. Richardson R, Smeaton A, Murphy J (1994) Using WordNet as a knowledge base for measuring semantic similarity between words. In: Proceedings of the 7th Irish AI and cognitive science conference, Dublin, 8-9 September 1994

  35. Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw-Hill, New York

  36. Thesus Web page: http://www.db-net.aueb.gr/thesus/

  37. Theodoridis S, Koutroubas K (1999) Pattern recognition. Academic Press, New York

  38. Web research collections - TREC Web Track. http://www.ted.cmis.csiro.au/TRECWeb/

  39. Vivisimo search engine: http://www.vivisimo.com/

  40. Varlamis I, Vazirgiannis M (2001) Web document searching using enhanced hyperlink semantics based on XML. In: In Proceedings of the international database engineering and applications symposium, IDEAS ‘01,Grenoble, France, 16-18 July 2001, pp 34-43

  41. Wordnet Web site: http://www.cogsci.princeton.edu/~wn/

  42. Wu Z, Palmer M (1994) Verb semantics and lexical selection. In: Proceedings of the 32nd annual meetings of the associations for computational linguistics, Las Cruces, NM, June 1994, pp 133-138

  43. Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual ACM SIGIR international conference on research and development in information retrieval, Melbourne, Australia, 24-28 August 1998, pp 46-54

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maria Halkidi.

Additional information

Received: 16 December 2002, Accepted: 16 April 2003, Published online: 17 September 2003

Rights and permissions

Reprints and permissions

About this article

Cite this article

Halkidi, M., Nguyen, B., Varlamis, I. et al. THESUS: Organizing Web document collections based on link semantics. VLDB 12, 320–332 (2003). https://doi.org/10.1007/s00778-003-0100-6

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-003-0100-6

Keywords:

Navigation