The VLDB Journal

, Volume 12, Issue 4, pp 320–332

THESUS: Organizing Web document collections based on link semantics

  • Maria Halkidi
  • Benjamin Nguyen
  • Iraklis Varlamis
  • Michalis Vazirgiannis


The requirements for effective search and management of the WWW are stronger than ever. Currently Web documents are classified based on their content not taking into account the fact that these documents are connected to each other by links. We claim that a page’s classification is enriched by the detection of its incoming links’ semantics. This would enable effective browsing and enhance the validity of search results in the WWW context. Another aspect that is underaddressed and strictly related to the tasks of browsing and searching is the similarity of documents at the semantic level. The above observations lead us to the adoption of a hierarchy of concepts (ontology) and a thesaurus to exploit links and provide a better characterization of Web documents. The enhancement of document characterization makes operations such as clustering and labeling very interesting. To this end, we devised a system called THESUS. The system deals with an initial sets of Web documents, extracts keywords from all pages’ incoming links, and converts them to semantics by mapping them to a domain’s ontology. Then a clustering algorithm is applied to discover groups of Web documents. The effectiveness of the clustering process is based on the use of a novel similarity measure between documents characterized by sets of terms. Web documents are organized into thematic subsets based on their semantics. The subsets are then labeled, thereby enabling easier management (browsing, searching, querying) of the Web. In this article, we detail the process of this system and give an experimental analysis of its results.


World Wide Web Link analysis Similarity measure Document clustering Link management Semantics 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Al-Halami R, Berwick R(1998) In: Fellbaum C, Miller G. (eds) WordNet, an electronic lexical database. MIT Press-Bradford Books, Cambridge, MAGoogle Scholar
  2. 2.
    Aggarwal C, Gates S, Yu P (1999) On the merits of building categorization systems by supervised clustering. In: Proceedings of the 5th international conference on knowledge discovery and data mining (ACM SIGKDD), 15-18 August 1999, San Diego, pp 352-356Google Scholar
  3. 3.
    Bidault A, Safar B, Froidevaux Ch (2002) Proximit’e entre requetes dans un contexte mediateur. 13eme Congres Francophone AFRIF-AFIA de reconnaissance des formes et intelligence artificielle, Centre des Congres d’Angers, FRANCE, 8-10 January 2002, pp 653-662Google Scholar
  4. 4.
    Beckmann N, Kriegel HP, Schneider R, Seeger B (1990) The R*-Tree: an efficient and robust access method for points and rectangles. In: Proceedings of the ACM SIGMOD international conference on management of data, Atlantic City, NJ, 23-25 May 1990, pp 322-331Google Scholar
  5. 5.
    Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the 7th international World Wide Web conference, Brisbane, Australia, 14-18 April 1998. Comput Netw ISDN Sys 30(1-7):pp 107-117Google Scholar
  6. 6.
    Chakrabarti S, Dom B, Gibson D, Keinberg J, Raghavan P, Rajagopalan S (1998) Automatic resource list compilation by analyzing hyperlink structure and associated text. In: Proceedings of the 7th international World Wide Web conference, Brisbane, Australia, 14-18 April 1998, Comput Netw ISDN Sys 30(1-7):65-74Google Scholar
  7. 7.
    Chakrabati S, Dom B, Gibson D, Kleinberg J, Kumar S, Raghavan P, Rajagopalan S, Tomkins A (1999) Mining the link structure of the World Wide Web. IEEE Comput 32(8):60-67Google Scholar
  8. 8.
    Chekuri C, Goldwasser M, Raghavan P, Upfal E (1997) Web search using automatic classification. Poster at the 6th international World Wide Web conference, Santa Clara, CA, April 1997, Scholar
  9. 9.
    DARPA Agent Markup Language Ontology Library. Scholar
  10. 10.
    Dumais S, Chen H (2000) Hierarchical classification of Web content. In: Proceedings of the 23rd ACM international conference on research and development in information retrieval, Athens, Greece, 24-28 July 2000, pp 256-263Google Scholar
  11. 11.
    Desmontils E, Jacquin C (2002) Indexing a Web site with a terminology oriented ontology. In: Cruz IF, Decker S, Euzenat J, McGuinness DL (eds) The emerging semantic Web. IOS Press, Amsterdam, pp 181-198Google Scholar
  12. 12.
    Ester M, Kriegel HP, Sander J, Xu X (1996) A density based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining ACM-SIGKDD, Portland, OR, August 1996, pp 226-231Google Scholar
  13. 13.
    Ester M, Kriegel HP, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24th VLDB conference, New York, 27-31 August 1998, pp 323-333Google Scholar
  14. 14.
    Eiter T, Mannila H (1997) Distance measures for point sets and their computation. Acta Informat 34(2):109-133CrossRefMathSciNetMATHGoogle Scholar
  15. 15.
    Fisher D (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2:139-172CrossRefMATHGoogle Scholar
  16. 16.
    Guarino N (1998) Formal ontology and information systems. In: Procedings of the 1st international conference on formal ontologies in information systems FOIS’98, Trento, Italy, June 1998, pp 3-15. IOS Press, AmsterdamGoogle Scholar
  17. 17.
    Gionis A, Gunopulos D, Koudras N (2001) Efficient and tunable similar set retrieval. In: Proceedings of the ACM SIGMOD international conference on management of data, Santa Barbara, 21-24 May 2001, pp 247-258Google Scholar
  18. 18.
    Green J, Horne N, Orlowska E, Siemens P (1996) A rough set model of information retrieval. Fundamenta Informaticae 28(3-4):273-296Google Scholar
  19. 19.
    Glover EJ, Tsioutsiouliklis K, Lawrence S, Pennock DM, Flake GW (2002) Using Web structure for classifying and describing Web pages. In: Proceedings of the 11th international World Wide Web conference (WWW 2002), Honolulu, 7-11 May 2002, pp 562-569Google Scholar
  20. 20.
    Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inform Sys 17(2-3):107-145Google Scholar
  21. 21.
    Haveliwala TH, Gionis A, Klein D, Indyk P (2002) Evaluating strategies for similarity on the Web. In: Proceedings of the 11th international World Wide Web conference (WWW 2002), Honolulu, 7-11 May 2002, pp 432-442Google Scholar
  22. 22.
    Halkidi M, Nguyen B, Varlamis I, Vazirgiannis M (2002) THESUS: Organizing web document collections based on semantics & clustering. Technical Report N.230. (available at Scholar
  23. 23.
    Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604-632CrossRefGoogle Scholar
  24. 24.
    Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, 15-18 August 1999, San Diego, pp 16-22Google Scholar
  25. 25.
    Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, Madison, WI, 24-27 July 1998, pp 296-304Google Scholar
  26. 26.
    Niiniluoto I (1987) Truthlikeness. Reidel, DordrechtGoogle Scholar
  27. 27.
    The Northern Light search engine: http://www.northernlight.comGoogle Scholar
  28. 28.
    Nguyen B, Vazirgianis M, Varlamis I, Halkidi M, (2002) Organizing Web documents into thematic subsets using an ontology, Technical Report. (available at Scholar
  29. 29.
    ODP - Open Directory Project, Scholar
  30. 30.
    Phelps T, Wilensky R (2000) Robust hyperlinks cost just five words each. UC Berkeley Computer Science Technical Report UCB//CSD-00-1091Google Scholar
  31. 31.
    Qui Y, Frei HP (1993) Concept Base Query Expansion. In: Proceedings of the 16th annual international ACM-SIGIR conference on research and development in information retrieval, Pittsburgh, 27 June-July 1 1993, pp 160-169Google Scholar
  32. 32.
    Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on artificial intelligence (IJCAI 95), Montreal, 20-25 August 1995, pp 448-453Google Scholar
  33. 33.
    Resnik P (1999) Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 11:95-130MATHGoogle Scholar
  34. 34.
    Richardson R, Smeaton A, Murphy J (1994) Using WordNet as a knowledge base for measuring semantic similarity between words. In: Proceedings of the 7th Irish AI and cognitive science conference, Dublin, 8-9 September 1994Google Scholar
  35. 35.
    Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw-Hill, New YorkGoogle Scholar
  36. 36.
    Thesus Web page: Scholar
  37. 37.
    Theodoridis S, Koutroubas K (1999) Pattern recognition. Academic Press, New YorkGoogle Scholar
  38. 38.
    Web research collections - TREC Web Track. Scholar
  39. 39.
    Vivisimo search engine: Scholar
  40. 40.
    Varlamis I, Vazirgiannis M (2001) Web document searching using enhanced hyperlink semantics based on XML. In: In Proceedings of the international database engineering and applications symposium, IDEAS ‘01,Grenoble, France, 16-18 July 2001, pp 34-43Google Scholar
  41. 41.
    Wordnet Web site: Scholar
  42. 42.
    Wu Z, Palmer M (1994) Verb semantics and lexical selection. In: Proceedings of the 32nd annual meetings of the associations for computational linguistics, Las Cruces, NM, June 1994, pp 133-138Google Scholar
  43. 43.
    Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual ACM SIGIR international conference on research and development in information retrieval, Melbourne, Australia, 24-28 August 1998, pp 46-54Google Scholar

Copyright information

© Springer-Verlag Berlin/Heidelberg 2003

Authors and Affiliations

  • Maria Halkidi
    • 1
  • Benjamin Nguyen
    • 2
  • Iraklis Varlamis
    • 1
  • Michalis Vazirgiannis
    • 1
  1. 1.76 Patision StreetAthens University of Economics and BusinessAthensGreece
  2. 2.Domaine de VoluceauINRIALe ChesnayFrance

Personalised recommendations