Advertisement

Soft Computing

, Volume 9, Issue 7, pp 481–492 | Cite as

Contextual weighted representations and indexing models for the retrieval of HTML documents

  • R. A. Marques Pereira
  • A. Molinari
  • G. Pasi
Article

Abstract

The diffusion of the World Wide Web (WWW) and the consequent increase in the production and exchange of textual information demand the development of effective information retrieval systems. The HyperText Markup Language (HTML) constitues a common basis for generating documents over the internet and the intranets. By means of the HTML the author is allowed to organize the text into subparts delimited by special tags; these subparts are then visualized by the HTML browser in distinct ways, i.e. with distinct typographical formats. In this paper a model for indexing HTML documents is proposed which exploits the role of tags in encoding the importance of their delimited text. Central to our model is a method to compute the significance degree of a term in a document by weighting the term instances according to the tags in which they occur. The indexing model proposed is based on a contextual weighted representation of the document under consideration, by means of which a set of (normalized) numerical weights is assigned to the various tags containing the text. The weighted representation is contextual in the sense that the set of numerical weights assigned to the various tags and the respective text depend (other than on the tags themselves) on the particular document considered. By means of the contextual weighted representation our indexing model reflects not only the general syntactic structure of the HTML language but also the information conveyed by the particular way in which the author instantiates that general structure in the document under consideration. We discuss two different forms of contextual weighting: the first is based on a linear weighted representation and is closer to the standard model of universal (i.e. non contextual) weighting; the second is based on a more complex non linear weighted representation and has a number of novel and interesting features.

Keywords

Information retrieval Adaptive representation of documents Contextual weighting 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agosti M, Crestani F, Pasi G (eds) (2001) Lectures in Information Retrieval, Springer, Berlin Heidelberg, New YorkGoogle Scholar
  2. Barfourosh A, Motahary Nezhad HR, Anderson ML, Perlis D (2002) Information Retrieval on the World Wide Web and Active Logic: A Survey and Problem Definition, Technical Report available at the URL citeseer.nj.nec.com/barfourosh02information.html Google Scholar
  3. Berners-Lee T, Connolly D (1994) Hypertext markup language specification - 2.0. IETF HTML Working Group Google Scholar
  4. Bookstein A (1981) A comparison of two systems of weighted Boolean retrieval. J. Am. Soc. Information Sci. 32(4):275–279Google Scholar
  5. Bordogna G, Pasi G (1995) Controlling retrieval through a user-adaptive representation of documents. Int. J. Approximate Reasoning 12:317–339Google Scholar
  6. Bordogna G, Pasi G (1995) Linguistic aggregation operators of selection criteria in fuzzy information retrieval. J. Intelligent Information Syst. 10:233–248Google Scholar
  7. Bordogna G, Pasi G (2000) Flexible querying of structured documents. In: Proc. of Flexible Query Answering Systems FQAS (Warsaw, Poland), pp. 350–361Google Scholar
  8. Bordogna G, Pasi G (2001) Modelling vagueness in information retrieval. In: Agosti M, Crestani F, Pasi G (eds) Lectures in Information Retrieval. Springer Verlag, Berlin Heidelberg New York Google Scholar
  9. Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Sys. 30:107–117Google Scholar
  10. Buell DA (1982) An analysis of some fuzzy subset applications to information retrieval systems. Fuzzy Sets and Sys. 7:35–42Google Scholar
  11. Carrire SJ, Kazman R (1997) WebQuery: searching and visualizing the Web through connectivity. Computer Networks 29:1257–1267 Google Scholar
  12. Cater SC, Kraft DH (1989) A generalizaton and clarification of the Waller-Kraft wish-list. Information and Processing Management 25:15–25Google Scholar
  13. Chakrabarti S, van der Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks 31:1623–1640Google Scholar
  14. Chakrabarti S, Joshi M, Tawde V (2001) Enhanced topic distillation using text, markup tags, and hyperlinks. In: Proc. SIGIR’01 Conference (New Orleans, 2001), pp 208–216 Google Scholar
  15. Crestani F, Lalmas M, van Rijsbergen CJ, Campbell I (1998) Is this document relevant?... Probably, ACM Computing Surveys 30(4):528–552Google Scholar
  16. Crestani F, Pasi G (2000) (eds) Soft Computing in Information Retrieval: Techniques and Applications. Physica Verlag, Heidelberg 2000, Series Studies in Fuzziness Google Scholar
  17. Croft B (1994) What do people want from Information Retrieval, D-Lib Magazine, November 1995 Google Scholar
  18. Cutler M, Shih Y, Meng W (1997) Using the structures of HTML documents to improve retrieval. In: Proc. USENIX Symposium on internet technologies and systems NSITS’97 (Monterey, California), pp 241–251Google Scholar
  19. Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J. ACM 46:604–632Google Scholar
  20. Kuhnert C (1995) Choosing an indexing strategy in an entreprise environment. In: Proc. 3rd International WWW Conference: Technology, Tools, and Applications (Darmstadt, Germany)Google Scholar
  21. Kobayashi M, Takeda K (2000) Information retrieval on the Web. IBM Research Report, RT0347 Google Scholar
  22. Molinari A, Pasi G (1996) A Fuzzy representation of HTML documents for information retrieval systems. In: Proc IEEE International Conference on Fuzzy Systems (New Orleans, September 1996) Google Scholar
  23. McBryan O (1994). GENVL and WWW: Tools for taming the Web. In: Proc. 1st International WWW Conference (Geneva, Switzerland, May 1994)Google Scholar
  24. Pfeifer U, Poersch T, Fuhr N (1996) Searching proper names in databases. In: Proc. Conference on Hypertext - Information Retrieval - Multimedia HIMS’96Google Scholar
  25. Pinkerton B (1994) Finding what people want: Experiences with the WebCrawler. In: Proc. 2nd International WWW Conference: Mosaic and the Web (Chicago, Illinois, October 1994)Google Scholar
  26. Radecki T (1979) Fuzzy set theoretical approach to document retrieval. Information Processing Management 15:247–260Google Scholar
  27. Savoy J (1996) An extended vector processing scheme for searching information in hypertext systems. Information Processing Management 32(2):155–170Google Scholar
  28. Salton G, McGill MJ (1984) Introduction to Modern Information Retrieval. McGraw-Hill, New YorkGoogle Scholar
  29. Salton G (1989) Automatic text processing: the transformation, analysis and retrieval of information by computer. Addison Wesley, Redwood City CA Google Scholar
  30. Salton G, Buckley C (1988) Term weighting approaches in automatic text retrieval. Information Processing Management 24(5):513–523Google Scholar
  31. Spertus E (1997) Parasite : Mining structural information on the Web. Computer Networks and ISDN Systems: The International Journal of Computer and Telecommunication Networking 29:1205–1215Google Scholar
  32. van Rijsbergen K (1979) Information Retrieval. Butterworths, London. Google Scholar
  33. Wilkinson R (1994) Effective retrieval of structured documents. In: Proc. SIGIR’94 Conference (Dublin, Ireland), pp 311–317Google Scholar
  34. Extensible Markup Language (XML) 1.0 W3C Reccomendation 10 February 1998, http://www.w3.org/TR/1998/REC-xml- 19980210 Google Scholar
  35. www.searchengineworld.com/spiders/lycos.htm, Lycos Search EngineGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  1. 1.Dipartimento di Informatica e Studi AziendaliUniversità degli Studi di TrentoTrentoItaly
  2. 2.Istituto Tecnologie della CostruzioneConsiglio Nazionale delle Ricerche CNRMilanoItaly

Personalised recommendations