Abstract
The work presented in this chapter suggests a new model of Information Retrieval System to search for information in hypertexts underlying Web sites. The model is based on the construction of a 2-level index. One level concerns the HTML pages individually. The other one concerns the context of these pages. In this work we assume that the textual content of a HTML page is not sufficient for a indexing process to grasp the information the page conveys. Contextual information is located in complementary pages. Complementary pages for a given page are identified with the help of a complementary measure. This measure is based both on content and link analysis and assesses how complementary two pages are. By the use of both local and contextual information when indexing pages, the quality of their index is improved and so is the effectiveness of the search engine.
Chapter PDF
Similar content being viewed by others
References
Botafogo R. A., Rivlin E. and Shneiderman B. (1992). Structural Analysis of Hypertext: Identifying Hierarchies and Useful Metrics. ACM Transactions on Information Systems, 10 (2).
Broder A., Glassman S., Manasse M. and Zweig G. (1997). Syntactic Clustering of the Web. In Proceedings of the 6th International WWW Conference, Santa Clara, USA.
Mendelzon A. and Rafiei D. (2000). What do the Neighbours Think? Computing Web Page Reputations. IEEE Data Engineering Bulletin, 23 (3): 9–16.
Agora21 (2000). THESDD, Le Thésaurus du Développement Durable. Available at http://www.agora21.org/bibliotheque.html.
Chekuri C. and Ragavan R (1997). Web Search Using Automatic Classification. In Proceedings of the 6th International WWW Conference, Santa Clara, USA.
Gibson D., Kleinberg J. M. and Raghavan R. (1998). Inferring Web Communities from Link Topology. In Proceedings of the Ninth ACM Conference on Hypertext, Pittsburgh, USA.
Rafiei D. and Mendelzon A. (2000). What is this Page Known for? Computing Web Page Reputations. In Proceedings of the 9th International WWW Conference, Amsterdam, Netherlands.
Amitay E. (1997). Hypertext: The Importance of Being Different. MSc Dissertation. Centre for Cognitive Science. The University of Edinburgh.
Dyreson C. E. (1998). A Jumping Spider: Restructuring the WWW Graph to Index Concepts that Span Pages. In Proceedings of the Seventh International WWW Conference (Workshop on Reuse of Web Information), Melbourne, Australia.
Aguiar F. (2002). On the Scalability of a Local Hypertext Link-Based Retrieval Method. Technical report, Ecole Nationale Supérieure des Mines de St. Etienne, France. RR2002. 4.
Salton G. and McGill M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY.
Small H. G. (1973). Co-citation in the Scientific Literature. Journal of the American Society for Information Science, (24):265–269.
Cho J., Garcia-Molina H. and Page L. (1998). Efficient Crawling Through URL Ordering. In Proceedings of the 7th International WWW Conference, Brisbane, Australia.
Dean J. and Henzinger M. (1999). Finding Related Pages in the World Wide Web. In Proceedings of the 8th International WWW Conference, Toronto, Canada.
Ding J., Gravano L. and Shivakumar N. (2000). Computing Geographical Scopes of Web Resources. In 26th International Conference on Very Large Databases, VLDB 2000, Cairo, Egypt.
Oh H. J., Myaeng S. H. and Lee M. H. (2000). A Practical Hypertext Categorization Method Using Links and Incrementally Available Class Information. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece.
Savoy J. (1997). Citation Schemes in Hypertext Information Retrieval. In Agosti M. and Smeaton A., editors, Information Retrieval and Hypertext,chapter 5. Kluwer Academic Publishers.
Bharat K. and Broder A. (1999). Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content. In Proceedings of the 8th International WWW Conference.
Bharat K., Broder A., Dean J. and Henzinger M. (2000). A Comparison of Techniques to Find Mirrored Hosts on the WWW. IEEE Data Engineering Bulletin, 23 (4): 21–26.
Hatano K., Sano R., Duan Y. and Tanaka K. (1999). An Interactive Classification of Web Documents by Self-Organizing Maps and Search Engines. In Database Systems for Advanced Applications, DASFAA, Hsinchu, Taiwan.
Tajima K., Hatano K., Matsukura T., Sano R. and Tanaka K. (1999). Discovery and Retrieval of Logical Information Units in Web. In Proceedings of the ACM Digital Library Conference, Berkeley, USA.
Tajima K., Mizuuchi Y., Kitagawa M. and Tanaka K. (1998). Cut as a Querying Unit for WWW, Netnews, and E-mail. In Proceedings of the ACM Hypertext Conference, Pittsburgh, USA.
Page L., Brin S., Motwani R. and Winograd T. (1998). The PageRank Citation Ranking: Bringing Order to the Web. Techni-cal report, Stanford Digital Libraries Working Paper. Available at http://citeseer.nj.nec.com/page98pagerank.html.
Kleinberg J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46 (5): 604–632.
Kobayashi N. and Kitagawa F. (1999). Finding a Page-set in the WWW Document and its Application to Search Engines. In In Proc. of Data Engineering Workshop (DEWS ‘89), IEICE (In Japanese), Kagoshima, Japan.
Buyukkokten O., Cho J., Garcia-Molina H., Gravano L. and Shivakumar N. (1999). Exploiting Geographical Location Information of Web Pages. In Proc. of the ACM SIGMOD Workshop on the Web and Databases (WebDB ‘89),Philadelphia, USA.
Pirolli P., Pitkow J. and Rao R. (1996). Silk from a Sow’s Ear: Extracting Usable Structures from the Web. In Proceedings of the ACM International Conference on Human Factors in Computing Systems CHI, Vancouver, Canada.
Kumar R., Raghavan P., Rajagopalan S. and Tomkins A. (1999). Trawling the Web for Emerging Cyber-Communities. In Proceedings of the 8th International WWW Conference, Toronto, Canada.
Weiss R., Velez B., Sheldon M. A., Nemprempre C., Szilagyi P., Duda A. and Gifford D. K. (1996). Hypursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In Proceedings of the Seventh ACM Conference on Hypertext, Washington, USA.
Brin S. and Page L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. In Proceedings of the Seventh International WWW Conference, Brisbane, Australia.
Chakrabarti S., Dom B. E. and Indyk P. (1998). Enhanced Hypertext Categorization Using Hyperlinks. In Proceedings of SIGMOD, ACM International Conference on Management of Data, Seattle, USA.
Chakrabarti S., Dom B. E., Ragavan P., Rajagopalan S., Gibson D. and Kleinberg J. M. (1998). Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. In Proceedings of the 7th International WWW Conference, Brisbane, Australia.
Chakrabarti S., van den Berg M. and Dom B. E. (1999). Focused Crawling: A New Approach for Topic-Specific Resource Discovery. In Proceedings of the 8th International WWW Conference, Toronto, Canada.
Nagafuji T. and Toyama M. (1998). A WWW Search Engine Based on Disjoint Page Groups in Dividing WWW Hypertext Space. In In Proc. of Data Engineering Workshop (DEWS’98), IEICE (In Japanese), Minakami, Japan.
van Rijsbergen C. J. (1979). Information Retrieval. Butterworths, 2nd edition.
Li W.-S., Candan K., Vu Q. and Agrawal D. (2001). Retrieving and Organizing Web Pages by “Information Unit”. In Proceedings of the Tenth International WWW Conference, Hong Kong, China.
Li W.-S., Kolak O. and Vu Q. (2000). Defining Logical Domains in a Web Site. In Proceedings of the ACM Hypertext Conference, San Antonio, USA.
Mizuuchi Y. and Tajima K. (1999). Finding Context Paths for Web Pages. In Proceedings of the ACM Hypertext Conference, Darmstadt, Germany.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Aguiar, F. (2003). Improving Web Search by the Identification of Contextual Information. In: Szczepaniak, P.S., Segovia, J., Kacprzyk, J., Zadeh, L.A. (eds) Intelligent Exploration of the Web. Studies in Fuzziness and Soft Computing, vol 111. Physica, Heidelberg. https://doi.org/10.1007/978-3-7908-1772-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-7908-1772-0_13
Publisher Name: Physica, Heidelberg
Print ISBN: 978-3-7908-2519-0
Online ISBN: 978-3-7908-1772-0
eBook Packages: Springer Book Archive