Using Document Dimensions for Enhanced Information Retrieval

Jayasooriya, Thimal; Manandhar, Suresh

doi:10.1007/978-3-540-30176-9_19

Thimal Jayasooriya²¹ &
Suresh Manandhar²¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3285))

Included in the following conference series:

Asian Applied Computing Conference

1430 Accesses

Abstract

Conventional document search techniques are constrained by attempting to match individual keywords or phrases to source documents. Thus, these techniques miss out documents that contain semantically similar terms, thereby achieving a relatively low degree of recall. At the same time, processing capabilities and tools for syntactic and semantic analysis of language have advanced to the point where an index-time linguistic analysis of source documents is both feasible and realistic. In this paper, we introduce document dimensions, a means of classifying or grouping terms discovered in documents. Using an enhanced version of Jakarta Lucene[1], we demonstrate that supplementing keyword analysis with some syntactic and semantic information can indeed enhance the quality of information retrieval results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Jakarta Lucene, http://jakarta.apache.org/lucene/docs/index.html
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1980)
Google Scholar
Salton, G., Y.C.: On the specification of term values in automatic indexing. Journal of Documentation 29, 351–372 (1973)
Google Scholar
Brin, S., Page, L.: Anatomy of a hypertextual web search engine. In: WWW7 (1998)
Google Scholar
Brooks, T.: The semantic distance model of relevance assessment. In: Proceedings of the 61 st Annual Meeting of ASIS, Pittsburgh, PA. Information Access in the Global Information Economy, vol. 35, pp. 33–44 (1998)
Google Scholar
Budanitsky, A.: Semantic distance in wordnet: An experimental, applicationoriented evaluation of five measures. In: Workshop on WordNet and Other Lexical Resources, in NAACL 2000, Pittsburgh, PA, June 2001 (2000)
Google Scholar
Dixon, M.: (An overview of document mining technology)
Google Scholar
Rijke, M.V.: Beyond document retrieval. In: Trento, Nice (2003)
Google Scholar
Yang, K.: Combining Text-, Link-, and Classification-based Retrieval Methods to Enhance Information Discovery on the Web. PhD thesis, University of North Carolina (2002)
Google Scholar
Modelling and mining of network information systems, http://www.mathstat.dal.ca/~mominis/
Lawrence, S., Giles, C.: Indexing and retrieval of scientific literature. In: Eighth International Conference on Information and Knowledge Management (1999)
Google Scholar
Lawrence, S.: Context in web search. In: IEEE Data Engineering Bulletin (2000)
Google Scholar
Hu, W.: An overview of world wide web search technologies. In: International Conference on Information Systems, Analysis and Synthesis, vol. 12 (2001)
Google Scholar
Etzioni, O.: On the instability of search engines. In: Content-Based Multimedia Information Access (RIAO), Paris, France (2000)
Google Scholar
WebFountain, http://www.almaden.ibm.com/webfountain/
Eder, J., Koncilia, C.: Evolution of dimension data in temporal datawarehouses. Springer, Heidelberg (1998)
Google Scholar
Roellke, T.: The accessibility dimension for structured document retrieval. Journal of Documentation (1998)
Google Scholar
Mothé, J.: Information mining: using document dimensions to analyse a document set interactively. In: European Colloquium on IR Research: ECIR, pp. 66–77 (2001)
Google Scholar
Mothé, J.: Doccube: Multi-dimensional visualization and exploration of large document sets. In: JASIST (Journal of American Society for Information Science and Technology) (2003)
Google Scholar
Tsang, V., Stevenson, S.: Calculating semantic distance between word sense probability distributions. In: Proceedings of CoNLL 2004, Boston, MA, USA (2004)
Google Scholar
Heydon, A., Najork, M.: Mercator: A scalable, extensible web crawler. World Wide Web 2, 219–229 (1999)
Article Google Scholar
Mailing list archives of nutch.org, http://sourceforge.net/mailarchive/forum.php?forum_id=13068&viewmonth=%200404&viewday=26

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of York, UK
Thimal Jayasooriya & Suresh Manandhar

Authors

Thimal Jayasooriya
View author publications
You can also search for this author in PubMed Google Scholar
Suresh Manandhar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, The University of York, YO10 5NG, Heslington, York, UK
Suresh Manandhar
Advanced Computer Architectures Group, Department of Computer Science, University of York, York, UK
Jim Austin
Department of Electrical Engineering, Indian Institute of Technology, 400076, Powai, Bombay, India
Uday Desai
Department of Computer Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, 113-8654, Tokyo, Japan
Yoshio Oyanagi
Indian Institute of Information Technology, Bangalore, 26/C, Electronics City, Hosur Road, 560100, Bangalore, India
Asoke K. Talukder

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jayasooriya, T., Manandhar, S. (2004). Using Document Dimensions for Enhanced Information Retrieval. In: Manandhar, S., Austin, J., Desai, U., Oyanagi, Y., Talukder, A.K. (eds) Applied Computing. AACC 2004. Lecture Notes in Computer Science, vol 3285. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30176-9_19

Download citation

DOI: https://doi.org/10.1007/978-3-540-30176-9_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23659-7
Online ISBN: 978-3-540-30176-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics