Skip to main content

How to find the nearest by evaluating only few? Clustering techniques used to improve the efficiency of an Information Retrieval system based on Distributional Semantics

  • Conference paper
  • 2011 Accesses

Summary

The first objective of this contribution is to give a description of our textual information retrieval system based on distributional semantics. The central idea of the approach is to represent the retrievable units and the user queries in a unified way as projections in a vector space of pertinent terms. The projections are derived from a co-occurrence matrix computed on large reference (textual) corpora collecting the distributional semantic information. A similarity computation based on the cosine measure is then used to characterize the semantic proximity between queries and documents.

Retrieval effectiveness can be further improved by the use of relevance feedback techniques. A simple feedback method where document relevance is interactively integrated to the original query will also be presented and evaluated.

Although our first experiments lead to quite promising results, one major drawback of our IR system in its original form is that the satisfaction of a query requires the evaluation of the similarities between that query and all the documents in the textual base. Therefore, the second objective of this contribution is to investigate how clustering techniques can be applied to the textual database in order to retrieve the documents satisfying a query through a partial exploration of the base. A tentative solution based on hierarchical clustering will be suggested.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Allen, J. (1995): Relevance Feedback with Too Much Data,In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, USA.

    Google Scholar 

  • Buckley, C. et al. (1995): Automatic Query Expansion Using SMART: TREC3,In the third Text REtrieval Conference (TREC-3), NIST Special Publication 500–225.

    Google Scholar 

  • Frakes, W.B. and Baeza-Yates, R. (1992): Information Retrieval: Data Structures 9 Algorithms. Prentice Hall.

    Google Scholar 

  • Gallant, S.I. et al. (1992): HNC’s MVlatchPlus System. SIGIR FORUM, 16(2).

    Google Scholar 

  • Geist, A. et al. (1994): PVM: Parallel Virtual Machine, A Users’ Guide and Tutorial for Networked Parallel Computing, The MIT Press, Cambridge, England.

    Google Scholar 

  • Harman, D. (1992): Relevance Feedback Revisited. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copehagen, Denmark.

    Google Scholar 

  • Hersh, W. and Buckley C. (1994): OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research,In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland.

    Google Scholar 

  • Hull, D., (1993): Using Statistical Testing in the Evaluation of Retrieval Experiments,In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, USA.

    Google Scholar 

  • Rungsawang, A. and Rajman, M. (1995): Textual Information Retrieval Based on the Concept of the Distributional Semantics. In Proceedings of the 3’’ d International Conference on Statistical Analysis of Textual Data. Rome, Italy, December.

    Google Scholar 

  • Schütze, H. (1992): Dimensions of Meaning. In IEEE Proceedings of Supercomputing 92.

    Google Scholar 

  • Salton, G. and McGill, M.J. (1983): Introduction to Modern Information Retrieval. McGraw Hill.

    Google Scholar 

  • Salton, G. et al. (1975): A Theory of Term Importance in Automatic Text Analysis. Journal of the American Society for Information Science.

    Google Scholar 

  • Salton, G. et al. (1976): Automatic Indexing Using Term Discrimination and Term Precision Measurement. Information Processing Management, 12.

    Google Scholar 

  • Salton, G. and Buckley, C. (1990): Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Science, 41(4).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer Japan

About this paper

Cite this paper

Rajman, M., Rungsawang, A. (1998). How to find the nearest by evaluating only few? Clustering techniques used to improve the efficiency of an Information Retrieval system based on Distributional Semantics. In: Hayashi, C., Yajima, K., Bock, HH., Ohsumi, N., Tanaka, Y., Baba, Y. (eds) Data Science, Classification, and Related Methods. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Tokyo. https://doi.org/10.1007/978-4-431-65950-1_54

Download citation

  • DOI: https://doi.org/10.1007/978-4-431-65950-1_54

  • Publisher Name: Springer, Tokyo

  • Print ISBN: 978-4-431-70208-5

  • Online ISBN: 978-4-431-65950-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics