How to find the nearest by evaluating only few? Clustering techniques used to improve the efficiency of an Information Retrieval system based on Distributional Semantics

Rajman, Martin; Rungsawang, Arnon

doi:10.1007/978-4-431-65950-1_54

How to find the nearest by evaluating only few? Clustering techniques used to improve the efficiency of an Information Retrieval system based on Distributional Semantics

Martin Rajman⁸ &
Arnon Rungsawang⁸

Conference paper

2011 Accesses

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

Summary

The first objective of this contribution is to give a description of our textual information retrieval system based on distributional semantics. The central idea of the approach is to represent the retrievable units and the user queries in a unified way as projections in a vector space of pertinent terms. The projections are derived from a co-occurrence matrix computed on large reference (textual) corpora collecting the distributional semantic information. A similarity computation based on the cosine measure is then used to characterize the semantic proximity between queries and documents.

Retrieval effectiveness can be further improved by the use of relevance feedback techniques. A simple feedback method where document relevance is interactively integrated to the original query will also be presented and evaluated.

Although our first experiments lead to quite promising results, one major drawback of our IR system in its original form is that the satisfaction of a query requires the evaluation of the similarities between that query and all the documents in the textual base. Therefore, the second objective of this contribution is to investigate how clustering techniques can be applied to the textual database in order to retrieve the documents satisfying a query through a partial exploration of the base. A tentative solution based on hierarchical clustering will be suggested.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Allen, J. (1995): Relevance Feedback with Too Much Data,In Proceedings of the 18^th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, USA.
Google Scholar
Buckley, C. et al. (1995): Automatic Query Expansion Using SMART: TREC3,In the third Text REtrieval Conference (TREC-3), NIST Special Publication 500–225.
Google Scholar
Frakes, W.B. and Baeza-Yates, R. (1992): Information Retrieval: Data Structures 9 Algorithms. Prentice Hall.
Google Scholar
Gallant, S.I. et al. (1992): HNC’s MVlatchPlus System. SIGIR FORUM, 16(2).
Google Scholar
Geist, A. et al. (1994): PVM: Parallel Virtual Machine, A Users’ Guide and Tutorial for Networked Parallel Computing, The MIT Press, Cambridge, England.
Google Scholar
Harman, D. (1992): Relevance Feedback Revisited. In Proceedings of the 15^th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copehagen, Denmark.
Google Scholar
Hersh, W. and Buckley C. (1994): OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research,In Proceedings of the 17t^h Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland.
Google Scholar
Hull, D., (1993): Using Statistical Testing in the Evaluation of Retrieval Experiments,In Proceedings of the 16^th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, USA.
Google Scholar
Rungsawang, A. and Rajman, M. (1995): Textual Information Retrieval Based on the Concept of the Distributional Semantics. In Proceedings of the 3’’ ^d International Conference on Statistical Analysis of Textual Data. Rome, Italy, December.
Google Scholar
Schütze, H. (1992): Dimensions of Meaning. In IEEE Proceedings of Supercomputing 92.
Google Scholar
Salton, G. and McGill, M.J. (1983): Introduction to Modern Information Retrieval. McGraw Hill.
Google Scholar
Salton, G. et al. (1975): A Theory of Term Importance in Automatic Text Analysis. Journal of the American Society for Information Science.
Google Scholar
Salton, G. et al. (1976): Automatic Indexing Using Term Discrimination and Term Precision Measurement. Information Processing Management, 12.
Google Scholar
Salton, G. and Buckley, C. (1990): Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Science, 41(4).
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, ENST-Paris, 46 Rue Barrault, F-75634, Paris Cedex 13, France
Martin Rajman & Arnon Rungsawang

Authors

Martin Rajman
View author publications
You can also search for this author in PubMed Google Scholar
Arnon Rungsawang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The Institute of Statistical Mathematics, 4-6-7 Minami-Azabu, Minato-ku, Tokyo 106, Japan
Chikio Hayashi , Noboru Ohsumi & Yasumasa Baba , &
School of Management, Science University of Tokyo, 500 Shimokiyoku, Kuki, Saitama 346, Japan
Keiji Yajima
Institut für Statistik, Rheinisch-Westfälische Technische Hochschule (RWTH), D-52056, Aachen, Germany
Hans-Hermann Bock
Faculty of Environmental Science & Technology, Okayama University, 2-1-1 Tsushima-naka, Okayama 700, Japan
Yutaka Tanaka

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rajman, M., Rungsawang, A. (1998). How to find the nearest by evaluating only few? Clustering techniques used to improve the efficiency of an Information Retrieval system based on Distributional Semantics. In: Hayashi, C., Yajima, K., Bock, HH., Ohsumi, N., Tanaka, Y., Baba, Y. (eds) Data Science, Classification, and Related Methods. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Tokyo. https://doi.org/10.1007/978-4-431-65950-1_54

Download citation

DOI: https://doi.org/10.1007/978-4-431-65950-1_54
Publisher Name: Springer, Tokyo
Print ISBN: 978-4-431-70208-5
Online ISBN: 978-4-431-65950-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics