Summary
The first objective of this contribution is to give a description of our textual information retrieval system based on distributional semantics. The central idea of the approach is to represent the retrievable units and the user queries in a unified way as projections in a vector space of pertinent terms. The projections are derived from a co-occurrence matrix computed on large reference (textual) corpora collecting the distributional semantic information. A similarity computation based on the cosine measure is then used to characterize the semantic proximity between queries and documents.
Retrieval effectiveness can be further improved by the use of relevance feedback techniques. A simple feedback method where document relevance is interactively integrated to the original query will also be presented and evaluated.
Although our first experiments lead to quite promising results, one major drawback of our IR system in its original form is that the satisfaction of a query requires the evaluation of the similarities between that query and all the documents in the textual base. Therefore, the second objective of this contribution is to investigate how clustering techniques can be applied to the textual database in order to retrieve the documents satisfying a query through a partial exploration of the base. A tentative solution based on hierarchical clustering will be suggested.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Allen, J. (1995): Relevance Feedback with Too Much Data,In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, USA.
Buckley, C. et al. (1995): Automatic Query Expansion Using SMART: TREC3,In the third Text REtrieval Conference (TREC-3), NIST Special Publication 500–225.
Frakes, W.B. and Baeza-Yates, R. (1992): Information Retrieval: Data Structures 9 Algorithms. Prentice Hall.
Gallant, S.I. et al. (1992): HNC’s MVlatchPlus System. SIGIR FORUM, 16(2).
Geist, A. et al. (1994): PVM: Parallel Virtual Machine, A Users’ Guide and Tutorial for Networked Parallel Computing, The MIT Press, Cambridge, England.
Harman, D. (1992): Relevance Feedback Revisited. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copehagen, Denmark.
Hersh, W. and Buckley C. (1994): OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research,In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland.
Hull, D., (1993): Using Statistical Testing in the Evaluation of Retrieval Experiments,In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, USA.
Rungsawang, A. and Rajman, M. (1995): Textual Information Retrieval Based on the Concept of the Distributional Semantics. In Proceedings of the 3’’ d International Conference on Statistical Analysis of Textual Data. Rome, Italy, December.
Schütze, H. (1992): Dimensions of Meaning. In IEEE Proceedings of Supercomputing 92.
Salton, G. and McGill, M.J. (1983): Introduction to Modern Information Retrieval. McGraw Hill.
Salton, G. et al. (1975): A Theory of Term Importance in Automatic Text Analysis. Journal of the American Society for Information Science.
Salton, G. et al. (1976): Automatic Indexing Using Term Discrimination and Term Precision Measurement. Information Processing Management, 12.
Salton, G. and Buckley, C. (1990): Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Science, 41(4).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1998 Springer Japan
About this paper
Cite this paper
Rajman, M., Rungsawang, A. (1998). How to find the nearest by evaluating only few? Clustering techniques used to improve the efficiency of an Information Retrieval system based on Distributional Semantics. In: Hayashi, C., Yajima, K., Bock, HH., Ohsumi, N., Tanaka, Y., Baba, Y. (eds) Data Science, Classification, and Related Methods. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Tokyo. https://doi.org/10.1007/978-4-431-65950-1_54
Download citation
DOI: https://doi.org/10.1007/978-4-431-65950-1_54
Publisher Name: Springer, Tokyo
Print ISBN: 978-4-431-70208-5
Online ISBN: 978-4-431-65950-1
eBook Packages: Springer Book Archive