Abstract
The major problem of the most current approaches of information models lies in that individual words provide unreliable evidence about the content of the texts. When the document is short, e.g. only the abstract is available, the word-use variability problem will have substantial impact on the Information Retrieval (IR) performance. To solve the problem, a new technology to short document retrieval named Reference Document Model (RDM) is put forward in this letter. RDM gets the statistical semantic of the query/document by pseudo feedback both for the query and document from reference documents. The contributions of this model are three-fold: (1) Pseudo feedback both for the query and the document; (2) Building the query model and the document model from reference documents; (3) Flexible indexing units, which can be any linguistic elements such as documents, paragraphs, sentences, n-grams, term or character. For short document retrieval, RDM achieves significant improvements over the classical probabilistic models on the task of ad hoc retrieval on Text REtrieval Conference (TREC) test sets. Results also show that the shorter the document, the better the RDM performance.
Similar content being viewed by others
Reference
M. J. Bates. Subject access in online catalogs: a design model. Journal of the American Society for Information Science and Technology, 37(1986)6, 357–376.
D. Tarr, H. Borko. Factors influencing inter-indexer consistency. Proceedings of the American Society for Information Science (ASIS) 37th Annual Meeting, Washington DC, 1974, vol. 11, 50–55.
R. Fidel. Individual variability in online searching behavior. C. A. Parkhurst (ed.). Proceedings of the American Society for Information Science (ASIS) 48th Annual Meeting, Las Vegas, 1985, vol.22, 69–72.
G. Salton, C. Buckley. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41(1990)4, 288–297.
D. Harman. Relevance feedback revisited. Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’92), Copenhagen, Denmark, 1992, 1–10.
S. Deerwester, S. T. Dumais, G. W. Furnas, et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(1990)1, 391–407.
J. Lafferty, C. Zhai. Document language models, query models, and risk minimization for information retrieval. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), New Orleans, Louisiana, United States, 2001, 111–119.
C. Zhai. Risk minimization and language modeling in text retrieval. [Ph.D. Dissertation], University of Massachusetts, Amherst, 2002.
G. Salton, A. Wong, C. S. Yang. A vector space model for information retrieval. Communications of the ACM, 18(1975)11, 613–620.
C. Raman, C. Harr, C. O. Simon, et al. Subwebs for specialized search. Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04), Sheffield, United Kingdom, 2004, 480–481.
Author information
Authors and Affiliations
Corresponding author
Additional information
Supported by the Funds of Heilongjiang Outstanding Young Teacher (1151G037).
About this article
Cite this article
Qi, H., Li, M., Gao, J. et al. Information Retrieval for short documents. J. of Electron.(China) 23, 933–936 (2006). https://doi.org/10.1007/s11767-006-0044-2
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/s11767-006-0044-2