A Linguistically Motivated Probabilistic Model of Information Retrieval
This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well known existing models of information retrieval, but is essential in the field of statistical natural language processing. Advances already made in statistical natural language processing will be used in this paper to formulate a probabilistic justification for using tfxidf term weighting. The paper shows that the new probabilistic interpretation of tfxidf term weighting might lead to better understanding of statistical ranking mechanisms, for example by explaining how they relate to coordination level ranking. A pilot experiment on the Cranfield test collection indicates that the presented model outperforms the vector space model with classical tfxidf and cosine length normalisation.
KeywordsInformation Retrieval Theory Statistical Information Retrieval Statistical Natural Language Processing
Unable to display preview. Download preview PDF.
- 1.C.L.A. Clarke, G.V. Cormack, and E.A. Tudhope. Relevance ranking for one to three term queries. In Proceedings of RIAO’97, pages 388–400, 1997.Google Scholar
- 3.W.B. Croft and H.R. Turtle. Text retrieval and inference. In P. Jacobs, editor, Text-based Intelligent Systems, pages 127–156. Lawrence Erlbaum, 1992.Google Scholar
- 4.D. Hawking and P. Thistlewaite. Relevance weighting using distance between term occurrences. Technical Report TR-CS-96-08, The Australian National University, August 1996. http://cs.anu.edu.au/techreports/.
- 5.C. Manning and H. Schütze, editors. Statistical NLP: Theory and Practice, draft. http://www.sultry.arts.su.edu.au/manning/courses/statnlp/, 1997.
- 6.A.M. Mood and F.A. Graybill, editors. Introduction to the Theory of Statistics, Second edition. McGraw-Hill, 1963.Google Scholar
- 8.S.E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the SIGIR’94, pages 232–241, 1994.Google Scholar
- 9.D.E. Rose and C. Stevens. V-twin: A lightweight engine for interactive use. In Proceedings of the 5th Text Retrieval Conference TREC-5, pages 279–290. NIST Special Publications, 1997.Google Scholar
- 11.G. Salton and M.J. McGill, editors. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.Google Scholar
- 12.A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In Proceedings of the SIGIR’96, pages 21–29, 1996.Google Scholar
- 13.T. Strzalkowski and K. Sparck Jones. Nlp track at trec-5. In Proceedings of the 5th Text Retrieval Conference TREC-5, pages 97–101. NIST Special Publications, 1997.Google Scholar
- 14.E.M. Voorhees and D.K. Harman. Overview of the 6th text retrieval conference. In Proceedings of the 6th Text Retrieval Conference TREC-6. NIST Special Publications, 1998.Google Scholar
- 15.R. Wilkinson, J. Zobel, and R. Sacks-Davis. Similarity measures for short queries. In Proceedings of the 4th Text Retrieval Conference TREC-4, pages 277–286. NIST Special Publications, 1996.Google Scholar