A Linguistically Motivated Probabilistic Model of Information Retrieval

  • Djoerd Hiemstra
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1513)


This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well known existing models of information retrieval, but is essential in the field of statistical natural language processing. Advances already made in statistical natural language processing will be used in this paper to formulate a probabilistic justification for using tfxidf term weighting. The paper shows that the new probabilistic interpretation of tfxidf term weighting might lead to better understanding of statistical ranking mechanisms, for example by explaining how they relate to coordination level ranking. A pilot experiment on the Cranfield test collection indicates that the presented model outperforms the vector space model with classical tfxidf and cosine length normalisation.


Information Retrieval Theory Statistical Information Retrieval Statistical Natural Language Processing 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    C.L.A. Clarke, G.V. Cormack, and E.A. Tudhope. Relevance ranking for one to three term queries. In Proceedings of RIAO’97, pages 388–400, 1997.Google Scholar
  2. 2.
    W.S. Cooper. Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval. ACM Transactions on Information Systems, 13:100–111, 1995.CrossRefGoogle Scholar
  3. 3.
    W.B. Croft and H.R. Turtle. Text retrieval and inference. In P. Jacobs, editor, Text-based Intelligent Systems, pages 127–156. Lawrence Erlbaum, 1992.Google Scholar
  4. 4.
    D. Hawking and P. Thistlewaite. Relevance weighting using distance between term occurrences. Technical Report TR-CS-96-08, The Australian National University, August 1996.
  5. 5.
    C. Manning and H. Schütze, editors. Statistical NLP: Theory and Practice, draft., 1997.
  6. 6.
    A.M. Mood and F.A. Graybill, editors. Introduction to the Theory of Statistics, Second edition. McGraw-Hill, 1963.Google Scholar
  7. 7.
    S.E. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129–146, 1976.CrossRefGoogle Scholar
  8. 8.
    S.E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the SIGIR’94, pages 232–241, 1994.Google Scholar
  9. 9.
    D.E. Rose and C. Stevens. V-twin: A lightweight engine for interactive use. In Proceedings of the 5th Text Retrieval Conference TREC-5, pages 279–290. NIST Special Publications, 1997.Google Scholar
  10. 10.
    G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523, 1988.CrossRefGoogle Scholar
  11. 11.
    G. Salton and M.J. McGill, editors. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.Google Scholar
  12. 12.
    A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In Proceedings of the SIGIR’96, pages 21–29, 1996.Google Scholar
  13. 13.
    T. Strzalkowski and K. Sparck Jones. Nlp track at trec-5. In Proceedings of the 5th Text Retrieval Conference TREC-5, pages 97–101. NIST Special Publications, 1997.Google Scholar
  14. 14.
    E.M. Voorhees and D.K. Harman. Overview of the 6th text retrieval conference. In Proceedings of the 6th Text Retrieval Conference TREC-6. NIST Special Publications, 1998.Google Scholar
  15. 15.
    R. Wilkinson, J. Zobel, and R. Sacks-Davis. Similarity measures for short queries. In Proceedings of the 4th Text Retrieval Conference TREC-4, pages 277–286. NIST Special Publications, 1996.Google Scholar
  16. 16.
    S.K.M. Wong and Y.Y. Yao. On modeling information retrieval with probabilistic inference. ACM Transactions on Information Systems, 13:38–68, 1995.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1998

Authors and Affiliations

  • Djoerd Hiemstra
    • 1
  1. 1.Centre for Telematics and Information Technology, The Parlevink Language Engineering GroupUniversity of TwenteAE EnschedeThe Netherlands

Personalised recommendations