Advertisement

Probabilistic Document Length Priors for Language Models

  • Roi Blanco
  • Alvaro Barreiro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4956)

Abstract

This paper addresses the issue of devising a new document prior for the language modeling (LM) approach for Information Retrieval. The prior is based on term statistics, derived in a probabilistic fashion and portrays a novel way of considering document length. Furthermore, we developed a new way of combining document length priors with the query likelihood estimation based on the risk of accepting the latter as a score. This prior has been combined with a document retrieval language model that uses Jelinek-Mercer (JM), a smoothing technique which does not take into account document length. The combination of the prior boosts the retrieval performance, so that it outperforms a LM with a document length dependent smoothing component (Dirichlet prior) and other state of the art high-performing scoring function (BM25). Improvements are significant, robust across different collections and query sizes.

Keywords

Information Retrieval Language Modeling Retrieval Model Retrieval Performance Mean Average Precision 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: WWW7: Proceedings of the seventh international conference on World Wide Web 7, pp. 107–117 (1998)Google Scholar
  2. 2.
    Buckley, C., Voorhees, E.: Retrieval evaluation with incomplete information. In: SIGIR 2004: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 25–32 (2004)Google Scholar
  3. 3.
    Craswell, N., Robertson, S., Zaragoza, H., Taylor, M.: Relevance weighting for query independent evidence. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 416–423 (2005)Google Scholar
  4. 4.
    Harper, D.J., Croft, W.B.: Using probabilistic models of document retrieval without relevance information. Journal Of Documentation 35(4), 285–295 (1979)CrossRefGoogle Scholar
  5. 5.
    He, B., Ounis, I.: A study of parameter tuning for term frequency normalization. In: CIKM 2003 Proceedings of the twelfth international conference on Information and knowledge management, pp. 10–16 (2003)Google Scholar
  6. 6.
    Kraaij, W., Westerveld, T., Hiemstra, D.: The importance of prior probabilities for entry page search. In: SIGIR 2002: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 27–34 (2002)Google Scholar
  7. 7.
    Lafferty, J., Zhai, C.: Probabilistic relevance models based on document and query generation. In: Croft, W.B., Lafferty, J. (eds.) Language Modeling and Information Retrieval. Kluwer International Series on Information Retrieval (2002)Google Scholar
  8. 8.
    Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: SIGIR 1998: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 275–281 (1998)Google Scholar
  9. 9.
    Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR 1994: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 232–241 (1994)Google Scholar
  10. 10.
    Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M.: Okapi at TREC-3. In: Proceedings of the tenth Text Retrieval Conference (TREC-3) (1995)Google Scholar
  11. 11.
    Robertson, S., Zaragoza, H., Taylor, M.: Simple BM25 extension to multiple weighted fields. In: CIKM 2004: Proceedings of the thirteenth ACM international conference on Information and knowledge management, pp. 42–49 (2004)Google Scholar
  12. 12.
    Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: SIGIR 1996: Proceedings of the 19st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 21–29 (1996)Google Scholar
  13. 13.
    Upstill, T., Craswell, N., Hawking, D.: Query-independent evidence in home page finding. ACM Transactions on Information Systems (TOIS) 21(3), 286–313 (2003)CrossRefGoogle Scholar
  14. 14.
    van Rijsbergen, C.J.: Information Retrieval. Butterworths (1979)Google Scholar
  15. 15.
    Voorhees, E.M., Harman, D.K.: TREC: Experiment and Evaluation in Information Retrieval. The MIT Press, Cambridge (2005)Google Scholar
  16. 16.
    Westerveld, T., Kraaij, W., Hiemstra, D.: Retrieving web pages using content, links, urls and anchors. In: Proceedings of the tenth Text Retrieval Conference (TREC-10), pp. 663–672 (2002)Google Scholar
  17. 17.
    Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22(2), 179–214 (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Roi Blanco
    • 1
  • Alvaro Barreiro
    • 1
  1. 1.IRLab. Computer Science DepartmentUniversity of A CoruñaSpain

Personalised recommendations