Advertisement

Frequentist and Bayesian Approach to Information Retrieval

  • Giambattista Amati
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3936)

Abstract

We introduce the hypergeometric models KL, DLH and DLLH using the DFR approach, and we compare these models to other relevant models of IR. The hypergeometric models are based on the probability of observing two probabilities: the relative within-document term frequency and the entire collection term frequency. Hypergeometric models are parameter-free models of IR. Experiments show that these models have an excellent performance with small and very large collections. We provide their foundations from the same IR probability space of language modelling (LM). We finally discuss the difference between DFR and LM. Briefly, DFR is a frequentist (Type I), or combinatorial approach, whilst language models use a Bayesian (Type II) approach for mixing the two probabilities, being thus inherently parametric in its nature.

Keywords

Maximum Likelihood Estimate Information Retrieval Language Modelling Query Expansion Informative Term 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Amati, G.: Probability Models for Information Retrieval based on Divergence from Randomness. PhD thesis, University of Glasgow (June 2003)Google Scholar
  2. 2.
    Amati, G., Carpineto, C., Romano, G.: FUB at TREC 10 web track: a probabilistic framework for topic relevance term weighting. In: Voorhees, E., Harman, D. (eds.) Proceedings of the 10th Text Retrieval Conference TREC 2001, Gaithersburg, MD, pp. 182–191. NIST Special Publication 500-250 (2002)Google Scholar
  3. 3.
    Amati, G., Carpineto, C., Romano, G.: Fondazione Ugo Bordoni at TREC 2004. In: Voorhees, E., Harman, D. (eds.) Proceedings of the 13th Text Retrieval Conference TREC 2001, Gaithersburg, MD, NIST Special Publication 500-261 (2004)Google Scholar
  4. 4.
    Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS) 20(4), 357–389 (2002)CrossRefGoogle Scholar
  5. 5.
    Bahl, L.R., Jelinek, F., Mercer, R.L.: A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-5 2, 179–190 (1983)CrossRefGoogle Scholar
  6. 6.
    Berger, A., Lafferty, J.: Information retrieval as statistical translation. In: SIGIR 1999: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 222–229. ACM Press, New York (1999)Google Scholar
  7. 7.
    Brown, P.F., Cocke, J., Della Pietra, S.A., Della Pietra, V.J., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Computational Linguistics 16(2), 79–85 (1990)Google Scholar
  8. 8.
    Carpineto, C., De Mori, R., Romano, G., Bigi, B.: An information theoretic approach to automatic query expansion. ACM Transactions on Information Systems 19(1), 1–27 (2001)CrossRefGoogle Scholar
  9. 9.
    Feller, W.: An introduction to probability theory and its applications., 3rd edn., vol. I. John Wiley & Sons Inc., New York (1968)MATHGoogle Scholar
  10. 10.
    Good, I.J.: The Estimation of Probabilities: an Essay onModern BayesianMethods, vol. 30. The M.I.T. Press, Cambridge (1968)Google Scholar
  11. 11.
    Harter, S.P.: A probabilistic approach to automatic keyword indexing. PhD thesis, Graduate Library, The University of Chicago, Thesis No. T25146 (1974)Google Scholar
  12. 12.
    He, B., Ounis, I.: A study of parameter tuning for term frequency normalization. In: Proceedings of the twelfth International Conference on Information and Knowledge Management. Springer, Heidelberg (2005)Google Scholar
  13. 13.
    He, B., Ounis, I.: A study of the Dirichlet priors for term frequency normalisation. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 465–471. ACM Press, New York (2005)Google Scholar
  14. 14.
    Jelinek, F., Mercer, R.: Interpolated estimation of markov source parameters from sparse data. In: Pattern Recognition in Practice, pp. 381–397. North-Holland, Amsterdam (1980)Google Scholar
  15. 15.
    Lafferty, J., Zhai, C.: Document Language Models, Query Models, and Risk Minimization for Information Retrieval. In: Proceedings of ACM SIGIR, New Orleans, Louisiana, USA, pp. 111–119. ACM Press, New York (2001)Google Scholar
  16. 16.
    Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Johnson, D.: Terrier information retrieval platform. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 517–519. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  17. 17.
    Plachouras, V., He, B., Ounis, I.: University of Glasgow at TREC2004: Experiments in Web, Robust and Terabyte tracks with Terrier. In: Proceedings of the 13th Text REtrieval Conference (TREC 2004), Gaithersburg, MD, NIST Special Pubblication 500-261 (2004)Google Scholar
  18. 18.
    Plochouras, V., Ounis, I.: Usefulness of hyperlink structure for query-biased topic distillation. In: Proceedings of the 27th annual international conference on Research and development in information retrieval, pp. 448–455. ACM Press, New York (2004)Google Scholar
  19. 19.
    Ponte, J., Croft, B.: A Language Modeling Approach in Information Retrieval. In: Croft, B., Moffat, A., Van Rijsbergen, C. (eds.) The 21st ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 275–281. ACM Press, New York (1998)CrossRefGoogle Scholar
  20. 20.
    Raghavan, V.V., Wong, S.K.: A critical analysis of the vector space model for information retrieval. Journal of the American Society for Information Science 37(5), 279–287 (1986)CrossRefGoogle Scholar
  21. 21.
    Renyi, A.: Foundations of probability. Holden-Day Press, San Francisco (1969)Google Scholar
  22. 22.
    Robertson, S., Walker, S.: Some simple approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. In: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, pp. 232–241. Springer, Heidelberg (1994)Google Scholar
  23. 23.
    Salton, G.: The SMART Retrieval System. Prentice Hall, New Jersey (1971)Google Scholar
  24. 24.
    Salton, G., McGill, M.: Introduction to modern Information Retrieval. McGraw–Hill, New York (1983)MATHGoogle Scholar
  25. 25.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)CrossRefMATHGoogle Scholar
  26. 26.
    Zhai, C., Lafferty, J.: Model-based Feedback in the Language Modeling Approach to Information Retrieval. In: ClKM 2001, Atlanta, Georgia, USA, November 5-10, pp. 334–342. ACM Press, New York (2001)Google Scholar
  27. 27.
    Zhai, C., Lafferty, J.: A Study of Smoothing Methods for Language Models Applied to Information Retrieval. ACM Transactions on Information Systems 22(2), 179–214 (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Giambattista Amati
    • 1
  1. 1.Fondazione Ugo BordoniRomeItaly

Personalised recommendations