Term Impacts as Normalized Term Frequencies for BM25 Similarity Scoring

  • Vo Ngoc Anh
  • Raymond Wan
  • Alistair Moffat
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5280)

Abstract

The BM25 similarity computation has been shown to provide effective document retrieval. In operational terms, the formulae which form the basis for BM25 employ both term frequency and document length normalization. This paper considers an alternative form of normalization using document-centric impacts, and shows that the new normalization simplifies BM25 and reduces the number of tuning parameters. Motivation is provided by a preliminary analysis of a document collection that shows that impacts are more likely to identify documents whose lengths resemble those of the relevant judgments.Experiments on TREC data demonstrate that impact-based BM25 is as good as or better than the original term frequency-based BM25 in terms of retrieval effectiveness.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [Anh and Moffat(2005)]
    Anh, V.N., Moffat, A.: Simplified similarity scoring using term ranks. In: Marchionini, G., Moffat, A., Tait, J., Baeza-Yates, R., Ziviani, N. (eds.) Proc. 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 2005, pp. 226–233. ACM Press, New York (2005)CrossRefGoogle Scholar
  2. [Buckley et al.(1993)Buckley, Salton, and Allan]
    Buckley, C., Salton, G., Allan, J.: Automatic retrieval with locality information using SMART. In: Harman, D.K. (ed.) Proceedings of the First Text REtrieval Conference (TREC-1), November 1993, pp. 59–72. National Institute of Standards and Technology (Special Publication 500-251), Gaithersburg (1993)Google Scholar
  3. [Clarke et al.(2007)Clarke, Fuhr, Kando, Kraaij, and de Vries]
    Clarke, C.L.A., Fuhr, N., Kando, N., Kraaij, W., de Vries, A.P. (eds.): Proc. 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 2007. ACM Press, New York (2007)Google Scholar
  4. [Cleveland et al.(1992)Cleveland, Grosse, and Shyu]
    Cleveland, W.S., Grosse, E., Shyu, W.M.: Local regression models. In: Chambers, J.M., Hastie, T.J. (eds.) Statistical Models in S, ch. 8. Chapman & Hall/CRC, Boca Raton (1992)Google Scholar
  5. [Fuhr(2001)]
    Fuhr, N.: Models in information retrieval. Lectures on information retrieval, pp. 21–50. Springer, Heidelberg (2001)Google Scholar
  6. [Metzler and Croft(2007)]
    Metzler, D., Croft, W.B.: Latent concept expansion using Markov random field. In: Clarke, et al. (eds.), pp. 311–318 (2007)Google Scholar
  7. [Metzler et al.(2008)Metzler, Strohman, and Croft]
    Metzler, D., Strohman, T., Croft, W.B.: A statistical view of binned retrieval models. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 175–186. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  8. [Robertson and Spärck Jones(1976)]
    Robertson, S., Spärck Jones, K.: Relevance weighting of search terms. Journal of the American Society for Information Science 27(3), 129–146 (1976)CrossRefGoogle Scholar
  9. [Robertson et al.(1994)Robertson, Walker, Jones, Hancock-Beaulieu, and Gatford]
    Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at TREC–3. In: Harman, D. (ed.) Proc. Third Text REtrieval Conference (TREC–3), pp. 109–126. National Institute of Standards and Technology (Special Publication 500-225), Gaithersburg (1994)Google Scholar
  10. [Robertson et al.(1998)Robertson, Walker, and Beaulieu]
    Robertson, S., Walker, S., Beaulieu, M.: Okapi at TREC–7: automatic ad hoc, filtering, VLC and filtering tracks. In: Voorhees, E., Harman, D. (eds.) Proc. Seventh Text REtrieval Conference (TREC–7), November 1998, pp. 253–261. National Institute of Standards and Technology (Special Publication 500-242) (1998)Google Scholar
  11. [Salton and Buckley(1988)]
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)CrossRefGoogle Scholar
  12. [Singhal et al.(1996a)Singhal, Buckley, and Mitra]
    Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Frei, H., Harman, D., Schäuble, P., Wilkinson, R. (eds.) Proc. 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 21–29. ACM Press, New York (1996)CrossRefGoogle Scholar
  13. [Singhal et al.(1996b)Singhal, Salton, Mitra, and Buckley]
    Singhal, A., Salton, G., Mitra, M., Buckley, C.: Document length normalization. Information Processing and Management 32(5), 619–633 (1996)CrossRefGoogle Scholar
  14. [Spärck Jones et al.(2000)Spärck Jones, Walker, and Robertson]
    Spärck Jones, K., Walker, S., Robertson, S.: A probabilistic model of information retrieval: Development and comparative experiments (Parts 1 and 2). Information Processing and Management 36, 779–808, 809–840 (2000)CrossRefGoogle Scholar
  15. [Tao and Zhai(2007)]
    Tao, T., Zhai, C.: An exploration of proximity measures in information retrieval. In: Clarke, et al. (eds.), pp. 295–302 (2007)Google Scholar
  16. [Zobel and Moffat(1998)]
    Zobel, J., Moffat, A.: Exploring the similarity space. ACM SIGIR Forum 32(1), 18–34 (1998)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Vo Ngoc Anh
    • 1
  • Raymond Wan
    • 2
  • Alistair Moffat
    • 1
  1. 1.Department of Computer Science and Software EngineeringThe University of MelbourneVictoriaAustralia
  2. 2.Bioinformatics CenterKyoto UniversityKyotoJapan

Personalised recommendations