A Log-Logistic Model-Based Interpretation of TF Normalization of BM25

  • Yuanhua Lv
  • ChengXiang Zhai
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7224)

Abstract

The effectiveness of BM25 retrieval function is mainly due to its sub-linear term frequency (TF) normalization component, which is controlled by a parameter k 1. Although BM25 was derived based on the classic probabilistic retrieval model, it has been so far unclear how to interpret its parameter k 1 probabilistically, making it hard to optimize the setting of this parameter. In this paper, we provide a novel probabilistic interpretation of the BM25 TF normalization and its parameter k 1 based on a log-logistic model for the probability of seeing a document in the collection with a given level of TF. The proposed interpretation allows us to derive different approaches to estimation of parameter k 1 based solely on the current collection without requiring any training data, thus effectively eliminating one free parameter from BM25. Our experiment results show that the proposed approaches can accurately predict the optimal k 1 without requiring training data and achieve better or comparable retrieval performance to a well-tuned BM25 where k 1 is optimized based on training data.

Keywords

BM25 term frequency log-logistic model percentile term frequency normalization automatic parameter tuning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 357–389 (2002)CrossRefGoogle Scholar
  2. 2.
    Bendersky, M., Metzler, D., Bruce Croft, W.: Learning concept importance using a weighted dependence model. In: WSDM 2010, pp. 31–40 (2010)Google Scholar
  3. 3.
    Church, K.W., Gale, W.A.: Poisson mixtures. Natural Language Engineering 1, 163–190 (1995)CrossRefGoogle Scholar
  4. 4.
    Clinchant, S., Gaussier, E.: Bridging Language Modeling and Divergence from Randomness Models: A Log-Logistic Model for IR. In: Azzopardi, L., Kazai, G., Robertson, S., Rüger, S., Shokouhi, M., Song, D., Yilmaz, E. (eds.) ICTIR 2009. LNCS, vol. 5766, pp. 54–65. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  5. 5.
    Clinchant, S., Gaussier, E.: Information-based models for ad hoc IR. In: SIGIR 2010, pp. 234–241 (2010)Google Scholar
  6. 6.
    Fang, H., Tao, T., Zhai, C.: A formal study of information retrieval heuristics. In: SIGIR 2004, pp. 49–56 (2004)Google Scholar
  7. 7.
    Harter, S.P.: A Probabilistic Approach to Automatic Keyword Indexing. PhD thesis, The University of Chicago (1974)Google Scholar
  8. 8.
    He, B., Ounis, I.: On setting the hyper-parameters of term frequency normalization for information retrieval. ACM Trans. Inf. Syst. 25 (July 2007)Google Scholar
  9. 9.
    Hintikka, J.: On Semantic Information. In: Hintikka, J., Suppes, P. (eds.) Information and Inference, pp. 3–27. D. Reidel Pub. (1970)Google Scholar
  10. 10.
    Sparck Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments. Information Processing and Management, 779–840 (2000)Google Scholar
  11. 11.
    Lease, M., Allan, J., Bruce Croft, W.: Regression Rank: Learning to Meet the Opportunity of Descriptive Queries. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 90–101. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  12. 12.
    Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1, 309–317 (1957)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Lv, Y., Zhai, C.: Adaptive term frequency normalization for bm25. In: CIKM 2011, pp. 1985–1988 (2011)Google Scholar
  14. 14.
    Lv, Y., Zhai, C.: Lower-bounding term frequency normalization. In: CIKM 2011, pp. 7–16 (2011)Google Scholar
  15. 15.
    Lv, Y., Zhai, C.: When documents are very long, bm25 fails! In: SIGIR 2011, pp. 1103–1104 (2011)Google Scholar
  16. 16.
    Ponte, J.M., Bruce Croft, W.: A language modeling approach to information retrieval. In: SIGIR 1998, pp. 275–281 (1998)Google Scholar
  17. 17.
    Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR 1994, pp. 232–241 (1994)Google Scholar
  18. 18.
    Robertson, S., Zaragoza, H., Taylor, M.: Simple bm25 extension to multiple weighted fields. In: CIKM 2004, pp. 42–49 (2004)Google Scholar
  19. 19.
    Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at trec-3. In: TREC 1994, pp. 109–126 (1994)Google Scholar
  20. 20.
    Singhal, A.: Modern information retrieval: a brief overview. In: Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, vol. 24 (2001)Google Scholar
  21. 21.
    Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: SIGIR 1996, pp. 21–29 (1996)Google Scholar
  22. 22.
    Svore, K.M., Burges, C.J.C.: A machine learning approach for improved bm25 retrieval. In: CIKM 2009, pp. 1811–1814 (2009)Google Scholar
  23. 23.
    Svore, K.M., Kanani, P.H., Khan, N.: How good is a span of terms?: exploiting proximity to improve web retrieval. In: SIGIR 2010, pp. 154–161 (2010)Google Scholar
  24. 24.
    Tao, T., Zhai, C.: An exploration of proximity measures in information retrieval. In: SIGIR 2007, pp. 295–302 (2007)Google Scholar
  25. 25.
    Taylor, M., Zaragoza, H., Craswell, N., Robertson, S., Burges, C.: Optimisation methods for ranking functions with multiple parameters. In: CIKM 2006, pp. 585–593 (2006)Google Scholar
  26. 26.
    Tison, C., Nicolas, J.M., Tupin, F.: Accuracy of fisher distributions and log-moment estimation to describe amplitude distributions of high resolution sar images over urban areas. In: IGARSS 2003, pp. 1999–2001 (2003)Google Scholar
  27. 27.
    Xu, Z., Akella, R.: A new probabilistic retrieval model based on the dirichlet compound multinomial distribution. In: SIGIR 2008, pp. 427–434 (2008)Google Scholar
  28. 28.
    Zhai, C., Lafferty, J.D.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: SIGIR 2001, pp. 334–342 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Yuanhua Lv
    • 1
  • ChengXiang Zhai
    • 1
  1. 1.University of Illinois at Urbana-ChampaignUrbanaUSA

Personalised recommendations