Abstract
The effectiveness of BM25 retrieval function is mainly due to its sub-linear term frequency (TF) normalization component, which is controlled by a parameter k 1. Although BM25 was derived based on the classic probabilistic retrieval model, it has been so far unclear how to interpret its parameter k 1 probabilistically, making it hard to optimize the setting of this parameter. In this paper, we provide a novel probabilistic interpretation of the BM25 TF normalization and its parameter k 1 based on a log-logistic model for the probability of seeing a document in the collection with a given level of TF. The proposed interpretation allows us to derive different approaches to estimation of parameter k 1 based solely on the current collection without requiring any training data, thus effectively eliminating one free parameter from BM25. Our experiment results show that the proposed approaches can accurately predict the optimal k 1 without requiring training data and achieve better or comparable retrieval performance to a well-tuned BM25 where k 1 is optimized based on training data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 357–389 (2002)
Bendersky, M., Metzler, D., Bruce Croft, W.: Learning concept importance using a weighted dependence model. In: WSDM 2010, pp. 31–40 (2010)
Church, K.W., Gale, W.A.: Poisson mixtures. Natural Language Engineering 1, 163–190 (1995)
Clinchant, S., Gaussier, E.: Bridging Language Modeling and Divergence from Randomness Models: A Log-Logistic Model for IR. In: Azzopardi, L., Kazai, G., Robertson, S., Rüger, S., Shokouhi, M., Song, D., Yilmaz, E. (eds.) ICTIR 2009. LNCS, vol. 5766, pp. 54–65. Springer, Heidelberg (2009)
Clinchant, S., Gaussier, E.: Information-based models for ad hoc IR. In: SIGIR 2010, pp. 234–241 (2010)
Fang, H., Tao, T., Zhai, C.: A formal study of information retrieval heuristics. In: SIGIR 2004, pp. 49–56 (2004)
Harter, S.P.: A Probabilistic Approach to Automatic Keyword Indexing. PhD thesis, The University of Chicago (1974)
He, B., Ounis, I.: On setting the hyper-parameters of term frequency normalization for information retrieval. ACM Trans. Inf. Syst. 25 (July 2007)
Hintikka, J.: On Semantic Information. In: Hintikka, J., Suppes, P. (eds.) Information and Inference, pp. 3–27. D. Reidel Pub. (1970)
Sparck Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments. Information Processing and Management, 779–840 (2000)
Lease, M., Allan, J., Bruce Croft, W.: Regression Rank: Learning to Meet the Opportunity of Descriptive Queries. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 90–101. Springer, Heidelberg (2009)
Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1, 309–317 (1957)
Lv, Y., Zhai, C.: Adaptive term frequency normalization for bm25. In: CIKM 2011, pp. 1985–1988 (2011)
Lv, Y., Zhai, C.: Lower-bounding term frequency normalization. In: CIKM 2011, pp. 7–16 (2011)
Lv, Y., Zhai, C.: When documents are very long, bm25 fails! In: SIGIR 2011, pp. 1103–1104 (2011)
Ponte, J.M., Bruce Croft, W.: A language modeling approach to information retrieval. In: SIGIR 1998, pp. 275–281 (1998)
Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR 1994, pp. 232–241 (1994)
Robertson, S., Zaragoza, H., Taylor, M.: Simple bm25 extension to multiple weighted fields. In: CIKM 2004, pp. 42–49 (2004)
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at trec-3. In: TREC 1994, pp. 109–126 (1994)
Singhal, A.: Modern information retrieval: a brief overview. In: Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, vol. 24 (2001)
Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: SIGIR 1996, pp. 21–29 (1996)
Svore, K.M., Burges, C.J.C.: A machine learning approach for improved bm25 retrieval. In: CIKM 2009, pp. 1811–1814 (2009)
Svore, K.M., Kanani, P.H., Khan, N.: How good is a span of terms?: exploiting proximity to improve web retrieval. In: SIGIR 2010, pp. 154–161 (2010)
Tao, T., Zhai, C.: An exploration of proximity measures in information retrieval. In: SIGIR 2007, pp. 295–302 (2007)
Taylor, M., Zaragoza, H., Craswell, N., Robertson, S., Burges, C.: Optimisation methods for ranking functions with multiple parameters. In: CIKM 2006, pp. 585–593 (2006)
Tison, C., Nicolas, J.M., Tupin, F.: Accuracy of fisher distributions and log-moment estimation to describe amplitude distributions of high resolution sar images over urban areas. In: IGARSS 2003, pp. 1999–2001 (2003)
Xu, Z., Akella, R.: A new probabilistic retrieval model based on the dirichlet compound multinomial distribution. In: SIGIR 2008, pp. 427–434 (2008)
Zhai, C., Lafferty, J.D.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: SIGIR 2001, pp. 334–342 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lv, Y., Zhai, C. (2012). A Log-Logistic Model-Based Interpretation of TF Normalization of BM25. In: Baeza-Yates, R., et al. Advances in Information Retrieval. ECIR 2012. Lecture Notes in Computer Science, vol 7224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28997-2_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-28997-2_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28996-5
Online ISBN: 978-3-642-28997-2
eBook Packages: Computer ScienceComputer Science (R0)