# Retrieval constraints and word frequency distributions a log-logistic model for IR

- 205 Downloads
- 8 Citations

## Abstract

We first present in this paper an analytical view of heuristic retrieval constraints which yields simple tests to determine whether a retrieval function satisfies the constraints or not. We then review empirical findings on word frequency distributions and the central role played by burstiness in this context. This leads us to propose a formal definition of burstiness which can be used to characterize probability distributions with respect to this phenomenon. We then introduce the family of information-based IR models which naturally captures heuristic retrieval constraints when the underlying probability distribution is bursty and propose a new IR model within this family, based on the log-logistic distribution. The experiments we conduct on several collections illustrate the good behavior of the log-logistic IR model: It significantly outperforms the Jelinek-Mercer and Dirichlet prior language models on most collections we have used, with both short and long queries and for both the MAP and the precision at 10 documents. It also compares favorably to BM25 and has similar performance to classical DFR models such as InL2 and PL2.

## Keywords

Retrieval constraints Burstiness Information retrieval theory Log-logistic distribution## Notes

### Acknowledgments

This research was partly supported by the Pascal-2Network of Excellence ICT-216886-NOE and the French project Fragrances ANR-08-CORD-008. We thank the anonymous reviewers for their comments on the first version of this paper.

## References

- 1.Airoldi, E. M., Cohen, W. W., & Fienberg, S. E.
*Bayesian methods for frequent terms in text: Models of contagion and the*δ^{2}*statistic*.Google Scholar - 2.Amati, G., & Rijsbergen, C. J. V. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness.
*ACM Transactions on Information and Systems 20*(4), 357–389.CrossRefGoogle Scholar - 3.Barabasi, A. L., & Albert, R. (1999). Emergence of scaling in random networks.
*Science, 286*(5439), 509–512.CrossRefMathSciNetGoogle Scholar - 4.Chakrabarti, D., & Faloutsos, C. (2006). Graph mining: Laws, generators, and algorithms.
*ACM Computer Survey, 38*(1), 2Google Scholar - 5.Church, K. W. (2000). Empirical estimates of adaptation: The chance of two noriegas is closer to p/2 than p2. In
*Proceedings of the 18th conference on computational linguistics*, Morristown, NJ, USA, Association for Computational Linguistics, pp. 180–186.Google Scholar - 6.Church, K. W., & Gale, W. A. (1995). Poisson mixtures.
*Natural Language Engineering, 1*, 163–190.CrossRefGoogle Scholar - 7.Clinchant, S., & Gaussier, É. The bnb distribution for text modeling. In Macdonald et al. [12], pp. 150–161.Google Scholar
- 8.Elkan, C. (2006). Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In Cohen, W. W., & Moore, A. (Eds.),
*ICML, volume 148 of ACM international conference proceeding series*, pp. 289–296. ACM.Google Scholar - 9.Fang, H., Tao, T., & Zhai, C. (2004). A formal study of information retrieval heuristics. In
*SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval*, pp. 49–56.Google Scholar - 10.Feller, W. (1968).
*An introduction to probability theory and its applications*(Vol. I). New York: Wiley.Google Scholar - 11.Harter, S. (1975). A probabilistic approach to automatic keyword indexing, part 1: On the distribution of speciality words in a technical literature, part 2: An algorithm for probabilistic indexing.
*Journal of the American Society for Information Science*, (26), 197–206.Google Scholar - 12.Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., & White, R. W. (eds.) (2008).
*Advances in information retrieval, 30th European conference on IR research, ECIR 2008, Glasgow, UK, March 30-April 3, 2008. Proceedings, volume 4956 of lecture notes in computer science*. Springer, Berlin.Google Scholar - 13.Madsen, R. E., Kauchak, D., & Elkan, C. (2005). Modeling word burstiness using the dirichlet distribution. In Raedt, L. D., & Wrobel, S. (Eds.),
*ICML, volume 119 of ACM international conference proceeding series*, pp. 545–552. ACM.Google Scholar - 14.Na, S.-H., Kang, I.-S., & Lee, J.-H. Improving term frequency normalization for multi-topical documents and application to language modeling approaches. In Macdonald et al. [12], pp. 382–393.Google Scholar
- 15.Robertson, S. E., & Walker, S. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In
*SIGIR ’94: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval*, New York, NY, USA, Springer, New York, pp. 232–241Google Scholar - 16.Salton, G., & McGill, M. J. (1983).
*Introduction to modern information retrieval*. New York NY USA: McGraw-Hill Inc.zbMATHGoogle Scholar - 17.Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted document length normalization. In
*SIGIR ’96: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval*, New York, NY, USA, ACM, pp. 21–29.Google Scholar - 18.Xu, Z., & Akella, R. (2008). A new probabilistic retrieval model based on the dirichlet compound multinomial distribution. In
*SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval*, New York, NY, USA: ACM, pp. 427–434.Google Scholar - 19.Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval.
*ACM Transactions Information System, 22*(2), 179–214.CrossRefGoogle Scholar