Advertisement

Information Retrieval

, Volume 11, Issue 2, pp 109–138 | Cite as

An analysis on document length retrieval trends in language modeling smoothing

  • David E. LosadaEmail author
  • Leif Azzopardi
Article

Abstract

Document length is widely recognized as an important factor for adjusting retrieval systems. Many models tend to favor the retrieval of either short or long documents and, thus, a length-based correction needs to be applied for avoiding any length bias. In Language Modeling for Information Retrieval, smoothing methods are applied to move probability mass from document terms to unseen words, which is often dependant upon document length. In this article, we perform an in-depth study of this behavior, characterized by the document length retrieval trends, of three popular smoothing methods across a number of factors, and its impact on the length of documents retrieved and retrieval performance. First, we theoretically analyze the Jelinek–Mercer, Dirichlet prior and two-stage smoothing strategies and, then, conduct an empirical analysis. In our analysis we show how Dirichlet prior smoothing caters for document length more appropriately than Jelinek–Mercer smoothing which leads to its superior retrieval performance. In a follow up analysis, we posit that length-based priors can be used to offset any bias in the length retrieval trends stemming from the retrieval formula derived by the smoothing technique. We show that the performance of Jelinek–Mercer smoothing can be significantly improved by using such a prior, which provides a natural and simple alternative to decouple the query and document modeling roles of smoothing. With the analysis of retrieval behavior conducted in this article, it is possible to understand why the Dirichlet Prior smoothing performs better than the Jelinek–Mercer, and why the performance of the Jelinek–Mercer method is improved by including a length-based prior.

Keywords

Language models Smoothing Document length 

Notes

Acknowledgements

The authors would like to thank Dr. Mark Baillie and the anonymous reviewers for their useful comments and suggestions which have been incorporated into this article. David E. Losada thanks the support obtained from projects TIN2005-08521-C02-01 (Ministerio de Educación y Ciencia), PGIDIT06PXIC206023PN and 07SIN005206PR (Xunta de Galicia). David E. Losada is funded on a “Ramón y Cajal” research fellowship, whose funds come from Ministerio de Educación y Ciencia and the FEDER program.

References

  1. Allan, J. (2005). HARD track overview in TREC 2005 high accuracy retrieval from documents. In Proceedings of the 14th Text Retrieval Conference (TREC 2005).Google Scholar
  2. Amati, G. (2003). Divergence from randomness. Ph.D. thesis, Department of Computer Science, University of Glasgow.Google Scholar
  3. Amati, G., & van Rijsbergen, C. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS), 20(4), 357–389.CrossRefGoogle Scholar
  4. Azzopardi, L. (2005) Incorporating context into the language modeling for ad hoc information retrieval. Ph.D. thesis, University of Paisley, Glasgow, UK.Google Scholar
  5. Azzopardi, L., & Losada, D. E. (2007). Fairly retrieving documents of all lengths. In Proceedings of the First International Conference in Theory of Information Retrieval (ICTIR 2007) (pp. 65–76).Google Scholar
  6. Chen, S. F., & Goodman, J. (1998). An empirical study of smoothing techniques for language modeling. Technical report TR-10-98, Harvard University.Google Scholar
  7. Chowdhury, A., McCabe, M. C., Grossman, D., & Frieder, O. (2002). Document normalization revisited. In Proceedings of the 25th ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 381–382). New York, NY: ACM Press.Google Scholar
  8. Craswell, N., Robertson, S., Zaragoza, H., & Taylor, M. (2005). Relevance weighting for query independent evidence. In Proceedings of the 28th ACM Conference on Research and Development in Information Retrieval, SIGIR’05 (pp. 416–423). Salvador, Brazil.Google Scholar
  9. Harman, D. (2005). TREC: Experiment and Evaluation in Information Retrieval, Chap. The TREC AdHoc Experiments, pp. 79–97. The MIT Press.Google Scholar
  10. Hauff, C., & Azzopardi, L. (2005). Age dependent document priors in link structure analysis. In D. Losada & J. M. Fernandez-Luna (Eds.), Proceedings of the 27th European Conference on Information Retrieval Research, ECIR’2005 (pp. 552–554). Santiago de Compostela, Spain: Springer Verlag, LNCS 3408.Google Scholar
  11. Hiemstra, D. (1998). A linguistically motivated probabilistic model of information retrieval. In C. Nicolaou & C. Stephanidis (Eds.), Lecture Notes in Computer Science: Research and Advanced Technology for Digital Libraries, Vol. 1513 (pp. 569–584).Google Scholar
  12. Hiemstra, D. (2000). A probabilistic justification for using tf x idf term weighting in information retrieval. International Journal of Digital Libraries, 3, 131–139.CrossRefGoogle Scholar
  13. Jelinek, F., & Mercer, R. (1980). Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice. Amsterdam.Google Scholar
  14. Kamps, J. (2005). Web-centric language models. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM).Google Scholar
  15. Kraaij, W., & Westerveld, T. (2000). TNO/UT at TREC-9: How different are web documents. In Proceedings of the TREC-9, the 9th Text Retrieval Conference. Gaithersburg, USA.Google Scholar
  16. Kraaij, W., Westerveld, T., & Hiemstra, D. (2002). The importance of prior probabilities for entry page search. In Proceedings of the 25th ACM Conference on Research and Development in Information Retrieval, SIGIR’02 (pp. 27–34). Tampere, Finland.Google Scholar
  17. Lemur. (2002). The Lemur toolkit. http://www.lemurproject.org
  18. Mackay, D., & Peto, L. (1995). A hierarchical Dirichlet language model. Natural Language Engineering, 1(3), 1–19.CrossRefGoogle Scholar
  19. Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. The MIT Press.Google Scholar
  20. Miller, D., Leek, T., & Schwartz, R. (1999). A hidden markov movel information retrieval system. In Proceedings of the SIGIR-99, the 22nd ACM Conference on Research and Development in Information Retrieval (pp. 214–221). Berkeley.Google Scholar
  21. Ogilvie, P., & Callan, J. (2004). Experiments with language models for known-item finding of e-mail messages. In Proceedings of the 14th Text Retrieval Conference, TREC-2004.Google Scholar
  22. Ponte, J., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st ACM Conference on Research and Development in Information Retrieval, SIGIR’98 (pp. 275–281). Melbourne, Australia.Google Scholar
  23. Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.Google Scholar
  24. Robertson, S., & Walker, S. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the SIGIR-94, the 17th ACM Conference on Research and Development in Information Retrieval (pp. 232–241). Dublin, Ireland.Google Scholar
  25. Robertson, S., Walker, S., Jones, S., Hancock Beaulieu, M., & Gatford, M. (1995). Okapi at TREC-3. In Harman, D. (Ed.), Proceedings of the TREC-3, the 3rd Text Retrieval Conference (pp. 109–127). NIST.Google Scholar
  26. Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of the 28th ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 162–169).Google Scholar
  27. Singhal, A., Buckley, C., & Mitra, M. (1996a). Pivoted document length normalization. In Proceedings of the 19th ACM SIGIR conference on Research and Development in Information Retrieval (pp. 21–29).Google Scholar
  28. Singhal, A., Buckley, C., & Mitra, M. (1996b). Pivoted document length normalization. In Proceedings of the SIGIR-96, the 19th ACM Conference on Research and Development in Information Retrieval (pp. 21–29). Zurich, Switzerland.Google Scholar
  29. Voorhees, E., & Harman, D. (1999). Overview of the eight text retrieval conference. In Proceedings of the TREC-8, the 8th text retrieval conference.Google Scholar
  30. Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to adhoc information retrieval. In Proceedings of the 24th ACM Conference on Research and Development in Information Retrieval, SIGIR’01 (pp. 334–342). New Orleans, USA.Google Scholar
  31. Zhai, C., & Lafferty, J. (2002). Two-stage language models for information retrieval. In Proceedings of the 25th ACM Conference on Research and Development in Information Retrieval, SIGIR’02 (pp. 49–56). Tampere, Finland.Google Scholar
  32. Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2), 179–214.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  1. 1.Departamento de Electrónica y ComputaciónUniversidad de Santiago de CompostelaSantiagoSpain
  2. 2.Department of Computing ScienceUniversity of GlasgowGlasgowScotland

Personalised recommendations