Skip to main content

An Unbiased Generative Model for Setting Dissemination Thresholds

  • Chapter
Language Modeling for Information Retrieval

Part of the book series: The Springer International Series on Information Retrieval ((INRE,volume 13))

Abstract

Information filtering systems based on statistical retrieval models usually compute a numeric score that indicates how well each document matches each profile. Documents with scores above profile-specific dissemination thresholds are delivered. Optimal dissemination thresholds are usually difficult to determine a priori, so they are often learned during filtering, using relevance feedback about disseminated documents. However, the scores of disseminated documents are a biased sample of the complete distribution of document scores, which causes some algorithms to learn suboptimal thresholds.

This chapter presents a generative method of adjusting dissemination thresholds that explicitly models and compensates for this bias. The new algorithm, which is based on the Maximum Likelihood principle, jointly estimates the parameters of the density distributions for relevant and non-relevant documents and the ratio of relevant to non-relevant documents in the region around the dissemination threshold. Experiments demonstrate its effectiveness when its underlying assumptions about document scores are true, and illustrate its behavior when its assumptions don’t match the actual distribution of document scores.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Allan, J. (1996). Incremental relevance feedback for information filtering. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 270–278.

    Chapter  Google Scholar 

  • Arampatzis, A. (2002). Unbiased S-D threshold optimization, initial query degradation, decay, and incrementality for adaptive document filtering. In Proceeding of the Tenth Text REtrieval Conference (TREC-10), pages 596–603. National Institute of Standards and Technology, special publication 500–250.

    Google Scholar 

  • Arampatzis, A., Beney, J., Koster, C., and van der Weide., T. (2001). Incrementality, decay, and threshold optimization for adaptive filtering systems. In Proceeding of Ninth Text REtrieval Conference (TREC-9), pages 589–600. National Institute of Standards and Technology, special publication 500249.

    Google Scholar 

  • Arampatzis, A. and Hameren, A. (2001). The score-distribution threshold optimization for adaptive binary classification task. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 285–293.

    Chapter  Google Scholar 

  • Ault, T. and Yang, Y. (2001). kNN at TREC-9: A failure analysis. In Proceeding of Ninth Text REtrieval Conference (TREC-9),pages 127–134. National Institute of Standards and Technology, special publication 500–249.

    Google Scholar 

  • Broglio, J., Callan, J., Croft, W., and Nachbar, D. (1995). Document retrieval and routing using the INQUERY system. In Proceeding of Third Text REtrieval Conference (TREC-3), pages 29–38. National Institute of Standards and Technology,special publication 500–225.

    Google Scholar 

  • Callan, J. (1996). Document filtering with inference networks. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 262–269.

    Google Scholar 

  • Hastie, T., Tibshirani, R., and Friedman, J. (2001). Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag.

    Google Scholar 

  • Hersh, W., Buckley, C., J.Leone, T., and Hickam, D. (1994). OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the Seventeenth Annual International ACM SIGIR Confer- ence on Research and Development in Information Retrieval, pages 192–201.

    Google Scholar 

  • Hull, D. A. and Robertson, S. (2000). The TREC-8 Filtering track final report. In Proceeding of the Eighth Text REtrieval Conference (TREC-8), pages 35–56. National Institute of Standards and Technology, special publication 500–246.

    Google Scholar 

  • Kim, Y, Hahn, S., and Zhang, B. (2000). Text filtering by boosting Naive Bayes classifiers. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 168–175. ACM Press.

    Google Scholar 

  • Kraaij, W., Pohlmann, R., and Hiemstra, D. (2000). Twenty-One at TREC8: Using language technology for information retrieval. In Proceedings of the Eighth Text REtrieval Conference (TREC-8), pages 285–300. National Institute of Standards and Technology, special publication 500–246.

    Google Scholar 

  • MacKay, D. J. (2001). Macopt — a nippy wee optimizer. http://wol.ra.phy.cam.ac.uk/mackay/c/macopt.html.

  • Manmatha, R., Rath, T., and Feng, F. (2001). Modeling score distributions for combining the outputs of search engines. In Proceedings of the 24th An-nual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 267–275.

    Google Scholar 

  • Ng, A. Y. and Jordan., M. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and Naive Bayes. In Proceeding of Fourteenth Neural Information Processing Systems.

    Google Scholar 

  • Porter, M. F. (1980). An algorithm for suffix stripping. Program,14(3):130–137.

    Google Scholar 

  • Press, W., Teukolsky, S., Vetterling, W., and Flannery, B. (1992). Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press.

    Google Scholar 

  • Robertson, S. and Hull, D. (2001). The TREC-9 Filtering track report. In The Ninth Text REtrieval Conference (TREC-9), pages 25–40. National Institute of Standards and Technology, special publication 500–249.

    Google Scholar 

  • Robertson, S. and Soboroff, I. (2002). The TREC-10 Filtering track final report. In Proceeding of the Tenth Text REtrieval Conference (TREC-10), pages 26–37. National Institute of Standards and Technology, special publication 500–250.

    Google Scholar 

  • Robertson, S. and Walker, S. (2000). Threshold setting in adaptive filtering. Journal of Documentation, pages 312–331.

    Google Scholar 

  • Robertson, S. and Walker, S. (2001). Microsoft Cambridge at TREC-9: Filtering track. In Proceeding of Ninth Text REtrieval Conference (TREC-9), pages 361–368. National Institute of Standards and Technology, special publication 500–249.

    Google Scholar 

  • Robertson, S., Walker, S., Beaulieu, M. M., Gatford, M., and Payne, A. (1996). Okapi at TREC-4. In Proceeding of Fourth Text REtrieval Conference (TREC4), pages 73–96. National Institute of Standards and Technology, special publication 500–236.

    Google Scholar 

  • Rocchio, J. J. (1971). Relevance feedback in information retrieval. In The SMART Retrieval System— Experiments in Automatic Document Processing, pages 313–323. Prentice Hall.

    Google Scholar 

  • Rubinstein, Y. D. and Hastie, T. (1997). Discriminative vs informative learning. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 49–53.

    Google Scholar 

  • Schapire, R., Singer, Y, and Singhal, A. (1998). Boosting and Rocchio applied to text filtering. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 215–213.

    Chapter  Google Scholar 

  • Zhai, C., Jansen, P., Roma, N., Stoica, E., and Evans, D. (2000). Optimization in CLARIT adaptive filtering. In Proceeding of Eighth Text REtrieval Conference (TREC-8), pages 253–258. National Institute of Standards and Technology 500–246.

    Google Scholar 

  • Zhai, C., Jansen, P., and Stoica, E. (1999). Threshold calibration in CLARIT adaptive filtering. In Proceeding of Seventh Text REtrieval Conference (TREC-7), pages 149–157. National Institute of Standards and Technology, special publication 500–242.

    Google Scholar 

  • Zhang, Y. and Callan, J. (2001a). Maximum likelihood estimation for filtering thresholds. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 294–302.

    Chapter  Google Scholar 

  • Zhang, Y. and Callan, J. (2001b). Yfilter at TREC-9. In Proceedings of the Ninth Text REtrieval Conference (TREC-9), pages 135–140. National Institute of Standards and Technology, special publication 500–249.

    Google Scholar 

  • Zhang, Y. and Callan, J. (2002). The bias problem and language models in adaptive filtering. In The Tenth Text REtrieval Conference (TREC-10), pages 78–83. National Institute of Standards and Technology, special publication 500–250.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Zhang, Y., Callan, J. (2003). An Unbiased Generative Model for Setting Dissemination Thresholds. In: Croft, W.B., Lafferty, J. (eds) Language Modeling for Information Retrieval. The Springer International Series on Information Retrieval, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-0171-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-94-017-0171-6_9

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-6263-5

  • Online ISBN: 978-94-017-0171-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics