An Unbiased Generative Model for Setting Dissemination Thresholds

Zhang, Yi; Callan, Jamie

doi:10.1007/978-94-017-0171-6_9

Yi Zhang⁴ &
Jamie Callan⁴

Part of the book series: The Springer International Series on Information Retrieval ((INRE,volume 13))

257 Accesses
1 Citations

Abstract

Information filtering systems based on statistical retrieval models usually compute a numeric score that indicates how well each document matches each profile. Documents with scores above profile-specific dissemination thresholds are delivered. Optimal dissemination thresholds are usually difficult to determine a priori, so they are often learned during filtering, using relevance feedback about disseminated documents. However, the scores of disseminated documents are a biased sample of the complete distribution of document scores, which causes some algorithms to learn suboptimal thresholds.

This chapter presents a generative method of adjusting dissemination thresholds that explicitly models and compensates for this bias. The new algorithm, which is based on the Maximum Likelihood principle, jointly estimates the parameters of the density distributions for relevant and non-relevant documents and the ratio of relevant to non-relevant documents in the region around the dissemination threshold. Experiments demonstrate its effectiveness when its underlying assumptions about document scores are true, and illustrate its behavior when its assumptions don’t match the actual distribution of document scores.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Allan, J. (1996). Incremental relevance feedback for information filtering. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 270–278.
Chapter Google Scholar
Arampatzis, A. (2002). Unbiased S-D threshold optimization, initial query degradation, decay, and incrementality for adaptive document filtering. In Proceeding of the Tenth Text REtrieval Conference (TREC-10), pages 596–603. National Institute of Standards and Technology, special publication 500–250.
Google Scholar
Arampatzis, A., Beney, J., Koster, C., and van der Weide., T. (2001). Incrementality, decay, and threshold optimization for adaptive filtering systems. In Proceeding of Ninth Text REtrieval Conference (TREC-9), pages 589–600. National Institute of Standards and Technology, special publication 500249.
Google Scholar
Arampatzis, A. and Hameren, A. (2001). The score-distribution threshold optimization for adaptive binary classification task. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 285–293.
Chapter Google Scholar
Ault, T. and Yang, Y. (2001). kNN at TREC-9: A failure analysis. In Proceeding of Ninth Text REtrieval Conference (TREC-9),pages 127–134. National Institute of Standards and Technology, special publication 500–249.
Google Scholar
Broglio, J., Callan, J., Croft, W., and Nachbar, D. (1995). Document retrieval and routing using the INQUERY system. In Proceeding of Third Text REtrieval Conference (TREC-3), pages 29–38. National Institute of Standards and Technology,special publication 500–225.
Google Scholar
Callan, J. (1996). Document filtering with inference networks. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 262–269.
Google Scholar
Hastie, T., Tibshirani, R., and Friedman, J. (2001). Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag.
Google Scholar
Hersh, W., Buckley, C., J.Leone, T., and Hickam, D. (1994). OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the Seventeenth Annual International ACM SIGIR Confer- ence on Research and Development in Information Retrieval, pages 192–201.
Google Scholar
Hull, D. A. and Robertson, S. (2000). The TREC-8 Filtering track final report. In Proceeding of the Eighth Text REtrieval Conference (TREC-8), pages 35–56. National Institute of Standards and Technology, special publication 500–246.
Google Scholar
Kim, Y, Hahn, S., and Zhang, B. (2000). Text filtering by boosting Naive Bayes classifiers. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 168–175. ACM Press.
Google Scholar
Kraaij, W., Pohlmann, R., and Hiemstra, D. (2000). Twenty-One at TREC8: Using language technology for information retrieval. In Proceedings of the Eighth Text REtrieval Conference (TREC-8), pages 285–300. National Institute of Standards and Technology, special publication 500–246.
Google Scholar
MacKay, D. J. (2001). Macopt — a nippy wee optimizer. http://wol.ra.phy.cam.ac.uk/mackay/c/macopt.html.
Manmatha, R., Rath, T., and Feng, F. (2001). Modeling score distributions for combining the outputs of search engines. In Proceedings of the 24th An-nual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 267–275.
Google Scholar
Ng, A. Y. and Jordan., M. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and Naive Bayes. In Proceeding of Fourteenth Neural Information Processing Systems.
Google Scholar
Porter, M. F. (1980). An algorithm for suffix stripping. Program,14(3):130–137.
Google Scholar
Press, W., Teukolsky, S., Vetterling, W., and Flannery, B. (1992). Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press.
Google Scholar
Robertson, S. and Hull, D. (2001). The TREC-9 Filtering track report. In The Ninth Text REtrieval Conference (TREC-9), pages 25–40. National Institute of Standards and Technology, special publication 500–249.
Google Scholar
Robertson, S. and Soboroff, I. (2002). The TREC-10 Filtering track final report. In Proceeding of the Tenth Text REtrieval Conference (TREC-10), pages 26–37. National Institute of Standards and Technology, special publication 500–250.
Google Scholar
Robertson, S. and Walker, S. (2000). Threshold setting in adaptive filtering. Journal of Documentation, pages 312–331.
Google Scholar
Robertson, S. and Walker, S. (2001). Microsoft Cambridge at TREC-9: Filtering track. In Proceeding of Ninth Text REtrieval Conference (TREC-9), pages 361–368. National Institute of Standards and Technology, special publication 500–249.
Google Scholar
Robertson, S., Walker, S., Beaulieu, M. M., Gatford, M., and Payne, A. (1996). Okapi at TREC-4. In Proceeding of Fourth Text REtrieval Conference (TREC4), pages 73–96. National Institute of Standards and Technology, special publication 500–236.
Google Scholar
Rocchio, J. J. (1971). Relevance feedback in information retrieval. In The SMART Retrieval System— Experiments in Automatic Document Processing, pages 313–323. Prentice Hall.
Google Scholar
Rubinstein, Y. D. and Hastie, T. (1997). Discriminative vs informative learning. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 49–53.
Google Scholar
Schapire, R., Singer, Y, and Singhal, A. (1998). Boosting and Rocchio applied to text filtering. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 215–213.
Chapter Google Scholar
Zhai, C., Jansen, P., Roma, N., Stoica, E., and Evans, D. (2000). Optimization in CLARIT adaptive filtering. In Proceeding of Eighth Text REtrieval Conference (TREC-8), pages 253–258. National Institute of Standards and Technology 500–246.
Google Scholar
Zhai, C., Jansen, P., and Stoica, E. (1999). Threshold calibration in CLARIT adaptive filtering. In Proceeding of Seventh Text REtrieval Conference (TREC-7), pages 149–157. National Institute of Standards and Technology, special publication 500–242.
Google Scholar
Zhang, Y. and Callan, J. (2001a). Maximum likelihood estimation for filtering thresholds. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 294–302.
Chapter Google Scholar
Zhang, Y. and Callan, J. (2001b). Yfilter at TREC-9. In Proceedings of the Ninth Text REtrieval Conference (TREC-9), pages 135–140. National Institute of Standards and Technology, special publication 500–249.
Google Scholar
Zhang, Y. and Callan, J. (2002). The bias problem and language models in adaptive filtering. In The Tenth Text REtrieval Conference (TREC-10), pages 78–83. National Institute of Standards and Technology, special publication 500–250.
Google Scholar

Download references

Author information

Authors and Affiliations

Language Technologies Institute School of Computer Science, Carniegie Mellon University, Pittsburgh, PA, 15213, USA
Yi Zhang & Jamie Callan

Authors

Yi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jamie Callan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Massachusetts, Amherst, USA
W. Bruce Croft (Distinguished Professor) (Distinguished Professor)
Computer Science Department, Carniege Mellon University, Pittsburgh, USA
John Lafferty (Associate Professor) (Associate Professor)

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhang, Y., Callan, J. (2003). An Unbiased Generative Model for Setting Dissemination Thresholds. In: Croft, W.B., Lafferty, J. (eds) Language Modeling for Information Retrieval. The Springer International Series on Information Retrieval, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-0171-6_9

Download citation

DOI: https://doi.org/10.1007/978-94-017-0171-6_9
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-6263-5
Online ISBN: 978-94-017-0171-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics