Abstract
Information filtering systems based on statistical retrieval models usually compute a numeric score that indicates how well each document matches each profile. Documents with scores above profile-specific dissemination thresholds are delivered. Optimal dissemination thresholds are usually difficult to determine a priori, so they are often learned during filtering, using relevance feedback about disseminated documents. However, the scores of disseminated documents are a biased sample of the complete distribution of document scores, which causes some algorithms to learn suboptimal thresholds.
This chapter presents a generative method of adjusting dissemination thresholds that explicitly models and compensates for this bias. The new algorithm, which is based on the Maximum Likelihood principle, jointly estimates the parameters of the density distributions for relevant and non-relevant documents and the ratio of relevant to non-relevant documents in the region around the dissemination threshold. Experiments demonstrate its effectiveness when its underlying assumptions about document scores are true, and illustrate its behavior when its assumptions don’t match the actual distribution of document scores.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Allan, J. (1996). Incremental relevance feedback for information filtering. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 270–278.
Arampatzis, A. (2002). Unbiased S-D threshold optimization, initial query degradation, decay, and incrementality for adaptive document filtering. In Proceeding of the Tenth Text REtrieval Conference (TREC-10), pages 596–603. National Institute of Standards and Technology, special publication 500–250.
Arampatzis, A., Beney, J., Koster, C., and van der Weide., T. (2001). Incrementality, decay, and threshold optimization for adaptive filtering systems. In Proceeding of Ninth Text REtrieval Conference (TREC-9), pages 589–600. National Institute of Standards and Technology, special publication 500249.
Arampatzis, A. and Hameren, A. (2001). The score-distribution threshold optimization for adaptive binary classification task. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 285–293.
Ault, T. and Yang, Y. (2001). kNN at TREC-9: A failure analysis. In Proceeding of Ninth Text REtrieval Conference (TREC-9),pages 127–134. National Institute of Standards and Technology, special publication 500–249.
Broglio, J., Callan, J., Croft, W., and Nachbar, D. (1995). Document retrieval and routing using the INQUERY system. In Proceeding of Third Text REtrieval Conference (TREC-3), pages 29–38. National Institute of Standards and Technology,special publication 500–225.
Callan, J. (1996). Document filtering with inference networks. In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 262–269.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer-Verlag.
Hersh, W., Buckley, C., J.Leone, T., and Hickam, D. (1994). OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the Seventeenth Annual International ACM SIGIR Confer- ence on Research and Development in Information Retrieval, pages 192–201.
Hull, D. A. and Robertson, S. (2000). The TREC-8 Filtering track final report. In Proceeding of the Eighth Text REtrieval Conference (TREC-8), pages 35–56. National Institute of Standards and Technology, special publication 500–246.
Kim, Y, Hahn, S., and Zhang, B. (2000). Text filtering by boosting Naive Bayes classifiers. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 168–175. ACM Press.
Kraaij, W., Pohlmann, R., and Hiemstra, D. (2000). Twenty-One at TREC8: Using language technology for information retrieval. In Proceedings of the Eighth Text REtrieval Conference (TREC-8), pages 285–300. National Institute of Standards and Technology, special publication 500–246.
MacKay, D. J. (2001). Macopt — a nippy wee optimizer. http://wol.ra.phy.cam.ac.uk/mackay/c/macopt.html.
Manmatha, R., Rath, T., and Feng, F. (2001). Modeling score distributions for combining the outputs of search engines. In Proceedings of the 24th An-nual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 267–275.
Ng, A. Y. and Jordan., M. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and Naive Bayes. In Proceeding of Fourteenth Neural Information Processing Systems.
Porter, M. F. (1980). An algorithm for suffix stripping. Program,14(3):130–137.
Press, W., Teukolsky, S., Vetterling, W., and Flannery, B. (1992). Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press.
Robertson, S. and Hull, D. (2001). The TREC-9 Filtering track report. In The Ninth Text REtrieval Conference (TREC-9), pages 25–40. National Institute of Standards and Technology, special publication 500–249.
Robertson, S. and Soboroff, I. (2002). The TREC-10 Filtering track final report. In Proceeding of the Tenth Text REtrieval Conference (TREC-10), pages 26–37. National Institute of Standards and Technology, special publication 500–250.
Robertson, S. and Walker, S. (2000). Threshold setting in adaptive filtering. Journal of Documentation, pages 312–331.
Robertson, S. and Walker, S. (2001). Microsoft Cambridge at TREC-9: Filtering track. In Proceeding of Ninth Text REtrieval Conference (TREC-9), pages 361–368. National Institute of Standards and Technology, special publication 500–249.
Robertson, S., Walker, S., Beaulieu, M. M., Gatford, M., and Payne, A. (1996). Okapi at TREC-4. In Proceeding of Fourth Text REtrieval Conference (TREC4), pages 73–96. National Institute of Standards and Technology, special publication 500–236.
Rocchio, J. J. (1971). Relevance feedback in information retrieval. In The SMART Retrieval System— Experiments in Automatic Document Processing, pages 313–323. Prentice Hall.
Rubinstein, Y. D. and Hastie, T. (1997). Discriminative vs informative learning. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 49–53.
Schapire, R., Singer, Y, and Singhal, A. (1998). Boosting and Rocchio applied to text filtering. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 215–213.
Zhai, C., Jansen, P., Roma, N., Stoica, E., and Evans, D. (2000). Optimization in CLARIT adaptive filtering. In Proceeding of Eighth Text REtrieval Conference (TREC-8), pages 253–258. National Institute of Standards and Technology 500–246.
Zhai, C., Jansen, P., and Stoica, E. (1999). Threshold calibration in CLARIT adaptive filtering. In Proceeding of Seventh Text REtrieval Conference (TREC-7), pages 149–157. National Institute of Standards and Technology, special publication 500–242.
Zhang, Y. and Callan, J. (2001a). Maximum likelihood estimation for filtering thresholds. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 294–302.
Zhang, Y. and Callan, J. (2001b). Yfilter at TREC-9. In Proceedings of the Ninth Text REtrieval Conference (TREC-9), pages 135–140. National Institute of Standards and Technology, special publication 500–249.
Zhang, Y. and Callan, J. (2002). The bias problem and language models in adaptive filtering. In The Tenth Text REtrieval Conference (TREC-10), pages 78–83. National Institute of Standards and Technology, special publication 500–250.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Zhang, Y., Callan, J. (2003). An Unbiased Generative Model for Setting Dissemination Thresholds. In: Croft, W.B., Lafferty, J. (eds) Language Modeling for Information Retrieval. The Springer International Series on Information Retrieval, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-0171-6_9
Download citation
DOI: https://doi.org/10.1007/978-94-017-0171-6_9
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-6263-5
Online ISBN: 978-94-017-0171-6
eBook Packages: Springer Book Archive