Information Retrieval

, Volume 14, Issue 1, pp 47–67 | Cite as

Variational bayes for modeling score distributions

  • Keshi Dai
  • Evangelos Kanoulas
  • Virgil Pavlu
  • Javed A. Aslam
The Second International Conference on the Theory of Information Retrieval (ICTIR2009)

Abstract

Empirical modeling of the score distributions associated with retrieved documents is an essential task for many retrieval applications. In this work, we propose modeling the relevant documents’ scores by a mixture of Gaussians and the non-relevant scores by a Gamma distribution. Applying Variational Bayes we automatically trade-off the goodness-of-fit with the complexity of the model. We test our model on traditional retrieval functions and actual search engines submitted to TREC. We demonstrate the utility of our model in inferring precision-recall curves. In all experiments our model outperforms the dominant exponential-Gaussian model.

Keywords

Score distributions Gaussian mixtures Variational inference Recall-precision curves 

Notes

Acknowledgments

We would like to thank Avi Arampatzis, Jaap Kamps and Stephen Robertson for many useful discussions. Further, we gratefully acknowledge the support provided by NSF grants IIS-0533625 and IIS-0534482 and by the European Commission who funded parts of this research within the Accurat project under contract number FP7-ICT-248347.

References

  1. Akaike, H. (1974). A new look at the statistical identification model. IEEE Transactions onAutomatic Control, 19, 716–723.MATHCrossRefMathSciNetGoogle Scholar
  2. Amati, G. (2003). Probability models for information retrieval based on divergence from randomness. PhD thesis, University of Glasgow.Google Scholar
  3. Amati, G., & Van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring divergence from randomness. ACM Transactions on Information Systems, 20(4), 357–389.CrossRefGoogle Scholar
  4. Arampatzis, A., & van Hameran, A. (2001). The score-distributional threshold optimization for adaptive binary classification tasks. In: SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 285–293). New York, NY: ACM. doi: 10.1145/383952.384009.
  5. Arampatzis, A. T., Robertson, S., & Kamps, J. (2009). Score distributions in information retrieval. In: ICTIR (pp. 139–151).Google Scholar
  6. Attias, H. (1999). Inferring parameters and structure of latent variable models by variational bayes. In: Proceedings of the 15th conference on uncertainty in artificial intelligence (pp. 21–30). San Francisco: Morgan Kaufmann Publishers.Google Scholar
  7. Attias, H. (2000). A variational bayesian framework for graphical models. In: In advances in neural information processing systems (Vol. 12, pp. 209–215). Cambridge: MIT Press.Google Scholar
  8. Baumgarten, C. (1999) A probabilistic solution to the selection and fusion problem in distributed information retrieval. In: SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 246–253). New York, NY: ACM. doi: 10.1145/312624.312685.
  9. Bennett, P. N. (2003). Using asymmetric distributions to improve text classifier probability estimates. In: SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval (pp. 111–118), ACM, New York, NY, USA, doi: 10.1145/860435.860457.
  10. Bishop, C. M. (2006). pattern recognition and machine learning (information science and statistics). New York: Springer.Google Scholar
  11. Bookstein, A. (1977). When the most “pertinent” document should not be retrieved—an analysis of the swets model. Information Processing & Management, 13(6), 377–383.MATHCrossRefGoogle Scholar
  12. Bozdogan, H. (1993). Choosing the number of component clusters in the mixture model using a new information complexity criterion “choosing the number of component clusters in the mixture model using a new information complexity criterion of the inverse-fisher information matrix. Information and Classification, 40–54.Google Scholar
  13. Celeux, G., & Soromenho, G. (1996). An entropy criterion for assessing the number of clusters in a mixture model. Journal of Classification, 13(195–212).Google Scholar
  14. Collins-Thompson, K., Ogilvie, P., Zhang, Y., & Callan, J. (2003). Information filtering, novelty detection, and named-page finding. In Proceedings of the 11th text retrieval conference.Google Scholar
  15. Hiemstra, D. (2001). Using language models for information retrieval. PhD thesis, Centre for Telematics and Information Technology, University of Twente.Google Scholar
  16. Kanoulas, E., Dai, K., Pavlu, V., & Aslam, J. A. (2010). Score distribution models: Assumptions, intuition, and robustness to score manipulation. In: To appear in proceedings of the 33rd annual international ACM SIGIR conference on research and development in information retrieval.Google Scholar
  17. Manmatha, R., Rath, T., & Feng, F. (2001). Modeling score distributions for combining the outputs of search engines. In: SIGIR ’01: proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 267–275). New York, NY: ACM. doi: 10.1145/383952.384005.
  18. Oard, D. W., Hedin, B., Tomlinson, S., Baron, J. R. (2009). Overview of the trec 2008 legal track. In: In Proceedings of the 17th text retrieval conference.Google Scholar
  19. Ounis, I., Lioma, C., Macdonald, C., & Plachouras, V. (2007). Research directions in terrier. In: R. Baeza-Yates. et al (Eds.), Novatica/UPGRADE special issue on next generation web search. Invited Paper, 8(1), 49–56.Google Scholar
  20. Richardson, S., & Green, P. J. (1997). On bayesian analysis of mixtures with an unknown number of components. Journal of Royal Statistical Society B, 59(4), 731–792.MATHCrossRefMathSciNetGoogle Scholar
  21. Rissanen, J. (1987). Stochastic complexity (with discussion). Journal of the Royal Statistical Society B, 49, 223–239; 253–265.Google Scholar
  22. Robertson, E. S., & Walker, S. (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR ’94: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (pp. 232–241). New York, NY, USA: Springer.Google Scholar
  23. Robertson, S. (2007). On score distributions and relevance. In: Amati, G., Carpineto, C., Romano, G. (Eds.), Advances in information retrieval, 29th European conference on IR research, ECIR 2007. Lecture notes in computer science, vol 4425/2007 (pp. 40–51). Springer: New York.Google Scholar
  24. Robertson, S. E., & Jones, S. K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129–146.CrossRefGoogle Scholar
  25. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.MATHCrossRefMathSciNetGoogle Scholar
  26. Spitters, M., & Kraaij, W. (2000). A language modeling approach to tracking news events. In: Proceedings of TDT workshop 2000 (pp. 101–106).Google Scholar
  27. Swets, J. A. (1963). Information retrieval systems. Science, 141(3577), 245–250.CrossRefGoogle Scholar
  28. Swets, J. A. (1969). Effectiveness of information retrieval methods. American Documentation, 20, 72–89.CrossRefGoogle Scholar
  29. Voorhees, E. M., & Harman, D. K. (2005). TREC: experiment and evaluation in information retrieval. Cambridge: Digital Libraries and Electronic Publishing, MIT Press.Google Scholar
  30. Zhang, Y., & Callan, J. (2001). Maximum likelihood estimation for filtering thresholds. In: SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 294–302). New York, NY: ACM. doi: 10.1145/383952.384012.

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Keshi Dai
    • 1
  • Evangelos Kanoulas
    • 2
  • Virgil Pavlu
    • 1
  • Javed A. Aslam
    • 1
  1. 1.College of Computer and Information ScienceNortheastern UniversityBostonUSA
  2. 2.Department of Information StudiesUniversity of SheffieldSheffieldUK

Personalised recommendations