Measuring the Variability in Effectiveness of a Retrieval System

  • Mehdi Hosseini
  • Ingemar J. Cox
  • Natasa Millic-Frayling
  • Vishwa Vinay
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6107)

Abstract

A typical evaluation of a retrieval system involves computing an effectiveness metric, e.g. average precision, for each topic of a test collection and then using the average of the metric, e.g. mean average precision, to express the overall effectiveness. However, averages do not capture all the important aspects of effectiveness and, used alone, may not be an informative measure of systems’ effectiveness. Indeed, in addition to the average, we need to consider the variation of effectiveness across topics. We refer to this variation as the variability in effectiveness. In this paper we explore how the variance of a metric can be used as a measure of variability. We define a variability metric, and illustrate how the metric can be used in practice.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Collins-Thompson, K.: Robust word similarity estimation using perturbation kernels. In: Azzopardi, L., Kazai, G., Robertson, S., Rüger, S., Shokouhi, M., Song, D., Yilmaz, E. (eds.) ICTIR 2009. LNCS, vol. 5766, pp. 265–272. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  2. 2.
    Collins-Thompson, K., Callan, J.: Estimation and use of uncertainty in pseudo-relevance feedback. In: SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 303–310. ACM, New York (2007)CrossRefGoogle Scholar
  3. 3.
    Cormack, G.V., Lynam, T.R.: Statistical precision of information retrieval evaluation. In: SIGIR 2006: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 533–540. ACM, New York (2006)CrossRefGoogle Scholar
  4. 4.
    Hull, D.: Using statistical testing in the evaluation of retrieval experiments. In: SIGIR 1993: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 329–338. ACM, New York (1993)CrossRefGoogle Scholar
  5. 5.
    Lee, C.T., Vinay, V., Mendes Rodrigues, E., Kazai, G., Milic-Frayling, N., Ignjatovic, A.: Measuring system performance and topic discernment using generalized adaptive-weight mean. In: CIKM 2009: Proceeding of the 18th ACM conference on Information and knowledge management, pp. 2033–2036. ACM, New York (2009)CrossRefGoogle Scholar
  6. 6.
    Levene, H.: Robust test for equality of variances. Contributions to Probability and Statistics: Essays in Honor of Harold Hotteling, 278–292 (1960)Google Scholar
  7. 7.
    Lin, W.-H., Hauptmann, A.: Revisiting the effect of topic set size on retrieval error. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 637–638. ACM, New York (2005)CrossRefGoogle Scholar
  8. 8.
    Sakai, T.: Evaluating evaluation metrics based on the bootstrap. In: SIGIR 2006: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 525–532. ACM, New York (2006)CrossRefGoogle Scholar
  9. 9.
    Sanderson, M., Zobel, J.: Information retrieval system evaluation: effort, sensitivity, and reliability. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 162–169. ACM, New York (2005)CrossRefGoogle Scholar
  10. 10.
    Smucker, M.D., Allan, J., Carterette, B.: A comparison of statistical significance tests for information retrieval evaluation. In: CIKM 2007: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 623–632. ACM, New York (2007)CrossRefGoogle Scholar
  11. 11.
    Voorhees, E.M.: The trec robust retrieval track. SIGIR Forum 39(1), 11–20 (2005)CrossRefGoogle Scholar
  12. 12.
    Voorhees, E.M., Buckley, C.: The effect of topic set size on retrieval experiment error. In: SIGIR 2002: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 316–323. ACM Press, New York (2002)CrossRefGoogle Scholar
  13. 13.
    Webber, W., Moffat, A., Zobel, J.: Score standardization for inter-collection comparison of retrieval systems. In: SIGIR 2008: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 51–58. ACM, New York (2008)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Mehdi Hosseini
    • 1
  • Ingemar J. Cox
    • 1
  • Natasa Millic-Frayling
    • 2
  • Vishwa Vinay
    • 2
  1. 1.Computer Science DepartmentUniversity College London 
  2. 2.Microsoft Research Cambridge 

Personalised recommendations