Information Retrieval Journal

, Volume 19, Issue 3, pp 313–350

Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation

Information Retrieval Evaluation Using Test Collections

Abstract

The number of topics that a test collection contains has a direct impact on how well the evaluation results reflect the true performance of systems. However, large collections can be prohibitively expensive, so researchers are bound to balance reliability and cost. This issue arises when researchers have an existing collection and they would like to know how much they can trust their results, and also when they are building a new collection and they would like to know how many topics it should contain before they can trust the results. Several measures have been proposed in the literature to quantify the accuracy of a collection to estimate the true scores, as well as different ways to estimate the expected accuracy of hypothetical collections with a certain number of topics. We can find ad-hoc measures such as Kendall tau correlation and swap rates, and statistical measures such as statistical power and indexes from generalizability theory. Each measure focuses on different aspects of evaluation, has a different theoretical basis, and makes a number of assumptions that are not met in practice, such as normality of distributions, homoscedasticity, uncorrelated effects and random sampling. However, how good these estimates are in practice remains a largely open question. In this paper we first compare measures and estimators of test collection accuracy and propose unbiased statistical estimators of the Kendall tau and tau AP correlation coefficients. Second, we detail a method for stochastic simulation of evaluation results under different statistical assumptions, which can be used for a variety of evaluation research where we need to know the true scores of systems. Third, through large-scale simulation from TREC data, we analyze the bias of a range of estimators of test collection accuracy. Fourth, we analyze the robustness to statistical assumptions of these estimators, in order to understand what aspects of an evaluation are affected by what assumptions and guide in the development of new collections and new measures. All the results in this paper are fully reproducible with data and code available online.

Keywords

Information retrieval Evaluation Test collection Reliability Simulation 

References

  1. Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. California: Wadsworth.Google Scholar
  2. Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A. P., & Yilmaz, E. (2008). Relevance assessment: Are judges exchangeable and does it matter? In International ACM SIGIR conference on research and development in information retrieval, pp. 667–674.Google Scholar
  3. Bodoff, D., & Li, P. (2007). Test theory for assessing IR test collections. In International ACM SIGIR conference on research and development in information retrieval, pp. 367–374.Google Scholar
  4. Boytsov, L., Belova, A., & Westfall, P. (2013). Deciding on an adjustment for multiplicity in IR experiments. In International ACM SIGIR conference on research and development in information retrieval, pp. 403–412.Google Scholar
  5. Brennan, R. L. (2001). Generalizability theory. Berlin: Springer.CrossRefMATHGoogle Scholar
  6. Brennan, R. L., & Kane, M. T. (1977). An index of dependability for mastery tests. Journal of Educational Measurement, 14(3), 277–289.CrossRefGoogle Scholar
  7. Buckley, C., & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In International ACM SIGIR conference on research and development in information retrieval, pp. 33–34.Google Scholar
  8. Buckley, C., Dimmick, D., Soboroff, I., & Voorhees, E. M. (2007). Bias and the limits of pooling for large collections. Journal of Information Retrieval, 10(6), 491–508.CrossRefGoogle Scholar
  9. Carterette, B. (2009). On rank correlation and the distance between rankings. In International ACM SIGIR conference on research and development in information retrieval, pp. 436–443.Google Scholar
  10. Carterette, B. (2012). Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Transactions on Information Systems, 30(1), 4. doi:10.1145/2094072.2094076.
  11. Carterette, B., & Soboroff, I. (2010). The effect of assessor error on IR system evaluation. In International ACM SIGIR conference on research and development in information retrieval, pp. 539–546.Google Scholar
  12. Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., & Allan, J. (2009). If I had a million queries. In European conference on information retrieval, pp. 288–300.Google Scholar
  13. Carterette, B., Kanoulas, E., & Yilmaz, E. (2011). Simulating simple user behavior for system effectiveness evaluation. In ACM international conference on information and knowledge management, pp. 611–620.Google Scholar
  14. Cohen, J. (1988). Statistical power analysis for the behavioral sciences. New Jersey: Lawrence Erlbaum Associates.Google Scholar
  15. Cormack, G. V., & Lynam, T. R. (2006). Statistical precision of information retrieval evaluation. In International ACM SIGIR conference on research and development in information retrieval, pp. 533–540.Google Scholar
  16. Cornfield, J., & Tukey, J. W. (1956). Average values of mean squares in factorials. The Annals of Mathematical Statistics, 27(4), 907–949.MathSciNetCrossRefMATHGoogle Scholar
  17. Cramér, H. (1928). On the composition of elementary errors II. Scandinavian Actuarial Journal, 11(1), 141–180.CrossRefMATHGoogle Scholar
  18. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. London: Wiley.Google Scholar
  19. Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In International ACM SIGIR conference on research and development in information retrieval, pp. 329–338.Google Scholar
  20. Joe, H. (2014). Dependence modeling with copulas. Boca Raton: CRC Press.MATHGoogle Scholar
  21. Jones, K. S. (1974). Automatic indexing. Journal of Documentation, 30(4), 393–432. doi:10.1108/eb026524.CrossRefGoogle Scholar
  22. Kekäläinen, J. (2005). Binary and graded relevance in IR evaluations: Comparison of the effects on ranking of IR systems. Information Processing and Management, 41(5), 1019–1033.CrossRefGoogle Scholar
  23. Lin, W. H., & Hauptmann, A. (2005). Revisiting the effect of topic set size on retrieval error. In International ACM SIGIR conference on research and development in information retrieval, pp. 637–638.Google Scholar
  24. Melucci, M. (2007). On rank correlation in information retrieval evaluation. ACM SIGIR Forum, 41(1), 18–33.CrossRefGoogle Scholar
  25. Robertson, S., & Kanoulas, E. (2012). On per-topic variance in IR evaluation. In International ACM SIGIR conference on research and development in information retrieval, pp. 891–900.Google Scholar
  26. Sakai, T. (2006). Evaluating evaluation metrics based on the bootstrap. In International ACM SIGIR conference on research and development in information retrieval, pp. 525–532.Google Scholar
  27. Sakai, T. (2007). On the reliability of information retrieval metrics based on graded relevance. Information Processing and Management, 43(2), 531–548.CrossRefGoogle Scholar
  28. Sakai, T. (2014a). Designing test collections for comparing many systems. In ACM international conference on information and knowledge management, pp. 61–70.Google Scholar
  29. Sakai, T. (2014b). Topic set size design with variance estimates from two-way ANOVA. In International workshop on evaluating information access, pp. 1–8.Google Scholar
  30. Sakai, T. (2015). Topic set size design. Journal of Information Retrieval. doi:10.1007/s10791-015-9273-z.
  31. Sakai, T., & Kando, N. (2008). On information retrieval metrics designed for evaluation with incomplete relevance assessments. Journal of Information Retrieval, 11(5), 447–470.CrossRefGoogle Scholar
  32. Sanderson, M. (2010). Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval, 4(4), 247–375.CrossRefMATHGoogle Scholar
  33. Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In International ACM SIGIR conference on research and development in information retrieval, pp. 162–169.Google Scholar
  34. Sanderson, M., Turpin, A., Zhang, Y., & Scholer, F. (2012). Differences in effectiveness across sub-collections. In ACM international conference on information and knowledge management, pp. 1965–1969.Google Scholar
  35. Searle, S. R., Casella, G., & McCulloch, C. E. (2006). Variance components. London: Wiley.MATHGoogle Scholar
  36. Smucker, M. D., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. In ACM international conference on information and knowledge management, pp. 623–632.Google Scholar
  37. Smucker, M. D., Allan, J., & Carterette, B. (2009). Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes. In International ACM SIGIR conference on research and development in information retrieval, pp. 630–631.Google Scholar
  38. Tague-Sutcliffe, J. (1992). The pragmatics of information retrieval experimentation, revisited. Information Processing and Management, 28(4), 467–490.CrossRefGoogle Scholar
  39. Urbano, J., & Marrero, M. (2015). How do gain and discount functions affect the correlation between DCG and user satisfaction? In European conference on information retrieval, pp. 197–202.Google Scholar
  40. Urbano, J., Marrero, M., & Martín, D. (2013a). A comparison of the optimality of statistical significance tests for information retrieval evaluation. In International ACM SIGIR conference on research and development in information retrieval, pp. 925–928.Google Scholar
  41. Urbano, J., Marrero, M., & Martín, D. (2013b). On the measurement of test collection reliability. In International ACM SIGIR conference on research and development in information retrieval, pp. 393–402.Google Scholar
  42. van Rijsbergen, C. J. (1979). Information retrieval. London: Butterworths.Google Scholar
  43. von Mises, R. (1931). Wahrscheinlichkeitsrechnung und ihre Anwendungen in der Statistik und theoretischen Physik.Google Scholar
  44. Voorhees, E. M. (1998). Variations in relevance judgments and the measurement of retrieval effectiveness. In International ACM SIGIR conference on research and development in information retrieval, pp. 315–323.Google Scholar
  45. Voorhees, E. M. (2001). Evaluation by highly relevant documents. In International ACM SIGIR conference on research and development in information retrieval, pp. 74–82.Google Scholar
  46. Voorhees, E. M. (2009). Topic set size redux. In International ACM SIGIR conference on research and development in information retrieval, pp. 806–807.Google Scholar
  47. Voorhees, E. M., & Buckley, C. (2002). The effect of topic set size on retrieval experiment error. In International ACM SIGIR conference on research and development in information retrieval, pp. 316–323.Google Scholar
  48. Webb, N. M., Shavelson, R. J., & Haertel, E. H. (2006). Reliability coefficients and generalizability theory. Handbook of Statistics, 26, 81–124.CrossRefMATHGoogle Scholar
  49. Webber, W., Moffat, A., & Zobel, J. (2008). Statistical power in retrieval experimentation. In ACM international conference on information and knowledge management, pp. 571–580.Google Scholar
  50. Yilmaz, E., Aslam, J. A., & Robertson, S. (2008). A new rank correlation coefficient for information retrieval. In International ACM SIGIR conference on research and development in information retrieval, pp. 587–594.Google Scholar
  51. Zobel, J. (1998). How reliable are the results of large-scale information retrieval experiments? In International ACM SIGIR conference on research and development in information retrieval, pp. 307–314.Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Universitat Pompeu FabraBarcelonaSpain

Personalised recommendations