Metrics, Statistics, Tests

  • Tetsuya Sakai

Abstract

This lecture is intended to serve as an introduction to Information Retrieval (IR) effectiveness metrics and their usage in IR experiments using test collections. Evaluation metrics are important because they are inexpensive tools for monitoring technological advances. This lecture covers a wide variety of IR metrics (except for those designed for XML retrieval, as there is a separature lecture dedicated to this topic) and discusses some methods for evaluating evaluation metrics. It also briefly covers computer-based statistical significance testing. The takeaways for IR experimenters are: (1) It is important to understand the properties of IR metrics and choose or design appropriate ones for the task at hand; (2) Computer-based statistical significance tests are simple and useful, although statistical significance does not necessarily imply practical significance, and statistical insignificance does not necessarily imply practical insignificance; and (3) Several methods exist for discussing which metrics are “good,” although none of them is perfect.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal, R., Sreenivas, G., Halverson, A., Leong, S.: Diversifying search results. In: Proceedings of ACM WSDM 2009, pp. 5–14 (2009)Google Scholar
  2. 2.
    Ahlgren, P., Grönqvist, L.: Retrieval evaluation with incomplete relevance data: A comparative study of three measures. In: Proceedings of ACM CIKM 2006, pp. 872–873 (2006)Google Scholar
  3. 3.
    Allan, J., Aslam, J., Azzopardi, L., Belkin, N., Borlund, P., Bruza, P., Callan, J., Carman, M., Clarke, C.L., Craswell, N., Croft, W.B., Culpepper, J.S., Diaz, F., Dumais, S., Ferro, N., Geva, S., Gonzalo, J., Hawking, D., Jarvelin, K., Jones, G., Jones, R., Kamps, J., Kando, N., Kanoulas, E., Karlgren, J., Kelly, D., Lease, M., Lin, J., Mizzaro, S., Moffat, A., Murdock, V., Oard, D.W., de Rijke, M., Sakai, T., Sanderson, M., Scholer, F., Si, L., Thom, J.A., Thomas, P., Trotman, A., Turpin, A., de Vries, A.P., Webber, W., Zhang, X., Zhang, Y.: Frontiers, challenges and opportunities for information retrieval: Report from SWIRL 2012. SIGIR Forum 46(1), 2–32 (2012)CrossRefGoogle Scholar
  4. 4.
    Allan, J., Carterette, B., Lewis, J.: When will information retrieval be “good enough”? In: Proceedings of ACM SIGIR 2005, pp. 433–440 (2005)Google Scholar
  5. 5.
    Arguello, J., Diaz, F., Callan, J., Carterette, B.: A methodology for evaluating aggregated search results. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 141–152. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  6. 6.
    Arvola, P., Kekäläinen, J., Junkkari, M.: Expected reading effort in focused retrieval evaluation. Information Retrieval 13(5), 460–484 (2010)CrossRefGoogle Scholar
  7. 7.
    Ashkan, A., Clarke, C.L.: On the informativeness of cascade and intent-aware effectiveness measures. In: Proceedings of ACM SIGIR 2013, pp. 407–416 (2011)Google Scholar
  8. 8.
    Aslam, J.A., Yilmaz, E., Pavlu, V.: The maximum entropy method for analyzing retrieval measures. In: Proceedings of ACM SIGIR 2005, pp. 27–34 (2005)Google Scholar
  9. 9.
    Azzopardi, L.: Usage based effectiveness measures. In: Proceedings of ACM CIKM 2009, pp. 631–640 (2009)Google Scholar
  10. 10.
    Baskaya, F., Keskustalo, H., Järvelin, K.: Time drives interaction: Simulating sessions in diverse searching environments. In: Proceedings of ACM SIGIR 2012, pp. 105–114 (2012)Google Scholar
  11. 11.
    Bodoff, D., Li, P.: Test theory for assessing IR test collections. In: Proceedings of ACM SIGIR 2007, pp. 367–374 (2007)Google Scholar
  12. 12.
    Bollman, P., Cherniavsky, V.S.: Measurement-theoretical investigation of the MZ-metric. In: Proceedings of ACM SIGIR 1980, pp. 256–267 (1980)Google Scholar
  13. 13.
    Brandt, C., Joachims, T., Yue, Y., Bank, J.: Dynamic ranked retrieval. In: Proceedings of ACM WSDM 2011, pp. 247–256 (2011)Google Scholar
  14. 14.
    Broder, A.: A taxonomy of web search. SIGIR Forum 36(2) (2002)Google Scholar
  15. 15.
    Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: Proceedings of ACM SIGIR 2000, pp. 33–40 (2000)Google Scholar
  16. 16.
    Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: Proceedings of ACM SIGIR 2004, pp. 25–32 (2004)Google Scholar
  17. 17.
    Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: Proceedings of ICML 2005, pp. 89–96 (2005)Google Scholar
  18. 18.
    Büttcher, S., Clarke, C.L., Yeung, P.C., Soboroff, I.: Reliable information retrieval evaluation with incomplete and biased judgements. In: Proceedings of ACM SIGIR 2007, pp. 63–70 (2007)Google Scholar
  19. 19.
    Carterette, B.: On rank correlation and the distance between rankings. In: Proceedings of ACM SIGIR 2009, pp. 436–443 (2009)Google Scholar
  20. 20.
    Carterette, B.: System effectiveness, user models, and user utility: A conceptual framework for investigation. In: Proceedings of ACM SIGIR 2011, pp. 903–912 (2011)Google Scholar
  21. 21.
    Carterette, B.: Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM TOIS 30(1) (2012)Google Scholar
  22. 22.
    Carterette, B., Bennett, P.N., Chickering, D.M., Dumais, S.T.: Here or there: Preference judgments for relevance. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 16–27. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  23. 23.
    Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J.A., Allan, J.: Evaluation over thousands of queries. In: Proceedings of ACM SIGIR 2008, pp. 651–658 (2008)Google Scholar
  24. 24.
    Chandar, P., Carterette, B.: Analysis of various evaluation measures for diversity. In: Proceedings of DDR 2011, pp. 21–28 (2011)Google Scholar
  25. 25.
    Chandar, P., Carterette, B.: What qualities do users prefer in diversity rankings? In: Proceedings of DDR 2012 (2012)Google Scholar
  26. 26.
    Chapelle, O., Ji, S., Liao, C., Velipasaoglu, E., Lai, L., Wu, S.L.: Intent-based diversification of web search results: Metrics and algorithms. Information Retrieval 14(6), 572–592 (2011)CrossRefGoogle Scholar
  27. 27.
    Chapelle, O., Metzler, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proceedings of ACM CIKM 2009, pp. 621–630 (2009)Google Scholar
  28. 28.
    Chinchor, N.: MUC-4 evaluation metrics. In: Proceedings of MUC-4, pp. 22–29 (1992)Google Scholar
  29. 29.
    Clarke, C.L., Craswell, N., Soboroff, I.: Overview of the TREC 2009 web track. In: Proceedings of TREC 2009 (2009)Google Scholar
  30. 30.
    Clarke, C.L., Craswell, N., Soboroff, I., Ashkan, A.: A comparative analysis of cascade measures for novelty and diversity. In: Proceedings of ACM WSDM 2011, pp. 75–84 (2011)Google Scholar
  31. 31.
    Clarke, C.L., Craswell, N., Soboroff, I., Voorhees, E.: Overview of the TREC 2011 web track. In: Proceedings of TREC 2011 (2012)Google Scholar
  32. 32.
    Clarke, C.L., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Büttcher, S., MacKinnon, I.: Novelty and diversity in information retrieval evaluation. In: Proceedings of ACM SIGIR 2008, pp. 659–666 (2009)Google Scholar
  33. 33.
    Clarke, C.L.A., Kolla, M., Vechtomova, O.: An effectiveness measure for ambiguous and underspecified queries. In: Azzopardi, L., Kazai, G., Robertson, S., Rüger, S., Shokouhi, M., Song, D., Yilmaz, E. (eds.) ICTIR 2009. LNCS, vol. 5766, pp. 188–199. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  34. 34.
    Clarke, C.L., Craswell, N., Voorhees, E.: Overview of the TREC 2012 web track. In: Proceedings of TREC 2012 (2013)Google Scholar
  35. 35.
    Cooper, W.S.: Expected search length: A single measure of retrieval effectiveness based on the weak ordering action of retrieval systems. JASIS 19(1), 30–41 (1968)Google Scholar
  36. 36.
    Cooper, W.S.: On selecting a measure of retrieval effectiveness. JASIS 24(2), 87–100 (1973)CrossRefGoogle Scholar
  37. 37.
    Cooper, W.S.: On selecting a measure of retrieval effectiveness: Part II. Implementation of the philosophy. JASIS 24(6), 413–424 (1973)CrossRefGoogle Scholar
  38. 38.
    Cormack, G.V., Lynam, T.R.: Statistical precision of information retrieval evaluation. In: Proceedings of ACM SIGIR 2006 (2006)Google Scholar
  39. 39.
    Dang, H., Lin, J.: Different structures for evaluating answers to complex questions: Pyramids won’t topple, and neither will human assessors. In: Proceedings of ACL 2007, pp. 768–775 (2007)Google Scholar
  40. 40.
    De Beer, J., Moens, M.F.: Rpref: A generalization of bpref towards graded relevance judgments. In: Proceedings of ACM SIGIR 2006, pp. 637–638 (2006)Google Scholar
  41. 41.
    Della Mea, V., Mizzaro, S.: Measuring retrieval effectiveness: A new proposal and a first experimental validation. JASIST 55(6), 503–543 (2004)Google Scholar
  42. 42.
    Dunlop, M.D.: Time, relevance and interaction modelling for information retrieval. In: Proceedings of ACM SIGIR 1997, pp. 206–213 (1997)Google Scholar
  43. 43.
    Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman & Hall/CRC (1993)Google Scholar
  44. 44.
    Eguchi, K., Oyama, K., Ishida, E., Kando, N., Kuriyama, K.: Overview of the web retrieval task at the third NTCIR workshop. NII Technical Reports NII-2003-002E (2003)Google Scholar
  45. 45.
    Gey, F., Larson, R., Machado, J., Yoshioka, M.: NTCIR9-GeoTime overview - evaluating geographic and temporal search: Round 2. In: Proceedings of NTCIR-9, pp. 9–17 (2011)Google Scholar
  46. 46.
    Golbus, P.B., Aslam, J.A., Clarke, C.L.: Increasing evaluation sensitivity to diversity. Information Retrieval (2013)Google Scholar
  47. 47.
    Hull, D.: Using statistical testing in the evaluation of retrieval experiments. In: Proceedings of ACM SIGIR 1993. pp. 329–338 (1993)Google Scholar
  48. 48.
    Ioannidis, J.P.: Why most published research findings are false. PLoS Med. 2(8) (2005)Google Scholar
  49. 49.
    Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20(4), 422–446 (2002)CrossRefGoogle Scholar
  50. 50.
    Järvelin, K., Price, S.L., Delcambre, L.M.L., Nielsen, M.L.: Discounted cumulated gain based evaluation of multiple-query IR sessions. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 4–15. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  51. 51.
    Johnson, D.H.: The insignificance of statistical significance testing. The Journal of Wildlife Management 63(3), 763–772 (1999)CrossRefGoogle Scholar
  52. 52.
    Kamps, J., Pehcevski, J., Kazai, G., Lalmas, M., Robertson, S.: INEX 2007 evaluation measures. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 24–33. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  53. 53.
    Kanoulas, E., Aslam, J.A.: Empirical justification of the gain and discount function for nDCG. In: ACM CIKM 2009, pp. 611–620 (2009)Google Scholar
  54. 54.
    Kanoulas, E., Carterette, B., Clough, P.D., Sanderson, M.: Evaluating multi-query sessions. In: Proceedings of ACM SIGIR 2011, pp. 1053–1062 (2011)Google Scholar
  55. 55.
    Kato, M.P., Sakai, T., Yamamoto, T., Iwata, M.: Report from the NTCIR-10 1CLICK-2 Japanese subtask: Baselines, upperbounds and evaluation robustness. In: Proceedings of ACM SIGIR 2013 (2013)Google Scholar
  56. 56.
    Kekäläinen, J., Järvelin, K.: Using graded relevance assessments in IR evaluation. JASIST 53(13), 1120–1129 (2002)CrossRefGoogle Scholar
  57. 57.
    Kishida, K.: Property of average precision and its generalization: An examination of evaluation indicator for information retrieval. NII Technical Reports NII-2005-014E (2005)Google Scholar
  58. 58.
    Kishida, K., Chen, K.H., Lee, S., Kuriyama, K., Kando, N., Chen, H.H.: Overview of CLIR task at the sixth NTCIR workshop. In: Proceedings of NTCIR-6, pp. 1–19 (2007)Google Scholar
  59. 59.
    Leenanupab, T., Zuccon, G., Jose, J.M.: A comprehensive analysis of parameter settings for novelty-biased cumulative gain. In: Proceedings of ACM CIKM 2012, pp. 1950–1954 (2012)Google Scholar
  60. 60.
    Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Proceedings of the ACL 2004 Workshop on Text Summarization Branches Out (2004)Google Scholar
  61. 61.
    Lin, J., Demner-Fushman, D.: Methods for automatically evaluating answers to complex questions. Information Retrieval 9(5), 565–587 (2006)CrossRefGoogle Scholar
  62. 62.
    Magdy, W., Jones, G.J.: PRES: A score metric for evaluating recall-oriented information retrieval applications. In: Proceedings of ACM SIGIR 2010, pp. 611–618 (2010)Google Scholar
  63. 63.
    Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM TOIS 27(1) (2008)Google Scholar
  64. 64.
    Nanba, H., Hirao, T.: Automatic evaluation in text summarization (in Japanese). Transactions of the Japanese Society for Artificial Intelligence 22(1), 10–16 (2008)Google Scholar
  65. 65.
    Nenkova, A., Passonneau, R., McKeown, K.: The pyramid method: Incorporating human content selection variation in summarization evaluation. ACM Transactions on Speech and Language Processing 4(2), Article 4 (2007)Google Scholar
  66. 66.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. IBM Research Report RC22176 (2001)Google Scholar
  67. 67.
    Pollock, S.M.: Measures for the comparison of information retrieval systems. American Documentation 19(4), 387–397 (1968)MathSciNetCrossRefGoogle Scholar
  68. 68.
    Rijsbergen, C.J.V.: Information Retrieval, 2nd edn. Butterworths (1979)Google Scholar
  69. 69.
    Robertson, S.E.: The probability ranking principle in IR. Journal of Documentation 33, 130–137 (1977)Google Scholar
  70. 70.
    Robertson, S.E.: On GMAP: and other transformations. In: Proceedings of ACM CIKM 2006, pp. 78–83 (2006)Google Scholar
  71. 71.
    Robertson, S.E.: A new interpretation of average precision. In: Proceedings of ACM SIGIR 2008, pp. 689–690 (2008)Google Scholar
  72. 72.
    Robertson, S.E., Kanoulas, E.: On per-topic variance in IR evaluation. In: Proceedings of ACM SIGIR 2012, pp. 891–900 (2012)Google Scholar
  73. 73.
    Robertson, S.E., Kanoulas, E., Yilmaz, E.: Extending average precision to graded relevance judgments. In: Proceedings of ACM SIGIR 2010, pp. 603–610 (2010)Google Scholar
  74. 74.
    Sakai, T.: New performance metrics based on multigrade relevance: Their application to question answering. In: Proceedings of NTCIR-4 (Open Submission Session) (2004)Google Scholar
  75. 75.
    Sakai, T.: Ranking the NTCIR systems based on multigrade relevance. In: Myaeng, S.-H., Zhou, M., Wong, K.-F., Zhang, H.-J. (eds.) AIRS 2004. LNCS, vol. 3411, pp. 251–262. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  76. 76.
    Sakai, T.: Bootstrap-based comparisons of IR metrics for finding one relevant document. In: Ng, H.T., Leong, M.-K., Kan, M.-Y., Ji, D. (eds.) AIRS 2006. LNCS, vol. 4182, pp. 374–389. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  77. 77.
    Sakai, T.: Evaluating evaluation metrics based on the bootstrap. In: Proceedings of ACM SIGIR 2006, pp. 525–532 (2006)Google Scholar
  78. 78.
    Sakai, T.: For building better retrieval systems: Trends in information retrieval evaluation based on graded relevance (in Japanese). IPSJ Magazine 47(2), 147–158 (2006)Google Scholar
  79. 79.
    Sakai, T.: Alternatives to bpref. In: Proceedings of ACM SIGIR 2007, pp. 71–78 (2007)Google Scholar
  80. 80.
    Sakai, T.: On penalising late arrival of relevant documents in information retrieval evaluation with graded relevance. In: Proceedings of EVIA 2007, pp. 32–43 (2007)Google Scholar
  81. 81.
    Sakai, T.: Comparing metrics across TREC and NTCIR: The robustness to system bias. In: Proceedings of ACM CIKM 2008, pp. 581–590 (2008)Google Scholar
  82. 82.
    Sakai, T.: Evaluation with informational and navigational intents. In: Proceedings of WWW 2012, pp. 499–508 (2012)Google Scholar
  83. 83.
    Sakai, T.: How intuitive are diversified search metrics? Concordance test results for the diversity U-measures. IPSJ SIG Technical Report 2013-IFAT-111 (2013)Google Scholar
  84. 84.
    Sakai, T.: The unreusability of diversified test collections. In: Proceedings of EVIA 2013 (2013)Google Scholar
  85. 85.
    Sakai, T., Dou, Z.: Summaries, ranked retrieval and sessions: A unified framework for information access evaluation. In: Proceedings of ACM SIGIR 2013, pp. 473–482 (2013)Google Scholar
  86. 86.
    Sakai, T., Dou, Z., Clarke, C.L.: The impact of intent selection on diversified search evaluation. In: Proceedings of ACM SIGIR 2013 (2013)Google Scholar
  87. 87.
    Sakai, T., Dou, Z., Song, R., Kando, N.: The reusability of a diversified search test collection. In: Hou, Y., Nie, J.-Y., Sun, L., Wang, B., Zhang, P. (eds.) AIRS 2012. LNCS, vol. 7675, pp. 26–38. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  88. 88.
    Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., Kato, M.P., Song, R., Iwata, M.: Summary of the NTCIR-10 INTENT-2 task: Subtopic mining and search result diversification. In: Proceedings of ACM SIGIR 2013 (2013)Google Scholar
  89. 89.
    Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval 11, 447–470 (2008)CrossRefGoogle Scholar
  90. 90.
    Sakai, T., Kato, M.P.: One click one revisited: Enhancing evaluation based on information units. In: Hou, Y., Nie, J.-Y., Sun, L., Wang, B., Zhang, P. (eds.) AIRS 2012. LNCS, vol. 7675, pp. 39–51. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  91. 91.
    Sakai, T., Kato, M.P., Song, Y.I.: Click the search button and be happy: Evaluating direct and immediate information access. In: Proceedings of ACM CIKM 2011, pp. 621–630 (2011)Google Scholar
  92. 92.
    Sakai, T., Robertson, S.: Modelling a user population for designing information retrieval metrics. In: Proceedings of EVIA 2008, pp. 30–41 (2008)Google Scholar
  93. 93.
    Sakai, T., Shima, H., Kando, N., Song, R., Lin, C.J., Mitamura, T., Sugimoto, M., Lee, C.W.: Overview of NTCIR-8 ACLIA IR4QA. In: Proceedings of NTCIR-8, pp. 63–93 (2010)Google Scholar
  94. 94.
    Sakai, T., Song, R.: Evaluating diversified search results using per-intent graded relevance. In: Proceedings of ACM SIGIR 2011 (2011)Google Scholar
  95. 95.
    Sakai, T., Song, R.: Diversified search evaluation: Lessons from the NTCIR-9 INTENT task. Information Retrieval (2013)Google Scholar
  96. 96.
    Sakai, T., Song, Y.I.: On evaluation environments for web search result diversification. In: Forum on Information Technology 2013 (2013)Google Scholar
  97. 97.
    Sanderson, M.: Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval 4, 247–375 (2010)CrossRefMATHGoogle Scholar
  98. 98.
    Sanderson, M., Paramita, M.L., Clough, P., Kanoulas, E.: Do user preferences and evaluation measures line up? In: Proceedings of ACM SIGIR 2010, pp. 555–562 (2010)Google Scholar
  99. 99.
    Sanderson, M., Zobel, J.: Information retrieval system evaluation: Effort, sensitivity, and reliability. In: Proceedings of ACM SIGIR 2005, pp. 162–169 (2005)Google Scholar
  100. 100.
    Savoy, J.: Statistical inference in retrieval effectiveness evaluation. Information Processing and Management 33(4), 495–512 (1997)CrossRefGoogle Scholar
  101. 101.
    Smucker, M.D., Allan, J., Carterette, B.: A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of ACM CIKM 2007, pp. 623–632 (2007)Google Scholar
  102. 102.
    Smucker, M.D., Clarke, C.L.A.: Modeling user variance in time-biased gain. In: Proceedings of ACM HCIR 2012 (2012)Google Scholar
  103. 103.
    Smucker, M.D., Clarke, C.L.A.: Stochastic simulation of time-biased gain. In: Proceedings of ACM CIKM 2012, pp. 2040–2044 (2012)Google Scholar
  104. 104.
    Smucker, M.D., Clarke, C.L.A.: Time-based calibration of effectiveness measures. In: Proceedings of ACM SIGIR 2012, pp. 95–104 (2012)Google Scholar
  105. 105.
    Turpin, A., Scholer, F., Järvelin, K., Wu, M., Culpepper, J.S.: Including summaries in system evaluation. In: Proceedings of ACM SIGIR 2009, pp. 508–515 (2009)Google Scholar
  106. 106.
    Voorhees, E.M.: The philosophy of information retrieval evaluation. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, pp. 355–370. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  107. 107.
    Voorhees, E.M., Buckley, C.: The effect of topic set size on retrieval experiment error. In: Proceedings of ACM SIGIR 2002, pp. 316–323 (2002)Google Scholar
  108. 108.
    Voorhees, E.M., Harman, D.K. (eds.): TREC: Experiment and Evaluation in Information Retrieval. The MIT Press (2005)Google Scholar
  109. 109.
    Webber, W., Moffat, A., Zobel, J.: Score standardization for inter-collection comparison of retrieval systems. In: Proceedings of ACM SIGIR 2008, pp. 51–58 (2008)Google Scholar
  110. 110.
    Webber, W., Moffat, A., Zobel, J.: Statistical power in retrieval experimentation. In: Proceedings of ACM CIKM 2008, pp. 571–580 (2008)Google Scholar
  111. 111.
    Webber, W., Moffat, A., Zobel, J.: The effect of pooling and evaluation depth on metric stability. In: Proceedings of EVIA 2010, pp. 7–15 (2010)Google Scholar
  112. 112.
    Webber, W., Moffat, A., Zobel, J.: A similarity measure for indefinite rankings. ACM TOIS 28(4) (2010)Google Scholar
  113. 113.
    Webber, W., Park, L.A.: Score adjustment for correction of pooling bias. In: Proceedings of ACM SIGIR 2009, pp. 444–451 (2009)Google Scholar
  114. 114.
    Yang, Y., Lad, A.: Modeling expected utility of multi-session information distillation. In: Azzopardi, L., Kazai, G., Robertson, S., Rüger, S., Shokouhi, M., Song, D., Yilmaz, E. (eds.) ICTIR 2009. LNCS, vol. 5766, pp. 164–175. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  115. 115.
    Yilmaz, E., Aslam, J., Robertson, S.: A new rank correlation coefficient for information retrieval. In: Proceedings of ACM SIGIR 2008, pp. 587–594 (2008)Google Scholar
  116. 116.
    Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imperfect judgments. In: ACM CIKM 2006 Proceedings, pp. 102–111 (2006)Google Scholar
  117. 117.
    Yilmaz, E., Shokouhi, M., Craswell, N., Robertson, S.: Expected browsing utility for web search evaluation. In: Proceedings of ACM CIKM 2010, pp. 1561–1564 (2010)Google Scholar
  118. 118.
    Zhai, C., Cohen, W.W., Lafferty, J.: Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In: Proceedings of ACM SIGIR 2003, pp. 10–17 (2003)Google Scholar
  119. 119.
    Zhang, Y., Park, L.A.F., Moffat, A.: Click-based evidence for decaying weight distributions in search effectiveness metrics. Information Retrieval 13(1), 46–69 (2010)CrossRefGoogle Scholar
  120. 120.
    Zhou, K., Cummins, R., Lalmas, M., Jose, J.M.: Evaluating aggregated search pages. In: Proceedings of ACM SIGIR 2012, pp. 115–124 (2012)Google Scholar
  121. 121.
    Zobel, J.: How reliable are the results of large-scale information retrieval experiments? In: Proceedings of ACM SIGIR 1998, pp. 307–314 (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Tetsuya Sakai
    • 1
  1. 1.Waseda UniversityJapan

Personalised recommendations