Tolerance of Effectiveness Measures to Relevance Judging Errors

  • Le Li
  • Mark D. Smucker
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8416)


Crowdsourcing relevance judgments for test collection construction is attractive because the practice has the possibility of being more affordable than hiring high quality assessors. A problem faced by all crowdsourced judgments – even judgments formed from the consensus of multiple workers – is that there will be differences in the judgments compared to the judgments produced by high quality assessors. For two TREC test collections, we simulated errors in sets of judgments and then measured the effect of these errors on effectiveness measures. We found that some measures appear to be more tolerant of errors than others. We also found that to achieve high rank correlation in the ranking of retrieval systems requires conservative judgments for average precision (AP) and nDCG, while precision at rank 10 requires neutral judging behavior. Conservative judging avoids mistakenly judging non-relevant documents as relevant at the cost of judging some relevant documents as non-relevant. In addition, we found that while conservative judging behavior maximizes rank correlation for AP and nDCG, to minimize the error in the measures’ values requires more liberal behavior. Depending on the nature of a set of crowdsourced judgments, the judgments may be more suitable with some effectiveness measures than others, and the use of some effectiveness measures will require higher levels of judgment quality than others.


Root Mean Square Error True Positive Rate Average Precision Mean Average Precision Discrimination Ability 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alonso, O., Mizzaro, S.: Can we get rid of TREC assessors? using Mechanical Turk for relevance assessment. In: Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, pp. 15–16 (July 2009)Google Scholar
  2. 2.
    McCreadie, R., Macdonald, C., Ounis, I.: Crowdsourcing blog track top news judgments at TREC. In: WSDM 2011 Workshop on Crowdsourcing for Search and Data Mining (2011)Google Scholar
  3. 3.
    Smucker, M.D., Kazai, G., Lease, M.: Overview of the TREC 2012 crowdsourcing track (2012)Google Scholar
  4. 4.
    Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. IPM 36, 697–716 (2000)Google Scholar
  5. 5.
    Hosseini, M., Cox, I., Milić-Frayling, N., Kazai, G., Vinay, V.: On aggregating labels from multiple crowd workers to infer relevance of documents. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 182–194. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  6. 6.
    Raykar, V.C., Yu, S., Zhao, L.H., Valadez, G.H., Florin, C., Bogoni, L., Moy, L.: Learning from crowds. The Journal of Machine Learning Research 99, 1297–1322 (2010)MathSciNetGoogle Scholar
  7. 7.
    Yilmaz, E., Aslam, J.A., Robertson, S.: A new rank correlation coefficient for information retrieval. In: SIGIR, pp. 587–594 (2008)Google Scholar
  8. 8.
    Macmillan, N.A., Creelman, C.D.: Detection theory: A user’s guide. Psychology Press (2004)Google Scholar
  9. 9.
    Voorhees, E.M., Harman, D.: Overview of the Eighth Text REtrieval Conference (TREC-8). In: Proceedings of TREC, vol. 8, pp. 1–24 (1999)Google Scholar
  10. 10.
    Voorhees, E.M.: Overview of TREC 2005. In: Proceedings of TREC (2005)Google Scholar
  11. 11.
    Smucker, M.D., Jethani, C.P.: Measuring assessor accuracy: a comparison of NIST assessors and user study participants. In: SIGIR, pp. 1231–1232 (2011)Google Scholar
  12. 12.
    Smucker, M., Jethani, C.: The crowd vs. the lab: A comparison of crowd-sourced and university laboratory participant behavior. In: Proceedings of the SIGIR 2011 Workshop on Crowdsourcing for Information Retrieval (2011)Google Scholar
  13. 13.
    Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1/2), 81–93 (1938)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. In: SIGIR, pp. 539–546 (2010)Google Scholar
  15. 15.
    Webber, W., Chandar, P., Carterette, B.: Alternative assessor disagreement and retrieval depth. In: CIKM, pp. 125–134 (2012)Google Scholar
  16. 16.
    Vuurens, J., de Vries, A.P., Eickhoff, C.: How much spam can you take? In: SIGIR 2011 Workshop on Crowdsourcing for Information Retrieval, CIR (2011)Google Scholar
  17. 17.
    Voorhees, E.: Variations in relevance judgments and the measurement of retrieval effectiveness. IPM 36(5), 697–716 (2000)Google Scholar
  18. 18.
    Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A., Yilmaz, E.: Relevance assessment: are judges exchangeable and does it matter. In: SIGIR, pp. 667–674 (2008)Google Scholar
  19. 19.
    Kinney, K., Huffman, S., Zhai, J.: How evaluator domain expertise affects search result relevance judgments. In: CIKM, pp. 591–598 (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Le Li
    • 1
  • Mark D. Smucker
    • 2
  1. 1.David R. Cheriton School of Computer ScienceCanada
  2. 2.Department of Management SciencesUniversity of WaterlooCanada

Personalised recommendations