Information Retrieval Journal

, Volume 19, Issue 3, pp 284–312 | Cite as

Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation

  • Thomas Demeester
  • Robin Aly
  • Djoerd Hiemstra
  • Dong Nguyen
  • Chris Develder
Information Retrieval Evaluation Using Test Collections

Abstract

Evaluation of search engines relies on assessments of search results for selected test queries, from which we would ideally like to draw conclusions in terms of relevance of the results for general (e.g., future, unknown) users. In practice however, most evaluation scenarios only allow us to conclusively determine the relevance towards the particular assessor that provided the judgments. A factor that cannot be ignored when extending conclusions made from assessors towards users, is the possible disagreement on relevance, assuming that a single gold truth label does not exist. This paper presents and analyzes the predicted relevance model (PRM), which allows predicting a particular result’s relevance for a random user, based on an observed assessment and knowledge on the average disagreement between assessors. With the PRM, existing evaluation metrics designed to measure binary assessor relevance, can be transformed into more robust and effectively graded measures that evaluate relevance towards a random user. It also leads to a principled way of quantifying multiple graded or categorical relevance levels for use as gains in established graded relevance measures, such as normalized discounted cumulative gain, which nowadays often use heuristic and data-independent gain values. Given a set of test topics with graded relevance judgments, the PRM allows evaluating systems on different scenarios, such as their capability of retrieving top results, or how well they are able to filter out non-relevant ones. Its use in actual evaluation scenarios is illustrated on several information retrieval test collections.

Keywords

Information retrieval evaluation Test collections Graded relevance assessments for information retrieval Assessor disagreement 

Notes

Acknowledgments

First of all, we would like to thank the reviewers. Their particularly detailed comments and suggestions lifted the paper’s overall quality and coherence, and played an important role in shaping the formulation and interpretation of the PRM in its current form. We would also like to thank Dolf Trieschnigg for his work on the FedWeb13 data and the many technical discussions, Tetsuya Sakai for providing us with the NTCIR INTENT-2 data, and Bart Deygers for the valuable suggestions to improve the manuscript. This work was supported by Ghent University—iMinds in Belgium, and by the Dutch national program COMMIT and the NWO-Catch project Folktales As Classifiable Texts (FACT) in the Netherlands.

References

  1. Agrawal, R., Gollapudi, S., Halverson, A., & Ieong, S. (2009). Diversifying search results. In Proceedings of the 2nd ACM international conference on web search and data mining (WSDM 2009) (pp. 5–14), Barcelona. doi: 10.1145/1498759.1498766.
  2. Al-Harbi, A. L., & Smucker, M. D. (2014). A qualitative exploration of secondary assessor relevance judging behavior categories and subject descriptors. In Proceedings of the 5th information interaction in context symposium (IIiX 2014) (pp. 195–204), Regensburg. doi: 10.1145/2637002.2637025.
  3. Bailey, P., Craswell, N., Soboroff, I., & Thomas, P. (2008). Relevance assessment: Are judges exchangeable and does it matter? In Proceedings of the 31st international ACM SIGIR conference research and development in information retrieval (SIGIR 2008), Singapore. doi: 10.1145/1390334.1390447.
  4. Carterette, B., & Soboroff, I. (2010). The effect of assessor errors on IR system evaluation. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR 2010) (pp. 539–546), Geneva. doi: 10.1145/1835449.1835540.
  5. Carterette, B., Bennett, P. N., Chickering, D. M., & Dumais, S. T. (2008). Here or there: Preference judgments for relevance. In Proceedngs of the 30th European conference on advances in information retrieval (ECIR 2008) (pp. 16–27). Berlin: Springer.Google Scholar
  6. Carterette, B., Kanoulas, E., & Yilmaz, E. (2012). Incorporating variability in user behavior into systems based evaluation. In Proceedings of the 21st ACM international conference on information and knowledge management (CIKM’12) (pp. 135–144). New York, NY: ACM. doi: 10.1145/2396761.2396782.
  7. Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM International Conference on Information and Knowledge Management (CIKM 2009) (pp. 621–630), New York, NY. doi: 10.1145/1645953.1646033.
  8. Demeester, T., Trieschnigg, D., Nguyen, D., & Hiemstra, D. (2013). Overview of the trec 2013 federated web search track. In Proceedings of the 22nd text retrieval conference (TREC 2013), Gaithersburg, MD.Google Scholar
  9. Demeester, T., Aly, R., Hiemstra, D., Nguyen, D., Trieschnigg, D., & Develder, C. (2014). Exploiting user disagreement for web search evaluation: An experimental approach. In Proceedings of the 7th ACM international conference on web search and data mining (WSDM 2014) (pp. 33–42), New York, NY. doi: 10.1145/2556195.2556268.
  10. Demeester, T., Trieschnigg, D., Zhou, K., Nguyen, D., & Hiemstra, D. (2015). FedWeb greatest hits: Presenting the new test collection for federated web search. In Proceedings of the 24th international world wide web conference (WWW 2015), Florence. doi: 10.1145/2740908.2742755.
  11. Harter, S. P. (1996). Variations in relevance assessments and the measurement of retrieval effectiveness. Journal of the American Society for Information Science, 47(1), 37–49. doi: 10.1002/(SICI)1097-4571(199601)47:1<3.0.CO;2-3.CrossRefGoogle Scholar
  12. Hosseini, M., Cox, I. J., Milić-frayling, N., Kazai, G., & Vinay, V. (2012). On aggregating labels from multiple crowd workers to infer relevance of documents. In Proceedings of the 34th European conference on advances in information retrieval (ECIR 2012) (pp. 182–194), Barcelona.Google Scholar
  13. Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446. doi: 10.1145/582415.582418.CrossRefGoogle Scholar
  14. Kanoulas, E., & Aslam, J. A. (2009). Empirical justification of the gain and discount function for nDCG. In Proceedings of the 18th ACM international conference on information and knowledge management (CIKM 2009) (pp. 611–620), Hong Kong. doi: 10.1145/1645953.1646032.
  15. Kazai, G., Yilmaz, E., Craswell, N., & Tahaghoghi, S. (2013). User intent and assessor disagreement in web search evaluation. In Proceedings of the 22nd ACM international conference on conference on information and knowledge management (CIKM 2013) (pp. 699–708). New York, NY: ACM. doi: 10.1145/2505515.2505716.
  16. Kekäläinen, J. (2005). Binary and graded relevance in IR evaluations: Comparison of the effects on ranking of IR systems. Information Processing & Management, 41(5), 1019–1033. doi: 10.1016/j.ipm.2005.01.004.CrossRefGoogle Scholar
  17. Moffat, A., & Zobel, J. (2008). Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, 27(1), 2:1–2:27. doi: 10.1145/1416950.1416952.CrossRefGoogle Scholar
  18. Nguyen, D., Demeester, T., Trieschnigg, D., & Hiemstra, D. (2012). Federated search in the wild: The combined power of over a hundred search engines. In Proceedings of the 21st ACM international conference on information and knowledge management (CIKM 2012), Maui, HI. doi: 10.1145/2396761.2398535.
  19. Robertson, S. E., Kanoulas, E., & Yilmaz, E. (2010). Extending average precision to graded relevance judgments. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval (SIGIR 2010) (pp. 603–610), Geneva. doi: 10.1145/1835449.1835550.
  20. Sakai, T. (2007). On the reliability of information retrieval metrics based on graded relevance. Information Processing & Management, 43(2), 531–548. doi: 10.1016/j.ipm.2006.07.020.CrossRefGoogle Scholar
  21. Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., & Song, R. (2013). Overview of the NTCIR-10 INTENT-2 task. In Proceedings of the 10th NTCIR conference (pp. 94–123), Tokyo.Google Scholar
  22. Smucker, M. D., & Clarke, C. L. (2012). Modeling user variance in time-biased gain. In Proceedings of the symposium on human–computer interaction and information retrieval (HCIR 2012), Cambridge, CA. doi: 10.1145/2391224.2391227.
  23. Song, R., Zhang, M., Sakai, T., Kato, M. P., Liu, Y., Sugimoto, M., Wang, Q., & Orii, N. (2011). Overview of the NTCIR-9 INTENT task. In Proceedings of the 9th NTCIR workshop meeting (pp. 82–105), Tokyo.Google Scholar
  24. Sormunen, E. (2002). Liberal relevance criteria of TREC: Counting on negligible documents? In Proceedings of the 25th International ACM SIGIR conference on research and development in information retrieval (SIGIR 2002) (pp. 324–330), Tampere.Google Scholar
  25. Turpin, A., Scholer, F., Jarvelin, K., Wu, M., & Culpepper, J. S. (2009). Including summaries in system evaluation. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (SIGIR 2009) (pp. 508–515), Boston, MA. doi: 10.1145/1571941.1572029.
  26. Vakkari, P., & Sormunen, E. (2004). The influence of relevance levels on the effectiveness of interactive information retrieval. Journal of the American Society for Information Science and Technology, 55(11), 963–969. doi: 10.1002/asi.20046.CrossRefGoogle Scholar
  27. Voorhees, E. (2000). Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management, 36(5), 697–716. doi: 10.1016/S0306-4573(00)00010-8.CrossRefGoogle Scholar
  28. Voorhees, E. M. (2001). Evaluation by highly relevant documents. In Proceedings of the 24th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2001) (pp. 74–82), New Orleans, LA. doi: 10.1145/383952.383963.
  29. Webber, W., Chandar, P., & Carterette, B. (2012). Alternative assessor disagreement and retrieval depth. In Proceedings of the 21st ACM international conference on information and knowledge management (CIKM 2012) (pp. 125–134), New York, NY. doi: 10.1145/2396761.2396781.
  30. Yilmaz, E., Shokouhi, M., Craswell, N., & Robertson, S. (2010). Expected browsing utility for web search evaluation. In Proceedings of the 19th ACM international conference on information and knowledge management (CIKM 2010) (pp. 1561–1564), Toronto, ON. doi: 10.1145/1871437.1871672.
  31. Zhai, C. X., Cohen, W. W., & Lafferty, J. (2003). Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In Proceedings of the 26th International ACM SIGIR conference on research and development in information retrieval (SIGIR 2003) (pp. 10–17), Toronto, ON. doi: 10.1145/860435.860440
  32. Zhou, K., Zha, H., Chang, Y., & Xue, G. R. (2014). Learning the gain values and discount factors of discounted cumulative gains. IEEE Transactions on Knowledge and Data Engineering, 26(2), 391–404. doi: 10.1109/TKDE.2012.252.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Thomas Demeester
    • 1
  • Robin Aly
    • 2
  • Djoerd Hiemstra
    • 2
  • Dong Nguyen
    • 2
  • Chris Develder
    • 1
  1. 1.Ghent University - iMindsGhentBelgium
  2. 2.University of TwenteEnschedeThe Netherlands

Personalised recommendations