Dissimilarity Based Query Selection for Efficient Preference Based IR Evaluation

  • Gabriella Kazai
  • Homer Sung
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8416)


The evaluation of Information Retrieval (IR) systems has recently been exploring the use of preference judgments over two lists of search results, presented side-by-side to judges. Such preference judgments have been shown to capture a richer set of relevance criteria than traditional methods of collecting relevance labels per single document. However, preference judgments over lists are expensive to obtain and are less reusable as any change to either side necessitates a new judgment. In this paper, we propose a way to measure the dissimilarity between two sides in side-by-side evaluation experiments and show how this measure can be used to prioritize queries to be judged in an offline setting. Our proposed measure, referred to as Weighted Ranking Difference (WRD), takes into account both the ranking differences and the similarity of the documents across the two sides, where a document may, for example, be a URL or a query suggestion. We empirically evaluate our measure on a large-scale, real-world dataset of crowdsourced preference judgments over ranked lists of auto-completion suggestions. We show that the WRD score is indicative of the probability of tie preference judgments and can, on average, save 25% of the judging resources.


Side-by-side evaluation preference judgments weighted ranked difference measure query prioritization judging cost reduction 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aslam, J.A., Pavlu, V., Yilmaz, E.: A statistical method for system evaluation using incomplete judgments. In: Proc. of the 29th ACM SIGIR Conference, SIGIR 2006, pp. 541–548. ACM, New York (2006)Google Scholar
  2. 2.
    Bailey, P., Craswell, N., White, R.W., Chen, L., Satyanarayana, A., Tahaghoghi, S.M.: Evaluating search systems using result page context. In: Proc. of the Third Symposium on Information Interaction in Context, IIiX 2010, pp. 105–114. ACM, New York (2010)Google Scholar
  3. 3.
    Bar-Ilan, J., Mat-Hassan, M., Levene, M.: Methods for comparing rankings of search engine results. Comput. Netw. 50(10), 1448–1463 (2006)CrossRefzbMATHGoogle Scholar
  4. 4.
    Carterette, B., Allan, J., Sitaraman, R.: Minimal test collections for retrieval evaluation. In: Proc. of the 29th ACM SIGIR Conference, SIGIR 2006, pp. 268–275. ACM, New York (2006)Google Scholar
  5. 5.
    Chandar, P., Carterette, B.: Using preference judgments for novel document retrieval. In: Proc. of the 35th ACM SIGIR Conference, SIGIR 2012, pp. 861–870. ACM, New York (2012)Google Scholar
  6. 6.
    Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. SIAM J. Discret. Math. 17(1), 134–160 (2004)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Guiver, J., Mizzaro, S., Robertson, S.: A few good topics: Experiments in topic set reduction for retrieval evaluation. ACM Trans. Inf. Syst. 27(4), 21:1–21:26 (2009)Google Scholar
  8. 8.
    Hosseini, M., Cox, I.J., Milic-Frayling, N., Vinay, V., Sweeting, T.: Selecting a subset of queries for acquisition of further relevance judgements. In: Amati, G., Crestani, F. (eds.) ICTIR 2011. LNCS, vol. 6931, pp. 113–124. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  9. 9.
    Kim, J., Kazai, G., Zitouni, I.: Relevance dimensions in preference-based ir evaluation. In: Proc. of the 36th ACM SIGIR Conference, SIGIR 2013, pp. 913–916. ACM, New York (2013)Google Scholar
  10. 10.
    Radlinski, F., Bennett, P.N., Carterette, B., Joachims, T.: Redundancy, diversity and interdependent document relevance. SIGIR Forum 43(2), 46–52 (2009)CrossRefGoogle Scholar
  11. 11.
    Radlinski, F., Craswell, N.: Comparing the sensitivity of information retrieval metrics. In: Crestani, F., Marchand-Maillet, S., Chen, H.-H., Efthimiadis, E.N., Savoy, J. (eds.) SIGIR 2010, pp. 667–674. ACM (2010)Google Scholar
  12. 12.
    Sanderson, M., Paramita, M.L., Clough, P., Kanoulas, E.: Do user preferences and evaluation measures line up? In: Proc. of the 33rd ACM SIGIR Conference, SIGIR 2010, pp. 555–562. ACM, New York (2010)Google Scholar
  13. 13.
    Shieh, G.: A weighted kendall’s tau statistic. Statistics and Probability Letters 39, 17–24 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Thomas, P., Hawking, D.: Evaluation by comparing result sets in context. In: Proc. of the 15th ACM International Conference on Information and Knowledge Management, CIKM 2006, pp. 94–101. ACM, New York (2006)Google Scholar
  15. 15.
    Voorhees, E.M., Harman, D.K. (eds.): TREC: Experimentation and Evaluation in Information Retrieval. MIT Press (2005)Google Scholar
  16. 16.
    Webber, W., Moffat, A., Zobel, J.: A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 20, 1–20 (2010)CrossRefGoogle Scholar
  17. 17.
    Yilmaz, E., Aslam, J.A., Robertson, S.: A new rank correlation coefficient for information retrieval. In: Proc. of the 31st ACM SIGIR Conference, SIGIR 2008, pp. 587–594. ACM, New York (2008)Google Scholar
  18. 18.
    Zhu, J., Wang, J., Vinay, V., Cox, I.J.: Topic (query) selection for IR evaluation. In: Proc. of the 32nd ACM SIGIR Conference, SIGIR 2009, pp. 802–803. ACM, New York (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Gabriella Kazai
    • 1
    • 2
  • Homer Sung
    • 1
    • 2
  1. 1.MicrosoftCambridgeUK
  2. 2.BellevueUS

Personalised recommendations