International Symposium on String Processing and Information Retrieval

SPIRE 2015: String Processing and Information Retrieval pp 137-148 | Cite as

Selective Labeling and Incomplete Label Mitigation for Low-Cost Evaluation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9309)


Information retrieval evaluation heavily relies on human effort to assess the relevance of result documents. Recent years have seen efforts and good progress to reduce the human effort and thus lower the cost of evaluation. Selective labeling strategies carefully choose a subset of result documents to label, for instance, based on their aggregate rank in results; strategies to mitigate incomplete labels seek to make up for missing labels, for instance, predicting them using machine learning methods. How different strategies interact, though, is unknown.

In this work, we study the interaction of several state-of-the-art strategies for selective labeling and incomplete label mitigation on four years of TREC Web Track data (2011–2014). Moreover, we propose and evaluate MaxRep as a novel selective labeling strategy, which has been designed so as to select effective training data for missing label prediction.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aslam, J.A., Pavlu, V.: A practical sampling strategy for efficient retrieval evaluation. Report (May 2007)Google Scholar
  2. 2.
    Aslam, J.A., Pavlu, V., Yilmaz, E.: A statistical method for system evaluation using incomplete judgments. In: SIGIR, pp. 541–548 (2006)Google Scholar
  3. 3.
    Buckley, C., Voorhees, E.M.: Retrieval evaluation with incomplete information. In: SIGIR, pp. 25–32 (2004)Google Scholar
  4. 4.
    Büttcher, S., Clarke, C.L.A., Yeung, P.C.K., Soboroff, I.: Reliable information retrieval evaluation with incomplete and biased judgements. In: SIGIR, pp. 63–70 (2007)Google Scholar
  5. 5.
    Carterette, B.: Robust test collections for retrieval evaluation. In: SIGIR, pp. 55–62 (2007)Google Scholar
  6. 6.
    Carterette, B., Allan, J.: Semiautomatic evaluation of retrieval systems using document similarities. In: CIKM, pp. 873–876 (2007)Google Scholar
  7. 7.
    Carterette, B., Allan, J., Sitaraman, R.: Minimal test collections for retrieval evaluation. In: SIGIR, pp. 268–275 (2006)Google Scholar
  8. 8.
    Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: CIKM, pp. 621–630 (2009)Google Scholar
  9. 9.
    Clarke, C.L., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Büttcher, S., MacKinnon, I.: Novelty and diversity in information retrieval evaluation. In: SIGIR, pp. 659–666 (2008)Google Scholar
  10. 10.
    Cleverdon, C.: The cranfield tests on index language devices. In: Aslib proceedings, vol. 19, pp. 173–194. MCB UP Ltd (1967)Google Scholar
  11. 11.
    Cormack, G.V., Palmer, C.R., Clarke, C.L.A.: Efficient construction of large test collections. In: SIGIR, pp. 282–289 (1998)Google Scholar
  12. 12.
    Nemhauser, G., Wolsey, L., Fisher, M.: An analysis of approximations for maximizing submodular set functions–i. Mathematical Programming 14, 265–294 (1978)MathSciNetCrossRefMATHGoogle Scholar
  13. 13.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)MATHGoogle Scholar
  14. 14.
    Rijsbergen, C.J.V.: Information Retrieval, 2nd edn. Butterworth-Heinemann, Newton (1979)Google Scholar
  15. 15.
    Sakai, T.: Alternatives to bpref. In: SIGIR, pp. 71–78 (2007)Google Scholar
  16. 16.
    Spärck Jones, K., Van Rijsbergen, K.: Information retrieval test collections. Journal of Documentation 32(1), 59–75 (1976)CrossRefGoogle Scholar
  17. 17.
    Vu, H.-T., Gallinari, P.: A machine learning based approach to evaluating retrieval systems. In: HLT-NAACL, pp. 399–406 (2006)Google Scholar
  18. 18.
    Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imperfect judgments. In: CIKM, pp. 102–111 (2006)Google Scholar
  19. 19.
    Yu, K., Bi, J., Tresp, V.: Active learning via transductive experimental design. In: ICML, pp. 1081–1088 (2006)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Max Planck Institute for InformaticsSaarbrückenGermany

Personalised recommendations