Knowledge and Information Systems

, Volume 16, Issue 2, pp 173–211 | Cite as

Estimating average precision when judgments are incomplete

Regular Paper

Abstract

We consider the problem of evaluating retrieval systems with incomplete relevance judgments. Recently, Buckley and Voorhees showed that standard measures of retrieval performance are not robust to incomplete judgments, and they proposed a new measure, bpref, that is much more robust to incomplete judgments. Although bpref is highly correlated with average precision when the judgments are effectively complete, the value of bpref deviates from average precision and from its own value as the judgment set degrades, especially at very low levels of assessment. In this work, we propose three new evaluation measures induced AP, subcollection AP, and inferred AP that are equivalent to average precision when the relevance judgments are complete and that are statistical estimates of average precision when relevance judgments are a random subset of complete judgments. We consider natural scenarios which yield highly incomplete judgments such as random judgment sets or very shallow depth pools. We compare and contrast the robustness of the three measures proposed in this work with bpref for both of these scenarios. Through the use of TREC data, we demonstrate that these measures are more robust to incomplete relevance judgments than bpref, both in terms of how well the measures estimate average precision (as measured with complete relevance judgments) and how well they estimate themselves (as measured with complete relevance judgments). Finally, since inferred AP is the most accurate approximation to average precision and the most robust measure in the presence of incomplete judgments, we provide a detailed analysis of this measure, both in terms of its behavior in theory and its implementation in practice.

Keywords

Evaluation Incomplete judgments Robustness Average precision bpref infAP 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Allan J (2004) HARD track overview in TREC 2004: High accuracy retrieval from documents. In: Proceedings of the 13th text retrieval conference (TREC 2004)Google Scholar
  2. 2.
    Aslam JA, Pavlu V and Savell R (2003). A unified model for metasearch and the efficient evaluation of retrieval systems via the hedge algorithm. In: Callan, J, Cormack, G, Clarke, C, Hawking, D, and Smeaton, A (eds) Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pp 393–394. ACM Press, New york Google Scholar
  3. 3.
    Aslam JA, Pavlu V, Savell R (2003) A unified model for metasearch, pooling, and system evaluation. In: Frieder O, Hammer J, Quershi S, Seligman L (eds) Proceedings of the 12th international conference on information and knowledge management. ACM Press, pp 484–491Google Scholar
  4. 4.
    Aslam JA, Pavlu V, Yilmaz E (2006) A statistical method for system evaluation using incomplete judgments. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM Press, pp 541–548Google Scholar
  5. 5.
    Buckley C (2006) ‘trec_eval’. http://trec.nist.gov/trec_eval/trec_eval.8.1.tar.gz
  6. 6.
    Buckley C, Voorhees EM (2004) Retrieval evaluation with incomplete information. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 25–32Google Scholar
  7. 7.
    Buttcher S, Clarke C, Soboroff I (2006) The TREC 2006 terabyte track. In: Proceedings of the 15th text REtrieval conference (TREC 2006)Google Scholar
  8. 8.
    Carterette B, Allan J, Sitaraman R (2006) Minimal test collections for retrieval evaluation. In: SIGIR ’06: proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 268–275Google Scholar
  9. 9.
    Chen SF, Goodman J (1996) An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th annual meeting of the association for computational linguistics. Morgan Kaufmann Publishers, San Francisco, pp 310–318Google Scholar
  10. 10.
    Clarke CLA, Scholer F, Soboroff I (2005) The TREC 2005 terabyte track. In: Proceedings of the 14th Text REtrieval conference (TREC 2005)Google Scholar
  11. 11.
    Cormack GV, Palmer CR, Clarke CLA (1998) Efficient construction of large test collections. In: Croft, Moffat, van Rijsbergen, wilkinson and Zobel (1998), pp 282–289Google Scholar
  12. 12.
    Croft WB, Moffat A, van Rijsbergen CJ, Wilkinson R, Zobel J (eds) (1998) In: Proceedings of the 21th annual international ACM SIGIR conference on research and development in information retrieval, ACM Press, New YorkGoogle Scholar
  13. 13.
    Harman D (1995). Overview of the third text REtreival conference (TREC-3). In: Harman, D (eds) Overview of the 3rd text REtrieval conference (TREC-3)’, pp 1–19. US Government Printing Office, Washington D.C., Gaithersburg Google Scholar
  14. 14.
    Hawking D and Robertson S (2003). On collection size and retrieval effectiveness. Info Retr 6(1): 99–105 CrossRefGoogle Scholar
  15. 15.
    Kagolovsky Y and Moehr JR (2003). Current status of the evaluation of information retrieval. J Med Syst 27(5): 409–424 CrossRefGoogle Scholar
  16. 16.
    Kraaij W, Over P, Smeaton A (2006) TRECVID 2006—an introduction. In: TREC video retrieval evaluation online proceedingsGoogle Scholar
  17. 17.
    Kukar M (2006). Quality assessment of individual classifications in machine learning and data mining. Knowl Info Syst 9(3): 364–384 CrossRefGoogle Scholar
  18. 18.
    Raghavan V, Bollmann P and Jung GS (1989). A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans Info Syst 7(3): 205–229 CrossRefGoogle Scholar
  19. 19.
    Tombros A and van Rijsbergen CJ (2004). Query-sensitive similarity measures for information retrieval. Knowl Info Syst 6(5): 617–642 Google Scholar
  20. 20.
    Voorhees EM (2001) Evaluation by highly relevant documents. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 74–82Google Scholar
  21. 21.
    Voorhees EM (2002) The philosophy of information retrieval evaluation. In: CLEF ’01: revised papers from the 2nd workshop of the cross-language evaluation forum on evaluation of cross-language information retrieval systems. Springer, London, pp 355–370Google Scholar
  22. 22.
    Voorhees EM, Harman D (1999) Overview of the 7th text retrieval conference (TREC-7). In: Proceedings of the 7th text REtrieval conference (TREC-7)’, pp 1–24Google Scholar
  23. 23.
    Yilmaz E, Aslam JA (2006) Estimating average precision with incomplete and imperfect judgments. In: Proceedings of the 15th ACM international conference on information and knowledge management, ACM Press, New YorkGoogle Scholar
  24. 24.
    Zobel J (1998) How reliable are the results of large-scale retrieval experiments?, In: Croft et al., pp 307–314Google Scholar

Copyright information

© Springer-Verlag London Limited 2007

Authors and Affiliations

  1. 1.College of Computer and Information ScienceNortheastern UniversityBostonUSA

Personalised recommendations