Skip to main content

The Philosophy of Information Retrieval Evaluation

  • Conference paper
  • First Online:
Evaluation of Cross-Language Information Retrieval Systems (CLEF 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2406))

Included in the following conference series:

Abstract

Evaluation conferences such as TREC, CLEF, and NTCIR are modern examples of the Cranfield evaluation paradigm. In Cranfield, researchers perform experiments on test collections to compare the relative effectiveness of different retrieval approaches. The test collections allow the researchers to control the effects of different system parameters, increasing the power and decreasing the cost of retrieval experiments as compared to user-based evaluations. This paper reviews the fundamental assumptions and appropriate uses of the Cranfield paradigm, especially as they apply in the context of the evaluation conferences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Martin Braschler. CLEF 200-Overview of results. In Carol Peters, editor, Cross-Language Information Retrieval and Evaluation; Lecture Notes in Computer Science2069, pages 89–101. Springer, 2001.

    Chapter  Google Scholar 

  2. Chris Buckley and Ellen M. Voorhees. Evaluating evaluation measure stability. In N. Belkin, P. Ingwersen, and M.K. Leong, editors, Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 33–40, 2000.

    Google Scholar 

  3. C. W. Cleverdon. The Cranfield tests on index language devices. In Aslib Proceedings, volume 19, pages 173–192, 1967. (Reprinted in Readings in Information Retrieval, K. Sparck-Jones and P. Willett, editors, Morgan Kaufmann, 1997).

    Article  Google Scholar 

  4. Cyril W. Cleverdon. The significance of the Cranfield tests on index languages. In Proceedings of the Fourteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 3–12, 1991.

    Google Scholar 

  5. Gordon V. Cormack, Christopher R. Palmer, and Charles L.A. Clarke. Efficient construction of large test collections. In Alistair Moffat, C.J. van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 1998. ACM Press, New York Croft et al. [6], pages 282–289.

    Chapter  Google Scholar 

  6. W. Bruce Croft, Alistair Moffat, C.J. van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 1998. ACM Press, New York.

    Google Scholar 

  7. C. A. Cuadra and R. V. Katter. Opening the black box of relevance. Journal of Documentation, 23(4):291–303, 1967.

    Article  Google Scholar 

  8. Donna Harman. Overview of the fourth Text REtrieval Conference (TREC-4). In D. K. Harman, editor, Proceedings of the Fourth Text REtrieval Conference (TREC-4), pages 1–23, October 1996. NIST Special Publication 500–236.

    Google Scholar 

  9. Stephen P. Harter. Variations in relevance assessments and the measurement of retrieval effectiveness. Journal of the American Society for Information Science, 47(1):37–49, 1996.

    Article  Google Scholar 

  10. William Hersh, Andrew Turpin, Susan Price, Benjamin Chan, Dale Kraemer, Lynetta Sacherek, and Daniel Olson. Do batch and user evaluations give the same results? In N. Belkin, P. Ingwersen, and M.K. Leong, editors, Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 17–24, 2000.

    Google Scholar 

  11. Noriko Kando, Kazuko Kuriyama, Toshihiko Nozue, Koji Eguchi, Hiroyuki Kato, and Souichiro Hidaka. Overview of IR tasks at the first NTCIR workshop. In Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition, pages 11–44, 1999.

    Google Scholar 

  12. M.E. Lesk and G. Salton. Relevance assessments and retrieval system evaluation. Information Storage and Retrieval, 4:343–359, 1969.

    Article  Google Scholar 

  13. G. Salton, editor. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Inc. Englewood Cliffs, New Jersey, 1971.

    Google Scholar 

  14. Linda Schamber. Relevance and information behavior. Annual Review of Information Science and Technology, 29:3–48, 1994.

    Google Scholar 

  15. K. Sparck Jones and C. van Rijsbergen. Report on the need for and provision of an “ideal” information retrieval test collection. British Library Research and Development Report 5266, Computer Laboratory, University of Cambridge, 1975.

    Google Scholar 

  16. Karen Sparck Jones. The Cranfield tests. In Karen Sparck Jones, editor, Information Retrieval Experiment, chapter 13, pages 256–284. Butterworths, London, 1981.

    Google Scholar 

  17. Karen Sparck Jones. Information Retrieval Experiment. Butterworths, London, 1981.

    Google Scholar 

  18. Karen Sparck Jones and Peter Willett. Evaluation. In Karen Sparck Jones and Peter Willett, editors, Readings in Information Retrieval, chapter 4, pages 167–174. Morgan Kaufmann, 1997.

    Google Scholar 

  19. Alan Stuart. Kendall’s tau. In Samuel Kotz and Norman L. Johnson, editors, Encyclopedia of Statistical Sciences, volume 4, pages 367–369. John Wiley & Sons, 1983.

    Google Scholar 

  20. M. Taube. A note on the pseudomathematics of relevance. American Documentation, 16(2):69–72, April 1965.

    Google Scholar 

  21. Andrew H. Turpin and William Hersh. Why batch and user evaluations do not give the same results. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 225–231, 2001.

    Google Scholar 

  22. C.J. van Rijsbergen. Information Retrieval, chapter 7. Butterworths, 2 edition, 1979.

    Google Scholar 

  23. Ellen M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing and Management, 36:697–716, 2000.

    Article  Google Scholar 

  24. Ellen M. Voorhees and Donna Harman. Overview of the eighth Text REtrieval Conference (TREC-8). In E.M. Voorhees and D.K. Harman, editors, Proceedings of the Eighth Text REtrieval Conference (TREC-8), pages 1–24, 2000. NIST Special Publication 500–246. Electronic version available at http://trec.nist.gov/pubs.html.

  25. Ellen M. Voorhees and Donna Harman. Overview of TREC 2001. In Proceedings of TREC 2001 (Draft), 2001. To appear.

    Google Scholar 

  26. Justin Zobel. How reliable are the results of large-scale information retrieval experiments? In Alistair Moffat, C.J. van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 1998. ACM Press, New York Croft et al. [6], pages 307–314.

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Voorhees, E.M. (2002). The Philosophy of Information Retrieval Evaluation. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds) Evaluation of Cross-Language Information Retrieval Systems. CLEF 2001. Lecture Notes in Computer Science, vol 2406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45691-0_34

Download citation

  • DOI: https://doi.org/10.1007/3-540-45691-0_34

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44042-0

  • Online ISBN: 978-3-540-45691-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics