Advertisement

The Evolution of Cranfield

  • Ellen M. VoorheesEmail author
Chapter
Part of the The Information Retrieval Series book series (INRE, volume 41)

Abstract

Evaluating search system effectiveness is a foundational hallmark of information retrieval research. Doing so requires infrastructure appropriate for the task at hand, which generally follows the Cranfield paradigm: test collections and associated evaluation measures. A primary purpose of Information Retrieval (IR) evaluation campaigns such as Text REtrieval Conference (TREC) and Conference and Labs of the Evaluation Forum (CLEF) is to build this infrastructure. The first TREC collections targeted the same task as the original Cranfield tests and used measures that were familiar to test collection users of the time. But as evaluation tasks have multiplied and diversified, test collection construction techniques and evaluation measure definitions have also been forced to evolve. This chapter examines how the Cranfield paradigm has been adapted to meet the changing requirements for search systems enabling it to continue to support a vibrant research community.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Al-Maskari A, Sanderson M, Clough P, Airio E (2008) The good and the bad system: does the test collection predict users’ effectiveness? In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08. ACM, New York, pp 59–66CrossRefGoogle Scholar
  2. Allan J (2003) HARD Track overview in TREC 2003: high accuracy retrieval from documents. In: Proceedings of the twelfth Text REtrieval Conference (TREC 2003)Google Scholar
  3. Aslam J, Ekstrand-Abueg M, Pavlu V, Diaz F, McCreadie R, Sakai T (2014) TREC 2014 temporal summarization track overview. In: Proceedings of the twenty-third Text REtrieval Conference (TREC 2014)Google Scholar
  4. Banks D, Over P, Zhang NF (1999) Blind men and elephants: six approaches to TREC data. Inf Retr 1:7–34CrossRefGoogle Scholar
  5. Bellot P, Bogers T, Geva S, Hall MA, Huurdeman HC, Kamps J, Kazai G, Koolen M, Moriceau V, Mothe J, Preminger M, SanJuan E, Schenkel R, Skov M, Tannier X, Walsh D (2014) Overview of INEX 2014. In: Kanoulas E, Lupu M, Clough P, Sanderson M, Hall M, Hanbury A, Toms E (eds) Information access evaluation – multilinguality, multimodality, and interaction. Proceedings of the fifth international conference of the CLEF initiative (CLEF 2014). Lecture notes in computer science (LNCS), vol 8685. Springer, Heidelberg, pp 212–228Google Scholar
  6. Broder A (2002) A taxonomy of web search. SIGIR Forum 36(2):3–10CrossRefGoogle Scholar
  7. Buckley C (2001) The TREC-9 query track. In: Voorhees E, Harman D (eds) Proceedings of the ninth Text REtreival Conference (TREC-9), pp 81–85Google Scholar
  8. Buckley C, Voorhees EM (2000) Evaluating evaluation measure stability. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 2000, pp 33–40Google Scholar
  9. Buckley C, Voorhees EM (2004) Retrieval evaluation with incomplete information. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp 25–32Google Scholar
  10. Buckley C, Voorhees EM (2005) Retrieval system evaluation. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 3, pp 53–75Google Scholar
  11. Buckley C, Dimmick D, Soboroff I, Voorhees E (2007) Bias and the limits of pooling for large collections. Inf Retr 10:491–508CrossRefGoogle Scholar
  12. Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G (2005) Learning to rank using gradient descent. In: Proceedings of the 22nd international conference on machine learning, ICML ’05. ACM, New York, pp 89–96CrossRefGoogle Scholar
  13. Burgin R (1992) Variations in relevance judgments and the evaluation of retrieval performance. Inf Process Manag 28(5):619–627CrossRefGoogle Scholar
  14. Carterette B (2011) System effectiveness, user models, and user utility: a conceptual framework for investigation. In: Proceedings of the 34th International ACM SIGIR conference on research and development in information retrieval (SIGIR’11). ACM, New York, pp 903–912Google Scholar
  15. Carterette BA (2012) Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans Inf Syst 30(1):4:1–4:34CrossRefGoogle Scholar
  16. Carterette B (2015) The best published result is random: Sequential testing and its effect on reported effectiveness. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’15, pp 747–750Google Scholar
  17. Carterette B, Allan J, Sitaraman R (2006) Minimal test collection for retrieval evaluation. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 268–275Google Scholar
  18. Chapelle O, Ji S, Liao C, Velipasaoglu E, Lai L, Wu SL (2011) Intent-based diversification of web search results: metrics and algorithms. Inf Retr 14(6):572–592CrossRefGoogle Scholar
  19. Clarke CL, Kolla M, Cormack GV, Vechtomova O, Ashkan A, Büttcher S, MacKinnon I (2008) Novelty and diversity in information retrieval evaluation. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08. ACM, New York, pp 659–666CrossRefGoogle Scholar
  20. Clarke CL, Craswell N, Soboroff I, Ashkan A (2011) A comparative analysis of cascade measures for novelty and diversity. In: Proceedings of the fourth ACM international conference on web search and data mining, WSDM ’11. ACM, New York, pp 75–84CrossRefGoogle Scholar
  21. Cleverdon CW (1967) The Cranfield tests on index language devices. In: Aslib proceedings, vol 19, pp 173–192, (Reprinted in Readings in Information Retrieval, K. Spärck-Jones and P. Willett, editors, Morgan Kaufmann, 1997)Google Scholar
  22. Cleverdon CW (1970) The effect of variations in relevance assessments in comparative experimental tests of index languages. Tech. Rep. Cranfield Library Report No. 3, Cranfield Institute of Technology, Cranfield, UKGoogle Scholar
  23. Cleverdon CW (1991) The significance of the Cranfield tests on index languages. In: Proceedings of the fourteenth annual international ACM/SIGIR conference on research and development in information retrieval, pp 3–12Google Scholar
  24. Cormack GV, Palmer CR, Clarke CLA (1998) Efficient construction of large test collections. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98. ACM, New York, pp 282–289CrossRefGoogle Scholar
  25. Cuadra CA, Katter RV (1967) Opening the black box of relevance. J Doc 23(4):291–303CrossRefGoogle Scholar
  26. Gilbert H, Spärck Jones K (1979) Statistical bases of relevance assessment for the ‘IDEAL’ information retrieval test collection. Tech. rep., Computer Laboratory, University of Cambridge. Available at http://sigir.org/resources/museum/ Google Scholar
  27. Guiver J, Mizzaro S, Robertson S (2009) A few good topics: experiments in topic set reduction for retrieval evaluation. ACM Trans Inf Syst 27(4):21:1–21:26CrossRefGoogle Scholar
  28. Harman D (1996) Overview of the fourth Text REtrieval Conference (TREC-4). In: Harman DK (ed) Proceedings of the fourth Text REtrieval Conference (TREC-4), pp 1–23, nIST Special Publication 500-236Google Scholar
  29. Harter SP (1996) Variations in relevance assessments and the measurement of retrieval effectiveness. J Am Soc Inf Sci 47(1):37–49CrossRefGoogle Scholar
  30. Hawking D, Craswell N (2005) The very large collection and web tracks. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 9, pp 199–231Google Scholar
  31. Hersh W, Turpin A, Price S, Chan B, Kraemer D, Sacherek L, Olson D (2000) Do batch and user evaluations give the same results? In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 2000, pp 17–24Google Scholar
  32. Hofmann K, Li L, Radlinski F (2016) Online evaluation for information retrieval. Found Trends Inf Retr 10(1):1–117CrossRefGoogle Scholar
  33. Hosseini M, Cox IJ, Milic-Frayling N, Shokouhi M, Yilmaz E (2012) An uncertainty-aware query selection model for evaluation of IR systems. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’12. ACM, New York, pp 901–910CrossRefGoogle Scholar
  34. Hull D (1993) Using statistical testing in the evaluation of retrieval experiments. In: Proceedings of the 16th annual international ACM/SIGIR conference on research and development in information retrieval, SIGIR ’93. ACM, New York, pp 329–338Google Scholar
  35. Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst 20(4):422–446CrossRefGoogle Scholar
  36. Kando N, Kuriyama K, Nozue T, Eguchi K, Kato H, Hidaka S (1999) Overview of IR tasks at the first NTCIR workshop. In: Proceedings of the first NTCIR workshop on research in Japanese text retrieval and term recognition, pp 11–44Google Scholar
  37. Keen EM (1966) Measures and averaging methods used in performance testing of indexing systems. Tech. rep., The College of Aeronautics, Cranfield, England. Available at http://sigir.org/resources/museum/
  38. Kelly D (2009) Methods for evaluating interactive information retrieval systems with users. Found Trends Inf Retr 3(1–2):1–224Google Scholar
  39. Kutlu M, Elsayed T, Lease M (2018) Learning to effectively select topics for information retrieval test collections. Inf Process Manag 54(1):37–59CrossRefGoogle Scholar
  40. Lalmas M, Tombros A (2007) Evaluating XML retrieval effectiveness at INEX. SIGIR Forum 41(1):40–57CrossRefGoogle Scholar
  41. Ledwith R (1992) On the difficulties of applying the results of information retrieval research to aid in the searching of large scientific databases. Inf Process Manag 28(4):451–455CrossRefGoogle Scholar
  42. Lesk ME (1967) SIG – the significance programs for testing the evaluation output. In: Information storage and retrieval, Scientific Report No. ISR-12, National Science Foundation, chap IIGoogle Scholar
  43. Lesk M, Salton G (1969) Relevance assessments and retrieval system evaluation. Inf Storage Retr 4:343–359CrossRefGoogle Scholar
  44. Losada DE, Parapar J, Barreiro A (2016) Feeling lucky?: multi-armed bandits for ordering judgements in pooling-based evaluation. In: Proceedings of the 31st annual ACM symposium on applied computing, SAC ’16. ACM, New York, pp 1027–1034CrossRefGoogle Scholar
  45. Mandl T, Womser-Hacker C (2003) Linguistic and statistical analysis of the clef topics. In: Peters C, Braschler M, Gonzalo J, Kluck M (eds) Advances in cross-language information retrieval: third workshop of the cross-language evaluation forum (CLEF 2002) revised papers. Lecture notes in computer science (LNCS), vol 2785. Springer, Heidelberg, pp 505–511CrossRefGoogle Scholar
  46. Moffat A, Zobel J (2008) Rank-biased precision for measurement of retrieval effectiveness. ACM Trans Inf Syst 27(1):Article 2CrossRefGoogle Scholar
  47. Robertson S (2008) On the history of evaluation in IR. J Inf Sci 34(4):439–456CrossRefGoogle Scholar
  48. Robertson S, Callan J (2005) Routing and filtering. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 5, pp 99–121Google Scholar
  49. Robertson S, Hancock-Beaulieu M (1992) On the evaluation of IR systems. Inf Process Manag 28(4):457–466CrossRefGoogle Scholar
  50. Sakai T (2006) Evaluating evaluation metrics based on the bootstrap. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’06. ACM, New York, pp 525–532CrossRefGoogle Scholar
  51. Sakai T (2008a) Comparing metrics across TREC and NTCIR: the robustness to pool depth bias. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 691–692Google Scholar
  52. Sakai T (2008b) Comparing metrics across TREC and NTCIR: the robustness to system bias. In: Proceedings of the 17th ACM conference on information and knowledge management, pp 581–590Google Scholar
  53. Sakai T (2014) Metrics, statistics, tests. In: Ferro N (ed) 2013 PROMISE winter school: bridging between information retrieval and databases. Lecture notes in computer science (LNCS), vol 8173 . Springer, Heidelberg, pp 116–163Google Scholar
  54. Sakai T (2016) Statistical significance, power, and sample sizes: a systematic review of SIGIR and TOIS, 2006-2015. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’16. ACM, New York, pp 5–14Google Scholar
  55. Sakai T, Kando N (2008) On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf Retr 11:447–470CrossRefGoogle Scholar
  56. Sanderson M (2010) Test collection based evaluation of information retrieval systems. Found Trends Inf Retr 4(4):247–375CrossRefGoogle Scholar
  57. Sanderson M, Zobel J (2005) Information retrieval system evaluation: effort, sensitivity, and reliability. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’05. ACM, New York, pp 162–169CrossRefGoogle Scholar
  58. Savoy J (1997) Statistical inference in retrieval effectiveness evaluation. Inf Process Manag 33(4):495–512CrossRefGoogle Scholar
  59. Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of the sixteenth ACM conference on conference on information and knowledge management, CIKM ’07. ACM, New York, pp 623–632CrossRefGoogle Scholar
  60. Soboroff I, Robertson S (2003) Building a filtering test collection for trec 2002. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’03. ACM, New York, pp 243–250Google Scholar
  61. Spärck Jones K (1974) Automatic indexing. J Doc 30:393–432CrossRefGoogle Scholar
  62. Spärck Jones K (2001) Automatic language and information processing: rethinking evaluation. Nat Lang Eng 7(1):29–46Google Scholar
  63. Spärck Jones K, Bates RG (1977) Report on a design study for the ‘IDEAL’ information retrieval test collection. Tech. rep., Computer Laboratory, University of Cambridge. Available at http://sigir.org/resources/museum/ Google Scholar
  64. Spärck Jones K, Van Rijsbergen C (1975) Report on the need for and provision for and ‘IDEAL’ information retrieval test collection. Tech. rep., Computer Laboratory, University of Cambridge. Available at http://sigir.org/resources/museum/ Google Scholar
  65. Taube M (1965) A note on the pseudomathematics of relevance. Am Doc 16(2):69–72CrossRefGoogle Scholar
  66. Tomlinson S, Hedin B (2011) Measuring effectiveness in the TREC legal track. In: Lupu M, Mayer K, Tait J, Trippe A (eds) Current challenges in patent information retrieval. The information retrieval series, vol 29. Springer, Berlin, pp 167–180CrossRefGoogle Scholar
  67. Trotman A, Geva S, Kamps J, Lalmas M, Murdock V (2010) Current research in focused retrieval and result aggregation. Inf Retr 13(5):407–411CrossRefGoogle Scholar
  68. Turpin AH, Hersh W (2001) Why batch and user evaluations do not give the same results. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’01, pp 225–231Google Scholar
  69. Van Rijsbergen C (1979) Evaluation, 2nd edn. Butterworths, London, chap 7Google Scholar
  70. Voorhees EM (2000) Variations in relevance judgments and the measurement of retrieval effectiveness. Inf Process Process 36:697–716CrossRefGoogle Scholar
  71. Voorhees EM (2005) Question answering in TREC. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 10, pp 233–257Google Scholar
  72. Voorhees EM (2014) The effect of sampling strategy on inferred measures. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval, SIGIR ’14. ACM, New York, pp 1119–1122Google Scholar
  73. Voorhees EM, Buckley C (2002) The effect of topic set size on retrieval experiment error. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’02. ACM, New York, pp 316–323CrossRefGoogle Scholar
  74. Voorhees EM, Harman DK (2005) The Text REtrieval Conference. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. MIT Press, Boston, chap 1, pp 3–19Google Scholar
  75. Yilmaz E, Aslam JA (2008) Estimating average precision when judgments are incomplete. Knowl Inf Syst 16:173–211CrossRefGoogle Scholar
  76. Yilmaz E, Kanoulas E, Aslam JA (2008) A simple and efficient sampling method for estimating AP and NDCG. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, pp 603–610Google Scholar
  77. Zobel J (1998) How reliable are the results of large-scale information retrieval experiments? In: Croft WB, Moffat A, van Rijsbergen C, Wilkinson R, Zobel J (eds) Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, Melbourne, Australia. ACM Press, New York, pp 307–314Google Scholar

Copyright information

© This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply 2019

Authors and Affiliations

  1. 1.National Institute of Standards and TechnologyGaithersburgUSA

Personalised recommendations