On Collection Size and Retrieval Effectiveness

Hawking, David; Robertson, Stephen

doi:10.1023/A:1022904715765

On Collection Size and Retrieval Effectiveness

Published: January 2003

Volume 6, pages 99–105, (2003)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

On Collection Size and Retrieval Effectiveness

Download PDF

David Hawking¹ &
Stephen Robertson²

554 Accesses
35 Citations
Explore all metrics

Abstract

The relationship between collection size and retrieval effectiveness is particularly important in the context of Web search. We investigate it first analytically and then experimentally, using samples and subsets of test collections. Different retrieval systems vary in how the score assigned to an individual document in a sample collection relates to the score it receives in the full collection; we identify four cases.

We apply signal detection (SD) theory to retrieval from samples, taking into account the four cases and using a variety of shapes for relevant and irrelevant distributions. We note that the SD model subsumes several earlier hypotheses about the causes of the decreased precision in samples. We also discuss other models which contribute to an understanding of the phenomenon, particularly relating to the effects of discreteness. Different models provide complementary insights.

Extensive use is made of test data, some from official submissions to the TREC-6 VLC track and some new, to illustrate the effects and test hypotheses. We empirically confirm predictions, based on SD theory, that P@n should decline when moving to a sample collection and that average precision and R-precision should remain constant. SD theory suggests the use of recall-fallout plots as operating characteristic (OC) curves. We plot OC curves of this type for a real retrieval system and query set and show that curves for sample collections are similar but not identical to the curve for the full collection.

Avoid common mistakes on your manuscript.

References

Arampatzis A, Beney J, Koster CHA and van der Weide TP (2000) Incrementatlity, half-life and threshold optimisation for adaptive document filtering. In: Voorhees EM and Harman DK, Eds., Proceedings of TREC-9, Gaithersburg MD. NIST special publication 500-249, trec.nist.gov.
Arampatzis A and van Hameren A (2001) The score-distributional threshold optimization for adaptive binary classification tasks. In: Proceedings of ACM SIGIR'2001 Conference. ACM Press, New York, pp. 285-293.
Google Scholar
Baumgarten C (1999) A probabilistic solution to the fusion problem in distributed information retrieval. In: Proceedings of ACM SIGIR'99 Conference. ACM Press, New York, pp. 246-253.
Google Scholar
Clarke CLA, Cormack GV and Burkowski FJ (1995) Shortest substring ranking (MultiText experiments for TREC-4). In: Harman DK, Ed., Proceedings of TREC-4. Gaithersburg, MD, pp. 295-304, NIST special publication 500-236.
Cormack GV, Lhotak O and Palmen CR (1999) Estimating precision by random sampling. In: Hearst M, Gey F and Tong R, Eds., Proceedings of SIGIR'99, Berkeley, CA. ACM Press, New York, Poster, pp. 273-274.
Google Scholar
Hawking D and Thistlewaite P (1996) Relevance weighting using distance between term occurrences. Technical Report TR-CS-96-08, Department of Computer Science, The Australian National University, cs.anu.edu.au/techreports/1996/index.html.
Hawking D, Thistlewaite P and Craswell N (1997) ANU/ACSys TREC-6 experiments. In: Voorhees E and Harman D, Eds., Proceedings of the TREC-6 Conference. Gaithersburg, MD, pp. 275-290, NIST. trec.nist.gov/pubs/trec6/papers/anu.ps.gz.
Hawking D, Thistlewaite P and Harman D (1999) Scaling up the TREC collection. Information Retrieval, 1(1): 115-137.
Google Scholar
Hays WL (1963) Statistics. Holt, Rinehart and Winston, London.
Google Scholar
Lawrence S and Giles CL (1999) Accessibility of information on the web. Nature, 400:107-109.
Google Scholar
Manmatha R, Rath T and Feng F (2001) Modeling score distributions for combining the outputs of search engines. In: Proceedings of SIGIR'2001 Conference. ACM Press.
National Institute of Standards and Technology (1997) TREC home page. trec.nist.gov/.
Provost F and Fawcett T (2001) Robust classification for imprecise environments. Machine Learning Journal, 42(3):203-231. www.stern.nyu.edu/∼fprovost/Papers/rocch-mlj.pdf.
Google Scholar
Robertson SE (1969) The parametric description of retrieval tests. Part 1: The basic parameters. Journal of Documentation, 25(1):1-27.
Google Scholar
Robertson SE, Walker S, Hancock-Beaulieu MM and Gatford M (1994) Okapi at TREC-3. In: Harman DK, Ed., Proceedings of TREC-3, Gaithersburg MD, November 1994. NIST special publication 500-225.
Salton G and McGill MJ (1983) Introduction to Modern Information Retrieval. McGraw-Hill, New York.
Google Scholar
Swets JA (1963) Information retrieval systems. Science, 141(3577):245-250.
Google Scholar
Swets JA (1969) Effectiveness of information retrieval methods. American Documentation, 20:72-89.
Google Scholar
Voorhees EM and Harman DK (1997), Eds. Proceedings of TREC-6, Gaithersburg MD. NIST special publication 500-240, trec.nist.gov.

Download references

Author information

Authors and Affiliations

CSIRO Mathematical and Information Sciences, Canberra, Australia
David Hawking
Microsoft Research, Cambridge, UK
Stephen Robertson

Authors

David Hawking
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Robertson
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hawking, D., Robertson, S. On Collection Size and Retrieval Effectiveness. Information Retrieval 6, 99–105 (2003). https://doi.org/10.1023/A:1022904715765

Download citation

Issue Date: January 2003
DOI: https://doi.org/10.1023/A:1022904715765

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On Collection Size and Retrieval Effectiveness

Abstract

Article PDF

Similar content being viewed by others

Word prevalence norms for 62,000 English lemmas

Density-Based Clustering Based on Hierarchical Density Estimates

Feature selection techniques for machine learning: a survey of more than two decades of research

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

On Collection Size and Retrieval Effectiveness

Abstract

Article PDF

Similar content being viewed by others

Word prevalence norms for 62,000 English lemmas

Density-Based Clustering Based on Hierarchical Density Estimates

Feature selection techniques for machine learning: a survey of more than two decades of research

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation