Abstract
Due to the popularity of Web search engines, a large proportion of real text retrieval queries are now processed over collections measured in tens or hundreds of gigabytes. A new Very Large test Collection (VLC) has been created to support qualification, measurement and comparison of systems operating at this level and to permit the study of the properties of very large collections. The VLC is an extension of the well-known TREC collection and has been distributed under the same conditions. A simple set of efficiency and effectiveness measures have been defined to encourage comparability of reporting. The 20 gigabyte first-edition of the VLC and a representative 10% sample have been used in a special interest track of the 1997 Text Retrieval Conference (TREC-6). The unaffordable cost of obtaining complete relevance assessments over collections of this scale is avoided by concentrating on early precision and relying on the core TREC collection to support detailed effectiveness studies. Results obtained by TREC-6 VLC track participants are presented here. All groups observed a significant increase in early precision as collection size increased. Explanatory hypotheses are advanced for future empirical testing. A 100 gigabyte second edition VLC (VLC2) has recently been compiled and distributed for use in TREC-7 in 1998.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Alexa Corporation. Web spawns 1.5 million pages daily. http://www.alexa.com/company/inthenews/webfacts.html, August 1998. Press release.
J. Allan, J. Callan, W.B. Croft, L. Ballesteros, D. Byrd, R. Swan and J. Xu. INQUERY does battle with TREC-6. In Voorhees and Harman [25], pages 169–206. NIST special publication 500–240.
E.W. Brown and H.A. Chong. The GURU system in TREC-6. In Voorhees and Harman [25], pages 535–540. NIST special publication 500–240.
C. Buckley and A. F. Lewit. Optimisation of inverted vector searches. In Proceedings of the Eighth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 97–110. ACM, New York, 1985.
C. Buckley, A. Singhal and M. Mitra. Using query zoning and correlation within SMART: TREC-5. In E. M. Voorhees and D. K. Harman, editors, Proceedings of the Fifth Text Retrieval Conference (TREC-5), pages 105–118, Gaithersburg MD, November 1996. U.S. National Institute of Standards and Technology. NIST special publication 500–238.
G.V. Cormack, C.R. Palmer, S.S.L. To and C.L.A. Clarke. Passage-based refinement: Multitext experiments for TREC-6. In Voorhees and Harman [25], pages 303–320. NIST special publication 500–240.
Digital Equipment Corporation. About AltaVista web page. http://www.altavista.digital.com/av/content/about.htm, 1997.
Digital Equipment Corporation. Digital's Alta Vista search index grows to record heights. http://www.altavista.digital.com/av/content/pr052798.htm, May 1998. Press release.
M. Franz and S. Roukos. TREC-6 ad hoc retrieval. In Voorhees and Harman [25], pages 511–516. NIST special publication 500–240.
Free Software Foundation. GNU WGET manual. http://theory.uwinnipeg.ca/localfiles/infofiles/wget.html, 1997.
D. K. Harman, editor. Proceedings of the First Text Retrieval Conference (TREC-1), Gaithersburg MD, November 1992. U.S. National Institute of Standards and Technology. NIST special publication 500–207.
D. Harman. Overview of the fourth Text Retrieval Conference (TREC-5). In D. K. Harman, editor, Proceedings of the Fourth Text Retrieval Conference (TREC-4), pages 1–24, Gaithersburg MD, November 1995. U.S. National Institute of Standards and Technology. NIST special publication 500–236.
D. Hawking. Scalable text retrieval for large digital libraries. In Carol Peters and Costatino Thanos, editors, Proceedings of the First European Conference on Digital Libraries, volume 1324 of Lecture Notes in Computer Science, pages 127–146, Pisa, Italy, September 1997. Springer, Berlin.
D. Hawking, P. Thistlewaite and N. Craswell. ANU/ACSys TREC-6 experiments. In Voorhees and Harman [25], pages 275–290. NIST special publication 500–240.
D. Hawking, P. Thistlewaite and N. Craswell. TREC Very Large Collection (VLC) web page. ACSys Cooperative Research Centre, The Australian National University, Canberra, 1997. http://pastime.anu.edu.au/TAR/vlc.html/.
Inktomi Corporation. The Inktomi technology behind HotBot. A white paper. http://www.inktomi.com/Tech/CoupClustWhitePap.html, 1996.
Internet Archive. Building a digital library for the future, August 1997. http://www.archive.org/.
Lexis-Nexis Corporation. About Lexis-Nexis web page. http://www.lexis-nexis.com/lncc/about/datacenter.html, September 1998.
National Institute of Standards and Technology. TREC home page. http://trec.nist.gov/, 1997.
V. Paxson. Flex documentation for linux. http://www.elcafe.com/man/man1/flexdoc.1.html, 1991.
M. Persin. Document filtering for fast ranking. In W. Bruce Croft and C.J. van Rijsbergen, editors, Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 339–348, Dublin, Ireland, July 1994. Springer, Berlin.
G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.
A. Singhal. AT&T at TREC-6. In Voorhees and Harman [25], pages 215–226. NIST special publication 500–240.
J.A. Swets. Information retrieval systems. Science, 141(3577):245–250, July 1963.
E. M. Voorhees and D. K. Harman, editors. Proceedings of the Sixth Text Retrieval Conference (TREC-6), Gaithersburg MD, November 1997. U.S. National Institute of Standards and Technology. NIST special publication 500–240.
S. Walker, S.E. Robertson, M. Boughanem, G.J.F. Jones, and K. Sparck Jones. Okapi at TREC-6: Automatic ad hoc, VLC, routing, filtering and QSDR. In Voorhees and Harman [25], pages 125–136. NIST special publication 500–240.
L. Wall, Tom Christiansen, and Randal L. Schwartz. Programming Perl. O'Reilly and Associates, Sebastopol CA, 1996.
Rights and permissions
About this article
Cite this article
Hawking, D., Thistlewaite, P. & Harman, D. Scaling Up the TREC Collection. Information Retrieval 1, 115–137 (1999). https://doi.org/10.1023/A:1009938405269
Issue Date:
DOI: https://doi.org/10.1023/A:1009938405269