Information Retrieval

, Volume 1, Issue 1–2, pp 115–137 | Cite as

Scaling Up the TREC Collection

  • David Hawking
  • Paul Thistlewaite
  • Donna Harman
Article

Abstract

Due to the popularity of Web search engines, a large proportion of real text retrieval queries are now processed over collections measured in tens or hundreds of gigabytes. A new Very Large test Collection (VLC) has been created to support qualification, measurement and comparison of systems operating at this level and to permit the study of the properties of very large collections. The VLC is an extension of the well-known TREC collection and has been distributed under the same conditions. A simple set of efficiency and effectiveness measures have been defined to encourage comparability of reporting. The 20 gigabyte first-edition of the VLC and a representative 10% sample have been used in a special interest track of the 1997 Text Retrieval Conference (TREC-6). The unaffordable cost of obtaining complete relevance assessments over collections of this scale is avoided by concentrating on early precision and relying on the core TREC collection to support detailed effectiveness studies. Results obtained by TREC-6 VLC track participants are presented here. All groups observed a significant increase in early precision as collection size increased. Explanatory hypotheses are advanced for future empirical testing. A 100 gigabyte second edition VLC (VLC2) has recently been compiled and distributed for use in TREC-7 in 1998.

test collection very large databases text retrieval 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Alexa Corporation. Web spawns 1.5 million pages daily. http://www.alexa.com/company/inthenews/webfacts.html, August 1998. Press release.Google Scholar
  2. 2.
    J. Allan, J. Callan, W.B. Croft, L. Ballesteros, D. Byrd, R. Swan and J. Xu. INQUERY does battle with TREC-6. In Voorhees and Harman [25], pages 169–206. NIST special publication 500–240.Google Scholar
  3. 3.
    E.W. Brown and H.A. Chong. The GURU system in TREC-6. In Voorhees and Harman [25], pages 535–540. NIST special publication 500–240.Google Scholar
  4. 4.
    C. Buckley and A. F. Lewit. Optimisation of inverted vector searches. In Proceedings of the Eighth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 97–110. ACM, New York, 1985.Google Scholar
  5. 5.
    C. Buckley, A. Singhal and M. Mitra. Using query zoning and correlation within SMART: TREC-5. In E. M. Voorhees and D. K. Harman, editors, Proceedings of the Fifth Text Retrieval Conference (TREC-5), pages 105–118, Gaithersburg MD, November 1996. U.S. National Institute of Standards and Technology. NIST special publication 500–238.Google Scholar
  6. 6.
    G.V. Cormack, C.R. Palmer, S.S.L. To and C.L.A. Clarke. Passage-based refinement: Multitext experiments for TREC-6. In Voorhees and Harman [25], pages 303–320. NIST special publication 500–240.Google Scholar
  7. 7.
    Digital Equipment Corporation. About AltaVista web page. http://www.altavista.digital.com/av/content/about.htm, 1997.Google Scholar
  8. 8.
    Digital Equipment Corporation. Digital's Alta Vista search index grows to record heights. http://www.altavista.digital.com/av/content/pr052798.htm, May 1998. Press release.Google Scholar
  9. 9.
    M. Franz and S. Roukos. TREC-6 ad hoc retrieval. In Voorhees and Harman [25], pages 511–516. NIST special publication 500–240.Google Scholar
  10. 10.
    Free Software Foundation. GNU WGET manual. http://theory.uwinnipeg.ca/localfiles/infofiles/wget.html, 1997.Google Scholar
  11. 11.
    D. K. Harman, editor. Proceedings of the First Text Retrieval Conference (TREC-1), Gaithersburg MD, November 1992. U.S. National Institute of Standards and Technology. NIST special publication 500–207.Google Scholar
  12. 12.
    D. Harman. Overview of the fourth Text Retrieval Conference (TREC-5). In D. K. Harman, editor, Proceedings of the Fourth Text Retrieval Conference (TREC-4), pages 1–24, Gaithersburg MD, November 1995. U.S. National Institute of Standards and Technology. NIST special publication 500–236.Google Scholar
  13. 13.
    D. Hawking. Scalable text retrieval for large digital libraries. In Carol Peters and Costatino Thanos, editors, Proceedings of the First European Conference on Digital Libraries, volume 1324 of Lecture Notes in Computer Science, pages 127–146, Pisa, Italy, September 1997. Springer, Berlin.Google Scholar
  14. 14.
    D. Hawking, P. Thistlewaite and N. Craswell. ANU/ACSys TREC-6 experiments. In Voorhees and Harman [25], pages 275–290. NIST special publication 500–240.Google Scholar
  15. 15.
    D. Hawking, P. Thistlewaite and N. Craswell. TREC Very Large Collection (VLC) web page. ACSys Cooperative Research Centre, The Australian National University, Canberra, 1997. http://pastime.anu.edu.au/TAR/vlc.html/.Google Scholar
  16. 16.
    Inktomi Corporation. The Inktomi technology behind HotBot. A white paper. http://www.inktomi.com/Tech/CoupClustWhitePap.html, 1996.Google Scholar
  17. 17.
    Internet Archive. Building a digital library for the future, August 1997. http://www.archive.org/.Google Scholar
  18. 18.
    Lexis-Nexis Corporation. About Lexis-Nexis web page. http://www.lexis-nexis.com/lncc/about/datacenter.html, September 1998.Google Scholar
  19. 19.
    National Institute of Standards and Technology. TREC home page. http://trec.nist.gov/, 1997.Google Scholar
  20. 20.
    V. Paxson. Flex documentation for linux. http://www.elcafe.com/man/man1/flexdoc.1.html, 1991.Google Scholar
  21. 21.
    M. Persin. Document filtering for fast ranking. In W. Bruce Croft and C.J. van Rijsbergen, editors, Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 339–348, Dublin, Ireland, July 1994. Springer, Berlin.Google Scholar
  22. 22.
    G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.Google Scholar
  23. 23.
    A. Singhal. AT&T at TREC-6. In Voorhees and Harman [25], pages 215–226. NIST special publication 500–240.Google Scholar
  24. 24.
    J.A. Swets. Information retrieval systems. Science, 141(3577):245–250, July 1963.Google Scholar
  25. 25.
    E. M. Voorhees and D. K. Harman, editors. Proceedings of the Sixth Text Retrieval Conference (TREC-6), Gaithersburg MD, November 1997. U.S. National Institute of Standards and Technology. NIST special publication 500–240.Google Scholar
  26. 26.
    S. Walker, S.E. Robertson, M. Boughanem, G.J.F. Jones, and K. Sparck Jones. Okapi at TREC-6: Automatic ad hoc, VLC, routing, filtering and QSDR. In Voorhees and Harman [25], pages 125–136. NIST special publication 500–240.Google Scholar
  27. 27.
    L. Wall, Tom Christiansen, and Randal L. Schwartz. Programming Perl. O'Reilly and Associates, Sebastopol CA, 1996.Google Scholar

Copyright information

© Kluwer Academic Publishers 1999

Authors and Affiliations

  • David Hawking
  • Paul Thistlewaite
  • Donna Harman

There are no affiliations available

Personalised recommendations