Skip to main content
Log in

Searchable words on the Web

  • Regular contribution
  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 GB of World Wide Web documents and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large datasets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley-Longman, Reading, MA

  2. Harman D (1995) Overview of the second text retrieval conference (TREC-2). Inf Process Manage 31(3):271–289

  3. Hasan J (2001) Automatic dictionary construction from large collections of text. Master’s thesis, School of Computer Science and Information Technology, RMIT University, RT-35 , Melbourne, Australia

  4. Heinz S, Zobel J, Williams HE (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Sys 20(2):192–223

  5. Kukich K (1992) Techniques for automatically correcting words in text. Comput Surv 24(4):377–440

  6. Lotka AJ (1926) The frequency distribution of scientific productivity. J Wash Acad Sci 16(12):317–323

  7. Li W (1998) Comments on Zipf’s law and the structures and evolution of natural language. Complexity 3(5):9–10

  8. Moffat A, Bell TAH (1995) In-situ generation of compressed inverted files. J Am Soc Inf Sci 46(7):537–550

  9. Porter MF (1980) An algorithm for suffix stripping. Program 13(3):130–137

  10. Spink A, Wolfram D, Jansen BJ, Saracevic T (2001) Searching the web: the public and their queries. J Am Soc Inf Sci 52(3):226–234

  11. Williams HE, Zobel J, Heinz S (2001) Self-adjusting trees in practice for large text collections. Softw Pract Exper 31(10):925–939

  12. Witten IH, Bell TC (1990) Source models for natural language text. Int J Man Mach Stud 32:545–579

  13. Witten IH, Moffat A, Bell TC (1999) Managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann, Los Altos, CA

  14. Xu J, Croft WB (1998) Corpus-based stemming using cooccurrence of word variants. ACM Trans Inf Sys 16(1):61–81

  15. Zipf GK (1949) Human behaviour and the principle of least effort. Addison-Wesley, Reading, MA

  16. Zobel J, Dart P (1995) Finding approximate matches in large lexicons. Softw Pract Exper 25(3):331–345

  17. Zobel J, Heinz S, Williams HE (2001) In-memory hash tables for accumulating text vocabularies. Inf Process Lett 80(6):271–277

  18. Zobel J, Williams HE (1999) Combined models for high-performance compression of large text collections. In: Proceedings of String Processing and Information Retrieval (SPIRE), Cancun, Mexico, 1999. IEEE Press, pp 224–231

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hugh E. Williams.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Williams, H., Zobel, J. Searchable words on the Web. Int J Digit Libr 5, 99–105 (2005). https://doi.org/10.1007/s00799-003-0050-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-003-0050-z

Keywords

Navigation