Abstract
In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 GB of World Wide Web documents and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large datasets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.
Similar content being viewed by others
References
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley-Longman, Reading, MA
Harman D (1995) Overview of the second text retrieval conference (TREC-2). Inf Process Manage 31(3):271–289
Hasan J (2001) Automatic dictionary construction from large collections of text. Master’s thesis, School of Computer Science and Information Technology, RMIT University, RT-35 , Melbourne, Australia
Heinz S, Zobel J, Williams HE (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Sys 20(2):192–223
Kukich K (1992) Techniques for automatically correcting words in text. Comput Surv 24(4):377–440
Lotka AJ (1926) The frequency distribution of scientific productivity. J Wash Acad Sci 16(12):317–323
Li W (1998) Comments on Zipf’s law and the structures and evolution of natural language. Complexity 3(5):9–10
Moffat A, Bell TAH (1995) In-situ generation of compressed inverted files. J Am Soc Inf Sci 46(7):537–550
Porter MF (1980) An algorithm for suffix stripping. Program 13(3):130–137
Spink A, Wolfram D, Jansen BJ, Saracevic T (2001) Searching the web: the public and their queries. J Am Soc Inf Sci 52(3):226–234
Williams HE, Zobel J, Heinz S (2001) Self-adjusting trees in practice for large text collections. Softw Pract Exper 31(10):925–939
Witten IH, Bell TC (1990) Source models for natural language text. Int J Man Mach Stud 32:545–579
Witten IH, Moffat A, Bell TC (1999) Managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann, Los Altos, CA
Xu J, Croft WB (1998) Corpus-based stemming using cooccurrence of word variants. ACM Trans Inf Sys 16(1):61–81
Zipf GK (1949) Human behaviour and the principle of least effort. Addison-Wesley, Reading, MA
Zobel J, Dart P (1995) Finding approximate matches in large lexicons. Softw Pract Exper 25(3):331–345
Zobel J, Heinz S, Williams HE (2001) In-memory hash tables for accumulating text vocabularies. Inf Process Lett 80(6):271–277
Zobel J, Williams HE (1999) Combined models for high-performance compression of large text collections. In: Proceedings of String Processing and Information Retrieval (SPIRE), Cancun, Mexico, 1999. IEEE Press, pp 224–231
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Williams, H., Zobel, J. Searchable words on the Web. Int J Digit Libr 5, 99–105 (2005). https://doi.org/10.1007/s00799-003-0050-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-003-0050-z