Searchable words on the Web
- 59 Downloads
- 13 Citations
Abstract
In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 GB of World Wide Web documents and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large datasets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.
Keywords
Web search Terms Word occurrences IndexingPreview
Unable to display preview. Download preview PDF.
References
- 1.Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley-Longman, Reading, MAGoogle Scholar
- 2.Harman D (1995) Overview of the second text retrieval conference (TREC-2). Inf Process Manage 31(3):271–289Google Scholar
- 3.Hasan J (2001) Automatic dictionary construction from large collections of text. Master’s thesis, School of Computer Science and Information Technology, RMIT University, RT-35 , Melbourne, AustraliaGoogle Scholar
- 4.Heinz S, Zobel J, Williams HE (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Sys 20(2):192–223Google Scholar
- 5.Kukich K (1992) Techniques for automatically correcting words in text. Comput Surv 24(4):377–440Google Scholar
- 6.Lotka AJ (1926) The frequency distribution of scientific productivity. J Wash Acad Sci 16(12):317–323Google Scholar
- 7.Li W (1998) Comments on Zipf’s law and the structures and evolution of natural language. Complexity 3(5):9–10Google Scholar
- 8.Moffat A, Bell TAH (1995) In-situ generation of compressed inverted files. J Am Soc Inf Sci 46(7):537–550Google Scholar
- 9.Porter MF (1980) An algorithm for suffix stripping. Program 13(3):130–137Google Scholar
- 10.Spink A, Wolfram D, Jansen BJ, Saracevic T (2001) Searching the web: the public and their queries. J Am Soc Inf Sci 52(3):226–234Google Scholar
- 11.Williams HE, Zobel J, Heinz S (2001) Self-adjusting trees in practice for large text collections. Softw Pract Exper 31(10):925–939Google Scholar
- 12.Witten IH, Bell TC (1990) Source models for natural language text. Int J Man Mach Stud 32:545–579Google Scholar
- 13.Witten IH, Moffat A, Bell TC (1999) Managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann, Los Altos, CAGoogle Scholar
- 14.Xu J, Croft WB (1998) Corpus-based stemming using cooccurrence of word variants. ACM Trans Inf Sys 16(1):61–81Google Scholar
- 15.Zipf GK (1949) Human behaviour and the principle of least effort. Addison-Wesley, Reading, MAGoogle Scholar
- 16.Zobel J, Dart P (1995) Finding approximate matches in large lexicons. Softw Pract Exper 25(3):331–345Google Scholar
- 17.Zobel J, Heinz S, Williams HE (2001) In-memory hash tables for accumulating text vocabularies. Inf Process Lett 80(6):271–277Google Scholar
- 18.Zobel J, Williams HE (1999) Combined models for high-performance compression of large text collections. In: Proceedings of String Processing and Information Retrieval (SPIRE), Cancun, Mexico, 1999. IEEE Press, pp 224–231Google Scholar