International Journal on Digital Libraries

, Volume 5, Issue 2, pp 99–105 | Cite as

Searchable words on the Web

Regular contribution

Abstract

In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 GB of World Wide Web documents and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large datasets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.

Keywords

Web search Terms Word occurrences Indexing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley-Longman, Reading, MAGoogle Scholar
  2. 2.
    Harman D (1995) Overview of the second text retrieval conference (TREC-2). Inf Process Manage 31(3):271–289Google Scholar
  3. 3.
    Hasan J (2001) Automatic dictionary construction from large collections of text. Master’s thesis, School of Computer Science and Information Technology, RMIT University, RT-35 , Melbourne, AustraliaGoogle Scholar
  4. 4.
    Heinz S, Zobel J, Williams HE (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Sys 20(2):192–223Google Scholar
  5. 5.
    Kukich K (1992) Techniques for automatically correcting words in text. Comput Surv 24(4):377–440Google Scholar
  6. 6.
    Lotka AJ (1926) The frequency distribution of scientific productivity. J Wash Acad Sci 16(12):317–323Google Scholar
  7. 7.
    Li W (1998) Comments on Zipf’s law and the structures and evolution of natural language. Complexity 3(5):9–10Google Scholar
  8. 8.
    Moffat A, Bell TAH (1995) In-situ generation of compressed inverted files. J Am Soc Inf Sci 46(7):537–550Google Scholar
  9. 9.
    Porter MF (1980) An algorithm for suffix stripping. Program 13(3):130–137Google Scholar
  10. 10.
    Spink A, Wolfram D, Jansen BJ, Saracevic T (2001) Searching the web: the public and their queries. J Am Soc Inf Sci 52(3):226–234Google Scholar
  11. 11.
    Williams HE, Zobel J, Heinz S (2001) Self-adjusting trees in practice for large text collections. Softw Pract Exper 31(10):925–939Google Scholar
  12. 12.
    Witten IH, Bell TC (1990) Source models for natural language text. Int J Man Mach Stud 32:545–579Google Scholar
  13. 13.
    Witten IH, Moffat A, Bell TC (1999) Managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann, Los Altos, CAGoogle Scholar
  14. 14.
    Xu J, Croft WB (1998) Corpus-based stemming using cooccurrence of word variants. ACM Trans Inf Sys 16(1):61–81Google Scholar
  15. 15.
    Zipf GK (1949) Human behaviour and the principle of least effort. Addison-Wesley, Reading, MAGoogle Scholar
  16. 16.
    Zobel J, Dart P (1995) Finding approximate matches in large lexicons. Softw Pract Exper 25(3):331–345Google Scholar
  17. 17.
    Zobel J, Heinz S, Williams HE (2001) In-memory hash tables for accumulating text vocabularies. Inf Process Lett 80(6):271–277Google Scholar
  18. 18.
    Zobel J, Williams HE (1999) Combined models for high-performance compression of large text collections. In: Proceedings of String Processing and Information Retrieval (SPIRE), Cancun, Mexico, 1999. IEEE Press, pp 224–231Google Scholar

Copyright information

© Springer-Verlag 2005

Authors and Affiliations

  1. 1.Department of Computer ScienceRMIT UniversityMelbourneAustralia

Personalised recommendations