Searchable words on the Web

Williams, Hugh E.; Zobel, Justin

doi:10.1007/s00799-003-0050-z

Searchable words on the Web

Regular contribution
Published: 01 April 2005

Volume 5, pages 99–105, (2005)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Hugh E. Williams¹ &
Justin Zobel¹

80 Accesses
27 Citations
Explore all metrics

Abstract

In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 GB of World Wide Web documents and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large datasets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley-Longman, Reading, MA
Harman D (1995) Overview of the second text retrieval conference (TREC-2). Inf Process Manage 31(3):271–289
Hasan J (2001) Automatic dictionary construction from large collections of text. Master’s thesis, School of Computer Science and Information Technology, RMIT University, RT-35 , Melbourne, Australia
Heinz S, Zobel J, Williams HE (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Sys 20(2):192–223
Kukich K (1992) Techniques for automatically correcting words in text. Comput Surv 24(4):377–440
Lotka AJ (1926) The frequency distribution of scientific productivity. J Wash Acad Sci 16(12):317–323
Li W (1998) Comments on Zipf’s law and the structures and evolution of natural language. Complexity 3(5):9–10
Moffat A, Bell TAH (1995) In-situ generation of compressed inverted files. J Am Soc Inf Sci 46(7):537–550
Porter MF (1980) An algorithm for suffix stripping. Program 13(3):130–137
Spink A, Wolfram D, Jansen BJ, Saracevic T (2001) Searching the web: the public and their queries. J Am Soc Inf Sci 52(3):226–234
Williams HE, Zobel J, Heinz S (2001) Self-adjusting trees in practice for large text collections. Softw Pract Exper 31(10):925–939
Witten IH, Bell TC (1990) Source models for natural language text. Int J Man Mach Stud 32:545–579
Witten IH, Moffat A, Bell TC (1999) Managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann, Los Altos, CA
Xu J, Croft WB (1998) Corpus-based stemming using cooccurrence of word variants. ACM Trans Inf Sys 16(1):61–81
Zipf GK (1949) Human behaviour and the principle of least effort. Addison-Wesley, Reading, MA
Zobel J, Dart P (1995) Finding approximate matches in large lexicons. Softw Pract Exper 25(3):331–345
Zobel J, Heinz S, Williams HE (2001) In-memory hash tables for accumulating text vocabularies. Inf Process Lett 80(6):271–277
Zobel J, Williams HE (1999) Combined models for high-performance compression of large text collections. In: Proceedings of String Processing and Information Retrieval (SPIRE), Cancun, Mexico, 1999. IEEE Press, pp 224–231

Download references

Author information

Authors and Affiliations

Department of Computer Science, RMIT University, 2476V, Melbourne, 3001, Australia
Hugh E. Williams & Justin Zobel

Authors

Hugh E. Williams
View author publications
You can also search for this author in PubMed Google Scholar
Justin Zobel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hugh E. Williams.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Williams, H., Zobel, J. Searchable words on the Web. Int J Digit Libr 5, 99–105 (2005). https://doi.org/10.1007/s00799-003-0050-z

Download citation

Published: 01 April 2005
Issue Date: April 2005
DOI: https://doi.org/10.1007/s00799-003-0050-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Searchable words on the Web

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

What an Algorithm Is

Recommender Systems: Techniques, Applications, and Challenges

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Searchable words on the Web

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

What an Algorithm Is

Recommender Systems: Techniques, Applications, and Challenges

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation