The goal of this work is to make it practical to compute corpus-based statistics for all substrings (ngrams). Anything you can do with words, we ought to be able to do with substrings. This paper will show how to compute many statistics of interest for all substrings (ngrams) in a large corpus. The method not only computes standard corpus frequency, freq, and document frequency, df, but generalizes naturally to compute, df k (str), the number of documents that mention the substring str at least k times. df k can be used to estimate the probability distribution of str across documents, as well as summary statistics of this distribution, e.g., mean, variance (and other moments), entropy and adaptation.


Binary Search Class Tree Document Frequency Substring Statistics Concordance Line 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Meyer, D., Schvaneveldt, R.: Facilitation in recognizing pairs of words: Evidence of a dependence between retrieval operations. Journal of Experimental Psychology 90, 227–234 (1971)CrossRefGoogle Scholar
  2. 2.
    Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21 (1972)CrossRefGoogle Scholar
  3. 3.
    Prince, E.: Toward a taxonomy of given-new information. In: Cole, P. (ed.), pp. 236–256. Academic Press, New York (1981)Google Scholar
  4. 4.
    Davis, J.R., Hirschberg, J.: Meeting of the Association for Computational Linguistics, 187–193 (1988)Google Scholar
  5. 5.
    Salton, G.: Automatic text processing. Addison-Wesley Longman Publishing Co., Inc., Amsterdam (1988)Google Scholar
  6. 6.
    Steele, G.: Debunking the “expensive procedure call” myth or, procedure call implementations considered harmful or, LAMBDA: The Ultimate GOTO. In: ACM Proceedings of the 1977 Annual Conference, pp. 187–193. ACM Press, New York (1988)Google Scholar
  7. 7.
    Bell, T., Cleary, J., Witten, I.: Text Compression. Prentice Hall, Englewood Cliffs (1990)Google Scholar
  8. 8.
    Charniak, E.: Statistical Language Learning. MIT Press, Cambridge (1993)Google Scholar
  9. 9.
    Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Harman, D., Liberman, M.: TIPSTER, LDC, vol. 1 (1993),
  11. 11.
    Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the Web. Comput. Netw. ISDN Syst. 29(8-3), 1157–1166 (1997)CrossRefGoogle Scholar
  12. 12.
    Witten, I., Moffat, A., Bell, T.: Managing gigabytes: compressing and indexing documents and images. Van Nostrand Reinhold, New York (1999)zbMATHGoogle Scholar
  13. 13.
    Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1999)Google Scholar
  14. 14.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  15. 15.
    Church, K.W.: Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p 2. In: Coling (2000)Google Scholar
  16. 16.
    Jurafsky, D., Martin, J.H.: Speech and Language Processing. Prentice Hall, Upper Saddle River (2000)Google Scholar
  17. 17.
    Huang, X., Acero, A., Hon, H.-W.: Spoken Language Processing. Prentice Hall, Upper Saddle River (2001)Google Scholar
  18. 18.
    Baayen, R.H.: Word Frequency Distributions. Kluwer Academic Publishers, Dordrecht (2001)CrossRefzbMATHGoogle Scholar
  19. 19.
    Yamamoto, M., Church, K.: Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics 27(1), 1–30 (2001)CrossRefGoogle Scholar
  20. 20.
    Xu, Y., Umemura, K.: Improvements of Katz K Mixture Model. Information and Media Technologies 1(1), 411–435 (2006)Google Scholar
  21. 21.

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Kyoji Umemura
    • 1
  • Kenneth Church
    • 2
  1. 1.Toyohashi University of Technology, TempakuToyohashiJapan
  2. 2.Microsoft, One Microsoft WayRedmondUSA

Personalised recommendations