CICLing 2009: Computational Linguistics and Intelligent Text Processing pp 53-71 | Cite as
Substring Statistics
Abstract
The goal of this work is to make it practical to compute corpus-based statistics for all substrings (ngrams). Anything you can do with words, we ought to be able to do with substrings. This paper will show how to compute many statistics of interest for all substrings (ngrams) in a large corpus. The method not only computes standard corpus frequency, freq, and document frequency, df, but generalizes naturally to compute, df k (str), the number of documents that mention the substring str at least k times. df k can be used to estimate the probability distribution of str across documents, as well as summary statistics of this distribution, e.g., mean, variance (and other moments), entropy and adaptation.
Keywords
Binary Search Class Tree Document Frequency Substring Statistics Concordance LinePreview
Unable to display preview. Download preview PDF.
References
- 1.Meyer, D., Schvaneveldt, R.: Facilitation in recognizing pairs of words: Evidence of a dependence between retrieval operations. Journal of Experimental Psychology 90, 227–234 (1971)CrossRefGoogle Scholar
- 2.Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21 (1972)CrossRefGoogle Scholar
- 3.Prince, E.: Toward a taxonomy of given-new information. In: Cole, P. (ed.), pp. 236–256. Academic Press, New York (1981)Google Scholar
- 4.Davis, J.R., Hirschberg, J.: Meeting of the Association for Computational Linguistics, 187–193 (1988)Google Scholar
- 5.Salton, G.: Automatic text processing. Addison-Wesley Longman Publishing Co., Inc., Amsterdam (1988)Google Scholar
- 6.Steele, G.: Debunking the “expensive procedure call” myth or, procedure call implementations considered harmful or, LAMBDA: The Ultimate GOTO. In: ACM Proceedings of the 1977 Annual Conference, pp. 187–193. ACM Press, New York (1988)Google Scholar
- 7.Bell, T., Cleary, J., Witten, I.: Text Compression. Prentice Hall, Englewood Cliffs (1990)Google Scholar
- 8.Charniak, E.: Statistical Language Learning. MIT Press, Cambridge (1993)Google Scholar
- 9.Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)MathSciNetCrossRefMATHGoogle Scholar
- 10.Harman, D., Liberman, M.: TIPSTER, LDC, vol. 1 (1993), http://www.ldc.upenn.edu
- 11.Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the Web. Comput. Netw. ISDN Syst. 29(8-3), 1157–1166 (1997)CrossRefGoogle Scholar
- 12.Witten, I., Moffat, A., Bell, T.: Managing gigabytes: compressing and indexing documents and images. Van Nostrand Reinhold, New York (1999)MATHGoogle Scholar
- 13.Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1999)Google Scholar
- 14.Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)MATHGoogle Scholar
- 15.Church, K.W.: Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p 2. In: Coling (2000)Google Scholar
- 16.Jurafsky, D., Martin, J.H.: Speech and Language Processing. Prentice Hall, Upper Saddle River (2000)Google Scholar
- 17.Huang, X., Acero, A., Hon, H.-W.: Spoken Language Processing. Prentice Hall, Upper Saddle River (2001)Google Scholar
- 18.Baayen, R.H.: Word Frequency Distributions. Kluwer Academic Publishers, Dordrecht (2001)CrossRefMATHGoogle Scholar
- 19.Yamamoto, M., Church, K.: Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics 27(1), 1–30 (2001)CrossRefGoogle Scholar
- 20.Xu, Y., Umemura, K.: Improvements of Katz K Mixture Model. Information and Media Technologies 1(1), 411–435 (2006)Google Scholar
- 21.Umemura, K.: www.cicling.org/2009/Umemura-Church/