Abstract
Low frequency words tend to be rich in content, and vice versa. But not all equally frequent words are equally meaningful. We will use inverse document frequency (IDF), a quantity borrowed from Information Retrieval, to distinguish words like somewhat and boycott. Both somewhat and boycott appeared approximately 1000 times in a corpus of 1989 Associated Press articles, but boycott is a better keyword because its IDF is farther from what would be expected by chance (Poisson).
This work was accomplished at AT&T.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Church, K. and Gale, W. 1997. Poisson Mixtures. Natural Language Engineering, vol 3 (2).
Johnson, N. and Kotz, S. 1969. Discrete Distributions Houghton Mifflin, Boston, Ma.
Katz, S. M. 1996. Distribution of content words and phrases in text and language modelling. in Natural Language Engineering, vol 2(1), pp. 15–59.
Mosteller, F. and Wallace, D. 1964. Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading, Ma.
Salton, G. 1989. Automatic Text Processing. Addison-Wesley, Reading, Ma.
Shannon, C. 1948. The Mathemarical Theory of Communication. in Bell System Technical Journal
Sparck Jones, K. 1972. A Statistical Interpretation of Term Specificity and its Application in Retrieval. in Journal of Documentation, vol. 28 (1), pp. 11–21.
van Rijsbergen, C. 1979. Information Retrieval. Second edition. Butterworths, London.
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Church, K., Gale, W. (1999). Inverse Document Frequency (IDF): A Measure of Deviations from Poisson. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds) Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology, vol 11. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2390-9_18
Download citation
DOI: https://doi.org/10.1007/978-94-017-2390-9_18
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5349-7
Online ISBN: 978-94-017-2390-9
eBook Packages: Springer Book Archive