Counting Words

Nugues, Pierre M.

doi:10.1007/978-3-642-41464-0_5

Pierre M. Nugues¹¹

Part of the book series: Cognitive Technologies ((COGTECH))

2854 Accesses

Abstract

We saw in Chap. 2 that words have specific contexts of use. Pairs of words like strong and tea or powerful and computer are not random associations but the result of a preference. A native speaker will use them naturally, while a learner will have to learn them from books – dictionaries – where they are explicitly listed. Similarly, the words rider and writer sound much alike in American English, but they are likely to occur with different surrounding words. Hence, hearing an ambiguous phonetic sequence, a listener will discard the improbable rider of books or writer of horses and prefer writer of books or rider of horses (Church and Mercer 1993) .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Some authors now use the term pointwise mutual information to mean mutual information. Neither Fano (1961) nor Church and Hanks (1990) used this term and we kept the original one.

References

Apache OpenNLP Development Community. (2012). Apache OpenNLP developer documentation. The Apache Software Foundation, 1.5.2 ed.
Google Scholar
Baeza-Yates, R., & Ribeiro-Neto, B. (2011). Modern information retrieval: The concepts and technology behind search (2nd ed.). New York: Addison-Wesley.
Google Scholar
Bentley, J., Knuth, D., & McIlroy, D. (1986). Programming pearls. Communications of the ACM, 6(29), 471–483.
Article Google Scholar
Brants, T., Popat, A. C., Xu, P., Och, F. J., & Dean, J. (2007). Large language models in machine translation. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague (pp. 858–867).
Google Scholar
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7), 107–117. Proceedings of WWW7.
Google Scholar
Brown, P. F., Della Pietra, V. J., deSouza, P. V., Lai, J. C., & Mercer, R. L. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467–489.
Google Scholar
Chen, S. F., & Goodman, J. (1998). An empirical study of smoothing techniques for language modeling. Technical report TR-10-98, Harvard University, Cambridge, MA.
Google Scholar
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.
Google Scholar
Church, K. W., & Mercer, R. L. (1993). Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics, 19(1), 1–24.
Google Scholar
Clarkson, P. R., & Rosenfeld, R. (1997). Statistical language modeling using the CMU-Cambridge toolkit. In Proceedings ESCA Eurospeech, Rhodes.
Google Scholar
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Google Scholar
Fano, R. M. (1961). Transmission of information: A statistical theory of communications. New York: MIT.
Google Scholar
Francis, W. N., & Kucera, H. (1982). Frequency analysis of English usage. Boston: Houghton Mifflin.
Google Scholar
Franz, A., & Brants, T. (2006). All our n-gram are belong to you. Retrieved November 7, 2013, from http://googleresearch.blogspot.se/2006/08/all-our-n-gram-are-belong-to-you.html
Google Scholar
Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(16), 237–264.
Article MATH MathSciNet Google Scholar
Grefenstette, G., & Tapanainen, P. (1994). What is a word, what is a sentence? Problems of tokenization. MLTT technical report 4, Xerox.
Google Scholar
Huang, J., Gao, J., Miao, J., Li, X., Wang, K., & Behr, F. (2010). Exploring web scale language models for search query processing. In Proceedings of the 19th international World Wide Web conference, Raleigh (pp. 451–460).
Google Scholar
Jelinek, F. (1990). Self-organized language modeling for speech recognition. In A. Waibel & K.-F. Lee (Eds.), Readings in speech recognition. San Mateo: Morgan Kaufmann. Reprinted from an IBM report, 1985.
Google Scholar
Jelinek, F. (1997). Statistical methods for speech recognition. Cambridge, MA: MIT.
Google Scholar
Jelinek, F., & Mercer, R. L. (1980). Interpolated estimation of Markov source parameters from sparse data. In E. S. Gelsema & L. N. Kanal (Eds.), Pattern recognition in practice (pp. 38–397). Amsterdam: North-Holland.
Google Scholar
Joachims, T. (2002). Learning to classify text using support vector machines. Boston: Kluwer Academic.
Book Google Scholar
Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3), 400–401.
Article Google Scholar
Kiss, T., & Strunk, J. (2006). Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32(4), 485–525.
Article Google Scholar
Laplace, P. (1820). Théorie analytique des probabilités (3rd ed.). Paris: Coursier.
Google Scholar
Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.
Google Scholar
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
Book MATH Google Scholar
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT.
MATH Google Scholar
Marcus, M., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
Google Scholar
Mauldin, M. L., & Leavitt, J. R. R. (1994). Web-agent related research at the center for machine translation. In Proceedings of the ACM SIG on networked information discovery and retrieval, McLean.
Google Scholar
Mikheev, A. (2002). Periods, capitalized words, etc. Computational Linguistics, 28(3), 289–318.
Article Google Scholar
Orwell, G. (1949). Nineteen eighty-four. London: Secker and Warburg.
Google Scholar
Palmer, H. E. (1933). Second interim report on English collocations, submitted to the tenth annual conference of English teachers under the auspices of the institute for research in English teaching. Tokyo: Institute for Research in English Teaching.
Google Scholar
Reynar, J. C. (1998). Topic segmentation: Algorithms and applications. PhD thesis, University of Pennsylvania, Philadelphia.
Google Scholar
Reynar, J. C., & Ratnaparkhi, A. (1997). A maximum entropy approach to identifying sentence boundaries. In Proceedings of the fifth conference on applied natural language processing, Washington, DC (pp. 16–19).
Google Scholar
Salton, G. (1988). Automatic text processing: The transformation, analysis, and retrieval of information by computer. Reading: Addison-Wesley.
Google Scholar
Salton, G., & Buckley, C. (1987). Term weighting approaches in automatic text retrieval. Technical report TR87-881, Department of Computer Science, Cornell University, Ithaca.
Google Scholar
Stolcke, A. (2002). SRILM – an extensible language modeling toolkit. In Proceedings of international conference spoken language processing, Denver.
Google Scholar
Wang, K., Thrasher, C., Viegas, E., Li, X., & Hsu, B. (2010). An overview of Microsoft web n-gram corpus and applications. In Proceedings of the NAACL HLT 2010: Demonstration session, Los Angeles (pp. 45–48).
Google Scholar
Yu, H.-F., Ho, C.-H., Juan, Y.-C., & Lin, C.-J. (2013). LibShortText: A library for short-text classification and analysis. Retrieved November 1, 2013, from http://www.csie.ntu.edu.tw/~cjlin/libshorttext
Google Scholar
Zaragoza, H., Craswell, N., Taylor, M., Saria, S., & Robertson, S. (2004). Microsoft Cambridge at TREC-13: Web and HARD tracks. In Proceedings of TREC-2004, Gaithersburg.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Lund University, Lund, Sweden
Pierre M. Nugues

Authors

Pierre M. Nugues
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nugues, P.M. (2014). Counting Words. In: Language Processing with Perl and Prolog. Cognitive Technologies. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41464-0_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-41464-0_5
Published: 26 June 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41463-3
Online ISBN: 978-3-642-41464-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics