Abstract
We saw in Chap. 2 that words have specific contexts of use. Pairs of words like strong and tea or powerful and computer are not random associations but the result of a preference. A native speaker will use them naturally, while a learner will have to learn them from books – dictionaries – where they are explicitly listed. Similarly, the words rider and writer sound much alike in American English, but they are likely to occur with different surrounding words. Hence, hearing an ambiguous phonetic sequence, a listener will discard the improbable rider of books or writer of horses and prefer writer of books or rider of horses (Church and Mercer 1993) .
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Apache OpenNLP Development Community. (2012). Apache OpenNLP developer documentation. The Apache Software Foundation, 1.5.2 ed.
Baeza-Yates, R., & Ribeiro-Neto, B. (2011). Modern information retrieval: The concepts and technology behind search (2nd ed.). New York: Addison-Wesley.
Bentley, J., Knuth, D., & McIlroy, D. (1986). Programming pearls. Communications of the ACM, 6(29), 471–483.
Brants, T., Popat, A. C., Xu, P., Och, F. J., & Dean, J. (2007). Large language models in machine translation. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague (pp. 858–867).
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7), 107–117. Proceedings of WWW7.
Brown, P. F., Della Pietra, V. J., deSouza, P. V., Lai, J. C., & Mercer, R. L. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467–489.
Chen, S. F., & Goodman, J. (1998). An empirical study of smoothing techniques for language modeling. Technical report TR-10-98, Harvard University, Cambridge, MA.
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.
Church, K. W., & Mercer, R. L. (1993). Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics, 19(1), 1–24.
Clarkson, P. R., & Rosenfeld, R. (1997). Statistical language modeling using the CMU-Cambridge toolkit. In Proceedings ESCA Eurospeech, Rhodes.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Fano, R. M. (1961). Transmission of information: A statistical theory of communications. New York: MIT.
Francis, W. N., & Kucera, H. (1982). Frequency analysis of English usage. Boston: Houghton Mifflin.
Franz, A., & Brants, T. (2006). All our n-gram are belong to you. Retrieved November 7, 2013, from http://googleresearch.blogspot.se/2006/08/all-our-n-gram-are-belong-to-you.html
Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(16), 237–264.
Grefenstette, G., & Tapanainen, P. (1994). What is a word, what is a sentence? Problems of tokenization. MLTT technical report 4, Xerox.
Huang, J., Gao, J., Miao, J., Li, X., Wang, K., & Behr, F. (2010). Exploring web scale language models for search query processing. In Proceedings of the 19th international World Wide Web conference, Raleigh (pp. 451–460).
Jelinek, F. (1990). Self-organized language modeling for speech recognition. In A. Waibel & K.-F. Lee (Eds.), Readings in speech recognition. San Mateo: Morgan Kaufmann. Reprinted from an IBM report, 1985.
Jelinek, F. (1997). Statistical methods for speech recognition. Cambridge, MA: MIT.
Jelinek, F., & Mercer, R. L. (1980). Interpolated estimation of Markov source parameters from sparse data. In E. S. Gelsema & L. N. Kanal (Eds.), Pattern recognition in practice (pp. 38–397). Amsterdam: North-Holland.
Joachims, T. (2002). Learning to classify text using support vector machines. Boston: Kluwer Academic.
Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3), 400–401.
Kiss, T., & Strunk, J. (2006). Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32(4), 485–525.
Laplace, P. (1820). Théorie analytique des probabilités (3rd ed.). Paris: Coursier.
Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT.
Marcus, M., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
Mauldin, M. L., & Leavitt, J. R. R. (1994). Web-agent related research at the center for machine translation. In Proceedings of the ACM SIG on networked information discovery and retrieval, McLean.
Mikheev, A. (2002). Periods, capitalized words, etc. Computational Linguistics, 28(3), 289–318.
Orwell, G. (1949). Nineteen eighty-four. London: Secker and Warburg.
Palmer, H. E. (1933). Second interim report on English collocations, submitted to the tenth annual conference of English teachers under the auspices of the institute for research in English teaching. Tokyo: Institute for Research in English Teaching.
Reynar, J. C. (1998). Topic segmentation: Algorithms and applications. PhD thesis, University of Pennsylvania, Philadelphia.
Reynar, J. C., & Ratnaparkhi, A. (1997). A maximum entropy approach to identifying sentence boundaries. In Proceedings of the fifth conference on applied natural language processing, Washington, DC (pp. 16–19).
Salton, G. (1988). Automatic text processing: The transformation, analysis, and retrieval of information by computer. Reading: Addison-Wesley.
Salton, G., & Buckley, C. (1987). Term weighting approaches in automatic text retrieval. Technical report TR87-881, Department of Computer Science, Cornell University, Ithaca.
Stolcke, A. (2002). SRILM – an extensible language modeling toolkit. In Proceedings of international conference spoken language processing, Denver.
Wang, K., Thrasher, C., Viegas, E., Li, X., & Hsu, B. (2010). An overview of Microsoft web n-gram corpus and applications. In Proceedings of the NAACL HLT 2010: Demonstration session, Los Angeles (pp. 45–48).
Yu, H.-F., Ho, C.-H., Juan, Y.-C., & Lin, C.-J. (2013). LibShortText: A library for short-text classification and analysis. Retrieved November 1, 2013, from http://www.csie.ntu.edu.tw/~cjlin/libshorttext
Zaragoza, H., Craswell, N., Taylor, M., Saria, S., & Robertson, S. (2004). Microsoft Cambridge at TREC-13: Web and HARD tracks. In Proceedings of TREC-2004, Gaithersburg.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Nugues, P.M. (2014). Counting Words. In: Language Processing with Perl and Prolog. Cognitive Technologies. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41464-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-41464-0_5
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41463-3
Online ISBN: 978-3-642-41464-0
eBook Packages: Computer ScienceComputer Science (R0)