Skip to main content

Counting Words

  • Chapter
  • First Online:
Language Processing with Perl and Prolog

Part of the book series: Cognitive Technologies ((COGTECH))

  • 2854 Accesses

Abstract

We saw in Chap. 2 that words have specific contexts of use. Pairs of words like strong and tea or powerful and computer are not random associations but the result of a preference. A native speaker will use them naturally, while a learner will have to learn them from books – dictionaries – where they are explicitly listed. Similarly, the words rider and writer sound much alike in American English, but they are likely to occur with different surrounding words. Hence, hearing an ambiguous phonetic sequence, a listener will discard the improbable rider of books or writer of horses and prefer writer of books or rider of horses (Church and Mercer 1993) .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Some authors now use the term pointwise mutual information to mean mutual information. Neither Fano (1961) nor Church and Hanks (1990) used this term and we kept the original one.

References

  • Apache OpenNLP Development Community. (2012). Apache OpenNLP developer documentation. The Apache Software Foundation, 1.5.2 ed.

    Google Scholar 

  • Baeza-Yates, R., & Ribeiro-Neto, B. (2011). Modern information retrieval: The concepts and technology behind search (2nd ed.). New York: Addison-Wesley.

    Google Scholar 

  • Bentley, J., Knuth, D., & McIlroy, D. (1986). Programming pearls. Communications of the ACM, 6(29), 471–483.

    Article  Google Scholar 

  • Brants, T., Popat, A. C., Xu, P., Och, F. J., & Dean, J. (2007). Large language models in machine translation. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague (pp. 858–867).

    Google Scholar 

  • Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1–7), 107–117. Proceedings of WWW7.

    Google Scholar 

  • Brown, P. F., Della Pietra, V. J., deSouza, P. V., Lai, J. C., & Mercer, R. L. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467–489.

    Google Scholar 

  • Chen, S. F., & Goodman, J. (1998). An empirical study of smoothing techniques for language modeling. Technical report TR-10-98, Harvard University, Cambridge, MA.

    Google Scholar 

  • Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.

    Google Scholar 

  • Church, K. W., & Mercer, R. L. (1993). Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics, 19(1), 1–24.

    Google Scholar 

  • Clarkson, P. R., & Rosenfeld, R. (1997). Statistical language modeling using the CMU-Cambridge toolkit. In Proceedings ESCA Eurospeech, Rhodes.

    Google Scholar 

  • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

    Google Scholar 

  • Fano, R. M. (1961). Transmission of information: A statistical theory of communications. New York: MIT.

    Google Scholar 

  • Francis, W. N., & Kucera, H. (1982). Frequency analysis of English usage. Boston: Houghton Mifflin.

    Google Scholar 

  • Franz, A., & Brants, T. (2006). All our n-gram are belong to you. Retrieved November 7, 2013, from http://googleresearch.blogspot.se/2006/08/all-our-n-gram-are-belong-to-you.html

    Google Scholar 

  • Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(16), 237–264.

    Article  MATH  MathSciNet  Google Scholar 

  • Grefenstette, G., & Tapanainen, P. (1994). What is a word, what is a sentence? Problems of tokenization. MLTT technical report 4, Xerox.

    Google Scholar 

  • Huang, J., Gao, J., Miao, J., Li, X., Wang, K., & Behr, F. (2010). Exploring web scale language models for search query processing. In Proceedings of the 19th international World Wide Web conference, Raleigh (pp. 451–460).

    Google Scholar 

  • Jelinek, F. (1990). Self-organized language modeling for speech recognition. In A. Waibel & K.-F. Lee (Eds.), Readings in speech recognition. San Mateo: Morgan Kaufmann. Reprinted from an IBM report, 1985.

    Google Scholar 

  • Jelinek, F. (1997). Statistical methods for speech recognition. Cambridge, MA: MIT.

    Google Scholar 

  • Jelinek, F., & Mercer, R. L. (1980). Interpolated estimation of Markov source parameters from sparse data. In E. S. Gelsema & L. N. Kanal (Eds.), Pattern recognition in practice (pp. 38–397). Amsterdam: North-Holland.

    Google Scholar 

  • Joachims, T. (2002). Learning to classify text using support vector machines. Boston: Kluwer Academic.

    Book  Google Scholar 

  • Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3), 400–401.

    Article  Google Scholar 

  • Kiss, T., & Strunk, J. (2006). Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32(4), 485–525.

    Article  Google Scholar 

  • Laplace, P. (1820). Théorie analytique des probabilités (3rd ed.). Paris: Coursier.

    Google Scholar 

  • Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.

    Google Scholar 

  • Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT.

    MATH  Google Scholar 

  • Marcus, M., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.

    Google Scholar 

  • Mauldin, M. L., & Leavitt, J. R. R. (1994). Web-agent related research at the center for machine translation. In Proceedings of the ACM SIG on networked information discovery and retrieval, McLean.

    Google Scholar 

  • Mikheev, A. (2002). Periods, capitalized words, etc. Computational Linguistics, 28(3), 289–318.

    Article  Google Scholar 

  • Orwell, G. (1949). Nineteen eighty-four. London: Secker and Warburg.

    Google Scholar 

  • Palmer, H. E. (1933). Second interim report on English collocations, submitted to the tenth annual conference of English teachers under the auspices of the institute for research in English teaching. Tokyo: Institute for Research in English Teaching.

    Google Scholar 

  • Reynar, J. C. (1998). Topic segmentation: Algorithms and applications. PhD thesis, University of Pennsylvania, Philadelphia.

    Google Scholar 

  • Reynar, J. C., & Ratnaparkhi, A. (1997). A maximum entropy approach to identifying sentence boundaries. In Proceedings of the fifth conference on applied natural language processing, Washington, DC (pp. 16–19).

    Google Scholar 

  • Salton, G. (1988). Automatic text processing: The transformation, analysis, and retrieval of information by computer. Reading: Addison-Wesley.

    Google Scholar 

  • Salton, G., & Buckley, C. (1987). Term weighting approaches in automatic text retrieval. Technical report TR87-881, Department of Computer Science, Cornell University, Ithaca.

    Google Scholar 

  • Stolcke, A. (2002). SRILM – an extensible language modeling toolkit. In Proceedings of international conference spoken language processing, Denver.

    Google Scholar 

  • Wang, K., Thrasher, C., Viegas, E., Li, X., & Hsu, B. (2010). An overview of Microsoft web n-gram corpus and applications. In Proceedings of the NAACL HLT 2010: Demonstration session, Los Angeles (pp. 45–48).

    Google Scholar 

  • Yu, H.-F., Ho, C.-H., Juan, Y.-C., & Lin, C.-J. (2013). LibShortText: A library for short-text classification and analysis. Retrieved November 1, 2013, from http://www.csie.ntu.edu.tw/~cjlin/libshorttext

    Google Scholar 

  • Zaragoza, H., Craswell, N., Taylor, M., Saria, S., & Robertson, S. (2004). Microsoft Cambridge at TREC-13: Web and HARD tracks. In Proceedings of TREC-2004, Gaithersburg.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Nugues, P.M. (2014). Counting Words. In: Language Processing with Perl and Prolog. Cognitive Technologies. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41464-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-41464-0_5

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-41463-3

  • Online ISBN: 978-3-642-41464-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics