Skip to main content

Basic Corpus Statistics

  • Chapter
  • First Online:
Text Mining with MATLAB®
  • 6968 Accesses

Abstract

This chapter opens the second part of the book, which focuses on mathematical models used for representing textual data. As already mentioned in Chap. 1, the basic objective of text mining (and data mining, in general) can be reduced to the discovery and extraction of relevant and valuable information from large volumes of data. In this chapter we will begin our description of text models by presenting methods that are based on the observation of basic properties and regularities in large volumes of text, which are generally referred to as corpus statistics. First, in Sect. 6.1, we describe some fundamental properties of natural language as they are reflected on the textual representation of the language. Then, in Sect. 6.2, we introduce the concept of word co-occurrences, as well as the commonly used measure of pointwise mutual information, for studying dependences between words. In Sect. 6.3, we focus on word co-occurrences at shorter distances while taking also word order into account.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Here, we will just use the term-document matrix as an intermediate step for computing the term-to-term co-occurrence matrix, but we will study in more detail this kind of representation later in Chap. 8, as it constitutes a fundamental element in the construction of geometrical models.

References

  • Church KW, Hanks P (1990) Words association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29

    Google Scholar 

  • Cover TM, Thomas JA (1991) Elements of information theory. John Wiley & Sons, New York

    Book  MATH  Google Scholar 

  • Evert S (2005) The statistics of word cooccurrences: word pairs and collocations. PhD Thesis, IMS Stuttgart

    Google Scholar 

  • Firth JR (1957) A synopsis of linguistic theory 1930–1955. In: Studies in linguistic analysis, Philological Society, Oxford, pp 1–32

    Google Scholar 

  • Furnas GW, Landauer TK, Gomez LM, Dumais ST (1983) Statistical semantics: analysis of the potential performance of keyword information systems. Bell Syst Tech J 62(6):1753–1806

    Google Scholar 

  • Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the international conference on research on computational linguistics

    Google Scholar 

  • Katz SM (1996) Distribution of content words and phrases in text and language modeling. Nat Lang Eng 2:15–59

    Article  Google Scholar 

  • Lehmann EL, Romano JP (2005) Testing statistical hypotheses. Springer, New York

    MATH  Google Scholar 

  • Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning, Bonn

    Google Scholar 

  • Mandelbrot BB (1954) Structure formelle des textes et communication. Word 10:1–27

    Google Scholar 

  • Paninski L (2003) Estimation of entropy and mutual information. Neural Comput 15:1191–1253

    Article  MATH  Google Scholar 

  • Ramsey FL, Schafer DW (1997) The statistical sleuth: a course in methods of data analysis. Duxbury Press, Belmont

    Google Scholar 

  • Zernik U (1991) Introduction. In: Lexical acquisition: exploiting on-line resources to build a Lexicon, Lawrence Erlbaum, Hillsdale, pp 1–26

    Google Scholar 

  • Zipf GK (1949) Human behavior and the principle of least effort. Addison-Wesley, Cambridge

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rafael E. Banchs .

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this chapter

Cite this chapter

Banchs, R.E. (2013). Basic Corpus Statistics. In: Text Mining with MATLAB®. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-4151-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-4151-9_6

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-4150-2

  • Online ISBN: 978-1-4614-4151-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics