Basic Corpus Statistics

Banchs, Rafael E.

doi:10.1007/978-1-4614-4151-9_6

Rafael E. Banchs²

6968 Accesses

Abstract

This chapter opens the second part of the book, which focuses on mathematical models used for representing textual data. As already mentioned in Chap. 1, the basic objective of text mining (and data mining, in general) can be reduced to the discovery and extraction of relevant and valuable information from large volumes of data. In this chapter we will begin our description of text models by presenting methods that are based on the observation of basic properties and regularities in large volumes of text, which are generally referred to as corpus statistics. First, in Sect. 6.1, we describe some fundamental properties of natural language as they are reflected on the textual representation of the language. Then, in Sect. 6.2, we introduce the concept of word co-occurrences, as well as the commonly used measure of pointwise mutual information, for studying dependences between words. In Sect. 6.3, we focus on word co-occurrences at shorter distances while taking also word order into account.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Hardcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Here, we will just use the term-document matrix as an intermediate step for computing the term-to-term co-occurrence matrix, but we will study in more detail this kind of representation later in Chap. 8, as it constitutes a fundamental element in the construction of geometrical models.

References

Church KW, Hanks P (1990) Words association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29
Google Scholar
Cover TM, Thomas JA (1991) Elements of information theory. John Wiley & Sons, New York
Book MATH Google Scholar
Evert S (2005) The statistics of word cooccurrences: word pairs and collocations. PhD Thesis, IMS Stuttgart
Google Scholar
Firth JR (1957) A synopsis of linguistic theory 1930–1955. In: Studies in linguistic analysis, Philological Society, Oxford, pp 1–32
Google Scholar
Furnas GW, Landauer TK, Gomez LM, Dumais ST (1983) Statistical semantics: analysis of the potential performance of keyword information systems. Bell Syst Tech J 62(6):1753–1806
Google Scholar
Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the international conference on research on computational linguistics
Google Scholar
Katz SM (1996) Distribution of content words and phrases in text and language modeling. Nat Lang Eng 2:15–59
Article Google Scholar
Lehmann EL, Romano JP (2005) Testing statistical hypotheses. Springer, New York
MATH Google Scholar
Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning, Bonn
Google Scholar
Mandelbrot BB (1954) Structure formelle des textes et communication. Word 10:1–27
Google Scholar
Paninski L (2003) Estimation of entropy and mutual information. Neural Comput 15:1191–1253
Article MATH Google Scholar
Ramsey FL, Schafer DW (1997) The statistical sleuth: a course in methods of data analysis. Duxbury Press, Belmont
Google Scholar
Zernik U (1991) Introduction. In: Lexical acquisition: exploiting on-line resources to build a Lexicon, Lawrence Erlbaum, Hillsdale, pp 1–26
Google Scholar
Zipf GK (1949) Human behavior and the principle of least effort. Addison-Wesley, Cambridge
Google Scholar

Download references

Author information

Authors and Affiliations

, , Barcelona
Rafael E. Banchs

Authors

Rafael E. Banchs
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rafael E. Banchs .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Banchs, R.E. (2013). Basic Corpus Statistics. In: Text Mining with MATLAB®. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-4151-9_6

Download citation

DOI: https://doi.org/10.1007/978-1-4614-4151-9_6
Published: 14 August 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-4150-2
Online ISBN: 978-1-4614-4151-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics