Abstract
This chapter opens the second part of the book, which focuses on mathematical models used for representing textual data. As already mentioned in Chap. 1, the basic objective of text mining (and data mining, in general) can be reduced to the discovery and extraction of relevant and valuable information from large volumes of data. In this chapter we will begin our description of text models by presenting methods that are based on the observation of basic properties and regularities in large volumes of text, which are generally referred to as corpus statistics. First, in Sect. 6.1, we describe some fundamental properties of natural language as they are reflected on the textual representation of the language. Then, in Sect. 6.2, we introduce the concept of word co-occurrences, as well as the commonly used measure of pointwise mutual information, for studying dependences between words. In Sect. 6.3, we focus on word co-occurrences at shorter distances while taking also word order into account.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Here, we will just use the term-document matrix as an intermediate step for computing the term-to-term co-occurrence matrix, but we will study in more detail this kind of representation later in Chap. 8, as it constitutes a fundamental element in the construction of geometrical models.
References
Church KW, Hanks P (1990) Words association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29
Cover TM, Thomas JA (1991) Elements of information theory. John Wiley & Sons, New York
Evert S (2005) The statistics of word cooccurrences: word pairs and collocations. PhD Thesis, IMS Stuttgart
Firth JR (1957) A synopsis of linguistic theory 1930–1955. In: Studies in linguistic analysis, Philological Society, Oxford, pp 1–32
Furnas GW, Landauer TK, Gomez LM, Dumais ST (1983) Statistical semantics: analysis of the potential performance of keyword information systems. Bell Syst Tech J 62(6):1753–1806
Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the international conference on research on computational linguistics
Katz SM (1996) Distribution of content words and phrases in text and language modeling. Nat Lang Eng 2:15–59
Lehmann EL, Romano JP (2005) Testing statistical hypotheses. Springer, New York
Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning, Bonn
Mandelbrot BB (1954) Structure formelle des textes et communication. Word 10:1–27
Paninski L (2003) Estimation of entropy and mutual information. Neural Comput 15:1191–1253
Ramsey FL, Schafer DW (1997) The statistical sleuth: a course in methods of data analysis. Duxbury Press, Belmont
Zernik U (1991) Introduction. In: Lexical acquisition: exploiting on-line resources to build a Lexicon, Lawrence Erlbaum, Hillsdale, pp 1–26
Zipf GK (1949) Human behavior and the principle of least effort. Addison-Wesley, Cambridge
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media New York
About this chapter
Cite this chapter
Banchs, R.E. (2013). Basic Corpus Statistics. In: Text Mining with MATLAB®. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-4151-9_6
Download citation
DOI: https://doi.org/10.1007/978-1-4614-4151-9_6
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-4150-2
Online ISBN: 978-1-4614-4151-9
eBook Packages: Computer ScienceComputer Science (R0)