Abstract
In the previous chapter we reviewed the basic elements of the statistical language model framework. An alternative and commonly used modeling paradigm, which was originally developed in the field of Information Retrieval, is the geometrical framework. Within this framework, vector spaces are used for constructing mathematical representations of documents, words and any other type of textual units. Basic geometrical concepts, such as distances, angles and projections, are then used to assess difference and similarity degrees among the units of analysis under consideration, which are modeled by means of vectors in the given vector space. In this chapter we will focus our attention on the geometrical framework of language modeling. First, in Sect. 8.1, the term-document matrix is presented and described in detail. Then, in Sect. 8.2, the vector space model approach is studied along with the popularly known TF-IDF (term frequency inverse document frequency) weighting scheme. Finally, in Sect. 8.3, the association scores and distance functions most commonly used in vector space model representations are described.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The most common situation is to compute the product of a single row of the transposed term-document matrix (one single document) against the complete term-document matrix (the whole data collection). Vocabulary reduction is also a common practice for several different reasons; this issue will be discussed in detail in Sect. 9.1.
- 2.
Here the word "informative" is used in a general sense. Strictly speaking, from a mathematical point of view and according to the principles of information theory, the most informative words would be the rarest words in the data collection. Probably, "relevant" or "discriminative" would be more appropriate terms here.
- 3.
It was proposed by the Greek mathematician Euclid of Alexandria in his fundamental treaty on geometry "The Elements", about 23 centuries ago! The definition of the Euclidean distance is based on Pythagoras' theorem.
References
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. The MIT Press, Cambridge
Roelleke T, Wang J (2008) TF-IDF uncovered: a study of theories and probabilities. In Proceedings of the 31st annual international ACM SIGIR conference, pp 435–442
Salton G (ed) (1971) The SMART retrieval system—experiments in automatic document retrieval. Prentice Hall Inc., Englewood Cliffs
Salton G, Wong A, Yang CS (1975) A vector space model for information retrieval. Commun ACM 18(11):613–620
Spärck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21
Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37(1):141–188
van Rijsbergen CJ (1979) Information retrieval. Butterworths, London
Widdows D (2004) Geometry and meaning. CSLI Publications, Center for the Study of Language and Information, Stanford
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media New York
About this chapter
Cite this chapter
Banchs, R.E. (2013). Geometrical Models. In: Text Mining with MATLAB®. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-4151-9_8
Download citation
DOI: https://doi.org/10.1007/978-1-4614-4151-9_8
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-4150-2
Online ISBN: 978-1-4614-4151-9
eBook Packages: Computer ScienceComputer Science (R0)