Geometrical Models

Banchs, Rafael E.

doi:10.1007/978-1-4614-4151-9_8

Rafael E. Banchs²

6886 Accesses

Abstract

In the previous chapter we reviewed the basic elements of the statistical language model framework. An alternative and commonly used modeling paradigm, which was originally developed in the field of Information Retrieval, is the geometrical framework. Within this framework, vector spaces are used for constructing mathematical representations of documents, words and any other type of textual units. Basic geometrical concepts, such as distances, angles and projections, are then used to assess difference and similarity degrees among the units of analysis under consideration, which are modeled by means of vectors in the given vector space. In this chapter we will focus our attention on the geometrical framework of language modeling. First, in Sect. 8.1, the term-document matrix is presented and described in detail. Then, in Sect. 8.2, the vector space model approach is studied along with the popularly known TF-IDF (term frequency inverse document frequency) weighting scheme. Finally, in Sect. 8.3, the association scores and distance functions most commonly used in vector space model representations are described.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Hardcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The most common situation is to compute the product of a single row of the transposed term-document matrix (one single document) against the complete term-document matrix (the whole data collection). Vocabulary reduction is also a common practice for several different reasons; this issue will be discussed in detail in Sect. 9.1.
2.
Here the word "informative" is used in a general sense. Strictly speaking, from a mathematical point of view and according to the principles of information theory, the most informative words would be the rarest words in the data collection. Probably, "relevant" or "discriminative" would be more appropriate terms here.
3.
It was proposed by the Greek mathematician Euclid of Alexandria in his fundamental treaty on geometry "The Elements", about 23 centuries ago! The definition of the Euclidean distance is based on Pythagoras' theorem.

References

Manning CD, Schütze H (1999) Foundations of statistical natural language processing. The MIT Press, Cambridge
MATH Google Scholar
Roelleke T, Wang J (2008) TF-IDF uncovered: a study of theories and probabilities. In Proceedings of the 31st annual international ACM SIGIR conference, pp 435–442
Google Scholar
Salton G (ed) (1971) The SMART retrieval system—experiments in automatic document retrieval. Prentice Hall Inc., Englewood Cliffs
Google Scholar
Salton G, Wong A, Yang CS (1975) A vector space model for information retrieval. Commun ACM 18(11):613–620
Article MATH Google Scholar
Spärck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21
Article Google Scholar
Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37(1):141–188
MathSciNet MATH Google Scholar
van Rijsbergen CJ (1979) Information retrieval. Butterworths, London
Google Scholar
Widdows D (2004) Geometry and meaning. CSLI Publications, Center for the Study of Language and Information, Stanford
MATH Google Scholar

Download references

Author information

Authors and Affiliations

, , Barcelona
Rafael E. Banchs

Authors

Rafael E. Banchs
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rafael E. Banchs .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Banchs, R.E. (2013). Geometrical Models. In: Text Mining with MATLAB®. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-4151-9_8

Download citation

DOI: https://doi.org/10.1007/978-1-4614-4151-9_8
Published: 14 August 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-4150-2
Online ISBN: 978-1-4614-4151-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics