Skip to main content

Geometrical Models

  • Chapter
  • First Online:
Book cover Text Mining with MATLAB®
  • 6886 Accesses

Abstract

In the previous chapter we reviewed the basic elements of the statistical language model framework. An alternative and commonly used modeling paradigm, which was originally developed in the field of Information Retrieval, is the geometrical framework. Within this framework, vector spaces are used for constructing mathematical representations of documents, words and any other type of textual units. Basic geometrical concepts, such as distances, angles and projections, are then used to assess difference and similarity degrees among the units of analysis under consideration, which are modeled by means of vectors in the given vector space. In this chapter we will focus our attention on the geometrical framework of language modeling. First, in Sect. 8.1, the term-document matrix is presented and described in detail. Then, in Sect. 8.2, the vector space model approach is studied along with the popularly known TF-IDF (term frequency inverse document frequency) weighting scheme. Finally, in Sect. 8.3, the association scores and distance functions most commonly used in vector space model representations are described.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 119.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The most common situation is to compute the product of a single row of the transposed term-document matrix (one single document) against the complete term-document matrix (the whole data collection). Vocabulary reduction is also a common practice for several different reasons; this issue will be discussed in detail in Sect. 9.1.

  2. 2.

    Here the word "informative" is used in a general sense. Strictly speaking, from a mathematical point of view and according to the principles of information theory, the most informative words would be the rarest words in the data collection. Probably, "relevant" or "discriminative" would be more appropriate terms here.

  3. 3.

    It was proposed by the Greek mathematician Euclid of Alexandria in his fundamental treaty on geometry "The Elements", about 23 centuries ago! The definition of the Euclidean distance is based on Pythagoras' theorem.

References

  • Manning CD, Schütze H (1999) Foundations of statistical natural language processing. The MIT Press, Cambridge

    MATH  Google Scholar 

  • Roelleke T, Wang J (2008) TF-IDF uncovered: a study of theories and probabilities. In Proceedings of the 31st annual international ACM SIGIR conference, pp 435–442

    Google Scholar 

  • Salton G (ed) (1971) The SMART retrieval system—experiments in automatic document retrieval. Prentice Hall Inc., Englewood Cliffs

    Google Scholar 

  • Salton G, Wong A, Yang CS (1975) A vector space model for information retrieval. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  • Spärck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21

    Article  Google Scholar 

  • Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37(1):141–188

    MathSciNet  MATH  Google Scholar 

  • van Rijsbergen CJ (1979) Information retrieval. Butterworths, London

    Google Scholar 

  • Widdows D (2004) Geometry and meaning. CSLI Publications, Center for the Study of Language and Information, Stanford

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rafael E. Banchs .

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this chapter

Cite this chapter

Banchs, R.E. (2013). Geometrical Models. In: Text Mining with MATLAB®. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-4151-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-4151-9_8

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-4150-2

  • Online ISBN: 978-1-4614-4151-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics