Counting Lumps in Word Space: Density as a Measure of Corpus Homogeneity

  • Magnus Sahlgren
  • Jussi Karlgren
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3772)


This paper introduces a measure of corpus homogeneity that indicates the amount of topical dispersion in a corpus. The measure is based on the density of neighborhoods in semantic word spaces. We evaluate the measure by comparing the results for five different corpora. Our initial results indicate that the proposed density measure can indeed identify differences in topical dispersion.


Latent Semantic Analysis Homogeneous Data Dimensionality Reduction Technique Random Indexing Context Vector 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kanerva, P., Kristofersson, J., Holst, A.: Random Indexing of text samples for Latent Semantic Analysis. In: CogSci 2000, p. 1036 (2000)Google Scholar
  2. 2.
    Karlgren, J., Sahlgren, M.: From Words to Understanding. In: Uesaka, Y., Kanerva, P., Asoh, H. (eds.) Foundations of Real-World Intelligence, pp. 294–308. CSLI Publications, Stanford (2001)Google Scholar
  3. 3.
    Kilgariff, A.: Comparing Corpora. Int. Journal of Corpus Linguistics 6, 1–37 (2001)CrossRefGoogle Scholar
  4. 4.
    Landauer, T., Dumais, S.: A solution to Plato’s problem: The Latent Semantic Analysis theory of acquisition, induction and representation of knowledge. Psychological Review 104, 211–240 (1997)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Magnus Sahlgren
    • 1
  • Jussi Karlgren
    • 1
  1. 1.SICS, Swedish Institute of Computer ScienceKistaSweden

Personalised recommendations