Counting Lumps in Word Space: Density as a Measure of Corpus Homogeneity

  • Magnus Sahlgren
  • Jussi Karlgren
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3772)

Abstract

This paper introduces a measure of corpus homogeneity that indicates the amount of topical dispersion in a corpus. The measure is based on the density of neighborhoods in semantic word spaces. We evaluate the measure by comparing the results for five different corpora. Our initial results indicate that the proposed density measure can indeed identify differences in topical dispersion.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Kanerva, P., Kristofersson, J., Holst, A.: Random Indexing of text samples for Latent Semantic Analysis. In: CogSci 2000, p. 1036 (2000)Google Scholar
  2. 2.
    Karlgren, J., Sahlgren, M.: From Words to Understanding. In: Uesaka, Y., Kanerva, P., Asoh, H. (eds.) Foundations of Real-World Intelligence, pp. 294–308. CSLI Publications, Stanford (2001)Google Scholar
  3. 3.
    Kilgariff, A.: Comparing Corpora. Int. Journal of Corpus Linguistics 6, 1–37 (2001)CrossRefGoogle Scholar
  4. 4.
    Landauer, T., Dumais, S.: A solution to Plato’s problem: The Latent Semantic Analysis theory of acquisition, induction and representation of knowledge. Psychological Review 104, 211–240 (1997)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Magnus Sahlgren
    • 1
  • Jussi Karlgren
    • 1
  1. 1.SICS, Swedish Institute of Computer ScienceKistaSweden

Personalised recommendations