Skip to main content

Mixture distributions

  • Chapter
Word Frequency Distributions

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 18))

  • 317 Accesses

Abstract

There are many ways in which texts are non-homogeneous. In the next chapter, we will see that the topical development in discourse may require adjustments to the standard LNRE models. Non-randomness in the way texts unfold through 'text time' N introduces a kind of non-homogeneity that is at odds with the basic assumptions underlying the urn model on which LNRE models are based. However, there are other kinds of non-homogeneity in texts that we have thus far not considered at all. Corpora are collections of texts, generally from different authors and covering a wide range of topics. Differences in register and authorial structure may make it impossible to fit a simple LNRE model to corpus-derived word frequency distributions. Even texts that are non-composite with respect to authorship, style, and register are nevertheless non-homogeneous in the sense that they are composed of words that differ widely with respect to their internal structure and quantitative properties. Some words, e.g., dog, tree, write, have no internal structure at all. Most words have some kind of internal structure, e.g., dogs consists of the base word dog and the plural suffix -s, and unwillingness has a layered structure that starts with the verb will and successively adds the affixes -ing, un-, and -ness:

will

(verb)

willing

(gerund in -ng)

unwilling

(adjective in un-)

unwillingness

(abstract noun in -ness)

Figures 4.1 and 4.2 illustrate how different the quantitative properties of morphologically defined subsets of words can be. Both figures are based the frequency lists in the CELEX lexical database (Baayen, Piepenbrock, & Van Rijn, 1993), which for Dutch is based on a corpus of 42 million words. Figure 4.1 presents a series of summary plots for the monomorphemic nouns in this corpus, and Figure 4.2 plots the same kind of summary plots for the Dutch suffix -heid, which enjoys similar use as the English suffix -ness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Baayen, R.H. (2001). Mixture distributions. In: Word Frequency Distributions. Text, Speech and Language Technology, vol 18. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-0844-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-94-010-0844-0_4

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-1-4020-0927-3

  • Online ISBN: 978-94-010-0844-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics