Mixture distributions

Baayen, R. Harald

doi:10.1007/978-94-010-0844-0_4

R. Harald Baayen⁴

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 18))

317 Accesses

Abstract

There are many ways in which texts are non-homogeneous. In the next chapter, we will see that the topical development in discourse may require adjustments to the standard LNRE models. Non-randomness in the way texts unfold through 'text time' N introduces a kind of non-homogeneity that is at odds with the basic assumptions underlying the urn model on which LNRE models are based. However, there are other kinds of non-homogeneity in texts that we have thus far not considered at all. Corpora are collections of texts, generally from different authors and covering a wide range of topics. Differences in register and authorial structure may make it impossible to fit a simple LNRE model to corpus-derived word frequency distributions. Even texts that are non-composite with respect to authorship, style, and register are nevertheless non-homogeneous in the sense that they are composed of words that differ widely with respect to their internal structure and quantitative properties. Some words, e.g., dog, tree, write, have no internal structure at all. Most words have some kind of internal structure, e.g., dogs consists of the base word dog and the plural suffix -s, and unwillingness has a layered structure that starts with the verb will and successively adds the affixes -ing, un-, and -ness:

will	(verb)
willing	(gerund in -ng)
unwilling	(adjective in un-)
unwillingness	(abstract noun in -ness)

Figures 4.1 and 4.2 illustrate how different the quantitative properties of morphologically defined subsets of words can be. Both figures are based the frequency lists in the CELEX lexical database (Baayen, Piepenbrock, & Van Rijn, 1993), which for Dutch is based on a corpus of 42 million words. Figure 4.1 presents a series of summary plots for the monomorphemic nouns in this corpus, and Figure 4.2 plots the same kind of summary plots for the Dutch suffix -heid, which enjoys similar use as the English suffix -ness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Author information

Authors and Affiliations

University of Nijmegen, The Netherlands
R. Harald Baayen

Authors

R. Harald Baayen
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Baayen, R.H. (2001). Mixture distributions. In: Word Frequency Distributions. Text, Speech and Language Technology, vol 18. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-0844-0_4

Download citation

DOI: https://doi.org/10.1007/978-94-010-0844-0_4
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-0927-3
Online ISBN: 978-94-010-0844-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics