Abstract
There are many ways in which texts are non-homogeneous. In the next chapter, we will see that the topical development in discourse may require adjustments to the standard LNRE models. Non-randomness in the way texts unfold through 'text time' N introduces a kind of non-homogeneity that is at odds with the basic assumptions underlying the urn model on which LNRE models are based. However, there are other kinds of non-homogeneity in texts that we have thus far not considered at all. Corpora are collections of texts, generally from different authors and covering a wide range of topics. Differences in register and authorial structure may make it impossible to fit a simple LNRE model to corpus-derived word frequency distributions. Even texts that are non-composite with respect to authorship, style, and register are nevertheless non-homogeneous in the sense that they are composed of words that differ widely with respect to their internal structure and quantitative properties. Some words, e.g., dog, tree, write, have no internal structure at all. Most words have some kind of internal structure, e.g., dogs consists of the base word dog and the plural suffix -s, and unwillingness has a layered structure that starts with the verb will and successively adds the affixes -ing, un-, and -ness:
will | (verb) |
willing | (gerund in -ng) |
unwilling | (adjective in un-) |
unwillingness | (abstract noun in -ness) |
Figures 4.1 and 4.2 illustrate how different the quantitative properties of morphologically defined subsets of words can be. Both figures are based the frequency lists in the CELEX lexical database (Baayen, Piepenbrock, & Van Rijn, 1993), which for Dutch is based on a corpus of 42 million words. Figure 4.1 presents a series of summary plots for the monomorphemic nouns in this corpus, and Figure 4.2 plots the same kind of summary plots for the Dutch suffix -heid, which enjoys similar use as the English suffix -ness.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Baayen, R.H. (2001). Mixture distributions. In: Word Frequency Distributions. Text, Speech and Language Technology, vol 18. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-0844-0_4
Download citation
DOI: https://doi.org/10.1007/978-94-010-0844-0_4
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-0927-3
Online ISBN: 978-94-010-0844-0
eBook Packages: Springer Book Archive