SmallWorlds of Natural Language
In this chapter, power-law distributions and Small World Graphs originating from natural language data are examined in the fashion of Quantitative Linguistics. After giving several data sources that exhibit power-law distributions in rank-frequency in Section 3.1, graphs with Small World properties in language data are discussed in Section 3.2. We shall see that these characteristics are omnipresent in language data, and we should be aware of them when designing Structure Discovery processes. When knowing e.g. that a few hundreds of words make the bulk of words in a text, it is safe to use only these as contextual features without losing a lot of text coverage. Knowing that word co-occurrence networks possess the scale-free Small World property has implications for clustering these networks. An interesting aspect is whether these characteristics are only inherent to real natural language data or whether they can be produced with generators of linear sequences in a much simpler way than our intuition about language complexity would suggest –in other words, we shall see how distinctive these characteristics are with respect to tests deciding whether a given sequence is natural language or not. Finally, an emergent random text generation model that captures many of the characteristics of natural language is defined and quantitatively verified in Section 3.3.
KeywordsDegree Distribution Small World Sentence Length Word Generator Random Text
Unable to display preview. Download preview PDF.