Abstract
In this chapter, power-law distributions and Small World Graphs originating from natural language data are examined in the fashion of Quantitative Linguistics. After giving several data sources that exhibit power-law distributions in rank-frequency in Section 3.1, graphs with Small World properties in language data are discussed in Section 3.2. We shall see that these characteristics are omnipresent in language data, and we should be aware of them when designing Structure Discovery processes. When knowing e.g. that a few hundreds of words make the bulk of words in a text, it is safe to use only these as contextual features without losing a lot of text coverage. Knowing that word co-occurrence networks possess the scale-free Small World property has implications for clustering these networks. An interesting aspect is whether these characteristics are only inherent to real natural language data or whether they can be produced with generators of linear sequences in a much simpler way than our intuition about language complexity would suggest –in other words, we shall see how distinctive these characteristics are with respect to tests deciding whether a given sequence is natural language or not. Finally, an emergent random text generation model that captures many of the characteristics of natural language is defined and quantitatively verified in Section 3.3.
Chapter PDF
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Biemann, C. (2012). SmallWorlds of Natural Language. In: Structure Discovery in Natural Language. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25923-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-25923-4_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25922-7
Online ISBN: 978-3-642-25923-4
eBook Packages: Computer ScienceComputer Science (R0)