Distributions of Functional and Content Words Differ Radically
We consider statistical properties of prepositions—the most numerous and important functional words in European languages. Usually, they syntactically link verbs and nouns to nouns. It is shown that their rank distributions in Russian differ radically from those of content words, being much more compact. The Zipf law distribution commonly used for content words fails for them, and thus approximations flatter at first ranks and steeper at higher ranks are applicable. For these purposes, the Mandelbrot family and an expo-logarithmic family of distributions are tested, and an insignificant difference between the two least-square approximations is revealed. It is proved that the first dozen of ranks cover more than 80% of all preposition occurrences in the DB of Russian collocations of Verb-Preposition-Noun and Noun-Preposition-Noun types, thus hardly leaving room for the rest two hundreds of available Russian prepositions.
KeywordsNatural Language Processing Content Word Rank Distribution Accusative Case Page Statistic
Unable to display preview. Download preview PDF.