Statistical analysis of lexical data using chi-squared and related distributions

Brainerd, Barron

doi:10.1007/BF02402331

Statistical analysis of lexical data using chi-squared and related distributions

Published: July 1975

Volume 9, pages 161–178, (1975)
Cite this article

Computers and the Humanities Aims and scope Submit manuscript

Barron Brainerd¹

59 Accesses
5 Citations
Explore all metrics

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Notes

See W.G. Cochran, “Some methods for strengthening the commonχ ² tests,”Biometrics, 10 (1954), 417–451, especially page 420. This rule of thumb, according to Cochran, is adequate when the degrees of freedom are greater than one and less than thirty. A more conservative rule is often used, and is suggested for one degree of freedom: Choose cells so that the expected cell counts are not less than five except for at most 20 percent of the cells where the expected counts can be as low as one. For thirty or more degrees of freedom a normal approximation is often suggested when too many of the expected cell counts are lower than five.
Google Scholar
Barron Brainerd, “An exploratory study of pronouns and articles as indices of genre in English,”Language and Style, 5 (1972), 239–259.
Google Scholar
In general, the power of the test increases with the number of degrees of freedom. The power of a test is roughly speaking the chance of rejecting a hypothesis when it is false. This varies with the choice of critical level and the choice of alternative to the null hypothesis. A test is more powerful than another if no matter what the choice of critical level the first test has a higher probability of rejecting the hypothesis if it is false. Thus the chi-squared obtained in Table 3.3 should be given more consideration than that of Table 3.4.
Barron Brainerd, “Article use as an indicator of style among English-language authors” inLinguistik und Statistik, ed. S. Jäger (Braunschweig: Vieweg, 1972) pp. 11–32.
Google Scholar
In some cases it can be shown that a certain modification of the Binomial distribution, based on a markov model of text generation, yields a better fit than the Poisson fit. However, one of the applications of the knowledge that a sample is Poisson lies in the remark that if a random variableX is Poisson, then the random variable\(Y = \sqrt {X + {3 \mathord{\left/ {\vphantom {3 8}} \right. \kern-\nulldelimiterspace} 8}} \) is approximately normally distributed with variance 1/4. The classical hypothesis tests can then be applied toY. The sort of approximation that we achieve here is adequate for these purposes.
Since articles are not, strictly speaking, random (an article cannot be succeeded immediately by another), the kind of justification presented here can only work as a first approximation. However, it can be shown that the probabilitiesP(X=k) are still approximately Poisson even when dependence is present.
See footnote 3 for a discussion of the power of a test.
See M.G. Kendall and A. Stuart,The advanced theory of statistics, Vol. II (London: Charles Giffin, 1961), pp. 578–579 for a derivation of this statistic.
Google Scholar
These ranges can be obtained from a table of the percentage points or quantiles of theχ ²-distribution with 49 degrees of freedom. If you are using chi-squared to test the goodness-of-fit data to a continuous distribution, best results can be obtained by choosing the cells so that they have equal expected values.

Download references

Author information

Authors and Affiliations

University of Toronto, Canada
Barron Brainerd (professor of mathematics and linguistics)

Authors

Barron Brainerd
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brainerd, B. Statistical analysis of lexical data using chi-squared and related distributions. Comput Hum 9, 161–178 (1975). https://doi.org/10.1007/BF02402331

Download citation

Issue Date: July 1975
DOI: https://doi.org/10.1007/BF02402331

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Statistical analysis of lexical data using chi-squared and related distributions

Access this article

Notes

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation