Notes
See W.G. Cochran, “Some methods for strengthening the commonχ 2 tests,”Biometrics, 10 (1954), 417–451, especially page 420. This rule of thumb, according to Cochran, is adequate when the degrees of freedom are greater than one and less than thirty. A more conservative rule is often used, and is suggested for one degree of freedom: Choose cells so that the expected cell counts are not less than five except for at most 20 percent of the cells where the expected counts can be as low as one. For thirty or more degrees of freedom a normal approximation is often suggested when too many of the expected cell counts are lower than five.
Barron Brainerd, “An exploratory study of pronouns and articles as indices of genre in English,”Language and Style, 5 (1972), 239–259.
In general, the power of the test increases with the number of degrees of freedom. The power of a test is roughly speaking the chance of rejecting a hypothesis when it is false. This varies with the choice of critical level and the choice of alternative to the null hypothesis. A test is more powerful than another if no matter what the choice of critical level the first test has a higher probability of rejecting the hypothesis if it is false. Thus the chi-squared obtained in Table 3.3 should be given more consideration than that of Table 3.4.
Barron Brainerd, “Article use as an indicator of style among English-language authors” inLinguistik und Statistik, ed. S. Jäger (Braunschweig: Vieweg, 1972) pp. 11–32.
In some cases it can be shown that a certain modification of the Binomial distribution, based on a markov model of text generation, yields a better fit than the Poisson fit. However, one of the applications of the knowledge that a sample is Poisson lies in the remark that if a random variableX is Poisson, then the random variable\(Y = \sqrt {X + {3 \mathord{\left/ {\vphantom {3 8}} \right. \kern-\nulldelimiterspace} 8}} \) is approximately normally distributed with variance 1/4. The classical hypothesis tests can then be applied toY. The sort of approximation that we achieve here is adequate for these purposes.
Since articles are not, strictly speaking, random (an article cannot be succeeded immediately by another), the kind of justification presented here can only work as a first approximation. However, it can be shown that the probabilitiesP(X=k) are still approximately Poisson even when dependence is present.
See footnote 3 for a discussion of the power of a test.
See M.G. Kendall and A. Stuart,The advanced theory of statistics, Vol. II (London: Charles Giffin, 1961), pp. 578–579 for a derivation of this statistic.
These ranges can be obtained from a table of the percentage points or quantiles of theχ 2-distribution with 49 degrees of freedom. If you are using chi-squared to test the goodness-of-fit data to a continuous distribution, best results can be obtained by choosing the cells so that they have equal expected values.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Brainerd, B. Statistical analysis of lexical data using chi-squared and related distributions. Comput Hum 9, 161–178 (1975). https://doi.org/10.1007/BF02402331
Issue Date:
DOI: https://doi.org/10.1007/BF02402331