Skip to main content
Log in

Statistical analysis of lexical data using chi-squared and related distributions

  • Published:
Computers and the Humanities Aims and scope Submit manuscript

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Notes

  1. See W.G. Cochran, “Some methods for strengthening the commonχ 2 tests,”Biometrics, 10 (1954), 417–451, especially page 420. This rule of thumb, according to Cochran, is adequate when the degrees of freedom are greater than one and less than thirty. A more conservative rule is often used, and is suggested for one degree of freedom: Choose cells so that the expected cell counts are not less than five except for at most 20 percent of the cells where the expected counts can be as low as one. For thirty or more degrees of freedom a normal approximation is often suggested when too many of the expected cell counts are lower than five.

    Google Scholar 

  2. Barron Brainerd, “An exploratory study of pronouns and articles as indices of genre in English,”Language and Style, 5 (1972), 239–259.

    Google Scholar 

  3. In general, the power of the test increases with the number of degrees of freedom. The power of a test is roughly speaking the chance of rejecting a hypothesis when it is false. This varies with the choice of critical level and the choice of alternative to the null hypothesis. A test is more powerful than another if no matter what the choice of critical level the first test has a higher probability of rejecting the hypothesis if it is false. Thus the chi-squared obtained in Table 3.3 should be given more consideration than that of Table 3.4.

  4. Barron Brainerd, “Article use as an indicator of style among English-language authors” inLinguistik und Statistik, ed. S. Jäger (Braunschweig: Vieweg, 1972) pp. 11–32.

    Google Scholar 

  5. In some cases it can be shown that a certain modification of the Binomial distribution, based on a markov model of text generation, yields a better fit than the Poisson fit. However, one of the applications of the knowledge that a sample is Poisson lies in the remark that if a random variableX is Poisson, then the random variable\(Y = \sqrt {X + {3 \mathord{\left/ {\vphantom {3 8}} \right. \kern-\nulldelimiterspace} 8}} \) is approximately normally distributed with variance 1/4. The classical hypothesis tests can then be applied toY. The sort of approximation that we achieve here is adequate for these purposes.

  6. Since articles are not, strictly speaking, random (an article cannot be succeeded immediately by another), the kind of justification presented here can only work as a first approximation. However, it can be shown that the probabilitiesP(X=k) are still approximately Poisson even when dependence is present.

  7. See footnote 3 for a discussion of the power of a test.

  8. See M.G. Kendall and A. Stuart,The advanced theory of statistics, Vol. II (London: Charles Giffin, 1961), pp. 578–579 for a derivation of this statistic.

    Google Scholar 

  9. These ranges can be obtained from a table of the percentage points or quantiles of theχ 2-distribution with 49 degrees of freedom. If you are using chi-squared to test the goodness-of-fit data to a continuous distribution, best results can be obtained by choosing the cells so that they have equal expected values.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brainerd, B. Statistical analysis of lexical data using chi-squared and related distributions. Comput Hum 9, 161–178 (1975). https://doi.org/10.1007/BF02402331

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02402331

Keywords

Navigation