Skip to main content

On the Development of a Model for Determining the Frequency of Occurrence of English Language Words

  • Conference paper
Pattern Recognition Theory and Applications

Part of the book series: NATO Advanced Study Institutes Series ((ASIC,volume 81))

Abstract

This paper proposes a scheme for estimating the frequency of occurrence of English words from the product of position-dependent letter frequencies. A sampling method is described for computing these frequencies at a given confidence limit from a minimum number of words. Computations for words of different length can be normalised under the assumption of a log-normal distribution of word size within the language. The normalised position dependent letter frequency plots for 2, 3 and 4-letter English words are presented. These plots are derived from the set of types of a given length that account for 80% of the observed tokens of the same length within a large corpus. The frequency of occurrence of English words can be approximated when modified conditional probability plots are used in conjunction with a scheme of transition diagrams for finite automata that synthesise these words. The three transition diagrams for all 2-letter English words contained in the Oxford English Dictionary are presented along with statistics on their observed and estimated word frequencies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 329.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Lenneberg, E.H. (1967). Biological Foundations of Language, Wiley, NY.

    Google Scholar 

  2. Liberman, A.M., Cooper, F.S., Harris, K.S., MacNeilage, P.F. and Studdert-Kennedy, M. (1975). Some observations on a model for speech perception. In W. Wathen-Dunn (Ed.), Models for the Perception of Speech and Visual Form, M. I. T. Press, Cambridge, Mass.

    Google Scholar 

  3. Adams, M.A. (1979). Models of Word Recognition. Cognitive Psychology, 11, pp. 133 – 76.

    Article  Google Scholar 

  4. Rumelhart, D.E. and Siple, P. (1974). Process of Recognising Tachistoscopically Presented Words. Psychological Review, 81, pp. 99 – 118.

    Article  Google Scholar 

  5. Broadbent, D.E. (1976). Word Frequency Effect and Response Bias. Psychological Review, 74, pp. 1 – 15.

    Article  Google Scholar 

  6. Pillsbury, W.B. (1897). A Study in Apperception. American Journal of Psychology, 8, pp. 315 – 393.

    Article  Google Scholar 

  7. Woodsworth, R.S. (1938). Experimental Psychology. Henry Holt Co., NY.

    Google Scholar 

  8. On the Development of a Recursive Model of Word Structure in the English language. In G. Lasker (Ed.), Applied Systems and Cybernetics, Pergamon Press, New York, 1980.

    Google Scholar 

  9. Mewhart, D.J.K. (1974). Accuracy and Order of Report in Tachistoscopic Identification. Canadian Journal of Psychology, 28, pp. 383 – 398.

    Article  Google Scholar 

  10. Gibson, E.J. Pick, A., Osser, H., and Hammond, M. (1962). The Role of Grapheme-Phoneme Correspondence in the Perception of Words. American Journal of Psychology, 75, pp. 554 – 570.

    Google Scholar 

  11. Herdan, G. (1962). The advanced Theory of Language as Choice and Chance. Springer-Verlag, NY.

    Google Scholar 

  12. Shannon, C.E. (1951). Prediction and Entrophy of English language. The Bell System Technical Journal, 38, pp. 50 – 64.

    MathSciNet  Google Scholar 

  13. Pierce, J.R. (1965). Symbols, Signals and Noise. Harper & Row Bros. Inc., NY.

    Google Scholar 

  14. Suen, C.Y. (1979). N-gram Statistics for Natural Language Understanding and Text Processing. IEEE Trans, on Pattern Analysis and Machine Intelligence PAMI-1, 2, pp. 164 – 172.

    Article  Google Scholar 

  15. Toussaint, G.T. and Shinghal, R. (1978). Cluster Analysis of English Text. In proceedings of Pattern Recognition and Image Processing Conference, Chicago, pp. 164 – 172.

    Google Scholar 

  16. Toussaint, G. (1974). Recent Prgress in Statistical Methods Applied to Pattern Recognition. In Proc. 2 aid Int. Joint Conf. on Pattern Recognition, Copenhagen.

    Google Scholar 

  17. Hanson, A.R., Riseman, E.M. and Fisher, E. (1976). Context in Word Recognition, Pattern Recognition, 8, pp. 35 – 45.

    Article  MATH  Google Scholar 

  18. Ehrich, R. and Koehler, K. (1975). Experiments in the Contextual Recognition of Cursive Script. IEEE Transactions on Computers, c-24, 2, pp. 182 – 93.

    Article  Google Scholar 

  19. Toussaint, G. and Donaldson, R. (1972). Some Simple Contextual Decoding Algorithms Applied to Recognition of Hand-Printed Text. In Proc. Annu. Canadian Comput. Conf., pp. 422101–16.

    Google Scholar 

  20. Duda, R.O. and Hart, P.E. (1968). Experiments in the Recognition of Hand-Printed Text: Part II — context analysis. AFIPS Conference Proceedings, 33, pp. 1139 – 1149.

    Google Scholar 

  21. Riseman, E.M. and Hanson, A.R. (1974). A Contextual Postprocessing System for Error Correction Using Binary n-grams. IEEE Transactions on Computers, c-23, 5, pp. 480 – 493.

    Article  Google Scholar 

  22. Vossler, C.M. and Branston, N.M. (1964). The Use of Context for Correcting Garbled English Text. In Proceedings of ACM 19th National Conference, pp. D2 4-1 to D2 4 – 3.

    Google Scholar 

  23. Blair, C.R. (1960). A Program for Correcting Spelling Errors. Information and Control, 3, pp. 60 – 67.

    Article  MathSciNet  MATH  Google Scholar 

  24. Carlson, G. (1966). Techniques for Replacing Characters that are Garbled on Input. Proceedings of the Spring Joint Computer Conference, pp. 189 – 192.

    Google Scholar 

  25. Shinghal, R. and Toussaint, G. (1979). A Bottom-up and Top-down Approach to Using Context in Text Recognition. Int. J. Man-Machine Studies, 11, pp. 201 – 212.

    Article  MATH  Google Scholar 

  26. Shinghal, R. and Toussaint, G. (1979). Experiments in Text Recognition with the Modified Viterbi Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1, 2, pp. 184 – 193.

    Article  Google Scholar 

  27. Shinghal, R., Rosenberg, D., Toussaint, G. (1978). A Simplified Heuristic Version of a Recursive Bayes Algorithm for Using Context in Text Recognition, IEEE Transactions on Systems, Man and Cybernetics, smc-8, 5, pp. 412 – 414.

    Google Scholar 

  28. Shinghal, R. and Toussaint, G. (1980). The Sensitivity of the Modified Viterbi Algorithm to the Source Statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2, 2, pp. 181 – 185.

    Article  Google Scholar 

  29. Carroll, J., Davis, P., Richman, B. (1971). Word Frequency Book, American Heritage, Houghton Mifflin Co., Ltd., NY.

    Google Scholar 

  30. Toussaint, G.T., Shinghal, R. (1978). Tables of Probabilities of Occurrence of Characters, Character-Paris, and Character- Triplets in English Text, McGill University, School of Computer Science, Technical Report No. SOCS 78. 6.

    Google Scholar 

  31. The Compact Edition of the English Dictionary (1971). Onions, C.T. ( Ed. ), Oxford University Press.

    Google Scholar 

  32. Funk and Waggnalls Standard College Dictionary (1978). Canadian Edition, Fitzhenry and Whiteside Ltd., Toronto.

    Google Scholar 

  33. Schwartz, E.I., Landovitz, L.F. (1978). Funk and Wagnalls Crossword Puzzle Word Finder. The Stonesong Press, Grosset and Dunlap, Inc., NY.

    Google Scholar 

  34. Zipf, G.K. (1935). The Psycho-Biology of Language. Houghton- Mifflin, Boston.

    Google Scholar 

  35. Estoup, J-B. (1916). Les gammes stenographiques. Privately printed for the Institute Stenographique, Paris As Cited In Mandelbrot, B. (1965). Information Theory and Psycholinguistics, in B. Wolman and E. Nagel (Ed.), Scientific Psychology, Basic Books Ltd.

    Google Scholar 

  36. Mandelbrot, B. (1961). On the Theory of Word Frequencies and on Related Markovian Models of Discourse. In R. Jakobson (Ed.), Structure of Language and its Mathematical Aspects, American Mathematical Society, Providence, R.I., pp. 190 – 219.

    Google Scholar 

  37. Yule, G.U. (1944). A Statistical Study of Vocabulary, Cambridge.

    Google Scholar 

  38. Kucera, H. and Francis, N. (1967). Computational Analysis of Present-Day American English, Brown University Press, Providence, R. I.

    Google Scholar 

  39. Herdan, G. (1960). Type-Token Mathematics, Mouton & Co., S-Gravenhage, The Hague, Netherlands.

    Google Scholar 

  40. Knopp, K. (1956). Infinite Sequences and Series, Dover Publications, Inc., NY, pp. 80 – 90.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1982 D. Reidel Publishing Company

About this paper

Cite this paper

O’Mara, K. (1982). On the Development of a Model for Determining the Frequency of Occurrence of English Language Words. In: Kittler, J., Fu, K.S., Pau, LF. (eds) Pattern Recognition Theory and Applications. NATO Advanced Study Institutes Series, vol 81. Springer, Dordrecht. https://doi.org/10.1007/978-94-009-7772-3_32

Download citation

  • DOI: https://doi.org/10.1007/978-94-009-7772-3_32

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-009-7774-7

  • Online ISBN: 978-94-009-7772-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics