On the Development of a Model for Determining the Frequency of Occurrence of English Language Words

O’Mara, Kevin

doi:10.1007/978-94-009-7772-3_32

Kevin O’Mara⁵

Part of the book series: NATO Advanced Study Institutes Series ((ASIC,volume 81))

218 Accesses
1 Citations

Abstract

This paper proposes a scheme for estimating the frequency of occurrence of English words from the product of position-dependent letter frequencies. A sampling method is described for computing these frequencies at a given confidence limit from a minimum number of words. Computations for words of different length can be normalised under the assumption of a log-normal distribution of word size within the language. The normalised position dependent letter frequency plots for 2, 3 and 4-letter English words are presented. These plots are derived from the set of types of a given length that account for 80% of the observed tokens of the same length within a large corpus. The frequency of occurrence of English words can be approximated when modified conditional probability plots are used in conjunction with a scheme of transition diagrams for finite automata that synthesise these words. The three transition diagrams for all 2-letter English words contained in the Oxford English Dictionary are presented along with statistics on their observed and estimated word frequencies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Hardcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Lenneberg, E.H. (1967). Biological Foundations of Language, Wiley, NY.
Google Scholar
Liberman, A.M., Cooper, F.S., Harris, K.S., MacNeilage, P.F. and Studdert-Kennedy, M. (1975). Some observations on a model for speech perception. In W. Wathen-Dunn (Ed.), Models for the Perception of Speech and Visual Form, M. I. T. Press, Cambridge, Mass.
Google Scholar
Adams, M.A. (1979). Models of Word Recognition. Cognitive Psychology, 11, pp. 133 – 76.
Article Google Scholar
Rumelhart, D.E. and Siple, P. (1974). Process of Recognising Tachistoscopically Presented Words. Psychological Review, 81, pp. 99 – 118.
Article Google Scholar
Broadbent, D.E. (1976). Word Frequency Effect and Response Bias. Psychological Review, 74, pp. 1 – 15.
Article Google Scholar
Pillsbury, W.B. (1897). A Study in Apperception. American Journal of Psychology, 8, pp. 315 – 393.
Article Google Scholar
Woodsworth, R.S. (1938). Experimental Psychology. Henry Holt Co., NY.
Google Scholar
On the Development of a Recursive Model of Word Structure in the English language. In G. Lasker (Ed.), Applied Systems and Cybernetics, Pergamon Press, New York, 1980.
Google Scholar
Mewhart, D.J.K. (1974). Accuracy and Order of Report in Tachistoscopic Identification. Canadian Journal of Psychology, 28, pp. 383 – 398.
Article Google Scholar
Gibson, E.J. Pick, A., Osser, H., and Hammond, M. (1962). The Role of Grapheme-Phoneme Correspondence in the Perception of Words. American Journal of Psychology, 75, pp. 554 – 570.
Google Scholar
Herdan, G. (1962). The advanced Theory of Language as Choice and Chance. Springer-Verlag, NY.
Google Scholar
Shannon, C.E. (1951). Prediction and Entrophy of English language. The Bell System Technical Journal, 38, pp. 50 – 64.
MathSciNet Google Scholar
Pierce, J.R. (1965). Symbols, Signals and Noise. Harper & Row Bros. Inc., NY.
Google Scholar
Suen, C.Y. (1979). N-gram Statistics for Natural Language Understanding and Text Processing. IEEE Trans, on Pattern Analysis and Machine Intelligence PAMI-1, 2, pp. 164 – 172.
Article Google Scholar
Toussaint, G.T. and Shinghal, R. (1978). Cluster Analysis of English Text. In proceedings of Pattern Recognition and Image Processing Conference, Chicago, pp. 164 – 172.
Google Scholar
Toussaint, G. (1974). Recent Prgress in Statistical Methods Applied to Pattern Recognition. In Proc. 2 aid Int. Joint Conf. on Pattern Recognition, Copenhagen.
Google Scholar
Hanson, A.R., Riseman, E.M. and Fisher, E. (1976). Context in Word Recognition, Pattern Recognition, 8, pp. 35 – 45.
Article MATH Google Scholar
Ehrich, R. and Koehler, K. (1975). Experiments in the Contextual Recognition of Cursive Script. IEEE Transactions on Computers, c-24, 2, pp. 182 – 93.
Article Google Scholar
Toussaint, G. and Donaldson, R. (1972). Some Simple Contextual Decoding Algorithms Applied to Recognition of Hand-Printed Text. In Proc. Annu. Canadian Comput. Conf., pp. 422101–16.
Google Scholar
Duda, R.O. and Hart, P.E. (1968). Experiments in the Recognition of Hand-Printed Text: Part II — context analysis. AFIPS Conference Proceedings, 33, pp. 1139 – 1149.
Google Scholar
Riseman, E.M. and Hanson, A.R. (1974). A Contextual Postprocessing System for Error Correction Using Binary n-grams. IEEE Transactions on Computers, c-23, 5, pp. 480 – 493.
Article Google Scholar
Vossler, C.M. and Branston, N.M. (1964). The Use of Context for Correcting Garbled English Text. In Proceedings of ACM 19th National Conference, pp. D2 4-1 to D2 4 – 3.
Google Scholar
Blair, C.R. (1960). A Program for Correcting Spelling Errors. Information and Control, 3, pp. 60 – 67.
Article MathSciNet MATH Google Scholar
Carlson, G. (1966). Techniques for Replacing Characters that are Garbled on Input. Proceedings of the Spring Joint Computer Conference, pp. 189 – 192.
Google Scholar
Shinghal, R. and Toussaint, G. (1979). A Bottom-up and Top-down Approach to Using Context in Text Recognition. Int. J. Man-Machine Studies, 11, pp. 201 – 212.
Article MATH Google Scholar
Shinghal, R. and Toussaint, G. (1979). Experiments in Text Recognition with the Modified Viterbi Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1, 2, pp. 184 – 193.
Article Google Scholar
Shinghal, R., Rosenberg, D., Toussaint, G. (1978). A Simplified Heuristic Version of a Recursive Bayes Algorithm for Using Context in Text Recognition, IEEE Transactions on Systems, Man and Cybernetics, smc-8, 5, pp. 412 – 414.
Google Scholar
Shinghal, R. and Toussaint, G. (1980). The Sensitivity of the Modified Viterbi Algorithm to the Source Statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2, 2, pp. 181 – 185.
Article Google Scholar
Carroll, J., Davis, P., Richman, B. (1971). Word Frequency Book, American Heritage, Houghton Mifflin Co., Ltd., NY.
Google Scholar
Toussaint, G.T., Shinghal, R. (1978). Tables of Probabilities of Occurrence of Characters, Character-Paris, and Character- Triplets in English Text, McGill University, School of Computer Science, Technical Report No. SOCS 78. 6.
Google Scholar
The Compact Edition of the English Dictionary (1971). Onions, C.T. ( Ed. ), Oxford University Press.
Google Scholar
Funk and Waggnalls Standard College Dictionary (1978). Canadian Edition, Fitzhenry and Whiteside Ltd., Toronto.
Google Scholar
Schwartz, E.I., Landovitz, L.F. (1978). Funk and Wagnalls Crossword Puzzle Word Finder. The Stonesong Press, Grosset and Dunlap, Inc., NY.
Google Scholar
Zipf, G.K. (1935). The Psycho-Biology of Language. Houghton- Mifflin, Boston.
Google Scholar
Estoup, J-B. (1916). Les gammes stenographiques. Privately printed for the Institute Stenographique, Paris As Cited In Mandelbrot, B. (1965). Information Theory and Psycholinguistics, in B. Wolman and E. Nagel (Ed.), Scientific Psychology, Basic Books Ltd.
Google Scholar
Mandelbrot, B. (1961). On the Theory of Word Frequencies and on Related Markovian Models of Discourse. In R. Jakobson (Ed.), Structure of Language and its Mathematical Aspects, American Mathematical Society, Providence, R.I., pp. 190 – 219.
Google Scholar
Yule, G.U. (1944). A Statistical Study of Vocabulary, Cambridge.
Google Scholar
Kucera, H. and Francis, N. (1967). Computational Analysis of Present-Day American English, Brown University Press, Providence, R. I.
Google Scholar
Herdan, G. (1960). Type-Token Mathematics, Mouton & Co., S-Gravenhage, The Hague, Netherlands.
Google Scholar
Knopp, K. (1956). Infinite Sequences and Series, Dover Publications, Inc., NY, pp. 80 – 90.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Concordia University, Montreal, Quebec, Canada
Kevin O’Mara

Authors

Kevin O’Mara
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Science and Engineering Research Council, Rutherford and Appleton Laboratories, Chilton, Didcot, England
Josef Kittler
Oxford, UK
Josef Kittler
Purdue University, West Lafayette, Indiana, USA
King Sun Fu
Ecole Nationale Superieure des Telecommunications, Paris, France
Louis-François Pau

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

O’Mara, K. (1982). On the Development of a Model for Determining the Frequency of Occurrence of English Language Words. In: Kittler, J., Fu, K.S., Pau, LF. (eds) Pattern Recognition Theory and Applications. NATO Advanced Study Institutes Series, vol 81. Springer, Dordrecht. https://doi.org/10.1007/978-94-009-7772-3_32

Download citation

DOI: https://doi.org/10.1007/978-94-009-7772-3_32
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-009-7774-7
Online ISBN: 978-94-009-7772-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics