Abstract
This paper proposes a scheme for estimating the frequency of occurrence of English words from the product of position-dependent letter frequencies. A sampling method is described for computing these frequencies at a given confidence limit from a minimum number of words. Computations for words of different length can be normalised under the assumption of a log-normal distribution of word size within the language. The normalised position dependent letter frequency plots for 2, 3 and 4-letter English words are presented. These plots are derived from the set of types of a given length that account for 80% of the observed tokens of the same length within a large corpus. The frequency of occurrence of English words can be approximated when modified conditional probability plots are used in conjunction with a scheme of transition diagrams for finite automata that synthesise these words. The three transition diagrams for all 2-letter English words contained in the Oxford English Dictionary are presented along with statistics on their observed and estimated word frequencies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Lenneberg, E.H. (1967). Biological Foundations of Language, Wiley, NY.
Liberman, A.M., Cooper, F.S., Harris, K.S., MacNeilage, P.F. and Studdert-Kennedy, M. (1975). Some observations on a model for speech perception. In W. Wathen-Dunn (Ed.), Models for the Perception of Speech and Visual Form, M. I. T. Press, Cambridge, Mass.
Adams, M.A. (1979). Models of Word Recognition. Cognitive Psychology, 11, pp. 133 – 76.
Rumelhart, D.E. and Siple, P. (1974). Process of Recognising Tachistoscopically Presented Words. Psychological Review, 81, pp. 99 – 118.
Broadbent, D.E. (1976). Word Frequency Effect and Response Bias. Psychological Review, 74, pp. 1 – 15.
Pillsbury, W.B. (1897). A Study in Apperception. American Journal of Psychology, 8, pp. 315 – 393.
Woodsworth, R.S. (1938). Experimental Psychology. Henry Holt Co., NY.
On the Development of a Recursive Model of Word Structure in the English language. In G. Lasker (Ed.), Applied Systems and Cybernetics, Pergamon Press, New York, 1980.
Mewhart, D.J.K. (1974). Accuracy and Order of Report in Tachistoscopic Identification. Canadian Journal of Psychology, 28, pp. 383 – 398.
Gibson, E.J. Pick, A., Osser, H., and Hammond, M. (1962). The Role of Grapheme-Phoneme Correspondence in the Perception of Words. American Journal of Psychology, 75, pp. 554 – 570.
Herdan, G. (1962). The advanced Theory of Language as Choice and Chance. Springer-Verlag, NY.
Shannon, C.E. (1951). Prediction and Entrophy of English language. The Bell System Technical Journal, 38, pp. 50 – 64.
Pierce, J.R. (1965). Symbols, Signals and Noise. Harper & Row Bros. Inc., NY.
Suen, C.Y. (1979). N-gram Statistics for Natural Language Understanding and Text Processing. IEEE Trans, on Pattern Analysis and Machine Intelligence PAMI-1, 2, pp. 164 – 172.
Toussaint, G.T. and Shinghal, R. (1978). Cluster Analysis of English Text. In proceedings of Pattern Recognition and Image Processing Conference, Chicago, pp. 164 – 172.
Toussaint, G. (1974). Recent Prgress in Statistical Methods Applied to Pattern Recognition. In Proc. 2 aid Int. Joint Conf. on Pattern Recognition, Copenhagen.
Hanson, A.R., Riseman, E.M. and Fisher, E. (1976). Context in Word Recognition, Pattern Recognition, 8, pp. 35 – 45.
Ehrich, R. and Koehler, K. (1975). Experiments in the Contextual Recognition of Cursive Script. IEEE Transactions on Computers, c-24, 2, pp. 182 – 93.
Toussaint, G. and Donaldson, R. (1972). Some Simple Contextual Decoding Algorithms Applied to Recognition of Hand-Printed Text. In Proc. Annu. Canadian Comput. Conf., pp. 422101–16.
Duda, R.O. and Hart, P.E. (1968). Experiments in the Recognition of Hand-Printed Text: Part II — context analysis. AFIPS Conference Proceedings, 33, pp. 1139 – 1149.
Riseman, E.M. and Hanson, A.R. (1974). A Contextual Postprocessing System for Error Correction Using Binary n-grams. IEEE Transactions on Computers, c-23, 5, pp. 480 – 493.
Vossler, C.M. and Branston, N.M. (1964). The Use of Context for Correcting Garbled English Text. In Proceedings of ACM 19th National Conference, pp. D2 4-1 to D2 4 – 3.
Blair, C.R. (1960). A Program for Correcting Spelling Errors. Information and Control, 3, pp. 60 – 67.
Carlson, G. (1966). Techniques for Replacing Characters that are Garbled on Input. Proceedings of the Spring Joint Computer Conference, pp. 189 – 192.
Shinghal, R. and Toussaint, G. (1979). A Bottom-up and Top-down Approach to Using Context in Text Recognition. Int. J. Man-Machine Studies, 11, pp. 201 – 212.
Shinghal, R. and Toussaint, G. (1979). Experiments in Text Recognition with the Modified Viterbi Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1, 2, pp. 184 – 193.
Shinghal, R., Rosenberg, D., Toussaint, G. (1978). A Simplified Heuristic Version of a Recursive Bayes Algorithm for Using Context in Text Recognition, IEEE Transactions on Systems, Man and Cybernetics, smc-8, 5, pp. 412 – 414.
Shinghal, R. and Toussaint, G. (1980). The Sensitivity of the Modified Viterbi Algorithm to the Source Statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2, 2, pp. 181 – 185.
Carroll, J., Davis, P., Richman, B. (1971). Word Frequency Book, American Heritage, Houghton Mifflin Co., Ltd., NY.
Toussaint, G.T., Shinghal, R. (1978). Tables of Probabilities of Occurrence of Characters, Character-Paris, and Character- Triplets in English Text, McGill University, School of Computer Science, Technical Report No. SOCS 78. 6.
The Compact Edition of the English Dictionary (1971). Onions, C.T. ( Ed. ), Oxford University Press.
Funk and Waggnalls Standard College Dictionary (1978). Canadian Edition, Fitzhenry and Whiteside Ltd., Toronto.
Schwartz, E.I., Landovitz, L.F. (1978). Funk and Wagnalls Crossword Puzzle Word Finder. The Stonesong Press, Grosset and Dunlap, Inc., NY.
Zipf, G.K. (1935). The Psycho-Biology of Language. Houghton- Mifflin, Boston.
Estoup, J-B. (1916). Les gammes stenographiques. Privately printed for the Institute Stenographique, Paris As Cited In Mandelbrot, B. (1965). Information Theory and Psycholinguistics, in B. Wolman and E. Nagel (Ed.), Scientific Psychology, Basic Books Ltd.
Mandelbrot, B. (1961). On the Theory of Word Frequencies and on Related Markovian Models of Discourse. In R. Jakobson (Ed.), Structure of Language and its Mathematical Aspects, American Mathematical Society, Providence, R.I., pp. 190 – 219.
Yule, G.U. (1944). A Statistical Study of Vocabulary, Cambridge.
Kucera, H. and Francis, N. (1967). Computational Analysis of Present-Day American English, Brown University Press, Providence, R. I.
Herdan, G. (1960). Type-Token Mathematics, Mouton & Co., S-Gravenhage, The Hague, Netherlands.
Knopp, K. (1956). Infinite Sequences and Series, Dover Publications, Inc., NY, pp. 80 – 90.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1982 D. Reidel Publishing Company
About this paper
Cite this paper
O’Mara, K. (1982). On the Development of a Model for Determining the Frequency of Occurrence of English Language Words. In: Kittler, J., Fu, K.S., Pau, LF. (eds) Pattern Recognition Theory and Applications. NATO Advanced Study Institutes Series, vol 81. Springer, Dordrecht. https://doi.org/10.1007/978-94-009-7772-3_32
Download citation
DOI: https://doi.org/10.1007/978-94-009-7772-3_32
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-009-7774-7
Online ISBN: 978-94-009-7772-3
eBook Packages: Springer Book Archive