Computer programs for spelling correction: An experiment in program design

  • James L. Peterson
Part of the Lecture Notes in Computer Science book series (LNCS, volume 96)


Hash Table Program Design Optical Character Recognition Spelling Error Spell Correction 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    C. N. Alberga, “String Similarity and Misspellings”, Communications of the ACM, Volume 10, Number 5, (May 1967), pages 302–313. Master's Thesis. Reviews previous work. Mentions two researchers at IBM Watson who suggest finding the longest common substrings and assigning probabilities based on the portion of the correct string matched. Does rather extensive but unreadable analysis of different algorithms, but with no real results. Reviewed in Computing Reviews, Volume 8, Number 5, Review 12,712.Google Scholar
  2. 2.
    W. W. Bledsoe and I. Browning, “Pattern Recognition and Reading by Machine”, Proceedings of the Eastern Joint Computer Conference, (1959), pages 225–232. Uses a small dictionary with probability of each word for OCR.Google Scholar
  3. 3.
    C. R. Blair, “A Program for Correcting Spelling Errors”, Information and Control, Volume 3, Number 1, (March 1960), pages 60–67. Weights the letters to create a four or five letter abbreviation for each word. If abbreviations match, the words are assumed to be the same. Mentions the possibility (impossibility) of building in rules like: i before e except after c and when like a as in neighbor and weigh, ...Google Scholar
  4. 4.
    C. P. Bourne, “Frequency and Impact of Spelling Errors in Bibliographic Data Bases”, Information Processing and Management, Volume 13, Number 1, (1977), pages 1–12. Examines the frequency of spelling errors in a sample drawn from 11 machine-readable bibliographic data bases. Finds that spelling errors are sufficiently severe that they should influence the search strategy to find information in the data base. Errors are not only in the input queries, but also in the data base itself.Google Scholar
  5. 5.
    G. Carlson, “Techniques for Replacing Characters that are Garbled on Input”, Proceedings of the 1966 Spring Joint Computer Conference, (1966), pages 189–192. Uses trigrams to correct OCR input of genealogical records.Google Scholar
  6. 6.
    R. W. Cornew, “A Statistical Method of Spelling Correction”, Information and Control, Volume 12, Number 2, (February 1968), pages 79–93. Uses digrams first, then a dictionary search to correct one character substitutions. Dictionary already exists for speech output problem.Google Scholar
  7. 7.
    F. J. Damerau, “A Technique for Computer Detection and Correction of Spelling Errors”, Communications of the ACM, Volume 7, Number 3, (March 1964), pages 171–176. The four errors: wrong, missing, extra, transposed, are mentioned here as accounting for 80% of errors. Uses a bit vector for preliminary compare. (bit[i] is on if letter i is in word). Reviewed in Computing Reviews, Volume 5, Number 4, Review 5,962.Google Scholar
  8. 8.
    L. Davidson, “Retrieval of Misspelled Names in an Airlines Passenger Record System”, Communications of the ACM, Volume 5, Number 3, (March 1962), pages 169–171. Abbreviates name to match stored name. Either name (token or dictionary) may be wrong.Google Scholar
  9. 9.
    E. G. Fisher, “The Use of Context in Character Recognition”, COINS TR 76-12, Department of Computer and Information Sciences, University of Massachusetts, Amherst, (July 1976), 189 pages. Considers the problem of automatically reading addresses from letters by the Post Office; also Morse code recognition.Google Scholar
  10. 10.
    D. N. Freeman, Error Correction in CORC: The Cornell Computing Language, Ph.D. Thesis, Department of Computer Science, Cornell University, (September 1963).Google Scholar
  11. 11.
    E. J. Galli and H. M. Yamada, “An Automatic Dictionary and Verification of Machine-Readable Text”, IBM Systems Journal, Volume 6, Number 3, (1967), page 192–207. Good discussion of the general problem of token identification and verification.Google Scholar
  12. 12.
    E. J. Galli and H. M. Yamada, “Experimental Studies in Computer-Assisted Correction of Unorthographic Text”, IEEE Transactions on Engineering Writing and Speech, Volume EWS-11, Number 2, (August 1968), page 75–84. Good review and explanation of techniques and problems.Google Scholar
  13. 13.
    J. J. Giangardella, J. Hudson and R. S. Roper, “Spelling Correction by Vector Representation Using a Digital Computer”, IEEE Transactions on Engineering Writing and Speech, Volume EWS-10, Number 2, (December 1967), pages 57–62. Defines hash functions to give a vector representation of a word as: Norm, Angle, and Distance. This speeds search time (over linear search) and aids in localizing the search for correct spellings since interchanged characters have the same Norm and extra or deleted letter is within fixed range.Google Scholar
  14. 14.
    H. T. Glantz, “On the Recognition of Information with a Digital Computer”, Journal of the ACM, Volume 4, Number 2, (April 1957), pages 178–188. Seems to want either exact match or greatest number of equal characters in equal positions. Good for OCR.Google Scholar
  15. 15.
    A. R. Hanson, E. M. Riseman and E. G. Fisher, “Context in Word Recognition”, Pattern Recognition, Volume 8, Number 1, (January 1976), pages 35–45.Google Scholar
  16. 16.
    L. D. Harmon, “Automatic Reading of Cursive Script”, Proceedings of a Symposium on Optical Character Recognition, Spartan Books, (January 1962), pages 151–152. Uses digrams and a “confusion matrix” to give the probability of letter substitutions.Google Scholar
  17. 17.
    L. D. Harmon, “Automatic Recognition of Print and Script”, Proceedings of the IEEE, Volume 60, Number 10, (October 1972), pages 1165–1176. Surveys the techniques for computer input of print, including a section on error detection and correction. Indicates that digrams can catch 70 percent of incorrect letter errors.Google Scholar
  18. 18.
    E. B. James and D. P. Partridge, “Tolerance to Inaccuracy in Computer Programs”, Computer Journal, Volume 19, Number 3, (August 1976), pages 207–212.Google Scholar
  19. 19.
    D. E. Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, Addison-Wesley, (1973), 722 pages.Google Scholar
  20. 20.
    H. Kucera and W. N. Francis, Computational Analysis of Present-Day American English, Brown University Press, (1967), 424 pages. Gives frequency and statistical information for the Brown Corpus of over a million tokens.Google Scholar
  21. 21.
    L. A. Leslie, 20,000 Words, McGraw-Hill, (1977), 282 pages. Representative of several books which list words.Google Scholar
  22. 22.
    R. Lowrance and R. A. Wagner, “An Extension of the String-to-String Correction Problem”, Journal of the ACM, Volume 22, Number 2, (April 1975), pages 175–183. Extends Wagner and Fischer [1974] to include adjacent transpositions as an edit operation.Google Scholar
  23. 23.
    C. K. McElwain and M. E. Evans, “The Degarbler — A Program for Correcting Machine-Read Morse Code”, Information and Control, Volume 5, Number 4, (December 1962), pages 368–384. Uses digrams, trigrams and a dictionary to correct up to 70% of errors in machine recognized Morse code. Uses 5 special rules for the types of errors which can occur (dot interpreted as dash, ...)Google Scholar
  24. 24.
    L. E. McMahon, L. L. Cherry, and R. Morris, “Statistical Text Processing”, The Bell System Technical Journal, Volume 57, Number 6, Part 2, (July–August 1978), pages 2137–2154. Good description of how computer systems can be used to process text, including spelling correction and an attempt at a syntax checker.Google Scholar
  25. 25.
    H. L. Morgan, “Spelling Correction in Systems Programs”, Communications of the ACM, Volume 13, Number 2, (February 1970), pages 90–94. Use of spelling correction for compilers and operating system JCL. Uses dictionary with the four classes of errors. Also possible to use syntax and semantics to narrow search space. Reports on the CORC compiler [Freeman 1963] which associated a probability with each possible misspelling.CrossRefGoogle Scholar
  26. 26.
    R. Morris and L. L. Cherry, “Computer Detection of Typographical Errors”, IEEE Transactions on Professional Communications, Volume PC-18, Number 1, (March 1975), pages 54–64. Describes the TYPO program for the UNIX system.Google Scholar
  27. 27.
    F. Muth and A. L. Tharp, “Correcting Human Error in Alphanumeric Terminal Input”, Information Processing and Management, Volume 13, Number 6, (1977), pages 329–337. Suggests a tree structure (like a trie) with special search procedures to allow corrections to be found. Damerau's review points out that their search strategies need improvement and that their tree is much too big to be practical. Each node of the tree has one character (data) and three pointers (father, brother, son). Reviewed in Computing Reviews, Volume 19, Number 6, Review 33,119.CrossRefGoogle Scholar
  28. 28.
    J. A. O'Brien, “Computer Program for Automatic Spelling Correction”, Technical Report RADC-TR-66-696, Rome Air Development Center, New York, (March 1967).Google Scholar
  29. 29.
    T. Okuda, E. Tanaka and T. Kasai, “A Method for the Correction of Garbled Words Based on the Levenshtein Metric”, IEEE Transactions on Computers, Volume C-25, Number 2, (February 1976), pages 172–177.Google Scholar
  30. 30.
    D. P. Partridge and E. B. James, “Natural Information Processing”, International Journal of Man-Machine Studies, Volume 6, Number 2, (March 1974), pages 205–235. Uses a tree structure representation of words to allow checks for incorrect input words. Done in the context of correcting keywords in a Fortran program, but more is there. Frequencies are kept with tree branches to allow the tree to modify itself to optimize search.Google Scholar
  31. 31.
    E. M. Riseman and R. W. Ehrich, “Contextual Word Recognition Using Binary Digrams”, IEEE Transactions on Computers, Volume C-20, Number 4, (April 1971), pages 397–403. Indicates the important property of digrams is only their zero or non-zero nature.Google Scholar
  32. 32.
    E. M. Riseman and A. R. Hanson, “A Contextual Postprocessing System for Error Correction Using Binary n-Grams”, IEEE Transactions on Computers, Volume C-23, Number 5, (May 1974), pages 480–493. Suggests using digrams (2-grams), trigrams (3-grams), or in general n-grams, but only storing whether the probability is zero or non-zero (1 bit). Also positional n-grams which keeps a separate n-gram table for each pair of positions (for each i and j we have the digram table for characters in position i and position j in a word).Google Scholar
  33. 33.
    W. S. Rosenbaum and J. J. Hilliard, “Multifont OCR Postprocessing System”, IBM Journal of Research and Development, Volume 19, Number 4, (July 1975), pages 398–421. Very specific emphasis on OCR problems. Some on searching with a match-any character.Google Scholar
  34. 34.
    B. A. Sheil, “Median Split Trees: A Fast Look-up Technique for Frequently Occurring Keys”, Communications of the ACM, Volume 21, Number 11, (November 1978), pages 947–958.CrossRefGoogle Scholar
  35. 35.
    A. J. Szanser, “Error-Correcting Methods in Natural Language Processing”, Information Processing 68 — Proceedings of IFIP 68, North Holland, Amsterdam, (August 1968), pages 1412–1416. Confused paper dealing with correction for machine translation and automatic interpretation of shorthand transcript tapes. Suggests “elastic” matching.Google Scholar
  36. 36.
    A. J. Szanser, “Automatic Error-Correction in Natural Languages”, Information Storage and Retrieval, Volume 5, Number 4, (February 1970), pages 169–174.CrossRefGoogle Scholar
  37. 37.
    E. Tanaka and T. Kasai, “Correcting Method of Garbled Languages Using Ordered Key Letters”, Electronics and Communications in Japan, Volume 55, Number 6, (1972), pages 127–133.Google Scholar
  38. 38.
    P. J. Tenczar and W. W. Golden, “Spelling, Word and Concept Recognition”, Report CERL-X-35, University of Illinois, Urbana, Illinois, (October 1962).Google Scholar
  39. 39.
    R. B. Thomas and M. Kassler, “Character Recognition in Context”, Information and Control, Volume 10, Number 1, (January 1967), pages 43–64. Considers tetragrams (sequences of 4 letters). Of 274 possible tetragrams, only 12 percent (61,273) are legal.CrossRefGoogle Scholar
  40. 40.
    L. Thorelli, “Automatic Correction of Errors in Text”, BIT, Volume 2, Number 1, (1962), pages 45–62. Sort of a survey/tutorial. Mentions digrams and dictionary look-up. Suggests maximizing probabilities.Google Scholar
  41. 41.
    C. M. Vossler and N. M. Branston, “The Use of Context for Correcting Garbled English Text”, Proceedings of the 19th ACM National Convention, (August 1964), pages D2.4-1 to D2.4-13. Uses confusion matrix and word probabilities to select the most probable input word. Also uses digrams. Trying to improve OCR input.Google Scholar
  42. 42.
    R. A. Wagner and M. J. Fischer, “The String-to-String Correction Problem”, Journal of the ACM, Volume 21, Number 1, (January 1974), pages 168–173. Algorithm for determining similarity of two strings as minimum number of edit operations to transform one into the other. Allowed edit operations are add, delete or substitute one character.CrossRefGoogle Scholar
  43. 43.
    C. K. Wong and A. K. Chandra, “Bounds for the String Editing Problem”, Journal of the ACM, Volume 23, Number 1, (January 1976), pages 13–16. Shows that the complexity bounds of [Wagner and Fischer 1974] are not only sufficient but also necessary.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1980

Authors and Affiliations

  • James L. Peterson
    • 1
  1. 1.The Department of Computer SciencesThe University of TexasAustinUSA

Personalised recommendations