Fast string correction with Levenshtein automata

  • Klaus U. Schulz
  • Stoyan Mihov
Original Research Paper

DOI: 10.1007/s10032-002-0082-8

Cite this article as:
Schulz, K. & Mihov, S. IJDAR (2002) 5: 67. doi:10.1007/s10032-002-0082-8

Abstract.

The Levenshtein distance between two words is the minimal number of insertions, deletions or substitutions that are needed to transform one word into the other. Levenshtein automata of degree n for a word W are defined as finite state automata that recognize the set of all words V where the Levenshtein distance between V and W does not exceed n. We show how to compute, for any fixed bound n and any input word W, a deterministic Levenshtein automaton of degree n for W in time linear to the length of W. Given an electronic dictionary that is implemented in the form of a trie or a finite state automaton, the Levenshtein automaton for W can be used to control search in the lexicon in such a way that exactly the lexical words V are generated where the Levenshtein distance between V and W does not exceed the given bound. This leads to a very fast method for correcting corrupted input words of unrestricted text using large electronic dictionaries. We then introduce a second method that avoids the explicit computation of Levenshtein automata and leads to even improved efficiency. Evaluation results are given that also address variants of both methods that are based on modified Levenshtein distances where further primitive edit operations (transpositions, merges and splits) are used.

Keywords: Spelling correction – Levenshtein distance – Optical character recognition – Electronic dictionaries 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Klaus U. Schulz
    • 1
  • Stoyan Mihov
    • 2
  1. 1.CIS, University of Munich, Oettingenstr. 67, 80538 Munich, Germany (e-mail: schulz@cis.uni-muenchen.de) DE
  2. 2.Linguistic Modelling Laboratory, LPDP, Bulgarian Academy of Sciences (e-mail: stoyan@lml.bas.bg) BG

Personalised recommendations