Abstract

This paper deals with the problem of estimating a transmitted string X * by processing the corresponding string Y, which is a noisy version of X *. We assume that Y contains substitution, insertion and deletion errors, and that X * is an element of a finite (but possibly, large) dictionary, H. The best estimate X  +  of X *, is defined as that element of H which minimizes the Generalized Levenshtein Distance D(X, Y) between X and Y, for all XH. All existing techniques for computing X  +  requires a separate evaluation of the edit distances between Y and every XH. In this paper, we show how we can evaluate D(X, Y) for every XH simultaneously, without resorting to any parallel computations. This is achieved by resorting to the use of an additional data structure called the Linked List of Prefixes (LLP), which is built “on top of” the trie representation of the dictionary. The computational advantage (for a dictionary made from the set of 1023 most common words augmented by computer-related words) gained is at least 50% and 80% measured in terms of the time and the number of operations required respectively. The accuracy forfeited is negligible.

Keywords

Edit Distance Edit Operation Longe Common Subsequence Longe Common Subsequence Dynamic Programming Principle 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Bentley, J., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: Eighth Annual ACM-SIAM Symposium on Discrete Algorithms New Orleans (January 1997)Google Scholar
  2. 2.
    Bucher, P., Hoffmann, K.: A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. In: Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, ISMB, vol. 96, pp. 44–51 (1996)Google Scholar
  3. 3.
    Bunke, H.: Structural and syntactic pattern recognition. In: Chen, C.H., Pau, L.F., Wang, P.S.P. (eds.) Handbook of Pattern Recognition and Computer Vision, World Scientific, Singapore (1993)Google Scholar
  4. 4.
    Bunke, H., Csirik, J.: Parametric string edit distance and its application to pattern recognition. IEEE Trans. Systems, Man and Cybern, SMC- 25, 202–206 (1993)CrossRefGoogle Scholar
  5. 5.
    Clement, J., Flajolet, P., Vallee, B.: The analysis of hybrid trie structures. In: Proc. Annual A CM-SIAM Symp. on Discrete Algorithms, San Francisco, California, pp. 531–539 (1998)Google Scholar
  6. 6.
    Dewey, G.: Relative Frequency of English Speech Sounds. Harvard Univ. Press (1923)Google Scholar
  7. 7.
    Heinz, S., Zobel, J., Williams, H.: Burst tries: A fast, efficient data structure for string keys. ACM Transactions on Information Systems 20(2), 192–223 (2002)CrossRefGoogle Scholar
  8. 8.
    Hunt, J.W., Szymanski, T.G.: A fast algorithm for computing longest common subsequences. Comm. Assoc. Comput. Mach. 20, 350–353 (1977)MATHMathSciNetGoogle Scholar
  9. 9.
    Jacquet, P., Szpankowski, W.: Analysis of digital tries with markovian dependency. IEEE Trans. Information Theory, IT-37(5), 1470–1475 (1991)CrossRefGoogle Scholar
  10. 10.
    Kashyap, R.L., Oommen, B.J.: An effective algorithm for string correction using generalized edit distances -i. description of the algorithm and its optimality. Inf. Sci. 23(2), 123–142 (1981)CrossRefGoogle Scholar
  11. 11.
    Levenshtein, A.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys. Dokl. 10, 707–710 (1966)MathSciNetGoogle Scholar
  12. 12.
    Masek, W.J., Paterson, M.S.: A faster algorithm computing string edit distances. J. Comput. System Sci. 20, 18–31 (1980)MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Oommen, B.J.: Recognition of noisy subsequences using constrained edit distances. IEEE Trans. on Pattern Anal. and Mach. Intel.,PAMI- 9, 676–685 (1987)MATHCrossRefGoogle Scholar
  14. 14.
    Oommen, B.J., Badr, G.: Dictionary-based syntactic pattern recognition using tries. Unabridged version of the present paper. Can be made available by contacting the authorsGoogle Scholar
  15. 15.
    Oommen, B.J., Kashyap, R.L.: A formal theory for optimal and information theoretic syntactic pattern recognition, vol. 31, pp. 1159–1177 (1998)Google Scholar
  16. 16.
    Oommen, B.J., Loke, R.K.S.: Syntactic pattern recognition involving traditional and generalized transposition errors: Attaining the information theoretic bound (submitted for pubication)Google Scholar
  17. 17.
    Oommen, B.J., Loke, R.K.S.: Designing syntactic pattern classifiers using vector quantization and parametric string editing. IEEE Transactions on Systems, Man and Cybernetics SMC- 29, 881–888 (1999)CrossRefGoogle Scholar
  18. 18.
    Peterson, J.L.: Computer programs for detecting and correcting spelling errors. Comm. Assoc. Comput. Mach. 23, 676–687 (1980)Google Scholar
  19. 19.
    Sankoff, D., Kruskal, J.B.: Time Warps, String Edits and Macromolecules: The Theory and practice of Sequence Comparison. Addison-Wesley, Reading (1983)Google Scholar
  20. 20.
    Stephen, G.A.: String Searching Algorithms. Lecture Notes Series on Computing, vol. 6. World Scientific, Sihgapore (2000)Google Scholar
  21. 21.
    Ukkonen, E.: Algorithm for approximate string matching. Information and control 64, 100–118 (1985)MATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    Wagner, R.A.: Order-n correction for regular languages. Comm. ACM 17, 265–268 (1974)MATHCrossRefGoogle Scholar
  23. 23.
    Wagner, R.A., Fisher, M.J.: The string to string correction problem. J. Assoc. Comput. Mach. 21, 168–173 (1974)MATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • B. John Oommen
    • 1
  • Ghada Badr
    • 2
  1. 1.Fellow of the IEEE, School of Computer Science, Carleton UniversityCarleton UniversityOttawaCanada
  2. 2.Ph.D student, School of Computer Science, Carleton UniversityCarleton UniversityOttawaCanada

Personalised recommendations