Pattern Analysis and Applications

, Volume 10, Issue 1, pp 1–13 | Cite as

Breadth-first search strategies for trie-based syntactic pattern recognition

Theoretical Advances

Abstract

Dictionary-based syntactic pattern recognition of strings attempts to recognize a transmitted string X*, by processing its noisy version, Y, without sequentially comparing Y with every element X in the finite, (but possibly, large) dictionary, H. The best estimate X+ of X*, is defined as that element of H which minimizes the generalized Levenshtein distance (GLD) D(X, Y) between X and Y, for all XH. The non-sequential PR computation of X+ involves a compact trie-based representation of H. In this paper, we show how we can optimize this computation by incorporating breadth first search schemes on the underlying graph structure. This heuristic emerges from the trie-based dynamic programming recursive equations, which can be effectively implemented using a new data structure called the linked list of prefixes that can be built separately or “on top of” the trie representation of H. The new scheme does not restrict the number of errors in Y to be merely a small constant, as is done in most of the available methods. The main contribution is that our new approach can be used for generalized GLDs and not merely for 0/1 costs. It is also applicable when all possible correct candidates need to be known, and not just the best match. These constitute the cases when the “cutoffs” cannot be used in the DFS trie-based technique (Shang and Merrettal in IEEE Trans Knowl Data Eng 8(4):540–547, 1996). The new technique is compared with the DFS trie-based technique (Risvik in United Patent 6377945 B1, 23 April 2002; Shang and Merrettal in IEEE Trans Knowl Data Eng 8(4):540–547, 1996) using three large and small benchmark dictionaries with different errors. In each case, we demonstrate marked improvements with regard to the operations needed up to 21%, while at the same time maintaining the same accuracy. Additionally, some further improvements can be obtained by introducing the knowledge of the maximum number or percentage of errors in Y.

Keywords

Trie-based syntactic pattern recognition Approximate string matching Noisy syntactic recognition using tries 

References

  1. 1.
    Acharya A, Zhu H, Shen K (1999) Adaptive algorithms for cache-efficient trie search. In: ACM and SIAM workshop on algorithm engineering and experimentation, January 1999, pp 296–311Google Scholar
  2. 2.
    Amengual JC, Vidal E (1998) Efficient error-correcting viterbi parsing. IEEE Trans Commun 20(10):1109–1116Google Scholar
  3. 3.
    Amengual JC, Vidal E (1998) The viterbi algorithm. IEEE Trans Pattern Anal Mach Intell 20(10):268–278CrossRefGoogle Scholar
  4. 4.
    Baeza-Yates R, Navarro G (1998) Fast approximate string matching in a dictionary. In: Proceedings of the 5th South American symposium on string processing and information retrieval (SPIRE’98), IEEE CS Press, pp 14–22Google Scholar
  5. 5.
    Bentley J, Sedgewick R (1997) Fast algorithms for sorting and searching strings. In: 8th annual ACM-SIAM symposium on discrete algorithms, New Orleans, January 1997, pp 360–369Google Scholar
  6. 6.
    Bouloutas A, Hart GW, Schwartz M (1991) Two extensions of the viterbi algorithm. IEEE Trans Inf Theory 37(2):430–436CrossRefMathSciNetGoogle Scholar
  7. 7.
    Bucher P, Hoffmann K (1996) A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. In: Proceedings of the 4th international conference on intelligent systems for molecular biology, ISMB, vol 96. AAAI Press, Menlo Park, pp 44–51Google Scholar
  8. 8.
    Bunke H (1993) Structural and syntactic pattern recognition. In: Chen CH, Pau LF, Wang PSP (eds) Handbook of pattern recognition and computer vision. World Scientific, SingaporeGoogle Scholar
  9. 9.
    Bunke H (1995) Fast approximate matching of words against a dictionary. Computing 55(1):75–89CrossRefMathSciNetMATHGoogle Scholar
  10. 10.
    Bunke H, Csirik J (1993) Parametric string edit distance and its application to pattern recognition. IEEE Trans Syst Man Cybern 25(1):202–206CrossRefGoogle Scholar
  11. 11.
    Clement J, Flajolet P, Vallee B (1998) The analysis of hybrid trie structures. In: Proceedings of the annual a CM-SIAM symposium on discrete algorithms, San Francisco, California, 1998, pp 531–539Google Scholar
  12. 12.
    Cole R, Gottieb L, Lewenstein M (2004) Dictionary matching and indexing with errors and don’t cares. In; Proceedings of the 36th annual ACM aymposium on theory of computing, Chicago, IL, USA, June 2004, pp 91–100Google Scholar
  13. 13.
    Cormen TH, Leiserson CE, Rivest RL (1990) Introduction to algorithms. The MIT Press, CambridgeMATHGoogle Scholar
  14. 14.
    Dewey G (1923) Relative frequency of English speech sounds. Harvard University Press, MAGoogle Scholar
  15. 15.
    Du M, Chang S (1994) An approach to designing very fast approximate string matching algorithms. IEEE Trans Knowl Data Eng 6(4):620–633CrossRefGoogle Scholar
  16. 16.
    Forney GD (1973) The viterbi algorithm. Proc IEEE 61(3):268–278MathSciNetCrossRefGoogle Scholar
  17. 17.
    Fuketa M, Sumitomo T, Shishibori M, Aoe J (1999) A suffix compression algorithm of tries. In: ICCPOL’99: 18th international conference on computer processing of original languages vol 18, pp 345–348Google Scholar
  18. 18.
    Kashyap RL, Oommen BJ (1981) An effective algorithm for string correction using generalized edit distances  −i. Description of the algorithm and its optimality. Inf Sci 23(2):123–142CrossRefGoogle Scholar
  19. 19.
    Kashyap RL, Oommen BJ (1984) String correction using probabilistic methods. Pattern Recognit Lett pp 147–154Google Scholar
  20. 20.
    Levenshtein A (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Dokl 10:707–710MathSciNetGoogle Scholar
  21. 21.
    Masek WJ, Paterson MS (1980) A faster algorithm computing string edit distances. J Comput Syst Sci 20:18–31CrossRefMathSciNetMATHGoogle Scholar
  22. 22.
    Mibov S, Schulz K (2002) Fast approximate string matching in large dictionaries. Available: www.cis.uni-muenchen.de//people//schulz//pub//fastapproxsearch.pdfGoogle Scholar
  23. 23.
    Miclet L (1990) Grammatical inference. Syntactic Struct Pattern Recognit Appl 237–290Google Scholar
  24. 24.
    Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surveys 33(1):31–88CrossRefGoogle Scholar
  25. 25.
    Oflazer K (1996) Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. Comput Linguist 22(1):73–89Google Scholar
  26. 26.
    Okuda T, Tanaka E, Kasai T (1976) A method of correction of garbled words based on the levenshtein metric. IEEE Trans Comput 25:172–177MathSciNetMATHCrossRefGoogle Scholar
  27. 27.
    Oommen BJ (1987) Constrained string editing. Inf Sci 40(3):267–284CrossRefMathSciNetGoogle Scholar
  28. 28.
    Oommen BJ (1987) Recognition of noisy subsequences using constrained edit distances. IEEE Trans Pattern Anal Mach Intell 9:676–685MATHCrossRefGoogle Scholar
  29. 29.
    Oommen BJ, Badr G (2004) Dictionary-based syntactic pattern recognition using tries. In: Proceedings of the joint IARR international workshops SSPR 2004 and SPR 2004, Libon, Portugal, August 2004, pp 251–259Google Scholar
  30. 30.
    Oommen BJ, Kashyap RL (1998) A formal theory for optimal and information theoretic syntactic pattern recognition. Pattern Recognit 31:1159–1177CrossRefGoogle Scholar
  31. 31.
    Oommen BJ, Loke RKS. Syntactic pattern recognition involving traditional and generalized transposition errors: Attaining the information theoretic bound. (Submitted) Google Scholar
  32. 32.
    Oommen BJ, Loke RKS (1997) Pattern recognition of strings with substitutions, insertions, deletions and generalized transposition. Pattern Recognit 30:789–800CrossRefGoogle Scholar
  33. 33.
    Oommen BJ, Loke RKS (1999) Designing syntactic pattern classifiers using vector quantization and parametric string editing. IEEE Trans Syst Man Cybern 29:881–888Google Scholar
  34. 34.
    Perez-Cortes JC, Amengual JC, Arlandis J, Llobet R (2000) Stochastic error correcting parsing for ocr post-processing. In: International conference on pattern recognition ICPR-2000, Barcelona, 2000, pp 4405–4408Google Scholar
  35. 35.
    Peterson JL (1980) Computer programs for detecting and correcting spelling errors. Commun Assoc Comput Mach 23:676–687Google Scholar
  36. 36.
    Risvik KM (2002) Search system and method for retrieval of data, and the use thereof in a search engine. United States Patent 6377945 B1, April 23 2002Google Scholar
  37. 37.
    Sankoff D, Kruskal JB (1983) Time warps, string edits and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, ReadingGoogle Scholar
  38. 38.
    Schulz K, Mihov S (2002) Fast string correction with levenshtein-automata. Int J Doc Anal Recognit 5(1):67–85CrossRefMATHGoogle Scholar
  39. 39.
    Shang H, Merrettal T (1996) Tries for approximate string matching. IEEE Trans Knowl Data Eng 8(4):540–547CrossRefGoogle Scholar
  40. 40.
    Stephen GA (1989) String searching. Prentice-Hall, Englewood CliffsGoogle Scholar
  41. 41.
    Stephen GA (2000) String searching algorithms, vol 6. Lecture Notes Series on Computing, World Scientific, SingaporeGoogle Scholar
  42. 42.
    Ukkonen E (1985) Algorithm for approximate string matching. Inf Control 64:100–118CrossRefMathSciNetMATHGoogle Scholar
  43. 43.
    Viterbi AJ (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 13:260–269CrossRefMATHGoogle Scholar
  44. 44.
    Wagner R, Fischer A (1974) The string-to-string correction problem. J Assoc Comput Mach 21:168–173MathSciNetMATHGoogle Scholar
  45. 45.
    Wagner RA (1974) Order-n correction for regular languages. Commun ACM 17:265–268CrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2006

Authors and Affiliations

  1. 1.School of Computer ScienceCarleton UniversityOttawaCanada

Personalised recommendations