Advertisement

A dictionary matching algorithm fast on the average for terms of varying length

  • Michal Ziv-Ukelson
  • Aaron Kershenbaum
Session I
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1448)

Abstract

We examine the exact dictionary matching problem with dynamic text and static terms and propose a simple but efficient algorithm with sublinear (in size of text) average performance for a wide range of practical problems. The algorithm is based on the Commentz-Walter-Horspool algorithm (CWH), presented by Baeza-Yates and Re‘gnier [101. Typically, our refinement will prune out more than 30% of characters scanned by CWH, when searching for all occurrences of tags, which are of varying lengths and members of a set of moderate size, in natural language text. This problem arises frequently in practice in scanning text downloaded from the internet, and accounts for a major portion of the preprocessing time associated with indexing such text for later retrieval. Our approach, which we refer to as layering, keeps track of an upper bound on the maximal length of potential term prefixes ending at each given position in the text. This information is then used to mask out some of the terms and filter out unnecessary character comparisons during the search. A practical implementation is described, which increases the size of the existing data structures as well as the preprocessing cost only by a factor of the size of the longest term in the set.

Keywords

Match Phase String Match Dynamic Text Natural Language Text Term Length 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    A.V. Aho. Algorithms for Finding Patterns in Strings. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science. Pages 257–300. Elsevier Science Publishers B.V., Amsterdam, The Netherlands. 1990.Google Scholar
  2. [2]
    A.V.Aho and M.Corasick. Efficient string matching: An aid to bibliographic search. Comm. of the ACM, 18(6):333–340, June 1975.Google Scholar
  3. [3]
    A. Àmir and M. Farach. Adaptive dictionary matching. Proc. 32nd IEEE FOCS, pages 760–766,1991.Google Scholar
  4. [4]
    A. Amir, M. Farach, R. Giancarlo, Z. Galil, and K. Park. Dynamic dictionary matching. Journal of Computer and System Sciences, 49(2):208–222, 1994Google Scholar
  5. [5]
    A. Amir, M. Farach, R.M. Idury, J.A. La Poutre', and A.A Schaffer. Improved dynamic dictionary matching. Information and Computation, 119(2):258–282, 1995Google Scholar
  6. [6]
    A.Amir, M.Farach, and Y.Matias. Efficient randomized dictionary matching algorithms, Proc. of 3rd Combinatorial Pattern Matching Conference, pages 259–272, 1992. Tucson, Arizona.Google Scholar
  7. [7]
    A. Apostolico and R. Giancarlo. The Boyer-Moore-Galil string searching strategies revisited, SIAM J. Comput, 15,(1), 98–105 (1986)Google Scholar
  8. [8]
    R.S. Boyer and J.S. Moore. A fast string matching algorithm. Comm. of the ACM, 20:762–772,1977.Google Scholar
  9. [9]
    R. Baeza-Yates and G.H. Gonnet. On Boyer-Moore automata, Research report, university of Waterloo,1989.Google Scholar
  10. [10]
    R. Baeza-Yates and M. Regnier. Fast Algorithms for Two Dimensional and Multiple Pattern Matching(Preliminary version). SWAT 90, In Proc. 2nd Scandinavian Workshop on Algorithm Theory. Number 447 in Lecture Notes in Computer Science, pages 332–347, Springer-Verlag, Bergen, Sweden, July 1990.Google Scholar
  11. [11]
    D. Breslauer. Dictionary-Matching on Unbounded Alphabets: Uniform-Length Dictionaries. Proc. of 5th Combinatorial Pattern Matching Conference, pages 184–197,1994, Asilomar, CA, USAGoogle Scholar
  12. [12]
    V. Bruye re, R. Baeza-Yates, O. Delgrange and R. Scheihing. On the size of Boyer Moore Automata. Proceedings of Third South American Workshop on String Processing. Recife, Brazil, August 1996, 31–46Google Scholar
  13. [13]
    M.Crochemore,A.Czumaj,L.Gasieniec,S.Jarominek,T.Lecroq,W.Plandowski, and W.Rytter. Fast Practical Multi-Pattern Matching. Technical Report 93–3, Institut Gaspard Monge, Université de Marne la Vallée, Marne la Vall'ee, France, 1993.Google Scholar
  14. [14]
    L.Colussi, Z.Galil, and R.Giancarlo.The exact complexity of string matching. ‘31st Symposium on foundations of Computer Science I, 135–143, IEEE(October 22–24 1990)Google Scholar
  15. [15]
    R.Cole. Tight bounds on the complexity of the Boyer-Moore pattern Matching algorithm, Technical Report 512, Computer Science Dept, New York University (June 1990).Google Scholar
  16. [16]
    B.Commentz-Walter. A string matching algorithm fast on the average. Technical Report 79.09.007, IBM Wissenchaftliches Zentrum. Heidelberg, Germany, 1979.Google Scholar
  17. [17]
    B. Commentz-Walter. A string matching algorithm fast on the average. Proc,6th International Colloquium on Automata, Languages, and Programming, Lecture notes in Computer Science. Pages 118–132. Springer-Verlag. Berlin, Germany 1979.Google Scholar
  18. [18]
    J.J. Fan and K.Y. Su. An efficient algorithm for matching multiple patterns. IEEE Transactions on Knowledge and Data Engineering. 5(2):339–351, April, 1993Google Scholar
  19. [19]
    Z. Galil. On improving the worst case running time of the Boyer-Moore string matching algorithm, Comm. of the ACM, 22(9) 505–508, (1979)Google Scholar
  20. [20]
    L.J. Guibas and A.M. Odlyzko. A new proof of the linearity of the Boyer-Moore string searching algorithm, Siam J. Comput. 9 (1980) 672–682Google Scholar
  21. [21]
    D.Gusfield.Algorithms on strings, trees and sequences, published by the press syndicate of the University Of Cambridge, (1997) 157–164Google Scholar
  22. [22]
    A. Hume and D. Sunday. Fast String Searching. Software-Practice and experience, Vol.21(11). 1221–1248 (November 1991).Google Scholar
  23. [23]
    T. Hagerup. On saving space in parallel computation, Information Processing Letters, Vol.29, 1988, pages 327–329Google Scholar
  24. [24]
    R.N. Horspool. Practical fast searching in strings. Software-Practice and Experience, 10:501–506,1980.Google Scholar
  25. [25]
    R.M. Idury and A.A Schaffer. Dynamic dictionary matching with failure functions. Proc. 3rd Annual Symposium on Combinatorial Pattern Matching, pages 273–284,1992.Google Scholar
  26. [26]
    J.Y. Kim and J. Shawe-Taylor. Fast Multiple Keyword Searching. Proc. of 3rd Combinatorial Pattern Matching Conference, pages 41–51,1992. Tucson, Arizona.Google Scholar
  27. [27]
    G.Kowalski and A. Meltzer. New Multi-Term high speed text search algorithms. 1st conference on computers and applications, IEEE(1984) Google Scholar
  28. [28]
    D.E. Knuth, J.H. Morris, and V.R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6:322–350,1977.Google Scholar
  29. [29]
    T.Lecroq. A variation on the Boyer-Moore algorithm. Theoretical Computer Science 92 (119–144), Elsevier.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1998

Authors and Affiliations

  • Michal Ziv-Ukelson
    • 1
  • Aaron Kershenbaum
    • 2
  1. 1.IBM T.J.W Research CenterHawthorne
  2. 2.Dept of C.S.Polytechnic UniversityBrooklynMY

Personalised recommendations