Advertisement

Tuning String Matching for Huge Pattern Sets

  • Jari Kytöjoki
  • Leena Salmela
  • Jorma Tarhio
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2676)

Abstract

We present three algorithms for exact string matching of multiple patterns. Our algorithms are filtering methods, which apply q-grams and bit parallelism. We ran extensive experiments with them and compared them with various versions of earlier algorithms, e.g. different trie implementations of the Aho-Corasick algorithm. Our algorithms showed to be substantially faster than earlier solutions for sets of 1,000–100,000 patterns. The gain is due to the improved filtering efficiency caused by q-grams.

Keywords

Hash Table Memory Usage Binary Search String Match Single Pattern 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    A. Aho, M. Corasick: Efficient string matching: An aid to bibliographic search. Communications of the ACM 18,6 (1975), 333–340.zbMATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    R. Baeza-Yates. Improved string searching. Software — Practice and Experience, 19,3 (1989), 257–271.CrossRefMathSciNetGoogle Scholar
  3. 3.
    R. Baeza-Yates, G. Gonnet: A new approach to text searching. Communications of ACM 35,10 (1992), 74–82.CrossRefGoogle Scholar
  4. 4.
    R. Boyer, S. Moore: A fast string searching algorithm. Communications of the ACM 20 (1977), 762–772.CrossRefGoogle Scholar
  5. 5.
    B. Commentz-Walter: A string matching algorithm fast on the average. Proc. 6th International Colloquium on Automata, Languages and Programming, Lecture Notes on Computer Science 71, 1979, 118–132.Google Scholar
  6. 6.
    M. Crochemore, W. Rytter: Text algorithms. Oxford University Press, 1994.Google Scholar
  7. 7.
    K. Fredriksson: Fast string matching with super-alphabet. Proc. SPIRE’ 02, String Processing and Information Retrieval, Lecture Notes in Computer Science 2476, 2002, 44–57.CrossRefGoogle Scholar
  8. 8.
    M. Fisk, G. Varghese: Fast content-based packet handling for intrusion detection. UCSD Technical Report CS2001-0670, 2001.Google Scholar
  9. 9.
    B. Gum, R. Lipton: Cheaper by the dozen: batched algorithms. Proc. First SIAM International Conference on Data Mining, 2001Google Scholar
  10. 10.
    N. Horspool: Practical fast searching in strings. Software — Practice and Experience 10 (1980), 501–506.CrossRefGoogle Scholar
  11. 11.
    R. Karp, M. Rabin: Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development 31 (1987), 249–260.zbMATHMathSciNetCrossRefGoogle Scholar
  12. 12.
    R. Muth, U. Manber: Approximate multiple string search. Proc. CPM’ 96, Combinatorial Pattern Matching, Lecture Notes in Computer Science 1075, 1996, 75–86.Google Scholar
  13. 13.
    G. Navarro, M. Raffinot: Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithms 5,4 (2000), 1–36.MathSciNetGoogle Scholar
  14. 14.
    G. Navarro, M. Raffinot: Flexible pattern matching in strings. Cambridge University Press, 2002.Google Scholar
  15. 15.
    S. Wu, U. Manber: A fast algorithm for multi-pattern searching. Report TR-94-17, Department of Computer Science, University of Arizona, 1994.Google Scholar
  16. 16.
    S. Wu, U. Manber: Agrep — A fast approximate pattern-matching tool. Proc. Usenix Winter 1992 Technical Conference, 1992, 153–162.Google Scholar
  17. 17.
    R. Zhu, T. Takaoka: A technique for two-dimensional pattern matching. Communications of the ACM 32 (1989), 1110–1120.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Jari Kytöjoki
    • 1
  • Leena Salmela
    • 1
  • Jorma Tarhio
    • 1
  1. 1.Department of Computer Science and EngineeringHelsinki University of TechnologyFinland

Personalised recommendations