Tuning String Matching for Huge Pattern Sets
We present three algorithms for exact string matching of multiple patterns. Our algorithms are filtering methods, which apply q-grams and bit parallelism. We ran extensive experiments with them and compared them with various versions of earlier algorithms, e.g. different trie implementations of the Aho-Corasick algorithm. Our algorithms showed to be substantially faster than earlier solutions for sets of 1,000–100,000 patterns. The gain is due to the improved filtering efficiency caused by q-grams.
KeywordsHash Table Memory Usage Binary Search String Match Single Pattern
Unable to display preview. Download preview PDF.
- 5.B. Commentz-Walter: A string matching algorithm fast on the average. Proc. 6th International Colloquium on Automata, Languages and Programming, Lecture Notes on Computer Science 71, 1979, 118–132.Google Scholar
- 6.M. Crochemore, W. Rytter: Text algorithms. Oxford University Press, 1994.Google Scholar
- 8.M. Fisk, G. Varghese: Fast content-based packet handling for intrusion detection. UCSD Technical Report CS2001-0670, 2001.Google Scholar
- 9.B. Gum, R. Lipton: Cheaper by the dozen: batched algorithms. Proc. First SIAM International Conference on Data Mining, 2001Google Scholar
- 12.R. Muth, U. Manber: Approximate multiple string search. Proc. CPM’ 96, Combinatorial Pattern Matching, Lecture Notes in Computer Science 1075, 1996, 75–86.Google Scholar
- 14.G. Navarro, M. Raffinot: Flexible pattern matching in strings. Cambridge University Press, 2002.Google Scholar
- 15.S. Wu, U. Manber: A fast algorithm for multi-pattern searching. Report TR-94-17, Department of Computer Science, University of Arizona, 1994.Google Scholar
- 16.S. Wu, U. Manber: Agrep — A fast approximate pattern-matching tool. Proc. Usenix Winter 1992 Technical Conference, 1992, 153–162.Google Scholar